Foot does not always stop on SIGTERM #651

Closed
opened 3 months ago by pranjal · 15 comments
pranjal commented 3 months ago

When I poweroff my machine with poweroff, a message shows up that
A stop job is running for Session 1 of orewbaka

On checking journalctl, I found this:

Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Stopping timed out. Killing.
Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Killing process 31774 (foot) with signal SIGKILL.
Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Killing process 31780 (foot:render:1) with signal SIGKILL.
Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Failed with result 'timeout'.
Jul 24 19:45:11 orewbaka systemd[1]: Stopped Session 1 of User orewbaka.

System Info
OS - Arch Linux
Boot Manager - systemd-boot
Wayland Compositor - Sway
Foot version - 1.8.2

When I poweroff my machine with `poweroff`, a message shows up that ```A stop job is running for Session 1 of orewbaka``` On checking `journalctl`, I found this: ``` Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Stopping timed out. Killing. Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Killing process 31774 (foot) with signal SIGKILL. Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Killing process 31780 (foot:render:1) with signal SIGKILL. Jul 24 19:45:11 orewbaka systemd[1]: session-1.scope: Failed with result 'timeout'. Jul 24 19:45:11 orewbaka systemd[1]: Stopped Session 1 of User orewbaka. ``` System Info OS - Arch Linux Boot Manager - systemd-boot Wayland Compositor - Sway Foot version - 1.8.2
Owner

This is something I have never seen (which just means something is different between yours and my machine)...

Is this a foot --server instance, or a regular foot instance? If foot --server, is there anything logged by foot in the journal?

Can you reproduce this manually by sending SIGTERM to a foot instance (regular, or foot --server, depending on where the original issue is happening)?

This is something I have never seen (which just means something is different between yours and my machine)... Is this a `foot --server` instance, or a regular foot instance? If `foot --server`, is there anything logged by foot in the journal? Can you reproduce this manually by sending `SIGTERM` to a foot instance (regular, or `foot --server`, depending on where the original issue is happening)?
dnkl added the
bug
label 3 months ago
Poster

This is a regular foot instance.

On manually sending SIGTERM to a foot instance, it always terminates normally.

This is a regular foot instance. On manually sending `SIGTERM` to a foot instance, it always terminates normally.
Owner

So, this might be a systemd-boot issue...

Foot should always terminate on SIGTERM regardless of who sends it, so this is a bug in foot. But it'll be much harder to debug if we can't reproduce it easily.

I assume systemd tries to kill all user processes when you log out (i.e poweroff), and foot is just one of them?

I'm guessing this happens at roughly the same time Sway is shut down. This too will cause foot to shutdown. I'm guessing there's some kind of interferense between the two mechanisms that causes foot to get stuck.

What we'd really need is a GDB backtrace (preferably using a debug build of foot) at this point, after the SIGTERM, but before the SIGKILL. But I don't see how we're going to get that, considering we cannot reproduce it manually.

> So, this might be a systemd-boot issue... Foot should always terminate on `SIGTERM` regardless of who sends it, so this is a bug in foot. But it'll be much harder to debug if we can't reproduce it easily. I assume systemd tries to kill all user processes when you log out (i.e poweroff), and foot is just one of them? I'm guessing this happens at roughly the same time Sway is shut down. This too will cause foot to shutdown. I'm guessing there's some kind of interferense between the two mechanisms that causes foot to get stuck. What we'd really need is a GDB backtrace (preferably using a debug build of foot) at this point, after the `SIGTERM`, but before the `SIGKILL`. But I don't see how we're going to get that, considering we cannot reproduce it manually.
Owner

One guess I have is that there is some corner case in the logic that handles the case where the Wayland socket has been closed (i.e. Sway has been terminated). If we get stuck there, SIGTERM would be blocked, which would explain why systemd can't terminate foot.

Foot should log errors in this case (to stderr). Any chance you can run foot and redirect its output to a file? Then, when the hang occurs, check the log file for errors.

One guess I have is that there is some corner case in the logic that handles the case where the Wayland socket has been closed (i.e. Sway has been terminated). If we get stuck there, `SIGTERM` would be blocked, which would explain why systemd can't terminate foot. Foot _should_ log errors in this case (to stderr). Any chance you can run foot and redirect its output to a file? Then, when the hang occurs, check the log file for errors.
Poster

I'm not quite sure how to log the output of foot. I ran foot > mylog but it does not log it.
EDIT: I figured it out. foot &> mylog works

I'm not quite sure how to log the output of foot. I ran `foot > mylog` but it does not log it. EDIT: I figured it out. `foot &> mylog` works
Owner

foot > a

That only redirects stdout, but foot logs on stderr. Try foot 2> a instead.

> foot > a That only redirects `stdout`, but foot logs on `stderr`. Try `foot 2> a` instead.
Poster

Alright I'll try foot 2> a

Alright I'll try `foot 2> a`
Poster

This is the output of foot:

info: main.c:398: version: 1.8.2 +pgo +ime +graphemes
info: main.c:405: arch: Linux x86_64/64-bit
info: main.c:412: locale: en_IN.UTF-8
info: config.c:2928: loading configuration from /home/orewbaka/.config/foot/foot.ini
info: wayland.c:1225: eDP-1: 1366x768+0x0@60Hz 0x15DB 15.33" scale=1 PPI=105x109 (physical) PPI=105x109 (logical), DPI=102.20
info: wayland.c:1378: requesting SSD decorations
info: fcft.c:267: fcft: 2.4.4 +graphemes +runs
info: fcft.c:277: fontconfig: 2.13.94
info: fcft.c:283: freetype: 2.10.4
info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-BoldOblique.ttf: size=13.00pt/18px, dpi=100.00
info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono.ttf: size=13.00pt/18px, dpi=100.00
info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-Bold.ttf: size=13.00pt/18px, dpi=100.00
info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-Oblique.ttf: size=13.00pt/18px, dpi=100.00
info: terminal.c:682: cell width=11, height=22
info: terminal.c:584: using 2 rendering threads
info: wayland.c:764: using SSD decorations
warn: wayland.c:1129: disconnected from Wayland

The issue was probably that I was executing sway exit in foot to close sway and go to a tty. So, foot was being disconnected from wayland and could not terminate properly.

Sorry for wasting your time.

This is the output of foot: ``` info: main.c:398: version: 1.8.2 +pgo +ime +graphemes info: main.c:405: arch: Linux x86_64/64-bit info: main.c:412: locale: en_IN.UTF-8 info: config.c:2928: loading configuration from /home/orewbaka/.config/foot/foot.ini info: wayland.c:1225: eDP-1: 1366x768+0x0@60Hz 0x15DB 15.33" scale=1 PPI=105x109 (physical) PPI=105x109 (logical), DPI=102.20 info: wayland.c:1378: requesting SSD decorations info: fcft.c:267: fcft: 2.4.4 +graphemes +runs info: fcft.c:277: fontconfig: 2.13.94 info: fcft.c:283: freetype: 2.10.4 info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-BoldOblique.ttf: size=13.00pt/18px, dpi=100.00 info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono.ttf: size=13.00pt/18px, dpi=100.00 info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-Bold.ttf: size=13.00pt/18px, dpi=100.00 info: fcft.c:758: /usr/share/fonts/TTF/DejaVuSansMono-Oblique.ttf: size=13.00pt/18px, dpi=100.00 info: terminal.c:682: cell width=11, height=22 info: terminal.c:584: using 2 rendering threads info: wayland.c:764: using SSD decorations warn: wayland.c:1129: disconnected from Wayland ``` The issue was probably that I was executing `sway exit` in foot to close sway and go to a tty. So, foot was being disconnected from wayland and could not terminate properly. Sorry for wasting your time.
Owner

Foot should still handle that. I tried it here on my machine, and foot shut down immediately upon running swaymsg exit.

One of the last things foot logs is "goodbye". The fact that we're not seeing that in your log is yet another sign of an abnormal termination.

What's systemd's timeout (i.e. how long does it way before sending SIGKILL)? Can you bump it to something large (I don't use systemd, hence my lack of knowledge)?

If so, we could try a debug build of foot, and attach GDB to it after exiting sway, before systemd's timeout kicks in. That would tell us where foot is stuck.

Foot should still handle that. I tried it here on my machine, and foot shut down immediately upon running `swaymsg exit`. One of the last things foot logs is "goodbye". The fact that we're not seeing that in your log is yet another sign of an abnormal termination. What's systemd's timeout (i.e. how long does it way before sending `SIGKILL`)? Can you bump it to something large (I don't use systemd, hence my lack of knowledge)? If so, we _could_ try a debug build of foot, and attach GDB to it after exiting sway, before systemd's timeout kicks in. That would tell us _where_ foot is stuck.
Owner

Any chance you can do a custom build with this patch applied?

diff --git a/wayland.c b/wayland.c
index a83ef59d..ffe8744f 100644
--- a/wayland.c
+++ b/wayland.c
@@ -1164,6 +1164,7 @@ fdm_wayl(struct fdm *fdm, int fd, int events, void *data)
     if (events & EPOLLHUP) {
         LOG_WARN("disconnected from Wayland");
         wl_display_cancel_read(wayl->display);
+        LOG_WARN("returned from wl_display_cancel_read()");
         return false;
     }
 

This since the last log message from your output was from here, followed by nothing. I grepped my own log files, and couldn't see this particular branch ever being taken. So, I'd like to see if we're for some weird reason I can't see, get stuck here.

If I'm wrong, we'll see "returned from wl_display_cancel_read()" in the log output. But if "disconnected from Wayland" is still the last log message, then we have a lead.

Any chance you can do a custom build with this patch applied? ```diff diff --git a/wayland.c b/wayland.c index a83ef59d..ffe8744f 100644 --- a/wayland.c +++ b/wayland.c @@ -1164,6 +1164,7 @@ fdm_wayl(struct fdm *fdm, int fd, int events, void *data) if (events & EPOLLHUP) { LOG_WARN("disconnected from Wayland"); wl_display_cancel_read(wayl->display); + LOG_WARN("returned from wl_display_cancel_read()"); return false; } ``` This since the last log message from your output was from here, followed by nothing. I grepped my own log files, and couldn't see _this_ particular branch ever being taken. So, I'd like to see if we're for some weird reason I can't see, get stuck here. If I'm wrong, we'll see _"returned from wl_display_cancel_read()"_ in the log output. But if _"disconnected from Wayland"_ is still the last log message, then we have a lead.
Owner

I think I've managed to reproduce, so there shouldn't be any need for you to try the patch above.

I think I've managed to reproduce, so there shouldn't be any need for you to try the patch above.
Owner

I think I found the problem.

I would appreciate if you could test #652, as I had to "force" (i.e. patch foot) to trigger the issue, and thus cannot be 100% sure it actually fixes it for you.

I think I found the problem. I would appreciate if you could test https://codeberg.org/dnkl/foot/pulls/652, as I had to "force" (i.e. patch foot) to trigger the issue, and thus cannot be 100% sure it actually fixes it for you.
Poster

What's systemd's timeout (i.e. how long does it way before sending SIGKILL)? Can you bump it to something large (I don't use systemd, hence my lack of knowledge)?

systemd waits 90 seconds before sending SIGKILL.

I would appreciate if you could test #652, as I has to "force" (i.e. patch foot) to trigger the issue, and thus cannot be 100% sure it actually fixes it for you.

I tested #652 with 10 instances of foot running at once and ran sway exit on one of them. The issue seems to be fixed.

The output of some of the instances are:

 err: wayland.c:1152: failed to read events from the Wayland socket: Broken pipe
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1611: failed to flush wayland socket: Broken pipe
info: main.c:579: goodbye

The output of the rest of the instances are:

warn: wayland.c:1165: disconnected from Wayland
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe
 err: wayland.c:1611: failed to flush wayland socket: Broken pipe
info: main.c:579: goodbye

I followed the instructions for a debug build from INSTALL.md

> What's systemd's timeout (i.e. how long does it way before sending SIGKILL)? Can you bump it to something large (I don't use systemd, hence my lack of knowledge)? systemd waits 90 seconds before sending SIGKILL. > I would appreciate if you could test #652, as I has to "force" (i.e. patch foot) to trigger the issue, and thus cannot be 100% sure it actually fixes it for you. I tested #652 with 10 instances of foot running at once and ran `sway exit` on one of them. The issue seems to be fixed. The output of some of the instances are: ``` err: wayland.c:1152: failed to read events from the Wayland socket: Broken pipe err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1611: failed to flush wayland socket: Broken pipe info: main.c:579: goodbye ``` The output of the rest of the instances are: ``` warn: wayland.c:1165: disconnected from Wayland err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1646: failed to roundtrip Wayland display: Broken pipe err: wayland.c:1611: failed to flush wayland socket: Broken pipe info: main.c:579: goodbye ``` I followed the instructions for a debug build from INSTALL.md
Owner

Awesome! Thanks for testing!

Awesome! Thanks for testing!
Owner

Fixed in 7db8e06f55

Fixed in https://codeberg.org/dnkl/foot/commit/7db8e06f554c35a3b783441f0a3f72515cd414c2
dnkl closed this issue 3 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.