Rarely the window does not close when shell exits #534

Closed
opened 5 months ago by cglogic · 6 comments
cglogic commented 5 months ago

The control flow does not return from int r = epoll_pwait(fdm->epoll_fd, events, tll_length(fdm->fds), -1, &fdm->sigmask); there:
b75bd6507a/fdm.c (L424)

It happens rarely in 1 of 30..50 cases. When system is under high load the chances are bigger. Not sure this is foot's issue or epoll-shim.

The control flow does not return from `int r = epoll_pwait(fdm->epoll_fd, events, tll_length(fdm->fds), -1, &fdm->sigmask);` there: https://codeberg.org/dnkl/foot/src/commit/b75bd6507a970520784791a6ff6beb1cf771077b/fdm.c#L424 It happens rarely in 1 of 30..50 cases. When system is under high load the chances are bigger. Not sure this is foot's issue or epoll-shim.
Owner

I have a really hard time seeing how this could be foot's fault. I'm saying this because:

  • epoll_pwait() atomically (when implemented correctly) switches the process' signal mask, and then waits/polls.
  • SIGCHLD is blocked very early, long before the shell is started.

My guess would be the problem is that epoll-shims implementation of epoll_pwait() isn't atomic.

That said, can you try the following patch:

diff --git a/fdm.c b/fdm.c
index d0466a6..e8a2fd4 100644
--- a/fdm.c
+++ b/fdm.c
@@ -448,6 +448,20 @@ fdm_poll(struct fdm *fdm)
         return false;
     }
 
+    for (int i = 0; i < SIGRTMAX; i++) {
+        if (received_signals[i]) {
+
+            LOG_WARN("received signal %d, but didn't get EINTR", i);
+
+            received_signals[i] = false;
+            struct sig_handler *handler = &fdm->signal_handlers[i];
+
+            xassert(handler->callback != NULL);
+            if (!handler->callback(fdm, i, handler->callback_data))
+                return false;
+        }
+    }
+
     bool ret = true;
 
     fdm->is_polling = true;

You may have to trigger something else in the FDM (should be enough to e.g. resize the window). Let me know if it works, and if you see the warning being logged.

I have a _really_ hard time seeing how this could be foot's fault. I'm saying this because: * `epoll_pwait()` **atomically** (when implemented correctly) switches the process' signal mask, and then waits/polls. * `SIGCHLD` is blocked very early, long before the shell is started. My guess would be the problem is that epoll-shims implementation of `epoll_pwait()` isn't atomic. That said, can you try the following patch: ```diff diff --git a/fdm.c b/fdm.c index d0466a6..e8a2fd4 100644 --- a/fdm.c +++ b/fdm.c @@ -448,6 +448,20 @@ fdm_poll(struct fdm *fdm) return false; } + for (int i = 0; i < SIGRTMAX; i++) { + if (received_signals[i]) { + + LOG_WARN("received signal %d, but didn't get EINTR", i); + + received_signals[i] = false; + struct sig_handler *handler = &fdm->signal_handlers[i]; + + xassert(handler->callback != NULL); + if (!handler->callback(fdm, i, handler->callback_data)) + return false; + } + } + bool ret = true; fdm->is_polling = true; ``` You may have to trigger something else in the FDM (should be enough to e.g. resize the window). Let me know if it works, and if you see the warning being logged.
Poster

I added debug logs around epoll_pwait.

Good case:

err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 2, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: -1, errno: 4
err: fdm.c:424: fds length: 3
err: fdm.c:427: r: 2, errno: 10

Bad case:

err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 7
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6
err: fdm.c:427: r: 1, errno: 35
err: fdm.c:424: fds length: 6

I added debug logs around `epoll_pwait`. Good case: > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 2, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: -1, errno: 4 > err: fdm.c:424: fds length: 3 > err: fdm.c:427: r: 2, errno: 10 Bad case: > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 7 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6 > err: fdm.c:427: r: 1, errno: 35 > err: fdm.c:424: fds length: 6
Owner

There is this TODO in foot. So would be nice to see if the patch above does help

        if (errno == EINTR) {
            /* TODO: is it possible to receive a signal without
             * getting EINTR here? */

There _is_ this TODO in foot. So would be nice to see if the patch above does help ```c if (errno == EINTR) { /* TODO: is it possible to receive a signal without * getting EINTR here? */ ```
Poster

There is this TODO in foot. So would be nice to see if the patch above does help

The patch helps. In some cases there is a log entry, but the window always closes:
warn: fdm.c:454: received signal 20, but didn't get EINTR.

> There _is_ this TODO in foot. So would be nice to see if the patch above does help The patch helps. In some cases there is a log entry, but the window always closes: `warn: fdm.c:454: received signal 20, but didn't get EINTR`.
Owner

Regardless of whether your issue is being caused by an epoll-shim bug or not, I think it makes sense to be defensive here. I.e don't assume signals may have been delivered and epoll_pwait() returning ready FDs. That is, don't assume it's enough to check for signals when we get EINTR.

I'll whip up a variant of the patch applied here, that doesn't have as large performance impact as this one does, and that doesn't duplicate the signal checking code.

Thanks for testing! And stayed tuned ;)

Regardless of whether your issue is being caused by an epoll-shim bug or not, I think it makes sense to be defensive here. I.e don't assume signals may have been delivered **and** `epoll_pwait()` returning ready FDs. That is, don't assume it's enough to check for signals when we get `EINTR`. I'll whip up a variant of the patch applied here, that doesn't have as large performance impact as this one does, and that doesn't duplicate the signal checking code. Thanks for testing! And stayed tuned ;)
Poster

Thanks again!

Thanks again!
dnkl closed this issue 5 months ago
Sign in to join this conversation.
No Milestone
No Assignees
2 Participants
Notifications
Due Date

No due date set.

Dependencies

This issue currently doesn't have any dependencies.

Loading…
There is no content yet.