Crashing when image-info is used #391
Labels
No Label
bug
can't reproduce
cleanup
discussion
documentation
don't squash
duplicate
enhancement
extra
good first issue
help wanted
invalid
multiframe
needs more info
needs testing
not our issue
notabug
question
up for grabs
wontfix
workflow
No Milestone
No Assignees
5 Participants
Notifications
Due Date
No due date set.
Dependencies
No dependencies set.
Reference: nsxiv/nsxiv#391
Loading…
Reference in New Issue
There is no content yet.
Delete Branch "%!s(<nil>)"
Deleting a branch is permanent. Although the deleted branch may exist for a short time before cleaning up, in most cases it CANNOT be undone. Continue?
I'm using nsxiv 30-2 from the AUR, and for the past few months I've been having the issue of nsxiv crashing under normal use. I experienced the same issue with an older version I was using which I compiled myself, too. The error message looks like:
Where the number of requests and known processed is different each time. I thought this was a Wayland problem since I first noticed this happen after I switched to Wayland, but after I logged out and logged back into an X11 session, I got the same issue (except the X server was ":0").
I'd open multiple pictures via
nsxiv .
ornsxiv -r .
, and be switching between images with n/p when nsxiv would crash with the above error. It doesn't happen on a specific image. I could have just a few images open, and it will crash after I scroll through just a few of them. Then I can reopen the images with the same command and either it won't crash or it'll crash elsewhere. I can also open thousands of images and hold n or p for a long time before the crash happens (which is how I got the above error specifically).While trying to pinpoint the issue, I removed my image-info file, which is provided. I have yet to experience a crash with the file removed. Thinking it might have been a problem with the commands I was using to get the geometry and file sizes, I replaced those with plain text (shown in
image-info-secondary
), and experienced a crash not long afterwards.I'm using KDE Plasma 5.26.4, Arch 6.0.10-arch2-1, and Wayland, but I've experienced this problem for a long time on both X11 and Wayland.
Can't reproduce. Does it happen on latest
master
branch?I cloned the repo and ran
doas make install-all
, put image-info back in the exec directory, and after about a minute of going through images it crashed again.I was able to produce a crash with the first
image-info
, but with a different error code. The output waswith the same error twice.
The crash only happened once when I issued
nsxiv -r .
in a large and nested directory with 63,000 images in it and held down "next image". When I re-do the process, the crash does not happen.I can't say for sure, but I have a feeling there might be a race condition somewhere that only happens if the drive is not warm enough.
I'm on master
3804b50656
.I assume these crashes are related, if not the same; but I'm not sure I reproduced the original issue here, so I'm not removing the "can't reproduce" label. Feel free to remove it.
Just FYI - nsxiv passes the image w/h to
image-info
as 2nd and 3rd argument. So invoking an external tool isn't necessary.That's really nice to know. I yanked this file from I don't remember where back when I was still using sxiv. I've changed it to
geometry="${2}x${3}"
Could either of you try out the following (quick and dirty) patch and report if it fixes the issue or not? The idea here is to reduce calls to
image-info
in case the user has something liken
pressed down.Which branch should I be working in? I am in an up-to-date master branch and can't make install-all it (even without patches).Runningdoas make install-all 2> err.txt
to get the errors, I have the following error text file.Okay I have no idea what was going on. I removed that directory, git cloned the nsxiv page from codeberg again, and it make-installed perfectly fine again.
But once I patched it in with the curl command you provided, it did crash after some images with:
The compiler error was because the
config.h
was outdated. But looks like the patch didn't fix the issue anyways. I'm out of ideas then, and I can't reproduce it on my system so it's difficult for me to debug this.@KodyVB
I can attempt to take over and help resolve this.
Step 1:
First off could you rebuild with minimal dependencies and see if you can reproduce this issue. Ie
make clean; make CFLAGS=-g OPT_DEP_DEFAULT=0;
I'd be surprised if this solves your issue, but it eliminate some code paths which may help us reason about the issue. Rebuild the following way for the remainder of these stepsStep 2:
In window.c::win_init, before anything else could you add a call to XInitThreads(). nsxiv is single threaded but I don't really trust imlib enough to make the same claim. If you can't reproduce the issue after adding this line, you can stop going through these steps.
Step 3:
In window.c::win_init, after we successfully open the X display (ie after the call to XOpenDisplay), would you able able to add in
XSynchronize(dpy, 1)
and reproduce under a debugger like gdb? If so would you share the stack trace. We could probably get a patch ready if we knew what function in nsxiv directly caused this error.With XSynchronize set, the crash should happen when we make the offending X call (instead of some arbitrary amount of time later) so it should show up in the stack trace.
I can't reproduce this issue either, so I'm just describing the steps I would do to troubleshoot this. Please let me know if you need clarification on anything.
Earlier this morning, I cloned the repo again to avoid having any kind of conflicts like I did previously, then ran the
make clean
andCFLAGS
andOPT_DEP_DEFAULT
options, but with those options the text at the bottom which was related to the crashing wouldn't appear.So then I ran
doas make clean
anddoas make clean install
and for some reason, I can't reproduce the error anymore. It used to be that the error would pop up after a couple hundred pictures at most, but now I've gone through at least 16,000 pictures without issue.The only variables that I can think of that changed were:
doas make clean install
instead ofdoas make install
(I'd only done uninstall before, not clean)hmmm i think i know what happened, the recent change to use posix_spawn() may have fixed some underlying issue with the old fork/dup/exec used for image-info, may be worth mentioning but i got a directory with around 400 png images that all average on the range of 4+ MB in size, i did experience nsxiv crashing while fast navigating that dir and i noticed it "randomly" crashing on the larger images, didn't ran enough tests to find the cause of that but i did notice that problem dissapearing since adding the usage of posix_spawn()
@KodyVB Here's an easy way to verify if the
posix_spawn
change was what fixed it or not, roll back to the commit right before it:And then try to see if you can reproduce the crash. After that, switch back to master and try again just in case - and to do that, simply change the first command to
git checkout master
and the last 2 commands remain the same.I'd be curious if you updated x-server, xlib or anything else X11 related.
I checked out
76c2b81
and it crashed within ~700 images as usual, then checked out master again and it crashed after ~2,300 images. So maybe I just had really weird luck last night when I went through ~16,000 images without issues? Sorry if I got everyone's hopes up...My fears would have been up if the posix_spawn changes actually fixed your issue. I could believe it making the problem rarer since it was supposed to provide a perf improvement (especially in cases like this).
@KodyVB
Could you continue with Step 2 of #391 (comment)? (no rush) Since master may make the race harder to hit, going back to
76c2b81
sound be fine. If we can root cause the issue in any version, should be able to tell if it is truly fixed in master.Also don't think this was the source of your issue, but would recommend always running the
clean
target by itself. Pretty sure you'd get undesirable behavior depending on your systems defaultMAKEFLAGS
(like having clean race with install if-j
was set). You'd probably notice if this was a problem, but thought it was worth mentioning for future referenceI checked out
76c2b81
, then addedXInitThreads();
to window.c underneathvoid win_init(win_t *win)
(which automatically added#include <X11/Xlib.h>
to the top thanks to my IDE), at line 110, ranmake clean
anddoas make clean install
, and had the same error after ~600 images.Then I removed those two lines, and added
XSynchronize(e->dpy, 1);
to the line after the if statement with the onlyXOpenDisplay
call I could find - making it line 125 in window.c. I hadn't used gdb before, so I looked it up and saw it should be compiled with a flag, and after searching gcc's manpage I saw-ggdb
was appropriate, so I added that to the end of line 20 of the Makefile. I ranmake clean
,make
, thengdb ./nsxiv
followed byrun -r <directory>
. After ~5,000 images, these were the last few lines in gdb:If I should have compiled nsxiv some other way, or did something else wrong, please let me know and I'll try again.
Unfortunate and I guess I should apologize to our imlib for suspecting them of doing something stupid to cause this problem.
You ran the command well enough. I had expected gdb to dump you shell where you could run commands like
bt
. But forgot that gdb would only do that by default for abnormal termination which didn't happen for this error. My bad.Haven't given up yet though. Later tonight, I'll upload a build/patch with some extra debugging info. May also need to look at X's source to see what
XIO: fatal IO error 10 (No child processes)
actually means.Also wanted to clarify that you only experience this problem when using an
image-info
and if that file isn't present you stop being able to reproduce this issue.So here's a branch to tryout: https://codeberg.org/TAAPArthur/nsxiv/src/branch/debugging
On Arch (and other glibc based distros) it should print a stack trace when that XIO error happens. The backtrace will replace the error string.
When you repro the issue, you should get a backtrace like:
For the lines relating to nsxiv, we can run
addr2line -e ./nsxiv 0xb507 0xbcaf
which would give us the lines right after the one that caused the crash. Please share the backtrace and the result of the addr2line command.Also took another guess and assumed the problem could be caused by our handling of SIGCHLD. Don't see how but the version in the branch certainly isn't a problem and should have the desired behavior on Linux. So if you suddenly stop being able to repro the issue, I guess we have a fix.
I'm sorry for the late reply - it's been a really hectic week.
And I'm also sorry to say: I might not be able to help debug anymore. After nearly a decade of using the same computer, I upgraded and can't seem to reproduce the error on my new computer - I just went through 11,000 images on the
76c2b81b
branch without error where my old computer would crash after a few hundred consistently.When I get a day off from work, I could probably throw my old hard drives back into my old computer (which is now acting as a NAS) and try debugging again, but I don't know how many others are even experiencing this issue since only @XPhyro has reproduced it other than me on this thread, so far. But there could just be people seeing the issue is open and not commenting, though.
I never bothered running nsxiv under thread sanitizer because it's a single threaded program, but TSan seem to be useful even outside of threads. In this case, it caught the sig-handler spoiling the errno #411
Might be bit of a stretch but it's possible that the sighandler's spoiled errno was causing xlib to incorrectly assume some error occurred and thus causing the crash.
EDIT: See the comment below (#391 (comment)) for reproduction steps if anyone else wants to try it out.
Now that I'm looking at it again,
No child processes
andNo such process
seems awfully like it's a spoiled errno bywaitpid
. So I'm quite hopeful that #411 might just be the correct fix for this.Just realized that if the sighandler was really the cause of this then it should be easy to reproduce.
var=$(pidof nsxiv); while :; do kill -s SIGCHLD $var; done
And boom! I can consistently reproduce it on
master
now within seconds. Putting thepid
into a variable was actually important because doing$(pidof nsxiv)
inside the loop actually makes it really hard to reproduce the issue, I presume because of the extra process invocation it was sending less SIGCHLD and so putting it into a variable avoids that overhead and is able to generate more signals.Trying to do the same on #411 works fine without any crash. So yeah, that confirms that the spoiled
errno
was definitely the cause of the crash.