Extremely bad performance since upgrade - worker spikes to 100% frequently #9681

Closed
opened 3 months ago by wardhale · 13 comments

The site works fine for a few minutes, then the Calckey worker spikes to 100% for a while. My instance has 142 users and averages only about 6 online at a time. I am running 1 CPU, 2GB, 4GB swap file and was about to upgrade, but someone with a larger server said they were having similar issues.

This behavior started after the upgrade and I have also noticed that my account on i.calckey.cloud is experiencing similar issues or timeouts and errors with an empty timeline... then I'll refresh after a few minutes and it will work.

Before I go digging, does anyone know what is causing the CPU spikes (as this seems to be happening on other calckey instances)? Any fixes?! I have attached my processes while it's running normally... vs during the spikes. They seem to happen regardless of activity.

Edit: also noticed that sometimes I get this JSON error... could just be due to timeout, but I added another screenshot.

The site works fine for a few minutes, then the Calckey worker spikes to 100% for a while. My instance has 142 users and averages only about 6 online at a time. I am running 1 CPU, 2GB, 4GB swap file and was about to upgrade, but someone with a larger server said they were having similar issues. This behavior started after the upgrade and I have also noticed that my account on i.calckey.cloud is experiencing similar issues or timeouts and errors with an empty timeline... then I'll refresh after a few minutes and it will work. Before I go digging, does anyone know what is causing the CPU spikes (as this seems to be happening on other calckey instances)? Any fixes?! I have attached my processes while it's running normally... vs during the spikes. They seem to happen regardless of activity. Edit: also noticed that sometimes I get this JSON error... could just be due to timeout, but I added another screenshot.
panos added the
🐛Bug
🐢Performance
labels 3 months ago
panos commented 3 months ago
Owner

We've also noticed this, we're investigating. If you find something please let us know! Which was the last version before the upgrade that started causing this?

We've also noticed this, we're investigating. If you find something please let us know! Which was the last version before the upgrade that started causing this?
Poster

Cheers @panos. It was working perfectly on 13.0.5. Then we upgraded to 13.1.2 and the issues started soon after. I upgraded to the release candidate for 13.1.3 today and the problems are still there. I wonder if it's possible to roll back to 13.0.5. I'm not s pro at this, but I'll let you know if I figure anything out.

Cheers @panos. It was working perfectly on 13.0.5. Then we upgraded to 13.1.2 and the issues started soon after. I upgraded to the release candidate for 13.1.3 today and the problems are still there. I wonder if it's possible to roll back to 13.0.5. I'm not s pro at this, but I'll let you know if I figure anything out.
panos commented 3 months ago
Owner

Thanks! In the meantime, you may want to try restarting the calckey service once in a while. This seems to (temporarily, at least) help on my server. But yeah we need to find what's causing this and fix it.

Thanks! In the meantime, you may want to try restarting the calckey service once in a while. This seems to (temporarily, at least) help on my server. But yeah we need to find what's causing this and fix it.
Collaborator

thanks for the info, now that I know that its a worker issue, I should be able to fix it really soon. hopefully "today" soon.

thanks for the info, now that I know that its a worker issue, I should be able to fix it really soon. hopefully "today" soon.
panos added the
🔥high priority
label 3 months ago
panos added this to the Future improvements project 3 months ago
Poster

Update: I disabled NSFW detection, disabled the global timeline, and restarted calckey... It's like 70% better now. I still get weird spikes to 100% cpu, but they only last 20-30 seconds rather than minutes... and they seem less frequent.

Update: I disabled NSFW detection, disabled the global timeline, and restarted calckey... It's like 70% better now. I still get weird spikes to 100% cpu, but they only last 20-30 seconds rather than minutes... and they seem less frequent.
jprjr commented 2 months ago

Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down.

I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can.

I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc.

Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down. I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can. I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc.
panos commented 2 months ago
Owner

Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down.

I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can.

I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc.

There have been some improvements on this, we still get some spikes on chat for example. So make sure you're on a recent version, if not dev (which isn't recommended for production), then perhaps the beta we recently released.
That said, help is always welcome! If possible, please join our Matrix support room at #calckey:matrix.fedibird.com as it's more convenient to use chat for direct communication (or give us a shout if you're already in it!).

> Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down. > > I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can. > > I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc. There have been some improvements on this, we still get some spikes on chat for example. So make sure you're on a recent version, if not dev (which isn't recommended for production), then perhaps the beta we recently released. That said, help is always welcome! If possible, please join our Matrix support room at #calckey:matrix.fedibird.com as it's more convenient to use chat for direct communication (or give us a shout if you're already in it!).
Collaborator

Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down.

I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can.

I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc.

workers are normally used to handle the queues, and to my info if you run calckey with the node_enviroment=develop enviroment variable you should be able to get a extensive log on what is executed in worker and maybe track down what its doing while the spikes happen

> Hi there, I just wanted to volunteer and see if there's anyway I can help pin this down. > > I'm experiencing the same issue, I I'm not sure what triggers the worker to start using 100%CPU, but if there's anyway to run Calckey in a debug mode, or logs that could be of use, etc, I'm happy to help however I can. > > I'm running on Linux without Docker (basically just following the usual instructions from the README), I can easily modify my systemd unit file to do things like, add additional flags, etc. workers are normally used to handle the queues, and to my info if you run calckey with the node_enviroment=develop enviroment variable you should be able to get a extensive log on what is executed in worker and maybe track down what its doing while the spikes happen
yawhn commented 2 months ago

Just a temp hotfix tip until this is properly fixed:

In our server we have set up a cron job (if you use pm2 there is an option there) and restart the service once per day (during the early morning hours). Moreover, we have increased the workers in the config settings to 4. These have improved performance issues a lot. But definatelly we need to look into this for a proper solution.

For sure there are serious performance issues when opening the chat as Panos mentioned above.

Just a temp hotfix tip until this is properly fixed: In our server we have set up a cron job (if you use pm2 there is an option there) and restart the service once per day (during the early morning hours). Moreover, we have increased the workers in the config settings to 4. These have improved performance issues a lot. But definatelly we need to look into this for a proper solution. For sure there are serious performance issues when opening the chat as Panos mentioned above.

I've also noticed performance issues recently after upgrading to v13.2.0-beta.31. I haven't really dug into that much other than to notice that the spikes seem to occur while updating charts (__chart_hashtag and others). The queries take a while to resolve but my postgres server is on another machine so in theory that shouldn't cause the CPU usage to increase so dramatically.

I've also noticed performance issues recently after upgrading to v13.2.0-beta.31. I haven't really dug into that much other than to notice that the spikes seem to occur while updating charts (`__chart_hashtag` and others). The queries take a while to resolve but my postgres server is on another machine so in theory that shouldn't cause the CPU usage to increase so dramatically.
Collaborator

Will take a look, thx!

Will take a look, thx!
panos commented 2 months ago
Owner

BTW I think that Kaity's recent fix wasn't in dev yet when beta.31 was released, so it might be worth trying dev (or we should do a beta4 soon).

BTW I think that Kaity's recent fix wasn't in dev yet when beta.31 was released, so it might be worth trying dev (or we should do a beta4 soon).
panos commented 4 weeks ago
Owner

Performance has improved significantly, so I'm closing this one.

Performance has improved significantly, so I'm closing this one.
panos closed this issue 4 weeks ago
Sign in to join this conversation.
No Milestone
No Assignees
6 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: calckey/calckey#9681
Loading…
There is no content yet.