Potential memory leak? Every four hours, the server’s memory usage sharply rises, leading to a crash of Sharkey.

calckey.7ka.orgでクラッシュした際のログです。

sharkey_web  | 2024-11-10T02:07:07.986962000Z DONE *    [download]      Download finished: https://webcatalog-free.circle.ms/favicon.ico
sharkey_web  | 2024-11-10T02:07:08.015158342Z DONE *    [download]      Download finished: https://www3.nhk.or.jp/news/html/20241110/K10014634131_2411100554_1110055544_01_02.jpg
sharkey_web  | 2024-11-10T02:07:11.271866300Z WARN 1    [queue inbox]   inbox activity ignored (maybe): id=https://himagine.club/likes/a0eiqcqdrax2b9nt reason=skip: target note no
t found https://misskey.io/notes/a0eijgstazll01fh
sharkey_web  | 2024-11-10T02:07:11.956811541Z WARN 2    [queue inbox]   inbox activity ignored (maybe): id=https://nijimiss.moe/likes/01JC9X6G0QPCZPR9GK40X8A7KF reason=skip: targe
t note not found https://nijimiss.moe/notes/01JC9X6096P9R8E1A5JW6ZW7QT
sharkey_web  | 2024-11-10T02:07:11.959421627Z WARN 1    [queue inbox]   inbox activity ignored (maybe): id=https://misskey.secinet.jp/likes/a0eipy1zpt reason=skip: target note not
 found https://posskey.com/notes/a0eikmt1xnwu1g9u
sharkey_web  | 2024-11-10T02:07:12.597964927Z INFO 2    [remote ap]     Create: https://mastodon-japan.net/users/miyao/statuses/113456213637450227/activity
sharkey_web  | 2024-11-10T02:07:12.600365407Z INFO 2    [remote ap]     Creating the Note: https://mastodon-japan.net/users/miyao/statuses/113456213637450227
sharkey_web  | 2024-11-10T02:07:12.624895771Z WARN 1    [queue inbox]   inbox activity ignored (maybe): id=https://nijimiss.moe/likes/01JC9X6GN6Z154NM0MF9VRPB2Z reason=skip: targe
t note not found https://nijimiss.moe/notes/01JC9X5ZDCR0TJ339F3ZG0B0KT
sharkey_web  | 2024-11-10T02:07:14.838041210Z WARN 2    [queue inbox]   inbox activity ignored (maybe): id=https://misskey.secinet.jp/likes/a0eiq0pxpu reason=skip: target note not
 found https://posskey.com/notes/a0ein1bpxnwu1ga3
sharkey_web  | 2024-11-10T02:07:15.687639721Z INFO 1    [remote ap]     Create: https://mstdn.jp/users/tsuki_no_yoru/statuses/113456213824542260/activity
sharkey_web  | 2024-11-10T02:07:15.689702390Z INFO 1    [remote ap]     Creating the Note: https://mstdn.jp/users/tsuki_no_yoru/statuses/113456213824542260
sharkey_web  | 2024-11-10T02:07:17.649871479Z INFO 2    [remote ap]     Create: https://mkkey.net/notes/a0eiqie8ul/activity
sharkey_web  | 2024-11-10T02:07:17.652205147Z INFO 2    [remote ap]     Creating the Note: https://mkkey.net/notes/a0eiqie8ul
sharkey_web  | 2024-11-10T02:07:18.041966264Z INFO 1    [remote ap]     Create: https://mfmf.club/notes/a0eiqhx93n/activity
sharkey_web  | 2024-11-10T02:07:18.045323101Z INFO 1    [remote ap]     Creating the Note: https://mfmf.club/notes/a0eiqhx93n
sharkey_web  | 2024-11-10T02:07:18.215799119Z INFO 2    [remote ap]     Deleting the Note: https://mytter.jp/users/fungus/statuses/113416577498530733
sharkey_web  | 2024-11-10T02:07:49.706063624Z 
sharkey_web  | 2024-11-10T02:07:49.735740250Z <--- Last few GCs --->
sharkey_web  | 2024-11-10T02:07:49.735755080Z 
sharkey_web  | 2024-11-10T02:07:49.735759960Z [101:0x72fb8ff24800]  2843172 ms: Scavenge 1955.2 (1993.6) -> 1954.1 (1994.3) MB, 4.67 / 0.00 ms  (average mu = 0.209, c
urrent mu = 0.163) allocation failure; 
sharkey_web  | 2024-11-10T02:07:49.735764330Z [101:0x72fb8ff24800]  2845435 ms: Mark-Compact (reduce) 1955.4 (1994.8) -> 1954.6 (1992.1) MB, 1838.62 / 0.02 ms  (+ 119.2 ms in 16 steps since start of marking, biggest step 13.8 ms, walltime since start of marking 1972 ms) (average mu = 0.177, current m
sharkey_web  | 2024-11-10T02:07:49.735772610Z 
sharkey_web  | 2024-11-10T02:07:49.735776361Z <--- JS stacktrace --->
sharkey_web  | 2024-11-10T02:07:49.735780291Z 
sharkey_web  | 2024-11-10T02:07:49.735784041Z FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
sharkey_web  | 2024-11-10T02:07:49.735787951Z ----- Native stack trace -----
sharkey_web  | 2024-11-10T02:07:49.735806511Z 
sharkey_web  | 2024-11-10T02:07:51.923025731Z INFO 2    [core]  The process is going to exit with code 0
sharkey_web  | 2024-11-10T02:07:52.054091774Z INFO 1    [core]  The process is going to exit with code 0
sharkey_web  | 2024-11-10T02:07:53.268257529Z  ELIFECYCLE  Command failed.
sharkey_web  | 2024-11-10T02:07:54.764183082Z  ELIFECYCLE  Command failed.
sharkey_web  | 2024-11-10T02:07:59.022951168Z 
sharkey_web  | 2024-11-10T02:07:59.025661586Z > sharkey@2024.9.1 migrateandstart /sharkey

サーバー環境は次のとおりです。

Sharkey 2024.9.1 (Docker image)
Ubuntu 24.04.1 LTS (GNU/Linux 6.8.0-48-generic x86_64)
メモリ 4GB
SSD 200GB
CPU 2vCPU(AMD EPYC 7443 24-Core)

$ redis-server -v
Redis server v=7.4.1 sha=00000000:0 malloc=jemalloc-5.3.0 bits=64 build=7fd3520e9c14a41b

$ psql -V
psql (PostgreSQL) 16.4 (Ubuntu 16.4-1.pgdg24.04+2)

$ docker -v
Docker version 27.3.1, build ce12230

docker-compose.ymlは次のように書き換えて、docker compose up -dで動かしています。

$ cat compose.yml 
services:
  web:
    image: registry.activitypub.software/transfem-org/sharkey:2024.9.1
    container_name: sharkey_web
    restart: always
    network_mode: "host"
    environment:
      NODE_ENV: production
      UV_THREADPOOL_SIZE: 8
    volumes:
      - ./files:/sharkey/files
      - ./.config:/sharkey/.config:ro```

Do you happen to have the log entries from immediately before the OOM?

追加しました。

Initially, I reported in the bug report that shonk.social didn’t seem to be crashing. However, upon reviewing the server metrics during JST 7:08 PM, it appears that there is significant load on the CPU and other resources, even though it hasn’t reached the point of crashing. Since this happens during the same time period that my server experiences issues, I suspect there may be a connection. (Actually, I also observed similar server metrics on shonk.social around 3:09 PM, but I only saw it once and wasn’t sure if it was related, so I didn’t report it. I don’t have other Sharkey accounts to check if this happens in different environments.) Here is a screenshot of shonk.social’s server metrics around 7:09 PM JST.

If this turns out to be an entirely unrelated issue and a mistaken opinion, I apologize.

We have the same problems. I was just about to open an issue myself. Our monitoring shows that the instance is crashing every four hours. When I checked the logs it was the mentioned JavaScript heap out of memory error.

Interestingly the crashes stopped for 27 days after I updated to 2024.8.2, but they started again about three weeks ago. Updating to 2024.9.1 didn't change that. I even disabled Meilisearch, because I thought it was "stealing" Sharkeys memory, but that didn't help.

@Daniel @magi Can you send me screenshots of your inbox queue from bull dashboard? I have spent the past couple of hours looking into this on my own instance. My findings are that I have "poisoned" job queues, certain things are being scheduled instead of removed.

ERR  6  [remote ap]     error occurred while fetching following/followers collection {
  stack: Error: Validate content type of AP response: Content type is not application/activity+json or application/ld+json
      at validateContentTypeSetAsActivityPub (file:///sharkey/packages/backend/built/core/activitypub/misc/validator.js:12:11)
      at ApRequestService.signedGet (file:///sharkey/packages/backend/built/core/activitypub/ApRequestService.js:215:9)
      at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
      at async Resolver.resolve (file:///sharkey/packages/backend/built/core/activitypub/ApResolverService.js:109:36)
      at async Resolver.resolveCollection (file:///sharkey/packages/backend/built/core/activitypub/ApResolverService.js:73:56)
      at async ApPersonService.isPublicCollection (file:///sharkey/packages/backend/built/core/activitypub/models/ApPersonService.js:654:30)
      at async Promise.all (index 0)
      at async ApPersonService.updatePerson (file:///sharkey/packages/backend/built/core/activitypub/models/ApPersonService.js:434:60)
      at async ApInboxService.update (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:652:13)
      at async ApInboxService.performOneActivity (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:157:20)
      at async ApInboxService.performActivity (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:138:22)
      at async InboxProcessorService.process (file:///sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:193:28)
      at async Worker.processJob (/sharkey/node_modules/.pnpm/bullmq@5.13.2/node_modules/bullmq/dist/cjs/classes/worker.js:455:28)
      at async Worker.retryIfFailed (/sharkey/node_modules/.pnpm/bullmq@5.13.2/node_modules/bullmq/dist/cjs/classes/worker.js:640:24)
}

One of the jobs "failing" in this manner, the second it hits a worker I notice a spike in memory usage and cpu usage (depending on the job). I found several types of "poisonous" jobs that in the right circumstances would stall workers or cause an increased memory usage that may exceed what you have allocated. I'm going to make issues for some of them but I'm curious to what y'all have noticed.

One thing that you can do to confirm this is just promote the queue and see if the instance dies. I was able to make my instance get a spike in memory usage to confirm that the jobs were in fact having an impact on functionality.

Promoting the inbox and deliver queue did nothing for me. I pretty much already ruled out the deliver and inbox queue some month ago even before the release of 2024.8.1 when I cleared all delayed jobs out of despair. Also there weren't any jobs that were delayed exactly four hours.

Why do i get the cursed inbox jobs ;( this is a nightmare. I'm not sure that the logs are going to output what we're looking for.

Like Daniel, I’ve been regularly promoting delayed queues from the Bull dashboard and deleting any that can’t be processed. However, I haven’t noticed any crashes occurring as a result of these actions. The issue seems to be happening even when the Bull dashboard queue appears empty, though, to be sure, I’ll try temporarily suspending federation with servers that frequently experience delivery delays and monitor for any changes. If it’s indeed an issue with the queue, addressing it might be challenging given the large number of federated servers my instance connects to.

No need to, even despite me pruning the "poisonous" job queues I have found that my instance (https://transfem.social/) is still having issues with the GC.

<--- Last few GCs --->                                                                                                                                                                                               
                                                                                                          
[3712733:0x7f9bdd322a00] 28821043 ms: Scavenge 4053.7 (4130.3) -> 4051.5 (4139.3) MB, 12.16 / 0.00 ms  (average mu = 0.189, current mu = 0.094) allocation failure;                                                  
[3712733:0x7f9bdd322a00] 28825101 ms: Mark-Compact 4061.3 (4142.0) -> 4056.0 (4144.8) MB, 4040.42 / 0.00 ms  (average mu = 0.125, current mu = 0.029) allocation failure; scavenge might not succeed                 
                                                                                                                                                                                                                     
                                                                                                                                                                                                                     
<--- JS stacktrace --->                                                                                                                                                                                              
                                                                                                          
FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory                         
----- Native stack trace -----                                                                            
                                                                                                          
ERR  *  [core cluster]  [192] died :(

So while in my case pruning some of the jobs that kept failing and would never succeed did help somewhat it is not the solution. I also have noticed that the instance has multiple GC heap allocation errors. I noticed though that my instance is usually downloading a bunch of media when this occurs, that's the only common pattern I see.

INFO 199        [core nest]     InstanceLoader: ServerModule dependencies initialized                                                                                                                                
INFO 199        [core nest]     InstanceLoader: CoreModule dependencies initialized                                                                                                                                  
INFO 199        [core nest]     InstanceLoader: EndpointsModule dependencies initialized                                                                                                                             
INFO 193        [url-preview]   Getting preview of https://youtu.be/HEoZZInT38g@en-US ...                                                                                                                            
INFO 192        [url-preview]   Getting preview of https://freedomnews.org.uk/2024/04/29/georgia-mass-protests-against-pro-russian-government-and-foreign-agents-law/@en-US ...                                      
INFO 194        [url-preview]   Getting preview of https://masto.pt/tags/Migra%C3%A7%C3%B5es@en-US ...                                                                                                               
DONE 194        [url-preview]   Got preview of https://masto.pt/tags/Migra%C3%A7%C3%B5es: Masto.PT                                                                                                                   
INFO 195        [url-preview]   Getting preview of https://masto.pt/tags/Patrim%C3%B3nio@en-US ...                                                                                                                   
DONE 192        [url-preview]   Got preview of https://freedomnews.org.uk/2024/04/29/georgia-mass-protests-against-pro-russian-government-and-foreign-agents-law/: Georgia: Mass protests against pro-Russian governm
ent and "foreign agents" law
INFO 195        [url-preview]   Getting preview of https://my.heinzhistorycenter.org/orders/558/tickets?eventId=651722b38f4b195c3950c699&cdEventIds=651722b38f4b195c3950c699&date=2023-10-18T19:30:00-04:00@en-US ...
DONE 193        [url-preview]   Got preview of https://youtu.be/HEoZZInT38g: Fancy Women Bike Ride
DONE 195        [url-preview]   Got preview of https://masto.pt/tags/Patrim%C3%B3nio: Masto.PT
DONE 195        [url-preview]   Got preview of https://my.heinzhistorycenter.org/orders/558/tickets?eventId=651722b38f4b195c3950c699&cdEventIds=651722b38f4b195c3950c699&date=2023-10-18T19:30:00-04:00: my.heinzhist
orycenter.org
INFO 194        [download]      Downloading https://cdn.transfem.social/files/5912772b-7748-43e7-a146-bd33fb477f22.png to /tmp/tmp-3715825-2J7GjdXgink8 ...
INFO 193        [download]      Downloading https://cdn.transfem.social/files/67f743c9-bdf5-4641-b260-e43e45c76a8f.png to /tmp/tmp-3714059-oxYTwS6ckVjI ...
DONE 193        [download]      Download finished: https://cdn.transfem.social/files/67f743c9-bdf5-4641-b260-e43e45c76a8f.png
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
DONE 194        [download]      Download finished: https://cdn.transfem.social/files/5912772b-7748-43e7-a146-bd33fb477f22.png
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
INFO 198        [url-preview]   Returning cache preview of https://github.com/mastodon/mastodon/issues?q=is%3Aissue%20state%3Aopen%20sort%3Areactions-%2B1-desc@en-US
INFO 195        [download]      Downloading https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp to /tmp/tmp-562043-2EtiI8fpyehM ...
INFO 192        [download]      Downloading https://mastodon.vierkantor.com/system/custom_emojis/images/000/096/272/original/fd382db8ce593687.png to /tmp/tmp-3712733-ufiwva1hfD5w ...
INFO 192        [download]      Downloading https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp to /tmp/tmp-3712733-E7ERc0480v5E ...
DONE 192        [download]      Download finished: https://mastodon.vierkantor.com/system/custom_emojis/images/000/096/272/original/fd382db8ce593687.png
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000
DONE 195        [download]      Download finished: https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp
DONE 192        [download]      Download finished: https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp
INFO 192        [download]      Downloading https://cdn.transfem.social/files/68d658b0-b264-4b5f-a4d2-04f1fde73dc4.webp to /tmp/tmp-3712733-qfXU7e8dmT2i ...
INFO 195        [download]      Downloading https://cdn.transfem.social/files/1e5a61e4-926f-4beb-bdf3-55e2794d4a4d.gif to /tmp/tmp-562043-60Nn25l7KTZW ...
INFO 193        [download]      Downloading https://cdn.transfem.social/files/78e07341-9006-4b69-a95f-b00fe0aa247c.webp to /tmp/tmp-3714059-afQfs1T7CV6J ...
DONE 192        [download]      Download finished: https://cdn.transfem.social/files/68d658b0-b264-4b5f-a4d2-04f1fde73dc4.webp
DONE 195        [download]      Download finished: https://cdn.transfem.social/files/1e5a61e4-926f-4beb-bdf3-55e2794d4a4d.gif
DONE 193        [download]      Download finished: https://cdn.transfem.social/files/78e07341-9006-4b69-a95f-b00fe0aa247c.webp

<--- Last few GCs --->

[564408:0x7f3accd222c0] 14374664 ms: Scavenge 4056.0 (4132.3) -> 4053.8 (4141.3) MB, 9.82 / 0.00 ms  (average mu = 0.211, current mu = 0.075) allocation failure; 
[564408:0x7f3accd222c0] 14378449 ms: Mark-Compact 4063.3 (4143.8) -> 4058.2 (4146.6) MB, 3769.67 / 0.00 ms  (average mu = 0.143, current mu = 0.031) allocation failure; scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
----- Native stack trace -----

It is a bit weird that nest explodes and then comes back up and explodes again. I am curious to whether everyone here has authorized fetch enabled, because I have it both for signed GET requests and inbound requests.

I run Sharkey with the default settings: https://activitypub.software/TransFem-org/Sharkey/-/blob/develop/.config/docker_example.yml?ref_type=heads#L294

Thanks, that helps. I was concerned a bit that it could have been related to authorized fetch but considering you don't have it enabled by default... It's gotta be somewhere else. I wish I could split the image processing part (seen with [download] as it downloads files to a temporary directory to perform processing on) from the rest. I doubt it is that though considering I've seen a bunch of download+processing in my logs and there was no heap out of memory errors right after... Something tells me the answer is not within logs. I've spent today and yesterday scrolling through hundreds of thousands of logs trying to find a pattern and I cannot find the issue leading up to FATAL ERROR. My instance doesn't crash simply because it has 32GB of ram allocated to it, but if I lowered it I have no doubts I'd be experiencing the same crashing as y'all. Increasing your instance's memory to something so over the top is not the solution... I'll try to talk to other admins and developers.

I've also spent quite some time to find a pattern in the logs without any success. Are we sure this isn't somehow related to the issue that was patched in 2024.8.2? Because, like I said, this initially fixed it for me, but just not permanently.

My leading theory is that 2024.8.2 fixed the memory leak, but the merge of upstream misskey 2024.9 into the codebase somehow reverted it, or regressed it in some way. I have 2024.8.2 with some misc features backported (it's 2024.8.2-transfem, https://activitypub.software/TransFem-org/sharkey-trans-fem/ available under develop-2024.8.2 branch). I can't say for sure that it doesn't have the same memory leak but I do not remember seeing the same jagged spikes in it. I already told @fEmber about this version.

Thank you for sharing the log information. I’m currently using it with the signToActivityPubGet: true setting. Due to the heavy load, on my server, logs are often not recorded for tens of seconds to around a minute before it crashes, so I haven’t even been able to check the download logs (although sometimes logs for chart generation have been recorded).

Can you tell whether the chart generation logs are complete? Like, did the generation process get cut off?

I can provide some logs. It seems like the records here indicate the process of chart generation was interrupted, but since there are many logs with time differences, it could simply be that no logs were recorded during the problematic period due to overlapping chart events.

fatal_error_logs.log

Wait, it looks like my instance crashes at the exact same times oO

@Daniel I have set my server's time zone to JST (UTC+9). Regardless of whether your server is in the same time zone or a different one, I’m curious if the same issue occurs around these times: 3:08, 7:08, 11:08, 15:08, 19:08, or 23:08.

@magi Jumping in to say that my server got overloaded at 18:08 UTC, so exactly at 03:08 UTC+9 as well

It's happened twice in one day for me now. If there's any logs or config files to share, please let me know.

Can you share the last 30 minutes of logs from right before this happened? And if possible, include your configuration file (.config/default.yml) without the secrets / passwords.

I looked in the monitoring system and saw that the first spike appeared roughly an hour before Sharkey overloaded the system (not shown in the screenshots), so I went ahead and attached logs from 17:00 UTC to 18:15 UTC.

sharkey_issue_805_daedric_world_1.log

.config/default.yml

Some observations (may or may not be relevant):

You don't proxy remote media, which is an unusual configuration.
Meilisearch is enabled.
Auth fetch is enabled.
The instance is running with the default clustering (1x web, 1x worker) and default job limits. If you have more than 2 cores, then you should adjust these settings.
You were already getting "query is slow" errors at the start of this log, so the trigger could have been earlier.
I see query that I'm not familiar with - checking the contents of user.tags. I've never seen that property in use before, does anyone know what it's for?
Chart processing apparently took so much CPU that your other jobs timed out. It didn't complete for more than 17 minutes! Some of the individual queries took 10+ seconds which is highly abnormal.
At 17:05:48, a number of outward HTTP requests suddenly failed with no reason code. I've never seen that behavior before.
Your postgres database seems to be very slow, like in general. Have you optimized it with pgTune?
At exactly 18:08:00, the user @notataboo@mastodon.social was federated to your instance. The timing is suspicious, but probably just a coincidence.
At 18:08:31, auth-fetch went crazy retrieving new users to verify keys. I suspect that someone boosted a post from your instance into an audience that hadn't federated before. The instance seemed to handle this fine, as there were no "query is slow" errors despite the heavy load.
At 18:08:53, a bunch of standard queries began running slow. This suggest heavy server load. The immediately previous log indicates that a GIF was just downloaded. This particular spike was probably caused by media processing, since GIFs take much longer to convert.
A similar spike happened at 18:08:53, with no clear cause.

I think I found a pattern, at least in my logs. I checked the logs of the last four crashes and every time the same errors appeared. I don't know how high the chance of a coincidence is tho. crash.log

I'm still experiencing crashes caused by the OOM error, and recently, I've noticed a new pattern that deviates from the previous behavior. Until now, the crashes occurred consistently every 4 hours, but now they sometimes happen at irregular intervals as well. For instance, you can see examples in the logs, such as Dec 13 20:22, Dec 17 10:21, and Dec 18 11:36.

As before, the logs don't provide much detailed information about these crashes. I often find myself wishing there was an option to enable more detailed logging in Misskey itself.

Are there other server admins facing similar issues? If so, have you observed these irregular crashes

One change I noticed is that I get two crashes in a row now, but still at the same times.

2024-12-18T18:08:42.427913273Z <--- Last few GCs --->
2024-12-18T18:08:42.427916840Z
2024-12-18T18:08:42.427918954Z [78:0x7f3f18924f40] 14311862 ms: Scavenge 4040.7 (4120.6) -> 4037.5 (4124.1) MB, 16.15 / 0.00 ms  (average mu = 0.497, current mu = 0.356) allocation failure;
2024-12-18T18:08:42.427927731Z [78:0x7f3f18924f40] 14311903 ms: Scavenge 4045.4 (4125.4) -> 4041.9 (4143.4) MB, 27.95 / 0.00 ms  (average mu = 0.497, current mu = 0.356) allocation failure;
2024-12-18T18:08:42.427929986Z [78:0x7f3f18924f40] 14316861 ms: Mark-Compact 4043.8 (4143.9) -> 4042.7 (4146.6) MB, 4663.89 / 0.00 ms  (average mu = 0.329, current mu = 0.109) allocation failure; GC in old space requested
2024-12-18T18:08:42.427933823Z
2024-12-18T18:08:42.427935807Z
2024-12-18T18:08:42.427938091Z <--- JS stacktrace --->
2024-12-18T18:08:42.427940135Z
2024-12-18T18:08:42.427942068Z FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
2024-12-18T18:08:42.427945084Z ----- Native stack trace -----
2024-12-18T18:08:42.427947729Z
2024-12-18T18:09:47.187393748Z
2024-12-18T18:09:47.187425548Z <--- Last few GCs --->
2024-12-18T18:09:47.187429545Z
2024-12-18T18:09:47.187432170Z [78:0x7e73c5324b00]    50639 ms: Mark-Compact 4047.6 (4135.2) -> 4039.1 (4142.9) MB, 3549.61 / 0.00 ms  (average mu = 0.163, current mu = 0.080) allocation failure; scavenge might not succeed
2024-12-18T18:09:47.187439214Z [78:0x7e73c5324b00]    55643 ms: Mark-Compact 4043.9 (4143.9) -> 4041.8 (4145.4) MB, 4700.52 / 0.00 ms  (average mu = 0.111, current mu = 0.061) allocation failure; GC in old space requested
2024-12-18T18:09:47.187441428Z
2024-12-18T18:09:47.187443422Z
2024-12-18T18:09:47.187445827Z <--- JS stacktrace --->
2024-12-18T18:09:47.187447981Z
2024-12-18T18:09:47.187450305Z FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
2024-12-18T18:09:47.187454643Z ----- Native stack trace -----
2024-12-18T18:09:47.187458791Z

Edit: Now that I think about it, the reason is probably because I setup two workers instead of one.

Thank you. If this issue is not occurring in your environment, then the periodic crashes and the new crashes happening on my server might be separate issues, even if they seem similar. Hmm, if I figure out anything new, I'll add it here or open a new issue.

I am also affected (sometimes it's so bad the entire system freezes and I just have to give up and reboot the VM).

If there's logs I can share tell me. It slowly gets annoying to get OOM'd daily.

I can confirm this behavior with a K8s container-based deployment. Seeing memory and CPU spike leading to a crash.

I thought it would only occur after 4 hours, but turns out the system doesn't care if it was on for 4 hours or for just 1. It seems like it will crash at a specific time instead. In this screenshot, the process was only alive for about an hour before it experienced the OOM at the exact same time as yesterday (10:08 UTC).

Thank you for the report! Can I ask what version of Sharkey you're running?

Absolutely! Sharkey is running 2024.11.2 (registry.activitypub.software/transfem-org/sharkey:2024.11.2), single container deployment.

Ah, interesting. So that means our DoS hardening and chart process fixed weren't enough ☹️

I have previously reported the issue where the software crashes every 4 hours. On my server, the crashes consistently occurred at JST 3:08, 7:08, 11:08, 15:08, 19:08, 23:08 for the past several months. Even after software updates, this issue remained unresolved.

However, starting March 10, the crash timing suddenly shifted. Now, the crashes occur at JST 2:08, 6:08, 10:08, 14:08, 18:08, 22:08 instead. The 4-hour interval remains unchanged, but the specific times have moved.

I did not make any changes to my server on that day—no manual configuration changes or restarts. This makes me wonder if some external factor, possibly something related to federation, is influencing this behavior. However, I cannot determine the exact cause.

Questions Has anyone else experienced a similar shift in crash timing? Could this change provide any clues toward identifying or solving this issue?

This issue has persisted daily for months, and it has become quite frustrating. If there is anything else I should check on my server—such as related symptoms or specific logs—please let me know, and I will investigate further.

I personally suspect that this crash is an unknown DoS bug (or vulnerability) caused by federation. If your instance is on version 2025.2.2 or later, would you please turn on Activity Logging with "preSave" set to true? There's a sample configuration in "example.yml" file.

Yes, the time of the crashes changed for me as well. Also at 2:08, 6:08, 10:08, 14:08, 18:08, 22:08 now (CET).

Wait - March 10th is when Daylight Savings time started, and the times shifted by the same amount. So this is definitely a scheduled thing of some kind.

we do not have any periodic job on a 4 hour period (https://activitypub.software/TransFem-org/Sharkey/-/blob/develop/packages/backend/src/core/QueueService.ts#L60-107)

it would be interesting to see if the times shift again in two weeks, when more of the northern hemisphere switches to summer time

I hadn't considered the impact of daylight saving time. Upon analyzing the pattern of the error timestamps, I looked into time zones where daylight saving time would have started around the time of the error, and it seems that regions such as Eastern Time in the U.S., parts of Canada, and Mexico are influencing the error timestamps.

I live in Japan, where daylight saving time is not observed. Therefore, I believe that the discrepancy in error timestamps due to DST is caused by an issue beyond my configuration (as there's no reason for me to be using the Eastern Time zone ...)

I also think that the crashes are triggered by something remote. I put my reverse proxy into maintenance mode from 18:05 to 18:10 today and the crash didn't happen. The next crash happened like usual at 22:08.

If the crash would be caused by some instance POSTing something to the inbox, I would have expected a staggered crash between 18:10 and 22:08 when the instance would have tried to redeliver the message, but that didn't happen.

Before that I suspected Lemmy to be the reason and blocked all Lemmy user agents on the reverse proxy, but that was not it.

Thank you for the info! This supports one of my theories, which is that this is caused by a malicious bot designed to intentionally kill instances. That would explain why it didn't retry.

Based on @fEmber's instructions, I have configured the default.yml inside the .config as follows:

# Save the activity before processing, then update later with the results.
# This has the advantage of capturing activities that cause a hard-crash, but doubles the number of queries used.
# Default: false
preSave: true

After making these changes, I restarted Sharkey via systemctl restart.

I then checked the logs for today’s issue using journalctl, but there is no relevant log recorded at the time of the fatal error. As usual, the load on the Misskey master process rapidly increases just before the crash and the software crashes within a few seconds.

Mar 15 10:07:42  sharkey[1145380]: ERR  2        [queue inbox]        failed(Error: Error in actor https://hellsite.site/users/XXXXXX - 502) id=5979885 attempts=4/8 age=12m activity=https://hellsite.site/users/XXXXXX#>
<omitted>
Mar 15 10:07:42  sharkey[1145380]:     stack: 'Error: Error in actor https://hellsite.site/users/XXXXXX - 502\n' +
Mar 15 10:07:42  sharkey[1145380]:       '    at InboxProcessorService._process (file:///home/firefish/Sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:134:27)\n' +
Mar 15 10:07:42  sharkey[1145380]:       '    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n' +
Mar 15 10:07:42  sharkey[1145380]:       '    at async InboxProcessorService.process (file:///home/firefish/Sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:80:20)\n' +
Mar 15 10:07:42  sharkey[1145380]:       '    at async /home/firefish/Sharkey/node_modules/.pnpm/bullmq@5.26.1/node_modules/bullmq/dist/cjs/classes/worker.js:512:32\n' +
Mar 15 10:07:42  sharkey[1145380]:       '    at async Worker.retryIfFailed (/home/firefish/Sharkey/node_modules/.pnpm/bullmq@5.26.1/node_modules/bullmq/dist/cjs/classes/worker.js:741:24)',
Mar 15 10:07:42  sharkey[1145380]:     message: 'Error in actor https://hellsite.site/users/XXXXXX - 502',
Mar 15 10:07:42  sharkey[1145380]:     name: 'Error'
Mar 15 10:07:42  sharkey[1145380]:   }
Mar 15 10:07:42  sharkey[1145380]: }
Mar 15 10:08:22  sharkey[1145366]: <--- Last few GCs --->
Mar 15 10:08:22  sharkey[1145366]: [1145366:0x7fc6bd514000]  6665086 ms: Mark-Compact 8070.5 (8228.3) -> 8061.6 (8235.3) MB, pooled: 0 MB, 4174.87 / 0.00 ms  (average mu = 0.235, current mu = 0.131) allocation failure; sc>
Mar 15 10:08:22  sharkey[1145366]: [1145366:0x7fc6bd514000]  6670954 ms: Mark-Compact 8072.3 (8239.8) -> 8068.0 (8241.6) MB, pooled: 0 MB, 5539.30 / 0.00 ms  (average mu = 0.148, current mu = 0.056) allocation failure; GC>
Mar 15 10:08:22  sharkey[1145366]: <--- JS stacktrace --->
Mar 15 10:08:22  sharkey[1145366]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
Mar 15 10:08:22  sharkey[1145366]: ----- Native stack trace -----
Mar 15 10:08:22  sharkey[1145366]:  1: 0xe19de0 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  2: 0x1240390 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  3: 0x1240667 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  4: 0x146e1a5  [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  5: 0x1487a19 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  6: 0x145c0e8 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [Misske>
Mar 15 10:08:22  sharkey[1145366]:  7: 0x145d015 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [Missk>
Mar 15 10:08:22  sharkey[1145366]:  8: 0x1435cee v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]:  9: 0x18972b0 v8::internal::Runtime_AllocateInOldGeneration(int, unsigned long*, v8::internal::Isolate*) [Misskey (master)]
Mar 15 10:08:22  sharkey[1145366]: 10: 0x7fc6ba06c476
Mar 15 10:08:23  sharkey[1145329]: Aborted (core dumped)
Mar 15 10:08:23  sharkey[1145380]: INFO 2        [core]        The process is going to exit with code 0
Mar 15 10:08:23  sharkey[1145379]: INFO 1        [core]        The process is going to exit with code 0
Mar 15 10:08:23  sharkey[1145133]:  ELIFECYCLE  Command failed with exit code 134.
Mar 15 10:08:23  systemd[1]: sharkey.service: Main process exited, code=exited, status=134/n/a
Mar 15 10:08:23  systemd[1]: sharkey.service: Failed with result 'exit-code'.
Mar 15 10:08:23  systemd[1]: sharkey.service: Consumed 13min 44.750s CPU time.
Mar 15 10:08:23  systemd[1]: sharkey.service: Scheduled restart job, restart counter is at 1.
Mar 15 10:08:23  systemd[1]: Stopped Sharkey daemon.```

The log duration is probably not enough. For me the memory usage starts going up at around :06 and peaks at around :08, so the trigger is probably sooner.

Yes, you'll want to look for any logs with a null result that happened between two of the crashes

In my case,the spike starts from halfway through the 7th minute and does not occur before or after that. Could the occurrence time of the issue vary depending on the environment? After that, the system restarts between crashes, but no logs containing null are output. The logs appear as shown in the attached file. Just to clarify, there are some null values in the attached file, but based on the timestamps, they appear to be logged more than several tens of seconds after the issue occurs and seem unrelated.

sharkey.txt

it won't show in that file - you'll need to query from the Activity Log tables in Postgres. I can provide instructions if needed.

I’d like to investigate this further. Could you provide the query instructions for checking the Activity Log tables in Postgres?

So I think (?) I found the reason for the crashes in the web server logs.

2025-03-15T21:07:19.085046712Z [15/Mar/2025:21:07:19 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"
2025-03-16T01:07:14.861058595Z [16/Mar/2025:01:07:14 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"
2025-03-16T05:08:08.426573926Z [16/Mar/2025:05:08:08 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"
2025-03-16T09:07:33.760317783Z [16/Mar/2025:09:07:33 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"
2025-03-16T13:07:22.364702102Z [16/Mar/2025:13:07:22 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"
2025-03-16T17:07:49.838710669Z [16/Mar/2025:17:07:49 +0000] [Cache:-] "POST /api/notes/global-timeline HTTP/1.1" 200 542738 "-" "axios/1.7.4" "172.18.0.13:3000"

They exactly match with the crash times. All requests have 542738 bytes.

those are almost certainly requests from our own mastodon emulation layer; 543738 is a bit large, even for the maximum of 100 notes… but not crash-worthy

your webserver logs should also show requests for /v1/timelines/public just before or after those: can you copy those lines?

This is it 100%. I can only see requests from that user agent at the crash times.

2025-03-16T17:08:00.505311636Z [16/Mar/2025:17:08:00 +0000] [Cache:-] "GET /api/v1/timelines/public?limit=40&min_id=9tnlv2dc2y5thtoq&since_id=9tnlv2dc2y5thtoq&local=false HTTP/1.1" 499 0 "-" "CDSCbot/2024.08.08 https://wiki.communitydata.science/CommunityData:Fediverse_research" "172.18.0.13:3000"
2025-03-16T17:08:24.582719011Z [16/Mar/2025:17:08:24 +0000] [Cache:-] "GET /api/v1/timelines/public?limit=40&min_id=9tnlv2dc2y5thtoq&since_id=9tnlv2dc2y5thtoq&local=false HTTP/1.1" 499 0 "-" "CDSCbot/2024.08.08 https://wiki.communitydata.science/CommunityData:Fediverse_research" "172.18.0.13:3000"
2025-03-16T17:08:50.868484413Z [16/Mar/2025:17:08:50 +0000] [Cache:-] "GET /api/v1/timelines/public?limit=40&min_id=9tnlv2dc2y5thtoq&since_id=9tnlv2dc2y5thtoq&local=false HTTP/1.1" 499 0 "-" "CDSCbot/2024.08.08 https://wiki.communitydata.science/CommunityData:Fediverse_research" "172.18.0.13:3000"
2025-03-16T17:09:14.940841607Z [16/Mar/2025:17:09:14 +0000] [Cache:-] "GET /api/v1/timelines/public?limit=40&min_id=9tnlv2dc2y5thtoq&since_id=9tnlv2dc2y5thtoq&local=false HTTP/1.1" 499 0 "-" "CDSCbot/2024.08.08 https://wiki.communitydata.science/CommunityData:Fediverse_research" "172.18.0.13:3000"

(tested on my instance, misskey api, request global timeline, 100 notes, the response is 395.29kb)

right, 499 means timeout in nginx-speak, so yeah, we may have our smoking gun

open questions:

do other people also get requests from https://wiki.communitydata.science/CommunityData:Fediverse_research?
why does timeline/public default to "global" (maybe real mastodon does the same?)
wtf does it crash? (mine doesn't, but it seems to allocate ~100MB for ~100KB response (mastodon api, local timeline, 100 notes), which feels excessive)

I get neither requests nor crashes, and that makes sense because the bot claims to respect robots.txt. I block all indexers from all pages on my instance.

I get those requests (with local=true, probably because I disabled the global timeline), and many others hitting the mastodon endpoints

maybe we should add a config option to disable the mastodon api emulation, while we try to make it work better?

note: my instance doesn't crash because:

single-user, so not many notes (not even on the global timeline) for the mastoapi to crash on
node can have up to 8GB of "old space"
my server has 64GB of RAM and most of it is free at all times

although it did use to crash randomly because of memory constraints, last crash was 2024-11-27 (yes I keep some logs for too long…)

Funny story, I was just debating whether we should add a killswitch for the masto API

a slightly-cheating option would also be to never return more than 10 notes from the various timeline / search endpoints… (this is only half serious)

The instance didn't crash at 22:08 after blocking the CDSCbot user agent on the reverse proxy.

I am the developer of the CDSCbot. I have modified our program to exclude any server with "Sharkey" in the version string from any timeline queries. This should alleviate any negative effects.

thank you @carlcx, both for the prompt response and for (accidentally) helping us find a bug in our code ☺️

After seeing @Daniel's post, I checked my nginx logs and found similar access logs on my server at the time of the issue. After temporarily blocking access from the CDSCbot user agent with WAF, the problem stopped occurring.

thank you for confirming!

mentioned in issue #1010

closing this issue, I've created mastoapi allocates too much memory (#1010) to actually implement a fix

closed

marked this issue as related to #1010

Potential memory leak? Every four hours, the server’s memory usage sharply rises, leading to a crash of Sharkey.

Designs

Child items 0

Activity

Potential memory leak? Every four hours, the server’s memory usage sharply rises, leading to a crash of Sharkey.

Relates to

Activity