What happened?(Please give us a brief description of what happened.) Potential memory leak? Every four hours, the server’s memory usage sharply rises, leading to a crash of Sharkey.
The issue occurs at JST 3:08, 7:08, 11:08, 15:08, 19:08, and 23:08. At these times, the memory usage of the Misskey Mastter worker spikes, sometimes by several gigabytes. This load increase causes a crash within 1–2 minutes, prompting a restart. The issue might be caused by changes specific to the latest Sharkey version, as I haven’t heard similar reports from other Misskey server admins. I’ve observed nearly identical symptoms on calckey.7ka.org. Interestingly, when I checked shonk.social at JST 11:08, it did not experience a crash. I’m not sure if this is due to the different server time zone or simply a different occurrence time.
I have already tried increasing the max-old-space-size allocation and adjusting the memory allocator, among other modifications, but none of these have resolved the issue. Sharkey continues to crash consistently at the same times.
What did you expect to happen?(Please give us a brief description of what you expected to happen.) Normal operation without memory leaks or crashes.
Version(What version of Sharkey is your instance running? You can find this by clicking your instance's logo at the top left and then clicking instance information.) 2024.10.0-dev-stelpolva (this is a forked version of 2024.9.1, but the same problem occurs on another server running the original 2024.9.1 version).
Instance(What instance of Sharkey are you using?) minazukey.uk (The same issue occurs on a friend’s server, calckey.7ka.org.)
What type of issue is this?(If this happens on your device and has to do with the user interface, it's client-side. If this happens on either with the API or the backend, or you got a server-side error in the client, it's server-side.) Server-side
How do you deploy Sharkey on your server? (Server-side issues only) Manually
What operating system are you using? (Server-side issues only) Ubuntu 22.04.5
Relevant log output(Please copy and paste any relevant log output. You can find your log by inspecting the page, and going to the "console" tab. This will be automatically formatted into code, so no need for backticks.)
The attached image shows the htop screen during the issue. Below is the error log; however, it doesn’t reveal the exact process causing the issue:
Initially, I reported in the bug report that shonk.social didn’t seem to be crashing. However, upon reviewing the server metrics during JST 7:08 PM, it appears that there is significant load on the CPU and other resources, even though it hasn’t reached the point of crashing. Since this happens during the same time period that my server experiences issues, I suspect there may be a connection. (Actually, I also observed similar server metrics on shonk.social around 3:09 PM, but I only saw it once and wasn’t sure if it was related, so I didn’t report it. I don’t have other Sharkey accounts to check if this happens in different environments.) Here is a screenshot of shonk.social’s server metrics around 7:09 PM JST.
If this turns out to be an entirely unrelated issue and a mistaken opinion, I apologize.
We have the same problems. I was just about to open an issue myself. Our monitoring shows that the instance is crashing every four hours. When I checked the logs it was the mentioned JavaScript heap out of memory error.
Interestingly the crashes stopped for 27 days after I updated to 2024.8.2, but they started again about three weeks ago. Updating to 2024.9.1 didn't change that. I even disabled Meilisearch, because I thought it was "stealing" Sharkeys memory, but that didn't help.
@Daniel@magi Can you send me screenshots of your inbox queue from bull dashboard? I have spent the past couple of hours looking into this on my own instance. My findings are that I have "poisoned" job queues, certain things are being scheduled instead of removed.
ERR 6 [remote ap] error occurred while fetching following/followers collection { stack: Error: Validate content type of AP response: Content type is not application/activity+json or application/ld+json at validateContentTypeSetAsActivityPub (file:///sharkey/packages/backend/built/core/activitypub/misc/validator.js:12:11) at ApRequestService.signedGet (file:///sharkey/packages/backend/built/core/activitypub/ApRequestService.js:215:9) at process.processTicksAndRejections (node:internal/process/task_queues:95:5) at async Resolver.resolve (file:///sharkey/packages/backend/built/core/activitypub/ApResolverService.js:109:36) at async Resolver.resolveCollection (file:///sharkey/packages/backend/built/core/activitypub/ApResolverService.js:73:56) at async ApPersonService.isPublicCollection (file:///sharkey/packages/backend/built/core/activitypub/models/ApPersonService.js:654:30) at async Promise.all (index 0) at async ApPersonService.updatePerson (file:///sharkey/packages/backend/built/core/activitypub/models/ApPersonService.js:434:60) at async ApInboxService.update (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:652:13) at async ApInboxService.performOneActivity (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:157:20) at async ApInboxService.performActivity (file:///sharkey/packages/backend/built/core/activitypub/ApInboxService.js:138:22) at async InboxProcessorService.process (file:///sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:193:28) at async Worker.processJob (/sharkey/node_modules/.pnpm/bullmq@5.13.2/node_modules/bullmq/dist/cjs/classes/worker.js:455:28) at async Worker.retryIfFailed (/sharkey/node_modules/.pnpm/bullmq@5.13.2/node_modules/bullmq/dist/cjs/classes/worker.js:640:24)}
One of the jobs "failing" in this manner, the second it hits a worker I notice a spike in memory usage and cpu usage (depending on the job). I found several types of "poisonous" jobs that in the right circumstances would stall workers or cause an increased memory usage that may exceed what you have allocated. I'm going to make issues for some of them but I'm curious to what y'all have noticed.
One thing that you can do to confirm this is just promote the queue and see if the instance dies. I was able to make my instance get a spike in memory usage to confirm that the jobs were in fact having an impact on functionality.
Promoting the inbox and deliver queue did nothing for me. I pretty much already ruled out the deliver and inbox queue some month ago even before the release of 2024.8.1 when I cleared all delayed jobs out of despair. Also there weren't any jobs that were delayed exactly four hours.
Like Daniel, I’ve been regularly promoting delayed queues from the Bull dashboard and deleting any that can’t be processed. However, I haven’t noticed any crashes occurring as a result of these actions. The issue seems to be happening even when the Bull dashboard queue appears empty, though, to be sure, I’ll try temporarily suspending federation with servers that frequently experience delivery delays and monitor for any changes. If it’s indeed an issue with the queue, addressing it might be challenging given the large number of federated servers my instance connects to.
No need to, even despite me pruning the "poisonous" job queues I have found that my instance (https://transfem.social/) is still having issues with the GC.
<--- Last few GCs ---> [3712733:0x7f9bdd322a00] 28821043 ms: Scavenge 4053.7 (4130.3) -> 4051.5 (4139.3) MB, 12.16 / 0.00 ms (average mu = 0.189, current mu = 0.094) allocation failure; [3712733:0x7f9bdd322a00] 28825101 ms: Mark-Compact 4061.3 (4142.0) -> 4056.0 (4144.8) MB, 4040.42 / 0.00 ms (average mu = 0.125, current mu = 0.029) allocation failure; scavenge might not succeed <--- JS stacktrace ---> FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory ----- Native stack trace ----- ERR * [core cluster] [192] died :(
So while in my case pruning some of the jobs that kept failing and would never succeed did help somewhat it is not the solution. I also have noticed that the instance has multiple GC heap allocation errors. I noticed though that my instance is usually downloading a bunch of media when this occurs, that's the only common pattern I see.
INFO 199 [core nest] InstanceLoader: ServerModule dependencies initialized INFO 199 [core nest] InstanceLoader: CoreModule dependencies initialized INFO 199 [core nest] InstanceLoader: EndpointsModule dependencies initialized INFO 193 [url-preview] Getting preview of https://youtu.be/HEoZZInT38g@en-US ... INFO 192 [url-preview] Getting preview of https://freedomnews.org.uk/2024/04/29/georgia-mass-protests-against-pro-russian-government-and-foreign-agents-law/@en-US ... INFO 194 [url-preview] Getting preview of https://masto.pt/tags/Migra%C3%A7%C3%B5es@en-US ... DONE 194 [url-preview] Got preview of https://masto.pt/tags/Migra%C3%A7%C3%B5es: Masto.PT INFO 195 [url-preview] Getting preview of https://masto.pt/tags/Patrim%C3%B3nio@en-US ... DONE 192 [url-preview] Got preview of https://freedomnews.org.uk/2024/04/29/georgia-mass-protests-against-pro-russian-government-and-foreign-agents-law/: Georgia: Mass protests against pro-Russian government and "foreign agents" lawINFO 195 [url-preview] Getting preview of https://my.heinzhistorycenter.org/orders/558/tickets?eventId=651722b38f4b195c3950c699&cdEventIds=651722b38f4b195c3950c699&date=2023-10-18T19:30:00-04:00@en-US ...DONE 193 [url-preview] Got preview of https://youtu.be/HEoZZInT38g: Fancy Women Bike RideDONE 195 [url-preview] Got preview of https://masto.pt/tags/Patrim%C3%B3nio: Masto.PTDONE 195 [url-preview] Got preview of https://my.heinzhistorycenter.org/orders/558/tickets?eventId=651722b38f4b195c3950c699&cdEventIds=651722b38f4b195c3950c699&date=2023-10-18T19:30:00-04:00: my.heinzhistorycenter.orgINFO 194 [download] Downloading https://cdn.transfem.social/files/5912772b-7748-43e7-a146-bd33fb477f22.png to /tmp/tmp-3715825-2J7GjdXgink8 ...INFO 193 [download] Downloading https://cdn.transfem.social/files/67f743c9-bdf5-4641-b260-e43e45c76a8f.png to /tmp/tmp-3714059-oxYTwS6ckVjI ...DONE 193 [download] Download finished: https://cdn.transfem.social/files/67f743c9-bdf5-4641-b260-e43e45c76a8f.pngWARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000DONE 194 [download] Download finished: https://cdn.transfem.social/files/5912772b-7748-43e7-a146-bd33fb477f22.pngWARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000INFO 198 [url-preview] Returning cache preview of https://github.com/mastodon/mastodon/issues?q=is%3Aissue%20state%3Aopen%20sort%3Areactions-%2B1-desc@en-USINFO 195 [download] Downloading https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp to /tmp/tmp-562043-2EtiI8fpyehM ...INFO 192 [download] Downloading https://mastodon.vierkantor.com/system/custom_emojis/images/000/096/272/original/fd382db8ce593687.png to /tmp/tmp-3712733-ufiwva1hfD5w ...INFO 192 [download] Downloading https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webp to /tmp/tmp-3712733-E7ERc0480v5E ...DONE 192 [download] Download finished: https://mastodon.vierkantor.com/system/custom_emojis/images/000/096/272/original/fd382db8ce593687.pngWARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000WARNING: CPU supports 0x6000000000004000, software requires 0x4000000000005000DONE 195 [download] Download finished: https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webpDONE 192 [download] Download finished: https://cdn.transfem.social/files/6bbebb56-2111-46c4-9747-29847efa800a.webpINFO 192 [download] Downloading https://cdn.transfem.social/files/68d658b0-b264-4b5f-a4d2-04f1fde73dc4.webp to /tmp/tmp-3712733-qfXU7e8dmT2i ...INFO 195 [download] Downloading https://cdn.transfem.social/files/1e5a61e4-926f-4beb-bdf3-55e2794d4a4d.gif to /tmp/tmp-562043-60Nn25l7KTZW ...INFO 193 [download] Downloading https://cdn.transfem.social/files/78e07341-9006-4b69-a95f-b00fe0aa247c.webp to /tmp/tmp-3714059-afQfs1T7CV6J ...DONE 192 [download] Download finished: https://cdn.transfem.social/files/68d658b0-b264-4b5f-a4d2-04f1fde73dc4.webpDONE 195 [download] Download finished: https://cdn.transfem.social/files/1e5a61e4-926f-4beb-bdf3-55e2794d4a4d.gifDONE 193 [download] Download finished: https://cdn.transfem.social/files/78e07341-9006-4b69-a95f-b00fe0aa247c.webp<--- Last few GCs --->[564408:0x7f3accd222c0] 14374664 ms: Scavenge 4056.0 (4132.3) -> 4053.8 (4141.3) MB, 9.82 / 0.00 ms (average mu = 0.211, current mu = 0.075) allocation failure; [564408:0x7f3accd222c0] 14378449 ms: Mark-Compact 4063.3 (4143.8) -> 4058.2 (4146.6) MB, 3769.67 / 0.00 ms (average mu = 0.143, current mu = 0.031) allocation failure; scavenge might not succeed<--- JS stacktrace --->FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory----- Native stack trace -----
It is a bit weird that nest explodes and then comes back up and explodes again. I am curious to whether everyone here has authorized fetch enabled, because I have it both for signed GET requests and inbound requests.
Thanks, that helps. I was concerned a bit that it could have been related to authorized fetch but considering you don't have it enabled by default... It's gotta be somewhere else. I wish I could split the image processing part (seen with [download] as it downloads files to a temporary directory to perform processing on) from the rest. I doubt it is that though considering I've seen a bunch of download+processing in my logs and there was no heap out of memory errors right after... Something tells me the answer is not within logs. I've spent today and yesterday scrolling through hundreds of thousands of logs trying to find a pattern and I cannot find the issue leading up to FATAL ERROR. My instance doesn't crash simply because it has 32GB of ram allocated to it, but if I lowered it I have no doubts I'd be experiencing the same crashing as y'all. Increasing your instance's memory to something so over the top is not the solution... I'll try to talk to other admins and developers.
I've also spent quite some time to find a pattern in the logs without any success. Are we sure this isn't somehow related to the issue that was patched in 2024.8.2? Because, like I said, this initially fixed it for me, but just not permanently.
My leading theory is that 2024.8.2 fixed the memory leak, but the merge of upstream misskey 2024.9 into the codebase somehow reverted it, or regressed it in some way. I have 2024.8.2 with some misc features backported (it's 2024.8.2-transfem, https://activitypub.software/TransFem-org/sharkey-trans-fem/ available under develop-2024.8.2 branch). I can't say for sure that it doesn't have the same memory leak but I do not remember seeing the same jagged spikes in it. I already told @fEmber about this version.
Thank you for sharing the log information. I’m currently using it with the signToActivityPubGet: true setting. Due to the heavy load, on my server, logs are often not recorded for tens of seconds to around a minute before it crashes, so I haven’t even been able to check the download logs (although sometimes logs for chart generation have been recorded).
I can provide some logs. It seems like the records here indicate the process of chart generation was interrupted, but since there are many logs with time differences, it could simply be that no logs were recorded during the problematic period due to overlapping chart events.
@Daniel I have set my server's time zone to JST (UTC+9). Regardless of whether your server is in the same time zone or a different one, I’m curious if the same issue occurs around these times: 3:08, 7:08, 11:08, 15:08, 19:08, or 23:08.
Can you share the last 30 minutes of logs from right before this happened? And if possible, include your configuration file (.config/default.yml) without the secrets / passwords.
I looked in the monitoring system and saw that the first spike appeared roughly an hour before Sharkey overloaded the system (not shown in the screenshots), so I went ahead and attached logs from 17:00 UTC to 18:15 UTC.
You don't proxy remote media, which is an unusual configuration.
Meilisearch is enabled.
Auth fetch is enabled.
The instance is running with the default clustering (1x web, 1x worker) and default job limits. If you have more than 2 cores, then you should adjust these settings.
You were already getting "query is slow" errors at the start of this log, so the trigger could have been earlier.
I see query that I'm not familiar with - checking the contents of user.tags. I've never seen that property in use before, does anyone know what it's for?
Chart processing apparently took so much CPU that your other jobs timed out. It didn't complete for more than 17 minutes! Some of the individual queries took 10+ seconds which is highly abnormal.
At 17:05:48, a number of outward HTTP requests suddenly failed with no reason code. I've never seen that behavior before.
Your postgres database seems to be very slow, like in general. Have you optimized it with pgTune?
At exactly 18:08:00, the user @notataboo@mastodon.social was federated to your instance. The timing is suspicious, but probably just a coincidence.
At 18:08:31, auth-fetch went crazy retrieving new users to verify keys. I suspect that someone boosted a post from your instance into an audience that hadn't federated before. The instance seemed to handle this fine, as there were no "query is slow" errors despite the heavy load.
At 18:08:53, a bunch of standard queries began running slow. This suggest heavy server load. The immediately previous log indicates that a GIF was just downloaded. This particular spike was probably caused by media processing, since GIFs take much longer to convert.
A similar spike happened at 18:08:53, with no clear cause.
I think I found a pattern, at least in my logs. I checked the logs of the last four crashes and every time the same errors appeared. I don't know how high the chance of a coincidence is tho.
crash.log
I'm still experiencing crashes caused by the OOM error, and recently, I've noticed a new pattern that deviates from the previous behavior. Until now, the crashes occurred consistently every 4 hours, but now they sometimes happen at irregular intervals as well. For instance, you can see examples in the logs, such as Dec 13 20:22, Dec 17 10:21, and Dec 18 11:36.
As before, the logs don't provide much detailed information about these crashes. I often find myself wishing there was an option to enable more detailed logging in Misskey itself.
Are there other server admins facing similar issues? If so, have you observed these irregular crashes
Thank you. If this issue is not occurring in your environment, then the periodic crashes and the new crashes happening on my server might be separate issues, even if they seem similar. Hmm, if I figure out anything new, I'll add it here or open a new issue.
I thought it would only occur after 4 hours, but turns out the system doesn't care if it was on for 4 hours or for just 1. It seems like it will crash at a specific time instead. In this screenshot, the process was only alive for about an hour before it experienced the OOM at the exact same time as yesterday (10:08 UTC).
I have previously reported the issue where the software crashes every 4 hours.
On my server, the crashes consistently occurred at JST 3:08, 7:08, 11:08, 15:08, 19:08, 23:08 for the past several months.
Even after software updates, this issue remained unresolved.
However, starting March 10, the crash timing suddenly shifted.
Now, the crashes occur at JST 2:08, 6:08, 10:08, 14:08, 18:08, 22:08 instead.
The 4-hour interval remains unchanged, but the specific times have moved.
I did not make any changes to my server on that day—no manual configuration changes or restarts.
This makes me wonder if some external factor, possibly something related to federation, is influencing this behavior. However, I cannot determine the exact cause.
Questions
Has anyone else experienced a similar shift in crash timing?
Could this change provide any clues toward identifying or solving this issue?
This issue has persisted daily for months, and it has become quite frustrating.
If there is anything else I should check on my server—such as related symptoms or specific logs—please let me know, and I will investigate further.
I personally suspect that this crash is an unknown DoS bug (or vulnerability) caused by federation. If your instance is on version 2025.2.2 or later, would you please turn on Activity Logging with "preSave" set to true? There's a sample configuration in "example.yml" file.
Wait - March 10th is when Daylight Savings time started, and the times shifted by the same amount. So this is definitely a scheduled thing of some kind.
I hadn't considered the impact of daylight saving time. Upon analyzing the pattern of the error timestamps, I looked into time zones where daylight saving time would have started around the time of the error, and it seems that regions such as Eastern Time in the U.S., parts of Canada, and Mexico are influencing the error timestamps.
I live in Japan, where daylight saving time is not observed. Therefore, I believe that the discrepancy in error timestamps due to DST is caused by an issue beyond my configuration (as there's no reason for me to be using the Eastern Time zone ...)
I also think that the crashes are triggered by something remote. I put my reverse proxy into maintenance mode from 18:05 to 18:10 today and the crash didn't happen. The next crash happened like usual at 22:08.
If the crash would be caused by some instance POSTing something to the inbox, I would have expected a staggered crash between 18:10 and 22:08 when the instance would have tried to redeliver the message, but that didn't happen.
Before that I suspected Lemmy to be the reason and blocked all Lemmy user agents on the reverse proxy, but that was not it.
Thank you for the info! This supports one of my theories, which is that this is caused by a malicious bot designed to intentionally kill instances. That would explain why it didn't retry.
Based on @fEmber's instructions, I have configured the default.yml inside the .config as follows:
# Save the activity before processing, then update later with the results.# This has the advantage of capturing activities that cause a hard-crash, but doubles the number of queries used.# Default: falsepreSave: true
After making these changes, I restarted Sharkey via systemctl restart.
I then checked the logs for today’s issue using journalctl, but there is no relevant log recorded at the time of the fatal error. As usual, the load on the Misskey master process rapidly increases just before the crash and the software crashes within a few seconds.
Mar 15 10:07:42 sharkey[1145380]: ERR 2 [queue inbox] failed(Error: Error in actor https://hellsite.site/users/XXXXXX - 502) id=5979885 attempts=4/8 age=12m activity=https://hellsite.site/users/XXXXXX#><omitted>Mar 15 10:07:42 sharkey[1145380]: stack: 'Error: Error in actor https://hellsite.site/users/XXXXXX - 502\n' +Mar 15 10:07:42 sharkey[1145380]: ' at InboxProcessorService._process (file:///home/firefish/Sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:134:27)\n' +Mar 15 10:07:42 sharkey[1145380]: ' at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n' +Mar 15 10:07:42 sharkey[1145380]: ' at async InboxProcessorService.process (file:///home/firefish/Sharkey/packages/backend/built/queue/processors/InboxProcessorService.js:80:20)\n' +Mar 15 10:07:42 sharkey[1145380]: ' at async /home/firefish/Sharkey/node_modules/.pnpm/bullmq@5.26.1/node_modules/bullmq/dist/cjs/classes/worker.js:512:32\n' +Mar 15 10:07:42 sharkey[1145380]: ' at async Worker.retryIfFailed (/home/firefish/Sharkey/node_modules/.pnpm/bullmq@5.26.1/node_modules/bullmq/dist/cjs/classes/worker.js:741:24)',Mar 15 10:07:42 sharkey[1145380]: message: 'Error in actor https://hellsite.site/users/XXXXXX - 502',Mar 15 10:07:42 sharkey[1145380]: name: 'Error'Mar 15 10:07:42 sharkey[1145380]: }Mar 15 10:07:42 sharkey[1145380]: }Mar 15 10:08:22 sharkey[1145366]: <--- Last few GCs --->Mar 15 10:08:22 sharkey[1145366]: [1145366:0x7fc6bd514000] 6665086 ms: Mark-Compact 8070.5 (8228.3) -> 8061.6 (8235.3) MB, pooled: 0 MB, 4174.87 / 0.00 ms (average mu = 0.235, current mu = 0.131) allocation failure; sc>Mar 15 10:08:22 sharkey[1145366]: [1145366:0x7fc6bd514000] 6670954 ms: Mark-Compact 8072.3 (8239.8) -> 8068.0 (8241.6) MB, pooled: 0 MB, 5539.30 / 0.00 ms (average mu = 0.148, current mu = 0.056) allocation failure; GC>Mar 15 10:08:22 sharkey[1145366]: <--- JS stacktrace --->Mar 15 10:08:22 sharkey[1145366]: FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memoryMar 15 10:08:22 sharkey[1145366]: ----- Native stack trace -----Mar 15 10:08:22 sharkey[1145366]: 1: 0xe19de0 node::OOMErrorHandler(char const*, v8::OOMDetails const&) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 2: 0x1240390 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 3: 0x1240667 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, v8::OOMDetails const&) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 4: 0x146e1a5 [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 5: 0x1487a19 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 6: 0x145c0e8 v8::internal::HeapAllocator::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [Misske>Mar 15 10:08:22 sharkey[1145366]: 7: 0x145d015 v8::internal::HeapAllocator::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [Missk>Mar 15 10:08:22 sharkey[1145366]: 8: 0x1435cee v8::internal::Factory::NewFillerObject(int, v8::internal::AllocationAlignment, v8::internal::AllocationType, v8::internal::AllocationOrigin) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 9: 0x18972b0 v8::internal::Runtime_AllocateInOldGeneration(int, unsigned long*, v8::internal::Isolate*) [Misskey (master)]Mar 15 10:08:22 sharkey[1145366]: 10: 0x7fc6ba06c476Mar 15 10:08:23 sharkey[1145329]: Aborted (core dumped)Mar 15 10:08:23 sharkey[1145380]: INFO 2 [core] The process is going to exit with code 0Mar 15 10:08:23 sharkey[1145379]: INFO 1 [core] The process is going to exit with code 0Mar 15 10:08:23 sharkey[1145133]: ELIFECYCLE Command failed with exit code 134.Mar 15 10:08:23 systemd[1]: sharkey.service: Main process exited, code=exited, status=134/n/aMar 15 10:08:23 systemd[1]: sharkey.service: Failed with result 'exit-code'.Mar 15 10:08:23 systemd[1]: sharkey.service: Consumed 13min 44.750s CPU time.Mar 15 10:08:23 systemd[1]: sharkey.service: Scheduled restart job, restart counter is at 1.Mar 15 10:08:23 systemd[1]: Stopped Sharkey daemon.```
The log duration is probably not enough. For me the memory usage starts going up at around :06 and peaks at around :08, so the trigger is probably sooner.
In my case,the spike starts from halfway through the 7th minute and does not occur before or after that. Could the occurrence time of the issue vary depending on the environment? After that, the system restarts between crashes, but no logs containing null are output. The logs appear as shown in the attached file.
Just to clarify, there are some null values in the attached file, but based on the timestamps, they appear to be logged more than several tens of seconds after the issue occurs and seem unrelated.
those are almost certainly requests from our own mastodon emulation layer; 543738 is a bit large, even for the maximum of 100 notes… but not crash-worthy
your webserver logs should also show requests for /v1/timelines/public just before or after those: can you copy those lines?
why does timeline/public default to "global" (maybe real mastodon does the same?)
wtf does it crash? (mine doesn't, but it seems to allocate ~100MB for ~100KB response (mastodon api, local timeline, 100 notes), which feels excessive)
I get neither requests nor crashes, and that makes sense because the bot claims to respect robots.txt. I block all indexers from all pages on my instance.
I am the developer of the CDSCbot. I have modified our program to exclude any server with "Sharkey" in the version string from any timeline queries. This should alleviate any negative effects.
After seeing @Daniel's post, I checked my nginx logs and found similar access logs on my server at the time of the issue. After temporarily blocking access from the CDSCbot user agent with WAF, the problem stopped occurring.