Short downtime - again due to frozen Ceph mount (umount -f doesn't work), after several waves of many requests today.

Looking for expertise 😅

· · Web · 6 · 6 · 14

To elaborate: In order to run services via multiple instances, we're slowly migrating data to the Ceph filesystem.

However, while we were testing this for months, the practice sometimes got us hard. We apologize for the inconvenience and we'll do our best to restore the stability of the past years.

If you want to get involved with the infrastructure behind Codeberg and / or want to share your experiences with Ceph, please reach out 😉

Our journey continues with this issue:

Feel free to read and discuss on, either here on Matrix or in the issue itself.

Thanks to the team, it's been a long meeting tonight :)

@codeberg Sorry, can't help.

Looks like this is related. Happens when adding an attachment to an issue comment.

@marian uh, thank you for reporting! Seems there is still a bit more to clean up after this incident. Bear with us, we are at it!

@codeberg 100% <3 & respect. trying to get up & going with ceph & rook myself, very much venturing into the great depths & hoping for merciful & good times.

would that we all be able to host data well, without trouble.

forever just consigning this task to the hyper-scale ultra-industrialists is a little death i do not want us to have. we need the willing. i'm sorry for all that this is not easy. i'm sorry for your trouble & the hard time you've been forced to put in.

let us hope. hope for this capability to mature & become only ever more reasonable.

@codeberg @ops I remember you had some problems with ceph at the beginning. Do you have any ideas?

@codeberg Hasn't GitLab used Ceph in the past too? And then moved away from it due to scalability issues and management overhead. Don't recall what they use as replacement though. looks like a promising alternative, but haven't tried it out myself yet, as I don't have any personal project of this scale. And probably Codeberg has already invested too much into Ceph to try something else, at this point 🤔

@dnaka91 Having already invested doesn't mean it's not worth switching to something else.

We don't yet know if Ceph scales cross-machine, for example.

Regarding the GitLab thing: The main issue is latency for small files. We investigated this ourself and came to the conclusion that the performance is well enough when garbage collection frequently packs the data. It's even okay to work with repos on Ceph cross-datacenter with much higher latency.

@dnaka91 Probably worth a closer look.

I'd personally give bonus points for "simple" / "manageable" right now :)

~ otto

@dnaka91 @codeberg

never heared of it and so it did not got compared or evaluated ...

I like the sentence in there readme:

"SeaweedFS is ideal for serving relatively smaller files quickly and concurrently." ...

Migrating away from ceph to this would be a bit of planing but is doable, so we should validate if it does perform well and if it can do the job.

@dnaka91 @codeberg

We need three features:
- s3 api, check
- posix fileysystem, check
- file locks are handled transparent as expected on local fs, ?!?

@dnaka91 @codeberg

And the complete team should ack to switch if it works of course ...

Are you using the kernel ceph client? Did you try the fuse client? (Or viceversa)
Which kernel version? Debian kernel?


When I was using CephFS, I always mounted it locally on one of the ceph nodes and then used sshfs to mount it remotely. That was very stable.

The kernel stacktrace does have ceph in it, so switching to SSH for the remote mount might allow you to confirm the theory that it's the ceph mount that is the root cause of the kernel panic.

@adam From our observation, the performance of sshfs wasn't fast enough for many small read accesses 😞


Sadness. Since you already have a VPN, NFS might be an option, especially if this is just a temporary solution.

I'm just trying to figure out some way to confirm that ceph is causing the kernel panic.

I suppose the other route would be to swap out the kernel on the box that is crashing. Though swap out for what is the real question... newer? different config? the same kernl as the ceph node is running?

@adam We could first try to upgrade the Ceph tools to a newer versions as @dachary suggested. The infra team will decide.

And yes, using something different than Ceph might also be worth another look. I tested NFS locally with Git garbage collection, it was a nightmare actually. But maybe the performance on the Codeberg servers would have been much better.

@codeberg @dachary

I have limited experience with NFS, but in that experience ASYNC_WRITE is key for performance (like an order of magnitude faster). Of course, that comes at the price of being theoretically less robust during a service interruption.

Best of luck. 🙂

make sure you do *not* mount the cephfs using the kernel ceph module on a machine holding any of your cluster storage. that may get you kernel deadlocks, and symptoms wouldn't be unlike those you're experiencing

@YoSiJo Also found that tip, but a complete restart seemed like the easier job.

The only issue was to confirm that the machine is really shut down. Last time, it hung during the shutdown, waiting for some stuff, and didn't come back up ... so we looked how to force-shutdown this time 🙈

@codeberg thanks for the transparency!

you probably already reached out to the ceph developer community? such a kernel issue might be a bug, but usually one very hard to reproduce and fix.

using a newer (or older) kernel could work better...

@davidak not yet we are still analysing and discussing possible underlying causes that triggered this unusual condition, to gather a more complete picture.

@codeberg @davidak I commented, let's continue the conversation there 🚀

Sign in to participate in the conversation
Mastodon for Tech Folks

This Mastodon instance is for people interested in technology. Discussions aren't limited to technology, because tech folks shouldn't be limited to technology either!