Follow

Question: How do short character limits in free texts work (e.g. on Birdsite) for non-english languages?

I imagine it would be less expressive in languages like , or character based languages like or might have an easier time fitting more content into a single message

· · Web · 1 · 0 · 1

@kaeedo I mean, yeah. Ideograph languages (e.g. CJK - Chinese/Japanese/Korean) have a distinct information density advantage.

Birdsite's limit is based on cell carrier limits*. Cell carrier limits were purely technical (payload size) and not at all about permitted/restricting expression.

@kaeedo
* Note that cell carrier limits were octet based, while birdsite's limits are grapheme based. That's important since an english letter is one octet**. An accented latin character is (typically) two octets. CJK characters are 3 or 4 octets, and that's ignoring combining sequences. Emojis can run as high as 20 octets for a single "character".

** Numbers based on using UTF-8. Early cell carries in CJK regions often used two octet multi-byte local encodings like SJIS.

@kaeedo I should say,. Birdsite's limits *WERE* based on cell carrier limits. Two things have changed since 2007.

1/ Cell phones have, broadly speaking, all managed to support multi-part text messaging. Also, the rise of native clients has made the text component moot in many areas.

2/ Birdsite's limits are now entirely codepoint based. This fits between octets and graphemes.

@kaeedo
A codepoint is a single unicode element (from U+0000 to U+1FFFF) and thus will take up at most 4 octets when expressed as UTF-8. Combining sequences (which Emojis use generously) allow a single grapheme (visually indivisible character) to be made up of many codepoints. So a "normal" CJK character (which can be expressed as a codepoint) is the same as an english letter for the sake of birdsite limits. Emojis can cost multiple codepoints, so are more "expensive" in applying limits.

@saramg
Thanks for the technical explanation. I usually rely on the runtime implementation of strings, and then don't worry about encoding until it forces me to.

@kaeedo I've had to spend entirely too long thinking about encodings so that other people don't have to. :)

Sign in to participate in the conversation
Mastodon for Tech Folks

This Mastodon instance is for people interested in technology. Discussions aren't limited to technology, because tech folks shouldn't be limited to technology either!