I'm writing a bog-standard Unicode tokeniser to replace the crap one SQLite ships, and I'm wondering why I'm wasting my life writing C code again.

In the same vein of the old post I re-tooted, is there a Rust guru out there that can tell me if:

* Decent ICU bindings or equivalent Unicode normalisation, case folding and word-break analysis exists for Rust? (the latter being key)
* Decent SQLite FTS5 custom tokeniser bindings or equivalent exist for Rust?

#Rust #Oxidisation #ICU #SQLite


@mjog Haven't worked with these problems but perhaps the unicode crates will do for the first point? lib.rs/crates/unicode-segmenta

· · Web · 1 · 0 · 0

@YaLTeR thanks for the pointer! I took a look and it might not handle word-segmentation for CJK/Thai/etc, which is mostly the point of using ICU.

I'll check it out though, probably a good project to get my feet wet.

Sign in to participate in the conversation
Mastodon for Tech Folks

This Mastodon instance is for people interested in technology. Discussions aren't limited to technology, because tech folks shouldn't be limited to technology either!