An implementation of grapheme cluster boundaries from Unicode text segmentation (UAX 29), for Unicode version 15.0.0.
Quick start
go get "github.com/clipperhouse/uax29/v2/graphemes"
import "github.com/clipperhouse/uax29/v2/graphemes"
text := "Hello, 世界. Nice dog! 👍🐶"
tokens := graphemes.FromString(text)
for tokens.Next() { // Next() returns true until end of data
fmt.Println(tokens.Value()) // Do something with the current grapheme
}
A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points.
Conformance
We use the Unicode test suite. Status:
APIs
If you have a string
text := "Hello, 世界. Nice dog! 👍🐶"
tokens := graphemes.FromString(text)
for tokens.Next() { // Next() returns true until end of data
fmt.Println(tokens.Value()) // Do something with the current grapheme
}
If you have an io.Reader
FromReader
embeds a bufio.Scanner
, so just use those methods.
r := getYourReader() // from a file or network maybe
tokens := graphemes.FromReader(r)
for tokens.Scan() { // Scan() returns true until error or EOF
fmt.Println(tokens.Text()) // Do something with the current grapheme
}
if tokens.Err() != nil { // Check the error
log.Fatal(tokens.Err())
}
If you have a []byte
b := []byte("Hello, 世界. Nice dog! 👍🐶")
tokens := graphemes.FromBytes(b)
for tokens.Next() { // Next() returns true until end of data
fmt.Println(tokens.Value()) // Do something with the current grapheme
}
Performance
On a Mac M2 laptop, we see around 200MB/s, or around 100 million graphemes per second. You should see ~constant memory, and no allocations.
Invalid inputs
Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.
Your pipeline should probably include a call to utf8.Valid()
.