abra/graphemes at d63a1c28ea9979d93ccbed67b15660f5df00ada7 - abra

codegod100/abra

Fork 0

forked from toolshed/abra

Files

History

decentral1se d63a1c28ea

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

iterator.go

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

reader.go

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

README.md

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

splitfunc.go

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

trie.go

chore: go mod tidy / vendor / make deps

2025-10-02 10:35:46 +02:00

README.md

An implementation of grapheme cluster boundaries from Unicode text segmentation (UAX 29), for Unicode version 15.0.0.

Quick start

go get "github.com/clipperhouse/uax29/v2/graphemes"

import "github.com/clipperhouse/uax29/v2/graphemes"

text := "Hello, 世界. Nice dog! 👍🐶"

tokens := graphemes.FromString(text)

for tokens.Next() {                     // Next() returns true until end of data
	fmt.Println(tokens.Value())         // Do something with the current grapheme
}

A grapheme is a “single visible character”, which might be a simple as a single letter, or a complex emoji that consists of several Unicode code points.

Conformance

We use the Unicode test suite. Status:

APIs

If you have a `string`

text := "Hello, 世界. Nice dog! 👍🐶"

tokens := graphemes.FromString(text)

for tokens.Next() {                     // Next() returns true until end of data
	fmt.Println(tokens.Value())         // Do something with the current grapheme
}

If you have an `io.Reader`

FromReader embeds a bufio.Scanner, so just use those methods.

r := getYourReader()                        // from a file or network maybe
tokens := graphemes.FromReader(r)

for tokens.Scan() {                         // Scan() returns true until error or EOF
	fmt.Println(tokens.Text())              // Do something with the current grapheme
}

if tokens.Err() != nil {                    // Check the error
	log.Fatal(tokens.Err())
}

If you have a `[]byte`

b := []byte("Hello, 世界. Nice dog! 👍🐶")

tokens := graphemes.FromBytes(b)

for tokens.Next() {                     // Next() returns true until end of data
	fmt.Println(tokens.Value())         // Do something with the current grapheme
}

Performance

On a Mac M2 laptop, we see around 200MB/s, or around 100 million graphemes per second. You should see ~constant memory, and no allocations.

Invalid inputs

Invalid UTF-8 input is considered undefined behavior. We test to ensure that bad inputs will not cause pathological outcomes, such as a panic or infinite loop. Callers should expect “garbage-in, garbage-out”.

Your pipeline should probably include a call to utf8.Valid().

README.md

Quick start

Conformance

APIs

If you have a string

If you have an io.Reader

If you have a []byte

Performance

Invalid inputs

If you have a `string`

If you have an `io.Reader`

If you have a `[]byte`