forked from toolshed/abra
		
	
		
			
				
	
	
		
			109 lines
		
	
	
		
			5.3 KiB
		
	
	
	
		
			Go
		
	
	
	
	
	
			
		
		
	
	
			109 lines
		
	
	
		
			5.3 KiB
		
	
	
	
		
			Go
		
	
	
	
	
	
| /*
 | |
| Package uniseg implements Unicode Text Segmentation, Unicode Line Breaking, and
 | |
| string width calculation for monospace fonts. Unicode Text Segmentation conforms
 | |
| to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode
 | |
| Line Breaking conforms to Unicode Standard Annex #14
 | |
| (https://unicode.org/reports/tr14/).
 | |
| 
 | |
| In short, using this package, you can split a string into grapheme clusters
 | |
| (what people would usually refer to as a "character"), into words, and into
 | |
| sentences. Or, in its simplest case, this package allows you to count the number
 | |
| of characters in a string, especially when it contains complex characters such
 | |
| as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or
 | |
| other languages. Additionally, you can use it to implement line breaking (or
 | |
| "word wrapping"), that is, to determine where text can be broken over to the
 | |
| next line when the width of the line is not big enough to fit the entire text.
 | |
| Finally, you can use it to calculate the display width of a string for monospace
 | |
| fonts.
 | |
| 
 | |
| # Getting Started
 | |
| 
 | |
| If you just want to count the number of characters in a string, you can use
 | |
| [GraphemeClusterCount]. If you want to determine the display width of a string,
 | |
| you can use [StringWidth]. If you want to iterate over a string, you can use
 | |
| [Step], [StepString], or the [Graphemes] class (more convenient but less
 | |
| performant). This will provide you with all information: grapheme clusters,
 | |
| word boundaries, sentence boundaries, line breaks, and monospace character
 | |
| widths. The specialized functions [FirstGraphemeCluster],
 | |
| [FirstGraphemeClusterInString], [FirstWord], [FirstWordInString],
 | |
| [FirstSentence], and [FirstSentenceInString] can be used if only one type of
 | |
| information is needed.
 | |
| 
 | |
| # Grapheme Clusters
 | |
| 
 | |
| Consider the rainbow flag emoji: 🏳️🌈. On most modern systems, it appears as one
 | |
| character. But its string representation actually has 14 bytes, so counting
 | |
| bytes (or using len("🏳️🌈")) will not work as expected. Counting runes won't,
 | |
| either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function
 | |
| utf8.RuneCountInString("🏳️🌈") and len([]rune("🏳️🌈")) will both return 4.
 | |
| 
 | |
| The [GraphemeClusterCount] function will return 1 for the rainbow flag emoji.
 | |
| The Graphemes class and a variety of functions in this package will allow you to
 | |
| split strings into its grapheme clusters.
 | |
| 
 | |
| # Word Boundaries
 | |
| 
 | |
| Word boundaries are used in a number of different contexts. The most familiar
 | |
| ones are selection (double-click mouse selection), cursor movement ("move to
 | |
| next word" control-arrow keys), and the dialog option "Whole Word Search" for
 | |
| search and replace. This package provides methods for determining word
 | |
| boundaries.
 | |
| 
 | |
| # Sentence Boundaries
 | |
| 
 | |
| Sentence boundaries are often used for triple-click or some other method of
 | |
| selecting or iterating through blocks of text that are larger than single words.
 | |
| They are also used to determine whether words occur within the same sentence in
 | |
| database queries. This package provides methods for determining sentence
 | |
| boundaries.
 | |
| 
 | |
| # Line Breaking
 | |
| 
 | |
| Line breaking, also known as word wrapping, is the process of breaking a section
 | |
| of text into lines such that it will fit in the available width of a page,
 | |
| window or other display area. This package provides methods to determine the
 | |
| positions in a string where a line must be broken, may be broken, or must not be
 | |
| broken.
 | |
| 
 | |
| # Monospace Width
 | |
| 
 | |
| Monospace width, as referred to in this package, is the width of a string in a
 | |
| monospace font. This is commonly used in terminal user interfaces or text
 | |
| displays or editors that don't support proportional fonts. A width of 1
 | |
| corresponds to a single character cell. The C function [wcswidth()] and its
 | |
| implementation in other programming languages is in widespread use for the same
 | |
| purpose. However, there is no standard for the calculation of such widths, and
 | |
| this package differs from wcswidth() in a number of ways, presumably to generate
 | |
| more visually pleasing results.
 | |
| 
 | |
| To start, we assume that every code point has a width of 1, with the following
 | |
| exceptions:
 | |
| 
 | |
|   - Code points with grapheme cluster break properties Control, CR, LF, Extend,
 | |
|     and ZWJ have a width of 0.
 | |
|   - U+2E3A, Two-Em Dash, has a width of 3.
 | |
|   - U+2E3B, Three-Em Dash, has a width of 4.
 | |
|   - Characters with the East-Asian Width properties "Fullwidth" (F) and "Wide"
 | |
|     (W) have a width of 2. (Properties "Ambiguous" (A) and "Neutral" (N) both
 | |
|     have a width of 1.)
 | |
|   - Code points with grapheme cluster break property Regional Indicator have a
 | |
|     width of 2.
 | |
|   - Code points with grapheme cluster break property Extended Pictographic have
 | |
|     a width of 2, unless their Emoji Presentation flag is "No", in which case
 | |
|     the width is 1.
 | |
| 
 | |
| For Hangul grapheme clusters composed of conjoining Jamo and for Regional
 | |
| Indicators (flags), all code points except the first one have a width of 0. For
 | |
| grapheme clusters starting with an Extended Pictographic, any additional code
 | |
| point will force a total width of 2, except if the Variation Selector-15
 | |
| (U+FE0E) is included, in which case the total width is always 1. Grapheme
 | |
| clusters ending with Variation Selector-16 (U+FE0F) have a width of 2.
 | |
| 
 | |
| Note that whether these widths appear correct depends on your application's
 | |
| render engine, to which extent it conforms to the Unicode Standard, and its
 | |
| choice of font.
 | |
| 
 | |
| [wcswidth()]: https://man7.org/linux/man-pages/man3/wcswidth.3.html
 | |
| */
 | |
| package uniseg
 |