textmetrics/count
Counts of words, sentences, syllables, characters, paragraphs and
polysyllables — the primitives consumed by readability scores in
textmetrics/readability.
Functions in this module are pure, deterministic, and O(n) in
the length of their input. They iterate over extended grapheme
clusters (via gleam/string.to_graphemes), not raw bytes, so
"naïve" is 5 graphemes and 1 word regardless of whether the
diacritic is encoded as a single codepoint or a + combining
mark.
All language-specific heuristics in this module are tuned for
English. Behaviour on other scripts is documented per
function and biased towards “do nothing surprising”: CJK without
whitespace stays one word; non-English text in
syllables_in_word returns 1.
Values
pub fn characters(text: String) -> Int
Count characters that contribute to readability formulas: ASCII
letters plus ASCII digits, excluding whitespace and ASCII
punctuation. Non-ASCII graphemes that look letter-like (Latin-1
accents, CJK ideographs) also count, mirroring the behaviour of
textstat’s char_count.
Examples:
count.characters("Hello, World!") // 10
count.characters("123 abc") // 6
pub fn paragraphs(text: String) -> Int
Count paragraphs. A paragraph is a maximal run of non-blank lines separated by one or more blank lines. Trailing blank lines do not produce empty paragraphs.
Examples:
count.paragraphs("a\n\nb\n\nc") // 3
count.paragraphs("one line") // 1
count.paragraphs("") // 0
pub fn polysyllables(text: String) -> Int
Count words with three or more syllables in text. This is the
“polysyllable” count consumed by SMOG with no exclusions; the
stricter Gunning-Fog “complex word” count is computed inline
inside readability.gunning_fog.
pub fn sentences(text: String) -> Int
Count sentences. Sentence terminators are ., !, ?. A run of
consecutive terminators counts as one boundary (so "What?!" is
one sentence). A trailing non-empty fragment that lacks a
terminator still counts as a sentence ("hello" → 1).
This implementation does not special-case abbreviations like
Mr., Dr., e.g.. Text dense in such abbreviations will be
over-segmented. Callers that need abbreviation-aware segmentation
should pre-process.
Empty input returns 0.
pub fn syllables(text: String) -> Int
Count syllables in text. Sums syllables_in_word
over each word found by words.
pub fn syllables_in_word(word: String) -> Int
Count syllables in a single word using an English heuristic:
- Lowercase the word.
- Strip non-ASCII-letter graphemes.
- Count maximal vowel groups in
a e i o u y, withyonly counting at non-initial position. - Subtract one if the word ends in a silent
e(the preceding letter being a consonant). - Floor at
1.
Examples:
count.syllables_in_word("the") // 1
count.syllables_in_word("hello") // 2
count.syllables_in_word("syllable") // 3
count.syllables_in_word("rhythm") // 1 (no vowels, floors at 1)
Returns 0 for an empty input. Returns 1 for non-English words
that contain no ASCII letters.
pub fn words(text: String) -> Int
Count words. A word is a maximal run of “letter-like” graphemes separated by whitespace or punctuation.
A grapheme counts as letter-like when its first code point is an
ASCII letter (a-z / A-Z), an ASCII digit, or any non-ASCII
character (covering Latin-1 letters, CJK ideographs, accented
letters delivered as a single grapheme, etc.). Whitespace and
ASCII punctuation are word boundaries.
Examples:
count.words("") // 0
count.words("hello") // 1
count.words("hello world") // 2
count.words("hello, world!") // 2
count.words("hello world") // 2 (whitespace collapses)