textmetrics/readability
Canonical English-language readability scores.
All six scores in this module — Flesch Reading Ease, Flesch–Kincaid
Grade Level, Gunning Fog, SMOG, Automated Readability Index, and
Coleman–Liau Index — are computed from the count primitives
exposed by textmetrics/count. The functions are
pure, deterministic, and O(n) in input length.
Returned scores are Float values; callers should round or
quantise to fit their reporting needs. Empty or extremely small
inputs return ReadabilityError instead of
non-finite numbers.
The syllable counter consumed by these formulas is an
English-tuned heuristic (see
count.syllables_in_word).
Non-English text will produce scores that match textstat’s
fallback behaviour but should not be interpreted as meaningful
grade-level estimates.
Reference scores produced by these implementations agree with
Python textstat (the de-facto reference) to within roughly
±2 on the Reading Ease 0–100 scale and ±1 on grade-level scales,
over a corpus of fixtures from the Wikipedia readability articles.
Types
Errors returned when a readability score cannot be computed because the input does not meet the minimum-size precondition of the underlying formula.
pub type ReadabilityError {
TooFewWords(at_least: Int, got: Int)
TooFewSentences(at_least: Int, got: Int)
}
Constructors
-
TooFewWords(at_least: Int, got: Int)The input had fewer than
at_leastwords.gotis the actual count. -
TooFewSentences(at_least: Int, got: Int)The input had fewer than
at_leastsentences.gotis the actual count.
Values
pub fn automated_readability_index(
text: String,
) -> Result(Float, ReadabilityError)
Automated Readability Index (ARI), Smith & Senter (1967).
4.71 × (characters/words) + 0.5 × (words/sentences) − 21.43
The characters count is letters + digits + accented graphemes
(i.e. count.characters) and excludes
whitespace and punctuation. ARI is the only formula in this
module that treats digits as score-bearing characters; texts
containing large numeric runs will score correspondingly higher.
The result is clamped to [0.0, 18.0]; use
automated_readability_index_unbounded
for the raw value.
pub fn automated_readability_index_unbounded(
text: String,
) -> Result(Float, ReadabilityError)
Automated Readability Index without the [0.0, 18.0] clamp.
pub fn coleman_liau_index(
text: String,
) -> Result(Float, ReadabilityError)
Coleman–Liau Index, Coleman & Liau (1975).
0.0588 × L − 0.296 × S − 15.8
where
L= average number of letters per 100 words (characters / words × 100)S= average number of sentences per 100 words (sentences / words × 100)
The output approximates the US grade level expected to read the
text comfortably. Like ARI, this formula uses the
count.characters definition (letters
- digits), so digit-heavy text scores slightly higher than its pure-prose equivalent.
The result is clamped to [0.0, 18.0]; use
coleman_liau_index_unbounded for
the raw value.
pub fn coleman_liau_index_unbounded(
text: String,
) -> Result(Float, ReadabilityError)
Coleman–Liau Index without the [0.0, 18.0] clamp.
pub fn flesch_kincaid_grade(
text: String,
) -> Result(Float, ReadabilityError)
Flesch–Kincaid Grade Level.
0.39 × (words/sentences) + 11.8 × (syllables/words) − 15.59
The output approximates the US school grade required to comprehend
the text. The result is clamped to [0.0, 18.0] (US K–12 plus
graduate range) so synthetic inputs cannot produce -2.88 or
49+. Use
flesch_kincaid_grade_unbounded
for the raw value.
pub fn flesch_kincaid_grade_unbounded(
text: String,
) -> Result(Float, ReadabilityError)
Flesch–Kincaid Grade Level without the [0.0, 18.0] clamp.
pub fn flesch_reading_ease(
text: String,
) -> Result(Float, ReadabilityError)
Flesch Reading Ease, original Flesch (1948) formula.
206.835 − 1.015 × (words/sentences) − 84.6 × (syllables/words)
Higher is easier. The classic interpretation bands:
90–100— 5th grade reader80–90— 6th grade70–80— 7th grade60–70— 8th–9th grade (“plain English”)50–60— 10th–12th grade30–50— college0–30— college graduate
The result is clamped to [0.0, 100.0] to match the standard
reporting convention used by Wikipedia, Microsoft Word, Python
textstat’s default, and most readability UIs. Use
flesch_reading_ease_unbounded
when you need the raw formula output (which can exceed 100 for
unusually short or syllable-poor text, and drop below 0 for
unusually dense academic prose).
Returns TooFewWords for input with no words,
and TooFewSentences for input with no
sentence-shaped content.
pub fn flesch_reading_ease_unbounded(
text: String,
) -> Result(Float, ReadabilityError)
Flesch Reading Ease without the standard [0.0, 100.0] clamp.
Returns the raw 206.835 − 1.015 × (words/sentences) − 84.6 ×
(syllables/words) value, which can exceed 100 for unusually
short text and drop below 0 for unusually dense prose.
pub fn gunning_fog(
text: String,
) -> Result(Float, ReadabilityError)
Gunning Fog Index — Robert Gunning (1952).
0.4 × ((words/sentences) + 100 × (polysyllables/words))
A polysyllable here is a word with three or more syllables. The
original Gunning rules excluded proper nouns, hyphenated compounds,
and inflected forms (-es / -ed / -ing) — this implementation
follows Python textstat and does not apply those
exclusions, so scores match textstat rather than the strict
1952 paper. Callers needing the strict variant can subtract their
own exclusion count from
count.polysyllables before applying
the formula directly.
Output approximates the years of formal education required to
understand the text on first reading. The result is clamped to
[0.0, 18.0]; use gunning_fog_unbounded
for the raw value.
pub fn gunning_fog_unbounded(
text: String,
) -> Result(Float, ReadabilityError)
Gunning Fog Index without the [0.0, 18.0] clamp.
pub fn smog(text: String) -> Result(Float, ReadabilityError)
Simple Measure of Gobbledygook (SMOG), McLaughlin (1969).
1.043 × sqrt(polysyllables × (30/sentences)) + 3.1291
SMOG is statistically reliable only for texts of 30 sentences or
more — McLaughlin’s regression was calibrated on samples of that
size, and applying the formula to shorter passages compounds
estimation error. This implementation therefore returns
TooFewSentences when the input has fewer
than 30 sentences.
pub fn smog_g(text: String) -> Result(Float, ReadabilityError)
SMOG-G — the same formula as SMOG, applied to texts shorter than
30 sentences via the same 30 / sentences scaling already used
inside SMOG. Issue #23: real-world snippets (a Wikipedia paragraph,
a press release, a tweet, an email) almost never have 30 sentences,
so the strict SMOG gate rules them all out. SMOG-G drops the gate
and returns the extrapolated grade for any non-empty input with
at least one sentence.
Use smog when you have 30+ sentences and need the
statistically calibrated form; use smog_g for everything else.
The two agree to within ~1 grade for 30+ sentences.