textmetrics/similarity

Similarity scores in the closed interval [0.0, 1.0].

1.0 means “identical”, 0.0 means “no similarity by this metric”. No function in this module returns NaN or a negative Float.

All string-typed functions operate on extended grapheme clusters and do not normalize their inputs — callers wanting NFC equivalence must normalize first.

Types

Validated parameter bag for jaro_winkler_with.

A value of this type is guaranteed to encode parameters that keep Jaro-Winkler output inside [0.0, 1.0]. Construct via default_jaro_winkler_config or jaro_winkler_config.

pub opaque type JaroWinklerConfig

Returned by jaro_winkler_config when its arguments fall outside the validated range.

pub type JaroWinklerConfigError {
  PrefixScaleOutOfRange(got: Float)
  PrefixMaxNegative(got: Int)
}

Constructors

  • PrefixScaleOutOfRange(got: Float)
  • PrefixMaxNegative(got: Int)

Returned by sorensen_dice when given an n-gram size below 1.

pub type SorensenDiceError {
  NgramSizeInvalid(got: Int)
}

Constructors

  • NgramSizeInvalid(got: Int)

Values

pub fn default_jaro_winkler_config() -> JaroWinklerConfig

Winkler-1990 defaults: prefix_scale = 0.1, prefix_max = 4.

pub fn jaro(a: String, b: String) -> Float

Jaro similarity at the grapheme level.

Edge cases (defined by convention):

  • jaro("", "") = 1.0
  • jaro("", b) = 0.0 for non-empty b
  • jaro(a, "") = 0.0 for non-empty a

Time O(m·n), space O(m + n).

pub fn jaro_winkler(a: String, b: String) -> Float

Jaro-Winkler similarity using Winkler-1990 defaults (prefix_scale = 0.1, prefix_max = 4).

pub fn jaro_winkler_config(
  prefix_scale prefix_scale: Float,
  prefix_max prefix_max: Int,
) -> Result(JaroWinklerConfig, JaroWinklerConfigError)

Construct a JaroWinklerConfig.

Invariants:

  • prefix_scale must be in [0.0, 0.25] (Winkler’s upper bound that keeps the score in [0, 1]).
  • prefix_max must be >= 0.
pub fn jaro_winkler_with(
  a: String,
  b: String,
  config: JaroWinklerConfig,
) -> Float

Jaro-Winkler similarity with caller-supplied parameters.

pub fn prefix_max(config: JaroWinklerConfig) -> Int

Read the prefix-cap parameter of a config.

pub fn prefix_scale(config: JaroWinklerConfig) -> Float

Read the prefix-scale parameter of a config.

pub fn sorensen_dice(
  a: String,
  b: String,
  n: Int,
) -> Result(Float, SorensenDiceError)

Sørensen-Dice coefficient over grapheme n-grams of size n.

Strict variant — surfaces NgramSizeInvalid for n < 1. For the common case of bigrams or trigrams over user-controlled a and b, prefer the lenient siblings sorensen_dice_bigrams and sorensen_dice_trigrams, which return a plain Float and skip the error-discarding boilerplate at the call site.

Edge cases (per spec §7.5):

  • When both n-gram multisets are empty, the result is Ok(1.0) if the inputs are equal (including both empty) and Ok(0.0) otherwise.
  • When exactly one input has no n-grams the score is Ok(0.0) (no overlap is possible).
  • n < 1 returns Error(NgramSizeInvalid(n)).
pub fn sorensen_dice_bigrams(a: String, b: String) -> Float

Sørensen-Dice over bigrams (n = 2) — the de-facto standard for string similarity. Sibling of jaro / jaro_winkler, returning a plain Float in [0.0, 1.0] so call sites can pipe directly into thresholds without unwrapping a Result.

Equivalent to sorensen_dice(a, b, 2) with the impossible-by- construction n < 1 branch elided.

pub fn sorensen_dice_trigrams(a: String, b: String) -> Float

Sørensen-Dice over trigrams (n = 3). The same shape as sorensen_dice_bigrams but with a wider n-gram window; useful when inputs share long common substrings and bigrams produce a noisy score.

Search Document