MeTA v1.3.7 Release Notes

  • ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix inconsistent behavior of utf::segmenter (and thus icu_tokenizer) for different locales. Thanks @CanoeFZH and @tng-konrad for helping debug this!

    โœจ Enhancements

    • ๐Ÿ‘ Allow for specifying the language and country for locale generation in setting up utf::segmenter (and thus icu_tokenizer)
    • ๐Ÿ‘ Allow for suppression of <s> and </s> tags within icu_tokenizer, mostly useful for information retrieval experiments with unigram words. Thanks @husseinhazimeh for the suggestion!
    • โž• Add a default-unigram-chain filter chain preset which is suitable for information retrieval experiments using unigram words. Thanks @husseinhazimeh for the suggestion!