MeTA v3.0.0 Release NotesRelease Date: 2017-02-13 // about 4 years ago
🆕 New features
➕ Add an
embedding_analyzerthat represents documents with their averaged word vectors.
➕ Add a
parallel::reductionalgorithm designed for parallelizing complex accumulation operations (like an E step in an EM algorithm)
Parallelize feature counting in feature selector using the new
parallel::for_each_blockalgorithm to run functions on (relatively) equal sub-ranges of an iterator range in parallel
➕ Add a parallel merge sort as
➕ Add a
util/traits.hheader for general useful traits
➕ Add a Markov model implementation in
➕ Add a generic unsupervised HMM implementation. This implementation supports HMMs with discrete observations (what is used most often) and sequence observations (useful for log mining applications). The forward-backward algorithm is implemented using both the scaling method and the log-space method. The scaling method is used by default, but the log-space method is useful for HMMs with sequence observations to avoid underflow issues when the output probabilities themselves are very small.
Add the KL-divergence retrieval function using pseudo-relevance feedback with the two-component mixture-model approach of Zhai and Lafferty, called
kl_divergence_prf. This ranker internally can use any
jelinek_mercerto perform the ranking of the feedback set and the result documents with respect to the modified query.
🆓 The EM algorithm used for the two-component mixture model is provided as the
index::feedback::unigram_mixturefree function and returns the feedback model.
➕ Add the Rocchio algorithm (
rocchio) for pseudo-relevance feedback in the vector space model.
💥 Breaking Change. To facilitate the above to changes, we have also broken the
rankerhierarchy into one more level. At the top we have
ranker, which has a pure virtual function
rank()that can be overridden to provide entirely custom ranking behavior. This is the class the KL-divergence and Rocchio methods derive from, as we need to re-define what it means to rank documents (first retrieving a feedback set, then ranking documents with respect to an updated query).
Most of the time, however, you will want to derive from the second level
ranking_function, which is what was called
rankerbefore. This class provides a definition of
rank()to perform document-at-a-time ranking, and expects deriving classes to instead provide
score_one()implementations to define the scoring function used for each document. Existing code that derived from
rankerprior to this version of MeTA likely needs to be changed to instead derive from
util::make_transform_iteratorfunction for providing iterators that transform their output according to a unary function.
💥 Breaking Change.
whitespace_tokenizernow emits only word tokens by default, suppressing all whitespace tokens. The old default was to emit tokens containing whitespace in addition to actual word tokens. The old behavior can be obtained by passing
falseto its constructor, or setting
suppress-whitespace = falsein its configuration group in
config.toml.(Note that whitespace tokens are still needed if using a
sentence_boundaryfilter but, in nearly all circumstances,
icu_tokenizershould be preferred.)
💥 Breaking Change. Co-occurrence counting for embeddings now uses history that crosses sentence boundaries by default. The old behavior (clearing the history when starting a new sentence) can be obtained by ensuring that a tokenizer is being used that emits sentence boundary tags and by setting
break-on-tags = truein the
💥 Breaking Change. All references in the embeddings library to "coocur" are have changed to "cooccur". This means that some files and binaries have been renamed. Much of the co-occurrence counting part of the embeddings library has also been moved to the public API.
🔧 Co-occurrence counting now is performed in parallel. Behavior of its merge strategy can be configured with the new
merge-fanout = n, which specifies the maximum number of on-disk chunks to allow before kicking off a multi-way merge (default 8).
- Add additional
- Additional functions have been added to
ranker_factoryto allow construction/loading of language_model_ranker subclasses (useful for the
- 🛠 Add a
util::make_fixed_heaphelper function to simplify the declaration of
util::fixed_heapclasses with lambda function comparators.
- ➕ Add regression tests for rankers MAP and NDCG scores. This adds a new dataset
cranfieldthat contains non-binary relevance judgments to facilitate these new tests.
- ⬆️ Bump bundled version of ICU to 58.2.
🐛 Bug Fixes
- 🛠 Fix bug in NDCG calculation (ideal-DCG was computed using the wrong sorting order for non-binary judgments)
- 🛠 Fix bug where the final chunks to be merged in index creation were not being deleted when merging completed
- 🛠 Fix bug where GloVe training would allocate the embedding matrix before starting the shuffling process, causing it to exceed the "max-ram" config parameter.
- 🛠 Fix bug with consuming MeTA from a build directory with
cmakewhen building a static ICU library.
meta-utfis now forced to be a shared library, which (1) should save on binary sizes and (2) ensures that the statically build ICU is linked into the
libmeta-utf.solibrary to avoid undefined references to ICU functions.
- 🛠 Fix bug with consuming Release-mode MeTA libraries from another project being built in Debug mode. Before,
identifiers.hwould change behavior based on the
NDEBUGmacro's setting. This behavior has been removed, and opaque identifiers are always on.
disk_index::doc_pathhave been deprecated in
🚀 favor of the more general (and less confusing)
metadata(). They will be removed in a future major release.
- 👌 Support for 32-bit architectures is provided on a best-effort basis. MeTA makes heavy use of memory mapping, which is best paired with a 64-bit address space. Please move to a 64-bit platform for using MeTA if at all possible (most consumer machines should support 64-bit if they were made in the last 5 years or so).
Model File Checksums (sha256)
d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz 672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz 40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
Please note that the embeddings model has changed. Please re-download.
- Add additional