All Versions
Latest Version
Avg Release Cycle
62 days
Latest Release
2433 days ago

Changelog History
Page 1

  • v3.0.2 Changes

    August 20, 2017

    ๐Ÿ› Bug fixes

    • Fix issues using MAKE_NUMERIC_IDENTIFIER instead of MAKE_NUMERIC_IDENTIFIER_UDL on GCC 7.1.1.
    • ๐Ÿ— Work around (what we assume is) a bug on MSYS2 where cmake would link in additional exception handling libraries that would cause a crash during indexing by building the mman-win32 library as shared.
    • โš  Silence fallthrough warnings on Clang from murmur_hash.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v3.0.1 Changes

    March 13, 2017

    ๐Ÿ†• New features

    • โž• Add an optional xz{i,o}fstream to meta::io if compiled with liblzma available.
    • util::disk_vector<const T> can now be used to specify a read-only view of a disk-backed vector.

    ๐Ÿ› Bug fixes

    • ๐Ÿ–จ ir_eval::print_stats now takes a num_docs parameter to properly display evaluation metrics at a certain cutoff point, which was always 5 beforehand. This fixes a bug in query-runner where the stats were not being computed according to the cutoff point specified in the configuration.
    • ir_eval::avg_p now correctly stops computing after num_docs. Before, if you specified num_docs as a smaller value than the size of the result list, it would erroneously keep calculating until the end of the result list instead of stopping after num_docs elements.
    • {inverted,forward}_index can now be loaded from read-only filesystems.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v3.0.0 Changes

    February 13, 2017

    ๐Ÿ†• New features

    โž• Add an embedding_analyzer that represents documents with their averaged word vectors.

    โž• Add a parallel::reduction algorithm designed for parallelizing complex accumulation operations (like an E step in an EM algorithm)

    Parallelize feature counting in feature selector using the new parallel::reduction

    Add a parallel::for_each_block algorithm to run functions on (relatively) equal sub-ranges of an iterator range in parallel

    โž• Add a parallel merge sort as parallel::sort

    โž• Add a util/traits.h header for general useful traits

    โž• Add a Markov model implementation in sequence::markov_model

    โž• Add a generic unsupervised HMM implementation. This implementation supports HMMs with discrete observations (what is used most often) and sequence observations (useful for log mining applications). The forward-backward algorithm is implemented using both the scaling method and the log-space method. The scaling method is used by default, but the log-space method is useful for HMMs with sequence observations to avoid underflow issues when the output probabilities themselves are very small.

    Add the KL-divergence retrieval function using pseudo-relevance feedback with the two-component mixture-model approach of Zhai and Lafferty, called kl_divergence_prf. This ranker internally can use any language_model_ranker subclass like dirichlet_prior or jelinek_mercer to perform the ranking of the feedback set and the result documents with respect to the modified query.

    ๐Ÿ†“ The EM algorithm used for the two-component mixture model is provided as the index::feedback::unigram_mixture free function and returns the feedback model.

    โž• Add the Rocchio algorithm (rocchio) for pseudo-relevance feedback in the vector space model.

    ๐Ÿ’ฅ Breaking Change. To facilitate the above to changes, we have also broken the ranker hierarchy into one more level. At the top we have ranker, which has a pure virtual function rank() that can be overridden to provide entirely custom ranking behavior. This is the class the KL-divergence and Rocchio methods derive from, as we need to re-define what it means to rank documents (first retrieving a feedback set, then ranking documents with respect to an updated query).

    Most of the time, however, you will want to derive from the second level ranking_function, which is what was called ranker before. This class provides a definition of rank() to perform document-at-a-time ranking, and expects deriving classes to instead provide initial_score() and score_one() implementations to define the scoring function used for each document. Existing code that derived from ranker prior to this version of MeTA likely needs to be changed to instead derive from ranking_function.

    Add the util::transform_iterator class and util::make_transform_iterator function for providing iterators that transform their output according to a unary function.

    ๐Ÿ’ฅ Breaking Change. whitespace_tokenizer now emits only word tokens by default, suppressing all whitespace tokens. The old default was to emit tokens containing whitespace in addition to actual word tokens. The old behavior can be obtained by passing false to its constructor, or setting suppress-whitespace = false in its configuration group in config.toml. (Note that whitespace tokens are still needed if using a sentence_boundary filter but, in nearly all circumstances, icu_tokenizer should be preferred.)

    ๐Ÿ’ฅ Breaking Change. Co-occurrence counting for embeddings now uses history that crosses sentence boundaries by default. The old behavior (clearing the history when starting a new sentence) can be obtained by ensuring that a tokenizer is being used that emits sentence boundary tags and by setting break-on-tags = true in the [embeddings] table of config.toml.

    ๐Ÿ’ฅ Breaking Change. All references in the embeddings library to "coocur" are have changed to "cooccur". This means that some files and binaries have been renamed. Much of the co-occurrence counting part of the embeddings library has also been moved to the public API.

    ๐Ÿ”ง Co-occurrence counting now is performed in parallel. Behavior of its merge strategy can be configured with the new [embeddings] config parameter merge-fanout = n, which specifies the maximum number of on-disk chunks to allow before kicking off a multi-way merge (default 8).

    โœจ Enhancements

    • Add additional packed_write and packed_read overloads: for std::pair, stats::dirichlet, stats::multinomial, util::dense_matrix, and util::sparse_vector
    • Additional functions have been added to ranker_factory to allow construction/loading of language_model_ranker subclasses (useful for the kl_divergence_prf implementation)
    • ๐Ÿ›  Add a util::make_fixed_heap helper function to simplify the declaration of util::fixed_heap classes with lambda function comparators.
    • โž• Add regression tests for rankers MAP and NDCG scores. This adds a new dataset cranfield that contains non-binary relevance judgments to facilitate these new tests.
    • โฌ†๏ธ Bump bundled version of ICU to 58.2.

    ๐Ÿ› Bug Fixes

    • ๐Ÿ›  Fix bug in NDCG calculation (ideal-DCG was computed using the wrong sorting order for non-binary judgments)
    • ๐Ÿ›  Fix bug where the final chunks to be merged in index creation were not being deleted when merging completed
    • ๐Ÿ›  Fix bug where GloVe training would allocate the embedding matrix before starting the shuffling process, causing it to exceed the "max-ram" config parameter.
    • ๐Ÿ›  Fix bug with consuming MeTA from a build directory with cmake when building a static ICU library. meta-utf is now forced to be a shared library, which (1) should save on binary sizes and (2) ensures that the statically build ICU is linked into the library to avoid undefined references to ICU functions.
    • ๐Ÿ›  Fix bug with consuming Release-mode MeTA libraries from another project being built in Debug mode. Before, identifiers.h would change behavior based on the NDEBUG macro's setting. This behavior has been removed, and opaque identifiers are always on.

    ๐Ÿ—„ Deprecation

    • disk_index::doc_name and disk_index::doc_path have been deprecated in
      ๐Ÿš€ favor of the more general (and less confusing) metadata(). They will be removed in a future major release.
    • ๐Ÿ‘Œ Support for 32-bit architectures is provided on a best-effort basis. MeTA makes heavy use of memory mapping, which is best paired with a 64-bit address space. Please move to a 64-bit platform for using MeTA if at all possible (most consumer machines should support 64-bit if they were made in the last 5 years or so).

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    672b10c398c1a193ba91dc8c0493d729ad3f73d9192ef33100baeb8afd4f5cde gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz

    Please note that the embeddings model has changed. Please re-download.

  • v2.4.2 Changes

    September 23, 2016

    ๐Ÿ› Bug Fixes

    • โœ… Properly shuffle documents when doing an even-split classification test
    • ๐Ÿ‘‰ Make forward indexer listen to indexer-num-threads config option.
    • ๐Ÿ‘‰ Use correct number of threads when deciding block sizes for parallel_for
    • โž• Add workaround to filesystem::remove_all for Windows systems to avoid spurious failures caused by virus scanners keeping files open after we deleted them
    • ๐Ÿ›  Fix invalid memory access in gzstreambuf::underflow

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v2.4.1 Changes

    September 08, 2016

    ๐Ÿ› Bug fixes

    • โš  Eliminate excess warnings on Darwin about double preprocessor definitions
    • ๐Ÿ›  Fix issue finding config.h when used as a sub-project via
      โž• add_subdirectory()

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v2.4.0 Changes

    September 07, 2016

    ๐Ÿ†• New features

    โž• Add a minimal perfect hashing implementation for language_model, and unify
    the querying interface with the existing language model.

    โž• Add a CMake install() command to install MeTA as a library (issue #143). For
    example, once the library is installed, users can do:

    find\_package(MeTA 2.4 REQUIRED)add\_executable(my-program src/my\_program.cpp) target\_link\_libraries(my-program meta-index) # or whatever libs needed from MeTA

    ๐Ÿ”‹ Feature selection functionality added to multiclass_dataset and
    binary_dataset and views (issues #111, #149 and PR #150 thanks to @siddshuk).

    auto selector = features::make\_selector(\*config, training\_vw); uint64\_t total\_features\_selected = 20; selector-\>select(total\_features\_selected); auto filtered\_dset = features::filter\_dataset(dset, \*selector); 

    ๐Ÿ‘‰ Users can now, similar to hash_append, declare standalone functions in the
    same scope as their type called packed_read and packed_write which will be
    called by io::packed::read and io::packed::write, respectively, via argument-dependent lookup.

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix edge-case bug in the succinct data structures
    • ๐Ÿ›  Fix off-by-one error in lm::diff

    โœจ Enhancements

    • โž• Added functionality to the meta::hashing library: hash_append overload for
      ๐Ÿ‘€ std::vector, manually-seeded hash function
    • Further isolate ICU in MeTA to allow CMake to install()
    • โšก๏ธ Updates to EWS (UIUC) build guide
    • โž• Add std::vector operations to io::packed
    • Consolidated all variants of chunk iterators into one template
    • โž• Add MeTA's citation to the README!

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v2.3.0 Changes

    August 02, 2016

    ๐Ÿ†• New features

    ๐Ÿšš Forward and inverted indexes are now stored in one directory. To make use of your existing indexes, you will need to move their directories. For example, a configuration that used to look like the following

    dataset = "20newsgroups"corpus = "line.toml"forward-index = "20news-fwd"inverted-index = "20news-inv"

    will now look like the following

    dataset = "20newsgroups"corpus = "line.toml"index = "20news-index"

    and your folder structure should now look like

    โ”œโ”€โ”€ fwd
    โ””โ”€โ”€ inv

    You can do this by simply moving the old folders around like so:

    mkdir 20news-index mv 20news-fwd 20news-index/fwd mv 20news-inv 20news-index/inv

    stats::multinomial now can report the number of unique event types
    counted (unique_events())

    std::vector can now be hashed via hash_append.

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix rounding bug in language model-based rankers. This bug caused
      ๐ŸŽ severely degraded performance for these rankers with short queries. The
      โœ… unit tests have been improved to prevent such a regression in the

    โœจ Enhancements

    • The bundled ICU version has been bumped to ICU 57.1.
    • ๐Ÿ MeTA will now attempt to build its own version of ICU on Windows if it
      fails to find a suitable ICU installed.
    • ๐Ÿ‘ CI support for GCC 6.x was added for all three major platforms.
    • ๐Ÿ›  CI support also uses a fixed version of LLVM/libc++ instead of trunk.

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v2.2.0 Changes

    April 09, 2016

    ๐Ÿ†• New features

    • Parallelized versions of PageRank and Personalized PageRank have been
      โž• added. A demo is available in wiki-page-rank; see the website for
      more information on obtaining the required data.
    • โž• Add a disk-based streaming minimal perfect hash function library. A
      sub-component of this is a small memory-mapped succinct data structure
      library for answering rank/select queries on bit vectors.
    • ๐Ÿšš Much of our CMake magic has been moved into a separate project included
      as a submodule:, which can
      ๐Ÿ‘ท now be used in other projects to simplify initial build system
      ๐Ÿ”ง configuration.

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix parameter settings in language model rankers not being range checked
      (issue #134).
    • Fix incorrect incoming edge insertion in directed_graph::add_edge().
    • Fix find_first_of and find_last_of in util::string_view.

    โœจ Enhancements

    • forward_index now knows how to tokenize a document down to a
      feature_vector, provided it was generated with a non-LIBSVM analyzer.
    • ๐Ÿ‘ Allow loading of an existing index where its corpus is no longer
    • Data is no longer shuffled in batch_train. Shuffling the data
      causes horrible access patterns in the postings file, so the data
      should instead shuffled before indexing.
    • util::array_views can now be constructed as empty.
    • util::multiway_merge has been made more generic. You can now specify
      ๐Ÿ”€ both the comparison function and merging criteria as parameters, which
      0๏ธโƒฃ default to operator< and operator==, respectively.
    • A simple utility classes io::mifstream and io::mofstream have been
      โž• added for places where a moveable ifstream or ofstream is desired
      ๐Ÿšš as a workaround for older standard libraries lacking these move
    • ๐Ÿ”ง The number of indexing threads can be controlled via the configuration
      0๏ธโƒฃ key indexer-num-threads (which defaults to the number of threads on
      the system), and the number of threads allowed to concurrently write to
      0๏ธโƒฃ disk can be controlled via indexer-max-writers (which defaults to 8).

    Model File Checksums (sha256)

    d29bf8b4cbeef21db087cf8042efe5afe25c7bd3c460997728d58b92c24ec283 beam-search-constituency-parser-4.tar.gz
    ce44c7d96a8339ff4b597f35a35534ccf93ab99b7d45cbbdddffe7e362b9c20e crf.tar.gz
    2a75ab9750ad2eabfe1b53889b15a31f79bd2315f71c2a4a62f6364586a6042d gigaword-embeddings-50d.tar.gz
    40cd87901eb29b69e57e4bca14bc2539d7d6b4ad5c186d6f3b1532a60c5163b0 greedy-constituency-parser.tar.gz
    a0a3814c1f82780f1296d600eba260f474420aa2d93f000e390c71a0ddac42d9 greedy-perceptron-tagger.tar.gz
  • v2.1.0 Changes

    February 13, 2016

    ๐Ÿ†• New features

    • โž• Add the GloVe algorithm for
      training word embeddings and a library class word_embeddings for loading and
      querying trained embeddings. To facilitate returning word embeddings, a simple
      util::array_view class was added.
    • โž• Add simple vector math library (and move fastapprox into the math

    ๐Ÿ› Bug fixes

    • Fix probe_map::extract() for inline_key_value_storage type; old
      implementation forgot to delete all sentinel values before returning the
    • ๐Ÿ›  Fix incorrect definition of l1norm() in sgd_model.
    • ๐Ÿ›  Fix gmap calculation where 0 average precision was ignored
    • ๐Ÿ›  Fix progress output in multiway_merge.

    โœจ Enhancements

    • ๐Ÿ‘Œ Improve performance of printing::progress. Before, progress::operator() in
      ๐ŸŽ tight loops could dramatically hurt performance, particularly due to frequent
      calls to std::chrono::steady_clock::now(). Now, progress::operator()
      simply sets an atomic iteration counter and a background thread periodically
      โšก๏ธ wakes to update the progress output.
    • ๐Ÿ‘ Allow full text storage in index as metadata field. If store-full-text = true (default false) in the corpus config, the string metadata field
      "content" will be added. This is to simplify the creation of full text
      ๐Ÿ“‡ metadata: the user doesn't have to duplicate their dataset in metadata.dat,
      ๐Ÿ“‡ and metadata.dat will still be somewhat human-readable without large strings
      of full text added.
    • ๐Ÿ‘ Allow make_index to take a user-supplied corpus object.


    • ZLIB is now a required dependency.
    • โœ… Switch to just using the standalone ./unit-test instead of ctest. There
      โœ… aren't really many advantages for us to using CTest at this point with the new
      โœ… unit test framework, so just use our unit test executable.
  • v2.0.1 Changes

    February 05, 2016

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix issue where metadata_parser would not consume spaces in string
      ๐Ÿ“‡ metadata fields. Thanks to hopsalot on the forum for the bug report!
    • ๐Ÿ›  Fix build issue on OS X with Xcode 6.4 and clang related to their
      shipped version of string_view lacking a const to_string() method

    โœจ Enhancements

    • The ./profile executable ensures that the file exists before operating on
      it. Thanks to @domarps for the PR!
    • โž• Add a generic util::multiway_merge algorithm for performing the
      ๐Ÿ”€ merge-step of an external memory merge sort.
    • ๐Ÿ— Build with the following Xcode versions on Travis CI:
      • Xcode 6.1 and OS X 10.9 (as before)
      • Xcode 6.4 and OS X 10.10 (new)
      • Xcode 7.1.1 and OS X 10.10 (new)
      • Xcode 7.2 and OS X 10.11 (new)