All Versions
23
Latest Version
Avg Release Cycle
62 days
Latest Release
1620 days ago

Changelog History
Page 2

  • v2.0.0 Changes

    ๐Ÿ†• New features and major changes

    Indexing

    • Index format rewrite: both inverted and forward indices now use the same compressed postings format, and intermediate chunks are now also compressed on-the-fly. There is now a built in tool to dump any forward index to libsvm format (as this is not the on-disk format for that type of index anymore).
    • ๐Ÿ“‡ Metadata support: indices can now store arbitrary metadata associated with individual documents with string, integer, unsigned integer, and floating point values
    • ๐Ÿ”ง Corpus configuration is now stored within the corpus directory itself, allowing for corpora to be distributed with their proper configurations rather than having to bake this into the main configuration file
    • ๐Ÿ”ง RAM limits can be set for the indexing process via the configuration file. These are approximate and based on heuristics, so you should always set these to lower than available RAM.
    • Forward indices can now be created directly instead of forcing the creation of an inverted index first

    Tokenization and Analysis

    • ICU will be built and statically linked if the system provided library is too old on both OS X and Linux platforms. MeTA now will specify an exact version of ICU that should be used per release for consistency. That version is 56.1 as of this release.
    • ๐Ÿ‘ Analyzers have been modified to support both integral and floating point values via the use of the featurizer object passed to tokenize()
    • Documents no longer store any count information during the analysis process

    Ranking

    • Postings lists can now be read in a streaming fashion rather than all at once via postings_stream
    • Ranking is now performed using a document-at-a-time scheme
    • Ranking functions now use fast approximate math from fastapprox
    • Rank correlation measures have been added to the evaluation library

    Language Model

    • Rewrite of the language model library which can load models from the .arpa format
    • [SyntacticDiff][syndiff] implementation for comparative text mining, which may include grammatical error correction, summarization, or feature generation

    Machine Learning

    • A feature selection library for selecting features for machine learning using chi square, information gain, correlation coefficient, and odds ratio has been added
    • The API for the machine learning algorithms has been changed to use dataset classes; these are separate from the index classes and represent data that is memory-resident
    • ๐Ÿ‘Œ Support for regression has been added (currently only via SGD)
    • The SGD algorithm has been improved to use a normalized adaptive gradient method which should make it less sensitive to feature scaling
    • ๐Ÿ‘ The SGD algorithm now supports (approximate) L1 regularization via a cumulative penalty approach
    • The libsvm modules are now also built using CMake

    Miscellaneous

    • Packed binary I/O functions allow for writing integers/floating point values in a compressed format that can be efficiently decoded. This should be used for most binary I/O that needs to be performed in the toolkit unless there is a specific reason not to.
    • An interactive demo application has been added for the shift-reduce constituency parser
    • A string_view class is provided in the meta::util namespace to be used for non-owning references to strings. This will use std::experimental::string_view if available and our own implementation if not
    • meta::util::optional will resolve to std::experimental::optional if it is available
    • Support for jemalloc has been added to the build system. We strongly recommend installing and linking against jemalloc for improved indexing performance.
    • A tool has been added to print out the top k terms in a corpus
    • A new library for hashing has been added in namespace meta::hashing. This includes a generic framework for writing hash functions that are randomly keyed as well as (insertion only) probing-based hash sets/maps with configurable resizing and probing strategies
    • ๐Ÿ›  A utility class fixed_heap has been added for places where a fixed size set of maximal/minimal values should be maintained in constant space
    • The filesystem management routines have been converted to use STLsoft in the event that the filesystem library in std::experimental::filesystem is not available
    • ๐Ÿ Building MeTA on Windows is now officially supported via MSYS2 and MinGW-w64, and continuious integration now builds it on every commit in this environment
    • ๐Ÿ‘ A small support library for things related to random number generation has been added in meta::random
    • ๐Ÿ“œ Sparse vectors now support operator+ and operator-
    • An STL container compatible allocator aligned_allocator<T, Alignment> has been added that can over-align data (useful for performance in some situations)
    • โœ… Bandit is now used for the unit tests, and these have been substantially improved upon
    • ๐Ÿšš io::parser deprecated and removed; most uses simply converted to std::fstream
    • binary_file_{reader,writer} deprecated and removed; io::packed or io::{read,write}_binary should be used instead

    ๐Ÿ› Bug fixes

    • knn classifier now only requests the top k when performing classification
    • An issue where uncompressed model files would not be found if using a zlib-enabled build (#101)

    โœจ Enhancements

    • ๐Ÿ‘ท Travis CI integration has been switched to their container infrastructure, and it now builds with OS X with Clang in addition to Linux with Clang and GCC
    • ๐Ÿ Appveyor CI for Windows builds alongside Travis
    • Indexing speeds are dramatically faster (thanks to many changes both in the in-memory posting chunks as well as optimizations in the tokenization process)
    • ๐Ÿš€ If no build type is specified, MeTA will be built in Release mode
    • The cpptoml dependency version has been bumped, allowing the use of things like value_or for cleaner code
    • The identifiers library has been dramatically simplified

    ๐ŸŒ [syndiff]: http://web.engr.illinois.edu/~massung1/files/bigdata-2015.pdf

  • v1.3.8 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix issue with confusion_matrix where precision and recall values were swapped. Thanks to @husseinhazimeh for finding this!

    โœจ Enhancements

    • ๐Ÿ‘ Better unit tests for confusion_matrix
    • โž• Add functions to confusion_matrix to directly access precision, recall, and F1 score
    • Create a predicted_label opaque identifier to emphasize class_labels that are output from some model (and thus shouldn't be interchangeable)
  • v1.3.7 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix inconsistent behavior of utf::segmenter (and thus icu_tokenizer) for different locales. Thanks @CanoeFZH and @tng-konrad for helping debug this!

    โœจ Enhancements

    • ๐Ÿ‘ Allow for specifying the language and country for locale generation in setting up utf::segmenter (and thus icu_tokenizer)
    • ๐Ÿ‘ Allow for suppression of <s> and </s> tags within icu_tokenizer, mostly useful for information retrieval experiments with unigram words. Thanks @husseinhazimeh for the suggestion!
    • โž• Add a default-unigram-chain filter chain preset which is suitable for information retrieval experiments using unigram words. Thanks @husseinhazimeh for the suggestion!
  • v1.3.6 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix potential off-by-one when calculating the number of documents in a line_corpus when its files do not end in a newline

    โœจ Enhancements

    • ๐Ÿ”„ Change score_data to support floating-point weights on query terms
  • v1.3.5 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix missing support for sequence/parser analyzers in the classify tools
  • v1.3.4 Changes

    ๐Ÿ†• New features

    • ๐Ÿ‘Œ Support building with biicode
    • โž• Add Vagrantfile for virtual machine configuration
    • โž• Add Dockerfile for Docker support

    โœจ Enhancements

    • ๐Ÿ‘Œ Improve ir_eval unit tests

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix ir_eval::ndcg incorrect log base and addition instead of subtraction in IDCG calculation
    • Fix ir_eval::avg_p incorrect early termination
  • v1.3.3 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  Fix issues with system-defined integer widths in binary model files (mainly impacted the greedy tagger and parser); please re-download any parser model files you may have had before
    • ๐Ÿ›  Fix bug where parser model directory is not created if a non-standard prefix is used (anything other than "parser")

    โœจ Enhancements

    • โš  Silence inconsistent missing overrides warning on clang >= 3.6
  • v1.3.2 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  fix potentially incorrect generation of vocabulary map files on 32-bit systems (this appears to have only impacted non-default block sizes)
  • v1.3.1 Changes

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  fix calculation of average precision in ir_eval (the denominator was incorrect)
    • specify that labels are required for the file_corpus document list; this allows spaces in the path to each document
  • v1.3 Changes

    ๐Ÿ†• New features

    • โž• additions to the graph library:
      • myopic search
      • BFS
      • preferential attachment graph generation model (supports node attractiveness from different distributions)
      • betweenness centrality
      • eigenvector centrality
    • โž• added a new natural language parsing library:
      • parse tree library (visitor-based)
      • shift-reduce constituency parser for generating phrase structure trees
      • reimplementation of evalb metrics for evaluating parsers
      • new filter for Penn Treebank-style normalization
    • โž• added a greedy averaged Perceptron-based tagger
    • demo application for various basic text processing (profile)
    • ๐Ÿ‘ basic iostreams that support gzip compression (if compiled with ZLib support)
    • โž• added iteration method for stats::multinomial seen events
    • โž• added expected value and entropy functions to stats namespace
    • โž• added linear_model: a generic multiclass classifier storage class
    • added gz_corpus: a compressed version of line_corpus
    • โž• added macros for generating type safe identifiers with user defined literal suffixes
    • โž• added a persistent stack data structure to meta::util

    โœจ Enhancements

    • โž• added operator== for util::optional<T>
    • ๐Ÿ‘ better CMake support for building the libsvm modules
    • ๐Ÿ‘ better CMake support for downloading unit-test data
    • ๐Ÿ‘Œ improved setup guide in README (for OS X, Ubuntu, Arch, and EWS/ENGRIT)
    • โ™ป๏ธ tree analyzers refactored to use the new parser library (removes dependency on outside toolkits for generating tree files)
    • ๐Ÿšš analyzers that are not part of the "core" have been moved into their respective folders (so ngram_pos_analyzer is in src/sequence, tree_analyzer is in src/parser)
    • make_index now checks if the files exist before loading an index, and if they are missing creates a new one (as opposed to just throwing an exception on a nonexistent file)
    • โฌ†๏ธ cpptoml upgraded to support TOML v0.4.0
    • โš  enable extra warnings (-Wextra) for clang++ and g++

    ๐Ÿ› Bug fixes

    • ๐Ÿ›  fix sequence_analyzer::analyze() const when applied to untagged sequences (was throwing when it shouldn't)
    • ensure that the inverted index object is destroyed first before uninverting occurs in the creation of a forward_idnex
    • ๐Ÿ›  fix bug where icu_tokenizer would output spaces as tokens
    • ๐Ÿ›  fix bugs where index objects were not destroyed before trying to delete their files in the unit tests
    • ๐Ÿ›  fix bug in sparse_vector::find() where it would return a non-end iterator when asked to find an element that does not exist