MeTA v2.0.0 Release Notes
-
๐ New features and major changes
Indexing
- Index format rewrite: both inverted and forward indices now use the same compressed postings format, and intermediate chunks are now also compressed on-the-fly. There is now a built in tool to dump any forward index to libsvm format (as this is not the on-disk format for that type of index anymore).
- ๐ Metadata support: indices can now store arbitrary metadata associated with individual documents with string, integer, unsigned integer, and floating point values
- ๐ง Corpus configuration is now stored within the corpus directory itself, allowing for corpora to be distributed with their proper configurations rather than having to bake this into the main configuration file
- ๐ง RAM limits can be set for the indexing process via the configuration file. These are approximate and based on heuristics, so you should always set these to lower than available RAM.
- Forward indices can now be created directly instead of forcing the creation of an inverted index first
Tokenization and Analysis
- ICU will be built and statically linked if the system provided library is too old on both OS X and Linux platforms. MeTA now will specify an exact version of ICU that should be used per release for consistency. That version is 56.1 as of this release.
- ๐ Analyzers have been modified to support both integral and floating point
values via the use of the
featurizer
object passed totokenize()
- Documents no longer store any count information during the analysis process
Ranking
- Postings lists can now be read in a streaming fashion rather than all at
once via
postings_stream
- Ranking is now performed using a document-at-a-time scheme
- Ranking functions now use fast approximate math from fastapprox
- Rank correlation measures have been added to the evaluation library
Language Model
- Rewrite of the language model library which can load models from the .arpa format
- [SyntacticDiff][syndiff] implementation for comparative text mining, which may include grammatical error correction, summarization, or feature generation
Machine Learning
- A feature selection library for selecting features for machine learning using chi square, information gain, correlation coefficient, and odds ratio has been added
- The API for the machine learning algorithms has been changed to use
dataset
classes; these are separate from the index classes and represent data that is memory-resident - ๐ Support for regression has been added (currently only via SGD)
- The SGD algorithm has been improved to use a normalized adaptive gradient method which should make it less sensitive to feature scaling
- ๐ The SGD algorithm now supports (approximate) L1 regularization via a cumulative penalty approach
- The libsvm modules are now also built using CMake
Miscellaneous
- Packed binary I/O functions allow for writing integers/floating point values in a compressed format that can be efficiently decoded. This should be used for most binary I/O that needs to be performed in the toolkit unless there is a specific reason not to.
- An interactive demo application has been added for the shift-reduce constituency parser
- A
string_view
class is provided in themeta::util
namespace to be used for non-owning references to strings. This will usestd::experimental::string_view
if available and our own implementation if not meta::util::optional
will resolve tostd::experimental::optional
if it is available- Support for jemalloc has been added to the build system. We strongly recommend installing and linking against jemalloc for improved indexing performance.
- A tool has been added to print out the top k terms in a corpus
- A new library for hashing has been added in namespace
meta::hashing
. This includes a generic framework for writing hash functions that are randomly keyed as well as (insertion only) probing-based hash sets/maps with configurable resizing and probing strategies - ๐ A utility class
fixed_heap
has been added for places where a fixed size set of maximal/minimal values should be maintained in constant space - The filesystem management routines have been converted to use STLsoft in
the event that the filesystem library in
std::experimental::filesystem
is not available - ๐ Building MeTA on Windows is now officially supported via MSYS2 and MinGW-w64, and continuious integration now builds it on every commit in this environment
- ๐ A small support library for things related to random number generation
has been added in
meta::random
- ๐ Sparse vectors now support
operator+
andoperator-
- An STL container compatible allocator
aligned_allocator<T, Alignment>
has been added that can over-align data (useful for performance in some situations) - โ Bandit is now used for the unit tests, and these have been substantially improved upon
- ๐
io::parser
deprecated and removed; most uses simply converted tostd::fstream
binary_file_{reader,writer}
deprecated and removed;io::packed
orio::{read,write}_binary
should be used instead
๐ Bug fixes
- knn classifier now only requests the top k when performing classification
- An issue where uncompressed model files would not be found if using a zlib-enabled build (#101)
โจ Enhancements
- ๐ท Travis CI integration has been switched to their container infrastructure, and it now builds with OS X with Clang in addition to Linux with Clang and GCC
- ๐ Appveyor CI for Windows builds alongside Travis
- Indexing speeds are dramatically faster (thanks to many changes both in the in-memory posting chunks as well as optimizations in the tokenization process)
- ๐ If no build type is specified, MeTA will be built in Release mode
- The cpptoml dependency version has been bumped, allowing the use of
things like
value_or
for cleaner code - The identifiers library has been dramatically simplified
๐ [syndiff]: http://web.engr.illinois.edu/~massung1/files/bigdata-2015.pdf