csv v1.1 Release Notes

Release Date: 2019-04-23 // almost 5 years ago
    • Namespace'd include directory inside a csv/ folder
    • โœ‚ Removed unnecessary shared_ptr in Dialects
    • Switched to std::string_view for keys in CSV rows - Now requiring C++17
    • Switched to using robin_hood::unordered_flat_map
    • โœ‚ Removed std::mutex, using std::atomic instead

Previous changes from v1.0

  • CSV for Modern C++

    Highlights

    Table of Contents

    • Reading CSV files
      • Dialects
      • Configuring Custom Dialects
      • Multi-character Delimiters
      • Ignoring Columns
      • No Header?
      • Dealing with Empty Rows
      • Reading first N rows
      • Performance Benchmark
    • Writing CSV files
    • Contributing
    • License

    Reading CSV files

    Simply include reader.hpp and you're good to go.

    #include \<reader.hpp\>
    

    ๐Ÿ“œ To start parsing CSV files, create a csv::Reader object and call .read(filename).

    csv::Reader foo; foo.read("test.csv");
    

    ๐Ÿ— This .read method is non-blocking. The reader spawns multiple threads to tokenize the file stream and build a "list of dictionaries". While the reader is doing it's thing, you can start post-processing the rows it has parsed so far using this iterator pattern:

    while(foo.busy()) { if (foo.has\_row()) { auto row = foo.next\_row(); // Each row is a robin\_map (https://github.com/Tessil/robin-map)auto foo = row["foo"] // You can use it just like an std::unordered\_mapauto bar = row["bar"]; // do something } }
    

    If instead you'd like to wait for all the rows to get processed, you can call .rows() which is a convenience method that executes the above while loop

    auto rows = foo.rows(); // blocks until the CSV is fully processedfor (auto& row : rows) { // Example: [{"foo": "1", "bar": "2"}, {"foo": "3", "bar": "4"}, ...] auto foo = row["foo"]; // do something}
    

    Dialects

    This csv library comes with three standard dialects:

    Name Description
    excel The excel dialect defines the usual properties of an Excel-generated CSV file
    excel_tab The excel_tab dialect defines the usual properties of an Excel-generated TAB-delimited file
    unix The unix dialect defines the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

    ๐Ÿ”ง Configuring Custom Dialects

    ๐Ÿ”ง Custom dialects can be constructed with .configure_dialect(...)

    csv::Reader csv; csv.configure\_dialect("my fancy dialect") .delimiter("") .quote\_character('"') .double\_quote(true) .skip\_initial\_space(false) .trim\_characters(' ', '\t') .ignore\_columns("foo", "bar") .header(true) .skip\_empty\_rows(true); csv.read("foo.csv");for (auto& row : csv.rows()) { // do something}
    
    Property Data Type Description
    0๏ธโƒฃ delimiter std::string
    0๏ธโƒฃ quote_character char
    0๏ธโƒฃ double_quote bool
    ๐ŸŽ‰ skip_initial_space bool
    0๏ธโƒฃ trim_characters std::vector<char>
    0๏ธโƒฃ ignore_columns std::vector<std::string>
    0๏ธโƒฃ header bool
    0๏ธโƒฃ column_names std::vector<std::string>
    skip_empty_rows bool specifies how empty rows should be interpreted. If this is set to true, empty rows are skipped. Default = false

    ๐Ÿ”ง The line terminator is '\n' by default. I use std::getline and handle stripping out '\r' from line endings. So, for now, this is not configurable in custom dialects.

    Multi-character Delimiters

    ๐ŸŒฒ Consider this strange, messed up log file:

    [Thread ID] :: [Log Level] :: [Log Message] :: {Timestamp}
    04 :: INFO :: Hello World :: 1555164718
    02 :: DEBUG :: Warning! Foo has happened :: 1555463132
    

    ๐Ÿ”ง To parse this file, simply configure a new dialect that splits on "::" and trims whitespace, braces, and bracket characters.

    csv::reader csv; csv.configure\_dialect("my strange dialect") .delimiter("::") .trim\_characters(' ', '[', ']', '{', '}'); csv.read("test.csv");for (auto& row : csv.rows()) { auto thread\_id = row["Thread ID"]; // "04"auto log\_level = row["Log Level"]; // "INFO"auto message = row["Log Message"]; // "Hello World"// do something}
    

    Ignoring Columns

    Consider the following CSV. Let's say you don't care about the columns age and gender. Here, you can use .ignore_columns and provide a list of columns to ignore.

    name, age, gender, email, department
    Mark Johnson, 50, M, [email protected], BA
    John Stevenson, 35, M, [email protected], IT
    Jane Barkley, 25, F, [email protected], MGT
    

    ๐Ÿ”ง You can configure the dialect to ignore these columns like so:

    csv::reader csv; csv.configure\_dialect("ignore meh and fez") .delimiter(", ") .ignore\_columns("age", "gender"); csv.read("test.csv");auto rows = csv.rows();// Your rows are:// [{"name": "Mark Johnson", "email": "[email protected]", "department": "BA"},// {"name": "John Stevenson", "email": "[email protected]", "department": "IT"},// {"name": "Jane Barkley", "email": "[email protected]", "department": "MGT"}]
    

    No Header?

    Sometimes you have CSV files with no header row:

    9 52 1
    52 91 0
    91 135 0
    135 174 0
    174 218 0
    218 260 0
    260 301 0
    301 341 0
    341 383 0
    ...
    

    ๐Ÿ“œ If you want to prevent the reader from parsing the first row as a header, simply:

    • Set .header to false
    • Provide a list of column names with .column_names(...)

      csv.configure_dialect("no headers") .header(false) .column_names("foo", "bar", "baz");

    The CSV rows will now look like this:

    [{"foo": "9", "bar": "52", "baz": "1"}, {"foo": "52", "bar": "91", "baz": "0"}, ...]
    

    If .column_names is not called, then the reader simply generates dictionary keys like so:

    [{"0": "9", "1": "52", "2": "1"}, {"0": "52", "1": "91", "2": "0"}, ...]
    

    Dealing with Empty Rows

    Sometimes you have to deal with a CSV file that has empty lines; either in the middle or at the end of the file:

    a,b,c
    1,2,3
    
    4,5,6
    
    10,11,12
    

    0๏ธโƒฃ Here's how this get's parsed by default:

    csv::Reader csv; csv.read("inputs/empty\_lines.csv");auto rows = csv.rows();// [{"a": 1, "b": 2, "c": 3}, {"a": "", "b": "", "c": ""}, {"a": "4", "b": "5", "c": "6"}, {"a": "", ...}]
    

    If you don't care for these empty rows, simply call .skip_empty_rows(true)

    csv::Reader csv; csv.configure\_dialect() .skip\_empty\_rows(true); csv.read("inputs/empty\_lines.csv");auto rows = csv.rows();// [{"a": 1, "b": 2, "c": 3}, {"a": "4", "b": "5", "c": "6"}, {"a": "10", "b": "11", "c": "12"}]
    

    Reading first N rows

    ๐Ÿ“œ If you know exactly how many rows to parse, you can help out the reader by using the .read(filename, num_rows) overloaded method. This saves the reader from trying to figure out the number of lines in the CSV file. You can use this method to parse the first N rows of the file instead of parsing all of it.

    csv::Reader foo; foo.read("bar.csv", 1000);auto rows = foo.rows();
    

    Note: Do not provide num_rows greater than the actual number of rows in the file - The reader will loop forever till the end of time.

    ๐ŸŽ Performance Benchmark

    // benchmark.cppvoid parse(const std::string& filename) { csv::Reader foo; foo.read(filename); std::vector\<csv::robin\_map\<std::string, std::string\>\> rows; while (foo.busy()) { if (foo.ready()) { auto row = foo.next\_row(); rows.push\_back(row); } } }
    
    $ g++ -pthread -std=c++11 -O3 -Iinclude/ -o test benchmark.cpp $ time ./test
    

    โœ… Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

    Here are the average-case execution times:

    Dataset File Size Rows Cols Time
    Demographic Statistics By Zip Code 27 KB 237 46 0.026s
    Simple 3-column CSV 14.1 MB 761,817 3 0.523s
    Majestic Million 77.7 MB 1,000,000 12 2.232s
    Crimes 2001 - Present 1.50 GB 6,846,406 22 32.411s

    Writing CSV files

    Simply include writer.hpp and you're good to go.

    #include \<writer.hpp\>
    

    To start writing CSV files, create a csv::Writer object and provide a filename:

    csv::Writer foo("test.csv");
    

    ๐Ÿ”ง Constructing a writer spawns a worker thread that is ready to start writing rows. Using .configure_dialect, configure the dialect to be used by the writer. This is where you can specify the column names:

    foo.configure\_dialect() .delimiter(", ") .column\_names("a", "b", "c");
    

    Now it's time to write rows. You can do this in multiple ways:

    foo.write\_row("1", "2", "3"); // parameter packingfoo.write\_row({"4", "5", "6"}); // std::vectorfoo.write\_row(std::map\<std::string, std::string\>{ // std::map {"a", "7"}, {"b", "8"}, {"c", "9"} }); foo.write\_row(std::unordered\_map\<std::string, std::string\>{ // std::unordered\_map {"a", "7"}, {"b", "8"}, {"c", "9"} }); foo.write\_row(csv::robin\_map\<std::string, std::string\>{ // robin\_map {"a", "7"}, {"b", "8"}, {"c", "9"} });
    

    ๐Ÿ‘ท Finally, once you're done writing rows, call .done() to stop the worker thread and close the file stream.

    foo.close();
    

    โœ… Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

    Here are the average-case execution times:

    Contributing

    Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

    License

    The project is available under the MIT license.