Streaming Lines

04/15/2014, Tue
Categories: #JavaScript
Tags: #NodeJs

Line by Line

There are many streaming modules on npm to read lines from text files, but many are inaccurate and slow. Inaccurate means that there is a failure to read many lines and slow means that files with a lot of text per line will cause the modules to crawl.

Below are the lines counts and timings for reading a large json file, 'bird.json' with 909 lines. The json is densely packed, meaning there are a few lines that are very long.

bird json time bird json line count time

Line-reader (excluded from graph) nearly slows to a standstill when reading a json file with dense lines, taking over 76 seconds to complete the read. This is an exception for line-reader because line-reader is usually comparable in speed to other streaming modules on less dense files as shown below.

Now testing line reading on 'War and Peace' founded on Gutenberg. This text file has 65008 lines.

war and peace time war and peace time line count

Linebyline errored out with 'RangeError: Maximum call stack size exceeded', while most of the stream readers are in the neighborhood of 300ms.

The files for the above benchmark is here.

Conclusion

It was surprising to see that some modules misread lines by a significant amount for 'bird.json' while almost all modules misreport the line count. The last line on both the json and text file is a blank line and often this is not read.

Most stream readers were not adept at reading the 'bird.json' file.

The more reliable streaming line reader module that I recommend would be 'split2' or 'split', since both are accurate on the line count and can handle densely packed files.