Click here to Skip to main content
15,394,479 members
Articles / Programming Languages / C#
Article
Posted 16 Mar 2017

Tagged as

Stats

28.5K views
440 downloads
28 bookmarked

Why to Build Your Own CSV Parser (or Maybe Not)

Rate me:
Please Sign up or sign in to vote.
4.98/5 (26 votes)
19 Mar 2017CPOL14 min read
Comparison of CSV parsers plus one custom implementation

Introduction

I will discuss various properties of a CSV parser, compare several parsers from NuGet and also provide my custom implementation.

Background

This is based on a work I did in the past for a customer in finances. We were building a calculation engine and importing CSV files was one of the main tasks there. I evaluated a lot of CSV parsers but eventually we decided to roll out our own implementation because of our unique requirements. As it's usually the case these special requirements were not that important after all but of course we learned that only after the implementation was finished. In the end, we replaced the core of our custom solution with TextFieldParser which had better handling of quoted field. Let me now share what I learned about parsing CSVs.

What is a CSV and How to Parse It

CSV format is defined by RFC4180 as this:

file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D
DQUOTE =  %x22
LF = %x0A
CRLF = CR LF
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

Let us discuss few interesting facts there.

Comma as Sole Field Separator

This is usually not true as people will send you files which can have pretty much anything for separator. Granted that comma, pipe and semicolon are the most common.

We actually had a requirement originally that the separator can be more than one character. This has never been used.

CR+LF as Line Ending

Again, this requirement gets ignored very often. You can be sure that nobody will care that the file they are sending to you has Windows, Unix or Mac line endings or a mixture of thereof. You just have to deal with it. Ideally, we should be able to parse any reasonable combination of CR and LF much like StreamReader.ReadLine method does.

In our implementation, we were very specific and allowed the user to specify the line endings explicitly. This was never popular for good reasons - instead of solving the problem, we just dumped it on the user. In the end, we defaulted it to Windows+Unix+Mac and never touched it again.

Quoted Fields

Quoted field must begin and end with double quote. Quote inside a field is escaped by doubling it. Quoted field can contain commas (the field separator) and new line (the line separator). This is very important as you can get all kinds of characters in text fields.

Note that it is invalid for the quote to be included in the text data: abc "efg" hij. In theory, this should fail as undefined state but in real life, you want to import as much as possible so this case would just be ignored and quote will be parsed as normal character.

I need to say that implementing a custom quote character is not much of a problem. Nevertheless, I have never seen anything other than double quote. Occasionally, people try to escape quotes using backslash but that's just wrong.

Skipping Lines

There are no comments or empty lines!

In most implementations, you are allowed to use single line comments. If you start the line with a hashtag (#), or a custom character of your choice, this line is skipped.

Likewise, empty lines are not part of the grammar above. But many implementations still allow you to skip empty lines automatically.

Comments are useful - they are occasionally used. On the other hand, I didn't see files with empty lines in the production environment.

Occasionally, there is the option to skip first N lines. This may be useful if somebody sends you garbage at the beginning of the file. On the other side, if there is garbage, it is much better to use comments to skip it.

Data Conversions

Some of the implementations allow you to trim white spaces around fields automatically. While this may be useful, I feel it is not the responsibility of the parser but rather some layer above. If this functionality is provided, it should be optional.

Some of the parsers try to do more than just giving you the lines and fields - they try to convert string into types specified by the user. Some implementers feel that loading data directly into POCO entities is the only correct way of parsing CSV. I certainly think these features have their target audience, but in my case I was either disabling this functionality if possible or excluding the parser from my evaluation altogether as this didn't fit my needs.

Reading and Writing Data

There are implementations that parse the whole file into an array or list or DataTable and return the whole content to you in one piece. This is just fine for small files but becomes a blocker when you need to import gigabytes of data. I was only looking at implementations that could read the file line by line.

Some libraries provide the functionality to write CSV data. I didn't focus on this part as it was not important to my use case.

Feature Comparison

Some of the interesting features of each parser. Empty fields mean I didn't find the respective information.

Parser Access Delimiter Line Endings Quote EOL inside quote Comment Trim white spaces
Microsoft.VisualBasic.
FileIO.TextFieldParser
Read line multiple strings   " yes multiple strings optional
Nuget: Cinchoo ETL 1.0.2.4
(article on CP)
Enumerator single string single string single char yes multiple strings yes
Nuget: Csv 1.0.11 Enumerable single char StreamReader.ReadLine " no no optional
Nuget: CsvHelper 3.0.0-beta7 Read line single string windows, unix single char yes single char optional
Nuget: CsvToolkit 0.13.0 Enumerable single char windows, unix single char yes no optional
Nuget: DevLib.Csv 2.16.23.19010 Enumerable single char StreamReader.ReadLine single char no no no
Nuget: LibCsv4Net 1.8.9.1102 Enumerable single char single string single char yes no no

Nuget: LumenWorksCsvReader 3.9.1
(article on CP)

Enumerable single char windows, unix single char yes single char optional
Nuget: Net.Code.Csv 1.0.3 Data reader single char   single char yes single char optional
Nuget: Nortal.Utilities.Csv 0.9.2 Read line single char single string single char yes no no
Nuget: Uncomplicated.Csv 1.5.2 Read line single char windows single char yes no no
My implementation Read line single char CR,LF,CR+LF,LF+CR normalizes into CR+LF " yes # no
CodeProject: C# - Light and Fast CSV Parser Enumerable single char windows, unix single char yes no yes
CodeProject: C# CSV File and String Reader Classes Read line single char   single char yes no no

Explanation of data access:

  • Read line: There is a method which can read a single line in the form of a string array (or equivalent) and returns null when end of file is reached.
  • Enumerable: The parser returns an enumerable of lines (string arrays or equivalent).
  • Data reader: Parser implements IDataReader interface.

Using the Code

To replicate the results shown below, please use the SimpleCsvReader.Demo project which is a simple console application. There are three modes.

First, you need to generate a random test file. You can specify number of lines, i.e., the size in the second parameter.

SimpleCsvReader.Demo.exe /gen 10000

When you have the file generated, you can parse it using the various parsers and record the time. The second parameter specifies how many times you want to repeat the measurements, averages of the results will be calculated.

SimpleCsvReader.Demo.exe /run 3

Finally, there is a third option to verify if the parsed data matches the generated data. This is done using a SHA256 hash of the data written to and read from the file.

SimpleCsvReader.Demo.exe /verify

Performance Comparison

The table shows the average time in seconds to parse a file with given number of rows. The sample files didn't contain any comments but they did contain quoted fields with line breaks. Please see also comments below the table.

Parser 10k 100k 1M 10M 100M
Microsoft.VisualBasic.FileIO.TextFieldParser 0.21 2.04 20.17 201.21 2061.95
Nuget: Cinchoo ETL 1.0.2.4 (3) 0.71 6.80 67.72 678.22 6890.05
Nuget: Csv 1.0.11 (1)(3) 0.66 6.51 64.45 662.48 6642.64
Nuget: CsvHelper 3.0.0-beta7 0.06 0.56 5.60 55.69 584.74
Nuget: CsvToolkit 0.13.0 (3) 0.12 1.18 11.68 118.92 1221.08
Nuget: DevLib.Csv 2.16.23.19010 (1) 0.04 0.35 3.51 34.95 374.55
Nuget: LibCsv4Net 1.8.9.1102 0.25 2.46 24.50 244.65 2507.25
Nuget: LumenWorksCsvReader 3.9.1 0.02 0.21 2.11 20.96 229.66
Nuget: Net.Code.Csv 1.0.3 (3) 0.05 0.52 5.16 51.30 541.96
Nuget: Nortal.Utilities.Csv 0.9.2 0.06 0.59 5.89 59.02 613.04
Nuget: Uncomplicated.Csv 1.5.2 0.04 0.34 3.43 33.95 365.23
SimpleCsvParser (2) 0.08 0.77 7.63 77.53 790.33
SimpleCsvParserMerged (2) 0.07 0.65 6.48 64.65 675.81
CodeProject: C# - Light and Fast CSV Parser 0.06 0.56 5.53 56.26 585.87
CodeProject: C# CSV File and String Reader Classes (3) 0.05 0.51 5.12 52.02 545.95
  1. Doesn't respect line break inside quotes.
  2. My implementation.
  3. Additional overhead because output needs to be adapted to collection of strings.

The following graph shows the relative speed of the parser compared to the VB TextFieldParser which is used as reference. Parsers slower than the reference are not shown.

Image 1

LumenWorks is the best. I was able to get under 40% with my own implementation. CsvHelper beta7 performance is about 30% better than the last stable version.

Custom Implementation

First, I need to say that what I'm showing here is not the implementation I wrote for my former employer as that is proprietary work. What I show here is my own implementation done for research purposes.

The code described in the section is located in SimpleCsvReader.Lib is the attached project.

Design

CSV parser can be implemented as a finite-state machine. Parsing one line can be represented by the following diagram. Each transition is labeled with the corresponding input that triggers the transition slash the performed action. This is so called Mealy machine.

Image 2

Start Line is the start state, End Line is the final state. To consume multiple line endings, these are pre-processed so the state machine doesn't have to deal with it directly and is therefore represented by single token EOL. EOF is end of file. ELSE means any other character. When there is no more data in the input, null is returned. If the line is empty, empty string array is returned.

Append char means adding the current char to the current field. End field takes the current field, adds it to the current line and clears the current field so parsing can continue. Keep char means that the input is not advanced to the next char so the same char is processed in the next state.

Please note that there is no transition from Quoted Field on EOF - this will throw an exception as the quoted field is not closed. Also, there can be no text following Double Quote - this will again raise an exception. On the other hand, we allow quote inside Regular Field.

TextReaderWrapper Class

The underlying file is read char by char using TextReader.Read method. This returns an integer and returns -1 in case end of file is reached. To allow simple manipulation with line endings, we wrap this and return -2 in case of line end.

This class has another important responsibility and that is tracking the current position in the input file. Not only the absolute position of the current character but also logical line and column number. This is useful when reporting errors to the user.

LineBuilder Class

Here, the line is built field by field and char by char. Current field is represented by a StringBuilder, current line by list of strings. Chars are being added to the current field, line ends are automatically translated to Environment.NewLine - the original value is lost. When field ends, the string is appended to the current line and current field's StringBuilder is cleared. When the line ends, the current line is returned to the caller.

Context and State Classes

I chose to implement the parser using State design pattern. This has the benefit of having the transition logic nicely encapsulated. It is direct translation of the diagram above. Context class serves as the main hub facilitating the transitions and executing related actions while the actual logic deciding next state and what to do is in the respective state class.

CsvParser Class

This is where is all comes together. CsvParser class is the main interface to the parser - the ReadLine method resides here. In case we would need more parameters than the delimiter, this is the place where they should be processed and validated. We also catch the parsing exceptions here and inject line and column numbers to help with problem solving.

CsvParserMerged Class

My parser was working fine, unit tests were passing but I was a bit disappointed with the performance. I thought my parser would be lightning fast! And although I've surpassed the reference implementation by more than a half, there were others who were better. And that, of course, I couldn't let go.

If you look at the code, I didn't change that much. Mainly, I merged everything into one class making it private. I also implemented the state machine using simple switch statement and states using enums. With all of this, I was able to squeeze additional 5% improvement out of it.

By rewriting the state machine, I got rid of the virtual methods. This in turn allowed the JIT compiler to inline far more stuff than before. I guess this is the main reason for the performance gain.

Points of Interest

If you think that CSV files are obsolete, then you are probably right. There are better ways to exchange information. Is it dead? Not by far.

Can We Further Improve the Performance?

It depends. Our implementation - the one from the beginning of this article - was not only parsing the file but also converting number and dates, then wrapping it in a IDataReader implementation and finally inserting everything into database via SqlBulkCopy. If you imagine the whole pipeline, it looks like this: FileStream > StreamReader > Data Conversion > Data Reader > SqlBulkCopy.

I spent some time with profiler looking at the performance and as you have probably guessed, most time is spent in reading the file itself - no surprise there. The second bottleneck was data conversion. There is definitely room for improvement there.

Also, there is a lot of buffering and copying between buffers going on. FileStream reads bytes into a buffer. StreamReader uses encoding to get chars from bytes and puts those into its own buffer. You then create strings out of these chars. Then conversions take in strings but internal implementations work with char arrays. SqlBulkCopy always calls GetValue method which returns object and your numbers and dates get boxed.

With some hacking, you can save some memory allocations but I doubt it's worth the effort.

Is It Worth Writing My Own CSV Parser?

Not really. Just pick one from NuGet and be off with it. You can always replace it with something better if you need to.

Why we didn't use some existing library? Well, one reason was the ridiculous requirements. Another reason was the conversion layer that turns strings into numbers and dates. You would be surprised at how difficult it is to get this right considering you need to parse data in different formats and cultures. This is why we kept the conversion layer and only replaced the parser.

Why we didn't use some existing solution like SSIS? Now that is a very good question and some day, I just might write another article about that.

History

  • 16th March, 2017 - Initial release
  • 17th March, 2017 - CsvParser 0.5.2 removed as it parses data upfront
  • 20th March, 2017 - New parsers added to the comparison:
    • Two parsers from CodeProject articles
    • Cinchoo ETL 1.0.2.4
    • CsvHelper updated to 3.0.0-beta7

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Tomas Takac
Software Developer (Senior)
Czech Republic Czech Republic
I started programming in Basic in 1989, learned Pascal in 1993, switched to C/C++ in 1997, discovered Java in 2001 and finally settled with C#/.NET in 2003.

Comments and Discussions

 
SuggestionNewer Library Pin
Jonathan Wood17-Feb-20 4:05
MemberJonathan Wood17-Feb-20 4:05 
QuestionMy BIG 5! Pin
Maciej Los14-Mar-18 11:41
mveMaciej Los14-Mar-18 11:41 
QuestionNice Pin
Sacha Barber30-Mar-17 3:55
MemberSacha Barber30-Mar-17 3:55 
AnswerRe: Nice Pin
Tomas Takac30-Mar-17 21:48
MemberTomas Takac30-Mar-17 21:48 
Questioncsv line parser Pin
obermd21-Mar-17 15:22
Memberobermd21-Mar-17 15:22 
AnswerRe: csv line parser Pin
Tomas Takac21-Mar-17 22:07
MemberTomas Takac21-Mar-17 22:07 
QuestionMy vote of #5 ... and, there are a number of recent CSV parsers on CP Pin
BillWoodruff16-Mar-17 23:54
mveBillWoodruff16-Mar-17 23:54 
AnswerRe: My vote of #5 ... and, there are a number of recent CSV parsers on CP Pin
Tomas Takac17-Mar-17 5:10
MemberTomas Takac17-Mar-17 5:10 
BugF# CsvParser results are not correct Pin
bvsms16-Mar-17 13:48
Memberbvsms16-Mar-17 13:48 
GeneralRe: F# CsvParser results are not correct Pin
Tomas Takac16-Mar-17 21:45
MemberTomas Takac16-Mar-17 21:45 
GeneralCRLF in QUOTEs Pin
PIEBALDconsult16-Mar-17 6:40
professionalPIEBALDconsult16-Mar-17 6:40 
GeneralRe: CRLF in QUOTEs Pin
Tomas Takac16-Mar-17 11:20
MemberTomas Takac16-Mar-17 11:20 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.