CSV is a text file format, shortcut for Comma-Separated Values. It is a very simple way to store data, where each table record is on one line and values are listed one after the other with comma in between.
CSV is a de-facto standard in data exchanging and many applications allow data export in a CSV format. Such files can grow very large in size. Our applications can both read and write CSV in streaming mode, so we never have to load whole large file into memory – this also means we can easily handle huge CSV files, files many gigabytes in size. We don’t even impose any limit in file size. Larger files will take longer to process, but we can generally handle any file size.
Parsing CSV is very easy. Parsing all possible variants out there is not trivial at all.
When a value contains commas itself, as is the case in a text, value has to be quoted. Being such a simple format, inevitably many applications write it in a slightly different way. Some quote each value, some quote only values containing commas, some quote all textual values etc. There are differencing encoding quotes – some applications precede nested quotes with backslash character, some use two quote characters one after the other. CSV file values can in fact be delimited with a character different from comma – often TAB character is used (ASCII value 9). Depending if Windows or Linux/Unix/Mac system writes the file, lines can be written by two characters (ASCII 13, then 10) or a single character (ASCII 10). You shouldn’t store multiline text data in CSV files, because you’ll prematurely detect end of record.
We have developed our own CSV parser and employ a rich test suite to make sure it handles common differences in CSV files our user have. Even though CSV format doesn’t contain metadata with field types, we analyze the data and determine optimal data type of each field.