Heteroclinic.net logo

www.heteroclinic.net

Tiny CSV Reader
I. the requirements

201506

CSV files are handy for bulk processing and it is also human readable. Wikipedia has an article about the formalization*, we quote here as the following:


* https://en.wikipedia.org/wiki/Comma-separated_values last retrieved June,2015.
Tiny CSV Reader
II. the Nature Language Expression

201506

It is important to identify or count the appearance of double quote literal.
Here we shorten it to quote. If quotes appear continuously in odd times, the first quote it is a field border.
A record separator can be a new line literal. Before it, the field borders must appear even number of times, otherwise it is a literal inside a field. Namely a record separator is a new line with "quotes appearing continuously in odd times" appears even times before it. Further, the next valid new line as a record separator has always even number count of quotes before it.
A fields separator can be a comma literal. Same as the record separator, "quotes appearing continuously in odd times" appears even times before it always.
A field can be an empty String. A record can be an empty line. A whole file can be empty. If we don't handle them, we may halt the program before an expected exit/end.

Tiny CSV Reader
III. Transfer the Nature Language Expression to Programmable Regular Expression

201506
Here after a long hike in a winding trail, we implement with Java for convenience

Tiny CSV Reader
IV. The Proof of Concept Test File

201506

We post the test data file here. It provides limited scenarioes that in real life could make a lesser program fail.

Tiny CSV Reader
V. Program through tests

201507

To get the program to pass the test file, there has been no straight way. The major problem is that I tried to solve the problem like working on an exercise from the theoretical computer science textbook ( automaton, regular language).


Tiny CSV Reader
VI. It is not a straight line

201507

It is not a linear process to get the program to pass the test data file.
Some tasks are trivia, like reading a file to a string, testing multiple line regex, testing/using greedy/reluctant quantifiers etc. Put the test data in the source code file then you have to manually manipulate the escape sequence of characters. Some can be verified by brutal or deligent test, like the reluctant mode of a group need outer extra layer of brackets, unless you can memorize all the operator precedence. I still don't understand why we use this [^\\\"] for a literal that is not double quote -- it just works, the other ways around won't. It is also tricky that $, the dollar sign in multiple line mode is no longer a line separator (end of line), the austere literal(s) would be System.getProperty("line.separator"), but for convenience I just use \n.
Mostly, the documentation tends to give you general ideas, highly abstraction (not exactly, it is hard to find theorem, equations and formula that are strict and leading deterministic results in a SDK), so may be it just tend to be brief. And the SDK version you use usually lags or over-shoots the documentation you are reading. If you go to the source code, unless you debug through every line with abundant leisure time, typically I give up at line twenty.

Tiny CSV Reader
VII. When it converges

201507
when the test passes for finding a literal appearing even times or odd times, I thought we are almost done. I suppose a record would just be, we just find a line separator without quote before it or a line separator with even counts of quote before it. But this 'or' condition caused some choas if you read/run test testParseRecords(). In a text book excercise of regular language (context free or not), you will define literal/alphabet strictly then apply the theorems and formula. But in this case, literal field separator and literal of record separator are all of wild card literal. I still didnot figure out the exact/correct regular grammar formula (symbol rules) but take a trial-error approach which is more of real life physics or engineering. The the analysis of testParseRecords() subdued us the conclusion that at either side of the 'or', the two conditions are not mutual exclusive, so when the matcher iterates through the input stream, the regex parser is confused so each findinng is not consuming just one condition but both condition.

When I am writting a program, I am always considering I am designing a truth table. It always leads to fixed points. Sometimes, the truth table may be so big.
So we may say if there is no theoretical foundation, well designed, logically sound requirments, it may not converge.

Tiny CSV Reader
VIII. Discussion

201507

Regular language is the fundamental base for understanding how computer programs work and how to write functioning computer programs. We take this chance study, review some key techniques in regular expresion and partially solve a problem.
Particularly for this task, there are still things open, to remove bording quotes, to replace dual quotes with single quote, to do integrity check of the input file. For the last one, I suggest using a pre-allocate field-record grid to fill in parsed results so when the integrity check fails, the program will not halt abnormally and the users has a chance to review what went wrong.
You can put your comments here https://github.com/wangzhikai/TinyCSVReader.