This has been an interesting week. I've been working on the implementation of IDataReader and IEnumerable for a couple of classes in C#. The classes parse some complicated text files using regular expressions.
I've learned more about the .Net class that implements regex than I would care to know. (I am happy to have learned about the class.) I've also run into some minor cases where a regex I worked out in perl didn't work the same way in the Microsoft component. However, the MS component has many strong points that made implementing the parsing method simpler. This isn't a Microsoft bash, it's just there are some minor differences when it comes to handling /r/n sequences. (I was able to work around them without issue.) I can say that learning regular expressions is a requirement for programming. You can accomplish your parsing task without using regex, but you can't do it in six or seven lines of code. My advice is to obtain a regex editor and start learning.
Getting back to my story, my assembly parses data using regex and then presents it to the calling class using the IDataReader interface. The convenient feature of implementing IDataReader is that you can load the data into a DataTable using the Load method. If you are clever with column names, you can easily create DataSets containing DataTables that have relationships to one another. You can do this without access to a database! It's not magic, but it is nifty. (And you can persist it to an XML file using the WriteXML method for future access.)
Here's a simple setup for regex in my assembly (with exception handling removed for clarity). The text that is parsed is shown after the code.
string loadfile = null;
TextReader loadfileReader = null;
ArrayList values = new ArrayList();
CSomeContainerClass _someClassInstance = null;
Regex regex = new Regex(@"A0:(.*)\r\n" +
@"A1:(.*)\r\n" +
@"A2:(.*)\r\n" +
@"A3:(.*)\r\n" +
@"A4:(.*)\r\n" +
@"A5:(.*)\r\n");
loadfileReader = File.OpenText(path);
loadfile = loadfileReader.ReadToEnd();
loadfileReader.Close();
if (regex.IsMatch(loadfile))
{
MatchCollection matches = regex.Matches(loadfile);
// Regex should only have one match in our example
if (matches.Count > 1)
throw new InvalidDataException("Bad match count");
Match record = matches[0];
ArrayList recordValues = new ArrayList();
for (int i = 1; i < record.Groups.Count; i++)
recordValues.Add(record.Groups[i].ToString().Trim());
_someClassInstance = new CSomeContainerClass(recordValues);
}
This code will match the following text:
A0:17456789AH
A1:8/24/2005 14:12
A2:AQZ-97567
A3:9000.10440
A4:61-1234
A5:97.25
The following values are returned in Groups[0..5] as strings:
17456789AH
8/24/2005 14:12
AQZ-97567
9000.10440
61-1234
97.25
The match the example illustrates is extremely simple. We would have to rework the regex string if our text looked something like the following:
A0:17456789AH 18765482AH 17211679AH 17491134AH
A1:8/24/2005 14:12 8/24/2005 14:14 8/24/2005 14:19 8/24/2005 14:22
A2:AQZ-97567
A3:9000.10440
A4:61-1234
A5:97.25 98.62 97.91 98.01
We could no longer use the grouping (.*) to gather up the data for the matched field. Some of the fields are followed by a white space. For the field A0, the data returned would be 17456789AH 18765482AH 17211679AH 17491134AH (a single string) which is probably not what we would be looking for. It's possible to parse the data using a different grouping construct. The whitespace matching symbol /w might prove useful. I'll leave the solution as an exercise for the sufficiently curious.