Simpler is better: November 2009

Retrieve data from a CSV file with a schema.ini file

Some times our program is required to retrieve data from a CSV file then write the data into another place, such as another client's database. Suppose there is a CSV file named test.csv with the content below


Account Number,Service Number,Service Type,Service Status,Service Start Date,Service End Date,Service PostCode
1000001,0396971002,ADSL,Open,2008-03-21 08:00:00,2009-03-20 08:00:00,3218
1000002,0898701423,Fixed Line,2006-05-08 12:00:00, 2009-05-07 12:00:00,6101
1000003,0432005900,Mobile,2009-10-28 00:00:00, 2012-10-27 00:00:00,3400

In C# it can be read into a DataSet by the method like this


public static DataSet ReadCsvFile(string path)
{
    DataSet ds = new DataSet();

    string directory = Path.GetDirectoryName(path);
    string fileName = Path.GetFileName(path);

    string connString = string.Format("Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties='text;HDR=Yes;FMT=Delimited;IMEX=1'", directory);
    string commText = string.Format("select * from [{0}]", fileName);

    OleDbConnection conn = new OleDbConnection(connString);
    OleDbCommand command = new OleDbCommand(commText, conn);
    OleDbDataAdapter adapter = new OleDbDataAdapter(command);

    adapter.Fill(ds);

    return ds;
}

Please note that we don't have to worry about the connection open and close here. DataAdapter takes care of it automatically, even if there is an exception happened inside adapter.Fill(ds). But if the connection is opened explicitly before adapter.Fill() then it needs to be closed explicitly after that.

After it loads the content into the DataSet, if you look into the table in the DataSet you will find that in the second column 'Service Number' all the first 0 disappeared. The reason is that the Jet engine generates the data type according to the value in the CSV file, which in this case it treats '0396971002' as a number and gets rid of the first 0. More accurately, it generates the data type according to the value in the first data line. In the CSV example above the first line is the header and the second line is the first data line. So if we add spaces in the value of column 'Service Number' in the second data line as this


1000002,08 9870 1423,Fixed Line,2006-05-08 12:00:00, 2009-05-07 12:00:00,6101

Then you will find in the DataSet this value is DBNull because it fails to be converted to a number type which is defined based on the value in the first data line.

Another annoying thing is the DateTime value. In this example it uses the format 'yyyy-MM-dd hh:mm:ss". It is an international standard format so it is able to convert it to a correct DateTime type despite what the current culture is. But in the CSV file if it is another format, which is very likely, such as Australian one "03/21/2008 18:30:00" while the current culture is set to USA then it will try to convert it according to "dd/MM/yyyy hh:mm:ss" ending with a failure.

It would make life easier if we can define a sort of schema file for the CSV file. In fact there is such a thing. The problem described above can be solved by using schema.ini.

We just need to create a text file which defines information such as the CSV file name and column types, then save it as schema.ini and put it in the same folder of the CSV file. Here is an schema.ini for the CSV example (which name is 'test.csv') given at the beginning


[test.csv]
ColNameHeader=True
Format=CSVDelimited
DateTimeFormat="yyyy-mm-dd hh:nn:ss"
Col1="Account Number" Text Width 10
Col2="Service Number" Text
Col3="Service Type" Text
Col4="Service Status" Text
Col5="Service Start Date" DateTime
Col6="Service End Date" DateTime
Col7="Service PostCode" Short

It is pretty straightforward. But there are several things I would like to explain

1. The DateTime format is "yyyy-mm-dd hh:nn:ss", not the format "yyyy-MM-dd hh:mm:ss" in c#. Of cause you can define it to another form such as "dd/mm/yyyy hh:nn:ss".
2. The Width definition is optional.
3. You don't have to define all the columns in the CSV file. You can just pick some of them according to your use. The DataSet will only contain the columns defined in the schema.ini file.
4. Please be aware of the space in the beginning or at the end of the column name. For example, "Account Number", " Account Number" and "Account Number " are three different columns.
5. If there is an invalid value stored in the CSV file and can not be converted to the type defined in the schema.ini file, then it will just leave it as DBNull. For example, if there is a value "post code" stored in the "Service PostCode" column in CSV file, while in the schema.ini it defines it as Short, then in the DataSet this value will be DBNull, other than throwing an exception. I think this is a nice function and pretty useful.

You could visit MSDN's page for the details of the schema.ini definition.

If there are more than one CSV file and their names are different, then there are two obvious solutions. One is generating the schema.ini file with the CSV file name dynamically. Another one is that we can define it as temp.csv or any other name in the schema.ini then put the schema.ini file in a temp folder. Every time before it reads a CSV file, by using the method File.Move(orgPath, desPath) it can move the CSV file to the temp folder and rename it to temp.csv.

Besides using schema.ini another solution is to create a pre-defined DataTable in the code then fill it with the CSV file. But it is cumbersome and if one of the values doesn't match the pre-defined type then it will throw an exception then terminate the retrieving process.

Simpler is better

Friday, November 6, 2009

About Me

Search This Blog

Blog Archive