Thursday, June 24, 2010

Importing the Code-Point dataset into MySql

I recently realised that capturing the geocoding results from Google wasn’t such a good idea. It seems that postcodes that are only partially correct will still be geocoded, but will be returned with a default latitude and longitude which are kind of in the right area for that postcode area but obviously aren’t right, since the postcode doesn’t exist.
So I decided it was time to grasp the nettle and use the Code-Point dataset for my UK postcodes page, so the data should be much more accurate. The only issue with the Code-Point dataset is that it doesn’t contain postcodes in Northern Ireland, Guernsey, Jersey and the Isle of Man for some reason.
So the first thing to do was to download the .NET connector for MySql. Then I modified my SQL Server import code to use MySql. As mentioned in that post, the code has dependencies on this CSV parser and my .NET coordinates library. I made some modifications to add spaces to postcodes as necessary and also to not bother inserting postcodes with no geographical data (I’m not sure if this is missing data or if these are simply non-geographical postcodes).
The code is shown below. Although it works, this is a simple implementation and is not particularly fast. It will probably take a few hours running against a local MySql installation but running against a remote installation will probably take longer. I only plan to run it once, so I’m not particularly concerned with performance, but if performance is an issue, you’ll probably want to take a look at the MySqlBulkLoader class or peruse the MySql documentation on improving the performance of inserts.
using System;
using System.IO;
using DotNetCoords;
using LumenWorks.Framework.IO.Csv;
using MySql.Data.MySqlClient;

namespace ImportCodepoint
{
  class Program
  {
    static void Main(string[] args)
    {
      string[] files = Directory.GetFiles(@"C:\Users\Doogal\Downloads\codepo_gb\Code-Point Open\data\CSV");
      foreach (string file in files)
      {
        ReadFile(file);
      }

    }

    private static void ReadFile(string file)
    {
      using (StreamReader reader = new StreamReader(file))
      {
        CsvReader csvReader = new CsvReader(reader, false);
        using (MySqlConnection conn = new MySqlConnection(
          "server=127.0.0.1;uid=root;pwd=password;database=test;"))
        {
          conn.Open();
          foreach (string[] data in csvReader)
          {
            string postcode = data[0];
            // some postcodes have spaces, some don't
            if (postcode.IndexOf(' ') < 0)
              postcode = data[0].Substring(0, data[0].Length - 3) + " " + data[0].Substring(data[0].Length - 3);
            double easting = double.Parse(data[10]);
            double northing = double.Parse(data[11]);

            // there are some postcodes with no location
            if ((easting != 0) && (northing != 0))
            {
              // convert easting/northing to lat/long
              OSRef osRef = new OSRef(easting, northing);
              LatLng latLng = osRef.ToLatLng();
              latLng.ToWGS84();

              using (MySqlCommand command = conn.CreateCommand())
              {
                Console.WriteLine(postcode);
                command.CommandTimeout = 60;
                command.CommandText = string.Format("INSERT INTO Postcodes (Postcode, Latitude, Longitude) " +
                  "VALUES ('{0}', {1}, {2})",
                  postcode, latLng.Latitude, latLng.Longitude);
                command.ExecuteNonQuery();
              }
            }
          }
        }
      }
    }
  }
}

No comments: