MySQL and more

Monday, May 18, 2009

Reverse Geocoding using MySQL GIS

Normal geocoding is the process of taking an address and converting it into latitude and longitude. Reverse geocoding goes the other direction; given a latitude and longitude we want to know the closest street address.

Google Maps API, Yahoo Maps, Microsoft, GeoNames.Org and others can provide both these services, but they either cost, or have restrictive terms of service that makes them unavailable for high volume applications. See Geo::Coder::US (for US addresses only). This uses a local data source, and requires you to download the data from the US Census manually. It can import a format known as Tiger/Line into a local BerkeleyDB file, then uses that to calculate lat/lng values from street addresses. However, it provides no method for reverse lookups.

This is because the parsed data is simply a tree of values with the following form:
 /zip/street name/type//

Each key contains a list of numbers which are actually the spatial representation of the street, broken down into segments for each block. It is a series of line segments, which are called "shape files," because they describe the streets in spatial terms. For example:

/94702/Ashby/Ave// : $VAR1 = [
# record 1
37848907, # start lat (from) $value/1_000_000 = the decimal form.
122297583, # start lng
400, # even num start (range)
598, # even num end
0, # zero street marker
401, # odd num start
599, # odd num end
#record 2 .. N


# ends with final point
37853426,
122279480
];
] )

For forward geocoding, you simply start with a known address, then use that as a key to find the appropriate spatial data for that street. The lat/lon is an interpolated value.

For reverse geocoding, it's impossible to start with a lat/lng and return the street without iterating for each and every street and doing comparisons to find the "closest." To be able to do this efficiently, we need a spatial index on this data. CPAN has a module for R-Tree indexes, but mysql also provides a convenient way to store all this shape data, and do spatial queries.

This example uses a modified version of MySQL, which includes improved GIS support

Building A Reverse Geocoding DB

Get the Data
Here's a quick PERL script to spider the Census Bureau site and download all the data. It's split by State and county so there are a lot of files:

use HTML::Parser;
my $base_url = 'http://www2.census.gov/geo/tiger/tigerua';

package IdentityParse;
use base "HTML::Parser";
use Data::Dumper;
use LWP::Simple;

my @urls;
my $burl;

sub start {
my ($self, $tag ,$attr, $attrseq, $origtext) = @_;
unless ($tag =~ /^a$/) {
return;
}

if (defined $attr->{'href'}) {
if ($attr->{'href'} =~ /[A-Z]\/$/) { # directories: add to list
push @urls, "$burl/$attr->{'href'}";
}
if ($attr->{'href'} =~ /\.ZIP$/i ) { # zip file, download it
if (! -e $attr->{'href'}) {
# we don't have it yet!
`wget '$burl/$attr->{href}'`;
}
}
}
}


my $p = new IdentityParse;

push @urls, $base_url;
while (@urls) {
my $url = shift @urls;
my $content = get $url;
$burl = $url;
$p->parse($content);
}


Convert Using Geo::Coder::US

* Install Geo::Coder::US

This comes with a series of support scripts, among these is import_tiger_zip.pl which will import all the files into a BerkeleyDB file.
 find /path/to/tigerline/data/ -name "*ZIP"  \
-exec ./import_tiger_zip.pl geocoder.db {} \;


The result is:
 -rw-rw-r-- 1 root root 724M Apr 24 12:15 geocoder.db

Import into MySQL

Once you have the geocoder.db output, this script can be used to generate sql statements for inserting this into a spatially aware mysql database.

for i in `cat z2`; do perl ./bdb_mysql_export.pl \
geocoder.db $i >> geocoder.sql; done


#!/usr/bin/perl -w

use Geo::Coder::US;
use DB_File;
use Data::Dumper;
use strict;
use vars qw(%db *db);

my $filename = shift @ARGV or die "Usage: $0 \n";

Geo::Coder::US->set_db($filename);
my @streets;
my %streets;
for my $arg (@ARGV) {
my $val;
my $path = "/$arg/";
my $zip = $arg;
my ( $key, $value );
$Geo::Coder::US::DBO->seq( $key = $path, $value, R_CURSOR );
while ( $key and $value and $key =~ /^$path/i ) {
$streets{$key} =
[ split( ", ", join( ", ", unpack( "w*", $value ) ) ) ];
$Geo::Coder::US::DBO->seq( $key, $value, R_NEXT );
}

}
my $DEBUG = 0;

#print Dumper \%streets;
#my @data = values %streets;
foreach my $k (keys %streets) {
my @data = @{ $streets{$k} } ;
# if (lc($k) =~ /fairoaks/) { $DEBUG = 1; } else { $DEBUG = 0; };
# print "Doing street: $k\n" if ($DEBUG);
# print Dumper(\@data) if ($DEBUG);
# next if (!$DEBUG);
my $first = 1;
my (@from, @to, @range, @best, $matched, @range2);
shift @data;
while (@data) {
@from = splice( @data, 0, 2 ) if $data[0] > 1_000_000;
while (@data and $data[0] < 1_000_000) {
shift @data if not $data[0]; # skip street-side zero marker
@range = splice( @data, 0, 2 );
shift @data while @data and $data[0] < 1_000_000;
}
last unless @data;
@to = splice( @data, 0, 2 );

print "From : " . Dumper(\@from)."\n" if ($DEBUG);;
print "To : ". Dumper(\@to)."\n" if ($DEBUG);;
print "Range1: ". Dumper(\@range). "\n" if ($DEBUG);;
print "Range2: ". Dumper(\@range2) ."\n" if ($DEBUG);;
my %found;
@found{qw{ zip street type prefix suffix }} = split "/", substr($k, 1), 5;
@found{qw{ frlat frlong tolat tolong }} = map( $_ / 1_000_000, @from, @to );
@found{qw{ fradd toadd }} = @range;
$found{$_} *= -1 for qw/frlong tolong/;
print "Found :". Dumper(\%found)."\n" if ($DEBUG);
if ( defined($found{fradd})) {
$found{street} =~ s/'/''/g;
next if (!defined($found{tolong}));

my $q = "INSERT INTO reverse_geocode (zip, street, type, prefix, from_addr, to_addr, line_segment) VALUES ('$found{zip}','$found{street}','$found{type}', '$found{prefix}',$found{fradd}, $found{toadd}, GeomFromText('LineString($found{frlat} $found{frlong}, $found{tolat} $found{tolong})') );";
print $q ."\n"; # if ($DEBUG);
}
if (defined($data[0]) && $data[0] > 1_000_000) {
@to = splice(@data, 0,2 );
}
}
}



Here's the basic table structure:

CREATE TABLE `reverse_geocode` (
`zip` varchar(5) NOT NULL DEFAULT '0',
`street` varchar(64) NOT NULL DEFAULT '',
`type` varchar(8) NOT NULL DEFAULT '',
`from_addr` int(11) NOT NULL DEFAULT '0',
`to_addr` int(11) NOT NULL DEFAULT '0',
`line_segment` geometry NOT NULL,
`prefix` varchar(4) NOT NULL DEFAULT '',
SPATIAL KEY `line_segment` (`line_segment`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;



And the .sql file can be imported with something like:
 mysql -u -p geocode <>


The result is
mysql> show table status like 'reverse_geocode' \G
*************************** 1. row ***************************
Name: reverse_geocode
Engine: MyISAM
Version: 10
Row_format: Dynamic
Rows: 16005729
Avg_row_length: 81
Data_length: 1306599484
Max_data_length: 281474976710655
Index_length: 1021292544
Data_free: 0
Auto_increment: NULL
Create_time: 2009-04-27 13:08:31
Update_time: 2009-04-27 14:34:36
Check_time: NULL
Collation: latin1_swedish_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.00 sec)


Query the Data

Simply use the MBRContains fuction, and pass it a bounding box to cull results too far away. Sort by distance and take the closest result.

mysql> SET @center = GeomFromText('POINT(37.372241 -122.021671)');
Query OK, 0 rows affected (0.00 sec)

mysql> SET @radius = 0.005;
Query OK, 0 rows affected (0.00 sec)

mysql> SET @bbox = GeomFromText(CONCAT('POLYGON((',
-> X(@center) - @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) - @radius, ',',
-> X(@center) + @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) + @radius, ',',
-> X(@center) - @radius, ' ', Y(@center) - @radius, '))')
-> );
Query OK, 0 rows affected (0.00 sec)

mysql> select zip, prefix, street, type, from_addr, to_addr, astext(line_segment), Distance(@center,line_segment) as dist
FROM reverse_geocode where MBRContains(@bbox, line_segment) order by dist limit 1;
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| zip | prefix | street | type | from_addr | to_addr | astext(line_segment) | dist |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
| 94086 | S | Fairoaks | Ave | 323 | 331 | LINESTRING(37.375141 -122.020671,37.372241 -122.021671) | 0 |
+-------+--------+----------+------+-----------+---------+---------------------------------------------------------+------+
1 row in set (0.00 sec)

Now for my project, this was good enough; knowing the closest block of the street produced the results I wanted.

For more accuracy, there are still a few issue that need to be solved:

1. For the even/odd side of each block the same line segment is used. In this case I only keep one side of the street, however the extraction script would have to be modified a little bit if you wanted both sides. It's probably most effective to add two more from/to address columns in the table to store the even/odd numbers.

2. You would have to use some basic geometry to figure out which side of the street the point is on, then use that to choose the even or odd numbers.

3. Using some more geometry, you could also find an approximate address along the line. Essentially you would interpolate a value, which may or may not be a valid address.

11 comments:

Assaf said...

Nice.
It's just too bad the GIS functions in MySQL are only partially implemented (e.g. point in polygon)... that was enough to make me switch to PG just for the GIS abilities.

gtowey said...

They're almost all there now, and they're much faster than PG: MySQL and Geospatial Data

Mapper99 said...

More great GIS services offered here

Mengu said...

Thank you very much for these scripts. They've helped me a lot in getting started building a similar database.

I'm currently at the step where I need to generate an SQL script to insert all of the data from geocoder.db into my GIS-enabled MySQL DB, but my perl interpreter doesn't seem to like the following block in your script


while (@data) {
@from = splice( @data, 0, 2 ) if $data[0] > 1_000_000;
while ( @data and $data[0] < range =" splice(" to =" splice(" q ="INSERT INTO reverse_geocode (zip, street, type, prefix, from_addr, to_addr, line_segment) VALUES ('$found{zip}','$found{street}','$found{type}', '$found{prefix}',$found{fradd}, $found{toadd}, GeomFromText('LineString($found{frlat} $found{frlong}, $found{tolat} $found{tolong})') );"> 1_000_000 ) {
@to = splice( @data, 0, 2 );
}
}


The specific error I got is


Bareword found where operator expected at ./bdb_mysql_export.pl line 40, near "" splice(" to"
(Missing operator before to?)
Can't modify numeric lt (<) in scalar assignment at ./bdb_mysql_export.pl line 40, near "" splice(" to "
syntax error at ./bdb_mysql_export.pl line 40, near "" splice(" to "
Execution of ./bdb_mysql_export.pl aborted due to compilation errors.


Can you please explain in a sentence or two what that block is meant to accomplish, so that I can have a more guided effort while trying to solve my problem? ;-)

Once again, thanks a lot for all the information you are putting on your blog.

Cheers,
Mengu

Navya said...

Thanks for sharing knowledge about geocoding. Would like to know more on this topic. waiting for your next post.
regards
GIS Application development

gtowey said...

@mengu:

Sorry about that. The code got eaten by this blog app. I updated the post and it should be usable now.

Greg Combs said...

I'd love to figure out how to code a script that will walk a MySQL database of addresses and geocode coordinates for each entry. So far all the batch geocoding mechanisms I've found (and there aren't many) are really fricking complicated and/or work off of file data.

Chris O'Keefe said...

Import into MySQL

Once you have the geocoder.db output, this script can be used to generate sql statements for inserting this into a spatially aware mysql database.

for i in `cat z2`; do perl ./bdb_mysql_export.pl \
geocoder.db $i >> geocoder.sql; done


I am a bit confused by this `cat z2` command. What is z2 in this context?

Tx in advance.. Great information

Dave LeJeune said...

Hi -

I thought I'd post the answer to the previous poster's question about what the significance of z2 is in for i in `cat z2`;

I'm not entirely sure I understand exactly what the original context was, but to get my MySQL export completely, I produced a list of states and used that as my 'z2' file.

Jamie said...

After spending a day on this, I figured out what the "z2" file contains.

It is a list of zip codes. To get a list of zip codes, google for "zipcode.csv".

Sri said...

hi can u tell me how to reverse geocode with php by using mysql database....


i have lat and lon in my data base need to reverse geocode it plz help me its my acdemic project nly a week time mail me kumarpavan48@gmail.com plzz plzz

Followers