| PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution 
 Printer Friendly
 
 A generic solution to PostGIS nearest neighborAfter some heavy brainstorming, I have come up with a faster and more generic solution to calculating nearest neighbors than my previous solutions.  For the gory details on how I arrived at this solution - please check out the following blog entries Boston GIS Blog entries on Nearest Neighbor Search.
 In the sections that follow I will briefly describe the key parts of the solution, the strengths of the solution, the caveats of the solution, and some general use cases. The parts of the solutionThe solution is composed of a new PostgreSQL type, and 2 PostgreSQL functions.  It involves a sequence of expand boxes that fan out from the bounding box of a geometry to as far out as you want.  Below are the descriptions of each part. pgis_nn - a new type that defines a neighbor solution -  for a given  near neighbor solution it gives you, the primary key of the neighbor and the distance it is away from the reference objectexpandoverlap_metric - this is a helper function that returns the first fanned out box that an object is found in relative to the reference geometry.  It returns an integer where the integer varies from 0 (in bounding box) to maxslices where maxslices is how many slices you want to dice up your larger expand box. _pgis_fn_nn - this is the workhorse function that should never be directly called  - it is a pgsql function that returns for each geometry k near neighborspgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))) - this is our interface function written in language sql to fool the planner into thinking we are using an sql function.  This is to get around certain anomalies of PostgreSQL that prevent pgsql set returning functions from being used if they reference fields in the from part of an SQL  statement. Description of the inputs is as follows.
 geom1 - this is the reference geometry.distguess - this is the furthest distance you want to branch out before you want the query to give up.  It is measured in units of the spatial ref system of your geometries.numnn - number of nearest neighbors to return.maxslices -  this is the max number of slices you want to slice up your whole expand box.  The more slices the thinner the slices, but the more iterations.  I'm sure there is some statistically function one can run against ones dataset to figure out the optimal number of slices and distance guess, but I haven't quite figured out what that would be.lookupset - this is usually the name of the table you want to search for nearest neighbors.  In theory you could throw in a fairly complicated subselect statement as long as you wrap it  in parethesis.  Something like '(SELECT g.*, b.name FROM sometable g INNER JOIN someothertable b ON g.id = b.id)' .  Althought I haven't had a need to do that yet so haven't tested that feature.swhere -  this is the where condition to apply to your dataset.  If you want the whole dataset to be evaluated, then put 'true'.  In theory you can have this correlated with your reference table by gluing parameters together, so can get pretty sophisticated.sgid2field - this is the name of the unique id field in the dataset you want to check for nearest neighbors.sgeom2field - this is the name of the geometry field in the dataset you want to check for nearest neighbors
 The code can be downloaded here http://www.bostongis.com/downloads/pgis_nn.txt.  I will also be posting this to the PostGIS Wiki once I clean it up a bit more.
 Key featuresMy objective in creating this solution is to give PostGIS the same expressiveness that Oracle's SDO_NN and SDO_NN_DISTANCE functions provide Oracle.  Unlike the Oracle solution, this solution is not implemented as an Operator, but instead as a set returning function.  Features are outlined below 
 Can handle more than one reference point at a time and return a single dataset.  For example you can say give me the 5 nearest neighbors for each of my records in this query and get the result back as a single table.You can limit your search to a particular distance range, slice it up as granularly as you want in order to achieve the best resulting speed.  Note I implemented this function as linearly expanding boxes to utilize GIST indexes.  In hindsight  it might have been more efficient processor wise to have the boxes expand like a normal distribution or something like that.You can apply where conditions to the sets you are searching instead of searching the whole table - e.g. only give me the 5 nearest hospitals that have more than 100 beds. You can also have the where conditions tied to fields of each record of interest.  E.g. include all near neighbors, but don't include the reference point.This solution works for all geometries - e.g. you can find point near neighbors to a polygon, polygon near neighbors to a polygon etc.  Nearest distance to a line etc.
 CaveatsThis function will only work if you give it a primary key that is an integer.  I figured an integer is what most people use for their primary keys and is the most efficient join wise.This function currently doesn't handle ties.  E.g. if you specify 5 neighbors and you have 6 where the last 2 are the same distance away, one will arbitrarily get left out. I think it will be fairly trivial to make it handle this case though.If there are no neighbors for a particular reference that meet the stated criteria, that record would get left out.  There are workarounds for this which I will try to provide in more advanced use cases.
 Use CasesAll the data we will be using is in SRID: 26986.  For a refresher course on loading data and setting up indexes, check
Part 1: Getting Started With PostGIS: An almost Idiot's Guide
This returned 1000 records in 6 seconds on my dual windows 2003 server Datasets used: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm,
 firestations: http://www.mass.gov/mgis/firestations.htm
 
 Find nearest fire station for first 1000 buildings in Boston and sort in decreasing order of distance from the fire station such that 
the building furthest away from its nearest firestation is at the top.
 
 
SELECT g1.gid as gid_ref, f.name, g1.nn_dist/1609.344 as dist_miles, g1.nn_gid
FROM (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM buildings limit 1000) b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.nn_dist DESC
 
 Data set used: Fire stations from above and Massachusetts census tiger roads  -http://www.mass.gov/mgis/cen2000_tiger.htm
 For each street segment in zip 02109 list the 2 nearest fire stations and the distance in meters from each fire station. Order results by street name, street id and distance from nearest fire station
 
 
Total query runtime: 687 ms. 296 rows retrieved.
SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, f.name, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT b.*,
    (pgis_fn_nn(b.the_geom, 1000000, 2,1000,'fire_stations', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') b) As g1, fire_stations f
    WHERE f.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 
 For each street segment in zip 02109 list the 2 nearest buildings with building footprint > 5000 sq meters and specify how far away they are and the areas of each building.
 
 
 
--Total query runtime: 2656 ms. 296 rows retrieved.
SELECT g1.gid as gid_ref, g1.street, g1.fraddl, g1.toaddl, area(b.the_geom) as bldg_area, g1.nn_dist as dist_meters, g1.nn_gid
FROM (SELECT s.*,
    (pgis_fn_nn(s.the_geom, 1000000, 2,1000,'buildings', 'area(the_geom) > 5000', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl = '02109') s) As g1, buildings b
    WHERE b.gid = g1.nn_gid 
    ORDER BY g1.street, g1.fraddl, g1.gid, g1.nn_dist 
 More Advanced QueriesExample of two simultaneous neighbor queries in one queryLayers usedhttp://www.mass.gov/mgis/schools.htm - schools
 http://www.mass.gov/mgis/hospitals.htm - hospitals
 
 Now for this, I think we can safely assume that no street we care about should be further than 30,000 meters from its nearest school and hospital
so we will limit our search to
30,000 meters and dice up our box into 10 expand boxes - so boxes expand at a rate of 3000 meters out per expansion.
Give me the nearest elemenary school and the nearest hospital emergency center for street segments in select zipcodes 
and how far away they are from these street segments. 
 
--Total query runtime: 5516 ms.(~5.5 secs). 291 rows retrieved.
SELECT emer.gid as gid_ref, emer.street, emer.fraddl, emer.toaddl, hospitals.name As emer_name, emer.nn_dist as emer_dist_meters, 
        schools.name as sch_name, sch.nn_dist as sch_dist_meters
FROM (SELECT b.gid, b.street,b.fraddl, b.toaddl,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As emer INNER JOIN
    (SELECT b.gid,
    (pgis_fn_nn(b.the_geom, 30000, 1,10,'schools', 'true', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM census2000tiger_arc where zipl IN('02109', '02110')) As b) As sch ON emer.gid = sch.gid
    INNER JOIN hospitals ON hospitals.gid = emer.nn_gid 
    INNER JOIN schools ON schools.gid = sch.nn_gid
    ORDER BY emer.street, emer.fraddl, emer.gid
 Example limit distance exclusion queriesProvide a list of all schools in Massachusetts Boston extent whose closest hospital emergency room is further than 3000 meters away
 Using our new neighbor function exclusion query
 
 
The old way doing a simple expand distance exclusion query.  SOMETIMES THE OLD WAYS ARE JUST BETTER
--Total query runtime: 2969 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN (SELECT b.gid, 
    (pgis_fn_nn(b.the_geom, 3000, 1,2,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM (SELECT * FROM schools WHERE schools.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)) b) As g1 ON sch.gid = g1.gid
    WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name
 
 
	
	-- Total query runtime: 63 ms. 75 rows retrieved.
SELECT sch.gid as gid_ref, sch.name as sch_name
FROM schools sch LEFT JOIN 
(SELECT b.gid FROM schools b , hospitals h 
		WHERE b.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986) AND h.er_status = 'Y'
				AND expand(b.the_geom, 3000) && h.the_geom AND distance(b.the_geom, h.the_geom) <= 3000) g1
ON sch.gid = g1.gid
WHERE g1.gid IS NULL AND sch.the_geom && GeomFromText('POLYGON((225446.34375 886435.5625,225446.34375 905270.8125,242278.640625 905270.8125,242278.640625 886435.5625,225446.34375 886435.5625))',26986)
ORDER BY sch.name
 Caching data by storing closest neighbor in a tableFor this exercise, we will create 2 new fields in our schools table called nearest_hospital and nearest_hospital_gid 
and update the fields with the closest hospital to that school.  
We will set our maximum distance search to 100,000 meters carved out into 10  10,000 meter expansions. 
-- Query returned successfully: 2613 rows affected, 20625 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid integer;
UPDATE schools SET nearest_hospital = h.name , nearest_hospital_gid = h.gid
FROM (SELECT s.gid As sch_gid,
    (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y''', 'gid', 'the_geom')).*  
    FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid
WHERE schools.gid = nn_emer.sch_gid
 
 
 Later on we realize we need an alternate hospital in case the first hospital is filled or because of traffic situation we can't get there,
so we create a new 
set of fields to hold the second closest hospital.  This particular exercise demonstrates where clause dependent on the
additional attribute data of the reference record.
 
 
-- Query returned successfully: 2613 rows affected, 23032 ms (~1/3 minute) execution time.
ALTER TABLE schools ADD nearest_hospital_2 character varying(255);
ALTER TABLE schools ADD nearest_hospital_gid_2 integer;
UPDATE schools SET nearest_hospital_2 = h.name , nearest_hospital_gid_2 = h.gid
FROM (SELECT s.gid As sch_gid,
    (pgis_fn_nn(s.the_geom, 100000, 1,10,'hospitals', 'er_status=''Y'' AND gid <> ' || CAST(s.nearest_hospital_gid As varchar(20)) , 'gid', 'the_geom')).*  
    FROM schools As s) As nn_emer INNER JOIN hospitals h ON nn_emer.nn_gid = h.gid
WHERE schools.gid = nn_emer.sch_gid AND nearest_hospital_gid IS NOT NULL
 
 
 
 Post Comments About PostGIS Nearest Neighbor: A Generic Solution - Much Faster than Previous Solution
 |