I am happy to report that the trick of encasing a postgresql pgsql function within a postgresql sql function was a big success.
Now for the Recap of the problems with using plain pgsql and why we went down the postgresql sql function encasement approach.
Recap: Plain old pgsql
Building on our expandoverlap_metric function, we add a new function and a new type.
CREATE TYPE pgis_nn AS
(gid integer, dist numeric(12,5));
CREATE OR REPLACE FUNCTION _pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))
RETURNS SETOF pgis_nn AS
$BODY$
DECLARE
strsql text;
rec pgis_nn;
ncollected integer;
it integer;
--NOTE: it: the iteration we are currently at
--start at the bounding box of the object (expand 0) and move up until it has collected more objects than we need or it = maxslices whichever event happens first
BEGIN
ncollected := 0; it := 0;
WHILE ncollected < numnn AND it <= maxslices LOOP
strsql := 'SELECT currentit.' || sgid2field || ', distance(ref.geom, currentit.' || sgeom2field || ') as dist FROM ' || lookupset || ' as currentit, (SELECT geometry(''' || CAST(geom1 As text) || ''') As geom) As ref WHERE ' || swhere || ' AND expand(ref.geom, ' || CAST(distguess*it/maxslices As varchar(100)) || ') && currentit.the_geom AND expandoverlap_metric(ref.geom, currentit.the_geom, ' || CAST(distguess As varchar(200)) || ', ' || CAST(maxslices As varchar(200)) || ') = ' || CAST(it As varchar(100)) || ' ORDER BY distance(ref.geom, currentit.the_geom) LIMIT ' || CAST((numnn - ncollected) As varchar(200));
--RAISE NOTICE 'sql: %', strsql;
FOR rec in EXECUTE (strsql) LOOP
IF ncollected < numnn THEN
ncollected := ncollected + 1;
RETURN NEXT rec;
ELSE
EXIT;
END IF;
END LOOP;
it := it + 1;
END LOOP;
END
$BODY$
LANGUAGE 'plpgsql' STABLE;
Plain old pgsql is tested and fails
/**Problem calculate the 5 nearest neighbors in table testpolys for the first 500 records **/
/**uncoated: pgsql -
fails with error ERROR: set-valued function called in context that cannot accept a set
SQL state: 0A000
Context: PL/pgSQL function "_pgis_fn_nn" line 16 at return next **/
SELECT g1.gid as gid_ref, (_pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'true', 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
Recap 2: PgSQL function wrapped in an sql function blanket
CREATE OR REPLACE FUNCTION pgis_fn_nn(geom1 geometry, distguess double precision, numnn integer, maxslices integer, lookupset varchar(150), swhere varchar(5000), sgid2field varchar(100), sgeom2field varchar(100))
RETURNS SETOF pgis_nn AS
$BODY$
SELECT * FROM _pgis_fn_nn($1,$2, $3, $4, $5, $6, $7, $8);
$BODY$
LANGUAGE 'sql' STABLE;
This time it works!
These tests were done on a slower unoptimized machine than my prior tests so all the numbers are higher.
This was done on a windows XP 2.66GHZ single processor, 2 GIG ram.
/**coated: pgsql - IT WORKS! and is faster than all other prior solutions**/
--Total query runtime: Total query runtime: 8075 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref, (pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(g1.gid As varchar(30)), 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 15401 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: 44063 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 slices
SELECT g1.gid as gid_ref , (fntestpolys_nn_better(g1.gid, g1.the_geom, 10000, 5,100)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
--Total query runtime: Total query runtime: 109072 ms. 2500 rows retrieved. - using guess expand box 10000 meters, 5 neighbors
SELECT g1.gid as gid_ref, (fntestpolys_nn(g1.gid, g1.the_geom, 10000, 5)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid LIMIT 500) g1;
/**Problem calculate the 5 nearest neighbors in table testpolys for all records in testpolys (3000 records) **/
--Total query runtime: 124411 ms. 15000 rows retrieved.
SELECT g1.gid as gid_ref , (fntestpolys_nn_evenbetter(g1.gid, g1.the_geom, 10000, 5,100,0)).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
--Total query runtime: Total query runtime: 104105 ms. 15000 rows retrieved. - using guess expand box 10000 meters, 5 neighbors, max 100 iterations, start at bounding box
SELECT g1.gid as gid_ref, (pgis_fn_nn(g1.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(g1.gid As varchar(30)), 'gid', 'the_geom')).*
FROM (SELECT * FROM testpolys tp ORDER BY tp.gid) g1;
Use Cases
Now it should be easy to answer all sorts of nearest neighbor questions with a simple SQL statement. Here is a simple one with our test data
--Grab additional attributes (in this case geometry) from the nearst neighbor
--Total query runtime: 11012 ms. 2500 rows retrieved.
SELECT g1.gid as gid_ref, g1.nn_dist, g1.nn_gid, g2.the_geom
FROM (SELECT tp.*, (pgis_fn_nn(tp.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(tp.gid As varchar(30)), 'gid', 'the_geom')).gid As nn_gid,
(pgis_fn_nn(tp.the_geom, 10000, 5,100,'testpolys', 'gid <> ' || CAST(tp.gid As varchar(30)), 'gid', 'the_geom')).dist As nn_dist
FROM (SELECT * FROM testpolys ORDER BY gid LIMIT 500) tp) g1, testpolys g2
WHERE g2.gid = g1.nn_gid
ORDER BY g1.gid, nn_dist;
Here is a real live use case one, but I haven't tested it out. I'm hoping it will run and be decently fast.
/**All buildings where the nearest fire station is greater than 3 miles away (note my geometries are in meters) and give the name of the firestation
and sort by the building whose closest station is furthest away**/
/**Data sets: buildings http://www.mass.gov/mgis/lidarbuildingfp2d.htm, firestations: http://www.mass.gov/mgis/firestations.htm **/
SELECT g1.gid as gid_ref, f.name, g1.nn_dist, g1.nn_gid
FROM (SELECT b.gid,
(pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).gid as nn_gid,
(pgis_fn_nn(b.the_geom, 1000000, 1,1000,'fire_stations', 'true', 'gid', 'the_geom')).dist as nn_dist
FROM (SELECT * FROM buildings) b) As g1, fire_stations f
WHERE f.gid = g1.nn_gid AND g1.nn_dist > 1609.344*3
ORDER BY g1.nn_dist DESC
I'll publish a tutorial and some more use cases on the main BostonGIS site next. Hopefully the timings on these tests will be decent.
UPDATE: I have published this to main BostonGIS site. The article can be found here PostGIS Nearest Neighbor: A Generic Solution. I also changed some of the type definition slightly so I don't need to alias fields so much.