database standards

BostonGIS BostonGIS Home About Postgres OnLine Journal PostGIS in Action

Sunday, October 28. 2007

Explain Analyze Geometry Relation ... Posted by Regina Obe in database standards, postgis postgresql at 04:18

Comments (4)
Trackbacks (0)

Explain Analyze Geometry Relation Operators and Joins Except Where?

Geometry Operators in Joins vs. WHERE clause

I've noticed that most people when they do queries in PostGIS (I presume other spatial databases as well), seem to put all there geometry relation (intersects, contains etc.) checks in the WHERE clause instead of the FROM clause of their SQL statements whereas I tend to do the opposite.

I've always wondered if there is a speed advantage of doing it one way or the other so I decided to look at the 2 EXPLAIN ANALYZE plans in pgAdmin for these two sample queries.


SELECT t.town, count(s.gid)as totschools
FROM towns t 
    INNER JOIN 
    schools s ON st_intersects(t.the_geom, s.the_geom)
GROUP BY t.town


SELECT t.town, count(s.gid) as totschools
    FROM towns t, schools s 
WHERE st_intersects(t.the_geom, s.the_geom)
GROUP BY t.town

As I expected, the plans are identical and look like this.

I ran for a couple of other types of queries and got the same conclusions. Even for compound statements like the below, the explain analyze plans were identical and the timings if I run for enough iterations come out on average the same.


--Total runtime: 90699.236 ms
SELECT t.town, count(s.gid) as totschools
FROM towns t INNER JOIN schools s ON (st_intersects(t.the_geom, s.the_geom) AND s.grades LIKE '%K%')
GROUP BY t.town

/**HashAggregate  (cost=4459.01..4463.40 rows=351 width=15) (actual time=90698.767..90698.995 rows=328 loops=1)
  ->  Nested Loop  (cost=0.00..4332.89 rows=25224 width=15) (actual time=63.797..90681.218 rows=1496 loops=1)
        Join Filter: _st_intersects(t.the_geom, s.the_geom)
        ->  Seq Scan on towns t  (cost=0.00..131.41 rows=1241 width=8242) (actual time=0.010..1.966 rows=1241 loops=1)
        ->  Index Scan using idx_schools_the_geom2 on schools s  (cost=0.00..3.37 rows=1 width=29) (actual time=25.623..66.418 rows=2 loops=1241)
              Index Cond: (t.the_geom && s.the_geom)
              Filter: (((s.grades)::text ~~ '%K%'::text) AND (t.the_geom && s.the_geom))
Total runtime: 90699.236 ms **/


--Total runtime: 92633.716 ms
SELECT t.town, count(s.gid) as totschools
FROM towns t , schools s 
WHERE (st_intersects(t.the_geom, s.the_geom) AND s.grades LIKE '%K%')
GROUP BY t.town

/*
HashAggregate  (cost=524.90..529.29 rows=351 width=16)
  ->  Nested Loop  (cost=0.00..517.76 rows=1429 width=16)
        Join Filter: _st_intersects(t.the_geom, s.the_geom)
        ->  Seq Scan on towns t  (cost=0.00..25.51 rows=351 width=25602)
        ->  Index Scan using idx_schools_the_geom2 on schools s  (cost=0.00..1.39 rows=1 width=29)
              Index Cond: (t.the_geom && s.the_geom)
              Filter: (((s.grades)::text ~~ '%K%'::text) AND (t.the_geom && s.the_geom))
*/

Visual plan for both
PgAdmin inner plan

So the question is why is there a preference for putting these things in the WHERE and why do I prefer JOIN?

If you think about it, JOIN is kind of a weird concept that appeals mostly to database geeks - whereas WHERE is something most people deal with every day. WHERE is more intuitive; Also let us not forget the nonconformist empire of Oracle and how in the olden days of Oracle, JOINS were done in the WHERE clause with things like =+ = += etc., to do LEFT INNER RIGHT FULL JOINS and I suspect a lot of Oracle Database users still do that even though it violates the ANSI SQL Standard.

Now why do I prefer JOIN over WHERE for this kind of thing

An INNER JOIN can be flipped over to a LEFT or RIGHT JOIN at the drop of a hat (give me all in set A even if there is no matching record in set B etc.) . There is no equivalent concept for WHERE except in Oracle. In the case above - yes should a town not have schools I probably want to know that and would change to a LEFT JOIN instead of having it left out of my result as would happen if I had the clause in the WHERE or INNER JOIN.
If I have at least one JOIN condition for each table, I usually don't make the mistake of doing an accidental cartesian product by forgetting a WHERE condition. This is of course just me. Others may not have this problem.

Why ever ask WHERE?

Now while I do tend to make sure that I have at least one JOIN condition for each of my tables, I also use WHERE? If you can do LEFT RIGHT INNER with JOINs and WHERE can only simulate INNER JOINS, why is WHERE ever important?

Well I can think of several cases where I would use WHERE in conjunction with JOIN, some cases where I arbitrarily choose JOIN and no WHERE out of habit, and some where I just have a WHERE such as when I have a VIEW so I put the JOIN in the view and a where in my query against the view or I have a single table or my audience doesn't understand the concept of JOIN so its easier to just use WHERE. Below is an example where WHERE is a useful thing and can't be replaced with just a JOIN.

EXCEPTION Queries: Give me a list of all towns that have no kindergartens


--Total runtime: 90916.939 ms
SELECT t.town
FROM towns t 
    LEFT JOIN 
        schools s ON 
(st_intersects(t.the_geom, s.the_geom) 
	AND s.grades LIKE '%K%')
WHERE s.gid IS NULL
ORDER BY t.town


Sort  (cost=517.77..517.77 rows=1 width=12) (actual time=17839.682..17839.687 rows=23 loops=1)
  Sort Key: t.town
  ->  Nested Loop Left Join  (cost=0.00..517.76 rows=1 width=12) (actual time=37.866..17839.617 rows=23 loops=1)
        Join Filter: _st_intersects(t.the_geom, s.the_geom)
        Filter: (s.gid IS NULL)
        ->  Seq Scan on towns t  (cost=0.00..25.51 rows=351 width=25602) (actual time=0.006..0.303 rows=351 loops=1)
        ->  Index Scan using idx_schools_the_geom2 on schools s  (cost=0.00..1.39 rows=1 width=29) (actual time=6.792..29.123 rows=7 loops=351)
              Index Cond: (t.the_geom && s.the_geom)
              Filter: (((s.grades)::text ~~ '%K%'::text) AND (t.the_geom && s.the_geom))
Total runtime: 17839.764 ms

It is really quite hard to answer the above question without a WHERE and using a NOT IN(subselect) or EXCEPT clause to avoid a JOIN is in general slower in DBMSs I have tried. Not in all cases though. As shown later below my NOT IN for this particular case beats out the LEFT 2 out of 3 times. This could have more to do with the fact that the planner has an easier time optimizing geometric INNER JOINS than LEFT JOINS.

What the above query is basically doing is

Match up the towns with schools with kindergartens that fall in the town region and if no match, then create a placeholder (filled with all nulls where we would otherwise have school data) so that we always have at least one school record for each town in our result.
I like to think of this as creating an imaginary kindergarten for towns that can't afford a real one so that no town is left behind. Kind of like Boston's Silver Line affectionately called The Silver Lie
In a nutshell that is what a LEFT JOIN is. It makes the impossible possible by saying create a place holder in the second data set where there is no match.
Now give me only a list of towns that have a non-existent kindergarten.

Now some may ask -
Q: Why can't you put the s.gid IS NULL in the LEFT JOIN clause and don't ask WHERE?
A: Because the LEFT JOIN is what creates the concept of a nonexistent kindergarten school so if you ask for such a thing before the concept of such a thing exists, you get all towns because there exists no real school that is imaginary. It doesn't take the planner long to figure out that there is no such school and to create a virtual school placeholder for all towns. In fact if you run such a query, the planner will return all the towns to you and very quickly in fact because it completely ignores the costly intersect check because it needs to do very little analysis to determine that doing an expensive intersect check is more costly than taking the obvious that a primary key (s.gid can not have nulls if it actually exists or a real school is never imaginary).

Its kind of strange that in most DBMSs I have tried, doing a left is in general faster than doing a NOT IN or EXCEPT. My guess for this reason is a LEFT takes in general less memory and sorting power to process. This varies depending on datasets and the properties of your dataset.

Think about the case of if the two sets of geometries intersect I can immediately throw out those results because looking forward I know I really don't care about towns that have kindergartens so as soon as I see such a thing - I throw it away. So while I need to process it, I immediately throw this information away. In fact think about it - I never asked what schools it has - so as soon as I see a town with a kindergarten - I could care less about it. I never have to consider this town again or join it or order it with my other towns. I do not need to remember it.

If I do a NOT IN or EXCEPT I first have to ask which towns have kindergartens and then which towns are not in the set of towns that have kindergartens. MORE MEMORY and processor needed for sorting in general because I need to know the towns that have kindergartens to compare with my full town set to know the towns that don't have kindergartens. Well its not quite that simple for NOT IN and EXCEPT, but almost because the planner sees these as separate distinct questions instead of in the case with the LEFT WHERE where it perceives that as one question.

Below is the same question - asked with a NOT IN. Note: We are asking who is in the list (our inner question) and who is not in the list that we just generated.


SELECT t.town
FROM towns t 
WHERE t.gid 
    NOT IN(SELECT t.gid 
            FROM towns t, schools s 
        WHERE (st_intersects(t.the_geom, s.the_geom) AND s.grades LIKE '%K%'))
ORDER BY t.town



/** Sort  (cost=554.28..554.72 rows=176 width=12) (actual time=17804.008..17804.014 rows=23 loops=1)
  Sort Key: town
  ->  Seq Scan on towns t  (cost=521.33..547.72 rows=176 width=12) (actual time=17803.827..17803.977 rows=23 loops=1)
        Filter: (NOT (hashed subplan))
        SubPlan
          ->  Nested Loop  (cost=0.00..517.76 rows=1429 width=4) (actual time=7.226..17800.803 rows=1496 loops=1)
                Join Filter: _st_intersects(t.the_geom, s.the_geom)
                ->  Seq Scan on towns t  (cost=0.00..25.51 rows=351 width=25594) (actual time=0.001..0.290 rows=351 loops=1)
                ->  Index Scan using idx_schools_the_geom2 on schools s  (cost=0.00..1.39 rows=1 width=25) (actual time=6.856..29.051 rows=7 loops=351)
                      Index Cond: (t.the_geom && s.the_geom)
                      Filter: (((s.grades)::text ~~ '%K%'::text) AND (t.the_geom && s.the_geom))
Total runtime: 17804.101 ms  ***/

Same question with EXCEPT - return all towns except those that have kindergartens

	
SELECT t.town
    FROM towns t
EXCEPT
SELECT t.town
    FROM towns t ,schools s 
    WHERE (st_intersects(t.the_geom, s.the_geom) 
    AND s.grades LIKE '%K%')
ORDER BY 1

/**SetOp Except  (cost=657.16..666.06 rows=178 width=12) (actual time=20962.974..20964.141 rows=23 loops=1)
  ->  Sort  (cost=657.16..661.61 rows=1780 width=12) (actual time=20962.956..20963.398 rows=1847 loops=1)
        Sort Key: town
        ->  Append  (cost=0.00..561.07 rows=1780 width=12) (actual time=0.007..20958.763 rows=1847 loops=1)
              ->  Subquery Scan *SELECT* 1  (cost=0.00..29.02 rows=351 width=12) (actual time=0.006..0.364 rows=351 loops=1)
                    ->  Seq Scan on towns t  (cost=0.00..25.51 rows=351 width=12) (actual time=0.005..0.178 rows=351 loops=1)
              ->  Subquery Scan *SELECT* 2  (cost=0.00..532.05 rows=1429 width=12) (actual time=7.062..20957.331 rows=1496 loops=1)
                    ->  Nested Loop  (cost=0.00..517.76 rows=1429 width=12) (actual time=7.061..20955.576 rows=1496 loops=1)
                          Join Filter: _st_intersects(t.the_geom, s.the_geom)
                          ->  Seq Scan on towns t  (cost=0.00..25.51 rows=351 width=25602) (actual time=0.001..0.293 rows=351 loops=1)
                          ->  Index Scan using idx_schools_the_geom2 on schools s  (cost=0.00..1.39 rows=1 width=25) (actual time=6.721..28.799 rows=7 loops=351)
                                Index Cond: (t.the_geom && s.the_geom)
                                Filter: (((s.grades)::text ~~ '%K%'::text) AND (t.the_geom && s.the_geom))
Total runtime: 20964.237 ms **/

One reason why it takes less memory to do a LEFT JOIN is what makes database programming a little harder to understand than standard procedural logic. Remember I stressed the idea of concepts verses reality. Sure the idea of a school that doesn't actually exist is kind of silly but its a useful model and allows us to do one very important thing - IRRADICATE INCONVENIENT EXCEPTIONS. All my towns now have kindergartens (sort of) so I am no longer asking a question I have no data for. The fact that some of these kindergartens are imaginary and that I am now on the hunt for towns with imaginary kindergartens is inconsequential. I have irradicated this inconvenient exception that forces me to ask, Which towns have kindergartens, a question I could care less about.

Although JOINS happen conceptually before WHERE, in reality they don't need to as long as the planner can guarantee the process of solving things out of order mimicks the conceptual model. In reality they never need to happen at all if reality can mimick concept. For example if you had in your WHERE clause towns.town LIKE 'ZZ%' and you have an index on town, your JOIN would most likely never happen because the database planner would run the WHERE part first since its less costly and would not affect the results (violate the conceptual model). It would then quickly realize no such Town exists and therefore the result is the empty set and doing a join will not change this result.

Its actually a wonderful hack that you can state a problem, see annoying exceptions to the rule and quickly concoct conceptual models that destroy these nasty exceptions so you can treat everything the same.

Saturday, October 20. 2007

Database Information Schema Catalog, ... Posted by Regina Obe in database standards, postgis postgresql at 00:00

Comments (0)
Trackbacks (0)

Database Information Schema Catalog, NULLS, and Code Generation

I love the idea of code generation without using IDEs just because I know if my IDEs of choice are not present at a particular moment, I've got another swiss army knife to rely on. I particularly like SQL code generation schemes that rely on a database to do it. When you are trying to generate SQL code, you often need the metadata of your table structures to do so and using the database to get that info is often the easiest approach.

This post is in the spirit of Hubert Lubaczewski's recent PostgreSQL post grant XXX on * ? which details how to grant access to tables in PostgreSQL using some SQL code generation tricks. As it turns out I had a similar situation I had to deal with recently, but involved updating fields to null.

Empty String verses Null

Now there have been long debates about when to use Null vs. Empty string and the various Gotchas involved. If you want to know about these, this series on Nulls What if null if null is null null null is null? by Hugo Kornelis is a pretty good one. Hugo is predominately a Microsoft SQL Server blogger, but for the most part how null is handled is pretty much the same across most pseudo ANSI standard relational databases. So regardless of your database poisons of choice, his comments have equal merit.

I generally prefer the use of Null instead of empty string, all gotchas aside, mostly for philosophical reasons; I believe that the absence of data should be a black hole and that is what NULL is. When you do a lot of statistical and financial reporting as we do, its immensely useful, but you have to be cognizant of the difference between 0, empty string and NULL and that it is used consistently within your database. There is one other reason I prefer NULL over using an empty string and that is that NULL can be cast to any data type because it represents the existence of nothing where as an empty string can not because an empty string is a string. This gets me to my particular dilemma.

Updating Empty String to NULL

Often times when I get property parcel data they send it to me as all varchar or some such thing and they are always changing the field names on me (the evils of DBF/ESRI shape as a transport mechanism) and I then have to massage this data into my superbly structured tables where a number is a number and can be tabulated without casting. Now take the case if you have say land property's assessed value or some other numeric field that comes as varchar and you need to stuff it into an integer or float (double precision) field and a lot of these fields come thru as empty string.

You get an error like this - ERROR: invalid input syntax for double precision: "" in PostgreSQL (or SQL Server) when you do something like.
SELECT CAST('' AS float)

But if you do something like SELECT CAST(NULL as float)

that works just fine, because all kinds of data can have black holes. NULL is the universal thing that is a party of all data sets. It is equivalent to the empty set in set theory.

So the way I get around this unpleasantry is to set all varchar fields that are empty to null before I try to insert it into my final table structure. You can imagine this gets pretty repetitive if you have to do this for say 20 fields. So here is my trick to generate the SQL to do this in PostgreSQL.

SELECT 'UPDATE ' || table_name || ' SET ' || column_name || ' =  NULL WHERE ' || column_name || ' = '''';'   As sqlupdate
FROM information_schema.columns 
WHERE table_name = 'sometable' AND data_type LIKE '%char%'

The above will return a row for each column in your table that is a character varying or char field and will contain the update statement to update all empty strings in that column to NULL. If you want to go one better, you can create a custom aggregate for strings (as I mentioned in More generate series) and use it to get a single row containing all your update statements.

Note that this same trick works in other databases that support the ISO-SQL:1999+ information_schema such as (SQL Server 2005, SQL Server 2000, MySQL 5, PostgreSQL 7.4 +) - (sadly Oracle apparently doesn't support information_schema - perhaps someday. :)).

In SQL Server 2000/2005 you would replace the || with + so your code would look like

SELECT 'UPDATE ' + table_name + ' SET ' + column_name + ' =  NULL WHERE ' + column_name + ' = '''';'   As sqlupdate
FROM information_schema.columns 
WHERE table_name = 'sometable' AND data_type LIKE '%char%'

In MySQL you would have to use the CONCAT function if you are not in ANSI mode. For MySQL 5+ you can do set sql_mode = 'ANSI'; and use the standard ||.

If you want to have the above return a single row in SQL Server 2005, similarly you would create an aggregate function for strings as described here - granted a bit more involved than doing the same in PostgreSQL. Or use SQL Server's XPath syntax supported in SQL Server 2005 as described here.

Note that if you are making heavy use of database schemas, then you will need to qualify your tables with the schema name and do a where on the table_schema as well.

SELECT 'UPDATE ' || table_schema || '.' || table_name || ' SET ' || column_name || ' =  NULL WHERE ' || column_name || ' = '''';'   As sqlupdate
FROM information_schema.columns 
WHERE table_name = 'sometable' AND data_type LIKE '%char%' AND table_schema = 'assessing';

Explore the INFORMATION_SCHEMA

The INFORMATION_SCHEMA is chuck-full of all sorts of useful metadata about your database objects. While some tables may not exist in some database management systems (DBMS) (e.g. PostgreSQL has a table information_schema.sequences (defined in SQL:2003 standard which you won't find in MYSQL and SQL Server because those databases don't have sequence objects), for the tables that exist, the tables are consistently named across all DBMS's that support them and so are the field names.

There are 3 tables (views) I find most useful in the information schema. These are all available in the DBMS's I mentioned, and those are

INFORMATION_SCHEMA.TABLES - contains listing of all tables and views in a database
INFORMATION_SCHEMA.COLUMNS - contains listing of all columns in a database, the table they belong to, the data type, length, ordinal position in table
INFORMATION_SCHEMA.VIEWS - contains names of views and the SQL that defines the view
These ones I haven't explored but look useful for doing security audits on your tables - INFORMATION_SCHEMA.TABLE_PRIVILEGES, INFORMATION_SCHEMA.COLUMN_PRIVILEGES

Note - for the most part if not in all cases, the INFORMATION_SCHEMA set of information are pretty much implemented as views on top of the proprietary DBMS system tables. The reasons I find most compelling to use them instead of using the direct system tables are the following

It is often easier to navigate for basic metadata than trying to navigate the DBMS system tables. I have found it very rare to need information that I couldn't find in the information_schema and needed to resort to system tables.
Since it is purely there for informational purposes and is a standard, you don't have to worry too much about it changing as you would a system table - whose purpose is more for direct management of database internals.
Since it is a standard, for the databases that comply with the standard you can feel a bit more at home, since there is one less thing you need to know about to migrate. Thus less feeling of Vendor Lock-In.

« previous page (Page 1 of 1, totaling 2 entries) next page »
Frontpage