Sunday, October 28. 2007
Explain Analyze Geometry Relation ... Posted by Regina Obe in database standards, postgis postgresql at 04:18
Geometry Operators in Joins vs. WHERE clause
I've noticed that most people when they do queries in PostGIS (I presume other spatial databases as well), seem to put all there geometry relation (intersects, contains etc.) checks in the WHERE clause instead of the FROM clause of their SQL statements whereas I tend to do the opposite.
I've always wondered if there is a speed advantage of doing it one way or the other so I decided to look at the 2 EXPLAIN ANALYZE plans in pgAdmin for these two sample queries.
As I expected, the plans are identical and look like this.
I ran for a couple of other types of queries and got the same conclusions. Even for compound statements like the below, the explain analyze plans were identical and the timings if I run for enough iterations come out on average the same.
So the question is why is there a preference for putting these things in the WHERE and why do I prefer JOIN?
If you think about it, JOIN is kind of a weird concept that appeals mostly to database geeks - whereas WHERE is something most people deal with every day. WHERE is more intuitive; Also let us not forget the nonconformist empire of Oracle and how in the olden days of Oracle, JOINS were done in the WHERE clause with things like =+ = += etc., to do LEFT INNER RIGHT FULL JOINS and I suspect a lot of Oracle Database users still do that even though it violates the ANSI SQL Standard.
Now why do I prefer JOIN over WHERE for this kind of thing
Why ever ask WHERE?
Now while I do tend to make sure that I have at least one JOIN condition for each of my tables, I also use WHERE? If you can do LEFT RIGHT INNER with JOINs and WHERE can only simulate INNER JOINS, why is WHERE ever important?
Well I can think of several cases where I would use WHERE in conjunction with JOIN, some cases where I arbitrarily choose JOIN and no WHERE out of habit, and some where I just have a WHERE such as when I have a VIEW so I put the JOIN in the view and a where in my query against the view or I have a single table or my audience doesn't understand the concept of JOIN so its easier to just use WHERE. Below is an example where WHERE is a useful thing and can't be replaced with just a JOIN.EXCEPTION Queries: Give me a list of all towns that have no kindergartens
It is really quite hard to answer the above question without a WHERE and using a NOT IN(subselect) or EXCEPT clause to avoid a JOIN is in general slower in DBMSs I have tried. Not in all cases though. As shown later below my NOT IN for this particular case beats out the LEFT 2 out of 3 times. This could have more to do with the fact that the planner has an easier time optimizing geometric INNER JOINS than LEFT JOINS.What the above query is basically doing is
Now some may ask -
Its kind of strange that in most DBMSs I have tried, doing a left is in general faster than doing a NOT IN or EXCEPT. My guess for this reason is a LEFT takes in general less memory and sorting power to process. This varies depending on datasets and the properties of your dataset.
Think about the case of if the two sets of geometries intersect I can immediately throw out those results because looking forward I know I really don't care about towns that have kindergartens so as soon as I see such a thing - I throw it away. So while I need to process it, I immediately throw this information away. In fact think about it - I never asked what schools it has - so as soon as I see a town with a kindergarten - I could care less about it. I never have to consider this town again or join it or order it with my other towns. I do not need to remember it.
If I do a NOT IN or EXCEPT I first have to ask which towns have kindergartens and then which towns are not in the set of towns that have kindergartens. MORE MEMORY and processor needed for sorting in general because I need to know the towns that have kindergartens to compare with my full town set to know the towns that don't have kindergartens. Well its not quite that simple for NOT IN and EXCEPT, but almost because the planner sees these as separate distinct questions instead of in the case with the LEFT WHERE where it perceives that as one question.
Below is the same question - asked with a NOT IN. Note: We are asking who is in the list (our inner question) and who is not in the list that we just generated.
Same question with EXCEPT - return all towns except those that have kindergartens
One reason why it takes less memory to do a LEFT JOIN is what makes database programming a little harder to understand than standard procedural logic. Remember I stressed the idea of concepts verses reality. Sure the idea of a school that doesn't actually exist is kind of silly but its a useful model and allows us to do one very important thing - IRRADICATE INCONVENIENT EXCEPTIONS. All my towns now have kindergartens (sort of) so I am no longer asking a question I have no data for. The fact that some of these kindergartens are imaginary and that I am now on the hunt for towns with imaginary kindergartens is inconsequential. I have irradicated this inconvenient exception that forces me to ask, Which towns have kindergartens, a question I could care less about.
Although JOINS happen conceptually before WHERE, in reality they don't need to as long as the planner can guarantee the process of solving things out of order mimicks the conceptual model. In reality they never need to happen at all if reality can mimick concept.
For example if you had in your WHERE clause
Its actually a wonderful hack that you can state a problem, see annoying exceptions to the rule and quickly concoct conceptual models that destroy these nasty exceptions so you can treat everything the same.
Saturday, October 20. 2007
Database Information Schema Catalog, ... Posted by Regina Obe in database standards, postgis postgresql at 00:00
I love the idea of code generation without using IDEs just because I know if my IDEs of choice are not present at a particular moment, I've got another swiss army knife to rely on. I particularly like SQL code generation schemes that rely on a database to do it. When you are trying to generate SQL code, you often need the metadata of your table structures to do so and using the database to get that info is often the easiest approach.
This post is in the spirit of Hubert Lubaczewski's recent PostgreSQL post grant XXX on * ? which details how to grant access to tables in PostgreSQL using some SQL code generation tricks. As it turns out I had a similar situation I had to deal with recently, but involved updating fields to null.
Empty String verses Null
Now there have been long debates about when to use Null vs. Empty string and the various Gotchas involved. If you want to know about these, this series on Nulls What if null if null is null null null is null? by Hugo Kornelis is a pretty good one. Hugo is predominately a Microsoft SQL Server blogger, but for the most part how null is handled is pretty much the same across most pseudo ANSI standard relational databases. So regardless of your database poisons of choice, his comments have equal merit.
I generally prefer the use of Null instead of empty string, all gotchas aside, mostly for philosophical reasons; I believe that the absence of data should be a black hole and that is what NULL is. When you do a lot of statistical and financial reporting as we do, its immensely useful, but you have to be cognizant of the difference between 0, empty string and NULL and that it is used consistently within your database. There is one other reason I prefer NULL over using an empty string and that is that NULL can be cast to any data type because it represents the existence of nothing where as an empty string can not because an empty string is a string. This gets me to my particular dilemma.
Updating Empty String to NULL
Often times when I get property parcel data they send it to me as all varchar or some such thing and they are always changing the field names on me (the evils of DBF/ESRI shape as a transport mechanism) and I then have to massage this data into my superbly structured tables where a number is a number and can be tabulated without casting. Now take the case if you have say land property's assessed value or some other numeric field that comes as varchar and you need to stuff it into an integer or float (double precision) field and a lot of these fields come thru as empty string.
You get an error like this - ERROR: invalid input syntax for double precision: "" in PostgreSQL (or SQL Server) when you do something like.
But if you do something like
So the way I get around this unpleasantry is to set all varchar fields that are empty to null before I try to insert it into my final table structure. You can imagine this gets pretty repetitive if you have to do this for say 20 fields. So here is my trick to generate the SQL to do this in PostgreSQL.
SELECT 'UPDATE ' || table_name || ' SET ' || column_name || ' = NULL WHERE ' || column_name || ' = '''';' As sqlupdate FROM information_schema.columns WHERE table_name = 'sometable' AND data_type LIKE '%char%'
The above will return a row for each column in your table that is a character varying or char field and will contain the update statement to update all empty strings in that column to NULL. If you want to go one better, you can create a custom aggregate for strings (as I mentioned in More generate series) and use it to get a single row containing all your update statements.
Note that this same trick works in other databases that support the ISO-SQL:1999+ information_schema such as (SQL Server 2005, SQL Server 2000, MySQL 5, PostgreSQL 7.4 +) - (sadly Oracle apparently doesn't support information_schema - perhaps someday. :)).In SQL Server 2000/2005 you would replace the || with + so your code would look like
SELECT 'UPDATE ' + table_name + ' SET ' + column_name + ' = NULL WHERE ' + column_name + ' = '''';' As sqlupdate FROM information_schema.columns WHERE table_name = 'sometable' AND data_type LIKE '%char%'
If you want to have the above return a single row in SQL Server 2005, similarly you would create an aggregate function for strings as described here - granted a bit more involved than doing the same in PostgreSQL. Or use SQL Server's XPath syntax supported in SQL Server 2005 as described here.
Note that if you are making heavy use of database schemas, then you will need to qualify your tables with the schema name and do a where on the table_schema as well.
SELECT 'UPDATE ' || table_schema || '.' || table_name || ' SET ' || column_name || ' = NULL WHERE ' || column_name || ' = '''';' As sqlupdate FROM information_schema.columns WHERE table_name = 'sometable' AND data_type LIKE '%char%' AND table_schema = 'assessing';
Explore the INFORMATION_SCHEMA
The INFORMATION_SCHEMA is chuck-full of all sorts of useful metadata about your database objects. While some tables may not exist in some database management systems (DBMS) (e.g. PostgreSQL has a table information_schema.sequences (defined in SQL:2003 standard which you won't find in MYSQL and SQL Server because those databases don't have sequence objects), for the tables that exist, the tables are consistently named across all DBMS's that support them and so are the field names.
There are 3 tables (views) I find most useful in the information schema. These are all available in the DBMS's I mentioned, and those are
Note - for the most part if not in all cases, the INFORMATION_SCHEMA set of information are pretty much implemented as views on top of the proprietary DBMS system tables. The reasons I find most compelling to use them instead of using the direct system tables are the following
« previous page (Page 1 of 1, totaling 2 entries) next page »
Friday, August 04. 2017
Tuesday, August 01. 2017
Monday, July 31. 2017
Saturday, July 29. 2017
Monday, July 24. 2017
Friday, August 18. 2017
Wednesday, August 16. 2017
Wednesday, August 16. 2017
Tuesday, August 15. 2017
Monday, August 14. 2017