I’ve been reading about PostgreSQL. I like what I see, a lot.
I’m a database weenie, which is unusual among application developers from what I’ve seen. I’ve found that almost universally, application developers don’t understand transactions, don’t understand basic data modeling rules, and generally regard RDBMSs as a pain in the butt, to be marginalized as much as possible.
Of course, Real SQL RDBMSs (meaning, to me, ACID compliant ones that implement ANSI SQL in addition to lots of useful extensions) used to cost an arm and a leg for a license, and then you had to pay even more for support, and even more for training. But there’s a reason the vendors charged so much: because they could! because there was so much functionality in their products.
I have seen a few projects from the inside, plus lots of arguments from the camp of “databases should be dumb”, that show that application developers would rather reinvent functionality in their own code that doesn’t work as well as existing functionality in the RDBMS, usually because the developer doesn’t fully understand the problem they’re solving.
(From a CTO standpoint, this is money thrown down a hole, assuming the goal is not to be database-independent. In that case you have to either move database functionality into your application and hopefully do it with an off the shelf persistence library that works correctly and costs you nothing, or code up a bunch of database-specific adapters that know the special vendor-specific extensions to do what you want.)
Here’s an example of application code failing to do a good job of duplicating standard database functionality: select max(id) from users
and then use that value + 1 as the ID of a new record you’re inserting. It works great in testing, but there’s a race condition that can crop up as soon as you have multiple threads hitting the database: if the select runs twice before the first insert happens, you get two inserts with the same ID. You can select... for update
first, but that would lock the whole table, one row at a time, before inserting the new row. So maybe we can add application code to make it fast and safe: go back to select max(id)...
but before inserting, do one last select, in a loop, to make sure there’s no collision, and increment the new ID again as needed, eventually doing the insert with an ID that you’re pretty sure is unique. Or, put a unique constraint on the column and try the insert, incrementing and retrying in an endless loop until it succeeds. Yes, developers really do stuff like this.
They could use a sequence object (or whatever the database’s particular name for it is) to do this automatically, or failing that, just make a teeny table with columns for tablename and next ID to use (which is basically what a sequence is) and do a select for update
on that. So you’d lock the sequence table, not the users table. But this is probably unnecessary since most databases already have a feature for this requirement, and in many cases you can just define a column as being an autoincremented number and let the assignment of an ID be implicit in your insert, getting the value afterwards using a guaranteed-safe function created to do exactly that.
There are many more complicated and dangerous examples; this is just one example of how single-threaded algorithms become either painfully slow or just plain incorrect in a highly multithreaded environment. A lot of problems are solved in a Real RDBMS. This is what they’re there for.
In the commercial world, I’ve been impressed by Oracle 8, MS SQL Server 2000, and to a lesser degree Informix and Sybase. These all cost a bunch, and MS SQL Server 2000 carries with it a very serious platform limitation that is either a non-issue or a showstopper depending on what the rest of your application’s technology stack looks like.
So lately I’ve been looking at Real RDBMSs in the open source, zero cost world. MySQL is the obvious leader but until lately they’ve been annoying about not wanting to add features that I consider core “Real RDBMS” functionality. These features were in Oracle, Sybase, Informix etc. ten years ago because they were necessary, and only now is MySQL AB starting to include them in their product, and it takes quite a few releases before they get it right. So although MySQL is decent for what it is, it ain’t a rock-solid and complete “Real RDBMS” by my definition. (See also: MySQL gotchas)
It looks like PostgreSQL is recently (last 12 months or less) becoming a serious candidate. I’ve been reading about it a lot and it seems to exceed the quality and feature set of Oracle8, which is a pretty serious bar to reach. The second bar I would look for is an admin GUI (or web interface) that’s as nice as the excellent ones for MS SQL Server 7 or 2000.
Here are some niceties:
- Good documentation (Is it just me or do the MySQL docs always leave you feeling like there are a dozen strange side-effects and gotchas you have to keep in mind at all times when using any feature?)
- They claim that working with text columns is just as fast as using varchar for the same size data (i.e. no heinous penalty for storing 100 characters in TEXT vs VARCHAR). I’m planning to use a lot of large text so that’s nice to hear.
- Fast full text indexing is available via tsearch2.
- Decent master-slave replication is available via slony-I. It seems a bit complicated but not flakey, which is really what matters.
And of course there’s the completeness of the feature set, like stored procs, triggers, subqueries, and working foreign keys (all of which are either missing and you’re not supposed to want them, or have recently been begrudgingly added, in MySQL).
PostgreSQL also has some gotchas and at the moment I’m definitely in a place where the grass is greener on the other side of the fence with respect to MySQL. I have to actually get my hands dirty with PostgreSQL to find out what sucks about it, and how it compares to MySQL for my needs. I’ll write about that when I have more direct experience. But for now, it sure looks green.