Wednesday, December 30, 2009

Is RDBMS pain boil down to premature optimization?

As I extract my self from stored procedure pondering how at the door step of 2010 we are still using such an inflexible, complex and surprising language to manage data? The stored procedure with its PERFORM syntax instead of SELECT when you are calling a stored procedure inside of a stored procedure and the stored procedure you are calling does not return anything. There are so many of these type of rules that 30 years ago would have felt natural. Today this is a waste of developer time. Postgresql provide many different languages but this sounds much better than it is in practice. What is really bothering me why is this so painful? Why have RDBMS not matured to make our tasks easier and why hasn't a better data storage technology come about?

In this blog I am going to explore some of the issue that I have seen with RDBMS's. All of them seem to come back to premature optimization.

Data consistency is all that matters!
This is a common argument why all data store must be ACID compliant. I have heard my co-workers laugh about how NoSQL databases store inconsistent data. This seem ironic in a way because I have read most of their code and writing a lot of code that uses ACID RDBMS and that data can easily be placed in an inconsistent state. This is for performance reasons that we do this. When I was writing EJB's and each data record locked for reading and writing this was a very consistent way to access data. The problem with this is no data can be read or writing in parallel. It is amazing how few users it takes to bring a web system to a halt that is truly consistent. In the web applications that I have written the make heavy use of caches like memcache which I am sure has created many inconstant user views of data. I can think of off the top of my head over my 10 years of coding web applications, only 3 instances where I have writing been involved or seen code that actually was consistent. It might just be me but I would guess most web app developers have had a similar experience.

Disk speed is slow?
Disk have only recently started to increase in speed with SSD coming out that are multiple times faster then the spindles. Has this been the reason we have lived with RDBMS for so long. We have not had fast enough disk to use something else? We have been able to add more disks and use raid to increase performance however with RDBMS this has come down to writing to two nodes. First writing to the master and it replicating to a minimum of one slave. But wait we can scale RDBMS simply but sharding! Except we have broken rule #1 ACID if we shard data. So if data access performance will ever exceed disk array or size exceeds a disk array then RDBMS.

Object are not relations but they store sooooo close until they don't.
As a developer this seem to be something that is constantly driving me crazy. The relational data model never exactly maps to objects of the applications so it is 75% of the time consistent between tables and objects. So we design the tables first and our objects end up looking like are schema until performance suffers really bad and then we end up changing the table schema and the objects don't make any sense. Only the most trivial applications applications does this work out but I don't tend to spend much time writing or maintaining trivial applications.

RDBMS Vendor + DBA all your problems are solved
I can agree that for specific problems specific venders have design good solutions for them however this is very rare . As for the DBA since I sell myself as a knowledgeable person I agree more knowledgeable people to help solve difficult problems is a very good approach however the problem is the data access is slow! No matter how many people you have there is only so fast a single node can write to a disk array. No matter what if the system grows though more users, more data or more features the system will get slower if it can not do parallel writes.

So maybe I am just "ranting about writing software" but I think it is time to start making the transition to better data storage systems. My future productivity depends on using data stores that get out of the way and let me use modern languages and concept to manage the data. For the situations that I need to have consistent data I will use ACID RDBMS or find a way to create consistent data in NoSQL data stores.

8 comments:

Anonymous said...

Interesting..... I have recently been researching the pros and cons of NoSQL databases and comparing same with the traditional RDBMS which currently hold sway.

As you already mentioned, there are other language bindings for writing Postgresql stored procedures. Unless there are performance issues with writing a stored procedures in a higher level language like say python I would certainly prefer python stored procedures over the dated terse syntax created ages ago.

My experience has been that ORMS like SQLAlchemy help with the back and forth mismatch between the application objects and the data tables as the data model of neccessity evolves while the application likewise is evolving. So it hasnt been that bad (for me at least).

So much has been said about how NoSql databases are more appropriate than the RDBMS for web applications (that need to scale at will in geometric fashion) where a simple key-value data store will suffice, etc. Indeed, NoSQL databases in some quarters are deemed to hold a lot of promise for the future of reliable, scalable distributed/concurrent database backends with more predictable performance characteristics than the RDBMS.

This is all fine and dandy. However dont NoSQL databases leave the developer to do most of the heavy lifting required to query the data in intelligent fashion? i.e Do you expect to see an immediate increase in your productivity w.r.t development related to data management of NOSQL based DBMS's.

I think I saw you mentioned somewhere that riak only allows writing the data management map/reduce logic in erlang - which same requires a steep learning curve in order to become proficient. Writing map/reduce logic using a functional high level language like javascript on the other hand perhaps would be a lot more fun/productive.

I would like to see general intuitive tools/libraries/apis developed to ease the development of application specific data access logic (regardless of the degree of their complexity) to ease the transition from using a traditional RDBMS to using NoSQL databases in cases where it is more appropriate to do so.

Brume

Lateef Jackson said...

SQLAlchemy really does take a lot of the pain out of SQL. If you have to get down and dirty with stored procedures or what not you still at the bottom of a cliff to learn and debug the oddities of working in the database environment.

"However dont NoSQL databases leave the developer to do most of the heavy lifting required to query the data in intelligent fashion?"
This is a really good question. The first major reason I believe NoSQL increases productivity is data model design. I have found that the document model allows the data to be organized so that there is a lot less quiring. I also find writing map/reduce to be faster and easier than SQL once you get use to it with some exceptions where it is more difficult but this might just be getting use to thinking in a document model. In general I find that either I didn't have to write the query or the savings in time that map/reduce creates is worth the extra effort on the queries that are more difficult.

"riak only allows writing the data management map/reduce logic in erlang - which same requires a steep learning curve in order to become proficient"
Yes this is true. It is however open source and if I knew enough erlang I would try to see if I could hook Python in to do the map/reduce. I don't think this would be to difficult and I expect if Riak becomes popular it will quickly get map/reduce support in javascript and php also. The SQL learning curve at the beginning might seem easy but it gets steep quickly as the data gets normalized and query are non trivial.

A Python map/reduce that could be used from python console would be an awesome tool for Riak and I think go a long way in accelerating the adoption of Riak.

dm said...

The interesting thing with a lot of the consistency discussions is that we are often comparing a single server RDBMS to a multi server NoSQL cluster. If we were to use NoSQL solutions on a single server too, we would then get instant consistency with many products just like with a SQL database.

Also product by product it varies - HBase IIRC favors consistency over availability.

Often with RDBMS, when one goes multiserver, one uses asynchronous replication which is eventually consistent.

dwight/MongoDB

Anonymous said...

i don't think the trend has anything to do with disk drives. Rather, changes in CPUs drive the trend: in the future we can have many slow cores on many machines cheaply, but one fast core becomes impossible.

horizontal scaling got the ball rolling on the NoSQL trend, but developers are now seeing too that programmer productivity is higher and administration lower, in particular with the document oriented databases

Anonymous said...

Hey Lateef,

I am trying to follow your train of thought....but not yet familiar enough with how document databases work to fully grasp what you mean....

"I have found that the document model allows the data to be organized so that there is a lot less quiring. I also find writing map/reduce to be faster and easier than SQL once you get use to it with some exceptions where it is more difficult but this might just be getting use to thinking in a document model. In general I find that either I didn't have to write the query or the savings in time that map/reduce creates is worth the extra effort on the queries that are more difficult.
"

Could you please elaborate further on this?. If the data you require is spread over so many documents in a document store wouldnt you have to manually get and filter each document in some sort of custom fashion (for each query) - in a sense building queries on the fly using primitive predicates provided by raw javascript code (for example). Or am I missing something fundamental here?.

When you say "the savings in time map/reduce creates" are you referring to time saved in returning the results from a map/reduce call vs the time it takes for the RDBMS equivalent for a complex (non-trivial) query?.

The last poster is also re-echoing your sentiments about higher programmer productivity.

Perhaps I need to get my hands dirty and/or do some actual coding with a document database in order to fully appreciate all this.

Brume

Lateef Jackson said...

@slow disk
Compared to CPU speed individual disk speeds have stood still for the last 20 years. If the performance had increased a lot of the RDBMS performance issues would not exist.

@Brume
As far as querying data model in a document based NoSQL in the applications that I have been working on the data tends to be less partitioned into documents. For example if you had a system you would have one document to store user information (name, password, authentication type), settings (colors, background, links), contact information (phone numbers, emails, address) and friends (list of other users). Well in a relational model that was normalized instead of having one table it would have maybe 7 (login, setting, phone, email, address, friend) or more potentially. Since all this is stored in a single document then there is no need to traverse other documents. So in this case we have a single object that would represent the document and instead of having 7 ORM mapped object to 7 tables we would just have one. If the use case is to find all the users who have setting X and email them for the next release we can just filter the documents by setting and we are done. In an ORM we can filter by setting then look up the email address, or we could build a join so that we don't have two round trips to the database... But that goes back to my point about performance.

When I was speaking of the map/reduce time savings I was talking development time savings. Specifically I found it much faster to write Javascript conditional to filter documents in CouchDB than write SQL.

As far as getting your hands dirty the issue I had was it took a couple little applications before I got my brain thinking document instead of relation. The first couple scripts I wrote using CouchDB looked like relational data model. I hate to say it but it is hard to "think outside the box".

Anonymous said...

Ah!.

So each document will contain all data it needs with fewer queries required to pull data of interest.

One obvious trade off being increased disk space requirements seeing as data is duplicated accross documents in order to obviate joins. As well updates of duplicated data will require more writes to different documents to accomplish.

Interesting.....

Now I see the forest for the trees :)

Duh!. It is hard to think with a non-relational hat on when all the apps you have written are for relational data models.

Brume

Lateef Jackson said...

@Brume
The documents can have foreign keys to other documents. So you don't have to have copies of the data. For example if I was to have a list of a list of friends then I would just have an array of the document ids that are the users friends. One of the reasons I like Riak so much is it support a concept of a link in a document of which you can traverse links which is really cool. In my applications that I wrote using CouchDB I just used a string that pointed to the document id of which I would use a lot in lists and dictionaries (associative arrays).

All I was trying to point out is the documents can have nested data in them which is very difficult or ugly to do in a relational database. In relational databases we end up creating lots of data that should just be nested and use queries to rebuild the nesting in our ORM. When you have a document and it has a small list of things like groups, friends, settings ect most of the time you don't need to create a document for each "thing".

Post a Comment