Thursday, April 23, 2009

Schema-less document freedom

One of the many project I am working on at the same time is content based. Different sources of content have different plugins that get the content and then store them with the relationships they have with other data. It is easy to design a base 'Content' class. Where there is an Image, Text, Video ect as subtypes of that. Then based on the source there is another subclass with specific infromation. In RDBMS this is a mess, with many different classes and each time I need special information things get ugly because I have 2 choices. Either grow a table or stuff some things in a generic field. When developing a prototype this really is a bummer I have to do a lot of thinking about data storage that I really don't want to.
Document is cool
What this boils down to in Documents is a content type (Image, Text, Video ect) and then source. Any addional information needed to be stored associated with the specifics just lives in the document. The only tricky part is populating from a view (query). I need to decide based on the type and source which class this is for and bam I am done. So much simpler.
Graph + RDBMS = Pain
Another thing this project does is create links between content, collection of content and collections of collections. Somewhere in there is a graph of content. Trying to traverse a graph with SQL without way to many queries is just hurting my head. From what I have heard writing a bunch of stored procedures code can reduce the volume of SQL however I think the point is unless you have to. Luckily I found CouchDB to try this with. I have not started to implement it but it fits in my head how I can easily traverse the graph with CouchDB.
Now to the belly of the beast to see if my theory holds any water.

Wednesday, April 22, 2009

CouchDB First impressions

For a while I have been wanting to have a different approach to database storage for a while. A good while ago I blog about object databases and mentioned taking a look at CouchDB. There has been something that has not been sitting right with me for a while about the relational model.
Observations
As a Python programmer I spend a lot of my time iterator of list, doing dictionary lookups to find and manipulate data. All the languages I have seen or written in for web are doing this. The ORM's save us time by maximizing our mental time thinking in programming language of choice. However eventually we end up switching our brain into relational mode so we can do some complex query or schema design.
Unbind my mind from relations please!
One thing that I have noticed is that relational model and how I access and store information into it changes the way I think about the information. When discussing with other developers applications I tend to start describing features like a SELECT statement (Getting all list of widgets FROM animal, primate WHERE ...). This bothers me that instead of the flexibility and easy of use of my programming language I am constrained to think about the limitation of my storage system. Ideally I would like to discussing features and functionally without any constraints.
Everything is simply upside down
After getting over the lack of schema definition which takes all of about 2 seconds when I realized I can get down to writing code right away instead of trying to constrain my ideas to the data model that I have developed. Simply create an class definition with typed attributes and you are off and running. What seems like a no brainer' is a couple of new types which are arrays and dictionaries. Not only does this seem like total no brainer' since just about every programming language these days has built int array and dictionary types. Even though Postgresql has an array type it is very hard to work with.
Then the problem is when it comes to the JOIN. Where relational database really are useful is joining tables of which the better be couse these days it seems with any non trivial system you have to join 5 tables to get anything useful from the system. I find this rather painful. Beause I spend a lot of mental energy trying to query data and filter it out when it would be much easier for my to just say "for r in data". It would be much easier to traverse all the records and filter them that way. This is one of the feature I love about couchDB you visit every record and can add javascript to filter or transform the data returned. WOW the feedom.
Documents are not tables
CouchDB also uses "Document" which takes a bit of getting use to if your like me and have been thinking about tables for the last decade. First documents fields can be typed but there is not strict schema that they have to conform to. What is nice is that if needed can just push new code supporting new fields without DDL changes. WOOT! I suppose flexability is what I like most about this. I don't want to constantly change schema during the process of developing a new application or new features on an exisitng one. I have to wipe the data every time I change the schema or data model.
Document as model we don't normalize they same way relational. This is hard to change my mind from thinking in database normal form to a document that holds data. So far it has been hard to get use to this natual fit. There are still relationships between documents but the volume of the relationships is greatly reduced. Example in a relational database we would have animal->primate->monkey as tables. In contrast this would all be represented in a single document.
Pain in the JOIN, maybe
You can't do joins the same way you do them in a RDBMS. However you can do joins. It has taken me a while to bend my mind on how to do this. It isn't complex it just hard to change the way I think about it. I have only just started to write my first joins in a simple application I am working on and I like it so far. The difference is that all types of documents are returned instead of single rows with lots of duplicate data. For example if you join the city with the state table you get [Charlotte, NC], [Concord, NC] where 'NC' is duplicate state. Where as in in couchdb we would get [NC, Charlotte, Concord, GA, Atlanta..]. On one hand going to have to deal with the different documents in the result set however on the other hand there is a lot less data and the time savings in trying to figure out the 15 table joins will more than make up for handling more complex result set.