"Not Invented Here"
When I am hacking on Frisky I generally just plow though what ever I can because I have so little time to think about what I am doing. So when I have a problem say caching I just write some code to solve the problem not thinking about trying to use another solution. Then when I come aware of a solution there are two reason I would avoid using a third party. The first momentum / laziness which is pretty lame but with the time crunch it is what it is. Second reason generally stems from the solution to the problem I have written is a more correct solution than an external one. That is fundamentally the balance that is hard to figure what side of the line I should be on.
It not as bad as I thought, it is better!
Enough of that though really! It is time to use more robust and faster implementation of what I could write myself. Even if I had all the time the libraries make a lot more sense from a design and simplicity than to invent a very custom solution. Now that I have read up on both of these I just need to implement them.
Beaker caching rocks!
It is amazing that it took a recent blog post to finally make the bell go off that I should be using beaker. I have only heard Mark Ramm talk about it a couple times plus the TurboGears and Pylons mailing list talked about it a lot. Duh! Hopefully this will be quick and easy to implement and save a lot of time.
Message Queue with STOMP
When I first started to think about building Frisky I wanted a messaging system. I described it as a worker process and really detailed it out to be anti-cron but what I was getting at was a message queue. Ofcourse not until I was reading about message queue system in the latest python magazine did I realize this. This will take some experimentation but greatly reduce the code that still needs to be written to implement original feature set. I was thinking I may be able to programmatically start one of the Python STOMP servers that way it could just be a server setting to turn it on.
Tuesday, May 5, 2009
Thursday, April 23, 2009
Schema-less document freedom
One of the many project I am working on at the same time is content based. Different sources of content have different plugins that get the content and then store them with the relationships they have with other data. It is easy to design a base 'Content' class. Where there is an Image, Text, Video ect as subtypes of that. Then based on the source there is another subclass with specific infromation. In RDBMS this is a mess, with many different classes and each time I need special information things get ugly because I have 2 choices. Either grow a table or stuff some things in a generic field. When developing a prototype this really is a bummer I have to do a lot of thinking about data storage that I really don't want to.
Document is cool
What this boils down to in Documents is a content type (Image, Text, Video ect) and then source. Any addional information needed to be stored associated with the specifics just lives in the document. The only tricky part is populating from a view (query). I need to decide based on the type and source which class this is for and bam I am done. So much simpler.
Graph + RDBMS = Pain
Another thing this project does is create links between content, collection of content and collections of collections. Somewhere in there is a graph of content. Trying to traverse a graph with SQL without way to many queries is just hurting my head. From what I have heard writing a bunch of stored procedures code can reduce the volume of SQL however I think the point is unless you have to. Luckily I found CouchDB to try this with. I have not started to implement it but it fits in my head how I can easily traverse the graph with CouchDB.
Now to the belly of the beast to see if my theory holds any water.
Document is cool
What this boils down to in Documents is a content type (Image, Text, Video ect) and then source. Any addional information needed to be stored associated with the specifics just lives in the document. The only tricky part is populating from a view (query). I need to decide based on the type and source which class this is for and bam I am done. So much simpler.
Graph + RDBMS = Pain
Another thing this project does is create links between content, collection of content and collections of collections. Somewhere in there is a graph of content. Trying to traverse a graph with SQL without way to many queries is just hurting my head. From what I have heard writing a bunch of stored procedures code can reduce the volume of SQL however I think the point is unless you have to. Luckily I found CouchDB to try this with. I have not started to implement it but it fits in my head how I can easily traverse the graph with CouchDB.
Now to the belly of the beast to see if my theory holds any water.
Wednesday, April 22, 2009
CouchDB First impressions
For a while I have been wanting to have a different approach to database storage for a while. A good while ago I blog about object databases and mentioned taking a look at CouchDB. There has been something that has not been sitting right with me for a while about the relational model.
Observations
As a Python programmer I spend a lot of my time iterator of list, doing dictionary lookups to find and manipulate data. All the languages I have seen or written in for web are doing this. The ORM's save us time by maximizing our mental time thinking in programming language of choice. However eventually we end up switching our brain into relational mode so we can do some complex query or schema design.
Unbind my mind from relations please!
One thing that I have noticed is that relational model and how I access and store information into it changes the way I think about the information. When discussing with other developers applications I tend to start describing features like a SELECT statement (Getting all list of widgets FROM animal, primate WHERE ...). This bothers me that instead of the flexibility and easy of use of my programming language I am constrained to think about the limitation of my storage system. Ideally I would like to discussing features and functionally without any constraints.
Everything is simply upside down
After getting over the lack of schema definition which takes all of about 2 seconds when I realized I can get down to writing code right away instead of trying to constrain my ideas to the data model that I have developed. Simply create an class definition with typed attributes and you are off and running. What seems like a no brainer' is a couple of new types which are arrays and dictionaries. Not only does this seem like total no brainer' since just about every programming language these days has built int array and dictionary types. Even though Postgresql has an array type it is very hard to work with.
Then the problem is when it comes to the JOIN. Where relational database really are useful is joining tables of which the better be couse these days it seems with any non trivial system you have to join 5 tables to get anything useful from the system. I find this rather painful. Beause I spend a lot of mental energy trying to query data and filter it out when it would be much easier for my to just say "for r in data". It would be much easier to traverse all the records and filter them that way. This is one of the feature I love about couchDB you visit every record and can add javascript to filter or transform the data returned. WOW the feedom.
Documents are not tables
CouchDB also uses "Document" which takes a bit of getting use to if your like me and have been thinking about tables for the last decade. First documents fields can be typed but there is not strict schema that they have to conform to. What is nice is that if needed can just push new code supporting new fields without DDL changes. WOOT! I suppose flexability is what I like most about this. I don't want to constantly change schema during the process of developing a new application or new features on an exisitng one. I have to wipe the data every time I change the schema or data model.
Document as model we don't normalize they same way relational. This is hard to change my mind from thinking in database normal form to a document that holds data. So far it has been hard to get use to this natual fit. There are still relationships between documents but the volume of the relationships is greatly reduced. Example in a relational database we would have animal->primate->monkey as tables. In contrast this would all be represented in a single document.
Pain in the JOIN, maybe
You can't do joins the same way you do them in a RDBMS. However you can do joins. It has taken me a while to bend my mind on how to do this. It isn't complex it just hard to change the way I think about it. I have only just started to write my first joins in a simple application I am working on and I like it so far. The difference is that all types of documents are returned instead of single rows with lots of duplicate data. For example if you join the city with the state table you get [Charlotte, NC], [Concord, NC] where 'NC' is duplicate state. Where as in in couchdb we would get [NC, Charlotte, Concord, GA, Atlanta..]. On one hand going to have to deal with the different documents in the result set however on the other hand there is a lot less data and the time savings in trying to figure out the 15 table joins will more than make up for handling more complex result set.
Observations
As a Python programmer I spend a lot of my time iterator of list, doing dictionary lookups to find and manipulate data. All the languages I have seen or written in for web are doing this. The ORM's save us time by maximizing our mental time thinking in programming language of choice. However eventually we end up switching our brain into relational mode so we can do some complex query or schema design.
Unbind my mind from relations please!
One thing that I have noticed is that relational model and how I access and store information into it changes the way I think about the information. When discussing with other developers applications I tend to start describing features like a SELECT statement (Getting all list of widgets FROM animal, primate WHERE ...). This bothers me that instead of the flexibility and easy of use of my programming language I am constrained to think about the limitation of my storage system. Ideally I would like to discussing features and functionally without any constraints.
Everything is simply upside down
After getting over the lack of schema definition which takes all of about 2 seconds when I realized I can get down to writing code right away instead of trying to constrain my ideas to the data model that I have developed. Simply create an class definition with typed attributes and you are off and running. What seems like a no brainer' is a couple of new types which are arrays and dictionaries. Not only does this seem like total no brainer' since just about every programming language these days has built int array and dictionary types. Even though Postgresql has an array type it is very hard to work with.
Then the problem is when it comes to the JOIN. Where relational database really are useful is joining tables of which the better be couse these days it seems with any non trivial system you have to join 5 tables to get anything useful from the system. I find this rather painful. Beause I spend a lot of mental energy trying to query data and filter it out when it would be much easier for my to just say "for r in data". It would be much easier to traverse all the records and filter them that way. This is one of the feature I love about couchDB you visit every record and can add javascript to filter or transform the data returned. WOW the feedom.
Documents are not tables
CouchDB also uses "Document" which takes a bit of getting use to if your like me and have been thinking about tables for the last decade. First documents fields can be typed but there is not strict schema that they have to conform to. What is nice is that if needed can just push new code supporting new fields without DDL changes. WOOT! I suppose flexability is what I like most about this. I don't want to constantly change schema during the process of developing a new application or new features on an exisitng one. I have to wipe the data every time I change the schema or data model.
Document as model we don't normalize they same way relational. This is hard to change my mind from thinking in database normal form to a document that holds data. So far it has been hard to get use to this natual fit. There are still relationships between documents but the volume of the relationships is greatly reduced. Example in a relational database we would have animal->primate->monkey as tables. In contrast this would all be represented in a single document.
Pain in the JOIN, maybe
You can't do joins the same way you do them in a RDBMS. However you can do joins. It has taken me a while to bend my mind on how to do this. It isn't complex it just hard to change the way I think about it. I have only just started to write my first joins in a simple application I am working on and I like it so far. The difference is that all types of documents are returned instead of single rows with lots of duplicate data. For example if you join the city with the state table you get [Charlotte, NC], [Concord, NC] where 'NC' is duplicate state. Where as in in couchdb we would get [NC, Charlotte, Concord, GA, Atlanta..]. On one hand going to have to deal with the different documents in the result set however on the other hand there is a lot less data and the time savings in trying to figure out the 15 table joins will more than make up for handling more complex result set.
Sunday, March 29, 2009
Hacking Frisky Round 3
Progress is fun
Progress continues (finally) on Frisky (async web server I started hacking on for fun). As previous post mentioned my initial progress I have finally taken a second step. Thanks to zenhabits I have been starting to be very productive with my time. I decided to get back to seeing if I could continue with my original vision of creating a new kind of webserver. As I review my vision I realize core is a concept of stablity and performance. Adding multiprocessor support was critical because it would impact almost all the other features and almost all the code. So I had to face the hurdle and jump.
Punching Performance in the eye
Python 2.5 with processing installed is not down to ~470 req/sec compared to ~ 1,000 rec/sec for WSGI requests. Currently the bottle neck is IPC from the main process to the processes running the WSGI code. Let some of the reason I am not going to nix processes:
Truthfully most requests that are not cached are waiting on database or IO and 500 req/sec would be good. Since static file performance is not affected and once I fix the caching code that will be as fast as static files in the short term I am not going to worry about the slowness.
Making coding easier
As I mentioned before I wanted to get the multiprocessing done so I could support some other features. One of those is hot deployment of code and in development an autoreload feature. Basically they are one in the same. By the server creating a new process it will get the new code. This could be done at runtime in production or during development to just load a new process when a file modification is noticed in the code base.
Next please
Worker processes was one of the other really innovative features I wanted Frisky to have. This is important because it will allow me to go back and start fill in lots of missing pieces like configuration, test suite and framework integrations (TurboGears, Django, Pylons ect).
PS. Code is on bitbucket http://bitbucket.org/lateefj/frisky/overview/
Progress continues (finally) on Frisky (async web server I started hacking on for fun). As previous post mentioned my initial progress I have finally taken a second step. Thanks to zenhabits I have been starting to be very productive with my time. I decided to get back to seeing if I could continue with my original vision of creating a new kind of webserver. As I review my vision I realize core is a concept of stablity and performance. Adding multiprocessor support was critical because it would impact almost all the other features and almost all the code. So I had to face the hurdle and jump.
Punching Performance in the eye
Python 2.5 with processing installed is not down to ~470 req/sec compared to ~ 1,000 rec/sec for WSGI requests. Currently the bottle neck is IPC from the main process to the processes running the WSGI code. Let some of the reason I am not going to nix processes:
- As this is the first web server (for that matter any server that needed performance) and it is pre alpha code it should get faster with TLC
- Caching support is not available once it is fixed that should greatly increase cacheable requests
- IPC is probably not the best way to pass information back and forth between processes but it sure is easy with the multiprocessing module!
Truthfully most requests that are not cached are waiting on database or IO and 500 req/sec would be good. Since static file performance is not affected and once I fix the caching code that will be as fast as static files in the short term I am not going to worry about the slowness.
Making coding easier
As I mentioned before I wanted to get the multiprocessing done so I could support some other features. One of those is hot deployment of code and in development an autoreload feature. Basically they are one in the same. By the server creating a new process it will get the new code. This could be done at runtime in production or during development to just load a new process when a file modification is noticed in the code base.
Next please
Worker processes was one of the other really innovative features I wanted Frisky to have. This is important because it will allow me to go back and start fill in lots of missing pieces like configuration, test suite and framework integrations (TurboGears, Django, Pylons ect).
PS. Code is on bitbucket http://bitbucket.org/lateefj/frisky/overview/
Tuesday, January 13, 2009
Coding Frisky (poorly)
Itch
A little while back I blog about a vision for a new type of web server to meet the needs of a new web. Along with how I thought it crazy how slow WSGI web servers are. FAWPS goal to be the fastest web server I think is a great niche. Frisky I want to be fast at runtime and at development time. Which seems to be why I am not going to dump all the code into the FAWPS contrib directory.
Scratching
The itch is funny thing. If I am scratching the Frisky itch then I am not scratching the money making itch or the hiking itch. It has been hard to find the time to scratch the Frisky itch. I am looking for some Zen for balance so I can enjoy coding on Frisky and still keep everything else at bay.
Coding... Poorly
They nasty rat nest of code I have committed up to bitbucket frisky repository is very embarrassing. Which is only half the motivation for me to keep coding. The other half is the carrot of seeing stuff work! These two factors are probably the most highly motivating factors for me to continue to carve out time I don't have to work on it more. So I guess until I get the features in that I blogged about I will be coding poorly and not circle back to clean code up and write tests.
Progress
Today I was able to integrate features I had when Frisky was using FAWPS2 as the core webserver. That code has been sitting idle waiting for some features to get finished and today I decided to give it a wirl. Now src.hackingthought.com is using the new Frisky. hackingthought.com also forwards to blog.hackingthought.com. How about some more features:
A little while back I blog about a vision for a new type of web server to meet the needs of a new web. Along with how I thought it crazy how slow WSGI web servers are. FAWPS goal to be the fastest web server I think is a great niche. Frisky I want to be fast at runtime and at development time. Which seems to be why I am not going to dump all the code into the FAWPS contrib directory.
Scratching
The itch is funny thing. If I am scratching the Frisky itch then I am not scratching the money making itch or the hiking itch. It has been hard to find the time to scratch the Frisky itch. I am looking for some Zen for balance so I can enjoy coding on Frisky and still keep everything else at bay.
Coding... Poorly
They nasty rat nest of code I have committed up to bitbucket frisky repository is very embarrassing. Which is only half the motivation for me to keep coding. The other half is the carrot of seeing stuff work! These two factors are probably the most highly motivating factors for me to continue to carve out time I don't have to work on it more. So I guess until I get the features in that I blogged about I will be coding poorly and not circle back to clean code up and write tests.
Progress
Today I was able to integrate features I had when Frisky was using FAWPS2 as the core webserver. That code has been sitting idle waiting for some features to get finished and today I decided to give it a wirl. Now src.hackingthought.com is using the new Frisky. hackingthought.com also forwards to blog.hackingthought.com. How about some more features:
- Domain aliases (forwarding)
- Static file serving (skipping WSGI application thus much faster)
- Caching system
- Compression system
- Skeleton for mapping urls to WSGI applications
- 762 bytes static file 1801.19 [#/sec] (mean)
- 10 bytes cached and compressed 2197.04 [#/sec] (mean)
- 74 bytes static file cached and compressed 2112.66 [#/sec] (mean)
Monday, December 29, 2008
Publicly linkable resources without an integer id
For those of us who are writing Python web applications with RDBMS storage it is very tempting to create urls like http://hackingthought.com/foo/72 where the '72' represents an id to a record in the a table. This has never felt right for the basic security risk of I could easy guess the other records which I may not want publicly exposed. Example 73 or 74. For many reasons this is undesirable. In the case of a blog this is an acceptable behavior because we may not be concerned with the resources that are on the other side. In the case of video site like youtube I may not want just anyone downloading the entire collection of video content!
This is pretty easy to resolve by using Pythons built in uuid module. For a while I thought this was acceptable but then I started to think the urls that it would generate would be rather long. Example: hackingthought.com/foo/181067910632484385564896804811492956458! To me this is a bit excessive to use the integer uuid generates. To make my life a lot easier there is a built in representation of the uuid which is hex. hackingthought.com/foo/8838693e35534e86b442f9d8b8d6192a. Better but only saved 7 characters.
Even though 32 characters is not that painful it is still a bit long for my taste. hex is not ideal, base64 would use a lot more characters and reduce the size. This would produce: hackingthought.com/foo/iDhpPjVTToa0QvnYuNYZKg (after subtracting ==\n).
At this point I have gotten it down to 22 characters vs 32 hex or 39 integer. 43% Reduction!
The code:
I wonder if it could be even shorter. I would love to know how to make them shorter without loosing any ease of use of built in python modules or writing of algorithms. Please drop me a line of you find another a better way to do this.
This is pretty easy to resolve by using Pythons built in uuid module. For a while I thought this was acceptable but then I started to think the urls that it would generate would be rather long. Example: hackingthought.com/foo/181067910632484385564896804811492956458! To me this is a bit excessive to use the integer uuid generates. To make my life a lot easier there is a built in representation of the uuid which is hex. hackingthought.com/foo/8838693e35534e86b442f9d8b8d6192a. Better but only saved 7 characters.
Even though 32 characters is not that painful it is still a bit long for my taste. hex is not ideal, base64 would use a lot more characters and reduce the size. This would produce: hackingthought.com/foo/iDhpPjVTToa0QvnYuNYZKg (after subtracting ==\n).
At this point I have gotten it down to 22 characters vs 32 hex or 39 integer. 43% Reduction!
The code:
import uuidPerformance? Well my laptop can generate 100K in about 6 seconds using uuid1 or uuid4 so I don't think that it is a bottleneck.
from base64 import b64encode
import timeit
k = uuid.uuid4()
# Unique id as an int
print('char length: %s type:int value: %s'
% (len(str(k.int)), k.int))
# Unique id as an hex
print('char length: %s type:hex value: %s'
% (len(str(k.hex)), k.hex))
# Unique as base 64
# Notice it replaces the + and / for chars that can work in a url
b64k = b64encode(k.bytes, '#$')
# Subtract formatting chars
b64k = b64k.replace('=', '').strip()
print('char length: %s type:hex value: %s'
% (len(str(b64k)), b64k))
I wonder if it could be even shorter. I would love to know how to make them shorter without loosing any ease of use of built in python modules or writing of algorithms. Please drop me a line of you find another a better way to do this.
Thursday, November 20, 2008
Pylons for FAPWS2
There was a question on my blog about FAPWS2 and Pylons. I was able to post a comment about it in the entry however I was trouble because I could not get a patch to the maintainer of FAPWS2 however William published FAPWS2 on github (http://github.com/william-os4y/fapws2/tree/master). It has then motivated me to create an account and contribute my personal contributions to the project. Which you can find http://github.com/lateefj/fapws2/tree/master. At the moment I have been able to just add a Pylons example on configuration. Mainly there is a run.py and development.ini that where modified to support Pylons. Last time I benchmarked it the performance was around 30% increase.
Next commit is I want to integrate my Frisky contributions into FAPWS2 this would help to see if I can get other to find out if they are getting the same performance with the caching utilities, static folder configuration (WSGI application are slow at serving static files). Just sent off email to make sure all the features I am proposing are sane. Can't wait to start hacking on it!
Next commit is I want to integrate my Frisky contributions into FAPWS2 this would help to see if I can get other to find out if they are getting the same performance with the caching utilities, static folder configuration (WSGI application are slow at serving static files). Just sent off email to make sure all the features I am proposing are sane. Can't wait to start hacking on it!
Subscribe to:
Posts (Atom)
