11 12 / 2011

Yes I know you are really happy with your “persistent” Key Value store. But did anybody notice hstore that comes along Postgresql. I find Postgresql to be a really great RDBMS that has been ignored all the time. It even has some great publisher/subscriber system as well (or LISTEN/NOTIFY in terms of Postgresql) that a lot of people may have implement using Redis, RabbitMQ etc. For people who have not lived anything other than MySQL. I would simply ask them to try out Postgres.

Instead of looking at benchmarks, I will be focusing on a key value store that is ACID compliant for real! Postgres takes advantage of its storage engine and has an extension on top for key value storage. So plan is to have a table can have a column that has a datatype of hstore; which in turn has a structure free storage. Thinking of this model multiple analogies throw themselves in. It can be a Column Family Store just like Cassandra where row key can be PK of the table, and each column of hstore type in table can be imagined like a super column, and each key in the hstore entry can be a column name. Similarly you can imagine it some what like Hash structures in Redis (HSET, HDEL), or 2 or 3 level MongoDB store (few modifications required). Despite being similar (when little tricks are applied) to your NoSQL store structures, this gives me an opportunity to demonstrate you some really trivial examples.

Lets setup our system first. For my experiment I will be using Postgres 9.1 and I will compile it from source. Once in source directory you can: ./configure && make install to install your Postgres. Don’t forget to install the extensions in the contrib directory: cd ./contrib && make install. Once you have setup the database you can create your own database and start the server (Hints: use initdb and pg_ctl). Then launch your psql and make sure you install your hstore extension:

CREATE EXTENSION hstore;
SELECT 'foo=>bar'::hstore;

If everything goes well you should be able to see table output. Now we are ready to do some DDL. I created a table my_store as schema definition below:

CREATE TABLE my_store
(
id character varying(1024) NOT NULL,
doc hstore,
CONSTRAINT my_store_pkey PRIMARY KEY (id)
)
WITH (
OIDS=FALSE
);

CREATE INDEX my_store_doc_idx_gist
ON my_store
USING gist
(doc);

As you can see I’ve created a table with hstore column type and one GiST index (for operators ? ?& ?| etc.). You can checkout of documentation to have a look on different type of operators you have.

Now that we have database and tables setup I wrote a simple script to populate it with about 115K rows from twitter stream. Now keep in mind that its a real life data and I was interested in querying few basic things from collected data. For example, how many people are putting hash tags, or doing mentions, or were posting links in the tweets? For doing this I wrote a simple python script using tweepy and psycopg2 and ran it for about few hours. For each tweet in my store I added a key value pair of ‘has_hashtags=>:t' if there were any hash tags in the tweet, similarly I introduced has_urls and has_mentions if they were present in tweet, I will be using these keys along with my GiST index to query my table later on.

So after populating my data with 115,142 tweets the database grew to a size of 239691780 bytes (Just 228MB). Now comes the fun part. I was totally blown away by what I can achieve by combining the power of relational and key value style under 1 store. So for example I want to  query all the tweets tweeted at unix timestamp of 1323446095 (since I stored the timestamps as a string here is what my query looks like):

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘created_at=>00001323446095’;

I can add simple count or any other SQL famous aggregate function without going into any complications of my data store specific map reduce or new language to learn hustle. Do note that I padded my timestamp value with zeros since I am only storing strings as values. Also I am utilizing @> operator, thats gonna use the GiST to really do a quick bitmap index scan instead of sequential scan. That was pretty good for starter. Lets try to fetch out all the tweets that had hash tags in them:

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’;

Yes querying complete database pulling out complete data (That you won’t probably do because you page the data :) ) gives me 14689 rows just under 360ms on average. Since we have SQL at hand lets make a condition little more complicated, and use a different operator for same stuff and also sort the data by created_at:

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’ AND doc ? ‘has_urls’
ORDER BY doc -> ‘created_at’ DESC;

It already sounds tasty! This is not it Postgres has more operators, so pulling out hash tagged tweets with urls or mentions is also possible,

SELECT doc -> ‘text’ as tweet, doc -> ‘created_at’ as created_at
FROM my_store
WHERE doc @> ‘has_hashtags=>:t’ AND doc ?| ARRAY[‘has_urls’, ‘has_mentions’]

This is not it! hstore comes with all sort of operators and index systems that you can ask for hash store. Check them out here. Now, despite the NoSQL boom I think we have some great examples and reasons of why RDBMS still remains core part of many market giants (Facebook being something everyone knows). Postgres just gives me one more reason to not ignore RDBMS systems, So If you have been moving around on some document stores just because the reason that RDBMS don’t provide them; think again! You can get the same rock solid durability with structure free systems.

I will be pretty soon revisiting the FriendFeed use case with MySQL to store structure free data with Postgresql approach. Stay tuned, leave your comments and thoughts.

Update 29th Sept 2012: I’ve visited the FriendFeed casestudy in my new blog post.


  1. pure-blogging reblogged this from maxpert
  2. barchuleta reblogged this from maxpert and added:
    Hmm I should try to install Postgresql on my Mac …….
  3. elyn-harrison reblogged this from maxpert
  4. carry-mitchell reblogged this from maxpert
  5. fokakier21 reblogged this from maxpert
  6. farmakeio21 reblogged this from maxpert
  7. dirpros92 reblogged this from maxpert
  8. prosdir23 reblogged this from maxpert
  9. cosminpitea reblogged this from maxpert
  10. website-designers-brighton reblogged this from maxpert
  11. zzarbi reblogged this from maxpert
  12. maxpert posted this