09 4 / 2013

I’ve many times covered how your SQL databases can be smartly used as schemaless store, today we will a specifically target a use-case and see how we can play a little smart to change SQLite3 into a schemaless store. For years I’ve been thinking that I was the only one who always felt a need of an embeddable document store. Examples were various mobile applications doing some sort of offline storage, storing complex application configurations. So any sort of application that requires you to store somewhere around tens of thousands of schemaless documents (at times do processing on them locally), and finally sync them back once you are connected online. Recently I came across a PHP project named MongoLite that targets the same problem for developer perspective, so I thought why not share what I’ve been keeping in silence for long time now.

Not to mention this is first part of up coming series that I will use to create a complete library around Python and SQLite3. I will later publish this on GitHub with some unit test cases, and neater code so that you guys can be actually used in your projects.

SQLite3 has this awesome thing called create_function. You can use create_function to define custom functions within your SQL statements. As an example from python documentation:

def md5sum(t):
    return md5.md5(t).hexdigest()

con = sqlite3.connect(":memory:")
con.create_function("md5", 1, md5sum)
cur = con.cursor()
cur.execute("select md5(?)", ("foo",))

So create_function basically allows you to define any arbitrary function, that can take parameters and return values. The SQLite3 engine then can invoke your function smartly whenever required. Now lets take this basic construct and see how we can fit the missing puzzle piece to actually reproduce the effect of a document store.

So imagine a table with just two columns id that is INT, and a document field that is BLOB (will allow us to use various formats like msgpack, or cPickle rather than only JSON); storage part is breeze. Your dictionary or document can simply be serialized (in whatever format you choose) and simply inserted into table as row, so your statement:

collection.insert({'foo': 'bar', 'say': True}) 

will be effectively translate into (assuming I am dumping into JSON):

sqlite_cursor.execute('INSERT INTO doc_table(document) VALUES(?)',
                      (buffer(json.dumps({'foo': 'bar', 'say': True})),) 
)

Pretty straight right? But now since you have stored everything as plain BLOB, SQLite has no idea how to look into field if you ask something like:

collection.find({'say': True}) #All documents with say = True

But we can use custom functions here to do that job for us. Look closely at problem and you see the SQLite3 doesn’t know how to parse your JSON and match the fields of JSON, but we can ask SQLite3 to invoke our function that in turn can do the comparison for SQLite. So imagine SQL statement:

SELECT * FROM doc_table WHERE json_matches(document, "{'say':true}");

Now we can implement the compare_match function in python, that can actually do the comparison and return True or False according to if we want the current (passed as parameter) document to be part of result set or not. Let’s have a quick look at a very simpler version of this registered function:

def doc_matches(self, doc, query):
  try:
    q = json.loads(query)
    d = json.loads(doc)
    return self.__is_sub_dict(d, q)
  except Exception as e:
    print e

def __is_sub_dict(self, d, q):
  for k, q_val in q.items():
    if not k in d:
      return False
    if q_val != d[k]:
      return False
  return True

Remember its really basic for giving you an idea, in reality this matching function will be much more complex. Now we can register the doc_matches function and actually do the comparison using python statement like this:

cursor.execute('SELECT * FROM doc_store WHERE json_matches(document, ?)', 
               [json.dumps(query_object)]) 

Simple and basic yet working. I’ve already done it in this gist. I’ve also implemented various serialization formats that you can try out your self by passing a different adapter to the constructor of the store (see gist’s main to see how I am doing that, by default it uses JSON as serialization format).

So our new toy document store and it works fine but it suffers with the great dread of complete table scan every time you do a query. We can avoid this nightmare by using indexes of SQLite and this is exactly what we will be doing in our next blog iteration.

Till then chill out and let me know your comments and suggestions.


  1. divanoui reblogged this from maxpert
  2. maxpert posted this