08 10 / 2012

 In my previous post I explored how to replace the storage engine of Whoosh to Redis. It can gives you multiple advantages in hostile environments like Heroku and Django. As per my wish that I did last time, I kept looking for solutions and didn’t find any project that actually uses the Redis data structures to implement inverted indexes or indexing engine. Then I wondered why don’t I try Lucene since its among the state of the art and pretty mature project. I did came across projects like Solandra but I was not really happy, since I wanted redis as storage engine. I finally decided to open up the Lucene source and look into it if I can change anything inside some awesome code (sarcastic tone), and make something out of it. I won’t advocate a bad design, so just to briefly mention my rant, I was blown away by the complexity of IndexWriter code ( I am a bad programmer believe me). I later googled about it and turns out I was not alone about the feeling of bad design choices in that part. 

 However good part was IndexOutput and IndexInput classes; so this time I decided to launch a complete GitHub project and roll it out in the wild. I’ve also done some heavy data loading and basic testing with sharding. I took the famous PirateBay data dump and indexed it on Lucene. 

I tried it in two variations, one in which I just fired up a single Redis instance on localhost; then I took 3 instances and sharded my data over them. The source code used can be located in this Gist. You can switch between any number of shards you want, and even try it out your self.

 On a single instance the database dump file was some where about 200MB. On 3 instance shard the size was 46MB, 83MB, and 104MB respectively. Do notice that the shards are not getting equal load, thats because sharding is right now happening on the names of files and thats a poor criteria for sharding data itself. In next iteration I am taking the sharding to file’s block level. Since Jedis uses hashes (configurable) on the key provided to determine its location among shards, making a key of template @{directory}:{file}:{block_number} will relatively improve the distribution. Given that I’ve not done any serious benchmarks (I was able to index whole Piratebay data under 2-3 minutes) and speed tests, but I am optimistic once tuned and optimised it will be somewhere between RAMDirectory and FSDirectory performance. 

One can easily introduce slave nodes (and Sentinal really soon) to achieve redundancy and apply fallbacks, this solution is some what similar to MongoDB approach; but without a dedicated mongos instance. The clients themselves are intelligent and handle sharding.

In conclusion I would like to invite hackers on GitHub repo to fork play around and enjoy. I would be happy to accept pull requests. Happy hacking!

Update: I’ve just enabled Solr support checkout GitHub repo for details on how to set it up. 

  1. tingletech reblogged this from maxpert
  2. maxpert posted this