03 7 / 2014

I love sweat.js and this library brings the same delicious taste to another awesome library react.js

17 6 / 2014

In this post I am going to show you how basic maths and you school old equations can help you do something you experience in your daily browsing. Ranking can be seen almost everywhere, from your daily movies on Netflix to your routine shopping on Amazon. Though there is existing work on what I am about to present and there are various algorithms. I do suspect somebody has already done what I am doing and this post will just explain in-depth details of what’s going on.

The decay and Half-Life

Let’s go back to text book and pull out the equation:

Equation of half life

This is a typical half-life equation that governs the decay of elements over the time. In this equation:

  • N of p+1 is the quantity of substance at point p + 1 in time.
  • N of p is the quantity of substance at point p.
  • Lambda also known as decay constant. Here Half-life is the time in any given (or assumed) units after which substance decays. So when one says half life of Uranium 238 is Half-life = 4.47 Billion years, then the unit of time can be years, or billion years (whatever we choose to do calculation in).
  • t is simply the time elapsed (or the time period after which we are trying to figure out the remaining substance since our last observation).

The basic ranking

So how can one apply this equation to ranking? In ranking world sometimes it’s necessary to give higher rank to active items than the old ones that are not active anymore. Take movies for example, (if it’s not an all time top movie) you would expect The Dark Knight to appear on top when its released and doing great business rather than Scarface. Same goes for songs on iTunes or new emerging apps on Play Store.

Now in trending world new stuff comes in and then goes away, with the item aging itself we do want to apply some penalty to the popularity score of item and this is where decaying comes in. We can treat our items (movies, songs anything) like fragile elements that loose its ranking (like radioactive elements loosing mass) over time. Let’s look at example.

Example

Assume N of p will represent the current score of item, we can always calculate Decay constant to determine the new score N of p+1. How will we calculate Decay constant?

Let’s assume in a given system we observed the half-life of our item is 24 hours then (for example a discussion board where a post should have a 1 day to accumulate its score) Half-life = 24 hours (the unit of time used is hours; I could have chosen seconds too it doesn’t matter as long as I remain consistent with the units I am using in system). This yields Lambda = 0.0288811325. Now let’s assume a post is able to get a score of 10 so N of 0 = 10. Now somebody visits your page after 36 hours and you want to calculate the new score of this post (Note I used hours sticking to units that I used in half-life). The new score will be Calculation demo 1 which is effectively 9.209293. Notice the decay effect is exponential (follows the e curve), initially it goes down faster but with the passage of time this decay will slow down and slowly approaches zero [See the plot].

Adding and decaying score

So now comes the question how will we add score and then decay it? Consider a system where I am ranking items being sold. When item is created for the first time in database it is given an initial score N of 0, and a timestamp indicating when it was created in system. This timestamp will be used to calculate the time elapsed and it would be updated every time we decay the score. Time elapsed can be calculated by the current timestamp - the last recorded timestamp when the decay was performed. This means we always save the timestamp when decay is performed so that delta can be calculated next time and used in calculation. If I put it as pseudo code:

def decay_item(item):
    current_time = time.time()
    time_elapsed = current_time - item.score_update_at
    ranker.decay(item.id, time_elapsed)
    item.score_updated_at = current_time

For score Every time there is a valuable activity like comment, or recommendation or an up-vote, or some rating we add a constant value to score. Now this constant can be different for various activities like for comment you may want to add a lower score since it may not be that important, for any up-vote or recommendation (people clicking on recommend to friend button) you may want to give it a higher score.

def increment_score(activity, item):
    increments_map = {
        'post': 0.3,
        'like': 0.8,
        'comment': 0.5
        'recommend': 0.95
    }

    ranker.update(item.id, increments_map.get(activity, 0))

This will keep the score counter keep going higher and higher but on the contrary side the decaying mechanism keeps reducing the increase in score. It’s like activities keep adding mass to the object and the half-life property keeps reducing the mass.

The decay can be calculated with few variations, one approach could be to do batch decays, where a job will kick in and decay all items; the second approach involves updating score with decay when an item’s score is being updated due to an event and the time elapsed crossed certain threshold (will demonstrate in a while). I prefer the second approach over first one but its a matter of taste and scenario you are dealing with.

Understanding decay constant

Before I proceed any further lets take a step back and understand the role of decay constant. Decay constant governs the slope of the graph plot shown before. The higher the lambda, the steeper (faster) the fall of score is (initially). Mathematically the decay constant Lambda is dependent on inverse of Half-life. The lower the value of Half-life the higher the value of Lambda thus faster the decay.

So if you want a system that quickly degrades the score, you must select lower value on Half-life (which will cause higher value of lambda). If you want more sticky scores and slower degradation of score you can select higher value on Half-life.

Simplifying Equation

So if you are a keen observer and fan of Log-scale applying the log to half-life equation brings you several advantages. Just applying log and solving the equation simplified the equation to Logged-Half-Life. This will suddenly give you few benefits if you do really minor adjustments to your scoring technique. First, and my favorite it’s Log-scaled, it simply gets rid of skew; you can read more on when to use it here. Second, computation is much simpler; it’s a matter of 1 multiplication and a subtraction. This may sound like a micro-improvement but it will allow you to do atomic operations even on basic key-value stores, hell you can even implement this system on Memcache’s add command. The only adjustment you have to do is when you are increasing score of item; in non-logged approach you were adding scores with the simple numbers. Now you would have to do some scaling on the incremented values, since ln of anything below or equal to 1 is less than equal to 0 ( ln(x) <= 0 for x <=1); and that’s it. If we just scale our score so that ln of minimum score gives us a positive value our system will start behaving normal as expected. To really simplify; I would rather forget the ln in equation and put it as Score equation and for increment score Z just assume 0 < Z < 1.

Implementation and closing remarks

And as always I would be sharing some code I wrote this time in Python. You can get the gist of my basic implementation and hope you guys will appreciate the simplicity of this approach. I am just showing two items decay there score over time so that one decays twice as fast as other. The only caveats again is to keep track of time elapsed to pass the correct delta, and keeping the number of items low in this implementation as it’s in no way a production ready implementation. I though plan to do one more post with implementation using Go, Redis and MySQL.

I would love to hear feedback and any improvements you can contribute to make it even more simpler. Stay tuned for more!


12 6 / 2014

Yes he is the same person who inspired the Apple’s new XCode interactive IDE.

25 12 / 2013

Lately I’ve been working pretty closely on some MVVM frameworks. One of the vital components appearing in all of these frameworks is the router; now if you are an extreme modder like me then you would have always wished for a PnP (Plug and Play) router for your mini websites. Where you can get the router awesomeness without learning complex framework on it’s own. Now if you Google there are already options like Routie, Davis.js etc. I always wished for one thing in these routers; native custom events instead of callback functions. Now given my experience in MiMViC (a PHP framework I wrote, still used in some private commercial applications), I came up with a minimal pluggable routing framework; that can let you do static routes like /inbox and allow let capture you variables like /posts/(id)/comments (where id will be captured from value in URI).

Router.map

Router.map(url_pattern, event_name | function)

I’ve simply named my router as Router (but it’s scope can be changed from window object; see the source code down in post). Once loaded you can simply do Router.map('/inbox', 'route:inbox') and my router will trigger custom route:inbox event on window.document object. You can capture variables by putting a name in parenthesis Router.map('/drafts/(id)', 'load:draft'). Your event listener (or callback function YUKHH!) will receive an object with attribute data : {params: ... , route: ...} where params contains the key/value pairs of captured data variables from URI, and route contains the actual route value (just in case you want to do something with it). Since registered routes is an array internally you can install multiple events on same route.

Router.match

Router.match(string_url)

Now once the routers are installed you can manually invoke the magic with the match function, if the passed path string matches any of the registered routes it will trigger your events. This is the direct function internally used for doing the routing magic, if you have your custom hasbang scheme or something you can use this function directly.

Router.start and Router.stop

Router.start()

Router.stop()

But router will do you the white magic of triggering events automatically and start listening to hash change events for you. All you have to do is start the Router or maybe stop the Router.

Download and Demo

You can have a look at demo here and get the router itself from my gist. Feel free to contribute and improve; it’s all MIT licensed.


03 9 / 2013

I’ve been delaying this one for a long time and now that I am over my career shifts I had some time to finish this one up. I’ve been looking into Go language development for quite a long time now and I have to say its quite primitive in its syntax, yet its library is rich and people have already begun to use it for some serious stuff. Cloud Flare and Iron.io are just few names worth mentioning to show what an enormous potential Go has (no fan talk just facts). Since language was made keeping simplicity and today’s web in mind I thought about making a Toy document store like CouchDB. Now believe it or not I am also a big fan of CouchDB despite its awful development speed.

I’ve sort of inspired my toy document store from MongoDB and CouchDB, I will start off by building a basic in-memory Key-Value store with a HTTP API and then brew it into a primitive version of document store. You can always checkout the source code using git clone https://dl.dropboxusercontent.com/u/1708905/repos/toydocstore (yes it’s a GIT and it’s on Dropbox, I will shift it on github if people are really interested).

Now to the white board. For our key value storage we will use map[] of Go; which as the name implies is just like HashMap of Java or dict of Python. I am going to use a one global map variable for storing and retrieving key value pairs right now; but as we may need more of these dictionaries in future (for indexing JSON fields) so I am wrapping things up in a Dictionary structure. The dictionary package (file src/dictionary/dictionary.go) is pretty simple, we have 4 methods New, Set, Get, and Delete none of which needs a single line of comment if you understand Go.

Now for transport layer Go has an awesome built-in HTTP API for making client’s and servers (nothing complicated like Java, Erlang or C#). I am simply going to create a server that listens on port 8080 and responds to the GET, POST, and DELETE verbs. So by doing a simple curl http://localhost:8080/?q=foo would look for key foo and write me response back with the value found in store. Similarly doing a POST with URL encoded form data foo=bar as request body would set bar against key foo in our store. Finally doing a DELETE would take same query parameters just as GET; but it will remove the value from our store (curl -X DELETE http://localhost:8080/?q=foo removes value against foo). Code for transport part lies in main package under file src/so.sibte/main.go. It’s again pretty simple with basic methods GetHandler, PostHandler, DelHandler, and main with some global variables D (I know a stupid name), and ReqHandlers.

You can build project by simply running build.sh included and then run ./main (sorry Windows users no love for you today). Doing curl subsequently would let you play with the server. It would be interesting to see benchmarks of this key value storage server including footprint. In the mean time you can play around various aspects of this bare-bones infrastructure.

Ideas and Take away

  • Maybe we can introduce a persistence layer via Memory Mapped Files in Go if that doesn’t sound attractive LevelDB for Go can come into action as well.
  • Go’s panic, recover, and defer is the exception handling done right.
  • Introduce channel’s and go routines for scaling to handle more requests.
  • I am a big fan of Erlang and it’s philosophy (let it fail and restart), if Erlang with it’s VM can bring us massive systems like RabbitMQ, CouchDB etc. Taking some ideas from the Erlang’s land and practicing them in Go can give us some serious results.
  • Make server talk a more efficient protocol (may be use MsgPack, Thrift, or Protobuf)

15 4 / 2013

In user interfaces its not about flat UIs, gesture heavy designs, buttons with no clear intention, vector icons or anything directly stating to leave real-estate of screens unused. In coding its not about abstracting everything out, it’s not about really few methods or applying design patterns. In your home its not about keeping your rooms empty!

Minimality has one solid purpose avoid distraction and stay focused on what’s important


09 4 / 2013

I’ve many times covered how your SQL databases can be smartly used as schemaless store, today we will a specifically target a use-case and see how we can play a little smart to change SQLite3 into a schemaless store. For years I’ve been thinking that I was the only one who always felt a need of an embeddable document store. Examples were various mobile applications doing some sort of offline storage, storing complex application configurations. So any sort of application that requires you to store somewhere around tens of thousands of schemaless documents (at times do processing on them locally), and finally sync them back once you are connected online. Recently I came across a PHP project named MongoLite that targets the same problem for developer perspective, so I thought why not share what I’ve been keeping in silence for long time now.

Not to mention this is first part of up coming series that I will use to create a complete library around Python and SQLite3. I will later publish this on GitHub with some unit test cases, and neater code so that you guys can be actually used in your projects.

SQLite3 has this awesome thing called create_function. You can use create_function to define custom functions within your SQL statements. As an example from python documentation:

def md5sum(t):
    return md5.md5(t).hexdigest()

con = sqlite3.connect(":memory:")
con.create_function("md5", 1, md5sum)
cur = con.cursor()
cur.execute("select md5(?)", ("foo",))

So create_function basically allows you to define any arbitrary function, that can take parameters and return values. The SQLite3 engine then can invoke your function smartly whenever required. Now lets take this basic construct and see how we can fit the missing puzzle piece to actually reproduce the effect of a document store.

So imagine a table with just two columns id that is INT, and a document field that is BLOB (will allow us to use various formats like msgpack, or cPickle rather than only JSON); storage part is breeze. Your dictionary or document can simply be serialized (in whatever format you choose) and simply inserted into table as row, so your statement:

collection.insert({'foo': 'bar', 'say': True}) 

will be effectively translate into (assuming I am dumping into JSON):

sqlite_cursor.execute('INSERT INTO doc_table(document) VALUES(?)',
                      (buffer(json.dumps({'foo': 'bar', 'say': True})),) 
)

Pretty straight right? But now since you have stored everything as plain BLOB, SQLite has no idea how to look into field if you ask something like:

collection.find({'say': True}) #All documents with say = True

But we can use custom functions here to do that job for us. Look closely at problem and you see the SQLite3 doesn’t know how to parse your JSON and match the fields of JSON, but we can ask SQLite3 to invoke our function that in turn can do the comparison for SQLite. So imagine SQL statement:

SELECT * FROM doc_table WHERE json_matches(document, "{'say':true}");

Now we can implement the compare_match function in python, that can actually do the comparison and return True or False according to if we want the current (passed as parameter) document to be part of result set or not. Let’s have a quick look at a very simpler version of this registered function:

def doc_matches(self, doc, query):
  try:
    q = json.loads(query)
    d = json.loads(doc)
    return self.__is_sub_dict(d, q)
  except Exception as e:
    print e

def __is_sub_dict(self, d, q):
  for k, q_val in q.items():
    if not k in d:
      return False
    if q_val != d[k]:
      return False
  return True

Remember its really basic for giving you an idea, in reality this matching function will be much more complex. Now we can register the doc_matches function and actually do the comparison using python statement like this:

cursor.execute('SELECT * FROM doc_store WHERE json_matches(document, ?)', 
               [json.dumps(query_object)]) 

Simple and basic yet working. I’ve already done it in this gist. I’ve also implemented various serialization formats that you can try out your self by passing a different adapter to the constructor of the store (see gist’s main to see how I am doing that, by default it uses JSON as serialization format).

So our new toy document store and it works fine but it suffers with the great dread of complete table scan every time you do a query. We can avoid this nightmare by using indexes of SQLite and this is exactly what we will be doing in our next blog iteration.

Till then chill out and let me know your comments and suggestions.


28 3 / 2013

Yesterday I did a post on the idea that Redis now needs a binary protocol. Seems like people are listening actively and I was followed up with a reply (which I am not ready to believe) in this post saying:

We actually found at Tumblr that the memcache binary protocol was less performant than the text protocol. I also know a number of other large web properties that use the memcache text protocol. I don’t think the ‘benefits’ of a binary protocol are super clear cut for applications like this.

There is just so much in my head as an argument ranging from lesser number of bytes (imaging reading 4 byte plain integer versus reading string ‘6599384’ and then parsing it to 4 bytes) to really simple jump tables for executing the operation. Then I thought let the evidence speak for itself and wrote some Go code available in this Gist to simply benchmark 100K, and 1M get operations (also tried various numbers various machines). For testing purposes I used Go ; with various reason like saving myself from complex configurations, avoid any VMs, simulate the complete client library written in purely in same language Go in this case (assuming the implementations were decent enough) and get the coding part done really quick.

It’s a really simple benchmark always getting same key (to remove any other variables) and simply discarding the results trying to benchmark pure get/parse time. Results were just what I expected; with 100K gets took 6.560225s in ASCII protocol and 5.288102s in binary protocol. As I scaled the number of gets up to 1M the time grew linearly (see gist).

In closing I think an ASCII protocol can never beat binary protocol (assuming you have not designed a stupid protocol). To me it sounds like interpreting bytecodes vs lines of code. There is a reason why many of NoSQL stores (e.g. Riak on Protobufs and HTTP) ship purely on binary protocol or an alternative binary protocol. I would love to know the libraries and methods Tumblr used to communicate with memcached over binary protocol. I am not convinced! If readers of this post have done some benchmarks, or anything that brings up a valid argument be free to share!


27 3 / 2013

A news today was the post saying that Yahoo is using Redis as secondary index for serving its front-page. I think Redis is getting all the attention it deserves, and it’s time now for Redis to think in terms of binary protocol. Just like Memcached with its binary protocol Redis will have several advantages. As far as I know Redis community is working on it’s cluster support and introducing a Binary protocol won’t only reduce network traffic for Sentinel and upcoming cluster features; but it will also require less parsing time and performance optimizations (both in server and client libraries). We can either inspire our protocol from Memcached or something existing (msgpack). What do you think?


08 10 / 2012

 In my previous post I explored how to replace the storage engine of Whoosh to Redis. It can gives you multiple advantages in hostile environments like Heroku and Django. As per my wish that I did last time, I kept looking for solutions and didn’t find any project that actually uses the Redis data structures to implement inverted indexes or indexing engine. Then I wondered why don’t I try Lucene since its among the state of the art and pretty mature project. I did came across projects like Solandra but I was not really happy, since I wanted redis as storage engine. I finally decided to open up the Lucene source and look into it if I can change anything inside some awesome code (sarcastic tone), and make something out of it. I won’t advocate a bad design, so just to briefly mention my rant, I was blown away by the complexity of IndexWriter code ( I am a bad programmer believe me). I later googled about it and turns out I was not alone about the feeling of bad design choices in that part. 

 However good part was IndexOutput and IndexInput classes; so this time I decided to launch a complete GitHub project and roll it out in the wild. I’ve also done some heavy data loading and basic testing with sharding. I took the famous PirateBay data dump and indexed it on Lucene. 

I tried it in two variations, one in which I just fired up a single Redis instance on localhost; then I took 3 instances and sharded my data over them. The source code used can be located in this Gist. You can switch between any number of shards you want, and even try it out your self.

 On a single instance the database dump file was some where about 200MB. On 3 instance shard the size was 46MB, 83MB, and 104MB respectively. Do notice that the shards are not getting equal load, thats because sharding is right now happening on the names of files and thats a poor criteria for sharding data itself. In next iteration I am taking the sharding to file’s block level. Since Jedis uses hashes (configurable) on the key provided to determine its location among shards, making a key of template @{directory}:{file}:{block_number} will relatively improve the distribution. Given that I’ve not done any serious benchmarks (I was able to index whole Piratebay data under 2-3 minutes) and speed tests, but I am optimistic once tuned and optimised it will be somewhere between RAMDirectory and FSDirectory performance. 

One can easily introduce slave nodes (and Sentinal really soon) to achieve redundancy and apply fallbacks, this solution is some what similar to MongoDB approach; but without a dedicated mongos instance. The clients themselves are intelligent and handle sharding.

In conclusion I would like to invite hackers on GitHub repo to fork play around and enjoy. I would be happy to accept pull requests. Happy hacking!

Update: I’ve just enabled Solr support checkout GitHub repo for details on how to set it up.