Sunday, December 21, 2014

Realtime Data Percolation with Elasticsearch, Akka and Java 8

Finally I've got some time to play with Elasticsearch Percolator feature. In a couple of words, it is a very efficient way of evaluating your data against a set of rules. Rules are usually defined by some queries. In classic approach, one would save the data in the database, and then run a batch of queries against it to see which corresponding rules are match. The Elasticsearch Percolator approach is opposite - the queries will be placed in database, and data evaluated against them.

This approach can be beneficial when:
  • you have a large amount of queries;
  • your data does not have a long lifespan (think about application log records for example, they can be safely deleted right after evaluation); 
  • you require fast real-time processing.
The data passed to Elasticsearch percolator will be thrown away. The stream of matched queries returned back (almost) immediately.
The example I wrote for my experiment was heavily based on Andrew Easter's sample he made more than a year ago [1], so I had to alter it a bit to use a new Elasticsearch API. My Scala skills are still weak, so I decided to rewrite a whole thing in Java 8, keep Akka actors intact, and drop the Play framework completely, along with the AngularJS UI. The results you can see or clone from my repository in Github [2].

The design is very simple: a Main class is responsible of starting the embedded Elasticsearch instance, define the Akka actors system, initialize Elasticsearch index with the proper mapping (the latest 1.4.1 version of Elasticsearch requires to have a mapping ready before percolation). The result of initialization Future call is a Stream of tuple objects, each represents a search string and matched data entry. The next step will be a populating Elasticsearch percolator with queries.  Nothing prevents us to add or delete this queries in real-time (Elasticsearch has a RESTful API which I used from Jersey client), but for simplicity all the queries will be defined in advance.

The dummy data supplied by LogEntryProducerActor class. Using built-in Akka scheduler, we can force a periodical log records generated as frequently as we want. The biggest simplification I made is the way how the matched queries are returned back. I've added a BlockingQueue on Worker Actor side, which keeps a matches produced by Percolator. Using Java 8 Streaming API the matched Tuples are directed from Queue back to client and simply printed to stdout.

Basically, I would like to see how the sample works under load, with the different number of queries defined. On my laptop I tried 1 millisecond interval between logs and got a steady 900...1050 records processed per second, with less than 5% average CPU use. After increasing a number of queries to a several thousands I started to get a Jersey-specific errors, caused by the initialization process. Obviously, if we need to initialize a large amount of documents really fast, sending a huge amount of REST calls will not be a good idea. The Elasticsearch java client probably will be the better option.

Overall, I am happy with the Percolation feature of Elasticsearch, it works quite fast and efficient. I haven't tried a clustered approach of Elasticsearch or Akka (the Akka's application.conf file is included, feel free to specify your clustering stuff there). The search query optimization was also omitted, for example one can use a "filter" instead of "query_string" Elasticsearch API for a potentially better performance.

[1] Reactive Real-time Log Search With Play, Akka, AngularJS and Elasticsearch (by Andrew Easter);
[2] My akka-percolator repository in GitHub;

No comments:

Post a Comment