Wednesday, December 31, 2014

Representing Your Local Maven Repository Structure in Neo4J Graph Database

Below you will find a way to use Neo4J graph database as a storage of metadata from your local Maven repository.

If you have Maven to manage your project dependencies, then after a while you might get a quite sizable local Maven cache. Maven keeps there a copies of all jars being downloaded from the remote repositories (or being installed by your local builds). Usually it is located in your home folder, under the hidden .m2 directory. For example, my local Maven repository size is ~8GB, and honestly saying I don't remember when the last time I cleaned it up.

Maven-based projects configured to get the external artifacts (jars, zip, war files, etc.) necessary for a build by looking for a set of <dependency> nodes in the project pom.xml file. Which means you can find out which artifacts in your local Maven repository are still in use and which ones are potential waste by simply examining all your pom.xml files. Note that I am talking here about direct dependencies only, not a dependency-on-dependency chains, which Maven is also capable to handle. If we want to investigate a whole chain, we need to get recursively to each of the pom files and its dependencies.

We all know that a local storage is not very expensive this days, so we will not gain much benefit by keeping an eye on how Maven uses our local artifacts and periodically wipe out the old ones. It is always cheaper to destroy the local cache completely and let maven to download all it needs automatically during the next build. However, having an automated way of resolving relations between your active projects and your local maven cache might be interesting from several others points of view. For example software audit (if any of your commercial projects use non-licensed code), or some code analysis (what versions of a particular dependencies are used). The Maven own Dependency plugin has some of this functionality, it can build a rudimentary trees and provide the dependency information in one particular project or group of related projects. But I can imagine a situation when you get a vulnerability report about one specific version of a dependency, and you would like to find out quickly where exactly it is used. This can be especially helpful if your organization uses a centralized Maven repository like Artifactory or Sonatype, and you can quickly poke into it's local repository cache.

There might be another reasons why I came up with this crazy idea to keep Maven repository information in graph database, so to stop speculations I will only add that the Neo4J is a great tool, and the task itself is very familiar to the "Digital Asst Management" use case, enough said.

A couple of word about implementation. Let's consider that you have a Neo4J database up and running somewhere, and you know a URL to it's REST endpoint. On my machine it is

GitHub repository with source code:

The application itself is a command-line tool, which first scans your local Maven repository for *.pom files, excludes the SNAPSHOTs, and then process the files one-by-one. We are particularly interested only in a subsection of pom file, in particular in <artifactId>, <groupId> and <version> nodes. Another point of interest is a <dependencies> node along with all its child nodes. This chunks of XML gets unmarshalled into java objects, see the corresponding and source files under com.rokhmanov.graph.sample.entity package. The last step is a graph creation - Neo4J RESTful API gets called by Jersey java client. Nothing really complex. This is a command line syntax:

usage: repo-graph-maven.xxxx.jar [OPTION]...
    <directory> - (mandatory parameter) path to the local Maven repository.
    <serverURL> - (mandatory parameter) Neo4j REST server root URL.
    'clear' - (optional parameter) The existing database will be recreated if specified.
Example: java -jar repo-graph-maven.jar ~/.m2/repository clear

Structure of Neo4J database: a single Root node "keeps" several links to Project nodes, each of them "has" links to Artifact nodes. So the graph schema is also very simple: "Root", "Project" and "Artifact" are nodes, and "keeps" and "has" are vertices.


Note that some Projects might use the same Artifacts, for example the same version of  jUnit library. The application handles this scenario, and graph includes a cycles, like the one on a screenshot below:

If you have a large repository, the initial run might take a several minutes. Each subsequent run will add only new projects and artifacts so it will be shorter. Keep in mind that if you supply the "clear" parameter to the application, the whole Neo4J graph will be repopulated from scratch.

After successful execution, we can finally play with the graph. Below are sample Cypher queries along with their results:

1. What projects use the "commons-io" artifact?

MATCH p=(b:Component)-->(c:Artifact)
WHERE c.artifactId='commons-io'
RETURN b.artifactId, b.version, b.groupId


2.  What versions of "commons-io" artifact are used overall?

MATCH (n:Artifact)
WHERE n.artifactId =~ '.*commons-io.*'
RETURN n.artifactId, n.groupId, n.version


3. What are the 10 the most used artifacts?

MATCH (n:Artifact)<-[r]-(x)
RETURN n.artifactId, n.groupId, n.version, COUNT(r)


Feel free to run all this queries by yourself using Neo4J Browser. Things which I think can be improved or made differently:
  • Implement variable substitution. For example, the ${project.version} can be replaced by the actual value from <parent> section in pom file;
  • Delete old nodes and edges from graph if the corresponding project is not exist anymore;
  • Implement batch REST calls to improve performance.
Overall this application has a POC quality and serves its needs. The task of course can be achieved by using a regular "relational" approach, the amount of data and the number of joins in the database will not be very large with just two objects and the graph structure like the one above. But nothing prevents us from adding more complexity to the graph in future, like a size of each artifact on filesystem, or information about its internal structure or license used. Keep also in mind that altering the schema (or graph) does not require to bring the database down, like when you alter a schema in a regular relational DB. This might be an important point when considering a solution line that in Production environment.

Sunday, December 21, 2014

Realtime Data Percolation with Elasticsearch, Akka and Java 8

Finally I've got some time to play with Elasticsearch Percolator feature. In a couple of words, it is a very efficient way of evaluating your data against a set of rules. Rules are usually defined by some queries. In classic approach, one would save the data in the database, and then run a batch of queries against it to see which corresponding rules are match. The Elasticsearch Percolator approach is opposite - the queries will be placed in database, and data evaluated against them.

This approach can be beneficial when:
  • you have a large amount of queries;
  • your data does not have a long lifespan (think about application log records for example, they can be safely deleted right after evaluation); 
  • you require fast real-time processing.
The data passed to Elasticsearch percolator will be thrown away. The stream of matched queries returned back (almost) immediately.
The example I wrote for my experiment was heavily based on Andrew Easter's sample he made more than a year ago [1], so I had to alter it a bit to use a new Elasticsearch API. My Scala skills are still weak, so I decided to rewrite a whole thing in Java 8, keep Akka actors intact, and drop the Play framework completely, along with the AngularJS UI. The results you can see or clone from my repository in Github [2].

The design is very simple: a Main class is responsible of starting the embedded Elasticsearch instance, define the Akka actors system, initialize Elasticsearch index with the proper mapping (the latest 1.4.1 version of Elasticsearch requires to have a mapping ready before percolation). The result of initialization Future call is a Stream of tuple objects, each represents a search string and matched data entry. The next step will be a populating Elasticsearch percolator with queries.  Nothing prevents us to add or delete this queries in real-time (Elasticsearch has a RESTful API which I used from Jersey client), but for simplicity all the queries will be defined in advance.

The dummy data supplied by LogEntryProducerActor class. Using built-in Akka scheduler, we can force a periodical log records generated as frequently as we want. The biggest simplification I made is the way how the matched queries are returned back. I've added a BlockingQueue on Worker Actor side, which keeps a matches produced by Percolator. Using Java 8 Streaming API the matched Tuples are directed from Queue back to client and simply printed to stdout.

Basically, I would like to see how the sample works under load, with the different number of queries defined. On my laptop I tried 1 millisecond interval between logs and got a steady 900...1050 records processed per second, with less than 5% average CPU use. After increasing a number of queries to a several thousands I started to get a Jersey-specific errors, caused by the initialization process. Obviously, if we need to initialize a large amount of documents really fast, sending a huge amount of REST calls will not be a good idea. The Elasticsearch java client probably will be the better option.

Overall, I am happy with the Percolation feature of Elasticsearch, it works quite fast and efficient. I haven't tried a clustered approach of Elasticsearch or Akka (the Akka's application.conf file is included, feel free to specify your clustering stuff there). The search query optimization was also omitted, for example one can use a "filter" instead of "query_string" Elasticsearch API for a potentially better performance.

[1] Reactive Real-time Log Search With Play, Akka, AngularJS and Elasticsearch (by Andrew Easter);
[2] My akka-percolator repository in GitHub;

Friday, July 11, 2014

Monitoring Radon with Arduino

I live in the Midwest, in Chicago suburbs. When I bought my house several years ago I ordered Radon gas check and the levels were in normal range. From what I read about Radon, it's levels can change during the year and season. I always wanted to check its level real-time and find the factors causing its fluctuation if it is possible.

I bought a Safety Siren Pro Series 3 Radon Gas detector from Amazon (Model No: HS71512) and did some initial measurements. On my living floors the radon levels were OK, but I was mostly interested in basement, which is unfinished non-liveable "crawl space", used mostly for storage. Would be great to have the Radon measurement process automated, without the need to crawl into basement every time to check the readings.

I found excellent article written by Chris Nafis which exactly targeted my task. Recently I've got a spare Arduino Kit from SparkFun Electronics and decided to use it for the Radon detection automation. You can get a lots of basics and instructions from Chris's radon work, I will explain here only what I did differently.

First of all, the Safety Siren Radon Gas detector has to be modified, which will void your warranty. Instead of having a set of cables directly attached to a Safety Siren detector board, I installed a 25-pin DB-25 connector on top of Radon Detector, and cut a LPT-port cable in half from my old PC. I soldered the cable wires to my Arduino and put it along with Ethernet Shield into a small plastic box. See the picture below of what I have today running in my basement.

Below is the picture of LPT cable ends with soldered wire pins I used for experimenting. Later I removed all the pins and soldered cable wires directly to the Arduino board, so the whole construction would fit into the box.

An interesting challenge was the detection of Long / Short setting of Safety Siren detector (1 week or 1 month of approximate radon data). I noticed that after reset the detector automatically turns on into Long setting, but I wanted to rely on the actual settings and not assume anything. Chris has an Arduino program for older Model HS80002 listed on his website, but I had a newer HS71512, and Long / Short detection logic were missing in the program. First I thought I can read the Long / Short LED, but unfortunately soldering into it was not an easy task without disassembling the Safety Siren board, so I decided to take a programming approach.

I noticed that the Long / Short settings has some timing differences. See two screenshots below, each screenshot has two channels displayed. On a first screenshot the DIGIT_4 strobe (lower channel on screenshot) goes 0 almost immediately with LTL signal (upper channel) when we are in Short setting. In Long setting, this two signals has a timing difference, so the second screenshot illustrates it.

 So to handle this timing difference, I used interrupts functionality of Arduino. The Arduino Uno has int0 on pin2, which I attached to LTL from Safety Siren detector.

attachInterrupt(0, detectShortHandler, RISING);

The interrupt service routine is very simple - if DIGIT_4 is 0 - we are in Short mode, if opposite - we are in Long.

void detectShortHandler()
  if (digitalRead(DIGIT_4) == LOW)
    state = ST_SHORT;
    state = ST_LONG;

Please note a small delay I had to add before processing the DIGIT_4 strobe value. The DIGIT_4 goes to 0 "almost" immediately, see a screenshot below, hence the pause.

Finally, I had the radon values detected, processed and periodically (once per ~30 minutes) published to Xively free service (former Pachube). Below is a graph of the Radon level in my crawl space after several days of monitoring.
Note that I have only "long" setting presented. After several weeks of collecting data, my Safety Siren detector failed to switch to Short. I think it is a detector issue (probably internal firmware fail), because it still switch into Long and stay there even after I disconnected Arduino and did a full reset of Safety Siren detector. Well, I cannot exchange detector because I opened it and I will not buy another one (~$100 is too steep for a small project), so I'll stuck with Long setting for now.
Update: after about 4 months both the "long" and "short" settings started to operate normally. I think the reason might be a power supply, this is the only thing which I changed with my current setup during that time.

For those who would like to repeat this project - I published my Arduino program on GitHub, feel free to use it and get back to me if any questions.