Useful Bits and Bytes: graph

Below you will find a way to use Neo4J graph database as a storage of metadata from your local Maven repository.

If you have Maven to manage your project dependencies, then after a while you might get a quite sizable local Maven cache. Maven keeps there a copies of all jars being downloaded from the remote repositories (or being installed by your local builds). Usually it is located in your home folder, under the hidden .m2 directory. For example, my local Maven repository size is ~8GB, and honestly saying I don't remember when the last time I cleaned it up.

Maven-based projects configured to get the external artifacts (jars, zip, war files, etc.) necessary for a build by looking for a set of <dependency> nodes in the project pom.xml file. Which means you can find out which artifacts in your local Maven repository are still in use and which ones are potential waste by simply examining all your pom.xml files. Note that I am talking here about direct dependencies only, not a dependency-on-dependency chains, which Maven is also capable to handle. If we want to investigate a whole chain, we need to get recursively to each of the pom files and its dependencies.

We all know that a local storage is not very expensive this days, so we will not gain much benefit by keeping an eye on how Maven uses our local artifacts and periodically wipe out the old ones. It is always cheaper to destroy the local cache completely and let maven to download all it needs automatically during the next build. However, having an automated way of resolving relations between your active projects and your local maven cache might be interesting from several others points of view. For example software audit (if any of your commercial projects use non-licensed code), or some code analysis (what versions of a particular dependencies are used). The Maven own Dependency plugin has some of this functionality, it can build a rudimentary trees and provide the dependency information in one particular project or group of related projects. But I can imagine a situation when you get a vulnerability report about one specific version of a dependency, and you would like to find out quickly where exactly it is used. This can be especially helpful if your organization uses a centralized Maven repository like Artifactory or Sonatype, and you can quickly poke into it's local repository cache.

There might be another reasons why I came up with this crazy idea to keep Maven repository information in graph database, so to stop speculations I will only add that the Neo4J is a great tool, and the task itself is very familiar to the "Digital Asst Management" use case, enough said.

A couple of word about implementation. Let's consider that you have a Neo4J database up and running somewhere, and you know a URL to it's REST endpoint. On my machine it is http://127.0.0.1:7474/db/data/

GitHub repository with source code: https://github.com/rokhmanov/repo-graph-maven

The application itself is a command-line tool, which first scans your local Maven repository for *.pom files, excludes the SNAPSHOTs, and then process the files one-by-one. We are particularly interested only in a subsection of pom file, in particular in <artifactId>, <groupId> and <version> nodes. Another point of interest is a <dependencies> node along with all its child nodes. This chunks of XML gets unmarshalled into java objects, see the corresponding Dependency.java and Project.java source files under com.rokhmanov.graph.sample.entity package. The last step is a graph creation - Neo4J RESTful API gets called by Jersey java client. Nothing really complex. This is a command line syntax:

usage: repo-graph-maven.xxxx.jar [OPTION]...
Options:
    <directory> - (mandatory parameter) path to the local Maven repository.
    <serverURL> - (mandatory parameter) Neo4j REST server root URL.
    'clear' - (optional parameter) The existing database will be recreated if specified.

Example: java -jar repo-graph-maven.jar ~/.m2/repository http://192.168.0.1:7474/db/data/ clear

Structure of Neo4J database: a single Root node "keeps" several links to Project nodes, each of them "has" links to Artifact nodes. So the graph schema is also very simple: "Root", "Project" and "Artifact" are nodes, and "keeps" and "has" are vertices.

[Root]--keeps-->[Project]--has-->[Artifact]

Note that some Projects might use the same Artifacts, for example the same version of jUnit library. The application handles this scenario, and graph includes a cycles, like the one on a screenshot below:

If you have a large repository, the initial run might take a several minutes. Each subsequent run will add only new projects and artifacts so it will be shorter. Keep in mind that if you supply the "clear" parameter to the application, the whole Neo4J graph will be repopulated from scratch.

After successful execution, we can finally play with the graph. Below are sample Cypher queries along with their results:

1. What projects use the "commons-io" artifact?

MATCH p=(b:Component)-->(c:Artifact)
WHERE c.artifactId='commons-io'
RETURN b.artifactId, b.version, b.groupId

Result:

2. What versions of "commons-io" artifact are used overall?

MATCH (n:Artifact)
WHERE n.artifactId =~ '.*commons-io.*'
RETURN n.artifactId, n.groupId, n.version

Result:

3. What are the 10 the most used artifacts?

MATCH (n:Artifact)<-[r]-(x)
RETURN n.artifactId, n.groupId, n.version, COUNT(r)
ORDER BY COUNT(r) DESC
LIMIT 10

Result:

Feel free to run all this queries by yourself using Neo4J Browser. Things which I think can be improved or made differently:

Implement variable substitution. For example, the ${project.version} can be replaced by the actual value from <parent> section in pom file;
Delete old nodes and edges from graph if the corresponding project is not exist anymore;
Implement batch REST calls to improve performance.

Overall this application has a POC quality and serves its needs. The task of course can be achieved by using a regular "relational" approach, the amount of data and the number of joins in the database will not be very large with just two objects and the graph structure like the one above. But nothing prevents us from adding more complexity to the graph in future, like a size of each artifact on filesystem, or information about its internal structure or license used. Keep also in mind that altering the schema (or graph) does not require to bring the database down, like when you alter a schema in a regular relational DB. This might be an important point when considering a solution line that in Production environment.

I did some graph DB research and POC about six month ago for a "configuration management" task. The short description: the system might have a different set of modules, each consist from a bigger set of different parts. Almost all parts has different types of relations inside one module and across them. The task is to easily store this data, and have some complex search and retrieval logic. A typical graph database task. My weapon of choice at that point was OrientDB.
My first results were quite good - I was able to quickly import data from our Teiid relational federated database and build a graph with 322591 vertices and 322636 edges. The complete load process took about 40 minutes, mostly because slowness on our side. Our federated database consist from many different parts: fast and small H2 tables, custom data translators to a legacy data, and large and slow DB2 tables with hundreds of thousand records on mainframe.
I used both Native and Tinkerpop API and found it useful and easy to use. In my POC I took Jung for graph visualization, and for a collection of graph algorithms.
Here are just some results for the graph above:

"Find Shortest Path" (Dijkstra): 4781.0 ms
"Check Nodes Connected" using Dijkstra algo: 7392.0 ms
"Check Nodes Connected" using SQL TRAVERSE extension from OrientDB: 23829.0

Note that results were obtained from in-memory graph, with no optimization (raw quality java code, no indexes defined in OrientDB, unlimited walk depth in traversal). I am confident the results can be improved very easily.
My next step was the integration of OrientDB into Teiid. Since we have a huge federated database, the way to use SQL to send and retrieve data from graphDB considered as convenient but somewhat unnatural approach. I can explain benefits and drawbacks of using this approach separately, it deserves a whole new post. Unfortunately that time (6 months ago) I had to step back: the JDBC implementation of OrientDB was very raw and did not provide enough metadata for Teiid 7.6 to be included in federated database.
I got back to this task recently, trying a new Teiid 8.1 (it has a useful notion of defining metadata in DDL along with NATIVE support like the old Teiid) and with latest OrientDB 1.2.0-SNAPSHOT. I took the latest orientdb JDBC driver sources from git and compiled jar file by myself. Taking the latest JDBC sources was probably my biggest mistake. I've tried to connect to running OrientDB instance using Eclipse Database Explorer and run some queries. Unfortunately even simple and innocent queries against my imported graph or even against built-in "tinkerpop" database provided inconsistent results and even caused Eclipse to freeze sometime:

select * from V where o_id like 'demodata.s%';
select from 6:6 where any() traverse(0,10) (o_id = 'demodata.stats');

Something definitely wrong in the latest JDBC driver code. Obviously, the JDBC driver is not a top priority feature for a graph database. I probably can implement the needed functionality by myself - write a custom Teiid translator (might be a considerably big task, depends of the level of SQL support we want to achieve), or try another graphDB implementations. I will be looking also on neo4j - they also have a jdbc driver. The query language they support in JDBC is not SQL, but (I think) it can be worked out.
This task is on hold for now anyway, I'll make an update in two-three months.

Useful Bits and Bytes

Wednesday, December 31, 2014

Representing Your Local Maven Repository Structure in Neo4J Graph Database

Tuesday, August 7, 2012

My Experience with OrientDB JDBC