Recently I was doing some planning work for one of our larger repositories to determine how we might approach splitting it up, and wanted to start asking a lot of questions about the project dependencies within it. There are various great tools out there like NDepend to help analyze complexity and dependencies, but I found myself wanting to really query the data in a lot of different ways, as well as inject it with knowledge we had about our projects such as which ones were part of the deployable artifacts, etc.
Since dependencies are naturally represented as graphs, particularly since they can be nested several levels through chains of dependencies, I figured I'd see if I could easily get the data into a Neo4j database and start querying it that way. It ended up being very easy and worked great, so I thought I'd share a quick version of what I hacked together since it was a fun and useful experiment. For this example I'll use the Xamarin.Forms repository, since it contains a number of different projects and dependencies within it.
Loading the Data
Generating the CSVs
Neo4j makes it nice and easy to import data via CSV files, so that's what I decided to go with. First, choose this option to locate the import folder for your database:
Make note of the path of that folder, since we'll need to plug that into the next step. Next I reached for F#, my favorite scripting language for stuff like this, and started writing up a quick FSX script to find all csproj
files in the repository and parse out their project references. My first pass at this used the XML type provider, but I ran into some parsing issues with it on some project files, and ultimately just dropping down to System.Xml
was concise enough that I just stuck with that.
First, some functions for parsing project files and writing out the resulting CSV files:
open System.IO
open System.Xml
let getProjectReferences (path:string) =
let doc = XmlDocument()
doc.Load(path)
doc.GetElementsByTagName "ProjectReference"
|> Seq.cast<XmlNode>
|> Seq.map (fun node ->
Path.GetFileNameWithoutExtension node.Attributes.["Include"].Value)
let repoPath = @"C:\code\github\xamarin\Xamarin.Forms"
let neoImportPath = @"<your import path here>"
let writeFile name lines =
File.WriteAllLines(Path.Combine(neoImportPath, name), Array.ofSeq lines)
This also makes the assumption that there is only one project in the repository with a given name, as a means of making things more readable by stripping off .csproj
from the file name.
Next, we'll read all csproj
files and create a map of their project dependencies:
let allDependencies =
Directory.EnumerateFiles(repoPath, "*.csproj", SearchOption.AllDirectories)
|> Seq.map (fun path ->
(Path.GetFileNameWithoutExtension path), (getProjectReferences path))
That's all the data we need, so now we just need to write out those CSV files. First, the list of projects:
allDependencies
|> Seq.map (fun (project, _) -> sprintf @"""%s""" project)
|> writeFile "projects.csv"
And then the dependencies:
allDependencies
|> Seq.filter (fun (_, projectDependencies) -> not <| (Seq.isEmpty projectDependencies))
|> Seq.collect (fun (project, projectDependencies) ->
projectDependencies
|> Seq.map(fun dependency -> sprintf @"""%s"",""%s""" project dependency))
|> writeFile "dependencies.csv"
Importing the CSVs
Now that those are generated, we just need to import those into the database using a bit of Cypher. First we'll do the projects:
LOAD CSV FROM 'file:///projects.csv' AS row
WITH toString(row[0]) AS name
CREATE (p:Project {name: name})
That will parse out each row in the CSV file and create Project
nodes for each of them, assigning the name
property based on the value. Next we'll load up the dependencies, matching them against the project nodes we just created, and creating a DEPENDS_ON
relationship between each of them:
LOAD CSV FROM 'file:///dependencies.csv' AS row
WITH toString(row[0]) AS dependent, toString(row[1]) AS dependency
MATCH (dependentProject:Project {name: dependent})
MATCH (dependencyProject:Project {name: dependency})
MERGE (dependentProject)-[rel:DEPENDS_ON]->(dependencyProject)
RETURN count(rel)
You can see here that the DEPENDS_ON
relationship also indicates the direction of that dependency. Similar to properties on project nodes, if we wanted we could also add properties to the relationships as well, so a future version of this could also include things like package dependencies as well, and indicate the type of dependency as a property on that relationship.
Now we've got all our projects and dependencies loaded into Neo4j and ready to query!
Querying the Data
Let's start simple and query out all the projects and their dependencies and visualize it, using the following query:
MATCH (p:Project) RETURN p
This ends up looking like:
Ok, so that alone doesn't end up being super useful since there's a lot going on, but it still says a lot! The Xamarin
prefix makes it a little hard to read in this form as well, but clicking through on that center node shows that it's actually Xamarin.Forms.Core
which is clearly one of the primary dependencies within this repository.
The visualization side of the graph data is cool, but let's check out some of the types of queries we can easily write based on having this data loaded into a graph database. For example, which projects have the most direct dependencies?
MATCH (dependent:Project)-[DEPENDS_ON]->(dependency:Project)
RETURN dependency.name, COUNT(dependent.name) AS numDirectDependents
ORDER BY numDirectDependents DESC
One of the nice things about Cypher is how readable these relationship queries end up being, since the syntax includes the visual representation of them. Those are direct depedencies, but what if we wanted to extent that to include indirect ones as well? All we need to do is add a *
into the relationship part of that query and Neo4j takes care of the rest:
MATCH (dependent:Project)-[DEPENDS_ON*]->(dependency:Project)
RETURN dependency.name, COUNT(DISTINCT dependent.name) AS numIndirectDependents
ORDER BY numIndirectDependents DESC
We can see that Xamarin.Forms.Core
is clearly one of the primary dependencies, but what percentage of projects actually depend on it?
MATCH (dependent:Project)
OPTIONAL MATCH (dependent)-[:DEPENDS_ON*]->(dependency:Project {name: "Xamarin.Forms.Core"})
WITH DISTINCT dependent.name as dependentName,
CASE dependency.name
WHEN NULL THEN false
ELSE true
END AS dependsOnCore
RETURN dependsOnCore, COUNT(*)
So five projects don't depend on Xamarin.Forms.Core
...what are they?
MATCH (dependent:Project)
WHERE NOT (dependent)-[:DEPENDS_ON*]->(:Project {name: "Xamarin.Forms.Core"})
RETURN DISTINCT dependent.name
This just scratches the surface of the types of queries you can start writing here, but even just the basics have already proven to be really interesting and valuable as I start to poke at the dependency graph in different ways to see what shakes out, especially when combined with domain-specific information about our projects. Not bad for a quick hack project!