Last week I attended the latest Data Science London meetup, which had a focus on Graphs and Graph Data. The take home message for me was that graph databases are an exciting and distinctive piece of technology, firstly because they can simplify storage and analysis of messy data sets, and secondly because they provide a useful visual encoding facilitating communication, interaction, and exploration of data.

First to speak was Ian Robinson from Neo Technology, co-author of the book ‘Graph Databases’. Neo Technology is the company behind Neo4j, who also sponsored the meetup. Ian made a presentation in which he described the fundamentals of the graph data model, the data structure at the at the core of a graph database.

He argued that the problem graph databases are trying to solve is that of ‘semi-structured’ data, and that they have certain conceptual and performance advantages compared to relational databases in situations where ‘joins’ are necessary.

Thanks to Ian I now have a clearer idea of the graph data model is, which I illustrate below.

graph

Specifically we have ‘nodes’ and ‘relationships’ where:

  • Nodes are usually nouns, in my diagram on the left people, places, books, and have labels which they may share with other nodes

  • Relationships (or edges in traditional network terminology) have relationship types, usually verb-like, in this example ‘lives at’, ‘wrote’. They are directed, for example author A wrote the book B, not vice versa, and also have key/value pairs associated with them, for instance dates.

Within Neo4j you can query the database looking for particular graph configurations or shortest paths, using Cypher, Neo4j’s query language, which doesn’t look that difficult to learn (there are a good number of drivers available for Neo4j here). For instance, a query might be written to find all authors of a particular genre from a particular country.

In the question and answer session, Ian also pointed out two solutions for handling ‘super nodes’, which are situations in which there are one or more nodes with a relatively large number of attached nodes, for instance Lady Gaga‘s followers. Either the surrounding nodes can be abstracted away as a new node property, or a sampling based approach can be taken, since often we are not that interested in all of the nodes around a super node.

Next to speak was Tareq Adebrabbo from Open Credo, who gave an useful overview his experience working with Neo4j.  His slides are available here, and are based on one of Tareq’s blog posts here. Tareq’s point was that problems solved with graph databases are often either ‘domain centric’, ie use a more predictable data model, or ‘data centric’, using less predictable data from a variety of sources. He went on to describe two applications, an impact analysis in a telecoms network, and an optimization problem within an oil flow network, as well giving best practice recommendations for working with graph databases.

Thanks to Carlos Somohono for organizing the Data Science London meetups. They’ve been fascinating and I’m looking forward to coming along to more sessions in 2014.