How to use Neo4J with LinkML-Store

This how-to guide shows you how to use linkml-store over a Neo4j database. linkml-store is a flexible framework that can be used with different backends, employing simple mappings via LinkML schemas.

One of the main use cases here is being able to provide a schema for validating a property graph database, as well as making it easier to map between a property graph database and other representations (tabular/dataframe, JSON/YAML, RDF, RDFstar, OWL, …)

We will try and use some of the same examples as in neo4j graph concepts guide, making a simple Movie-oriented KG:

Running this how-to guide interactively.

This guide is both documentation, and a Jupyter notebook.

You can run the notebook locally. You will need a local neo4j database running, with authentication disabled.

One way to do this is via Docker:

docker run
  --name myneo4j
  --publish 7474:7474
  --publish 7687:7687
  --volume $HOME/neo4j/data/:/data
  --volume $HOME/neo4j/logs/:/logs
  -e NEO4J_AUTH=none
  neo4j

IMPORTANT NOTE if you run this Notebook locally, it will overwrite the default neo4j. If you have the enterprise edition you can modify this notebook to use a different database.

[1]:

from linkml_store import Client

client = Client()

Ensure the database is empty, then connect

[2]:

db = client.attach_database("neo4j", "test")
db.drop()
db = client.attach_database("neo4j", "test")

Using a schema

It is not necessary to defined a LinkML schema in advance, but this has a lot of advantages:

validation
clarity for producers and consumers
declarative mappings and graph projections

We’ll use a predefined one here. You can see the source schemasheet here

[3]:

db.load_schema_view("input/movies_kg/schema.yaml")

Next we’ll visualization this using PlamtUML

[4]:

%%bash
gen-plantuml --directory input/movies_kg/diagrams input/movies_kg/schema.yaml

WARNING:root:File "schema.yaml", line 36, col 15: Unrecognized prefix: rdf

Note that this is in many respects a “normal” UML type schema, with classes and class relationships.

Here we define to base classes Node and Edge. We are treating edges as first-class entities in the model. We are following conventions and using subject and object for the head and tail nodes.

You don’t need to define your schema this way. In fact it’s possible to map any LinkML schema to a property graph. You don’t need classes like “Node” and “Edge”, and you don’t need a reification type model.

But having this kind of “explicit” model is easiest to make the relationship to Neo4J easier to grok.

Create Node and Edge collections

We create two collections, one for nodes, and one for edges.

Note: you can organize collections how you like - you could have multiple node collections, (e.g. different collections for persons, movies), and multiple edge collections. However, you can’t mix nodes and edges in one collection.

[5]:

node_collection = db.create_collection("Node", alias="nodes", recreate_if_exists=True)
edge_collection = db.create_collection("Edge", alias="nodes", recreate_if_exists=True)

It’s necessary to provide some kind of mechanism to indicate which collections are edge collections. This can be inferred from the schema, but here will will make it explicit.

We’ll use a default edge projections, which assigns special meaning to subject, predicate, and object.

[6]:

from linkml_store.graphs.graph_map import EdgeProjection

edge_collection.metadata.graph_projection = EdgeProjection()

Adding Data

[7]:

# For convenience, we assign constants for each of the identifiers in the database
TH = "ACTOR:TH"
RZ = "ACTOR:RZ"
FG = "MOVIE:FG"
BTTF = "MOVIE:BTTF"
FGC = "CHARACTER:FGC"

[8]:

persons = [
    { "id": TH, "category": "Actor", "name": "Tom Hanks"},
    { "id": RZ, "category": "Director", "name": "Robert Zemekis"},
]
movies = [
    { "id": FG, "category": "Movie", "name": "Forest Gump"},
]
characters = [
    { "id": FGC,"category": "Character", "name": "Forest Gump (Character)"},
]

[9]:

edges = [
    {"subject": RZ, "predicate": "Directed", "object": FG},
    {"subject": TH, "predicate": "ActedIn", "object": FG, "plays": FGC},
]

[10]:

# Modify this to experiment with different ways of adding
ADD_INDIVIDUALLY = True

[11]:

if ADD_INDIVIDUALLY:
    node_collection.insert(persons)
    node_collection.insert(movies)
    node_collection.insert(characters)
    edge_collection.insert(edges)
else:
    g = {
        "nodes": persons + movies + characters,
        "edges": edges
    }
    db.insert(g)

Visualization

We will visualize the whole database using py2neo, networkx, and matplotlib. This is not as visually attractive as the native neo4j UI but it works better in a notebook:

[12]:

import matplotlib.pyplot as plt
from linkml_store.utils.neo4j_utils import draw_neo4j_graph

plt.figure(figsize=(15, 10))
draw_neo4j_graph()
plt.title("Visualization of simple subset", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()

Note this doesn’t display edge properties (e.g. the “plays” edge property between Tom Hanks and Forrest Gump

Exploring in Neo4j

If we go on over to http://localhost:7474/browser/ (if you are running this locally) you can see this in the normal UI:

Validation

One of the main use cases for LinkML and LinkML-Store is validation using a schema. Most of the rich features of LinkML validation are available for property graphs, we will illustrate some simple ones here.

Validating an existing database

First we’ll validate the database. Note that by default, validation is retrospective; it’s possible to insert invalid data into the database.

[13]:

errs = list(db.iter_validate_database())
errs

[13]:

[]

Good, no errors!

Prospective Validation

We can switch on prospective validation, to prevent any future modifications making the data invalid:

[14]:

edge_collection.metadata.validate_modifications = True

Let’s test this out by inserting deliberately invalid data – making up an edge property not defined in the schema

[15]:

try:
    edge_collection.insert({ "subject": RZ, "predicate": "Directed", "object": BTTF, "date": 1985})
except Exception as e:
    print("Got an exception (we expect this!!!)")
    print(e)

Got an exception (we expect this!!!)
Validation errors: [ValidationResult(type='jsonschema validation', severity=<Severity.ERROR: 'ERROR'>, message="Additional properties are not allowed ('date' was unexpected) in /", instance={'subject': 'ACTOR:RZ', 'predicate': 'Directed', 'object': 'MOVIE:BTTF', 'date': 1985}, instance_index=0, instantiates='Directed', context=[])]

We can also test referential integrity. By default, inserting an edge that points to a non-existent node will create a dangling node in Neo4j (Neo4j enforces that there must be some kind of node).

We will change the policy and try and insert a dangling edge

[16]:

from linkml_store.api.stores.neo4j.neo4j_collection import DeletePolicy

edge_collection.delete_policy = DeletePolicy.ERROR

try:
    edge_collection.insert({ "subject": RZ, "predicate": "Directed", "object": "MOVIE:MADE_UP_NODE"})
except Exception as e:
    print("Got an exception (we expect this!!!)")
    print(e)

Got an exception (we expect this!!!)
Node with identifier MOVIE:MADE_UP_NODE not found in the database.

Loading Data from a file

Loading data into a graph is very much like loading any other data into a LinkML schema. So long as you conform to the schema we defined above (which has defined graph projections), then it can be loaded into Neo4J.

Data could be combined in a single JSON file, but it’s also convenient to use tabular or dataframe-oriented formats.

Here we will load from CSV

[17]:

import pandas as pd

[18]:

nodes_df = pd.read_csv("input/movies_kg/nodes.csv").convert_dtypes()
nodes_df

[18]:

	id	category	name	born	released
0	PERSON:TH	Actor	Tom Hanks	1956	<NA>
1	PERSON:RZ	Director	Robert Zemeckis	1951	<NA>
2	PERSON:HW	Actor	Hugo Weaving	1960	<NA>
3	PERSON:KR	Actor	Keanu Reeves	1964	<NA>
4	PERSON:MJF	Actor	Michael J. Fox	1961	<NA>
5	MOVIE:FG	Movie	Forrest Gump	<NA>	1994
6	MOVIE:TM	Movie	The Matrix	<NA>	1999
7	MOVIE:TM2	Movie	The Matrix Reloaded	<NA>	2003
8	MOVIE:TM3	Movie	The Matrix Revolutions	<NA>	2003
9	MOVIE:BTTF	Movie	Back to the Future	<NA>	1985

[19]:

edges_df = pd.read_csv("input/movies_kg/edges.csv").convert_dtypes()
edges_df

[19]:

	subject	predicate	object	plays
0	PERSON:TH	ActedIn	MOVIE:FG	CHARACTER:Forrest
1	PERSON:RZ	Directed	MOVIE:FG	<NA>
2	PERSON:RZ	Directed	MOVIE:BTTF	<NA>
3	PERSON:KR	ActedIn	MOVIE:TM	CHARACTER:Neo
4	PERSON:MJF	ActedIn	MOVIE:BTTF	CHARACTER:MartyMcFly
5	PERSON:HW	ActedIn	MOVIE:TM	CHARACTER:AgentSmith

[20]:

node_collection.insert(nodes_df.to_dict(orient="records"))

[21]:

# TODO: use type designator in validation
edge_collection.metadata.validate_modifications = True

[22]:

edge_collection.insert(edges_df.to_dict(orient="records"))

[23]:

plt.figure(figsize=(15, 10))
draw_neo4j_graph()
plt.title("Visualization of whole database", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()

The above display is a little cluttered.

If you are running this notebook locally you can head over to your neo4j and explore via interactive views like this:

Queries

The graph can be queried the same as any other database in linkml-store. Note that at this time we don’t expose a graph-oriented API, we instead expose everything as node or edge collections. This has some advantages and disadvantages. The goal is not to replace something like Cypher (you can and should of course just do Cypher queries over the underlying neoj4 database if and when you need to).

Let’s start by querying for the entire contents of both collections:

[24]:

node_collection.find().rows_dataframe

[24]:

	name	id	category	born	released
0	Forest Gump	MOVIE:FG	Movie	NaN	NaN
1	Forest Gump (Character)	CHARACTER:FGC	Character	NaN	NaN
2	Tom Hanks	PERSON:TH	Actor	1956.0	NaN
3	Robert Zemeckis	PERSON:RZ	Director	1951.0	NaN
4	Hugo Weaving	PERSON:HW	Actor	1960.0	NaN
5	Keanu Reeves	PERSON:KR	Actor	1964.0	NaN
6	Michael J. Fox	PERSON:MJF	Actor	1961.0	NaN
7	Forrest Gump	MOVIE:FG	Movie	NaN	1994.0
8	Back to the Future	MOVIE:BTTF	Movie	NaN	1985.0
9	Tom Hanks	ACTOR:TH	Actor	NaN	NaN
10	The Matrix	MOVIE:TM	Movie	NaN	1999.0
11	Robert Zemekis	ACTOR:RZ	Director	NaN	NaN
12	The Matrix Reloaded	MOVIE:TM2	Movie	NaN	2003.0
13	The Matrix Revolutions	MOVIE:TM3	Movie	NaN	2003.0

[25]:

edge_collection.find().rows_dataframe

[25]:

	subject	predicate	object	plays
0	ACTOR:RZ	Directed	MOVIE:FG	NaN
1	ACTOR:TH	ActedIn	MOVIE:FG	CHARACTER:FGC
2	PERSON:TH	ActedIn	MOVIE:FG	CHARACTER:Forrest
3	PERSON:TH	ActedIn	MOVIE:FG	CHARACTER:Forrest
4	PERSON:RZ	Directed	MOVIE:FG	NaN
5	PERSON:RZ	Directed	MOVIE:FG	NaN
6	PERSON:RZ	Directed	MOVIE:BTTF	NaN
7	PERSON:KR	ActedIn	MOVIE:TM	CHARACTER:Neo
8	PERSON:MJF	ActedIn	MOVIE:BTTF	CHARACTER:MartyMcFly
9	PERSON:HW	ActedIn	MOVIE:TM	CHARACTER:AgentSmith

We can also use standard key-value mongodb-like queries; over nodes:

[26]:

node_collection.find({"category": "Actor"}).rows_dataframe

[26]:

	born	name	id	category
0	1956.0	Tom Hanks	PERSON:TH	Actor
1	1960.0	Hugo Weaving	PERSON:HW	Actor
2	1964.0	Keanu Reeves	PERSON:KR	Actor
3	1961.0	Michael J. Fox	PERSON:MJF	Actor
4	NaN	Tom Hanks	ACTOR:TH	Actor

over edges:

[27]:

edge_collection.find({"predicate": "ActedIn"}).rows_dataframe

[27]:

	subject	predicate	object	plays
0	ACTOR:TH	ActedIn	MOVIE:FG	CHARACTER:FGC
1	PERSON:TH	ActedIn	MOVIE:FG	CHARACTER:Forrest
2	PERSON:TH	ActedIn	MOVIE:FG	CHARACTER:Forrest
3	PERSON:KR	ActedIn	MOVIE:TM	CHARACTER:Neo
4	PERSON:MJF	ActedIn	MOVIE:BTTF	CHARACTER:MartyMcFly
5	PERSON:HW	ActedIn	MOVIE:TM	CHARACTER:AgentSmith

Facet Counts

We can also do basic analytic queries such as Solr-style facet grouping and counting.

We’ll make use of a lightweight function that converts linkml-store facet counts to dataframes:

[28]:

from linkml_store.utils.pandas_utils import facet_summary_to_dataframe_unmelted

[29]:

r = node_collection.query_facets(facet_columns=["category"])
facet_summary_to_dataframe_unmelted(r)

[29]:

	category	Value
0	Movie	6
1	Actor	5
2	Director	2
3	Character	1

[30]:

# TODO: implement facet counts for edges
#r = edge_collection.query_facets(facet_columns=["subject", "predicate", "object"])
#facet_summary_to_dataframe_unmelted(r)

Indexing via LLM embeddings

Neo4j can be indexed and search via LLM embeddings, just like any other linkml-store wrapped database.

TODO show examples

[ ]: