How to use Neo4J with LinkML-Store

This how-to guide shows you how to use linkml-store over a Neo4j database. linkml-store is a flexible framework that can be used with different backends, employing simple mappings via LinkML schemas.

One of the main use cases here is being able to provide a schema for validating a property graph database, as well as making it easier to map between a property graph database and other representations (tabular/dataframe, JSON/YAML, RDF, RDFstar, OWL, …)

We will try and use some of the same examples as in neo4j graph concepts guide, making a simple Movie-oriented KG:

neo4j graph concepts

Running this how-to guide interactively.

This guide is both documentation, and a Jupyter notebook.

You can run the notebook locally. You will need a local neo4j database running, with authentication disabled.

One way to do this is via Docker:

docker run
  --name myneo4j
  --publish 7474:7474
  --publish 7687:7687
  --volume $HOME/neo4j/data/:/data
  --volume $HOME/neo4j/logs/:/logs
  -e NEO4J_AUTH=none
  neo4j

IMPORTANT NOTE if you run this Notebook locally, it will overwrite the default neo4j. If you have the enterprise edition you can modify this notebook to use a different database.

[1]:
from linkml_store import Client

client = Client()

Ensure the database is empty, then connect

[2]:
db = client.attach_database("neo4j", "test")
db.drop()
db = client.attach_database("neo4j", "test")

Using a schema

It is not necessary to defined a LinkML schema in advance, but this has a lot of advantages:

  • validation

  • clarity for producers and consumers

  • declarative mappings and graph projections

We’ll use a predefined one here. You can see the source schemasheet here

[3]:
db.load_schema_view("input/movies_kg/schema.yaml")

Next we’ll visualization this using PlamtUML

[4]:
%%bash
gen-plantuml --directory input/movies_kg/diagrams input/movies_kg/schema.yaml
WARNING:root:File "schema.yaml", line 36, col 15: Unrecognized prefix: rdf

UML Schema

Note that this is in many respects a “normal” UML type schema, with classes and class relationships.

Here we define to base classes Node and Edge. We are treating edges as first-class entities in the model. We are following conventions and using subject and object for the head and tail nodes.

You don’t need to define your schema this way. In fact it’s possible to map any LinkML schema to a property graph. You don’t need classes like “Node” and “Edge”, and you don’t need a reification type model.

But having this kind of “explicit” model is easiest to make the relationship to Neo4J easier to grok.

Create Node and Edge collections

We create two collections, one for nodes, and one for edges.

Note: you can organize collections how you like - you could have multiple node collections, (e.g. different collections for persons, movies), and multiple edge collections. However, you can’t mix nodes and edges in one collection.

[5]:
node_collection = db.create_collection("Node", alias="nodes", recreate_if_exists=True)
edge_collection = db.create_collection("Edge", alias="nodes", recreate_if_exists=True)

It’s necessary to provide some kind of mechanism to indicate which collections are edge collections. This can be inferred from the schema, but here will will make it explicit.

We’ll use a default edge projections, which assigns special meaning to subject, predicate, and object.

[6]:
from linkml_store.graphs.graph_map import EdgeProjection

edge_collection.metadata.graph_projection = EdgeProjection()

Adding Data

[7]:
# For convenience, we assign constants for each of the identifiers in the database
TH = "ACTOR:TH"
RZ = "ACTOR:RZ"
FG = "MOVIE:FG"
BTTF = "MOVIE:BTTF"
FGC = "CHARACTER:FGC"
[8]:
persons = [
    { "id": TH, "category": "Actor", "name": "Tom Hanks"},
    { "id": RZ, "category": "Director", "name": "Robert Zemekis"},
]
movies = [
    { "id": FG, "category": "Movie", "name": "Forest Gump"},
]
characters = [
    { "id": FGC,"category": "Character", "name": "Forest Gump (Character)"},
]

[9]:
edges = [
    {"subject": RZ, "predicate": "Directed", "object": FG},
    {"subject": TH, "predicate": "ActedIn", "object": FG, "plays": FGC},
]
[10]:
# Modify this to experiment with different ways of adding
ADD_INDIVIDUALLY = True
[11]:
if ADD_INDIVIDUALLY:
    node_collection.insert(persons)
    node_collection.insert(movies)
    node_collection.insert(characters)
    edge_collection.insert(edges)
else:
    g = {
        "nodes": persons + movies + characters,
        "edges": edges
    }
    db.insert(g)

Visualization

We will visualize the whole database using py2neo, networkx, and matplotlib. This is not as visually attractive as the native neo4j UI but it works better in a notebook:

[12]:
import matplotlib.pyplot as plt
from linkml_store.utils.neo4j_utils import draw_neo4j_graph

plt.figure(figsize=(15, 10))
draw_neo4j_graph()
plt.title("Visualization of simple subset", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()
../_images/how-to_Use-Neo4j_21_0.png

Note this doesn’t display edge properties (e.g. the “plays” edge property between Tom Hanks and Forrest Gump

Exploring in Neo4j

If we go on over to http://localhost:7474/browser/ (if you are running this locally) you can see this in the normal UI:

neo4j screenshot of single edge

Validation

One of the main use cases for LinkML and LinkML-Store is validation using a schema. Most of the rich features of LinkML validation are available for property graphs, we will illustrate some simple ones here.

Validating an existing database

First we’ll validate the database. Note that by default, validation is retrospective; it’s possible to insert invalid data into the database.

[13]:
errs = list(db.iter_validate_database())
errs
[13]:
[]

Good, no errors!

Prospective Validation

We can switch on prospective validation, to prevent any future modifications making the data invalid:

[14]:
edge_collection.metadata.validate_modifications = True

Let’s test this out by inserting deliberately invalid data – making up an edge property not defined in the schema

[15]:
try:
    edge_collection.insert({ "subject": RZ, "predicate": "Directed", "object": BTTF, "date": 1985})
except Exception as e:
    print("Got an exception (we expect this!!!)")
    print(e)
Got an exception (we expect this!!!)
Validation errors: [ValidationResult(type='jsonschema validation', severity=<Severity.ERROR: 'ERROR'>, message="Additional properties are not allowed ('date' was unexpected) in /", instance={'subject': 'ACTOR:RZ', 'predicate': 'Directed', 'object': 'MOVIE:BTTF', 'date': 1985}, instance_index=0, instantiates='Directed', context=[])]

We can also test referential integrity. By default, inserting an edge that points to a non-existent node will create a dangling node in Neo4j (Neo4j enforces that there must be some kind of node).

We will change the policy and try and insert a dangling edge

[16]:
from linkml_store.api.stores.neo4j.neo4j_collection import DeletePolicy

edge_collection.delete_policy = DeletePolicy.ERROR

try:
    edge_collection.insert({ "subject": RZ, "predicate": "Directed", "object": "MOVIE:MADE_UP_NODE"})
except Exception as e:
    print("Got an exception (we expect this!!!)")
    print(e)
Got an exception (we expect this!!!)
Node with identifier MOVIE:MADE_UP_NODE not found in the database.

Loading Data from a file

Loading data into a graph is very much like loading any other data into a LinkML schema. So long as you conform to the schema we defined above (which has defined graph projections), then it can be loaded into Neo4J.

Data could be combined in a single JSON file, but it’s also convenient to use tabular or dataframe-oriented formats.

Here we will load from CSV

[17]:
import pandas as pd

[18]:
nodes_df = pd.read_csv("input/movies_kg/nodes.csv").convert_dtypes()
nodes_df
[18]:
id category name born released
0 PERSON:TH Actor Tom Hanks 1956 <NA>
1 PERSON:RZ Director Robert Zemeckis 1951 <NA>
2 PERSON:HW Actor Hugo Weaving 1960 <NA>
3 PERSON:KR Actor Keanu Reeves 1964 <NA>
4 PERSON:MJF Actor Michael J. Fox 1961 <NA>
5 MOVIE:FG Movie Forrest Gump <NA> 1994
6 MOVIE:TM Movie The Matrix <NA> 1999
7 MOVIE:TM2 Movie The Matrix Reloaded <NA> 2003
8 MOVIE:TM3 Movie The Matrix Revolutions <NA> 2003
9 MOVIE:BTTF Movie Back to the Future <NA> 1985
[19]:
edges_df = pd.read_csv("input/movies_kg/edges.csv").convert_dtypes()
edges_df
[19]:
subject predicate object plays
0 PERSON:TH ActedIn MOVIE:FG CHARACTER:Forrest
1 PERSON:RZ Directed MOVIE:FG <NA>
2 PERSON:RZ Directed MOVIE:BTTF <NA>
3 PERSON:KR ActedIn MOVIE:TM CHARACTER:Neo
4 PERSON:MJF ActedIn MOVIE:BTTF CHARACTER:MartyMcFly
5 PERSON:HW ActedIn MOVIE:TM CHARACTER:AgentSmith
[20]:
node_collection.insert(nodes_df.to_dict(orient="records"))
[21]:
# TODO: use type designator in validation
edge_collection.metadata.validate_modifications = True
[22]:
edge_collection.insert(edges_df.to_dict(orient="records"))
[23]:
plt.figure(figsize=(15, 10))
draw_neo4j_graph()
plt.title("Visualization of whole database", fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()
../_images/how-to_Use-Neo4j_39_0.png

The above display is a little cluttered.

If you are running this notebook locally you can head over to your neo4j and explore via interactive views like this:

neo4j screenshot of more edges

Queries

The graph can be queried the same as any other database in linkml-store. Note that at this time we don’t expose a graph-oriented API, we instead expose everything as node or edge collections. This has some advantages and disadvantages. The goal is not to replace something like Cypher (you can and should of course just do Cypher queries over the underlying neoj4 database if and when you need to).

Let’s start by querying for the entire contents of both collections:

[24]:
node_collection.find().rows_dataframe
[24]:
name id category born released
0 Forest Gump MOVIE:FG Movie NaN NaN
1 Forest Gump (Character) CHARACTER:FGC Character NaN NaN
2 Tom Hanks PERSON:TH Actor 1956.0 NaN
3 Robert Zemeckis PERSON:RZ Director 1951.0 NaN
4 Hugo Weaving PERSON:HW Actor 1960.0 NaN
5 Keanu Reeves PERSON:KR Actor 1964.0 NaN
6 Michael J. Fox PERSON:MJF Actor 1961.0 NaN
7 Forrest Gump MOVIE:FG Movie NaN 1994.0
8 Back to the Future MOVIE:BTTF Movie NaN 1985.0
9 Tom Hanks ACTOR:TH Actor NaN NaN
10 The Matrix MOVIE:TM Movie NaN 1999.0
11 Robert Zemekis ACTOR:RZ Director NaN NaN
12 The Matrix Reloaded MOVIE:TM2 Movie NaN 2003.0
13 The Matrix Revolutions MOVIE:TM3 Movie NaN 2003.0
[25]:
edge_collection.find().rows_dataframe
[25]:
subject predicate object plays
0 ACTOR:RZ Directed MOVIE:FG NaN
1 ACTOR:TH ActedIn MOVIE:FG CHARACTER:FGC
2 PERSON:TH ActedIn MOVIE:FG CHARACTER:Forrest
3 PERSON:TH ActedIn MOVIE:FG CHARACTER:Forrest
4 PERSON:RZ Directed MOVIE:FG NaN
5 PERSON:RZ Directed MOVIE:FG NaN
6 PERSON:RZ Directed MOVIE:BTTF NaN
7 PERSON:KR ActedIn MOVIE:TM CHARACTER:Neo
8 PERSON:MJF ActedIn MOVIE:BTTF CHARACTER:MartyMcFly
9 PERSON:HW ActedIn MOVIE:TM CHARACTER:AgentSmith

We can also use standard key-value mongodb-like queries; over nodes:

[26]:
node_collection.find({"category": "Actor"}).rows_dataframe
[26]:
born name id category
0 1956.0 Tom Hanks PERSON:TH Actor
1 1960.0 Hugo Weaving PERSON:HW Actor
2 1964.0 Keanu Reeves PERSON:KR Actor
3 1961.0 Michael J. Fox PERSON:MJF Actor
4 NaN Tom Hanks ACTOR:TH Actor

over edges:

[27]:
edge_collection.find({"predicate": "ActedIn"}).rows_dataframe
[27]:
subject predicate object plays
0 ACTOR:TH ActedIn MOVIE:FG CHARACTER:FGC
1 PERSON:TH ActedIn MOVIE:FG CHARACTER:Forrest
2 PERSON:TH ActedIn MOVIE:FG CHARACTER:Forrest
3 PERSON:KR ActedIn MOVIE:TM CHARACTER:Neo
4 PERSON:MJF ActedIn MOVIE:BTTF CHARACTER:MartyMcFly
5 PERSON:HW ActedIn MOVIE:TM CHARACTER:AgentSmith

Facet Counts

We can also do basic analytic queries such as Solr-style facet grouping and counting.

We’ll make use of a lightweight function that converts linkml-store facet counts to dataframes:

[28]:
from linkml_store.utils.pandas_utils import facet_summary_to_dataframe_unmelted
[29]:
r = node_collection.query_facets(facet_columns=["category"])
facet_summary_to_dataframe_unmelted(r)
[29]:
category Value
0 Movie 6
1 Actor 5
2 Director 2
3 Character 1
[30]:
# TODO: implement facet counts for edges
#r = edge_collection.query_facets(facet_columns=["subject", "predicate", "object"])
#facet_summary_to_dataframe_unmelted(r)

Indexing via LLM embeddings

Neo4j can be indexed and search via LLM embeddings, just like any other linkml-store wrapped database.

TODO show examples

[ ]: