How to Check Referential Integrity
This example uses MongoDB
[1]:
from linkml_store import Client
client = Client()
[2]:
db = client.attach_database("mongodb://localhost:27017", "test-ri")
db.metadata.ensure_referential_integrity = True
db.set_schema_view("../../tests/input/countries/countries.linkml.yaml")
[5]:
countries_coll = db.create_collection("Country", alias="countries", recreate_if_exists=True)
routes_coll = db.create_collection("Route", alias="routes", recreate_if_exists=True)
[6]:
COUNTRIES = "../../tests/input/countries/countries.jsonl"
ROUTES = "../../tests/input/countries/routes.csv"
[7]:
from linkml_store.utils.format_utils import load_objects
countries = load_objects(COUNTRIES)
routes = load_objects(ROUTES)
routes
[7]:
[{'origin': 'DE', 'destination': 'FR', 'method': 'rail'}]
[8]:
countries_coll.insert(countries)
routes_coll.insert(routes)
[9]:
routes_coll.find().rows
[9]:
[{'origin': 'DE', 'destination': 'FR', 'method': 'rail'}]
[10]:
for result in db.iter_validate_database():
print(result)
Inserting invalid data
We will intentionally insert an invalid row
[11]:
routes_coll.insert({"origin": "ZZZ", "destination": "YYY", "method": "rail"})
[12]:
routes_coll.find().rows_dataframe
[12]:
origin | destination | method | |
---|---|---|---|
0 | DE | FR | rail |
1 | ZZZ | YYY | rail |
[13]:
results = list(db.iter_validate_database())
type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='ZZZ' instance_index=None instantiates='Country' context=[] source=None
type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='YYY' instance_index=None instantiates='Country' context=[] source=None
[14]:
assert any(r for r in results if "Referential integrity" in r.message)
[16]:
for result in results:
print("Expected error: ", result)
Expected error: type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='ZZZ' instance_index=None instantiates='Country' context=[] source=None
Expected error: type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='YYY' instance_index=None instantiates='Country' context=[] source=None
Command Line Example using DuckDB
We’ll next show a command line example; we will use DuckDB here and CSVs, but the same principles apply to other databases and formats.
First we’ll make two CSVs, one for patients and one for samples. The samples will refer to patients.
[81]:
PATIENTS = """id,name,age
p1,John Doe,34
p2,Jane Doe,65
"""
with open("output/patients.csv", "w") as stream:
stream.write(PATIENTS)
[82]:
SAMPLES = """id,patient
s1,p1
s2,p2
s3,p2
"""
with open("output/samples.csv", "w") as stream:
stream.write(SAMPLES)
Note this dataset is well-behaved, every sample refers to a patient.
There is one issue with the data though, and that is that the default loader doesn’t perform ptype inference, so the ages will be treated as strings.
Next we’ll add a schema file
[86]:
SCHEMA = """
id: http://example.org/patients
name: patients
description: Patients and samples
prefixes:
linkml: http://w3id.org/linkml/
ex: http://example.org/
default_prefix: ex
imports:
- linkml:types
classes:
Sample:
attributes:
id:
identifier: true
patient:
range: Patient
Patient:
attributes:
id:
identifier: true
name:
required: true
age:
range: integer
"""
with open("output/patients.linkml.yaml", "w") as stream:
stream.write(SCHEMA)
Load data into DuckDB
We’ll first clear any older databases we may have created
[87]:
from pathlib import Path
Path("output/patient_samples.ddb").unlink(missing_ok=True)
Then we’ll load the data
[88]:
%%bash
linkml-store \
-d output/patient_samples.ddb \
-c Patient \
insert output/patients.csv
Inserted 2 objects from output/patients.csv into collection 'Patient'.
[89]:
%%bash
linkml-store \
-d output/patient_samples.ddb \
-c Sample \
insert output/samples.csv
Inserted 3 objects from output/samples.csv into collection 'Sample'.
Check Referential Integrity (no RI)
We don’t expect any referential integrity issues here
[90]:
%%bash
linkml-store --schema output/patients.linkml.yaml -d output/patient_samples.ddb validate -O csv
type,severity,message,instance,instance_index,instantiates,context
jsonschema validation,ERROR,"'34' is not of type 'integer', 'null' in /age","{'id': 'p1', 'name': 'John Doe', 'age': '34'}",0,Patient,[]
jsonschema validation,ERROR,"'65' is not of type 'integer', 'null' in /age","{'id': 'p2', 'name': 'Jane Doe', 'age': '65'}",0,Patient,[]
Adding dangling references
We’ll deliberately add a sample that refers to a non-existent patient
[91]:
%%bash
linkml-store \
-d output/patient_samples.ddb \
-c Sample \
insert --object '{"id": "s4", "patient": "p3"}'
Inserted 1 objects from {"id": "s4", "patient": "p3"} into collection 'Sample'.
And then re-validate
[92]:
%%bash
linkml-store --schema output/patients.linkml.yaml --set ensure_referential_integrity=true -d output/patient_samples.ddb validate -O csv
type,severity,message,instance,instance_index,instantiates,context
jsonschema validation,ERROR,"'34' is not of type 'integer', 'null' in /age","{'id': 'p1', 'name': 'John Doe', 'age': '34'}",0,Patient,[]
jsonschema validation,ERROR,"'65' is not of type 'integer', 'null' in /age","{'id': 'p2', 'name': 'Jane Doe', 'age': '65'}",0,Patient,[]
ReferentialIntegrity,ERROR,Referential integrity error: Patient not found,p3,,Patient,[]
[ ]: