How to Check Referential Integrity

This example uses MongoDB

[1]:
from linkml_store import Client

client = Client()
[2]:
db = client.attach_database("mongodb://localhost:27017", "test-ri")
db.metadata.ensure_referential_integrity = True
db.set_schema_view("../../tests/input/countries/countries.linkml.yaml")

[5]:
countries_coll = db.create_collection("Country", alias="countries", recreate_if_exists=True)
routes_coll = db.create_collection("Route", alias="routes", recreate_if_exists=True)
[6]:
COUNTRIES = "../../tests/input/countries/countries.jsonl"
ROUTES = "../../tests/input/countries/routes.csv"
[7]:
from linkml_store.utils.format_utils import load_objects

countries = load_objects(COUNTRIES)
routes = load_objects(ROUTES)
routes
[7]:
[{'origin': 'DE', 'destination': 'FR', 'method': 'rail'}]
[8]:
countries_coll.insert(countries)
routes_coll.insert(routes)
[9]:
routes_coll.find().rows
[9]:
[{'origin': 'DE', 'destination': 'FR', 'method': 'rail'}]
[10]:
for result in db.iter_validate_database():
    print(result)

Inserting invalid data

We will intentionally insert an invalid row

[11]:
routes_coll.insert({"origin": "ZZZ", "destination": "YYY", "method": "rail"})
[12]:
routes_coll.find().rows_dataframe
[12]:
origin destination method
0 DE FR rail
1 ZZZ YYY rail
[13]:
results = list(db.iter_validate_database())
type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='ZZZ' instance_index=None instantiates='Country' context=[] source=None
type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='YYY' instance_index=None instantiates='Country' context=[] source=None
[14]:
assert any(r for r in results if "Referential integrity" in r.message)
[16]:
for result in results:
    print("Expected error: ", result)
Expected error:  type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='ZZZ' instance_index=None instantiates='Country' context=[] source=None
Expected error:  type='ReferentialIntegrity' severity=<Severity.ERROR: 'ERROR'> message='Referential integrity error: Country not found' instance='YYY' instance_index=None instantiates='Country' context=[] source=None

Command Line Example using DuckDB

We’ll next show a command line example; we will use DuckDB here and CSVs, but the same principles apply to other databases and formats.

First we’ll make two CSVs, one for patients and one for samples. The samples will refer to patients.

[81]:
PATIENTS = """id,name,age
p1,John Doe,34
p2,Jane Doe,65
"""
with open("output/patients.csv", "w") as stream:
    stream.write(PATIENTS)
[82]:
SAMPLES = """id,patient
s1,p1
s2,p2
s3,p2
"""
with open("output/samples.csv", "w") as stream:
    stream.write(SAMPLES)

Note this dataset is well-behaved, every sample refers to a patient.

There is one issue with the data though, and that is that the default loader doesn’t perform ptype inference, so the ages will be treated as strings.

Next we’ll add a schema file

[86]:
SCHEMA = """
id: http://example.org/patients
name: patients
description: Patients and samples
prefixes:
  linkml: http://w3id.org/linkml/
  ex: http://example.org/
default_prefix: ex
imports:
  - linkml:types
classes:
  Sample:
    attributes:
      id:
        identifier: true
      patient:
        range: Patient

  Patient:
    attributes:
      id:
        identifier: true
      name:
        required: true
      age:
        range: integer
"""
with open("output/patients.linkml.yaml", "w") as stream:
    stream.write(SCHEMA)

Load data into DuckDB

We’ll first clear any older databases we may have created

[87]:
from pathlib import Path

Path("output/patient_samples.ddb").unlink(missing_ok=True)

Then we’ll load the data

[88]:
%%bash
linkml-store \
   -d output/patient_samples.ddb \
   -c Patient \
   insert output/patients.csv

Inserted 2 objects from output/patients.csv into collection 'Patient'.
[89]:
%%bash
linkml-store \
  -d output/patient_samples.ddb \
  -c Sample \
  insert output/samples.csv
Inserted 3 objects from output/samples.csv into collection 'Sample'.

Check Referential Integrity (no RI)

We don’t expect any referential integrity issues here

[90]:
%%bash
linkml-store --schema output/patients.linkml.yaml -d output/patient_samples.ddb validate -O csv
type,severity,message,instance,instance_index,instantiates,context
jsonschema validation,ERROR,"'34' is not of type 'integer', 'null' in /age","{'id': 'p1', 'name': 'John Doe', 'age': '34'}",0,Patient,[]
jsonschema validation,ERROR,"'65' is not of type 'integer', 'null' in /age","{'id': 'p2', 'name': 'Jane Doe', 'age': '65'}",0,Patient,[]

Adding dangling references

We’ll deliberately add a sample that refers to a non-existent patient

[91]:
%%bash
linkml-store \
  -d output/patient_samples.ddb \
  -c Sample \
  insert --object '{"id": "s4", "patient": "p3"}'
Inserted 1 objects from {"id": "s4", "patient": "p3"} into collection 'Sample'.

And then re-validate

[92]:
%%bash
linkml-store --schema output/patients.linkml.yaml --set ensure_referential_integrity=true -d output/patient_samples.ddb validate -O csv
type,severity,message,instance,instance_index,instantiates,context
jsonschema validation,ERROR,"'34' is not of type 'integer', 'null' in /age","{'id': 'p1', 'name': 'John Doe', 'age': '34'}",0,Patient,[]
jsonschema validation,ERROR,"'65' is not of type 'integer', 'null' in /age","{'id': 'p2', 'name': 'Jane Doe', 'age': '65'}",0,Patient,[]
ReferentialIntegrity,ERROR,Referential integrity error: Patient not found,p3,,Patient,[]

[ ]: