Python API Tutorial¶
While linkml-data-qc is primarily a CLI tool, it also provides a Python API for programmatic access.
Core Classes¶
The main classes are:
ComplianceAnalyzer- Analyzes data files for complianceSchemaIntrospector- Extracts recommended slots from schemasQCConfig- Configuration for weights and thresholds- Formatters (
JSONFormatter,CSVFormatter,TextFormatter) - Output formatting
Setup¶
First, let's create the same test data as in the CLI tutorial:
In [1]:
Copied!
%%bash
# Create test schema
cat > /tmp/disease_schema.yaml << 'EOF'
id: https://example.org/disease
name: disease_schema
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Disease:
attributes:
id:
identifier: true
name:
required: true
description:
recommended: true
synonyms:
multivalued: true
recommended: true
ontology_id:
recommended: true
EOF
# Create test data
cat > /tmp/disease_good.yaml << 'EOF'
id: DISEASE:001
name: Asthma
description: A chronic respiratory condition
synonyms:
- bronchial asthma
ontology_id: MONDO:0004979
EOF
cat > /tmp/disease_poor.yaml << 'EOF'
id: DISEASE:002
name: Unknown Disease
EOF
echo "Test files created!"
%%bash
# Create test schema
cat > /tmp/disease_schema.yaml << 'EOF'
id: https://example.org/disease
name: disease_schema
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Disease:
attributes:
id:
identifier: true
name:
required: true
description:
recommended: true
synonyms:
multivalued: true
recommended: true
ontology_id:
recommended: true
EOF
# Create test data
cat > /tmp/disease_good.yaml << 'EOF'
id: DISEASE:001
name: Asthma
description: A chronic respiratory condition
synonyms:
- bronchial asthma
ontology_id: MONDO:0004979
EOF
cat > /tmp/disease_poor.yaml << 'EOF'
id: DISEASE:002
name: Unknown Disease
EOF
echo "Test files created!"
Test files created!
Basic Analysis¶
In [2]:
Copied!
from linkml_data_qc import ComplianceAnalyzer
# Create an analyzer with your schema
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml")
# Analyze a file
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
print(f"Global Compliance: {report.global_compliance}%")
print(f"Total Checks: {report.total_checks}")
print(f"Total Populated: {report.total_populated}")
from linkml_data_qc import ComplianceAnalyzer
# Create an analyzer with your schema
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml")
# Analyze a file
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
print(f"Global Compliance: {report.global_compliance}%")
print(f"Total Checks: {report.total_checks}")
print(f"Total Populated: {report.total_populated}")
Global Compliance: 100.0% Total Checks: 3 Total Populated: 3
Accessing Detailed Results¶
In [3]:
Copied!
# Summary by slot shows compliance for each recommended field
print("Summary by Slot:")
for slot, percentage in report.summary_by_slot.items():
print(f" {slot}: {percentage}%")
# Summary by slot shows compliance for each recommended field
print("Summary by Slot:")
for slot, percentage in report.summary_by_slot.items():
print(f" {slot}: {percentage}%")
Summary by Slot: description: 100.0% synonyms: 100.0% ontology_id: 100.0%
In [4]:
Copied!
# Path scores show per-object compliance
print("Path Scores:")
for ps in report.path_scores:
print(f" {ps.path}: {ps.overall_percentage}%")
for ss in ps.slot_scores:
print(f" {ss.slot_name}: {ss.populated}/{ss.total} ({ss.percentage}%)")
# Path scores show per-object compliance
print("Path Scores:")
for ps in report.path_scores:
print(f" {ps.path}: {ps.overall_percentage}%")
for ss in ps.slot_scores:
print(f" {ss.slot_name}: {ss.populated}/{ss.total} ({ss.percentage}%)")
Path Scores:
(root): 100.0%
description: 1/1 (100.0%)
synonyms: 1/1 (100.0%)
ontology_id: 1/1 (100.0%)
Schema Introspection¶
In [5]:
Copied!
from linkml_data_qc import SchemaIntrospector
introspector = SchemaIntrospector("/tmp/disease_schema.yaml")
# Get all recommended slots in schema
print(f"Recommended slots: {introspector.recommended_slots}")
# Get class-specific info
class_info = introspector.get_class_slots("Disease")
print(f"\nDisease class recommended: {class_info.recommended_slots}")
from linkml_data_qc import SchemaIntrospector
introspector = SchemaIntrospector("/tmp/disease_schema.yaml")
# Get all recommended slots in schema
print(f"Recommended slots: {introspector.recommended_slots}")
# Get class-specific info
class_info = introspector.get_class_slots("Disease")
print(f"\nDisease class recommended: {class_info.recommended_slots}")
Recommended slots: {'ontology_id', 'synonyms', 'description'}
Disease class recommended: ['description', 'synonyms', 'ontology_id']
Using Configuration¶
In [6]:
Copied!
from linkml_data_qc import QCConfig, SlotQCConfig
# Create configuration with weights and thresholds
config = QCConfig(
default_weight=1.0,
slots={
"ontology_id": SlotQCConfig(weight=2.0, min_compliance=80.0),
"description": SlotQCConfig(weight=0.5),
}
)
# Create analyzer with configuration
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml", config)
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
print(f"Global Compliance: {report.global_compliance}%")
print(f"Weighted Compliance: {report.weighted_compliance}%")
from linkml_data_qc import QCConfig, SlotQCConfig
# Create configuration with weights and thresholds
config = QCConfig(
default_weight=1.0,
slots={
"ontology_id": SlotQCConfig(weight=2.0, min_compliance=80.0),
"description": SlotQCConfig(weight=0.5),
}
)
# Create analyzer with configuration
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml", config)
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
print(f"Global Compliance: {report.global_compliance}%")
print(f"Weighted Compliance: {report.weighted_compliance}%")
Global Compliance: 100.0% Weighted Compliance: 100.0%
Checking for Violations¶
In [7]:
Copied!
# Analyze poor compliance file with strict threshold
config = QCConfig(
slots={
"description": SlotQCConfig(min_compliance=50.0),
}
)
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml", config)
report = analyzer.analyze_file("/tmp/disease_poor.yaml", "Disease")
if report.threshold_violations:
print("Threshold Violations:")
for v in report.threshold_violations:
print(f" {v.path}.{v.slot_name}: {v.actual_compliance}% < {v.min_required}%")
else:
print("No violations!")
# Analyze poor compliance file with strict threshold
config = QCConfig(
slots={
"description": SlotQCConfig(min_compliance=50.0),
}
)
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml", config)
report = analyzer.analyze_file("/tmp/disease_poor.yaml", "Disease")
if report.threshold_violations:
print("Threshold Violations:")
for v in report.threshold_violations:
print(f" {v.path}.{v.slot_name}: {v.actual_compliance}% < {v.min_required}%")
else:
print("No violations!")
No violations!
Formatting Output¶
In [8]:
Copied!
from linkml_data_qc import JSONFormatter, TextFormatter
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml")
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
# Text format
print(TextFormatter.format(report))
from linkml_data_qc import JSONFormatter, TextFormatter
analyzer = ComplianceAnalyzer("/tmp/disease_schema.yaml")
report = analyzer.analyze_file("/tmp/disease_good.yaml", "Disease")
# Text format
print(TextFormatter.format(report))
Compliance Report: /tmp/disease_good.yaml
Target Class: Disease
Global Compliance: 100.0% (3/3)
Weighted Compliance: 100.0%
Summary by Slot:
description: 100.0%
ontology_id: 100.0%
synonyms: 100.0%
Detailed Path Scores:
(root) (Disease): 100.0%
- description: OK
- synonyms: OK
- ontology_id: OK
In [9]:
Copied!
# JSON format
import json
json_output = JSONFormatter.format(report)
print(json.dumps(json.loads(json_output), indent=2))
# JSON format
import json
json_output = JSONFormatter.format(report)
print(json.dumps(json.loads(json_output), indent=2))
{
"file_path": "/tmp/disease_good.yaml",
"target_class": "Disease",
"schema_path": "/tmp/disease_schema.yaml",
"global_compliance": 100.0,
"weighted_compliance": 100.0,
"total_checks": 3,
"total_populated": 3,
"path_scores": [
{
"path": "(root)",
"parent_class": "Disease",
"item_count": 1,
"slot_scores": [
{
"path": "(root)",
"slot_name": "description",
"populated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "(root)",
"slot_name": "synonyms",
"populated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "(root)",
"slot_name": "ontology_id",
"populated": 1,
"total": 1,
"percentage": 100.0
}
],
"overall_percentage": 100.0
}
],
"aggregated_scores": [],
"threshold_violations": [],
"summary_by_slot": {
"description": 100.0,
"synonyms": 100.0,
"ontology_id": 100.0
},
"recommended_slots": [
"ontology_id",
"synonyms",
"description"
],
"config_path": null,
"timestamp": "2025-12-06T20:10:19.589874"
}
Multi-File Analysis¶
In [10]:
Copied!
from linkml_data_qc import analyze_directory, create_multi_file_report
# Analyze all matching files in a directory
reports = analyze_directory(
schema_path="/tmp/disease_schema.yaml",
data_dir="/tmp",
target_class="Disease",
pattern="disease_*.yaml"
)
# Create aggregated report
multi_report = create_multi_file_report(reports)
print(f"Files Analyzed: {multi_report.files_analyzed}")
print(f"Overall Compliance: {multi_report.global_compliance}%")
print("\nSummary by Slot:")
for slot, pct in multi_report.summary_by_slot.items():
print(f" {slot}: {pct}%")
from linkml_data_qc import analyze_directory, create_multi_file_report
# Analyze all matching files in a directory
reports = analyze_directory(
schema_path="/tmp/disease_schema.yaml",
data_dir="/tmp",
target_class="Disease",
pattern="disease_*.yaml"
)
# Create aggregated report
multi_report = create_multi_file_report(reports)
print(f"Files Analyzed: {multi_report.files_analyzed}")
print(f"Overall Compliance: {multi_report.global_compliance}%")
print("\nSummary by Slot:")
for slot, pct in multi_report.summary_by_slot.items():
print(f" {slot}: {pct}%")
Files Analyzed: 3 Overall Compliance: 33.33333333333333% Summary by Slot: description: 33.33333333333333% synonyms: 33.33333333333333% ontology_id: 33.33333333333333%
When to Use the CLI vs Python API¶
Use the CLI when:
- Running one-off compliance checks
- Integrating with CI/CD pipelines
- Generating reports for external tools
Use the Python API when:
- Building custom analysis pipelines
- Integrating with other Python tools
- Needing programmatic access to detailed results
- Building dashboards or visualizations