QC for Tabular Data (CSV/TSV)¶
This tutorial shows how to use linkml-data-qc with CSV and TSV files - common formats for tabular datasets.
When to Use Tabular Formats¶
CSV/TSV is great for:
- Flat datasets where each row is an instance
- Data exported from spreadsheets or databases
- Large datasets with many similar records
How It Works¶
- Each row in the CSV/TSV becomes an instance of your target class
- Column headers map to slot names
- Empty cells are treated as missing (not populated)
Example: Gene Annotation Dataset¶
Let's create a schema for gene annotations and a TSV dataset with varying completeness.
In [1]:
Copied!
%%bash
cat > /tmp/gene_schema.yaml << 'EOF'
id: https://example.org/genes
name: gene_schema
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Gene:
attributes:
gene_id:
identifier: true
required: true
symbol:
required: true
name:
recommended: true
description:
recommended: true
chromosome:
recommended: true
go_terms:
recommended: true
publications:
recommended: true
EOF
echo "Gene schema created!"
%%bash
cat > /tmp/gene_schema.yaml << 'EOF'
id: https://example.org/genes
name: gene_schema
prefixes:
linkml: https://w3id.org/linkml/
imports:
- linkml:types
default_range: string
classes:
Gene:
attributes:
gene_id:
identifier: true
required: true
symbol:
required: true
name:
recommended: true
description:
recommended: true
chromosome:
recommended: true
go_terms:
recommended: true
publications:
recommended: true
EOF
echo "Gene schema created!"
Gene schema created!
Create a TSV Dataset¶
Here's a gene annotation file with some missing recommended fields:
In [2]:
Copied!
%%bash
cat > /tmp/genes.tsv << 'EOF'
gene_id symbol name description chromosome go_terms publications
ENSG00000139618 BRCA2 BRCA2 DNA repair associated Involved in DNA repair 13 GO:0006281,GO:0006310 PMID:8589730
ENSG00000141510 TP53 Tumor protein p53 Tumor suppressor gene 17 GO:0006915,GO:0008283 PMID:2052564
ENSG00000157764 BRAF B-Raf proto-oncogene 7 GO:0006468
ENSG00000133703 KRAS KRAS proto-oncogene GTPase signaling 12 PMID:2278970
ENSG00000171862 PTEN Phosphatase and tensin homolog 10
ENSG00000146648 EGFR 7 GO:0007169 PMID:3490687
ENSG00000181143 MYC MYC proto-oncogene Transcription factor 8 GO:0006355 PMID:7479893
EOF
echo "Gene TSV created! Contents:"
cat /tmp/genes.tsv
%%bash
cat > /tmp/genes.tsv << 'EOF'
gene_id symbol name description chromosome go_terms publications
ENSG00000139618 BRCA2 BRCA2 DNA repair associated Involved in DNA repair 13 GO:0006281,GO:0006310 PMID:8589730
ENSG00000141510 TP53 Tumor protein p53 Tumor suppressor gene 17 GO:0006915,GO:0008283 PMID:2052564
ENSG00000157764 BRAF B-Raf proto-oncogene 7 GO:0006468
ENSG00000133703 KRAS KRAS proto-oncogene GTPase signaling 12 PMID:2278970
ENSG00000171862 PTEN Phosphatase and tensin homolog 10
ENSG00000146648 EGFR 7 GO:0007169 PMID:3490687
ENSG00000181143 MYC MYC proto-oncogene Transcription factor 8 GO:0006355 PMID:7479893
EOF
echo "Gene TSV created! Contents:"
cat /tmp/genes.tsv
Gene TSV created! Contents:
gene_id symbol name description chromosome go_terms publications ENSG00000139618 BRCA2 BRCA2 DNA rep
air associated Involved in DNA repair 13 GO:0006281,GO:0006310 PMID:8589730 ENSG00000141510 TP53 Tum
or protein p53 Tumor suppressor gene 17 GO:0006915,GO:0008283 PMID:2052564 ENSG00000157764 BRAF B-Ra
f proto-oncogene 7 GO:0006468 ENSG00000133703 KRAS KRAS proto-oncogene GTPase signaling 12 PMID:2
278970 ENSG00000171862 PTEN Phosphatase and tensin homolog 10 ENSG00000146648 EGFR 7 GO:0007169
PMID:3490687 ENSG00000181143 MYC MYC proto-oncogene Transcription factor 8 GO:0006355 PMID:7479893
Analyze the TSV¶
In [3]:
Copied!
%%bash
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene
%%bash
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene
Compliance Report: /tmp/genes.tsv Target Class: Gene Global Compliance: 77.1% (27/35) Weighted Compl
iance: 77.1% Summary by Slot: chromosome: 100.0% description: 57.1% go_terms: 71.4% name: 8
5.7% publications: 71.4% Aggregated Scores by List Path: [].chromosome: 100.0% (7/7) [].descr
iption: 57.1% (4/7) [].go_terms: 71.4% (5/7) [].name: 85.7% (6/7) [].publications: 71.4% (5/7)
Detailed Path Scores:
[0] (Gene): 100.0%
- name: OK
- description: OK
- chromosome:
OK
- go_terms: OK
- publications: OK
[1] (Gene): 100.0%
- name: OK
- description:
OK
- chromosome: OK
- go_terms: OK
- publications: OK
[2] (Gene): 60.0%
- name: OK
- description: MISSING
- chromosome: OK
- go_terms: OK
- publications: MISSING
[3
] (Gene): 80.0%
- name: OK
- description: OK
- chromosome: OK
- go_terms: MISSING
- publications: OK
[4] (Gene): 40.0%
- name: OK
- description: MISSING
- chromosome:
OK
- go_terms: MISSING
- publications: MISSING
[5] (Gene): 60.0%
- name: MISSING
- description: MISSING
- chromosome: OK
- go_terms: OK
- publications: OK
[6] (Gene):
100.0%
- name: OK
- description: OK
- chromosome: OK
- go_terms: OK
- publicatio
ns: OK
Understanding Tabular Paths¶
For CSV/TSV data, each row gets an indexed path:
[0]- First row (BRCA2)[1]- Second row (TP53)- etc.
The aggregated path [] shows compliance across all rows.
Let's see the JSON output for details:
In [4]:
Copied!
%%bash
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-f json | python -m json.tool | head -60
%%bash
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-f json | python -m json.tool | head -60
{
"file_path": "/tmp/genes.tsv",
"target_class": "Gene",
"schema_path": "/tmp/gene_schem
a.yaml",
"global_compliance": 77.14285714285715,
"weighted_compliance": 77.14285714285715,
"total_checks": 35,
"total_populated": 27,
"path_scores": [
{
"path":
"[0]",
"parent_class": "Gene",
"item_count": 1,
"slot_scores":
[
{
"path": "[0]",
"slot_name": "name",
"populated": 1,
"total": 1,
"percentage": 10
0.0
},
{
"path": "[0]",
"slo
t_name": "description",
"populated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "[0
]",
"slot_name": "chromosome",
"populated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "[0]",
"slot_name": "go_terms",
"pop
ulated": 1,
"total": 1,
"percentage": 100.0
},
{
"path": "[0]",
"slot_name": "publicatio
ns",
"populated": 1,
"total": 1,
"percen
tage": 100.0
}
],
"overall_percentage": 100.0
},
{
"path": "[1]",
"parent_class": "Gene",
"item_count": 1,
"slot_scores": [
{
"path": "[1]",
"slot_name": "name",
Example: CSV Format¶
CSV works the same way:
In [5]:
Copied!
%%bash
cat > /tmp/genes.csv << 'EOF'
gene_id,symbol,name,description,chromosome,go_terms,publications
ENSG00000139618,BRCA2,BRCA2 DNA repair associated,Involved in DNA repair,13,"GO:0006281,GO:0006310",PMID:8589730
ENSG00000141510,TP53,Tumor protein p53,Tumor suppressor gene,17,"GO:0006915,GO:0008283",PMID:2052564
ENSG00000157764,BRAF,B-Raf proto-oncogene,,7,GO:0006468,
ENSG00000133703,KRAS,KRAS proto-oncogene,GTPase signaling,12,,PMID:2278970
EOF
linkml-data-qc /tmp/genes.csv \
-s /tmp/gene_schema.yaml \
-t Gene
%%bash
cat > /tmp/genes.csv << 'EOF'
gene_id,symbol,name,description,chromosome,go_terms,publications
ENSG00000139618,BRCA2,BRCA2 DNA repair associated,Involved in DNA repair,13,"GO:0006281,GO:0006310",PMID:8589730
ENSG00000141510,TP53,Tumor protein p53,Tumor suppressor gene,17,"GO:0006915,GO:0008283",PMID:2052564
ENSG00000157764,BRAF,B-Raf proto-oncogene,,7,GO:0006468,
ENSG00000133703,KRAS,KRAS proto-oncogene,GTPase signaling,12,,PMID:2278970
EOF
linkml-data-qc /tmp/genes.csv \
-s /tmp/gene_schema.yaml \
-t Gene
Compliance Report: /tmp/genes.csv Target Class: Gene Global Compliance: 85.0% (17/20) Weighted Compl
iance: 85.0% Summary by Slot: chromosome: 100.0% description: 75.0% go_terms: 75.0% name: 1
00.0% publications: 75.0% Aggregated Scores by List Path: [].chromosome: 100.0% (4/4) [].desc
ription: 75.0% (3/4) [].go_terms: 75.0% (3/4) [].name: 100.0% (4/4) [].publications: 75.0% (3/
4)
Detailed Path Scores:
[0] (Gene): 100.0%
- name: OK
- description: OK
- chromosome
: OK
- go_terms: OK
- publications: OK
[1] (Gene): 100.0%
- name: OK
- description
: OK
- chromosome: OK
- go_terms: OK
- publications: OK
[2] (Gene): 60.0%
- name:
OK
- description: MISSING
- chromosome: OK
- go_terms: OK
- publications: MISSING
[3] (Gene): 80.0%
- name: OK
- description: OK
- chromosome: OK
- go_terms: MISSING
- publications: OK
Setting Thresholds for Tabular Data¶
For large datasets, you might want to ensure a minimum percentage of records have key fields:
In [6]:
Copied!
%%bash
cat > /tmp/gene_qc_config.yaml << 'EOF'
default_weight: 1.0
slots:
name:
weight: 2.0
min_compliance: 90.0 # At least 90% of genes should have names
go_terms:
weight: 1.5
min_compliance: 70.0 # At least 70% should have GO annotations
publications:
weight: 1.0
min_compliance: 60.0 # At least 60% should have publications
EOF
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-c /tmp/gene_qc_config.yaml
%%bash
cat > /tmp/gene_qc_config.yaml << 'EOF'
default_weight: 1.0
slots:
name:
weight: 2.0
min_compliance: 90.0 # At least 90% of genes should have names
go_terms:
weight: 1.5
min_compliance: 70.0 # At least 70% should have GO annotations
publications:
weight: 1.0
min_compliance: 60.0 # At least 60% should have publications
EOF
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-c /tmp/gene_qc_config.yaml
Compliance Report: /tmp/genes.tsv Target Class: Gene Global Compliance: 77.1% (27/35) Weighted Compl
iance: 78.0% Config: /tmp/gene_qc_config.yaml Summary by Slot: chromosome: 100.0% description:
57.1% go_terms: 71.4% name: 85.7% publications: 71.4% Threshold Violations (1): [].name: 85
.7% < 90.0% (shortfall: 4.3%) Aggregated Scores by List Path: [].chromosome: 100.0% (7/7) [].de
scription: 57.1% (4/7) [].go_terms: 71.4% (5/7) [].name: 85.7% (6/7) [].publications: 71.4% (5
/7)
Detailed Path Scores:
[0] (Gene): 100.0%
- name: OK
- description: OK
- chromosom
e: OK
- go_terms: OK
- publications: OK
[1] (Gene): 100.0%
- name: OK
- descriptio
n: OK
- chromosome: OK
- go_terms: OK
- publications: OK
[2] (Gene): 60.0%
- name:
OK
- description: MISSING
- chromosome: OK
- go_terms: OK
- publications: MISSING
[3] (Gene): 80.0%
- name: OK
- description: OK
- chromosome: OK
- go_terms: MISSING
- publications: OK
[4] (Gene): 40.0%
- name: OK
- description: MISSING
- chromoso
me: OK
- go_terms: MISSING
- publications: MISSING
[5] (Gene): 60.0%
- name: MISSING
- description: MISSING
- chromosome: OK
- go_terms: OK
- publications: OK
[6] (Gene
): 100.0%
- name: OK
- description: OK
- chromosome: OK
- go_terms: OK
- publica
tions: OK
Analyzing a Directory of TSV Files¶
If you have multiple TSV files, analyze them together:
In [7]:
Copied!
%%bash
# Create a second TSV file
cat > /tmp/genes2.tsv << 'EOF'
gene_id symbol name description chromosome go_terms publications
ENSG00000105329 TGFB1 Transforming growth factor beta 1 Cytokine signaling 19 GO:0008083 PMID:3016893
ENSG00000127479 VEGFA Vascular endothelial growth factor A Angiogenesis regulator 6 GO:0001525 PMID:2587564
EOF
linkml-data-qc /tmp \
-s /tmp/gene_schema.yaml \
-t Gene \
--pattern "genes*.tsv"
%%bash
# Create a second TSV file
cat > /tmp/genes2.tsv << 'EOF'
gene_id symbol name description chromosome go_terms publications
ENSG00000105329 TGFB1 Transforming growth factor beta 1 Cytokine signaling 19 GO:0008083 PMID:3016893
ENSG00000127479 VEGFA Vascular endothelial growth factor A Angiogenesis regulator 6 GO:0001525 PMID:2587564
EOF
linkml-data-qc /tmp \
-s /tmp/gene_schema.yaml \
-t Gene \
--pattern "genes*.tsv"
Multi-File Compliance Report Files Analyzed: 2 Global Compliance: 82.2% Weighted Compliance: 82.2%
Summary by Slot (across all files): chromosome: 100.0% description: 66.7% go_terms: 77.8% na
me: 88.9% publications: 77.8% Summary by Path (across all files): [].chromosome: 100.0% [].de
scription: 66.7% [].go_terms: 77.8% [].name: 88.9% [].publications: 77.8% Per-File Compliance
: /tmp/genes.tsv: 77.1% /tmp/genes2.tsv: 100.0%
Finding Incomplete Records¶
Use CSV output to identify which records need attention:
In [8]:
Copied!
%%bash
echo "=== Records with lowest compliance ==="
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-f csv | grep -v "\[\]" | sort -t, -k7 -n | head -10
%%bash
echo "=== Records with lowest compliance ==="
linkml-data-qc /tmp/genes.tsv \
-s /tmp/gene_schema.yaml \
-t Gene \
-f csv | grep -v "\[\]" | sort -t, -k7 -n | head -10
=== Records with lowest compliance ===
/tmp/genes.tsv,[2],Gene,description,0,1,0.0 /tmp/genes.tsv,[2],Gene,publications,0,1,0.0 /tmp/gen
es.tsv,[3],Gene,go_terms,0,1,0.0 /tmp/genes.tsv,[4],Gene,description,0,1,0.0 /tmp/genes.tsv,[4],Ge
ne,go_terms,0,1,0.0 /tmp/genes.tsv,[4],Gene,publications,0,1,0.0 /tmp/genes.tsv,[5],Gene,descripti
on,0,1,0.0 /tmp/genes.tsv,[5],Gene,name,0,1,0.0 file,path,class,slot,populated,total,percentage
CI/CD Integration for Tabular Data¶
Ensure your data meets quality standards before releases:
# Fail if less than 80% of genes have names
linkml-data-qc data/genes.tsv \
-s schema/gene.yaml \
-t Gene \
--min-compliance 80
Key Takeaways¶
- File extension matters - Use
.csv,.tsv, or.tabfor tabular files - Each row = one instance - Compliance is calculated per row, then aggregated
- Empty cells = missing - Empty strings in CSV/TSV are treated as unpopulated
- Use patterns - Analyze multiple files with
--pattern "*.tsv" - Same config works - Weight and threshold configs work identically for tabular data