QC for Tabular Data (CSV/TSV)¶

This tutorial shows how to use linkml-data-qc with CSV and TSV files - common formats for tabular datasets.

When to Use Tabular Formats¶

CSV/TSV is great for:

Flat datasets where each row is an instance
Data exported from spreadsheets or databases
Large datasets with many similar records

How It Works¶

Each row in the CSV/TSV becomes an instance of your target class
Column headers map to slot names
Empty cells are treated as missing (not populated)

Example: Gene Annotation Dataset¶

Let's create a schema for gene annotations and a TSV dataset with varying completeness.

In [1]:

Copied!





%%bash
cat > /tmp/gene_schema.yaml << 'EOF'
id: https://example.org/genes
name: gene_schema
prefixes:
  linkml: https://w3id.org/linkml/
imports:
  - linkml:types
default_range: string

classes:
  Gene:
    attributes:
      gene_id:
        identifier: true
        required: true
      symbol:
        required: true
      name:
        recommended: true
      description:
        recommended: true
      chromosome:
        recommended: true
      go_terms:
        recommended: true
      publications:
        recommended: true
EOF
echo "Gene schema created!"
%%bash
cat > /tmp/gene_schema.yaml << 'EOF'
id: https://example.org/genes
name: gene_schema
prefixes:
  linkml: https://w3id.org/linkml/
imports:
  - linkml:types
default_range: string

classes:
  Gene:
    attributes:
      gene_id:
        identifier: true
        required: true
      symbol:
        required: true
      name:
        recommended: true
      description:
        recommended: true
      chromosome:
        recommended: true
      go_terms:
        recommended: true
      publications:
        recommended: true
EOF
echo "Gene schema created!"

Gene schema created!

Create a TSV Dataset¶

Here's a gene annotation file with some missing recommended fields:

In [2]:

Copied!





%%bash
cat > /tmp/genes.tsv << 'EOF'
gene_id	symbol	name	description	chromosome	go_terms	publications
ENSG00000139618	BRCA2	BRCA2 DNA repair associated	Involved in DNA repair	13	GO:0006281,GO:0006310	PMID:8589730
ENSG00000141510	TP53	Tumor protein p53	Tumor suppressor gene	17	GO:0006915,GO:0008283	PMID:2052564
ENSG00000157764	BRAF	B-Raf proto-oncogene		7	GO:0006468	
ENSG00000133703	KRAS	KRAS proto-oncogene	GTPase signaling	12		PMID:2278970
ENSG00000171862	PTEN	Phosphatase and tensin homolog		10		
ENSG00000146648	EGFR			7	GO:0007169	PMID:3490687
ENSG00000181143	MYC	MYC proto-oncogene	Transcription factor	8	GO:0006355	PMID:7479893
EOF
echo "Gene TSV created! Contents:"
cat /tmp/genes.tsv
%%bash
cat > /tmp/genes.tsv << 'EOF'
gene_id	symbol	name	description	chromosome	go_terms	publications
ENSG00000139618	BRCA2	BRCA2 DNA repair associated	Involved in DNA repair	13	GO:0006281,GO:0006310	PMID:8589730
ENSG00000141510	TP53	Tumor protein p53	Tumor suppressor gene	17	GO:0006915,GO:0008283	PMID:2052564
ENSG00000157764	BRAF	B-Raf proto-oncogene		7	GO:0006468	
ENSG00000133703	KRAS	KRAS proto-oncogene	GTPase signaling	12		PMID:2278970
ENSG00000171862	PTEN	Phosphatase and tensin homolog		10		
ENSG00000146648	EGFR			7	GO:0007169	PMID:3490687
ENSG00000181143	MYC	MYC proto-oncogene	Transcription factor	8	GO:0006355	PMID:7479893
EOF
echo "Gene TSV created! Contents:"
cat /tmp/genes.tsv

Gene TSV created! Contents:

gene_id	symbol	name	description	chromosome	go_terms	publications
ENSG00000139618	BRCA2	BRCA2 DNA rep

air associated	Involved in DNA repair	13	GO:0006281,GO:0006310	PMID:8589730
ENSG00000141510	TP53	Tum

or protein p53	Tumor suppressor gene	17	GO:0006915,GO:0008283	PMID:2052564
ENSG00000157764	BRAF	B-Ra

f proto-oncogene		7	GO:0006468	
ENSG00000133703	KRAS	KRAS proto-oncogene	GTPase signaling	12		PMID:2

278970
ENSG00000171862	PTEN	Phosphatase and tensin homolog		10		
ENSG00000146648	EGFR			7	GO:0007169

	PMID:3490687
ENSG00000181143	MYC	MYC proto-oncogene	Transcription factor	8	GO:0006355	PMID:7479893

Analyze the TSV¶

In [3]:

Copied!





%%bash
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene
%%bash
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene

Compliance Report: /tmp/genes.tsv
Target Class: Gene
Global Compliance: 77.1% (27/35)
Weighted Compl

iance: 77.1%

Summary by Slot:
  chromosome: 100.0%
  description: 57.1%
  go_terms: 71.4%
  name: 8

5.7%
  publications: 71.4%

Aggregated Scores by List Path:
  [].chromosome: 100.0% (7/7)
  [].descr

iption: 57.1% (4/7)
  [].go_terms: 71.4% (5/7)
  [].name: 85.7% (6/7)
  [].publications: 71.4% (5/7)


Detailed Path Scores:
  [0] (Gene): 100.0%
    - name: OK
    - description: OK
    - chromosome:

OK
    - go_terms: OK
    - publications: OK
  [1] (Gene): 100.0%
    - name: OK
    - description:

OK
    - chromosome: OK
    - go_terms: OK
    - publications: OK
  [2] (Gene): 60.0%
    - name: OK

    - description: MISSING
    - chromosome: OK
    - go_terms: OK
    - publications: MISSING
  [3

] (Gene): 80.0%
    - name: OK
    - description: OK
    - chromosome: OK
    - go_terms: MISSING

  - publications: OK
  [4] (Gene): 40.0%
    - name: OK
    - description: MISSING
    - chromosome:

 OK
    - go_terms: MISSING
    - publications: MISSING
  [5] (Gene): 60.0%
    - name: MISSING

- description: MISSING
    - chromosome: OK
    - go_terms: OK
    - publications: OK
  [6] (Gene):

100.0%
    - name: OK
    - description: OK
    - chromosome: OK
    - go_terms: OK
    - publicatio

ns: OK

Understanding Tabular Paths¶

For CSV/TSV data, each row gets an indexed path:

[0] - First row (BRCA2)
[1] - Second row (TP53)
etc.

The aggregated path [] shows compliance across all rows.

Let's see the JSON output for details:

In [4]:

Copied!





%%bash
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -f json | python -m json.tool | head -60
%%bash
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -f json | python -m json.tool | head -60

{
    "file_path": "/tmp/genes.tsv",
    "target_class": "Gene",
    "schema_path": "/tmp/gene_schem

a.yaml",
    "global_compliance": 77.14285714285715,
    "weighted_compliance": 77.14285714285715,

   "total_checks": 35,
    "total_populated": 27,
    "path_scores": [
        {
            "path":

 "[0]",
            "parent_class": "Gene",
            "item_count": 1,
            "slot_scores":

[
                {
                    "path": "[0]",
                    "slot_name": "name",

                "populated": 1,
                    "total": 1,
                    "percentage": 10

0.0
                },
                {
                    "path": "[0]",
                    "slo

t_name": "description",
                    "populated": 1,
                    "total": 1,

            "percentage": 100.0
                },
                {
                    "path": "[0

]",
                    "slot_name": "chromosome",
                    "populated": 1,

       "total": 1,
                    "percentage": 100.0
                },
                {

                "path": "[0]",
                    "slot_name": "go_terms",
                    "pop

ulated": 1,
                    "total": 1,
                    "percentage": 100.0

},
                {
                    "path": "[0]",
                    "slot_name": "publicatio

ns",
                    "populated": 1,
                    "total": 1,
                    "percen

tage": 100.0
                }
            ],
            "overall_percentage": 100.0
        },

     {
            "path": "[1]",
            "parent_class": "Gene",
            "item_count": 1,

           "slot_scores": [
                {
                    "path": "[1]",

 "slot_name": "name",

Example: CSV Format¶

CSV works the same way:

In [5]:

Copied!





%%bash
cat > /tmp/genes.csv << 'EOF'
gene_id,symbol,name,description,chromosome,go_terms,publications
ENSG00000139618,BRCA2,BRCA2 DNA repair associated,Involved in DNA repair,13,"GO:0006281,GO:0006310",PMID:8589730
ENSG00000141510,TP53,Tumor protein p53,Tumor suppressor gene,17,"GO:0006915,GO:0008283",PMID:2052564
ENSG00000157764,BRAF,B-Raf proto-oncogene,,7,GO:0006468,
ENSG00000133703,KRAS,KRAS proto-oncogene,GTPase signaling,12,,PMID:2278970
EOF

linkml-data-qc /tmp/genes.csv \
    -s /tmp/gene_schema.yaml \
    -t Gene
%%bash
cat > /tmp/genes.csv << 'EOF'
gene_id,symbol,name,description,chromosome,go_terms,publications
ENSG00000139618,BRCA2,BRCA2 DNA repair associated,Involved in DNA repair,13,"GO:0006281,GO:0006310",PMID:8589730
ENSG00000141510,TP53,Tumor protein p53,Tumor suppressor gene,17,"GO:0006915,GO:0008283",PMID:2052564
ENSG00000157764,BRAF,B-Raf proto-oncogene,,7,GO:0006468,
ENSG00000133703,KRAS,KRAS proto-oncogene,GTPase signaling,12,,PMID:2278970
EOF

linkml-data-qc /tmp/genes.csv \
    -s /tmp/gene_schema.yaml \
    -t Gene

Compliance Report: /tmp/genes.csv
Target Class: Gene
Global Compliance: 85.0% (17/20)
Weighted Compl

iance: 85.0%

Summary by Slot:
  chromosome: 100.0%
  description: 75.0%
  go_terms: 75.0%
  name: 1

00.0%
  publications: 75.0%

Aggregated Scores by List Path:
  [].chromosome: 100.0% (4/4)
  [].desc

ription: 75.0% (3/4)
  [].go_terms: 75.0% (3/4)
  [].name: 100.0% (4/4)
  [].publications: 75.0% (3/

4)

Detailed Path Scores:
  [0] (Gene): 100.0%
    - name: OK
    - description: OK
    - chromosome

: OK
    - go_terms: OK
    - publications: OK
  [1] (Gene): 100.0%
    - name: OK
    - description

: OK
    - chromosome: OK
    - go_terms: OK
    - publications: OK
  [2] (Gene): 60.0%
    - name:

OK
    - description: MISSING
    - chromosome: OK
    - go_terms: OK
    - publications: MISSING

[3] (Gene): 80.0%
    - name: OK
    - description: OK
    - chromosome: OK
    - go_terms: MISSING

    - publications: OK

Setting Thresholds for Tabular Data¶

For large datasets, you might want to ensure a minimum percentage of records have key fields:

In [6]:

Copied!





%%bash
cat > /tmp/gene_qc_config.yaml << 'EOF'
default_weight: 1.0

slots:
  name:
    weight: 2.0
    min_compliance: 90.0  # At least 90% of genes should have names
  go_terms:
    weight: 1.5
    min_compliance: 70.0  # At least 70% should have GO annotations
  publications:
    weight: 1.0
    min_compliance: 60.0  # At least 60% should have publications
EOF

linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -c /tmp/gene_qc_config.yaml
%%bash
cat > /tmp/gene_qc_config.yaml << 'EOF'
default_weight: 1.0

slots:
  name:
    weight: 2.0
    min_compliance: 90.0  # At least 90% of genes should have names
  go_terms:
    weight: 1.5
    min_compliance: 70.0  # At least 70% should have GO annotations
  publications:
    weight: 1.0
    min_compliance: 60.0  # At least 60% should have publications
EOF

linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -c /tmp/gene_qc_config.yaml

Compliance Report: /tmp/genes.tsv
Target Class: Gene
Global Compliance: 77.1% (27/35)
Weighted Compl

iance: 78.0%
Config: /tmp/gene_qc_config.yaml

Summary by Slot:
  chromosome: 100.0%
  description:

57.1%
  go_terms: 71.4%
  name: 85.7%
  publications: 71.4%

Threshold Violations (1):
  [].name: 85

.7% < 90.0% (shortfall: 4.3%)

Aggregated Scores by List Path:
  [].chromosome: 100.0% (7/7)
  [].de

scription: 57.1% (4/7)
  [].go_terms: 71.4% (5/7)
  [].name: 85.7% (6/7)
  [].publications: 71.4% (5

/7)

Detailed Path Scores:
  [0] (Gene): 100.0%
    - name: OK
    - description: OK
    - chromosom

e: OK
    - go_terms: OK
    - publications: OK
  [1] (Gene): 100.0%
    - name: OK
    - descriptio

n: OK
    - chromosome: OK
    - go_terms: OK
    - publications: OK
  [2] (Gene): 60.0%
    - name:

 OK
    - description: MISSING
    - chromosome: OK
    - go_terms: OK
    - publications: MISSING

 [3] (Gene): 80.0%
    - name: OK
    - description: OK
    - chromosome: OK
    - go_terms: MISSING

    - publications: OK
  [4] (Gene): 40.0%
    - name: OK
    - description: MISSING
    - chromoso

me: OK
    - go_terms: MISSING
    - publications: MISSING
  [5] (Gene): 60.0%
    - name: MISSING

   - description: MISSING
    - chromosome: OK
    - go_terms: OK
    - publications: OK
  [6] (Gene

): 100.0%
    - name: OK
    - description: OK
    - chromosome: OK
    - go_terms: OK
    - publica

tions: OK

Analyzing a Directory of TSV Files¶

If you have multiple TSV files, analyze them together:

In [7]:

Copied!





%%bash
# Create a second TSV file
cat > /tmp/genes2.tsv << 'EOF'
gene_id	symbol	name	description	chromosome	go_terms	publications
ENSG00000105329	TGFB1	Transforming growth factor beta 1	Cytokine signaling	19	GO:0008083	PMID:3016893
ENSG00000127479	VEGFA	Vascular endothelial growth factor A	Angiogenesis regulator	6	GO:0001525	PMID:2587564
EOF

linkml-data-qc /tmp \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    --pattern "genes*.tsv"
%%bash
# Create a second TSV file
cat > /tmp/genes2.tsv << 'EOF'
gene_id	symbol	name	description	chromosome	go_terms	publications
ENSG00000105329	TGFB1	Transforming growth factor beta 1	Cytokine signaling	19	GO:0008083	PMID:3016893
ENSG00000127479	VEGFA	Vascular endothelial growth factor A	Angiogenesis regulator	6	GO:0001525	PMID:2587564
EOF

linkml-data-qc /tmp \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    --pattern "genes*.tsv"

Multi-File Compliance Report
Files Analyzed: 2
Global Compliance: 82.2%
Weighted Compliance: 82.2%

Summary by Slot (across all files):
  chromosome: 100.0%
  description: 66.7%
  go_terms: 77.8%
  na

me: 88.9%
  publications: 77.8%

Summary by Path (across all files):
  [].chromosome: 100.0%
  [].de

scription: 66.7%
  [].go_terms: 77.8%
  [].name: 88.9%
  [].publications: 77.8%

Per-File Compliance

:
  /tmp/genes.tsv: 77.1%
  /tmp/genes2.tsv: 100.0%

Finding Incomplete Records¶

Use CSV output to identify which records need attention:

In [8]:

Copied!





%%bash
echo "=== Records with lowest compliance ==="
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -f csv | grep -v "\[\]" | sort -t, -k7 -n | head -10
%%bash
echo "=== Records with lowest compliance ==="
linkml-data-qc /tmp/genes.tsv \
    -s /tmp/gene_schema.yaml \
    -t Gene \
    -f csv | grep -v "\[\]" | sort -t, -k7 -n | head -10

=== Records with lowest compliance ===

/tmp/genes.tsv,[2],Gene,description,0,1,0.0
/tmp/genes.tsv,[2],Gene,publications,0,1,0.0
/tmp/gen

es.tsv,[3],Gene,go_terms,0,1,0.0
/tmp/genes.tsv,[4],Gene,description,0,1,0.0
/tmp/genes.tsv,[4],Ge

ne,go_terms,0,1,0.0
/tmp/genes.tsv,[4],Gene,publications,0,1,0.0
/tmp/genes.tsv,[5],Gene,descripti

on,0,1,0.0
/tmp/genes.tsv,[5],Gene,name,0,1,0.0
file,path,class,slot,populated,total,percentage

CI/CD Integration for Tabular Data¶

Ensure your data meets quality standards before releases:

# Fail if less than 80% of genes have names
linkml-data-qc data/genes.tsv \
    -s schema/gene.yaml \
    -t Gene \
    --min-compliance 80

Key Takeaways¶

File extension matters - Use .csv, .tsv, or .tab for tabular files
Each row = one instance - Compliance is calculated per row, then aggregated
Empty cells = missing - Empty strings in CSV/TSV are treated as unpopulated
Use patterns - Analyze multiple files with --pattern "*.tsv"
Same config works - Weight and threshold configs work identically for tabular data