Generalizers

Generalizers take example data and generalizes to a schema

Warning

Generalization is inherently a heuristic process, this should be viewed as a bootstrapping process that semi-automates the creation of a new schema for you.

Generalizing from a single TSV

schemauto  generalize-csv tests/resources/NWT_wildfires_biophysical_2016.tsv -o wildfire.yaml

The schema will have a slot for every column, e,g:

classes:
  Observation:
    slots:
    - site
    - plot
    - plot_size
    - date
    - observer

Ranges will be auto-inferred, e.g.:

slots:
  site:
    examples:
    - value: ZF20-105
    range: string
  plot:
    examples:
    - value: '6'
    range: integer
  plot_size:
    examples:
    - value: 10X10
    range: plot_size_enum
  date:
    examples:
    - value: '2016-07-09'
    range: datetime

Enums will be automatically inferred:

enums:
  plot_size_enum:
    permissible_values:
      10X10:
        description: 10X10
      5x5:
        description: 5x5
      2.5X2.5:
        description: 2.5X2.5
      5X5:
        description: 5X5
      3x3:
        description: 3x3
  ecosystem_enum:
    permissible_values:
      Open Fen:
        description: Open Fen
      Treed Fen:
        description: Treed Fen
      Black Spruce:
        description: Black Spruce
      Poor Fen:
        description: Poor Fen
      Fen:
        description: Fen
      Lowland:
        description: Lowland
      Upland:
        description: Upland
      Bog:
        description: Bog
      Lowland Black Spruce:
        description: Lowland Black Spruce

Chaining an annotator

If you provide an --annotator option you can auto-annotate enums:

schemauto  generalize-csv \
    --annotator bioportal:envo \
    tests/resources/NWT_wildfires_biophysical_2016.tsv \
    -o wildfire.yaml
ecosystem_enum:
  from_schema: https://w3id.org/MySchema
  permissible_values:
    Open Fen:
      description: Open Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Treed Fen:
      description: Treed Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Black Spruce:
      description: Black Spruce
    Poor Fen:
      description: Poor Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Fen:
      description: Fen
      meaning: ENVO:00000232
    Lowland:
      description: Lowland
    Upland:
      description: Upland
      meaning: ENVO:00000182
    Bog:
      description: Bog
      meaning: ENVO:01000534
      exact_mappings:
      - ENVO:01000535
      - ENVO:00000044
      - ENVO:01001209
      - ENVO:01000527
    Lowland Black Spruce:
      description: Lowland Black Spruce

The annotation can also be run as a separate step

See Annotators

Generalizing from multiple TSVs

You can use the generalize-tsvs command to generalize from multiple TSVs, with foreign key linkages auto-inferred.

For example, given a file envo.tsv:

environments

envo term id

envo term label

ENVO_01000752

area of barren land

ENVO_01001570

terrestrial ecoregion

ENVO_01001581

sea surface layer

ENVO_01001582

forest floor

And a file file samples.tsv:

samples

BIOSAMPLE_ID

BIOSAMPLE_NAME

ENVO_BIOME_ID

ENVO_FEATURE_ID

ENVO_MATERIAL_ID

156554

Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgControl_Nextera2

ENVO_01000174

ENVO_01000159

ENVO_00002261

156649

Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera5

ENVO_01000174

ENVO_01000159

ENVO_00005781

156728

Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera84

ENVO_01000174

ENVO_01000159

ENVO_00005781

156738

Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWMinControl_Nextera2

ENVO_01000174

ENVO_01001275

ENVO_00002261

We can create a multi-class schema, with foreign keys inferred:

schemauto generalize-tsvs --infer-foreign-keys sample.tsv envo.tsv

This will generate a schema with two classes, where the join between the sample table and the term table is inferred:


classes:
sample:

slots: - BIOSAMPLE_ID - BIOSAMPLE_NAME - ENVO_BIOME_ID - ENVO_FEATURE_ID - ENVO_MATERIAL_ID

envo:

slots: - ENVO_ID - ENVO_LABEL

slots:
BIOSAMPLE_ID:

range: integer

BIOSAMPLE_NAME:

range: string

ENVO_BIOME_ID:

examples: - value: ENVO_01000022 range: envo

ENVO_FEATURE_ID:

range: envo

ENVO_MATERIAL_ID:

range: envo

ENVO_ID:

identifier: true range: string

ENVO_LABEL:

range: string

Generalizing from tables on the web

You can use generalize-htmltable

schemauto  generalize-htmltable  https://www.nature.com/articles/s41467-022-31626-4/tables/1

Will generate:

name: example
description: example
id: https://w3id.org/example
imports:
- linkml:types
prefixes:
  linkml: https://w3id.org/linkml/
  example: https://w3id.org/example
default_prefix: example
slots:
  GWAS trait:
    examples:
    - value: "\xC2"
    range: string
  Peak GWAS SNP:
    examples:
    - value: rs2974298
    range: string
  Gene:
    examples:
    - value: SMIM19
    range: string
  NK cell cis eSNP:
    examples:
    - value: rs2974348
    range: string
  TWAS Z score:
    examples:
    - value: '3.809'
    range: string
  TWAS P value:
    examples:
    - value: '0.0001'
    range: string
classes:
  example:
    slots:
    - GWAS trait
    - Peak GWAS SNP
    - Gene
    - NK cell cis eSNP
    - TWAS Z score
    - TWAS P value

Generalizing from JSON

Packages

class schema_automator.generalizers.CsvDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, column_separator: str = '\t', schema_name: str = 'example', robot: bool = False, data_dictionary_row_count: int = 0, enum_columns: ~typing.List[str] = <factory>, enum_mask_columns: ~typing.List[str] = <factory>, enum_threshold: float = 0.1, enum_strlen_threshold: int = 30, max_enum_size: int = 50, downcase_header: bool = False, snakecase_header: bool = False, infer_foreign_keys: bool = False, max_pk_len: int = 60, min_distinct_fk_val: int = 8, source_schema: ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None)[source]

A Generalizer that generalizes from example CSV/TSV data

column_separator: str = '\t'

character that separates columns in the input file

convert(file: str, **kwargs) SchemaDefinition[source]

Converts a single TSV file to a single-class schema

Parameters:
  • file

  • kwargs

Returns:

convert_dicts(rr: List[Dict], schema_name: str = 'example', class_name: str = 'Observation', **kwargs) SchemaDefinition | None[source]

Converts a list of row objects to a schema.

Each row is a data item, presumed to be of the same type, that is generalized.

Parameters:
  • rr

  • schema_name

  • class_name

  • kwargs

Returns:

convert_from_dataframe(df: DataFrame, **kwargs) SchemaDefinition[source]

Converts a single dataframe to a single-class schema

Parameters:
  • df

  • kwargs

Returns:

convert_multiple(files: List[str], **kwargs) SchemaDefinition[source]

Converts multiple TSVs to a schema

Parameters:
  • files

  • kwargs

Returns:

convert_to_edge_slots(all_tsv_rows: List, name: str = 'Observation', **kwargs) Dict | None[source]
assume that TSV has 3 relevant columns:
  1. slot name to add

  2. slot definition to add

  3. examples of values for the slot

also assume that these are all edge_properties at the moment. TODO: add parameter to allow edge or node property disambiguation.

data_dictionary_row_count: int = 0

number of rows after header containing data dictionary information

downcase_header: bool = False

If true, coerce column names to be lower case

enum_columns: List[str]

List of columns that are coerced into enums

enum_mask_columns: List[str]

List of columns that are excluded from being enums

enum_strlen_threshold: int = 30

Maximum length of a string to be considered a permissible enum value

enum_threshold: float = 0.1

If number if distinct values divided by total number of values is greater than this, then the column is considered an enum

infer_foreign_keys: bool = False

For multi-CVS files, infer linkages between rows

infer_linkages(files: List[str], **kwargs) List[ForeignKey][source]

Heuristic procedure for determining which tables are linked to others via implicit foreign keys

If all values of one column FT.FC are present in column PT.PC, then FT.FC is a potential foreign key and PC is a potential primary key of PT.

This procedure can generate false positives, so additional heuristics are applied. Each potential foreign key relationship gets an ad-hoc score:

  • links across tables score more highly than within

  • suffixes such as _id are more likely on PK and FK tables

  • the foreign key column table is likely to start with the base column name

In addition, if there are competing primary keys for a table, the top scoring one is selected

max_enum_size: int = 50

Max number of permissible values for a column to be considered an enum

max_pk_len: int = 60

Maximum length to be considered for a primary key column. Note: URIs can be long

min_distinct_fk_val: int = 8

For inferring foreign keys, there must be a minimum number.

robot: bool = False

If true, conforms to robot template format. Data dictionary rows start with ‘>’

schema_name: str = 'example'

LinkML schema name (no spaces)

snakecase_header: bool = False

If true, coerce column names to be snake case

source_schema: SchemaDefinition | None = None

Optional base schema to draw from

class schema_automator.generalizers.JsonDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None, omit_null: bool | None = None, inline_as_dict_slot_keys: ~typing.Mapping[str, str] | None = None)[source]

A generalizer that abstract from JSON instance data

convert(input: str | Dict, format: str = 'json', container_class_name='Container', **kwargs) SchemaDefinition[source]

Generalizes from a JSON file

Parameters:
  • input

  • format

  • container_class_name

  • kwargs

Returns:

inline_as_dict_slot_keys: Mapping[str, str] = None

Mapping between the name of a dict-inlined slot and the unique key for that entity

class schema_automator.generalizers.RdfDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None)[source]

A generalizer that generalizes from source RDF turtle data

convert(file: str, dir: str, **kwargs) SchemaDefinition[source]

Generalizes from an RDF file

Parameters:
  • file

  • dir

  • kwargs

Returns: