Generalizers¶

Generalizers take example data and generalizes to a schema

Warning

Generalization is inherently a heuristic process, this should be viewed as a bootstrapping process that semi-automates the creation of a new schema for you.

Generalizing from a single TSV¶

schemauto  generalize-tsv tests/resources/NWT_wildfires_biophysical_2016.tsv -o wildfire.yaml

The schema will have a slot for every column, e,g:

classes:
  Observation:
    slots:
    - site
    - plot
    - plot_size
    - date
    - observer

Ranges will be auto-inferred, e.g.:

slots:
  site:
    examples:
    - value: ZF20-105
    range: string
  plot:
    examples:
    - value: '6'
    range: integer
  plot_size:
    examples:
    - value: 10X10
    range: plot_size_enum
  date:
    examples:
    - value: '2016-07-09'
    range: datetime

Enums will be automatically inferred:

enums:
  plot_size_enum:
    permissible_values:
      10X10:
        description: 10X10
      5x5:
        description: 5x5
      2.5X2.5:
        description: 2.5X2.5
      5X5:
        description: 5X5
      3x3:
        description: 3x3
  ecosystem_enum:
    permissible_values:
      Open Fen:
        description: Open Fen
      Treed Fen:
        description: Treed Fen
      Black Spruce:
        description: Black Spruce
      Poor Fen:
        description: Poor Fen
      Fen:
        description: Fen
      Lowland:
        description: Lowland
      Upland:
        description: Upland
      Bog:
        description: Bog
      Lowland Black Spruce:
        description: Lowland Black Spruce

Generalizing from multiple TSVs¶

You can use the generalize-tsvs command to generalize from multiple TSVs, with foreign key linkages auto-inferred.

For example, given a file envo.tsv:

environments¶
envo term id	envo term label
ENVO_01000752	area of barren land
ENVO_01001570	terrestrial ecoregion
ENVO_01001581	sea surface layer
ENVO_01001582	forest floor

And a file file samples.tsv:

samples¶
BIOSAMPLE_ID	BIOSAMPLE_NAME	ENVO_BIOME_ID	ENVO_FEATURE_ID	ENVO_MATERIAL_ID
156554	Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgControl_Nextera2	ENVO_01000174	ENVO_01000159	ENVO_00002261
156649	Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera5	ENVO_01000174	ENVO_01000159	ENVO_00005781
156728	Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera84	ENVO_01000174	ENVO_01000159	ENVO_00005781
156738	Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWMinControl_Nextera2	ENVO_01000174	ENVO_01001275	ENVO_00002261

We can create a multi-class schema, with foreign keys inferred:

schemauto generalize-tsvs --infer-foreign-keys sample.tsv envo.tsv

This will generate a schema with two classes, where the join between the sample table and the term table is inferred:

classes:

sample:: slots: - BIOSAMPLE_ID - BIOSAMPLE_NAME - ENVO_BIOME_ID - ENVO_FEATURE_ID - ENVO_MATERIAL_ID
envo:: slots: - ENVO_ID - ENVO_LABEL

slots:

BIOSAMPLE_ID:: range: integer
BIOSAMPLE_NAME:: range: string
ENVO_BIOME_ID:: examples: - value: ENVO_01000022 range: envo
ENVO_FEATURE_ID:: range: envo
ENVO_MATERIAL_ID:: range: envo
ENVO_ID:: identifier: true range: string
ENVO_LABEL:: range: string

Generalizing from tables on the web¶

You can use generalize-htmltable

schemauto  generalize-htmltable  https://www.nature.com/articles/s41467-022-31626-4/tables/1

Will generate:

name: example
description: example
id: https://w3id.org/example
imports:
- linkml:types
prefixes:
  linkml: https://w3id.org/linkml/
  example: https://w3id.org/example
default_prefix: example
slots:
  GWAS trait:
    examples:
    - value: "\xC2"
    range: string
  Peak GWAS SNP:
    examples:
    - value: rs2974298
    range: string
  Gene:
    examples:
    - value: SMIM19
    range: string
  NK cell cis eSNP:
    examples:
    - value: rs2974348
    range: string
  TWAS Z score:
    examples:
    - value: '3.809'
    range: string
  TWAS P value:
    examples:
    - value: '0.0001'
    range: string
classes:
  example:
    slots:
    - GWAS trait
    - Peak GWAS SNP
    - Gene
    - NK cell cis eSNP
    - TWAS Z score
    - TWAS P value

Generalizing from JSON¶

tbw

Chaining an annotator¶

If you provide an --annotator option you can auto-annotate enums:

schemauto  generalize-tsv \
    --annotator bioportal:envo \
    tests/resources/NWT_wildfires_biophysical_2016.tsv \
    -o wildfire.yaml

ecosystem_enum:
  from_schema: https://w3id.org/MySchema
  permissible_values:
    Open Fen:
      description: Open Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Treed Fen:
      description: Treed Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Black Spruce:
      description: Black Spruce
    Poor Fen:
      description: Poor Fen
      meaning: ENVO:00000232
      exact_mappings:
      - ENVO:00000232
    Fen:
      description: Fen
      meaning: ENVO:00000232
    Lowland:
      description: Lowland
    Upland:
      description: Upland
      meaning: ENVO:00000182
    Bog:
      description: Bog
      meaning: ENVO:01000534
      exact_mappings:
      - ENVO:01000535
      - ENVO:00000044
      - ENVO:01001209
      - ENVO:01000527
    Lowland Black Spruce:
      description: Lowland Black Spruce

The annotation can also be run as a separate step

See Annotators

Packages for generalizing¶

class schema_automator.generalizers.CsvDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, column_separator: str = '\t', schema_name: str = 'example', robot: bool = False, data_dictionary_row_count: int = 0, enum_columns: ~typing.List[str] = <factory>, enum_mask_columns: ~typing.List[str] = <factory>, enum_threshold: float = 0.1, enum_strlen_threshold: int = 30, max_enum_size: int = 50, downcase_header: bool = False, snakecase_header: bool = False, infer_foreign_keys: bool = False, max_pk_len: int = 60, min_distinct_fk_val: int = 8, source_schema: ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None)[source]¶

A Generalizer that generalizes from example CSV/TSV data

column_separator: str = '\t'¶: character that separates columns in the input file

convert(file: str, **kwargs) → SchemaDefinition[source]¶

Converts a single TSV file to a single-class schema

Parameters:

file
kwargs

Returns:

convert_dicts(rr: List[Dict], schema_name: str = 'example', class_name: str = 'Observation', **kwargs) → SchemaDefinition | None[source]¶

Converts a list of row objects to a schema.

Each row is a data item, presumed to be of the same type, that is generalized.

Parameters:

rr
schema_name
class_name
kwargs

Returns:

convert_from_dataframe(df: DataFrame, **kwargs) → SchemaDefinition[source]¶

Converts a single dataframe to a single-class schema

Parameters:

df
kwargs

Returns:

convert_multiple(files: List[str], **kwargs) → SchemaDefinition[source]¶

Converts multiple TSVs to a schema

Parameters:

files
kwargs

Returns:

convert_to_edge_slots(all_tsv_rows: List, name: str = 'Observation', **kwargs) → Dict | None[source]¶

assume that TSV has 3 relevant columns:

slot name to add
slot definition to add
examples of values for the slot

also assume that these are all edge_properties at the moment. TODO: add parameter to allow edge or node property disambiguation.

data_dictionary_row_count: int = 0¶: number of rows after header containing data dictionary information

downcase_header: bool = False¶: If true, coerce column names to be lower case

enum_columns: List[str]¶: List of columns that are coerced into enums

enum_mask_columns: List[str]¶: List of columns that are excluded from being enums

enum_strlen_threshold: int = 30¶: Maximum length of a string to be considered a permissible enum value

enum_threshold: float = 0.1¶: If number if distinct values divided by total number of values is greater than this, then the column is considered an enum

infer_foreign_keys: bool = False¶: For multi-CVS files, infer linkages between rows

infer_linkages(files: List[str], **kwargs) → List[ForeignKey][source]¶

Heuristic procedure for determining which tables are linked to others via implicit foreign keys

If all values of one column FT.FC are present in column PT.PC, then FT.FC is a potential foreign key and PC is a potential primary key of PT.

This procedure can generate false positives, so additional heuristics are applied. Each potential foreign key relationship gets an ad-hoc score: - links across tables score more highly than within - suffixes such as _id are more likely on PK and FK tables - the foreign key column table is likely to start with the base column name In addition, if there are competing primary keys for a table, the top scoring one is selected

max_enum_size: int = 50¶: Max number of permissible values for a column to be considered an enum

max_pk_len: int = 60¶: Maximum length to be considered for a primary key column. Note: URIs can be long

min_distinct_fk_val: int = 8¶: For inferring foreign keys, there must be a minimum number.

robot: bool = False¶: If true, conforms to robot template format. Data dictionary rows start with ‘>’

schema_name: str = 'example'¶: LinkML schema name (no spaces)

snakecase_header: bool = False¶: If true, coerce column names to be snake case

source_schema: SchemaDefinition | None = None¶: Optional base schema to draw from

class schema_automator.generalizers.JsonDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None, omit_null: bool | None = None, inline_as_dict_slot_keys: ~typing.Mapping[str, str] | None = None)[source]¶

A generalizer that abstract from JSON instance data

convert(input: str | Dict, format: str = 'json', container_class_name='Container', **kwargs) → SchemaDefinition[source]¶

Generalizes from a JSON file

Parameters:

input
format
container_class_name
kwargs

Returns:

inline_as_dict_slot_keys: Mapping[str, str] = None¶: Mapping between the name of a dict-inlined slot and the unique key for that entity

class schema_automator.generalizers.RdfDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None)[source]¶

A generalizer that generalizes from source RDF turtle data

convert(file: str, dir: str, **kwargs) → SchemaDefinition[source]¶

Generalizes from an RDF file

Parameters:

file
dir
kwargs

Returns:

Generalizers¶

Generalizing from a single TSV¶

Generalizing from multiple TSVs¶

Generalizing from tables on the web¶

Generalizing from JSON¶

Chaining an annotator¶

Packages for generalizing¶

Schema Automator

Navigation

Related Topics