Generalizers¶
Generalizers take example data and generalizes to a schema
Warning
Generalization is inherently a heuristic process, this should be viewed as a bootstrapping process that semi-automates the creation of a new schema for you.
Generalizing from a single TSV¶
schemauto generalize-tsv tests/resources/NWT_wildfires_biophysical_2016.tsv -o wildfire.yaml
The schema will have a slot for every column, e,g:
classes:
Observation:
slots:
- site
- plot
- plot_size
- date
- observer
Ranges will be auto-inferred, e.g.:
slots:
site:
examples:
- value: ZF20-105
range: string
plot:
examples:
- value: '6'
range: integer
plot_size:
examples:
- value: 10X10
range: plot_size_enum
date:
examples:
- value: '2016-07-09'
range: datetime
Enums will be automatically inferred:
enums:
plot_size_enum:
permissible_values:
10X10:
description: 10X10
5x5:
description: 5x5
2.5X2.5:
description: 2.5X2.5
5X5:
description: 5X5
3x3:
description: 3x3
ecosystem_enum:
permissible_values:
Open Fen:
description: Open Fen
Treed Fen:
description: Treed Fen
Black Spruce:
description: Black Spruce
Poor Fen:
description: Poor Fen
Fen:
description: Fen
Lowland:
description: Lowland
Upland:
description: Upland
Bog:
description: Bog
Lowland Black Spruce:
description: Lowland Black Spruce
Generalizing from multiple TSVs¶
You can use the generalize-tsvs
command to generalize from multiple TSVs, with
foreign key linkages auto-inferred.
For example, given a file envo.tsv
:
envo term id |
envo term label |
---|---|
ENVO_01000752 |
area of barren land |
ENVO_01001570 |
terrestrial ecoregion |
ENVO_01001581 |
sea surface layer |
ENVO_01001582 |
forest floor |
And a file file samples.tsv
:
BIOSAMPLE_ID |
BIOSAMPLE_NAME |
ENVO_BIOME_ID |
ENVO_FEATURE_ID |
ENVO_MATERIAL_ID |
---|---|---|---|---|
156554 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgControl_Nextera2 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00002261 |
156649 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera5 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00005781 |
156728 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera84 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00005781 |
156738 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWMinControl_Nextera2 |
ENVO_01000174 |
ENVO_01001275 |
ENVO_00002261 |
We can create a multi-class schema, with foreign keys inferred:
schemauto generalize-tsvs --infer-foreign-keys sample.tsv envo.tsv
This will generate a schema with two classes, where the join between the sample table and the term table is inferred:
- classes:
- sample:
slots: - BIOSAMPLE_ID - BIOSAMPLE_NAME - ENVO_BIOME_ID - ENVO_FEATURE_ID - ENVO_MATERIAL_ID
- envo:
slots: - ENVO_ID - ENVO_LABEL
- slots:
- BIOSAMPLE_ID:
range: integer
- BIOSAMPLE_NAME:
range: string
- ENVO_BIOME_ID:
examples: - value: ENVO_01000022 range: envo
- ENVO_FEATURE_ID:
range: envo
- ENVO_MATERIAL_ID:
range: envo
- ENVO_ID:
identifier: true range: string
- ENVO_LABEL:
range: string
Generalizing from tables on the web¶
You can use generalize-htmltable
schemauto generalize-htmltable https://www.nature.com/articles/s41467-022-31626-4/tables/1
Will generate:
name: example
description: example
id: https://w3id.org/example
imports:
- linkml:types
prefixes:
linkml: https://w3id.org/linkml/
example: https://w3id.org/example
default_prefix: example
slots:
GWAS trait:
examples:
- value: "\xC2"
range: string
Peak GWAS SNP:
examples:
- value: rs2974298
range: string
Gene:
examples:
- value: SMIM19
range: string
NK cell cis eSNP:
examples:
- value: rs2974348
range: string
TWAS Z score:
examples:
- value: '3.809'
range: string
TWAS P value:
examples:
- value: '0.0001'
range: string
classes:
example:
slots:
- GWAS trait
- Peak GWAS SNP
- Gene
- NK cell cis eSNP
- TWAS Z score
- TWAS P value
Generalizing from JSON¶
tbw
Chaining an annotator¶
If you provide an --annotator
option you can auto-annotate enums:
schemauto generalize-tsv \
--annotator bioportal:envo \
tests/resources/NWT_wildfires_biophysical_2016.tsv \
-o wildfire.yaml
ecosystem_enum:
from_schema: https://w3id.org/MySchema
permissible_values:
Open Fen:
description: Open Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Treed Fen:
description: Treed Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Black Spruce:
description: Black Spruce
Poor Fen:
description: Poor Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Fen:
description: Fen
meaning: ENVO:00000232
Lowland:
description: Lowland
Upland:
description: Upland
meaning: ENVO:00000182
Bog:
description: Bog
meaning: ENVO:01000534
exact_mappings:
- ENVO:01000535
- ENVO:00000044
- ENVO:01001209
- ENVO:01000527
Lowland Black Spruce:
description: Lowland Black Spruce
The annotation can also be run as a separate step
See Annotators
Packages for generalizing¶
- class schema_automator.generalizers.CsvDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, column_separator: str = '\t', schema_name: str = 'example', robot: bool = False, data_dictionary_row_count: int = 0, enum_columns: ~typing.List[str] = <factory>, enum_mask_columns: ~typing.List[str] = <factory>, enum_threshold: float = 0.1, enum_strlen_threshold: int = 30, max_enum_size: int = 50, downcase_header: bool = False, snakecase_header: bool = False, infer_foreign_keys: bool = False, max_pk_len: int = 60, min_distinct_fk_val: int = 8, source_schema: ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None)[source]¶
A Generalizer that generalizes from example CSV/TSV data
- column_separator: str = '\t'¶
character that separates columns in the input file
- convert(file: str, **kwargs) SchemaDefinition [source]¶
Converts a single TSV file to a single-class schema
- Parameters:
file
kwargs
- Returns:
- convert_dicts(rr: List[Dict], schema_name: str = 'example', class_name: str = 'Observation', **kwargs) SchemaDefinition | None [source]¶
Converts a list of row objects to a schema.
Each row is a data item, presumed to be of the same type, that is generalized.
- Parameters:
rr
schema_name
class_name
kwargs
- Returns:
- convert_from_dataframe(df: DataFrame, **kwargs) SchemaDefinition [source]¶
Converts a single dataframe to a single-class schema
- Parameters:
df
kwargs
- Returns:
- convert_multiple(files: List[str], **kwargs) SchemaDefinition [source]¶
Converts multiple TSVs to a schema
- Parameters:
files
kwargs
- Returns:
- convert_to_edge_slots(all_tsv_rows: List, name: str = 'Observation', **kwargs) Dict | None [source]¶
- assume that TSV has 3 relevant columns:
slot name to add
slot definition to add
examples of values for the slot
also assume that these are all edge_properties at the moment. TODO: add parameter to allow edge or node property disambiguation.
- data_dictionary_row_count: int = 0¶
number of rows after header containing data dictionary information
- downcase_header: bool = False¶
If true, coerce column names to be lower case
- enum_columns: List[str]¶
List of columns that are coerced into enums
- enum_mask_columns: List[str]¶
List of columns that are excluded from being enums
- enum_strlen_threshold: int = 30¶
Maximum length of a string to be considered a permissible enum value
- enum_threshold: float = 0.1¶
If number if distinct values divided by total number of values is greater than this, then the column is considered an enum
- infer_foreign_keys: bool = False¶
For multi-CVS files, infer linkages between rows
- infer_linkages(files: List[str], **kwargs) List[ForeignKey] [source]¶
Heuristic procedure for determining which tables are linked to others via implicit foreign keys
If all values of one column FT.FC are present in column PT.PC, then FT.FC is a potential foreign key and PC is a potential primary key of PT.
This procedure can generate false positives, so additional heuristics are applied. Each potential foreign key relationship gets an ad-hoc score: - links across tables score more highly than within - suffixes such as _id are more likely on PK and FK tables - the foreign key column table is likely to start with the base column name In addition, if there are competing primary keys for a table, the top scoring one is selected
- max_enum_size: int = 50¶
Max number of permissible values for a column to be considered an enum
- max_pk_len: int = 60¶
Maximum length to be considered for a primary key column. Note: URIs can be long
- min_distinct_fk_val: int = 8¶
For inferring foreign keys, there must be a minimum number.
- robot: bool = False¶
If true, conforms to robot template format. Data dictionary rows start with ‘>’
- schema_name: str = 'example'¶
LinkML schema name (no spaces)
- snakecase_header: bool = False¶
If true, coerce column names to be snake case
- source_schema: SchemaDefinition | None = None¶
Optional base schema to draw from
- class schema_automator.generalizers.JsonDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None, omit_null: bool | None = None, inline_as_dict_slot_keys: ~typing.Mapping[str, str] | None = None)[source]¶
A generalizer that abstract from JSON instance data
- convert(input: str | Dict, format: str = 'json', container_class_name='Container', **kwargs) SchemaDefinition [source]¶
Generalizes from a JSON file
- Parameters:
input
format
container_class_name
kwargs
- Returns:
- inline_as_dict_slot_keys: Mapping[str, str] = None¶
Mapping between the name of a dict-inlined slot and the unique key for that entity