Generalizers¶
Generalizers take example data and generalizes to a schema
Warning
Generalization is inherently a heuristic process, this should be viewed as a bootstrapping process that semi-automates the creation of a new schema for you.
Generalizing from a single TSV¶
schemauto generalize-csv tests/resources/NWT_wildfires_biophysical_2016.tsv -o wildfire.yaml
The schema will have a slot for every column, e,g:
classes:
Observation:
slots:
- site
- plot
- plot_size
- date
- observer
Ranges will be auto-inferred, e.g.:
slots:
site:
examples:
- value: ZF20-105
range: string
plot:
examples:
- value: '6'
range: integer
plot_size:
examples:
- value: 10X10
range: plot_size_enum
date:
examples:
- value: '2016-07-09'
range: datetime
Enums will be automatically inferred:
enums:
plot_size_enum:
permissible_values:
10X10:
description: 10X10
5x5:
description: 5x5
2.5X2.5:
description: 2.5X2.5
5X5:
description: 5X5
3x3:
description: 3x3
ecosystem_enum:
permissible_values:
Open Fen:
description: Open Fen
Treed Fen:
description: Treed Fen
Black Spruce:
description: Black Spruce
Poor Fen:
description: Poor Fen
Fen:
description: Fen
Lowland:
description: Lowland
Upland:
description: Upland
Bog:
description: Bog
Lowland Black Spruce:
description: Lowland Black Spruce
Chaining an annotator¶
If you provide an --annotator
option you can auto-annotate enums:
schemauto generalize-csv \
--annotator bioportal:envo \
tests/resources/NWT_wildfires_biophysical_2016.tsv \
-o wildfire.yaml
ecosystem_enum:
from_schema: https://w3id.org/MySchema
permissible_values:
Open Fen:
description: Open Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Treed Fen:
description: Treed Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Black Spruce:
description: Black Spruce
Poor Fen:
description: Poor Fen
meaning: ENVO:00000232
exact_mappings:
- ENVO:00000232
Fen:
description: Fen
meaning: ENVO:00000232
Lowland:
description: Lowland
Upland:
description: Upland
meaning: ENVO:00000182
Bog:
description: Bog
meaning: ENVO:01000534
exact_mappings:
- ENVO:01000535
- ENVO:00000044
- ENVO:01001209
- ENVO:01000527
Lowland Black Spruce:
description: Lowland Black Spruce
The annotation can also be run as a separate step
See Annotators
Generalizing from multiple TSVs¶
You can use the generalize-tsvs
command to generalize from multiple TSVs, with
foreign key linkages auto-inferred.
For example, given a file envo.tsv
:
envo term id |
envo term label |
---|---|
ENVO_01000752 |
area of barren land |
ENVO_01001570 |
terrestrial ecoregion |
ENVO_01001581 |
sea surface layer |
ENVO_01001582 |
forest floor |
And a file file samples.tsv
:
BIOSAMPLE_ID |
BIOSAMPLE_NAME |
ENVO_BIOME_ID |
ENVO_FEATURE_ID |
ENVO_MATERIAL_ID |
---|---|---|---|---|
156554 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgControl_Nextera2 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00002261 |
156649 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera5 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00005781 |
156728 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWOrgHeat_Nextera84 |
ENVO_01000174 |
ENVO_01000159 |
ENVO_00005781 |
156738 |
Enriched cells from forest soil in Barre Woods Harvard Forest LTER site, Petersham, Massachusetts, United States - Alteio_BWMinControl_Nextera2 |
ENVO_01000174 |
ENVO_01001275 |
ENVO_00002261 |
We can create a multi-class schema, with foreign keys inferred:
schemauto generalize-tsvs --infer-foreign-keys sample.tsv envo.tsv
This will generate a schema with two classes, where the join between the sample table and the term table is inferred:
- classes:
- sample:
slots: - BIOSAMPLE_ID - BIOSAMPLE_NAME - ENVO_BIOME_ID - ENVO_FEATURE_ID - ENVO_MATERIAL_ID
- envo:
slots: - ENVO_ID - ENVO_LABEL
- slots:
- BIOSAMPLE_ID:
range: integer
- BIOSAMPLE_NAME:
range: string
- ENVO_BIOME_ID:
examples: - value: ENVO_01000022 range: envo
- ENVO_FEATURE_ID:
range: envo
- ENVO_MATERIAL_ID:
range: envo
- ENVO_ID:
identifier: true range: string
- ENVO_LABEL:
range: string
Generalizing from tables on the web¶
You can use generalize-htmltable
schemauto generalize-htmltable https://www.nature.com/articles/s41467-022-31626-4/tables/1
Will generate:
name: example
description: example
id: https://w3id.org/example
imports:
- linkml:types
prefixes:
linkml: https://w3id.org/linkml/
example: https://w3id.org/example
default_prefix: example
slots:
GWAS trait:
examples:
- value: "\xC2"
range: string
Peak GWAS SNP:
examples:
- value: rs2974298
range: string
Gene:
examples:
- value: SMIM19
range: string
NK cell cis eSNP:
examples:
- value: rs2974348
range: string
TWAS Z score:
examples:
- value: '3.809'
range: string
TWAS P value:
examples:
- value: '0.0001'
range: string
classes:
example:
slots:
- GWAS trait
- Peak GWAS SNP
- Gene
- NK cell cis eSNP
- TWAS Z score
- TWAS P value
Generalizing from JSON¶
Packages¶
- class schema_automator.generalizers.CsvDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, column_separator: str = '\t', schema_name: str = 'example', robot: bool = False, data_dictionary_row_count: int = 0, enum_columns: ~typing.List[str] = <factory>, enum_mask_columns: ~typing.List[str] = <factory>, enum_threshold: float = 0.1, enum_strlen_threshold: int = 30, max_enum_size: int = 50, downcase_header: bool = False, snakecase_header: bool = False, infer_foreign_keys: bool = False, max_pk_len: int = 60, min_distinct_fk_val: int = 8, source_schema: ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None)[source]¶
A Generalizer that generalizes from example CSV/TSV data
- column_separator: str = '\t'¶
character that separates columns in the input file
- convert(file: str, **kwargs) SchemaDefinition [source]¶
Converts a single TSV file to a single-class schema
- Parameters:
file –
kwargs –
- Returns:
- convert_dicts(rr: List[Dict], schema_name: str = 'example', class_name: str = 'Observation', **kwargs) SchemaDefinition | None [source]¶
Converts a list of row objects to a schema.
Each row is a data item, presumed to be of the same type, that is generalized.
- Parameters:
rr –
schema_name –
class_name –
kwargs –
- Returns:
- convert_from_dataframe(df: DataFrame, **kwargs) SchemaDefinition [source]¶
Converts a single dataframe to a single-class schema
- Parameters:
df –
kwargs –
- Returns:
- convert_multiple(files: List[str], **kwargs) SchemaDefinition [source]¶
Converts multiple TSVs to a schema
- Parameters:
files –
kwargs –
- Returns:
- convert_to_edge_slots(all_tsv_rows: List, name: str = 'Observation', **kwargs) Dict | None [source]¶
- assume that TSV has 3 relevant columns:
slot name to add
slot definition to add
examples of values for the slot
also assume that these are all edge_properties at the moment. TODO: add parameter to allow edge or node property disambiguation.
- data_dictionary_row_count: int = 0¶
number of rows after header containing data dictionary information
- downcase_header: bool = False¶
If true, coerce column names to be lower case
- enum_columns: List[str]¶
List of columns that are coerced into enums
- enum_mask_columns: List[str]¶
List of columns that are excluded from being enums
- enum_strlen_threshold: int = 30¶
Maximum length of a string to be considered a permissible enum value
- enum_threshold: float = 0.1¶
If number if distinct values divided by total number of values is greater than this, then the column is considered an enum
- infer_foreign_keys: bool = False¶
For multi-CVS files, infer linkages between rows
- infer_linkages(files: List[str], **kwargs) List[ForeignKey] [source]¶
Heuristic procedure for determining which tables are linked to others via implicit foreign keys
If all values of one column FT.FC are present in column PT.PC, then FT.FC is a potential foreign key and PC is a potential primary key of PT.
This procedure can generate false positives, so additional heuristics are applied. Each potential foreign key relationship gets an ad-hoc score:
links across tables score more highly than within
suffixes such as _id are more likely on PK and FK tables
the foreign key column table is likely to start with the base column name
In addition, if there are competing primary keys for a table, the top scoring one is selected
- max_enum_size: int = 50¶
Max number of permissible values for a column to be considered an enum
- max_pk_len: int = 60¶
Maximum length to be considered for a primary key column. Note: URIs can be long
- min_distinct_fk_val: int = 8¶
For inferring foreign keys, there must be a minimum number.
- robot: bool = False¶
If true, conforms to robot template format. Data dictionary rows start with ‘>’
- schema_name: str = 'example'¶
LinkML schema name (no spaces)
- snakecase_header: bool = False¶
If true, coerce column names to be snake case
- source_schema: SchemaDefinition | None = None¶
Optional base schema to draw from
- class schema_automator.generalizers.JsonDataGeneralizer(identifier_slots: ~typing.List[str] = <factory>, depluralize_class_names: bool = <factory>, inflect_engine: ~inflect.engine = <factory>, mappings: dict | None = None, omit_null: bool | None = None, inline_as_dict_slot_keys: ~typing.Mapping[str, str] | None = None)[source]¶
A generalizer that abstract from JSON instance data
- convert(input: str | Dict, format: str = 'json', container_class_name='Container', **kwargs) SchemaDefinition [source]¶
Generalizes from a JSON file
- Parameters:
input –
format –
container_class_name –
kwargs –
- Returns:
- inline_as_dict_slot_keys: Mapping[str, str] = None¶
Mapping between the name of a dict-inlined slot and the unique key for that entity