Pandera#
Overview#
Pandera is an open-source framework for data validation on dataframe-like objects. PolaRS is a fast dataframe library.
The Pandera Generator produces Pandera models using the class-based API using the PolaRS integration.
The implementation of the generator is incomplete. Because Pandera is a dataframe library, the first priority is implementing models of literal data types and checks for flat tables as shown below. tests/test_generators/test_panderagen.py also has an example using supported LinkML features.
Currently supported features are:
literal slot ranges: string, integer, float, boolean, date, datetime
enums
constraints: required, pattern, minimum_value, maximum_value, multivalued
Future priorities that are currently not supported include:
inheritance
inline / nested struct columns
array columns
modeling unnested class ranges (references to separate dataframes)
Example#
Given a definition of a synthetic flat table:
PanderaSyntheticTable:
description: A flat table with a reasonably complete assortment of datatypes.
attributes:
identifier_column:
description: identifier
identifier: True
range: integer
required: True
bool_column:
description: test boolean column
range: boolean
required: True
integer_column:
description: test integer column with min/max values
range: integer
required: True
minimum_value: 0
maximum_value: 999
float_column:
description: test float column
range: float
required: True
string_column:
description: test string column
range: string
required: True
date_column:
description: test date column
range: date
required: True
datetime_column:
description: test datetime column
range: datetime
required: True
enum_column:
description: test enum column
range: SyntheticEnum
required: True
ontology_enum_column:
description: test enum column with ontology values
range: SyntheticEnumOnt
required: True
multivalued_column:
description: one-to-many form
range: integer
required: True
multivalued: True
inlined_as_list: True
(some details omitted for brevity, including header information)
The generate python looks like this:
class PanderaSyntheticTable(pla.DataFrameModel, _LinkmlPanderaValidator):
"""A flat table with a reasonably complete assortment of datatypes."""
identifier_column: int= pla.Field()
"""identifier"""
bool_column: bool= pla.Field()
"""test boolean column"""
integer_column: int= pla.Field(ge=0, le=999, )
"""test integer column with min/max values"""
float_column: float= pla.Field()
"""test float column"""
string_column: str= pla.Field()
"""test string column"""
date_column: Date= pla.Field()
"""test date column"""
datetime_column: DateTime= pla.Field()
"""test datetime column"""
enum_column: Enum= pla.Field(dtype_kwargs={"categories":('ANIMAL','VEGETABLE','MINERAL',)})
"""test enum column"""
ontology_enum_column: Enum= pla.Field(dtype_kwargs={"categories":('fiction','non fiction',)})
"""test enum column with ontology values"""
multivalued_column: List[int]= pla.Field()
"""one-to-many form"""
Command Line#
gen-pandera#
gen-pandera [OPTIONS] YAMLFILE
Options
- -V, --version#
Show the version and exit.
- --template-file <template_file>#
Optional jinja2 template to use for class generation
- --template-path <template_path>#
Optional jinja2 template directory within module
- --package <package>#
Package name where relevant for generated class files
Arguments
- YAMLFILE#
Required argument
Generator#
- class linkml.generators.panderagen.PanderaGenerator(schema: str | ~typing.TextIO | ~linkml_runtime.linkml_model.meta.SchemaDefinition | Generator | ~pathlib.Path, schemaview: ~linkml_runtime.utils.schemaview.SchemaView = None, format: str | None = None, metadata: bool = True, useuris: bool | None = None, log_level: int | None = 30, mergeimports: bool | None = True, source_file_date: str | None = None, source_file_size: int | None = None, logger: ~logging.Logger | None = None, verbose: bool | None = None, output: str | None = None, namespaces: ~linkml_runtime.utils.namespaces.Namespaces | None = None, directory_output: bool = False, base_dir: str = None, metamodel_name_map: dict[str, str] = None, importmap: str | ~collections.abc.Mapping[str, str] | None = None, emit_prefixes: set[str] = <factory>, metamodel: ~linkml.utils.schemaloader.SchemaLoader = None, stacktrace: bool = False, include: str | ~pathlib.Path | ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None, template_file: str | None = None, package: str = 'example', template_path: str = 'panderagen_class_based', gen_classvars: bool = True, gen_slots: bool = True, genmeta: bool = False, emit_metadata: bool = True, coerce: bool = False)[source]#
Generates Pandera python classes from a LinkML schema.
Status: incompletely implemented
One styles is supported:
panderagen_class_based
- default_value_for_type(typ: str) str [source]#
Allow underlying framework to handle default if not specified.
- generatorname: ClassVar[str] = 'panderagen.py'#
Name of the generator. Override with os.path.basename(__file__)
- generatorversion: ClassVar[str] = '0.0.1'#
Version of the generator. Consider deprecating and instead use overall linkml version
Templates#
The panderagen module uses a templating system that allows generating different target APIs. The only template currently provided is the default panderagen_class_based template.
The PanderaGenerator
then serves as a translation layer between
the source models from linkml_runtime
and the target models under
panderagen
, making clear what is needed to generate
schema code as well as what parts of the linkml metamodel are supported.
Additional Notes#
When possible the Pandera Generator implements LinkML constraints directly as Pandera checks. Support for additional checks using Pandera custom checks is planned for the future.
The Python code is currently generated in a single file output to the console.
Usage Example#
Generate the class from tutorial 01 using the gen-pandera command
gen-pandera examples/tutorial/tutorial01/personinfo.yaml > personinfo_pandera.py
Run an example program to create a one row dataframe and validate it. No exceptions are raised because the data matches the model.
from personinfo_pandera import Person
import polars as pl
dataframe = pl.DataFrame(
[
{
"id": "ORCID:1234",
"full_name": "Clark Kent",
"age": "32",
"phone": "555-555-5555"
}
]
)
Person.validate(dataframe)