Pandera¶
Overview¶
Pandera is an open-source framework for data validation on dataframe-like objects. PolaRS is a fast dataframe library.
The Pandera Generator produces Pandera models using the class-based API using the PolaRS integration.
The implementation of the generator is incomplete. Because Pandera is a dataframe library, the first priority is implementing models of literal and nested data types and checks for single tables as shown below. tests/test_generators/test_panderagen.py also has an example using supported LinkML features.
Currently supported LinkML features are:
literal slot ranges: string, integer, float, boolean, date, datetime
enums
constraints: required, pattern, minimum_value, maximum_value, multivalued
inlining: nested single-valued objects, lists of literals, lists of objects
note: nested dictionary collections (including ‘simple’ dicts) inlining is inefficient and incomplete, use inlined-as-list instead.
Future priorities that are currently not supported include:
foreign key association to other tables
model and slot inheritance
aliases
different target dataframe libraries
Example¶
Given a definition of a synthetic flat table with some nested/inlined columns:
PanderaSyntheticTable:
description: A flat table with a reasonably complete assortment of datatypes.
attributes:
identifier_column:
description: identifier
identifier: true
range: integer
required: true
bool_column:
description: test boolean column
range: boolean
required: true
integer_column:
description: test integer column with min/max values
range: integer
required: true
minimum_value: 0
maximum_value: 999
float_column:
description: test float column
range: float
required: true
string_column:
description: test string column
range: string
required: true
pattern: "^(this)|(that)|(whatever)$"
date_column:
description: test date column
range: date
required: true
datetime_column:
description: test datetime column
range: datetime
required: true
enum_column:
description: test enum column
range: SyntheticEnum
required: true
ontology_enum_column:
description: test enum column with ontology values
range: SyntheticEnumOnt
required: true
multivalued_column:
description: one-to-many form
range: integer
required: true
multivalued: true
inlined_as_list: true
any_type_column:
description: needs to have type object
range: AnyType
required: true
inlined_class_column:
description: test column with another class inlined as a struct
range: ColumnType
required: true
inlined: true
inlined_as_list: false
multivalued: true
inlined_as_list_column:
description: test column with another class inlined as a list
range: ColumnType
required: true
inlined: true
inlined_as_list: true
multivalued: true
inlined_simple_dict_column:
description: test column inlined using simple dict form
range: SimpleDictType
multivalued: true
inlined: true
inlined_as_list: false
required: true
(details omitted, including header information, slots, enums and nested class definitions)
The generate python looks like this:
class PanderaSyntheticTable(pla.DataFrameModel, _LinkmlPanderaValidator):
"""A flat table with a reasonably complete assortment of datatypes."""
identifier_column: int= pla.Field()
"""identifier"""
bool_column: bool= pla.Field()
"""test boolean column"""
integer_column: int= pla.Field(ge=0, le=999, )
"""test integer column with min/max values"""
float_column: float= pla.Field()
"""test float column"""
string_column: str= pla.Field()
"""test string column"""
date_column: Date= pla.Field()
"""test date column"""
datetime_column: DateTime= pla.Field()
"""test datetime column"""
enum_column: Enum= pla.Field(dtype_kwargs={"categories":('ANIMAL','VEGETABLE','MINERAL',)})
"""test enum column"""
ontology_enum_column: Enum= pla.Field(dtype_kwargs={"categories":('fiction','non fiction',)})
"""test enum column with ontology values"""
multivalued_column: List[int]= pla.Field()
"""one-to-many form"""
any_type_column: Object = pla.Field()
"""needs to have type object"""
inlined_class_column: Struct = pla.Field()
"""test column with another class inlined as a struct"""
inlined_as_list_column: pl.List = pla.Field()
"""test column with another class inlined as a list"""
inlined_simple_dict_column: Struct = pla.Field()
"""test column inlined using simple dict form"""
@pla.check("inlined_class_column")
def check_nested_struct_inlined_class_column(cls, data: PolarsData):
return cls._check_collection_struct(data)
@pla.check("inlined_as_list_column")
def check_nested_struct_inlined_as_list_column(cls, data: PolarsData):
return cls._check_nested_list_struct(data)
@pla.check("inlined_simple_dict_column")
def check_nested_struct_inlined_simple_dict_column(cls, data: PolarsData):
return cls._check_simple_dict(data)
_NESTED_RANGES = {
"inlined_class_column": "ColumnType",
"inlined_as_list_column": "ColumnType",
"inlined_simple_dict_column": "SimpleDictType",
}
_INLINE_FORM = {
"inlined_class_column": "inline_collection_dict",
"inlined_as_list_column": "inlined_list_dict",
"inlined_simple_dict_column": "simple_dict",
}
_INLINE_DETAILS = {
"inlined_simple_dict_column": {'id': 'id', 'other': 'x'},
}
Command Line¶
gen-pandera¶
gen-pandera [OPTIONS] YAMLFILE
Options
- -V, --version¶
Show the version and exit.
- --template-file <template_file>¶
Optional jinja2 template to use for class generation
- --template-path <template_path>¶
Optional jinja2 template directory within module
- --package <package>¶
Package name where relevant for generated class files
Arguments
- YAMLFILE¶
Required argument
Generator¶
- class linkml.generators.panderagen.PanderaGenerator(schema: str | ~typing.TextIO | ~linkml_runtime.linkml_model.meta.SchemaDefinition | Generator | ~pathlib.Path, schemaview: ~linkml_runtime.utils.schemaview.SchemaView = None, format: str | None = None, metadata: bool = True, useuris: bool | None = None, log_level: int | None = 30, mergeimports: bool | None = True, source_file_date: str | None = None, source_file_size: int | None = None, logger: ~logging.Logger | None = None, verbose: bool | None = None, output: str | None = None, namespaces: ~linkml_runtime.utils.namespaces.Namespaces | None = None, directory_output: bool = False, base_dir: str = None, metamodel_name_map: dict[str, str] = None, importmap: str | ~collections.abc.Mapping[str, str] | None = None, emit_prefixes: set[str] = <factory>, metamodel: ~linkml.utils.schemaloader.SchemaLoader = None, stacktrace: bool = False, include: str | ~pathlib.Path | ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None, template_file: str | None = None, package: str = 'example', template_path: str = 'panderagen_class_based', gen_classvars: bool = True, gen_slots: bool = True, genmeta: bool = False, emit_metadata: bool = True, coerce: bool = False)[source]¶
Generates Pandera python classes from a LinkML schema.
Status: incompletely implemented
One styles is supported:
panderagen_class_based
- compile_pandera() ModuleType [source]¶
Generates and compiles Pandera model
- default_value_for_type(typ: str) str [source]¶
Allow underlying framework to handle default if not specified.
- generatorname: ClassVar[str] = 'panderagen.py'¶
Name of the generator. Override with os.path.basename(__file__)
- generatorversion: ClassVar[str] = '0.0.1'¶
Version of the generator. Consider deprecating and instead use overall linkml version
Templates¶
The panderagen module uses a templating system that allows generating different target APIs. The only template currently provided is the default panderagen_class_based template.
The PanderaGenerator
then serves as a translation layer between
the source models from linkml_runtime
and the target models under
panderagen
, making clear what is needed to generate
schema code as well as what parts of the linkml metamodel are supported.
Additional Notes¶
When possible the Pandera Generator implements LinkML constraints directly as Pandera checks. Support for nested columns uses Pandera custom checks to first isolate the nested column and then recursvely call Pandera validation.
The Python code is currently output to the console. The generated class depends on helper methods in the LinkML library at runtime to perform nested checks.
Usage Example¶
Generate the class from tutorial 01 using the gen-pandera command
gen-pandera examples/tutorial/tutorial01/personinfo.yaml > personinfo_pandera.py
Run an example program to create a one row dataframe and validate it. No exceptions are raised because the data matches the model.
from personinfo_pandera import Person
import polars as pl
dataframe = pl.DataFrame(
[
{
"id": "ORCID:1234",
"full_name": "Clark Kent",
"age": "32",
"phone": "555-555-5555"
}
]
)
Person.validate(dataframe)
Development Notes¶
This generator primarily supports Pandera, which is a validator. The underlying dataframe libraries are only partially supported. As a consequence, loading data from many of the unit tests (into a dataframe format) can be challenging when the library does not support some of the LinkML conventions. These include ‘simple’ dict inlining and polymorphism from row to row.
Following nested objects (and eventually foreign-key associations) relies on using PolaRS expressions API for efficiency. These may make greater use of the Narwhals API for more general support of additional dataframe libraries in the future.
Future Roadmap¶
The following major features need to be prioritized
Foreign key associations
Model and slot inheritance, including abstract models
Make transformer module more general rather than performing operations only at validation time.
Generalize support for additional dataframe libraries - PolaRS independent of Pandera to help loading tables prior to validation - Parquet/PyArrow storage formats - PySpark (also supported by Pandera) - Narwhals (general dataframe API wrapper)
Improve modularity - leverage and align with existing linkml-runtime modules and tables
Conversion mechanism (loaders) for models using inlined-as-dictionary and inlined-as-simple-dict forms to inlined-as-list.
Top-level validator cli tool under linkml/validators
Ability to use the generated Pandera without a runtime LinkML dependency.
Cardinality checks over entire dataframe columns