Pandera#

Overview#

Pandera is an open-source framework for data validation on dataframe-like objects. PolaRS is a fast dataframe library.

The Pandera Generator produces Pandera models using the class-based API using the PolaRS integration.

The implementation of the generator is incomplete. Because Pandera is a dataframe library, the first priority is implementing models of literal data types and checks for flat tables as shown below. tests/test_generators/test_panderagen.py also has an example using supported LinkML features.

Currently supported features are:

literal slot ranges: string, integer, float, boolean, date, datetime
enums
constraints: required, pattern, minimum_value, maximum_value, multivalued

Future priorities that are currently not supported include:

inheritance
inline / nested struct columns
array columns
modeling unnested class ranges (references to separate dataframes)

Example#

Given a definition of a synthetic flat table:

PanderaSyntheticTable:
  description: A flat table with a reasonably complete assortment of datatypes.
  attributes:
    identifier_column:
      description: identifier
      identifier: True
      range: integer
      required: True
    bool_column:
      description: test boolean column
      range: boolean
      required: True
    integer_column:
      description: test integer column with min/max values
      range: integer
      required: True
      minimum_value: 0
      maximum_value: 999
    float_column:
      description: test float column
      range: float
      required: True
    string_column:
      description: test string column
      range: string
      required: True
    date_column:
      description: test date column
      range: date
      required: True
    datetime_column:
      description: test datetime column
      range: datetime
      required: True
    enum_column:
      description: test enum column
      range: SyntheticEnum
      required: True
    ontology_enum_column:
      description: test enum column with ontology values
      range: SyntheticEnumOnt
      required: True
    multivalued_column:
      description: one-to-many form
      range: integer
      required: True
      multivalued: True
      inlined_as_list: True

(some details omitted for brevity, including header information)

The generate python looks like this:

class PanderaSyntheticTable(pla.DataFrameModel, _LinkmlPanderaValidator):
    """A flat table with a reasonably complete assortment of datatypes."""


    identifier_column: int= pla.Field()
    """identifier"""

    bool_column: bool= pla.Field()
    """test boolean column"""

    integer_column: int= pla.Field(ge=0, le=999, )
    """test integer column with min/max values"""

    float_column: float= pla.Field()
    """test float column"""

    string_column: str= pla.Field()
    """test string column"""

    date_column: Date= pla.Field()
    """test date column"""

    datetime_column: DateTime= pla.Field()
    """test datetime column"""

    enum_column: Enum= pla.Field(dtype_kwargs={"categories":('ANIMAL','VEGETABLE','MINERAL',)})
    """test enum column"""

    ontology_enum_column: Enum= pla.Field(dtype_kwargs={"categories":('fiction','non fiction',)})
    """test enum column with ontology values"""

    multivalued_column: List[int]= pla.Field()
    """one-to-many form"""

Command Line#

gen-pandera#

gen-pandera [OPTIONS] YAMLFILE

Options

-V, --version#: Show the version and exit.

--template-file <template_file>#: Optional jinja2 template to use for class generation

--template-path <template_path>#: Optional jinja2 template directory within module

--package <package>#: Package name where relevant for generated class files

Arguments

YAMLFILE#: Required argument

Generator#

class linkml.generators.panderagen.PanderaGenerator(schema: str | ~typing.TextIO | ~linkml_runtime.linkml_model.meta.SchemaDefinition | Generator | ~pathlib.Path, schemaview: ~linkml_runtime.utils.schemaview.SchemaView = None, format: str | None = None, metadata: bool = True, useuris: bool | None = None, log_level: int | None = 30, mergeimports: bool | None = True, source_file_date: str | None = None, source_file_size: int | None = None, logger: ~logging.Logger | None = None, verbose: bool | None = None, output: str | None = None, namespaces: ~linkml_runtime.utils.namespaces.Namespaces | None = None, directory_output: bool = False, base_dir: str = None, metamodel_name_map: dict[str, str] = None, importmap: str | ~collections.abc.Mapping[str, str] | None = None, emit_prefixes: set[str] = <factory>, metamodel: ~linkml.utils.schemaloader.SchemaLoader = None, stacktrace: bool = False, include: str | ~pathlib.Path | ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None, template_file: str | None = None, package: str = 'example', template_path: str = 'panderagen_class_based', gen_classvars: bool = True, gen_slots: bool = True, genmeta: bool = False, emit_metadata: bool = True, coerce: bool = False)[source]#

Generates Pandera python classes from a LinkML schema.

Status: incompletely implemented

One styles is supported:

panderagen_class_based

compile_pandera() → module[source]#: Generates and compiles Pandera model

default_value_for_type(typ: str) → str[source]#: Allow underlying framework to handle default if not specified.

generatorname: ClassVar[str] = 'panderagen.py'#: Name of the generator. Override with os.path.basename(__file__)

generatorversion: ClassVar[str] = '0.0.1'#: Version of the generator. Consider deprecating and instead use overall linkml version

render() → OODocument[source]#: Create a data structure ready to pass to the serialization templates.

serialize(rendered_module: OODocument | None = None) → str[source]#: Serialize the schema to a Pandera module as a string

template_file: str | None = None#: Path to template

valid_formats: ClassVar[list[str]] = ['python']#: Allowed formats - first format is default

Templates#

The panderagen module uses a templating system that allows generating different target APIs. The only template currently provided is the default panderagen_class_based template.

The PanderaGenerator then serves as a translation layer between the source models from linkml_runtime and the target models under panderagen , making clear what is needed to generate schema code as well as what parts of the linkml metamodel are supported.

Additional Notes#

When possible the Pandera Generator implements LinkML constraints directly as Pandera checks. Support for additional checks using Pandera custom checks is planned for the future.

The Python code is currently generated in a single file output to the console.

Usage Example#

Generate the class from tutorial 01 using the gen-pandera command

gen-pandera examples/tutorial/tutorial01/personinfo.yaml > personinfo_pandera.py

Run an example program to create a one row dataframe and validate it. No exceptions are raised because the data matches the model.

from personinfo_pandera import Person
import polars as pl

dataframe = pl.DataFrame(
  [
      {
      "id": "ORCID:1234",
      "full_name": "Clark Kent",
      "age": "32",
      "phone": "555-555-5555"
      }
  ]
)
Person.validate(dataframe)