Pandera

Overview

Pandera is an open-source framework for data validation on dataframe-like objects. PolaRS is a fast dataframe library.

The Pandera Generator produces Pandera models using the class-based API using the PolaRS integration. It can also produce PolaRS schemas for use in loading data.

The implementation of the generator is incomplete. Because Pandera is a dataframe library, the first priority is implementing models of literal and nested data types and checks for single tables as shown below. tests/linkml/test_generators/test_panderagen.py also has an example using supported LinkML features.

Currently supported LinkML features are:

  • literal slot ranges: string, integer, float, boolean, date, datetime

  • enums

  • constraints: required, pattern, minimum_value, maximum_value, multivalued

  • inlining: nested single-valued objects, lists of literals, lists of objects

Future priorities that are currently not supported include:

  • foreign key association to other tables

  • model and slot inheritance

  • aliases

  • additional target dataframe libraries

Example

Given a definition of a synthetic flat table with some nested/inlined columns:

PanderaSyntheticTable:
    description: A flat table with a reasonably complete assortment of datatypes.
    attributes:
      identifier_column:
        description: identifier
        identifier: true
        range: integer
        required: true
      bool_column:
        description: test boolean column
        range: boolean
        required: true
      integer_column:
        description: test integer column with min/max values
        range: integer
        required: true
        minimum_value: 0
        maximum_value: 999
      float_column:
        description: test float column
        range: float
        required: true
      string_column:
        description: test string column
        range: string
        required: true
        pattern: "^(this)|(that)|(whatever)$"
      date_column:
        description: test date column
        range: date
        required: true
      datetime_column:
        description: test datetime column
        range: datetime
        required: true
      enum_column:
        description: test enum column
        range: SyntheticEnum
        required: true
      ontology_enum_column:
        description: test enum column with ontology values
        range: SyntheticEnumOnt
        required: true
      multivalued_column:
        description: one-to-many form
        range: integer
        required: true
        multivalued: true
        inlined_as_list: true
      any_type_column:
        description: needs to have type object
        range: AnyType
        required: true
      inlined_class_column:
        description: test column with another class inlined as a struct
        range: ColumnType
        required: true
        inlined: true
        inlined_as_list: false
        multivalued: true
      inlined_as_list_column:
        description: test column with another class inlined as a list
        range: ColumnType
        required: true
        inlined: true
        inlined_as_list: true
        multivalued: true
      inlined_simple_dict_column:
        description: test column inlined using simple dict form
        range: SimpleDictType
        multivalued: true
        inlined: true
        inlined_as_list: false
        required: true

(details omitted, including header information, slots, enums and nested class definitions)

The generate python looks like this:

class PanderaSyntheticTable(pla.DataFrameModel, _LinkmlPanderaValidator):
      """A flat table with a reasonably complete assortment of datatypes."""


      identifier_column: int= pla.Field()
      """identifier"""

      bool_column: bool= pla.Field()
      """test boolean column"""

      integer_column: int= pla.Field(ge=0, le=999, )
      """test integer column with min/max values"""

      float_column: float= pla.Field()
      """test float column"""

      string_column: str= pla.Field()
      """test string column"""

      date_column: Date= pla.Field()
      """test date column"""

      datetime_column: DateTime= pla.Field()
      """test datetime column"""

      enum_column: Enum= pla.Field(dtype_kwargs={"categories":('ANIMAL','VEGETABLE','MINERAL',)})
      """test enum column"""

      ontology_enum_column: Enum= pla.Field(dtype_kwargs={"categories":('fiction','non fiction',)})
      """test enum column with ontology values"""

      multivalued_column: List[int]= pla.Field()
      """one-to-many form"""

      any_type_column: Object = pla.Field()
      """needs to have type object"""

      inlined_class_column: Struct = pla.Field()
      """test column with another class inlined as a struct"""

      inlined_as_list_column: pl.List = pla.Field()
      """test column with another class inlined as a list"""

      inlined_simple_dict_column: Struct = pla.Field()
      """test column inlined using simple dict form"""

      @pla.check("inlined_class_column")
      def check_nested_struct_inlined_class_column(cls, data: PolarsData):
          return cls._check_collection_struct(data)

      @pla.check("inlined_as_list_column")
      def check_nested_struct_inlined_as_list_column(cls, data: PolarsData):
          return cls._check_nested_list_struct(data)

      @pla.check("inlined_simple_dict_column")
      def check_nested_struct_inlined_simple_dict_column(cls, data: PolarsData):
          return cls._check_simple_dict(data)

      _NESTED_RANGES = {
          "inlined_class_column": "ColumnType",
          "inlined_as_list_column": "ColumnType",
          "inlined_simple_dict_column": "SimpleDictType",
      }
      _INLINE_FORM = {
          "inlined_class_column": "inline_collection_dict",
          "inlined_as_list_column": "inlined_list_dict",
          "inlined_simple_dict_column": "simple_dict",
      }
      _INLINE_DETAILS = {
          "inlined_simple_dict_column": {'id': 'id', 'other': 'x'},
      }

Command Line

gen-pandera

Generate Pandera classes to represent a LinkML model

gen-pandera [OPTIONS] YAMLFILE

Options

-V, --version

Show the version and exit.

--generator-class <generator_class>

Generator class to use. Options: [‘PanderaDataframeGenerator’, ‘PolarsSchemaDataframeGenerator’] (not used with –package)

--template-file <template_file>

Optional jinja2 template to use for class generation (not used with –package)

--template-path <template_path>

Optional jinja2 template directory within module (not used with –package)

--package <package>

Package name where relevant for generated class files

Arguments

YAMLFILE

Required argument

The Python code is written to a package directory if the –package command line option is provided. Otherwise the code is written to the console if only generation of a single module is specified.

Usage Example

Generate the package from tutorial 01 using the gen-pandera command Command-line options are under active development and are likely to change.

# recommended is to generate a package with all schema forms
gen-pandera --package personinfo examples/tutorial/tutorial01/personinfo.yaml

# alternatively you can generate the schemas individually using the --template-path and --template-file arguments
# instead of --package
gen-pandera --template-path panderagen_polars_schema --template-file polars_schema.jinja2 examples/tutorial/tutorial01/personinfo.yaml > personinfo_panderagen_polars_schema.py

# the panderagen schema is the default, but note that it depends on the polars schema
gen-pandera examples/tutorial/tutorial01/personinfo.yaml > personinfo_panderagen_class_based.py

Run an example program to create a one row dataframe and validate it. No exceptions are raised because the data matches the model.

import personinfo.panderagen_polars_schema as personinfo_pl
import personinfo.panderagen_class_based as personinfo_pa
import polars as pl

# generated schema is more reliable than PolaRS inferred schema
dataframe = pl.DataFrame(
  [
      {
      "id": "ORCID:1234",
      "full_name": "Clark Kent",
      "age": "32",
      "phone": "555-555-5555"
      }
  ],
  schema = personinfo_pl.Person
)

# Pandera validation supports more LinkML features than PolaRS.
# Would throw an exception if validation failed.
personinfo_pa.Person.validate(dataframe)

Generator

class linkml.generators.panderagen.PanderaDataframeGenerator(schema: str | ~typing.TextIO | ~linkml_runtime.linkml_model.meta.SchemaDefinition | Generator | ~pathlib.Path, schemaview: ~linkml_runtime.utils.schemaview.SchemaView = None, format: str | None = None, metadata: bool = True, useuris: bool | None = None, log_level: int | None = 30, mergeimports: bool | None = True, source_file_date: str | None = None, source_file_size: int | None = None, logger: ~logging.Logger | None = None, verbose: bool | None = None, output: str | None = None, namespaces: ~linkml_runtime.utils.namespaces.Namespaces | None = None, directory_output: bool = False, base_dir: str = None, metamodel_name_map: dict[str, str] = None, importmap: str | ~collections.abc.Mapping[str, str] | None = None, emit_prefixes: set[str] = <factory>, metamodel: ~linkml.utils.schemaloader.SchemaLoader = None, stacktrace: bool = False, include: str | ~pathlib.Path | ~linkml_runtime.linkml_model.meta.SchemaDefinition | None = None, template_file: str = None, true_enums: bool = False, package: str = 'example', TYPE_MAP: dict = None, template_path: str = None, gen_classvars: bool = True, gen_slots: bool = True, genmeta: bool = False, emit_metadata: bool = True, roll_up_slots: bool = False, backing_form: str = 'serialization', inline_validator_mixin: bool = False, coerce: bool = False)[source]

Generates Pandera python classes from a LinkML schema.

Status: incompletely implemented

Two styles are supported:

  • class-based

  • schema-based (not implemented)

static make_multivalued(range: str) str[source]

Convert a type to its multivalued equivalent.

Templates

The panderagen module uses a templating system that allows generating different target APIs. The currently provided templates are the default panderagen_class_based template and panderagen_polars_schema.

Subclasses of DataframeGenerator serve as a translation layer between the source models and schema view from linkml_runtime and the target models under panderagen , making clear what is needed to generate schema code as well as what parts of the linkml metamodel are supported.

Pandera Custom Checks

When possible the Pandera Generator implements LinkML constraints directly as Pandera checks. Support for nested columns uses Pandera custom checks to first isolate the nested column and then recursively call Pandera validation.

The generated Pandera class depends on helper methods in the LinkML library at runtime to perform nested checks.

Validation and Lazyframes

Pandera validation can operate on lazyframes or dataframes. However when a lazyframe is validated, checks that require collection are not run. In general this means only schema-level checks are performed on lazyframes. The LinkmlPanderaValidator checks whether it is validating a dataframe or lazyframe and maintains the same form when making nested validation calls.

Inlined Dictionary Handling

Many dataframe libraries do not handle dictionaries efficiently. Inlining objects as lists is an efficient alternative. The pandera generator supports transforming dictionaries to lists either at load time (preferred) or at validation time.

The implementation of the load-time transform makes use of several generated schemas: - PolaRS serialized form - PolaRS loaded form - Pandera serialized form - Pandera loaded form

The PolaRS serialized form represents any inlined dictionary as an opaque pl.Object. This schema is compatible with loading dataframes using polars.read_json or pl.DataFrame(). The PolaRS loaded form uses lists rather than dictionaries for inlining. To transform between the forms, the pandera generator can also generate a load transform module. The transform currently only implements the load direction. The generator also generates serialized and loaded forms of the Pandera schema.

Not all of these schemas are needed for every application. The example below shows how to use all of them to load a python object into a dataframe. Note the specific model does not actually contain dictionary forms that require a transform.

import personinfo.panderagen_class_based as pcb
import personinfo.panderagen_polars_schema_loaded as ppsl
import personinfo.panderagen_polars_schema_transform as ppst
import personinfo.panderagen_polars_schema as pps
import personinfo.panderagen_schema_loaded as psl
import polars as pl

# some of the schema forms have informative string representations
print(f"Panderagen (serialized): {pcb.Person}")
print(f"Panderagen (loaded): {psl.Person}")
print(f"PolaRS load transform: {ppst.Person}")
print(f"PolaRS schema (serialized): {pps.Person}")
print(f"PolaRS schema (loaded): {ppsl.Person}")

p = pl.DataFrame([
      {
        "full_name": "Old Joe Clark",
        "age": 23
      }
    ],
    schema=pps.Person
)
pcb.Person.validate(p)

print(p)

p_loaded = ppst.Person().load(p)
psl.Person.validate(p_loaded)

Development Notes

This generator supports Pandera, which is a validator. To assist with loading or constructing dataframes that conform to the model, the underlying PolaRS dataframe schema is also generated. Transforming the forms found in the unit tests and existing models are also prioritized in future development.

Following nested objects (and eventually foreign-key associations) relies on using PolaRS expressions API for efficiency. These may make greater use of the Narwhals API for more general support of additional dataframe libraries in the future.

Testing

The panderagen package is tested against the subset of the LinkML compliance tests that it currently implements. There is also a specific test for the generator that emphasizes the dataframe nature of the validation.

To test panderagen compliance use the -m panderagen pytest mark. Use -m dataframe_polars_schema for the PolaRS compliance.

pytest -m panderagen tests/linkml/test_compliance
pytest -m dataframe_polars_schema tests/linkml/test_compliance

In the tests, the optional LinkML dependencies such as NumPy, PolaRS, and Pandera are wrapped in test fixtures and imported using pytest.importorskip. This prevents test collection errors and skips the tests when the optional packages are not installed.

Future Roadmap

The following major features need to be prioritized

  • ability to generate a schema from examples/PersonSchema/personinfo.yaml

  • Foreign key associations

  • Expand the transformer model to support literal types (datetime cases) and boolean constraints.

  • Model and slot inheritance, including abstract models

  • Generalize support for additional dataframe libraries - Parquet/PyArrow storage formats - PySpark (also supported by Pandera) - Narwhals (general dataframe API wrapper) - Ibis (portable dataframe library)

  • Improve modularity - leverage and align with existing linkml-runtime modules and tables

  • Top-level validator cli tool under linkml/validators

  • Ability to use the generated Pandera without a runtime LinkML dependency.

  • Cardinality checks over entire dataframe columns