Example Usage
Given a LinkML schema such as the following: https://github.com/linkml/linkml-arrays/blob/main/tests/input/temperature_dataset.yaml
We can generate Pydantic classes for the schema: https://github.com/linkml/linkml-arrays/blob/main/tests/test_dumpers/array_classes.py
We can then create instances of these classes to represent data:
import numpy as np
from tests.test_dumpers.array_classes import (
LatitudeSeries, LongitudeSeries, DaySeries,
TemperatureMatrix, TemperatureDataset
)
latitude_in_deg = LatitudeSeries(values=np.array([1, 2, 3]))
longitude_in_deg = LongitudeSeries(values=np.array([4, 5, 6]))
time_in_d = DaySeries(values=np.array([7, 8, 9]))
temperatures_in_K = TemperatureMatrix(
values=np.ones((3, 3, 3)),
)
temperature = TemperatureDataset(
name="my_temperature",
latitude_in_deg=latitude_in_deg,
longitude_in_deg=longitude_in_deg,
time_in_d=time_in_d,
temperatures_in_K=temperatures_in_K,
)
Serialization
We currently have four options for serializing (dumper) these arrays to disk:
a YAML file for the non-array data and a NumPy file for each of the arrays
a YAML file for the non-array data and an HDF5 file with a single dataset for each of the arrays
a single HDF5 file with a hierarchical structure that mirrors the structure of the data object and contains non-array data as attributes and array data as datasets
a single Zarr (v2) directory store with a hierarchical structure that mirrors the structure of the data object and contains non-array data as attributes and array data as arrays
For all dumpers, first get a SchemaView
object for the LinkML schema:
from linkml_runtime import SchemaView
from pathlib import Path
schema_path = Path("temperature_dataset.yaml")
schemaview = SchemaView(schema_path)
Then use a dumper to serialize the TemperatureDataset
data object that we created above:
YAML + NumPy dumper:
from linkml_arrays.dumpers import YamlNumpyDumper
YamlNumpyDumper().dumps(temperature, schemaview=schemaview)
Output YAML file with references to the NumPy files for each array:
latitude_in_deg:
values: file:./my_temperature.LatitudeSeries.values.npy
longitude_in_deg:
values: file:./my_temperature.LongitudeSeries.values.npy
name: my_temperature
temperatures_in_K:
values: file:./my_temperature.TemperatureMatrix.values.npy
time_in_d:
values: file:./my_temperature.DaySeries.values.npy
YAML + HDF5 dumper:
from linkml_arrays.dumpers import YamlHdf5Dumper
YamlHdf5Dumper().dumps(temperature, schemaview=schemaview)
Output YAML file with references to the HDF5 files for each array:
latitude_in_deg:
values: file:./my_temperature.LatitudeSeries.values.h5
longitude_in_deg:
values: file:./my_temperature.LongitudeSeries.values.h5
name: my_temperature
temperatures_in_K:
values: file:./my_temperature.TemperatureMatrix.values.h5
time_in_d:
values: file:./my_temperature.DaySeries.values.h5
HDF5 dumper:
from linkml_arrays.dumpers import Hdf5Dumper
Hdf5Dumper().dumps(temperature, schemaview=schemaview)
The h5dump
output of the resulting HDF5 file:
HDF5 "my_temperature.h5" {
GROUP "/" {
ATTRIBUTE "name" {
DATATYPE H5T_STRING {
STRSIZE H5T_VARIABLE;
STRPAD H5T_STR_NULLTERM;
CSET H5T_CSET_UTF8;
CTYPE H5T_C_S1;
}
DATASPACE SCALAR
DATA {
(0): "my_temperature"
}
}
GROUP "latitude_in_deg" {
DATASET "values" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 1, 2, 3
}
}
}
GROUP "longitude_in_deg" {
DATASET "values" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 4, 5, 6
}
}
}
GROUP "temperatures_in_K" {
DATASET "values" {
DATATYPE H5T_IEEE_F64LE
DATASPACE SIMPLE { ( 3, 3, 3 ) / ( 3, 3, 3 ) }
DATA {
(0,0,0): 1, 1, 1,
(0,1,0): 1, 1, 1,
(0,2,0): 1, 1, 1,
(1,0,0): 1, 1, 1,
(1,1,0): 1, 1, 1,
(1,2,0): 1, 1, 1,
(2,0,0): 1, 1, 1,
(2,1,0): 1, 1, 1,
(2,2,0): 1, 1, 1
}
}
}
GROUP "time_in_d" {
DATASET "values" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 3 ) / ( 3 ) }
DATA {
(0): 7, 8, 9
}
}
}
}
}
Zarr dumper:
from linkml_arrays.dumpers import ZarrDumper
ZarrDumper().dumps(temperature, schemaview=schemaview)
The tree
output of the resulting Zarr directory store:
my_temperature.zarr
├── .zattrs
├── .zgroup
├── latitude_in_deg
│ ├── .zgroup
│ └── values
│ ├── .zarray
│ └── 0
├── longitude_in_deg
│ ├── .zgroup
│ └── values
│ ├── .zarray
│ └── 0
├── temperatures_in_K
│ ├── .zgroup
│ └── values
│ ├── .zarray
│ └── 0.0.0
└── time_in_d
├── .zgroup
└── values
├── .zarray
└── 0
Deserialization
For deserializing (loading) the data, we can use the corresponding loader for each dumper:
YAML + NumPy loader:
from hbreader import hbread
from linkml_arrays.loaders import YamlNumpyLoader
read_yaml = hbread("my_temperature_yaml_numpy.yaml")
read_temperature = YamlNumpyLoader().loads(read_yaml, target_class=TemperatureDataset, schemaview=schemaview)
YAML + HDF5 loader:
from hbreader import hbread
from linkml_arrays.loaders import YamlHdf5Loader
read_yaml = hbread("my_temperature_yaml_hdf5.yaml")
read_temperature = YamlHdf5Loader().loads(read_yaml, target_class=TemperatureDataset, schemaview=schemaview)
HDF5 loader:
from linkml_arrays.loaders import Hdf5Loader
read_temperature = Hdf5Loader().loads("my_temperature.h5", target_class=Temperature
Zarr loader:
from linkml_arrays.loaders import ZarrLoader
read_temperature = ZarrLoader().loads("my_temperature.zarr", target_class=Temperature