How to predict missing data

LinkML implements the “CRUDSI” design pattern. In addition to Create, Read, Update, Delete, LinkML also supports Search and Inference.

The framework is designed to support different kinds of inference, including rule-based and LLMs. This notebooks shows simple ML-based inference using scikit-learn DecisionTrees.

This how-to walks through the basic operations of using the linkml-store command line tool to perform training and inference using scikit-learn DecisionTrees. This uses the command line interface, but the same operations can be performed programmatically using the Python API, or via the Web API.

We will use a subset of the classic Iris dataset, converted to jsonl (JSON Lines) format:

[2]:

%%bash
linkml-store -i ../../tests/input/iris.jsonl describe

              count unique     top freq   mean       std  min  25%   50%    75%  max
petal_length  100.0    NaN     NaN  NaN  2.861  1.449549  1.0  1.5  2.45  4.325  5.1
petal_width   100.0    NaN     NaN  NaN  0.786  0.565153  0.1  0.2   0.8    1.3  1.8
sepal_length  100.0    NaN     NaN  NaN  5.471  0.641698  4.3  5.0   5.4    5.9  7.0
sepal_width   100.0    NaN     NaN  NaN  3.099  0.478739  2.0  2.8  3.05    3.4  4.4
species         100      2  setosa   50    NaN       NaN  NaN  NaN   NaN    NaN  NaN

The Infer Command

[5]:

%%bash
linkml-store infer --help

Usage: linkml-store infer [OPTIONS]

  Predict a complete object from a partial object.

  Currently two main prediction methods are provided: RAG and sklearn

  ## RAG:

  The RAG approach will use Retrieval Augmented Generation to inference the
  missing attributes of an object.

  Example:

      linkml-store  -i countries.jsonl inference -t rag  -q 'name: Uruguay'

  Result:

      capital: Montevideo, code: UY, continent: South America, languages:
      [Spanish]

  You can pass in configurations as follows:

      linkml-store  -i countries.jsonl inference -t
      rag:llm_config.model_name=llama-3  -q 'name: Uruguay'

  ## SKLearn:

  This uses scikit-learn (defaulting to simple decision trees) to do the
  prediction.

      linkml-store -i tests/input/iris.csv inference -t sklearn            -q
      '{"sepal_length": 5.1, "sepal_width": 3.5, "petal_length": 1.4,
      "petal_width": 0.2}'

Options:
  -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]
                                  Output format
  -o, --output PATH               Output file path
  -T, --target-attribute TEXT     Target attributes for inference
  -F, --feature-attributes TEXT   Feature attributes for inference (comma
                                  separated)
  -Y, --inference-config-file PATH
                                  Path to inference configuration file
  -E, --export-model PATH         Export model to file
  -L, --load-model PATH           Load model from file
  -M, --model-format [pickle|onnx|pmml|pfa|joblib|png|linkml_expression|rulebased|rag_index]
                                  Format for model
  -S, --training-test-data-split <FLOAT FLOAT>...
                                  Training/test data split
  -t, --predictor-type TEXT       Type of predictor  [default: sklearn]
  -n, --evaluation-count INTEGER  Number of examples to evaluate over
  --evaluation-match-function TEXT
                                  Name of function to use for matching objects
                                  in eval
  -q, --query TEXT                query term
  --help                          Show this message and exit.

Training and Inference

We can perform training and inference in a single step.

For feature labels, we use:

petal_length
petal_width
sepal_length
sepal_width

These can be explicitly specified using -F, but in this case we are specifying a query, so the feature labels are inferred from the query.

We specify the target label using -T. In this case, we are predicting the species of the iris.

[4]:

%%bash
linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q "{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}"

/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

predicted_object:
  species: setosa
confidence: 1.0

The data model for the output consists of a predicted_object slot and a confidence. Note that for standard ML operations, the predicted object will typically have one attribute only, but other kinds of inference (OWL reasoning, LLMs) may be able to predict complex objects.

Saving the Model

Performing training and inference in a single step is convenient where training is fast, but more typically we’d want to save the model for later use.

We can do this with the -E option:

[11]:

%%bash
linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -E "tmp/iris-model.joblib"

We can use a pre-saved model in inference:

[14]:

%%bash
linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L "tmp/iris-model.joblib" -q "{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}"

/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

predicted_object:
  species: setosa
confidence: 1.0

Exporting models to explainable visualizations

We can export the model to a visual representation to make it more explaininable:

[9]:

%%bash
linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E input/iris-model.png

Generating a rule-based model

Although traditionally ML is used for statistical inference, sometimes we might want to use ML (e.g. Decision Trees) to generate simple purely deterministic rule-based models.

linkml-store has a different kind of inference engine that works using LinkML schemas, specifically

rules at the class an slot level
expressions that combine slot assignments logically and artithmetically

We can export (some) ML models to this format:

[10]:

%%bash
linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml
cat tmp/iris-model.rulebased.yaml

class_rules: null
config:
  feature_attributes:
  - petal_length
  - petal_width
  - sepal_length
  - sepal_width
  target_attributes:
  - species
slot_expressions:
  species: ("setosa" if ({petal_width} <= 0.8000) else "versicolor")
slot_rules: null

We can then apply this model to new data:

[32]:

%%bash
linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t rulebased -L tmp/iris-model.rulebased.yaml  -q "{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}"

EVAL {'petal_length': 2.5, 'petal_width': 0.5, 'sepal_length': 5.0, 'sepal_width': 3.5}
predicted_object:
  petal_length: 2.5
  petal_width: 0.5
  sepal_length: 5.0
  sepal_width: 3.5
  species: setosa

More advanced ML models

Currently only Decision Trees are supported. Additionally, most of the underlying functionality of scikit-learn is hidden.

For more advanced ML, you are encouraged to use linkml-store for data management and then exporting to standard tabular ot dataframe formats in order to do more advanced ML in Python. linkml-store is not intended as an ML platform. Instead a limited set of operations are provided to assist with data exploration and assisting in construction of deterministic rules.

For inference using LLMs and Retrieval Augmented Generation, see the how-to guide on those topics.

[ ]: