{ "cells": [ { "cell_type": "markdown", "source": [ "# How to predict missing data\n", "\n", "LinkML implements the \"CRUDSI\" design pattern. In addition to **Create**, **Read**, **Update**, **Delete**, LinkML also supports Search and *Inference*.\n", "\n", "The framework is designed to support different kinds of inference, including rule-based and LLMs. This notebooks shows simple ML-based inference using scikit-learn DecisionTrees.\n", "\n", "This how-to walks through the basic operations of using the `linkml-store` command line tool to perform training and inference using scikit-learn DecisionTrees. This uses the command line interface, but the same operations can be performed programmatically using the Python API, or via the Web API.\n", "\n", "We will use a subset of the classic [Iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html), converted to jsonl (JSON Lines) format:" ], "metadata": { "collapsed": false }, "id": "611f205e9444130c" }, { "cell_type": "code", "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl describe" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-23T22:15:36.754913Z", "start_time": "2024-08-23T22:15:33.366042Z" } }, "id": "d2ef6e85292b5a20", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " count unique top freq mean std min 25% 50% 75% max\n", "petal_length 100.0 NaN NaN NaN 2.861 1.449549 1.0 1.5 2.45 4.325 5.1\n", "petal_width 100.0 NaN NaN NaN 0.786 0.565153 0.1 0.2 0.8 1.3 1.8\n", "sepal_length 100.0 NaN NaN NaN 5.471 0.641698 4.3 5.0 5.4 5.9 7.0\n", "sepal_width 100.0 NaN NaN NaN 3.099 0.478739 2.0 2.8 3.05 3.4 4.4\n", "species 100 2 setosa 50 NaN NaN NaN NaN NaN NaN NaN\n" ] } ], "execution_count": 2 }, { "metadata": {}, "cell_type": "markdown", "source": "## The Infer Command", "id": "335516b2c129363a" }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-23T22:20:41.635957Z", "start_time": "2024-08-23T22:20:38.428284Z" } }, "cell_type": "code", "source": [ "%%bash\n", "linkml-store infer --help" ], "id": "e38efeb1addfe697", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Usage: linkml-store infer [OPTIONS]\n", "\n", " Predict a complete object from a partial object.\n", "\n", " Currently two main prediction methods are provided: RAG and sklearn\n", "\n", " ## RAG:\n", "\n", " The RAG approach will use Retrieval Augmented Generation to inference the\n", " missing attributes of an object.\n", "\n", " Example:\n", "\n", " linkml-store -i countries.jsonl inference -t rag -q 'name: Uruguay'\n", "\n", " Result:\n", "\n", " capital: Montevideo, code: UY, continent: South America, languages:\n", " [Spanish]\n", "\n", " You can pass in configurations as follows:\n", "\n", " linkml-store -i countries.jsonl inference -t\n", " rag:llm_config.model_name=llama-3 -q 'name: Uruguay'\n", "\n", " ## SKLearn:\n", "\n", " This uses scikit-learn (defaulting to simple decision trees) to do the\n", " prediction.\n", "\n", " linkml-store -i tests/input/iris.csv inference -t sklearn -q\n", " '{\"sepal_length\": 5.1, \"sepal_width\": 3.5, \"petal_length\": 1.4,\n", " \"petal_width\": 0.2}'\n", "\n", "Options:\n", " -O, --output-type [json|jsonl|yaml|yamll|tsv|csv|python|parquet|formatted|table|duckdb|postgres|mongodb]\n", " Output format\n", " -o, --output PATH Output file path\n", " -T, --target-attribute TEXT Target attributes for inference\n", " -F, --feature-attributes TEXT Feature attributes for inference (comma\n", " separated)\n", " -Y, --inference-config-file PATH\n", " Path to inference configuration file\n", " -E, --export-model PATH Export model to file\n", " -L, --load-model PATH Load model from file\n", " -M, --model-format [pickle|onnx|pmml|pfa|joblib|png|linkml_expression|rulebased|rag_index]\n", " Format for model\n", " -S, --training-test-data-split ...\n", " Training/test data split\n", " -t, --predictor-type TEXT Type of predictor [default: sklearn]\n", " -n, --evaluation-count INTEGER Number of examples to evaluate over\n", " --evaluation-match-function TEXT\n", " Name of function to use for matching objects\n", " in eval\n", " -q, --query TEXT query term\n", " --help Show this message and exit.\n" ] } ], "execution_count": 5 }, { "cell_type": "markdown", "source": [ "## Training and Inference\n", "\n", "We can perform training and inference in a single step. \n", "\n", "For feature labels, we use:\n", "\n", "- `petal_length`\n", "- `petal_width`\n", "- `sepal_length`\n", "- `sepal_width`\n", "\n", "These can be explicitly specified using `-F`, but in this case we are specifying a query, so\n", "the feature labels are inferred from the query.\n", "\n", "We specify the target label using `-T`. In this case, we are predicting the `species` of the iris.\n" ], "metadata": { "collapsed": false }, "id": "2b7b1b83be1db9de" }, { "cell_type": "code", "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-23T22:17:38.972690Z", "start_time": "2024-08-23T22:17:35.558907Z" } }, "id": "4984aeb4016df154", "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " species: setosa\n", "confidence: 1.0\n", "\n" ] } ], "execution_count": 4 }, { "metadata": {}, "cell_type": "markdown", "source": "The data model for the output consists of a `predicted_object` slot and a `confidence`. Note that for standard ML operations, the predicted object will typically have one attribute only, but other kinds of inference (OWL reasoning, LLMs) may be able to predict complex objects.", "id": "dfcbdae846f56ada" }, { "cell_type": "markdown", "source": [ "## Saving the Model\n", "\n", "Performing training and inference in a single step is convenient where training is fast, but more typically we'd want to save the model for later use.\n", "\n", "We can do this with the `-E` option:" ], "metadata": { "collapsed": false }, "id": "d5246f13699101c7" }, { "cell_type": "code", "execution_count": 11, "outputs": [], "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -E \"tmp/iris-model.joblib\"" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-12T19:35:54.421780Z", "start_time": "2024-08-12T19:35:51.522109Z" } }, "id": "82dcfeb0bd355cff" }, { "cell_type": "markdown", "source": [ "We can use a pre-saved model in inference:" ], "metadata": { "collapsed": false }, "id": "1dfc49d59872942a" }, { "cell_type": "code", "execution_count": 14, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/cjm/Library/Caches/pypoetry/virtualenvs/linkml-store-8ZYO4kTy-py3.10/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names\n", " warnings.warn(\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "predicted_object:\n", " species: setosa\n", "confidence: 1.0\n" ] } ], "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -L \"tmp/iris-model.joblib\" -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-12T19:37:09.132122Z", "start_time": "2024-08-12T19:37:06.560577Z" } }, "id": "e65cd3b7ca131615" }, { "cell_type": "markdown", "source": [ "## Exporting models to explainable visualizations\n", "\n", "We can export the model to a visual representation to make it more explaininable:" ], "metadata": { "collapsed": false }, "id": "20d5059d2efcd1ca" }, { "cell_type": "code", "source": [ "%%bash\n", "linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E input/iris-model.png" ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-23T22:23:18.451362Z", "start_time": "2024-08-23T22:23:15.571984Z" } }, "id": "d7d14edd77e9e1fe", "outputs": [], "execution_count": 9 }, { "cell_type": "markdown", "source": "![img](input/iris-model.png)", "metadata": { "collapsed": false }, "id": "cca55edf629f8c26" }, { "cell_type": "markdown", "source": [ "## Generating a rule-based model\n", "\n", "Although traditionally ML is used for *statistical inference*, sometimes we might want to use ML (e.g. Decision Trees) to generate\n", "simple purely deterministic rule-based models.\n", "\n", "linkml-store has a different kind of inference engine that works using LinkML schemas, specifically\n", "\n", "- `rules` at the class an slot level\n", "- `expressions` that combine slot assignments logically and artithmetically\n", "\n", "We can export (some) ML models to this format:" ], "metadata": { "collapsed": false }, "id": "3ef8a6bc39b5e667" }, { "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-23T22:24:16.457340Z", "start_time": "2024-08-23T22:24:13.977990Z" } }, "cell_type": "code", "source": [ "%%bash\n", "linkml-store -i ../../tests/input/iris.jsonl infer -t sklearn -T species -L tmp/iris-model.joblib -E tmp/iris-model.rulebased.yaml\n", "cat tmp/iris-model.rulebased.yaml" ], "id": "acb7c57ecb3be9b", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "class_rules: null\n", "config:\n", " feature_attributes:\n", " - petal_length\n", " - petal_width\n", " - sepal_length\n", " - sepal_width\n", " target_attributes:\n", " - species\n", "slot_expressions:\n", " species: (\"setosa\" if ({petal_width} <= 0.8000) else \"versicolor\")\n", "slot_rules: null\n" ] } ], "execution_count": 10 }, { "metadata": {}, "cell_type": "markdown", "source": "We can then apply this model to new data:", "id": "50f9cd9df60b41c9" }, { "cell_type": "code", "execution_count": 32, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "EVAL {'petal_length': 2.5, 'petal_width': 0.5, 'sepal_length': 5.0, 'sepal_width': 3.5}\n", "predicted_object:\n", " petal_length: 2.5\n", " petal_width: 0.5\n", " sepal_length: 5.0\n", " sepal_width: 3.5\n", " species: setosa\n" ] } ], "source": [ "%%bash\n", "linkml-store --stacktrace -i ../../tests/input/iris.jsonl infer -t rulebased -L tmp/iris-model.rulebased.yaml -q \"{petal_length: 2.5, petal_width: 0.5, sepal_length: 5.0, sepal_width: 3.5}\" " ], "metadata": { "collapsed": false, "ExecuteTime": { "end_time": "2024-08-12T22:18:11.759880Z", "start_time": "2024-08-12T22:18:08.912484Z" } }, "id": "4df0d87dff96e667" }, { "metadata": {}, "cell_type": "markdown", "source": [ "## More advanced ML models\n", "\n", "Currently only Decision Trees are supported. Additionally, most of the underlying functionality of scikit-learn is hidden.\n", "\n", "For more advanced ML, you are encouraged to use linkml-store for *data management* and then exporting to standard tabular ot dataframe formats in order to do more advanced ML in Python. linkml-store is *not* intended as an ML platform. Instead a limited set of operations are provided to assist with data exploration and assisting in construction of deterministic rules.\n", "\n", "For inference using LLMs and Retrieval Augmented Generation, see the how-to guide on those topics.\n" ], "id": "d1b583ce2d75c0e0" }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": "", "id": "c8d9e36761d3088d" } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.6" } }, "nbformat": 4, "nbformat_minor": 5 }