How to Generate AI prompts#
Generative AI exemplified by ChatGPT is a powerful but unreliable technology that can help with many different kinds of tasks:
Extracting from text into databases
Natural language interfaces to databases
The underlying technology is instruction-tuned Large Language Models (LLMs), powerful tools that have very general question-answer abilities, yet are prone to hallucinations. Additionally, LLMs can generate text that is understandable by a human, but getting them to produce structured data according a defined schema can be challenging.
In June 2023, OpenAI announced the ability to describe function calls in OpenAI API requests. This is compatible with using LinkML to describe your data, and it is possible to auto-generate prompts using LinkML.
The idea here is to go from free text narrative or semi-structured text such as:
Izumi is a professor at the University of Tokyo, where she has been employed since 2017.
She is 56 years old.
She has a brother called Toshiro.
For example, the intended target object may be expressed in JSON as:
"description": "Izumi is a professor at the University of Tokyo.",
"employed_at": "University of Tokyo",
This how-to is aimed at a typical LinkML developer. We’ll briefly address why LinkML is a good framework for this kind of task at the end, but if you are coming here as a newbie, you may want to check out Why LinkML? in the FAQ.
Further exploration: Text extraction with OntoGPT and SPIRES#
OntoGPT is a framework for combining LLMs with ontologies. It includes an implementation of the SPIRES algorithm for extracting structured data from text, using LinkML schemas. The approach taken differs from the above in two ways:
SPIRES allows for grounding of identified entities and concepts using ontology-lookup (via OAK), and dynamic value sets
Rather than try and extract JSON in a single pass, SPIRES takes a multi-pass approach recursing down the schema
SPIRES predated OpenAI function calls and we are still exploring the relative strengths and benefits of both approaches.
SPIRES also makes use of optional additional annotations in the schema to augment the instructions provided to the LLM. Some of these may be separated out into their own standard in future.
For more on SPIRES, see:
Structured prompt interrogation and recursive extraction of semantics (SPIRES): A method for populating knowledge bases using zero-shot learning. (2023) Caufield, J.H. et al doi.org/10.48550/arXiv.2304.02711
Other use cases#
Extracting text is not the only use case for using LLMs with LinkML.
Another example is combining GPT with a trusted database to increase its reliability. If the external database is described using LinkML, then we can build systems following patterns like ReAct or using LangChain that combine the LLM with database lookups.
Check back later for examples of this!
Can I use LLMs to help generate schemas?#
Yes! LinkML is self-describing so the above methodology should be possible to generate a schema from semi-structured or textual sources, simply by using the metamodel as the schema! We haven’t done many experiments yet, but if successful we may incorporate into schema-automator.
And of course, many LinkML developers have already discovered that GitHub copilot (and similar tools like tabnine) work surprisingly well as an autocomplete assistant when editing schemas in an IDE.
Aren’t LLMs unreliable?#
As a LinkML developer you likely care about precise modeling, and accurate representation of knowledge and data. LLMs may seem anathema to you! They are unreliable! They hallucinate! They give a different answer from one run to the next!
Additionally, there are aspects of current LLMs that go against certain parts of the Open Science ethos of LinkML. The examples in this tutorial require subscription access to use proprietary models, with inscrutable training data. On top of that there are environmental costs in training them and running them.
If this isn’t your reaction, then we encourage reading the Stochastic Parrots: https://dl.acm.org/doi/10.1145/3442188.3445922 paper which goes into these issues in more detail.
If it is your reaction, then rest assured that LLMs will not be a part of the core LinkML framework (although they may be used in ancillary parts like schema-automator). But we do intend to make it easier for people who are interested in combining LLMs with well-defined, reliable, trusted data. Although everyone is still figuring out the strengths and benefits of this technology, there are reasons to believe that LLMs when combined with curator oversight and trusted data can be a useful tool in the data landscape.