Responsible Research through High Quality AI Maintaining High Quality: How to Assess AI-Generated Output

Quality is always a question when it comes to AI outputs. There are a few factors involved when assessing the quality of AI-generated outputs that align with the steps within the AI generation process.

Training Data Stage

The quality of training data correlates to the quality you can expect from a Large Language Model (LLM). Most commercial LLMs have been trained on the open web, but some have licensed data for training. Commercial LLMs are meant to be general intelligence so they can answer the broadest number of questions and tasks posed to them. The better the data used for training is, the better the AI quality will be, but most commercial LLMs do not disclose what data sources they use for training. Many do state that if a user is interacting with the non-licensed version of the commercial LLM or chatbot, the user data may be used for training unless the user opts out.

In the case of EBSCO’s use of AI, we only work with licensed commercial LLMs to make sure prompts are protected from AI training. Because most commercial LLMs do not reveal their data sources, the quality of a commercial LLM is dependent on assessing the output of the AI, comparing it to other AI model responses, and determining which has a better-quality threshold for specific use cases. Human in the loop is how the quality of AI responses is determined. EBSCO compares AI model quality before selecting a model for each AI feature. The models behind our AI features are documented on EBSCO Connect.

AI Model and Operation Stage

Another quality aspect is the AI model and operation stage of that model. The model itself is controlled by the LLM provider, but the fine-tuning, thresholds like temperature (how confident does the model need to be to use the information in the AI response), and the prompt sent to the LLM can be controlled by your organization or by the researcher themselves if they are using an LLM in their research. While the model quality remains dependent on the AI output quality assessment, the fine-tuning, thresholds, and prompting (to name a few of the parameters you can use with an LLM) can be adjusted and tested before an LLM is used by end-users. This is usually tested with a sample set of data, questions, and parameter changes to find the best approach. EBSCO has dedicated teams of AI engineers who run continual tests on these AI parameters to maintain the high-quality expectations of our products.

EBSCO has always been dedicated to high-quality, trustworthy data, and AI quality is no different.

Learn more about AI at EBSCO

Grounding Stage

The next stage that affects quality is the grounding stage where the LLM is supplemented through Retrieval Augmented Generation (RAG). The query going to the LLM first retrieves verified and authoritative data from outside of the LLM before using that information to improve the predictability, accuracy, context, and timeliness of the AI generated output. Grounding is completely controlled by the individual or organization using the AI, so this is where quality is most critical. Using quality data sources helps reduce hallucinations and increases AI response specificity by at least 46%, according to recent studies. EBSCO for instance, grounds our AI features on the authoritative content from our databases. This is not training the AI in any way. The grounding allows the AI to be supplemented with the authoritative data within our databases, followed by the human curation of facts and subjects that we have always curated.

End User Assessment Stage

The next stage for quality is the end user assessment stage. While this stage often has some passive quality assessment from users (abandoned or refined search queries for instance), quality assessment is conducted on AI responses periodically to ensure the quality is met and does not degrade over time. EBSCO uses a three-step human review process for AI responses where internal Subject Matter Experts (SMEs) review, followed by Beta testers, followed by end users. This is the human in the loop review process. A sample rubric that EBSCO uses for AI response assessment measures:

Timeliness: Is the information presented in the Insight current and not out of date information?
Tone: Does the information in the Insight match the tone in the article?
Terminology: Does the terminology in the Insight match what is in the article?
Accuracy: Is the information in the Insight accurate based on the details found in the article?
Thematic: Are the main themes from the article covered in the Insight?
Usefulness: Was the Insight useful as supplemental material to the abstract and/or research?

In addition, there are some system assessments such as latency (how slow was the AI in completing its task), up/down time (how reliable is the system when you need to use it), cost and environmental efficiency (responsibility to frugality and the planet), prompt engineering peer review (helps decrease biases), temperature control (sort of like the confidence threshold for an AI's responses), and much more. All this feeds into how well the AI will perform for any given task.

For every stage in the AI pipeline, quality can be measured, and steps can be taken to increase quality. It is critical to assess quality at every stage, in addition to other measures such as biases, cost, environmental impact, equality, and more. We will cover those tenets in upcoming posts.

EBSCO has always been dedicated to high quality, trustworthy data, and AI quality is no different. We not only measure quality at every stage, but we also have SMEs who review a representative sample of the AI responses and outputs to make sure the quality remains high.

If you are interested in trying one of our AI features, check out our newly launched AI Insights and Natural Language Search.

AI Insights Connect Page Natural Language Search Connect Page