FACTS Grounding: Establishing a Novel Benchmark for Assessing Factuality in Large Language Models

Responsibility & Safety

Published 17 December 2024

Authors: FACTS team

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations.

Large language models (LLMs) are changing the way we access information, but they still struggle with factual accuracy. They can sometimes make up false information, especially with complex inputs, which can undermine trust and limit their usefulness in real-world situations.

Today, we’re launching FACTS Grounding, a detailed benchmark to assess how well LLMs can create responses that are not only accurate based on provided inputs but also detailed enough to satisfy user questions.

We hope this benchmark will drive industry-wide improvement in factuality and grounding. To monitor progress, we’re also introducing the FACTS leaderboard on Kaggle. We’ve tested top LLMs with FACTS Grounding and added their scores to the initial leaderboard. We will keep updating it as the field evolves.

Current leaderboard ranking

FACTS Grounding dataset

The FACTS Grounding dataset contains 1,719 examples designed to require in-depth responses based on the provided context document. Each example includes a document, a system instruction for the LLM to refer only to the given document, and a user request.

An example from the FACTS Grounding dataset

The examples are split into a “public” set (860) and a “private” (859) held-out set. We’re releasing the public set today for anyone to use in evaluating LLMs. To prevent benchmark contamination and hacking, we are keeping the private evaluation set withheld. The FACTS leaderboard scores are the average performance across both public and private sets.

The FACTS Grounding examples include documents of various lengths, up to 32,000 tokens (about 20,000 words), covering areas like finance, technology, retail, medicine, and law. User requests vary widely, from summarization to Q&A generation and rewriting tasks. We excluded examples that might need creativity, mathematics, or complex reasoning, as those might require more advanced reasoning beyond grounding.

Prompt distribution

Collective judgement by leading LLMs

To succeed on an example, an LLM must analyze the complex information in the document and create a detailed response that fully answers the user request while being attributable to that document.

FACTS Grounding evaluates model responses using three top LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We use different judges to avoid bias from a judge favoring responses from its own model family. The judge models were tested against a separate test set to find the best judging templates and ensure they align with human raters.

Each FACTS Grounding example is judged in two steps. First, responses are checked for eligibility, and disqualified if they don’t fully address the user’s request. Second, they are checked for factual accuracy, ensuring no false information is presented.

The eligibility and grounding accuracy of a LLM’s response are assessed separately by several AI judge models, and results are combined to see if the LLM handled the example successfully. The final score for grounding is the average of all judge models’ scores across all examples. More details on our evaluation method can be found in our paper.

A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible.

FACTS Grounding will continue to evolve

We know that benchmarks can quickly become outdated, so this launch of FACTS Grounding and its leaderboard is just the start. Factuality and grounding are crucial for the future success of LLMs and AI systems, and we plan to expand and refine FACTS Grounding as the field advances.

We invite the AI community to engage with FACTS Grounding, evaluate their models on the open examples, or submit models for evaluation. We believe that thorough benchmarking, along with ongoing research and development, will continue to enhance AI systems.

Acknowledgements

FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.

Source link