Friday, June 27, 2025
No Result
View All Result
Eltaller Digital
  • Home
  • Latest
  • AI
  • Technology
  • Apple
  • Gadgets
  • Finance & Insurance
  • Deals
  • Automobile
  • Best AI Tools
  • Gaming
  • Home
  • Latest
  • AI
  • Technology
  • Apple
  • Gadgets
  • Finance & Insurance
  • Deals
  • Automobile
  • Best AI Tools
  • Gaming
No Result
View All Result
Eltaller Digital
No Result
View All Result
Home Artificial Intelligence

FACTS Grounding: Establishing a Novel Benchmark for Assessing Factuality in Large Language Models

December 18, 2024
in Artificial Intelligence
Reading Time: 4 mins read
0 0
A A
0
FACTS Grounding: Establishing a Novel Benchmark for Assessing Factuality in Large Language Models
Share on FacebookShare on Twitter


Responsibility & Safety

Published 17 December 2024

Authors: FACTS team

Our comprehensive benchmark and online leaderboard offer a much-needed measure of how accurately LLMs ground their responses in provided source material and avoid hallucinations.

Large language models (LLMs) are changing the way we access information, but they still struggle with factual accuracy. They can sometimes make up false information, especially with complex inputs, which can undermine trust and limit their usefulness in real-world situations.

Today, we’re launching FACTS Grounding, a detailed benchmark to assess how well LLMs can create responses that are not only accurate based on provided inputs but also detailed enough to satisfy user questions.

We hope this benchmark will drive industry-wide improvement in factuality and grounding. To monitor progress, we’re also introducing the FACTS leaderboard on Kaggle. We’ve tested top LLMs with FACTS Grounding and added their scores to the initial leaderboard. We will keep updating it as the field evolves.

Current leaderboard ranking

FACTS Grounding dataset

The FACTS Grounding dataset contains 1,719 examples designed to require in-depth responses based on the provided context document. Each example includes a document, a system instruction for the LLM to refer only to the given document, and a user request.

An example from the FACTS Grounding dataset

The examples are split into a “public” set (860) and a “private” (859) held-out set. We’re releasing the public set today for anyone to use in evaluating LLMs. To prevent benchmark contamination and hacking, we are keeping the private evaluation set withheld. The FACTS leaderboard scores are the average performance across both public and private sets.

The FACTS Grounding examples include documents of various lengths, up to 32,000 tokens (about 20,000 words), covering areas like finance, technology, retail, medicine, and law. User requests vary widely, from summarization to Q&A generation and rewriting tasks. We excluded examples that might need creativity, mathematics, or complex reasoning, as those might require more advanced reasoning beyond grounding.

Prompt distribution

Collective judgement by leading LLMs

To succeed on an example, an LLM must analyze the complex information in the document and create a detailed response that fully answers the user request while being attributable to that document.

FACTS Grounding evaluates model responses using three top LLM judges: Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet. We use different judges to avoid bias from a judge favoring responses from its own model family. The judge models were tested against a separate test set to find the best judging templates and ensure they align with human raters.

Each FACTS Grounding example is judged in two steps. First, responses are checked for eligibility, and disqualified if they don’t fully address the user’s request. Second, they are checked for factual accuracy, ensuring no false information is presented.

The eligibility and grounding accuracy of a LLM’s response are assessed separately by several AI judge models, and results are combined to see if the LLM handled the example successfully. The final score for grounding is the average of all judge models’ scores across all examples. More details on our evaluation method can be found in our paper.

A factually correct response that fails to properly address the user’s request fails the benchmarking example. Here we see three instances of model responses that the automated LLM judges considered ineligible.

FACTS Grounding will continue to evolve

We know that benchmarks can quickly become outdated, so this launch of FACTS Grounding and its leaderboard is just the start. Factuality and grounding are crucial for the future success of LLMs and AI systems, and we plan to expand and refine FACTS Grounding as the field advances.

We invite the AI community to engage with FACTS Grounding, evaluate their models on the open examples, or submit models for evaluation. We believe that thorough benchmarking, along with ongoing research and development, will continue to enhance AI systems.

Acknowledgements

FACTS Grounding was led by: Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Dipanjan Das, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, and Nate Keating.

We are also very grateful for contributions from: Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, and Sasha Goldshtein.

We would also like to thank Avinatan Hassidim, D. Sculley, Fernando Pereira, Koray Kavukcuoglu, Slav Petrov, Ya Xu, and Yossi Matias for their continued support.



Source link

Related

Tags: AssessingBenchmarkEstablishingfactsFactualityGroundingLanguageLargeModels
Previous Post

Apple Watch Introduces Sleep Apnea Notifications in Brazil

Next Post

Popular Class Set for First D&D 2024 Playtest

Related Posts

Will AI Take Over the World? How Close Is AI to World Domination?
Artificial Intelligence

Will AI Take Over the World? How Close Is AI to World Domination?

December 21, 2024
Will AI Take Over The World: What Experts Say
Artificial Intelligence

Will AI Take Over The World: What Experts Say

December 21, 2024
Google DeepMind’s Participation at NeurIPS 2024
Artificial Intelligence

Google DeepMind’s Participation at NeurIPS 2024

December 21, 2024
Are AI Models Efficiently Scaling Knowledge Storage? Meta Researchers Enhance Memory Layer Capabilities
Artificial Intelligence

Are AI Models Efficiently Scaling Knowledge Storage? Meta Researchers Enhance Memory Layer Capabilities

December 21, 2024
Ecologists Identify Limitations of Computer Vision Models in Wildlife Image Retrieval
Artificial Intelligence

Ecologists Identify Limitations of Computer Vision Models in Wildlife Image Retrieval

December 21, 2024
Efficient Text Compression for Reducing LLM Expenses
Artificial Intelligence

Efficient Text Compression for Reducing LLM Expenses

December 20, 2024
Next Post
Popular Class Set for First D&D 2024 Playtest

Popular Class Set for First D&D 2024 Playtest

Pricing Announced for BYD Sealion 7 Coupé-SUV | The Car Expert

Pricing Announced for BYD Sealion 7 Coupé-SUV | The Car Expert

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
Installing the Nothing AI Gallery App on Any Nothing Device

Installing the Nothing AI Gallery App on Any Nothing Device

December 14, 2024
Rewards & Punishments Await the Curious in ‘Dungeons of Blood and Dream’

Rewards & Punishments Await the Curious in ‘Dungeons of Blood and Dream’

December 21, 2024
Get Your Steam Deck Payment Plan – Easy Monthly Options

Get Your Steam Deck Payment Plan – Easy Monthly Options

December 21, 2024
The Best 10 Luxury Perfumes for Women in 2025

The Best 10 Luxury Perfumes for Women in 2025

December 28, 2024
Will AI Take Over the World? How Close Is AI to World Domination?

Will AI Take Over the World? How Close Is AI to World Domination?

December 21, 2024
Local Evaluation of Microsoft’s Phi-4 (14B) AI Model: Insights on Performance, Constraints, and Future Possibilities

Local Evaluation of Microsoft’s Phi-4 (14B) AI Model: Insights on Performance, Constraints, and Future Possibilities

December 18, 2024

Pin Clicks: A Complete Guide to Analyzing & Optimizing Pinterest Success

June 25, 2025
Bigscreen Beyond 2 Launching Next Month: Refining A Vision For VR Enthusiasts Without Apple Or Meta

Bigscreen Beyond 2 Launching Next Month: Refining A Vision For VR Enthusiasts Without Apple Or Meta

March 21, 2025
The Best 10 Luxury Perfumes for Women in 2025

The Best 10 Luxury Perfumes for Women in 2025

December 28, 2024
How Do I earn more money as a Fiverr affiliate?

How Do I earn more money as a Fiverr affiliate?

December 26, 2024
Is the Tesla Cybertruck *Really* Bulletproof? Here’s The Truth

Is the Tesla Cybertruck *Really* Bulletproof? Here’s The Truth

December 23, 2024
Will AI Take Over the World? How Close Is AI to World Domination?

Will AI Take Over the World? How Close Is AI to World Domination?

December 21, 2024
Eltaller Digital

Stay updated with Eltaller Digital – delivering the latest tech news, AI advancements, gadget reviews, and global updates. Explore the digital world with us today!

Categories

  • Apple
  • Artificial Intelligence
  • Automobile
  • Best AI Tools
  • Deals
  • Finance & Insurance
  • Gadgets
  • Gaming
  • Latest
  • Technology

Latest Updates

  • Pin Clicks: A Complete Guide to Analyzing & Optimizing Pinterest Success
  • Bigscreen Beyond 2 Launching Next Month: Refining A Vision For VR Enthusiasts Without Apple Or Meta
  • The Best 10 Luxury Perfumes for Women in 2025
  • About Us
  • Advertise With Us
  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact Us

Copyright © 2024 Eltaller Digital.
Eltaller Digital is not responsible for the content of external sites.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
Manage Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage {vendor_count} vendors Read more about these purposes
View preferences
{title} {title} {title}
No Result
View All Result
  • Home
  • Latest
  • AI
  • Technology
  • Apple
  • Gadgets
  • Finance & Insurance
  • Deals
  • Automobile
  • Best AI Tools
  • Gaming

Copyright © 2024 Eltaller Digital.
Eltaller Digital is not responsible for the content of external sites.