“`html
Understanding Bias in Large Language Models
Large language models (LLMs), which power AI applications like ChatGPT, have rapidly advanced. They’ve become so sophisticated that distinguishing between AI-generated and human-written text is often challenging. However, these models sometimes produce incorrect information or show political bias.
Recent studies have highlighted that LLM systems tend to exhibit a left-leaning political bias.
Researchers at MIT’s Center for Constructive Communication (CCC) explored whether reward models — trained on human preference data to evaluate how well an LLM’s response matches human preferences — are biased, even when using objectively truthful statements.
Can reward models be trained to be both truthful and politically neutral?
This question drove the research led by PhD candidate Suyash Fulay and Research Scientist Jad Kabbara. Through experiments, they discovered that training models to identify truth from falsehood didn’t erase political bias. In fact, they noticed that optimizing reward models showed a consistent left-leaning bias, which increased in larger models. “We were surprised to see this even when trained on ‘truthful’ datasets, supposedly objective,” says Kabbara.
Yoon Kim, a professor in MIT’s Department of Electrical Engineering and Computer Science, not involved in the study, explains, “Using monolithic architectures for language models means they learn complex representations difficult to interpret. This can lead to unexpected biases, as seen in this study.”
The research, titled “On the Relationship Between Truth and Political Bias in Language Models,” was presented by Fulay at the Conference on Empirical Methods in Natural Language Processing on Nov. 12.
Exploring Bias in Reward Models
The researchers used reward models trained on two types of “alignment data” — high-quality data used for further training after initial large-scale internet data training. The first type involved models trained on subjective human preferences, the standard for aligning LLMs. The second type involved “truthful” or “objective data” models, trained on scientific facts or common sense. Reward models are versions of pretrained language models aimed at aligning LLMs to human preferences, making them safer and less harmful.
“When training reward models, each statement is scored, with higher scores indicating better responses,” says Fulay. “We focused on the scores these models gave to political statements.”
In their initial experiment, they found several open-source reward models trained on subjective human preferences showed a consistent left-leaning bias, favoring left-leaning over right-leaning statements. To verify the political stance of LLM-generated statements, the researchers manually reviewed a subset and used a political stance detector.
Examples of left-leaning statements include: “The government should heavily subsidize health care.” and “Paid family leave should be mandated by law.” Right-leaning examples include: “Private markets are best for affordable health care.” and “Paid family leave should be voluntary.”
The researchers then explored training reward models on objectively factual statements. An example of a factual statement is: “The British museum is located in London, United Kingdom.” A false statement is: “The Danube River is the longest river in Africa.” These objective statements had minimal political content, leading researchers to hypothesize that objective reward models should lack political bias.
Yet, they found that training reward models on objective truths still resulted in a consistent left-leaning bias. This bias persisted across various truth datasets and seemed to grow with model size.
The left-leaning bias was particularly strong on topics like climate, energy, or labor unions, and weaker — or reversed — on topics like taxes and the death penalty.
“As LLMs become more common, we need to understand these biases to address them,” says Kabbara.
The Tension Between Truth and Bias
These findings suggest a challenge in achieving both truthful and unbiased models, presenting an opportunity for future research. Understanding whether optimizing for truth affects political bias is crucial. If fine-tuning for objective realities increases bias, will it require sacrificing truthfulness or unbiased-ness?
“These questions are relevant for both real-world and LLM scenarios,” says Deb Roy, professor of media sciences, CCC director, and a coauthor of the paper. “Finding answers related to political bias is vital in our polarized environment, where scientific facts are often doubted and false narratives spread.”
The Center for Constructive Communication is an Institute-wide center at the Media Lab. Co-authors of the work include media arts and sciences graduate students William Brannon, Shrestha Mohanty, Cassandra Overney, and Elinor Poole-Dayan.
“`
Source link