LLMs Are Great… If They Can Handle Your Data
Originally published at https://blog.developer.bazaarvoice.com on October 28, 2024.
Large language models (LLMs) are powerful tools for handling unstructured text. However, they face a challenge when the text exceeds their context window. Bazaarvoice encountered this issue while developing its AI Review Summaries feature. With millions of user reviews, fitting them all into the context window of even the latest LLMs is impractical, and doing so would be prohibitively expensive.
In this post, I’ll explain how Bazaarvoice addressed this problem by compressing input text without losing meaning. We implemented a multi-pass hierarchical clustering approach that allows us to adjust the level of detail for compression, regardless of the chosen embedding model. This technique made our Review Summaries feature financially feasible and prepared us to scale our business in the future.
Bazaarvoice has been collecting user-generated product reviews for nearly 20 years, resulting in a large volume of unstructured data. These reviews vary in length and content. LLMs are excellent for processing unstructured text, as they can identify relevant information among distractions.
However, LLMs have limitations, such as the context window, which determines how many tokens (approximately the number of words) can be processed at once. State-of-the-art models like Anthropic’s Claude version 3 have large context windows of up to 200,000 tokens, enough to fit small novels. Yet, the internet is vast, and our user-generated reviews are no exception.
We faced the context window limit while building our Review Summaries feature, which summarizes all reviews for a specific product on a client’s website. Over time, many products accumulated thousands of reviews, quickly exceeding the LLM context window. Some products even have millions of reviews, requiring significant re-engineering of LLMs to process in one prompt.
Even if technically feasible, the costs would be prohibitive. LLM providers charge based on the number of input and output tokens, and approaching context window limits for millions of products can lead to cloud hosting bills exceeding six figures.
To overcome these technical and financial limitations, we focused on a simple insight: many reviews convey the same message. Review summaries capture recurring insights, themes, and sentiments. By leveraging data duplication, we reduced the amount of text sent to the LLM, preventing context window limits and lowering operating costs.
To achieve this, we needed to identify text segments conveying the same message. This task is challenging because people often use different words or phrases to express the same idea.
Fortunately, identifying semantically similar text has been an active research area in natural language processing. Agirre et al.’s 2013 study provided human-labeled data of semantically similar sentences, known as the STS Benchmark. The dataset asks humans to rate semantic similarity on a scale of 1–5.
The STS Benchmark is used to evaluate how well a text embedding model associates semantically similar sentences in its high-dimensional space. We use Pearson’s correlation to measure how well the embedding model represents human judgments.
Thus, we use an embedding model to identify semantically similar phrases from product reviews, removing repeated phrases before sending them to the LLM.
Our approach is as follows:
- Segment product reviews into sentences.
- Compute an embedding vector for each sentence using a network that performs well on the STS benchmark.
- Use agglomerative clustering on all embedding vectors for each product.
- Retain an example sentence — the one closest to the cluster centroid — from each cluster to send to the LLM, discarding other sentences in the cluster.
- Consider small clusters as outliers and randomly sample them for inclusion in the LLM.
- Include the number of sentences each cluster represents in the LLM prompt to ensure the weight of each sentiment is considered.
This method may seem straightforward, but there were challenges to address before trusting this approach.
First, we ensured the model embedded text in a space where semantically similar sentences are close together, and dissimilar ones are far apart. We used the STS benchmark dataset and computed Pearson correlation for the models we evaluated. As AWS is our cloud provider, we assessed their Titan Text Embedding models.
AWS’s embedding models performed well in embedding semantically similar sentences, which was beneficial as we could use them off the shelf at a low cost.
The next challenge was enforcing semantic similarity during clustering. Ideally, no cluster would have two sentences with semantic similarity less than what humans accept — a score of 4. However, these scores don’t directly translate to embedding distances needed for clustering thresholds.
To address this, we used the STS benchmark dataset, computed distances for all pairs in the training dataset, and fit a polynomial from scores to distance thresholds.
This polynomial helps compute the distance threshold needed to meet any semantic similarity target. For Review Summaries, we selected a score of 3.5, ensuring clusters contain sentences that are "roughly" to "mostly" equivalent or more.
This can be done on any embedding network, allowing us to experiment with different networks as they become available and quickly swap them without worrying about semantic dissimilarity.
We knew our semantic compression was reliable, but it was unclear how much compression we could achieve. The compression varied across products, clients, and industries.
Without semantic information loss (a hard threshold of 4), we achieved a compression ratio of 1.18 (a space savings of 15%).
Clearly, lossless compression wasn’t sufficient for financial viability.
Our distance selection method offered an interesting possibility: we could gradually increase information loss by repeatedly running clustering at lower thresholds for remaining data.
The approach is as follows:
- Run clustering with a threshold selected from score = 4 (lossless).
- Select outlying clusters with few vectors for the next phase and rerun clustering on clusters with fewer than 10 vectors.
- Run clustering again with a threshold selected from score = 3 (not lossless, but acceptable).
- Select clusters with fewer than 10 vectors.
- Repeat as desired, continuously decreasing the score threshold.
At each clustering pass, we sacrifice more information loss but gain more compression without affecting the lossless representative phrases selected in the first pass.
This approach is useful for Review Summaries, where high semantic similarity is desired, and other cases where less semantic information loss is acceptable but prompt input costs are a concern.
Despite this, many clusters still had a single vector even after lowering the score threshold. These are considered outliers and randomly sampled for the final prompt, ensuring it contains 25,000 tokens or fewer.
The multi-pass clustering and random outlier sampling allow for semantic information loss in exchange for a smaller context window to send to the LLM. This raises the question: how accurate are our summaries?
At Bazaarvoice, authenticity is crucial for consumer trust, and our Review Summaries must authentically represent all voices in the reviews. Any lossy compression approach risks misrepresenting or excluding consumers who contributed reviews.
To validate our compression technique, we measured it directly. For each product, we sampled reviews and used LLM Evals to determine if the summary was representative and relevant to each review, providing a metric to evaluate our compression.
Over 20 years, we’ve collected nearly a billion user-generated reviews and needed to generate summaries for tens of millions of products. Many products have thousands of reviews, some up to millions, which would exhaust LLM context windows and be costly.
Using our approach, we reduced input text size by 97.7% (a compression ratio of 42), allowing us to scale this solution for all products and any review volume. Additionally, the cost of generating summaries for our billion-scale dataset decreased by 82.4%, including the cost of embedding sentence data and storing them in a database.