In this issue, we go deep into the evaluation challenges of natural language and ML-aided search, a catch up of the “weird world” of LLMs, a little bit of a contrary opinion, and more.
As always, if you find anything interest, please send it along.
Measuring the Impact and Quality of ML-Driven Search
As LLMs and ML in general has gotten more powerful, one of the obvious uses is within search. These language models can make search significantly better by expanding the results that come back for a query while still maintaining relevance. When done very well, the search experience can even change how people search. Users realize that they can go beyond one or two keyword queries, and when they do that, they start to use more natural language formulations, or they decrease their usage of facets to instead filter down with the query itself.
Additionally, searchers start to expect that the results will change over time to get better. Most important is to note that “get better” is highly localized: the results that are shown should be better for that searcher at that time for that query. Users probably can’t often vocalize this expectation, but the performance of search that does it versus search that doesn’t is a clear indication of the latent expectation.
I’ve been thinking about this a lot recently, of course, and seeing that there is a new DeepLearning.ai course on search with LLMs with Jay Alammar and Luis Serrano from Cohere brought up another thought:
Searcher behavior isn’t the only thing that changes with the addition of language models into search. The way the owners of these search engines analyze and tweak their search changes, too.
Why ML Changes How We Analyze the Quality of Search
Search for many decades now has been keyword-driven, and it has generally leaned more towards optimizing for precision, or making sure that everything that comes back in the results is clearly relevant. Search for red and you will get red dresses, red shoes, or maybe Redding, PA, but you won’t get maroon shoes or articles on why we blush.
How We Match Changes How We Evaluate
Because of this change in matching, evaluating search by sight becomes a lot more difficult. We can no longer look just for matching strings, but we need to look for matching contexts. For example, we once had a situation where searching for Cupertino brought up iPhones and MacBooks, but also a toaster. What? Why? After some research, it turned out that the name of the toaster was a city in California. Nothing fine tuning wouldn’t fix, but an issue when you’re trying to evaluate search quality.
It isn’t just the increased recall that makes evaluation difficult when adding ML to search. The addition of popularity, seasonality, personalization, and other continually updating features also makes evaluation difficult. We can look at results and see if they are plainly relevant, but if a user has a strong preference for city-themed toasters over electronics, then the toaster is indeed the right result for that query for that user at that time.
This all becomes even more difficult when you have searchers tending more and more towards natural language or conversational queries. Now your long tail is stretching longer and longer, and the impact you are having on analyzing the head queries is diminishing.
The classical way of evaluating search has been to use a human-annotated dataset and to compare how closely your results match against what humans said were the most relevant, using a measure like nDCG. The problem is that the most popular datasets were based on keyword matching. While that is changing, it still doesn't solve the problem of adding other features on top.
So, What to Do?
There are a couple of things you can do, and you probably want to do both. One is to do your own human evaluations. The other is to test in production.
When we were building Algolia Answers, we had a robust system of human evaluations that we did anytime we made a change to the parameters of the system. We made it a four-point scale that measured results as great, okay, poor, or terrible. (In retrospect, this was a bad idea, as I’ve come to believe you want to have either a boolean good or bad evaluation or have an odd-numbered scale so that you can have a mid-point value for okay.) This is fairly common: both Google and Bing do this. It’s also time-consuming.
There are ways for you to farm this out. For example, Mechanical Turk. We never had much success with that. It’s necessary to know what makes a good result, and that isn’t easy to teach in a blurb. Plus, much like having everyone doing customer support in the early days of a startup, this is an ideal way for the team to identify great queries, but also themes of why things aren’t working.
To speed up the evaluations, we used Prodigy, from the company behind spaCy. We had to create our own view for the annotation, but once we did, we could go through the evaluations significantly faster. It even supports keyboard shortcuts!
The human evaluations will give you an indication of whether you are on the right track with textual and semantic relevance, but to really see how it works, especially when layering on the other features we talked about above, you have to put it in front of end users.
Thankfully, for most of you, this will be much easier than what we have to do at Algolia, as our customers are not the end users. Nonetheless, it is vital. Start an experiment, through an A/B test or another mechanism, and get at it. If you want to learn more about how to do this, I suggest this book, Experimentation for Engineers.
What about Generated Content?
What we have been discussing so far is static content, where we have something determinative to compare against. (In other words, the order may change between searches, but we will still have a static reference, like an ID.) Generated content is, of course, much trickier, and I’m not sure anyone has solved it yet. There are a lot of options (bettertest, DeepEval, agentops, baserun.ai, PromptTools, and maybe a dozen others), which means it isn’t yet a solved problem. I still think that the two most important steps are going to persist: you don’t want to release until you’ve seen it yourself, and you don't want to maintain it if it isn’t performing better than what you had before.
(By the way, I’ve also been thinking about this topic thanks to a recent conversation my team had with Daniel Tunkelang on testing personalized search. Highly recommend reaching out to him if you have search needs.)
Other News and Recent Resources
Catching Up on the Weird World of LLMs
Another great one from Simon Wilson. There’s too much to summarize, and this is probably nothing new to you if you are reading this newsletter, but it could be a good resource to share with your SWE colleague who wants to get up to date quickly. My favorite part is this description of the past few years:
One way to think about [LLMs] is that about 3 years ago, aliens landed on Earth. They handed over a USB stick and then disappeared. Since then we’ve been poking the thing they gave us with a stick, trying to figure out what it does and how it works.
What If Generative AI Turns Out to be a Dud?
It’s always useful to have contrary voices out there, even if they turn out to be wrong. It forces us to think about possible failures and how to avoid them, or wonder where we are getting overly exuberant.
In this article, Gary Marcus asks what could happen if generative AI turns out to be a nice technology that doesn’t change the world. While he extrapolates too far in some areas (if war with China comes, it won’t be generative AI’s fault, I’m confident), it’s still worth reading to get a pessimistic take.
Miscellany
The Story of Ask Amplitude: another behind-the-scenes look at building an app with an LLM
Meet Marqo, an Open Source Vector Search Engine for AI Applications
New York Times Considers Legal Action Against OpenAI as Copyright Tensions Swirl