Handling Typos to Increase Speed to Success
In the last newsletter, I talked about how I'm of the opinion that the "speed to user success" should be our top factor in prioritizing search success. Now, let's look at how handling typos can contribute to that goal.
For the purpose of this post, let's consider a typo when a searcher misspells a word (e.g., saerch or tiepo) and a misunderstanding when a searcher uses a word that is different than the text that's on an item (e.g., looking rather than searching).
Handling Typos
The good news is that handling typos is pretty easy, and it should be baked into any search technology that you're using. Most of these are going to use edit distance calculations such as the Levenshtein distance. This is a way of calculating the number of substitutions between two strings. The distance between the two strings beer and beet would be one, while windy and widny would be two. (There's another calculation called Damerau-Levenshtein that counts transpositions as just one change, so this latter distance would be just one.)
There are different ways that search engines use that edit distance. Generally, they will limit the distance allowed before removing a document from the result set, and they will use the distance to sort the results, generally with a heavy emphasis on this count over other measures.
Handling typos and misspellings helps to improve the speed to user success for one pretty obvious reason: it doesn't require users to be perfectly precise when typing. Searchers can hit the wrong keys or be terrible spellers, and, within reason, they'll find what they're looking for.
Some Issues That Arise
Not all typos are typos, though. Put another way, just because there's an edit distance under the "relevant" threshold, doesn't mean that a document is a decent match.
Take the first example: beer and beet. This would be an easy typo to make on a QWERTY keyboard, as the keys are right next to each other, but both strings represent real words. However, you're going to get very different results when searching for german beer versus german beet, and pure typo tolerance won't necessarily help you, as both are "valid" searches from a purely textual point of view.
A good fix for this is the use of contextual typo tolerance. You see this on search engines like Google, which automatically changes the query and alerts the user. Great user experience! You almost never see it on e-commerce sites though. I think there are two reasons. The first is that the product catalogs are usually small enough that you don't have these "collisions." The second is that there's usually not enough data to train a model to recognize when one term should win out over another.
Typo Squatting
Another issue that arises with handling typos simply through a heavy influence of edit distance is typo squatting on marketplaces or sites that search through user-generated content.
Typo squatting is when a bad actor intentionally peppers a document's text with typos or alternatives. Something like Tailor Swift.
This problem can be hard to solve because most solutions institute more problems, and so the right approach is usually domain specific. One way to tackle this, especially for domains like entertainment, is to provide a very heavy boost to verified users. This works well on head queries (e.g., Taylor Swift), but can serve to disadvantage up-and-coming users on long tail queries. Something like learning to rank can help here, but there are no easy answers.
All-in-all, handling typos smartly within search will help searchers get to success by not forcing them to reformulate a query, to know the correct spelling, or to wade through garbage. Typos are just one type of "misunderstanding," though, and next time we'll look at others.