Book Review: Building Machine Learning Powered Applications
Building Machine Learning Powered Applications is a book written by Emmanuel Ameisen and published in 2020. It examines how to plan, build, and deploy a model. What is there to learn from the book, and is its content still relevant in an LLM world?
The biggest takeaway of the book, and perhaps the differentiation between this book and other tech-oriented ML books, is in the research and planning section. Ask: is this really an ML problem? Do you have the data necessary?
Does it really need to be ML?
Machine learning is a hot field and brands love to tout it, but not everything has to be ML. Indeed, it's better if you build a product without ML. ML is difficult to explain. When things go wrong, rule-based systems are a lot easier to understand and, thus, to fix. ML is expensive. Code is relatively inexpensive. ML requires specialized expertise. Code needs coders, which is specialized, but not nearly to the extent of ML engineers.
The need to question whether ML is necessary has become even more important with the rise of LLMs. There is a feeling among many that customers need AI in the product to buy it, so let's take the engineer who has watched a few YouTube videos on prompt engineering and add something to our offering. It's all nonsense, of course. Customers need products that solve their problems at a price they're willing to pay, whether that problem is solved by AGI or a dude in the back reading data from a spreadsheet. In a way, we're pretty lucky that building for VR isn't that easy, or else we'd have a bunch of rudderless companies spinning up a spatial experiences team rather than fixing all of their dashboard bugs.
So, how do you determine if you need and can use ML? Ameisen says there are two steps. First, frame the product goal in an ML paradigm (supervised or unsupervised) and evaluate ML feasibility by what data is needed, and what current models can handle the task. For the ML paradigm, while supervised is generally easier (because you have the labels), you don't always have the data even for that. Getting labeled data can be tricky! When you're working in search, for example, you often won't have labeled results sets that precisely meet your needs.
Also, product goals don't always fit neatly into "supervised" or "unsupervised." Ameisen mentions fraud detection, which can either identify unusual transactions (unsupervised) or identify known fraudulent transactions (supervised).
The rise of highly capable LLMs changed this calculus quite a bit, though. While purpose-built models will generally be more capable, LLMs require a lot less data. This is not to say no data, but you can get your product spun up a lot quicker, either to start collecting data, or before that, for prototyping. Now, speaking of prototyping.
The first time is always handmade
When you decide that your product needs ML, really you should be deciding that you have a hunch that your product needs ML. Again, products without ML are cheaper and easier to maintain than those that use it. That's why prototyping and exploratory data analysis. Hopefully, you already know what prototyping is, but you might not know exploratory data analysis (EDA). EDA is what it sounds like: looking at the data you have and seeing what you can learn about it. You'll "rarely find [the exact data necessary]," and so EDA is all about identifying hypotheses about which of your data will work for your needs.
A good way to do this is to start with a handmade solution. Ameisen intersperses the book with interviews, and in his interview with Monica Rogati (formerly working on ML at LinkedIn and IBM), she touches on this directly:
The first line of defense is looking at the data yourself. Let’s say we want to build a model to recommend groups to LinkedIn users. A naive way would be to recommend the most popular group containing their company’s name in the group title. After looking at a few examples, we found out one of the popular groups for the company Oracle was “Oracle sucks!” which would be a terrible group to recommend to Oracle employees.
It is always valuable to spend the manual effort to look at inputs and outputs of your model. Scroll past a bunch of examples to see if anything looks weird. The head of my department at IBM had this mantra of doing something manually for an hour before putting in any work...
At Jawbone, for example, people entered “phrases” to log the content of their meals. By the time we labeled the top 100 by hand, we had covered 80% of phrases and had strong ideas of what the main problems we would have to handle, such as varieties of text encoding and languages.
And the second is as simple as possible
After the handmade work, Rogati recommends starting with the simplest possible model, in order to get a baseline:
The goal of our plan should be to derisk our model somehow. The best way to do this is to start with a “strawman baseline” to evaluate worst-case performance. For our earlier example, this could be simply suggesting whichever action the user previously took.
If we did this, how often would our prediction be correct, and how annoying would our model be to the user if we were wrong? Assuming that our model was not much better than this baseline, would our product still be valuable?
This is what Jeremy Howard recommends, as well, in Deep Learning for Coders:
[A baseline is a] simple model that you are confident should perform reasonably well. It should be simple to implement and easy to test, so that you can then test each of your improved ideas and make sure they are always better than your baseline. Without starting with a sensible baseline, it is difficult to know whether your super-fancy models are any good. One good approach to creating a baseline is doing what we have done here: think of a simple, easy-to-implement model. Another good approach is to search around to find other people who have solved problems similar to yours, and download and run their code on your dataset. Ideally, try both of these!
Without these baselines, your team might spend months on a solution when the handmade solution might have been enough for your customers.
Plan and measure
Ameisen's section on planning should be familiar to anyone with experience in building products. Focus on the business goals and let your metrics flow from there. One potential difference compared to building, say, a new dashboard experience is that when building with ML, you will be working against metrics before you ever put anything in front of customers.
This is because the model needs a metric to optimize against. The metric needs to correlate closely with your primary business metric. So if you want to optimize for overall revenue, that should be the target metric for the model. If you don't have revenue, conversion rate might work. But be careful, a higher conversion rate doesn't mean necessarily that you'll get higher revenue. The model might instead end up pushing a lot of low-cost items to get a higher CVR. That's why guardrail metrics are important. These are metrics that will stop a model from going out, even if the model is doing well on the target metric. Other examples of guardrail metrics might be latency or compute cost.
When comparing your model's performance against your target metric. Ameisen points out to have reasonable expectations. "The more pleasantly surprised you are by the performance of your model on your test data, the more likely you are to have an error in your pipeline"
Plan for failure
When having reasonable expectations, also think about how you can build your product to handle the cases in which the model fails to make the correct prediction. Jeremy Howard again makes a similar point that if a computer can do 90% of the work and leave the most difficult 10% to humans, that's a really good model! People are discovering this anew with LLMs. One oft-recommended approach is to prompt LLMs to return the answer along with a confidence score. Similar approach here, where if your model provides a lower score, you might not return the prediction, or you might present it in a different way.
Let's take search results. There's already a bit of a built-in safeguard, because results with a higher similarity score will (generally) rank higher than results without. When you introduce sorts on an attribute (like price high to low) or when there are few results to show at all, ranking by similarity score no longer protects. Instead, you ramp up the minimum score needed for a result to display, because the cost of a low-score result is now higher.
Ameisen also talks about a lot of other checks that are important, such as making sure that the data that goes into the model is within a valid range (no one is 125 years old) or having a filtering model that identifies where a model is likely to fail. The most important consideration, though, is still to identify the cost of having a failure and designing your way around it.
In an LLM world, Building Machine Learning Powered Applications is due for a refresh, but it's still a useful book. As always, the fundamentals of product building still hold true: build something people want, don't build more than you have to, and account for what can go wrong.