August 3, 2023

Best practices for building with LLMs, is ChatGPT getting dumber, and more

Aug 03, 2023

News has slowed down during the summer, but it hasn’t stopped, so read on for what’s new and worthwhile. There are some real gems in here, with two long reads (including one 65-minute read) with absolutely no fluff. Plus, a new forum for generative AI assistance, ChatGPT’s declining user base, and more.

As always, if you find anything interest, please send it along.

Guidelines for Building with LLMs

Here we’ve got two long reads on building with LLMs (or building LLMs themselves).

The first, Challenges and Applications of Large Language Models, goes into both the challenges and opportunities that are present. Even if you don’t plan on building an LLM yourself, knowing those challenges as well will also help when building with them, because you’ll know more about where they might fail.

For example, starting with training you’ve got a lot of challenges, such as the fact that datasets are so large that no one can quality check all of it. In the words of the paper, they are “unfathomable.” And a lot of problems are closely related. When training data is duplicated, the model starts to “memorize” it instead of generating new text, it just regurgitates. Actual dupes are easy to find, but it’s the near duplicates that cause the most trouble. Or there might be PII present. Or data in the evaluation set is also in the training set, but it’s undiscovered due to the unfathomability of the training set.

Even “small” things like tokenizers can pose their own challenges. Imagine a novel word that the tokenizer has never seen, whether it’s slang or any other kind of neologism. Because only so many tokens can be fed into the model, this could lead to information loss and difficulty when it comes to the generative tasks that an end user might want.

And, of course, there are challenges when building on top of LLMs, such as “prompt brittleness” where small changes to the prompt can lead to major changes in the response. The paper points to prompt engineering as a way to get around brittleness and specifically points to two buckets of solutions:

Single-turn
- In-context learning
- Instruction following
- Chain-of-thought
- Prompt tuning with embeddings
Multi-turn
- Self consistency (i.e., getting multiple answers and taking the most common)
- Ask-me-anything (let the LLM provide the best question to ask to get an answer)
- Least-to-most (the LLM tells the caller what info is needed to solve the final question, the caller asks the LLM for that info, and then asks the final question)
- Tree of thoughts
- Self-refinement

Next up, Patterns for Building LLM-based Systems & Products, goes even deeper into those things that you need beyond the basics. This isn’t a guide on prompt engineering, it’s a blog post that clocks in at 65 minutes to read. It identifies the patterns that are necessary to put in place for a robust LLM-based product:

Evaluations
Retrieval augmented generation
Fine tuning
Caching
Guardrails
Defensive UX
Collecting user feedback

The whole thing is worth a read and resists neat summarization, but the thing that’s noteworthy to me is how lengthy it is and how each pattern has different, difficult decisions to make.

Building a product on an LLM isn’t a small undertaking, and it’s not one that can be easily farmed out to a back-end engineer to release a product in six weeks. When we built Algolia Answers, it took many months before we were in a state where we were comfortable. This was before even today’s level of maturity was present, but we were doing many of these same tasks back then. (We didn’t see the need for fine tuning, caching, or user feedback, but probably would have gotten to each if we had continued.)

Ultimately, the issue with the six-week, one-person LLM “products” is that they won’t be as useful to users as they should be. This will lead a crashing of the hype cycle of the sort we saw with voice experiences that were built by people who six months prior were in healthcare but took some quick online JavaScript courses and started releasing Alexa skills. Users saw those voice interactions as lacking, they’ll see the products built on LLMs as lacking, and they will blame it on the tech rather than the builder.

That’s why it’s so important to have these patterns in place and to be clear-eyed about the effort it will take to build products on LLMs. If you are building or thinking about building something soon, read the above and use it to create a checklist. You’ll be far ahead.

But, speaking of the hype cycle…

Declining ChatGPT Userbase

ChatGPT Use Declines as Users Complain about ‘Dumber’ Answers

According to SimilarWeb, traffic to ChatGPT on the web decreased 10% MoM. Some have said that the web traffic is declining because OpenAI have released apps on iOS and (very recently) Android, but downloads are in decline as well.

Others point to the “dumbing down” of ChatGPT as the reason. Users are complaining on social media that ChatGPT is no longer as intelligent as it was when they first started using it. Is it true? Peter Welinder, VP of Product at OpenAI says it’s not:

No, we haven't made GPT-4 dumber. Quite the opposite: we make each new version smarter than the previous one.
Current hypothesis: When you use it more heavily, you start noticing issues you didn't see before.

And, you know, I think this is the most likely explanation. When people first started using ChatGPT, it seemed magical. They looked past its flaws. Sure, it gave incorrect answers sometimes, but everything else was amazing!

But now that they are using it more, the magical has become commonplace. And when something becomes commonplace, we are less willing to overlook where it falls short. Those incorrect answers bother us. The writing style starts to grate.

It’s the next phase in LLMs, and that’s okay.

LLMs and Security

A Framework to Securely Use LLMs in Companies

In this post, Sandesh Mysore Anand talks about the different security concerns around using LLMs within a company. The concerns are broken down into six types of risks for three different use cases.

The types of risks:

Prompt injections
Data leakage
Training data poison
Money loss
Insecure supply chain
Overreliance on LLMs

And the use cases:

Online tool usage
Internal applications
Customer-facing applications

He also looks at whether the LLMs are third-party or self-hosted.

Of course, all have risks, but internal usage of LLMs is by far the least risky. It’s the overreliance on LLMs that’s the biggest risk there, due to hallucinations or inaccurate information. Same for online tool usage.

The biggest risks are for external (user-facing) integration of LLMs, with prompt injections, data leakage, denial of service, and overreliance on LLMs all being flagged.

Get Assistance from Other Gen AI Experts

Welcome to Gen AI Stack Exchange Limited Private Beta

Stack Exchange has quietly launched into “limited private beta” a new site geared toward generative AI. So if you have specific questions, this could be a good place to go. Such as if you want to know how to get ChatGPT to stop apologizing.

Talking to Computers: The Email