After launching a new AI product to our most tech-savvy users, we want

After launching a new AI product to our most tech-savvy users, we want to analyze the model performance on corner case questions. How can we identify those questions?

After launching a new AI product to our most tech-savvy users, we want to analyze the model performance on corner case questions. How can we identify those questions?

Firstly, let's define corner case questions, as this is one of the most important concepts in Gen AI DS. They are significantly different questions from what the model has seen during the internal pre-launch tests and, therefore, there is uncertainty around the model performance. Essentially, they are unexpected use cases.

That is, before the launch, a model is always tested on an extremely broad range of questions. These questions are usually mostly generated by an LLM itself and the answers are reviewed by a combination of human reviewers and an LLM (while there is debate on whether an LLM can accurately assess the quality of the answers, there is no doubt that LLMs are incredibly good in generating test questions for validation purposes. And btw the ability to generate accurate test data is a huge advantage of LLMs over traditional ML).

However, no matter what, the combinations of possible questions are virtually infinite, and there is no way that, before launching, a model can be stress-tested to every single possible question. Hence, you need to give it into the hands of your users (typically the most tech savvy/power users) and see what happens.

Often, the number one goal of early-launches is specifically to check how the model did on corner case questions. And since data scientists are in charge of analyzing post-launch data, that becomes the main task of a Gen AI DS. Obviously, you should still check how the model did on all the questions, but you can pretty safely assume that the model performance didn't change much as long as the questions were similar to the test set. The real question is how the model performed on the new/corner case/untested questions.

The task is therefore reduced to: I have a set of test questions used for training/testing purposes before the launch and I have a set of real user questions after the launch. How do I find the differences?

In general, corner cases are small variations of common questions. It is unlikely that the LLM has entirely missed a topic when generating test questions. So it's likely a topic that's present in the test set, but with a twist that makes it significantly harder. Example: how to calculate sample size in an A/B test (simple) -> how to calculate sample size if I am running a non-inferiority test (harder) -> how to calculate sample size if I am running a non-inferiority test where a metric drop is actually good for the business (approaching corner case).

There are a couple of ways to do it. One way would be via building a binary model. Label as 0 the pre-launch questions and 1 the post-launch ones, train a model, and return a score for both pre-launch and post-launch questions. By definition, corner cases are under-represented in the pre-launch set, so the model will return high probabilities (higher likelihood of classifying them as post-launch questions). Then just pick the questions whose score is above a high threshold (say 0.9) and these are the questions whose answer you want to manually analyze. No high-threshold scores would mean that no post-launch question is significantly different from the pre-launch ones. In the extreme case that the two distributions are exactly the same, in a balanced dataset the score would always be ~0.5 (very unlikely to happen. It's a bit too early for that since it would basically mean that our LLM test questions were absolutely perfect).

Another way to do it is via clustering. Cluster all the questions in the pre-launch dataset. Then assign the post-launch questions to the closest cluster. If that question is far from the closest cluster centroid, this is a corner case question. Basically, do anomaly detection on post-launch questions, where an anomaly is defined as far from the typical pre-launch questions.

Clustering is a bit more user friendly and also allows you to better interpret the results. For instance, you could notice that some corner case questions are in between two clusters, so what makes them a corner case is the rare intersection of two topics ("how to calculate sample size if I am running a non-inferiority test where a metric drop is actually good for the business" is rare because in between statistics and business). Some other questions might be corner cases just because they are really hard/deep into a given topic (some obscure question about a statistical test no sane person has ever heard about). These will be far from just one cluster centroid.

The binary model approach doesn't give you this kind of visibility into the results. However, it is probably more scalable, more efficient over large amounts of data, and requires less work (binary models are still more automated than clustering).

Complete and Continue