All the questions

All the case study questions in this section of the course

1) At Meta, we want to start rolling out AI studio. Should we first test it on a random sample of users or on a specific type of users? Why?

2) Google is testing ads inside AI Overview (see image below, the links on the right would be ads) vs the old google result page. Which impact do you expect this to have on: ads CTR, cost per click, and # of searches per user vs the old standard google page with no AI?

3) X (Twitter) wants to build a model to identify AI-generated tweets because they are bad for user engagement. How would you do it only using the tweet text? What do you think is more costly to X: false positives or false negatives?

4) Unlike Amazon, Airbnb doesn't use AI to summarize reviews yet. However, they do use AI to translate the listing page. Why do you think this is happening from a data science perspective?

5) We want to launch AI studio within Whatsapp, at first only targeting power users interested in AI. How can we do that?

6) At Character.ai, there is a significant friction because the user has to think about what to write and then type. How would you use data to suggest possible improvements to the user experience?

7) OpenAI launched their latest model to only the top-tier users (highest paying ones). In their pre-launch tests, they got a certain answer accuracy score for their latest model. However, when analyzing the post launch data, they noticed a lower score. Why is that and how to fix it?

8) After testing AI review summaries at Amazon, we noticed that conversion rate is flat, but refund rate is up. Why do you think that happened? What kind of data would you look at to validate your hypothesis and what follow up test you'd suggest?

9) How would you define an engagement metric for an AI chatbot robust against the fact that great answers are likely to end the interaction (user gets to know what they needed), while bad answers might lead to follow up questions, so more usage?

10) Most AI answers are very verbose and your PM at Meta AI is worried that it's hurting engagement with the feature. What kind of data would you look at to find out if this is indeed an issue?

11) How would you estimate the probability of an LLM model answering a question correctly? The goal is to send to the AI only the questions with high probability of being answered correctly

12) After launching a new AI product to our most tech-savvy users, we want to analyze the model performance on corner case questions. How can we identify those questions?

13) We A/B tested Meta AI inside Whatsapp on a representative subset of the whole population and found that: total messages sent to groups is up, total 1:1 messages (to individual contacts) is flat, messages (groups + individual) per user session is down. Is this good or bad? How would you explain it?

14) LinkedIn is planning to introduce an AI tool that allows recruiters to send highly personalized AI-written emails. What kind of impact do you expect this to have on response rate? What kind of current data would you look at to guess it?

15) At first, Google wants to launch AI Overview (AI answer for search queries) just for a few specific categories. How would you choose those categories? We are talking about the very first version of AI overview, w/o ads inside.

16) Do you expect Meta AI inside Whatsapp to have network effects? Only consider whatsapp users who have never used it yet: would they be affected by the fact that some other users are using Meta AI? Why? What are some metrics that you consider particularly important to understand what's going on?

17) At Character.AI, the top character for engagement (defined as chats per day) is a Spanish Tutor. However, we noticed that actual response accuracy (scored by human reviewers) is low. How would you figure out what’s happening and what would you suggest your PM to do as follow up steps?

18) At Google, how would you design an A/B test to validate the hypothesis that LLMs are better at finding more hidden information (e.g. only very few sites have the answer to it) compared to the standard google search algo?

19) At Airbnb, we want to decide if it's a good idea to test review summarization, e.g. AI generated review summaries on top of reviews. What kind of data would you look at to see if this is a test worth trying?

20) When do you think it would be appropriate to start running A/B tests on new AI products? When is it too early? Assume your model is super reliable.

21) Meta AI inside Whatsapp is currently implemented as a normal chat (reverse chronological order based on the last interaction). What data would you use to support an A/B test where it's implemented as a pinned chat by default (always at the top, no matter the last interaction timestamp)?

22) How would you use LLMs at Netflix to improve movie/series trailers?

23) When you launch a new AI product, you have a huge novelty effect. Can you think about a way to build a model that predicts whether the usage spike depends on novelty effect? As model features, only use the first user interactions with the AI product.

24) Can you A/B test two AI models? With basically infinite types of possible questions that can be asked, how can you design an A/B test where test and control have comparable question distributions?

25) At Uber, we want to start replacing our customer service with an LLM. Which CS support queries would be the first ones you would try to replace? Assume your goal is to minimize the risk of negatively affecting revenue.

26) You built a Harry Potter summarizer via RAG + LLM and want to launch it. However, when testing it, you notice that it lacks contextual knowledge. How to fix that?

27) In AI, there is an obvious trade-off between helpfulness vs hallucination rate. As you reduce the hallucination rate, responses tend to become more generic. How would you deal with this trade-off from a DS perspective?

28) We A/B tested two different AI models. The test version got way more harmful replies than control, despite the fact that our internal tests prior to the A/B test showed similar harmful response rates. Why do you think this happened?

29) At Uber, your PM wants you to analyze their iPhone app reviews via an LLM. Specifically, they want you to tell them the most important negative characteristics of their product. How would you do that?

30) At Netflix, how would you use AI to give movie screenwriters ideas for their scripts?

31) Which metric would you choose to optimize a new AI product? Why an average-based metric might be better in this case compared to the typical DS scenario?

32) What do you think was the key data driven insight behind Notebook lm? Walk me through how a data scientist would have discovered it.

33) At Amazon, we want to test a customer service AI chatbot. The metric we care about is the percentage of positive feedback from users. The test wins if the AI chatbot metric is not worse, e.g. the AI bot is as good as humans. How many queries do I need in my test?

34) Youtube has recently seen a spike in AI videos. These have lower metrics on average so your PM is worried. What would you recommend them to do to mitigate the issue?

35) After launching their latest model to top tier customers only, OpenAI saw a +10% in engagement via a before/after test done on top tier customers (assume no novelty effect). How would you extrapolate from this what will be the engagement improvement once the new model is launched to the whole population?

36) At Youtube, how would you use LLMs to increase click through rate (CTR)?

37) How would you personalize Perplexity responses? Which data would you use to pitch your idea to your PM?

38) From a metric and data collection standpoint, compare pros and cons of these two different strategies: embed AI products within current products (e.g. AI Studio or Meta AI within Whatsapp) or run them as a standalone product (e.g. all the AI products on Google AI Labs)

39) Google cloud lets you test their AI voices (text-to-speech) before creating an account. On the other hand, AWS forces you to create an account before testing their AI voices. Explain which data you would look at to decide which strategy is better.

40) An ecommerce company started using LLMs to write google ads based on Amazon positive reviews. This led to a decrease in ad CTR, increase in conversions per ad click, and conversion per ad impression slightly down. Bidding price and keywords haven’t changed. What happened? Which follow up steps?

Complete and Continue