Novelty Effect

One of the most common issues data scientists face when dealing with A/B testing is the so-called novelty effect. The problem with novelty effect is the following: when you give users the chance to try a new feature, at first they might try it out just out of curiosity, even if the feature is not actually better. So, say you run a test for 2 weeks, you are actually capturing in those two weeks the phase in which customers are extensively using your feature just because it is new and they are curious, i.e. novelty effect.

You then check your test results and see, for instance, that engagement or time spent on site are way up for test vs control. So you give that feature to all users, but after some time you actually see a steep decline in engagement or time spent. This is a really really common issue especially on sites whose main metrics are related to clicks and time spent on site, like social networks or most businesses that make money via ads.

A little bit more formally, the problem is how to isolate the effect of the new feature vs the effect of novelty, which is unrelated to the new feature and it always happens whenever a user sees something new. And novelty effect is the specific (and most common) example of a much wider topic: how to make sure I am testing only one specific thing and not multiple things at the same time. For instance, let’s say you run a test giving some users a lower price. How to isolate the effect of the lower price vs the excitement of getting a discount?

Note that ironically the opposite also happens. That is, if you give users a new experience, at first they might hate it cause it is not what they are used to and they feel they have to re-learn how to use the product. This is called change aversion. However, in practice, this is a much smaller problem from an A/B testing standpoint because it only affects major product redesigns, which are way rarer than small UI tweaks and they are often not even A/B tested (it is hard to A/B test a major change, like a new logo or a totally new site redesign).

The obvious solution for novelty effect would be to run tests longer, giving test users enough time to get rid of the novelty effect. However, that’s hardly efficient, and the cost of having to run tests for longer would probably outweigh the benefits coming from more reliable results.

Below we will go through an A/B test that was affected by novelty effect and we will see what companies do to make sure that test results are still reliable.

Data

Below we have a standard A/B test dataset. You can also download it from here.

Data comes from a social network and the new product being tested is the first version of a friend recommendation feature. The metric that’s been chosen to evaluate the test is average number of pages visited per user. Given that most social networks monetize via ads and assuming constant the number of ads clicked per # of pages visited, that metric essentially means revenue. The higher that number, the more money the company makes.

#read from google drive
data=read.csv("https://drive.google.com/uc?export=download&id=10LkHByquDZAf7K7krvqHZfYOQanChVjF")
head(data)

  user_id signup_date  test_date browser test pages_visited
1      34  2015-01-01 2015-08-15  Chrome    0             6
2      59  2015-01-01 2015-08-12  Chrome    1             6
3     178  2015-01-01 2015-08-10  Safari    1             3
4     383  2015-01-01 2015-08-05 Firefox    1             9
5     397  2015-01-01 2015-08-27      IE    0             1
6     488  2015-01-01 2015-08-10  Chrome    0             1

Column description:

user_id: the id of the user. Unique by user
signup_date: when the user joined the social network
test_date: the date of the first session of that user since the test started
browser: user browser during that session
test: users are randomly split into test (1) and control (0). Test users see the new feature and control user don’t
pages_visited: the metric we care about. # of pages visited in that session. The test is considered successful if it increases this number

import pandas
pandas.set_option('display.max_columns', 20)
pandas.set_option('display.width', 350)
  
#read from google drive
data= pandas.read_csv("https://drive.google.com/uc?export=download&id=10LkHByquDZAf7K7krvqHZfYOQanChVjF")
print(data.head())

   user_id signup_date   test_date  browser  test  pages_visited
0       34  2015-01-01  2015-08-15   Chrome     0              6
1       59  2015-01-01  2015-08-12   Chrome     1              6
2      178  2015-01-01  2015-08-10   Safari     1              3
3      383  2015-01-01  2015-08-05  Firefox     1              9
4      397  2015-01-01  2015-08-27       IE     0              1

Column description:

user_id: the id of the user. Unique by user
signup_date: when the user joined the social network
test_date: the date of the first session of that user since the test started
browser: user browser during that session
test: users are randomly split into test (1) and control (0). Test users see the new feature and control user don’t
pages_visited: the metric we care about. # of pages visited in that session. The test is considered successful if it increases this number

Check A/B Test Results

Firstly, let’s check the test results as we would normally do by simply performing an A/B test between the two groups:

#t-test of test vs control for our target metric 
ab_test = t.test(data$pages_visited[data$test==1], data$pages_visited[data$test==0])
ab_test


    Welch Two Sample t-test

data:  data$pages_visited[data$test == 1] and data$pages_visited[data$test == 0]
t = 5.4743, df = 92316, p-value = 4.404e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.05468628 0.11568527
sample estimates:
mean of x mean of y 
 4.694989  4.609804

if  (ab_test$p.value>0.05)   print ("Non-significant results") else
    if (ab_test$statistic>0) print ("Statistically better results") else
                             print ("Statistically worse results")

[1] "Statistically better results"

So, in this case, the test appears to be a clear winner. The increase in page views is close to 2%, which would mean a massive revenue increase for any medium or large company that makes money with ads.

However, this is a classical example of a test that might be affected by novelty effect. After all, people see these new friends suggested to them and they might be prompted at first to click on their profiles just out of curiosity, cause they want to find out how the new feature works. In cases like this, you need to make sure the gain you see is not coming from novelty effect.

There is a key catch about novelty effect: almost by definition this affects mainly returning users. After all, for new users everything is new. So that particular feature cannot have novelty effect by itself. Therefore, the most commonly used way to check for novelty effect is by segmenting users in new vs returning. If the feature is winning for returning users, but not for new users, that’s a really strong sign that novelty effect dynamics are happening.

Btw note that segmenting users by new vs returning is always a useful exercise when running an A/B test. That gives you crucial information on your test. A healthy A/B test should do well on new users. Otherwise, the risk is that you keep optimizing for your old users and end up in a local optimum, not being able to capture opportunities outside of your main user base.

#make them dates
data$signup_date = as.Date(data$signup_date)
data$test_date = as.Date(data$test_date)
#segment users into new vs old. We define new as those for which the test/control experience was the same as their sign-up date
data$is_new_user = ifelse(data$signup_date==data$test_date, 1, 0)

#now let's do the test for old users and new users separately
ab_test_old = t.test(data$pages_visited[data$test==1 & data$is_new_user == 0], data$pages_visited[data$test==0 & data$is_new_user == 0])
ab_test_old


    Welch Two Sample t-test

data:  data$pages_visited[data$test == 1 & data$is_new_user == 0] and data$pages_visited[data$test == 0 & data$is_new_user == 0]
t = 6.9108, df = 71267, p-value = 4.859e-12
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.08376927 0.15009688
sample estimates:
mean of x mean of y 
 4.720415  4.603481

#we divide by 2 p-value significance level because we have run two tests. i.e. we are using the Bonferroni correction
if  (ab_test_old$p.value>0.05/2)   print ("Returning users results: Non-significant results") else
    if (ab_test_old$statistic>0)   print ("Returning users results: Statistically better results") else
                                   print ("Returning users results: Statistically worse results")

[1] "Returning users results: Statistically better results"

ab_test_new = t.test(data$pages_visited[data$test==1 & data$is_new_user == 1], data$pages_visited[data$test==0 & data$is_new_user == 1])
ab_test_new


    Welch Two Sample t-test

data:  data$pages_visited[data$test == 1 & data$is_new_user == 1] and data$pages_visited[data$test == 0 & data$is_new_user == 1]
t = -1.0809, df = 19563, p-value = 0.2797
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.11657577  0.03370165
sample estimates:
mean of x mean of y 
 4.593712  4.635149

#we divide by 2 p-value significance level because we have run two tests. i.e. we are using the Bonferroni correction
if  (ab_test_new$p.value>0.05/2)   print ("New users results: Non-significant results") else
    if (ab_test_new$statistic>0)   print ("New users results: Statistically better results") else
                                   print ("New users results: Statistically worse results")

[1] "New users results: Non-significant results"

There you go! Our test was winning because old users are using the new feature a lot. But the new feature has no impact on the new user behavior. In practice, this is a really strong sign of novelty effect.

Btw this exercise was just about showing how to find novelty effect. But obviously everything that we learned previously should be applied to these two new tests as well. Like you should define in advance minimum effect size for both groups and make sure you have enough people per group, you should make sure randomization worked fine, etc. Essentially, everything should be duplicated for the new user test as well as for the old user test.

Firstly, let’s check the test results as we would normally do by simply performing an A/B test between the two groups:

from scipy import stats
#t-test of test vs control for our target metric 
test = stats.ttest_ind(data.loc[data['test'] == 1]['pages_visited'], data.loc[data['test'] == 0]['pages_visited'], equal_var=False)
  
#t statistics
print(test.statistic)

5.474295518566027

#p-value
print(test.pvalue)

4.403954129457701e-08

#print test results
if (test.pvalue>0.05):
  print ("Non-significant results")
elif (test.statistic>0):
  print ("Statistically better results")
else:
  print ("Statistically worse results")

Statistically better results

#segment users into new vs old. We define new as those for which the test/control experience was the same as their sign-up date
#now let's do the test for old users and new users separately
  
#old users
ab_test_old = stats.ttest_ind(data.loc[(data['test'] == 1) & (data['signup_date']!=data['test_date'])]['pages_visited'], 
                              data.loc[(data['test'] == 0) & (data['signup_date']!=data['test_date'])]['pages_visited'], 
                              equal_var=False)
#t statistics
print(ab_test_old.statistic)

6.910803219940347

#p-value
print(ab_test_old.pvalue)

4.859481805141211e-12

#we divide by 2 p-value significance level because we have run two tests. i.e. we are using the Bonferroni correction
#print test results
if (ab_test_old.pvalue>0.05/2):
  print ("Returning users: Non-significant results")
elif (ab_test_old.statistic>0):
  print ("Returning users: Statistically better results")
else:
  print ("Returning users: Statistically worse results")

Returning users: Statistically better results

#new users
ab_test_new = stats.ttest_ind(data.loc[(data['test'] == 1) & (data['signup_date']==data['test_date'])]['pages_visited'], 
                              data.loc[(data['test'] == 0) & (data['signup_date']==data['test_date'])]['pages_visited'], 
                              equal_var=False)
#t statistics
print(ab_test_new.statistic)

-1.0809363577979878

#p-value
print(ab_test_new.pvalue)

0.27973874896130424

#we divide by 2 p-value significance level because we have run two tests. i.e. we are using the Bonferroni correction
#print test results
if (ab_test_new.pvalue>0.05/2):
  print ("New users: Non-significant results")
elif (ab_test_new.statistic>0):
  print ("New users: Statistically better results")
else:
  print ("New users: Statistically worse results")

New users: Non-significant results

If you want to check statistically significance on segments of your population, you need to define those segments in advance, before running the test. And make sure each test on each subset of the user base is statistically sound.

Complete and Continue