Solution: Ads Analysis

Ads Analysis - Solution




Data


Let’s read the dataset and see the variables we have


require(dplyr) 
require(ggplot2)

#read the file
data = read.csv("https://drive.google.com/uc?export=download&id=1wB3x3bPIX7C19Rmse1bpZtYBmVcrPDGa")
dim(data)
[1] 2115    7
head(data)
        date shown clicked converted avg_cost_per_click total_revenue         ad
1 2015-10-01 65877    2339        43               0.90        641.62 ad_group_1
2 2015-10-02 65100    2498        38               0.94        756.37 ad_group_1
3 2015-10-03 70658    2313        49               0.86        970.90 ad_group_1
4 2015-10-04 69809    2833        51               1.01        907.39 ad_group_1
5 2015-10-05 68186    2696        41               1.00        879.45 ad_group_1
6 2015-10-06 66864    2617        46               0.98        746.48 ad_group_1
#make it a date
data$date=as.Date(data$date)

summary(data)
      date                shown           clicked        converted      avg_cost_per_click total_revenue               ad      
 Min.   :2015-10-01   Min.   :     0   Min.   :    0   Min.   :   0.0   Min.   :0.000      Min.   : -200.2   ad_group_1 :  53  
 1st Qu.:2015-10-14   1st Qu.: 28030   1st Qu.:  744   1st Qu.:  18.0   1st Qu.:0.760      1st Qu.:  235.5   ad_group_11:  53  
 Median :2015-10-27   Median : 54029   Median : 1392   Median :  41.0   Median :1.400      Median :  553.3   ad_group_12:  53  
 Mean   :2015-10-27   Mean   : 68300   Mean   : 3056   Mean   : 126.5   Mean   :1.374      Mean   : 1966.5   ad_group_13:  53  
 3rd Qu.:2015-11-09   3rd Qu.: 97314   3rd Qu.: 3366   3rd Qu.: 103.0   3rd Qu.:1.920      3rd Qu.: 1611.5   ad_group_15:  53  
 Max.   :2015-11-22   Max.   :192507   Max.   :20848   Max.   :1578.0   Max.   :4.190      Max.   :39623.7   ad_group_16:  53  
                                                                                                             (Other)    :1797  


Data looks weird. For instance, there is negative revenue that doesn’t make much sense. Let’s clean the data a bit. Here we will remove impossible data. In a real world situation, we would try to get to the bottom of this to figure out where the bad data are coming from.

#Revenue cannot be negative
paste("There are", nrow(subset(data, total_revenue<0)), "events with negative revenue")
[1] "There are 4 events with negative revenue"
#Remove those
data = subset(data, !total_revenue<0)

#Also, clicked should be >= shown and converted should be >= clicked. Let's see:
paste("There are", nrow(subset(data, shown<clicked | clicked<converted)), "events where the funnel doesn't make any sense")
[1] "There are 0 events where the funnel doesn't make any sense"
#Finally, there are a few zeros that seem weird, considering that avg values are very high. Let's plot and see:
ggplot(data, aes(y=shown, x=date, colour=ad, group=ad)) + 
  geom_line(show.legend = FALSE) + 
  ggtitle("Ad impressions") 


Those sudden zeros definitely look weird. Let’s get rid of them and then check ads clicks.

#remove zero impression data
data = subset(data, shown>0)
#now check clicks
ggplot(data, aes(y=clicked, x=date, colour=ad, group=ad)) + 
  geom_line(show.legend = FALSE) + 
  ggtitle("Ad clicks")


There are still a few weird drops to zero, despite having removed when impressions was zero. Let’s remove these weird events too.

#remove zero click data
data = subset(data, clicked>0)
#now check conversion
ggplot(data, aes(y=converted, x=date, colour=ad, group=ad)) + 
  geom_line(show.legend = FALSE) + 
  ggtitle("Ad conversions")

summary(data)
      date                shown           clicked          converted      avg_cost_per_click total_revenue               ad      
 Min.   :2015-10-01   Min.   :  7494   Min.   :  187.0   Min.   :   0.0   Min.   :0.080      Min.   :    0.0   ad_group_1 :  53  
 1st Qu.:2015-10-14   1st Qu.: 28446   1st Qu.:  756.5   1st Qu.:  18.0   1st Qu.:0.780      1st Qu.:  244.8   ad_group_15:  53  
 Median :2015-10-27   Median : 54569   Median : 1413.0   Median :  42.0   Median :1.410      Median :  561.1   ad_group_17:  53  
 Mean   :2015-10-26   Mean   : 68854   Mean   : 3088.7   Mean   : 127.8   Mean   :1.388      Mean   : 1989.2   ad_group_2 :  53  
 3rd Qu.:2015-11-09   3rd Qu.:100396   3rd Qu.: 3413.5   3rd Qu.: 106.0   3rd Qu.:1.920      3rd Qu.: 1634.9   ad_group_22:  53  
 Max.   :2015-11-22   Max.   :192507   Max.   :20848.0   Max.   :1578.0   Max.   :4.190      Max.   :39623.7   ad_group_26:  53  
                                                                                                               (Other)    :1773  


We could spend more time looking at the data, but overall things seem to make sense now.





import pandas
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)
  
#read the file
data = pandas.read_csv("https://drive.google.com/uc?export=download&id=1wB3x3bPIX7C19Rmse1bpZtYBmVcrPDGa")
print(data.shape)
  
(2115, 7)
print(data.head())
         date  shown  clicked  converted  avg_cost_per_click  total_revenue          ad
0  2015-10-01  65877     2339         43                0.90         641.62  ad_group_1
1  2015-10-02  65100     2498         38                0.94         756.37  ad_group_1
2  2015-10-03  70658     2313         49                0.86         970.90  ad_group_1
3  2015-10-04  69809     2833         51                1.01         907.39  ad_group_1
4  2015-10-05  68186     2696         41                1.00         879.45  ad_group_1
data['date'] = pandas.to_datetime(data['date'])
  
print(data.describe())
               shown       clicked    converted  avg_cost_per_click  total_revenue
count    2115.000000   2115.000000  2115.000000         2115.000000    2115.000000
mean    68299.844444   3056.077069   126.453901            1.373749    1966.517589
std     48884.821409   3783.579969   233.420826            0.754331    3942.018757
min         0.000000      0.000000     0.000000            0.000000    -200.150000
25%     28030.500000    744.000000    18.000000            0.760000     235.470000
50%     54029.000000   1392.000000    41.000000            1.400000     553.300000
75%     97314.500000   3366.500000   103.000000            1.920000    1611.495000
max    192507.000000  20848.000000  1578.000000            4.190000   39623.710000


Data looks weird. For instance, there is negative revenue that doesn’t make much sense. Let’s clean the data a bit. Here we will remove impossible data. In a real world situation, we would try to get to the bottom of this to figure out where the bad data are coming from.

#Revenue cannot be negative
print("There are", len(data.query('total_revenue<0')), "events with negative revenue")
There are 4 events with negative revenue
#Remove those
data = data.query('total_revenue>=0')

#Also, clicked should be >= shown and converted should be >= clicked. Let's see:
print("There are", len(data.query('shown<clicked | clicked<converted')), "events where the funnel doesn't make any sense")
There are 0 events where the funnel doesn't make any sense
#Finally, there are a few zeros that seem weird, considering that avg values are very high. Let's plot and see:
g=sns.lineplot(x="date", y="shown",hue="ad", data=data, legend="")
/home/info/.local/lib/python3.5/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.

To register the converters:
    >>> from pandas.plotting import register_matplotlib_converters
    >>> register_matplotlib_converters()
  warnings.warn(msg, FutureWarning)
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad impressions") 
plt.show()


Those sudden zeros definitely look weird. Let’s get rid of them and then check ads clicks.

#remove zero impression data
data = data.query('shown>0')
#now check clicks
g=sns.lineplot(x="date", y="clicked",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad clicks") 
plt.show()


There are still a few weird drops to zero, despite having removed when impressions was zero. Let’s remove these weird events too.

#remove zero click data
data = data.query('clicked>0')
#now check conversion
g=sns.lineplot(x="date", y="converted",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad conversions") 
plt.show()

#check final dataset
print(data.describe())
               shown       clicked    converted  avg_cost_per_click  total_revenue
count    2091.000000   2091.000000  2091.000000         2091.000000    2091.000000
mean    68853.852702   3088.744142   127.822573            1.387661    1989.242257
std     48799.535713   3792.468317   234.394500            0.746035    3958.838731
min      7494.000000    187.000000     0.000000            0.080000       0.000000
25%     28446.000000    756.500000    18.000000            0.780000     244.765000
50%     54569.000000   1413.000000    42.000000            1.410000     561.070000
75%    100395.500000   3413.500000   106.000000            1.920000    1634.895000
max    192507.000000  20848.000000  1578.000000            4.190000   39623.710000


We could spend more time looking at the data, but overall things seem to make sense now.




Q: If you had to identify the 5 best ad groups, which ones would be? Which metric did you choose to identify the best ones? Why? Explain the pros of your metric as well as the possible cons. From a business perspective, choosing that metric implies that you are focusing on what?


There are obviously tons of ad related metric. And they are all plausible and useful in different situations, depending on the main business focus of a given company. A few examples are:

  1. Click-through-rate (CTR), # of clicks/ # of impressions, i.e. of all people seeing the ad, how many click on it?
  • Pros: as discussed extensively in this course case studies, it is great to identify demand from users. Also, since it only takes into account what the user does after seeing the ad, it can be easily estimated and allows to test and improve ad characteristics to maximize the number of users coming to the site
  • Cons: It has no information about what the user does once they come to the site as well as costs. A very high CTR is not that useful if none of those users converts. It can also easily be tricked via click-bait ads.
  • This metric is purely growth-based when the goal is to bring users to the site

  1. Conversion rate (CR), # of conversions/ # of impressions, i.e. of all people seeing the ad, how many people end up becoming customers?
  • Pros: Fixes most of CTR issues. After all, the goal of an ad is to get customers, and this is what this metric represents.
  • Cons: It has no information about costs. Also, conversion on the site highly depends on the product. An ad could be great in capturing demand, but if the product is bad, no one converts. Finally, how to link conversion to an ad click is pretty messy, conversion might happen after a long time, might happen after multiple visits through different channels, etc.
  • This metric combines both growth and product.

  1. Profits, revenue from conversions - ad costs
  • Pros: Very concrete and business driven. Am I making money via ads?
  • Cons: Same as CR, hard to define revenue from an ad, not only there are the issues described above, but here you would need to predict user lifetime value to see if an ad is profitable. Also, it only considers users after they click on the ad. It has no information about the top of the funnel, i.e. how many users see the ad and choose to not click.
  • This metric focuses on the business/product side, but it is hardly useful from a growth standpoint. It is typically used by very large companies whose main goal is being profitable.


For this exercise, let’s look into profits and CTR. We will pick the top 5 ads based on CTR as long as profits are above a certain threshold. In plain English, that means we want the ads with the highest growth potential after removing the ads that would make us go broken.


#Let's get CTR and profits per ad group on an avg
data_rank = data %>%
             group_by(ad) %>%
             summarize (CTR = mean(clicked/shown),
                        profits = mean(total_revenue-clicked*avg_cost_per_click),
                        count = mean(shown)
                       )

#check minimum ad count just to make sure we have enough data for all groups. Looks like we are fine
min(data_rank$count)
[1] 8461.096
#Let's plot them
qplot(profits, CTR, data=data_rank)+ 
  geom_vline(xintercept=-5000)


Based on the plot, we can define as threshold for profitability -5K. We will remove all ads below that, and then find the ads with the highest CTR among the remaining ones. Note that here we are considering short term revenue from the ad, and we are expecting that number to be negative. An ad should be positive when taking into account a user life time value (as usual, lifetime -> 1 yr).

#Remove ads that are making us lose too much money and find top 5 based on CTR among the other ones
data_rank %>%
  filter (profits > -5000) %>%
  arrange(desc(CTR)) %>%
  head(5) 
# A tibble: 5 x 4
  ad             CTR profits   count
  <fct>        <dbl>   <dbl>   <dbl>
1 ad_group_19 0.0942  -3460.  19507.
2 ad_group_28 0.0892  -2644.  19886.
3 ad_group_13 0.0856  -3573. 161519.
4 ad_group_34 0.0843  -1189.  35372.
5 ad_group_5  0.0639  -3234.  52406.





#Let's get CTR and profits per ad group on an avg
data_rank = data.groupby('ad').apply(
             lambda x: pandas.Series({
                       'CTR': (x['clicked']/x['shown']).mean(),
                       'profits': (x['total_revenue']-x['clicked']*x['avg_cost_per_click']).mean(),
                       'count': x['shown'].mean()
  })
).reset_index()

#check minimum ad count just to make sure we have enough data for all groups. Looks like we are fine
print(data_rank['count'].min())
8461.096153846154
#Let's plot them
plt.scatter(data_rank['profits'], data_rank['CTR'])
plt.axvline(x=-5000)
plt.xlabel('profits')
plt.ylabel('CTR')
plt.show()


Based on the plot, we can define as threshold for profitability -5K. We will remove all ads below that, and then find the ads with the highest CTR among the remaining ones. Note that here we are considering short term revenue from the ad, and we are expecting that number to be negative. An ad should be positive when taking into account a user life time value (as usual, lifetime -> 1 yr).

#Remove ads that are making us lose too much money and find top 5 based on CTR among the other ones
print(data_rank.query('profits > -5000').sort_values('CTR', ascending=False).head(5))
             ad       CTR          count      profits
10  ad_group_19  0.094188   19506.884615 -3460.419038
20  ad_group_28  0.089191   19886.339623 -2644.280943
4   ad_group_13  0.085635  161519.176471 -3573.196275
27  ad_group_34  0.084257   35371.622642 -1189.270000
35   ad_group_5  0.063862   52406.461538 -3233.796538




Q: For each group, predict how many ads will be shown on Dec, 15


The most common ways to do this are either using gam or using some time series package. In this case, we will do via the time series package prophet.


library(prophet)
#Let's firstly take just one ad_group and do it manually for that one as an example. Then, we will apply it to all others
data_ad1 = subset(data, ad=="ad_group_1")[, c("date", "shown")]
qplot(date, shown, data=data_ad1, geom="line", main="Ad group 1")

#Let's build a time series using prophet. From previous plots, time series appear pretty straightforward and flattish. Default values should do fine
colnames(data_ad1)=c("ds", "y")
#build the model
ts = prophet(data_ad1)

#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
days_predictions = as.numeric(as.Date("2015-12-15")-max(data_ad1$ds))
future_data = make_future_dataframe(ts, periods = days_predictions)
predictions = predict(ts, future_data)
#let's plot them
plot(ts, predictions)

#and plot the series component
prophet_plot_components(ts, predictions)


We can see that there is some (very very small) weekly variation. Most importantly, the time serie overall is pretty flat in October and starts going up in Nov. Predictions for Dec 15 tend to be up given that the model assumes the time serie keeps following its most recent trend. Obviously, this would be much more useful with yearly data, so we would be able to capture yearly seasonality.

Let’s now do it for all the ad groups and get the predictions.

#data set to store results
predictions_all_groups = data.frame(date = as.Date("2015-12-15"),
                                    ad_group = unique(data$ad),
                                    prediction = NA
                                    )

#We will do it via a for loop through all the ad groups
for (i in 1:nrow(predictions_all_groups)){

     #subset data, only keep 1 ad group and columns needed
     tmp = subset(data, ad == predictions_all_groups$ad_group[i])[, c("date", "shown")] 
     colnames(tmp)=c("ds", "y")

     #build the model
     ts_tmp = prophet(tmp)

     #make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
     days_predictions_tmp = as.numeric(as.Date("2015-12-15")-max(tmp$ds))
     future_data_tmp = make_future_dataframe(ts_tmp, periods = days_predictions_tmp)
     predictions_tmp = predict(ts_tmp, future_data_tmp)

     #store results for the last day, i.e. Dec 15 predcition
     predictions_all_groups$prediction[i] = round(tail(predictions_tmp$yhat,1))
}
predictions_all_groups
         date    ad_group prediction
1  2015-12-15  ad_group_1      77558
2  2015-12-15  ad_group_2      50305
3  2015-12-15  ad_group_3     152888
4  2015-12-15  ad_group_4      94679
5  2015-12-15  ad_group_5      54750
6  2015-12-15  ad_group_6      40587
7  2015-12-15  ad_group_7      56988
8  2015-12-15  ad_group_8      47171
9  2015-12-15  ad_group_9     124642
10 2015-12-15 ad_group_10     127043
11 2015-12-15 ad_group_11      19964
12 2015-12-15 ad_group_12      28000
13 2015-12-15 ad_group_13     162073
14 2015-12-15 ad_group_14       8715
15 2015-12-15 ad_group_15      15773
16 2015-12-15 ad_group_16      30494
17 2015-12-15 ad_group_17     134101
18 2015-12-15 ad_group_18      89749
19 2015-12-15 ad_group_19      19889
20 2015-12-15 ad_group_20     120417
21 2015-12-15 ad_group_21      28320
22 2015-12-15 ad_group_22      26866
23 2015-12-15 ad_group_23      48616
24 2015-12-15 ad_group_24      37394
25 2015-12-15 ad_group_25     174387
26 2015-12-15 ad_group_26      72261
27 2015-12-15 ad_group_27      65385
28 2015-12-15 ad_group_28      20984
29 2015-12-15 ad_group_29      21327
30 2015-12-15 ad_group_30     110710
31 2015-12-15 ad_group_31     123616
32 2015-12-15 ad_group_32      39847
33 2015-12-15 ad_group_33      16240
34 2015-12-15 ad_group_34      38683
35 2015-12-15 ad_group_35      60008
36 2015-12-15 ad_group_36      71430
37 2015-12-15 ad_group_37      71485
38 2015-12-15 ad_group_38     177108
39 2015-12-15 ad_group_39      26122
40 2015-12-15 ad_group_40      80638





import fbprophet
#Let's firstly take just one ad_group and do it manually for that one as an example. Then, we will apply it to all others
data_ad1 = data.query('ad==\"ad_group_1\"')[["date", "shown"]]
#plot
g=sns.lineplot(x="date", y="shown", data=data_ad1)
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad group 1") 
plt.show()

#Let's build time series using prophet. From previous plots, time series appear pretty straightforward and flattish. Default values should do fine
  
#change column names and index
data_ad1['ds'] = data_ad1['date']
data_ad1.rename({'shown': 'y'}, axis=1, inplace=True)
data_ad1.set_index('date', inplace=True)

#build the model
ts = fbprophet.Prophet()
ts.fit(data_ad1)
  
#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
<fbprophet.forecaster.Prophet object at 0x7eaec2144eb8>
days_predictions = (pandas.to_datetime("2015-12-15")-data_ad1['ds'].max()).days
future_data = ts.make_future_dataframe(periods = days_predictions)
predictions = ts.predict(future_data)
  
#let's plot them
ts.plot(predictions)
plt.show()

#and plot the series component
ts.plot_components(predictions)
plt.show()


We can see that there is some (very very small) weekly variation. Most importantly, the time serie overall is pretty flat in October and starts going up in Nov. Predictions for Dec 15 tend to be up given that the model assumes the time serie keeps following its most recent trend. Obviously, this would be much more useful with yearly data, so we would be able to capture yearly seasonality.

Let’s now do it for all the ad groups and get the predictions.

#data set to store results
predictions_all_groups = pandas.DataFrame({'ad_group':data['ad'].unique(),
                                           'date': pandas.to_datetime("2015-12-15")})
prediction_dec_15 = []                                           
   
#We will do it via a for loop through all the ad groups
for level in predictions_all_groups.ad_group:
     
     #subset data, only keep 1 ad group and columns needed
     tmp = data.loc[data['ad']==level][["date", "shown"]]
     
     #change column names and index
     tmp['ds'] = tmp['date']
     tmp.rename({'shown': 'y'}, axis=1, inplace=True)
     tmp.set_index('date', inplace=True)
     
     #build the model
     ts_tmp = fbprophet.Prophet()
     ts_tmp.fit(tmp)
     
     #make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
     days_predictions_tmp = (pandas.to_datetime("2015-12-15")-tmp['ds'].max()).days
     future_data_tmp = ts_tmp.make_future_dataframe(periods = days_predictions_tmp)
     predictions_tmp = ts_tmp.predict(future_data_tmp)
     
     #store results for the last day, i.e. Dec 15 prediction
     prediction_dec_15.append(round(predictions_tmp['yhat'].iat[-1]))

       
predictions_all_groups['predictions'] = prediction_dec_15
print(predictions_all_groups)
       ad_group       date  predictions
0    ad_group_1 2015-12-15      77558.0
1    ad_group_2 2015-12-15      50305.0
2    ad_group_3 2015-12-15     152888.0
3    ad_group_4 2015-12-15      94679.0
4    ad_group_5 2015-12-15      54750.0
5    ad_group_6 2015-12-15      40587.0
6    ad_group_7 2015-12-15      56988.0
7    ad_group_8 2015-12-15      47171.0
8    ad_group_9 2015-12-15     124642.0
9   ad_group_10 2015-12-15     127043.0
10  ad_group_11 2015-12-15      19964.0
11  ad_group_12 2015-12-15      28000.0
12  ad_group_13 2015-12-15     162073.0
13  ad_group_14 2015-12-15       8715.0
14  ad_group_15 2015-12-15      15773.0
15  ad_group_16 2015-12-15      30494.0
16  ad_group_17 2015-12-15     134101.0
17  ad_group_18 2015-12-15      89749.0
18  ad_group_19 2015-12-15      19889.0
19  ad_group_20 2015-12-15     120417.0
20  ad_group_21 2015-12-15      28320.0
21  ad_group_22 2015-12-15      26866.0
22  ad_group_23 2015-12-15      48616.0
23  ad_group_24 2015-12-15      37394.0
24  ad_group_25 2015-12-15     174387.0
25  ad_group_26 2015-12-15      72261.0
26  ad_group_27 2015-12-15      65385.0
27  ad_group_28 2015-12-15      20984.0
28  ad_group_29 2015-12-15      21327.0
29  ad_group_30 2015-12-15     110710.0
30  ad_group_31 2015-12-15     123616.0
31  ad_group_32 2015-12-15      39847.0
32  ad_group_33 2015-12-15      16240.0
33  ad_group_34 2015-12-15      38683.0
34  ad_group_35 2015-12-15      60008.0
35  ad_group_36 2015-12-15      71430.0
36  ad_group_37 2015-12-15      71485.0
37  ad_group_38 2015-12-15     177108.0
38  ad_group_39 2015-12-15      26122.0
39  ad_group_40 2015-12-15      80638.0




Q: Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down


The simplest way to answer this is to build a linear regression for each ad group. Cost_per_click will be the y value and the x axis is days as a number (1,2,3,etc). We will then check the coefficient of the regression:

  1. Positive and significant coefficient -> going up
  2. Negative and significant coefficient -> going down
  3. Non-significant coefficient -> flat



#firstly let's plot them to get a sense of whether cost per click is going up or not in general
ggplot(data, aes(y=avg_cost_per_click, x=date, colour=ad, group=ad)) + 
  geom_line(show.legend = FALSE) + 
  ggtitle("Avg Cost per Click") 


It is pretty messy, but overall things seem pretty flat.

#build all linear regressions via a for loop

#empty data frame
lr_output = data.frame(coefficients = NA, p_value = NA)

for (i in 1:length(unique(data$ad))){
  
  #subset dataset and keep variables we care about
  data_reg = subset(data, ad == unique(data$ad)[i])[,c("date", "avg_cost_per_click")]
  #replace dates with numbers. 0 is the minimum date, 1 is minimum date + 1 day and so on
  data_reg$date = as.numeric(data_reg$date - min(data_reg$date))
  
  #build the regression
  lr = lm(avg_cost_per_click ~ date, data=data_reg)
  #save coefficient and p-value
  lr_output[i,]=summary(lr)$coefficients[2,c(1,4)]
}

#add ad_group to the output dataset
lr_output$ad = unique(data$ad)

#classify based on the rule described above
lr_output$cluster = with(lr_output, 
                         #we use Bonferroni rule here for significance
                         ifelse(coefficients>0 & p_value<0.05/40, "going_up", 
                                ifelse(coefficients<0 & p_value<0.05/40, "going_down",
                                       "flat"
                           )
                         )
                        )
lr_output
    coefficients     p_value          ad cluster
1   9.264635e-04 0.145428675  ad_group_1    flat
2  -6.450572e-06 0.991094892  ad_group_2    flat
3   2.444767e-03 0.255564070  ad_group_3    flat
4  -5.768327e-04 0.804444949  ad_group_4    flat
5  -1.017822e-03 0.440260786  ad_group_5    flat
6  -1.733591e-03 0.077128597  ad_group_6    flat
7  -2.568940e-03 0.120068135  ad_group_7    flat
8  -2.689082e-03 0.026230860  ad_group_8    flat
9   4.434769e-05 0.434625747  ad_group_9    flat
10 -1.613688e-04 0.861238479 ad_group_10    flat
11  4.897502e-04 0.655741797 ad_group_11    flat
12  8.611293e-04 0.494786701 ad_group_12    flat
13  1.983768e-04 0.874992317 ad_group_13    flat
14 -7.762287e-04 0.168030042 ad_group_14    flat
15  4.055797e-04 0.533287316 ad_group_15    flat
16 -5.946124e-04 0.280798019 ad_group_16    flat
17 -2.741493e-05 0.894070982 ad_group_17    flat
18  6.251394e-04 0.691426469 ad_group_18    flat
19 -2.979686e-03 0.139845451 ad_group_19    flat
20  6.851306e-04 0.467956859 ad_group_20    flat
21 -1.026952e-03 0.234991383 ad_group_21    flat
22 -2.435091e-04 0.695846332 ad_group_22    flat
23 -5.389308e-04 0.641058957 ad_group_23    flat
24 -6.371408e-03 0.030699555 ad_group_24    flat
25 -9.972679e-05 0.870318346 ad_group_25    flat
26  7.240768e-04 0.699793717 ad_group_26    flat
27 -2.951464e-04 0.156160922 ad_group_27    flat
28 -1.831963e-03 0.188440544 ad_group_28    flat
29 -1.374928e-03 0.314216608 ad_group_29    flat
30 -1.136913e-04 0.878116815 ad_group_30    flat
31  7.982583e-05 0.449890625 ad_group_31    flat
32  4.258184e-03 0.033788052 ad_group_32    flat
33 -9.667796e-04 0.319079315 ad_group_33    flat
34  1.282051e-04 0.932790462 ad_group_34    flat
35 -4.120303e-04 0.163060175 ad_group_35    flat
36 -1.071601e-03 0.056770705 ad_group_36    flat
37  2.610063e-03 0.076241171 ad_group_37    flat
38 -8.087405e-04 0.432336355 ad_group_38    flat
39  1.140945e-03 0.327894385 ad_group_39    flat
40  1.925344e-03 0.006204814 ad_group_40    flat
table(lr_output$cluster)

flat 
  40 


Looks like everything is flat. W/o using the Bonferroni correction (i.e. had we chosen a less aggressive p-value threshold), we would have likely found some significant ads.
Also, via a linear regression, we are looking at the overall trend. That is, old data points weigh just as much as the most recent ones. Had we focused only on the most recent data, we obviously might have found different results.



#firstly let's plot them to get a sense of whether cost per click is going up or not in general
g=sns.lineplot(x="date", y="avg_cost_per_click",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Avg Cost per Click") 
plt.show()


It is pretty messy, but overall things seem pretty flat.

import statsmodels.api as sm
#build all linear regressions via a for loop

#empty vectors
coefficient = []
p_value = []

#for loop over all ad groups
for level in data['ad'].unique():
  
  #subset dataset and keep variables we care about
  data_reg = data.loc[data['ad']==level][["date", "avg_cost_per_click"]]
  #replace dates with numbers. 0 is the minimum date, 1 is minimum date + 1 day and so on
  data_reg['date'] = (data_reg['date'] - data_reg['date'].min()).dt.days
  #add intercept
  data_reg['intercept'] = 1
  
  #build the regression
  lr = sm.OLS(data_reg['avg_cost_per_click'],data_reg.drop('avg_cost_per_click', axis=1)).fit()
  #save coefficient and p-value
  coefficient.append(lr.params[0])
  p_value.append(lr.pvalues[0])


#put all together in a datframe
lr_output = pandas.DataFrame({
                  'ad_group': data['ad'].unique(),
                  'coefficients': coefficient,
                  'p_value': p_value})

#classify based on the rule described above. we use Bonferroni rule for significance
lr_output['cluster'] = np.where((lr_output['coefficients']>0) & (lr_output['p_value']<0.05/40), 'going_up',
                                np.where((lr_output['coefficients']<0) & (lr_output['p_value']<0.05/40), 'going_down',
                                         'flat'))
  
print(lr_output)                                         
       ad_group  coefficients   p_value cluster
0    ad_group_1      0.000926  0.145429    flat
1    ad_group_2     -0.000006  0.991095    flat
2    ad_group_3      0.002445  0.255564    flat
3    ad_group_4     -0.000577  0.804445    flat
4    ad_group_5     -0.001018  0.440261    flat
5    ad_group_6     -0.001734  0.077129    flat
6    ad_group_7     -0.002569  0.120068    flat
7    ad_group_8     -0.002689  0.026231    flat
8    ad_group_9      0.000044  0.434626    flat
9   ad_group_10     -0.000161  0.861238    flat
10  ad_group_11      0.000490  0.655742    flat
11  ad_group_12      0.000861  0.494787    flat
12  ad_group_13      0.000198  0.874992    flat
13  ad_group_14     -0.000776  0.168030    flat
14  ad_group_15      0.000406  0.533287    flat
15  ad_group_16     -0.000595  0.280798    flat
16  ad_group_17     -0.000027  0.894071    flat
17  ad_group_18      0.000625  0.691426    flat
18  ad_group_19     -0.002980  0.139845    flat
19  ad_group_20      0.000685  0.467957    flat
20  ad_group_21     -0.001027  0.234991    flat
21  ad_group_22     -0.000244  0.695846    flat
22  ad_group_23     -0.000539  0.641059    flat
23  ad_group_24     -0.006371  0.030700    flat
24  ad_group_25     -0.000100  0.870318    flat
25  ad_group_26      0.000724  0.699794    flat
26  ad_group_27     -0.000295  0.156161    flat
27  ad_group_28     -0.001832  0.188441    flat
28  ad_group_29     -0.001375  0.314217    flat
29  ad_group_30     -0.000114  0.878117    flat
30  ad_group_31      0.000080  0.449891    flat
31  ad_group_32      0.004258  0.033788    flat
32  ad_group_33     -0.000967  0.319079    flat
33  ad_group_34      0.000128  0.932790    flat
34  ad_group_35     -0.000412  0.163060    flat
35  ad_group_36     -0.001072  0.056771    flat
36  ad_group_37      0.002610  0.076241    flat
37  ad_group_38     -0.000809  0.432336    flat
38  ad_group_39      0.001141  0.327894    flat
39  ad_group_40      0.001925  0.006205    flat
print(lr_output['cluster'].value_counts())
flat    40
Name: cluster, dtype: int64


Looks like everything is flat. W/o using the Bonferroni correction (i.e. had we chosen a less aggressive p-value threshold), we would have likely found some significant ads.
Also, via a linear regression, we are looking at the overall trend. That is, old data points weigh just as much as the most recent ones. Had we focused only on the most recent data, we obviously might have found different results.


Complete and Continue