Solution: Ads Analysis
Ads Analysis - Solution
Data
Let’s read the dataset and see the variables we have
require(dplyr)
require(ggplot2)
#read the file
data = read.csv("https://drive.google.com/uc?export=download&id=1wB3x3bPIX7C19Rmse1bpZtYBmVcrPDGa")
dim(data)
[1] 2115 7
head(data)
date shown clicked converted avg_cost_per_click total_revenue ad
1 2015-10-01 65877 2339 43 0.90 641.62 ad_group_1
2 2015-10-02 65100 2498 38 0.94 756.37 ad_group_1
3 2015-10-03 70658 2313 49 0.86 970.90 ad_group_1
4 2015-10-04 69809 2833 51 1.01 907.39 ad_group_1
5 2015-10-05 68186 2696 41 1.00 879.45 ad_group_1
6 2015-10-06 66864 2617 46 0.98 746.48 ad_group_1
#make it a date
data$date=as.Date(data$date)
summary(data)
date shown clicked converted avg_cost_per_click total_revenue ad
Min. :2015-10-01 Min. : 0 Min. : 0 Min. : 0.0 Min. :0.000 Min. : -200.2 ad_group_1 : 53
1st Qu.:2015-10-14 1st Qu.: 28030 1st Qu.: 744 1st Qu.: 18.0 1st Qu.:0.760 1st Qu.: 235.5 ad_group_11: 53
Median :2015-10-27 Median : 54029 Median : 1392 Median : 41.0 Median :1.400 Median : 553.3 ad_group_12: 53
Mean :2015-10-27 Mean : 68300 Mean : 3056 Mean : 126.5 Mean :1.374 Mean : 1966.5 ad_group_13: 53
3rd Qu.:2015-11-09 3rd Qu.: 97314 3rd Qu.: 3366 3rd Qu.: 103.0 3rd Qu.:1.920 3rd Qu.: 1611.5 ad_group_15: 53
Max. :2015-11-22 Max. :192507 Max. :20848 Max. :1578.0 Max. :4.190 Max. :39623.7 ad_group_16: 53
(Other) :1797
Data looks weird. For instance, there is negative revenue that doesn’t make much sense. Let’s clean the data a bit. Here we will remove impossible data. In a real world situation, we would try to get to the bottom of this to figure out where the bad data are coming from.
#Revenue cannot be negative
paste("There are", nrow(subset(data, total_revenue<0)), "events with negative revenue")
[1] "There are 4 events with negative revenue"
#Remove those
data = subset(data, !total_revenue<0)
#Also, clicked should be >= shown and converted should be >= clicked. Let's see:
paste("There are", nrow(subset(data, shown<clicked | clicked<converted)), "events where the funnel doesn't make any sense")
[1] "There are 0 events where the funnel doesn't make any sense"
#Finally, there are a few zeros that seem weird, considering that avg values are very high. Let's plot and see:
ggplot(data, aes(y=shown, x=date, colour=ad, group=ad)) +
geom_line(show.legend = FALSE) +
ggtitle("Ad impressions")
Those sudden zeros definitely look weird. Let’s get rid of them and then check ads clicks.
#remove zero impression data
data = subset(data, shown>0)
#now check clicks
ggplot(data, aes(y=clicked, x=date, colour=ad, group=ad)) +
geom_line(show.legend = FALSE) +
ggtitle("Ad clicks")
There are still a few weird drops to zero, despite having removed when impressions was zero. Let’s remove these weird events too.
#remove zero click data
data = subset(data, clicked>0)
#now check conversion
ggplot(data, aes(y=converted, x=date, colour=ad, group=ad)) +
geom_line(show.legend = FALSE) +
ggtitle("Ad conversions")
summary(data)
date shown clicked converted avg_cost_per_click total_revenue ad
Min. :2015-10-01 Min. : 7494 Min. : 187.0 Min. : 0.0 Min. :0.080 Min. : 0.0 ad_group_1 : 53
1st Qu.:2015-10-14 1st Qu.: 28446 1st Qu.: 756.5 1st Qu.: 18.0 1st Qu.:0.780 1st Qu.: 244.8 ad_group_15: 53
Median :2015-10-27 Median : 54569 Median : 1413.0 Median : 42.0 Median :1.410 Median : 561.1 ad_group_17: 53
Mean :2015-10-26 Mean : 68854 Mean : 3088.7 Mean : 127.8 Mean :1.388 Mean : 1989.2 ad_group_2 : 53
3rd Qu.:2015-11-09 3rd Qu.:100396 3rd Qu.: 3413.5 3rd Qu.: 106.0 3rd Qu.:1.920 3rd Qu.: 1634.9 ad_group_22: 53
Max. :2015-11-22 Max. :192507 Max. :20848.0 Max. :1578.0 Max. :4.190 Max. :39623.7 ad_group_26: 53
(Other) :1773
We could spend more time looking at the data, but overall things seem to make sense now.
import pandas
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
from matplotlib import rcParams
rcParams.update({'figure.autolayout': True})
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)
#read the file
data = pandas.read_csv("https://drive.google.com/uc?export=download&id=1wB3x3bPIX7C19Rmse1bpZtYBmVcrPDGa")
print(data.shape)
(2115, 7)
print(data.head())
date shown clicked converted avg_cost_per_click total_revenue ad
0 2015-10-01 65877 2339 43 0.90 641.62 ad_group_1
1 2015-10-02 65100 2498 38 0.94 756.37 ad_group_1
2 2015-10-03 70658 2313 49 0.86 970.90 ad_group_1
3 2015-10-04 69809 2833 51 1.01 907.39 ad_group_1
4 2015-10-05 68186 2696 41 1.00 879.45 ad_group_1
data['date'] = pandas.to_datetime(data['date'])
print(data.describe())
shown clicked converted avg_cost_per_click total_revenue
count 2115.000000 2115.000000 2115.000000 2115.000000 2115.000000
mean 68299.844444 3056.077069 126.453901 1.373749 1966.517589
std 48884.821409 3783.579969 233.420826 0.754331 3942.018757
min 0.000000 0.000000 0.000000 0.000000 -200.150000
25% 28030.500000 744.000000 18.000000 0.760000 235.470000
50% 54029.000000 1392.000000 41.000000 1.400000 553.300000
75% 97314.500000 3366.500000 103.000000 1.920000 1611.495000
max 192507.000000 20848.000000 1578.000000 4.190000 39623.710000
Data looks weird. For instance, there is negative revenue that doesn’t make much sense. Let’s clean the data a bit. Here we will remove impossible data. In a real world situation, we would try to get to the bottom of this to figure out where the bad data are coming from.
#Revenue cannot be negative
print("There are", len(data.query('total_revenue<0')), "events with negative revenue")
There are 4 events with negative revenue
#Remove those
data = data.query('total_revenue>=0')
#Also, clicked should be >= shown and converted should be >= clicked. Let's see:
print("There are", len(data.query('shown<clicked | clicked<converted')), "events where the funnel doesn't make any sense")
There are 0 events where the funnel doesn't make any sense
#Finally, there are a few zeros that seem weird, considering that avg values are very high. Let's plot and see:
g=sns.lineplot(x="date", y="shown",hue="ad", data=data, legend="")
/home/info/.local/lib/python3.5/site-packages/pandas/plotting/_matplotlib/converter.py:103: FutureWarning: Using an implicitly registered datetime converter for a matplotlib plotting method. The converter was registered by pandas on import. Future versions of pandas will require you to explicitly register matplotlib converters.
To register the converters:
>>> from pandas.plotting import register_matplotlib_converters
>>> register_matplotlib_converters()
warnings.warn(msg, FutureWarning)
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad impressions")
plt.show()
Those sudden zeros definitely look weird. Let’s get rid of them and then check ads clicks.
#remove zero impression data
data = data.query('shown>0')
#now check clicks
g=sns.lineplot(x="date", y="clicked",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad clicks")
plt.show()
There are still a few weird drops to zero, despite having removed when impressions was zero. Let’s remove these weird events too.
#remove zero click data
data = data.query('clicked>0')
#now check conversion
g=sns.lineplot(x="date", y="converted",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad conversions")
plt.show()
#check final dataset
print(data.describe())
shown clicked converted avg_cost_per_click total_revenue
count 2091.000000 2091.000000 2091.000000 2091.000000 2091.000000
mean 68853.852702 3088.744142 127.822573 1.387661 1989.242257
std 48799.535713 3792.468317 234.394500 0.746035 3958.838731
min 7494.000000 187.000000 0.000000 0.080000 0.000000
25% 28446.000000 756.500000 18.000000 0.780000 244.765000
50% 54569.000000 1413.000000 42.000000 1.410000 561.070000
75% 100395.500000 3413.500000 106.000000 1.920000 1634.895000
max 192507.000000 20848.000000 1578.000000 4.190000 39623.710000
We could spend more time looking at the data, but overall things seem to make sense now.
Q: If you had to identify the 5 best ad groups, which ones would be? Which metric did you choose to identify the best ones? Why? Explain the pros of your metric as well as the possible cons. From a business perspective, choosing that metric implies that you are focusing on what?
There are obviously tons of ad related metric. And they are all plausible and useful in different situations, depending on the main business focus of a given company. A few examples are:
- Click-through-rate (CTR), # of clicks/ # of impressions, i.e. of all people seeing the ad, how many click on it?
- Pros: as discussed extensively in this course case studies, it is great to identify demand from users. Also, since it only takes into account what the user does after seeing the ad, it can be easily estimated and allows to test and improve ad characteristics to maximize the number of users coming to the site
- Cons: It has no information about what the user does once they come to the site as well as costs. A very high CTR is not that useful if none of those users converts. It can also easily be tricked via click-bait ads.
- This metric is purely growth-based when the goal is to bring users to the site
- Conversion rate (CR), # of conversions/ # of impressions, i.e. of all people seeing the ad, how many people end up becoming customers?
- Pros: Fixes most of CTR issues. After all, the goal of an ad is to get customers, and this is what this metric represents.
- Cons: It has no information about costs. Also, conversion on the site highly depends on the product. An ad could be great in capturing demand, but if the product is bad, no one converts. Finally, how to link conversion to an ad click is pretty messy, conversion might happen after a long time, might happen after multiple visits through different channels, etc.
- This metric combines both growth and product.
- Profits, revenue from conversions - ad costs
- Pros: Very concrete and business driven. Am I making money via ads?
- Cons: Same as CR, hard to define revenue from an ad, not only there are the issues described above, but here you would need to predict user lifetime value to see if an ad is profitable. Also, it only considers users after they click on the ad. It has no information about the top of the funnel, i.e. how many users see the ad and choose to not click.
- This metric focuses on the business/product side, but it is hardly useful from a growth standpoint. It is typically used by very large companies whose main goal is being profitable.
For this exercise, let’s look into profits and CTR. We will pick the top 5 ads based on CTR as long as profits are above a certain threshold. In plain English, that means we want the ads with the highest growth potential after removing the ads that would make us go broken.
#Let's get CTR and profits per ad group on an avg
data_rank = data %>%
group_by(ad) %>%
summarize (CTR = mean(clicked/shown),
profits = mean(total_revenue-clicked*avg_cost_per_click),
count = mean(shown)
)
#check minimum ad count just to make sure we have enough data for all groups. Looks like we are fine
min(data_rank$count)
[1] 8461.096
#Let's plot them
qplot(profits, CTR, data=data_rank)+
geom_vline(xintercept=-5000)
Based on the plot, we can define as threshold for profitability -5K. We will remove all ads below that, and then find the ads with the highest CTR among the remaining ones. Note that here we are considering short term revenue from the ad, and we are expecting that number to be negative. An ad should be positive when taking into account a user life time value (as usual, lifetime -> 1 yr).
#Remove ads that are making us lose too much money and find top 5 based on CTR among the other ones
data_rank %>%
filter (profits > -5000) %>%
arrange(desc(CTR)) %>%
head(5)
# A tibble: 5 x 4
ad CTR profits count
<fct> <dbl> <dbl> <dbl>
1 ad_group_19 0.0942 -3460. 19507.
2 ad_group_28 0.0892 -2644. 19886.
3 ad_group_13 0.0856 -3573. 161519.
4 ad_group_34 0.0843 -1189. 35372.
5 ad_group_5 0.0639 -3234. 52406.
#Let's get CTR and profits per ad group on an avg
data_rank = data.groupby('ad').apply(
lambda x: pandas.Series({
'CTR': (x['clicked']/x['shown']).mean(),
'profits': (x['total_revenue']-x['clicked']*x['avg_cost_per_click']).mean(),
'count': x['shown'].mean()
})
).reset_index()
#check minimum ad count just to make sure we have enough data for all groups. Looks like we are fine
print(data_rank['count'].min())
8461.096153846154
#Let's plot them
plt.scatter(data_rank['profits'], data_rank['CTR'])
plt.axvline(x=-5000)
plt.xlabel('profits')
plt.ylabel('CTR')
plt.show()
Based on the plot, we can define as threshold for profitability -5K. We will remove all ads below that, and then find the ads with the highest CTR among the remaining ones. Note that here we are considering short term revenue from the ad, and we are expecting that number to be negative. An ad should be positive when taking into account a user life time value (as usual, lifetime -> 1 yr).
#Remove ads that are making us lose too much money and find top 5 based on CTR among the other ones
print(data_rank.query('profits > -5000').sort_values('CTR', ascending=False).head(5))
ad CTR count profits
10 ad_group_19 0.094188 19506.884615 -3460.419038
20 ad_group_28 0.089191 19886.339623 -2644.280943
4 ad_group_13 0.085635 161519.176471 -3573.196275
27 ad_group_34 0.084257 35371.622642 -1189.270000
35 ad_group_5 0.063862 52406.461538 -3233.796538
Q: For each group, predict how many ads will be shown on Dec, 15
The most common ways to do this are either using gam or using some time series package. In this case, we will do via the time series package prophet.
library(prophet)
#Let's firstly take just one ad_group and do it manually for that one as an example. Then, we will apply it to all others
data_ad1 = subset(data, ad=="ad_group_1")[, c("date", "shown")]
qplot(date, shown, data=data_ad1, geom="line", main="Ad group 1")
#Let's build a time series using prophet. From previous plots, time series appear pretty straightforward and flattish. Default values should do fine
colnames(data_ad1)=c("ds", "y")
#build the model
ts = prophet(data_ad1)
#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
days_predictions = as.numeric(as.Date("2015-12-15")-max(data_ad1$ds))
future_data = make_future_dataframe(ts, periods = days_predictions)
predictions = predict(ts, future_data)
#let's plot them
plot(ts, predictions)
#and plot the series component
prophet_plot_components(ts, predictions)
We can see that there is some (very very small) weekly variation. Most importantly, the time serie overall is pretty flat in October and starts going up in Nov. Predictions for Dec 15 tend to be up given that the model assumes the time serie keeps following its most recent trend. Obviously, this would be much more useful with yearly data, so we would be able to capture yearly seasonality.
Let’s now do it for all the ad groups and get the predictions.
#data set to store results
predictions_all_groups = data.frame(date = as.Date("2015-12-15"),
ad_group = unique(data$ad),
prediction = NA
)
#We will do it via a for loop through all the ad groups
for (i in 1:nrow(predictions_all_groups)){
#subset data, only keep 1 ad group and columns needed
tmp = subset(data, ad == predictions_all_groups$ad_group[i])[, c("date", "shown")]
colnames(tmp)=c("ds", "y")
#build the model
ts_tmp = prophet(tmp)
#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
days_predictions_tmp = as.numeric(as.Date("2015-12-15")-max(tmp$ds))
future_data_tmp = make_future_dataframe(ts_tmp, periods = days_predictions_tmp)
predictions_tmp = predict(ts_tmp, future_data_tmp)
#store results for the last day, i.e. Dec 15 predcition
predictions_all_groups$prediction[i] = round(tail(predictions_tmp$yhat,1))
}
predictions_all_groups
date ad_group prediction
1 2015-12-15 ad_group_1 77558
2 2015-12-15 ad_group_2 50305
3 2015-12-15 ad_group_3 152888
4 2015-12-15 ad_group_4 94679
5 2015-12-15 ad_group_5 54750
6 2015-12-15 ad_group_6 40587
7 2015-12-15 ad_group_7 56988
8 2015-12-15 ad_group_8 47171
9 2015-12-15 ad_group_9 124642
10 2015-12-15 ad_group_10 127043
11 2015-12-15 ad_group_11 19964
12 2015-12-15 ad_group_12 28000
13 2015-12-15 ad_group_13 162073
14 2015-12-15 ad_group_14 8715
15 2015-12-15 ad_group_15 15773
16 2015-12-15 ad_group_16 30494
17 2015-12-15 ad_group_17 134101
18 2015-12-15 ad_group_18 89749
19 2015-12-15 ad_group_19 19889
20 2015-12-15 ad_group_20 120417
21 2015-12-15 ad_group_21 28320
22 2015-12-15 ad_group_22 26866
23 2015-12-15 ad_group_23 48616
24 2015-12-15 ad_group_24 37394
25 2015-12-15 ad_group_25 174387
26 2015-12-15 ad_group_26 72261
27 2015-12-15 ad_group_27 65385
28 2015-12-15 ad_group_28 20984
29 2015-12-15 ad_group_29 21327
30 2015-12-15 ad_group_30 110710
31 2015-12-15 ad_group_31 123616
32 2015-12-15 ad_group_32 39847
33 2015-12-15 ad_group_33 16240
34 2015-12-15 ad_group_34 38683
35 2015-12-15 ad_group_35 60008
36 2015-12-15 ad_group_36 71430
37 2015-12-15 ad_group_37 71485
38 2015-12-15 ad_group_38 177108
39 2015-12-15 ad_group_39 26122
40 2015-12-15 ad_group_40 80638
import fbprophet
#Let's firstly take just one ad_group and do it manually for that one as an example. Then, we will apply it to all others
data_ad1 = data.query('ad==\"ad_group_1\"')[["date", "shown"]]
#plot
g=sns.lineplot(x="date", y="shown", data=data_ad1)
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Ad group 1")
plt.show()
#Let's build time series using prophet. From previous plots, time series appear pretty straightforward and flattish. Default values should do fine
#change column names and index
data_ad1['ds'] = data_ad1['date']
data_ad1.rename({'shown': 'y'}, axis=1, inplace=True)
data_ad1.set_index('date', inplace=True)
#build the model
ts = fbprophet.Prophet()
ts.fit(data_ad1)
#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
<fbprophet.forecaster.Prophet object at 0x7eaec2144eb8>
days_predictions = (pandas.to_datetime("2015-12-15")-data_ad1['ds'].max()).days
future_data = ts.make_future_dataframe(periods = days_predictions)
predictions = ts.predict(future_data)
#let's plot them
ts.plot(predictions)
plt.show()
#and plot the series component
ts.plot_components(predictions)
plt.show()
We can see that there is some (very very small) weekly variation. Most importantly, the time serie overall is pretty flat in October and starts going up in Nov. Predictions for Dec 15 tend to be up given that the model assumes the time serie keeps following its most recent trend. Obviously, this would be much more useful with yearly data, so we would be able to capture yearly seasonality.
Let’s now do it for all the ad groups and get the predictions.
#data set to store results
predictions_all_groups = pandas.DataFrame({'ad_group':data['ad'].unique(),
'date': pandas.to_datetime("2015-12-15")})
prediction_dec_15 = []
#We will do it via a for loop through all the ad groups
for level in predictions_all_groups.ad_group:
#subset data, only keep 1 ad group and columns needed
tmp = data.loc[data['ad']==level][["date", "shown"]]
#change column names and index
tmp['ds'] = tmp['date']
tmp.rename({'shown': 'y'}, axis=1, inplace=True)
tmp.set_index('date', inplace=True)
#build the model
ts_tmp = fbprophet.Prophet()
ts_tmp.fit(tmp)
#make predictions until Dec 15. This is the number of days for which we need a prediction to get to Dec 15
days_predictions_tmp = (pandas.to_datetime("2015-12-15")-tmp['ds'].max()).days
future_data_tmp = ts_tmp.make_future_dataframe(periods = days_predictions_tmp)
predictions_tmp = ts_tmp.predict(future_data_tmp)
#store results for the last day, i.e. Dec 15 prediction
prediction_dec_15.append(round(predictions_tmp['yhat'].iat[-1]))
predictions_all_groups['predictions'] = prediction_dec_15
print(predictions_all_groups)
ad_group date predictions
0 ad_group_1 2015-12-15 77558.0
1 ad_group_2 2015-12-15 50305.0
2 ad_group_3 2015-12-15 152888.0
3 ad_group_4 2015-12-15 94679.0
4 ad_group_5 2015-12-15 54750.0
5 ad_group_6 2015-12-15 40587.0
6 ad_group_7 2015-12-15 56988.0
7 ad_group_8 2015-12-15 47171.0
8 ad_group_9 2015-12-15 124642.0
9 ad_group_10 2015-12-15 127043.0
10 ad_group_11 2015-12-15 19964.0
11 ad_group_12 2015-12-15 28000.0
12 ad_group_13 2015-12-15 162073.0
13 ad_group_14 2015-12-15 8715.0
14 ad_group_15 2015-12-15 15773.0
15 ad_group_16 2015-12-15 30494.0
16 ad_group_17 2015-12-15 134101.0
17 ad_group_18 2015-12-15 89749.0
18 ad_group_19 2015-12-15 19889.0
19 ad_group_20 2015-12-15 120417.0
20 ad_group_21 2015-12-15 28320.0
21 ad_group_22 2015-12-15 26866.0
22 ad_group_23 2015-12-15 48616.0
23 ad_group_24 2015-12-15 37394.0
24 ad_group_25 2015-12-15 174387.0
25 ad_group_26 2015-12-15 72261.0
26 ad_group_27 2015-12-15 65385.0
27 ad_group_28 2015-12-15 20984.0
28 ad_group_29 2015-12-15 21327.0
29 ad_group_30 2015-12-15 110710.0
30 ad_group_31 2015-12-15 123616.0
31 ad_group_32 2015-12-15 39847.0
32 ad_group_33 2015-12-15 16240.0
33 ad_group_34 2015-12-15 38683.0
34 ad_group_35 2015-12-15 60008.0
35 ad_group_36 2015-12-15 71430.0
36 ad_group_37 2015-12-15 71485.0
37 ad_group_38 2015-12-15 177108.0
38 ad_group_39 2015-12-15 26122.0
39 ad_group_40 2015-12-15 80638.0
Q: Cluster ads into 3 groups: the ones whose avg_cost_per_click is going up, the ones whose avg_cost_per_click is flat and the ones whose avg_cost_per_click is going down
The simplest way to answer this is to build a linear regression for each ad group. Cost_per_click will be the y value and the x axis is days as a number (1,2,3,etc). We will then check the coefficient of the regression:
- Positive and significant coefficient -> going up
- Negative and significant coefficient -> going down
- Non-significant coefficient -> flat
#firstly let's plot them to get a sense of whether cost per click is going up or not in general
ggplot(data, aes(y=avg_cost_per_click, x=date, colour=ad, group=ad)) +
geom_line(show.legend = FALSE) +
ggtitle("Avg Cost per Click")
It is pretty messy, but overall things seem pretty flat.
#build all linear regressions via a for loop
#empty data frame
lr_output = data.frame(coefficients = NA, p_value = NA)
for (i in 1:length(unique(data$ad))){
#subset dataset and keep variables we care about
data_reg = subset(data, ad == unique(data$ad)[i])[,c("date", "avg_cost_per_click")]
#replace dates with numbers. 0 is the minimum date, 1 is minimum date + 1 day and so on
data_reg$date = as.numeric(data_reg$date - min(data_reg$date))
#build the regression
lr = lm(avg_cost_per_click ~ date, data=data_reg)
#save coefficient and p-value
lr_output[i,]=summary(lr)$coefficients[2,c(1,4)]
}
#add ad_group to the output dataset
lr_output$ad = unique(data$ad)
#classify based on the rule described above
lr_output$cluster = with(lr_output,
#we use Bonferroni rule here for significance
ifelse(coefficients>0 & p_value<0.05/40, "going_up",
ifelse(coefficients<0 & p_value<0.05/40, "going_down",
"flat"
)
)
)
lr_output
coefficients p_value ad cluster
1 9.264635e-04 0.145428675 ad_group_1 flat
2 -6.450572e-06 0.991094892 ad_group_2 flat
3 2.444767e-03 0.255564070 ad_group_3 flat
4 -5.768327e-04 0.804444949 ad_group_4 flat
5 -1.017822e-03 0.440260786 ad_group_5 flat
6 -1.733591e-03 0.077128597 ad_group_6 flat
7 -2.568940e-03 0.120068135 ad_group_7 flat
8 -2.689082e-03 0.026230860 ad_group_8 flat
9 4.434769e-05 0.434625747 ad_group_9 flat
10 -1.613688e-04 0.861238479 ad_group_10 flat
11 4.897502e-04 0.655741797 ad_group_11 flat
12 8.611293e-04 0.494786701 ad_group_12 flat
13 1.983768e-04 0.874992317 ad_group_13 flat
14 -7.762287e-04 0.168030042 ad_group_14 flat
15 4.055797e-04 0.533287316 ad_group_15 flat
16 -5.946124e-04 0.280798019 ad_group_16 flat
17 -2.741493e-05 0.894070982 ad_group_17 flat
18 6.251394e-04 0.691426469 ad_group_18 flat
19 -2.979686e-03 0.139845451 ad_group_19 flat
20 6.851306e-04 0.467956859 ad_group_20 flat
21 -1.026952e-03 0.234991383 ad_group_21 flat
22 -2.435091e-04 0.695846332 ad_group_22 flat
23 -5.389308e-04 0.641058957 ad_group_23 flat
24 -6.371408e-03 0.030699555 ad_group_24 flat
25 -9.972679e-05 0.870318346 ad_group_25 flat
26 7.240768e-04 0.699793717 ad_group_26 flat
27 -2.951464e-04 0.156160922 ad_group_27 flat
28 -1.831963e-03 0.188440544 ad_group_28 flat
29 -1.374928e-03 0.314216608 ad_group_29 flat
30 -1.136913e-04 0.878116815 ad_group_30 flat
31 7.982583e-05 0.449890625 ad_group_31 flat
32 4.258184e-03 0.033788052 ad_group_32 flat
33 -9.667796e-04 0.319079315 ad_group_33 flat
34 1.282051e-04 0.932790462 ad_group_34 flat
35 -4.120303e-04 0.163060175 ad_group_35 flat
36 -1.071601e-03 0.056770705 ad_group_36 flat
37 2.610063e-03 0.076241171 ad_group_37 flat
38 -8.087405e-04 0.432336355 ad_group_38 flat
39 1.140945e-03 0.327894385 ad_group_39 flat
40 1.925344e-03 0.006204814 ad_group_40 flat
table(lr_output$cluster)
flat
40
Looks like everything is flat. W/o using the Bonferroni correction (i.e. had we chosen a less aggressive p-value threshold), we would have likely found some significant ads.
Also, via a linear regression, we are looking at the overall trend. That is, old data points weigh just as much as the most recent ones. Had we focused only on the most recent data, we obviously might have found different results.
#firstly let's plot them to get a sense of whether cost per click is going up or not in general
g=sns.lineplot(x="date", y="avg_cost_per_click",hue="ad", data=data, legend="")
g.xaxis.set_major_locator(mdates.MonthLocator())
g.xaxis.set_major_formatter(mdates.DateFormatter('%b %d'))
g.set_title("Avg Cost per Click")
plt.show()
It is pretty messy, but overall things seem pretty flat.
import statsmodels.api as sm
#build all linear regressions via a for loop
#empty vectors
coefficient = []
p_value = []
#for loop over all ad groups
for level in data['ad'].unique():
#subset dataset and keep variables we care about
data_reg = data.loc[data['ad']==level][["date", "avg_cost_per_click"]]
#replace dates with numbers. 0 is the minimum date, 1 is minimum date + 1 day and so on
data_reg['date'] = (data_reg['date'] - data_reg['date'].min()).dt.days
#add intercept
data_reg['intercept'] = 1
#build the regression
lr = sm.OLS(data_reg['avg_cost_per_click'],data_reg.drop('avg_cost_per_click', axis=1)).fit()
#save coefficient and p-value
coefficient.append(lr.params[0])
p_value.append(lr.pvalues[0])
#put all together in a datframe
lr_output = pandas.DataFrame({
'ad_group': data['ad'].unique(),
'coefficients': coefficient,
'p_value': p_value})
#classify based on the rule described above. we use Bonferroni rule for significance
lr_output['cluster'] = np.where((lr_output['coefficients']>0) & (lr_output['p_value']<0.05/40), 'going_up',
np.where((lr_output['coefficients']<0) & (lr_output['p_value']<0.05/40), 'going_down',
'flat'))
print(lr_output)
ad_group coefficients p_value cluster
0 ad_group_1 0.000926 0.145429 flat
1 ad_group_2 -0.000006 0.991095 flat
2 ad_group_3 0.002445 0.255564 flat
3 ad_group_4 -0.000577 0.804445 flat
4 ad_group_5 -0.001018 0.440261 flat
5 ad_group_6 -0.001734 0.077129 flat
6 ad_group_7 -0.002569 0.120068 flat
7 ad_group_8 -0.002689 0.026231 flat
8 ad_group_9 0.000044 0.434626 flat
9 ad_group_10 -0.000161 0.861238 flat
10 ad_group_11 0.000490 0.655742 flat
11 ad_group_12 0.000861 0.494787 flat
12 ad_group_13 0.000198 0.874992 flat
13 ad_group_14 -0.000776 0.168030 flat
14 ad_group_15 0.000406 0.533287 flat
15 ad_group_16 -0.000595 0.280798 flat
16 ad_group_17 -0.000027 0.894071 flat
17 ad_group_18 0.000625 0.691426 flat
18 ad_group_19 -0.002980 0.139845 flat
19 ad_group_20 0.000685 0.467957 flat
20 ad_group_21 -0.001027 0.234991 flat
21 ad_group_22 -0.000244 0.695846 flat
22 ad_group_23 -0.000539 0.641059 flat
23 ad_group_24 -0.006371 0.030700 flat
24 ad_group_25 -0.000100 0.870318 flat
25 ad_group_26 0.000724 0.699794 flat
26 ad_group_27 -0.000295 0.156161 flat
27 ad_group_28 -0.001832 0.188441 flat
28 ad_group_29 -0.001375 0.314217 flat
29 ad_group_30 -0.000114 0.878117 flat
30 ad_group_31 0.000080 0.449891 flat
31 ad_group_32 0.004258 0.033788 flat
32 ad_group_33 -0.000967 0.319079 flat
33 ad_group_34 0.000128 0.932790 flat
34 ad_group_35 -0.000412 0.163060 flat
35 ad_group_36 -0.001072 0.056771 flat
36 ad_group_37 0.002610 0.076241 flat
37 ad_group_38 -0.000809 0.432336 flat
38 ad_group_39 0.001141 0.327894 flat
39 ad_group_40 0.001925 0.006205 flat
print(lr_output['cluster'].value_counts())
flat 40
Name: cluster, dtype: int64
Looks like everything is flat. W/o using the Bonferroni correction (i.e. had we chosen a less aggressive p-value threshold), we would have likely found some significant ads.
Also, via a linear regression, we are looking at the overall trend. That is, old data points weigh just as much as the most recent ones. Had we focused only on the most recent data, we obviously might have found different results.