Regressions
Regressions and Coefficients
We will focus here on logistic regression given that the label we are trying to predict (“clicked”) is binary. However, the overall approach if you were dealing with a linear regression would be similar. After all, a logistic regression can be seen as a linear method with a particular link function (logit) to constrain the output between 0 and 1, so that it can be used for binary classification problems.
Code
#Read from google drive. This is the same dataset described in the previous section
data=read.csv("https://drive.google.com/uc?export=download&id=1PXjbqSMu__d_ppEv92i_Gnx3kKgfvhFk")
#Before building the regression, we need to know which ones are the reference levels for the categorical variables
#only keep categorical variables
dt = data[,sapply(data, is.factor)]
#find first level. These are the reference levels
sapply(sapply(dt, levels), "[[", 1)
email_text email_version weekday user_country
"long_email" "generic" "Friday" "ES"
#build logisitc regression
log.reg = glm(clicked ~ ., data = data, family = binomial)
#print coefficients and their pvalues
summary(log.reg)$coefficients
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.880922e+00 1.560551e-01 -44.0928996 0.000000e+00
email_id -3.848609e-08 7.780270e-08 -0.4946626 6.208383e-01
email_textshort_email 2.793085e-01 4.530413e-02 6.1651878 7.039953e-10
email_versionpersonalized 6.387251e-01 4.691389e-02 13.6148404 3.268591e-42
hour 1.670684e-02 5.005810e-03 3.3374906 8.453859e-04
weekdayMonday 5.410326e-01 9.340848e-02 5.7921141 6.950589e-09
weekdaySaturday 2.828638e-01 9.777452e-02 2.8930214 3.815553e-03
weekdaySunday 1.836278e-01 1.001176e-01 1.8341213 6.663599e-02
weekdayThursday 6.254040e-01 9.233836e-02 6.7729595 1.261743e-11
weekdayTuesday 6.162222e-01 9.237057e-02 6.6711960 2.537271e-11
weekdayWednesday 7.554637e-01 9.084352e-02 8.3160993 9.090564e-17
user_countryFR -7.864558e-02 1.625708e-01 -0.4837621 6.285547e-01
user_countryUK 1.155254e+00 1.220474e-01 9.4656215 2.918203e-21
user_countryUS 1.141360e+00 1.159490e-01 9.8436412 7.301922e-23
user_past_purchases 1.878107e-01 5.725710e-03 32.8012980 5.642459e-236
#clean the output a bit
output = data.frame(summary(log.reg)$coefficients)
#fix column names
colnames(output) = c("Coefficient_Value", "SE", "z_value", "p_value")
#only keep significant variables
output = subset (output, p_value < 0.05)
#get final results ordered by coefficient value
output[order(output$Coefficient_Value, decreasing=T),]
Coefficient_Value SE z_value p_value
user_countryUK 1.15525449 0.12204740 9.465622 2.918203e-21
user_countryUS 1.14136025 0.11594899 9.843641 7.301922e-23
weekdayWednesday 0.75546370 0.09084352 8.316099 9.090564e-17
email_versionpersonalized 0.63872512 0.04691389 13.614840 3.268591e-42
weekdayThursday 0.62540395 0.09233836 6.772960 1.261743e-11
weekdayTuesday 0.61622219 0.09237057 6.671196 2.537271e-11
weekdayMonday 0.54103257 0.09340848 5.792114 6.950589e-09
weekdaySaturday 0.28286377 0.09777452 2.893021 3.815553e-03
email_textshort_email 0.27930848 0.04530413 6.165188 7.039953e-10
user_past_purchases 0.18781071 0.00572571 32.801298 5.642459e-236
hour 0.01670684 0.00500581 3.337491 8.453859e-04
(Intercept) -6.88092187 0.15605510 -44.092900 0.000000e+00
import pandas
import statsmodels.api as sm
pandas.set_option('display.max_columns', 10)
pandas.set_option('display.width', 350)
#Read from google drive. This is the same dataset described in the previous section
data = pandas.read_csv('https://drive.google.com/uc?export=download&id=1PXjbqSMu__d_ppEv92i_Gnx3kKgfvhFk')
#Before building the regression, we need to know which ones are the reference levels for the categorical variables
#only keep categorical variables
data_categorical = data.select_dtypes(['object']).astype("category")
#find reference level, i.e. the first level
print(data_categorical.apply(lambda x: x.cat.categories[0]))
email_text long_email
email_version generic
weekday Friday
user_country ES
dtype: object
#make dummy variables from categorical ones. Using one-hot encoding and drop_first=True. The latest stable version of sm (0.14) requires float conversion
data = pandas.get_dummies(data, drop_first=True).astype('float')
#add intercept
data['intercept'] = 1
#drop the label
train_cols = data.drop('clicked', axis=1)
#Build Logistic Regression
logit = sm.Logit(data['clicked'], train_cols)
output = logit.fit()
Optimization terminated successfully.
Current function value: 0.092770
Iterations 9
output_table = pandas.DataFrame(dict(coefficients = output.params, SE = output.bse, z = output.tvalues, p_values = output.pvalues))
#get coefficients and pvalues
print(output_table)
coefficients SE z p_values
email_id -3.848609e-08 7.780379e-08 -0.494656 6.208432e-01
hour 1.670684e-02 5.005879e-03 3.337445 8.455247e-04
user_past_purchases 1.878107e-01 5.725787e-03 32.800855 5.725039e-236
email_text_short_email 2.793085e-01 4.530477e-02 6.165101 7.043829e-10
email_version_personalized 6.387251e-01 4.691461e-02 13.614631 3.277989e-42
weekday_Monday 5.410326e-01 9.341014e-02 5.792011 6.954864e-09
weekday_Saturday 2.828638e-01 9.777629e-02 2.892969 3.816190e-03
weekday_Sunday 1.836278e-01 1.001194e-01 1.834088 6.664099e-02
weekday_Thursday 6.254040e-01 9.233999e-02 6.772839 1.262790e-11
weekday_Tuesday 6.162222e-01 9.237223e-02 6.671077 2.539336e-11
weekday_Wednesday 7.554637e-01 9.084515e-02 8.315950 9.102053e-17
user_country_FR -7.864563e-02 1.625969e-01 -0.483685 6.286097e-01
user_country_UK 1.155255e+00 1.220603e-01 9.464618 2.946372e-21
user_country_US 1.141360e+00 1.159626e-01 9.842487 7.386228e-23
intercept -6.880922e+00 1.560666e-01 -44.089646 0.000000e+00
#only keep significant variables and order results by coefficient value
print(output_table.loc[output_table['p_values'] < 0.05].sort_values("coefficients", ascending=False))
coefficients SE z p_values
user_country_UK 1.155255 0.122060 9.464618 2.946372e-21
user_country_US 1.141360 0.115963 9.842487 7.386228e-23
weekday_Wednesday 0.755464 0.090845 8.315950 9.102053e-17
email_version_personalized 0.638725 0.046915 13.614631 3.277989e-42
weekday_Thursday 0.625404 0.092340 6.772839 1.262790e-11
weekday_Tuesday 0.616222 0.092372 6.671077 2.539336e-11
weekday_Monday 0.541033 0.093410 5.792011 6.954864e-09
weekday_Saturday 0.282864 0.097776 2.892969 3.816190e-03
email_text_short_email 0.279308 0.045305 6.165101 7.043829e-10
user_past_purchases 0.187811 0.005726 32.800855 5.725039e-236
hour 0.016707 0.005006 3.337445 8.455247e-04
intercept -6.880922 0.156067 -44.089646 0.000000e+00
Understanding the output
- Categorical Variables
- All categorical variables are encoded via one-hot encoding. If there are n levels within a categorical variable, we are creating n-1 dummy variables. The remaining level is the reference level or baseline
- For instance, weekday has 6 levels in the regression: Monday, Saturday, Sunday, Thursday, Tuesday, and Wednesday. The missing one Friday is the baseline
- The way to interpret the outcome for categorical variables is that the coefficient of those levels is relative to the missing level. All days are better than Friday, although Sunday is not statistically significantly better
- If you see all negative (positive) and significant coefficients for a given categorical variable doesn’t mean that they are all bad (good) in absolute terms. It simply means that they are all worse (better) than the reference level
- There are quite a few cases in which you want to specifically set the reference level. For instance, when you have one level which is by far the most common and you want to compare all other levels against that. This is especially common if you are looking for growth opportunities. Let’s take country as an example, you might want your most important country as reference level. Or if you are looking into new marketing channels to see which one is the most promising one, it would be beneficial to compare them against your current best one. The resulting levels with positive and significant coefficients would be a goldmine of information from a growth standpoint
- All categorical variables are encoded via one-hot encoding. If there are n levels within a categorical variable, we are creating n-1 dummy variables. The remaining level is the reference level or baseline
- General Insights
- User country seems very important. Especially interesting is that English speaking countries (US, UK) are doing significantly better than non-English speaking countries (ES, FR). That could mean a bad translation or in general a non-localized version of the email. The first thing you want to do here is probably getting in touch with the international team and asking them to review French and Spanish email templates
- Not surprisingly, all weekday coefficients are positive. Sunday is (barely) non-significant, all others are significant. This is a consequence of having Friday as reference level. It is a well-known fact that sending marketing emails on Friday is not a great idea. Wednesday seems to be the best day, but in general all week days (Monday-Thursday) perform similarly. Friday - Sunday are much worse. The company should probably start sending emails only Monday-Thursday, with a particular focus on the middle of the week
- Personalized emails are doing better. So the company should stop sending generic emails. But most importantly, this can be a huge insight from a product standpoint. If just adding the name at the top is increasing clicks significantly, imagine what would happen with even more personalization. Definitely worth investing in this
- Sending short emails appears to be better, but personalizing emails should be the priority vs finding a general optimal email template that on an average works best for everyone (see much lower coefficient compared to the personalization one)
- Hour perfectly emphasizes the problems of logistic regressions with numerical variables. The best time is likely during the day and early mornings and late nights are probably bad. But the model is trying to find a linear relationship between hour and the output. In most cases, this means that will not find a significant relationship. If it does find significance, the results would be highly misleading. Like in this case, it is telling us that the larger the value of hour, the better it is. So the best time would be 24 (midnight)! To solve this, you should manually create segments (i.e. indicator variables) before building the model. One segment could be night time, one morning to noon, etc.
- Email_id is not significant, but the p-value is not that high either, so it is something to keep in mind. Email_id could be interesting because it can be seen as a proxy for time, i.e. the first email sent gets id 1, second id 2, etc. So a significant and negative coefficient would mean that as time goes by, less and less people are clicking on the email. This could be a big red flag, like for instance Google started labeling us as spam. It doesn’t look like the case here, but still, it is something to keep in mind
- More importantly, note the super low coefficient for email_id compared to the other ones. That doesn’t mean that the variable is irrelevant. The super low coefficient simply depends on the fact that email_id scale is way larger than the other variables. The max value of all other variables is 24 for hour. The max value of email_id is 100K! So the low coefficient is meant to balance the different scale, otherwise email_id would entirely drive the regression output.
- The intercept highly negative and significant is the regression outcome if all variables are set to zero. So, basically, categorical variables are all set to their reference levels and numerical variables are set to 0. Intercepts are almost always negative and significant given that in the majority of cases you are dealing with imbalanced classes, where 1s are <5% of the events. And in a logistic regression a negative outcome means higher probability of predicting class zero. Don’t read too much into it. After all, the all-values-are-0 scenario is unrealistic at best, and often impossible. Like here “hour” is coded as from 1 to 24, so it cannot even have the value 0! Only thing, looking at the scale of the intercept vs the scale of the other coefficients * the possible values of those variables can be useful to get a sense of by how much you can affect the output
- -> If I send emails on Wednesday, that variable value becomes 0.7 (i.e. 0.7 coefficient times the value of the variable that would be 1) which is pretty high relative to the -6.8 intercept. So opportunities of meaningful improvements are there. Imagine my intercept were -1000 and Wednesday coefficient were the same. Then optimizing the day would be almost irrelevant from a practical standpoint.
- -> If I send emails on Wednesday, that variable value becomes 0.7 (i.e. 0.7 coefficient times the value of the variable that would be 1) which is pretty high relative to the -6.8 intercept. So opportunities of meaningful improvements are there. Imagine my intercept were -1000 and Wednesday coefficient were the same. Then optimizing the day would be almost irrelevant from a practical standpoint.
- User country seems very important. Especially interesting is that English speaking countries (US, UK) are doing significantly better than non-English speaking countries (ES, FR). That could mean a bad translation or in general a non-localized version of the email. The first thing you want to do here is probably getting in touch with the international team and asking them to review French and Spanish email templates
Pros and Cons
Pros of using logistic regression coefficients to extract insights from data
✓ Pretty much anyone in a technical or product management role in a tech company is familiar with logistic regressions (if this is not true at your company, you are probably working in the wrong place). It is so much easier to present data science work if the audience is already familiar with the techniques used
✓ Logistic regressions are by far the most used model in production. Despite all the blog posts, conference talks, etc. about deep learning, it is almost guaranteed that a consumer tech company most important model in production will be a logistic regression. Therefore, it will be easy to collaborate with engineers (i.e. leveraging prior work done by them, helping them improve their model, etc.)
✓ It is simple, fast, and generally reliable. Indeed, building the model is straightforward. The model works well in the majority of cases and all you have to do is look at the coefficient values and their p-values
Cons of using logistic regression coefficients to extract insights from data
✓ Coefficients give an idea of the impact of each variable on the output. But it is actually pretty hard to exactly visualize what that means. I.e., a change in a given variable by one unit changes the log odds ratio byβ units, whereβ is the coefficient. Mmh…
✓ Coefficients do not allow to segment a variable. For instance, a positive coefficient in front of variable age means that as age increases, the output keeps increasing as well. Always. This is unlikely to be true for most numerical variables. You often need to create segments before building the regression (btw RuleFit solves exactly this problem)
✓ Coefficient meaning in front of a categorical variable with several levels can be confusing. You change a given variable reference level and all other level coefficients change
✓ The absolute value of a coefficient is often used to quickly estimate variable importance. However, that depends on the variable scale more than anything else. You could normalize variables, so they are all on the same scale. But that’s rarely a good idea if your goal is presenting to product people. It is hard to get a product manager excited by saying: “If we increase variable X by one standard deviation, we could achieve this and that”