Subscription upgrades

Subscription Upgrades



Goal


Our product offers various subscription tiers. In increasing order of price, they are: "Free", "Starter", "Standard", and "Plus". That is, free is the cheapest, starter the second cheapest, etc. We have a table that logs, on a daily basis, if a user changes their subscription tier.

Your code should return the total count of users that upgraded at least once in their lifetime as well the count of users that downgraded at least once. Upgraded is defined as any subscription change from a cheaper tier to a more expensive one and downgraded is the opposite (more expensive -> cheaper)


Data


You can download the data set from here

Column Name Value Description
user_id 836 Id of the user
date 2020-01-01 date of the subscription tier change
subscription_tier Plus that user updated their plan and got the Plus subscription

Solution


The main challenge of this problem is to calculate the "at least once" part. This kind of question is really really common, so you need to be familiar with how to do it. A common way to find out if something happened at least once in a group is: group by -> max(if that condition happens -> 1 else -> 0)). And if the resulting max is >0, then that condition did happen at least once within a given group.





As extensively described in the previous lessons, we'll start by writing down the output in the outermost query, which is just the question itself. That is, it is the count of upgraders and downgraders. After that, we'll keep working backwards until the query basically solves itself.

--just the output for now
SELECT COUNT (upgraders),
       COUNT (downgraders)
FROM (
     --subquery goes here
)       



So upgraders/downgraders should have NULL if a user_id never upgraded/downgraded and any value if they did it at least once. Let's define upgraders and downgraders according to the question:

SELECT COUNT (upgraders),
       COUNT (downgraders)
FROM (
      SELECT MAX (CASE WHEN current_tier>previous_tier THEN 1 ELSE NULL END) AS upgraders,
             MAX (CASE WHEN current_tier<previous_tier THEN 1 ELSE NULL END) AS downgraders,
             user_id
      FROM  (
        --new subquery
      )
      GROUP BY user_id
) 



Now we need to create the current_tier and previous_tier variables. Obviously, they have to be numeric and such that a lower tier subscription maps to a lower value. For current_tier, we can just map the subscription offers to increasing numbers. And previous_tier will be the lag of current_tier partitioned by user_id. Let's do it in two steps (we could also just do it in one query, but the SELECT part would become very long). Firstly, we calculate previous_tier as the lag of current_tier:



SELECT COUNT (upgraders),
       COUNT (downgraders)
FROM (
      SELECT MAX (CASE WHEN current_tier>previous_tier THEN 1 ELSE NULL END) AS upgraders,
             MAX (CASE WHEN current_tier<previous_tier THEN 1 ELSE NULL END) AS downgraders,
             user_id
      FROM  (
             --previous_tier is the lag of current_tier
             SELECT LAG(current_tier) OVER (PARTITION BY user_id ORDER BY date) AS previous_tier,
             current_tier,
             user_id
             FROM (
               --subquery goes here
             )
      )
      GROUP BY user_id
) 



And now all we need is to define current_tier as a numeric column and we are done.

SELECT COUNT (upgraders),
       COUNT (downgraders)
FROM (
      SELECT 
             MAX (CASE WHEN current_tier>previous_tier THEN 1 ELSE NULL END) AS upgraders,
             MAX (CASE WHEN current_tier<previous_tier THEN 1 ELSE NULL END) AS downgraders,
             user_id
      FROM  (
             --previous_tier is the lag of current_tier
             SELECT LAG(current_tier) OVER (PARTITION BY user_id ORDER BY date) AS previous_tier,
             current_tier,
             user_id,
             date
             FROM (
                   SELECT CASE WHEN subscription_tier = "Free" THEN 0
                               WHEN subscription_tier = "Starter" THEN 1 
                               WHEN subscription_tier = "Standard" THEN 2
                               ELSE 3 END AS current_tier,
                          user_id,
                          date
                   FROM data
             )
      )
      GROUP BY user_id
) 
COUNT (upgraders) COUNT (downgraders)
724 726



An alternative approach could have been to have the "CASE WHEN current_tier>previous_tier THEN" part directly return the user_id if TRUE and NULL otherwise. And then we could have wrapped that inside a COUNT (distinct). Something like below. Either approach is fine. The first one in this lesson is more general and always works to find out if something has happened at least once within a group. The approach below is more specific and more efficient when you need to return a count. In any case, in a job interview, either way is fine.

SELECT COUNT(DISTINCT CASE WHEN current_tier>previous_tier THEN user_id ELSE NULL END) AS count_upgraders,
       COUNT(DISTINCT CASE WHEN current_tier<previous_tier THEN user_id ELSE NULL END) AS count_downgraders
FROM  (
             SELECT LAG(current_tier) OVER (PARTITION BY user_id ORDER BY date) AS previous_tier,
             current_tier,
             user_id,
             date
             FROM (
                   SELECT CASE WHEN subscription_tier = "Free" THEN 0
                               WHEN subscription_tier = "Starter" THEN 1 
                               WHEN subscription_tier = "Standard" THEN 2
                               ELSE 3 END AS current_tier,
                          user_id,
                          date
                   FROM data
                 )
      )
count_upgraders count_downgraders
724 726







As extensively described in the previous lessons, we'll start by writing down the output:

summarize (upgraders = sum(has_upgraded_ever),
           downgraders = sum(has_downgraded_ever))



That is, we need to calculate those two variables (has_upgraded_ever and has_downgraded_ever). For this to make sense, those two variables should be 1 if a given user has upgraded/downgraded at least once and 0 otherwise. Let's use the standard approach for the "at least once" problems:

#standard way to calculate at least once
group_by(user_id) %>%
#the ifelse() isn't strictly needed, but it is probably clearer this way
# max(current_tier<previous_tier) already returns TRUE/FALSE which would act as 0/1 in the final sum
summarize(has_upgraded_ever =  max(ifelse(current_tier<previous_tier, 1, 0)),
          has_downgraded_ever =  max(ifelse(current_tier>previous_tier, 1, 0))
          )%>%
#code from the prior step
summarize (upgraders = sum(has_upgraded_ever),
           downgraders = sum(has_downgraded_ever))  



Now we need to create the current_tier and previous_tier variables. Obviously, they have to be numeric and such that a lower tier subscription maps to a lower value. For current_tier, we can just map the subscription offers to increasing numbers. And previous_tier will be the lag of current_tier. We can do it in a couple of mutate and we are done.

library(tidyverse)
#read the data
data=read.csv("https://drive.google.com/uc?export=download&id=1TVvfPvGyeUtAz3aWtAxVAowiXfKApTeO")

data %>%
  #map subscription tiers to increasing numbers
  #if you don't remember recode() syntax, you can simply do a series of nested ifelse or case when 
  mutate(current_tier = recode(subscription_tier, "Free" = 0, "Starter"=1, "Standard"=2, "Plus"=3)) %>%
  group_by(user_id) %>%
  #calculate LAG of current_tier
  mutate (previous_tier = lag(current_tier, order_by=as.Date(date))) %>%
  #let's get rid of the first row per user, i.e. previous_tier is NA. The NAs might create problems later in the max(ifelse()) 
  filter(!is.na(previous_tier)) %>%
  #the group by below is not stricly needed. The dataset is already grouped by user_id, I leave it there for clarity
  group_by(user_id) %>%
  #returns 1 if it happens at least once
  summarize(has_upgraded_ever   =  max(ifelse(current_tier>previous_tier, 1, 0)),
            has_downgraded_ever =  max(ifelse(current_tier<previous_tier, 1, 0))
            )%>%
  summarize (upgraders = sum(has_upgraded_ever),
             downgraders = sum(has_downgraded_ever))  
# A tibble: 1 × 2
  upgraders downgraders
      <dbl>       <dbl>
1       724         726



Since in this exercise it is just asking for two counts, an alternative approach could have been to simply subset the dataset by current_tier>previous_tier and then count unique user_id. Something like below. Either approach is fine. The first one in this lesson is more general and always works to find out if something happened at least once within a group. The approach below is more specific and more efficient when you need to just return a count. In any case, in a job interview, either way is fine.

library(tidyverse)
data=read.csv("https://drive.google.com/uc?export=download&id=1TVvfPvGyeUtAz3aWtAxVAowiXfKApTeO")

data = data %>%
  mutate(current_tier = recode(subscription_tier, "Free" = 0, "Starter"=1, "Standard"=2, "Plus"=3)) %>%
  group_by(user_id) %>%
  mutate (previous_tier = lag(current_tier, order_by=as.Date(date)))

#identify upgraders by subsetting the dataset and counting unique user_id
paste("Total upgraders are ", length(unique(subset(data,current_tier>previous_tier)$user_id)))
[1] "Total upgraders are  724"
paste("Total downgraders are ", length(unique(subset(data,current_tier<previous_tier)$user_id)))
[1] "Total downgraders are  726"







As extensively described in the previous lessons, we'll start by writing down the output:

print("Total upgraders are ", data['has_upgraded_ever'].sum())
print("Total downgraders are ", data['has_downgraded_ever'].sum())



That is, we need to calculate those two variables (has_upgraded_ever and has_downgraded_ever). For this to make sense, those two variables should be 1 if a given user has upgraded/downgraded at least once and 0 otherwise. Let's use the standard approach for the "at least once" problems:

#standard way to calculate at least once
data = data.groupby('user_id').apply(lambda x: pandas.Series({
            #the np.where() isn't strictly needed, but it is probably clearer this way
            #(['current_tier']>x['previous_tier']).max() already returns TRUE/FALSE which would act as 0/1 in the final sum
            'has_upgraded_ever': np.where(x['current_tier']>x['previous_tier'],1,0).max(),
            'has_downgraded_ever': np.where(x['current_tier']<x['previous_tier'],1,0).max()})
            ).reset_index()

print("Total upgraders are ", data['has_upgraded_ever'].sum())
print("Total downgraders are ", data['has_downgraded_ever'].sum()) 



Now we need to create the current_tier and previous_tier variables. Obviously, they have to be numeric and such that a lower tier subscription maps to a lower value. For current_tier, we can just map the subscription offers to increasing numbers. And previous_tier will be the shift of current_tier:

import pandas
import numpy as np
#read the data
data=pandas.read_csv("https://drive.google.com/uc?export=download&id=1TVvfPvGyeUtAz3aWtAxVAowiXfKApTeO")

#create current_tier. I.e. from strings to increasing numbers
data['current_tier'] = np.where(data['subscription_tier'] == "Free", 0,
                       np.where(data['subscription_tier'] == "Starter", 1,
                       np.where(data['subscription_tier'] == "Standard", 2, 3)))

#order by date before applying the shift function
data['date'] = pandas.to_datetime(data['date'])
data = data.sort_values('date')

#get previous_tier via the shift function
data['previous_tier'] = data.groupby('user_id')['current_tier'].shift()

data = data.groupby('user_id').apply(lambda x: pandas.Series({
             #returns 1 if it happens at least once
            'has_upgraded_ever': np.where(x['current_tier']>x['previous_tier'],1,0).max(),
            'has_downgraded_ever': np.where(x['current_tier']<x['previous_tier'],1,0).max()})
            ).reset_index()

print("Total upgraders are ", data['has_upgraded_ever'].sum())
Total upgraders are  724
print("Total downgraders are ", data['has_downgraded_ever'].sum()) 
Total downgraders are  726



Since in this exercise it is just asking for two counts, an alternative approach could have been to simply subset the dataset by current_tier>previous_tier and then count unique user_id. Something like below. Either approach is fine. The first one in this lesson is more general and always works to find out if something happened at least once within a group. The approach below is more specific and more efficient when you need to just return a count. In any case, in a job interview, either way is fine.

import pandas
import numpy as np
data=pandas.read_csv("https://drive.google.com/uc?export=download&id=1TVvfPvGyeUtAz3aWtAxVAowiXfKApTeO")

data['current_tier'] = np.where(data['subscription_tier'] == "Free", 0,
                       np.where(data['subscription_tier'] == "Starter", 1,
                       np.where(data['subscription_tier'] == "Standard", 2, 3)))

data['date'] = pandas.to_datetime(data['date'])
data = data.sort_values('date')
data['previous_tier'] = data.groupby('user_id')['current_tier'].shift()

#identify upgraders by subsetting the dataset and counting unique user_id
print("Total upgraders are ", data.query('current_tier>previous_tier')['user_id'].nunique())
Total upgraders are  724
print("Total downgraders are ", data.query('current_tier<previous_tier')['user_id'].nunique()) 
Total downgraders are  726





Complete and Continue