In these exercises, you will be exploring some sample subscription sales data. There is one line per sale. Some sales are first-time purchases, called conversions, and others are renewals. Some sales are at the base price, some are on sale, and some are sold at the base price after a free one-week trial period.

Setup

# Feel free to load any packages you plan on using here
# You can always load more later as you go
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading and Cleaning

  1. Read the data file into R.

Here are the columns:

userid: unique userid purchasedate: date of purchase price: purchase price plan: conditions around plan off (ex, on sale) interval: the subscription interval (eg, monthly, yearly, etc) type: conversion (a first time purchase) or renewal source: platform for the purchase

# Reading the file
sales <- read.csv("/Users/pgagneja/Downloads/r_evaluation_data.csv")

# checking if data looks good
head(sales)
str(sales)
## 'data.frame':    327749 obs. of  7 variables:
##  $ userid      : int  77910 84255 219404 56393 261931 164774 175658 225163 131468 244299 ...
##  $ purchasedate: Factor w/ 60 levels "01/01/2019","01/02/2019",..: 60 56 42 25 1 14 50 15 46 24 ...
##  $ price       : num  40 40 30 30 30 ...
##  $ plan        : Factor w/ 3 levels "base","on_sale",..: 1 1 2 2 2 2 3 3 2 1 ...
##  $ interval    : Factor w/ 1 level "Yearly": 1 1 1 1 1 1 1 1 1 1 ...
##  $ type        : Factor w/ 2 levels "Conversion","Renewal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ source      : Factor w/ 3 levels "Apple","Google",..: 1 1 3 1 1 2 1 1 1 2 ...
# Date column is read in as a Factor, changing it to date format
sales$purchasedate<-as.Date(sales$purchasedate,format="%m/%d/%Y")

# Verifying the data types again
str(sales)
## 'data.frame':    327749 obs. of  7 variables:
##  $ userid      : int  77910 84255 219404 56393 261931 164774 175658 225163 131468 244299 ...
##  $ purchasedate: Date, format: "2019-03-01" "2019-02-25" ...
##  $ price       : num  40 40 30 30 30 ...
##  $ plan        : Factor w/ 3 levels "base","on_sale",..: 1 1 2 2 2 2 3 3 2 1 ...
##  $ interval    : Factor w/ 1 level "Yearly": 1 1 1 1 1 1 1 1 1 1 ...
##  $ type        : Factor w/ 2 levels "Conversion","Renewal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ source      : Factor w/ 3 levels "Apple","Google",..: 1 1 3 1 1 2 1 1 1 2 ...
# Exploring the data fields a bit further 
library(inspectdf)
inspect_types(sales)
show_plot(inspect_types(sales))

This output shows the spread among different R data types for each of the columns in the dataset i.e. the imported dataset has 4 columns with factor data type, 1 column as date and so on.

# I can also look at how much memory is being used by each column, that maybe useful information at the time of performance tuning
# show_plot(inspect_mem(sales))

# Find the count/% of missing values
inspect_na(sales)

Great there is really no missing data here!!

Now let me also find out how are the numeric variable(s) (well only price for now) is distributed across sources/plans.

# Not interested in looking at 1st column in this dataframe which represents userid
show_plot(inspect_num(sales[,-1]))

So basically we are only looking at 2 price points here 29.99 (about ~63% of values) and 39.99.

Let me dig in a lit bit more

inspect_imb(sales)

This table mainly highlights that

  • interval column only contains ‘Yearly’ subscriptions - indicating that all the transactions correspond to annual subscriptions ( which is better than monthly I assume).
  • type has ~72% ‘conversion’ - meaning there are so many 1st time purchases made in this duration.
  • plan has 65% ‘on_sale’ value - showing that most conversions happened when the plan was on sale.
  • source has 50% ‘Apple’ - indicating how most of the records/transactions are coming through Apple.

At this point, few questions can be conjured,

  • is the annual/monthly subscriptions priced right ?
  • How much revenue was lost/ reduced when the plan was on sale ?
# check_missing()
# table(is.na(sales[,5:6]))

# sales$purchaseDate<-NULL
# sales$newPurchaseDate<-NULL
  1. Are there any potential issues with this data? If so, address them. If you find issues later in your analysis, feel free to address them then and there.
# 1. Date was read as factor , converted to date type above.

# 2. Duplicate Records 
# identified when aggregating/testing
# sales %>% filter(userid == 77910)

# Here remove duplicates
sales_deduped<-unique(sales)
# sales has 327749 rows
# sales_deduped 297937 rows

inspect_imb(sales_deduped)
# exploration
length(unique(sales$userid))
## [1] 297937
# plot(sales_deduped$price) # to see the spread

min(sales_deduped$purchasedate) # 2019-01-01
## [1] "2019-01-01"
max(sales_deduped$purchasedate) # 2019-03-01
## [1] "2019-03-01"

I do not see any further peculiarity in the data like any outlier or incorrect dates or -ve price etc.

Data Exploration

Answer as many of the following questions as you can. You may use whatever methods you like, including graphical, though you will be asked to make specific graphs in the next section.

  1. Where is the bulk of the revenue coming from?
# Total by Source
salesBySource <- sales_deduped %>% 
                      group_by(source) %>% 
                      summarise(revenue = sum(price),
                                cnt = n(),
                                unique_cnt = n_distinct(userid))

More than half the revenue and no. of transactions are driven by Apple/iPhone users.

Additionally, some more interesting insights come out of the following analysis:

# Total by Plan
salesByPlan <- sales_deduped %>% 
                      group_by(plan) %>% 
                      summarise(revenue = sum(price),
                                cnt = n(),
                                unique_cnt = n_distinct(userid))

library(DataExplorer)
summary(sales_deduped)
##      userid        purchasedate            price               plan       
##  Min.   :     1   Min.   :2019-01-01   Min.   :29.99   base      : 46623  
##  1st Qu.: 74995   1st Qu.:2019-01-10   1st Qu.:29.99   on_sale   :194921  
##  Median :149996   Median :2019-01-23   Median :29.99   with_trial: 56393  
##  Mean   :149992   Mean   :2019-01-25   Mean   :33.45                      
##  3rd Qu.:224993   3rd Qu.:2019-02-10   3rd Qu.:39.99                      
##  Max.   :300000   Max.   :2019-03-01   Max.   :39.99                      
##    interval              type           source      
##  Yearly:297937   Conversion:213531   Apple :147937  
##                  Renewal   : 84406   Google:100000  
##                                      Web   : 50000  
##                                                     
##                                                     
## 
table(sales_deduped$plan,sales_deduped$price)
##             
##               29.99  39.99
##   base            0  46623
##   on_sale    194912      9
##   with_trial      0  56393
table(sales_deduped$plan,sales_deduped$price, sales_deduped$source)
## , ,  = Apple
## 
##             
##              29.99 39.99
##   base           0 22690
##   on_sale    86585     0
##   with_trial     0 38662
## 
## , ,  = Google
## 
##             
##              29.99 39.99
##   base           0 15174
##   on_sale    67095     0
##   with_trial     0 17731
## 
## , ,  = Web
## 
##             
##              29.99 39.99
##   base           0  8759
##   on_sale    41232     9
##   with_trial     0     0

We can see here that across all the sources on sale plan is sold the most (>50%)

table(sales_deduped$plan,sales_deduped$type, sales_deduped$source)
## , ,  = Apple
## 
##             
##              Conversion Renewal
##   base            12244   10446
##   on_sale         62627   23958
##   with_trial      38635      27
## 
## , ,  = Google
## 
##             
##              Conversion Renewal
##   base             9926    5248
##   on_sale         52850   14245
##   with_trial      17731       0
## 
## , ,  = Web
## 
##             
##              Conversion Renewal
##   base             3660    5099
##   on_sale         15858   25383
##   with_trial          0       0

For both Apple & Google phones, there have been twice or more new purchases than renewals when on sale. On web, however more renewals have been seen during sale period than the new purchases.

table(sales_deduped$price,sales_deduped$type, sales_deduped$source)
## , ,  = Apple
## 
##        
##         Conversion Renewal
##   29.99      62627   23958
##   39.99      50879   10473
## 
## , ,  = Google
## 
##        
##         Conversion Renewal
##   29.99      52850   14245
##   39.99      27657    5248
## 
## , ,  = Web
## 
##        
##         Conversion Renewal
##   29.99      15849   25383
##   39.99       3669    5099

Satisfying the basics economics of demand and supply, there has been more conversions as well as renewals at lower price point (for all sources).

  1. Assuming 2019 first-time conversions were comparable in volume to 2018 conversions, what we can be said about renewal rates?
summary(sales_deduped$type)
## Conversion    Renewal 
##     213531      84406
renewalToConvRatio <- 84406/213531

Overall Renewals (for 2 months in 2019) are 40% of the conversions. However, to comment on the renewal rates basis that may need some more assumptions.

  1. Can you determine from this data whether the sale price is a good idea? If not, what additional information would you need?

Not quite clear on the ask of this question (e.g. good idea for what ? ) . Leaving it for now.

Visualization

Make as many of the following plots as you can. Feel free to reuse your code from the previous section.

  1. Make a plot showing the total number of sales for each source over time. This should be a single plot.
library(ggplot2)

# Below graph again answers question 1 (Exploration section)
# ggplot(data=salesBySource, aes(x=source, y=cnt)) +
#   geom_bar(stat="identity")

# Rolling up data by purchase dates and source to plot over time by source
salesByDaySource <- sales_deduped %>% 
                      group_by(source,purchasedate) %>% 
                      summarise(revenue = sum(price),
                                cnt = n(),
                                unique_cnt = n_distinct(userid))

ggplot(data=salesByDaySource, aes(x= purchasedate, y=cnt, group = source)) +
  geom_line(aes(color = source)) 

A quick glance of this chart shows that sales (count of transactions) for each source is very similar following a peak after New Year, steadily peaking and falling in a similar fashion with Apple being the leader, followed by Google (or Android) users and then by the web buyers.

  1. Make a plot showing the breakdown of total revenue by plan, type, and source. This may be a single plot, or a unified collection of plots layed out in the same graphics window.
breakdown <- sales_deduped %>% 
                      group_by(plan,type,source) %>% 
                      summarise(revenue = sum(price),
                                cnt = n())
  
ggplot(breakdown, aes(x = plan, y = revenue)) + 
  geom_point() + 
  facet_grid(type ~ source)