In these exercises, you will be exploring some sample subscription sales data. There is one line per sale. Some sales are first-time purchases, called conversions, and others are renewals. Some sales are at the base price, some are on sale, and some are sold at the base price after a free one-week trial period.
# Feel free to load any packages you plan on using here
# You can always load more later as you go
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Here are the columns:
userid: unique userid purchasedate: date of purchase price: purchase price plan: conditions around plan off (ex, on sale) interval: the subscription interval (eg, monthly, yearly, etc) type: conversion (a first time purchase) or renewal source: platform for the purchase
# Reading the file
sales <- read.csv("/Users/pgagneja/Downloads/r_evaluation_data.csv")
# checking if data looks good
head(sales)
str(sales)
## 'data.frame': 327749 obs. of 7 variables:
## $ userid : int 77910 84255 219404 56393 261931 164774 175658 225163 131468 244299 ...
## $ purchasedate: Factor w/ 60 levels "01/01/2019","01/02/2019",..: 60 56 42 25 1 14 50 15 46 24 ...
## $ price : num 40 40 30 30 30 ...
## $ plan : Factor w/ 3 levels "base","on_sale",..: 1 1 2 2 2 2 3 3 2 1 ...
## $ interval : Factor w/ 1 level "Yearly": 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 2 levels "Conversion","Renewal": 1 1 1 1 1 1 1 1 1 1 ...
## $ source : Factor w/ 3 levels "Apple","Google",..: 1 1 3 1 1 2 1 1 1 2 ...
# Date column is read in as a Factor, changing it to date format
sales$purchasedate<-as.Date(sales$purchasedate,format="%m/%d/%Y")
# Verifying the data types again
str(sales)
## 'data.frame': 327749 obs. of 7 variables:
## $ userid : int 77910 84255 219404 56393 261931 164774 175658 225163 131468 244299 ...
## $ purchasedate: Date, format: "2019-03-01" "2019-02-25" ...
## $ price : num 40 40 30 30 30 ...
## $ plan : Factor w/ 3 levels "base","on_sale",..: 1 1 2 2 2 2 3 3 2 1 ...
## $ interval : Factor w/ 1 level "Yearly": 1 1 1 1 1 1 1 1 1 1 ...
## $ type : Factor w/ 2 levels "Conversion","Renewal": 1 1 1 1 1 1 1 1 1 1 ...
## $ source : Factor w/ 3 levels "Apple","Google",..: 1 1 3 1 1 2 1 1 1 2 ...
# Exploring the data fields a bit further
library(inspectdf)
inspect_types(sales)
show_plot(inspect_types(sales))
This output shows the spread among different R data types for each of the columns in the dataset i.e. the imported dataset has 4 columns with factor data type, 1 column as date and so on.
# I can also look at how much memory is being used by each column, that maybe useful information at the time of performance tuning
# show_plot(inspect_mem(sales))
# Find the count/% of missing values
inspect_na(sales)
Great there is really no missing data here!!
Now let me also find out how are the numeric variable(s) (well only price for now) is distributed across sources/plans.
# Not interested in looking at 1st column in this dataframe which represents userid
show_plot(inspect_num(sales[,-1]))
So basically we are only looking at 2 price points here 29.99 (about ~63% of values) and 39.99.
Let me dig in a lit bit more
inspect_imb(sales)
This table mainly highlights that
interval column only contains ‘Yearly’ subscriptions - indicating that all the transactions correspond to annual subscriptions ( which is better than monthly I assume).type has ~72% ‘conversion’ - meaning there are so many 1st time purchases made in this duration.plan has 65% ‘on_sale’ value - showing that most conversions happened when the plan was on sale.source has 50% ‘Apple’ - indicating how most of the records/transactions are coming through Apple.At this point, few questions can be conjured,
# check_missing()
# table(is.na(sales[,5:6]))
# sales$purchaseDate<-NULL
# sales$newPurchaseDate<-NULL
# 1. Date was read as factor , converted to date type above.
# 2. Duplicate Records
# identified when aggregating/testing
# sales %>% filter(userid == 77910)
# Here remove duplicates
sales_deduped<-unique(sales)
# sales has 327749 rows
# sales_deduped 297937 rows
inspect_imb(sales_deduped)
# exploration
length(unique(sales$userid))
## [1] 297937
# plot(sales_deduped$price) # to see the spread
min(sales_deduped$purchasedate) # 2019-01-01
## [1] "2019-01-01"
max(sales_deduped$purchasedate) # 2019-03-01
## [1] "2019-03-01"
I do not see any further peculiarity in the data like any outlier or incorrect dates or -ve price etc.
Answer as many of the following questions as you can. You may use whatever methods you like, including graphical, though you will be asked to make specific graphs in the next section.
# Total by Source
salesBySource <- sales_deduped %>%
group_by(source) %>%
summarise(revenue = sum(price),
cnt = n(),
unique_cnt = n_distinct(userid))
More than half the revenue and no. of transactions are driven by Apple/iPhone users.
Additionally, some more interesting insights come out of the following analysis:
# Total by Plan
salesByPlan <- sales_deduped %>%
group_by(plan) %>%
summarise(revenue = sum(price),
cnt = n(),
unique_cnt = n_distinct(userid))
library(DataExplorer)
summary(sales_deduped)
## userid purchasedate price plan
## Min. : 1 Min. :2019-01-01 Min. :29.99 base : 46623
## 1st Qu.: 74995 1st Qu.:2019-01-10 1st Qu.:29.99 on_sale :194921
## Median :149996 Median :2019-01-23 Median :29.99 with_trial: 56393
## Mean :149992 Mean :2019-01-25 Mean :33.45
## 3rd Qu.:224993 3rd Qu.:2019-02-10 3rd Qu.:39.99
## Max. :300000 Max. :2019-03-01 Max. :39.99
## interval type source
## Yearly:297937 Conversion:213531 Apple :147937
## Renewal : 84406 Google:100000
## Web : 50000
##
##
##
table(sales_deduped$plan,sales_deduped$price)
##
## 29.99 39.99
## base 0 46623
## on_sale 194912 9
## with_trial 0 56393
table(sales_deduped$plan,sales_deduped$price, sales_deduped$source)
## , , = Apple
##
##
## 29.99 39.99
## base 0 22690
## on_sale 86585 0
## with_trial 0 38662
##
## , , = Google
##
##
## 29.99 39.99
## base 0 15174
## on_sale 67095 0
## with_trial 0 17731
##
## , , = Web
##
##
## 29.99 39.99
## base 0 8759
## on_sale 41232 9
## with_trial 0 0
We can see here that across all the sources on sale plan is sold the most (>50%)
table(sales_deduped$plan,sales_deduped$type, sales_deduped$source)
## , , = Apple
##
##
## Conversion Renewal
## base 12244 10446
## on_sale 62627 23958
## with_trial 38635 27
##
## , , = Google
##
##
## Conversion Renewal
## base 9926 5248
## on_sale 52850 14245
## with_trial 17731 0
##
## , , = Web
##
##
## Conversion Renewal
## base 3660 5099
## on_sale 15858 25383
## with_trial 0 0
For both Apple & Google phones, there have been twice or more new purchases than renewals when on sale. On web, however more renewals have been seen during sale period than the new purchases.
table(sales_deduped$price,sales_deduped$type, sales_deduped$source)
## , , = Apple
##
##
## Conversion Renewal
## 29.99 62627 23958
## 39.99 50879 10473
##
## , , = Google
##
##
## Conversion Renewal
## 29.99 52850 14245
## 39.99 27657 5248
##
## , , = Web
##
##
## Conversion Renewal
## 29.99 15849 25383
## 39.99 3669 5099
Satisfying the basics economics of demand and supply, there has been more conversions as well as renewals at lower price point (for all sources).
summary(sales_deduped$type)
## Conversion Renewal
## 213531 84406
renewalToConvRatio <- 84406/213531
Overall Renewals (for 2 months in 2019) are 40% of the conversions. However, to comment on the renewal rates basis that may need some more assumptions.
Not quite clear on the ask of this question (e.g. good idea for what ? ) . Leaving it for now.
Make as many of the following plots as you can. Feel free to reuse your code from the previous section.
library(ggplot2)
# Below graph again answers question 1 (Exploration section)
# ggplot(data=salesBySource, aes(x=source, y=cnt)) +
# geom_bar(stat="identity")
# Rolling up data by purchase dates and source to plot over time by source
salesByDaySource <- sales_deduped %>%
group_by(source,purchasedate) %>%
summarise(revenue = sum(price),
cnt = n(),
unique_cnt = n_distinct(userid))
ggplot(data=salesByDaySource, aes(x= purchasedate, y=cnt, group = source)) +
geom_line(aes(color = source))
A quick glance of this chart shows that sales (count of transactions) for each source is very similar following a peak after New Year, steadily peaking and falling in a similar fashion with Apple being the leader, followed by Google (or Android) users and then by the web buyers.
breakdown <- sales_deduped %>%
group_by(plan,type,source) %>%
summarise(revenue = sum(price),
cnt = n())
ggplot(breakdown, aes(x = plan, y = revenue)) +
geom_point() +
facet_grid(type ~ source)