Stout Case Study #2

Importing the given csv file and exploring/studying it to gain insight about the data.

## New names:
## * `` -> ...1

## Rows: 685927 Columns: 4
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): customer_email
## dbl (3): ...1, net_revenue, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

A glimpse of the dataset

…1	customer_email	net_revenue	year
0	nhknapwsbx@gmail.com	249.92	2015
1	joiuzbvcpn@gmail.com	87.61	2015
2	ukkjctepxt@gmail.com	168.38	2015
3	gykatilzrt@gmail.com	62.40	2015
4	mmsgsrtxah@gmail.com	43.08	2015
5	mobvusnzfr@gmail.com	39.44	2015
6	kxwkqvacsv@gmail.com	171.46	2015
7	iqmnugdcjs@gmail.com	111.99	2015
8	baiikostmd@gmail.com	170.70	2015
9	lfeafnigbu@gmail.com	54.99	2015

Next, we check if the dataset contains any missing values. This dataset does not contain any missing values.

##           ...1 customer_email    net_revenue           year 
##              0              0              0              0

Now we change the name of the first column from ‘..1’ to ‘index’ to enhance readability and understanding as well.

The first five rows of the dataset look likke this.

index	customer_email	net_revenue	year
0	nhknapwsbx@gmail.com	249.92	2015
1	joiuzbvcpn@gmail.com	87.61	2015
2	ukkjctepxt@gmail.com	168.38	2015
3	gykatilzrt@gmail.com	62.40	2015
4	mmsgsrtxah@gmail.com	43.08	2015

We now find all the unique values in the column ‘year’

## [1] 2015 2016 2017

We now count the number of values corresponding to each year, in the order 2017, 2016 and 2015.

## [1] 231294

## [1] 204646

## [1] 249987

Total Revenue for Current Year

## # A tibble: 3 × 2
##    year   Revenue
##   <dbl>     <dbl>
## 1  2015 29036749.
## 2  2016 25730944.
## 3  2017 31417495.

New Customer Revenue for 2017

## # A tibble: 1 × 1
##     Revenue
##       <dbl>
## 1 21769213.

New Customer Revenue for 2016

## # A tibble: 1 × 1
##     Revenue
##       <dbl>
## 1 17206367.

Existing customer growth for 2017

## Joining, by = "customer_email"
## Joining, by = "customer_email"

## [1] "Existing customer growth for 2017 = 63857.0600000005"

Existing customer growth for 2016

## Joining, by = "customer_email"
## Joining, by = "customer_email"

## [1] "Existing customer growth for 2016 = 39043.6499999985"

Revenue lost due to attrition for 2017

## [1] "Revenue lost due to attrition for 2017 = 5686551.44"

Revenue lost due to attrition for 2016

## [1] "Revenue lost due to attrition for 2016 = -3305805.6"

A negative value indicates that there was rather an increase in the total revenue from the year 2015 to 2016

For existing customers in 2017~

Existing customer revenue current year (2017)

## Joining, by = "customer_email"

## # A tibble: 1 × 1
##   Revenue_2017
##          <dbl>
## 1     9648282.

Existing customer revenue prior year (2016)

## Joining, by = "customer_email"

## # A tibble: 1 × 1
##   Revenue_2016
##          <dbl>
## 1     9584425.

For existing customers in 2016~

Existing customer revenue current year (2016)

## Joining, by = "customer_email"

## # A tibble: 1 × 1
##   Revenue_2016
##          <dbl>
## 1     8524577.

Existing customer revenue prior year (2015)

## Joining, by = "customer_email"

## # A tibble: 1 × 1
##   Revenue_2015
##          <dbl>
## 1     8485533.

Total customers current year and Previous Years

## [1] "Total customers current year (2017) = 249987"

## [1] "Total customers previous year (2016) = 204646"

## [1] "Total customers previous year (2015) = 231294"

New Customers

## [1] "Total number of new customers in 2017 from 2016= 45341"

## [1] "Total number of new customers in 2016 from 2015= -26648"

A negative value indicates a loss of customers in 2016 from 2015

Customer emails of new customers gained in 2017

## Joining, by = "customer_email"

## # A tibble: 173,449 × 1
##    customer_email      
##    <chr>               
##  1 qvjfterwnk@gmail.com
##  2 fdkiioqtli@gmail.com
##  3 vpsfdhgrfo@gmail.com
##  4 pnnenkxlpj@gmail.com
##  5 qkjragcrsd@gmail.com
##  6 sqashwoqai@gmail.com
##  7 thcyofilac@gmail.com
##  8 ylqsflvinx@gmail.com
##  9 actvovcfwt@gmail.com
## 10 wacguvfcpu@gmail.com
## # … with 173,439 more rows

Lost Customers

## [1] "Total number of customers lost in 2016 from 2015 = 26648"

## [1] "Emails of customers lost in 2016"

## Joining, by = "customer_email"

## # A tibble: 163,539 × 1
##    customer_email      
##    <chr>               
##  1 nhknapwsbx@gmail.com
##  2 joiuzbvcpn@gmail.com
##  3 ukkjctepxt@gmail.com
##  4 gykatilzrt@gmail.com
##  5 mmsgsrtxah@gmail.com
##  6 mobvusnzfr@gmail.com
##  7 kxwkqvacsv@gmail.com
##  8 iqmnugdcjs@gmail.com
##  9 utycjynusf@gmail.com
## 10 dptyjlaniw@gmail.com
## # … with 163,529 more rows

Visualizations

Here are two plots:

which highlight the total number of customers for the years

which highlight the total revenue for the years

If given more time, I’d work more on the visualizations and explore deeper relationships between the existing customers and new customers wrt years and calculate yearly profit, loss and would gain deeper insight about the data.

PS: I’m extremely thankful to Stout for giving me this opportunity and I have thoroughly enjoyed working on this. I could not work more on this assessment because I have 2 midterms this week.

Stout CaseStudy2

Ayushe Gangal

4/9/2022