There is 1 dataset(csv) with 3 years worth of customer orders. There are 4 columns in the csv dataset: index, CUSTOMER_EMAIL(unique identifier as hash), Net_Revenue, and Year.
For each year we need the following information:
Dataset:https://www.dropbox.com/sh/xhy2fzjdvg3ykhy/AADAVKH9tgD_dWh6TZtOd34ia?dl=0
We begin by loading the packages that will be required throughout the course of our analysis.
library(tidyr)
library(DT)
library(ggplot2)
library(dplyr)
library(tidyverse)
library(kableExtra)
library(lubridate)
library(readxl)
library(highcharter)
library(lubridate)
library(scales)
library(RColorBrewer)
library(wesanderson)
library(plotly)
library(shiny)
library(readxl)
library(readr)
library(scales)
library(stringr)
library(boot)
library(reshape2)
After loading the required packages, we move on to the data preparation step which would start by loading the data into our R-studio.
summary(data)
## V1 customer_email net_revenue year
## Min. : 0 Length:685927 Min. : 1.00 Min. :2015
## 1st Qu.:171482 Class :character 1st Qu.: 63.51 1st Qu.:2015
## Median :342963 Mode :character Median :125.73 Median :2016
## Mean :342963 Mean :125.65 Mean :2016
## 3rd Qu.:514445 3rd Qu.:187.82 3rd Qu.:2017
## Max. :685926 Max. :250.00 Max. :2017
glimpse(data)
## Rows: 685,927
## Columns: 4
## $ V1 <int> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
## $ customer_email <chr> "nhknapwsbx@gmail.com", "joiuzbvcpn@gmail.com", "ukk...
## $ net_revenue <dbl> 249.92, 87.61, 168.38, 62.40, 43.08, 39.44, 171.46, ...
## $ year <int> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015...
Next we will check for missing values and deal with them accordingly.
summary(is.na(data))
## V1 customer_email net_revenue year
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:685927 FALSE:685927 FALSE:685927 FALSE:685927
The final cleaned dataset can be found below in an interactive table.
datatable(head(data, 20), class = 'cell-border stripe')
data %>%
group_by(year) %>%
summarise(Revenue =sum(net_revenue))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 3 x 2
## year Revenue
## <int> <dbl>
## 1 2015 29036749.
## 2 2016 25730944.
## 3 2017 31417495.
The Current Revenue of 2017 is $31417495
cust_2017 <- data %>%
filter( year == 2017)
cust_2016 <- data %>%
filter( year == 2016)
cust_2015 <- data %>%
filter( year == 2015)
new_cust_2017 <- anti_join(cust_2017, cust_2016, by = c("customer_email" = "customer_email"))
new_cust_2016 <- anti_join(cust_2016, cust_2015, by = c("customer_email" = "customer_email"))
new_cust_2017 %>%
summarise(Revenue_new_cust_2017 = sum(net_revenue))
## Revenue_new_cust_2017
## 1 21769213
new_cust_2016 %>%
summarise(Revenue_new_cust_2016 = sum(net_revenue))
## Revenue_new_cust_2016
## 1 17206367
result = data %>%
group_by(customer_email)
head(result)
## # A tibble: 6 x 4
## # Groups: customer_email [6]
## V1 customer_email net_revenue year
## <int> <chr> <dbl> <int>
## 1 0 nhknapwsbx@gmail.com 250. 2015
## 2 1 joiuzbvcpn@gmail.com 87.6 2015
## 3 2 ukkjctepxt@gmail.com 168. 2015
## 4 3 gykatilzrt@gmail.com 62.4 2015
## 5 4 mmsgsrtxah@gmail.com 43.1 2015
## 6 5 mobvusnzfr@gmail.com 39.4 2015