library(magrittr)
library(tidyr)
library(dplyr)
library(lubridate)
library(Hmisc)
library(outliers)
library(forecast)
In order to give recommendations whether Rossmann should expand stores or not in next years, the first thing I need to do is searching data to understand what they have done during the period of its operation and its competitive level compared to its rivals from 2013 to 2015. I find out 2 datasets related to the problem statement from kaggle website and make the data ready for the further statistical analysis. More detail, after importing and merging 2 datasets to gain knowlege about this brand, I filter important variables and deeply observe to make my analysis easier. Then, I check data structure to understand its attributes in the combined dataset and understand the meaning of each value for the variables. Next, I check whether my data is messy or not; fortunately; my data already tidy up because each variable has its own column, each observation has its own row and each type of observational unit has its own table. Furthermore, to know the level of customer preference which supports for problem statement, I create a new variable, especifically average purchase per customer, to know the level of spending on Rossmann products of every customer in each area and each period of time. In addition, I clean data for obvious errors, identify and handle multivariate outliers by replacing them with 5th and 95th quantile, and deal with missing values by replacing them with “None” and “median value” depending on natures of variables. Lastly, my dataset has right-skewed distribution, so I transform it to change its distribution into normal distribution which will make easier in statistical analysis. To make my description about data preprocessing simplier, steps below will clearly demonstrate what I describe the process above.
The purpose of this report is to determine whether Rossmann should keep openning stores in next years. Therefore, deeply understanding its operation is the most important thing I need to focus. In order to have a better overview about Rossmann, the combined dataset below is merged by two relative detasets and divided into 18 variables representing historical sales for 1,115 Rossmann stores in last 3 years. During operation, some stores were temporarily closed for refurbishment, so I remove dates without Rossmann’s revenue for more accurate analysis.
# Set working directory
setwd("~/Desktop/Week3")
The working directory was changed to /Users/diepdo/Desktop/Week3 inside a notebook chunk. The working directory will be reset when the chunk is finished running. Use the knitr root.dir option in the setup chunk to change the working directory for notebook chunks.
# Import two datasets
train <- read.csv("train.csv", stringsAsFactors = FALSE)
head(train)
store <- read.csv("store.csv", stringsAsFactors = FALSE)
head(store)
# Combine two datasets
combination <- left_join(store,train,key="Store")
Joining, by = "Store"
# Filter data
combination <- combination %>% filter(Sales > 0)
head(combination)
# Summarise data
summary(combination)
Store StoreType Assortment CompetitionDistance CompetitionOpenSinceMonth
Min. : 1.0 Length:844338 Length:844338 Min. : 20 Min. : 1.00
1st Qu.: 280.0 Class :character Class :character 1st Qu.: 710 1st Qu.: 4.00
Median : 558.0 Mode :character Mode :character Median : 2320 Median : 8.00
Mean : 558.4 Mean : 5458 Mean : 7.22
3rd Qu.: 837.0 3rd Qu.: 6890 3rd Qu.:10.00
Max. :1115.0 Max. :75860 Max. :12.00
NA's :2186 NA's :268600
CompetitionOpenSinceYear Promo2 Promo2SinceWeek Promo2SinceYear PromoInterval
Min. :1900 Min. :0.0000 Min. : 1.0 Min. :2009 Length:844338
1st Qu.:2006 1st Qu.:0.0000 1st Qu.:13.0 1st Qu.:2011 Class :character
Median :2010 Median :0.0000 Median :22.0 Median :2012 Mode :character
Mean :2009 Mean :0.4987 Mean :23.3 Mean :2012
3rd Qu.:2013 3rd Qu.:1.0000 3rd Qu.:37.0 3rd Qu.:2013
Max. :2015 Max. :1.0000 Max. :50.0 Max. :2015
NA's :268600 NA's :423292 NA's :423292
DayOfWeek Date Sales Customers Open Promo
Min. :1.00 Length:844338 Min. : 46 Min. : 8.0 Min. :1 Min. :0.0000
1st Qu.:2.00 Class :character 1st Qu.: 4859 1st Qu.: 519.0 1st Qu.:1 1st Qu.:0.0000
Median :3.00 Mode :character Median : 6369 Median : 676.0 Median :1 Median :0.0000
Mean :3.52 Mean : 6956 Mean : 762.8 Mean :1 Mean :0.4464
3rd Qu.:5.00 3rd Qu.: 8360 3rd Qu.: 893.0 3rd Qu.:1 3rd Qu.:1.0000
Max. :7.00 Max. :41551 Max. :7388.0 Max. :1 Max. :1.0000
StateHoliday SchoolHoliday
Length:844338 Min. :0.0000
Class :character 1st Qu.:0.0000
Mode :character Median :0.0000
Mean :0.1936
3rd Qu.:0.0000
Max. :1.0000
# Check data structure
str(combination)
'data.frame': 844338 obs. of 18 variables:
$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ StoreType : chr "c" "c" "c" "c" ...
$ Assortment : chr "a" "a" "a" "a" ...
$ CompetitionDistance : int 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
$ CompetitionOpenSinceMonth: int 9 9 9 9 9 9 9 9 9 9 ...
$ CompetitionOpenSinceYear : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Promo2 : int 0 0 0 0 0 0 0 0 0 0 ...
$ Promo2SinceWeek : int NA NA NA NA NA NA NA NA NA NA ...
$ Promo2SinceYear : int NA NA NA NA NA NA NA NA NA NA ...
$ PromoInterval : chr "" "" "" "" ...
$ DayOfWeek : int 5 4 3 2 1 6 5 4 3 2 ...
$ Date : chr "2015-07-31" "2015-07-30" "2015-07-29" "2015-07-28" ...
$ Sales : int 5263 5020 4782 5011 6102 4364 3706 3769 3464 3558 ...
$ Customers : int 555 546 523 560 612 500 459 503 463 469 ...
$ Open : int 1 1 1 1 1 1 1 1 1 1 ...
$ Promo : int 1 1 1 1 1 0 0 0 0 0 ...
$ StateHoliday : chr "0" "0" "0" "0" ...
$ SchoolHoliday : int 1 1 1 1 1 0 0 0 0 0 ...
In the combined dataset, some variables are not presented as its natures (e.g: Assortment is not character variable)
# Adjust data structure
combination$Assortment <- factor(combination$Assortment,levels=c("a","b","c"),
labels = c("basic","extra","extended"))
combination$Promo2 <- factor(combination$Promo2,levels=c(0,1),
labels = c("not continuing promotion","continuing promotion"))
combination$DayOfWeek <- factor(combination$DayOfWeek,levels=c(1,2,3,4,5,6,7),
labels = c("Sunday","Monday","Tuesday","Wednesday","Thursday","Friday","Saturday"))
combination$Open <- factor(combination$Open,levels=c(0,1),
labels = c("closed","open"))
combination$Promo <- factor(combination$Promo,levels=c(0,1),
labels=c("no promotion","promotion"))
combination$StateHoliday <- factor(combination$StateHoliday,levels=c("a","b","c",0),
labels=c("public-holiday","Easter-holiday","Christmas","None"))
combination$SchoolHoliday <- factor(combination$SchoolHoliday, levels=c(0,1),
labels=c("closed","open"))
combination$Date <- as.Date(combination$Date)
str(combination)
'data.frame': 844338 obs. of 18 variables:
$ Store : int 1 1 1 1 1 1 1 1 1 1 ...
$ StoreType : chr "c" "c" "c" "c" ...
$ Assortment : Factor w/ 3 levels "basic","extra",..: 1 1 1 1 1 1 1 1 1 1 ...
$ CompetitionDistance : int 1270 1270 1270 1270 1270 1270 1270 1270 1270 1270 ...
$ CompetitionOpenSinceMonth: int 9 9 9 9 9 9 9 9 9 9 ...
$ CompetitionOpenSinceYear : int 2008 2008 2008 2008 2008 2008 2008 2008 2008 2008 ...
$ Promo2 : Factor w/ 2 levels "not continuing promotion",..: 1 1 1 1 1 1 1 1 1 1 ...
$ Promo2SinceWeek : int NA NA NA NA NA NA NA NA NA NA ...
$ Promo2SinceYear : int NA NA NA NA NA NA NA NA NA NA ...
$ PromoInterval : chr "" "" "" "" ...
$ DayOfWeek : Factor w/ 7 levels "Sunday","Monday",..: 5 4 3 2 1 6 5 4 3 2 ...
$ Date : Date, format: "2015-07-31" "2015-07-30" "2015-07-29" ...
$ Sales : int 5263 5020 4782 5011 6102 4364 3706 3769 3464 3558 ...
$ Customers : int 555 546 523 560 612 500 459 503 463 469 ...
$ Open : Factor w/ 2 levels "closed","open": 2 2 2 2 2 2 2 2 2 2 ...
$ Promo : Factor w/ 2 levels "no promotion",..: 2 2 2 2 2 1 1 1 1 1 ...
$ StateHoliday : Factor w/ 4 levels "public-holiday",..: 4 4 4 4 4 4 4 4 4 4 ...
$ SchoolHoliday : Factor w/ 2 levels "closed","open": 2 2 2 2 2 1 1 1 1 1 ...
For example, with assortment variable, I convert from character into factor and label it as basic, extra and extended stores respectively in relation to its level a,b,c.
This combined dataset is long format as it is arranged in such a way that a single subject’s information is stored in multiple rows. Each variable has its own column. Each observation has its own row. Each value has its own cell. Therefore, this combined dataset represents as a tidy data.
In order to positively forecast Rossmann’s sales in next two years with the purpose of extending stores, I create new column named Average Purchase Per Customer from 2 existing variables (Sales and Customers) by mutate () function to know the average amount of money each customer tend to spend on Rossmann products in each day in each store.
# Create new variable
combination <- mutate(combination, AveragePurchasePerCustomer=Sales/Customers)
head(combination$AveragePurchasePerCustomer)
[1] 9.482883 9.194139 9.143403 8.948214 9.970588 8.728000
In the combined dataset, some variables, including competition distance, competition open since month, competition open since year, promo2 since week, promo2 since year, have some missing values.
# Check missing values
sapply(combination, function(x) sum(is.na(x)))
Store StoreType Assortment CompetitionDistance
0 0 0 2186
CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek
268600 268600 0 423292
Promo2SinceYear PromoInterval DayOfWeek Date
423292 0 0 0
Sales Customers Open Promo
0 0 0 0
StateHoliday SchoolHoliday AveragePurchasePerCustomer
0 0 0
Depending on natures of variables, I handle these missing values with different ways. Specifically,
Some stores will have no 2nd promotion, so two variables- promo2 since week and promo2 since year will have NAs. Then, I replace “NAs” with “None” to indicate that no 2nd promotion results in no promo2 since week and no promo2 since year.
With missing values of variables of competition distance, competition open since month and competition open since year, I replace “NAs”" into its “median value”" because median value will work well with its column; month and year could not be double variable (month and year could be double variable when using mean)
# Deal with missing values
combination$Promo2SinceWeek <- ifelse(is.na(combination$Promo2SinceWeek),
'None', combination$Promo2SinceWeek)
combination$Promo2SinceYear <- ifelse(is.na(combination$Promo2SinceYear),
'None', combination$Promo2SinceYear)
combination$CompetitionDistance[is.na(combination$CompetitionDistance)] <- median(combination$CompetitionDistance, na.rm=TRUE)
combination$CompetitionOpenSinceMonth[is.na(combination$CompetitionOpenSinceMonth)] <- median(combination$CompetitionOpenSinceMonth, na.rm=TRUE)
combination$CompetitionOpenSinceYear[is.na(combination$CompetitionOpenSinceYear)] <- median(combination$CompetitionOpenSinceYear, na.rm=TRUE)
sapply(combination, function(x) sum(is.na(x)))
Store StoreType Assortment CompetitionDistance
0 0 0 0
CompetitionOpenSinceMonth CompetitionOpenSinceYear Promo2 Promo2SinceWeek
0 0 0 0
Promo2SinceYear PromoInterval DayOfWeek Date
0 0 0 0
Sales Customers Open Promo
0 0 0 0
StateHoliday SchoolHoliday AveragePurchasePerCustomer
0 0 0
# Detect outliers Sales & AveragePurchasePerCustomer
par(mfrow=c(2,1))
boxplot(combination$Sales, main="The number of Sales between stores")
boxplot(combination$AveragePurchasePerCustomer, main="The Average Spending of Each Customer between Stores")
# See descriptive statistics of Sales
summary(combination$Sales)
Min. 1st Qu. Median Mean 3rd Qu. Max.
46 4859 6369 6956 8360 41551
IQR(combination$Sales, na.rm=TRUE)
[1] 3501
thelowerfence_Sales <- quantile(combination$Sales,0.25, na.rm=TRUE)-1.5*IQR(combination$Sales, na.rm=TRUE)
head(thelowerfence_Sales)
25%
-392.5
thelowerfence_Sales <- 0
head(thelowerfence_Sales)
[1] 0
theupperfence_Sales <- quantile(combination$Sales,0.75, na.rm=TRUE)+1.5*IQR(combination$Sales, na.rm=TRUE)
head(theupperfence_Sales)
75%
13611.5
sd(combination$Sales, na.rm=TRUE)
[1] 3103.816
# See descriptive statistics of AveragePurchasePerCustomer
summary(combination$AveragePurchasePerCustomer)
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.749 7.896 9.250 9.494 10.900 64.958
IQR(combination$AveragePurchasePerCustomer,na.rm=TRUE)
[1] 3.004158
thelowerfence_AveragePurchasePerCustomer <- quantile(combination$AveragePurchasePerCustomer,na.rm=TRUE,0.25)-1.5*IQR(combination$AveragePurchasePerCustomer,na.rm=TRUE)
head(thelowerfence_AveragePurchasePerCustomer)
25%
3.389334
theupperfence_AveragePurchasePerCustomer <- quantile(combination$AveragePurchasePerCustomer,na.rm=TRUE,0.75)+1.5*IQR(combination$AveragePurchasePerCustomer,na.rm=TRUE)
head(theupperfence_AveragePurchasePerCustomer)
75%
15.40597
sd(combination$AveragePurchasePerCustomer, na.rm=TRUE)
[1] 2.197448
# Handle outliers
cap <- function(x){quantiles <- quantile( x, c(.05, 0.25, 0.75, .95),na.rm=TRUE)
x[ x < quantiles[2] - 1.5*IQR(x,na.rm=TRUE) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x,na.rm=TRUE) ] <- quantiles[4]
x
}
Sales_capped <- combination$Sales %>% cap()
combination_sub <- combination %>% dplyr::select(Sales, AveragePurchasePerCustomer)
# Check summary statistics
summary(combination_sub)
Sales AveragePurchasePerCustomer
Min. : 46 Min. : 2.749
1st Qu.: 4859 1st Qu.: 7.896
Median : 6369 Median : 9.250
Mean : 6956 Mean : 9.494
3rd Qu.: 8360 3rd Qu.:10.900
Max. :41551 Max. :64.958
# Check summary statistics again
combination_capped <- sapply(combination_sub, FUN=cap)
summary(combination_capped)
Sales AveragePurchasePerCustomer
Min. : 46 Min. : 3.389
1st Qu.: 4859 1st Qu.: 7.896
Median : 6369 Median : 9.250
Mean : 6807 Mean : 9.472
3rd Qu.: 8360 3rd Qu.:10.900
Max. :13611 Max. :15.406
#Check outliers
boxplot(combination_capped)
The boxplot above shows that variables of Sales and Average Purchase Per Customer visualise multivariate outliers. In order to handle these outliers, I need to find the lower and upper fence and replace the outliers with neighbours inside this fence. In this situation, because the value of Sales cannot be negative, I change from -392.5 into 0 as the lower fence and choose values of 5th and 95th percentile as neighbours. If any value of variables is greater than the upper fence, I replace it with the value of 95th percentile. If any value of variables is less than the lower fence, I replace it with the value of 5th percentile. You can see that multivariate outliers will disappear in the second boxplot and the maximum for Sales and Average Purchase Per Customer are now 13611 and 15.406 instead of 41551 and 64.958.
# Check & Change data distribution Sales
par(mfrow=c(1,2))
hist(combination$Sales)
boxcox_x <- BoxCox(combination$Sales, lambda = "auto")
hist(boxcox_x)
# Check & Change data distribution AveragePurchasePerCustomer
par(mfrow=c(1,2))
hist(combination$AveragePurchasePerCustomer)
boxcox_y <- BoxCox(combination$AveragePurchasePerCustomer, lambda = "auto")
hist(boxcox_y)
As you can see, the value of Sales and Average Purchase Per Customer has right-skewed distribution. In order to make it easier in doing hypothesis testing as well as analysing, I transform data by using BoxCox() to improve assumptions of normality and homogeneity of variance of these variables.