Required packages

library(dplyr)
library(tidyr)

Executive Summary

The data pre-processing is carried out on datasets from the retail domain. There are three datasets which are joined using left join. The final joined data is composed of variables with numeric, character and logical data types.

Conversion of data type for a few variables had to be carried out. Four variables were converted to factory type and one variable which represented date, was in character data type and was converted into Date format. A numeric variable is also categorized into a new variable with three levels which is an ordered factor. The data was checked for tidiness. There were five variables which were present as column names which were worthy of being combined into a single variable using gather function. A new binary variable was created as an indicator of missing values.

The variables were checked for missing values and outlier, where boxplots were used for outlier checking. Only two variables had missing values, one which was of type factor and the other numeric. The missing values for the numeric variable were replaced by the median value of the variable. The numeric variables which had outliers were modified by capping the outlier values to the upper or lower value from the boxplot which is not an outlier itself.

Finally, a variable which had left skew was transformed using square transformation in order to make it approximately normal.

Data

The data used for this assignment is a Retail Analytics Dataset that is sourced form Kaggle.

The datasets are available at the following link: https://www.kaggle.com/manjeetsingh/retaildataset/data

The data description is available at the following link: https://www.kaggle.com/manjeetsingh/retaildataset

The datasets are released under CC0: Public Domain.

There are three datasets Stores, Features and Sales. All three datasets are available in .csv format. The description of the data are as follows:

Stores : This dataset contains 45 stores of a company, with the type and size of the store. The stores data is anonymized. There are 45 observations and the variables of this data set are as follows (as present in original form):

Variable Type Levels Description
Store Numeric - Store Number
Type Character A,B,C Type of Store
Size Numeric - Size of the Store

Sales: This is the Sales data for each store by department from 5/2/2010 to 1/11/2012. There are 421570 observations and the variables in this data set are:

Variable Type Levels Description
Store Numeric - Store Number
Dept Numeric - Department Number
Date Character - Representing Week
Weekly_Sales Numeric - Weekly Sales for given store and dept
IsHoliday Logical - If the week has special Holiday

Features: Data related to the store and other activity for the dates. There are 8190 rows and the variables are as follows:

Variable Type Levels Description
Store Numeric - Store Number
Date Character - Representing Week
Temperature Numeric - Avg Temperature in Region of Store
Fuel Price Numeric - Fuel price in Region of Store
Markdown1-5 Numeric - Promotional Markdowns Type and Amnts
CPI Numeric - Consumer Price Index
Unemployment Numeric - Unemployment Rate
IsHoliday Logical - If the week has special Holiday

We first read in the individual datasets.

# Step 1: Reading in Store
stores = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/stores data-set.csv",stringsAsFactors = FALSE)
# Step 2: Check if Store data set has been read in properly
head(stores)
dim(stores)
[1] 45  3
# Step 3: Reading in Sales
sales = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/sales data-set.csv",stringsAsFactors = FALSE)
# Step 4: Check head and dim of Sales Data
head(sales)
dim(sales)
[1] 421570      5
# Step 5: Reading in Feature data set
features = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/Features data set.csv",stringsAsFactors = FALSE)
# Step 6: Checking head and dim of Features Data 
head(features)
dim(features)
[1] 8190   12
# Step 7: Left Join Stores and Sales Dataset by Store
storeData = stores %>% left_join(y = sales,by = "Store")
# Step 8:  Checking Head and Dim of storeData
head(storeData)
dim(storeData)
[1] 421570      7
# Step 9: Left Join storeData with features by Store and Date 
storeData = storeData %>% left_join(y = features,by = c("Store","Date"))
# Step 10: Checking Head and Dim of storeData
head(storeData)
dim(storeData)
[1] 421570     17

Step 1: Read in Stores Data : In this step we first read in the store data using read.csv function and save as stores.
Step 2: Check Head and Dimension of Store Data : Check the head of the imported stores data to see if the import is proper and check the dimension of the read in data.The stores data has been read in properly and has 45 observations and 3 variables as expected.
Step 3: Read in Sales Data : The Sales data set is read into R using the read.csv function and saved as sales.
Step 4: Check Head and Dimension of Sales Data : The sales data is read into R properly and has 421570 observations and 5 variables as expected.
Step 5: Read in Features Data : The Features data set is read into R using the read.csv function and saved as features.
Step 6: Check Head and Dimension of Features Data : Check the head and dimensions of the Features data set. The features data set is read into R and it has 8190 observations and 12 variables.
Step 7: Left Join Stores and Sales Dataset : We left join the Sales dataset with the Stores dataset on the variable Store and save the dataset as storeData.
Step 8: Check Head and Dimension of storeData Dataset: The head function and dim function are used to check the head and dimensions of the joined dataset. It has 421570 observations and 7 variables as required.
Step 9: Left Join StoreData and Features Dataset : The features dataset is left joined with storeData created above on the variables Store and Date and stored again as storeData.
Step 10: Check Head and Dimension of storeData Dataset : The head and dimension of the storeData after joining the three tables is checked. There are 421570 observations and 17 rows as expected.

Understand

# Step 1: Check class of storeData
class(storeData)
[1] "data.frame"
# Step 2: Check Data Types of storeData
str(storeData)
'data.frame':   421570 obs. of  17 variables:
 $ Store       : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Type        : chr  "A" "A" "A" "A" ...
 $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
 $ Dept        : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Date        : chr  "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
 $ Weekly_Sales: num  24925 46039 41596 19404 21828 ...
 $ IsHoliday.x : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
 $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
 $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
 $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ CPI         : num  211 211 211 211 211 ...
 $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...
 $ IsHoliday.y : logi  FALSE TRUE FALSE FALSE FALSE FALSE ...
# Step 3: Check Both IsHoliday Variable for Similarity
summary(storeData$IsHoliday.x == storeData$IsHoliday.y)
   Mode    TRUE 
logical  421570 
# Step 4: Dropping IsHoliday.y Variable 
storeData = storeData %>% select(... = -IsHoliday.y )
# Step 5: Renaming IsHoliday.x to IsHoliday
storeData = storeData %>% rename(IsHoliday = IsHoliday.x)
head(storeData)
# Step 6: Changing Data Type of Store, Type, Dept & IsHoliday Variable to Factor
storeData[,c(1,2,4,7)] = lapply(storeData[,c(1,2,4,7)],as.factor)
# Step 7: Check Change of Data Type 
str(storeData)
'data.frame':   421570 obs. of  16 variables:
 $ Store       : Factor w/ 45 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Type        : Factor w/ 3 levels "A","B","C": 1 1 1 1 1 1 1 1 1 1 ...
 $ Size        : int  151315 151315 151315 151315 151315 151315 151315 151315 151315 151315 ...
 $ Dept        : Factor w/ 81 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Date        : chr  "05/02/2010" "12/02/2010" "19/02/2010" "26/02/2010" ...
 $ Weekly_Sales: num  24925 46039 41596 19404 21828 ...
 $ IsHoliday   : Factor w/ 2 levels "FALSE","TRUE": 1 2 1 1 1 1 1 1 1 1 ...
 $ Temperature : num  42.3 38.5 39.9 46.6 46.5 ...
 $ Fuel_Price  : num  2.57 2.55 2.51 2.56 2.62 ...
 $ MarkDown1   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown2   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown3   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown4   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ MarkDown5   : num  NA NA NA NA NA NA NA NA NA NA ...
 $ CPI         : num  211 211 211 211 211 ...
 $ Unemployment: num  8.11 8.11 8.11 8.11 8.11 ...
# Step 8: Converting Date from Character to Date Format
storeData$Date = as.Date(storeData$Date,format = "%d/%m/%Y")
# Step 9: Check Conversion of All Variables 
head(storeData)
# Step 10: Create New Variable for Categorizing Size
sizeDisc = vector(mode = "character",length = dim(storeData)[1])
storeData = cbind(storeData,sizeDisc)
# Step 11: Find Q1 and Q3 to discretize Size Variable
q1 = quantile(x = storeData$Size,probs = c(0.25))
q3 = quantile(x = storeData$Size,probs = c(0.75))
# Step 12: Discretize Size Variable into A New Variable
storeData$sizeDisc =  ifelse(test = storeData$Size < q1,yes = "Small",no = storeData$sizeDisc)
storeData$sizeDisc =  ifelse(test = storeData$Size > q3,yes = "Big",no = storeData$sizeDisc)
storeData$sizeDisc =  ifelse(test = storeData$Size > q1 & storeData$Size < q3 ,yes = "Medium",no = storeData$sizeDisc)
# Step 13: Creating SizeDisc as Ordered Factor
storeData$sizeDisc = factor(storeData$sizeDisc,levels = c("Small","Medium","Big"),ordered = TRUE)
# Step 14: Checking Type of SizeDisc
str(storeData$sizeDisc)
 Ord.factor w/ 3 levels "Small"<"Medium"<..: 2 2 2 2 2 2 2 2 2 2 ...
# Step 15: Dropping Variable Size
storeData = storeData %>% select(-Size)

Step 1: Check Class of storeData : The first step is to check the data structure of the dataset storeData. The dataset has a data structure Data Frame.

Step 2: Data Types of storeData: We check the different data types of variables using the str function. This is a convenient method to check the data types of all variables together. There variables are of mix type with 2 variables of character type, two of logical type (which could be same variable) and the rest are of numeric data type. We can also immediately see some issues. For e.g. the variable Date represents Date’s but its type is character.

Step 3: Check Both IsHoliday Variable For Similarity: We compare the IsHoliday.x and IsHoliday.y variables to check if they are similar and then use summary function to get the result. We get all 421570 as TRUE which inidicates that they are same variable.

Step 4: Dropping IsHoliday.y Variable : The IsHoliday.y variable is dropped because it is similar to the IsHoliday.x Variable.

Step 5: Renaming IsHoliday.x to IsHoliday : The IsHoliday.x variable is renamed as IsHoliday and stored in storeData and we print the head of the dataset to confirm.

Step 6: Changing Data Type of Store, Type & Dept Variable to Factor : The variables Store and Dept are just identifier variables but were originally in numeric type. Thus, we convert it from numeric data type to factor. The variable Type is of data type character & IsHoliday is logical, hence we change these also to factor type.

Step 7: Check Change of Data Type : We check if the data types have been changed to factor type by using the str function. We see for these variables the data types have been changed to factor.

Step 8: Converting Date from Character to Date Format: We convert the Date variable which is in Character type to Date Format with the as.date function using the format “%d/%m/%Y” .

Step 9: Check Conversion of All Variables : Print the head of the dataset to check conversion of data types of all variables. We see that all the variables are converted to required type.

Step 10: Create New Variable for Categorizing Size : We would like to categorize the variable size into Small, Medium and Big. To do this we create a new variable called sizeDisc. We bind this new variable with the storeData dataset.

Step 11: Find Q1 and Q3 to discretize Size Variable : To categorize the size variable we will use the 1st quartile and 3rd quartile. Any values below 1st quartile are labelled Small, values of Size between 1st and 3rd quartile are labelled Medium and values above 3rd quartile are labelled Big. Here we calculate the 1st and 3rd quartiles.

Step 12: Discretize Size Variable into A New Variable : Based on the 1st and 3rd quartiles we add categorized values into the new variable sizeDisc from size.

Step 13: Creating SizeDisc as Ordered Factor : The new created variable sizeDisc is based on the values of Size which is numeric. Since the values Small, Medium and Big are ordinal we change the type of the sizeDisc variable to ordered factor.

Step 14: Checking Type of SizeDisc : We check the structure of the variable sizeDisc to check if the conversion to ordered factor has taken place. It has been converted to ordered factor with three levels.

Step 15: Dropping Variable Size : We have categorized the variable Size into a new variable SizeDisc. Hence, we drop the variable size from the dataset.

In Summary in this section we checked the data types of the different variables which were a mix of logical, numeric and character. We then converted Store, Dept, Type and IsHoliday variables to data type factor and converted Date to date format. Finally, we categorized Size variable into SizeDisc variable and converted the SizeDisc variable into an ordered factor.

Tidy & Manipulate Data I

For data to be tidy we need the following rules to be adhered:

The tidy rule which is a problem for this dataset is that there are 5 Markdown prices as variables named MarkDown1 to Markdown5. These variables are just types of Markdown and should be combined into one variable which indicates the types of Markdown. Another variable will indicate the value which these Markdowns take for each observation. We can use the gather function to do this.

# Step 1: Combine Markdown into One Variable Using Gather
storeData = storeData %>% gather(key = "MarkdownType",value = "MarkdownAmnt" ,... = c(MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5))
# Step 2: Check Head of Gathered Data
head(storeData)
# Step 3: Convert MarkDownType to Factor Type 
storeData$MarkdownType = as.factor(storeData$MarkdownType)

Step 1: Combine Markdown into One Variable Using Gather: We combine the different Markdown variables into one variable using gather function. The new variable that holds the type of markdown is MarkdownType and MarkdownAmnt holds the values for each of these Markdowns for each observations.

Step 2: Check Head of Gathered Data: We check the head of the gathered data to see if it has worked. The five variables have been gathered into the two variables as required.

Step 3: Convert MarkDownType to Factor Type: The variable MarkDownType which is created is of character type. So we change it to data type of factor.

Tidy & Manipulate Data II

The data has Weekly Sales for each Store and Dept by Week. Also it shows the different types of Markdown Types for each week and its value. From Step 1 below we see that the MarkdownAmnt variable we created has a lot of missing values. One way of dealing with this is to use imputation. For predictive analytics if a variable has a large number of missing values one way of dealing with this is to create a binary variable indicating if each observation is missing or not for that variable. In some cases whether the original variable is missing or not may be correlated with the target variable (Kelleher, et al., 2015)1. When such a variable is created usually the original variable is dropped. Here we create the new variable but do not drop the old variable for the moment since the data is only being pre processed. We create a new binary variable markdownMissing to indicate if the MarkdownAmnt is missing or not.

1: Kelleher, J. D., Namee, B. M. & D’Arcy, A., 2015. Data Exploration. In: Fundamentals Of Machine Learning For Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. London: MIT Press, p. 121.

# Step 1: Find Percent of Missing Values in MarkdownAmnt
storeData %>% filter(is.na(MarkdownAmnt)) %>% summarise(Percent = (n()/dim(storeData)[1])*100) 
# Step 2: Create Variable markdownMissing using Mutate
storeData = storeData %>% mutate(markdownMissing = if_else(condition = is.na(MarkdownAmnt),true = 1,false = 0))
# Step 3: Check Creation of markdownMissing
str(storeData$markdownMissing)
 num [1:2107850] 1 1 1 1 1 1 1 1 1 1 ...
# Step 4: Convert markDownMissing to Factor
storeData$markdownMissing = factor(storeData$markdownMissing)
# Step 5: Check Structure 
str(storeData$markdownMissing)
 Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

Step 2: Create Variable markdownMissing using Mutate: Using mutate we create the new variable markdownMissing which takes on 1 if MarkdownAmnt is missing and 0 if MarkdownAmnt is not missing.

Step 3: Check Creation of markdownMissing : We check if the markdownMissing variable has been created by checking the structure of the data.

Step 4: Convert markDownMissing to Factor: The markdownMissing can take on only two values 0 and 1. Hence, it is converted to a factor data type.

Step 5: Check Structure: We check if the conversion of markdownMissing to data type factor is successful. It is a factor with two levels 0 and 1.

Scan I

# Step 1: Checking for Missing Values in All Variables
colSums(is.na(storeData))
          Store            Type            Dept            Date 
              0               0               0               0 
   Weekly_Sales       IsHoliday     Temperature      Fuel_Price 
              0               0               0               0 
            CPI    Unemployment        sizeDisc    MarkdownType 
              0               0           98330               0 
   MarkdownAmnt markdownMissing 
        1422431               0 
# Step 2: Finding Median of MarkdownAmnt 
markmedian = median(storeData$MarkdownAmnt,na.rm = TRUE)
# Step 3: Replacing Missing Values with Median
storeData$MarkdownAmnt =  replace_na(data = storeData$MarkdownAmnt,replace = markmedian)
# Step 4: Check Replacement
sum(is.na(storeData$MarkdownAmnt))
[1] 0
# Step 5: Subset Numeric Variables
numeric = select_if(storeData,is.numeric)
# Step 6: Summary of Numeric Variables
summary(numeric)
  Weekly_Sales     Temperature       Fuel_Price         CPI       
 Min.   : -4989   Min.   : -2.06   Min.   :2.472   Min.   :126.1  
 1st Qu.:  2080   1st Qu.: 46.68   1st Qu.:2.933   1st Qu.:132.0  
 Median :  7612   Median : 62.09   Median :3.452   Median :182.3  
 Mean   : 15981   Mean   : 60.09   Mean   :3.361   Mean   :171.2  
 3rd Qu.: 20206   3rd Qu.: 74.28   3rd Qu.:3.738   3rd Qu.:212.4  
 Max.   :693099   Max.   :100.14   Max.   :4.468   Max.   :227.2  
  Unemployment     MarkdownAmnt     
 Min.   : 3.879   Min.   :  -265.8  
 1st Qu.: 6.891   1st Qu.:  1562.9  
 Median : 7.866   Median :  1562.9  
 Mean   : 7.960   Mean   :  2391.5  
 3rd Qu.: 8.572   3rd Qu.:  1562.9  
 Max.   :14.313   Max.   :141630.6  
# Step 7: Replacing Inconsistent Value in Weekly_Sales
storeData %>%filter(Weekly_Sales < 0) %>% count()
storeData$Weekly_Sales = ifelse(test = storeData$Weekly_Sales < 0,yes = NA,no = storeData$Weekly_Sales)
# Step 8: Replacing Inconsistent Value in MarkdownAmnt
storeData %>%filter(MarkdownAmnt < 0) %>% count()
storeData$MarkdownAmnt = ifelse(test = storeData$MarkdownAmnt < 0,yes = NA,no = storeData$MarkdownAmnt)

Step 1: Checking for Missing Values in All Variables: We use colSums and is.na to check for missing values in the variables. There are only two variables with missing values, the sizeDisc variable we created from size which is a factor variable, and MarkdownAmnt which is numeric. For sizeDisc we leave the unknown values as NA’s.

Step 2: Finding Median of MarkdownAmnt: Do deal replace missing values in MarkdownAmnt we find the median so that we can replace missing values with imputed median.

Step 3: Replacing Missing Values with Median: We replace missing values in MarkdownAmnt with the median values. This is just done for demonstration but for real analysis this would not be done because the variable has too many missing values and replacing with mean or median changes the distribution.

Step 4: Check Replacement: We check replacement of the missing values in the variable markdownAmnt

Step 5 & 6: Subset & Summary of Numeric Variables: We subset only the numeric variables and check their summary to find if any variables have inconsistent values. Both weekly Sales and markdownAmnt has negative values which we replace.

Step 7 & 8: Replacing Inconsistent Value in Weekly_Sales & markDownAmnt : The inconsistent values of less than 0 have to be replaced in both the variables. We see that Weekly Sales has 6425 values below 0 and markDownAmnt has 1568 such values. We could replace with mean or median, but due to high no of such values we just replace it with NA.

Scan II

# Step 1: Finding Variables with Outliers
bx1 = boxplot(storeData$Weekly_Sales,main = "Boxplot of Weekly Sales")

bx2 = boxplot(storeData$Temperature,main = "Boxplot of Temperature")

bx3 = boxplot(storeData$Fuel_Price,storeData$Unemployment,main = "Boxplot of Fuel Price & Unemployment")

bx4 = boxplot(storeData$CPI,main = "Boxplot of CPI")

# Step 2: Capping Outlier in Weekly Sales
storeData$Weekly_Sales = ifelse(test = storeData$Weekly_Sales > bx1$stats[5],yes = bx1$stats[5],no = storeData$Weekly_Sales)
# Step 3: Capping Outlier in Temperature
storeData$Temperature = ifelse(test = storeData$Temperature < bx2$stats[1],yes = bx2$stats[1],no = storeData$Temperature)
#Step 4: Capping Outlier in Unemployment
# For lower outliers
storeData$Unemployment = ifelse(test = storeData$Unemployment < bx3$stats[1,2],yes = bx3$stats[1,2],no = storeData$Unemployment)
# For higher outliers
storeData$Unemployment = ifelse(test = storeData$Unemployment > bx3$stats[5,2],yes = bx3$stats[5,2],no = storeData$Unemployment)
# Step 5: Boxplot to Check Capping 
boxplot(storeData$Weekly_Sales,main = "Boxplot of Weekly Sales after Capping")

boxplot(storeData$Temperature,storeData$Unemployment,main = "Boxplot of Temperature and Unemployment after Capping")


Step 1: Finding Variables with Outliers: We plot the boxplot of numeric variables to see if there are any outliers. Only Weekly_Sales, Temperature and Unemployment have outliers. CPI and Fuel price variable do not have outliers.
Weekly Sales has outliers on the higher side, temperature on the lower side and Unemployment both on the higher side and lower side. We find the minimum and maximum value from the boxplot which are not outliers for the variables Weekly Sales , Temperature and Unemployment. These represent the whiskers at the top and bottom of the boxplot. We find this in stats of the boxplot object. We use these min and max values for capping outliers.

Step 2: Capping Outlier in Weekly Sales : We cap the outliers of the variable Weekly Sales to the max value from boxplot which is not an outlier.

Step 3: Capping Outlier in Temperature : The temperature variable has outlier on the lower side. We cap these outliers to the min value from boxplot which is not an outlier.

Step 4: Capping Outlier in Unemployment: Unemployment has outliers both at the higher and lower end. We cap higher outliers to the max value from boxplot which is not an outlier and similarly lower outliers are capped to the lower value from boxplot which is not an outlier.

Step 5: Boxplot to Check Capping: We plot the boxplot of the three variables again to check capping. There are no outliers present.

Transform

# Step 1: Checking Distribution of Variable Unemployment
hist(storeData$Unemployment,main = "Histogram of Unemployment",xlab = "Unemployment")

# Step 2: Creating Sqr & Cube Data
unemloy.sqr = storeData$Unemployment^2
unemploy.cube = storeData$Unemployment^3
# Step 3: Plotting histogram of Transformed Data
hist(unemloy.sqr,main = "Histogram of Squared Unemployment",xlab = "Unemployment")

hist(unemploy.cube,main = "Histogram of Unemployment Raised to 3",xlab = "Unemployment")

# Step 4: Transforming the Unemployment Variable in Dataset
storeData = storeData %>% mutate(unemploymentTrans = Unemployment^2)

Step 1: Checking Distribution of Variable Unemployment: The variable Unemployment needs to be transformed to reduce skewness and make it approximately normal. In this step we check the distribution of the variable. We see that the variable is slightly left skewed.

Step 2: Creating Sqr & Cube Data: From the previous step we know there is slightly left skew. For left skewed data, power of 2,3 etc work well for reducing skewness. Here we create the square and cube of data.

Step 3: Plotting histogram of Transformed Data: We plot the histogram of the transformed data to see if there is a change in the distribution.

Step 4: Transforming the Unemployment Variable in Dataset: From the plot of transformed data we see that the square transformtion works best. Here we mutate the Unemployment variable in the data to create the new variable unemploymentTrans which contains the squared data.



---
title: "MATH2349 Semester 1, 2018"
author: "Rushil R Sirur - s3557134"
subtitle: Assignment 3
output:
  html_notebook: default
---

## Required packages 


```{r,results='hide'}
library(dplyr)
library(tidyr)
```


## Executive Summary 


The data pre-processing is carried out on datasets from the retail domain. There are three datasets which are joined using left join. The final joined data is composed of variables with numeric, character and logical data types.     

Conversion of data type for a few variables had to be carried out. Four variables were converted to factory type and one variable which represented date, was in character data type and was converted into Date format. A numeric variable is also categorized into a new variable with three levels which is an ordered factor. 
The data was checked for tidiness. There were five variables which were present as column names which were worthy of being combined into a single variable using gather function. A new binary variable was created as an indicator of missing values.     

The variables were checked for missing values and outlier, where boxplots were used for outlier checking. Only two variables had missing values, one which was of type factor and the other numeric. The missing values for the numeric variable were replaced by the median value of the variable. The numeric variables which had outliers were modified by capping the outlier values to the upper or lower value from the boxplot which is not an outlier itself.     

Finally, a variable which had left skew was transformed using square transformation in order to make it approximately normal. 


## Data 

The data used for this assignment is a Retail Analytics Dataset that is sourced form Kaggle. 

The datasets are available at the following link: https://www.kaggle.com/manjeetsingh/retaildataset/data

The data description is available at the following link: 
https://www.kaggle.com/manjeetsingh/retaildataset

The datasets are released under CC0: Public Domain. 


There are three datasets Stores, Features and Sales. All three datasets are available in .csv format. The description of the data are as follows: 


**Stores** : This dataset contains 45 stores of a company, with the type and size of the store. The stores data is anonymized. There are 45 observations and the variables of this data set are as follows (as present in original form): 

Variable|Type|Levels|Description
--------|----|------|-----------
Store|Numeric|-|Store Number
Type|Character|A,B,C|Type of Store
Size|Numeric|-|Size of the Store


**Sales**: This is the Sales data for each store by department from 5/2/2010 to 1/11/2012. There are 421570 observations and the variables in this data set are: 


Variable|Type|Levels|Description
--------|----|------|-----------
Store|Numeric|-|Store Number
Dept|Numeric|-|Department Number
Date|Character|-|Representing Week
Weekly_Sales|Numeric|-|Weekly Sales for given store and dept
IsHoliday|Logical|-|If the week has special Holiday


**Features**: Data related to the store and other activity for the dates. There are 8190 rows and the variables are as follows:

Variable|Type|Levels|Description
--------|----|------|-----------
Store|Numeric|-|Store Number
Date|Character|-|Representing Week
Temperature|Numeric|-|Avg Temperature in Region of Store
Fuel Price|Numeric|-|Fuel price in Region of Store
Markdown1-5|Numeric|-|Promotional Markdowns Type and Amnts
CPI|Numeric|-|Consumer Price Index
Unemployment|Numeric|-|Unemployment Rate
IsHoliday|Logical|-|If the week has special Holiday



We first read in the individual datasets.  


```{r}


# Step 1: Reading in Store
stores = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/stores data-set.csv",stringsAsFactors = FALSE)

# Step 2: Check if Store data set has been read in properly
head(stores)
dim(stores)

# Step 3: Reading in Sales
sales = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/sales data-set.csv",stringsAsFactors = FALSE)

# Step 4: Check head and dim of Sales Data
head(sales)
dim(sales)


# Step 5: Reading in Feature data set
features = read.csv(file = "E:/RMIT/Course Content/MATH 2349 - Data Preprocessing/Assessments/Assessment 3/Assignment/Features data set.csv",stringsAsFactors = FALSE)


# Step 6: Checking head and dim of Features Data 
head(features)
dim(features)

# Step 7: Left Join Stores and Sales Dataset by Store
storeData = stores %>% left_join(y = sales,by = "Store")

# Step 8:  Checking Head and Dim of storeData
head(storeData)
dim(storeData)

# Step 9: Left Join storeData with features by Store and Date 
storeData = storeData %>% left_join(y = features,by = c("Store","Date"))

# Step 10: Checking Head and Dim of storeData
head(storeData)
dim(storeData)

```

**Step 1: Read in Stores Data** : In this step we first read in the store data using read.csv function and save as stores.   
**Step 2: Check Head and Dimension of Store Data** : Check the head of the imported stores data to see if the import is proper and check the dimension of the read in data.The stores data has been read in properly and has 45 observations and 3 variables as expected.     
**Step 3: Read in Sales Data** : The Sales data set is read into R using the read.csv function and saved as sales.     
**Step 4: Check Head and Dimension of Sales Data** : The sales data is read into R properly and has 421570 observations and 5 variables as expected.     
**Step 5: Read in Features Data** : The Features data set is read into R using the read.csv function and saved as features.    
**Step 6: Check Head and Dimension of Features Data** : Check the head and dimensions of the Features data set. The features data set is read into R and it has 8190 observations and 12 variables.       
**Step 7: Left Join Stores and Sales Dataset** : We left join the Sales dataset with the Stores dataset on the variable Store and save the dataset as storeData.    
**Step 8: Check Head and Dimension of storeData Dataset**: The head function and dim function are used to check the head and dimensions of the joined dataset. It has 421570 observations and 7 variables as required.     
**Step 9: Left Join StoreData and Features Dataset** : The features dataset is left joined with storeData created above on the variables Store and Date and stored again as storeData.     
**Step 10: Check Head and Dimension of storeData Dataset** : The head and dimension of the storeData after joining the three tables is checked. There are 421570 observations and 17 rows as expected. 







## Understand 



```{r}
# Step 1: Check class of storeData
class(storeData)

# Step 2: Check Data Types of storeData
str(storeData)

# Step 3: Check Both IsHoliday Variable for Similarity
summary(storeData$IsHoliday.x == storeData$IsHoliday.y)

# Step 4: Dropping IsHoliday.y Variable 
storeData = storeData %>% select(... = -IsHoliday.y )

# Step 5: Renaming IsHoliday.x to IsHoliday
storeData = storeData %>% rename(IsHoliday = IsHoliday.x)
head(storeData)

# Step 6: Changing Data Type of Store, Type, Dept & IsHoliday Variable to Factor
storeData[,c(1,2,4,7)] = lapply(storeData[,c(1,2,4,7)],as.factor)

# Step 7: Check Change of Data Type 
str(storeData)

# Step 8: Converting Date from Character to Date Format
storeData$Date = as.Date(storeData$Date,format = "%d/%m/%Y")

# Step 9: Check Conversion of All Variables 
head(storeData)

# Step 10: Create New Variable for Categorizing Size
sizeDisc = vector(mode = "character",length = dim(storeData)[1])
storeData = cbind(storeData,sizeDisc)

# Step 11: Find Q1 and Q3 to discretize Size Variable
q1 = quantile(x = storeData$Size,probs = c(0.25))
q3 = quantile(x = storeData$Size,probs = c(0.75))


# Step 12: Discretize Size Variable into A New Variable
storeData$sizeDisc =  ifelse(test = storeData$Size < q1,yes = "Small",no = storeData$sizeDisc)
storeData$sizeDisc =  ifelse(test = storeData$Size > q3,yes = "Big",no = storeData$sizeDisc)
storeData$sizeDisc =  ifelse(test = storeData$Size > q1 & storeData$Size < q3 ,yes = "Medium",no = storeData$sizeDisc)


# Step 13: Creating SizeDisc as Ordered Factor
storeData$sizeDisc = factor(storeData$sizeDisc,levels = c("Small","Medium","Big"),ordered = TRUE)

# Step 14: Checking Type of SizeDisc
str(storeData$sizeDisc)

# Step 15: Dropping Variable Size
storeData = storeData %>% select(-Size)

```


****

**Step 1: Check Class of storeData** : The first step is to check the data structure of the dataset storeData. The dataset has a data structure Data Frame.     


**Step 2: Data Types of storeData**: We check the different data types of variables using the str function. This is a convenient method to check the data types of all variables together. There variables are of mix type with 2 variables of character type, two of logical type (which could be same variable) and the rest are of numeric data type. We can also immediately see some issues. For e.g. the variable Date represents Date's but its type is character.    


**Step 3: Check Both IsHoliday Variable For Similarity**: We compare the IsHoliday.x and IsHoliday.y variables to check if they are similar and then use summary function to get the result. We get all 421570 as TRUE which inidicates that they are same variable.  


**Step 4: Dropping IsHoliday.y Variable** : The IsHoliday.y variable is dropped because it is similar to the IsHoliday.x Variable.  

**Step 5: Renaming IsHoliday.x to IsHoliday** : The IsHoliday.x variable is renamed as IsHoliday and stored in storeData and we print the head of the dataset to confirm. 

**Step 6: Changing Data Type of Store, Type & Dept Variable to Factor** : The variables Store and Dept are just identifier variables but were originally in numeric type. Thus, we convert it from numeric data type to factor. The variable Type is of data type character & IsHoliday is logical, hence we change these also to factor type. 

**Step 7: Check Change of Data Type** : We check if the data types have been changed to factor type by using the str function. We see for these variables the data types have been changed to factor. 

**Step 8: Converting Date from Character to Date Format**: We convert the Date variable which is in Character type to Date Format with the as.date function using the format "%d/%m/%Y" .

**Step 9: Check Conversion of All Variables **: Print the head of the dataset to check conversion of data types of all variables. We see that all the variables are converted to required type.

**Step 10: Create New Variable for Categorizing Size** : We would like to categorize the variable size into Small, Medium and Big. To do this we create a new variable called sizeDisc. We bind this new variable with the storeData dataset. 


**Step 11: Find Q1 and Q3 to discretize Size Variable** : To categorize the size variable we will use the 1st quartile and 3rd quartile. Any values below 1st quartile are labelled Small, values of Size between 1st and 3rd quartile are labelled Medium and values above 3rd quartile are labelled Big. Here we calculate the 1st and 3rd quartiles. 

**Step 12: Discretize Size Variable into A New Variable** : Based on the 1st and 3rd quartiles we add categorized values into the new variable sizeDisc from size. 


**Step 13: Creating SizeDisc as Ordered Factor** : The new created variable sizeDisc is based on the values of Size which is numeric. Since the values Small, Medium and Big are ordinal we change the type of the sizeDisc variable to ordered factor.  


**Step 14: Checking Type of SizeDisc** : We check the structure of the variable sizeDisc to check if the conversion to ordered factor has taken place. It has been converted to ordered factor with three levels. 

**Step 15: Dropping Variable Size** : We have categorized the variable Size into a new variable SizeDisc. Hence, we drop the variable size from the dataset. 


In Summary in this section we checked the data types of the different variables which were a mix of logical, numeric and character. We then converted Store, Dept, Type and IsHoliday variables to data type factor and converted Date to date format. Finally, we categorized Size variable into SizeDisc variable and converted the SizeDisc variable into an ordered factor. 







##	Tidy & Manipulate Data I 


For data to be tidy we need the following rules to be adhered: 

* Each variable must be in a column
* Each observation must be in a row
* Each value must be in a cell


The tidy rule which is a problem for this dataset is that there are 5 Markdown prices as variables named MarkDown1 to Markdown5. These variables are just types of Markdown and should be combined into one variable which indicates the types of Markdown. Another variable will indicate the value which these Markdowns take for each observation. We can use the gather function to do this. 


```{r}
# Step 1: Combine Markdown into One Variable Using Gather
storeData = storeData %>% gather(key = "MarkdownType",value = "MarkdownAmnt" ,... = c(MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5))

# Step 2: Check Head of Gathered Data
head(storeData)

# Step 3: Convert MarkDownType to Factor Type 
storeData$MarkdownType = as.factor(storeData$MarkdownType)

```

****

**Step 1: Combine Markdown into One Variable Using Gather**: We combine the different Markdown variables into one variable using gather function. The new variable that holds the type of markdown is MarkdownType and MarkdownAmnt holds the values for each of these Markdowns for each observations. 

**Step 2: Check Head of Gathered Data**: We check the head of the gathered data to see if it has worked. The five variables have been gathered into the two variables as required. 

**Step 3: Convert MarkDownType to Factor Type**: The variable MarkDownType which is created is of character type. So we change it to data type of factor.







##	Tidy & Manipulate Data II 


The data has Weekly Sales for each Store and Dept by Week. Also it shows the different types of Markdown Types for each week and its value. From Step 1 below we see that the MarkdownAmnt variable we created has a lot of missing values. One way of dealing with this is to use imputation. For predictive analytics if a variable has a large number of missing values one way of dealing with this is to create a binary variable indicating if each observation is missing or not for that variable. In some cases whether the original variable is missing or not may be correlated with the target variable (Kelleher, et al., 2015)^1^. When such a variable is created usually the original variable is dropped. Here we create the new variable but do not drop the old variable for the moment since the data is only being pre processed. We create a new binary variable markdownMissing to indicate if the MarkdownAmnt is missing or not. 

**^1^**: Kelleher, J. D., Namee, B. M. & D'Arcy, A., 2015. Data Exploration. In: Fundamentals Of Machine Learning For Predictive Data Analytics: Algorithms, Worked Examples, and Case Studies. London: MIT Press, p. 121.


```{r}
# Step 1: Find Percent of Missing Values in MarkdownAmnt
storeData %>% filter(is.na(MarkdownAmnt)) %>% summarise(Percent = (n()/dim(storeData)[1])*100) 

# Step 2: Create Variable markdownMissing using Mutate
storeData = storeData %>% mutate(markdownMissing = if_else(condition = is.na(MarkdownAmnt),true = 1,false = 0))

# Step 3: Check Creation of markdownMissing
str(storeData$markdownMissing)

# Step 4: Convert markDownMissing to Factor
storeData$markdownMissing = factor(storeData$markdownMissing)

# Step 5: Check Structure 
str(storeData$markdownMissing)


```

**Step 2: Create Variable markdownMissing using Mutate**: Using mutate we create the new variable markdownMissing which takes on 1 if MarkdownAmnt is missing and 0 if MarkdownAmnt is not missing. 


**Step 3: Check Creation of markdownMissing** : We check if the markdownMissing variable has been created by checking the structure of the data.

**Step 4: Convert markDownMissing to Factor**: The markdownMissing can take on only two values 0 and 1. Hence, it is converted to a factor data type. 

**Step 5: Check Structure**: We check if the conversion of markdownMissing to data type factor is successful. It is a factor with two levels 0 and 1. 




##	Scan I 


```{r}
# Step 1: Checking for Missing Values in All Variables
colSums(is.na(storeData))

# Step 2: Finding Median of MarkdownAmnt 
markmedian = median(storeData$MarkdownAmnt,na.rm = TRUE)

# Step 3: Replacing Missing Values with Median
storeData$MarkdownAmnt =  replace_na(data = storeData$MarkdownAmnt,replace = markmedian)

# Step 4: Check Replacement
sum(is.na(storeData$MarkdownAmnt))

# Step 5: Subset Numeric Variables
numeric = select_if(storeData,is.numeric)

# Step 6: Summary of Numeric Variables
summary(numeric)

# Step 7: Replacing Inconsistent Value in Weekly_Sales
storeData %>%filter(Weekly_Sales < 0) %>% count()
storeData$Weekly_Sales = ifelse(test = storeData$Weekly_Sales < 0,yes = NA,no = storeData$Weekly_Sales)

# Step 8: Replacing Inconsistent Value in MarkdownAmnt
storeData %>%filter(MarkdownAmnt < 0) %>% count()
storeData$MarkdownAmnt = ifelse(test = storeData$MarkdownAmnt < 0,yes = NA,no = storeData$MarkdownAmnt)

```

**Step 1: Checking for Missing Values in All Variables**: We use colSums and is.na to check for missing values in the variables. There are only two variables with missing values, the sizeDisc variable we created from size which is a factor variable, and MarkdownAmnt which is numeric. For sizeDisc we leave the unknown values as NA's.  

**Step 2: Finding Median of MarkdownAmnt**: Do deal replace missing values in MarkdownAmnt we find the median so that we can replace missing values with imputed median. 

**Step 3: Replacing Missing Values with Median**: We replace missing values in MarkdownAmnt with the median values. This is just done for demonstration but for real analysis this would not be done because the variable has too many missing values and replacing with mean or median changes the distribution. 

**Step 4: Check Replacement**: We check replacement of the missing values in the variable markdownAmnt

**Step 5 & 6: Subset & Summary of Numeric Variables**: We subset only the numeric variables and check their summary to find if any variables have inconsistent values. Both weekly Sales and markdownAmnt has negative values which we replace. 

**Step 7 & 8: Replacing Inconsistent Value in Weekly_Sales & markDownAmnt **: The inconsistent values of less than 0 have to be replaced in both the variables. We see that Weekly Sales has 6425 values below 0 and markDownAmnt has 1568 such values. We could replace with mean or median, but due to high no of such values we just replace it with NA. 




##	Scan II



```{r}
# Step 1: Finding Variables with Outliers
bx1 = boxplot(storeData$Weekly_Sales,main = "Boxplot of Weekly Sales")
bx2 = boxplot(storeData$Temperature,main = "Boxplot of Temperature")
bx3 = boxplot(storeData$Fuel_Price,storeData$Unemployment,main = "Boxplot of Fuel Price & Unemployment")
bx4 = boxplot(storeData$CPI,main = "Boxplot of CPI")

# Step 2: Capping Outlier in Weekly Sales
storeData$Weekly_Sales = ifelse(test = storeData$Weekly_Sales > bx1$stats[5],yes = bx1$stats[5],no = storeData$Weekly_Sales)

# Step 3: Capping Outlier in Temperature
storeData$Temperature = ifelse(test = storeData$Temperature < bx2$stats[1],yes = bx2$stats[1],no = storeData$Temperature)

#Step 4: Capping Outlier in Unemployment
# For lower outliers
storeData$Unemployment = ifelse(test = storeData$Unemployment < bx3$stats[1,2],yes = bx3$stats[1,2],no = storeData$Unemployment)
# For higher outliers
storeData$Unemployment = ifelse(test = storeData$Unemployment > bx3$stats[5,2],yes = bx3$stats[5,2],no = storeData$Unemployment)

# Step 5: Boxplot to Check Capping 
boxplot(storeData$Weekly_Sales,main = "Boxplot of Weekly Sales after Capping")
boxplot(storeData$Temperature,storeData$Unemployment,main = "Boxplot of Temperature and Unemployment after Capping")

```
****

**Step 1: Finding Variables with Outliers**: We plot the boxplot of numeric variables to see if there are any outliers. Only Weekly_Sales, Temperature and Unemployment have outliers. CPI and Fuel price variable do not have outliers.   
Weekly Sales has outliers on the higher side, temperature on the lower side and Unemployment both on the higher side and lower side. We find the minimum and maximum value from the boxplot which are not outliers for the variables Weekly Sales , Temperature and Unemployment. These represent the whiskers at the top and bottom of the boxplot. We find this in stats of the boxplot object. We use these min and max values for capping outliers. 

**Step 2: Capping Outlier in Weekly Sales** : We cap the outliers of the variable Weekly Sales to the max value from boxplot which is not an outlier. 


**Step 3: Capping Outlier in Temperature** : The temperature variable has outlier on the lower side. We cap these outliers to the min value from boxplot which is not an outlier. 

**Step 4: Capping Outlier in Unemployment**: Unemployment has outliers both at the higher and lower end. We cap higher outliers to the max value from boxplot which is not an outlier and similarly lower outliers are capped to the lower value from boxplot which is not an outlier. 

**Step 5: Boxplot to Check Capping**: We plot the boxplot of the three variables again to check capping. There are no outliers present. 



##	Transform 


```{r}
# Step 1: Checking Distribution of Variable Unemployment
hist(storeData$Unemployment,main = "Histogram of Unemployment",xlab = "Unemployment")

# Step 2: Creating Sqr & Cube Data
unemloy.sqr = storeData$Unemployment^2
unemploy.cube = storeData$Unemployment^3

# Step 3: Plotting histogram of Transformed Data
hist(unemloy.sqr,main = "Histogram of Squared Unemployment",xlab = "Unemployment")
hist(unemploy.cube,main = "Histogram of Unemployment Raised to 3",xlab = "Unemployment")

# Step 4: Transforming the Unemployment Variable in Dataset
storeData = storeData %>% mutate(unemploymentTrans = Unemployment^2)

```


**Step 1: Checking Distribution of Variable Unemployment**: The variable Unemployment needs to be transformed to reduce skewness and make it approximately normal. In this step we check the distribution of the variable. We see that the variable is slightly left skewed. 

**Step 2: Creating Sqr & Cube Data**: From the previous step we know there is slightly left skew. For left skewed data, power of 2,3 etc work well for reducing skewness. Here we create the square and cube of data. 


**Step 3: Plotting histogram of Transformed Data**: We plot the histogram of the transformed data to see if there is a change in the distribution. 


**Step 4: Transforming the Unemployment Variable in Dataset**: From the plot of transformed data we see that the square transformtion works best. Here we mutate the Unemployment variable in the data to create the new variable unemploymentTrans which contains the squared data. 

<br>
<br>
