This are the required packages for the data preprocessing to make the tidy and ready for the statistical analysis.
# This is the R chunk for the required packages
library(readr)
library(dplyr)
library(tidyr)
library(magrittr)
library(outliers)
library(swirl)
library(rvest)
In this Data preprocessing various steps are used to make the dataset ready for the statistical analysis.Firstly,the data structures and data types of the variables were observed so as to apply correct preprocessing method for numerical and character variables.The character variable “Gender” and “Branch” were converted to factor variable, this factored variable “Branch” consisted of branches (A,B and C) which was appropriately labelled to (Melbourne, Sdyney and Perth).Next step involves the removal of untidyness from the dataset which can be checked by considering the three principles of tidyness.As in this dataset one of the principle is violated as multiple information of the “Date” column is stored in single column by using “separate()” function it can be separated in different columns.This Tidy data is then checked for the missing values which is then recode by assigning the value using “for loop”.After the datset is recoded the dataset is checked for the possible outliers and handled using Tukey’s method and z-score.Lastly, even after the removal of outliers the dataset is not symmetric, as the dataset in this case for “Total” column is right skewed reciprocal transformation of cube is used to make the data normally distributed.The data is almost normally distributed and ready for the further analysis.
The two data sets are sales of the supermarket and total income per product reffered from (https://www.kaggle.com/aungpyaeap/supermarket-sales). The sales of supermarket provides the information of a person(male/female) for the quantity of items purchased,unit cost,product line,total cost,unique Invoice id,date and branch of supermarket.The income dataset gives an overview for the mode of payment received,percentage of margin, price of COGS(cost of goods sold),ratings of the supermarket and Invoice id.
Sales dataset for Supermarket variable description :-
Invoice id: Computer generated sales slip invoice identification number.
Date: Date of purchase.
Branch: Branch of supercenter (3 branches are available identified by A, B and C).
Gender: Gender type of customer (Male or Female).
Product line: General item categorization groups ( Electronic accessories, Fashion accessories, Food and beverages, Health and beauty, Home and lifestyle, Sports and travel).
Unit price: Price of each product.
Quantity: Number of products purchased by customer.
Total: Total price.
Income of Supermarket variable description :-
Invoice id: Computer generated sales slip invoice identification number.
Payment: Payment used by customer for purchase (3 methods – Cash, Credit card and Ewallet)
COGS: Cost of goods sold.
Gross margin percentage: Gross margin percentage.
Rating: Customer stratification rating on their overall shopping experience (On a scale of 1 to 10).
The two datasets are read using “read_csv()” function as the dataset was in the .CSV format and stored in “sales” and “income” variables respectively.As both the datasets have “Invoice id” as a common attribute “inner_join()” function is used to join the data sets.The merged data sets is stored in a variable called “merge” which now consists information of 1000 observations and 12 variables.
# This is the R chunk for the Data Section
sales <- read_csv("datasets_205965_451952_supermarket_sales.csv")
Parsed with column specification:
cols(
`Invoice ID` = [31mcol_character()[39m,
Date = [31mcol_character()[39m,
Branch = [31mcol_character()[39m,
Gender = [31mcol_character()[39m,
`Product line` = [31mcol_character()[39m,
`Unit price` = [32mcol_double()[39m,
Quantity = [32mcol_double()[39m,
Total = [31mcol_character()[39m
)
income <- read_csv("datasets_205965_451952_supermarket_income.csv")
Parsed with column specification:
cols(
`Invoice ID` = [31mcol_character()[39m,
Payment = [31mcol_character()[39m,
cogs = [32mcol_double()[39m,
`gross margin percentage` = [32mcol_double()[39m,
Rating = [32mcol_double()[39m
)
merge <- inner_join(sales,income)
Joining, by = "Invoice ID"
head(merge)
NA
“str()” function is used to give a brief summary regarding the variables.As it can be incurred from the output data that the data set consist of variables with multiple data types which is required for preprocessing steps.
The dataset consist of “Branch” and “Gender” as character variable which is factorized using “factor()” function .
As the “Branch” variable consist of three branches (A,B and C) which is difficult to interpret.So,in order to make the process simpler “factor()” function is used along with the “levels” and “labels” argument to label the branches as (Melbourne, Sdyney and Perth)
# This is the R chunk for the Understand Section
str(merge)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1000 obs. of 12 variables:
$ Invoice ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
$ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
$ Branch : chr "A" "C" "A" "A" ...
$ Gender : chr "Female" "Female" "Male" "Male" ...
$ Product line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
$ Unit price : num 74.7 15.3 46.3 58.2 86.3 ...
$ Quantity : num 7 5 7 8 7 7 6 10 2 3 ...
$ Total : chr "522.83" "76.4" "324.31" ".." ...
$ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
$ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
$ gross margin percentage: num 4.76 4.76 4.76 4.76 4.76 ...
$ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
merge$Branch <- factor(merge$Branch)
levels(merge$Branch)
[1] "A" "B" "C"
merge$Gender <- factor(merge$Gender)
levels(merge$Gender)
[1] "Female" "Male"
merge$Branch <- factor(merge$Branch,levels = c("A","B","C"),
labels = c("Melbourne","Sdyney","Perth"))
head(merge$Branch)
[1] Melbourne Perth Melbourne Melbourne Melbourne Perth
Levels: Melbourne Sdyney Perth
The dataset is said to be tidy when it follows three principles of tidyness. 1]Each variable must have each column.
2]Each observation must have each row.
3]Each value must have its own cell.
However, one of the principle is not satisfied in this datatset as multiple information is stored in single column(Date),therefore the data is untidy. In order to make the data tidy the multiple information must be separated by using “separate()” function is used which separates each information in different columns (Month,Day,Year).
# This is the R chunk for the Tidy & Manipulate Data I
merge <- merge%>%separate(Date,into = c("Month","Day","Year"),sep = "/")
head(merge)
NA
The new column (gross_income) is added in the dataset using “mutate()” function which gives you the amount earned by the supermarket which can be calculated as the product of “COGS” and “gross margin perecntage/100” columns.
# This is the R chunk for the Tidy & Manipulate Data II
merge <- mutate(merge,gross_income = cogs*(`gross margin percentage`/100))
head(merge)
NA
The dataset is checked for NA values using “is.na()” function along with “sum()” function to determine the total number of NA’s in the data set.However the sum shows a zero value which means that there are no NA values currently.
Now,we need to check for missing values or special value or errors in the datasets.As it can be seen from the “Total” variable it consist of missing values represented by “..” in order to handle this it is intially replaced with NA . Using “sum()” function it can be seen that eight values are missing from the data.
In order to recode the missing values “for loop” is used to assign a value in the “Total” variable where the value for the “Total” variable is the product of the “Unit price” to the “Quantity” purchased.
The missing values are handled and now the dataset is converted to its numeric data type using “as.numeic()” function.
Using “str()” function it can be observed that all the variables now have proper data types and no missing and NA values.
# This is the R chunk for the Scan I
sum(is.na(merge))
[1] 0
merge$Total[merge$Total==".."] <- NA
sum(is.na(merge$Total))
[1] 8
for (i in 1:length(merge$Total)) {
if(is.na(merge[i, 10])){
merge[i, 10] <- merge[i, 8] * merge[i, 9]
}
}
merge$Total<- as.numeric(merge$Total)
head(merge)
str(merge)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 1000 obs. of 15 variables:
$ Invoice ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
$ Month : chr "1" "3" "3" "1" ...
$ Day : chr "5" "8" "3" "27" ...
$ Year : chr "2019" "2019" "2019" "2019" ...
$ Branch : Factor w/ 3 levels "Melbourne","Sdyney",..: 1 3 1 1 1 3 1 3 1 2 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 1 1 1 ...
$ Product line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
$ Unit price : num 74.7 15.3 46.3 58.2 86.3 ...
$ Quantity : num 7 5 7 8 7 7 6 10 2 3 ...
$ Total : num 522.8 76.4 324.3 465.8 604.2 ...
$ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
$ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
$ gross margin percentage: num 4.76 4.76 4.76 4.76 4.76 ...
$ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
$ gross_income : num 24.9 3.64 15.44 22.18 28.77 ...
The outliers must be handled in the data set as :- - they increase the error variance,
- they reduce the power of statistical tests,
- they can bias or influence the estimates of model parameters that may be of substantive interest.
The outlier detection methods are:-
1]Tukey’s method of outlier detection is used for non symmetric data distributions.
2]Z-score is used for normal or symmetric distribution.
Histogram for all the numeric variables are plotted to check for the distribution .It can be observed that only the “Total” and “gross_income” variable seems to have non symmetric data distribution so Tukey’s method is used to determine the outliers.
The summarize function is used to calculate the quratile 1 and quartile 3 values.Further,IQR function is used to calculate the inter-quartile range which is needed to find out the lower and upper fence to consider the data as an outlier.
After calculating the lower and upper fences it is founded that the values for “Total” and “gross_income” are well within the fence and shoudn’t be addressed as outliers.
Z-score method is used for the other three variables to check for the outliers .However,none of the variable showed the z-score greater than 3 which means that there were no outliers in variables (Unit price,Quantity and Rating column).
# This is the R chunk for the Scan II
hist(merge$`Unit price`)
hist(merge$Quantity)
hist(merge$Rating)
hist(merge$Total)
hist(merge$gross_income)
merge$Total%>%boxplot(main="TOTAL PRICE",col="yellow")
merge$gross_income%>%boxplot(main="INCOME",col="blue")
merge%>%summarize(q1=quantile(Total,probs = 0.25),
q3=quantile(Total,probs = 0.75))
iqr <- IQR(merge$Total)
upper_fence_total <- 1.5*iqr
upper_fence_total
[1] 495.6113
lower_fence_total <- -1.5*iqr
lower_fence_total
[1] -495.6113
merge%>%summarize(q1=quantile(gross_income,probs = 0.25),
q3=quantile(gross_income,probs = 0.75))
iqr <- IQR(merge$gross_income)
upper_fence_income <- 1.5*iqr
upper_fence_income
[1] 23.60054
lower_fence_income <- -1.5*iqr
lower_fence_income
[1] -23.60054
z.scores <- merge$`Unit price` %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.72081 -0.86044 -0.01669 0.00000 0.84028 1.67158
length(which( abs(z.scores) >3 ))
[1] 0
z.scores <- merge$Quantity %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.5427 -0.8586 -0.1745 0.0000 0.8517 1.5359
length(which( abs(z.scores) >3 ))
[1] 0
z.scores <- merge$Rating %>% scores(type = "z")
z.scores %>% summary()
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.72974 -0.85693 0.01589 0.00000 0.88870 1.76151
length(which( abs(z.scores) >3 ))
[1] 0
The “Total” variable is used for the data transformation.The histogram for the “Total” variable is plotted which shows that the data distribution is right skewed.In order to make the data distribution normal we need to use data transformation. As in this case the data which is right skewed reciprocal transformation is applied to make it approximately normal distribution.
The histogram for the comparison is shown for before and after data transformation.
# This is the R chunk for the Transform Section
hist(merge$Total,main="Before Transformation")
data_transform <- merge$Total^(1/3)
hist(data_transform,main = "After Transformation")
References
1] (Supermarket sales, 2020)
2] Data Wrangling-(Module Notes 5)
3] Data Wrangling-(Module Notes 6)
4] Data Wrangling-(Module Notes 7)