library(tidyr)
library(readr)
library(outliers)
## Warning: package 'outliers' was built under R version 4.0.3
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
In this assignment it is required to perform various data pre-processing features and functions. I have choosen a dataset from kaggle. This dataset consists data about top sport brands - Nike and Adidas. It contains information about their sales stats in Australia. Data includes information about its cost and sale values, number of units were sold and postcode of each transaction day by day for 2016-2018 period. Several functions and packages were used including readr, outliers, tidyr, lubridate and dplyr.
In this dataset there were three sheets in one excel file. I have imported them as three different files and pre-process them as three datasets. One dataset renamed as Untidy_Suburb.csv and this dataset was converted from wide to long format via gather function. Then all three datasets merge to a one file, so that data processing can be done within a single dataset. With available sales data, I have mutated a new variable ‘Profit’ for analysis of their profits based on selective data segment. The dataset was subset and selected a particular dataset for analysis.
Later, dataset was scanned for the NA values and irregularities and this particular dataset was mostly consisted of complete data. But further studying the dataset, it was identified that there were few minus values for units sold. I scanned the profit variable for all possible outliers for each retailer. Outliers were identified using the boxplot. The capping method has been utilised where these outliers were properly replaced with its nearest neighbours that were not outliers. These outliers were not modified as its significance played a vital role in the dataset. In the last step I have transformed the skewness of the profit variable as it had a lot of abnormalities.
The Three data sets that I have choosen for this assignment was based on the top sport brands - Nike and Adidas downloaded from Kaggle(https://www.kaggle.com/australiastats/australia-top-sport-chains?) It contains information about their sales stats in Australia. The file (AustraliaChains.xlsx) consisted of 3 sheets. (Fact Table, Managers and Suburbs).
These sheets were converted to separate csv files as AusChain.csv(Fact Table), AusChainManagers.csv(Managers), Untidy_Suburb.csv(Suburbs) before being imported to R. AusChains.csv set consisted of 7 columns and provided information about the sales and the Managers and Suburbs consist of 3 columns each for manager and store location.
The csv files are all read with read_csv() from readr and then merged using dplyr join function, inner_join(). Untidy_Suburs dataset converted using gather() function and sorted with distinct values. Then all 3 datasets were merged using the inner join with one common variable called “Postcode” in two steps.
AusChain <- read_csv("AusChain.csv")
## Parsed with column specification:
## cols(
## Date = col_character(),
## Chain = col_character(),
## Postcode = col_double(),
## Category = col_character(),
## Total.Units = col_double(),
## Sale.Price = col_double(),
## Cost.Price = col_double()
## )
AusChain <- data.frame(AusChain)
str(AusChain)
## 'data.frame': 76466 obs. of 7 variables:
## $ Date : chr "1/01/2016" "1/01/2016" "1/01/2016" "1/01/2016" ...
## $ Chain : chr "Nike" "Nike" "Nike" "Nike" ...
## $ Postcode : num 2650 3550 4053 3076 2031 ...
## $ Category : chr "Home" "Groceries" "Home" "Groceries" ...
## $ Total.Units: num 175 908 1 714 65 87 92 21 127 77 ...
## $ Sale.Price : num 2.78 0.61 500 1.14 0.48 3.66 5.21 4.49 4.19 4.73 ...
## $ Cost.Price : num 6.35 0.93 750 1.46 3.9 ...
head(AusChain)
AusChainManagers <- read_csv("AusChainManagers.csv")
## Parsed with column specification:
## cols(
## Suburb = col_character(),
## Postcode = col_double(),
## Manager = col_character()
## )
AusChainManagers <- data.frame(AusChainManagers)
str(AusChainManagers)
## 'data.frame': 98 obs. of 3 variables:
## $ Suburb : chr "Darwin" "Alice Springs" "Alexandria" "Mascot" ...
## $ Postcode: num 800 870 2015 2020 2031 ...
## $ Manager : chr "Michael Carney" "Michael Carney" "Jeremy Garcia" "Jeremy Garcia" ...
head(AusChainManagers)
UntidySuburbs <- read_csv("Untidy_Suburb.csv")
## Parsed with column specification:
## cols(
## Postcode = col_double(),
## ACT = col_character(),
## NSW = col_character(),
## NT = col_character(),
## QLD = col_character(),
## SA = col_character(),
## TAS = col_character(),
## VIC = col_character(),
## WA = col_character()
## )
UntidySuburbs <- data.frame(UntidySuburbs)
str(UntidySuburbs)
## 'data.frame': 97 obs. of 9 variables:
## $ Postcode: num 870 2015 2020 2031 2064 ...
## $ ACT : chr NA NA NA NA ...
## $ NSW : chr NA "Alexandria" "Mascot" "Randwick" ...
## $ NT : chr "Alice Springs" NA NA NA ...
## $ QLD : chr NA NA NA NA ...
## $ SA : chr NA NA NA NA ...
## $ TAS : chr NA NA NA NA ...
## $ VIC : chr NA NA NA NA ...
## $ WA : chr NA NA NA NA ...
head(UntidySuburbs)
tidySuburbs <- UntidySuburbs %>% gather(`ACT`, `NSW`, `NT`,`QLD`,`SA`,`TAS`,`VIC`,`WA`, key ="State", value = "Suburb")
SortedSuburbs <- tidySuburbs %>% distinct(Suburb, .keep_all = TRUE)
MgrAusChain <- SortedSuburbs %>% inner_join( AusChainManagers,c("Postcode" = "Postcode"))
refMgrAusChain <- select(MgrAusChain, State, Postcode, Suburb.x, Manager)
NewAusChain <- AusChain %>% inner_join( refMgrAusChain,c("Postcode" = "Postcode"))
After merging 3 datasets and initial tidyup, dataset is consisted of total of 11 variables at first including numeric, character and factors. I converted ‘Chain’ variable from Charactor to a factor using the as.factor function and further gave them levels and labels. Further I have converted ‘Date’ variable from Charactor to date format.
filter(distinct(NewAusChain,Chain))
NewAusChain$Chain <- as.factor(NewAusChain$Chain)
NewAusChain$Chain<-factor(NewAusChain$Chain,levels = c("Nike","Adidas"))
labels=c("Nike","Adidas")
levels(NewAusChain$Chain)
## [1] "Nike" "Adidas"
class(NewAusChain$Chain)
## [1] "factor"
NewAusChain$Date <- as.Date(NewAusChain$Date, format = "%d/%m/%Y")
str(NewAusChain$Date)
## Date[1:75983], format: "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...
In this step, I have subset a dataset for assignment purpose and selected ‘Kids’ category from ‘VIC’ State only. After that I have renamed Total.Units, Sale.Price, Cost.Price and Suburb.x variables for easy reference.
VICKids <-filter(NewAusChain, State=="VIC" & Category =="Kids") # Subset
VICKids <- rename(VICKids, Units = Total.Units, Sale_Price = Sale.Price, Cost_Price =Cost.Price, Suburb = Suburb.x) #REname Variables
In this step to find out profit for each observation, I have created 3 new variable called ‘Sale_Value’ by (Units * Sale_Price), ‘Total_Cost’ by (Cost_Price * Units) and then ‘Profit’ by (Sale_Value-Total_Cost).
After creating those variable, I have selected only the following variable in the order of Date,Chain,Category,Units,Sale_Price,Cost_Price,Sale_Value,Total_Cost, Profit,State,Postcode,Suburb and Manager.
After inspecting ‘Units’ variable, it is found out there are 6 negative(-1). As it is unusual to have negative stock, it is update with 1, assuming it was a user input error.
Then I re-arranged the dataset by variables, Chain,Postcode and Category.Dataset ‘VICKidsChain’ will be used as main dataset analysis purpose.
VICKids <- mutate(VICKids, Sale_Value = Units * Sale_Price, Total_Cost = Cost_Price * Units, Profit = Sale_Value-Total_Cost)
VICKids <- select(VICKids, Date,Chain,Category,Units,Sale_Price,Cost_Price,Sale_Value,Total_Cost, Profit,State,Postcode,Suburb,Manager)
filter(VICKids, Units <0)
VICKids$Units[VICKids$Units == -1] <- 1
VICKidsChain <- arrange(VICKids, Chain,Postcode,Category)
str(VICKidsChain)
## 'data.frame': 2479 obs. of 13 variables:
## $ Date : Date, format: "2016-01-01" "2016-01-01" ...
## $ Chain : Factor w/ 2 levels "Nike","Adidas": 1 1 1 1 1 1 1 1 1 1 ...
## $ Category : chr "Kids" "Kids" "Kids" "Kids" ...
## $ Units : num 68 17 4 6 16 230 208 481 1 15 ...
## $ Sale_Price: num 1.95 2.47 3.75 4.33 5.24 4.03 5.12 3.81 2 2.73 ...
## $ Cost_Price: num 2.75 3.28 2.75 1.65 1.71 3.41 3.47 2.51 6.5 2.69 ...
## $ Sale_Value: num 132.6 42 15 26 83.8 ...
## $ Total_Cost: num 187 55.8 11 9.9 27.4 ...
## $ Profit : num -54.4 -13.8 4 16.1 56.5 ...
## $ State : chr "VIC" "VIC" "VIC" "VIC" ...
## $ Postcode : num 3018 3018 3018 3018 3018 ...
## $ Suburb : chr "Altona" "Altona" "Altona" "Altona" ...
## $ Manager : chr "Brian Hubbard" "Brian Hubbard" "Brian Hubbard" "Brian Hubbard" ...
In this dataset I have checked null values with is.null function. This dataset is near perfect and there were no null values or missing values. Further I have checked for all the infinite values using the sapply function and and it was all perfect. The only mismatch was there were 6 negative values(-1) in ‘Units’ variable. It was addressed in the previous section.
is.null(VICKids)
## [1] FALSE
sapply(VICKids, function(x) sum( is.na(x) ))
## Date Chain Category Units Sale_Price Cost_Price Sale_Value
## 0 0 0 0 0 0 0
## Total_Cost Profit State Postcode Suburb Manager
## 0 0 0 0 0 0
colSums(is.na(VICKids))
## Date Chain Category Units Sale_Price Cost_Price Sale_Value
## 0 0 0 0 0 0 0
## Total_Cost Profit State Postcode Suburb Manager
## 0 0 0 0 0 0
This dataset has very high standard deviation. Reason being both brands sells differentiated products with quantity varies vastly. There are big differences in prices for most categories as well. So under this dataset there are 727 outliers, which were identified with boxplot method. After getting summrised details box plot and histogram was drawn to visualise. Two boxplots were done to illustrate for Nike and Adidas separately and all together.
Histogram was plotted to visualise the data spread. We can see there is an obvious right skewed spread of data.
summary(VICKidsChain)
## Date Chain Category Units
## Min. :2016-01-01 Nike :1524 Length:2479 Min. : 1.0
## 1st Qu.:2016-06-01 Adidas: 955 Class :character 1st Qu.: 9.0
## Median :2016-11-01 Mode :character Median : 36.0
## Mean :2016-10-24 Mean : 131.9
## 3rd Qu.:2017-04-01 3rd Qu.: 140.0
## Max. :2017-08-01 Max. :2625.0
## Sale_Price Cost_Price Sale_Value Total_Cost
## Min. : 0.250 Min. : 0.350 Min. : -14.99 Min. : -8.75
## 1st Qu.: 3.160 1st Qu.: 2.000 1st Qu.: 30.92 1st Qu.: 19.89
## Median : 4.730 Median : 2.840 Median : 180.78 Median : 105.00
## Mean : 4.844 Mean : 2.939 Mean : 685.19 Mean : 390.58
## 3rd Qu.: 6.055 3rd Qu.: 3.600 3rd Qu.: 739.96 3rd Qu.: 422.60
## Max. :30.560 Max. :15.000 Max. :12573.75 Max. :7428.75
## Profit State Postcode Suburb
## Min. :-199.66 Length:2479 Min. :3018 Length:2479
## 1st Qu.: 6.48 Class :character 1st Qu.:3131 Class :character
## Median : 69.12 Mode :character Median :3199 Mode :character
## Mean : 294.61 Mean :3320
## 3rd Qu.: 306.13 3rd Qu.:3550
## Max. :5878.46 Max. :3977
## Manager
## Length:2479
## Class :character
## Mode :character
##
##
##
VICKidsChain %>% group_by(Chain) %>% summarise(Min = min(Profit, na.rm = TRUE),
Q1 = quantile(Profit, probs = 0.25, na.rm = TRUE),
Median = median(Profit, na.rm = TRUE),
Q3 = quantile(Profit, probs = 0.75, na.rm = TRUE),
Max = max(Profit, na.rm = TRUE),
IQR = Q3 - Q1,
Mean = mean(Profit, na.rm = TRUE),
SD = sd(Profit, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Profit)))
## `summarise()` ungrouping output (override with `.groups` argument)
boxplot(VICKidsChain["Profit"], las=2, main = "Victoria Kids Profit(All)", ylab = "Profit", xlab = "All")
boxplot(VICKidsChain$Profit ~ VICKidsChain$Chain, main="Victoria Kids Profit by Chain", ylab = "Profit", xlab = "Chain")
outliers<- boxplot(VICKidsChain$Profit)$out
outliers
## [1] 1040.00 884.53 1497.30 1045.80 1497.65 2057.92 763.84 1162.35 1313.18
## [10] 1387.36 2986.64 1970.54 2856.00 772.80 1632.80 1317.33 901.48 1067.82
## [19] 1317.90 1970.54 2856.00 772.80 1632.80 1317.33 901.48 1438.80 1875.72
## [28] 2117.70 3614.16 892.08 905.92 1182.81 1281.00 1444.19 775.20 1662.78
## [37] 2131.57 870.25 1024.80 1299.52 2240.35 2528.16 2320.78 2993.76 861.84
## [46] 870.84 1920.45 1098.72 823.60 957.06 1704.30 2509.77 2320.78 2993.76
## [55] 861.84 870.84 1920.45 1098.72 867.14 1128.00 2563.55 3146.96 802.20
## [64] 824.40 902.40 968.89 806.26 1389.96 1825.35 814.00 1663.80 807.07
## [73] 1323.00 817.02 837.80 883.12 999.18 2357.59 814.00 1663.80 807.07
## [82] 1323.00 817.02 837.80 883.12 865.24 1136.02 2991.56 3240.90 1058.94
## [91] 1043.19 1373.09 1341.83 1789.76 797.04 1763.20 2361.15 762.42 765.00
## [100] 927.18 1891.40 2488.35 1817.24 2887.05 3560.04 921.30 936.26 2519.54
## [109] 2586.72 1087.11 1133.60 1761.95 899.76 1444.60 1082.25 1244.40 957.00
## [118] 836.57 2137.52 3461.15 921.30 936.26 2519.54 2586.72 1087.11 1133.60
## [127] 1761.95 899.76 1444.60 1082.25 1244.40 2342.44 3447.74 4437.45 5878.46
## [136] 1017.36 1282.65 1812.52 1966.56 804.54 849.68 1049.04 2012.80 2180.95
## [145] 2560.04 2591.60 1213.70 950.04 1316.80 1309.88 1324.32 1476.45 2560.04
## [154] 2591.60 1213.70 950.04 1316.80 840.51 942.37 3488.20 3596.12 1440.26
## [163] 1754.00 2024.88 986.40 2197.60 3877.25 4163.20 1391.27 2438.20 1307.71
## [172] 1068.65 1325.40 2309.37 1073.65 2336.13 1391.27 2438.20 1307.71 1068.65
## [181] 1325.40 2309.37 1780.17 2064.23 4975.11 5145.00 927.00 1025.00 2351.52
## [190] 1522.80 1915.20 993.60 2150.20 2340.80 771.00 853.83 967.19 1203.60
## [199] 1246.08 2598.48 3317.85 2335.23 3286.00 896.14 1229.80 846.32 1535.76
## [208] 1557.74 2320.67 2434.83 2335.23 3286.00 896.14 1229.80 846.32 1557.74
## [217] 1535.76 1103.36 1188.44 3992.56 4906.20 931.00 830.27 1289.96 1306.10
## [226] 2058.20 877.10 1320.69 1173.84 1312.42 2346.19 2302.88 2465.68 920.40
## [235] 1422.40 1361.70 1146.09 1145.19 1222.64 2302.88 2465.68 920.40 1422.40
## [244] 1361.70 1146.09 1196.98 2519.88 3188.66 800.80 1745.92 888.28 1159.06
## [253] 1073.48 960.40 886.60 900.24 1432.08 1078.44 2222.14 1171.83 785.00
## [262] 1150.72 1078.44 2222.14 1171.83 785.00 1035.20 1789.11 1954.26 1025.00
## [271] 1162.80 1162.80 878.46 761.56 761.56 853.30 853.30
length(outliers)
## [1] 277
hist(VICKidsChain$Profit)
VICKidsChain$Profit %>% hist(col = "blue",
ylim = c(0,0.003),
xlim = c(0,6000),
xlab = "Profit($)",
main = "Profit by Chain",
breaks = 10,
density = 20,
prob = TRUE)
lines(density(VICKidsChain$Profit, adjust = 2), col = "red", lwd = 2)
Fair way of treating oultlier are Capping which involves replacing the outliers with the nearest neighbours that are not outliers. Here, for outliers that lie outside the outlier fences on a box-plot, we can cap it by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile. After capping, we can compare with initial values, where we clearly can see Max value drastically came down.
In the last step I have checked the distribution of variable ‘Profit’. According to the first histogram we can see that the distribution is skewed towards the right. By using the log function we shift the right skewed distribution towards a normal distribution.
summary(VICKidsChain$Profit)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -199.66 6.48 69.12 294.61 306.13 5878.46
cap <- function(x){
quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
x
}
capp <- VICKidsChain$Profit %>% cap()
summary(capp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -199.66 6.48 69.12 265.75 306.13 1361.70
Chain_Profit<- log(VICKidsChain$Profit)
## Warning in log(VICKidsChain$Profit): NaNs produced
hist(Chain_Profit)
kaggle: Australia Top Sport Chains Nike and Adidas sales stats https://www.kaggle.com/australiastats/australia-top-sport-chains?