MATH2349 Data Wrangling

Required packages

library(tidyr)
library(readr)
library(outliers)

## Warning: package 'outliers' was built under R version 4.0.3

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

Executive Summary

In this assignment it is required to perform various data pre-processing features and functions. I have choosen a dataset from kaggle. This dataset consists data about top sport brands - Nike and Adidas. It contains information about their sales stats in Australia. Data includes information about its cost and sale values, number of units were sold and postcode of each transaction day by day for 2016-2018 period. Several functions and packages were used including readr, outliers, tidyr, lubridate and dplyr.

In this dataset there were three sheets in one excel file. I have imported them as three different files and pre-process them as three datasets. One dataset renamed as Untidy_Suburb.csv and this dataset was converted from wide to long format via gather function. Then all three datasets merge to a one file, so that data processing can be done within a single dataset. With available sales data, I have mutated a new variable ‘Profit’ for analysis of their profits based on selective data segment. The dataset was subset and selected a particular dataset for analysis.

Later, dataset was scanned for the NA values and irregularities and this particular dataset was mostly consisted of complete data. But further studying the dataset, it was identified that there were few minus values for units sold. I scanned the profit variable for all possible outliers for each retailer. Outliers were identified using the boxplot. The capping method has been utilised where these outliers were properly replaced with its nearest neighbours that were not outliers. These outliers were not modified as its significance played a vital role in the dataset. In the last step I have transformed the skewness of the profit variable as it had a lot of abnormalities.

Data

The Three data sets that I have choosen for this assignment was based on the top sport brands - Nike and Adidas downloaded from Kaggle(https://www.kaggle.com/australiastats/australia-top-sport-chains?) It contains information about their sales stats in Australia. The file (AustraliaChains.xlsx) consisted of 3 sheets. (Fact Table, Managers and Suburbs).

These sheets were converted to separate csv files as AusChain.csv(Fact Table), AusChainManagers.csv(Managers), Untidy_Suburb.csv(Suburbs) before being imported to R. AusChains.csv set consisted of 7 columns and provided information about the sales and the Managers and Suburbs consist of 3 columns each for manager and store location.

The csv files are all read with read_csv() from readr and then merged using dplyr join function, inner_join(). Untidy_Suburs dataset converted using gather() function and sorted with distinct values. Then all 3 datasets were merged using the inner join with one common variable called “Postcode” in two steps.

AusChain <- read_csv("AusChain.csv")

## Parsed with column specification:
## cols(
##   Date = col_character(),
##   Chain = col_character(),
##   Postcode = col_double(),
##   Category = col_character(),
##   Total.Units = col_double(),
##   Sale.Price = col_double(),
##   Cost.Price = col_double()
## )

AusChain <- data.frame(AusChain)
str(AusChain)

## 'data.frame':    76466 obs. of  7 variables:
##  $ Date       : chr  "1/01/2016" "1/01/2016" "1/01/2016" "1/01/2016" ...
##  $ Chain      : chr  "Nike" "Nike" "Nike" "Nike" ...
##  $ Postcode   : num  2650 3550 4053 3076 2031 ...
##  $ Category   : chr  "Home" "Groceries" "Home" "Groceries" ...
##  $ Total.Units: num  175 908 1 714 65 87 92 21 127 77 ...
##  $ Sale.Price : num  2.78 0.61 500 1.14 0.48 3.66 5.21 4.49 4.19 4.73 ...
##  $ Cost.Price : num  6.35 0.93 750 1.46 3.9 ...

head(AusChain)

AusChainManagers <- read_csv("AusChainManagers.csv")

## Parsed with column specification:
## cols(
##   Suburb = col_character(),
##   Postcode = col_double(),
##   Manager = col_character()
## )

AusChainManagers <- data.frame(AusChainManagers)
str(AusChainManagers)

## 'data.frame':    98 obs. of  3 variables:
##  $ Suburb  : chr  "Darwin" "Alice Springs" "Alexandria" "Mascot" ...
##  $ Postcode: num  800 870 2015 2020 2031 ...
##  $ Manager : chr  "Michael Carney" "Michael Carney" "Jeremy Garcia" "Jeremy Garcia" ...

head(AusChainManagers)

UntidySuburbs <- read_csv("Untidy_Suburb.csv")

## Parsed with column specification:
## cols(
##   Postcode = col_double(),
##   ACT = col_character(),
##   NSW = col_character(),
##   NT = col_character(),
##   QLD = col_character(),
##   SA = col_character(),
##   TAS = col_character(),
##   VIC = col_character(),
##   WA = col_character()
## )

UntidySuburbs <- data.frame(UntidySuburbs)
str(UntidySuburbs)

## 'data.frame':    97 obs. of  9 variables:
##  $ Postcode: num  870 2015 2020 2031 2064 ...
##  $ ACT     : chr  NA NA NA NA ...
##  $ NSW     : chr  NA "Alexandria" "Mascot" "Randwick" ...
##  $ NT      : chr  "Alice Springs" NA NA NA ...
##  $ QLD     : chr  NA NA NA NA ...
##  $ SA      : chr  NA NA NA NA ...
##  $ TAS     : chr  NA NA NA NA ...
##  $ VIC     : chr  NA NA NA NA ...
##  $ WA      : chr  NA NA NA NA ...

head(UntidySuburbs)

tidySuburbs <- UntidySuburbs %>% gather(`ACT`, `NSW`, `NT`,`QLD`,`SA`,`TAS`,`VIC`,`WA`, key ="State", value = "Suburb")

SortedSuburbs <- tidySuburbs %>% distinct(Suburb, .keep_all = TRUE) 

MgrAusChain <- SortedSuburbs %>% inner_join( AusChainManagers,c("Postcode" = "Postcode")) 

refMgrAusChain <- select(MgrAusChain, State, Postcode, Suburb.x, Manager)

NewAusChain <- AusChain %>% inner_join( refMgrAusChain,c("Postcode" = "Postcode"))

Understand

After merging 3 datasets and initial tidyup, dataset is consisted of total of 11 variables at first including numeric, character and factors. I converted ‘Chain’ variable from Charactor to a factor using the as.factor function and further gave them levels and labels. Further I have converted ‘Date’ variable from Charactor to date format.

filter(distinct(NewAusChain,Chain))

NewAusChain$Chain <- as.factor(NewAusChain$Chain) 
NewAusChain$Chain<-factor(NewAusChain$Chain,levels = c("Nike","Adidas"))
labels=c("Nike","Adidas")

levels(NewAusChain$Chain)

## [1] "Nike"   "Adidas"

class(NewAusChain$Chain)

## [1] "factor"

NewAusChain$Date <- as.Date(NewAusChain$Date, format = "%d/%m/%Y")
str(NewAusChain$Date)

##  Date[1:75983], format: "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" "2016-01-01" ...

Tidy & Manipulate Data I

In this step, I have subset a dataset for assignment purpose and selected ‘Kids’ category from ‘VIC’ State only. After that I have renamed Total.Units, Sale.Price, Cost.Price and Suburb.x variables for easy reference.

VICKids <-filter(NewAusChain, State=="VIC" & Category =="Kids") # Subset

VICKids <- rename(VICKids, Units = Total.Units, Sale_Price = Sale.Price, Cost_Price =Cost.Price, Suburb = Suburb.x) #REname Variables

Tidy & Manipulate Data II

In this step to find out profit for each observation, I have created 3 new variable called ‘Sale_Value’ by (Units * Sale_Price), ‘Total_Cost’ by (Cost_Price * Units) and then ‘Profit’ by (Sale_Value-Total_Cost).

After creating those variable, I have selected only the following variable in the order of Date,Chain,Category,Units,Sale_Price,Cost_Price,Sale_Value,Total_Cost, Profit,State,Postcode,Suburb and Manager.

After inspecting ‘Units’ variable, it is found out there are 6 negative(-1). As it is unusual to have negative stock, it is update with 1, assuming it was a user input error.

Then I re-arranged the dataset by variables, Chain,Postcode and Category.Dataset ‘VICKidsChain’ will be used as main dataset analysis purpose.

VICKids <- mutate(VICKids, Sale_Value = Units * Sale_Price, Total_Cost = Cost_Price * Units, Profit = Sale_Value-Total_Cost) 

VICKids <- select(VICKids, Date,Chain,Category,Units,Sale_Price,Cost_Price,Sale_Value,Total_Cost, Profit,State,Postcode,Suburb,Manager) 

filter(VICKids, Units <0)

VICKids$Units[VICKids$Units == -1] <- 1

VICKidsChain <- arrange(VICKids, Chain,Postcode,Category) 
str(VICKidsChain)

## 'data.frame':    2479 obs. of  13 variables:
##  $ Date      : Date, format: "2016-01-01" "2016-01-01" ...
##  $ Chain     : Factor w/ 2 levels "Nike","Adidas": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Category  : chr  "Kids" "Kids" "Kids" "Kids" ...
##  $ Units     : num  68 17 4 6 16 230 208 481 1 15 ...
##  $ Sale_Price: num  1.95 2.47 3.75 4.33 5.24 4.03 5.12 3.81 2 2.73 ...
##  $ Cost_Price: num  2.75 3.28 2.75 1.65 1.71 3.41 3.47 2.51 6.5 2.69 ...
##  $ Sale_Value: num  132.6 42 15 26 83.8 ...
##  $ Total_Cost: num  187 55.8 11 9.9 27.4 ...
##  $ Profit    : num  -54.4 -13.8 4 16.1 56.5 ...
##  $ State     : chr  "VIC" "VIC" "VIC" "VIC" ...
##  $ Postcode  : num  3018 3018 3018 3018 3018 ...
##  $ Suburb    : chr  "Altona" "Altona" "Altona" "Altona" ...
##  $ Manager   : chr  "Brian Hubbard" "Brian Hubbard" "Brian Hubbard" "Brian Hubbard" ...

Scan I

In this dataset I have checked null values with is.null function. This dataset is near perfect and there were no null values or missing values. Further I have checked for all the infinite values using the sapply function and and it was all perfect. The only mismatch was there were 6 negative values(-1) in ‘Units’ variable. It was addressed in the previous section.

is.null(VICKids)

## [1] FALSE

sapply(VICKids, function(x) sum( is.na(x) ))

##       Date      Chain   Category      Units Sale_Price Cost_Price Sale_Value 
##          0          0          0          0          0          0          0 
## Total_Cost     Profit      State   Postcode     Suburb    Manager 
##          0          0          0          0          0          0

colSums(is.na(VICKids))

##       Date      Chain   Category      Units Sale_Price Cost_Price Sale_Value 
##          0          0          0          0          0          0          0 
## Total_Cost     Profit      State   Postcode     Suburb    Manager 
##          0          0          0          0          0          0

Scan II

This dataset has very high standard deviation. Reason being both brands sells differentiated products with quantity varies vastly. There are big differences in prices for most categories as well. So under this dataset there are 727 outliers, which were identified with boxplot method. After getting summrised details box plot and histogram was drawn to visualise. Two boxplots were done to illustrate for Nike and Adidas separately and all together.

Histogram was plotted to visualise the data spread. We can see there is an obvious right skewed spread of data.

summary(VICKidsChain)

##       Date               Chain        Category             Units       
##  Min.   :2016-01-01   Nike  :1524   Length:2479        Min.   :   1.0  
##  1st Qu.:2016-06-01   Adidas: 955   Class :character   1st Qu.:   9.0  
##  Median :2016-11-01                 Mode  :character   Median :  36.0  
##  Mean   :2016-10-24                                    Mean   : 131.9  
##  3rd Qu.:2017-04-01                                    3rd Qu.: 140.0  
##  Max.   :2017-08-01                                    Max.   :2625.0  
##    Sale_Price       Cost_Price       Sale_Value         Total_Cost     
##  Min.   : 0.250   Min.   : 0.350   Min.   :  -14.99   Min.   :  -8.75  
##  1st Qu.: 3.160   1st Qu.: 2.000   1st Qu.:   30.92   1st Qu.:  19.89  
##  Median : 4.730   Median : 2.840   Median :  180.78   Median : 105.00  
##  Mean   : 4.844   Mean   : 2.939   Mean   :  685.19   Mean   : 390.58  
##  3rd Qu.: 6.055   3rd Qu.: 3.600   3rd Qu.:  739.96   3rd Qu.: 422.60  
##  Max.   :30.560   Max.   :15.000   Max.   :12573.75   Max.   :7428.75  
##      Profit           State              Postcode       Suburb         
##  Min.   :-199.66   Length:2479        Min.   :3018   Length:2479       
##  1st Qu.:   6.48   Class :character   1st Qu.:3131   Class :character  
##  Median :  69.12   Mode  :character   Median :3199   Mode  :character  
##  Mean   : 294.61                      Mean   :3320                     
##  3rd Qu.: 306.13                      3rd Qu.:3550                     
##  Max.   :5878.46                      Max.   :3977                     
##    Manager         
##  Length:2479       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

VICKidsChain %>% group_by(Chain) %>% summarise(Min = min(Profit, na.rm = TRUE),
                                               Q1 = quantile(Profit, probs = 0.25, na.rm = TRUE),
                                               Median = median(Profit, na.rm = TRUE),
                                               Q3 = quantile(Profit, probs = 0.75, na.rm = TRUE),
                                               Max = max(Profit, na.rm = TRUE),
                                               IQR = Q3 - Q1,
                                               Mean = mean(Profit, na.rm = TRUE),
                                               SD = sd(Profit, na.rm = TRUE),
                                               n = n(),
                                               Missing = sum(is.na(Profit)))

## `summarise()` ungrouping output (override with `.groups` argument)

boxplot(VICKidsChain["Profit"], las=2, main = "Victoria Kids Profit(All)", ylab = "Profit", xlab = "All")

boxplot(VICKidsChain$Profit ~ VICKidsChain$Chain, main="Victoria Kids Profit by Chain", ylab = "Profit", xlab = "Chain")

outliers<- boxplot(VICKidsChain$Profit)$out

outliers

##   [1] 1040.00  884.53 1497.30 1045.80 1497.65 2057.92  763.84 1162.35 1313.18
##  [10] 1387.36 2986.64 1970.54 2856.00  772.80 1632.80 1317.33  901.48 1067.82
##  [19] 1317.90 1970.54 2856.00  772.80 1632.80 1317.33  901.48 1438.80 1875.72
##  [28] 2117.70 3614.16  892.08  905.92 1182.81 1281.00 1444.19  775.20 1662.78
##  [37] 2131.57  870.25 1024.80 1299.52 2240.35 2528.16 2320.78 2993.76  861.84
##  [46]  870.84 1920.45 1098.72  823.60  957.06 1704.30 2509.77 2320.78 2993.76
##  [55]  861.84  870.84 1920.45 1098.72  867.14 1128.00 2563.55 3146.96  802.20
##  [64]  824.40  902.40  968.89  806.26 1389.96 1825.35  814.00 1663.80  807.07
##  [73] 1323.00  817.02  837.80  883.12  999.18 2357.59  814.00 1663.80  807.07
##  [82] 1323.00  817.02  837.80  883.12  865.24 1136.02 2991.56 3240.90 1058.94
##  [91] 1043.19 1373.09 1341.83 1789.76  797.04 1763.20 2361.15  762.42  765.00
## [100]  927.18 1891.40 2488.35 1817.24 2887.05 3560.04  921.30  936.26 2519.54
## [109] 2586.72 1087.11 1133.60 1761.95  899.76 1444.60 1082.25 1244.40  957.00
## [118]  836.57 2137.52 3461.15  921.30  936.26 2519.54 2586.72 1087.11 1133.60
## [127] 1761.95  899.76 1444.60 1082.25 1244.40 2342.44 3447.74 4437.45 5878.46
## [136] 1017.36 1282.65 1812.52 1966.56  804.54  849.68 1049.04 2012.80 2180.95
## [145] 2560.04 2591.60 1213.70  950.04 1316.80 1309.88 1324.32 1476.45 2560.04
## [154] 2591.60 1213.70  950.04 1316.80  840.51  942.37 3488.20 3596.12 1440.26
## [163] 1754.00 2024.88  986.40 2197.60 3877.25 4163.20 1391.27 2438.20 1307.71
## [172] 1068.65 1325.40 2309.37 1073.65 2336.13 1391.27 2438.20 1307.71 1068.65
## [181] 1325.40 2309.37 1780.17 2064.23 4975.11 5145.00  927.00 1025.00 2351.52
## [190] 1522.80 1915.20  993.60 2150.20 2340.80  771.00  853.83  967.19 1203.60
## [199] 1246.08 2598.48 3317.85 2335.23 3286.00  896.14 1229.80  846.32 1535.76
## [208] 1557.74 2320.67 2434.83 2335.23 3286.00  896.14 1229.80  846.32 1557.74
## [217] 1535.76 1103.36 1188.44 3992.56 4906.20  931.00  830.27 1289.96 1306.10
## [226] 2058.20  877.10 1320.69 1173.84 1312.42 2346.19 2302.88 2465.68  920.40
## [235] 1422.40 1361.70 1146.09 1145.19 1222.64 2302.88 2465.68  920.40 1422.40
## [244] 1361.70 1146.09 1196.98 2519.88 3188.66  800.80 1745.92  888.28 1159.06
## [253] 1073.48  960.40  886.60  900.24 1432.08 1078.44 2222.14 1171.83  785.00
## [262] 1150.72 1078.44 2222.14 1171.83  785.00 1035.20 1789.11 1954.26 1025.00
## [271] 1162.80 1162.80  878.46  761.56  761.56  853.30  853.30

length(outliers)

## [1] 277

hist(VICKidsChain$Profit)

VICKidsChain$Profit %>% hist(col = "blue",
                             ylim = c(0,0.003),
                             xlim = c(0,6000),
                             xlab = "Profit($)",
                             main = "Profit by Chain",
                             breaks = 10,
                             density = 20,
                             prob = TRUE)

lines(density(VICKidsChain$Profit, adjust = 2), col = "red", lwd = 2)

Transform

Fair way of treating oultlier are Capping which involves replacing the outliers with the nearest neighbours that are not outliers. Here, for outliers that lie outside the outlier fences on a box-plot, we can cap it by replacing those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile. After capping, we can compare with initial values, where we clearly can see Max value drastically came down.

In the last step I have checked the distribution of variable ‘Profit’. According to the first histogram we can see that the distribution is skewed towards the right. By using the log function we shift the right skewed distribution towards a normal distribution.

summary(VICKidsChain$Profit)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -199.66    6.48   69.12  294.61  306.13 5878.46

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}
capp <- VICKidsChain$Profit %>% cap()
summary(capp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -199.66    6.48   69.12  265.75  306.13 1361.70

Chain_Profit<- log(VICKidsChain$Profit)

## Warning in log(VICKidsChain$Profit): NaNs produced

hist(Chain_Profit)

References:

kaggle: Australia Top Sport Chains Nike and Adidas sales stats https://www.kaggle.com/australiastats/australia-top-sport-chains?