This document aims to provide you step-by-step guide on how to import, massaging and wrangling data by using tidyverse family of packages.
By the end of this hands-on exercise, you should acquire the following competencies:
Before you can start using R, you are required to create a sub-folder called data below this RMarkdown file. You are encouraged to keep the data files according to their respective sub-folder such as aspatial.
Before we get started, it is important for us to ensure that all the R packages we need have been installed.
The code chunk:
packages = c('ggstatsplot', 'tidyverse')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
The code chunk above forms two tasks:
With tidyverse, we do not have to install or/and launch readr, tidyr, dplyr and ggplot2 packages separately.
The code chunk below imports a csv file (i.e. store_info.csv)using read_csv() of readr package.
store_info <- read_csv("data/aspatial/store_location.csv")
store_info
## # A tibble: 220 x 80
## Market Status Milestone `Local Code` `CHAMPS Code` `JDE Code` `Store Name`
## <chr> <chr> <chr> <dbl> <chr> <dbl> <chr>
## 1 Phili~ Relo ~ Closed 51005648 63P51005648 57005648 Delta
## 2 Phili~ New o~ Opened 51005651 63P51005651 57005651 Sm City
## 3 Phili~ New o~ Opened 51005653 63P51005653 57005653 C.M. Recto
## 4 Phili~ New o~ Opened 51005656 63P51005656 57005656 Araneta Col~
## 5 Phili~ New o~ Opened 51005661 63P51005661 57005661 Wilson
## 6 Phili~ New o~ Opened 51005660 63P51005660 57005660 Ermita
## 7 Phili~ New o~ Opened 51005662 63P51005662 57005662 Centrepoint
## 8 Phili~ New o~ Opened 51005665 63P51005665 57005665 Banawe (Del~
## 9 Phili~ New o~ Opened 51005669 63P51005669 57005668 Megamall
## 10 Phili~ New o~ Opened 51005677 63P51005677 57005677 Marcos High~
## # ... with 210 more rows, and 73 more variables: `Latest Asset Type` <chr>,
## # `Facility Type` <chr>, `Open Date` <chr>, `Close Date` <chr>, `Store
## # Address` <chr>, `2015_StoreCount` <dbl>, `2016_StoreCount` <dbl>,
## # `2017_StoreCount` <dbl>, `2018_StoreCount` <dbl>, `2019_StoreCount` <dbl>,
## # Latitude <dbl>, Longitude <dbl>, `15Open_Weeks` <dbl>,
## # `16Open_Weeks` <dbl>, `17Open_Weeks` <dbl>, `18Open_Weeks` <dbl>,
## # `19Open_Weeks` <dbl>, `15Del_Sales` <dbl>, `16Del_Sales` <dbl>,
## # `17Del_Sales` <dbl>, `18Del_Sales` <dbl>, `19Del_Sales` <dbl>,
## # `15Del_Txns` <dbl>, `16Del_Txns` <dbl>, `17Del_Txns` <dbl>,
## # `18Del_Txns` <dbl>, `19Del_Txns` <dbl>, `18Del_WPSA` <chr>,
## # `19Del_WPSA` <chr>, `18Del_WPST` <chr>, `19Del_WPST` <chr>,
## # `15TA_Sales` <dbl>, `16TA_Sales` <dbl>, `17TA_Sales` <dbl>,
## # `18TA_Sales` <dbl>, `19TA_Sales` <dbl>, `15TA_Txns` <dbl>,
## # `16TA_Txns` <dbl>, `17TA_Txns` <dbl>, `18TA_Txns` <dbl>, `19TA_Txns` <dbl>,
## # `18TA_WPSA` <chr>, `19TA_WPSA` <chr>, `18TA_WPST` <chr>, `19TA_WPST` <chr>,
## # `15DI_Sales` <dbl>, `16DI_Sales` <dbl>, `17DI_Sales` <dbl>,
## # `18DI_Sales` <dbl>, `19DI_Sales` <dbl>, `15DI_Txns` <dbl>,
## # `16DI_Txns` <dbl>, `17DI_Txns` <dbl>, `18DI_Txns` <dbl>, `19DI_Txns` <dbl>,
## # `18DI_WPSA` <chr>, `19DI_WPSA` <chr>, `18DI_WPST` <chr>, `19DI_WPST` <chr>,
## # `15All_Sales` <dbl>, `16All_Sales` <dbl>, `17All_Sales` <dbl>,
## # `18All_Sales` <dbl>, `19All_Sales` <dbl>, `15All_Txns` <dbl>,
## # `16All_Txns` <dbl>, `17All_Txns` <dbl>, `18All_Txns` <dbl>,
## # `19All_Txns` <dbl>, `18All_WPSA` <chr>, `19All_WPSA` <chr>,
## # `18All_WPST` <chr>, `19All_WPST` <chr>
If you are new to tidyverse, notice that the data are stored in tibble data.frame and not a common data.frame of R. To gain better understanding of tibble data.frame, please consult Chapter 10 Tibbles of R for Data Science.
According to the listing above, there are two fields that their data type were wrongly parsed by read_csv(). They are Local Code and JDE Code. The data value type of both fields should be in character data type instead of numeric data type.
Actual, there are other fields that are wrongly parsed by read_csv() such as 19Del_WPSA. This field captures weekly performance sales average which should be in numerical data type and not character data type. We are not going to change any of this error because we are going to derive them later.
In the code chunk below, col_types argument and col_character() are used to change the data type of both field into character.
store_info <- read_csv("data/aspatial/store_location.csv", col_types = cols(`Local Code` = col_character(), `JDE Code` = col_character()))
store_info
## # A tibble: 220 x 80
## Market Status Milestone `Local Code` `CHAMPS Code` `JDE Code` `Store Name`
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Phili~ Relo ~ Closed 51005648 63P51005648 57005648 Delta
## 2 Phili~ New o~ Opened 51005651 63P51005651 57005651 Sm City
## 3 Phili~ New o~ Opened 51005653 63P51005653 57005653 C.M. Recto
## 4 Phili~ New o~ Opened 51005656 63P51005656 57005656 Araneta Col~
## 5 Phili~ New o~ Opened 51005661 63P51005661 57005661 Wilson
## 6 Phili~ New o~ Opened 51005660 63P51005660 57005660 Ermita
## 7 Phili~ New o~ Opened 51005662 63P51005662 57005662 Centrepoint
## 8 Phili~ New o~ Opened 51005665 63P51005665 57005665 Banawe (Del~
## 9 Phili~ New o~ Opened 51005669 63P51005669 57005668 Megamall
## 10 Phili~ New o~ Opened 51005677 63P51005677 57005677 Marcos High~
## # ... with 210 more rows, and 73 more variables: `Latest Asset Type` <chr>,
## # `Facility Type` <chr>, `Open Date` <chr>, `Close Date` <chr>, `Store
## # Address` <chr>, `2015_StoreCount` <dbl>, `2016_StoreCount` <dbl>,
## # `2017_StoreCount` <dbl>, `2018_StoreCount` <dbl>, `2019_StoreCount` <dbl>,
## # Latitude <dbl>, Longitude <dbl>, `15Open_Weeks` <dbl>,
## # `16Open_Weeks` <dbl>, `17Open_Weeks` <dbl>, `18Open_Weeks` <dbl>,
## # `19Open_Weeks` <dbl>, `15Del_Sales` <dbl>, `16Del_Sales` <dbl>,
## # `17Del_Sales` <dbl>, `18Del_Sales` <dbl>, `19Del_Sales` <dbl>,
## # `15Del_Txns` <dbl>, `16Del_Txns` <dbl>, `17Del_Txns` <dbl>,
## # `18Del_Txns` <dbl>, `19Del_Txns` <dbl>, `18Del_WPSA` <chr>,
## # `19Del_WPSA` <chr>, `18Del_WPST` <chr>, `19Del_WPST` <chr>,
## # `15TA_Sales` <dbl>, `16TA_Sales` <dbl>, `17TA_Sales` <dbl>,
## # `18TA_Sales` <dbl>, `19TA_Sales` <dbl>, `15TA_Txns` <dbl>,
## # `16TA_Txns` <dbl>, `17TA_Txns` <dbl>, `18TA_Txns` <dbl>, `19TA_Txns` <dbl>,
## # `18TA_WPSA` <chr>, `19TA_WPSA` <chr>, `18TA_WPST` <chr>, `19TA_WPST` <chr>,
## # `15DI_Sales` <dbl>, `16DI_Sales` <dbl>, `17DI_Sales` <dbl>,
## # `18DI_Sales` <dbl>, `19DI_Sales` <dbl>, `15DI_Txns` <dbl>,
## # `16DI_Txns` <dbl>, `17DI_Txns` <dbl>, `18DI_Txns` <dbl>, `19DI_Txns` <dbl>,
## # `18DI_WPSA` <chr>, `19DI_WPSA` <chr>, `18DI_WPST` <chr>, `19DI_WPST` <chr>,
## # `15All_Sales` <dbl>, `16All_Sales` <dbl>, `17All_Sales` <dbl>,
## # `18All_Sales` <dbl>, `19All_Sales` <dbl>, `15All_Txns` <dbl>,
## # `16All_Txns` <dbl>, `17All_Txns` <dbl>, `18All_Txns` <dbl>,
## # `19All_Txns` <dbl>, `18All_WPSA` <chr>, `19All_WPSA` <chr>,
## # `18All_WPST` <chr>, `19All_WPST` <chr>
Notice that both fields are in character data type now.
In this section, you will learn how to derive a new tibble data.frame by using selected fields from the newly imported tibble data.frame similar to the example give below.
The code chunk below is used to derive the tibble data.frame above. It performs the following tasks:
All_WPSA <- store_info %>%
select(4,20:24,67:71)%>%
mutate(`2015` = `15All_Sales`/`15Open_Weeks`) %>%
mutate(`2016` = `16All_Sales`/`16Open_Weeks`) %>%
mutate(`2017` = `17All_Sales`/`17Open_Weeks`) %>%
mutate(`2018` = `18All_Sales`/`18Open_Weeks`) %>%
mutate(`2019` = `19All_Sales`/`19Open_Weeks`) %>%
pivot_longer(cols = 12:16,
names_to = "Year",
values_to = "All_WPSA") %>%
select(1, 12:13)
Now, we will repeat the code chunk to derive a new tibble data.frame for delivery sales.
Del_WPSA <- store_info %>%
select(4,20:24,25:29)%>%
mutate(`2015` = `15Del_Sales`/`15Open_Weeks`) %>%
mutate(`2016` = `16Del_Sales`/`16Open_Weeks`) %>%
mutate(`2017` = `17Del_Sales`/`17Open_Weeks`) %>%
mutate(`2018` = `18Del_Sales`/`18Open_Weeks`) %>%
mutate(`2019` = `19Del_Sales`/`19Open_Weeks`) %>%
pivot_longer(cols = 12:16,
names_to = "Year",
values_to = "Del_WPSA") %>%
select(1, 12:13)
Next, you will learn how to join two or more data.frames by using join() of dplyr.
store_info_selected <- store_info %>%
select(3:4, 8:9)
store_tidy <- All_WPSA %>%
left_join(Del_WPSA,
by = c("Local Code", "Year")) %>%
left_join(store_info_selected,
by = c("Local Code"))
Tukey defined EDA in 1961 as:
“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”
EDA is often the first step to getting to know about your data.
It is iterative in nature:
Rinse and repeat until you communicate the findings.
The code chunk below is used to reveal the distribution of All_WPSA in a histogram.
ggplot(data=store_tidy,
aes(x= All_WPSA)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 138 rows containing non-finite values (stat_bin).
In the code chunk below, geom_vline() is used to add the median and mean line.
ggplot(data=store_tidy,
aes(x= All_WPSA)) +
geom_histogram(bins=25,
color="black",
fill="light blue") +
geom_vline(aes(xintercept=mean(All_WPSA,
na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(All_WPSA, na.rm=T)), # Ignore NA values for mean
color="grey30", linetype="dashed", size=1)
## Warning: Removed 138 rows containing non-finite values (stat_bin).
In the code chunk below, geom_facet() is used to plot multiple histograms showing the distribution of All)WPSA over years.
ggplot(data=store_tidy,
aes(x= All_WPSA)) +
geom_histogram(bins=25,
color="black",
fill="light blue") +
geom_vline(aes(xintercept=mean(All_WPSA,
na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1) +
geom_vline(aes(xintercept=median(All_WPSA, na.rm=T)), # Ignore NA values for mean
color="grey30", linetype="dashed", size=1)+
facet_wrap(~ Year)
## Warning: Removed 138 rows containing non-finite values (stat_bin).
Beside histogram, other statistical graphical methods commonly used for EDA are boxplot and density plot. They can be prepared using geom_boxplot() and geom_density of ggplot2 package
ggplot(data=store_tidy, aes(x=`Latest Asset Type`)) +
geom_bar()
store_tidy %>%
drop_na(All_WPSA) %>%
ggplot(aes(x=`Year`)) +
geom_bar() +
ylim(0,220) +
geom_text(stat="count",
aes(label=paste0(..count..)),
vjust=-1) +
xlab("Year") +
ylab("No. of\nStores") +
theme(axis.title.y=element_text(angle = 0))
store_tidy %>%
drop_na(All_WPSA) %>%
ggplot(aes(x=`Year`,
fill=`Latest Asset Type`)) +
geom_bar() +
ylim(0,220) +
geom_text(stat="count",
aes(label=paste0(..count..)),
position = position_stack(vjust = .5), size = 2, color = "black") +
xlab("Year") +
ylab("No. of\nStores") +
theme(axis.title.y=element_text(angle = 0))
store_tidy %>%
filter(Year == 2019) %>%
gghistostats(x = All_WPSA,
xlab = "Weekly Performance Sales Average", centrality.parameter = "median"
)
## t is large; approximation invoked.
store_tidy %>%
ggbetweenstats(x = Year,
y = All_WPSA,
nboot = 10,
messages = FALSE
)
store_tidy %>%
filter(Year == "2019") %>%
ggbetweenstats(x = `Latest Asset Type`,
y = All_WPSA,
pairwise.comparisons = TRUE,
p.adjust.method = "fdr",
outlier.tagging = TRUE,
nboot = 10,
messages = FALSE
)