1 Overview

This document aims to provide you step-by-step guide on how to import, massaging and wrangling data by using tidyverse family of packages.

1.1 Learning Outcome

By the end of this hands-on exercise, you should acquire the following competencies:

  • importing csv file into R and performing basic data parsing using readr package,
  • massaging and wrangling tibble data.frame using tidyr and dplyr package,
  • visualising data values using ggplot2 package, and
  • performing Confirmatory Data Analysis using ggstatsplot package.

1.2 Data Acquisition

Before you can start using R, you are required to create a sub-folder called data below this RMarkdown file. You are encouraged to keep the data files according to their respective sub-folder such as aspatial.

2 Getting Started

2.1 Installing and launching the R packages

Before we get started, it is important for us to ensure that all the R packages we need have been installed.

The code chunk:

packages = c('ggstatsplot', 'tidyverse')
for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p,character.only = T)
}

The code chunk above forms two tasks:

  • it check if ggstatsplot and tidyverse have be installed in R. If any one of them is missing, R will install the missing package.
  • launching the package by using library() of base R.

With tidyverse, we do not have to install or/and launch readr, tidyr, dplyr and ggplot2 packages separately.

3 Importing Data

3.1 Working with readr package

The code chunk below imports a csv file (i.e. store_info.csv)using read_csv() of readr package.

store_info <- read_csv("data/aspatial/store_location.csv")
store_info
## # A tibble: 220 x 80
##    Market Status Milestone `Local Code` `CHAMPS Code` `JDE Code` `Store Name`
##    <chr>  <chr>  <chr>            <dbl> <chr>              <dbl> <chr>       
##  1 Phili~ Relo ~ Closed        51005648 63P51005648     57005648 Delta       
##  2 Phili~ New o~ Opened        51005651 63P51005651     57005651 Sm City     
##  3 Phili~ New o~ Opened        51005653 63P51005653     57005653 C.M. Recto  
##  4 Phili~ New o~ Opened        51005656 63P51005656     57005656 Araneta Col~
##  5 Phili~ New o~ Opened        51005661 63P51005661     57005661 Wilson      
##  6 Phili~ New o~ Opened        51005660 63P51005660     57005660 Ermita      
##  7 Phili~ New o~ Opened        51005662 63P51005662     57005662 Centrepoint 
##  8 Phili~ New o~ Opened        51005665 63P51005665     57005665 Banawe (Del~
##  9 Phili~ New o~ Opened        51005669 63P51005669     57005668 Megamall    
## 10 Phili~ New o~ Opened        51005677 63P51005677     57005677 Marcos High~
## # ... with 210 more rows, and 73 more variables: `Latest Asset Type` <chr>,
## #   `Facility Type` <chr>, `Open Date` <chr>, `Close Date` <chr>, `Store
## #   Address` <chr>, `2015_StoreCount` <dbl>, `2016_StoreCount` <dbl>,
## #   `2017_StoreCount` <dbl>, `2018_StoreCount` <dbl>, `2019_StoreCount` <dbl>,
## #   Latitude <dbl>, Longitude <dbl>, `15Open_Weeks` <dbl>,
## #   `16Open_Weeks` <dbl>, `17Open_Weeks` <dbl>, `18Open_Weeks` <dbl>,
## #   `19Open_Weeks` <dbl>, `15Del_Sales` <dbl>, `16Del_Sales` <dbl>,
## #   `17Del_Sales` <dbl>, `18Del_Sales` <dbl>, `19Del_Sales` <dbl>,
## #   `15Del_Txns` <dbl>, `16Del_Txns` <dbl>, `17Del_Txns` <dbl>,
## #   `18Del_Txns` <dbl>, `19Del_Txns` <dbl>, `18Del_WPSA` <chr>,
## #   `19Del_WPSA` <chr>, `18Del_WPST` <chr>, `19Del_WPST` <chr>,
## #   `15TA_Sales` <dbl>, `16TA_Sales` <dbl>, `17TA_Sales` <dbl>,
## #   `18TA_Sales` <dbl>, `19TA_Sales` <dbl>, `15TA_Txns` <dbl>,
## #   `16TA_Txns` <dbl>, `17TA_Txns` <dbl>, `18TA_Txns` <dbl>, `19TA_Txns` <dbl>,
## #   `18TA_WPSA` <chr>, `19TA_WPSA` <chr>, `18TA_WPST` <chr>, `19TA_WPST` <chr>,
## #   `15DI_Sales` <dbl>, `16DI_Sales` <dbl>, `17DI_Sales` <dbl>,
## #   `18DI_Sales` <dbl>, `19DI_Sales` <dbl>, `15DI_Txns` <dbl>,
## #   `16DI_Txns` <dbl>, `17DI_Txns` <dbl>, `18DI_Txns` <dbl>, `19DI_Txns` <dbl>,
## #   `18DI_WPSA` <chr>, `19DI_WPSA` <chr>, `18DI_WPST` <chr>, `19DI_WPST` <chr>,
## #   `15All_Sales` <dbl>, `16All_Sales` <dbl>, `17All_Sales` <dbl>,
## #   `18All_Sales` <dbl>, `19All_Sales` <dbl>, `15All_Txns` <dbl>,
## #   `16All_Txns` <dbl>, `17All_Txns` <dbl>, `18All_Txns` <dbl>,
## #   `19All_Txns` <dbl>, `18All_WPSA` <chr>, `19All_WPSA` <chr>,
## #   `18All_WPST` <chr>, `19All_WPST` <chr>

If you are new to tidyverse, notice that the data are stored in tibble data.frame and not a common data.frame of R. To gain better understanding of tibble data.frame, please consult Chapter 10 Tibbles of R for Data Science.

According to the listing above, there are two fields that their data type were wrongly parsed by read_csv(). They are Local Code and JDE Code. The data value type of both fields should be in character data type instead of numeric data type.

Actual, there are other fields that are wrongly parsed by read_csv() such as 19Del_WPSA. This field captures weekly performance sales average which should be in numerical data type and not character data type. We are not going to change any of this error because we are going to derive them later.

In the code chunk below, col_types argument and col_character() are used to change the data type of both field into character.

store_info <- read_csv("data/aspatial/store_location.csv", col_types = cols(`Local Code` = col_character(), `JDE Code` = col_character()))
store_info
## # A tibble: 220 x 80
##    Market Status Milestone `Local Code` `CHAMPS Code` `JDE Code` `Store Name`
##    <chr>  <chr>  <chr>     <chr>        <chr>         <chr>      <chr>       
##  1 Phili~ Relo ~ Closed    51005648     63P51005648   57005648   Delta       
##  2 Phili~ New o~ Opened    51005651     63P51005651   57005651   Sm City     
##  3 Phili~ New o~ Opened    51005653     63P51005653   57005653   C.M. Recto  
##  4 Phili~ New o~ Opened    51005656     63P51005656   57005656   Araneta Col~
##  5 Phili~ New o~ Opened    51005661     63P51005661   57005661   Wilson      
##  6 Phili~ New o~ Opened    51005660     63P51005660   57005660   Ermita      
##  7 Phili~ New o~ Opened    51005662     63P51005662   57005662   Centrepoint 
##  8 Phili~ New o~ Opened    51005665     63P51005665   57005665   Banawe (Del~
##  9 Phili~ New o~ Opened    51005669     63P51005669   57005668   Megamall    
## 10 Phili~ New o~ Opened    51005677     63P51005677   57005677   Marcos High~
## # ... with 210 more rows, and 73 more variables: `Latest Asset Type` <chr>,
## #   `Facility Type` <chr>, `Open Date` <chr>, `Close Date` <chr>, `Store
## #   Address` <chr>, `2015_StoreCount` <dbl>, `2016_StoreCount` <dbl>,
## #   `2017_StoreCount` <dbl>, `2018_StoreCount` <dbl>, `2019_StoreCount` <dbl>,
## #   Latitude <dbl>, Longitude <dbl>, `15Open_Weeks` <dbl>,
## #   `16Open_Weeks` <dbl>, `17Open_Weeks` <dbl>, `18Open_Weeks` <dbl>,
## #   `19Open_Weeks` <dbl>, `15Del_Sales` <dbl>, `16Del_Sales` <dbl>,
## #   `17Del_Sales` <dbl>, `18Del_Sales` <dbl>, `19Del_Sales` <dbl>,
## #   `15Del_Txns` <dbl>, `16Del_Txns` <dbl>, `17Del_Txns` <dbl>,
## #   `18Del_Txns` <dbl>, `19Del_Txns` <dbl>, `18Del_WPSA` <chr>,
## #   `19Del_WPSA` <chr>, `18Del_WPST` <chr>, `19Del_WPST` <chr>,
## #   `15TA_Sales` <dbl>, `16TA_Sales` <dbl>, `17TA_Sales` <dbl>,
## #   `18TA_Sales` <dbl>, `19TA_Sales` <dbl>, `15TA_Txns` <dbl>,
## #   `16TA_Txns` <dbl>, `17TA_Txns` <dbl>, `18TA_Txns` <dbl>, `19TA_Txns` <dbl>,
## #   `18TA_WPSA` <chr>, `19TA_WPSA` <chr>, `18TA_WPST` <chr>, `19TA_WPST` <chr>,
## #   `15DI_Sales` <dbl>, `16DI_Sales` <dbl>, `17DI_Sales` <dbl>,
## #   `18DI_Sales` <dbl>, `19DI_Sales` <dbl>, `15DI_Txns` <dbl>,
## #   `16DI_Txns` <dbl>, `17DI_Txns` <dbl>, `18DI_Txns` <dbl>, `19DI_Txns` <dbl>,
## #   `18DI_WPSA` <chr>, `19DI_WPSA` <chr>, `18DI_WPST` <chr>, `19DI_WPST` <chr>,
## #   `15All_Sales` <dbl>, `16All_Sales` <dbl>, `17All_Sales` <dbl>,
## #   `18All_Sales` <dbl>, `19All_Sales` <dbl>, `15All_Txns` <dbl>,
## #   `16All_Txns` <dbl>, `17All_Txns` <dbl>, `18All_Txns` <dbl>,
## #   `19All_Txns` <dbl>, `18All_WPSA` <chr>, `19All_WPSA` <chr>,
## #   `18All_WPST` <chr>, `19All_WPST` <chr>

Notice that both fields are in character data type now.

4 Data Preparation

4.1 Deriving new data.frames

In this section, you will learn how to derive a new tibble data.frame by using selected fields from the newly imported tibble data.frame similar to the example give below.

The code chunk below is used to derive the tibble data.frame above. It performs the following tasks:

  • selecting the target fields by using select() of dplyr package. In this case, they are column 4 (i.e. Local Code), columns 20:24 (i.e. 15Open_Weeks - 19Open_Weeks) and columns 67:71 (i.e. 15All_Sales - 19All_Sales).
  • computing the weekly average sales by using mutate() of dplyr package. Name each of the output field as the particular year (i.e. 2015).
  • pivoting the five target fields (i.e. 12:16) into two output fields by using pivot_longer() of tidyr package.
  • select columns 1, 12 and 13 and save them into an output tibble data.frame called All_WPSA.
All_WPSA <- store_info %>%
  select(4,20:24,67:71)%>%
  mutate(`2015` = `15All_Sales`/`15Open_Weeks`) %>%
  mutate(`2016` = `16All_Sales`/`16Open_Weeks`) %>%
  mutate(`2017` = `17All_Sales`/`17Open_Weeks`) %>%
  mutate(`2018` = `18All_Sales`/`18Open_Weeks`) %>%
  mutate(`2019` = `19All_Sales`/`19Open_Weeks`) %>%
  pivot_longer(cols = 12:16,
               names_to = "Year",
               values_to = "All_WPSA") %>%
  select(1, 12:13)

Now, we will repeat the code chunk to derive a new tibble data.frame for delivery sales.

Del_WPSA <- store_info %>%
  select(4,20:24,25:29)%>%
  mutate(`2015` = `15Del_Sales`/`15Open_Weeks`) %>%
  mutate(`2016` = `16Del_Sales`/`16Open_Weeks`) %>%
  mutate(`2017` = `17Del_Sales`/`17Open_Weeks`) %>%
  mutate(`2018` = `18Del_Sales`/`18Open_Weeks`) %>%
  mutate(`2019` = `19Del_Sales`/`19Open_Weeks`) %>%
  pivot_longer(cols = 12:16,
               names_to = "Year",
               values_to = "Del_WPSA") %>%
  select(1, 12:13)

4.2 Performing relational join

Next, you will learn how to join two or more data.frames by using join() of dplyr.

store_info_selected <- store_info %>%
  select(3:4, 8:9)
store_tidy <- All_WPSA %>%
  left_join(Del_WPSA,
            by = c("Local Code", "Year")) %>%
  left_join(store_info_selected,
            by = c("Local Code")) 

5 Exploratory Data Analysis (EDA)

Tukey defined EDA in 1961 as:

“Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data.”

EDA is often the first step to getting to know about your data.

  • It is iterative in nature:

    • Visualise and generate questions about your data.
    • Search for answers by visualising, transforming, and modeling your data.
    • Use what you learn to refine your questions and generate new questions.
  • Rinse and repeat until you communicate the findings.

5.1 Tips on doing EDA

  • EDA is fundamentally a data sensing process - it is not an exact science. It requires knowledge of your data and a lot of patience.
  • At the most basic level, it involves answering two questions:
    • What type of variation occurs within my variables?
    • What type of covariation occurs between my variables?
  • EDA relies heavily on statistical graphics to reveal patterns and anomelies of your data.

5.2 Getting to know ggplot2

  • ggplot2 (http://ggplot2.tidyverse.org/index.html) is a R package for declaratively creating statistical graphics.
  • It is designed and developed by Hadley Wickham in 2005.
  • It is an implementation of Leland Wilkinson’s Grammar of Graphics, a general scheme for data visualization which breaks up graphs into semantic components such as scales and layers.
  • It can serve as a replacement for the base graphics in R and contains a number of defaults for web and print display of common scales.

5.3 A Layered Grammar of Graphics - Design principles of ggplot2

  • Data: The dataset being ploted.
  • Aesthetics: Take attributes of the data and use them to influence visual characteristics, such as position, colours, size, shape, or transparency.
  • Geometrics: The visual elements used for our data, such as point, bar or line.
  • Facets: split the data into subsets to create multiple variations of the same graph (paneling, multiple plots).
  • Statistics: statiscal transformations that summarise data (e.g. mean, confidence intervals).
  • Coordinate systems: define the plane on which data are mapped on the graphic.
  • Themes: modify all non-data components of a plot, such as main title, sub-title, y-aixs title, or legend background.

5.4 Visualising continuous variable

5.4.1 A simple histogram

The code chunk below is used to reveal the distribution of All_WPSA in a histogram.

ggplot(data=store_tidy, 
       aes(x= All_WPSA)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 138 rows containing non-finite values (stat_bin).

5.4.1.1 Adding mean and median lines on the histogram

In the code chunk below, geom_vline() is used to add the median and mean line.

ggplot(data=store_tidy, 
       aes(x= All_WPSA)) +
  geom_histogram(bins=25, 
                 color="black", 
                 fill="light blue") +
  geom_vline(aes(xintercept=mean(All_WPSA,
                                 na.rm=T)), # Ignore NA values for mean
             color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(All_WPSA, na.rm=T)), # Ignore NA values for mean
             color="grey30", linetype="dashed", size=1)
## Warning: Removed 138 rows containing non-finite values (stat_bin).

5.4.1.2 Trellis plot

In the code chunk below, geom_facet() is used to plot multiple histograms showing the distribution of All)WPSA over years.

ggplot(data=store_tidy, 
       aes(x= All_WPSA)) +
  geom_histogram(bins=25, 
                 color="black", 
                 fill="light blue") +
  geom_vline(aes(xintercept=mean(All_WPSA,
                                 na.rm=T)), # Ignore NA values for mean
             color="red", linetype="dashed", size=1) +
  geom_vline(aes(xintercept=median(All_WPSA, na.rm=T)), # Ignore NA values for mean
             color="grey30", linetype="dashed", size=1)+
  facet_wrap(~ Year)
## Warning: Removed 138 rows containing non-finite values (stat_bin).

Beside histogram, other statistical graphical methods commonly used for EDA are boxplot and density plot. They can be prepared using geom_boxplot() and geom_density of ggplot2 package

5.5 Visualising Categorical Variables

5.5.1 A simple bar chart

ggplot(data=store_tidy, aes(x=`Latest Asset Type`)) +
  geom_bar()

5.5.2 Sorted bar chart

5.5.3 Distribution of stores over years

store_tidy %>% 
  drop_na(All_WPSA) %>%
ggplot(aes(x=`Year`)) +
  geom_bar() +
  ylim(0,220) +
  geom_text(stat="count", 
      aes(label=paste0(..count..)),
      vjust=-1) +
  xlab("Year") +
  ylab("No. of\nStores") +
  theme(axis.title.y=element_text(angle = 0))

5.5.4 Distribution of store by asset type over year

store_tidy %>% 
  drop_na(All_WPSA) %>%
ggplot(aes(x=`Year`, 
           fill=`Latest Asset Type`)) +
  geom_bar() +
  ylim(0,220) +
  geom_text(stat="count", 
      aes(label=paste0(..count..)),
      position = position_stack(vjust = .5), size = 2, color = "black") +
  xlab("Year") +
  ylab("No. of\nStores") +
  theme(axis.title.y=element_text(angle = 0))

6 Confirmatory Data Analysis

6.1 Univariate test

store_tidy %>%
  filter(Year == 2019) %>%
  gghistostats(x = All_WPSA,
               xlab = "Weekly Performance Sales Average", centrality.parameter = "median"
)
## t is large; approximation invoked.

6.2 Comparing store performance by year

store_tidy %>%
ggbetweenstats(x = Year,
  y = All_WPSA,
  nboot = 10,
  messages = FALSE
)

6.3 Comparing store performance by asset type

store_tidy %>%
  filter(Year == "2019") %>%
ggbetweenstats(x = `Latest Asset Type`,
  y = All_WPSA,
  pairwise.comparisons = TRUE,
  p.adjust.method = "fdr",
  outlier.tagging = TRUE,
  nboot = 10,
  messages = FALSE
)