MATH2405 TP3, 2021

Setup

The necessary packages are loaded here:

library(readxl) # For importing with no external dependencies, loads dates & times, drops blank columns, reads characters as characters. *Note {readxl} does not support export functions.
library(readr) # For importing with full variable names incl. spaces.
library(magrittr) # For pipe operators.
library(tidyr) # For data wrangling.

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

library(dplyr) # For data exploration & manipulation.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

options(warn=-1) # Turn off warnings at completion for clean export to report

Locate Data

Data description

The fuel dataset is a collection of fuel pricing data collected and collated daily by the NSW Government Department of Fair Trading as part of the FuelCheck comparisons project. FuelCheck provides real-time information about fuel prices at service stations across NSW and enables consumers to find the lowest price for a range of fuel types in their geographical area via a map interface.

The data is also released monthly to Data.NSW for Creative Commons exploration and inclusion in analytical projects. The fuel dataset loaded in this analysis is the August 2021 monthly release, published September 3, 2021 at: https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-a97a46fc-2bdd-4b90-ac7f-0cb1e8d7ac3b/distribution/dist-nsw-ckan-14408fba-1d13-46a2-b4a1-841709fb0a0a/details?q=

Dataset Summary:

Variable Name	Variable Type	Description
Service Station Name	Nominal	The trading name of the service station
Address	Nominal	The address of the service station
Suburb	Categorical	Repeated component of the service station address (Suburb)
Postcode	Categorical	Repeated component of the service station address (Postcode)
Brand	Categorical	Brand of service station / fuel distributor
FuelCode	Categorical	Type of fuel sold
PriceUpdatedDate	Ordinal	Time and date of data input to the dataset
Price	Continuous	Price listing for each observation

Reference:

Department of Fair Trading 2021, FuelCheck Price History August 2021, Data.NSW, NSW Governemnt, Retrieved September 9, 2021, from Data.NSW Government website: https://data.nsw.gov.au/data/dataset/a97a46fc-2bdd-4b90-ac7f-0cb1e8d7ac3b/resource/14408fba-1d13-46a2-b4a1-841709fb0a0a/download/price_history_checks_august2021.xlsx

Read/Import Data

First we import the dataset from the working directory using {readxl} to the dataframe object “fuel”, skipping the first two rows of the original dataset which contained the document title.

The head() function shows the first few rows of the dataframe and the variable (or column) names and data types.

# Import datset and print dataset head
fuel <- read_excel("data/price_history_checks_august2021.xlsx", skip = 2)
head(fuel)

Inspect dataset and variables

Inspection of the dataset reveals:

Dimensions: 32373 observations of 8 variables identified using dim()
Data Types:
- Original data types identified via str() were all character, excepting Price which is numeric (double precision).
- A number of variables loaded as character should be factors (Suburb, Postcode, Brand, FuelCode) and are updated using as.factor() functions.
- The PriceDateUpdated variable includes date, time, and AM/PM. The AM/PM is unecessary to the presentation of individual observations and causes conversion read errors, so we remove it by reducing the number of characters in the string, then split the date and time components using separate(), and convert the outputs using as.Date() and strptime(), specifying the format of the data inputs to ensure accurate representation.
The factor variables show:
- Suburb has 1219 levels, automatically arranged alphabetically (identified using str() as the number of variables is large).
- Postcode has 482 levels, and is not ordinal data (identified using str() as the number of variables is large)
- Brand has 26 levels, and is not ordinal (identified using levels() function)
- FuelCode has 9 levels, and is not ordinal (identified using levels() function)
The column names are appropriate. Spaces are added to mutli-word column names using gsub()

# Establish dimensions
dim(fuel)

## [1] 32373     8

# Inspect structure for data types
str(fuel)

## tibble [32,373 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ServiceStationName: chr [1:32373] "Metro Petroleum Ballina" NA NA NA ...
##  $ Address           : chr [1:32373] "323 River Street, BALLINA NSW 2478" NA NA NA ...
##  $ Suburb            : chr [1:32373] "BALLINA" NA NA NA ...
##  $ Postcode          : chr [1:32373] "2478" NA NA NA ...
##  $ Brand             : chr [1:32373] "Metro Fuel" NA NA NA ...
##  $ FuelCode          : chr [1:32373] "E10" "U91" "P95" "PDL" ...
##  $ PriceUpdatedDate  : chr [1:32373] "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" ...
##  $ Price             : num [1:32373] 152 155 165 150 173 ...

# Convert relevant character variables to factors
cols_fact <- c("Suburb", "Postcode", "Brand", "FuelCode")
fuel[cols_fact] <- lapply(fuel[cols_fact], as.factor)

# Remove AM/PM from PriceUpdatedDate variable, separate into two variables for Date and Time, convert to appropriate data type, and remove time variable.
fuel$PriceUpdatedDate <- substr(fuel$PriceUpdatedDate,1,nchar(fuel$PriceUpdatedDate)-3)
fuel <- separate(fuel, PriceUpdatedDate, into = c("Date", "Time"), sep = " ", remove = TRUE)
fuel$Date <- as.Date(fuel$Date, "%d/%m/%Y")
fuel <- fuel[ ,-8]

# Check levels of factor variables
str(fuel)

## tibble [32,373 x 8] (S3: tbl_df/tbl/data.frame)
##  $ ServiceStationName: chr [1:32373] "Metro Petroleum Ballina" NA NA NA ...
##  $ Address           : chr [1:32373] "323 River Street, BALLINA NSW 2478" NA NA NA ...
##  $ Suburb            : Factor w/ 1219 levels "ABERDARE","ABERDEEN",..: 44 NA NA NA NA 776 NA 541 NA NA ...
##  $ Postcode          : Factor w/ 482 levels "2007","2008",..: 287 NA NA NA NA 141 NA 313 NA NA ...
##  $ Brand             : Factor w/ 26 levels "7-Eleven","Ampol",..: 15 NA NA NA NA 15 NA 10 NA NA ...
##  $ FuelCode          : Factor w/ 9 levels "B20","DL","E10",..: 3 9 6 8 7 8 3 7 6 3 ...
##  $ Date              : Date[1:32373], format: "2021-08-01" "2021-08-01" ...
##  $ Price             : num [1:32373] 152 155 165 150 173 ...

levels(fuel$Brand)

##  [1] "7-Eleven"          "Ampol"             "BP"               
##  [4] "Budget"            "Caltex"            "Caltex Woolworths"
##  [7] "Coles Express"     "Costco"            "Enhance"          
## [10] "Independent"       "Inland Petroleum"  "Liberty"          
## [13] "Lowes"             "Matilda"           "Metro Fuel"       
## [16] "Mobil"             "Mobil 1"           "Prime Petroleum"  
## [19] "Puma Energy"       "Shell"             "South West"       
## [22] "Speedway"          "Transwest Fuels"   "United"           
## [25] "Westside"          "Woodham Petroleum"

levels(fuel$FuelCode)

## [1] "B20" "DL"  "E10" "E85" "LPG" "P95" "P98" "PDL" "U91"

# Check the column names and rename to include spaces in multi-word column names
colnames(fuel)

## [1] "ServiceStationName" "Address"            "Suburb"            
## [4] "Postcode"           "Brand"              "FuelCode"          
## [7] "Date"               "Price"

new_cols <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(fuel))
colnames(fuel) <- new_cols
colnames(fuel)

## [1] "Service Station Name" "Address"              "Suburb"              
## [4] "Postcode"             "Brand"                "Fuel Code"           
## [7] "Date"                 "Price"

Tidy data

Identifying Issues

Reviewing the data according to tidy data principles, we identify some issues:

Each observation must have a row
- The Service Station Name, Suburb, Postcode, and Brand values were stored as merged cells in Excel, so each new observation of these is followed by a number of NA values per the different Fuel Codes sold at each location.
Each variable has its own column
- The Fuel Code variable is multiple variables stored in one column.
Each value has its own cell
- The Address variable contains multiple values, and is redundant as Suburb and Postcode variables provide sufficient detail for this scope. The data also includes a number of non-standard fules that are removed for the purposes of this scope.
Further considerations
- The Suburb and Service Station Name variables include a mix of upper and lower case variables, which may cause read errors if the same value is expressed in different cases, and which affects the legibility of the data.

Resolving Issues

Values are copied down to fill the NAs for each previously merged observation using fill().
The data frame is widened to include individual variables for the Fuel Code categories, using Price as the values reference in a pivot_wider() function.
The Address and non-standard Fuel Code variables are removed using subsetting.
Suburb and Service Station Name values are converted to upper case only using toupper().

The wrangled data set is now 18024 observations of 10 variables identified using dim(). The head of the data is output to review changes implemented.

# 1. Complete each observation by copying down the relevant values
fuel <- fuel %>%  fill("Service Station Name", "Suburb", "Postcode", "Brand", .direction = "down")
head(fuel)

# 2. Widen the dataframe to include individual variables for the Fuel Code categories. Where there are duplicate observations per user error in the data uploaded or the price was uploaded multiple times in one day, the mean value is retained.
fuel <- pivot_wider(data = fuel, names_from = "Fuel Code", values_from = "Price", values_fn = mean)

# 3. Remove Address variable as redundant using subsetting
fuel <- fuel[ ,c(1, 3:11)]

# 4. Align all values in Suburb and Service Station Name variables to upper case
fuel$Suburb <- toupper(fuel$Suburb)
fuel$`Service Station Name`<- toupper(fuel$`Service Station Name`)

# Establish dimensions
dim(fuel)

## [1] 18024    10

# Review the data set head
head(fuel)

Summary statistics

The summary statistics of the now individual Fuel Code numeric variables, grouped by the categorical Brand variable are output below. These figures are attained through a multi-layer pipe function of group_by() and summarise().

Each data frame output is sorted by the Count variable in descending order (Brands with the most observations appear first in the list).

For the purposes of this scope, only the most common fuel types (E10, U91, and P95) are explored.

# Extract E10 summary statistics grouped by Brand
E10_Summary_Stats <- fuel %>%
        group_by(Brand) %>%
        summarise(E10_Min = min(E10, na.rm = TRUE),
                  E10_Median = median(E10, na.rm = TRUE), 
                  E10_Max = max(E10, na.rm = TRUE), 
                  E10_Mean = mean(E10, na.rm = TRUE), 
                  E10_Standard_Deviation = sd(E10, na.rm = TRUE),
                  Count = n())

E10_Summary_Stats <- E10_Summary_Stats[order(-E10_Summary_Stats$Count),] 
E10_Summary_Stats

# Extract U91 summary statistics grouped by Brand
U91_Summary_Stats <- fuel %>%
        group_by(Brand) %>%
        summarise(U91_Min = min(U91, na.rm = TRUE),
                  U91_Median = median(U91, na.rm = TRUE), 
                  U91_Max = max(U91, na.rm = TRUE), 
                  U91_Mean = mean(U91, na.rm = TRUE), 
                  U91_Standard_Deviation = sd(U91, na.rm = TRUE),
                  Count = n())

U91_Summary_Stats <- U91_Summary_Stats[order(-U91_Summary_Stats$Count),] 
U91_Summary_Stats

# Extract P95 summary statistics grouped by Brand
P95_Summary_Stats <- fuel %>%
        group_by(Brand) %>%
        summarise(P95_Min = min(P95, na.rm = TRUE),
                  P95_Median = median(P95, na.rm = TRUE), 
                  P95_Max = max(P95, na.rm = TRUE), 
                  P95_Mean = mean(P95, na.rm = TRUE), 
                  P95_Standard_Deviation = sd(P95, na.rm = TRUE),
                  Count = n())

P95_Summary_Stats <- P95_Summary_Stats[order(-P95_Summary_Stats$Count),]
P95_Summary_Stats

Create a list

This code creates a list containing a numeric value (ID) for each response in the categorical variable Brand. This list includes a total of 26 unique values.

# Create a list with a numeric value response to a categorical variable (Brand)
list_new <- list(1:26, c("7-Eleven","Ampol","BP","Budget","Caltex","Caltex Woolworths","Coles Express",
                         "Costco","Enhance","Independent","Inland Petroleum","Liberty","Lowes","Matilda",
                         "Metro Fuel","Mobil","Mobil 1","Prime Petroleum","Puma Energy","Shell","South West",
                         "Speedway","Transwest Fuels","United","Westside","Woodham Petroleum"))

# Name the elements of the list and check the structure
names(list_new) <- c("Brand ID", "Brand")
str(list_new)

## List of 2
##  $ Brand ID: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Brand   : chr [1:26] "7-Eleven" "Ampol" "BP" "Budget" ...

Join the list

This newly created list is then joined to the larger data frame to create a new expanded data frame using merge(). The new Brand ID variable is re-named to align with the other variables’ naming convention using names(), and the new expanded data frame is re-ordered first to move the new Brand ID variable to the first column using subsetting, then to sort the data by the Date variable using order().

# Join the list to the fuel dataframe, 
fuel_expanded <- merge(fuel, list_new, by = "Brand")

# Rename the new column with a space to match other variables naming convention
names(fuel_expanded)[names(fuel_expanded) == "Brand.ID"] <- "Brand ID"

# Re-order the columns to show the new Brand ID variable
fuel_expanded <- fuel_expanded[,c(11,1:10)]

# Re-order the merged dataframe to sort by Date
fuel_expanded <- fuel_expanded[order(fuel_expanded$Date),]
head(fuel_expanded)

Subsetting I

The first 10 observations (including all variables) are subsetted from the fuel_expanded data frame using square brackets, and converted to a matrix using data.matrix(). The resulting matrix is numeric, as a matrix must be homogeneous (rather than heterogeneous like a data frame), so the mixed variable types convert to the nearest logical format. In this data frame comprised of numeric and factored variables, a numeric matrix result is the most logical homogeneous expression of the data.

# Subset the dataframe to the first 10 observaions including all variables
fuel_sub <- fuel_expanded[1:10,]

# Convert to a matrix and check the structure

fuel_mat <- data.matrix(fuel_sub, rownames.force = NA)
str(fuel_mat)

##  num [1:10, 1:11] 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : chr [1:10] "2" "19" "30" "70" ...
##   ..$ : chr [1:11] "Brand ID" "Brand" "Service Station Name" "Suburb" ...

Subsetting II

This code subsets the data frame into a new object using square brackets to extract the first and last columns / variables of the data set. This object is then saved as an R object file in the working directory output folder.

# Subset the dataframe with only two columns (first and last variables) 
fuel_sub2 <- fuel_expanded[,c(1,11)]

# Save the R object to .RData
save(fuel_sub2, file = "output/fuel_subset2.RData")

Create a new Data Frame

The new data frame “shirts” is created with two variables using data.frame():

Orders: a numeric vector indicating the number of orders received
Size: an ordinal factor of potential shirt sizes, noting the input order is correct (orderered = TRUE)

The structure of the variables and levels of the Size variable are output using str() and levels() functions.

A further numeric vector is created as the object “Fulfilled” and appended to the data frame using cbind(). As the order of the appended variable makes it difficult to compare between Ordered and Fulfilled values, the columns are re-organised using subsetting.

The attributes and dimensions of the shirts data frame are output using str().

# Create a data frame of 4 observations in 2 variables & name the variables. Factor & order the ordinal variable.
shirts <- data.frame (Orders = c(46, 2, 17, 4),
                      Size = factor(c("Small", "Medium", "Large", "XLarge"), ordered = TRUE))

# Show the structure of the variables and levels of the ordinal variable
str(shirts)

## 'data.frame':    4 obs. of  2 variables:
##  $ Orders: num  46 2 17 4
##  $ Size  : Ord.factor w/ 4 levels "Large"<"Medium"<..: 3 2 1 4

levels(shirts$Size)

## [1] "Large"  "Medium" "Small"  "XLarge"

# Create another numeric vector & cbind() to the data frame. Reorder columns into logical structure.
Fulfilled <- c(30, 0, 10, 2)
shirts <- cbind(shirts, Fulfilled)
shirts <- shirts[,c(2,1,3)]

# Check the attributes and dimension of the new data frame
str(shirts)

## 'data.frame':    4 obs. of  3 variables:
##  $ Size     : Ord.factor w/ 4 levels "Large"<"Medium"<..: 3 2 1 4
##  $ Orders   : num  46 2 17 4
##  $ Fulfilled: num  30 0 10 2

Create another Data Frame

The new data frame “shirts_nz” is created with four variables using data.frame():

Code: a new ordinal vector containing abbreviations that will correspond to the Size variable
Size: the common variable to the “shirts” data frame
NZ Orders: a numeric vector indicating the number of orders received from New Zealand
NZ Fulfilled: a numeric vector indicating the number of orders fulfilled to New Zealand

The two data frames are merged into a new object “shirts_anz”, with the new variables correctly aligned with the existing data set.

Some further wrangling of the dataset is undertaken to ensure clarity that the original “Orders” variable is “AUS Orders” received in Australia, and “Fulfilled” is “AUS Fulfilled” to Australia.

In addition two new variables are added using basic arithmetic functions to show the total number of orders received and the total number of unfulfilled orders across both locations.

# Create another data frame with a common variable
shirts_nz <- data.frame (Code = factor(c("S", "M", "L", "XL"), ordered = TRUE), 
                         Size = factor(c("Small", "Medium", "Large", "XLarge"), ordered = TRUE),
                        "NZ Orders" = c(22, 18, 6, 29),
                        "NZ Fulfilled" = c(8, 9, 1, 0))

# Join the two data frames, ensuring the new categorical variable "Code" is carried to the combined dataset
shirts_anz <- merge(shirts, shirts_nz)

# Relabel the original data frame columns to align
shirts_anz <- shirts_anz[,c(4,1:3,5:6)]
colnames(shirts_anz) <- c("Code", "Size", "AUS Orders", "AUS Fulfilled", "NZ Orders", "NZ Fulfilled")

# Add variables for total orders and unfulfilled
shirts_anz$`Total Orders` <- (shirts_anz$`AUS Orders` + shirts_anz$`NZ Orders`)
shirts_anz$Unfulfilled <- (shirts_anz$`Total Orders` - (shirts_anz$`AUS Fulfilled` + shirts_anz$`NZ Fulfilled`))

# Output data frame
shirts_anz