The necessary packages are loaded here:
library(readxl) # For importing with no external dependencies, loads dates & times, drops blank columns, reads characters as characters. *Note {readxl} does not support export functions.
library(readr) # For importing with full variable names incl. spaces.
library(magrittr) # For pipe operators.
library(tidyr) # For data wrangling.
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
library(dplyr) # For data exploration & manipulation.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
options(warn=-1) # Turn off warnings at completion for clean export to report
The fuel dataset is a collection of fuel pricing data collected and collated daily by the NSW Government Department of Fair Trading as part of the FuelCheck comparisons project. FuelCheck provides real-time information about fuel prices at service stations across NSW and enables consumers to find the lowest price for a range of fuel types in their geographical area via a map interface.
The data is also released monthly to Data.NSW for Creative Commons exploration and inclusion in analytical projects. The fuel dataset loaded in this analysis is the August 2021 monthly release, published September 3, 2021 at: https://data.nsw.gov.au/search/dataset/ds-nsw-ckan-a97a46fc-2bdd-4b90-ac7f-0cb1e8d7ac3b/distribution/dist-nsw-ckan-14408fba-1d13-46a2-b4a1-841709fb0a0a/details?q=
| Variable Name | Variable Type | Description |
|---|---|---|
| Service Station Name | Nominal | The trading name of the service station |
| Address | Nominal | The address of the service station |
| Suburb | Categorical | Repeated component of the service station address (Suburb) |
| Postcode | Categorical | Repeated component of the service station address (Postcode) |
| Brand | Categorical | Brand of service station / fuel distributor |
| FuelCode | Categorical | Type of fuel sold |
| PriceUpdatedDate | Ordinal | Time and date of data input to the dataset |
| Price | Continuous | Price listing for each observation |
Department of Fair Trading 2021, FuelCheck Price History August 2021, Data.NSW, NSW Governemnt, Retrieved September 9, 2021, from Data.NSW Government website: https://data.nsw.gov.au/data/dataset/a97a46fc-2bdd-4b90-ac7f-0cb1e8d7ac3b/resource/14408fba-1d13-46a2-b4a1-841709fb0a0a/download/price_history_checks_august2021.xlsx
First we import the dataset from the working directory using {readxl} to the dataframe object “fuel”, skipping the first two rows of the original dataset which contained the document title.
The head() function shows the first few rows of the dataframe and the variable (or column) names and data types.
# Import datset and print dataset head
fuel <- read_excel("data/price_history_checks_august2021.xlsx", skip = 2)
head(fuel)
Inspection of the dataset reveals:
Dimensions: 32373 observations of 8 variables identified using dim()
Data Types:
The factor variables show:
The column names are appropriate. Spaces are added to mutli-word column names using gsub()
# Establish dimensions
dim(fuel)
## [1] 32373 8
# Inspect structure for data types
str(fuel)
## tibble [32,373 x 8] (S3: tbl_df/tbl/data.frame)
## $ ServiceStationName: chr [1:32373] "Metro Petroleum Ballina" NA NA NA ...
## $ Address : chr [1:32373] "323 River Street, BALLINA NSW 2478" NA NA NA ...
## $ Suburb : chr [1:32373] "BALLINA" NA NA NA ...
## $ Postcode : chr [1:32373] "2478" NA NA NA ...
## $ Brand : chr [1:32373] "Metro Fuel" NA NA NA ...
## $ FuelCode : chr [1:32373] "E10" "U91" "P95" "PDL" ...
## $ PriceUpdatedDate : chr [1:32373] "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" "1/08/2021 12:19:29 AM" ...
## $ Price : num [1:32373] 152 155 165 150 173 ...
# Convert relevant character variables to factors
cols_fact <- c("Suburb", "Postcode", "Brand", "FuelCode")
fuel[cols_fact] <- lapply(fuel[cols_fact], as.factor)
# Remove AM/PM from PriceUpdatedDate variable, separate into two variables for Date and Time, convert to appropriate data type, and remove time variable.
fuel$PriceUpdatedDate <- substr(fuel$PriceUpdatedDate,1,nchar(fuel$PriceUpdatedDate)-3)
fuel <- separate(fuel, PriceUpdatedDate, into = c("Date", "Time"), sep = " ", remove = TRUE)
fuel$Date <- as.Date(fuel$Date, "%d/%m/%Y")
fuel <- fuel[ ,-8]
# Check levels of factor variables
str(fuel)
## tibble [32,373 x 8] (S3: tbl_df/tbl/data.frame)
## $ ServiceStationName: chr [1:32373] "Metro Petroleum Ballina" NA NA NA ...
## $ Address : chr [1:32373] "323 River Street, BALLINA NSW 2478" NA NA NA ...
## $ Suburb : Factor w/ 1219 levels "ABERDARE","ABERDEEN",..: 44 NA NA NA NA 776 NA 541 NA NA ...
## $ Postcode : Factor w/ 482 levels "2007","2008",..: 287 NA NA NA NA 141 NA 313 NA NA ...
## $ Brand : Factor w/ 26 levels "7-Eleven","Ampol",..: 15 NA NA NA NA 15 NA 10 NA NA ...
## $ FuelCode : Factor w/ 9 levels "B20","DL","E10",..: 3 9 6 8 7 8 3 7 6 3 ...
## $ Date : Date[1:32373], format: "2021-08-01" "2021-08-01" ...
## $ Price : num [1:32373] 152 155 165 150 173 ...
levels(fuel$Brand)
## [1] "7-Eleven" "Ampol" "BP"
## [4] "Budget" "Caltex" "Caltex Woolworths"
## [7] "Coles Express" "Costco" "Enhance"
## [10] "Independent" "Inland Petroleum" "Liberty"
## [13] "Lowes" "Matilda" "Metro Fuel"
## [16] "Mobil" "Mobil 1" "Prime Petroleum"
## [19] "Puma Energy" "Shell" "South West"
## [22] "Speedway" "Transwest Fuels" "United"
## [25] "Westside" "Woodham Petroleum"
levels(fuel$FuelCode)
## [1] "B20" "DL" "E10" "E85" "LPG" "P95" "P98" "PDL" "U91"
# Check the column names and rename to include spaces in multi-word column names
colnames(fuel)
## [1] "ServiceStationName" "Address" "Suburb"
## [4] "Postcode" "Brand" "FuelCode"
## [7] "Date" "Price"
new_cols <- gsub("([a-z])([A-Z])", "\\1 \\2", colnames(fuel))
colnames(fuel) <- new_cols
colnames(fuel)
## [1] "Service Station Name" "Address" "Suburb"
## [4] "Postcode" "Brand" "Fuel Code"
## [7] "Date" "Price"
Reviewing the data according to tidy data principles, we identify some issues:
The wrangled data set is now 18024 observations of 10 variables identified using dim(). The head of the data is output to review changes implemented.
# 1. Complete each observation by copying down the relevant values
fuel <- fuel %>% fill("Service Station Name", "Suburb", "Postcode", "Brand", .direction = "down")
head(fuel)
# 2. Widen the dataframe to include individual variables for the Fuel Code categories. Where there are duplicate observations per user error in the data uploaded or the price was uploaded multiple times in one day, the mean value is retained.
fuel <- pivot_wider(data = fuel, names_from = "Fuel Code", values_from = "Price", values_fn = mean)
# 3. Remove Address variable as redundant using subsetting
fuel <- fuel[ ,c(1, 3:11)]
# 4. Align all values in Suburb and Service Station Name variables to upper case
fuel$Suburb <- toupper(fuel$Suburb)
fuel$`Service Station Name`<- toupper(fuel$`Service Station Name`)
# Establish dimensions
dim(fuel)
## [1] 18024 10
# Review the data set head
head(fuel)
The summary statistics of the now individual Fuel Code numeric variables, grouped by the categorical Brand variable are output below. These figures are attained through a multi-layer pipe function of group_by() and summarise().
Each data frame output is sorted by the Count variable in descending order (Brands with the most observations appear first in the list).
For the purposes of this scope, only the most common fuel types (E10, U91, and P95) are explored.
# Extract E10 summary statistics grouped by Brand
E10_Summary_Stats <- fuel %>%
group_by(Brand) %>%
summarise(E10_Min = min(E10, na.rm = TRUE),
E10_Median = median(E10, na.rm = TRUE),
E10_Max = max(E10, na.rm = TRUE),
E10_Mean = mean(E10, na.rm = TRUE),
E10_Standard_Deviation = sd(E10, na.rm = TRUE),
Count = n())
E10_Summary_Stats <- E10_Summary_Stats[order(-E10_Summary_Stats$Count),]
E10_Summary_Stats
# Extract U91 summary statistics grouped by Brand
U91_Summary_Stats <- fuel %>%
group_by(Brand) %>%
summarise(U91_Min = min(U91, na.rm = TRUE),
U91_Median = median(U91, na.rm = TRUE),
U91_Max = max(U91, na.rm = TRUE),
U91_Mean = mean(U91, na.rm = TRUE),
U91_Standard_Deviation = sd(U91, na.rm = TRUE),
Count = n())
U91_Summary_Stats <- U91_Summary_Stats[order(-U91_Summary_Stats$Count),]
U91_Summary_Stats
# Extract P95 summary statistics grouped by Brand
P95_Summary_Stats <- fuel %>%
group_by(Brand) %>%
summarise(P95_Min = min(P95, na.rm = TRUE),
P95_Median = median(P95, na.rm = TRUE),
P95_Max = max(P95, na.rm = TRUE),
P95_Mean = mean(P95, na.rm = TRUE),
P95_Standard_Deviation = sd(P95, na.rm = TRUE),
Count = n())
P95_Summary_Stats <- P95_Summary_Stats[order(-P95_Summary_Stats$Count),]
P95_Summary_Stats
This code creates a list containing a numeric value (ID) for each response in the categorical variable Brand. This list includes a total of 26 unique values.
# Create a list with a numeric value response to a categorical variable (Brand)
list_new <- list(1:26, c("7-Eleven","Ampol","BP","Budget","Caltex","Caltex Woolworths","Coles Express",
"Costco","Enhance","Independent","Inland Petroleum","Liberty","Lowes","Matilda",
"Metro Fuel","Mobil","Mobil 1","Prime Petroleum","Puma Energy","Shell","South West",
"Speedway","Transwest Fuels","United","Westside","Woodham Petroleum"))
# Name the elements of the list and check the structure
names(list_new) <- c("Brand ID", "Brand")
str(list_new)
## List of 2
## $ Brand ID: int [1:26] 1 2 3 4 5 6 7 8 9 10 ...
## $ Brand : chr [1:26] "7-Eleven" "Ampol" "BP" "Budget" ...
This newly created list is then joined to the larger data frame to create a new expanded data frame using merge(). The new Brand ID variable is re-named to align with the other variables’ naming convention using names(), and the new expanded data frame is re-ordered first to move the new Brand ID variable to the first column using subsetting, then to sort the data by the Date variable using order().
# Join the list to the fuel dataframe,
fuel_expanded <- merge(fuel, list_new, by = "Brand")
# Rename the new column with a space to match other variables naming convention
names(fuel_expanded)[names(fuel_expanded) == "Brand.ID"] <- "Brand ID"
# Re-order the columns to show the new Brand ID variable
fuel_expanded <- fuel_expanded[,c(11,1:10)]
# Re-order the merged dataframe to sort by Date
fuel_expanded <- fuel_expanded[order(fuel_expanded$Date),]
head(fuel_expanded)
The first 10 observations (including all variables) are subsetted from the fuel_expanded data frame using square brackets, and converted to a matrix using data.matrix(). The resulting matrix is numeric, as a matrix must be homogeneous (rather than heterogeneous like a data frame), so the mixed variable types convert to the nearest logical format. In this data frame comprised of numeric and factored variables, a numeric matrix result is the most logical homogeneous expression of the data.
# Subset the dataframe to the first 10 observaions including all variables
fuel_sub <- fuel_expanded[1:10,]
# Convert to a matrix and check the structure
fuel_mat <- data.matrix(fuel_sub, rownames.force = NA)
str(fuel_mat)
## num [1:10, 1:11] 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "dimnames")=List of 2
## ..$ : chr [1:10] "2" "19" "30" "70" ...
## ..$ : chr [1:11] "Brand ID" "Brand" "Service Station Name" "Suburb" ...
This code subsets the data frame into a new object using square brackets to extract the first and last columns / variables of the data set. This object is then saved as an R object file in the working directory output folder.
# Subset the dataframe with only two columns (first and last variables)
fuel_sub2 <- fuel_expanded[,c(1,11)]
# Save the R object to .RData
save(fuel_sub2, file = "output/fuel_subset2.RData")
The new data frame “shirts” is created with two variables using data.frame():
The structure of the variables and levels of the Size variable are output using str() and levels() functions.
A further numeric vector is created as the object “Fulfilled” and appended to the data frame using cbind(). As the order of the appended variable makes it difficult to compare between Ordered and Fulfilled values, the columns are re-organised using subsetting.
The attributes and dimensions of the shirts data frame are output using str().
# Create a data frame of 4 observations in 2 variables & name the variables. Factor & order the ordinal variable.
shirts <- data.frame (Orders = c(46, 2, 17, 4),
Size = factor(c("Small", "Medium", "Large", "XLarge"), ordered = TRUE))
# Show the structure of the variables and levels of the ordinal variable
str(shirts)
## 'data.frame': 4 obs. of 2 variables:
## $ Orders: num 46 2 17 4
## $ Size : Ord.factor w/ 4 levels "Large"<"Medium"<..: 3 2 1 4
levels(shirts$Size)
## [1] "Large" "Medium" "Small" "XLarge"
# Create another numeric vector & cbind() to the data frame. Reorder columns into logical structure.
Fulfilled <- c(30, 0, 10, 2)
shirts <- cbind(shirts, Fulfilled)
shirts <- shirts[,c(2,1,3)]
# Check the attributes and dimension of the new data frame
str(shirts)
## 'data.frame': 4 obs. of 3 variables:
## $ Size : Ord.factor w/ 4 levels "Large"<"Medium"<..: 3 2 1 4
## $ Orders : num 46 2 17 4
## $ Fulfilled: num 30 0 10 2
The new data frame “shirts_nz” is created with four variables using data.frame():
The two data frames are merged into a new object “shirts_anz”, with the new variables correctly aligned with the existing data set.
Some further wrangling of the dataset is undertaken to ensure clarity that the original “Orders” variable is “AUS Orders” received in Australia, and “Fulfilled” is “AUS Fulfilled” to Australia.
In addition two new variables are added using basic arithmetic functions to show the total number of orders received and the total number of unfulfilled orders across both locations.
# Create another data frame with a common variable
shirts_nz <- data.frame (Code = factor(c("S", "M", "L", "XL"), ordered = TRUE),
Size = factor(c("Small", "Medium", "Large", "XLarge"), ordered = TRUE),
"NZ Orders" = c(22, 18, 6, 29),
"NZ Fulfilled" = c(8, 9, 1, 0))
# Join the two data frames, ensuring the new categorical variable "Code" is carried to the combined dataset
shirts_anz <- merge(shirts, shirts_nz)
# Relabel the original data frame columns to align
shirts_anz <- shirts_anz[,c(4,1:3,5:6)]
colnames(shirts_anz) <- c("Code", "Size", "AUS Orders", "AUS Fulfilled", "NZ Orders", "NZ Fulfilled")
# Add variables for total orders and unfulfilled
shirts_anz$`Total Orders` <- (shirts_anz$`AUS Orders` + shirts_anz$`NZ Orders`)
shirts_anz$Unfulfilled <- (shirts_anz$`Total Orders` - (shirts_anz$`AUS Fulfilled` + shirts_anz$`NZ Fulfilled`))
# Output data frame
shirts_anz