1 Introduction

1.1 About

Goal: Learn R language just like any foreign languages

About the data: FuelCheck provides real-time information about fuel prices at service stations across NSW. The dataset contains fuel price and other attributes for October 2017 only.

FuelCheck enables NSW motorists to:

Find the cheapest fuel being sold anywhere in NSW
Get directions to any service station in NSW
Search for fuel by type (E10, Regular, LPG, etc) or brand
Submit a complaint to NSW Fair Trading if the price at the pump doesn’t match what is shown on FuelCheck.

The data we will be working on is for October 2017 NSW only.

1.2 Library

Load Packages

library(tidyverse)
library(lubridate)
library(DT)
library(broom)
library(knitr)
library(viridis) #scale color
library(forcats) #factors
library(glue)

library(ggmap)
library(leaflet)
library(rgdal)
library(tmap)
library(broom)
library(leaflet.extras)
library(emo)

1.3 Data

Load Data

data1 <- read_csv("D:/r datawarehouse/petrol/October-2017.csv")

1.4 Missing value

Check for any missing values

colSums(is.na(data1))

## ServiceStationName            Address             Suburb 
##                  0                  0                  0 
##           Postcode              Brand           FuelCode 
##                  0                  0                  0 
##   PriceUpdatedDate              Price 
##                  0                  0

There are no missing values in the dataset.

2 Data Wrangling

2.1 Data structure

There are 60045 rows and 8 columns in the dataset.

datatable(head(data1, 2), options = list(pagelength = 5, scrollx = T), caption = "Displaying only first 2 rows of the dataset")

2.2 Data types

Below is the summary of each variable and its class assigned to it. We can clearly see that PriceUpdatedDate is in character class. We will assign correct class to each variables.

summary(data1)

##  ServiceStationName   Address             Suburb             Postcode   
##  Length:60045       Length:60045       Length:60045       Min.   :1579  
##  Class :character   Class :character   Class :character   1st Qu.:2142  
##  Mode  :character   Mode  :character   Mode  :character   Median :2204  
##                                                           Mean   :2319  
##                                                           3rd Qu.:2557  
##                                                           Max.   :4383  
##     Brand             FuelCode         PriceUpdatedDate       Price      
##  Length:60045       Length:60045       Length:60045       Min.   : 53.9  
##  Class :character   Class :character   Class :character   1st Qu.:127.8  
##  Mode  :character   Mode  :character   Mode  :character   Median :136.9  
##                                                           Mean   :136.4  
##                                                           3rd Qu.:145.9  
##                                                           Max.   :177.7

2.2.1 Counting unique value in each variable

counts <- as.data.frame(matrix(NA, nrow = 1, ncol = 8))
names(counts) <- names(data1)


for(i in seq_along(data1)){
  counts[[i]] <- length(unique(data1[[i]]))
}

kable(counts, caption = "Count of unique values")

Count of unique values
ServiceStationName	Address	Suburb	Postcode	Brand	FuelCode	PriceUpdatedDate	Price
1794	1926	1058	473	20	9	11965	613

This should help us in assigning correct class to each variable. Brand,FuelCode and PostCode should be changed to factors. PriceUpdatedDate as Date and Time.

2.2.2 Assign new class to variables

We’ll create a new variable Day_time which will convert the PriceUpdatedDate variable to date and time. This will also preserve the original data.

data1 <- data1 %>% 
  mutate_at(vars(Brand,FuelCode,Postcode), funs(as.factor(.)))

data1$Day_time <- dmy_hm(data1$PriceUpdatedDate)

2.2.3 Create varaible with just date

data1$Date <- lubridate::date(data1$Day_time)

2.2.4 Create variable just for day

data1$Day <- wday(data1$Day_time, label = T, abbr = T)

2.2.5 Create a new variable with standard price format

data1 <- data1 %>% 
  mutate(Price_std = Price/100)

3 Data exploration

3.1 Distribution of price

3.1.1 Density Plot

Density plot shows distribution of data over a continious interval. Since they are not affected by the number of bins as in Histogram, it is better at determining the distribution shape.

data1 %>% 
  ggplot(aes(Price_std))+
  geom_density(fill = "grey", size = 1)+
  geom_vline(aes(xintercept = median(data1$Price_std, na.rm = T)), linetype = "longdash" , size = .80)+
  scale_x_continuous(name = "Price", breaks = seq(.50,1.80, by = .10), labels = scales::dollar)+
  labs(title = "Density plot of all Fuel Price types")

Just by gleening on the density plot we can see the distribution is skewed to the left which tells us that the mean price falls on the left side of the medain price. It is also multimodal distribition which means it has multiple peaks. This is due to different types of fuel have various prices. Above density plot shows when considering all fuel types most of the fuel prices are concentrated around $1.35.

Below are the types of fuel found in the dataset:

Fuel Code	Description
E10	10 per cent ethanol blend
P98	Premium unleaded 98
U91	Standard unleaded petrol
PDL	Premium Diesel
P95	Premium unleaded 95
DL	Diesel
E85	Ethanol fuel blend of 85% denatured ethanol fuel and 15% gasoline or other hydrocarbon
LPG	Liquefied petroleum gas
CNG	Compressed natural gas

Descriptions for fuel code were generated from google search. However, I couldn’t find PDL information. My guess is that it stands for Premium Diesel. So I plotted density plot of PDL and DL to check for similarity. The distribution has medium similarity. I therefore think PDL is Premium Diesel.

data1 %>% 
  filter(FuelCode %in% c("PDL", "DL")) %>% 
  ggplot(aes(Price_std))+
  geom_density(aes(fill = FuelCode), alpha = .5, size = 1)+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Desnsity plot: DL & PDL")

Also, fuel type Compressed natural gas (CNG) is the odd one. It only has one row of data. Therefore, I’ll exclude from any plot generated below.

datatable(data1 %>% 
  filter(FuelCode == "CNG"), options = list(pagelength = 5, scrollx = T, scrollCollapse = T, paging = F))

data1 %>% 
  filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
  ggplot(aes(Price_std))+
  geom_density(aes(fill = FuelCode),size = 1, alpha = 0.5 )+ #position = "stack")+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Density plot of Ethanol fuel type")

Above plot claims that Premium unleaded fuel are more expensive then the rest. They also have multimodal distribution. We can also observe that there a pattern - P98,P95 are closely grouped and E10, E85 are closely grouped.

data1 %>% 
  filter(FuelCode %in% c("LPG")) %>% 
  ggplot(aes(Price_std))+
  geom_density(fill = "grey", size = 1)+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Density plot of LPG")

It looks like LPG price are the most volatile out of all the fuel types. This also explains why we have small peak at the first density plot.

3.1.2 Boxplot: Fuel type vs price

data1 %>% 
  ggplot(aes(fct_reorder(FuelCode, Price_std, fun = median, na.rm =T), Price_std))+
  geom_boxplot(aes(fill = FuelCode))+
  scale_y_continuous(name = "Price", breaks = seq(0.5, 2.5, by = 0.1), labels = scales::dollar)+
  labs(x = "Fuel Code", title = "Boxplot of fuel type vs price")+
  coord_flip()+
  theme(legend.position = "none")

I’d like to plot all boxplot in one graph so that I can use the same scale to obtain information. We can see that premium unleaded median price are the highest among all other fuel types. However they are also skewed to the left meaning there were few days the price got low but most of the time it stayed around its median price. LPG is the second cheapest fuel however, they have lot of outliers.

The fuel I normally use is E10. Atleast 75% of all the price for Oct were above $1.20

Go: Top menu

3.2 Date and price analysis

# data1 %>% 
#   filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
#   ggplot(aes(Date, Price_std, group = Date))+
#   geom_boxplot(aes(fill = FuelCode))+
#   facet_wrap(~FuelCode)+
#   theme(legend.position = "none")+
#   scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
#   scale_y_continuous(name = "Price", labels = scales::dollar)+
#   labs(title = "Boxplot of Unleaded Fuel type")+
#   stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue")
#   #geom_smooth(aes(group = 1))

data1 %>% 
  filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  facet_wrap(~FuelCode)+
  theme(legend.position = "none")+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Boxplot of Unleaded Fuel type")+
  #stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue")
  geom_smooth(aes(group = 1), col = "red")

All unleaded plot is consistent with the pattern that the medain price hit the rock bottom around 16th October and started upward movement.

# data1 %>% 
#   filter(FuelCode %in% c("PDL", "DL")) %>% 
#   ggplot(aes(Date, Price_std, group = Date))+
#   geom_boxplot(aes(fill = FuelCode))+
#   facet_wrap(~FuelCode)+
#   stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue" )+
#   scale_y_continuous(name = "Price", labels = scales::dollar)+
#   labs(title = "Boxplot of Diesel")+
#   theme(legend.position = "none")+
#   scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")

data1 %>% 
  filter(FuelCode %in% c("PDL", "DL")) %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  facet_wrap(~FuelCode)+
  #stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue" )+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Boxplot of Diesel")+
  theme(legend.position = "none")+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  geom_smooth(aes(group = 1), col = "red")

data1 %>% 
  filter(FuelCode == "LPG") %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  #stat_summary(fun.y = median, geom = "line", aes(group = 1), col = "blue", size = 0.8)+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  theme(legend.position = "none")+
  labs(title = "Boxplot of LPG")+
  geom_smooth(aes(group = 1), col = "red")

3.2.1 Day of Week analysis

unleaded <- c("E10", "E85", "P95", "P98", "U91")

data1 %>% 
  filter(FuelCode %in% unleaded) %>% 
  group_by(Day,FuelCode) %>% 
  summarise(med_price = median(Price_std)) %>% 
  ggplot(aes(FuelCode,Day))+
  geom_tile(aes(fill = med_price))+
  scale_fill_viridis(direction = -1, option = "A")+
  labs(fill = "Median Price", x = "Fuel Code", title = "Medain price for each day of the week for Unleaded Fuel types")

# data1 %>% 
#   filter(FuelCode %in% c("DL", "PDL")) %>% 
#   group_by(FuelCode, Day) %>% 
#   summarise(Med_price = median(Price_std)) %>% 
#   ggplot(aes(FuelCode, Day))+
#   geom_tile(aes(fill = Med_price))+
#   scale_fill_viridis(direction = -1, option = "A")+
#   labs(fill = "Median Price", title = "Medain price for each day of the week for Diesel Fuel types", x = "Fuel Code")