1 Introduction

1.1 About

Goal: Learn R language just like any foreign languages

About the data: FuelCheck provides real-time information about fuel prices at service stations across NSW. The dataset contains fuel price and other attributes for October 2017 only.

FuelCheck enables NSW motorists to:

  1. Find the cheapest fuel being sold anywhere in NSW

  2. Get directions to any service station in NSW

  3. Search for fuel by type (E10, Regular, LPG, etc) or brand

  4. Submit a complaint to NSW Fair Trading if the price at the pump doesn’t match what is shown on FuelCheck.

The data we will be working on is for October 2017 NSW only.

1.2 Library

Load Packages

library(tidyverse)
library(lubridate)
library(DT)
library(broom)
library(knitr)
library(viridis) #scale color
library(forcats) #factors
library(glue)

library(ggmap)
library(leaflet)
library(rgdal)
library(tmap)
library(broom)
library(leaflet.extras)
library(emo)

1.3 Data

Load Data

data1 <- read_csv("D:/r datawarehouse/petrol/October-2017.csv")

1.4 Missing value

Check for any missing values

colSums(is.na(data1))
## ServiceStationName            Address             Suburb 
##                  0                  0                  0 
##           Postcode              Brand           FuelCode 
##                  0                  0                  0 
##   PriceUpdatedDate              Price 
##                  0                  0

There are no missing values in the dataset.

2 Data Wrangling

2.1 Data structure

There are 60045 rows and 8 columns in the dataset.

datatable(head(data1, 2), options = list(pagelength = 5, scrollx = T), caption = "Displaying only first 2 rows of the dataset")

2.2 Data types

Below is the summary of each variable and its class assigned to it. We can clearly see that PriceUpdatedDate is in character class. We will assign correct class to each variables.

summary(data1)
##  ServiceStationName   Address             Suburb             Postcode   
##  Length:60045       Length:60045       Length:60045       Min.   :1579  
##  Class :character   Class :character   Class :character   1st Qu.:2142  
##  Mode  :character   Mode  :character   Mode  :character   Median :2204  
##                                                           Mean   :2319  
##                                                           3rd Qu.:2557  
##                                                           Max.   :4383  
##     Brand             FuelCode         PriceUpdatedDate       Price      
##  Length:60045       Length:60045       Length:60045       Min.   : 53.9  
##  Class :character   Class :character   Class :character   1st Qu.:127.8  
##  Mode  :character   Mode  :character   Mode  :character   Median :136.9  
##                                                           Mean   :136.4  
##                                                           3rd Qu.:145.9  
##                                                           Max.   :177.7

2.2.1 Counting unique value in each variable

counts <- as.data.frame(matrix(NA, nrow = 1, ncol = 8))
names(counts) <- names(data1)


for(i in seq_along(data1)){
  counts[[i]] <- length(unique(data1[[i]]))
}

kable(counts, caption = "Count of unique values")
Count of unique values
ServiceStationName Address Suburb Postcode Brand FuelCode PriceUpdatedDate Price
1794 1926 1058 473 20 9 11965 613

This should help us in assigning correct class to each variable. Brand,FuelCode and PostCode should be changed to factors. PriceUpdatedDate as Date and Time.

2.2.2 Assign new class to variables

We’ll create a new variable Day_time which will convert the PriceUpdatedDate variable to date and time. This will also preserve the original data.

data1 <- data1 %>% 
  mutate_at(vars(Brand,FuelCode,Postcode), funs(as.factor(.)))

data1$Day_time <- dmy_hm(data1$PriceUpdatedDate)

2.2.3 Create varaible with just date

data1$Date <- lubridate::date(data1$Day_time)

2.2.4 Create variable just for day

data1$Day <- wday(data1$Day_time, label = T, abbr = T)

2.2.5 Create a new variable with standard price format

data1 <- data1 %>% 
  mutate(Price_std = Price/100)

3 Data exploration

3.1 Distribution of price

3.1.1 Density Plot

Density plot shows distribution of data over a continious interval. Since they are not affected by the number of bins as in Histogram, it is better at determining the distribution shape.

data1 %>% 
  ggplot(aes(Price_std))+
  geom_density(fill = "grey", size = 1)+
  geom_vline(aes(xintercept = median(data1$Price_std, na.rm = T)), linetype = "longdash" , size = .80)+
  scale_x_continuous(name = "Price", breaks = seq(.50,1.80, by = .10), labels = scales::dollar)+
  labs(title = "Density plot of all Fuel Price types")

Just by gleening on the density plot we can see the distribution is skewed to the left which tells us that the mean price falls on the left side of the medain price. It is also multimodal distribition which means it has multiple peaks. This is due to different types of fuel have various prices. Above density plot shows when considering all fuel types most of the fuel prices are concentrated around $1.35.

Below are the types of fuel found in the dataset:

Fuel Code Description
E10 10 per cent ethanol blend
P98 Premium unleaded 98
U91 Standard unleaded petrol
PDL Premium Diesel
P95 Premium unleaded 95
DL Diesel
E85 Ethanol fuel blend of 85% denatured ethanol fuel and 15% gasoline or other hydrocarbon
LPG Liquefied petroleum gas
CNG Compressed natural gas

Descriptions for fuel code were generated from google search. However, I couldn’t find PDL information. My guess is that it stands for Premium Diesel. So I plotted density plot of PDL and DL to check for similarity. The distribution has medium similarity. I therefore think PDL is Premium Diesel.

data1 %>% 
  filter(FuelCode %in% c("PDL", "DL")) %>% 
  ggplot(aes(Price_std))+
  geom_density(aes(fill = FuelCode), alpha = .5, size = 1)+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Desnsity plot: DL & PDL")

Also, fuel type Compressed natural gas (CNG) is the odd one. It only has one row of data. Therefore, I’ll exclude from any plot generated below.

datatable(data1 %>% 
  filter(FuelCode == "CNG"), options = list(pagelength = 5, scrollx = T, scrollCollapse = T, paging = F))
data1 %>% 
  filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
  ggplot(aes(Price_std))+
  geom_density(aes(fill = FuelCode),size = 1, alpha = 0.5 )+ #position = "stack")+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Density plot of Ethanol fuel type")

Above plot claims that Premium unleaded fuel are more expensive then the rest. They also have multimodal distribution. We can also observe that there a pattern - P98,P95 are closely grouped and E10, E85 are closely grouped.

data1 %>% 
  filter(FuelCode %in% c("LPG")) %>% 
  ggplot(aes(Price_std))+
  geom_density(fill = "grey", size = 1)+
  scale_x_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Density plot of LPG")

It looks like LPG price are the most volatile out of all the fuel types. This also explains why we have small peak at the first density plot.

3.1.2 Boxplot: Fuel type vs price

data1 %>% 
  ggplot(aes(fct_reorder(FuelCode, Price_std, fun = median, na.rm =T), Price_std))+
  geom_boxplot(aes(fill = FuelCode))+
  scale_y_continuous(name = "Price", breaks = seq(0.5, 2.5, by = 0.1), labels = scales::dollar)+
  labs(x = "Fuel Code", title = "Boxplot of fuel type vs price")+
  coord_flip()+
  theme(legend.position = "none")

I’d like to plot all boxplot in one graph so that I can use the same scale to obtain information. We can see that premium unleaded median price are the highest among all other fuel types. However they are also skewed to the left meaning there were few days the price got low but most of the time it stayed around its median price. LPG is the second cheapest fuel however, they have lot of outliers.

The fuel I normally use is E10. Atleast 75% of all the price for Oct were above $1.20

Go: Top menu

3.2 Date and price analysis

# data1 %>% 
#   filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
#   ggplot(aes(Date, Price_std, group = Date))+
#   geom_boxplot(aes(fill = FuelCode))+
#   facet_wrap(~FuelCode)+
#   theme(legend.position = "none")+
#   scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
#   scale_y_continuous(name = "Price", labels = scales::dollar)+
#   labs(title = "Boxplot of Unleaded Fuel type")+
#   stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue")
#   #geom_smooth(aes(group = 1))

data1 %>% 
  filter(!(FuelCode %in% c("CNG", "DL", "PDL", "LPG"))) %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  facet_wrap(~FuelCode)+
  theme(legend.position = "none")+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Boxplot of Unleaded Fuel type")+
  #stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue")
  geom_smooth(aes(group = 1), col = "red")

All unleaded plot is consistent with the pattern that the medain price hit the rock bottom around 16th October and started upward movement.

# data1 %>% 
#   filter(FuelCode %in% c("PDL", "DL")) %>% 
#   ggplot(aes(Date, Price_std, group = Date))+
#   geom_boxplot(aes(fill = FuelCode))+
#   facet_wrap(~FuelCode)+
#   stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue" )+
#   scale_y_continuous(name = "Price", labels = scales::dollar)+
#   labs(title = "Boxplot of Diesel")+
#   theme(legend.position = "none")+
#   scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")

data1 %>% 
  filter(FuelCode %in% c("PDL", "DL")) %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  facet_wrap(~FuelCode)+
  #stat_summary(fun.y=median, geom="line", aes(group=1), size = 0.8, col = "blue" )+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  labs(title = "Boxplot of Diesel")+
  theme(legend.position = "none")+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  geom_smooth(aes(group = 1), col = "red")

data1 %>% 
  filter(FuelCode == "LPG") %>% 
  ggplot(aes(Date, Price_std, group = Date))+
  geom_boxplot(aes(fill = FuelCode))+
  #stat_summary(fun.y = median, geom = "line", aes(group = 1), col = "blue", size = 0.8)+
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%d")+
  scale_y_continuous(name = "Price", labels = scales::dollar)+
  theme(legend.position = "none")+
  labs(title = "Boxplot of LPG")+
  geom_smooth(aes(group = 1), col = "red")

3.2.1 Day of Week analysis

unleaded <- c("E10", "E85", "P95", "P98", "U91")

data1 %>% 
  filter(FuelCode %in% unleaded) %>% 
  group_by(Day,FuelCode) %>% 
  summarise(med_price = median(Price_std)) %>% 
  ggplot(aes(FuelCode,Day))+
  geom_tile(aes(fill = med_price))+
  scale_fill_viridis(direction = -1, option = "A")+
  labs(fill = "Median Price", x = "Fuel Code", title = "Medain price for each day of the week for Unleaded Fuel types")

# data1 %>% 
#   filter(FuelCode %in% c("DL", "PDL")) %>% 
#   group_by(FuelCode, Day) %>% 
#   summarise(Med_price = median(Price_std)) %>% 
#   ggplot(aes(FuelCode, Day))+
#   geom_tile(aes(fill = Med_price))+
#   scale_fill_viridis(direction = -1, option = "A")+
#   labs(fill = "Median Price", title = "Medain price for each day of the week for Diesel Fuel types", x = "Fuel Code")

Go: Top menu

4 Geospatial of fuel stations

We will extract geospatial location of each petrol station from google. We will then plot those location on a map provided by leaflet.

data1$id <- row.names(data1)

Fuel_stations <- data1 %>% 
  group_by(Address) %>% 
  summarise(n = n()) %>% 
  arrange(-n) %>% 
  mutate(id = row.names(.))

# geocodings <- geocode(Fuel_stations$Address)

# write_csv(geocodings, "D:/r datawarehouse/petrol/geocodings.csv")

# geocode_na <- is.na(geocodings$lon)

geocodings <- read_csv("D:/r datawarehouse/petrol/geocodings.csv")

geocodings$id <- row.names(geocodings)

Fuel_stations <-  Fuel_stations %>%
  mutate_at(vars(Address), funs(str_replace_all(., "\\s\\(.*\\)", ""))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Warnoon & Palm Avenue, Leeton NSW 2705", "Warnoon & Palm Avenue, Leeton NSW 2705"))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Zouch and Edward Streets, Young NSW 2594", "Zouch and Edward Streets, Young NSW 2594"))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Fitzroy and Cobra Streets, South Dubbo NSW 2830", "Fitzroy and Cobra Streets, South Dubbo NSW 2830")))


# na_fuel_station <- Fuel_stations$Address[is.na(geocodings$lon)]

# na_geocoding <- geocode(na_fuel_station)

# write_csv(na_geocoding, "D:/r datawarehouse/petrol/na_geocoding.csv")

na_geocoding <- read_csv("D:/r datawarehouse/petrol/na_geocoding.csv")

# names(na_geocoding) <- c("lon.x", "lat.x","id")

geocodings$id <- as.numeric(geocodings$id)

# geocoding_na_row <- geocodings %>% 
#   filter(is.na(lon))

# na_geocoding$id <- geocoding_na_row$id

geocodings <-  geocodings %>% 
  left_join(na_geocoding, by=c("id" = "id"))

geocodings$lon.x[is.na(geocodings$lon.x)] <- geocodings$lon.y[is.na(geocodings$lon.x)]
geocodings$lat.x[is.na(geocodings$lat.x)] <- geocodings$lat.y[is.na(geocodings$lat.x)]



geocodings <-  geocodings %>% 
  select(1:3)

Fuel_stations$id <- as.numeric(Fuel_stations$id)

Fuel_stations<- Fuel_stations %>% 
  left_join(geocodings, by=c("id"="id"))

Fuel_stations <- Fuel_stations %>% 
  select(-c(2:3))

data1 <- data1 %>%
  mutate_at(vars(Address), funs(str_replace_all(., "\\s\\(.*\\)", ""))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Warnoon & Palm Avenue, Leeton NSW 2705", "Warnoon & Palm Avenue, Leeton NSW 2705"))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Zouch and Edward Streets, Young NSW 2594", "Zouch and Edward Streets, Young NSW 2594"))) %>% 
  mutate_at(vars(Address), funs(str_replace_all(., "Cnr Fitzroy and Cobra Streets, South Dubbo NSW 2830", "Fitzroy and Cobra Streets, South Dubbo NSW 2830")))


data1 <- data1 %>% 
  left_join(Fuel_stations, by = c("Address", "Address"))




# Fuel_stations <-  Fuel_stations %>% 
#   mutate_at(vars(Address), funs(str_replace_all(., "\\s\\(.*\\)", ""))) %>% 
#   mutate_at(vars(Address), funs(str_replace_all(., "^Cnr |^Cnr. ", "")))

# Fuel_stations$Address[str_detect(Fuel_stations$Address, ("^Cnr |^Cnr. "))]


# x <- as.data.frame(Fuel_stations$Address[geocode_na])
# 
# names(x) <- "var"
# 
# x_2<- x %>% 
#   mutate_at(vars(var), funs(str_replace_all(., "\\s*\\(.*\\)", "")))



# geocodings$lon[Fuel_stations$Address %>% 
#   str_detect("574 Great Western Hwy")]
# 
# 
# geocode("574 Great Western Hwy, Werrington NSW 2747")


# sum(is.na(geocodings))
# 
# x1 <- data1[1:4,]
# # x<- geocode(x1$Address)
# 
# x1$id <- row.names(x1)
# x$id <- row.names(x)
# 
# x1 %>% 
#   left_join(x)
data1 %>% 
  ggplot(aes(lon.x, lat.x))+
  geom_point()

Ok! It didn’t go out smoothly. Straight away we can see that top left cluster are incorrect GPS coordinates. Australia should be at the bottom right.

After tweaking here and there I still couldn’t fix the GPS cordinates. I will not spend further time for this since it is not important.

data1$edited_address <- glue("{data1$Address}, Australia")

wrong_gps <- data1 %>% 
  filter(lon.x < 100) %>% 
  group_by(edited_address) %>% 
  count()

# getcode_wrong_gps <- geocode(wrong_gps$edited_address)

# write_csv(getcode_wrong_gps, "D:/r datawarehouse/petrol/getcode_wrong_gps.csv")

getcode_wrong_gps <- read_csv("D:/r datawarehouse/petrol/getcode_wrong_gps.csv")

wrong_gps$lon.z <- getcode_wrong_gps$lon
wrong_gps$lat.z <- getcode_wrong_gps$lat

wrong_gps <- wrong_gps %>% 
  select(1) %>% 
  mutate(identifier = "A")

data1 <- data1 %>% 
  left_join(wrong_gps, by = c("edited_address", "edited_address"))

data1$lon.x[!is.na(data1$identifier)] <- NA
data1$lat.x[!is.na(data1$identifier)] <- NA

# data1 %>% 
#   filter(identifier == "A")

# wrong_gps %>% 
#   left_join(data1, by = c("edited_address", "edited_address"))
# 
# 
# leaflet() %>% 
#   addTiles() %>% 
#   addMarkers(data = wrong_gps,
#              lng = ~lon.z,
#              lat = ~lat.z,
#              clusterOptions = markerClusterOptions())




# Ok, even after adding Australia at the end of the Service station address google is still giving me GPS coordinates of Non AU. I am not #going to spend any further time correcting it.

All NSW Fuel service station location:

# NSW_geo <- geocode("New South Wales, Australia")

# rm(NSW_geo)

NSW_gps <- data.frame(lon = 146.9211, lat =  -31.25322)

# map_nsw <- get_map(NSW_gps)

# map_data <- readOGR(dsn = "Aus_map/AUS_adm1.shp")

service_stat <- data1 %>% 
  group_by(Address, lon.x, lat.x, ServiceStationName) %>% 
  summarise(n = n())


leaflet() %>% 
  addTiles() %>% 
  addMarkers(data = service_stat,lat= ~lat.x,lng = ~lon.x,clusterOptions = markerClusterOptions(),popup =as.character(service_stat$ServiceStationName)) %>% 
    addProviderTiles(providers$OpenStreetMap) %>% 
  addScaleBar()
# ==================== Google map 

# g <- get_map(NSW_gps, zoom = 6, source = "google")
# 
# ggmap(g)+
#   geom_point(data = data1, aes(lon.x, lat.x))+
#   facet_wrap(~FuelCode)

Go: Top Menu