1 Executive Summary

The purpose of this report is to analyse and study the dataset from Roads and Maritime Services (RSM). The dataset contains the statistics of all the cars that are travelling through the cross-city tunnel in March 2019 and 2020. Through studying and analysing this dataset, we can conclude that the hypothesis testings that were created throughout the report proved to correlate with the content learnt in MATHS1005. What was discovered through this report is the comparative discoveries between vehicles going through the cross-city tunnel


2 Full Report

2.1 Initial Data Analysis (IDA)

The data-set that is being analysed is from the NSW Toll Road Data (RSM) and the number of vehicles that pass through the cross-city tunnel in 15 minutes interval was recorded in March for 2019 and 2020. According to the toll data website, the dataset was released with intentions for future toll roads to be competitive when bidding for tenure and also knowledge that is fun to explore and study to be more knowledgeable about the commute in Sydney. Every vehicle that is taking the cross-cityc tunnels would be considered an e-toll customer thus, automatically generating data through e-tags and number plates of vehicles passing the through the cross-city tunnel in 15 minutes interval. This dataset would be considered valid as the statistics that were gathered are from the NSW toll road data (RSM). The assumption that the government verifies this dataset would be high as the data is affiliated with the Roads and Maritime Services (RSM) in order to make future changes on the roads thus, increasing the reliability of the dataset. An issue that can be discussed about the dataset is that motorcycles aren’t included in the dataset although, there aren’t many motorcycles passing through the cross-city tunnel there are a fair number of motorcycles that do. Column 4 and 5 of the dataset represents the 15 minute intervals from 12am to 11:59pm, the dataset also correlates with column 3 which is the dates of when the 15 minute intervals are taking place. The column with the vehicle classes is in column 7, the vehicle classes just consist of cars and trucks. In column 14, the total volume is represented here, this is the considered the key value in my research as the total volume will be used for calculations such as t-tests and p-tests

##Install Packages 'dplyr' and 'readxl'
library(readxl) 
traffic.csv <- read.csv("C:/Users/Home/Desktop/MATH1005/data/traffic.csv")
x <- subset(traffic.csv)

as.data.frame(sapply(traffic.csv,class))
##                    sapply(traffic.csv, class)
## AssetID                             character
## FinancialQtrID                        integer
## Date                                character
## IntervalStart                       character
## IntervalEnd                         character
## Version                               integer
## VehicleClass                        character
## TollPointID                         character
## GantryDirection                     character
## GantryLocation                      character
## GantryGPSLatitude                     numeric
## GantryGPSLongitude                    numeric
## GantryType                          character
## TotalVolume                           integer


2.2 Research Question 1: Is there any difference between the number of trucks and cars going through the cross road tunnels in March 2019 and 2020?

We will test the hypothesis that there is no difference between the number of cars and trucks going in the cross-city tunnel using the given data. We will use the 2-sample t-test whether if there is a difference with assumptions that the variance is the same. H0: there is no difference: h0 = μ1 = μ2, or μ1 - μ2 = 0 H1: there is a difference: h1 = μ1 ≠ μ2, or μ1 - μ2 ≠ 0 Firstly, let’s separate the car and truck from the total volume.

library(readxl) 
traffic.csv <- read.csv("C:/Users/Home/Desktop/MATH1005/data/traffic.csv")

x <- subset(traffic.csv, VehicleClass== "Car") 

View(x) ##separating data by vehicle class with the terms car and truck.
      
y <- subset(traffic.csv, VehicleClass== "Truck")

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
a <- select(x, TotalVolume)

b <- select(y, TotalVolume) 

Car <- array(c(unlist(a)))
Truck <- array(c(unlist(b)))

boxplot(Car,Truck, names = c("Car", "Truck")) 

### Creating the boxplot for visual representation of car and truck going through the tunnel.

Selecting the total volume column in for car and truck and creating new numerical arrays for both car and truck. This ensures that the data is not in the data frame format but in a numerical array format. After doing this we can now conduct the 2-sample t-test calculations.

n1 = length(Car)
m1 = mean(Car)
var1 = sd(Car)^2
n2 = length(Truck)  #calculating test statistic
m2 = mean(Truck)
var2 = sd(Truck)^2

sdp2 = ((n1 - 1) * var1 + (n2 - 1) * var2)/(n1 + n2 - 2)
se = sqrt(sdp2 * (1/n1 + 1/n2))
TestStat = (m1 - m2 - 0)/se
TestStat
## [1] 155.9693
pvalue = pt(abs(TestStat), n1 + n2 - 2, lower.tail = F)
pvalue
## [1] 0

The p-value is much lower than the significance level meaning that the null hypothesis was proven false thus we can conclude the alternative hypothesis was true which was that there is a difference between the amount of cars and trucks in the cross-city tunnel Now we use the t.test() function to check the answers.

t.test(Car,Truck,var.equal = T,alternative = "less")
## 
##  Two Sample t-test
## 
## data:  Car and Truck
## t = 155.97, df = 35710, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##     -Inf 127.389
## sample estimates:
##  mean of x  mean of y 
## 129.159330   3.099798
t.test(Car,Truck,alternative = "less")
## 
##  Welch Two Sample t-test
## 
## data:  Car and Truck
## t = 155.97, df = 17915, p-value = 1
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##     -Inf 127.389
## sample estimates:
##  mean of x  mean of y 
## 129.159330   3.099798

Notice that both the 2-sample t-test and Welch 2-sample t-test give very similar results. In practice, it would be safer to use the Welch 2-sample t-test as you do not need to assume the two samples have the same spread. The 2-sample t-test shows that the number of cars compared to trucks is not equal on the cross-city tunnel, we can determine from the box-plot that the relationship between number of cars and trucks that come into the tunnel do not correlate and thus, there are significantly more cars in the cross-city tunnel


2.3 Research Question 2: Are there 80% or more of the vehicles in the cross-city tunnels in March 2019 and 2020?

Since there are no differences in cars and trucks in the cross-city tunnel, let’s test out if there are more than 80% of the vehicles are cars in the cross-city tunnels. We will test the hypothesis if there are 80% or more vehicles are cars in the cross-city tunnels. Hypothesis: There will be 80% or more vehicles that are cars in the cross-city tunnel in March, 2019. We will use α = 0.05 as our significance level. If p is the proportion of cars passing through the cross-city tunnel, we will test H0: p >/ 0.7 vs H1: p < 0.8 Firstly, let’s find the separate the car and trucks into their own categories and find out their sums.

library(readxl) 
traffic.csv <- read.csv("C:/Users/Home/Desktop/MATH1005/data/traffic.csv")

x <- subset(traffic.csv, VehicleClass== "Car") 

View(x) ##separating data by vehicle class with the terms car and truck.
      
y <- subset(traffic.csv, VehicleClass== "Truck")

library(dplyr)

a <- select(x, TotalVolume)
b <- select(y, TotalVolume) 
# Selecting the total volume column in for car and truck and creating new numerical arrays for both car and truck. This ensures that the data is not in the data frame format but in a numerical array format

Car <- array(c(unlist(a)))
Truck <- array(c(unlist(b)))

sum(x$TotalVolume,y$TotalVolume)
## [1] 2361619

The passing vehicles in the dataset are independent of each other, they will not affect the each other’s total volume. The sample size is 2361619 and is therefore large enough for the central limit theorem so the results will be normally distributed. What we are attempting to find is the p-value through the z-test and comparing it to the significance level to prove the null or alternative hypothesis.

Now we can do the hypothesis testing:

mu = 0.8
sig = sqrt(((1-0.8)^2 * 80 + (0-0.8)^2 * 20)/100) 
c(mu,sig)
## [1] 0.8 0.4
n = 2361619
EV_sum = mu * n
SE_sum = sig * sqrt(n)
c(EV_sum, SE_sum)
## [1] 1889295.2000     614.7024
n = 2361619
EV_mean = mu 
SE_mean = sig / sqrt(n)
c(EV_mean, SE_mean) 
## [1] 0.8000000000 0.0002602886
OV_sum = 1889295
test.stat_sum = (OV_sum - EV_sum)/SE_sum
test.stat_sum
## [1] -0.0003253607
OV_mean = 1889295/2361619
test.stat_mean = (OV_mean - EV_mean)/SE_mean
test.stat_mean
## [1] -0.0003253607
pnorm(test.stat_sum,lower.tail = FALSE)
## [1] 0.5001298
### : p >/ 0.7 vs H1: p < 0.8

since the p-value is equal to the significant, the null hypothesis is proven and conclude that the given dataset provides evidence that there are 80% or more vehicles that are travelling through the cross-city tunnel are cars.


Addition Research

https://informatics.sydney.edu.au/news/sydneytolls/ A 3D and more detailed visual representation of Day to day patterns of vehicles going through the cross-city tunnel from in 2019. There are lots of graphs shown which show a better visual representation than the box plot in research question 1.


3 References

E-TAG. (2022, March 29). Wikipedia. https://en.wikipedia.org/wiki/E-TAG

Global Notes. nswtollroaddata.com. (n/a.). NSW Toll Road Data. https://nswtollroaddata.com/data-download/

Nathaniel Butterworth. (2019, April 23) Exploring Sydney toll road data. Retrivied Nov, 1, 2022. https://informatics.sydney.edu.au/news/sydneytolls/

Ozroads: Cross City Tunnel. (2007, March 31). Retrieved Nov, 1, 2022, from https://www.ozroads.com.au/NSW/Freeways/CCT/cct.htm