Previously, we extracted the taxi availability data using an API provided by the Singapore government, which allows us not only to track the geolocation of all taxis, but also the total number of available taxis at a given time. We have previously used Python to extract the data. The data extraction code can be found here.
Following the data extraction, we can visualize the data, and try to locate patterns in the data. For this purpose, we use R as it has many good libraries and plotting devices that allow us to conduct exploratory analysis in a fairly easy manner.
We begin by importing key libraries for data processing and visualization processes. Next we read and preprocess the data (create additional factors or variables as necessary), and finally, we visualize this data.
library(RSQLite)
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(chron)
library(lattice)
library(ggplot2)
First, we read the data on to our working environment, by connecting to the SQLite database.
# Set Working Directory
setwd("~/Desktop")
# Make a connection to SQLite Database and save Data Frame to working environment
con = dbConnect(SQLite(), dbname = "TaxiTimeAnalysis.sqlite")
df <- dbReadTable(con, "TaxiTimeAnalysis")
Before processing the data, we can look at how the data is structured.
head(df)
## id Count DATE_YYYYMMDD TIME_HHMMSS
## 1 1 7423 2016-01-12 15:00:00
## 2 2 7508 2016-01-12 16:00:00
## 3 3 6804 2016-01-12 17:00:00
## 4 4 6436 2016-01-12 18:00:00
## 5 5 7257 2016-01-12 19:00:00
## 6 6 8286 2016-01-12 20:00:00
str(df)
## 'data.frame': 9921 obs. of 4 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Count : int 7423 7508 6804 6436 7257 8286 8430 7677 7834 8241 ...
## $ DATE_YYYYMMDD: chr "2016-01-12" "2016-01-12" "2016-01-12" "2016-01-12" ...
## $ TIME_HHMMSS : chr "15:00:00" "16:00:00" "17:00:00" "18:00:00" ...
As we have a better idea of how the data looks like, we can begin processing the data, and create additional factors as necessary.
# Converting Date column to Date be date variables, and convert Time column to time variables
df$DATE_YYYYMMDD <- as.Date(df$DATE_YYYYMMDD, format = "%Y-%m-%d")
df$TIME_HHMMSS <- chron(times = df$TIME_HHMMSS)
# Create an additional factor comprising of YYYY-MM, which will be used to generate facet plots later
df$Month <- as.factor(substr(df$DATE_YYYYMMDD, 1, 7))
We begin by conducting some exploratory data analysis on the dataset.
# Plot Hourly Average
Hourly_Avg <- tapply(df$Count, df$TIME_HHMMSS, mean, na.rm = T)
plot(sort(unique(df$TIME_HHMMSS)), Hourly_Avg, type = "l",
xlab = "Hour of the Day", ylab = "Average Taxi Counts",
ylim = c(min(Hourly_Avg), max(Hourly_Avg)),
main = "Hourly Average Taxi Counts from Jan 2016 to Feb 2017")
# Plot Daily Average
Daily_Avg <- aggregate(df[, c('Count', 'DATE_YYYYMMDD')], by = list(df$DATE_YYYYMMDD), mean, na.rm = T)
dates_before_announcement <- Daily_Avg[Daily_Avg$DATE_YYYYMMDD < '2016-03-31', ]
dates_before_announcement$Mean <- mean(dates_before_announcement$Count)
dates_after_announcement <- Daily_Avg[Daily_Avg$DATE_YYYYMMDD > '2016-06-30', ]
dates_after_announcement$Mean <- mean(dates_after_announcement$Count)
diff <- dates_after_announcement$Mean[1] - dates_before_announcement$Mean[1]
g <- ggplot(data = Daily_Avg, aes(x = DATE_YYYYMMDD, y = Count)) +
geom_line() +
geom_line(data = dates_before_announcement,
aes(x = DATE_YYYYMMDD, y = Mean), colour = "blue", lwd = 2) +
geom_line(data = dates_after_announcement,
aes(x = DATE_YYYYMMDD, y = Mean), colour = "red", lwd = 2) +
xlab("Date") + ylab("Average Taxi Counts") +
ggtitle("Daily Average Taxi Counts from Jan 2016 to Feb 2017")
g
Looking at the data, it appears that average taxi counts have fallen over the period Apr - Jun 2016. For example, the mean of taxi counts prior to Apr 2016 was approximate 6410.3720113, while the mean of taxi counts after Jun 2016 was 5076.486454, amounting to a total difference of -1333.8855572.
This may have been due to official announcements which reduced the uncertainty surrounding legality issues, along with the Government’s framework/approach in dealing with private-hire car services:
Before proceeding to carry out a t-test to ascertain whether there are any significant differences between the 2 means (Before Apr 2016 and after Jun 2016), we take a quick look at the distribution of taxi counts over the 2 periods.
library(gridExtra)
# From Jan 2016 to Mar 2016
hist_before <- ggplot(data = dates_before_announcement, aes(Count)) +
geom_histogram(aes(y = (..count..) * 100 /sum(..count..))) +
xlab('Mean Taxi Counts') + ylab('Occurrence (in %)') + ggtitle('Before the Announcements') +
xlim(3000, 9000) + ylim(0, 25)
hist_after <- ggplot(data = dates_after_announcement, aes(Count)) +
geom_histogram(aes(y = (..count..) * 100 /sum(..count..))) +
xlab('Mean Taxi Counts') + ylab('Occurrence (in %)') + ggtitle('After the Announcements') +
xlim(3000, 9000) + ylim(0, 25)
grid.arrange(hist_before, hist_after, ncol = 2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
From the histogram, it does appear that the distribution of taxi counts have shifted before and after the 3-month period. Previously, there were likely to be anywhere between 5000 - 7000 taxis at any given point in time (on average). However, the numbers appeared to have dwindled down to 4000 - 6000 after the announcements were made. However, are the differences statistically significant? We conduct a t-test to verify this claim.
conf_int <- t.test(dates_before_announcement$Count, dates_after_announcement$Count,
paired = F, var.equal = F)$conf
result <- if (conf_int[1] > 0 & conf_int[2] > 0){
c('reject the null hypothesis that the sample means are similar', 'statistically
different')
} else {
c('do not reject the null hypothesis that the sample means are similar', 'not
statistically different')
}
Under an independent t-test for two samples, we reject the null hypothesis that the sample means are similar and conclude that the sample means are statistically different, with the confidence interval being in the range of (1147.6726804, 1520.098434).
# Plot Monthly Average
Monthly_Avg <- tapply(df$Count, df$Month, mean, na.rm = T)
df$Month <- as.yearmon(df$Month, format = "%Y-%m")
plot(unique(df$Month), Monthly_Avg,
type = "l", lwd = 2,
xlab = "Year-Month", ylab = "Counts",
ylim = c(min(Monthly_Avg), max(Monthly_Avg)),
main = "Monthly Average Taxi Counts from Jan 2016 to Feb 2017")
Looking at the data, it appears that average taxi counts have fallen over the period; this might be due to the announcement…
Going forward, it might be interesting to see whether there have been any structural changes in the availability of taxis over the course of a day in a specific month.
Plotting Hourly Average by Month and Year
# Create a Date Time variable
df$DateTime <- as.POSIXlt(paste(as.Date(df$Month, format = "%b %Y"), df$TIME_HHMMSS), format = "%Y-%m-%d %H:%M:%S", tz = "Asia/Singapore")
# Create new columns to ensure
df <- df[order(df$DateTime),]
DateTime <- unique(sort(df$DateTime))
Hourly_Avg_By_DateTime <- tapply(df$Count, as.character(df$DateTime), mean, na.rm = T)
df_hourly_facet <- data.frame(DateTime, Hourly_Avg_By_DateTime)
df_hourly_facet$Date <- as.Date(substr(df_hourly_facet$DateTime, 1, 10), format = "%Y-%m-%d")
df_hourly_facet$Time <- times(substr(df_hourly_facet$DateTime, 12, 19))
facet_plt <- xyplot(Hourly_Avg_By_DateTime ~ Time | factor(Date),
as.table = T,
data = df_hourly_facet,
layout = c(6,3),
xlab = "Hour", ylab = "Hourly Average Taxi Counts")
facet_plt