The following data has been taken from Kaggle that was previously used for a competition. With the following data, I plan to:
The data being used here can be found at Kaggle’s page.
The frist step would be to load the data and look at the libraries I will be using.
List of libraries:
The data is divided into train and test by the providers. The competition was testing for machine learning but I will not be delving into that in this exercise.
train <- read_csv("train.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## vendor_id = col_double(),
## pickup_datetime = col_datetime(format = ""),
## dropoff_datetime = col_datetime(format = ""),
## passenger_count = col_double(),
## pickup_longitude = col_double(),
## pickup_latitude = col_double(),
## dropoff_longitude = col_double(),
## dropoff_latitude = col_double(),
## store_and_fwd_flag = col_character(),
## trip_duration = col_double()
## )
The data contains a total of 1458644 observations and spread over 11 coloumns.1
Since I will be working with tidyverse I would like to change the data from it’s original form to a tibble. I will then look at the the structure of the tibble i.e. the underlying data.
train <- as_tibble(train)
glimpse(train)
## Observations: 1,458,644
## Variables: 11
## $ id <chr> "id2875421", "id2377394", "id3858529", "id350467...
## $ vendor_id <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, ...
## $ pickup_datetime <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, 2016-...
## $ dropoff_datetime <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, 2016-...
## $ passenger_count <dbl> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1, 1, ...
## $ pickup_longitude <dbl> -73.98215, -73.98042, -73.97903, -74.01004, -73....
## $ pickup_latitude <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40.79321...
## $ dropoff_longitude <dbl> -73.96463, -73.99948, -74.00533, -74.01227, -73....
## $ dropoff_latitude <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40.78252...
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration <dbl> 455, 663, 2124, 429, 435, 443, 341, 1551, 255, 1...
Usually, str is the popular choice to get a look at the data. However, using glimpse, is I believe, a much more convinient option as it let’s me peek into the data i.e. what the observations looks like along with the type of the variable that they are.
A description of the variables:
Initial obeservations on the data:
Now to check for any missing values in the dataset:
sum(is.na(train))
## [1] 0
The feature of coercion enables us to find the number of missing values in this dataset. There are a total of 0 missing values.
I will not be making sure that the variables are in the format that they are supposed to be. This will include converting a few variables to factors using forcats and converting another set of variables to ymd_hms i.e. year, month, date, hour, minutes, second format using lubridate.
train <- train %>% mutate(vendor_id = factor(vendor_id)) %>% mutate(passenger_count = factor(passenger_count)) %>%
mutate(store_and_fwd_flag = factor(store_and_fwd_flag)) %>% mutate(pickup_datetime = ymd_hms(pickup_datetime)) %>%
mutate(dropoff_datetime = ymd_hms(dropoff_datetime))
The code replaces the existing coloumn with the new required format coloumns with the same names for the variables. However, the order of the variables changes.
There is a probablity of an existance of inconsistency in the trip_duration variable. Theoretically, the formula for this would be:
\[trip\_duration = dropoff\_datetime - pickup\_datetime\]
However, there is not information regarding how this particular variable was collected. It could’ve been recorded independantly by the application or it could’ve taken dropoff_datetime and pickup_datetime as inputs to calculate that coloumn. A sense check in this section would be beneficial.
result <- (nrow(train) == sum((-1 * (train$pickup_datetime - train$dropoff_datetime)) ==
train$trip_duration))
The evaluation for the above expression is TRUE. If the value is TRUE then this is an indication that the coloumns in question are consistent. If not, then there is a need to check the issue. Here, we do not have any issues.
I would like to use a sample data of 10,000 random observations so as to render graphs faster for intial visualization purposes. Howver, the final code will use the full data set.
sample <- sample_n(train, 10000)
Given that vendors for the dataset were of the type
unique(train$vendor_id)
## [1] 2 1
## Levels: 1 2
Thus, there are only two vendors operating in this area, Vendor 1 and Vendor 2.
###Passenger Count
From a general look at the data, we can see that there are multiple values for the passenger_count variable and the variable is formatted to a factor. Thus there should ideally be a limited number of values to work with.
levels(train$passenger_count)
## [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
According to this data the minimum number of passengers in total trips were 0 passengers. This could be due to the following factors:
Given that the maximum passenger count is 9, the vendors must be having very large cars to service the requests. This is very much line with a service such as Uber XL in India.
A simple barchart would give us a starting point for further exploration with regard to this variable.
I will be using ggplot for visualiations. I will also be using , results = ‘hide’ in code blocks so as to render only relevant graphs in the final output.
ggplot(sample) + geom_histogram(aes(trip_duration))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The graph in its current state does not reveal much information. There are too many values in too few bins. I will increase the number of bins so that more information is reveal from the data.
ggplot(sample) + geom_histogram(aes(trip_duration), bins = 100)
ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150)
Even at 150 bins, the graphs don’t seem to be revealing much. It is also noticiable that the data is very skewed. This can be adjusted using a log transformation (scale_x_log10()). The skewness in the graph is also because of a few trips that fall on the extreme end of the spectrum.
ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10()
The graph is normalised now and regular statistics can be used on it. However, there are a few extreme values that can be better visualised by changing the scale of y-Axis using (scale_y_sqrt()). This transformation is commonly applied to counted data, especially if the values are mostly rather small.
ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10() +
scale_y_sqrt()
This graph is ideal for analysis. Now to apply the full data set:
ggplot(train) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10() +
scale_y_sqrt() + labs(title = paste("Distribution of Trip Durations"), subtitle = paste("Data demonstrates a normal distribution after\nrelevant transformations"),
x = "Trip duration in seconds.\nLog sclae.", y = "Frequency of trips (n).\nSquare root scale.")
Most of the trips are of the duration just shy of 17 minutes. An accurate measurement of the length of a mean trip is 15.99 minutes.
There are a few trips that are on the extreme end of the graph. These can be pulled from the dataset by:
dataTable <- train %>% filter(rank(desc(trip_duration)) < 10) %>% select(trip_duration)
kable(dataTable, format = "pandoc", col.names = c("Trip Duration"), caption = "Lengthiest trips in the data set")
| Trip Duration |
|---|
| 86387 |
| 86390 |
| 86391 |
| 1939736 |
| 2049578 |
| 2227612 |
| 3526282 |
| 86392 |
| 86385 |
The top ten trips based on duration spead over 86,000 seconds i.e. 1433.33 minutes. The longest trip is for 5.87713710^{4} minutes i.e. just around 41 days. These trip could possibly be an indication of outstation trips or some form of data-entry error.
First to plot a barchart of how many trips have taken place across the timeframe of the data set:
ggplot(train) + geom_histogram(aes(pickup_datetime), color = "#1212b3", fill = "#1212b3",
bins = 100) + labs(title = paste("Frequency of Trips Across Time"), subtitle = paste("Fairly consistent ride frequency"),
x = "Pick-up date", y = "Number of trips completed") + theme_bw()
This shows that the data set has trips recorded from the beginning of January till the end of June i.e. the beginning of July. The trips seem to be fairly consistent in the way they are distributed albeit with a few outlier. The most significant one is between January and February.
This needs to be looked at more closely to asses the significance of the drop.
train %>% filter(month(pickup_datetime) < 4) %>% ggplot(aes(pickup_datetime)) + geom_histogram(bins = 100,
color = "#1212b3", fill = "#1212b3") + labs(title = paste("Frequency of Trips Between January and April"),
subtitle = paste(str_wrap("There is a alarming fall in the number of trip near the end of February",
width = 90)), x = "Pick-up date", y = "Number of trips completed") + theme_bw()
The drop in the number of trip at the end of February was very significant with the number of trips being completed close to zero. This possibly could not be due to holiday of any nature, as in the period being analysed there are multiple holidays and this anamoly is happening only here. This could be possibly due to weather. However, that data is not provided in the already loaded data set.
The exact figures and the data are as follows: 2 3
count_of_trips <- train %>% group_by(day(pickup_datetime), month(pickup_datetime)) %>%
dplyr::summarise(n = n())
# count_of_trips <- head(arrange(count_of_trips[order(count_of_trips$n),]))
count_of_trips <- head(arrange(count_of_trips, n))
kable(count_of_trips, format = "pandoc", col.names = c("Date", "Month", "Number of trips completed"),
caption = "Lowest Rides Summary of Number of Trips by Day and Month")
| Date | Month | Number of trips completed |
|---|---|---|
| 23 | 1 | 1648 |
| 24 | 1 | 3383 |
| 30 | 5 | 5570 |
| 25 | 1 | 6084 |
| 3 | 1 | 6353 |
| 29 | 5 | 6372 |
A search through Kaggle has rendered a dataset that would be suitable to enhance this analysis. This data set can be found at Mathijs Waegemakers’s page.
weather <- read_csv("weather_data_nyc_centralpark_2016(1).csv")
## Parsed with column specification:
## cols(
## date = col_character(),
## `maximum temperature` = col_double(),
## `minimum temperature` = col_double(),
## `average temperature` = col_double(),
## precipitation = col_character(),
## `snow fall` = col_character(),
## `snow depth` = col_character()
## )
glimpse(weather)
## Observations: 366
## Variables: 7
## $ date <chr> "1-1-2016", "2-1-2016", "3-1-2016", "4-1-2016...
## $ `maximum temperature` <dbl> 42, 40, 45, 36, 29, 41, 46, 46, 47, 59, 40, 4...
## $ `minimum temperature` <dbl> 34, 32, 35, 14, 11, 25, 31, 31, 40, 40, 26, 2...
## $ `average temperature` <dbl> 38.0, 36.0, 40.0, 25.0, 20.0, 33.0, 38.5, 38....
## $ precipitation <chr> "0.00", "0.00", "0.00", "0.00", "0.00", "0.00...
## $ `snow fall` <chr> "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0....
## $ `snow depth` <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", ...
There are a number of issues with the data. precipitation, snowfall, and snowdepth are
maximum temperatu~, minimum temperat~, and average temperat~ which represent the maximum, minimum, and the average temperate for a given day would not affect a ride outcome to that of an extent as previously observed. It would be alright to drop these coloumns so as to have a compact dataset.
date which as the name suggests represents the date of the observation but this variable is recorded as a
weather <- as_tibble(weather)
weather <- weather %>% mutate(date = dmy(date), precipitation = as.double(precipitation),
snowfall = as.double(`snow fall`), snowdepth = as.double(`snow depth`)) %>% select(-c(2,
3, 4), -`snow fall`, -`snow depth`)
Now a simple with regard to the amount of snowfall across time but before the end February could reveal something meaningful (note that this is the period of interest).
weather_subset <- weather %>% group_by(date) %>% filter(month(date) < 3)
ggplot(weather_subset) + geom_line(aes(date, snowfall)) + labs(title = paste("Distribution of Snowfall in NYC Area"),
subtitle = paste("A large spike prevalent in the middle of January"), x = "Date",
y = "Amount of snowfall", caption = "Breaks in the line are either NA values or\nwhere there was no snowfall") +
theme_bw()
As suspected, there was a large external environmental factor that affected the number of rides in the middle of January that caused the number of rides completed to fall to almost zero.
index_of_outlier <- as.numeric(which.max(weather_subset$snowdepth))
data_of_outlier <- weather_subset[as.numeric(index_of_outlier - 2):as.numeric(index_of_outlier +
2), ]
kable(data_of_outlier, format = "pandoc", col.names = c("Date", "Precipitation",
"Snowfall", "Snowdepth"), caption = "Days surrounding the point of interest")
| Date | Precipitation | Snowfall | Snowdepth |
|---|---|---|---|
| 2016-01-22 | 0.01 | 0.2 | 0 |
| 2016-01-23 | 2.31 | 27.3 | 6 |
| 2016-01-24 | NA | NA | 22 |
| 2016-01-25 | 0.00 | 0.0 | 19 |
| 2016-01-26 | 0.00 | 0.0 | 17 |
Even though the highest amount of snowdepth was on 2016-01-24 the lowest total rides (amounting to 1648 rides) took place on 2016-01-23 (a day before the highest snowdepth) which is the same day as when NYC received about 27.3 inches of snowfall - possibly a snowstorm. Although the number of rides went up to on 2016-01-23 (the day with 22 inches of snowfall) to 3383 this is still very low copared to the regular state. The rides would’ve increased due to the strom setlling down and road beaing cleared.
It can be conclusively said that the issue in the drastic fall in the rides during the period in question can be attributed to weather disturbances.
I would like to first look at the distribution of passengers. A bar chart would do the job since this is a count of categorical variable.4
DoP1 <- ggplot(train) + geom_bar(aes(passenger_count)) + scale_y_sqrt() + labs(title = paste("Passenger Distribution"),
subtitle = paste("Passenger count grouped by the number of passengers in each ride"),
x = "Passenger count categories", y = "Total number of rides") + theme_bw()
DoP1
data <- train %>% group_by(passenger_count) %>% dplyr::summarise(n = n())
The distribution shows very clearly that the most number of trips, i.e. 1033540 trips happened with 1 passenger(s) in the car. We know from the list of levels for passenger_count that there are rides with even 9 passengers in the car but this cannot be seen in the chart properly. In order to get a better view, the scale of the y-axis has been changed using scale_y_sqrt(). In the modified graph, there are samll bars for 0, 7, 8, and 9 passengers.
kable(data, format = "pandoc", col.names = c("Passenger Count Group", "Count"), caption = "Summary of Distribution of Passengers")
| Passenger Count Group | Count |
|---|---|
| 0 | 60 |
| 1 | 1033540 |
| 2 | 210318 |
| 3 | 59896 |
| 4 | 28404 |
| 5 | 78088 |
| 6 | 48333 |
| 7 | 3 |
| 8 | 1 |
| 9 | 1 |
The distribution seems sensible given that most of the trips happened with 1, 2, 3, or 4 passengers.
Given that there are two vendors in the data, I would also like to look at how these trips were serviced by them.
ggplot(train) + geom_bar(aes(vendor_id)) + labs(title = paste("Trips Grouped by Vendors"),
x = "Vendor identification", y = "Number of trips completed") + theme_bw()
Most of the trips were conducted by Vendor 2 although the margin doesn’t seem significant through a visual examination. As a proportion, Vendor 2 has 1.1503077 times more trips than Vendor 1. As long as this proportion is under 30-50%, it does not seem justifiable to state that one of the Vendors as conclusively superior market power.
The data can be further divided and plotted based on the vendor_id.5
#data <- train %>%
# group_by(vendor_id, passenger_count) %>%
# count()
data <- train %>%
group_by(vendor_id, passenger_count) %>%
dplyr::summarise(n = n())
ggplot(data) +
geom_bar(aes(x = data$passenger_count, y = data$n, fill = data$vendor_id),
stat = "identity",
position = "dodge") +
scale_y_sqrt() +
labs(
title = paste(
"Passenger Count Distribution"
),
subtitle = paste(
"Data grouped by vendor"
),
x = "Passenger count category",
y = "Number of trips"
) +
scale_fill_discrete(name = "Vendor ID") +
theme_bw()
One of the previous inferences for the data was that the rides are somewhat equally distributed between both the Vendors. However, when the passenger category is grouped by the vendor_id, we can see that this inference holds true only for situations where there are 1, 2, 3, 4 passengers. This is the usual order flow. The larger passenger size category of 5, and 6 passengers is served mostly by Vendor 2 but the rides with 7, 8, and 9 passengers are served exclusively by Vendor 2.
### Issues With The Application
The store_and_fwd_flag as informed before is indicative of if the data was sent back to the server right after it was collected or if there was any delay due to any reason whatsoever and was stored on the deivce and later sent to the server.
data <- train %>% group_by(vendor_id, store_and_fwd_flag) %>% select(vendor_id, store_and_fwd_flag) %>%
count()
kable(x = data, format = "pandoc", col.names = c("Vendor ID", "Error", "Count"),
caption = "Error Distribution Across Vendors")
| Vendor ID | Error | Count |
|---|---|---|
| 1 | N | 670297 |
| 1 | Y | 8045 |
| 2 | N | 780302 |
The table shows that Vendor 2 has not had any issues with regard to data being pushed to the server. On the other hand, Vendor 1 has had approximately issues in 1.1859799% of the total rides. As long as this doesn’t affect customer satisfaction or service provision, I do not think this is a cause for concern although in the long term the application could be bettered for Vendor 1.
Given that we have date and time details of each ride, we can look closer on how the rides were distributed based on different measures of time. I will be looking into:
day_hour <- train %>%
mutate(hour = hour(pickup_datetime)) %>%
mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
group_by(hour, vendor_id) %>%
dplyr::summarise(n = n())
ggplot(day_hour) +
geom_line(aes(x = hour, y = n, color = vendor_id)) +
labs(
title = paste(
"Cumulative Distribution of Rides in a Day "
),
subtitle = paste(
"NYC seems to sleep early in the morning"
),
x = "Hour of the day",
y = "Total number of rides"
) +
scale_color_discrete(name = "Vendor ID") +
theme_bw()
The graph shows that the least number of trips were fulfilled in the early morning from 3 AM to 6 AM. The ride completion before 3 AM has a downward trend indicating a fall in the number of orders as the early morning approaches. However the trend starts rising by 6 AM/6:30 AM till late morning. This might be due to the rides that are booked for daily commute to offices.
There is a slight dip around 2 PM that lasts till a little past 3 PM. This could be due to drivers stepping down for lunch. The trend later continues to rise to hit a peak ride completions at around 5:30 PM. This could be a culmination of people stepping out for the evening, children coming back from schools, and/or workers checking out of their offices. (The list is not exhaustive)
We can check the hypothesis of who is using the rides in the morning hours.
data <- train %>%
mutate(hour = hour(pickup_datetime),
wday = wday(pickup_datetime)) %>%
group_by(hour, passenger_count) %>%
filter(hour >= 7 & hour <= 11) %>%
dplyr::summarise(n = n())
ggplot(data) +
geom_bar(aes(x = hour, y = n),
stat = "identity")+
facet_wrap( ~ passenger_count) +
labs(
title = paste(
"Distribution of Passengers by the Hour of the Day"
),
subtitle = paste(
"The data specifically looks at 7 AM to 11 AM"
),
x = "Hour of the day",
y = "Total number of rides",
caption = "Each subgraph represents a category of passengers"
) +
theme_bw()
The graph shows very clearly that the number of rides completed from 7 AM to 11 AM are almost all that contain a single passenger with a few containing two passengers. This is indicative of the rides increasing in that time segment due to daily work commute.
weekdays <- train %>%
mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
group_by(wday, vendor_id) %>%
dplyr::summarise(n = n())
ggplot(weekdays) +
geom_point(aes(wday, n, color = vendor_id),
size = 5) +
labs(
title = paste(
"Distribution of Rides Across Days of the Week"
),
subtitle = paste(
"Both the vendors follow similar pattern\nalthough Vendor 1 has total lower number of rides"
),
x = "Day of the week",
y = "Total number of trips completed"
) +
scale_color_discrete(name = "Vendor ID") +
theme_bw()
The graph shows that the frequency of completion of rides across the days of the week for each vendor follow a very similar pattern with the lowest total number of trips completed on Mondays and the most of Fridays although Saturday has almost the same level of number of trips completed. From previous inference, we know that Vendor 2 has more number of rides overall and this is reiterated in this graph again.
Logically, trips complted from Mondays to Thursdays should be on a similar level as these are all regular workdays. Friday could possible have a higher number of trip completitions due to it being a weekend and more people stepping out and moving around the city. The same rationale applies for Saturday and Sunday. However, the data does not seem to be supporting this hypothesis as the trips on Sundays are considerably lower than Friday and Saturday, and the total trips completed from Monday to Thursday have a rising pattern.
This section deals with how the rides are spread across the days of the week based on the number of passengers in each trip. However, a step further to divide the data by vendor_id revealed a subtle preference pattern that is consistent across the spectrum.
passenger_day_week <- train %>% mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
group_by(wday, passenger_count, vendor_id) %>% dplyr::summarise(n = n())
ggplot(passenger_day_week) + geom_point(aes(wday, n, color = vendor_id)) + facet_wrap(~passenger_count) +
labs(title = paste("Breakdown of Rides Based on Day of Week\nand Vendor ID"),
subtitle = paste("Subtle preferences are visible across customer segments "),
x = "Day of the Week", y = "Total number of trips completed") + scale_color_discrete(name = "Vendor ID") +
theme_bw()
There seem to be no rides where passengers were zero for Vendor 1, although the number is not very significant for Vendor 2. An interesting pattern emerges in this data when looked at it in this particular format. Vendor 1 is preffered for those trips where the passenger count is one. Whereas, for anything above that it i.e. passenger_count>2, Vendor 2 seemed to be the preferred choice, especially for larger groups that are of a gorup size 5 or more.
There could be multiple reasons for the above observations to occur. From general reasoning and observation in my own market surroundings, it could be possible that:
This section is to take a closer look at the duration of the trips undertaken by each vendor and how they differed based on the day of the week and hour of the day.
data <- train %>%
mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
group_by(wday, vendor_id) %>%
dplyr::summarise(mean = mean(trip_duration))
p1 <- ggplot(data) +
geom_point(aes(x = wday, y = mean, color = vendor_id),
size = 4) +
labs(
title = paste(
"Relation Between Day of the Week and Average Trip Duration"
),
x = "Day of the Week",
y = ""
) +
scale_color_discrete(name = "Vendor ID") +
theme_bw()
data2 <- train %>%
mutate(hour = hour(pickup_datetime)) %>%
group_by(hour, vendor_id) %>%
dplyr::summarise(mean = mean(trip_duration))
p2 <- ggplot(data2) +
geom_point(aes(x = hour, y = mean, color = vendor_id),
size = 4) +
labs(
title = paste(
"Relation Between Hour of the Day and Averege Trip Duration"
),
x = "Hour of the Day",
y = " Average Duration of the Trip (in Seconds) ->"
) +
scale_color_discrete(name = "Vendor ID") +
theme_bw()
Rmisc::multiplot(p1, p2, cols = 1)
On any given day, the average trip duration for Vendor 2 is greater than that for Vendor 1. One conclusion that can be drawn from this is that Vendor 1 is used for short rides, possibly daily commutes. Given the previous inference that Vendor 1 might be a low cost service provider, this output seems to fall in line with the findings before.
On an average, the average trip durations for both the vendors move is a similar manner across the hour of the day as well as across the day of the week. A hypothesis to test both these statements would be as follows:
\[ h_{0}: \beta_{1} = \beta_{2} \cdots \beta_{n} = 0\] \[ h_{0} \ is \ not \ true \] A multiple dummy variable regression for:
\[ mean = \beta_{0} + \beta_{1}wdayMon + \cdots +\beta{n}wdayN \] generated through the following code gives:
data <- data %>% ungroup(wday) %>% mutate(wday = as.character(wday))
model_1 <- lm(mean ~ wday, data = data)
summary(model_1)
##
## Call:
## lm(formula = mean ~ wday, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -133.3 -106.7 0.0 106.7 133.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 982.274 107.601 9.129 3.89e-05 ***
## wdayMon -90.481 152.170 -0.595 0.571
## wdaySat -42.064 152.170 -0.276 0.790
## wdaySun -90.945 152.170 -0.598 0.569
## wdayThu 16.894 152.170 0.111 0.915
## wdayTue -5.203 152.170 -0.034 0.974
## wdayWed -13.100 152.170 -0.086 0.934
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 152.2 on 7 degrees of freedom
## Multiple R-squared: 0.1241, Adjusted R-squared: -0.6266
## F-statistic: 0.1653 on 6 and 7 DF, p-value: 0.9783
On an individual level, none of the indivudal weekdays have a significant effect on the mean of the model excel for the intercept that is significant even at the 1% level of significance. Moreover, the model as a whole also has a very low explanatory power of 12.41% with a p-value of 0.9783 which is far beyond any acceptable level of significance. Thus h~0 cannot be rejected.
On the other hand, a dummy variable regression of the hours of the day on the mean time taken for the trip using the following equation:
\[ mean = \beta_{0} + \beta_{1}hour1 + \cdots +\beta_{n}hourN \] using the following code:
data2 <- data2 %>% ungroup(hour) %>% mutate(hour = as.character(hour))
model_2 <- lm(mean ~ hour, data = data2)
summary(model_2)
##
## Call:
## lm(formula = mean ~ hour, data = data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -204.6 -103.9 0.0 103.9 204.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 930.643 121.040 7.689 6.32e-08 ***
## hour1 -37.198 171.176 -0.217 0.830
## hour10 -4.536 171.176 -0.027 0.979
## hour11 28.388 171.176 0.166 0.870
## hour12 54.712 171.176 0.320 0.752
## hour13 94.190 171.176 0.550 0.587
## hour14 138.211 171.176 0.807 0.427
## hour15 180.878 171.176 1.057 0.301
## hour16 141.710 171.176 0.828 0.416
## hour17 91.494 171.176 0.535 0.598
## hour18 40.616 171.176 0.237 0.814
## hour19 -44.128 171.176 -0.258 0.799
## hour2 -46.275 171.176 -0.270 0.789
## hour20 -59.555 171.176 -0.348 0.731
## hour21 -48.932 171.176 -0.286 0.777
## hour22 87.571 171.176 0.512 0.614
## hour23 -13.197 171.176 -0.077 0.939
## hour3 -41.418 171.176 -0.242 0.811
## hour4 -10.454 171.176 -0.061 0.952
## hour5 -106.911 171.176 -0.625 0.538
## hour6 -133.384 171.176 -0.779 0.443
## hour7 -104.148 171.176 -0.608 0.549
## hour8 -12.826 171.176 -0.075 0.941
## hour9 -3.917 171.176 -0.023 0.982
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 171.2 on 24 degrees of freedom
## Multiple R-squared: 0.3052, Adjusted R-squared: -0.3607
## F-statistic: 0.4583 on 23 and 24 DF, p-value: 0.9671
Helps us understnad that even in this case, apart from the intercept, there is no variable that has a statistically significant effect on the dependant variable mean. Although the model explains around 30% of the variation in mean, the model has a large p-value that fails to reject h0.
Even though the hour of the day and the day of the week are not statistically significant, the trip durations are considerably higher on Fridays and Saturdays (as can be seen from the graph). This might be due to people unwinding on the weekends and taking much longer trips to places they want to visit than just to commute to office. (This is based on the assumption that people would be rational to pick a house that is close to their workpalce so as to reduce daily stress and increase productivity by not wasting time in commute). However, there is no rationale for trips of Thursdays to be of longer duration than Friday or any other day and there is insufficient data to help with this anomaly.
The average trip durations are more or less similar for the hours of the day from 7 AM to 11 AM/12 PM. This might be due to the fact that:
The last statement can be explored further by calculating the distance travelled in each trip and using the trip_duration to calculate speed.
This section involves material that has been taken directly from online resources as this includes a high level of trigonometry. Given latitudes and longitudes, distances can be calculated using the Haversign formula and the Great Circle Distance Approach.
\[a = sin(dlat / 2)^2 + cos(lat1) * cos(lat2) * sin(dlon / 2)^2\] Where dlon and dlat are the differences in radians between the two vectors. Note that our data have latitudes and longitudes in degrees. Statistical softwares prefer working in radians and thus this conversion.
r is the radius in Kilometers and has a constant value of 6371.
Speed is calculated using the standard formula of \[ speed = distance/time\]
#Function for converting degrees to radians.
deg2rad <- function(x) {
return(x * (pi/180))
}
#Input data from the dataset
lon1 <- deg2rad(train$dropoff_longitude)
lon2 <- deg2rad(train$pickup_longitude)
lat1 <- deg2rad(train$dropoff_latitude)
lat2 <- deg2rad(train$pickup_latitude)
#Haversine Formula and Great Circle Distance Approach
dlon <- lon2 - lon1
dlat <- lat2 - lat1
a <- sin(dlat / 2)^2 + cos(lat1) * cos(lat2) * sin(dlon / 2)^2
c <- 2 * asin(sqrt(a))
r <- 6371
#Distance in Kilometers
d <- c * r
#Calculation of Average Speed
modified_train <- train %>%
mutate(t = train$trip_duration/3600,
s = d/t)
#s is speed in Km/h
modified_train_this <- modified_train %>%
mutate(hour = hour(pickup_datetime)) %>%
group_by(hour) %>%
dplyr::summarise(average_speed = mean(s),
standard_deviation = sd(s)
)
set.seed(42)
kable(x = sample_n(modified_train_this, 10),
format = 'pandoc',
col.names = c("Hour of the Day",
"Average Speed (Km/h)",
"Standard Deviation"),
caption = "Sample of Speeds and Standard Deviations")
| Hour of the Day | Average Speed (Km/h) | Standard Deviation |
|---|---|---|
| 16 | 12.60812 | 10.047240 |
| 4 | 21.99173 | 10.659310 |
| 0 | 17.66529 | 9.729157 |
| 9 | 12.63069 | 8.116976 |
| 3 | 19.90431 | 11.498547 |
| 17 | 12.49368 | 7.371774 |
| 23 | 16.85722 | 8.753082 |
| 14 | 12.20083 | 7.277611 |
| 7 | 15.73354 | 7.800496 |
| 6 | 20.52107 | 11.757863 |
p1 <- ggplot(modified_train_this) +
geom_point(aes(x = hour, y = average_speed)) +
labs(
title = paste(
"Average Speed Across Hours of the Day"
),
x = "Hour of the day",
y = "Average Speed (Km/h)"
) +
theme_bw()
p2 <- ggplot(modified_train_this) +
geom_point(aes(x = hour, y = standard_deviation))+
labs(
title = paste(
"Standard Deviation of Speed Across Hours of the Day"
),
subtitle = paste(
"SD is more or less in a tight range except for two outliers"
),
x = "Hour of the day",
y = "Standard Deviation of Speed"
) +
theme_bw()
Rmisc::multiplot(p1, p2, cols = 1)
From the previous section, I was trying to understand more about if “drivers are driving faster so as to complete more trips and bag their performance rewards”. The average speed of the trips between 7 AM to 11 AM/12 AM are very low i.e. less than 5 Km/h. Given that this is a rush hour, it can be considered to be normal. Moreover, the standard deviation in these hours is also under a tight corridor indicating that this performance has sustained over the whole dataset.
The average speed seems to be very high in the late nights (or early mornings if one prefers it that way), however, this is not exceptional as everyone else on the road is also in the same speed range (as indicated by the standard deviation measure from 12 AM to 7 AM).
There are two outliers indicating that in the evenings it is hard to say if the usual trip estimates can be relied upon. However, this might not be a very strong indication as the standard deviation is overall in a range except for two observations that are moreover not even consequtively located.
Thus, from the above two graphs, and reasoning as detailed the previous hypothesis can be discarded as there is no substantial visual evidence.
There could possibily be a correlation between passenger_count and trip_duration. The causation if any could be based on the factors that the larger groups could have more number of stops, or take more time to even start the trip after the cab arrives, and/or travel for longer distances such as from home to restaurants and other extertainment places and even outstation trips that would normally be of lower frequency by individuals.
ggplot(train) +
geom_boxplot(aes(x = reorder(passenger_count, trip_duration, FUN = median)
, y = trip_duration, outline = FALSE)
) +
ylim(0,1e+03) +
labs(
title = paste(
"Relation Between Passenger Count and Trip Duration"
),
subtitle = paste(
"Data ordered based on median values"
),
x = "Passenger count",
y = "Trip duration (in seconds)"
)
Although the section begined with a possible understanding for causation, if any, there seems to be no significant relationship between these two variables as the medians of each are in a tight range while the data for passenger groups of 7, 8, 9 people are not reliable as there are too few observations.
From a statistical perspective I would like to explore the model:
\[y = \beta_{0} +\beta_{1} x_{1} +\beta_{2} x_{2} + \cdots +\beta_{n} x_{n} + e_{i}\] Where:
model_1 <- lm(trip_duration ~ passenger_count, data = modified_train)
x = summary(model_1)$coefficients[, ]
kable(x = x, col.names = c("Estimate", "Std. Error", "t-value", "p-value"))
| Estimate | Std. Error | t-value | p-value | |
|---|---|---|---|---|
| (Intercept) | 1718.4333 | 676.1233 | 2.5415976 | 0.0110348 |
| passenger_count1 | -788.0336 | 676.1429 | -1.1654837 | 0.2438234 |
| passenger_count2 | -712.9750 | 676.2197 | -1.0543540 | 0.2917210 |
| passenger_count3 | -690.1971 | 676.4619 | -1.0203045 | 0.3075843 |
| passenger_count4 | -664.9036 | 676.8370 | -0.9823688 | 0.3259184 |
| passenger_count5 | -648.2012 | 676.3830 | -0.9583345 | 0.3378943 |
| passenger_count6 | -657.0781 | 676.5428 | -0.9712291 | 0.3314344 |
| passenger_count7 | -1698.7667 | 3098.3862 | -0.5482747 | 0.5835034 |
| passenger_count8 | -1614.4333 | 5280.6918 | -0.3057238 | 0.7598150 |
| passenger_count9 | -1158.4333 | 5280.6918 | -0.2193715 | 0.8263607 |
In the dummy variable regression above, it is evident that none of the coefficients are significant at 5% significance level except for the intercept. The standard errors for passenger_count where the group size is either 7,8,9 is abnormally high (including their p-values) and this is attributable to the fact that there were very few trips in that category.
The model as a whole has an R-square of 8.374061410^{-5} - an insignificant amount of the variation in the dependant variable is being explained by all the dummy variables.
[1] New York City Taxi Trip Duration. 2017. <URL: https://www.kaggle.com/c/nyc-taxi-trip-duration/overview>.
[2] T. BajajCheck. Program for distance between two points on earth. Oct. 2019. <URL: https://www.geeksforgeeks.org/program-distance-two-points-earth/>.
[3] R. A. Irizarry. Introduction to Data Science. 2020.
[4] M. Waegemakers. Weather data in New York City - 2016. Sep. 2017. <URL: https://www.kaggle.com/mathijs/weather-data-in-new-york-city-2016?rvi=1>.
[5] H. Wickham and G. Grolemund. R for data science: import, tidy, transform, visualize and model data. OReilly, 2017.
[6] N. Yau. Data Points : Visualization That Means Something. John Wiley & Sons, Inc. , 2013.
The number of variables and observations could change during the time of tidying the data.↩
Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”.↩
Subsequent Knit-ing has rendered errors in “head(arrange(count_of_trips[order(count_of_trips$n),]))” thus not providing the desired output ot the lowest number of rides on a given day in those specific months. However, this was overcome using plyr - “head(arrange(count_of_trips, ))”. Specific marked as comment in code block.↩
A similar error that had occored previously occured here again. Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”.↩
The following code block encounted the same issue regardign summarise as before. Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”. Implementing the same throughout the document.↩