NYC Taxi Trip Duration Analysis

Introduction

The following data has been taken from Kaggle that was previously used for a competition. With the following data, I plan to:

clean the data
conduct exploratory analysis
chek for relations that are significant through linear regressions

The data being used here can be found at Kaggle’s page.

Loading and Looking at the Data

The frist step would be to load the data and look at the libraries I will be using.

List of libraries:

readr
tidyverse
lubridate
forcats
leaflet
knitr
Rmisc

The data is divided into train and test by the providers. The competition was testing for machine learning but I will not be delving into that in this exercise.

train <- read_csv("train.csv")

## Parsed with column specification:
## cols(
##   id = col_character(),
##   vendor_id = col_double(),
##   pickup_datetime = col_datetime(format = ""),
##   dropoff_datetime = col_datetime(format = ""),
##   passenger_count = col_double(),
##   pickup_longitude = col_double(),
##   pickup_latitude = col_double(),
##   dropoff_longitude = col_double(),
##   dropoff_latitude = col_double(),
##   store_and_fwd_flag = col_character(),
##   trip_duration = col_double()
## )

The data contains a total of 1458644 observations and spread over 11 coloumns.¹

Since I will be working with tidyverse I would like to change the data from it’s original form to a tibble. I will then look at the the structure of the tibble i.e. the underlying data.

train <- as_tibble(train)

glimpse(train)

## Observations: 1,458,644
## Variables: 11
## $ id                 <chr> "id2875421", "id2377394", "id3858529", "id350467...
## $ vendor_id          <dbl> 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, ...
## $ pickup_datetime    <dttm> 2016-03-14 17:24:55, 2016-06-12 00:43:35, 2016-...
## $ dropoff_datetime   <dttm> 2016-03-14 17:32:30, 2016-06-12 00:54:38, 2016-...
## $ passenger_count    <dbl> 1, 1, 1, 1, 1, 6, 4, 1, 1, 1, 1, 4, 2, 1, 1, 1, ...
## $ pickup_longitude   <dbl> -73.98215, -73.98042, -73.97903, -74.01004, -73....
## $ pickup_latitude    <dbl> 40.76794, 40.73856, 40.76394, 40.71997, 40.79321...
## $ dropoff_longitude  <dbl> -73.96463, -73.99948, -74.00533, -74.01227, -73....
## $ dropoff_latitude   <dbl> 40.76560, 40.73115, 40.71009, 40.70672, 40.78252...
## $ store_and_fwd_flag <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N"...
## $ trip_duration      <dbl> 455, 663, 2124, 429, 435, 443, 341, 1551, 255, 1...

Usually, str is the popular choice to get a look at the data. However, using glimpse, is I believe, a much more convinient option as it let’s me peek into the data i.e. what the observations looks like along with the type of the variable that they are.

A description of the variables:

id: a unique identifier for each trip
vendor_id: there seem to be multiple vendors in NYC (it is a big place) and this identifies which vendor has serviced the trip
pickup_datetime: the date and time of when a particular trip started
dropoff_datetime: the date and time of when a particular trip ended
passenger_count: the number of people who were present in the car during the ride excluding the driver
pickup_longitude: the longitude part of the location details of where the pick up took place
pickup_latitude: the latitude part of the location details of where the pick up took place
dropoff_longitude: the longitude part of the location details of where the drop off took place
dropoff_latitude: the latitude part of the location details of where the drop off took place
store_and_fwd_flag: if the the trip details were stored on the phone and later sent to the server due to any reason
trip_duration: the length of the trip as measured in seconds

Initial obeservations on the data:

The nature of the variables seems coherent. The fact that vendor_id, and store_and_fwd_flag are of the type indicates that there are limited types of values in each.
Prima facie it looks like there are only two vendors, i.e. 1 and 2. However, this can be only confirmed by looking at all the data.
pickup_datetime and dropoff_datetime are in format. However, working with dates and times in lubridate format is more convinient.
As of now id does not play a critical role in the analysis and can be left as .

Now to check for any missing values in the dataset:

sum(is.na(train))

## [1] 0

The feature of coercion enables us to find the number of missing values in this dataset. There are a total of 0 missing values.

I will not be making sure that the variables are in the format that they are supposed to be. This will include converting a few variables to factors using forcats and converting another set of variables to ymd_hms i.e. year, month, date, hour, minutes, second format using lubridate.

train <- train %>% mutate(vendor_id = factor(vendor_id)) %>% mutate(passenger_count = factor(passenger_count)) %>% 
    mutate(store_and_fwd_flag = factor(store_and_fwd_flag)) %>% mutate(pickup_datetime = ymd_hms(pickup_datetime)) %>% 
    mutate(dropoff_datetime = ymd_hms(dropoff_datetime))

The code replaces the existing coloumn with the new required format coloumns with the same names for the variables. However, the order of the variables changes.

There is a probablity of an existance of inconsistency in the trip_duration variable. Theoretically, the formula for this would be:
\[trip\_duration = dropoff\_datetime - pickup\_datetime\]
However, there is not information regarding how this particular variable was collected. It could’ve been recorded independantly by the application or it could’ve taken dropoff_datetime and pickup_datetime as inputs to calculate that coloumn. A sense check in this section would be beneficial.

result <- (nrow(train) == sum((-1 * (train$pickup_datetime - train$dropoff_datetime)) == 
    train$trip_duration))

The evaluation for the above expression is TRUE. If the value is TRUE then this is an indication that the coloumns in question are consistent. If not, then there is a need to check the issue. Here, we do not have any issues.

Exploratory Data Analysis

I would like to use a sample data of 10,000 random observations so as to render graphs faster for intial visualization purposes. Howver, the final code will use the full data set.

sample <- sample_n(train, 10000)

Different Types of Vendors

Given that vendors for the dataset were of the type , pulling out unique values would give information regarding how many vendors are operational.

unique(train$vendor_id)

## [1] 2 1
## Levels: 1 2

Thus, there are only two vendors operating in this area, Vendor 1 and Vendor 2.

###Passenger Count

From a general look at the data, we can see that there are multiple values for the passenger_count variable and the variable is formatted to a factor. Thus there should ideally be a limited number of values to work with.

levels(train$passenger_count)

##  [1] "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"

According to this data the minimum number of passengers in total trips were 0 passengers. This could be due to the following factors:

The service might have an item delivery option which would have no passengers. However, there is insufficient information to test this hypothesis.
There might be few trips that took place as errors. *There is insufficient information to test if the trips recorded also consist of those where rides were booked and the driver arrived at the location after driving for a while and then the trip is cancelled.

Given that the maximum passenger count is 9, the vendors must be having very large cars to service the requests. This is very much line with a service such as Uber XL in India.

Trip Durations and their Distributions

A simple barchart would give us a starting point for further exploration with regard to this variable.

I will be using ggplot for visualiations. I will also be using , results = ‘hide’ in code blocks so as to render only relevant graphs in the final output.

ggplot(sample) + geom_histogram(aes(trip_duration))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The graph in its current state does not reveal much information. There are too many values in too few bins. I will increase the number of bins so that more information is reveal from the data.

ggplot(sample) + geom_histogram(aes(trip_duration), bins = 100)

ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150)

Even at 150 bins, the graphs don’t seem to be revealing much. It is also noticiable that the data is very skewed. This can be adjusted using a log transformation (scale_x_log10()). The skewness in the graph is also because of a few trips that fall on the extreme end of the spectrum.

ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10()

The graph is normalised now and regular statistics can be used on it. However, there are a few extreme values that can be better visualised by changing the scale of y-Axis using (scale_y_sqrt()). This transformation is commonly applied to counted data, especially if the values are mostly rather small.

ggplot(sample) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10() + 
    scale_y_sqrt()

This graph is ideal for analysis. Now to apply the full data set:

ggplot(train) + geom_histogram(aes(trip_duration), bins = 150) + scale_x_log10() + 
    scale_y_sqrt() + labs(title = paste("Distribution of Trip Durations"), subtitle = paste("Data demonstrates a normal distribution after\nrelevant transformations"), 
    x = "Trip duration in seconds.\nLog sclae.", y = "Frequency of trips (n).\nSquare root scale.")

Most of the trips are of the duration just shy of 17 minutes. An accurate measurement of the length of a mean trip is 15.99 minutes.

There are a few trips that are on the extreme end of the graph. These can be pulled from the dataset by:

dataTable <- train %>% filter(rank(desc(trip_duration)) < 10) %>% select(trip_duration)

kable(dataTable, format = "pandoc", col.names = c("Trip Duration"), caption = "Lengthiest trips in the data set")

Lengthiest trips in the data set
Trip Duration
86387
86390
86391
1939736
2049578
2227612
3526282
86392
86385

The top ten trips based on duration spead over 86,000 seconds i.e. 1433.33 minutes. The longest trip is for 5.87713710^{4} minutes i.e. just around 41 days. These trip could possibly be an indication of outstation trips or some form of data-entry error.

Distribution of Trips Across Time

First to plot a barchart of how many trips have taken place across the timeframe of the data set:

ggplot(train) + geom_histogram(aes(pickup_datetime), color = "#1212b3", fill = "#1212b3", 
    bins = 100) + labs(title = paste("Frequency of Trips Across Time"), subtitle = paste("Fairly consistent ride frequency"), 
    x = "Pick-up date", y = "Number of trips completed") + theme_bw()

This shows that the data set has trips recorded from the beginning of January till the end of June i.e. the beginning of July. The trips seem to be fairly consistent in the way they are distributed albeit with a few outlier. The most significant one is between January and February.

This needs to be looked at more closely to asses the significance of the drop.

train %>% filter(month(pickup_datetime) < 4) %>% ggplot(aes(pickup_datetime)) + geom_histogram(bins = 100, 
    color = "#1212b3", fill = "#1212b3") + labs(title = paste("Frequency of Trips Between January and April"), 
    subtitle = paste(str_wrap("There is a alarming fall in the number of trip near the end of February", 
        width = 90)), x = "Pick-up date", y = "Number of trips completed") + theme_bw()

The drop in the number of trip at the end of February was very significant with the number of trips being completed close to zero. This possibly could not be due to holiday of any nature, as in the period being analysed there are multiple holidays and this anamoly is happening only here. This could be possibly due to weather. However, that data is not provided in the already loaded data set.

The exact figures and the data are as follows: ² ³

count_of_trips <- train %>% group_by(day(pickup_datetime), month(pickup_datetime)) %>% 
    dplyr::summarise(n = n())

# count_of_trips <- head(arrange(count_of_trips[order(count_of_trips$n),]))
count_of_trips <- head(arrange(count_of_trips, n))

kable(count_of_trips, format = "pandoc", col.names = c("Date", "Month", "Number of trips completed"), 
    caption = "Lowest Rides Summary of Number of Trips by Day and Month")

Lowest Rides Summary of Number of Trips by Day and Month
Date	Month	Number of trips completed
23	1	1648
24	1	3383
30	5	5570
25	1	6084
3	1	6353
29	5	6372

A search through Kaggle has rendered a dataset that would be suitable to enhance this analysis. This data set can be found at Mathijs Waegemakers’s page.

weather <- read_csv("weather_data_nyc_centralpark_2016(1).csv")

## Parsed with column specification:
## cols(
##   date = col_character(),
##   `maximum temperature` = col_double(),
##   `minimum temperature` = col_double(),
##   `average temperature` = col_double(),
##   precipitation = col_character(),
##   `snow fall` = col_character(),
##   `snow depth` = col_character()
## )

glimpse(weather)

## Observations: 366
## Variables: 7
## $ date                  <chr> "1-1-2016", "2-1-2016", "3-1-2016", "4-1-2016...
## $ `maximum temperature` <dbl> 42, 40, 45, 36, 29, 41, 46, 46, 47, 59, 40, 4...
## $ `minimum temperature` <dbl> 34, 32, 35, 14, 11, 25, 31, 31, 40, 40, 26, 2...
## $ `average temperature` <dbl> 38.0, 36.0, 40.0, 25.0, 20.0, 33.0, 38.5, 38....
## $ precipitation         <chr> "0.00", "0.00", "0.00", "0.00", "0.00", "0.00...
## $ `snow fall`           <chr> "0.0", "0.0", "0.0", "0.0", "0.0", "0.0", "0....
## $ `snow depth`          <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", ...

There are a number of issues with the data. precipitation, snowfall, and snowdepth are character type. These need to be converted to double so as to conduct any mathematical operations.

maximum temperatu~, minimum temperat~, and average temperat~ which represent the maximum, minimum, and the average temperate for a given day would not affect a ride outcome to that of an extent as previously observed. It would be alright to drop these coloumns so as to have a compact dataset.

date which as the name suggests represents the date of the observation but this variable is recorded as a character. The appropriate representation would be a date object as formated by lubridate.

weather <- as_tibble(weather)
weather <- weather %>% mutate(date = dmy(date), precipitation = as.double(precipitation), 
    snowfall = as.double(`snow fall`), snowdepth = as.double(`snow depth`)) %>% select(-c(2, 
    3, 4), -`snow fall`, -`snow depth`)

Now a simple with regard to the amount of snowfall across time but before the end February could reveal something meaningful (note that this is the period of interest).

weather_subset <- weather %>% group_by(date) %>% filter(month(date) < 3)

ggplot(weather_subset) + geom_line(aes(date, snowfall)) + labs(title = paste("Distribution of Snowfall in NYC Area"), 
    subtitle = paste("A large spike prevalent in the middle of January"), x = "Date", 
    y = "Amount of snowfall", caption = "Breaks in the line are either NA values or\nwhere there was no snowfall") + 
    theme_bw()

As suspected, there was a large external environmental factor that affected the number of rides in the middle of January that caused the number of rides completed to fall to almost zero.

index_of_outlier <- as.numeric(which.max(weather_subset$snowdepth))
data_of_outlier <- weather_subset[as.numeric(index_of_outlier - 2):as.numeric(index_of_outlier + 
    2), ]
kable(data_of_outlier, format = "pandoc", col.names = c("Date", "Precipitation", 
    "Snowfall", "Snowdepth"), caption = "Days surrounding the point of interest")

Days surrounding the point of interest
Date	Precipitation	Snowfall	Snowdepth
2016-01-22	0.01	0.2	0
2016-01-23	2.31	27.3	6
2016-01-24	NA	NA	22
2016-01-25	0.00	0.0	19
2016-01-26	0.00	0.0	17

Even though the highest amount of snowdepth was on 2016-01-24 the lowest total rides (amounting to 1648 rides) took place on 2016-01-23 (a day before the highest snowdepth) which is the same day as when NYC received about 27.3 inches of snowfall - possibly a snowstorm. Although the number of rides went up to on 2016-01-23 (the day with 22 inches of snowfall) to 3383 this is still very low copared to the regular state. The rides would’ve increased due to the strom setlling down and road beaing cleared.

It can be conclusively said that the issue in the drastic fall in the rides during the period in question can be attributed to weather disturbances.

Distribution of Passengers

I would like to first look at the distribution of passengers. A bar chart would do the job since this is a count of categorical variable.⁴

DoP1 <- ggplot(train) + geom_bar(aes(passenger_count)) + scale_y_sqrt() + labs(title = paste("Passenger Distribution"), 
    subtitle = paste("Passenger count grouped by the number of passengers in each ride"), 
    x = "Passenger count categories", y = "Total number of rides") + theme_bw()

DoP1

data <- train %>% group_by(passenger_count) %>% dplyr::summarise(n = n())

The distribution shows very clearly that the most number of trips, i.e. 1033540 trips happened with 1 passenger(s) in the car. We know from the list of levels for passenger_count that there are rides with even 9 passengers in the car but this cannot be seen in the chart properly. In order to get a better view, the scale of the y-axis has been changed using scale_y_sqrt(). In the modified graph, there are samll bars for 0, 7, 8, and 9 passengers.

kable(data, format = "pandoc", col.names = c("Passenger Count Group", "Count"), caption = "Summary of Distribution of Passengers")

Summary of Distribution of Passengers
Passenger Count Group	Count
0	60
1	1033540
2	210318
3	59896
4	28404
5	78088
6	48333
7	3
8	1
9	1

The distribution seems sensible given that most of the trips happened with 1, 2, 3, or 4 passengers.

Given that there are two vendors in the data, I would also like to look at how these trips were serviced by them.

ggplot(train) + geom_bar(aes(vendor_id)) + labs(title = paste("Trips Grouped by Vendors"), 
    x = "Vendor identification", y = "Number of trips completed") + theme_bw()

Most of the trips were conducted by Vendor 2 although the margin doesn’t seem significant through a visual examination. As a proportion, Vendor 2 has 1.1503077 times more trips than Vendor 1. As long as this proportion is under 30-50%, it does not seem justifiable to state that one of the Vendors as conclusively superior market power.

The data can be further divided and plotted based on the vendor_id.⁵

#data <- train %>%
#  group_by(vendor_id, passenger_count) %>%
#  count()

data <- train %>%
  group_by(vendor_id, passenger_count) %>%
  dplyr::summarise(n = n())

ggplot(data) +
  geom_bar(aes(x = data$passenger_count, y = data$n, fill = data$vendor_id),
           stat = "identity",
           position = "dodge") +
  scale_y_sqrt() + 
  labs(
    title = paste(
      "Passenger Count Distribution"
    ),
    subtitle = paste(
      "Data grouped by vendor"
    ), 
    x = "Passenger count category", 
    y = "Number of trips"
  ) +
  scale_fill_discrete(name = "Vendor ID") +
  theme_bw()

One of the previous inferences for the data was that the rides are somewhat equally distributed between both the Vendors. However, when the passenger category is grouped by the vendor_id, we can see that this inference holds true only for situations where there are 1, 2, 3, 4 passengers. This is the usual order flow. The larger passenger size category of 5, and 6 passengers is served mostly by Vendor 2 but the rides with 7, 8, and 9 passengers are served exclusively by Vendor 2.
### Issues With The Application

The store_and_fwd_flag as informed before is indicative of if the data was sent back to the server right after it was collected or if there was any delay due to any reason whatsoever and was stored on the deivce and later sent to the server.

data <- train %>% group_by(vendor_id, store_and_fwd_flag) %>% select(vendor_id, store_and_fwd_flag) %>% 
    count()

kable(x = data, format = "pandoc", col.names = c("Vendor ID", "Error", "Count"), 
    caption = "Error Distribution Across Vendors")

Error Distribution Across Vendors
Vendor ID	Error	Count
1	N	670297
1	Y	8045
2	N	780302

The table shows that Vendor 2 has not had any issues with regard to data being pushed to the server. On the other hand, Vendor 1 has had approximately issues in 1.1859799% of the total rides. As long as this doesn’t affect customer satisfaction or service provision, I do not think this is a cause for concern although in the long term the application could be bettered for Vendor 1.

Distribution of Rides Based on Time

Given that we have date and time details of each ride, we can look closer on how the rides were distributed based on different measures of time. I will be looking into:

Number of rides (count) throughout a day
Number of rides (count) throughout the week
Number of rides spread throughout the whole week grouped by passengers

Number of Rides (Count) Throughout a Day

day_hour <- train %>%
  mutate(hour = hour(pickup_datetime)) %>%
  mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
  group_by(hour, vendor_id) %>%
  dplyr::summarise(n = n())

ggplot(day_hour) +
  geom_line(aes(x = hour, y = n, color = vendor_id)) +
  labs(
    title = paste(
      "Cumulative Distribution of Rides in a Day "
    ),
    subtitle = paste(
      "NYC seems to sleep early in the morning"
    ), 
    x = "Hour of the day",
    y = "Total number of rides"
  ) +
  scale_color_discrete(name = "Vendor ID") +
  theme_bw()

The graph shows that the least number of trips were fulfilled in the early morning from 3 AM to 6 AM. The ride completion before 3 AM has a downward trend indicating a fall in the number of orders as the early morning approaches. However the trend starts rising by 6 AM/6:30 AM till late morning. This might be due to the rides that are booked for daily commute to offices.

There is a slight dip around 2 PM that lasts till a little past 3 PM. This could be due to drivers stepping down for lunch. The trend later continues to rise to hit a peak ride completions at around 5:30 PM. This could be a culmination of people stepping out for the evening, children coming back from schools, and/or workers checking out of their offices. (The list is not exhaustive)

We can check the hypothesis of who is using the rides in the morning hours.

data <- train %>%
  mutate(hour = hour(pickup_datetime),
         wday = wday(pickup_datetime)) %>%
  group_by(hour, passenger_count) %>%
  filter(hour >= 7 & hour <= 11) %>%
  dplyr::summarise(n = n())

ggplot(data) +
  geom_bar(aes(x = hour, y = n), 
           stat = "identity")+
  facet_wrap( ~ passenger_count) +
  labs(
    title = paste(
      "Distribution of Passengers by the Hour of the Day"
    ), 
    subtitle = paste(
      "The data specifically looks at 7 AM to 11 AM"
    ),
    x = "Hour of the day",
    y = "Total number of rides",
    caption = "Each subgraph represents a category of passengers"
  ) +
  theme_bw()

The graph shows very clearly that the number of rides completed from 7 AM to 11 AM are almost all that contain a single passenger with a few containing two passengers. This is indicative of the rides increasing in that time segment due to daily work commute.

Number of Rides (Count) Throughout the Week

weekdays <- train %>%
  mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
  group_by(wday, vendor_id) %>%
  dplyr::summarise(n = n())

ggplot(weekdays) +
  geom_point(aes(wday, n, color = vendor_id), 
             size = 5) +
  labs(
    title = paste(
      "Distribution of Rides Across Days of the Week"
    ), 
    subtitle = paste(
      "Both the vendors follow similar pattern\nalthough Vendor 1 has total lower number of rides"
    ), 
    x = "Day of the week",
    y = "Total number of trips completed"
  ) +
  scale_color_discrete(name = "Vendor ID") +
  theme_bw()

The graph shows that the frequency of completion of rides across the days of the week for each vendor follow a very similar pattern with the lowest total number of trips completed on Mondays and the most of Fridays although Saturday has almost the same level of number of trips completed. From previous inference, we know that Vendor 2 has more number of rides overall and this is reiterated in this graph again.

Logically, trips complted from Mondays to Thursdays should be on a similar level as these are all regular workdays. Friday could possible have a higher number of trip completitions due to it being a weekend and more people stepping out and moving around the city. The same rationale applies for Saturday and Sunday. However, the data does not seem to be supporting this hypothesis as the trips on Sundays are considerably lower than Friday and Saturday, and the total trips completed from Monday to Thursday have a rising pattern.

Number of Rides Spread Throughout Week Grouped by Passengers and Vendor ID

This section deals with how the rides are spread across the days of the week based on the number of passengers in each trip. However, a step further to divide the data by vendor_id revealed a subtle preference pattern that is consistent across the spectrum.

passenger_day_week <- train %>% mutate(wday = wday(pickup_datetime, label = TRUE)) %>% 
    group_by(wday, passenger_count, vendor_id) %>% dplyr::summarise(n = n())

ggplot(passenger_day_week) + geom_point(aes(wday, n, color = vendor_id)) + facet_wrap(~passenger_count) + 
    labs(title = paste("Breakdown of Rides Based on Day of Week\nand Vendor ID"), 
        subtitle = paste("Subtle preferences are visible across customer segments "), 
        x = "Day of the Week", y = "Total number of trips completed") + scale_color_discrete(name = "Vendor ID") + 
    theme_bw()

There seem to be no rides where passengers were zero for Vendor 1, although the number is not very significant for Vendor 2. An interesting pattern emerges in this data when looked at it in this particular format. Vendor 1 is preffered for those trips where the passenger count is one. Whereas, for anything above that it i.e. passenger_count>2, Vendor 2 seemed to be the preferred choice, especially for larger groups that are of a gorup size 5 or more.

There could be multiple reasons for the above observations to occur. From general reasoning and observation in my own market surroundings, it could be possible that:

Vendor 1 is a low cost service provider which is causing people to use this service for daily commute.
Vendor 1 could be having smaller cars, but this is not supported by the presence of trips completed for groups of 5 and 6 people.
Vendor 2 could be providing cars that are a on the higher end compared to the regular taxis and it’s (possibly only) competitor Vendor 1 thereby capturing the market of groups much more successfuly than Vendor 1.
Vendor 2 could be having a special service for large cars thereby capturing the infrequent but essential market of of 7+ group size. Although it might not prove very economical it would help in building brand value.

Duration of Trips

This section is to take a closer look at the duration of the trips undertaken by each vendor and how they differed based on the day of the week and hour of the day.

data <- train %>%
  mutate(wday = wday(pickup_datetime, label = TRUE)) %>%
  group_by(wday, vendor_id) %>%
  dplyr::summarise(mean = mean(trip_duration))

p1 <- ggplot(data) +
  geom_point(aes(x = wday, y = mean, color = vendor_id),
             size = 4) +
  labs(
    title = paste(
      "Relation Between Day of the Week and Average Trip Duration"
    ), 
    x = "Day of the Week",
    y = ""
  ) +
  scale_color_discrete(name = "Vendor ID") +
  theme_bw()

data2 <- train %>%
  mutate(hour = hour(pickup_datetime)) %>%
  group_by(hour, vendor_id) %>%
  dplyr::summarise(mean = mean(trip_duration))

p2 <- ggplot(data2) +
  geom_point(aes(x = hour, y = mean, color = vendor_id),
             size = 4) +
  labs(
    title = paste(
      "Relation Between Hour of the Day and Averege Trip Duration"
    ), 
    x = "Hour of the Day", 
    y = "                                   Average Duration of the Trip (in Seconds) ->"
  ) +
  scale_color_discrete(name = "Vendor ID") +
  theme_bw()
  
Rmisc::multiplot(p1, p2, cols = 1)

On any given day, the average trip duration for Vendor 2 is greater than that for Vendor 1. One conclusion that can be drawn from this is that Vendor 1 is used for short rides, possibly daily commutes. Given the previous inference that Vendor 1 might be a low cost service provider, this output seems to fall in line with the findings before.

On an average, the average trip durations for both the vendors move is a similar manner across the hour of the day as well as across the day of the week. A hypothesis to test both these statements would be as follows:

\[ h_{0}: \beta_{1} = \beta_{2} \cdots \beta_{n} = 0\] \[ h_{0} \ is \ not \ true \] A multiple dummy variable regression for:

\[ mean = \beta_{0} + \beta_{1}wdayMon + \cdots +\beta{n}wdayN \] generated through the following code gives:

data <- data %>% ungroup(wday) %>% mutate(wday = as.character(wday))
model_1 <- lm(mean ~ wday, data = data)
summary(model_1)

## 
## Call:
## lm(formula = mean ~ wday, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -133.3 -106.7    0.0  106.7  133.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  982.274    107.601   9.129 3.89e-05 ***
## wdayMon      -90.481    152.170  -0.595    0.571    
## wdaySat      -42.064    152.170  -0.276    0.790    
## wdaySun      -90.945    152.170  -0.598    0.569    
## wdayThu       16.894    152.170   0.111    0.915    
## wdayTue       -5.203    152.170  -0.034    0.974    
## wdayWed      -13.100    152.170  -0.086    0.934    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 152.2 on 7 degrees of freedom
## Multiple R-squared:  0.1241, Adjusted R-squared:  -0.6266 
## F-statistic: 0.1653 on 6 and 7 DF,  p-value: 0.9783

On an individual level, none of the indivudal weekdays have a significant effect on the mean of the model excel for the intercept that is significant even at the 1% level of significance. Moreover, the model as a whole also has a very low explanatory power of 12.41% with a p-value of 0.9783 which is far beyond any acceptable level of significance. Thus h~0 cannot be rejected.

On the other hand, a dummy variable regression of the hours of the day on the mean time taken for the trip using the following equation:

\[ mean = \beta_{0} + \beta_{1}hour1 + \cdots +\beta_{n}hourN \] using the following code:

data2 <- data2 %>% ungroup(hour) %>% mutate(hour = as.character(hour))
model_2 <- lm(mean ~ hour, data = data2)
summary(model_2)

## 
## Call:
## lm(formula = mean ~ hour, data = data2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -204.6 -103.9    0.0  103.9  204.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  930.643    121.040   7.689 6.32e-08 ***
## hour1        -37.198    171.176  -0.217    0.830    
## hour10        -4.536    171.176  -0.027    0.979    
## hour11        28.388    171.176   0.166    0.870    
## hour12        54.712    171.176   0.320    0.752    
## hour13        94.190    171.176   0.550    0.587    
## hour14       138.211    171.176   0.807    0.427    
## hour15       180.878    171.176   1.057    0.301    
## hour16       141.710    171.176   0.828    0.416    
## hour17        91.494    171.176   0.535    0.598    
## hour18        40.616    171.176   0.237    0.814    
## hour19       -44.128    171.176  -0.258    0.799    
## hour2        -46.275    171.176  -0.270    0.789    
## hour20       -59.555    171.176  -0.348    0.731    
## hour21       -48.932    171.176  -0.286    0.777    
## hour22        87.571    171.176   0.512    0.614    
## hour23       -13.197    171.176  -0.077    0.939    
## hour3        -41.418    171.176  -0.242    0.811    
## hour4        -10.454    171.176  -0.061    0.952    
## hour5       -106.911    171.176  -0.625    0.538    
## hour6       -133.384    171.176  -0.779    0.443    
## hour7       -104.148    171.176  -0.608    0.549    
## hour8        -12.826    171.176  -0.075    0.941    
## hour9         -3.917    171.176  -0.023    0.982    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 171.2 on 24 degrees of freedom
## Multiple R-squared:  0.3052, Adjusted R-squared:  -0.3607 
## F-statistic: 0.4583 on 23 and 24 DF,  p-value: 0.9671

Helps us understnad that even in this case, apart from the intercept, there is no variable that has a statistically significant effect on the dependant variable mean. Although the model explains around 30% of the variation in mean, the model has a large p-value that fails to reject h₀.

Even though the hour of the day and the day of the week are not statistically significant, the trip durations are considerably higher on Fridays and Saturdays (as can be seen from the graph). This might be due to people unwinding on the weekends and taking much longer trips to places they want to visit than just to commute to office. (This is based on the assumption that people would be rational to pick a house that is close to their workpalce so as to reduce daily stress and increase productivity by not wasting time in commute). However, there is no rationale for trips of Thursdays to be of longer duration than Friday or any other day and there is insufficient data to help with this anomaly.

The average trip durations are more or less similar for the hours of the day from 7 AM to 11 AM/12 PM. This might be due to the fact that:

schools have buses to transport children
passengers are ready to get into their cabs as soon as the cab arrives so as to not waste time or reach their place of work late
drivers are driving faster so as to complete more trips and bag their performance rewards

The last statement can be explored further by calculating the distance travelled in each trip and using the trip_duration to calculate speed.

Average Speed Analysis

This section involves material that has been taken directly from online resources as this includes a high level of trigonometry. Given latitudes and longitudes, distances can be calculated using the Haversign formula and the Great Circle Distance Approach.

\[a = sin(dlat / 2)^2 + cos(lat1) * cos(lat2) * sin(dlon / 2)^2\] Where dlon and dlat are the differences in radians between the two vectors. Note that our data have latitudes and longitudes in degrees. Statistical softwares prefer working in radians and thus this conversion.

r is the radius in Kilometers and has a constant value of 6371.

Speed is calculated using the standard formula of \[ speed = distance/time\]

#Function for converting degrees to radians.

deg2rad <- function(x) {
   return(x * (pi/180))
}

#Input data from the dataset

lon1 <- deg2rad(train$dropoff_longitude)
lon2 <- deg2rad(train$pickup_longitude)
lat1 <- deg2rad(train$dropoff_latitude)
lat2 <- deg2rad(train$pickup_latitude)

#Haversine Formula and Great Circle Distance Approach

dlon <-  lon2 - lon1  
dlat <-  lat2 - lat1 
a <-  sin(dlat / 2)^2 + cos(lat1) * cos(lat2) * sin(dlon / 2)^2
c <-  2 * asin(sqrt(a))  
r <- 6371

#Distance in Kilometers

d <- c * r

#Calculation of Average Speed

modified_train <- train %>%
  mutate(t = train$trip_duration/3600,
         s = d/t)

#s is speed in Km/h

modified_train_this <- modified_train %>%
  mutate(hour = hour(pickup_datetime)) %>%
  group_by(hour) %>%
  dplyr::summarise(average_speed = mean(s),
                   standard_deviation = sd(s)
                   )

set.seed(42)
kable(x = sample_n(modified_train_this, 10),
      format = 'pandoc',
      col.names = c("Hour of the Day",
                    "Average Speed (Km/h)",
                    "Standard Deviation"),
      caption = "Sample of Speeds and Standard Deviations")

Sample of Speeds and Standard Deviations
Hour of the Day	Average Speed (Km/h)	Standard Deviation
16	12.60812	10.047240
4	21.99173	10.659310
0	17.66529	9.729157
9	12.63069	8.116976
3	19.90431	11.498547
17	12.49368	7.371774
23	16.85722	8.753082
14	12.20083	7.277611
7	15.73354	7.800496
6	20.52107	11.757863

p1 <- ggplot(modified_train_this) +
  geom_point(aes(x = hour, y = average_speed)) +
  labs(
    title = paste(
      "Average Speed Across Hours of the Day"
    ),
    x = "Hour of the day",
    y = "Average Speed (Km/h)"
    ) +
  theme_bw()
  

p2 <- ggplot(modified_train_this) +
  geom_point(aes(x = hour, y = standard_deviation))+
  labs(
    title = paste(
      "Standard Deviation of Speed Across Hours of the Day"
    ),
    subtitle = paste(
      "SD is more or less in a tight range except for two outliers"
    ),
    x = "Hour of the day",
    y = "Standard Deviation of Speed"
    ) +
  theme_bw()
  

Rmisc::multiplot(p1, p2, cols = 1)

From the previous section, I was trying to understand more about if “drivers are driving faster so as to complete more trips and bag their performance rewards”. The average speed of the trips between 7 AM to 11 AM/12 AM are very low i.e. less than 5 Km/h. Given that this is a rush hour, it can be considered to be normal. Moreover, the standard deviation in these hours is also under a tight corridor indicating that this performance has sustained over the whole dataset.

The average speed seems to be very high in the late nights (or early mornings if one prefers it that way), however, this is not exceptional as everyone else on the road is also in the same speed range (as indicated by the standard deviation measure from 12 AM to 7 AM).

There are two outliers indicating that in the evenings it is hard to say if the usual trip estimates can be relied upon. However, this might not be a very strong indication as the standard deviation is overall in a range except for two observations that are moreover not even consequtively located.

Thus, from the above two graphs, and reasoning as detailed the previous hypothesis can be discarded as there is no substantial visual evidence.

Relation Between Passenger Count and Trip Duration

There could possibily be a correlation between passenger_count and trip_duration. The causation if any could be based on the factors that the larger groups could have more number of stops, or take more time to even start the trip after the cab arrives, and/or travel for longer distances such as from home to restaurants and other extertainment places and even outstation trips that would normally be of lower frequency by individuals.

ggplot(train) +
  geom_boxplot(aes(x = reorder(passenger_count, trip_duration, FUN = median)
                   , y = trip_duration, outline = FALSE)
               ) +
  ylim(0,1e+03) +
  labs(
    title = paste(
      "Relation Between Passenger Count and Trip Duration"
    ),
    subtitle = paste(
      "Data ordered based on median values"
    ), 
    x = "Passenger count",
    y = "Trip duration (in seconds)"
  )

Although the section begined with a possible understanding for causation, if any, there seems to be no significant relationship between these two variables as the medians of each are in a tight range while the data for passenger groups of 7, 8, 9 people are not reliable as there are too few observations.

From a statistical perspective I would like to explore the model:

\[y = \beta_{0} +\beta_{1} x_{1} +\beta_{2} x_{2} + \cdots +\beta_{n} x_{n} + e_{i}\] Where:

y is trip_duration
x_i is passenger_count and in this model it is recorded as a dummy variable
e_i is the error term

model_1 <- lm(trip_duration ~ passenger_count, data = modified_train)
x = summary(model_1)$coefficients[, ]
kable(x = x, col.names = c("Estimate", "Std. Error", "t-value", "p-value"))

	Estimate	Std. Error	t-value	p-value
(Intercept)	1718.4333	676.1233	2.5415976	0.0110348
passenger_count1	-788.0336	676.1429	-1.1654837	0.2438234
passenger_count2	-712.9750	676.2197	-1.0543540	0.2917210
passenger_count3	-690.1971	676.4619	-1.0203045	0.3075843
passenger_count4	-664.9036	676.8370	-0.9823688	0.3259184
passenger_count5	-648.2012	676.3830	-0.9583345	0.3378943
passenger_count6	-657.0781	676.5428	-0.9712291	0.3314344
passenger_count7	-1698.7667	3098.3862	-0.5482747	0.5835034
passenger_count8	-1614.4333	5280.6918	-0.3057238	0.7598150
passenger_count9	-1158.4333	5280.6918	-0.2193715	0.8263607

In the dummy variable regression above, it is evident that none of the coefficients are significant at 5% significance level except for the intercept. The standard errors for passenger_count where the group size is either 7,8,9 is abnormally high (including their p-values) and this is attributable to the fact that there were very few trips in that category.

The model as a whole has an R-square of 8.374061410^{-5} - an insignificant amount of the variation in the dependant variable is being explained by all the dummy variables.

The number of variables and observations could change during the time of tidying the data.↩
Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”.↩
Subsequent Knit-ing has rendered errors in “head(arrange(count_of_trips[order(count_of_trips$n),]))” thus not providing the desired output ot the lowest number of rides on a given day in those specific months. However, this was overcome using plyr - “head(arrange(count_of_trips, ))”. Specific marked as comment in code block.↩
A similar error that had occored previously occured here again. Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”.↩
The following code block encounted the same issue regardign summarise as before. Conflict between the summarise() function of plyr and dplyr caused the code to work in some states and not function in other. Thus the explicit mention “dplyr::summarise(n = n())”. Implementing the same throughout the document.↩

Data Anaysis

Venkat Ellendula

29/04/2020