The New York City Taxi and Limousine Commission is the agency responsible for licensing and regulating New York City’s medallion (yellow) taxicabs, for-hire vehicles (community-based liveries, black cars and luxury limousines), commuter vans, and paratransit vehicles. Green Taxis (as opposed to yellow ones) are taxis that are not allowed to pick up passengers inside of the densely populated areas of Manhattan.
In this report, we will perform exploratory data analysis on Green Taxis and try to understand following things:
Following steps are taken to complete the study.
Data is collected from New York City Taxi and Limousine Commission website for the month of Sep-2015 (http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
Understand the given data dictionary for the data. (http://www.nyc.gov/html/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf)
Data is cleaned as per the objectives of the study
Manipulated Data to arrive at specific results
Hypothesis Testing
Applied linear regression modelling technique to predict tip as a percentage of the total fare
This study will allow consumers of green taxi to understand time required to travel a particular distance, average speed of the taxi at particular hour of day, expected fare amount for a particular distance and expected tip amount by the green taxi drivers.
To analyze green taxi data, we will use following R packages:
library(data.table)
library(ggplot2)
library(tidyverse)
library(chron)
library(dplyr)
Import Green Taxi Data for the month of September 2015
# Import downloaded data
green <- fread("https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2015-09.csv")
# Check dimensions
dim_green <- dim(green)
dim_green
## [1] 1494926 21
There are 1494926 observations and 21 variables in the imported data set
Following code will allow us to understand structure of the data and summary of the data
str(green)
summary(green)
After summarizing imported data, it has been been found that
Now, we check for duplicate records in the data set. Using unique function we conclude that there are no duplicate records in the dataset.
unique_rec <- unique(green)
dim(unique_rec)
## [1] 1494926 21
Following data cleaning steps are taken to ensure we have cleaned data set
green_temp <- green %>%
select(-c( Ehail_fee, Store_and_fwd_flag, Trip_type, VendorID)) %>%
filter((Fare_amount >= 0) &
(Extra >= 0) &
(MTA_tax >= 0) &
(Tip_amount >= 0) &
(Tolls_amount >= 0) &
(improvement_surcharge >= 0) &
(Total_amount >= 2.5))
Now, we look at dimensions and structure of our cleaned data set.
dim(green_temp)
str(green_temp)
summary(green_temp)
green_temp is a temporary table that will be used for our analysis. The structure of this table is given below.
Summary of cleaned data set green_temp is given below
Histogram of the number of the trip distance (“Trip Distance”)
ggplot(green_temp, aes(x = Trip_distance))+
geom_histogram(binwidth = 0.3, aes(fill = ..count..)) +
scale_fill_gradient( name = "Frequency",
low = "green",
high = "red", labels = c("0", "40K", "80k","120k","160k")) +
coord_cartesian(xlim = c(0,10)) +
ggtitle("Histogram for Trip Distance") +
xlab("Trip Distance") +
ylab("Frequency of Trip Distance")
Findings on Trip_distance histogram
Extract hour value from lpep_pickup_datetime column of green_temp table
green_temp$hour_time <- format(as.POSIXct(green_temp$lpep_pickup_datetime,
format = "%Y-%m-%d %H:%M"),
format = "%H")
We have extracted the time part from the pick up data so that we get to know the exact hour of the day for a particular trip. (We get 24 different hours from this extraction process. 0 refers 12 midnight and then other numbers correspond to each specific hour in a day)
Calculate average trip distance and median trip distance by grouping hour value
mean_trip_distance <- green_temp %>% select(Trip_distance, hour_time) %>%
group_by(hour_time) %>%
summarise(mean_trip_distance = mean(Trip_distance))
head(mean_trip_distance,5)
## # A tibble: 5 x 2
## hour_time mean_trip_distance
## <chr> <dbl>
## 1 00 3.121554
## 2 01 3.023154
## 3 02 3.053454
## 4 03 3.218797
## 5 04 3.533137
median_trip_distance <- green_temp %>%
select(Trip_distance, hour_time) %>%
group_by(hour_time) %>%
summarise(median_trip_distance = median(Trip_distance))
head(median_trip_distance,5)
## # A tibble: 5 x 2
## hour_time median_trip_distance
## <chr> <dbl>
## 1 00 2.20
## 2 01 2.13
## 3 02 2.15
## 4 03 2.21
## 5 04 2.37
mean_trip_distance and median_trip_distance gives mean and median trip distance of a green taxi for a specific hour of a day
For example, mean_trip_distance at “01’ hour is 3.023 miles, which suggests that on average 3.023 miles is the trip distance for the customers who travelled in between 1 am - 2 am in the month of September
green_temp$trip_duration <- as.POSIXct(green_temp$Lpep_dropoff_datetime) -
as.POSIXct(green_temp$lpep_pickup_datetime)
units(green_temp$trip_duration) <- "hours"
#average speed for each trip
green_temp$avg_speed <- green_temp$Trip_distance/as.numeric(green_temp$trip_duration)
green_avg_speed <- green_temp %>% filter(Trip_distance > 0 & trip_duration > 0) %>%
mutate(avg_speed = Trip_distance/as.numeric(trip_duration)) %>%
select(lpep_pickup_datetime, hour_time,
Trip_distance, trip_duration, avg_speed)
# fetch week_number
green_avg_speed$week_number <- sapply(strsplit(
as.character(green_avg_speed$lpep_pickup_datetime), " "),"[", 1) %>%
strftime(format = "%V") %>%
as.numeric()
unique(green_avg_speed$week_number)
## [1] 36 37 38 39 40
The above lines of code is meant to find out week number in which the trip was completed. Following points provide more details regarding the above chunk of code.
hourly_average_speed <- green_avg_speed %>%
group_by(hour_time) %>%
summarise(avg_speed_hourly = mean(avg_speed, na.rm = TRUE))
ggplot(hourly_average_speed, aes(x = hour_time, y = avg_speed_hourly, fill = hour_time)) +
geom_bar(stat = "identity") +
labs(title =" Hourly Average Speed in September 2015",
x = "Hour of the day",
y = "Hourly Average Speed") +
scale_fill_discrete(name = "Hour") +
coord_flip()
The above bar chart shows hourly average speed of the green taxi in the month of September. From the bar chart we observe following things
weekly_average_speed <- green_avg_speed %>%
group_by(week_number) %>%
summarise(avg_speed_weekly = mean(avg_speed, na.rm = TRUE))
ggplot(weekly_average_speed, aes(x = week_number , y = avg_speed_weekly)) +
geom_bar( stat = "identity", aes(fill = factor(week_number))) +
labs(title =" Weekly Average Speed in September 2015",
x = "Week Number of the Year 2015",
y = "Weekly Average Speed") +
scale_fill_discrete(name = "Week number")
The above bar chart shows weekly average speed of the green taxi in the month of September. From the bar chart we observe following things
The ANOVA test allows us to conclude if the difference in the weekly average speed of green taxis is statistically significant
hypo_weekly_speed <- aov(week_number ~ avg_speed_weekly, weekly_average_speed)
summary(hypo_weekly_speed)
## Df Sum Sq Mean Sq F value Pr(>F)
## avg_speed_weekly 1 4.286 4.286 2.25 0.231
## Residuals 3 5.714 1.905
Findings on ANOVA test are given below
In this section, we have studied the relation between Tip amount and Total Fare amount.
quantile(green_temp$Fare_amount, probs = 0.99)
## 99%
## 50
quantile(green_temp$Fare_amount, probs = 0.01)
## 1%
## 3
quantile(green_temp$Tip_amount, probs = 0.99)
## 99%
## 9.24
quantile(green_temp$Tip_amount, probs = 0.01)
## 1%
## 0
green_sample <- green_temp[sample(1:nrow(green_temp), 160000,
replace = FALSE),]
tip_relation <- green_sample %>%
filter(Payment_type == 1 & Fare_amount >= 3 &
Fare_amount <= 50) %>%
select(-c(1,2,3,4,5,6,7,20)) %>%
mutate(total_fare = Fare_amount + Extra + Tolls_amount +
improvement_surcharge + MTA_tax)
To build a linear regression model we have taken a random sample of 160k observations from green_temp table. We will use this sample to predict the relation between tip amount and Total Fare amount
Let us look at how tip amount fluctuates based on Total fare
ggplot(tip_relation, aes(x = total_fare, y = Tip_amount)) +
geom_point(alpha = 0.3) +
geom_jitter() +
geom_smooth() +
coord_cartesian(xlim = c(0,60), ylim = c(0,15)) +
ggtitle("Relation between Tip amount and Total fare")+
xlab("Total Fare") +
ylab("Tip Amount")
The above plot shows how values of the Tip Amount varies based on the values of Total Fare amount. From this plot the following things have been noticed.
linear_model <- lm(Tip_amount ~ total_fare, data = tip_relation)
summary(linear_model)
##
## Call:
## lm(formula = Tip_amount ~ total_fare, data = tip_relation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.982 -0.626 0.245 0.677 196.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1796006 0.0151204 11.88 <2e-16 ***
## total_fare 0.1567148 0.0008711 179.91 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.103 on 74237 degrees of freedom
## Multiple R-squared: 0.3036, Adjusted R-squared: 0.3036
## F-statistic: 3.237e+04 on 1 and 74237 DF, p-value: < 2.2e-16
The model equation for a simple linear regression model is given below
\[ Tip Amount = 0.184 + 0.156 * Total Fare Amount \]
We have found following things from the simple linear regression model to predict the Tip.
From the linear regression model and scatter plots we conclude following things
Limitations