\(~\)
In this project, we’ll be looking at 5 years worth of taxi trip data from January 1, 2015 to December 31, 2019 provided by the City of Chicago.
We’d like to answer three main questions:
How much revenue do taxi trips generate each year in Chicago?
In which Chicago neighborhoods do most taxi rides originate?
How well can we predict taxi demand at a specific time and place within Chicago?
\(~\)
\(~\)
The City of Chicago supplies a helpful data dictionary that describes the variables of the dataset as follows:
| Variable | Description | Type |
|---|---|---|
| unique_key | A unique identifier for the trip. | text |
| taxi_id | A unique identifier for the taxi. | text |
| trip_start_timestamp | When the trip started, rounded to the nearest 15 minutes. | datetime |
| trip_end_timestamp | When the trip ended, rounded to the nearest 15 minutes. | datetime |
| trip_seconds | Time of the trip in seconds. | numeric |
| trip_miles | Distance of the trip in miles. | numeric |
| pickup_census_tract | The Census Tract where the trip began. | text |
| dropoff_census_tract | The Census Tract where the trip ended. | text |
| pickup_community_area | The Community Area where the trip began. This will be blank for locations outside Chicago. | numeric |
| dropoff_community_area | The Community Tract where the trip ended. This will be blank for locations outside Chicago. | numeric |
| fare | The fare for the trip. | numeric |
| tips | The tip for the trip. Cash tips generally will not be recorded. | numeric |
| tolls | The tolls for the trip. | numeric |
| extras | Extra charges for the trip. | numeric |
| trip_total | Total cost of the trip, the total of the previous columns. | numeric |
| payment_type | Type of payment for the trip. | text |
| company | The taxi company. | text |
| pickup_latitude | The latitude of the center of the pickup Community Area. | numeric |
| pickup_longitude | The longitude of the center of the pickup Community Area. | numeric |
| pickup_location | The latitude/longitude combination of the center of the pickup Community Area. | geospatial |
| dropoff_latitude | The latitude of the center of the dropoff Community Area. | numeric |
| dropoff_longitude | The longitude of the center of the dropoff Community Area. | numeric |
| dropoff_location | The latitude/longitude combination of the center of the dropoff Community Area. | geospatial |
\(~\)
The data for these 5 years worth of data include over 83 million taxi trips among 23 variables, as described at the City of Chicago’s website. Due to the large size of the data, a direct download of all the data proved difficult without setting up an additional database. Luckily, Google already hosts this data publicly on its “Chicago Taxi Trips” BigQuery Database site.
\(~\)
\(~\)
Per the City of Chicago’s own documentation, “the dataset is collected from a variety of hardward and software under real-world conditions”, and therefore may have recorded erroneous values. We made the following judgment calls to aggregate and narrow down the data in our initial query:
pickup_census_tract:
fare:
trip_miles:
trip_seconds:
In addition, we’ve also added the following variables after some data manipulation:
pickup_date:
dow:
hour:
num_rides:
trip_minutes:
\(~\)
The full SQL query can be seen here:
SELECT
COUNT(unique_key) AS num_rides,
EXTRACT(DATE FROM trip_start_timestamp) AS pickup_date,
EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS dow,
EXTRACT(HOUR FROM trip_start_timestamp) AS hour,
SUM(fare) AS fare,
SUM(tips) AS tips,
SUM(tolls) AS tolls,
SUM(extras) AS extras,
SUM(trip_total) AS trip_total,
SUM(trip_seconds)/60 AS trip_minutes,
SUM(trip_miles) AS trip_miles,
pickup_census_tract,
pickup_latitude,
pickup_longitude,
pickup_location
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_start_timestamp >= '2015-01-01' AND
trip_start_timestamp <= '2019-12-31' AND
pickup_census_tract IS NOT NULL AND
trip_miles IS NOT NULL AND
trip_seconds IS NOT NULL AND
fare < 1000 AND
trip_seconds < 28800
GROUP BY
EXTRACT(DATE FROM trip_start_timestamp),
EXTRACT(DAYOFWEEK FROM trip_start_timestamp),
EXTRACT(HOUR FROM trip_start_timestamp),
pickup_census_tract,
pickup_latitude,
pickup_longitude,
pickup_location
\(~\)
This leaves us with 83,828,086 taxi trips and 15 total variables.
\(~\)
\(~\)
Given that we’d eventually like to predict taxi demand in Chicago, we’ve downloaded data from NOAA about temperature, wind speeds, and precipitation for each day in our 5-year dataset. (Documentation about this dataset can be found here.) The data are gathered at the NOAA’s station at the Chicago O’Hare airport. We can join it to our existing dataset by date.
Since our data is aggregated at the date-hour-census_tract level, we can also feature some simple additional variables for simplicity in the EDA and question-answering process.
data <- rides %>%
mutate(fare_trip = fare/num_rides,
minutes_trip = trip_minutes/num_rides,
distance_trip = trip_miles/num_rides,
weekend = if_else(dow == 1 | dow >= 6, 1, 0),
year = year(pickup_date)) %>%
left_join(temp %>%
clean_names() %>%
select(date,
wind_day_avg = awnd,
precipitation_inches = prcp,
temp_day_avg = tavg,
temp_day_min = tmin,
temp_day_max = tmax)
, by = c("pickup_date" = "date"))
\(~\)
\(~\)
First, we’d like to explore some exploratory graphs on fares, ride time and trip distance:
\(~\)
ggplot(data, aes(x = fare_trip)) +
ggtitle("Most Fares Are $5-15") +
theme_classic() +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
geom_density(color = "black", fill = "lightblue", alpha = 0.6) +
scale_x_continuous(name = "Ride Fare", breaks = seq(0,50, by = 5), limits = c(0,50),
labels = paste0("$", seq(0,50, by = 5)),
) +
scale_y_continuous(name = "Density") +
geom_vline(xintercept = median(data$fare_trip), linetype = "dashed", color = "red")
\(~\)
ggplot(data, aes(x = minutes_trip)) +
ggtitle("Most Trips Are 5-15 Minutes") +
theme_classic() +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
geom_density(color = "black", fill = "pink", alpha = 0.6) +
scale_x_continuous(name = "Ride Time (Minutes)", breaks = seq(0,60, by = 5), limits = c(0,60)) +
scale_y_continuous(name = "Density") +
geom_vline(xintercept = median(data$minutes_trip), linetype = "dashed", color = "red")
\(~\)
ggplot(data, aes(x = distance_trip)) +
ggtitle("Most Trips Are Not Far (<4 Miles)") +
theme_classic() +
theme(plot.title = element_text(face = "bold", hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
geom_density(color = "black", fill = "orange", alpha = 0.6) +
scale_x_continuous(name = "Trip Distance (Miles)", breaks = seq(0,20, by = 2), limits = c(0,20)) +
scale_y_continuous(name = "Density") +
geom_vline(xintercept = median(data$distance_trip), linetype = "dashed", color = "red")
\(~\)
ggplot(data %>% group_by(year) %>%
summarise(year = mean(year),
fare = sum(fare) )
, aes(y = fare, x = year)) +
geom_bar(position = "dodge", stat = "identity", fill = "lightgreen", color = "black") +
theme_classic() +
ggtitle("Chicago Taxi Fare Revenues Are Declining") +
theme(plot.title = element_text(face = "bold" , hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
scale_y_continuous(name = "Fares",
labels = paste0("$", seq(0,300, by = 50), "M"),
breaks = 10^6 * seq(0,300, by = 50)
) +
scale_x_continuous(name = "Year")
\(~\)
We see that revenues are on a steady decline from 2016-2019. It would be interesting to know if this is due to greater competition on price, or if there are just fewer trips being taken. We’ll take a look at average fare per trip:
\(~\)
ggplot(data %>% group_by(year) %>%
summarise(year = mean(year),
fare_trip = mean(fare_trip))
, aes(y = fare_trip, x = year)) +
geom_bar(position = "dodge", stat = "identity", fill = "lightgreen", color = "black") +
theme_classic() +
ggtitle("But Taxi Fares per Trip Have Held Steady") +
theme(plot.title = element_text(face = "bold" , hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
scale_y_continuous(name = "Fare per Trip",
labels = paste0("$", seq(0,16, by = 2)),
breaks = seq(0,16, by = 2)
) +
scale_x_continuous(name = "Year")
\(~\)
So it looks like the total revenue is due to fewer overall taxi trips in each of these years. Let’s take a look:
\(~\)
ggplot(data %>% group_by(year) %>%
summarise(year = mean(year),
num_rides = sum(num_rides))
, aes(y = num_rides, x = year)) +
geom_bar(position = "dodge", stat = "identity", fill = "red", color = "black", alpha = 0.6) +
theme_classic() +
ggtitle("Conventional Taxi Ridership is Steadily Declining in Chicago") +
theme(plot.title = element_text(face = "bold" , hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
scale_y_continuous(name = "Total Rides",
labels = paste0(seq(0,25, by = 5), "M"),
breaks = 10^6 * seq(0,25, by = 5)
) +
scale_x_continuous(name = "Year")
\(~\)
First, we need to import the shape files of census tracts. We found a helpful source here: Big Bytes Census Tract Shapefiles. Since this file includes census tracts for the entire United States, we can restrict these shapes to those included in our data by joining where GEOID matches pickup_census_tract.
# Import census-tract shapefiles, restrict to relevant Chicago-area census tracts, export for Tableau
shapes <- read_csv(header$ford_av('census_tract_shapefiles_all.csv')) %>%
filter(GEOID %in% data$pickup_census_tract,
GEOID != 17031990000 # Removes large polygon in Lake Michigan
) %>%
mutate(pickup_census_tract = as.character(GEOID)) %>%
select(-GEOID) %>%
write_csv(header$ford_av('chicago_shapefiles.csv'))
# Export existing data for Tableau
write_csv(data %>%
mutate(pickup_census_tract = as.character(pickup_census_tract),
day_of_week = case_when(dow == 1 ~ "Sunday",
dow == 2 ~ "Monday",
dow == 3 ~ "Tuesday",
dow == 4 ~ "Wednesday",
dow == 5 ~ "Thursday",
dow == 6 ~ "Friday",
dow == 7 ~ "Saturday"),
hour_of_day = paste0(hour,":00")),
header$ford_av('chicago_filtered.csv'))
\(~\)
Then we can map out the shapes in Tableau, and plot maps showing ride origination data. It’s clear that most taxi pickups occur in three locations:
\(~\)
\(~\)
We’d like to build a predictive model that can predict how many taxis should be dispatched within a given census tract. We initially propose the following model:
\[pickups= \beta_0 + \beta_1day\_of\_week + \beta_2hour\_of\_day + \beta_3wind\_speed + \beta_4temperature + \beta_5precipitation + \varepsilon\]
We’ll proceed by building and comparing the following models:
\(~\)
In order to use multiple linear regression, we’ll need to run the model we proposed above and perform checks on the underlying assumptions of linear regression, primarily:
We can build, run, and test the model as follows, using the most concentrated data from “The Loop” in census tract 17031839100.
# Subset data to 'The Loop'
loop <- data %>% filter(pickup_census_tract == 17031839100)
# Multiple Linear Regression with model as specified above
m1 <- lm(num_rides ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg + precipitation_inches,
data = loop)
# Summarize
summary(m1)
##
## Call:
## lm(formula = num_rides ~ as.factor(dow) + as.factor(hour) + wind_day_avg +
## temp_day_avg + precipitation_inches, data = loop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -843.62 -109.60 -17.86 119.09 791.39
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -160.4882 5.5640 -28.844 < 0.0000000000000002 ***
## as.factor(dow)2 245.5871 3.0685 80.034 < 0.0000000000000002 ***
## as.factor(dow)3 333.3046 3.0520 109.207 < 0.0000000000000002 ***
## as.factor(dow)4 352.8490 3.0444 115.901 < 0.0000000000000002 ***
## as.factor(dow)5 363.6324 3.0368 119.742 < 0.0000000000000002 ***
## as.factor(dow)6 315.0677 3.0344 103.831 < 0.0000000000000002 ***
## as.factor(dow)7 44.4827 3.0362 14.651 < 0.0000000000000002 ***
## as.factor(hour)1 -36.8735 5.6317 -6.547 0.0000000000592 ***
## as.factor(hour)2 -53.2798 5.7237 -9.309 < 0.0000000000000002 ***
## as.factor(hour)3 -52.8490 5.8880 -8.976 < 0.0000000000000002 ***
## as.factor(hour)4 -63.1941 5.7881 -10.918 < 0.0000000000000002 ***
## as.factor(hour)5 -65.3388 5.6611 -11.542 < 0.0000000000000002 ***
## as.factor(hour)6 -30.6971 5.6126 -5.469 0.0000000454356 ***
## as.factor(hour)7 54.9300 5.6041 9.802 < 0.0000000000000002 ***
## as.factor(hour)8 179.1029 5.5979 31.995 < 0.0000000000000002 ***
## as.factor(hour)9 233.7886 5.5979 41.764 < 0.0000000000000002 ***
## as.factor(hour)10 254.5764 5.5971 45.483 < 0.0000000000000002 ***
## as.factor(hour)11 387.3976 5.5964 69.223 < 0.0000000000000002 ***
## as.factor(hour)12 471.6847 5.5964 84.284 < 0.0000000000000002 ***
## as.factor(hour)13 451.8989 5.5964 80.748 < 0.0000000000000002 ***
## as.factor(hour)14 426.9592 5.5964 76.292 < 0.0000000000000002 ***
## as.factor(hour)15 467.1378 5.5964 83.471 < 0.0000000000000002 ***
## as.factor(hour)16 589.3932 5.5964 105.317 < 0.0000000000000002 ***
## as.factor(hour)17 656.8496 5.5964 117.371 < 0.0000000000000002 ***
## as.factor(hour)18 624.3324 5.5964 111.560 < 0.0000000000000002 ***
## as.factor(hour)19 500.6682 5.5964 89.463 < 0.0000000000000002 ***
## as.factor(hour)20 328.5110 5.5964 58.701 < 0.0000000000000002 ***
## as.factor(hour)21 207.8606 5.5964 37.142 < 0.0000000000000002 ***
## as.factor(hour)22 169.2819 5.5964 30.248 < 0.0000000000000002 ***
## as.factor(hour)23 85.0304 5.5964 15.194 < 0.0000000000000002 ***
## wind_day_avg 2.2319 0.2314 9.648 < 0.0000000000000002 ***
## temp_day_avg -0.4266 0.0422 -10.110 < 0.0000000000000002 ***
## precipitation_inches 3.0284 2.6734 1.133 0.257
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 169 on 42905 degrees of freedom
## Multiple R-squared: 0.7282, Adjusted R-squared: 0.728
## F-statistic: 3593 on 32 and 42905 DF, p-value: < 0.00000000000000022
\(~\)
The initial regression model shows that all proposed variables are highly significant except for precipitation_inches. Before proceeding, let’s examine those assumptions.
\(~\)
We can test this by checking the correlation between numeric variables with one another, as well as examining the “variance inflation factor” (VIF) among predictors in the model. Let’s perform both of those tests below:
# Correlation of wind and temperature
cor(select(loop, wind_day_avg, temp_day_avg, precipitation_inches))
## wind_day_avg temp_day_avg precipitation_inches
## wind_day_avg 1.00000000 -0.1991421 0.08735175
## temp_day_avg -0.19914207 1.0000000 0.12544508
## precipitation_inches 0.08735175 0.1254451 1.00000000
The correlation among the three numeric predictors are all less than 0.2 in absolute value. Since values below 0.5 shouldn’t be problematic, we can move on to test the VIF among predictors. If the values are <10, we can confirm there is not major multicollinearity in our linear model.
car::vif(m1)
## GVIF Df GVIF^(1/(2*Df))
## as.factor(dow) 1.009349 6 1.000776
## as.factor(hour) 1.002139 23 1.000046
## wind_day_avg 1.059285 1 1.029216
## temp_day_avg 1.065367 1 1.032166
## precipitation_inches 1.032881 1 1.016308
All of our VIF values are close to 1, and don’t come close to approaching 10. Let’s move on to the next assumption.
\(~\)
This is best evaluated by examining plots of the model itself. Our “Residuals vs. Fitted” plot should have a band of data points equally distributed across both the x- and y-axis. Given that this is count data, however, we do expect to see a sharp downward sloping line before we perform any kind of transformations.
plot(m1, which = 1)
It looks like there are underlying patterns in this data that might do better with a transformation. Given that it’s count data, let’s try taking the log of the response varible to smooth out this plot.
# Multiple Linear Regression with model as specified above
m2 <- lm(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg + precipitation_inches,
data = loop)
# Summarize
plot(m2, which = 1)
Although it isn’t perfect, this represents an improvement on our original model. We can also confirm a reasonable assumption of homoscedasticity. With these checks performed, we can drop the precipitation_inches predictor and move this transformed model into the next phase: cross-validation.
# Define training control
set.seed(123)
trc <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
# Train multiple linear regression model
cv_mlr <- train(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg,
data = loop,
method = "lm",
trControl = trc)
# Summarize the results
print(cv_mlr)
## Linear Regression
##
## 42938 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 38644, 38644, 38645, 38644, 38644, 38644, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.7352667 0.8007847 0.5160302
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Our results show the following:
In order to see if a non-linear model can outperform our existing model, we will implement a random forest model and compare our output metrics:
# Define training control
set.seed(123)
trc <- trainControl(method = "repeatedcv", number = 10 , repeats = 3)
# Train multiple linear regression model
cv_rf <- train(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg,
data = loop,
method = "ranger",
trControl = trc,
# Based on prior tuning, these were the optimal parameters found
tuneGrid = expand.grid(mtry = 16,
splitrule = "variance",
min.node.size = 5)
# In order to find importance of variables, must specify in the model
,importance = "impurity"
)
# Summarize the results
print(cv_rf)
## Random Forest
##
## 42938 samples
## 4 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 38644, 38644, 38645, 38644, 38644, 38644, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 0.546325 0.8900585 0.3659163
##
## Tuning parameter 'mtry' was held constant at a value of 16
## Tuning
## parameter 'splitrule' was held constant at a value of variance
##
## Tuning parameter 'min.node.size' was held constant at a value of 5
We can see that with the optimal tuning parameters, our results show the following:
This is an even more powerful model than our original multiple linear regression model.
Let’s take a look at the variables with the most predictive power in our random forest model.
vi <- varImp(cv_rf)$importance %>%
add_rownames("variable") %>%
rename(Importance = Overall) %>%
arrange(desc(Importance)) %>%
filter(row_number() <= 10) %>%
mutate(variable = str_replace_all(variable, "as.factor\\(|_day_avg", ""),
variable = str_replace_all(variable, "\\)", "_"))
ggplot(vi, aes(x=reorder(variable, -Importance), y = Importance)) +
geom_bar(position = "dodge", stat = "identity", fill = "blue", color = "black", alpha = 0.6) +
theme_classic() +
ggtitle("Hour of the Day is the Strongest Predictor") +
theme(plot.title = element_text(face = "bold" , hjust = 0.5, size = 15),
axis.text = element_text(color = "black", size = 12),
axis.title.x = element_text(color = "black", size = 12),
axis.title.y = element_text(color = "black", size = 12)) +
scale_y_continuous(name = "Relative Importance",
breaks = seq(0,100, by = 20)
) +
scale_x_discrete(name = "Predictor")
Let’s try three different scenarios for how many taxis should be available in “The Loop” neighborhood given the following conditions:
# 1. Monday at 3:00pm, 50 degrees, and 15 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 2, hour = 15, wind_day_avg = 15, temp_day_avg = 50)))
## [1] 562.0273
# 2. Wednesday at 8:00am, 45 degrees, and 5 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 4, hour = 8, wind_day_avg = 5, temp_day_avg = 45)))
## [1] 331.2549
# 3. Friday at 7:00pm, 80 degrees, and 2 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 6, hour = 19, wind_day_avg = 2, temp_day_avg = 80)))
## [1] 545.2977
\(~\)
So we can see that for “The Loop” neighborhood of Chicago, each of these specific set of circumstances can give us a specific number of expected taxi pickups.
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)