\(~\)

Introduction

In this project, we’ll be looking at 5 years worth of taxi trip data from January 1, 2015 to December 31, 2019 provided by the City of Chicago.

We’d like to answer three main questions:

  1. How much revenue do taxi trips generate each year in Chicago?

  2. In which Chicago neighborhoods do most taxi rides originate?

  3. How well can we predict taxi demand at a specific time and place within Chicago?

\(~\)

\(~\)

Data Structure

The City of Chicago supplies a helpful data dictionary that describes the variables of the dataset as follows:

Variable Description Type
unique_key A unique identifier for the trip. text
taxi_id A unique identifier for the taxi. text
trip_start_timestamp When the trip started, rounded to the nearest 15 minutes. datetime
trip_end_timestamp When the trip ended, rounded to the nearest 15 minutes. datetime
trip_seconds Time of the trip in seconds. numeric
trip_miles Distance of the trip in miles. numeric
pickup_census_tract The Census Tract where the trip began. text
dropoff_census_tract The Census Tract where the trip ended. text
pickup_community_area The Community Area where the trip began. This will be blank for locations outside Chicago. numeric
dropoff_community_area The Community Tract where the trip ended. This will be blank for locations outside Chicago. numeric
fare The fare for the trip. numeric
tips The tip for the trip. Cash tips generally will not be recorded. numeric
tolls The tolls for the trip. numeric
extras Extra charges for the trip. numeric
trip_total Total cost of the trip, the total of the previous columns. numeric
payment_type Type of payment for the trip. text
company The taxi company. text
pickup_latitude The latitude of the center of the pickup Community Area. numeric
pickup_longitude The longitude of the center of the pickup Community Area. numeric
pickup_location The latitude/longitude combination of the center of the pickup Community Area. geospatial
dropoff_latitude The latitude of the center of the dropoff Community Area. numeric
dropoff_longitude The longitude of the center of the dropoff Community Area. numeric
dropoff_location The latitude/longitude combination of the center of the dropoff Community Area. geospatial

\(~\)

Accessing the Data

The data for these 5 years worth of data include over 83 million taxi trips among 23 variables, as described at the City of Chicago’s website. Due to the large size of the data, a direct download of all the data proved difficult without setting up an additional database. Luckily, Google already hosts this data publicly on its “Chicago Taxi Trips” BigQuery Database site.

\(~\)

\(~\)

Querying and Filtering the Data

Per the City of Chicago’s own documentation, “the dataset is collected from a variety of hardward and software under real-world conditions”, and therefore may have recorded erroneous values. We made the following judgment calls to aggregate and narrow down the data in our initial query:

In addition, we’ve also added the following variables after some data manipulation:

\(~\)

The full SQL query can be seen here:


            SELECT 
                COUNT(unique_key)                            AS num_rides,
                EXTRACT(DATE      FROM trip_start_timestamp) AS pickup_date,
                EXTRACT(DAYOFWEEK FROM trip_start_timestamp) AS dow,
                EXTRACT(HOUR      FROM trip_start_timestamp) AS hour,
                SUM(fare)                                    AS fare,
                SUM(tips)                                    AS tips,
                SUM(tolls)                                   AS tolls,
                SUM(extras)                                  AS extras,
                SUM(trip_total)                              AS trip_total,
                SUM(trip_seconds)/60                         AS trip_minutes,
                SUM(trip_miles)                              AS trip_miles,
                pickup_census_tract,
                pickup_latitude,
                pickup_longitude,
                pickup_location
            FROM 
                `bigquery-public-data.chicago_taxi_trips.taxi_trips`
            WHERE 
                trip_start_timestamp >= '2015-01-01' AND
                trip_start_timestamp <= '2019-12-31' AND
                pickup_census_tract      IS NOT NULL AND
                trip_miles               IS NOT NULL AND
                trip_seconds             IS NOT NULL AND
                fare < 1000                          AND
                trip_seconds < 28800
            GROUP BY 
                EXTRACT(DATE      FROM trip_start_timestamp),
                EXTRACT(DAYOFWEEK FROM trip_start_timestamp),
                EXTRACT(HOUR      FROM trip_start_timestamp),
                pickup_census_tract,
                pickup_latitude,
                pickup_longitude,
                pickup_location

\(~\)

This leaves us with 83,828,086 taxi trips and 15 total variables.

\(~\)

Data Manipulation and Augmentation

\(~\)

Given that we’d eventually like to predict taxi demand in Chicago, we’ve downloaded data from NOAA about temperature, wind speeds, and precipitation for each day in our 5-year dataset. (Documentation about this dataset can be found here.) The data are gathered at the NOAA’s station at the Chicago O’Hare airport. We can join it to our existing dataset by date.

Since our data is aggregated at the date-hour-census_tract level, we can also feature some simple additional variables for simplicity in the EDA and question-answering process.

data <- rides %>% 
  mutate(fare_trip     = fare/num_rides,
         minutes_trip  = trip_minutes/num_rides,
         distance_trip = trip_miles/num_rides,
         weekend       = if_else(dow == 1 | dow >= 6, 1, 0),
         year          = year(pickup_date)) %>% 
  left_join(temp %>% 
              clean_names() %>% 
              select(date,
                     wind_day_avg         = awnd,
                     precipitation_inches = prcp,
                     temp_day_avg         = tavg,
                     temp_day_min         = tmin,
                     temp_day_max         = tmax)
            , by = c("pickup_date" = "date"))

\(~\)

Exploratory Data Analysis (EDA):

\(~\)

First, we’d like to explore some exploratory graphs on fares, ride time and trip distance:

\(~\)

Fares:

ggplot(data, aes(x = fare_trip)) +
  ggtitle("Most Fares Are $5-15") +
  theme_classic() +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12)) +
  geom_density(color = "black", fill = "lightblue", alpha = 0.6) +
  scale_x_continuous(name = "Ride Fare", breaks = seq(0,50, by = 5), limits = c(0,50), 
                     labels = paste0("$", seq(0,50, by = 5)),
                     ) +
  scale_y_continuous(name = "Density") +
  geom_vline(xintercept = median(data$fare_trip), linetype = "dashed", color = "red")

  • Median fare (red line): $9.50
  • Mean fare: $12.09
  • Most ride fares are clustered around $5-15 (likely within downtown)

\(~\)

Ride Time:

ggplot(data, aes(x = minutes_trip)) +
  ggtitle("Most Trips Are 5-15 Minutes") +
  theme_classic() +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12)) +
  geom_density(color = "black", fill = "pink", alpha = 0.6) +
  scale_x_continuous(name = "Ride Time (Minutes)", breaks = seq(0,60, by = 5), limits = c(0,60)) +
  scale_y_continuous(name = "Density") +
  geom_vline(xintercept = median(data$minutes_trip), linetype = "dashed", color = "red")

  • Median ride time (red line): 10.31 minutes
  • Mean ride time: 12.19 minutes
  • Most ride times are clustered around 5-15 minutes
  • A long tail exists, but most rides don’t last more than an hour

\(~\)

Trip Distance:

ggplot(data, aes(x = distance_trip)) +
  ggtitle("Most Trips Are Not Far (<4 Miles)") +
  theme_classic() +
  theme(plot.title       = element_text(face = "bold", hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12)) +
  geom_density(color = "black", fill = "orange", alpha = 0.6) +
  scale_x_continuous(name = "Trip Distance (Miles)", breaks = seq(0,20, by = 2), limits = c(0,20)) +
  scale_y_continuous(name = "Density") +
  geom_vline(xintercept = median(data$distance_trip), linetype = "dashed", color = "red")

  • Median trip distance (red line): 2 miles
  • Mean trip time: 3.23 miles
  • Most trips are <4 miles
  • A long tail of trips exist around 18 miles in distance (these may be to the airport)

\(~\)

Question 1: How much revenue do taxi trips generate each year in Chicago?

ggplot(data %>% group_by(year) %>% 
         summarise(year = mean(year),
                   fare = sum(fare) )
       , aes(y = fare, x = year)) +
  geom_bar(position = "dodge", stat = "identity", fill = "lightgreen", color = "black") + 
  theme_classic() +
  ggtitle("Chicago Taxi Fare Revenues Are Declining") +
  theme(plot.title       = element_text(face  = "bold" , hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12))  +
  scale_y_continuous(name = "Fares", 
                     labels = paste0("$", seq(0,300, by = 50), "M"),
                     breaks = 10^6 * seq(0,300, by = 50)
                     ) +
  scale_x_continuous(name = "Year") 

\(~\)

We see that revenues are on a steady decline from 2016-2019. It would be interesting to know if this is due to greater competition on price, or if there are just fewer trips being taken. We’ll take a look at average fare per trip:

\(~\)

ggplot(data %>% group_by(year) %>% 
         summarise(year      = mean(year),
                   fare_trip = mean(fare_trip))
       , aes(y = fare_trip, x = year)) +
  geom_bar(position = "dodge", stat = "identity", fill = "lightgreen", color = "black") + 
  theme_classic() +
  ggtitle("But Taxi Fares per Trip Have Held Steady") +
  theme(plot.title       = element_text(face  = "bold" , hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12))  +
  scale_y_continuous(name = "Fare per Trip",
                     labels = paste0("$", seq(0,16, by = 2)),
                     breaks = seq(0,16, by = 2)
                     ) +
  scale_x_continuous(name = "Year") 

\(~\)

So it looks like the total revenue is due to fewer overall taxi trips in each of these years. Let’s take a look:

\(~\)

ggplot(data %>% group_by(year) %>% 
         summarise(year      = mean(year),
                   num_rides = sum(num_rides))
       , aes(y = num_rides, x = year)) +
  geom_bar(position = "dodge", stat = "identity", fill = "red", color = "black", alpha = 0.6) + 
  theme_classic() +
  ggtitle("Conventional Taxi Ridership is Steadily Declining in Chicago") +
  theme(plot.title       = element_text(face  = "bold" , hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12))  +
  scale_y_continuous(name = "Total Rides",
                     labels = paste0(seq(0,25, by = 5), "M"),
                     breaks = 10^6 * seq(0,25, by = 5)
                     ) +
  scale_x_continuous(name = "Year") 

\(~\)

Question 2: In which Chicago neighborhoods do most taxi rides originate?

First, we need to import the shape files of census tracts. We found a helpful source here: Big Bytes Census Tract Shapefiles. Since this file includes census tracts for the entire United States, we can restrict these shapes to those included in our data by joining where GEOID matches pickup_census_tract.

# Import census-tract shapefiles, restrict to relevant Chicago-area census tracts, export for Tableau
shapes <- read_csv(header$ford_av('census_tract_shapefiles_all.csv')) %>% 
  filter(GEOID %in% data$pickup_census_tract,
         GEOID != 17031990000 # Removes large polygon in Lake Michigan
         ) %>% 
  mutate(pickup_census_tract = as.character(GEOID)) %>% 
  select(-GEOID) %>% 
  write_csv(header$ford_av('chicago_shapefiles.csv'))

# Export existing data for Tableau
write_csv(data %>%
            mutate(pickup_census_tract = as.character(pickup_census_tract),
                   day_of_week         = case_when(dow == 1 ~ "Sunday",
                                                   dow == 2 ~ "Monday",
                                                   dow == 3 ~ "Tuesday",
                                                   dow == 4 ~ "Wednesday",
                                                   dow == 5 ~ "Thursday",
                                                   dow == 6 ~ "Friday",
                                                   dow == 7 ~ "Saturday"),
                   hour_of_day         = paste0(hour,":00")),
          header$ford_av('chicago_filtered.csv'))

\(~\)

Then we can map out the shapes in Tableau, and plot maps showing ride origination data. It’s clear that most taxi pickups occur in three locations:

\(~\)

Question 3: How well can we predict taxi demand at a specific time and place within Chicago?

\(~\)

We’d like to build a predictive model that can predict how many taxis should be dispatched within a given census tract. We initially propose the following model:

\[pickups= \beta_0 + \beta_1day\_of\_week + \beta_2hour\_of\_day + \beta_3wind\_speed + \beta_4temperature + \beta_5precipitation + \varepsilon\]

We’ll proceed by building and comparing the following models:

\(~\)

Multiple Linear Regression

In order to use multiple linear regression, we’ll need to run the model we proposed above and perform checks on the underlying assumptions of linear regression, primarily:

  1. The predictors are linearly independent from each other (no multicollinearity)
  2. The predictors are not correlated with the error term (no endogeneity)
  3. The error of the variance is constant (homoscedasticity)

We can build, run, and test the model as follows, using the most concentrated data from “The Loop” in census tract 17031839100.

# Subset data to 'The Loop'
loop <- data %>% filter(pickup_census_tract == 17031839100)

# Multiple Linear Regression with model as specified above
m1 <- lm(num_rides ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg + precipitation_inches, 
         data = loop)

# Summarize
summary(m1)
## 
## Call:
## lm(formula = num_rides ~ as.factor(dow) + as.factor(hour) + wind_day_avg + 
##     temp_day_avg + precipitation_inches, data = loop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -843.62 -109.60  -17.86  119.09  791.39 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          -160.4882     5.5640 -28.844 < 0.0000000000000002 ***
## as.factor(dow)2       245.5871     3.0685  80.034 < 0.0000000000000002 ***
## as.factor(dow)3       333.3046     3.0520 109.207 < 0.0000000000000002 ***
## as.factor(dow)4       352.8490     3.0444 115.901 < 0.0000000000000002 ***
## as.factor(dow)5       363.6324     3.0368 119.742 < 0.0000000000000002 ***
## as.factor(dow)6       315.0677     3.0344 103.831 < 0.0000000000000002 ***
## as.factor(dow)7        44.4827     3.0362  14.651 < 0.0000000000000002 ***
## as.factor(hour)1      -36.8735     5.6317  -6.547      0.0000000000592 ***
## as.factor(hour)2      -53.2798     5.7237  -9.309 < 0.0000000000000002 ***
## as.factor(hour)3      -52.8490     5.8880  -8.976 < 0.0000000000000002 ***
## as.factor(hour)4      -63.1941     5.7881 -10.918 < 0.0000000000000002 ***
## as.factor(hour)5      -65.3388     5.6611 -11.542 < 0.0000000000000002 ***
## as.factor(hour)6      -30.6971     5.6126  -5.469      0.0000000454356 ***
## as.factor(hour)7       54.9300     5.6041   9.802 < 0.0000000000000002 ***
## as.factor(hour)8      179.1029     5.5979  31.995 < 0.0000000000000002 ***
## as.factor(hour)9      233.7886     5.5979  41.764 < 0.0000000000000002 ***
## as.factor(hour)10     254.5764     5.5971  45.483 < 0.0000000000000002 ***
## as.factor(hour)11     387.3976     5.5964  69.223 < 0.0000000000000002 ***
## as.factor(hour)12     471.6847     5.5964  84.284 < 0.0000000000000002 ***
## as.factor(hour)13     451.8989     5.5964  80.748 < 0.0000000000000002 ***
## as.factor(hour)14     426.9592     5.5964  76.292 < 0.0000000000000002 ***
## as.factor(hour)15     467.1378     5.5964  83.471 < 0.0000000000000002 ***
## as.factor(hour)16     589.3932     5.5964 105.317 < 0.0000000000000002 ***
## as.factor(hour)17     656.8496     5.5964 117.371 < 0.0000000000000002 ***
## as.factor(hour)18     624.3324     5.5964 111.560 < 0.0000000000000002 ***
## as.factor(hour)19     500.6682     5.5964  89.463 < 0.0000000000000002 ***
## as.factor(hour)20     328.5110     5.5964  58.701 < 0.0000000000000002 ***
## as.factor(hour)21     207.8606     5.5964  37.142 < 0.0000000000000002 ***
## as.factor(hour)22     169.2819     5.5964  30.248 < 0.0000000000000002 ***
## as.factor(hour)23      85.0304     5.5964  15.194 < 0.0000000000000002 ***
## wind_day_avg            2.2319     0.2314   9.648 < 0.0000000000000002 ***
## temp_day_avg           -0.4266     0.0422 -10.110 < 0.0000000000000002 ***
## precipitation_inches    3.0284     2.6734   1.133                0.257    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 169 on 42905 degrees of freedom
## Multiple R-squared:  0.7282, Adjusted R-squared:  0.728 
## F-statistic:  3593 on 32 and 42905 DF,  p-value: < 0.00000000000000022

\(~\)

The initial regression model shows that all proposed variables are highly significant except for precipitation_inches. Before proceeding, let’s examine those assumptions.

\(~\)

1. The predictors are linearly independent from each other (no multicollinearity)

We can test this by checking the correlation between numeric variables with one another, as well as examining the “variance inflation factor” (VIF) among predictors in the model. Let’s perform both of those tests below:

# Correlation of wind and temperature
cor(select(loop, wind_day_avg, temp_day_avg, precipitation_inches))
##                      wind_day_avg temp_day_avg precipitation_inches
## wind_day_avg           1.00000000   -0.1991421           0.08735175
## temp_day_avg          -0.19914207    1.0000000           0.12544508
## precipitation_inches   0.08735175    0.1254451           1.00000000

The correlation among the three numeric predictors are all less than 0.2 in absolute value. Since values below 0.5 shouldn’t be problematic, we can move on to test the VIF among predictors. If the values are <10, we can confirm there is not major multicollinearity in our linear model.

car::vif(m1)
##                          GVIF Df GVIF^(1/(2*Df))
## as.factor(dow)       1.009349  6        1.000776
## as.factor(hour)      1.002139 23        1.000046
## wind_day_avg         1.059285  1        1.029216
## temp_day_avg         1.065367  1        1.032166
## precipitation_inches 1.032881  1        1.016308

All of our VIF values are close to 1, and don’t come close to approaching 10. Let’s move on to the next assumption.

\(~\)

2. The predictors are not correlated with the error term (no endogeneity)

While there are mathematical approaches to answer this question, there are some basic questions we can ask to decide whether this is a problem. They are as follows:

Could the number of taxi pickups in a given census area cause or change any of the following variables?

  • dow
  • hour
  • wind_day_avg
  • temp_day_avg
  • precipitation_inches

Without too much thought, it’s very clear that the answer to all of these is “no”. We can move on to the third assumption.

\(~\)

3. The error of the variance is constant (homoscedasticity)

This is best evaluated by examining plots of the model itself. Our “Residuals vs. Fitted” plot should have a band of data points equally distributed across both the x- and y-axis. Given that this is count data, however, we do expect to see a sharp downward sloping line before we perform any kind of transformations.

plot(m1, which = 1)

It looks like there are underlying patterns in this data that might do better with a transformation. Given that it’s count data, let’s try taking the log of the response varible to smooth out this plot.

# Multiple Linear Regression with model as specified above
m2 <- lm(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg + precipitation_inches, 
         data = loop)

# Summarize
plot(m2, which = 1)

Although it isn’t perfect, this represents an improvement on our original model. We can also confirm a reasonable assumption of homoscedasticity. With these checks performed, we can drop the precipitation_inches predictor and move this transformed model into the next phase: cross-validation.

# Define training control
set.seed(123)
trc <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

# Train multiple linear regression model
cv_mlr <- train(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg,
               data = loop,
               method = "lm",
               trControl = trc)

# Summarize the results
print(cv_mlr)
## Linear Regression 
## 
## 42938 samples
##     4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 38644, 38644, 38645, 38644, 38644, 38644, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.7352667  0.8007847  0.5160302
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Our results show the following:

  • Root Mean Squared Error (RMSE): 0.7352
  • R-squared: 0.80
  • Mean Absolute Error (MAE): 0.516

Random Forest

In order to see if a non-linear model can outperform our existing model, we will implement a random forest model and compare our output metrics:

# Define training control
set.seed(123)
trc <- trainControl(method = "repeatedcv", number = 10 , repeats = 3)

# Train multiple linear regression model
cv_rf <- train(log(num_rides) ~ as.factor(dow) + as.factor(hour) + wind_day_avg + temp_day_avg,
               data = loop,
               method = "ranger",
               trControl = trc,

               # Based on prior tuning, these were the optimal parameters found
               tuneGrid  = expand.grid(mtry = 16,
                                       splitrule = "variance",
                                       min.node.size = 5)

               # In order to find importance of variables, must specify in the model
               ,importance = "impurity"
               )

# Summarize the results
print(cv_rf)
## Random Forest 
## 
## 42938 samples
##     4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 38644, 38644, 38645, 38644, 38644, 38644, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   0.546325  0.8900585  0.3659163
## 
## Tuning parameter 'mtry' was held constant at a value of 16
## Tuning
##  parameter 'splitrule' was held constant at a value of variance
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5

We can see that with the optimal tuning parameters, our results show the following:

  • Root Mean Squared Error (RMSE): 0.5463
  • R-squared: 0.89
  • Mean Absolute Error (MAE): 0.366

This is an even more powerful model than our original multiple linear regression model.

Let’s take a look at the variables with the most predictive power in our random forest model.

vi <- varImp(cv_rf)$importance %>%
  add_rownames("variable") %>%
  rename(Importance = Overall) %>%
  arrange(desc(Importance)) %>%
  filter(row_number() <= 10) %>%
  mutate(variable = str_replace_all(variable, "as.factor\\(|_day_avg", ""),
         variable = str_replace_all(variable, "\\)", "_"))

ggplot(vi, aes(x=reorder(variable, -Importance), y = Importance)) +
  geom_bar(position = "dodge", stat = "identity", fill = "blue", color = "black", alpha = 0.6) +
  theme_classic() +
  ggtitle("Hour of the Day is the Strongest Predictor") +
  theme(plot.title       = element_text(face  = "bold" , hjust = 0.5, size = 15),
        axis.text        = element_text(color = "black", size  = 12),
        axis.title.x     = element_text(color = "black", size  = 12),
        axis.title.y     = element_text(color = "black", size  = 12))  +
  scale_y_continuous(name = "Relative Importance",
                     breaks = seq(0,100, by = 20)
                     ) +
  scale_x_discrete(name = "Predictor")

Test Predictions

Let’s try three different scenarios for how many taxis should be available in “The Loop” neighborhood given the following conditions:

# 1. Monday at 3:00pm, 50 degrees, and 15 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 2, hour = 15, wind_day_avg = 15, temp_day_avg = 50)))
## [1] 562.0273
# 2. Wednesday at 8:00am, 45 degrees, and 5 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 4, hour = 8,  wind_day_avg = 5,  temp_day_avg = 45)))
## [1] 331.2549
# 3. Friday at 7:00pm, 80 degrees, and 2 mph of wind speed
exp(predict(cv_rf, data.frame(dow = 6, hour = 19, wind_day_avg = 2,  temp_day_avg = 80)))
## [1] 545.2977

\(~\)

So we can see that for “The Loop” neighborhood of Chicago, each of these specific set of circumstances can give us a specific number of expected taxi pickups.

  1. Monday between 3:00-4:00pm, 50 degrees, and 15 mph of wind speed = 562 taxi pickups
  2. Wednesday between 8:00-9:00am, 45 degrees, and 5 mph of wind speed = 331 taxi pickups
  3. Friday between 7:00-8:00pm, 76 degrees, and 2 mph of wind speed = 545 taxi pickups

\(~\)

Conclusion

\(~\)

  1. Before COVID-19, taxi trip revenue was declining
    • 2019 taxi revenue in Chicago was about $125M.
    • But there was nearly a 50% drop in revenue between 2016 and 2019 ($250M+).

\(~\)

  1. Most rides originate in downtown Chicago and its two major airports
    • The visualization above also shows the greatest demand occurs on weekend nights.

\(~\)

  1. This predictive model could be useful to taxi/AV fleet dispatchers optimizing business decisions
    • An expansion of this model to all neighborhoods could be used to make strategic decisions about the size of a taxi/AV fleet in a particular city.
    • It could also be used to place vehicles in the appropriate neighborhoods to pre-empt demand and optimize vehicle utilization.

\(~\)

  1. But it has limitations
    • We only ran these models for one specific census tract (due to constraints on computing time/power).
    • We rounded the timestamps to the nearest hour, but the original data came in 15-minute increments. It could be made even more granular.
    • This only represents data from taxis, which appeared to be on the decline before COVID-19, and likely underrepresent true demand. Obtaining data from competing sources like Uber or Lyft would be helpful in building a truer picture of demand for taxis in the city.

\(~\)

  1. Further work can be done to make it more powerful to decision-makers
    • Additional data, greater computing power, and more advanced models (e.g. XGBoost and neural networks) could be used to build a more robust platform with much greater precision.
    • Additional business context would also be helpful to understand how this type of model may already be understood in practice (e.g. taxi drivers in the city already know downtown gets busy on weekend nights). Gaining additional insight from practictioners could enrich the model and include features that have been omitted.