knitr::opts_chunk$set(echo = TRUE)
knitr::opts_chunk$set(fig.width=12, fig.height=8,fig.align = "center")

Motivation

We live in an era of sharing economies; Ride Share is a huge part of that eco-system and is an extremely interesting one that is deeply tied with our lives today.

My Goal for this project is two fold; On an exploration perpspective I am quite interested in seeing what are some interesting insights that we can find on people’s interaction with ride-share services. On an utility perspective, I would want to build a model that can help people better estimate how much commuting expense they’ll use through ride share services.

Specficially in line with the exploration, I aim to challenge and verify some of the typical assumptions we have regarding ride-share pricing. Things such as weekdays, time of the day, weather and even neighborhood. I specifically want to look into these areas:

Peak hours and ride prices: My general hypothesis is that peak hours DO affect ride prices due to the high amount demand present.
Weekends and holidays: Similarly it would seem reasonable to assume an increase in fare during holidays because of the propensity of people to go out and have fun; However I would say that may not be absolute but dependent on whether the volume of daily commuters offsets that propensity.
Weather(percipitation, wind and temperature): I would pose it that severe weather condition might decrease fare as people are more unlikely to go out. However weather conditions such as rain that occured in the middle of the day may increase fare due to an increased demand.
Neighborhood: I believe that neighborhood will have an effect on ride-share prices at least for the selected few. For instance, more central are are likely going to be more expensive

Data Cleaning and Aggregation

I aggregated a couple of data source to enable those findings, the detail of the sources can be accessed through the links below: https://data.cityofchicago.org/Transportation/Transportation-Network-Providers-Trips/m6dm-c72p/data https://www.wunderground.com/history/daily/us/il/chicago/KMDW/date/2019-5-7?cm_ven=localwx_history http://www.robparal.com/ChicagoCommunityAreaData.html https://www2.illinois.gov/cms/personnel/employeeresources/pages/stateholidays.aspx

#Load Library
library(readr)
require(gridExtra)
library(dplyr)
library(chron)
library(stringr)
library(ggplot2)

setwd('C:/Users/fzy20/Downloads')

#Load Datasets
trips = read_csv("Transportation_Network_Providers_-_Trips.csv")
holiday = read_csv("C:/Users/fzy20/Documents/Penn/Chicago Holidays.csv")
areaInfo = read_csv("C:/Users/fzy20/Documents/Penn/CommAreaInfo.csv")
areaCode = read_csv("C:/Users/fzy20/Downloads/CommAreas.csv")
weather = read_csv("C:/Users/fzy20/Documents/Penn/Weather Chicago.csv")

#Clean Data, drop data without proper location reference
trips_cleaned = trips[!is.na(trips$`Pickup Centroid Location`),]
trips_cleaned = trips_cleaned[!is.na(trips_cleaned$`Dropoff Centroid Location`),]


#Sample Data for analysis due to computational constraint
trips_selected = trips_cleaned[sample(1:nrow(trips_cleaned),size = 500000),]

#Pick relevant area information
areaInfo = areaInfo[,c(1:7)]

#Feature Engineer the Date and the Time
trips_selected$Start_Date = sapply(trips_selected$`Trip Start Timestamp`,function(x) return(strsplit(x, " ")[[1]][1]))
trips_selected$Start_Time = sapply(trips_selected$`Trip Start Timestamp`,function(x) return(strsplit(x, " ")[[1]][2]))
trips_selected$Start_AM_PM = sapply(trips_selected$`Trip Start Timestamp`,function(x) return(strsplit(x, " ")[[1]][3]))
trips_selected$End_Date = sapply(trips_selected$`Trip End Timestamp`,function(x) return(strsplit(x, " ")[[1]][1]))
trips_selected$End_Time = sapply(trips_selected$`Trip End Timestamp`,function(x) return(strsplit(x, " ")[[1]][2]))
trips_selected$End_AM_PM = sapply(trips_selected$`Trip End Timestamp`,function(x) return(strsplit(x, " ")[[1]][3]))
trips_selected$Start_Date = as.Date(trips_selected$Start_Date, format = "%m/%d/%Y")
trips_selected$End_Date = as.Date(trips_selected$End_Date, format = "%m/%d/%Y")
head(trips_selected)


#clean holiday datasets
holiday$`2018` = sapply(holiday$`2018`,function(x) return(paste(strsplit(x, ",")[[1]][2]," 2018")))
holiday$`2019` = sapply(holiday$`2019`,function(x) return(paste(strsplit(x, ",")[[1]][2]," 2019")))

#Change the vairable type to a date variable
holiday$`2018` = as.Date(holiday$`2018`,format = " %B %d %Y")
holiday$`2019` = as.Date(holiday$`2019`,format = " %B %d %Y")

#Concatenate the columns
holiday_1 = holiday[,c(1,2)]
holiday_2 = holiday[,c(1,3)]

#Rename Columns
colnames(holiday_1) = c("holiday","Date")
colnames(holiday_2) = c("holiday","Date")

#Put together the columns
holiday = rbind(holiday_1,holiday_2)
holiday$is_holiday = 1
head(holiday)


#Reframed weather data
weather$Date = as.Date(weather$Date, format = "%m/%d/%Y")
#Merge Weather and Holiday
weather_holiday = merge(weather, holiday, by.x = "Date",by.y = "Date", all.x = TRUE)
head(weather_holiday)


#Merge area informations
area_complete = merge(areaCode, areaInfo, by = c("Area Number","Name"))
head(area_complete)


#Merging Relevant datasets
trips_with_weather_holiday = merge(trips_selected, weather_holiday, by.x = 'Start_Date', by.y = 'Date', all.x = TRUE)
trip_final = merge(trips_with_weather_holiday, area_complete, by.x = "Pickup Community Area", by.y = "Area Number")
trip_final[is.na(trip_final$is_holiday),'is_holiday'] = 0
#Display final datasets
head(trip_final)


#Unselect columns that will not aid the pre-ride insights and predictive model
trip_final_tuned = trip_final %>% select(-c("Trip Start Timestamp","Trip End Timestamp","Dropoff Centroid Location",
                         "Pickup Centroid Location","Shared Trip Authorized","Trip ID", "Pickup Census Tract", "Dropoff Census Tract"))
#Add additional month and weekday feature
trip_final_tuned$Month = months(trip_final_tuned$Start_Date)
trip_final_tuned$Weekdays = weekdays(trip_final_tuned$Start_Date)
#reformat the median Household Income Variable
trip_final_tuned$Median_Household_Income = sapply(trip_final_tuned$Median_Household_Income, function(x) gsub("\\$|,","",x))
trip_final_tuned$Median_Household_Income = as.integer(trip_final_tuned$Median_Household_Income)
#convert feature to time
trip_final_tuned$Start_Time <- chron(times=trip_final_tuned$Start_Time)
head(trip_final_tuned)
#column names
colnames(trip_final_tuned)

#Create distance variable
trip_final_tuned$distance = sqrt((trip_final_tuned$`Pickup Centroid Longitude`- trip_final_tuned$`Dropoff Centroid Longitude`)^2 + (trip_final_tuned$`Pickup Centroid Latitude`- trip_final_tuned$`Dropoff Centroid Latitude`)^2)
#remove observations that has no time or fare
trip_final_tuned = trip_final_tuned[!is.na(trip_final_tuned$Fare),]
trip_final_tuned = trip_final_tuned[!is.na(trip_final_tuned$`Trip Seconds`),]

#Reconstruct column name to aid data training
names(trip_final_tuned)<-str_replace_all(names(trip_final_tuned), c(" " = "" , "," = "","-" = "_" ))
trip_final = trip_final_tuned[!is.na(trip_final_tuned$Fare),]

#Construct hour variables
trip_final$hour = paste(format(strptime(trip_final$Start_Time,"%H:%M:%S"),'%H'),trip_final$Start_AM_PM)

Visuals and Analysis

Instead of using the full dataset, I will pick a sample out of the dataset for the ease and speed of the analysis; feel free to implement this at scale if you see fit

#temp analysis
#trip_1 = trip_final
#trip_final = trip_final[sample(1:nrow(trip_final),size = 20000),]
trip_final = read.csv("https://raw.githubusercontent.com/zhangyidavidfan/Ride_Share_Price_Dashboard_Source_Code/master/trip_small_tuned.csv")
trip_final$hour = paste(format(strptime(trip_final$Start_Time,"%H:%M:%S"),'%H'),trip_final$Start_AM_PM)

Prior to moving into the analysis proper that deals with the aforementioned facets, the first general assumption that I would want to test is on the relationship between distance and prices. The positive correlation illustrated below indicates that distance must be controlled in our later experimenents and tests to avoid confounding effect, enabling us to really isolate the area that we would want to test.

#Distance and Prices
ggplot(data = trip_final, aes(x =distance, y = Fare)) + geom_point() + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + geom_smooth(method = "lm") + ggtitle("Distance and Prices") + annotate("text",label = paste("Correlation: ",cor(trip_final$distance, trip_final$Fare)),x=0.2,y=90)

Peak Hours and Ride Prices

It couldn’t seem anymore reasonable for prices to be affected by peak hours due to an increase in demand at those times. But how can we quantify the idea of demand? I believe, while with discrepancy, the best approximation for that will likely be the volume of rides. Following that logic it would then seem plausible to assert that: Price and Ride Volume should be directly related. To verify that, I decided to first survey the landscape and plot out the volume of rides and the average fare at each time bucket separated by 15 mins.

As seen from the figure below, the result is quite surprising. Seemingly the relationship of fare and number of trips are not close to directly related but in fact seems to be inversely related if anything. Not only that, furthermore, the fare across all time seem relatively stable, there were no increase as what I expected during peak hours. The only evident height in terms of fare in fact happened during the time span around 5am.

g1 = ggplot(data = trip_final[trip_final$Start_AM_PM == "AM",], aes(x =factor(Start_Time), y = Fare)) + geom_bar( stat = "summary", fun.y = "mean") + theme(axis.text.x = element_text(angle = 90, hjust = 0,size = 4)) + ggtitle("AM and average fare") + annotate("rect", xmin = 14, xmax = 23, ymin = 0, ymax = 15,
  alpha = 0.2, color = "red") + xlab("")

g2 = ggplot(data = trip_final[trip_final$Start_AM_PM == "PM",], aes(x =factor(Start_Time), y = Fare)) + geom_bar( stat = "summary", fun.y = "mean") + theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 4))+ ggtitle("PM and average fare") + annotate("rect", xmin = 18, xmax = 27, ymin = 0, ymax = 15,
  alpha = 0.2, color = "red") + xlab("")

g3 = ggplot(data = trip_final[trip_final$Start_AM_PM == "AM",], aes(x =factor(Start_Time))) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 0,size = 4)) + ggtitle("AM and number of trips") + annotate("rect", xmin = 14, xmax = 23, ymin = 0, ymax = 350,
  alpha = 0.2, color = "red") + xlab("")

g4 = ggplot(data = trip_final[trip_final$Start_AM_PM == "PM",], aes(x =factor(Start_Time))) + geom_bar() + theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 4))+ ggtitle("PM and number of trips")+ annotate("rect", xmin = 18, xmax = 27, ymin = 0, ymax = 350,
  alpha = 0.2, color = "red") + xlab("")

grid.arrange(g1,g2,g3,g4,ncol = 2,nrow = 2)

The boxplot below, further confirmed the observation. The changes in fare averages are most dirven by outliers. The median remained generally the same excluding once again the brief period around 5 am and 10 am. While this may be a sampling error, I am more inclined to believe that the 20000 sample has enough of a reason to make us believe that: Peak hour doesn’t affect prices that much

g1 = ggplot(data = trip_final[trip_final$Start_AM_PM == "AM",], aes(x =factor(Start_Time), y = Fare)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 0,size = 6)) + ggtitle("AM and average fare") +
  coord_cartesian(ylim=c(0,110)) + xlab("")

g2 = ggplot(data = trip_final[trip_final$Start_AM_PM == "PM",], aes(x =factor(Start_Time), y = Fare)) + geom_boxplot() + theme(axis.text.x = element_text(angle = 90, hjust = 0, size = 6))+ ggtitle("PM and average fare") +
  coord_cartesian(ylim=c(0,110)) + xlab("")

grid.arrange(g1,g2,ncol = 2,nrow = 1)

However after controlling by time and looking at the regression below, while morning peak hours really shows no effect; it seems like evening peak hour has a significant effect trending upward. Other times remain effectless (even the one around 5am)

summary(lm(data = trip_final, Fare ~ hour + distance))

## 
## Call:
## lm(formula = Fare ~ hour + distance, data = trip_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.152  -1.699  -0.189   1.425 100.519 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.892414   0.141154  27.576  < 2e-16 ***
## hour01 PM    0.122527   0.198452   0.617  0.53697    
## hour02 AM   -0.045616   0.216907  -0.210  0.83343    
## hour02 PM    0.263841   0.192604   1.370  0.17075    
## hour03 AM    0.100983   0.228576   0.442  0.65864    
## hour03 PM    0.063710   0.185726   0.343  0.73158    
## hour04 AM   -0.107749   0.236884  -0.455  0.64921    
## hour04 PM    0.574845   0.188068   3.057  0.00224 ** 
## hour05 AM   -0.410970   0.234452  -1.753  0.07963 .  
## hour05 PM    0.724828   0.179219   4.044 5.27e-05 ***
## hour06 AM   -0.288860   0.222301  -1.299  0.19382    
## hour06 PM    0.354009   0.178341   1.985  0.04716 *  
## hour07 AM   -0.020274   0.203462  -0.100  0.92063    
## hour07 PM    0.004597   0.181047   0.025  0.97974    
## hour08 AM    0.010334   0.195442   0.053  0.95783    
## hour08 PM   -0.129928   0.186867  -0.695  0.48688    
## hour09 AM    0.091616   0.207465   0.442  0.65879    
## hour09 PM    0.013256   0.186016   0.071  0.94319    
## hour10 AM   -0.297218   0.208793  -1.424  0.15461    
## hour10 PM   -0.128318   0.184855  -0.694  0.48759    
## hour11 AM   -0.702721   0.204467  -3.437  0.00059 ***
## hour11 PM    0.392928   0.186528   2.107  0.03517 *  
## hour12 AM    0.236884   0.190571   1.243  0.21387    
## hour12 PM   -0.096488   0.204099  -0.473  0.63640    
## distance    94.250550   0.459767 204.997  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.966 on 19975 degrees of freedom
## Multiple R-squared:  0.6799, Adjusted R-squared:  0.6795 
## F-statistic:  1768 on 24 and 19975 DF,  p-value: < 2.2e-16

This then brings to question: Why are the average fares around 5am so high and around peak hours low? The figure below gives our answer. 5am travels tend to have the longest distance across the board while peak hour distances are rather standard if anything lower than usual. Why? the interpretation that comes to my mind is that 5 am travellers likely decides to travel at 5 am because their work is quite far away from their home, this factor no longer plays into effect during our regression as it controls for distance yielding the results earlier.

g1 = ggplot(data = trip_final[trip_final$Start_AM_PM == "AM",], aes(x =factor(Start_Time), y = distance)) + geom_bar( stat = "summary", fun.y = "mean") + theme(axis.text.x = element_text(angle = 90, hjust = 0,size = 4)) + ggtitle("AM and average distance") + annotate("rect", xmin = 14, xmax = 23, ymin = 0, ymax = 0.125,
  alpha = 0.2, color = "red") + xlab("")

g2 = ggplot(data = trip_final[trip_final$Start_AM_PM == "PM",], aes(x =factor(Start_Time), y = distance)) + geom_bar( stat = "summary", fun.y = "mean") + theme(axis.text.x = element_text(angle = 90, hjust = 0,size = 4)) + ggtitle("PM and average distance") + annotate("rect", xmin = 18, xmax = 27, ymin = 0, ymax = 0.125,
  alpha = 0.2, color = "red") + xlab("")

grid.arrange(g1,g2,ncol = 2,nrow = 1)

Another interesting thing to look at will be the whole negatively correlation between count and fare. Looking at the correlation below, we could deduce that this correlation might be slightly contributed to distance but not wholey. What I believe drives this negatively correlation is the other side of demand: the supply. The reason why large amount trips correlates with lower fares is likely because there is a relatively even number of supply and demand on both side (i.e. uber and lyft drivers are becoming smarter), while the converse drives price up because while there is lesser demand around 5am or so there is even lesser supply of drivers. So what’s the takeway? No pain no gain, if you’re a ride-share service provider, then maybe waking up at 5am might just buy you the new Iphone :D.

x = trip_final %>% group_by(Start_Time) %>% summarise(count = n(), fare = mean(Fare), distance = mean(distance))
print(paste("the correlation between vehicle count and distance is: ", cor(x$count,x$distance)))

## [1] "the correlation between vehicle count and distance is:  -0.174874403843102"

print(paste("the correlation between vehicle count and fare is: ", cor(x$count,x$fare)))

## [1] "the correlation between vehicle count and fare is:  -0.312533441679878"

Weekends and Holidays

When we think about holidays and weekends, we tend to picture cars moving everywhere; people in the mall and restaurants. All seemingly a picture of life and prosperity. But how does this relate to the ride-sharing dynamic system. Will there be more demands and thus higher fares because people have a greater inlcination of going out? We will break this down to two aspects: one focusing on weekdays and one focusing on holidays.

Weekdays

To start off with weekdays; Surveying car requests over weekdays at different time buckets (AM Early: 12 midnight to 6 AM | AM Late: 6 AM to 12 Noon | PM Early: 12 Noon to 6 PM | PM Late 6 PM to 12 Midnight) provides some reasonable observations. 1. PM Late is consistently the one with the most amount of quantity requested, this would make sense as this is the time when most get off work/goes out to have fun 2. PM Lates during Friday and Saturday are particulalry huge likely because it is a “Free Night” 3. Sunday rides are evenly distributed likely since it is a free day however nights are no longer free

All these observations are quite standard and nothing too surprising.

trip_final$Start_AM_PM = as.character(trip_final$Start_AM_PM)
trip_final$Start_Time = chron(times=trip_final$Start_Time) 
trip_final[trip_final$Start_Time < chron(times ="6:00:00"),"Start_AM_PM"] = sapply(trip_final[trip_final$Start_Time < chron(times ="6:00:00"),"Start_AM_PM"], function(x) paste(x,"early")) 
trip_final[trip_final$Start_Time >= chron(times ="6:00:00"),"Start_AM_PM"] = sapply(trip_final[trip_final$Start_Time >= chron(times ="6:00:00"),"Start_AM_PM"], function(x) paste(x,"late")) 

ggplot(data = trip_final, aes(x = ordered(Weekdays, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")),fill = trip_final$Start_AM_PM)) + geom_bar(position = "dodge") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) +ggtitle("Vehicle Count Across Time") + xlab("")

Cross tabbing the fare and distance across time shows less of a perfect correlation the two variables unlike before. This is quite interesting and deserves to be broken down further

g1 = ggplot(data = trip_final, aes(x = ordered(Weekdays, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")),fill = trip_final$Start_AM_PM, y = Fare)) + geom_bar( stat = "summary", fun.y = "mean",position = "dodge") + theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "bottom",legend.text = element_text(size = 5),axis.title.x = element_text(size = 5),legend.title  = element_text(size = 5)) +ggtitle("Fare Acorss Time") + xlab("")

g2 = ggplot(data = trip_final, aes(x = ordered(Weekdays, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", 
"Friday", "Saturday", "Sunday")),fill = trip_final$Start_AM_PM, y = distance)) + geom_bar( stat = "summary", fun.y = "mean",position = "dodge") + theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "bottom",legend.text = element_text(size = 5),axis.title.x = element_text(size = 5),legend.title  = element_text(size = 5))+ggtitle("distance Acorss Time") + xlab("")

grid.arrange(g1,g2,ncol = 2)

In this next part, I will be using regression tools to analyze the differences between day time combinations while controlling for distance.

First off, the regression output below shines light on the most basic comparison: AM Early vs AM Late vs PM Early vs PM Late. Most variables are insignificant, however rides during PM Early seems to be the one that has a significantly higher prices. I would again attribute this to the lack of drivers during that time (do notice that the significance level is not at 2e16 even with 20000 rides, suggesting this could be affected by sampling)

summary(lm(data = trip_final, Fare ~ Start_AM_PM + distance))

## 
## Call:
## lm(formula = Fare ~ Start_AM_PM + distance, data = trip_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.735  -1.692  -0.200   1.410 100.708 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.83372    0.08060  47.566  < 2e-16 ***
## Start_AM_PMAM late  -0.03592    0.09439  -0.381   0.7035    
## Start_AM_PMPM early  0.45217    0.09466   4.777  1.8e-06 ***
## Start_AM_PMPM late   0.14730    0.08912   1.653   0.0984 .  
## distance            93.96219    0.45802 205.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.972 on 19995 degrees of freedom
## Multiple R-squared:  0.6787, Adjusted R-squared:  0.6786 
## F-statistic: 1.056e+04 on 4 and 19995 DF,  p-value: < 2.2e-16

To the fun stuff! In the next few blocks I will be running over a couple more regression, each regressing over the 4 time bucket at each different day. I will be compiling all my insights after the codes. For reference the benchmark used for the regression are all time buckets on a Friday

trip_temp = trip_final[trip_final$is_holiday == FALSE,]
trip_temp = trip_temp[trip_temp$Start_AM_PM == "AM early",]
trip_temp$Weekdays_time = paste(trip_temp$Weekdays, trip_temp$Start_AM_PM)
summary(lm(data = trip_temp, Fare ~ Weekdays_time + distance))

## 
## Call:
## lm(formula = Fare ~ Weekdays_time + distance, data = trip_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.906  -1.659  -0.337   1.188  78.012 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.53327    0.21617  20.971  < 2e-16 ***
## Weekdays_timeMonday AM early    -0.98513    0.31704  -3.107  0.00191 ** 
## Weekdays_timeSaturday AM early   0.02682    0.25489   0.105  0.91620    
## Weekdays_timeSunday AM early     0.16688    0.24826   0.672  0.50153    
## Weekdays_timeThursday AM early  -0.89888    0.30703  -2.928  0.00344 ** 
## Weekdays_timeTuesday AM early   -0.48668    0.35687  -1.364  0.17278    
## Weekdays_timeWednesday AM early -0.55502    0.30294  -1.832  0.06705 .  
## distance                        85.95481    1.18462  72.559  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.752 on 2558 degrees of freedom
## Multiple R-squared:  0.6747, Adjusted R-squared:  0.6738 
## F-statistic: 757.8 on 7 and 2558 DF,  p-value: < 2.2e-16

trip_temp = trip_final[trip_final$is_holiday == FALSE,]
trip_temp = trip_temp[trip_temp$Start_AM_PM == "AM late",]
trip_temp$Weekdays_time = paste(trip_temp$Weekdays, trip_temp$Start_AM_PM)
summary(lm(data = trip_temp, Fare ~ Weekdays_time + distance))

## 
## Call:
## lm(formula = Fare ~ Weekdays_time + distance, data = trip_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.788  -1.725  -0.225   1.446  45.880 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.14567    0.15360  26.991   <2e-16 ***
## Weekdays_timeMonday AM late    -0.23112    0.22416  -1.031    0.303    
## Weekdays_timeSaturday AM late  -0.31320    0.19580  -1.600    0.110    
## Weekdays_timeSunday AM late     0.02647    0.19832   0.133    0.894    
## Weekdays_timeThursday AM late  -0.07783    0.20920  -0.372    0.710    
## Weekdays_timeTuesday AM late   -0.01342    0.23244  -0.058    0.954    
## Weekdays_timeWednesday AM late -0.14198    0.21208  -0.669    0.503    
## distance                       90.38419    0.89249 101.272   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.961 on 4761 degrees of freedom
## Multiple R-squared:  0.6834, Adjusted R-squared:  0.6829 
## F-statistic:  1468 on 7 and 4761 DF,  p-value: < 2.2e-16

trip_temp = trip_final[trip_final$is_holiday == FALSE,]
trip_temp = trip_temp[trip_temp$Start_AM_PM == "PM early",]
trip_temp$Weekdays_time = paste(trip_temp$Weekdays, trip_temp$Start_AM_PM)
summary(lm(data = trip_temp, Fare ~ Weekdays_time + distance))

## 
## Call:
## lm(formula = Fare ~ Weekdays_time + distance, data = trip_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.972  -1.808  -0.202   1.509 100.535 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       4.2511     0.1638  25.951  < 2e-16 ***
## Weekdays_timeMonday PM early     -0.5916     0.2423  -2.442   0.0147 *  
## Weekdays_timeSaturday PM early   -0.3162     0.2167  -1.460   0.1445    
## Weekdays_timeSunday PM early     -0.8953     0.2283  -3.921 8.94e-05 ***
## Weekdays_timeThursday PM early   -0.1431     0.2283  -0.627   0.5307    
## Weekdays_timeTuesday PM early    -0.6210     0.2521  -2.463   0.0138 *  
## Weekdays_timeWednesday PM early  -0.3564     0.2306  -1.546   0.1223    
## distance                        101.7549     1.0106 100.691  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.379 on 4647 degrees of freedom
## Multiple R-squared:  0.6861, Adjusted R-squared:  0.6856 
## F-statistic:  1451 on 7 and 4647 DF,  p-value: < 2.2e-16

trip_temp = trip_final[trip_final$is_holiday == FALSE,]
trip_temp = trip_temp[trip_temp$Start_AM_PM == "PM late",]
trip_temp$Weekdays_time = paste(trip_temp$Weekdays, trip_temp$Start_AM_PM)
summary(lm(data = trip_temp, Fare ~ Weekdays_time + distance))

## 
## Call:
## lm(formula = Fare ~ Weekdays_time + distance, data = trip_temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -20.162  -1.616  -0.110   1.452  74.129 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      4.4434     0.1100  40.411  < 2e-16 ***
## Weekdays_timeMonday PM late     -1.0619     0.1755  -6.050 1.52e-09 ***
## Weekdays_timeSaturday PM late   -0.3169     0.1417  -2.236  0.02537 *  
## Weekdays_timeSunday PM late     -1.1307     0.1631  -6.931 4.58e-12 ***
## Weekdays_timeThursday PM late   -0.5752     0.1535  -3.747  0.00018 ***
## Weekdays_timeTuesday PM late    -1.0467     0.1812  -5.776 7.97e-09 ***
## Weekdays_timeWednesday PM late  -0.9271     0.1625  -5.706 1.20e-08 ***
## distance                        97.2918     0.7988 121.790  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.722 on 6857 degrees of freedom
## Multiple R-squared:  0.6842, Adjusted R-squared:  0.6838 
## F-statistic:  2122 on 7 and 6857 DF,  p-value: < 2.2e-16

On AM Early: With Friday as the control, most of the weekdays have a lower price compared to Friday, while the weekends have slightly higher prices (Although insignificant). Since AM Early includes all rides before 6AM, I would think this difference is likely one from rides that happened over late nights (midnight to 2am; feel free to explore this more). Breaking it down this way gives an evident explanation for the observable change

On AM Late: Not much changes here, price seems very stable with no factor too significant; I don’t have an extremely sound inference for this, but it could be due to how there is unlikely going to be a surplus of either demand or supply around a regular morning time. Over weekdays, there will be less demand and supply as people are usually working and over weekends more on both side

On PM Early: There also isn’t much information here except for sunday when Fare are particularly low. I believe this is due to the fact that Sunday afternoon is a time where most people who choose to spend resting instead of going out in order to recuperate for the week ahead. I guess it just so happened that this hit the supply more than it did for the demand.

On PM Late: The interesting and I guess the most obvious one. Much significant result here likely because it is benchmarked against Friday. It can be clearly seen that Friday night and Saturday night have the biggest boost in fares, with friday slightly ahead (maybe becuase of church on sunday?). Thursday also has a slight boost compared to the other days possibly because lots of happy hours tend to happen on Thursday

This section generally confirmed our hypothesis: however the weekend effect is only selectively evident at night as opposed to anything else. Furthermore given that the section breaks the data down in quite granular fashion, it may also be affected largely by sampling. do take it with a grain of salt

Holidays

Moving on to holidays: I first decided to run a t-test, specifically looking into how fare and travelling changes during holidays. What is really surprising here is that as shown below. People tend to travel longer during holidays but the fare is actually lower

t.test(trip_final[trip_final$is_holiday == 0,"distance"], trip_final[trip_final$is_holiday == 1,"distance"])

## 
##  Welch Two Sample t-test
## 
## data:  trip_final[trip_final$is_holiday == 0, "distance"] and trip_final[trip_final$is_holiday == 1, "distance"]
## t = -2.1273, df = 1277.4, p-value = 0.03358
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.0078710527 -0.0003185976
## sample estimates:
##  mean of x  mean of y 
## 0.05765768 0.06175251

t.test(trip_final[trip_final$is_holiday == 0,"Fare"], trip_final[trip_final$is_holiday == 1,"Fare"])

## 
##  Welch Two Sample t-test
## 
## data:  trip_final[trip_final$is_holiday == 0, "Fare"] and trip_final[trip_final$is_holiday == 1, "Fare"]
## t = 0.16216, df = 1313.3, p-value = 0.8712
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.3568218  0.4211266
## sample estimates:
## mean of x mean of y 
##  9.431716  9.399563

The regression below echoes the result. But why is that? I would think it is because of the fact that people are more willing to drive in holidays by taking their family out. That brings down demand drastically. Furthemore much like me I do believe driver might have over-estimated the going-out effect of holidays on increasing people’s use of ride share services, thereby causing the mismatch.

summary(lm(data = trip_final, "Fare ~ is_holiday + distance"))

## 
## Call:
## lm(formula = "Fare ~ is_holiday + distance", data = trip_final)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -21.953  -1.661  -0.217   1.404 100.980 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.01365    0.03917 102.461  < 2e-16 ***
## is_holiday  -0.41694    0.12099  -3.446  0.00057 ***
## distance    93.96953    0.45774 205.288  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.975 on 19997 degrees of freedom
## Multiple R-squared:  0.6782, Adjusted R-squared:  0.6782 
## F-statistic: 2.107e+04 on 2 and 19997 DF,  p-value: < 2.2e-16

Following the assertion made earlier, it is then not surprisng to see that the volume as seen in the graph below, decreases during holidays.

holiday_count = trip_final %>% group_by(Start_Date) %>% summarize(count = n(), is_holiday = mean(is_holiday))
ggplot(holiday_count, aes(x = Start_Date, y = count, fill = is_holiday)) + geom_col() + theme(axis.text.x = element_text(angle = 45,size = 7,hjust = 1)) + ggtitle("Ride Volume holidays or not")

The next two chunks, simply showcases the specific date-time combo that had the highest/lowest fare and highest/lowest amount of vehicles. Just for anyone who is interested :D

#highest Fare
holiday_fare = trip_final %>% group_by(Start_Date, hour) %>% summarize(fare = mean(Fare), count = n(),is_holiday = mean(is_holiday))
holiday_fare$Weekday = weekdays(as.Date(holiday_fare$Start_Date))
holiday_fare = holiday_fare[holiday_fare$count > 10,]
head(holiday_fare[order(holiday_fare$fare,decreasing = TRUE),])

## # A tibble: 6 x 6
## # Groups:   Start_Date [6]
##   Start_Date hour   fare count is_holiday Weekday  
##   <fct>      <chr> <dbl> <int>      <dbl> <chr>    
## 1 2018-11-16 10 AM  17.7    12          0 Friday   
## 2 2018-11-27 03 PM  17.5    11          0 Tuesday  
## 3 2018-11-05 05 PM  17.1    20          0 Monday   
## 4 2018-12-03 12 PM  17.1    12          0 Monday   
## 5 2018-12-12 05 PM  16.5    15          0 Wednesday
## 6 2018-12-14 04 PM  15.6    13          0 Friday

#Highest count
head(holiday_fare[order(holiday_fare$count,decreasing = TRUE),])

## # A tibble: 6 x 6
## # Groups:   Start_Date [5]
##   Start_Date hour   fare count is_holiday Weekday 
##   <fct>      <chr> <dbl> <int>      <dbl> <chr>   
## 1 2018-12-01 11 PM  8.57    42          0 Saturday
## 2 2018-11-17 07 PM  9.88    40          0 Saturday
## 3 2018-12-14 06 PM 10       40          0 Friday  
## 4 2018-12-14 11 PM  8.68    38          0 Friday  
## 5 2018-11-10 11 PM  8.31    37          0 Saturday
## 6 2018-11-09 06 PM 10.8     36          0 Friday

#Lowest Count
holiday_fare = trip_final %>% group_by(Start_Date, hour) %>% summarize(fare = mean(Fare), count = n(),is_holiday = mean(is_holiday))
holiday_fare$Weekday = weekdays(as.Date(holiday_fare$Start_Date))
head(holiday_fare[order(holiday_fare$count),])

## # A tibble: 6 x 6
## # Groups:   Start_Date [4]
##   Start_Date hour   fare count is_holiday Weekday 
##   <fct>      <chr> <dbl> <int>      <dbl> <chr>   
## 1 2018-12-03 03 AM  22.5     1          0 Monday  
## 2 2018-12-25 01 PM   7.5     1          1 Tuesday 
## 3 2018-12-25 06 AM  22.5     1          1 Tuesday 
## 4 2018-12-27 05 AM  15       1          0 Thursday
## 5 2018-12-31 07 AM  20       1          0 Monday  
## 6 2018-12-31 11 AM   5       1          0 Monday

head(holiday_fare[order(holiday_fare$fare),])

## # A tibble: 6 x 6
## # Groups:   Start_Date [6]
##   Start_Date hour   fare count is_holiday Weekday  
##   <fct>      <chr> <dbl> <int>      <dbl> <chr>    
## 1 2018-12-31 06 AM  3.33     3          0 Monday   
## 2 2018-11-02 02 AM  3.75     2          0 Friday   
## 3 2018-12-27 03 AM  4        5          0 Thursday 
## 4 2018-11-01 01 AM  4.64     7          0 Thursday 
## 5 2018-11-05 10 AM  4.64     7          0 Monday   
## 6 2018-11-07 05 AM  5        8          0 Wednesday

Weather

Another cirtical part that could have an effect on the ride-share dynamic would be the weather surrounding the time of the booking. Before making any assumption about the effect of these weather. I first wanted to check the correlation of the different weather with each other to avoid problem such as multicollinearity. In general the weather relationship seems fine, however, such attention should be drawn to humidty and precipitation as they are quite strongly related.

trip_weather = trip_final
#extarct averages
trip_weather$AvgHumid = (trip_weather$MaxHumidity + trip_weather$MinHumidity)/2
trip_weather$AvgWind = (trip_weather$MaxWindSpeed + trip_weather$MinWindSpeed)/2
corrplot::corrplot(cor(trip_weather[,c("AvgTemp","AvgHumid","AvgWind","AvgPrecipation")]))

Moving on to our favorite tool to examine the effect, I took the liberty to remove humidity to avoid confounding effect: Temperature and Precipitation both don’t seem significan. However it seems like Precipitation has a wider range in its effect compared to all the other. the wind speed is significant and with more wind, fare actually seemed to decrease.

summary(lm(data = trip_weather, "Fare ~ AvgTemp + AvgWind + AvgPrecipation + distance"))

## 
## Call:
## lm(formula = "Fare ~ AvgTemp + AvgWind + AvgPrecipation + distance", 
##     data = trip_weather)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.035  -1.665  -0.199   1.401 100.950 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     4.338771   0.162024  26.779  < 2e-16 ***
## AvgTemp         0.001364   0.004489   0.304    0.761    
## AvgWind        -0.036121   0.008177  -4.417 1.01e-05 ***
## AvgPrecipation  0.049717   0.174818   0.284    0.776    
## distance       93.967146   0.457692 205.306  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.974 on 19995 degrees of freedom
## Multiple R-squared:  0.6783, Adjusted R-squared:  0.6783 
## F-statistic: 1.054e+04 on 4 and 19995 DF,  p-value: < 2.2e-16

Diving deeper, i wanted to explore the relationship of weather conditions in a more graphical way. furthemore I would want to pose that weather condition is likely going affect riders more than drivers since while a driver is within a car more oblivious to the weather outside a rider will have to get off the car at times and interact with the outside enviroment. Most analysis following this will follow the same assumption

Fare does seem to be rather agnostic to temperaure. However extremely cold weather seemed to pushed up the price, likely due to its effect of turning riders into hybernating mode. A key thing to consider here is that since this datasets only spans during the winter the result may not be extremely conclusive to really say that temperature doesn’t have an effect.

weather_fare = trip_weather %>% group_by(Start_Date) %>% summarize(fare = mean(Fare), count = n(),temp = mean(AvgTemp), perc = mean(AvgPrecipation), humid = mean(AvgHumid), wind = mean(AvgWind))
ggplot(weather_fare, aes(y = fare, x = temp)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed") +ggtitle("Temperature and Fare")

ggplot(weather_fare, aes(y = count, x = temp)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed") +ggtitle("Temperature and Count")

For precipitation, we see a decrease in fare with minimal rain followed by an increase the another decrease. I believe that this could be explained by our hypothesis earlier. rain in the middle range are unlikely strong storms but ones that occurs in the middle of the day. This constrains pedestrian from taking public transportation thereby increasing demand on Ride-share service. However strong rain are likely one that comes with a storm. In those scenario, ride-share services are unlikely to be active yielding lower fare. Do take this inference with caution, similar to temperature, the reason why a fully conclusive answer can’t be reached is because only a small sample out of our sample actually occured in the rain day.

ggplot(weather_fare, aes(y = fare, x = perc)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Precipitation and Fare")

ggplot(weather_fare, aes(y = count, x = perc)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Precipitation and Count")

Looking at humidity, what seems to occur is that the fare is much higher when humidity is in the middle. It is hard to really come into an conclusion on this since 70% worth humidity still seems quite high so rather than attributing this into an effect, I would like to hold some skepticism with regards to this for now.

ggplot(weather_fare, aes(y = fare, x = humid)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Humidity and Fare")

ggplot(weather_fare, aes(y = count, x = humid)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Humidity and count")

The linear trend is most obvious with wind (no wonder why it is the only significant thing). This linear decreasing trend is likely similar to that precipitation where in less wind means better weather and thus more people are willing to go out while the converse drags down the demand.

ggplot(weather_fare, aes(y = fare, x = wind)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Wind and Fare")

ggplot(weather_fare, aes(y = count, x = wind)) + geom_point() + geom_smooth() + geom_abline(intercept = mean(weather_fare$fare), slope = 0,linetype = "dashed")+ggtitle("Wind and count")

Overall, this section is less conclusively that I would’ve otherwise hoped, however nonetheless I would say weather does still to some degree affect the ride-sharing fare, more so due to the change in demand.

Neighborhood

A vehicle oriented business is undoubtedly also a location specific business. This considered it will quite interesting to look at where the rides are actually taking place. In the step below, I ran a regression based on the dropoff and pickup area of rides and looked specifically for those that came back with a significant result.

#OD paired based dataframe
area_fare = trip_final %>% group_by(PickupCommunityArea,DropoffCommunityArea) %>% summarize(fare = mean(Fare),population = mean(Population),distance = mean(distance), income = mean(Median_Household_Income), occupied = sum(Owner_Occupied, Renter_Occupied_House), count = n(), pickupy = mean(PickupCentroidLatitude), pickupx = mean(PickupCentroidLongitude), dropoffy = mean(DropoffCentroidLatitude),dropoffx = mean(DropoffCentroidLongitude))

#model community area's affect on fare
model_area = lm(data = area_fare, "fare ~ factor(PickupCommunityArea) + factor(DropoffCommunityArea) + distance")
factors = summary(model_area)$coefficient
#Extarcting only significant factors
factors = as.data.frame(factors[factors[,4] < 0.05,])
factors$variable = gsub(pattern = "factor",replacement = " ",
      rownames(factors))
factors[order(factors$Estimate),]

##                                 Estimate Std. Error   t value     Pr(>|t|)
## factor(DropoffCommunityArea)11 -2.962128  1.0324823 -2.868939 4.162842e-03
## factor(DropoffCommunityArea)15 -2.692537  0.9877585 -2.725907 6.469870e-03
## factor(DropoffCommunityArea)40 -2.568278  1.0458342 -2.455722 1.414687e-02
## factor(DropoffCommunityArea)35 -2.525034  0.9735631 -2.593601 9.568740e-03
## factor(DropoffCommunityArea)39 -2.311606  1.0020474 -2.306883 2.116567e-02
## factor(DropoffCommunityArea)12 -2.299751  1.1282207 -2.038388 4.164582e-02
## factor(DropoffCommunityArea)68 -2.241231  1.0274049 -2.181449 2.926930e-02
## factor(DropoffCommunityArea)58 -2.172595  1.0170316 -2.136212 3.278636e-02
## factor(DropoffCommunityArea)21 -2.007900  1.0076497 -1.992657 4.643822e-02
## factor(PickupCommunityArea)56   1.841148  0.8839986  2.082750 3.740404e-02
## (Intercept)                     4.406953  1.0223515  4.310604 1.709234e-05
## factor(DropoffCommunityArea)74  5.072810  1.8573997  2.731135 6.368496e-03
## distance                       98.430703  1.5523331 63.408237 0.000000e+00
##                                                 variable
## factor(DropoffCommunityArea)11  (DropoffCommunityArea)11
## factor(DropoffCommunityArea)15  (DropoffCommunityArea)15
## factor(DropoffCommunityArea)40  (DropoffCommunityArea)40
## factor(DropoffCommunityArea)35  (DropoffCommunityArea)35
## factor(DropoffCommunityArea)39  (DropoffCommunityArea)39
## factor(DropoffCommunityArea)12  (DropoffCommunityArea)12
## factor(DropoffCommunityArea)68  (DropoffCommunityArea)68
## factor(DropoffCommunityArea)58  (DropoffCommunityArea)58
## factor(DropoffCommunityArea)21  (DropoffCommunityArea)21
## factor(PickupCommunityArea)56    (PickupCommunityArea)56
## (Intercept)                                  (Intercept)
## factor(DropoffCommunityArea)74  (DropoffCommunityArea)74
## distance                                        distance

instead of going through every item of the dataframe, I specifically checked for those that have the biggest effect among all the pickup and the dropoff; the result retruned as seen below, after research has shown to be of the most secluded areas in Chicago, locating at the edge of the area.

#Extract the names of the area with the top positive effect
print(as.character(trip_final[trip_final$PickupCommunityArea == 74, "Name"][1]))

## [1] "MOUNT GREENWOOD"

print(as.character(trip_final[trip_final$PickupCommunityArea == 56, "Name"][1]))

## [1] "GARFIELD RIDGE"

Another thing that will just be absolutely interesting to look at will be examining the Origin and Destination pairs, by graphing them onto the Chicago map and adjusting its color based on certain traits. The graph below views this on the count status of each pair of trips. From this graph we could see that ride activity is most evident still within the more central area of Chicago and less so on trips to the outer area

ggmap(map) + geom_segment(data=area_fare[area_fare$count>15,], aes(x= pickupx, y=pickupy,xend = dropoffx, yend=dropoffy,color = count),size = 1.5,alpha = 0.3) +scale_color_gradient(low="blue", high="red")

Conversely, the graph below views this on the fare status of each pair of trips. Like what we expected, From this graph we could see that fare is much higher on trips to the outer area followed by those in center city and cheapest on short rides along the outskirt like those by edgewater

ggmap(map) + geom_segment(data=area_fare[area_fare$count>15,], aes(x= pickupx, y=pickupy,xend = dropoffx, yend=dropoffy,color = fare),size = 1.5,alpha = 0.3) +scale_color_gradient(low="blue", high="red")

The last graph below views this on a different metric: a combination of both fare and distance - Fare/Distance to examine how the fare is like with respect to the distance, this should more accurately reflect how different OD combination affects price. I removed those with a ratio way too high because they are mostly from areas charged with a base rate despite short distance. From this graph we could see that the ratio is rather similar across the board, but seemingly those in the central area does seem slightly higher

ggmap(map) + geom_segment(data=area_fare[(area_fare$count>15)&(area_fare$fare/area_fare$distance< 1000),], aes(x= pickupx, y=pickupy,xend = dropoffx, yend=dropoffy,color = log(fare/distance)),size = 2,alpha = 0.3) +scale_color_gradient(low="blue", high="red")

Another thing that could be concluded is that geographically associated demographic information such as income, population and houses occupied doesn’t really affect ride prices. Safe to assume that this don’t heavily affect demand either.

summary(lm(data = area_fare, "fare ~ income + population + occupied + distance"))

## 
## Call:
## lm(formula = "fare ~ income + population + occupied + distance", 
##     data = area_fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.427  -1.704  -0.038   1.638  34.273 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.735e+00  2.364e-01  11.568   <2e-16 ***
## income      3.499e-06  3.581e-06   0.977    0.329    
## population  5.733e-06  3.617e-06   1.585    0.113    
## occupied    3.451e-08  4.666e-08   0.740    0.460    
## distance    9.906e+01  1.302e+00  76.063   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.69 on 2107 degrees of freedom
## Multiple R-squared:  0.7387, Adjusted R-squared:  0.7382 
## F-statistic:  1489 on 4 and 2107 DF,  p-value: < 2.2e-16

Using the same Fare/Distance metric as above, I wanted to show some top OD pairs within the area for anyone interested. After doing some research, I found that most of this area are within the more downtown area known for its characteristic such as the nightlife. Most of this commute also seem to be done in short distance. Busy Area close to each other, i guess that’s secretly formula for a high priced ride.

area_overall = trip_final %>% group_by(PickupCommunityArea,DropoffCommunityArea,hour,Weekdays) %>% summarize(fare = mean(Fare),distance = mean(distance), count = n())
#Downtown areas mainly
area_overall$farePerDist = area_overall$fare/area_overall$distance
area_overall_temp = area_overall[area_overall$count>5,]
area_overall_temp[order(area_overall_temp$farePerDist,decreasing = TRUE),]

## # A tibble: 229 x 8
## # Groups:   PickupCommunityArea, DropoffCommunityArea, hour [129]
##    PickupCommunity… DropoffCommunit… hour  Weekdays  fare distance count
##               <int>            <int> <chr> <fct>    <dbl>    <dbl> <int>
##  1               32                8 04 PM Friday   16.5   0.0197     10
##  2                8                8 09 AM Thursday  8.75  0.0109      8
##  3               28               28 01 AM Sunday    8.33  0.0107      6
##  4               32               32 06 PM Monday    4.17  0.00582     6
##  5                8                8 05 AM Saturday  8.57  0.0123      7
##  6                8                8 08 PM Saturday  5.5   0.00860    10
##  7               24               24 06 AM Sunday    5.71  0.00954     7
##  8               28               28 05 PM Thursday  5.71  0.00999     7
##  9                8                8 08 PM Friday    6.04  0.0110     12
## 10                8                8 07 PM Thursday  5.36  0.00986    14
## # … with 219 more rows, and 1 more variable: farePerDist <dbl>

areaweek = area_overall %>% group_by(PickupCommunityArea,DropoffCommunityArea,Weekdays) %>% summarize(fare = mean(fare),distance = mean(distance), count = sum(count))
areaweek$farePerDist = areaweek$fare/areaweek$distance
areaweek = areaweek[areaweek$count > 10,]
areaweek[order(areaweek$farePerDist,decreasing = TRUE),]

## # A tibble: 334 x 7
## # Groups:   PickupCommunityArea, DropoffCommunityArea [68]
##    PickupCommunity… DropoffCommunit… Weekdays  fare distance count
##               <int>            <int> <fct>    <dbl>    <dbl> <int>
##  1               25               25 Saturday  4.88  0.00275    11
##  2               25               25 Monday    5.83  0.00537    11
##  3               41               41 Sunday    3.91  0.00649    11
##  4               41               41 Tuesday   4.86  0.00865    16
##  5               32               32 Tuesday   4.85  0.00929    30
##  6               32               32 Friday    5.09  0.00984    39
##  7                6                6 Monday    4.90  0.00950    24
##  8               33               33 Saturday  4.31  0.00841    12
##  9               32               32 Wednesd…  4.97  0.00997    34
## 10               32               32 Thursday  4.54  0.00919    47
## # … with 324 more rows, and 1 more variable: farePerDist <dbl>

areahour = area_overall %>% group_by(PickupCommunityArea,DropoffCommunityArea,hour) %>% summarize(fare = mean(fare),distance = mean(distance), count = sum(count))
areahour$farePerDist = areahour$fare/areahour$distance
areahour = areahour[areahour$count > 10,]
areahour[order(areahour$farePerDist,decreasing = TRUE),]

## # A tibble: 306 x 7
## # Groups:   PickupCommunityArea, DropoffCommunityArea [36]
##    PickupCommunity… DropoffCommunit… hour   fare distance count farePerDist
##               <int>            <int> <chr> <dbl>    <dbl> <int>       <dbl>
##  1               32               32 10 PM  6.88  0.0107     11        644.
##  2               32               32 12 AM  5.92  0.00947    12        625.
##  3               28               28 11 AM  3.82  0.00634    13        602.
##  4               32               32 05 PM  5.62  0.00954    19        589.
##  5               32               32 07 PM  5.28  0.00897    12        588.
##  6                8                8 01 AM  7.24  0.0126     38        576.
##  7               32               32 09 PM  6.12  0.0114     11        539.
##  8               32               32 02 PM  5.10  0.00951    18        536.
##  9               32               32 09 AM  4.58  0.00868    12        528.
## 10               32               32 01 PM  5     0.00948    12        527.
## # … with 296 more rows

Conclusion

A lot more things could be analyzed and discovered from this dataset and I would encourage anyone who is interested to try to implement this analysis with all of the data provided by the city of Chicago. This project has been fun and I hope you all enjoyed reading through it!

P.S to view the dashboard source code proceed to: https://github.com/zhangyidavidfan/Ride_Share_Price_Dashboard_Source_Code

GGMAP Citation: D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

Ride Share Base Code

David Fan

5/15/2019