Motivation for this Project

We want to find the impact weather has on different taxi service companies and whether this impact is statistically significant, also, we want to find the effect ‘For Hire Vehicles’ have had on taxis in recent years. In addition, we would like to find out time frames which demand is higher for taxis therefore a taxi company can allocate their employees accordingly. We will be using the following analytical techniques to answer these questions:

  • Transformation and Clean Up
  • Hypothesis Testing
  • Time Series Plot
  • Heat Map

Exploratory Analysis

Cleaning & Transformations

To start we need to remove a few hundred observations. This is because the data was not recorded at the date time which causes the data to become skewed. To do this we filtered out dates which had the same date in the Date.time variable and file_name variable.

Formatting Date Column

This must be done so that both weather data and taxi data can be correctly merged.

Match Locations

In order to reduce redundancy we are only analyzing the weather from Manhattan.

Create Binary Precipitation Column

Creating a binary precipitation column will allow us to conduct a conditional numerical inference on taxi demand.

Hypothesis Testing

Looking at the ‘df’ data frame we have 5113 observations with 12 different variables. The data contains observations on three different taxi companies, as well as the weather recorded for each date. We would like to find out if precipitation has an effect on the number of taxi rides. Intuitively we would expect more rides on rainy days because less people are willing to walk in the rain.

First let us look at the distribution of number of rides made from 2014-2018 for each company.

Monthly Ride Distribution

We notice that the summer is the least popular time for Yellow Taxi and Green Taxi, however FHV remain steady throughout the summer. February through May are the most popular times for Yellow and Green Taxis while FHV still remains steady. With the exception of the year 2015 for the Green Taxi company, the two taxi companies are steady declining. Adversely we notice FHV are on a steady increase.

There is an observed difference in the average number of rides made when it is raining?

## df$Precipitation: no
## [1] 265450.3
## ------------------------------------------------------------ 
## df$Precipitation: yes
## [1] 276671.7

As seen above the average number of rides increases in the rain but is this statistically significant?

Inference

## df$Precipitation: no
## [1] 3498
## ------------------------------------------------------------ 
## df$Precipitation: yes
## [1] 1615

The sample data set was acquired by random sampling and are less than 10% of the total taxi ride data.

Null Hypothesis: There is no change in the number of rides made when it rains.

Alternative Hypothesis: There is a change in the number of rides made when it rains.

## Response variable: numerical, Explanatory variable: categorical
## Difference between two means
## Summary statistics:
## n_no = 3498, mean_no = 265450.3, sd_no = 208928.8
## n_yes = 1615, mean_yes = 276671.7, sd_yes = 219433.6
## Observed difference between means (no-yes) = -11221.34
## 
## H0: mu_no - mu_yes = 0 
## HA: mu_no - mu_yes != 0 
## Standard error = 6503.37 
## Test statistic: Z =  -1.725 
## p-value =  0.0844

Using the inference function we are able to analyze our response variable, df$rides. The second argument in the function is our explanatory varible, df$Precipitation, which splits our data into two groups, days with rain and no rain. We are interested in the mean parameter and are conducting a hypothesis test.

Conclusion

Our p-value returned by the function is 0.0844, using a 5% significance level we fail to reject the null hypothesis. However we can say with 90% certainty that there is a difference in the amount of taxi rides when it rains.

Time Series Plot

Conclusion

In conclusion we notice pretty significant changes based on varying conditions. Our question whether weather played an impact on demand for rides was significant up to a 90% significance level. A taxi company should properly allocate their recourses when rain is in the forcast.

We noticed a significant increase in demand for ‘For Hire Vehicles’ while green and yellow taxis have steady decreased in demand. This is bad news for yellow and green taxi companies because the demand for their service is only getting lower.

Lastly we identified the days for each month where demand for rides is the highest. In addition to forecasting weather, knowing which days of the week in each month can allow for more accurate recourse allocation.