The NYC Taxi data, which is available on Kaggle include 8 variables. 4 variables for pickup and dropoff locations, 1 for fare amount, 1 for pickup date time, 1 for key, and 1 for passenger count. I use the pickup and dropoff locations to calculate trip distance, which is an important factor for fare amount. I am going to do some EDA and see if there is anything I can use to predict the fare amount.

## Observations: 982,023
## Variables: 8
## $ key               <dttm> 2009-06-15 17:26:21, 2010-01-05 16:52:16, 2...
## $ fare_amount       <dbl> 4.5, 16.9, 5.7, 7.7, 5.3, 12.1, 7.5, 16.5, 9...
## $ pickup_datetime   <chr> "2009-06-15 17:26:21 UTC", "2010-01-05 16:52...
## $ pickup_longitude  <dbl> -73.84431, -74.01605, -73.98274, -73.98713, ...
## $ pickup_latitude   <dbl> 40.72132, 40.71130, 40.76127, 40.73314, 40.7...
## $ dropoff_longitude <dbl> -73.84161, -73.97927, -73.99124, -73.99157, ...
## $ dropoff_latitude  <dbl> 40.71228, 40.78200, 40.75056, 40.75809, 40.7...
## $ passenger_count   <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1,...

I am interested in the following questions and would like to see if these can get answered by the EDA.

  1. which neighorhoods are the busiest in the rush hours?
  2. which neighorhoods’ passengers pay the highest fare amount?
  3. Whether distance, month, passenger numbers, weekdays, hours have relationship with fare amount?

I define 6 AM to 9 AM to the morning rush hours, which I assume people take taxi in the morning to go to work. The following two maps show which neighorhoods are the busiest for pickup and dropoff during the rush hours. As we can see, upper east side is the busiest pickup location and midtown is the busiest dropoff location. This is expected as upper east side has a lot of residential buildings and midtown has a lot of office buildings.

Likwise, I define 4 PM to 8 PM as the rush hour in the afternoon. I found that midtown is the busiest area for pickup and upper east side is the busiest area for dropoff. This result is in line with the morning rush hour as people take taxi to go home after work.

Then I look at the 20 top counted neighorhoods and I am trying to figure out which if passengers from some of them pay higher taxi fare than others. Passengers taking taxi from Financial District and Battery Park City pay higher fare than others and passengers to Financial District and Harlem pay higher fare. This makes sense as these areas are fairly distant from the city.

To answer the last question, I take a look at some factors’ relationship with taxi fare. As we can see, the fare amount and trip distance are positively related. Taxi fares on Fridays and Saturdays are slightly higher than other days. Trips around 6 AM have the lowest fare. There is no significant difference in fare amoung months. Trips with 6 passengers have slight higher fares than others.