Welcome to the Uber Data Analysis and Visualization with R

In this project I will walk through a data visualization using the Uber trips in New York City dataset. The main objective is to create some insightful visualization that provides valuable information about the rides that occurred in NYC.

I begin the project by importing the packages that will be used in this report:

Packages

library(ggplot2)
library(plotly)
# [maybe] This is the main package for this data visualization project
# I will explore the plotly package as well

library(ggthemes)
# ggthemes provides more nice looking themes for the ggplot2 library.

library(lubridate)
# Date is one of the data types in this project. Lubridate allows to work with timestamps
# in a more granular level.

library(dplyr)
library(tidyr)
# dplyr and tidyr will help with the dataframe manipulation

library(DT)
# Tables are as important as charts when it comes to understand data and reporting results. The
# DT package brings interactive tables to the game.

library(scales)
# This package helps setting up the scales in geographical visualizations

library(sf)
library(rnaturalearth)

Reading the data

Now let’s read the 6 csv files that holds our data. I will first read one by one and then join them in a single variable.

start_time <- Sys.time()
data1 <- read.csv('Uber-dataset/uber-raw-data-apr14.csv')
data2 <- read.csv('Uber-dataset/uber-raw-data-may14.csv')
data3 <- read.csv('Uber-dataset/uber-raw-data-jun14.csv')
data4 <- read.csv('Uber-dataset/uber-raw-data-jul14.csv')
data5 <- read.csv('Uber-dataset/uber-raw-data-aug14.csv')
data6 <- read.csv('Uber-dataset/uber-raw-data-sep14.csv')
end_time <- Sys.time()
dt <- end_time - start_time

There is a good number of records in this dataset. It took 7.11 seconds to read all of them. Let’s see how it looks like.

head(data1)

##          Date.Time     Lat      Lon   Base
## 1 4/1/2014 0:11:00 40.7690 -73.9549 B02512
## 2 4/1/2014 0:17:00 40.7267 -74.0345 B02512
## 3 4/1/2014 0:21:00 40.7316 -73.9873 B02512
## 4 4/1/2014 0:28:00 40.7588 -73.9776 B02512
## 5 4/1/2014 0:33:00 40.7594 -73.9722 B02512
## 6 4/1/2014 0:33:00 40.7383 -74.0403 B02512

head(data2)

##          Date.Time     Lat      Lon   Base
## 1 5/1/2014 0:02:00 40.7521 -73.9914 B02512
## 2 5/1/2014 0:06:00 40.6965 -73.9715 B02512
## 3 5/1/2014 0:15:00 40.7464 -73.9838 B02512
## 4 5/1/2014 0:17:00 40.7463 -74.0011 B02512
## 5 5/1/2014 0:17:00 40.7594 -73.9734 B02512
## 6 5/1/2014 0:20:00 40.7685 -73.8625 B02512

head(data3)

##          Date.Time     Lat      Lon   Base
## 1 6/1/2014 0:00:00 40.7293 -73.9920 B02512
## 2 6/1/2014 0:01:00 40.7131 -74.0097 B02512
## 3 6/1/2014 0:04:00 40.3461 -74.6610 B02512
## 4 6/1/2014 0:04:00 40.7555 -73.9833 B02512
## 5 6/1/2014 0:07:00 40.6880 -74.1831 B02512
## 6 6/1/2014 0:08:00 40.7152 -73.9917 B02512

head(data4)

##          Date.Time     Lat      Lon   Base
## 1 7/1/2014 0:03:00 40.7586 -73.9706 B02512
## 2 7/1/2014 0:05:00 40.7605 -73.9994 B02512
## 3 7/1/2014 0:06:00 40.7320 -73.9999 B02512
## 4 7/1/2014 0:09:00 40.7635 -73.9793 B02512
## 5 7/1/2014 0:20:00 40.7204 -74.0047 B02512
## 6 7/1/2014 0:35:00 40.7487 -73.9869 B02512

head(data5)

##          Date.Time     Lat      Lon   Base
## 1 8/1/2014 0:03:00 40.7366 -73.9906 B02512
## 2 8/1/2014 0:09:00 40.7260 -73.9918 B02512
## 3 8/1/2014 0:12:00 40.7209 -74.0507 B02512
## 4 8/1/2014 0:12:00 40.7387 -73.9856 B02512
## 5 8/1/2014 0:12:00 40.7323 -74.0077 B02512
## 6 8/1/2014 0:13:00 40.7349 -74.0033 B02512

head(data6)

##          Date.Time     Lat      Lon   Base
## 1 9/1/2014 0:01:00 40.2201 -74.0021 B02512
## 2 9/1/2014 0:01:00 40.7500 -74.0027 B02512
## 3 9/1/2014 0:03:00 40.7559 -73.9864 B02512
## 4 9/1/2014 0:06:00 40.7450 -73.9889 B02512
## 5 9/1/2014 0:11:00 40.8145 -73.9444 B02512
## 6 9/1/2014 0:12:00 40.6735 -73.9918 B02512

All the dataframes have the same structure. Thus we can use the rbind function to unite them all:

data <- rbind(data1, data2, data3, data4, data5, data6)
str(data)

## 'data.frame':    4534327 obs. of  4 variables:
##  $ Date.Time: chr  "4/1/2014 0:11:00" "4/1/2014 0:17:00" "4/1/2014 0:21:00" "4/1/2014 0:28:00" ...
##  $ Lat      : num  40.8 40.7 40.7 40.8 40.8 ...
##  $ Lon      : num  -74 -74 -74 -74 -74 ...
##  $ Base     : chr  "B02512" "B02512" "B02512" "B02512" ...

nrow(data)

## [1] 4534327

So we have a dataframe with 4534327 records! That is a lot of trips. Let’s see what we can do with them.

Cleaning the data

There is some work to be done in order to enhance this dataset. For instance, the “Date.Time” column has the character data type. So, in order to convert the time to its proper data type, we can use the as.POSIXct function:

# Using the US date format
data$Date.Time <- as.POSIXct(data$Date.Time, format = "%m/%d/%Y %H:%M:%S")
#View(data)

And now we can extract some specific information about the date and time, such as hour, day and moth of the trip.

# Extracting the trip's timestamp
data$Time <- format(as.POSIXct(data$Date.Time, format = "%m/%d/%Y %H:%M:%S"), format="%H:%M:%S")

data$Hour <- factor(hour(hms(data$Time)))

# Extracting the the day of the trip
data$Day <- factor(day(data$Date.Time))

# EXtracting the month of the trip
data$Month <- factor(month(data$Date.Time, label=TRUE))

# Extracting the year of the trip
data$Year <- factor(year(data$Date.Time))

# Extracting the name of the day
data$DayOfWeek <- factor(wday(data$Date.Time, label=TRUE))

Now we have a much more granular data and we can start plotting some visualizations.

Data Visualization

Time

As far the data period goes, what looks like the trips distribution in a day? We can make a group by hour and plot the number of trips taken in each hour.

# First let's turn off the scientific notation
options(scipen=999)

# Grouping the data and counting the occurrences by hour
by_hour <- data %>%
  group_by(Hour) %>%
  summarise(AmountOfTrips = n())

# Creating a barplot to visualize
plot_by_hour <- ggplot(data = by_hour,
                       aes(Hour, AmountOfTrips)) +
  geom_bar(stat = "identity", fill='steelblue', color='black') +
  scale_y_continuous(labels=comma, breaks=seq(0, 350000, 50000)) +
  labs(title='Trips by hour in NY', x='Hour', y='Amount of trips')

plot_by_hour

We can also group by month to compare the hours through each month and see if there is any difference.

# Grouping by month and hour

by_month_hour <- data %>%
  group_by(Month, Hour) %>%
  summarise(AmountOfTrips = n())

plot_by_month_hour <- ggplot(data = by_month_hour,
                             aes(Hour, AmountOfTrips, fill = Month)) +
  geom_bar(stat='identity') +
  ggtitle('Amount of trips in NY by month and hour') +
  scale_y_continuous(labels = comma, breaks=seq(0, 350000, 50000))

plot_by_month_hour

As we can see, the amount of trips is well distributed through the months. Besides, we see that the evening is the period with the biggest number of trips with Uber. That is the time that people leaves their jobs to go home or go hanging out. Lets see the there is a correlation with the week day and the number of trips.

# Grouping the data by weekday and hour
by_weekday <- data %>%
  group_by(DayOfWeek) %>%
  summarise(AmountOfTrips = n())

plot_by_weekday <- ggplot(data = by_weekday,
                          aes(DayOfWeek, AmountOfTrips))+
  geom_bar(stat='identity', fill = 'steelblue', color='black') +
  ggtitle('Amount of trips by day of week in NY') +
  scale_y_continuous(labels=comma, breaks=seq(0, 900000, 100000)) +
  labs(x='Day of week', y='Amount of trips')

plot_by_weekday

Seems like there is no major difference in the number of trips by day of week. A last and simple question that we can answer is the number of trips taken by month.

by_month <- data %>%
  group_by(Month) %>%
  summarise(AmountOfTrips = n())

plot_by_month <- plot_by_weekday <- ggplot(data = by_month,
                          aes(Month, AmountOfTrips))+
  geom_bar(stat='identity', fill = 'steelblue', color='black') +
  ggtitle('Amount of trips by month in NY') +
  scale_y_continuous(labels=comma, breaks=seq(0, 1200000, 100000)) +
  labs(x='Month', y='Amount of trips')

plot_by_month

Seems like the number of trips increase as the year pass by. I wonder why it happens…

Bases

Now let’s explore how the trips relates with the Uber bases. To start, we can see the pure amount of trips for each base and each month.

by_base <- data %>%
  group_by(Base, Month) %>%
  summarise(AmountOfTrips = n())

plot_by_base <- ggplot(data=by_base,
                       aes(Base, AmountOfTrips, fill=Month)) +
  geom_bar(stat = 'identity', color='black', position='dodge') +
  labs(title='Number of trips by Uber base in NY', x='Base', y='Amount of trips')
  scale_y_continuous(labels=comma)

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

plot_by_base

Geographical Visualization

Now let’s explore the geographical aspect of the data. Our dataset has the “Lat” and “Lon” columns that stands for Latitude and Longitude, which represents geographical coordinates. We can use that data to plot the trips into a map to have a spatial view of the dataset. We can start by simply making a scatter plot of the points:

ggplot(data=data, aes(x=Lon, y=Lat)) + 
  geom_point(size=1, color='steelblue') + 
  theme_map()+
  ggtitle('2014 Uber trips in NYC')

But we surely can do better. For that, let’s use the the sf package that provides us functions to deal with geographical data.

# Transforming the original dataframe to a "sf dataframe", that is a geografical data type
data_sf <- st_as_sf(data, coords=c('Lon', 'Lat'), crs=4326)

# Downloading the NY map
US_map <- ne_states(country='united states of america', returnclas = 'sf')
NY_map <- filter(US_map, gn_name == 'New York')

# Ploting the NY Map
ggplot() +
  geom_sf(data=NY_map,
          fill=gray(0.8), color=gray(0.7)) +
  theme_void()

Now let’s scatter the trips into the map

ggplot() +
  geom_sf(data=NY_map,
          fill=gray(0.8), color=gray(0.7)) +
  geom_sf(data=data_sf,
          alpha=0.7,
          show.legend='point') +
  labs(title = "Uber trips in NYC (Apr - Sep) 2014") +
  #guides(color = guide_legend(override.aes = list(size = 6))) +
  theme_void()

That is a lot of points we have here! It took almost 20 minutes for my computer render this graph. As we can see, the quantity of rides took in NYC in this period is quite well distributed throughout the city. For simplicity, lets use only 20% of this data to continue exploring.

# Sampling some rows of the original dataframe
new_data <- data_sf[sample(nrow(data_sf), nrow(data_sf)*0.2),]
nrow(new_data)

## [1] 906865

And now we can scatter the ride points again, but now zoomed in to the NYC

scatter <- ggplot() +
  geom_sf(data=NY_map,
          fill=gray(0.8), color=gray(0.7)) +
  geom_sf(data=new_data,
          aes(color=Base),
          alpha=0.7,
          show.legend='point') +
  labs(title = "Uber trips in NYC (Apr - Sep) 2014") +
  coord_sf(xlim=c(-75, -72), ylim=c(39.5, 41.2)) +
  #guides(color = guide_legend(override.aes = list(size = 6))) +
  theme_void()
    
  
scatter

Uber Data Visualization

Michel de Araújo

27/07/2021