USL HMWRK #1 - Uber Dataset

Uber data analysis

This short analysis is dedicated to uber data, which provide information on uber pickups in USA. Data were collected for September 2014.

In order to provide transparency, all codes will be shown within the report.

The aim of the analysis is to apply clustering methods in order to gain some insight into data and then carry out a simple prediction.

A few words about the data

Sys.setenv(LANG = "en")
options(digits=10)
#load dataset
setwd('C:\\usl_hmwrk_uber\\')
uber <- read.csv('uber-data.csv')
#packages needed (and those not that much needed)
library(caTools)
library(tidyverse)
library(factoextra)
library(stats)
library(flexclust)
library(mclust)
library(fpc)
library(clustertend)
library(cluster)
library(ClusterR)
library(hms)
library(ggmap)

Originally, the dataset consists of 4 columns and 1 028 136 observations.

The variables are:

Date.Time - timestamp of pickup (date and time)
Lon - Longitude
Lat - Latitude
Base - The TLC base company code affiliated with the Uber pickup

Let’s take a look at the dataset.

str(uber)

## 'data.frame':    1028136 obs. of  4 variables:
##  $ Date.Time: chr  "9/1/2014 0:01:00" "9/1/2014 0:01:00" "9/1/2014 0:03:00" "9/1/2014 0:06:00" ...
##  $ Lat      : num  40.2 40.8 40.8 40.7 40.8 ...
##  $ Lon      : num  -74 -74 -74 -74 -73.9 ...
##  $ Base     : chr  "B02512" "B02512" "B02512" "B02512" ...

sum(is.na(uber))

## [1] 0

Fortunately, there are no missing values in the data

That’s how looks pickups’ breakdown by TLC base company:

#let's decode the Base column
uber$Base <- ifelse(uber$Base=='B02512', 'UNTER_LLC', uber$Base)
uber$Base <- ifelse(uber$Base=='B02598', 'HINTER_LLC', uber$Base)
uber$Base <- ifelse(uber$Base=='B02617', 'WEITER_LLC', uber$Base)
uber$Base <- ifelse(uber$Base=='B02682', 'SCHMECKEN_LLC', uber$Base)
uber$Base <- ifelse(uber$Base=='B02764', 'DANACH-NY_LLC', uber$Base)

options(scipen=5)
# let's get rid of warning results: 
options(dplyr.summarise.inform = FALSE)

uber %>%
  group_by(Base) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(Base, -counts), fill = counts)) +
  geom_bar(stat="identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=12))  +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Base") +
  geom_text(aes(label = counts), vjust = 1.5, size = 4, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL)

And here is a view at the dataset by date and time:

# let's divide the first column into two new ones: date and time
uber$dt <-uber$Date.Time 
uber <- separate(data = uber, col = dt, into = c("Date", "Time"), sep = "\\s")
uber$Date.Time <- as.POSIXct(uber$Date.Time, format = "%m/%d/%Y %H:%M:%S")
uber$Date <- as.Date(uber$Date,format='%m/%d/%Y')
#install.packages('hms')
uber$Time <- as_hms(uber$Time)
# another look at the dataset's structure:
str(uber)

## 'data.frame':    1028136 obs. of  6 variables:
##  $ Date.Time: POSIXct, format: "2014-09-01 00:01:00" "2014-09-01 00:01:00" ...
##  $ Lat      : num  40.2 40.8 40.8 40.7 40.8 ...
##  $ Lon      : num  -74 -74 -74 -74 -73.9 ...
##  $ Base     : chr  "UNTER_LLC" "UNTER_LLC" "UNTER_LLC" "UNTER_LLC" ...
##  $ Date     : Date, format: "2014-09-01" "2014-09-01" ...
##  $ Time     : 'hms' num  00:01:00 00:01:00 00:03:00 00:06:00 ...
##   ..- attr(*, "units")= chr "secs"

uber %>%
  group_by(Date) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Date)) +
  geom_line(size = 1.5, colour ='darkblue') +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Date") +
  theme(legend.position = "off")

As we can see, there are some peaks in the data -> it is consistent with the hypothesis that weekends = parties = more pickups. We will dig into it later.

uber$Weekday <- weekdays(uber$Date)

uber$Weekday <- factor(uber$Weekday, levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

uber %>%
  group_by(Weekday) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = reorder(Weekday, -counts), fill = counts)) +
  geom_bar(stat="identity") +
  theme_minimal() +
  theme(axis.text.x = element_text(size=12))  +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Weekday") +
  geom_text(aes(label = counts), vjust = 1.5, size = 4, color = "white") +
  theme(panel.grid.major.y = element_blank(), legend.position = "off") +
  scale_y_discrete(labels = NULL)

Saturdays and Fridays are very popular days when it comes to Uber pickups, but surprisingly, Tuesday is the most popular weekday when it comes to # of pickups. The explanation is pretty straightforward: There were 5 Tuesdays and only 4 weekends in September 2014. ;)

uber %>%
  group_by(Time) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Time)) +
  geom_line(size = 1, colour ='darkgrey') +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Time") +
  theme(legend.position = "off")

In general, that’s how the # of pickups is dependent on time. Actually, there are two peaks: aroud 8-9 am - people are probably going to work/ on a work meeting and then in the evening.

uber %>%
  group_by(Time, Weekday) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Time, colour =Weekday)) +
  geom_line(size = 1) +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Time ~ Weekday") +
  theme(legend.position = "off") +
  facet_wrap(~Weekday)

There are many pickups at Fridays and Saturdays nights (actually Sunday 00:00-03:00 as well). In fact, Tuesdays seem to be very busy for Uber drivers - there are many people using Uber then (morgning hours and evenings).

The last but not least glimpse at our dataset. It would be useful to check what is the most common localization of the pickups - let’s check that by computing the latitutde/longitude mode and check that point on the map

# Create the mode function.
# credits: https://www.tutorialspoint.com/r/r_mean_median_mode.htm
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(uber$Lon)

## [1] -73.9888

getmode(uber$Lat)

## [1] 40.774

The coordinates of the most popular points are: 40.774;-73.988. Below a presentation of this point and surroundings on the map (via google maps):

What about the clustering?

Purpose of this exercise is to apply some clustering methods on the dataset. This dataset is large in terms of using clustering methods (over 1 mil rows). That’s why we are going to use the CLARA Method. The clusters are going to be created by geographical variables (longitude and latitude).

Let’s take a look at whole dataset by these variables:

ggplot(uber, aes(Lat, Lon, alpha = .001, show.legend = FALSE)) +
  geom_point() +
  theme_minimal() +
  ggtitle('Uber pickups - full sample') +
  xlab('Latitude')+
  ylab('Longitude')

At first glance we can observe some outliers. It’s not good in general when it comes to clustering, but Clara method uses PAM algorithm. Fortunately, it is not sensitive to outliers.

I would say, that there might be 3-4 clusters. Let’s check that hypothesis.

In order to determine the appropriate number of clusters, I am going to run a few iterations of clara (assigning 3-7 clusters). Then I will calculate some measures concerning proposed clusters.

Before the clustering I am standardizing the x and y variables. For Clustering computations I am setting # of samples to 10, sample size to 1000 and distance metric as manhattan.

xy_z <- as.data.frame(lapply(uber[,2:3], scale))
#opt<-Optimal_Clusters_KMeans(xy_z, max_clusters=10, plot_clusters=TRUE, criterion="silhouette")
# another approach - 3 clusters and 6 samples specified - play with samplesize
set.seed(111)
clara_clust_3<-clara(xy_z, 3, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_4<-clara(xy_z, 4, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_5<-clara(xy_z, 5, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_6<-clara(xy_z, 6, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)
clara_clust_7<-clara(xy_z, 7, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)

# plot the output
#fviz_cluster(clara_clust_3, geom="point") 
fviz_cluster(clara_clust_3, geom="point", pointsize=1, ggtheme=theme_classic())

fviz_cluster(clara_clust_4, geom="point", pointsize=1, ggtheme=theme_classic())

fviz_cluster(clara_clust_5, geom="point", pointsize=1, ggtheme=theme_classic())

fviz_cluster(clara_clust_6, geom="point", pointsize=1, ggtheme=theme_classic())

fviz_cluster(clara_clust_7, geom="point", pointsize=1, ggtheme=theme_classic())

Let’s take a look at silhouette plots for calculated clustering variations:

fviz_silhouette(clara_clust_3)

##   cluster size ave.sil.width
## 1       1  164         -0.03
## 2       2  381          0.61
## 3       3  455          0.24

fviz_silhouette(clara_clust_4)

##   cluster size ave.sil.width
## 1       1  145          0.10
## 2       2  369          0.60
## 3       3  450          0.26
## 4       4   36          0.57

fviz_silhouette(clara_clust_5)

##   cluster size ave.sil.width
## 1       1  136          0.06
## 2       2  330          0.52
## 3       3  202         -0.05
## 4       4   36          0.60
## 5       5  296          0.45

fviz_silhouette(clara_clust_6)

##   cluster size ave.sil.width
## 1       1  145          0.10
## 2       2  294          0.50
## 3       3  162          0.20
## 4       4   44          0.39
## 5       5  319          0.38
## 6       6   36          0.66

fviz_silhouette(clara_clust_7)

##   cluster size ave.sil.width
## 1       1  115          0.16
## 2       2  232          0.31
## 3       3  224          0.50
## 4       4  131          0.07
## 5       5   51          0.10
## 6       6  221          0.27
## 7       7   26          0.84

As we can see, the highest average score attains iteration with four clusters. Let’s validate it by another metric.

Let’s calculate Calinski-Harabasz measure for each iteration and check if results from silhouette plots would be applicable here:

In fact, 4 clusters seem to be the best possible option. We could ask one more question before evaluating our clusters: Should we carry out the clustering on this data at all?

Let’s examine that by carrying out Duda-Hart test for splitting. It states whether dataset should be split into two clusters.

clara_clust_2<-clara(xy_z, 2, metric="manhattan", stand=FALSE, samples=10,
           sampsize=1000, trace=0, medoids.x=TRUE,
           rngR=FALSE, pamLike=FALSE, correct.d=TRUE)

dudahart2(xy_z, clara_clust_2$clustering) #fpc::

## $p.value
## [1] 1
## 
## $dh
## [1] 0.7561315597
## 
## $compare
## [1] 0.6793398296
## 
## $cluster1
## [1] TRUE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232306

When it comes to the uber data, it seems that clustering is not the best option here, but for now let’s leave it aside and analyse the clusters’ center points - in this case medoids.

clara_clust_4$medoids

##                Lat            Lon
## [1,] -1.2496473128  0.07059826515
## [2,] -0.2087104657 -0.40955980816
## [3,]  0.6313868015 -0.01685909820
## [4,] -2.2807871073  3.28079795528

clara_clust_4$i.med

## [1] 631117 443174 590736 843597

uber <- cbind(uber,xy_z,clara_clust_4$clustering)
colnames(uber) <- c("Date.Time", "Lat", "Lon", "Base", "Date", "Time", "Weekday", "Lat_z", "Lon_z", "cluster")

uber$cluster <- as.factor(uber$cluster)

medoids <- uber[c(clara_clust_4$i.med), -c(1,8,9)]  
medoids

##            Lat      Lon          Base       Date     Time  Weekday cluster
## 631117 40.6882 -73.9677    WEITER_LLC 2014-09-28 23:18:00   Sunday       1
## 443174 40.7307 -73.9957    WEITER_LLC 2014-09-13 18:30:00 Saturday       2
## 590736 40.7650 -73.9728    WEITER_LLC 2014-09-25 18:32:00 Thursday       3
## 843597 40.6461 -73.7805 SCHMECKEN_LLC 2014-09-29 22:40:00   Monday       4

All medoids were registered in the evening, three of them belong to same Base firm (WEITER LLC).

Each of the medoids represent diffferend Weekday and day of September.

We will examine whole clusters by these characteristics later. For now, let’s look at the medoids’ location on the map.

plot on map - medoids with clusters assignment

Medoid 1 is surrounded by Universities in Brooklyn.
Medoid 2 is set at the Washington Square Park.
Medoid 3 is set at the Central Park.
Medoid 4 is set by the John F. Kennedy International Airport. It seems to be fairly reasonable that this place would be popular for Taxis.

As we can see, there are some strategic points from which accumulate taxis’ pickups (in our case parks and an airport)

Now let’s take a look at assigned clusters by date and time characteristics.

Clusters by the datetime

The clusters are going to be analysed in following sequence:

date
weekday
time
time ~ weekday

Firstly though let’s take a look at number of pickups assigned to each cluster:

uber %>%
  group_by(cluster) %>%
  summarize(counts = n())

## # A tibble: 4 x 2
##   cluster counts
##   <fct>    <int>
## 1 1       139689
## 2 2       436825
## 3 3       415447
## 4 4        36175

Clusters 2 and 3 contain similar quantity of pickups (so-called ‘Park Clusters’)
Cluster 1 is not relatively small (Brooklyn-Uni one - 130k vs 410-430k).
Cluster 4 is the smallest one (Airport Cluster - 36k)

uber %>%
  group_by(cluster, Date) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Date, color=cluster)) +
  geom_line(size = 1.2) +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 for assigned clusters by Date")

Clusters 1-3 seem to be dependent on the weekends (more pickups at the weekends) - we will check that in the moment.
What is more interesting - Airport Cluster #4 is stable over time - demand for Uber services are equal all the time.

uber %>%
  group_by(cluster, Weekday) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Weekday, fill = cluster)) +
  geom_col(position = position_dodge()) +
  theme_minimal() +
  theme(axis.text.x = element_text(size=12))  +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 by Weekday")

Clusters 1-3 are in fact more popular during weekends. Surrounding of Cluster #2 seems to be “Party-sensitive” (we will check that hypothesis soon).
Monday-Thursday quantity of pickups are very similar for clusters 2 and 3, weekends make the difference.

Let’s take a look at the time of pickups for our clusters.

uber %>%
  group_by(cluster, Time) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Time, color=cluster)) +
  geom_line(size = 0.9, alpha=0.7) +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 for assigned clusters by Time")

Clusters 1 and 4 are quite stable over time.
Cluster 2 and 3 have two peaks - about 8-9 am and in the afternoon (~5-6 pm).
What is interesting, there are many pickups in cluster #2 at the night time (10pm-3am). It is a next proof to call Cluster #2 the “Party Cluster”. We will be able to state that clearly by looking at the next plot.

uber %>%
  group_by(cluster, Time, Weekday) %>%
  summarize(counts = n()) %>%
  ggplot(aes(y=counts, x = Time, color=cluster)) +
  geom_line(size = 0.5, alpha=0.8) +
  theme_minimal() +
  labs(x = "", y = "", title = "Number of Uber pickups in September 2014 for assigned clusters by Time") +
  facet_wrap(.~Weekday)

Friday-Sunday nights are dominated by cluster #2. Then we can call it oficially the “Party Cluster”.

Let’s move on the last step - a simple prediction.

Predicting clusters for 09/02

We are going to use clusters for some forecasting purposes. We need to choose a sub-sample. It will consist of pickups registered on 09/02 (Tuesday).

There are 28831 observation registered this day. The dataset will be split into two parts: training and test set (in proportion 9:1).

We need a smaller sample since kcca class object is necessary.

set.seed(111)
uber_pred <- uber[uber$Date=='2014-09-02',c(2:4,6)]
xy_z_pred <- as.data.frame(lapply(uber_pred[,1:2], scale))


split = sample.split(1:nrow(xy_z_pred), SplitRatio = 0.9)
training_set = subset(xy_z_pred, split == TRUE)
test_set = subset(xy_z_pred, split == FALSE)
uber_pred <- cbind(uber_pred,xy_z_pred,split)
colnames(uber_pred) <- c('Lat','Lon','Base','Time','Lat_z','Lon_z','split')

pred_kcca = kcca(uber_pred[split==TRUE, 5:6], k=4, kccaFamily("kmeans"))

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

## Found more than one class "kcca" in cache; using the first, from namespace 'flexclust'

## Also defined by 'kernlab'

pred_kcca

## kcca object of family 'kmeans' 
## 
## call:
## kcca(x = uber_pred[split == TRUE, 5:6], k = 4, family = kccaFamily("kmeans"))
## 
## cluster sizes:
## 
##     1     2     3     4 
##  2386  8858 13392  1311

# do the prediction
pred_train <- predict(pred_kcca)
pred_test <- predict(pred_kcca, newdata=uber_pred[split==FALSE, 5:6])

Let’s now visualize the output

image(pred_kcca)
points(uber_pred[split==TRUE, 5:6], col=pred_train, pch=19, cex=0.3)
points(uber_pred[split==FALSE, 5:6], col=pred_test, pch=22, bg="grey")

As we can see, the predicted points seem to be located near the center. I won’t dig deeper into that, this example is just a proof, that some clustering prediction may be done using this dataset.

Summary

This dataset was a challenge to deal with using clustering methods. Clara came to the rescue. In NY, September 2014 Uber pickups were mostly needed at the weekends, but surprisingly, the most pickups occur on Tuesday (remember that there were 5 Tuesdays and only 4 Fridays/Saturdays).

There were many outliers and the clusters separation was not so easy (which resulted in relatively low Silhouette width averages). In fact, results of Duda-Hart Test state that clustering might not be the best option to deal with the data. But We did the clustering anyway!

We managed to distinguish four significant clusters, which are:

#1 - Uni-Brooklyn Cluster
#2 - Washington Square Park Cluster
#3 - Central Park - Party Cluster
#4 - John F. Kennedy International Airport Cluster.

Mornings and evenings are the busiest periods when it comes to Uber pickups (8-9 am and 5-8 pm, except for weekend ofc :)).

It was fun to analyse this dataset I hope that results are also interesting for the Viewer! :)