library(DescTools)
library(ggplot2)
library(dbscan)
## 
## Attaching package: 'dbscan'
## The following object is masked from 'package:stats':
## 
##     as.dendrogram
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(cluster)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(prophet)
## Loading required package: Rcpp
## Loading required package: rlang
library(hopkins)
library(leaflet)

Exploratory Data Analysis

We are loading and checking data for NAs, wrong values.

rides <- read.csv("uber-data.csv")

ggplot(rides, aes(x = Lon, y = Lat)) +
  geom_point(alpha = 0.3, size = 0.1) +
  theme_minimal()

DescTools::Desc(rides)
## ────────────────────────────────────────────────────────────────────────────── 
## Describe rides (data.frame):
## 
## data frame:  1028136 obs. of  4 variables
##      1028136 complete cases (100.0%)
## 
##   Nr  Class  ColName    NAs  Levels
##   1   chr    Date.Time  .          
##   2   num    Lat        .          
##   3   num    Lon        .          
##   4   chr    Base       .          
## 
## 
## ────────────────────────────────────────────────────────────────────────────── 
## 1 - Date.Time (character)
## 
##      length         n       NAs    unique    levels     dupes
##   1'028'136 1'028'136         0    42'907    42'907         y
##                100.0%      0.0%                              
## 
##                  level  freq  perc  cumfreq  cumperc
## 1   9/13/2014 18:44:00    82  0.0%       82     0.0%
## 2   9/13/2014 18:41:00    80  0.0%      162     0.0%
## 3   9/13/2014 18:45:00    78  0.0%      240     0.0%
## 4   9/18/2014 19:01:00    78  0.0%      318     0.0%
## 5   9/13/2014 15:40:00    75  0.0%      393     0.0%
## 6   9/13/2014 18:05:00    75  0.0%      468     0.0%
## 7   9/27/2014 23:40:00    75  0.0%      543     0.1%
## 8   9/13/2014 19:17:00    74  0.0%      617     0.1%
## 9   9/18/2014 19:04:00    74  0.0%      691     0.1%
## 10  9/19/2014 18:14:00    74  0.0%      765     0.1%
## 11  9/20/2014 18:47:00    74  0.0%      839     0.1%
## 12   9/4/2014 18:49:00    74  0.0%      913     0.1%
## ... etc.
##  [list output truncated]

## ────────────────────────────────────────────────────────────────────────────── 
## 2 - Lat (numeric)
## 
##      length          n      NAs   unique       0s     mean   meanCI'
##   1'028'136  1'028'136        0    5'135        0  40.7392  40.7391
##                 100.0%     0.0%              0.0%           40.7393
##                                                                    
##         .05        .10      .25   median      .75      .90      .95
##     40.6703    40.6912  40.7204  40.7418  40.7612  40.7760  40.7889
##                                                                    
##       range         sd    vcoef      mad      IQR     skew     kurt
##      1.3579     0.0408   0.0010   0.0302   0.0408   0.4615  10.1800
##                                                                    
## lowest : 39.9897, 40.058, 40.0794, 40.0972, 40.1122
## highest: 41.3043, 41.3141, 41.318, 41.3183 (2), 41.3476
## 
## ' 95%-CI (classic)

## ────────────────────────────────────────────────────────────────────────────── 
## 3 - Lon (numeric)
## 
##      length          n       NAs    unique        0s      mean    meanCI'
##   1'028'136  1'028'136         0     7'724         0  -73.9718  -73.9719
##                 100.0%      0.0%                0.0%            -73.9717
##                                                                         
##         .05        .10       .25    median       .75       .90       .95
##    -74.0104   -74.0066  -73.9962  -73.9831  -73.9628  -73.9296  -73.8653
##                                                                         
##       range         sd     vcoef       mad       IQR      skew      kurt
##      2.0573     0.0583   -0.0008    0.0233    0.0334    2.2393   27.9242
##                                                                         
## lowest : -74.7736, -74.7155, -74.7121, -74.7120, -74.7035
## highest: -72.8013, -72.7558, -72.7480, -72.7339, -72.7163
## 
## ' 95%-CI (classic)

## ────────────────────────────────────────────────────────────────────────────── 
## 4 - Base (character)
## 
##      length         n       NAs    unique    levels     dupes
##   1'028'136 1'028'136         0         5         5         y
##                100.0%      0.0%                              
## 
##     level     freq   perc    cumfreq  cumperc
## 1  B02617  377'695  36.7%    377'695    36.7%
## 2  B02598  240'600  23.4%    618'295    60.1%
## 3  B02682  197'138  19.2%    815'433    79.3%
## 4  B02764  178'333  17.3%    993'766    96.7%
## 5  B02512   34'370   3.3%  1'028'136   100.0%

Data Preparation

Since the dataset is huge and it will not possible to cluster whole set, we have to reduce the data points and take a random subset.

set.seed(123)

subset_rides <- rides[sample(nrow(rides), 10000), ]

subset_rides$Date.Time <- mdy_hms(subset_rides$Date.Time)

subset_rides$Date <- as.Date(subset_rides$Date.Time)
subset_rides$Hour <- hour(subset_rides$Date.Time)
subset_rides$Weekday <- wday(subset_rides$Date.Time, label = TRUE)

ggplot(subset_rides, aes(x = Lon, y = Lat)) +
  geom_point(alpha = 0.3, size = 0.1) +
  theme_minimal()

coords <- subset_rides[, c("Lat", "Lon")]

fviz_nbclust(coords, kmeans, method = "silhouette") +
  theme_minimal()

fviz_nbclust(coords, kmeans, method = "gap_stat", , nboot = 50) +
  theme_minimal()
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 500000)
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Hopkins Statistic

set.seed(123)
hopkins_stat <- hopkins(coords, m = nrow(coords) - 1)
hopkins_stat
## [1] 0.997341

K-means Clustering

We see that data is not ideal to cluster with K-means algorithm, but I still want to do it see what we get. From silhouette score test we can conclude that 2 clusters is a good choice. However, gap statistic suggests 5 clusters is the better one, I choose to go with 5, because I expect more clusters and silhouette for 5 clusters is not much worse than 2. We also extracted Date and Hour values from the dataset to analyze later. It is clearly visible that shapes of clusters are not regular, and they are more like concave, so DBSCAN will be our choice.

k <- 5
kmeans_result <- kmeans(coords, centers = k)

subset_rides$cluster <- kmeans_result$cluster

pal <- colorFactor(palette = "Set1", domain = subset_rides$cluster)

leaflet(subset_rides) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    color = ~pal(cluster),
    radius = 2,
    stroke = FALSE,
    fillOpacity = 0.7,
    popup = ~paste("Cluster:", cluster)
  ) %>%
  addLegend("bottomright", pal = pal, values = ~cluster,
            title = "K-Means Clusters")

DBSCAN Clustering

Again the k-means clustering is unsatisfying, as we just cut the big cluster into half. There are no clear groupings of points this way. I will also not try hierarchial clustering or Medoids Clustering, because it is clearly visible from data that only DBSCAN or Spectral clustering are a better choice. After running DBSCAN, I think we can not find better clustering than the result, so I will just analyze DBSCAN results.

Elbow points of the distance plot is at around 0.005, which means it is a good eps value. MinPts was chosen as 10, because we have 4 properties, 4+1 = 5, and it is necessary to choose higher or equal to 5. Also, checking for minPts=3 and minPts=5, gave unsatisfactory results.

k <- 4  

kNNdistplot(coords, k)
abline(h = 0.01, lty = 2) 

db <- dbscan(coords, eps = 0.005, minPts = 10)

subset_rides$db_cluster <- db$cluster

Looking at the data looks like DBScan is the better choice for clustering this set. Also after checking different eps values, 0.03 gives a moderately nice result of clusters.

pal <- colorFactor(palette = "Set1", domain = subset_rides$db_cluster)

leaflet(subset_rides) %>%
  addTiles() %>%
  addCircleMarkers(
    lng = ~Lon,
    lat = ~Lat,
    color = ~pal(db_cluster),
    radius = 2,
    stroke = FALSE,
    fillOpacity = 0.7,
    popup = ~paste("Cluster:", db_cluster)
  ) %>%
  addLegend("bottomright", pal = pal, values = ~db_cluster,
            title = "DBSCAN Clusters")
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors

We can identify three big clusters in the city center(, one class of outliers (defined by a red color), and 14 small collection of points. I would say this is reasonable, however, unfortunately, due to the fact that we can not analyze whole set of 1 million points, the structure of data is a little different.

Hour Plot of clusters

cluster_time_summary <- subset_rides %>%  
  group_by(db_cluster, Hour) %>%
  summarise(
    count = n(),
    .groups = "drop"
  ) %>%
  group_by(db_cluster) %>%
  mutate(
    total_cluster_pickups = sum(count),
    percent = 100 * count / total_cluster_pickups
  ) %>%
  arrange(db_cluster, Hour)

ggplot(cluster_time_summary, aes(x = Hour, y = percent, color = factor(db_cluster))) +
  geom_line(size = 1) +
  facet_wrap(~ db_cluster, scales = "free_y") +
  labs(
    title = "Cluster activity by hour of day",
    x = "Hour of Day",
    y = "Percentage of Pickups",
    color = "Cluster"
  ) +
  theme_minimal() +
  theme(legend.position = "none")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

By the hour of day, peak number of pickups in the biggest cluster (the city center) are at around 16-18 time slot, which coincides with reality. From 20 to 4 in the morning the number is at its lowest, indicating the night time (obviously). It spikes slightly at around 7 in the morning, when people go to work and schools. Then it falls down a little from 10 to 15 when people are busy. Other clusters have slightly different specific times when the Uber is more active. In cluster number 5, for example, we can observe that there are more requested rides even earlier, around 6-7 in the morning.

Date Plot of clusters

cluster_date_summary <- subset_rides %>%
  group_by(db_cluster, Date) %>%
  summarise(
    count = n(),
    .groups = "drop"
  ) %>%
  group_by(db_cluster) %>%
  mutate(
    total_cluster_pickups = sum(count),
    percent = 100 * count / total_cluster_pickups
  ) %>%
  arrange(db_cluster, Date)

ggplot(cluster_date_summary, aes(x = Date, y = percent, color = factor(db_cluster))) +
  geom_line(size = 1) +
  facet_wrap(~ db_cluster, scales = "free_y") +
  labs(
    title = "Cluster activity by date",
    x = "Hour of Day",
    y = "Percentage of Pickups",
    color = "Cluster"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

From the date chart, we can tell that all clusters do not have stable number of pickups every day. The demand is higher on specific days and it falls down soon. However, in smaller clusters, the pickups are concentrated in small amount of dates, while in bigger clusters, the ride numbers are spread evenly across the whole time span.

Base Plot of clusters

base_cluster_summary <- subset_rides %>%
  group_by(db_cluster, Base) %>%
  summarise(
    count = n(),
    mean_Lat = mean(Lat),
    mean_Lon = mean(Lon)
  ) %>%
  arrange(db_cluster, desc(count))
## `summarise()` has grouped output by 'db_cluster'. You can override using the
## `.groups` argument.
ggplot(base_cluster_summary, aes(x = factor(db_cluster), y = count, fill = Base)) +
  geom_bar(stat = "identity", position = "fill") +
  labs(title = "Base distribution across clusters",
       x = "Cluster", y = "Proportion of rides", fill = "Base") +
  theme_minimal()

To analyze the distribution of pickup “Bases”, we can see that the most popular one is B02617 overall, while B02512 is the least popular. Interesting insight is that cluster 17 has only two types of pickups which may indicate a special geographical location. Maybe an airport? Some clusters also do not contain red Base of pickups(clusters 9, 11, 14, and 18). Most clusters are dominated by the green Base, but in some of them purple or yellow are more dominant.

Analyze cluster centers

cluster_hour_summary <- subset_rides %>%
  group_by(db_cluster, Hour) %>%
  summarise(
    mean_Lat = mean(Lat),
    mean_Lon = mean(Lon),
    count = n(),
    .groups = "drop"
  )

ggplot(cluster_hour_summary, aes(x = mean_Lon, y = mean_Lat, color = factor(db_cluster))) +
  geom_path(aes(group = db_cluster), arrow = arrow(length = unit(0.15, "cm"))) +
  geom_point(size = 2) +
  labs(
    title = "Movement of Cluster Centers by Hour of Day",
    x = "Longitude", y = "Latitude",
    color = "Cluster"
  ) +
  theme_minimal()

We can see that all clusters by the hour of the day do not change their center by much. This means that there is not a strong relation between hour of the day and the places of pickups. However, the outliers behave weirdly, as the center changes for almost every hour of the day, meaning the hour directly impacts which areas of the city are more active during a certain hour.

cluster_date_summary <- subset_rides %>%
  group_by(db_cluster, Date) %>%
  summarise(
    mean_Lat = mean(Lat),
    mean_Lon = mean(Lon),
    count = n(),
    .groups = "drop"
  )

ggplot(cluster_date_summary, aes(x = mean_Lon, y = mean_Lat, color = factor(db_cluster))) +
  geom_path(aes(group = db_cluster), arrow = arrow(length = unit(0.15, "cm"))) +
  geom_point(size = 2) +
  labs(
    title = "Movement of Cluster Centers by Date",
    x = "Longitude", y = "Latitude",
    color = "Cluster"
  ) +
  theme_minimal()

We can observe the same trends for the relation between date and locations of pickups as with the hours of the day.

Forecasting number of rides

We first get a time series of rides for every cluster. Then we choose cluster 1 to do forecasting.

rides_by_cluster_time <- subset_rides %>%
  group_by(db_cluster, Date.Time) %>%
  summarise(count = n(), .groups = "drop")

cluster1 <- rides_by_cluster_time %>%
  filter(db_cluster == 1) %>%
  arrange(Date.Time) %>%
  select(ds = Date.Time, y = count)

m <- prophet(cluster1)
## Disabling yearly seasonality. Run prophet with yearly.seasonality=TRUE to override this.
future <- make_future_dataframe(m, periods = 240, freq = "hour")
forecast <- predict(m, future)

prophet_plot_components(m, forecast)

We can see that forecasting by dates seems to show that next 10 days will see reduced demand for rides. Highest number of rides comes for around September 18. Saturday and Sunday are predicted to have the most pickups. Finally, the busiest hours are 8 in the morning and around 18 in the evening, as expected, and as seen from cluster analysis earlier.

plot(m, forecast)

The blue line, which is the fitted model, shows us approximation of future. Light shaded area is the uncertainty interval. Forecast shows us that every day the highest demand is expected at 8 and 18. By the day of the week, the demand smoothly rises till weekend and falls down when the week begins.

Conclusion

We managed to use DBSCAN to get clusters, which showed us most dense areas in the NYC. We can group those areas into clusters to get analysis of which hour of the day is the most active in a certain area. Areas could also denote districts of the city and help us to adapt according to the district. We can also analyze how cluster centers move to get an idea of which parts of a specific cluster is the busiest at a given hour or date. Additionally, we inferred which Base of the rides are demanded and which are not in specific clusters.