Introduction
- Prediction for 2 modes
Dataset and Preprocessing
- Uber data
- Brooklyn Spatial Data
- Exploratory Data Analysis
Preparing data for 2 modes
- Preparing daily mode data
- Preparing hourly mode data
- Methods and data format
Models
- Models for daily mode
- Predicton for daily mode
- Models for hourly mode
- Prediction for hourly mode

Chapter1 Introduction

This project delves into the spatial organization of Uber demand in Brooklyn, USA, across different temporal dimensions. This study uses a robust dataset containing historical Uber pickup information to conduct a meticulous exploration to track the evolution of these patterns over time.

This investigation goes beyond mere observation and aims to give the study predictive power. To achieve this, advanced modeling techniques will be employed, in particular Random Forests and Extreme Gradient Boosting. These models will be enhanced to cover the spatial dimension, providing a more nuanced understanding of the intricate relationships and dynamics in urban ride-hailing transportation geography.

One of the key aspects of this research is the application and demonstration of random forest and extreme gradient boosting models in predicting higher demand locations for Uber. These machine learning methods have proven to be powerful tools for capturing complex patterns in data, and their integration with spatial dimensions is expected to provide a more accurate representation of Uber demand dynamics in Brooklyn.

Incorporating spatial dimensions into the computational validity of these predictive models will be a key focus. This project aims to demonstrate how this enhancement can significantly improve forecast accuracy. By enabling the model to grasp the spatial relationships between adjacent observations, the results are expected to provide unparalleled insights into localized demand patterns for Uber in Brooklyn.

Furthermore, the results of this research project have implications for optimizing Uber’s operating strategies in urban environments. Understanding and predicting localized demand patterns is critical to improving service deployment, resource allocation, and strategic decision-making.

In summary, this study strives to uncover the temporal evolution and nuanced spatial dynamics of Uber demand in Brooklyn, employing advanced modeling techniques to predict the location of higher demand. The emphasis on computational efficiency and the incorporation of spatial dimensions are expected to improve the model’s predictive accuracy, ultimately providing actionable insights for Uber’s strategic and operational efforts in the complex urban environment of Brooklyn, USA.

1.1 Prediction for 2 modes

1.1.1 Prediction for Future Day Demand

In this mode, I focus on the overall demand trend for the entire future day. This may involve more macro factors such as week changes, special events (e.g., weekends and weekdays), etc. The output of this prediction mode is typically an estimate of the overall demand for the next day, assisting Uber in optimizing decisions such as vehicle scheduling and resource allocation.

1.1.2 Prediction for Future Hours Demand

Unlike the daily demand forecast, this mode is more focused on the demand situation within each hour. It provides more precise estimates of demand within specific time periods, helping Uber allocate vehicles and resources more accurately to meet demand fluctuations during peak and off-peak hours.

1.1.3 Differences and Significance of the Two Modes:

Time Granularity: Daily demand forecasting focuses on the overall daily trend, providing support for long-term and macro decisions. Hourly demand forecasting is more precise and can be used for short-term and more specific decisions, such as hourly vehicle scheduling and resource allocation.

Decision Support: Daily demand forecasting can help Uber plan long-term strategies, such as holiday promotions, market expansion plans, etc. Hourly demand forecasting is more suitable for real-time decision-making, such as responding to sudden events and optimizing services during peak hours.

User Experience: Optimizing daily demand forecasting can improve the overall user experience, ensuring Uber can meet a wide range of demand. Optimizing hourly demand forecasting ensures users can quickly access services at any moment, enhancing immediacy and availability. By considering both of these prediction modes simultaneously, Uber can comprehensively address demand changes at different time scales, providing a higher-quality service.

Chapter2 Dataset and Preprocessing

For this project, I will use two datasets:

for Uber, just retrieve the data from this link : download
for OpenStreetMap, I use the GeoFabrik website to download the new-york-latest-free.shp.zip shapefile

2.1 Uber data

There are raw data on Uber pickups in New York City from April 2014. The datad has the following columns:

Date/Time : The date and time of the Uber pickup
Lat : The latitude of the Uber pickup
Lon : The longitude of the Uber pickup
Base : The TLC base company code affiliated with the Uber pickup

#uber movement on 04.2014 in newyork
uber = read.csv(here("data/uber-raw-data-apr14.csv"))

Once loaded. We can have glimpse at the Uber table:

str(uber)

## 'data.frame':    564516 obs. of  4 variables:
##  $ Date.Time: chr  "4/1/2014 0:11:00" "4/1/2014 0:17:00" "4/1/2014 0:21:00" "4/1/2014 0:28:00" ...
##  $ Lat      : num  40.8 40.7 40.7 40.8 40.8 ...
##  $ Lon      : num  -74 -74 -74 -74 -74 ...
##  $ Base     : chr  "B02512" "B02512" "B02512" "B02512" ...

head(uber)

##          Date.Time     Lat      Lon   Base
## 1 4/1/2014 0:11:00 40.7690 -73.9549 B02512
## 2 4/1/2014 0:17:00 40.7267 -74.0345 B02512
## 3 4/1/2014 0:21:00 40.7316 -73.9873 B02512
## 4 4/1/2014 0:28:00 40.7588 -73.9776 B02512
## 5 4/1/2014 0:33:00 40.7594 -73.9722 B02512
## 6 4/1/2014 0:33:00 40.7383 -74.0403 B02512

We can see 564,516 Uber pickup locations happend on the 04. 2014 in New York.

2.2 Brooklyn spatial data

Visualizing data related to Brooklyn, New York. The code involves loading geographic data for New York state, specifically the Brooklyn area, and plotting various spatial features such as places, roads, and landuse. Additionally, limiting the x and y-axis ranges and Uber pickup points to focus on the Brooklyn area.

# New york state border
newyork.place = sf::read_sf(dsn="data/new-york-latest-free.shp/",
                            layer="gis_osm_places_a_free_1")
# specifically the Brooklyn area
br.polygon = newyork.place %>% filter(name=="Brooklyn") 
br.polygon = st_transform(br.polygon, crs = "+proj=longlat +datum=NAD83")

#save Brooklyn state border
#saveRDS(br.polygon,"data/brooklynBorder.Rdata")

#convert to sp 
br.polygon.sp= as_Spatial(br.polygon)

# new york state roads
newyork.roads = sf::read_sf(dsn="data/new-york-latest-free.shp/",
                            layer="gis_osm_roads_free_1")
newyork.roads = st_transform(newyork.roads, crs = "+proj=longlat +datum=NAD83")

# new york state landuse
newyork.landuse = sf::read_sf(dsn="data/new-york-latest-free.shp/",
                              layer="gis_osm_landuse_a_free_1")
newyork.landuse = st_transform(newyork.landuse, crs = "+proj=longlat +datum=NAD83")

2.2.1 Generate Brooklyn grid cells

I created a spatial grid of 308 cells over the Brooklyn, New York area. Each grid cell covers a spatial extent of 0.01 in both x and y dimensions. The generated grid serves as the spatial basis for subsequent clustering and predictive modeling tasks, providing a structured framework for analyzing and predicting spatial patterns within the Brooklyn region.

intersection.grid <- st_as_stars(st_bbox(br.polygon),  dx = 0.01, dy = 0.01)
intersection.grid = st_as_sf(intersection.grid)
intersection.grid  = intersection.grid[br.polygon, ]  #308 grids
intersection.grid$values=c(1:308)
head(intersection.grid,1)

## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -73.96669 ymin: 40.72943 xmax: -73.95669 ymax: 40.73943
## Geodetic CRS:  +proj=longlat +datum=NAD83
##    values                       geometry
## 10      1 POLYGON ((-73.96669 40.7394...

ggplot() +
  geom_sf(data = intersection.grid, aes(geometry = geometry), fill = "lightblue", color = "white") +
  geom_sf(data = br.polygon, aes(geometry = geometry), fill = "transparent", color = "red") +
  geom_sf_text(data = intersection.grid, aes(label = values), size = 2, color = "black") +
  theme_minimal() +
  labs(title = "Intersection Grid with Brooklyn Polygon")-> pic1
pic1

## Warning in st_point_on_surface.sfc(sf::st_zm(x)): st_point_on_surface may not
## give correct results for longitude/latitude data

# convert to  sp
intersection.grid.sp <-  as_Spatial(intersection.grid)

In the next step, I filtered the Uber pickups dataset to retain only those occurring within the Brooklyn area. Containing 61,668 Uber pickups that fall within the defined Brooklyn grid.

# Secondly, only reserve the pickups on Brooklyn 
uber.br=uber
coordinates(uber.br)=c("Lon","Lat")
crs(uber.br)=crs(intersection.grid)
uber.br$gridID=over(uber.br,br.polygon.sp)
uber.br=data.frame(uber.br)
uber.br[!is.na(uber.br$gridID.osm_id),]->uber.br #61,668

Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

uber.br$Date.Time <- as.POSIXct(uber.br$Date.Time, format = "%m/%d/%Y %H:%M:%S")

uber.br=uber.br %>%
  mutate(day=day(Date.Time),
         hour=hour(Date.Time),
         weekdays=weekdays(Date.Time),
         week=week(Date.Time))

head(uber.br)

##               Date.Time     Lat      Lon   Base gridID.osm_id gridID.code
## 23  2014-04-01 04:15:00 40.6561 -73.9531 B02512       9691750        1010
## 65  2014-04-01 06:15:00 40.7204 -73.9545 B02512       9691750        1010
## 89  2014-04-01 06:44:00 40.7231 -73.9442 B02512       9691750        1010
## 102 2014-04-01 07:00:00 40.6990 -73.9951 B02512       9691750        1010
## 178 2014-04-01 08:19:00 40.7214 -73.9475 B02512       9691750        1010
## 193 2014-04-01 08:32:00 40.6819 -73.9358 B02512       9691750        1010
##     gridID.fclass gridID.population gridID.name optional day hour weekdays week
## 23         suburb           2736074    Brooklyn     TRUE   1    4  Tuesday   14
## 65         suburb           2736074    Brooklyn     TRUE   1    6  Tuesday   14
## 89         suburb           2736074    Brooklyn     TRUE   1    6  Tuesday   14
## 102        suburb           2736074    Brooklyn     TRUE   1    7  Tuesday   14
## 178        suburb           2736074    Brooklyn     TRUE   1    8  Tuesday   14
## 193        suburb           2736074    Brooklyn     TRUE   1    8  Tuesday   14

2.2.2 Visualization spatial data

#Brooklyn Bounding box:  xmin: -74.05669 ymin: 40.55034 xmax: -73.83294 ymax: 40.73943

#plot 1:Brooklyn border with Uber pickups
ggplot() +
  # Plot Brooklyn polygon
  geom_sf(data = br.polygon, fill = "lightblue", color = "black") +
  # Plot Uber pickups with color-coded weekdays
  geom_point(data = uber.br, aes(x = Lon, y = Lat, color = weekdays), alpha = 0.5, size = 2) +
  # Customize color scale for weekdays
  scale_color_brewer(palette = "Set3") +
  # Add labels and title
  labs(title = "Uber Pickups in Brooklyn",
       x = "Longitude",
       y = "Latitude",
       color = "Weekday") +
  # Adjust the theme for better aesthetics
  theme_minimal() +
  theme(legend.position = "bottom",
        axis.text.x  = element_text(angle = 90)) +
  # Set axis limits
  scale_x_continuous(limits = c(-74.06, -73.83)) +
  scale_y_continuous(limits = c(40.54, 40.74)) -> pic2


#plot 2: Brooklyn roads plot 
ggplot()+
  geom_sf(data=newyork.roads %>% filter(fclass=="motorway"),
          color="navy")+
  geom_sf(data=newyork.roads %>% filter(fclass=="primary"),
          color="steelblue") +
  geom_sf(data=newyork.roads %>% filter(fclass=="residential"),
          color="darkcyan") +
  geom_sf(data = br.polygon,
          fill="transparent",
          color="grey80")+
  scale_x_continuous(limits = c(-74.06,-73.83)) +
  scale_y_continuous(limits = c(40.54,40.74))  ->pic3


plot_grid(pic3,pic2,align = "hv",ncol=2)

#plot 3: Brooklyn landuse
newyork.landuse %>% 
  mutate(fclass=case_when(
    fclass %in% c("grass","forest","residential","scrub",
                  "park","farmland","cemetery","meadow","retail","industrial") ~ fclass,
   TRUE ~"others"
  )) -> newyork.landuse
ggplot() +
  geom_sf(data = newyork.landuse,
          aes(fill = fclass),
          color = "navy") +
  geom_sf(data = br.polygon,
          fill = "transparent",
          color = "grey60") +
  scale_x_continuous(limits = c(-74.06, -73.83)) +
  scale_y_continuous(limits = c(40.54, 40.74)) +
  
  # Use a gradient color palette for land-use categories
  scale_fill_brewer(palette = "Set3") +
  
  # Add labels and title
  labs(title = "Land Use in Brooklyn",
       x = "Longitude",
       y = "Latitude",
       fill = "Land Use Category") +
  
  # Adjust the theme for better aesthetics
  theme_minimal() +
  theme(legend.position = "bottom") -> pic4
pic4

2.3 Exploratory Data Analysis

Different time mode analysis

Histogram Analysis

# Customize the appearance of the day histogram
ggplot(uber.br, aes(x = day)) +
  geom_histogram(binwidth = 1, fill = "skyblue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Uber Demand by Day",
       x = "Day of the Month",
       y = "Frequency") +
  theme_minimal()-> pic5
pic5

# Customize the appearance of the hour histogram
ggplot(uber.br, aes(x = hour)) +
  geom_histogram(binwidth = 1, fill = "lightcoral", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Uber Demand by Hour",
       x = "Hour of the Day",
       y = "Frequency") +
  theme_minimal() ->pic6
pic6

# Customize the appearance of the weekdays histogram
uber.br$weekdays=factor(uber.br$weekdays,
                              levels = c("Monday",
                                         "Tuesday",
                                         "Wednesday",
                                         "Thursday",
                                         "Friday",
                                         "Saturday",
                                         "Sunday"),
                              labels = c("Monday",
                                         "Tuesday",
                                         "Wednesday",
                                         "Thursday",
                                         "Friday",
                                         "Saturday",
                                         "Sunday"))
ggplot(uber.br, aes(x = weekdays)) +
  geom_bar(fill = "darkviolet", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Uber Demand by Weekdays",
       x = "Weekdays",
       y = "Frequency") +
  theme_minimal() -> pic7
pic7

Day Histogram: The distribution of Uber demand over the 30 days in April reveals a distinct cyclical pattern. Notably, the frequency experiences an initial ascent, reaches a peak, and subsequently undergoes a decline. This cyclicality signifies fluctuations in demand throughout the month.

Hour Histogram: Analyzing the 24-hour frequency in April demonstrates a concentrated demand for Uber services during the late afternoon and early evening hours, specifically between 17:00 PM and 22:00 PM. Conversely, the demand experiences a noticeable dip during the early morning hours, from midnight to 5:00 AM.

Weekdays Histogram: Examining the 7-day frequency in April unveils a discernible trend. Demand experiences an initial increase, reaching its small peak on Wednesday, and then gradually decreases towards the weekend, reaching the big peak on Saturday. This pattern suggests a midweek surge in demand, then with Saturday standing out as the day of highest demand, followed by a tapering off towards Sunday.

Weekdays-hour line plot

uber.br %>% 
  dplyr::group_by(weekdays,hour) %>%
  dplyr::summarise(n=n())   %>%
  mutate(index=c(1:length(weekdays))) %>%
  ungroup()-> uber.weekdays.hour

# Create a more visually appealing line plot with points and facet by weekdays
ggplot(uber.weekdays.hour, aes(x = index, y = n, group = weekdays, color = weekdays)) +
  geom_line(size = 1, alpha = 0.7) +
  geom_point(size = 2, alpha = 0.8) +
  facet_wrap(~weekdays, scales = "free_x") +
  labs(title = "Uber Demand by Weekdays and Hours",
       x = "Index",
       y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  scale_color_brewer(palette = "Set2") -> pic8
pic8

Upon analyzing the line plot depicting Uber demand across weekdays and hours, distinct patterns emerge, underscoring the variation between weekdays and weekends.

Weekdays:

Morning Peak (7-9 am): Notably, on weekdays, Uber demand experiences an early peak between 7:00 AM and 9:00 AM. This surge in demand during the morning hours suggests a pronounced reliance on Uber for commuting to work or other daytime obligations.
Late Peak (17-22 pm): Subsequently, there is a substantial increase in demand during the late afternoon and early evening hours, specifically between 17:00 PM and 22:00 PM. This late peak aligns with the typical post-work commute and evening activities.

Weekends:

Increasing Demand: In contrast to weekdays, Saturday exhibit a increasing pattern till the midnight. There is a consistent and sharply demand on Saturday. While Sunday is relatively more stable.
Midnight Demand (20pm-0 am): A noteworthy observation on weekends is the heightened demand at midnight (20pm-0:00 AM). This trend implies that during weekends, people are more inclined to engage in late-night activities or socializing, contributing to a sustained demand during the early hours of the morning.

In summary, the weekdays exhibit distinct peak times associated with work-related commutes, while weekends showcase a increasing distributed demand, particularly during midnight hours, indicative of social and leisure activities.

Chapter3 Preparing Data for Models

In the preparation of daily pattern data, the strategy revolves around utilizing spatial data spanning 21 days to predict a single day into the future. The training set encompasses the initial 20 days, while the remaining 1 day serves as the test set. Emphasizing the daily mode, the objective is to craft a one-day forecast. This approach adeptly captures the nuanced daily and weekly shifts in ride-hailing dynamics. By focusing on weekly and daily patterns, insightful revelations into spatiotemporal trends in Uber pickups are derived, fortifying the model’s predictive prowess for subsequent spatial analyses.

Contrastingly, in the preparation of hourly pattern data, the strategy centers on leveraging spatial data across 7 days to forecast demand for the subsequent 8 hours. The dataset, totaling 7 * 24 =168 hours, utilizes the first 160 hours for training and the remaining 8 hours for testing. Employing the hourly mode, the goal is to generate a forecast for the immediate 8 hours. This approach effectively captures the swift hourly fluctuations in ride-hailing demand and rigorously evaluates model performance, ensuring robustness and adaptability. By focusing on daily and hourly patterns, the model gains valuable insights into spatiotemporal trends in Uber pickups, ultimately amplifying its predictive capabilities for future spatial analyses. This holistic approach seeks to refine the accuracy and practicality of Uber’s forecasts, particularly in addressing the intricate dynamics of demand at different temporal scales.

3.1 Preparing daily mode data

I plan to use spatial data spanning 21 days, with the first 20 days for training and the remaining 1 days used for testing. This approach allowed me to capture daily changes in ride-hailing and evaluate its performance on a separate set of data to ensure robustness. By focusing on weekly and daily patterns, I can gain insights into spatiotemporal trend in Uber pickups.

3.1.1 Daily Pick up points into grid cells

#get the  daily pick up point into the grid
day.pickup=matrix(data=0,nrow=308,ncol = 22)
colnames(day.pickup)=c(paste0(c(1:21),"day"),"total_sum")
rownames(day.pickup)=c(1:308)
mid = data.frame(gridID=c(1:308))

for (i in 1:21) {
  pickup.day= uber.br %>% filter(day==i)
  coordinates(pickup.day)=c("Lon","Lat")
  crs(pickup.day)=crs(intersection.grid)
  #pickup.day=st_as_sf(pickup.day)
  pickup.day$gridID=over(pickup.day,intersection.grid.sp)
  pickup.day=data.frame(pickup.day)
  pickup.num=pickup.day %>% dplyr::group_by(values) %>% dplyr::summarise(n=n())
  #aggregate by girdID, then get everygirdID has how many starups every year
  #then merge every day pickup num 
  pickup.merge=merge(mid,pickup.num,by.x="gridID",by.y="values",all.x=T)$n
  pickup.merge[which(is.na(pickup.merge))]=0
  day.pickup[,i]=pickup.merge
}
day.pickup[,22]=rowSums(day.pickup) 
day.pickup[,22] %>% sum() # 41,968 pick up points in grid

## [1] 41968

head(day.pickup)

##   1day 2day 3day 4day 5day 6day 7day 8day 9day 10day 11day 12day 13day 14day
## 1   11   14   22   28   34   22   16   15   16    10    27    34    24    10
## 2   21   24   37   44   46   30   34   24   19    32    37    31    22    17
## 3    0    1    0    1    0    0    0    0    0     1     1     0     0     1
## 4    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## 5   37   42   58   75  110   86   44   37   47    46    60    85    60    34
## 6   53   54   46   81  110   79   52   47   52    47    60    94    74    50
##   15day 16day 17day 18day 19day 20day 21day total_sum
## 1    14    15    16    19    27    20    14       408
## 2    23    18    23    22    27    30    13       574
## 3     2     1     2     0     0     1     0        11
## 4     0     0     0     0     0     0     0         0
## 5    61    54    68    75   118    64    29      1290
## 6    45    55    62    84    95    58    47      1345

3.1.2 Mapping Daily pickups on an OpenStreetMap

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.
## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

## Warning: Graphs cannot be vertically aligned unless the axis parameter is set.
## Placing graphs unaligned.

3.1.3 Visualization whole density of pickups and day by day

#Visualization the density of pickups
autoplot.OpenStreetMap(OpenStreetMap::openproj(range.osm)) +
  stat_density_2d(data=uber.br,
                  aes(x=Lon,
                      y=Lat,
                      fill=stat(level)),
                  bins=30,
                  alpha=.9,
                  geom="polygon"
                  ) +
  scale_fill_viridis(option = "C")  +
  geom_sf(data=br.polygon,aes(geometry=geometry),
          inherit.aes = F,
          fill="transparent") +
  theme(axis.title = element_blank(),
        axis.text.x=element_text(angle = 90)) -> pic8

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

pic8

## Warning: `stat(level)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
## ℹ The deprecated feature was likely used in the OpenStreetMap package.
##   Please report the issue to the authors.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#Density of pickups day by day
autoplot.OpenStreetMap(OpenStreetMap::openproj(range.osm)) +
  stat_density_2d(data=uber.br %>% dplyr::filter(day %in% c(1:7)),
                  aes(x=Lon,
                      y=Lat,
                      fill=stat(level)),
                  bins=30,
                  alpha=.9,
                  geom="polygon"
                  ) +
  facet_wrap(~day) +
  scale_fill_viridis(option = "C")  +
  geom_sf(data=br.polygon,aes(geometry=geometry),
          inherit.aes = F,
          fill="transparent") +
  theme(axis.title = element_blank(),
        axis.text.x=element_text(angle = 90))+
   labs(
    title = "Density of Pickups Day by Day",
    fill = "Density"  # You can add labels for other aesthetics as well
  )->pic9

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

pic9

The daily density of Uber pickups reveals both common patterns and subtle variations across different days. Notably, a consistent high-density area is observed in the upper part of Brooklyn, suggesting a consistent demand in that region. However, intriguingly, some days exhibit a singular high-density center, while on others, there are two distinct centers. This variance indicates a dynamic and shifting demand landscape within the city. The factors contributing to these fluctuations could be multifaceted, ranging from day-specific events, temporal patterns, or even changes in user behavior.

3.2 Preparing hourly mode data

In the context of hourly spatial data spanning 7 days, totaling 168 hours, the chosen approach involves dedicating the initial 160 hours for training and reserving the remaining 8 hours for testing. This meticulous strategy allows for a focused examination of hourly changes in ride-hailing patterns. By concentrating on both daily and hourly patterns, the model is poised to capture the nuanced spatiotemporal trends in Uber pickups. This comprehensive analysis provides a deep understanding of how demand fluctuates not only throughout the day but also on a hourly basis. The separation of training and testing datasets ensures a rigorous evaluation of the model’s performance, fostering robustness and adaptability in predicting dynamic and location-specific demand patterns.

3.2.1 Hourly Pick up points into grid cells

# Hourly data, I only get 7*24=168 hours data
hour.pickup=matrix(data=0,nrow=308,ncol = 169)
colnames(hour.pickup)=c(paste0("hour",c(1:168)),"total_sum")
rownames(hour.pickup)=c(1:308)
mid = data.frame(gridID=c(1:308))

# Loops over 7*24 hours
for (i in 1:168) {
  pickup.hour= uber.br %>% filter(day==(i-1)%/%24+1 & hour==(i-1)%%24)
  coordinates(pickup.hour)=c("Lon","Lat")
  crs(pickup.hour)=crs(intersection.grid)
  pickup.hour$gridID=over(pickup.hour,intersection.grid.sp)
  pickup.hour=data.frame(pickup.hour)
  
  # Aggregate pickup counts for each grid for the current day
  pickup.num=pickup.hour %>% dplyr::group_by(values) %>% dplyr::summarise(n=n())
  pickup.merge=merge(mid,pickup.num,by.x="gridID",by.y="values",all.x=T)$n
  pickup.merge[which(is.na(pickup.merge))]=0
  
  # Store pickup counts in the hour.pickup matrix
  hour.pickup[,i]=pickup.merge
}

#Calculate Total Pickups for Each Grid:
hour.pickup[,169]=rowSums(hour.pickup) 
hour.pickup[,169] %>% sum() #14,708 pick up points in grid

## [1] 14708

head(hour.pickup,1)

##   hour1 hour2 hour3 hour4 hour5 hour6 hour7 hour8 hour9 hour10 hour11 hour12
## 1     0     0     0     0     1     0     1     3     1      0      0      1
##   hour13 hour14 hour15 hour16 hour17 hour18 hour19 hour20 hour21 hour22 hour23
## 1      1      0      1      0      1      0      0      0      0      0      1
##   hour24 hour25 hour26 hour27 hour28 hour29 hour30 hour31 hour32 hour33 hour34
## 1      0      0      0      0      0      0      0      2      1      3      1
##   hour35 hour36 hour37 hour38 hour39 hour40 hour41 hour42 hour43 hour44 hour45
## 1      1      0      0      1      0      0      0      0      1      0      1
##   hour46 hour47 hour48 hour49 hour50 hour51 hour52 hour53 hour54 hour55 hour56
## 1      3      0      0      1      1      1      0      0      0      1      0
##   hour57 hour58 hour59 hour60 hour61 hour62 hour63 hour64 hour65 hour66 hour67
## 1      2      0      2      0      0      1      1      1      0      1      2
##   hour68 hour69 hour70 hour71 hour72 hour73 hour74 hour75 hour76 hour77 hour78
## 1      1      2      3      1      1      4      0      0      0      0      0
##   hour79 hour80 hour81 hour82 hour83 hour84 hour85 hour86 hour87 hour88 hour89
## 1      4      2      0      3      1      0      0      0      0      1      0
##   hour90 hour91 hour92 hour93 hour94 hour95 hour96 hour97 hour98 hour99 hour100
## 1      1      2      0      1      2      3      4      2      1      0       0
##   hour101 hour102 hour103 hour104 hour105 hour106 hour107 hour108 hour109
## 1       1       0       0       0       2       2       2       2       2
##   hour110 hour111 hour112 hour113 hour114 hour115 hour116 hour117 hour118
## 1       0       0       0       0       4       5       1       2       2
##   hour119 hour120 hour121 hour122 hour123 hour124 hour125 hour126 hour127
## 1       3       3       3       3       0       0       0       0       1
##   hour128 hour129 hour130 hour131 hour132 hour133 hour134 hour135 hour136
## 1       0       0       1       3       0       2       2       0       1
##   hour137 hour138 hour139 hour140 hour141 hour142 hour143 hour144 hour145
## 1       1       1       0       1       2       1       0       0       0
##   hour146 hour147 hour148 hour149 hour150 hour151 hour152 hour153 hour154
## 1       0       0       0       0       0       0       1       1       1
##   hour155 hour156 hour157 hour158 hour159 hour160 hour161 hour162 hour163
## 1       3       1       1       1       0       1       0       2       2
##   hour164 hour165 hour166 hour167 hour168 total_sum
## 1       1       0       0       1       0       147

3.2.2 Visualization density of pickups hour by hour

Plot pickup points density over 24 hours

#Density of pickups hour by hour
autoplot.OpenStreetMap(OpenStreetMap::openproj(range.osm)) +
  stat_density_2d(data=uber.br %>% dplyr::filter(day==1),
                  aes(x=Lon,
                      y=Lat,
                      fill=stat(level)/10),
                  bins=30,
                  alpha=.9,
                  geom="polygon"
                  ) +
  facet_wrap(~hour,ncol=6) +
  scale_fill_viridis(option = "C")  +
  geom_sf(data=br.polygon,aes(geometry=geometry),
          inherit.aes = F,
          fill="transparent") +
  theme(axis.title = element_blank(),
        axis.text.x=element_text(angle = 90))+
   labs(
    title = "Density of Pickups Hour by Hour",
    fill = "Density"  # You can add labels for other aesthetics as well
  )->pic10

## Coordinate system already present. Adding new coordinate system, which will
## replace the existing one.

pic10

The hourly density patterns of Uber pickups exhibit significant variations. Notably, the shape and density of pickups vary throughout the day. A distinctive pattern emerges, with particularly high densities occurring around 21:00pm and 22:00pm. During these hours, there is a noticeable surge in Uber pickups, suggesting increased demand or activity. This temporal heterogeneity in density underscores the dynamic nature of Uber service utilization over the course of a day. The ability to discern these patterns is crucial for understanding and optimizing transportation logistics, potentially informing resource allocation and service planning to accommodate the fluctuating demands at different hours.

3.3 Methods and data format

3.3.1 Methods:DBSCAN-Spatial data clustering

DBSCAN(Density-Based Spatial Clustering of Applications with Noise:

Density-Based Clustering: The core idea is based on the density of spatial data points. It categorizes data points into core points, border points and noise points. Core points are those with sufficient points in their neighborhood, border points are close to core points but not enough to be considered as core, and noise points are isolated data points.
Parameter Configuration: Two key parameters: MinPts, and eps. Minpts specifies how many data points are required in the neighborhood of a core point to form a cluster, and eps defines the distance threshold for the core point’s neighborhood.

Daily data clusters using DBSCAN

# DBSCAN by daily data 
uber.br %>% filter(day %in%c(1:21))->uber.br
uber.br$day.cluster=0

for (i in 1:21) {
  set.seed(111)
  db= fpc::dbscan(uber.br %>% filter(day==i) %>% select(Lon,Lat), eps=0.0035,
                  MinPts=25)
  uber.br$day.cluster[uber.br$day==i]=db$cluster
}
sum(uber.br$day.cluster[uber.br$day %in% c(1:21)]>0)/
  length(uber.br$day.cluster[uber.br$day %in% c(1:21)])

## [1] 0.5994806

#59.94% pickups belongs to cluster

#DBSCAN by whole data
set.seed(111)
db<- fpc::dbscan(uber.br%>% filter(day %in% c(1:21)) %>% select(Lon,Lat), eps=0.001, MinPts=50) #1-74cluster
uber.br$whole.day.cluster[uber.br$day %in% c(1:21)] = db$cluster
uber.br$whole.day.cluster[uber.br$whole.day.cluster==0]=NA
#saveRDS(db,"data/wholeDBSCAN.RData")
#db=readRDS("data/wholeDBSCAN.RData")

#################################################################
# preparing cluster dummies data day by day
grid.clusters=matrix(data=0,nrow=308,ncol = 22)
colnames(grid.clusters)=c(paste0("cluster",c(1:21)),"total_cluster")
mid = data.frame(gridID=c(1:308))

for (i in 1:21) {
  cluster.day= uber.br %>% filter(day==i)
  coordinates(cluster.day)=c("Lon","Lat")
  crs(cluster.day)=crs(intersection.grid)
  #cluster.day=st_as_sf(cluster.day)
  cluster.day$gridID=over(cluster.day,intersection.grid.sp)
  cluster.day=data.frame(cluster.day)
  cluster.num=cluster.day %>% dplyr::group_by(values) %>% dplyr::summarise(cluster=sum(day.cluster))
  #aggregate by girdID, then get everygirdID has how many starups every year
  #then merge every day pickup num 
  cluster.merge=merge(mid,cluster.num,by.x="gridID",by.y="values",all.x=T)$cluster
  cluster.merge[which(is.na(cluster.merge))]=0
  cluster.merge[which(cluster.merge>0)]=1
  grid.clusters[,i]=cluster.merge
}


grid.clusters[,22]=rowSums(grid.clusters) 
grid.clusters[,22] %>% sum()  #652

## [1] 652

day.pickup.cluster=cbind(day.pickup,data.frame(grid.clusters))
head(day.pickup.cluster)

##   1day 2day 3day 4day 5day 6day 7day 8day 9day 10day 11day 12day 13day 14day
## 1   11   14   22   28   34   22   16   15   16    10    27    34    24    10
## 2   21   24   37   44   46   30   34   24   19    32    37    31    22    17
## 3    0    1    0    1    0    0    0    0    0     1     1     0     0     1
## 4    0    0    0    0    0    0    0    0    0     0     0     0     0     0
## 5   37   42   58   75  110   86   44   37   47    46    60    85    60    34
## 6   53   54   46   81  110   79   52   47   52    47    60    94    74    50
##   15day 16day 17day 18day 19day 20day 21day total_sum cluster1 cluster2
## 1    14    15    16    19    27    20    14       408        0        1
## 2    23    18    23    22    27    30    13       574        0        1
## 3     2     1     2     0     0     1     0        11        0        0
## 4     0     0     0     0     0     0     0         0        0        0
## 5    61    54    68    75   118    64    29      1290        1        1
## 6    45    55    62    84    95    58    47      1345        1        1
##   cluster3 cluster4 cluster5 cluster6 cluster7 cluster8 cluster9 cluster10
## 1        1        1        1        1        1        0        1         1
## 2        1        1        1        1        1        0        1         1
## 3        0        0        0        0        0        0        0         0
## 4        0        0        0        0        0        0        0         0
## 5        1        1        1        1        1        1        1         1
## 6        1        1        1        1        1        1        1         1
##   cluster11 cluster12 cluster13 cluster14 cluster15 cluster16 cluster17
## 1         1         1         1         0         1         0         1
## 2         1         1         1         0         1         0         1
## 3         0         0         0         0         0         0         0
## 4         0         0         0         0         0         0         0
## 5         1         1         1         1         1         1         1
## 6         1         1         1         1         1         1         1
##   cluster18 cluster19 cluster20 cluster21 total_cluster
## 1         1         1         1         0            16
## 2         1         1         1         0            16
## 3         0         0         0         0             0
## 4         0         0         0         0             0
## 5         1         1         1         1            21
## 6         1         1         1         1            21

Visualizing the cluster plot

# to see the whole cluster plot
ggplot() +
  geom_point(data=uber.br[is.na(uber.br$whole.day.cluster),],
             aes(x=Lon,y=Lat),color="gray",shape=20,size=1)+
  geom_point(data=uber.br[!is.na(uber.br$whole.day.cluster),],
             aes(x=Lon,y=Lat,color=whole.day.cluster),
             shape=16,size=1.2)+
  scale_color_viridis(option = "H",trans="reverse") +
  geom_sf(data = br.polygon,aes(geometry=geometry),
          inherit.aes = F,size=2,
          fill="transparent")+
  labs(title ="", color = "Whole 21 days cluster by DBSCAN")+
  theme_minimal() -> pic11
pic11

# to see DBSCAN clusters day by day
ggplot() +
  geom_point(data=uber.br[uber.br$day.cluster==0,]%>%filter(day %in% c(1:7)),aes(x=Lon,y=Lat),color="gray",shape=20,size=1)+
  geom_point(data=uber.br[!uber.br$day.cluster==0,]%>%filter(day %in% c(1:7)),aes(x=Lon,y=Lat,color=day.cluster),shape=16,size=1.2)+
  facet_wrap(~day) +
  scale_color_viridis(option = "H",trans="reverse") +
  geom_sf(data = br.polygon,aes(geometry=geometry),
          inherit.aes = F,size=2,
          fill="transparent")+
  labs(title ="", color = "Day clusters by DBSCAN")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90))->pic12
pic12

In the context of applying DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to analyze day clusters, variations in cluster density and distribution are expected. DBSCAN identifies clusters based on data density, allowing for the identification of areas with higher concentrations of data points. Days with more diverse activities or events may exhibit increased cluster diversity, leading to a higher number of identified clusters. Conversely, days with fewer distinct patterns may result in fewer clusters. The size and spatial distribution of clusters may vary, reflecting the dynamic nature of data on different days. Larger areas of clusters may indicate days with widespread or cohesive patterns, while sparse or dispersed clusters may suggest days with more scattered and diverse data points. Overall, these variations highlight the flexibility of DBSCAN in capturing diverse patterns and structures within temporal datasets.

Hourly data clusters using DBSCAN

uber.br$hour.cluster=0

for (i in 1:168) {
  set.seed(111)
  db= fpc::dbscan(uber.br %>% filter(day==(i-1)%/%24+1 & hour==(i-1)%%24)%>% select(Lon,Lat), eps=0.01,MinPts=5)
  uber.br$hour.cluster[uber.br$day==(i-1)%/%24+1 & uber.br$hour==(i-1)%%24]=db$cluster
}

sum(uber.br$hour.cluster>0,na.rm = T)/length(uber.br$day[uber.br$day %in% c(1:7)])

## [1] 0.7794398

#77.94 poins belongs to  cluster

#DBSCAN by whole data
set.seed(111)
db<- fpc::dbscan(uber.br%>%filter(day %in%c(1:7)) %>% select(Lon,Lat), eps=0.001, MinPts=20) #1-80cluster
uber.br$whole.hour.cluster[uber.br$day %in% c(1:7)] = db$cluster
#saveRDS(db,"data/wholeHourDBSCAN.RData")
#db=readRDS("data/wholeHourDBSCAN.RData")

#################################################################
# preparing cluster hour by hour, totally 7*24=168 hours
grid.clusters.hour=matrix(data=0,nrow=308,ncol = 169)
colnames(grid.clusters.hour)=c(paste0(c(1:168),"cluster"),"total_cluster")
mid = data.frame(gridID=c(1:308))

for (i in 1:168) {
  #Extract Clusters for Each Hour
  cluster.hour= uber.br %>% filter(day==(i-1)%/%24+1 & hour==(i-1)%%24)
  coordinates(cluster.hour)=c("Lon","Lat")
  crs(cluster.hour)=crs(intersection.grid)
  #cluster.hour=st_as_sf(cluster.hour)
  cluster.hour$gridID=over(cluster.hour,intersection.grid.sp)
  cluster.hour=data.frame(cluster.hour)
  cluster.num=cluster.hour %>% dplyr::group_by(values) %>% 
    dplyr::summarise(cluster=sum(hour.cluster))
  cluster.merge=merge(mid,cluster.num,by.x="gridID",by.y="values",all.x=T)$cluster
  cluster.merge[which(is.na(cluster.merge))]=0
  #Convert to Binary Indicator
  cluster.merge[which(cluster.merge>0)]=1
  grid.clusters.hour[,i]=cluster.merge
}

#Summarize Daily Cluster Totals
grid.clusters.hour[,169]=rowSums(grid.clusters.hour) 
grid.clusters.hour[,169]%>%sum() #3,924

## [1] 3924

hour.pickup.cluster.frac=cbind(hour.pickup,data.frame(grid.clusters.hour))
head(hour.pickup.cluster.frac,1)

##   hour1 hour2 hour3 hour4 hour5 hour6 hour7 hour8 hour9 hour10 hour11 hour12
## 1     0     0     0     0     1     0     1     3     1      0      0      1
##   hour13 hour14 hour15 hour16 hour17 hour18 hour19 hour20 hour21 hour22 hour23
## 1      1      0      1      0      1      0      0      0      0      0      1
##   hour24 hour25 hour26 hour27 hour28 hour29 hour30 hour31 hour32 hour33 hour34
## 1      0      0      0      0      0      0      0      2      1      3      1
##   hour35 hour36 hour37 hour38 hour39 hour40 hour41 hour42 hour43 hour44 hour45
## 1      1      0      0      1      0      0      0      0      1      0      1
##   hour46 hour47 hour48 hour49 hour50 hour51 hour52 hour53 hour54 hour55 hour56
## 1      3      0      0      1      1      1      0      0      0      1      0
##   hour57 hour58 hour59 hour60 hour61 hour62 hour63 hour64 hour65 hour66 hour67
## 1      2      0      2      0      0      1      1      1      0      1      2
##   hour68 hour69 hour70 hour71 hour72 hour73 hour74 hour75 hour76 hour77 hour78
## 1      1      2      3      1      1      4      0      0      0      0      0
##   hour79 hour80 hour81 hour82 hour83 hour84 hour85 hour86 hour87 hour88 hour89
## 1      4      2      0      3      1      0      0      0      0      1      0
##   hour90 hour91 hour92 hour93 hour94 hour95 hour96 hour97 hour98 hour99 hour100
## 1      1      2      0      1      2      3      4      2      1      0       0
##   hour101 hour102 hour103 hour104 hour105 hour106 hour107 hour108 hour109
## 1       1       0       0       0       2       2       2       2       2
##   hour110 hour111 hour112 hour113 hour114 hour115 hour116 hour117 hour118
## 1       0       0       0       0       4       5       1       2       2
##   hour119 hour120 hour121 hour122 hour123 hour124 hour125 hour126 hour127
## 1       3       3       3       3       0       0       0       0       1
##   hour128 hour129 hour130 hour131 hour132 hour133 hour134 hour135 hour136
## 1       0       0       1       3       0       2       2       0       1
##   hour137 hour138 hour139 hour140 hour141 hour142 hour143 hour144 hour145
## 1       1       1       0       1       2       1       0       0       0
##   hour146 hour147 hour148 hour149 hour150 hour151 hour152 hour153 hour154
## 1       0       0       0       0       0       0       1       1       1
##   hour155 hour156 hour157 hour158 hour159 hour160 hour161 hour162 hour163
## 1       3       1       1       1       0       1       0       2       2
##   hour164 hour165 hour166 hour167 hour168 total_sum X1cluster X2cluster
## 1       1       0       0       1       0       147         0         0
##   X3cluster X4cluster X5cluster X6cluster X7cluster X8cluster X9cluster
## 1         0         0         0         0         1         1         1
##   X10cluster X11cluster X12cluster X13cluster X14cluster X15cluster X16cluster
## 1          0          0          0          1          0          1          0
##   X17cluster X18cluster X19cluster X20cluster X21cluster X22cluster X23cluster
## 1          1          0          0          0          0          0          1
##   X24cluster X25cluster X26cluster X27cluster X28cluster X29cluster X30cluster
## 1          0          0          0          0          0          0          0
##   X31cluster X32cluster X33cluster X34cluster X35cluster X36cluster X37cluster
## 1          1          1          1          1          1          0          0
##   X38cluster X39cluster X40cluster X41cluster X42cluster X43cluster X44cluster
## 1          1          0          0          0          0          1          0
##   X45cluster X46cluster X47cluster X48cluster X49cluster X50cluster X51cluster
## 1          1          1          0          0          0          0          0
##   X52cluster X53cluster X54cluster X55cluster X56cluster X57cluster X58cluster
## 1          0          0          0          1          0          1          0
##   X59cluster X60cluster X61cluster X62cluster X63cluster X64cluster X65cluster
## 1          1          0          0          0          1          1          0
##   X66cluster X67cluster X68cluster X69cluster X70cluster X71cluster X72cluster
## 1          1          1          1          1          1          1          1
##   X73cluster X74cluster X75cluster X76cluster X77cluster X78cluster X79cluster
## 1          1          0          0          0          0          0          1
##   X80cluster X81cluster X82cluster X83cluster X84cluster X85cluster X86cluster
## 1          1          0          1          1          0          0          0
##   X87cluster X88cluster X89cluster X90cluster X91cluster X92cluster X93cluster
## 1          0          0          0          1          1          0          1
##   X94cluster X95cluster X96cluster X97cluster X98cluster X99cluster X100cluster
## 1          1          1          1          1          0          0           0
##   X101cluster X102cluster X103cluster X104cluster X105cluster X106cluster
## 1           0           0           0           0           0           0
##   X107cluster X108cluster X109cluster X110cluster X111cluster X112cluster
## 1           0           1           1           0           0           0
##   X113cluster X114cluster X115cluster X116cluster X117cluster X118cluster
## 1           0           1           1           1           1           1
##   X119cluster X120cluster X121cluster X122cluster X123cluster X124cluster
## 1           1           1           1           1           0           0
##   X125cluster X126cluster X127cluster X128cluster X129cluster X130cluster
## 1           0           0           1           0           0           0
##   X131cluster X132cluster X133cluster X134cluster X135cluster X136cluster
## 1           1           0           1           1           0           1
##   X137cluster X138cluster X139cluster X140cluster X141cluster X142cluster
## 1           0           1           0           1           1           1
##   X143cluster X144cluster X145cluster X146cluster X147cluster X148cluster
## 1           0           0           0           0           0           0
##   X149cluster X150cluster X151cluster X152cluster X153cluster X154cluster
## 1           0           0           0           1           0           1
##   X155cluster X156cluster X157cluster X158cluster X159cluster X160cluster
## 1           1           1           1           1           0           1
##   X161cluster X162cluster X163cluster X164cluster X165cluster X166cluster
## 1           0           1           1           1           0           0
##   X167cluster X168cluster total_cluster
## 1           1           0            71

#saveRDS(hour.pickup.cluster.frac,"data/HourpickupCluster.RData")
#hour.pickup.cluster.frac=readRDS("data/HourpickupCluster.RData")

# to see the whole cluster plot
ggplot() +
  geom_point(data=uber.br[uber.br$whole.hour.cluster==0,],
             aes(x=Lon,y=Lat),color="gray",shape=20,size=1)+
  geom_point(data=uber.br[!uber.br$whole.hour.cluster==0,],
             aes(x=Lon,y=Lat,color=whole.hour.cluster),
             shape=16,size=1.2)+
  scale_color_viridis(option = "H",trans="reverse") +
  geom_sf(data = br.polygon,aes(geometry=geometry),
          inherit.aes = F,size=2,
          fill="transparent")+
  labs(title ="", color = "Whole 168 hour clusters by DBSCAN")+
  theme_minimal() -> pic13
pic13

## Warning: Removed 27260 rows containing missing values (`geom_point()`).
## Removed 27260 rows containing missing values (`geom_point()`).

# to see DBSCAN clusters hour by hour
ggplot() +
  geom_point(data=uber.br[uber.br$hour.cluster==0,]%>%
               filter(day==1),aes(x=Lon,y=Lat),color="gray",shape=20,size=1)+
  geom_point(data=uber.br[!uber.br$hour.cluster==0,]%>%
               filter(day==1),aes(x=Lon,y=Lat,color=hour.cluster),shape=16,size=1.2)+
  facet_wrap(~hour,ncol = 6) +
  scale_color_viridis(option = "H",trans="reverse") +
  geom_sf(data = br.polygon,aes(geometry=geometry),
          inherit.aes = F,size=2,
          fill="transparent")+
  labs(title ="", color = "Hour clusters by DBSCAN")+
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 90))->pic14
pic14

The distinct variations in hour clusters identified by DBSCAN reveal the temporal dynamics of the data. The absence of clusters between 0 am and 3 am suggests a period of lower activity or reduced data density during the early morning hours. The increased number of clusters during the morning peak time, from 8 am to 9 am, signifies a surge in activities or events during this period, leading to a higher concentration of data points and distinct patterns. Contrary to expectations, the presence of clusters at 13 pm indicates sustained activity during the midday timeframe, challenging the notion of a lull or transition period. Then, the clusters during the late peak time, from 17 pm to 21 pm, with a larger spatial extent, still suggests heightened activity, possibly corresponding to the evening rush hour or other prominent events.

These findings underscore the nuanced insights provided by DBSCAN, capturing varying patterns throughout hours and aligning with the ebb and flow of daily activities. The temporal clustering patterns offer valuable perspectives into the underlying dynamics of the dataset, aiding in the identification of key timeframes with distinctive characteristics.

3.3.2 Methods:Rolling window

In time series analysis, a rolling window, also known as a moving or sliding window, is a technique used to assess data points within a fixed-size subset that “rolls” or moves across the entire time series. The window moves incrementally, one step at a time, until it covers the entire series. At each step, analysis or computations are performed on the data points within the window.

This approach is particularly useful for observing changes over time and identifying patterns or trends. It allows analysts to continuously update their understanding of the time series dynamics as new data becomes available.

Daily rolling window

Step1: Calculating fractions-Counts divided by N(total number of pickups points in day T)--to know every grid’s pickups day density.

day.pickup.cluster %>% length()

## [1] 44

for (i in 45:65){
  day.pickup.cluster [,i]=day.pickup.cluster [,i-44]/sum(day.pickup.cluster [,i-44])
}

paste0("frac",seq(1,21,1))-> frac
paste0("day",seq(1,21,1))-> day
paste0("cluster",seq(1,21,1))-> cluster
colnames(day.pickup.cluster)=c(day,"total.pickup",cluster,"total.cluster",frac)

Step2: fractions rescaled to 0-1 range with formula, so max value=1, min value=0, because original data is too small, so normalizing to 0,1

normalize=function(x,na.rm=T)(x-min(x,na.rm = T))/(max(x,na.rm = T)-min(x,na.rm = T))
day.pickup.cluster= day.pickup.cluster %>% mutate(across(matches("^frac"),normalize))
day.pickup.cluster$frac1 %>% summary() #range [0,1]

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.04145 0.02586 1.00000

Step3:, calculated spatial averages of the rescaled fractions

The nb2listw function supplements a neighbors list with spatial weights for the chosen coding scheme. The style parameter specifies the encoding style or pattern of the weights, and it can take on the following different values:

“W” (Queen style): Queen weights indicate that two regions are considered neighbors if they share a boundary or vertex. This is a binary encoding, where there is a connection it’s 1, otherwise, it’s 0.

# to get the grid spatial weight matrix
# conversion the sp the neighbor class
neighbor.grid=poly2nb(intersection.grid,queen = T)
# then converse the nb class to weights matrix
nb.matrix.grid=nb2listw(neighbor.grid,style='W')
nb.matrix.grid

## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 308 
## Number of nonzero links: 2242 
## Percentage nonzero weights: 2.363383 
## Average number of links: 7.279221 
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0       S1       S2
## W 308 94864 308 87.72001 1238.817

These different styles allow you to choose the weight encoding style that suits your spatial analysis problem. Different encoding styles may be suitable for different types of data and analysis tasks.

The Queen style results:

The nb.matrix.grid represents a spatial weights matrix characterizing the relationships between 308 regions. It’s a binary matrix indicating neighbors, with 2242 non-zero links. The percentage of non-zero weights is 2.36%, and on average, each region has approximately 7.28 neighbors. The spatial relationships are quantified through the weights matrix, enabling spatial analysis.

for (i in 66:86){
  day.pickup.cluster[,i]=lag.listw(x=nb.matrix.grid,var=day.pickup.cluster[,i-21])
}
paste0("lag.frac",seq(1,21,1))-> lag
colnames(day.pickup.cluster)=c(day,"total.pickup",cluster,"total.cluster",frac,lag)

Step4:, finnally, get the daily rolling window

#change to long format
day.pickup.cluster$ID=rownames(day.pickup.cluster)
day.pickup.cluster %>% select(ID,matches("^frac")) %>% pivot_longer(cols = !ID) -> frac.df
colnames(frac.df)=c('ID','day','current.frac')
frac.df$day=rep(c(1:21),308)

day.pickup.cluster %>% select(ID,matches("^lag")) %>% pivot_longer(cols = !ID) -> lag.frac.df
colnames(lag.frac.df)=c('ID','day','current.lag.frac')
lag.frac.df$day=rep(c(1:21),308)

day.pickup.cluster %>% select(ID,matches("^cluster")) %>% pivot_longer(cols = !ID) -> cluster.df
colnames(cluster.df)=c('ID','day','current.cluster')
cluster.df$day=rep(c(1:21),308)

frac.df %>% merge(lag.frac.df,by=c("ID","day")) %>% merge(cluster.df,by=c("ID","day"))->frac.lag.cluster.day
frac.lag.cluster.day$ID=as.integer(frac.lag.cluster.day$ID)

#Rolling windows:Using current day, - 1 day data, - 2 day data, -3 day data.... -6day data, +1 day data(future data).
frac.lag.cluster.day %>% group_by(ID) %>%
  arrange(ID,day) %>%
  mutate(mins1.frac=lag(current.frac,1),
         mins2.frac=lag(current.frac,2),
         mins3.frac=lag(current.frac,3),
         mins4.frac=lag(current.frac,4),
         mins5.frac=lag(current.frac,5),
         mins6.frac=lag(current.frac,6),
         
         mins1.frac.lag=lag(current.lag.frac,1),
         mins2.frac.lag=lag(current.lag.frac,2),
         mins3.frac.lag=lag(current.lag.frac,3),
         mins4.frac.lag=lag(current.lag.frac,4),
         mins5.frac.lag=lag(current.lag.frac,5),
         mins6.frac.lag=lag(current.lag.frac,6),

         mins1.cluster=lag(current.cluster,1),
         mins2.cluster=lag(current.cluster,2),
         mins3.cluster=lag(current.cluster,3),
         mins4.cluster=lag(current.cluster,4),
         mins5.cluster=lag(current.cluster,5),
         mins6.cluster=lag(current.cluster,6),
         plus1.cluster=lead(current.cluster,1))  %>% drop_na() %>% 
  ungroup() %>% arrange(ID)-> rolling.window.day
head(rolling.window.day)

## # A tibble: 6 × 24
##      ID   day current.frac current.lag.frac current.cluster mins1.frac
##   <int> <int>        <dbl>            <dbl>           <dbl>      <dbl>
## 1     1     7       0.119             0.323               1     0.118 
## 2     1     8       0.116             0.279               0     0.119 
## 3     1     9       0.130             0.320               1     0.116 
## 4     1    10       0.0769            0.321               1     0.130 
## 5     1    11       0.148             0.288               1     0.0769
## 6     1    12       0.115             0.237               1     0.148 
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <dbl>,
## #   mins2.cluster <dbl>, mins3.cluster <dbl>, mins4.cluster <dbl>,
## #   mins5.cluster <dbl>, mins6.cluster <dbl>, plus1.cluster <dbl>

#saveRDS(rolling.window.day, file = 'data/dayRollingWindow.RData')

Hourly rolling window

hour.pickup.cluster.frac %>% length()

## [1] 338

#step1
for (i in 339:506){
  hour.pickup.cluster.frac[,i]=hour.pickup.cluster.frac[,i-338]/sum(hour.pickup.cluster.frac[,i-338])
}
paste0("frac",seq(1,168,1))-> frac
paste0("hour",seq(1,168,1))-> hour
paste0("cluster",seq(1,168,1))-> cluster
colnames(hour.pickup.cluster.frac)=c(hour,"total.pickup",cluster,"total.cluster",frac)

#step2
normalize=function(x,na.rm=T)(x-min(x,na.rm = T))/(max(x,na.rm = T)-min(x,na.rm = T))
hour.pickup.cluster.frac= hour.pickup.cluster.frac %>% mutate(across(matches("^frac"),normalize))

# step3: to get the grid spatial weight matrix
neighbor.grid=poly2nb(intersection.grid,queen = T)
nb.matrix.grid=nb2listw(neighbor.grid,style='W')

for (i in 507:674){
  hour.pickup.cluster.frac[,i]=lag.listw(x=nb.matrix.grid,var=hour.pickup.cluster.frac[,i-168])
}
paste0("lag.frac",seq(1,168,1))-> lag
colnames(hour.pickup.cluster.frac)=c(hour,"total.pickup",cluster,"total.cluster",frac,lag)

# step3 finnally get the hourly rolling window
hour.pickup.cluster.frac$ID=rownames(hour.pickup.cluster.frac)

hour.pickup.cluster.frac %>% select(ID,matches("^frac")) %>% pivot_longer(cols = !ID) -> frac.df
colnames(frac.df)=c('ID','hour','current.frac')
frac.df$hour=rep(c(1:168),308)

hour.pickup.cluster.frac %>% select(ID,matches("^lag")) %>% pivot_longer(cols = !ID) -> lag.frac.df
colnames(lag.frac.df)=c('ID','hour','current.lag.frac')
lag.frac.df$hour=rep(c(1:168),308)

hour.pickup.cluster.frac %>% select(ID,matches("^cluster")) %>% pivot_longer(cols = !ID) -> cluster.df
colnames(cluster.df)=c('ID','hour','current.cluster')
cluster.df$hour=rep(c(1:168),308)

frac.df %>% merge(lag.frac.df,by=c("ID","hour")) %>% merge(cluster.df,by=c("ID","hour"))->frac.lag.cluster.hour
frac.lag.cluster.hour$ID=as.integer(frac.lag.cluster.hour$ID)


frac.lag.cluster.hour %>% group_by(ID) %>%
  arrange(ID,hour) %>%
  mutate(mins1.frac=lag(current.frac,1),
         mins2.frac=lag(current.frac,2),
         mins3.frac=lag(current.frac,3),
         mins4.frac=lag(current.frac,4),
         mins5.frac=lag(current.frac,5),
         mins6.frac=lag(current.frac,6),

         mins1.frac.lag=lag(current.lag.frac,1),
         mins2.frac.lag=lag(current.lag.frac,2),
         mins3.frac.lag=lag(current.lag.frac,3),
         mins4.frac.lag=lag(current.lag.frac,4),
         mins5.frac.lag=lag(current.lag.frac,5),
         mins6.frac.lag=lag(current.lag.frac,6),

         mins1.cluster=lag(current.cluster,1),
         mins2.cluster=lag(current.cluster,2),
         mins3.cluster=lag(current.cluster,3),
         mins4.cluster=lag(current.cluster,4),
         mins5.cluster=lag(current.cluster,5),
         mins6.cluster=lag(current.cluster,6),
         plus1.cluster=lead(current.cluster,1))  %>% drop_na() %>% 
  ungroup() %>% arrange(ID)-> rolling.window.hour
head(rolling.window.hour)

## # A tibble: 6 × 24
##      ID  hour current.frac current.lag.frac current.cluster mins1.frac
##   <int> <int>        <dbl>            <dbl>           <dbl>      <dbl>
## 1     1     7        0.111           0.519                1      0    
## 2     1     8        0.214           0.143                1      0.111
## 3     1     9        0.1             0.267                1      0.214
## 4     1    10        0               0.667                0      0.1  
## 5     1    11        0               0.111                0      0    
## 6     1    12        0.25            0.0833               0      0    
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <dbl>,
## #   mins2.cluster <dbl>, mins3.cluster <dbl>, mins4.cluster <dbl>,
## #   mins5.cluster <dbl>, mins6.cluster <dbl>, plus1.cluster <dbl>

#save rolling.window
#saveRDS(rolling.window.hour, file = 'data/HourollingWindow.RData')
#rolling.window.hour=readRDS("data/HourollingWindow.RData")

Chapter4 Models

4.1 Models for daily mode

4.1.1 Daily train and test data

rolling.window.day$ID=as.factor(rolling.window.day$ID)
rolling.window.day$day=as.factor(rolling.window.day$day)

rolling.window.day$plus1.cluster=factor(x=rolling.window.day$plus1.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins1.cluster=factor(x=rolling.window.day$mins1.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins2.cluster=factor(x=rolling.window.day$mins2.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins3.cluster=factor(x=rolling.window.day$mins3.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins4.cluster=factor(x=rolling.window.day$mins4.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins5.cluster=factor(x=rolling.window.day$mins5.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$mins6.cluster=factor(x=rolling.window.day$mins6.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.day$current.cluster=factor(x=rolling.window.day$current.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))

#spit to training and testing data
rolling.window.day[which(rolling.window.day$day==20),]->test.data.day
head(test.data.day)

## # A tibble: 6 × 24
##   ID    day   current.frac current.lag.frac current.cluster mins1.frac
##   <fct> <fct>        <dbl>            <dbl> <fct>                <dbl>
## 1 1     20         0.126             0.319  yes                  0.114
## 2 2     20         0.189             0.226  yes                  0.114
## 3 3     20         0.00629           0.158  no                   0    
## 4 4     20         0                 0.0613 no                   0    
## 5 5     20         0.403             0.377  yes                  0.498
## 6 6     20         0.365             0.348  yes                  0.401
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <fct>,
## #   mins2.cluster <fct>, mins3.cluster <fct>, mins4.cluster <fct>,
## #   mins5.cluster <fct>, mins6.cluster <fct>, plus1.cluster <fct>

rolling.window.day[-which(rolling.window.day$day==20),]-> train.data.day
head(train.data.day)

## # A tibble: 6 × 24
##   ID    day   current.frac current.lag.frac current.cluster mins1.frac
##   <fct> <fct>        <dbl>            <dbl> <fct>                <dbl>
## 1 1     7           0.119             0.323 yes                 0.118 
## 2 1     8           0.116             0.279 no                  0.119 
## 3 1     9           0.130             0.320 yes                 0.116 
## 4 1     10          0.0769            0.321 yes                 0.130 
## 5 1     11          0.148             0.288 yes                 0.0769
## 6 1     12          0.115             0.237 yes                 0.148 
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <fct>,
## #   mins2.cluster <fct>, mins3.cluster <fct>, mins4.cluster <fct>,
## #   mins5.cluster <fct>, mins6.cluster <fct>, plus1.cluster <fct>

#save the test and train data
#saveRDS(test.data.day,"data/Daytestdata.RData")
#test.data.day=readRDS("data/Daytestdata.RData")
#saveRDS(train.data.day,"data/Daytraindata.RData")
#train.data.day=readRDS("data/Daytraindata.RData")

4.1.2 Daily Models- Decison tree, Random Forest, and XGBoost

In the context of daily models, I plan to employ three distinct machine learning algorithms: Decision Trees, Random Forest, and XGBoost. The objective is to explore various parameter configurations for each algorithm, aiming to identify the most optimal models for predicting daily outcomes. This iterative process involves adjusting parameters such as tree depth, number of estimators, and learning rates to discern their impact on model performance. Random Forest introduces ensemble learning for enhanced robustness, and XGBoost brings gradient boosting with regularization to improve predictive accuracy. By systematically testing different parameter combinations and evaluating their performance metrics, I aim to pinpoint the configurations that yield the most accurate and reliable daily prediction models.

Modelling-Decision tree

trcontrol.5cv=trainControl(method="cv",
             number=5,
             classProbs = T,
             summaryFunction = twoClassSummary)

formula.var1=plus1.cluster~current.frac+mins1.frac+mins2.frac+
                    mins3.frac+mins4.frac+mins5.frac+mins6.frac

formula.var2=plus1.cluster~current.lag.frac+mins1.frac.lag+
            mins2.frac.lag+mins3.frac.lag+mins4.frac.lag+
            mins5.frac.lag+mins6.frac.lag+current.frac+
            mins1.frac+mins2.frac+mins3.frac+mins4.frac+
            mins5.frac+mins6.frac

formula.var3=plus1.cluster~current.lag.frac+mins1.frac.lag+
            mins2.frac.lag+mins3.frac.lag+mins4.frac.lag+
            mins5.frac.lag+mins6.frac.lag+current.frac+
            mins1.frac+mins2.frac+mins3.frac+mins4.frac+
            mins5.frac+mins6.frac+current.cluster+mins1.cluster+
            mins2.cluster+mins3.cluster+mins4.cluster+mins5.cluster+
            mins6.cluster

# individual decision tree-one variable
set.seed(1234)
day.tree01=train(formula.var1,
                    data =train.data.day,
                    method="rpart",
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      cp=seq(0,0.05,0.001))
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#cp = 0.001, ROC=0.953

#individual decision tree- two variables
set.seed(1234)
day.tree02=train(formula.var2,
                    data =train.data.day,
                    method="rpart",
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      cp=seq(0,0.05,0.001))
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#cp = 0.001,ROC=0.956

#individual decision tree- two variables + lag y
day.tree03=train(formula.var3,
                    data =train.data.day,
                    method="rpart",
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      cp=seq(0,0.05,0.001))
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#save models
#day.tree01 %>% saveRDS(here("output","day.tree01.rds"))
#day.tree02 %>% saveRDS(here("output","day.tree02.rds"))
#day.tree03 %>% saveRDS(here("output","day.tree03.rds"))

Modelling-Random Forest

#1 variable*300 trees, and 5 nodesize
set.seed(1234)
day.rf01=train(formula.var1,
                    data =train.data.day,
                    method="rf",
                    ntree=300,
                    nodesize=5,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=2:5)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry=2, ROC=0.979, better than desicion tree

#2 varaialbes *300trees, and 5 nodesize
set.seed(1234)
day.rf02=train(formula.var2,
                    data =train.data.day,
                    method="rf",
                    ntree=300,
                    nodesize=5,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=2:8)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry=2, ROC= 0.9895
#3 varaialbes *300trees, and 5 nodesize
set.seed(1234)
day.rf03=train(formula.var3,
                    data =train.data.day,
                    method="rf",
                    ntree=300,
                    nodesize=5,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=2:10)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry=5, ROC= 0.9954  --> the best

#save models
#day.rf01 %>% saveRDS(here("output","day.rf01.rds"))
#day.rf02 %>% saveRDS(here("output","day.rf02.rds"))
#day.rf03 %>% saveRDS(here("output","day.rf03.rds"))

Modelling-XGB gradient boosting machine

#convert data
train.data.day.xgb=train.data.day
test.data.day.xgb=test.data.day
train.data.day.xgb$plus1.cluster=as.character(train.data.day.xgb$plus1.cluster)
train.data.day.xgb$plus1.cluster[train.data.day.xgb$plus1.cluster=="yes"]=1
train.data.day.xgb$plus1.cluster[train.data.day.xgb$plus1.cluster=="no"]=0
train.data.day.xgb$plus1.cluster=as.numeric(train.data.day.xgb$plus1.cluster)

test.data.day.xgb$plus1.cluster=as.character(test.data.day.xgb$plus1.cluster)
test.data.day.xgb$plus1.cluster[test.data.day.xgb$plus1.cluster=="yes"]=1
test.data.day.xgb$plus1.cluster[test.data.day.xgb$plus1.cluster=="no"]=0
test.data.day.xgb$plus1.cluster=as.numeric(test.data.day.xgb$plus1.cluster)

#1 variables
set.seed(1234)
cv=xgb.cv(data=as.matrix(train.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac)),
          label=as.numeric(train.data.day.xgb$plus1.cluster),
          nrounds=100,
          nfold=5,
          objective= "binary:logistic",
          eta=0.1,
          max_depth=8,
          early_stopping_rounds = 10,
          verbose=F)
#get the evaluation log
cv$best_ntreelimit

## [1] 55

#  ntrees.train
#   55          
#Run xgboost
set.seed(1234)
xgb01.var1=xgboost(data=as.matrix(train.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac)),
                      label=as.numeric(train.data.day.xgb$plus1.cluster),
                      nrounds=55,
                      objective="binary:logistic",
                      eta=0.1,
                      max_depth=8,
                      verbose=F)  

#2 variables
set.seed(1234)
cv=xgb.cv(data=as.matrix(train.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag)),
          label=as.numeric(train.data.day.xgb$plus1.cluster),
          nrounds=100,
          nfold=5,
          objective="binary:logistic",
          eta=0.1,
          max_depth=8,
          early_stopping_rounds = 10,
          verbose=F)
#get the evaluation log
cv$best_ntreelimit

## [1] 53

#  ntrees.train
#   53          
#Run xgboost
set.seed(1234)
xgb02.vars2=xgboost(data=as.matrix(train.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag)),
                      label=as.numeric(train.data.day.xgb$plus1.cluster),
                      nrounds=53,
                      objective="binary:logistic",
                      eta=0.1,
                      max_depth=8,
                      verbose=F) 
#save models
#xgb01.var1 %>% saveRDS(here("output","day.xgb01.rds"))
#xgb02.vars2 %>% saveRDS(here("output","day.xgb02.rds"))

4.2 Prediction for daily mode

table(rolling.window.day$plus1.cluster)/length(rolling.window.day$plus1.cluster)

## 
##        no       yes 
## 0.8995826 0.1004174

With approximately 90% of the values labeled as “no” and 10% as “yes,” the dataset is imbalanced. In situations like this, where the classes are unevenly represented, using a default cutoff point of 0.5 for binary classification might not be optimal. To enhance prediction accuracy, it is advisable to explore different cutoff points. This involves adjusting the threshold used to classify predicted probabilities into binary outcomes. Experimenting with various cutoff points allows for a more nuanced assessment of the model’s performance, considering the specific context and priorities related to false positives and false negatives in the classification task.

4.2.1 Comparing different models to find the optimal one

#best cut-off point for Decison tree models and Random Forest models
cutoff_binary=function(predict_probs,
                       real,
                       cutoff=0.5,
                       level.positive=1,
                       level.negative=0){
  ctable=confusionMatrix(data=as.factor(ifelse(predict_probs[,level.positive]>cutoff,level.positive,level.negative)),
                         reference=real)
  result=round(c(ctable$overall[1],ctable$byClass[c(1:2,7,11)]),digits = 4)
  return(result)
}
#best cut-off point for Xgboost machine
summary_binary02 <- function(predicted_probs,
                           real,
                           cutoff = 0.5,
                           level_positive = "1",
                           level_negative = "0") {
  # save the classification table
  ctable <- confusionMatrix(as.factor(ifelse(predicted_probs > cutoff, 
                                             level_positive, 
                                             level_negative)), 
                            real, 
                            level_positive) 
  # extract selected statistics into a vector
  stats <- round(c(ctable$overall[1],
                   ctable$byClass[c(1:2, 7, 11)]),
                 5)
  # and return them as a function result
  return(stats)
}
#compare the result, cutoff=0.5
cutoff=0.5
predictions=data.frame(
  day.tree01.pre=cutoff_binary(predict_probs=
                                 predict(day.tree01,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.tree02.pre=cutoff_binary(predict_probs=
                                 predict(day.tree02,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.tree03.pre=cutoff_binary(predict_probs=
                                 predict(day.tree03,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.rf01.pre=cutoff_binary(predict_probs=
                                 predict(day.rf01,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.rf02.pre=cutoff_binary(predict_probs=
                                 predict(day.rf02,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.rf03.pre=cutoff_binary(predict_probs=
                                 predict(day.rf03,test.data.day,type="prob"),
                               real=test.data.day$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  day.xgb01.pre=summary_binary02(predicted_probs=
                                   predict(xgb01.var1,as.matrix(test.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac))),
                                 real=as.factor(test.data.day.xgb$plus1.cluster),
                                 cutoff = cutoff),
  day.xgb02.pre=summary_binary02(predicted_probs =
                                   predict(xgb02.vars2 ,as.matrix(test.data.day.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag))),
                                 real =as.factor(test.data.day.xgb$plus1.cluster),
                                 cutoff=cutoff) 
) 

t(predictions)

##                Accuracy Sensitivity Specificity      F1 Balanced Accuracy
## day.tree01.pre  0.94160     0.95450     0.77270 0.96810           0.86360
## day.tree02.pre  0.97730     0.97550     1.00000 0.98760           0.98780
## day.tree03.pre  0.97080     0.97200     0.95450 0.98410           0.96330
## day.rf01.pre    0.94480     0.96150     0.72730 0.97000           0.84440
## day.rf02.pre    0.97400     0.97200     1.00000 0.98580           0.98600
## day.rf03.pre    0.97400     0.97200     1.00000 0.98580           0.98600
## day.xgb01.pre   0.95130     0.77273     0.96503 0.69388           0.86888
## day.xgb02.pre   0.98052     1.00000     0.97902 0.88000           0.98951

When evaluating the performance of decision tree, random forest, and XGBoost models for predicting future day demand in the Uber dataset, the results at a cutoff threshold of 0.5 reveal that the XGBoost model with 2 variables stands out as the best performer. It achieved an impressive accuracy of 98.05% and a balanced accuracy of 98.95%, indicating robust overall predictive capability. The model’s sensitivity and specificity metrics further support its effectiveness, with a sensitivity of 100%, suggesting it correctly identifies all instances of the positive class. These results signify the superior predictive power of the XGBoost model with a more efficient use of features, making it well-suited for forecasting Uber demand. It’s important to note that the choice of the cutoff threshold significantly impacts the model’s performance, and the selection should align with the specific priorities and constraints of the use case, particularly considering the trade-off between precision and recall.

4.2.2 Visualization prediction value for future 1 day

#best model- day.xgb02,when cutoff=0.5
day.best.predict.probs = predict(xgb02.vars2,
                                 as.matrix(test.data.day.xgb%>%
                                             dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,
                                                           mins4.frac,mins5.frac,mins6.frac,current.lag.frac,
                                                           mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,
                                                           mins4.frac.lag,mins5.frac.lag,mins6.frac.lag)))

day.best.predict.level=ifelse(day.best.predict.probs>=0.5,1,0)


intersection.grid %>%
  mutate(true.day21=test.data.day.xgb$plus1.cluster[test.data.day.xgb$day==20],
         predict.probs.day21=day.best.predict.probs,
         predict.level.day21=day.best.predict.level
         ) -> intersection.grid

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.day21))) +
  labs(title="True clusters in 21 day",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.day21
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.day21)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.day21
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.day21))) +
  labs(title="Prdict level with cutoff=0.5",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.day21

plot_grid(true.day21,
          predict.probs.day21,
          predict.level.day21,
          ncol = 3)

In summary, Leveraging the XGBoost model with two variables for daily high-demand mode prediction in the Uber dataset yields remarkable accuracy (98.05%) and balanced accuracy (98.95%). This robust model enables Uber to strategically allocate resources, enhancing operational efficiency by proactively addressing fluctuations in demand. The high accuracy ensures precise identification of high-demand and non-high-demand days, minimizing driver idle time and improving service responsiveness.

4.3 Models for Hourly mode

For hourly modeling, my strategy involves deploying Random Forest and XGBoost algorithms, coupled with an exploration of diverse parameters to discern optimal models for predicting high-frequency data. By iteratively adjusting parameters such as the number of estimators and rounds, I aim to identify configurations that maximize accuracy and reliability in forecasting hourly high-demand patterns. Understanding the nuanced fluctuations and changes in demand throughout the day and hours is crucial. Uber experiences distinct patterns, including peak hours, which require precise prediction for effective resource allocation. This approach ensures a tailored response to hourly demand, enhancing Uber’s adaptability and responsiveness to the evolving needs of users and drivers throughout different hours, ultimately optimizing service delivery and user satisfaction.

4.3.1 Hourly train and test data

rolling.window.hour$ID=as.factor(rolling.window.hour$ID)

rolling.window.hour$plus1.cluster=factor(x=rolling.window.hour$plus1.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins1.cluster=factor(x=rolling.window.hour$mins1.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins2.cluster=factor(x=rolling.window.hour$mins2.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins3.cluster=factor(x=rolling.window.hour$mins3.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins4.cluster=factor(x=rolling.window.hour$mins4.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins5.cluster=factor(x=rolling.window.hour$mins5.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$mins6.cluster=factor(x=rolling.window.hour$mins6.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))
rolling.window.hour$current.cluster=factor(x=rolling.window.hour$current.cluster,
                                      levels=c(0,1),
                                      labels=c("no","yes"))


#test:160-167 to predict future 6 hours (161-168hour)
rolling.window.hour[which(rolling.window.hour$hour>=160),]%>% arrange(ID,hour)->test.data.hour #for
#train:7-161
rolling.window.hour[-which(frac.lag.cluster.hour$hour>=160),]-> train.data.hour
head(test.data.hour)

## # A tibble: 6 × 24
##   ID     hour current.frac current.lag.frac current.cluster mins1.frac
##   <fct> <int>        <dbl>            <dbl> <fct>                <dbl>
## 1 1       160        0.2              0.333 yes                  0    
## 2 1       161        0                0.333 no                   0.2  
## 3 1       162        0.167            0.222 yes                  0    
## 4 1       163        0.167            0.222 yes                  0.167
## 5 1       164        0.143            0.190 yes                  0.167
## 6 1       165        0                0.259 no                   0.143
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <fct>,
## #   mins2.cluster <fct>, mins3.cluster <fct>, mins4.cluster <fct>,
## #   mins5.cluster <fct>, mins6.cluster <fct>, plus1.cluster <fct>

head(train.data.hour)

## # A tibble: 6 × 24
##   ID     hour current.frac current.lag.frac current.cluster mins1.frac
##   <fct> <int>        <dbl>            <dbl> <fct>                <dbl>
## 1 1         7        0.111           0.519  yes                  0    
## 2 1         8        0.214           0.143  yes                  0.111
## 3 1         9        0.1             0.267  yes                  0.214
## 4 1        10        0               0.667  no                   0.1  
## 5 1        11        0               0.111  no                   0    
## 6 1        12        0.25            0.0833 no                   0    
## # ℹ 18 more variables: mins2.frac <dbl>, mins3.frac <dbl>, mins4.frac <dbl>,
## #   mins5.frac <dbl>, mins6.frac <dbl>, mins1.frac.lag <dbl>,
## #   mins2.frac.lag <dbl>, mins3.frac.lag <dbl>, mins4.frac.lag <dbl>,
## #   mins5.frac.lag <dbl>, mins6.frac.lag <dbl>, mins1.cluster <fct>,
## #   mins2.cluster <fct>, mins3.cluster <fct>, mins4.cluster <fct>,
## #   mins5.cluster <fct>, mins6.cluster <fct>, plus1.cluster <fct>

#save the test and train data
#saveRDS(test.data.hour,"data/Hourtestdata.RData")
#test.data.hour=readRDS("data/Hourtestdata.RData")
#saveRDS(train.data.hour,"data/Hourtraindata.RData")
#train.data.hour=readRDS("data/Hourtraindata.RData")

Modelling-Random Forest

#1 varaialbes *300trees, and 10 nodesize
set.seed(1234)
hour.rf01=train(formula.var1,
                    data =train.data.hour,
                    method="rf",
                    ntree=300,
                    nodesize=10,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=2:4)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry  ROC        Sens       Spec     
#3    0.9440

#2 varaialbes *300trees, and 10 nodesize
set.seed(1234)
hour.rf02=train(formula.var2,
                    data =train.data.hour,
                    method="rf",
                    ntree=300,
                    nodesize=10,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=2:8)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry=3,ROC=0.976

#3 varaialbes *300trees, and 10 nodesize
set.seed(1234)
hour.rf03=train(formula.var3,
                    data =train.data.hour,
                    method="rf",
                    ntree=300,
                    nodesize=10,
                    trControl=trcontrol.5cv,
                    tuneGrid=expand.grid(
                      mtry=4:10)
                    )

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

#mtry=5, ROC= 0.976

#cutoff point
#to see the best cut-off point...
cutoff_binary=function(predict_probs,
                       real,
                       cutoff=0.5,
                       level.positive=1,
                       level.negative=0){
  ctable=confusionMatrix(data=as.factor(ifelse(predict_probs[,level.positive]>cutoff,level.positive,level.negative)),
                         reference=real)
  result=round(c(ctable$overall[1],ctable$byClass[c(1:2,7,11)]),digits = 4)
  return(result)
}

#save models
#hour.rf01 %>% saveRDS(here("output","Hour.rf.var1.rds"))
#hour.rf02 %>% saveRDS(here("output","Hour.rf.var2.rds"))
#hour.rf03 %>% saveRDS(here("output","Hour.rf.var3.rds"))
#hour.rf01 =readRDS(here("output","Hour.rf.var1.rds"))
#hour.rf02 =readRDS(here("output","Hour.rf.var2.rds"))
#hour.rf03 =readRDS(here("output","Hour.rf.var3.rds"))

Modelling-XGB gradient boosting machine

#convert data
train.hour.xgb=train.data.hour
test.hour.xgb=test.data.hour
train.hour.xgb$plus1.cluster=as.character(train.hour.xgb$plus1.cluster)
train.hour.xgb$plus1.cluster[train.hour.xgb$plus1.cluster=="yes"]=1
train.hour.xgb$plus1.cluster[train.hour.xgb$plus1.cluster=="no"]=0
train.hour.xgb$plus1.cluster=as.numeric(train.hour.xgb$plus1.cluster)

test.hour.xgb$plus1.cluster=as.character(test.hour.xgb$plus1.cluster)
test.hour.xgb$plus1.cluster[test.hour.xgb$plus1.cluster=="yes"]=1
test.hour.xgb$plus1.cluster[test.hour.xgb$plus1.cluster=="no"]=0
test.hour.xgb$plus1.cluster=as.numeric(test.hour.xgb$plus1.cluster)

#1 variables
set.seed(1234)
cv=xgb.cv(data=as.matrix(train.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac)),
          label=as.numeric(train.hour.xgb$plus1.cluster),
          nrounds=100,
          nfold=5,
          objective= "binary:logistic",
          eta=0.1,
          max_depth=8,
          early_stopping_rounds = 10,
          verbose=F)
#get the evaluation log
cv$best_ntreelimit

## [1] 58

#  ntrees.train
#   58        
#Run xgboost
set.seed(1234)
hour.xgb01=xgboost(data=as.matrix(train.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac)),
                      label=as.numeric(train.hour.xgb$plus1.cluster),
                      nrounds=58,
                      objective="binary:logistic",
                      eta=0.1,
                      max_depth=8,
                      verbose=F)  

#### 2 variables
set.seed(1234)
cv=xgb.cv(data=as.matrix(train.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag)),
          label=as.numeric(train.hour.xgb$plus1.cluster),
          nrounds=100,
          nfold=5,
          objective="binary:logistic",
          eta=0.1,
          max_depth=8,
          early_stopping_rounds = 10,
          verbose=F)
#get the evaluation log
cv$best_ntreelimit

## [1] 74

#  ntrees.train
#   74         
#Run xgboost
set.seed(1234)
hour.xgb02=xgboost(data=as.matrix(train.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag)),
                      label=as.numeric(train.hour.xgb$plus1.cluster),
                      nrounds=74,
                      objective="binary:logistic",
                      eta=0.1,
                      max_depth=8,
                      verbose=F) 
#save models
#hour.xgb01 %>% saveRDS(here("output","hour.xgb01.rds"))
#hour.xgb02 %>% saveRDS(here("output","hour.xgb02.rds"))

4.4 Prediction for Hourly mode

table(rolling.window.hour$plus1.cluster)/length(rolling.window.hour$plus1.cluster)

## 
##         no        yes 
## 0.92209809 0.07790191

With an imbalance in labels-92% “no” and 7% “yes”-the default 0.5 cutoff for binary classification may be suboptimal. Exploring different thresholds is advised to improve prediction accuracy, considering the trade-off between false positives and false negatives.

4.4.1 Comparing different models to find the optimal one

#compare the result, best cutoff=0.3
cutoff=0.3
prediction.hour=data.frame(
  hour.rf01.pre=cutoff_binary(predict_probs=
                                 predict(hour.rf01,test.data.hour,type="prob"),
                               real=test.data.hour$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  hour.rf02.pre=cutoff_binary(predict_probs=
                                 predict(hour.rf02,test.data.hour,type="prob"),
                               real=test.data.hour$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  hour.rf03.pre=cutoff_binary(predict_probs=
                                 predict(hour.rf03,test.data.hour,type="prob"),
                               real=test.data.hour$plus1.cluster,
                               cutoff = cutoff,
                               level.positive = "yes",
                               level.negative = "no") ,
  hour.xgb01.pre=summary_binary02(predicted_probs=
                                   predict(hour.xgb01,as.matrix(test.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac))),
                                 real=as.factor(test.hour.xgb$plus1.cluster),
                                 cutoff = cutoff),
  hour.xgb02.pre=summary_binary02(predicted_probs =
                                   predict(xgb02.vars2 ,as.matrix(test.hour.xgb %>% dplyr::select(current.frac,mins1.frac,mins2.frac,mins3.frac,mins4.frac,mins5.frac,mins6.frac,current.lag.frac,mins1.frac.lag,mins2.frac.lag,mins3.frac.lag,mins4.frac.lag,mins5.frac.lag,mins6.frac.lag))),
                                 real =as.factor(test.hour.xgb$plus1.cluster),
                                 cutoff=cutoff) 
) 

t(prediction.hour)

##                Accuracy Sensitivity Specificity      F1 Balanced Accuracy
## hour.rf01.pre   0.96470     0.96930     0.90710 0.98070           0.93820
## hour.rf02.pre   0.97610     0.97630     0.97270 0.98690           0.97450
## hour.rf03.pre   0.97810     0.97810     0.97810 0.98800           0.97810
## hour.xgb01.pre  0.95373     0.89617     0.95835 0.74208           0.92726
## hour.xgb02.pre  0.94318     0.74863     0.95879 0.66184           0.85371

When cutoff=0.1

               Accuracy Sensitivity Specificity      F1 Balanced Accuracy 
hour.rf01.pre   0.92570     0.92330     0.95630 0.95840           0.93980 
hour.rf02.pre   0.93510     0.93030     0.99450 0.96370           0.96240 
hour.rf03.pre   0.93990     0.93560     0.99450 0.96650           0.96500 ->best
hour.xgb01.pre  0.89894     0.94536     0.89522 0.58151           0.92029 
hour.xgb02.pre  0.91761     0.89071     0.91977 0.61626           0.90524

When cutoff=0.3 -> best

               Accuracy Sensitivity Specificity      F1 Balanced Accuracy 
hour.rf01.pre   0.96470     0.96930     0.90710 0.98070           0.93820 
hour.rf02.pre   0.97610     0.97630     0.97270 0.98690           0.97450 
hour.rf03.pre   0.97810     0.97810     0.97810 0.98800           0.97810 ->best
hour.xgb01.pre  0.95373     0.89617     0.95835 0.74208           0.92726 
hour.xgb02.pre  0.94318     0.74863     0.95879 0.66184           0.85371

When cutoff=0.5

               Accuracy Sensitivity Specificity      F1 Balanced Accuracy 
hour.rf01.pre   0.98050     0.99430     0.80870 0.98950           0.90150 
hour.rf02.pre   0.98660     0.99470     0.88520 0.99280           0.94000 -> best
hour.rf03.pre   0.98340     0.99250     0.86890 0.99100           0.93070 
hour.xgb01.pre  0.96550     0.74863     0.98290 0.76323           0.86577 
hour.xgb02.pre  0.94968     0.62295     0.97589 0.64773           0.79942

The evaluation of binary classification models at different cutoff points reveals nuanced performance metrics. While a cutoff of 0.5 results in a high overall accuracy of 98.66%, the specificity=88% is comparatively lower. Lowering the threshold to 0.3, the random forest model with three variables emerges as the optimal choice, achieving an accuracy and balanced accuracy of both 97.81%. This cutoff enhances specificity, indicating a better ability to correctly identify true negatives. The balanced accuracy, F1 score, and sensitivity metrics also demonstrate favorable outcomes for the 0.3 cutoff, signifying a well-balanced model performance. The emphasis on specificity is crucial in scenarios with imbalanced datasets, where accurately identifying the majority class is of utmost importance. Thus, the choice of cutoff, along with the random forest model with three variables, contributes to a more robust and context-aware predictive model for the given classification task.

4.4.2 Visualization prediction value for future 8 hours

#best model- hour.rf03
hour.predict.probs =predict(hour.rf03 ,test.data.hour,type="prob")$yes  #percent
hour.predict.level=ifelse(predict(hour.rf03 ,
                               test.data.hour,type="prob")$yes>=0.3,"yes","no")
test.data.hour$predict.probs=hour.predict.probs
test.data.hour$predict.level=hour.predict.level

##test:160-167 to predict future 8 hours (161-168hour)
intersection.grid %>%
  mutate(true.hour161=test.data.hour$plus1.cluster[test.data.hour$hour==160],
         predict.probs.hour161=test.data.hour$predict.probs[test.data.hour$hour==160],
         predict.level.hour161=test.data.hour$predict.level[test.data.hour$hour==160],
         
         true.hour162=test.data.hour$plus1.cluster[test.data.hour$hour==161],
         predict.probs.hour162=test.data.hour$predict.probs[test.data.hour$hour==161],
         predict.level.hour162=test.data.hour$predict.level[test.data.hour$hour==161],
         
         true.hour163=test.data.hour$plus1.cluster[test.data.hour$hour==162],
         predict.probs.hour163=test.data.hour$predict.probs[test.data.hour$hour==162],
         predict.level.hour163=test.data.hour$predict.level[test.data.hour$hour==162],
         
         true.hour164=test.data.hour$plus1.cluster[test.data.hour$hour==163],
         predict.probs.hour164=test.data.hour$predict.probs[test.data.hour$hour==163],
         predict.level.hour164=test.data.hour$predict.level[test.data.hour$hour==163],
         
         true.hour165=test.data.hour$plus1.cluster[test.data.hour$hour==164],
         predict.probs.hour165=test.data.hour$predict.probs[test.data.hour$hour==164],
         predict.level.hour165=test.data.hour$predict.level[test.data.hour$hour==164],
         
         true.hour166=test.data.hour$plus1.cluster[test.data.hour$hour==165],
         predict.probs.hour166=test.data.hour$predict.probs[test.data.hour$hour==165],
         predict.level.hour166=test.data.hour$predict.level[test.data.hour$hour==165],
         
         true.hour167=test.data.hour$plus1.cluster[test.data.hour$hour==166],
         predict.probs.hour167=test.data.hour$predict.probs[test.data.hour$hour==166],
         predict.level.hour167=test.data.hour$predict.level[test.data.hour$hour==166],
         
         true.hour168=test.data.hour$plus1.cluster[test.data.hour$hour==167],
         predict.probs.hour168=test.data.hour$predict.probs[test.data.hour$hour==167],
         predict.level.hour168=test.data.hour$predict.level[test.hour.xgb$hour==167],
         ) -> intersection.grid

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour161))) +
  labs(title="True clusters in 16pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour161
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour161)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour161
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour161))) +
  labs(title="Prdict level with cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour161

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour162))) +
  labs(title="True clusters in 17pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour162
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour162)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour162
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour162))) +
  labs(title="Prdict level with cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour162

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour163))) +
  labs(title="True clusters in 18pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour163
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour163)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour163
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour163))) +
  labs(title="Prdict level with cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour163

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour164))) +
  labs(title="True clusters in 19pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour164
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour164)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour164
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour164))) +
  labs(title="Prdict level with cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour164

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour165))) +
  labs(title="True clusters in 20pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour165
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour165)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour165
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour165))) +
  labs(title="Prdiction level cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour165

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour166))) +
  labs(title="True clusters in 21pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour166
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour166)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour166
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour166))) +
  labs(title="Prdiction level cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour166

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour167))) +
  labs(title="True clusters in 22pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour167
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour167)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour167
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour167))) +
  labs(title="Prdiction level cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour167

ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(true.hour168))) +
  labs(title="True clusters in 23pm",fill="cluster")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom")->true.hour168
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=predict.probs.hour168)) +
  labs(title="Predict probs",fill="prediction")+
  scale_fill_viridis() +theme_bw()+
  theme(legend.position = "bottom") ->predict.probs.hour168
ggplot()+
  geom_sf(data=intersection.grid,aes(geometry=geometry,fill=factor(predict.level.hour168))) +
  labs(title="Prdiction level cutoff0.3",fill="prediction")+
  scale_fill_manual(values=c("#440154","#FDE725")) +theme_bw()+
  theme(legend.position = "bottom") ->predict.level.hour168

plot_grid(true.hour161,
          predict.probs.hour161,
          predict.level.hour161,
          ncol = 3)

plot_grid(true.hour162,
          predict.probs.hour162,
          predict.level.hour162,
          ncol = 3)

plot_grid(true.hour163,
          predict.probs.hour163,
          predict.level.hour163,
          ncol = 3)

plot_grid(true.hour164,
          predict.probs.hour164,
          predict.level.hour164,
          ncol = 3)

plot_grid(true.hour165,
          predict.probs.hour165,
          predict.level.hour165,
          ncol = 3)

plot_grid(true.hour166,
          predict.probs.hour166,
          predict.level.hour166,
          ncol = 3)

plot_grid(true.hour167,
          predict.probs.hour167,
          predict.level.hour167,
          ncol = 3)

plot_grid(true.hour168,
          predict.probs.hour168,
          predict.level.hour168,
          ncol = 3)

In summary, harnessing the power of the RandomForest model with three variables for predicting hourly high-demand modes in the Uber dataset has proven highly effective. The model achieved an impressive accuracy of 97.81% and a balanced accuracy of 97.81%, underscoring its robust predictive capabilities. What sets this model apart is its adeptness at capturing the intricate hourly changes in demand patterns. Hourly fluctuations in demand are dynamic and often influenced by various factors, such as rush hours, activities, and other temporal nuances. The RandomForest model, with its ensemble learning approach, excels in discerning and adapting to these nuanced patterns. This capability is crucial for Uber to optimize resource allocation, ensuring a responsive and efficient service during peak demand hours. The model’s high accuracy and balanced performance make it a valuable tool for anticipating and adapting to the evolving demands of the Uber transportation service on an hourly basis, ultimately enhancing user experience and operational efficiency. ### References

Maria Kubara. (12 April 2023 ). Spatiotemporal localisation patterns of technological startups: The case for recurrent neural networks in predicting urban startup clusters. The Annals of Regional Science, https://doi.org/10.1007/s00168-023-01220-7

Forecasting High Demand of Uber Locations in Brooklyn, New York, USA

Ting_Wei

2023-12-14

Chapter1 Introduction

1.1 Prediction for 2 modes

1.1.1 Prediction for Future Day Demand

1.1.2 Prediction for Future Hours Demand

1.1.3 Differences and Significance of the Two Modes:

Chapter2 Dataset and Preprocessing

2.1 Uber data

2.2 Brooklyn spatial data

2.2.1 Generate Brooklyn grid cells

2.2.2 Visualization spatial data

2.3 Exploratory Data Analysis

Different time mode analysis

Chapter3 Preparing Data for Models

3.1 Preparing daily mode data

3.1.1 Daily Pick up points into grid cells

3.1.2 Mapping Daily pickups on an OpenStreetMap

3.1.3 Visualization whole density of pickups and day by day

3.2 Preparing hourly mode data

3.2.1 Hourly Pick up points into grid cells

3.2.2 Visualization density of pickups hour by hour

3.3 Methods and data format

3.3.1 Methods:DBSCAN-Spatial data clustering

3.3.2 Methods:Rolling window

Chapter4 Models

4.1 Models for daily mode

4.1.1 Daily train and test data

4.1.2 Daily Models- Decison tree, Random Forest, and XGBoost

4.2 Prediction for daily mode

4.2.1 Comparing different models to find the optimal one

4.2.2 Visualization prediction value for future 1 day

4.3 Models for Hourly mode

4.3.1 Hourly train and test data

4.4 Prediction for Hourly mode

4.4.1 Comparing different models to find the optimal one

4.4.2 Visualization prediction value for future 8 hours