Objective

This project aims to optimize traffic light timings dynamically, resulting in a more efficient and streamlined traffic flow that significantly reduces congestion and enhances the overall commuting experience for all road users by:

  1. Developing a robust classification system categorizing traffic situations into distinct levels (low, medium, high, and heavy).

  2. Employing predictive modeling to accurately forecast the number of vehicles on the roads at specific times.

1) Data Preparation

a) Import & Quick Explore on Traffic Dataset

library(dplyr)
traffic_df <- read.csv("Dataset/Traffic.csv")
glimpse(traffic_df)
## Rows: 2,976
## Columns: 9
## $ Time              <chr> "12:00:00 AM", "12:15:00 AM", "12:30:00 AM", "12:45:…
## $ Date              <int> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, …
## $ Day.of.the.week   <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ CarCount          <int> 31, 49, 46, 51, 57, 44, 37, 42, 51, 34, 45, 45, 50, …
## $ BikeCount         <int> 0, 0, 0, 0, 6, 0, 0, 4, 0, 0, 0, 0, 0, 0, 22, 16, 28…
## $ BusCount          <int> 4, 3, 3, 2, 15, 5, 1, 4, 9, 4, 1, 1, 3, 4, 42, 49, 2…
## $ TruckCount        <int> 4, 3, 6, 5, 16, 4, 4, 5, 7, 7, 1, 3, 0, 4, 1, 0, 3, …
## $ Total             <int> 39, 55, 55, 58, 94, 53, 42, 55, 67, 45, 47, 49, 53, …
## $ Traffic.Situation <chr> "low", "low", "low", "low", "normal", "low", "low", …
head(traffic_df)
##          Time Date Day.of.the.week CarCount BikeCount BusCount TruckCount Total
## 1 12:00:00 AM   10         Tuesday       31         0        4          4    39
## 2 12:15:00 AM   10         Tuesday       49         0        3          3    55
## 3 12:30:00 AM   10         Tuesday       46         0        3          6    55
## 4 12:45:00 AM   10         Tuesday       51         0        2          5    58
## 5  1:00:00 AM   10         Tuesday       57         6       15         16    94
## 6  1:15:00 AM   10         Tuesday       44         0        5          4    53
##   Traffic.Situation
## 1               low
## 2               low
## 3               low
## 4               low
## 5            normal
## 6               low

b) Data Cleaning

To identify the incorrect, incomplete, inaccurate, irrelevant or missing part of the data and modify, replace or delete them when necessary.

Replace any ” ” to NA

cnames<-names(traffic_df)
for (i in cnames){
  print(paste(i,sum(traffic_df[i]=="" )))
  traffic_df[which(traffic_df[i]==""),i]<-NA
}
## [1] "Time 0"
## [1] "Date 0"
## [1] "Day.of.the.week 0"
## [1] "CarCount 0"
## [1] "BikeCount 0"
## [1] "BusCount 0"
## [1] "TruckCount 0"
## [1] "Total 0"
## [1] "Traffic.Situation 0"

Check for missing values and sum them up for each column

colSums(is.na(traffic_df))
##              Time              Date   Day.of.the.week          CarCount 
##                 0                 0                 0                 0 
##         BikeCount          BusCount        TruckCount             Total 
##                 0                 0                 0                 0 
## Traffic.Situation 
##                 0

c) Data Conversion

To convert time from string to time format using lubridate

Create new column for hour , minute, AM/PM

library(lubridate)
traffic_df$Time_hms<- hms(traffic_df$Time)
traffic_df$Time <- as.POSIXct(strptime(traffic_df$Time, format = "%I:%M:%S %p"))

# Create columns for hour, minute, and AM/PM
traffic_df <- traffic_df %>%
  mutate(
    hour = hour(Time),        # Hour in separate column
    minute = minute(Time),    # Minute in separate column
    am_pm = ifelse(hour(Time) < 12, 1, 2)  # 1 for AM, 2 for PM
  )

head(traffic_df)
##                  Time Date Day.of.the.week CarCount BikeCount BusCount
## 1 2024-01-13 00:00:00   10         Tuesday       31         0        4
## 2 2024-01-13 00:15:00   10         Tuesday       49         0        3
## 3 2024-01-13 00:30:00   10         Tuesday       46         0        3
## 4 2024-01-13 00:45:00   10         Tuesday       51         0        2
## 5 2024-01-13 01:00:00   10         Tuesday       57         6       15
## 6 2024-01-13 01:15:00   10         Tuesday       44         0        5
##   TruckCount Total Traffic.Situation   Time_hms hour minute am_pm
## 1          4    39               low  12H 0M 0S    0      0     1
## 2          3    55               low 12H 15M 0S    0     15     1
## 3          6    55               low 12H 30M 0S    0     30     1
## 4          5    58               low 12H 45M 0S    0     45     1
## 5         16    94            normal   1H 0M 0S    1      0     1
## 6          4    53               low  1H 15M 0S    1     15     1

d) Data transformation

To convert or modify raw data into a suitable format or structure for analysis

To create factor on numeric, and sort the ordinal variable

traffic_df %>%
  count(`Day.of.the.week`)
##   Day.of.the.week   n
## 1          Friday 384
## 2          Monday 384
## 3        Saturday 384
## 4          Sunday 384
## 5        Thursday 480
## 6         Tuesday 480
## 7       Wednesday 480
traffic_df$'Day.of.the.week' = factor(traffic_df$'Day.of.the.week', levels = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

traffic_df$DayofWeek_Numeric = as.numeric(traffic_df$'Day.of.the.week')


traffic_df$'Traffic.Situation' = factor(traffic_df$'Traffic.Situation', levels = c("low", "normal", "high", "heavy"))

traffic_df$Traffic_Situation_Numeric = as.numeric(traffic_df$'Traffic.Situation')


head(traffic_df)
##                  Time Date Day.of.the.week CarCount BikeCount BusCount
## 1 2024-01-13 00:00:00   10         Tuesday       31         0        4
## 2 2024-01-13 00:15:00   10         Tuesday       49         0        3
## 3 2024-01-13 00:30:00   10         Tuesday       46         0        3
## 4 2024-01-13 00:45:00   10         Tuesday       51         0        2
## 5 2024-01-13 01:00:00   10         Tuesday       57         6       15
## 6 2024-01-13 01:15:00   10         Tuesday       44         0        5
##   TruckCount Total Traffic.Situation   Time_hms hour minute am_pm
## 1          4    39               low  12H 0M 0S    0      0     1
## 2          3    55               low 12H 15M 0S    0     15     1
## 3          6    55               low 12H 30M 0S    0     30     1
## 4          5    58               low 12H 45M 0S    0     45     1
## 5         16    94            normal   1H 0M 0S    1      0     1
## 6          4    53               low  1H 15M 0S    1     15     1
##   DayofWeek_Numeric Traffic_Situation_Numeric
## 1                 2                         1
## 2                 2                         1
## 3                 2                         1
## 4                 2                         1
## 5                 2                         2
## 6                 2                         1

2) Exploratory Data Analysis

To check on summary statistic of each column

summary(traffic_df)
##       Time                          Date     Day.of.the.week    CarCount    
##  Min.   :2024-01-13 00:00:00   Min.   : 1   Monday   :384    Min.   :  6.0  
##  1st Qu.:2024-01-13 05:56:15   1st Qu.: 8   Tuesday  :480    1st Qu.: 19.0  
##  Median :2024-01-13 11:52:30   Median :16   Wednesday:480    Median : 64.0  
##  Mean   :2024-01-13 11:52:30   Mean   :16   Thursday :480    Mean   : 68.7  
##  3rd Qu.:2024-01-13 17:48:45   3rd Qu.:24   Friday   :384    3rd Qu.:107.0  
##  Max.   :2024-01-13 23:45:00   Max.   :31   Saturday :384    Max.   :180.0  
##                                             Sunday   :384                   
##    BikeCount        BusCount       TruckCount        Total      
##  Min.   : 0.00   Min.   : 0.00   Min.   : 0.00   Min.   : 21.0  
##  1st Qu.: 5.00   1st Qu.: 1.00   1st Qu.: 6.00   1st Qu.: 55.0  
##  Median :12.00   Median :12.00   Median :14.00   Median :109.0  
##  Mean   :14.92   Mean   :15.28   Mean   :15.32   Mean   :114.2  
##  3rd Qu.:22.00   3rd Qu.:25.00   3rd Qu.:23.00   3rd Qu.:164.0  
##  Max.   :70.00   Max.   :50.00   Max.   :40.00   Max.   :279.0  
##                                                                 
##  Traffic.Situation    Time_hms               hour           minute     
##  low   : 304       Min.   :1H 0M 0S     Min.   : 0.00   Min.   : 0.00  
##  normal:1669       1st Qu.:3H 56M 15S   1st Qu.: 5.75   1st Qu.:11.25  
##  high  : 321       Median :6H 52M 30S   Median :11.50   Median :22.50  
##  heavy : 682       Mean   :6H 52M 30S   Mean   :11.50   Mean   :22.50  
##                    3rd Qu.:9H 48M 45S   3rd Qu.:17.25   3rd Qu.:33.75  
##                    Max.   :12H 45M 0S   Max.   :23.00   Max.   :45.00  
##                                                                        
##      am_pm     DayofWeek_Numeric Traffic_Situation_Numeric
##  Min.   :1.0   Min.   :1.000     Min.   :1.000            
##  1st Qu.:1.0   1st Qu.:2.000     1st Qu.:2.000            
##  Median :1.5   Median :4.000     Median :2.000            
##  Mean   :1.5   Mean   :3.903     Mean   :2.464            
##  3rd Qu.:2.0   3rd Qu.:6.000     3rd Qu.:3.000            
##  Max.   :2.0   Max.   :7.000     Max.   :4.000            
## 
a) The Relationship between Traffic Situation and Total Vehicles

The total vehicles are categorized and ranked as follows, with mean values for each category: Low (65), Normal (83), High(142) and Heavy (199).

library(ggplot2)

mean_values <- traffic_df %>%
  group_by(`Traffic.Situation`) %>%
  summarize(mean_total = mean(Total))

ggplot(traffic_df, aes(x = `Traffic.Situation`, y = Total, fill = `Traffic.Situation`)) +
  geom_boxplot() +
  geom_text(data = mean_values, aes(x = `Traffic.Situation`, y = mean_total, label = round(mean_total, 0)),
            vjust = -0.5, color = "black", size = 3, position = position_dodge(width = 0.75)) +
  labs(title = "Relationship between Total and Traffic Situation",
       x = "Traffic Situation",
       y = "Total") +
  scale_fill_manual(values = c("low" = "blue", "normal" = "green", "high" = "red","heavy"="grey")) +
  theme_minimal()

b) Number of Vehicles on Daily Basis

The overall number of vehicles remains relatively constant throughout the month.

ggplot(traffic_df, aes(x = factor(Date), y = Total)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Vehicles by Date",
       x = "Date",
       y = "Total") +
  theme_minimal()

c) Proprotion of Vehicles over Date

Cars constitute a larger portion of the total vehicles, indicating their prevalence compared to other vehicle types. BikeCount is high on every few days.

library(tidyr)

# Calculate proportions
EDA_VehiclebyDate <- traffic_df %>%
  group_by(Date) %>%
  summarize(
    percent_car = sum(CarCount) / sum(Total) * 100,
    percent_bike = sum(BikeCount) / sum(Total) * 100,
    percent_bus = sum(BusCount) / sum(Total) * 100,
    percent_truck = sum(TruckCount) / sum(Total) * 100
  )

# Reshape data for plotting using gather
EDA_VehiclebyDate_Gather <- EDA_VehiclebyDate %>%
  gather(key = "Vehicle_Type", value = "Percentage", -Date)

# Plotting
ggplot(EDA_VehiclebyDate_Gather, aes(x = factor(Date), y = Percentage, fill = Vehicle_Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Proportion of Vehicle Types Over Date",
       x = "Date",
       y = "Percentage") +
  scale_fill_manual(values = c("percent_car" = "blue", "percent_bike" = "green", "percent_bus" = "red", "percent_truck" = "purple")) +
  theme_minimal()

d) Proprotion of Vehicles over Day of Week

On Fridays, BikeCount noticeably surges, accompanied by a diminished proportion of larger vehicles such as BusCount and TruckCount.

# Calculate proportions
EDA_VehiclebyDay <- traffic_df %>%
  group_by(Day.of.the.week) %>%
  summarize(
    percent_car = sum(CarCount) / sum(Total) * 100,
    percent_bike = sum(BikeCount) / sum(Total) * 100,
    percent_bus = sum(BusCount) / sum(Total) * 100,
    percent_truck = sum(TruckCount) / sum(Total) * 100
  )

# Reshape data for plotting using gather
EDA_VehiclebyDay_Gather <- EDA_VehiclebyDay %>%
  gather(key = "Vehicle_Type", value = "Percentage", -Day.of.the.week)

# Plotting
ggplot(EDA_VehiclebyDay_Gather, aes(x = factor(Day.of.the.week), y = Percentage, fill = Vehicle_Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Proportion of Vehicle Types Over Day of Week",
       x = "Day of the Week",
       y = "Percentage") +
  scale_fill_manual(values = c("percent_car" = "blue", "percent_bike" = "green", "percent_bus" = "red", "percent_truck" = "purple")) +
  theme_minimal()

e) Traffic Situation

The traffic situation pattern is generally similar on both weekdays and weekends. However, there is a notable difference in the total number of vehicles, with more vehicles observed on weekdays compared to weekends.

ggplot(traffic_df, aes(x = Day.of.the.week, fill = `Traffic.Situation`)) +
  geom_bar(position = "dodge") +
  labs(x = "Day of the week") +
  ggtitle("Histogram of Traffic Situation by Day of the week") +
  theme_minimal()

f) Number of Vehicles on Hour

On an hourly basis, the traffic situation is generally normal during most hours (0000 to 2300). The period from 0600 to 1800 experiences heavier traffic, likely due to working hours.The number of vehicles peaked around 5-8am and 4-6pm

ggplot(traffic_df, aes(x = factor(hour), y = Total,fill = `Traffic.Situation`)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Vehicles by Hour",
       x = "Time",
       y = "Total") +
  theme_minimal()

g) Identify the relationship between the variables

The correlation coefficients indicate strong, positive relationships between traffic situation and CarCount (0.69), BikeCount (0.56), BusCount (0.69), Total (0.78).

selected_cols <- traffic_df[ c('DayofWeek_Numeric', 'Date','CarCount', 'BikeCount', 'BusCount','TruckCount', 'Total', 'hour', 'minute', 'am_pm','Traffic_Situation_Numeric')]


# Computing the correlation matrix
correlation_matrix <- cor(selected_cols,method = "spearman")
correlation_matrix
##                           DayofWeek_Numeric         Date     CarCount
## DayofWeek_Numeric               1.000000000 -0.024441186 -0.004679699
## Date                           -0.024441186  1.000000000 -0.011382575
## CarCount                       -0.004679699 -0.011382575  1.000000000
## BikeCount                       0.035196332 -0.005270197  0.749585264
## BusCount                       -0.044073940 -0.015112213  0.754586842
## TruckCount                     -0.029521139  0.022323732 -0.645307877
## Total                          -0.011219205 -0.008937869  0.960325256
## hour                            0.000000000  0.000000000  0.161930232
## minute                          0.000000000  0.000000000 -0.002307269
## am_pm                           0.000000000  0.000000000  0.114641356
## Traffic_Situation_Numeric      -0.025435751 -0.004718380  0.691167541
##                              BikeCount      BusCount   TruckCount        Total
## DayofWeek_Numeric          0.035196332 -0.0440739400 -0.029521139 -0.011219205
## Date                      -0.005270197 -0.0151122132  0.022323732 -0.008937869
## CarCount                   0.749585264  0.7545868423 -0.645307877  0.960325256
## BikeCount                  1.000000000  0.7269200607 -0.608864711  0.818856416
## BusCount                   0.726920061  1.0000000000 -0.568699752  0.837282379
## TruckCount                -0.608864711 -0.5686997518  1.000000000 -0.552649980
## Total                      0.818856416  0.8372823788 -0.552649980  1.000000000
## hour                       0.198004911  0.2226425469 -0.004864151  0.193874934
## minute                    -0.001020467  0.0009379625  0.004154811 -0.006117425
## am_pm                      0.191860372  0.1757892994  0.017641702  0.149989606
## Traffic_Situation_Numeric  0.560930908  0.6887718293 -0.296704073  0.776969464
##                                   hour        minute      am_pm
## DayofWeek_Numeric          0.000000000  0.0000000000 0.00000000
## Date                       0.000000000  0.0000000000 0.00000000
## CarCount                   0.161930232 -0.0023072694 0.11464136
## BikeCount                  0.198004911 -0.0010204673 0.19186037
## BusCount                   0.222642547  0.0009379625 0.17578930
## TruckCount                -0.004864151  0.0041548113 0.01764170
## Total                      0.193874934 -0.0061174248 0.14998961
## hour                       1.000000000  0.0000000000 0.86677814
## minute                     0.000000000  1.0000000000 0.00000000
## am_pm                      0.866778142  0.0000000000 1.00000000
## Traffic_Situation_Numeric  0.113419565 -0.0211591243 0.07481169
##                           Traffic_Situation_Numeric
## DayofWeek_Numeric                       -0.02543575
## Date                                    -0.00471838
## CarCount                                 0.69116754
## BikeCount                                0.56093091
## BusCount                                 0.68877183
## TruckCount                              -0.29670407
## Total                                    0.77696946
## hour                                     0.11341957
## minute                                  -0.02115912
## am_pm                                    0.07481169
## Traffic_Situation_Numeric                1.00000000
library(reshape2) # For data manipulation

# Convert correlation matrix to a long format for ggplot
cor_df <- melt(correlation_matrix)

# Plot heatmap using ggplot with rotated x-axis labels
ggplot(cor_df, aes(Var1, Var2, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(low = "#D73027", mid = "#FFFFBF", high = "#4575B4", midpoint = 0,
                       limits = c(-1, 1), name = "Correlation") +
  geom_text(aes(label = round(value, 4)), color = "black", size = 2.45) +
  labs(title = "Correlation Heatmap", x = "", y = "") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 75, vjust = 0.5))

3) Modeling

a) Classification - Predict Traffic Situation Using Random Forest
Create Training and Test Set
library(caret)
library(randomForest)

set.seed(50)


# Create train-test split using createDataPartition
trainIndex <- createDataPartition(traffic_df$Traffic.Situation, p = 0.7, list = FALSE)

train_data <- traffic_df[trainIndex, ]
test_data <- traffic_df[-trainIndex, ]

table(traffic_df$Traffic.Situation)
## 
##    low normal   high  heavy 
##    304   1669    321    682
Downsampling due to imbalanced data
train_data_down <- downSample(x = train_data[,-ncol(train_data)],
                              y = train_data$Traffic.Situation)

table(train_data_down$Traffic.Situation)
## 
##    low normal   high  heavy 
##    213    213    213    213
#Build the classfication model using randomForest
train_data_down <- subset(train_data_down, select = -c(Class))

# Check the column names in train_data
names(train_data)
##  [1] "Time"                      "Date"                     
##  [3] "Day.of.the.week"           "CarCount"                 
##  [5] "BikeCount"                 "BusCount"                 
##  [7] "TruckCount"                "Total"                    
##  [9] "Traffic.Situation"         "Time_hms"                 
## [11] "hour"                      "minute"                   
## [13] "am_pm"                     "DayofWeek_Numeric"        
## [15] "Traffic_Situation_Numeric"
# Check the column names in train_data_down
names(train_data_down)
##  [1] "Time"              "Date"              "Day.of.the.week"  
##  [4] "CarCount"          "BikeCount"         "BusCount"         
##  [7] "TruckCount"        "Total"             "Traffic.Situation"
## [10] "Time_hms"          "hour"              "minute"           
## [13] "am_pm"             "DayofWeek_Numeric"
Build the model using Random Forest
#For train data
model.rf = randomForest(Traffic.Situation ~ ., data = train_data,
                        keep.forest = TRUE, ntree = 150)

#For train data down
model_down.rf = randomForest(Traffic.Situation ~ ., data = train_data_down,
                             keep.forest = TRUE, ntree = 150)
Plot the train data model
Plot the error rate to show the class error rate. As the number of tree increases, the error rate approaches zero
plot(model.rf, main = "Error rate of random forest")    
# Add legend with class labels
legend("topright", legend = levels(train_data$Traffic.Situation), col = 1:4, lty = 1)

Plot the downsampling data model
plot(model_down.rf, main = "Error rate of random forest") 
# Add legend with class labels
legend("topright", legend = levels(train_data$Traffic.Situation), col = 1:4, lty = 1)

Display the importance affecting random forest from greatest to least impact, from Top to Bottom
for train data
varImpPlot(model.rf, pch = 20, main = "Importance of Variables")

for downsampling data
varImpPlot(model_down.rf, pch = 20, main = "Importance of Variables")

Model Evaluation

Evaluate the performance of our random forest models by using confusion matrix. First, use the models to make prediction.
#For train data
pred_Test_rd = predict(model.rf, test_data[,setdiff(names(test_data),"Traffic.Situation")],
                       type="response",
                       norm.votes=TRUE
)

#For train data down
pred_Test_rd_down = predict(model_down.rf, test_data[,setdiff(names(test_data),"is_canceled")],
                            type="response",
                            norm.votes=TRUE
)
Confusion metrics for train data
tab_Test = table(actual=test_data$Traffic.Situation, predicted=pred_Test_rd);
confusionMatrix(tab_Test, mode = "everything")
## Confusion Matrix and Statistics
## 
##         predicted
## actual   low normal high heavy
##   low     91      0    0     0
##   normal   0    500    0     0
##   high     0      0   96     0
##   heavy    0      0    0   204
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9959, 1)
##     No Information Rate : 0.5612     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: low Class: normal Class: high Class: heavy
## Sensitivity              1.0000        1.0000      1.0000        1.000
## Specificity              1.0000        1.0000      1.0000        1.000
## Pos Pred Value           1.0000        1.0000      1.0000        1.000
## Neg Pred Value           1.0000        1.0000      1.0000        1.000
## Precision                1.0000        1.0000      1.0000        1.000
## Recall                   1.0000        1.0000      1.0000        1.000
## F1                       1.0000        1.0000      1.0000        1.000
## Prevalence               0.1021        0.5612      0.1077        0.229
## Detection Rate           0.1021        0.5612      0.1077        0.229
## Detection Prevalence     0.1021        0.5612      0.1077        0.229
## Balanced Accuracy        1.0000        1.0000      1.0000        1.000
Confusion metrics for downsampling
tab_Test_down = table(actual=test_data$Traffic.Situation, predicted=pred_Test_rd_down);
confusionMatrix(tab_Test_down,mode="everything")
## Confusion Matrix and Statistics
## 
##         predicted
## actual   low normal high heavy
##   low     91      0    0     0
##   normal   2    486    8     4
##   high     0      0   96     0
##   heavy    0      0    1   203
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9832          
##                  95% CI : (0.9724, 0.9905)
##     No Information Rate : 0.5455          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9727          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: low Class: normal Class: high Class: heavy
## Sensitivity              0.9785        1.0000      0.9143       0.9807
## Specificity              1.0000        0.9654      1.0000       0.9985
## Pos Pred Value           1.0000        0.9720      1.0000       0.9951
## Neg Pred Value           0.9975        1.0000      0.9887       0.9942
## Precision                1.0000        0.9720      1.0000       0.9951
## Recall                   0.9785        1.0000      0.9143       0.9807
## F1                       0.9891        0.9858      0.9552       0.9878
## Prevalence               0.1044        0.5455      0.1178       0.2323
## Detection Rate           0.1021        0.5455      0.1077       0.2278
## Detection Prevalence     0.1021        0.5612      0.1077       0.2290
## Balanced Accuracy        0.9892        0.9827      0.9571       0.9896

RF Model Interpretation

Random Forest Model with Train Data only has higher precision which means it can predict the traffic situation more accurately most of the time while Random Forest Model with Train Data Down has slightly lower precision.

Random Forest Model with Train Data Down has lower recall compare to Random Forest Model with Train Data, which means it has lower percentage of traffic situation that were classified correctly/accurately.

Random Forest Model with Train Data only has higher accuracy and F1 score compared to Random Forest with Train Data Down. Overall,

Random Forest Model with Train Data has better performance compared to Random Forest Model with Train Data Down. In conclusion, Downsampling has not really improved the performance of the model.

3) Modeling

b) Regression - Predicting Number of Vehicles(Total)
Create Training and Test Set
trainIndex <- createDataPartition(traffic_df$Total, p = 0.7, list = FALSE)

train_data <- traffic_df[trainIndex, ]
test_data <- traffic_df[-trainIndex, ]

head(train_data)
##                  Time Date Day.of.the.week CarCount BikeCount BusCount
## 2 2024-01-13 00:15:00   10         Tuesday       49         0        3
## 3 2024-01-13 00:30:00   10         Tuesday       46         0        3
## 4 2024-01-13 00:45:00   10         Tuesday       51         0        2
## 5 2024-01-13 01:00:00   10         Tuesday       57         6       15
## 7 2024-01-13 01:30:00   10         Tuesday       37         0        1
## 8 2024-01-13 01:45:00   10         Tuesday       42         4        4
##   TruckCount Total Traffic.Situation   Time_hms hour minute am_pm
## 2          3    55               low 12H 15M 0S    0     15     1
## 3          6    55               low 12H 30M 0S    0     30     1
## 4          5    58               low 12H 45M 0S    0     45     1
## 5         16    94            normal   1H 0M 0S    1      0     1
## 7          4    42               low  1H 30M 0S    1     30     1
## 8          5    55               low  1H 45M 0S    1     45     1
##   DayofWeek_Numeric Traffic_Situation_Numeric
## 2                 2                         1
## 3                 2                         1
## 4                 2                         1
## 5                 2                         2
## 7                 2                         1
## 8                 2                         1
We first proposed a model with all interaction terms.
modelA <- lm(Total ~Date * DayofWeek_Numeric * hour * minute * 
    am_pm * Traffic_Situation_Numeric,data = selected_cols)
summary(modelA)
## 
## Call:
## lm(formula = Total ~ Date * DayofWeek_Numeric * hour * minute * 
##     am_pm * Traffic_Situation_Numeric, data = selected_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.414 -22.457  -4.413  19.729 125.775 
## 
## Coefficients:
##                                                                      Estimate
## (Intercept)                                                        -1.523e+02
## Date                                                                2.450e+00
## DayofWeek_Numeric                                                   1.773e+01
## hour                                                                1.067e+01
## minute                                                              2.490e+00
## am_pm                                                               1.385e+02
## Traffic_Situation_Numeric                                           6.141e+01
## Date:DayofWeek_Numeric                                             -3.221e-01
## Date:hour                                                          -1.860e-01
## DayofWeek_Numeric:hour                                             -1.337e+00
## Date:minute                                                        -1.604e-01
## DayofWeek_Numeric:minute                                           -7.059e-01
## hour:minute                                                        -2.627e-01
## Date:am_pm                                                         -2.955e+00
## DayofWeek_Numeric:am_pm                                            -1.743e+01
## hour:am_pm                                                         -9.316e+00
## minute:am_pm                                                       -2.133e+00
## Date:Traffic_Situation_Numeric                                     -1.249e+00
## DayofWeek_Numeric:Traffic_Situation_Numeric                        -1.157e+01
## hour:Traffic_Situation_Numeric                                      4.214e-01
## minute:Traffic_Situation_Numeric                                   -5.297e-01
## am_pm:Traffic_Situation_Numeric                                    -2.550e+01
## Date:DayofWeek_Numeric:hour                                         1.922e-02
## Date:DayofWeek_Numeric:minute                                       3.782e-02
## Date:hour:minute                                                    1.345e-02
## DayofWeek_Numeric:hour:minute                                       5.314e-02
## Date:DayofWeek_Numeric:am_pm                                        4.929e-01
## Date:hour:am_pm                                                     2.250e-01
## DayofWeek_Numeric:hour:am_pm                                        1.260e+00
## Date:minute:am_pm                                                   1.762e-01
## DayofWeek_Numeric:minute:am_pm                                      7.420e-01
## hour:minute:am_pm                                                   2.089e-01
## Date:DayofWeek_Numeric:Traffic_Situation_Numeric                    1.766e-01
## Date:hour:Traffic_Situation_Numeric                                 1.083e-01
## DayofWeek_Numeric:hour:Traffic_Situation_Numeric                    9.232e-01
## Date:minute:Traffic_Situation_Numeric                               4.936e-02
## DayofWeek_Numeric:minute:Traffic_Situation_Numeric                  1.954e-01
## hour:minute:Traffic_Situation_Numeric                               5.866e-02
## Date:am_pm:Traffic_Situation_Numeric                                1.422e+00
## DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric                   1.104e+01
## hour:am_pm:Traffic_Situation_Numeric                                1.052e+00
## minute:am_pm:Traffic_Situation_Numeric                              5.611e-01
## Date:DayofWeek_Numeric:hour:minute                                 -2.552e-03
## Date:DayofWeek_Numeric:hour:am_pm                                  -3.401e-02
## Date:DayofWeek_Numeric:minute:am_pm                                -4.257e-02
## Date:hour:minute:am_pm                                             -1.351e-02
## DayofWeek_Numeric:hour:minute:am_pm                                -5.291e-02
## Date:DayofWeek_Numeric:hour:Traffic_Situation_Numeric              -8.847e-03
## Date:DayofWeek_Numeric:minute:Traffic_Situation_Numeric            -1.242e-02
## Date:hour:minute:Traffic_Situation_Numeric                         -3.546e-03
## DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric            -1.014e-02
## Date:DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric             -2.642e-01
## Date:hour:am_pm:Traffic_Situation_Numeric                          -1.134e-01
## DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric             -7.996e-01
## Date:minute:am_pm:Traffic_Situation_Numeric                        -6.162e-02
## DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric           -2.442e-01
## hour:minute:am_pm:Traffic_Situation_Numeric                        -5.762e-02
## Date:DayofWeek_Numeric:hour:minute:am_pm                            2.832e-03
## Date:DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric        6.047e-04
## Date:DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric         1.657e-02
## Date:DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric       1.600e-02
## Date:hour:minute:am_pm:Traffic_Situation_Numeric                    4.411e-03
## DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric       1.526e-02
## Date:DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric -9.487e-04
##                                                                    Std. Error
## (Intercept)                                                         1.769e+02
## Date                                                                8.621e+00
## DayofWeek_Numeric                                                   3.881e+01
## hour                                                                1.802e+01
## minute                                                              6.189e+00
## am_pm                                                               1.514e+02
## Traffic_Situation_Numeric                                           8.327e+01
## Date:DayofWeek_Numeric                                              1.917e+00
## Date:hour                                                           8.680e-01
## DayofWeek_Numeric:hour                                              3.970e+00
## Date:minute                                                         3.135e-01
## DayofWeek_Numeric:minute                                            1.369e+00
## hour:minute                                                         6.416e-01
## Date:am_pm                                                          7.399e+00
## DayofWeek_Numeric:am_pm                                             3.304e+01
## hour:am_pm                                                          1.138e+01
## minute:am_pm                                                        5.288e+00
## Date:Traffic_Situation_Numeric                                      4.105e+00
## DayofWeek_Numeric:Traffic_Situation_Numeric                         1.808e+01
## hour:Traffic_Situation_Numeric                                      8.308e+00
## minute:Traffic_Situation_Numeric                                    2.896e+00
## am_pm:Traffic_Situation_Numeric                                     6.961e+01
## Date:DayofWeek_Numeric:hour                                         1.925e-01
## Date:DayofWeek_Numeric:minute                                       6.916e-02
## Date:hour:minute                                                    3.242e-02
## DayofWeek_Numeric:hour:minute                                       1.445e-01
## Date:DayofWeek_Numeric:am_pm                                        1.638e+00
## Date:hour:am_pm                                                     5.485e-01
## DayofWeek_Numeric:hour:am_pm                                        2.497e+00
## Date:minute:am_pm                                                   2.672e-01
## DayofWeek_Numeric:minute:am_pm                                      1.148e+00
## hour:minute:am_pm                                                   4.044e-01
## Date:DayofWeek_Numeric:Traffic_Situation_Numeric                    9.000e-01
## Date:hour:Traffic_Situation_Numeric                                 4.068e-01
## DayofWeek_Numeric:hour:Traffic_Situation_Numeric                    1.825e+00
## Date:minute:Traffic_Situation_Numeric                               1.474e-01
## DayofWeek_Numeric:minute:Traffic_Situation_Numeric                  6.366e-01
## hour:minute:Traffic_Situation_Numeric                               3.008e-01
## Date:am_pm:Traffic_Situation_Numeric                                3.432e+00
## DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric                   1.490e+01
## hour:am_pm:Traffic_Situation_Numeric                                5.245e+00
## minute:am_pm:Traffic_Situation_Numeric                              2.395e+00
## Date:DayofWeek_Numeric:hour:minute                                  7.225e-03
## Date:DayofWeek_Numeric:hour:am_pm                                   1.216e-01
## Date:DayofWeek_Numeric:minute:am_pm                                 5.810e-02
## Date:hour:minute:am_pm                                              2.033e-02
## DayofWeek_Numeric:hour:minute:am_pm                                 8.957e-02
## Date:DayofWeek_Numeric:hour:Traffic_Situation_Numeric               8.999e-02
## Date:DayofWeek_Numeric:minute:Traffic_Situation_Numeric             3.223e-02
## Date:hour:minute:Traffic_Situation_Numeric                          1.538e-02
## DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric             6.744e-02
## Date:DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric              7.432e-01
## Date:hour:am_pm:Traffic_Situation_Numeric                           2.564e-01
## DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric              1.140e+00
## Date:minute:am_pm:Traffic_Situation_Numeric                         1.211e-01
## DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric            5.123e-01
## hour:minute:am_pm:Traffic_Situation_Numeric                         1.872e-01
## Date:DayofWeek_Numeric:hour:minute:am_pm                            4.480e-03
## Date:DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric        3.404e-03
## Date:DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric         5.626e-02
## Date:DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric       2.592e-02
## Date:hour:minute:am_pm:Traffic_Situation_Numeric                    9.480e-03
## DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric       4.113e-02
## Date:DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric  2.069e-03
##                                                                    t value
## (Intercept)                                                         -0.861
## Date                                                                 0.284
## DayofWeek_Numeric                                                    0.457
## hour                                                                 0.592
## minute                                                               0.402
## am_pm                                                                0.915
## Traffic_Situation_Numeric                                            0.737
## Date:DayofWeek_Numeric                                              -0.168
## Date:hour                                                           -0.214
## DayofWeek_Numeric:hour                                              -0.337
## Date:minute                                                         -0.512
## DayofWeek_Numeric:minute                                            -0.516
## hour:minute                                                         -0.409
## Date:am_pm                                                          -0.399
## DayofWeek_Numeric:am_pm                                             -0.527
## hour:am_pm                                                          -0.818
## minute:am_pm                                                        -0.403
## Date:Traffic_Situation_Numeric                                      -0.304
## DayofWeek_Numeric:Traffic_Situation_Numeric                         -0.640
## hour:Traffic_Situation_Numeric                                       0.051
## minute:Traffic_Situation_Numeric                                    -0.183
## am_pm:Traffic_Situation_Numeric                                     -0.366
## Date:DayofWeek_Numeric:hour                                          0.100
## Date:DayofWeek_Numeric:minute                                        0.547
## Date:hour:minute                                                     0.415
## DayofWeek_Numeric:hour:minute                                        0.368
## Date:DayofWeek_Numeric:am_pm                                         0.301
## Date:hour:am_pm                                                      0.410
## DayofWeek_Numeric:hour:am_pm                                         0.505
## Date:minute:am_pm                                                    0.659
## DayofWeek_Numeric:minute:am_pm                                       0.647
## hour:minute:am_pm                                                    0.516
## Date:DayofWeek_Numeric:Traffic_Situation_Numeric                     0.196
## Date:hour:Traffic_Situation_Numeric                                  0.266
## DayofWeek_Numeric:hour:Traffic_Situation_Numeric                     0.506
## Date:minute:Traffic_Situation_Numeric                                0.335
## DayofWeek_Numeric:minute:Traffic_Situation_Numeric                   0.307
## hour:minute:Traffic_Situation_Numeric                                0.195
## Date:am_pm:Traffic_Situation_Numeric                                 0.414
## DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric                    0.741
## hour:am_pm:Traffic_Situation_Numeric                                 0.201
## minute:am_pm:Traffic_Situation_Numeric                               0.234
## Date:DayofWeek_Numeric:hour:minute                                  -0.353
## Date:DayofWeek_Numeric:hour:am_pm                                   -0.280
## Date:DayofWeek_Numeric:minute:am_pm                                 -0.733
## Date:hour:minute:am_pm                                              -0.664
## DayofWeek_Numeric:hour:minute:am_pm                                 -0.591
## Date:DayofWeek_Numeric:hour:Traffic_Situation_Numeric               -0.098
## Date:DayofWeek_Numeric:minute:Traffic_Situation_Numeric             -0.385
## Date:hour:minute:Traffic_Situation_Numeric                          -0.231
## DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric             -0.150
## Date:DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric              -0.356
## Date:hour:am_pm:Traffic_Situation_Numeric                           -0.442
## DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric              -0.701
## Date:minute:am_pm:Traffic_Situation_Numeric                         -0.509
## DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric            -0.477
## hour:minute:am_pm:Traffic_Situation_Numeric                         -0.308
## Date:DayofWeek_Numeric:hour:minute:am_pm                             0.632
## Date:DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric         0.178
## Date:DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric          0.295
## Date:DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric        0.617
## Date:hour:minute:am_pm:Traffic_Situation_Numeric                     0.465
## DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric        0.371
## Date:DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric  -0.459
##                                                                    Pr(>|t|)
## (Intercept)                                                           0.389
## Date                                                                  0.776
## DayofWeek_Numeric                                                     0.648
## hour                                                                  0.554
## minute                                                                0.687
## am_pm                                                                 0.360
## Traffic_Situation_Numeric                                             0.461
## Date:DayofWeek_Numeric                                                0.867
## Date:hour                                                             0.830
## DayofWeek_Numeric:hour                                                0.736
## Date:minute                                                           0.609
## DayofWeek_Numeric:minute                                              0.606
## hour:minute                                                           0.682
## Date:am_pm                                                            0.690
## DayofWeek_Numeric:am_pm                                               0.598
## hour:am_pm                                                            0.413
## minute:am_pm                                                          0.687
## Date:Traffic_Situation_Numeric                                        0.761
## DayofWeek_Numeric:Traffic_Situation_Numeric                           0.522
## hour:Traffic_Situation_Numeric                                        0.960
## minute:Traffic_Situation_Numeric                                      0.855
## am_pm:Traffic_Situation_Numeric                                       0.714
## Date:DayofWeek_Numeric:hour                                           0.920
## Date:DayofWeek_Numeric:minute                                         0.585
## Date:hour:minute                                                      0.678
## DayofWeek_Numeric:hour:minute                                         0.713
## Date:DayofWeek_Numeric:am_pm                                          0.764
## Date:hour:am_pm                                                       0.682
## DayofWeek_Numeric:hour:am_pm                                          0.614
## Date:minute:am_pm                                                     0.510
## DayofWeek_Numeric:minute:am_pm                                        0.518
## hour:minute:am_pm                                                     0.606
## Date:DayofWeek_Numeric:Traffic_Situation_Numeric                      0.844
## Date:hour:Traffic_Situation_Numeric                                   0.790
## DayofWeek_Numeric:hour:Traffic_Situation_Numeric                      0.613
## Date:minute:Traffic_Situation_Numeric                                 0.738
## DayofWeek_Numeric:minute:Traffic_Situation_Numeric                    0.759
## hour:minute:Traffic_Situation_Numeric                                 0.845
## Date:am_pm:Traffic_Situation_Numeric                                  0.679
## DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric                     0.459
## hour:am_pm:Traffic_Situation_Numeric                                  0.841
## minute:am_pm:Traffic_Situation_Numeric                                0.815
## Date:DayofWeek_Numeric:hour:minute                                    0.724
## Date:DayofWeek_Numeric:hour:am_pm                                     0.780
## Date:DayofWeek_Numeric:minute:am_pm                                   0.464
## Date:hour:minute:am_pm                                                0.506
## DayofWeek_Numeric:hour:minute:am_pm                                   0.555
## Date:DayofWeek_Numeric:hour:Traffic_Situation_Numeric                 0.922
## Date:DayofWeek_Numeric:minute:Traffic_Situation_Numeric               0.700
## Date:hour:minute:Traffic_Situation_Numeric                            0.818
## DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric               0.880
## Date:DayofWeek_Numeric:am_pm:Traffic_Situation_Numeric                0.722
## Date:hour:am_pm:Traffic_Situation_Numeric                             0.658
## DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric                0.483
## Date:minute:am_pm:Traffic_Situation_Numeric                           0.611
## DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric              0.634
## hour:minute:am_pm:Traffic_Situation_Numeric                           0.758
## Date:DayofWeek_Numeric:hour:minute:am_pm                              0.527
## Date:DayofWeek_Numeric:hour:minute:Traffic_Situation_Numeric          0.859
## Date:DayofWeek_Numeric:hour:am_pm:Traffic_Situation_Numeric           0.768
## Date:DayofWeek_Numeric:minute:am_pm:Traffic_Situation_Numeric         0.537
## Date:hour:minute:am_pm:Traffic_Situation_Numeric                      0.642
## DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric         0.711
## Date:DayofWeek_Numeric:hour:minute:am_pm:Traffic_Situation_Numeric    0.647
## 
## Residual standard error: 30.81 on 2912 degrees of freedom
## Multiple R-squared:  0.7436, Adjusted R-squared:  0.738 
## F-statistic:   134 on 63 and 2912 DF,  p-value: < 2.2e-16
The F-test indicates very strong evidence (p<2.2*10^-16) that the model fits better than the null model. However, we noticed that no coefficient is statistically significant in the model. Thus, we will only retain the 5-way interaction terms with lowest p-value and re-fit the model.
modelB <- lm(Total ~Date * DayofWeek_Numeric * hour * minute * 
     am_pm + Traffic_Situation_Numeric,data = selected_cols)
summary(modelB)
## 
## Call:
## lm(formula = Total ~ Date * DayofWeek_Numeric * hour * minute * 
##     am_pm + Traffic_Situation_Numeric, data = selected_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.330 -23.388  -3.542  20.039 125.363 
## 
## Coefficients:
##                                            Estimate Std. Error t value Pr(>|t|)
## (Intercept)                              -1.232e+02  4.206e+01  -2.929  0.00342
## Date                                      1.762e-01  2.147e+00   0.082  0.93462
## DayofWeek_Numeric                        -6.703e+00  9.780e+00  -0.685  0.49316
## hour                                      1.185e+01  4.264e+00   2.779  0.00549
## minute                                    1.841e+00  1.499e+00   1.228  0.21937
## am_pm                                     8.938e+01  3.618e+01   2.471  0.01354
## Traffic_Situation_Numeric                 4.590e+01  6.277e-01  73.129  < 2e-16
## Date:DayofWeek_Numeric                    4.399e-02  4.961e-01   0.089  0.92934
## Date:hour                                 4.904e-02  2.176e-01   0.225  0.82169
## DayofWeek_Numeric:hour                    7.227e-01  9.910e-01   0.729  0.46587
## Date:minute                              -6.862e-02  7.653e-02  -0.897  0.36994
## DayofWeek_Numeric:minute                 -2.919e-01  3.485e-01  -0.838  0.40234
## hour:minute                              -1.734e-01  1.518e-01  -1.142  0.25356
## Date:am_pm                               -3.060e-01  1.847e+00  -0.166  0.86846
## DayofWeek_Numeric:am_pm                   5.962e+00  8.414e+00   0.709  0.47869
## hour:am_pm                               -7.333e+00  2.696e+00  -2.720  0.00657
## minute:am_pm                             -1.354e+00  1.289e+00  -1.050  0.29386
## Date:DayofWeek_Numeric:hour              -6.672e-03  5.027e-02  -0.133  0.89442
## Date:DayofWeek_Numeric:minute             1.068e-02  1.768e-02   0.604  0.54561
## Date:hour:minute                          5.942e-03  7.754e-03   0.766  0.44361
## DayofWeek_Numeric:hour:minute             3.115e-02  3.531e-02   0.882  0.37784
## Date:DayofWeek_Numeric:am_pm             -5.214e-02  4.268e-01  -0.122  0.90276
## Date:hour:am_pm                          -1.178e-02  1.376e-01  -0.086  0.93180
## DayofWeek_Numeric:hour:am_pm             -4.946e-01  6.268e-01  -0.789  0.43009
## Date:minute:am_pm                         5.735e-02  6.584e-02   0.871  0.38378
## DayofWeek_Numeric:minute:am_pm            2.254e-01  2.998e-01   0.752  0.45219
## hour:minute:am_pm                         1.096e-01  9.604e-02   1.142  0.25370
## Date:DayofWeek_Numeric:hour:minute       -1.057e-03  1.791e-03  -0.590  0.55532
## Date:DayofWeek_Numeric:hour:am_pm         4.553e-03  3.179e-02   0.143  0.88613
## Date:DayofWeek_Numeric:minute:am_pm      -8.063e-03  1.521e-02  -0.530  0.59603
## Date:hour:minute:am_pm                   -4.299e-03  4.905e-03  -0.876  0.38084
## DayofWeek_Numeric:hour:minute:am_pm      -1.992e-02  2.233e-02  -0.892  0.37261
## Date:DayofWeek_Numeric:hour:minute:am_pm  6.841e-04  1.133e-03   0.604  0.54599
##                                             
## (Intercept)                              ** 
## Date                                        
## DayofWeek_Numeric                           
## hour                                     ** 
## minute                                      
## am_pm                                    *  
## Traffic_Situation_Numeric                ***
## Date:DayofWeek_Numeric                      
## Date:hour                                   
## DayofWeek_Numeric:hour                      
## Date:minute                                 
## DayofWeek_Numeric:minute                    
## hour:minute                                 
## Date:am_pm                                  
## DayofWeek_Numeric:am_pm                     
## hour:am_pm                               ** 
## minute:am_pm                                
## Date:DayofWeek_Numeric:hour                 
## Date:DayofWeek_Numeric:minute               
## Date:hour:minute                            
## DayofWeek_Numeric:hour:minute               
## Date:DayofWeek_Numeric:am_pm                
## Date:hour:am_pm                             
## DayofWeek_Numeric:hour:am_pm                
## Date:minute:am_pm                           
## DayofWeek_Numeric:minute:am_pm              
## hour:minute:am_pm                           
## Date:DayofWeek_Numeric:hour:minute          
## Date:DayofWeek_Numeric:hour:am_pm           
## Date:DayofWeek_Numeric:minute:am_pm         
## Date:hour:minute:am_pm                      
## DayofWeek_Numeric:hour:minute:am_pm         
## Date:DayofWeek_Numeric:hour:minute:am_pm    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.93 on 2943 degrees of freedom
## Multiple R-squared:  0.7389, Adjusted R-squared:  0.736 
## F-statistic: 260.2 on 32 and 2943 DF,  p-value: < 2.2e-16
ModelB shows some significant coefficients,yet no interaction terms appear to have significant p-value. Hence, we will retain the 4-way interaction term with the lowest p-value and re-fit the model.
modelC <- lm(Total ~DayofWeek_Numeric * hour * minute * am_pm
             + Date + Traffic_Situation_Numeric,data = selected_cols)
summary(modelC)
## 
## Call:
## lm(formula = Total ~ DayofWeek_Numeric * hour * minute * am_pm + 
##     Date + Traffic_Situation_Numeric, data = selected_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.784 -23.576  -4.029  20.053 125.602 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         -1.200e+02  1.926e+01  -6.229 5.36e-10 ***
## DayofWeek_Numeric                   -5.964e+00  4.412e+00  -1.352    0.177    
## hour                                 1.269e+01  1.957e+00   6.487 1.02e-10 ***
## minute                               6.655e-01  6.852e-01   0.971    0.332    
## am_pm                                8.471e+01  1.656e+01   5.117 3.31e-07 ***
## Date                                -3.418e-02  6.328e-02  -0.540    0.589    
## Traffic_Situation_Numeric            4.587e+01  6.262e-01  73.262  < 2e-16 ***
## DayofWeek_Numeric:hour               6.037e-01  4.471e-01   1.350    0.177    
## DayofWeek_Numeric:minute            -1.019e-01  1.572e-01  -0.648    0.517    
## hour:minute                         -7.100e-02  6.945e-02  -1.022    0.307    
## DayofWeek_Numeric:am_pm              5.088e+00  3.796e+00   1.340    0.180    
## hour:am_pm                          -7.554e+00  1.236e+00  -6.111 1.12e-09 ***
## minute:am_pm                        -3.757e-01  5.895e-01  -0.637    0.524    
## DayofWeek_Numeric:hour:minute        1.243e-02  1.593e-02   0.780    0.435    
## DayofWeek_Numeric:hour:am_pm        -4.152e-01  2.828e-01  -1.468    0.142    
## DayofWeek_Numeric:minute:am_pm       8.158e-02  1.352e-01   0.603    0.546    
## hour:minute:am_pm                    3.596e-02  4.392e-02   0.819    0.413    
## DayofWeek_Numeric:hour:minute:am_pm -7.762e-03  1.008e-02  -0.770    0.441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.87 on 2958 degrees of freedom
## Multiple R-squared:  0.7385, Adjusted R-squared:  0.737 
## F-statistic: 491.3 on 17 and 2958 DF,  p-value: < 2.2e-16
ModelC appears to have no significant interaction terms, we will retain the 3-way interaction term with the lowest p-value and re-fit the model.
modelD <- lm(Total ~DayofWeek_Numeric * hour * am_pm
             + minute + Date + Traffic_Situation_Numeric,data = selected_cols)
summary(modelD)
## 
## Call:
## lm(formula = Total ~ DayofWeek_Numeric * hour * am_pm + minute + 
##     Date + Traffic_Situation_Numeric, data = selected_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -72.405 -23.725  -4.058  20.084 123.557 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -105.87194   11.57309  -9.148  < 2e-16 ***
## DayofWeek_Numeric              -8.25516    2.63826  -3.129 0.001771 ** 
## hour                           11.09512    1.17446   9.447  < 2e-16 ***
## am_pm                          76.24396    9.91906   7.687 2.04e-14 ***
## minute                          0.03801    0.03376   1.126 0.260331    
## Date                           -0.03417    0.06329  -0.540 0.589286    
## Traffic_Situation_Numeric      45.87978    0.62564  73.333  < 2e-16 ***
## DayofWeek_Numeric:hour          0.88331    0.26728   3.305 0.000961 ***
## DayofWeek_Numeric:am_pm         6.92323    2.26978   3.050 0.002307 ** 
## hour:am_pm                     -6.74449    0.74164  -9.094  < 2e-16 ***
## DayofWeek_Numeric:hour:am_pm   -0.58975    0.16907  -3.488 0.000493 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.88 on 2965 degrees of freedom
## Multiple R-squared:  0.7377, Adjusted R-squared:  0.7368 
## F-statistic: 833.9 on 10 and 2965 DF,  p-value: < 2.2e-16
ModelD apppears to have valuable interaction terms. Looking at the coefficients, minute and Date are not significant, hence we will remove them and re-fit the model.
modelE <- lm(Total ~DayofWeek_Numeric * hour * am_pm
            + Traffic_Situation_Numeric,data = selected_cols)
summary(modelE)
## 
## Call:
## lm(formula = Total ~ DayofWeek_Numeric * hour * am_pm + Traffic_Situation_Numeric, 
##     data = selected_cols)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -71.557 -23.517  -4.038  20.169 123.895 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  -105.5631    11.5001  -9.179  < 2e-16 ***
## DayofWeek_Numeric              -8.2548     2.6381  -3.129 0.001770 ** 
## hour                           11.0991     1.1744   9.451  < 2e-16 ***
## am_pm                          76.2645     9.9183   7.689    2e-14 ***
## Traffic_Situation_Numeric      45.8633     0.6253  73.342  < 2e-16 ***
## DayofWeek_Numeric:hour          0.8834     0.2673   3.306 0.000959 ***
## DayofWeek_Numeric:am_pm         6.9252     2.2696   3.051 0.002299 ** 
## hour:am_pm                     -6.7467     0.7416  -9.098  < 2e-16 ***
## DayofWeek_Numeric:hour:am_pm   -0.5899     0.1691  -3.489 0.000492 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.88 on 2967 degrees of freedom
## Multiple R-squared:  0.7376, Adjusted R-squared:  0.7369 
## F-statistic:  1042 on 8 and 2967 DF,  p-value: < 2.2e-16
As a result, all coefficients and interaction terms are significant except for DayofWeek_Numeric in modelE. However, we will keep DayofWeek_Numeric in the model as it has good interaction with other coefficients. The F-test indicates very strong evidence (p<2.2*10^-16) that the model fits better than the null model. R-squared decreased by 0.006 from 0.7436 (ModelA) to 0.7376, indicating that our decisions to remove the terms are reasonable. Thus, Model E is th preferred model.
Creating a dataframe for model comparison
model_names <- c("Model A", "Model B", "Model C", "Model D", "Model E")
multiple_r_squared <- c(0.7436, 0.7389, 0.7385, 0.7377, 0.7376)
adjusted_r_squared <- c(0.7416, 0.736, 0.737, 0.7368, 0.7369)
f_statistic <- c(274.2, 260.2, 491.3, 833.9, 1042)
residual_standard_error <- c(30.91, 30.93, 30.87, 30.88, 30.88)

model_comparison <- data.frame(Model = model_names,
                               Multiple_R_squared = multiple_r_squared,
                               Adjusted_R_squared = adjusted_r_squared,
                               F_statistic = f_statistic,
                               Residual_Standard_Error = residual_standard_error)
# Displaying the model comparison dataframe
print(model_comparison)
##     Model Multiple_R_squared Adjusted_R_squared F_statistic
## 1 Model A             0.7436             0.7416       274.2
## 2 Model B             0.7389             0.7360       260.2
## 3 Model C             0.7385             0.7370       491.3
## 4 Model D             0.7377             0.7368       833.9
## 5 Model E             0.7376             0.7369      1042.0
##   Residual_Standard_Error
## 1                   30.91
## 2                   30.93
## 3                   30.87
## 4                   30.88
## 5                   30.88

Model Evaluation

Evaluate the performance of the models by using MAE & Rsquared
predictions <- predict(modelE, newdata = test_data)
mae <- mean(abs(predictions - test_data$Total))
rsquared <- 1 - sum((predictions - test_data$Total)^2) / sum((mean(test_data$Total) - test_data$Total)^2)


print(paste("Mean Absolute Error (MAE):", mae))
## [1] "Mean Absolute Error (MAE): 25.4947560591296"
print(paste("R-squared:", rsquared))
## [1] "R-squared: 0.727816654081597"
plot(modelE)

LR Model Interpretation

Model E exhibits strong performance, with an R-squared value of 0.7376, indicating that it explains a significant portion of the variance in the target variable. Model E is slightly lower in R-squared but has a similar adjusted R-squared, suggesting it may be a more parsimonious model. Model E has the highest F-statistic, indicating a significant improvement over the null model. Models A, D and E have similar and lower residual standard errors, suggesting better precision. In summary, model E is preferred as it balances good fit (R-squared), simplicity (adjusted R-squared) and predictive power (F-statistic). The small decrease in R-squared compared to Model A is justifiable due to the more parsimonious nature of Model E. However, the differences between models are relatively small and further refinement such as feature engineering, exploring additional variables or trying different algorithms may lead to improved results.

Conclusion

EDA Conclusion:

• The traffic data suggests a consistent monthly pattern, indicating a steady flow of vehicles throughout the month.

• Weekdays exhibit higher vehicle volumes compared to weekends, suggesting potential variations in commuter behavior or activities.

• The heavy traffic during working hours (0600 to 1800) highlights the impact of regular working schedules on traffic patterns.

Modeling Conclusion:

• Random forest is successfully developed, which can be used to predict and classify traffic situations into distinct levels (low, medium, high, heavy).

• Based on the confusion matrix for both train data model and downsampling data model, both models show strong predictive capabilities for classifying traffic situations. The train data model achieving perfect accuracy and the downsampling data model maintaining high accuracy while addressing class imbalances. These models can serve as valuable tools for predicting and managing traffic situations, but further validation and testing on diverse datasets are recommended for comprehensive assessment and deployment.

• Besides, linear regression is developed, which can be used to accurately forecast the number of vehicles on the roads at specific times.

• In overall, Model E stands as a robust tool for forecasting traffic, offering valuable insights into temporal and situational factors influencing vehicular flow. It proves highly effective in predicting road traffic and showcasing strong explanatory power with an R-squared of 0.7376. Key features like day, hour, and traffic situation are critical and their interactions significantly contribute to accurate predictions. Compared to other models, Model E demonstrates competitive performance. While effective on the evaluated data, ongoing validation and adaptation to diverse datasets are crucial for real-world applicability.