HW_04 by Swastika Barua

Models to classify Pre_Crash_Mode without crash narrative column

# Load necessary libraries
library(caret)
library(dplyr)
library(nnet)    # For multinom logistic regression
library(e1071)   # For SVM
library(randomForest)  # For Random Forest

options(repos = c(CRAN = "https://cran.rstudio.com"))

# Install packages only if not installed
if (!requireNamespace("DALEX", quietly = TRUE)) install.packages("DALEX")
if (!requireNamespace("iBreakDown", quietly = TRUE)) install.packages("iBreakDown")
if (!requireNamespace("lime", quietly = TRUE)) install.packages("lime")

# Load data
data <- read.csv("C:/Users/swast/OneDrive - Texas State University/02_TXST/Fall 2024/AI/HW 04/CA_AV_CrashReportsSample_SB.csv")

head(data)

##   Year      Date        Time Veh1_Year Veh1_Register_State
## 1 2022  1/1/2022  2:04:00 AM      2021                  CA
## 2 2022  1/3/2022  9:00:00 AM      2021                  CA
## 3 2022  1/4/2022 11:30:00 AM      2016                  CA
## 4 2022  1/6/2022  9:25:00 AM      2021                  CA
## 5 2022 1/11/2022  8:55:00 AM      2016                  CA
## 6 2022 1/18/2022  5:20:00 PM      2021                  CA
##                                        Location_Street Location_City
## 1                        Marin Street at Kansas Street San Francisco
## 2                           100 Block of Bayshore Blvd San Francisco
## 3 Montgomery Street after the Clay Street Intersection San Francisco
## 4                Divisadero Street and Geary Boulevard San Francisco
## 5                                  13th and Bryant St. San Francisco
## 6                      14 Street and S Van Ness Avenue San Francisco
##   Location_County Location_State Location_Zip Latitude Longitude
## 1   San Francisco             CA        94124 37.74851 -122.4008
## 2   San Francisco             CA        94124 37.73459 -122.4056
## 3   San Francisco             CA        94111 37.79491 -122.4032
## 4   San Francisco             CA        94115 37.78356 -122.4389
## 5              CA             CA        94103 37.76949 -122.4100
## 6   San Francisco             CA        94103 37.76864 -122.4171
##          Vehicle_Was No_Vehicle_Involved Describe_Vehicle_Damage
## 1             Moving                   2                Moderate
## 2             Moving                   2                   Minor
## 3 Stopped in Traffic                   2                    None
## 4             Moving                   2                Moderate
## 5 Stopped in Traffic                   2                Moderate
## 6             Moving                   2                Moderate
##   Otherparty_Vehicle_State Vehicle_was_OtherParty Pre_Crash_Mode Weather
## 1                       CA                 Moving   Conventional   Clear
## 2                       CA                 Moving     Autonomous   Clear
## 3                       CA                 Moving   Conventional  Cloudy
## 4                       CA                 Moving   Conventional   Clear
## 5                       CA                 Moving     Autonomous   Clear
## 6                       CA                 Moving   Conventional   Clear
##               Lighting Roadway_Surface    Roadway.Conditions
## 1 Dark - Street Lights             Dry No Unusual Conditions
## 2             Daylight             Dry No Unusual Conditions
## 3             Daylight             Wet No Unusual Conditions
## 4             Daylight             Dry No Unusual Conditions
## 5             Daylight             Dry No Unusual Conditions
## 6             Daylight             Dry No Unusual Conditions
##     Movement_Bef_Veh1                         Movement_Bef_Veh2 Collision_Type
## 1 Proceeding Straight Making Right Turn/Xing into opposing lane           Side
## 2    Slowing/Stopping                       Proceeding Straight       Rear End
## 3             Stopped                                   Backing        Head-on
## 4 Proceeding Straight                            Changing Lanes     Side Swipe
## 5             Stopped                       Proceeding Straight       Rear End
## 6 Proceeding Straight                       Proceeding Straight      Broadside
##   Collision_Type1
## 1      Side Swipe
## 2        Rear End
## 3         Head-On
## 4      Side Swipe
## 5        Rear End
## 6       Broadside
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     CrashNarrative
## 1 On January 1, 2021 at 2:04 AM PST a Waymo Autonomous Vehicle (?Waymo AV?) operating in San Francisco, California was in a collision involving a passenger vehicle at Marin Street and Kansas Street. While proceeding straight in autonomous mode on westbound Marin Street, a passenger vehicle turned right onto Marin Street from Kansas Street, crossing the center lane line and traveling in the oncoming traffic lane. The driver of the Waymo AV transitioned the system into manual mode and slowed, and moved to the right most edge of the lane. The oncoming passenger vehicle struck the left side of the Waymo AV and immediately proceeded to leave the scene of the collision. At the time of the impact, the Waymo AV?s Level 4 ADS was not engaged and a test driver was operating the Waymo AV in manual mode. The Waymo AV sustained moderate damage to the left front panel and both doors on the left side of the vehicle.
## 2                                                                                                                                                                                                                On January 3, 2021 at 9:00 AM PST a Waymo Autonomous Vehicle (?Waymo AV?) operating in San Francisco, California was in a collision involving a heavy truck at the northbound merge lane on the 100 block of Bayshore Boulevard. At the time of the impact, the Waymo AV?s Level 4 ADS was engaged in autonomous mode, and a test driver was present (in the driver?s seating position). While attempting a merge onto northbound Bayshore Boulevard from southbound Bayshore Boulevard, the Waymo AV came to a stop to yield for approaching traffic on the right. The Waymo AV was rear ended by the truck, causing minor damage to the right rear bumper of the Waymo AV. The truck sustained damage to the left front bumper.
## 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                A Zoox vehicle in manual mode came to a stop behind a passenger vehicle that was attempting to parallel park on Montgomery Street. While reversing, the passenger vehicle made contact with the front vehicle sensors of the Zoox vehicle. There were no injuries and the police were not called.
## 4                                                                                                                                                      On January 6, 2022 at 9:25 AM PST a Waymo Autonomous Vehicle (?Waymo AV?) operating in San Francisco, California was in a collision involving a passenger vehicle at Divisadero Street and Geary Boulevard.While the Waymo AV was proceeding straight in manual mode in the right lane on northbound Divisadero Street, a passenger vehicle in the left lane attempted an abrupt lane change at the entrance to the intersection at Geary Boulevard as the Waymo AV was approaching the intersection. The front left fender of the Waymo AV made contact with the right front panel of the passenger vehicle. At the time of the impact, the Waymo AV?s Level 4 ADS was not engaged and a test driver was operating the Waymo AV in manual mode. Both vehicles sustained moderate damage.\n
## 5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 A stopped Zoox vehicle in autonomous mode was struck by a passenger vehicle that approached it at low speed from behind. The Zoox vehicle incurred damage to its sensors and rear bumper. There were no injuries and the police were not called.
## 6          On January 18, 2022 at 5:20 PM PST a Waymo Autonomous Vehicle (?Waymo AV?) operating in San Francisco, CA was in a collision involving a passenger vehicle on 14th Street at S. Van Ness Avenue.The Waymo AV was in autonomous mode on eastbound 14th Street as it approached S. Van Ness Avenue. The test driver (in the driver?s seating position) transitioned to manual mode before reaching the intersection. The Waymo AV then proceeded through a green light into the intersection in manual mode. A passenger vehicle traveling northbound on S. Van Ness Avenue then entered the intersection against its red light, making contact with the front of the Waymo AV. At the time of the impact, the Waymo AV?s Level 4 ADS was not engaged and a test driver was operating the Waymo AV in manual mode. The Waymo AV sustained damage to its front fascia and the passenger vehicle sustained damage to its driver side doors.

# Remove crash narratives
data <- data %>%
  select(-CrashNarrative)

# Convert all categorical features to factors
data <- data %>%
  mutate(across(where(is.character), as.factor))

# Convert target variable to factor
data$Pre_Crash_Mode <- as.factor(data$Pre_Crash_Mode)

# Train-test split
set.seed(42)
trainIndex <- createDataPartition(data$Pre_Crash_Mode, p = 0.8, list = FALSE)
train_data <- data[trainIndex, ]
test_data <- data[-trainIndex, ]

# Load necessary libraries for visualization
library(ggplot2)

# Visualize class distribution in the training set
ggplot(train_data, aes(x = Pre_Crash_Mode)) +
  geom_bar(fill = "#abbda4") +
  labs(title = "Class Distribution in Training Set", x = "Pre_Crash_Mode", y = "Count") +
  theme_minimal()

# Visualize class distribution in the test set
ggplot(test_data, aes(x = Pre_Crash_Mode)) +
  geom_bar(fill = "#b39998") +
  labs(title = "Class Distribution in Test Set", x = "Pre_Crash_Mode", y = "Count") +
  theme_minimal()

# Compute the correlation matrix for numeric variables in the training set
numeric_vars <- train_data %>% select_if(is.numeric)
cor_matrix <- cor(numeric_vars, use = "complete.obs")

# Plot the correlation matrix
library(reshape2)
melted_cor_matrix <- melt(cor_matrix)

ggplot(data = melted_cor_matrix, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#d4bb96", high = "#827451", mid = "white", midpoint = 0) +
  labs(title = "Feature Correlation Matrix") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Define cross-validation control
control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

Decision Tree

# Model 1: Decision tree
# Remove rows with any missing values
train_data <- na.omit(train_data)
test_data <- na.omit(test_data)

# Re-run the model training after cleaning the data
model_tree <- train(Pre_Crash_Mode ~ ., data = train_data, method = "rpart", trControl = control)
tree_pred <- predict(model_tree, test_data)
tree_report <- confusionMatrix(tree_pred, test_data$Pre_Crash_Mode)

# Assuming `tree_report` contains the confusion matrix from a model
# Print overall accuracy
cat("Accuracy:", tree_report$overall['Accuracy'], "\n")

## Accuracy: 0.5517241

# Print other performance measures: Sensitivity, Specificity, Precision, Recall, F1 Score
# Confusion matrix metrics are stored in `tree_report$byClass`
cat("Sensitivity:", tree_report$byClass['Sensitivity'], "\n")

## Sensitivity: 1

cat("Specificity:", tree_report$byClass['Specificity'], "\n")

## Specificity: 0

cat("Precision:", tree_report$byClass['Pos Pred Value'], "\n")

## Precision: 0.5517241

cat("Recall (Sensitivity):", tree_report$byClass['Sensitivity'], "\n")  # Recall is the same as Sensitivity

## Recall (Sensitivity): 1

cat("F1 Score:", tree_report$byClass['F1'], "\n")

## F1 Score: 0.7111111

Random Forest

# Model 2: Random Forest
model_rf <- train(Pre_Crash_Mode ~ ., data = train_data, method = "rf", trControl = control)
rf_pred <- predict(model_rf, test_data)
rf_report <- confusionMatrix(rf_pred, test_data$Pre_Crash_Mode)

# Print overall accuracy
cat("Accuracy:", rf_report$overall['Accuracy'], "\n")

## Accuracy: 0.6551724

# Print additional performance metrics
cat("Sensitivity:", rf_report$byClass['Sensitivity'], "\n")

## Sensitivity: 0.875

cat("Specificity:", rf_report$byClass['Specificity'], "\n")

## Specificity: 0.3846154

cat("Precision (Pos Pred Value):", rf_report$byClass['Pos Pred Value'], "\n")

## Precision (Pos Pred Value): 0.6363636

cat("Recall (Sensitivity):", rf_report$byClass['Sensitivity'], "\n") # Recall is the same as Sensitivity

## Recall (Sensitivity): 0.875

cat("F1 Score:", rf_report$byClass['F1'], "\n")

## F1 Score: 0.7368421

SVM

# Model: SVM
model_svm <- train(Pre_Crash_Mode ~ ., data = train_data, method = "svmLinear", trControl = control)
svm_pred <- predict(model_svm, test_data)
svm_report <- confusionMatrix(svm_pred, test_data$Pre_Crash_Mode)

# Print overall accuracy
cat("Accuracy:", svm_report$overall['Accuracy'], "\n")

## Accuracy: 0.5862069

# Print additional performance metrics
cat("Sensitivity:", svm_report$byClass['Sensitivity'], "\n")

## Sensitivity: 1

cat("Specificity:", svm_report$byClass['Specificity'], "\n")

## Specificity: 0.07692308

cat("Precision (Pos Pred Value):", svm_report$byClass['Pos Pred Value'], "\n")

## Precision (Pos Pred Value): 0.5714286

cat("Recall (Sensitivity):", svm_report$byClass['Sensitivity'], "\n") # Recall is the same as Sensitivity

## Recall (Sensitivity): 1

cat("F1 Score:", svm_report$byClass['F1'], "\n")

## F1 Score: 0.7272727

Models to classify Pre_Crash_Mode without crash narrative column

# Load necessary libraries
library(dplyr)
library(caret)
library(randomForest)
library(class)
library(e1071)
library(tm)  # For text mining

# Load data and handle missing values
data1 <- read.csv("C:/Users/swast/OneDrive - Texas State University/02_TXST/Fall 2024/AI/HW 04/CA_AV_CrashReportsSample_SB.csv")
data_clean <- data1 %>%
  filter(!is.na(Pre_Crash_Mode)) %>%
  na.omit()  # Remove rows with any NA values

# Select only the columns "Pre_Crash_Mode" and "CrashNarrative"
data_clean <- data1 %>% select(Pre_Crash_Mode, CrashNarrative)

# Convert target variable to factor
data_clean$Pre_Crash_Mode <- as.factor(data_clean$Pre_Crash_Mode)

# Text preprocessing
# Create a corpus from the CrashNarrative column
corpus <- Corpus(VectorSource(data_clean$CrashNarrative))
corpus <- tm_map(corpus, content_transformer(tolower))        # Convert to lowercase
corpus <- tm_map(corpus, removePunctuation)                   # Remove punctuation
corpus <- tm_map(corpus, removeNumbers)                       # Remove numbers
corpus <- tm_map(corpus, removeWords, stopwords("english"))   # Remove common stopwords
corpus <- tm_map(corpus, stripWhitespace)                     # Remove extra whitespace

# Create a Document-Term Matrix (DTM)
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.99)  # Remove sparse terms, keeping only the most frequent terms

# Convert DTM to a data frame
dtm_data <- as.data.frame(as.matrix(dtm))
colnames(dtm_data) <- make.names(colnames(dtm_data))

# Combine DTM with the target variable
data_model <- cbind(Pre_Crash_Mode = data_clean$Pre_Crash_Mode, dtm_data)

# Split data into training and test sets
set.seed(123)
trainIndex <- createDataPartition(data_model$Pre_Crash_Mode, p = 0.8, list = FALSE)
trainData <- data_model[trainIndex, ]
testData <- data_model[-trainIndex, ]

# Define a function to evaluate and print model performance
evaluate_model <- function(model, test_data, target_col) {
  predictions <- predict(model, test_data)
  cm <- confusionMatrix(predictions, test_data[[target_col]])
  print(cm)
}

Random Forest

# Random Forest
set.seed(123)
rf_model <- randomForest(Pre_Crash_Mode ~ ., data = trainData, na.action = na.omit)
cat("Random Forest Confusion Matrix:\n")

## Random Forest Confusion Matrix:

evaluate_model(rf_model, testData, "Pre_Crash_Mode")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autonomous Conventional
##   Autonomous           32            4
##   Conventional          0           17
##                                           
##                Accuracy : 0.9245          
##                  95% CI : (0.8179, 0.9791)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 1.498e-07       
##                                           
##                   Kappa : 0.8369          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.8095          
##          Pos Pred Value : 0.8889          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.6038          
##          Detection Rate : 0.6038          
##    Detection Prevalence : 0.6792          
##       Balanced Accuracy : 0.9048          
##                                           
##        'Positive' Class : Autonomous      
##

KNN

# K-Nearest Neighbors (KNN)
set.seed(123)
knn_model <- train(Pre_Crash_Mode ~ ., data = trainData,
                   method = "knn",
                   trControl = trainControl(method = "cv", number = 5),
                   tuneLength = 10)
cat("KNN Confusion Matrix:\n")

## KNN Confusion Matrix:

evaluate_model(knn_model, testData, "Pre_Crash_Mode")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autonomous Conventional
##   Autonomous           30            9
##   Conventional          2           12
##                                           
##                Accuracy : 0.7925          
##                  95% CI : (0.6589, 0.8916)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 0.00285         
##                                           
##                   Kappa : 0.5399          
##                                           
##  Mcnemar's Test P-Value : 0.07044         
##                                           
##             Sensitivity : 0.9375          
##             Specificity : 0.5714          
##          Pos Pred Value : 0.7692          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.6038          
##          Detection Rate : 0.5660          
##    Detection Prevalence : 0.7358          
##       Balanced Accuracy : 0.7545          
##                                           
##        'Positive' Class : Autonomous      
##

SVM

# Support Vector Machine (SVM)
set.seed(123)
svm_model <- svm(Pre_Crash_Mode ~ ., data = trainData, kernel = "linear", probability = TRUE)
cat("SVM Confusion Matrix:\n")

## SVM Confusion Matrix:

evaluate_model(svm_model, testData, "Pre_Crash_Mode")

## Confusion Matrix and Statistics
## 
##               Reference
## Prediction     Autonomous Conventional
##   Autonomous           28            1
##   Conventional          4           20
##                                           
##                Accuracy : 0.9057          
##                  95% CI : (0.7934, 0.9687)
##     No Information Rate : 0.6038          
##     P-Value [Acc > NIR] : 1e-06           
##                                           
##                   Kappa : 0.8076          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.8750          
##             Specificity : 0.9524          
##          Pos Pred Value : 0.9655          
##          Neg Pred Value : 0.8333          
##              Prevalence : 0.6038          
##          Detection Rate : 0.5283          
##    Detection Prevalence : 0.5472          
##       Balanced Accuracy : 0.9137          
##                                           
##        'Positive' Class : Autonomous      
##

Explainable AI

options(repos = c(CRAN = "https://cran.rstudio.com"))

# Install packages only if not installed
if (!requireNamespace("DALEX", quietly = TRUE)) install.packages("DALEX")
if (!requireNamespace("iBreakDown", quietly = TRUE)) install.packages("iBreakDown")
if (!requireNamespace("lime", quietly = TRUE)) install.packages("lime")

# Load libraries
library(DALEX)
library(iBreakDown)
library(lime)
library(caret)
library(randomForest)
library(ggplot2)

# Load necessary libraries
library(DALEX)
library(iBreakDown)
library(ggplot2)

# Assuming rf_model, knn_model, and svm_model are trained and available

# Convert target to numeric for DALEX explainers if needed
y_numeric <- as.numeric(as.factor(testData$Pre_Crash_Mode))

# Create explainers for each model
explainer_rf <- DALEX::explain(
  model = rf_model,
  data = testData[, -1],
  y = y_numeric,
  label = "Random Forest",
  predict_function = function(model, newdata) predict(model, newdata, type = "prob")[, 2]  # Probability for class 2
)

## Preparation of a new explainer is initiated
##   -> model label       :  Random Forest 
##   -> data              :  53  rows  449  cols 
##   -> target variable   :  53  values 
##   -> predict function  :  function(model, newdata) predict(model, newdata, type = "prob")[,      2] 
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package randomForest , ver. 4.7.1.2 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.014 , mean =  0.3915094 , max =  0.906  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.502 , mean =  1.004717 , max =  1.702  
##   A new explainer has been created!

explainer_knn <- DALEX::explain(
  model = knn_model,
  data = testData[, -1],
  y = y_numeric,
  label = "KNN",
  predict_function = function(model, newdata) predict(model, newdata, type = "prob")[, 2]
)

## Preparation of a new explainer is initiated
##   -> model label       :  KNN 
##   -> data              :  53  rows  449  cols 
##   -> target variable   :  53  values 
##   -> predict function  :  function(model, newdata) predict(model, newdata, type = "prob")[,      2] 
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package caret , ver. 6.0.94 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0 , mean =  0.3228437 , max =  1  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.4 , mean =  1.073383 , max =  1.857143  
##   A new explainer has been created!

explainer_svm <- DALEX::explain(
  model = svm_model,
  data = testData[, -1],
  y = y_numeric,
  label = "SVM",
  predict_function = function(model, newdata) attr(predict(model, newdata, probability = TRUE), "probabilities")[, 2]
)

## Preparation of a new explainer is initiated
##   -> model label       :  SVM 
##   -> data              :  53  rows  449  cols 
##   -> target variable   :  53  values 
##   -> predict function  :  function(model, newdata) attr(predict(model, newdata, probability = TRUE),      "probabilities")[, 2] 
##   -> predicted values  :  No value for predict function target column. (  default  )
##   -> model_info        :  package e1071 , ver. 1.7.16 , task classification (  default  ) 
##   -> predicted values  :  numerical, min =  0.0002457937 , mean =  0.5788534 , max =  0.997841  
##   -> residual function :  difference between y and yhat (  default  )
##   -> residuals         :  numerical, min =  0.002158994 , mean =  0.817373 , max =  1.999754  
##   A new explainer has been created!

# Calculate SHAP values for the first 5 observations for each model
shap_rf <- predict_parts(explainer_rf, new_observation = testData[1:5, -1], type = "shap")
shap_knn <- predict_parts(explainer_knn, new_observation = testData[1:5, -1], type = "shap")
shap_svm <- predict_parts(explainer_svm, new_observation = testData[1:5, -1], type = "shap")

# Plot SHAP values for each model
plot(shap_rf) + ggtitle("SHAP Values for Random Forest")

plot(shap_knn) + ggtitle("SHAP Values for KNN")

plot(shap_svm) + ggtitle("SHAP Values for SVM")

The results provide a detailed view of model performance in predicting Pre_Crash_Mode using Random Forest, K-Nearest Neighbors (KNN), and Support Vector Machine (SVM) models. Random Forest achieved the highest accuracy (92.45%), showing strong sensitivity and specificity, making it reliable for both classes. KNN also performed well, with balanced sensitivity and specificity but a slightly lower accuracy (79.25%), indicating some challenges in distinguishing between classes. SVM reached a high accuracy of 90.57% with robust sensitivity and specificity, making it a suitable choice for binary classification in this dataset.

The SHAP (SHapley Additive exPlanations) analysis for each model reveals the contributions of individual features to predictions. Key features like manual, mode, and waymo were significant across models, impacting the predictions by either supporting or contradicting the classification. These SHAP values provide an interpretable layer to understand feature importance, confirming that certain features heavily influence the model’s decision boundary, especially in Random Forest and SVM models. This interpretability helps ensure model transparency and provides insights into feature behavior in the context of predicting Pre_Crash_Mode.