Homework #4
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:.
Please be sure to address the business impact (it could be of any domain) of your solution.
Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.
An analysis of motor vehicle accidents in Nassau county, Long Island, NY could provide useful information to a level 1 adult trauma center in Nassau County, Long Island.
Building a models predicting accident severity with feature variables including location, weather attributes and amenities could aid in injury prevention as well.
Unsupervised methods could provide important attributes in severe accidents.
US-Accidents: A Countrywide Traffic Accident Dataset Published: January 2021
Description:
This is a countrywide traffic accident dataset, which covers 49 states of the United States. The data is continuously being collected from February 2016, using several data providers, including multiple APIs that provide streaming traffic event data. These APIs broadcast traffic events captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 1.5 million accident records in this dataset. Check the below descriptions for more detailed information. (Moosavi, 2021).
References:
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, arXiv preprint arXiv:1906.05409 (2019).
Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.
The dataset contains 2,845,342 records.
This study will be a subset of the dataset US-Accidents, traffic accidents in Nassau County, Long Island, NY 6,219 records.
#file1<-"C:/Users/Lisa/OneDrive/CUNY/622/HW4/US_Accidents_Dec21_updated.csv"
#car1<-read.csv(file1,header=TRUE)
#2,845,342 records
#car2<-car1[car1$State=="NY"& car1$County=="Nassau",]
#write.csv(car2,"C:/Users/Lisa/OneDrive/CUNY/622/HW4/car2.csv", row.names = TRUE)
file3<-"C:/Users/Lisa/OneDrive/CUNY/622/HW4/car2.csv"
car2<-read.csv(file3,header=TRUE)
Imputation:
If windchill was missing, impute with temperature.
If precipitation was missing, impute with 0.
The POI’s below were converted to 1, 0 (True, False):
Sunrise_Sunset is 1 for Day and 0 for Night.
Remaining records with NA are excluded.
car3<-car2
summary(car3)
## X ID Severity Start_Time
## Min. : 31708 Length:6219 Min. :1.000 Length:6219
## 1st Qu.: 608280 Class :character 1st Qu.:2.000 Class :character
## Median :1501829 Mode :character Median :2.000 Mode :character
## Mean :1459704 Mean :2.236
## 3rd Qu.:2244843 3rd Qu.:2.000
## Max. :2844969 Max. :4.000
##
## End_Time Start_Lat Start_Lng End_Lat
## Length:6219 Min. :40.59 Min. :-73.75 Min. :40.59
## Class :character 1st Qu.:40.69 1st Qu.:-73.65 1st Qu.:40.69
## Mode :character Median :40.72 Median :-73.59 Median :40.72
## Mean :40.73 Mean :-73.59 Mean :40.73
## 3rd Qu.:40.78 3rd Qu.:-73.53 3rd Qu.:40.78
## Max. :40.87 Max. :-73.42 Max. :40.87
##
## End_Lng Distance.mi. Description Number
## Min. :-73.89 Min. : 0.0000 Length:6219 Min. : 1
## 1st Qu.:-73.64 1st Qu.: 0.1430 Class :character 1st Qu.: 201
## Median :-73.58 Median : 0.5070 Mode :character Median : 598
## Mean :-73.59 Mean : 0.9515 Mean :1146
## 3rd Qu.:-73.52 3rd Qu.: 1.2280 3rd Qu.:1799
## Max. :-73.40 Max. :17.5910 Max. :8649
## NA's :5609
## Street Side City County
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## State Zipcode Country Timezone
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Airport_Code Weather_Timestamp Temperature.F. Wind_Chill.F.
## Length:6219 Length:6219 Min. : 7.0 Min. :-11.10
## Class :character Class :character 1st Qu.:43.0 1st Qu.: 36.00
## Mode :character Mode :character Median :55.0 Median : 52.00
## Mean :55.7 Mean : 51.18
## 3rd Qu.:69.1 3rd Qu.: 67.00
## Max. :98.0 Max. : 98.00
## NA's :25 NA's :1160
## Humidity... Pressure.in. Visibility.mi. Wind_Direction
## Min. : 11.00 Min. :28.91 Min. : 0.000 Length:6219
## 1st Qu.: 48.00 1st Qu.:29.82 1st Qu.:10.000 Class :character
## Median : 66.00 Median :29.97 Median :10.000 Mode :character
## Mean : 64.93 Mean :29.97 Mean : 9.034
## 3rd Qu.: 83.00 3rd Qu.:30.13 3rd Qu.:10.000
## Max. :100.00 Max. :30.85 Max. :10.000
## NA's :25 NA's :18 NA's :32
## Wind_Speed.mph. Precipitation.in. Weather_Condition Amenity
## Min. : 0.00 Min. :0.0000 Length:6219 Length:6219
## 1st Qu.: 6.90 1st Qu.:0.0000 Class :character Class :character
## Median : 9.20 Median :0.0000 Mode :character Mode :character
## Mean :10.29 Mean :0.0086
## 3rd Qu.:13.80 3rd Qu.:0.0000
## Max. :36.00 Max. :1.3200
## NA's :119 NA's :1477
## Bump Crossing Give_Way Junction
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## No_Exit Railway Roundabout Station
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Stop Traffic_Calming Traffic_Signal Turning_Loop
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
#if wind_chill missing use temp
car3$Wind_Chill.F.=ifelse(is.na(car3$Wind_Chill.F.),car3$Temperature.F.,car3$Wind_Chill.F.)
car3$Precipitation.in.=ifelse(is.na(car3$Precipitation.in.),0,car3$Precipitation.in.)
car3$Amenity=ifelse((car3$Amenity=='True'),1,0)
car3$Bump=ifelse((car3$Bump=='True'),1,0)
car3$Crossing=ifelse((car3$Crossing=='True'),1,0)
car3$Give_Way=ifelse((car3$Give_Way=='True'),1,0)
car3$Junction=ifelse((car3$Junction=='True'),1,0)
car3$No_Exit=ifelse((car3$No_Exit=='True'),1,0)
car3$Railway=ifelse((car3$Railway=='True'),1,0)
car3$Roundabout=ifelse((car3$Roundabout=='True'),1,0)
car3$Station=ifelse((car3$Station=='True'),1,0)
car3$Stop=ifelse((car3$Stop=='True'),1,0)
car3$Traffic_Calming=ifelse((car3$Traffic_Calming=='True'),1,0)
car3$Traffic_Signal=ifelse((car3$Traffic_Signal=='True'),1,0)
car3$Turning_Loop=ifelse((car3$Turning_Loop=='True'),1,0)
car3$Sunrise_Sunset=ifelse((car3$Sunrise_Sunset=="Day"),1,0)
summary(car3)
## X ID Severity Start_Time
## Min. : 31708 Length:6219 Min. :1.000 Length:6219
## 1st Qu.: 608280 Class :character 1st Qu.:2.000 Class :character
## Median :1501829 Mode :character Median :2.000 Mode :character
## Mean :1459704 Mean :2.236
## 3rd Qu.:2244843 3rd Qu.:2.000
## Max. :2844969 Max. :4.000
##
## End_Time Start_Lat Start_Lng End_Lat
## Length:6219 Min. :40.59 Min. :-73.75 Min. :40.59
## Class :character 1st Qu.:40.69 1st Qu.:-73.65 1st Qu.:40.69
## Mode :character Median :40.72 Median :-73.59 Median :40.72
## Mean :40.73 Mean :-73.59 Mean :40.73
## 3rd Qu.:40.78 3rd Qu.:-73.53 3rd Qu.:40.78
## Max. :40.87 Max. :-73.42 Max. :40.87
##
## End_Lng Distance.mi. Description Number
## Min. :-73.89 Min. : 0.0000 Length:6219 Min. : 1
## 1st Qu.:-73.64 1st Qu.: 0.1430 Class :character 1st Qu.: 201
## Median :-73.58 Median : 0.5070 Mode :character Median : 598
## Mean :-73.59 Mean : 0.9515 Mean :1146
## 3rd Qu.:-73.52 3rd Qu.: 1.2280 3rd Qu.:1799
## Max. :-73.40 Max. :17.5910 Max. :8649
## NA's :5609
## Street Side City County
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## State Zipcode Country Timezone
## Length:6219 Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Airport_Code Weather_Timestamp Temperature.F. Wind_Chill.F.
## Length:6219 Length:6219 Min. : 7.0 Min. :-11.10
## Class :character Class :character 1st Qu.:43.0 1st Qu.: 38.00
## Mode :character Mode :character Median :55.0 Median : 55.00
## Mean :55.7 Mean : 53.48
## 3rd Qu.:69.1 3rd Qu.: 69.10
## Max. :98.0 Max. : 98.00
## NA's :25 NA's :25
## Humidity... Pressure.in. Visibility.mi. Wind_Direction
## Min. : 11.00 Min. :28.91 Min. : 0.000 Length:6219
## 1st Qu.: 48.00 1st Qu.:29.82 1st Qu.:10.000 Class :character
## Median : 66.00 Median :29.97 Median :10.000 Mode :character
## Mean : 64.93 Mean :29.97 Mean : 9.034
## 3rd Qu.: 83.00 3rd Qu.:30.13 3rd Qu.:10.000
## Max. :100.00 Max. :30.85 Max. :10.000
## NA's :25 NA's :18 NA's :32
## Wind_Speed.mph. Precipitation.in. Weather_Condition Amenity
## Min. : 0.00 Min. :0.000000 Length:6219 Min. :0.000000
## 1st Qu.: 6.90 1st Qu.:0.000000 Class :character 1st Qu.:0.000000
## Median : 9.20 Median :0.000000 Mode :character Median :0.000000
## Mean :10.29 Mean :0.006573 Mean :0.008201
## 3rd Qu.:13.80 3rd Qu.:0.000000 3rd Qu.:0.000000
## Max. :36.00 Max. :1.320000 Max. :1.000000
## NA's :119
## Bump Crossing Give_Way Junction No_Exit
## Min. :0 Min. :0.0000 Min. :0.0000000 Min. :0.0000 Min. :0
## 1st Qu.:0 1st Qu.:0.0000 1st Qu.:0.0000000 1st Qu.:0.0000 1st Qu.:0
## Median :0 Median :0.0000 Median :0.0000000 Median :0.0000 Median :0
## Mean :0 Mean :0.0283 Mean :0.0006432 Mean :0.2058 Mean :0
## 3rd Qu.:0 3rd Qu.:0.0000 3rd Qu.:0.0000000 3rd Qu.:0.0000 3rd Qu.:0
## Max. :0 Max. :1.0000 Max. :1.0000000 Max. :1.0000 Max. :0
##
## Railway Roundabout Station Stop
## Min. :0.0000000 Min. :0 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.0000000 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.0000000 Median :0 Median :0.00000 Median :0.000000
## Mean :0.0006432 Mean :0 Mean :0.01608 Mean :0.002573
## 3rd Qu.:0.0000000 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.0000000 Max. :0 Max. :1.00000 Max. :1.000000
##
## Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset
## Min. :0 Min. :0.00000 Min. :0 Min. :0.0000
## 1st Qu.:0 1st Qu.:0.00000 1st Qu.:0 1st Qu.:0.0000
## Median :0 Median :0.00000 Median :0 Median :1.0000
## Mean :0 Mean :0.04904 Mean :0 Mean :0.6692
## 3rd Qu.:0 3rd Qu.:0.00000 3rd Qu.:0 3rd Qu.:1.0000
## Max. :0 Max. :1.00000 Max. :0 Max. :1.0000
##
## Civil_Twilight Nautical_Twilight Astronomical_Twilight
## Length:6219 Length:6219 Length:6219
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
We will be including the following variables:
car4<-car3 %>%
select(Severity,
Start_Lat,
Start_Lng,
Distance.mi.,
Temperature.F.,
Wind_Chill.F.,
Humidity...,
Pressure.in.,
Visibility.mi.,
Wind_Speed.mph.,
Precipitation.in.,
Amenity,
Crossing,
Junction,
Station,
Stop,
Traffic_Signal,
Sunrise_Sunset)
#Remove records is missing values
car4[car4==""] <- NA
car4<-na.omit(car4)
#car4 %>% drop_na(car4)
str(car4)
## 'data.frame': 6093 obs. of 18 variables:
## $ Severity : int 2 2 3 3 4 3 2 2 3 2 ...
## $ Start_Lat : num 40.7 40.8 40.8 40.8 40.8 ...
## $ Start_Lng : num -73.7 -73.7 -73.6 -73.7 -73.5 ...
## $ Distance.mi. : num 1.488 1.019 0.299 2.328 1.292 ...
## $ Temperature.F. : num 55 51.1 46 45 41 39.9 45 48 37.9 42.1 ...
## $ Wind_Chill.F. : num 55 51.1 39.3 37.4 34.3 34.5 38.2 48 33.5 37.1 ...
## $ Humidity... : int 93 44 54 53 53 97 86 61 65 70 ...
## $ Pressure.in. : num 29.6 29.9 30.1 30.1 30.2 ...
## $ Visibility.mi. : num 2 10 10 10 10 3 10 10 10 10 ...
## $ Wind_Speed.mph. : num 11.5 19.6 16.1 18.4 11.5 8.1 15 12.7 5.8 8.1 ...
## $ Precipitation.in.: num 0.01 0 0 0 0 0.01 0 0 0 0 ...
## $ Amenity : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Crossing : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Junction : num 1 0 1 1 0 1 0 0 1 0 ...
## $ Station : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Stop : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Traffic_Signal : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Sunrise_Sunset : num 0 1 0 0 1 1 1 0 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:126] 9 59 62 63 64 65 67 68 79 93 ...
## ..- attr(*, "names")= chr [1:126] "9" "59" "62" "63" ...
summary(car4)
## Severity Start_Lat Start_Lng Distance.mi.
## Min. :1.00 Min. :40.59 Min. :-73.75 Min. : 0.000
## 1st Qu.:2.00 1st Qu.:40.69 1st Qu.:-73.65 1st Qu.: 0.140
## Median :2.00 Median :40.72 Median :-73.59 Median : 0.497
## Mean :2.23 Mean :40.73 Mean :-73.59 Mean : 0.950
## 3rd Qu.:2.00 3rd Qu.:40.78 3rd Qu.:-73.53 3rd Qu.: 1.228
## Max. :4.00 Max. :40.87 Max. :-73.42 Max. :17.591
## Temperature.F. Wind_Chill.F. Humidity... Pressure.in.
## Min. : 7.0 Min. :-11.10 Min. : 11.00 Min. :28.91
## 1st Qu.:43.0 1st Qu.: 38.00 1st Qu.: 48.00 1st Qu.:29.82
## Median :55.0 Median : 55.00 Median : 65.00 Median :29.97
## Mean :55.8 Mean : 53.54 Mean : 64.79 Mean :29.97
## 3rd Qu.:69.1 3rd Qu.: 69.10 3rd Qu.: 83.00 3rd Qu.:30.13
## Max. :98.0 Max. : 98.00 Max. :100.00 Max. :30.85
## Visibility.mi. Wind_Speed.mph. Precipitation.in. Amenity
## Min. : 0.000 Min. : 0.0 Min. :0.000000 Min. :0.00000
## 1st Qu.:10.000 1st Qu.: 6.9 1st Qu.:0.000000 1st Qu.:0.00000
## Median :10.000 Median : 9.2 Median :0.000000 Median :0.00000
## Mean : 9.042 Mean :10.3 Mean :0.006698 Mean :0.00837
## 3rd Qu.:10.000 3rd Qu.:13.8 3rd Qu.:0.000000 3rd Qu.:0.00000
## Max. :10.000 Max. :36.0 Max. :1.320000 Max. :1.00000
## Crossing Junction Station Stop
## Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.000000
## 1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:0.000000
## Median :0.00000 Median :0.0000 Median :0.00000 Median :0.000000
## Mean :0.02839 Mean :0.2053 Mean :0.01608 Mean :0.002626
## 3rd Qu.:0.00000 3rd Qu.:0.0000 3rd Qu.:0.00000 3rd Qu.:0.000000
## Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.000000
## Traffic_Signal Sunrise_Sunset
## Min. :0.00000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000
## Mean :0.04973 Mean :0.6718
## 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :1.00000 Max. :1.0000
var1<-car4 %>%
select(Severity,
Start_Lat,
Start_Lng,
Distance.mi.,
Temperature.F.,
Wind_Chill.F.,
Humidity...,
Pressure.in.,
Visibility.mi.,
Wind_Speed.mph.,
Precipitation.in.
) %>%
gather(key='variables',value='value')
var1%>%
ggplot(aes(x=variables, y=value)) +
geom_boxplot() +
labs(title='Boxplot of Numerical Variables')+
theme(plot.title=element_text(hjust=0.5))+
coord_flip()
DataExplorer::plot_histogram(car4)
From the histograms above, we can classify the variables into the distribution categories:
dataset1<-car4 %>%
group_by(Severity, Sunrise_Sunset) %>%
summarise(S_count=n())
## `summarise()` has grouped output by 'Severity'. You can override using the `.groups` argument.
overallplot<-ggplot(dataset1,aes(x=Severity, y=S_count,fill=Sunrise_Sunset))+geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=90,size = 5, face = "bold"), axis.title.y = element_text( size = 6, face = "bold"))+labs(title = "Car Accidents Nassau County, NY",
subtitle = " Mar 2016- Dec 2021 ",
x = "Severity", y = "Number of Accidents")
overallplot
The above plot suggests that most accidents are a level 2 occurring during the day.
Mapping all accidents
map1<-car4 %>%
leaflet () %>%
addTiles () %>%
addHeatmap(lng=~Start_Lng, lat=~Start_Lat, intensity=2, blur=4, max=1, radius=4)
map1
Interested in severity 3,4 more likely injuries:
Mapping Severity 3 & 4 accidents:
car5 <- car4 %>% filter(car4$Severity==4 |car4$Severity==3)
map2<-car5 %>%
leaflet () %>%
addTiles () %>%
addHeatmap(lng=~Start_Lng, lat=~Start_Lat, intensity=2, blur=4, max=1, radius=4)
#mapview(car4, xcol="Start_Lng", ycol="Start_Lat", crs =4269, grid=FALSE)
map2
Side by side: All accidents vs severity 4, 5
library(leafsync)
## Warning: package 'leafsync' was built under R version 4.1.3
sync(map1, map2, ncol = 2)
We will use 75% of the data to train and 25% to test our models.
set.seed(123)
split = sample.split(car4$Severity, SplitRatio = 0.8)
train = subset(car4, split == TRUE)
test = subset(car4, split == FALSE)
Y_train <- train[,'Severity']
Y_test <- test[,'Severity']
X_train <- train[, !(colnames(train) == 'Severity')]
X_test <- test[, !(colnames(test) == 'Severity')]
library(randomForest)
set.seed(123)
car_rf = randomForest(x = X_train,
y = Y_train,
mtry=12,
importance=TRUE,
ntree = 500)
## Warning in randomForest.default(x = X_train, y = Y_train, mtry = 12, importance
## = TRUE, : The response has five or fewer unique values. Are you sure you want to
## do regression?
car_rf
##
## Call:
## randomForest(x = X_train, y = Y_train, ntree = 500, mtry = 12, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 12
##
## Mean of squared residuals: 0.2279701
## % Var explained: 25.73
# Predicting the Test set results
y_pred = predict(car_rf, newdata = test)
plot(y_pred, test$Severity)
abline(0,1)
mean((y_pred-test$Severity)^2)
## [1] 0.2388338
The test MSE = .2388
importance(car_rf)
## %IncMSE IncNodePurity
## Start_Lat 71.019228 238.082629
## Start_Lng 42.782750 180.936961
## Distance.mi. 98.893070 202.538584
## Temperature.F. 35.704815 101.275681
## Wind_Chill.F. 35.182275 107.530585
## Humidity... 33.221898 127.353185
## Pressure.in. 44.266395 164.609413
## Visibility.mi. 18.587023 24.459878
## Wind_Speed.mph. 29.375466 110.296317
## Precipitation.in. 12.377185 22.813823
## Amenity 7.235074 3.828830
## Crossing 20.432000 13.269731
## Junction 18.418312 15.533216
## Station 22.809808 13.247730
## Stop 1.130248 1.100663
## Traffic_Signal 28.275807 22.220746
## Sunrise_Sunset 25.117332 23.262563
varImpPlot(car_rf)
The random forest names latitude, distance, ongitude, distance, pressure, humidity and temperature.
Latitude and longitude suggests particular areas are dangerous roadways.
set.seed(12)
nnetGrid <- expand.grid(.decay = c(0,0.01,.1),
.size = c(1:5),
.bag = FALSE)
nnetFit <- train(X_train, Y_train,
method = 'avNNet',
preProc = c('center','scale'),
tuneGrid = nnetGrid,
linout = TRUE,
trace = FALSE,
MaxNWts = 5 * (ncol(X_train) + 1 + 5 + 1),
maxit = 100
)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetFit
## Model Averaged Neural Network
##
## 4875 samples
## 17 predictor
##
## Pre-processing: centered (17), scaled (17)
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 4875, 4875, 4875, 4875, 4875, 4875, ...
## Resampling results across tuning parameters:
##
## decay size RMSE Rsquared MAE
## 0.00 1 0.5466415 0.03524544 0.3737422
## 0.00 2 0.5426167 0.03733871 0.3710252
## 0.00 3 0.5416536 0.04285998 0.3668341
## 0.00 4 0.5443025 0.04685063 0.3641160
## 0.00 5 0.5404815 0.05088628 0.3622794
## 0.01 1 0.5435461 0.03446969 0.3722347
## 0.01 2 0.5428102 0.03744412 0.3697361
## 0.01 3 0.5413183 0.04340994 0.3658335
## 0.01 4 0.5406490 0.04871990 0.3627246
## 0.01 5 0.5406009 0.05053923 0.3615607
## 0.10 1 0.5445937 0.03214851 0.3716450
## 0.10 2 0.5417336 0.04111046 0.3679597
## 0.10 3 0.5407235 0.04566816 0.3654301
## 0.10 4 0.5403906 0.04928328 0.3621715
## 0.10 5 0.5407640 0.05018304 0.3617645
##
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.
varimp <- varImp(nnetFit)
varimp$importance %>%
arrange(desc(Overall)) %>%
kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
| Overall | |
|---|---|
| Distance.mi. | 100.0000000 |
| Start_Lat | 79.9046430 |
| Traffic_Signal | 43.2623306 |
| Crossing | 41.9281369 |
| Start_Lng | 37.5710780 |
| Wind_Chill.F. | 15.3257766 |
| Temperature.F. | 13.9341843 |
| Humidity… | 7.6002692 |
| Station | 7.0046778 |
| Sunrise_Sunset | 6.2018433 |
| Amenity | 6.1750330 |
| Precipitation.in. | 2.7610988 |
| Junction | 1.6939199 |
| Wind_Speed.mph. | 1.6466898 |
| Pressure.in. | 0.2518870 |
| Stop | 0.1322378 |
| Visibility.mi. | 0.0000000 |
The Neural network names distance, Latitude, traffic signal, crossing & longitude as the top 5 important variables.
The dataset of Nassau County car accidents of severity 3 or 4 contains 1061 records.
car6<-car5
pca_out=prcomp(car6,scale=TRUE)
pca_out
## Standard deviations (1, .., p=18):
## [1] 1.52605037 1.38596148 1.29676746 1.24457777 1.11276150 1.03444581
## [7] 1.02200611 0.98997735 0.93930684 0.92674839 0.91277307 0.86630255
## [13] 0.84461537 0.82912910 0.76315419 0.68546828 0.63423985 0.08171469
##
## Rotation (n x k) = (18 x 18):
## PC1 PC2 PC3 PC4 PC5
## Severity 0.09835738 -0.09201160 0.048113182 0.289142987 -0.05014822
## Start_Lat 0.03159939 0.01538580 -0.237883395 0.337923578 -0.50416740
## Start_Lng 0.02127685 -0.06018918 -0.008308465 0.421120681 -0.13096475
## Distance.mi. 0.10290338 0.08344665 -0.328467821 0.180022449 -0.29057941
## Temperature.F. -0.56240420 0.28093050 -0.021821004 0.181581952 0.09594789
## Wind_Chill.F. -0.55376529 0.28205198 -0.020100272 0.212725226 0.11901125
## Humidity... 0.27798385 0.33068902 0.313241495 0.245468479 0.24720640
## Pressure.in. 0.11613038 -0.32035872 -0.299331606 -0.044986012 0.05768104
## Visibility.mi. -0.27600366 -0.31312583 -0.394750268 -0.042488629 0.14612564
## Wind_Speed.mph. -0.07349280 0.04930169 0.150801066 -0.488939867 -0.42101904
## Precipitation.in. 0.12551213 0.28627132 0.323182404 0.029245833 -0.26880790
## Amenity -0.08637640 -0.24859072 0.193641560 -0.001873216 -0.01253553
## Crossing -0.15052309 -0.33268563 0.332865892 0.111591573 -0.06179652
## Junction 0.06879730 0.20426707 -0.111701607 -0.303296410 0.36113427
## Station -0.06969212 -0.28952469 0.259786800 0.114741780 0.03251732
## Stop -0.06692099 -0.08567282 0.067524170 0.032047721 -0.06408629
## Traffic_Signal -0.15091976 -0.34338142 0.347878890 0.051717323 0.08581907
## Sunrise_Sunset -0.31666053 0.06740811 0.087627770 -0.298187234 -0.36893601
## PC6 PC7 PC8 PC9 PC10
## Severity 0.50256602 0.08320807 -0.605031985 0.13025051 0.227274027
## Start_Lat -0.16128606 -0.03200206 0.137967084 -0.17383541 -0.142189997
## Start_Lng -0.44351888 0.29890724 -0.254894354 -0.28883273 0.260624055
## Distance.mi. 0.39783952 -0.16879862 -0.046568301 -0.07972604 -0.291902613
## Temperature.F. 0.12488902 -0.09288932 0.004073236 0.01014637 -0.009050120
## Wind_Chill.F. 0.12125597 -0.09701065 0.019123530 0.01157946 -0.004469226
## Humidity... 0.02758672 -0.03192433 0.060094254 0.10213089 -0.028480923
## Pressure.in. 0.22880230 -0.26647927 0.297819033 0.08769216 0.155670100
## Visibility.mi. -0.03501625 0.13373518 -0.207562846 -0.04305958 -0.034012947
## Wind_Speed.mph. 0.13601603 0.07408662 -0.355689788 0.02710457 -0.026725735
## Precipitation.in. 0.14804284 -0.12766276 0.090056471 -0.18196849 0.036539136
## Amenity 0.07689827 -0.55572188 0.007799604 -0.50826204 0.266623555
## Crossing -0.03192197 -0.06326242 0.005670506 -0.21270915 -0.202365398
## Junction 0.08912523 0.14700536 -0.190488141 -0.63812306 -0.116698949
## Station 0.17615582 0.27452467 0.101840696 -0.08329362 -0.658891982
## Stop 0.42180917 0.56713391 0.461909112 -0.17712500 0.410425303
## Traffic_Signal -0.02859160 -0.07830988 -0.093431326 0.23596737 0.095660476
## Sunrise_Sunset -0.12933275 0.08303968 0.092327841 0.06487112 0.096161286
## PC11 PC12 PC13 PC14
## Severity -0.12123828 -0.051886460 0.320824212 0.112119786
## Start_Lat 0.22444499 -0.405561660 0.182199059 -0.042575280
## Start_Lng -0.29754740 0.247099689 -0.157411829 -0.023970096
## Distance.mi. 0.04935802 0.429127097 -0.530646736 0.007808708
## Temperature.F. -0.03262408 -0.054734701 0.008285423 0.020180474
## Wind_Chill.F. -0.04026381 -0.047488999 0.009905273 0.018757292
## Humidity... 0.13316732 0.155319639 0.017203267 0.162661069
## Pressure.in. -0.51835163 -0.101050196 0.045105191 0.208286145
## Visibility.mi. 0.04722560 -0.192130692 -0.048141733 -0.146942458
## Wind_Speed.mph. 0.19817288 -0.090224055 -0.041779238 0.021982392
## Precipitation.in. -0.45383170 -0.419951079 -0.141138975 -0.247251965
## Amenity 0.22866634 0.269618971 0.234553545 -0.247222406
## Crossing 0.06978339 -0.164234177 -0.162204471 0.743724528
## Junction -0.16294524 -0.120979830 -0.134175368 0.073474499
## Station -0.20177828 0.163433449 0.297125223 -0.301858446
## Stop 0.23529134 0.008191666 -0.096390625 -0.011900353
## Traffic_Signal -0.02282650 -0.138933868 -0.561626956 -0.310756822
## Sunrise_Sunset -0.35647571 0.407720134 0.136756270 0.153940727
## PC15 PC16 PC17 PC18
## Severity 0.12491331 -0.21983257 -0.049197403 2.776223e-03
## Start_Lat 0.45216718 -0.05407190 0.141584380 -3.433973e-04
## Start_Lng -0.07340169 0.34351883 -0.025335742 2.359813e-03
## Distance.mi. -0.05970045 -0.07051364 0.021004673 6.006248e-05
## Temperature.F. 0.03210835 0.17208263 -0.057496876 7.065500e-01
## Wind_Chill.F. 0.03258877 0.14971013 -0.045388913 -7.067358e-01
## Humidity... 0.13275093 0.15853197 0.682373830 1.577367e-02
## Pressure.in. 0.17541831 0.41633399 0.128567482 6.915084e-04
## Visibility.mi. -0.34951726 -0.14828988 0.611858021 2.690928e-03
## Wind_Speed.mph. 0.01044358 0.57710793 0.131655976 -3.189210e-02
## Precipitation.in. -0.38278004 -0.14625878 0.112565395 2.067319e-03
## Amenity -0.03448251 0.01580847 0.079397035 2.201671e-03
## Crossing -0.16864516 -0.09008897 -0.022464574 -8.506573e-04
## Junction 0.39937698 -0.08888054 0.001434936 -2.467305e-03
## Station 0.03010288 0.15499029 0.002874577 4.851715e-04
## Stop -0.01845377 0.01105624 0.020390950 1.262040e-03
## Traffic_Signal 0.45454885 -0.08357674 0.048140224 2.595207e-03
## Sunrise_Sunset 0.23391592 -0.38916643 0.273850423 -3.107422e-04
pr_var<-pca_out$sdev^2
pr_var
## [1] 2.328829731 1.920889220 1.681605850 1.548973834 1.238238154 1.070078139
## [7] 1.044496485 0.980055148 0.882297339 0.858862577 0.833154669 0.750480100
## [13] 0.713375118 0.687455068 0.582404316 0.469866769 0.402260193 0.006677291
pve_c<-pr_var/sum(pr_var)
pve_c
## [1] 0.1293794295 0.1067160678 0.0934225472 0.0860541019 0.0687910085
## [6] 0.0594487855 0.0580275825 0.0544475082 0.0490165188 0.0477145876
## [11] 0.0462863705 0.0416933389 0.0396319510 0.0381919482 0.0323557954
## [16] 0.0261037094 0.0223477885 0.0003709606
Principal component analysis describes the variation of multivariate data in terms of a set of new, uncorrelated variables defined to be a linear combination of the original variable. In doing so, principal components provides a lower dimensional representation of the dataset and attempts to summarize what is important.
The analysis yields 18 principal components with the first two components accounting for 13 and 11 percent of the variation. In other words, these two principal components account for 24 % of the variation.
pr_var<-pca_out$sdev^2
pve_c<-pr_var/sum(pr_var)
#pve
qplot(c(1:18), pve_c) +
geom_line() +
xlab('Principal_Components') +
ylab('Variance_Explained') + ggtitle('Scree_Plot') +
ylim(0,1)
Principal Components biplot
biplot(pca_out,scale=0, cex=.6)
The biplot attempts to summarize the loadings of the first two components. The longer the red line, the more important: temperature, windchillF. speed, humidity, visibility notable.
Hierarchical clustering separates the dataset into groups by measuring similarity between observations. By using complete linkage which is the maximal intercluster distance and the average distance.
Sc_car5<-car6 %>%
select_all()%>%
scale()
hc_complete=hclust(dist(Sc_car5), method="complete")
myDend1c<-as.dendrogram(hc_complete)
plot(myDend1c, main="Hierarchical clustering - complete - 3 clusters")
#plot(hc.complete)
k=3
n=nrow(Sc_car5)
MidPoint=(hc_complete$height[n-k]+hc_complete$height[n-k+1])/2
abline(h=MidPoint,lty=2)
#######
hc_average=hclust(dist(Sc_car5), method="average")
myDend2c<-as.dendrogram(hc_average)
plot(myDend2c, main="Hierarchical clustering - average linkage - 3 clusters")
k=3
n=nrow(Sc_car5)
MidPoint2c=(hc_average$height[n-k]+hc_average$height[n-k+1])/2
abline(h=MidPoint2c,lty=2)
We see that both dendrograms result in 3 clusters.
This could provide a starting point for further study.
The purpose of the study is to understand where the car accidents are occurring in Nassau County, NY.
This could aid emergency personnel and Level 1 trauma centers in developing accident prevention programs. In addition, we sought to understand what is causing the more severe accidents. Likewise, understanding which variables are important in predicting accident severity could aid in accident prevention.
Accident prevention programs could be implemented based on studies of this type.
The analysis predicting severity of accident in Nassau county names location, distance, temperature, traffic signal as important.
This tells us the location of the accident suggests particular areas are dangerous roadways. Could these roadways be safer, perhaps by reducing the speed limit in a section of roadway?
PCA the more important: temperature, windchillF. windspeed, humidity, visibility notable.
In addition, weather conditions do affect safe driving. Perhaps defensive driving courses offered in high schools and libraries would be a remedy for accident prevention.