Assignment

Homework #4

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:.

  • describe the problem you are trying to solve.
  • describe your datases and what you did to prepare the data for analysis.
  • methodologies you used for analyzing the data
  • what’s the purpose of the analysis performed
  • make your conclusions from your analysis.

Please be sure to address the business impact (it could be of any domain) of your solution.

Your final presentation could be the traditional R file or Python file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.

Problem

An analysis of motor vehicle accidents in Nassau county, Long Island, NY could provide useful information to a level 1 adult trauma center in Nassau County, Long Island.

Building a models predicting accident severity with feature variables including location, weather attributes and amenities could aid in injury prevention as well.

Unsupervised methods could provide important attributes in severe accidents.

Dataset:

US-Accidents: A Countrywide Traffic Accident Dataset Published: January 2021

Dataset can be assessed at address below:

https://smoosavi.org/datasets/us_accidents

Description:
This is a countrywide traffic accident dataset, which covers 49 states of the United States. The data is continuously being collected from February 2016, using several data providers, including multiple APIs that provide streaming traffic event data. These APIs broadcast traffic events captured by a variety of entities, such as the US and state departments of transportation, law enforcement agencies, traffic cameras, and traffic sensors within the road-networks. Currently, there are about 1.5 million accident records in this dataset. Check the below descriptions for more detailed information. (Moosavi, 2021).

Attribute Description

  • ID This is a unique identifier of the accident record. No
  • Severity Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay). No
  • Start_Time Shows start time of the accident in local time zone. No
  • End_Time Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed. No
  • Start_Lat Shows latitude in GPS coordinate of the start point. No
  • Start_Lng Shows longitude in GPS coordinate of the start point. No
  • End_Lat Shows latitude in GPS coordinate of the end point. Yes
  • End_Lng Shows longitude in GPS coordinate of the end point. Yes
  • Distance(mi) The length of the road extent affected by the accident. No
  • Description Shows natural language description of the accident. No
  • Number Shows the street number in address field. Yes
  • Street Shows the street name in address field. Yes
  • Side Shows the relative side of the street (Right/Left) in address field. Yes
  • City Shows the city in address field. Yes
  • County Shows the county in address field. Yes
  • State Shows the state in address field. Yes
  • Zipcode Shows the zipcode in address field. Yes
  • Country Shows the country in address field. Yes
  • Timezone Shows timezone based on the location of the accident (eastern, central, etc.). Yes
  • Airport_Code Denotes an airport-based weather station which is the closest one to location of the accident. Yes
  • Weather_Timestamp Shows the time-stamp of weather observation record (in local time). Yes
  • Temperature(F) Shows the temperature (in Fahrenheit). Yes
  • Wind_Chill(F) Shows the wind chill (in Fahrenheit). Yes
  • Humidity(%) Shows the humidity (in percentage). Yes
  • Pressure(in) Shows the air pressure (in inches). Yes
  • Visibility(mi) Shows visibility (in miles). Yes
  • Wind_Direction Shows wind direction. Yes
  • Wind_Speed(mph) Shows wind speed (in miles per hour). Yes
  • Precipitation(in) Shows precipitation amount in inches, if there is any. Yes
  • Weather_Condition Shows the weather condition (rain, snow, thunderstorm, fog, etc.) Yes
  • Amenity A POI annotation which indicates presence of amenity in a nearby location. No
  • Bump A POI annotation which indicates presence of speed bump or hump in a nearby location. No
  • Crossing A POI annotation which indicates pr- esence of crossing in a nearby location. No
  • Give_Way A POI annotation which indicates presence of give_way in a nearby location. No
  • Junction A POI annotation which indicates presence of junction in a nearby location. No
  • No_Exit A POI annotation which indicates presence of no_exit in a nearby location. No
  • Railway A POI annotation which indicates presence of railway in a nearby location. No
  • Roundabout A POI annotation which indicates presence of roundabout in a nearby location. No
  • Station A POI annotation which indicates presence of station in a nearby location. No
  • Stop A POI annotation which indicates presence of stop in a nearby location. No
  • Traffic_Calming A POI annotation which indicates presence of traffic_calming in a nearby location. No
  • Traffic_Signal A POI annotation which indicates presence of traffic_signal in a nearby loction. No
  • Turning_Loop A POI annotation which indicates presence of turning_loop in a nearby location. No
  • Sunrise_Sunset Shows the period of day (i.e. day or night) based on sunrise/sunset. Yes
  • Civil_Twilight Shows the period of day (i.e. day or night) based on civil twilight. Yes
  • Nautical_Twilight Shows the period of day (i.e. day or night) based on nautical twilight. Yes
  • Astronomical_Twilight Shows the period of day (i.e. day or night) based on astronomical twilight. Yes

References:

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. “A Countrywide Traffic Accident Dataset.”, arXiv preprint arXiv:1906.05409 (2019).

Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. “Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights.” In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

The dataset contains 2,845,342 records.

This study will be a subset of the dataset US-Accidents, traffic accidents in Nassau County, Long Island, NY 6,219 records.

#file1<-"C:/Users/Lisa/OneDrive/CUNY/622/HW4/US_Accidents_Dec21_updated.csv"


#car1<-read.csv(file1,header=TRUE)
#2,845,342 records

#car2<-car1[car1$State=="NY"& car1$County=="Nassau",]
#write.csv(car2,"C:/Users/Lisa/OneDrive/CUNY/622/HW4/car2.csv", row.names = TRUE)

file3<-"C:/Users/Lisa/OneDrive/CUNY/622/HW4/car2.csv"


car2<-read.csv(file3,header=TRUE)

Clean the dataset

Imputation:

If windchill was missing, impute with temperature.

If precipitation was missing, impute with 0.

The POI’s below were converted to 1, 0 (True, False):

  • Amenity
  • Crossing
  • Junction
  • Station
  • Stop
  • Traffic_Signal

Sunrise_Sunset is 1 for Day and 0 for Night.

Remaining records with NA are excluded.

car3<-car2
summary(car3)
##        X                ID               Severity      Start_Time       
##  Min.   :  31708   Length:6219        Min.   :1.000   Length:6219       
##  1st Qu.: 608280   Class :character   1st Qu.:2.000   Class :character  
##  Median :1501829   Mode  :character   Median :2.000   Mode  :character  
##  Mean   :1459704                      Mean   :2.236                     
##  3rd Qu.:2244843                      3rd Qu.:2.000                     
##  Max.   :2844969                      Max.   :4.000                     
##                                                                         
##    End_Time           Start_Lat       Start_Lng         End_Lat     
##  Length:6219        Min.   :40.59   Min.   :-73.75   Min.   :40.59  
##  Class :character   1st Qu.:40.69   1st Qu.:-73.65   1st Qu.:40.69  
##  Mode  :character   Median :40.72   Median :-73.59   Median :40.72  
##                     Mean   :40.73   Mean   :-73.59   Mean   :40.73  
##                     3rd Qu.:40.78   3rd Qu.:-73.53   3rd Qu.:40.78  
##                     Max.   :40.87   Max.   :-73.42   Max.   :40.87  
##                                                                     
##     End_Lng        Distance.mi.     Description            Number    
##  Min.   :-73.89   Min.   : 0.0000   Length:6219        Min.   :   1  
##  1st Qu.:-73.64   1st Qu.: 0.1430   Class :character   1st Qu.: 201  
##  Median :-73.58   Median : 0.5070   Mode  :character   Median : 598  
##  Mean   :-73.59   Mean   : 0.9515                      Mean   :1146  
##  3rd Qu.:-73.52   3rd Qu.: 1.2280                      3rd Qu.:1799  
##  Max.   :-73.40   Max.   :17.5910                      Max.   :8649  
##                                                        NA's   :5609  
##     Street              Side               City              County         
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     State             Zipcode            Country            Timezone        
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Airport_Code       Weather_Timestamp  Temperature.F. Wind_Chill.F.   
##  Length:6219        Length:6219        Min.   : 7.0   Min.   :-11.10  
##  Class :character   Class :character   1st Qu.:43.0   1st Qu.: 36.00  
##  Mode  :character   Mode  :character   Median :55.0   Median : 52.00  
##                                        Mean   :55.7   Mean   : 51.18  
##                                        3rd Qu.:69.1   3rd Qu.: 67.00  
##                                        Max.   :98.0   Max.   : 98.00  
##                                        NA's   :25     NA's   :1160    
##   Humidity...      Pressure.in.   Visibility.mi.   Wind_Direction    
##  Min.   : 11.00   Min.   :28.91   Min.   : 0.000   Length:6219       
##  1st Qu.: 48.00   1st Qu.:29.82   1st Qu.:10.000   Class :character  
##  Median : 66.00   Median :29.97   Median :10.000   Mode  :character  
##  Mean   : 64.93   Mean   :29.97   Mean   : 9.034                     
##  3rd Qu.: 83.00   3rd Qu.:30.13   3rd Qu.:10.000                     
##  Max.   :100.00   Max.   :30.85   Max.   :10.000                     
##  NA's   :25       NA's   :18      NA's   :32                         
##  Wind_Speed.mph. Precipitation.in. Weather_Condition    Amenity         
##  Min.   : 0.00   Min.   :0.0000    Length:6219        Length:6219       
##  1st Qu.: 6.90   1st Qu.:0.0000    Class :character   Class :character  
##  Median : 9.20   Median :0.0000    Mode  :character   Mode  :character  
##  Mean   :10.29   Mean   :0.0086                                         
##  3rd Qu.:13.80   3rd Qu.:0.0000                                         
##  Max.   :36.00   Max.   :1.3200                                         
##  NA's   :119     NA's   :1477                                           
##      Bump             Crossing           Give_Way           Junction        
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##    No_Exit            Railway           Roundabout          Station         
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##      Stop           Traffic_Calming    Traffic_Signal     Turning_Loop      
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Sunrise_Sunset     Civil_Twilight     Nautical_Twilight  Astronomical_Twilight
##  Length:6219        Length:6219        Length:6219        Length:6219          
##  Class :character   Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character     
##                                                                                
##                                                                                
##                                                                                
## 
#if wind_chill missing use temp
car3$Wind_Chill.F.=ifelse(is.na(car3$Wind_Chill.F.),car3$Temperature.F.,car3$Wind_Chill.F.)

car3$Precipitation.in.=ifelse(is.na(car3$Precipitation.in.),0,car3$Precipitation.in.)

car3$Amenity=ifelse((car3$Amenity=='True'),1,0)
car3$Bump=ifelse((car3$Bump=='True'),1,0)
car3$Crossing=ifelse((car3$Crossing=='True'),1,0)
car3$Give_Way=ifelse((car3$Give_Way=='True'),1,0)
car3$Junction=ifelse((car3$Junction=='True'),1,0)
car3$No_Exit=ifelse((car3$No_Exit=='True'),1,0)
car3$Railway=ifelse((car3$Railway=='True'),1,0)
car3$Roundabout=ifelse((car3$Roundabout=='True'),1,0)
car3$Station=ifelse((car3$Station=='True'),1,0)
car3$Stop=ifelse((car3$Stop=='True'),1,0)
car3$Traffic_Calming=ifelse((car3$Traffic_Calming=='True'),1,0)
car3$Traffic_Signal=ifelse((car3$Traffic_Signal=='True'),1,0)
car3$Turning_Loop=ifelse((car3$Turning_Loop=='True'),1,0)

car3$Sunrise_Sunset=ifelse((car3$Sunrise_Sunset=="Day"),1,0)


summary(car3)
##        X                ID               Severity      Start_Time       
##  Min.   :  31708   Length:6219        Min.   :1.000   Length:6219       
##  1st Qu.: 608280   Class :character   1st Qu.:2.000   Class :character  
##  Median :1501829   Mode  :character   Median :2.000   Mode  :character  
##  Mean   :1459704                      Mean   :2.236                     
##  3rd Qu.:2244843                      3rd Qu.:2.000                     
##  Max.   :2844969                      Max.   :4.000                     
##                                                                         
##    End_Time           Start_Lat       Start_Lng         End_Lat     
##  Length:6219        Min.   :40.59   Min.   :-73.75   Min.   :40.59  
##  Class :character   1st Qu.:40.69   1st Qu.:-73.65   1st Qu.:40.69  
##  Mode  :character   Median :40.72   Median :-73.59   Median :40.72  
##                     Mean   :40.73   Mean   :-73.59   Mean   :40.73  
##                     3rd Qu.:40.78   3rd Qu.:-73.53   3rd Qu.:40.78  
##                     Max.   :40.87   Max.   :-73.42   Max.   :40.87  
##                                                                     
##     End_Lng        Distance.mi.     Description            Number    
##  Min.   :-73.89   Min.   : 0.0000   Length:6219        Min.   :   1  
##  1st Qu.:-73.64   1st Qu.: 0.1430   Class :character   1st Qu.: 201  
##  Median :-73.58   Median : 0.5070   Mode  :character   Median : 598  
##  Mean   :-73.59   Mean   : 0.9515                      Mean   :1146  
##  3rd Qu.:-73.52   3rd Qu.: 1.2280                      3rd Qu.:1799  
##  Max.   :-73.40   Max.   :17.5910                      Max.   :8649  
##                                                        NA's   :5609  
##     Street              Side               City              County         
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##     State             Zipcode            Country            Timezone        
##  Length:6219        Length:6219        Length:6219        Length:6219       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Airport_Code       Weather_Timestamp  Temperature.F. Wind_Chill.F.   
##  Length:6219        Length:6219        Min.   : 7.0   Min.   :-11.10  
##  Class :character   Class :character   1st Qu.:43.0   1st Qu.: 38.00  
##  Mode  :character   Mode  :character   Median :55.0   Median : 55.00  
##                                        Mean   :55.7   Mean   : 53.48  
##                                        3rd Qu.:69.1   3rd Qu.: 69.10  
##                                        Max.   :98.0   Max.   : 98.00  
##                                        NA's   :25     NA's   :25      
##   Humidity...      Pressure.in.   Visibility.mi.   Wind_Direction    
##  Min.   : 11.00   Min.   :28.91   Min.   : 0.000   Length:6219       
##  1st Qu.: 48.00   1st Qu.:29.82   1st Qu.:10.000   Class :character  
##  Median : 66.00   Median :29.97   Median :10.000   Mode  :character  
##  Mean   : 64.93   Mean   :29.97   Mean   : 9.034                     
##  3rd Qu.: 83.00   3rd Qu.:30.13   3rd Qu.:10.000                     
##  Max.   :100.00   Max.   :30.85   Max.   :10.000                     
##  NA's   :25       NA's   :18      NA's   :32                         
##  Wind_Speed.mph. Precipitation.in.  Weather_Condition     Amenity        
##  Min.   : 0.00   Min.   :0.000000   Length:6219        Min.   :0.000000  
##  1st Qu.: 6.90   1st Qu.:0.000000   Class :character   1st Qu.:0.000000  
##  Median : 9.20   Median :0.000000   Mode  :character   Median :0.000000  
##  Mean   :10.29   Mean   :0.006573                      Mean   :0.008201  
##  3rd Qu.:13.80   3rd Qu.:0.000000                      3rd Qu.:0.000000  
##  Max.   :36.00   Max.   :1.320000                      Max.   :1.000000  
##  NA's   :119                                                             
##       Bump      Crossing         Give_Way            Junction         No_Exit 
##  Min.   :0   Min.   :0.0000   Min.   :0.0000000   Min.   :0.0000   Min.   :0  
##  1st Qu.:0   1st Qu.:0.0000   1st Qu.:0.0000000   1st Qu.:0.0000   1st Qu.:0  
##  Median :0   Median :0.0000   Median :0.0000000   Median :0.0000   Median :0  
##  Mean   :0   Mean   :0.0283   Mean   :0.0006432   Mean   :0.2058   Mean   :0  
##  3rd Qu.:0   3rd Qu.:0.0000   3rd Qu.:0.0000000   3rd Qu.:0.0000   3rd Qu.:0  
##  Max.   :0   Max.   :1.0000   Max.   :1.0000000   Max.   :1.0000   Max.   :0  
##                                                                               
##     Railway            Roundabout    Station             Stop         
##  Min.   :0.0000000   Min.   :0    Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.0000000   1st Qu.:0    1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.0000000   Median :0    Median :0.00000   Median :0.000000  
##  Mean   :0.0006432   Mean   :0    Mean   :0.01608   Mean   :0.002573  
##  3rd Qu.:0.0000000   3rd Qu.:0    3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.0000000   Max.   :0    Max.   :1.00000   Max.   :1.000000  
##                                                                       
##  Traffic_Calming Traffic_Signal     Turning_Loop Sunrise_Sunset  
##  Min.   :0       Min.   :0.00000   Min.   :0     Min.   :0.0000  
##  1st Qu.:0       1st Qu.:0.00000   1st Qu.:0     1st Qu.:0.0000  
##  Median :0       Median :0.00000   Median :0     Median :1.0000  
##  Mean   :0       Mean   :0.04904   Mean   :0     Mean   :0.6692  
##  3rd Qu.:0       3rd Qu.:0.00000   3rd Qu.:0     3rd Qu.:1.0000  
##  Max.   :0       Max.   :1.00000   Max.   :0     Max.   :1.0000  
##                                                                  
##  Civil_Twilight     Nautical_Twilight  Astronomical_Twilight
##  Length:6219        Length:6219        Length:6219          
##  Class :character   Class :character   Class :character     
##  Mode  :character   Mode  :character   Mode  :character     
##                                                             
##                                                             
##                                                             
## 

We will be including the following variables:

  • Severity
  • Start_Lat
  • Start_Lng
  • Distance.mi.
  • Temperature.F.
  • Wind_Chill.F.
  • Humidity…
  • Pressure.in.
  • Visibility.mi.
  • Wind_Speed.mph.
  • Precipitation.in.
  • Amenity
  • Crossing
  • Junction
  • Station
  • Stop
  • Traffic_Signal
  • Sunrise_Sunset
car4<-car3 %>%
  select(Severity,
Start_Lat,
Start_Lng,
Distance.mi.,
Temperature.F.,
Wind_Chill.F.,
Humidity...,
Pressure.in.,
Visibility.mi.,
Wind_Speed.mph.,
Precipitation.in.,
Amenity,
Crossing,
Junction,
Station,
Stop,
Traffic_Signal,
Sunrise_Sunset)

#Remove records is missing values

car4[car4==""] <- NA
car4<-na.omit(car4)
#car4 %>% drop_na(car4)
str(car4)
## 'data.frame':    6093 obs. of  18 variables:
##  $ Severity         : int  2 2 3 3 4 3 2 2 3 2 ...
##  $ Start_Lat        : num  40.7 40.8 40.8 40.8 40.8 ...
##  $ Start_Lng        : num  -73.7 -73.7 -73.6 -73.7 -73.5 ...
##  $ Distance.mi.     : num  1.488 1.019 0.299 2.328 1.292 ...
##  $ Temperature.F.   : num  55 51.1 46 45 41 39.9 45 48 37.9 42.1 ...
##  $ Wind_Chill.F.    : num  55 51.1 39.3 37.4 34.3 34.5 38.2 48 33.5 37.1 ...
##  $ Humidity...      : int  93 44 54 53 53 97 86 61 65 70 ...
##  $ Pressure.in.     : num  29.6 29.9 30.1 30.1 30.2 ...
##  $ Visibility.mi.   : num  2 10 10 10 10 3 10 10 10 10 ...
##  $ Wind_Speed.mph.  : num  11.5 19.6 16.1 18.4 11.5 8.1 15 12.7 5.8 8.1 ...
##  $ Precipitation.in.: num  0.01 0 0 0 0 0.01 0 0 0 0 ...
##  $ Amenity          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Crossing         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Junction         : num  1 0 1 1 0 1 0 0 1 0 ...
##  $ Station          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Stop             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Traffic_Signal   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Sunrise_Sunset   : num  0 1 0 0 1 1 1 0 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:126] 9 59 62 63 64 65 67 68 79 93 ...
##   ..- attr(*, "names")= chr [1:126] "9" "59" "62" "63" ...
summary(car4)
##     Severity      Start_Lat       Start_Lng       Distance.mi.   
##  Min.   :1.00   Min.   :40.59   Min.   :-73.75   Min.   : 0.000  
##  1st Qu.:2.00   1st Qu.:40.69   1st Qu.:-73.65   1st Qu.: 0.140  
##  Median :2.00   Median :40.72   Median :-73.59   Median : 0.497  
##  Mean   :2.23   Mean   :40.73   Mean   :-73.59   Mean   : 0.950  
##  3rd Qu.:2.00   3rd Qu.:40.78   3rd Qu.:-73.53   3rd Qu.: 1.228  
##  Max.   :4.00   Max.   :40.87   Max.   :-73.42   Max.   :17.591  
##  Temperature.F. Wind_Chill.F.     Humidity...      Pressure.in.  
##  Min.   : 7.0   Min.   :-11.10   Min.   : 11.00   Min.   :28.91  
##  1st Qu.:43.0   1st Qu.: 38.00   1st Qu.: 48.00   1st Qu.:29.82  
##  Median :55.0   Median : 55.00   Median : 65.00   Median :29.97  
##  Mean   :55.8   Mean   : 53.54   Mean   : 64.79   Mean   :29.97  
##  3rd Qu.:69.1   3rd Qu.: 69.10   3rd Qu.: 83.00   3rd Qu.:30.13  
##  Max.   :98.0   Max.   : 98.00   Max.   :100.00   Max.   :30.85  
##  Visibility.mi.   Wind_Speed.mph. Precipitation.in.     Amenity       
##  Min.   : 0.000   Min.   : 0.0    Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:10.000   1st Qu.: 6.9    1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :10.000   Median : 9.2    Median :0.000000   Median :0.00000  
##  Mean   : 9.042   Mean   :10.3    Mean   :0.006698   Mean   :0.00837  
##  3rd Qu.:10.000   3rd Qu.:13.8    3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :10.000   Max.   :36.0    Max.   :1.320000   Max.   :1.00000  
##     Crossing          Junction         Station             Stop         
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.00000   Min.   :0.000000  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.000000  
##  Median :0.00000   Median :0.0000   Median :0.00000   Median :0.000000  
##  Mean   :0.02839   Mean   :0.2053   Mean   :0.01608   Mean   :0.002626  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.000000  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.00000   Max.   :1.000000  
##  Traffic_Signal    Sunrise_Sunset  
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000  
##  Mean   :0.04973   Mean   :0.6718  
##  3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.0000

Data Exploration

var1<-car4 %>% 
  select(Severity,
Start_Lat,
Start_Lng,
Distance.mi.,
Temperature.F.,
Wind_Chill.F.,
Humidity...,
Pressure.in.,
Visibility.mi.,
Wind_Speed.mph.,
Precipitation.in.
 ) %>%
  gather(key='variables',value='value')

var1%>%
  ggplot(aes(x=variables, y=value)) +
  geom_boxplot() +
  labs(title='Boxplot of Numerical Variables')+
  theme(plot.title=element_text(hjust=0.5))+
  coord_flip()

DataExplorer::plot_histogram(car4)

From the histograms above, we can classify the variables into the distribution categories:

  • Symmetric: Pressure, temperature, windchill, longitude
  • Left-skewed: humidity, visibility
  • Right-skewed: windspeed,distance, precipitation, severity
  • Multi-modal: latitude, windspeed
dataset1<-car4 %>%
  group_by(Severity, Sunrise_Sunset) %>%
  summarise(S_count=n())
## `summarise()` has grouped output by 'Severity'. You can override using the `.groups` argument.
overallplot<-ggplot(dataset1,aes(x=Severity, y=S_count,fill=Sunrise_Sunset))+geom_bar(stat="identity")+ theme(axis.text.x=element_text(angle=90,size = 5, face = "bold"), axis.title.y = element_text( size = 6, face = "bold"))+labs(title = "Car Accidents Nassau County, NY",
       subtitle = " Mar 2016- Dec 2021  ",
       x = "Severity", y = "Number of Accidents")

overallplot

The above plot suggests that most accidents are a level 2 occurring during the day.

Mapping

Mapping all accidents

map1<-car4 %>%
  leaflet () %>%
  addTiles () %>%
  addHeatmap(lng=~Start_Lng, lat=~Start_Lat, intensity=2, blur=4, max=1, radius=4)
map1

Interested in severity 3,4 more likely injuries:

Mapping Severity 3 & 4 accidents:

car5 <-  car4 %>% filter(car4$Severity==4 |car4$Severity==3)

map2<-car5 %>%
  leaflet () %>%
  addTiles () %>%
  addHeatmap(lng=~Start_Lng, lat=~Start_Lat, intensity=2, blur=4, max=1, radius=4)
#mapview(car4, xcol="Start_Lng", ycol="Start_Lat", crs =4269, grid=FALSE)
map2

Side by side: All accidents vs severity 4, 5

library(leafsync)
## Warning: package 'leafsync' was built under R version 4.1.3
sync(map1, map2, ncol = 2)

Supervised Learning

Splitting the dataset

We will use 75% of the data to train and 25% to test our models.

set.seed(123)
split = sample.split(car4$Severity, SplitRatio = 0.8)
train = subset(car4, split == TRUE)
test = subset(car4, split == FALSE)
Y_train <- train[,'Severity']
Y_test <- test[,'Severity']
X_train <- train[, !(colnames(train) == 'Severity')]
X_test <- test[, !(colnames(test) == 'Severity')]

Random Forest: Predicting Severity of Accident

library(randomForest)
set.seed(123)
car_rf = randomForest(x = X_train,
                          y = Y_train,
                      mtry=12,
                      importance=TRUE,
                          ntree = 500)
## Warning in randomForest.default(x = X_train, y = Y_train, mtry = 12, importance
## = TRUE, : The response has five or fewer unique values. Are you sure you want to
## do regression?
car_rf
## 
## Call:
##  randomForest(x = X_train, y = Y_train, ntree = 500, mtry = 12,      importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 12
## 
##           Mean of squared residuals: 0.2279701
##                     % Var explained: 25.73
# Predicting the Test set results
y_pred = predict(car_rf, newdata = test)
plot(y_pred, test$Severity)
abline(0,1)

mean((y_pred-test$Severity)^2)
## [1] 0.2388338

The test MSE = .2388

importance(car_rf)
##                     %IncMSE IncNodePurity
## Start_Lat         71.019228    238.082629
## Start_Lng         42.782750    180.936961
## Distance.mi.      98.893070    202.538584
## Temperature.F.    35.704815    101.275681
## Wind_Chill.F.     35.182275    107.530585
## Humidity...       33.221898    127.353185
## Pressure.in.      44.266395    164.609413
## Visibility.mi.    18.587023     24.459878
## Wind_Speed.mph.   29.375466    110.296317
## Precipitation.in. 12.377185     22.813823
## Amenity            7.235074      3.828830
## Crossing          20.432000     13.269731
## Junction          18.418312     15.533216
## Station           22.809808     13.247730
## Stop               1.130248      1.100663
## Traffic_Signal    28.275807     22.220746
## Sunrise_Sunset    25.117332     23.262563
varImpPlot(car_rf)

The random forest names latitude, distance, ongitude, distance, pressure, humidity and temperature.

Latitude and longitude suggests particular areas are dangerous roadways.

Neural Network: Predicting Severity of Accident

set.seed(12)
nnetGrid <- expand.grid(.decay = c(0,0.01,.1),
                        .size = c(1:5),
                        .bag = FALSE)
nnetFit <- train(X_train, Y_train,
                  method = 'avNNet',
                  preProc = c('center','scale'),
                  tuneGrid = nnetGrid,
                  linout = TRUE,
                  trace = FALSE,
                  MaxNWts = 5 * (ncol(X_train) + 1 + 5 + 1),
                  maxit = 100
  
)
## Warning: executing %dopar% sequentially: no parallel backend registered
nnetFit
## Model Averaged Neural Network 
## 
## 4875 samples
##   17 predictor
## 
## Pre-processing: centered (17), scaled (17) 
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 4875, 4875, 4875, 4875, 4875, 4875, ... 
## Resampling results across tuning parameters:
## 
##   decay  size  RMSE       Rsquared    MAE      
##   0.00   1     0.5466415  0.03524544  0.3737422
##   0.00   2     0.5426167  0.03733871  0.3710252
##   0.00   3     0.5416536  0.04285998  0.3668341
##   0.00   4     0.5443025  0.04685063  0.3641160
##   0.00   5     0.5404815  0.05088628  0.3622794
##   0.01   1     0.5435461  0.03446969  0.3722347
##   0.01   2     0.5428102  0.03744412  0.3697361
##   0.01   3     0.5413183  0.04340994  0.3658335
##   0.01   4     0.5406490  0.04871990  0.3627246
##   0.01   5     0.5406009  0.05053923  0.3615607
##   0.10   1     0.5445937  0.03214851  0.3716450
##   0.10   2     0.5417336  0.04111046  0.3679597
##   0.10   3     0.5407235  0.04566816  0.3654301
##   0.10   4     0.5403906  0.04928328  0.3621715
##   0.10   5     0.5407640  0.05018304  0.3617645
## 
## Tuning parameter 'bag' was held constant at a value of FALSE
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 4, decay = 0.1 and bag = FALSE.
varimp <- varImp(nnetFit)
varimp$importance %>%
  arrange(desc(Overall)) %>%
  kable() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>% scroll_box(width="100%",height="400px")
Overall
Distance.mi. 100.0000000
Start_Lat 79.9046430
Traffic_Signal 43.2623306
Crossing 41.9281369
Start_Lng 37.5710780
Wind_Chill.F. 15.3257766
Temperature.F. 13.9341843
Humidity… 7.6002692
Station 7.0046778
Sunrise_Sunset 6.2018433
Amenity 6.1750330
Precipitation.in. 2.7610988
Junction 1.6939199
Wind_Speed.mph. 1.6466898
Pressure.in. 0.2518870
Stop 0.1322378
Visibility.mi. 0.0000000

The Neural network names distance, Latitude, traffic signal, crossing & longitude as the top 5 important variables.

Unsupervised Learning

Principal Component Analysis: Accidents of Severity 4,5

The dataset of Nassau County car accidents of severity 3 or 4 contains 1061 records.

car6<-car5
pca_out=prcomp(car6,scale=TRUE)
pca_out
## Standard deviations (1, .., p=18):
##  [1] 1.52605037 1.38596148 1.29676746 1.24457777 1.11276150 1.03444581
##  [7] 1.02200611 0.98997735 0.93930684 0.92674839 0.91277307 0.86630255
## [13] 0.84461537 0.82912910 0.76315419 0.68546828 0.63423985 0.08171469
## 
## Rotation (n x k) = (18 x 18):
##                           PC1         PC2          PC3          PC4         PC5
## Severity           0.09835738 -0.09201160  0.048113182  0.289142987 -0.05014822
## Start_Lat          0.03159939  0.01538580 -0.237883395  0.337923578 -0.50416740
## Start_Lng          0.02127685 -0.06018918 -0.008308465  0.421120681 -0.13096475
## Distance.mi.       0.10290338  0.08344665 -0.328467821  0.180022449 -0.29057941
## Temperature.F.    -0.56240420  0.28093050 -0.021821004  0.181581952  0.09594789
## Wind_Chill.F.     -0.55376529  0.28205198 -0.020100272  0.212725226  0.11901125
## Humidity...        0.27798385  0.33068902  0.313241495  0.245468479  0.24720640
## Pressure.in.       0.11613038 -0.32035872 -0.299331606 -0.044986012  0.05768104
## Visibility.mi.    -0.27600366 -0.31312583 -0.394750268 -0.042488629  0.14612564
## Wind_Speed.mph.   -0.07349280  0.04930169  0.150801066 -0.488939867 -0.42101904
## Precipitation.in.  0.12551213  0.28627132  0.323182404  0.029245833 -0.26880790
## Amenity           -0.08637640 -0.24859072  0.193641560 -0.001873216 -0.01253553
## Crossing          -0.15052309 -0.33268563  0.332865892  0.111591573 -0.06179652
## Junction           0.06879730  0.20426707 -0.111701607 -0.303296410  0.36113427
## Station           -0.06969212 -0.28952469  0.259786800  0.114741780  0.03251732
## Stop              -0.06692099 -0.08567282  0.067524170  0.032047721 -0.06408629
## Traffic_Signal    -0.15091976 -0.34338142  0.347878890  0.051717323  0.08581907
## Sunrise_Sunset    -0.31666053  0.06740811  0.087627770 -0.298187234 -0.36893601
##                           PC6         PC7          PC8         PC9         PC10
## Severity           0.50256602  0.08320807 -0.605031985  0.13025051  0.227274027
## Start_Lat         -0.16128606 -0.03200206  0.137967084 -0.17383541 -0.142189997
## Start_Lng         -0.44351888  0.29890724 -0.254894354 -0.28883273  0.260624055
## Distance.mi.       0.39783952 -0.16879862 -0.046568301 -0.07972604 -0.291902613
## Temperature.F.     0.12488902 -0.09288932  0.004073236  0.01014637 -0.009050120
## Wind_Chill.F.      0.12125597 -0.09701065  0.019123530  0.01157946 -0.004469226
## Humidity...        0.02758672 -0.03192433  0.060094254  0.10213089 -0.028480923
## Pressure.in.       0.22880230 -0.26647927  0.297819033  0.08769216  0.155670100
## Visibility.mi.    -0.03501625  0.13373518 -0.207562846 -0.04305958 -0.034012947
## Wind_Speed.mph.    0.13601603  0.07408662 -0.355689788  0.02710457 -0.026725735
## Precipitation.in.  0.14804284 -0.12766276  0.090056471 -0.18196849  0.036539136
## Amenity            0.07689827 -0.55572188  0.007799604 -0.50826204  0.266623555
## Crossing          -0.03192197 -0.06326242  0.005670506 -0.21270915 -0.202365398
## Junction           0.08912523  0.14700536 -0.190488141 -0.63812306 -0.116698949
## Station            0.17615582  0.27452467  0.101840696 -0.08329362 -0.658891982
## Stop               0.42180917  0.56713391  0.461909112 -0.17712500  0.410425303
## Traffic_Signal    -0.02859160 -0.07830988 -0.093431326  0.23596737  0.095660476
## Sunrise_Sunset    -0.12933275  0.08303968  0.092327841  0.06487112  0.096161286
##                          PC11         PC12         PC13         PC14
## Severity          -0.12123828 -0.051886460  0.320824212  0.112119786
## Start_Lat          0.22444499 -0.405561660  0.182199059 -0.042575280
## Start_Lng         -0.29754740  0.247099689 -0.157411829 -0.023970096
## Distance.mi.       0.04935802  0.429127097 -0.530646736  0.007808708
## Temperature.F.    -0.03262408 -0.054734701  0.008285423  0.020180474
## Wind_Chill.F.     -0.04026381 -0.047488999  0.009905273  0.018757292
## Humidity...        0.13316732  0.155319639  0.017203267  0.162661069
## Pressure.in.      -0.51835163 -0.101050196  0.045105191  0.208286145
## Visibility.mi.     0.04722560 -0.192130692 -0.048141733 -0.146942458
## Wind_Speed.mph.    0.19817288 -0.090224055 -0.041779238  0.021982392
## Precipitation.in. -0.45383170 -0.419951079 -0.141138975 -0.247251965
## Amenity            0.22866634  0.269618971  0.234553545 -0.247222406
## Crossing           0.06978339 -0.164234177 -0.162204471  0.743724528
## Junction          -0.16294524 -0.120979830 -0.134175368  0.073474499
## Station           -0.20177828  0.163433449  0.297125223 -0.301858446
## Stop               0.23529134  0.008191666 -0.096390625 -0.011900353
## Traffic_Signal    -0.02282650 -0.138933868 -0.561626956 -0.310756822
## Sunrise_Sunset    -0.35647571  0.407720134  0.136756270  0.153940727
##                          PC15        PC16         PC17          PC18
## Severity           0.12491331 -0.21983257 -0.049197403  2.776223e-03
## Start_Lat          0.45216718 -0.05407190  0.141584380 -3.433973e-04
## Start_Lng         -0.07340169  0.34351883 -0.025335742  2.359813e-03
## Distance.mi.      -0.05970045 -0.07051364  0.021004673  6.006248e-05
## Temperature.F.     0.03210835  0.17208263 -0.057496876  7.065500e-01
## Wind_Chill.F.      0.03258877  0.14971013 -0.045388913 -7.067358e-01
## Humidity...        0.13275093  0.15853197  0.682373830  1.577367e-02
## Pressure.in.       0.17541831  0.41633399  0.128567482  6.915084e-04
## Visibility.mi.    -0.34951726 -0.14828988  0.611858021  2.690928e-03
## Wind_Speed.mph.    0.01044358  0.57710793  0.131655976 -3.189210e-02
## Precipitation.in. -0.38278004 -0.14625878  0.112565395  2.067319e-03
## Amenity           -0.03448251  0.01580847  0.079397035  2.201671e-03
## Crossing          -0.16864516 -0.09008897 -0.022464574 -8.506573e-04
## Junction           0.39937698 -0.08888054  0.001434936 -2.467305e-03
## Station            0.03010288  0.15499029  0.002874577  4.851715e-04
## Stop              -0.01845377  0.01105624  0.020390950  1.262040e-03
## Traffic_Signal     0.45454885 -0.08357674  0.048140224  2.595207e-03
## Sunrise_Sunset     0.23391592 -0.38916643  0.273850423 -3.107422e-04
pr_var<-pca_out$sdev^2
pr_var
##  [1] 2.328829731 1.920889220 1.681605850 1.548973834 1.238238154 1.070078139
##  [7] 1.044496485 0.980055148 0.882297339 0.858862577 0.833154669 0.750480100
## [13] 0.713375118 0.687455068 0.582404316 0.469866769 0.402260193 0.006677291
pve_c<-pr_var/sum(pr_var)
pve_c
##  [1] 0.1293794295 0.1067160678 0.0934225472 0.0860541019 0.0687910085
##  [6] 0.0594487855 0.0580275825 0.0544475082 0.0490165188 0.0477145876
## [11] 0.0462863705 0.0416933389 0.0396319510 0.0381919482 0.0323557954
## [16] 0.0261037094 0.0223477885 0.0003709606

Principal component analysis describes the variation of multivariate data in terms of a set of new, uncorrelated variables defined to be a linear combination of the original variable. In doing so, principal components provides a lower dimensional representation of the dataset and attempts to summarize what is important.

The analysis yields 18 principal components with the first two components accounting for 13 and 11 percent of the variation. In other words, these two principal components account for 24 % of the variation.

pr_var<-pca_out$sdev^2
pve_c<-pr_var/sum(pr_var)
#pve

qplot(c(1:18), pve_c) + 
  geom_line() +
  xlab('Principal_Components') +
  ylab('Variance_Explained') + ggtitle('Scree_Plot') + 
  ylim(0,1)

Principal Components biplot

biplot(pca_out,scale=0, cex=.6)

The biplot attempts to summarize the loadings of the first two components. The longer the red line, the more important: temperature, windchillF. speed, humidity, visibility notable.

Hierarchical Clustering: Accidents of severity 3,4

Hierarchical clustering separates the dataset into groups by measuring similarity between observations. By using complete linkage which is the maximal intercluster distance and the average distance.

Sc_car5<-car6 %>%
  select_all()%>%
  scale()

hc_complete=hclust(dist(Sc_car5), method="complete")

myDend1c<-as.dendrogram(hc_complete)
plot(myDend1c, main="Hierarchical clustering - complete - 3 clusters")
#plot(hc.complete)

k=3
n=nrow(Sc_car5)
MidPoint=(hc_complete$height[n-k]+hc_complete$height[n-k+1])/2
abline(h=MidPoint,lty=2)

#######
hc_average=hclust(dist(Sc_car5), method="average")

myDend2c<-as.dendrogram(hc_average)
plot(myDend2c, main="Hierarchical clustering - average linkage - 3 clusters")

k=3
n=nrow(Sc_car5)
MidPoint2c=(hc_average$height[n-k]+hc_average$height[n-k+1])/2
abline(h=MidPoint2c,lty=2)

We see that both dendrograms result in 3 clusters.

This could provide a starting point for further study.

Purpose

The purpose of the study is to understand where the car accidents are occurring in Nassau County, NY, and what variables drive severity of accident.

This could aid emergency personnel and Level 1 trauma centers in developing accident prevention programs. In addition, we sought to understand what is causing the more severe accidents. Likewise, understanding which variables are important in predicting accident severity could aid in accident prevention.

Conclusion

Accident prevention programs could be implemented based on studies of this type.

The analysis predicting severity of accident in Nassau county names location, distance, temperature, traffic signal as important.

This tells us the location of the accident suggests particular areas are dangerous roadways. Could these roadways be safer, perhaps by reducing the speed limit in a section of roadway?

PCA the more important: temperature, windchillF. windspeed, humidity, visibility notable.

In addition, weather conditions do affect safe driving. Perhaps defensive driving courses offered in high schools and libraries would be a remedy for accident prevention.