The aim of this project is to predict car accident severity. The data has been taken from the Traffic Records Group, Traffic Management Division, Seattle Department of Transportation. It comprises of all collisions and crashes that have occurred in the state from 2004 to 2019.The data has the key information that include Severity of the accident, Incident Date, number of vehicles and persons involved in accidents, Road type and conditions, Weather and light conditions.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
import pandas as pd
data = pd.read_csv("C:\\Users\\sswarupa\\Documents\\GitHub\\CarAccidentSeverity\\Data-Collisions.csv", low_memory=False)
data.head()
## SEVERITYCODE X Y ... SEGLANEKEY CROSSWALKKEY HITPARKEDCAR
## 0 2 -122.323148 47.703140 ... 0 0 N
## 1 1 -122.347294 47.647172 ... 0 0 N
## 2 1 -122.334540 47.607871 ... 0 0 N
## 3 1 -122.334803 47.604803 ... 0 0 N
## 4 2 -122.306426 47.545739 ... 0 0 N
##
## [5 rows x 38 columns]
data.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 194673 entries, 0 to 194672
## Data columns (total 38 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 SEVERITYCODE 194673 non-null int64
## 1 X 189339 non-null float64
## 2 Y 189339 non-null float64
## 3 OBJECTID 194673 non-null int64
## 4 INCKEY 194673 non-null int64
## 5 COLDETKEY 194673 non-null int64
## 6 REPORTNO 194673 non-null object
## 7 STATUS 194673 non-null object
## 8 ADDRTYPE 192747 non-null object
## 9 INTKEY 65070 non-null float64
## 10 LOCATION 191996 non-null object
## 11 EXCEPTRSNCODE 84811 non-null object
## 12 EXCEPTRSNDESC 5638 non-null object
## 13 SEVERITYCODE.1 194673 non-null int64
## 14 SEVERITYDESC 194673 non-null object
## 15 COLLISIONTYPE 189769 non-null object
## 16 PERSONCOUNT 194673 non-null int64
## 17 PEDCOUNT 194673 non-null int64
## 18 PEDCYLCOUNT 194673 non-null int64
## 19 VEHCOUNT 194673 non-null int64
## 20 INCDATE 194673 non-null object
## 21 INCDTTM 194673 non-null object
## 22 JUNCTIONTYPE 188344 non-null object
## 23 SDOT_COLCODE 194673 non-null int64
## 24 SDOT_COLDESC 194673 non-null object
## 25 INATTENTIONIND 29805 non-null object
## 26 UNDERINFL 189789 non-null object
## 27 WEATHER 189592 non-null object
## 28 ROADCOND 189661 non-null object
## 29 LIGHTCOND 189503 non-null object
## 30 PEDROWNOTGRNT 4667 non-null object
## 31 SDOTCOLNUM 114936 non-null float64
## 32 SPEEDING 9333 non-null object
## 33 ST_COLCODE 194655 non-null object
## 34 ST_COLDESC 189769 non-null object
## 35 SEGLANEKEY 194673 non-null int64
## 36 CROSSWALKKEY 194673 non-null int64
## 37 HITPARKEDCAR 194673 non-null object
## dtypes: float64(4), int64(12), object(22)
## memory usage: 56.4+ MB
There are 194673 rows and 38 columns. There are several variables in the dataset which looks like are ids and not useful for our model.
counts = data.nunique()
counts
## SEVERITYCODE 2
## X 23563
## Y 23839
## OBJECTID 194673
## INCKEY 194673
## COLDETKEY 194673
## REPORTNO 194670
## STATUS 2
## ADDRTYPE 3
## INTKEY 7614
## LOCATION 24102
## EXCEPTRSNCODE 2
## EXCEPTRSNDESC 1
## SEVERITYCODE.1 2
## SEVERITYDESC 2
## COLLISIONTYPE 10
## PERSONCOUNT 47
## PEDCOUNT 7
## PEDCYLCOUNT 3
## VEHCOUNT 13
## INCDATE 5985
## INCDTTM 162058
## JUNCTIONTYPE 7
## SDOT_COLCODE 39
## SDOT_COLDESC 39
## INATTENTIONIND 1
## UNDERINFL 4
## WEATHER 11
## ROADCOND 9
## LIGHTCOND 9
## PEDROWNOTGRNT 1
## SDOTCOLNUM 114932
## SPEEDING 1
## ST_COLCODE 63
## ST_COLDESC 62
## SEGLANEKEY 1955
## CROSSWALKKEY 2198
## HITPARKEDCAR 2
## dtype: int64
colli = data.iloc[:,[0,8,15,20,22,27,28,29,37]]
colli.head()
## SEVERITYCODE ADDRTYPE ... LIGHTCOND HITPARKEDCAR
## 0 2 Intersection ... Daylight N
## 1 1 Block ... Dark - Street Lights On N
## 2 1 Block ... Daylight N
## 3 1 Block ... Daylight N
## 4 2 Intersection ... Daylight N
##
## [5 rows x 9 columns]
I chose the severity of the accidents as the dependent variable. SEVERITYCODE is a categorical variable and follows a code that corresponds to the severity of the collision: 2 (injury) and 1 (property damage). I chose 8 attributes from the 37 available in the dataset: address type, collision type, incident date, junction type, weather, road condition, light condition and hit parked car.
#Checking for the null values
colli.isnull().sum()
## SEVERITYCODE 0
## ADDRTYPE 1926
## COLLISIONTYPE 4904
## INCDATE 0
## JUNCTIONTYPE 6329
## WEATHER 5081
## ROADCOND 5012
## LIGHTCOND 5170
## HITPARKEDCAR 0
## dtype: int64
# using dropna() function
mod_colli = colli.dropna(axis=0, how='any')
mod_colli.shape
## (182895, 9)
mod_colli.isnull().sum()
## SEVERITYCODE 0
## ADDRTYPE 0
## COLLISIONTYPE 0
## INCDATE 0
## JUNCTIONTYPE 0
## WEATHER 0
## ROADCOND 0
## LIGHTCOND 0
## HITPARKEDCAR 0
## dtype: int64
Removed rows with any missing values. 11,778 rows were deleted. The final dataset has 182895 rows, 9 columns and there are no missing values.
import seaborn as sns
sns.set()
sns.catplot(x='SEVERITYCODE', kind='count', data=mod_colli, height=3, aspect=3)
mod_colli['SEVERITYCODE'].value_counts()
## 1 126270
## 2 56625
## Name: SEVERITYCODE, dtype: int64
The target data looks imbalanced, however there are sufficient rows with SEVERITYCODE 2 to train the model. There are 12K rows with SEVERITYCODE 1 and only 5K rows with SEVERITYCODE 2. I tried balancing the target by simple random sampling however it did not help in improving the accuracy so excluding that step.
#Plotting Weather by Severity
import seaborn as sns
import matplotlib.pyplot as plt
chart = sns.catplot(x='WEATHER', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
Most of the accidents happened with weather type clear, raining and overcast.
#Plotting Address type by Severity
import seaborn as sns
chart = sns.catplot(x='ADDRTYPE', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
Most of the accidents happened on the block however the proportion of severity code 2 is significantly high on intersection.
#Plotting Collision type by Severity
import seaborn as sns
chart = sns.catplot(x='COLLISIONTYPE', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
Most of the accidents happened with collision type parked car however they are mostly severity code 1. Most of the severity code 2 accidents happened with collision type rear ended and Angles.
#Plotting Junction type by Severity
import seaborn as sns
chart = sns.catplot(x='JUNCTIONTYPE', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
#Plotting Road Condition by Severity
import seaborn as sns
chart = sns.catplot(x='ROADCOND', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
Most of the accidents happened on wet and dry road conditions suggesting that traffic must be low on the riskier conditions.
#Plotting Light Condition by Severity
import seaborn as sns
chart = sns.catplot(x='LIGHTCOND', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
chart.set_xticklabels(rotation=45)
#Plotting Hit parked car by Severity
import seaborn as sns
chart = sns.catplot(x='HITPARKEDCAR', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
Most of the accidents happened when the HITPARKEDCAR type ‘N’.
#Exacting Month from Incident Date and replacing incident date with incident month
mod_colli['INCMONTH'] = pd.DatetimeIndex(mod_colli['INCDATE']).month
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
del mod_colli['INCDATE']
mod_colli.head()
## SEVERITYCODE ADDRTYPE ... HITPARKEDCAR INCMONTH
## 0 2 Intersection ... N 3
## 1 1 Block ... N 12
## 2 1 Block ... N 11
## 3 1 Block ... N 3
## 4 2 Intersection ... N 1
##
## [5 rows x 9 columns]
#Plotting Incident Month by Severity
import seaborn as sns
chart = sns.catplot(x='INCMONTH', kind='count', hue='SEVERITYCODE', data=mod_colli, height=3, aspect=3)
Plotted month of the incident to see if there is any seasonal pattern however doesn’t look like there is any seasonal pattern.
from sklearn import preprocessing
severity = preprocessing.LabelEncoder()
severity.fit(mod_colli['SEVERITYCODE'])
## LabelEncoder()
mod_colli['SEVERITYCODE'] = severity.transform(mod_colli['SEVERITYCODE'])
## <string>:1: SettingWithCopyWarning:
## A value is trying to be set on a copy of a slice from a DataFrame.
## Try using .loc[row_indexer,col_indexer] = value instead
##
## See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
addrtype = preprocessing.LabelEncoder()
severity.fit(mod_colli['ADDRTYPE'])
## LabelEncoder()
mod_colli['ADDRTYPE'] = severity.transform(mod_colli['ADDRTYPE'])
collisiontype = preprocessing.LabelEncoder()
severity.fit(mod_colli['COLLISIONTYPE'])
## LabelEncoder()
mod_colli['COLLISIONTYPE'] = severity.transform(mod_colli['COLLISIONTYPE'])
junctiontype = preprocessing.LabelEncoder()
severity.fit(mod_colli['JUNCTIONTYPE'])
## LabelEncoder()
mod_colli['JUNCTIONTYPE'] = severity.transform(mod_colli['JUNCTIONTYPE'])
weather = preprocessing.LabelEncoder()
severity.fit(mod_colli['WEATHER'])
## LabelEncoder()
mod_colli['WEATHER'] = severity.transform(mod_colli['WEATHER'])
roadcond = preprocessing.LabelEncoder()
severity.fit(mod_colli['ROADCOND'])
## LabelEncoder()
mod_colli['ROADCOND'] = severity.transform(mod_colli['ROADCOND'])
lightcond = preprocessing.LabelEncoder()
severity.fit(mod_colli['LIGHTCOND'])
## LabelEncoder()
mod_colli['LIGHTCOND'] = severity.transform(mod_colli['LIGHTCOND'])
hitparkedcar = preprocessing.LabelEncoder()
severity.fit(mod_colli['HITPARKEDCAR'])
## LabelEncoder()
mod_colli['HITPARKEDCAR'] = severity.transform(mod_colli['HITPARKEDCAR'])
incmonth = preprocessing.LabelEncoder()
severity.fit(mod_colli['INCMONTH'])
## LabelEncoder()
mod_colli['INCMONTH'] = severity.transform(mod_colli['INCMONTH'])
mod_colli.head()
## SEVERITYCODE ADDRTYPE COLLISIONTYPE ... LIGHTCOND HITPARKEDCAR INCMONTH
## 0 1 2 0 ... 5 0 2
## 1 0 1 9 ... 2 0 11
## 2 0 1 5 ... 5 0 10
## 3 0 1 4 ... 5 0 2
## 4 1 2 0 ... 5 0 0
##
## [5 rows x 9 columns]
As we can see above that all categorical variables are now converted into numerical values.
mod_colli.corr()
## SEVERITYCODE ADDRTYPE ... HITPARKEDCAR INCMONTH
## SEVERITYCODE 1.000000 0.192744 ... -0.092478 0.005963
## ADDRTYPE 0.192744 1.000000 ... -0.118982 -0.000107
## COLLISIONTYPE -0.127117 -0.479831 ... 0.033586 0.008729
## JUNCTIONTYPE -0.200887 -0.919606 ... 0.143587 0.007152
## WEATHER -0.090768 -0.071629 ... 0.033383 0.015249
## ROADCOND -0.037238 -0.019876 ... 0.005173 0.019777
## LIGHTCOND -0.042661 -0.035628 ... 0.012456 -0.026348
## HITPARKEDCAR -0.092478 -0.118982 ... 1.000000 0.005310
## INCMONTH 0.005963 -0.000107 ... 0.005310 1.000000
##
## [9 rows x 9 columns]
import seaborn as sns
plt.figure(figsize=(9,9))
sns.heatmap(mod_colli.corr(), annot = True, linewidth=.2, cbar_kws={"shrink":1})
plt.show()
from sklearn.model_selection import train_test_split
X = mod_colli.iloc[:,[1,2,3,4,5,6,7,8]]
y = mod_colli.iloc[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=40)
print('Train set:', X_train.shape, y_train.shape)
## Train set: (128026, 8) (128026,)
print('Test set:', X_test.shape, y_test.shape)
## Test set: (54869, 8) (54869,)
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import numpy as np
#Training
k = 4
neigh = KNeighborsClassifier(n_neighbors = k).fit(X_train,y_train)
neigh
# Predicting
## KNeighborsClassifier(n_neighbors=4)
yhat = neigh.predict(X_test)
yhat[0:5]
## array([0, 1, 0, 0, 0], dtype=int64)
from sklearn import metrics
print("Train set Accuracy: ", metrics.accuracy_score(y_train, neigh.predict(X_train)))
## Train set Accuracy: 0.7317576117351163
print("Test set Accuracy: ", metrics.accuracy_score(y_test, yhat))
## Test set Accuracy: 0.7161056334177769
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
con_mat=confusion_matrix(y_test,yhat)
sns.heatmap(con_mat,annot=True,annot_kws=
{"size":20},cmap="viridis")
plt.show()
print(classification_report(y_test, yhat))
## precision recall f1-score support
##
## 0 0.74 0.91 0.82 37859
## 1 0.59 0.28 0.38 17010
##
## accuracy 0.72 54869
## macro avg 0.66 0.60 0.60 54869
## weighted avg 0.69 0.72 0.68 54869
k2 = 6
n2 = KNeighborsClassifier(n_neighbors=k2).fit(X_train,y_train)
n2
## KNeighborsClassifier(n_neighbors=6)
yhat2 = n2.predict(X_test)
yhat2[0:5]
## array([0, 1, 0, 0, 0], dtype=int64)
print("Train Accuracy:",metrics.accuracy_score(y_train,n2.predict(X_train)))
## Train Accuracy: 0.7346085951291144
print("Test Accuracy:",metrics.accuracy_score(y_test,yhat2))
## Test Accuracy: 0.7207348411671436
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
con_mat=confusion_matrix(y_test,yhat2)
sns.heatmap(con_mat,annot=True,annot_kws=
{"size":20},cmap="viridis")
plt.show()
print(classification_report(y_test, yhat2))
## precision recall f1-score support
##
## 0 0.74 0.92 0.82 37859
## 1 0.61 0.27 0.38 17010
##
## accuracy 0.72 54869
## macro avg 0.67 0.60 0.60 54869
## weighted avg 0.70 0.72 0.68 54869
#Finding Optimal Value of K in KNN
neighbors = np.arange(1, 10)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over K values
for i, k in enumerate(neighbors):
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)
# Compute training and test data accuracy
train_accuracy[i] = knn.score(X_train, y_train)
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
## KNeighborsClassifier(n_neighbors=1)
## KNeighborsClassifier(n_neighbors=2)
## KNeighborsClassifier(n_neighbors=3)
## KNeighborsClassifier(n_neighbors=4)
## KNeighborsClassifier()
## KNeighborsClassifier(n_neighbors=6)
## KNeighborsClassifier(n_neighbors=7)
## KNeighborsClassifier(n_neighbors=8)
## KNeighborsClassifier(n_neighbors=9)
plt.plot(neighbors, test_accuracy, label = 'Testing dataset Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training dataset Accuracy')
plt.legend()
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.show()
k4 = 8
n4 = KNeighborsClassifier(n_neighbors=k4).fit(X_train,y_train)
n4
## KNeighborsClassifier(n_neighbors=8)
yhat4 = n4.predict(X_test)
yhat4[0:5]
## array([0, 1, 0, 0, 0], dtype=int64)
print("Train Accuracy:",metrics.accuracy_score(y_train,n4.predict(X_train)))
## Train Accuracy: 0.7336166091262712
print("Test Accuracy:",metrics.accuracy_score(y_test,yhat4))
## Test Accuracy: 0.7223204359474384
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
con_mat=confusion_matrix(y_test,yhat4)
sns.heatmap(con_mat,annot=True,annot_kws=
{"size":20},cmap="viridis")
plt.show()
print(classification_report(y_test, yhat4))
## precision recall f1-score support
##
## 0 0.74 0.92 0.82 37859
## 1 0.61 0.29 0.39 17010
##
## accuracy 0.72 54869
## macro avg 0.68 0.60 0.61 54869
## weighted avg 0.70 0.72 0.69 54869
KNN model with K value of 8 is the best KNN model with accuracy of .7223 on the test dataset.