The two merged files contain data about bike accidents in Great Britain from 1979 to 2018. The data covers the accident index, number of vehicles involved, number of casualties,the biker’s gender, severity of the accident the victims age group range, date of accident, time of accident, day of accident, speed limit, road conditions, weather conditions, and road type. In this project, I am going to import and merge the data sets, clean the dataset, and perform exploratory data analysis using 5 kinds of data visualizations in order to gain a better understanding of the data and question the outcomes.
import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'D:/Anaconda3/Library/plugins/platforms'
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from matplotlib.ticker import FuncFormatter
accidents = pd.read_csv("U:\\Accidents.csv")
bikers = pd.read_csv("U:\\Bikers.csv")
# merging the two datasets
df = pd.merge(bikers, accidents,
on='Accident_Index',
how='inner')
df.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 827861 entries, 0 to 827860
## Data columns (total 14 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 Accident_Index 827861 non-null object
## 1 Gender 827861 non-null object
## 2 Severity 827861 non-null object
## 3 Age_Grp 827861 non-null object
## 4 Number_of_Vehicles 827861 non-null int64
## 5 Number_of_Casualties 827861 non-null int64
## 6 Date 827861 non-null object
## 7 Time 827861 non-null object
## 8 Speed_limit 827861 non-null float64
## 9 Road_conditions 827861 non-null object
## 10 Weather_conditions 827861 non-null object
## 11 Day 827861 non-null object
## 12 Road_type 827861 non-null object
## 13 Light_conditions 827861 non-null object
## dtypes: float64(1), int64(2), object(11)
## memory usage: 94.7+ MB
df.describe(include='all')
## Accident_Index Gender ... Road_type Light_conditions
## count 827861 827861 ... 827861 827861
## unique 827861 3 ... 6 3
## top 197901A1SEE71 Male ... Single carriageway Daylight
## freq 1 660025 ... 656703 660657
## mean NaN NaN ... NaN NaN
## std NaN NaN ... NaN NaN
## min NaN NaN ... NaN NaN
## 25% NaN NaN ... NaN NaN
## 50% NaN NaN ... NaN NaN
## 75% NaN NaN ... NaN NaN
## max NaN NaN ... NaN NaN
##
## [11 rows x 14 columns]
We can see that my dataset does not contain any columns or rows with “NA.” Thus, nothing needs to be removed from my data frame yet.
df.isna().sum()
## Accident_Index 0
## Gender 0
## Severity 0
## Age_Grp 0
## Number_of_Vehicles 0
## Number_of_Casualties 0
## Date 0
## Time 0
## Speed_limit 0
## Road_conditions 0
## Weather_conditions 0
## Day 0
## Road_type 0
## Light_conditions 0
## dtype: int64
I created a scatterplot to look at the number of biking accidents by month from years 2008 to 2018. The colors on the scale go from dark blue to green to yellow to show the increasing amount of bike accidents each year.
df['Year']= pd.to_datetime(df['Date']).dt.year
df['Month']= pd.to_datetime(df['Date']).dt.month
x0 = df.groupby(['Year','Month'])['Year'].count().reset_index(name='Count')
x0 = pd.DataFrame(x0)
x = x0.loc[ ~x0['Year'].isin(range(2008))]
plt.figure(figsize=(18,12))
plt.scatter(x['Month'], x['Year'], marker='o', cmap='viridis',
c=x['Count'], s=x['Count'], edgecolors='black')
plt.title('Bike Accidents', fontsize = 35)
plt.xlabel('Months of the Year', fontsize = 25)
plt.ylabel('Year', fontsize = 25)
cbar = plt.colorbar()
cbar.set_label('Number of Biking Accidents', rotation = 270,
fontsize = 25, labelpad = 30)
cbarticks = [*range(800,int(x['Count'].max()), 200)]
cbar.set_ticks(cbarticks)
xticks = [*range( x['Month'].min(), x['Month'].max()+1, 1)]
plt.xticks(xticks, fontsize = 20)
yticks = [*range( x['Year'].min(), x['Year'].max()+1, 1)]
plt.yticks(yticks, fontsize = 20)
plt.show()
In the scatterplot above, we can see that the largest number of accidents recorded was in July, 2014. Most of the dots in the the middle of the scatterplot are green or yellow, which tells us there were more accidents in months like June, July, August, and September. This makes sense because it is warm in these months in England, and more people ride bikes when its warm or the summertime. December looks to be the only month with all dark blue dots, which tells us that, each year, it was the month with the fewest bike accidents. January also has many dark blue dots, except for in 2014, it is bright green. This could potentially be an outlier in our data. Overall, the month of December looks to have the least amount of accidents, and the month of July has the most. From this scatterplot, I cannot certainly conclude which year had the least amount of accidents, but we can clearly see the year 2014 looks to have the most.
I created a vertical bar chart to look at the frequency of bike accident casualties by the age groups of the bicyclists.
df1 = df[['Age_Grp','Number_of_Casualties','Date','Time','Day']]
df1['Year']= pd.to_datetime(df1['Date']).dt.year
df1['Month']= pd.to_datetime(df1['Date']).dt.month
df1['MonthName']=pd.to_datetime(df1['Date']).dt.strftime('%b')
df1['Hour']=pd.to_datetime(df1['Time']).dt.hour
x1 = df1.groupby(['Age_Grp']).agg({'Age_Grp':['count'], 'Number_of_Casualties':['sum','mean']}).reset_index()
x1.columns=['Age Groups','Count','Total Casualties','Average Casualties']
x1 = x1.sort_values('Count', ascending=False)
def mean_color(data):
colors = []
avg = data.Count.mean()
for each in data.Count:
if each > avg*1.01:
colors.append('mediumpurple')
elif each < avg*0.99:
colors.append('skyblue')
else:
colors.append('black')
return colors
bottom1 = 0
top1 = 8
d1 = x1.loc[bottom1:top1]
my_colors1 = mean_color(d1)
above = mpatches.Patch(color='mediumpurple', label='Above Average')
#at = mpatches.Patch(color='black', label='Within 1% of Mean')
below = mpatches.Patch(color='skyblue', label='Below Average')
plt.figure(figsize=(18,12))
plt.bar(d1['Age Groups'], d1.Count, label='Count', color=my_colors1)
plt.legend(handles=[above,below], fontsize=14)
plt.axhline(d1.Count.mean(), color='black', linestyle='dashed')
plt.title('Bike Casualty Frequency by Age Group', fontsize = 20)
plt.xlabel('Age Groups', fontsize = 15)
plt.ylabel('Casualty Count', fontsize = 15)
plt.text(top1-0.5,d1.Count.mean()+1100, 'Mean = ' + str(round(d1.Count.mean(),2)), rotation=0, fontsize=14)
plt.show()
From the vertical bar chart above, we can see that the most bike
accident casualties occurred with bicyclists who were within 11 to 15
years. The least amount of casualties occurred from those who were 66 to
75 years old. These findings make sense because not many people over the
age of 65 ride bikes and teen years are the prime age for riding
bikes.
I created a multiple line plot to show the increases and decreases of bike casualties of each day by the hour.
x2 = df1.groupby(['Hour','Day'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')
fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(1,1,1)
my_colors = {'Monday':'dodgerblue',
'Tuesday':'saddlebrown',
'Wednesday':'mediumseagreen',
'Thursday':'darkorange',
'Friday':'yellow',
'Saturday':'mediumpurple',
'Sunday':'deeppink'}
for key, grp in x2.groupby(['Day']):
grp.plot(ax=ax, kind='line', x='Hour', y='Total Casualties', color=my_colors[key], label=key, marker='8')
plt.title('Total Casualties by Hour', fontsize = 18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize = 18)
ax.set_ylabel('Total Casualties', fontsize = 18)
ax.set_xticks(np.arange(24))
handles, labels = ax.get_legend_handles_labels()
handles = [handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels = [labels[1],labels[5],labels[6],labels[4],labels[0],labels[2],labels[3]]
plt.legend(handles,labels,loc='best', fontsize=14, ncol=1)
plt.show()
From the multiple line plot above, we can see that the most casualties
occur at 5pm on most days (except Saturdays and Sundays). The second
time with the most casualties is 7am. The times of the day with the
least amount of casualties are 1am-4am for every day of the week.
Overall, there are drastic increases and decreases of casualties at 7am
and 5pm. Thus, we can conclude that, from this data, it is best to not
ride a bike at 5pm or 7am and it is safest to ride a bike on Saturday or
Sunday in the daytime. This makes me wonder if most of these bike
casualties are due to cars because most of the casualties are during the
common times and days of the week where people are traveling to or from
work.
I created a nested pie chart to show percentage of bike casualties by each quarter and month of the year.
df1['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df1['Quarter'] = 'Quarter ' + df1['Date'].dt.quarter.astype('string')
piedf = df1.groupby(['Quarter', 'MonthName', 'Month'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')
piedf.sort_values(by=['Month'], inplace=True)
piedf.reset_index(inplace=True,drop=True)
del piedf['Month']
out_colors = len(piedf.Quarter.unique())
out_colors_ref_num = np.arange(out_colors)*4
in_colors = len(piedf.MonthName.unique())
all_colors_ref_num = np.arange(out_colors + in_colors)
in_colors_ref_num = []
for each in all_colors_ref_num:
if each not in out_colors_ref_num:
in_colors_ref_num.append(each)
fig = plt.figure(figsize=(10,10))
ax=fig.add_subplot(1,1,1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(out_colors_ref_num)
all_cas = piedf["Total Casualties"].sum()
piedf.groupby(['Quarter'])['Total Casualties'].sum().plot(
kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance=1.1,
wedgeprops= {'edgecolor':'white'}, textprops={'fontsize':13},
autopct = lambda p: '{:.2f}%\n({:.0f})'.format(p,(p/100)*all_cas),
startangle=90)
inner_colors = colormap(in_colors_ref_num)
piedf['Total Casualties'].plot(
kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.55, labeldistance=0.8,
wedgeprops= {'edgecolor':'white'}, textprops={'fontsize':12},
labels = piedf.MonthName,
autopct = '%1.2f%%',
startangle=90)
hole = plt.Circle((0,0),0.3,fc='white')
fig1= plt.gcf()
fig1.gca().add_artist(hole)
ax.yaxis.set_visible(False)
plt.title('Total Casualties by Quarter and Month', fontsize=18)
ax.text(0,0, 'Total Casualties\n' + str('{:,}'.format(all_cas)), size=18, ha='center',va='center')
ax.axis('equal')
plt.tight_layout()
plt.show()
From the nest pie chart above, we can see that each quarter had similar
percentages of bike casualties. Quarter 1 had 21.47% of the total
casualties and Quarter 4 had 28.56% of the total casualties. These
percentages make sense because July, August and September are months
with good bike riding weather, whereas January, February, and March are
colder and snowy. Overall, there were a total of 866,946 casualties and
the month of February had the lowest percentage and the month of July
had the highest percentage out of the months.
I created a bump chart because I wanted to show the differences between ranks of the amount of casualties by each month based on the weather conditions during the accident.
df['MonthName']=pd.to_datetime(df['Date']).dt.strftime('%b')
bumpdf = df.groupby(['Weather_conditions','MonthName'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')
bumpdf = bumpdf.pivot(index='Weather_conditions',columns='MonthName', values='Total Casualties')
month_order = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
bumpdf = bumpdf.reindex(columns=month_order)
bumpdf = bumpdf.drop(['Missing data'])
bumpdf_ranked = bumpdf.rank(0, ascending=False, method='min')
bumpdf_ranked = bumpdf_ranked.T
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
bumpdf_ranked.plot(kind='line', ax=ax, marker='o',markeredgewidth=1, linewidth=6,
markersize=44, markerfacecolor='white')
ax.invert_yaxis()
numrows = bumpdf_ranked.shape[0]
numcols = bumpdf_ranked.shape[1]
plt.ylabel('Monthly Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Total Casualties by Month and by Weather Condition \n Bump Chart', fontsize=18, pad=15)
plt.xticks(np.arange(numrows), month_order, fontsize=14)
plt.yticks(range(1, numcols+1, 1), fontsize=14)
ax.set_xlabel('Month', fontsize=18)
handles, labels = ax.get_legend_handles_labels()
handles = [handles[8],handles[7],handles[6],handles[5],handles[4],handles[3],handles[2],handles[1],handles[0]]
labels = [labels[8],labels[7],labels[6],labels[5],labels[4],labels[3],labels[2],labels[1],labels[0]]
ax.legend(handles,labels, bbox_to_anchor=(1.01,1.01), fontsize=14,
labelspacing=1, markerscale=.4, borderpad=1,handletextpad=0.8)
i=0
j=0
for eachcol in bumpdf_ranked.columns:
for eachrow in bumpdf_ranked.index:
this_rank = bumpdf_ranked.iloc[i,j]
ax.text(i, this_rank, str(bumpdf.iloc[j,i]), ha='center', va='center', fontsize=11)
i+=1
j+=1
i=0
plt.show()
From the bump chart above, we can see that when the weather conditions
are dry, it is ranked number 1 across all months. This makes sense
because most people are going to ride bikes when its dry outside and the
more bikers on the road, the more bikers can get into accidents. The
lowest ranked weather condition is “snow and windy” and it is the lowest
ranked for every month. The weather conditions that ranked consistently
across every month was dry, rain, unknown, and snow and windy. The
weather conditions, snow and fog, fluctuated between ranks 7 and 8
throughout the year. Rain and windy, clear and windy, and other weather
conditions all fluctuated between ranks 4,5 and 6.
Overall, most of the bike accidents and casualties in Great Britain occurred within summertime months. Most casualties occurred within the month of July where the road conditions were dry and weather conditions were clear. Most casualties occurred within the age groups of 11-15 and 26-35 during week days at 7am or 5pm. These biking accidents could be due to many external factors such as cars, people, bikers, or natural elements. I would love to be able to explore more data about the location of the accidents and other possible external factors to gain a better understanding of the data and the story it tells. I would be able to ask questions like, which cities and/or streets in Great Britain have the most biking accidents? Or Do cars, people, or other bikers cause the most accidents?