Executive Summary

The two merged files contain data about bike accidents in Great Britain from 1979 to 2018. The data covers the accident index, number of vehicles involved, number of casualties,the biker’s gender, severity of the accident the victims age group range, date of accident, time of accident, day of accident, speed limit, road conditions, weather conditions, and road type. In this project, I am going to import and merge the data sets, clean the dataset, and perform exploratory data analysis using 5 kinds of data visualizations in order to gain a better understanding of the data and question the outcomes.

Loading in Data and R Libraries:

import os
os.environ['QT_QPA_PLATFORM_PLUGIN_PATH'] = 'D:/Anaconda3/Library/plugins/platforms'

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from matplotlib.ticker import FuncFormatter

accidents = pd.read_csv("U:\\Accidents.csv")
bikers = pd.read_csv("U:\\Bikers.csv")
# merging the two datasets
df = pd.merge(bikers, accidents, 
                   on='Accident_Index', 
                   how='inner')

Descriptive Statistics

df.info()
## <class 'pandas.core.frame.DataFrame'>
## Int64Index: 827861 entries, 0 to 827860
## Data columns (total 14 columns):
##  #   Column                Non-Null Count   Dtype  
## ---  ------                --------------   -----  
##  0   Accident_Index        827861 non-null  object 
##  1   Gender                827861 non-null  object 
##  2   Severity              827861 non-null  object 
##  3   Age_Grp               827861 non-null  object 
##  4   Number_of_Vehicles    827861 non-null  int64  
##  5   Number_of_Casualties  827861 non-null  int64  
##  6   Date                  827861 non-null  object 
##  7   Time                  827861 non-null  object 
##  8   Speed_limit           827861 non-null  float64
##  9   Road_conditions       827861 non-null  object 
##  10  Weather_conditions    827861 non-null  object 
##  11  Day                   827861 non-null  object 
##  12  Road_type             827861 non-null  object 
##  13  Light_conditions      827861 non-null  object 
## dtypes: float64(1), int64(2), object(11)
## memory usage: 94.7+ MB
df.describe(include='all')
##        Accident_Index  Gender  ...           Road_type Light_conditions
## count          827861  827861  ...              827861           827861
## unique         827861       3  ...                   6                3
## top     197901A1SEE71    Male  ...  Single carriageway         Daylight
## freq                1  660025  ...              656703           660657
## mean              NaN     NaN  ...                 NaN              NaN
## std               NaN     NaN  ...                 NaN              NaN
## min               NaN     NaN  ...                 NaN              NaN
## 25%               NaN     NaN  ...                 NaN              NaN
## 50%               NaN     NaN  ...                 NaN              NaN
## 75%               NaN     NaN  ...                 NaN              NaN
## max               NaN     NaN  ...                 NaN              NaN
## 
## [11 rows x 14 columns]

Cleaning Data

We can see that my dataset does not contain any columns or rows with “NA.” Thus, nothing needs to be removed from my data frame yet.

df.isna().sum()
## Accident_Index          0
## Gender                  0
## Severity                0
## Age_Grp                 0
## Number_of_Vehicles      0
## Number_of_Casualties    0
## Date                    0
## Time                    0
## Speed_limit             0
## Road_conditions         0
## Weather_conditions      0
## Day                     0
## Road_type               0
## Light_conditions        0
## dtype: int64

Scatterplot

I created a scatterplot to look at the number of biking accidents by month from years 2008 to 2018. The colors on the scale go from dark blue to green to yellow to show the increasing amount of bike accidents each year.

df['Year']= pd.to_datetime(df['Date']).dt.year
df['Month']= pd.to_datetime(df['Date']).dt.month

x0 = df.groupby(['Year','Month'])['Year'].count().reset_index(name='Count')
x0 = pd.DataFrame(x0)

x = x0.loc[ ~x0['Year'].isin(range(2008))]

plt.figure(figsize=(18,12))

plt.scatter(x['Month'], x['Year'], marker='o', cmap='viridis', 
            c=x['Count'], s=x['Count'], edgecolors='black')

plt.title('Bike Accidents', fontsize = 35)
plt.xlabel('Months of the Year', fontsize = 25)
plt.ylabel('Year', fontsize = 25)

cbar = plt.colorbar()
cbar.set_label('Number of Biking Accidents', rotation = 270, 
               fontsize = 25, labelpad = 30)

cbarticks = [*range(800,int(x['Count'].max()), 200)]
cbar.set_ticks(cbarticks)

xticks = [*range( x['Month'].min(), x['Month'].max()+1, 1)]
plt.xticks(xticks, fontsize = 20)
yticks = [*range( x['Year'].min(), x['Year'].max()+1, 1)]
plt.yticks(yticks, fontsize = 20)
plt.show()

In the scatterplot above, we can see that the largest number of accidents recorded was in July, 2014. Most of the dots in the the middle of the scatterplot are green or yellow, which tells us there were more accidents in months like June, July, August, and September. This makes sense because it is warm in these months in England, and more people ride bikes when its warm or the summertime. December looks to be the only month with all dark blue dots, which tells us that, each year, it was the month with the fewest bike accidents. January also has many dark blue dots, except for in 2014, it is bright green. This could potentially be an outlier in our data. Overall, the month of December looks to have the least amount of accidents, and the month of July has the most. From this scatterplot, I cannot certainly conclude which year had the least amount of accidents, but we can clearly see the year 2014 looks to have the most.

Vertical Bar Chart

I created a vertical bar chart to look at the frequency of bike accident casualties by the age groups of the bicyclists.

df1 = df[['Age_Grp','Number_of_Casualties','Date','Time','Day']]
df1['Year']= pd.to_datetime(df1['Date']).dt.year
df1['Month']= pd.to_datetime(df1['Date']).dt.month
df1['MonthName']=pd.to_datetime(df1['Date']).dt.strftime('%b')
df1['Hour']=pd.to_datetime(df1['Time']).dt.hour
x1 = df1.groupby(['Age_Grp']).agg({'Age_Grp':['count'], 'Number_of_Casualties':['sum','mean']}).reset_index()
x1.columns=['Age Groups','Count','Total Casualties','Average Casualties']
x1 = x1.sort_values('Count', ascending=False)

def mean_color(data):
    colors = []
    avg = data.Count.mean()
    for each in data.Count:
        if each > avg*1.01:
            colors.append('mediumpurple')
        elif each < avg*0.99:
            colors.append('skyblue')
        else:
            colors.append('black')
    return colors

bottom1 = 0
top1 = 8
d1 = x1.loc[bottom1:top1]
my_colors1 = mean_color(d1)

above = mpatches.Patch(color='mediumpurple', label='Above Average')
#at = mpatches.Patch(color='black', label='Within 1% of Mean')
below = mpatches.Patch(color='skyblue', label='Below Average')

plt.figure(figsize=(18,12))

plt.bar(d1['Age Groups'], d1.Count, label='Count', color=my_colors1)
plt.legend(handles=[above,below], fontsize=14)
plt.axhline(d1.Count.mean(), color='black', linestyle='dashed')

plt.title('Bike Casualty Frequency by Age Group', fontsize = 20)
plt.xlabel('Age Groups', fontsize = 15)
plt.ylabel('Casualty Count', fontsize = 15)
plt.text(top1-0.5,d1.Count.mean()+1100, 'Mean = ' + str(round(d1.Count.mean(),2)), rotation=0, fontsize=14)

plt.show()

From the vertical bar chart above, we can see that the most bike accident casualties occurred with bicyclists who were within 11 to 15 years. The least amount of casualties occurred from those who were 66 to 75 years old. These findings make sense because not many people over the age of 65 ride bikes and teen years are the prime age for riding bikes.

Multiple Line Plot

I created a multiple line plot to show the increases and decreases of bike casualties of each day by the hour.

x2 = df1.groupby(['Hour','Day'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')

fig = plt.figure(figsize = (18,10))
ax = fig.add_subplot(1,1,1)

my_colors = {'Monday':'dodgerblue',
             'Tuesday':'saddlebrown',
             'Wednesday':'mediumseagreen',
             'Thursday':'darkorange',
             'Friday':'yellow',
             'Saturday':'mediumpurple',
             'Sunday':'deeppink'}

for key, grp in x2.groupby(['Day']):
    grp.plot(ax=ax, kind='line', x='Hour', y='Total Casualties', color=my_colors[key], label=key, marker='8')
    
plt.title('Total Casualties by Hour', fontsize = 18)
ax.set_xlabel('Hour (24 Hour Interval)', fontsize = 18)
ax.set_ylabel('Total Casualties', fontsize = 18)

ax.set_xticks(np.arange(24))

handles, labels = ax.get_legend_handles_labels()
handles = [handles[1],handles[5],handles[6],handles[4],handles[0],handles[2],handles[3]]
labels = [labels[1],labels[5],labels[6],labels[4],labels[0],labels[2],labels[3]]
plt.legend(handles,labels,loc='best', fontsize=14, ncol=1)

plt.show()

From the multiple line plot above, we can see that the most casualties occur at 5pm on most days (except Saturdays and Sundays). The second time with the most casualties is 7am. The times of the day with the least amount of casualties are 1am-4am for every day of the week. Overall, there are drastic increases and decreases of casualties at 7am and 5pm. Thus, we can conclude that, from this data, it is best to not ride a bike at 5pm or 7am and it is safest to ride a bike on Saturday or Sunday in the daytime. This makes me wonder if most of these bike casualties are due to cars because most of the casualties are during the common times and days of the week where people are traveling to or from work.

Nested Pie Chart

I created a nested pie chart to show percentage of bike casualties by each quarter and month of the year.

df1['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df1['Quarter'] = 'Quarter ' + df1['Date'].dt.quarter.astype('string')
piedf = df1.groupby(['Quarter', 'MonthName', 'Month'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')
piedf.sort_values(by=['Month'], inplace=True)
piedf.reset_index(inplace=True,drop=True)
del piedf['Month']

out_colors = len(piedf.Quarter.unique())
out_colors_ref_num = np.arange(out_colors)*4

in_colors = len(piedf.MonthName.unique())
all_colors_ref_num = np.arange(out_colors + in_colors)

in_colors_ref_num = []
for each in all_colors_ref_num:
    if each not in out_colors_ref_num:
        in_colors_ref_num.append(each)
        
fig = plt.figure(figsize=(10,10))
ax=fig.add_subplot(1,1,1)

colormap = plt.get_cmap("tab20c")
outer_colors = colormap(out_colors_ref_num)

all_cas = piedf["Total Casualties"].sum()

piedf.groupby(['Quarter'])['Total Casualties'].sum().plot(
                kind='pie', radius=1, colors=outer_colors, pctdistance=0.85, labeldistance=1.1,
                wedgeprops= {'edgecolor':'white'}, textprops={'fontsize':13},
                autopct = lambda p: '{:.2f}%\n({:.0f})'.format(p,(p/100)*all_cas), 
                startangle=90)

inner_colors = colormap(in_colors_ref_num)
piedf['Total Casualties'].plot(
                kind='pie', radius=0.7, colors=inner_colors, pctdistance=0.55, labeldistance=0.8,
                wedgeprops= {'edgecolor':'white'}, textprops={'fontsize':12},
                labels = piedf.MonthName,
                autopct = '%1.2f%%', 
                startangle=90)

hole = plt.Circle((0,0),0.3,fc='white')
fig1= plt.gcf()
fig1.gca().add_artist(hole)

ax.yaxis.set_visible(False)
plt.title('Total Casualties by Quarter and Month', fontsize=18)

ax.text(0,0, 'Total Casualties\n' + str('{:,}'.format(all_cas)), size=18, ha='center',va='center')

ax.axis('equal')
plt.tight_layout()


plt.show()

From the nest pie chart above, we can see that each quarter had similar percentages of bike casualties. Quarter 1 had 21.47% of the total casualties and Quarter 4 had 28.56% of the total casualties. These percentages make sense because July, August and September are months with good bike riding weather, whereas January, February, and March are colder and snowy. Overall, there were a total of 866,946 casualties and the month of February had the lowest percentage and the month of July had the highest percentage out of the months.

Bump Chart

I created a bump chart because I wanted to show the differences between ranks of the amount of casualties by each month based on the weather conditions during the accident.

df['MonthName']=pd.to_datetime(df['Date']).dt.strftime('%b')
bumpdf = df.groupby(['Weather_conditions','MonthName'])['Number_of_Casualties'].sum().reset_index(name='Total Casualties')
bumpdf = bumpdf.pivot(index='Weather_conditions',columns='MonthName', values='Total Casualties')
month_order = ['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']
bumpdf = bumpdf.reindex(columns=month_order)
bumpdf = bumpdf.drop(['Missing data'])

bumpdf_ranked = bumpdf.rank(0, ascending=False, method='min')
bumpdf_ranked = bumpdf_ranked.T

fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
bumpdf_ranked.plot(kind='line', ax=ax, marker='o',markeredgewidth=1, linewidth=6,
                  markersize=44, markerfacecolor='white')

ax.invert_yaxis()

numrows = bumpdf_ranked.shape[0]
numcols = bumpdf_ranked.shape[1]

plt.ylabel('Monthly Ranking', fontsize=18, labelpad=10)
plt.title('Ranking of Total Casualties by Month and by Weather Condition \n Bump Chart', fontsize=18, pad=15)
plt.xticks(np.arange(numrows), month_order, fontsize=14)
plt.yticks(range(1, numcols+1, 1), fontsize=14)
ax.set_xlabel('Month', fontsize=18)

handles, labels = ax.get_legend_handles_labels()
handles = [handles[8],handles[7],handles[6],handles[5],handles[4],handles[3],handles[2],handles[1],handles[0]]
labels = [labels[8],labels[7],labels[6],labels[5],labels[4],labels[3],labels[2],labels[1],labels[0]]
ax.legend(handles,labels, bbox_to_anchor=(1.01,1.01), fontsize=14,
         labelspacing=1, markerscale=.4, borderpad=1,handletextpad=0.8)

i=0
j=0
for eachcol in bumpdf_ranked.columns:
    for eachrow in bumpdf_ranked.index:
        this_rank = bumpdf_ranked.iloc[i,j]
        ax.text(i, this_rank, str(bumpdf.iloc[j,i]), ha='center', va='center', fontsize=11)
        i+=1
    j+=1
    i=0

plt.show()

From the bump chart above, we can see that when the weather conditions are dry, it is ranked number 1 across all months. This makes sense because most people are going to ride bikes when its dry outside and the more bikers on the road, the more bikers can get into accidents. The lowest ranked weather condition is “snow and windy” and it is the lowest ranked for every month. The weather conditions that ranked consistently across every month was dry, rain, unknown, and snow and windy. The weather conditions, snow and fog, fluctuated between ranks 7 and 8 throughout the year. Rain and windy, clear and windy, and other weather conditions all fluctuated between ranks 4,5 and 6.

Conclusion

Overall, most of the bike accidents and casualties in Great Britain occurred within summertime months. Most casualties occurred within the month of July where the road conditions were dry and weather conditions were clear. Most casualties occurred within the age groups of 11-15 and 26-35 during week days at 7am or 5pm. These biking accidents could be due to many external factors such as cars, people, bikers, or natural elements. I would love to be able to explore more data about the location of the accidents and other possible external factors to gain a better understanding of the data and the story it tells. I would be able to ask questions like, which cities and/or streets in Great Britain have the most biking accidents? Or Do cars, people, or other bikers cause the most accidents?