import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
import matplotlib.patches as mpatches
warnings.filterwarnings("ignore")
filepath = "C:/Users/txrus/Sem 6/Data Visual/Python/Deliverable/global_shark_attacks.csv"
df=pd.read_csv(filepath)
Shark attacks, while statistically rare, have long captivated public attention and stirred both fear and fascination within beach-goers. Beyond the news headlines, these incidents offer a unique lens through which one can explore human interaction with marine environments. By analyzing available historical data on shark attacks spanning the last couple centuries, this report aims to uncover patterns in when, where, and how these attacks happened.
The analysis reveals compelling seasonal and hourly trends, suggesting that environmental and behavioral factors play a significant role in the frequency of attacks. Certain coastal regions, especially in the United States, have emerged as consistent hot spots, while, unsurprisingly, activities like surfing and swimming account for a majority of the attacks. Additionally, the data highlights stark differences in fatality rates across countries.
Through data analysis, this report sheds light on the broader context of shark attacks in order to gain a better understanding of the causes of shark attacks.
Note: This analysis is based on available data, which may not be complete.
The dataset consists primarily of categorical variables, meaning traditional summary statistics (such as mean or standard deviation) offer limited insight. Even numerical variables like age contain inconsistent or messy data that would require cleaning to be useful for analysis. However, since age is not used in any visualizations or graphs in this report, it will not be cleaned and will be treated as a categorical variable.
Below is a description of each variable in the dataset:
for col in df.columns:
print(f"\n--- Description for column: {col.title()}---\n")
df[col].describe()
##
## --- Description for column: Date---
##
## count 6587
## unique 5558
## top 1957-01-01
## freq 11
## Name: date, dtype: object
##
## --- Description for column: Year---
##
## count 6758.000000
## mean 1970.935928
## std 56.227881
## min 1.000000
## 25% 1950.000000
## 50% 1986.000000
## 75% 2009.000000
## max 2023.000000
## Name: year, dtype: float64
##
## --- Description for column: Type---
##
## count 6871
## unique 11
## top Unprovoked
## freq 5065
## Name: type, dtype: object
##
## --- Description for column: Country---
##
## count 6839
## unique 215
## top USA
## freq 2522
## Name: country, dtype: object
##
## --- Description for column: Area---
##
## count 6409
## unique 862
## top Florida
## freq 1174
## Name: area, dtype: object
##
## --- Description for column: Location---
##
## count 6325
## unique 4427
## top New Smyrna Beach, Volusia County
## freq 192
## Name: location, dtype: object
##
## --- Description for column: Activity---
##
## count 6304
## unique 1553
## top Surfing
## freq 1112
## Name: activity, dtype: object
##
## --- Description for column: Name---
##
## count 6670
## unique 5638
## top male
## freq 669
## Name: name, dtype: object
##
## --- Description for column: Sex---
##
## count 6318
## unique 6
## top M
## freq 5545
## Name: sex, dtype: object
##
## --- Description for column: Age---
##
## count 3903
## unique 232
## top 19.0
## freq 89
## Name: age, dtype: object
##
## --- Description for column: Fatal_Y_N---
##
## count 6890
## unique 9
## top N
## freq 4804
## Name: fatal_y_n, dtype: object
##
## --- Description for column: Time---
##
## count 3372
## unique 397
## top Afternoon
## freq 215
## Name: time, dtype: object
##
## --- Description for column: Species---
##
## count 3772
## unique 1560
## top White shark
## freq 192
## Name: species, dtype: object
There are 6,587 recorded dates for shark attacks in the dataset, with 5,558 of them being unique. The most frequently occurring date is 1957-01-01, which appears 11 times — a notably high count for a single day. This repetition may suggest placeholder values or potential data entry inconsistencies, raising questions about the reliability or completeness of some records.
This is the only fully numerical variable in the data set.It includes 6,758 entries for the year of each shark attack. The values range from as early as year 1 to as recent as 2023, with a median year of 1986. The most recent quarter of data (75th percentile) starts in 2009, while the earliest quarter ends in 1950. The presence of extremely low values like 1 suggests possible errors or missing data encoded incorrectly, which could affect analyses involving time trends unless filtered or corrected.
The type variable describes the nature of each shark encounter, with 6,871 recorded entries and 11 unique categories. The most common type is Unprovoked, accounting for 5,065 incidents. This overwhelming majority suggests that most shark attacks occur without deliberate human interaction or provocation. 11 unique categories could show entry inconsistencies in the data set.
The country column contains 6,839 entries and 215 unique values, representing the global distribution of shark attacks. The USA appears most frequently, with 2,522 incidents, which is a significant portion of the data set. This could reflect either a higher incidence of shark-human interactions in the region, more thorough reporting practices, or both. The large number of unique countries indicates wide geographic coverage across the globe.
The area column provides more specific regional detail within countries, with 6,409 entries and 862 unique values. The most commonly reported area is Florida, appearing 1,174 times, which is further emphasizing the United States as a hotspot for reported shark activity.
The location column contains 6,325 entries with 4,427 unique values, providing highly specific information about where each shark attack occurred. The most frequently reported location is New Smyrna Beach, Volusia County, which appears 192 times.
The activity column describes what the individual was doing at the time of the shark attack, with 6,304 entries and 1,553 unique activities recorded. The most frequent activity is Surfing, which accounts for 1,112 cases. The large number of unique values suggests that this column captures a wide variety of behaviors. There are some overlap with similar entries written in different formats (“swimming” vs “Swimming near shore”).
The name column contains 6,670 entries and 5,638 unique values, presumably identifying the individuals involved in each shark attack. Interestingly, the most common entry is simply “male”, appearing 669 times, which suggests that in many cases, the individual’s name was unknown and replaced with a descriptor.
The sex column records the gender of individuals involved in shark attacks, with 6,318 entries and 6 unique values. The most frequent entry is “M” (Male), accounting for 5,545 cases. While the majority of values are consistent, the presence of six unique entries indicated likely inconsistencies.
The age column has 3,903 non-null entries and 232 unique values. The most common age recorded is 19.0, appearing 89 times. Although this is a numerical variable, it is stored as an object type and includes inconsistencies such as ranges (e.g., “13 or 14”), approximate values, and text-based entries.
The fatal_y_n column indicates whether a shark attack was fatal, with 6,890 entries and 9 unique values. The most common value is “N” (non-fatal), accounting for 4,804 records. While this variable is meant to represent a binary outcome, the presence of nine unique values suggests inconsistencies or unknown data.
The time column contains 3,372 non-null entries with 397 unique values, capturing the reported time of each shark attack. The most common entry is “Afternoon”, occurring 215 times. However, the wide variety of unique entries suggests inconsistent formatting, including vague descriptors (“Just before noon”) as well as specific timestamps (“14h00”). These inconsistencies would require significant parsing and standardization to be used reliably for time-based analysis. In this report, the time data was cleaned and transformed into an hour column for visualization.
The species column identifies the type of shark involved in each attack, with 3,772 entries and 1,560 unique values. The most frequently reported species is the White shark, appearing 192 times.
The following section presents visualizations designed to explore key patterns in the shark attack dataset. These graphs highlight trends across time, geography, activity type, and other relevant factors. By transforming the data into visual formats, we can more effectively identify insights, anomalies, and areas that warrant further investigation.
date_df = df[-df['date'].isna() & -df['time'].isna()]
date_df['date'] = date_df['date'].astype(str)
date_df['month'] = date_df['date'].str.split('-').str[1]
date_df['month'] = pd.to_numeric(date_df['month'])
def month_to_quarter(month):
return (month - 1) // 3 + 1
date_df['quarter'] = date_df['month'].apply(month_to_quarter)
date_df['time'] = date_df['time'].astype(str)
date_df['hour'] = date_df['time'].str.split('h').str[0] # gets the hour
date_df['hour'] = date_df['hour'].astype(str).str.lower().str.strip()
# Source: Stack Overflow and AI to create the keywords dictionary
# I put the list of unique values for hour into a LLM and had it create the keywords dictionary below
import re
def extract_hour(val):
# For numbers
if re.match(r'^\d{1,2}$', val):
return int(val)
# Match HHMM or HHjMM or weird formats
match = re.search(r'(\d{1,2})[hj:.\s]?', val)
if match:
h = int(match.group(1))
if 0 <= h <= 23:
return h
# Keyword based
keywords = {
'morning': 9,
'mid-morning': 10,
'late morning': 11,
'afternoon': 14,
'early afternoon': 13,
'late afternoon': 16,
'evening': 18,
'early evening': 17,
'late evening': 20,
'night': 21,
'late night': 23,
'dusk': 19,
'dawn': 6,
'midday': 12,
'noon': 12,
'sunset': 19,
'daybreak': 6,
'a.m.': 9,
'p.m.': 15,
'midnight': 0,
'dark': 21,
'after dark': 22,
'just before dawn': 5,
'just before noon': 11,
'just after 12': 13
}
for key, hour in keywords.items():
if key in val:
return hour
return np.nan
date_df['hour'] = date_df['hour'].apply(extract_hour)
date_df['hour'] = date_df['hour'].astype(str).str.strip('.')
date_df = date_df[date_df['hour']!= 'nan']
quarter_hour_df = date_df.groupby(['quarter', 'hour']).size().reset_index(name='num_attacks')
quarter_hour_df['hour'] = quarter_hour_df['hour'].astype(float)
quarter_hour_df['quarter'] = quarter_hour_df['quarter'].astype(int)
quarter_hour_df = quarter_hour_df.sort_values(['quarter', 'hour']).reset_index(drop=True)
fig = plt.figure(figsize=(18,10))
ax = fig.add_subplot(1,1,1)
colors = {1:'red',2:'blue',3:'green',4:'gray'}
for quarter, grp in quarter_hour_df.groupby('quarter'):
grp.plot(ax=ax, kind='line',x='hour',y='num_attacks',label=quarter, marker='8',color = colors[quarter])
plt.title("Number of Attacks by Hour\n and Quarter",fontsize=20,pad=12)
ax.set_xlabel('Hours (24 Hour Interval)',fontsize=18,labelpad=12)
ax.set_ylabel('Total Number of Attacks',fontsize=18)
ax.set_xticks(np.arange(24))
ax.tick_params(axis='x',labelsize=14)
ax.tick_params(axis='y',labelsize=14)
handles, labels = ax.get_legend_handles_labels()
labels = ['Quarter 1','Quarter 2','Quarter 3','Quarter 4']
plt.legend(handles, labels,fontsize=16)
plt.show()
This line graph displays the total number of shark attacks by hour of the day, segmented by quarter. Each line represents a different quarter of the year, allowing for comparisons between seasonal trends in attack timing. The x-axis spans a 24-hour day, while the y-axis shows the total number of attacks that occurred during each hour across all available years. Building on the seasonal trends identified in the previous graph—where most shark attacks occur during summer months—this chart further reveals that attacks also follow a daily rhythm, peaking in the early afternoon hours. Across all quarters, there is a noticeable spike around 2 PM, with Quarter 3 showing the highest overall activity throughout the afternoon. This corresponds to typical beach and water activity hours, when sunlight, temperature, and human presence in the ocean are all at their peak.
The combined evidence from this graph and the previous month-by-decade analysis suggests that shark attacks are not randomly distributed—they’re strongly influenced by human ocean use patterns, especially during warmer months and active daylight hours. Understanding these temporal patterns can help inform beach safety measures and guide public awareness around when the risk of shark encounters is statistically more likely.
pie_df = df.groupby(['activity','type'])['activity'].count().reset_index(name='numatt')
pie_df.sort_values(['numatt'],inplace=True,ascending=False)
pie_df.reset_index(inplace=True)
top_activities_df = (
pie_df.groupby(['activity'])['numatt']
.sum()
.sort_values(ascending=False)
.head(10)
.reset_index()
)
pie_df['type_grouped'] = pie_df['type'].apply(
lambda x: x if x in ['Provoked', 'Unprovoked'] else 'Other')
type_df = (pie_df.groupby('type_grouped')['numatt'].sum().reset_index())
from matplotlib.patches import Patch
fig = plt.figure(figsize=(9,9))
ax = fig.add_subplot(1,1,1)
colormap = plt.get_cmap("tab20c")
outer_colors = colormap(np.arange(len(top_activities_df))*2)
number_inside_colors = len( type_df.numatt.unique())
all_color_ref_number = np.arange((len(top_activities_df)) + number_inside_colors)
inside_color_ref_number = []
for each in all_color_ref_number:
if each not in np.arange(len(top_activities_df))*2:
inside_color_ref_number.append(each)
inner_colors = colormap(inside_color_ref_number)
all_labels = top_activities_df['activity'].tolist()
pie_labels = ['Surfing', 'Swimming', 'Fishing', 'Spearfishing'] + [''] * 6
legend_labels = all_labels[4:]
top_activities_df['numatt'].plot( kind='pie',
labels= pie_labels,
radius = 1, colors = outer_colors,
pctdistance = 0.9, labeldistance=1.05,
wedgeprops = dict(edgecolor='w'),
autopct='%1.1f%%',
startangle=90)
type_df.numatt.plot(kind='pie',
radius = 0.6, colors=inner_colors,pctdistance = 0.55, labels = type_df.type_grouped,
labeldistance = 0.7, startangle = 250,autopct = '%1.2f%%', wedgeprops = dict(edgecolor='w'))
legend_patches = [Patch(facecolor=outer_colors[i + 4], edgecolor='w', label=legend_labels[i]) for i in range(6)] # stack overflow
plt.legend(handles=legend_patches,
title='Other Activities (< 5%)',
loc = 1, bbox_to_anchor=(1.15, 0.9))
hole = plt.Circle( (0,0), 0.25, fc='white')
fig1 = plt.gcf()
fig1.gca().add_artist(hole)
ax.axis('equal')
ax.yaxis.set_visible(False)
plt.title('Top 10 Activities by Number of Attacks\n and Type')
ax.text(0,0,'Number of Attacks:\n'+ str(top_activities_df.numatt.sum()), ha='center',va='center',size=11)
plt.tight_layout()
plt.show()
This donut chart breaks down shark attacks by type (inner ring) and associated human activity (outer ring), based on 3,803 recorded attacks. The majority of attacks (75%) are classified as unprovoked, meaning they occurred without human interference or direct provocation of the shark. The remaining incidents are categorized as either provoked (9.43%) or other/unknown (15.74%). Among the activities represented(outer ring), surfing (29.2%) and swimming (26.5%) account for over half of all reported shark attacks. These are followed by fishing (13.1%) and spearfishing (10.1%), while other activities like wading, snorkeling, diving, and standing in shallow water make up smaller shares. This activity-based analysis ties directly into the previous graphs by reinforcing the idea that human behavior is a key driver of shark attack patterns. The peak months and hours of the day for attacks—June through September and early to mid-afternoon—coincide with when people are most likely to be engaged in high-risk water activities like surfing and swimming. The data emphasizes that shark attacks are not evenly distributed across all ocean users but are concentrated among those most exposed in the surf zone or open water.
loc_df = df[df.location.notna()]
location_counts = df.groupby('location').agg({'country': 'first', 'location': 'count'}).rename(columns={'location': 'count'}).reset_index()
location_counts.sort_values('count', ascending =False, inplace=True)
location_counts.reset_index(inplace=True, drop=True)
loc_25 = location_counts.head(25)
colors = []
for each in loc_25.country:
if each == 'USA':
colors.append('royalblue')
elif each == 'BRAZIL':
colors.append('forestgreen')
elif each == 'SOUTH AFRICA':
colors.append('pink')
elif each == 'IRAN':
colors.append('burlywood')
elif each == 'MOZAMBIQUE':
colors.append('yellow')
elif each == 'AUSTRALIA':
colors.append('teal')
else:
print(each)
usa = mpatches.Patch(color='royalblue', label = 'USA')
bra = mpatches.Patch(color='forestgreen', label = 'BRAZIL')
safr = mpatches.Patch(color='pink', label = 'SOUTH AFRICA')
iran = mpatches.Patch(color='burlywood', label = 'IRAN')
moz = mpatches.Patch(color='yellow', label = 'MOZAMBIQUE')
aus = mpatches.Patch(color='teal', label = 'AUSTRALIA')
fig = plt.figure(figsize = (26,20))
ax1 = fig.add_subplot(1,1,1)
ax1.barh(loc_25.location,loc_25['count'], color = colors, edgecolor = 'black')
ax1.set_title('Top 25 Locations for Shark Attacks',size = 30)
ax1.set_xlabel('Number of Shark Attacks',fontsize = 26)
ax1.set_ylabel('Location', fontsize=26)
plt.xticks(fontsize=24)
plt.yticks(fontsize=14, rotation = 30)
ax1.legend(loc='upper right', fontsize=24, handles = [ usa, bra, safr, iran, moz, aus])
for row_counter, value_at_row_counter in enumerate(loc_25['count']):
ax1.text(
value_at_row_counter + 3, row_counter, str(value_at_row_counter),
color='black', size=22, fontweight='bold', ha='left', va='center')
plt.show()
This horizontal bar chart ranks the top 25 global locations with the highest number of recorded shark attacks. Each bar represents a specific beach or coastal area, with color-coded countries indicating geographic distribution. The United States, mostly Florida, dominates the chart, accounting for the majority of the top-ranked locations. New Smyrna Beach, Florida alone accounts for 192 attacks, making it the most common location by far. This spatial concentration supports earlier observations about when and how attacks occur: Florida’s warm climate, extensive coastline, and high tourism-driven beach traffic create ideal conditions for frequent water activity. It also ties in with the earlier seen seasonal and hourly patterns, as high-risk behaviors peak during summer months and daylight hours. Additionally, countries like Australia, South Africa, and Brazil also appear in the top 25, highlighting that while shark attacks are a global phenomenon, specific coastal ecosystems and usage patterns make certain areas far more prone to encounters. These locations are often home to both high human presence and active shark populations, reinforcing the idea that geography, behavior, and environment all play an interconnected role in shark attack risk.
total_attacks = df.groupby('country').size().reset_index(name='total_attacks')
total_attacks = total_attacks[total_attacks.total_attacks >= 25]
total_attacks.reset_index(inplace=True, drop=True)
top_countries = total_attacks.sort_values(by='total_attacks', ascending=False)
fatal_attacks = df[df['fatal_y_n'] == 'Y']
fatal_counts = (
fatal_attacks[fatal_attacks['country'].isin(top_countries['country'])]
.groupby('country')
.size()
.reset_index(name='fatal_attacks')
)
fatal_ratio_df = pd.merge(top_countries, fatal_counts, on='country', how='left')
fatal_ratio_df['fatal_attacks'] = fatal_ratio_df['fatal_attacks'].fillna(0)
fatal_ratio_df['fatal_ratio'] = fatal_ratio_df['fatal_attacks'] / fatal_ratio_df['total_attacks']
fatal_ratio_df = fatal_ratio_df.sort_values(by='fatal_ratio', ascending=False)
fatal_ratio_df.reset_index(inplace=True, drop = True)
fatal_ratio_df.sort_values('fatal_ratio', ascending=False, inplace=True)
fatal_ratio_df = fatal_ratio_df.head(15)
fig = plt.figure(figsize=(18,10))
ax1 = fig.add_subplot(1,1,1)
ax2 = ax1.twinx()
bar_width = 0.4
def autolabel(these_bars, this_axis, symbol):
for each_bar in these_bars:
height = each_bar.get_height()
this_axis.text(each_bar.get_x() + each_bar.get_width()/2 , height *1.01,
symbol+format(height), fontsize=11, color='black', ha = 'center', va='bottom')
x_pos = np.arange(len(fatal_ratio_df))
total = ax1.bar(
x_pos-(0.5 * bar_width), fatal_ratio_df.total_attacks,
bar_width, color = 'darkblue', edgecolor='white', label = 'Total Attacks'
)
kills = ax1.bar(
x_pos+(0.5 * bar_width), fatal_ratio_df.fatal_attacks,
bar_width, color = 'red', edgecolor='white', label = 'Fatal Attacks'
)
ax1.set_xlabel('Country', fontsize=18)
ax1.set_ylabel('Total Number of Attacks', fontsize=18, labelpad = 15)
ax2.set_ylabel('Number of Fatal Attacks', fontsize=18, labelpad=20, rotation=270)
ax1.tick_params(axis='y', labelsize=14)
ax2.tick_params(axis='y', labelsize=14)
ax1.set_xticks(x_pos)
ax1.set_xticklabels(fatal_ratio_df.country,rotation = 40,ha = 'right',fontsize=10)
legend = ax1.legend(fontsize=18, loc = 'upper left',frameon=False)
autolabel(total,ax1,'')
autolabel(kills,ax1,'')
ax1.set_ylim([0,160])
for i, ratio in enumerate(fatal_ratio_df['fatal_ratio']):
max_height = max(fatal_ratio_df.total_attacks[i], fatal_ratio_df.fatal_attacks[i])
ax1.text(x_pos[i], max_height + 15,
f'{ratio:.2f}', ha='center', va='bottom',
fontsize=11,fontweight='bold', color='green')
ax1.vlines(x=x_pos[i], ymin=0, ymax= max_height + 15, color='black', linestyle='-', linewidth=1)
plt.title("Top 15 Countries by Fatal Attack Ratio",fontsize = 22,pad=25)
plt.suptitle('Text in green is the ratio of fatal shark attacks', y=0.9)
plt.show()
This grouped bar chart ranks the top 15 countries by fatal shark attack ratio, offering a deeper look into not just where attacks happen, but how deadly they are. Each country is represented with two bars: total attacks (blue) and fatal attacks (red). The green labels above each pair represent the fatality ratio, which is the proportion of shark attacks in that country that resulted in death. From this visualization, we see that countries like the Philippines (0.57), Panama (0.56), and Jamaica (0.53) have some of the highest fatality ratios, despite not having the highest total number of attacks. In contrast, nations with a high volume of attacks, like Mexico and Papua New Guinea, have a slightly lower but still significant fatality rate of around 0.45–0.48. This suggests that while places like Florida (as seen in the previous chart) may have high numbers of shark encounters, the likelihood of a fatal outcome is often lower, likely due to better rescue response times, proximity to medical care, and awareness protocols.
This analysis of shark attack data reveals clear patterns in how, when, and where these incidents are most likely to occur. The time-stamp analysis show a sharp rise in attacks over the last century, largely concentrated in the summer months and afternoon hours, periods when human presence in the water is at its peak. Activity based data highlights that most attacks happen during unprovoked encounters with surfers and swimmers, reinforcing the link between recreational ocean use and shark encounters. Geographic patterns further demonstrate that certain locations, especially along the U.S. coastline in Florida, account for a disproportionate share of global shark attacks. However, while these areas see the highest frequency, they do not necessarily have the highest fatality rates. The fatal attack ratio varies considerably by country, suggesting that medical infrastructure, emergency response, and public awareness play key roles in determining outcomes when attacks occur. Together, these findings underscore that shark attacks may not be random. They are closely tied to human behavior, environmental conditions, and regional preparedness. Recognizing these patterns can inform safety strategies, guide public education efforts, and contribute to a more rational and data-informed understanding of our relationship with sharks in shared marine environments.