We are going to perform exploratory data analysis, to a dataset containing all reported sharks attacks to humans that we know of up until the end of 2017. The data comes from the Global Shark Attack File - (https://www.sharkattackfile.net/index.htm).
So we first have a glimpse at the dataset, and to the structure of our dataset -using the R STR command and a specific function created to that effect, called BASIC SUMMARY
## # A tibble: 10 x 20
## `Case Number` Date Year Type Country Area Location Activity Name Sex
## <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2017.06.11 11/06~ 2017 Unpro~ AUSTRA~ West~ Point C~ Body bo~ Paul~ M
## 2 2017.06.10.b 10/06~ 2017 Unpro~ AUSTRA~ Vict~ Flinder~ Surfing fema~ F
## 3 2017.06.10.a 10/06~ 2017 Unpro~ USA Flor~ Ponce I~ Surfing Brya~ M
## 4 2017.06.07.R Repor~ 2017 Unpro~ UNITED~ Sout~ Bantham~ Surfing Rich~ M
## 5 2017.06.04 4/06/~ 2017 Unpro~ USA Flor~ Middle ~ Spearfi~ Park~ M
## 6 2017.06.02 2/06/~ 2017 Unpro~ BAHAMAS New ~ Athol I~ Snorkel~ Tiff~ F
## 7 2017.05.30 30/05~ 2017 Provo~ USA Sout~ Awendaw~ Touchin~ Mack~ F
## 8 2017.05.28 28/05~ 2017 Unpro~ USA Flor~ Off Jup~ Feeding~ Rand~ M
## 9 2017.05.27 27/05~ 2017 <NA> AUSTRA~ New ~ Evans H~ Fishing Terr~ M
## 10 2017.05.12 12/05~ 2017 Unpro~ UNITED~ Shar~ Khor Fa~ Spearfi~ Al B~ M
## # ... with 10 more variables: Age <chr>, Injury <chr>, Fatal (Y/N) <chr>,
## # Time <chr>, Species <chr>, Investigator or Source <chr>, pdf <chr>,
## # href <chr>, Case Number_1 <chr>, original order <dbl>
## tibble[,20] [6,094 x 20] (S3: tbl_df/tbl/data.frame)
## $ Case Number : chr [1:6094] "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
## $ Date : chr [1:6094] "11/06/2017" "10/06/2017" "10/06/2017" "Reported 07-Jun-2017" ...
## $ Year : num [1:6094] 2017 2017 2017 2017 2017 ...
## $ Type : chr [1:6094] "Unprovoked" "Unprovoked" "Unprovoked" "Unprovoked" ...
## $ Country : chr [1:6094] "AUSTRALIA" "AUSTRALIA" "USA" "UNITED KINGDOM" ...
## $ Area : chr [1:6094] "Western Australia" "Victoria" "Florida" "South Devon" ...
## $ Location : chr [1:6094] "Point Casuarina, Bunbury" "Flinders, Mornington Penisula" "Ponce Inlet, Volusia County" "Bantham Beach" ...
## $ Activity : chr [1:6094] "Body boarding" "Surfing" "Surfing" "Surfing" ...
## $ Name : chr [1:6094] "Paul Goff" "female" "Bryan Brock" "Rich Thomson" ...
## $ Sex : chr [1:6094] "M" "F" "M" "M" ...
## $ Age : chr [1:6094] "48" NA "19" "30" ...
## $ Injury : chr [1:6094] "No injury, board bitten" "No injury, knocke off board" "Laceration to left foot" "Bruise to leg, cuts to hand sustained when he hit the shark" ...
## $ Fatal (Y/N) : chr [1:6094] "N" "N" "N" "N" ...
## $ Time : chr [1:6094] "08h30" "15h45" "10h00" NA ...
## $ Species : chr [1:6094] "White shark, 4 m" "7 gill shark" NA "3m shark, probably a smooth hound" ...
## $ Investigator or Source: chr [1:6094] "WA Today, 6/11/2017" NA "Daytona Beach News-Journal, 6/10/2017" "C. Moore, GSAF" ...
## $ pdf : chr [1:6094] "2017.06.11-Goff.pdf" "2017.06.10.b-Flinders.pdf" "2017.06.10.a-Brock.pdf" "2017.06.07.R-Thomson.pdf" ...
## $ href : chr [1:6094] "http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/201"| __truncated__ "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
## $ Case Number_1 : chr [1:6094] "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
## $ original order : num [1:6094] 6095 6094 6093 6092 6091 ...
## variable type levels
## 1 Case Number tbl_df 6078
## 2 Date tbl_df 5197
## 3 Year tbl_df 241
## 4 Type tbl_df 7
## 5 Country tbl_df 197
## 6 Area tbl_df 778
## 7 Location tbl_df 3942
## 8 Activity tbl_df 1475
## 9 Name tbl_df 5081
## 10 Sex tbl_df 6
## 11 Age tbl_df 151
## 12 Injury tbl_df 3580
## 13 Fatal (Y/N) tbl_df 8
## 14 Time tbl_df 354
## 15 Species tbl_df 1473
## 16 Investigator or Source tbl_df 4793
## 17 pdf tbl_df 6083
## 18 href tbl_df 6077
## 19 Case Number_1 tbl_df 6077
## 20 original order tbl_df 6093
## topLevel
## 1 1907.10.16.R
## 2 10/05/1905
## 3 2015
## 4 Unprovoked
## 5 USA
## 6 Florida
## 7 <NA>
## 8 Surfing
## 9 male
## 10 M
## 11 <NA>
## 12 FATAL
## 13 N
## 14 <NA>
## 15 <NA>
## 16 C. Moore, GSAF
## 17 1898.00.00.R-Syria.pdf
## 18 http://sharkattackfile.net/spreadsheets/pdf_directory/w014.01.25-Grant.pdf
## 19 1907.10.16.R
## 20 569
## topCount topFrac missFreq missFrac
## 1 2 0.000 0 0.000
## 2 11 0.002 0 0.000
## 3 141 0.023 2 0.000
## 4 4466 0.733 4 0.001
## 5 2160 0.354 46 0.008
## 6 1016 0.167 412 0.068
## 7 512 0.084 512 0.084
## 8 935 0.153 537 0.088
## 9 509 0.084 206 0.034
## 10 4908 0.805 577 0.095
## 11 2718 0.446 2718 0.446
## 12 755 0.124 28 0.005
## 13 4400 0.722 30 0.005
## 14 3250 0.533 3250 0.533
## 15 3001 0.492 3001 0.492
## 16 98 0.016 17 0.003
## 17 2 0.000 0 0.000
## 18 4 0.001 1 0.000
## 19 2 0.000 0 0.000
## 20 2 0.000 0 0.000
And so firstly we see that:
most of the variables in the dataset are of type ‘character variable’
there are several variables with a significant amount of missing values (AGE, TIME, SHARK SPECIES)
since the ACTIVITY, LOCATION and SPECIES variables are of type character and consist of free text, we have 1475 different activities, 1473 different species and 3492 injuries (ACTIVITY). We need to clean the data to amalgamate similar categories that have just been spelt differently.
We will convert as many variables as we can to their “true” type –there’s not much we can do at the moment.
Let’s see the incidence of attacks on gender:
table(sharks$Sex)
##
## . F lli M N
## 1 606 1 4908 1
Most attacks are on males. There are 3 incorrect categories, whose abundance is insignificant so we will leave as is.
We will now examine the Nr. shark attacks over time:
It appears to be more cases only from 1800 onwards.
So let’s dig deeper:
Now let’s analise where the attacks have taken place:
sort(table(sharks$Country), decreasing = TRUE)[1:35]
##
## USA AUSTRALIA SOUTH AFRICA PAPUA NEW GUINEA
## 2160 1303 571 133
## NEW ZEALAND BRAZIL BAHAMAS MEXICO
## 126 103 101 86
## ITALY FIJI PHILIPPINES REUNION
## 71 62 60 59
## NEW CALEDONIA MOZAMBIQUE CUBA SPAIN
## 51 45 42 40
## EGYPT INDIA CROATIA JAPAN
## 39 38 34 33
## PANAMA IRAN SOLOMON ISLANDS GREECE
## 32 29 29 25
## JAMAICA HONG KONG FRENCH POLYNESIA INDONESIA
## 25 24 22 21
## ENGLAND PACIFIC OCEAN TONGA ATLANTIC OCEAN
## 20 19 18 16
## BERMUDA SRI LANKA VANUATU
## 16 14 14
sum(sort(table(sharks$Country), decreasing = TRUE)[1:35])
## [1] 5481
There is an abismal difference in the level of cases in USA, Australia and South Africa compared with the rest of the other countries.
Then there is a tier of 5 countries in the hundred-case mark, and then below others.
And the first 35 countries represent 89,9 % of attacks documented.
Now let’s look at any trends in the Nr of shark attacks over time.
Since most of the data is from 1800 onwards, we will only plot cases from 1800 onwards.
We now review the data to see if there is any correlation between the time of the attacks and Nr. of cases.
## TIME OF ATTACKS ##
## time: 354 unique values
## parse time with 'hms()' COMMAND from LUBRIDATE
sharks_activity <- sharks
sharks_time <- sharks
sharks_time$Time <- hm(sharks_time$Time)
times <- sharks_time$Time
hist(times$hour, breaks = 24)
In principle this might indicate that most attacks occur during daylight…
BUT, we have to bear in mind that only 37% of the shark attacks have time of the attack recorded!
Now, let’s have a look at the species of shark that most commonly attack:
sharks$Species <- str_replace_all(sharks$Species, "[:digit:]", "")
sharks$Species <- str_replace_all(sharks$Species, "shark", "")
species <- tibble(Text = sharks$Species)
species_words <- species %>% unnest_tokens(output = word, input = Text)
species_words <- species_words %>% anti_join(stop_words)
species_wordcounts <- species_words %>% count(word, sort = TRUE)
species_wordcounts
## # A tibble: 461 x 2
## word n
## <chr> <int>
## 1 <NA> 3001
## 2 white 632
## 3 tiger 271
## 4 bull 182
## 5 involvement 142
## 6 shark 123
## 7 confirmed 111
## 8 blacktip 102
## 9 nurse 94
## 10 lb 86
## # ... with 451 more rows
White, blue, bull and tiger sharks are the species that are more involved in the attacks, it appears.
Now let’s try quantify which are the shark species that are mostly responsible for fatal attacks:
sharks_species <- sharks
sharks_species$Species <- tolower(sharks_species$Species)
wanted_species <- c("white|tiger|bull|blacktip|whaler|nurse|reef|grey|blue|mako|hammerhead")
sharks_species <- sharks_species %>% filter(grepl(wanted_species, Species))
unicos <- unique(sharks_species$Species) ### 626 different factor variables after filtering for the most common species listed
### now we need to reclassify all synonims)
### sharks_play <- mutate_if (str_detect(sharks_play$Species, 'tiger'),
### ~str_replace_all(sharks_play$Species, c(".*tiger.*" = "tiger")) )
### this one didn't work
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*tiger.*", "tiger")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*bull.*", "bull")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*white.*", "white")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*mako.*", "mako")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*whaler.*", "whaler")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*nurse.*", "nurse")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*grey.*", "grey")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*hammerhead.*", "hammerhead")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*blacktip.*", "blacktip")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*reef.*", "reef")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*blue.*", "blue")
### now only 11 species left :)
colnames(sharks_species)[13] <- "Fatal"
sharks_species <- sharks_species %>% filter(Fatal == "Y" | Fatal == "N")
This plot doesnt take into account the different magnitude of case counts:
This one does:
Out of 1583 cases and 11 “species” listed, 4 stand out: TIGER, then WHITE, BULL & BLUE are more deadly –(but his was not tested for statistical significance).
We can also do a contigency table for this:
## # A tibble: 19 x 4
## # Groups: Species [11]
## Species Fatal n prop
## <chr> <chr> <int> <dbl>
## 1 blacktip N 102 1
## 2 blue N 43 0.782
## 3 blue Y 12 0.218
## 4 bull N 134 0.784
## 5 bull Y 37 0.216
## 6 grey N 25 1
## 7 hammerhead N 45 0.938
## 8 hammerhead Y 3 0.0625
## 9 mako N 54 0.964
## 10 mako Y 2 0.0357
## 11 nurse N 90 0.989
## 12 nurse Y 1 0.0110
## 13 reef N 33 1
## 14 tiger N 201 0.700
## 15 tiger Y 86 0.300
## 16 whaler N 65 0.915
## 17 whaler Y 6 0.0845
## 18 white N 491 0.762
## 19 white Y 153 0.238
But we can see that the bar chart provides information much more readily than the table!
WHICH were the human ACTIVITIES where the ATTACKS occurred?
activities_text <- Corpus(VectorSource(sharks$Activity))
activities_text_clean <- tm_map(activities_text, removePunctuation)
activities_text_clean <- tm_map(activities_text_clean, content_transformer(tolower))
activities_text_clean <- tm_map(activities_text_clean, removeNumbers)
activities_text_clean <- tm_map(activities_text_clean, stripWhitespace)
activities_text_clean <- tm_map(activities_text_clean, removeWords, stopwords('english'))
activities_text_clean <- tm_map(activities_text_clean, removeWords, "shark")
activities_text_clean <- tm_map(activities_text_clean, removeWords, "sharks")
wordcloud(activities_text_clean, scale = c(2, 1), min.freq = 40, colors = rainbow(30))
Now let’s see what type of injuries do shark attacks cause:
##
## Bitten
## 1519
## Lacerations
## 1051
## FATAL
## 755
## No injury
## 722
## Severed body part
## 223
## Minor injury
## 126
## Survived
## 97
## No details
## 43
## Abrasions
## 33
## FATAL, body not recovered
## 25
## Leg injured
## 11
## Probable drowning & scavenging
## 9
## Puncture wounds to foot
## 8
## FATAL x 2
## 5
## Foot injured
## 5
## Hand injured
## 5
## Major injury
## 5
## PROVOKED INCIDENT
## 5
## Puncture wounds to right foot
## 5
## Recovered
## 5
## Shark involvement prior to death unconfirmed
## 5
## Death may have been due to drowning
## 4
## FATAL, body was not recovered
## 4
## Injury to hand
## 4
## No Injury
## 4
## Probable drowning
## 4
## Probable drowning / scavenging
## 4
## Puncture wounds to right thigh
## 4
## Shark involvement prior to death was not confirmed
## 4
## FATAL, but shark involvement prior to death unconfirmed
## 3
## FATAL, but shark involvement prior to death was not determined
## 3
## Human remains found in shark
## 3
## Minor lacerations to foot
## 3
## Missing, believed taken by a shark
## 3
## No Injury to occupants
## 3
But FATAL is sometimes present when type of injury is also listed!
So, let’s clean the activity data a bit:
activit_text <- Corpus(VectorSource(sharks$Injury))
activit_text_clean <- tm_map(activit_text, removePunctuation)
activit_text_clean <- tm_map(activit_text_clean, content_transformer(tolower))
activit_text_clean <- tm_map(activit_text_clean, removeNumbers)
activit_text_clean <- tm_map(activit_text_clean, stripWhitespace)
activit_text_clean <- tm_map(activit_text_clean, removeWords, stopwords('english'))
activit_text_clean <- tm_map(activit_text_clean, removeWords, "injury")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "severed")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "lacerations")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "fatal")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "bitten")
wordcloud(activit_text_clean, scale = c(2, 1), min.freq = 30, colors = rainbow(30))
The ‘wordcloud’ that we get after cleaning the date still doesn’t tell us much…
Age versus attacksNot surprisingly, most attacks are on younger people in their 20’s.
human activity being undertaken vs. its fatality rateFinally, let’s see the correlation between the ACTIVITY versus the FATALITY –are there human activities that result on attacks being fatal more often?
This plot doesn’t take into account the different magnitude of case counts:
But this plot does :)
Out of 1094 cases, SWIMMING & SNORKELING appear to be the activity with highest FATALITY RATE
table(sharks$Type)
##
## Boat Boating Invalid Provoked Sea Disaster Unprovoked
## 202 110 529 563 220 4466