We are going to perform exploratory data analysis, to a dataset containing all reported sharks attacks to humans that we know of up until the end of 2017. The data comes from the Global Shark Attack File - (https://www.sharkattackfile.net/index.htm).

So we first have a glimpse at the dataset, and to the structure of our dataset -using the R STR command and a specific function created to that effect, called BASIC SUMMARY

## # A tibble: 10 x 20
##    `Case Number` Date    Year Type   Country Area  Location Activity Name  Sex  
##    <chr>         <chr>  <dbl> <chr>  <chr>   <chr> <chr>    <chr>    <chr> <chr>
##  1 2017.06.11    11/06~  2017 Unpro~ AUSTRA~ West~ Point C~ Body bo~ Paul~ M    
##  2 2017.06.10.b  10/06~  2017 Unpro~ AUSTRA~ Vict~ Flinder~ Surfing  fema~ F    
##  3 2017.06.10.a  10/06~  2017 Unpro~ USA     Flor~ Ponce I~ Surfing  Brya~ M    
##  4 2017.06.07.R  Repor~  2017 Unpro~ UNITED~ Sout~ Bantham~ Surfing  Rich~ M    
##  5 2017.06.04    4/06/~  2017 Unpro~ USA     Flor~ Middle ~ Spearfi~ Park~ M    
##  6 2017.06.02    2/06/~  2017 Unpro~ BAHAMAS New ~ Athol I~ Snorkel~ Tiff~ F    
##  7 2017.05.30    30/05~  2017 Provo~ USA     Sout~ Awendaw~ Touchin~ Mack~ F    
##  8 2017.05.28    28/05~  2017 Unpro~ USA     Flor~ Off Jup~ Feeding~ Rand~ M    
##  9 2017.05.27    27/05~  2017 <NA>   AUSTRA~ New ~ Evans H~ Fishing  Terr~ M    
## 10 2017.05.12    12/05~  2017 Unpro~ UNITED~ Shar~ Khor Fa~ Spearfi~ Al B~ M    
## # ... with 10 more variables: Age <chr>, Injury <chr>, Fatal (Y/N) <chr>,
## #   Time <chr>, Species <chr>, Investigator or Source <chr>, pdf <chr>,
## #   href <chr>, Case Number_1 <chr>, original order <dbl>
## tibble[,20] [6,094 x 20] (S3: tbl_df/tbl/data.frame)
##  $ Case Number           : chr [1:6094] "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
##  $ Date                  : chr [1:6094] "11/06/2017" "10/06/2017" "10/06/2017" "Reported 07-Jun-2017" ...
##  $ Year                  : num [1:6094] 2017 2017 2017 2017 2017 ...
##  $ Type                  : chr [1:6094] "Unprovoked" "Unprovoked" "Unprovoked" "Unprovoked" ...
##  $ Country               : chr [1:6094] "AUSTRALIA" "AUSTRALIA" "USA" "UNITED KINGDOM" ...
##  $ Area                  : chr [1:6094] "Western Australia" "Victoria" "Florida" "South Devon" ...
##  $ Location              : chr [1:6094] "Point Casuarina, Bunbury" "Flinders, Mornington Penisula" "Ponce Inlet, Volusia County" "Bantham Beach" ...
##  $ Activity              : chr [1:6094] "Body boarding" "Surfing" "Surfing" "Surfing" ...
##  $ Name                  : chr [1:6094] "Paul Goff" "female" "Bryan Brock" "Rich Thomson" ...
##  $ Sex                   : chr [1:6094] "M" "F" "M" "M" ...
##  $ Age                   : chr [1:6094] "48" NA "19" "30" ...
##  $ Injury                : chr [1:6094] "No injury, board bitten" "No injury, knocke off board" "Laceration to left foot" "Bruise to leg, cuts to hand sustained when he hit the shark" ...
##  $ Fatal (Y/N)           : chr [1:6094] "N" "N" "N" "N" ...
##  $ Time                  : chr [1:6094] "08h30" "15h45" "10h00" NA ...
##  $ Species               : chr [1:6094] "White shark, 4 m" "7 gill shark" NA "3m shark, probably a smooth hound" ...
##  $ Investigator or Source: chr [1:6094] "WA Today, 6/11/2017" NA "Daytona Beach News-Journal, 6/10/2017" "C. Moore, GSAF" ...
##  $ pdf                   : chr [1:6094] "2017.06.11-Goff.pdf" "2017.06.10.b-Flinders.pdf" "2017.06.10.a-Brock.pdf" "2017.06.07.R-Thomson.pdf" ...
##  $ href                  : chr [1:6094] "http://sharkattackfile.net/spreadsheets/pdf_directory/http://sharkattackfile.net/spreadsheets/pdf_directory/201"| __truncated__ "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.b-Flinders.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.10.a-Brock.pdf" "http://sharkattackfile.net/spreadsheets/pdf_directory/2017.06.07.R-Thomson.pdf" ...
##  $ Case Number_1         : chr [1:6094] "2017.06.11" "2017.06.10.b" "2017.06.10.a" "2017.06.07.R" ...
##  $ original order        : num [1:6094] 6095 6094 6093 6092 6091 ...
##                  variable   type levels
## 1             Case Number tbl_df   6078
## 2                    Date tbl_df   5197
## 3                    Year tbl_df    241
## 4                    Type tbl_df      7
## 5                 Country tbl_df    197
## 6                    Area tbl_df    778
## 7                Location tbl_df   3942
## 8                Activity tbl_df   1475
## 9                    Name tbl_df   5081
## 10                    Sex tbl_df      6
## 11                    Age tbl_df    151
## 12                 Injury tbl_df   3580
## 13            Fatal (Y/N) tbl_df      8
## 14                   Time tbl_df    354
## 15                Species tbl_df   1473
## 16 Investigator or Source tbl_df   4793
## 17                    pdf tbl_df   6083
## 18                   href tbl_df   6077
## 19          Case Number_1 tbl_df   6077
## 20         original order tbl_df   6093
##                                                                      topLevel
## 1                                                                1907.10.16.R
## 2                                                                  10/05/1905
## 3                                                                        2015
## 4                                                                  Unprovoked
## 5                                                                         USA
## 6                                                                     Florida
## 7                                                                        <NA>
## 8                                                                     Surfing
## 9                                                                        male
## 10                                                                          M
## 11                                                                       <NA>
## 12                                                                      FATAL
## 13                                                                          N
## 14                                                                       <NA>
## 15                                                                       <NA>
## 16                                                             C. Moore, GSAF
## 17                                                     1898.00.00.R-Syria.pdf
## 18 http://sharkattackfile.net/spreadsheets/pdf_directory/w014.01.25-Grant.pdf
## 19                                                               1907.10.16.R
## 20                                                                        569
##    topCount topFrac missFreq missFrac
## 1         2   0.000        0    0.000
## 2        11   0.002        0    0.000
## 3       141   0.023        2    0.000
## 4      4466   0.733        4    0.001
## 5      2160   0.354       46    0.008
## 6      1016   0.167      412    0.068
## 7       512   0.084      512    0.084
## 8       935   0.153      537    0.088
## 9       509   0.084      206    0.034
## 10     4908   0.805      577    0.095
## 11     2718   0.446     2718    0.446
## 12      755   0.124       28    0.005
## 13     4400   0.722       30    0.005
## 14     3250   0.533     3250    0.533
## 15     3001   0.492     3001    0.492
## 16       98   0.016       17    0.003
## 17        2   0.000        0    0.000
## 18        4   0.001        1    0.000
## 19        2   0.000        0    0.000
## 20        2   0.000        0    0.000

And so firstly we see that:

We will convert as many variables as we can to their “true” type –there’s not much we can do at the moment.

Let’s see the incidence of attacks on gender:

table(sharks$Sex)
## 
##    .    F  lli    M    N 
##    1  606    1 4908    1

Most attacks are on males. There are 3 incorrect categories, whose abundance is insignificant so we will leave as is.

We will now examine the Nr. shark attacks over time:

It appears to be more cases only from 1800 onwards.

So let’s dig deeper:

Now let’s analise where the attacks have taken place:

sort(table(sharks$Country), decreasing = TRUE)[1:35]
## 
##              USA        AUSTRALIA     SOUTH AFRICA PAPUA NEW GUINEA 
##             2160             1303              571              133 
##      NEW ZEALAND           BRAZIL          BAHAMAS           MEXICO 
##              126              103              101               86 
##            ITALY             FIJI      PHILIPPINES          REUNION 
##               71               62               60               59 
##    NEW CALEDONIA       MOZAMBIQUE             CUBA            SPAIN 
##               51               45               42               40 
##            EGYPT            INDIA          CROATIA            JAPAN 
##               39               38               34               33 
##           PANAMA             IRAN  SOLOMON ISLANDS           GREECE 
##               32               29               29               25 
##          JAMAICA        HONG KONG FRENCH POLYNESIA        INDONESIA 
##               25               24               22               21 
##          ENGLAND    PACIFIC OCEAN            TONGA   ATLANTIC OCEAN 
##               20               19               18               16 
##          BERMUDA        SRI LANKA          VANUATU 
##               16               14               14
sum(sort(table(sharks$Country), decreasing = TRUE)[1:35])
## [1] 5481

There is an abismal difference in the level of cases in USA, Australia and South Africa compared with the rest of the other countries.

Then there is a tier of 5 countries in the hundred-case mark, and then below others.

And the first 35 countries represent 89,9 % of attacks documented.

Time-series of attacks

Now let’s look at any trends in the Nr of shark attacks over time.

Since most of the data is from 1800 onwards, we will only plot cases from 1800 onwards.

Time of day versus attacks

We now review the data to see if there is any correlation between the time of the attacks and Nr. of cases.

## TIME OF ATTACKS ##
## time: 354 unique values
## parse time with 'hms()' COMMAND from LUBRIDATE
sharks_activity <- sharks
sharks_time <- sharks

sharks_time$Time <- hm(sharks_time$Time)

times <- sharks_time$Time
hist(times$hour, breaks = 24)

In principle this might indicate that most attacks occur during daylight…

BUT, we have to bear in mind that only 37% of the shark attacks have time of the attack recorded!

SPECIES of SHARKS INVOLVED in ATTACKS

Now, let’s have a look at the species of shark that most commonly attack:

sharks$Species <- str_replace_all(sharks$Species, "[:digit:]", "")
sharks$Species <- str_replace_all(sharks$Species, "shark", "")

species <- tibble(Text = sharks$Species)
species_words <- species %>% unnest_tokens(output = word, input = Text) 
species_words <- species_words %>% anti_join(stop_words) 

species_wordcounts <- species_words %>% count(word, sort = TRUE)
species_wordcounts
## # A tibble: 461 x 2
##    word            n
##    <chr>       <int>
##  1 <NA>         3001
##  2 white         632
##  3 tiger         271
##  4 bull          182
##  5 involvement   142
##  6 shark         123
##  7 confirmed     111
##  8 blacktip      102
##  9 nurse          94
## 10 lb             86
## # ... with 451 more rows

White, blue, bull and tiger sharks are the species that are more involved in the attacks, it appears.

Now let’s try quantify which are the shark species that are mostly responsible for fatal attacks:

sharks_species <- sharks

sharks_species$Species <- tolower(sharks_species$Species)

wanted_species <- c("white|tiger|bull|blacktip|whaler|nurse|reef|grey|blue|mako|hammerhead")

sharks_species <- sharks_species %>% filter(grepl(wanted_species, Species))

unicos <- unique(sharks_species$Species)    ### 626 different factor variables after filtering for the most common species listed 
### now we need to reclassify all synonims)

### sharks_play <- mutate_if (str_detect(sharks_play$Species, 'tiger'),
###   ~str_replace_all(sharks_play$Species, c(".*tiger.*" = "tiger")) )
### this one didn't work


sharks_species$Species <- str_replace_all(sharks_species$Species, ".*tiger.*", "tiger")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*bull.*", "bull")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*white.*", "white")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*mako.*", "mako")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*whaler.*", "whaler")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*nurse.*", "nurse")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*grey.*", "grey")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*hammerhead.*", "hammerhead")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*blacktip.*", "blacktip")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*reef.*", "reef")
sharks_species$Species <- str_replace_all(sharks_species$Species, ".*blue.*", "blue")

### now only 11 species left   :)

colnames(sharks_species)[13] <- "Fatal"
sharks_species <- sharks_species %>% filter(Fatal == "Y"  |  Fatal == "N")

This plot doesnt take into account the different magnitude of case counts:

This one does:

Out of 1583 cases and 11 “species” listed, 4 stand out: TIGER, then WHITE, BULL & BLUE are more deadly –(but his was not tested for statistical significance).

We can also do a contigency table for this:

## # A tibble: 19 x 4
## # Groups:   Species [11]
##    Species    Fatal     n   prop
##    <chr>      <chr> <int>  <dbl>
##  1 blacktip   N       102 1     
##  2 blue       N        43 0.782 
##  3 blue       Y        12 0.218 
##  4 bull       N       134 0.784 
##  5 bull       Y        37 0.216 
##  6 grey       N        25 1     
##  7 hammerhead N        45 0.938 
##  8 hammerhead Y         3 0.0625
##  9 mako       N        54 0.964 
## 10 mako       Y         2 0.0357
## 11 nurse      N        90 0.989 
## 12 nurse      Y         1 0.0110
## 13 reef       N        33 1     
## 14 tiger      N       201 0.700 
## 15 tiger      Y        86 0.300 
## 16 whaler     N        65 0.915 
## 17 whaler     Y         6 0.0845
## 18 white      N       491 0.762 
## 19 white      Y       153 0.238

But we can see that the bar chart provides information much more readily than the table!

HUMAN ACTIVITIES WHILST the ATTACKS TOOK PLACE

WHICH were the human ACTIVITIES where the ATTACKS occurred?

activities_text <- Corpus(VectorSource(sharks$Activity))
activities_text_clean <- tm_map(activities_text, removePunctuation)

activities_text_clean <- tm_map(activities_text_clean, content_transformer(tolower))
activities_text_clean <- tm_map(activities_text_clean, removeNumbers)
activities_text_clean <- tm_map(activities_text_clean, stripWhitespace)
activities_text_clean <- tm_map(activities_text_clean, removeWords, stopwords('english'))
activities_text_clean <- tm_map(activities_text_clean, removeWords, "shark")
activities_text_clean <- tm_map(activities_text_clean, removeWords, "sharks")

wordcloud(activities_text_clean, scale = c(2, 1), min.freq = 40, colors = rainbow(30))

TYPE OF INJURIES CAUSED

Now let’s see what type of injuries do shark attacks cause:

## 
##                                                         Bitten 
##                                                           1519 
##                                                    Lacerations 
##                                                           1051 
##                                                          FATAL 
##                                                            755 
##                                                      No injury 
##                                                            722 
##                                              Severed body part 
##                                                            223 
##                                                   Minor injury 
##                                                            126 
##                                                       Survived 
##                                                             97 
##                                                     No details 
##                                                             43 
##                                                      Abrasions 
##                                                             33 
##                                      FATAL, body not recovered 
##                                                             25 
##                                                    Leg injured 
##                                                             11 
##                                 Probable drowning & scavenging 
##                                                              9 
##                                        Puncture wounds to foot 
##                                                              8 
##                                                      FATAL x 2 
##                                                              5 
##                                                   Foot injured 
##                                                              5 
##                                                   Hand injured 
##                                                              5 
##                                                   Major injury 
##                                                              5 
##                                              PROVOKED INCIDENT 
##                                                              5 
##                                  Puncture wounds to right foot 
##                                                              5 
##                                                      Recovered 
##                                                              5 
##                   Shark involvement prior to death unconfirmed 
##                                                              5 
##                            Death may have been due to drowning 
##                                                              4 
##                                  FATAL, body was not recovered 
##                                                              4 
##                                                 Injury to hand 
##                                                              4 
##                                                      No Injury 
##                                                              4 
##                                              Probable drowning 
##                                                              4 
##                                 Probable drowning / scavenging 
##                                                              4 
##                                 Puncture wounds to right thigh 
##                                                              4 
##             Shark involvement prior to death was not confirmed 
##                                                              4 
##        FATAL, but shark involvement prior to death unconfirmed 
##                                                              3 
## FATAL, but shark involvement prior to death was not determined 
##                                                              3 
##                                   Human remains found in shark 
##                                                              3 
##                                      Minor lacerations to foot 
##                                                              3 
##                             Missing, believed taken by a shark 
##                                                              3 
##                                         No Injury to occupants 
##                                                              3

But FATAL is sometimes present when type of injury is also listed!

So, let’s clean the activity data a bit:

activit_text <- Corpus(VectorSource(sharks$Injury))
activit_text_clean <- tm_map(activit_text, removePunctuation)

activit_text_clean <- tm_map(activit_text_clean, content_transformer(tolower))
activit_text_clean <- tm_map(activit_text_clean, removeNumbers)
activit_text_clean <- tm_map(activit_text_clean, stripWhitespace)
activit_text_clean <- tm_map(activit_text_clean, removeWords, stopwords('english'))
activit_text_clean <- tm_map(activit_text_clean, removeWords, "injury")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "severed")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "lacerations")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "fatal")
activit_text_clean <- tm_map(activit_text_clean, removeWords, "bitten")

wordcloud(activit_text_clean, scale = c(2, 1), min.freq = 30, colors = rainbow(30))

The ‘wordcloud’ that we get after cleaning the date still doesn’t tell us much…

Age versus attacks

Not surprisingly, most attacks are on younger people in their 20’s.

RELATION between human activity being undertaken vs. its fatality rate

Finally, let’s see the correlation between the ACTIVITY versus the FATALITY –are there human activities that result on attacks being fatal more often?

This plot doesn’t take into account the different magnitude of case counts:

But this plot does :)

Out of 1094 cases, SWIMMING & SNORKELING appear to be the activity with highest FATALITY RATE

Most importantly:

table(sharks$Type)
## 
##         Boat      Boating      Invalid     Provoked Sea Disaster   Unprovoked 
##          202          110          529          563          220         4466