Summary

In this document a simple analysis of the “Storm Data”-dataset of the national weather service is presented to determine the most harmful events in terms of human cost and money. The results and conclusions are presented in terms of max and average risk of human cost and money. From the information regarding this dataset (“National Weather Service Storm Data Documentation”) one can makeup that not all events are registered prior to 1996. All conclusions are therefore based on the data after 01-01-1996. Multiple data cleaning actions are performed on the EVTYPE-column to reduce the number of possible events.

The analysis presented in this document contains all steps from raw data to generating results.

Program setup

Setup of session and loading required packages

library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)
library(stringr)
library(ggrepel)

options(scipen=999)

Data upload

Uploading data to Rstudio.

setwd("C:\\_Data\\Mijn Documenten\\R\\Scripts Coursera\\Course 5 - Week 4")

downloadlink<-("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")

  if(!file.exists("dataset_c5w4q1.csv.bz2")) {
    download.file(downloadlink,"dataset_c5w4q1.csv.bz2")
  }

  if(!exists("data_source")) {
    data_source <- read.csv("dataset_c5w4q1.csv.bz2", header = TRUE,stringsAsFactors=FALSE)
  }

dataset<-data_source

Data cleanup

The column with registered events is pretty dirty with a lot of typos, resulting in +/-1000 possible events. These will be reduced to +/- 600 possible events

Dates

  dataset$BGN_DATE<-as.Date(dataset$BGN_DATE,format="%m/%d/%Y %H:%M:%S")

Reduce EVTYPE-entries by correcting typos in EVTYPE

Number of events prior to cleaning

length(table(dataset$EVTY))
## [1] 985

Remove entries labeled “summary”

  dataset<-dataset[str_detect(dataset$EVTYPE, "summ")==FALSE,]

Remove typos from EVTYPE-column

# Capital to lower letters EVTYPE
  dataset$EVTYPE<-tolower(dataset$EVTYPE)

# Trim whitespaces in EVTYPE
  dataset$EVTYPE<-str_trim(dataset$EVTYPE)
  
# Change tstm in thunderstorm in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"tstm","thunderstorm")  

# Replace forest in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"forest","-") 
    
# Change hvy in heavy from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"hvy","heavy") 
  
# Change lightn in lightning from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"lightn(?=\\b)","lightning") 
  
# Change fld in flood from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"fld","flood") 
  
# Change unseasonably-dry in drought from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"unseasonably-dry","drought")  

# Change hurricane and typhoon into hurricane (typhoon) in EVTYPE   
  dataset$EVTYPE[str_detect(dataset$EVTYPE, "hurric|typh")==TRUE]<-"hurricane (typhoon)"

# Remove all nubmers in EVTYPE
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:digit:]","-")

# Change all punctuation to same form in EVTYPE   
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]","-")

# Remove multple whitespaces in EVTYPE   
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:blank:]{1,}","-")
  
# Remove the word "and" in EVTYPE   
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,("(?<=[:punct:])and"),"-")
   
# Remove multiple punctuation in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]{2,}","-")

# Remove punctuation from the end of the description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]$","")
    
# Remove plural form -s from description in EVTYPE   
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"s(?=\\b)","")
    
# Remove form -ing from description in EVTYPE   
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"s(?=\\b)","")
     
# Remove -mix from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"-mix(?=\\b)","")   

# Change wild-fire in wildfire from description in EVTYPE 
  dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"wild-fire","wildfire")

# Change thunderstorm-wind in thunderstorm from description in EVTYPE 
  dataset$EVTYPE[str_detect(dataset$EVTYPE, "thunderstorm")==TRUE]<-"thunderstorm"

  # Remove double words from description in EVTYPE
   
  a<-list()
  
  for(i in 1:nrow(dataset))  {
      
    a[i]<-str_split(dataset$EVTYPE[i],"-")
  }
  
  split_events<-lapply(a,unlist)  
    
  split_events<-lapply(split_events,unique)
  
  split_events<-lapply(split_events,str_c,collapse="-")
    
  dataset$EVTYPE<-as.vector(unlist(split_events))

  rm(a, split_events)   

Number of EVTYPE after processing

length(table(dataset$EVTY))
## [1] 571

Data processing

The data prior to 1996 is not complete, therefore all results are based on registrations after 01-01-1996. In the following section a summary is made for the costs associated with the possible events, the maximum per EVTYPE is calculated (fatalities, injuries, money), the total per EVTYPE is calculated (fatalities, injuries, money), the average per EVTYPE is calculated (fatalities, injuries, money). To calculate monetary value, only entries with “k”, “m” and “b” are considered, all others will be ignored (+/-325 entries, 0,03%)

To include both the risk and costs per EVTYPE in the conclusions, the risks (risk = probability * effect) are also calculated. The calculations include the maximum risk per EVTYPE (fatalities, injuries, money) and the average risk per EVTYPE (fatalities, injuries, money).

Lastly, all EVTYPES without fatalities are excluded from the results

# Calculate property damage in dollars
  dataset$Property_damage_cost<-0

  dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "k")==TRUE]<-1000
  dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "m")==TRUE]<-1000000
  dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "b")==TRUE]<-1000000000
  
  dataset$Property_damage_cost<-dataset$Property_damage_cost*dataset$PROPDMG
  
# Calculate crop damage in dollars
  
  dataset$Crop_damage_cost<-0
  
  dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "k")==TRUE]<-1000
  dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "m")==TRUE]<-1000000
  dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "b")==TRUE]<-1000000000
  
  dataset$Crop_damage_cost<-dataset$Crop_damage_cost*dataset$CROPDMG 

# Select data after 01-01-1996

dataset_1996<-subset(dataset,dataset$BGN_DATE>"1996-01-01")

# Calculate summaries 

dataset_1996_risk<-dataset_1996 %>% group_by(EVTYPE)%>%summarise(
  tot_killed=sum(FATALITIES),
  tot_injured=sum(INJURIES),
  tot_cost_prop=sum(Property_damage_cost),
  tot_cost_crop=sum(Crop_damage_cost),
  no_events=n(),
  event_rate=n()/(year(Sys.Date())-1996),
  avg_killed=sum(FATALITIES)/n(),
  avg_injured=sum(INJURIES)/n(),
  avg_cost_prop=sum(Property_damage_cost)/n(),
  avg_cost_crop=sum(Crop_damage_cost)/n(),
  max_killed=max(FATALITIES),
  max_injured=max(INJURIES),
  max_cost_prop=max(Property_damage_cost),
  max_cost_crop=max(Crop_damage_cost),
  avg_risk_h=event_rate*avg_killed, 
  max_risk_h=event_rate*max_killed,
  avg_risk_cp=event_rate*avg_cost_prop, 
  max_risk_cp=event_rate*max_cost_prop,
  avg_risk_cc=event_rate*avg_cost_crop, 
  max_risk_cc=event_rate*max_cost_crop)

dataset_1996_risk_hc<-dataset_1996_risk%>%filter(tot_killed>0)  # dataset with killed >0
dataset_1996_risk_cp<-dataset_1996_risk%>%filter(tot_cost_prop>0)  # dataset with property damage >0
dataset_1996_risk_cc<-dataset_1996_risk%>%filter(tot_cost_crop>0)  # dataset with crop damage >0


head(dataset_1996_risk_hc)
## # A tibble: 6 x 21
##   EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
##   <chr>       <dbl>       <dbl>         <dbl>         <dbl>     <int>
## 1 avala~        223         156       3711800             0       378
## 2 black~          1          24             0             0        17
## 3 blizz~         70         385     525658950       7060000      2633
## 4 blowi~          1           1         15000             0         5
## 5 coast~          3           2     251400560             0       596
## 6 coast~          3           0     103809000             0       147
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## #   avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## #   max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## #   max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## #   avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## #   max_risk_cc <dbl>
head(dataset_1996_risk_cp)
## # A tibble: 6 x 21
##   EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
##   <chr>       <dbl>       <dbl>         <dbl>         <dbl>     <int>
## 1 astro~          0           0       9425000             0       103
## 2 astro~          0           0        320000             0       174
## 3 avala~        223         156       3711800             0       378
## 4 beach~          0           0        100000             0         4
## 5 blizz~         70         385     525658950       7060000      2633
## 6 blowi~          0           0         20000             0         2
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## #   avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## #   max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## #   max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## #   avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## #   max_risk_cc <dbl>
head(dataset_1996_risk_cc)
## # A tibble: 6 x 21
##   EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
##   <chr>       <dbl>       <dbl>         <dbl>         <dbl>     <int>
## 1 agric~          0           0             0      28820000         3
## 2 blizz~         70         385     525658950       7060000      2633
## 3 cold-~         95          12       1990000        600000       539
## 4 damag~          0           0       8000000      34130000         3
## 5 droug~          0           4    1046101000   13367566000      2433
## 6 dry-m~          3          25       1732600         15000       173
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## #   avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## #   max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## #   max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## #   avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## #   max_risk_cc <dbl>

Results

The results after processing are presented the graphs below. Based on the graphs, one can see that there are no extremities on one single axis. This means that any reduction in possible EVTYPE-event will not change the top greatest risks. Further EVTYPE aggregation is therefore not necessary.

Greatest risks based on human costs

Greatest risks based on property damage

Greatest risks based on crop damage

Conclusions

Based on both human costs and damage to property and crops, the biggest threats are (flash) floods, hail, tornado’s, thunderstorms-winds, hurricane-typhoons, storm-surge, excessive heats and droughts