In this document a simple analysis of the “Storm Data”-dataset of the national weather service is presented to determine the most harmful events in terms of human cost and money. The results and conclusions are presented in terms of max and average risk of human cost and money. From the information regarding this dataset (“National Weather Service Storm Data Documentation”) one can makeup that not all events are registered prior to 1996. All conclusions are therefore based on the data after 01-01-1996. Multiple data cleaning actions are performed on the EVTYPE-column to reduce the number of possible events.
The analysis presented in this document contains all steps from raw data to generating results.
Setup of session and loading required packages
library(dplyr)
library(ggplot2)
library(lubridate)
library(reshape2)
library(stringr)
library(ggrepel)
options(scipen=999)
Uploading data to Rstudio.
setwd("C:\\_Data\\Mijn Documenten\\R\\Scripts Coursera\\Course 5 - Week 4")
downloadlink<-("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
if(!file.exists("dataset_c5w4q1.csv.bz2")) {
download.file(downloadlink,"dataset_c5w4q1.csv.bz2")
}
if(!exists("data_source")) {
data_source <- read.csv("dataset_c5w4q1.csv.bz2", header = TRUE,stringsAsFactors=FALSE)
}
dataset<-data_source
The column with registered events is pretty dirty with a lot of typos, resulting in +/-1000 possible events. These will be reduced to +/- 600 possible events
dataset$BGN_DATE<-as.Date(dataset$BGN_DATE,format="%m/%d/%Y %H:%M:%S")
length(table(dataset$EVTY))
## [1] 985
dataset<-dataset[str_detect(dataset$EVTYPE, "summ")==FALSE,]
# Capital to lower letters EVTYPE
dataset$EVTYPE<-tolower(dataset$EVTYPE)
# Trim whitespaces in EVTYPE
dataset$EVTYPE<-str_trim(dataset$EVTYPE)
# Change tstm in thunderstorm in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"tstm","thunderstorm")
# Replace forest in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"forest","-")
# Change hvy in heavy from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"hvy","heavy")
# Change lightn in lightning from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"lightn(?=\\b)","lightning")
# Change fld in flood from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"fld","flood")
# Change unseasonably-dry in drought from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"unseasonably-dry","drought")
# Change hurricane and typhoon into hurricane (typhoon) in EVTYPE
dataset$EVTYPE[str_detect(dataset$EVTYPE, "hurric|typh")==TRUE]<-"hurricane (typhoon)"
# Remove all nubmers in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:digit:]","-")
# Change all punctuation to same form in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]","-")
# Remove multple whitespaces in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:blank:]{1,}","-")
# Remove the word "and" in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,("(?<=[:punct:])and"),"-")
# Remove multiple punctuation in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]{2,}","-")
# Remove punctuation from the end of the description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"[:punct:]$","")
# Remove plural form -s from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"s(?=\\b)","")
# Remove form -ing from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"s(?=\\b)","")
# Remove -mix from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"-mix(?=\\b)","")
# Change wild-fire in wildfire from description in EVTYPE
dataset$EVTYPE<-str_replace_all(dataset$EVTYPE,"wild-fire","wildfire")
# Change thunderstorm-wind in thunderstorm from description in EVTYPE
dataset$EVTYPE[str_detect(dataset$EVTYPE, "thunderstorm")==TRUE]<-"thunderstorm"
# Remove double words from description in EVTYPE
a<-list()
for(i in 1:nrow(dataset)) {
a[i]<-str_split(dataset$EVTYPE[i],"-")
}
split_events<-lapply(a,unlist)
split_events<-lapply(split_events,unique)
split_events<-lapply(split_events,str_c,collapse="-")
dataset$EVTYPE<-as.vector(unlist(split_events))
rm(a, split_events)
Number of EVTYPE after processing
length(table(dataset$EVTY))
## [1] 571
The data prior to 1996 is not complete, therefore all results are based on registrations after 01-01-1996. In the following section a summary is made for the costs associated with the possible events, the maximum per EVTYPE is calculated (fatalities, injuries, money), the total per EVTYPE is calculated (fatalities, injuries, money), the average per EVTYPE is calculated (fatalities, injuries, money). To calculate monetary value, only entries with “k”, “m” and “b” are considered, all others will be ignored (+/-325 entries, 0,03%)
To include both the risk and costs per EVTYPE in the conclusions, the risks (risk = probability * effect) are also calculated. The calculations include the maximum risk per EVTYPE (fatalities, injuries, money) and the average risk per EVTYPE (fatalities, injuries, money).
Lastly, all EVTYPES without fatalities are excluded from the results
# Calculate property damage in dollars
dataset$Property_damage_cost<-0
dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "k")==TRUE]<-1000
dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "m")==TRUE]<-1000000
dataset$Property_damage_cost[str_detect(tolower(dataset$PROPDMGEXP), "b")==TRUE]<-1000000000
dataset$Property_damage_cost<-dataset$Property_damage_cost*dataset$PROPDMG
# Calculate crop damage in dollars
dataset$Crop_damage_cost<-0
dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "k")==TRUE]<-1000
dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "m")==TRUE]<-1000000
dataset$Crop_damage_cost[str_detect(tolower(dataset$CROPDMGEXP), "b")==TRUE]<-1000000000
dataset$Crop_damage_cost<-dataset$Crop_damage_cost*dataset$CROPDMG
# Select data after 01-01-1996
dataset_1996<-subset(dataset,dataset$BGN_DATE>"1996-01-01")
# Calculate summaries
dataset_1996_risk<-dataset_1996 %>% group_by(EVTYPE)%>%summarise(
tot_killed=sum(FATALITIES),
tot_injured=sum(INJURIES),
tot_cost_prop=sum(Property_damage_cost),
tot_cost_crop=sum(Crop_damage_cost),
no_events=n(),
event_rate=n()/(year(Sys.Date())-1996),
avg_killed=sum(FATALITIES)/n(),
avg_injured=sum(INJURIES)/n(),
avg_cost_prop=sum(Property_damage_cost)/n(),
avg_cost_crop=sum(Crop_damage_cost)/n(),
max_killed=max(FATALITIES),
max_injured=max(INJURIES),
max_cost_prop=max(Property_damage_cost),
max_cost_crop=max(Crop_damage_cost),
avg_risk_h=event_rate*avg_killed,
max_risk_h=event_rate*max_killed,
avg_risk_cp=event_rate*avg_cost_prop,
max_risk_cp=event_rate*max_cost_prop,
avg_risk_cc=event_rate*avg_cost_crop,
max_risk_cc=event_rate*max_cost_crop)
dataset_1996_risk_hc<-dataset_1996_risk%>%filter(tot_killed>0) # dataset with killed >0
dataset_1996_risk_cp<-dataset_1996_risk%>%filter(tot_cost_prop>0) # dataset with property damage >0
dataset_1996_risk_cc<-dataset_1996_risk%>%filter(tot_cost_crop>0) # dataset with crop damage >0
head(dataset_1996_risk_hc)
## # A tibble: 6 x 21
## EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 avala~ 223 156 3711800 0 378
## 2 black~ 1 24 0 0 17
## 3 blizz~ 70 385 525658950 7060000 2633
## 4 blowi~ 1 1 15000 0 5
## 5 coast~ 3 2 251400560 0 596
## 6 coast~ 3 0 103809000 0 147
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## # avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## # max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## # max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## # avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## # max_risk_cc <dbl>
head(dataset_1996_risk_cp)
## # A tibble: 6 x 21
## EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 astro~ 0 0 9425000 0 103
## 2 astro~ 0 0 320000 0 174
## 3 avala~ 223 156 3711800 0 378
## 4 beach~ 0 0 100000 0 4
## 5 blizz~ 70 385 525658950 7060000 2633
## 6 blowi~ 0 0 20000 0 2
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## # avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## # max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## # max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## # avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## # max_risk_cc <dbl>
head(dataset_1996_risk_cc)
## # A tibble: 6 x 21
## EVTYPE tot_killed tot_injured tot_cost_prop tot_cost_crop no_events
## <chr> <dbl> <dbl> <dbl> <dbl> <int>
## 1 agric~ 0 0 0 28820000 3
## 2 blizz~ 70 385 525658950 7060000 2633
## 3 cold-~ 95 12 1990000 600000 539
## 4 damag~ 0 0 8000000 34130000 3
## 5 droug~ 0 4 1046101000 13367566000 2433
## 6 dry-m~ 3 25 1732600 15000 173
## # ... with 15 more variables: event_rate <dbl>, avg_killed <dbl>,
## # avg_injured <dbl>, avg_cost_prop <dbl>, avg_cost_crop <dbl>,
## # max_killed <dbl>, max_injured <dbl>, max_cost_prop <dbl>,
## # max_cost_crop <dbl>, avg_risk_h <dbl>, max_risk_h <dbl>,
## # avg_risk_cp <dbl>, max_risk_cp <dbl>, avg_risk_cc <dbl>,
## # max_risk_cc <dbl>
The results after processing are presented the graphs below. Based on the graphs, one can see that there are no extremities on one single axis. This means that any reduction in possible EVTYPE-event will not change the top greatest risks. Further EVTYPE aggregation is therefore not necessary.
Based on both human costs and damage to property and crops, the biggest threats are (flash) floods, hail, tornado’s, thunderstorms-winds, hurricane-typhoons, storm-surge, excessive heats and droughts