Storm Events is observed in USA Every Year affecting a wide range of people at different extents. Damages caused by this events includes building,bridges,roads, hospital and other infrastructures along with crops and livestock. This analysis tries to estimate the impact of these storm events in terms of public health by estimating fatality and injuries and in terms of economic consequence in terms of property damage and crop damage.

Data

The data for this analysis is sourced from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

This data come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Data: Storm Data 47Mb

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011.

library(dplyr)
library(stringr)
library(reshape2)
library(ggplot2)
library(patchwork)

Data Processing

data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
storm_data <- data.table::fread(data_url)

The relevant variables for our current question are,

storm_data2 <- storm_data[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG"
                             ,"PROPDMGEXP","CROPDMG","CROPDMGEXP")]
names(storm_data2) <- sub("_",".",tolower(names(storm_data2)))

The storm_data2$bgn.date should be a date variable.

class(storm_data2$bgn.date)
## [1] "character"

But unfortunately it is a character variable. This variable needs to be converted to date variable if we want to make any use of this variable.

storm_data2$bgn.date <- as.Date(storm_data2$bgn.date, "%m/%d/%Y")
class(storm_data2$bgn.date)
## [1] "Date"

For proper representation of economic damage the exponential variables storm_data$PROPDMGXPand storm_data$CROPDMGXP should be numeric representing its exponential number.

This two variables will first be coerced to factor and be assigned to prop_exp and crop_exp.

prop_exp <- factor(storm_data2[[6]])
crop_exp <- factor(storm_data2[[8]])
levels(prop_exp)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m" "M"
levels(crop_exp)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

Looking at the levels of we can immediately notice there are some numbers,some letters with some sign characters. The letters represent some specific exponential values where

levels(prop_exp) <- str_replace_all(levels(prop_exp),
                                    c("h|H"="2","k|K"="3","m|M"="6","b|B"="9"))
levels(prop_exp)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"
levels(crop_exp) <- str_replace_all(levels(crop_exp),
                                    c("h|H"="2","k|K"="3","m|M"="6","b|B"="9"))
levels(crop_exp)
## [1] ""  "?" "0" "2" "9" "3" "6"

But the sign characters (“-”,“?”,“+”) are not specific on their exponential values.

sum(prop_exp %in% c("+","?","-"))/nrow(storm_data2) * 100
## [1] 0.001551596
sum(crop_exp %in% c("+","?","-"))/nrow(storm_data2) * 100
## [1] 0.0007757978

0.0015% values are signed in prop_exp or storm_data2$propdmgexp and 0.00078% values are signed in crop_exp or storm_data2$propdmgexp.We can see they are in a very low in percentage in the data. So, we can safely ignore them. but removing observation based on one variable may result in loss of data in other variables so in order to avoid them we may turn the total value for damage to 0 by assigning the exponential value a negative infinity.

levels(prop_exp) <- gsub("(\\+|\\?|\\-)","-Inf", levels(prop_exp))
levels(prop_exp)
##  [1] ""     "-Inf" "0"    "1"    "2"    "3"    "4"    "5"    "6"    "7"   
## [11] "8"    "9"
levels(crop_exp) <- gsub("(\\+|\\?|\\-)","-Inf", levels(crop_exp))
levels(crop_exp)
## [1] ""     "-Inf" "0"    "2"    "9"    "3"    "6"

lets take look on the empty strings

head(storm_data2[storm_data2$propdmgexp=="",c(5,6)])
##    propdmg propdmgexp
## 54       0           
## 55       0           
## 56       0           
## 57       0           
## 58       0           
## 59       0
head(storm_data2[storm_data2$cropdmgexp=="",c(7,8)])
##   cropdmg cropdmgexp
## 1       0           
## 2       0           
## 3       0           
## 4       0           
## 5       0           
## 6       0

Looks like the empty strings are mostly corresponds to value 0 for damage.

sum(storm_data2$propdmgexp=="" & storm_data2$propdmg != 0)/sum(storm_data2$propdmgexp=="") * 100
## [1] 0.01631132

only 0.02% values are which are empty string and not 0. so It is safe to assume the empty strings represent 1 as exponential value.

For values 0 in exponential values will result in damage of 1$ despite any value for damage. Lets look at the damage values for exponential 0 which are not 1.

sum(storm_data2$propdmgexp == "0" & storm_data2$propdmg!=1)
## [1] 209
sum(storm_data2$cropdmgexp == "0" & storm_data2$cropdmg!=1)
## [1] 19

Though they are low in number we can fix them by replacing 0 with 1.

levels(prop_exp) <- gsub("^$|0","1", levels(prop_exp))
levels(prop_exp)
##  [1] "1"    "-Inf" "2"    "3"    "4"    "5"    "6"    "7"    "8"    "9"
levels(crop_exp) <- gsub("^$|0","1", levels(crop_exp))
levels(crop_exp)
## [1] "1"    "-Inf" "2"    "9"    "3"    "6"

For empty string and 0 we assigned 1 as their value

lastly we can safely coerce the prop_exp and crop_exp to factor with numeric levels and assign them to their respected position.

storm_data2$propdmgexp <- as.numeric(as.character(prop_exp))
storm_data2$cropdmgexp <- as.numeric(as.character(crop_exp))

The storm_data2$evtype should be in lower class for better correlation

storm_data2$evtype <- tolower(storm_data2$evtype)

Imapct of Storm Events on Public Health

health_data <- storm_data2[,2:4]

top_15_events <- group_by(health_data,evtype) %>% 
  summarise(fatality.sum = sum(fatalities),injury.sum=sum(injuries), total=sum(fatalities)+sum(injuries)) %>%
  (function(x){x[order(x[[2]]+x[[3]],decreasing=T),][1:15,]})

top_15_events
## # A tibble: 15 × 4
##    evtype            fatality.sum injury.sum total
##    <chr>                    <dbl>      <dbl> <dbl>
##  1 tornado                   5633      91346 96979
##  2 excessive heat            1903       6525  8428
##  3 tstm wind                  504       6957  7461
##  4 flood                      470       6789  7259
##  5 lightning                  816       5230  6046
##  6 heat                       937       2100  3037
##  7 flash flood                978       1777  2755
##  8 ice storm                   89       1975  2064
##  9 thunderstorm wind          133       1488  1621
## 10 winter storm               206       1321  1527
## 11 high wind                  248       1137  1385
## 12 hail                        15       1361  1376
## 13 hurricane/typhoon           64       1275  1339
## 14 heavy snow                 127       1021  1148
## 15 wildfire                    75        911   986

These are the top events that has most impacts on public health.

top_15_events$evtype <- with(top_15_events,
                              reorder(evtype,-fatality.sum-injury.sum))

top_15_melt <- melt(top_15_events[,-4],id.vars=1)

g_ph <- ggplot(top_15_melt,aes(x=evtype,y=value,fill=variable))
g_ph + 
  geom_col() + labs(title ="Top 15 Impactful Events on Public Health",
                    x = "",
                    y = "Affected People")+
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1),
        plot.title = element_text(hjust = .5))+
  scale_fill_discrete(name="",labels=c("Fatalities","Injuries"))

From the plot we can see that tornado is the most impactful storm events with more than 96000 fatalities and injuries from 1950 to 2011.

Economic Imapct of Storm Events

eco_data <- storm_data2[,c(2,5:8)]
eco_data <- mutate(eco_data,propdmg=propdmg*(10**(propdmgexp-6)),
             cropdmg=cropdmg*(10**(cropdmgexp-6)),
             .keep='unused')

top_eco_evnets <- group_by(eco_data,evtype) %>% 
  summarise(propdmg.sum=sum(propdmg),
            cropdmg.sum=sum(cropdmg),
            total=sum(propdmg)+sum(cropdmg)) %>%
  (function(x){x[order(x[[2]]+x[[3]],decreasing = T),][1:15,]})

top_eco_evnets
## # A tibble: 15 × 4
##    evtype            propdmg.sum cropdmg.sum   total
##    <chr>                   <dbl>       <dbl>   <dbl>
##  1 flood                 144658.    5662.    150320.
##  2 hurricane/typhoon      69306.    2608.     71914.
##  3 tornado                56947.     415.     57362.
##  4 storm surge            43324.       0.005  43324.
##  5 hail                   15735.    3026.     18761.
##  6 flash flood            16823.    1421.     18244.
##  7 drought                 1046.   13973.     15019.
##  8 hurricane              11868.    2742.     14610.
##  9 river flood             5119.    5029.     10148.
## 10 ice storm               3945.    5022.      8967.
## 11 tropical storm          7704.     678.      8382.
## 12 winter storm            6688.      26.9     6715.
## 13 high wind               5270.     639.      5909.
## 14 wildfire                4765.     295.      5061.
## 15 tstm wind               4485.     554.      5039.

These are the top events that has most impacts on economics. Damage of properties are given in millions

top_eco_evnets$evtype <- with(top_eco_evnets,
                              reorder(evtype,-propdmg.sum-cropdmg.sum))

top_eco_melt <- melt(top_eco_evnets[-4],id.vars=1)
top_eco_melt$variable <- with(top_eco_melt,reorder(variable,value))

g_eco <- ggplot(top_eco_melt,aes(x=evtype,y=value,fill=variable))

g_eco + 
  geom_col() + labs(title ="Top 15 Impactful Events on Economics",
                    x = "",
                    y = "Damage (Million Dollars)")+
  theme(axis.text.x = element_text(angle=45, vjust=1, hjust=1),
        plot.title = element_text(hjust = .5))+
  scale_fill_discrete(name="",
                      labels=c("Property Damage","Crop Damage"))

From the graph we can see the most imapctful event for economics is flood which has caused about 150B of worth of property and crop From 1950 to 2011.

Most frequent Storm events

event_time <- storm_data2[,1:2] 
event_time$bgn.date <- as.numeric(format(event_time$bgn.date,"%m"))
event_time$evtype <- with(event_time,reorder(evtype,bgn.date,
                                          length,decreasing=TRUE))

top_freq_events <- group_by(event_time,evtype) %>% 
                    summarise(frequency=length(bgn.date),
                              most.occur=paste(
                                        month.abb[quantile(bgn.date,0.25)],"-",
                                        month.abb[quantile(bgn.date,0.75)]),
                                               ) %>%
  (function(x){x[order(x$frequency,decreasing = T),]}) %>%
  (function(x){x[,1] <- reorder(x$evtype,-x$frequency);x[1:15,]})

top_freq_events
## # A tibble: 15 × 3
##    evtype             frequency most.occur
##    <fct>                  <int> <chr>     
##  1 hail                  288661 Apr - Jul 
##  2 tstm wind             219942 May - Jul 
##  3 thunderstorm wind      82564 May - Aug 
##  4 tornado                60652 Apr - Jul 
##  5 flash flood            54277 May - Aug 
##  6 flood                  25327 Mar - Aug 
##  7 thunderstorm winds     20843 May - Jul 
##  8 high wind              20214 Feb - Nov 
##  9 lightning              15754 Jun - Aug 
## 10 heavy snow             15708 Feb - Nov 
## 11 heavy rain             11742 May - Sep 
## 12 winter storm           11433 Feb - Nov 
## 13 winter weather          7045 Jan - Nov 
## 14 funnel cloud            6844 May - Jul 
## 15 marine tstm wind        6175 May - Aug
fills <- c("#cf4a34",
"#48bc5e",
"#be55ba",
"#80b543",
"#7366cb",
"#b8ae49",
"#6a8fcd",
"#dc923a",
"#52c1b0",
"#c6456e",
"#458e5c",
"#c277ad",
"#687c32",
"#d88170",
"#99662f")

g_freq_hist <- ggplot(top_freq_events,aes(x=evtype,
                                          y=frequency/1000)) + 
  geom_col(fill=fills)+
  scale_x_discrete(limits=top_freq_events[[1]]) + 
  labs(title="Top Frequent Events and Their Occurence Throughout Year",
       x=NULL,
       y="Frequency (k)") +
  theme(plot.title = element_text(hjust=0.5),
        axis.text.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.line.x = element_blank(),
        plot.margin = unit(c(0,0,0,0),"pt"),
        strip_text = element_text(),
        legend.position = "none") 

g_time_box <- ggplot(event_time,aes(evtype,bgn.date)) + 
  geom_boxplot(fill=fills) +
  theme(axis.text.x=element_text(angle=45,vjust=1,hjust=1),
        plot.margin = unit(c(0,0,0,0),"pt"),
        legend.position = "none") +
  scale_x_discrete(limits=top_freq_events[[1]]) +
  labs(title=NULL,x=NULL,y="Months") +
  scale_y_continuous(breaks=1:12,labels = month.abb)

g_freq_hist / plot_spacer()/ g_time_box +plot_layout(heights=c(1,0,1))

From 1950 to 2011 the most frequent storm is hail which have occured more than 288k times and this event mostly occurs between April-July. Most impactful event in terms of public health is tornado which is 4th most frequent events occur mostly between April-July. In terms of economic consequence most impactful event is flood which is the 6th most frequent events and occurs mostly between March-August.

Results

Top Events Affecting Public Health (Number of People)

Event Type Fatalities Injuries Total
tornado 5633 91346 96979
excessive heat 1903 6525 8428
tstm wind 504 6957 7461
flood 470 6789 7259
lightning 816 5230 6046
heat 937 2100 3037
flash flood 978 1777 2755
ice storm 89 1975 2064
thunderstorm wind 133 1488 1621
winter storm 206 1321 1527
high wind 248 1137 1385
hail 15 1361 1376
hurricane/typhoon 64 1275 1339
heavy snow 127 1021 1148
wildfire 75 911 986

Top Events Affecting Economics (Million Dollar)

Events Property Damage Crop Damage Total
flood 144657.710 5661.9685 150319.678
hurricane/typhoon 69305.840 2607.8728 71913.713
tornado 56947.382 414.9547 57362.337
storm surge 43323.536 0.0050 43323.541
hail 15735.270 3025.9547 18761.225
flash flood 16822.678 1421.3171 18243.995
drought 1046.106 13972.5660 15018.672
hurricane 11868.319 2741.9100 14610.229
river flood 5118.945 5029.4590 10148.405
ice storm 3944.928 5022.1135 8967.042
tropical storm 7703.891 678.3460 8382.237
winter storm 6688.497 26.9440 6715.441
high wind 5270.046 638.5713 5908.618
wildfire 4765.114 295.4728 5060.587
tstm wind 4484.959 554.0073 5038.966

Top Frequent Events and Their Most Occurrence Month range

Event Type Frequency Most Occurrence
hail 288661 Apr - Jul
tstm wind 219942 May - Jul
thunderstorm wind 82564 May - Aug
tornado 60652 Apr - Jul
flash flood 54277 May - Aug
flood 25327 Mar - Aug
thunderstorm winds 20843 May - Jul
high wind 20214 Feb - Nov
lightning 15754 Jun - Aug
heavy snow 15708 Feb - Nov
heavy rain 11742 May - Sep
winter storm 11433 Feb - Nov
winter weather 7045 Jan - Nov
funnel cloud 6844 May - Jul
marine tstm wind 6175 May - Aug