Analysis on most damaging storm events suggests tornados are by far the worst

Synopsis:

Using data from the National Weather Service on storm events, the intent of this analysis is to highlight the top 10 storm events that are most harmful to population health, as well as those that have the greatest economic impact. We do this by looking at total fatalities and injuries (separately) for each storm event. For economic impact, we look at total combined property and crop damage for each storm event. This analysis is conducted using total reported counts, grouped by storm event type. We provide graphs that show the top 10 events for fatalities, injuries, and combined property and crop damage. In summary, the event most harmful to population health is tornado (by almost 200% compared to the next most harmful storm event based on fatalities), and the events that have the greatest economic impact are the extreme storm events, especially when a tornado is involved.

Data Processing:

To process the data, we download and read in the csv file from the National Weather Service. You can see the link and code to do so below. We format the date column as a date variable and we load R libraries needed later in the analysis. We also remove any inappropriate event type entries, and convert the property and crop damage fields so they are expressed in a parallel format for consumption in the analysis.

#---------------------------------------
## Loading and preprocessing the data
#---------------------------------------
setwd("C:/Users/Sarah Lynn/Desktop/Self Study/Coursera DS JH - reproducible research/Week 4 project")
dstfile <- paste0(getwd(),"/storm_data.csv")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile=dstfile)
data0 <- read.csv("storm_data.csv")
data0$BGN_DATE <- gsub(x=data0$BGN_DATE,pattern=" 0:00:00",replacement="",fixed=T)
data0$BGN_DATE <- as.Date(data0$BGN_DATE,"%m/%d/%Y")
library(dplyr)
library(ggplot2)
library(knitr)
#Remove inappropriate event type entries
data1 <- data0[!grepl("^Summary.*)|(\\?)",data0$EVTYPE),]
data11 <- data1[grepl("K|M|B",data0$PROPDMGEXP)|grepl("K|M|B",data0$CROPDMGEXP),]

data11$PROPDMGEXP <- sub("K",1000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- sub("M",1000000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- sub("B",1000000000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- as.numeric(data11$PROPDMGEXP)

## Warning: NAs introduced by coercion

data11$CROPDMGEXP <- sub("K",1000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- sub("M",1000000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- sub("B",1000000000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- as.numeric(data11$CROPDMGEXP)

## Warning: NAs introduced by coercion

data22 <- mutate(data11,PROPDMG2=PROPDMG*PROPDMGEXP,CROPDMG2=CROPDMG*CROPDMGEXP)

Analysis Part 1: Storm events most harmful to population health

Here we will assess which storm events are most harmful to population health based on fatalities and injuries from each storm event recorded in our dataset. We will rank the top 10 most harmful events based on fatalities and injuries, and make a final list based on the intersection of these.

The reason we use this approach is that there is no good way to equate an injury to a fatality. Hence, instead of combining and ranking total counts, we will rank them separately and use the intersection. We will also pull separately top ranked events to ensure we get representation from both measures.

data2 <-  data1 %>%
                      group_by(EVTYPE) %>%
                        summarise_at(c("FATALITIES","INJURIES"),funs(x_sum=sum,x_cnt=n()))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

data3 <- data2[!(data2$FATALITIES_x_sum==0&data2$INJURIES_x_sum==0),]
data4 <- mutate(data3,tot_hurt=FATALITIES_x_sum+INJURIES_x_sum
                            ,FATALITIES_per_event=FATALITIES_x_sum/FATALITIES_x_cnt
                            ,INJURIES_per_event=INJURIES_x_sum/INJURIES_x_cnt)
ordered_data_f <- arrange(data4,desc(FATALITIES_x_sum))
ordered_data_i <- arrange(data4,desc(INJURIES_x_sum))
top10_fatalities <- as.data.frame(cbind(total_fatalities=ordered_data_f$FATALITIES_x_sum[1:10],event_type=ordered_data_f$EVTYPE[1:10]))
top10_injuries <- as.data.frame(cbind(total_injuries=ordered_data_i$INJURIES_x_sum[1:10],event_type=ordered_data_i$EVTYPE[1:10]))
top10_fatalities

##    total_fatalities     event_type
## 1              5633        TORNADO
## 2              1903 EXCESSIVE HEAT
## 3               978    FLASH FLOOD
## 4               937           HEAT
## 5               816      LIGHTNING
## 6               504      TSTM WIND
## 7               470          FLOOD
## 8               368    RIP CURRENT
## 9               248      HIGH WIND
## 10              224      AVALANCHE

top10_injuries

##    total_injuries        event_type
## 1           91346           TORNADO
## 2            6957         TSTM WIND
## 3            6789             FLOOD
## 4            6525    EXCESSIVE HEAT
## 5            5230         LIGHTNING
## 6            2100              HEAT
## 7            1975         ICE STORM
## 8            1777       FLASH FLOOD
## 9            1488 THUNDERSTORM WIND
## 10           1361              HAIL

From the output, you can see that Tornado is by far the most harmful storm event for fatalities or injuries. To ensure this isn’t a mistake with the data, we do the following check:

tornado_inj_data <- data1[data1$EVTYPE=="TORNADO",]$INJURIES
tornado_fat_data <- data1[data1$EVTYPE=="TORNADO",]$FATALITIES
boxplot(cbind(log(tornado_fat_data[tornado_fat_data!=0])
              ,log(tornado_inj_data[tornado_inj_data!=0])),names=c("fatailities","injuiries")
               ,main="Boxplot to evaluate data for outliers or bad data")

## Warning in cbind(log(tornado_fat_data[tornado_fat_data != 0]),
## log(tornado_inj_data[tornado_inj_data != : number of rows of result is not a
## multiple of vector length (arg 1)

By removing the zeros (which is fine because these are the storm events that had no fatalities or injuries) and taking the log becuase of the extreme skew of the data, we can tell that there is no one outlier pulling the totals off. Instead, Tornado events must just be more common and/or more harmful when they occur.

To look at just how much more extreme the damage is with tornado’s vs the next storm event, we’ll do the following:

tot_tornado_fatalities <- as.numeric(top10_fatalities[1,1])
tot_tornado_injuries <-as.numeric(top10_injuries[1,1])
tot_next_fatalities <-as.numeric(top10_fatalities[2,1])
tot_next_injuries <-as.numeric(top10_injuries[2,1])
percent_lift_f <- tot_tornado_fatalities/tot_next_fatalities -1 
percent_lift_i <- tot_tornado_injuries/tot_next_injuries -1 
percent_lift_f

## [1] 1.960063

percent_lift_i

## [1] 12.13008

The code above reports that, by fatalities, tornados report 196% more fatalities than the next highest ranked storm type. By injuries, tornadoes report 1,213% more injuires than the next highest ranked storm type. That’s big!

Next we will find the intersection of the top 10 events by injuries and fatalities. From the following code, we see that there are 7 shared events. They are ranked according to their combined ranks for fatailities and injuries, and output below:

top_injuries <- as.data.frame(cbind(ordered_data_i$INJURIES_x_sum[1:10],ordered_data_i$EVTYPE[1:10]))
top_fatalities <- as.data.frame(cbind(ordered_data_f$FATALITIES_x_sum[1:10],ordered_data_f$EVTYPE[1:10]))
top_injuries_rnk <- mutate(top_injuries,rnk1 = row_number())
top_fatalities_rnk <- mutate(top_fatalities,rnk2 = row_number())
combined_top_health0 <- merge(top_injuries_rnk,top_fatalities_rnk,by=c("V2"),all=TRUE) %>%
     mutate(rnk=rnk1+rnk2) 
combined_top_health <- combined_top_health0[!is.na(combined_top_health0$rnk),]  %>%
           arrange(rnk) %>%
              mutate(rank = row_number(),event_type=V2) %>%
                 select(rank,event_type)
combined_top_health

##   rank     event_type
## 1    1        TORNADO
## 2    2 EXCESSIVE HEAT
## 3    3      TSTM WIND
## 4    4          FLOOD
## 5    5           HEAT
## 6    6      LIGHTNING
## 7    7    FLASH FLOOD

Analysis Part 2: Storm events that have the greatest economic impact

Here we will assess which storm events have the greatest economic impact based on property damage and crop damage from each storm event recorded in our dataset. We will rank the top 10 most harmful events based on property and crop damage combined. Note that we can combine these metrics from the start to rank them because, unlike fatalities and injuries, these have a common base of being measured in dollars.

Using the converted property and crop damage fields so everything is in terms of dollars, we will view the top 10 events.

data33 <-  data22 %>%
  group_by(EVTYPE) %>%
  summarise_at(c("PROPDMG2","CROPDMG2"),funs(x_sum=sum,x_cnt=n()))
data44 <- data33[!(data33$PROPDMG2_x_sum==0&data33$CROPDMG2_x_sum==0),]
data55 <- mutate(data44,tot_DMG=PROPDMG2_x_sum+CROPDMG2_x_sum
                ,PROPDMG_per_event=PROPDMG2_x_sum/PROPDMG2_x_cnt
                ,CROPDMG_per_event=CROPDMG2_x_sum/CROPDMG2_x_cnt)
ordered_data_pc <- arrange(data55,desc(tot_DMG))
top_dmg <- as.data.frame(cbind(total_damage=ordered_data_pc$tot_DMG[1:10],event_type=ordered_data_pc$EVTYPE[1:10]))
top_dmg_rnk <- mutate(top_dmg,rank = row_number()) %>% 
             arrange(rank) %>%
                 select(rank,event_type)
top_dmg_rnk

##    rank                 event_type
## 1     1 TORNADOES, TSTM WIND, HAIL
## 2     2                    TSUNAMI
## 3     3            HIGH WINDS/COLD
## 4     4  HURRICANE OPAL/HIGH WINDS
## 5     5    WINTER STORM HIGH WINDS
## 6     6       TROPICAL STORM JERRY
## 7     7            LAKESHORE FLOOD
## 8     8     HIGH WINDS HEAVY RAINS
## 9     9               FOREST FIRES
## 10   10       FLASH FLOODING/FLOOD

Notice that the 4th highest event type listed is hurrican opal/high winds, which is presumably a specific storm (hence the name “opal” in the event type). As a result, we will now view the top 10 storm event occurances to see if our list is driven by specific storms vs storm types.

data_severe_storms <- data22[data22$EVTYPE%in%ordered_data_pc$EVTYPE[1:10],]
ordered_data_severe_storms <- select(data_severe_storms,BGN_DATE,COUNTYNAME,STATE,EVTYPE,PROPDMG2,CROPDMG2) %>%   arrange(desc(PROPDMG2+CROPDMG2))
names(ordered_data_severe_storms)  <- c("Date","County_Name","State","Event_Type","Property_Damage", "Crop_Damage")
ordered_data_severe_storms[1:10,]

##          Date                                          County_Name State
## 1  1993-03-12                                           FLZ001>023    FL
## 2  1995-10-04                                           ALZ001>050    AL
## 3  2009-09-29                                               PSZ002    AS
## 4  1995-12-09                                      CAZ01>03 06>010    CA
## 5  1993-03-13                                               SCZ008    SC
## 6  1993-03-13                                               SCZ007    SC
## 7  2011-03-11                                               CAZ529    CA
## 8  1995-08-23 FLZ039 - 042>043 - 048>052 - 055>057 - 060>062 - 065    FL
## 9  2011-03-11                                               HIZ023    HI
## 10 2006-11-15                                               CAZ001    CA
##                    Event_Type Property_Damage Crop_Damage
## 1  TORNADOES, TSTM WIND, HAIL        1.60e+09     2.5e+06
## 2   HURRICANE OPAL/HIGH WINDS        1.00e+08     1.0e+07
## 3                     TSUNAMI        8.10e+07     2.0e+04
## 4     WINTER STORM HIGH WINDS        6.00e+07     5.0e+06
## 5             HIGH WINDS/COLD        5.00e+07     5.0e+06
## 6             HIGH WINDS/COLD        5.00e+07     5.0e+05
## 7                     TSUNAMI        2.66e+07     0.0e+00
## 8        TROPICAL STORM JERRY        4.00e+06     1.5e+07
## 9                     TSUNAMI        1.42e+07     0.0e+00
## 10                    TSUNAMI        9.20e+06     0.0e+00

From the list outputed, we can see that the top storm event type for both per occurance and in total is this TORNADOES, TSTM WIND, HAIL. This is specific to a storm in Florida during 1993. Also, the second entry on this list is for the specific Hurricane Opal.

As a result, when we report the storm events that have the greatest economic impact, we should keep in mind that these are often specific severe storm event occurances.

Results

In summary, the following figure shows the top storm events for fatalities, injuries, and total damage dollars.

par(mfrow=c(1,3),mar=c(8,10,4,5),mgp=c(4.5,1,0))

barplot(ordered_data_f$FATALITIES_x_sum[10:1],names.arg=ordered_data_f$EVTYPE[10:1]
        ,main="Top 10 events - fatalities"
        ,xlab="Fatalities total"
        ,las=2
        ,horiz=TRUE
        )


barplot(ordered_data_i$INJURIES_x_sum[10:1],names.arg=ordered_data_i$EVTYPE[10:1]
        ,main="Top 10 events - injuries"
        ,xlab="Injuries total"
        ,las=2
        ,horiz=TRUE
        )

barplot(ordered_data_pc$tot_DMG[10:1],names.arg=ordered_data_pc$EVTYPE[10:1]
        ,main="Top 10 events - damage $s"
        ,xlab="property & crop damage total"
        ,las=2
        ,horiz=TRUE
)

The following table reports the top storm events for most harmful to population health, as well as worst for economic impact.

final_data <- cbind(rank=combined_top_health[1:5,1],population_health=combined_top_health[1:5,2],economic_impact=top_dmg_rnk[1:5,2])
kable(final_data, caption = "Most Harmful Storm Events", align=c("c","l","l"))

Most Harmful Storm Events
rank	population_health	economic_impact
1	TORNADO	TORNADOES, TSTM WIND, HAIL
2	EXCESSIVE HEAT	TSUNAMI
3	TSTM WIND	HIGH WINDS/COLD
4	FLOOD	HURRICANE OPAL/HIGH WINDS
5	HEAT	WINTER STORM HIGH WINDS

In conclusion, Tornados are by far the most damaging storm type for population health, and when part of a storm, they are also the greatest economic impact. Otherwise, the most damaging storm types for population health are tornado, excessive heat, tsuanmi wind, flood, and heat (in order of severity). The storm types with the greatest economic impact are the extreme storm occurances as events like tornado/tsunami wind/hail, tsunami, high winds/cold, hurrican/high winds, and winter storm high winds (also in order of severity).