Severe Weather Impact in USA

Prepared by Andrew Leonard
Reproducable Research Project #2
John Hopkins Data Science on Coursera

Synopsis

The following report looks at the impact severe weather has on the United States to answer two questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The analysis uses data reported from the NOAA Storm Database from 1950 to November 2011. Due to the nature of the data collection, my analysis only uses data between 1993 and November 2011, for reasons described in Data Processing section. The report concludes the top 5 harmful events to population health are tornados, heat, floods, thundestorm winds, and winter storms, with heat causing the most fatalaties and tornados causing the highest injury. The top 5 most costly economic events are floods, hurrianes, storm surges, torandos, and hail. Floods have had the highest cost through the time period accessed, with over 1.5 trillion dollars in damages.

A report of how I processed and analyized the data to arrive at these conculsions follows.

Data Processing

The first code block will load all needed libraries and load in the dataset used for the analysis.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
#read in data
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, "./stormdata.csv")
stormdata <- read.csv("stormdata.csv")

The next code block will select only the needed variables from the dataset.

#remove unneeded variables
stormdatanew <- select(stormdata,STATE__, BGN_DATE, COUNTY, COUNTYNAME, STATE, EVTYPE,MAG,FATALITIES,INJURIES, PROPDMG,PROPDMGEXP,CROPDMG, CROPDMGEXP)

Next, I clean the event type variable. The following code will fix typos and combine related event types. The event types choosen to clean account for more than 90% of injuries, fatalaties, and economic damage. While there are additional event types to be cleaned, it was determind through the analysis that the outcome of the questions being asked will not change. This will be shown in more depth below.

#edit event variables - all upper case letters
#chose these specific events after exploring top 90% of injuries and fatalities
stormdatanew$EVTYPE <- toupper(stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXCESSIVE HEAT", "HEAT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HEAT WAVE", "HEAT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("STRONG WIND", "HIGH WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("FLASH FLOOD", "FLOOD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HEAVY SNOW", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("BLIZZARD", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("THUNDERSTORM WIND", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("TSTM WINDS", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("WILD/FOREST FIRE", "WILDFIRE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HIGHWIND", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("ICE STORM", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("WINTER WEATHER", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXTREME COLD/WIND CHILL", "EXTREME COLD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXTREME COLD/WIND CHILL", "EXTREME COLD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HIGH WINDS", "HIGH WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE/TYPHOON", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE OPAL/HIGH WINDS", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE GORDON", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE EMILY", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE ERIN", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE FELIX", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE OPAL","HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE-GENERATED SWELLS","HURRICANE", stormdatanew$EVTYPE)

After exploration, it is clear before 1993 only thunderstorm wind, hail, and tornado events were recorded. Using data from 1950 to 1993 would over empashis these events in the analysis. Thus, I only use data after 1993, as this data includes all 48 event types. The following code show only 3 events recorded prior to 1993, and pull out data from 1993 to 2011 for analysis.

#convert BGN_DATE to date class
stormdatanew$BGN_DATE <- as.character(stormdatanew$BGN_DATE)
stormdatanew$BGN_DATE <- as.Date(stormdatanew$BGN_DATE, format = "%m/%d/%Y")

#filter data for before 1993 and after 1993. View number of event types for each.
before1993 <- filter(stormdatanew, stormdatanew$BGN_DATE < "1993-01-01")
length(unique(before1993$EVTYPE))
## [1] 3
after1993 <-filter(stormdatanew, stormdatanew$BGN_DATE >= "1993-01-01")
length(unique(after1993$EVTYPE))
## [1] 840
#pull out data for fatalities and injuries for 1993 after data
after1993 <- group_by(after1993, EVTYPE)

Continue to process data to best answer question #2.

The PROPDMGEXP variable and CROPDMGEXP variable use K to represent damange in the thousands, M to represent damange in the millions, and B to represent damange in the billions. I will replace these letters with apporiate numerical values to make analysis easier, then multiply the exponent by the PROPDMG variable to get a numerical value of the total property damange. I’ll store this in a new variable, TOTPROPDMG.

#replace property damage code with appropriate numbers for analysis. 
after1993$PROPDMGEXP <- sub("K", 1000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("M", 1000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("m", 1000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("B", 1000000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- as.numeric(after1993$PROPDMGEXP)
## Warning: NAs introduced by coercion
after1993$TOTPROPDMG <- after1993$PROPDMG*after1993$PROPDMGEXP

after1993$CROPDMGEXP <- sub("K", 1000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("M", 1000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("m", 1000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("B", 1000000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- as.numeric(after1993$CROPDMGEXP)
## Warning: NAs introduced by coercion
after1993$TOTCROPDMG <- after1993$CROPDMG*after1993$CROPDMGEXP

Analysis

Question #1 Analysis

Analize the data to find which events have the most harm to population health. I will look at both injuries and fatalities to answer this question. The following code blocks address question #1.

#create new dataframe that groups fatalities and injuries by event type
harm <- summarise(after1993, FATALITIES = sum(FATALITIES), 
                    INJURIES = sum(INJURIES))

#table to help understand the types of events to subset. 
table(harm$FATALITIES > 100)
## 
## FALSE  TRUE 
##   828    12
table(harm$INJURIES > 200)
## 
## FALSE  TRUE 
##   822    18
#subset data to most harmful types of storms 
mostharm <- subset(harm, harm$FATALITIES > 100 | harm$INJURIES > 200)

While I cleaned the majority of the event types, some events will not be included in the analysis. I have, however, 96% of harmful events (injuries & fatalities) accounted for. The additional 4% will not change the results.

#Summaries to check % of event types included in analysis. 
sum(mostharm$FATALITIES)
## [1] 9967
sum(after1993$FATALITIES)
## [1] 10865
sum(mostharm$INJURIES)
## [1] 66572
sum(after1993$INJURIES)
## [1] 68765
#We have 96% of harmful events accounted for. 
(sum(mostharm$FATALITIES)+sum(mostharm$INJURIES))/(sum(after1993$FATALITIES)+sum(after1993$INJURIES))
## [1] 0.961183

For the final results, I will look at the sum of injuries, the sum of fatalities, and the sum of both injuries and fatalities to best understand the harm the events have on population health. The following code block will create the new variable and look at most harmful events to population health.

#new variable shows sum of fatalities and injuries
mostharm$TOTALHARM <- mostharm$FATALITIES+mostharm$INJURIES

#arrange in order of most harm variable
mostharm <- arrange(mostharm, desc(TOTALHARM))

#view dataset to see most harmful events
mostharm
## # A tibble: 20 x 4
##    EVTYPE         FATALITIES INJURIES TOTALHARM
##    <chr>               <dbl>    <dbl>     <dbl>
##  1 TORNADO              1621    23310     24931
##  2 HEAT                 3012     9004     12016
##  3 FLOOD                1448     8566     10014
##  4 TSTM WIND             438     6027      6465
##  5 WINTER STORM          556     5520      6076
##  6 LIGHTNING             816     5230      6046
##  7 HIGH WIND             394     1740      2134
##  8 WILDFIRE               87     1456      1543
##  9 HURRICANE             133     1326      1459
## 10 RIP CURRENT           572      529      1101
## 11 HAIL                   10      960       970
## 12 FOG                    62      734       796
## 13 EXTREME COLD          287      255       542
## 14 DUST STORM             22      440       462
## 15 TROPICAL STORM         58      340       398
## 16 AVALANCHE             224      170       394
## 17 DENSE FOG              18      342       360
## 18 HEAVY RAIN             98      251       349
## 19 HIGH SURF             104      156       260
## 20 GLAZE                   7      216       223

Take the top 5 most harmful events and graph their information in the results section. The top5harm dataset will be the final dataset used to graph.

#from the above table, we can pull top 5 harmful events to graph in results section
top5harm <- subset(mostharm[1:5,])
Question #2 Analysis

I’ll explore economic damange by event type next to answer question #2: which types of events have the greatest economic consequences?

#create new dataframe that groups economic damage by event type
economicdmg <- summarise(after1993, TOTPROPDMG = sum(TOTPROPDMG, na.rm = TRUE), 
                         TOTCROPDMG = sum(TOTCROPDMG, na.rm = TRUE))

#replace NAs with 0
economicdmg$TOTPROPDMG <- economicdmg$TOTPROPDMG %>%  replace_na(0)
economicdmg$TOTCROPDMG <- economicdmg$TOTCROPDMG %>%  replace_na(0)

Add together property damange and crop damange for each event type to find the top 5 most costly events.

#add together property and crop damange
economicdmg<- mutate(economicdmg, TOTALDMG = TOTPROPDMG+TOTCROPDMG)

#filter events with economic cost.
mostdmg <- filter(economicdmg, economicdmg$TOTALDMG>0)

#arrange events to have most expensive on top.
mostdmg <- arrange(mostdmg, desc(mostdmg$TOTALDMG))

View most costly events.

#view dataset to see most costly events
mostdmg
## # A tibble: 354 x 4
##    EVTYPE            TOTPROPDMG  TOTCROPDMG      TOTALDMG
##    <chr>                  <dbl>       <dbl>         <dbl>
##  1 FLOOD          160798521887.  7083285550 167881807437.
##  2 HURRICANE       84656180010   5505292800  90161472810 
##  3 STORM SURGE     43323536000         5000  43323541000 
##  4 TORNADO         26338962421    414953110  26753915531 
##  5 HAIL            15732266870   3025537450  18757804320 
##  6 WINTER STORM    12246094158.  5310770600  17556864758.
##  7 DROUGHT          1046106000  13972566000  15018672000 
##  8 TSTM WIND        9704092218.  1159501100  10863593318.
##  9 RIVER FLOOD      5118945500   5029459000  10148404500 
## 10 TROPICAL STORM   7703890550    678346000   8382236550 
## # … with 344 more rows

Pull out the 5 most costly events. The dataset top5dmg will be the final dataset used to graph results.

#pull out top 5 events
top5dmg <- subset(mostdmg[1:5,])

Results

Question #1 Results: Across the United States, which types of events are most harmful with respect to population health?

Graph the top 5 harmful events in 3 ways. The first graph shows the total fatalities of each event. The second shows total injuries. The final will show the sum of injuries and fatalities.

#graph fatalities
g1 <- ggplot(top5harm, aes(EVTYPE, FATALITIES)) + geom_bar(stat="identity")+
    ggtitle("Top 5 Harmful Events 1993 to 2011")
g2 <- ggplot(top5harm, aes(EVTYPE, INJURIES)) + geom_bar(stat="identity")
g3 <- ggplot(top5harm, aes(EVTYPE, TOTALHARM)) + geom_bar(stat="identity")

grid.arrange(g1, g2, g3, nrow = 3)

It is clear from the analysis done through the data processing process that the top 5 harmful events to population health are tornados, heat, floods, thundestorm winds, and winter storms, with heat causing the most fatalaties and tornados causing the highest injury.

Question #2 Results: Across the United States, which types of events have the greatest economic consequences?

Next, graph the total economic damage done by the 5 most costly events.

#graph economic damange 
ggplot(top5dmg, aes(EVTYPE, TOTALDMG)) + geom_bar(stat="identity")+
    ggtitle("Total Economic Damage 1993 to 2011")+ylab("Cost of Damage in $")

The top 5 most harmful events from an economic damage standpoint are floods, hurrianes, storm surges, torandos, and hail. Floods have had the highest cost through the time period, with over 1.5 trillion dollars in damages.