The following report looks at the impact severe weather has on the United States to answer two questions:
The analysis uses data reported from the NOAA Storm Database from 1950 to November 2011. Due to the nature of the data collection, my analysis only uses data between 1993 and November 2011, for reasons described in Data Processing section. The report concludes the top 5 harmful events to population health are tornados, heat, floods, thundestorm winds, and winter storms, with heat causing the most fatalaties and tornados causing the highest injury. The top 5 most costly economic events are floods, hurrianes, storm surges, torandos, and hail. Floods have had the highest cost through the time period accessed, with over 1.5 trillion dollars in damages.
A report of how I processed and analyized the data to arrive at these conculsions follows.
The first code block will load all needed libraries and load in the dataset used for the analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#read in data
URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, "./stormdata.csv")
stormdata <- read.csv("stormdata.csv")
The next code block will select only the needed variables from the dataset.
#remove unneeded variables
stormdatanew <- select(stormdata,STATE__, BGN_DATE, COUNTY, COUNTYNAME, STATE, EVTYPE,MAG,FATALITIES,INJURIES, PROPDMG,PROPDMGEXP,CROPDMG, CROPDMGEXP)
Next, I clean the event type variable. The following code will fix typos and combine related event types. The event types choosen to clean account for more than 90% of injuries, fatalaties, and economic damage. While there are additional event types to be cleaned, it was determind through the analysis that the outcome of the questions being asked will not change. This will be shown in more depth below.
#edit event variables - all upper case letters
#chose these specific events after exploring top 90% of injuries and fatalities
stormdatanew$EVTYPE <- toupper(stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("RIP CURRENTS", "RIP CURRENT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXCESSIVE HEAT", "HEAT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HEAT WAVE", "HEAT", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("STRONG WIND", "HIGH WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("FLASH FLOOD", "FLOOD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HEAVY SNOW", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("BLIZZARD", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("THUNDERSTORM WIND", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("TSTM WINDS", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("WILD/FOREST FIRE", "WILDFIRE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HIGHWIND", "TSTM WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("ICE STORM", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("WINTER WEATHER", "WINTER STORM", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXTREME COLD/WIND CHILL", "EXTREME COLD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("EXTREME COLD/WIND CHILL", "EXTREME COLD", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HIGH WINDS", "HIGH WIND", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE/TYPHOON", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE OPAL/HIGH WINDS", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE GORDON", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE EMILY", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE ERIN", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE FELIX", "HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE OPAL","HURRICANE", stormdatanew$EVTYPE)
stormdatanew$EVTYPE <- gsub("HURRICANE-GENERATED SWELLS","HURRICANE", stormdatanew$EVTYPE)
After exploration, it is clear before 1993 only thunderstorm wind, hail, and tornado events were recorded. Using data from 1950 to 1993 would over empashis these events in the analysis. Thus, I only use data after 1993, as this data includes all 48 event types. The following code show only 3 events recorded prior to 1993, and pull out data from 1993 to 2011 for analysis.
#convert BGN_DATE to date class
stormdatanew$BGN_DATE <- as.character(stormdatanew$BGN_DATE)
stormdatanew$BGN_DATE <- as.Date(stormdatanew$BGN_DATE, format = "%m/%d/%Y")
#filter data for before 1993 and after 1993. View number of event types for each.
before1993 <- filter(stormdatanew, stormdatanew$BGN_DATE < "1993-01-01")
length(unique(before1993$EVTYPE))
## [1] 3
after1993 <-filter(stormdatanew, stormdatanew$BGN_DATE >= "1993-01-01")
length(unique(after1993$EVTYPE))
## [1] 840
#pull out data for fatalities and injuries for 1993 after data
after1993 <- group_by(after1993, EVTYPE)
Continue to process data to best answer question #2.
The PROPDMGEXP variable and CROPDMGEXP variable use K to represent damange in the thousands, M to represent damange in the millions, and B to represent damange in the billions. I will replace these letters with apporiate numerical values to make analysis easier, then multiply the exponent by the PROPDMG variable to get a numerical value of the total property damange. I’ll store this in a new variable, TOTPROPDMG.
#replace property damage code with appropriate numbers for analysis.
after1993$PROPDMGEXP <- sub("K", 1000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("M", 1000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("m", 1000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- sub("B", 1000000000, after1993$PROPDMGEXP)
after1993$PROPDMGEXP <- as.numeric(after1993$PROPDMGEXP)
## Warning: NAs introduced by coercion
after1993$TOTPROPDMG <- after1993$PROPDMG*after1993$PROPDMGEXP
after1993$CROPDMGEXP <- sub("K", 1000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("M", 1000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("m", 1000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- sub("B", 1000000000, after1993$CROPDMGEXP)
after1993$CROPDMGEXP <- as.numeric(after1993$CROPDMGEXP)
## Warning: NAs introduced by coercion
after1993$TOTCROPDMG <- after1993$CROPDMG*after1993$CROPDMGEXP
Analize the data to find which events have the most harm to population health. I will look at both injuries and fatalities to answer this question. The following code blocks address question #1.
#create new dataframe that groups fatalities and injuries by event type
harm <- summarise(after1993, FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES))
#table to help understand the types of events to subset.
table(harm$FATALITIES > 100)
##
## FALSE TRUE
## 828 12
table(harm$INJURIES > 200)
##
## FALSE TRUE
## 822 18
#subset data to most harmful types of storms
mostharm <- subset(harm, harm$FATALITIES > 100 | harm$INJURIES > 200)
While I cleaned the majority of the event types, some events will not be included in the analysis. I have, however, 96% of harmful events (injuries & fatalities) accounted for. The additional 4% will not change the results.
#Summaries to check % of event types included in analysis.
sum(mostharm$FATALITIES)
## [1] 9967
sum(after1993$FATALITIES)
## [1] 10865
sum(mostharm$INJURIES)
## [1] 66572
sum(after1993$INJURIES)
## [1] 68765
#We have 96% of harmful events accounted for.
(sum(mostharm$FATALITIES)+sum(mostharm$INJURIES))/(sum(after1993$FATALITIES)+sum(after1993$INJURIES))
## [1] 0.961183
For the final results, I will look at the sum of injuries, the sum of fatalities, and the sum of both injuries and fatalities to best understand the harm the events have on population health. The following code block will create the new variable and look at most harmful events to population health.
#new variable shows sum of fatalities and injuries
mostharm$TOTALHARM <- mostharm$FATALITIES+mostharm$INJURIES
#arrange in order of most harm variable
mostharm <- arrange(mostharm, desc(TOTALHARM))
#view dataset to see most harmful events
mostharm
## # A tibble: 20 x 4
## EVTYPE FATALITIES INJURIES TOTALHARM
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 1621 23310 24931
## 2 HEAT 3012 9004 12016
## 3 FLOOD 1448 8566 10014
## 4 TSTM WIND 438 6027 6465
## 5 WINTER STORM 556 5520 6076
## 6 LIGHTNING 816 5230 6046
## 7 HIGH WIND 394 1740 2134
## 8 WILDFIRE 87 1456 1543
## 9 HURRICANE 133 1326 1459
## 10 RIP CURRENT 572 529 1101
## 11 HAIL 10 960 970
## 12 FOG 62 734 796
## 13 EXTREME COLD 287 255 542
## 14 DUST STORM 22 440 462
## 15 TROPICAL STORM 58 340 398
## 16 AVALANCHE 224 170 394
## 17 DENSE FOG 18 342 360
## 18 HEAVY RAIN 98 251 349
## 19 HIGH SURF 104 156 260
## 20 GLAZE 7 216 223
Take the top 5 most harmful events and graph their information in the results section. The top5harm dataset will be the final dataset used to graph.
#from the above table, we can pull top 5 harmful events to graph in results section
top5harm <- subset(mostharm[1:5,])
I’ll explore economic damange by event type next to answer question #2: which types of events have the greatest economic consequences?
#create new dataframe that groups economic damage by event type
economicdmg <- summarise(after1993, TOTPROPDMG = sum(TOTPROPDMG, na.rm = TRUE),
TOTCROPDMG = sum(TOTCROPDMG, na.rm = TRUE))
#replace NAs with 0
economicdmg$TOTPROPDMG <- economicdmg$TOTPROPDMG %>% replace_na(0)
economicdmg$TOTCROPDMG <- economicdmg$TOTCROPDMG %>% replace_na(0)
Add together property damange and crop damange for each event type to find the top 5 most costly events.
#add together property and crop damange
economicdmg<- mutate(economicdmg, TOTALDMG = TOTPROPDMG+TOTCROPDMG)
#filter events with economic cost.
mostdmg <- filter(economicdmg, economicdmg$TOTALDMG>0)
#arrange events to have most expensive on top.
mostdmg <- arrange(mostdmg, desc(mostdmg$TOTALDMG))
View most costly events.
#view dataset to see most costly events
mostdmg
## # A tibble: 354 x 4
## EVTYPE TOTPROPDMG TOTCROPDMG TOTALDMG
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 160798521887. 7083285550 167881807437.
## 2 HURRICANE 84656180010 5505292800 90161472810
## 3 STORM SURGE 43323536000 5000 43323541000
## 4 TORNADO 26338962421 414953110 26753915531
## 5 HAIL 15732266870 3025537450 18757804320
## 6 WINTER STORM 12246094158. 5310770600 17556864758.
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 TSTM WIND 9704092218. 1159501100 10863593318.
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 TROPICAL STORM 7703890550 678346000 8382236550
## # … with 344 more rows
Pull out the 5 most costly events. The dataset top5dmg will be the final dataset used to graph results.
#pull out top 5 events
top5dmg <- subset(mostdmg[1:5,])
Graph the top 5 harmful events in 3 ways. The first graph shows the total fatalities of each event. The second shows total injuries. The final will show the sum of injuries and fatalities.
#graph fatalities
g1 <- ggplot(top5harm, aes(EVTYPE, FATALITIES)) + geom_bar(stat="identity")+
ggtitle("Top 5 Harmful Events 1993 to 2011")
g2 <- ggplot(top5harm, aes(EVTYPE, INJURIES)) + geom_bar(stat="identity")
g3 <- ggplot(top5harm, aes(EVTYPE, TOTALHARM)) + geom_bar(stat="identity")
grid.arrange(g1, g2, g3, nrow = 3)
It is clear from the analysis done through the data processing process that the top 5 harmful events to population health are tornados, heat, floods, thundestorm winds, and winter storms, with heat causing the most fatalaties and tornados causing the highest injury.
Next, graph the total economic damage done by the 5 most costly events.
#graph economic damange
ggplot(top5dmg, aes(EVTYPE, TOTALDMG)) + geom_bar(stat="identity")+
ggtitle("Total Economic Damage 1993 to 2011")+ylab("Cost of Damage in $")
The top 5 most harmful events from an economic damage standpoint are floods, hurrianes, storm surges, torandos, and hail. Floods have had the highest cost through the time period, with over 1.5 trillion dollars in damages.