The following anlysis aims to determine which extreme weather events are the most impactful in terms of both public health and economics. Using data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, wich contains storm data for the years 1950-2011. We measure public health impact as the total number of injuries and fatalities resulting from a weather event, and the economic impact as the total estimated cost of crop and property damage. From these data, we find that tornadoes have the largest impact on public health, while flooding has the largest economic impact.
We first download the data from the link provided to the /data folder in the working directory, and then read the data in to R. We then want to determine how many records contain missing data.
if(!dir.exists("./data")){
dir.create("./data")
}
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', "./data/NOAA.bz2")
data <- read.csv("./data/NOAA.bz2")
pct_na <- mean(is.na(data))
5.23% of observations contain missing values. This is a relatively low percentage, so we will ignore missing values for this analysis.
Our first step in cleaning the data is to address the PROPDMGEXP and CROPDMGEXP variables. These fields are multiplied by the cost estimate for the respective damage to arrive at the total cost estimate. This field is currently a character representing the multipliler (M for million, B for billion, etc). We will standardize this field by setting all entries to upper case, then create a new field that represents the numeric value of the character field.
## convert all 'EXP' variables to upper case
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$PROPDMGMULT <- with(data, ifelse(PROPDMGEXP == 'K', 1000, ifelse(PROPDMGEXP == 'M', 1000000, ifelse(PROPDMGEXP == 'B', 1000000000,ifelse(PROPDMGEXP == 'H', 100, NA)))))
data$CROPDMGMULT <- with(data, ifelse(CROPDMGEXP == 'K', 1000, ifelse(CROPDMGEXP == 'M', 1000000, ifelse(CROPDMGEXP == 'B', 1000000000,ifelse(CROPDMGEXP == 'H', 100, NA)))))
Next we will create a summary table that calculates the total injuries, fatalities, property damage cost, and crop damage cost for each weather event type. This table will be used as the basis for all analysis.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_sum <- summarize(group_by(data, EVTYPE), injuries = sum(INJURIES), fatalities = sum(FATALITIES), property_damage = sum(PROPDMGMULT * PROPDMG, na.rm = TRUE), crop_damage = sum(CROPDMGMULT * CROPDMG, na.rm = TRUE))
head(data_sum[order(-data_sum$injuries), ], 20)
## # A tibble: 20 x 5
## EVTYPE injuries fatalities property_damage crop_damage
## <fctr> <dbl> <dbl> <dbl> <dbl>
## 1 TORNADO 91346 5633 56937160480 414953110
## 2 TSTM WIND 6957 504 4484928440 554007350
## 3 FLOOD 6789 470 144657709800 5661968450
## 4 EXCESSIVE HEAT 6525 1903 7753700 492402000
## 5 LIGHTNING 5230 816 928659280 12092090
## 6 HEAT 2100 937 1797000 401461500
## 7 ICE STORM 1975 89 3944927810 5022113500
## 8 FLASH FLOOD 1777 978 16140811510 1421317100
## 9 THUNDERSTORM WIND 1488 133 3483121140 414843050
## 10 HAIL 1361 15 15732267220 3025954450
## 11 WINTER STORM 1321 206 6688497250 26944000
## 12 HURRICANE/TYPHOON 1275 64 69305840000 2607872800
## 13 HIGH WIND 1137 248 5270046260 638571300
## 14 HEAVY SNOW 1021 127 932589140 134653100
## 15 WILDFIRE 911 75 4765114000 295472800
## 16 THUNDERSTORM WINDS 908 64 1735954850 190654700
## 17 BLIZZARD 805 101 659213950 112060000
## 18 FOG 734 62 13155500 0
## 19 WILD/FOREST FIRE 545 12 3001829500 106796830
## 20 DUST STORM 440 22 5549000 3100000
We see several similar weather events when we view the top 20 events by injury count. For example, TSTM WIND and Thunderstorm Wind refer to the same event. There are also several heat related events. For this analysis, the following events will be combined:
* TSTM WIND, THUNDERSTORM WIND
* HEAT, EXCESSIVE HEAT, HEAT WAVE, EXTREME HEAT
* BLIZZARD, HEAVY SNOW
* FLOOD, FLASH FLOOD
* HIGH WIND, HIGH WINDS
* FOG, DENSE FOG
## combine EVTYPE as described
data$EVTYPE[data$EVTYPE == "THUNDERSTORM WIND"] <- "TSTM WIND"
data$EVTYPE[data$EVTYPE == "HEAT"] <- "EXTREME HEAT"
data$EVTYPE[data$EVTYPE == "EXCESSIVE HEAT"] <- "EXTREME HEAT"
data$EVTYPE[data$EVTYPE == "HEAT WAVE"] <- "EXTREME HEAT"
data$EVTYPE[data$EVTYPE == "BLIZZARD"] <- "HEAVY SNOW"
data$EVTYPE[data$EVTYPE == "FLASH FLOOD"] <- "FLOOD"
data$EVTYPE[data$EVTYPE == "DENSE FOG"] <- "FOG"
## re-create data_sum table with revised EVTYPE
data_sum <- summarize(group_by(data, EVTYPE), injuries = sum(INJURIES), fatalities = sum(FATALITIES), property_damage = sum(PROPDMGMULT * PROPDMG, na.rm = TRUE), crop_damage = sum(CROPDMGMULT * CROPDMG, na.rm = TRUE))
Viewing the top 20 events by property damage reveals additional event types that can be combined
head(data_sum[order(-data_sum$property_damage), ], 20)
## # A tibble: 20 x 5
## EVTYPE injuries fatalities property_damage
## <fctr> <dbl> <dbl> <dbl>
## 1 FLOOD 8566 1448 160798521310
## 2 HURRICANE/TYPHOON 1275 64 69305840000
## 3 TORNADO 91346 5633 56937160480
## 4 STORM SURGE 38 13 43323536000
## 5 HAIL 1361 15 15732267220
## 6 HURRICANE 46 61 11868319010
## 7 TSTM WIND 8445 637 7968049580
## 8 TROPICAL STORM 340 58 7703890550
## 9 WINTER STORM 1321 206 6688497250
## 10 HIGH WIND 1137 248 5270046260
## 11 RIVER FLOOD 2 2 5118945500
## 12 WILDFIRE 911 75 4765114000
## 13 STORM SURGE/TIDE 5 11 4641188000
## 14 ICE STORM 1975 89 3944927810
## 15 HURRICANE OPAL 1 1 3172846000
## 16 WILD/FOREST FIRE 545 12 3001829500
## 17 HEAVY RAIN/SEVERE WEATHER 0 0 2500000000
## 18 THUNDERSTORM WINDS 908 64 1735954850
## 19 TORNADOES, TSTM WIND, HAIL 0 25 1600000000
## 20 HEAVY SNOW 1826 228 1591803090
## # ... with 1 more variables: crop_damage <dbl>
The following events will be combined using the same process as before:
* RIVER FLOOD, FLOOD, FLASH FLOOD/FLOOD, FLASH FLOODING
* HURRICANE OPAL, HURRICANE
* THUNDERSTORM WINDS, TSTM WIND, SEVERE THUNDERSTORM
## combine EVTYPE as described
data$EVTYPE[data$EVTYPE == "RIVER FLOOD"] <- "FLOOD"
data$EVTYPE[data$EVTYPE == "FLASH FLOOD/FLOOD"] <- "FLOOD"
data$EVTYPE[data$EVTYPE == "FLASH FLOODING"] <- "FLOOD"
data$EVTYPE[data$EVTYPE == "HURRICANE OPAL"] <- "HURRICANE/TYPHOON"
data$EVTYPE[data$EVTYPE == "HURRICANE"] <- "HURRICANE/TYPHOON"
data$EVTYPE[data$EVTYPE == "TYPHOON"] <- "HURRICANE/TYPHOON"
data$EVTYPE[data$EVTYPE == "THUNDERSTORM WINDS"] <- "TSTM WIND"
data$EVTYPE[data$EVTYPE == "SEVERE THUNDERSTORM"] <- "TSTM WIND"
## re-create data_sum table with revised EVTYPE
data_sum <- summarize(group_by(data, EVTYPE), injuries = sum(INJURIES), fatalities = sum(FATALITIES), property_damage = sum(PROPDMGMULT * PROPDMG, na.rm = TRUE), crop_damage = sum(CROPDMGMULT * CROPDMG, na.rm = TRUE))
The data is now sufficiently cleaned for the purposes of this analysis.
In assessing the public health impacts of extreme weather events, we consider the total injuries and fatalities for each event type. A new variable is created that combines the total injury and fatality count for each weather event to represent the total health impact of the event. We then plot the top five weather events based on this new variable
## load required packages
library(reshape2)
library(ggplot2)
## filter and sort data by health impact
health_total <- with(data_sum, data.frame(EVTYPE, injuries, fatalities, total_impact = injuries + fatalities))
health_total <- head(health_total[order(-health_total$total_impact), ], 5)
## format data for plotting
health_melted <- melt(health_total, id = c("EVTYPE", "total_impact"))
## plot by total health impact
ggplot(health_melted, aes(EVTYPE, value, fill = variable)) + geom_col(position = position_stack(reverse = TRUE)) + aes(reorder(EVTYPE, -total_impact)) + xlab("Weather Event Type") + ggtitle("Public Health Impact of Extreme Weather Events", subtitle = '1950 - 2011') + theme(legend.title = element_blank()) + ylab("Total Injuries and Fatalities")
The above chart shows that tornadoes cause significantly more injuries than any other extreme weather event. Tornadoes have caused 96979 injuries and fatalities since 1950. Extreme Heat has caused the second most, at only 12197. Floods, Thunderstorms, and Lightning round out the top five. It should be noted that while Extreme heat has a much lower total incident count that tornadoes, it appears to have the highest percentage of incidents result in a fatality compared to the other events in the top five.
Next we will analyze the economic impact of extreme weather. The overall economic impact of a weather event will be considered the sum of the estimated cost of crop damages and property damages resulting from the event. We will create a column chart following the same process used to analyze public health.
## filter and sort data by economic impact
econ_total <- with(data_sum, data.frame(EVTYPE, property_damage, crop_damage, total_damage = property_damage + crop_damage))
econ_total <- head(econ_total[order(-econ_total$total_damage), ], 5)
## format data for plotting
melted <- melt(econ_total, id = c("EVTYPE", "total_damage"))
## plot total economic impact
ggplot(melted, aes(EVTYPE, value/1000000000, fill = variable)) + geom_col(position = position_stack(reverse = TRUE)) + aes(reorder(EVTYPE, -total_damage)) + xlab("Weather Event Type") + ggtitle("Economic Impact of Extreme Weather Events", subtitle = '1950 - 2011') + theme(legend.title = element_blank(), axis.text.x = element_text(size = 7)) + ylab("Estimated Cost (in billions)")
The distribution of economic impacts are more evenly spread than what was seen with public health impacts. Flooding is responsible for the highest estimated cost at $178.6 billion. Of the remaining event types in the top five, neither Tornado or Storm Surge reported any economic losses from crop damages. Overall, a majority of the economic impact of extreme weather is due to property damage
Based on the above analysis, it appears that tornadoes and flooding are the most impactful extreme weather events in terms of both public health and economic impacts. These events not only hold the top ranking in health and economic impact, repsectively, they are also the only two events that rank in the top five in both health and economic impacts.