This analysis uses data from the National Oceanic and Atmospheric Administration’s Storm Database. The purpose of this analysis is to identify the storm events that are most harmful to population health and have the greatest economic consequences. The data indicate that tornadoes are the most harmful to population health, and the storms with high-winds cause the most economic damage.
# create directory to store data ----
directory <- "/Users/mmilunski/Dropbox/Analytics/Coursera/RepResA2"
if(!file.exists(directory)) {
dir.create(directory)
## download data ----
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", paste0(directory, "/", "noaa.csv"), method = "curl",
quiet = TRUE)
}
# load data ----
noaa <- read.csv(paste0(directory, "/", "noaa.csv"), header = TRUE,
stringsAsFactors = FALSE, na.strings = "")
Some of these names have single or double underscores after the last character. Going to run this line of code twice in order to mop-up the trailing underscores.
# names before... ----
names(noaa)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
# loop to remove ----
reps <- 1:2
for(i in seq_along(reps)) {
names(noaa) <- tolower(gsub("_$", "", names(noaa)))
}
# names after... ----
names(noaa)
## [1] "state" "bgn_date" "bgn_time" "time_zone" "county"
## [6] "countyname" "state" "evtype" "bgn_range" "bgn_azi"
## [11] "bgn_locati" "end_date" "end_time" "county_end" "countyendn"
## [16] "end_range" "end_azi" "end_locati" "length" "width"
## [21] "f" "mag" "fatalities" "injuries" "propdmg"
## [26] "propdmgexp" "cropdmg" "cropdmgexp" "wfo" "stateoffic"
## [31] "zonenames" "latitude" "longitude" "latitude_e" "longitude"
## [36] "remarks" "refnum"
It seems that the loop to remove excess underscores has resulted in two state and two longitude variables. To differentiate them, I’ll change the first state to state_num and the 35th vector to longitude_e. After changing the two names, I check to see if there are duplicated variable names still in the NOAA dataframe.
colnames(noaa)[1] <- "state_num"
colnames(noaa)[35] <- "longitude_e"
# check for duplicated variable names ----
duplicated(names(noaa))
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE
Looks like there are no more duplicated names.
evtype.Look at total fatalities and injuries by event type. It may help to look at the impact on population health through an additive lens, that is, adding fatalities and injuries together to make a variable called casualties.
library(sqldf)
health <- sqldf("select evtype, sum(fatalities) as total_fatalities,
sum(injuries) as total_injuries,
sum(fatalities + injuries) as total_casualties
from noaa
group by evtype
order by total_casualties DESC")
# for kicks: dplyr equivalent to the sql statement above... way faster ----
library(dplyr)
health_dplyr <- noaa %>%
select(evtype, fatalities, injuries) %>%
mutate(casualties = fatalities + injuries) %>%
group_by(evtype) %>%
summarise(fatal_total = sum(fatalities), injur_total = sum(injuries),
casual_total = sum(casualties)) %>%
arrange(desc(casual_total))
# take a look at the top 15 by casualties ----
head(health, 15)
## evtype total_fatalities total_injuries total_casualties
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
## 15 WILDFIRE 75 911 986
It looks like tornadoes have some of the highest casualty rates.
To represent economic consequences, I’ll look at crop damage (cropdmg) and property damage (propdmg). Similar to the health issues above, it would be informative to have a variable representing the total damage created by these storms. There’s a multiplier variable for both propdmg and cropdmg that has three values: K (thousands), M (millions), and B (billions).
# fix `propdmg` and `cropdmg` by using the multipliers ----
noaa <- noaa %>%
mutate(propdmg_mult = ifelse(propdmgexp == "K",
1000, ifelse(propdmgexp == "M", 1000000,
ifelse(propdmgexp == "B",
1000000000, NA))))
noaa <- noaa %>%
mutate(cropdmg_mult = ifelse(cropdmgexp == "K",
1000, ifelse(cropdmgexp == "M", 1000000,
ifelse(cropdmgexp == "B",
1000000000, NA))))
noaa <- noaa %>%
mutate(propdmg_full = propdmg * propdmg_mult,
cropdmg_full = cropdmg * cropdmg_mult)
# make table ----
econ <- noaa %>%
select(evtype, propdmg_full, cropdmg_full) %>%
mutate(totaldmg = cropdmg_full + propdmg_full) %>%
group_by(evtype) %>%
summarise(tot_propdmg = sum(propdmg_full),
tot_cropdmg = sum(cropdmg_full),
tot_totaldmg = sum(totaldmg)) %>%
arrange(desc(tot_totaldmg))
head(econ, 15)
## # A tibble: 15 × 4
## evtype tot_propdmg tot_cropdmg tot_totaldmg
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADOES, TSTM WIND, HAIL 1.600e+09 2500000 1602500000
## 2 HIGH WINDS/COLD 1.105e+08 7000000 117500000
## 3 HURRICANE OPAL/HIGH WINDS 1.000e+08 10000000 110000000
## 4 WINTER STORM HIGH WINDS 6.000e+07 5000000 65000000
## 5 Heavy Rain/High Surf 1.350e+07 1500000 15000000
## 6 LAKESHORE FLOOD 7.540e+06 0 7540000
## 7 HIGH WINDS HEAVY RAINS 7.500e+06 10000 7510000
## 8 FOREST FIRES 5.000e+06 500000 5500000
## 9 FLASH FLOODING/FLOOD 1.750e+06 175000 1925000
## 10 HEAVY SNOW/HIGH WINDS & FLOOD 1.500e+06 20000 1520000
## 11 Frost/Freeze 1.000e+06 100000 1100000
## 12 TROPICAL STORM GORDON 5.000e+05 500000 1000000
## 13 DUST STORM/HIGH WINDS 5.000e+04 500000 550000
## 14 MARINE THUNDERSTORM WIND 4.364e+05 50000 486400
## 15 ASTRONOMICAL LOW TIDE 3.200e+05 0 320000
evtype variable), are most harmful with respect to population health?The top five events that are most harmful to population health (most casualties produced) are the following:
library(ggplot2)
library(reshape2)
# create health_data for plotting ----
health_data <- head(health, 5)
# melt data ----
health_melt <- melt(health_data, id.vars = "evtype")
# create plot ----
base <- ggplot(health_melt, aes(x = evtype, y = value, fill = variable))
bar <- geom_bar(stat = "identity", position = "dodge")
plotLabels <- labs(title = "Storm Events Most Harmful to Population Health",
subtitle = "Top Five Largest Casualty-Producing Events",
x = "Storm Event Type",
y = "Number of People")
base + bar + plotLabels
Interpretation:
Excessive heat, floods, lightning, tornadoes, and thunderstorm winds are the top five storms most harmful to population health. It’s clear from the plot that tornadoes produce the most casualties and can be viewed as the greatest threat to population health.
# create econ_data for plotting ----
econ_data <- head(econ, 5)
# melt econ_data ----
econ_melt <- melt(econ_data, id.vars = "evtype")
# create plot ----
base1 <- ggplot(econ_melt, aes(x = evtype, y = (value / 1000000),
fill = variable))
bar1 <- geom_bar(stat = "identity", position = "dodge")
plotLabels1 <- labs(title = "Property, Crop, and Total Damage from Storms",
subtitle = "Top Five Economically-Harmful Events",
x = "Storm Event Type",
y = "Assessed Damage in Millions (USD)")
rotate1 <- theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
barText1 <- geom_text(aes(label = (value / 1000000)), vjust = -0.2, size = 2,
position = position_dodge(.9))
base1 + bar1 + plotLabels1 + rotate1 + barText1
Interpretation:
Generally speaking, storm events with high winds tend to cause the greatest economic damage. The combination of tornadoes, thunderstorm winds, and hail caused the greatest damage with property and crop damage around 1.6 billion USD.