Synopsis

This analysis uses data from the National Oceanic and Atmospheric Administration’s Storm Database. The purpose of this analysis is to identify the storm events that are most harmful to population health and have the greatest economic consequences. The data indicate that tornadoes are the most harmful to population health, and the storms with high-winds cause the most economic damage.

Data Processing

Downloading and loading data into R

# create directory to store data ----
directory <- "/Users/mmilunski/Dropbox/Analytics/Coursera/RepResA2"

if(!file.exists(directory)) {
        dir.create(directory)

        ## download data ----
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", paste0(directory, "/", "noaa.csv"), method = "curl", 
                      quiet = TRUE)
}

# load data ----
noaa <- read.csv(paste0(directory, "/", "noaa.csv"), header = TRUE,
                        stringsAsFactors = FALSE, na.strings = "")

Fix the variable names for easier typing

Some of these names have single or double underscores after the last character. Going to run this line of code twice in order to mop-up the trailing underscores.

# names before... ----
names(noaa)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
# loop to remove ----
reps <- 1:2
for(i in seq_along(reps)) {
        names(noaa) <- tolower(gsub("_$", "", names(noaa)))
}

# names after... ----
names(noaa)
##  [1] "state"      "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude" 
## [36] "remarks"    "refnum"

It seems that the loop to remove excess underscores has resulted in two state and two longitude variables. To differentiate them, I’ll change the first state to state_num and the 35th vector to longitude_e. After changing the two names, I check to see if there are duplicated variable names still in the NOAA dataframe.

colnames(noaa)[1] <- "state_num"
colnames(noaa)[35] <- "longitude_e"

# check for duplicated variable names ----
duplicated(names(noaa))
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE

Looks like there are no more duplicated names.

First pass at health and economic consequences by evtype.

By health

Look at total fatalities and injuries by event type. It may help to look at the impact on population health through an additive lens, that is, adding fatalities and injuries together to make a variable called casualties.

library(sqldf)
health <- sqldf("select evtype, sum(fatalities) as total_fatalities,
                sum(injuries) as total_injuries, 
                sum(fatalities + injuries) as total_casualties
                from noaa 
                group by evtype
                order by total_casualties DESC")

# for kicks: dplyr equivalent to the sql statement above... way faster ----
library(dplyr)
health_dplyr <- noaa %>% 
        select(evtype, fatalities, injuries) %>% 
        mutate(casualties = fatalities + injuries) %>% 
        group_by(evtype) %>% 
        summarise(fatal_total = sum(fatalities), injur_total = sum(injuries),
                  casual_total = sum(casualties)) %>%
        arrange(desc(casual_total))

# take a look at the top 15 by casualties ----
head(health, 15)
##               evtype total_fatalities total_injuries total_casualties
## 1            TORNADO             5633          91346            96979
## 2     EXCESSIVE HEAT             1903           6525             8428
## 3          TSTM WIND              504           6957             7461
## 4              FLOOD              470           6789             7259
## 5          LIGHTNING              816           5230             6046
## 6               HEAT              937           2100             3037
## 7        FLASH FLOOD              978           1777             2755
## 8          ICE STORM               89           1975             2064
## 9  THUNDERSTORM WIND              133           1488             1621
## 10      WINTER STORM              206           1321             1527
## 11         HIGH WIND              248           1137             1385
## 12              HAIL               15           1361             1376
## 13 HURRICANE/TYPHOON               64           1275             1339
## 14        HEAVY SNOW              127           1021             1148
## 15          WILDFIRE               75            911              986

It looks like tornadoes have some of the highest casualty rates.

By economic consequences

To represent economic consequences, I’ll look at crop damage (cropdmg) and property damage (propdmg). Similar to the health issues above, it would be informative to have a variable representing the total damage created by these storms. There’s a multiplier variable for both propdmg and cropdmg that has three values: K (thousands), M (millions), and B (billions).

# fix `propdmg` and `cropdmg` by using the multipliers ----
noaa <- noaa %>% 
        mutate(propdmg_mult = ifelse(propdmgexp == "K", 
                                     1000, ifelse(propdmgexp == "M", 1000000,
                                                  ifelse(propdmgexp == "B",
                                                         1000000000, NA))))
noaa <- noaa %>% 
        mutate(cropdmg_mult = ifelse(cropdmgexp == "K", 
                                     1000, ifelse(cropdmgexp == "M", 1000000,
                                                  ifelse(cropdmgexp == "B",
                                                         1000000000, NA))))
noaa <- noaa %>% 
        mutate(propdmg_full = propdmg * propdmg_mult, 
               cropdmg_full = cropdmg * cropdmg_mult)
# make table ----
econ <- noaa %>% 
        select(evtype, propdmg_full, cropdmg_full) %>%  
        mutate(totaldmg = cropdmg_full + propdmg_full) %>% 
        group_by(evtype) %>% 
        summarise(tot_propdmg = sum(propdmg_full), 
                  tot_cropdmg = sum(cropdmg_full),
                  tot_totaldmg = sum(totaldmg)) %>% 
        arrange(desc(tot_totaldmg))
head(econ, 15)
## # A tibble: 15 × 4
##                           evtype tot_propdmg tot_cropdmg tot_totaldmg
##                            <chr>       <dbl>       <dbl>        <dbl>
## 1     TORNADOES, TSTM WIND, HAIL   1.600e+09     2500000   1602500000
## 2                HIGH WINDS/COLD   1.105e+08     7000000    117500000
## 3      HURRICANE OPAL/HIGH WINDS   1.000e+08    10000000    110000000
## 4        WINTER STORM HIGH WINDS   6.000e+07     5000000     65000000
## 5           Heavy Rain/High Surf   1.350e+07     1500000     15000000
## 6                LAKESHORE FLOOD   7.540e+06           0      7540000
## 7         HIGH WINDS HEAVY RAINS   7.500e+06       10000      7510000
## 8                   FOREST FIRES   5.000e+06      500000      5500000
## 9           FLASH FLOODING/FLOOD   1.750e+06      175000      1925000
## 10 HEAVY SNOW/HIGH WINDS & FLOOD   1.500e+06       20000      1520000
## 11                  Frost/Freeze   1.000e+06      100000      1100000
## 12         TROPICAL STORM GORDON   5.000e+05      500000      1000000
## 13         DUST STORM/HIGH WINDS   5.000e+04      500000       550000
## 14      MARINE THUNDERSTORM WIND   4.364e+05       50000       486400
## 15         ASTRONOMICAL LOW TIDE   3.200e+05           0       320000

Results

1. Across the United States, which types of events (as indicated by the evtype variable), are most harmful with respect to population health?

The top five events that are most harmful to population health (most casualties produced) are the following:

library(ggplot2)
library(reshape2)

# create health_data for plotting ----
health_data <- head(health, 5)

# melt data ----
health_melt <- melt(health_data, id.vars = "evtype")

# create plot ----
base <- ggplot(health_melt, aes(x = evtype, y = value, fill = variable))
bar <- geom_bar(stat = "identity", position = "dodge")
plotLabels <- labs(title = "Storm Events Most Harmful to Population Health",
                   subtitle = "Top Five Largest Casualty-Producing Events",
                   x = "Storm Event Type",
                   y = "Number of People")
base + bar + plotLabels

Interpretation:
Excessive heat, floods, lightning, tornadoes, and thunderstorm winds are the top five storms most harmful to population health. It’s clear from the plot that tornadoes produce the most casualties and can be viewed as the greatest threat to population health.

2. Across the United States, which types of events have the greatest economic consequences?

# create econ_data for plotting ----
econ_data <- head(econ, 5)

# melt econ_data ----
econ_melt <- melt(econ_data, id.vars = "evtype")

# create plot ----
base1 <- ggplot(econ_melt, aes(x = evtype, y = (value / 1000000), 
                               fill = variable))
bar1 <- geom_bar(stat = "identity", position = "dodge")
plotLabels1 <- labs(title = "Property, Crop, and Total Damage from Storms",
                   subtitle = "Top Five Economically-Harmful Events",
                   x = "Storm Event Type",
                   y = "Assessed Damage in Millions (USD)")
rotate1 <- theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1))
barText1 <- geom_text(aes(label = (value / 1000000)), vjust = -0.2, size = 2,
                      position = position_dodge(.9))
base1 + bar1 + plotLabels1 + rotate1 + barText1

Interpretation:
Generally speaking, storm events with high winds tend to cause the greatest economic damage. The combination of tornadoes, thunderstorm winds, and hail caused the greatest damage with property and crop damage around 1.6 billion USD.