Synopsis

This document analyses what are the types of weather events that are a) most harmful to the health of the general population, and b) most economically damaging within the United States. The analysis is done based on publicly available data from the NOAA Storm Database.

Data Processing

We begin our data processing step by downloading the publicly available U.S. National Oceanic and Atmospheric Administration’s storm database. The data set contains events from 1950 up to November of 2011, and will form the base data for our analysis.

We begin by reading in the data:

library(data.table)

stormdata <- fread("repdata-data-StormData.csv")
## 
Read 4.1% of 967216 rows
Read 27.9% of 967216 rows
Read 39.3% of 967216 rows
Read 53.8% of 967216 rows
Read 65.1% of 967216 rows
Read 78.6% of 967216 rows
Read 83.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
## Warning in fread("repdata-data-StormData.csv"): Read less rows (902297)
## than were allocated (967216). Run again with verbose=TRUE and please
## report.

And now we check if all of the 902,297 rows are present, and check some columns of the first few values:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dim(stormdata)
## [1] 902297     37
select(head(stormdata), BGN_DATE, EVTYPE, INJURIES, FATALITIES)
##              BGN_DATE  EVTYPE INJURIES FATALITIES
## 1:  4/18/1950 0:00:00 TORNADO       15          0
## 2:  4/18/1950 0:00:00 TORNADO        0          0
## 3:  2/20/1951 0:00:00 TORNADO        2          0
## 4:   6/8/1951 0:00:00 TORNADO        2          0
## 5: 11/15/1951 0:00:00 TORNADO        2          0
## 6: 11/15/1951 0:00:00 TORNADO        6          0

This is a very wide dataset, so we select the few interesting values to study.

impacts <- select(stormdata, BGN_DATE, EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

head(impacts)
##              BGN_DATE  EVTYPE INJURIES FATALITIES PROPDMG PROPDMGEXP
## 1:  4/18/1950 0:00:00 TORNADO       15          0    25.0          K
## 2:  4/18/1950 0:00:00 TORNADO        0          0     2.5          K
## 3:  2/20/1951 0:00:00 TORNADO        2          0    25.0          K
## 4:   6/8/1951 0:00:00 TORNADO        2          0     2.5          K
## 5: 11/15/1951 0:00:00 TORNADO        2          0     2.5          K
## 6: 11/15/1951 0:00:00 TORNADO        6          0     2.5          K
##    CROPDMG CROPDMGEXP
## 1:       0           
## 2:       0           
## 3:       0           
## 4:       0           
## 5:       0           
## 6:       0

We confirm that the variables we will use in our study do not have a high proportion of missing values, which could skew our analysis.

mean(is.na(impacts$PROPDMG))
## [1] 0
mean(is.na(impacts$CROPDMG))
## [1] 0
mean(is.na(impacts$INJURIES))
## [1] 0
mean(is.na(impacts$FATALITIES))
## [1] 0

Results

Population health impacts

To determine the events that do the most harm to population health, we look into the subset of events that resulted in injuries or fatalities.

health <- filter(impacts, INJURIES > 0 | FATALITIES > 0)
dim(health)
## [1] 21929     8

In order to understand which are the most harmful events, we total the number of injuries and fatalities, and order the events by casualties in descending order.

health_by_event <- 
        group_by(health, EVTYPE) %>%
        summarize(injuries = sum(INJURIES), 
                  fatalities = sum(FATALITIES)) %>%
        mutate(casualties = injuries + fatalities)

head(arrange(health_by_event, desc(casualties)))
##           EVTYPE injuries fatalities casualties
## 1        TORNADO    91346       5633      96979
## 2 EXCESSIVE HEAT     6525       1903       8428
## 3      TSTM WIND     6957        504       7461
## 4          FLOOD     6789        470       7259
## 5      LIGHTNING     5230        816       6046
## 6           HEAT     2100        937       3037

We not plot the top four most impacting event types.

library(ggplot2)
h <- arrange(health_by_event, desc(casualties))[1:4]
qplot(data = h, x = EVTYPE, y = casualties, geom = "bar", stat="identity")

We can see that tornados have caused the most impact on populations by a very large amount; roughly ten times more than the next most impactful event, excessive heat.

Economic impacts

For the economic impact we pick the events that caused monetary damage of some kind:

economy <- filter(impacts, PROPDMG > 0 | CROPDMG > 0)
dim(economy)
## [1] 245031      8

For correctly calculating the total economic impact of an event, we need to convert the multiplier columns into a numeric value, which we can use to arrive at the actual recorded damage.

conv.magnitude <- function(mag) {
        sapply(mag, function(magnitude) {
                r <- 1
                
                if (grepl("k", magnitude, ignore.case = TRUE)) {
                        r <- 1000
                } else if (grepl("m", magnitude, ignore.case = TRUE)) {
                        r <- 1000000
                } else if (grepl("b", magnitude, ignore.case = TRUE)) {
                        r <- 1000000000
                }
                
                r
        })
}

economy <- mutate(economy, 
                  prop_mag = conv.magnitude(PROPDMGEXP),
                  crop_mag = conv.magnitude(CROPDMGEXP))

economy <- mutate(economy, 
                  damage = (PROPDMG * prop_mag) + (CROPDMG * crop_mag))

We can then total up the economic damage caused by each event type.

economy_by_event <-
        group_by(economy, EVTYPE) %>%
        summarize(total_damage = sum(damage)) %>% 
        arrange(desc(total_damage))

head(economy_by_event)
##              EVTYPE total_damage
## 1             FLOOD 150319678257
## 2 HURRICANE/TYPHOON  71913712800
## 3           TORNADO  57352114049
## 4       STORM SURGE  43323541000
## 5              HAIL  18758221521
## 6       FLASH FLOOD  17562129167

We can now the top four most economically damaging events.

e <- economy_by_event[1:4]
qplot(data = e, x = EVTYPE, y = total_damage, geom = "bar", stat = "identity")

Here we see that floods cause the most impact, and are roughly twice as impactfull as the next-worse event, hurricanes.