Synopsys

The purpose of this analysis is to determine the population and economic impact of different types of storms in the United States between 1950 and 2011. Data is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

Analysis is performed on the United States as a whole and is summarized across the entire period of interest. For each event type in the storm database, the total population impacts (fatalities plus injuries) and total economic impacts (property damage plus crop damage) are determined. The top 25 contributors to population and economic impact are summarized in the Results section.

In terms of population impact, tornadoes are far and away the leading contributor, at 96,979 total casualties. The next most impactful contributors include excessive heat (8428), thunderstorm wind (7461), flood (7259), and lightning (6046), although all are more than an order of magnitude less than tornadoes.

Floods are the most devastating event in terms of ecomonic impact, causing more than $150B in damages between 1950 and 2011. Other top economic impact contributors include hurricanes/typhoons ($72B), tornadoes ($57B), storm surges ($43B), and hail ($19B).

Data Processing

Load Required Libraries

The first step is to load required libraries.

  1. ggplot2 is used to generate plots of the data
  2. dplyr is used to group and summarize the data
  3. reshape2 is used to stack the data from different categories
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)

Download Data and Read into R

The data is downloaded from the course website if necessary.

fileUrl = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("./repdata_Fdata_FStormData.csv.bz2")) {
  download.file(fileUrl,"./repdata_Fdata_FStormData.csv.bz2",method="curl")
}

The data is read into R in .csv format. Note that read.csv can read the compressed file directly, so there is no need to unzip the file. Missing data is encoded as NA.

A brief summary of the raw data is provided below.

storm_raw <- read.csv("./repdata_Fdata_FStormData.csv.bz2", na.strings = "NA")
str(storm_raw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subset the Data

The raw data contains many columns of data that are not necessary for the purposes of this analysis. The dplyr() function select() is used to subset the data to only the columns of interest.

storm <- select(storm_raw, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
str(storm)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

For a complete description of the event types see the documentation here: National Weather Service Storm Data Documentation

Note that there are 985 levels to the EVTYPE (event type) variable. In the interest of keeping the summaries to a manageable size for presentation, we will summarize only the top 25 event types for both casualties (fatalities+injuries) and total damage (property+crop damage).

Clean the Data

Now that the data of interest has been extracted, data cleaning must be performed. Specifically, the economic impact data is not always coded in a consistent manner. The values in the PROPDMGEXP and CROPDMGEXP columns should indicate the magnitude of the impact (e.g. B for $Billions, M for $Millions, etc.)

table(storm$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(storm$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Some of the entries are as expected, but the “”, “-”, “?”, “+”, and integer entries are of concern. Create a subset of the data with the suspect items for further analysis.

#Filter the suspect values
suspects <- c("", "-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8")
prop_temp <- filter(storm_raw, (PROPDMGEXP %in% suspects) & PROPDMG != 0)
crop_temp <- filter(storm_raw, (CROPDMGEXP %in% suspects) & CROPDMG != 0)

First, examine the suspect CROPDMGEXP values.

# See how many rows are of concern for each suspect value
table(crop_temp$CROPDMGEXP)
## 
##     ?  0  2  B  k  K  m  M 
##  3  0 12  0  0  0  0  0  0
# List the corresponding values of PROPDMGEXP to see if they can be used to impute
select(crop_temp, PROPDMGEXP, CROPDMGEXP)
##    PROPDMGEXP CROPDMGEXP
## 1                      0
## 2           K          0
## 3           K          0
## 4           K          0
## 5           K          0
## 6           K          0
## 7           K          0
## 8           K          0
## 9           K          0
## 10          K          0
## 11          K          0
## 12          K          0
## 13          K           
## 14          M           
## 15          K

There are only 15 total suspect crop damage entries, all either “” or “0”. Of these, all but one entry includes a corresponding entry for PROPDMGEXP, so this will be used to impute the missing values for CROPDMGEXP where available. The assumption for the final suspect entry will be to interpret as $K (thousands) as this is the predominant entry in the CROPDMGEXP column.

The following code imputes the missing CROPDMGEXP values.

# Create a temporary copy of the data
tmp<-storm

# Impute the entries for which there is a PROPDMGEXP
#table(tmp$CROPDMGEXP)
tmp$CROPDMGEXP <- as.character(tmp$CROPDMGEXP)
tmp$PROPDMGEXP <- as.character(tmp$PROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMG != 0 & (tmp$CROPDMGEXP == "" | tmp$CROPDMGEXP == "0"), tmp$PROPDMGEXP, tmp$CROPDMGEXP)

# Impute the remaining entry as "K"
#table(tmp$CROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMG != 0 & tmp$PROPDMGEXP == "" & (tmp$CROPDMGEXP == "" | tmp$CROPDMGEXP == "0"), "K", tmp$CROPDMGEXP)
#table(tmp$CROPDMGEXP)

# Replace the remaining "" and "?" with "0". Note these are all CROPDMG == 0.
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMGEXP == "" | tmp$CROPDMGEXP == "?", "0", tmp$CROPDMGEXP)

# confirm there are no suspect values for CROPDMGEXP
tmp2 <- filter(tmp, (CROPDMGEXP %in% suspects) & CROPDMG != 0)
tmp2 # should be empty
## [1] EVTYPE     FATALITIES INJURIES   PROPDMG    PROPDMGEXP CROPDMG   
## [7] CROPDMGEXP
## <0 rows> (or 0-length row.names)
# Copy temp data back to data frame
storm<-tmp

Next, examine the suspect PROPDMGEXP values.

# See how many rows are of concern for each suspect value
table(prop_temp$PROPDMGEXP)
## 
##       -   ?   +   0   1   2   3   4   5   6   7   8   B   h   H   K   m 
##  76   1   0   5 209   0   1   1   4  18   3   2   0   0   0   0   0   0 
##   M 
##   0
tmp4 <- select(prop_temp, PROPDMGEXP, CROPDMGEXP)

# Find possible imputable values from CROPDMGEXP")
tmp4[tmp4$CROPDMGEXP == "K" | tmp4$CROPDMGEXP == "M",]
##     PROPDMGEXP CROPDMGEXP
## 26           5          M
## 192          0          K
## 197                     K
## 212          0          K
## 215          3          K
## 235                     K
## 272                     K
## 276          0          M
## 288                     K
## 293                     K
## 315          5          K
## 316          0          K

There are over 300 total suspect columns. The non-zero integers will be interpreted as 10^n. Of the remaining, several contain a valid entry (“M” or “K”) for CROPDMGEXP. These will be used to impute the PROPDMGEXP value if the suspect value was not an integer greater than 0. The remaining suspect values will be imputed as K, the predominant entry in this category.

The following code imputes the missing PROPDMGEXP values.

# Make a temporary copy of the data frame
tmp<-storm
#table(tmp$PROPDMGEXP)

# redefine the suspects list to allow integers > 0 to be interpreted as 10^n
suspects2 <- c("", "-", "?", "+", "0")
ok <- c("k", "K", "m", "M")

# Impute with CROPDMGEXP if available
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMG != 0 & tmp$PROPDMGEXP %in% suspects2 & tmp$CROPDMGEXP %in% ok, tmp$CROPDMGEXP, tmp$PROPDMGEXP)

# Replace rest with "K"
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMG != 0 & tmp$PROPDMGEXP %in% suspects2, "K", tmp$PROPDMGEXP)

# Replace the remaining "" and "?" with "0". Note these are all PROPDMG == 0.
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMGEXP == "" | tmp$PROPDMGEXP == "?", "0", tmp$PROPDMGEXP)

# confirm there are no suspect values for PROPDMGEXP
tmp2 <- filter(tmp, (PROPDMGEXP %in% suspects2) & PROPDMG != 0)
tmp2 # should be empty
## [1] EVTYPE     FATALITIES INJURIES   PROPDMG    PROPDMGEXP CROPDMG   
## [7] CROPDMGEXP
## <0 rows> (or 0-length row.names)
# Copy the temp data back to the data frame
storm<-tmp

Next convert the “H”, “K”, “M”, “B” type values to an equivalent exponent value (e.g. K = 3, M = 6). Then compute the property damage and crop damage from the corresponding base and exponent values and store in the data frame. These are the final cleaned damage values that will be used during the data analysis.

tmp<-storm

hundreds <- c("h", "H")
thousands <- c("k", "K")
millions <- c("m", "M")
billions <- c("b", "B")

# replace the chars with exponent values
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMGEXP %in% hundreds, "2",tmp$PROPDMGEXP)
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMGEXP %in% thousands, "3",tmp$PROPDMGEXP)
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMGEXP %in% millions, "6",tmp$PROPDMGEXP)
tmp$PROPDMGEXP <- ifelse(tmp$PROPDMGEXP %in% billions, "9",tmp$PROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMGEXP %in% hundreds, "2",tmp$CROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMGEXP %in% thousands, "3",tmp$CROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMGEXP %in% millions, "6",tmp$CROPDMGEXP)
tmp$CROPDMGEXP <- ifelse(tmp$CROPDMGEXP %in% billions, "9",tmp$CROPDMGEXP)

# convert to numeric
tmp$PROPDMGEXP <- as.numeric(tmp$PROPDMGEXP)
tmp$CROPDMGEXP <- as.numeric(tmp$CROPDMGEXP)

# Compute the property damage and crop damage from the base and exponent values
tmp <- mutate(tmp, prop_dmg = PROPDMG * 10^PROPDMGEXP, crop_dmg = CROPDMG * 10^CROPDMGEXP)

# Copy the temporary data back to the data frame
storm<-tmp

Group, Summarize, and Sort the Data

Now that the data is cleaned, the data processing occurs in several steps.

  1. Group the data by EVTYPE using the dplyr group_by() function.
  2. Summarize the data by calculating the total fatalities, injuries, property damage and crop damage by EVTYPE using the dplyr summarize() function.
  3. Add total casualty and damage columns to the data frame using the dplyr mutate() function.
  4. Sort the data by (1) total casualties and (2) total damages using the dplyr arrange() function. Retain the top 25 results from each sort.
  5. Rearrange the data using the reshape2 melt() function in order to generate stacked bar graphs for population impacts and economic impacts. Rename columns for clarity.
  6. List the top 5 contributors for total casualties and total damage.
# 1. Group data
storm_by_type <- group_by(storm, EVTYPE)

# 2. Summarize data
by_type <- summarize(storm_by_type, fatality=sum(FATALITIES), injury=sum(INJURIES), propertydmg=sum(prop_dmg), cropdmg=sum(crop_dmg))

# 3. Add total casualty and damage columns
by_type <- mutate(by_type, totcas = fatality + injury, totdmg = propertydmg + cropdmg)

# 4. Sort by total casualties and damage, retaining the top 25
totalcas <- arrange(by_type, desc(totcas))[1:25,]
totaldmg = arrange(by_type, desc(totdmg))[1:25,]

# 5. Rearrange the data
# Stack the casualty types in order to display in stacked bar graph
totalcas <- select(totalcas, EVTYPE, totcas, fatality, injury)
totalcas <- melt(totalcas, id.var = c('EVTYPE','totcas'), variable.name = 'casualty.type')
names(totalcas)[names(totalcas) == 'value'] <- 'casualties'

# Stack the damage types in order to display in stacked bar graph
totaldmg <- select(totaldmg, EVTYPE, totdmg, propertydmg, cropdmg)
totaldmg <- melt(totaldmg, id.var = c('EVTYPE','totdmg'), variable.name = 'damage.type')
names(totaldmg)[names(totaldmg) == 'value'] <- 'damages'

# 6. List the top 5 contributors to casualties
select(totalcas, EVTYPE, totcas)[1:5,]
##           EVTYPE totcas
## 1        TORNADO  96979
## 2 EXCESSIVE HEAT   8428
## 3      TSTM WIND   7461
## 4          FLOOD   7259
## 5      LIGHTNING   6046
# List the top 5 contributors to damage
select(totaldmg, EVTYPE, totdmg)[1:5,]
##              EVTYPE       totdmg
## 1             FLOOD 150319685250
## 2 HURRICANE/TYPHOON  71913712800
## 3           TORNADO  57365687090
## 4       STORM SURGE  43323541000
## 5              HAIL  18761537370

Results

Population Impacts

A summary of the population impact of US storms is shown in the following plot.

# Generate the Population impact plot
g <- ggplot(totalcas)
g +
  aes(x=reorder(EVTYPE,casualties,function(x) -sum(x)), y=casualties, fill = casualty.type) +
  geom_bar(stat="identity") + 
  xlab("Event Type") +
  ylab("Total Casualties") +
  ggtitle("Population Impact by Event Type \nUnited States Totals 1950-2011") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The major takeaways include:

  1. Tornadoes dominate the population impacts, resulting in close to 97000 total casualties.
  2. Tornadoes also result in the highest number of fatalities.
  3. No other event type exceeds 9000 total casualties.

Economic Impacts

A summary of the economic impact of US storms is shown in the following plot.

# Generate the Economic impact plot
h <- ggplot(totaldmg)
h +
  aes(x=reorder(EVTYPE,damages,function(x) -sum(x)), y=damages/1e9, fill = damage.type) +
  geom_bar(stat="identity") + 
  xlab("Event Type") +
  ylab("Total Damages ($B)") +
  ggtitle("Economic Impact by Event Type \nUnited States Totals 1950-2011") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

The major takeaways include:

  1. At $150B, floods are the #1 cause of economic damage by roughly a factor of 2 over the next highest impact event (hurricane/typhoon).
  2. Drought is the primary contributor to crop damage.
  3. Property damages far exceed crop damages for virtually all storm types.