NOAA Weather Impact Analysis Project

Synopsis

This is an assignment from Coursera Course on Reproducible Research. The goal of this assignment is to create and publish an RMarkdown document to communicate a simple analysis. The final document is self contained. The code provided here will download and process the raw data.

This analysis answers the following questions regarding weather events in the United States:
1. Health Hazards Which types of events are most harmful with respect to population health?
2. Economic Damage Which types of events have the greatest economic consequences?

Conclusions and Recommendations

The health hazard analysis shows that tornados are the weather events that have killed the most humans. Excessive heat is the second leading cause of death due to weather events.

The economic damage analysis shows that extreme water events like Floods, Hurricane/typhoons, and storm surges cause the most economic damage. Tornados also cause significant economic damage. These weather events cause significant damage to both property and crops.

Government should spend resources on identifying methods to minimize the impact of tornados, floods, hurricanes, and heat waves. These extreme weather events cannot be prevented so we must detect conditions early and learn to protect lives and property.

Data Processing

This data comes from the United States Department of Commerce National Oceanic & Atmospheric Administration National Weather Service. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are fewer events recorded.

Links to NOAA Database Documentation
* National Weather Service Storm Data Documentation
* National Climatic Data Center Storm Events FAQ

Lets start by loading data provided by the instructor. First we check to see if the zip file has been extracted. If not, we download and unzip it. The bz2 format requires R.utils package to open. Once opened, we read the data into a data.frame.

if(!file.exists("./repdata-data-StormData.csv")){
    fileURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
    download.file(fileURL, "./repdata-data-StormData.csv.bz2")
    require(R.utils)
    bunzip2("./repdata-data-StormData.csv.bz2", destname = ".")
}

storm_data <- read.csv("./repdata-data-StormData.csv")

For this analysis we are concerned with several key statistics related to each weather events:
* EVTYPE - Type of weather event
* FATALITIES - Number of people who die
* INJURIES - Number of people injured
* PROPDMG - Property Damage in US Dollars
* PROPDMGEXP - Magnitude of Property Damage in mixed units
* CROPDMG - Crop Damage in US Dollars
* CROPDMGEXP - Magnitude of Crop Damage in mixed units

Let’s check the summary statistics. for these variables.

storm_vars <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
summary(storm_data[storm_vars])

##                EVTYPE         FATALITIES     INJURIES         PROPDMG    
##  HAIL             :288661   Min.   :  0   Min.   :   0.0   Min.   :   0  
##  TSTM WIND        :219940   1st Qu.:  0   1st Qu.:   0.0   1st Qu.:   0  
##  THUNDERSTORM WIND: 82563   Median :  0   Median :   0.0   Median :   0  
##  TORNADO          : 60652   Mean   :  0   Mean   :   0.2   Mean   :  12  
##  FLASH FLOOD      : 54277   3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:   0  
##  FLOOD            : 25326   Max.   :583   Max.   :1700.0   Max.   :5000  
##  (Other)          :170878                                                
##    PROPDMGEXP        CROPDMG        CROPDMGEXP    
##         :465934   Min.   :  0.0          :618413  
##  K      :424665   1st Qu.:  0.0   K      :281832  
##  M      : 11330   Median :  0.0   M      :  1994  
##  0      :   216   Mean   :  1.5   k      :    21  
##  B      :    40   3rd Qu.:  0.0   0      :    19  
##  5      :    28   Max.   :990.0   B      :     9  
##  (Other):    84                   (Other):     9

The exponent variables PROPDMGEXP and CROPDMGEXP have mixed units. Some exponents are integers, but most are SI prefises like “K” for thousand and “M” for million. We need a function to standardize the units. With this function we convert all the values in these columns to integer exponents. Then we can calculate the estimated economic damage.

# Function to convert mixed units to an integer exponent
dmg_exponent <- function(exp_vector) {
  # This function returns a numeric vector of base 10 exponents
  # The exp_vector is a vector of exponent notations including integers and prefixes like "K"
  exp_out <- as.character(exp_vector)   # Initialize response
  exp_out[exp_vector == "" | exp_vector == "?"] <- 0
  exp_out[exp_vector == "+" | exp_vector == "-"] <- 0
  exp_out[exp_vector == "h" | exp_vector == "H"] <- 2
  exp_out[exp_vector == "k" | exp_vector == "K"] <- 3
  exp_out[exp_vector == "m" | exp_vector == "M"] <- 6
  exp_out[exp_vector == "b" | exp_vector == "B"] <- 9
  return(as.numeric(exp_out))
}

# Create tidy exponent vectors
storm_data$PROPDMG_exponent <- dmg_exponent(storm_data$PROPDMGEXP)
storm_data$CROPDMG_exponent <- dmg_exponent(storm_data$CROPDMGEXP)

# Caluclate actual property damage in US Dollars
storm_data$PROP_DMG <- storm_data$PROPDMG * 10^ storm_data$PROPDMG_exponent
storm_data$CROP_DMG <- storm_data$CROPDMG * 10^ storm_data$CROPDMG_exponent
storm_data$TOTAL_DMG <- storm_data$PROP_DMG + storm_data$CROP_DMG

# Convert beginning date string to a calendar date/time vector
storm_data$storm_dates <- as.POSIXlt(storm_data$BGN_DATE, "%m/%d/%Y %H:%M:%S", tz = "GMT")

# Inspect the tidy storm data
storm_vars <- c("EVTYPE", "FATALITIES", "INJURIES", "PROP_DMG", "CROP_DMG", "TOTAL_DMG")
summary(storm_data[storm_vars])

##                EVTYPE         FATALITIES     INJURIES     
##  HAIL             :288661   Min.   :  0   Min.   :   0.0  
##  TSTM WIND        :219940   1st Qu.:  0   1st Qu.:   0.0  
##  THUNDERSTORM WIND: 82563   Median :  0   Median :   0.0  
##  TORNADO          : 60652   Mean   :  0   Mean   :   0.2  
##  FLASH FLOOD      : 54277   3rd Qu.:  0   3rd Qu.:   0.0  
##  FLOOD            : 25326   Max.   :583   Max.   :1700.0  
##  (Other)          :170878                                 
##     PROP_DMG           CROP_DMG          TOTAL_DMG       
##  Min.   :0.00e+00   Min.   :0.00e+00   Min.   :0.00e+00  
##  1st Qu.:0.00e+00   1st Qu.:0.00e+00   1st Qu.:0.00e+00  
##  Median :0.00e+00   Median :0.00e+00   Median :0.00e+00  
##  Mean   :4.75e+05   Mean   :5.44e+04   Mean   :5.29e+05  
##  3rd Qu.:5.00e+02   3rd Qu.:0.00e+00   3rd Qu.:1.00e+03  
##  Max.   :1.15e+11   Max.   :5.00e+09   Max.   :1.15e+11  
##

Now we can see that damage and fatalitiy statistics are kighly skewed! We can tell because the median is zero for each of these variables, the mean is much larger, and the maximum values are extremely large. If we were creating predictive models, we may want to use the log of total damage. However, the raw damage numbers are easier for humans to interpret and are therefore more useful when looking at differences.

Results

With this tidy data, we can now look at the relative severity of these weather events. We want to identify events that cause the highest health hazard, and those that create the most economic damage.

Health Hazards

Let’s start this analysis by looking at both fatalities and injuries. We calculate to total number of fatalities and injuries for each weather event type. We then merge the data frames and check for correlation with the correlation coefficient. It turns out that injuries are highly correlated with fatalities. We will just work with fatalities for the rest of this analysis.

storm_fatalities <- aggregate(FATALITIES ~ EVTYPE, data=storm_data, FUN = sum)
storm_injuries <- aggregate(INJURIES ~ EVTYPE, data=storm_data, FUN = sum)

storm_fatalities <- merge(storm_fatalities, storm_injuries, by = "EVTYPE")
storm_fatalities <- storm_fatalities[order(storm_fatalities$FATALITIES, decreasing=TRUE), ]

cor(storm_fatalities$FATALITIES, storm_fatalities$INJURIES)

## [1] 0.9438

We want to view an ordered bar plot to identify the most dangerous conditions. This will also show us the relative differences between each of these conditions. We will use the package ggplot2 to create the bar plot.

# Order Event Types by Fatalities and plot
storm_fatalities$EVTYPE <- with(storm_fatalities, reorder(EVTYPE, FATALITIES))

library(ggplot2)
ggplot(data = storm_fatalities[1:10, ], aes(x = EVTYPE, y = FATALITIES)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(x = "Event Type", y = "Fatalities") +
  labs(title = "Total Fatalities for Top 10 Weather Event Types 1950-2011")

plot of chunk unnamed-chunk-4

It is clear that tornado’s are by far the most dangerous weather events in terms of killing humans. Excessive heat is a distant, but significant, second.

Economic Damage

Now we investigate economic losses. We calculate total damage, property damage, and crop damage for each weather event type. We then merge the data frames and order by total economic damage.

storm_damage <- aggregate(TOTAL_DMG ~ EVTYPE, data=storm_data, FUN = sum)
storm_prop_dmg <- aggregate(PROP_DMG ~ EVTYPE, data=storm_data, FUN = sum)
storm_crop_dmg <- aggregate(CROP_DMG ~ EVTYPE, data=storm_data, FUN = sum)

storm_damage <- merge(storm_damage, storm_prop_dmg, by = "EVTYPE")
storm_damage <- merge(storm_damage, storm_crop_dmg, by = "EVTYPE")
storm_damage <- storm_damage[order(storm_damage$TOTAL_DMG, decreasing=TRUE), ]

We want to view an ordered bar plot to identify the most damaging conditions. This will also show us the relative differences between each of these conditions.

# Order Event Types by Total Damage and plot
storm_damage$EVTYPE <- with(storm_damage, reorder(EVTYPE, TOTAL_DMG))

ggplot(data = storm_damage[1:10, ], aes(x = EVTYPE, y = TOTAL_DMG)) +
  geom_bar(stat = "identity") + coord_flip() +
  labs(x = "Event Type", y = "damage") +
  labs(title = "Total Damage for Top 10 Weather Event Types 1950-2011")

plot of chunk unnamed-chunk-6

Extreme water events like Floods, Hurricane/typhoons, and storm surges cause the most economic damage. Tornados also cause significant economic damage. These weather events cause significant damage to both property and crops.

head(storm_damage)

##                EVTYPE TOTAL_DMG  PROP_DMG  CROP_DMG
## 170             FLOOD 1.503e+11 1.447e+11 5.662e+09
## 411 HURRICANE/TYPHOON 7.191e+10 6.931e+10 2.608e+09
## 834           TORNADO 5.736e+10 5.695e+10 4.150e+08
## 670       STORM SURGE 4.332e+10 4.332e+10 5.000e+03
## 244              HAIL 1.876e+10 1.574e+10 3.026e+09
## 153       FLASH FLOOD 1.824e+10 1.682e+10 1.421e+09

Acknowledgements

R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.

H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.

Henrik Bengtsson (2014). R.utils: Various programming utilities. R package version 1.33.0. http://CRAN.R-project.org/package=R.utils

Coursera Reproducible Research Course repdata-006. http://www.coursera.org/course/repdata Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD