This is an assignment from Coursera Course on Reproducible Research. The goal of this assignment is to create and publish an RMarkdown document to communicate a simple analysis. The final document is self contained. The code provided here will download and process the raw data.
This analysis answers the following questions regarding weather events in the United States:
1. Health Hazards Which types of events are most harmful with respect to population health?
2. Economic Damage Which types of events have the greatest economic consequences?
The health hazard analysis shows that tornados are the weather events that have killed the most humans. Excessive heat is the second leading cause of death due to weather events.
The economic damage analysis shows that extreme water events like Floods, Hurricane/typhoons, and storm surges cause the most economic damage. Tornados also cause significant economic damage. These weather events cause significant damage to both property and crops.
Government should spend resources on identifying methods to minimize the impact of tornados, floods, hurricanes, and heat waves. These extreme weather events cannot be prevented so we must detect conditions early and learn to protect lives and property.
This data comes from the United States Department of Commerce National Oceanic & Atmospheric Administration National Weather Service. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are fewer events recorded.
Links to NOAA Database Documentation
* National Weather Service Storm Data Documentation
* National Climatic Data Center Storm Events FAQ
Lets start by loading data provided by the instructor. First we check to see if the zip file has been extracted. If not, we download and unzip it. The bz2 format requires R.utils package to open. Once opened, we read the data into a data.frame.
if(!file.exists("./repdata-data-StormData.csv")){
fileURL = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, "./repdata-data-StormData.csv.bz2")
require(R.utils)
bunzip2("./repdata-data-StormData.csv.bz2", destname = ".")
}
storm_data <- read.csv("./repdata-data-StormData.csv")
For this analysis we are concerned with several key statistics related to each weather events:
* EVTYPE - Type of weather event
* FATALITIES - Number of people who die
* INJURIES - Number of people injured
* PROPDMG - Property Damage in US Dollars
* PROPDMGEXP - Magnitude of Property Damage in mixed units
* CROPDMG - Crop Damage in US Dollars
* CROPDMGEXP - Magnitude of Crop Damage in mixed units
Let’s check the summary statistics. for these variables.
storm_vars <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
summary(storm_data[storm_vars])
## EVTYPE FATALITIES INJURIES PROPDMG
## HAIL :288661 Min. : 0 Min. : 0.0 Min. : 0
## TSTM WIND :219940 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0
## THUNDERSTORM WIND: 82563 Median : 0 Median : 0.0 Median : 0
## TORNADO : 60652 Mean : 0 Mean : 0.2 Mean : 12
## FLASH FLOOD : 54277 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0
## FLOOD : 25326 Max. :583 Max. :1700.0 Max. :5000
## (Other) :170878
## PROPDMGEXP CROPDMG CROPDMGEXP
## :465934 Min. : 0.0 :618413
## K :424665 1st Qu.: 0.0 K :281832
## M : 11330 Median : 0.0 M : 1994
## 0 : 216 Mean : 1.5 k : 21
## B : 40 3rd Qu.: 0.0 0 : 19
## 5 : 28 Max. :990.0 B : 9
## (Other): 84 (Other): 9
The exponent variables PROPDMGEXP and CROPDMGEXP have mixed units. Some exponents are integers, but most are SI prefises like “K” for thousand and “M” for million. We need a function to standardize the units. With this function we convert all the values in these columns to integer exponents. Then we can calculate the estimated economic damage.
# Function to convert mixed units to an integer exponent
dmg_exponent <- function(exp_vector) {
# This function returns a numeric vector of base 10 exponents
# The exp_vector is a vector of exponent notations including integers and prefixes like "K"
exp_out <- as.character(exp_vector) # Initialize response
exp_out[exp_vector == "" | exp_vector == "?"] <- 0
exp_out[exp_vector == "+" | exp_vector == "-"] <- 0
exp_out[exp_vector == "h" | exp_vector == "H"] <- 2
exp_out[exp_vector == "k" | exp_vector == "K"] <- 3
exp_out[exp_vector == "m" | exp_vector == "M"] <- 6
exp_out[exp_vector == "b" | exp_vector == "B"] <- 9
return(as.numeric(exp_out))
}
# Create tidy exponent vectors
storm_data$PROPDMG_exponent <- dmg_exponent(storm_data$PROPDMGEXP)
storm_data$CROPDMG_exponent <- dmg_exponent(storm_data$CROPDMGEXP)
# Caluclate actual property damage in US Dollars
storm_data$PROP_DMG <- storm_data$PROPDMG * 10^ storm_data$PROPDMG_exponent
storm_data$CROP_DMG <- storm_data$CROPDMG * 10^ storm_data$CROPDMG_exponent
storm_data$TOTAL_DMG <- storm_data$PROP_DMG + storm_data$CROP_DMG
# Convert beginning date string to a calendar date/time vector
storm_data$storm_dates <- as.POSIXlt(storm_data$BGN_DATE, "%m/%d/%Y %H:%M:%S", tz = "GMT")
# Inspect the tidy storm data
storm_vars <- c("EVTYPE", "FATALITIES", "INJURIES", "PROP_DMG", "CROP_DMG", "TOTAL_DMG")
summary(storm_data[storm_vars])
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0 Min. : 0.0
## TSTM WIND :219940 1st Qu.: 0 1st Qu.: 0.0
## THUNDERSTORM WIND: 82563 Median : 0 Median : 0.0
## TORNADO : 60652 Mean : 0 Mean : 0.2
## FLASH FLOOD : 54277 3rd Qu.: 0 3rd Qu.: 0.0
## FLOOD : 25326 Max. :583 Max. :1700.0
## (Other) :170878
## PROP_DMG CROP_DMG TOTAL_DMG
## Min. :0.00e+00 Min. :0.00e+00 Min. :0.00e+00
## 1st Qu.:0.00e+00 1st Qu.:0.00e+00 1st Qu.:0.00e+00
## Median :0.00e+00 Median :0.00e+00 Median :0.00e+00
## Mean :4.75e+05 Mean :5.44e+04 Mean :5.29e+05
## 3rd Qu.:5.00e+02 3rd Qu.:0.00e+00 3rd Qu.:1.00e+03
## Max. :1.15e+11 Max. :5.00e+09 Max. :1.15e+11
##
Now we can see that damage and fatalitiy statistics are kighly skewed! We can tell because the median is zero for each of these variables, the mean is much larger, and the maximum values are extremely large. If we were creating predictive models, we may want to use the log of total damage. However, the raw damage numbers are easier for humans to interpret and are therefore more useful when looking at differences.
With this tidy data, we can now look at the relative severity of these weather events. We want to identify events that cause the highest health hazard, and those that create the most economic damage.
Let’s start this analysis by looking at both fatalities and injuries. We calculate to total number of fatalities and injuries for each weather event type. We then merge the data frames and check for correlation with the correlation coefficient. It turns out that injuries are highly correlated with fatalities. We will just work with fatalities for the rest of this analysis.
storm_fatalities <- aggregate(FATALITIES ~ EVTYPE, data=storm_data, FUN = sum)
storm_injuries <- aggregate(INJURIES ~ EVTYPE, data=storm_data, FUN = sum)
storm_fatalities <- merge(storm_fatalities, storm_injuries, by = "EVTYPE")
storm_fatalities <- storm_fatalities[order(storm_fatalities$FATALITIES, decreasing=TRUE), ]
cor(storm_fatalities$FATALITIES, storm_fatalities$INJURIES)
## [1] 0.9438
We want to view an ordered bar plot to identify the most dangerous conditions. This will also show us the relative differences between each of these conditions. We will use the package ggplot2 to create the bar plot.
# Order Event Types by Fatalities and plot
storm_fatalities$EVTYPE <- with(storm_fatalities, reorder(EVTYPE, FATALITIES))
library(ggplot2)
ggplot(data = storm_fatalities[1:10, ], aes(x = EVTYPE, y = FATALITIES)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Event Type", y = "Fatalities") +
labs(title = "Total Fatalities for Top 10 Weather Event Types 1950-2011")
It is clear that tornado’s are by far the most dangerous weather events in terms of killing humans. Excessive heat is a distant, but significant, second.
Now we investigate economic losses. We calculate total damage, property damage, and crop damage for each weather event type. We then merge the data frames and order by total economic damage.
storm_damage <- aggregate(TOTAL_DMG ~ EVTYPE, data=storm_data, FUN = sum)
storm_prop_dmg <- aggregate(PROP_DMG ~ EVTYPE, data=storm_data, FUN = sum)
storm_crop_dmg <- aggregate(CROP_DMG ~ EVTYPE, data=storm_data, FUN = sum)
storm_damage <- merge(storm_damage, storm_prop_dmg, by = "EVTYPE")
storm_damage <- merge(storm_damage, storm_crop_dmg, by = "EVTYPE")
storm_damage <- storm_damage[order(storm_damage$TOTAL_DMG, decreasing=TRUE), ]
We want to view an ordered bar plot to identify the most damaging conditions. This will also show us the relative differences between each of these conditions.
# Order Event Types by Total Damage and plot
storm_damage$EVTYPE <- with(storm_damage, reorder(EVTYPE, TOTAL_DMG))
ggplot(data = storm_damage[1:10, ], aes(x = EVTYPE, y = TOTAL_DMG)) +
geom_bar(stat = "identity") + coord_flip() +
labs(x = "Event Type", y = "damage") +
labs(title = "Total Damage for Top 10 Weather Event Types 1950-2011")
Extreme water events like Floods, Hurricane/typhoons, and storm surges cause the most economic damage. Tornados also cause significant economic damage. These weather events cause significant damage to both property and crops.
head(storm_damage)
## EVTYPE TOTAL_DMG PROP_DMG CROP_DMG
## 170 FLOOD 1.503e+11 1.447e+11 5.662e+09
## 411 HURRICANE/TYPHOON 7.191e+10 6.931e+10 2.608e+09
## 834 TORNADO 5.736e+10 5.695e+10 4.150e+08
## 670 STORM SURGE 4.332e+10 4.332e+10 5.000e+03
## 244 HAIL 1.876e+10 1.574e+10 3.026e+09
## 153 FLASH FLOOD 1.824e+10 1.682e+10 1.421e+09
R Core Team (2014). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
RStudio: Integrated development environment for R. RStudio Version 0.98.994 © 2009-2013 RStudio, Inc. http://www.rstudio.com/
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New York, 2009.
Henrik Bengtsson (2014). R.utils: Various programming utilities. R package version 1.33.0. http://CRAN.R-project.org/package=R.utils
Coursera Reproducible Research Course repdata-006. http://www.coursera.org/course/repdata Roger D. Peng, PhD, Jeff Leek, PhD, Brian Caffo, PhD