Reproducible Research Peer Assessment #2

Background

In this course project, we will explore with the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. This database contains information of major storms and weather events in the U.S., including the time and location of the event, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

The data for this assignment came in the form of a CSV file compressed via the bzip2 algorithm to reduce its size. Let’s download the file and save it in our destination folder. We will take a quick look of our data before we begin our data exploration and analysis.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file (url, destfile = ".\\events_data.csv", method = "wininet")
file <- read.csv (".\\events_data.csv", colClasses = "character")

Dataset Transformation

The first events in the database began in 1950 and ended in November 2011. In the earlier years of the database there were fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. With that in mind, let’s get an idea of the number of events recorded in each from 1950 to 2011.

file_date <- file$BGN_DATE
file_date_convert <- strptime (file_date, format = "%m/%d/%Y %H:%M:%S")
class (file_date_convert)

## [1] "POSIXlt" "POSIXt"

file_date_format <- as.numeric (format (file_date_convert, "%Y"))
file_date_count <- data.frame (table (file_date_format))
names (file_date_count) <- c ("Year", "Event_Freq")
head (file_date_count)

##   Year Event_Freq
## 1 1950        223
## 2 1951        269
## 3 1952        272
## 4 1953        492
## 5 1954        609
## 6 1955       1413

barplot (height = file_date_count$Event_Freq, xlab = "Year", ylab = "Event  Frequency", main = "Event Frequency for Years (1950-2011)", names.arg = file_date_count$Year, col = "lightblue", las = 3, cex.names = 0.60)

Looking at the barplot, very few events were recorded from 1950 to 1991, relatively to the most recent decade (2002-2011). Conversely, more complete records were made in the later years due to huge improvements in technology and book-keeping. The goal of this project is to determine events that are most harmful to population health and economic damage, and make future public health decisions and recommendations based on this data. Hence, it makes sense to only keep our most recent data. For our data analysis, we will keep all data from the last 20 years (1992-2011) and disregard the rest.

file_add <- cbind (file, file_date_format)
file_update <- subset (file_add, file_date_format > 1991)
dim (file_update)

## [1] 728272     38

Across the US, Which Type of Events Are Most Harmful to Population Health?

In order to determine which events are most harmful to population health, the number of fatalities were taken into account and added for each type of event. Injuries were disregarded because they range from minor incidents like a finger cut or bruised elbow to something more serious like a broken leg. Coming up with a system to rank different types of injuries relative to a fatality would have been an extremely complicated task. Hence, I thought it best to simplify things and only include fatalities. The top 5 event types were determined and included in a barplot below.

file_update$FATALITIES <- as.numeric (file_update$FATALITIES)
event_fatal <- aggregate (FATALITIES ~ EVTYPE, file_update, sum)
event_fatal_sort <- event_fatal [order (event_fatal$FATALITIES, decreasing = T), ]
event_fatal_final <- event_fatal_sort [1:5, ]
event_fatal_final

##             EVTYPE FATALITIES
## 130 EXCESSIVE HEAT       1903
## 834        TORNADO       1660
## 153    FLASH FLOOD        978
## 275           HEAT        937
## 464      LIGHTNING        816

barplot (height = event_fatal_final$FATALITIES, names.arg = event_fatal_final$EVTYPE, xlab = "Event Type", ylab = "Frequency of Fatalities", main = "Fatalities for Diff. Event Types (1992-2011)", col = "red", ylim = c (0, 2000), cex.names = 0.75)

Fatalities Analysis & Results

Excessive heat resulted in most deaths in the United States from 1992-2011, followed by tornadoes. Flash flood came in third, followed by heat and lightning. As a sidenote, excessive heat and heat are explained in greater detail in the National Service Guide link attached in the project instructions. Having excessive heat and heat in the top 5 came as no surprise, given the heat and high humidity during the American summer season, especially in the southern U.S.

Across the US, Which Type of Events Have the Greatest Economic Impact?

Before performing data analysis on this topic, some data had to be cleaned and transformed, specifically the observations for the “PROPDMGEXP” and “CROPDMGEXP” variables. Refer to the list below for more detail.
1. “B” and “b” refer to billions and modified to “1E9”.
2. “M” and “m” refer to millions and modified to “1E6”.
3. “K” and “k” refer to thousands and modified to “1E3”.
4. “H” and “h” refer to hundreds and modified to “1E2”.

Other type of observations (i.e. 0, 1, 2, 3, etc.) were disregarded due to a lack of documentation.

library (dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

file_1 <- mutate (file, PropDmgExp = PROPDMGEXP)
file_1 <- mutate (file_1, CropDmgExp = CROPDMGEXP)

file_1$PropDmgExp <- gsub ("B", 1E9, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("b", 1E9, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("M", 1E6, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("m", 1E6, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("K", 1E3, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("K", 1E3, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("H", 1E2, file_1$PropDmgExp)
file_1$PropDmgExp <- gsub ("h", 1E2, file_1$PropDmgExp)

file_1$CropDmgExp <- gsub ("B", 1E9, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("b", 1E9, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("M", 1E6, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("m", 1E6, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("K", 1E3, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("k", 1E3, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("H", 1E2, file_1$CropDmgExp)
file_1$CropDmgExp <- gsub ("h", 1E2, file_1$CropDmgExp)

file_1$PROPDMG <- as.numeric (file_1$PROPDMG)
file_1$CROPDMG <- as.numeric (file_1$CROPDMG)
file_1$PropDmgExp <- as.numeric (file_1$PropDmgExp)

## Warning: NAs introduced by coercion

file_1$CropDmgExp <- as.numeric (file_1$CropDmgExp)

## Warning: NAs introduced by coercion

file_1_Prop_Filter <- subset (file_1, PropDmgExp == 1E2 | PropDmgExp == 1E3| PropDmgExp == 1E6 | PropDmgExp == 1E9)
file_1_Crop_Filter <- subset (file_1, CropDmgExp == 1E2 | CropDmgExp == 1E3| CropDmgExp == 1E6 | CropDmgExp == 1E9)

The numeric observations under “PropDmgExp” and “CropDmgExp” were then combined via multiplication with the numeric observations under “PROPDMG” and “CROPDMG”, respectively.

file_1_Prop_final <- mutate (file_1_Prop_Filter, Total_Damage = PROPDMG * PropDmgExp)
file_1_Crop_final <- mutate (file_1_Crop_Filter, Total_Damage = CROPDMG * CropDmgExp)

Now, let’s determine which events have the greatest economic impact. The number of dollars of economic damage were added up for each type of event for both categories, property damage and economic damage.

prop_damage <- aggregate (Total_Damage ~ EVTYPE, file_1_Prop_final, sum)
prop_damage_sort <- prop_damage [order (prop_damage$Total_Damage, decreasing = TRUE), ]
prop_damage_final <- prop_damage_sort [1:5, ]
prop_damage_final

##                EVTYPE Total_Damage
## 62              FLOOD 144657709800
## 178 HURRICANE/TYPHOON  69305840000
## 332           TORNADO  56937160480
## 280       STORM SURGE  43323536000
## 50        FLASH FLOOD  16140811510

crop_damage <- aggregate (Total_Damage ~ EVTYPE, file_1_Crop_final, sum)
crop_damage_sort <- crop_damage [order (crop_damage$Total_Damage, decreasing = TRUE), ]
crop_damage_final <- crop_damage_sort [1:5, ]
crop_damage_final

##         EVTYPE Total_Damage
## 16     DROUGHT  13972566000
## 34       FLOOD   5661968450
## 97 RIVER FLOOD   5029459000
## 84   ICE STORM   5022113500
## 52        HAIL   3025954450

Lastly, a panel plot of 2 barplots were created for the top 5 type of events for both categories.

par (mfcol = c (2, 1))
barplot (height = prop_damage_final$Total_Damage, names.arg = prop_damage_final$EVTYPE, xlab = "Event Type", ylab = "Total Damage", col = "darkorange", main = "Total Property Damage ($$) for Diff. Event Types (1992-2011)", ylim = c (0, 145E9), cex.names = 0.75, cex.axis = 0.75)

barplot (height = crop_damage_final$Total_Damage, names.arg = crop_damage_final$EVTYPE, xlab = "Event Type", ylab = "Total Damage", col = "darkgreen", main = "Total Crop Damage ($$) for Diff. Event Types (1992-2011)", ylim = c (0, 14E9), cex.names = 0.75, cex.axis = 0.75)

Economic Damage Analysis & Results

Flood resulted in the most property damage across the United States from 1992-2011, at a total of almost $145B USD. That is more than doubled its nearest competitor, hurricane/typhoon. Tornado came in third, followed by storm surge and flash flood.

For crop damage, drought came in first at almost $14B. Flood and river flood came in next, followed by icestorm. Hail rounded up the top 5.

Reproducible Research Peer Assessment #2 - National Storm Events

Stephen Lee

January 30, 2016