In this brief study, we undertake an analysis of the NOAA Storm Data in an attempt to answer the following questions:
1.Across the United States, which types of events are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
Our overall hypothesis is that certain wheather events impact the health of the population of the United States as well as have an adverse economic impact on the economy of the same region. We obtained the wheather data used in this study from the NOAA Storm Dtatbase.
1.The health impact analysis implies that tornados are the weather events that have had the greatest impact on the population of the United States with Excessive heat as the second leading cause of health issues for the population. We feel the need to explain a possible anomally. While Excessive Heat seems to have a greater fatality rate than Tornados, when taken in the aggergate of injusy and fatality Tornados have a far greater health impact on the population. This may well be a source of confusion and contention between researchers as to the magnitude of the “Health Impact” betwen Tornados and Excessive Heat. We choose to look at the Healt Impact in the aggregate.
2.The economic consequences analysis implies that Floods cause the greatest economic damage with extreme water events like Floods, Hurricane/typhoons, and storm surges causeing significant economic damage.
It is worth noteing that that these conclusions are derived from mean averages and therefor may be distorted by event outliers.
In this process we load the libraries and the download the data from the identified URL. We then read in the CSV file and extract the pertinent variables.
## House keeping
proc_date <- date()
library(markdown)
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(Rcmdr)
## Loading required package: RcmdrMisc
## Loading required package: car
## Loading required package: sandwich
## The Commander GUI is launched only in interactive sessions
library(ggplot2)
library(reshape2)
# Only the first time data is download, a 'strmdata' directory is created.
if (!file.exists("strmdata")) {
dir.create("strmdata")
}
# and stores that file into 'strmdata' directory
if (!file.exists("strmdata/repdata-data-StormData.csv.bz2")) {
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "./strmdata/repdata-data-StormData.csv.bz2")
}
## Unzip and read the dataset into the strom_dat table
storm_dat <- read.csv(bzfile("./strmdata/repdata-data-StormData.csv.bz2"), stringsAsFactors = FALSE)
## Process the storm_dat table for the pertinent variables
storm_dat$EVTYPE <- capitalize(tolower(storm_dat$EVTYPE))
After reading in the data set we will review a few attributes of the data. We see that the data set has 902,297 rows with 37 columns. WE then review the structure of the data.
dim(storm_dat)
## [1] 902297 37
head(storm_dat[,1:8])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE
## 1 Tornado
## 2 Tornado
## 3 Tornado
## 4 Tornado
## 5 Tornado
## 6 Tornado
str(storm_dat)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "Tornado" "Tornado" "Tornado" "Tornado" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
In this section we present our results with supporting graphics.
To assess the population health impact we examine FATALITIES and INJURIES. First, we combine them and class them as CASUALTIES. next, we subset our data, sort it then generate a histogram.
damages <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, storm_dat, sum)
hum_dam <- melt(head(damages[order(-damages$FATALITIES, -damages$INJURIES), ], 10))
The below histogram supports the the inference that event type Tornado poses the greatest threat to human health (as an aggregation of fatalities and injuries) in the population of the United States.
## Graph of human impacts
ggplot(hum_dam, aes(x = EVTYPE, y = value, fill = variable)) + geom_bar(stat = "identity") +
coord_flip() + ggtitle("Harmful events") + labs(x = "", y = "number of people impacted") +
scale_fill_manual(values = c("red", "orange"), labels = c("Deaths", "Injuries"))
To estimate the top ten economic impact events we use the same algorithm as above with modification to the variables such as the property (PROPDMG) and crop (CROPDMG) damage variables. In the supporting graphic these varaiables are experssed in thousands of dollars.
## Economic impact analysis
storm_dat$PROPDMG <- storm_dat$PROPDMG * as.numeric(Recode(storm_dat$PROPDMGEXP, "'0'=1;'1'=10;'2'=100;'3'=1000;'4'=10000;'5'=100000;'6'=1000000;'7'=10000000;'8'=100000000;'B'=1000000000;'h'=100;'H'=100;'K'=1000;'m'=1000000;'M'=1000000;'-'=0;'?'=0;'+'=0", as.factor.result = FALSE))
storm_dat$CROPDMG <- storm_dat$CROPDMG * as.numeric(Recode(storm_dat$CROPDMGEXP, "'0'=1;'2'=100;'B'=1000000000;'k'=1000;'K'=1000;'m'=1000000;'M'=1000000;''=0;'?'=0", as.factor.result = FALSE))
ecofact <- aggregate(cbind(PROPDMG, CROPDMG) ~ EVTYPE, storm_dat, sum)
eco_dam <- melt(head(ecofact[order(-ecofact$PROPDMG, -ecofact$CROPDMG), ], 10))
The below histogram supports the the inference that event type Flood appears to have had the greatest economic impact on the United States.
ggplot(eco_dam, aes(x = EVTYPE, y = value, fill = variable)) + geom_bar(stat = "identity") +
coord_flip() + ggtitle("Economic consequences") + labs(x = "", y = "cost of damages in dollars") +
scale_fill_manual(values = c("orange", "red"), labels = c("Property Damage",
"Crop Damage"))