An analysis of the NOAA Database is conducted to establish causes of harm to human health and the economy. The events that cause the most harm to human health are identified as the events that cause the five greatest number of both injuries and fatalities. The event that cause the greatest damage to the economy are identified by ranking the events that cause greater than $10 billion in damages to crops and property. This document contains all code required to read in the data, process and analysis the data and produce the Tables and Figures.
The Storm Data FAQ page is referenced to understand the variables to guide both the post processing and the analysis.
The data are downloaded. The file name is renamed to facilitate reading.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2")
The file is read into stormdata.
stormdata <- read.csv(file.path("StormData.csv.bz2"))
The names and structure of the variables are below.
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
It looks like many of the variables will not be used. EVTYPE, FATALITIES, INJURIES can be used as they are. Unfortunately, damages are divided into four variables. PROPDMG, CROPDMG, PROPDMGEXP and CROPDMGEXP. The last two of these variable ending in “EXP” are factors indicating One, Tens, hundreds, Thousands, Millions or Billions or something else as modifier to damages. It will be necessary to reduce the number of levels in the variable ending in “EXP” to a unique list of interpretable values.
unique(stormdata$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
unique(stormdata$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
rlist <- list(ONE=c(""," "), H=c("H","h"), K=c("k","K"), M=c("m","M"), B=c("b","B")) #Other values will be changed to NA
levels(stormdata$PROPDMGEXP) <- rlist
levels(stormdata$CROPDMGEXP) <- rlist
After reducing the levels, we can create two new variables TOTALCROP and TOTAL PROP for the total crop damage and total property damage. The rows of PROPDMG and CROPDMG corresponding are multiplied by the values in column 2 of conv when the corresponding value in column 1 is equal to the values in PROPDMGEXP and CROPDMGEXP. The totals are set to zero is there is an NA in either variable ending in “EXP”
conv <- data.frame(c("ONE","H","K","M","B"),
c(1,100,1000,1000000,1000000000))
stormdata$TOTALCROP <- 0
stormdata$TOTALPROP <- 0
for (n in 1:nrow(conv)) {
locs<-with(stormdata,which(CROPDMGEXP==conv[n,1]))
if (length(locs)!=0) {
stormdata$TOTALCROP[locs] <- conv[n,2]*stormdata$CROPDMG[locs]
}
locs<-with(stormdata,which(PROPDMGEXP==conv[n,1]))
if (length(locs)!=0) {
stormdata$TOTALPROP[locs] <- conv[n,2]*stormdata$PROPDMG[locs]
}
}
Next, we check to see how many row have NA’s. Since there are only a few, we will drop them.
mean(stormdata$TOTALCROP==NA | stormdata$TOTALPROP==NA)
## [1] NA
stormdata$TOTALCROP[which(is.na(stormdata$CROPDMGEXP))] <- 0
stormdata$TOTALPROP[which(is.na(stormdata$PROPDMGEXP))] <- 0
The results will address the following questions: 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequence?
Question 1 will be answered first by aggregating the total number of injuries and fatalities by event type.
fatal <- aggregate(FATALITIES~EVTYPE,stormdata,sum)
injury <- aggregate(INJURIES~EVTYPE,stormdata,sum)
Order fatal by the total fatalities and retain the five most fatal event types.
fatal <-fatal[order(-fatal[,2],fatal[,1]),]
fatal <- fatal[1:5,]
print(fatal)
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
Table 1: Table showing how many fatalities for each Event type(EVTYPE)
Order injury by the total injuries and retain the five most injurious event types.
injury <- injury[order(-injury[,2],injury[,1]),]
injury <- injury[1:5,]
print(injury)
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
Table 2: Table showiing how many injuries for each Event type(EVTYPE)
Now we can see which events cause the most harm to human health. Tornados cause the most harm to health in term of injuries and fatalities. The next most harmful cause depends on which criteria is selected.
To determine which events have the greatest economic harm we will consider only the total damages. We will aggregate total damages by event type then keep only the totals that exceed $10 Billion.
library(ggplot2)
dmgtot <- aggregate(TOTALPROP+TOTALCROP~EVTYPE,stormdata,
sum,na.action = na.omit)
dmgtot <- dmgtot[which(dmgtot[,2]>1E10),]
names(dmgtot) <- c("Event_Type","Total_Damage_USD")
ggplot(dmgtot,aes(x=reorder(Event_Type,-Total_Damage_USD),
y=Total_Damage_USD))+
geom_bar(stat="identity", fill="steelblue")+
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
xlab(NULL)
detach(package:ggplot2)
Figure 1: Bar plot showing Total Damages in US Dolors(USD) of weather events with more than 10 Billion in damages
Figure 1 shows that floods cause the greatest damage which totals approximately $150 billion. Tornadoes which cause the greatest harm to human health are the third most expensive.
Below is the session info.
sessionInfo()
## R version 3.4.1 (2017-06-30)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8.1 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 digest_0.6.12 rprojroot_1.2 plyr_1.8.4
## [5] grid_3.4.1 gtable_0.2.0 backports_1.1.0 magrittr_1.5
## [9] evaluate_0.10.1 scales_0.5.0 ggplot2_2.2.1 rlang_0.1.2
## [13] stringi_1.1.5 lazyeval_0.2.0 rmarkdown_1.6 labeling_0.3
## [17] tools_3.4.1 stringr_1.2.0 munsell_0.4.3 yaml_2.1.14
## [21] compiler_3.4.1 colorspace_1.3-2 htmltools_0.3.6 knitr_1.17
## [25] tibble_1.3.4