In this report we aim to present the events that cause the most human damages and health issues, in addition to the events that cause the greatest economic consequences. To investigate these issues, we obtained data from the U.S. National Oceanic and Atmospheric Administration's (NOAA) database. We specifically obtained data for the years 1950 through 2011. From these data, we found that the events tornados and heat waves cause the most injuries and fatalities, respectively. On the other hand, thunderstorms and drought cause the most property and crop daamges, respectively. Thus, we conducted the analysis using the following events: tornados, heat waves, thunderstorms and drought.
From the NOAA storm database we obtained data on weather events that are monitored across the U.S. We obtained the files for the years 1950 through 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
We used the utils package to unzip the data file, then read.csv function to read in the data.
# setwd('ReproducibleAnalysis/Projects/Project2') Unzip the data set. First
# load the R.utils library
library(R.utils)
# bunzip2('repdata-data-StormData.csv.bz2', 'repdata-data-StormData.csv',
# remove=F) Read the data
storms <- read.csv("repdata-data-StormData.csv")
# stms <- subset(storms, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 |
# CROPDMG > 0)
After reading in the we check the first few rows (there are 902297) observations in this dataset. There are 37 variables in the data set.
dim(storms)
## [1] 902297 37
head(storms[, 1:8])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE
## 1 TORNADO
## 2 TORNADO
## 3 TORNADO
## 4 TORNADO
## 5 TORNADO
## 6 TORNADO
str(storms)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels ""," "," "," ",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The events that causes the maximum number of fatalities and injuries are heat waves and tornados.
levels(factor(storms$EVTYPE[which.max(storms$FATALITIES)]))
## [1] "HEAT"
levels(factor(storms$EVTYPE[which.max(storms$INJURIES)]))
## [1] "TORNADO"
The events that have the greatest economic consequences are thunderstorms and drouht. Thunderstorms cause the greatest property damages. Drought causes the greatest crop damages.
levels(factor(storms$EVTYPE[which.max(storms$PROPDMG)]))
## [1] "THUNDERSTORM WIND"
levels(factor(storms$EVTYPE[which.max(storms$CROPDMG)]))
## [1] "DROUGHT"
In the data set, the event type has 985 factors. The tornado factor could be any of the strings that include the string “TORN”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We used the lubridate package to manipulate dates in the date set. We plotted the number of tornados over the years from 1950 to 2011. It is obvious that the number of tornados increases over the years. The number of tornados ranges from about 250 in 1950 to about 2200 in 2011. Although there is obvious increasing trend in the number of tornados, there is no clear trend in the number of injuries from tornados. Wichita County in Texas experienced the maximum number of injuries caused by tornados in April 10, 1979. An estimated 1700 people were injured.
tUpper <- toupper(storms$EVTYPE)
t <- grep("TORN+", tUpper, value = T, perl = T)
TornadoStms <- subset(storms, storms$EVTYPE %in% t)
date <- TornadoStms$BGN_DATE
library(lubridate)
date <- mdy_hms(date)
TornadoStms$Year <- year(date)
# Count the number of tornados each year
TornadosCnt <- table(TornadoStms$Year)
TornadoStms$STATE[which.max(TornadoStms$INJURIES)]
## [1] TX
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
TornadoStms$COUNTYNAME[which.max(TornadoStms$INJURIES)]
## [1] WICHITA
## 29601 Levels: 5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
TornadoStms$BGN_DATE[which.max(TornadoStms$INJURIES)]
## [1] 4/10/1979 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
TornadoStms$INJURIES[which.max(TornadoStms$INJURIES)]
## [1] 1700
# Or
max(TornadoStms$INJURIES)
## [1] 1700
# Using ggplot, first convert table into a data frame df <-
# as.data.frame(TornadosCnt) rename first column names(df)[1] <- 'Year'
# ggplot library(ggplot2) p <- ggplot(df,
# aes(as.numeric(as.character(Year)), Freq)) p + geom_line(color='blue') +
# ylab('Number of Tornados') + xlab('') + ggtitle #('Number of Tornados over
# the Year')
par(mfrow = c(2, 1))
# Plot Number of tornados each year
plot(TornadosCnt, type = "l", col = "blue", ylab = "Number of Tornados", xlab = "",
main = "Number of Tornados over the Years", tck = 1)
# Number of Injuries over the years caused by tornados
injuries = tapply(TornadoStms$INJURIES, TornadoStms$Year, sum)
plot(as.numeric(levels(as.factor(TornadoStms$Year))), injuries, type = "l",
tck = 1, xlab = "", ylab = "Number of Injuries", main = "Number of Injuries caused by Tornados each year",
col = "blue")
The heat wave factor could be any of the strings that include the string “HEAT”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted the number of heat waves over the years from 1993 to 2011. It is obvious that the number of heat waves increases over the years. The number of heat waves ranges from about 10 in 1993 to about 410 in 2011. On the hand, on average, the yearly number of fatalities decreases. The maximum number of fatalities was caused by a heat wave in Illinois in the second week of July, 1995. An estimated 583 people died during that week.
h <- grep("HEAT+", tUpper, value = T, perl = T)
heatSevere <- subset(storms, storms$EVTYPE %in% h)
HDate <- heatSevere$BGN_DATE
HDate <- mdy_hms(HDate)
heatSevere$Year <- year(HDate)
HeatWaveCnt <- table(heatSevere$Year)
heatSevere$STATE[which.max(heatSevere$FATALITIES)]
## [1] IL
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
heatSevere$COUNTYNAME[which.max(heatSevere$FATALITIES)]
## [1] ILZ003>006 - 008 - 010>014 - 019>023 - 032 - 033 - 039
## 29601 Levels: 5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
heatSevere$BGN_DATE[which.max(heatSevere$FATALITIES)]
## [1] 7/12/1995 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
heatSevere$FATALITIES[which.max(heatSevere$FATALITIES)]
## [1] 583
# Or
max(heatSevere$FATALITIES)
## [1] 583
par(mfrow = c(2, 1))
plot(HeatWaveCnt, type = "l", col = "red", ylab = "Number of Heat Waves", xlab = "",
main = "Number of Heat Waves Over the Years", tck = 1)
# Number of fatalities caused by Heat
fatalities <- tapply(heatSevere$FATALITIES, heatSevere$Year, sum)
plot(as.numeric(levels(as.factor(heatSevere$Year))), fatalities, type = "l",
tck = 1, xlab = "", ylab = "Number of Fatalities", main = "Number of Fatalities caused by Heat each year",
col = "red")
The thunderstorm factor could be any of the strings that include the string “THUN”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted yearly cost of property damages caused by thunderstorms from 1993 to 2005. There is significant variability in the cost between years. Thunderstorms caused the greatest property damages in Franklin County, North Carolina in July 26, 2009. The property damages were estimated to be about $500 million.
tH <- grep("THUN+", tUpper, value = T, perl = T)
thunder <- subset(storms, storms$EVTYPE %in% tH)
tHDate <- thunder$BGN_DATE
tHDate <- mdy_hms(tHDate)
thunder$Year <- year(tHDate)
propDmgCost <- tapply(thunder$PROPDMG, thunder$Year, sum)
thunder$STATE[which.max(thunder$PROPDMG)]
## [1] NC
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
thunder$COUNTYNAME[which.max(thunder$PROPDMG)]
## [1] FRANKLIN
## 29601 Levels: 5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
thunder$BGN_DATE[which.max(thunder$PROPDMG)]
## [1] 7/26/2009 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
thunder$PROPDMG[which.max(thunder$PROPDMG)]
## [1] 5000
The thunderstorm factor could be any of the strings that include the string “DROUGHT”. For this reason, we used the grep() function to extract these strings, however after transforming the characters into upper letters. We plotted yearly cost of crop damages caused by drought from 1993 to 2005. There is a somehow increasing trend in the cost of crop damages over the years. Drought caused the greatest crop damages in Montana in May 1, 2004. The crop damages were estimated to be about $77.5 million.
# Drought
d <- grep("DROUGHT+", tUpper, value = T, perl = T)
drought <- subset(storms, storms$EVTYPE %in% d)
dDate <- drought$BGN_DATE
dDate <- mdy_hms(dDate)
drought$Year <- year(dDate)
cropDmgCost <- tapply(drought$CROPDMG, drought$Year, sum)
drought$STATE[which.max(drought$CROPDMG)]
## [1] MT
## 72 Levels: AK AL AM AN AR AS AZ CA CO CT DC DE FL GA GM GU HI IA ID ... XX
drought$COUNTYNAME[which.max(drought$CROPDMG)]
## [1] MTZ024>025 - 062
## 29601 Levels: 5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI ... ZIEBACH AND HAAKON
drought$BGN_DATE[which.max(drought$CROPDMG)]
## [1] 5/1/2004 0:00:00
## 16335 Levels: 10/10/1954 0:00:00 10/10/1958 0:00:00 ... 9/9/2011 0:00:00
drought$PROPDMG[which.max(drought$PROPDMG)]
## [1] 775
par(mfrow = c(2, 1))
plot(as.numeric(levels(factor(thunder$Year))), propDmgCost, type = "l", col = "orange",
ylab = "Cost of Property Damage, $millions", xlab = "", main = "Cost of Property Damages Over the Years",
tck = 1)
plot(as.numeric(levels(factor(drought$Year))), cropDmgCost, type = "l", col = "brown",
ylab = "Cost of Crop Damages, $millions", xlab = "", main = "Cost of Crop Damages Over the Years",
tck = 1)
The two tables shown below summarize the maximum number of fatalities and injuries caused by heat and tornados, and the yearly average cost of property and crop damages caused by thunderstorms and drought in each state. The data set state.abb in R was used to match and keep only the states. Use the melt() function in the reshape2 package to change array data structure to a data frame.
# Maximum number of fatalities caused by heat in each State
fH <- tapply(heatSevere$FATALITIES, heatSevere$STATE, max)
fT <- tapply(TornadoStms$FATALITIES, TornadoStms$STATE, max)
iH <- tapply(heatSevere$INJURIES, heatSevere$STATE, max)
iT <- tapply(TornadoStms$INJURIES, TornadoStms$STATE, max)
# Reshape
library(reshape2)
fH1 <- melt(fH)
fT1 <- melt(fT)
iH1 <- melt(iH)
iT1 <- melt(iT)
# Data frame
df <- data.frame(fH1, fT1[2], iH1[2], iT1[2])
names(df) <- c("State", "Fatalities By Heat", "Fatalities by Tornados", "Injuries By Heat",
"Injuries By Tornados")
df1 <- subset(df, df$State %in% state.abb)
|
# Maximum number of fatalities caused by heat in each State
cP <- tapply(thunder$PROPDMG, thunder$STATE, mean)
cC <- tapply(drought$CROPDMG, drought$STATE, mean)
# Reshape
library(reshape2)
cP1 <- melt(cP)
cC1 <- melt(cC)
names(cP1) = c("State", "Property Damages")
names(cC1) = c("State", "Crop Damages")
cP1 <- subset(cP1, cP1$State %in% state.abb)
cC1 <- subset(cC1, cC1$State %in% state.abb)
# Data frame
dataFrame <- data.frame(cP1, cC1[2])
names(dataFrame) = c("State", "Property Damage, $millions", "Crop Damages, $millions")
|