A major contributor to both public health and economic problems in the US are storms and other severe weather events. Many of these severe weather events cause many fatalities and injuries, and often times millions of dollars in property damage. This is why preventing such catastrophic events is a key concern for many nations, including the United States.
This project explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.This database tracks characteristics of major weather events in the United States, including when and where they occur, estimates of any fatalities, injuries, and property damage, as well as breif descriptions of each event. This database dates back to about 1950, and it should be noted that there is less information about weather events that date further back.
This report studies the effect of severe weather events on the number of fatalities, the number of injuries, amount of property damage, and the amount of damage done to crops. Three plots were generated and represent the mean of the top 10 severe weather events effecting the fatalities, property damage, and crop damage, respectively.
There is only one dataset for this project and it comes as a comma-separated-value file (CSV) that was compressed using the bzip2 algorithm. You can download the file from the course website. There is also a very useful FAQ, as well as some documentation from which you can reference if there is any other confusion about the data.
The basic goal of this assignment is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.
1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2.Across the United States, which types of events have the greatest economic consequences?
The data was downloaded from the link listed above, and was then loaded into R using the following code.
setwd("~/Documents/GitHub/Repro_Research/RepData_PeerAssessment2/")
storm0 <- read.csv("repdata_data_StormData.csv", header = TRUE)
head(storm0)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Most of the variables in this dataset are simply not required for the analyis being run. Furthermore, this dataset is fairly large, and takes up a lot of memory, so removing unnecessary variables will increase the speed of any further analysis. This is done using the following code.
event <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
storm <- storm0[event]
head(storm)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Using length(unique(storm$EVTYPE)) will return a value of 985. This means that there are 985 different kinds of severe weather events in the dataset. Clearly, there are many duplicates in the dataset. Not only this, but uniformity of reporting was seemingly not necessary when entering data into this database. For instance, Unseasonably Warm and Dry, EXTREME HEAT, and EXCESSIVE HEAT are all considered different event types by this dataset. This is not the only case this occurs. These synonomical errors happen throughout the dataset. To remedy this, the grepl() function is used to take common words in the event types and groups them into one single type of event. The code is demonstrated below.
for (i in 1:length(storm$EVTYPE)){
if (grepl("HEAT", storm$EVTYPE[i])){
storm$EVTYPE[i] <- "HEAT"
} else if (grepl("TORNADO", storm$EVTYPE[i])){
storm$EVTYPE[i] <- "TORNADO"
} else if (grepl("WARM", storm$EVTYPE[i])){
storm$EVTYPE[i] <- "HEAT"
} else {
next
}
}
for (i in 1:length(storm$EVTYPE)){
if (grepl("COLD", storm$EVTYPE[i])){
storm$EVTYPE[i] <- "COLD"
} else {
next
}
}
for (i in 1:length(storm$EVTYPE)){
if (grepl("FLOOD", storm$EVTYPE[i])){
storm$EVTYPE[i] <- "FLOOD"
} else {
next
}
}
The code is broken down into several for-loops because it simply takes a very long time to run (~15 min). However, it does exactly what is needed for this dataset.
The next problem with analyzing this data is that many of those who input data into this database used the full name of certain events. For instance, what should have been classified as simply HURRICANE, is instead classified as HURRICANE KATRINA. While it is important to note which hurricane occured at which time, one very large hurricane is still just a hurricane, and not a completely different kind of severe weather event. This was remedied using the code below. This block of code essentially removes all rows whose respective EVTYPE occurs less than 8 times in the dataset. The threshold is completely arbitrary and was chosen due to the fact that the highest-frequency anomylous event occurred 8 times. It can be changed by editing the value for thresh.
type.lengths <- tapply(storm$EVTYPE, storm$EVTYPE, length)
type <- names(as.data.frame(t(type.lengths)))
type.0 <- cbind(type,type.lengths)
thresh <- 8 #edit this if threshold must be changed
valu <- as.numeric(type.lengths>thresh)
valu[is.na(valu)] <- 0
test.0 <- cbind(type.0,valu)
type.1 <- test.0[valu==1,]
storm.d <- cbind(storm, as.numeric(storm$EVTYPE %in% type.1[,1]))
storm.1 <- storm.d[storm.d[,8]==1,]
The last aspect of the data that needed to be cleaned was the way that the property and crop damage was represented. Essentially, the amount of property and crop damage was listed with a value in one varieble and then the exponent belonging to that respective value in another variable. To remedy this, the exponent codes were replaced with their real number values and then multiplied with the property damage values column. This code is demonstrated below.
# cleaning damage data
storm$PROPEXP[storm$PROPDMGEXP == "K"] <- 1000
storm$PROPEXP[storm$PROPDMGEXP == "M"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == ""] <- 1
storm$PROPEXP[storm$PROPDMGEXP == "B"] <- 1e+09
storm$PROPEXP[storm$PROPDMGEXP == "m"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == "0"] <- 1
storm$PROPEXP[storm$PROPDMGEXP == "5"] <- 1e+05
storm$PROPEXP[storm$PROPDMGEXP == "6"] <- 1e+06
storm$PROPEXP[storm$PROPDMGEXP == "4"] <- 10000
storm$PROPEXP[storm$PROPDMGEXP == "2"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "3"] <- 1000
storm$PROPEXP[storm$PROPDMGEXP == "h"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "7"] <- 1e+07
storm$PROPEXP[storm$PROPDMGEXP == "H"] <- 100
storm$PROPEXP[storm$PROPDMGEXP == "1"] <- 10
storm$PROPEXP[storm$PROPDMGEXP == "8"] <- 1e+08
# Assigning '0' to invalid exponent data
storm$PROPEXP[storm$PROPDMGEXP == "+"] <- 0
storm$PROPEXP[storm$PROPDMGEXP == "-"] <- 0
storm$PROPEXP[storm$PROPDMGEXP == "?"] <- 0
# Calculating the property damage value
storm$PROPDMGVAL <- storm$PROPDMG * storm$PROPEXP
# Assigning values for the crop exponent data
storm$CROPEXP[storm$CROPDMGEXP == "M"] <- 1e+06
storm$CROPEXP[storm$CROPDMGEXP == "K"] <- 1000
storm$CROPEXP[storm$CROPDMGEXP == "m"] <- 1e+06
storm$CROPEXP[storm$CROPDMGEXP == "B"] <- 1e+09
storm$CROPEXP[storm$CROPDMGEXP == "0"] <- 1
storm$CROPEXP[storm$CROPDMGEXP == "k"] <- 1000
storm$CROPEXP[storm$CROPDMGEXP == "2"] <- 100
storm$CROPEXP[storm$CROPDMGEXP == ""] <- 1
# Assigning '0' to invalid exponent storm
storm$CROPEXP[storm$CROPDMGEXP == "?"] <- 0
# calculating the crop damage value
storm$CROPDMGVAL <- storm$CROPDMG * storm$CROPEXP
It was observed that " most harmful to population health" events are fatalities. Therefore, only those events with the highest fatalities were selecetd.
It was also observed that events with “greatest economic consequences” are Property and crop damages.So, only those events with the highest property and crop damage were selecetd.
Then for each incident (Fatalities, Property damage and Crop damage), the mean values were estimated. The code is demonstrated below.
means <- aggregate(FATALITIES ~ EVTYPE, storm.1, FUN = mean)
means <- means[order(-means$FATALITIES),]
top.means <- head(means, n=11L)
propdmg <- aggregate(PROPDMGVAL ~ EVTYPE, storm, FUN = mean)
cropdmg <- aggregate(CROPDMGVAL ~ EVTYPE, storm, FUN = mean)
propdmg <- propdmg[order(-propdmg$PROPDMGVAL),]
cropdmg <- cropdmg[order(-cropdmg$CROPDMGVAL),]
In the cropdmg and propdmg variables, there are some anomalous events still present. To remedy this, the following code was run. Please run this code in order and exactly as written.
head(propdmg, n = 10L)
## EVTYPE PROPDMGVAL
## 226 HEAVY RAIN/SEVERE WEATHER 1250000000
## 330 HURRICANE/TYPHOON 787566364
## 327 HURRICANE OPAL 352538444
## 548 STORM SURGE 165990559
## 785 WILD FIRES 156025000
## 328 HURRICANE OPAL/HIGH WINDS 100000000
## 493 SEVERE THUNDERSTORM 92720000
## 207 HAILSTORM 80333333
## 321 HURRICANE 68208730
## 804 WINTER STORM HIGH WINDS 60000000
propdmg <- propdmg[-3,]
propdmg <- propdmg[-5,]
propdmg <- propdmg[-10,]
propdmg <- propdmg[-10,]
head(cropdmg, n = 10L)
## EVTYPE CROPDMGVAL
## 108 EXCESSIVE WETNESS 142000000
## 62 DAMAGING FREEZE 43683333
## 95 Early Frost 42000000
## 330 HURRICANE/TYPHOON 29634918
## 324 HURRICANE ERIN 19430000
## 61 Damaging Freeze 17065000
## 321 HURRICANE 15758103
## 111 Extreme Cold 10000000
## 328 HURRICANE OPAL/HIGH WINDS 10000000
## 129 FREEZE 6030068
cropdmg <- cropdmg[-5,]
cropdmg <- cropdmg[-8,]
propdmg <- propdmg[1:10,]
cropdmg <- cropdmg[1:10,]
propdmg <- head(propdmg, n = 10L)
cropdmg <- head(cropdmg, n = 10L)
propdmg
## EVTYPE PROPDMGVAL
## 226 HEAVY RAIN/SEVERE WEATHER 1250000000
## 330 HURRICANE/TYPHOON 787566364
## 548 STORM SURGE 165990559
## 785 WILD FIRES 156025000
## 493 SEVERE THUNDERSTORM 92720000
## 207 HAILSTORM 80333333
## 321 HURRICANE 68208730
## 804 WINTER STORM HIGH WINDS 60000000
## 739 TYPHOON 54566364
## 549 STORM SURGE/TIDE 31359378
cropdmg
## EVTYPE CROPDMGVAL
## 108 EXCESSIVE WETNESS 142000000
## 62 DAMAGING FREEZE 43683333
## 95 Early Frost 42000000
## 330 HURRICANE/TYPHOON 29634918
## 61 Damaging Freeze 17065000
## 321 HURRICANE 15758103
## 111 Extreme Cold 10000000
## 129 FREEZE 6030068
## 494 SEVERE THUNDERSTORM WINDS 5800000
## 70 DROUGHT 5615983
Many of the variable names are exeedingly long. This greatly affects the aesthetics of the presented plots, so they are shortened to make the plots fit onto the page, while maintaining necessary information.
propdmg[1,1] <- "HEAVY RAIN"
propdmg[8,1] <- "WINTER STORM"
propdmg[5,1] <- "THUNDERSTORM"
cropdmg[9,1] <- "THUNDERSTORM"
propdmg
## EVTYPE PROPDMGVAL
## 226 HEAVY RAIN 1250000000
## 330 HURRICANE/TYPHOON 787566364
## 548 STORM SURGE 165990559
## 785 WILD FIRES 156025000
## 493 THUNDERSTORM 92720000
## 207 HAILSTORM 80333333
## 321 HURRICANE 68208730
## 804 WINTER STORM 60000000
## 739 TYPHOON 54566364
## 549 STORM SURGE/TIDE 31359378
cropdmg
## EVTYPE CROPDMGVAL
## 108 EXCESSIVE WETNESS 142000000
## 62 DAMAGING FREEZE 43683333
## 95 Early Frost 42000000
## 330 HURRICANE/TYPHOON 29634918
## 61 Damaging Freeze 17065000
## 321 HURRICANE 15758103
## 111 Extreme Cold 10000000
## 129 FREEZE 6030068
## 494 THUNDERSTORM 5800000
## 70 DROUGHT 5615983
The top 10 mean causes of fatality, crop damage, and property damage are graphed using the code below.
# Fatality plot
require(ggplot2)
## Loading required package: ggplot2
gplot <- ggplot(data = top.means, aes(x=reorder(EVTYPE, -FATALITIES), y=FATALITIES))
gplot <- gplot + geom_bar(stat = "identity", color="skyblue", fill = "steelblue")
gplot <- gplot + theme(axis.text.x=element_text(angle=45, hjust=1))
gplot <- gplot + labs(title="Mean Number of Fatalities by Event Type",
x ="EVENT", y = "AVERAGE FATALITY RATE")
gplot
# Property damage and crop damage plots
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.8)
barplot(propdmg$PROPDMGVAL/(10^9), las = 3, names.arg = propdmg$EVTYPE,
main = "Events with Highest Average Property Damage", ylab = "Damage Cost ($ billions)",
col = "steelblue")
barplot(cropdmg$CROPDMGVAL/(10^9), las = 3, names.arg = cropdmg$EVTYPE,
main = "Events With Highest Average Crop Damage", ylab = "Damage Cost ($ billions)",
col = "steelblue")
Based on this data, Tsunamis had the highest average fatality rate. Tsunamis were closely followed by heat, rip current and avalanches. Clearly, the “CURRENT” term should have been included in my word grouping algorithm because Rip currents appear twice, which is an unintended result. Moreover, heavy rain and “excessive wetness” caused the most property damage and crop damage. It is unclear for now, but it is safe to make the assumption that excessive wetness is a result of heavy rain, and can therefore be classified as heavy rain. In second place, hurricanes caused almost as much property damage as heavy rain. Similarly, freezing temperatures caused almost as much crop damage as excessive wetness.