The goal of this analysis is to use NOAA’s data to shed light on the health-related effects and economic effects of various environmental phenomena.
I find that: for fatalities, the three most damaging types of environmental phenomena are heat, tornados, and floods, respectively. For injuries, the three most damaging types of environmental phenomena are tornados, heat and thunderstorms, respectively. For crop damage, the three most damaging types of environmental phenomena are droughts, floods, and hurricanes/typhoons, respectively. For property damage, the three most damaging types of environmental phenomena are floods, hurricanes/typhoons and storm surges, respectively. Overall, the most economically damaging environmental phenomena are floods, hurricanes/typhoons, and storm surges.
The following code chunk downloads the data from the Internet. Specifically, if the data file (“repdata%2Fdata%2FStormData.csv.bz2”) is not in the working directory, the code downloads the data from the Web and then unzips it, once again yielding “repdata%2Fdata%2FStormData.csv.bz2”. This file is then read into R.
dataFile <- "repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(dataFile)) {
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
}
StormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
I begin the analysis by loading the stringr and dplyr packages, which will be used throughout.
if("stringr" %in% rownames(installed.packages()) == FALSE) {install.packages("stringr")}
library("stringr")
## Warning: package 'stringr' was built under R version 3.3.1
if(!"dplyr" %in% rownames(installed.packages())) {install.packages("dplyr")}
library("dplyr")
The next code chunk reads the data and changes the class of the BGN_DATE variable to date and extracts the YEAR of each observation’s date.
data <- select(StormData, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)
data$DATE <- as.Date(strptime(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
data$YEAR <- as.factor(format(data$DATE,format="%Y"))
The next code chunk plots the data availability over the years. Given the paucity of records in earlier years, it made sense to me to continue my analysis using only the data for years 2000 – 2011. Moreover, given my inteerst, it made sense to focus only on the variables that measure physical health effects and the economic effects.
barplot(table(data$YEAR), ylab="Total Number of records")
data2 <- filter(data, DATE > "2000-01-01")
data3 <- select(data2, BGN_DATE, DATE, YEAR, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,PROPDMGEXP, CROPDMG, CROPDMGEXP)
Note that I am considering years 2000 and later only
The data set tells us the property damage by means of 2 variables: PROPDMG gives us a numerical measure of property damage and PROPDMGEXP gives us a coded dollar value of each unit of damage. I convert the letter codes in PROPDMGEXP into numerical dollar values. Finally, the product of PROPDMG and PROPDMGEXP gives me the dollar value of property damage for each instance of property damage.
unique(data3$PROPDMGEXP)
## [1] K M B 0
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# Sorting the property exponent data
data3$PROPEXP[data3$PROPDMGEXP == "K"] <- 10^3
data3$PROPEXP[data3$PROPDMGEXP == "M"] <- 10^6
data3$PROPEXP[data3$PROPDMGEXP == "" ] <- 10^0
data3$PROPEXP[data3$PROPDMGEXP == "B"] <- 10^9
data3$PROPEXP[data3$PROPDMGEXP == "m"] <- 10^6
data3$PROPEXP[data3$PROPDMGEXP == "0"] <- 10^0
data3$PROPEXP[data3$PROPDMGEXP == "5"] <- 10^5
data3$PROPEXP[data3$PROPDMGEXP == "6"] <- 10^6
data3$PROPEXP[data3$PROPDMGEXP == "4"] <- 10^4
data3$PROPEXP[data3$PROPDMGEXP == "2"] <- 10^2
data3$PROPEXP[data3$PROPDMGEXP == "3"] <- 10^3
data3$PROPEXP[data3$PROPDMGEXP == "h"] <- 10^2
data3$PROPEXP[data3$PROPDMGEXP == "7"] <- 10^7
data3$PROPEXP[data3$PROPDMGEXP == "H"] <- 10^2
data3$PROPEXP[data3$PROPDMGEXP == "1"] <- 10^1
data3$PROPEXP[data3$PROPDMGEXP == "8"] <- 10^8
# give 0 to invalid exponent data, so they not count in
data3$PROPEXP[data3$PROPDMGEXP == "+"] <- 0
data3$PROPEXP[data3$PROPDMGEXP == "-"] <- 0
data3$PROPEXP[data3$PROPDMGEXP == "?"] <- 0
# compute the property damage value
data3$PROPDMGVAL <- data3$PROPDMG * data3$PROPEXP
The same calculations are repeated for the crop damage data in the next code chunk:
unique(data3$CROPDMGEXP)
## [1] K M B
## Levels: ? 0 2 B k K m M
data3$CROPEXP[data3$CROPDMGEXP == "M"] <- 10^6
data3$CROPEXP[data3$CROPDMGEXP == "K"] <- 10^3
data3$CROPEXP[data3$CROPDMGEXP == "m"] <- 10^6
data3$CROPEXP[data3$CROPDMGEXP == "B"] <- 10^9
data3$CROPEXP[data3$CROPDMGEXP == "0"] <- 10^0
data3$CROPEXP[data3$CROPDMGEXP == "k"] <- 10^3
data3$CROPEXP[data3$CROPDMGEXP == "2"] <- 10^2
data3$CROPEXP[data3$CROPDMGEXP == "" ] <- 10^0
# give 0 to invalid exponent data, so they not count in
data3$CROPEXP[data3$CROPDMGEXP == "?"] <- 0
# compute the crop damage value
data3$CROPDMGVAL <- data3$CROPDMG * data3$CROPEXP
The event types of the events recorded in the NOAA data set are not always carefully recorded. For example, in some cases the event type is Thunderstorm and in others it is TSTM. I have tried to painstakingly get rid of the numerous duplication in the data in the next code chunk.
data4 <- select(data3, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMGVAL, CROPDMGVAL)
data4$EVTYPE[grepl("HEAVY SURF|HIGH SURF ADVISORIES|HIGH SURF ADVISORY|HAZARDOUS SURF", data4$EVTYPE)] <- "HIGH SURF"
data4$EVTYPE[grepl("THUNDERSTORM|SEVERE THUNDERSTORM|TSTM WIND|GUSTY THUNDERSTORM WIND|THUNDERSTORM WIND (G40)|GUSTY THUNDERSTORM WINDS", data4$EVTYPE)] <- "THUNDERSTORM WIND"
data4$EVTYPE[grepl("COASTAL FLOODING|CSTL FLOODING/EROSION", data4$EVTYPE)] <- "COASTAL FLOOD"
data4$EVTYPE[grepl("FLASH FLOOD|TIDAL FLOODING|LAKESHORE FLOOD|URBAN/SML STREAM FLD", data4$EVTYPE)] <- "FLOOD"
data4$EVTYPE[grepl("FREEZING RAIN|FREEZING DRIZZLE", data4$EVTYPE)] <- "FREEZING RAIN"
data4$EVTYPE[grepl("HIGH WIND|STRONG WIND|STRONG WINDS|WIND DAMAGE|GRADIENT WIND|GUSTY WIND|WIND GUSTS|GUSTY LAKE WIND|WND", data4$EVTYPE)] <- "WIND"
data4$EVTYPE[grepl("FREEZE", data4$EVTYPE)] <- "FREEZE"
data4$EVTYPE[grepl("WIND CHILL|EXTREME WINDCHILL", data4$EVTYPE)] <- "WIND CHILL"
data4$EVTYPE[grepl("HEAVY SNOW|MODERATE SNOWFALL|LAKE EFFECT SNOW|LAKE-EFFECT SNOW|RECORD SNOWFALL|LIGHT SNOW|LATE SEASON SNOW|RECORD SNOW|SNOW/BLOWING SNOW|EARLY SNOWFALL|FIRST SNOW", data4$EVTYPE)] <- "SNOW"
data4$EVTYPE[grepl("EXCESSIVE HEAT|RECORD WARMTH|RECORD HEAT|WARM WEATHER|VERY WARM|UNSEASONABLY WARM|UNSEASONABLY HOT|UNUSUALLY WARM", data4$EVTYPE)] <- "HEAT"
data4$EVTYPE[grepl("EXTREME COLD|UNUSUALLY COLD|RECORD COLD|PROLONG COLD|UNSEASONABLY COOL|COLD WEATHER|UNUSEASONAL LOW TEMP|UNSEASONABLY COLD", data4$EVTYPE)] <- "COLD"
data4$EVTYPE[grepl("HEAVY RAIN|RECORD RAINFALL|UNSEASONABLY WET|LOCALLY HEAVY RAIN", data4$EVTYPE)] <- "RAIN"
data4$EVTYPE[grepl("HAIL", data4$EVTYPE)] <- "HAIL"
data4$EVTYPE[grepl("FOG", data4$EVTYPE)] <- "FOG"
data4$EVTYPE[grepl("DRY", data4$EVTYPE)] <- "DRY WEATHER"
data4$EVTYPE[grepl("ICE", data4$EVTYPE)] <- "ICE"
data4$EVTYPE[grepl("TORNADO", data4$EVTYPE)] <- "TORNADO"
data4$EVTYPE[grepl("DUST DEVEL", data4$EVTYPE)] <- "DUST DEVIL"
data4$EVTYPE <- str_trim(data4$EVTYPE)
# unique(data4$EVTYPE)
Now the data is ready for my analysis.
data.evtype <- group_by(data4, EVTYPE)
data.evtype.damage <- summarize(data.evtype, TOTAL.FATALITIES = sum(FATALITIES), TOTAL.INJURIES = sum(INJURIES), PROPERTY.DAMAGE = sum(PROPDMGVAL), CROP.DAMAGE = sum(CROPDMGVAL))
data.evtype.damage <- mutate(data.evtype.damage, ECON.DAMAGE = PROPERTY.DAMAGE + CROP.DAMAGE)
econ.damage.evtype <- select(data.evtype.damage, EVTYPE, PROPERTY.DAMAGE, CROP.DAMAGE, ECON.DAMAGE)
econ.damage.evtype <- head(arrange(econ.damage.evtype, desc(ECON.DAMAGE)), 10)
crop.damage.evtype <- head(arrange(econ.damage.evtype, desc(CROP.DAMAGE)), 10)
property.damage.evtype <- head(arrange(econ.damage.evtype, desc(PROPERTY.DAMAGE)), 10)
life.damage.evtype <- select(data.evtype.damage, EVTYPE, TOTAL.FATALITIES, TOTAL.INJURIES)
death.damage.evtype <- head(arrange(life.damage.evtype, desc(TOTAL.FATALITIES)), 10)
injury.damage.evtype <- head(arrange(life.damage.evtype, desc(TOTAL.INJURIES)), 10)
The relevant data frames are:
The next code chunk creates a 2-panel plot showing the top-10 environmental types for fatalities and the top-10 environmental types for injuries.
par(mfrow=c(1,2))
barplot(injury.damage.evtype$TOTAL.INJURIES, names.arg = injury.damage.evtype$EVTYPE, ylab = "Total Injuries, 2000-2011", main = "Total Injuries, 2000-2011", las = 2)
barplot(death.damage.evtype$TOTAL.FATALITIES, names.arg = death.damage.evtype$EVTYPE, ylab = "Total Fatalities, 2000-2011", main = "Total Fatalities, 2000-2011", las = 2)
For fatalities, the three most damaging types of environmental phenomena are heat, tornados, and floods, respectively. For injuries, the three most damaging types of environmental phenomena are tornados, heat and thunderstorms, respectively.
The next code chunk creates a 3-panel plot showing the top-10 environmental types for crop damage, property damage and total economic damage (which is the sum of property damage and crop damage).
par(mfrow=c(1,3))
barplot(econ.damage.evtype$ECON.DAMAGE, names.arg = econ.damage.evtype$EVTYPE, ylab = "Total Economic Damage, 2000-2011", main = "Total Economic Damage, 2000-2011", las = 2)
barplot(property.damage.evtype$PROPERTY.DAMAGE, names.arg = property.damage.evtype$EVTYPE, ylab = "Total Property Damage, 2000-2011", main = "Total Property Damage, 2000-2011", las = 2)
barplot(crop.damage.evtype$CROP.DAMAGE, names.arg = crop.damage.evtype$EVTYPE, ylab = "Total Crop Damage, 2000-2011", main = "Total Crop Damage, 2000-2011", las = 2)
For crop damage, the three most damaging types of environmental phenomena are droughts, floods, and hurricanes/typhoons, respectively. For property damage, the three most damaging types of environmental phenomena are floods, hurricanes/typhoons and storm surges, respectively. Overall, the most economically damaging environmental phenomena are floods, hurricanes/typhoons, and storm surges.
This essay uses NOAA’s storm data set to provide information to public officials that may help them decide which types of environmental events cause the most damage (physical or economic). This information may help public officials to make the best use of scarce budgetary resources in building defences against natural disasters and emergencies.