Weather disasters often result in considerable adverse effects on both public health and economic state of communities. The goal of this research is to assess which types of events are most harmful in terms of the amount of injuries and deaths caused, and property damage. For purposes of this assignment we will use the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Our paper presents the Top 10 weather-related events, which resulted in the most damaging outcomes for USA population during 1996-2011. Based on our analysis we concluded that tornadoes are the most harmful with respect to population health, and floods have the greatest economic consequences.
The data for this assignment was extracted from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The database records start in 1950 and end in November 2011. In the earlier years of the database there are generally fewer types of events recorded (Storm Events Recording Details), that’s why decision to include in your analysis all the presented time frame might cause your results to be misleading. So, for our project we decided to use the time span with the most complete data - period between 1996 and 2011. The database is presented in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size and can be downloaded via this link:
There is also some documentation on the database available:
# Install package for handling .bz2 files
# The standard unzip() function doesn't work with .bz2 archives, so we need to install "R.Utils" package to handle it
if (!require("R.utils")) {
install.packages("R.utils", repos="http://cran.rstudio.com/")
}
library("R.utils")
# To prevent R from downloading and extracting the same dataset all over again we should add the check if the file is already exist
# 1. If file exist - do nothing
# 2. If archive has been already downloaded - bunzip it
# 3. If neither file, nor archive is in working directory - download archive and extract file
# While downloading file via download.file() function we should use mode (a character string denoting the mode for opening and writing the file) 'wb' , which assumes binary data (default mode is 'w' - text data)
# If method='curl' option doesn't work and you still get an error message ("Unsupported URL scheme') try to install the 'downloader' package
# It should be mentioned that R can work .bz2 files without decompressing it (but we will not use this option)
if (!file.exists("repdata-data-StormData.csv")) {
if (!file.exists("StormData.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "repdata-data-StormData.csv.bz2"
download.file(url, destfile, mode = "wb", method="curl")
dateDownload <- date()
bunzip2("repdata-data-StormData.csv.bz2")
} else {
bunzip2("repdata-data-StormData.csv.bz2")
}
}
rawData<-read.csv("repdata-data-StormData.csv")
rawData$BGN_DATE <- as.Date(rawData$BGN_DATE, format = "%m/%d/%Y")
rawData$EVTYPE <- as.factor(rawData$EVTYPE)
rawData$PROPDMGEXP <- as.factor(rawData$PROPDMGEXP)
rawData$CROPDMGEXP <- as.factor(rawData$CROPDMGEXP)
Original dataset includes 902297 observations of 37 variables. Amongst them we are only interested in observations that were recorded after 1995-12-31 and 8 variables:
# Choosing only observations made after "1995-12-31" and 8 variables of interest
procData_1<-subset(rawData,BGN_DATE>as.Date("1995-12-31"), select = c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
# Deleting observations where all of the variables of interest are less than 0
procData_2<-subset(procData_1, FATALITIES>0 | INJURIES>0 | PROPDMG>0 | CROPDMG>0, select =c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
# Removing unused levels from the dataset (after subsetting the dataset still keeps all the levels of the original dataset)
procData_2<-droplevels(procData_2)
After processing our dataset includes only 201318 observations of 8 variables.
You might notice that Storm Data Documentation contains 48 event types, whereas our dataset include 222. And if you check the list of those names you’ll see how messy it is. So we need to clean it up. In this case preparing correct classification requires some degree of awareness in the field, and my final result by no means can be treated as a point of reference. But for the purposes of this assignment I suppose it can be accepted.
After some data rearrangement we get dataset with 47 conventional event types (excluding №6 - “Debris Flow”“, №36 -”Sleet“, which were not found, and adding”Other“).
EVTYPE_recode<-list(
"Astronomical Low Tide" = c("ASTRONOMICAL HIGH TIDE","ASTRONOMICAL LOW TIDE"),
"Avalanche" = c("AVALANCHE","LANDSLIDE","LANDSLIDES","Landslump","MUD SLIDE","Mudslide","MUDSLIDE","Mudslides","ROCK SLIDE"),
"Blizzard" = c("BLIZZARD"),
"Coastal Flood" = c("Beach Erosion","COASTAL FLOODING/EROSION","COASTAL EROSION","Coastal Flood","COASTAL FLOOD","Coastal Flooding","COASTAL FLOODING","COASTAL FLOODING/EROSION"),
"Cold/Wind Chill" = c("Cold","COLD","COLD AND SNOW","Cold Temperature","COLD WEATHER","COLD/WIND CHILL","Erosion/Cstl Flood"),
"Dense Fog" = c("DENSE FOG"),
"Dense Smoke" = c("DENSE SMOKE"),
"Drought" = c("DROUGHT"),
"Dust Devil" = c("Dust Devil","DUST DEVIL"),
"Dust Storm" = c("DUST STORM","BLOWING DUST"),
"Excessive Heat" = c("EXCESSIVE HEAT","Heat Wave","HYPERTHERMIA/EXPOSURE","Hypothermia/Exposure","HYPOTHERMIA/EXPOSURE","RECORD HEAT","UNSEASONABLY WARM"),
"Extreme Cold/Wind Chill" = c("Extended Cold","Extreme Cold","EXTREME COLD","EXTREME COLD/WIND CHILL","EXTREME WINDCHILL","Unseasonable Cold","UNSEASONABLY COLD"),
"Flash Flood" = c(" FLASH FLOOD","FLASH FLOOD","FLASH FLOOD/FLOOD"),
"Flood" = c("FLOOD","FLOOD/FLASH/FLOOD","RIVER FLOOD","River Flooding","RIVER FLOODING","DAM BREAK","Tidal Flooding","DROWNING","URBAN/SML STREAM FLD","TIDAL FLOODING"),
"Freezing Fog" = c("FREEZING FOG","FOG"),
"Frost/Freeze " = c("Freeze","FREEZE","AGRICULTURAL FREEZE","Damaging Freeze","DAMAGING FREEZE","Early Frost","FROST","Frost/Freeze","FROST/FREEZE","HARD FREEZE"),
"Funnel Cloud" = c("FUNNEL CLOUD"),
"Hail" = c("HAIL","SMALL HAIL"),
"Heat" = c("HEAT","WARM WEATHER"),
"Heavy Rain" = c("HEAVY RAIN","Heavy Rain/High Surf","RAIN","RAIN/SNOW","UNSEASONAL RAIN","MIXED PRECIP","Mixed Precipitation","MIXED PRECIPITATION","Torrential Rainfall"),
"Heavy Snow" = c("HEAVY SNOW","Heavy snow shower","LATE SEASON SNOW","Light snow","Light Snow","LIGHT SNOW","Light Snowfall","Snow","SNOW","SNOW AND ICE","SNOW SQUALL","Snow Squalls","SNOW SQUALLS","EXCESSIVE SNOW","FALLING SNOW/ICE","blowing snow"),
"High Surf" = c(" HIGH SURF ADVISORY","HEAVY SEAS","Heavy Surf","HEAVY SURF","Heavy surf and wind","HEAVY SURF/HIGH SURF","HIGH SEAS","High Surf","HIGH SURF","HIGH SWELLS","HIGH WATER","ROGUE WAVE","ROUGH SEAS","ROUGH SURF","WIND AND WAVE","HAZARDOUS SURF","Coastal Storm","COASTAL STORM","COASTALSTORM"),
"High Wind" = c("gradient wind","Gradient wind","GRADIENT WIND","GUSTY WIND","GUSTY WIND/HAIL","GUSTY WIND/HVY RAIN","Gusty wind/rain","Gusty winds","Gusty Winds","GUSTY WINDS","HIGH WIND","HIGH WIND (G40)","HIGH WINDS","NON-SEVERE WIND DAMAGE","NON-TSTM WIND","NON TSTM WIND"),
"Hurricane/Typhoon" = c("HURRICANE","Hurricane Edouard","HURRICANE/TYPHOON","TYPHOON"),
"Ice Storm" = c("ICE STORM"),
"Lakeshore Flood" = c("LAKESHORE FLOOD"),
"Lake-Effect Snow" = c("LAKE-EFFECT SNOW","Lake Effect Snow","LAKE EFFECT SNOW"),
"Lightning" = c("LIGHTNING"),
"Marine Hail" = c("MARINE HAIL"),
"Marine High Wind" = c("MARINE HIGH WIND","Marine Accident"),
"Marine Strong Wind" = c("MARINE STRONG WIND"),
"Marine Thunderstorm Wind" = c("MARINE THUNDERSTORM WIND","MARINE TSTM WIND"),
"Rip Current" = c("RIP CURRENT","RIP CURRENTS"),
"Seiche" = c("SEICHE"),
"Storm Tide" = c("STORM SURGE","STORM SURGE/TIDE"),
"Strong Wind" = c("Strong Wind","STRONG WIND","Strong Winds","STRONG WINDS","Wind","WIND","Wind Damage","WINDS"),
"Thunderstorm Wind" = c(" TSTM WIND"," TSTM WIND (G45)","THUNDERSTORM","THUNDERSTORM WIND","THUNDERSTORM WIND (G40)","Tstm Wind","TSTM WIND","TSTM WIND (G45)","TSTM WIND (41)","TSTM WIND (G35)","LANDSPOUT","Whirlwind","WHIRLWIND","DOWNBURST","TSTM WIND (G40)","TSTM WIND (G45)","TSTM WIND 40","TSTM WIND 45","TSTM WIND AND LIGHTNING","TSTM WIND G45","TSTM WIND/HAIL","DRY MICROBURST","WET MICROBURST","Microburst"),
"Tornado" = c("TORNADO"),
"Tropical Depression" = c("TROPICAL DEPRESSION"),
"Tropical Storm" = c("TROPICAL STORM"),
"Tsunami" = c("TSUNAMI"),
"Volcanic Ash" = c("VOLCANIC ASH"),
"Waterspout" = c("WATERSPOUT"),
"Wildfire" = c("WILD/FOREST FIRE","WILDFIRE","BRUSH FIRE"),
"Winter Storm" = c("WINTER STORM"),
"Winter Weather" = c("WINTER WEATHER","WINTER WEATHER MIX","WINTER WEATHER/MIX","Wintry Mix","WINTRY MIX","Ice jam flood (minor","ICE ON ROAD","ICE ROADS","ICY ROADS","Freezing drizzle","Freezing Drizzle","FREEZING DRIZZLE","Freezing Rain","FREEZING RAIN","Freezing Spray","BLACK ICE","LIGHT FREEZING RAIN","Glaze","GLAZE"),
"Other" = c("Other","OTHER"))
levels(procData_2$EVTYPE)<-EVTYPE_recode
PRPDMGEXP and CROPDMGEXP variables indicate the degree of property or crop damage. Total estimated damage is therefore PROPDMG * 10PROPDMGEXP. For example, if PROPDMG is 3.5 and PROPDMGEXP is 5, then total estimated property damage is 3.5 * 105 (USD 350,000). In our case these variables can take 4 values:
Lets recalculate the amount of damage taking into account the degree of damage, and write it into 2 new variables - PROPDMG_new and CROPDMG_new.
# Recode the levels of PRPDMGEXP and CROPDMGEXP variables
EXP_recode<-list("0"="", "3"="K", "6"="M", "9"="B")
levels(procData_2$PROPDMGEXP)<-EXP_recode
levels(procData_2$CROPDMGEXP)<-EXP_recode
# Transform factor levels to numerics
# If we try to transform PROPDMGEXP|CROPDMGEXP variables directly into numerics instead of getting actual values (i.e. "0", "3", "6", "9") we will see the factor's numbers sequence (i.e. 1, 2, 3, 4)
procData_2$PROPDMGEXP<-as.numeric(as.character(procData_2$PROPDMGEXP))
procData_2$CROPDMGEXP<-as.numeric(as.character(procData_2$CROPDMGEXP))
# Calculcate damage variables
procData_2$PROPDMG_new<-procData_2$PROPDMG*10^(procData_2$PROPDMGEXP)/1000000
procData_2$CROPDMG_new<-procData_2$CROPDMG*10^(procData_2$CROPDMGEXP)/1000000
Next step is aggregating data by the EVTYPE variable to calculate the total number of deaths and injuries, and amount of economic damage caused by each type of weather events.
# Install 'reshape2' package to melt and cast data
if (!require("reshape2")) {
install.packages("reshape2", repos="http://cran.rstudio.com/")
}
library("reshape2")
# Create datasets suitable for making plots
# To understand what exactly is happening here visit: http://seananderson.ca/2013/10/19/reshape.html
# To use variables' names with spaces in 'aggregate' function you need to enclose it with backticks (`)
# To arrange data on graph we need to get ordered list of events (check the last lines of code for each group of data)
# "Health Effects"
healthEffects<-procData_2[, c("EVTYPE","FATALITIES","INJURIES")]
healthEffects$TOTAL<-healthEffects$FATALITIES+healthEffects$INJURIES
healthEffects<-healthEffects[healthEffects$TOTAL!=0,]
healthEffects<-melt(healthEffects, id.vars="EVTYPE", variable.name= "Health outcome", value.name="Number of cases")
healthEffects<-aggregate(`Number of cases` ~ EVTYPE + `Health outcome`, data=healthEffects, sum)
healthEffects<-dcast(healthEffects, EVTYPE ~ `Health outcome`, value.var="Number of cases")
FATALITIES<-healthEffects[, c("EVTYPE","FATALITIES")]
FATALITIES<-FATALITIES[order(-FATALITIES$FATALITIES), ]
FATALITIES<-FATALITIES[1:10,]
INJURIES<-healthEffects[, c("EVTYPE","INJURIES")]
INJURIES<-INJURIES[order(-INJURIES$INJURIES), ]
INJURIES<-INJURIES[1:10,]
TOTAL_HEALTH<-healthEffects[, c("EVTYPE","TOTAL")]
TOTAL_HEALTH<-TOTAL_HEALTH[order(-TOTAL_HEALTH$TOTAL), ]
TOTAL_HEALTH<-TOTAL_HEALTH[1:15,]
fatalLevels<-FATALITIES$EVTYPE
injurLevels<-INJURIES$EVTYPE
totalHealthLevels<-TOTAL_HEALTH$EVTYPE
# "Economic Consequences"
econCons<-procData_2[, c("EVTYPE", "PROPDMG_new", "CROPDMG_new")]
econCons$TOTAL<-econCons$PROPDMG_new+econCons$CROPDMG_new
econCons<-econCons[econCons$TOTAL!=0,]
econCons<-melt(econCons, id.vars="EVTYPE", variable.name= "Economic outcome", value.name="Amount of damage")
econCons<-aggregate(`Amount of damage` ~ EVTYPE + `Economic outcome`, data=econCons, sum)
econCons<-dcast(econCons, EVTYPE ~ `Economic outcome`, value.var="Amount of damage")
PROPERTY<-econCons[, c("EVTYPE","PROPDMG_new")]
PROPERTY<-PROPERTY[order(-PROPERTY$PROPDMG_new), ]
PROPERTY<-PROPERTY[1:10,]
CROP<-econCons[, c("EVTYPE","CROPDMG_new")]
CROP<-CROP[order(-CROP$CROPDMG_new), ]
CROP<-CROP[1:10,]
TOTAL_ECON<-econCons[, c("EVTYPE","TOTAL")]
TOTAL_ECON<-TOTAL_ECON[order(-TOTAL_ECON$TOTAL), ]
TOTAL_ECON<-TOTAL_ECON[1:15,]
propLevels<-PROPERTY$EVTYPE
cropLevels<-CROP$EVTYPE
totalEconLevels<-TOTAL_ECON$EVTYPE
The last step in our analysis is plotting the data we get on a graph and making conclusions. We will make two figures here - “Health effects of severe weather events” and “Economic consequences of severe weather events”. Each figure includes 3 plots: separate graphs for Fatalities, Injuries and Total health effect, and separate for Property, Crop damage and Total economic outcomes.
# Install packages for making plots
if (!require("ggplot2")) {
install.packages("ggplot2", repos="http://cran.rstudio.com/")
}
if (!require("grid")) {
install.packages("grid", repos="http://cran.rstudio.com/")
}
if (!require("gridExtra")) {
install.packages("gridExtra", repos="http://cran.rstudio.com/")
}
library("ggplot2")
library("grid")
library("gridExtra")
# Preparing formatting for plots
# Parameters 'vjust' and 'hjust' are set to prevent the graph being overlapped by event types' names
# If we don't set the 'x' argument ('x' axes's title) of labs() function to NULL, it's name would be taken as 'x' argument of aes() function (i.e. 'factor(EVTYPE, levels=fatalLevels)')
# To combine all the plots into one figure we use 'arrangeGrob' function
theme<-theme(plot.title = element_text(size=10)) + theme(axis.title.y = element_text(size=8)) + theme(axis.text.x= element_text(size= 5, angle=45, vjust = 1, hjust= 1)) + theme(axis.text.y = element_text(size=5))
# "Health Effects"
h1<-ggplot(data=FATALITIES, aes(x=factor(EVTYPE, levels=fatalLevels), y=FATALITIES))+geom_bar(stat="identity", fill="red2")+labs(title="Fatalities by Weather Events", x=NULL, y="Number of Fatalities")+theme
h2<-ggplot(data=INJURIES, aes(x=factor(EVTYPE, levels=injurLevels), y=INJURIES))+geom_bar(stat="identity", fill="yellow")+labs(title="Injuries by Weather Events", x=NULL, y="Number of Injuries")+theme
h3<-ggplot(data=TOTAL_HEALTH, aes(x=factor(EVTYPE, levels=totalHealthLevels), y=TOTAL))+geom_bar(stat="identity", fill="purple3")+labs(title="Total Health Effects of Weather Events", x=NULL, y="Total Health Effects")+theme
H1<-arrangeGrob(h1, h2, ncol=2)
H2<-grid.arrange(H1, h3, nrow=2)
As shown in the plot above, tornadoes caused most of the injuries and nearly most of the fatalities in the USA during the observation period, and so can be considered as having the worst effect on the USA public health.
# "Economic Consequences"
e1<-ggplot(data=PROPERTY, aes(x=factor(EVTYPE, levels=propLevels), y=PROPDMG_new))+geom_bar(stat="identity", fill="blue1")+labs(title="Property Damage of Weather Events", x=NULL, y="Property Damage, $1M")+theme
e2<-ggplot(data=CROP, aes(x=factor(EVTYPE, levels=cropLevels), y=CROPDMG_new))+geom_bar(stat="identity", fill="cyan1")+labs(title="Crop Damage of Weather Events", x=NULL, y="Crop Damage, $1M")+theme
e3<-ggplot(data=TOTAL_ECON, aes(x=factor(EVTYPE, levels=totalEconLevels), y=TOTAL))+geom_bar(stat="identity", fill="chartreuse")+labs(title="Total Economic Consequences of Weather Events", x=NULL, y="Total Economic Damage, $1M")+theme
E1<-arrangeGrob(e1, e2, ncol=2)
E2<-grid.arrange(E1, e3, nrow=2)
As shown in the plot below, with regard to material aspect, flood was the most damaging weather event harming the USA economy in the period between 1996 and 2011.