Author: Arash Amoozegar
In this study, I explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks the characteristics of major weather events in the U.S. After an extensive data cleaning, my analysis shows that throughout the U.S. history, in terms of aggregate damages, tornado and flood were the most harmful weather events for public health and economy, respectively. In terms of per-event average, in recent times, Tsunami and hurricane caused the most damages to public health and U.S. economy while in 1950 to 1993, flood and wildfire was causing such damages to public health and cities’ economy.
In this section, I explain each of the steps for pre-processing the data including downloading, importing, and cleaning the data and transforming the variables.
The NOAA storm database is in .csv format compressed via bzip2. I first set the working directory and then download and save the database.
setwd("D:/R/Reproducible Research/Assignment_2/")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
Since thhe read.csv() package in R has the capability to load the compressed files automatically, I use it to load the database directly into R. I use the code inside an if statement which checks whether the database is already loaded in the workspace. If data is not loaded already, this step will take some time due to the size of the uncompressed file.
if (!"NOAA_Storm_Data" %in% ls()) {
NOAA_Storm_Data <- read.csv(bzfile("./StormData.csv.bz2"))
}
The following code shows that there are 902,297 observations and 37 variables in the database.
dim(NOAA_Storm_Data)
## [1] 902297 37
The following code takes care of converting the BGN_DATE variable into a R date format varible BGN_DATE_F and shows that the data spans from January 3, 1950 to November 30, 2011.
NOAA_Storm_Data$BGN_DATE_F <- format(as.Date(NOAA_Storm_Data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"))
cat("Data spans from",min(NOAA_Storm_Data$BGN_DATE_F),"to",max(NOAA_Storm_Data$BGN_DATE_F))
## Data spans from 1950-01-03 to 2011-11-30
The EVTYPE variable which indicates the type of the weather event is very messy. The raw data shows 985 different categories for this variable. There are many typos in category names and same event is written in different names for different observations (e.g. Strong Winds, Strong winds, STRONG WINDS, Strong Wind, STRONG WIND). In addition, according to the table on page 6 of the National Weather Service Storm Data Documentation, NOAA only recognises 48 event categories which is far from 985 categories we see in the database. Therefore, we need to map these diverse categories to the official ones in order to make sense of the data. The following code takes care of mapping by string search and creates a new variable EVTYPECLEAN.
NOAA_Storm_Data$EVTYPECLEAN <- "Other"
NOAA_Storm_Data[grepl("Astronomical Low Tide", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Astronomical Low Tide"
NOAA_Storm_Data[grepl("Avalanche|Avalance", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Avalanche"
NOAA_Storm_Data[grepl("Blizzard", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Blizzard"
NOAA_Storm_Data[grepl("Debris Flow|slide|Landslump|Mud|Erosion|Erosin", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Debris Flow"
NOAA_Storm_Data[grepl("Fog|Vog", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Dense Fog"
NOAA_Storm_Data[grepl("Freezing Fog", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Freezing Fog"
NOAA_Storm_Data[grepl("Dense Smoke", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Dense Smoke"
NOAA_Storm_Data[grepl("Drought", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Drought"
NOAA_Storm_Data[grepl("Dust", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Dust Storm"
NOAA_Storm_Data[grepl("Dust Devil", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Dust Devil"
NOAA_Storm_Data[grepl("Heat|Warm|Dry|Hot|Driest", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Heat"
NOAA_Storm_Data[grepl("Excessive Heat|Record Warm|Record High|Hyper|Temperature Record|Monthly Temperature|Record Temperature", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Excessive Heat"
NOAA_Storm_Data[grepl("Flood|FLD|Floyd|Rising Water|Dam Break|Dam Failure|Small Stream|Swell", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Flood"
NOAA_Storm_Data[grepl("Lakeshore Flood", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Lakeshore Flood"
NOAA_Storm_Data[grepl("Flash Flood|Flash Floood", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Flash Flood"
NOAA_Storm_Data[grepl("Coastal Flood", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Coastal Flood"
NOAA_Storm_Data[grepl("Frost|Freeze|Freezing|Glaze", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Frost/Freeze"
NOAA_Storm_Data[grepl("Funnel", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Funnel Cloud"
NOAA_Storm_Data[grepl("Hail", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Hail"
NOAA_Storm_Data[grepl("Marine Hail", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Marine Hail"
NOAA_Storm_Data[grepl("Rain|Shower|Wet|Drowning|Precip|Burst|Heavy Mix|Cloud", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Heavy Rain"
NOAA_Storm_Data[grepl("Snow", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Heavy Snow"
NOAA_Storm_Data[grepl("Lake-Effect Snow", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Lake-Effect Snow"
NOAA_Storm_Data[grepl("Surf|Tide|Wave|High Water|High Sea|Heavy Sea|Rough Sea|Surge|Turbulence", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "High Surf"
NOAA_Storm_Data[grepl("Wind|Wnd", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Strong Wind"
NOAA_Storm_Data[grepl("High Wind", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "High Wind"
NOAA_Storm_Data[grepl("Marine High Wind", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Marine High Wind"
NOAA_Storm_Data[grepl("Marine Strong Wind", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Marine Strong Wind"
NOAA_Storm_Data[grepl("Hurricane", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Hurricane"
NOAA_Storm_Data[grepl("Ice|Icy", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Ice Storm"
NOAA_Storm_Data[grepl("Lightning|Ligntning|Lighting|Lights", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Lightning"
NOAA_Storm_Data[grepl("Thunderstorm|TSTM", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Thunderstorm Wind"
NOAA_Storm_Data[grepl("Marine Thunderstorm Wind", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Marine Thunderstorm Wind"
NOAA_Storm_Data[grepl("Rip Current", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Rip Current"
NOAA_Storm_Data[grepl("Seiche", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Seiche"
NOAA_Storm_Data[grepl("Sleet", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Sleet"
NOAA_Storm_Data[grepl("Storm", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Storm Surge/Tide"
NOAA_Storm_Data[grepl("Tornado|Torndao|Typhoon", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Tornado"
NOAA_Storm_Data[grepl("Tropical Depression", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Tropical Depression"
NOAA_Storm_Data[grepl("Tropical Storm", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Tropical Storm"
NOAA_Storm_Data[grepl("Tsunami", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Tsunami"
NOAA_Storm_Data[grepl("Volcanic Ash|Volcanic", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Volcanic Ash"
NOAA_Storm_Data[grepl("Spout", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Waterspout"
NOAA_Storm_Data[grepl("Fire|Smoke", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Wildfire"
NOAA_Storm_Data[grepl("Winter|Cool|Low Temp|Wintry|Cold", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Winter Weather"
NOAA_Storm_Data[grepl("Cold Chill|Wind Chill", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Cold/Wind Chill"
NOAA_Storm_Data[grepl("Extreme Cold|Record Low|Hypo", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Extreme Cold/Wind Chill"
NOAA_Storm_Data[grepl("Winter Storm", NOAA_Storm_Data$EVTYPE, ignore.case = TRUE), "EVTYPECLEAN"] <- "Winter Storm"
After this categorisation, there are only 153 (0.0017%) observations left uncategorised which are included in the “Other” category.
In chapter “Economic Effects”, in order to analyse the damages done by weather events, I use two variables PROPDMG and CROPDMG and their dollar value unites stored in variables PROPDMGEXP and CROPDMGEXP. The units for the property damage are hundred, thousand, million, and billion dollars while the units for crop damages are thousand, million, and billion dollars. Following code calculates the dollar amount of the damages and stores them in two numerical variables PROPDMGE_Num and CROPDMGE_Num. This conversion will ignore cases such as “?”, “+”, and 0-8.
NOAA_Storm_Data$PROPDMGEXP <- tolower(NOAA_Storm_Data$PROPDMGEXP)
NOAA_Storm_Data$CROPDMGEXP <- tolower(NOAA_Storm_Data$CROPDMGEXP)
NOAA_Storm_Data$PROPDMGE_Num[NOAA_Storm_Data$PROPDMGEXP == "h"] <- NOAA_Storm_Data$PROPDMG[NOAA_Storm_Data$PROPDMGEXP == "h"] * 10^2
NOAA_Storm_Data$PROPDMGE_Num[NOAA_Storm_Data$PROPDMGEXP == "k"] <- NOAA_Storm_Data$PROPDMG[NOAA_Storm_Data$PROPDMGEXP == "k"] * 10^3
NOAA_Storm_Data$PROPDMGE_Num[NOAA_Storm_Data$PROPDMGEXP == "m"] <- NOAA_Storm_Data$PROPDMG[NOAA_Storm_Data$PROPDMGEXP == "m"] * 10^6
NOAA_Storm_Data$PROPDMGE_Num[NOAA_Storm_Data$PROPDMGEXP == "b"] <- NOAA_Storm_Data$PROPDMG[NOAA_Storm_Data$PROPDMGEXP == "b"] * 10^9
NOAA_Storm_Data$CROPDMGE_Num[NOAA_Storm_Data$CROPDMGEXP == "k"] <- NOAA_Storm_Data$CROPDMG[NOAA_Storm_Data$CROPDMGEXP == "k"] * 10^3
NOAA_Storm_Data$CROPDMGE_Num[NOAA_Storm_Data$CROPDMGEXP == "m"] <- NOAA_Storm_Data$CROPDMG[NOAA_Storm_Data$CROPDMGEXP == "m"] * 10^6
NOAA_Storm_Data$CROPDMGE_Num[NOAA_Storm_Data$CROPDMGEXP == "b"] <- NOAA_Storm_Data$CROPDMG[NOAA_Storm_Data$CROPDMGEXP == "b"] * 10^9
Analysing the number of events per year (which is the table output of the following code) reveals a significant increase in the number of observations since 1994 (the average number of events in each year is 14,553).
NOAA_Storm_Data$BGN_DATE_FY <- as.numeric(format(as.Date(NOAA_Storm_Data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
Assuming the causes of this increase is a combination of some exogeneous variables, I divide the NOAA database into two databases: one from 1950 to the end of 1993 (database 1) and the other one from 1994 to the end of 2011 (database 2). This subsectioning also reduces the significant difference between mean and median in the overall database. From here on, I will perform analysis on each of these two subsets seperately.
NOAA_Storm_Data_1 <- NOAA_Storm_Data[NOAA_Storm_Data$BGN_DATE_FY <= 1993,]
NOAA_Storm_Data_2 <- NOAA_Storm_Data[NOAA_Storm_Data$BGN_DATE_FY >= 1994,]
To analyse the public health effects of weather events, I look at two variables: FATALITIES and INJURIES. In the two NOAA databases I only keep events that had at least one fatality/injury. I use two measures to find the most harmful weather events. First, I look at the total of fatalities and injuries in each sample. Second, I use the average number of fatalities and injuries (calculated as total fatalities and injuries divided by the number of events) as a ranking measure. Lastly, I sort the databases descending to find the top ranked harmful weather events.
suppressMessages(library(dplyr))
# Calculations for dataset 1: 1950-1993
NOAA_Storm_Data_1_FI <- subset(NOAA_Storm_Data_1, FATALITIES + INJURIES > 0)
Fatal_Inj_1_Sum <- as.data.frame(aggregate(FATALITIES + INJURIES ~ EVTYPECLEAN, data = NOAA_Storm_Data_1_FI, FUN = sum))
Fatal_Inj_1_Sum <- as.data.frame(arrange(Fatal_Inj_1_Sum, desc(`FATALITIES + INJURIES`), EVTYPECLEAN))
Fatal_Inj_1_Mean <- as.data.frame(aggregate(FATALITIES + INJURIES ~ EVTYPECLEAN, data = NOAA_Storm_Data_1_FI, FUN = mean))
Fatal_Inj_1_Mean <- as.data.frame(arrange(Fatal_Inj_1_Mean, desc(`FATALITIES + INJURIES`), EVTYPECLEAN))
# Calculations for dataset 2: 1994-2011
NOAA_Storm_Data_2_FI <- subset(NOAA_Storm_Data_2, FATALITIES + INJURIES > 0)
Fatal_Inj_2_Sum <- as.data.frame(aggregate(FATALITIES + INJURIES ~ EVTYPECLEAN, data = NOAA_Storm_Data_2_FI, FUN = sum))
Fatal_Inj_2_Sum <- as.data.frame(arrange(Fatal_Inj_2_Sum, desc(`FATALITIES + INJURIES`), EVTYPECLEAN))
Fatal_Inj_2_Mean <- as.data.frame(aggregate(FATALITIES + INJURIES ~ EVTYPECLEAN, data = NOAA_Storm_Data_2_FI, FUN = mean))
Fatal_Inj_2_Mean <- as.data.frame(arrange(Fatal_Inj_2_Mean, desc(`FATALITIES + INJURIES`), EVTYPECLEAN))
Figure 1 presents a panel plot illustrating both the aggregate and average fatalities and injuries for the top-10 most harmful weather events for 1950-1993 and 1994-2011.
In order to assess the extent of economic effects of weather events, I look at two variables: PROPDMGE_Num and CROPDMGE_Num. I follow a similar strategy as I used before for analysing the public health effects, i.e. subsetting both databases to only events with at least $1 in damages and using total and per-event average damages to find the top ranked harmful weather events.
# Calculations for dataset 1: 1950-1993
NOAA_Storm_Data_1_PC <- subset(NOAA_Storm_Data_1, PROPDMGE_Num + CROPDMGE_Num > 0)
Prop_Crop_1_Sum <- aggregate(PROPDMGE_Num + CROPDMGE_Num ~ EVTYPECLEAN, data = NOAA_Storm_Data_1_PC, FUN = sum)
Prop_Crop_1_Sum <- arrange(Prop_Crop_1_Sum, desc(`PROPDMGE_Num + CROPDMGE_Num`), EVTYPECLEAN)
Prop_Crop_1_Mean <- aggregate(PROPDMGE_Num + CROPDMGE_Num ~ EVTYPECLEAN, data = NOAA_Storm_Data_1_PC, FUN = mean)
Prop_Crop_1_Mean <- arrange(Prop_Crop_1_Mean, desc(`PROPDMGE_Num + CROPDMGE_Num`), EVTYPECLEAN)
# Calculations for dataset 2: 1994-2011
NOAA_Storm_Data_2_PC <- subset(NOAA_Storm_Data_2, PROPDMGE_Num + CROPDMGE_Num > 0)
Prop_Crop_2_Sum <- aggregate(PROPDMGE_Num + CROPDMGE_Num ~ EVTYPECLEAN, data = NOAA_Storm_Data_2_PC, FUN = sum)
Prop_Crop_2_Sum <- arrange(Prop_Crop_2_Sum, desc(`PROPDMGE_Num + CROPDMGE_Num`), EVTYPECLEAN)
Prop_Crop_2_Mean <- aggregate(PROPDMGE_Num + CROPDMGE_Num ~ EVTYPECLEAN, data = NOAA_Storm_Data_2_PC, FUN = mean)
Prop_Crop_2_Mean <- arrange(Prop_Crop_2_Mean, desc(`PROPDMGE_Num + CROPDMGE_Num`), EVTYPECLEAN)
Figure 2 presents a panel plot illustrating both the aggregate and average property and crop damage for the top-10 most harmful weather events for 1950-1993 and 1994-2011.
Figure 1 illustrates the health effects for the top-10 most harmful weather events. Throughout the whole sample, on aggregate tornado was the most harmful weather event to the public health. Recent data shows that on average tsunami had the most fatalities and injuries while from 1950 to 1993, wildfire had such effects.
par(mfrow = c(2, 2))
# Top Left
bp <- barplot(Fatal_Inj_1_Sum[1:10,2], xaxt = "n", main = "Health Effects of Events - 1950-1993", ylab = "Aggregate Fatality and Injury", col = rgb(0,0,1,1/4))
text(cex = 1, x = bp + 0.25, y = -1.25, Fatal_Inj_1_Sum[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Top right plot
bp <- barplot(Fatal_Inj_1_Mean[1:10,2], xaxt = "n", main = "Health Effects of Events - 1950-1993", ylab = "Average Fatality and Injury", col = rgb(0,0,1,1/4))
text(cex = 1, x = bp + 0.25, y = -1.25, Fatal_Inj_1_Mean[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Bottom left plot
bp <- barplot(Fatal_Inj_2_Sum[1:10,2], xaxt = "n", main = "Health Effects of Events - 1994-2011", ylab = "Aggregate Fatality and Injury", col = rgb(0,1,0,1/4))
text(cex = 1, x = bp + 0.25, y = -1.25, Fatal_Inj_2_Sum[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Bottom right plot
bp <- barplot(Fatal_Inj_2_Mean[1:10,2], xaxt = "n", main = "Health Effects of Events - 1994-2011", ylab = "Average Fatality and Injury", col = rgb(0,1,0,1/4))
text(cex = 1, x = bp + 0.25, y = -1.25, Fatal_Inj_2_Mean[1:10,1], xpd = TRUE, srt = 45, pos = 2)
Figure 2 illustrates the economic effects for the top-10 most harmful weather events. In terms of economic effects, throughout the whole sample, flood was the most harmful weather event. In recent memory, on per-event average, hurricane caused the most damages to the properties and crops while in the period of 1950-1993, flood was causing such damages aside from being the top most harmful event to the public health.
par(mfrow = c(2, 2))
# Top Left
bp <- barplot(Prop_Crop_1_Sum[1:10,2]/10^6, xaxt = "n", main = "Economic Effects of Events - 1950-1993", ylab = "Aggregate Damage (x10^6)", col = rgb(0,0,1,1/2))
text(cex = 1, x = bp + 0.25, y = -1.25, Prop_Crop_1_Sum[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Top right plot
bp <- barplot(Prop_Crop_1_Mean[1:10,2]/10^6, xaxt = "n", main = "Economic Effects of Events - 1950-1993", ylab = "Average Damage (x10^6)", col = rgb(0,0,1,1/2))
text(cex = 1, x = bp + 0.25, y = -1.25, Prop_Crop_1_Mean[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Bottom left plot
bp <- barplot(Prop_Crop_2_Sum[1:10,2]/10^6, xaxt = "n", main = "Economic Effects of Events - 1994-2011", ylab = "Aggregate Damage (x10^6)", col = rgb(0,1,0,1/2))
text(cex = 1, x = bp + 0.25, y = -1.25, Prop_Crop_2_Sum[1:10,1], xpd = TRUE, srt = 45, pos = 2)
# Bottom right plot
bp <- barplot(Prop_Crop_2_Mean[1:10,2]/10^6, xaxt = "n", main = "Economic Effects of Events - 1994-2011", ylab = "Average Damage (x10^6)", col = rgb(0,1,0,1/2))
text(cex = 1, x = bp + 0.25, y = -1.25, Prop_Crop_2_Mean[1:10,1], xpd = TRUE, srt = 45, pos = 2)