The purpose of this data analysis is to address the Effect of severe weather events in population health & economic impact across the United States.
This document describes the steps taken to address the (2) main questions, as well as the code chunks (written in R) for each step performed upon the original data set. The study was performed using the U.S. National Oceanic and Athmospheric Administration´s storm database. The analysis recodes certain variables and creates an Economic Indicator as well as a Health Indicator to assist the study.
The analysis answers:
To begin the analysis, load the required packages into R:
library(dplyr)
library(stringdist)
library(stargazer)
library(dplyr)
library(ggplot2)
The data is loaded from the official download link provided and stored in the StormData object.
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("./repdata_data_StormData.csv")) {
download.file(fileURL, destfile = "repdata_data_StormData.csv.bz2")
unzip("./repdata_data_StormData.csv.bz2")
}
StormData <- read.csv("repdata_data_StormData.csv", stringsAsFactors = FALSE, na.strings = c("NA", ""))
The beggining date (BGN_DATE) is changed into Date format. Notice that for every other date/time variable (such as BGN_TIME, END_DATE, etc.), the format is kept the same for sake of simplicity.
StormData$BGN_DATE <- as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
According to the official NOAA Storm Database documentation, there are only 48 official event types, a quick glance of our StormData data set shows 985 unique “categories” for the EVTYPE variable.
length(unique(StormData$EVTYPE))
## [1] 985
This means that there are certain event types in the original dataset not included in the official event type list.
The weather events will be reclassified so they match the official event type list:
The official list is built according to the documentation:
OEVT <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
Then, each observation of the EVTYPE column of the StormData dataset is renamed to its closest match using Approximate String Matching via “amatch” from the “stringdist” package.
StormData$EVTYPE <- OEVT[amatch(StormData$EVTYPE, table = OEVT, nomatch = "Other", maxDist = 15)]
Notice that this method will probably not be perfect for every observation and because the maxDist parameter is set to 15 (to avoid illogical cases), some events might be classified as NA(coerced).
Events coerced to NA´s:
sum(is.na(StormData$EVTYPE))
## [1] 11683
Events classified as NA will be renamed to “Other” (Event types that couldn´t be matched to an official event type.)
StormData$EVTYPE[is.na(StormData$EVTYPE)] <- "Other"
The value for both CROPDMG and PROPDMG variables are in different scales specifeied by both CROPDMGEXP & PROPDMGEXP. A work (Provided by Usama A.F. Khalil under the discussion forums for this course) on How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP was done to determine the correct multiplier values for each “legend”.
The following chunk of code creates an auxiliary dataframe that will be used to scale both crop damage and property damage variables:
expValues <- data.frame(legend = c(unique(StormData$PROPDMGEXP), "k"), multiplier = c(1000, 1000000, 0, 1000000000, 1000000, 1, 10, 10, 10, 0, 10, 10, 10, 100, 10, 100, 0, 10, 10, 1000))
The multiplier for each legend:
stargazer(expValues, summary = FALSE, type = "html", title = "Exponent Values and Multipliers")
legend | multiplier | |
1 | K | 1,000 |
2 | M | 1,000,000 |
3 | 0 | |
4 | B | 1,000,000,000 |
5 | m | 1,000,000 |
6 | + | 1 |
7 | 0 | 10 |
8 | 5 | 10 |
9 | 6 | 10 |
10 | ? | 0 |
11 | 4 | 10 |
12 | 2 | 10 |
13 | 3 | 10 |
14 | h | 100 |
15 | 7 | 10 |
16 | H | 100 |
17 | - | 0 |
18 | 1 | 10 |
19 | 8 | 10 |
20 | k | 1,000 |
Property damage and crop damage are scaled to millions of USD (by first multipliying the value times the multiplier according to its legend) into new variables (PROPDMG.USD & CROPDMG.USD).
## PROPERTY DAMAGE:
StormData <- mutate(StormData, PROPDMG.USD = (StormData$PROPDMG * (expValues$multiplier[match(StormData$PROPDMGEXP, expValues$legend)]))/1000000)
## CROP DAMAGE:
StormData <- mutate(StormData, CROPDMG.USD = (StormData$CROPDMG * (expValues$multiplier[match(StormData$CROPDMGEXP, expValues$legend)]))/1000000)
Another variable is created: “Total Victims Indicator” ( TVI ) which is the sum of fatalities and injuries \[TVI = \text{Fatalities} + \text{Injuries}\]
StormData <- mutate(StormData, TVI = StormData$FATALITIES + StormData$INJURIES )
Another variable is created: Total Damage Indicator (TDI) which is the sum of property damage (in millions USD) and crop damage (in millions USD) \[TDI = \text{Property Damage }_{M.USD} + \text{Crop Damage }_{M.USD}\]
StormData <- mutate(StormData, TDI = StormData$PROPDMG.USD + StormData$CROPDMG.USD )
The data will be summarized calculating the mean TVI across each event type to obtain the average number of victims for each type. This method was chosen because it ignores the frequency of occurrence for each event type (the mean could be interpreted as an “expected number of victims”). In contrast; a summary by total number of victims (sum TVI), would accumulate fatalities and injuries from 1950 to 2011 for each event type.
The study will focus on the top 5 event types (with higher TVI values)
HealthSummary <- StormData %>% group_by(EVTYPE) %>% summarise(mean.TVI = mean(TVI))
HealthSummary <- arrange(HealthSummary, desc(mean.TVI))
HealthSummary <- HealthSummary[1:5, ]
The data will be summarized calculating the mean TDI across each event type to obtain the average economic damage for each type. The reason this method was chosen is explained in the previous section.
NOTICE: The average TDI could be interpreted as the expected economic damage (in million USD) for each event type.
The study will focus on the top 5 event types (with higher TDI values)
EconSummary <- StormData %>% group_by(EVTYPE) %>% summarise(mean.TDI = mean(TDI))
EconSummary <- arrange(EconSummary, desc(mean.TDI))
EconSummary <- EconSummary[1:5, ]
From the analysis performed upon the original data, the event types that present (on average) higher values for TVI (Total Victims Indicator), or fatalities & injuries are: Excessive Heat, Heat, Seiches, Tornadoes and Rip Currents. Notice that the frequency of occurrence for this events is not taken into consideration.
g <- ggplot(HealthSummary, aes(x = EVTYPE, y = mean.TVI))
g + geom_col(aes(fill = EVTYPE)) + labs(title = "Weather Event Impact on Population Health", subtitle = "Top 5 - US: 1950-2011") + labs(x = "Event Type", y = "Mean TVI")
Figure 2: Health Impact
From the analysis performed upon the original data, the event types that present (on average) higher values for TDI (Total Damage Indicator), or Crop Damages & Property Damages are: Dense Smoke, Storm surge/tide, Other (events not matched to official event categories), Droughts and Floods. Notice that the frequency of occurrence for this events is not taken into consideration.
g <- ggplot(EconSummary, aes(x = EVTYPE, y = mean.TDI))
g + geom_col(aes(fill = EVTYPE)) + labs(title = "Weather Event Impact on Economic Damage", subtitle = "Top 5 - US: 1950-2011") + labs(x = "Event Type", y = "Mean TDI (millions of USD)") + scale_fill_brewer(palette = "Dark2")
Figure 3: Economic Damage