The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was used to answer the question “which types of events are the most harmful with respect to population health?” and “which types of events have the greatest economic consequences?”.
The event types were cleaned and grouped together before comparing their sample means of the grouped event types. Average injuries and fatalities per event were used to guage impact on population health and average estimated property and crop damage were used to guage the economic consequences.
A one way analysis of the sample means of various event types as well as an ANOVA/linear regression was performed to guide my assessment as to which event types had the highest consequences and were significantly different from the others.
The data showed that hot/dry weather and hurricanes were associated with the most harm with respect to population health. Hurricanes and storms were associated with the highest economic consequences.
Note that other factors were not fitted or examined. This is not the “pure effect” and nor are we establishing causation; merely association. This is simply a one way view of event types and their health/economic impacts.
The U.S. NOAA database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
library(ggplot2)
library(gridExtra)
library(data.table)
# custom plotting function
andrewsPlot <- function(response, ylab, title){
data$eventCleaned2 <- as.factor(data$eventCleaned2)
lmFit <- lm(get(response) ~ eventCleaned2, data=data)
# creating a data frame with estimates and standard errors for plotting
lmPred <- data.table("factor" = substring(names(lmFit$coefficients), 14), "coeff" = lmFit$coefficients)
# adding intercept term to estimates
lmPred[-1,2] <- lmPred$coeff[-1] + lmFit$coefficients[1]
lmPred[1,1] = "blizzard"
# standard error of betas
lmPred <- lmPred[, coeffSElow := coeff - summary(lmFit)$coefficients[,2]]
lmPred <- lmPred[, coeffSEhigh := coeff + summary(lmFit)$coefficients[,2]]
#plotting everything
g1 <- ggplot(data, aes(eventCleaned2, get(response), group = 1)) +
stat_summary(fun.y=mean, geom="bar") +
geom_point(data=lmPred, aes(factor, coeff, colour="mean")) +
geom_point(data=lmPred, aes(factor, coeffSElow, colour="standard error"), shape=2) +
geom_point(data=lmPred, aes(factor, coeffSEhigh, colour="standard error"), shape=6) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
theme(legend.position = "bottom") +
scale_colour_manual(values=c("red","blue")) +
labs(x="Event type", y=ylab) +
ggtitle(title)
return(g1)
}
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
,"C:/Users/andre/Google Drive/Programming/Coursera/JHU Data Specialisation/Course 5/Assignment2/data.bz2" )
# date downloaded 10th December 2018
data <- read.csv("C:/Users/andre/Google Drive/Programming/Coursera/JHU Data Specialisation/Course 5/Assignment2/data.bz2")
data <- as.data.table(data)
The variable of interest here is event type, “EVTYPE”, with the issue being that it has 985 levels, many of which are redundant/duplicates. I begin by grouping the related event types by searching the strings for keywords. my goal is to:
Both of the above are important for us to be able to draw insights from the data.
#str(data)
#levels(data$EVTYPE)
# setting default eventCleaned value
data$eventCleaned <- "Other"
# defining a function to search for keywords
regxfunc <- function (var){
data[grep(var, data$EVTYPE, ignore.case = TRUE), eventCleaned := var]
}
# look for keywords in the "EVTYPE" field.
# phrases that appear later in the loop will take precedence. Therefore, I place more generic keywords at the
# beginning of the loop.
for (i in c("wind", "storm", "cold", "winter", "wintry", "cool", "wet", "hot", "heat", "warm", "dry", "freeze", "flood", "rain", "snow", "light", "thunder", "blizzard", "fire", "ice", "hail", "landslide", "fog", "tsunami", "icy", "other", "frost", "wave", "surf", "tornado", "hurricane", "hail")){
regxfunc(i)
}
# check how many uncleaned event types left (this is an iterative process, adding more keywords to the group # to reduce the number of cleaned event types and checking the distribution of cleaned event types afterwards)
#table(data[eventCleaned == "Other"]$EVTYPE)
#distribution of cleaned event types, look for similar categories that can be grouped together
table(data$eventCleaned)
##
## blizzard cold cool dry fire flood fog
## 2744 2439 18 307 4240 82673 1883
## freeze frost hail heat hot hurricane ice
## 99 1413 290401 2569 14 288 2193
## icy landslide light other Other rain snow
## 32 613 15975 59 19015 12169 17432
## storm surf thunder tornado tsunami warm wave
## 1570 1066 109493 60699 20 313 87
## wet wind winter wintry
## 42 252738 19599 94
barplot(prop.table(table(data$eventCleaned)))
#cleaning stage 2 - grouping keywords together
data$eventCleaned2 <- data$eventCleaned
data$eventCleaned2[data$eventCleaned == "cold"| data$eventCleaned == "winter" | data$eventCleaned == "wintry" | data$eventCleaned == "cool"] = "wintry"
data$eventCleaned2[data$eventCleaned == "hot" | data$eventCleaned == "heat" | data$eventCleaned == "dry" | data$eventCleaned == "warm"] = "hot/dry"
data$eventCleaned2[data$eventCleaned == "snow" | data$eventCleaned == "ice" | data$eventCleaned == "icy" | data$eventCleaned == "frost" | data$eventCleaned == "freeze"] = "snow/ice/frost"
data$eventCleaned2[data$eventCleaned == "rain" | data$eventCleaned == "wet"] = "rain"
data$eventCleaned2[data$eventCleaned == "other" | data$eventCleaned == "Other"] = "other"
data$eventCleaned2[data$eventCleaned == "tsunami" | data$eventCleaned == "surf" | data$eventCleaned == "wave"] = "tsunami/surf/wave"
# final distribution of cleaned event types
table(data$eventCleaned2)
##
## blizzard fire flood fog
## 2744 4240 82673 1883
## hail hot/dry hurricane landslide
## 290401 3203 288 613
## light other rain snow/ice/frost
## 15975 19074 12211 21169
## storm thunder tornado tsunami/surf/wave
## 1570 109493 60699 1173
## wind wintry
## 252738 22150
Property and crop damage is split up into two fields each: one representing the significant figures and the other representing the exponent. Referring to the metadata (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) gives us a mapping for the exponent. A quick look at the distribution of the exponent shows there are a few nonsensical values which I ignore here due to immateriality.
# distribution of the units of damage
table(data$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(data$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
# it looks like there are a small number of nonsensical inputs. Given how few of them there are, it's probably okay to either ignore or exclude them.
data <- data[, propDMGcleaned := PROPDMG]
data <- data[PROPDMGEXP == "B" | PROPDMGEXP == "b", propDMGcleaned := PROPDMG * 10^9]
data <- data[PROPDMGEXP == "K" | PROPDMGEXP == "k", propDMGcleaned := PROPDMG * 10^3]
data <- data[PROPDMGEXP == "M" | PROPDMGEXP == "m", propDMGcleaned := PROPDMG * 10^6]
data <- data[, cropDMGcleaned := CROPDMG]
data <- data[CROPDMGEXP == "B" | CROPDMGEXP == "b", cropDMGcleaned := CROPDMG * 10^9]
data <- data[CROPDMGEXP == "K" | CROPDMGEXP == "k", cropDMGcleaned := CROPDMG * 10^3]
data <- data[CROPDMGEXP == "M" | CROPDMGEXP == "m", cropDMGcleaned := CROPDMG * 10^6]
To see if there were any statistically significant differneces between the health consequences (fatailities and injuries) and economic consequences (estimate property and crop damage), I did the following:
I examined which event types were the most linked to injuries and fatalities by looking at the average number of injuries and fatalities per event. The event types with the highest levels of associated (correlation not causation) injuries are: hurricane, hot/dry, tornado, tsunami/surf/wave. The event types with the highest levels of associated (correlation not causation) injuries are: hot/dry, hurricane, tsunami/surf/wave, tornado. It is interesting to note that the top fmy event types are the same for both injuries and fatalities. The top fmy event types all generally have p values of less than 5% as well indicating that their effect is likley not zero.
# grid.arrange(g1, g2, nrow = 1)
# fit linear model
data$eventCleaned2 <- as.factor(data$eventCleaned2)
lmFit1 <- lm(FATALITIES ~ eventCleaned2, data=data)
summary(lmFit1)
##
## Call:
## lm(formula = FATALITIES ~ eventCleaned2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94 -0.01 0.00 0.00 582.06
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.036808 0.014562 2.528 0.011483 *
## eventCleaned2fire -0.015581 0.018689 -0.834 0.404448
## eventCleaned2flood -0.018374 0.014802 -1.241 0.214489
## eventCleaned2fog 0.006209 0.022827 0.272 0.785622
## eventCleaned2hail -0.036653 0.014631 -2.505 0.012238 *
## eventCleaned2hot/dry 0.899814 0.019842 45.349 < 2e-16 ***
## eventCleaned2hurricane 0.431942 0.047248 9.142 < 2e-16 ***
## eventCleaned2landslide 0.026814 0.034077 0.787 0.431365
## eventCleaned2light 0.014397 0.015763 0.913 0.361050
## eventCleaned2other 0.010115 0.015574 0.649 0.516037
## eventCleaned2rain -0.027881 0.016115 -1.730 0.083610 .
## eventCleaned2snow/ice/frost -0.024006 0.015477 -1.551 0.120886
## eventCleaned2storm 0.037715 0.024138 1.562 0.118186
## eventCleaned2thunder -0.034871 0.014743 -2.365 0.018019 *
## eventCleaned2tornado 0.056044 0.014887 3.765 0.000167 ***
## eventCleaned2tsunami/surf/wave 0.290558 0.026610 10.919 < 2e-16 ***
## eventCleaned2wind -0.032942 0.014641 -2.250 0.024448 *
## eventCleaned2wintry -0.004708 0.015438 -0.305 0.760377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7628 on 902279 degrees of freedom
## Multiple R-squared: 0.006494, Adjusted R-squared: 0.006475
## F-statistic: 346.9 on 17 and 902279 DF, p-value: < 2.2e-16
lmFit2 <- lm(INJURIES ~ eventCleaned2, data=data)
summary(lmFit2)
##
## Call:
## lm(formula = INJURIES ~ eventCleaned2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.61 -0.06 -0.04 -0.01 1698.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.29373 0.10340 2.841 0.00450 **
## eventCleaned2fire 0.08551 0.13270 0.644 0.51932
## eventCleaned2flood -0.18968 0.10510 -1.805 0.07111 .
## eventCleaned2fog 0.27823 0.16208 1.717 0.08606 .
## eventCleaned2hail -0.28868 0.10389 -2.779 0.00546 **
## eventCleaned2hot/dry 2.47804 0.14089 17.588 < 2e-16 ***
## eventCleaned2hurricane 4.31738 0.33549 12.869 < 2e-16 ***
## eventCleaned2landslide -0.20727 0.24197 -0.857 0.39166
## eventCleaned2light 0.03384 0.11193 0.302 0.76238
## eventCleaned2other -0.23365 0.11059 -2.113 0.03461 *
## eventCleaned2rain -0.26892 0.11443 -2.350 0.01877 *
## eventCleaned2snow/ice/frost -0.13548 0.10990 -1.233 0.21764
## eventCleaned2storm 0.25913 0.17140 1.512 0.13056
## eventCleaned2thunder -0.27109 0.10469 -2.590 0.00961 **
## eventCleaned2tornado 1.21217 0.10571 11.467 < 2e-16 ***
## eventCleaned2tsunami/surf/wave 0.36356 0.18895 1.924 0.05434 .
## eventCleaned2wind -0.25866 0.10396 -2.488 0.01284 *
## eventCleaned2wintry -0.19048 0.10962 -1.738 0.08226 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.416 on 902279 degrees of freedom
## Multiple R-squared: 0.005739, Adjusted R-squared: 0.00572
## F-statistic: 306.3 on 17 and 902279 DF, p-value: < 2.2e-16
g1 <- andrewsPlot("FATALITIES", "Avg. no. of fatalities", "Fatalities per event")
g2 <- andrewsPlot("INJURIES", "Ave. no. of injuries", "Injuries per event")
grid.arrange(g1, g2, nrow = 1)
I examined which event types were the most linked to crop and property damage by looking at the property and crop damage per event. Judging from the p-values of the linear models as well as looking at the figures below of the average economic damage from events, the bulk of the event types do relatively little damage compared to hurricanes (which is both drastically different in terms of mean cost as well as incredibly statistically significant for both property damage and crop damage models) and storms which cause the second highest average levels of property damage.
#
# fit linear model
data$eventCleaned2 <- as.factor(data$eventCleaned2)
lmFit2 <- lm(propDMGcleaned ~ eventCleaned2, data=data)
summary(lmFit2)
##
## Call:
## lm(formula = propDMGcleaned ~ eventCleaned2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.943e+08 -6.068e+04 -6.068e+04 -4.183e+04 1.150e+11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 242316 2466291 0.098 0.922
## eventCleaned2fire 1762786 3165290 0.557 0.578
## eventCleaned2flood 1783755 2506886 0.712 0.477
## eventCleaned2fog -229033 3866063 -0.059 0.953
## eventCleaned2hail -181641 2477916 -0.073 0.942
## eventCleaned2hot/dry -237196 3360584 -0.071 0.944
## eventCleaned2hurricane 294049976 8002260 36.746 <2e-16 ***
## eventCleaned2landslide 287459 5771514 0.050 0.960
## eventCleaned2light -183670 2669717 -0.069 0.945
## eventCleaned2other -150995 2637733 -0.057 0.954
## eventCleaned2rain 23273 2729367 0.009 0.993
## eventCleaned2snow/ice/frost -5532 2621267 -0.002 0.998
## eventCleaned2storm 35226089 4088223 8.616 <2e-16 ***
## eventCleaned2thunder -183571 2497004 -0.074 0.941
## eventCleaned2tornado 696631 2521421 0.276 0.782
## eventCleaned2tsunami/surf/wave -11500 4506840 -0.003 0.998
## eventCleaned2wind -200490 2479643 -0.081 0.936
## eventCleaned2wintry 74697 2614597 0.029 0.977
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 129200000 on 902279 degrees of freedom
## Multiple R-squared: 0.001798, Adjusted R-squared: 0.00178
## F-statistic: 95.62 on 17 and 902279 DF, p-value: < 2.2e-16
lmFit3 <- lm(cropDMGcleaned ~ eventCleaned2, data=data)
summary(lmFit3)
##
## Call:
## lm(formula = cropDMGcleaned ~ eventCleaned2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19150322 -10724 -6836 -5228 4999851682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40838 155626 0.262 0.7930
## eventCleaned2fire 54275 199734 0.272 0.7858
## eventCleaned2flood 107480 158188 0.679 0.4969
## eventCleaned2fog -40838 243954 -0.167 0.8671
## eventCleaned2hail -30114 156360 -0.193 0.8473
## eventCleaned2hot/dry 239803 212058 1.131 0.2581
## eventCleaned2hurricane 19109484 504954 37.844 < 2e-16 ***
## eventCleaned2landslide -8184 364191 -0.022 0.9821
## eventCleaned2light -40081 168463 -0.238 0.8119
## eventCleaned2other 692250 166445 4.159 3.2e-05 ***
## eventCleaned2rain 51768 172227 0.301 0.7637
## eventCleaned2snow/ice/frost 297338 165406 1.798 0.0722 .
## eventCleaned2storm 404612 257973 1.568 0.1168
## eventCleaned2thunder -34876 157564 -0.221 0.8248
## eventCleaned2tornado -34002 159105 -0.214 0.8308
## eventCleaned2tsunami/surf/wave -34768 284388 -0.122 0.9027
## eventCleaned2wind -35610 156469 -0.228 0.8200
## eventCleaned2wintry 22286 164985 0.135 0.8925
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8152000 on 902279 degrees of freedom
## Multiple R-squared: 0.001969, Adjusted R-squared: 0.00195
## F-statistic: 104.7 on 17 and 902279 DF, p-value: < 2.2e-16
g3 <- andrewsPlot("propDMGcleaned", "Avg. property damage ($)", "Property damage")
g4 <- andrewsPlot("cropDMGcleaned", "Avg. crop damag ($)", "Crop damage")
grid.arrange(g3, g4, nrow = 1)