Synopsis

The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database was used to answer the question “which types of events are the most harmful with respect to population health?” and “which types of events have the greatest economic consequences?”.

The event types were cleaned and grouped together before comparing their sample means of the grouped event types. Average injuries and fatalities per event were used to guage impact on population health and average estimated property and crop damage were used to guage the economic consequences.

A one way analysis of the sample means of various event types as well as an ANOVA/linear regression was performed to guide my assessment as to which event types had the highest consequences and were significantly different from the others.

The data showed that hot/dry weather and hurricanes were associated with the most harm with respect to population health. Hurricanes and storms were associated with the highest economic consequences.

Note that other factors were not fitted or examined. This is not the “pure effect” and nor are we establishing causation; merely association. This is simply a one way view of event types and their health/economic impacts.

Data processing

The U.S. NOAA database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

library(ggplot2)
library(gridExtra)
library(data.table)

# custom plotting function
andrewsPlot <- function(response, ylab, title){
        data$eventCleaned2 <- as.factor(data$eventCleaned2)
        lmFit <- lm(get(response) ~ eventCleaned2, data=data)
        
        # creating a data frame with estimates and standard errors for plotting
        lmPred <- data.table("factor" = substring(names(lmFit$coefficients), 14), "coeff" =  lmFit$coefficients)
        # adding intercept term to estimates
        lmPred[-1,2] <- lmPred$coeff[-1] + lmFit$coefficients[1]
        lmPred[1,1] = "blizzard"
        # standard error of betas
        lmPred <- lmPred[, coeffSElow := coeff - summary(lmFit)$coefficients[,2]]
        lmPred <- lmPred[, coeffSEhigh := coeff + summary(lmFit)$coefficients[,2]]
        
        #plotting everything
        g1 <- ggplot(data, aes(eventCleaned2, get(response), group = 1)) + 
                stat_summary(fun.y=mean, geom="bar") +
                geom_point(data=lmPred, aes(factor, coeff, colour="mean")) +
                geom_point(data=lmPred, aes(factor, coeffSElow, colour="standard error"), shape=2) +
                geom_point(data=lmPred, aes(factor, coeffSEhigh, colour="standard error"), shape=6) +
                theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
                theme(legend.position = "bottom") +
                scale_colour_manual(values=c("red","blue")) + 
                labs(x="Event type", y=ylab) +
                ggtitle(title)
        return(g1)
}

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
              ,"C:/Users/andre/Google Drive/Programming/Coursera/JHU Data Specialisation/Course 5/Assignment2/data.bz2" )
# date downloaded 10th December 2018
data <- read.csv("C:/Users/andre/Google Drive/Programming/Coursera/JHU Data Specialisation/Course 5/Assignment2/data.bz2")
data <- as.data.table(data)

The variable of interest here is event type, “EVTYPE”, with the issue being that it has 985 levels, many of which are redundant/duplicates. I begin by grouping the related event types by searching the strings for keywords. my goal is to:

reduce the number of groups, in particular, the number of groups with very few events, whilst;
keeping groupings relatively homogenous

Both of the above are important for us to be able to draw insights from the data.

#str(data)
#levels(data$EVTYPE)

# setting default eventCleaned value
data$eventCleaned <- "Other"

# defining a function to search for keywords
regxfunc <- function (var){
  data[grep(var, data$EVTYPE, ignore.case = TRUE), eventCleaned := var]
}

# look for keywords in the "EVTYPE" field.
# phrases that appear later in the loop will take precedence. Therefore, I place more generic keywords at the
# beginning of the loop.
for (i in c("wind", "storm", "cold", "winter", "wintry", "cool", "wet", "hot", "heat", "warm", "dry", "freeze", "flood", "rain", "snow", "light", "thunder", "blizzard", "fire", "ice", "hail", "landslide", "fog", "tsunami", "icy", "other", "frost", "wave", "surf", "tornado", "hurricane", "hail")){
  regxfunc(i)
}

# check how many uncleaned event types left (this is an iterative process, adding more keywords to the group  # to reduce the number of cleaned event types and checking the distribution of cleaned event types afterwards)
#table(data[eventCleaned  == "Other"]$EVTYPE)

#distribution of cleaned event types, look for similar categories that can be grouped together
table(data$eventCleaned)

## 
##  blizzard      cold      cool       dry      fire     flood       fog 
##      2744      2439        18       307      4240     82673      1883 
##    freeze     frost      hail      heat       hot hurricane       ice 
##        99      1413    290401      2569        14       288      2193 
##       icy landslide     light     other     Other      rain      snow 
##        32       613     15975        59     19015     12169     17432 
##     storm      surf   thunder   tornado   tsunami      warm      wave 
##      1570      1066    109493     60699        20       313        87 
##       wet      wind    winter    wintry 
##        42    252738     19599        94

barplot(prop.table(table(data$eventCleaned)))

#cleaning stage 2 - grouping keywords together
data$eventCleaned2 <- data$eventCleaned
data$eventCleaned2[data$eventCleaned == "cold"| data$eventCleaned == "winter" | data$eventCleaned == "wintry" | data$eventCleaned == "cool"] = "wintry"
data$eventCleaned2[data$eventCleaned == "hot" | data$eventCleaned == "heat" | data$eventCleaned == "dry" | data$eventCleaned == "warm"] = "hot/dry"
data$eventCleaned2[data$eventCleaned == "snow" | data$eventCleaned == "ice" | data$eventCleaned == "icy" | data$eventCleaned == "frost" | data$eventCleaned == "freeze"] = "snow/ice/frost"
data$eventCleaned2[data$eventCleaned == "rain" | data$eventCleaned == "wet"] = "rain"
data$eventCleaned2[data$eventCleaned == "other" | data$eventCleaned == "Other"] = "other"
data$eventCleaned2[data$eventCleaned == "tsunami" | data$eventCleaned == "surf" | data$eventCleaned == "wave"] = "tsunami/surf/wave"

# final distribution of cleaned event types
table(data$eventCleaned2)

## 
##          blizzard              fire             flood               fog 
##              2744              4240             82673              1883 
##              hail           hot/dry         hurricane         landslide 
##            290401              3203               288               613 
##             light             other              rain    snow/ice/frost 
##             15975             19074             12211             21169 
##             storm           thunder           tornado tsunami/surf/wave 
##              1570            109493             60699              1173 
##              wind            wintry 
##            252738             22150

Property and crop damage is split up into two fields each: one representing the significant figures and the other representing the exponent. Referring to the metadata (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) gives us a mapping for the exponent. A quick look at the distribution of the exponent shows there are a few nonsensical values which I ignore here due to immateriality.

# distribution of the units of damage
table(data$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330

table(data$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

# it looks like there are a small number of nonsensical inputs. Given how few of them there are, it's probably okay to either ignore or exclude them.

data <- data[, propDMGcleaned := PROPDMG]
data <- data[PROPDMGEXP == "B" | PROPDMGEXP == "b", propDMGcleaned := PROPDMG * 10^9]
data <- data[PROPDMGEXP == "K" | PROPDMGEXP == "k", propDMGcleaned := PROPDMG * 10^3]
data <- data[PROPDMGEXP == "M" | PROPDMGEXP == "m", propDMGcleaned := PROPDMG * 10^6]

data <- data[, cropDMGcleaned := CROPDMG]
data <- data[CROPDMGEXP == "B" | CROPDMGEXP == "b", cropDMGcleaned := CROPDMG * 10^9]
data <- data[CROPDMGEXP == "K" | CROPDMGEXP == "k", cropDMGcleaned := CROPDMG * 10^3]
data <- data[CROPDMGEXP == "M" | CROPDMGEXP == "m", cropDMGcleaned := CROPDMG * 10^6]

Method

To see if there were any statistically significant differneces between the health consequences (fatailities and injuries) and economic consequences (estimate property and crop damage), I did the following:

Looked at a one way cut of the average number of fatalities, injuries and economic damage. This will give us an idea of the event types with higher and lower levels of health/economic consequences.
Fitting a linear model (in this case this is essentially just an ANOVA of the differing event types) will give us the standard errors of the coefficients representing the different event types as well as whether they are significantly different from zero (e.g. do they have any effect). I note that the fit may not be the best (if looking at diagnostics, measures of goodness of fit etc), but the point of fitting the linear model was to obtain an idea as to the level of volatility of the means of different event types, and not neccesarily for prediction.
It is important to not only look at the differences in the means of event types, but the standard errors as an event type could look high if there are few observations that by chance are quite high. The standard error will allow for this.

Results

Which events are most harmful to the population health?

I examined which event types were the most linked to injuries and fatalities by looking at the average number of injuries and fatalities per event. The event types with the highest levels of associated (correlation not causation) injuries are: hurricane, hot/dry, tornado, tsunami/surf/wave. The event types with the highest levels of associated (correlation not causation) injuries are: hot/dry, hurricane, tsunami/surf/wave, tornado. It is interesting to note that the top fmy event types are the same for both injuries and fatalities. The top fmy event types all generally have p values of less than 5% as well indicating that their effect is likley not zero.

# grid.arrange(g1, g2, nrow = 1)

# fit linear model
data$eventCleaned2 <- as.factor(data$eventCleaned2)
lmFit1 <- lm(FATALITIES ~ eventCleaned2, data=data)
summary(lmFit1)

## 
## Call:
## lm(formula = FATALITIES ~ eventCleaned2, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -0.94  -0.01   0.00   0.00 582.06 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.036808   0.014562   2.528 0.011483 *  
## eventCleaned2fire              -0.015581   0.018689  -0.834 0.404448    
## eventCleaned2flood             -0.018374   0.014802  -1.241 0.214489    
## eventCleaned2fog                0.006209   0.022827   0.272 0.785622    
## eventCleaned2hail              -0.036653   0.014631  -2.505 0.012238 *  
## eventCleaned2hot/dry            0.899814   0.019842  45.349  < 2e-16 ***
## eventCleaned2hurricane          0.431942   0.047248   9.142  < 2e-16 ***
## eventCleaned2landslide          0.026814   0.034077   0.787 0.431365    
## eventCleaned2light              0.014397   0.015763   0.913 0.361050    
## eventCleaned2other              0.010115   0.015574   0.649 0.516037    
## eventCleaned2rain              -0.027881   0.016115  -1.730 0.083610 .  
## eventCleaned2snow/ice/frost    -0.024006   0.015477  -1.551 0.120886    
## eventCleaned2storm              0.037715   0.024138   1.562 0.118186    
## eventCleaned2thunder           -0.034871   0.014743  -2.365 0.018019 *  
## eventCleaned2tornado            0.056044   0.014887   3.765 0.000167 ***
## eventCleaned2tsunami/surf/wave  0.290558   0.026610  10.919  < 2e-16 ***
## eventCleaned2wind              -0.032942   0.014641  -2.250 0.024448 *  
## eventCleaned2wintry            -0.004708   0.015438  -0.305 0.760377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7628 on 902279 degrees of freedom
## Multiple R-squared:  0.006494,   Adjusted R-squared:  0.006475 
## F-statistic: 346.9 on 17 and 902279 DF,  p-value: < 2.2e-16

lmFit2 <- lm(INJURIES ~ eventCleaned2, data=data)
summary(lmFit2)

## 
## Call:
## lm(formula = INJURIES ~ eventCleaned2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##   -4.61   -0.06   -0.04   -0.01 1698.49 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     0.29373    0.10340   2.841  0.00450 ** 
## eventCleaned2fire               0.08551    0.13270   0.644  0.51932    
## eventCleaned2flood             -0.18968    0.10510  -1.805  0.07111 .  
## eventCleaned2fog                0.27823    0.16208   1.717  0.08606 .  
## eventCleaned2hail              -0.28868    0.10389  -2.779  0.00546 ** 
## eventCleaned2hot/dry            2.47804    0.14089  17.588  < 2e-16 ***
## eventCleaned2hurricane          4.31738    0.33549  12.869  < 2e-16 ***
## eventCleaned2landslide         -0.20727    0.24197  -0.857  0.39166    
## eventCleaned2light              0.03384    0.11193   0.302  0.76238    
## eventCleaned2other             -0.23365    0.11059  -2.113  0.03461 *  
## eventCleaned2rain              -0.26892    0.11443  -2.350  0.01877 *  
## eventCleaned2snow/ice/frost    -0.13548    0.10990  -1.233  0.21764    
## eventCleaned2storm              0.25913    0.17140   1.512  0.13056    
## eventCleaned2thunder           -0.27109    0.10469  -2.590  0.00961 ** 
## eventCleaned2tornado            1.21217    0.10571  11.467  < 2e-16 ***
## eventCleaned2tsunami/surf/wave  0.36356    0.18895   1.924  0.05434 .  
## eventCleaned2wind              -0.25866    0.10396  -2.488  0.01284 *  
## eventCleaned2wintry            -0.19048    0.10962  -1.738  0.08226 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.416 on 902279 degrees of freedom
## Multiple R-squared:  0.005739,   Adjusted R-squared:  0.00572 
## F-statistic: 306.3 on 17 and 902279 DF,  p-value: < 2.2e-16

g1 <- andrewsPlot("FATALITIES", "Avg. no. of fatalities", "Fatalities per event")
g2 <- andrewsPlot("INJURIES", "Ave. no. of injuries", "Injuries per event")
grid.arrange(g1, g2, nrow = 1)

Which events have the greatest economic consequences?

I examined which event types were the most linked to crop and property damage by looking at the property and crop damage per event. Judging from the p-values of the linear models as well as looking at the figures below of the average economic damage from events, the bulk of the event types do relatively little damage compared to hurricanes (which is both drastically different in terms of mean cost as well as incredibly statistically significant for both property damage and crop damage models) and storms which cause the second highest average levels of property damage.

# 
# fit linear model
data$eventCleaned2 <- as.factor(data$eventCleaned2)
lmFit2 <- lm(propDMGcleaned ~ eventCleaned2, data=data)
summary(lmFit2)

## 
## Call:
## lm(formula = propDMGcleaned ~ eventCleaned2, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.943e+08 -6.068e+04 -6.068e+04 -4.183e+04  1.150e+11 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       242316    2466291   0.098    0.922    
## eventCleaned2fire                1762786    3165290   0.557    0.578    
## eventCleaned2flood               1783755    2506886   0.712    0.477    
## eventCleaned2fog                 -229033    3866063  -0.059    0.953    
## eventCleaned2hail                -181641    2477916  -0.073    0.942    
## eventCleaned2hot/dry             -237196    3360584  -0.071    0.944    
## eventCleaned2hurricane         294049976    8002260  36.746   <2e-16 ***
## eventCleaned2landslide            287459    5771514   0.050    0.960    
## eventCleaned2light               -183670    2669717  -0.069    0.945    
## eventCleaned2other               -150995    2637733  -0.057    0.954    
## eventCleaned2rain                  23273    2729367   0.009    0.993    
## eventCleaned2snow/ice/frost        -5532    2621267  -0.002    0.998    
## eventCleaned2storm              35226089    4088223   8.616   <2e-16 ***
## eventCleaned2thunder             -183571    2497004  -0.074    0.941    
## eventCleaned2tornado              696631    2521421   0.276    0.782    
## eventCleaned2tsunami/surf/wave    -11500    4506840  -0.003    0.998    
## eventCleaned2wind                -200490    2479643  -0.081    0.936    
## eventCleaned2wintry                74697    2614597   0.029    0.977    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 129200000 on 902279 degrees of freedom
## Multiple R-squared:  0.001798,   Adjusted R-squared:  0.00178 
## F-statistic: 95.62 on 17 and 902279 DF,  p-value: < 2.2e-16

lmFit3 <- lm(cropDMGcleaned ~ eventCleaned2, data=data)
summary(lmFit3)

## 
## Call:
## lm(formula = cropDMGcleaned ~ eventCleaned2, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
##  -19150322     -10724      -6836      -5228 4999851682 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       40838     155626   0.262   0.7930    
## eventCleaned2fire                 54275     199734   0.272   0.7858    
## eventCleaned2flood               107480     158188   0.679   0.4969    
## eventCleaned2fog                 -40838     243954  -0.167   0.8671    
## eventCleaned2hail                -30114     156360  -0.193   0.8473    
## eventCleaned2hot/dry             239803     212058   1.131   0.2581    
## eventCleaned2hurricane         19109484     504954  37.844  < 2e-16 ***
## eventCleaned2landslide            -8184     364191  -0.022   0.9821    
## eventCleaned2light               -40081     168463  -0.238   0.8119    
## eventCleaned2other               692250     166445   4.159  3.2e-05 ***
## eventCleaned2rain                 51768     172227   0.301   0.7637    
## eventCleaned2snow/ice/frost      297338     165406   1.798   0.0722 .  
## eventCleaned2storm               404612     257973   1.568   0.1168    
## eventCleaned2thunder             -34876     157564  -0.221   0.8248    
## eventCleaned2tornado             -34002     159105  -0.214   0.8308    
## eventCleaned2tsunami/surf/wave   -34768     284388  -0.122   0.9027    
## eventCleaned2wind                -35610     156469  -0.228   0.8200    
## eventCleaned2wintry               22286     164985   0.135   0.8925    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8152000 on 902279 degrees of freedom
## Multiple R-squared:  0.001969,   Adjusted R-squared:  0.00195 
## F-statistic: 104.7 on 17 and 902279 DF,  p-value: < 2.2e-16

g3 <- andrewsPlot("propDMGcleaned", "Avg. property damage ($)", "Property damage")
g4 <- andrewsPlot("cropDMGcleaned", "Avg. crop damag ($)", "Crop damage")

grid.arrange(g3, g4, nrow = 1)

Weather event types and the impact on health and the economy in the US

Andrew Lau

10/12/2018