Synopsis

In this assignment we aim to explore the NOAA Storm Database and answer two questions about severe weather in the United States.

  1. Which types of events are most harmful to population health?

  2. Which types of events have the greatest economic consequences?

We have discovered that tornados are the most dangerous extreme weather event both with regards to human health and damage to property. If we look at the economic damage, we have to take into account an even greater damage that weather can do and that is damage to crops. In that regard hail is the worst weather event.

Data Processing

From the National Weather Service we obtain the Storm Data from the year 1950 to November 2011.

First we load the storm data. We can load the data directly from the compressed .bz2 file. The data is comma separated and the first line contains the header which we also load.

stormData <- read.table(bzfile("repdata-data-StormData.csv.bz2"), sep=",", header=TRUE)

Let’s take a quick look at the data

dim(stormData)
## [1] 902297     37

We see there are 902297 observations and 37 variables. Let’s take a closer look at the data.

head(stormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Since we will want to plot some time series in the end, we will extract the years from the variable BGN_DATE and append them at the end of stormData.

extractYear <- function(d)
{
    substr(strsplit(as.character(d), "/")[[1]][3], 1, 4)
}
stormData$YEAR <- sapply(stormData$BGN_DATE, extractYear)

Results

Most fatal types of weather

To answer the first question about the impact of severe weather on human health, we will pay attention to two variables - FATALITIES and INJURIES. We want to see which types of weather events are responsible for most fatalities and injuries. First let’s see how many different weather events are there in our data.

length(unique(stormData$EVTYPE))
## [1] 985

Since there are almost a thousand different weather events, we will have a very tough time trying to make a meaningful plot of the data. So we will look at just at the few events which have caused the most injuries and fatalities.

max(stormData$FATALITIES); max(stormData$INJURIES)
## [1] 583
## [1] 1700

We see that the most fatalities by a single weather event was 583 and the most injured by a single event is 1700. Let’s see which event(s) is/are the culprit(s).

stormData[which.max(stormData$FATALITIES), "EVTYPE"]; stormData[which.max(stormData$INJURIES), "EVTYPE"]
## [1] HEAT
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND
## [1] TORNADO
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Looks like heat and tornados could be the biggest health threat, but we have to make sure they are not just outliers, but consistently threaten human life. In order to do this, we will have to devise a good criterion to determine which event is the most dangerous.

First of all, let’s discount all the weather event types which haven’t caused a single fatality nor injury. Or, to put it another way, we want to know how many event types have caused at least a single fatality or injury.

dangerousInd <- (stormData$FATALITIES > 0 | stormData$INJURIES > 0) & !is.na(stormData$FATALITIES);
length(unique(stormData[dangerousInd, "EVTYPE"]))
## [1] 220

This is still too high a number so we need to strengthen the restrictions further. From now on We will discount all the event types which haven’t caused a single fatality or injury and focus on those that have.

dangerous <- stormData[dangerousInd, ];
dim(dangerous)
## [1] 21929    38

We have reduced our considerations to 21929 events - a mere 2.4 percent of the starting number. Now let’s try to reduce the number of possible culprits even more. We will look at the cummulative damage our dangerous events have caused.

sumFatalities <- aggregate(dangerous$FATALITIES, by=list(EVTYPE=dangerous$EVTYPE), FUN=sum)
sumFatalities <- sumFatalities[order(sumFatalities[,2], decreasing=TRUE), ]
head(sumFatalities, n=15)
##                EVTYPE    x
## 184           TORNADO 5633
## 32     EXCESSIVE HEAT 1903
## 42        FLASH FLOOD  978
## 69               HEAT  937
## 123         LIGHTNING  816
## 191         TSTM WIND  504
## 47              FLOOD  470
## 147       RIP CURRENT  368
## 93          HIGH WIND  248
## 2           AVALANCHE  224
## 214      WINTER STORM  206
## 148      RIP CURRENTS  204
## 71          HEAT WAVE  172
## 37       EXTREME COLD  160
## 173 THUNDERSTORM WIND  133

Even if we take into consideration that heat is broken up into EXCESSIVE HEAT, HEAT, HEAT WAVE, etc., we can still clearly see that the clear “winner” here are tornados.

Now let’s see what about injuries.

sumInjuries <- aggregate(dangerous$INJURIES, by=list(EVTYPE=dangerous$EVTYPE), FUN=sum)
sumInjuries <- sumInjuries[order(sumInjuries[,2], decreasing=TRUE), ]
head(sumInjuries, n=15)
##                EVTYPE     x
## 184           TORNADO 91346
## 191         TSTM WIND  6957
## 47              FLOOD  6789
## 32     EXCESSIVE HEAT  6525
## 123         LIGHTNING  5230
## 69               HEAT  2100
## 117         ICE STORM  1975
## 42        FLASH FLOOD  1777
## 173 THUNDERSTORM WIND  1488
## 67               HAIL  1361
## 214      WINTER STORM  1321
## 109 HURRICANE/TYPHOON  1275
## 93          HIGH WIND  1137
## 77         HEAVY SNOW  1021
## 210          WILDFIRE   911

Again tornados are significantly ahead of the other weather events. Therefore it is safe to conclude that tornados are the biggest health threat in the USA of all the extreme weather.

Let’s plot the time series of fatalities caused by the 5 most dangerous extreme weather event types.

library(ggplot2)
mostFatal <- stormData[stormData$EVTYPE %in% sumFatalities[1:5, "EVTYPE"], ]
mostFatalPlot <- aggregate(mostFatal$FATALITIES, by = list(YEAR = mostFatal$YEAR, EVTYPE = mostFatal$EVTYPE), FUN=sum)
qplot(mostFatalPlot$YEAR, mostFatalPlot$x, geom="jitter", colour=mostFatalPlot$EVTYPE, xlab="Year", ylab="Fatalities")

plot of chunk unnamed-chunk-12

We see that other than a single outlier of HEAT, tornados are the most deadly. That is probably also aided by the fact that only tornado fatalities have been recorded prior to 1993, but there is nothing we can do about missing data. However, if there is any weather type that can take first place with regards to fatalities, that would probably be (extreme) heat.

Types of weather that cause the most economic damage

Now we will do a similar analysis as we did for the health risk, but we turn our attention to economic damage that weather does. For that purpose the most useful variables will be PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The PROPDMGEXP and CROPDMGEXP variables tell us the scale of the damage in dollars - K for thousands, M for millions, B for billions. Since this is a very cumbersome way of working with the given dollar amounts, we will first turn all the values into proper dollar amounts without the suffixes which denote the scale.

scale <- function(x, y)
{
    if (y == 'k' || y == 'K')
        return(x*1000)
    if (y == 'm' || y == 'M')
        return(x*1000000)
    if (y == 'b' || y == 'B')
        return(x*1000000000)
    return(x)
}
stormData$PROPDMG <- scale(stormData$PROPDMG, stormData$PROPDMGEXP)
stormData$CROPDMG <- scale(stormData$CROPDMG, stormData$CROPDMGEXP)

Let’s first look at the maximal damage ever

max(stormData$PROPDMG); max(stormData$CROPDMG)
## [1] 5e+06
## [1] 990

And let’s see who the culprits were.

stormData[which.max(stormData$PROPDMG), "EVTYPE"]; stormData[which.max(stormData$CROPDMG), "EVTYPE"]
## [1] THUNDERSTORM WIND
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND
## [1] DROUGHT
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

It is not surprising that drought has caused the highest recorded crop damage. Also, it is unsurprising that thunderstorm wind has caused the highest property damage. Again, we have to check if these are just outliers or are they consistently guilty for causing crop and property damage, respectively.

sumPropDmg <- aggregate(stormData$PROPDMG, by=list(EVTYPE=stormData$EVTYPE), FUN=sum)
sumPropDmg <- sumPropDmg[order(sumPropDmg[,2], decreasing=TRUE), ]
head(sumPropDmg, n=15)
##                 EVTYPE         x
## 834            TORNADO 3.212e+09
## 153        FLASH FLOOD 1.420e+09
## 856          TSTM WIND 1.336e+09
## 170              FLOOD 8.999e+08
## 760  THUNDERSTORM WIND 8.768e+08
## 244               HAIL 6.887e+08
## 464          LIGHTNING 6.034e+08
## 786 THUNDERSTORM WINDS 4.463e+08
## 359          HIGH WIND 3.247e+08
## 972       WINTER STORM 1.327e+08
## 310         HEAVY SNOW 1.223e+08
## 957           WILDFIRE 8.446e+07
## 427          ICE STORM 6.600e+07
## 676        STRONG WIND 6.299e+07
## 376         HIGH WINDS 5.562e+07

We see that tornados cause the most property damage. Let’s see what about crops.

sumCropDmg <- aggregate(stormData$CROPDMG, by=list(EVTYPE=stormData$EVTYPE), FUN=sum)
sumCropDmg <- sumCropDmg[order(sumCropDmg[,2], decreasing=TRUE), ]
head(sumCropDmg, n=15)
##                 EVTYPE      x
## 244               HAIL 579596
## 153        FLASH FLOOD 179200
## 170              FLOOD 168038
## 856          TSTM WIND 109203
## 834            TORNADO 100019
## 760  THUNDERSTORM WIND  66791
## 95             DROUGHT  33899
## 786 THUNDERSTORM WINDS  18685
## 359          HIGH WIND  17283
## 290         HEAVY RAIN  11123
## 212       FROST/FREEZE   7034
## 140       EXTREME COLD   6121
## 848     TROPICAL STORM   5899
## 402          HURRICANE   5339
## 164     FLASH FLOODING   5126

Now we see that drought was just an outlier and that cumulatively the biggest menace to crops is hail.

Let’s plot the time series of property damage caused by the 5 most dangerous extreme weather event types.

mostPropDmg <- stormData[stormData$EVTYPE %in% sumPropDmg[1:5, "EVTYPE"], ]
mostPropDmgPlot <- aggregate(mostPropDmg$PROPDMG, by = list(YEAR = mostPropDmg$YEAR, EVTYPE = mostPropDmg$EVTYPE), FUN=sum)
qplot(mostPropDmgPlot$YEAR, mostPropDmgPlot$x, geom="jitter", colour=mostPropDmgPlot$EVTYPE, xlab="Year", ylab="Property Damage")

plot of chunk unnamed-chunk-18 We see that lately (in the last couple of years) thunderstorms have caused the most damage, but tornados have been historicaly and consistently at the top. It appears that property damage caused by weather other than tornados and thunderstorm winds have not been measured before the year 1993. The missing data probably account for tornados causing the most damage cumulatively. In light of the missing data it appears that thunderstorm wind is probably most devastating with regards to property damage.

Let’s take a look at the crop damage also.

mostCropDmg <- stormData[stormData$EVTYPE %in% sumCropDmg[1:5, "EVTYPE"], ]
mostCropDmgPlot <- aggregate(mostCropDmg$CROPDMG, by = list(YEAR = mostCropDmg$YEAR, EVTYPE = mostCropDmg$EVTYPE), FUN=sum)
qplot(mostCropDmgPlot$YEAR, mostCropDmgPlot$x, geom="jitter", colour=mostCropDmgPlot$EVTYPE, xlab="Year", ylab="Crop Damage")

plot of chunk unnamed-chunk-19 The situation is far from crystal clear, but hail obviously causes more damage to crops than other severe weather types. It appears that all the data before the year 1993 is missing as can be seen in the following way:

sum(stormData$CROPDMG[stormData$YEAR <= "1992"])
## [1] 0