In this assignment we aim to explore the NOAA Storm Database and answer two questions about severe weather in the United States.
Which types of events are most harmful to population health?
Which types of events have the greatest economic consequences?
We have discovered that tornados are the most dangerous extreme weather event both with regards to human health and damage to property. If we look at the economic damage, we have to take into account an even greater damage that weather can do and that is damage to crops. In that regard hail is the worst weather event.
From the National Weather Service we obtain the Storm Data from the year 1950 to November 2011.
First we load the storm data. We can load the data directly from the compressed .bz2 file. The data is comma separated and the first line contains the header which we also load.
stormData <- read.table(bzfile("repdata-data-StormData.csv.bz2"), sep=",", header=TRUE)
Let’s take a quick look at the data
dim(stormData)
## [1] 902297 37
We see there are 902297 observations and 37 variables. Let’s take a closer look at the data.
head(stormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Since we will want to plot some time series in the end, we will extract the years from the variable BGN_DATE and append them at the end of stormData.
extractYear <- function(d)
{
substr(strsplit(as.character(d), "/")[[1]][3], 1, 4)
}
stormData$YEAR <- sapply(stormData$BGN_DATE, extractYear)
To answer the first question about the impact of severe weather on human health, we will pay attention to two variables - FATALITIES and INJURIES. We want to see which types of weather events are responsible for most fatalities and injuries. First let’s see how many different weather events are there in our data.
length(unique(stormData$EVTYPE))
## [1] 985
Since there are almost a thousand different weather events, we will have a very tough time trying to make a meaningful plot of the data. So we will look at just at the few events which have caused the most injuries and fatalities.
max(stormData$FATALITIES); max(stormData$INJURIES)
## [1] 583
## [1] 1700
We see that the most fatalities by a single weather event was 583 and the most injured by a single event is 1700. Let’s see which event(s) is/are the culprit(s).
stormData[which.max(stormData$FATALITIES), "EVTYPE"]; stormData[which.max(stormData$INJURIES), "EVTYPE"]
## [1] HEAT
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
## [1] TORNADO
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
Looks like heat and tornados could be the biggest health threat, but we have to make sure they are not just outliers, but consistently threaten human life. In order to do this, we will have to devise a good criterion to determine which event is the most dangerous.
First of all, let’s discount all the weather event types which haven’t caused a single fatality nor injury. Or, to put it another way, we want to know how many event types have caused at least a single fatality or injury.
dangerousInd <- (stormData$FATALITIES > 0 | stormData$INJURIES > 0) & !is.na(stormData$FATALITIES);
length(unique(stormData[dangerousInd, "EVTYPE"]))
## [1] 220
This is still too high a number so we need to strengthen the restrictions further. From now on We will discount all the event types which haven’t caused a single fatality or injury and focus on those that have.
dangerous <- stormData[dangerousInd, ];
dim(dangerous)
## [1] 21929 38
We have reduced our considerations to 21929 events - a mere 2.4 percent of the starting number. Now let’s try to reduce the number of possible culprits even more. We will look at the cummulative damage our dangerous events have caused.
sumFatalities <- aggregate(dangerous$FATALITIES, by=list(EVTYPE=dangerous$EVTYPE), FUN=sum)
sumFatalities <- sumFatalities[order(sumFatalities[,2], decreasing=TRUE), ]
head(sumFatalities, n=15)
## EVTYPE x
## 184 TORNADO 5633
## 32 EXCESSIVE HEAT 1903
## 42 FLASH FLOOD 978
## 69 HEAT 937
## 123 LIGHTNING 816
## 191 TSTM WIND 504
## 47 FLOOD 470
## 147 RIP CURRENT 368
## 93 HIGH WIND 248
## 2 AVALANCHE 224
## 214 WINTER STORM 206
## 148 RIP CURRENTS 204
## 71 HEAT WAVE 172
## 37 EXTREME COLD 160
## 173 THUNDERSTORM WIND 133
Even if we take into consideration that heat is broken up into EXCESSIVE HEAT, HEAT, HEAT WAVE, etc., we can still clearly see that the clear “winner” here are tornados.
Now let’s see what about injuries.
sumInjuries <- aggregate(dangerous$INJURIES, by=list(EVTYPE=dangerous$EVTYPE), FUN=sum)
sumInjuries <- sumInjuries[order(sumInjuries[,2], decreasing=TRUE), ]
head(sumInjuries, n=15)
## EVTYPE x
## 184 TORNADO 91346
## 191 TSTM WIND 6957
## 47 FLOOD 6789
## 32 EXCESSIVE HEAT 6525
## 123 LIGHTNING 5230
## 69 HEAT 2100
## 117 ICE STORM 1975
## 42 FLASH FLOOD 1777
## 173 THUNDERSTORM WIND 1488
## 67 HAIL 1361
## 214 WINTER STORM 1321
## 109 HURRICANE/TYPHOON 1275
## 93 HIGH WIND 1137
## 77 HEAVY SNOW 1021
## 210 WILDFIRE 911
Again tornados are significantly ahead of the other weather events. Therefore it is safe to conclude that tornados are the biggest health threat in the USA of all the extreme weather.
Let’s plot the time series of fatalities caused by the 5 most dangerous extreme weather event types.
library(ggplot2)
mostFatal <- stormData[stormData$EVTYPE %in% sumFatalities[1:5, "EVTYPE"], ]
mostFatalPlot <- aggregate(mostFatal$FATALITIES, by = list(YEAR = mostFatal$YEAR, EVTYPE = mostFatal$EVTYPE), FUN=sum)
qplot(mostFatalPlot$YEAR, mostFatalPlot$x, geom="jitter", colour=mostFatalPlot$EVTYPE, xlab="Year", ylab="Fatalities")
We see that other than a single outlier of HEAT, tornados are the most deadly. That is probably also aided by the fact that only tornado fatalities have been recorded prior to 1993, but there is nothing we can do about missing data. However, if there is any weather type that can take first place with regards to fatalities, that would probably be (extreme) heat.
Now we will do a similar analysis as we did for the health risk, but we turn our attention to economic damage that weather does. For that purpose the most useful variables will be PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The PROPDMGEXP and CROPDMGEXP variables tell us the scale of the damage in dollars - K for thousands, M for millions, B for billions. Since this is a very cumbersome way of working with the given dollar amounts, we will first turn all the values into proper dollar amounts without the suffixes which denote the scale.
scale <- function(x, y)
{
if (y == 'k' || y == 'K')
return(x*1000)
if (y == 'm' || y == 'M')
return(x*1000000)
if (y == 'b' || y == 'B')
return(x*1000000000)
return(x)
}
stormData$PROPDMG <- scale(stormData$PROPDMG, stormData$PROPDMGEXP)
stormData$CROPDMG <- scale(stormData$CROPDMG, stormData$CROPDMGEXP)
Let’s first look at the maximal damage ever
max(stormData$PROPDMG); max(stormData$CROPDMG)
## [1] 5e+06
## [1] 990
And let’s see who the culprits were.
stormData[which.max(stormData$PROPDMG), "EVTYPE"]; stormData[which.max(stormData$CROPDMG), "EVTYPE"]
## [1] THUNDERSTORM WIND
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
## [1] DROUGHT
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
It is not surprising that drought has caused the highest recorded crop damage. Also, it is unsurprising that thunderstorm wind has caused the highest property damage. Again, we have to check if these are just outliers or are they consistently guilty for causing crop and property damage, respectively.
sumPropDmg <- aggregate(stormData$PROPDMG, by=list(EVTYPE=stormData$EVTYPE), FUN=sum)
sumPropDmg <- sumPropDmg[order(sumPropDmg[,2], decreasing=TRUE), ]
head(sumPropDmg, n=15)
## EVTYPE x
## 834 TORNADO 3.212e+09
## 153 FLASH FLOOD 1.420e+09
## 856 TSTM WIND 1.336e+09
## 170 FLOOD 8.999e+08
## 760 THUNDERSTORM WIND 8.768e+08
## 244 HAIL 6.887e+08
## 464 LIGHTNING 6.034e+08
## 786 THUNDERSTORM WINDS 4.463e+08
## 359 HIGH WIND 3.247e+08
## 972 WINTER STORM 1.327e+08
## 310 HEAVY SNOW 1.223e+08
## 957 WILDFIRE 8.446e+07
## 427 ICE STORM 6.600e+07
## 676 STRONG WIND 6.299e+07
## 376 HIGH WINDS 5.562e+07
We see that tornados cause the most property damage. Let’s see what about crops.
sumCropDmg <- aggregate(stormData$CROPDMG, by=list(EVTYPE=stormData$EVTYPE), FUN=sum)
sumCropDmg <- sumCropDmg[order(sumCropDmg[,2], decreasing=TRUE), ]
head(sumCropDmg, n=15)
## EVTYPE x
## 244 HAIL 579596
## 153 FLASH FLOOD 179200
## 170 FLOOD 168038
## 856 TSTM WIND 109203
## 834 TORNADO 100019
## 760 THUNDERSTORM WIND 66791
## 95 DROUGHT 33899
## 786 THUNDERSTORM WINDS 18685
## 359 HIGH WIND 17283
## 290 HEAVY RAIN 11123
## 212 FROST/FREEZE 7034
## 140 EXTREME COLD 6121
## 848 TROPICAL STORM 5899
## 402 HURRICANE 5339
## 164 FLASH FLOODING 5126
Now we see that drought was just an outlier and that cumulatively the biggest menace to crops is hail.
Let’s plot the time series of property damage caused by the 5 most dangerous extreme weather event types.
mostPropDmg <- stormData[stormData$EVTYPE %in% sumPropDmg[1:5, "EVTYPE"], ]
mostPropDmgPlot <- aggregate(mostPropDmg$PROPDMG, by = list(YEAR = mostPropDmg$YEAR, EVTYPE = mostPropDmg$EVTYPE), FUN=sum)
qplot(mostPropDmgPlot$YEAR, mostPropDmgPlot$x, geom="jitter", colour=mostPropDmgPlot$EVTYPE, xlab="Year", ylab="Property Damage")
We see that lately (in the last couple of years) thunderstorms have caused the most damage, but tornados have been historicaly and consistently at the top. It appears that property damage caused by weather other than tornados and thunderstorm winds have not been measured before the year 1993. The missing data probably account for tornados causing the most damage cumulatively. In light of the missing data it appears that thunderstorm wind is probably most devastating with regards to property damage.
Let’s take a look at the crop damage also.
mostCropDmg <- stormData[stormData$EVTYPE %in% sumCropDmg[1:5, "EVTYPE"], ]
mostCropDmgPlot <- aggregate(mostCropDmg$CROPDMG, by = list(YEAR = mostCropDmg$YEAR, EVTYPE = mostCropDmg$EVTYPE), FUN=sum)
qplot(mostCropDmgPlot$YEAR, mostCropDmgPlot$x, geom="jitter", colour=mostCropDmgPlot$EVTYPE, xlab="Year", ylab="Crop Damage")
The situation is far from crystal clear, but hail obviously causes more damage to crops than other severe weather types. It appears that all the data before the year 1993 is missing as can be seen in the following way:
sum(stormData$CROPDMG[stormData$YEAR <= "1992"])
## [1] 0