This analysis examines the damage caused by storm events in the U.S. recorded by the National Oceanic and Atmospheric Administration between 1950 and 2011. The primary focuses of this analysis are damage to health and life, measured by injuries and fatalities, and damage to property and crops, measured in U.S. dollars. Because events types were recorded in a non-uniform way by NOAA staff over the years, I take steps to clean the event type variable and make it ready for use in the analysis. I also scale the property damage and crop damage variables to make them ready for comparison.
I find that the greatest damage to people, as measured by aggregate fatalities and injuries, comes from tornadoes, though significant damage is also caused by floods. By contrast, floods are the main source of property damage and crop damage.
Before starting, we load the necessary packages.
library(dplyr)
library(ggplot2)
library(R.utils)
## Warning: package 'R.utils' was built under R version 4.0.4
## Warning: package 'R.oo' was built under R version 4.0.3
## Warning: package 'R.methodsS3' was built under R version 4.0.3
To begin, I download the data file to my computer from the NOAA website. I then use bunzip2 to convert the file to csv.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "C:/Users/19732/Downloads/repdata_data_StormData.csv.bz2"
download.file(url, destfile)
bunzip2(filename="repdata_data_StormData.csv.bz2", destname="repdata_data_StormData.csv")
I then load the file into R.
setwd("C:/Users/19732/Downloads")
stormData <- read.csv("repdata_data_StormData.csv")
Next I use str to understand the structure of the data.
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
First we notice that many of these variables have empty values, such as BGN_AZI. Similarly, COUNTYENDN begins with several NA values.
We are primarily interested here in how event type is related to varying levels of damage to population health and economic consequences. The variables that will aid us in this analysis are EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP. FATALITIES and INJURIES are integer variables, and EVTYPE is a character variable.
nrow(table(stormData$EVTYPE))
## [1] 985
There are 985 different values that the variable EVTYPE takes. For this analysis, I ignored any values with less than 200 values. This leaves the following event types:
table1 <- data.frame(table(stormData$EVTYPE))
table1 <- table1 %>% filter(Freq >= 200)
table1
## Var1 Freq
## 1 AVALANCHE 386
## 2 BLIZZARD 2719
## 3 COASTAL FLOOD 650
## 4 COLD/WIND CHILL 539
## 5 DENSE FOG 1293
## 6 DROUGHT 2488
## 7 DUST STORM 427
## 8 EXCESSIVE HEAT 1678
## 9 EXTREME COLD 655
## 10 EXTREME COLD/WIND CHILL 1002
## 11 EXTREME WINDCHILL 204
## 12 FLASH FLOOD 54277
## 13 FLASH FLOODING 682
## 14 FLOOD 25326
## 15 FLOOD/FLASH FLOOD 624
## 16 FOG 538
## 17 FREEZING RAIN 250
## 18 FROST/FREEZE 1342
## 19 FUNNEL CLOUD 6839
## 20 HAIL 288661
## 21 HEAT 767
## 22 HEAVY RAIN 11723
## 23 HEAVY SNOW 15708
## 24 HEAVY SURF/HIGH SURF 228
## 25 HIGH SURF 725
## 26 HIGH WIND 20212
## 27 HIGH WINDS 1533
## 28 ICE STORM 2006
## 29 LAKE-EFFECT SNOW 636
## 30 LANDSLIDE 600
## 31 LIGHTNING 15754
## 32 MARINE HAIL 442
## 33 MARINE THUNDERSTORM WIND 5812
## 34 MARINE TSTM WIND 6175
## 35 RIP CURRENT 470
## 36 RIP CURRENTS 304
## 37 SNOW 587
## 38 STORM SURGE 261
## 39 STRONG WIND 3566
## 40 THUNDERSTORM WIND 82563
## 41 THUNDERSTORM WINDS 20843
## 42 TORNADO 60652
## 43 TROPICAL STORM 690
## 44 TSTM WIND 219940
## 45 TSTM WIND/HAIL 1028
## 46 URBAN FLOOD 249
## 47 URBAN/SML STREAM FLD 3392
## 48 WATERSPOUT 3796
## 49 WILD/FOREST FIRE 1457
## 50 WILDFIRE 2761
## 51 WIND 340
## 52 WINTER STORM 11433
## 53 WINTER WEATHER 7026
## 54 WINTER WEATHER/MIX 1104
Let’s now note in the dataset which rows fall into our top event types listed above.
stormData$topEVTYPE <- stormData$EVTYPE %in% table1$Var1
We can see here that the subset of EVTYPE values included captures 14,284 of a total of 15,145 fatalities:
aggregate(FATALITIES~ topEVTYPE, data=stormData, sum)
## topEVTYPE FATALITIES
## 1 FALSE 861
## 2 TRUE 14284
I also included any value containing “HURRICANE”, as hurricanes may contribute a large share of economic and human damage despite being rare. Here I code all hurricanes in the group “HURRICANE”.
stormData$EVTYPE <- ifelse(grepl("HURRICANE", stormData$EVTYPE, ignore.case=TRUE), "HURRICANE", stormData$EVTYPE)
At this step, I combine EVTYPE values that are included in our list above but refer to the same type of weather event:
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="EXCESSIVE HEAT","HEAT", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="FLOOD/FLASH FLOOD","FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="FLASH FLOODING","FLASH FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="HEAVY SURF/HIGH SURF","HIGH SURF", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="HIGH WINDS","HIGH WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="MARINE TSTM WIND","MARINE THUNDERSTORM WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="RIP CURRENT","RIP CURRENTS", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="THUNDERSTORM WIND","THUNDERSTORM WINDS", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="TSTM WIND/HAIL","TSTM WIND", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="URBAN/SML STREAM FLD","URBAN FLOOD", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="WINTER WEATHER/MIX","WINTER WEATHER", stormData$EVTYPE)
stormData$EVTYPE <- ifelse(stormData$EVTYPE=="WILD/FOREST FIRE","WILDFIRE", stormData$EVTYPE)
Next, let’s examine PROPDMG, CROPDMG, PROPDMGEXP, and CROPDMGEXP. Here we can see the values that PROPDMGEXP and CROPDMGEXP can take.
table(stormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(stormData$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
The vast majority of rows are either blank or a capital letter representing an exponent of ten: H, K, M, and B. Therefore, I ignore rows with any other values. This variable converts the originals into numeric scalars.
stormData$PROPDMGEXP2 <- ifelse(stormData$PROPDMGEXP=="H", 100,
ifelse(stormData$PROPDMGEXP=="K", 1000,
ifelse(stormData$PROPDMGEXP=="M", 1000000,
ifelse(stormData$PROPDMGEXP=="B", 1000000000, NA))))
stormData$CROPDMGEXP2 <- ifelse(stormData$CROPDMGEXP=="H", 100,
ifelse(stormData$CROPDMGEXP=="K", 1000,
ifelse(stormData$CROPDMGEXP=="M", 1000000,
ifelse(stormData$CROPDMGEXP=="B", 1000000000, NA))))
We can now create two new variables that scale PROPDMG and CROPDMG uniformly.
stormData$PROPDMG2 <- stormData$PROPDMG * stormData$PROPDMGEXP2
stormData$CROPDMG2 <- stormData$CROPDMG * stormData$CROPDMGEXP2
Let’s now filter out the rows that we don’t want included in our analysis. First, we only want rows with EVTYPES that are in our topEVTYPE list.
stormData2 <- filter(stormData, topEVTYPE==TRUE)
Next, we will filter out the rows that don’t have full data for FATALITIES, INJURIES, CROPDMG2, and PROPDMG2.
stormData2$fullData <- ifelse((is.na(stormData2$FATALITIES)==FALSE &
is.na(stormData2$INJURIES)==FALSE &
is.na(stormData2$PROPDMG2)==FALSE &
is.na(stormData2$CROPDMG2)==FALSE &
is.numeric(stormData2$FATALITIES)==TRUE &
is.numeric(stormData2$INJURIES)==TRUE &
is.numeric(stormData2$PROPDMG2)==TRUE &
is.numeric(stormData2$CROPDMG2)==TRUE
), 1, 0)
stormData3 <- filter(stormData2, fullData==1)
nrow(stormData3)
## [1] 278492
This leaves about 278,000 storm events in our data set. With our data now prepared, we can proceed to analyze damage from these storm events.
Our first question is, “Across the United States, which types of events are most harmful with respect to population health?” First we will aggregate fatalities and injuries by event type.
fatalitiesByEventType <- aggregate(FATALITIES ~ EVTYPE, data=stormData3, sum)
injuriesByEventType <- aggregate(INJURIES ~ EVTYPE, data=stormData3, sum)
Now we will print out these tables in descending order.
fatalitiesByEventType[order(fatalitiesByEventType$FATALITIES, decreasing = TRUE), ]
## EVTYPE FATALITIES
## 32 TORNADO 1064
## 10 FLASH FLOOD 390
## 16 HEAT 375
## 11 FLOOD 264
## 27 RIP CURRENTS 211
## 24 LIGHTNING 173
## 31 THUNDERSTORM WINDS 141
## 4 COLD/WIND CHILL 94
## 1 AVALANCHE 83
## 19 HIGH SURF 70
## 20 HIGH WIND 63
## 37 WILDFIRE 60
## 9 EXTREME COLD/WIND CHILL 58
## 30 STRONG WIND 55
## 39 WINTER WEATHER 30
## 2 BLIZZARD 23
## 38 WINTER STORM 20
## 33 TROPICAL STORM 19
## 17 HEAVY RAIN 16
## 21 ICE STORM 13
## 18 HEAVY SNOW 12
## 34 TSTM WIND 11
## 26 MARINE THUNDERSTORM WIND 10
## 7 DUST STORM 5
## 23 LANDSLIDE 5
## 3 COASTAL FLOOD 3
## 8 EXTREME COLD 3
## 15 HAIL 2
## 5 DENSE FOG 0
## 6 DROUGHT 0
## 12 FOG 0
## 13 FROST/FREEZE 0
## 14 FUNNEL CLOUD 0
## 22 LAKE-EFFECT SNOW 0
## 25 MARINE HAIL 0
## 28 SNOW 0
## 29 STORM SURGE 0
## 35 URBAN FLOOD 0
## 36 WATERSPOUT 0
injuriesByEventType[order(injuriesByEventType$INJURIES, decreasing = TRUE), ]
## EVTYPE INJURIES
## 32 TORNADO 11960
## 11 FLOOD 6495
## 16 HEAT 2071
## 21 ICE STORM 1616
## 31 THUNDERSTORM WINDS 1491
## 24 LIGHTNING 1010
## 10 FLASH FLOOD 669
## 37 WILDFIRE 582
## 2 BLIZZARD 407
## 20 HIGH WIND 398
## 34 TSTM WIND 375
## 39 WINTER WEATHER 324
## 33 TROPICAL STORM 311
## 15 HAIL 303
## 27 RIP CURRENTS 140
## 30 STRONG WIND 135
## 19 HIGH SURF 125
## 38 WINTER STORM 114
## 18 HEAVY SNOW 92
## 7 DUST STORM 69
## 1 AVALANCHE 58
## 17 HEAVY RAIN 47
## 26 MARINE THUNDERSTORM WIND 26
## 4 COLD/WIND CHILL 12
## 23 LANDSLIDE 12
## 9 EXTREME COLD/WIND CHILL 7
## 12 FOG 5
## 6 DROUGHT 4
## 5 DENSE FOG 3
## 3 COASTAL FLOOD 1
## 8 EXTREME COLD 0
## 13 FROST/FREEZE 0
## 14 FUNNEL CLOUD 0
## 22 LAKE-EFFECT SNOW 0
## 25 MARINE HAIL 0
## 28 SNOW 0
## 29 STORM SURGE 0
## 35 URBAN FLOOD 0
## 36 WATERSPOUT 0
Let’s now see how these groups compare visually using fatalities and injuries. First we merge our fatalities dataset and our injuries dataset.
fatalitiesByEventType$MEASURE <- rep("Fatalities", 39)
injuriesByEventType$MEASURE <- rep("Injuries", 39)
fatalitiesByEventType <- rename(fatalitiesByEventType, TOTAL=FATALITIES)
injuriesByEventType <- rename(injuriesByEventType, TOTAL=INJURIES)
fataliesAndInjuriesByEventType <- rbind(fatalitiesByEventType, injuriesByEventType)
Now we can plot the results.
ggplot(data=fataliesAndInjuriesByEventType, aes(x=EVTYPE, y=TOTAL, fill=MEASURE))+geom_bar(stat="identity")+labs(title="Fatalities and injuries by event type")+scale_fill_brewer(palette="Dark2")+theme(plot.margin = margin(t=0.5, r=0.5, b=0.5, l=0.5, "cm"))+coord_flip()+labs(x="Event type", y="Total")
As we saw in our printed lists, the most deadly and injuring events are tornadoes, though floods also contribute many deaths and injuries. Heat can also be quite deadly.
Our second question is, “Across the United States, which types of events have the greatest economic consequences?” First we will aggregate property damage and crop damage by event type.
propDmgByEventType <- aggregate(PROPDMG2 ~ EVTYPE, data=stormData3, sum)
cropDmgByEventType <- aggregate(CROPDMG2 ~ EVTYPE, data=stormData3, sum)
Now we will print out these tables in descending order.
propDmgByEventType[order(propDmgByEventType$PROPDMG2, decreasing = TRUE), ]
## EVTYPE PROPDMG2
## 11 FLOOD 132904889050
## 32 TORNADO 16166771690
## 15 HAIL 7991587690
## 10 FLASH FLOOD 7388886080
## 31 THUNDERSTORM WINDS 3678929940
## 37 WILDFIRE 3547207470
## 20 HIGH WIND 2517547340
## 33 TROPICAL STORM 1056591350
## 38 WINTER STORM 1017844200
## 21 ICE STORM 903037300
## 34 TSTM WIND 662600660
## 24 LIGHTNING 315273980
## 17 HEAVY RAIN 310701930
## 6 DROUGHT 233721000
## 18 HEAVY SNOW 178292000
## 3 COASTAL FLOOD 167580560
## 23 LANDSLIDE 150303500
## 30 STRONG WIND 119252060
## 2 BLIZZARD 94981000
## 19 HIGH SURF 83017500
## 22 LAKE-EFFECT SNOW 40035000
## 39 WINTER WEATHER 19897500
## 35 URBAN FLOOD 12778600
## 13 FROST/FREEZE 9480000
## 9 EXTREME COLD/WIND CHILL 7038000
## 36 WATERSPOUT 5206200
## 7 DUST STORM 3399000
## 16 HEAT 3058200
## 29 STORM SURGE 2915000
## 5 DENSE FOG 2842000
## 1 AVALANCHE 2385800
## 8 EXTREME COLD 2355000
## 4 COLD/WIND CHILL 1990000
## 26 MARINE THUNDERSTORM WIND 436400
## 14 FUNNEL CLOUD 65100
## 12 FOG 32000
## 25 MARINE HAIL 4000
## 27 RIP CURRENTS 1000
## 28 SNOW 1000
cropDmgByEventType[order(cropDmgByEventType$CROPDMG2, decreasing = TRUE), ]
## EVTYPE CROPDMG2
## 11 FLOOD 5265989450
## 21 ICE STORM 5022110000
## 15 HAIL 2028390900
## 6 DROUGHT 1652696000
## 10 FLASH FLOOD 1397535100
## 13 FROST/FREEZE 931801000
## 20 HIGH WIND 651624850
## 31 THUNDERSTORM WINDS 600812250
## 34 TSTM WIND 521631450
## 16 HEAT 493135000
## 33 TROPICAL STORM 451711000
## 32 TORNADO 353376460
## 37 WILDFIRE 284810100
## 18 HEAVY SNOW 131643100
## 2 BLIZZARD 112060000
## 30 STRONG WIND 64948500
## 17 HEAVY RAIN 64585800
## 38 WINTER STORM 23724000
## 23 LANDSLIDE 20017000
## 39 WINTER WEATHER 15000000
## 24 LIGHTNING 5512150
## 35 URBAN FLOOD 3581600
## 7 DUST STORM 2350000
## 8 EXTREME COLD 2255000
## 4 COLD/WIND CHILL 600000
## 26 MARINE THUNDERSTORM WIND 50000
## 28 SNOW 10000
## 29 STORM SURGE 5000
## 1 AVALANCHE 0
## 3 COASTAL FLOOD 0
## 5 DENSE FOG 0
## 9 EXTREME COLD/WIND CHILL 0
## 12 FOG 0
## 14 FUNNEL CLOUD 0
## 19 HIGH SURF 0
## 22 LAKE-EFFECT SNOW 0
## 25 MARINE HAIL 0
## 27 RIP CURRENTS 0
## 36 WATERSPOUT 0
At the top of the lists above, we see that floods rank as the most damaging events in the aggregate. Ice storms are also highly damaging to crops, and tornadoes can be highly damaging to property.
Let’s now see how these event types compare visually. First we merge our property damage dataset and our crop damage dataset.
propDmgByEventType$MEASURE <- rep("Property Damage", 39)
cropDmgByEventType$MEASURE <- rep("Crop Damage", 39)
propDmgByEventType <- rename(propDmgByEventType, TOTAL=PROPDMG2)
cropDmgByEventType <- rename(cropDmgByEventType, TOTAL=CROPDMG2)
allEconomicDmgByEventType <- rbind(propDmgByEventType, cropDmgByEventType)
Now we can plot the results.
ggplot(data=allEconomicDmgByEventType, aes(x=EVTYPE, y=TOTAL, fill=MEASURE))+geom_bar(stat="identity")+labs(title="Economic damage by event type")+scale_fill_brewer(palette="Dark2")+theme(plot.margin = margin(t=0.5, r=0.5, b=0.5, l=0.5, "cm"))+coord_flip()+labs(x="Event type", y="Damage (in dollars)")
Matching our printed results, we see that floods by far contribute the most damage to property and crops.