In this article, using National Oceanic and Atmospheric Administration (NOAA) storm database, we try to determine event types that impact economy and human lives the most in the USA. Fist part of the article will describe how the NOAA database file is transformed for the analysis. Then we conduct exploratory data analysis to reply to stated questions. Unfortunately, due to data quality issues, first we have to deal with inconsistent event type labeling in the data set. Next, we discover that tornadoes, thunderstorm winds and floods/flash floods are the most deadly and economically impactful events. Even though there is great disparity in how many total casualties each event generates, amount of fatalities for events with lower casualties count, can be close to those ranking higher due to greater mortality rates. In the last part of this article, we learn that in last 5.5 years of data out of 11, more economic impact is generated by flood-like events, signaling progressing climate changes and need to address them.
First we load all necessary packages (if you wish to reproduce the results, you first need to install them in you R with install.packages() function):
library(dplyr)
library(ggplot2)
Secondly we download the data, unzip it and load it into R as data frame:
fileName <- "stormDatabase.bz2"
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", fileName)
stormDF <- read.csv(fileName, header = TRUE, sep =",",
stringsAsFactors = FALSE)
We take a look at structure of the Storm data set, to make sure it was loaded correctly:
str(stormDF)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
range(stormDF$REFNUM)
## [1] 1 902297
We can see that number of rows matches perfectly with maximum reference line number (REFNUM column) from the source file, meaning that all data was loaded correctly. As we will be using END_DATE column to filter on the data range, we will create a new column End.Date which will hold the same date information, but is converted to R date format for ease of use:
stormDF$End.Date<- as.Date(stormDF$END_DATE, format = "%m/%d/%Y")
For the sake of validity of analysis, we limit our source data set to events that ended after 1st of January 2000 inclusively (last date available is 2011-11-30). It doesn’t make sense to take into account events from before 2000, due to how different infrastructure and technology were, making the decision based on legacy data inaccurate in today’s setting:
stormDF <- subset(stormDF, End.Date >= (as.Date("2000-01-01")))
In order to reply to the first question, we introduce a new variable to the loaded data frame, stormDF. It will represent sum of injured and killed during an atmospheric event. We’ll call it Casualties and it will sum FATALITIES and INJURIES columns from the source data.
stormDF$Casualties <- stormDF$FATALITIES + stormDF$INJURIES
summary(stormDF$Casualties)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0786 0.0000 1308.0000
We can see that most of the events in the dataset (at least 75%) produce 0 casualties. As these events posed no threat to human health, we deem not needed for further analysis:
stormDF.withCasualties <- stormDF[stormDF$Casualties > 0,]
summary(stormDF.withCasualties$Casualties)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 4.633 3.000 1308.000
We sum the new column by EVTYPE variable and analyze the resulting values for plotting considerations:
stormDF.byEVTYPE <- group_by(stormDF.withCasualties, EVTYPE)
stormDF.sumByEVTYPE <- summarise(stormDF.byEVTYPE, sum(Casualties))
#In order to faciliate further usage, we change the name of the variable to `Casualties`, otherwise `sum` in the name makes R think it's a function
names(stormDF.sumByEVTYPE)[2] <- "Casualties"
summary(stormDF.sumByEVTYPE)
## EVTYPE Casualties
## Length:78 Min. : 1.0
## Class :character 1st Qu.: 4.0
## Mode :character Median : 36.0
## Mean : 527.3
## 3rd Qu.: 303.0
## Max. :16406.0
arrange(stormDF.sumByEVTYPE, desc(Casualties))
## # A tibble: 78 x 2
## EVTYPE Casualties
## <chr> <dbl>
## 1 TORNADO 16406
## 2 EXCESSIVE HEAT 4721
## 3 LIGHTNING 3459
## 4 TSTM WIND 1869
## 5 THUNDERSTORM WIND 1530
## 6 HEAT 1453
## 7 FLASH FLOOD 1412
## 8 HURRICANE/TYPHOON 1339
## 9 WILDFIRE 986
## 10 HIGH WIND 808
## # ... with 68 more rows
By looking at this output we can see a problem: after aggregating number of casualties by EVTYPE we get 78 rows while official documentation for the NOAA storm database (page 6) indicates only 48 event types. There are 2 main reasons for this:
Unfortunately, due to unpredictable nature of errors, we cannot easily automate the process with methods such as adist() method calculating similarity between strings. Therefore, first we manually introduce 48 event types referenced on page 6 of official documentation for the NOAA storm database (page 6). Then we compare how many of those 78 event types are left after we compare them with official 48 types and decide on case by case basis to which official group they should belong (however events of type “Other” remain as such due to their undescribed nature):
official.event.types <- toupper(c("Astronomical Low Tide","Avalanche",
"Blizzard","Coastal Flood",
"Cold/Wind Chill","Debris Flow",
"Dense Fog","Dense Smoke",
"Drought","Dust Devil",
"Dust Storm","Excessive Heat",
"Extreme Cold/Wind Chill","Flash Flood",
"Flood","Frost/Freeze",
"Funnel Cloud","Freezing Fog",
"Hail","Heat",
"Heavy Rain","Heavy Snow",
"High Surf","High Wind",
"Hurricane (Typhoon)","Ice Storm",
"Lake-Effect Snow","Lakeshore Flood",
"Lightning","Marine Hail",
"Marine High Wind","Marine Strong Wind",
"Marine Thunderstorm Wind","Rip Current",
"Seiche","Sleet",
"Storm Surge/Tide","Strong Wind",
"Thunderstorm Wind","Tornado",
"Tropical Depression","Tropical Storm",
"Tsunami","Volcanic Ash",
"Waterspout","Wildfire",
"Winter Storm","Winter Weather"))
nonofficial.event.names <- stormDF.sumByEVTYPE$EVTYPE[!(stormDF.sumByEVTYPE$EVTYPE %in% official.event.types)]
nonofficial.event.names
## [1] "BRUSH FIRE" "COASTAL FLOODING"
## [3] "COLD" "COLD WEATHER"
## [5] "DROWNING" "DRY MICROBURST"
## [7] "EXTREME COLD" "EXTREME WINDCHILL"
## [9] "FALLING SNOW/ICE" "FOG"
## [11] "GUSTY WINDS" "HAZARDOUS SURF"
## [13] "HEAVY SURF" "HEAVY SURF/HIGH SURF"
## [15] "HIGH SEAS" "HIGH WATER"
## [17] "HURRICANE" "HURRICANE/TYPHOON"
## [19] "ICE ON ROAD" "LANDSLIDE"
## [21] "LIGHT SNOW" "MARINE TSTM WIND"
## [23] "NON-SEVERE WIND DAMAGE" "NON TSTM WIND"
## [25] "OTHER" "RIP CURRENTS"
## [27] "ROGUE WAVE" "ROUGH SEAS"
## [29] "SMALL HAIL" "SNOW"
## [31] "STORM SURGE" "STRONG WINDS"
## [33] "THUNDERSTORM" "THUNDERSTORM WIND (G40)"
## [35] "TSTM WIND" "TSTM WIND (G45)"
## [37] "TSTM WIND/HAIL" "UNSEASONABLY WARM"
## [39] "URBAN/SML STREAM FLD" "WARM WEATHER"
## [41] "WILD/FOREST FIRE" "WIND"
## [43] "WINTER WEATHER MIX" "WINTER WEATHER/MIX"
#plyr mapvalues() function offers easy syntax for change of values from one set to another within a vector or factor
#::operator was used instead of loading plyr with library() to avoid conflicting names of functions with dplyr package
stormDF.withCasualties$EVTYPE <- plyr::mapvalues(stormDF.withCasualties$EVTYPE, nonofficial.event.names,
toupper(c("WILDFIRE","COASTAL FLOOD",
"COLD/WIND CHILL","COLD/WIND CHILL",
"FLOOD", "Thunderstorm Wind",
"Extreme Cold/Wind Chill", "Extreme Cold/Wind Chill",
"Heavy Snow", "Dense Fog",
"Strong Wind", "High Surf",
"High Surf", "High Surf",
"High Surf", "High Surf",
"Hurricane (Typhoon)", "Hurricane (Typhoon)",
"Winter Weather", "Debris Flow",
"Winter Storm","Marine Thunderstorm Wind",
"High Wind", "High Wind",
"Other", "Rip Current",
"Storm Surge/Tide", "Marine Strong Wind",
"Hail", "Winter Weather",
"Coastal Flood", "Strong Wind",
"Thunderstorm Wind", "Thunderstorm Wind",
"Thunderstorm Wind", "Thunderstorm Wind",
"Thunderstorm Wind", "Heat",
"Heavy Rain", "Heat",
"WILDFIRE", "Strong Wind",
"Winter Weather", "Winter Weather"
)))
We now use our updated event classification to prepare dataset for plotting and plot the figure:
stormDF.byEVTYPE <- group_by(stormDF.withCasualties, EVTYPE)
stormDF.sumByEVTYPE <- summarise(stormDF.byEVTYPE, sum(Casualties), sum(FATALITIES))
#In order to faciliate further usage, we change the name of the variable to `Casualties`, otherwise `sum` in the name makes R think it's a function
names(stormDF.sumByEVTYPE)[2:3] <- c("Casualties", "Fatalities")
stormDF.sumByEVTYPE$EVTYPE <- as.factor(stormDF.sumByEVTYPE$EVTYPE)
ggplot(stormDF.sumByEVTYPE, aes(x = reorder(EVTYPE,Casualties), y = Casualties))+
geom_point(size=2.7, colour="#4d4d4d")+
geom_point(aes(x = reorder(EVTYPE,Casualties), y = Fatalities),colour = "#f15854")+
ggtitle("Casualties and Fatalities by Event Type in USA, 2000-2011", subtitle = "The closer the red circle is to black one, the highier the death rate of an event type")+
xlab("Event Type")+
ylab("Total Casualties (black) and Fatalities (red)")+
labs(caption="Source: NOAA Storm Database")+
coord_flip()+
theme_bw()+
theme(panel.border = element_blank())
We can see that Tornados, Excessive Heat, Thunderstorm Wind and Lightning pose the biggest threat to human life and they account for 68% of total casualties and 49% of total fatalities. We can also see that death rate of an event type is somewhat strongly positively correlated with number of total casualties. What we can see also from the dot plot above that Tornados, despite their huge casualties amount, has very low mortality rate, and certain events like excessive heat and flash flood produce almost as many fatalities. This means that for no event type preventive measures should be decided based on total casualties numbers only.
Similarly to how we analyzed impact on population’s health, we will now try to answer the question which event types are most impactful to the economy. In order to do so, we again take the data for period 2000-2011, crate a new variable Economic.Damage which sums CROPDMG and PROPDMG from the source file.
stormDF$Economic.Damage <- stormDF$CROPDMG + stormDF$PROPDMG
summary(stormDF$Economic.Damage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 13.12 1.00 5000.00
We again see a lot of events producing no economic damage like with casualties. As we do not need to analyze them, we remove them in the next step:
stormDF.withEconomicDmg <- stormDF[stormDF$Economic.Damage > 0,]
summary(stormDF.withEconomicDmg$Economic.Damage)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 3.00 10.00 45.29 29.00 5000.00
We aggregate column Economic.Damage by EVTYPE and verify whether we need to adjust event type names to match the official, 48-type, naming convention:
stormDF.byEVTYPE <- group_by(stormDF.withEconomicDmg, EVTYPE)
stormDF.sumByEVTYPE <- summarise(stormDF.byEVTYPE, sum(Economic.Damage))
#In order to faciliate further usage, we change the name of the variable to `Economic.Damage`, otherwise `sum` in the name makes R think it's a function
names(stormDF.sumByEVTYPE)[2] <- "Economic.Damage"
arrange(stormDF.sumByEVTYPE, desc(Economic.Damage))
## # A tibble: 93 x 2
## EVTYPE Economic.Damage
## <chr> <dbl>
## 1 FLASH FLOOD 1131715.05
## 2 TORNADO 980746.61
## 3 THUNDERSTORM WIND 928920.36
## 4 TSTM WIND 865286.92
## 5 HAIL 815812.65
## 6 FLOOD 792567.18
## 7 LIGHTNING 397297.29
## 8 HIGH WIND 259038.15
## 9 WINTER STORM 97746.93
## 10 WILDFIRE 87371.54
## # ... with 83 more rows
We can see that we get a lot of event types that do not fall into the official classification. We repeat new label assignment as we did in the first step in order to minimize the amount of relabeling we need to do this time, then we again use the mapvalues() function to fix this data quality issue:
#plyr mapvalues() function offers easy syntax for change of values from one set to another within a vector or factor
#::operator was used instead of loading plyr with library() to avoid conflicting names of functions with dplyr package
stormDF.withEconomicDmg$EVTYPE <- plyr::mapvalues(stormDF.withEconomicDmg$EVTYPE, nonofficial.event.names,
toupper(c("WILDFIRE","COASTAL FLOOD",
"COLD/WIND CHILL","COLD/WIND CHILL",
"FLOOD", "Thunderstorm Wind",
"Extreme Cold/Wind Chill", "Extreme Cold/Wind Chill",
"Heavy Snow", "Dense Fog",
"Strong Wind", "High Surf",
"High Surf", "High Surf",
"High Surf", "High Surf",
"Hurricane (Typhoon)", "Hurricane (Typhoon)",
"Winter Weather", "Debris Flow",
"Winter Storm","Marine Thunderstorm Wind",
"High Wind", "High Wind",
"Other", "Rip Current",
"Storm Surge/Tide", "Marine Strong Wind",
"Hail", "Winter Weather",
"Coastal Flood", "Strong Wind",
"Thunderstorm Wind", "Thunderstorm Wind",
"Thunderstorm Wind", "Thunderstorm Wind",
"Thunderstorm Wind", "Heat",
"Heavy Rain", "Heat",
"WILDFIRE", "Strong Wind",
"Winter Weather", "Winter Weather"
)), warn_missing = FALSE)
nonofficial.event.names2 <- unique(stormDF.withEconomicDmg$EVTYPE[!(stormDF.withEconomicDmg$EVTYPE %in% official.event.types)])
nonofficial.event.names2
## [1] "RAIN" "TSTM WIND (G40)"
## [3] "TSTM WIND G45" "FREEZE"
## [5] " TSTM WIND" "DAM BREAK"
## [7] "OTHER" "LIGHT FREEZING RAIN"
## [9] "MIXED PRECIPITATION" "LAKE EFFECT SNOW"
## [11] "GUSTY WIND" " FLASH FLOOD"
## [13] "LATE SEASON SNOW" "NON-TSTM WIND"
## [15] "MUDSLIDE" "MUD SLIDE"
## [17] "BLOWING DUST" " HIGH SURF ADVISORY"
## [19] "WHIRLWIND" "SNOW SQUALLS"
## [21] "ASTRONOMICAL HIGH TIDE"
stormDF.withEconomicDmg$EVTYPE <- plyr::mapvalues(stormDF.withEconomicDmg$EVTYPE, nonofficial.event.names2, toupper(c("Heavy Rain","Thunderstorm Wind",
"Thunderstorm Wind", "Frost/Freeze",
"Thunderstorm Wind","Flash Flood",
"Other", "Winter Weather",
"Winter Weather", "Lake-Effect Snow",
"Strong Wind","Flash Flood",
"Winter Weather","Strong Wind",
"Flash Flood","Flash Flood",
"Dust Storm", "High Surf",
"Thunderstorm Wind", "Winter Weather",
"High Surf"
)), warn_missing = FALSE)
We now prepare the data for plotting and plot the values. In order to understand which event types have been causing more damage in recent years, we will divide our dataset into two periods: events that ended between 1st of January 2000 and 30th of June 2006 and those that ended between 1st of July 2006 and the end of 2011.
stormDF.byEVTYPEandEnd.Date <- group_by(stormDF.withEconomicDmg, EVTYPE, End.Date)
stormDF.sumByEVTYPEandEnd.Date <- summarise(stormDF.byEVTYPEandEnd.Date, sum(Economic.Damage))
names(stormDF.sumByEVTYPEandEnd.Date)[3] <- "Economic.Damage"
stormDF.sumByEVTYPEandEnd.Date$Economic.Damage.AfterJun2006 <- ifelse(stormDF.sumByEVTYPEandEnd.Date$End.Date >= as.Date("2006-07-01"), stormDF.sumByEVTYPEandEnd.Date$Economic.Damage, 0)
sormDF.forPlot2 <- group_by(stormDF.sumByEVTYPEandEnd.Date, EVTYPE)
sormDF.forPlot2 <- summarise(sormDF.forPlot2, Economic.Damage = sum(Economic.Damage), Economic.Damage.AfterJun2006 = sum(Economic.Damage.AfterJun2006))
ggplot(sormDF.forPlot2, aes(x = reorder(EVTYPE,Economic.Damage), y = Economic.Damage))+
geom_col(fill = "#88bde6")+
geom_col(aes(x = reorder(EVTYPE,Economic.Damage), y = Economic.Damage.AfterJun2006), fill = "#eddd46", width = 0.35, position = "stack")+
ggtitle("Economic Damage in USA by Event Type in USD in 2000-2011 (blue) and in July 2006-2011 (yellow)", subtitle = "The closer the yellow bar end to the end of the blue one, the more damage was created after July 2006")+
xlab("Event Type")+
ylab("Economic Damage in USD")+
labs(caption="Source: NOAA Storm Database")+
coord_flip()+
theme_bw()+
theme(panel.border = element_blank())
Thunderstorm Wind, Flash Flood, Tornados, hails and floods have by far the most economic impact. They contribute to a little over 80% of total damage in years 2000-2011. Observation of yellow bars allows us to see that in the second half of the analyzed period, Floods and similar events are more harmful to the economy, than in previous years, signaling climate related changes that need to be taken into account when devising plans for new infrastructures.