In this study we shought to explore the effects of severe weather on health and economics in the US. To do this, we analyzed storm data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to November 2011. The hypothesis was that certain types of events would cause more damange to property while others would be more damaging to health. We found some interesting relationships between the event types, and the damages caused, over time. On a per event basis, “moving” events, such as tornadoes and floods cause more damage to property, while “still” events such as heat and cold cause more injuries. However, over time, seasonal events like floods and tornadoes have higher injury and death tools, just as they cause more damange to property and crops than the other events.
library(dplyr)
library(data.table)
library(R.utils)
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
fmt <- function(x) {
format(x, decimal.mark=".", big.mark=",",, , small.interval=3, nsmall=2, scientific = F)
}
The purpose of the study is to identify which types of events, among events like avalanches, fogs, and extreme cold, have highest impact on health and the economy in the US. The hypothesis is that events with motion, e.g. storms, have higher impact on property than those without, e.g. extreme fog, while the latter might have a significant impact on health. To investigate this, we’ll be using data from NOAA’s storm database to analyse possible correlations between economic damage, injuries and fatalities caused by each type of event. Our 2 main questions are:
The data was downloaded in bz2 format, decompressed into a CSV, and loaded into a data variable.
dataFile <- "repdata-data-StormData.csv.bz2"
decomFile <-"repdata-data-StormData.csv"
if (!dataFile %in% dir("./")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = dataFile)
}
if (!decomFile %in% dir("./")) {
bunzip2(dataFile, decomFile, remove = FALSE, skip = TRUE)
}
data <- dplyr::tbl_df(data.table::fread(decomFile))
Read 0.0% of 967216 rows
Read 24.8% of 967216 rows
Read 42.4% of 967216 rows
Read 54.8% of 967216 rows
Read 73.4% of 967216 rows
Read 83.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
Warning in data.table::fread(decomFile): Read less rows (902297) than were
allocated (967216). Run again with verbose=TRUE and please report.
Looking at the dimensions of the data set, there are 902297 entries and 37 columns.
dim(data)
[1] 902297 37
We observe the columns to identify the most relevant ones to answer our questions:
names(data)
[1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
[6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
[11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
[16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
[21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
[26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
[31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
[36] "REMARKS" "REFNUM"
To measure health hazard, we will evaluate two fields: FATALITIES and INJURIES, representing number of fatalities and injuries caused by the event, respectively. To measure economic impact, we will look at the fields PROPDMG and CROPDMG, which represent the estimated damage to private property (structures, objects, vegetation) and public infrastructure, and crops, respectively; these two variables have two complementary variables, PROPDMGEXP and CROPDMGEXP, which describe the exponent of the value as follows: K" for thousands of dollars, “M” for millions and “B” for billions, and an empty string for non.
It’s worth noting that at this point it’s unknown whether the damage estimates are adjusted for inflation, and the data ranges for time span of 61 years.
Peeking at the event types, we find many inconsistencies. The values ought to represent specific natural phenomenon that are of locally non-common nature, such as snow in near tropical regions. However, there are entries with titles such as hvy rain and wnd, which indicate different writings of the same event (mispellings), and others such as summary july 23-24, which are presumably summary damage reports up to that point, though the year is not specified. Also, there were entries with leading and trailing spaces, which can also indicate different categories representing the same event type.
data.sub <- data %>% dplyr::select(STATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, EVTYPE)
We check the number of unique events before trying to clean it.
length(unique(data.sub$EVTYPE))
[1] 985
To clean the data, we first trim the values of the event type, and convert them to lower case. Then, we remove the entries that represented summaries, idetified by the word “summary” at the beginning.
data.sub$EVTYPE <- tolower(trim(data.sub$EVTYPE))
nonSummary <- sapply(data.sub$EVTYPE, function(x){return ()})
data.sub <- data.sub[grep("summary", data.sub$EVTYPE) != 0,]
m = dim(data.sub)[1]
To decide whether it would be worth correcting mispelled event labels, we check the ratio of some mispelled labels:
sum(data.sub$EVTYPE=="wnd")/m
[1] 1.108283e-06
sum(data.sub$EVTYPE=="hvy rain")/m
[1] 2.216565e-06
The ratios are considerably insignificant. Still, these mispelled events may be outliers or have had considerable damage levels to health or property, and so where identified, we correct them.
data.sub$EVTYPE[data.sub$EVTYPE=="wnd"] = "wind"
data.sub$EVTYPE[data.sub$EVTYPE=="hvy rain"] = "heavy rain"
After doing that, we reduced the number of unique event types to length(unique(data.sub$EVTYPE)). Before, we proceed with the analysis, we convert the property and damage values to their numeric dollar values with their power, to make comparison easier.
# this is done to make the switch case work
pow <- function(x){
if (x == ""){
return(1)
}else if (x == "K"){
return (10^3)
}else if (x == "M"){
return (10^6)
}else if (x == "B"){
return (10^9)
}else{
return (0)
}
}
data.summ <- data.sub %>% dplyr::mutate(PROPERTY.DAMAGE = PROPDMG*sapply(PROPDMGEXP, pow), CROP.DAMAGE = CROPDMG*sapply(CROPDMGEXP, pow))
We first group our data points by event type.
data.gp <- data.summ %>% dplyr::group_by(EVTYPE) %>%
summarise(PROPERTY.DAMAGE=sum(PROPERTY.DAMAGE), CROP.DAMAGE=sum(CROP.DAMAGE), FATALITIES=sum(FATALITIES), INJURIES=sum(INJURIES), N=n())
With the data summary, we can observe basic statistics on injuries and fatalities, and economic damages. The most fatal event had a fatality of 583, and the event with highest injury count had 1,700.00 registered or estimated injuries. As for property damages, the mean is at about USD 473,546.67, and the highest is USD 115,000,000,000.00, while crop damages averaged at USD 54,409.75, and the highest was USD 5,000,000,000.00
With this data grouped, we can acummulate the economic damages. The first interactive visualization highlights let’s us explore the effects the different event types have on property and crop damages, and their level of fatalities.
library(plotly)
plot_ly(data.gp, x = CROP.DAMAGE, y = PROPERTY.DAMAGE, text = paste("EventType: ", EVTYPE),
mode = "markers", color = FATALITIES) %>% layout( title = "Total Crop vs Property Damage", hovermode="closest" )
At first glance, we can see that most events are clustered at the bottom left of our chart, which indicates that same close or similar levels of damage to crops and properties. Some events such as storm surges and heavy rain have non registered impact on crops, while severe thunderstorms and windchills don’t seem to have an impact on property.
Events by Crop Damage
head(arrange(data.gp, desc(CROP.DAMAGE)))
Source: local data frame [6 x 6]
EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES N
(chr) (dbl) (dbl) (dbl) (dbl) (int)
1 drought 1046106000 13972566000 0 4 2488
2 flood 144657709807 5661968450 470 6789 25327
3 river flood 5118945500 5029459000 2 2 173
4 ice storm 3944927810 5022113500 89 1975 2006
5 hail 15727366777 3025537453 15 1361 288661
6 hurricane 11868319010 2741910000 61 46 174
Floods are more impactful to property than any other event, and it stands as an outlier in that aspect, although, they also have the second highest rank in damange to crops, where droughts lead in comparison. The damage estimates between both differ by a factor of about 10, with floods causing about USD 146B in damanges to property, while droughts have caused about USD 13B over a period of 61 years.
Events by Property Damage
head(arrange(data.gp, desc(PROPERTY.DAMAGE)))
Source: local data frame [6 x 6]
EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES N
(chr) (dbl) (dbl) (dbl) (dbl) (int)
1 flood 144657709807 5661968450 470 6789 25327
2 hurricane/typhoon 69305840000 2607872800 64 1275 88
3 tornado 56925660483 414953110 5633 91346 60652
4 storm surge 43323536000 5000 13 38 261
5 flash flood 16140861717 1421317100 978 1777 54278
6 hail 15727366777 3025537453 15 1361 288661
Looking at fatalities though, tornados have been responsible for more loss of life than any other event, with over 5,000 deaths, followed by excessive heat, with a tally of almost 2,000 deaths.
Events by Fatalities
head(arrange(data.gp, desc(FATALITIES)))
Source: local data frame [6 x 6]
EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES N
(chr) (dbl) (dbl) (dbl) (dbl) (int)
1 tornado 56925660483 414953110 5633 91346 60652
2 excessive heat 7753700 492402000 1903 6525 1678
3 flash flood 16140861717 1421317100 978 1777 54278
4 heat 1797000 401461500 937 2100 767
5 lightning 928659283 12092090 816 5230 15755
6 tstm wind 4493058440 554007350 504 6957 219946
Events by Injuries
head(arrange(data.gp, desc(INJURIES)))
Source: local data frame [6 x 6]
EVTYPE PROPERTY.DAMAGE CROP.DAMAGE FATALITIES INJURIES N
(chr) (dbl) (dbl) (dbl) (dbl) (int)
1 tornado 56925660483 414953110 5633 91346 60652
2 tstm wind 4493058440 554007350 504 6957 219946
3 flood 144657709807 5661968450 470 6789 25327
4 excessive heat 7753700 492402000 1903 6525 1678
5 lightning 928659283 12092090 816 5230 15755
6 heat 1797000 401461500 937 2100 767
These measures are merely summaries, totalling the impact of each event over our measured period. We now turn to analyzing the impact of each event on a per/event basis, to see if perhaps there are events that occur less than others, but are more damaging when they occur.
plot_ly(data.gp, x = CROP.DAMAGE/N, y = PROPERTY.DAMAGE/N, text = paste("EventType: ", EVTYPE),
mode = "markers", color=FATALITIES/N) %>% layout( title="Crop vs Property Damage Per Event", hovermode="closest" )
Taking an average of the damage caused by event types, the picture changes. Excessive wetness, cold wet conditions, and excessive freeze are identified as the most damaging events to crops, racking up USD 142M, USD 66M, USD 37M in damages each time they occur. For property damage we see tornadoes, tstm winds and hails tying in as most damaging events, racking up USD 1.6B each time they occur on average.
Events by Crop Damage
head(data.gp %>% arrange(desc(CROP.DAMAGE/N)) %>% mutate(CROP.DAMAGE.TURN=CROP.DAMAGE/N) %>% select(EVTYPE, CROP.DAMAGE.TURN))
Source: local data frame [6 x 2]
EVTYPE CROP.DAMAGE.TURN
(chr) (dbl)
1 excessive wetness 142000000
2 cold and wet conditions 66000000
3 damaging freeze 37028750
4 hurricane/typhoon 29634918
5 river flood 29072017
6 early frost 21000000
Events by Property Damage
head(data.gp %>% arrange(desc(PROPERTY.DAMAGE/N)) %>% mutate(PROPERTY.DAMAGE.TURN=PROPERTY.DAMAGE/N) %>% select(EVTYPE, PROPERTY.DAMAGE.TURN))
Source: local data frame [6 x 2]
EVTYPE PROPERTY.DAMAGE.TURN
(chr) (dbl)
1 tornadoes, tstm wind, hail 1600000000
2 heavy rain/severe weather 1250000000
3 hurricane/typhoon 787566364
4 hurricane opal 350316222
5 storm surge 165990559
6 wild fires 156025000
Observing fatalities on a per event basis, we get a different picture as well. Tornadoes/tstm/hails still remain the events with highest death toll on each occurance, with an average of 25, but they are now followed by cold and snow with 14. Excessive heat ranks 4th with an everage of 6 deaths.
Events by Fatalities
head(data.gp %>% arrange(desc(FATALITIES/N)) %>% mutate(FATALITIES.TURN=FATALITIES/N) %>% select(EVTYPE, FATALITIES.TURN))
Source: local data frame [6 x 2]
EVTYPE FATALITIES.TURN
(chr) (dbl)
1 tornadoes, tstm wind, hail 25.000000
2 cold and snow 14.000000
3 tropical storm gordon 8.000000
4 record/excessive heat 5.666667
5 extreme heat 4.363636
6 heat wave drought 4.000000
Finally, we can plot he number of injuries against fatailities to see which events have highest impact on health.
plot_ly(data.gp, x = FATALITIES/N, y = INJURIES/N, text = paste("EventType: ", EVTYPE),
mode = "markers") %>% layout( title = "Injuries vs Fatalities per Event", hovermode="closest" )
Tropical storms cause on average 43 injuries on each occurance, followed by wild fires with roughly 38, and thunderstorms with 27. And we can verify that in the table.
Events by Injuries
head(data.gp %>% arrange(desc(INJURIES/N)) %>% mutate(INJURIES.TURN=INJURIES/N) %>% select(EVTYPE, INJURIES.TURN))
Source: local data frame [6 x 2]
EVTYPE INJURIES.TURN
(chr) (dbl)
1 tropical storm gordon 43.0
2 wild fires 37.5
3 thunderstormw 27.0
4 high wind and seas 20.0
5 snow/high winds 18.0
6 glaze/ice storm 15.0
Looking at whole period of 61 years, from 1950 to November 2011, the events with the highest damage to crops are droughts, with a total damage of USD 13B, followed by ice storms and river floods both of with a total of USD 5B . The events with highest damage to property are floods, with a total damage of 144B, i.e. roughly 10 times more damaging than droughts; in ranking, floods are followed by hurricanes and tornados with damages totaling USD 69B and USD 56B, and the latter leads in fatalities caused.
However, individually, the events with highest economic impact are wetness/cold wet conditions/excessive freeze which are the most damanging events to crops, causing between USD 37-142M in damages each time they occur, while tornadoes/tstm winds and hails cause greater damange to property, averaged at USD 1.6B each time they occur. Worthy of noting is that tornadoes do not bear highest impact on health, despite their high damage to properties, but instead it’s tropical storms, and wild fires, causing up to 43 and 37 injuries on each occurance on average. Nonetheless, cyclic events such as floods seem to occur far more often than other events, and thus need careful attention as they amess greater total damange over time.
Interestingly, there was a distinction found between the events that cause more damange to health versus property and crops, when looking at event occurances. While moving events, such tornados and hurricanes have higher damage toll, it is still events such as wild firest, and heat that cause more injuries. However, over time, seasonal events like floods and tornadoes have higher injury and death tools, just as they cause more damange to property and crops than any other events.