The aim of this report is to find out, in the United States from year 1950 to 2011, which types of severe weather events are most harmful with respect to population health; and which types of events have the greatest economic consequences.
Analysis was done using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
I classified the severe weather events in the database into 17 broad categories, and calculated their impact on population health, as well as the total damage caused. The impact on population health was obtained by adding up the fatalities and injuries, and the total damage (as million-dollar amounts) was the summation of property damage and crop damage.
From the analysis I found that, TORNADO is the most harmful with respect to population health, as it accounted 62.4% of total fatalities and injuries across the US from 1950 to 2011. FLOOD has the greatest economic consequences, followed by HURRICANE, TORNADO and STORM_SURGE, and they contributed 37.8%, 18.9%, 12.4%, 9.1% of total damage across the country respectively (78.2% in total).
At first, let’s download the data from our course website and read it into R using read.csv().
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")
data<-read.csv("StormData.csv.bz2")
This data set contains 902,297 rows and 37 columns. The table below lists the names of all the variables and their classes.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
I change the format of the variables “BGN_DATE” and “END_DATE” to Date, and create a “year” variable to indicate the event beginning year. This variable will be used later for producing the time series plots.
library(dplyr)
data<-data %>%
mutate(BGN_DATE=as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S")) %>%
mutate(END_DATE=as.Date(END_DATE, "%m/%d/%Y %H:%M:%S")) %>%
mutate(year=as.numeric(format(BGN_DATE,'%Y')))
In this analysis, I use the sum of fatalities and injuries to reflect the impact on population health. The R code below creates the variable “pophealth” use the two variables “FATALITIES” and “INJURIES” in the data.
In addition, I have checked and there are no missing values in these two variables.
summary(data$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(data$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
data <- mutate(data, pophealth=FATALITIES+INJURIES)
summary(data$pophealth)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1725 0.0000 1742.0000
The values for property damage and corp damage are stored in the variables “PROPDMG” and “CROPDMG”. However, in order to obtain the correct dollar amounts for these two types of damage, we need to consider the alphabetical characters in variables “PROPDMGEXP” and “CROPDMGEXP”, which are:
The total damage (in million USD) is calculated using information from the above mentioned four variables, with the R code below:
proptemp<-rep(1,dim(data)[1])
proptemp[grepl("[Bb]", data$PROPDMGEXP)]<-1000000000
proptemp[grepl("[Mm]", data$PROPDMGEXP)]<-1000000
proptemp[grepl("[Kk]", data$PROPDMGEXP)]<-1000
croptemp<-rep(1,dim(data)[1])
croptemp[grepl("[Bb]", data$CROPDMGEXP)]<-1000000000
croptemp[grepl("[Mm]", data$CROPDMGEXP)]<-1000000
croptemp[grepl("[Kk]", data$CROPDMGEXP)]<-1000
data <- data %>% mutate(totaldmg=(PROPDMG*proptemp+CROPDMG*croptemp)/1000000)
summary(data$totaldmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.30e-01 0.00e+00 1.15e+05
The types of severe weather events are stored in the variable “EVTYPE” in this data base, and there are in total 985 unique event types. However, the coding of these event types is a bit messy, as some of them actually can be grouped together.
length(unique(data$EVTYPE))
## [1] 985
For example, there are 15 unique event types in the data base containing the word “TORNADO”, which I think could be all classified under one single event type.
unique(data[grepl("TORNADO", data$EVTYPE),]$EVTYPE)
## [1] "TORNADO" "TORNADO F0"
## [3] "TORNADOS" "WATERSPOUT/TORNADO"
## [5] "WATERSPOUT TORNADO" "WATERSPOUT-TORNADO"
## [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"
## [9] "WATERSPOUT/ TORNADO" "TORNADO F3"
## [11] "TORNADO F1" "TORNADO/WATERSPOUT"
## [13] "TORNADO F2" "TORNADOES"
## [15] "TORNADO DEBRIS"
Therefore, I regroup the original event types into 17 broad categories based on the logic below:
event<-rep(0,dim(data)[1])
event[grepl("DROUGHT|DRY", data$EVTYPE)]<-"DROUGHT"
event[grepl("LIGHTNING", data$EVTYPE)]<-"LIGHTNING"
event[grepl("DUST( )?STORM|DUST DEVIL", data$EVTYPE)]<-"DUST_STORM"
event[grepl("THUNDERSTORM|TSTM", data$EVTYPE)]<-"THUNDERSTORM"
event[grepl("STORM SURGE", data$EVTYPE)]<-"STORM_SURGE"
event[grepl("TROPICAL STORM", data$EVTYPE)]<-"TROPICAL_STORM"
event[grepl("TIDE|SURF|Surf|RIP CURRENT|TSUNAMI", data$EVTYPE)]<-"TIDE/SURF"
event[grepl("HEAT|[Hh]eat", data$EVTYPE)]<-"HEAT_WAVE"
event[grepl("WIND|[Ww]ind", data$EVTYPE)]<-"WIND"
event[grepl("RAIN|[Rr]ain", data$EVTYPE)]<-"RAIN"
event[grepl("FIRE|[Ff]ire", data$EVTYPE)]<-"FIRE"
event[grepl("HAIL|[Hh]ail", data$EVTYPE)]<-"HAIL"
event[grepl("HURRICANE|[Hh]rricane", data$EVTYPE)]<-"HURRICANE"
event[grepl("FLOOD|[Ff]lood|STREAM FLD", data$EVTYPE)]<-"FLOOD"
event[grepl("TORNADO", data$EVTYPE)]<-"TORNADO"
event[grepl("WINTER|[Ww]inter|Wintry|WINTRY", data$EVTYPE)]<-"WINTER_WEATHER"
event[grepl("E[Xx](.*)? C[Oo][Ll][Dd]", data$EVTYPE)]<-"WINTER_WEATHER"
event[grepl("SNOW|[Ss]now|FREEZ|Freez|FROST|Frost",data$EVTYPE)]<-"WINTER_WEATHER"
event[grepl("ICE STORM|BLIZZARD|[Bb]lizzard", data$EVTYPE)]<-"WINTER_WEATHER"
event[event==0]<-"OTHERS"
data<-cbind(data,event)
The 17 broad event types are listed below:
unique(data$event)
## [1] "TORNADO" "WIND" "HAIL" "WINTER_WEATHER"
## [5] "HURRICANE" "OTHERS" "RAIN" "LIGHTNING"
## [9] "TIDE/SURF" "THUNDERSTORM" "FLOOD" "HEAT_WAVE"
## [13] "DUST_STORM" "FIRE" "DROUGHT" "STORM_SURGE"
## [17] "TROPICAL_STORM"
The two tables below show the total impact on population health and total damage (in million USD) by each event, sum up across all dates and locations in the United States from 1950 to 2011.
byEvent<-data %>%
group_by(event) %>%
summarize(PopHealth=sum(pophealth), TotalDamage=sum(totaldmg)) %>%
mutate(PopHealth_percent=round(PopHealth/sum(PopHealth),6)) %>%
mutate(TotDamage_percent=round(TotalDamage/sum(TotalDamage),6))
## `summarise()` ungrouping output (override with `.groups` argument)
The event TORNADO has the greatest impact on population health, and it accounts 62.4% of total fatalities and injuries across the whole country, while none of the other events contribute more than 10%.
byEvent %>%
select(c(1,2,4)) %>%
arrange(desc(PopHealth))
## # A tibble: 17 x 3
## event PopHealth PopHealth_percent
## <chr> <dbl> <dbl>
## 1 TORNADO 97068 0.624
## 2 WIND 12607 0.0810
## 3 HEAT_WAVE 12362 0.0794
## 4 FLOOD 10234 0.0657
## 5 WINTER_WEATHER 7165 0.0460
## 6 LIGHTNING 6048 0.0389
## 7 OTHERS 2363 0.0152
## 8 FIRE 1698 0.0109
## 9 TIDE/SURF 1688 0.0108
## 10 HAIL 1487 0.00955
## 11 HURRICANE 1461 0.00938
## 12 DUST_STORM 506 0.00325
## 13 TROPICAL_STORM 449 0.00288
## 14 RAIN 381 0.00245
## 15 DROUGHT 64 0.000411
## 16 STORM_SURGE 51 0.000328
## 17 THUNDERSTORM 41 0.000263
The event FLOOD causes the highest damage at about 180 billion, followed by HURRICANE, TORNADO and STORM_SURGE which cause 90, 59, and 43 billion worth of damage respectively. In total, these four types of events account about 78.2% of total damage from all events. The rest events have proportion of damage less than 10%.
totaldamage<-byEvent %>%
select(c(1,3,5)) %>%
arrange(desc(TotalDamage))
totaldamage
## # A tibble: 17 x 3
## event TotalDamage TotDamage_percent
## <chr> <dbl> <dbl>
## 1 FLOOD 179970. 0.378
## 2 HURRICANE 90271. 0.189
## 3 TORNADO 59011. 0.124
## 4 STORM_SURGE 43324. 0.0909
## 5 WINTER_WEATHER 21147. 0.0444
## 6 HAIL 19132. 0.0402
## 7 WIND 17876. 0.0375
## 8 DROUGHT 15025. 0.0315
## 9 FIRE 8905. 0.0187
## 10 TROPICAL_STORM 8409. 0.0177
## 11 TIDE/SURF 4898. 0.0103
## 12 RAIN 4044. 0.00849
## 13 OTHERS 1309. 0.00275
## 14 THUNDERSTORM 1227. 0.00258
## 15 LIGHTNING 941. 0.00198
## 16 HEAT_WAVE 925. 0.00194
## 17 DUST_STORM 9.35 0.00002
sum(totaldamage[1:4,3])
## [1] 0.782028
The horizontal bar plot below will give you a clear visualization.
The bar plot on the left shows the impact on population health by each of the 17 event types. TORNADO has a significantly higher values comparing to other event types.
The right plot gives the total damage in million USD by event types and FLOOD has the greatest damage, while HURRICANE, TORNADO and STORM_SURGE are also higher compare to the rest of event types.
library(ggplot2)
byEvent$event1 <- factor(byEvent$event,
levels=(arrange(byEvent, PopHealth)$event))
g1<-ggplot(byEvent, aes(x=event1, y=PopHealth))+
geom_bar(stat="identity", fill="steelblue") + coord_flip()+
labs(title="Impact on Population Health", x="",y="")
byEvent$event2 <- factor(byEvent$event,
levels=(arrange(byEvent, TotalDamage)$event))
g2<-ggplot(byEvent, aes(x=event2, y=TotalDamage))+
geom_bar(stat="identity", fill="steelblue") + coord_flip()+
labs(title="Total Damage (in Million USD)",x="", y="")
cowplot::plot_grid(g1, g2, labels = "AUTO")
Besides the total population health impact and damage across all dates for each event type, let’s also look at the time trend.
byEvent_byYear<-data %>%
group_by_at(.vars=vars(one_of(c("year","event")))) %>%
summarize(PopHealth=sum(pophealth), TotalDamage=sum(totaldmg))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)
The TORNADO have better records in the old years (when most of the events have no records), that’s why it dominates the population health impact. Nevertheless, it is still the most harmful event within the recent two decades with a peak in 2011.
ggplot(byEvent_byYear, aes(year, PopHealth))+geom_line(aes(color=event))+
labs(x="",y="Impact on Population Health")
From the time series plot below we see that FLOOD has the highest damage because of a single event, which happened in year 2006 with economic consequences of 118.8 billion dollars (about 24.9% of overall damage from all events). From the plot we can also see that there are two other events in 2005 caused great damage due to HURRICANE and STORM_SURGE.
ggplot(byEvent_byYear, aes(year, TotalDamage))+geom_line(aes(color=event))+
labs(x="",y="Total Damage (in Million USD)")
damage_byYr<-data.frame(byEvent_byYear) %>%
mutate(damage_percent=TotalDamage/sum(TotalDamage)) %>%
select(c(1,2,4,5)) %>%
arrange(desc(TotalDamage))
head(damage_byYr)
## year event TotalDamage damage_percent
## 1 2006 FLOOD 118824.713 0.24941019
## 2 2005 HURRICANE 51799.317 0.10872551
## 3 2005 STORM_SURGE 43058.565 0.09037888
## 4 2004 HURRICANE 18922.256 0.03971736
## 5 1993 FLOOD 11278.333 0.02367295
## 6 2011 TORNADO 9850.962 0.02067693