Synopsis

The aim of this report is to find out, in the United States from year 1950 to 2011, which types of severe weather events are most harmful with respect to population health; and which types of events have the greatest economic consequences.
Analysis was done using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
I classified the severe weather events in the database into 17 broad categories, and calculated their impact on population health, as well as the total damage caused. The impact on population health was obtained by adding up the fatalities and injuries, and the total damage (as million-dollar amounts) was the summation of property damage and crop damage.
From the analysis I found that, TORNADO is the most harmful with respect to population health, as it accounted 62.4% of total fatalities and injuries across the US from 1950 to 2011. FLOOD has the greatest economic consequences, followed by HURRICANE, TORNADO and STORM_SURGE, and they contributed 37.8%, 18.9%, 12.4%, 9.1% of total damage across the country respectively (78.2% in total).

Data Processing

Download the data and read it into R

At first, let’s download the data from our course website and read it into R using read.csv().

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.csv.bz2")

data<-read.csv("StormData.csv.bz2")

This data set contains 902,297 rows and 37 columns. The table below lists the names of all the variables and their classes.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Format the dates and create a “year” variable

I change the format of the variables “BGN_DATE” and “END_DATE” to Date, and create a “year” variable to indicate the event beginning year. This variable will be used later for producing the time series plots.

library(dplyr)

data<-data %>% 
    mutate(BGN_DATE=as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S")) %>%
    mutate(END_DATE=as.Date(END_DATE, "%m/%d/%Y %H:%M:%S")) %>%
    mutate(year=as.numeric(format(BGN_DATE,'%Y')))

Create a variable for the impact on population health

In this analysis, I use the sum of fatalities and injuries to reflect the impact on population health. The R code below creates the variable “pophealth” use the two variables “FATALITIES” and “INJURIES” in the data.
In addition, I have checked and there are no missing values in these two variables.

summary(data$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(data$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
data <- mutate(data, pophealth=FATALITIES+INJURIES)
summary(data$pophealth)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1725    0.0000 1742.0000

Create a variable for the total damage caused

The values for property damage and corp damage are stored in the variables “PROPDMG” and “CROPDMG”. However, in order to obtain the correct dollar amounts for these two types of damage, we need to consider the alphabetical characters in variables “PROPDMGEXP” and “CROPDMGEXP”, which are:

  • “K” for thousands
  • “M” for millions
  • “B” for billions

The total damage (in million USD) is calculated using information from the above mentioned four variables, with the R code below:

proptemp<-rep(1,dim(data)[1])
proptemp[grepl("[Bb]", data$PROPDMGEXP)]<-1000000000
proptemp[grepl("[Mm]", data$PROPDMGEXP)]<-1000000
proptemp[grepl("[Kk]", data$PROPDMGEXP)]<-1000

croptemp<-rep(1,dim(data)[1])
croptemp[grepl("[Bb]", data$CROPDMGEXP)]<-1000000000
croptemp[grepl("[Mm]", data$CROPDMGEXP)]<-1000000
croptemp[grepl("[Kk]", data$CROPDMGEXP)]<-1000


data <- data %>% mutate(totaldmg=(PROPDMG*proptemp+CROPDMG*croptemp)/1000000)
summary(data$totaldmg)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.00e+00 0.00e+00 0.00e+00 5.30e-01 0.00e+00 1.15e+05

Create broad categories for severe weather events

The types of severe weather events are stored in the variable “EVTYPE” in this data base, and there are in total 985 unique event types. However, the coding of these event types is a bit messy, as some of them actually can be grouped together.

length(unique(data$EVTYPE))
## [1] 985

For example, there are 15 unique event types in the data base containing the word “TORNADO”, which I think could be all classified under one single event type.

unique(data[grepl("TORNADO", data$EVTYPE),]$EVTYPE)
##  [1] "TORNADO"                    "TORNADO F0"                
##  [3] "TORNADOS"                   "WATERSPOUT/TORNADO"        
##  [5] "WATERSPOUT TORNADO"         "WATERSPOUT-TORNADO"        
##  [7] "TORNADOES, TSTM WIND, HAIL" "COLD AIR TORNADO"          
##  [9] "WATERSPOUT/ TORNADO"        "TORNADO F3"                
## [11] "TORNADO F1"                 "TORNADO/WATERSPOUT"        
## [13] "TORNADO F2"                 "TORNADOES"                 
## [15] "TORNADO DEBRIS"

Therefore, I regroup the original event types into 17 broad categories based on the logic below:

  1. As long as the words of that event type appear in the original variable “EVTYPE”, regardless in capital or small letters, or accompany with other words.
    • e.g. “TORNADO F3” will be classified under event type “TORNADO”
    • e.g. “FLOOD FLASH” will be classified under event type “FLOOD”
  2. Tide, surf and rip current are classified under TIDE/SURF
  3. Extreme cold, Frost/Freeze, Snow, ice storm and blizzard are all classified under “WINTER_WEATHER”
  4. “OTHERS” includes all the rest event types which are not elsewhere classified (e.g. fog, avalanche,volcanic, mudslide, land slide, waterspout, etc.)
event<-rep(0,dim(data)[1])
event[grepl("DROUGHT|DRY", data$EVTYPE)]<-"DROUGHT" 
event[grepl("LIGHTNING", data$EVTYPE)]<-"LIGHTNING"
event[grepl("DUST( )?STORM|DUST DEVIL", data$EVTYPE)]<-"DUST_STORM"
event[grepl("THUNDERSTORM|TSTM", data$EVTYPE)]<-"THUNDERSTORM"
event[grepl("STORM SURGE", data$EVTYPE)]<-"STORM_SURGE"
event[grepl("TROPICAL STORM", data$EVTYPE)]<-"TROPICAL_STORM" 
event[grepl("TIDE|SURF|Surf|RIP CURRENT|TSUNAMI", data$EVTYPE)]<-"TIDE/SURF"
event[grepl("HEAT|[Hh]eat", data$EVTYPE)]<-"HEAT_WAVE" 
event[grepl("WIND|[Ww]ind", data$EVTYPE)]<-"WIND"
event[grepl("RAIN|[Rr]ain", data$EVTYPE)]<-"RAIN"
event[grepl("FIRE|[Ff]ire", data$EVTYPE)]<-"FIRE"
event[grepl("HAIL|[Hh]ail", data$EVTYPE)]<-"HAIL"
event[grepl("HURRICANE|[Hh]rricane", data$EVTYPE)]<-"HURRICANE"
event[grepl("FLOOD|[Ff]lood|STREAM FLD", data$EVTYPE)]<-"FLOOD"
event[grepl("TORNADO", data$EVTYPE)]<-"TORNADO"
event[grepl("WINTER|[Ww]inter|Wintry|WINTRY", data$EVTYPE)]<-"WINTER_WEATHER"
event[grepl("E[Xx](.*)? C[Oo][Ll][Dd]", data$EVTYPE)]<-"WINTER_WEATHER" 
event[grepl("SNOW|[Ss]now|FREEZ|Freez|FROST|Frost",data$EVTYPE)]<-"WINTER_WEATHER" 
event[grepl("ICE STORM|BLIZZARD|[Bb]lizzard", data$EVTYPE)]<-"WINTER_WEATHER"
event[event==0]<-"OTHERS"
data<-cbind(data,event)

The 17 broad event types are listed below:

unique(data$event)
##  [1] "TORNADO"        "WIND"           "HAIL"           "WINTER_WEATHER"
##  [5] "HURRICANE"      "OTHERS"         "RAIN"           "LIGHTNING"     
##  [9] "TIDE/SURF"      "THUNDERSTORM"   "FLOOD"          "HEAT_WAVE"     
## [13] "DUST_STORM"     "FIRE"           "DROUGHT"        "STORM_SURGE"   
## [17] "TROPICAL_STORM"

Results

Calculate the impact on population health and total damage by event types

The two tables below show the total impact on population health and total damage (in million USD) by each event, sum up across all dates and locations in the United States from 1950 to 2011.

byEvent<-data %>% 
    group_by(event) %>%
    summarize(PopHealth=sum(pophealth), TotalDamage=sum(totaldmg)) %>%
    mutate(PopHealth_percent=round(PopHealth/sum(PopHealth),6)) %>% 
    mutate(TotDamage_percent=round(TotalDamage/sum(TotalDamage),6)) 
## `summarise()` ungrouping output (override with `.groups` argument)

The event TORNADO has the greatest impact on population health, and it accounts 62.4% of total fatalities and injuries across the whole country, while none of the other events contribute more than 10%.

byEvent %>% 
    select(c(1,2,4)) %>%
    arrange(desc(PopHealth))
## # A tibble: 17 x 3
##    event          PopHealth PopHealth_percent
##    <chr>              <dbl>             <dbl>
##  1 TORNADO            97068          0.624   
##  2 WIND               12607          0.0810  
##  3 HEAT_WAVE          12362          0.0794  
##  4 FLOOD              10234          0.0657  
##  5 WINTER_WEATHER      7165          0.0460  
##  6 LIGHTNING           6048          0.0389  
##  7 OTHERS              2363          0.0152  
##  8 FIRE                1698          0.0109  
##  9 TIDE/SURF           1688          0.0108  
## 10 HAIL                1487          0.00955 
## 11 HURRICANE           1461          0.00938 
## 12 DUST_STORM           506          0.00325 
## 13 TROPICAL_STORM       449          0.00288 
## 14 RAIN                 381          0.00245 
## 15 DROUGHT               64          0.000411
## 16 STORM_SURGE           51          0.000328
## 17 THUNDERSTORM          41          0.000263

The event FLOOD causes the highest damage at about 180 billion, followed by HURRICANE, TORNADO and STORM_SURGE which cause 90, 59, and 43 billion worth of damage respectively. In total, these four types of events account about 78.2% of total damage from all events. The rest events have proportion of damage less than 10%.

totaldamage<-byEvent %>%
    select(c(1,3,5)) %>%
    arrange(desc(TotalDamage))
totaldamage
## # A tibble: 17 x 3
##    event          TotalDamage TotDamage_percent
##    <chr>                <dbl>             <dbl>
##  1 FLOOD            179970.             0.378  
##  2 HURRICANE         90271.             0.189  
##  3 TORNADO           59011.             0.124  
##  4 STORM_SURGE       43324.             0.0909 
##  5 WINTER_WEATHER    21147.             0.0444 
##  6 HAIL              19132.             0.0402 
##  7 WIND              17876.             0.0375 
##  8 DROUGHT           15025.             0.0315 
##  9 FIRE               8905.             0.0187 
## 10 TROPICAL_STORM     8409.             0.0177 
## 11 TIDE/SURF          4898.             0.0103 
## 12 RAIN               4044.             0.00849
## 13 OTHERS             1309.             0.00275
## 14 THUNDERSTORM       1227.             0.00258
## 15 LIGHTNING           941.             0.00198
## 16 HEAT_WAVE           925.             0.00194
## 17 DUST_STORM            9.35           0.00002
sum(totaldamage[1:4,3])
## [1] 0.782028

The horizontal bar plot below will give you a clear visualization.
The bar plot on the left shows the impact on population health by each of the 17 event types. TORNADO has a significantly higher values comparing to other event types.
The right plot gives the total damage in million USD by event types and FLOOD has the greatest damage, while HURRICANE, TORNADO and STORM_SURGE are also higher compare to the rest of event types.

library(ggplot2)
byEvent$event1 <- factor(byEvent$event, 
                            levels=(arrange(byEvent, PopHealth)$event))

g1<-ggplot(byEvent, aes(x=event1, y=PopHealth))+
    geom_bar(stat="identity", fill="steelblue") + coord_flip()+
    labs(title="Impact on Population Health", x="",y="")

byEvent$event2 <- factor(byEvent$event, 
                        levels=(arrange(byEvent, TotalDamage)$event))

g2<-ggplot(byEvent, aes(x=event2, y=TotalDamage))+
    geom_bar(stat="identity", fill="steelblue") + coord_flip()+
    labs(title="Total Damage (in Million USD)",x="", y="")

cowplot::plot_grid(g1, g2, labels = "AUTO")

Time trend of population health impact and total damage

Besides the total population health impact and damage across all dates for each event type, let’s also look at the time trend.

byEvent_byYear<-data %>% 
    group_by_at(.vars=vars(one_of(c("year","event")))) %>%
    summarize(PopHealth=sum(pophealth), TotalDamage=sum(totaldmg))
## `summarise()` regrouping output by 'year' (override with `.groups` argument)

The TORNADO have better records in the old years (when most of the events have no records), that’s why it dominates the population health impact. Nevertheless, it is still the most harmful event within the recent two decades with a peak in 2011.

ggplot(byEvent_byYear, aes(year, PopHealth))+geom_line(aes(color=event))+
    labs(x="",y="Impact on Population Health")

From the time series plot below we see that FLOOD has the highest damage because of a single event, which happened in year 2006 with economic consequences of 118.8 billion dollars (about 24.9% of overall damage from all events). From the plot we can also see that there are two other events in 2005 caused great damage due to HURRICANE and STORM_SURGE.

ggplot(byEvent_byYear, aes(year, TotalDamage))+geom_line(aes(color=event))+
    labs(x="",y="Total Damage (in Million USD)")

damage_byYr<-data.frame(byEvent_byYear) %>% 
    mutate(damage_percent=TotalDamage/sum(TotalDamage)) %>%
    select(c(1,2,4,5)) %>%
    arrange(desc(TotalDamage))
head(damage_byYr)
##   year       event TotalDamage damage_percent
## 1 2006       FLOOD  118824.713     0.24941019
## 2 2005   HURRICANE   51799.317     0.10872551
## 3 2005 STORM_SURGE   43058.565     0.09037888
## 4 2004   HURRICANE   18922.256     0.03971736
## 5 1993       FLOOD   11278.333     0.02367295
## 6 2011     TORNADO    9850.962     0.02067693