The Most Severe Weather Events In The United States

Synopsis

Data from Department of Commerce • National Oceanic & Atmospheric Administration • National Weather Service were be used to analyse which kinds of weather event cause the most harm within the USA. About 900,000 observations were consolidated and analyzed regarding the damage they cause. With respect to health the most harmful events are tornados, thunderstorm winds and exessive heat. With respect to economics the most harmful events are flood, hurricane and tornado. Since tornado is one of the three most harmful weather events with respect to health and economics, respectively, precautionary measures against tornados should be adopted.

Data Processing

From the coursera course project we obtained data on characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Reading in the data

We read in the data from the csv-file included in the zip archive. The data is a comma-separated-value file compressed via the bzip2 algorithm.

destfile = "repdata_data_StormData.csv.bz2"
if(!file.exists(destfile)){
    download.file("https://d396qusza40orc.cloudfront.net/repdata/data/StormData.csv.bz2", destfile)
}
if(!exists("storm0")){
  storm0 <- read.csv("repdata_data_StormData.csv.bz2")
}

After reading in the data we got an overview of the data frame and its size (there are 902,297).

dim(storm0)
## [1] 902297     37
str(storm0)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Missing values are a common problem with environmental data and so we check to se what proportion of the observations are missing (i.e. coded as NA).

mean(is.na(storm0))  ## Are missing values important here?
## [1] 0.05229737

Because the proportion of missing values is relatively low (0.0522974), we choose to ignore missing values for now.

Processing and editing the data

We changed the column names to lower case for convenience.

storm1 <- storm0 ## copy data frame
names(storm1) <- tolower(names(storm1))

We looked at the column evtype. There are 0 different event types recorded. For further analysis we copied the data frame and uncapitalised the values.

storm1$evtype <- tolower(storm1$evtype) ## convert into small letters
length(unique(storm1$evtype)) ## How many different event types - recorded?
## [1] 898

We created a vector with the event types described within the documentation of the dataset.

events <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
events <- tolower(events)
length(unique(events)) ## How many different event types - defined?
## [1] 48

Since there are far too many event types recorded within the dataset, we looked for recorded events which are not described in the Storm Data Event Table and edited the text variables in order to consolidate different spellings.

eventnoise <- storm1$evtype %in% events
eventnoisedata <- unique(storm1$evtype[!eventnoise])
length(unique(eventnoisedata)) ## How many different event types - not defined?
## [1] 852

There are noticeable patterns for a wrong documentation or misspelling of some events. We corrected these items. We analysed the frequency of the observed events in order to decide how to deal with wrong documented events.

countevtype1 <- count(storm1, evtype)
countevtype1[with(countevtype1, order(-n, evtype)),] ## occurrence of eventtypes
## # A tibble: 898 x 2
##    evtype                  n
##    <chr>               <int>
##  1 hail               288661
##  2 tstm wind          219942
##  3 thunderstorm wind   82564
##  4 tornado             60652
##  5 flash flood         54277
##  6 flood               25327
##  7 thunderstorm winds  20843
##  8 high wind           20214
##  9 lightning           15754
## 10 heavy snow          15708
## # ... with 888 more rows

According to their frequency we corrected the observed events.

storm2 <- storm1
storm2$evtype <- gsub("tstm", "thunderstorm", storm2$evtype) #abbreviation for thunderstorm
storm2$evtype <- gsub("^(thunder.*)(.*)", "thunderstorm wind", storm2$evtype) #misspelling thunderstorm
storm2$evtype <- gsub("^((?!marine).*)((?<!marine).)thunde.* win.*", "thunderstorm wind", storm2$evtype, perl = TRUE) #misspelling thunderstorm
storm2$evtype <- gsub("wind.+", "wind", storm2$evtype) #misspelling wind
storm2$evtype <- gsub("^(wind)$", "high wind", storm2$evtype) #recategorise wind
storm2$evtype <- gsub("\\bfld\\b", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("\\bflo*ding\\b", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("(.*)flash flood(.*)", "flash flood", storm2$evtype) #misspelling flash flood
storm2$evtype <- gsub("(.*)coastal flood(.*)", "coastal flood", storm2$evtype) #misspelling coastal flood
storm2$evtype <- gsub("(.*)lakeshore flood(.*)", "lakeshore flood", storm2$evtype) #misspelling lakeshore flood
storm2$evtype <- gsub("^(flood.*)(.*)", "flood", storm2$evtype) #misspelling flood
storm2$evtype <- gsub("^((?!flash|coastal|lakeshore).*)((?<!flash|coastal|lakeshore).)flood(.*)", "flood", storm2$evtype, perl = TRUE) #misspelling flood
storm2$evtype <- gsub("(.*)wild(.*)fire(.*)", "wildfire", storm2$evtype) #misspelling wildfire
storm2$evtype <- gsub("(.*)winter weather(.*)", "winter weather", storm2$evtype) #misspelling winter weather
storm2$evtype <- gsub("(.*)extreme cold(.*)", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("^(snow)$", "heavy snow", storm2$evtype) #misspelling snow
storm2$evtype <- gsub("\\blandslide\\b", "avalanche", storm2$evtype) #misspelling avalanche
storm2$evtype <- gsub("^(fog)$", "dense fog", storm2$evtype) #misspelling dense fog
storm2$evtype <- gsub("^(cold/wind)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("current.+", "current", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(storm surge)$", "storm surge/tide", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(freezing rain)$", "frost/freeze", storm2$evtype) #misspelling current
storm2$evtype <- gsub("^(extreme wind)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme wind
storm2$evtype <- gsub("(.*)(high|heavy surf)(.*)", "high surf", storm2$evtype) #misspelling high surf
storm2$evtype <- gsub("^((hurricane).*)", "hurricane (typhoon)", storm2$evtype) #misspelling hurricane
storm2$evtype <- gsub("^(record warmth)$", "heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(record heat)$", "excessive heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(heat wave)$", "excessive heat", storm2$evtype) #misspelling heat
storm2$evtype <- gsub("^(cold)$", "cold/wind chill", storm2$evtype) #misspelling cold
storm2$evtype <- gsub("^(record cold)$", "extreme cold/wind chill", storm2$evtype) #misspelling extreme cold
storm2$evtype <- gsub("cloud.+", "cloud", storm2$evtype) #misspelling cloud
storm2$evtype <- gsub("^(frost)$", "frost/freeze", storm2$evtype) #misspelling frost
storm2$evtype <- gsub("^(ice)$", "frost/freeze", storm2$evtype) #misspelling freeze
storm2$evtype <- gsub("waterspout.+", "waterspout", storm2$evtype) #misspelling waterspout
storm2$evtype <- gsub("rain.+", "rain", storm2$evtype) #misspelling rain
storm2$evtype <- gsub("lake effect", "lake-effect", storm2$evtype) #misspelling lake-effect
storm2$evtype <- gsub("(.*)tornado(.*)", "tornado", storm2$evtype) #misspelling tornado
storm2$evtype <- gsub("^(freez.*)(.*)", "frost/freeze", storm2$evtype) #misspelling freeze
storm2$evtype <- gsub("^(hail.*)(.*)", "hail", storm2$evtype) #misspelling hail
storm2$evtype <- gsub("^(blizzard.*)(.*)", "thunderstorm wind", storm2$evtype) #misspelling blizzard

Some recorded events could not be classified according to the storm data event table. These data were classified as “others”.

storm2$evtype[which(!storm2$evtype %in% events)] <- c("others")

Due to the replacements the frequency of the observed events changed significantly

countevtype2 <- count(storm2, evtype)
countevtype2[with(countevtype2, order(-n, evtype)),] ## occurrence of eventtypes
## # A tibble: 44 x 2
##    evtype                        n
##    <chr>                     <int>
##  1 thunderstorm wind        327522
##  2 hail                     288775
##  3 tornado                   60686
##  4 flash flood               55668
##  5 flood                     29576
##  6 high surf                 23604
##  7 heavy snow                16325
##  8 lightning                 15754
##  9 marine thunderstorm wind  11987
## 10 heavy rain                11785
## # ... with 34 more rows

Processing data with respect to population health

We extracted and rearranged the columns ‘evtype’, ‘fatalities’ and ‘injuries’ in order to get a tidy dataset.

varsHealth <- c("evtype", "fatalities", "injuries") #define relevant columns
stormHealth <- storm2[varsHealth] #extract relevant columns
stormHealth1 <- stormHealth %>% group_by(evtype) %>% 
                  summarise(damages = sum(fatalities, injuries, na.rm = TRUE), 
                              fatalities = sum(fatalities, na.rm = TRUE), 
                              injuries = sum(injuries, na.rm = TRUE)) 
                              #calculate fatalities and injuries per eventtype
stormHealth1 <- stormHealth1[with(stormHealth1, order(-damages, evtype)),] 
                  #order eventtypes according to their damages
head(stormHealth1)
## # A tibble: 6 x 4
##   evtype            damages fatalities injuries
##   <chr>               <dbl>      <dbl>    <dbl>
## 1 tornado             96997       5633    91364
## 2 thunderstorm wind   11154        838    10316
## 3 excessive heat       9031       2077     6954
## 4 flood                7386        512     6874
## 5 lightning            6046        816     5230
## 6 heat                 3037        937     2100

We rearranged the calculated dataset in order to get a tidy dataset.

stormHealth2 <- stormHealth1[,c(1,3,4)] #extract relevant columns
stormHealth2 <- stormHealth2 %>% pivot_longer(-evtype, names_to = "harmtype", 
                                              values_to = "QUANTITY")
                  #change columns 'fatalities' and 'injuries' to rows
stormHealth2 <- as.data.frame(stormHealth2)
                  #define dataset as data frame
head(stormHealth2)
##              evtype   harmtype QUANTITY
## 1           tornado fatalities     5633
## 2           tornado   injuries    91364
## 3 thunderstorm wind fatalities      838
## 4 thunderstorm wind   injuries    10316
## 5    excessive heat fatalities     2077
## 6    excessive heat   injuries     6954

Processing data with respect to economic consequences

For analysing economic consequences we prepared a separate data frame.

varsEconomics <- c("evtype", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp") #define relevant columns
stormEconomics <- storm2[varsEconomics] #extract relevant columns
stormEconomics1 <- stormEconomics
stormEconomics1$propdmgexp <- tolower(stormEconomics1$propdmgexp) ## convert into small letters
stormEconomics1$cropdmgexp <- tolower(stormEconomics1$cropdmgexp) ## convert into small letters

Since the amount of damage is coded in two columns, we calculated the correct amount and the sum of property and crop damages.

stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="k"]<- 1000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="m"]<- 1000000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp=="b"]<- 1000000000
stormEconomics1$propdmgfactor[stormEconomics1$propdmgexp==""]<- 1
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="k"]<- 1000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="m"]<- 1000000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp=="b"]<- 1000000000
stormEconomics1$cropdmgfactor[stormEconomics1$cropdmgexp==""]<- 1
stormEconomics1$propdmgcalc <- with(stormEconomics1, (propdmg*propdmgfactor))
stormEconomics1$cropdmgcalc <- with(stormEconomics1, (cropdmg*cropdmgfactor))

For furhter analysis we extracted and rearranged the columns ‘evtype’ and the calculated economic damages,respectively.

stormEconomics2 <- stormEconomics1 %>% group_by(evtype) %>% 
                    summarise(damages = sum(propdmgcalc, cropdmgcalc, na.rm = TRUE), 
                            propdmgcalc = sum(propdmgcalc, na.rm = TRUE), 
                            cropdmgcalc = sum(cropdmgcalc, na.rm = TRUE))
stormEconomics2 <- stormEconomics2[with(stormEconomics2, order(-damages, evtype)),] 
                    #order eventtypes according to their damages

We rearranged the calculated dataset in order to get a tidy dataset.

stormEconomics3 <- stormEconomics2[,c(1,3,4)] #extract relevant columns
stormEconomics3 <- stormEconomics3 %>% pivot_longer(-evtype, names_to = "harmtype", 
                                                    values_to = "QUANTITY")
                        #change columns 'propdmgcalc' and 'cropdmgcalc' to rows
stormEconomics3 <- as.data.frame(stormEconomics3) #define dataset as data frame
head(stormEconomics3)
##                evtype    harmtype     QUANTITY
## 1               flood propdmgcalc 150239461307
## 2               flood cropdmgcalc  10856294050
## 3 hurricane (typhoon) propdmgcalc  84656180010
## 4 hurricane (typhoon) cropdmgcalc   5505292800
## 5             tornado propdmgcalc  56941931733
## 6             tornado cropdmgcalc    414961360

Results

Most harmful types of events with respect to population health

We plot stacked bar char and ordered the displayed bars according to their overall damage in oder to visualize a ranking of the most harmful types of events with respect to population health.

ggplot(stormHealth2, aes(fill=harmtype, y=QUANTITY, x=evtype)) +
  geom_bar(postition="stack", stat="identity") +
  scale_x_discrete(limits=stormHealth2$evtype)+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title="Most Harmful Types of Events", subtitle="with Respect to Population Health", 
       x="Event Type", y="Harm with Respect \n to Population Health \n [in people]") +
  scale_fill_discrete(name="Type of Damage",
                         breaks=c("fatalities", "injuries"),
                         labels=c("Fatalities", "Injuries"))

Most harmful types of events with respect to economics

We plot stacked bar char and ordered the displayed bars according to their overall damage in oder to visualize a ranking of the most harmful types of events with respect to economics.

ggplot(stormEconomics3, aes(fill=harmtype, y=QUANTITY, x=evtype)) +
  geom_bar(postition="stack", stat="identity") +
  scale_x_discrete(limits=stormEconomics3$evtype) +
  scale_y_continuous(labels = function(x) format(x, scientific = FALSE, big.mark = ",")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title="Most Harmful Types of Events", subtitle="with Respect to Economics", 
       x="Event Type", y="Harm with Respect \n to Economics \n [in US$]") +
  scale_fill_discrete(name="Type of Damage",
                         breaks=c("cropdmgcalc", "propdmgcalc"),
                         labels=c("Crop Damage", "Property Damage"))