National Weather Service Storm Data Analysis - Consequential Impact on Population Health and Economic Damages by Weather Event Types

Synopsis

This data analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The data records used are from 1989 to 2011. The weather events will be analyzed by population health related damages (i.e. fatailities and injuries) and by economic damage costs (i.e. property and crop). The results will illustrate the 10 most consequential weather events for the population health and economic categories.

The conclusions state that tornado category contributes to the maximum number of injuries and combined total fatalities and injuries. For the property and crop damage costs, the flood category contributes to the maximum number of property damage costs and combined total property and crop damage costs.

Data Processing

library(tidyverse)
theme_set(theme_light())

Download the dataset to ./data directory and read the CSV to the stormData dataframe

if(!file.exists("./data")){
  dir.create("./data")}
if(!file.exists("./data/stormData.csv.bz2")){
     file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
     download.file(file_url, destfile="./data/stormData.csv.bz2")
     }

stormData <- read.csv("./data/stormData.csv.bz2", stringsAsFactors = FALSE)

Examine stormData variables types and values

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Subset stormData for the variables necessary for the data analysis

stormSel <- stormData %>% 
     select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, 
            CROPDMG, CROPDMGEXP)

Examine stormData variables for missing values

sum(is.na(stormSel))
## [1] 0

No missing values!

Determine the start year for the weather events

In the assignments instructions, it states that “in the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.” Considering this statement, it is necessary to determine the start year and the criteria for selection. The following steps will determine the range where 85% or less of the total events are gathered.

Format BGN_DATE column as.Date, extract year from BGN_DATE and create new variable Year

stormSel$BGN_DATE <- as.Date(stormSel$BGN_DATE, format = "%m/%d/%Y")

stormSel <- stormSel %>%
     mutate(Year = as.numeric(format(BGN_DATE, '%Y')))

Determine the starting year where the cumulative number of events is 85% or less

yearCnt <- stormSel %>% 
     group_by(Year) %>% 
     tally() %>% 
     arrange(desc(n))

colnames(yearCnt) <- c("Year", "EventFreq")

yearCnt <- yearCnt %>% 
     mutate(
          cumsum = cumsum(EventFreq), 
          freq = round(EventFreq / sum(EventFreq), 3), 
          cum_freq = cumsum(freq)
     )

tail(yearCnt[which(yearCnt$cum_freq <= 0.85), ])
## # A tibble: 6 x 5
##    Year EventFreq cumsum  freq cum_freq
##   <dbl>     <int>  <int> <dbl>    <dbl>
## 1  1994     20631 702131 0.023    0.779
## 2  1992     13534 715665 0.015    0.794
## 3  1993     12607 728272 0.014    0.808
## 4  1991     12522 740794 0.014    0.822
## 5  1990     10946 751740 0.012    0.834
## 6  1989     10410 762150 0.012    0.846

Starting year is 1989.

Determine the event types (EVTYPE) with the most frequencies

Subset stormSel starting at 1989 and FATALITIES (OR) INJURIES (OR) PROPDMG (OR) CROPDMG is greater than 0

stormSel2 <- stormSel %>% 
     filter(Year >= 1989 & ( 
                 FATALITIES > 0 | 
                 INJURIES > 0 | 
                 PROPDMG > 0 | 
                 CROPDMG > 0))

Subset for EVTYPE order by frequencies (n)

eventCnt <- stormSel2 %>% 
     group_by(EVTYPE) %>% 
     tally() %>% 
     arrange(desc(n))

Print the top 40 events with most frequencies

eventCnt %>%
     print(n = 40)
## # A tibble: 488 x 2
##    EVTYPE                   n
##    <chr>                <int>
##  1 TSTM WIND            62607
##  2 THUNDERSTORM WIND    43655
##  3 HAIL                 26092
##  4 FLASH FLOOD          20967
##  5 TORNADO              16879
##  6 LIGHTNING            13293
##  7 THUNDERSTORM WINDS   12086
##  8 FLOOD                10175
##  9 HIGH WIND             5522
## 10 STRONG WIND           3370
## 11 WINTER STORM          1508
## 12 HEAVY SNOW            1342
## 13 HEAVY RAIN            1105
## 14 WILDFIRE               857
## 15 ICE STORM              708
## 16 URBAN/SML STREAM FLD   702
## 17 EXCESSIVE HEAT         698
## 18 HIGH WINDS             657
## 19 TSTM WIND/HAIL         441
## 20 TROPICAL STORM         416
## 21 WINTER WEATHER         407
## 22 RIP CURRENT            400
## 23 WILD/FOREST FIRE       388
## 24 FLASH FLOODING         302
## 25 FLOOD/FLASH FLOOD      279
## 26 AVALANCHE              268
## 27 DROUGHT                266
## 28 BLIZZARD               253
## 29 RIP CURRENTS           241
## 30 HEAT                   215
## 31 EXTREME COLD           197
## 32 LAKE-EFFECT SNOW       194
## 33 LANDSLIDE              193
## 34 STORM SURGE            177
## 35 COASTAL FLOOD          164
## 36 URBAN FLOOD            139
## 37 WINTER WEATHER/MIX     139
## 38 HIGH SURF              130
## 39 HURRICANE              129
## 40 LIGHT SNOW             119
## # … with 448 more rows

From the table, the count for unique EVTYPE variable is 488 (too many).

Using the National Weather Instruction 10-1065 (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) Section 7 - Event Types (pages 2-4) as a guide to combine most frequent event types (ex. TSTM and THUNDER* to THUNDERSTORM, and so on).

stormSel2$EVTYPE <- toupper(stormSel2$EVTYPE)
stormSel2$EVTYPE[grep("TSTM|THUNDER", stormSel2$EVTYPE)] <- "THUNDERSTORM"
stormSel2$EVTYPE[grep("HAIL", stormSel2$EVTYPE)] <- "HAIL"
stormSel2$EVTYPE[grep("FLOOD|FLD", stormSel2$EVTYPE)] <- "FLOOD"
stormSel2$EVTYPE[grep("TORNA|FUNNEL|SPOUT", stormSel2$EVTYPE)] <- "TORNADO"
stormSel2$EVTYPE[grep("LIGHTN|LIGHTI|LIGNT", stormSel2$EVTYPE)] <- "LIGHTNING"
stormSel2$EVTYPE[grep("COLD", stormSel2$EVTYPE)] <- "COLD"
stormSel2$EVTYPE[grep("WIND|GUST", stormSel2$EVTYPE)] <- "WIND"
stormSel2$EVTYPE[grep("WINTER|BLIZZARD", stormSel2$EVTYPE)] <- "WINTERSTORM"
stormSel2$EVTYPE[grep("SNOW", stormSel2$EVTYPE)] <- "SNOW"
stormSel2$EVTYPE[grep("RAIN", stormSel2$EVTYPE)] <- "RAIN"
stormSel2$EVTYPE[grep("FIRE", stormSel2$EVTYPE)] <- "WILDFIRE"
stormSel2$EVTYPE[grep("ICE|ICY", stormSel2$EVTYPE)] <- "ICE"
stormSel2$EVTYPE[grep("HEAT|DROUGHT", stormSel2$EVTYPE)] <- "HEAT"
stormSel2$EVTYPE[grep("TROPICAL", stormSel2$EVTYPE)] <- "TROPICAL STORM"
stormSel2$EVTYPE[grep("RIP", stormSel2$EVTYPE)] <- "RIP CURRENT"
stormSel2$EVTYPE[grep("AVALANC", stormSel2$EVTYPE)] <- "AVALANCHE"
stormSel2$EVTYPE[grep("SLIDE", stormSel2$EVTYPE)] <- "LANDSLIDE"
stormSel2$EVTYPE[grep("SURGE", stormSel2$EVTYPE)] <- "STORM SURGE"
stormSel2$EVTYPE[grep("SURF", stormSel2$EVTYPE)] <- "SURF"
stormSel2$EVTYPE[grep("HURRICAN|TYPHOO", stormSel2$EVTYPE)] <- "HURRICANE"
stormSel2$EVTYPE[grep("FREEZ|FROST", stormSel2$EVTYPE)] <- "FREEZE"
stormSel2$EVTYPE[grep("FOG", stormSel2$EVTYPE)] <- "FOG"
stormSel2$EVTYPE[grep("DUST", stormSel2$EVTYPE)] <- "DUST STORM"
stormSel2$EVTYPE[grep("MICRO | MIRCO", stormSel2$EVTYPE)] <- "DRY MICROBURST"
stormSel2$EVTYPE[grep("GLAZE", stormSel2$EVTYPE)] <- "GLAZE"
stormSel2$EVTYPE[grep("TSUNAMI", stormSel2$EVTYPE)] <- "TSUNAMI"

Subset again for EVTYPE with greater frequencies and print the top 40 most frequent events

eventCnt2 <- stormSel2 %>% 
     group_by(EVTYPE) %>% 
     tally() %>% 
     arrange(desc(n))

eventCnt2 %>%
     print(n = 40)
## # A tibble: 78 x 2
##    EVTYPE                      n
##    <chr>                   <int>
##  1 THUNDERSTORM           119192
##  2 FLOOD                   33183
##  3 HAIL                    26128
##  4 TORNADO                 16974
##  5 LIGHTNING               13302
##  6 WIND                     9897
##  7 WINTERSTORM              2316
##  8 SNOW                     1881
##  9 WILDFIRE                 1258
## 10 HEAT                     1246
## 11 RAIN                     1200
## 12 ICE                       764
## 13 RIP CURRENT               643
## 14 COLD                      468
## 15 TROPICAL STORM            456
## 16 AVALANCHE                 269
## 17 HURRICANE                 232
## 18 STORM SURGE               225
## 19 SURF                      222
## 20 LANDSLIDE                 209
## 21 DUST STORM                199
## 22 FOG                       181
## 23 FREEZE                    170
## 24 DRY MICROBURST             78
## 25 OTHER                      34
## 26 GLAZE                      21
## 27 MIXED PRECIPITATION        18
## 28 TSUNAMI                    14
## 29 SEICHE                      9
## 30 ASTRONOMICAL HIGH TIDE      8
## 31 HEAVY MIX                   8
## 32 HIGH SEAS                   8
## 33 LOW TEMPERATURE             7
## 34 UNSEASONABLY WARM           7
## 35 HYPOTHERMIA/EXPOSURE        6
## 36 MIXED PRECIP                6
## 37 COASTAL STORM               4
## 38 HIGH WATER                  4
## 39 MICROBURST                  4
## 40 WINTRY MIX                  4
## # … with 38 more rows

Economic Damage Cost Exponents (PROPDMGEXP, CROPDMGEXP)

Using the National Weather Serivice Instruction 10-1065 page 12, the estimated damage costs for property category is calculated by PROPDMG * PROPDMGEXP (exponent). The same logic will apply to CROPDMG. Since the PROPDMGEXP and CROPDMGEXP are categorical variables, they need a conversion to the corresponding numeric exponent.

Verify the unique values for PROPDMGEXP

unique(stormSel2$PROPDMGEXP)
##  [1] "M" "K" ""  "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"

Change PROPDMG key to corresponding numeric values

stormSel2 <- stormSel2 %>% 
     mutate(PROPDMGEXP1 = case_when(
          PROPDMGEXP == "-" ~ 1, # 10^0 = 1
          PROPDMGEXP == "+" ~ 1, 
          PROPDMGEXP == "0" ~ 1, 
          PROPDMGEXP == "2" ~ 10^2, 
          PROPDMGEXP == "3" ~ 10^3, 
          PROPDMGEXP == "4" ~ 10^4, 
          PROPDMGEXP == "5" ~ 10^5, 
          PROPDMGEXP == "6" ~ 10^6, 
          PROPDMGEXP == "7" ~ 10^7, 
          PROPDMGEXP == "H" ~ 10^2, 
          PROPDMGEXP == "h" ~ 10^2, 
          PROPDMGEXP == "K" ~ 10^3, 
          PROPDMGEXP == "M" ~ 10^6, 
          PROPDMGEXP == "m" ~ 10^6, 
          PROPDMGEXP == "B" ~ 10^9, 
          PROPDMGEXP == "" ~ 1))

Verify the unique values for CROPDMGEXP

unique(stormSel2$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k"

Change CROPDMG key to corresponding numeric values

stormSel2 <- stormSel2 %>% 
     mutate(CROPDMGEXP1 = case_when(
          CROPDMGEXP == "" ~ 1, # 10^0 = 1
          CROPDMGEXP == "?" ~ 1, 
          CROPDMGEXP == "0" ~ 1, 
          CROPDMGEXP == "K" ~ 10^3, 
          CROPDMGEXP == "k" ~ 10^3, 
          CROPDMGEXP == "M" ~ 10^6, 
          CROPDMGEXP == "m" ~ 10^6, 
          CROPDMGEXP == "B" ~ 10^9))

Results

Population Health - Fatalities and Injuries

Calculate fatalities and injuries by event type

fatalities_injuries <- stormSel2 %>% 
     group_by(EVTYPE) %>% 
     summarise(Fatalities = sum(FATALITIES),
               Injuries = sum(INJURIES), 
               Totals = Fatalities + Injuries) %>% 
     arrange(desc(Totals))

Print the top 10 events order by fatalities and injuries (combined)

fatalities_injuries_10 <- fatalities_injuries[1:10, ]
fatalities_injuries_10
## # A tibble: 10 x 4
##    EVTYPE       Fatalities Injuries Totals
##    <chr>             <dbl>    <dbl>  <dbl>
##  1 TORNADO            1808    28037  29845
##  2 HEAT               3138     9228  12366
##  3 FLOOD              1553     8683  10236
##  4 THUNDERSTORM        608     7992   8600
##  5 LIGHTNING           817     5231   6048
##  6 WINTERSTORM         378     2682   3060
##  7 WIND                476     1954   2430
##  8 ICE                 102     2183   2285
##  9 WILDFIRE             90     1608   1698
## 10 HURRICANE           133     1333   1466

Plot barchart for total fatalities, injuries and combined totals by event type.

First, convert dataframe to long format

storm_long <- gather(fatalities_injuries_10, Fatalities_Injuries, Totals, -EVTYPE, convert = FALSE)

fi_chart <- storm_long %>% 
     ggplot(aes(x = reorder(EVTYPE, Totals), y = Totals)) + 
     geom_bar(stat = "identity", aes(fill = Fatalities_Injuries), position = "dodge") + 
     coord_flip() + 
     labs(title = "United States Total Fatalities and Injuries by Event (1989-2011)", 
          x = "Event Type", y = "Frequency") + 
     theme(legend.title = element_blank())
fi_chart

The plot illustrates the tornado category contributes to the maximum number of injuries and combined total fatalities and injuries, followed by heat and flood weather events. However, the heat associated events contribute to the maximum number of fatalities, followed by tornado and flood events.

Calculate property damage costs (PropertyCost) and crop damage costs (CropCost) by event type (in million $)

stormSel2 <- stormSel2 %>% 
     mutate(PropertyCost = (PROPDMG * PROPDMGEXP1) / 10^6) %>% 
     mutate(CropCost = (CROPDMG * CROPDMGEXP1) / 10^6)

property_crop <- stormSel2 %>% 
     group_by(EVTYPE) %>% 
     summarise(PropertyTotalCost = sum(PropertyCost),
               CropTotalCost = sum(CropCost), 
               Totals = PropertyTotalCost + CropTotalCost) %>% 
     arrange(desc(Totals))

Print the top 10 events order by property and crop damage costs (combined)

property_crop_10 <- property_crop[1:10, ]
property_crop_10
## # A tibble: 10 x 4
##    EVTYPE         PropertyTotalCost CropTotalCost  Totals
##    <chr>                      <dbl>         <dbl>   <dbl>
##  1 FLOOD                    168270.     12389.    180659.
##  2 HURRICANE                 85256.      5506.     90763.
##  3 STORM SURGE               47965.         0.855  47966.
##  4 TORNADO                   32294.       415.     32709.
##  5 HAIL                      15978.      3047.     19024.
##  6 HEAT                       1066.     14877.     15943.
##  7 THUNDERSTORM              12785.      1274.     14060.
##  8 ICE                        3959.      5022.      8981.
##  9 WILDFIRE                   8497.       403.      8900.
## 10 TROPICAL STORM             7716.       695.      8411.

Plot barchart for total property and crop damage costs by event type

First, convert dataframe to long format

prop_crop_long <- gather(property_crop_10, Property_Crop, Totals, -EVTYPE, convert = FALSE)

pc_chart <- prop_crop_long %>% 
     ggplot(aes(x = reorder(EVTYPE, Totals), y = Totals)) + 
     geom_bar(stat = "identity", aes(fill = Property_Crop), position = "dodge") + 
     coord_flip() + 
     labs(title = "United States Total Property and Crop Damage Costs by Event (1989-2011)", 
          x = "Event Type", y = "Damage Costs (millions of dollars)") + 
     theme(legend.title = element_blank())
pc_chart

The plot illustrates the flood category contributes to the maximum number of property damage costs and combined total property and crop damage costs, followed by hurricane and storm surge weather events. However, the heat associated events contribute to the maximum number of crop damage costs, followed by flood, hurricane, and ice/freeze categories.