Overview

This analysis will utilize the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to explore how the different event types impact health and economic conditions in the United States between 1950 and 2011.The analysis indicates that tornadoes strongly impact health conditions while flooding tends to dominate property damage and drought is the primary source of economic impact on crops.

Data Processing

To begin the process, we call the relevant R libraries and up upload the storm data.

library(dplyr)
library(tidyverse)
library(ggplot2)
stormDF <- read.csv("repdata_data_StormData.csv.bz2")

Before starting any analysis, a general review of the data is conducted to determine the size of the data set and its basic structure. Additionally, we’ll verify that the column names are in an appropriate format. The data set contains 37 columns (named in snake_case format) with 902297 observations.

dim(stormDF)
## [1] 902297     37
str(stormDF)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
colnames(stormDF)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The Human Cost

We’ll begin the actual analysis by first looking at the direct physical damage inflicted upon individuals from the storm types, focusing on fatalities, injuries, and the combined causality. Given the large number of event types, this analysis will focus on the ten most damaging event types.

healthRisk <- stormDF %>% select(EVTYPE, FATALITIES, INJURIES) %>% 
  mutate(CASUALITIES = FATALITIES + INJURIES)

healthRisk <- healthRisk[healthRisk$EVTYPE != "?"& (healthRisk$INJURIES > 0 | healthRisk$FATALITIES > 0) ,]

hList <-  healthRisk %>%
  group_by(EVTYPE) %>%
  summarise(totFatal = sum(FATALITIES, na.rm = TRUE),
            totInjury = sum(INJURIES, na.rm = TRUE),
            totCasual = sum(CASUALITIES))

hList <- hList %>% arrange(desc(totCasual))
hL10 <- hList[1:10,]
print(hL10)
## # A tibble: 10 × 4
##    EVTYPE            totFatal totInjury totCasual
##    <chr>                <dbl>     <dbl>     <dbl>
##  1 TORNADO               5633     91346     96979
##  2 EXCESSIVE HEAT        1903      6525      8428
##  3 TSTM WIND              504      6957      7461
##  4 FLOOD                  470      6789      7259
##  5 LIGHTNING              816      5230      6046
##  6 HEAT                   937      2100      3037
##  7 FLASH FLOOD            978      1777      2755
##  8 ICE STORM               89      1975      2064
##  9 THUNDERSTORM WIND      133      1488      1621
## 10 WINTER STORM           206      1321      1527
hL10_long <- hL10 %>%
  pivot_longer(cols = c(totFatal, totInjury,totCasual), names_to = "Variable", values_to = "Value")

ggplot(hL10_long, aes(x = reorder(EVTYPE, -Value), y = Value, fill = Variable)) +
  geom_col(position = "dodge") +
  labs(x = "Storm Type", y = "Values", fill = "Columns") +
  theme_minimal()+ labs(title = "Top Ten Casualty Sources for Injury and Death", x = "Event Type", y = "Casualties") + theme(axis.text.x = element_text(angle = 45, hjust = 1))


To summarize the effect of adverse weather on individuals, we can plot the casuality (and individual fatalities and injuries) for the 10 highest sources.From the plot, we can see that tornadoes are overwhelming the primary source of personal injuries and fatalities in the US, with all other sources being trivial in comparison.

The Economic Cost

The next component of the analysis will evaluate the financial cost from the storms, focusing on property and crop damage. Reviewing the original structure of the data indicates that costs associated with these two categories are recorded in 2 columns each within the data set, one giving the coefficient (significand) and its exponent (power of 10). The exponents need to be reviewed and converted and combined with the coefficients to obtain the total cost for each event.

propertyRisk <- stormDF %>% select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
propertyRisk <- propertyRisk[propertyRisk$EVTYPE != "?"& (propertyRisk$PROPDMG > 0 | propertyRisk$CROPDMG > 0) ,]

unique(propertyRisk$PROPDMGEXP)
##  [1] "K" "M" "B" "m" ""  "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(propertyRisk$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k"

Review of the exponent values indicates a non-standard system utilizing both powers of 10 and letters (for hundreds, thousands, millions, etc.) Given this, a direct substitution is executed to convert these to all numeric values for both property and crop values.

propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "NA")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "-")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "?")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "+")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "0")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "1")] <- 10^1
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "2")] <- 10^2
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "3")] <- 10^3
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "4")] <- 10^4
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "5")] <- 10^5
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "6")] <- 10^6
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "7")] <- 10^7
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "8")] <- 10^8
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "h")] <- 10^2  #hundred
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "H")] <- 10^2  #hundred
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "K")] <- 10^3  #thousand
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "m")] <- 10^6  #million
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "M")] <- 10^6  #million
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "B")] <- 10^9  #billion

propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "?")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "0")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "k")] <- 10^3  #thousand
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "K")] <- 10^3  #thousand
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "m")] <- 10^6  #million
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "M")] <- 10^6  #million
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "B")] <- 10^9  #billion

The final step before analyzing the data is to combine the coefficient and its exponent into a single value for both property and crop damage.

propertyRisk$PROPDMGEXP <- as.numeric(propertyRisk$PROPDMGEXP)
propertyRisk$CROPDMGEXP <- as.numeric(propertyRisk$CROPDMGEXP)

propertyRisk <- propertyRisk %>% mutate (PROPERTY = PROPDMG * PROPDMGEXP) %>% 
  mutate(CROP = CROPDMG * CROPDMGEXP) %>%  
  mutate(CASUALITIES =PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP )

As before, given the large number of event types, we will identify the 10 most most significant sources of damage and focus on their analysis.

pList <-  propertyRisk %>%
  group_by(EVTYPE) %>%
  summarise(totProp = sum(PROPERTY, na.rm = TRUE),
            totCrop = sum(CROP, na.rm = TRUE),
            totalCasual = sum(CASUALITIES))

pList <- pList %>% arrange(desc(totalCasual))
pL10 <- pList[1:10,]
print(pL10)
## # A tibble: 10 × 4
##    EVTYPE                 totProp     totCrop  totalCasual
##    <chr>                    <dbl>       <dbl>        <dbl>
##  1 FLOOD             144657709870  5661968450 150319678320
##  2 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
##  3 TORNADO            56947382445   414953270  57362335715
##  4 STORM SURGE        43323536000        5000  43323541000
##  5 HAIL               15735270147  3025954473  18761224620
##  6 FLASH FLOOD        16822678195  1421317100  18243995295
##  7 DROUGHT             1046106000 13972566000  15018672000
##  8 HURRICANE          11868319010  2741910000  14610229010
##  9 RIVER FLOOD         5118945500  5029459000  10148404500
## 10 ICE STORM           3944928310  5022113500   8967041810
pL10_long <- pL10 %>%
  pivot_longer(cols = c(totProp, totCrop,totalCasual), names_to = "Variable", values_to = "Value")

ggplot(pL10_long, aes(x = reorder(EVTYPE, -Value), y = Value, fill = Variable)) +
  geom_col(position = "dodge") +
  labs(x = "Storm Type", y = "Values", fill = "Columns") +
  theme_minimal()+ labs(title = "Top Ten Sources of Casualties for Crop and Property Loss", x = "Event Type", y = "Casualties") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))


A review of the plot indicates the highest sources of financial damage is not tornadoes, as it was for physical injury, but rather flooding. Additionally, although flooding is the dominant sources of financial loss, several other sources also contribute substantially; hurricanes, tornadoes, and storm surge. While flooding is a substantial source of crop damage, it contributes less to losses than drought which is the primary source of crop damage.

Results

The analysis indicate that tornadoes are overwhelming the primary source of harm to the health of the population. While tornadoes do contribute to property damage, they are not nearly as influential a source as flooding which dominates the other substantial sources of hurricanes, tornadoes, and storm surge. While all of these sources do contribute to crop damage, drought is the primary source of crop damage in the US.