Storms and severe weather massively impact both the health of the population of a country and that country’s economy. With the US experiencing a variety of extreme weather conditions and occurrences across the country, it is vital that government is aware of what poses the largest threat. In this analysis I will look at the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. With appropriate managing of the data and illustrative tabling and graphing, I aim to show which storms or weather conditions prove to harm public health and the economy in the US.

Data Processing

First, I will load the packages that will be required for the analysis. Next, I will download and store the data. I will look at the structure of the data.

library(tidyverse)
library(RColorBrewer)

URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(URL, "./repdata%2Fdata%2FStormData.csv.bz2")
datedownloaded <- date()

data <- read.csv("./repdata%2Fdata%2FStormData.csv.bz2")
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As can be seen in the structure, the EXP variables are factor variables so I will convert to character and examine.

data <- data %>% mutate(CROPDMGEXP = as.character(CROPDMGEXP), PROPDMGEXP = as.character(PROPDMGEXP))
table(data$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
table(data$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330

These indicate the units of the CROPDM and PROPDM as powers of 10. According to the source, National Weather Service Storm Data Documentation, “Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions. If additional precision is available, it may be provided in the narrative part of the entry." Thus, I deduce the following.

data$CROPDMGEXP[data$CROPDMGEXP=="?"] <- 10^0
data$CROPDMGEXP[data$CROPDMGEXP=="0"] <- 10^0
data$CROPDMGEXP[data$CROPDMGEXP=="2"] <- 10^0
data$CROPDMGEXP[data$CROPDMGEXP=="B"] <- 10^9
data$CROPDMGEXP[data$CROPDMGEXP=="k"] <- 10^3
data$CROPDMGEXP[data$CROPDMGEXP=="K"] <- 10^3
data$CROPDMGEXP[data$CROPDMGEXP=="m"] <- 10^6
data$CROPDMGEXP[data$CROPDMGEXP=="M"] <- 10^6

data$PROPDMGEXP[data$PROPDMGEXP=="-"] <- 10^0
data$PROPDMGEXP[data$PROPDMGEXP=="?"] <- 10^0
data$PROPDMGEXP[data$PROPDMGEXP=="+"] <- 10^0
data$PROPDMGEXP[data$PROPDMGEXP=="0"] <- 10^0
data$PROPDMGEXP[data$PROPDMGEXP=="1"] <- 10^0
data$PROPDMGEXP[data$PROPDMGEXP=="2"] <- 10^2
data$PROPDMGEXP[data$PROPDMGEXP=="3"] <- 10^3
data$PROPDMGEXP[data$PROPDMGEXP=="4"] <- 10^4
data$PROPDMGEXP[data$PROPDMGEXP=="5"] <- 10^5
data$PROPDMGEXP[data$PROPDMGEXP=="6"] <- 10^6
data$PROPDMGEXP[data$PROPDMGEXP=="7"] <- 10^7
data$PROPDMGEXP[data$PROPDMGEXP=="8"] <- 10^8
data$PROPDMGEXP[data$PROPDMGEXP=="B"] <- 10^9
data$PROPDMGEXP[data$PROPDMGEXP=="h"] <- 10^2
data$PROPDMGEXP[data$PROPDMGEXP=="H"] <- 10^2
data$PROPDMGEXP[data$PROPDMGEXP=="K"] <- 10^3
data$PROPDMGEXP[data$PROPDMGEXP=="m"] <- 10^6
data$PROPDMGEXP[data$PROPDMGEXP=="M"] <- 10^6

data <- data %>% mutate(CROPDMGEXP = as.numeric(CROPDMGEXP), PROPDMGEXP = as.numeric(PROPDMGEXP))
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 10^0
data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 10^0
table(data$CROPDMGEXP)
## 
##      1   1000  1e+06  1e+09 
## 618440 281853   1995      9
table(data$PROPDMGEXP)
## 
##      1    100   1000  10000  1e+05  1e+06  1e+07  1e+08  1e+09 
## 466189     20 424669      4     28  11341      5      1     40

I now have correctly formatted damage units so I will create new columns with the real damage estimate values.

data <- data %>% mutate(CROPDMGCOST = CROPDMG*CROPDMGEXP, PROPDMGCOST = PROPDMG*PROPDMGEXP)

Results

First, I will look at how storms and severe weather affect public health. I will create two tables that show the number of fatalities and injuries per storm respectively.

fatalities <- data %>% group_by(EVTYPE) %>% 
  summarise(TOTAL_FATALITIES = sum(FATALITIES)) %>%
  arrange(desc(TOTAL_FATALITIES))
fatalities
## # A tibble: 985 x 2
##    EVTYPE         TOTAL_FATALITIES
##    <fct>                     <dbl>
##  1 TORNADO                    5633
##  2 EXCESSIVE HEAT             1903
##  3 FLASH FLOOD                 978
##  4 HEAT                        937
##  5 LIGHTNING                   816
##  6 TSTM WIND                   504
##  7 FLOOD                       470
##  8 RIP CURRENT                 368
##  9 HIGH WIND                   248
## 10 AVALANCHE                   224
## # ... with 975 more rows
injuries <- data %>% group_by(EVTYPE) %>% 
  summarise(TOTAL_INJURIES = sum(INJURIES)) %>%
  arrange(desc(TOTAL_INJURIES))
injuries
## # A tibble: 985 x 2
##    EVTYPE            TOTAL_INJURIES
##    <fct>                      <dbl>
##  1 TORNADO                    91346
##  2 TSTM WIND                   6957
##  3 FLOOD                       6789
##  4 EXCESSIVE HEAT              6525
##  5 LIGHTNING                   5230
##  6 HEAT                        2100
##  7 ICE STORM                   1975
##  8 FLASH FLOOD                 1777
##  9 THUNDERSTORM WIND           1488
## 10 HAIL                        1361
## # ... with 975 more rows

Next, I will plot the data for fatalities.

fatalities %>% top_n(10) %>%
  ggplot(aes(reorder(EVTYPE, -TOTAL_FATALITIES), TOTAL_FATALITIES, fill = EVTYPE)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    scale_fill_brewer(palette = "Spectral") +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle("Top 10 Storms to Cause Fatalities") +
    xlab("Storm Type") + 
    ylab("Total Fatalities")


Clearly tornados are the biggest cause of concern in terms of the fatalities they cause, with excessive heat posing a threat almost double the third most deadly, yet still less than half of tornados.

Next, I will plot the data for injuries.

injuries %>% top_n(10) %>%
  ggplot(aes(reorder(EVTYPE, -TOTAL_INJURIES), TOTAL_INJURIES, fill = EVTYPE)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    scale_fill_brewer(palette = "Spectral") +
    theme(axis.text.x = element_text(angle = 90)) +
    ggtitle("Top 10 Storms to Cause Injuries") +
    xlab("Storm Type") + 
    ylab("Total Injuries")


Again, tornados are the most dangerous. In terms of injuries though they are even more considerably dangerous than the second most. Notice this time that excessive heat drops to the fourth most dangerous and proportionally is less comparable to tornados. This perhaps suggests that excessive heat is a cause for concern as when it causes harm it more often leads to death.

Here, I will look at how storms and severe weather affect the economy by showing the costs of damages to crops and properties for each storm type. First, I will display tables.

propdmg <- data %>% group_by(EVTYPE) %>%
  summarise(TOTAL_PROPDMGCOST = sum(PROPDMGCOST)) %>%
  arrange(desc(TOTAL_PROPDMGCOST)) 
propdmg
## # A tibble: 985 x 2
##    EVTYPE            TOTAL_PROPDMGCOST
##    <fct>                         <dbl>
##  1 FLOOD                 144657709807 
##  2 HURRICANE/TYPHOON      69305840000 
##  3 TORNADO                56947380676.
##  4 STORM SURGE            43323536000 
##  5 FLASH FLOOD            16822673978.
##  6 HAIL                   15735267513.
##  7 HURRICANE              11868319010 
##  8 TROPICAL STORM          7703890550 
##  9 WINTER STORM            6688497251 
## 10 HIGH WIND               5270046295 
## # ... with 975 more rows

Finally, I will plot the estimated cost of damages caused to properties.

propdmg %>% top_n(10) %>%
  ggplot(aes(reorder(EVTYPE, -TOTAL_PROPDMGCOST), TOTAL_PROPDMGCOST, fill = EVTYPE)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  scale_fill_brewer(palette = "Spectral") + 
  theme(axis.text.x = element_text(angle = 90)) +
  ggtitle("Top 10 Storms that Damage Properties") +
  xlab("Storm Type") + 
  ylab("Estimate of Damages to Properties")


Floods are the most costly in terms of damage to properties, with hurricanes and typhoons in second and tornados a close third.

From the anaylsis, I deduce that tornados are the biggest concern when it comes to public health, with extreme heat also a problem (though not nearly as much so as tornados). I also deduce that floods pose the biggest threat to the US economy in terms of the damage they cause, with hurricanes and typhoons also a problem.