Synopsis

The purpose of this analysis is to briefly summarize the data concerning weather events which may affect the health of the US Citizens, and the economic impact of such events. Ultimately, this introduction may be used to dig deeper into the impact of natural phenomena so as to guide public policy. We find that the most common weather event reported has to do with strong winds. However, when it comes to casualtes related to those events, the most dangerous ones to the health of the population are tornados. However, when it comes to measuring the economic impact, the most damaging weather event are floods.

This RPub page was created as as an assignment for the John Hopkins University Reproducible Resarch course available in coursera, and has no other intention than serve for the evaluation of said course."

Introduction

As stated in the abstract, this document was prepared for the peer-graded assignment, course project 2 in the John Hopkins University Reproducible Research course available in Coursera. Documentation for the data set can be found here. Further information from the National Environmental Satellite, Data, and Information service can be found in here. The objective of this project is to revise some weather data made available through the Coursera web site, and assess the impact of weather events (tornadoes, hurricanes,…) in the economy as well as in the health of the population.

Two questions are adressed in this analysis:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The data analysis was conducted in R, and the html document was elaborated using the rmarkdown and the knitr packages. The session information is presented below.

sessionInfo()
## R version 3.6.2 (2019-12-12)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=Spanish_Mexico.1252  LC_CTYPE=Spanish_Mexico.1252   
## [3] LC_MONETARY=Spanish_Mexico.1252 LC_NUMERIC=C                   
## [5] LC_TIME=Spanish_Mexico.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.2  magrittr_1.5    tools_3.6.2     htmltools_0.4.0
##  [5] yaml_2.2.1      Rcpp_1.0.3      stringi_1.4.5   rmarkdown_2.1  
##  [9] knitr_1.27      stringr_1.4.0   xfun_0.12       digest_0.6.23  
## [13] rlang_0.4.4     evaluate_0.14

Data Processing

As noted before, the data is available in the Coursera page for the course, so first we’ll create a new directory inside our working directory, and download the data into this new folder. The data downloaded is a comma-separated- value compressed via the bzip2 algorithm, so we’ll need to use the bzfile() function to read our data.

# Let's create a directory for our data
if(!file.exists(".\\Data")){dir.create(".\\Data")}

d_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Downloading the file.
# To avoid knitr downloading and unziping the file each time the rmarkdown file is compiled, we'll create an
# if condition so as to only download the file if it's not already been downloaded.

if(!file.exists(".\\Data\\Storm_Data.csv.bz2")){
     download.file(url = d_url, destfile = ".\\Data\\Storm_Data.csv.bz2")
}

stormData <- read.csv(".\\Data\\Storm_Data.csv.bz2")

The data set loaded consists of 37 variables and 902,297 observations. Let’s quickly take a look at the structure.

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We’ll just use a some of these variables, mainly, EVTYPE meaning the event type. For our health indicators, we’ll keep the variables FATALITIES (number of fatalities) and INJURIES(number of people injured). For our economic indicators, we’ll keep the variables PROPDMG (propoerty damage), PROPDMGEXP(property damage magnitude - K for thousands, M for millinos and B for billions), CROPDMG (crop damage) and CROPDMGEXP (crop damage magnitude). I’ll also keep the REMARKS variable in case something unusual comes up, then we can know what happened.

data <- stormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "REMARKS")]

The variable that describes th type of event in our data set is called EVTYPE, and is a factor containing 985 levels. The upside of having so many levels is that they are very self-explanatory. The downside is that what works for us is to have the events grouped in more readable categories. Let’s take a look at the first 20 types of events.

head(levels(data$EVTYPE), 20)
##  [1] "   HIGH SURF ADVISORY"  " COASTAL FLOOD"         " FLASH FLOOD"          
##  [4] " LIGHTNING"             " TSTM WIND"             " TSTM WIND (G45)"      
##  [7] " WATERSPOUT"            " WIND"                  "?"                     
## [10] "ABNORMAL WARMTH"        "ABNORMALLY DRY"         "ABNORMALLY WET"        
## [13] "ACCUMULATED SNOWFALL"   "AGRICULTURAL FREEZE"    "APACHE COUNTY"         
## [16] "ASTRONOMICAL HIGH TIDE" "ASTRONOMICAL LOW TIDE"  "AVALANCE"              
## [19] "AVALANCHE"              "BEACH EROSIN"

Just from this we can see that sometimes there are many events just because of a typo, such as reporting Avalance instead of Avalanche. If we dig deeper into the data, we’ll find other examples such as “BLACK ICE” and “Black Ice” representing different types (capital letters), or “BITTER WIND CHILL TEMPERATURES” and “BITTER WIND CHILL”. Let’s work into cleaning this data.

library(dplyr)
# Start by setting all factors to lower case and removing needless spacing
data$EVTYPE <- tolower(data$EVTYPE)
data$EVTYPE <- trimws(data$EVTYPE)

# Several EVTYPE that start with "summary of [date]". We don't need that.
data <- filter(data, grepl("summary", EVTYPE) == FALSE)

# We'll then continue to group some data

# Avalanche
data$EVTYPE <- gsub(".*avala.*", "avalanche", data$EVTYPE)
# Snow
data$EVTYPE <- gsub(".*snow.*", "snow", data$EVTYPE)
# Tornado
data$EVTYPE <- gsub(".*tornado.*", "tornado", data$EVTYPE)
data$EVTYPE <- gsub(".*landspout.*", "tornado", data$EVTYPE)
# Fire
data$EVTYPE <- gsub(".*fire.*", "fire", data$EVTYPE)
# Hurricane
data$EVTYPE <- gsub(".*hurricane.*", "hurricane", data$EVTYPE)
# Tsunami
data$EVTYPE <- gsub(".*tsunami.*", "tsunami", data$EVTYPE)
# Volcanic
data$EVTYPE <- gsub(".*volc.*", "volcanic activity", data$EVTYPE)
# Fog
data$EVTYPE <- gsub(".*fog.*", "fog", data$EVTYPE)
data$EVTYPE <- gsub(".*vog.*", "fog", data$EVTYPE)
# Lightning
data$EVTYPE <- gsub(".*light.*", "lightning", data$EVTYPE)
# Dryness
data$EVTYPE <- gsub(".*dry.*", "dryness", data$EVTYPE)
data$EVTYPE <- gsub(".*drie.*", "dryness", data$EVTYPE)
# Other arine related
data$EVTYPE <- gsub(".*marin.*", "other marine related", data$EVTYPE)
data$EVTYPE <- gsub(".*tide.*", "other marine related", data$EVTYPE)
data$EVTYPE <- gsub(".*surf.*", "other marine related", data$EVTYPE)
data$EVTYPE <- gsub(".*seas.*", "other marine related", data$EVTYPE)
data$EVTYPE <- gsub(".*waves.*", "other marine related", data$EVTYPE)
# Rip current
data$EVTYPE <- gsub(".*rip.*", "landslide, rock slide or mudslide", data$EVTYPE)
# Landslides (landslide, rockslide, mudslide)
data$EVTYPE <- gsub(".*slide.*", "landslide, rock slide or mudslide", data$EVTYPE)
data$EVTYPE <- gsub(".*slump.*", "landslide, rock slide or mudslide", data$EVTYPE)
# Flood
data$EVTYPE <- gsub(".*flood.*", "flood", data$EVTYPE)
data$EVTYPE <- gsub(".*flooo.*", "flood", data$EVTYPE)
data$EVTYPE <- gsub(".*floy.*", "flood", data$EVTYPE)
# Hail
data$EVTYPE <- gsub(".*hail.*", "hail", data$EVTYPE)
# Waterspout
data$EVTYPE <- gsub(".*spout.*", "waterspout", data$EVTYPE)
# rain
data$EVTYPE <- gsub(".*rain.*", "rain", data$EVTYPE)
data$EVTYPE <- gsub(".*precip.*", "rain", data$EVTYPE)
data$EVTYPE <- gsub(".*shower*", "rain", data$EVTYPE)
data$EVTYPE <- gsub(".*sleet.*", "rain", data$EVTYPE)
# Heat
data$EVTYPE <- gsub(".*heat.*", "heat", data$EVTYPE)
data$EVTYPE <- gsub(".*hot.*", "heat", data$EVTYPE)
data$EVTYPE <- gsub(".*warm.*", "heat", data$EVTYPE)
data$EVTYPE <- gsub(".*high temperature.*", "heat", data$EVTYPE)
# cold
data$EVTYPE <- gsub(".*cold.*", "cold", data$EVTYPE)
data$EVTYPE <- gsub(".*frost.*", "cold", data$EVTYPE)
data$EVTYPE <- gsub(".*cool.*", "cold", data$EVTYPE)
data$EVTYPE <- gsub(".*wint.*", "cold", data$EVTYPE)
data$EVTYPE <- gsub(".*low temperature.*", "cold", data$EVTYPE)
data$EVTYPE <- gsub(".*freez.*", "cold", data$EVTYPE)

# We leave storm and winds at the end
# Winds
data$EVTYPE <- gsub(".*wind.*", "wind", data$EVTYPE)
data$EVTYPE <- gsub(".*wnd.*", "wind", data$EVTYPE)
data$EVTYPE <- gsub(".*w ind.*", "wind", data$EVTYPE)
# Storm
data$EVTYPE <- gsub(".*storm.*", "storm", data$EVTYPE)
data$EVTYPE <- gsub("^storm", "storm", data$EVTYPE)
data$EVTYPE <- gsub("storm$", "storm", data$EVTYPE)

Sure, there are other atypical observations event types, but for now that will have to do. There’s just one step left to make the data set we want. As you may remember, we have four economic impact indicators: PROPDMGand CROPDMG, both showing an amount, and PROPDMGEXPand CROPDMGEXP, both showing the magnitude (K for thousands, M for millions, and B for billions). We want to have comparable data in the same variable, so we must multiply the amounts.

However, we note that although the documentation only specifies what K, M, and B stand for, variables PROPDMGEXP and CROPDMGEXP contain additional levels, such as -, ?, + and numbers 0-8. The documentation is not clear as to what this means and in the REMARKS variables, we don’t find some consistent information as to what these values stand for. Further down you can see examples of these remarks.

data[data$PROPDMGEXP == "0", ]$REMARKS[1]
## [1] Highest tides of the year combined with 35 mph south winds brought tide levels of 13.9 feet to the area.  Damage was $1500.00 to the Lowell Point Road. 
## 436781 Levels:  -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of  January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n
data[data$PROPDMGEXP == "?", ]$REMARKS[1]
## [1] Hail was again the main feature from these late night thunderstorms.  One report of tree damage was received from Galva, Illinois, where 18-inch-diameter tree limbs were down.   
## 436781 Levels:  -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of  January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n
data[data$PROPDMGEXP == "+", ]$REMARKS[1]
## [1] Typical of this time of year is the seasonal breakup process across mainland Alaska.  Major rivers, which serve as transportation corridors around the state, undergo the change from ice covered roadways (even plowed by the state) to summer boat/barge traffic.  May 6th and 7th brought flooding to Bethel, on the Kuskokwim River.  Damage occurred to the seawall and caused Brown Slough to back up.  Areas along the waterfront and low areas bordering the slough were flooded, worst of which occurred along Alder Street, East Avenue, Fourth through Seventh Avenues, Main Street, and Settler Drive.  Several vehicles were nearly submerged and several people were rescued from homes that had flood water coming in.  Additionally, there were several oil spills from local tanks used to heat homes.  Other villages affected by breakup on the Kuskokwim River were (in order) Aniak (damaged roads and damaged dike), Akiak, Akiachak, Kwethluk (airport road damage, extensive boardwalk damage, and insulation damage to 13 homes), Oscarville (flooded sewage lagoon, flooded streets and boardwalk, and a few flooded structures), and then Napaskiak (flooded roads and the new runway was isolated, insulation damage to six homes).  Upper and Lower Kalskag were also affected by "minor" breakup flooding. Minor flooding was reported along the Yukon River at Pilot Station, Marshall, Emmonak (six houses), and Alakanuk (roads and other low lying areas). 
## 436781 Levels:  -2 at Deer Park\n ... Zones 22 and 23 were added to the high wind warning of  January 26. Peak winds Sitka 55MPH, Cape Decision 58MPH, and Cape Spencer 64MPH.\n

For full disclosure, and since there are few cases, I’m going to assume that anyghing not having K, M, or Bis just the amount in US dollars, instead of thousands, millions or billions.

data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$PROPDMGEXP <- trimws(data$PROPDMGEXP)
data$CROPDMGEXP <- trimws(data$CROPDMGEXP)

data$PDMG <- data$PROPDMG
data[data$PROPDMGEXP == "K", ]$PDMG <- data[data$PROPDMGEXP == "K", ]$PROPDMG * 1e+03
data[data$PROPDMGEXP == "M", ]$PDMG <- data[data$PROPDMGEXP == "M", ]$PROPDMG * 1e+06
data[data$PROPDMGEXP == "B", ]$PDMG <- data[data$PROPDMGEXP == "B", ]$PROPDMG * 1e+09

data$CDMG <- data$CROPDMG
data[data$CROPDMGEXP == "K", ]$CDMG <- data[data$CROPDMGEXP == "K", ]$CROPDMG * 1e+03
data[data$CROPDMGEXP == "M", ]$CDMG <- data[data$CROPDMGEXP == "M", ]$CROPDMG * 1e+06
data[data$CROPDMGEXP == "B", ]$CDMG <- data[data$CROPDMGEXP == "B", ]$CROPDMG * 1e+09

Once having this data, we can collapse the information into a new data set for our results.

results_data <- data %>%
     group_by(EVTYPE) %>%
     summarize(events = length(EVTYPE), 
               fatalities = sum(FATALITIES), 
               f_mean = mean(FATALITIES), 
               injuries = sum(INJURIES), 
               i_mean = mean(INJURIES), 
               pdmg = sum(PDMG), 
               pdmg_mean = mean(PDMG), 
               cdmg = sum(CDMG), 
               cdmg_mean = mean(CDMG)) %>%
     arrange(desc(events))

head(results_data, 15)
## # A tibble: 15 x 10
##    EVTYPE events fatalities  f_mean injuries  i_mean    pdmg pdmg_mean    cdmg
##    <chr>   <int>      <dbl>   <dbl>    <dbl>   <dbl>   <dbl>     <dbl>   <dbl>
##  1 wind   349972       1147 3.28e-3    11226 3.21e-2 1.58e10    4.51e4 1.96e 9
##  2 hail   289956         20 6.90e-5     1467 5.06e-3 1.60e10    5.52e4 3.11e 9
##  3 flood   82721       1525 1.84e-2     8604 1.04e-1 1.68e11    2.03e6 1.24e10
##  4 torna~  60702       5661 9.33e-2    91407 1.51e+0 5.86e10    9.65e5 4.17e 8
##  5 cold    23667        724 3.06e-2     2305 9.74e-2 7.04e 9    2.98e5 3.44e 9
##  6 snow    17704        169 9.55e-3     1165 6.58e-2 1.03e 9    5.81e4 1.35e 8
##  7 light~  15803        817 5.17e-2     5232 3.31e-1 9.34e 8    5.91e4 1.21e 7
##  8 other~  14335        261 1.82e-2      365 2.55e-2 4.77e 9    3.33e5 4.26e 7
##  9 rain    12310        112 9.10e-3      329 2.67e-2 3.23e 9    2.62e5 7.95e 8
## 10 funne~   6844          0 0.             3 4.38e-4 1.95e 5    2.84e1 0.     
## 11 fire     4240         90 2.12e-2     1608 3.79e-1 8.50e 9    2.01e6 4.03e 8
## 12 water~   3849          3 7.79e-4       29 7.53e-3 9.56e 6    2.48e3 0.     
## 13 storm    3497        195 5.58e-2     2892 8.27e-1 5.62e10    1.61e7 5.74e 9
## 14 urban~   3392         28 8.25e-3       79 2.33e-2 5.83e 7    1.72e4 8.49e 6
## 15 heat     2839       3133 1.10e+0     9226 3.25e+0 2.03e 7    7.16e3 9.04e 8
## # ... with 1 more variable: cdmg_mean <dbl>

This final table is what we’ll use for our results

Results

Let’s look at our results. First, we’ll look at the fatalities and injuries caused by the weather events. We’ll show two plots, the first one includes the first

library(ggplot2)
library(reshape2)
results_data$f_i <- results_data$fatalities + results_data$injuries
results_data <- arrange(results_data, desc(f_i))
top10 <- unlist(as.list(results_data[1:10, "EVTYPE"]))
     
p1 <- results_data %>%
     select(EVTYPE, fatalities, injuries) %>% 
     filter(EVTYPE %in% top10)
p1 <- melt(p1, id = c("EVTYPE"))
names(p1) <- c("EVTYPE", "Casualty", "value")

ggplot(p1, aes(fill = Casualty, y = value, x = reorder(EVTYPE, -value))) + 
     geom_bar(position="stack", stat="identity") + 
     ggtitle("Health impact of weather events") +
     xlab("Weather event") + 
     ylab("Number of casualties") + 
     theme(plot.title = element_text(hjust = 0.5), 
           axis.text.x = element_text(angle = 45), 
           text = element_text(size = 15)) + 
     theme_bw() + 
     scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE))

As we can see from the table and the plot, although tornadoes rank 4th on the most common events, it’s impact in the population’s health far exceeds the impact of other weather events. With almost 100,000 injuries plus fatalities, it’s far riskier than the most common event, fast winds, wich sums a bit over 12,000 injuries plus fatalities.

As for economic impact.

results_data$Prop_mill <- results_data$pdmg / 1000000
results_data$Crop_mill <- results_data$cdmg / 1000000
results_data$Econ_mill <- results_data$Prop_mill + results_data$Crop_mill

results_data <- arrange(results_data, desc(Econ_mill))
top10 <- unlist(as.list(results_data[1:10, "EVTYPE"]))
     
p2 <- results_data %>%
     select(EVTYPE, Prop_mill, Crop_mill) %>% 
     filter(EVTYPE %in% top10)
p2 <- melt(p2, id = c("EVTYPE"))
names(p2) <- c("EVTYPE", "Indicator", "value")

ggplot(p2, aes(fill = Indicator, y = value, x = reorder(EVTYPE, -value))) + 
     geom_bar(stat="identity") + 
     ggtitle("Economic Damage of weather events") +
     xlab("Weather event") + 
     ylab("Million USD") + 
     theme(plot.title = element_text(hjust = 0.5), 
           axis.text.x = element_text(angle = 45), 
           text = element_text(size = 15)) + 
     scale_y_continuous(labels = function(x) format(x, big.mark = ",", scientific = FALSE)) + 
     scale_fill_manual(values=c("#009B07", "#CCB300"), 
                       name = "Indicator", labels = c("Property damage", "Crop damage"))

Regarding economic damage, we can see that the most damage is caused neither by the most frequent event (strong winds), nor by the event with the most casualties (tornados), but by flood.