Analysis of severe weather events and their economic consequences across the United States

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

Preparation

# Load packages
library(magrittr) # Pipe operator
library(dplyr) # Data manipulation
library(ggplot2) # plotting system using the grammar of graphics
library(tidyr)

# URL to download
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Local data filename
dataFileZIP <- "./StormDataset.csv.bz2"

# Directory for the dataset
dirFile <- "./StormDataset"

# If not exist, download the dataset.zip,
if (file.exists(dataFileZIP) == FALSE) {
      download.file(fileURL, destfile = dataFileZIP)
}

# Note: .csv.bz2 can be read directly with read.csv function
StormDataset <- read.csv("StormDataset.csv.bz2")

# Structure of the dataset imported
glimpse(StormDataset)

## Observations: 902,297
## Variables: 37
## $ STATE__    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE   <fctr> 4/18/1950 0:00:00, 4/18/1950 0:00:00, 2/20/1951 0:...
## $ BGN_TIME   <fctr> 0130, 0145, 1600, 0900, 1500, 2000, 0100, 0900, 20...
## $ TIME_ZONE  <fctr> CST, CST, CST, CST, CST, CST, CST, CST, CST, CST, ...
## $ COUNTY     <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME <fctr> MOBILE, BALDWIN, FAYETTE, MADISON, CULLMAN, LAUDER...
## $ STATE      <fctr> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL...
## $ EVTYPE     <fctr> TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ BGN_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI    <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ BGN_LOCATI <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_DATE   <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_TIME   <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI    <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ END_LOCATI <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LENGTH     <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH      <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F          <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP <fctr> K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ WFO        <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ STATEOFFIC <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ ZONENAMES  <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ LATITUDE   <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE  <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS    <fctr> , , , , , , , , , , , , , , , , , , , , , , , , 
## $ REFNUM     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...

Manipulation

The data set import contains 902297 observations and 37 attributes. TO answer the question for this project, we limit our focus to the following variables:

EVTYPE: The type of storm event (i.e. Blizzard, Dense Fog, Flood, Heat, ..)
FATALITIES: The number of fatalities as a result of the storm event (direct or indirect)
INJURIES: The number of injuries as a result of the storm event (direct or indirect)
PROPDMG: The total property damage rounded to three significant digits used in conjunction with PROPDMGEXP to determine the appropriate size multiplier.
PROPDMGEXP: A letter code indicating the magnitude of the PROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively.
CROPDMG: The total crop damage rounded to three significant digits used in conjunction with CROPDMGEXP to determine the the appropriate size multiplier.
CROPDMGEXP: A letter code indicating the magnitude of the CROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively.

# Selection of the relevant variables

DataSelection <- StormDataset %>%
      select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Structure of the dataset
glimpse(DataSelection)

## Observations: 902,297
## Variables: 7
## $ EVTYPE     <fctr> TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP <fctr> K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,

Event Types

We can rapidly see that there are a few issues that needs to be addressed to have a clean dataset for the analysis. It looks like observations are full of typo issues and inconsistent entry. For this project, we are going to assign event types to a main event categories. This enables the reduction of the number of event types.

# First cleaning on the event type
#####################################

# Upper case all event observations
DataSelection$EVTYPE <- as.factor(toupper(DataSelection$EVTYPE))

# Trim function to remove white spaces at the start and end of the event type
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

# Applying trim function
DataSelection$EVTYPE <- as.factor(trim(DataSelection$EVTYPE))

# Classification of the main event categories
DataSelection[grepl("HAIL", DataSelection$EVTYPE), "EVTYPE"] <- "HAIL"
DataSelection[grepl("HEAVY RAIN", DataSelection$EVTYPE), "EVTYPE"] <- "HEAVY RAIN"
DataSelection[grepl("HEAVY SNOW", DataSelection$EVTYPE), "EVTYPE"] <- "HEAVY SNOW"
DataSelection[grepl("HIGH WIND", DataSelection$EVTYPE), "EVTYPE"] <- "HIGH WIND"
DataSelection[grepl("HURRICANE", DataSelection$EVTYPE), "EVTYPE"] <- "HURRICANE"
DataSelection[grepl("THUNDERSTORM", DataSelection$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
DataSelection[grepl("TSTM", DataSelection$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
DataSelection[grepl("TROPICAL STORM", DataSelection$EVTYPE), "EVTYPE"] <- "TROPICAL STORM"
DataSelection[grepl("URBAN", DataSelection$EVTYPE), "EVTYPE"] <- "URBAN FLOOD"
DataSelection[grepl("WATERSPOUT", DataSelection$EVTYPE), "EVTYPE"] <- "WATERSPOUT"
DataSelection[grepl("WINTER STORM", DataSelection$EVTYPE), "EVTYPE"] <- "WINTER STORM"
DataSelection[grepl("SUMMARY", DataSelection$EVTYPE), "EVTYPE"] <- NA
DataSelection[grepl("LIGHTNING", DataSelection$EVTYPE), "EVTYPE"] <- "LIGHTNING"
DataSelection[grepl("LIGHTING", DataSelection$EVTYPE), "EVTYPE"] <- "LIGHTNING"
DataSelection[grepl("FLOOD", DataSelection$EVTYPE), "EVTYPE"] <- "FLOOD"
DataSelection[grepl("HEAT", DataSelection$EVTYPE), "EVTYPE"] <- "HEAT"

# Unused levels are removed from the dataset
DataSelection$EVTYPE <- droplevels(DataSelection$EVTYPE)

Value manipulation

The following code reassign coding expression to a numeric value for the computation of the total damage value for each events. Values have been deducted from the dataset documentation. The manipulation affects the following variables: determine the appropriate size multiplier. - PROPDMGEXP: A letter code indicating the magnitude of the PROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively. - CROPDMGEXP: A letter code indicating the magnitude of the CROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively.

# Second clearing of the EXP value
####################################
# Assigning values for the property exponent data
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "K"] <- 1000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "M"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == ""] <- 1
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "B"] <- 1e+09
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "m"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "0"] <- 1
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "5"] <- 1e+05
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "6"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "4"] <- 10000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "2"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "3"] <- 1000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "h"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "7"] <- 1e+07
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "H"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "1"] <- 10
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "8"] <- 1e+08
# Assigning '0' to invalid exponent data
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "+"] <- 0
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "-"] <- 0
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "?"] <- 0

DataSelection$PROPEXP <- as.numeric(DataSelection$PROPEXP)

# Assigning values for the crop exponent data 
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "M"] <- 1e+06
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "K"] <- 1000
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "m"] <- 1e+06
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "B"] <- 1e+09
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "0"] <- 1
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "k"] <- 1000
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "2"] <- 100
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == ""] <- 1
# Assigning '0' to invalid exponent data
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "?"] <- 0

DataSelection$CROPEXP <- as.numeric(DataSelection$CROPEXP)

Final selection

The final steps removes the NA values in the dataset and compute the following operation on the dataset: - calculate the total damage value for crop and properties damages - group by event types - remove zero values events

DataSelection <- na.omit(DataSelection)

# Final computation to get:
###################################

DataSelection <- DataSelection %>%
      mutate(PROPVAL = PROPDMG * PROPEXP) %>%
      mutate(CROPVAL = CROPDMG * CROPEXP) %>%
      select(EVTYPE, FATALITIES, INJURIES, PROPVAL, CROPVAL) %>%
      group_by(EVTYPE) %>%
      summarise_all(sum) %>%
      filter(FATALITIES > 0 & INJURIES > 0  & PROPVAL > 0 & CROPVAL > 0)

summary(DataSelection)

##                      EVTYPE     FATALITIES        INJURIES    
##  BLIZZARD               : 1   Min.   :   1.0   Min.   :    1  
##  COLD/WIND CHILL        : 1   1st Qu.:  12.5   1st Qu.:   26  
##  DRY MICROBURST         : 1   Median :  66.0   Median :  280  
##  DUST STORM             : 1   Mean   : 395.0   Mean   : 3923  
##  EXTREME COLD           : 1   3rd Qu.: 131.0   3rd Qu.: 1333  
##  EXTREME COLD/WIND CHILL: 1   Max.   :5633.0   Max.   :91346  
##  (Other)                :29                                   
##     PROPVAL             CROPVAL         
##  Min.   :1.500e+04   Min.   :5.000e+03  
##  1st Qu.:7.690e+06   1st Qu.:5.525e+05  
##  Median :3.246e+08   Median :2.744e+07  
##  Mean   :1.216e+10   Mean   :9.412e+08  
##  3rd Qu.:5.461e+09   3rd Qu.:6.983e+08  
##  Max.   :1.682e+11   Max.   :1.239e+10  
##

Results

Events and harm to population

Across the United States, which types of events are most harmful with respect to population health?

To answer this question, we will utilize the top 10 events with the most fatalies and injuries.

# Preparing dataset for plotting
# Note: for the ordering of the data, we place fatalities first
(USeventPOP <- DataSelection %>%
      select(EVTYPE, FATALITIES, INJURIES) %>%
      gather(key = POPHEALTH, value = NUMBERPEOPLE, FATALITIES, INJURIES) %>%
      arrange(desc(NUMBERPEOPLE)) %>%
      head(n = 10))

## # A tibble: 10 × 3
##          EVTYPE  POPHEALTH NUMBERPEOPLE
##          <fctr>      <chr>        <dbl>
## 1       TORNADO   INJURIES        91346
## 2  THUNDERSTORM   INJURIES         9448
## 3          HEAT   INJURIES         9224
## 4         FLOOD   INJURIES         8683
## 5       TORNADO FATALITIES         5633
## 6     LIGHTNING   INJURIES         5231
## 7          HEAT FATALITIES         3138
## 8     ICE STORM   INJURIES         1975
## 9         FLOOD FATALITIES         1552
## 10    HIGH WIND   INJURIES         1522

# Plot the dataset
ggplot(USeventPOP, aes(x = reorder(EVTYPE, -NUMBERPEOPLE), y = NUMBERPEOPLE, fill = POPHEALTH)) +
      geom_bar(stat = "identity", position = "stack") +
      labs(x = "Event Type", y = "Number of People") +
      ggtitle("Most Harmful Events across the U.S. with respect to Population Health") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

The graph clearly shows that tornados are by far the most harmful event across the United States, both in number of fatalities and injuries. Following tornado events, heat-related events, thunderstorms and flood are the most harmful events.

Events and Economic consequences

Across the United States, which types of events have the greatest economic consequences?

# Preparing dataset for plotting
# Note: for the ordering of the data, we place fatalities first
(USeventDAMG <- DataSelection %>%
      select(EVTYPE, PROPVAL, CROPVAL) %>%
      gather(key = VALUETYPE, value = TOTALVALUE, PROPVAL, CROPVAL) %>%
      arrange(desc(TOTALVALUE)) %>%
      head(n = 10))

## # A tibble: 10 × 3
##            EVTYPE VALUETYPE   TOTALVALUE
##            <fctr>     <chr>        <dbl>
## 1           FLOOD   PROPVAL 168248038513
## 2       HURRICANE   PROPVAL  84656180010
## 3         TORNADO   PROPVAL  56947380617
## 4     STORM SURGE   PROPVAL  43323536000
## 5            HAIL   PROPVAL  17622991537
## 6           FLOOD   CROPVAL  12388194200
## 7    THUNDERSTORM   PROPVAL  11139326676
## 8  TROPICAL STORM   PROPVAL   7714390550
## 9    WINTER STORM   PROPVAL   6689497251
## 10      HIGH WIND   PROPVAL   6157245003

# Plotting the graph
ggplot(USeventDAMG, aes(x = reorder(EVTYPE, -TOTALVALUE), y = TOTALVALUE, fill = VALUETYPE)) +
      geom_bar(stat = "identity", position = "stack") +
      labs(x = "Event Type", y = "Damaged Economic Value") +
      ggtitle("Most Harmful Events across the U.S. with respect to Economic Value") +
      theme(axis.text.x = element_text(angle = 45, hjust = 1))

Flood represents the main harmful economic events across the United States. Water damages can be severe to properties in particular.

We can notice that the majority of the economic damages are related to properties for the top 10 type of events with the maximum damage value.

Session Information

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] tidyr_0.6.0          ggplot2_2.2.1        dplyr_0.5.0         
## [4] magrittr_1.5         RDocumentation_0.7.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.8         knitr_1.15.1        devtools_1.12.0    
##  [4] munsell_0.4.3       githubinstall_0.2.1 colorspace_1.3-2   
##  [7] rjson_0.2.15        R6_2.2.0            plyr_1.8.4         
## [10] stringr_1.1.0       httr_1.2.1          tools_3.3.2        
## [13] grid_3.3.2          gtable_0.2.0        data.table_1.10.0  
## [16] DBI_0.5-1           withr_1.0.2         htmltools_0.3.5    
## [19] lazyeval_0.2.0      yaml_2.1.14         rprojroot_1.1      
## [22] digest_0.6.10       assertthat_0.1      tibble_1.2         
## [25] curl_2.3            memoise_1.0.0       evaluate_0.10      
## [28] rmarkdown_1.3       labeling_0.3        stringi_1.1.2      
## [31] scales_0.4.1        backports_1.0.4     jsonlite_1.1       
## [34] proto_1.0.0

ReproducibleResearch_Project2

Pierre Baudin

January 27, 2017