Reproducible Research - Course Project 2

Title

Tornado and Flood are are the Major weather events in the United States
with respect to their Health or Economic impacts

Synopsis

The basic goal of this work is to explore the NOAA Storm Database in order to answer some basic questions about severe weather events. Based on the analysis,across the United States, Tornado (as indicated in the EVTYPE) are most harmful with respect to population health while Flood have the greatest economic consequences.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

Data source
The data for this work come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size which can be downloaded from the following web site:

Storm Data [47Mb]

There is also some documentation of the database available where you will find how some of the variables are constructed/defined.

Data Analysis
Set up

Importing data

#downloading the file
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" 
path <- getwd() 
destfile <- file.path(path, "StormData.csv.bz2") 
if (!file.exists(destfile)) 
        { message("Downloading file...") 
        download.file(url, destfile, mode = "wb") 
} else { message("Using cached file.") }
## Using cached file.

Reading data

#read data
data <- read.csv(bzfile("StormData.csv.bz2"), header = TRUE, sep=",")
#inspecting data
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Processing data Since we are only interested in the health and economic impacts of the extreme weather events, the following columns are subsetted by the “select” function.

  • EVTYPE
  • FATALITIES
  • INJURIES
  • PROPDMG
  • PROPDMGEXP
  • CROPDMG
  • CROPDMGEXP

Also, the data is further filtered by excluding those with neither health nor economic impacts.

#subset data

tdata <- tibble::as_tibble(data)
subset_data <- tdata %>%
                select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, 
                       CROPDMG, CROPDMGEXP) %>%
                filter(FATALITIES>0| INJURIES>0| PROPDMG>0| CROPDMG>0)
str(subset_data)                
## tibble [254,633 × 7] (S3: tbl_df/tbl/data.frame)
##  $ EVTYPE    : chr [1:254633] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num [1:254633] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:254633] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:254633] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:254633] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:254633] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:254633] "" "" "" "" ...

Alphabetical characters used to signify magnitude include “K” for thousands,
“M” for millions, and “B” for billions. The following codes will match
the characters with the magnitudes and thus calculate the money values of the
damage.

# Finding the property/crop damage exponents and levels
unique(subset_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(subset_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k"
# Remove the white spaces, if any.
subset_data$PROPDMGEXP <- trimws(subset_data$PROPDMGEXP) 
subset_data$CROPDMGEXP <- trimws(subset_data$CROPDMGEXP)

# Convert the small letter to capital letter
subset_data$PROPDMGEXP <- toupper(subset_data$PROPDMGEXP)
subset_data$CROPDMGEXP <- toupper(subset_data$CROPDMGEXP)

# Assigning values for the exponent data by generating a data.frame  
## for converting exponent characters ("keynames") to magnitudes ("keyvalues") 
keynames = c("H", "K", "M", "B", 
                 "0", "1", "2", "3", "4", "5", "6", "7", "8", 
                 "", "?", "+", "-")
keyvalues = c(100, 1000,1e+06, 1e+09,
                     1, 10, 100, 1000, 10000, 100000, 1000000,1e+07, 1e+08,
                     1, 1, 1, 1)
map_df <- data.frame(exp = keynames, 
                     val = keyvalues, 
                     stringsAsFactors = FALSE)
subset_data$PROPDMGEKEY <- map_df$val[match(subset_data$PROPDMGEXP, map_df$exp)]
subset_data$CROPDMGEKEY <- map_df$val[match(subset_data$CROPDMGEXP, map_df$exp)]
# Fill any remaining NAs with 1 (as a safety measure)
subset_data$PROPDMGEKEY[is.na(subset_data$PROPDMGEKEY)] <- 1
subset_data$CROPDMGEKEY[is.na(subset_data$CROPDMGEKEY)] <- 1

#Calculation of the value impact of the damage
subset_data$PROPDMGEVAL <- subset_data$PROPDMGEKEY * subset_data$PROPDMG
subset_data$CROPDMGEVAL <- subset_data$CROPDMGEKEY * subset_data$CROPDMG

###Results
Estimation of impacts
Next, we will estimate the health and economic impacts. The health impacts are
estimated by summation of fatalities and injuries while the economic impacts are
represented by the sum of money value of property damages and crop damages.

# adding two new columns, HEALTHIMP for health impact and TOTALDMG for total damages
subset_data<- subset_data %>%
        mutate(HEALTHIMP= FATALITIES + INJURIES) %>%
        mutate(TOTALDMG = PROPDMGEVAL + CROPDMGEVAL) 
# check the values of new columns
summary(subset_data$HEALTHIMP)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.6114    0.0000 1742.0000
summary(subset_data$TOTALDMG)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 2.500e+03 1.000e+04 1.875e+06 5.000e+04 1.150e+11

Plotting
Then, the impacts for each type of events are calculated by summation of the impact values
of individual events under each category. After that, a graph is plotted to show the top 10
events with greatest impact.

#summation of  the health impact values of individual events under each category
summary_table_hi <-subset_data %>%
        group_by(EVTYPE) %>%
        summarise(GRAND_TOTAL = sum(HEALTHIMP, na.rm = TRUE)) %>%
        slice_max(GRAND_TOTAL, n = 10) %>%  # Get top 10
        ungroup()
head(summary_table_hi)
## # A tibble: 6 × 2
##   EVTYPE         GRAND_TOTAL
##   <chr>                <dbl>
## 1 TORNADO              96979
## 2 EXCESSIVE HEAT        8428
## 3 TSTM WIND             7461
## 4 FLOOD                 7259
## 5 LIGHTNING             6046
## 6 HEAT                  3037
#plotting a graph to show the top 10 most costly weather events
summary_table_hi %>% 
        ggplot(aes(x = reorder(EVTYPE, GRAND_TOTAL), y = GRAND_TOTAL)) +
        geom_col() +
        coord_flip() +
        labs(title = "Top 10 Most Costly Weather Events",
             x = "Event Type", 
             y = "Health Impact")

#summation of  the economic impact values of individual events under each category
summary_table_dmg<- subset_data %>%
        group_by(EVTYPE) %>%
        summarise(GRAND_TOTAL = sum(TOTALDMG, na.rm = TRUE)) %>%
        slice_max(GRAND_TOTAL, n = 10) %>% # Get top 10
        ungroup()
head(summary_table_dmg)
## # A tibble: 6 × 2
##   EVTYPE              GRAND_TOTAL
##   <chr>                     <dbl>
## 1 FLOOD             150319678257 
## 2 HURRICANE/TYPHOON  71913712800 
## 3 TORNADO            57362333946.
## 4 STORM SURGE        43323541000 
## 5 HAIL               18761221986.
## 6 FLASH FLOOD        18243991078.
#plotting a graph to show the top 10 most costly weather events
summary_table_dmg %>%
        ggplot(aes(x = reorder(EVTYPE, GRAND_TOTAL), y = GRAND_TOTAL)) +
        geom_col() +
        coord_flip() +
        labs(title = "Top 10 Most Costly Weather Events", 
             x = "Event Type", 
             y = "Total Damage ($)")

Conclusion

Based on the graphs generated, across the United States, tornado are most
harmful with respect to population health, leading to 96979 injuries/deaths,
while flood have the greatest economic consequences, resulting loss of
USD 150 billions.