Synopsis

Storms and other severe weather events can cause both public health and economic problems, resulting in loss of life, injuries, significant property damage, and/or disruption to commerce. In this report we aim to explore which types of weather events cause the greatest harm to public health and economy respectively across the United States. To investigate, we obtained data from Coursera course site, originally from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Our analysis found that tornadoes poses the biggest threat to public health whereas flash flood causes the greatest damage to economy.

Data Processing

We first set up the Rstudio and the packages required for the data analysis.

1. Preparation

    1.1 Set working directory
# setwd("C:/Users/Angashley/Desktop/CourseraR learning/Reproducible Research/Week4/StormData")
# personal details edited out
    1.2 Load packages to be used 
library(dplyr)
library(reshape2)
library(ggplot2)
library(knitr)
    1.3 Set global options
opts_chunk$set(echo = TRUE, fig.width=12, fig.height=8, fig.path='Figs/', cache=TRUE)

2. Download and read in the data

We download the StormData.csv.bz2 data file that comes in the form of a comma-separated-value file compressed via the bzip2 algorithm. The read.csv() command is used to read in the data.

bzip2url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("StormData.csv.bz2")){
        download.file(bzip2url, 
                      destfile = "StormData.csv.bz2")}

StormData <- read.csv("StormData.csv.bz2")

3. Create a useful data set for analysis

In the early years of Storm Events Database, only events such as Tornado, Thunderstorm Wind and Hail were recorded. More recent years starting from Jan 1996, all 48 event types are recorded. To make comparisons between weather events, we think it’s better to use those recent years when all event types are recorded. The rows and columns we use include:

  • [2]BGN_DATE (> 1995/12/31)
  • [8]EVTYPE
  • [23]FATALITIES
  • [24]INJURIES
  • [25]PROPDMG and [26]PROPDMGEXP
  • [27]CROPDMG and [28]CROPDMGEXP

The following code does these: 1) it subsets the columns of interest; 2) selects data with a date later than Dec. 31, 1995; and 3) filters data with zero fatalities, injuries, or damages to properties and crops.

stormData <- tbl_df(StormData) %>%
        select(BGN_DATE, EVTYPE, FATALITIES,
               INJURIES, PROPDMG, PROPDMGEXP,
               CROPDMG, CROPDMGEXP) %>%
        rename(DATE = BGN_DATE)

stormData$DATE <- format(as.POSIXct(stormData$DATE, 
                format="%m/%d/%Y"),
                format="%Y/%m/%d")

stormdata <- stormData %>% filter(as.Date(DATE) >
                                  as.Date("1995/12/31"))

stormdat <- stormdata %>% filter(FATALITIES > 0 |
                                 INJURIES > 0 |
                                 PROPDMG > 0 |
                                 CROPDMG > 0)

4. Tidy up EVTYPE

We examine the event types in the variable EVTYPE, and can see they are not properly prepared, e.g. extra spaces, inconsistent casing and inconsistent event naming (w/o space, singular/plural).

events <- levels(stormdat$EVTYPE)
head(events,20)
##  [1] "   HIGH SURF ADVISORY"  " COASTAL FLOOD"        
##  [3] " FLASH FLOOD"           " LIGHTNING"            
##  [5] " TSTM WIND"             " TSTM WIND (G45)"      
##  [7] " WATERSPOUT"            " WIND"                 
##  [9] "?"                      "ABNORMAL WARMTH"       
## [11] "ABNORMALLY DRY"         "ABNORMALLY WET"        
## [13] "ACCUMULATED SNOWFALL"   "AGRICULTURAL FREEZE"   
## [15] "APACHE COUNTY"          "ASTRONOMICAL HIGH TIDE"
## [17] "ASTRONOMICAL LOW TIDE"  "AVALANCE"              
## [19] "AVALANCHE"              "BEACH EROSIN"

The following code converts EVTYPE to UPPERCASE, and removes spaces, scale indicator (i.e. G45, G40 etc.), digits and brackets, as well as removes the ending ‘S’ in plurals.

levels(stormdat$EVTYPE) <- toupper(levels(stormdat$EVTYPE))

levels(stormdat$EVTYPE) <- gsub(pattern = " |G[0-9]+|\\d+|[[:punct:]]", 
                                replacement = "",
                                levels(stormdat$EVTYPE))

levels(stormdat$EVTYPE) <- gsub(pattern = "(.*)S$", 
                                replacement = "\\1",
                                levels(stormdat$EVTYPE))

5. Tidy up PROPDMGEXPand CROPDMGEXP

The following code converts the exponent indexes PROPDMGEXPand CROPDMGEXP to corresponding digits, e.g. B to 9, K to 3 and M to 6. Empty string is 0.

levels(stormdat$PROPDMGEXP) 
## [1] ""  "B" "K" "M"
levels(stormdat$CROPDMGEXP)
## [1] ""  "B" "K" "M"
levels(stormdat$PROPDMGEXP) <- c(0,9,3,6)
levels(stormdat$CROPDMGEXP) <- c(0,9,3,6)

Results

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

1.1 List the top 10 events that cause greatest loss of life. We can see that the No. 1 destructive event is Excessive Heat. Tornado and Flash Flood are the second and third severe events that cause a large number of fatalities.

fatalities <- stormdat %>% group_by(EVTYPE) %>%
        summarise(Fatalities = sum(FATALITIES))
head(arrange(fatalities, desc(Fatalities)), n = 10)
## # A tibble: 10 x 2
##           EVTYPE Fatalities
##           <fctr>      <int>
## 1  EXCESSIVEHEAT       1797
## 2        TORNADO       1511
## 3     FLASHFLOOD        887
## 4      LIGHTNING        651
## 5     RIPCURRENT        542
## 6          FLOOD        414
## 7       TSTMWIND        242
## 8           HEAT        237
## 9       HIGHWIND        235
## 10     AVALANCHE        223

1.2 List the top 10 events that cause greatest injuries. We can see that the No. 1 destructive event is Tornado. Excessive Heat and Flood are the second and third severe events that cause a large number of injuries.

injuries <- stormdat %>% group_by(EVTYPE) %>%
        summarise(Injuries = sum(INJURIES))
head(arrange(injuries, desc(Injuries)), n = 10)
## # A tibble: 10 x 2
##              EVTYPE Injuries
##              <fctr>    <int>
## 1           TORNADO    20667
## 2             FLOOD     6758
## 3     EXCESSIVEHEAT     6391
## 4         LIGHTNING     4141
## 5          TSTMWIND     3633
## 6        FLASHFLOOD     1674
## 7  THUNDERSTORMWIND     1400
## 8       WINTERSTORM     1292
## 9  HURRICANETYPHOON     1275
## 10             HEAT     1222

1.3 Plot a bar chart to show visually the top 10 events that cause greatest damage to population health overall.

harmHealth <- stormdat %>% group_by(EVTYPE) %>% 
                summarise(Fatalities = sum(FATALITIES), 
                          Injuries = sum(INJURIES), 
                          damageTotal = sum(FATALITIES,INJURIES)) %>% 
                arrange(desc(damageTotal))

harmHealth0 <- harmHealth[1:10,] %>% melt(id.vars = "EVTYPE", 
                                  measure.vars = c("Fatalities", "Injuries"), 
                                  variable.name = "damageType",
                                  value.name = "damageCount",
                                  factorsAsStrings = TRUE)

ggplot(data = harmHealth0, aes(x = reorder(EVTYPE, damageCount), y = damageCount, fill = damageType)) + geom_bar(stat="identity") + 
        coord_flip() + labs(fill = "", x = "Storm Events", y = "Population",
                        title = "Top 10 Storm Events that Cause Damage to Health") +
                  theme(text = element_text(size = 15), 
                        plot.title = element_text(colour = "blue")) 

From the plot above, we can see that Tornado poses the biggest threat to overall population health; Excessive Heat and Flood rank second and third.

2. Across the United States, which types of events have the greatest economic consequences?

2.1 List the top 10 events that cause greatest damage to properties. We can see that the No. 1 destructive event is Flash Flood. Thunderstorm Wind and Tornado are are the second and third events that cause severe consequences to properties.

harmProperty <- stormdat %>% group_by(EVTYPE) %>% 
        summarise(harmTotal = sum(PROPDMG*10^as.numeric(PROPDMGEXP)))
head(arrange(harmProperty,desc(harmTotal)), n = 10)
## # A tibble: 10 x 2
##              EVTYPE  harmTotal
##              <fctr>      <dbl>
## 1        FLASHFLOOD 1364500310
## 2          TSTMWIND 1359600640
## 3           TORNADO 1351198440
## 4             FLOOD 1010592400
## 5  THUNDERSTORMWIND  884963640
## 6              HAIL  685404200
## 7         LIGHTNING  490854780
## 8          HIGHWIND  348342490
## 9       WINTERSTORM  139575650
## 10         WILDFIRE  115760104

2.2 List the top 10 events that cause greatest damage to crops.We can see that the No. 1 destructive event is Hail. Flood and Flash Flood are the second and third events that cause severe consequences to crops.

harmCrop <- stormdat %>% group_by(EVTYPE) %>% 
        summarise(harmTotal = sum(CROPDMG*10^as.numeric(CROPDMGEXP)))
head(arrange(harmCrop,desc(harmTotal)), n = 10)
## # A tibble: 10 x 2
##              EVTYPE harmTotal
##              <fctr>     <dbl>
## 1              HAIL 516156150
## 2             FLOOD 195276200
## 3        FLASHFLOOD 171641800
## 4           DROUGHT 144412300
## 5          TSTMWIND 113117850
## 6           TORNADO  91869910
## 7  THUNDERSTORMWIND  69651000
## 8         HURRICANE  29493100
## 9          HIGHWIND  22820400
## 10        HEAVYRAIN  17438900

2.3 Plot a bar chart to show visually the top 10 events that cause greatest economic damage overall.

harmEconomy <- stormdat %>% group_by(EVTYPE) %>%
                summarise(Properties = sum(PROPDMG*10^as.numeric(PROPDMGEXP)),
                         Crops = sum(CROPDMG*10^as.numeric(CROPDMGEXP)),
                         damageTotal = sum(PROPDMG*10^as.numeric(PROPDMGEXP),
                                          CROPDMG*10^as.numeric(CROPDMGEXP))) %>%
                arrange(desc(damageTotal))

harmEconomy0 <- harmEconomy[1:10,] %>% melt(id.vars = "EVTYPE", 
                                  measure.vars = c("Properties", "Crops"), 
                                  variable.name = "damageType",
                                  value.name = "damageCost",
                                  factorsAsStrings = TRUE)

ggplot(data = harmEconomy0, aes(x = reorder(EVTYPE, damageCost), y = damageCost, fill = damageType)) + geom_bar(stat="identity") + 
        coord_flip() + labs(fill = "", x = "Storm Events", y = "Cost (US dollars)",
                        title = "Top 10 Storm Events that Cause Damage to Economy") +
                  theme(text = element_text(size = 15), 
                        plot.title = element_text(colour = "blue")) 

From the plot above, we can see that Flash Flood causes the biggest overall economic costs; Thunderstorm Wind and Tornado rank second and third.