Reproducible Research: Course Project 2

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, crop and property damage, and preventing such outcomes to the extent possible is a key concern. In this report we analyse which of these events are the most harmful.

The analysis shows that tornadoes are the most harmful weather events with respect to population health taking into consideration both injuries and fatalities. Floods are events causing most economic damage as they cause the most damage to property while the most harmful event for crop only is drought.

Data Processing

1. Loading the data

Loading the packages that we are going to use for this analysis:

knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(data.table)
library(plyr)
library(dplyr)
library(lattice)
library(knitr)

We download the source data file from here. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm and it covers storm and weather events in the United States between 1950 and 2011. Documentation of the database is available here. We download, read the data and get session info:

if (!file.exists("StormData.csv.bz2")) {
  fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(fileUrl, destfile="StormData.csv.bz2")
}
storm <- read.csv("StormData.csv.bz2")
sInfo <- sessionInfo()
2. Summary of the raw data

We get the structure of the dataset.

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Let’s convert data to a data table.

stormdt <- as.data.table(storm)

We now get a list of column names to create a subset of data that we are going to use for the analysis.

names(stormdt)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
3. Creating a subset of the data

The questions we are trying to answer in this analysis are which types of weather events are most harmful with respect to population health and which ones have the greatest economic consequences. So, we only need event type and data related to health and economic impacts and hence the following data columns are selected:

EVTYPE Event types that might have different impact on population health or economy.

FATALITIES and INJURIES

Fatalities and injuries estimated for the event. These values are used to estimate the weather events impact on population health.

PROPDMG and CROPDMG, PROPDMGEXP and CROPDMGEXP

Property and crop damages estimated for the event and their units (magnitudes - K,B,M). These values are used to estimate the weather events impact on economy.

We create a subset of data and get a summary for those variables

stormSubset <- select(stormdt, c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
summary(stormSubset)
##     EVTYPE            FATALITIES          INJURIES            PROPDMG       
##  Length:902297      Min.   :  0.0000   Min.   :   0.0000   Min.   :   0.00  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000   1st Qu.:   0.00  
##  Mode  :character   Median :  0.0000   Median :   0.0000   Median :   0.00  
##                     Mean   :  0.0168   Mean   :   0.1557   Mean   :  12.06  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000   3rd Qu.:   0.50  
##                     Max.   :583.0000   Max.   :1700.0000   Max.   :5000.00  
##   PROPDMGEXP           CROPDMG         CROPDMGEXP       
##  Length:902297      Min.   :  0.000   Length:902297     
##  Class :character   1st Qu.:  0.000   Class :character  
##  Mode  :character   Median :  0.000   Mode  :character  
##                     Mean   :  1.527                     
##                     3rd Qu.:  0.000                     
##                     Max.   :990.000

We can see that median values are zeros for all variables and even the 3rd quantile is 0 or close to 0 for most variables, so we want to take a subset of this dataset to consider only events that caused either damage to economy or population health.

FinalStorm <- subset(stormSubset, INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0)
4. Calculating total damage

We need to convert columns with units to actual values instead of -,+, H, K, etc.

unique(FinalStorm$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(FinalStorm$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k"
FinalStorm$PROPDMGEXP <- mapvalues(FinalStorm$PROPDMGEXP, from = c("K", "M","", "B", "m", "+", "0", "5", "6", "4", "2", "3", "h", "7", "H", "-"), to = c(10^3, 10^6, 1, 10^9, 10^6, 0,1,10^5, 10^6, 10^4, 10^2, 10^3, 10^2, 10^7, 10^2, 0))
FinalStorm$PROPDMGEXP <- as.numeric(as.character(FinalStorm$PROPDMGEXP))
FinalStorm$CROPDMGEXP <- mapvalues(FinalStorm$CROPDMGEXP, from = c("","M", "K", "m", "B", "?", "0", "k"), to = c(1,10^6, 10^3, 10^6, 10^9, 0, 1, 10^3))
FinalStorm$CROPDMGEXP <- as.numeric(as.character(FinalStorm$CROPDMGEXP))

FinalStorm$PROPDMGTOT <- (FinalStorm$PROPDMG * FinalStorm$PROPDMGEXP)/10^9
FinalStorm$CROPDMGTOT <- (FinalStorm$CROPDMG * FinalStorm$CROPDMGEXP)/10^9

We calculate total number of fatalities and injuries per event type as well as the total damage to property and crop. We melt Melting data.table for easier plotting.

TotalHealth <- FinalStorm[, .(fatalities = sum(FATALITIES), injuries = sum(INJURIES), total = sum(FATALITIES) + sum(INJURIES)), by = .(EVTYPE)][order(-total)]

TotHealth <- as.data.frame(melt(TotalHealth, id.vars="EVTYPE", variable.name = "damage"))

TotalEconomy <- FinalStorm[, .(Total_Property_Damage = sum(PROPDMGTOT), Total_Crop_Damage = sum(CROPDMGTOT), total = sum(PROPDMGTOT) + sum(CROPDMGTOT)), by = .(EVTYPE)][order(-total)]

TotEconomy <- as.data.frame(melt(TotalEconomy, id.vars="EVTYPE", variable.name = "damage"))
5. Finding most harmful events

Now we are going to get only the top 5 most harmful events:

TH<-TotHealth %>% group_by(damage) %>% top_n(5,value) %>% arrange(damage, -value)
TE<-TotEconomy %>% group_by(damage) %>% top_n(5,value) %>% arrange(damage, -value)

Results

We plot the number of fatalities, injuries and a total of fatalities and injuries to find the top 5 weather events that are most harmful to US population.

ggplot(data = TH, aes(x=EVTYPE,value)) +
  geom_bar(stat="identity", width = 0.5) +
  labs(title = "Most harmful events with respect to population health",
       y = "Number (*10^9) of fatalities and injuries (log2 scale)", x = "Event") +
  theme(legend.position="none") +
  facet_wrap(~ damage) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(trans='log2')

The most harmful event overall is tornado and it is the most harmful in terms of both injuries and fatalities. Excessive heat is the second most common cause of death while Thunderstorm Wind the second most common cause of injuries.

We plot the damage to crops and property and the total damage to find the top 5 weather events that are most harmful to US economy.

ggplot(data = TE, aes(x=EVTYPE,value)) +
  geom_bar(stat="identity", width = 0.5) +
  labs(title = "Weather events with greatest economic consequences",
       y = "Damage to crop and property (*10^9) (log2 scale)", x = "Event") +
  theme(legend.position="none") +
  facet_wrap(~ damage) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_y_continuous(trans='log2')

The most harmful event overall is flood as it’s a most harmful event for property. On the other hand, the most harmful weather event to crops is drought.