Synopsis

This report seeks to answer 2 questions:

  1. Across the United States, which types of weather events are most harmful with respect to population health?
  2. Across the United States, which types of weather events have the greatest economic consequences?

Data used in this analysis was sourced from the U.S. National Weather Services from the period 1950 to Nov 2011.
From the analysis of this data, the following answers were concluded:
1. The most harmful weather type in terms of population health are Tornados, contributing 5,633 fatalities (37.2% of total) and 91,346 injured (65% of total).
Other events of note on health were Excessive Heat, Flash Floods, Heat, Lightning, TSTM Wind, and Floods.

2, The most damaging weather type in terms of economic consequences are Floods, resulting in Damages amounting to US$144.7B (33.9% of total) of property and US$5,7B (65% of total) of crops.
Other events of note on economic consequences were Hurricanes/Typhoons, Tornados, Storm Surges, Flash Floods, Droughts, River Floods, Ice Storms and Hail.

R Libraries

various libraries required for analysis using R were loaded.

library(data.table)
library(R.utils)
library(dplyr)
library(ggplot2)
library(grid)
library(gridExtra)

Data Source & Download

The data used in this analysis was based on data from the link provided : Storm Data (47Mb)

The data is a copy of the StormData.CSV file recorded between 1950 and Nov 2011 originally sourced from the US National Weather Service.

We first create a download sub directory and download the file from the URL into it on the first run only.

Documentation on the data is available from these links:
* National Weather Service Storm Data Documentation
* National Climatic Data Center Storm Events FAQ

# Download source files from URL into and create download sub-directory on 1st run only.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./download/StormData.csv.bz2"

if(!file.exists("./download/StormData.csv.bz2")) {
  dir.create("./download")  
  download.file(url,destfile=destfile)
}

Data Processing

The data was then read from the file into a data table.

data <- fread("./download/StormData.csv.bz2")

Examining the data we find the following:

str(data)
## Classes 'data.table' and 'data.frame':   902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, ".internal.selfref")=<externalptr>

The data set to be examined to answer the above questions only required selecting the following variables:
* EVTYPE (CHR)
* FATALITIES (NUM)
* INJURIES (NUM)
* PROPDMG (NUM)
* PROPDMGEXP (CHR)
* CROPDMG (NUM)
* CROPDMGEXP (CHR)

# select required variables
data <- data %>% select(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

The data was found to have no clear NA missing values.

mean(is.na(data))
## [1] 0

Further examination of the selected data was performed.

Examining Variable EVTYPE

This variable had an extremely large variance in values as compared to the 48 types documented.

# Check unique variables in EVTYPE
length(unique(data$EVTYPE))
## [1] 985

Only minor cleaning was performed on this variable which were in order:

  1. All values were capitalized.
  2. Leading, Trailing and double spaces were removed.
# Cleaning EVTYPE
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trimws(data$EVTYPE)
searchString <- '  '
replacementString <- ' '
data$EVTYPE = sub(searchString,replacementString,data$EVTYPE)

# Check unique variables in EVTYPE
length(unique(data$EVTYPE))
## [1] 883

This somewhat reduced the unique values.

Notes:
1. The Weather event types in the data was highly contextual and any matching / correction to the documented types required weather services expertise.
2. It was targeted that the top 5 event types should represent at least 60% of the total from the data set or else further cleaning would be required.

Examining Damage Order of Magnitude Variables

The order of magnitude variables for property damage, PROPDMGEXP and crop damage, CROPDMGEXP had more unique values than stated in the documentation. i.e. (H= hundreds, K= thousands, M= millions, B= billions).

# Checking prop dmg. exp
rd <- 4
PDE <- data %>%
  count(PROPDMGEXP, name='PDEcount') %>%
  mutate (PDE_pct = round(100*PDEcount/sum(PDEcount),rd))

# Checking crop dmg. exp
CDE <- data %>%
  count(CROPDMGEXP, name='CDEcount') %>%
  mutate (CDE_pct = round(100*CDEcount/sum(CDEcount),rd))

print(PDE, row.names= NULL)
##  PROPDMGEXP PDEcount PDE_pct
##               465934 51.6387
##           -        1  0.0001
##           ?        8  0.0009
##           +        5  0.0006
##           0      216  0.0239
##           1       25  0.0028
##           2       13  0.0014
##           3        4  0.0004
##           4        4  0.0004
##           5       28  0.0031
##           6        4  0.0004
##           7        5  0.0006
##           8        1  0.0001
##           B       40  0.0044
##           h        1  0.0001
##           H        6  0.0007
##           K   424665 47.0649
##           m        7  0.0008
##           M    11330  1.2557
print(CDE, row.names= NULL)
##  CROPDMGEXP CDEcount CDE_pct
##               618413 68.5376
##           ?        7  0.0008
##           0       19  0.0021
##           2        1  0.0001
##           B        9  0.0010
##           k       21  0.0023
##           K   281832 31.2349
##           m        1  0.0001
##           M     1994  0.2210

Although the erroneous EXP values for both EXP components were hardly significant it still had to be cleaned up to calculate the amount of damage.

The erroneous EXP values were converted according to the following assumptions in order:

  1. All values were capitalized.
  2. All special characters were converted to blanks.
  3. All blanks were then converted to 0.
  4. Digits were considered as 1e+(digit) multipliers.

Property & crop damage values were then calculated and combined into variables “PropDmg” & “CropDmg” respectively in the clean data set.

# Replace erroneous values of Property & Crop Damages EXP.

## Cleaning prop & crop dmg. exp
data$PROPDMGEXP <- gsub("[^[:alnum:]]","",data$PROPDMGEXP)
data$CROPDMGEXP <- gsub("[^[:alnum:]]","",data$CROPDMGEXP)
data <-data %>%
  mutate(PROPDMGEXP = toupper(PROPDMGEXP),CROPDMGEXP = toupper(CROPDMGEXP)) %>%
  mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=="",0),
         CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=="",0)) %>%
  mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='H', 100),
         CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='H', 100)) %>%
  mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='K', 1000),
         CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='K', 1000)) %>%
  mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='M', 1000000),
         CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='M', 1000000)) %>%
  mutate(PROPDMGEXP = replace(PROPDMGEXP,PROPDMGEXP=='B', 1000000000),
         CROPDMGEXP = replace(CROPDMGEXP,CROPDMGEXP=='B', 1000000000)) %>%
  mutate(PROPDMGEXP = as.double(PROPDMGEXP),CROPDMGEXP = as.double(CROPDMGEXP))

## Calculating Damage Values and combining into new variables.
cleandata <-  data %>%
  mutate(PropDmg = PROPDMG*PROPDMGEXP, CropDmg = CROPDMG*CROPDMGEXP) %>%
  select("EVTYPE", "FATALITIES", "INJURIES", "PropDmg", "CropDmg" )
head(cleandata)
##     EVTYPE FATALITIES INJURIES PropDmg CropDmg
## 1: TORNADO          0       15   25000       0
## 2: TORNADO          0        0    2500       0
## 3: TORNADO          0        2   25000       0
## 4: TORNADO          0        2    2500       0
## 5: TORNADO          0        2    2500       0
## 6: TORNADO          0        6    2500       0

Showing top of the clean data set.

Results

Analysis Logic

The clean data set was filtered and grouped to show the required categorical summaries. Only the top 5 events in each category were plotted accordingly.

Initial options for the plots:

# Options for Figures
mytheme <- ttheme_default(base_size=10,
                          core= list(fg_params= list(hjust=0,x=0.1)),
                          colhead= list(fg_params=list(hjust=0, x=0.1)))
top <-5

Top 5 Weather Events Causing Fatalities & Injuries.

## Fatalities
fatal <- cleandata %>%
  group_by(EVTYPE) %>%
  summarise(totfatal = sum(FATALITIES)) %>%
  arrange(desc(totfatal)) %>%
  filter(totfatal !=0) %>%
  mutate(pct=round(100*totfatal/sum(totfatal),1)) %>%
  slice_max(totfatal, n=top)

p1f <- ggplot(fatal,aes(x=totfatal, y=reorder(EVTYPE, totfatal)))
p1f <- p1f + geom_col(fill ='red')
p1f <- p1f + labs(title= paste('Fatal (Top',top, '=', sum(fatal$pct),'%)'),
                  y=NULL, x= 'Count')
fatal$totfatal <- format(fatal$totfatal, big.mark=",",core.just= "right")
tb1 <- tableGrob(fatal, theme= mytheme, rows = NULL, cols= c("Event Type","Fatalities","%"))

## Injured
injured <- cleandata %>%
  group_by(EVTYPE) %>%
  summarise(totinjured = sum(INJURIES)) %>%
  arrange(desc(totinjured)) %>%
  filter(totinjured !=0) %>%
  mutate(pct=round(100*totinjured/sum(totinjured),1)) %>%
  slice_max(totinjured,n=top)

p2i <- ggplot(injured, aes(y=reorder(EVTYPE, totinjured), x=totinjured))
p2i <- p2i + geom_col(fill ='orange')
p2i <- p2i + labs(title = paste('Injured (Top',top, '=', sum(fatal$pct),'%)'),
                  y=NULL, x= 'Count')
injured$totinjured <- format(injured$totinjured, big.mark=",",core.just= "right")
tb2 <- tableGrob(injured, theme= mytheme, rows = NULL, cols= c("Event Type","Injured","%"))

grid.arrange(p1f, p2i, tb1, tb2, heights= unit(c(120,80), c("mm", "mm")),
             widths= unit(c(120,120), c("mm", "mm")),
             top= "Weather Events Harmful to Population Health")

Top 5 Weather Events Causing Damage.

PDmg <- cleandata %>%
  group_by(EVTYPE) %>%
  summarise(totPDmg = round(sum(PropDmg)/1e9,3)) %>%
  arrange(desc(totPDmg)) %>%
  filter(totPDmg !=0) %>%
  mutate(pct=round(100*totPDmg/sum(totPDmg),1)) %>%
  slice_max(totPDmg,n=top)

p3p <- ggplot(PDmg, aes(y=reorder(EVTYPE, totPDmg), x= totPDmg))
p3p <- p3p + geom_col(fill ='Purple')
p3p <- p3p + labs(title= paste('Property Damage (Top', top, '=', sum(PDmg$pct),'%)'),
                  y= NULL, x= 'Value ($B)')
PDmg$totPDmg <- format(PDmg$totPDmg, big.mark=",",core.just= "right")
tb3 <- tableGrob(PDmg, theme= mytheme, rows = NULL, cols= c("Event Type","Value ($B)","%"))

## Crop Damage
CDmg <- cleandata %>%
  group_by(EVTYPE) %>%
  summarise(totCDmg = round(sum(CropDmg)/1e9,3)) %>%
  arrange(desc(totCDmg)) %>%
  filter(totCDmg !=0) %>%
  mutate(pct=round(100*totCDmg/sum(totCDmg),1)) %>%
  slice_max(totCDmg,n=top)

p4c <- ggplot(CDmg,aes(y=reorder(EVTYPE, totCDmg), x=totCDmg))
p4c <- p4c + geom_col(fill ='Steel Blue')
p4c <- p4c + labs(title= paste('Crop Damage (Top',top, '=', sum(CDmg$pct),'%)'),
                  y= NULL, x= 'Value ($B)')
CDmg$totCDmg <- format(CDmg$totCDmg, big.mark=",",core.just= "right")
tb4 <- tableGrob(CDmg, theme= mytheme, rows = NULL, cols= c("Event Type","Value ($B)","%"))

grid.arrange(p3p,p4c, tb3, tb4, heights= unit(c(120,80), c("mm", "mm")),
             widths= unit(c(120,120), c("mm", "mm")),
             top= "Weather Events with Damaging Economic Consequences")