Reproducible Research Course Assignment2 - Health and Economic Consequences of Storms in the United States

Adrian Lim June 2015

Synopsis

This Project analyses the storm data from NOAA and answers 2 main questions:
- Across the United States, which Event Types causes the most harm to population health
- Across the Unites States, which Event Types causes the most economic damage.

Data Processing

1)Prerequisite Libraries

Load the required R libraries.

library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

2)Unzip and read the storm data file.

The storm data is downloaded from Storm Data.
The zipped file is assummed downloaded into the current working directory for further processing.
Documentation describing the data is downloaded from Storm Data Documentation.
There is a useful FAQ.

bunzip2("./repdata-data-StormData.csv.bz2", "./storm3.csv", remove = FALSE)
stormdf <- read.csv("./storm3.csv",header=TRUE,sep=",")

Dimensions of the data are:

dim(stormdf)
## [1] 902297     37

Basic information on the data:

str(stormdf)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

3) Subset the data file

Lets subset the data file to include only the pertinent columns for our analysis. We require only information about the event type, number of fatalities and injuries, amount of property and crop damage.

df <- select(stormdf,STATE,EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
head(df)
##   STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1    AL TORNADO          0       15    25.0          K       0           
## 2    AL TORNADO          0        0     2.5          K       0           
## 3    AL TORNADO          0        2    25.0          K       0           
## 4    AL TORNADO          0        2     2.5          K       0           
## 5    AL TORNADO          0        2     2.5          K       0           
## 6    AL TORNADO          0        6     2.5          K       0

Check for any missing values:

sum(is.na(df))
## [1] 0

There are no missing values.

4) EVTYPE field analysis

Table 2.1.1 (Storm Event Table) describing the event types sourced from the below URL https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf lists 48 event types.

eventypenames <- unique(df$EVTYPE)
numeventtypes <- length(eventypenames)
head(eventypenames)
## [1] TORNADO               TSTM WIND             HAIL                 
## [4] FREEZING RAIN         SNOW                  ICE STORM/FLASH FLOOD
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

However the data currently has 985 event types so clearly there is a lot of EVTYPE data which requires cleaning. Due to the lack of time though I am leaving this to follow up work.

Results

1)Analysis of Fatalities

First lets group the events by type, sum the total number of fatalities for each event type then order by highest to the lowest

fatalitiesByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalFatalities = sum(FATALITIES)) %>% arrange(desc(TotalFatalities))
head(fatalitiesByEventType)
## Source: local data frame [6 x 2]
## 
##           EVTYPE TotalFatalities
## 1        TORNADO            5633
## 2 EXCESSIVE HEAT            1903
## 3    FLASH FLOOD             978
## 4           HEAT             937
## 5      LIGHTNING             816
## 6      TSTM WIND             504

Now lets plot a graph of the top eight highest fatality count by event

barplot(height=fatalitiesByEventType$TotalFatalities[1:8],names.arg = fatalitiesByEventType$EVTYPE[1:8],col=rainbow(10,start=0,end=1),main="Top 8 Fatality Events",ylab="Total Number of Fatalities",las=2,cex.axis=0.75)

Clearly Tornadoes are the leading cause of fatalities by far.

2)Analysis of Injuries

Similiarly for Injuries, lets first lets group the events by type, sum the total number of injuries for each event type then order by highest to the lowest

InjuriesByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalInjuries = sum(INJURIES)) %>% arrange(desc(TotalInjuries))
head(InjuriesByEventType)
## Source: local data frame [6 x 2]
## 
##           EVTYPE TotalInjuries
## 1        TORNADO         91346
## 2      TSTM WIND          6957
## 3          FLOOD          6789
## 4 EXCESSIVE HEAT          6525
## 5      LIGHTNING          5230
## 6           HEAT          2100

Now lets plot a graph of the top eight highest injuries count by event

barplot(height=InjuriesByEventType$TotalInjuries[1:8],names.arg = InjuriesByEventType$EVTYPE[1:8],col=rainbow(10,start=0,end=1),main="Top 8 Injury Events",ylab="Total Number of Injuries",las=2,cex.axis=0.75)

Clearly Tornadoes again are the leading cause of injuries by far compared to other event types.

3)Analysis of Economic Impact

3a) Process the property and crop multipliers

The PROPDMGEXP and CROPDMGEXP fiels contain the multipliers for the damages for property damage and crop damage respectively. The unique values of those fields are:

unique(df$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(df$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

We need to replace the non numeric values like K,M etc with their numeric equivalents eg 1000, 1,000,000. “”,“+”,“-”,“?” will be replaced with numeric
0,1,-1,0 respectively. Firstly lets convert to all upper case.

df$PROPDMGEXP <- toupper(df$PROPDMGEXP)
df$CROPDMGEXP <- toupper(df$CROPDMGEXP)
symbol <- c("","+","-","?",0:9,"H","K","M","B")
replacement <- c(0,1,-1,0,0:9,100,1000,1000000,1000000000)
multiplier <-data.frame(symbol,replacement)
head(multiplier)
##   symbol replacement
## 1                  0
## 2      +           1
## 3      -          -1
## 4      ?           0
## 5      0           0
## 6      1           1

3b)Calculate actual economic damage

Now we calculate the actual economic damage value per event using the multipliers and covert to the scale to Billions of Dollars

df$PROPDMG <- df$PROPDMG*multiplier[match(df$PROPDMGEXP,multiplier$symbol),2]
df$CROPDMG <- df$CROPDMG*multiplier[match(df$CROPDMGEXP,multiplier$symbol),2]
df$PROPDMG <- df$PROPDMG/1000000000
df$CROPDMG <- df$CROPDMG/1000000000

For economic impact sum the total property damage and total crop damage for each event type then calculate the total Damage value which is the sum of the proporty damage plus crop damage and then arrange from highest to lowest total Damage value

EconomicImpactByEventType <- df %>% group_by(EVTYPE) %>% summarise(TotalProp = sum(PROPDMG), TotalCrop = sum(CROPDMG)) %>% mutate(TotalDamageValue = TotalProp + TotalCrop) %>% arrange(desc(TotalDamageValue))
head(EconomicImpactByEventType)
## Source: local data frame [6 x 4]
## 
##              EVTYPE TotalProp TotalCrop TotalDamageValue
## 1             FLOOD 144.65771 5.6619684        150.31968
## 2 HURRICANE/TYPHOON  69.30584 2.6078728         71.91371
## 3           TORNADO  56.93716 0.4149531         57.35211
## 4       STORM SURGE  43.32354 0.0000050         43.32354
## 5              HAIL  15.73227 3.0259544         18.75822
## 6       FLASH FLOOD  16.14081 1.4213171         17.56213

3c) Plot the graph of the top eight highest economic impact event types

options(scipen=999)
barplot(height=EconomicImpactByEventType$TotalDamageValue[1:8],names.arg = EconomicImpactByEventType$EVTYPE[1:8],col=rainbow(8,start=0,end=1),main="Top 8 Economic Damage Event Types",ylab="Total Economic Damage (Billion$)",las=2)

Conclusion

The most dangerous event type is clearly tornadoes as it causes by far the most number of fatalities and injuries.

From the last graph, it is quite clear that the the top 3 event types causing the most economic damage is:

1.Floods
2.Typhoon
3.Tornados

More work can be done to clean up the EVTYPE fields and reanalyse the result to see whether there is a significant impact. This shall be left to a follow up project.