Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to identify:

  1. which type(s) of severe weather event creates most harm to public health.
  2. which type(s) of severe weather event leads to most harm to the economy.

The results suggests that excessive heat and tornados has greatest impact to public health, while floods and droughts have the greatest economic impact.

1. Setup Workspace

1.1 Set working directory and load R packages that we will be using for this project.

setwd("~/Data Science/Module 5 Reproducible Research/Assignment 2/")

library(R.utils) # use R.utils to unzip .bz2 files
## Warning: package 'R.utils' was built under R version 3.2.5
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.2.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.2.3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## R.utils v2.3.0 (2016-04-13) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2) # use ggplot for our data plots
## Warning: package 'ggplot2' was built under R version 3.2.4
library(gridExtra) # enable us to plot 2 charts side by side
## Warning: package 'gridExtra' was built under R version 3.2.5

1.2. Download and extract the data file.

Check of data file already exists in working directory. Download and unzip if file does not exist.

if(!file.exists("StormData.csv.bz2")) {
    temp <- tempfile()
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
        datafile <- bunzip2(temp,"StormData.csv", remove = FALSE, skip = TRUE)
        unlink(temp)   
}

StormData <- read.csv(datafile, header=T, sep=",")

2. Data Processing

2.1. Examining the data

2.1.1 This purpose of this exercise is to gain an overview of the variables captured in the dataset, the variable types, as well as, the number of records captured (ie. size of dataset).

head(StormData, n=1)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                        14   100 3   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15      25          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
str(StormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The dataset consists of 902297 records of 37 variables, with the first recorded event in 1950.

Given the scope of our project, we will only require the following variables to determine which weather event has greatest impact to public health and the economy.

  1. EVTYPE
  2. FATALITIES - to determine health impact
  3. INJURIES - to determine health impact
  4. PROPDMG - to determine economic impact
  5. PROPDMGEXP - to determine economic impact
  6. CROPDMG - to determine economic impact
  7. CROPDMGEXP - to determine economic impact

2.1.2 Next we examine the dataset to determine if there are any problems in the dataset that may potentially impact our analysis. We will use the total number of severe weather events recorded per year as an indication of the quality of the dataset.

We can derive the year of the event from the “BGN_DATE” field in the NOAA dataset.

# Extract year from BGN_DATE field.

if (dim(StormData)[2] == 37) {
    StormData$year <- as.numeric(format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}

# Summarise number of events recorded by year

ggplot(data=StormData, aes(StormData$year)) + 
        geom_histogram(breaks=seq(1950, 2012, by = 1), 
                 col="orange", 
                 fill="orange", 
                 alpha = .2) + 
        labs(title="Total Number of Severe Weather Events Per Year") +
        labs(x="Year", y="No. Events")

From the histogram, the number of recorded events increased significantly from 1995. This would suggest either (a) poor data quality collected prior to 1995 or (b) fewer occurance of severe weather events prior to 1995.

We will assume scenario (b) is true and consider the entire dataset for our analysis.

2.2 Preparing the data for analysis

2.2.1 Create subset of NOAA data containing only the variables identified in section 2.1.1.

As we will not require all the variables in the original dataset for our analysis, we will create a subset of the NOAA data with the relevant variables identified eariler. A smaller dataset is always easier to run and manage.

Projvariables<-c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG", "CROPDMGEXP")

StormData2<- StormData[Projvariables]

str(StormData2) # Check data subset
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
unique(StormData2$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(StormData2$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

2.2.2 Format economic impact variable

Per NOAA code book, the variables “PROPDMGEXP” and “CROPDMGEXP” represent the use of characters to signify the magnitude of the estimated damage captured in the variables “PROPDMG” and “CROPDMG”.

unique(StormData2$PROPDMGEXP)
##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(StormData2$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

We will need to apply these factors to “PROPDMG” and “CROPDMG” in order to arrive at the the full dollar economic impact of the corresponding weather event.

2.2.2.1 Compute the full dollar impact of property damage

# Assign the multiplier to be used for each character

StormData2$PROPVAL[StormData2$PROPDMGEXP ==  "K"   ]  <-    1000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "M"     ]   <-  10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == ""      ]   <-  1
StormData2$PROPVAL[StormData2$PROPDMGEXP == "B"     ]   <-  10^9
StormData2$PROPVAL[StormData2$PROPDMGEXP == "m"     ]   <-  10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == "+"     ]   <-  0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "0"     ]   <-  1
StormData2$PROPVAL[StormData2$PROPDMGEXP == "5"     ]   <-  10^5
StormData2$PROPVAL[StormData2$PROPDMGEXP == "6"     ]   <-  10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == "?"     ]   <-  0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "4"     ]   <-  10000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "2"     ]   <-  100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "3"     ]   <-  1000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "h"     ]   <-  100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "7"     ]   <-  10^7
StormData2$PROPVAL[StormData2$PROPDMGEXP == "H"     ]   <-  100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "-"     ]   <-  0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "1"     ]   <-  10
StormData2$PROPVAL[StormData2$PROPDMGEXP == "8" ]   <-  10^8

# Compute full dollar impact

StormData2$PROPDMGVAL <- StormData2$PROPDMG * StormData2$PROPVAL

2.2.2.2 Compute the full dollar impact of crop damage

# Assign the multiplier to be used for each character

StormData2$CROPVAL[StormData2$CROPDMGEXP ==  ""     ]   <-  1
StormData2$CROPVAL[StormData2$CROPDMGEXP == "M"     ]   <-  10^6
StormData2$CROPVAL[StormData2$CROPDMGEXP == "K"     ]   <-  1000
StormData2$CROPVAL[StormData2$CROPDMGEXP == "m"     ]   <-  10^9
StormData2$CROPVAL[StormData2$CROPDMGEXP == "B"     ]   <-  10^6
StormData2$CROPVAL[StormData2$CROPDMGEXP == "?"     ]   <-  0
StormData2$CROPVAL[StormData2$CROPDMGEXP == "0"     ]   <-  1
StormData2$CROPVAL[StormData2$CROPDMGEXP == "k"     ]   <-  1000
StormData2$CROPVAL[StormData2$CROPDMGEXP == "2" ]   <-  100

# Compute full dollar impact

StormData2$CROPDMGVAL <- StormData2$CROPDMG * StormData2$CROPVAL

We now have a properly formatted subset of the NOAA dataset for our project.

3. Determine the top 5 severe weather event type with greatest impact to public health

3.1 calculate the total number of fatalities and injuries for each type severe weather event. Identify the top 5 severe weather events for number of deaths and for number of injuries.

Death <- aggregate(FATALITIES ~ EVTYPE, data = StormData2, FUN = sum)
Injury <- aggregate(INJURIES ~ EVTYPE, data = StormData2, FUN = sum)

Deathtop5 <- Death[order(-Death$FATALITIES),][1:5, ]
Injurytop5 <- Injury[order(-Injury$INJURIES),][1:5, ]

Deathtop5
##             EVTYPE FATALITIES
## 834        TORNADO       5633
## 130 EXCESSIVE HEAT       1903
## 153    FLASH FLOOD        978
## 275           HEAT        937
## 464      LIGHTNING        816
Injurytop5
##             EVTYPE INJURIES
## 834        TORNADO    91346
## 856      TSTM WIND     6957
## 170          FLOOD     6789
## 130 EXCESSIVE HEAT     6525
## 464      LIGHTNING     5230

4. Determine the top 5 severe weather event type with greatest economic impact

4.1 Calculate the total dollar damage for each type severe weather event. Identift the top 5 severe weather events for property and crop damage.

Property <- aggregate(PROPDMGVAL~ EVTYPE, data = StormData2, FUN = sum)
Crops <- aggregate(CROPDMGVAL ~ EVTYPE, data = StormData2, FUN = sum)

Proptop5 <- Property[order(-Property$PROPDMGVAL),][1:5, ]
Croptop5 <- Crops[order(-Crops$CROPDMGVAL),][1:5, ]


Proptop5
##                EVTYPE   PROPDMGVAL
## 170             FLOOD 144657709807
## 411 HURRICANE/TYPHOON  69305840000
## 834           TORNADO  56947380617
## 670       STORM SURGE  43323536000
## 153       FLASH FLOOD  16822673979
Croptop5
##             EVTYPE  CROPDMGVAL
## 95         DROUGHT 12474066000
## 409 HURRICANE OPAL 10009000000
## 170          FLOOD  5661968450
## 244           HAIL  3025954473
## 402      HURRICANE  2741910000

5. Results

5.1 Severe weather events with greatest impact to public health

# Summarise our findings from section 3.1 into a chart

plot1<- ggplot(data=Deathtop5, aes(reorder(Deathtop5$EVTYPE, Deathtop5$FATALITIES),Deathtop5$FATALITIES)) + 
        geom_bar(stat="identity", 
                 col="red", 
                 fill="red", 
                 alpha = .2,
                 width = .5) + 
        coord_flip() +
        theme(title=element_text(size=8))+
        labs(title="Total Deaths by Severe Weather Event - 1950 to 2011") +
        labs(x="Event Type", y="Total Deaths")              
              
  
plot2<- ggplot(data=Injurytop5, aes(reorder(Injurytop5$EVTYPE, Injurytop5$INJURIES),Injurytop5$INJURIES)) + 
        geom_bar(stat="identity", 
                 col="yellow", 
                 fill="yellow", 
                 alpha = .2,
                 width = .5) + 
        coord_flip() +
        theme(title=element_text(size=8))+
        labs(title="Total Injuries by Severe Weather Event - 1950 to 2011") +
        labs(x="Event Type", y="Total Injured")          
  
 
grid.arrange(plot1,plot2, ncol = 2)

The above plot shows the top 5 severe weather events, ranked accordingly to the total number of deaths and injuries caused, over the period of our study.

Tornados, which resulted in the highest number of deaths and also injuries, clearly have the greatest impact to public health.

5.2 Severe weather events with greatest economic impact

# Summarise our findings from section 4.1 in a chart

plot3<- ggplot(data=Proptop5, aes(reorder(Proptop5$EVTYPE, Proptop5$PROPDMGVAL),Proptop5$PROPDMGVAL)) + 
        geom_bar(stat="identity", 
                 col="blue", 
                 fill="blue", 
                 alpha = .2,
                 width = .5) + 
        coord_flip() +
        theme(title=element_text(size=7))+
        labs(title="Total Value of Property Damaged by Severe Weather Event - 1950 to 2011") +
        labs(x="Event Type", y="Total $ Property Damaged")              
              
  
plot4<- ggplot(data=Croptop5, aes(reorder(Croptop5$EVTYPE, Croptop5$CROPDMGVAL),Croptop5$CROPDMGVAL)) + 
        geom_bar(stat="identity", 
                 col="green", 
                 fill="green", 
                 alpha = .2,
                 width = .5) + 
        coord_flip() +
        theme(title=element_text(size=7))+
        labs(title="Total Value of Crop Damaged by Severe Weather Event - 1950 to 2011") +
        labs(x="Event Type", y="Total $ Crop Damaged")          
  
 
grid.arrange(plot3,plot4, ncol = 2)

The above plot shows the top 5 severe weather events, ranked accordingly to the total value of property damaged and total value of crop damaged, over the period of our study.

We can see that floods have the greatest damage to property, while droughts have the most adverse impact to crops.