Impact Analysis of Storms on Public Health & Economy

Synopsis

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to identify the type of storms that have a major impact on public health and economy. In this report, we have quantified the impact on public health as a function of public fatalities and injuries and economic impact as a function of property and crop damage. From our analysis, we concluded that Tornados and Floods cause the most damage to public health and economy respectively.

Data Processing

The data for this project came from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. Storm Data

Documentation for this data is also available at Storm Data Documentation

Downloading & Reading the data

library(dplyr)
library(ggplot2)

We will first download the data from the download link. Then we will read the data from the bz2 compressed comma separated raw file into an R object.

## Download the data file
fl_nm = "storm_data.csv.bz2"
if(!file.exists(fl_nm)) {
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",fl_nm)
}

## Read the data
strm_dt <- read.csv("storm_data.csv.bz2",stringsAsFactors = FALSE)

After reading the data, we will summarize the R object to familiarize ourselves with the data.

str(strm_dt)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Lastly, we will check few rows to make sure everything looks good.

head(strm_dt[,4:10])

##   TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE BGN_RANGE BGN_AZI
## 1       CST     97     MOBILE    AL TORNADO         0        
## 2       CST      3    BALDWIN    AL TORNADO         0        
## 3       CST     57    FAYETTE    AL TORNADO         0        
## 4       CST     89    MADISON    AL TORNADO         0        
## 5       CST     43    CULLMAN    AL TORNADO         0        
## 6       CST     77 LAUDERDALE    AL TORNADO         0

Cleansing the data

In this section, we will transform the raw data into a tidy dataset that we will use for our analysis.

## Making a copy of the raw data into a new object
tdy_strm_dt <- strm_dt

First, Property damage and crop damage is expressed as a combination of values and exponents. In this step, we will convert them into absolute numbers.

Then, we will remove the records with erroneous exponents from our dataset.

## Standardize the Property Damage and Crop Damage Exponents
tdy_strm_dt[tdy_strm_dt$PROPDMGEXP=="","PROPDMGEXP"] <- "0"
tdy_strm_dt[grep("[Bb]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "9"
tdy_strm_dt[grep("[Mm]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "6"
tdy_strm_dt[grep("[Kk]",tdy_strm_dt$PROPDMGEXP),"PROPDMGEXP"] <- "3"
tdy_strm_dt[tdy_strm_dt$CROPDMGEXP=="","CROPDMGEXP"] <- "0"
tdy_strm_dt[grep("[Kk]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "3"
tdy_strm_dt[grep("[Mm]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "6"
tdy_strm_dt[grep("[Bb]",tdy_strm_dt$CROPDMGEXP),"CROPDMGEXP"] <- "9"

## Segregate the error records
err_strm_dt <- rbind(tdy_strm_dt[-grep("[0-9]",tdy_strm_dt$PROPDMGEXP),],
                    tdy_strm_dt[-grep("[0-9]",tdy_strm_dt$CROPDMGEXP),])

## Remove the error records
tdy_strm_dt <- tdy_strm_dt[grep("[0-9]",tdy_strm_dt$PROPDMGEXP),]
tdy_strm_dt <- tdy_strm_dt[grep("[0-9]",tdy_strm_dt$CROPDMGEXP),]

## Calculate Total Damage
tdy_strm_dt$PROPDMGEXP <- as.numeric(tdy_strm_dt$PROPDMGEXP)
tdy_strm_dt$CROPDMGEXP <- as.numeric(tdy_strm_dt$CROPDMGEXP)
tdy_strm_dt$PROPDMGABS <- tdy_strm_dt$PROPDMG*10^tdy_strm_dt$PROPDMGEXP
tdy_strm_dt$CROPDMGABS <- tdy_strm_dt$CROPDMG*10^tdy_strm_dt$CROPDMGEXP

Lastly, we will retain only the relevant fields for our analysis.

## Retain only the required information
tdy_strm_dt$BGN_YR <- format(as.Date(tdy_strm_dt$BGN_DATE,"%m/%d/%Y"),"%Y")
sbst_tdy_strm_dt <- tdy_strm_dt[,c("REFNUM","BGN_YR","EVTYPE","FATALITIES","INJURIES","PROPDMGABS","CROPDMGABS")]

Results

Across the United States, which types of events are most harmful with respect to population health?

We will define the most harmful events as a function of fatalities and injuries caused to the general population.

First, we will calculate key aggregated metrics for each Event over the entire duration of the data.

## Calculate key aggregated metrics
sbst_tdy_strm_dt <- group_by(sbst_tdy_strm_dt, EVTYPE)
agg_strm_dt <- summarize(sbst_tdy_strm_dt, ttl_FATALITIES=sum(FATALITIES), ttl_INJURIES=sum(INJURIES),
                                 ttl_DMG=sum(PROPDMGABS+CROPDMGABS), mean_FATALITIES=mean(FATALITIES),
                                 mean_INJURIES=mean(INJURIES), mean_DMG=mean(PROPDMGABS+CROPDMGABS),
                                 ttl_INSTANCES=n())

Then, we will rank the events based on Public fatalities and Injuries.

For the purpose of ranking, we are equating 20 injuries to a fatality.

Followed by this, we will identify the top 5 events with highest public health impact based on fatalities and injuries.

## Calculate the rank and top 5 events
agg_strm_dt$rank <- dense_rank(desc(agg_strm_dt$ttl_FATALITIES+agg_strm_dt$ttl_INJURIES*.05))
agg_strm_dt$legend <- agg_strm_dt$EVTYPE
agg_strm_dt[agg_strm_dt$rank>5,"legend"] <- "All Others"

## Plot the Fatalities and Injuries and label the top 5 events
plot(agg_strm_dt$ttl_FATALITIES, agg_strm_dt$ttl_INJURIES, col=as.factor(agg_strm_dt$legend), lwd=3, pch=8,
     ylab = "Total Injuries", xlab = "Total Fatalities")
legend("topleft",unique(agg_strm_dt$legend), col = as.factor(unique(agg_strm_dt$legend)), lwd=3)

From the above scatter plot, we conclude that Tornados have the highest impact on the public health by far.

To ensure data capture is not a factor in our conclusion, we will re-run the same analysis on data since 2000.

## Re-run the same steps on data since 2000
sbst_tdy_strm_dt_2k <- subset(sbst_tdy_strm_dt, BGN_YR>=2000)
agg_strm_dt_2k <- summarize(sbst_tdy_strm_dt_2k, ttl_FATALITIES=sum(FATALITIES), ttl_INJURIES=sum(INJURIES),
                                 ttl_DMG=sum(PROPDMGABS+CROPDMGABS), mean_FATALITIES=mean(FATALITIES),
                                 mean_INJURIES=mean(INJURIES), mean_DMG=mean(PROPDMGABS+CROPDMGABS),
                                 ttl_INSTANCES=n())

agg_strm_dt_2k$rank <- dense_rank(desc(agg_strm_dt_2k$ttl_FATALITIES+agg_strm_dt_2k$ttl_INJURIES*.05))
agg_strm_dt_2k$legend <- agg_strm_dt_2k$EVTYPE
agg_strm_dt_2k[agg_strm_dt_2k$rank>5,"legend"] <- "All Others"

## Plot the Fatalities and Injuries and label the top 5 events since 2000
plot(agg_strm_dt_2k$ttl_FATALITIES, agg_strm_dt_2k$ttl_INJURIES, col=as.factor(agg_strm_dt_2k$legend), lwd=3, pch=8,
     ylab = "Total Injuries", xlab = "Total Fatalities")
legend("topleft",unique(agg_strm_dt_2k$legend), col = as.factor(unique(agg_strm_dt_2k$legend)), lwd=3)

The above scatter plot confirms that Tornados continues to have a major impact on public health. In fact, that is true for Excessive Heat, Lighting and Flash floods as well.

Across the United States, which types of events have the greatest economic consequences?

In this section, we will analyze the economic impact of these events.

Similar to the above analysis, we will first identify the Top 5 events with highest property and crop damage (in USD).

For our scatter plot, we will focus on two key aspects. 1) Total damage caused by an event 2) Mean damage caused by an event

## Calculate the rank and top 5 events 
agg_strm_dt$rank <- dense_rank(desc(agg_strm_dt$ttl_DMG))
agg_strm_dt$legend <- agg_strm_dt$EVTYPE
agg_strm_dt[agg_strm_dt$rank>5,"legend"] <- "All Others"

## Plot the total and mean damage and label the top 5 events
plot(agg_strm_dt$ttl_DMG, agg_strm_dt$mean_DMG, col=as.factor(agg_strm_dt$legend), lwd=3, pch=8,
     ylab = "Average Damage", xlab = "Total Damage")
legend("topleft",unique(agg_strm_dt$legend), col = as.factor(unique(agg_strm_dt$legend)), lwd=3)

From this scatter plot, we can draw following conclusions:

While a Flood doesn’t cause a lot of damage on an average, it occurs so frequently that it has the highest economic impact.
On the other hand, Hurricanes and Typhoons occur at a lower rate, but each time they happen there is a major economic loss.