Reproducible Research of Economic and Health Impacts from Severe Weather Events across the United States from 1995 to 2007.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Synopsis

In this project I will use the data mentioned in the introduction, which was provided by the NOAA via the Coursera Data Science Specialization course. I will attempt to create an R Markdown file to best explain and share the method used, the code chunks adopted, the results, and conclusions of my findings. Visual as well as numerical presentations will be provided. Disclaimer: the conclusions and results are my own and should not be considered as those views of Coursera, Johns Hopkins or their professors.

Reading in the Data

At the time of this assignment I was using RStudio Version 0.99.902 - © 2009-2016 RStudio, Inc. Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/538.1 (KHTML, like Gecko) rstudio Safari/538.1 Qt/5.4.1, under R version R x64 3.3.1.

First things first, we have to download the data and process it. But before we do that, set your working directory to wherever you choose, then:

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("./Reproducible Research Assignment II")) {dir.create("./Reproducible Research Assignment II")}
download.file(url, destfile="./Reproducible Research Assignment II/StormData.csv.bz2") ## I am using a windows computer hence no curl method
StormData <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)

Processing the Data

Per many lessons and classes, and common sense really, it’s always good to get a feel for the data even if you have no clue what you’re looking at. Sometimes you might find a pattern and it starts to make sense. Or, you could just read the 97 page pdf file that came as part of the assignment. Below are just a few I usually use before getting more specific.

str(StormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(StormData[, 1:10])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI
## 1 TORNADO         0        
## 2 TORNADO         0        
## 3 TORNADO         0        
## 4 TORNADO         0        
## 5 TORNADO         0        
## 6 TORNADO         0
tail(StormData[, 1:10])
##        STATE__           BGN_DATE    BGN_TIME TIME_ZONE COUNTY
## 902292      47 11/28/2011 0:00:00 03:00:00 PM       CST     21
## 902293      56 11/30/2011 0:00:00 10:30:00 PM       MST      7
## 902294      30 11/10/2011 0:00:00 02:48:00 PM       MST      9
## 902295       2  11/8/2011 0:00:00 02:58:00 PM       AKS    213
## 902296       2  11/9/2011 0:00:00 10:21:00 AM       AKS    202
## 902297       1 11/28/2011 0:00:00 08:00:00 PM       CST      6
##                                  COUNTYNAME STATE         EVTYPE BGN_RANGE
## 902292 TNZ001>004 - 019>021 - 048>055 - 088    TN WINTER WEATHER         0
## 902293                         WYZ007 - 017    WY      HIGH WIND         0
## 902294                         MTZ009 - 010    MT      HIGH WIND         0
## 902295                               AKZ213    AK      HIGH WIND         0
## 902296                               AKZ202    AK       BLIZZARD         0
## 902297                               ALZ006    AL     HEAVY SNOW         0
##        BGN_AZI
## 902292        
## 902293        
## 902294        
## 902295        
## 902296        
## 902297

It’s clear that this is a huge data set, some of which is due to more than one header for, assuming, the same data. Over 900k observations (rows) of 37 variables (columns). Luckily, the assignment specifies that we should be interested in event types, or EVTYPE, as well as focusing on recent events. So we will base our research on those. I have chosen to focus on data from 1995 until the most recent.

Transforming and Cleaning some of the Data

In order to select what data we need to focus on, let’s consider the questions: > 1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? > 2. Across the United States, which types of events have the greatest economic consequences?

Based on these, and reading through the .pdf file provided, we note that the answer to the above questions lie in a few variables within EVTYPE. Namely:

  • FATALITIES
  • INJURIES
  • PROPDMG
  • CROPDMG
  • PROPDMGEXP
  • CROPDMGEXP

Let’s start with question 1,

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Out of this list, 2 observations relate to direct health - INURIES and FATALITIES.

Before we begin let’s cull the data to read from 1995. In the str() function above we noted a “BGN_DATE” variable. So first we need to create a new data set starting from 1995.

StormData$BGN_DATE <- as.Date(StormData$BGN_DATE, format="%m/%d/%Y")
New <- as.Date("1995-01-01")
NewStormData <- subset(StormData, StormData$BGN_DATE >= New)

This data set is now down to 681k observations (vs 902k previously).

Let’s modify our new data set to help with focusing on just the EVTYPE column,

Fatalities <- with(NewStormData, aggregate(FATALITIES, by = list(EVTYPE), sum)) ## to help simplify our modeling function
colnames(Fatalities) <- c("EVTYPE", "TOTAL") ## set the column name
Fatal_sub <- subset(Fatalities, Fatalities$TOTAL > 0) ## only consider values above 0
Fatal_sub_ord <- Fatal_sub[order(Fatal_sub$TOTAL, decreasing = TRUE),] ## set the order

Right now those numbers don’t tell us much - eg which number belongs to what event type - but at least we can see they are all numbers :) and not characters for example.

Results

Now let’s plot (Note: I did this many times and decided to limit the outcome to the top 20 as the plots unded really messy and unclear otherwise). I followed advice from herefor the reordering as my order command wasn’t giving me what I was looking for.

library(ggplot2)
g <- ggplot(Fatal_sub_ord[1:20,], aes(x= reorder(EVTYPE, -TOTAL), y=TOTAL, color="Purple"))
FatalPlot <- g+geom_bar(stat="identity", show.legend= FALSE)+coord_flip()+ylab("Number of Fatalities")+xlab("Event Type")+ggtitle("Top 20 Fatal Natural Events Across the US")
FatalPlot

Now we do the same injuries.

Injuries <- with(NewStormData, aggregate(INJURIES, by = list(EVTYPE), sum))
colnames(Injuries) <- c("EVTYPE", "TOTAL")
Injur_sub <- subset(Injuries, Injuries$TOTAL > 0)
Injur_sub_ord <- Injur_sub[order(Injur_sub$TOTAL, decreasing=TRUE),]
g2 <- ggplot(Injur_sub_ord[1:20,], aes(x=reorder(EVTYPE, -TOTAL), y=TOTAL, color="Brown"))
InjuryPlot <- g2+geom_bar(stat="identity", show.legend = F)+coord_flip()+ylab("Number of Injuries")+xlab("Event Type")+ggtitle("Top 20 Injuries by Natural Events Across the US")
InjuryPlot

So far we have considered only the direct impact of weather events on human health, but now we must consider the indirect impacts or the direct impacts on the Economy. Which leads us to the next question:

2. Across the United States, which types of events have the greatest economic consequences?

Now we focus on the following variables:

  • PROPDMG
  • CROPDMG
  • PROPDMGEXP
  • CROPDMGEXP

These refer to PROPerty DaMaGe, CROP DaMaGe, then their EXP which I think refers to exponent to specify the 1000s. In the literature, pg 12, you can determine that H= 100s, K= 1,000s, M= 1,000,000s, and B=1,000,000,000s.

This is more complicated and after much referencing, I’ll share how I’ve tackled it.

To make life simple and create order, we’ll convert everything first (from letters to numbers) and looks at millions as our base reference value.

## select the data we are interested from within the set
Propdmg <- NewStormData$PROPDMG
Propdmgex <- NewStormData$PROPDMGEXP
Cropdmg <- NewStormData$CROPDMG
Cropdmgex <- NewStormData$CROPDMGEXP
## create the values by multiplying the desired exponent
Propdmg[Propdmgex%in%c("M", "m")]<-Propdmg[Propdmgex%in%c("M", "m")]*1
Propdmg[Propdmgex%in%c("B", "b")]<-Propdmg[Propdmgex%in%c("B", "b")]*1e+03
Propdmg[Propdmgex%in%c("K", "k")]<-Propdmg[Propdmgex%in%c("K", "k")]*1e-03
Propdmg[Propdmgex%in%c("H", "h")]<-Propdmg[Propdmgex%in%c("H", "h")]*1e-04
Propdmg[!(Propdmgex%in%c("H", "h", "K", "k", "B", "b", "M", "m"))]<-Propdmg[!(Propdmgex%in%c("H", "h", "B", "b", "M", "m", "K", "k"))]*1e-06
Cropdmg[Cropdmgex%in%c("M", "m")] <- Cropdmg[Cropdmgex%in%c("M", "m")]*1
Cropdmg[Cropdmgex%in%c("B", "b")] <- Cropdmg[Cropdmgex%in%c("B", "b")]*1e+03
Cropdmg[Cropdmgex%in%c("K", "k")] <- Cropdmg[Cropdmgex%in%c("K", "k")]*1e-03
Cropdmg[Cropdmgex%in%c("H", "h")] <- Cropdmg[Cropdmgex%in%c("H", "h")]*1e-04
Cropdmg[!(Cropdmgex%in%c("H", "h", "K", "k", "B", "b", "M", "m"))]<-Cropdmg[!(Cropdmgex%in%c("H", "h", "B", "b", "M", "m", "K", "k"))]*1e-06
CropProp <- Cropdmg + Propdmg
EcoDmg <- aggregate(CropProp~NewStormData$EVTYPE, FUN=sum)
EcoDmgOrd <- EcoDmg[order(EcoDmg$CropProp, decreasing=TRUE),]
colnames(EcoDmgOrd) <- c("EVTYPE", "TOTAL")
g3 <- ggplot(EcoDmgOrd[1:20, ], aes(x=reorder(EVTYPE, -TOTAL), y=TOTAL, color="Blue"))
EcoDmgPlot <- g3+geom_bar(stat="identity", show.legend = FALSE)+coord_flip()+ylab("Total Cost in '000")+xlab("Event Type")+ggtitle("Top 20 Highest Costing Natural Events Accross the US")
EcoDmgPlot

So in conclusion, since 1995 Highest Fatality Event = Excessive Heat (stroke)
Highest Injury Event = Tornadoes
Highest Economic Impact = Flood