Coursera Data Science Reproducible Research Project 2

Saugata Ghosh September 3, 2016

Impact of Severe Weather Events on Public Health and Economy in the United States

Synopsis

In this report, the goal is to analyze the impact of different weather events on public health and economy based on the storm database collected from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) from 1950 - 2011. The data used will be estimates of fatalities, injuries, property and crop damage to decide which types of events are most harmful to the population health and economy. From these data, we found that tornados and heat are most harmful with respect to population health, while flood and hurricanes have the greatest economic impacts.

Load packages and set default options

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)
library(knitr)
opts_chunk$set(echo = TRUE)

Data Processing

In this section we first load the downloaded csv file on storm data and examine its structure. We also verify that the data has no missing values

stormdata <- read.csv("stormdata.csv", header =TRUE, sep = ",", stringsAsFactors = FALSE)
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
colSums(is.na(stormdata))
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

Next we need to view the distribution of the data over years. For this we transform the variable BGN_DATE into a suitable format, extract the year of the data and examine the distribution

stormdata$BGN_DATE <- mdy_hms(stormdata$BGN_DATE)
stormdata$YEAR <- year(stormdata$BGN_DATE)
hist(stormdata$YEAR, breaks = 30, xlab = "Year", main = "Distribution of Available data by Year")

From the distribution we can see that for years prior to the 1990s the data available is quiet sparse. Hence we decide to only consider the data from 1989 onwards(the first year for which we have more than 10000 records) and subset only the variables of interest to the analysis

storm <- subset(stormdata, YEAR >= 1989)
storm <- storm[,c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

We examine the type of events and remove those with the heading ‘Summary’ because that header is unclassifiable. Next we derive a measure of impact on population health which is a sum of the number of injuries and fatalities and name the new variable ‘casualties’. Then we aggregate ‘casualties’ data by event type. For the purpose of analysis we first filter out those events with total ‘casualties’ figure less than 10 between 1989 and 2011.

storm <- storm[!grepl("Summary|SUMMARY", storm$EVTYPE),]
storm$CASUALTIES <- storm$FATALITIES+storm$INJURIES
storm2 <- aggregate(CASUALTIES ~ EVTYPE, data = storm,sum)
storm2 <- storm2[storm2$CASUALTIES >=10,]
storm2 <- storm2[order(-storm2$CASUALTIES),]

We observe that many of the minor event types appear to be variants of the names of some of the major event types. Based on our understanding of the data dictionaries provided we club the different event type names into broader categories as follows:

storm2$EVTYPE[grep("TORNADO|WATERSPOUT",storm2$EVTYPE)]<- "TORNADO"
storm2$EVTYPE[grep("Heat|HEAT|WARM",storm2$EVTYPE)]<- "HEAT"
storm2$EVTYPE[grep("FLOOD|FLASH FLOOD|URBAN/SML STREAM FLD",storm2$EVTYPE)]<- "FLOOD"
storm2$EVTYPE[grep("TSTM|THUNDER",storm2$EVTYPE)]<- "THUNDER STORM"
storm2$EVTYPE[grep("ICE STORM|BLIZZARD|WINTER STORM",storm2$EVTYPE)]<- "BLIZZARD"
storm2$EVTYPE[grep("COLD|WINTER WEATHER|WINTRY",storm2$EVTYPE)]<- "COLD"
storm2$EVTYPE[grep("WILD",storm2$EVTYPE)]<- "WILDFIRE"
storm2$EVTYPE[grep("HURRICANE",storm2$EVTYPE)]<- "HURRICANE"
storm2$EVTYPE[grep("DENSE FOG",storm2$EVTYPE)]<- "FOG"
storm2$EVTYPE[grep("WIND|WINDS|STORM SURGE",storm2$EVTYPE)]<- "WIND"
storm2$EVTYPE[grep("SURF",storm2$EVTYPE)]<- "SURF"
storm2$EVTYPE[grep("TROPICAL",storm2$EVTYPE)]<- "TROPICAL STORM"
storm2$EVTYPE[grep("RIP",storm2$EVTYPE)]<- "CURRENT"
storm2$EVTYPE[grep("GLAZE|ICE|ICY",storm2$EVTYPE)]<- "ICE"
storm2$EVTYPE[grep("SNOW",storm2$EVTYPE)]<- "SNOW"
storm2$EVTYPE[grep("HAIL",storm2$EVTYPE)]<- "HAIL"
storm2$EVTYPE[grep("RAIN|DRIZZLE|PRECIP",storm2$EVTYPE)]<- "RAIN"
storm2$EVTYPE[grep("SEAS|MARINE",storm2$EVTYPE)]<- "SEAS"
storm2$EVTYPE[grep("DUST",storm2$EVTYPE)]<- "DUST STORM"

We aggregate the ‘casualties’ figure based on the new groupings and extract the top ten events with the highest casualties

storm3 <- aggregate(CASUALTIES ~ EVTYPE, data = storm2, sum)
storm4 <- storm3[order(-storm3$CASUALTIES),][1:10,]

Similarly we create a variable ‘PDMG’ in the storm dataset which multiplies the property damage (PROPDMG) variable by the relevant unit found in the ‘PROPDMGEXP’ variable (‘H’ or ‘h’ = 100, ‘K’ or ‘k’ = 1000, ‘M’ or ‘m’ = 100000, ‘B’ or ‘b’ = 1000000000). We perform the same transformation for crop damage and create a new variable CDMG. Finally we arrive at a measure of total economic damage TDMG (in ’000 $) by adding PDMG and CDMG

storm$PDMG[storm$PROPDMG==0] <- 0  
storm$PDMG[storm$PROPDMGEXP=="H" | storm$PROPDMGEXP=="h"] <-
  100*storm$PROPDMG[storm$PROPDMGEXP=="H" | storm$PROPDMGEXP=="h"]

storm$PDMG[storm$PROPDMGEXP=="K" | storm$PROPDMGEXP=="k"] <-
  1000*storm$PROPDMG[storm$PROPDMGEXP=="K" | storm$PROPDMGEXP=="k"]

storm$PDMG[storm$PROPDMGEXP=="M" | storm$PROPDMGEXP=="m"] <-
  1000000*storm$PROPDMG[storm$PROPDMGEXP=="M" | storm$PROPDMGEXP=="m"]

storm$PDMG[storm$PROPDMGEXP=="B" | storm$PROPDMGEXP=="b"] <-
  1000000000*storm$PROPDMG[storm$PROPDMGEXP=="B" | storm$PROPDMGEXP=="b"]  

storm$CDMG[storm$CROPDMG==0] <- 0  
storm$CDMG[storm$CROPDMGEXP=="H" | storm$CROPDMGEXP=="h"]  <-
  100*storm$CROPDMG[storm$CROPDMGEXP=="H" | storm$CROPDMGEXP=="h"]

storm$CDMG[storm$CROPDMGEXP=="K" | storm$CROPDMGEXP=="k"] <-
  1000*storm$CROPDMG[storm$CROPDMGEXP=="K" | storm$CROPDMGEXP=="k"]

storm$CDMG[storm$CROPDMGEXP=="M" | storm$CROPDMGEXP=="m"] <-
  1000000*storm$CROPDMG[storm$CROPDMGEXP=="M" | storm$CROPDMGEXP=="m"]

storm$CDMG[storm$CROPDMGEXP=="B" | storm$CROPDMGEXP=="b"] <-
  1000000000*storm$CROPDMG[storm$CROPDMGEXP=="B" | storm$CROPDMGEXP=="b"]  

storm$TDMG <- (storm$PDMG + storm$CDMG)/1000

Next we aggregate the total economic damage data by event type and filter out events which have caused less than $ 1 million in damage over the years of study

storm5 <- aggregate(TDMG ~ EVTYPE, data = storm,sum)
storm5 <- storm5[storm5$TDMG >=1000,]

We perform a grouping of event types into broader categories as done earlier

storm5$EVTYPE[grep("TORNADO|WATERSPOUT",storm5$EVTYPE)]<- "TORNADO"
storm5$EVTYPE[grep("Heat|HEAT|WARM",storm5$EVTYPE)]<- "HEAT"
storm5$EVTYPE[grep("FLOOD|FLASH FLOOD|URBAN/SML STREAM FLD|Flood|Flooding",storm5$EVTYPE)]<- "FLOOD"
storm5$EVTYPE[grep("TSTM|THUNDER",storm5$EVTYPE)]<- "THUNDER STORM"
storm5$EVTYPE[grep("ICE STORM|BLIZZARD|WINTER STORM",storm5$EVTYPE)]<- "BLIZZARD"
storm5$EVTYPE[grep("COLD|WINTER WEATHER|WINTRY|Cold",storm5$EVTYPE)]<- "COLD"
storm5$EVTYPE[grep("WILD|FOREST",storm5$EVTYPE)]<- "WILDFIRE"
storm5$EVTYPE[grep("HURRICANE|TYPHOON",storm5$EVTYPE)]<- "HURRICANE"
storm5$EVTYPE[grep("DENSE FOG",storm5$EVTYPE)]<- "FOG"
storm5$EVTYPE[grep("WIND|WINDS|STORM SURGE",storm5$EVTYPE)]<- "WIND"
storm5$EVTYPE[grep("SURF|Surf",storm5$EVTYPE)]<- "SURF"
storm5$EVTYPE[grep("TROPICAL",storm5$EVTYPE)]<- "TROPICAL STORM"
storm5$EVTYPE[grep("RIP",storm5$EVTYPE)]<- "CURRENT"
storm5$EVTYPE[grep("GLAZE|ICE|ICY",storm5$EVTYPE)]<- "ICE"
storm5$EVTYPE[grep("SNOW",storm5$EVTYPE)]<- "SNOW"
storm5$EVTYPE[grep("HAIL",storm5$EVTYPE)]<- "HAIL"
storm5$EVTYPE[grep("RAIN|DRIZZLE|PRECIP",storm5$EVTYPE)]<- "RAIN"
storm5$EVTYPE[grep("SEAS|MARINE",storm5$EVTYPE)]<- "SEAS"
storm5$EVTYPE[grep("DUST",storm5$EVTYPE)]<- "DUST STORM"
storm5$EVTYPE[grep("FROST|FREEZE|Frost|Freeze",storm5$EVTYPE)]<- "FROST"
storm5$EVTYPE[grep("LIGHTNING",storm5$EVTYPE)]<- "LIGHTNING"

We aggregate the total economic data by event type and extract the top ten sources of economic damage

storm6 <- aggregate(TDMG ~ EVTYPE, data = storm5, sum)
storm7 <- storm6[order(-storm6$TDMG),][1:10,]

Results

We create below a barplot showing the impact of the most harmful events on population health. From the plot we can see that tornadoes followed by heat are the two biggest harmful events

plot1<-ggplot(storm4, aes(x = reorder(factor(EVTYPE),-CASUALTIES), y = CASUALTIES))+geom_bar(stat = "identity", fill = "red", width = 0.5)+
xlab("Event Type") + ylab("Total Casualties") + ggtitle("Impact of harmful events on population health")

plot1 <- plot1+theme_bw()+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 10))

plot1

We create below a barplot showing the impact of the most harmful events on the economy.From the plot we can see that floods followed by hurricanes are the most harmful events to the economy.

plot2<-ggplot(storm7, aes(x = reorder(factor(EVTYPE),-TDMG), y = TDMG))+geom_bar(stat = "identity", fill = "blue", width = 0.5)+
  xlab("Event Type") + ylab("Total Damage ('000 $") + ggtitle("Impact of harmful events on economy")

plot2 <- plot2+theme_bw()+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+theme(axis.text.x = element_text(angle = 90, hjust = 1,size = 10))

plot2