Synopsis

Weather events across the US can result in significant affects on population health, and result in significant costs in damages.

Data from the National Oceanic and Atmospheric Administration’s (NOAA) storm database, collected between 1950 and 2011, was analyzed to identify the most significant weather impacts on health, and most significant costs due to property and crop damage that occured during weather events.

Of the weather events, tornadoes result in the most number of injuries and fatalities, far outreaching other types of weather events. Other weather events that cause large numbers of injuries include excessive heat, flooding, and lightning and wind from thunderstorms. Other weather events that cause large numbers of fatalities include heat, flash floods, and lightning.

Weather events that result in the highest damage costs to properties include floods, hurricanes, storm surge, and tornadoes. Events that result in the highest damage costs to crops include droughts, floods, hurricanes, and ice storms.

The analyisis and results are described in the following report.

Questions

Two questions were asked:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

The original NOAA data was downloaded and stored locally on the machine where the analyis was done. It was first explored to understand what was contained in the data, and missing data was identified. Subsets of the data were created, one including affects on population health, and the second costs associated with property and crop damage. Each set was cleaned to remove missing data, including data for weather events that caused no affects on population health for the first analyisis, and no costs associated with damage to property or crops for the second analysis.

Get the data

The Storm Data, originaly from the National Oceanic and Atmospheric Administration’s (NOAA) storm database, was downloaded from the course site and saved locally in StormData.bz2.

    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.bz2", method="curl")
    stormdata <- read.csv(bzfile("StormData.bz2"))

Explore the data

The data was explored first to see what we were working with. We have 902297 records and 37 variable.

dim(stormdata)
## [1] 902297     37
names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Check for missing data

In this case there is some missing data (NA), but not missing from the variables we’re interested in. However, see Subset sections below.

apply(is.na(stormdata),2,sum)
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME 
##          0          0          0          0          0          0 
##      STATE     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE 
##          0          0          0          0          0          0 
##   END_TIME COUNTY_END COUNTYENDN  END_RANGE    END_AZI END_LOCATI 
##          0          0     902297          0          0          0 
##     LENGTH      WIDTH          F        MAG FATALITIES   INJURIES 
##          0          0     843563          0          0          0 
##    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP        WFO STATEOFFIC 
##          0          0          0          0          0          0 
##  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_    REMARKS 
##          0         47          0         40          0          0 
##     REFNUM 
##          0

Subset the data needed to answer the first question

The relevant columns were gathered from the original data for effects related to population health (FATALITIES, INJURIES), in addition to the weather event (EVTYPE), to create a subset of the data for our first analysis.

# Get the subset of the data we need
healthtype <- stormdata[,c("EVTYPE","FATALITIES","INJURIES")]

Clean the data subset

Though there were no relevant NAs in the population health data subset, there were many records that recorded no fatalities or injuries. These records were converted to NA and removed from the analysis.

healthtype[healthtype==0] <- NA # treat 0 values as NA

The population health data was split into fatalities and injuries subsets, creating their own data sets for independent analyses, removing the NAs from each.

fatalities <- aggregate(FATALITIES ~ EVTYPE, healthtype, sum, na.rm=TRUE)
injuries <- aggregate(INJURIES ~ EVTYPE, healthtype, sum, na.rm=TRUE)

Extract the most harmful weather events on population health

Since many weather events produce few fatilities or injuries, and we are interested in the most harmful, a lower limit was set on fatalities and injuries so only those with higher levels were included in the analysis. These levels are arbitrarily set to weather events that result in fatalities greater or equal to 100, and injuries greater or equal to 500.

# get the events with fatalities over 100
high_fatalities<- fatalities[which(fatalities$FATALITIES>=100), ]
# get the event types with high fatalities
high_fatality_events <- levels(droplevels(high_fatalities$EVTYPE))
# get the events with injuries over 500
high_injuries <-injuries[which(injuries$INJURIES>=500), ]
# get the event types with high injurues
high_injury_events <- levels(droplevels(high_injuries$EVTYPE))
head(high_injuries)
##            EVTYPE INJURIES
## 3        BLIZZARD      805
## 20 EXCESSIVE HEAT     6525
## 28    FLASH FLOOD     1777
## 30          FLOOD     6789
## 33            FOG      734
## 45           HAIL     1361
head(high_fatalities)
##                     EVTYPE FATALITIES
## 2                AVALANCHE        224
## 4                 BLIZZARD        101
## 26          EXCESSIVE HEAT       1903
## 30            EXTREME COLD        160
## 31 EXTREME COLD/WIND CHILL        125
## 35             FLASH FLOOD        978

Examine the outliers

Initial exploration of the data showed obvious outliers for tornadoes, with an order of magnitude greater number of fatalities and injuries (91346 injuries, 5633 fatalities) when compared with the fatalities and injuries of next closest weather event (6957 injuries, 1903 fatalities). The outliers have been left in the data, as they represents the most significant threat to population health.

# get the obvious outlier values (Tornadoes)
max(high_injuries$INJURIES, na.rm=TRUE)
## [1] 91346
max(high_fatalities$FATALITIES, na.rm=TRUE)
## [1] 5633

Subset the data needed to answer the second question

The relevant columns needed to answer the second question were extracted from the original data set to analyze the cost of damage resulting from weather events. These columns included:
ENVTYPE PROPDMG PRODMGEXP (K=1000, M=1,000,000, B=1,000,000,000 ) CROPDMG CROPDMGEXP

Extract the data for property and crop damage

From the original data, the relevant columns were extracted into a subset with which to analyze damage costs due to weather events.

exptype <- stormdata[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]

Clean the damage expense data

The data from the damage costs data set were then cleaned, replacing invalid values with NA in PROPDMGEXP (property damage expense) so they could be remove during the analysis.

library(car)
exptype$PROPDMGEXP<-recode(exptype$PROPDMGEXP,"c('1','2','3','4','5','6','7','8','9','0','-','+','?','', 'NA')=NA")
exptype$CROPDMGEXP<-recode(exptype$CROPDMGEXP,"c('0','1','2','?','', 'NA')=NA")

Separate out the property damage data for its own analysis

The property damage data was separated out from the total damage cost data, so property damage could be analyzed on its own.

#options(scipen=999) #disable scientific notation
propertydmg <- aggregate(PROPDMG ~ EVTYPE+PROPDMGEXP, exptype, sum, na.rm=TRUE)
# replace K,M,B,H with equivalent numeric values to 
# be used in calculating damage expense
library(plyr)
propertydmg$PROPDMGEXP<-revalue(propertydmg$PROPDMGEXP, c("K"=1000,"m"=1000000, "M"=1000000, "B"=1000000000, "h"=100, "H"=100))
propertydmg$PROPDMGEXP<-as.numeric(levels(propertydmg$PROPDMGEXP))[propertydmg$PROPDMGEXP]
propertydmg$total <- propertydmg$PROPDMGEXP * propertydmg$PROPDMG

Because we are only interested in the weather events that cause the most property damage, a lower limit was set on the amount of damage expense (500,000,000)

options(scipen=999) #disable scientific notation
high_propertydmg <-propertydmg[propertydmg$total>=500000000, ]
head(high_propertydmg)
##                      EVTYPE PROPDMGEXP PROPDMG        total
## 1               FLASH FLOOD 1000000000     1.0   1000000000
## 2                     FLOOD 1000000000   122.5 122500000000
## 3                      HAIL 1000000000     1.8   1800000000
## 4 HEAVY RAIN/SEVERE WEATHER 1000000000     2.5   2500000000
## 5                 HIGH WIND 1000000000     1.3   1300000000
## 6                 HURRICANE 1000000000     5.7   5700000000

Separate out the crop damage data for its own analysis

The crop damage data was separated out from the total damage cost data, so crop damage could be analyzed on its own.

cropdmg <- aggregate(CROPDMG ~ EVTYPE + CROPDMGEXP, exptype, sum, na.rm=TRUE) 
# replace K,M,B with equivalent numeric values to 
# be used in calculating damage expence
library(plyr)
cropdmg$CROPDMGEXP<-revalue(cropdmg$CROPDMGEXP, c("K"=1000,"k"=1000, "m"=1000000, "M"=1000000, "B"=1000000000))
cropdmg$CROPDMGEXP<-as.numeric(levels(cropdmg$CROPDMGEXP))[cropdmg$CROPDMGEXP]
cropdmg$total <- cropdmg$CROPDMGEXP * as.numeric(cropdmg$CROPDMG)

Because we are only interested in the weather events that cause the most crop damage, a lower limit was set on the amount of damage expense (500,000,000)

high_cropdmg <-cropdmg[cropdmg$total>=500000000, ]
head(high_cropdmg)
##                EVTYPE CROPDMGEXP   CROPDMG       total
## 1             DROUGHT 1000000000      1.50  1500000000
## 4   HURRICANE/TYPHOON 1000000000      1.51  1510000000
## 5           ICE STORM 1000000000      5.00  5000000000
## 6         RIVER FLOOD 1000000000      5.00  5000000000
## 52               HAIL       1000 576707.45   576707450
## 153           DROUGHT    1000000  12451.12 12451120000

Results

Affects of Weather Events on Population Health

The number of injuries and fatalities for each weather event type are presented in the following figure. The most notable results are with regard to tornadoes, which results in many more injuries and fatalities when compared to other types of weather events.

par(mfrow=c(1,2), mar=c(8,4,4,0))

options("scipen"=100, "digits"=4) # prevent exponent labels for y axis
inj<-barplot(high_injuries$INJURIES, 
             width=0.7, 
             ylim=c(0,max(high_injuries$INJURIES+10000)), 
             col="steelblue4", 
             main="Injuries", 
             xlab="", 
             ylab="Number of Injuries"
             )
lab<-c(high_injury_events)
text(x = inj, 
     y = high_injuries$INJURIES, 
     label = high_injuries$INJURIES, 
     pos = 3, 
     cex = 0.5, 
     col = "red", 
     srt=90)
axis(1, 
     at=inj[c(1:length(high_injury_events))],
     labels=lab, 
     las=2, 
     cex.axis=0.5)

fat<-barplot(high_fatalities$FATALITIES, 
             width=0.7, 
             ylim=c(0,max(high_fatalities$FATALITIES+1000)),  
             col="steelblue4", 
             main="Fatalities", 
             xlab="", 
             ylab="Number of Fatalities")
lab<-c(high_fatality_events)
text(x = fat, 
     y = high_fatalities$FATALITIES, 
     label = high_fatalities$FATALITIES, 
     pos = 3, 
     cex = 0.5, 
     col = "red", 
     las=2, 
     srt=90)
axis(1, 
     at=fat[c(1:length(high_fatality_events))],
     labels=lab, 
     las=2, 
     cex.axis=0.5)
title("Effects of Weather Events on Population Health", outer=TRUE, line = -1)

Figure 1: These two bar plots show a much larger fatality and injury rate for tornado events.

Property Damage Costs Due to Weather Events

Property damage expense is presented in the following figure. The most significant damage to property is caused by floods, hurricanes, storm surge, and tornadoes.

library(ggplot2)
prop_cost<-ggplot(data=high_propertydmg, 
                    aes(x=EVTYPE, y=total)) +
    geom_bar(stat="identity") + 
        xlab("Weather Event Type") + 
        ggtitle("The Property Damage Cost of Severe Weather in the US") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        scale_y_continuous(name = "Damage (in 1,000,000,000)", 
                           labels = c(0, 5, 10, 15))
prop_cost

Figure 2: This bar plot shows floods, hurricanes, storm surge, and tornadoes as resulting in the most property damage.

Crop Damage Costs Due to Weather Events

Crop damage expense is presented in the following figure. The most significant damage to crops is caused by floods, drought, and ice storms.

crop_cost=ggplot(data=high_cropdmg, 
                    aes(x=EVTYPE, y=total)) +
    geom_bar(stat="identity") + 
        xlab("Weather Event Type") + 
        ggtitle("The Crop Damage Cost of Severe Weather in the US") + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        scale_y_continuous(name = "Damage (in 1,000,000,000)", 
                           labels = c(0, 5, 10, 15))

crop_cost

Figure 3: This bar plot shows droughts, floods, and ice storms as resulting in the most crop damage.