Weather events across the US can result in significant affects on population health, and result in significant costs in damages.
Data from the National Oceanic and Atmospheric Administration’s (NOAA) storm database, collected between 1950 and 2011, was analyzed to identify the most significant weather impacts on health, and most significant costs due to property and crop damage that occured during weather events.
Of the weather events, tornadoes result in the most number of injuries and fatalities, far outreaching other types of weather events. Other weather events that cause large numbers of injuries include excessive heat, flooding, and lightning and wind from thunderstorms. Other weather events that cause large numbers of fatalities include heat, flash floods, and lightning.
Weather events that result in the highest damage costs to properties include floods, hurricanes, storm surge, and tornadoes. Events that result in the highest damage costs to crops include droughts, floods, hurricanes, and ice storms.
The analyisis and results are described in the following report.
Two questions were asked:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The original NOAA data was downloaded and stored locally on the machine where the analyis was done. It was first explored to understand what was contained in the data, and missing data was identified. Subsets of the data were created, one including affects on population health, and the second costs associated with property and crop damage. Each set was cleaned to remove missing data, including data for weather events that caused no affects on population health for the first analyisis, and no costs associated with damage to property or crops for the second analysis.
The Storm Data, originaly from the National Oceanic and Atmospheric Administration’s (NOAA) storm database, was downloaded from the course site and saved locally in StormData.bz2
.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "StormData.bz2", method="curl")
stormdata <- read.csv(bzfile("StormData.bz2"))
The data was explored first to see what we were working with. We have 902297 records and 37 variable.
dim(stormdata)
## [1] 902297 37
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
In this case there is some missing data (NA), but not missing from the variables we’re interested in. However, see Subset sections below.
apply(is.na(stormdata),2,sum)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME
## 0 0 0 0 0 0
## STATE EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE
## 0 0 0 0 0 0
## END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI END_LOCATI
## 0 0 902297 0 0 0
## LENGTH WIDTH F MAG FATALITIES INJURIES
## 0 0 843563 0 0 0
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC
## 0 0 0 0 0 0
## ZONENAMES LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## 0 47 0 40 0 0
## REFNUM
## 0
The relevant columns were gathered from the original data for effects related to population health (FATALITIES, INJURIES
), in addition to the weather event (EVTYPE
), to create a subset of the data for our first analysis.
# Get the subset of the data we need
healthtype <- stormdata[,c("EVTYPE","FATALITIES","INJURIES")]
Though there were no relevant NAs in the population health data subset, there were many records that recorded no fatalities or injuries. These records were converted to NA and removed from the analysis.
healthtype[healthtype==0] <- NA # treat 0 values as NA
The population health data was split into fatalities and injuries subsets, creating their own data sets for independent analyses, removing the NAs from each.
fatalities <- aggregate(FATALITIES ~ EVTYPE, healthtype, sum, na.rm=TRUE)
injuries <- aggregate(INJURIES ~ EVTYPE, healthtype, sum, na.rm=TRUE)
Since many weather events produce few fatilities or injuries, and we are interested in the most harmful, a lower limit was set on fatalities and injuries so only those with higher levels were included in the analysis. These levels are arbitrarily set to weather events that result in fatalities greater or equal to 100, and injuries greater or equal to 500.
# get the events with fatalities over 100
high_fatalities<- fatalities[which(fatalities$FATALITIES>=100), ]
# get the event types with high fatalities
high_fatality_events <- levels(droplevels(high_fatalities$EVTYPE))
# get the events with injuries over 500
high_injuries <-injuries[which(injuries$INJURIES>=500), ]
# get the event types with high injurues
high_injury_events <- levels(droplevels(high_injuries$EVTYPE))
head(high_injuries)
## EVTYPE INJURIES
## 3 BLIZZARD 805
## 20 EXCESSIVE HEAT 6525
## 28 FLASH FLOOD 1777
## 30 FLOOD 6789
## 33 FOG 734
## 45 HAIL 1361
head(high_fatalities)
## EVTYPE FATALITIES
## 2 AVALANCHE 224
## 4 BLIZZARD 101
## 26 EXCESSIVE HEAT 1903
## 30 EXTREME COLD 160
## 31 EXTREME COLD/WIND CHILL 125
## 35 FLASH FLOOD 978
Initial exploration of the data showed obvious outliers for tornadoes, with an order of magnitude greater number of fatalities and injuries (91346 injuries, 5633 fatalities
) when compared with the fatalities and injuries of next closest weather event (6957 injuries, 1903 fatalities
). The outliers have been left in the data, as they represents the most significant threat to population health.
# get the obvious outlier values (Tornadoes)
max(high_injuries$INJURIES, na.rm=TRUE)
## [1] 91346
max(high_fatalities$FATALITIES, na.rm=TRUE)
## [1] 5633
The relevant columns needed to answer the second question were extracted from the original data set to analyze the cost of damage resulting from weather events. These columns included:ENVTYPE PROPDMG PRODMGEXP (K=1000, M=1,000,000, B=1,000,000,000 ) CROPDMG CROPDMGEXP
From the original data, the relevant columns were extracted into a subset with which to analyze damage costs due to weather events.
exptype <- stormdata[,c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
The data from the damage costs data set were then cleaned, replacing invalid values with NA in PROPDMGEXP (property damage expense) so they could be remove during the analysis.
library(car)
exptype$PROPDMGEXP<-recode(exptype$PROPDMGEXP,"c('1','2','3','4','5','6','7','8','9','0','-','+','?','', 'NA')=NA")
exptype$CROPDMGEXP<-recode(exptype$CROPDMGEXP,"c('0','1','2','?','', 'NA')=NA")
The property damage data was separated out from the total damage cost data, so property damage could be analyzed on its own.
#options(scipen=999) #disable scientific notation
propertydmg <- aggregate(PROPDMG ~ EVTYPE+PROPDMGEXP, exptype, sum, na.rm=TRUE)
# replace K,M,B,H with equivalent numeric values to
# be used in calculating damage expense
library(plyr)
propertydmg$PROPDMGEXP<-revalue(propertydmg$PROPDMGEXP, c("K"=1000,"m"=1000000, "M"=1000000, "B"=1000000000, "h"=100, "H"=100))
propertydmg$PROPDMGEXP<-as.numeric(levels(propertydmg$PROPDMGEXP))[propertydmg$PROPDMGEXP]
propertydmg$total <- propertydmg$PROPDMGEXP * propertydmg$PROPDMG
Because we are only interested in the weather events that cause the most property damage, a lower limit was set on the amount of damage expense (500,000,000
)
options(scipen=999) #disable scientific notation
high_propertydmg <-propertydmg[propertydmg$total>=500000000, ]
head(high_propertydmg)
## EVTYPE PROPDMGEXP PROPDMG total
## 1 FLASH FLOOD 1000000000 1.0 1000000000
## 2 FLOOD 1000000000 122.5 122500000000
## 3 HAIL 1000000000 1.8 1800000000
## 4 HEAVY RAIN/SEVERE WEATHER 1000000000 2.5 2500000000
## 5 HIGH WIND 1000000000 1.3 1300000000
## 6 HURRICANE 1000000000 5.7 5700000000
The crop damage data was separated out from the total damage cost data, so crop damage could be analyzed on its own.
cropdmg <- aggregate(CROPDMG ~ EVTYPE + CROPDMGEXP, exptype, sum, na.rm=TRUE)
# replace K,M,B with equivalent numeric values to
# be used in calculating damage expence
library(plyr)
cropdmg$CROPDMGEXP<-revalue(cropdmg$CROPDMGEXP, c("K"=1000,"k"=1000, "m"=1000000, "M"=1000000, "B"=1000000000))
cropdmg$CROPDMGEXP<-as.numeric(levels(cropdmg$CROPDMGEXP))[cropdmg$CROPDMGEXP]
cropdmg$total <- cropdmg$CROPDMGEXP * as.numeric(cropdmg$CROPDMG)
Because we are only interested in the weather events that cause the most crop damage, a lower limit was set on the amount of damage expense (500,000,000
)
high_cropdmg <-cropdmg[cropdmg$total>=500000000, ]
head(high_cropdmg)
## EVTYPE CROPDMGEXP CROPDMG total
## 1 DROUGHT 1000000000 1.50 1500000000
## 4 HURRICANE/TYPHOON 1000000000 1.51 1510000000
## 5 ICE STORM 1000000000 5.00 5000000000
## 6 RIVER FLOOD 1000000000 5.00 5000000000
## 52 HAIL 1000 576707.45 576707450
## 153 DROUGHT 1000000 12451.12 12451120000
The number of injuries and fatalities for each weather event type are presented in the following figure. The most notable results are with regard to tornadoes, which results in many more injuries and fatalities when compared to other types of weather events.
par(mfrow=c(1,2), mar=c(8,4,4,0))
options("scipen"=100, "digits"=4) # prevent exponent labels for y axis
inj<-barplot(high_injuries$INJURIES,
width=0.7,
ylim=c(0,max(high_injuries$INJURIES+10000)),
col="steelblue4",
main="Injuries",
xlab="",
ylab="Number of Injuries"
)
lab<-c(high_injury_events)
text(x = inj,
y = high_injuries$INJURIES,
label = high_injuries$INJURIES,
pos = 3,
cex = 0.5,
col = "red",
srt=90)
axis(1,
at=inj[c(1:length(high_injury_events))],
labels=lab,
las=2,
cex.axis=0.5)
fat<-barplot(high_fatalities$FATALITIES,
width=0.7,
ylim=c(0,max(high_fatalities$FATALITIES+1000)),
col="steelblue4",
main="Fatalities",
xlab="",
ylab="Number of Fatalities")
lab<-c(high_fatality_events)
text(x = fat,
y = high_fatalities$FATALITIES,
label = high_fatalities$FATALITIES,
pos = 3,
cex = 0.5,
col = "red",
las=2,
srt=90)
axis(1,
at=fat[c(1:length(high_fatality_events))],
labels=lab,
las=2,
cex.axis=0.5)
title("Effects of Weather Events on Population Health", outer=TRUE, line = -1)
Figure 1: These two bar plots show a much larger fatality and injury rate for tornado events.
Property damage expense is presented in the following figure. The most significant damage to property is caused by floods, hurricanes, storm surge, and tornadoes.
library(ggplot2)
prop_cost<-ggplot(data=high_propertydmg,
aes(x=EVTYPE, y=total)) +
geom_bar(stat="identity") +
xlab("Weather Event Type") +
ggtitle("The Property Damage Cost of Severe Weather in the US") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(name = "Damage (in 1,000,000,000)",
labels = c(0, 5, 10, 15))
prop_cost
Figure 2: This bar plot shows floods, hurricanes, storm surge, and tornadoes as resulting in the most property damage.
Crop damage expense is presented in the following figure. The most significant damage to crops is caused by floods, drought, and ice storms.
crop_cost=ggplot(data=high_cropdmg,
aes(x=EVTYPE, y=total)) +
geom_bar(stat="identity") +
xlab("Weather Event Type") +
ggtitle("The Crop Damage Cost of Severe Weather in the US") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(name = "Damage (in 1,000,000,000)",
labels = c(0, 5, 10, 15))
crop_cost
Figure 3: This bar plot shows droughts, floods, and ice storms as resulting in the most crop damage.