In this report we aim to investigate the impacts of different types of severe weather events across the United States with a focus on public health and national economy.
We obtained data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States from year 1950 to November 2011. This includes when and where the events occurred, as well as estimates of any fatalities, injuries and property damage.
From these data, we found that, tornado, heat and wind are most harmful with respect to population health, while flood, hurricane and storm have the greatest economic consequences.
library(knitr)
opts_chunk$set(fig.path = "Figs/", warning=FALSE, message = FALSE, echo=TRUE)
options(scipen = 999)
From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, we obtained information of major storms and weather events across the U.S. We obtained the files from year 1950 to November 2011.
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
Download, unzip and read in the data
if (!file.exists("./data")){dir.create("./data")}
if (!file.exists("./data/NOAA.csv")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./data/NOAA.csv", method = "curl")
}
if (!exists("NOAA")){
NOAA <- read.csv(bzfile("./data/NOAA.csv"),header = TRUE)
}
Look into some details of the data set
dim(NOAA)
## [1] 902297 37
head(NOAA,3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
We can see that the data set contains 902297 observations of 37 variables. The events in the database start in the year 1950 and end in November 2011.
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Here, we reinforce the two questions we want to address:
For the data set, we choose to select the fatalities and injuries data to analyze the influence on public health, and for national economy, we would concentrate on the estimation of property and crop damages. Based on that, we subset the following data out of the original:
library(dplyr)
interestData <- NOAA %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(interestData, 6)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
NOTES on the variables:
When we look into the event types, we can immediately find some duplicates or some types that can be generated into one type. For example:
sort(unique(interestData$EVTYPE))[1:10]
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD"
## [4] " LIGHTNING" " TSTM WIND" " TSTM WIND (G45)"
## [7] " WATERSPOUT" " WIND" "?"
## [10] "ABNORMAL WARMTH"
There are several index all containing “Wind” or “Flood.” Therefore, we can group some index together.
#insert a new column for grouping
interestData$New.Event <- interestData$EVTYPE
#reorganize the event types
interestData$New.Event[grep("DROUGHT", interestData$EVTYPE, ignore.case = TRUE)] <- "DROUGHT"
interestData$New.Event[grep("COLD", interestData$EVTYPE, ignore.case = TRUE)] <- "COLD"
interestData$New.Event[grep("HAIL", interestData$EVTYPE, ignore.case = TRUE)] <- "HAIL"
interestData$New.Event[grep("HURRICANE", interestData$EVTYPE, ignore.case = TRUE)] <- "HURRICANE"
interestData$New.Event[grep("HEAT", interestData$EVTYPE, ignore.case = TRUE)] <- "HEAT"
interestData$New.Event[grep("FLOOD", interestData$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
interestData$New.Event[grep("LIGHTNING", interestData$EVTYPE, ignore.case = TRUE)] <- "LIGHTNING"
interestData$New.Event[grep("RAIN", interestData$EVTYPE, ignore.case = TRUE)] <- "RAIN"
interestData$New.Event[grep("TORNADO", interestData$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
interestData$New.Event[grep("WIND", interestData$EVTYPE, ignore.case = TRUE)] <- "WIND"
interestData$New.Event[grep("WINTER", interestData$EVTYPE, ignore.case = TRUE)] <- "WINTER"
interestData$New.Event[grep("WILDFIRE", interestData$EVTYPE, ignore.case = TRUE)] <- "WILDFIRE"
interestData$New.Event[grep("STORM", interestData$EVTYPE, ignore.case = TRUE)] <- "STORM"
interestData$New.Event[grep("SNOW", interestData$EVTYPE, ignore.case = TRUE)] <- "SNOW"
#To show the new data set
sort(table(interestData$New.Event), decreasing = TRUE)[1:10]
##
## HAIL WIND STORM FLOOD TORNADO SNOW
## 289270 255385 124599 82686 60699 17705
## LIGHTNING RAIN WINTER FUNNEL CLOUD
## 15760 12175 8160 6839
Then we can see much clearer of different event types.
Before we do any further processing and analyzing of the data, we first check for any missing values there.
sum(is.na(interestData$FATALITIES))
## [1] 0
sum(is.na(interestData$INJURIES))
## [1] 0
sum(is.na(interestData$PROPDMG))
## [1] 0
sum(is.na(interestData$PROPDMGEXP))
## [1] 0
sum(is.na(interestData$CROPDMG))
## [1] 0
sum(is.na(interestData$CROPDMGEXP))
## [1] 0
From above, we see there are no missing values in our data set. We can then begin our processing.
To evaluate the public health impact:
fatalData <- interestData %>% select(New.Event, FATALITIES) %>% group_by(New.Event) %>% summarise(sum.fatalities = sum(FATALITIES)) %>% arrange(-sum.fatalities)
head(fatalData, 8)
## # A tibble: 8 x 2
## New.Event sum.fatalities
## <chr> <dbl>
## 1 TORNADO 5636
## 2 HEAT 3138
## 3 FLOOD 1524
## 4 WIND 1235
## 5 LIGHTNING 817
## 6 STORM 633
## 7 RIP CURRENT 368
## 8 AVALANCHE 224
injuriesData <- interestData %>% select(New.Event, INJURIES) %>% group_by(New.Event) %>% summarise(sum.injuries = sum(INJURIES)) %>% arrange(-sum.injuries)
head(injuriesData, 8)
## # A tibble: 8 x 2
## New.Event sum.injuries
## <chr> <dbl>
## 1 TORNADO 91407
## 2 HEAT 9224
## 3 WIND 9001
## 4 FLOOD 8602
## 5 STORM 6692
## 6 LIGHTNING 5231
## 7 HAIL 1371
## 8 HURRICANE 1328
To analyze the economic impacts brought by weather events, we first need to understand the property and crop damages data. Here, we see special index we need to figure out.
paste("The characters in PROPDMGEXP include")
## [1] "The characters in PROPDMGEXP include"
sort(table(interestData$PROPDMGEXP), decreasing = TRUE)
##
## K M 0 B 5 1 2 ? m H
## 465934 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
paste("The characters in CROPDMGEXP include")
## [1] "The characters in CROPDMGEXP include"
sort(table(interestData$CROPDMGEXP), decreasing = TRUE)
##
## K M k 0 B ? 2 m
## 618413 281832 1994 21 19 9 7 1 1
In the National Weather Service Storm Data Documentation, there’s one sentence explaining the index in the PROPDMGEXP and CROPDMGEXP:
Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
Combined with the trial online, we find the index can be interpreted as follows:
To have a data frame with interpretations of these units:
economyData <- interestData %>% select(New.Event, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
#transform the PROPDMGEXP data
economyData$PROPDMGEXP[grep("[Kk]", economyData$PROPDMGEXP, ignore.case = TRUE)] <- 10^3
economyData$PROPDMGEXP[grep("[Mm]", economyData$PROPDMGEXP, ignore.case = TRUE)] <- 10^6
economyData$PROPDMGEXP[grep("[Bb]", economyData$PROPDMGEXP, ignore.case = TRUE)] <- 10^9
economyData$PROPDMGEXP[grep("[Hh]", economyData$PROPDMGEXP, ignore.case = TRUE)] <- 10^2
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "")] <- 0
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "?")] <- 0
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "-")] <- 0
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "+")] <- 1
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "0")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "1")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "2")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "3")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "4")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "5")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "6")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "7")] <- 10
economyData$PROPDMGEXP[(economyData$PROPDMGEXP == "8")] <- 10
#transform the CROPDMGEXP data
economyData$CROPDMGEXP[grep("[Kk]", economyData$CROPDMGEXP, ignore.case = TRUE)] <- 10^3
economyData$CROPDMGEXP[grep("[Mm]", economyData$CROPDMGEXP, ignore.case = TRUE)] <- 10^6
economyData$CROPDMGEXP[grep("[Bb]", economyData$CROPDMGEXP, ignore.case = TRUE)] <- 10^9
economyData$CROPDMGEXP[grep("[Hh]", economyData$CROPDMGEXP, ignore.case = TRUE)] <- 10^2
economyData$CROPDMGEXP[(economyData$CROPDMGEXP == "?")] <- 0
economyData$CROPDMGEXP[(economyData$CROPDMGEXP == "0")] <- 10
economyData$CROPDMGEXP[(economyData$CROPDMGEXP == "2")] <- 10
economyData$CROPDMGEXP[(economyData$CROPDMGEXP == "")] <- 0
#To show the new unit variables
paste("The units in PROPDMGEXP include")
## [1] "The units in PROPDMGEXP include"
sort(table(economyData$PROPDMGEXP), decreasing = TRUE)
##
## 10 1000 1000000 1000000000 100
## 466248 424665 11337 40 7
paste("The units in CROPDMGEXP include")
## [1] "The units in CROPDMGEXP include"
sort(table(economyData$CROPDMGEXP), decreasing = TRUE)
##
## 0 1000 1000000 10 1000000000
## 618413 281853 1995 27 9
economyData <- mutate(economyData, propCost = PROPDMG * as.numeric(PROPDMGEXP))
propDMG <- economyData %>% select(New.Event, propCost) %>% group_by(New.Event) %>% summarise(sum.PropCost = sum(propCost)) %>% arrange(-sum.PropCost)
#To show the property damages data
head(propDMG,8)
## # A tibble: 8 x 2
## New.Event sum.PropCost
## <chr> <dbl>
## 1 FLOOD 167502199413
## 2 HURRICANE 84656180010
## 3 STORM 73054022622
## 4 TORNADO 56993100717
## 5 HAIL 15733046447
## 6 WIND 12454677314.
## 7 WILDFIRE 4865614000
## 8 RAIN 3254491210
economyData <- mutate(economyData, cropCost = CROPDMG * as.numeric(CROPDMGEXP))
cropDMG<- economyData %>% select(New.Event, cropCost) %>% group_by(New.Event) %>% summarise(sum.CropCost = sum(cropCost)) %>% arrange(-sum.CropCost)
#To show the crop damages data
head(cropDMG,8)
## # A tibble: 8 x 2
## New.Event sum.CropCost
## <chr> <dbl>
## 1 DROUGHT 13972566000
## 2 FLOOD 12266906100
## 3 STORM 6406919600
## 4 HURRICANE 5505292800
## 5 HAIL 3046837650
## 6 WIND 1519029150
## 7 COLD 1409115500
## 8 FROST/FREEZE 1094086000
Now we have our processed data ready to make plots, analyze and achieve the results.
In order to show impacts on public health by different event types, we can make a barplot of sum number of fatalities and injuries from year 1950 to November 2011.
#re-format the health related data
healthResult <- inner_join(fatalData, injuriesData, by = "New.Event")
healthResult <- healthResult %>% mutate(healthResult, Total=sum.fatalities+sum.injuries) %>% arrange(-Total)
library(ggplot2)
#create the plot
ggplot(healthResult[1:8,], aes(x=New.Event, y=Total))+geom_bar(stat = "identity")+xlab("Event Type")+ylab("Number of Injuries and Fatalities")+ggtitle("Impact on Public Health by Weather Event Types Across U.S.")
head(healthResult,3)
## # A tibble: 3 x 4
## New.Event sum.fatalities sum.injuries Total
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5636 91407 97043
## 2 HEAT 3138 9224 12362
## 3 WIND 1235 9001 10236
The plot combined with the table shows us that the top three weather events with greatest impacts on public health are tornado, heat and wind. And tornado has the obvious greatest influence.
In order to show impacts on economy by different event types, we can make a barplot of number of sum damages from year 1950 to November 2011.
economyResult <- inner_join(propDMG, cropDMG, by = "New.Event")
economyResult <- economyResult %>% mutate(economyResult, Total=sum.PropCost+sum.CropCost) %>% arrange(-Total)
library(ggplot2)
#create the plot
ggplot(economyResult[1:8,], aes(x=New.Event, y=Total))+geom_bar(stat = "identity")+xlab("Event Type")+ylab("Total Damages Cost")+ggtitle("Impact on National Economy by Weather Event Types Across U.S.")
head(economyResult,3)
## # A tibble: 3 x 4
## New.Event sum.PropCost sum.CropCost Total
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 167502199413 12266906100 179769105513
## 2 HURRICANE 84656180010 5505292800 90161472810
## 3 STORM 73054022622 6406919600 79460942222
The plot combined with the table shows us that the top three weather events with greatest impacts on national economy are flood, hurricane and storm. And flood has the obvious greatest influence.