Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project aims to explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States. These include when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
For a detailed description of the NOAA storm database, please refer to https://www.ncdc.noaa.gov/stormevents/. Essentially, our analysis suggests that Tornados are the most harmful to Population Health, while Floods cause the greatest Economic Consequences in the United States.
Loading the required libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(ggplot2)
Loading the data
if(!file.exists("StormData.csv.bz2")){
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")
}
if(!file.exists("StormData.csv.bz2")){
stop("Can't locate file 'StormData.csv.bz2'!")
}
stormDataRaw <- read.csv("StormData.csv.bz2", header = TRUE, stringsAsFactors = FALSE)
Show structure of the dataset
str(stormDataRaw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902,297 observations of 37 variables in the dataset. Only a subset is required for our analysis.
Relevant variables include the date (BGN_DATE), event type (EVTYPE), health impact (FATALITIES and INJURIES), economic damages (PROPDMG and CROPDMG), as well as their corresponding exponents (PROPDMGEXP and CROPDMGEXP).
stormData <- select(stormDataRaw, BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
stormData$YEAR <- year(stormData$BGN_DATE)
According to the NOAA storm database, the full dataset of weather events (48 event types) is only available from the year 1996. From the years 1950 to 1995, only a subset of event types is available. For a fair comparison, we will limit our dataset to observations from the years 1996 to 2011.
stormData <- filter(stormData, YEAR >= 1996)
Variables that are not relevant to population health and the economy are excluded from our analysis.
stormData <- filter(stormData, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)
The variables for economic damages, PROPDMG and CROPDMG, require adjustments. They each have a separate exponent variable, PROPDMGEXP and CROPDMGEXP, which need to be converted into a proper factor.
table(stormData$PROPDMGEXP)
##
## B K M
## 8448 32 185474 7364
table(stormData$CROPDMGEXP)
##
## B K M
## 102767 2 96787 1762
Both exponent variables, PROPDMGEXP and CROPDMGEXP, are converted to uppercase for translation to their corresponding factors: "“,”?“,”+“,”-" = 1 “0” = 1 “1” = 10 “2” = 100 “3” = 1,000 “4” = 10,000 “5” = 100,000 “6” = 1,000,000 “7” = 10,000,000 “8” = 100,000,000 “9” = 1,000,000,000 “H” = 100 “K” = 1,000 “M” = 1,000,000 “B” = 1,000,000,000
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "-")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "?")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "+")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "0")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "1")] <- 10^1
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "2")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "3")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "4")] <- 10^4
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "5")] <- 10^5
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "6")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "7")] <- 10^7
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "8")] <- 10^8
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "H")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "K")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "M")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "B")] <- 10^9
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "?")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "0")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "2")] <- 10^2
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "K")] <- 10^3
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "M")] <- 10^6
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "B")] <- 10^9
The distinction between FATALITIES and INJURIES is not important for our analysis. Therefore, both variables are combined to form a new variable, HEALTHIMPACT.
Likewise for economic damages, both PROPDMG and CROPDMG are multiplied by their corresponding factors and combined to form a new variable, ECONOMICIMPACT.
stormData <- mutate(stormData, HEALTHIMPACT = FATALITIES + INJURIES)
stormData <- mutate(stormData, ECONOMICIMPACT = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)
The variable, event type (EVTYPE), also requires cleaning up. Since our analysis is looking at the most impactful events, only part of EVTYPE needs to be cleaned up. For our analysis, we will look at event types in the 95% percentile.
First, we look at event types in the 95% percentile for health impact (HEALTHIMPACT).
healthImpact <- with(stormData, aggregate(HEALTHIMPACT ~ EVTYPE, FUN = sum))
subset(healthImpact, HEALTHIMPACT > quantile(HEALTHIMPACT, prob = 0.95))
## EVTYPE HEALTHIMPACT
## 45 EXCESSIVE HEAT 8188
## 53 FLASH FLOOD 2561
## 55 FLOOD 7172
## 86 HEAT 1459
## 102 HIGH WIND 1318
## 107 HURRICANE/TYPHOON 1339
## 130 LIGHTNING 4792
## 177 THUNDERSTORM WIND 1530
## 181 TORNADO 22178
## 186 TSTM WIND 3870
## 211 WILDFIRE 986
## 217 WINTER STORM 1483
There are two event types in the 95% percentile which are not official definitions in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. They are “TSTM WIND” (“THUNDERSTORM WIND”) and “HURRICANE/TYPHOON” (“HURRICANE (TYPHOON)”).
stormData$EVTYPE[(stormData$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE/TYPHOON")] <- "HURRICANE (TYPHOON)"
Next, we look at event types in the 95% percentile for economic impact (ECONOMICIMPACT).
economicImpact <- with(stormData, aggregate(ECONOMICIMPACT ~ EVTYPE, FUN = sum))
subset(economicImpact, ECONOMICIMPACT > quantile(ECONOMICIMPACT, prob = 0.95))
## EVTYPE ECONOMICIMPACT
## 37 DROUGHT 14413667000
## 53 FLASH FLOOD 16557105610
## 55 FLOOD 148919611950
## 83 HAIL 17071172870
## 102 HIGH WIND 5881421660
## 105 HURRICANE 14554229010
## 106 HURRICANE (TYPHOON) 71913712800
## 170 STORM SURGE 43193541000
## 177 THUNDERSTORM WIND 8812927230
## 181 TORNADO 24900370720
## 184 TROPICAL STORM 8320186550
There are two event types in the 95% percentile which are not official definitions in https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf. They are “HURRICANE” (“HURRICANE (TYPHOON)”) and “STORM SURGE” (“STORM SURGE/TIDE”).
stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE")] <- "HURRICANE (TYPHOON)"
stormData$EVTYPE[(stormData$EVTYPE == "STORM SURGE")] <- "STORM SURGE/TIDE"
First, we look at the Top 10 Severe Weather Events (EVTYPE) that have the greatest impact on Population Health (healthImpact) in the United States.
healthImpact <- stormData %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMPACT = sum(HEALTHIMPACT)) %>%
arrange(desc(HEALTHIMPACT))
g1 <- ggplot(healthImpact[1:10,], aes(x=reorder(EVTYPE, -HEALTHIMPACT), y=HEALTHIMPACT, color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event Type") + ylab("Number of Fatalities and Injuries") +
theme(legend.position="none") +
ggtitle("Impact of Severe Weather Events on Population Health in the United States")
print(g1)
Essentially, the figure above shows that Tornados have the greatest impact on Population Health in the United States.
Next, we look at the Top 10 Severe Weather Events (EVTYPE) that have the greatest impact on the Economy (economicImpact) in the United States.
economicImpact <- stormData %>%
group_by(EVTYPE) %>%
summarise(ECONOMICIMPACT = sum(ECONOMICIMPACT)) %>%
arrange(desc(ECONOMICIMPACT))
g2 <- ggplot(economicImpact[1:10,], aes(x=reorder(EVTYPE, -ECONOMICIMPACT), y=ECONOMICIMPACT, color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event Type") + ylab("Economic Damages in USD") +
theme(legend.position="none") +
ggtitle("Impact of Severe Weather Events on the Economy in the United States")
print(g2)
Essentially, the figure above shows that Floods have the greatest impact on the Economy in the United States.
In summary, our analysis suggests that Tornados are the most harmful to Population Health, while Floods cause the greatest Economic Consequences in the United States.