This study investigates the severe weather events in the US based on time series data from the National Oceanic and Atmospheric Administration (NOAA). The NOAA storm database not only tracks a lot of events across all states in the US, but also provides information on injuries and property damages. For a detailed description of the NOAA storm database see: (https://www.ncdc.noaa.gov/stormevents/) The present analysis makes use of that information and clearly shows that tornadoes have the most harmful impact on public health, whereas floods are most expensive from an economic view.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
The source data file is downloaded from (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2). It covers weather events between 1950 and 2011. Comprehensive documentation for the dataset is available: (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf) (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf)
if (!file.exists("StormData.csv.bz2")) {
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")
# Exit if the file is not available
if (!file.exists("StormData.csv.bz2")) {
stop("Can't locate file 'StormData.csv.bz2'!")
}
}
# Load the dataset
stormDataRaw <- read.csv("StormData.csv.bz2")
# Show the structure of the dataset
str(stormDataRaw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
There are 902.297 observations with 37 variables in the file. Only a subset is required for the analysis. 1. Relevant for the analysis are the date (BGN_DATE), event type (EVTYPE), counter for the health impact (FATALITIES and INJURIES), monetary impact on crop and property (PROPDMG and CROPDMG) as well as their corresponding exponents (PROPDMGEXP and CROPDMGEXP). 2. According to the NOAA ((https://www.ncdc.noaa.gov/stormevents/details.jsp)) the full set of weather events (48 event types) is available since 1996. Between 1950 and 1995 only a subset (Tornado, Thunderstorm Wind and Hail) of these events is available in the storm database. In order to have o comparable basis for the analysis the dataset is limited to the observations between 1996 and 2011. 3. The dataset contains a lot of observations without any information about health and/or economic damages. These observations are excluded from the analysis.
stormData <- select(stormDataRaw, BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)
# Format the BGN_DATE variable as a date
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
stormData$YEAR <- year(stormData$BGN_DATE)
# Tornado 1950 - 1954
# Tornado, Thunderstorm Wind, Hail 1955 - 1995
# 48 Events since 1996
# Only use events since 1996
stormData <- filter(stormData, YEAR >= 1996)
# Only use events with either health impact or economic damage
stormData <- filter(stormData, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)
The economical damages provided in the storm dataset require some adjustments. Each variable - CROPDMG and PROPDMG - comes with a separate exponent - CROPDMGEXP and PROPDMGEXP. In the first step the content of the exponent variables need to be converted into a proper factor.
table(stormData$PROPDMGEXP)
##
## B K M
## 8448 32 185474 7364
table(stormData$CROPDMGEXP)
##
## B K M
## 102767 2 96787 1762
Both exponents are converted to uppercase to adapt all the exponents with the same meaning (eg. h and H). The next steps convert the exponents into corresponding factors: “”, “?”, “+”, “-”: 1 “0”: 1 “1”: 10 “2”: 100 “3”: 1.000 “4”: 10.000 “5”: 100.000 “6”: 1.000.000 “7”: 10.000.000 “8”: 100.000.000 “9”: 1.000.000.000 “H”: 100 “K”: 1.000 “M”: 1.000.000 “B”: 1.000.000.000
According to the previous tables, the CROPDMGEXP only contains a subset of these values. Most of the numerical exponents are missing. The factor is only calculated for the exponents provided in that variable.
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "?")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "0")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "2")] <- 10^2
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "K")] <- 10^3
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "M")] <- 10^6
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "B")] <- 10^9
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "-")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "?")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "+")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "0")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "1")] <- 10^1
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "2")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "3")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "4")] <- 10^4
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "5")] <- 10^5
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "6")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "7")] <- 10^7
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "8")] <- 10^8
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "H")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "K")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "M")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "B")] <- 10^9
The distinction between fatalities and injuries is not important for the existing analysis. Therefore both variables are added to form a new variable HEALTHIMP. A similar approach is used for the economic impact. Both crop and property damages (in USD) are multiplied by their corresponding factor and added to form a new variable ECONOMICCOST.
stormData <- mutate(stormData, HEALTHIMP = FATALITIES + INJURIES)
stormData <- mutate(stormData, ECONOMICCOST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)
The event types also require a more detailed examination.
stormData$EVTYPE <- toupper(stormData$EVTYPE)
dim(data.frame(table(stormData$EVTYPE)))
## [1] 186 2
After converting the variable EVTYPE to uppercase, there are still 186 different event types listed. According to the NOAA there should be only 48. So there are a lot of duplicates. Here are some examples containing the string “THUND”:
evtypeUnique <- unique(stormData$EVTYPE)
evtypeUnique[grep("THUND", evtypeUnique)]
## [1] "THUNDERSTORM" "THUNDERSTORM WIND (G40)"
## [3] "THUNDERSTORM WIND" "MARINE THUNDERSTORM WIND"
Cleaning all event types is quite a big effort. Since this analysis is looking at the most harmful events, only part of the event types are cleaned. Therefore the health impact (HEALTHIMP) is summed up per event type. Only event types in the 95% quantile are to be cleaned.
healthImpact <- with(stormData, aggregate(HEALTHIMP ~ EVTYPE, FUN = sum))
subset(healthImpact, HEALTHIMP > quantile(HEALTHIMP, prob = 0.95))
## EVTYPE HEALTHIMP
## 39 EXCESSIVE HEAT 8188
## 46 FLASH FLOOD 2561
## 48 FLOOD 7172
## 69 HEAT 1459
## 88 HURRICANE/TYPHOON 1339
## 107 LIGHTNING 4792
## 146 THUNDERSTORM WIND 1530
## 149 TORNADO 22178
## 153 TSTM WIND 3870
## 182 WINTER STORM 1483
There are only two event types in the 95% quantile, which are not compliant to the official types defined in (https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf). These are TSTM WIND and HURRICANE/TYPHOON.
stormData$EVTYPE[(stormData$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE/TYPHOON")] <- "HURRICANE (TYPHOON)"
The same procedure is been used to clean the event types for the most important economic impacts. After summing up the economic cost ECONOMICCOST, events in the 95% quantile are to be cleaned.
economicCost <- with(stormData, aggregate(ECONOMICCOST ~ EVTYPE, FUN = sum))
subset(economicCost, ECONOMICCOST > quantile(ECONOMICCOST, prob = 0.95))
## EVTYPE ECONOMICCOST
## 32 DROUGHT 14413667000
## 46 FLASH FLOOD 16557105610
## 48 FLOOD 148919611950
## 66 HAIL 17071172870
## 86 HURRICANE 14554229010
## 87 HURRICANE (TYPHOON) 71913712800
## 141 STORM SURGE 43193541000
## 146 THUNDERSTORM WIND 8812957230
## 149 TORNADO 24900370720
## 152 TROPICAL STORM 8320186550
Again there are only two event types in the 95% quantile, which are not compliant to the official types: HURRICANE and STORM SURGE.
stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE")] <- "HURRICANE (TYPHOON)"
stormData$EVTYPE[(stormData$EVTYPE == "STORM SURGE")] <- "STORM SURGE/TIDE"
The cleaned data frame stormData is been aggregated per EVTYPE and provided in a descending order in the new data frame healthImpact.
healthImpact <- stormData %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMP = sum(HEALTHIMP)) %>%
arrange(desc(HEALTHIMP))
#healthImpact[1:10,]
g1 <- ggplot(healthImpact[1:10,], aes(x=reorder(EVTYPE, -HEALTHIMP),y=HEALTHIMP,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event") + ylab("Number of fatalities and injuries") +
theme(legend.position="none") +
ggtitle("Fatalities and injuries in the US caused by severe weather events")
g1
The bar chart shows that Tornadoes are the most harmful weather events for people’s health.
The cleaned data frame stormData is been aggregated per EVTYPE and provided in a descending order in the new data frame economicCost.
economicCost <- stormData %>%
group_by(EVTYPE) %>%
summarise(ECONOMICCOST = sum(ECONOMICCOST)) %>%
arrange(desc(ECONOMICCOST))
#economicCost[1:10,]
g1 <- ggplot(economicCost[1:10,], aes(x=reorder(EVTYPE, -ECONOMICCOST),y=ECONOMICCOST,color=EVTYPE)) +
geom_bar(stat="identity", fill="white") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Event") + ylab("Economic cost in USD") +
theme(legend.position="none") +
ggtitle("Economic cost in the US caused by severe weather events")
g1
The bar chart shows that Floods cause the biggest economical damages.