Severe weather events can have a substantial impact on both the health of the public and the economy. Events such as these can cause property and crop damage as well as injuries and even fatalities. Minimising the impact of such outcomes is of great concern the the USA as a whole.
In this project we use the U.S. National Oceanic and Atmospheric Administrations (NOAA) Storm Database the assess the impact of these severe weather events.
library(ggplot2)
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
library(markdown)
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 4.0.5
library(RColorBrewer)
fileurl = 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
data_dir <- getwd()
if (!file.exists('./storm_data.bzip2')){
download.file(fileurl,'./storm_data.csv.bz2', mode = 'wb')
}
if(!exists("storm_data")) {
storm_data <- read.csv(bzfile("storm_data.csv.bz2"),header = TRUE)
}
We first run str() to get an overview of the data.
str(storm_data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We interested in the most harmful events (indicated by the EVTYPE variable) with respect to the population health and economic consequences across the entirety of the United States; therefore, we will keep the following variables:
EVTYPE: Event Type e.g. FloodFATALITIES: Number of fatalities caused by EventINJURIES: Number of injuries caused by EventPROPDMG: Property damage caused by eventPROPDMGEXP: The units for the property damage i.e. K = Thousands, M = Millions, B = BillionsCROPDMG: Crop damage caused by eventCROPDMGEXP: The units for the crop damage i.e. K = Thousands, M = Millions, B = BillionsWe subset our data to only keep these variables.
variables <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data <- storm_data[variables]
dim(data)
## [1] 902297 7
Let’s take a look at the first few observations.
head(data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
We check for missing values in any of the seven variables. We see there are no missing values for any of the variables - this is an exceptionally nice dataset!
for (var in variables){
print(sum(is.na(data$var)))
}
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
## [1] 0
As noted above, the damage variables have associated exponent variables for thousands, millions and billions; we will have to combine these later for true damage amounts. Let’s take a look at the different possible Units as this will be important.
sort(table(data$PROPDMGEXP), decreasing = TRUE)
##
## K M 0 B 5 1 2 ? m H
## 465934 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
sort(table(data$CROPDMGEXP), decreasing = TRUE)
##
## K M k 0 B ? 2 m
## 618413 281832 1994 21 19 9 7 1 1
We can see the expected K, M and B exponents. However, there are other unexpected results. Based on these tables and the documentation we will take the following:
We remove all entries that are not in one of the four categories listed above as we cannot be sure about what the costs should be. We also check to confirm we are left with only these categories and the new dimensions.
exp <- c("K", "k", "M", "m", "B", "")
data <- subset(data, PROPDMGEXP %in% exp & CROPDMGEXP %in% exp)
sort(table(data$PROPDMGEXP), decreasing = TRUE)
##
## K M B m
## 465928 424645 11329 40 7
sort(table(data$CROPDMGEXP), decreasing = TRUE)
##
## K M k B m
## 618100 281826 1992 21 9 1
dim(data)
## [1] 901949 7
We’ve only lost about 300 entries - very well. Now we turn our attention to the EVTYPE variable. We see that there are many different types of events so let’s just look at the first 20. We will need to do some categorisation.
length(unique(data$EVTYPE))
## [1] 981
sort(table(data$EVTYPE), decreasing = TRUE)[1:20]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288625 219939 82559
## TORNADO FLASH FLOOD FLOOD
## 60628 54261 25325
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20628 20210 15732
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15705 11723 11432
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE
## 3392 2761
Using this, and the documentation, we can summarise most of these events into seven categories:
We will create a new variable, EVENT, which categorises as such.
# Create the new variable with the default value of OTHER
data$EVENT <- "OTHER"
# Categorise by keywords listed above
data$EVENT[grep("HAIL", data$EVTYPE, ignore.case = TRUE)] <- "HAIL"
data$EVENT[grep("WIND|THUNDERSTORM|LIGHTNING", data$EVTYPE, ignore.case = TRUE)] <- "STORM"
data$EVENT[grep("TORNADO", data$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
data$EVENT[grep("FLOOD|RAIN", data$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
data$EVENT[grep("WINTER|SNOW|ICE|ICY", data$EVTYPE, ignore.case = TRUE)] <- "FREEZE"
data$EVENT[grep("HEAT|WARM|FIRE", data$EVTYPE, ignore.case = TRUE)] <- "HEAT"
sort(table(data$EVENT), decreasing = TRUE)
##
## STORM HAIL FLOOD TORNADO FREEZE OTHER HEAT
## 380443 289238 94854 60676 39447 30076 7215
We also need to transform the damage variables to their full numbers as mentioned before. We do this by converting the exponents to numeric values and then raising the damage variables to these powers.
# Property
# Set all none K, M, B to be 0 exponent
data$PROPDMGEXPNUM[!grepl("K|k|M|m|B", data$PROPDMGEXP, ignore.case = TRUE)] <- 0
data$PROPDMGEXPNUM[grep("K|k", data$PROPDMGEXP, ignore.case = TRUE)] <- 3
data$PROPDMGEXPNUM[grep("M|m", data$PROPDMGEXP, ignore.case = TRUE)] <- 6
data$PROPDMGEXPNUM[grep("B", data$PROPDMGEXP, ignore.case = TRUE)] <- 9
data$PROPERTY_DAMAGE <- data$PROPDMG * 10^(as.numeric(data$PROPDMGEXPNUM))
# Crops
# Set all none K, M, B to be 0 exponent
data$CROPDMGEXPNUM[!grepl("K|k|M|m|B", data$CROPDMGEXP, ignore.case = TRUE)] <- 0
data$CROPDMGEXPNUM[grep("K|k", data$CROPDMGEXP, ignore.case = TRUE)] <- 3
data$CROPDMGEXPNUM[grep("M|m", data$CROPDMGEXP, ignore.case = TRUE)] <- 6
data$CROPDMGEXPNUM[grep("B", data$CROPDMGEXP, ignore.case = TRUE)] <- 9
data$CROP_DAMAGE <- data$CROPDMG * 10^(as.numeric(data$CROPDMGEXPNUM))
# Class check
class(data$PROPERTY_DAMAGE)
## [1] "numeric"
class(data$CROP_DAMAGE)
## [1] "numeric"
Our data is now suitable to start summarising to answer the questions posed. For economic impact we simply care about the total damage cost caused by the event. For Public Health impact we care about both the number of injuries and the number of fatalities. So we aim to make three summary datasets:
We take the sum of the number of injuries by event.
data_injuries <- data %>%
select(EVENT, INJURIES) %>%
group_by(EVENT) %>%
summarise(TOTAL_INJURIES = sum(INJURIES))
head(data_injuries)
## # A tibble: 6 x 2
## EVENT TOTAL_INJURIES
## <chr> <dbl>
## 1 FLOOD 8905
## 2 FREEZE 5237
## 3 HAIL 1368
## 4 HEAT 10851
## 5 OTHER 6041
## 6 STORM 16688
We take the sum of the number of fatalities by event.
data_fatalities <- data %>%
select(EVENT, FATALITIES) %>%
group_by(EVENT) %>%
summarise(TOTAL_FATALITIES = sum(FATALITIES))
head(data_fatalities)
## # A tibble: 6 x 2
## EVENT TOTAL_FATALITIES
## <chr> <dbl>
## 1 FLOOD 1633
## 2 FREEZE 547
## 3 HAIL 15
## 4 HEAT 3268
## 5 OTHER 1782
## 6 STORM 2232
We take the sum of both the property and crop damage by event.
data_damage <- data %>%
select(EVENT, PROPERTY_DAMAGE, CROP_DAMAGE) %>%
group_by(EVENT) %>%
summarise(TOTAL_DAMAGE = sum(PROPERTY_DAMAGE, CROP_DAMAGE))
head(data_damage)
## # A tibble: 6 x 2
## EVENT TOTAL_DAMAGE
## <chr> <dbl>
## 1 FLOOD 183943763634
## 2 FREEZE 16986995410
## 3 HAIL 18995875230
## 4 HEAT 9829715160
## 5 OTHER 167360037440
## 6 STORM 20269204081
We are now ready to present the results. We create three plots using ggplot2
injuries <- ggplot(data_injuries, aes(x=reorder(EVENT, -TOTAL_INJURIES), y=TOTAL_INJURIES/1000, fill = EVENT)) +
geom_col() +
xlab("Event") +
ylab("Total Injuries (Thousands)") +
ggtitle("Top 7 Severe Storm Events - Injuries") +
scale_fill_brewer(palette = "Set1")
fatalities <- ggplot(data_fatalities, aes(x=reorder(EVENT, -TOTAL_FATALITIES), y=TOTAL_FATALITIES/1000, fill = EVENT)) +
geom_col() +
xlab("Event") +
ylab("Total Fatalities (Thousands)") +
ggtitle("Top 7 Severe Storm Events - Fatalities") +
scale_fill_brewer(palette = "Set2")
damages <- ggplot(data_damage, aes(x=reorder(EVENT, -TOTAL_DAMAGE), y=TOTAL_DAMAGE/10^9, fill = EVENT)) +
geom_col() +
xlab("Event") +
ylab("Total Cost of Damages ($ in Billions)") +
ggtitle("Top 7 Severe Storm Events - Damages") +
scale_fill_brewer(palette = "Set3")
Based on these two graphs we can see that Tornados are by far the biggest threat to public health of all major weather events.
print(injuries)
print(fatalities)
In terms of Economic Impact, it is clear that Floods have the largest impact.
print(damages)