Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
In this analysis, we seek to address the following research questions,
We will need to perform data pre-processing, including data cleansing, transformation, create derived variables and etc, before we can perform analysis using the the ‘tidy’ data set.
This section covers each step and component of the data processing activities. The objective is to finalise a ‘tidy’ data set for analysis.
We first load the libraries required for this analysis. While it is subjective, my programming preference is to load libraries I would be using upfront in a single code chunk.
# Load libraries
library(data.table) # for fread function
library(dplyr) # for data wrangling
library(ggplot2) # for plotting
library(gridExtra) # for arranging plots into grid
library(lubridate) # for date time functions
library(R.utils) # for uncompressing bz2 function
## Warning: package 'R.utils' was built under R version 4.3.2
library(quantmod) # for USA CPI (inflation) data
library(psych) # for descriptive stats
## Warning: package 'psych' was built under R version 4.3.2
library(knitr) # for producing nicer table output using kable function
We check the working directory and existence of source files. If not, download and/or uncompress accordingly.
# Ensure all files are in the intended working dir else use setwd() to set
#setwd("your working directory") # <--Uncomment to use if required.
#getwd() # <--Uncomment to use if required.
# Check, download & unzip raw data file if required
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("./StormData.csv.bz2")){
download.file(fileUrl,"./StormData.csv.bz2", mode="wb")
}
if (!file.exists("./StormData.csv")) {
bunzip2("./StormData.csv.bz2","./StormData.csv", remove = FALSE)
}
The raw data file is imported in R and a temporary data set is created with required variables.
# Import raw file into R
if(!exists("dfTemp1")) {
dfTemp1 <- fread("StormData.csv", sep=",", header=TRUE, data.table=FALSE, stringsAsFactors=FALSE)
}
# Create a df with required vars as we do not need all the imported variables.
dfTemp2 <- dfTemp1 %>%
select(REFNUM, EVTYPE , STATE, COUNTY, COUNTYNAME, BGN_DATE, END_DATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
We convert date variables into proper date variables.
# Convert date variables to date format
dfTemp2$BGN_DATE <- mdy_hms(dfTemp2$BGN_DATE)
dfTemp2$END_DATE <- mdy_hms(dfTemp2$END_DATE)
The economic impact to property and crop damages were not recorded as ‘as is’ values. We need to translate them into actual dollar values. In addition, we need to adjust these translated values with yearly inflation rates from 1950 to 2011 so that we can view the economic impacts in real terms.
# Do CPI data part
# Get the min year from dfTemp2$BGN_DATE for CPI base year
YearMin <- as.character(year(min(dfTemp2$BGN_DATE)))
# Get CPI (Inflation Rate) data using quantmod package
getSymbols("CPIAUCSL", src='FRED')
## [1] "CPIAUCSL"
# Compute the yearly averages and adjustment rates from base year
cpi.avg <- apply.yearly(CPIAUCSL, mean)
cpi.adj.rate <- cpi.avg/as.numeric(cpi.avg[YearMin])
# Create a DF from the xts object
dfCPI <- data.frame(date=index(cpi.adj.rate), coredata(cpi.adj.rate), row.names=NULL)
dfCPI <- dfCPI %>% mutate(year.cpi = year(date))
# Do weather data part
# Function for translating value exponent
Exp.Translate <- function(e) {
# h -> hundred, k -> thousand, m -> million, b -> billion
if (e %in% c('h', 'H'))
return(2)
else if (e %in% c('k', 'K'))
return(3)
else if (e %in% c('m', 'M'))
return(6)
else if (e %in% c('b', 'B'))
return(9)
else if (e %in% c('', '-', '?', '+'))
return(0)
else if (!is.na(as.numeric(e))) # if a digit
return(as.numeric(e))
else {stop("Invalid exponent value.")}
}
# Translate the exponents
PROPDMG.EXP <- data.frame(sapply(dfTemp2$PROPDMGEXP, FUN=Exp.Translate))
colnames(PROPDMG.EXP) <- c("PROPDMG.EXP")
CROPDMG.EXP <- data.frame(sapply(dfTemp2$CROPDMGEXP, FUN=Exp.Translate))
colnames(CROPDMG.EXP) <- c("CROPDMG.EXP")
# Compute the CPI adjusted damage values into a new DF
dfTemp3 <- dfTemp2 %>%
bind_cols(PROPDMG.EXP) %>%
bind_cols(CROPDMG.EXP) %>%
mutate(year.cpi = year(BGN_DATE)) %>%
left_join(dfCPI, by="year.cpi") %>%
mutate(PROPDMG.VAL = PROPDMG*(10**PROPDMG.EXP)*CPIAUCSL) %>%
mutate(CROPDMG.VAL = CROPDMG*(10**CROPDMG.EXP)*CPIAUCSL) %>%
mutate(TTL.ECONOMIC.IMPACT = PROPDMG.VAL+CROPDMG.VAL) %>%
mutate(TTL.FATALITY.INJURY = FATALITIES+INJURIES) %>%
select(1:9,19:22)
There were 890 unique event types in the imported data while there are 48 modern NOAA event types according to the NWSI 10-1605 (AUGUST 17, 2007) Storm Event Table. A mapping exercise is required to map these historical event types to the modern event types.
Initial mapping attempts with fuzzy text matching techniques from the
stringdist package were not successful due to the quality
of the free-text values recorded over the decades. As there are only 890
historical event type values to be mapped to 48 modern event type
values, a look-up table was created for mapping via a manual
‘eyeballing’ exercise. This lookup file is available on the GitHub
repository mentioned in the GitHub Repository section
below for forking.
The output from this step is the ‘tidy’ data set dfTidy, which will be used for analysis and concludes the activities required for data pre-processing.
# Remove leading, trailing and duplicated white spaces
dfTemp3$EVTYPE <- trimws(toupper(dfTemp3$EVTYPE))
# Import lookup table
dfNOAA.Lookup <- read.csv("STORM_NOAA_LOOKUP.csv", header=TRUE, stringsAsFactors=FALSE)
# Match event types to the modern event types
dfTidy <- dfTemp3 %>%
left_join(dfNOAA.Lookup, by=c("EVTYPE"="STORM.EVTYPE"))
# Remove unused objects to free up memory
rm(dfTemp1, dfTemp2, dfTemp3)
Before we tackle the 2 research questions, we performed data analysis
to understand the shapes and distributions of the data using the 4
quantiative variables
i.e. FATALITIES, INJURIES, PROPDMG.VAL and CROPDMG.VAL.
Not every weather event results in adverse health and economic impact, hence we would want to know the proportions of 0 impacts relative to the all the events that had occured.
# Find proportions of 0 values
AllCount <- length(dfTidy$REFNUM)
ObsName <- c("Proportions Percent")
FATALITIES <- (length(subset(dfTidy, dfTidy$FATALITIES <= 0)$FATALITIES)/AllCount)*100
INJURIES <- (length(subset(dfTidy, dfTidy$INJURIES <= 0)$INJURIES)/AllCount)*100
PROPDMG <- (length(subset(dfTidy, dfTidy$PROPDMG.VAL <= 0)$PROPDMG.VAL)/AllCount)*100
CROPDMG <- (length(subset(dfTidy, dfTidy$CROPDMG.VAL <= 0)$CROPDMG.VAL)/AllCount)*100
dfZeros <- data.frame(FATALITIES,INJURIES,PROPDMG,CROPDMG)
dfZeros <- cbind(ObsName, dfZeros)
kable(dfZeros, digits=2)
| ObsName | FATALITIES | INJURIES | PROPDMG | CROPDMG |
|---|---|---|---|---|
| Proportions Percent | 99.23 | 98.05 | 73.49 | 97.55 |
Results showed there were high proportions of 0s in the 4 quantitative variables.
In lieu of the high proportions of 0s mentioned earlier, we performed summary statistics and distribution plots with exclusions of 0s to understand the data distributions of the 4 quantitative variables.
# Subset data where values > 0
sub1 <- subset(dfTidy, dfTidy$FATALITIES>0)$FATALITIES
sub2 <- subset(dfTidy, dfTidy$INJURIES>0)$INJURIES
sub3 <- (subset(dfTidy, dfTidy$PROPDMG.VAL>0)$PROPDMG.VAL)/1000000
sub4 <- (subset(dfTidy, dfTidy$CROPDMG.VAL>0)$CROPDMG.VAL)/1000000
# Run quick summary stats for the 4 required variables
dfSummStats <- rbind(describe(sub1),
describe(sub2),
describe(sub3),
describe(sub4))
varName <- c("FATALITIES","INJURIES","PROPDMG.VAL ($M)","CROPDMG.VAL ($M)")
dfSummStats <- cbind(varName, dfSummStats) %>%
select(1,3:6,9:14)
kable(dfSummStats, row.names=FALSE, digits=2)
| varName | n | mean | sd | median | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|
| FATALITIES | 6974 | 2.17 | 8.43 | 1.00 | 1 | 583.0 | 582.0 | 49.53 | 3259.05 | 0.10 |
| INJURIES | 17604 | 7.98 | 38.08 | 2.00 | 1 | 1700.0 | 1699.0 | 20.47 | 627.28 | 0.29 |
| PROPDMG.VAL (\(M) | 239174| 13.71| 2093.26| 0.07| 0| 963291.8| 963291.8| 417.04| 188513.14| 4.28| |CROPDMG.VAL (\)M) | 22099 | 16.07 | 331.81 | 0.11 | 0 | 30800.0 | 30800.0 | 73.80 | 6488.57 | 2.23 |
# Plot histogram to view data distributions
q1 <- qplot(sub1,
geom="histogram",
xlab = "Fatalities #",
fill=I("blue"),
col=I("red"),
alpha=I(.2))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
q2 <- qplot(sub2,
geom="histogram",
xlab = "Injuries #",
fill=I("blue"),
col=I("red"),
alpha=I(.2))
q3 <- qplot(sub3,
geom="histogram",
xlab = "Property Damage $M",
fill=I("blue"),
col=I("red"),
alpha=I(.2))
q4 <- qplot(sub4,
geom="histogram",
xlab = "Crop Damage $M",
fill=I("blue"),
col=I("red"),
alpha=I(.2))
grid.arrange(q1, q2, q3, q4, ncol=2, nrow =2, top="Data Distributions of the 4 Variables (excl. 0s)")
The summary statistics and distribution plots indicated that the data distributions of the 4 variables are highly right-skewed and there were many observations of extreme high values.
The decision is to aggregate the quantitive measures by Event Type and filter for Top 10 event types that have the most impact.
We create Top 10 data sets for Fatalities, Injuries, Property Damage and Crop Damage after aggregations by Event Types.
# Create the top 10 event types for Fatalities
dfTop10.FATALITIES <- dfTidy %>%
group_by(NOAA.EVTYPE) %>%
summarise(TOTAL.FATALITIES = sum(FATALITIES)) %>%
arrange(desc(TOTAL.FATALITIES)) %>%
top_n(10, TOTAL.FATALITIES) %>%
ungroup()
# Create the top 10 event types for Injuries
dfTop10.INJURIES <- dfTidy %>%
group_by(NOAA.EVTYPE) %>%
summarise(TOTAL.INJURIES = sum(INJURIES)) %>%
arrange(desc(TOTAL.INJURIES)) %>%
top_n(10, TOTAL.INJURIES) %>%
ungroup()
# Create the top 10 event types for PROPDMG.VAL
dfTop10.PROPDMG.VAL <- dfTidy %>%
group_by(NOAA.EVTYPE) %>%
summarise(TOTAL.PROPDMG.VAL = sum(PROPDMG.VAL)/1000000) %>%
arrange(desc(TOTAL.PROPDMG.VAL)) %>%
top_n(10, TOTAL.PROPDMG.VAL) %>%
ungroup()
# Create the top 10 event types for CROPDMG.VAL
dfTop10.CROPDMG.VAL <- dfTidy %>%
group_by(NOAA.EVTYPE) %>%
summarise(TOTAL.CROPDMG.VAL = sum(CROPDMG.VAL)/1000000) %>%
arrange(desc(TOTAL.CROPDMG.VAL)) %>%
top_n(10, TOTAL.CROPDMG.VAL) %>%
ungroup()
g1 <- ggplot(data=dfTop10.FATALITIES,
aes(x=reorder(NOAA.EVTYPE, TOTAL.FATALITIES), y=TOTAL.FATALITIES, fill=NOAA.EVTYPE)) +
geom_bar(stat="identity") +
coord_flip() +
ylab("Total Number of Fatalities") +
xlab("NOAA Event Type") +
theme_bw() +
theme(legend.position="none")
g2 <- ggplot(data=dfTop10.INJURIES,
aes(x=reorder(NOAA.EVTYPE, TOTAL.INJURIES), y=TOTAL.INJURIES, fill=NOAA.EVTYPE)) +
geom_bar(stat="identity") +
coord_flip() +
ylab("Total Number of Injuries") +
xlab("NOAA Event Type") +
theme_bw() +
theme(legend.position="none")
grid.arrange(g1, g2, top="Top 10 Harmful Weather Events in the US (1950-2011)")
The plots indicated the Top 10 harmful event types with respect to the population health. Tornadoes topped the charts as the most harmful with the most fatalites and injuries. In both cases, the number of fatalities and injuries due to Tornadoes are more than 2 times than the rest of the 9 event types in each chart. The plots also indicated that injuries tend to be much higher than fatalities.
g3 <- ggplot(data=dfTop10.PROPDMG.VAL,
aes(x=reorder(NOAA.EVTYPE, TOTAL.PROPDMG.VAL), y=TOTAL.PROPDMG.VAL, fill=NOAA.EVTYPE)) +
geom_bar(stat="identity") +
coord_flip() +
ylab("Property Damage $M (CPI adjusted)") +
xlab("NOAA Event Type") +
theme_bw() +
theme(legend.position="none")
g4 <- ggplot(data=dfTop10.CROPDMG.VAL,
aes(x=reorder(NOAA.EVTYPE, TOTAL.CROPDMG.VAL), y=TOTAL.CROPDMG.VAL, fill=NOAA.EVTYPE)) +
geom_bar(stat="identity") +
coord_flip() +
ylab("Crop Damage $M (CPI adjusted)") +
xlab("NOAA Event Type") +
theme_bw() +
theme(legend.position="none")
grid.arrange(g3, g4, top="Top 10 Economic Impacts by Weather Events in the US (1950-2011)")
The plots indicated the Top 10 event types with high econonic consequences. Floods caused the highest economic damage to properties while Droughts caused the highest crop damage in millions of real dollars. The plots also indicated that the value of property damages is much higher than that of crop damages.
This analysis is completed with R. For reproducibility, you can fork and download the R code from this GitHub Repository
Details from sessionInfo below,
## R version 4.3.1 (2023-06-16 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22000)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## time zone: Africa/Casablanca
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.43 psych_2.3.12 quantmod_0.4.25 TTR_0.24.3
## [5] xts_0.13.1 zoo_1.8-12 R.utils_2.12.3 R.oo_1.25.0
## [9] R.methodsS3_1.8.2 lubridate_1.9.2 gridExtra_2.3 ggplot2_3.4.3
## [13] dplyr_1.1.2 data.table_1.14.8
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3 lattice_0.21-8
## [5] digest_0.6.33 magrittr_2.0.3 evaluate_0.21 grid_4.3.1
## [9] timechange_0.2.0 fastmap_1.1.1 jsonlite_1.8.7 fansi_1.0.4
## [13] scales_1.2.1 jquerylib_0.1.4 mnormt_2.1.1 cli_3.6.1
## [17] crayon_1.5.2 rlang_1.1.1 munsell_0.5.0 withr_2.5.0
## [21] cachem_1.0.8 yaml_2.3.7 tools_4.3.1 parallel_4.3.1
## [25] colorspace_2.1-0 curl_5.0.2 vctrs_0.6.3 R6_2.5.1
## [29] lifecycle_1.0.3 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1
## [33] gtable_0.3.4 glue_1.6.2 highr_0.10 xfun_0.40
## [37] tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.15.0 farver_2.1.1
## [41] htmltools_0.5.6 nlme_3.1-162 labeling_0.4.3 rmarkdown_2.25
## [45] compiler_4.3.1