Reproducible Research Course Project 2
Peer-graded Assignment
This course project is available on GitHub
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Requirements
if (!require(ggplot2)) {
install.packages("ggplot2")
library(ggplot2)
}
## Loading required package: ggplot2
if (!require(dplyr)) {
install.packages("dplyr")
library(dplyr, warn.conflicts = FALSE)
}
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
if (!require(xtable)) {
install.packages("xtable")
library(xtable, warn.conflicts = FALSE)
}
## Loading required package: xtable
Display session information.
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Europe/Berlin
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] xtable_1.8-4 dplyr_1.1.4 ggplot2_3.5.1
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.5 cli_3.6.3 knitr_1.49 rlang_1.1.4
## [5] xfun_0.49 generics_0.1.3 jsonlite_1.8.9 glue_1.8.0
## [9] colorspace_2.1-1 htmltools_0.5.8.1 sass_0.4.9 scales_1.3.0
## [13] fansi_1.0.6 rmarkdown_2.29 grid_4.4.2 evaluate_1.0.1
## [17] munsell_0.5.1 jquerylib_0.1.4 tibble_3.2.1 fastmap_1.2.0
## [21] yaml_2.3.10 lifecycle_1.0.4 compiler_4.4.2 pkgconfig_2.0.3
## [25] digest_0.6.37 R6_2.5.1 tidyselect_1.2.1 utf8_1.2.4
## [29] pillar_1.9.0 magrittr_2.0.3 bslib_0.8.0 withr_3.0.2
## [33] tools_4.4.2 gtable_0.3.6 cachem_1.1.0
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
setwd("~/Projects/Coursera/ReproducibleResearch-Project2")
noaaDataFileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
noaaDataFile <- "Data/repdata-data-StormData.csv.bz2"
if (!file.exists('Data')) {
dir.create('Data')
}
if (!file.exists(noaaDataFile)) {
download.file(url = noaaDataFileURL, destfile = noaaDataFile)
}
noaaData <- read.csv(noaaDataFile, sep = ",", header = TRUE)
Dataset summary
names(noaaData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(noaaData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(noaaData)
For this exercise, the dataset will be filtered to include only the required variables for this use-case.
| Variable | Description |
|---|---|
| EVTYPE | Event type (Flood, Heat, Hurricane, Tornado, …) |
| FATALITIES | Number of fatalities resulting from event |
| INJURIES | Number of injuries resulting from event |
| PROPDMG | Property damage in USD |
| PROPDMGEXP | Unit multiplier for property damage (K, M, or B) |
| CROPDMG | Crop damage in USD |
| CROPDMGEXP | Unit multiplier for property damage (K, M, or B) |
| BGN_DATE | Begin date of the event |
| END_DATE | End date of the event |
| STATE | State where the event occurred |
noaaDataTidy <- subset(noaaData, EVTYPE != "?" &
(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0),
select = c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP",
"BGN_DATE",
"END_DATE",
"STATE"))
dim(noaaDataTidy)
## [1] 254632 10
sum(is.na(noaaDataTidy))
## [1] 0
The Dataset has 254632 observations containing 10 variables and no missing values.
Total of 487 unique Event are to be listed
length(unique(noaaDataTidy$EVTYPE))
## [1] 487
Some entries contains wrong pluralization, mixed cases and even
misspellings. For example, Strong Wind,
STRONG WIND,Strong Winds, and
STRONG WINDS. To solve this is required to convert all
entries to uppercase and combine into categories.
noaaDataTidy$EVTYPE <- toupper(noaaDataTidy$EVTYPE)
# AVALANCHE
noaaDataTidy$EVTYPE <- gsub('.*AVALANCE.*', 'AVALANCHE', noaaDataTidy$EVTYPE)
# BLIZZARD
noaaDataTidy$EVTYPE <- gsub('.*BLIZZARD.*', 'BLIZZARD', noaaDataTidy$EVTYPE)
# CLOUD
noaaDataTidy$EVTYPE <- gsub('.*CLOUD.*', 'CLOUD', noaaDataTidy$EVTYPE)
# COLD
noaaDataTidy$EVTYPE <- gsub('.*COLD.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*FREEZ.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*FROST.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*ICE.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LOW TEMPERATURE RECORD.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LO.*TEMP.*', 'COLD', noaaDataTidy$EVTYPE)
# DRY
noaaDataTidy$EVTYPE <- gsub('.*DRY.*', 'DRY', noaaDataTidy$EVTYPE)
# DUST
noaaDataTidy$EVTYPE <- gsub('.*DUST.*', 'DUST', noaaDataTidy$EVTYPE)
# FIRE
noaaDataTidy$EVTYPE <- gsub('.*FIRE.*', 'FIRE', noaaDataTidy$EVTYPE)
# FLOOD
noaaDataTidy$EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', noaaDataTidy$EVTYPE)
# FOG
noaaDataTidy$EVTYPE <- gsub('.*FOG.*', 'FOG', noaaDataTidy$EVTYPE)
# HAIL
noaaDataTidy$EVTYPE <- gsub('.*HAIL.*', 'HAIL', noaaDataTidy$EVTYPE)
# HEAT
noaaDataTidy$EVTYPE <- gsub('.*HEAT.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WARM.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*HIGH.*TEMP.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*RECORD HIGH TEMPERATURES.*', 'HEAT', noaaDataTidy$EVTYPE)
# HYPOTHERMIA/EXPOSURE
noaaDataTidy$EVTYPE <- gsub('.*HYPOTHERMIA.*', 'HYPOTHERMIA/EXPOSURE', noaaDataTidy$EVTYPE)
# LANDSLIDE
noaaDataTidy$EVTYPE <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', noaaDataTidy$EVTYPE)
# LIGHTNING
noaaDataTidy$EVTYPE <- gsub('^LIGHTNING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('^LIGNTNING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('^LIGHTING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)
# MICROBURST
noaaDataTidy$EVTYPE <- gsub('.*MICROBURST.*', 'MICROBURST', noaaDataTidy$EVTYPE)
# MUDSLIDE
noaaDataTidy$EVTYPE <- gsub('.*MUDSLIDE.*', 'MUDSLIDE', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*MUD SLIDE.*', 'MUDSLIDE', noaaDataTidy$EVTYPE)
# RAIN
noaaDataTidy$EVTYPE <- gsub('.*RAIN.*', 'RAIN', noaaDataTidy$EVTYPE)
# RIP CURRENT
noaaDataTidy$EVTYPE <- gsub('.*RIP CURRENT.*', 'RIP CURRENT', noaaDataTidy$EVTYPE)
# STORM
noaaDataTidy$EVTYPE <- gsub('.*STORM.*', 'STORM', noaaDataTidy$EVTYPE)
# SUMMARY
noaaDataTidy$EVTYPE <- gsub('.*SUMMARY.*', 'SUMMARY', noaaDataTidy$EVTYPE)
# TORNADO
noaaDataTidy$EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*TORNDAO.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LANDSPOUT.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WATERSPOUT.*', 'TORNADO', noaaDataTidy$EVTYPE)
# SURF
noaaDataTidy$EVTYPE <- gsub('.*SURF.*', 'SURF', noaaDataTidy$EVTYPE)
# VOLCANIC
noaaDataTidy$EVTYPE <- gsub('.*VOLCANIC.*', 'VOLCANIC', noaaDataTidy$EVTYPE)
# WET
noaaDataTidy$EVTYPE <- gsub('.*WET.*', 'WET', noaaDataTidy$EVTYPE)
# WIND
noaaDataTidy$EVTYPE <- gsub('.*WIND.*', 'WIND', noaaDataTidy$EVTYPE)
# WINTER
noaaDataTidy$EVTYPE <- gsub('.*WINTER.*', 'WINTER', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WINTRY.*', 'WINTER', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*SNOW.*', 'WINTER', noaaDataTidy$EVTYPE)
The number of unique Event Type values were reduced to 81
length(unique(noaaDataTidy$EVTYPE))
## [1] 81
The BNG_START and END_DATE variables are
stored as factors which should be made available as date types
that can be worked with.
Four new variables based on date variables in the tidy dataset will be created:
| Variable | Description |
|---|---|
| DATE_START | Begin date of the event (date type) |
| DATE_END | End date of the event (date type). |
| YEAR | Year the event started |
| DURATION | Duration (in hours) |
noaaDataTidy$DATE_START <- as.Date(noaaDataTidy$BGN_DATE, format = "%m/%d/%Y")
noaaDataTidy$DATE_END <- as.Date(noaaDataTidy$END_DATE, format = "%m/%d/%Y")
noaaDataTidy$YEAR <- as.integer(format(noaaDataTidy$DATE_START, "%Y"))
noaaDataTidy$DURATION <- as.numeric(noaaDataTidy$DATE_END - noaaDataTidy$DATE_START)/3600
Information about Property Damage is logged using following
variables: - PROPDMG (with magnitudes in K(thousands),
M(Millions), B(Billions) for ) - PROPDMGEXP -
PROPDMG is the mantissa (the significant) rounded to three
significant digits - PROPDMGEXP is the exponent (the
multiplier). The same approach is used for Crop Damage where the
CROPDMG variable is encoded by the CROPDMGEXP
variable.
A quick review of the data for the PROPDMGEXP and CROPDMGEXP variables shows that there are several other characters being logged.
table(toupper(noaaDataTidy$PROPDMGEXP))
##
## - + 0 2 3 4 5 6 7 B
## 11585 1 5 210 1 1 4 18 3 3 40
## H K M
## 7 231427 11327
table(toupper(noaaDataTidy$CROPDMGEXP))
##
## ? 0 B K M
## 152663 6 17 7 99953 1986
To calculate costs, the PROPDMGEXP and
CROPDMGEXP variables should be mapped to a factor which
will be used to calculate the costs for both property and crop damage.
Two new variables should be created to store damage costs:
# function to get factor
getMultiplier <- function(exp) {
exp <- toupper(exp);
if (exp == "") return (10^0);
if (exp == "-") return (10^0);
if (exp == "?") return (10^0);
if (exp == "+") return (10^0);
if (exp == "0") return (10^0);
if (exp == "1") return (10^1);
if (exp == "2") return (10^2);
if (exp == "3") return (10^3);
if (exp == "4") return (10^4);
if (exp == "5") return (10^5);
if (exp == "6") return (10^6);
if (exp == "7") return (10^7);
if (exp == "8") return (10^8);
if (exp == "9") return (10^9);
if (exp == "H") return (10^2);
if (exp == "K") return (10^3);
if (exp == "M") return (10^6);
if (exp == "B") return (10^9);
return (NA);
}
# calculate property damage and crop damage costs (in billions)
noaaDataTidy$PROP_COST <- with(noaaDataTidy, as.numeric(PROPDMG) * sapply(PROPDMGEXP, getMultiplier))/10^9
noaaDataTidy$CROP_COST <- with(noaaDataTidy, as.numeric(CROPDMG) * sapply(CROPDMGEXP, getMultiplier))/10^9
Create a summarized dataset of health impact data (fatalities + injuries). Sort the results in descending order by health impact.
healthImpactData <- aggregate(x = list(HEALTH_IMPACT = noaaDataTidy$FATALITIES + noaaDataTidy$INJURIES),
by = list(EVENT_TYPE = noaaDataTidy$EVTYPE),
FUN = sum,
na.rm = TRUE)
healthImpactData <- healthImpactData[order(healthImpactData$HEALTH_IMPACT, decreasing = TRUE),]
Create a summarized dataset of damage impact costs (property damage + crop damage). Sort the results in descending order by damage cost.
damageCostImpactData <- aggregate(x = list(DAMAGE_IMPACT = noaaDataTidy$PROP_COST + noaaDataTidy$CROP_COST),
by = list(EVENT_TYPE = noaaDataTidy$EVTYPE),
FUN = sum,
na.rm = TRUE)
damageCostImpactData <- damageCostImpactData[order(damageCostImpactData$DAMAGE_IMPACT, decreasing = TRUE),]
Fatalities and injuries have the most harmful impact on population health. The results below display the 10 most harmful weather events in terms of population health in the U.S.
print(xtable(head(healthImpactData, 10),
caption = "Top 10 Most Harmful Weather Events to Population Health"),
caption.placement = 'top',
type = "html",
include.rownames = FALSE,
html.table.attributes='class="table-bordered", width="100%"')
| EVENT_TYPE | HEALTH_IMPACT |
|---|---|
| TORNADO | 97075.00 |
| HEAT | 12392.00 |
| FLOOD | 10127.00 |
| WIND | 9893.00 |
| LIGHTNING | 6049.00 |
| STORM | 4780.00 |
| COLD | 3100.00 |
| WINTER | 1924.00 |
| FIRE | 1698.00 |
| HAIL | 1512.00 |
healthImpactChart <- ggplot(head(healthImpactData, 10),
aes(x = reorder(EVENT_TYPE, HEALTH_IMPACT), y = HEALTH_IMPACT, fill = EVENT_TYPE)) +
coord_flip() +
geom_bar(stat = "identity") +
xlab("Event Type") +
ylab("Total Fatalities and Injures") +
theme(plot.title = element_text(size = 14, hjust = 0.5)) +
ggtitle("Top 10 Most Harmful Weather Events to Population Health")
print(healthImpactChart)
Property and crop damage have the most harmful impact on the economy. The results below display the 10 most harmful weather events in terms economic consequences in the U.S.
print(xtable(head(damageCostImpactData, 10),
caption = "Top 10 Events with Greatest Economic Consequences"),
caption.placement = 'top',
type = "html",
include.rownames = FALSE,
html.table.attributes='class="table-bordered", width="100%"')
| EVENT_TYPE | DAMAGE_IMPACT |
|---|---|
| FLOOD | 180.58 |
| HURRICANE/TYPHOON | 71.91 |
| STORM | 70.45 |
| TORNADO | 57.43 |
| HAIL | 20.74 |
| DROUGHT | 15.02 |
| HURRICANE | 14.61 |
| COLD | 12.70 |
| WIND | 12.01 |
| FIRE | 8.90 |
damageCostImpactChart <- ggplot(head(damageCostImpactData, 10),
aes(x = reorder(EVENT_TYPE, DAMAGE_IMPACT), y = DAMAGE_IMPACT, fill = EVENT_TYPE)) +
coord_flip() +
geom_bar(stat = "identity") +
xlab("Event Type") +
ylab("Total Property / Crop Damage Cost (in Billions)") +
theme(plot.title = element_text(size = 14, hjust = 0.5)) +
ggtitle("Top 10 Events with Greatest Economic Consequences")
print(damageCostImpactChart)
After previous analysis we found out that:
Which types of weather events are most harmful to population health?
Tornadoes are responsible for the greatest number of fatalities and injuries.
Which types of weather events have the greatest economic consequences?
Floods are responsible for causing the most property damage and crop damage costs.