Course Project

Reproducible Research Course Project 2

Peer-graded Assignment

Synonpsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Assignment

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Environment Setup

Requirements

if (!require(ggplot2)) {
    install.packages("ggplot2")
    library(ggplot2)
}
## Loading required package: ggplot2
if (!require(dplyr)) {
    install.packages("dplyr")
    library(dplyr, warn.conflicts = FALSE)
}
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
if (!require(xtable)) {
    install.packages("xtable")
    library(xtable, warn.conflicts = FALSE)
}
## Loading required package: xtable

Display session information.

sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Berlin
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] xtable_1.8-4  dplyr_1.1.4   ggplot2_3.5.1
## 
## loaded via a namespace (and not attached):
##  [1] vctrs_0.6.5       cli_3.6.3         knitr_1.49        rlang_1.1.4      
##  [5] xfun_0.49         generics_0.1.3    jsonlite_1.8.9    glue_1.8.0       
##  [9] colorspace_2.1-1  htmltools_0.5.8.1 sass_0.4.9        scales_1.3.0     
## [13] fansi_1.0.6       rmarkdown_2.29    grid_4.4.2        evaluate_1.0.1   
## [17] munsell_0.5.1     jquerylib_0.1.4   tibble_3.2.1      fastmap_1.2.0    
## [21] yaml_2.3.10       lifecycle_1.0.4   compiler_4.4.2    pkgconfig_2.0.3  
## [25] digest_0.6.37     R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
## [29] pillar_1.9.0      magrittr_2.0.3    bslib_0.8.0       withr_3.0.2      
## [33] tools_4.4.2       gtable_0.3.6      cachem_1.1.0

Load Data

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

setwd("~/Projects/Coursera/ReproducibleResearch-Project2")
noaaDataFileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
noaaDataFile <- "Data/repdata-data-StormData.csv.bz2"
if (!file.exists('Data')) {
    dir.create('Data')
}
if (!file.exists(noaaDataFile)) {
    download.file(url = noaaDataFileURL, destfile = noaaDataFile)
}
noaaData <- read.csv(noaaDataFile, sep = ",", header = TRUE)

Dataset summary

names(noaaData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
str(noaaData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
head(noaaData)

Data Processing

Data Subset

For this exercise, the dataset will be filtered to include only the required variables for this use-case.

Variable Description
EVTYPE Event type (Flood, Heat, Hurricane, Tornado, …)
FATALITIES Number of fatalities resulting from event
INJURIES Number of injuries resulting from event
PROPDMG Property damage in USD
PROPDMGEXP Unit multiplier for property damage (K, M, or B)
CROPDMG Crop damage in USD
CROPDMGEXP Unit multiplier for property damage (K, M, or B)
BGN_DATE Begin date of the event
END_DATE End date of the event
STATE State where the event occurred
noaaDataTidy <- subset(noaaData, EVTYPE != "?" &
                 (FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0),
                 select = c("EVTYPE",
                            "FATALITIES",
                            "INJURIES", 
                            "PROPDMG",
                            "PROPDMGEXP",
                            "CROPDMG",
                            "CROPDMGEXP",
                            "BGN_DATE",
                            "END_DATE",
                            "STATE"))
dim(noaaDataTidy)
## [1] 254632     10
sum(is.na(noaaDataTidy))
## [1] 0

The Dataset has 254632 observations containing 10 variables and no missing values.

Clean Event Type Data

Total of 487 unique Event are to be listed

length(unique(noaaDataTidy$EVTYPE))
## [1] 487

Some entries contains wrong pluralization, mixed cases and even misspellings. For example, Strong Wind, STRONG WIND,Strong Winds, and STRONG WINDS. To solve this is required to convert all entries to uppercase and combine into categories.

noaaDataTidy$EVTYPE <- toupper(noaaDataTidy$EVTYPE)
# AVALANCHE
noaaDataTidy$EVTYPE <- gsub('.*AVALANCE.*', 'AVALANCHE', noaaDataTidy$EVTYPE)

# BLIZZARD
noaaDataTidy$EVTYPE <- gsub('.*BLIZZARD.*', 'BLIZZARD', noaaDataTidy$EVTYPE)

# CLOUD
noaaDataTidy$EVTYPE <- gsub('.*CLOUD.*', 'CLOUD', noaaDataTidy$EVTYPE)

# COLD
noaaDataTidy$EVTYPE <- gsub('.*COLD.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*FREEZ.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*FROST.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*ICE.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LOW TEMPERATURE RECORD.*', 'COLD', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LO.*TEMP.*', 'COLD', noaaDataTidy$EVTYPE)

# DRY
noaaDataTidy$EVTYPE <- gsub('.*DRY.*', 'DRY', noaaDataTidy$EVTYPE)

# DUST
noaaDataTidy$EVTYPE <- gsub('.*DUST.*', 'DUST', noaaDataTidy$EVTYPE)

# FIRE
noaaDataTidy$EVTYPE <- gsub('.*FIRE.*', 'FIRE', noaaDataTidy$EVTYPE)

# FLOOD
noaaDataTidy$EVTYPE <- gsub('.*FLOOD.*', 'FLOOD', noaaDataTidy$EVTYPE)

# FOG
noaaDataTidy$EVTYPE <- gsub('.*FOG.*', 'FOG', noaaDataTidy$EVTYPE)

# HAIL
noaaDataTidy$EVTYPE <- gsub('.*HAIL.*', 'HAIL', noaaDataTidy$EVTYPE)

# HEAT
noaaDataTidy$EVTYPE <- gsub('.*HEAT.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WARM.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*HIGH.*TEMP.*', 'HEAT', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*RECORD HIGH TEMPERATURES.*', 'HEAT', noaaDataTidy$EVTYPE)

# HYPOTHERMIA/EXPOSURE
noaaDataTidy$EVTYPE <- gsub('.*HYPOTHERMIA.*', 'HYPOTHERMIA/EXPOSURE', noaaDataTidy$EVTYPE)

# LANDSLIDE
noaaDataTidy$EVTYPE <- gsub('.*LANDSLIDE.*', 'LANDSLIDE', noaaDataTidy$EVTYPE)

# LIGHTNING
noaaDataTidy$EVTYPE <- gsub('^LIGHTNING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('^LIGNTNING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('^LIGHTING.*', 'LIGHTNING', noaaDataTidy$EVTYPE)

# MICROBURST
noaaDataTidy$EVTYPE <- gsub('.*MICROBURST.*', 'MICROBURST', noaaDataTidy$EVTYPE)

# MUDSLIDE
noaaDataTidy$EVTYPE <- gsub('.*MUDSLIDE.*', 'MUDSLIDE', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*MUD SLIDE.*', 'MUDSLIDE', noaaDataTidy$EVTYPE)

# RAIN
noaaDataTidy$EVTYPE <- gsub('.*RAIN.*', 'RAIN', noaaDataTidy$EVTYPE)

# RIP CURRENT
noaaDataTidy$EVTYPE <- gsub('.*RIP CURRENT.*', 'RIP CURRENT', noaaDataTidy$EVTYPE)

# STORM
noaaDataTidy$EVTYPE <- gsub('.*STORM.*', 'STORM', noaaDataTidy$EVTYPE)

# SUMMARY
noaaDataTidy$EVTYPE <- gsub('.*SUMMARY.*', 'SUMMARY', noaaDataTidy$EVTYPE)

# TORNADO
noaaDataTidy$EVTYPE <- gsub('.*TORNADO.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*TORNDAO.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*LANDSPOUT.*', 'TORNADO', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WATERSPOUT.*', 'TORNADO', noaaDataTidy$EVTYPE)

# SURF
noaaDataTidy$EVTYPE <- gsub('.*SURF.*', 'SURF', noaaDataTidy$EVTYPE)

# VOLCANIC
noaaDataTidy$EVTYPE <- gsub('.*VOLCANIC.*', 'VOLCANIC', noaaDataTidy$EVTYPE)

# WET
noaaDataTidy$EVTYPE <- gsub('.*WET.*', 'WET', noaaDataTidy$EVTYPE)

# WIND
noaaDataTidy$EVTYPE <- gsub('.*WIND.*', 'WIND', noaaDataTidy$EVTYPE)

# WINTER
noaaDataTidy$EVTYPE <- gsub('.*WINTER.*', 'WINTER', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*WINTRY.*', 'WINTER', noaaDataTidy$EVTYPE)
noaaDataTidy$EVTYPE <- gsub('.*SNOW.*', 'WINTER', noaaDataTidy$EVTYPE)

The number of unique Event Type values were reduced to 81

length(unique(noaaDataTidy$EVTYPE))
## [1] 81

Clean Date Data

The BNG_START and END_DATE variables are stored as factors which should be made available as date types that can be worked with.

Four new variables based on date variables in the tidy dataset will be created:

Variable Description
DATE_START Begin date of the event (date type)
DATE_END End date of the event (date type).
YEAR Year the event started
DURATION Duration (in hours)
noaaDataTidy$DATE_START <- as.Date(noaaDataTidy$BGN_DATE, format = "%m/%d/%Y")
noaaDataTidy$DATE_END <- as.Date(noaaDataTidy$END_DATE, format = "%m/%d/%Y")
noaaDataTidy$YEAR <- as.integer(format(noaaDataTidy$DATE_START, "%Y"))
noaaDataTidy$DURATION <- as.numeric(noaaDataTidy$DATE_END - noaaDataTidy$DATE_START)/3600

Clean Economic Data

Information about Property Damage is logged using following variables: - PROPDMG (with magnitudes in K(thousands), M(Millions), B(Billions) for ) - PROPDMGEXP - PROPDMG is the mantissa (the significant) rounded to three significant digits - PROPDMGEXP is the exponent (the multiplier). The same approach is used for Crop Damage where the CROPDMG variable is encoded by the CROPDMGEXP variable.

A quick review of the data for the PROPDMGEXP and CROPDMGEXP variables shows that there are several other characters being logged.

table(toupper(noaaDataTidy$PROPDMGEXP))
## 
##             -      +      0      2      3      4      5      6      7      B 
##  11585      1      5    210      1      1      4     18      3      3     40 
##      H      K      M 
##      7 231427  11327
table(toupper(noaaDataTidy$CROPDMGEXP))
## 
##             ?      0      B      K      M 
## 152663      6     17      7  99953   1986

To calculate costs, the PROPDMGEXP and CROPDMGEXP variables should be mapped to a factor which will be used to calculate the costs for both property and crop damage. Two new variables should be created to store damage costs:

  • PROP_COST
  • CROP_COST
# function to get factor
getMultiplier <- function(exp) {
    exp <- toupper(exp);
    if (exp == "")  return (10^0);
    if (exp == "-") return (10^0);
    if (exp == "?") return (10^0);
    if (exp == "+") return (10^0);
    if (exp == "0") return (10^0);
    if (exp == "1") return (10^1);
    if (exp == "2") return (10^2);
    if (exp == "3") return (10^3);
    if (exp == "4") return (10^4);
    if (exp == "5") return (10^5);
    if (exp == "6") return (10^6);
    if (exp == "7") return (10^7);
    if (exp == "8") return (10^8);
    if (exp == "9") return (10^9);
    if (exp == "H") return (10^2);
    if (exp == "K") return (10^3);
    if (exp == "M") return (10^6);
    if (exp == "B") return (10^9);
    return (NA);
}

# calculate property damage and crop damage costs (in billions)
noaaDataTidy$PROP_COST <- with(noaaDataTidy, as.numeric(PROPDMG) * sapply(PROPDMGEXP, getMultiplier))/10^9
noaaDataTidy$CROP_COST <- with(noaaDataTidy, as.numeric(CROPDMG) * sapply(CROPDMGEXP, getMultiplier))/10^9

Summarize Data

Create a summarized dataset of health impact data (fatalities + injuries). Sort the results in descending order by health impact.

healthImpactData <- aggregate(x = list(HEALTH_IMPACT = noaaDataTidy$FATALITIES + noaaDataTidy$INJURIES), 
                                  by = list(EVENT_TYPE = noaaDataTidy$EVTYPE), 
                                  FUN = sum,
                                  na.rm = TRUE)
healthImpactData <- healthImpactData[order(healthImpactData$HEALTH_IMPACT, decreasing = TRUE),]

Create a summarized dataset of damage impact costs (property damage + crop damage). Sort the results in descending order by damage cost.

damageCostImpactData <- aggregate(x = list(DAMAGE_IMPACT = noaaDataTidy$PROP_COST + noaaDataTidy$CROP_COST), 
                                  by = list(EVENT_TYPE = noaaDataTidy$EVTYPE), 
                                  FUN = sum,
                                  na.rm = TRUE)
damageCostImpactData <- damageCostImpactData[order(damageCostImpactData$DAMAGE_IMPACT, decreasing = TRUE),]

Results

Event Types Most Harmful to Population Health

Fatalities and injuries have the most harmful impact on population health. The results below display the 10 most harmful weather events in terms of population health in the U.S.

print(xtable(head(healthImpactData, 10),
             caption = "Top 10 Most Harmful Weather Events to Population Health"),
             caption.placement = 'top',
             type = "html",
             include.rownames = FALSE,
             html.table.attributes='class="table-bordered", width="100%"')
Top 10 Most Harmful Weather Events to Population Health
EVENT_TYPE HEALTH_IMPACT
TORNADO 97075.00
HEAT 12392.00
FLOOD 10127.00
WIND 9893.00
LIGHTNING 6049.00
STORM 4780.00
COLD 3100.00
WINTER 1924.00
FIRE 1698.00
HAIL 1512.00


healthImpactChart <- ggplot(head(healthImpactData, 10),
                            aes(x = reorder(EVENT_TYPE, HEALTH_IMPACT), y = HEALTH_IMPACT, fill = EVENT_TYPE)) +
                            coord_flip() +
                            geom_bar(stat = "identity") + 
                            xlab("Event Type") +
                            ylab("Total Fatalities and Injures") +
                            theme(plot.title = element_text(size = 14, hjust = 0.5)) +
                            ggtitle("Top 10 Most Harmful Weather Events to Population Health")
print(healthImpactChart)

Event Types with Greatest Economic Consequences

Property and crop damage have the most harmful impact on the economy. The results below display the 10 most harmful weather events in terms economic consequences in the U.S.

print(xtable(head(damageCostImpactData, 10),
             caption = "Top 10 Events with Greatest Economic Consequences"),
             caption.placement = 'top',
             type = "html",
             include.rownames = FALSE,
             html.table.attributes='class="table-bordered", width="100%"')
Top 10 Events with Greatest Economic Consequences
EVENT_TYPE DAMAGE_IMPACT
FLOOD 180.58
HURRICANE/TYPHOON 71.91
STORM 70.45
TORNADO 57.43
HAIL 20.74
DROUGHT 15.02
HURRICANE 14.61
COLD 12.70
WIND 12.01
FIRE 8.90


damageCostImpactChart <- ggplot(head(damageCostImpactData, 10),
                            aes(x = reorder(EVENT_TYPE, DAMAGE_IMPACT), y = DAMAGE_IMPACT, fill = EVENT_TYPE)) +
                            coord_flip() +
                            geom_bar(stat = "identity") + 
                            xlab("Event Type") +
                            ylab("Total Property / Crop Damage Cost (in Billions)") +
                            theme(plot.title = element_text(size = 14, hjust = 0.5)) +
                            ggtitle("Top 10 Events with Greatest Economic Consequences")
print(damageCostImpactChart)

Conclusion

After previous analysis we found out that:

  • Which types of weather events are most harmful to population health?

    Tornadoes are responsible for the greatest number of fatalities and injuries.

  • Which types of weather events have the greatest economic consequences?

    Floods are responsible for causing the most property damage and crop damage costs.