This project examines NOAA’s Storm Events Database, which documents significant weather occurrences across the United States, detailing their timing, locations, and associated damages. The analysis aims to identify the event types that pose the greatest threats to public health and those that lead to substantial economic losses.
Data Format: The dataset is provided as a comma-separated value (CSV) file, compressed using the bzip2 algorithm to reduce its size.
Download Link: You can download the file from the course website:
Storm Data [47Mb] https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2 Documentation: Additional information on the database, including variable definitions and constructions, is available through:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ https://ncdc.noaa.gov +3 https://spc.noaa.gov +3 https://noaa.gov +3
Data Coverage: The database records events starting from 1950 up to November 2011. Earlier years may have fewer recorded events, likely due to less comprehensive record-keeping, whereas more recent years are considered more complete.
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
if (!file.exists("stormdata.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "stormdata.csv.bz2")
bunzip2("stormdata.csv.bz2", "stormdata.csv", remove=FALSE)
}
StormInfo <- data.table::fread("stormdata.csv", fill=TRUE, header=TRUE)
head(StormInfo)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <char> <char> <char> <char> <char> <char> <char>
## 1: 1.00 4/18/1950 0:00:00 0130 CST 97.00 MOBILE AL
## 2: 1.00 4/18/1950 0:00:00 0145 CST 3.00 BALDWIN AL
## 3: 1.00 2/20/1951 0:00:00 1600 CST 57.00 FAYETTE AL
## 4: 1.00 6/8/1951 0:00:00 0900 CST 89.00 MADISON AL
## 5: 1.00 11/15/1951 0:00:00 1500 CST 43.00 CULLMAN AL
## 6: 1.00 11/15/1951 0:00:00 2000 CST 77.00 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## <char> <char> <char> <char> <char> <char> <num> <lgcl>
## 1: TORNADO 0.00 0 NA
## 2: TORNADO 0.00 0 NA
## 3: TORNADO 0.00 0 NA
## 4: TORNADO 0.00 0 NA
## 5: TORNADO 0.00 0 NA
## 6: TORNADO 0.00 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES
## <num> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0 14.0 100 3 0 0 15
## 2: 0 2.0 150 2 0 0 0
## 3: 0 0.1 123 2 0 0 2
## 4: 0 0.0 100 2 0 0 2
## 5: 0 0.0 150 2 0 0 2
## 6: 0 1.5 177 2 0 0 6
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE
## <num> <char> <num> <char> <char> <char> <char> <num>
## 1: 25.0 K 0 3040
## 2: 2.5 K 0 3042
## 3: 25.0 K 0 3340
## 4: 2.5 K 0 3458
## 5: 2.5 K 0 3412
## 6: 2.5 K 0 3450
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## <num> <num> <num> <char> <num>
## 1: 8812 3051 8806 1
## 2: 8755 0 0 2
## 3: 8742 0 0 3
## 4: 8626 0 0 4
## 5: 8642 0 0 5
## 6: 8748 0 0 6
Now, let’s check the variable names
names(StormInfo)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Now, we will select only the variables relevant to our analysis and convert their names to lowercase by creating a subset using dplyr.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.4.3
##
## Adjuntando el paquete: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Stormsubset <- StormInfo %>%
select(c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")) %>%
rename_all(tolower)
str(Stormsubset)
## Classes 'data.table' and 'data.frame': 902297 obs. of 7 variables:
## $ evtype : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ fatalities: num 0 0 0 0 0 0 0 0 1 0 ...
## $ injuries : num 15 0 2 2 2 6 1 0 14 0 ...
## $ propdmg : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ propdmgexp: chr "K" "K" "K" "K" ...
## $ cropdmg : num 0 0 0 0 0 0 0 0 0 0 ...
## $ cropdmgexp: chr "" "" "" "" ...
## - attr(*, ".internal.selfref")=<externalptr>
First, we select the relevant columns related to population health. Then, the top 10 rows are sorted in descending order to create a bar plot by 3 steps.
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.4.3
step1 <- Stormsubset %>%
select(evtype, fatalities, injuries)
print(head(step1))
## evtype fatalities injuries
## <char> <num> <num>
## 1: TORNADO 0 15
## 2: TORNADO 0 0
## 3: TORNADO 0 2
## 4: TORNADO 0 2
## 5: TORNADO 0 2
## 6: TORNADO 0 6
step2 <- step1 %>%
group_by(evtype) %>%
summarize(fatalities = sum(fatalities), injuries = sum(injuries), .groups = 'drop')
print(head(step2))
## # A tibble: 6 × 3
## evtype fatalities injuries
## <chr> <dbl> <dbl>
## 1 " HIGH SURF ADVISORY" 0 0
## 2 " COASTAL FLOOD" 0 0
## 3 " FLASH FLOOD" 0 0
## 4 " LIGHTNING" 0 0
## 5 " TSTM WIND" 0 0
## 6 " TSTM WIND (G45)" 0 0
step3 <- step2 %>%
arrange(desc(fatalities), desc(injuries)) %>%
slice(1:10)
print(step3)
## # A tibble: 10 × 3
## evtype fatalities injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 6 TSTM WIND 504 6957
## 7 FLOOD 470 6789
## 8 RIP CURRENT 368 232
## 9 HIGH WIND 248 1137
## 10 AVALANCHE 224 170
Effect_health <- step3 %>%
pivot_longer(cols = c(fatalities, injuries), names_to = "type", values_to = "value")
print(head(Effect_health))
## # A tibble: 6 × 3
## evtype type value
## <chr> <chr> <dbl>
## 1 TORNADO fatalities 5633
## 2 TORNADO injuries 91346
## 3 EXCESSIVE HEAT fatalities 1903
## 4 EXCESSIVE HEAT injuries 6525
## 5 FLASH FLOOD fatalities 978
## 6 FLASH FLOOD injuries 1777
# Checking if ggplot2 is already installed
if (!requireNamespace("ggplot2", quietly = TRUE)) {
install.packages("ggplot2")
}
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.4.3
# Checking the data
if (exists("Effect_health")) {
str(Effect_health)
head(Effect_health)
# Plot
ggplot(data = Effect_health, aes(x = reorder(evtype, -value), y = value, fill = type)) +
geom_bar(position = "dodge", stat = "identity") +
labs(x = "Event Type", y = "Count") +
theme_bw() +
theme(axis.text.x = element_text(angle = 20, vjust = 0.7)) +
ggtitle("Total Number of Fatalities and Injuries of Top 10 Storm Event Types") +
scale_fill_manual(values = c("blue", "gray"))
} else {
stop("El objeto 'Effect_health' no está definido. Por favor, asegúrate de que los datos se hayan procesado correctamente.")
}
## tibble [20 × 3] (S3: tbl_df/tbl/data.frame)
## $ evtype: chr [1:20] "TORNADO" "TORNADO" "EXCESSIVE HEAT" "EXCESSIVE HEAT" ...
## $ type : chr [1:20] "fatalities" "injuries" "fatalities" "injuries" ...
## $ value : num [1:20] 5633 91346 1903 6525 978 ...
It is evident that tornadoes have the greatest impact on public health, as they cause the highest number of fatalities and injuries.
The variable PROPDMGEXP represents property damage costs and can be used to identify the events with the most significant economic impact.
On the other hand the exponent values for property and crop damage costs are inconsistent, so I created a function to standardize them and compute the total cost using their respective exponents (expressed in millions).
cost_economy <- function(x) {
if (x == "H")
1E-4
else if (x == "K")
1E-3
else if (x == "M")
1
else if (x == "B")
1E3
else
1E-6
}
Once we have standardized the respective economic exponents, a variable called Economic_Effect is created to assess whether the selected variables have had an impact on the economy.
Effect_economy <-
Stormsubset %>%
select("evtype", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp") %>%
mutate(prop_dmg = propdmg * sapply(propdmgexp, FUN = cost_economy),
crop_dmg = cropdmg * sapply(cropdmgexp, FUN = cost_economy), .keep = "unused") %>%
group_by(evtype) %>%
summarize(property = sum(prop_dmg), crop = sum(crop_dmg), .groups = 'drop') %>%
arrange(desc(property), desc(crop)) %>%
slice(1:10) %>%
pivot_longer(cols = c(property, crop), names_to = "type", values_to = "value")
ggplot(data=Effect_economy, aes(reorder(evtype, -value), value, fill=type)) +
geom_bar(position = "dodge", stat="identity") +
labs(x="Event Type", y="Count (millions)") +
theme_bw() +
theme(axis.text.x = element_text(angle = 25, vjust=0.5)) +
ggtitle("Total Cost of Property and Crop Damage by top 10 storm event types") +
scale_fill_manual(values=c("blue", "grey"))
The bar plot shows that floods and hurricanes/typhoons incur the highest property and crop damage costs, making them the most economically impactful events.
The analysis of NOAA’s Storm Events Database highlights significant findings regarding the health and economic impacts of severe weather events in the United States:
Impact on Public Health: Tornadoes are the most detrimental to public health, causing the highest number of fatalities and injuries across the country. Their unpredictable nature and intensity make them a critical focus area for disaster preparedness and response efforts.
Economic Consequences: Floods and hurricanes/typhoons lead to the most substantial property and crop damage, resulting in billions of dollars in losses. These events underscore the need for enhanced infrastructure, insurance mechanisms, and climate adaptation strategies to mitigate future economic impacts.
This analysis serves as a foundation for policymakers and stakeholders to prioritize resources, improve resilience, and protect communities from the adverse effects of natural disasters. By understanding the patterns of harm and cost, targeted interventions can be designed to minimize both human and financial losses effectively.