In this analysis, we calculate the total fatalities and economic
losses caused by various weather events using data from the U.S.
National Oceanic and Atmospheric Administration’s (NOAA) storm database,
which can be accessed via this
link. Additional documentation on the data variables is available
here:
- Data
documentation 1
- Data
documentation 2
The analysis begins by downloading and loading the data into R, followed by inspecting the raw data. We calculate the sum of fatalities, injuries, total health impacts (fatalities + injuries), property damage estimates, crop damage estimates, and total economic loss (property and crop damage) for each weather event across the United States. The events with the severest health impacts and highest economic losses are identified. We then extract the top 10 events based on fatalities, injuries, total health impacts, property damage, crop damage, and total economic loss. Visualizations are created to highlight these top 10 events. This analysis provides insights into the most impactful weather events in terms of human life and economic costs, shedding light on their consequences for public health and the economy.
Firstly, we need to download data and the related documents from the website.
# Check the R environment
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=Chinese (Simplified)_China.utf8
##
## time zone: Asia/Shanghai
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.38 R6_2.6.1 fastmap_1.2.0 xfun_0.56
## [5] cachem_1.1.0 knitr_1.51 htmltools_0.5.8.1 rmarkdown_2.30
## [9] lifecycle_1.0.5 cli_3.6.5 sass_0.4.10 jquerylib_0.1.4
## [13] compiler_4.5.2 rstudioapi_0.18.0 tools_4.5.2 evaluate_1.0.5
## [17] bslib_0.10.0 yaml_2.3.10 otel_0.2.0 jsonlite_2.0.0
## [21] rlang_1.1.7
# Set working directory
setwd("E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2")
# Download data
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" # raw data
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data.csv",method="curl")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf" # Variable definition document
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_documentation.pdf",method="curl")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf" # Some FAQs related to variable definitions
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_FAQ.pdf",method="curl")
And we should record the date of download at the same time.
# Record the date of download
dateDownloaded <- date()
dateDownloaded
## [1] "Sat Mar 21 14:27:37 2026"
Here is the timestamp I got: Sat Mar 21 14:27:37 2026.
Then we need to load the data into R and take a look at the raw data
# Load data
storm_data <- read.csv("storm_data.csv")
# inspect the raw data in R
# Take a look at the column names in the dataset
names(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
#Convert column names and values in the EVTYPE column to lowercase
colnames(storm_data) <- tolower(colnames(storm_data))
storm_data$evtype <- tolower(storm_data$evtype)
# View the first few rows of the dataset
head(storm_data)
## state__ bgn_date bgn_time time_zone county countyname state evtype
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL tornado
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL tornado
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL tornado
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL tornado
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL tornado
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL tornado
## bgn_range bgn_azi bgn_locati end_date end_time county_end countyendn
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## end_range end_azi end_locati length width f mag fatalities injuries propdmg
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## propdmgexp cropdmg cropdmgexp wfo stateoffic zonenames latitude longitude
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## latitude_e longitude_ remarks refnum
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Next, we calculate total fatalities, injuries, and health impacts (fatalities + injuries) for each event types.
# Extract specific columns using column indexing
subset_storm_data <- storm_data[, c("fatalities", "injuries", "evtype")]
# Load the dplyr package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Calculate the sum of fatalities, injuries, and total_health_impacts by evtype
health_impact <- subset_storm_data %>%
mutate(health_impacts = fatalities + injuries) %>% # Create total_health_impacts column
group_by(evtype) %>% # Group by evtype
summarise(
total_fatalities = sum(fatalities, na.rm = TRUE),
total_injuries = sum(injuries, na.rm = TRUE),
total_health_impacts = sum(health_impacts, na.rm = TRUE)
)
In addition, we calculate property damage, crop damage and total economic consequences for each event.
# Extract specific columns using column indexing
subset_storm_data2 <- storm_data[, c("propdmg", "propdmgexp", "cropdmg", "cropdmgexp", "evtype")]
# Calculate types of events with the greatest economic consequences
# Firstly, we should take a look at the unit in PROPDMGEXP and CROPDMGEXP
unique(subset_storm_data2$propdmgexp)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(subset_storm_data2$cropdmgexp)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
# Then, we need to converts the unit in propdmgexp and cropdmgexp into a numerical multiplier
convert_exponent <- function(exp) {
if (exp == "K"|| exp == "k") {
return(1000) # K represents thousand (1000)
} else if (exp == "M" || exp == "m") {
return(1000000) # M or m represents million (1,000,000)
} else if (exp == "B") {
return(1000000000) # B represents billion (1,000,000,000)
} else if (exp == "h" || exp == "H") {
return(100) # h or H represents hundred (100)
} else if (exp %in% c("1", "2", "3", "4", "5", "6", "7", "8", "0")) {
return(10^as.numeric(exp)) # Numbers represent powers of 10 (e.g., 10^5 for "5")
} else if (exp == "") {
return(1) # If there is no unit, assume multiplier is 1
} else if (exp == "+" || exp == "-" || exp == "?") {
return(NA) # Other symbols are treated as missing or unclear data
} else {
return(NA) # Handle undefined units
}
}
# Load the dplyr package
library(dplyr)
# Apply the conversion function and calculate the actual property damage value
subset_storm_data2$propdmgvalue <- subset_storm_data2$propdmg * sapply(subset_storm_data2$propdmgexp, convert_exponent)
subset_storm_data2$cropdmgvalue <- subset_storm_data2$cropdmg * sapply(subset_storm_data2$cropdmgexp, convert_exponent)
# Calculate the sum of property and crop damage by event type
economic_consequences <- subset_storm_data2 %>%
mutate(economic_loss = propdmgvalue + cropdmgvalue) %>% # Create economic_loss column
group_by(evtype) %>% # Group by evtype
summarise(
total_property_damage = sum(propdmgvalue, na.rm = TRUE),
total_crop_damage = sum(cropdmgvalue, na.rm = TRUE),
total_economic_impacts = sum(economic_loss, na.rm = TRUE)
) %>%
mutate(
total_property_damage = total_property_damage / 1e9, # Convert to billions
total_crop_damage = total_crop_damage / 1e9, # Convert to billions
total_economic_impacts = total_economic_impacts / 1e9 # Convert to billions
)
And we find out the top 10 events with largest fatalities, injuries or total health impacts and visualize them.
# Sets the width and height of the plot (in inches) using 'fig.width=9, fig.height=8'.
# Arrange the data in descending order
# For total_fatalities
top_10_fatalities <- health_impact %>%
arrange(desc(total_fatalities)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# For total_injuries
top_10_injuries <- health_impact %>%
arrange(desc(total_injuries)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# For total_health_impacts
top_10_health_impacts <- health_impact %>%
arrange(desc(total_health_impacts)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# Load ggplot2
library(ggplot2)
# Create bar plot with rotated x-axis labels
p1 <- ggplot(top_10_injuries,
aes(x = evtype, y = total_injuries)) +
geom_bar(stat = "identity", fill = "#FFE2CE") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Injuries",
title = "Top 10 Total Injury Events")
p2 <- ggplot(top_10_fatalities,
aes(x = evtype, y = total_fatalities)) +
geom_bar(stat = "identity", fill = "#FCB2AF") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), # Rotate x-axis labels by 45 degrees, Right-align labels to prevent overlap, # Set label size
plot.title = element_text(hjust = 0.5)) + # Center the plot title
labs(x = "Event Type",
y = "Total Fatalities",
title = "Top 10 Total Fatality Events")
p3 <- ggplot(top_10_health_impacts,
aes(x = evtype, y = total_health_impacts)) +
geom_bar(stat = "identity", fill = "#F1766D") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Health Impacts",
title = "Top 10 Total Health Impact Events")
# Use the patchwork function to combine the above three plots
# install.packages("patchwork") # Install the package
library(patchwork) # Load the package
p1 + p2 + p3 # This will arrange the plots in 3 rows and 1 column
The top 10 events with the largest total health impacts are: c(“tornado”, “excessive heat”, “tstm wind”, “flood”, “lightning”, “heat”, “flash flood”, “ice storm”, “thunderstorm wind”, “winter storm”), with tornado at the top. tornado had the largest fatalities, causing 5633 deaths. tornado also had the largest injuries, causing 9.1346^{4} injuries.
Besides, And we find out the top 10 events with largest property damage, crop damage or total economic loss and visualize them.
# Sets the width and height of the plot (in inches) using 'fig.width=11, fig.height=8'.
# Arrange the data in descending order
# For total_property_damage
top_10_property_damage <- economic_consequences %>%
arrange(desc(total_property_damage)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# For total_crop_damage
top_10_crop_damage <- economic_consequences %>%
arrange(desc(total_crop_damage)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# For total_economic_impacts
top_10_economic_impacts <- economic_consequences %>%
arrange(desc(total_economic_impacts)) %>% # Arrange in descending order
head(10) # Get the top 10 rows
# Load ggplot2
library(ggplot2)
# Create bar plot with rotated x-axis labels
p4 <- ggplot(top_10_property_damage,
aes(x = evtype, y = total_property_damage)) +
geom_bar(stat = "identity", fill = "#E1F3FB") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Property Damage",
title = "Top 10 Total Property Damage Events")
p5 <- ggplot(top_10_crop_damage,
aes(x = evtype, y = total_crop_damage)) +
geom_bar(stat = "identity", fill = "#9BDFDF") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Crop Damage",
title = "Top 10 Total Crop Damage Events")
p6 <- ggplot(top_10_economic_impacts,
aes(x = evtype, y = total_economic_impacts)) +
geom_bar(stat = "identity", fill = "#66C1A4") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Economic Impacts",
title = "Top 10 Total Economic Impact Events")
# Use the patchwork function to combine the above three plots
install.packages("patchwork") # Install the package
## Warning: package 'patchwork' is in use and will not be installed
library(patchwork) # Load the package
p4 + p5 + p6 # This will arrange the plots in 3 rows and 1 column
The event types with the top 10 total economic losses are: c(“flood”, “hurricane/typhoon”, “tornado”, “storm surge”, “hail”, “flash flood”, “drought”, “hurricane”, “river flood”, “ice storm”). flood had the highest economic loss, with 150.3196783 billion dollars. flood had the severest impact on property, while drought caused the severest damage on crops.
In order to protect people from weather events, governments should take more protective measures on the following weather events: c(“tornado”, “excessive heat”, “tstm wind”, “flood”, “lightning”, “heat”, “flash flood”, “ice storm”, “thunderstorm wind”, “winter storm”). Additionally, tornado should be prioritized to prevent fatalities and save lives. More economic concerns should be placed on these events: c(“flood”, “hurricane/typhoon”, “tornado”, “storm surge”, “hail”, “flash flood”, “drought”, “hurricane”, “river flood”, “ice storm”).