Synopsis

In this analysis, we calculate the total fatalities and economic losses caused by various weather events using data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which can be accessed via this link. Additional documentation on the data variables is available here:
- Data documentation 1
- Data documentation 2

The analysis begins by downloading and loading the data into R, followed by inspecting the raw data. We calculate the sum of fatalities, injuries, total health impacts (fatalities + injuries), property damage estimates, crop damage estimates, and total economic loss (property and crop damage) for each weather event across the United States. The events with the severest health impacts and highest economic losses are identified. We then extract the top 10 events based on fatalities, injuries, total health impacts, property damage, crop damage, and total economic loss. Visualizations are created to highlight these top 10 events. This analysis provides insights into the most impactful weather events in terms of human life and economic costs, shedding light on their consequences for public health and the economy.

Data Processing

Firstly, we need to download data and the related documents from the website.

# Check the R environment
sessionInfo()
## R version 4.5.2 (2025-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
##   LAPACK version 3.12.1
## 
## locale:
## [1] LC_COLLATE=Chinese (Simplified)_China.utf8 
## [2] LC_CTYPE=Chinese (Simplified)_China.utf8   
## [3] LC_MONETARY=Chinese (Simplified)_China.utf8
## [4] LC_NUMERIC=C                               
## [5] LC_TIME=Chinese (Simplified)_China.utf8    
## 
## time zone: Asia/Shanghai
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.38     R6_2.6.1          fastmap_1.2.0     xfun_0.56        
##  [5] cachem_1.1.0      knitr_1.51        htmltools_0.5.8.1 rmarkdown_2.30   
##  [9] lifecycle_1.0.5   cli_3.6.5         sass_0.4.10       jquerylib_0.1.4  
## [13] compiler_4.5.2    rstudioapi_0.18.0 tools_4.5.2       evaluate_1.0.5   
## [17] bslib_0.10.0      yaml_2.3.10       otel_0.2.0        jsonlite_2.0.0   
## [21] rlang_1.1.7
# Set working directory
setwd("E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2")

# Download data
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" # raw data
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data.csv",method="curl")

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf" # Variable definition document
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_documentation.pdf",method="curl")

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf" # Some FAQs related to variable definitions
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_FAQ.pdf",method="curl")

And we should record the date of download at the same time.

# Record the date of download
dateDownloaded <- date()  
dateDownloaded 
## [1] "Sat Mar 21 14:27:37 2026"

Here is the timestamp I got: Sat Mar 21 14:27:37 2026.

Then we need to load the data into R and take a look at the raw data

# Load data
storm_data <- read.csv("storm_data.csv")

# inspect the raw data in R
# Take a look at the column names in the dataset
names(storm_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
#Convert column names and values in the EVTYPE column to lowercase
colnames(storm_data) <- tolower(colnames(storm_data))
storm_data$evtype <- tolower(storm_data$evtype)

# View the first few rows of the dataset
head(storm_data)
##   state__           bgn_date bgn_time time_zone county countyname state  evtype
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL tornado
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL tornado
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL tornado
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL tornado
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL tornado
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL tornado
##   bgn_range bgn_azi bgn_locati end_date end_time county_end countyendn
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   end_range end_azi end_locati length width f mag fatalities injuries propdmg
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   propdmgexp cropdmg cropdmgexp wfo stateoffic zonenames latitude longitude
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   latitude_e longitude_ remarks refnum
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Next, we calculate total fatalities, injuries, and health impacts (fatalities + injuries) for each event types.

# Extract specific columns using column indexing
subset_storm_data <- storm_data[, c("fatalities", "injuries", "evtype")]

# Load the dplyr package
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Calculate the sum of fatalities, injuries, and total_health_impacts by evtype
health_impact <- subset_storm_data %>%
  mutate(health_impacts = fatalities + injuries) %>%  # Create total_health_impacts column
  group_by(evtype) %>%  # Group by evtype
  summarise(
    total_fatalities = sum(fatalities, na.rm = TRUE),
    total_injuries = sum(injuries, na.rm = TRUE),
    total_health_impacts = sum(health_impacts, na.rm = TRUE)
  )

In addition, we calculate property damage, crop damage and total economic consequences for each event.

# Extract specific columns using column indexing
subset_storm_data2 <- storm_data[, c("propdmg", "propdmgexp", "cropdmg", "cropdmgexp", "evtype")]

# Calculate types of events with the greatest economic consequences
# Firstly, we should take a look at the unit in PROPDMGEXP and CROPDMGEXP
unique(subset_storm_data2$propdmgexp)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(subset_storm_data2$cropdmgexp)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
# Then, we need to converts the unit in propdmgexp and cropdmgexp into a numerical multiplier 
convert_exponent <- function(exp) {
  if (exp == "K"|| exp == "k") {
    return(1000)  # K represents thousand (1000)
  } else if (exp == "M" || exp == "m") {
    return(1000000)  # M or m represents million (1,000,000)
  } else if (exp == "B") {
    return(1000000000)  # B represents billion (1,000,000,000)
  } else if (exp == "h" || exp == "H") {
    return(100)  # h or H represents hundred (100)
  } else if (exp %in% c("1", "2", "3", "4", "5", "6", "7", "8", "0")) {
    return(10^as.numeric(exp))  # Numbers represent powers of 10 (e.g., 10^5 for "5")
  } else if (exp == "") {
    return(1)  # If there is no unit, assume multiplier is 1
  } else if (exp == "+" || exp == "-" || exp == "?") {
    return(NA)  # Other symbols are treated as missing or unclear data
  } else {
    return(NA)  # Handle undefined units
  }
}

# Load the dplyr package
library(dplyr)

# Apply the conversion function and calculate the actual property damage value
subset_storm_data2$propdmgvalue <- subset_storm_data2$propdmg * sapply(subset_storm_data2$propdmgexp, convert_exponent)
subset_storm_data2$cropdmgvalue <- subset_storm_data2$cropdmg * sapply(subset_storm_data2$cropdmgexp, convert_exponent)

# Calculate the sum of property and crop damage by event type
economic_consequences <- subset_storm_data2 %>%
  mutate(economic_loss = propdmgvalue + cropdmgvalue) %>%  # Create economic_loss column
  group_by(evtype) %>%  # Group by evtype
  summarise( 
    total_property_damage = sum(propdmgvalue, na.rm = TRUE),
    total_crop_damage = sum(cropdmgvalue, na.rm = TRUE),
    total_economic_impacts = sum(economic_loss, na.rm = TRUE)
  ) %>%
  mutate(
    total_property_damage = total_property_damage / 1e9,  # Convert to billions
    total_crop_damage = total_crop_damage / 1e9,  # Convert to billions
    total_economic_impacts = total_economic_impacts / 1e9  # Convert to billions
  )

Result

Top 10 events with the largest health impacts

And we find out the top 10 events with largest fatalities, injuries or total health impacts and visualize them.

# Sets the width and height of the plot (in inches) using 'fig.width=9, fig.height=8'.

# Arrange the data in descending order

# For total_fatalities
top_10_fatalities <- health_impact %>%
  arrange(desc(total_fatalities)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# For total_injuries
top_10_injuries <- health_impact %>%
  arrange(desc(total_injuries)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# For total_health_impacts
top_10_health_impacts <- health_impact %>%
  arrange(desc(total_health_impacts)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# Load ggplot2
library(ggplot2)

# Create bar plot with rotated x-axis labels
p1 <- ggplot(top_10_injuries, 
       aes(x = evtype, y = total_injuries)) +
  geom_bar(stat = "identity", fill = "#FFE2CE") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Event Type", 
       y = "Total Injuries",
       title = "Top 10 Total Injury Events")

p2 <- ggplot(top_10_fatalities, 
       aes(x = evtype, y = total_fatalities)) +
  geom_bar(stat = "identity", fill = "#FCB2AF") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10), # Rotate x-axis labels by 45 degrees, Right-align labels to prevent overlap, # Set label size
        plot.title = element_text(hjust = 0.5)) + # Center the plot title
  labs(x = "Event Type", 
       y = "Total Fatalities",
       title = "Top 10 Total Fatality Events")

p3 <- ggplot(top_10_health_impacts, 
       aes(x = evtype, y = total_health_impacts)) +
  geom_bar(stat = "identity", fill = "#F1766D") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Event Type", 
       y = "Total Health Impacts",
       title = "Top 10 Total Health Impact Events")

# Use the patchwork function to combine the above three plots
# install.packages("patchwork") # Install the package
library(patchwork) # Load the package

p1 + p2 + p3 # This will arrange the plots in 3 rows and 1 column

The top 10 events with the largest total health impacts are: c(“tornado”, “excessive heat”, “tstm wind”, “flood”, “lightning”, “heat”, “flash flood”, “ice storm”, “thunderstorm wind”, “winter storm”), with tornado at the top. tornado had the largest fatalities, causing 5633 deaths. tornado also had the largest injuries, causing 9.1346^{4} injuries.

Top 10 events with the largest economic impacts

Besides, And we find out the top 10 events with largest property damage, crop damage or total economic loss and visualize them.

# Sets the width and height of the plot (in inches) using 'fig.width=11, fig.height=8'.

# Arrange the data in descending order

# For total_property_damage
top_10_property_damage <- economic_consequences %>%
  arrange(desc(total_property_damage)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# For total_crop_damage
top_10_crop_damage <- economic_consequences %>%
  arrange(desc(total_crop_damage)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# For total_economic_impacts
top_10_economic_impacts <- economic_consequences %>%
  arrange(desc(total_economic_impacts)) %>%  # Arrange in descending order
  head(10)  # Get the top 10 rows

# Load ggplot2
library(ggplot2)

# Create bar plot with rotated x-axis labels
p4 <- ggplot(top_10_property_damage, 
       aes(x = evtype, y = total_property_damage)) +
  geom_bar(stat = "identity", fill = "#E1F3FB") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Event Type", 
       y = "Total Property Damage",
       title = "Top 10 Total Property Damage Events")

p5 <- ggplot(top_10_crop_damage, 
       aes(x = evtype, y = total_crop_damage)) +
  geom_bar(stat = "identity", fill = "#9BDFDF") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Event Type", 
       y = "Total Crop Damage",
       title = "Top 10 Total Crop Damage Events")

p6 <- ggplot(top_10_economic_impacts, 
       aes(x = evtype, y = total_economic_impacts)) +
  geom_bar(stat = "identity", fill = "#66C1A4") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5)) +
  labs(x = "Event Type", 
       y = "Total Economic Impacts",
       title = "Top 10 Total Economic Impact Events")

# Use the patchwork function to combine the above three plots
install.packages("patchwork") # Install the package
## Warning: package 'patchwork' is in use and will not be installed
library(patchwork) # Load the package

p4 + p5 + p6 # This will arrange the plots in 3 rows and 1 column

The event types with the top 10 total economic losses are: c(“flood”, “hurricane/typhoon”, “tornado”, “storm surge”, “hail”, “flash flood”, “drought”, “hurricane”, “river flood”, “ice storm”). flood had the highest economic loss, with 150.3196783 billion dollars. flood had the severest impact on property, while drought caused the severest damage on crops.

Conclusion

In order to protect people from weather events, governments should take more protective measures on the following weather events: c(“tornado”, “excessive heat”, “tstm wind”, “flood”, “lightning”, “heat”, “flash flood”, “ice storm”, “thunderstorm wind”, “winter storm”). Additionally, tornado should be prioritized to prevent fatalities and save lives. More economic concerns should be placed on these events: c(“flood”, “hurricane/typhoon”, “tornado”, “storm surge”, “hail”, “flash flood”, “drought”, “hurricane”, “river flood”, “ice storm”).