In this analysis, we calculate the total fatalities and economic
losses caused by various weather events using data from the U.S.
National Oceanic and Atmospheric Administration’s (NOAA) storm database,
which can be accessed via this
link. Additional documentation on the data variables is available
here:
- Data
documentation 1
- Data
documentation 2
The analysis starts by downloading and loading the data into R, followed by inspecting the raw data. We calculate the sum of fatalities and economic losses for each weather event across the United States. The events with the largest fatalities and economic losses are identified. The data is arranged in descending order based on both fatalities and economic losses. We extract the top 10 events with the highest fatalities and the top 10 with the highest economic losses. Visualizations are then created to highlight these top 10 events. This analysis helps to understand the most impactful weather events in terms of human life and economic costs. The results provide insights into the consequences of severe weather events on public health and the economy.
Firstly, we need to download data and the related documents from the website.
# Set working directory
setwd("E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2")
# Download data
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" # raw data
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data.csv",method="curl")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf" # Variable definition document
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_documentation.pdf",method="curl")
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf" # Some FAQs related to variable definitions
download.file(fileUrl,destfile="E:/新技能/Epidemiology/John_Hopkins_Reproducible_research/Course_Project_2/storm_data_FAQ.pdf",method="curl")
And we should record the date of download at the same time.
# Record the date of download
dateDownloaded <- date()
dateDownloaded
## [1] "Fri Mar 20 18:01:13 2026"
Here is the timestamp I got: Fri Mar 20 18:01:13 2026.
Then we need to load the data into R and take a look at the raw data
# Load data
storm_data <- read.csv("storm_data.csv")
# inspect the raw data in R
# Take a look at the column names in the dataset
names(storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
# View the first few rows of the dataset
head(storm_data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Next, we calculate the total fatalities for each event types, find out the event types with the largest fatalities.
# Calculate the total fatalities by event type
fatalities_sum <- aggregate(FATALITIES ~ EVTYPE, data = storm_data, FUN = sum, na.rm = TRUE)
# Identify the event with the maximum fatalities
max_fatalities <- fatalities_sum[which.max(fatalities_sum$FATALITIES),]
max_fatalities
## EVTYPE FATALITIES
## 834 TORNADO 5633
In addition, we calculate economic consequences for each event and find out the event with the largest total economic loss.
# Calculate types of events with the greatest economic consequences
# Firstly, we need to converts the unit in PROPDMGEXP into a numerical multiplier
# Create a function to convert PROPDMGEXP to a numerical multiplier
convert_exponent <- function(exp) {
if (exp == "K") {
return(1000) # K represents thousand (1000)
} else if (exp == "M" || exp == "m") {
return(1000000) # M or m represents million (1,000,000)
} else if (exp == "B") {
return(1000000000) # B represents billion (1,000,000,000)
} else if (exp == "h" || exp == "H") {
return(100) # h or H represents hundred (100)
} else if (exp %in% c("1", "2", "3", "4", "5", "6", "7", "8", "0")) {
return(10^as.numeric(exp)) # Numbers represent powers of 10 (e.g., 10^5 for "5")
} else if (exp == "") {
return(1) # If there is no unit, assume multiplier is 1
} else if (exp == "+" || exp == "-" || exp == "?") {
return(NA) # Other symbols are treated as missing or unclear data
} else {
return(NA) # Handle undefined units
}
}
# Apply the conversion function and calculate the actual property damage value
storm_data$PROPDMGVALUE <- storm_data$PROPDMG * sapply(storm_data$PROPDMGEXP, convert_exponent)
# Calculate the total economic loss by event type
economic_consequences <- aggregate(PROPDMGVALUE ~ EVTYPE, data = storm_data, FUN = sum, na.rm = TRUE)
# Identify the event with the greatest economic consequences
max_economic_consequences <- economic_consequences[which.max(economic_consequences$PROPDMGVALUE),]
max_economic_consequences
## EVTYPE PROPDMGVALUE
## 169 FLOOD 144657709807
We can find that the event type with the maximum fatalities is TORNADO. In addition, we arrange the data to identify the top 10 events with the largest fatalities and visualize them.
# Sets the width and height of the plot (in inches).
# Arrange the fatalities_sum in descending order of total fatalities
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
fatalities_sum <- arrange(fatalities_sum, desc(FATALITIES))
# Filter the top 10 events with the highest fatalities
top10_fatalities_sum <- fatalities_sum[1:10,]
# Load ggplot2
library(ggplot2)
# Create bar plot with rotated x-axis labels
ggplot(top10_fatalities_sum,
aes(x = reorder(EVTYPE, -FATALITIES), y = FATALITIES)) +
geom_bar(stat = "identity", fill = "lightcoral") +
theme_bw() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5)) +
labs(x = "Event Type",
y = "Total Fatalities",
title = "Event Type with the Top 10 Total Fatalities")
Through the barplot, we could find that the event types with the top 10 total fatalities are: TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT, LIGHTNING, TSTM WIND, FLOOD, RIP CURRENT, HIGH WIND, AVALANCHE. TORNADO had the largest fatalities, causing 5633 deaths.
Besides, we find that FLOOD had the severest economic consequence, with a total loss of 1.4465771^{11} dollars. Next, we arrange the economic consequences in descending order of total economic loss to identify the top 10 events with the largest total economic losses.
# Sets the width and height of the plot (in inches).
# Load required libraries
library(ggplot2)
library(dplyr)
# Arrange the economic consequences in descending order of total economic loss
economic_consequences <- arrange(economic_consequences, desc(PROPDMGVALUE))
# Filter the top 10 events with the highest economic loss
top10_economic_consequences <- economic_consequences[1:10,]
# Convert the economic loss to billions for the top 10 events
top10_economic_consequences$billion <- top10_economic_consequences$PROPDMGVALUE / 1e9
# Create the bar plot using ggplot2
ggplot(top10_economic_consequences,
aes(x = reorder(EVTYPE, -billion), y = billion)) + # Reorder bars by economic loss descending
geom_bar(stat = "identity", fill = "orange") + # Create bars with orange fill
theme_bw() + # Use black-and-white theme (clean background)
theme(
axis.text.x = element_text(angle = 45, # Rotate x-axis labels by 45 degrees
hjust = 1, # Right-align labels to prevent overlap
size = 10), # Set label size
plot.title = element_text(hjust = 0.5), # Center the plot title
axis.title.x = element_text(vjust = 1.5) # Adjust x-axis title position slightly
) +
labs(
x = "Event Type", # X-axis label
y = "Total Economic Loss (billion dollars)", # Y-axis label
title = "Event Type with Top 10 Total Economic Loss" # Plot title
)
Through the barplot, we could find that the event types with the top 10 total economic losses are: FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE, FLASH FLOOD, HAIL, HURRICANE, TROPICAL STORM, WINTER STORM, HIGH WIND. FLOOD had the highest economic loss, with 144.6577098 billion dollars.