Severe weather events such as storms can have serious effects on both public health and the economy. These events may cause injuries, deaths, and damage to property, which makes their analysis important for planning and risk management.
This project analyzes storm data collected by the U.S. National Oceanic and Atmospheric Administration (NOAA). The dataset includes information about major weather events in the United States, such as the type of event, when and where it occurred, and the estimated number of injuries, fatalities, and economic losses.
The goal of this analysis is to explore the NOAA Storm Database and identify which types of weather events have the greatest impact on population health and which cause the most economic damage across the United States. The analysis is based on summary statistics and visualizations created using R.
This report is written for an academic audience and is intended to demonstrate data analysis skills rather than to provide policy recommendations.
The data used in this study come from the NOAA Storm Database and are provided as a compressed CSV file. The database covers weather events recorded between 1950 and November 2011. Earlier years contain fewer observations, likely due to incomplete reporting, while more recent years provide more reliable and complete information.
Additional documentation explains how the variables were defined and recorded.
We load the data from the raw text file and examine which variables (columns) are available in the dataset.
data <- read.csv2("C:/Users/ryb00/Documents/All_R_files/Storm/repdata_data_StormData/repdata_data_StormData.csv", sep = ",", header = TRUE)
head(data,2)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1.00 4/18/1950 0:00:00 0130 CST 97.00 MOBILE AL TORNADO
## 2 1.00 4/18/1950 0:00:00 0145 CST 3.00 BALDWIN AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0.00 0.00 NA
## 2 0.00 0.00 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0.00 14.00 100.00 3 0.00 0.00 15.00 25.00
## 2 0.00 2.00 150.00 2 0.00 0.00 0.00 2.50
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0.00 3040.00 8812.00
## 2 K 0.00 3042.00 8755.00
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051.00 8806.00 1.00
## 2 0.00 0.00 2.00
dim(data)
## [1] 902297 37
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Since in this analysis we will focus on the variables “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, and “CROPDMG”, we will first check how many of them contain missing values.
colSums(is.na(data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")]))
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 0 0 0 0 0
These variables do not contain any missing values.
The variables “FATALITIES”, “INJURIES”, “PROPDMG”, and “CROPDMG” should be numeric. We will check this and, if necessary, convert them to numeric values.
str(data[, c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")])
## 'data.frame': 902297 obs. of 4 variables:
## $ FATALITIES: chr "0.00" "0.00" "0.00" "0.00" ...
## $ INJURIES : chr "15.00" "0.00" "2.00" "2.00" ...
## $ PROPDMG : chr "25.00" "2.50" "25.00" "2.50" ...
## $ CROPDMG : chr "0.00" "0.00" "0.00" "0.00" ...
data$FATALITIES <- as.numeric(data$FATALITIES)
data$INJURIES <- as.numeric(data$INJURIES)
data$PROPDMG <- as.numeric(data$PROPDMG)
data$CROPDMG <- as.numeric(data$CROPDMG)
To standardize the EVTYPE variable, all values were converted to uppercase and trimmed of any extra spaces at the beginning and end of the strings. After this, we counted the number of unique event types.
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trimws(data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 890
Since the number of unique values in the EVTYPE variable is quite large, we will first identify the event types that cause the most harm to public health and the economy. These values will then be standardized in more detail. First, we will find the 20 event types (EVTYPE) that cause the most harm to public health in the United States.
health_raw <- data %>%
group_by(EVTYPE) %>%
summarise(
fatalities = sum(FATALITIES),
injuries = sum(INJURIES),
health = fatalities + injuries
) %>%
arrange(desc(health))
head(health_raw$EVTYPE, 20)
## [1] "TORNADO" "EXCESSIVE HEAT" "TSTM WIND"
## [4] "FLOOD" "LIGHTNING" "HEAT"
## [7] "FLASH FLOOD" "ICE STORM" "THUNDERSTORM WIND"
## [10] "WINTER STORM" "HIGH WIND" "HAIL"
## [13] "HURRICANE/TYPHOON" "HEAVY SNOW" "WILDFIRE"
## [16] "THUNDERSTORM WINDS" "BLIZZARD" "FOG"
## [19] "RIP CURRENT" "WILD/FOREST FIRE"
Similarly, we will find the 20 event types (EVTYPE) that cause the most economic damage in the United States.
economic_raw <- data %>%
group_by(EVTYPE) %>%
summarise(
property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG),
total_damage = property_damage + crop_damage
) %>%
arrange(desc(total_damage))
head(economic_raw$EVTYPE, 20)
## [1] "TORNADO" "FLASH FLOOD" "TSTM WIND"
## [4] "HAIL" "FLOOD" "THUNDERSTORM WIND"
## [7] "LIGHTNING" "THUNDERSTORM WINDS" "HIGH WIND"
## [10] "WINTER STORM" "HEAVY SNOW" "WILDFIRE"
## [13] "ICE STORM" "STRONG WIND" "HEAVY RAIN"
## [16] "HIGH WINDS" "TROPICAL STORM" "WILD/FOREST FIRE"
## [19] "DROUGHT" "FLASH FLOODING"
Similar and duplicate event types in the EVTYPE variable were combined: thunderstorm wind, heat, wildfire, flash floods, and high wind. All other values remained unchanged.
data <- data %>%
mutate(EVTYPE = toupper(EVTYPE),
EVTYPE = trimws(EVTYPE),
EVTYPE = case_when(
grepl("^TSTM WIND", EVTYPE) ~ "THUNDERSTORM WIND",
grepl("^THUNDERSTORM WIND", EVTYPE) ~ "THUNDERSTORM WIND",
EVTYPE == "THUNDERSTORM WINDS" ~ "THUNDERSTORM WIND",
EVTYPE %in% c("EXCESSIVE HEAT", "HEAT") ~ "HEAT",
EVTYPE == "WILD/FOREST FIRE" ~ "WILDFIRE",
grepl("^FLASH FLOOD", EVTYPE) ~ "FLASH FLOOD",
EVTYPE %in% c("HIGH WIND", "HIGH WINDS", "STRONG WIND") ~ "HIGH WIND",
TRUE ~ EVTYPE
))
We will find the 10 events with the highest number of fatalities and injuries.
top10_health <- data %>%
group_by(EVTYPE) %>%
summarise(
fatalities = sum(FATALITIES),
injuries = sum(INJURIES),
.groups = "drop"
) %>%
slice_max(fatalities + injuries, n = 10)
top10_health
## # A tibble: 10 × 3
## EVTYPE fatalities injuries
## <chr> <dbl> <dbl>
## 1 TORNADO 5633 91346
## 2 HEAT 2840 8625
## 3 THUNDERSTORM WIND 709 9458
## 4 FLOOD 470 6789
## 5 LIGHTNING 816 5230
## 6 FLASH FLOOD 1018 1785
## 7 HIGH WIND 386 1719
## 8 ICE STORM 89 1975
## 9 WILDFIRE 87 1456
## 10 WINTER STORM 206 1321
The top 10 events by number of fatalities and injuries were transformed into a long format for plotting, and a horizontal bar chart was created showing the number of fatalities and injuries separately for each event.
health_long <- top10_health %>%
pivot_longer(
cols = c(fatalities, injuries),
names_to = "type",
values_to = "count"
)
health_long$EVTYPE <- factor(
health_long$EVTYPE,
levels = top10_health$EVTYPE[order(top10_health$fatalities + top10_health$injuries)]
)
ggplot(health_long, aes(x = EVTYPE, y = count, fill = type)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_manual(
values = c("fatalities" = "firebrick", "injuries" = "steelblue"),
labels = c("Fatalities", "Injuries")
) +
labs(
title = "Weather Events Most Harmful to Population Health",
subtitle = "Fatalities vs Injuries by Event Type",
x = "Event Type",
y = "Number of people affected",
fill = ""
) +
theme_minimal()
Now we will find the 10 events with the highest property damage and crop damage.
top10_economic <- data %>%
group_by(EVTYPE) %>%
summarise(
property_damage = sum(PROPDMG),
crop_damage = sum(CROPDMG),
.groups = "drop"
) %>%
slice_max(property_damage + crop_damage, n = 10)
top10_economic
## # A tibble: 10 × 3
## EVTYPE property_damage crop_damage
## <chr> <dbl> <dbl>
## 1 TORNADO 3212258. 100019.
## 2 THUNDERSTORM WIND 2671630. 199232.
## 3 FLASH FLOOD 1455200. 185057.
## 4 HAIL 688693. 579596.
## 5 FLOOD 899938. 168038.
## 6 LIGHTNING 603352. 3581.
## 7 HIGH WIND 443368. 20660.
## 8 WINTER STORM 132721. 1979.
## 9 WILDFIRE 123804. 8554.
## 10 HEAVY SNOW 122252. 2166.
The resulting chart displays property damage and crop damage separately for each event.
economic_long <- top10_economic %>%
pivot_longer(
cols = c(property_damage, crop_damage),
names_to = "type",
values_to = "count"
)
economic_long$EVTYPE <- factor(
economic_long$EVTYPE,
levels = top10_economic$EVTYPE[order(top10_economic$property_damage + top10_economic$crop_damage)]
)
ggplot(economic_long, aes(x = EVTYPE, y = count, fill = type)) +
geom_col(position = "dodge") +
coord_flip() +
scale_fill_manual(
values = c("property_damage" = "darkgreen", "crop_damage" = "goldenrod"),
labels = c("Property Damage", "Crop Damage")
) +
labs(
title = "Weather Events Causing the Greatest Economic Damage",
subtitle = "Property Damage vs Crop Damage by Event Type",
x = "Event Type",
y = "Damage (USD)",
fill = ""
) +
theme_minimal()
As a result of the analysis of weather events in the United States, the types of events that cause the greatest harm to public health and the economy were identified. The top 10 events by number of fatalities and injuries include tornadoes, extreme heat, and thunderstorms. By economic damage, the greatest impact comes from tornadoes, floods, and hail. These results show that certain types of weather events pose the highest risk and require increased attention when planning protective measures and allocating resources.