Synopsis

Severe weather events such as storms can have serious effects on both public health and the economy. These events may cause injuries, deaths, and damage to property, which makes their analysis important for planning and risk management.

This project analyzes storm data collected by the U.S. National Oceanic and Atmospheric Administration (NOAA). The dataset includes information about major weather events in the United States, such as the type of event, when and where it occurred, and the estimated number of injuries, fatalities, and economic losses.

Purpose of the Analysis

The goal of this analysis is to explore the NOAA Storm Database and identify which types of weather events have the greatest impact on population health and which cause the most economic damage across the United States. The analysis is based on summary statistics and visualizations created using R.

This report is written for an academic audience and is intended to demonstrate data analysis skills rather than to provide policy recommendations.

Data Processing

The data used in this study come from the NOAA Storm Database and are provided as a compressed CSV file. The database covers weather events recorded between 1950 and November 2011. Earlier years contain fewer observations, likely due to incomplete reporting, while more recent years provide more reliable and complete information.

Additional documentation explains how the variables were defined and recorded.

We load the data from the raw text file and examine which variables (columns) are available in the dataset.

data <- read.csv2("C:/Users/ryb00/Documents/All_R_files/Storm/repdata_data_StormData/repdata_data_StormData.csv", sep = ",", header = TRUE)
head(data,2)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1    1.00 4/18/1950 0:00:00     0130       CST  97.00     MOBILE    AL TORNADO
## 2    1.00 4/18/1950 0:00:00     0145       CST   3.00    BALDWIN    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1      0.00                                            0.00         NA
## 2      0.00                                            0.00         NA
##   END_RANGE END_AZI END_LOCATI LENGTH  WIDTH F  MAG FATALITIES INJURIES PROPDMG
## 1      0.00                     14.00 100.00 3 0.00       0.00    15.00   25.00
## 2      0.00                      2.00 150.00 2 0.00       0.00     0.00    2.50
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K    0.00                                      3040.00   8812.00
## 2          K    0.00                                      3042.00   8755.00
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1    3051.00    8806.00           1.00
## 2       0.00       0.00           2.00
dim(data)
## [1] 902297     37
names(data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Since in this analysis we will focus on the variables “EVTYPE”, “FATALITIES”, “INJURIES”, “PROPDMG”, and “CROPDMG”, we will first check how many of them contain missing values.

colSums(is.na(data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")]))
##     EVTYPE FATALITIES   INJURIES    PROPDMG    CROPDMG 
##          0          0          0          0          0

These variables do not contain any missing values.

The variables “FATALITIES”, “INJURIES”, “PROPDMG”, and “CROPDMG” should be numeric. We will check this and, if necessary, convert them to numeric values.

str(data[, c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")])
## 'data.frame':    902297 obs. of  4 variables:
##  $ FATALITIES: chr  "0.00" "0.00" "0.00" "0.00" ...
##  $ INJURIES  : chr  "15.00" "0.00" "2.00" "2.00" ...
##  $ PROPDMG   : chr  "25.00" "2.50" "25.00" "2.50" ...
##  $ CROPDMG   : chr  "0.00" "0.00" "0.00" "0.00" ...
data$FATALITIES <- as.numeric(data$FATALITIES)
data$INJURIES <- as.numeric(data$INJURIES)
data$PROPDMG <- as.numeric(data$PROPDMG)
data$CROPDMG <- as.numeric(data$CROPDMG)

To standardize the EVTYPE variable, all values were converted to uppercase and trimmed of any extra spaces at the beginning and end of the strings. After this, we counted the number of unique event types.

data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trimws(data$EVTYPE)
length(unique(data$EVTYPE))
## [1] 890

Since the number of unique values in the EVTYPE variable is quite large, we will first identify the event types that cause the most harm to public health and the economy. These values will then be standardized in more detail. First, we will find the 20 event types (EVTYPE) that cause the most harm to public health in the United States.

health_raw <- data %>%
    group_by(EVTYPE) %>%
    summarise(
        fatalities = sum(FATALITIES),
        injuries   = sum(INJURIES),
        health     = fatalities + injuries
    ) %>%
    arrange(desc(health))
head(health_raw$EVTYPE, 20)
##  [1] "TORNADO"            "EXCESSIVE HEAT"     "TSTM WIND"         
##  [4] "FLOOD"              "LIGHTNING"          "HEAT"              
##  [7] "FLASH FLOOD"        "ICE STORM"          "THUNDERSTORM WIND" 
## [10] "WINTER STORM"       "HIGH WIND"          "HAIL"              
## [13] "HURRICANE/TYPHOON"  "HEAVY SNOW"         "WILDFIRE"          
## [16] "THUNDERSTORM WINDS" "BLIZZARD"           "FOG"               
## [19] "RIP CURRENT"        "WILD/FOREST FIRE"

Similarly, we will find the 20 event types (EVTYPE) that cause the most economic damage in the United States.

economic_raw <- data %>%
    group_by(EVTYPE) %>%
    summarise(
        property_damage = sum(PROPDMG),
        crop_damage     = sum(CROPDMG),
        total_damage    = property_damage + crop_damage
    ) %>%
    arrange(desc(total_damage))
head(economic_raw$EVTYPE, 20)
##  [1] "TORNADO"            "FLASH FLOOD"        "TSTM WIND"         
##  [4] "HAIL"               "FLOOD"              "THUNDERSTORM WIND" 
##  [7] "LIGHTNING"          "THUNDERSTORM WINDS" "HIGH WIND"         
## [10] "WINTER STORM"       "HEAVY SNOW"         "WILDFIRE"          
## [13] "ICE STORM"          "STRONG WIND"        "HEAVY RAIN"        
## [16] "HIGH WINDS"         "TROPICAL STORM"     "WILD/FOREST FIRE"  
## [19] "DROUGHT"            "FLASH FLOODING"

Similar and duplicate event types in the EVTYPE variable were combined: thunderstorm wind, heat, wildfire, flash floods, and high wind. All other values remained unchanged.

data <- data %>%
  mutate(EVTYPE = toupper(EVTYPE),
         EVTYPE = trimws(EVTYPE),
         EVTYPE = case_when(
           grepl("^TSTM WIND", EVTYPE) ~ "THUNDERSTORM WIND",
           grepl("^THUNDERSTORM WIND", EVTYPE) ~ "THUNDERSTORM WIND",
           EVTYPE == "THUNDERSTORM WINDS" ~ "THUNDERSTORM WIND",
           EVTYPE %in% c("EXCESSIVE HEAT", "HEAT") ~ "HEAT",
           EVTYPE == "WILD/FOREST FIRE" ~ "WILDFIRE",
           grepl("^FLASH FLOOD", EVTYPE) ~ "FLASH FLOOD",
           EVTYPE %in% c("HIGH WIND", "HIGH WINDS", "STRONG WIND") ~ "HIGH WIND",
           TRUE ~ EVTYPE
         ))

Results

We will find the 10 events with the highest number of fatalities and injuries.

top10_health <- data %>%
    group_by(EVTYPE) %>%
    summarise(
        fatalities = sum(FATALITIES),
        injuries   = sum(INJURIES),
        .groups = "drop"
    ) %>%
    slice_max(fatalities + injuries, n = 10)
top10_health
## # A tibble: 10 × 3
##    EVTYPE            fatalities injuries
##    <chr>                  <dbl>    <dbl>
##  1 TORNADO                 5633    91346
##  2 HEAT                    2840     8625
##  3 THUNDERSTORM WIND        709     9458
##  4 FLOOD                    470     6789
##  5 LIGHTNING                816     5230
##  6 FLASH FLOOD             1018     1785
##  7 HIGH WIND                386     1719
##  8 ICE STORM                 89     1975
##  9 WILDFIRE                  87     1456
## 10 WINTER STORM             206     1321

The top 10 events by number of fatalities and injuries were transformed into a long format for plotting, and a horizontal bar chart was created showing the number of fatalities and injuries separately for each event.

health_long <- top10_health %>%
    pivot_longer(
        cols = c(fatalities, injuries),
        names_to = "type",
        values_to = "count"
    )

health_long$EVTYPE <- factor(
    health_long$EVTYPE,
    levels = top10_health$EVTYPE[order(top10_health$fatalities + top10_health$injuries)]
)

ggplot(health_long, aes(x = EVTYPE, y = count, fill = type)) +
    geom_col(position = "dodge") +
    coord_flip() +
    scale_fill_manual(
        values = c("fatalities" = "firebrick", "injuries" = "steelblue"),
        labels = c("Fatalities", "Injuries")
    ) +
    labs(
        title = "Weather Events Most Harmful to Population Health",
        subtitle = "Fatalities vs Injuries by Event Type",
        x = "Event Type",
        y = "Number of people affected",
        fill = ""
    ) +
    theme_minimal()

Now we will find the 10 events with the highest property damage and crop damage.

top10_economic <- data %>%
    group_by(EVTYPE) %>%
    summarise(
        property_damage = sum(PROPDMG),
        crop_damage     = sum(CROPDMG),
        .groups = "drop"
    ) %>%
    slice_max(property_damage + crop_damage, n = 10)
top10_economic
## # A tibble: 10 × 3
##    EVTYPE            property_damage crop_damage
##    <chr>                       <dbl>       <dbl>
##  1 TORNADO                  3212258.     100019.
##  2 THUNDERSTORM WIND        2671630.     199232.
##  3 FLASH FLOOD              1455200.     185057.
##  4 HAIL                      688693.     579596.
##  5 FLOOD                     899938.     168038.
##  6 LIGHTNING                 603352.       3581.
##  7 HIGH WIND                 443368.      20660.
##  8 WINTER STORM              132721.       1979.
##  9 WILDFIRE                  123804.       8554.
## 10 HEAVY SNOW                122252.       2166.

The resulting chart displays property damage and crop damage separately for each event.

economic_long <- top10_economic %>%
    pivot_longer(
        cols = c(property_damage, crop_damage),
        names_to = "type",
        values_to = "count"
    )

economic_long$EVTYPE <- factor(
    economic_long$EVTYPE,
    levels = top10_economic$EVTYPE[order(top10_economic$property_damage + top10_economic$crop_damage)]
)

ggplot(economic_long, aes(x = EVTYPE, y = count, fill = type)) +
    geom_col(position = "dodge") +
    coord_flip() +
    scale_fill_manual(
        values = c("property_damage" = "darkgreen", "crop_damage" = "goldenrod"),
        labels = c("Property Damage", "Crop Damage")
    ) +
    labs(
        title = "Weather Events Causing the Greatest Economic Damage",
        subtitle = "Property Damage vs Crop Damage by Event Type",
        x = "Event Type",
        y = "Damage (USD)",
        fill = ""
    ) +
    theme_minimal()

As a result of the analysis of weather events in the United States, the types of events that cause the greatest harm to public health and the economy were identified. The top 10 events by number of fatalities and injuries include tornadoes, extreme heat, and thunderstorms. By economic damage, the greatest impact comes from tornadoes, floods, and hail. These results show that certain types of weather events pose the highest risk and require increased attention when planning protective measures and allocating resources.