Note for peer reviewers:
This is my original work, first submitted to Coursera on 18 December.
My previous submission was affected by plagiarism, where other learners copied my public Rpubs link.
Coursera advised me to resubmit with my name included for clarity.
Thank you for reviewing my work fairly.

Synopsis

This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database contains records from 1950 to 2011 that track characteristics of major storms and weather events in the United States, including when and where they occur, along with estimates of any fatalities, injuries, and property damage.

This analysis specifically addresses the following two questions:

  1. Which types of events are most harmful to population health?
  2. Which types have the greatest economic consequences?

The findings might help in making decisions about how to allocate and prioritize resources to respond to various severe weather events.

Software Environment information and load of libraries

Set locale to English and get session info:

Sys.setlocale("LC_ALL", "English_United States.UTF-8")
## [1] "LC_COLLATE=English_United States.utf8;LC_CTYPE=English_United States.utf8;LC_MONETARY=English_United States.utf8;LC_NUMERIC=C;LC_TIME=English_United States.utf8"
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.37     R6_2.6.1          lifecycle_1.0.4   jsonlite_2.0.0   
##  [5] evaluate_1.0.5    cachem_1.1.0      rlang_1.1.6       cli_3.6.5        
##  [9] rstudioapi_0.17.1 jquerylib_0.1.4   bslib_0.9.0       rmarkdown_2.30   
## [13] tools_4.2.2       xfun_0.53         yaml_2.3.10       fastmap_1.2.0    
## [17] compiler_4.2.2    htmltools_0.5.8.1 knitr_1.50        sass_0.4.10

Load of libraries (shall be installed first, if not available yet):

library(dplyr) 
## Warning: package 'dplyr' was built under R version 4.2.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)

Data Processing

Methodological Considerations

After reading the provided Storm Data Documentation and some preliminary exploratory analysis of the dataset, I identified the following questions that need to be addressed further in this analysis:

  • What part of the dataset to be used for the analysis?
  • How to deal with the “messy” description of the event types in the data?
  • How to calculate the total health damage?
  • How to calculate the total economic damage?

Data load

The dataset from NOAA is downloaded from the source and stored in subfolder “data”:

# Create a "data" subfolder, if not available:
if(!file.exists("data")) {
    dir.create("data")
}

# Download the dataset and in "data" subfolder, if not available, and read the data:
if(!file.exists("./data/repdata_data_StormData.csv.bz2")){
    download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                         destfile = "./data/repdata_data_StormData.csv.bz2")
}

stormdata <- read.csv("./data/repdata_data_StormData.csv.bz2",header=TRUE)

Explore the dataset structure:

dim(stormdata)
## [1] 902297     37
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

List of all variables in the row dataset:

names(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Dimensionality reduction

In order to reduce the dimensions of the dataset before further analysis, some of the variables could be excluded as not relevant to the objective of this analysis. Based on the description of the variables in the National Weather Service Storm Data Documentation and the structure of the loaded dataset, the following variables are kept:

For Event identification:

  • EVTYPE — event type

Population Health Variables:

  • FATALITIES — number of deaths directly caused by the event
  • INJURIES — number of injuries directly caused by the event

Economic Damage Variables:

  • PROPDMG & PROPDMGEXP — numeric value of property damage & multiplier
  • CROPDMG % CROPDMGEXP — numeric value of crop damage & multiplier

Optional, for additional analysis:

  • BGN_DATE — start date (needed to extract year)
  • STATE & COUNTY — geographic identifiers

Creating the clean dataset:

storm_clean <- stormdata %>%
  select(BGN_DATE, EVTYPE, 
         STATE, COUNTY, 
         FATALITIES, INJURIES,
         PROPDMG, PROPDMGEXP,
         CROPDMG, CROPDMGEXP)

Check for NAs:

colSums(is.na(storm_clean))
##   BGN_DATE     EVTYPE      STATE     COUNTY FATALITIES   INJURIES    PROPDMG 
##          0          0          0          0          0          0          0 
## PROPDMGEXP    CROPDMG CROPDMGEXP 
##          0          0          0

No NAs in the subset.

Observations by year

Year and location are not strictly needed for answering the questions in general, but might be helpful for further analysis of a harmful events - if their impact is changing across the years or if they are location specific. So, new column YEAR is derived from the BGN_DATE and BGN_DATE is then dropped:

storm_clean$BGN_DATE <- parse_date_time( storm_clean$BGN_DATE, orders = c("mdy HMS", "mdy HM", "mdy") )

storm_clean$YEAR <- year(storm_clean$BGN_DATE)
storm_clean <- storm_clean %>% select(-BGN_DATE)

Since the dataset description notes that older records may be inconsistent or incomplete, while more recent years are considered more reliable, lets first examine the number of observations by year:

year_summary <- storm_clean %>% 
    group_by(YEAR) %>% 
    summarise(
        observations = n(), unique_events = n_distinct(EVTYPE)
        ) 

ggplot(year_summary, aes(YEAR, observations)) + 
    geom_line()

print(paste(
  "Percentage of observations after 1996 from total:",
  round(mean(storm_clean$YEAR >= 1996) * 100, 2)
))
## [1] "Percentage of observations after 1996 from total: 72.43"

The plot shows that the observations per year are significantly more from 1996 onwards. Also, more recent events might be more relevant now, because of the global climate changes and evolving environment. So we can further subset the data to use only 1996-2011 period (which will keep 72% from the original data):

Subsetting only observation from 1996 - 2011

storm_clean <- storm_clean %>% 
  filter(YEAR >= 1996)

Event type analysis

Check how many different type of events are observed and sort descending by number of observation per type:

storm_clean %>%
  count(EVTYPE, sort = TRUE) %>%
  nrow()
## [1] 516

The unique event types are 516, so lets try matching some by converting all letters to uppercase and remove leading and trailing spaces, and then count again:

storm_clean$EVTYPE_clean <- storm_clean$EVTYPE %>%
    toupper() %>% 
    trimws()

storm_clean %>%
  count(EVTYPE_clean, sort = TRUE) %>%
  {
    list(
      total_event_types = nrow(.),
      top_10 = head(., 10)
    )
  }
## $total_event_types
## [1] 430
## 
## $top_10
##         EVTYPE_clean      n
## 1               HAIL 207715
## 2          TSTM WIND 128668
## 3  THUNDERSTORM WIND  81403
## 4        FLASH FLOOD  51000
## 5              FLOOD  24248
## 6            TORNADO  23154
## 7          HIGH WIND  19909
## 8         HEAVY SNOW  14000
## 9          LIGHTNING  13204
## 10        HEAVY RAIN  11528

The number of unique event types has decreased to 430 as identical types are grouped together, which will enhance the accuracy of the analysis. It is visible that the EVTYPE data is “messy”- for example, we see both TSTM WIND and THUNDERSTORM WIND in the Top 10 list.

Although one way to do the analysis is to group together some of the events by the sense of the words in their names, it is better to keep them as they are at this point, to avoid confusion and misinterpretation. For example, FLOOD and FLASH FLOOD are two different categories in the official categorization in the National Weather Service Storm Data Documentation and describe different events. Also, manual mapping of all event types might be unjustified. The chosen approach is to calculate the events with highest impact and then explore if there are types in the top 50 results that don’t match the official categorization, and deal with them. This will be done both for health and for economic impact and the final calculations of the impact will be done afterwards, based on the mapped event types, for higher accuracy.

Calculate total health impact per event type

The total_harm is calculated as sum of the fatalities and injuries:

health_impact <- storm_clean %>%
  group_by(EVTYPE_clean) %>%
  summarise(
    fatalities = sum(FATALITIES, na.rm = TRUE),
    injuries   = sum(INJURIES, na.rm = TRUE),
    total_harm = fatalities + injuries
  ) %>%
  arrange(desc(total_harm))

head(health_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE_clean      fatalities injuries total_harm
##    <chr>                  <dbl>    <dbl>      <dbl>
##  1 TORNADO                 1511    20667      22178
##  2 EXCESSIVE HEAT          1797     6391       8188
##  3 FLOOD                    414     6758       7172
##  4 LIGHTNING                651     4141       4792
##  5 TSTM WIND                241     3629       3870
##  6 FLASH FLOOD              887     1674       2561
##  7 THUNDERSTORM WIND        130     1400       1530
##  8 WINTER STORM             191     1292       1483
##  9 HEAT                     237     1222       1459
## 10 HURRICANE/TYPHOON         64     1275       1339

Calculate the impact of the top 50 results as a threshold and as a percentage of the total health damage:

print(
  paste(
    "The top 50 results include all events which caused total injuries >=",
    format(health_impact$total_harm[50], big.mark = ","),
    "per event, and cover",
    round(
      (sum(health_impact$total_harm[1:50]) /
         sum(health_impact$total_harm)) * 100,
      2
    ),
    "% of the total health harm."
  )
)
## [1] "The top 50 results include all events which caused total injuries >= 30 per event, and cover 99.34 % of the total health harm."

Therefore, it is reasonable to consolidate event types within these top 50 without losing any meaningful information from the dataset.

The National Weather Service Storm Data Documentation states that the only events permitted in Storm Data are listed in Table 1 of Section 2.1.1., namely they are:

official_events <- c(
  "ASTRONOMICAL LOW TIDE",
  "AVALANCHE",
  "BLIZZARD",
  "COASTAL FLOOD",
  "COLD/WIND CHILL",
  "DEBRIS FLOW",
  "DENSE FOG",
  "DENSE SMOKE",
  "DROUGHT",
  "DUST DEVIL",
  "DUST STORM",
  "EXCESSIVE HEAT",
  "EXTREME COLD/WIND CHILL",
  "FLASH FLOOD",
  "FLOOD",
  "FROST/FREEZE",
  "FUNNEL CLOUD",
  "FREEZING FOG",
  "HAIL",
  "HEAT",
  "HEAVY RAIN",
  "HEAVY SNOW",
  "HIGH SURF",
  "HIGH WIND",
  "HURRICANE (TYPHOON)",
  "ICE STORM",
  "LAKE-EFFECT SNOW",
  "LAKESHORE FLOOD",
  "LIGHTNING",
  "MARINE HAIL",
  "MARINE HIGH WIND",
  "MARINE STRONG WIND",
  "MARINE THUNDERSTORM WIND",
  "RIP CURRENT",
  "SEICHE",
  "SLEET",
  "STORM SURGE/TIDE",
  "STRONG WIND",
  "THUNDERSTORM WIND",
  "TORNADO",
  "TROPICAL DEPRESSION",
  "TROPICAL STORM",
  "TSUNAMI",
  "VOLCANIC ASH",
  "WATERSPOUT",
  "WILDFIRE",
  "WINTER STORM",
  "WINTER WEATHER"
)

let’s examine which of the top 50 event names do not match the official definitions and consolidate them appropriately:

top50_non_official <- health_impact %>%
  slice_max(total_harm, n = 50) %>%
  filter(!(EVTYPE_clean %in% official_events)) %>%
  pull(EVTYPE_clean)

top50_non_official
##  [1] "TSTM WIND"            "HURRICANE/TYPHOON"    "FOG"                 
##  [4] "WILD/FOREST FIRE"     "RIP CURRENTS"         "GLAZE"               
##  [7] "EXTREME COLD"         "HURRICANE"            "URBAN/SML STREAM FLD"
## [10] "WIND"                 "TSTM WIND/HAIL"       "WINTER WEATHER/MIX"  
## [13] "HEAVY SURF/HIGH SURF" "LANDSLIDE"            "WINTRY MIX"          
## [16] "HEAT WAVE"            "WINTER WEATHER MIX"   "HEAVY SURF"          
## [19] "STORM SURGE"          "SNOW SQUALL"          "COLD"

These events can now be matched with the official category:

storm_clean <- storm_clean %>%
  mutate(
    EVTYPE_mapped = case_when(
      EVTYPE_clean == "TSTM WIND" ~ "THUNDERSTORM WIND",
      EVTYPE_clean == "HURRICANE/TYPHOON" ~ "HURRICANE (TYPHOON)",
      EVTYPE_clean == "FOG" ~ "DENSE FOG",
      EVTYPE_clean == "WILD/FOREST FIRE" ~ "WILDFIRE",
      EVTYPE_clean == "RIP CURRENTS" ~ "RIP CURRENT",
      EVTYPE_clean == "GLAZE" ~ "ICE STORM",
      EVTYPE_clean == "EXTREME COLD" ~ "EXTREME COLD/WIND CHILL",
      EVTYPE_clean == "HURRICANE" ~ "HURRICANE (TYPHOON)",
      EVTYPE_clean == "URBAN/SML STREAM FLD" ~ "FLOOD",
      EVTYPE_clean == "WIND" ~ "HIGH WIND",
      EVTYPE_clean == "TSTM WIND/HAIL" ~ "THUNDERSTORM WIND",
      EVTYPE_clean == "WINTER WEATHER/MIX" ~ "WINTER WEATHER",
      EVTYPE_clean == "HEAVY SURF/HIGH SURF" ~ "HIGH SURF",
      EVTYPE_clean == "LANDSLIDE" ~ "DEBRIS FLOW",
      EVTYPE_clean == "WINTRY MIX" ~ "WINTER WEATHER",
      EVTYPE_clean == "HEAT WAVE" ~ "EXCESSIVE HEAT",
      EVTYPE_clean == "WINTER WEATHER MIX" ~ "WINTER WEATHER",
      EVTYPE_clean == "HEAVY SURF" ~ "HIGH SURF",
      EVTYPE_clean == "STORM SURGE" ~ "STORM SURGE/TIDE",
      EVTYPE_clean == "SNOW SQUALL" ~ "HEAVY SNOW",
      EVTYPE_clean == "COLD" ~ "COLD/WIND CHILL",
      
      TRUE ~ EVTYPE_clean
    )
  )

Calculate economic damage per event type

The total economic damage will be calculated as sum of property damage and crop damage.

Check the unique exponents:

unique(storm_clean$PROPDMGEXP)
## [1] "K" ""  "M" "B" "0"

Include conversion of the exponent fields into numeric as a first step:

exp_to_num <- function(exp) {
  ifelse(exp == "K", 1e3,
  ifelse(exp == "M", 1e6,
  ifelse(exp == "B", 1e9,
  ifelse(exp == "0", 1,
  1))))   # empty string - multiplier 1
}

storm_clean$PROPDMGEXP_num <- exp_to_num(storm_clean$PROPDMGEXP)
storm_clean$CROPDMGEXP_num <- exp_to_num(storm_clean$CROPDMGEXP)

Compute of the total economic damage:

storm_clean <- storm_clean %>%
  mutate(
    prop_damage = PROPDMG * PROPDMGEXP_num,
    crop_damage = CROPDMG * CROPDMGEXP_num,
    total_damage = prop_damage + crop_damage
  )

Summarization by event type:

economic_impact <- storm_clean %>%
  group_by(EVTYPE_mapped) %>%
  summarise(
    property = sum(prop_damage, na.rm = TRUE),
    crop     = sum(crop_damage, na.rm = TRUE),
    total_economic = property + crop
  ) %>%
  arrange(desc(total_economic))

head(economic_impact, 10)
## # A tibble: 10 × 4
##    EVTYPE_mapped           property        crop total_economic
##    <chr>                      <dbl>       <dbl>          <dbl>
##  1 FLOOD               144003143200  4983266500   148986409700
##  2 HURRICANE (TYPHOON)  81118659010  5349282800    86467941810
##  3 STORM SURGE/TIDE     47834724000      855000    47835579000
##  4 TORNADO              24616945710   283425010    24900370720
##  5 HAIL                 14595143420  2476029450    17071172870
##  6 FLASH FLOOD          15222253910  1334901700    16557155610
##  7 DROUGHT               1046101000 13367566000    14413667000
##  8 THUNDERSTORM WIND     7913146380  1016942600     8930088980
##  9 TROPICAL STORM        7642475550   677711000     8320186550
## 10 WILDFIRE              7760449500   402255130     8162704630

Top 50 events characteristics:

print(
  paste(
    "The top 50 results include all events which caused total economic damage >=",
    format(economic_impact$total_economic[50], big.mark = ","),
    "USD per event, and cover",
    round(
      (sum(economic_impact$total_economic[1:50]) /
         sum(economic_impact$total_economic)) * 100,
      2
    ),
    "% of the total economic damage."
  )
)
## [1] "The top 50 results include all events which caused total economic damage >= 5,421,000 USD per event, and cover 99.99 % of the total economic damage."

Again, to make the analysis more reliable, lets check which of the top 50 results don’t much the official definitions and try to consolidate them correctly:

Check which of the top 50 don’t match the official list:

top50_non_official_econ <- economic_impact %>%
  slice_max(total_economic, n = 50) %>%
  filter(!(EVTYPE_mapped %in% official_events)) %>%
  pull(EVTYPE_mapped)

top50_non_official_econ
##  [1] "TYPHOON"                   "FREEZE"                   
##  [3] "RIVER FLOODING"            "COASTAL FLOODING"         
##  [5] "DAMAGING FREEZE"           "EARLY FROST"              
##  [7] "AGRICULTURAL FREEZE"       "UNSEASONABLY COLD"        
##  [9] "RIVER FLOOD"               "SMALL HAIL"               
## [11] "COASTAL FLOODING/EROSION"  "EXTREME WINDCHILL"        
## [13] "EROSION/CSTL FLOOD"        "COASTAL  FLOODING/EROSION"
## [15] "HEAVY RAIN/HIGH SURF"      "HARD FREEZE"              
## [17] "UNSEASONAL RAIN"           "ASTRONOMICAL HIGH TIDE"   
## [19] "MARINE TSTM WIND"

Mapping for those events to match the official list:

storm_clean <- storm_clean %>%
  mutate(
    EVTYPE_mapped = case_when(
      EVTYPE_clean == "TYPHOON" ~ "HURRICANE (TYPHOON)",
      EVTYPE_clean == "FREEZE" ~ "FROST/FREEZE",
      EVTYPE_clean == "RIVER FLOODING" ~ "FLOOD",
      EVTYPE_clean == "COASTAL FLOODING" ~ "COASTAL FLOOD",
      EVTYPE_clean == "DAMAGING FREEZE" ~ "FROST/FREEZE",
      EVTYPE_clean == "EARLY FROST" ~ "FROST/FREEZE",
      EVTYPE_clean == "AGRICULTURAL FREEZE" ~ "FROST/FREEZE",
      EVTYPE_clean == "UNSEASONABLY COLD" ~ "COLD/WIND CHILL",
      EVTYPE_clean == "RIVER FLOOD" ~ "FLOOD",
      EVTYPE_clean == "SMALL HAIL" ~ "HAIL",
      EVTYPE_clean == "COASTAL FLOODING/EROSION" ~ "COASTAL FLOOD",
      EVTYPE_clean == "EXTREME WINDCHILL" ~ "EXTREME COLD/WIND CHILL",
      EVTYPE_clean == "EROSION/CSTL FLOOD" ~ "COASTAL FLOOD",
      EVTYPE_clean == "COASTAL  FLOODING/EROSION" ~ "COASTAL FLOOD",
      EVTYPE_clean == "HEAVY RAIN/HIGH SURF" ~ "HIGH SURF",
      EVTYPE_clean == "HARD FREEZE" ~ "FROST/FREEZE",
      EVTYPE_clean == "UNSEASONAL RAIN" ~ "HEAVY RAIN",
      EVTYPE_clean == "ASTRONOMICAL HIGH TIDE" ~ "STORM SURGE/TIDE",
      EVTYPE_clean == "MARINE TSTM WIND" ~ "MARINE THUNDERSTORM WIND",
     

      TRUE ~ EVTYPE_mapped
    )
  )

Aligning the event types with the official categories will now result in more precise aggregation. Furthermore, analysis based on the official categories has the advantage of being easier to extend — for example, by incorporating new data or comparing results with other studies.

Results

Recalculating Health Impact with improved Event Classification:

health_impact <- storm_clean %>%
  group_by(EVTYPE_mapped) %>%
  summarise(
    fatalities = sum(FATALITIES, na.rm = TRUE),
    injuries   = sum(INJURIES, na.rm = TRUE),
    total_harm = fatalities + injuries
  ) %>%
  arrange(desc(total_harm))

head(health_impact, 20)
## # A tibble: 20 × 4
##    EVTYPE_mapped       fatalities injuries total_harm
##    <chr>                    <dbl>    <dbl>      <dbl>
##  1 TORNADO                   1511    20667      22178
##  2 EXCESSIVE HEAT            1797     6461       8258
##  3 FLOOD                      444     6838       7282
##  4 THUNDERSTORM WIND          376     5124       5500
##  5 LIGHTNING                  651     4141       4792
##  6 FLASH FLOOD                887     1674       2561
##  7 WILDFIRE                    87     1456       1543
##  8 WINTER STORM               191     1292       1483
##  9 HEAT                       237     1222       1459
## 10 HURRICANE (TYPHOON)        125     1326       1451
## 11 HIGH WIND                  253     1167       1420
## 12 RIP CURRENT                542      503       1045
## 13 DENSE FOG                   69      855        924
## 14 HEAVY SNOW                 109      733        842
## 15 HAIL                         7      723        730
## 16 WINTER WEATHER              62      560        622
## 17 ICE STORM                   83      530        613
## 18 BLIZZARD                    70      385        455
## 19 TROPICAL STORM              57      338        395
## 20 DUST STORM                  11      376        387

Now the ranking is more correct and slightly changed. For example, now THUNDERSTORM WIND (#4) is more harmful than LIGHTNING, WILDFIRE (#7) appeared in top 10, etc.

Plot for visualization:

df <- health_impact[1:10, ]

# Order by total_harm
df <- df[order(df$total_harm, decreasing = TRUE), ]

par(mar = c(10, 4, 4, 2))

# Create a matrix for barplot
mat <- rbind(
  Total_Harm = df$total_harm,
  Injuries   = df$injuries,
  Fatalities = df$fatalities
)

barplot(
  mat,
  beside = TRUE,
  names.arg = df$EVTYPE_mapped,
  las = 2,
  col = c("steelblue3", "turquoise3", "tomato"),
  main = "Top 10 Weather Events by Health Impact",
  ylab = "Count",
  cex.names = 0.7,   # x‑axis label size
  cex.axis = 0.7,    # y‑axis tick size
  cex.lab = 0.8      # y‑axis label size 
)

legend(
  "topright",
  legend = c("Total Harm", "Injuries", "Fatalities"),
  fill = c("steelblue3", "turquoise3", "tomato"),
  cex = 0.8
)

Final Results - Health Impact

Tornadoes stand out as the most harmful weather event, causing a combined total of fatalities and injuries that is more than double the impact of the second‑ranked event, Excessive Heat. In fact, Tornadoes inflict more harm than the next three event types combined (Excessive Heat, Flood, and Thunderstorm Wind). If the focus is on fatalities, the most harmful event is Excessive Heat, followed by Tornadoes and the Flash Floods. Further analysis could be done on how the effect from the top ranked events changes across the years or what is the ranking for different states.

Recalculating Economic Impact with improved Event Classification:

economic_impact <- storm_clean %>%
  group_by(EVTYPE_mapped) %>%
  summarise(
    property = sum(prop_damage, na.rm = TRUE),
    crop     = sum(crop_damage, na.rm = TRUE),
    total_economic = property + crop
  ) %>%
  arrange(desc(total_economic))

head(economic_impact, 20)
## # A tibble: 20 × 4
##    EVTYPE_mapped               property        crop total_economic
##    <chr>                          <dbl>       <dbl>          <dbl>
##  1 FLOOD                   144129580200  5013161500   149142741700
##  2 HURRICANE (TYPHOON)      81718889010  5350107800    87068996810
##  3 STORM SURGE/TIDE         47844149000      855000    47845004000
##  4 TORNADO                  24616945710   283425010    24900370720
##  5 HAIL                     14595213420  2496822450    17092035870
##  6 FLASH FLOOD              15222253910  1334901700    16557155610
##  7 DROUGHT                   1046101000 13367566000    14413667000
##  8 THUNDERSTORM WIND         7913146380  1016942600     8930088980
##  9 TROPICAL STORM            7642475550   677711000     8320186550
## 10 WILDFIRE                  7760449500   402255130     8162704630
## 11 HIGH WIND                 5250149860   633861300     5884011160
## 12 ICE STORM                 3642398810    15660000     3658058810
## 13 WINTER STORM              1532743250    11944000     1544687250
## 14 FROST/FREEZE                18680000  1368761000     1387441000
## 15 EXTREME COLD/WIND CHILL     29163400  1326023000     1355186400
## 16 HEAVY RAIN                 584864440   738169800     1323034240
## 17 LIGHTNING                  743077080     6898440      749975520
## 18 HEAVY SNOW                 634447540    71122100      705569640
## 19 BLIZZARD                   525658950     7060000      532718950
## 20 EXCESSIVE HEAT               7723700   492402000      500125700

The ranking doesn’t differ much from the previous calculation (which was done after the first round of mapping for the health‑harm analysis and was already cleaned to some extent), but the impact calculated now is more precise. Some changes appear in the 11–20 range.

Plot for visualization:

df2 <- economic_impact[1:10, ]
df2 <- df2[order(df2$total_economic, decreasing = TRUE), ]

par(mar = c(10, 4, 4, 2))

# Scale to millions
scale_factor <- 1e6

mat2 <- rbind(
  Total_Economic = df2$total_economic / scale_factor,
  Property       = df2$property / scale_factor,
  Crop           = df2$crop / scale_factor
)

barplot(
  mat2,
  beside = TRUE,
  names.arg = df2$EVTYPE_mapped,
  las = 2,
  col = c("steelblue3", "turquoise3", "tomato"),
  main = "Top 10 Weather Events by Economic Impact",
  ylab = "Damage (Millions USD)",
  cex.names = 0.7,
  cex.axis = 0.7,
  cex.lab = 0.8
)

legend(
  "topright",
  legend = c("Total Economic", "Property", "Crop"),
  fill = c("steelblue3", "turquoise3", "tomato"),
  cex = 0.8
)

Final Results - Economic Impact

The most harmful weather event in terms of economic impact is Flood, followed by Hurricane (Typhoon) and Storm Surge/Tide. Overall, most of the economic damage comes from property losses, while crop damage represents a smaller share. However, crop damage understandably dominates during Drought events. Depending on the focus, a deeper analysis could examine the two types of damage separately or investigate trends over time or across geographic regions.

Conclusions

This analysis emphasizes the critical role that certain unfavorable weather events play in affecting both people and the economy, while offering insight into which events are the most harmful. Tornadoes cause the greatest health impact, whereas Floods, Hurricanes (Typhoons), and Storm Surge/Tide result in the most severe economic damage. Allocating resources to address and mitigate these events would be most beneficial.