Note for peer reviewers:
This is my original work, first submitted to Coursera on 18
December.
My previous submission was affected by plagiarism, where other learners
copied my public Rpubs link.
Coursera advised me to resubmit with my name included for clarity.
Thank you for reviewing my work fairly.
This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The database contains records from 1950 to 2011 that track characteristics of major storms and weather events in the United States, including when and where they occur, along with estimates of any fatalities, injuries, and property damage.
This analysis specifically addresses the following two questions:
The findings might help in making decisions about how to allocate and prioritize resources to respond to various severe weather events.
Set locale to English and get session info:
Sys.setlocale("LC_ALL", "English_United States.UTF-8")
## [1] "LC_COLLATE=English_United States.utf8;LC_CTYPE=English_United States.utf8;LC_MONETARY=English_United States.utf8;LC_NUMERIC=C;LC_TIME=English_United States.utf8"
sessionInfo()
## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 lifecycle_1.0.4 jsonlite_2.0.0
## [5] evaluate_1.0.5 cachem_1.1.0 rlang_1.1.6 cli_3.6.5
## [9] rstudioapi_0.17.1 jquerylib_0.1.4 bslib_0.9.0 rmarkdown_2.30
## [13] tools_4.2.2 xfun_0.53 yaml_2.3.10 fastmap_1.2.0
## [17] compiler_4.2.2 htmltools_0.5.8.1 knitr_1.50 sass_0.4.10
Load of libraries (shall be installed first, if not available yet):
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(ggplot2)
After reading the provided Storm Data Documentation and some preliminary exploratory analysis of the dataset, I identified the following questions that need to be addressed further in this analysis:
The dataset from NOAA is downloaded from the source and stored in subfolder “data”:
# Create a "data" subfolder, if not available:
if(!file.exists("data")) {
dir.create("data")
}
# Download the dataset and in "data" subfolder, if not available, and read the data:
if(!file.exists("./data/repdata_data_StormData.csv.bz2")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./data/repdata_data_StormData.csv.bz2")
}
stormdata <- read.csv("./data/repdata_data_StormData.csv.bz2",header=TRUE)
dim(stormdata)
## [1] 902297 37
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
List of all variables in the row dataset:
names(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
In order to reduce the dimensions of the dataset before further analysis, some of the variables could be excluded as not relevant to the objective of this analysis. Based on the description of the variables in the National Weather Service Storm Data Documentation and the structure of the loaded dataset, the following variables are kept:
For Event identification:
Population Health Variables:
Economic Damage Variables:
Optional, for additional analysis:
storm_clean <- stormdata %>%
select(BGN_DATE, EVTYPE,
STATE, COUNTY,
FATALITIES, INJURIES,
PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
colSums(is.na(storm_clean))
## BGN_DATE EVTYPE STATE COUNTY FATALITIES INJURIES PROPDMG
## 0 0 0 0 0 0 0
## PROPDMGEXP CROPDMG CROPDMGEXP
## 0 0 0
No NAs in the subset.
Year and location are not strictly needed for answering the questions in general, but might be helpful for further analysis of a harmful events - if their impact is changing across the years or if they are location specific. So, new column YEAR is derived from the BGN_DATE and BGN_DATE is then dropped:
storm_clean$BGN_DATE <- parse_date_time( storm_clean$BGN_DATE, orders = c("mdy HMS", "mdy HM", "mdy") )
storm_clean$YEAR <- year(storm_clean$BGN_DATE)
storm_clean <- storm_clean %>% select(-BGN_DATE)
Since the dataset description notes that older records may be inconsistent or incomplete, while more recent years are considered more reliable, lets first examine the number of observations by year:
year_summary <- storm_clean %>%
group_by(YEAR) %>%
summarise(
observations = n(), unique_events = n_distinct(EVTYPE)
)
ggplot(year_summary, aes(YEAR, observations)) +
geom_line()
print(paste(
"Percentage of observations after 1996 from total:",
round(mean(storm_clean$YEAR >= 1996) * 100, 2)
))
## [1] "Percentage of observations after 1996 from total: 72.43"
The plot shows that the observations per year are significantly more from 1996 onwards. Also, more recent events might be more relevant now, because of the global climate changes and evolving environment. So we can further subset the data to use only 1996-2011 period (which will keep 72% from the original data):
storm_clean <- storm_clean %>%
filter(YEAR >= 1996)
Check how many different type of events are observed and sort descending by number of observation per type:
storm_clean %>%
count(EVTYPE, sort = TRUE) %>%
nrow()
## [1] 516
The unique event types are 516, so lets try matching some by converting all letters to uppercase and remove leading and trailing spaces, and then count again:
storm_clean$EVTYPE_clean <- storm_clean$EVTYPE %>%
toupper() %>%
trimws()
storm_clean %>%
count(EVTYPE_clean, sort = TRUE) %>%
{
list(
total_event_types = nrow(.),
top_10 = head(., 10)
)
}
## $total_event_types
## [1] 430
##
## $top_10
## EVTYPE_clean n
## 1 HAIL 207715
## 2 TSTM WIND 128668
## 3 THUNDERSTORM WIND 81403
## 4 FLASH FLOOD 51000
## 5 FLOOD 24248
## 6 TORNADO 23154
## 7 HIGH WIND 19909
## 8 HEAVY SNOW 14000
## 9 LIGHTNING 13204
## 10 HEAVY RAIN 11528
The number of unique event types has decreased to 430 as identical types are grouped together, which will enhance the accuracy of the analysis. It is visible that the EVTYPE data is “messy”- for example, we see both TSTM WIND and THUNDERSTORM WIND in the Top 10 list.
Although one way to do the analysis is to group together some of the events by the sense of the words in their names, it is better to keep them as they are at this point, to avoid confusion and misinterpretation. For example, FLOOD and FLASH FLOOD are two different categories in the official categorization in the National Weather Service Storm Data Documentation and describe different events. Also, manual mapping of all event types might be unjustified. The chosen approach is to calculate the events with highest impact and then explore if there are types in the top 50 results that don’t match the official categorization, and deal with them. This will be done both for health and for economic impact and the final calculations of the impact will be done afterwards, based on the mapped event types, for higher accuracy.
The total_harm is calculated as sum of the fatalities and injuries:
health_impact <- storm_clean %>%
group_by(EVTYPE_clean) %>%
summarise(
fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE),
total_harm = fatalities + injuries
) %>%
arrange(desc(total_harm))
head(health_impact, 10)
## # A tibble: 10 × 4
## EVTYPE_clean fatalities injuries total_harm
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 1511 20667 22178
## 2 EXCESSIVE HEAT 1797 6391 8188
## 3 FLOOD 414 6758 7172
## 4 LIGHTNING 651 4141 4792
## 5 TSTM WIND 241 3629 3870
## 6 FLASH FLOOD 887 1674 2561
## 7 THUNDERSTORM WIND 130 1400 1530
## 8 WINTER STORM 191 1292 1483
## 9 HEAT 237 1222 1459
## 10 HURRICANE/TYPHOON 64 1275 1339
Calculate the impact of the top 50 results as a threshold and as a percentage of the total health damage:
print(
paste(
"The top 50 results include all events which caused total injuries >=",
format(health_impact$total_harm[50], big.mark = ","),
"per event, and cover",
round(
(sum(health_impact$total_harm[1:50]) /
sum(health_impact$total_harm)) * 100,
2
),
"% of the total health harm."
)
)
## [1] "The top 50 results include all events which caused total injuries >= 30 per event, and cover 99.34 % of the total health harm."
Therefore, it is reasonable to consolidate event types within these top 50 without losing any meaningful information from the dataset.
The National Weather Service Storm Data Documentation states that the only events permitted in Storm Data are listed in Table 1 of Section 2.1.1., namely they are:
official_events <- c(
"ASTRONOMICAL LOW TIDE",
"AVALANCHE",
"BLIZZARD",
"COASTAL FLOOD",
"COLD/WIND CHILL",
"DEBRIS FLOW",
"DENSE FOG",
"DENSE SMOKE",
"DROUGHT",
"DUST DEVIL",
"DUST STORM",
"EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL",
"FLASH FLOOD",
"FLOOD",
"FROST/FREEZE",
"FUNNEL CLOUD",
"FREEZING FOG",
"HAIL",
"HEAT",
"HEAVY RAIN",
"HEAVY SNOW",
"HIGH SURF",
"HIGH WIND",
"HURRICANE (TYPHOON)",
"ICE STORM",
"LAKE-EFFECT SNOW",
"LAKESHORE FLOOD",
"LIGHTNING",
"MARINE HAIL",
"MARINE HIGH WIND",
"MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND",
"RIP CURRENT",
"SEICHE",
"SLEET",
"STORM SURGE/TIDE",
"STRONG WIND",
"THUNDERSTORM WIND",
"TORNADO",
"TROPICAL DEPRESSION",
"TROPICAL STORM",
"TSUNAMI",
"VOLCANIC ASH",
"WATERSPOUT",
"WILDFIRE",
"WINTER STORM",
"WINTER WEATHER"
)
let’s examine which of the top 50 event names do not match the official definitions and consolidate them appropriately:
top50_non_official <- health_impact %>%
slice_max(total_harm, n = 50) %>%
filter(!(EVTYPE_clean %in% official_events)) %>%
pull(EVTYPE_clean)
top50_non_official
## [1] "TSTM WIND" "HURRICANE/TYPHOON" "FOG"
## [4] "WILD/FOREST FIRE" "RIP CURRENTS" "GLAZE"
## [7] "EXTREME COLD" "HURRICANE" "URBAN/SML STREAM FLD"
## [10] "WIND" "TSTM WIND/HAIL" "WINTER WEATHER/MIX"
## [13] "HEAVY SURF/HIGH SURF" "LANDSLIDE" "WINTRY MIX"
## [16] "HEAT WAVE" "WINTER WEATHER MIX" "HEAVY SURF"
## [19] "STORM SURGE" "SNOW SQUALL" "COLD"
These events can now be matched with the official category:
storm_clean <- storm_clean %>%
mutate(
EVTYPE_mapped = case_when(
EVTYPE_clean == "TSTM WIND" ~ "THUNDERSTORM WIND",
EVTYPE_clean == "HURRICANE/TYPHOON" ~ "HURRICANE (TYPHOON)",
EVTYPE_clean == "FOG" ~ "DENSE FOG",
EVTYPE_clean == "WILD/FOREST FIRE" ~ "WILDFIRE",
EVTYPE_clean == "RIP CURRENTS" ~ "RIP CURRENT",
EVTYPE_clean == "GLAZE" ~ "ICE STORM",
EVTYPE_clean == "EXTREME COLD" ~ "EXTREME COLD/WIND CHILL",
EVTYPE_clean == "HURRICANE" ~ "HURRICANE (TYPHOON)",
EVTYPE_clean == "URBAN/SML STREAM FLD" ~ "FLOOD",
EVTYPE_clean == "WIND" ~ "HIGH WIND",
EVTYPE_clean == "TSTM WIND/HAIL" ~ "THUNDERSTORM WIND",
EVTYPE_clean == "WINTER WEATHER/MIX" ~ "WINTER WEATHER",
EVTYPE_clean == "HEAVY SURF/HIGH SURF" ~ "HIGH SURF",
EVTYPE_clean == "LANDSLIDE" ~ "DEBRIS FLOW",
EVTYPE_clean == "WINTRY MIX" ~ "WINTER WEATHER",
EVTYPE_clean == "HEAT WAVE" ~ "EXCESSIVE HEAT",
EVTYPE_clean == "WINTER WEATHER MIX" ~ "WINTER WEATHER",
EVTYPE_clean == "HEAVY SURF" ~ "HIGH SURF",
EVTYPE_clean == "STORM SURGE" ~ "STORM SURGE/TIDE",
EVTYPE_clean == "SNOW SQUALL" ~ "HEAVY SNOW",
EVTYPE_clean == "COLD" ~ "COLD/WIND CHILL",
TRUE ~ EVTYPE_clean
)
)
The total economic damage will be calculated as sum of property damage and crop damage.
unique(storm_clean$PROPDMGEXP)
## [1] "K" "" "M" "B" "0"
exp_to_num <- function(exp) {
ifelse(exp == "K", 1e3,
ifelse(exp == "M", 1e6,
ifelse(exp == "B", 1e9,
ifelse(exp == "0", 1,
1)))) # empty string - multiplier 1
}
storm_clean$PROPDMGEXP_num <- exp_to_num(storm_clean$PROPDMGEXP)
storm_clean$CROPDMGEXP_num <- exp_to_num(storm_clean$CROPDMGEXP)
storm_clean <- storm_clean %>%
mutate(
prop_damage = PROPDMG * PROPDMGEXP_num,
crop_damage = CROPDMG * CROPDMGEXP_num,
total_damage = prop_damage + crop_damage
)
economic_impact <- storm_clean %>%
group_by(EVTYPE_mapped) %>%
summarise(
property = sum(prop_damage, na.rm = TRUE),
crop = sum(crop_damage, na.rm = TRUE),
total_economic = property + crop
) %>%
arrange(desc(total_economic))
head(economic_impact, 10)
## # A tibble: 10 × 4
## EVTYPE_mapped property crop total_economic
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144003143200 4983266500 148986409700
## 2 HURRICANE (TYPHOON) 81118659010 5349282800 86467941810
## 3 STORM SURGE/TIDE 47834724000 855000 47835579000
## 4 TORNADO 24616945710 283425010 24900370720
## 5 HAIL 14595143420 2476029450 17071172870
## 6 FLASH FLOOD 15222253910 1334901700 16557155610
## 7 DROUGHT 1046101000 13367566000 14413667000
## 8 THUNDERSTORM WIND 7913146380 1016942600 8930088980
## 9 TROPICAL STORM 7642475550 677711000 8320186550
## 10 WILDFIRE 7760449500 402255130 8162704630
Top 50 events characteristics:
print(
paste(
"The top 50 results include all events which caused total economic damage >=",
format(economic_impact$total_economic[50], big.mark = ","),
"USD per event, and cover",
round(
(sum(economic_impact$total_economic[1:50]) /
sum(economic_impact$total_economic)) * 100,
2
),
"% of the total economic damage."
)
)
## [1] "The top 50 results include all events which caused total economic damage >= 5,421,000 USD per event, and cover 99.99 % of the total economic damage."
Again, to make the analysis more reliable, lets check which of the top 50 results don’t much the official definitions and try to consolidate them correctly:
top50_non_official_econ <- economic_impact %>%
slice_max(total_economic, n = 50) %>%
filter(!(EVTYPE_mapped %in% official_events)) %>%
pull(EVTYPE_mapped)
top50_non_official_econ
## [1] "TYPHOON" "FREEZE"
## [3] "RIVER FLOODING" "COASTAL FLOODING"
## [5] "DAMAGING FREEZE" "EARLY FROST"
## [7] "AGRICULTURAL FREEZE" "UNSEASONABLY COLD"
## [9] "RIVER FLOOD" "SMALL HAIL"
## [11] "COASTAL FLOODING/EROSION" "EXTREME WINDCHILL"
## [13] "EROSION/CSTL FLOOD" "COASTAL FLOODING/EROSION"
## [15] "HEAVY RAIN/HIGH SURF" "HARD FREEZE"
## [17] "UNSEASONAL RAIN" "ASTRONOMICAL HIGH TIDE"
## [19] "MARINE TSTM WIND"
Mapping for those events to match the official list:
storm_clean <- storm_clean %>%
mutate(
EVTYPE_mapped = case_when(
EVTYPE_clean == "TYPHOON" ~ "HURRICANE (TYPHOON)",
EVTYPE_clean == "FREEZE" ~ "FROST/FREEZE",
EVTYPE_clean == "RIVER FLOODING" ~ "FLOOD",
EVTYPE_clean == "COASTAL FLOODING" ~ "COASTAL FLOOD",
EVTYPE_clean == "DAMAGING FREEZE" ~ "FROST/FREEZE",
EVTYPE_clean == "EARLY FROST" ~ "FROST/FREEZE",
EVTYPE_clean == "AGRICULTURAL FREEZE" ~ "FROST/FREEZE",
EVTYPE_clean == "UNSEASONABLY COLD" ~ "COLD/WIND CHILL",
EVTYPE_clean == "RIVER FLOOD" ~ "FLOOD",
EVTYPE_clean == "SMALL HAIL" ~ "HAIL",
EVTYPE_clean == "COASTAL FLOODING/EROSION" ~ "COASTAL FLOOD",
EVTYPE_clean == "EXTREME WINDCHILL" ~ "EXTREME COLD/WIND CHILL",
EVTYPE_clean == "EROSION/CSTL FLOOD" ~ "COASTAL FLOOD",
EVTYPE_clean == "COASTAL FLOODING/EROSION" ~ "COASTAL FLOOD",
EVTYPE_clean == "HEAVY RAIN/HIGH SURF" ~ "HIGH SURF",
EVTYPE_clean == "HARD FREEZE" ~ "FROST/FREEZE",
EVTYPE_clean == "UNSEASONAL RAIN" ~ "HEAVY RAIN",
EVTYPE_clean == "ASTRONOMICAL HIGH TIDE" ~ "STORM SURGE/TIDE",
EVTYPE_clean == "MARINE TSTM WIND" ~ "MARINE THUNDERSTORM WIND",
TRUE ~ EVTYPE_mapped
)
)
Aligning the event types with the official categories will now result in more precise aggregation. Furthermore, analysis based on the official categories has the advantage of being easier to extend — for example, by incorporating new data or comparing results with other studies.
health_impact <- storm_clean %>%
group_by(EVTYPE_mapped) %>%
summarise(
fatalities = sum(FATALITIES, na.rm = TRUE),
injuries = sum(INJURIES, na.rm = TRUE),
total_harm = fatalities + injuries
) %>%
arrange(desc(total_harm))
head(health_impact, 20)
## # A tibble: 20 × 4
## EVTYPE_mapped fatalities injuries total_harm
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 1511 20667 22178
## 2 EXCESSIVE HEAT 1797 6461 8258
## 3 FLOOD 444 6838 7282
## 4 THUNDERSTORM WIND 376 5124 5500
## 5 LIGHTNING 651 4141 4792
## 6 FLASH FLOOD 887 1674 2561
## 7 WILDFIRE 87 1456 1543
## 8 WINTER STORM 191 1292 1483
## 9 HEAT 237 1222 1459
## 10 HURRICANE (TYPHOON) 125 1326 1451
## 11 HIGH WIND 253 1167 1420
## 12 RIP CURRENT 542 503 1045
## 13 DENSE FOG 69 855 924
## 14 HEAVY SNOW 109 733 842
## 15 HAIL 7 723 730
## 16 WINTER WEATHER 62 560 622
## 17 ICE STORM 83 530 613
## 18 BLIZZARD 70 385 455
## 19 TROPICAL STORM 57 338 395
## 20 DUST STORM 11 376 387
Now the ranking is more correct and slightly changed. For example, now THUNDERSTORM WIND (#4) is more harmful than LIGHTNING, WILDFIRE (#7) appeared in top 10, etc.
Plot for visualization:
df <- health_impact[1:10, ]
# Order by total_harm
df <- df[order(df$total_harm, decreasing = TRUE), ]
par(mar = c(10, 4, 4, 2))
# Create a matrix for barplot
mat <- rbind(
Total_Harm = df$total_harm,
Injuries = df$injuries,
Fatalities = df$fatalities
)
barplot(
mat,
beside = TRUE,
names.arg = df$EVTYPE_mapped,
las = 2,
col = c("steelblue3", "turquoise3", "tomato"),
main = "Top 10 Weather Events by Health Impact",
ylab = "Count",
cex.names = 0.7, # x‑axis label size
cex.axis = 0.7, # y‑axis tick size
cex.lab = 0.8 # y‑axis label size
)
legend(
"topright",
legend = c("Total Harm", "Injuries", "Fatalities"),
fill = c("steelblue3", "turquoise3", "tomato"),
cex = 0.8
)
Tornadoes stand out as the most harmful weather event, causing a combined total of fatalities and injuries that is more than double the impact of the second‑ranked event, Excessive Heat. In fact, Tornadoes inflict more harm than the next three event types combined (Excessive Heat, Flood, and Thunderstorm Wind). If the focus is on fatalities, the most harmful event is Excessive Heat, followed by Tornadoes and the Flash Floods. Further analysis could be done on how the effect from the top ranked events changes across the years or what is the ranking for different states.
economic_impact <- storm_clean %>%
group_by(EVTYPE_mapped) %>%
summarise(
property = sum(prop_damage, na.rm = TRUE),
crop = sum(crop_damage, na.rm = TRUE),
total_economic = property + crop
) %>%
arrange(desc(total_economic))
head(economic_impact, 20)
## # A tibble: 20 × 4
## EVTYPE_mapped property crop total_economic
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144129580200 5013161500 149142741700
## 2 HURRICANE (TYPHOON) 81718889010 5350107800 87068996810
## 3 STORM SURGE/TIDE 47844149000 855000 47845004000
## 4 TORNADO 24616945710 283425010 24900370720
## 5 HAIL 14595213420 2496822450 17092035870
## 6 FLASH FLOOD 15222253910 1334901700 16557155610
## 7 DROUGHT 1046101000 13367566000 14413667000
## 8 THUNDERSTORM WIND 7913146380 1016942600 8930088980
## 9 TROPICAL STORM 7642475550 677711000 8320186550
## 10 WILDFIRE 7760449500 402255130 8162704630
## 11 HIGH WIND 5250149860 633861300 5884011160
## 12 ICE STORM 3642398810 15660000 3658058810
## 13 WINTER STORM 1532743250 11944000 1544687250
## 14 FROST/FREEZE 18680000 1368761000 1387441000
## 15 EXTREME COLD/WIND CHILL 29163400 1326023000 1355186400
## 16 HEAVY RAIN 584864440 738169800 1323034240
## 17 LIGHTNING 743077080 6898440 749975520
## 18 HEAVY SNOW 634447540 71122100 705569640
## 19 BLIZZARD 525658950 7060000 532718950
## 20 EXCESSIVE HEAT 7723700 492402000 500125700
The ranking doesn’t differ much from the previous calculation (which was done after the first round of mapping for the health‑harm analysis and was already cleaned to some extent), but the impact calculated now is more precise. Some changes appear in the 11–20 range.
Plot for visualization:
df2 <- economic_impact[1:10, ]
df2 <- df2[order(df2$total_economic, decreasing = TRUE), ]
par(mar = c(10, 4, 4, 2))
# Scale to millions
scale_factor <- 1e6
mat2 <- rbind(
Total_Economic = df2$total_economic / scale_factor,
Property = df2$property / scale_factor,
Crop = df2$crop / scale_factor
)
barplot(
mat2,
beside = TRUE,
names.arg = df2$EVTYPE_mapped,
las = 2,
col = c("steelblue3", "turquoise3", "tomato"),
main = "Top 10 Weather Events by Economic Impact",
ylab = "Damage (Millions USD)",
cex.names = 0.7,
cex.axis = 0.7,
cex.lab = 0.8
)
legend(
"topright",
legend = c("Total Economic", "Property", "Crop"),
fill = c("steelblue3", "turquoise3", "tomato"),
cex = 0.8
)
The most harmful weather event in terms of economic impact is Flood, followed by Hurricane (Typhoon) and Storm Surge/Tide. Overall, most of the economic damage comes from property losses, while crop damage represents a smaller share. However, crop damage understandably dominates during Drought events. Depending on the focus, a deeper analysis could examine the two types of damage separately or investigate trends over time or across geographic regions.
This analysis emphasizes the critical role that certain unfavorable weather events play in affecting both people and the economy, while offering insight into which events are the most harmful. Tornadoes cause the greatest health impact, whereas Floods, Hurricanes (Typhoons), and Storm Surge/Tide result in the most severe economic damage. Allocating resources to address and mitigate these events would be most beneficial.