Our analysis show that Tornados have the greatest health impact, whereas Hurricane/Typhoon have the greatest economical impact.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Your data analysis must address the following questions:
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
knitr::opts_chunk$set(echo = TRUE,
warning = FALSE,
fig.width = 10,
fig.height = 5,
fig.keep = "all",
dev = "png")
library(ggplot2)
library(knitr)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ lubridate 1.9.3 ✔ stringr 1.5.1
## ✔ purrr 1.0.2 ✔ tibble 3.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:reshape2':
##
## dcast, melt
##
## The following objects are masked from 'package:lubridate':
##
## hour, isoweek, mday, minute, month, quarter, second, wday, week,
## yday, year
##
## The following object is masked from 'package:purrr':
##
## transpose
##
## The following objects are masked from 'package:dplyr':
##
## between, first, last
sessionInfo()
## R version 4.3.2 (2023-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 11 x64 (build 22631)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.utf8
## [2] LC_CTYPE=English_United Kingdom.utf8
## [3] LC_MONETARY=English_United Kingdom.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.utf8
##
## time zone: Europe/London
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.15.0 reshape2_1.4.4 lubridate_1.9.3 forcats_1.0.0
## [5] stringr_1.5.1 purrr_1.0.2 readr_2.1.5 tibble_3.2.1
## [9] tidyverse_2.0.0 dplyr_1.1.4 tidyr_1.3.1 knitr_1.45
## [13] ggplot2_3.4.4
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.4 jsonlite_1.8.8 compiler_4.3.2 Rcpp_1.0.12
## [5] tidyselect_1.2.0 jquerylib_0.1.4 scales_1.3.0 yaml_2.3.8
## [9] fastmap_1.1.1 plyr_1.8.9 R6_2.5.1 generics_0.1.3
## [13] munsell_0.5.0 bslib_0.6.1 pillar_1.9.0 tzdb_0.4.0
## [17] rlang_1.1.3 utf8_1.2.4 stringi_1.8.3 cachem_1.0.8
## [21] xfun_0.41 sass_0.4.8 timechange_0.3.0 cli_3.6.2
## [25] withr_3.0.0 magrittr_2.0.3 digest_0.6.34 grid_4.3.2
## [29] rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5
## [33] evaluate_0.23 glue_1.7.0 fansi_1.0.6 colorspace_2.1-0
## [37] rmarkdown_2.25 tools_4.3.2 pkgconfig_2.0.3 htmltools_0.5.7
StormData <- read.csv("repdata_data_StormData.csv")
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
colnames(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
As stated on the introduction, we are only looking at data relating to health and economical impact. The only columns used in analysis will be:
Refined <- c("EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG",
"PROPDMGEXP",
"CROPDMG",
"CROPDMGEXP")
Refined_Model <- StormData[, Refined]
Refined1 <- StormData[, Refined]
summary(Refined1)
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
# Create a subtable where only events that caused damaged are stored
Refined2 <- as_tibble(Refined1)
Refined2 <- Refined1 %>%
filter(INJURIES > 0,
FATALITIES > 0,
PROPDMG > 0,
CROPDMG > 0)
Refined2df <- as.data.frame(Refined2)
Refined2dt <- as.data.table(Refined2)
The code goes as follows:
Refined2df$PROPDMGEXP[Refined2df$PROPDMGEXP == "K"] <- 1000
Refined2df$PROPDMGEXP[Refined2df$PROPDMGEXP == "M"] <- 1000000
Refined2df$PROPDMGEXP[Refined2df$PROPDMGEXP == "B"] <- 1000000000
Refined2df$PROPDMGEXP[is.na(Refined2df$PROPDMGEXP) ] <- 1
Refined2df$CROPDMGEXP[Refined2df$CROPDMGEXP == "K"] <- 1000
Refined2df$CROPDMGEXP[Refined2df$CROPDMGEXP == "M"] <- 1000000
Refined2df$CROPDMGEXP[Refined2df$CROPDMGEXP == "B"] <- 1000000000
Refined2df$CROPDMGEXP[is.na(Refined2df$CROPDMGEXP) ] <- 1
# Creating columns with Property and Crop Cost
Refined2df$CropCost <- (Refined2df$CROPDMG * as.numeric(Refined2df$CROPDMGEXP))
Refined2df$PropertyCost <- (Refined2df$PROPDMG * as.numeric(Refined2df$PROPDMGEXP))
With this analysis we will estimate the health impact of the different events.
HealthImpact <- c("EVTYPE",
"FATALITIES",
"INJURIES")
HealthImpact <- StormData[, HealthImpact]
HI <- HealthImpact %>%
group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
Total = (sum(FATALITIES) + sum(INJURIES))) %>%
arrange(desc(by_group = Total))
Top10_Health <- HI[1:10,]
head(Top10_Health)
## # A tibble: 6 × 4
## EVTYPE Fatalities Injuries Total
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
With this analysis we will look at the economical impact.
EI <- Refined2df %>%
group_by(EVTYPE) %>%
summarise(CropCost = sum(CropCost),
PropCost = sum(PropertyCost),
Total = (sum(CropCost) + sum(PropertyCost))) %>%
arrange(desc(by_group = Total))
Top10_Eco <- EI[1:10,]
head(Top10_Eco)
## # A tibble: 6 × 4
## EVTYPE CropCost PropCost Total
## <chr> <dbl> <dbl> <dbl>
## 1 HURRICANE/TYPHOON 1795000000 11300000000 13095000000
## 2 WILDFIRE 75150000 1165120000 1240270000
## 3 HIGH WIND 222935000 948690000 1171625000
## 4 TORNADO 93525000 1051902000 1145427000
## 5 TROPICAL STORM 121695000 628520000 750215000
## 6 EXCESSIVE HEAT 492400000 170000 492570000
To present this data we will generate a bar chart
HII <- Top10_Health %>%
select(-Total)
HII_melt <- melt(HII,
id.vars = "EVTYPE",
variable.name = "Damage_Type")
ggplot(HII_melt,
aes(x = reorder(EVTYPE, -value),
y = value)) +
geom_bar(stat = "identity",
aes(fill = Damage_Type),
position = "stack") +
scale_fill_manual(values = c("red",
"blue")) +
labs(y = "Total",
x = "Event Type",
title = "Health Impact",
subtitle = "Top 10 US Weather Events") +
scale_y_continuous(expand = c(0, 0)) +
#scale_x_continuous(expand = c(0, 0)) +
theme_classic() +
theme(axis.text.x = element_text(angle=45, hjust=1))
ggsave("question1.png")
## Saving 10 x 5 in image
As Tornado overpower the data, we are going to remove them from the graph in order to see the other 9 events into more detail.
HIII <- Top10_Health %>%
select(-Total) %>%
filter(EVTYPE != "TORNADO")
HIII_melt <- melt(HIII,
id.vars = "EVTYPE",
variable.name = "Damage_Type")
ggplot(HIII_melt,
aes(x = reorder(EVTYPE, -value),
y = value)) +
geom_bar(stat = "identity",
aes(fill = Damage_Type),
position = "stack") +
scale_fill_manual(values = c("red",
"blue")) +
labs(y = "Total",
x = "Event Type",
title = "Health Impact",
subtitle = "Top 10 US Weather Events (without TORNADO") +
scale_y_continuous(expand = c(0, 0)) +
#scale_x_continuous(expand = c(0, 0)) +
theme_classic() +
theme(axis.text.x = element_text(angle=45, hjust=1))
ggsave("question1-1.png")
## Saving 10 x 5 in image
EII <- Top10_Eco %>%
select(-Total)
EII_melt <- melt(EII,
id.vars = "EVTYPE",
variable.name = "Type")
ggplot(EII_melt,
aes(x = reorder(EVTYPE, -value),
y = value)) +
geom_bar(stat = "identity",
aes(fill = Type),
position = "stack") +
scale_fill_manual(values = c("grey",
"green")) +
labs(y = "Total",
x = "Event Type",
title = "Economic Impact",
subtitle = "Top 10 US Weather Events") +
scale_y_continuous(expand = c(0, 0)) +
#scale_x_continuous(expand = c(0, 0)) +
theme_classic() +
theme(axis.text.x = element_text(angle=45, hjust=1))
ggsave("question2.png")
## Saving 10 x 5 in image