In this report, we aim to assess the destruction caused by weather events between 1950 and 2011. Our goal is to determine which event causes the most damage, as measured by both health risks (injures and deaths) and financial risks (property and crop damage). From these data, we have determined that, across of the documented events that Tornadoes are most harmful with respect to both health and economic consequences.
We loaded in the necessary libraries and read in the data from the included zip archive. The data is in .csv format and the specific variables that we need do not need any tidying
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(ggpubr)
data_file <- "repdata_data_StormData.csv.bz2"
data <- read.csv(data_file)
After reading in the data, we check the first few rows of the dataset.
dim(data)
## [1] 902297 37
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
We first want to answer the question, which types of events are most harmful with respect to population health? To assess this, we must summarize the variables, injuries and deaths, by weather event.
data_deaths <- data %>%
group_by(event = EVTYPE) %>%
summarize(deaths = sum(FATALITIES)) %>%
arrange(desc(deaths)) %>%
top_n(10, wt = deaths)
## `summarise()` ungrouping output (override with `.groups` argument)
data_injuries <- data %>%
group_by(event = EVTYPE) %>%
summarize(injuries = sum(INJURIES)) %>%
arrange(desc(injuries)) %>%
top_n(10, wt = injuries)
## `summarise()` ungrouping output (override with `.groups` argument)
Next, we want to answer the question, which type of events have the greatest economic consequences? To assess this, we must summarize the variables, crop damange and property damage.
data_damage <- data %>%
group_by(event = EVTYPE) %>%
summarize(property_damage = sum(PROPDMG)/1000,
crop_damage = sum(CROPDMG)/1000,
total_damage = (property_damage + crop_damage)) %>%
arrange(desc(total_damage)) %>%
top_n(10, wt = total_damage)
## `summarise()` ungrouping output (override with `.groups` argument)
In order to show the events with the greatest health risks, we will list the top 10 in terms of both deaths and injuries.
data_deaths
## # A tibble: 10 x 2
## event deaths
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
data_injuries
## # A tibble: 10 x 2
## event injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
We will also produce a barchart showing that Torndaoes, clearly, cause the greatest risk to the population’s health.
death_plot <- ggplot(data_deaths, aes(event, deaths)) +
geom_bar(stat = "identity", aes(fill = event)) +
ylab("Total Deaths") +
xlab("Event Type") +
theme(axis.text.x = element_text(angle = 90, size = 6),
legend.position = "none",
axis.title.x = element_text(color = "blue",
size = 12),
axis.title.y = element_text(color = "blue",
size = 12))
injuries_plot <- ggplot(data_injuries, aes(event, injuries)) +
geom_bar(stat = "identity", aes(fill = event)) +
ylab("Total Injuries") +
xlab("Event Type") +
theme(axis.text.x = element_text(angle = 90, size =6),
legend.position = "none",
axis.title.x = element_text(color = "blue",
size = 12),
axis.title.y = element_text(color = "blue",
size = 12))
ggarrange(death_plot, injuries_plot, ncol = 2,
labels = c("Total Deaths per Event Type",
"Total Injuries per Event Type"),
font.label = list(color = "blue"))
We will do the same for economic risk. Here is the top 10:
data_damage
## # A tibble: 10 x 4
## event property_damage crop_damage total_damage
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 3212. 100. 3312.
## 2 FLASH FLOOD 1420. 179. 1599.
## 3 TSTM WIND 1336. 109. 1445.
## 4 HAIL 689. 580. 1268.
## 5 FLOOD 900. 168. 1068.
## 6 THUNDERSTORM WIND 877. 66.8 944.
## 7 LIGHTNING 603. 3.58 607.
## 8 THUNDERSTORM WINDS 446. 18.7 465.
## 9 HIGH WIND 325. 17.3 342.
## 10 WINTER STORM 133. 1.98 135.
And here is the barchart showing, again, that Tornadoes cause the most risk with regards to finances, as well.
ggplot(data_damage, aes(event, total_damage)) +
geom_bar(stat = "identity", aes(fill = event)) +
ylab("Total Damage (,000s)") +
xlab("Event Type") +
ggtitle("Total Damage per Event Type") +
theme(axis.text.x = element_text(angle = 90, size =6),
legend.position = "none",
plot.title = element_text(hjust = 0.5, color = "blue",
size = 16, face = "bold"),
axis.title.x = element_text(color = "blue",
size = 12, face ="bold"),
axis.title.y = element_text(color = "blue",
size = 12, face ="bold"))