Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The analysis will identify two critical questions: which events most affect population health and which events most affect the United States economy.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Data
Information about the data can be found : Storm Data Documentations
Retrieving Data
library(readr)
if(!file.exists("StormData.csv.bz2")) {
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "StormData.csv.bz2")
}
stormData <- read.csv("StormData.csv.bz2")
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The main questions are as follows: 1. Which types of events are most harmful to U.S. population health? 2. Which types of events are most harmful to U.S. economy?
Therefore, the columns I am going to be using are as follows: 1. Event Type (EVTYPE) 2. Date (BGN_DATE) 3. Fatalities (FATALITIES) population health 4. Injuries (INJURIES) population health 5. Property Damage (PROPDMG) economy 6. Crop Damage (CROPDMG) economy
Selecting columns to create new dataset
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
table(stormData$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
names <- c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data1 <- select(stormData, all_of(names))
After checking the structure of the data, I will change the event type column into a factor variable and the date column into a date variable
Changing into appropriate variable type
data1$EVTYPE <- as.factor(data1$EVTYPE)
data1$BGN_DATE <- as.Date(data1$BGN_DATE, format = "%m/%d/%Y")
str(data1)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_DATE : Date, format: "1950-04-18" "1950-04-18" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
First, the analysis will attempt to understand which types of events are most harmful with respect to population health.
Identifying top 10 events causing fatalities
data2 <- data1
data2 %<>%
group_by(EVTYPE) %>%
summarise(FatalSum = sum(FATALITIES)) %>%
arrange(desc(FatalSum))
## `summarise()` ungrouping output (override with `.groups` argument)
top10_harmingHealth <- data2[1:10,]
top10_harmingHealth
## # A tibble: 10 x 2
## EVTYPE FatalSum
## <fct> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
Based on the analysis the top 10 events that cause the most fatalities are as follows:
Graphing top 10 events causing fatalities
library(ggplot2)
g <- ggplot(data = top10_harmingHealth, aes(x = reorder(EVTYPE, -FatalSum),
y = FatalSum))
g + geom_bar( stat = "identity", fill = "#003f5c") +
labs(x = "Event Type", y = "Fatalities", title = "Total Fatalities by Event Type (1950-2011)") +
theme(axis.text.x = element_text(size = 8)) +
geom_text(aes(label = FatalSum), vjust = -.75, size = 3.5)
Identifying top 10 events causing injuries
data3 <- data1
data3 %<>%
group_by(EVTYPE) %>%
summarise(InjurySum = sum(INJURIES)) %>%
arrange(desc(InjurySum))
## `summarise()` ungrouping output (override with `.groups` argument)
top10_injury_events <- data3[1:10,]
top10_injury_events
## # A tibble: 10 x 2
## EVTYPE InjurySum
## <fct> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
Based on the analysis the top 10 events that cause the most injuries are as follows:
Graphing top 10 events causing injuries
g2 <- ggplot(data = top10_injury_events, aes(x = reorder(EVTYPE, -InjurySum),
y = InjurySum))
g2 + geom_bar( stat = "identity", fill = "#003f5c") +
labs(x = "Event Type", y = "Injuries", title = "Total Injuries by Event Type (1950-2011)") +
theme(axis.text.x = element_text(size = 8)) +
geom_text(aes(label = InjurySum), vjust = -.75, size = 3.5)
Because, crop and property damages are all based on the same monetary value (dollars) I decided to create a new variable that combines the monetary value of crop damage and property damage.
Creating total cost variable
data4 <- data1
data4 %<>%
mutate(PROPDMGEXP = case_when(
PROPDMGEXP == "K" ~ 3,
PROPDMGEXP == "M" ~ 6,
PROPDMGEXP == "B" ~ 9,
PROPDMGEXP == "m" ~ 6,
PROPDMGEXP == "5" ~ 5,
PROPDMGEXP == "6" ~ 6,
PROPDMGEXP == "4" ~ 4,
PROPDMGEXP == "2" ~ 2,
PROPDMGEXP == "3" ~ 3,
PROPDMGEXP == "h" ~ 2,
PROPDMGEXP == "H" ~ 2,
PROPDMGEXP == "7" ~ 7,
PROPDMGEXP == "1" ~ 1,
PROPDMGEXP == "8" ~ 8,)) %>%
mutate(CROPDMGEXP = case_when(
CROPDMGEXP == "M" ~ 6,
CROPDMGEXP == "K" ~ 3,
CROPDMGEXP == "m" ~ 6,
CROPDMGEXP == "B" ~ 9,
CROPDMGEXP == "k" ~ 3,
CROPDMGEXP == "2" ~ 2))
data4$PROPDMGEXP[(is.na(data4$PROPDMGEXP) == TRUE)] <- 0
data4$CROPDMGEXP[(is.na(data4$CROPDMGEXP) == TRUE)] <- 0
data4 %<>%
mutate(total_cost = (PROPDMG * 10^PROPDMGEXP) + (CROPDMG * 10^CROPDMGEXP))
Identifying top 10 events causing most economic cost
top10_costs_events <- data4
top10_costs_events %<>%
group_by(EVTYPE) %>%
summarize(totalcost = sum(total_cost)) %>%
arrange(desc(totalcost)) %>%
mutate(totalcost = totalcost/1000000000)
## `summarise()` ungrouping output (override with `.groups` argument)
top10_costs_events <- top10_costs_events[1:10,]
top10_costs_events
## # A tibble: 10 x 2
## EVTYPE totalcost
## <fct> <dbl>
## 1 FLOOD 150.
## 2 HURRICANE/TYPHOON 71.9
## 3 TORNADO 57.4
## 4 STORM SURGE 43.3
## 5 HAIL 18.8
## 6 FLASH FLOOD 18.2
## 7 DROUGHT 15.0
## 8 HURRICANE 14.6
## 9 RIVER FLOOD 10.1
## 10 ICE STORM 8.97
The top 10 events that cause the most economic cost are as follows:
Graphing top 10 events causing most economic cost
g3 <- ggplot(top10_costs_events, aes(x = reorder(EVTYPE, -totalcost), y = totalcost))
g3 +
geom_bar(stat = "identity", fill = "#003f5c") +
labs(x = "Event Type", y = "Total Cost(in billions)", title = "Total Cost By Event (1950-2011)") +
geom_text(aes(label = round(totalcost, digits = 1), vjust = -.75, size = .8)) +
theme(legend.position = "none", axis.text.x = element_text(size = 8))
Based on the results of the analysis we found the following:
The top 10 events that cause the most fatalities are as follows:
The top 10 events that cause the most injuries are as follows:
The top 10 events that cause the most economic cost are as follows: