In this report we aim to identify the natural events that have the highest consequences in terms of human health and economic damage. Our final goal is to provide key events on which to focus governmental action to obtain the highest effect with the minimal cost. To investigate these results, we obtained data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
if (!file.exists("data")) {
dir.create("data")
}
UrlFile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "./data/data_for_peer_assignment.csv"
if (!file.exists(filename)) {
download.file(UrlFile, filename, method = "curl")
}
stormdata <- read.csv(filename)
For this assignment, I will use the following packages :
library(dplyr)
library(data.table)
library(lubridate)
library(ggplot2)
After loading the data, we can check the first few rows in this dataset
dim(stormdata)
## [1] 902297 37
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Determining the most harmful events will be based on the number of fatalities and casualties. We can start by filtering the dataset and only keep the observaltions with fatalities or casualties with values above 0, and gather both variables into a “casualties” variable :
casu_stormdata <- filter(stormdata, (FATALITIES > 0) | (INJURIES > 0))
casu_stormdata <- data.table(casu_stormdata)
casu_stormdata[, CASUALTIES := FATALITIES + INJURIES]
We can now calculate the sum of casualties caused by each event type :
casu_stormdata <- casu_stormdata %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
group_by(EVTYPE)
sum_casu <- summarise(casu_stormdata, sum = sum(CASUALTIES, na.rm = T))
We want to see the most harmful events. Therefore, we will only keep the top decile (top 10% in terms of casualties) :
quant_sum_casu <- quantile(sum_casu$sum, probs = seq(0,1,0.1))
quant_sum_casu
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 1.0 1.0 1.0 2.0 3.0 5.0 14.0 29.3 92.0 463.9
## 100%
## 96979.0
most_casu <- filter(sum_casu, sum >= as.numeric(quant_sum_casu[10]))
We can see that some event types have duplicates, such as Thunderstorm Winds, Thunderstorm Wind, TSTM Wind.
We combine these rows together :
most_casu[15,2] <- most_casu[14,2] + most_casu[15,2]
most_casu[17,2] <- most_casu[16,2] + most_casu[17,2] + most_casu[19,2]
most_casu <- most_casu[-c(14, 16, 19), ]
We can now properly see the most harmful events :
g_casu <- ggplot(most_casu, aes(EVTYPE, sum))
g_casu + geom_bar(stat = "identity", aes(fill = EVTYPE)) +
ggtitle("Casualties per event type (top decile)") +
labs(y = "Casualties", fill = "Event type") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.title = element_text(hjust = 0.5))
Even among the top 19 events, the overwhelming majority of casualties have been caused by Tornadoes
This time, we will focus on the economic cost rather than the human consequences. For that, we will only keep the most significant observations of the dataset : events that have caused at least millions of dollars of damage.
That means keeping only rows that have either PROPDMGEXP or CROPDMGEXP equal to M or B :
dmg_stormdata <- filter(stormdata, ((PROPDMG > 0) | (CROPDMG > 0)) &
((PROPDMGEXP %in% c("m", "M", "b", "B")) | (CROPDMGEXP %in% c("m", "M", "b", "B"))))
dmg_stormdata <- data.table(dmg_stormdata)
To properly calculate damage costs, we have to combine PROPDMG with PROPDMGEXP and CROPDMG with CROPDPMGEXP into single value variables.
We firstly create a function to convert a value with its exponent into a complete numeric value :
create_num <- function(num, expo){
tempexpo <- tolower(expo)
if (tempexpo == "k") {
return(num * 1000)
} else if (tempexpo == "m") {
return(num * 1000000)
} else if (tempexpo == "b") {
return(num * 1000000000)
} else if (tempexpo %in% c("0","1","2","3","4","5","6","7","8","9")){
return(num*10)
} else if (tempexpo == "+"){
return(num)
} else {
return(0)
}
}
Then, we pass this function into each observation and store the result in two variables, PROPDMGNUM and CROPDMGNUM, respectively for numeric property damage and numeric property damage :
dmg_stormdata[, PROPDMGNUM := 0]
dmg_stormdata[, CROPDMGNUM := 0]
for (i in 1:nrow(dmg_stormdata)){
dmg_stormdata$PROPDMGNUM[i] <- create_num(dmg_stormdata$PROPDMG[i], dmg_stormdata$PROPDMGEXP[i])
dmg_stormdata$CROPDMGNUM[i] <- create_num(dmg_stormdata$CROPDMG[i], dmg_stormdata$CROPDMGEXP[i])
}
We calculate the total economic cost by adding the values of PROPDMGNUM and CROPDMGNUM into a new variable : TOTALDMG :
dmg_stormdata[, TOTALDMG := PROPDMGNUM + CROPDMGNUM]
We want to focus on events that have the greatest economic damage. Therefore, we will, once again, only keep the top decile :
dmg_stormdata <- dmg_stormdata %>% mutate(EVTYPE = as.factor(EVTYPE)) %>%
group_by(EVTYPE)
sum_dmg <- summarise(dmg_stormdata, sum = sum(TOTALDMG, na.rm = T))
quant_sum_dmg <- quantile(sum_dmg$sum, probs = seq(0, 1, 0.1))
most_dmg <- filter(sum_dmg, sum >= as.numeric(quant_sum_dmg[10]))
We can now properly see the events with the highest economic consequences :
g_dmg <- ggplot(most_dmg, aes(EVTYPE, sum*0.000001))
g_dmg + geom_bar(stat = "identity", aes(fill = EVTYPE)) +
ggtitle("Economic damage per event type (top decile)") +
labs(y = "Cost (in millions of USD)", fill = "Event type") +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
plot.title = element_text(hjust = 0.5))
We can see that Floods have, by far, the highest economic consequences, along with Hurricanes / Typhoons, Storm surges and, once again, Tornadoes.
In addition, these were also events that had dire human consequences as well.
We can therefore safely say that these are the events on which the government should focus on the most, both in regard to economic and human safety.