Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, crop and property damage, and preventing such outcomes to the extent possible is a key concern. In this report we analyse which of these events are the most harmful.
The analysis shows that tornadoes are the most harmful weather events with respect to population health taking into consideration both injuries and fatalities. Floods are events causing most economic damage as they cause the most damage to property while the most harmful event for crop only is drought.
Loading the packages that we are going to use for this analysis:
knitr::opts_chunk$set(echo = TRUE)
library(ggplot2)
library(data.table)
library(plyr)
library(dplyr)
library(lattice)
library(knitr)
We download the source data file from here. The data come in the form of a comma-separated-value file compressed via the bzip2 algorithm and it covers storm and weather events in the United States between 1950 and 2011. Documentation of the database is available here. We download, read the data and get session info:
if (!file.exists("StormData.csv.bz2")) {
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="StormData.csv.bz2")
}
storm <- read.csv("StormData.csv.bz2")
sInfo <- sessionInfo()
We get the structure of the dataset.
str(storm)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Let’s convert data to a data table.
stormdt <- as.data.table(storm)
We now get a list of column names to create a subset of data that we are going to use for the analysis.
names(stormdt)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The questions we are trying to answer in this analysis are which types of weather events are most harmful with respect to population health and which ones have the greatest economic consequences. So, we only need event type and data related to health and economic impacts and hence the following data columns are selected:
EVTYPE Event types that might have different impact on population health or economy.
FATALITIES and INJURIES
Fatalities and injuries estimated for the event. These values are used to estimate the weather events impact on population health.
PROPDMG and CROPDMG, PROPDMGEXP and CROPDMGEXP
Property and crop damages estimated for the event and their units (magnitudes - K,B,M). These values are used to estimate the weather events impact on economy.
We create a subset of data and get a summary for those variables
stormSubset <- select(stormdt, c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
summary(stormSubset)
## EVTYPE FATALITIES INJURIES PROPDMG
## Length:902297 Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## Class :character 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Mode :character Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
We can see that median values are zeros for all variables and even the 3rd quantile is 0 or close to 0 for most variables, so we want to take a subset of this dataset to consider only events that caused either damage to economy or population health.
FinalStorm <- subset(stormSubset, INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0)
We need to convert columns with units to actual values instead of -,+, H, K, etc.
unique(FinalStorm$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(FinalStorm$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
FinalStorm$PROPDMGEXP <- mapvalues(FinalStorm$PROPDMGEXP, from = c("K", "M","", "B", "m", "+", "0", "5", "6", "4", "2", "3", "h", "7", "H", "-"), to = c(10^3, 10^6, 1, 10^9, 10^6, 0,1,10^5, 10^6, 10^4, 10^2, 10^3, 10^2, 10^7, 10^2, 0))
FinalStorm$PROPDMGEXP <- as.numeric(as.character(FinalStorm$PROPDMGEXP))
FinalStorm$CROPDMGEXP <- mapvalues(FinalStorm$CROPDMGEXP, from = c("","M", "K", "m", "B", "?", "0", "k"), to = c(1,10^6, 10^3, 10^6, 10^9, 0, 1, 10^3))
FinalStorm$CROPDMGEXP <- as.numeric(as.character(FinalStorm$CROPDMGEXP))
FinalStorm$PROPDMGTOT <- (FinalStorm$PROPDMG * FinalStorm$PROPDMGEXP)/10^9
FinalStorm$CROPDMGTOT <- (FinalStorm$CROPDMG * FinalStorm$CROPDMGEXP)/10^9
We calculate total number of fatalities and injuries per event type as well as the total damage to property and crop. We melt Melting data.table for easier plotting.
TotalHealth <- FinalStorm[, .(fatalities = sum(FATALITIES), injuries = sum(INJURIES), total = sum(FATALITIES) + sum(INJURIES)), by = .(EVTYPE)][order(-total)]
TotHealth <- as.data.frame(melt(TotalHealth, id.vars="EVTYPE", variable.name = "damage"))
TotalEconomy <- FinalStorm[, .(Total_Property_Damage = sum(PROPDMGTOT), Total_Crop_Damage = sum(CROPDMGTOT), total = sum(PROPDMGTOT) + sum(CROPDMGTOT)), by = .(EVTYPE)][order(-total)]
TotEconomy <- as.data.frame(melt(TotalEconomy, id.vars="EVTYPE", variable.name = "damage"))
Now we are going to get only the top 5 most harmful events:
TH<-TotHealth %>% group_by(damage) %>% top_n(5,value) %>% arrange(damage, -value)
TE<-TotEconomy %>% group_by(damage) %>% top_n(5,value) %>% arrange(damage, -value)
We plot the number of fatalities, injuries and a total of fatalities and injuries to find the top 5 weather events that are most harmful to US population.
ggplot(data = TH, aes(x=EVTYPE,value)) +
geom_bar(stat="identity", width = 0.5) +
labs(title = "Most harmful events with respect to population health",
y = "Number (*10^9) of fatalities and injuries (log2 scale)", x = "Event") +
theme(legend.position="none") +
facet_wrap(~ damage) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(trans='log2')
The most harmful event overall is tornado and it is the most harmful in terms of both injuries and fatalities. Excessive heat is the second most common cause of death while Thunderstorm Wind the second most common cause of injuries.
We plot the damage to crops and property and the total damage to find the top 5 weather events that are most harmful to US economy.
ggplot(data = TE, aes(x=EVTYPE,value)) +
geom_bar(stat="identity", width = 0.5) +
labs(title = "Weather events with greatest economic consequences",
y = "Damage to crop and property (*10^9) (log2 scale)", x = "Event") +
theme(legend.position="none") +
facet_wrap(~ damage) +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_y_continuous(trans='log2')
The most harmful event overall is flood as it’s a most harmful event for property. On the other hand, the most harmful weather event to crops is drought.