For this assignment we will look at extreme weather events (EVTYPE variable) in the NOAA Storm Database. The data are available for download here, while a description of the data can be found here.
We will focus on events that caused the most health damage (injuries and fatalities) as well as the most economic consequences (crop damage).
knitr::opts_chunk$set(echo=TRUE, cache=TRUE)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, use, warnings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(ggplot2)
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(! file.exists("download.bz2")){
download.file(url, "download.bz2")
}
if(! file.exists("download")){
bunzip2("download.bz2")
}
data <- fread("download")
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
“FATALITIES” and “INJURIES” are the column names related to population health
# calculate fatalities and injuries by event type
fatalities_inj_evt <- data[, .(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), total_adverse_event = sum(FATALITIES + INJURIES)), by = .(EVTYPE)]
# order by fatalities in decreasing order
fatalities_inj_evt <- fatalities_inj_evt[order(-FATALITIES),]
#take top ten to display
fatalities_inj_evt <- fatalities_inj_evt[1:10,]
#melt data.table in order to display counts by event type
fatalities_inj_evt_melt <- melt(fatalities_inj_evt, id.vars="EVTYPE", variable.name="adverseEvent")
healthPlot <- ggplot(fatalities_inj_evt_melt, aes(x=reorder(EVTYPE, -value), y=value)) + geom_col(aes(fill=adverseEvent), position="dodge") + labs(title = "Top 10 Adverse Weather Events in USA", x="Event Type", y="Frequency") + theme(axis.text.x = element_text(angle=90, hjust=1))
In paragraph 2.7 “Damage”, of the “National Weather Service Instruction” document, linked to above, we see that amounts are “rounded to 3 significant digits, followed by an alphabetical character signifying the magnitude of the number, e.g. 1.55B for $1,550,000,000”. Where B is billion, M million, and K thousand.
The amounts appear to be in the “CROPDMG” and “PROPDMG” columns, while the letters appear to be in the “CROPDMGEXP” and “PROPDMGEXP” columns. However, data in the “EXP” columns is not uniform, as shown by:
unique(data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
unique(data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
There are lower case as well as upper case letters as well as numbers (the latter are not mentioned in the documentation). We will use the following translation of numbers and letters to powers of ten (as the EXP part of the column name suggests):
propDmgTranslation <- c("\"\"" = 10^0,
"-" = 10^0,
"+" = 10^0,
"?" = 0^0,
"0" = 10^0,
"1" = 10^1,
"2" = 10^2,
"3" = 10^3,
"4" = 10^4,
"5" = 10^5,
"6" = 10^6,
"7" = 10^7,
"8" = 10^8,
"9" = 10^9,
"H" = 10^2,
"K" = 10^3,
"M" = 10^6,
"B" = 10^9)
cropDmgTranslation <- c("\"\"" = 10^0,
"?" = 10^0,
"0" = 10^0,
"2" = 10^2,
"K" = 10^3,
"M" = 10^6,
"B" = 10^9)
#make sure we only have upper case
toUppercols <- c("PROPDMGEXP", "CROPDMGEXP")
data[, (toUppercols) := c(lapply(.SD, toupper)), .SDcols = toUppercols]
#do actual translation
data[, PROPDMGEXP := propDmgTranslation[as.character(data[,PROPDMGEXP])]]
data[is.na(PROPDMGEXP), PROPDMGEXP := 10^0 ]
data[, CROPDMGEXP := cropDmgTranslation[as.character(data[,CROPDMGEXP])] ]
data[is.na(CROPDMGEXP), CROPDMGEXP := 10^0 ]
Then we can calculate the individual costs for property and crops:
data <- data[, .(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, propCost = PROPDMG * PROPDMGEXP, CROPDMG, CROPDMGEXP, cropCost = CROPDMG * CROPDMGEXP)]
Calculate combined properties and crop damage costs per event type (“EVTYPE”):
totalCostCombined <- data[, .(propCost = sum(propCost), cropCost = sum(cropCost), totalCostCombined = sum(propCost) + sum(cropCost)), by = .(EVTYPE)]
# order by totalCostCombined in descending order
totalCostCombined <- totalCostCombined[order(-totalCostCombined),]
# take first ten
totalCostCombined <- totalCostCombined[1:10,]
firstTen <- head(totalCostCombined)
We see that flooding events have to highest combined cost for both property and crop damage. In graph form:
totalCostCombinedMelt <- melt(firstTen, id.vars="EVTYPE", variable.name="costType")
economicalDamagePlot <- ggplot(totalCostCombinedMelt, aes(x=reorder(EVTYPE, -value), y=value)) + geom_col(aes(fill=costType), position="dodge") + labs(title = "Top 10 Weather Events in USA in Terms of Economical Damage", x="Event Type", y="Cost (USD)") + theme(axis.text.x = element_text(angle=90, hjust=1))
For events that were most harmful to population health we mostly looked at the “FATALITIES” and “INJURIES” columns in the data set, as detailed above. After summing up the fatalities and injuries by event type we come to the following top ten most harmful events:
## EVTYPE FATALITIES INJURIES total_adverse_event
## <char> <num> <num> <num>
## 1: TORNADO 5633 91346 96979
## 2: EXCESSIVE HEAT 1903 6525 8428
## 3: FLASH FLOOD 978 1777 2755
## 4: HEAT 937 2100 3037
## 5: LIGHTNING 816 5230 6046
## 6: TSTM WIND 504 6957 7461
## 7: FLOOD 470 6789 7259
## 8: RIP CURRENT 368 232 600
## 9: HIGH WIND 248 1137 1385
## 10: AVALANCHE 224 170 394
Which we can visualise in graph form:
In this plot “total_adverse_event” means the sum of FATALITIES and INJURIES.We conclude that TORNADOs are the most harmful events in terms of fatalities and injuries.
For events that cause the most economic damage we considered crop damage and property damage (CROPDMG and PROPDMG) as detailed above. After calculating total costs by event type we come to the following top ten of events that cause most economic damage:
## EVTYPE propCost cropCost totalCostCombined
## <char> <num> <num> <num>
## 1: FLOOD 144657709807 5661968450 150319678257
## 2: HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3: TORNADO 56947380676 414953270 57362333946
## 4: STORM SURGE 43323536000 5000 43323541000
## 5: HAIL 15735267513 3025954473 18761221986
## 6: FLASH FLOOD 16822673978 1421317100 18243991078
This can be visulalised in a bar chart as follows:
In this graph, totalCostCombined means the cost of property damage and crop damage combined. We conclude that in terms of economic damage FLOODs are the most harmful event.