The basic goal is to explore the NOAA Storm Database and answer the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The main conclusions of this analysis are:
Tornados are by far the most harmful storm event in the US on population health over the periode 1996 - 2011. For more details see the analysis below.
Floods are by far the storm event in the US that have the greatest economic consequences. For more details see the analysis below.
The source data is originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather ServiceStorm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
In this section the processing of the source data is described, so it is reproducible:
The following libraries are used:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
First the data needs to be downloaded and processed for analysis. Set the working directory to /StormData and download the file to the subdirectory ‘/data’.
## First load the data
# checking and creating directories
if(!file.exists("data")){
dir.create("data")
}
# download files from internet
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "./data/rep-data-StormData.csv.bz2"
download.file(fileUrl, destfile = filename, method="curl") # curl: for https
list.files("./data")
## [1] "rep-data-StormData.csv.bz2"
# read data
data <- read.csv(filename, header=TRUE, sep = ",") # comma-seperated file with header
First a global look at the various variables and data set:
# check and look at the data
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Let’s see what types of events (EVTYPE) exist?
events <- as.factor(unique(data$EVTYPE))
str(events)
## Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 856 244 201 629 429 657 972 409 786 ...
What variables are used to determine harmful for population health and economic disaster?
Harmful for population is determined by the number of direct fatalities and injuries. This is listed in the following columns:
FATALITIES
INJURIES
# FATALITIES
summary(data$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
# INJURIES
summary(data$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
What variables are used to determine the greatest economic consequences in the dataset?
According to the documentation of the dataset “Damage” is Flood related damage, crop related damage, other related damage and delayed damage. In the dataset only property damage (PROPDMG) and crop damage (CROPDMG) are listed.
With some assistance of the Discussion Forum, there is some important additional information about these columns: The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). You should use both to get the total values for crops and property damage. (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred). The number from one to ten represent the power of ten (10^The number). The symbols “-”, “+” and “?” refers to less than, greater than and low certainty. You have the option to ignore these three symbols altogether.
Therefore the following variables determine the economic consequences of the dataset:
PROPDMG
PROPDMGEXP
CROPDMG
CROPDMGEXP
Now let’s have a look at the various exponents in the dataset:
exponentPROP <- as.factor(unique(data$PROPDMGEXP))
exponentPROP
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
exponentCROP <- as.factor(unique(data$CROPDMGEXP))
exponentCROP
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
These values relate to the following exponents and need to be converted. The following named list can be used to change the exponents:
exponent <- c(" " = 0,
"-" = 0,
"?" = 0,
"+" = 1,
"0" = 10,
"1" = 10,
"2" = 10,
"3" = 10,
"4" = 10,
"5" = 10,
"6" = 10,
"7" = 10,
"8" = 10,
"h" = 100,
"H" = 100,
"K" = 1000,
"m" = 1000000,
"M" = 1000000,
"B" = 1000000000)
Now, the economic consequences can be calculated as:
Total damage = Total damange property + Total damage crop
Total damage = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP
Now present the results:
Below are the 15 most harmfull storm events listed order by the total casualties over the period 1996 - 2011. The total casualties consists of the total number of fatalities and total number of injuries.
# which events have most impact on populaties Health (fatalities and injuries)?
PopHealth <- data %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES),
Total = sum(FATALITIES) + sum(INJURIES)) %>%
arrange(desc(Total))
PopHealth <- PopHealth[1:15, ]
PopHealth
## # A tibble: 15 × 4
## EVTYPE Fatalities Injuries Total
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
## 11 HIGH WIND 248 1137 1385
## 12 HAIL 15 1361 1376
## 13 HURRICANE/TYPHOON 64 1275 1339
## 14 HEAVY SNOW 127 1021 1148
## 15 WILDFIRE 75 911 986
Now product a barchart that summarizes the casualties per event type. In order to present the barchart for the fatalities, injuries and total casualties the data first need to be wrangled:
lPopHealth <- PopHealth %>%
pivot_longer(cols = c("Fatalities", "Injuries", "Total"),
names_to = "Impact", values_to = "Number")
lPopHealth$Impact <- as.factor(lPopHealth$Impact)
p <- ggplot(data = lPopHealth, aes(x= reorder(EVTYPE, -Number), y=Number , fill=Impact)) +
geom_bar(stat="identity", position="dodge") +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
ggtitle("Most Harmful Storm Event Types in US on Population Health ('96-'11)") +
xlab("Storm Event Types") +
ylab("Number of casualties")
p
As stated above, the economic consequences can be calculated as:
Total damage = Total damange property + Total damage crop
Total damage = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP
The top 15 storm events with the greatest economic consequences are:
# select Economic consequences
EcCons <- data %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# change exponents
EcCons$PROPDMGEXP2 <- exponent[EcCons$PROPDMGEXP]
EcCons$CROPDMGEXP2 <- exponent[EcCons$CROPDMGEXP]
# replace NA values with 0
EcCons[is.na(EcCons)] <- 0
# calculate Economic Costs
EcCons$costs <- EcCons$PROPDMG* EcCons$PROPDMGEXP2 +
EcCons$CROPDMG* EcCons$CROPDMGEXP2
EcCons <- EcCons %>%
select(EVTYPE, costs) %>%
group_by(EVTYPE) %>%
summarise(totalcosts = sum(costs)) %>%
arrange(desc(totalcosts))
EcCons <- EcCons[1:15, ]
EcCons
## # A tibble: 15 × 2
## EVTYPE totalcosts
## <chr> <dbl>
## 1 FLOOD 150319678250
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352117607
## 4 STORM SURGE 43323541000
## 5 HAIL 18757807527
## 6 FLASH FLOOD 17562132111
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041810
## 11 TROPICAL STORM 8382236550
## 12 WINTER STORM 6715441260
## 13 HIGH WIND 5908617580
## 14 WILDFIRE 5060586800
## 15 TSTM WIND 5038936340
Finally, plot these results:
g <- ggplot(data = EcCons, aes(x= reorder(EVTYPE, -totalcosts), y =totalcosts/1000000 )) +
geom_bar(stat="identity", position="dodge", fill="green") +
theme(axis.text.x = element_text(angle=60, hjust=1)) +
ggtitle("15 Storm Event Types in US with highest Economic Costs ('96-'11)") +
xlab("Storm Event Types") +
ylab("Economic costs in Billion USD")
g