In this report, I tried to answer two questions regarding an exploration of severe weather data, obtained from the NOAA Storm Database. The goals was to evaluate the event type that caused the largest numbers of fatal and non-fatal injuries and to determine the event types that cause the highest number of US-dollar damage to properties and crops. Overall, high winds were the cause for the largest number of in US-dollar reported damages, where injuries due to weather events, both fatal and non-fatal, were mostly caused by tornadoes.
The table below provides an initial overview of the data in the stormfront dataset of the first few columns.
## Download data
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", dest = "./data/stormdata.csv.bz2", curl = FALSE)
## Import data in R
stormdata <- read.csv("./data/stormdata.csv.bz2")
## See first few lines of data
hux(head(stormdata[, c(1:7)])) %>% theme_article()
| STATE__ | BGN_DATE | BGN_TIME | TIME_ZONE | COUNTY | COUNTYNAME | STATE |
|---|---|---|---|---|---|---|
| 1 | 4/18/1950 0:00:00 | 0130 | CST | 97 | MOBILE | AL |
| 1 | 4/18/1950 0:00:00 | 0145 | CST | 3 | BALDWIN | AL |
| 1 | 2/20/1951 0:00:00 | 1600 | CST | 57 | FAYETTE | AL |
| 1 | 6/8/1951 0:00:00 | 0900 | CST | 89 | MADISON | AL |
| 1 | 11/15/1951 0:00:00 | 1500 | CST | 43 | CULLMAN | AL |
| 1 | 11/15/1951 0:00:00 | 2000 | CST | 77 | LAUDERDALE | AL |
Now, we need to clean up the EVTYPE variable, as it
contains upper and lower case data, but also a lot of duplicates caused
by typos or differences in spelling.
## Clean up EVTYPE
stormdata$EVTYPE <- toupper(stormdata$EVTYPE) #uppercase
stormdata$EVTYPE <- gsub("[^A-Z ]", "", stormdata$EVTYPE) # remove all whitespaces and non A-Z symbols
## Stormdata is large and not all columns are needed for the analysis.
stormdata <- stormdata[, -c(2:7, 9:22, 30:37)]
The first research question is:
As stated in the question, EVTYPE provides information
on the weather event type, where population health is captured in
variable: FATALITIES and INJURIES. For the
secondary research question, it goes:
Damages by weather event types is registerd by number under
PROPDMG and CROPDMG, respectively damage to
property and crops. The description of the amount is presented in
PROPDMGEXP
# View unique values for PROPDMGEXP
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# View unique values for CROPDMGEXP
unique(stormdata$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
In these cases, most likely h = 100, k = 100.000, m = 1.000.000 and b
= 1.000.000.000 All other numbers are to be removed from the dataset as
these are unknown or missing. Therefore, I created a function
change_exp()
# create a function to change all relevant exponentials of dmg.
change_exp <- function(x){
v <- toupper(x) #change all to upper
v <- gsub("[^a-zA-Z]", "", v) # keep all A-Z elements
v <- trimws(v, which = "both") #Trim all whitespaces
v[v == ""] <- NA # change empty strings to NA
return(v)
}
# apply function to propdmgexp and cropdmgexp
stormdata$PROPDMGEXP <- change_exp(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- change_exp(stormdata$CROPDMGEXP)
Now, we need to test whether we are left with only relevant values to indicate the amount of zero’s
# Check PROPDMGEXP
unique(stormdata$PROPDMGEXP)
## [1] "K" "M" NA "B" "H"
# Check CROPDMGEXP
unique(stormdata$CROPDMGEXP)
## [1] NA "M" "K" "B"
Now, the damage needs to be converted to ‘true’ numbers. As we do not know to what extend damage is related to the ’NA’s, we assign NA to the number of damage as well.
# Property damage first:
stormdata$PROPDMG[is.na(stormdata$PROPDMGEXP)] <- NA
# Crop damage second:
stormdata$CROPDMG[is.na(stormdata$CROPDMGEXP)] <- NA
We can now calculate the amount of damage in US-dollars:
dollar_damage <- function(damage, exponent){
if (is.na(exponent)) {v <- 0}
else if (exponent == "H" | !is.na(exponent)) {v <- 100}
else if (exponent == "K") {v <- 1000}
else if (exponent == "M") {v <- 1000000}
else if (exponent == "B") {v <- 1000000000}
return(v*damage)
}
# Apply the function to create new columns with dollars of damage to property
stormdata$PROPDMGDOLLARS <- mapply(dollar_damage, stormdata$PROPDMG, stormdata$PROPDMGEXP)
# Apply the function to create new columns with dollars of damage to crops
stormdata$CROPDMGDOLLARS <- mapply(dollar_damage, stormdata$CROPDMG, stormdata$CROPDMGEXP)
Now create a ‘event-type-damage dataframe’
## Create a table that summarizes level of fatality, injuries, property and crop damage
## per EV type 'ev_fi'
ev_fi_dmg <- stormdata %>% group_by(EVTYPE) %>%
summarise(sum_fatalities = sum(FATALITIES),
sum_injuries = sum(INJURIES),
sum_dmg_property = sum(PROPDMGDOLLARS),
sum_dmg_crops = sum(CROPDMGDOLLARS))
hux(head(ev_fi_dmg)) %>% theme_article()
| EVTYPE | sum_fatalities | sum_injuries | sum_dmg_property | sum_dmg_crops |
|---|---|---|---|---|
| 0 | 0 | 500 | ||
| HIGH SURF ADVISORY | 0 | 0 | 2e+04 | |
| COASTAL FLOOD | 0 | 0 | ||
| FLASH FLOOD | 0 | 0 | 5e+03 | |
| LIGHTNING | 0 | 0 | ||
| TSTM WIND | 0 | 0 |
To obtain an overview of all damages and injuries/fatalities, we take a look at the plots below:
## Plot fatalities and injuries
plot_1a <- ev_fi_dmg %>% arrange(desc(sum_fatalities)) %>% slice_head(., n = 10) %>%
ggplot(., aes(x = reorder(EVTYPE, -sum_fatalities), y = sum_fatalities)) +
geom_bar(stat = "identity") +
xlab("Weather event type") +
ylab("Total sum of fatal injuries")+
theme(axis.text.x = element_text(angle = 90))
plot_1b <- ev_fi_dmg %>% arrange(desc(sum_injuries)) %>% slice_head(., n = 10) %>%
ggplot(., aes(x = reorder(EVTYPE, -sum_injuries), y = sum_injuries)) +
geom_bar(stat = "identity") +
xlab("Weather event type") +
ylab("Total sum of injuries")+
theme(axis.text.x = element_text(angle = 90))
plot_1 <- plot_grid(plot_1a, plot_1b, labels = c("A", "B"))
plot_1
Here, we can see that tornado’s cause the largest number of fatal injuries as well as non-fatal injuries. Excessive heat is second when it comes to largest cause for fatalities, whereas this is fourth place for injuries.
Now, let us take a look at damage described in dollars to properties and crops.
## Plot fatalities and injuries
plot_2a <- ev_fi_dmg %>% arrange(desc(sum_dmg_property)) %>% slice_head(., n = 10) %>%
ggplot(., aes(x = reorder(EVTYPE, -sum_dmg_property), y = sum_dmg_property)) +
geom_bar(stat = "identity") +
xlab("Weather event type") +
ylab("Total of property damage")+
theme(axis.text.x = element_text(angle = 90))
plot_2b <- ev_fi_dmg %>% arrange(desc(sum_dmg_crops)) %>% slice_head(., n = 10) %>%
ggplot(., aes(x = reorder(EVTYPE, -sum_dmg_crops), y = sum_dmg_crops)) +
geom_bar(stat = "identity") +
xlab("Weather event type") +
ylab("Total of crop damage")+
theme(axis.text.x = element_text(angle = 90))
ev_fi_dmg$sum_total_dmg <- rowSums(ev_fi_dmg[, c(4:5)], na.rm = TRUE)
plot_2c <- ev_fi_dmg %>% arrange(desc(sum_total_dmg)) %>% slice_head(., n = 10) %>%
ggplot(., aes(x = reorder(EVTYPE, -sum_total_dmg), y = sum_total_dmg)) +
geom_bar(stat = "identity") +
xlab("Weather event type") +
ylab("Total of damage")+
theme(axis.text.x = element_text(angle = 90))
plot_2 <- plot_grid(plot_2a, plot_2b, plot_2c, labels = c("A", "B", "C"), ncol = 3)
plot_2
In the second plot we can see that excessive snow and high, but cold
winds are the largest causes for property and crop damage respectively.
Plot 2C shows that high winds lead to the largest overall damage in
US-dollars, followed by excessive snow and flash flooding/floods.