Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database we want to know which types of event are most harmful to population health and which have the greatest economic consecuences. Studying which types of events cause a greater number of fatalities and injuries, we see that Tornado is the most harmful. And studying which cause more crop and property damage (in amount of dollars), we see that Flood have the greatest economic consecuences.
We load the NOAA storm database (instructions can be found here).
stormDataRaw <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)
We see how the dataset is, and the colum names.
dim(stormDataRaw)
## [1] 902297 37
head(stormDataRaw)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
colnames(stormDataRaw)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Our goal is to answer two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The stormDataRaw database has a lot of variables, so we are going to create a separate data frame with the variables that are important to answer these questions: type of event, number of fatalities, number of injuries, property damage (amount of dolars) and crop damage (amount of dollars).
library(dplyr)
stormData <- stormDataRaw %>%
select(REFNUM, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(stormData)
## REFNUM EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 1 TORNADO 0 15 25.0 K 0
## 2 2 TORNADO 0 0 2.5 K 0
## 3 3 TORNADO 0 2 25.0 K 0
## 4 4 TORNADO 0 2 2.5 K 0
## 5 5 TORNADO 0 2 2.5 K 0
## 6 6 TORNADO 0 6 2.5 K 0
According to the instructions, there are only 48 possible types of events. However, the EVTYPE variable has 985 levels (some in capital letters and some in lower case).
str(levels(stormData$EVTYPE), vec.len = 4)
## chr [1:985] "?" "ABNORMALLY DRY" "ABNORMALLY WET" "ABNORMAL WARMTH" ...
tail(levels(stormData$EVTYPE))
## [1] "WINTER WEATHER/MIX" "WINTERY MIX" "Wintry mix"
## [4] "Wintry Mix" "WINTRY MIX" "WND"
The first thing we can do to clean a bit the EVTYPE is transform all the entries to capital letters.
# all EVTYPE to uppercase and refactor
stormData$EVTYPE <- factor(toupper(stormData$EVTYPE))
str(levels(stormData$EVTYPE), vec.len = 4)
## chr [1:898] "?" "ABNORMALLY DRY" "ABNORMALLY WET" "ABNORMAL WARMTH" ...
To see which events are most harmful to population health we are interested in the number of fatalities (FATALITIES variable) and injuries (INJURIES variable). To reduce the number of types of events, we only keep those with non-zero values in both fatalities and injuries, then refactor and save into a new data frame.
# select just the EVTYPES with FATALITIES>0 or INJURIES>0
stormHealth <- stormData %>%
select(REFNUM, EVTYPE, FATALITIES, INJURIES) %>%
filter(FATALITIES > 0 | INJURIES > 0)
stormHealth$EVTYPE <- factor(stormHealth$EVTYPE)
str(levels(stormHealth$EVTYPE))
## chr [1:205] "AVALANCE" "AVALANCHE" "BLACK ICE" "BLIZZARD" ...
There are still some problems with the EVTYPE variable, but now there are just 205 levels. We know that the EVTYPE variable should be cleaned more (it has too many levels, some repeated), but maybe just cleaning a bit is enough to answer our questions.
So, we can try to rename the values of EVTYPE with regular expressions and see if the results are good enough.
# we understand that TSTM means THUNDERSTORM
stormData$EVTYPE[grep("^TSTM", stormData$EVTYPE)] <- "THUNDERSTORM WIND"
# levels that starts with THUNDER we treated as THUNDERSTORM WIND
stormData$EVTYPE[grep("THUNDER", stormData$EVTYPE)] <- "THUNDERSTORM WIND"
# levels that starts with HURRI are HURRICANE
stormData$EVTYPE[grep("HURRI", stormData$EVTYPE)] <- "HURRICANE"
# levels that starts with RIP are RIP CURRENT
stormData$EVTYPE[grep("^RIP", stormData$EVTYPE)] <- "RIP CURRENT"
# levels that starts with FLOOD we treated as FLOOD (not FLASH FOOD)
stormData$EVTYPE[grep("^FLOOD", stormData$EVTYPE)] <- "FLOOD"
# refactor EVTYPE
stormData$EVTYPE <- factor(stormData$EVTYPE)
With the EVTYPE a bit cleaned, we are going to see which type of events causes more fatalities and injuries (sum).
# grouping by EVTYPE and suming FATALITIES and INJURIES
stormHealth <- stormData %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum), FATALITIES, INJURIES) %>%
arrange(desc(FATALITIES + INJURIES))
stormHealth <- data.frame(stormHealth)
head(stormHealth, 20)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 5633 91346
## 2 THUNDERSTORM WIND 722 9536
## 3 EXCESSIVE HEAT 1903 6525
## 4 FLOOD 495 6806
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
## 7 FLASH FLOOD 978 1777
## 8 ICE STORM 89 1975
## 9 WINTER STORM 206 1321
## 10 HURRICANE 135 1328
## 11 HIGH WIND 248 1137
## 12 HAIL 15 1361
## 13 HEAVY SNOW 127 1021
## 14 RIP CURRENT 577 529
## 15 WILDFIRE 75 911
## 16 BLIZZARD 101 805
## 17 FOG 62 734
## 18 WILD/FOREST FIRE 12 545
## 19 HEAT WAVE 172 379
## 20 DUST STORM 22 440
We see that the most harmful weather-event is the tornado, followed by: thunderstorm wind, excessive heat, flood and lightning.
If we sum the fatalities and injuries of the rest of the table it can be changed by cleaning more the data, but it is difficult that it changes the top of the table (maybe in the order, but not all of them).
# fatalities and injuries of the tail of the table (rows 21:760)
stormHealth[21:dim(stormHealth)[1], ] %>%
summarise_each(funs(sum), FATALITIES, INJURIES) %>%
arrange(desc(FATALITIES, INJURIES))
## FATALITIES INJURIES
## 1 1820 4722
According to the instructions, the PROPDMGEXP and CROPDMGEXP is a character that indicates the magnitude of the damage: “K” for thousands, “M” for millions and “B” for billions. However, we see that there are more levels (some not in capital letters). So we transform all the levels to capital letters.
levels(stormData$CROPDMGEXP)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
levels(stormData$PROPDMGEXP)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
# all PROPDMGEXP to capital letters
stormData[stormData$PROPDMGEXP=="h",]$PROPDMGEXP <-"H"
stormData[stormData$PROPDMGEXP=="m",]$PROPDMGEXP <-"M"
# all CROPDMGEXP to capital letters
stormData[stormData$CROPDMGEXP=="m",]$CROPDMGEXP <-"M"
stormData[stormData$CROPDMGEXP=="k",]$CROPDMGEXP <-"K"
We do not know what these other character means (“-”, “?”, “0”, “1”, etc.) so we are going to omit them. Also, to reduce the data frame we only keep those files with some damage (in property or crop).
stormDamage <- stormData %>%
select(REFNUM, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(PROPDMG + CROPDMG > 0, PROPDMGEXP %in% c("B", "H", "K", "M"),
CROPDMGEXP %in% c("B", "H", "K", "M"))
head(stormDamage)
## REFNUM EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 187566 HURRICANE 0.1 B 10 M
## 2 188271 THUNDERSTORM WIND 5.0 M 500 K
## 3 187568 HURRICANE 25.0 M 1 M
## 4 187570 HURRICANE 48.0 M 4 M
## 5 187571 HURRICANE 20.0 M 10 M
## 6 187640 THUNDERSTORM WIND 50.0 K 50 K
Now we want to sum all the amount of money of damage: crop and property by event type. So we first make a function to be able to multiply variables ...DMG with ...DMGEXP.
# function to convert character (billions, millions...) to numbers
mult = function(x){
ifelse(x=="B", 10^9, ifelse(x=="M", 10^6, ifelse(x=="K", 10^3, 100)))
}
stormTotalDamage <- stormDamage %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(totalDamage = PROPDMG*mult(PROPDMGEXP) + CROPDMG*mult(CROPDMGEXP)) %>%
group_by(EVTYPE) %>%
summarise(sumTotalDamage = sum(totalDamage)) %>%
arrange(desc(sumTotalDamage))
head(stormTotalDamage)
## Source: local data frame [6 x 2]
##
## EVTYPE sumTotalDamage
## 1 FLOOD 138235749000
## 2 HURRICANE 44313445800
## 3 TORNADO 16520148150
## 4 RIVER FLOOD 10108369000
## 5 HAIL 10020591590
## 6 FLASH FLOOD 8715295130
If we plot the data to see the comparison.
library(ggplot2)
qplot(EVTYPE, sumTotalDamage/10^9,
data=data.frame(stormTotalDamage)[1:10,], stat = "identity",
geom = "bar",
main = "Total damage (crop + property) by event type",
ylab = "Total amount of dollars \n of damage (in billions)",
xlab = "Type of event"
) + theme(axis.text.x = element_text(angle = 90))
Fig.cap: Plot that show the sum of crop and property damage (in billions) by type of weather event.
So we can say that omitting the strange character of the ...DMGEXP variables, Flood is the weather-event with greates economic consequences, followed by Hurricane and Tornado. To be sure in this conclussion we should study what that strange character means and clean better the data.
The most harmful type of event with respect to population health is Tornado, followed maybe by Thunderstorm Wind, Excessive Heat, Flood and Lightning (not in this order necessarily). However, to be secure one should clean better the dataset of NOAA.
The type of event that have the greatest economic consequences is Flood, followed by Hurricane and Tornado. We have omitted some kind of strange data, so the data should be study and clean also in this case to be sure.