This exploration of the NOAA Storm Database addresses two questions: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? And across the United States, which types of events have the greatest economic consequences?
First I load the packages that are needed to perform the analysis.
library(dplyr)
library(ggplot2)
library(tidyr)
Secondly, I load the data.
StormData <- read.csv("repdata_data_StormData.csv")
To get an idea of the variables in this data set I print the variable names.
names(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
I will address the questions by looking at a subset of the variables above. First of all, I will include the “EVTYPE” variable which describes the type of storm event. The variables that I will use to measure the effect of storm events on population health are “FATALITIES” and “INJURIES”. Furthermore, the variables that I will use to measure the economic consequences of storm events are “PROPDMG”, “PROPDMGEXP”, “CROPDMG” and “CROPDMGEXP”. This results in the following data set.
StormData2 <- StormData %>% select("EVTYPE", "FATALITIES",
"INJURIES", "PROPDMG",
"PROPDMGEXP", "CROPDMG","CROPDMGEXP")
The variables “PROPDMGEXP” and “CROPDMGEXP” contain capital letters that indicate the magnitude of the “PROPDMG” and “CROPDMG” respectively. ‘K’ stands for thousands, ‘M’ for millions and ‘B’ for billions. For the purpose of this exploratory analysis I set any other entries besides ‘K’, ‘M’ and ‘B’ to be 1. Let’s merge the damage and magnitude columns in 3 steps to simplify the data set for the analysis.
## 1. Transforming PROPDMGEXP and CROPDMGEXP
StormData2 <- StormData2 %>% mutate(PROPDMGEXP = ifelse(PROPDMGEXP == "K", 1000,
ifelse(PROPDMGEXP == "M", 10^6,
ifelse(PROPDMGEXP == "B", 10^9,1))),
CROPDMGEXP = ifelse(CROPDMGEXP == "K", 1000,
ifelse(CROPDMGEXP == "M", 10^6,
ifelse(CROPDMGEXP == "B", 10^9,1))))
## 2. Merging the damage and magnitude columns
StormData2 <- StormData2 %>% mutate(PROPDMG = PROPDMG * PROPDMGEXP,
CROPDMG = CROPDMG * CROPDMGEXP)
## 3. Removing the magnitude columns
StormData2 <- StormData2 %>% select(-c(PROPDMGEXP, CROPDMGEXP))
As a first step in the exploratory analysis of the simplified data
set StormData2 I would like to explore which type of storm
events occurred the most.
head(sort(table(StormData2$EVTYPE), decreasing = TRUE),5)
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD
## 54277
As can be seen in the sorted table above HAIL is the
storm event type that, across the United States, occurred the most
between April 1950 and November 2011, followed by TSTM WIND
(wind associated with thunderstorms).
The number of times a storm event occurs gives an indication of its impact, but does not fully answer the questions stated in the synopsis above. Therefore, let’s investigate a little deeper which storm events are most harmful for the population health and which had the greatest economic consequences. The way in which I will do this with respect to the population health is by counting the total number of fatalities and injuries for each storm event and by visualizing the 5 most harmful storm events in a barplot. With respect to economic consequences I will repeat this process, but then I will calculate the total property damage and total crop damage instead.
The code below groups the StormData2 data set by
EVTYPE, summarizes the event types by taking the sum of the
number of fatalities and the sum of the number of injuries for all
occurrences of a specific event type. Then the resulting data frame is
sorted based on the fatality count, because I believe the fatality count
determines the severity of a storm event the most. Then I take the top 5
storm events that had the biggest impact and before making the barplot I
transform the data in order to make it more suitable for the plot using
the gather() function from the tidr
package.
population_health_impact <- StormData2 %>% group_by(EVTYPE) %>%
summarize(fatality_count = sum(FATALITIES),
injury_count = sum(INJURIES)) %>%
arrange(desc(fatality_count))
population_health_most_impact <- population_health_impact[1:5,]
population_health_most_impact2 <- gather(population_health_most_impact,
fatalities_or_injuries, count, -EVTYPE)
population_health_most_impact2 %>%
ggplot(aes(x = reorder(EVTYPE,-count), y = count,
fill = fatalities_or_injuries)) + geom_bar(stat="identity") +
labs(x = "Event Type", y = "Count", fill = "") +
scale_fill_discrete(labels = c("Fatality Count", "Injury Count"))
Then for the economic consequences a similar approach is used,
however after calculating the total property damage and crop damage, I
create another variable called TOTDMG, which is the sum of
the two. Then with the top 5 storm events that had the biggest economic
impact I create another barplot.
economic_impact <- StormData2 %>% group_by(EVTYPE) %>%
summarize(TOTPROPDMG = sum(PROPDMG),
TOTCROPDMG = sum(CROPDMG))
economic_impact <- economic_impact %>% mutate(TOTDMG = TOTPROPDMG + TOTCROPDMG) %>% arrange(desc(TOTDMG))
economic_most_impact <- economic_impact[1:5,]
economic_most_impact %>% ggplot(aes(x = reorder(EVTYPE, -TOTDMG),
y = TOTDMG/10^9)) +
geom_bar(stat="identity") +
labs(x = "Event Type", y = "Total Damage (in billions)")
The storm events that are most harmful for population health are:
tornados, excessive heat, lightning, heat and flash floods. The storm
events that are most harmful with respect to the economic consequences
are: floods, hurricanes/typhoons, tornados, storm surges and hail.
Interestingly, we can see that tornados and floods are present in both
lists, indicating that these storm events are very harmful for the
population health as well as the economy. Furthermore, as we have
concluded before, hail is the storm event that has occurred the most in
the NOAA Storm Database and interestingly we can see it is the number 5
biggest contributor to negatively affecting the economy of the United
States. For further research I would suggest to look at the
EVTYPE column and find a way to group events together with
similar characteristics. For example, look at the following grep
search.
grep("WARM", unique(StormData$EVTYPE), value = TRUE)
## [1] "RECORD WARMTH" "UNSEASONABLY WARM"
## [3] "WARM DRY CONDITIONS" "UNSEASONABLY WARM AND DRY"
## [5] "RECORD WARM TEMPS." "ABNORMAL WARMTH"
## [7] "UNUSUAL WARMTH" "UNUSUAL/RECORD WARMTH"
## [9] "UNSEASONABLY WARM YEAR" "RECORD WARM"
## [11] "UNSEASONABLY WARM/WET" "UNUSUALLY WARM"
## [13] "WARM WEATHER" "UNSEASONABLY WARM & WET"
## [15] "PROLONG WARMTH" "VERY WARM"
And look at this second grep search.
grep("HEAT", unique(StormData$EVTYPE), value = TRUE)
## [1] "HEAT" "EXTREME HEAT" "EXCESSIVE HEAT"
## [4] "RECORD HEAT" "HEAT WAVE" "DROUGHT/EXCESSIVE HEAT"
## [7] "RECORD HEAT WAVE" "RECORD/EXCESSIVE HEAT" "HEAT WAVES"
## [10] "HEAT WAVE DROUGHT" "HEAT/DROUGHT" "HEAT DROUGHT"
## [13] "EXCESSIVE HEAT/DROUGHT"
In conclusion, only looking at two search terms, namely “WARM” and “HEAT”, there are many storm event types with similar characteristics. Finding a way to group these events together could potentially improve further analysis of the NOAA Storm Database.