In the following document we will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. We will do this with the objective of answering the following 2 questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The document consists on 3 sections:
The dataset provided for the assignment has been downloaded an extract it. We will load it in it’s original, raw version now, and store it in a variable called ‘StormData’. To do the following, make sure to set your directory to the location of the file.
StormData <- read.csv("repdata-data-StormData.csv")
The first step for this analysis will be to load the
ggplot2 and dplyr packages, which will be
necessary for our analysis.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We will now run head() on StormData to get a clue of
what we’re working with.
head(StormData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Now, let’s call colnames() on our dataset in order to
see what variables we’re working with.
colnames(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For this section, we will work on determining which events are the
most harmful. We will first focus on fatalities, since this is the worst
case of damage towards population health. Based on our use of
colnames(), this measurement corresponds to the variable
named ‘FATALITIES’.
In order to determine which events are the most harmful towards population health, we will analyze this part in two ways. We will identify: a) Which events have the highest total number of fatalities overall; and b) Which events have the highest total number of injuries overall.
For this task, we will extract the 2 columns we’re most interested in. That is “EVTYPE” for event type, and “FATALITIES”.
ev_fat <- select(StormData, EVTYPE, FATALITIES)
Now, we will group our data by EVTYPE and summarise the data by calculating the sum of fatalities by event.
total_fat <- group_by(ev_fat, EVTYPE) %>% summarise(total_fatalities = sum(FATALITIES))
Next, lets reorder the dataset in descending order of
‘total_fatalities’ and store it in the variable ‘most_fat’. We will use
the dplyr function arrange for this task, specifying the
argument ‘desc’.
most_fat <- arrange(total_fat, desc(total_fatalities))
We will analyze the results of this variable in the next section called ‘results’. For now, we will continue with the next criteria.
This section will essentially be the same as the previous section, but applied to the ‘INJURIES’ column, as follows:
ev_inj <- select(StormData, EVTYPE, INJURIES)
total_inj <- group_by(ev_inj, EVTYPE) %>% summarise(total_injuries = sum(INJURIES))
most_inj <- arrange(total_inj, desc(total_injuries))
To make this analysis possible, we will begin by extracting the columns of interest. We will need “EVTYPE” once more, which indicates the event type. We will also need “PROPDMG” and “CROPDMG” for property damage and crop damage respectively. We will also need “PROPDMGEXP” and “CROPDMGEXP” for property damage exponent and crop damage exponent. This columns will indicate the exponent at which the damage columns are elevated to. Let’s do that now.
dmg <- select(StormData, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Given the damage values have different exponents, we will first make
them all into the same exponent. We will choose ‘M’, or millions of
dollars. We now run into a problem. The dataset documentation states the
meaning of K, M, B as thousands, millions and billions respectively.
Let’s now run unique() on dmg$PROPDMGEXP to understand the
problem.
unique(dmg$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
We can see the familiar K, M & B. However, we also observe a bunch of different entry types, most of them being integers. Since no documentation is provided that specifies the meaning of these, we will filter them out. We will only keep the entries from which ‘PROPDMGEXP’ and ‘CROPDMGEXP’ are equal to either K, M or B. We will also keep the observations where this value is empty, assuming this means the numeric value is to be taken as it is, with no additional exponent. We will also split them into two datasets, one for property damage and another for crop damage.
dmg_prop <- select(dmg, EVTYPE, PROPDMG, PROPDMGEXP)
dmg_crop <- select(dmg, EVTYPE, CROPDMG, CROPDMGEXP)
dmg_propf <- filter(dmg_prop, PROPDMGEXP %in% c("K","M","B", ""))
dmg_cropf <- filter(dmg_crop, CROPDMGEXP %in% c("K","M","B", ""))
After this, we ensure all of the exponents in the data are in either thousands, millions or billions of dollars, or simply no exponent at all. Now, we wish to express the damage quantities in millions of dollars. For this, we will multiply each entry corresponding to ‘B’ by a thousand, each “K” entry by 1e-3, and each empty entry by 1e-6. The ‘M’ entries will be left as they are since they are already expressed in millions of dollars.
dmg_propf <- mutate(dmg_propf, PROPDMGF = PROPDMG * case_when(
PROPDMGEXP == "" ~ 1e-6,
PROPDMGEXP == "K" ~ 1e-3,
PROPDMGEXP == "M" ~ 1,
PROPDMGEXP == "B" ~ 1e3
))
dmg_cropf <- mutate(dmg_cropf, CROPDMGF = CROPDMG * case_when(
CROPDMGEXP == "" ~ 1e-6,
CROPDMGEXP == "K" ~ 1e-3,
CROPDMGEXP == "M" ~ 1,
CROPDMGEXP == "B" ~ 1e3
))
Now, we will group by EVTYPE and sum the total damage in both cases.
total_dmg_prop <- group_by(dmg_propf, EVTYPE) %>% summarise(TOT_DMG = sum(PROPDMGF))
total_dmg_crop <- group_by(dmg_cropf, EVTYPE) %>% summarise(TOT_DMG = sum(CROPDMGF))
With the total damage sum by event type, we will merge both datasets into one, containing the total damage by event type, across both property and crop damage.
total_dmg <- merge(total_dmg_crop, total_dmg_prop, by = "EVTYPE")
total_dmg$TOT_DMG <- total_dmg$TOT_DMG.x + total_dmg$TOT_DMG.y
total_dmg <- select(total_dmg, EVTYPE, TOT_DMG)
We are now ready to analyze the results.
In the following section we present the results of our analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.
After analizing the data, we found the following:
most_fat_sliced <- slice(most_fat, 1:10)
most_fat_sliced$EVTYPE <- factor(most_fat_sliced$EVTYPE, levels = unique(most_fat_sliced$EVTYPE))
ggplot(most_fat_sliced, aes(x = EVTYPE, y = total_fatalities)) +
geom_bar(stat = 'identity', fill = 'royalblue') +
labs(title = "Deadliest event types", x = "Event Type", y = "Total Fatalities") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(most_fat)
## # A tibble: 985 × 2
## EVTYPE total_fatalities
## <chr> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## # ℹ 975 more rows
As we can tell from the chart, the 5 most deadly event types are, in descending order, tornadoes, excessive heat, flash flood, heat and lightning. Tornadoes alone account for over 5 thousand deaths.
most_inj_sliced <- slice(most_inj, 1:10)
most_inj_sliced$EVTYPE <- factor(most_inj_sliced$EVTYPE, levels = unique(most_inj_sliced$EVTYPE))
ggplot(most_inj_sliced, aes(x = EVTYPE, y = total_injuries)) +
geom_bar(stat = 'identity', fill = 'darkred') +
labs(title = "Event types by injuries", x = "Event Type", y = "Total Cost (in millions of dollars)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(most_inj)
## # A tibble: 985 × 2
## EVTYPE total_injuries
## <chr> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## # ℹ 975 more rows
Once more, tornadoes take the first place, this time by an even larger margin, causing almost 100 thousand injuries. TSTM wind, floods, excessive heat and lighting occupied the second to fifth place, respectively. As a conclussion, while there are several event types that cause harm to the human population, such as flood, excessive heat and lightning, non cause as much human damage as tornadoes, which ocuppy a clear first place in both cases.
The following are the results as to the costliest event types:
tot_damage_arr <- arrange(total_dmg, desc(TOT_DMG))
total_dmg_sliced <- slice(tot_damage_arr, 1:10)
total_dmg_sliced$EVTYPE <- factor(total_dmg_sliced$EVTYPE, levels = unique(total_dmg_sliced$EVTYPE))
ggplot(total_dmg_sliced, aes(x = EVTYPE, y = TOT_DMG)) +
geom_bar(stat = 'identity', fill = 'steelblue') +
labs(title = "Costliest event types", x = "Event Type", y = "Total Cost (millions of dollars)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
print(total_dmg_sliced)
## EVTYPE TOT_DMG
## 1 FLOOD 150319.678
## 2 HURRICANE/TYPHOON 71913.713
## 3 TORNADO 57340.614
## 4 STORM SURGE 43323.541
## 5 HAIL 18752.904
## 6 FLASH FLOOD 17562.129
## 7 DROUGHT 15018.672
## 8 HURRICANE 14610.229
## 9 RIVER FLOOD 10148.405
## 10 ICE STORM 8967.041
As we can observe, in monetary terms, the 5 costliest event types are flood, hurricane/typhoon, tornado, storm surge and hail. Floods alone accounted for over 150 billion dollars in damages. Hurricanes, tornadoes and storm surges where not too far behind, with total costs of ~72, ~57 and ~43 billion dollars respectively.