This analysis will utilize the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to explore how the different event types impact health and economic conditions in the United States between 1950 and 2011.The analysis indicates that tornadoes strongly impact health conditions while flooding tends to dominate property damage and drought is the primary source of economic impact on crops.
To begin the process, we call the relevant R libraries and up upload the storm data.
library(dplyr)
library(tidyverse)
library(ggplot2)
stormDF <- read.csv("repdata_data_StormData.csv.bz2")
Before starting any analysis, a general review of the data is conducted to determine the size of the data set and its basic structure. Additionally, we’ll verify that the column names are in an appropriate format. The data set contains 37 columns (named in snake_case format) with 902297 observations.
dim(stormDF)
## [1] 902297 37
str(stormDF)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
colnames(stormDF)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We’ll begin the actual analysis by first looking at the direct physical damage inflicted upon individuals from the storm types, focusing on fatalities, injuries, and the combined causality. Given the large number of event types, this analysis will focus on the ten most damaging event types.
healthRisk <- stormDF %>% select(EVTYPE, FATALITIES, INJURIES) %>%
mutate(CASUALITIES = FATALITIES + INJURIES)
healthRisk <- healthRisk[healthRisk$EVTYPE != "?"& (healthRisk$INJURIES > 0 | healthRisk$FATALITIES > 0) ,]
hList <- healthRisk %>%
group_by(EVTYPE) %>%
summarise(totFatal = sum(FATALITIES, na.rm = TRUE),
totInjury = sum(INJURIES, na.rm = TRUE),
totCasual = sum(CASUALITIES))
hList <- hList %>% arrange(desc(totCasual))
hL10 <- hList[1:10,]
print(hL10)
## # A tibble: 10 × 4
## EVTYPE totFatal totInjury totCasual
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
hL10_long <- hL10 %>%
pivot_longer(cols = c(totFatal, totInjury,totCasual), names_to = "Variable", values_to = "Value")
ggplot(hL10_long, aes(x = reorder(EVTYPE, -Value), y = Value, fill = Variable)) +
geom_col(position = "dodge") +
labs(x = "Storm Type", y = "Values", fill = "Columns") +
theme_minimal()+ labs(title = "Top Ten Casualty Sources for Injury and Death", x = "Event Type", y = "Casualties") + theme(axis.text.x = element_text(angle = 45, hjust = 1))
To summarize the effect of adverse weather on individuals, we can
plot the casuality (and individual fatalities and injuries) for the 10
highest sources.From the plot, we can see that tornadoes are
overwhelming the primary source of personal injuries and fatalities in
the US, with all other sources being trivial in comparison.
The next component of the analysis will evaluate the financial cost from the storms, focusing on property and crop damage. Reviewing the original structure of the data indicates that costs associated with these two categories are recorded in 2 columns each within the data set, one giving the coefficient (significand) and its exponent (power of 10). The exponents need to be reviewed and converted and combined with the coefficients to obtain the total cost for each event.
propertyRisk <- stormDF %>% select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
propertyRisk <- propertyRisk[propertyRisk$EVTYPE != "?"& (propertyRisk$PROPDMG > 0 | propertyRisk$CROPDMG > 0) ,]
unique(propertyRisk$PROPDMGEXP)
## [1] "K" "M" "B" "m" "" "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"
unique(propertyRisk$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k"
Review of the exponent values indicates a non-standard system utilizing both powers of 10 and letters (for hundreds, thousands, millions, etc.) Given this, a direct substitution is executed to convert these to all numeric values for both property and crop values.
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "NA")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "-")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "?")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "+")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "0")] <- 10^0
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "1")] <- 10^1
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "2")] <- 10^2
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "3")] <- 10^3
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "4")] <- 10^4
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "5")] <- 10^5
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "6")] <- 10^6
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "7")] <- 10^7
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "8")] <- 10^8
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "h")] <- 10^2 #hundred
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "H")] <- 10^2 #hundred
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "K")] <- 10^3 #thousand
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "m")] <- 10^6 #million
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "M")] <- 10^6 #million
propertyRisk$PROPDMGEXP[(propertyRisk$PROPDMGEXP == "B")] <- 10^9 #billion
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "?")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "0")] <- 10^0
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "k")] <- 10^3 #thousand
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "K")] <- 10^3 #thousand
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "m")] <- 10^6 #million
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "M")] <- 10^6 #million
propertyRisk$CROPDMGEXP[(propertyRisk$CROPDMGEXP == "B")] <- 10^9 #billion
The final step before analyzing the data is to combine the coefficient and its exponent into a single value for both property and crop damage.
propertyRisk$PROPDMGEXP <- as.numeric(propertyRisk$PROPDMGEXP)
propertyRisk$CROPDMGEXP <- as.numeric(propertyRisk$CROPDMGEXP)
propertyRisk <- propertyRisk %>% mutate (PROPERTY = PROPDMG * PROPDMGEXP) %>%
mutate(CROP = CROPDMG * CROPDMGEXP) %>%
mutate(CASUALITIES =PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP )
As before, given the large number of event types, we will identify the 10 most most significant sources of damage and focus on their analysis.
pList <- propertyRisk %>%
group_by(EVTYPE) %>%
summarise(totProp = sum(PROPERTY, na.rm = TRUE),
totCrop = sum(CROP, na.rm = TRUE),
totalCasual = sum(CASUALITIES))
pList <- pList %>% arrange(desc(totalCasual))
pL10 <- pList[1:10,]
print(pL10)
## # A tibble: 10 × 4
## EVTYPE totProp totCrop totalCasual
## <chr> <dbl> <dbl> <dbl>
## 1 FLOOD 144657709870 5661968450 150319678320
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 TORNADO 56947382445 414953270 57362335715
## 4 STORM SURGE 43323536000 5000 43323541000
## 5 HAIL 15735270147 3025954473 18761224620
## 6 FLASH FLOOD 16822678195 1421317100 18243995295
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944928310 5022113500 8967041810
pL10_long <- pL10 %>%
pivot_longer(cols = c(totProp, totCrop,totalCasual), names_to = "Variable", values_to = "Value")
ggplot(pL10_long, aes(x = reorder(EVTYPE, -Value), y = Value, fill = Variable)) +
geom_col(position = "dodge") +
labs(x = "Storm Type", y = "Values", fill = "Columns") +
theme_minimal()+ labs(title = "Top Ten Sources of Casualties for Crop and Property Loss", x = "Event Type", y = "Casualties") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
A review of the plot indicates the highest sources of financial
damage is not tornadoes, as it was for physical injury, but rather
flooding. Additionally, although flooding is the dominant sources of
financial loss, several other sources also contribute substantially;
hurricanes, tornadoes, and storm surge. While flooding is a substantial
source of crop damage, it contributes less to losses than drought which
is the primary source of crop damage.
The analysis indicate that tornadoes are overwhelming the primary source of harm to the health of the population. While tornadoes do contribute to property damage, they are not nearly as influential a source as flooding which dominates the other substantial sources of hurricanes, tornadoes, and storm surge. While all of these sources do contribute to crop damage, drought is the primary source of crop damage in the US.