In this report we aim to describe most impactful population health consequences and economic damages casualties made by storms and other weather events in the United States between 1950 and 2011. From the U.S. National Oceanic and Atmospheric Administration’s (NOAA) database we can access the storm data set in order to answer questions about severe weather events. Impactful casualty is defined here as the biggest amount of people touched or economic damages by one weather event. From this setup, for population health consequences it was one tornado with 1742 casualties which was the most harmful event. For economic damages it was one flood with 115 032 500 000$ of damage in total which was the most impactful event. However, two other definitions are explored in this report and results are greatly distinct from previous. One is the proportion of types of weather events impacting population health or economy. The other is the total amount of casualties through all types of events occurred.
## Overview of session
sessionInfo()
## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252
## [3] LC_MONETARY=French_France.1252 LC_NUMERIC=C
## [5] LC_TIME=French_France.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_4.0.2 magrittr_1.5 tools_4.0.2 htmltools_0.5.0
## [5] yaml_2.2.1 stringi_1.4.6 rmarkdown_2.3 knitr_1.29
## [9] stringr_1.4.0 xfun_0.16 digest_0.6.25 rlang_0.4.7
## [13] evaluate_0.14
## Rstudio version
rstudioapi::versionInfo()$version
## [1] '1.3.1073'
## Files in working directory
list.files(getwd())
## [1] "data" "FinalProject.Rproj" "README.html"
## [4] "README.md" "README.Rmd" "README_cache"
## [7] "README_files"
Download and store the data file :
## Create a folder to store data set
if (!file.exists("data")){
dir.create("data")
}
## Download the data set from url source
if (!file.exists("data/Storm Data.bz2")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "data/Storm Data.bz2")
}
## Check downloading date and file
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Sep 27 22:18:43 2020"
list.files("data")
## [1] "Storm Data.bz2"
Load the data file, “Storm Data.bz2”, previously downloaded :
storm <- read.csv("data/Storm Data.bz2")
Overview of the data set :
# Dimensions
dim(storm)
## [1] 902297 37
# Variables names
colnames(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
# First rows
head(storm, n = 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
The present analysis address the following questions :
1. Across the United States, which types of events are most harmful with respect to population health ?
2. Across the United States, which types of events have the greatest economic consequences ?
Subset to get only the working variables :
library(dplyr)
health <- storm[, c("EVTYPE", "FATALITIES", "INJURIES")]
Add fatalities and injuries variables count to get total amount of population health consequences :
health <- health %>% mutate(Casualties = FATALITIES + INJURIES)
Overview of the casualties variable :
summary(health$Casualties)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1725 0.0000 1742.0000
# Proportion of harmless events without types distinction :
mean(health$Casualties == 0)
## [1] 0.9756965
# Summary of harmful events without types distinction :
summary(health$Casualties[health$Casualties != 0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 7.099 4.000 1742.000
Now, the data is grouped by types of events and calculations made to get questioning results :
# Group by types of events
byEvtype <- health %>% group_by(EVTYPE)
# Aggregate by types of events with their related maximum number of casualties and proportion of events with at least one casualty
summaryzero <- summarize(byEvtype, maxCasu = max(Casualties),
totalCasu = sum(Casualties),
propHarmEvent = sum(Casualties != 0)/n(),
numEvent = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Show top 10 of impactful events with the above processed data :
# Show the above result by descending order :
## 1. maximum number of casualties made by one event for each types of events
ansmax <- arrange(summaryzero[summaryzero$maxCasu !=0,],
desc(maxCasu), desc(totalCasu))
ansmax
## # A tibble: 220 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 TORNADO 1742 96979 0.131 60652
## 2 ICE STORM 1569 2064 0.0474 2006
## 3 FLOOD 802 7259 0.0162 25326
## 4 HURRICANE/TYPHOON 787 1339 0.295 88
## 5 HEAT 583 3037 0.272 767
## 6 EXCESSIVE HEAT 521 8428 0.404 1678
## 7 BLIZZARD 390 906 0.0316 2719
## 8 HEAT WAVE 202 481 0.405 74
## 9 TROPICAL STORM 201 398 0.0609 690
## 10 HEAVY SNOW 185 1148 0.0120 15708
## # ... with 210 more rows
## 2. proportion of types of events with at least one casualty
ansprop <- arrange(summaryzero[summaryzero$propHarmEvent !=0,],
desc(propHarmEvent), desc(maxCasu))
ansprop
## # A tibble: 220 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Heat Wave 70 70 1 1
## 2 TROPICAL STORM GORDON 51 51 1 1
## 3 SNOW/HIGH WINDS 34 36 1 2
## 4 THUNDERSTORMW 27 27 1 1
## 5 TORNADOES, TSTM WIND, HAIL 25 25 1 1
## 6 HIGH WIND AND SEAS 23 23 1 1
## 7 HEAT WAVE DROUGHT 19 19 1 1
## 8 WINTER STORM HIGH WINDS 16 16 1 1
## 9 GLAZE/ICE STORM 15 15 1 1
## 10 COLD AND SNOW 14 14 1 1
## # ... with 210 more rows
## 3. total amount casualties for each types of events
anstot <- arrange(summaryzero[summaryzero$totalCasu !=0,],
desc(totalCasu), desc(maxCasu))
anstot
## # A tibble: 220 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 TORNADO 1742 96979 0.131 60652
## 2 EXCESSIVE HEAT 521 8428 0.404 1678
## 3 TSTM WIND 60 7461 0.0133 219940
## 4 FLOOD 802 7259 0.0162 25326
## 5 LIGHTNING 51 6046 0.210 15754
## 6 HEAT 583 3037 0.272 767
## 7 FLASH FLOOD 159 2755 0.0172 54277
## 8 ICE STORM 1569 2064 0.0474 2006
## 9 THUNDERSTORM WIND 70 1621 0.00826 82563
## 10 WINTER STORM 170 1527 0.0199 11433
## # ... with 210 more rows
Since most of 10th impactful types of events are present both in summary tables of maximum and total number of casualties, let’s compare the two variables with the intersecting data.
For better comparison, a plot is made :
library(tidyr)
library(ggplot2)
library(RColorBrewer)
# Make tidy data set to make easier plotting :
interdata <- ansmax[1:10, c("EVTYPE", "maxCasu", "totalCasu")][ansmax$EVTYPE[1:10] %in% anstot$EVTYPE[1:10],]
colnames(interdata) <- c("TypesOfEvents", "Maximum number of casualties for one event", "Total number of casualties through all observed types of events")
tidyInterdata <- gather(interdata, key = countingMethod, value = count, -TypesOfEvents)
# Plot :
pal <- colorRampPalette(brewer.pal(3,"Set2"))
g <- ggplot(tidyInterdata, aes(TypesOfEvents, count))
g + geom_col(fill = c(pal(5),pal(5))) + facet_wrap(.~countingMethod, scales = "free") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Crossing biggest impactful events types between maximum and total amount of casualties") +
labs(y = "Number of casualties")
The above results show different types of answer:
1. One is dealing with the maximum number of casualties, "maxCasu" variable. In that sense, one **tornado** event is the most harmful with the biggest event doing 1742 casualties.
2. The second is dealing with the propotion of types of events with at least one casualty, "popHarmEvent" variable. This result is skewed since most of the bigger proportion variable values are related to few events types occurences. By consequence, "popHarmEvent" top 10 have just one of each type of event observed. In that sense, **heat wave** is the most harmful event with 70 casualties in only one apparition in the data base. Also, the data labels could be miss leading since **heat wave drought** is another type of event and can be found in the same "popHarmEvent" top 10 results, on position 7 with 19 casualties. If needed, clustering labels could avoid to many distinction between similar labels.
3. Last one is dealing with the total amount of casualties through all occurred events in the database, "totalCasu" variable. In that sense, **tornado** is again the most harmful with a total of 96 979 casualties through 60 652 occurrences.
Since the analysis is focused on the addition of injuries and fatalities, a deeper analysis could find more precision by doing the same process with distinction between casualties. Other calculations could be made to define harmful casualties, like the total amount of casualties by events types or the related proportion.
This process analysis is mostly identical to the previous one.
Subset to get only the working variables :
econ <- storm[, c("EVTYPE", "PROPDMG", "CROPDMG", "PROPDMGEXP", "CROPDMGEXP")]
With related unit prefix variables get the explicit number of economic amount casualties :
# Types of prefix
table(econ$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5 6
## 465934 1 8 5 216 25 13 4 4 28 4
## 7 8 B h H K m M
## 5 1 40 1 6 424665 7 11330
table(econ$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
# Replace prefix nomination as a related decimal unit multiplication variable
multipre <- function(prefix){
if (prefix == "" | prefix == 0){
1
} else if (prefix == "K" | prefix == "k") {
10^3
} else if (prefix == "M" | prefix == "m"){
10^6
} else if (prefix == "B" | prefix == "b"){
10^9
} else if (!is.na(as.numeric(prefix))){
as.numeric(prefix)
} else {
0
}
}
econ <- econ %>% mutate(PropMulti = sapply(PROPDMGEXP, multipre), CropMulti = sapply(CROPDMGEXP, multipre))
The above code is subjective to interpretation since some prefix in the data set are outside of the range specified by the Storm Data Documentation, which is quoted as :
> Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
Add fatalities and injuries variables count to get total amount of population health consequences :
econ <- econ %>% mutate(Casualties = PROPDMG*PropMulti + CROPDMG*CropMulti)
Overview of the casualties variable :
summary(econ$Casualties)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.28e+05 1.00e+03 1.15e+11
# Proportion of harmless events :
mean(econ$Casualties == 0)
## [1] 0.7284508
# Without harmless events :
summary(econ$Casualties[econ$Casualties != 0])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 3.000e+03 1.000e+04 1.944e+06 5.000e+04 1.150e+11
Now, the data is grouped by types of events and
# Group by types of events
byEvtype <- econ %>% group_by(EVTYPE)
# Aggregate by types of events with their related maximum number of casualties and proportion of events with at least one casualty
summaryzero <- summarize(byEvtype, maxCasu = max(Casualties),
totalCasu = sum(Casualties),
propHarmEvent = sum(Casualties != 0)/n(),
numEvent = n())
## `summarise()` ungrouping output (override with `.groups` argument)
Show top 10 of impactful events with the above processed data :
# Show the above result by descending order :
## 1. maximum number of casualties made by one event for each types of events
ansmax <- arrange(summaryzero[summaryzero$maxCasu !=0,],
desc(maxCasu), desc(totalCasu))
ansmax
## # A tibble: 428 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 FLOOD 115032500000 150319678257 0.397 25326
## 2 STORM SURGE 31300000000 43323541000 0.663 261
## 3 HURRICANE/TYPHOON 16930000000 71913712800 0.795 88
## 4 RIVER FLOOD 10000000000 10148404500 0.613 173
## 5 TROPICAL STORM 5150000000 8382236550 0.590 690
## 6 ICE STORM 5000500000 8967041360 0.333 2006
## 7 WINTER STORM 5000000000 6715441251 0.121 11433
## 8 STORM SURGE/TIDE 4000000000 4642038000 0.318 148
## 9 HURRICANE 3500000000 14610229010 0.707 174
## 10 TORNADO 2800000000 57352114398. 0.649 60652
## # ... with 418 more rows
## 2. proportion of types of events with at least one casualty
ansprop <- arrange(summaryzero[summaryzero$propHarmEvent !=0,],
desc(propHarmEvent), desc(maxCasu))
ansprop
## # A tibble: 428 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 TORNADOES, TSTM WIND, HAIL 1602500000 1602500000 1 1
## 2 WILD FIRES 619000000 624100000 1 4
## 3 HAILSTORM 220000000 241000000 1 3
## 4 EXCESSIVE WETNESS 142000000 142000000 1 1
## 5 HURRICANE OPAL/HIGH WINDS 110000000 110000000 1 1
## 6 River Flooding 105580000 134175000 1 5
## 7 COLD AND WET CONDITIONS 66000000 66000000 1 1
## 8 WINTER STORM HIGH WINDS 65000000 65000000 1 1
## 9 HIGH WINDS/COLD 55000000 117500000 1 5
## 10 HURRICANE EMILY 50000000 50000000 1 1
## # ... with 418 more rows
## 3. total amount casualties for each types of events
anstot <- arrange(summaryzero[summaryzero$totalCasu !=0,],
desc(totalCasu), desc(maxCasu))
anstot
## # A tibble: 428 x 5
## EVTYPE maxCasu totalCasu propHarmEvent numEvent
## <chr> <dbl> <dbl> <dbl> <int>
## 1 FLOOD 115032500000 150319678257 0.397 25326
## 2 HURRICANE/TYPHOON 16930000000 71913712800 0.795 88
## 3 TORNADO 2800000000 57352114398. 0.649 60652
## 4 STORM SURGE 31300000000 43323541000 0.663 261
## 5 HAIL 1800000000 18758221636. 0.0900 288661
## 6 FLASH FLOOD 1000000000 17562129655. 0.381 54277
## 7 DROUGHT 1000000000 15018672000 0.107 2488
## 8 HURRICANE 3500000000 14610229010 0.707 174
## 9 RIVER FLOOD 10000000000 10148404500 0.613 173
## 10 ICE STORM 5000500000 8967041360 0.333 2006
## # ... with 418 more rows
Since most of 10th impactful types of events are present both in summary tables of maximum and total number of casualties, let’s compare the two variables with the intersecting data.
For better comparison, a plot is made :
library(tidyr)
library(ggplot2)
library(RColorBrewer)
# Make tidy data set to make easier plotting :
interdata <- ansmax[1:10, c("EVTYPE", "maxCasu", "totalCasu")][ansmax$EVTYPE[1:10] %in% anstot$EVTYPE[1:10],]
colnames(interdata) <- c("TypesOfEvents", "Maximum number of casualties for one event", "Total number of casualties through all observed types of events")
tidyInterdata <- gather(interdata, key = countingMethod, value = count, -TypesOfEvents)
# Plot :
pal <- colorRampPalette(brewer.pal(3,"Set2"))
g <- ggplot(tidyInterdata, aes(TypesOfEvents, count))
g + geom_col(fill = c(pal(7),pal(7))) + facet_wrap(.~countingMethod, scales = "free") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "Crossing biggest impactful events types between maximum and total amount of casualties") +
labs(y = "Number of casualties")
Same as the previous section, the above results show two different types of answer:
1. One is dealing with the maximum number of dollars of damage on property and crops, "maxCasu" variable. In that sense, **flood** events are the most impactful with the biggest event doing 115 032 500 000$ of damage.
2. The other is dealing with the propotion of types of events with at least one casualty, "popHarmEvent" variable. This result is skewed since most of the bigger proportion variable values are related to few events types occurences. By consequence, "popHarmEvent" top 10 have few type of event observed. In that sense, the aggreagte **tornadoes, tstm wind, hail** is the most harmful event(s) with 1 602 500 000$ of damage.
3. Last one is dealing with the total amount of economic damages through all occurred events in the database, "totalCasu" variable. In that sense, **flood** is again the most harmful with a total of 150 319 678 257$ of damage through 15 326 occurrences. It is interesting to note that 76% of this amount is due only to a single event, the one mentioned in point 1 above.
Since the analysis is focused on the addition of property and crops damages, a deeper analysis could find more precision by doing the same process with distinction between the two variables. Other calculations could be made to define harmful casualties, like the total amount of casualties by events types or the related proportion.