Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
# Load packages
library(magrittr) # Pipe operator
library(dplyr) # Data manipulation
library(ggplot2) # plotting system using the grammar of graphics
library(tidyr)
# URL to download
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
# Local data filename
dataFileZIP <- "./StormDataset.csv.bz2"
# Directory for the dataset
dirFile <- "./StormDataset"
# If not exist, download the dataset.zip,
if (file.exists(dataFileZIP) == FALSE) {
download.file(fileURL, destfile = dataFileZIP)
}
# Note: .csv.bz2 can be read directly with read.csv function
StormDataset <- read.csv("StormDataset.csv.bz2")
# Structure of the dataset imported
glimpse(StormDataset)
## Observations: 902,297
## Variables: 37
## $ STATE__ <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE <fctr> 4/18/1950 0:00:00, 4/18/1950 0:00:00, 2/20/1951 0:...
## $ BGN_TIME <fctr> 0130, 0145, 1600, 0900, 1500, 2000, 0100, 0900, 20...
## $ TIME_ZONE <fctr> CST, CST, CST, CST, CST, CST, CST, CST, CST, CST, ...
## $ COUNTY <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME <fctr> MOBILE, BALDWIN, FAYETTE, MADISON, CULLMAN, LAUDER...
## $ STATE <fctr> AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL...
## $ EVTYPE <fctr> TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ BGN_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ BGN_LOCATI <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_DATE <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_TIME <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_LOCATI <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ LENGTH <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP <fctr> K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ WFO <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ STATEOFFIC <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ ZONENAMES <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ LATITUDE <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
## $ REFNUM <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
The data set import contains 902297 observations and 37 attributes. TO answer the question for this project, we limit our focus to the following variables:
# Selection of the relevant variables
DataSelection <- StormDataset %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
# Structure of the dataset
glimpse(DataSelection)
## Observations: 902,297
## Variables: 7
## $ EVTYPE <fctr> TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP <fctr> K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP <fctr> , , , , , , , , , , , , , , , , , , , , , , , ,
We can rapidly see that there are a few issues that needs to be addressed to have a clean dataset for the analysis. It looks like observations are full of typo issues and inconsistent entry. For this project, we are going to assign event types to a main event categories. This enables the reduction of the number of event types.
# First cleaning on the event type
#####################################
# Upper case all event observations
DataSelection$EVTYPE <- as.factor(toupper(DataSelection$EVTYPE))
# Trim function to remove white spaces at the start and end of the event type
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
# Applying trim function
DataSelection$EVTYPE <- as.factor(trim(DataSelection$EVTYPE))
# Classification of the main event categories
DataSelection[grepl("HAIL", DataSelection$EVTYPE), "EVTYPE"] <- "HAIL"
DataSelection[grepl("HEAVY RAIN", DataSelection$EVTYPE), "EVTYPE"] <- "HEAVY RAIN"
DataSelection[grepl("HEAVY SNOW", DataSelection$EVTYPE), "EVTYPE"] <- "HEAVY SNOW"
DataSelection[grepl("HIGH WIND", DataSelection$EVTYPE), "EVTYPE"] <- "HIGH WIND"
DataSelection[grepl("HURRICANE", DataSelection$EVTYPE), "EVTYPE"] <- "HURRICANE"
DataSelection[grepl("THUNDERSTORM", DataSelection$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
DataSelection[grepl("TSTM", DataSelection$EVTYPE), "EVTYPE"] <- "THUNDERSTORM"
DataSelection[grepl("TROPICAL STORM", DataSelection$EVTYPE), "EVTYPE"] <- "TROPICAL STORM"
DataSelection[grepl("URBAN", DataSelection$EVTYPE), "EVTYPE"] <- "URBAN FLOOD"
DataSelection[grepl("WATERSPOUT", DataSelection$EVTYPE), "EVTYPE"] <- "WATERSPOUT"
DataSelection[grepl("WINTER STORM", DataSelection$EVTYPE), "EVTYPE"] <- "WINTER STORM"
DataSelection[grepl("SUMMARY", DataSelection$EVTYPE), "EVTYPE"] <- NA
DataSelection[grepl("LIGHTNING", DataSelection$EVTYPE), "EVTYPE"] <- "LIGHTNING"
DataSelection[grepl("LIGHTING", DataSelection$EVTYPE), "EVTYPE"] <- "LIGHTNING"
DataSelection[grepl("FLOOD", DataSelection$EVTYPE), "EVTYPE"] <- "FLOOD"
DataSelection[grepl("HEAT", DataSelection$EVTYPE), "EVTYPE"] <- "HEAT"
# Unused levels are removed from the dataset
DataSelection$EVTYPE <- droplevels(DataSelection$EVTYPE)
The following code reassign coding expression to a numeric value for the computation of the total damage value for each events. Values have been deducted from the dataset documentation. The manipulation affects the following variables: determine the appropriate size multiplier. - PROPDMGEXP: A letter code indicating the magnitude of the PROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively. - CROPDMGEXP: A letter code indicating the magnitude of the CROPDMG dollar amount {“K”,“M”,“B”} for “thousands”, “millions” and “billions” respectively.
# Second clearing of the EXP value
####################################
# Assigning values for the property exponent data
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "K"] <- 1000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "M"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == ""] <- 1
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "B"] <- 1e+09
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "m"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "0"] <- 1
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "5"] <- 1e+05
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "6"] <- 1e+06
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "4"] <- 10000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "2"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "3"] <- 1000
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "h"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "7"] <- 1e+07
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "H"] <- 100
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "1"] <- 10
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "8"] <- 1e+08
# Assigning '0' to invalid exponent data
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "+"] <- 0
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "-"] <- 0
DataSelection$PROPEXP[DataSelection$PROPDMGEXP == "?"] <- 0
DataSelection$PROPEXP <- as.numeric(DataSelection$PROPEXP)
# Assigning values for the crop exponent data
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "M"] <- 1e+06
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "K"] <- 1000
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "m"] <- 1e+06
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "B"] <- 1e+09
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "0"] <- 1
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "k"] <- 1000
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "2"] <- 100
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == ""] <- 1
# Assigning '0' to invalid exponent data
DataSelection$CROPEXP[DataSelection$CROPDMGEXP == "?"] <- 0
DataSelection$CROPEXP <- as.numeric(DataSelection$CROPEXP)
The final steps removes the NA values in the dataset and compute the following operation on the dataset: - calculate the total damage value for crop and properties damages - group by event types - remove zero values events
DataSelection <- na.omit(DataSelection)
# Final computation to get:
###################################
DataSelection <- DataSelection %>%
mutate(PROPVAL = PROPDMG * PROPEXP) %>%
mutate(CROPVAL = CROPDMG * CROPEXP) %>%
select(EVTYPE, FATALITIES, INJURIES, PROPVAL, CROPVAL) %>%
group_by(EVTYPE) %>%
summarise_all(sum) %>%
filter(FATALITIES > 0 & INJURIES > 0 & PROPVAL > 0 & CROPVAL > 0)
summary(DataSelection)
## EVTYPE FATALITIES INJURIES
## BLIZZARD : 1 Min. : 1.0 Min. : 1
## COLD/WIND CHILL : 1 1st Qu.: 12.5 1st Qu.: 26
## DRY MICROBURST : 1 Median : 66.0 Median : 280
## DUST STORM : 1 Mean : 395.0 Mean : 3923
## EXTREME COLD : 1 3rd Qu.: 131.0 3rd Qu.: 1333
## EXTREME COLD/WIND CHILL: 1 Max. :5633.0 Max. :91346
## (Other) :29
## PROPVAL CROPVAL
## Min. :1.500e+04 Min. :5.000e+03
## 1st Qu.:7.690e+06 1st Qu.:5.525e+05
## Median :3.246e+08 Median :2.744e+07
## Mean :1.216e+10 Mean :9.412e+08
## 3rd Qu.:5.461e+09 3rd Qu.:6.983e+08
## Max. :1.682e+11 Max. :1.239e+10
##
Across the United States, which types of events are most harmful with respect to population health?
To answer this question, we will utilize the top 10 events with the most fatalies and injuries.
# Preparing dataset for plotting
# Note: for the ordering of the data, we place fatalities first
(USeventPOP <- DataSelection %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
gather(key = POPHEALTH, value = NUMBERPEOPLE, FATALITIES, INJURIES) %>%
arrange(desc(NUMBERPEOPLE)) %>%
head(n = 10))
## # A tibble: 10 × 3
## EVTYPE POPHEALTH NUMBERPEOPLE
## <fctr> <chr> <dbl>
## 1 TORNADO INJURIES 91346
## 2 THUNDERSTORM INJURIES 9448
## 3 HEAT INJURIES 9224
## 4 FLOOD INJURIES 8683
## 5 TORNADO FATALITIES 5633
## 6 LIGHTNING INJURIES 5231
## 7 HEAT FATALITIES 3138
## 8 ICE STORM INJURIES 1975
## 9 FLOOD FATALITIES 1552
## 10 HIGH WIND INJURIES 1522
# Plot the dataset
ggplot(USeventPOP, aes(x = reorder(EVTYPE, -NUMBERPEOPLE), y = NUMBERPEOPLE, fill = POPHEALTH)) +
geom_bar(stat = "identity", position = "stack") +
labs(x = "Event Type", y = "Number of People") +
ggtitle("Most Harmful Events across the U.S. with respect to Population Health") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The graph clearly shows that tornados are by far the most harmful event across the United States, both in number of fatalities and injuries. Following tornado events, heat-related events, thunderstorms and flood are the most harmful events.
Across the United States, which types of events have the greatest economic consequences?
# Preparing dataset for plotting
# Note: for the ordering of the data, we place fatalities first
(USeventDAMG <- DataSelection %>%
select(EVTYPE, PROPVAL, CROPVAL) %>%
gather(key = VALUETYPE, value = TOTALVALUE, PROPVAL, CROPVAL) %>%
arrange(desc(TOTALVALUE)) %>%
head(n = 10))
## # A tibble: 10 × 3
## EVTYPE VALUETYPE TOTALVALUE
## <fctr> <chr> <dbl>
## 1 FLOOD PROPVAL 168248038513
## 2 HURRICANE PROPVAL 84656180010
## 3 TORNADO PROPVAL 56947380617
## 4 STORM SURGE PROPVAL 43323536000
## 5 HAIL PROPVAL 17622991537
## 6 FLOOD CROPVAL 12388194200
## 7 THUNDERSTORM PROPVAL 11139326676
## 8 TROPICAL STORM PROPVAL 7714390550
## 9 WINTER STORM PROPVAL 6689497251
## 10 HIGH WIND PROPVAL 6157245003
# Plotting the graph
ggplot(USeventDAMG, aes(x = reorder(EVTYPE, -TOTALVALUE), y = TOTALVALUE, fill = VALUETYPE)) +
geom_bar(stat = "identity", position = "stack") +
labs(x = "Event Type", y = "Damaged Economic Value") +
ggtitle("Most Harmful Events across the U.S. with respect to Economic Value") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Flood represents the main harmful economic events across the United States. Water damages can be severe to properties in particular.
We can notice that the majority of the economic damages are related to properties for the top 10 type of events with the maximum damage value.
sessionInfo()
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] tidyr_0.6.0 ggplot2_2.2.1 dplyr_0.5.0
## [4] magrittr_1.5 RDocumentation_0.7.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.8 knitr_1.15.1 devtools_1.12.0
## [4] munsell_0.4.3 githubinstall_0.2.1 colorspace_1.3-2
## [7] rjson_0.2.15 R6_2.2.0 plyr_1.8.4
## [10] stringr_1.1.0 httr_1.2.1 tools_3.3.2
## [13] grid_3.3.2 gtable_0.2.0 data.table_1.10.0
## [16] DBI_0.5-1 withr_1.0.2 htmltools_0.3.5
## [19] lazyeval_0.2.0 yaml_2.1.14 rprojroot_1.1
## [22] digest_0.6.10 assertthat_0.1 tibble_1.2
## [25] curl_2.3 memoise_1.0.0 evaluate_0.10
## [28] rmarkdown_1.3 labeling_0.3 stringi_1.1.2
## [31] scales_0.4.1 backports_1.0.4 jsonlite_1.1
## [34] proto_1.0.0