##Synopsis
This analysis looks at the economic and personal damage caused by different weather events. The goal is to find the weather events which cause the most injuries and fatalities, and find the weather events which cause the most damage to property and crops. Tornados end up causing the most personal damage from 1994-2011, and floods end up causing the most economic damage over the same time frame. Data was provided by NOAA through the National Weather Service and National Climactic Data Center.
##Data Processing
The first step in our anaylsis is to clean the data. First, we’ll download the data file (if it hasn’t already been downloaded). Next, we’ll read the datafile into R using the fread function as this is a large set of data. Once the data is in R, we’ll remove all but 8 variables since that is all the variables we’ll need for the analysis.
library(data.table)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## Registered S3 method overwritten by 'R.oo':
## method from
## throw.default R.methodsS3
## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.9.0 successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, nullfile,
## parse, warnings
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
#download the datafile
if(!file.exists("stormData.csv.bz2")) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "stormData.csv.bz2")
}
#read the datafile and remove unneeded columns
data <- fread("stormData.csv.bz2")
data2 <- data %>% select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG,
PROPDMGEXP, CROPDMG, CROPDMGEXP)
Now we’ll change the BGN_DATE variable into date format, filter out the occurence of monthly summaries in the EVTYPE variable, and create a new variable called people_harmed which is the total of the INJURIES and FATALITIES variables. People_harmed will be the base of our analysis for weather event harm to people.
#convert BGN_DATE to date
data2$BGN_DATE <- ymd(mdy_hms(data2$BGN_DATE))
#filter out summaries for EVTYPE
data2 <- data2 %>% filter(!grepl("^[Ss][Uu][Mm]", EVTYPE))
#creating a total people harmed variable
data2 <- data2 %>% mutate(people_harmed = FATALITIES + INJURIES)
Next we’ll tackle the tricky step of figuring out what the exponential variables for property damage and crop damage are. This wasn’t explicitly addressed in the NOAA documentation, but in the discussion forum for this class (Reproducible Research), there was a link to analysis done to figure out what the numbers and letters mean (link here: https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html). Based on this analysis, we know h means hundreds, k means thousands, m means millions, b means billions, and any number means tens plus that number (in the ones place). We’ll use that key to convert PROPDMG and CROPDMG into two new variables with the correct numeric value.
#for property damage and crop damage variables, the exponential variable gives
#the multiplier (h = hundreds, k = thousands, m = millions, b = billions,
#any number = 10 + number). All other symbols will be disregarded and damage
#value will be used
data2 <- data2 %>% mutate(PROPDMGEXP =
ifelse(PROPDMGEXP == "h", "H",
ifelse(PROPDMGEXP == "k", "K",
ifelse(PROPDMGEXP == "m", "M",
ifelse(PROPDMGEXP == "b", "B", PROPDMGEXP)))))
data2 <- data2 %>% mutate(property =
ifelse(PROPDMGEXP == "H", PROPDMG * 100,
ifelse(PROPDMGEXP == "K", PROPDMG * 1000,
ifelse(PROPDMGEXP == "M", PROPDMG * 1000000,
ifelse(PROPDMGEXP == "B", PROPDMG * 1000000000,
ifelse(PROPDMGEXP == "0", PROPDMG * 10,
ifelse(PROPDMGEXP == "1", PROPDMG * 10 + 1,
ifelse(PROPDMGEXP == "2", PROPDMG * 10 + 2,
ifelse(PROPDMGEXP == "3", PROPDMG * 10 + 3,
ifelse(PROPDMGEXP == "4", PROPDMG * 10 + 4,
ifelse(PROPDMGEXP == "5", PROPDMG * 10 + 5,
ifelse(PROPDMGEXP == "6", PROPDMG * 10 + 6,
ifelse(PROPDMGEXP == "7", PROPDMG * 10 + 7,
ifelse(PROPDMGEXP == "8", PROPDMG * 10 + 8,
ifelse(PROPDMGEXP == "9", PROPDMG * 10 + 9,
PROPDMG)))))))))))))))
#now doing crop damage
data2 <- data2 %>% mutate(CROPDMGEXP =
ifelse(CROPDMGEXP == "h", "H",
ifelse(CROPDMGEXP == "k", "K",
ifelse(CROPDMGEXP == "m", "M",
ifelse(CROPDMGEXP == "b", "B", CROPDMGEXP)))))
data2 <- data2 %>% mutate(crop =
ifelse(CROPDMGEXP == "H", CROPDMG * 100,
ifelse(CROPDMGEXP == "K", CROPDMG * 1000,
ifelse(CROPDMGEXP == "M", CROPDMG * 1000000,
ifelse(CROPDMGEXP == "B", CROPDMG * 1000000000,
ifelse(CROPDMGEXP == "0", CROPDMG * 10,
ifelse(CROPDMGEXP == "1", CROPDMG * 10 + 1,
ifelse(CROPDMGEXP == "2", CROPDMG * 10 + 2,
ifelse(CROPDMGEXP == "3", CROPDMG * 10 + 3,
ifelse(CROPDMGEXP == "4", CROPDMG * 10 + 4,
ifelse(CROPDMGEXP == "5", CROPDMG * 10 + 5,
ifelse(CROPDMGEXP == "6", CROPDMG * 10 + 6,
ifelse(CROPDMGEXP == "7", CROPDMG * 10 + 7,
ifelse(CROPDMGEXP == "8", CROPDMG * 10 + 8,
ifelse(CROPDMGEXP == "9", CROPDMG * 10 + 9,
CROPDMG)))))))))))))))
Our last step before tackling EVTYPES is to combine property and crop damage into one variable called damage for our calculations for economic damage and to filter out any instances where there is a zero value for both economic damage and the number of people harmed to cut down on the amount of data.
#make a damage variable to account for total property and crop damage
#filter out any 0 value for people_harmed + damage to cut down on data
data2 <- data2 %>% mutate(damage = property + crop)
data2 <- data2 %>% mutate(filt = damage + people_harmed) %>%
filter(filt != 0)
The EVTYPE variable gets a little complicated based on it’s uncleanliness. I decided to go through any unique event type with more than ten occurences and try to figure out what actual event it should be classified under if it wasn’t on the original list of 48 variables. A number of results ended up being combinations which I decided to leave unchanged for the explicit reason of being unable to know how much of the affect from the event should be classified to which combined event. This was done with the goal of cleaning the data as much as I could without dirtying it in another format. I also decided not to remove any data which fell out of the 48 event types because misclassified events would be small enough to not affect the analysis of looking at the top weather events in terms of total damage.
#subbing misspelled names and misclassifications for event types
#ignored events which were combined since it's impossible to know how they
#should have been classified. Also didn't make any adjustments to events with
#less than ten occurences because it wouldn't make a massive difference for the #purposes of this assignment. Kept in unclassified data for the same reason
data2$EVTYPE <- gsub("TSTM", "THUNDERSTORM", data2$EVTYPE)
data2$EVTYPE <- gsub("WINDS", "WIND", data2$EVTYPE)
data2$EVTYPE <- gsub("URBAN/SML STREAM FLD", "FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("WILD/FOREST FIRE", "wILDFIRE", data2$EVTYPE)
data2$EVTYPE <- gsub("URBAN FLOOD", "FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("WEATHER/MIX", "WEATHER", data2$EVTYPE)
data2$EVTYPE <- gsub("RIVER FLOOD", "FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("EXTREME COLD", "EXTREME COLD/WIND CHILL", data2$EVTYPE)
data2$EVTYPE <- gsub("FLOODING", "FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("HURRICANE/TYPHOON", "HURRICANE", data2$EVTYPE)
data2$EVTYPE <- gsub("HURRICANE", "HURRICANE/TYPHOON", data2$EVTYPE)
data2$EVTYPE <- gsub("THUNDERSTORM WINDS", "THUNDERSTORM WIND", data2$EVTYPE)
data2$EVTYPE <- gsub("EXCESSIVE SNOW", "HEAVY SNOW", data2$EVTYPE)
data2$EVTYPE <- gsub("EXTREME COLD/WIND CHILL/WIND CHILL", "EXTREME COLD/WIND CHILL", data2$EVTYPE)
data2$EVTYPE <- gsub("HEAVY SURF", "HIGH SURF", data2$EVTYPE)
data2$EVTYPE <- gsub("HEAVY SURF/HIGH SURF", "HIGH SURF", data2$EVTYPE)
data2$EVTYPE <- gsub("COLD/WIND CHILL", "EXTREME COLD/WIND CHILL", data2$EVTYPE)
data2$EVTYPE <- gsub("FROST/FREEZE", "FREEZE", data2$EVTYPE)
data2$EVTYPE <- gsub("FREEZE", "FROST/FREEZE", data2$EVTYPE)
data2$EVTYPE <- gsub("FLASH FLOODS", "FLASH FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("HEAT WAVE", "HEAT", data2$EVTYPE)
data2$EVTYPE <- gsub("SMALL HAIL", "HAIL", data2$EVTYPE)
data2$EVTYPE <- gsub("URBAN/SMALL STREAM FLOOD", "FLOOD", data2$EVTYPE)
data2$EVTYPE <- gsub("\\(","", data2$EVTYPE)
data2$EVTYPE <- gsub("\\)","", data2$EVTYPE)
data2$EVTYPE <- gsub("THUNDERSTORM WIND G45", "THUNDERSTORM WIND", data2$EVTYPE)
data2$EVTYPE <- gsub("THUNDERSTORM WIND G40", "THUNDERSTORM WIND", data2$EVTYPE)
data2$EVTYPE <- gsub("THUNDERSTORMS WIND", "THUNDERSTORM WIND", data2$EVTYPE)
#make everything uppercase
data2$EVTYPE <- toupper(data2$EVTYPE)
Given the large number of events with less than 10 observations, I decided to forgo trying to clean them. The analysis below shows that the amount of events with less than 10 occurences only affects 2.7% of the data which was a low enough number in my opinion to make cleaning it not worth the time.
#calculate data with below 10 observations for number of unclean data
events <- data2 %>% select(EVTYPE) %>% mutate(occurence = 1) %>%
group_by(EVTYPE) %>% summarize(total = sum(occurence))
totalOccurence <- events %>% pull(total)
totalOccurence <- sum(totalOccurence)
below10 <- events %>% filter(total <= 10) %>% pull(total)
below10 <- sum(below10)
percent_unclean <- below10/totalOccurence
print(percent_unclean)
## [1] 0.002713124
##Timeframe
For the purposes of showing the total damage inflicted by the weather events, we must first look at the number of observations by year to make sure we are comparing apples to apples by not including years with a low number of observations. This plot shows the number of observations by year.
#make a plot showing observations by year
plot1data <- data2 %>% mutate(year = year(BGN_DATE), occurence = 1) %>%
select(year, occurence) %>% group_by(year) %>%
summarize(total = sum(occurence))
xplot1 <- plot1data %>% pull(year)
yplot1 <- plot1data %>% pull(total)
plot(xplot1, yplot1, xlab = "Year", ylab = "Observations",
main = "Number of Weather Event Observations per Year", pch = 19,
col = "blue", cex = 1.5)
abline(v = 1994, col = "orange", lty = 2, lwd = 3)
As a result of this plot, we will only look at weather events occuring between 1994 and 2011 as the number of observations was a lot higher in those years than the years prior to 1994. I decided to not include 1993 since it looks like a transitional year in observation collecting.
##Results
The first question we want answered is which weather events cause the most harm to people. The plot below shows the top eight weather events which have caused the most injuries and fatalities combined to people in the time period from 1994 to 2011. By looking at the totals over that time period, we are finding a good balance between number of occurences and size of damage done to people.
#make a plot showing most harmful events to people
plot2data <- data2 %>% mutate(year = year(BGN_DATE)) %>%
filter(year >= 1994) %>%
select(year, EVTYPE, people_harmed)
plot2data <- plot2data %>% group_by(EVTYPE) %>%
summarize(harm = sum(people_harmed)) %>% arrange(desc(harm))
plot2data <- plot2data[1:8, ]
xplot2 <- plot2data %>% pull(EVTYPE)
yplot2 <- plot2data %>% pull(harm)
barplot(yplot2, names.arg = xplot2, ylim = c(0, 25000),
main = "Total Injuries and Fatalities from Weather Events (1994-2011)",
col = "red", xlab = "Weather Event", ylab = "Number of People")
This plot shows that tornados are far and away the biggest cause of harm to people from weather events observed.
The second question we want answered is which weather events cause the most harm to the eonomy. The plot below shows the top eight weather events which have caused the most property and crop damage in the time period from 1994 to 2011.
#make a plot showing most harmful events for economy (billion dollars)
plot3data <- data2 %>% mutate(year = year(BGN_DATE)) %>%
filter(year >= 1994) %>%
select(year, EVTYPE, damage)
plot3data <- plot3data %>% group_by(EVTYPE) %>%
summarize(damage = sum(damage)/1000000000) %>% arrange(desc(damage))
plot3data <- plot3data[1:8, ]
xplot3 <- plot3data %>% pull(EVTYPE)
yplot3 <- plot3data %>% pull(damage)
barplot(yplot3, names.arg = xplot3, ylim = c(0, 150),
ylab = "Damage (billion dollars)", yaxt = "n",
main = "Total Property and Crop Damage from Weather Events (1994-2011)",
col = "green", xlab = "Weather Event")
axis(2, at = c(0, 25, 50, 75, 100, 125, 150))
When looking at economic damage, tornados drop down to fourth on the list despite their high personal damage, and are replaced by floods as the most damaging weather event when looking at things on an economic perspective.
##R Session Info
print(sessionInfo())
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18362)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.7.4 dplyr_0.8.3 R.utils_2.9.0 R.oo_1.22.0
## [5] R.methodsS3_1.7.1 data.table_1.12.2
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.2 knitr_1.24 magrittr_1.5 tidyselect_0.2.5
## [5] R6_2.4.0 rlang_0.4.0 stringr_1.4.0 tools_3.6.1
## [9] xfun_0.9 htmltools_0.3.6 yaml_2.2.0 digest_0.6.20
## [13] assertthat_0.2.1 tibble_2.1.3 crayon_1.3.4 purrr_0.3.2
## [17] glue_1.3.1 evaluate_0.14 rmarkdown_1.15 stringi_1.4.3
## [21] compiler_3.6.1 pillar_1.4.2 pkgconfig_2.0.2