This analysis analyzes weather storm data from the National Oceanic and Atmospheric Administration (NOAA) and finds that tornadoes cause the most fatalities and injuries while floods cause the most property and crop damage combined. Notably, droughts cause more crop damage than do floods.
We begin first by loading in necessary packages to R for the analysis.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
library(ggplot2)
library(scales)
In the code chunk below, we use download.file and read.csv to get the data from the repository and read it into R. Of note, we set up an if loop to see if this file already exists in the working directory to eliminate redundant downloads.
rr_data <- "./repdata-data-StormData.csv.bz2"
if (!file.exists(rr_data))
{
data.url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url = data.url, destfile = rr_data)
}
data <- read.csv("repdata-data-StormData.csv.bz2")
We can use National Weather Service Instruction 10-1605 and the FAQ page to get more information on the data set. We are interested in weather event types and measures of economic consequences and public health. Let’s take a look at our data.
str(data)
'data.frame': 902297 obs. of 37 variables:
$ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
$ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
$ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
$ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
$ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
$ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
$ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
$ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
$ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
$ COUNTYENDN: logi NA NA NA NA NA NA ...
$ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
$ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
$ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
$ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
$ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
$ F : int 3 2 2 2 2 2 2 1 3 3 ...
$ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
$ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
$ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
$ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
$ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
$ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
$ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
$ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
$ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
$ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
$ LATITUDE : num 3040 3042 3340 3458 3412 ...
$ LONGITUDE : num 8812 8755 8742 8626 8642 ...
$ LATITUDE_E: num 3051 0 0 0 0 ...
$ LONGITUDE_: num 8806 0 0 0 0 ...
$ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
$ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
From this admittedly not-so-pretty run of the column headers and description of data types, we can see what we are interested in the following columns:
We will begin by reducing the number of columns to make our dataset more manageable. We also identify the BGN_DATE column as a date data type.
data2 <- select(data, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
data2$BGN_DATE <- as.Date(data2$BGN_DATE, format = "%m/%d/%Y")
From here, we have to make use of the levels for PROPDMGEXP and CROPDMGEXP. The FAQ and the National Weather Service Instruction do not indicate what each character means. After Googling around a bit, I found a related analysis that does a fantastic job of comparing the values in PROPDMGEXP and CROPDMGEXP to the online NOAA website. What we can learn from this analysis is that the levels for these columns should be interpreted as below:
| Value | Multiplier |
|---|---|
| blank, -, or ? | 0 |
| + | 1 |
| numeric | 10 |
| h or H | 100 |
| k or K | 1,000 |
| m or M | 1,000,000 |
| b or B | 1,000,000,000 |
Our next task here will be to replace these values with the appropriate numeric value so that we can get an economic estimate for both property and crop damage. To make this easier, we will first bring all letter characters to uppercase. Once we replace the EXP values with the corresponding multiplier, we will create new variables to multiply DMG and EXP to get the economic damage done to property and crops.
data2$PROPDMGEXP <- toupper(data2$PROPDMGEXP)
data2$PROPDMGEXP[data2$PROPDMGEXP %in% c("","?","-","0","1","2","3","4","5","6","7","8","9")] <- 0
data2$PROPDMGEXP[data2$PROPDMGEXP %in% c("+")] <- 1
data2$PROPDMGEXP[data2$PROPDMGEXP == "H"] <- 100
data2$PROPDMGEXP[data2$PROPDMGEXP == "K"] <- 1000
data2$PROPDMGEXP[data2$PROPDMGEXP == "M"] <- 1000000
data2$PROPDMGEXP[data2$PROPDMGEXP == "B"] <- 1000000000
data2$PROPDMGEXP <- as.integer(data2$PROPDMGEXP)
data2 <- data2 %>% mutate(Property.Damage = PROPDMG * PROPDMGEXP)
data2$CROPDMGEXP <- toupper(data2$CROPDMGEXP)
data2$CROPDMGEXP[data2$CROPDMGEXP %in% c("","?","-","0","1","2","3","4","5","6","7","8","9")] <- 0
data2$CROPDMGEXP[data2$CROPDMGEXP %in% c("+")] <- 1
data2$CROPDMGEXP[data2$CROPDMGEXP == "H"] <- 100
data2$CROPDMGEXP[data2$CROPDMGEXP == "K"] <- 1000
data2$CROPDMGEXP[data2$CROPDMGEXP == "M"] <- 1000000
data2$CROPDMGEXP[data2$CROPDMGEXP == "B"] <- 1000000000
data2$CROPDMGEXP <- as.integer(data2$CROPDMGEXP)
data2 <- data2 %>% mutate(Crop.Damage = CROPDMG * CROPDMGEXP)
Now that the data is (or are, whatever floats your boat) processed, we are ready to answer the research questions. We start with the effect on the population’s health, as measured by the top ten weather events by total fatalities and total injuries.
data3 <- data2 %>%
group_by(EVTYPE) %>%
summarize("Total Fatalities" = sum(FATALITIES, na.rm = TRUE),
"Total Injuries" = sum(INJURIES, na.rm = TRUE)) %>%
arrange(desc(`Total Fatalities`)) %>%
top_n(10, wt = `Total Fatalities`)
data3 <- melt(data3, id.vars = c("EVTYPE"))
data3$EVTYPE <- factor(data3$EVTYPE, levels = unique(data3$EVTYPE))
ggplot(data3, aes(x = EVTYPE, y = value, fill = variable)) +
geom_bar(stat = "identity") +
facet_grid(variable ~ ., scales = "free_y") +
labs(x = "", y = "Population Affected") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This bar chart shows that Tornadoes are the single biggest cause of fatalities and injuries.
Now we move onto economic damage. We will add up all the Property Damage and Crop Damage by event type, then add them together to arrive at Total Damage.
data4 <- data2 %>%
group_by(EVTYPE) %>%
summarize("Property Damage" = sum(Property.Damage, na.rm = TRUE)/1000000,
"Crop Damage" = sum(Crop.Damage, na.rm = TRUE)/1000000,
"Total Damage" = (sum(Property.Damage, na.rm = TRUE) + sum(Crop.Damage, na.rm = TRUE))/1000000) %>%
arrange(desc(`Total Damage`)) %>%
top_n(10, wt = `Total Damage`)
data4 <- melt(data4, id.vars = c("EVTYPE"))
data4$EVTYPE <- factor(data4$EVTYPE, levels = unique(data4$EVTYPE))
ggplot(data4, aes(x = EVTYPE, y = value, fill = variable)) +
geom_bar(stat = "identity") +
facet_grid(variable ~ ., scales = "free_y") +
labs(x = "", y = "Economic Cost of Damage ($millions)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = comma) +
scale_fill_brewer(palette = "Dark2") +
guides(fill = FALSE)
From this graph, we can see that Floods cause the greatest amount of total damage, driven primarily by Property Damage. While keeping in mind that the scales for Crop Damage and Property Damage are not the same, we note that Droughts are the single biggest cause of Crop Damage.
Here is information about the R Session used to conduct this analysis.
sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_0.4.1 ggplot2_2.2.0 lubridate_1.6.0 dplyr_0.5.0 data.table_1.10.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 knitr_1.15.1 magrittr_1.5 munsell_0.4.3 colorspace_1.3-1
[6] R6_2.2.0 stringr_1.1.0 plyr_1.8.4 tools_3.3.2 grid_3.3.2
[11] gtable_0.2.0 DBI_0.5-1 htmltools_0.3.5 yaml_2.1.14 lazyeval_0.2.0
[16] rprojroot_1.1 digest_0.6.10 assertthat_0.1 tibble_1.2 RColorBrewer_1.1-2
[21] reshape2_1.4.2 base64enc_0.1-3 rsconnect_0.7 evaluate_0.10 rmarkdown_1.3
[26] labeling_0.3 stringi_1.1.2 backports_1.0.4 jsonlite_1.1