This document outlines the analyses to answer the following questions:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The NOAA Storm Database was used to answer these questions. The data was first imported into R and cleaned to create tidy datasets. Plots were then generated answer the questions.
library(dplyr)
library(tidyverse)
library(ggplot2)
library(gridExtra)
The data was first downloaded into the working directory from the link provided as a .bz2 file.
The file was then opened and examined.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "stormdata.bz2", header=TRUE)
data <- read.csv("stormdata.bz2")
The dataset has 37 variables and 902297 observations.
dim(data)
## [1] 902297 37
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
By examining the properties of the dataset we can determine the variables of interest.
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The dataset contains 37 variables but we are only interested in the event types (EVTYPE), health variables (FATALITIES, INJURIES), and economic variables (PROPGMD, PROPDMGEXP, CROPDMG, DROPDMGEXP).
data2 <- data[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
head(data2)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Visual inspection of EVTYPE shows us that there are 985 categories of events. But the documentation shows 48 different events. Changing the letters in all the words to uppercase reduced the number of categories to 898. Some values were not properly coded. The following code renames the items to match the event type category.
length(unique(data2$EVTYPE))
## [1] 985
data2$EVTYPER <- data2$EVTYPE
data2$EVTYPER <- toupper(data2$EVTYPER)
data2$EVTYPER <- gsub("^(SMALL )?HAIL.*", "HAIL", data2$EVTYPER)
data2$EVTYPER <- gsub("TSTM|THUNDERSTORMS?", "THUNDERSTORM", data2$EVTYPER)
data2$EVTYPER <- gsub("STORMS?", "STORM", data2$EVTYPER)
data2$EVTYPER <- gsub("WINDS?|WINDS?/HAIL", "WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("RAINS?", "RAIN", data2$EVTYPER)
data2$EVTYPER <- gsub("^TH?UN?DEE?RS?TO?RO?M ?WIND.*|^(SEVERE )?THUNDERSTORM$|^WIND STORM$|^(DRY )?MI[CR][CR]OBURST.*|^THUNDERSTORMW$", "THUNDERSTORM WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("^COASTAL ?STORM$|^MARINE ACCIDENT$", "MARINE THUNDERSTORM WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("^FLOODS?.*|^URBAN/SML STREAM FLD$|^(RIVER|TIDAL|MAJOR|URBAN|MINOR|ICE JAM|RIVER AND STREAM|URBAN/SMALL STREAM)? FLOOD(ING)?S?$|^HIGH WATER$|^URBAN AND SMALL STREAM FLOODIN$|^DROWNING$|^DAM BREAK$", "FLOOD", data2$EVTYPER)
data2$EVTYPER <- gsub("^FLASH FLOOD.*|^RAPIDLY RISING WATER$", "FLASH FLOOD", data2$EVTYPER)
data2$EVTYPER <- gsub("WATERSPOUTS?", "WATERSPOUT", data2$EVTYPER)
data2$EVTYPER <- gsub("WEATHER/MIX", "WEATHER", data2$EVTYPER)
data2$EVTYPER <- gsub("CURRENTS?", "CURRENT", data2$EVTYPER)
data2$EVTYPER <- gsub("^WINDCHILL$|^COLD.*|^LOW TEMPERATURE$|^UNSEASONABLY COLD$", "COLD/WIND CHILL", data2$EVTYPER)
data2$EVTYPER <- gsub("^EXTREME WIND ?CHILL$|^(EXTENDED|EXTREME|RECORD)? COLDS?$", "EXTREME COLD/WIND CHILL", data2$EVTYPER)
data2$EVTYPER <- gsub("^WILD/FOREST FIRE$|^(WILD|BRUSH|FOREST)? ?FIRES?$", "WILDFIRE", data2$EVTYPER)
data2$EVTYPER <- gsub("^RAIN/SNOW$|^(BLOWING|HEAVY|EXCESSIVE|BLOWING|ICE AND|RECORD)? ?SNOWS?.*", "HEAVY SNOW", data2$EVTYPER)
data2$EVTYPER <- gsub("^FOG$", "DENSE FOG", data2$EVTYPER)
data2$EVTYPER <- gsub("^(GUSTY|NON-SEVERE|NON ?-?THUNDERSTORM)? ?WIND.*|^ICE/STRONG WIND$", "STRONG WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("SURGE$", "SURGE/TIDE", data2$EVTYPER)
data2$EVTYPER <- gsub("CLOUDS?", "CLOUD", data2$EVTYPER)
data2$EVTYPER <- gsub("^FROST[/\\]FREEZE$|^FROST$|^(DAMAGING)? ?FREEZE$|^HYP[OE]R?THERMIA.*|^ICE$|^(ICY|ICE) ROADS$|^BLACK ICE$|^ICE ON ROAD$", "FROST/FREEZE", data2$EVTYPER)
data2$EVTYPER <- gsub("^GLAZE.*|^FREEZING (RAIN|DRIZZLE|RAIN/SNOW|SPRAY$)$|^WINTRY MIX$|^MIXED PRECIP(ITATION)?$|^WINTER WEATHER MIX$|^LIGHT SNOW$|^FALLING SNOW/ICE$|^SLEET.*", "SLEET", data2$EVTYPER)
data2$EVTYPER <- gsub("^HURRICANE.*", "HURRICANE/TYPHOON", data2$EVTYPER)
data2$EVTYPER <- gsub("^HEAT WAVES?$|^UNSEASONABLY WARM$|^WARM WEATHER$", "HEAT", data2$EVTYPER)
data2$EVTYPER <- gsub("^(EXTREME|RECORD/EXCESSIVE|RECORD) HEAT$", "EXCESSIVE HEAT", data2$EVTYPER)
data2$EVTYPER <- gsub("^HEAVY SURF(/HIGH SURF)?.*$|^(ROUGH|HEAVY) SEAS?.*|^(ROUGH|ROGUE|HAZARDOUS) SURF.*|^HIGH WIND AND SEAS$|^HIGH SURF.*|^HIGH SURF ADVISORY", "HIGH SURF", data2$EVTYPER)
data2$EVTYPER <- gsub("^LAND(SLUMP|SLIDE)?S?$|^MUD ?SLIDES?$|^AVALANCH?E$", "AVALANCHE", data2$EVTYPER)
data2$EVTYPER <- gsub("^UNSEASONABLY WARM AND DRY$|^DROUGHT.*|^HEAT WAVE DROUGHT$", "DROUGHT", data2$EVTYPER)
data2$EVTYPER <- gsub("^TORNADO.*", "TORNADO", data2$EVTYPER)
data2$EVTYPER <- gsub("^TROPICAL STORM.*", "TROPICAL STORM", data2$EVTYPER)
data2$EVTYPER <- gsub("^MARINE MISHAP$|^HIGH WIND/SEAS$", "MARINE HIGH WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("^HIGH WIND.*", "HIGH WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("^HIGH SEAS$", "MARINE STRONG WIND", data2$EVTYPER)
data2$EVTYPER <- gsub("^RIP CURRENT.*", "RIP CURRENT", data2$EVTYPER)
data2$EVTYPER <- gsub("^WATERSPOUT.*", "WATERSPOUT", data2$EVTYPER)
data2$EVTYPER <- gsub("^EXCESSIVE RAINFALL$|^RAIN.*|^TORRENTIAL RAINFALL$|^(HEAVY|HVY)? (RAIN|MIX|PRECIPITATION).*", "HEAVY RAIN", data2$EVTYPER)
data2$EVTYPER <- gsub("^FOG.*", "FREEZING FOG", data2$EVTYPER)
data2$EVTYPER <- gsub("^WINTER STORM.*", "WINTER STORM", data2$EVTYPER)
data2$EVTYPER <- gsub("^THUNDERSNOW$|^ICE STORM.*", "ICE STORM", data2$EVTYPER)
data2$EVTYPER <- gsub("WAVES?|SWELLS?", "SURF", data2$EVTYPER)
data2$EVTYPER <- gsub("^LIGHTNING.*", "LIGHTNING", data2$EVTYPER)
data2$EVTYPER <- gsub("^WHIRLWIND$|^GUSTNADO$|^TORNDAO$", "TORNADO", data2$EVTYPER)
data2$EVTYPER <- gsub("^COASTAL FLOOD.*", "COASTAL FLOOD", data2$EVTYPER)
data2$EVTYPER <- gsub("^TYPHOON", "HURRICANE/TYPHOON", data2$EVTYPER)
data2$EVTYPER <- gsub("^EROSION/CSTL FLOOD$|^COASTAL FLOOD/EROSION$|^COASTAL SURGE/TIDE$", "COASTAL FLOOD", data2$EVTYPER)
data2$EVTYPER <- gsub("^ASTRONOMICAL HIGH TIDE$", "STORM SURGE/TIDE", data2$EVTYPER)
data2$EVTYPER <- gsub("^(GROUND)? ?BLIZZARD.*$", "BLIZZARD", data2$EVTYPER)
data2$EVTYPER <- gsub("^DUST STORM.*$", "DUST STORM", data2$EVTYPER)
data2$EVTYPER <- gsub("^SUMMARY|NONE|OTHER", "NA", data2$EVTYPER)
length(unique(data2$EVTYPER))
## [1] 392
The code above reduced the number of categories to 392, which is still more than the 48 categories. Many of the items also do not fit in any of the categories and will not be considered during analyses.
Both CROPDMG and PROPDMG need to be converted into the same units so we can sum all the items. This is done by modifying the CROPDMGEXP and PROPDMGEXP variables (i.e., making them numeric).
data2$PROPDMGEXPR <- data2$PROPDMGEXP
data2$PROPDMGEXPR <- as.factor(toupper(data2$PROPDMGEXPR))
data2$PROPDMGEXPR <- gsub("H", 2, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("K", 3, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("M", 6, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("B", 9, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("^+", 0, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("^-", 0, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- gsub("^?", 0, data2$PROPDMGEXPR)
data2$PROPDMGEXPR <- as.numeric(data2$PROPDMGEXPR)
data2$CROPDMGEXPR <- data2$CROPDMGEXP
data2$CROPDMGEXPR <- as.factor(toupper(data2$CROPDMGEXPR))
data2$CROPDMGEXPR <- gsub("H", 2, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("K", 3, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("M", 6, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("B", 9, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("^+", 0, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("^-", 0, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- gsub("^?", 0, data2$CROPDMGEXPR)
data2$CROPDMGEXPR <- as.numeric(data2$CROPDMGEXPR)
#creating new CROP and PROP damage variables
data2$PROPDMGNEW <- with(data2, PROPDMG * 10^PROPDMGEXPR)
data2$CROPDMGNEW <- with(data2, CROPDMG * 10^CROPDMGEXPR)
The following code creates subsets of the clean dataset by summing the outcomes of interest and extracting the top 10 values.
data2$EVTYPER <- as.factor(data2$EVTYPER)
#Question 1 dataset
inj <- (data2 %>% select(EVTYPER, INJURIES) %>% group_by(EVTYPER) %>% summarize(TOTINJ = sum(INJURIES)) %>% arrange(desc(TOTINJ)))[1:10,]
fatal <- (data2 %>% select(EVTYPER, FATALITIES) %>% group_by(EVTYPER) %>% summarize(TOTFATAL = sum(FATALITIES)) %>% arrange(desc(TOTFATAL)))[1:10,]
#Question 2 dataset
prop <- (data2 %>% select(EVTYPER, PROPDMGNEW) %>% group_by(EVTYPER) %>% summarize(TOTPROP = sum(PROPDMGNEW)) %>% arrange(desc(TOTPROP)))[1:10,]
crop <- (data2 %>% select(EVTYPER, CROPDMGNEW) %>% group_by(EVTYPER) %>% summarize(TOTCROP = sum(CROPDMGNEW)) %>% arrange(desc(TOTCROP)))[1:10,]
inj$EVTYPER <- factor(inj$EVTYPER, levels = inj$EVTYPER[order(-inj$TOTINJ, decreasing=TRUE)])
injplot <- ggplot(inj[1:5, ], aes(x=EVTYPER, y=TOTINJ, fill="identity")) + geom_bar(stat="identity") +
theme(legend.position = "none") +
labs(y="total Number of Injuries", x="Event Type") +
coord_flip() +
ggtitle("5 Most Common Weather Events\nCausing Injuries Across the US")
fatal$EVTYPER <- factor(fatal$EVTYPER, levels = fatal$EVTYPER[order(-fatal$TOTFATAL, decreasing=TRUE)])
fatalplot <- ggplot(fatal[1:5,], aes(x=EVTYPER, y=TOTFATAL, fill="identity")) + geom_bar(stat="identity") +
theme(legend.position = "none") +
labs(y="total Number of Deaths", x="Event Type") +
coord_flip() +
ggtitle("5 Most Common Weather Events\nCausing Deaths Across the US")
grid.arrange(fatalplot, injplot, nrow=2)
Based on these graphs, the weather events that are the most harmful with respect to population health across the US are tornadoes, with over 5000 fatalities and 90000 injuries.
prop$EVTYPER <- factor(prop$EVTYPER, levels = prop$EVTYPER[order(-prop$TOTPROP, decreasing = TRUE)])
propplot <- ggplot(prop[1:5, ], aes(x=EVTYPER, y=TOTPROP, fill="identity")) + geom_bar(stat="identity") +
theme(legend.position = "none") +
labs(y="total Property Damage ($)", x="Event Type") +
coord_flip() +
ggtitle("5 Most Common Weather Events\nCausing Property Damage Across the US")
crop$EVTYPER <- factor(crop$EVTYPER, levels = crop$EVTYPER[order(-crop$TOTCROP, decreasing=TRUE)])
cropplot <- ggplot(crop[1:5, ], aes(x=EVTYPER, y=TOTCROP, fill="identity")) + geom_bar(stat="identity") +
theme(legend.position = "none") +
labs(y="total Crop Damage ($)", x="Event Type") +
coord_flip() +
ggtitle("5 Most Common Weather Events\nCausing Crop Damage Across the US")
grid.arrange(propplot, cropplot, nrow=2)
Based on these graphs, the weather events that cause the most property damage are hurricanes and typhoons, with over 85 billion dollars worth of damage. The weather events that cause the most crop damage are droughts, with over 13 billion dollars worth of damage.