Synopsis
Using data from the U.S. National Oceanic and Atmospheric Administration (NOAA), the following data analysis determines which weather events have been the most harmful to the US population from an economic and health perspective.
Data Processing
First, we need to load the data into R.
## Load data
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dir <- "/Users/kevinroche22/RData/SevereWeatherAnalysis"
setwd(dir)
download.file(url, "StormData.csv.bz2")
stormData <- read.csv(bzfile("StormData.csv.bz2"), header = TRUE, sep = ",")
Let’s take a look at what we’re working with.
## [1] 902297 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
So, the data contains just over 900k observations of 37 different variables. It looks like some of the variables have missing data. Let’s see how many missing values each variable has.
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## 1 0 0 0 0 0 0 0 0 0
## BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 1 0 0 0 0 0 100 0 0
## END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 0 0 0 93.49061 0 0 0 0 0
## CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E
## 1 0 0 0 0 0 0.005208928 0 0.00443313
## LONGITUDE_ REMARKS REFNUM
## 1 0 0 0
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## 1 0 0 0 0 0 0 0 0 0
## BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN END_RANGE END_AZI
## 1 0 0 0 0 0 902297 0 0
## END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 0 0 0 843563 0 0 0 0 0
## CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE LATITUDE_E
## 1 0 0 0 0 0 47 0 40
## LONGITUDE_ REMARKS REFNUM
## 1 0 0 0
Interesting. The “COUNTYENDN” variable is missing entirely, while the “F” variable is ~93% missing. A very small proportion of the latitude variables are also missing. Given that I’m only interested in the variables that have economic and health consequences, these missing values won’t be problematic.
Now, let’s take a subset of the data that only includes the variables that are of interest to me.
## Select Variables
stormData <- stormData %>%
select("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
Some of these variables are related. The value for “PROPDMG” doesn’t represent the amount of property damage alone, it represents the coefficient - so it needs to be multiplied by the corresponding value from “PROPDMGEXP”. The same logic follows for “CROPDMG” and “CROPDMGEXP”.
The code below creates two new variables, “PROPERTYDMGS” and “CROPDMGS” that multiply the values for PROPDMG and CROPDMG by their corresponding exponent. Observations that have missing exponent values or erroneous exponent values (a very small amount of the exponents are labeled “?”, for example) are assigned a value of NA.
The original “PROPDMG,”PROPDMGEXP“,”CROPDMG“, and”CROPDMPEXP" columns are then dropped.
## Property Damage
stormData$PROPDMGEXP <- as.character(stormData$PROPDMGEXP)
stormData <- stormData %>%
mutate(PROPERTYDMGS = case_when(PROPDMGEXP == "K" ~ PROPDMG * 1000,
PROPDMGEXP == "M" ~ PROPDMG * 1000000,
PROPDMGEXP == "B" ~ PROPDMG * 1000000000))
## Crop Damage
stormData$CROPDMGEXP <- as.character(stormData$CROPDMGEXP)
stormData <- stormData %>%
mutate(CROPDMGS = case_when(CROPDMGEXP == "K" ~ CROPDMG * 1000,
CROPDMGEXP == "M" ~ CROPDMG * 1000000,
CROPDMGEXP == "B" ~ CROPDMG * 1000000000))
## Drop old property damage and crop damage columns
stormData <- stormData %>%
select(-PROPDMG, -PROPDMGEXP, -CROPDMG, -CROPDMGEXP)
## Format new columns as integers
stormData$CROPDMGS <- as.integer(stormData$CROPDMGS)
stormData$PROPERTYDMGS <- as.integer(stormData$PROPERTYDMGS)
The date variable, BGN_DATE, isn’t formatted as a date. Let’s fix that, and then extract the year from it.
## Format as date
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
## Create new variable, year
stormData <- stormData %>%
mutate(YEAR = str_extract(BGN_DATE, "^\\d{4}")) %>%
select(-BGN_DATE)
## Format year variable
stormData$YEAR <- lubridate::ymd(stormData$YEAR, truncated = 2L)
The NOAA notes that prior to 1996, only four types of events were recorded - Hail, Wind, Thunderstorms, and Tornadoes. From 1996 onward, 48 event types are recorded. The data will be skewed towards the original four events if I use data from all time periods, so I’ll need to subset the data from 1996 onward to allow for proper comparison between event types.
We only need events that resulted in either health or economic impacts, so we can filter out observations that don’t fall into either of those categories.
## Only keep events with health or economic impacts
stormData <- stormData %>%
filter(PROPERTYDMGS > 0 | CROPDMGS > 0 | FATALITIES > 0 | INJURIES > 0)
Let’s check the cleanliness of the event type data.
## Change case of all event types to upper
stormData$EVTYPE <- toupper(stormData$EVTYPE)
stormData %>% distinct(EVTYPE)
## EVTYPE
## 1 WINTER STORM
## 2 TORNADO
## 3 TSTM WIND
## 4 HIGH WIND
## 5 FLASH FLOOD
## 6 FREEZING RAIN
## 7 EXTREME COLD
## 8 LIGHTNING
## 9 HAIL
## 10 FLOOD
## 11 TSTM WIND/HAIL
## 12 EXCESSIVE HEAT
## 13 RIP CURRENTS
## 14 OTHER
## 15 HEAVY SNOW
## 16 WILD/FOREST FIRE
## 17 ICE STORM
## 18 BLIZZARD
## 19 STORM SURGE
## 20 ICE JAM FLOOD (MINOR
## 21 DUST STORM
## 22 STRONG WIND
## 23 DUST DEVIL
## 24 URBAN/SML STREAM FLD
## 25 FOG
## 26 ROUGH SURF
## 27 HEAVY SURF
## 28 HEAVY RAIN
## 29 MARINE ACCIDENT
## 30 AVALANCHE
## 31 FREEZE
## 32 DRY MICROBURST
## 33 WINDS
## 34 COASTAL STORM
## 35 EROSION/CSTL FLOOD
## 36 RIVER FLOODING
## 37 WATERSPOUT
## 38 DAMAGING FREEZE
## 39 HURRICANE
## 40 TROPICAL STORM
## 41 BEACH EROSION
## 42 HIGH SURF
## 43 HEAVY RAIN/HIGH SURF
## 44 UNSEASONABLE COLD
## 45 EARLY FROST
## 46 WINTRY MIX
## 47 DROUGHT
## 48 COASTAL FLOODING
## 49 TORRENTIAL RAINFALL
## 50 LANDSLUMP
## 51 HURRICANE EDOUARD
## 52 TIDAL FLOODING
## 53 STRONG WINDS
## 54 EXTREME WINDCHILL
## 55 GLAZE
## 56 EXTENDED COLD
## 57 WHIRLWIND
## 58 HEAVY SNOW SHOWER
## 59 LIGHT SNOW
## 60 COASTAL FLOOD
## 61 MIXED PRECIP
## 62 COLD
## 63 FREEZING SPRAY
## 64 DOWNBURST
## 65 MUDSLIDES
## 66 MICROBURST
## 67 MUDSLIDE
## 68 SNOW
## 69 SNOW SQUALLS
## 70 WIND DAMAGE
## 71 LIGHT SNOWFALL
## 72 FREEZING DRIZZLE
## 73 GUSTY WIND/RAIN
## 74 GUSTY WIND/HVY RAIN
## 75 WIND
## 76 COLD TEMPERATURE
## 77 HEAT WAVE
## 78 COLD AND SNOW
## 79 RAIN/SNOW
## 80 TSTM WIND (G45)
## 81 GUSTY WINDS
## 82 GUSTY WIND
## 83 TSTM WIND 40
## 84 TSTM WIND 45
## 85 HARD FREEZE
## 86 TSTM WIND (41)
## 87 HEAT
## 88 RIVER FLOOD
## 89 TSTM WIND (G40)
## 90 RIP CURRENT
## 91 MUD SLIDE
## 92 FROST/FREEZE
## 93 SNOW AND ICE
## 94 AGRICULTURAL FREEZE
## 95 WINTER WEATHER
## 96 SNOW SQUALL
## 97 ICY ROADS
## 98 THUNDERSTORM
## 99 HYPOTHERMIA/EXPOSURE
## 100 LAKE EFFECT SNOW
## 101 MIXED PRECIPITATION
## 102 BLACK ICE
## 103 COASTALSTORM
## 104 DAM BREAK
## 105 BLOWING SNOW
## 106 FROST
## 107 GRADIENT WIND
## 108 UNSEASONABLY COLD
## 109 TSTM WIND AND LIGHTNING
## 110 WET MICROBURST
## 111 HEAVY SURF AND WIND
## 112 FUNNEL CLOUD
## 113 TYPHOON
## 114 LANDSLIDES
## 115 HIGH SWELLS
## 116 HIGH WINDS
## 117 SMALL HAIL
## 118 UNSEASONAL RAIN
## 119 COASTAL FLOODING/EROSION
## 120 TSTM WIND (G45)
## 121 TSTM WIND (G45)
## 122 HIGH WIND (G40)
## 123 TSTM WIND (G35)
## 124 COASTAL EROSION
## 125 UNSEASONABLY WARM
## 126 SEICHE
## 127 COASTAL FLOODING/EROSION
## 128 HYPERTHERMIA/EXPOSURE
## 129 ROCK SLIDE
## 130 GUSTY WIND/HAIL
## 131 HEAVY SEAS
## 132 TSTM WIND
## 133 LANDSPOUT
## 134 RECORD HEAT
## 135 EXCESSIVE SNOW
## 136 FLOOD/FLASH/FLOOD
## 137 WIND AND WAVE
## 138 FLASH FLOOD/FLOOD
## 139 LIGHT FREEZING RAIN
## 140 ICE ROADS
## 141 HIGH SEAS
## 142 RAIN
## 143 ROUGH SEAS
## 144 TSTM WIND G45
## 145 NON-SEVERE WIND DAMAGE
## 146 WARM WEATHER
## 147 THUNDERSTORM WIND (G40)
## 148 LANDSLIDE
## 149 HIGH WATER
## 150 FLASH FLOOD
## 151 LATE SEASON SNOW
## 152 WINTER WEATHER MIX
## 153 ROGUE WAVE
## 154 FALLING SNOW/ICE
## 155 NON-TSTM WIND
## 156 NON TSTM WIND
## 157 BRUSH FIRE
## 158 BLOWING DUST
## 159 VOLCANIC ASH
## 160 HIGH SURF ADVISORY
## 161 HAZARDOUS SURF
## 162 WILDFIRE
## 163 COLD WEATHER
## 164 ICE ON ROAD
## 165 DROWNING
## 166 EXTREME COLD/WIND CHILL
## 167 MARINE TSTM WIND
## 168 HURRICANE/TYPHOON
## 169 DENSE FOG
## 170 WINTER WEATHER/MIX
## 171 ASTRONOMICAL HIGH TIDE
## 172 HEAVY SURF/HIGH SURF
## 173 TROPICAL DEPRESSION
## 174 LAKE-EFFECT SNOW
## 175 MARINE HIGH WIND
## 176 THUNDERSTORM WIND
## 177 TSUNAMI
## 178 STORM SURGE/TIDE
## 179 COLD/WIND CHILL
## 180 LAKESHORE FLOOD
## 181 MARINE THUNDERSTORM WIND
## 182 MARINE STRONG WIND
## 183 ASTRONOMICAL LOW TIDE
## 184 DENSE SMOKE
## 185 MARINE HAIL
## 186 FREEZING FOG
Oof. There are 186 observations, when the NOAA says there should only be 48. This means there are a lot of duplicates in there.
Because I’m looking for the most harmful events - and, I suppose, to avoid large amounts of tedious work - I’m going to only clean observations that are in the 90th percentile as far as health and economic impacts go.
## Change NA's to 0's
stormData <- stormData %>%
mutate(CROPDMGS = replace_na(CROPDMGS, 0))
stormData <- stormData %>%
mutate(PROPERTYDMGS = replace_na(PROPERTYDMGS, 0))
## Create variables representing health and economic impacts
stormData <- stormData %>%
mutate(HEALTHIMPACT = INJURIES + FATALITIES, ECONIMPACT = PROPERTYDMGS + CROPDMGS)
## View observations in 90th percentile of health impacts
stormData %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMPACT = sum(HEALTHIMPACT)) %>%
arrange(desc(HEALTHIMPACT)) %>%
filter(HEALTHIMPACT > quantile(HEALTHIMPACT, 0.9))
## # A tibble: 19 x 2
## EVTYPE HEALTHIMPACT
## <chr> <dbl>
## 1 TORNADO 22178
## 2 EXCESSIVE HEAT 8188
## 3 FLOOD 7172
## 4 LIGHTNING 4792
## 5 TSTM WIND 3870
## 6 FLASH FLOOD 2561
## 7 THUNDERSTORM WIND 1530
## 8 WINTER STORM 1483
## 9 HEAT 1459
## 10 HURRICANE/TYPHOON 1339
## 11 HIGH WIND 1318
## 12 WILDFIRE 986
## 13 HEAVY SNOW 805
## 14 FOG 772
## 15 HAIL 720
## 16 WILD/FOREST FIRE 557
## 17 RIP CURRENT 549
## 18 RIP CURRENTS 496
## 19 BLIZZARD 455
## View observations in 90th percentile of economic impacts
stormData %>%
group_by(EVTYPE) %>%
summarise(ECONIMPACT = sum(ECONIMPACT)) %>%
arrange(desc(ECONIMPACT)) %>%
filter(ECONIMPACT > quantile(ECONIMPACT, 0.9))
## # A tibble: 19 x 2
## EVTYPE ECONIMPACT
## <chr> <dbl>
## 1 FLOOD 30919611943
## 2 TORNADO 22100370719
## 3 HAIL 17071172869
## 4 FLASH FLOOD 16557105603
## 5 DROUGHT 14413667000
## 6 HURRICANE 11554229010
## 7 HURRICANE/TYPHOON 11003712799
## 8 HIGH WIND 5881421657
## 9 WILDFIRE 5054139799
## 10 TSTM WIND 5031971788
## 11 THUNDERSTORM WIND 3780985440
## 12 ICE STORM 3657908808
## 13 TROPICAL STORM 3170186549
## 14 WILD/FOREST FIRE 3108564829
## 15 WINTER STORM 1544687250
## 16 EXTREME COLD 1328733399
## 17 HEAVY RAIN 1313034240
## 18 FROST/FREEZE 1104666000
## 19 LIGHTNING 749975520
## Clean 90th percentile observations
stormData <- stormData %>%
mutate_at("EVTYPE", str_replace, "TSTM WIND", "THUNDERSTORM WIND") %>%
mutate_at("EVTYPE", str_replace, "RIP CURRENTS", "RIP CURRENT") %>%
mutate_at("EVTYPE", str_replace, "FOG", "FREEZING FOG") %>%
mutate_at("EVTYPE", str_replace, "WILD/FOREST FIRE", "WILDFIRE") %>%
mutate_at("EVTYPE", str_replace, "HURRICANE/TYPHOON", "HURRICANE (TYPHOON)") %>%
mutate_at("EVTYPE", str_replace, "EXTREME COLD/WIND CHILL", "EXTREME COLD") %>%
mutate_at("EVTYPE", str_replace, "HURRICANE$", "HURRICANE (TYPHOON)")
At this point, the data set is clean enough to conduct some analysis on.
Results
Let’s visualize which weather events had the biggest health impacts.
## Events with the largest total health costs
healthCosts <- stormData %>%
group_by(EVTYPE) %>%
summarise(HEALTHIMPACT = sum(HEALTHIMPACT)) %>%
arrange(desc(HEALTHIMPACT)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, HEALTHIMPACT), y = HEALTHIMPACT, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Health Costs (Injuries + Fatalities) by Event Type") +
labs(x = "Event Type", y = "Total Health Impacts") +
coord_flip()
## Events with the largest number of injuries
injuries <- stormData %>%
group_by(EVTYPE) %>%
summarise(INJURIES = sum(INJURIES)) %>%
arrange(desc(INJURIES)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, INJURIES), y = INJURIES, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Injuries by Event Type") +
labs(x = "Event Type", y = "Total Injuries") +
coord_flip()
## Events with the largest number of deaths
deaths <- stormData %>%
group_by(EVTYPE) %>%
summarise(FATALITIES = sum(FATALITIES)) %>%
arrange(desc(FATALITIES)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, FATALITIES), y = FATALITIES, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Fatalities by Event Type") +
labs(x = "Event Type", y = "Total Deaths") +
coord_flip()
## Plot
gridExtra::grid.arrange(healthCosts, injuries, deaths, heights = unit(c(2.5,2.5,2.5), c("in", "in", "in")), nrow = 3)
We observe that tornadoes lead to the most injuries, while excessive heat causes the most fatalities.
Now, let’s visualize which events had the biggest economic consequences.
## Events with the largest total economic costs
econData <- stormData %>%
group_by(EVTYPE) %>%
summarise(ECONIMPACT = sum(ECONIMPACT)) %>%
arrange(desc(ECONIMPACT)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, ECONIMPACT), y = ECONIMPACT, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Economic Costs (Property + Crop Damage) by Event Type") +
labs(x = "Event Type", y = "Total Economic Costs") +
coord_flip() +
scale_y_continuous(labels=scales::dollar_format())
## Events with the largest total property damage
propertyDamage <- stormData %>%
group_by(EVTYPE) %>%
summarise(PROPERTYDMGS = sum(PROPERTYDMGS)) %>%
arrange(desc(PROPERTYDMGS)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, PROPERTYDMGS), y = PROPERTYDMGS, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Property Damage by Event Type") +
labs(x = "Event Type", y = "Total Property Damage") +
coord_flip() +
scale_y_continuous(labels=scales::dollar_format())
cropDamage <- stormData %>%
group_by(EVTYPE) %>%
summarise(CROPDMGS = sum(CROPDMGS)) %>%
arrange(desc(CROPDMGS)) %>%
top_n(10) %>%
ggplot(aes(x = reorder(EVTYPE, CROPDMGS), y = CROPDMGS, color = EVTYPE)) +
geom_bar(stat="identity") +
theme_bw() +
theme(legend.position="none") +
ggtitle("Crop Damage by Event Type") +
labs(x = "Event Type", y = "Total Crop Damage") +
coord_flip() +
scale_y_continuous(labels=scales::dollar_format())
## Plot
gridExtra::grid.arrange(econData, propertyDamage, cropDamage, heights = unit(c(2.5,2.5,2.5), c("in", "in", "in")), nrow = 3)
We observe that floods cause the most property damage, while droughts cause the most crop damage. This makes intuitive sense - floods are somewhat common in homes, and crops often die in a drought.