This program looks at storm data from the National Weather Service and examines the impacts of different event types on public health and economic outcomes. The program will uplaod the correct dataset, clean and process, and produce a few exploratory graphs that attempt to provide ideas for future analysis.
This section sets up the directories and imports the libaries necessary for my results to be reproducible. It then imports the data and reads it in to R. The data is then processed for the analysis, mostly by cleaning the EVTYPE variable and formatting dates correctly. ## Setting Up This code sets up R for the analysis. The program uses tidyverse packages, as well as RColorBrewer to look pretty.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
##
## intersect, setdiff, union
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(RColorBrewer)
We will also create the directories to be used from a given working directory, and download the data to that directory.
if(!file.exists('./plots')){dir.create('./plots')}
if(!file.exists('./data')){dir.create('./data')}
if(!file.exists('./data/weatherdata.csv.bz2')) {
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,'./data/weatherdata.csv.bz2', method = "curl")
}
And finally, we will load the data
weatherdata <- read.csv('./data/weatherdata.csv.bz2')
First, we take a high-level look at the data, just to be sure what we’re dealing with. This step isn’t needed for reproducibility, but is a best practice.
dim(weatherdata)
## [1] 902297 37
str(weatherdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Next, the analysis is only interested in a few of these columns, so we want to narrow down our data to just the columns for the analysis. These columns contain data on our dates, geographic information, the type of event, and the health and economic variables of interest to the analysis. I also prefer lower-case names, so I edit the names here.
relevant_cols <- c("BGN_DATE","STATE","EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP","REFNUM")
working_data <- weatherdata[,relevant_cols]
names(working_data) <- tolower(names(working_data))
Because the analysis is interested in how some of these variables have changed over time, we also need to correctly format the dates and then, since this is just a simple analysis, isolate the year for each observation.
working_data <- working_data %>% mutate(dates = mdy_hms(bgn_date)) %>% mutate(years = year(dates))
Now with all of the relevant columns added and formatted, the program checks for the percentage of NA’s in each column
num_of_na <- c()
for(i in seq_along(names(working_data))){
na <- round(mean(is.na(working_data[,i])),2)*100
num_of_na <- c(num_of_na,na)
}
num_of_na
## [1] 0 0 0 0 0 0 0 0 0 0 0 0
No Na’s! that’s good.
The property and crop damage exponents need some work in order for the data to be usable. First, let’s check for the different symbols being used.
sort(table(working_data$propdmgexp), decreasing = T)
##
## K M 0 B 5 1 2 ? m H
## 465934 424665 11330 216 40 28 25 13 8 7 6
## + 7 3 4 6 - 8 h
## 5 5 4 4 4 1 1 1
Looks confusing, but luckily someone at this link already double checked what each symbol was standing in for. Based on their analysis, the following code wiill create a new column for each exponent, and then another to multiply the damage number by said exponent.
working_data$propdmgexponent <- working_data$propdmgexp
working_data$propdmgexponent[grep("\\?", working_data$propdmgexp)] <- "0"
working_data$propdmgexponent[grep("\\-", working_data$propdmgexp)] <- "0"
working_data$propdmgexponent[grep("\\+", working_data$propdmgexp)] <- "1"
working_data$propdmgexponent[grep("[0-8]", working_data$propdmgexp)] <- "1"
working_data$propdmgexponent[grep("[Hh]", working_data$propdmgexp)] <- "2"
working_data$propdmgexponent[grep("[Kk]", working_data$propdmgexp)] <- "3"
working_data$propdmgexponent[grep("[Mm]", working_data$propdmgexp)] <- "6"
working_data$propdmgexponent[grep("[Bb]", working_data$propdmgexp)] <- "9"
working_data$cropdmgexponent <- working_data$cropdmgexp
working_data$cropdmgexponent[grep("\\?", working_data$cropdmgexp)] <- "0"
working_data$cropdmgexponent[grep("\\-", working_data$cropdmgexp)] <- "0"
working_data$cropdmgexponent[grep("\\+", working_data$cropdmgexp)] <- "1"
working_data$cropdmgexponent[grep("[0-8]", working_data$cropdmgexp)] <- "1"
working_data$cropdmgexponent[grep("[Hh]", working_data$cropdmgexp)] <- "2"
working_data$cropdmgexponent[grep("[Kk]", working_data$cropdmgexp)] <- "3"
working_data$cropdmgexponent[grep("[Mm]", working_data$cropdmgexp)] <- "6"
working_data$cropdmgexponent[grep("[Bb]", working_data$cropdmgexp)] <- "9"
working_data$total_prop_damage <- (working_data$propdmg * 10^as.numeric(working_data$propdmgexponent))/1000000
working_data$total_crop_damage <- (working_data$cropdmg * 10^as.numeric(working_data$cropdmgexponent))/1000000
First, we get a rough of idea of the number of events and those most commonly seen.
no.events <- length(unique(working_data$evtype))
sort(table(working_data$evtype), decreasing = TRUE)[1:40]
##
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE
## 600
I narrowed these 985 events in to 15 different rough classifications. Those are winter-related (snow, ice), wind, visibility(dust or fog), tornadoes, thunderstorms, rain, ocean-related, landslides, hurricanes, heat-related, hail, floods, fire, drought, and other. These classifications obviously are not perfect, and one figure below in particular will give us pause on our classifications, but in order to start generating first-pass exploratory graphs, I thought it sufficient.
The below code classifies these events in to these 15 categories. Note that the other was done first, then rain and wind categories because these were the more general categories, and so if for example a classification was “Hurricane Wind”, I wanted this to be attributed to hurricanes rather than wind.
working_data$event_class <- "Other"
working_data$event_class[grep("WINDS?|GUSTY|BURST", working_data$evtype, ignore.case = T)] <- "Wind"
working_data$event_class[grep("RAIN|PRECIP", working_data$evtype, ignore.case = T)] <- "Rain"
working_data$event_class[grep("SNOW|BLIZZARD|IC(E|Y)|WINT|COLD|CHILL|FREEZ|RECORD LOW|SLEET|FROST|MIX",
working_data$evtype, ignore.case = T)] <- "Winter"
working_data$event_class[grep("HAIL", working_data$evtype, ignore.case = T)] <- "Hail"
working_data$event_class[grep("TSTM|THUNDERSTORM|LIGHTN?ING", working_data$evtype, ignore.case = T)] <- "Thunderstorm"
working_data$event_class[grep("TORN|NADO|FUNNEL|WATERSPOUT", working_data$evtype, ignore.case = T)] <- "Tornado"
working_data$event_class[grep("FLOO?OD|FLD|SURF", working_data$evtype, ignore.case = T)] <- "Flood"
working_data$event_class[grep("FIRE", working_data$evtype, ignore.case = T)] <- "Fire"
working_data$event_class[grep("DROUGHT", working_data$evtype, ignore.case = T)] <- "Drought"
working_data$event_class[grep("HEAT|HIGH TEMPERATURE|WARMTH|RECORD HIGH", working_data$evtype, ignore.case = T)] <- "Heat"
working_data$event_class[grep("SLIDES?\\>|AVALAN", working_data$evtype, ignore.case = T)] <- "Landslides"
working_data$event_class[grep("RIP CURRENT|TIDE|MARINE|SEA|SURGE|WAVE", working_data$evtype, ignore.case = T)] <- "Ocean"
working_data$event_class[grep("HURRICANE|TROPICAL STORM|TYPHOON", working_data$evtype, ignore.case = T)] <- "Hurricane"
working_data$event_class[grep("FOG|DUST", working_data$evtype, ignore.case = T)] <- "Visibility"
Now lets take a look at our final classifications, as well as the percentage of the data that got classified as “Other”. If this were too high, we would want to perhaps take another look at some of our missing classifications.
sort(table(working_data$event_class), decreasing = TRUE)[1:length(unique(working_data$event_class))]
##
## Thunderstorm Hail Flood Tornado Winter Wind
## 340577 288838 87184 71544 46889 26327
## Ocean Rain Fire Heat Drought Visibility
## 14465 11902 4240 2757 2495 2472
## Landslides Hurricane Other
## 1040 996 571
other_percentage <- round(mean(working_data$event_class == "Other")*100,2)
I am happy with 0.06 percent slippage in the other category!
Finally, we need to subset based on years. As figure 1 below will show, tornadoes far outpace all other events in casualities. This is because they were the first events tracked by the data, and so had a long start time. We will look at the total affects below, but except for the first graph and andother tracking tornadoes since 1950, we will be using the following data subset.
modern_data <- filter(working_data, years >= 1996)
And that concludes our data processing.
First, lets take a look at the total outcomes of both fatalities and injuries across all years.
event_fatalsums <- working_data %>% group_by(event_class) %>% summarize(sums = sum(fatalities, na.rm = TRUE)) %>% mutate(tag = 'Fatalities')
event_injurysums <- working_data %>% group_by(event_class) %>% summarize(sums = sum(injuries, na.rm = TRUE)) %>% mutate(tag = 'Injuries')
total_health_sums <- rbind(event_fatalsums,event_injurysums)
casualitysums_bar <- ggplot(total_health_sums, aes(event_class, sums, fill = tag)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
labs(x ="Event Type", y = "Casualities", title = "Total Casualities by Event Type")
print(casualitysums_bar)
Clearly, we can’t look at all years for accurate representations of our data. Let’s try the modern data looking since 1996.
all_nados <- round(mean(working_data$event_class == 'Tornado')*100,2)
new_nados <- round(mean(modern_data$event_class == 'Tornado')*100,2)
modern_event_fatalsums <- modern_data %>% group_by(event_class) %>% summarize(sums = sum(fatalities, na.rm = TRUE)) %>% mutate(tag = 'Fatalities')
modern_event_injurysums <- modern_data %>% group_by(event_class) %>% summarize(sums = sum(injuries, na.rm = TRUE)) %>% mutate(tag = 'Injuries')
modern_total_health_sums <- rbind(modern_event_fatalsums,modern_event_injurysums)
modern_casualitysums_bar <- ggplot(modern_total_health_sums, aes(event_class, sums, fill = tag)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
labs(x ="Event Type", y = "Casualities", title = "Total Casualities by Event Type", subtitle = "Since 1996")
print(modern_casualitysums_bar)
So that clearly makes a difference, but tornadoes still dominate. Although, it does seem odd that the proportion of the dataset taken up by tornadoes only fell from 7.93 percent to 4.99 percent. That might be something to check in our data later.
It’s also no surprise that the most fatal events also cause the most injuries, though ocean and landslide events seem to be much more lethal in relative terms than the more destructive events like tornadoes and storms.
As the below graph shows, when ladnslide and ocean events do occur, they are significantly more lethal than other more broadly destructive events.
modern_lethality <- modern_event_fatalsums %>% bind_cols(modern_event_injurysums)
lethal_bar <- ggplot(modern_lethality,aes(event_class, sums/(sums1 + sums), fill = event_class)) +
geom_bar(stat = "identity", color = "black") +
ylim(0, 0.75) +
coord_flip() +
theme(legend.position="none") +
labs(x ="Event Type", y = "Lethality", title = "Lethality by Event Type", subtitle = "Since 1996")
print(lethal_bar)
One other interesting point. Our data may be more accurately reflecting the occurrence rate of different events. As shown here, the event with the highest number of casualties is actually the heat-related category. It is important to note, however, that the means may be low because there are a vast number of zero-casualty events for a given class.
event_fatalmeans <- modern_data %>% group_by(event_class) %>%
summarize(means = mean(fatalities, na.rm = TRUE)) %>% mutate(tag = 'Fatalities')
event_injurymeans <- modern_data %>% group_by(event_class) %>%
summarize(means = mean(injuries, na.rm = TRUE)) %>% mutate(tag = 'Injuries')
total_health_means <- rbind(event_fatalmeans,event_injurymeans)
casualitymeans_bar <- ggplot(total_health_means, aes(event_class, means, fill = tag)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
labs(x ="Event Type", y = "Casualities", title = "Mean Casualities by Event Type")
print(casualitymeans_bar)
It would also be interest to see how some of these events vary over time (or at least since 1996). The below graph plots the five most destructive event classes over the time period.
big_events <- c('Winter','Tornado','Thunderstorm','Heat','Flood')
big_event_data <- modern_data %>% filter(event_class %in% big_events) %>% mutate(casualities = fatalities + injuries)
big_event_sums <- big_event_data %>% group_by(event_class, years) %>% summarize(sums = sum(casualities, na.rm = TRUE))
big_event_line <- ggplot(big_event_sums, aes(x = years, y = sums, group = event_class, color = event_class)) +
geom_line() +
geom_point() +
labs(x = "Year", y = "Sum of Casualities", title = "Casualities by Event from 1996 to 2011") +
theme(legend.title=element_blank())
print(big_event_line)
and just for fun, here is another graph tracking tornadoes from 1950. It is interesting that every 30 years or so there is a particularly bad tornado season in the US causing a spike in casualties.
tornado_event_data <- working_data %>% filter(event_class == 'Tornado') %>% mutate(casualities = fatalities + injuries)
tornado_event_sums <- tornado_event_data %>% group_by(years) %>% summarize(sums = sum(casualities, na.rm = TRUE))
tornado_event_line <- ggplot(tornado_event_sums, aes(x = years, y = sums)) +
geom_line() +
geom_point()+
labs(x = 'Year', y = 'Sum of Casualities', title = 'Tornados from 1950 to 2011', subtitle = 'Just for fun')
print(tornado_event_line)
We can also try to examine some geographic data. Namely, I want to see how outcomes have changed for the 10 states with the most casualties causes by weather events have changed from 1996-2004 to 2004-2011.
rel_states <- big_event_data %>% group_by(state) %>% summarize(sums = sum(casualities, na.rm = TRUE)) %>%
filter(sums %in% sort(sums, decreasing = TRUE)[1:10])
state_sums_early <- big_event_data %>% filter(years <= 2004 & state %in% rel_states[[1]]) %>%
group_by(state,event_class) %>% summarize(sums = sum(casualities, na.rm = TRUE)) %>% mutate(tag = 'Pre-2004')
state_sums_late <- big_event_data %>% filter(years > 2004 & state %in% rel_states[[1]]) %>%
group_by(state,event_class) %>% summarize(sums = sum(casualities, na.rm = TRUE)) %>% mutate(tag = 'Post-2004')
state_sums_total <- rbind(state_sums_early,state_sums_late)
state_event_line <- ggplot(state_sums_total, aes(x = tag, y = sums, group = event_class, color = event_class)) +
geom_line() +
geom_point() +
facet_wrap( ~ state, ncol = 5) +
coord_cartesian(ylim=c(0, 2500)) +
labs(x = "Year",
y = "Sum of Casualities",
title = "Casualities by Event in Most Affected States",
subtitle = "Before and After 2004") +
theme(legend.title=element_blank())
print(state_event_line)
First, lets look at the aggregates of all damage for each of the major event types
event_propsums <- modern_data %>% group_by(event_class) %>%
summarize(sums = sum(total_prop_damage, na.rm = TRUE)) %>% mutate(tag = 'Property Damage')
event_cropsums <- modern_data %>% group_by(event_class) %>% summarize(sums = sum(total_crop_damage, na.rm = TRUE)) %>%
mutate(tag = 'Crop Damage')
total_damage_sums <- rbind(event_propsums,event_cropsums)
total_damage_sums_bar <- ggplot(total_damage_sums, aes(event_class, sums, fill = tag)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
scale_fill_brewer(palette = "Spectral") +
labs(x ="Event Type", y = "Damage in $MM", title = "Total Damage by Event Type", subtitle = "Since 1996")
print(total_damage_sums_bar)
There seems to be a big difference between each type of damage, so lets look at each seperately
propsums_bar <- ggplot(event_propsums,aes(event_class, sums, fill = event_class)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
theme(legend.position="none") +
ylim(0,200000) +
labs(x ="Event Type", y = "Damage in $MM", title = "Total Property Damage by Event Type")
print(propsums_bar)
cropsums_bar <- ggplot(event_cropsums,aes(event_class, sums, fill = event_class)) +
geom_bar(stat = "identity", color = "black") +
coord_flip() +
theme(legend.position="none") +
ylim(0,15000) +
labs(x ="Event Type", y = "Damage in $MM", title = "Total Crop Damage by Event Type")
print(cropsums_bar)
And look at these over time
## By Time
big_prop_events <- c('Flood','Hurricane','Ocean','Tornado','Hail')
big_prop_data <- modern_data %>% filter(event_class %in% big_prop_events)
big_crop_events <- c('Drought','Flood','Hurricane','Winter','Hail')
big_crop_data <- modern_data %>% filter(event_class %in% big_crop_events)
big_prop_sums <- big_prop_data %>% group_by(event_class, years) %>% summarize(sums = sum(total_prop_damage, na.rm = TRUE))
big_prop_line <- ggplot(big_prop_sums, aes(x = years, y = sums, group = event_class, color = event_class)) +
geom_line() +
geom_point() +
labs(x = "Year", y = "Property Damage", title = "Property Damage by Event from 1996 to 2011") +
theme(legend.title=element_blank())
print(big_prop_line)
## Guessing on these, "ocean" is probably mislabeled (2004 was hurricane Katrina)
## The big spike in the Flood in 2006 is likely the Mid-Atlantic Flood
big_crop_sums <- big_crop_data %>% group_by(event_class, years) %>% summarize(sums = sum(total_crop_damage, na.rm = TRUE))
big_crop_line <- ggplot(big_crop_sums, aes(x = years, y = sums, group = event_class, color = event_class)) +
geom_line() +
geom_point() +
labs(x = "Year", y = "Crop Damage", title = "Crop Damage by Event from 1996 to 2011") +
theme(legend.title=element_blank())
print(big_crop_line)
The property damage time graph is particularly interesting because it may point to a mislabeling of the data. it’s odd that hurricane and ocean events spike with one another during Hurricane Katrina, although these could of course just be correlted events
In conclusion, tornadoes seem to be the greatest weather-related health risk to Americans given this data, followed in aggregate terms by flooding, thunderstorms, and heat-related events. This information comes with the important caveat that some events, when they occur, are particularly dangerous, such as landslides and ocean-related events.
On the economic side, property damage vastly outweighs crop damage in these weather events. The events causing the most property damage were floods, hurricanes, ocean-related events, and tornadoes.
Crop damage was a result of mostly droughts, followed by hail and flooding.
There are some causes for concern in my processing of the data. There do seem to be some correlations between similar events, such as hurricanes and ocean swells. If I had more time to completely the assignment, one route I would really like to use k-means clustering based on the latitude and longitude data to attempt to more accurately group the weather events together. For example, in some of the first data in Alabama, there were simultaneous thunderstorms and tornadoes. This was likely a single weather event, and not accounting for these correlations is a big weakness in the current analysis.