The aim of this analysis is to be able to provide some tools for goverment officials that are responsible for preparing for severe weather events so they can prioritize where the resources have to go. In order to do so, we went through data containing all the severe weather events in the US from 1950 to 2011 as collected in the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. At the end we need to be able to identify which events have the biggest impact in Public Health (defined as number of Fatalities and / or Injuries reported as being directly attributed to said events), and those that have the biggest economic consequences (as defined by the total property damage in dollars). During the process of exploring the data we found that the variable “Property Damage”" had several different units of measures, ranging from thousands, millions and billions of dollars, to number of properties damaged. With this in mind, we decided to take only those events for which the damage was calculated as dollars, and the units were normalized so they could be added together. At the end, we were able to identify that TORNADOS are the events that produced the biggest impact in Public Health (50% of total Fatalities and / or Injuries) and FLOODS are the events that genereated the biggest Economic Consequences with a total property damage of aproximately 144.7B dollars (27% of the total)
Let’s call the libraries that will be using during our analysis
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
#Creates a function (.simpleCap) that will allow us to Capitalize the first letter of each word
#(taken from R help file)
.simpleCap <- function(x) {
s <- strsplit(x, " ")[[1]]
paste(toupper(substring(s, 1, 1)), substring(s, 2),
sep = "", collapse = " ")
}
First, we load the Storm Data downloading it from the NOAA web site. Since this data set comes as a compressed CSV file, where the columns are separated by comma (“,”), we can use the read.csv function to automatically load it into a data frame
#Download the raw data file from the NOAA web site
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", file.path(getwd(),"//raw_data//StormData.csv.bz2"))
stormDataDF <- read.csv(file.path(getwd(),"//raw_data//StormData.csv.bz2"), stringsAsFactors = FALSE)
Let’s check the first few rows of the data set (the set is composed of 902,297 observations of 37 variables)
#How many rows and columns does the data set have?
dim(stormDataDF)
## [1] 902297 37
#Let's check the first few rows and just a few of the columns so it fits neatly in the display
head(stormDataDF[,1:9])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE
## 1 TORNADO 0
## 2 TORNADO 0
## 3 TORNADO 0
## 4 TORNADO 0
## 5 TORNADO 0
## 6 TORNADO 0
Let’s check if we have missing values for the columns EVTYPE, FATALITIES, INJURIES, PROPDMG which are key in our analysis
#Are there missing values in EVTYPE?
sum(is.na(stormDataDF$EVTYPE))
## [1] 0
#After looking at unique values in the EVTYPE column the can see tha the values
#are not normalized, so we'll apply some basic data cleansing to fix some obvious
#issues.
#Strategy:
# 1. Convert all events into lower caps
# 2. Convert all events into initial cap
# 3. Trim leading and trailing blank spaces
#Values BEFORE change
head(sort(unique(stormDataDF$EVTYPE)))
## [1] " HIGH SURF ADVISORY" " COASTAL FLOOD" " FLASH FLOOD"
## [4] " LIGHTNING" " TSTM WIND" " TSTM WIND (G45)"
#Apply changes
stormDataDF$EVTYPE <- sapply(sapply(sapply(stormDataDF$EVTYPE, tolower), .simpleCap), trimws)
#Values AFTER changes
head(sort(unique(stormDataDF$EVTYPE)))
## [1] "?" "Abnormal Warmth" "Abnormally Dry"
## [4] "Abnormally Wet" "Accumulated Snowfall" "Agricultural Freeze"
#Are there missing values in FATALITIES?
sum(is.na(stormDataDF$FATALITIES))
## [1] 0
#Are there missing values in INJURIES?
sum(is.na(stormDataDF$INJURIES))
## [1] 0
#Are there missing values in PROPDMG?
sum(is.na(stormDataDF$PROPDMG))
## [1] 0
We have to analyze the impact of these events on public health (represented in this data set by fatalities and injuries) and economic consequences (represented in this data set by property damage), so let’s take a look at the different values of these variables.
Let’s take an overview of the Fatalities values
#Creates a new dataframe with the events that had fatalities and / or injuries
stormFatalitiesInjuries <- subset(stormDataDF, FATALITIES > 0 | INJURIES > 0)
#Runs a 5 figure summary on the FATALITIES column
summary(subset(stormFatalitiesInjuries, FATALITIES > 0)$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 2.172 2.000 583.000
#Let's see how many events are reporting the most fatalities
stormEventsbyFatalities <- summarise(group_by(stormFatalitiesInjuries, FATALITIES), Total.Obs = n())
arrange(stormEventsbyFatalities, desc(FATALITIES))
## # A tibble: 52 × 2
## FATALITIES Total.Obs
## <dbl> <int>
## 1 583 1
## 2 158 1
## 3 116 1
## 4 114 1
## 5 99 1
## 6 90 1
## 7 75 1
## 8 74 1
## 9 67 1
## 10 57 2
## # ... with 42 more rows
We can see that:
Now, let’s take a look at the Injuries data
#Runs a 5 figure summary on the INJURIES column
summary(subset(stormFatalitiesInjuries, INJURIES > 0)$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 7.983 4.000 1700.000
#Let's see how many events are reporting the most injuries
stormEventsbyInjuries <- summarise(group_by(stormFatalitiesInjuries, INJURIES), Total.Obs = n())
arrange(stormEventsbyInjuries, desc(INJURIES))
## # A tibble: 200 × 2
## INJURIES Total.Obs
## <dbl> <int>
## 1 1700 1
## 2 1568 1
## 3 1228 1
## 4 1150 2
## 5 800 2
## 6 785 1
## 7 780 1
## 8 750 1
## 9 700 1
## 10 600 1
## # ... with 190 more rows
#Gets the number of events that reported more than 500 injuries
sum(subset(stormEventsbyInjuries, INJURIES >= 500)$Total.Obs)
## [1] 24
We can see that:
Finally, let’s check the Property Damage data
#Gets the different units of measuring property damage
unique(stormDataDF$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
#Creates a new dataframe that only contains the records that have property damage greater than 0 and
#for which the units of measure are K, M, m, B
stormPropertyDamage <- subset(stormDataDF, PROPDMG > 0 & PROPDMGEXP %in% c("K", "M", "m", "B"))
#Adds a new column to the dataframe that contains the property damage values in the same units
stormPropertyDamage <- mutate(stormPropertyDamage, Norm.PropDmg = ifelse(PROPDMGEXP %in% c("M","m"), PROPDMG * (10^6), ifelse(PROPDMGEXP == "K", PROPDMG * (10^3), PROPDMG * (10^9))))
#Runs a 5 figure summary on the new Property Damage column
summary(stormPropertyDamage$Norm.PropDmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000e+01 3.000e+03 1.000e+04 1.789e+06 5.000e+04 1.150e+11
#Creates a new dataframe with the events that had property damage
stormEventsbyPropDmg <- summarise(group_by(stormPropertyDamage, Norm.PropDmg), Total.Obs = n())
arrange(stormEventsbyPropDmg, desc(Norm.PropDmg))
## # A tibble: 1,653 × 2
## Norm.PropDmg Total.Obs
## <dbl> <int>
## 1 1.150e+11 1
## 2 3.130e+10 1
## 3 1.693e+10 1
## 4 1.126e+10 1
## 5 1.000e+10 1
## 6 7.350e+09 1
## 7 5.880e+09 1
## 8 5.420e+09 1
## 9 5.150e+09 1
## 10 5.000e+09 2
## # ... with 1,643 more rows
#Gets the number of events that reported at least 100,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^5)$Total.Obs)
## [1] 40593
#Gets the percentage of the total events that generated at least 100,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^5)$Total.Obs) / sum(stormEventsbyPropDmg$Total.Obs)
## [1] 0.169954
#Gets the number of events that reported at least 1,000,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^6)$Total.Obs)
## [1] 10926
#Gets the percentage of the total events that generated at least 1,000,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^6)$Total.Obs) / sum(stormEventsbyPropDmg$Total.Obs)
## [1] 0.04574477
We can see that there are many different units of measurement for property damage. We are going to focus only on those that we can assume explicitely represent money: K, M, m, B
Lets’s take a look at the events that have the biggest impact on public health, by adding up all fatalities and injuries by event.
There are in total 193 events that reported fatalities and / or injuries… These are too many events to plot and make any sense of it, so we’ll take a look a the events that make 80% (39 events) of the Total Impact
We can clearly see that Tornados are, by far, the Severe Weather Events that impact the most on Public Health with 96,979 fatalities / injuries (62% of the 80%), followed by Excessive Heat with 8,428 fatalities / injuries (5.41%)
stormFatalitiesInjuries$EVTYPE <- sapply(stormFatalitiesInjuries$EVTYPE,tolower)
#Creates a new dataframe that adds up all the injuries and fatalities by event
stormFatalitiesInjuriesByEvent <- summarise(group_by(stormFatalitiesInjuries, EVTYPE = EVTYPE), Total.Impact = sum(INJURIES + FATALITIES))
#Gets the 80% percentile of the Total Impact
nrow(subset(stormFatalitiesInjuriesByEvent, Total.Impact > quantile(stormFatalitiesInjuriesByEvent$Total.Impact, c(0.80))))
## [1] 39
#Gets only the Events that make 80% of the Total Impact (there are 193 different
#events, plotting all of them would create a messy and unreadable plot)
stormFatalitiesInjuriesByEvent <- arrange(stormFatalitiesInjuriesByEvent, desc(Total.Impact))[1:nrow(subset(stormFatalitiesInjuriesByEvent, Total.Impact > quantile(stormFatalitiesInjuriesByEvent$Total.Impact, c(0.80)))),]
#Reorders the dataframe as factors to plot the events in a descending order
stormFatalitiesInjuriesByEvent$EVTYPE <- factor(stormFatalitiesInjuriesByEvent$EVTYPE, levels = stormFatalitiesInjuriesByEvent$EVTYPE[order(stormFatalitiesInjuriesByEvent$Total.Impact, decreasing = TRUE)])
#Creates a bar chart with the total impact in public health of each one of the
#events that make 80% of the Total Impact
g <- ggplot(stormFatalitiesInjuriesByEvent, aes(y = Total.Impact, x = EVTYPE, fill = as.factor(EVTYPE)))
g + geom_bar(stat = "identity") +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
legend.key.size = unit(0.20, "cm"),
legend.title = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal"
) +
ggtitle("Severe Weather Events Impacting 80% of Public Health") +
ylab("Total Impact in Public Health \n (Fatalities / Injuries)") +
scale_fill_discrete(name="Event Type")
Lets’s take a look at the events that have the biggest Economic Consequences, by adding up all property damage by event.
There are in total 370 events that reported Property Damage… These are too many events to plot and make any sense of it, so we’ll take a look a the events that make 80% (74 events) of the Total Impact
We can clearly see that Floods are, by far, the Severe Weather Events that have the biggest economic consequences, with a total property damage of aproximately 144.7B (33.9% of the 80%), followed by Hurricanes / Typhoons with approximately 69.3B in property damage (16.23%) and Tornados with 56.9B or 13.33% of total impact
#Creates a new dataframe that adds up all the injuries and fatalities by event
stormPropDmgByEvent <- summarise(group_by(stormPropertyDamage, EVTYPE = EVTYPE), Total.Impact = sum(Norm.PropDmg))
#Gets the 80% percentile of the Total Impact
nrow(subset(stormPropDmgByEvent, Total.Impact > quantile(stormPropDmgByEvent$Total.Impact, c(0.80))))
## [1] 74
#Gets only the Events that make 80% of the Total Impact (there are 370 different
#events, plotting all of them would create a messy and unreadable plot)
stormPropDmgByEvent <- arrange(stormPropDmgByEvent, desc(Total.Impact))[1:nrow(subset(stormPropDmgByEvent, Total.Impact > quantile(stormPropDmgByEvent$Total.Impact, c(0.80)))),]
#Reorders the dataframe as factors to plot the events in a descending order
stormPropDmgByEvent$EVTYPE <- factor(stormPropDmgByEvent$EVTYPE, levels = stormPropDmgByEvent$EVTYPE[order(stormPropDmgByEvent$Total.Impact, decreasing = TRUE)])
#Creates a bar chart with the total economic consequences of each one of the
#events that make 80% of the Total Impact
g <- ggplot(stormPropDmgByEvent, aes(y = Total.Impact, x = EVTYPE, fill = as.factor(EVTYPE)))
g + geom_bar(stat = "identity") +
theme(axis.text.x = element_blank(),
axis.title.x = element_blank(),
legend.key.size = unit(0.20, "cm"),
legend.title = element_text(size = 10),
legend.position = "bottom",
legend.direction = "horizontal"
) +
ggtitle("Severe Weather Events with 80% of Economic Consequences") +
ylab("Total Economic Consequences \n (Property Damage in dollars)") +
scale_fill_discrete(name="Event Type")