Synopsis

The aim of this analysis is to be able to provide some tools for goverment officials that are responsible for preparing for severe weather events so they can prioritize where the resources have to go. In order to do so, we went through data containing all the severe weather events in the US from 1950 to 2011 as collected in the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. At the end we need to be able to identify which events have the biggest impact in Public Health (defined as number of Fatalities and / or Injuries reported as being directly attributed to said events), and those that have the biggest economic consequences (as defined by the total property damage in dollars). During the process of exploring the data we found that the variable “Property Damage”" had several different units of measures, ranging from thousands, millions and billions of dollars, to number of properties damaged. With this in mind, we decided to take only those events for which the damage was calculated as dollars, and the units were normalized so they could be added together. At the end, we were able to identify that TORNADOS are the events that produced the biggest impact in Public Health (50% of total Fatalities and / or Injuries) and FLOODS are the events that genereated the biggest Economic Consequences with a total property damage of aproximately 144.7B dollars (27% of the total)

Data Processing

Let’s call the libraries that will be using during our analysis

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

#Creates a function (.simpleCap) that will allow us to Capitalize the first letter of each word
#(taken from R help file)
.simpleCap <- function(x) {
    s <- strsplit(x, " ")[[1]]
    paste(toupper(substring(s, 1, 1)), substring(s, 2),
          sep = "", collapse = " ")
}

First, we load the Storm Data downloading it from the NOAA web site. Since this data set comes as a compressed CSV file, where the columns are separated by comma (“,”), we can use the read.csv function to automatically load it into a data frame

#Download the raw data file from the NOAA web site
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", file.path(getwd(),"//raw_data//StormData.csv.bz2"))
stormDataDF <- read.csv(file.path(getwd(),"//raw_data//StormData.csv.bz2"), stringsAsFactors = FALSE)

Let’s check the first few rows of the data set (the set is composed of 902,297 observations of 37 variables)

#How many rows and columns does the data set have?
dim(stormDataDF)
## [1] 902297     37
#Let's check the first few rows and just a few of the columns so it fits neatly in the display
head(stormDataDF[,1:9])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE
## 1 TORNADO         0
## 2 TORNADO         0
## 3 TORNADO         0
## 4 TORNADO         0
## 5 TORNADO         0
## 6 TORNADO         0

Let’s check if we have missing values for the columns EVTYPE, FATALITIES, INJURIES, PROPDMG which are key in our analysis

#Are there missing values in EVTYPE?
sum(is.na(stormDataDF$EVTYPE))
## [1] 0
#After looking at unique values in the EVTYPE column the can see tha the values 
#are not normalized, so we'll apply some basic data cleansing to fix some obvious
#issues.
#Strategy: 

#  1. Convert all events into lower caps
#  2. Convert all events into initial cap
#  3. Trim leading and trailing blank spaces

#Values BEFORE change
head(sort(unique(stormDataDF$EVTYPE)))
## [1] "   HIGH SURF ADVISORY" " COASTAL FLOOD"        " FLASH FLOOD"         
## [4] " LIGHTNING"            " TSTM WIND"            " TSTM WIND (G45)"
#Apply changes
stormDataDF$EVTYPE <- sapply(sapply(sapply(stormDataDF$EVTYPE, tolower), .simpleCap), trimws)

#Values AFTER changes
head(sort(unique(stormDataDF$EVTYPE)))
## [1] "?"                    "Abnormal Warmth"      "Abnormally Dry"      
## [4] "Abnormally Wet"       "Accumulated Snowfall" "Agricultural Freeze"
#Are there missing values in FATALITIES?
sum(is.na(stormDataDF$FATALITIES))
## [1] 0
#Are there missing values in INJURIES?
sum(is.na(stormDataDF$INJURIES))
## [1] 0
#Are there missing values in PROPDMG?
sum(is.na(stormDataDF$PROPDMG))
## [1] 0

Results

We have to analyze the impact of these events on public health (represented in this data set by fatalities and injuries) and economic consequences (represented in this data set by property damage), so let’s take a look at the different values of these variables.

Fatalities

Let’s take an overview of the Fatalities values

#Creates a new dataframe with the events that had fatalities and / or injuries
stormFatalitiesInjuries <- subset(stormDataDF, FATALITIES > 0 | INJURIES > 0)

#Runs a 5 figure summary on the FATALITIES column
summary(subset(stormFatalitiesInjuries, FATALITIES > 0)$FATALITIES)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   2.172   2.000 583.000
#Let's see how many events are reporting the most fatalities
stormEventsbyFatalities <- summarise(group_by(stormFatalitiesInjuries, FATALITIES), Total.Obs = n())
arrange(stormEventsbyFatalities, desc(FATALITIES))
## # A tibble: 52 × 2
##    FATALITIES Total.Obs
##         <dbl>     <int>
## 1         583         1
## 2         158         1
## 3         116         1
## 4         114         1
## 5          99         1
## 6          90         1
## 7          75         1
## 8          74         1
## 9          67         1
## 10         57         2
## # ... with 42 more rows

We can see that:

  • 75% of the events for which fatalities were reported, had at most 2 fatalities
  • There’s 1 event for which 538 fatalities were reported

Injuries

Now, let’s take a look at the Injuries data

#Runs a 5 figure summary on the INJURIES column
summary(subset(stormFatalitiesInjuries, INJURIES > 0)$INJURIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    1.000    1.000    2.000    7.983    4.000 1700.000
#Let's see how many events are reporting the most injuries
stormEventsbyInjuries <- summarise(group_by(stormFatalitiesInjuries, INJURIES), Total.Obs = n())
arrange(stormEventsbyInjuries, desc(INJURIES))
## # A tibble: 200 × 2
##    INJURIES Total.Obs
##       <dbl>     <int>
## 1      1700         1
## 2      1568         1
## 3      1228         1
## 4      1150         2
## 5       800         2
## 6       785         1
## 7       780         1
## 8       750         1
## 9       700         1
## 10      600         1
## # ... with 190 more rows
#Gets the number of events that reported more than 500 injuries
sum(subset(stormEventsbyInjuries, INJURIES >= 500)$Total.Obs)
## [1] 24

We can see that:

  • 75% of the events for which injuries were reported, had at most 4 injuries
  • There are 24 events that reported at least 500 injuries

Property Damage

Finally, let’s check the Property Damage data

#Gets the different units of measuring property damage
unique(stormDataDF$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
#Creates a new dataframe that only contains the records that have property damage greater than 0 and
#for which the units of measure are K, M, m, B
stormPropertyDamage <- subset(stormDataDF, PROPDMG > 0 & PROPDMGEXP %in% c("K", "M", "m", "B"))

#Adds a new column to the dataframe that contains the property damage values in the same units
stormPropertyDamage <- mutate(stormPropertyDamage, Norm.PropDmg = ifelse(PROPDMGEXP %in% c("M","m"), PROPDMG * (10^6), ifelse(PROPDMGEXP == "K", PROPDMG * (10^3), PROPDMG * (10^9))))

#Runs a 5 figure summary on the new Property Damage column
summary(stormPropertyDamage$Norm.PropDmg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.000e+01 3.000e+03 1.000e+04 1.789e+06 5.000e+04 1.150e+11
#Creates a new dataframe with the events that had property damage
stormEventsbyPropDmg <- summarise(group_by(stormPropertyDamage, Norm.PropDmg), Total.Obs = n())
arrange(stormEventsbyPropDmg, desc(Norm.PropDmg))
## # A tibble: 1,653 × 2
##    Norm.PropDmg Total.Obs
##           <dbl>     <int>
## 1     1.150e+11         1
## 2     3.130e+10         1
## 3     1.693e+10         1
## 4     1.126e+10         1
## 5     1.000e+10         1
## 6     7.350e+09         1
## 7     5.880e+09         1
## 8     5.420e+09         1
## 9     5.150e+09         1
## 10    5.000e+09         2
## # ... with 1,643 more rows
#Gets the number of events that reported at least 100,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^5)$Total.Obs)
## [1] 40593
#Gets the percentage of the total events that generated at least 100,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^5)$Total.Obs) / sum(stormEventsbyPropDmg$Total.Obs)
## [1] 0.169954
#Gets the number of events that reported at least 1,000,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^6)$Total.Obs)
## [1] 10926
#Gets the percentage of the total events that generated at least 1,000,000 in damage
sum(subset(stormEventsbyPropDmg, Norm.PropDmg >= 10^6)$Total.Obs) / sum(stormEventsbyPropDmg$Total.Obs)
## [1] 0.04574477

We can see that there are many different units of measurement for property damage. We are going to focus only on those that we can assume explicitely represent money: K, M, m, B

  • 75% of the events for which there was property damaged reported, had at most 50,000 in damage
  • 40,593 events (17% of total) have more than 100,000 in damage
  • 10,926 events (4.5% of total) have more than 1,000,000 in damage

What events impact Public Health the most?

Lets’s take a look at the events that have the biggest impact on public health, by adding up all fatalities and injuries by event.

There are in total 193 events that reported fatalities and / or injuries… These are too many events to plot and make any sense of it, so we’ll take a look a the events that make 80% (39 events) of the Total Impact

We can clearly see that Tornados are, by far, the Severe Weather Events that impact the most on Public Health with 96,979 fatalities / injuries (62% of the 80%), followed by Excessive Heat with 8,428 fatalities / injuries (5.41%)

stormFatalitiesInjuries$EVTYPE <- sapply(stormFatalitiesInjuries$EVTYPE,tolower)

#Creates a new dataframe that adds up all the injuries and fatalities by event
stormFatalitiesInjuriesByEvent <- summarise(group_by(stormFatalitiesInjuries, EVTYPE = EVTYPE), Total.Impact = sum(INJURIES + FATALITIES))

#Gets the 80% percentile of the Total Impact
nrow(subset(stormFatalitiesInjuriesByEvent, Total.Impact > quantile(stormFatalitiesInjuriesByEvent$Total.Impact, c(0.80))))
## [1] 39
#Gets only the Events that make 80% of the Total Impact (there are 193 different 
#events, plotting all of them would create a messy and unreadable plot)
stormFatalitiesInjuriesByEvent <- arrange(stormFatalitiesInjuriesByEvent, desc(Total.Impact))[1:nrow(subset(stormFatalitiesInjuriesByEvent, Total.Impact > quantile(stormFatalitiesInjuriesByEvent$Total.Impact, c(0.80)))),]

#Reorders the dataframe as factors to plot the events in a descending order
stormFatalitiesInjuriesByEvent$EVTYPE <- factor(stormFatalitiesInjuriesByEvent$EVTYPE, levels = stormFatalitiesInjuriesByEvent$EVTYPE[order(stormFatalitiesInjuriesByEvent$Total.Impact, decreasing = TRUE)])

#Creates a bar chart with the total impact in public health of each one of the
#events that make 80% of the Total Impact
g <- ggplot(stormFatalitiesInjuriesByEvent, aes(y = Total.Impact, x = EVTYPE, fill = as.factor(EVTYPE)))
g + geom_bar(stat = "identity") + 
    theme(axis.text.x = element_blank(),
          axis.title.x = element_blank(), 
          legend.key.size = unit(0.20, "cm"), 
          legend.title = element_text(size = 10),
          legend.position = "bottom",
          legend.direction = "horizontal"
          ) +
    ggtitle("Severe Weather Events Impacting 80% of Public Health") +
    ylab("Total Impact in Public Health \n (Fatalities / Injuries)") +
    scale_fill_discrete(name="Event Type")

What events have the biggest Economic Consequences?

Lets’s take a look at the events that have the biggest Economic Consequences, by adding up all property damage by event.

There are in total 370 events that reported Property Damage… These are too many events to plot and make any sense of it, so we’ll take a look a the events that make 80% (74 events) of the Total Impact

We can clearly see that Floods are, by far, the Severe Weather Events that have the biggest economic consequences, with a total property damage of aproximately 144.7B (33.9% of the 80%), followed by Hurricanes / Typhoons with approximately 69.3B in property damage (16.23%) and Tornados with 56.9B or 13.33% of total impact

#Creates a new dataframe that adds up all the injuries and fatalities by event
stormPropDmgByEvent <- summarise(group_by(stormPropertyDamage, EVTYPE = EVTYPE), Total.Impact = sum(Norm.PropDmg))

#Gets the 80% percentile of the Total Impact
nrow(subset(stormPropDmgByEvent, Total.Impact > quantile(stormPropDmgByEvent$Total.Impact, c(0.80))))
## [1] 74
#Gets only the Events that make 80% of the Total Impact (there are 370 different 
#events, plotting all of them would create a messy and unreadable plot)
stormPropDmgByEvent <- arrange(stormPropDmgByEvent, desc(Total.Impact))[1:nrow(subset(stormPropDmgByEvent, Total.Impact > quantile(stormPropDmgByEvent$Total.Impact, c(0.80)))),]

#Reorders the dataframe as factors to plot the events in a descending order
stormPropDmgByEvent$EVTYPE <- factor(stormPropDmgByEvent$EVTYPE, levels = stormPropDmgByEvent$EVTYPE[order(stormPropDmgByEvent$Total.Impact, decreasing = TRUE)])

#Creates a bar chart with the total economic consequences of each one of the
#events that make 80% of the Total Impact
g <- ggplot(stormPropDmgByEvent, aes(y = Total.Impact, x = EVTYPE, fill = as.factor(EVTYPE)))
g + geom_bar(stat = "identity") + 
    theme(axis.text.x = element_blank(),
          axis.title.x = element_blank(), 
          legend.key.size = unit(0.20, "cm"), 
          legend.title = element_text(size = 10),
          legend.position = "bottom",
          legend.direction = "horizontal"
          ) +
    ggtitle("Severe Weather Events with 80% of Economic Consequences") +
    ylab("Total Economic Consequences \n (Property Damage in dollars)") +
    scale_fill_discrete(name="Event Type")