Synopsis

This analysis aims to identify the weather events that are most harmful to population health and the economy of the United States. The data used in the analysis comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011.

The number of injuries and number of deaths were chosen as indicators for population health, while property damage and crop damage were chosen for economic impact. The total casualties and economic damage from 1950 to 2011 were calculated for each event and the results are presented in two graphs that show, in decreasing order, the top 10 most harmful weather events.

The analysis shows that tornadoes were responsible for the most deaths and injuries, followed by excessive heat and thunderstorm winds. Floods, however, are responsible for most of the property and crop damage, followed by hurricanes (typhoons) and tornadoes.

Data Processing

The first step is to download the data from course web site and load it into R.

library(dplyr)
library(ggplot2)

options(scipen = 999)

file_URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists("repdata_data_StormData.csv.bz2")) {
  download.file(file_URL, destfile = "repdata_data_StormData.csv.bz2", method = "curl")
}

storm_data <- read.csv("repdata_data_StormData.csv.bz2", header = TRUE)

Next, we will take a look at the raw data and check for missing data that could affect our results.

colSums(is.na(storm_data))
##    STATE__   BGN_DATE   BGN_TIME  TIME_ZONE     COUNTY COUNTYNAME      STATE 
##          0          0          0          0          0          0          0 
##     EVTYPE  BGN_RANGE    BGN_AZI BGN_LOCATI   END_DATE   END_TIME COUNTY_END 
##          0          0          0          0          0          0          0 
## COUNTYENDN  END_RANGE    END_AZI END_LOCATI     LENGTH      WIDTH          F 
##     902297          0          0          0          0          0     843563 
##        MAG FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP 
##          0          0          0          0          0          0          0 
##        WFO STATEOFFIC  ZONENAMES   LATITUDE  LONGITUDE LATITUDE_E LONGITUDE_ 
##          0          0          0         47          0         40          0 
##    REMARKS     REFNUM 
##          0          0

We are only in interested 5 variables: EVTYPE (the event name), FATALITIES (number of deaths), INJURIES (number of persons injured), PROPDMG (property damage), PROPDMGEXP (order of magnitude for property damage), CROPDMG (crop damage), CROPDMGEXP (order of magnitude for crop damage). There are no missing values in any of these variables so no additional processing is needed to handle them.

We need a way to handle the order of magnitude for crop and property damage. The values contained in the 2 variables are shown below.

unique(storm_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(storm_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Unfortunately, the documentation for the database does not provide an explanation for every character. We only know that H is hundreds, K is thousands, M is millions and B is billions. The other characters will be considered to represent an order of magnitude of 0 and replaced with 1.

storm_data$PROPDMGEXP[which(!(storm_data$PROPDMGEXP %in% c("K", "k", "m", "M", "h", "H", "B")))] = 1
storm_data$PROPDMGEXP[which(storm_data$PROPDMGEXP == "h"| storm_data$PROPDMGEXP == "H")] = 100
storm_data$PROPDMGEXP[which(storm_data$PROPDMGEXP == "k"| storm_data$PROPDMGEXP == "K")] = 1000
storm_data$PROPDMGEXP[which(storm_data$PROPDMGEXP == "m"| storm_data$PROPDMGEXP == "M")] = 1000000
storm_data$PROPDMGEXP[which(storm_data$PROPDMGEXP == "B")] = 1000000000

storm_data$CROPDMGEXP[which(!(storm_data$CROPDMGEXP %in% c("K", "k", "m", "M", "h", "H", "B")))] = 1
storm_data$CROPDMGEXP[which(storm_data$CROPDMGEXP == "h"| storm_data$CROPDMGEXP == "H")] = 100
storm_data$CROPDMGEXP[which(storm_data$CROPDMGEXP == "k"| storm_data$CROPDMGEXP == "K")] = 1000
storm_data$CROPDMGEXP[which(storm_data$CROPDMGEXP == "m"| storm_data$CROPDMGEXP == "M")] = 1000000
storm_data$CROPDMGEXP[which(storm_data$CROPDMGEXP == "B")] = 1000000000

Now we can compute the actual values of property and crop damage. First, we convert the PROPDMGEXP and CROPDMGEXP variables to numeric and multiply them with PRODMG and CROPDMG respectively.

storm_data$PROPDMGEXP <- as.numeric(storm_data$PROPDMGEXP)
storm_data$CROPDMGEXP <- as.numeric(storm_data$CROPDMGEXP)

storm_data$PROPDMG <- storm_data$PROPDMG * storm_data$PROPDMGEXP
storm_data$CROPDMG <- storm_data$CROPDMG * storm_data$CROPDMGEXP

Finally we can summarize the data. We calculate the damage to population health as total casualties = injuries + deaths and economic damage as total damage = prop damage + crop damage.

storm_data_sum <- storm_data %>%
  group_by(EVTYPE) %>%
  mutate(totalcasualties = FATALITIES + INJURIES, totaldamage = PROPDMG + CROPDMG) %>%
  summarise(totalcasualties = sum(totalcasualties),totaldamage = sum(totaldamage)/1000000000)

Results

Across the United States, which types of events are most harmful with respect to population health?

Let’s take a look at the top 10 most harmful events.

ggplot(data=top_n(storm_data_sum,10, totalcasualties), aes(x=reorder(EVTYPE, -totalcasualties), y=totalcasualties)) +
  geom_bar(stat="identity") +
  xlab("Event") +
  ylab("Total Casualties") +
  ggtitle("Highest total deaths and injuries in the US by weather event (1950-2011)")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), plot.title = element_text(hjust = 0.5))

It is clear from this graph that there Tornadoes are by far the most harmful event for population health. One weakness of the data can be seen here as the names of some events are not clear and don’t match the documentation. For example, TSTM WIND might be the same as THUNDERSTORM WIND and both appear in the top 10. I have decided not to make any changes to the categories due to incomplete information. This would have affected the results.

Across the United States, which types of events have the greatest economic consequences?

Again, let’s take a look at the top 10 most harmful events.

ggplot(data=top_n(storm_data_sum,10, totaldamage), aes(x=reorder(EVTYPE, -totaldamage), y=totaldamage)) +
  geom_bar(stat="identity") +
  xlab("Event") +
  ylab("Total Economic Damage (In Billions Of Dollars)") +
  ggtitle("Highest total property and crop damage in the US by weather event (1950-2011)")+
  theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5), plot.title = element_text(hjust = 0.5))

The total damage is in billions of dollars. This time, floods are the most harmful event when it comes to economic damage, followed hurricanes (typhoon) and tornadoes. Again, there might be an issue with HURRICANE/TYPHOON and HURRICANE being the same category.