Analysis on U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (1950 - 2011)

Final Project, Reproducible Research, Coursera, JHU

Synopsis

This article did an analysis on NOAA storm database from 1950 to 2011, trying to find out the most harmful storm type and focusing on the population health influences and economic consequences. Plots are used in the analysis, showing the top 10 harmful events on average, in total and in maximum for each aspect respectively. And finally two questions are answered:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data Processing

In this part, the methods and steps of the data processing will be explained. And the data used can be downloaded from Storm Data [47MB].

Loading Data

As the original data is too big, a .bz2 type file will be downloaded. Therefore, before opening the file, unziping is required. The code below requies the data stored in your working directory.

library(R.utils) ## The R package used to unzip .bz2
if(!"StormData.csv" %in% dir()){
  bunzip2("repdata-data-StormData.csv.bz2", "StormData.csv", 
        remove = FALSE, skip = TRUE)
}
#### load data
if(!exists("storm_data")){
  storm_data <- read.csv("StormData.csv", 
                         stringsAsFactors = FALSE)
}

After then, you will get a very big dataframe with dimentions as follows:

dim(storm_data)

## [1] 902297     37

And among these 37 variables, we only care about:

EVTYPE: Storm event type
FATALITIES: The number of fatalities in a specific event
INJURIES: The number of people injured in a specific event
PROPDMG: Property damaged in a event (under a specific exponent)
PROPDMGEXP: The exponent used to calculate PROPDMG
CROPDMG: Crop damaged in a event (under a specific exponent)
CROPDMGEXP: The exponent used to calculate CROPDMG

Processing Data

With the R command unique, we can have a quick look at the PROPDMGEXP and CROPDMGEXP:

unique(storm_data$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

And after searching in the internet, we conclude the symbols above have the following meaning:

B/b = billions
M/m = millions
K/k = kilos
H/h = hundreds
0 ~ 8 = 10
-/? = 0
+ = 1

Knowing this, we will then replace them by the right format.

storm_data <- within(storm_data, {
  ## PROPDMGEXP
  PROPDMGEXP <- gsub("[0-8]", "10", PROPDMGEXP)
  PROPDMGEXP <- gsub("[Bb]", "1000000000", PROPDMGEXP)
  PROPDMGEXP <- gsub("[Mm]", "1000000", PROPDMGEXP)
  PROPDMGEXP <- gsub("[Kk]", "1000", PROPDMGEXP)
  PROPDMGEXP <- gsub("[Hh]", "100", PROPDMGEXP)
  PROPDMGEXP <- gsub("[-\\?]", "0", PROPDMGEXP)
  PROPDMGEXP <- gsub("\\+", "1", PROPDMGEXP)
  PROPDMGEXP[PROPDMGEXP == ""] <- "0"
  ## CROPDMGEXP
  CROPDMGEXP <- gsub("[0-8]", "10", CROPDMGEXP)
  CROPDMGEXP <- gsub("[Bb]", "1000000000", CROPDMGEXP)
  CROPDMGEXP <- gsub("[Mm]", "1000000", CROPDMGEXP)
  CROPDMGEXP <- gsub("[Kk]", "1000", CROPDMGEXP)
  CROPDMGEXP <- gsub("[Hh]", "100", CROPDMGEXP)
  CROPDMGEXP <- gsub("[-\\?]", "0", CROPDMGEXP)
  CROPDMGEXP <- gsub("\\+", "1", CROPDMGEXP)
  CROPDMGEXP[CROPDMGEXP == ""] <- "0"
})

After then, we can calculate the influence on population health and economy, and extract what we care about.

library(dplyr)
storm_data$EVTYPE <- factor(storm_data$EVTYPE)
storm_damage <- storm_data %>% 
  mutate(economicLoss = PROPDMG*as.numeric(PROPDMGEXP)+CROPDMG*as.numeric(CROPDMGEXP)) %>% 
  mutate(populationHealth = FATALITIES + INJURIES) %>% 
  select(EVTYPE, economicLoss, populationHealth) %>% 
  group_by(EVTYPE)

calculate measurements for factors to compare

In order to find out the most harmful one, we should select some measurements for two factors to compare. In this section, we will use means, summations and maximums as measurements. And we only uses top 10 datas because we only care about those with the most harmful effects.

Firstly, we will calculate the average of two factors.

library(dplyr)
storm_damage_mean <- summarize_all(storm_damage, mean)
ecoLossTop10_mean <- head(arrange(storm_damage_mean, desc(economicLoss)), 10)
popHealthTop10_mean <- head(arrange(storm_damage_mean, desc(populationHealth)), 10)

Secondly, we calculate the summation of two factors.

library(dplyr)
storm_damage_sum <- summarize_all(storm_damage, sum)
ecoLossTop10_sum <- head(arrange(storm_damage_sum, desc(economicLoss)), 10)
popHealthTop10_sum <- head(arrange(storm_damage_sum, desc(populationHealth)), 10)

Last, we calculate the maxmum of two factors.

library(dplyr)
storm_damage_max <- summarize_all(storm_damage, max)
ecoLossTop10_max <- head(arrange(storm_damage_max, desc(economicLoss)), 10)
popHealthTop10_max <- head(arrange(storm_damage_max, desc(populationHealth)), 10)

Result

Based on the works above, we can derive our results now. And in order to do it, we need the following R packages, and we will make a color function pal to polish our plots.

library(ggplot2)
library(grDevices)
library(gridExtra)
pal <- colorRampPalette(c("darkorange", "skyblue"))

Across the United States, which types of events are most harmful with respect to population health?

We can get the answer from the figure below.

xpop_mean <- reorder(popHealthTop10_mean$EVTYPE, -(popHealthTop10_mean$populationHealth))
popplot_mean <- ggplot(popHealthTop10_mean, aes(x = xpop_mean, y = populationHealth)) + 
  geom_bar(stat = "identity", position = "stack", fill = pal(10)) + 
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) + 
  labs(x = "Type of Events", y = "Population Health Involved on Average")
xpop_sum <- reorder(popHealthTop10_sum$EVTYPE, -(popHealthTop10_sum$populationHealth))
popplot_sum <- ggplot(popHealthTop10_sum, aes(x = xpop_sum, y = populationHealth)) + 
  geom_bar(stat = "identity", position = "stack", fill = pal(10)) + 
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) + 
  labs(x = "Type of Events", y = "Population Health Involved in Total")
xpop_max <- reorder(popHealthTop10_max$EVTYPE, -(popHealthTop10_max$populationHealth))
popplot_max <- ggplot(popHealthTop10_max, aes(x = xpop_max, y = populationHealth)) + 
  geom_bar(stat = "identity", position = "stack", fill = pal(10)) + 
  theme(axis.text.x = element_text(angle = 70, hjust = 1)) + 
  labs(x = "Type of Events", y = "Population Health Involved in Maximum")
grid.arrange(popplot_mean, popplot_sum, popplot_max, ncol = 3, 
             top = "Top 10 Harmful Events on Average, in Total and Maximum \n (w.r.t. Population Health)")

From the figures above, we can find out that Heat Wave is the most harmful on average w.r.t population health. However, there remains possibility of disastrous Tornado, which will have a extremely bad influence on population health.

Across the United States, which types of events have the greatest economic consequences?