Analysis on U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (1950 - 2011)
Final Project, Reproducible Research, Coursera, JHU
This article did an analysis on NOAA storm database from 1950 to 2011, trying to find out the most harmful storm type and focusing on the population health influences and economic consequences. Plots are used in the analysis, showing the top 10 harmful events on average, in total and in maximum for each aspect respectively. And finally two questions are answered:
In this part, the methods and steps of the data processing will be explained. And the data used can be downloaded from Storm Data [47MB].
As the original data is too big, a .bz2 type file will be downloaded. Therefore, before opening the file, unziping is required. The code below requies the data stored in your working directory.
library(R.utils) ## The R package used to unzip .bz2
if(!"StormData.csv" %in% dir()){
bunzip2("repdata-data-StormData.csv.bz2", "StormData.csv",
remove = FALSE, skip = TRUE)
}
#### load data
if(!exists("storm_data")){
storm_data <- read.csv("StormData.csv",
stringsAsFactors = FALSE)
}
After then, you will get a very big dataframe with dimentions as follows:
dim(storm_data)
## [1] 902297 37
And among these 37 variables, we only care about:
With the R command unique, we can have a quick look at the PROPDMGEXP and CROPDMGEXP:
unique(storm_data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
And after searching in the internet, we conclude the symbols above have the following meaning:
Knowing this, we will then replace them by the right format.
storm_data <- within(storm_data, {
## PROPDMGEXP
PROPDMGEXP <- gsub("[0-8]", "10", PROPDMGEXP)
PROPDMGEXP <- gsub("[Bb]", "1000000000", PROPDMGEXP)
PROPDMGEXP <- gsub("[Mm]", "1000000", PROPDMGEXP)
PROPDMGEXP <- gsub("[Kk]", "1000", PROPDMGEXP)
PROPDMGEXP <- gsub("[Hh]", "100", PROPDMGEXP)
PROPDMGEXP <- gsub("[-\\?]", "0", PROPDMGEXP)
PROPDMGEXP <- gsub("\\+", "1", PROPDMGEXP)
PROPDMGEXP[PROPDMGEXP == ""] <- "0"
## CROPDMGEXP
CROPDMGEXP <- gsub("[0-8]", "10", CROPDMGEXP)
CROPDMGEXP <- gsub("[Bb]", "1000000000", CROPDMGEXP)
CROPDMGEXP <- gsub("[Mm]", "1000000", CROPDMGEXP)
CROPDMGEXP <- gsub("[Kk]", "1000", CROPDMGEXP)
CROPDMGEXP <- gsub("[Hh]", "100", CROPDMGEXP)
CROPDMGEXP <- gsub("[-\\?]", "0", CROPDMGEXP)
CROPDMGEXP <- gsub("\\+", "1", CROPDMGEXP)
CROPDMGEXP[CROPDMGEXP == ""] <- "0"
})
After then, we can calculate the influence on population health and economy, and extract what we care about.
library(dplyr)
storm_data$EVTYPE <- factor(storm_data$EVTYPE)
storm_damage <- storm_data %>%
mutate(economicLoss = PROPDMG*as.numeric(PROPDMGEXP)+CROPDMG*as.numeric(CROPDMGEXP)) %>%
mutate(populationHealth = FATALITIES + INJURIES) %>%
select(EVTYPE, economicLoss, populationHealth) %>%
group_by(EVTYPE)
In order to find out the most harmful one, we should select some measurements for two factors to compare. In this section, we will use means, summations and maximums as measurements. And we only uses top 10 datas because we only care about those with the most harmful effects.
Firstly, we will calculate the average of two factors.
library(dplyr)
storm_damage_mean <- summarize_all(storm_damage, mean)
ecoLossTop10_mean <- head(arrange(storm_damage_mean, desc(economicLoss)), 10)
popHealthTop10_mean <- head(arrange(storm_damage_mean, desc(populationHealth)), 10)
Secondly, we calculate the summation of two factors.
library(dplyr)
storm_damage_sum <- summarize_all(storm_damage, sum)
ecoLossTop10_sum <- head(arrange(storm_damage_sum, desc(economicLoss)), 10)
popHealthTop10_sum <- head(arrange(storm_damage_sum, desc(populationHealth)), 10)
Last, we calculate the maxmum of two factors.
library(dplyr)
storm_damage_max <- summarize_all(storm_damage, max)
ecoLossTop10_max <- head(arrange(storm_damage_max, desc(economicLoss)), 10)
popHealthTop10_max <- head(arrange(storm_damage_max, desc(populationHealth)), 10)
Based on the works above, we can derive our results now. And in order to do it, we need the following R packages, and we will make a color function pal to polish our plots.
library(ggplot2)
library(grDevices)
library(gridExtra)
pal <- colorRampPalette(c("darkorange", "skyblue"))
We can get the answer from the figure below.
xpop_mean <- reorder(popHealthTop10_mean$EVTYPE, -(popHealthTop10_mean$populationHealth))
popplot_mean <- ggplot(popHealthTop10_mean, aes(x = xpop_mean, y = populationHealth)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Population Health Involved on Average")
xpop_sum <- reorder(popHealthTop10_sum$EVTYPE, -(popHealthTop10_sum$populationHealth))
popplot_sum <- ggplot(popHealthTop10_sum, aes(x = xpop_sum, y = populationHealth)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Population Health Involved in Total")
xpop_max <- reorder(popHealthTop10_max$EVTYPE, -(popHealthTop10_max$populationHealth))
popplot_max <- ggplot(popHealthTop10_max, aes(x = xpop_max, y = populationHealth)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Population Health Involved in Maximum")
grid.arrange(popplot_mean, popplot_sum, popplot_max, ncol = 3,
top = "Top 10 Harmful Events on Average, in Total and Maximum \n (w.r.t. Population Health)")
From the figures above, we can find out that Heat Wave is the most harmful on average w.r.t population health. However, there remains possibility of disastrous Tornado, which will have a extremely bad influence on population health.
We can get the answer from the figure below.
xeco_mean <- reorder(ecoLossTop10_mean$EVTYPE, -(ecoLossTop10_mean$economicLoss))
ecoplot_mean <- ggplot(ecoLossTop10_mean, aes(x = xeco_mean, y = economicLoss/1e9)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Economic Conseq. on Average ($ billion)")
xeco_sum <- reorder(ecoLossTop10_sum$EVTYPE, -(ecoLossTop10_sum$economicLoss))
ecoplot_sum <- ggplot(ecoLossTop10_sum, aes(x = xeco_sum, y = economicLoss/1e9)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Economic Conseq. in Total ($ billion)")
xeco_max <- reorder(ecoLossTop10_max$EVTYPE, -(ecoLossTop10_max$economicLoss))
ecoplot_max <- ggplot(ecoLossTop10_max, aes(x = xeco_max, y = economicLoss/1e9)) +
geom_bar(stat = "identity", position = "stack", fill = pal(10)) +
theme(axis.text.x = element_text(angle = 70, hjust = 1)) +
labs(x = "Type of Events", y = "Economic Conseq. in Maximum ($ billion)")
grid.arrange(ecoplot_mean, ecoplot_sum, ecoplot_max, ncol = 3,
top = "Top 10 Harmful Events on Average, in Total and Maximum \n (w.r.t. Economic Consequences)")
From the figures above, we can find out that Tornadoes, thunderstorm wind, hail is the most harmful on average w.r.t economic consequences. However, there remains possibility of disastrous Flood, which will have a extremely bad influence on economic.
Written by ScintiGimcki, Aug 11th 2018