The analysis at hand represents the second assignment of the course Reproducible Research from the Coursera Data Science Specialization. The objective of the analysis is to investigate the effect of severe weather events on the US population and economy using the Storm Database of the National Oceanic and Atmospheric Administration (NOAA). The impact on the population is measured through fatalities and injuries, whereas economic harm is measured via financial damage on crops and properties. The database contains data from 1950 to 2011. More data tends to be available for the more recent years of the observation period.
The data for the analysis is available as flatfile in CSV-format (Comma-Separated-Value) compressed with the bzip2 algorithm. At the time of the analysus, it can be downloaded here:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
The data was downloaded using the mentioned URL. It was then renamed into Stormdata.csv and moved to the local R working directory.
The CSV-file is read into R from the local working directory.
## read CSV-file:
basedata <- read.csv("Stormdata.csv", header=TRUE, na.strings = "")
Seven variables out of the available ones were identified to be relevant for the subsequent analysis steps:
As a consequence the original dataset is now subset to come up with an analysis dataset that only contains the needed variables.
## subset relevant variables:
stormdata <- basedata[,c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
The first question of the assignment asks for the most harmful events with regard to the health of the population. Event types are reflected by the variable EVTYPE which is contained in the analysis dataset.
In the next step fatalaties and injuries are accumulated by event type using functions from the plyr package that will be loaded in the code chunk. In oder to evaluate the impact of the event type, the aformentioned aggregates are then ordered in decreasing order.
## load library plyr:
library(plyr)
## aggregate fatalities and injuries:
Harm_to_population <- ddply(stormdata, .(EVTYPE), summarize,fatalities = sum(FATALITIES),injuries = sum(INJURIES))
## order aggregated data decreasingly by fatalities and assign to new vector
FatalIncidents <- Harm_to_population[order(Harm_to_population$fatalities, decreasing = T), ]
## order aggregated data decreasingly by injuries and assign to new vector
InjuryIncidents <- Harm_to_population[order(Harm_to_population$injuries, decreasing = T), ]
The conducted aggregation forms the basis for the identification of the top 10 weather events that led to fatalities and injuries. The shortlisted data is then visualized using the package ggplot2 which is loaded in the subsequent code chunck.
## use head-function to calculate top 10 events in terms of fatalities
FatalIncidentsTop10 <-head(FatalIncidents[order(FatalIncidents$fatalities,decreasing=T),],10)
## use head-function to calculate top 10 events in terms of injuries
InjuryIncidentsTop10 <-head(InjuryIncidents[order(InjuryIncidents$injuries,decreasing=T),],10)
## load libary ggplotw
library(ggplot2)
## plot top 10 events - fatalities
ggplot(data = FatalIncidentsTop10, aes(x = FatalIncidentsTop10$EVTYPE, y = FatalIncidentsTop10$fatalities)) + geom_bar(fill="steelblue", stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("No. of Fatalities") + ggtitle("NOAA Top 10: Highest Fatality Counts, 1950-2011")
## plot top 10 events - injuries
ggplot(data = InjuryIncidentsTop10, aes(x = InjuryIncidentsTop10$EVTYPE, y = InjuryIncidentsTop10$injuries)) + geom_bar(fill="orange", stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("No. of Injuries") + ggtitle("NOAA Top 10: Highest Injury Counts, 1950-2011")
Tornados clearly stand out as most harmful type of weather event both in terms of fatalities and injuries caused.
Also the top 10 weather events causing economic damages shall be evaluated. Economic damage is measured using a numeric variable for harm on properties and crops. The exponential values for the damage estimates are stored in a separate column. These exponents are represented by letters such as h for hundred and k for thousand. This has to be taken into account to calculate the total economic damage. The following code chunck therefore contains steps to convert the data accordingly.
## replace missing values
stormdata$PROPDMG[(stormdata$PROPDMG == "")] <- 0
stormdata$CROPDMG[(stormdata$CROPDMG == "")] <- 0
## convert into character
stormdata$PROPDMGEXP <- as.character(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- as.character(stormdata$CROPDMGEXP)
## conduct conversion for property damage: letter is transponded to respective exponent, e.g. h = 2 for later 10^2 = 100
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "")] <- 0
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "+") | (stormdata$PROPDMGEXP == "-") | (stormdata$PROPDMGEXP == "?")] <- 1
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "h") | (stormdata$PROPDMGEXP == "H")] <- 2
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "k") | (stormdata$PROPDMGEXP == "K")] <- 3
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "m") | (stormdata$PROPDMGEXP == "M")] <- 6
stormdata$PROPDMGEXP[(stormdata$PROPDMGEXP == "B")] <- 9
## conduct conversion for crop damage: letter is transponded to respective exponent, e.g. h = 2 for later 10^2 = 100
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "")] <- 0
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "+") | (stormdata$CROPDMGEXP == "-") | (stormdata$CROPDMGEXP == "?")] <- 1
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "h") | (stormdata$CROPDMGEXP == "H")] <- 2
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "k") | (stormdata$CROPDMGEXP == "K")] <- 3
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "m") | (stormdata$CROPDMGEXP == "M")] <- 6
stormdata$CROPDMGEXP[(stormdata$CROPDMGEXP == "B")] <- 9
# re-convert to integer for computation of next step
stormdata$PROPDMGEXP <- as.integer(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- as.integer(stormdata$CROPDMGEXP)
# calculate the total damage for each event and sum property and crop damage
economic_damage <- stormdata$PROPDMGEXP * 10^stormdata$PROPDMGEXP + stormdata$CROPDMGEXP * 10^stormdata$CROPDMGEXP
stormdata_econ_damage <- cbind(stormdata, economic_damage)
## subset relevant variables
stormdata_econ_damage <-stormdata_econ_damage[,c(1,2,3,8)]
## aggregate
EconomicDamagesAggregate <-aggregate(. ~ EVTYPE,data = stormdata_econ_damage ,FUN=sum)
## order dataset decreasingly by aggregated variable
EconomicDamagesAggregateSorted <- EconomicDamagesAggregate[order(EconomicDamagesAggregate$economic_damage, decreasing = T), ]
## use head-function to come up with top 10 events for economic damage
EconomicDamageIncidentsTop10 <- head(EconomicDamagesAggregateSorted, 10)
## visualize with ggplot2 which was loaded before
ggplot(data = EconomicDamageIncidentsTop10, aes(x = EconomicDamageIncidentsTop10$EVTYPE, y = EconomicDamageIncidentsTop10$economic_damage)) + geom_bar(fill="lightgreen", stat = "identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("Economic Damage in mio. USD") + ggtitle("NOAA Top 10: Highest Economic Costs, 1950-2011")
Conclusion: With regard to economic damage caused, hurricanes/typhoons represent the most harmful type of weahter event.Floods are the runner-up, tornados rank third.