Coursera Peer Assessment 2 / Reproducible Research
This analysis study uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It contains data from 1950 until November 2011. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The purpose of this analysis is visualise the correlation of events with health and economic impact.
if (!file.exists("repdata-data-StormData.csv")) {
temp <- tempfile()
# Remote location of file to be downloaded
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
unlink(temp)
}
# Reading the csv data into the data variable
# Faster Read
data <- fread('repdata-data-StormData.csv', header = T, sep = ',')
##
Read 11.4% of 967216 rows
Read 37.2% of 967216 rows
Read 54.8% of 967216 rows
Read 75.5% of 967216 rows
Read 82.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:08
The data set is has the following dimensions
dim(data)
## [1] 902297 37
In this analysis exercise we will be looking into the correlation between the storm related events registered in correlation to health and economic consequences.
To simplify the data set we will be only accounting the following columns:
dataset <- data %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
This section include the analysis data specifically for the types of events and their relation with health and economic impact. We will be producing some bar graphs for ease identification of the main contributors to the hazards for health and economically.
The following table show the top 10 events that produced fatalitites
top_10_fatalities_by_event <- dataset %>% group_by(EVTYPE) %>% summarise(total=sum(FATALITIES)) %>% arrange(desc(total)) %>% top_n(10) %>% transform(EVTYPE = reorder(EVTYPE, total))
## Selecting by total
kable(top_10_fatalities_by_event)
| EVTYPE | total |
|---|---|
| TORNADO | 5633 |
| EXCESSIVE HEAT | 1903 |
| FLASH FLOOD | 978 |
| HEAT | 937 |
| LIGHTNING | 816 |
| TSTM WIND | 504 |
| FLOOD | 470 |
| RIP CURRENT | 368 |
| HIGH WIND | 248 |
| AVALANCHE | 224 |
Distribution of most dangerous events correlated to the fatalities they produce
g <- ggplot(top_10_fatalities_by_event, aes(x=factor(EVTYPE), y=total))
g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=rel(0.7)))
The following table show the top 10 events that produced injuries
top_10_injuries_by_event <- dataset %>% group_by(EVTYPE) %>% summarise(total=sum(INJURIES)) %>% arrange(desc(total)) %>% top_n(10) %>% transform(EVTYPE = reorder(EVTYPE, total))
## Selecting by total
kable(top_10_injuries_by_event)
| EVTYPE | total |
|---|---|
| TORNADO | 91346 |
| TSTM WIND | 6957 |
| FLOOD | 6789 |
| EXCESSIVE HEAT | 6525 |
| LIGHTNING | 5230 |
| HEAT | 2100 |
| ICE STORM | 1975 |
| FLASH FLOOD | 1777 |
| THUNDERSTORM WIND | 1488 |
| HAIL | 1361 |
Distribution of most dangerous events correlated to the injuries they produce
g <- ggplot(top_10_injuries_by_event, aes(x=factor(EVTYPE), y=total))
g + geom_bar(stat = "identity") + theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5, size=rel(0.7)))
The economic impact is measured by property and crop damages. PROPDMGEXP and CROPDMGEXP are factor variables with the following levels.
Some of values in the data set are not expressed in numbers, so it is required to transform this values in notation to real numbers.
dataset$PROPDMGEXP <- as.character(dataset$PROPDMGEXP)
dataset$PROPDMGEXP = gsub("\\-|\\+|\\?","0",dataset$PROPDMGEXP)
dataset$PROPDMGEXP = gsub("B|b", "9", dataset$PROPDMGEXP)
dataset$PROPDMGEXP = gsub("M|m", "6", dataset$PROPDMGEXP)
dataset$PROPDMGEXP = gsub("K|k", "3", dataset$PROPDMGEXP)
dataset$PROPDMGEXP = gsub("H|h", "2", dataset$PROPDMGEXP)
dataset$PROPDMGEXP <- as.numeric(dataset$PROPDMGEXP)
dataset$PROPDMGEXP[is.na(dataset$PROPDMGEXP)] = 0
dataset$ActPropDam<- dataset$PROPDMG * 10^dataset$PROPDMGEXP
propDam <- aggregate(ActPropDam~EVTYPE, data=dataset, sum)
propDam_reorder<- propDam[order(-propDam$ActPropDam),]
PropDam10<-propDam_reorder[1:10,]
dataset$CROPDMGEXP <- as.character(dataset$CROPDMGEXP)
dataset$CROPDMGEXP = gsub("\\-|\\+|\\?","0",dataset$CROPDMGEXP)
dataset$CROPDMGEXP = gsub("B|b", "9", dataset$CROPDMGEXP)
dataset$CROPDMGEXP = gsub("M|m", "6", dataset$CROPDMGEXP)
dataset$CROPDMGEXP = gsub("K|k", "3", dataset$CROPDMGEXP)
dataset$CROPDMGEXP = gsub("H|h", "2", dataset$CROPDMGEXP)
dataset$CROPDMGEXP <- as.numeric(dataset$CROPDMGEXP)
dataset$CROPDMGEXP[is.na(dataset$CROPDMGEXP)] = 0
dataset$ActCropDam<- dataset$CROPDMG * 10^dataset$CROPDMGEXP
cropDam <- aggregate(ActCropDam~EVTYPE, data=dataset, sum)
cropDam_reorder<- cropDam[order(-cropDam$ActCropDam),]
CropDam10<-cropDam_reorder[1:10,]
TotalDam <- aggregate(ActPropDam + ActCropDam~EVTYPE, data=dataset, sum)
names(TotalDam)[2] <- "total"
TotalDam10 <- arrange(TotalDam, desc(total)) %>% top_n(10)
## Selecting by total
Plots for analysis on the overall impact of storms in economic aspects
par(mfrow=c(1,3))
barplot(PropDam10$ActPropDam,
names = PropDam10$EVTYPE,
cex.names = 0.7,
cex.axis = 0.7,
xlab = "Event Type",
ylab = "Total Property Damage ($)",
main = "Top 10 Events Causing \n Most Property Damage")
barplot(CropDam10$ActCropDam,
names = CropDam10$EVTYPE,
cex.names = 0.7,
cex.axis = 0.7,
xlab = "Event Type",
ylab = "Total Crop Damage ($)",
main = "Top 10 Events Causing \n Most Crop Damage")
barplot(TotalDam10$total,
names = TotalDam10$EVTYPE,
cex.names = 0.7,
cex.axis = 0.7,
xlab = "Event Type",
ylab = "Total Crop Damage ($)",
main = "Top 10 Events Causing \n Most Total Damage")
As you can see from the analysis, TORNADO caused the most fatalities and most injuries. FLOOD caused the most property damage. DROUGHT caused the most crop damange, while FLOOD caused the most overall economic damage.