Disclaimer This Documnent is part of the Peer Assessment 2 from the Reproducible Research Course on the Coursera Data science Specialization. The information contained here is for general information purposes only. Any reliance you place on such information is therefore strictly at your own risk. I cannot accept any liability for the consequences of any actions taken on the basis of the information here provided.
This analysis was conducted using the data from NOAA Storm Database in order to answer some basic questions about severe weather events. The overall goal of the analysis is to define, across the US, which types of eventsare most harmful with to population health and which types of events have the greatest economic consequence.
## Data Processing
The data is processed from a file called repdata-data-StormData that is presumed to be on your working directory.
#register the necessary libraries
library(plyr)
library(ggplot2)
data <- read.csv("repdata-data-StormData.csv.bz2")
#all these columns names are transformed to lower case to facilitate the analysis.
#Also, all the columns that are not necessary are removed
names(data) <- tolower(names(data))
data<-subset(data, select = -c(state__, bgn_date, bgn_time, time_zone, county, countyname,
state, bgn_range, bgn_azi, bgn_locati, end_date, end_time,
county_end, countyendn, end_range, end_azi, end_locati,
length, width, f, mag, wfo, stateoffic, zonenames, latitude,
longitude, latitude_e, longitude_,remarks, refnum))
Once the data is loaded, its processing is done in 3 parts
data$fatalscore <- (data$fatalities * 10) + data$injuries
data<-subset(data, select = -c(fatalities, injuries)) #dont need them anymore
In regards to that, numbers were treated sas an exponent, so for example, 3 means 10^3 = 1000
Letters were treated as follows:
H or h = 2 (hundred = 10^2)
K or k = 3 (thousand = 10^3)
M or m = 6 (million = 10^6)
B or b = 9 (billion = 10^9)
And all other characters (including blanks) were considered 1, therefore not affecting the calculations
#Property Damage
data$propmult <- revalue(data$propdmgexp, c(" "="1",NULL="1","0"="1","1"="10","2"="100",
"3"="1000","4"="10000","5"="100000","6"="1000000",
"7"="10000000","8"="100000000","-"="1","?"="1",
"+"="1","B"="1000000000","h"="100","H"="100",
"K"="1000", "M"="1000000","m"="1000000")
, warn_missing = FALSE)
data$propmult <- as.character(data$propmult)
data$propmult <- as.integer(data$propmult)
data$propmult[is.na(data$propmult)] <- 1
data$propdmg <- data$propdmg * data$propmult
data<-subset(data, select = -c(propdmgexp, propmult)) #dont need them anymore
#Crop damage
data$cropmult <- revalue(data$cropdmgexp, c(" "="1",NULL="1","0"="1","1"="10","2"="100",
"3"="1000","4"="10000","5"="100000",
"6"="1000000","7"="10000000","8"="100000000"
,"-"="1","?"="1","+"="1","B"="1000000000",
"h"="100","H"="100","K"="1000", "M"="1000000",
"m"="1000000", "k"="1000"),
warn_missing = FALSE)
data$cropmult <- as.character(data$cropmult)
data$cropmult <- as.integer(data$cropmult)
data$cropmult[is.na(data$cropmult)] <- 1
data$cropdmg <- data$cropdmg * data$cropmult
#Total Damage is Added togheter and fields that are no longer necessary are removed
data$totaldamage <- data$propdmg + data$cropdmg
data<-subset(data, select = -c(cropdmgexp, cropmult, propdmg, cropdmg))
The event types’ names are cleaned and a few of them with similar names are grouped together and summarised. Temporary variables are used to group the events in bucktes ordered by total damage and total fatal score
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
data$evtype <- trim(data$evtype)
data[grep("TORNADO",data$evtype),c("evtype")] <- 'TORNADO'
data[grep("THUNDERSTORM",data$evtype),c("evtype")] <- 'THUNDERSTORM'
summarydata <- ddply(data, 'evtype', summarise, fatalscore_sum = sum(fatalscore),
dmg_sum = sum(totaldamage))
summary_bydmg <- arrange(summarydata, desc(dmg_sum))
summary_byfatal <- arrange(summarydata, desc(fatalscore_sum))
summary_bydmg<-subset(summary_bydmg, select = -c(fatalscore_sum))
summary_byfatal<-subset(summary_byfatal, select = -c(dmg_sum))
##summary by dmg
a <- summary_bydmg [1:19,]
b <- summary_bydmg [20:883,]
b$evtype <- 'Other'
b <- ddply(b, 'evtype', summarise, dmg_sum = sum(dmg_sum))
summary_bydmg <- rbind(a,b)
#Into Billion dolars
summary_bydmg$dmg_sum <- summary_bydmg$dmg_sum/1000000000
##summary by fatal
a <- summary_byfatal [1:19,]
b <- summary_byfatal [20:883,]
b$evtype <- 'Other'
b <- ddply(b, 'evtype', summarise, fatalscore_sum = sum(fatalscore_sum))
summary_byfatal <- rbind(a,b)
#Divide the score by 1000 to simplify reporting
summary_byfatal$fatalscore_sum <- summary_byfatal$fatalscore_sum/1000
By Looking at the Chart bellow we can conclude that FLOODS had the biggest economic impact among all of the events with over $150 Billion of damage caused. In second place, HURRICANE/TYPHOON with $71 Billion followed closelly by TORNADOS with $60 Billion in Damage.
Accoring to this report, even by adding “HURRICANE/TYPHOON” and “TORNADO”, which are similar event types (hurricanes and typhoons form over water and are huge, while tornados form over land and are much smaller in size), we dont get to the same amount of economic consequenses caused by Floods
ggplot(summary_bydmg, aes(x=evtype, y=dmg_sum, fill=evtype)) +
ggtitle("Events with the greatest economic consequences ($Billion)") +
geom_bar(stat="identity", color='black') +
guides(fill=guide_legend(override.aes=list(colour=NA))) +
coord_polar(theta='x') + xlab("") + ylab("") + theme(legend.position="none") +
geom_text(aes(label= round(dmg_sum)))
When it comes to what is the most harmful event with respect to population health, TORNADO is at first place. Tornados are so harmfull in comparisson to others that their score is 6 times bigger than the second most harmfull event.
Remember that the score is calculated considering the number of Injuries and fatalities caused by a particular event (for more information about the calculation, refer to the Data Processing - Item 1 section )
EXCESSIVE HEAT is the second most harmfull event followed by LIGHTNING and TSTM WIND on third and fourth places.
FLOOD, which is in the first place when it comes to economic damage, is on 6th place when it comes to harm to populatin health, and curiously, HURRICANE/TYPHOON which are the second most disastrous event only appears on the 17th place when it comes to harm to people
ggplot(summary_byfatal, aes(x=evtype, y=fatalscore_sum, fill=evtype)) +
geom_bar(stat="identity", color='black') +
xlab("Event type") + ylab("Fatality Score") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))+
theme(legend.position="none") + geom_text(aes(label= round(fatalscore_sum)))