Synopsis

This article is made as a homework project for the Coursera Reproducible Research course. The project aims to identify the most affecting natural events on US population health, and economic cost. The first aim was operationalized as the top 10 most injurous and fatal events existing in the dataset, while the second aim was operationalized as the top 10 most costly events that caused property and crops monetary damage.

Data Processing

First, we fetch and load the data

#I had to unzip the file manually and then load it. It took lots of unending time for the bz2 to load, it was too computaionally expensive for my laptop to unzip and read the data at the same time

setwd("C:/Users/ttt/Desktop/Reproducible Research - Project 2")
data <- read.csv("repdata_data_StormData.csv/repdata_data_StormData.csv")

Now, we go to building a simpler dataset to operate on, this one will be only made with the relevant variables. It should include variables needed for both questions. In addressing the second question, some class-mates have taken PROPDMGEXP and CROPDMGEXP into consideration. While I was able to make the proper coding for this, yet the computational power of my laptop didn’t help in finalizing the calculation process, it was just too computaionally expensive, therefore I’ll ignore PROPDMGEXP and CROPDMGEXP. I’ll be only dealing with PROPDMG and CROPDMG to represent economic losses.

#make a new dataset with a column that only contains the un-frequented (i.e. unique) event types.
smallerdata <- data.frame(EVS=unique(data$EVTYPE))

#add a column that is the sum of all fatalities caused by the event
smallerdata$FATALITIES <- tapply(data$FATALITIES, data$EVTYPE , sum)[smallerdata$EVS]

#add a column that is the sum of all injuries caused by the event
smallerdata$INJ <- tapply(data$INJURIES, data$EVTYPE , sum)[smallerdata$EVS]

#add a column that is the sum of injuries and fatalities caused by the event
smallerdata$totalhealthissues <- smallerdata$INJ + smallerdata$FATALITIES

#add a column that is the sum of all property damage costs (in Thousand dollars) caused by the event
smallerdata$PROPDMG <- (tapply(data$PROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000

#add a column that is the sum of all crop damage costs (in Thousand dollars) caused by the event
smallerdata$CROPDMG <- (tapply(data$CROPDMG, data$EVTYPE , sum)[smallerdata$EVS])/1000

#add a column that is the sum of poperty damage costs and crop damage costs caused by the event (in Thousand $)
smallerdata$totalcosts <- smallerdata$PROPDMG + smallerdata$CROPDMG

Now that we have made our small and operatable dataset, we start our analysis.

We explore the data a bit. Drawing a boxplot for number of fatalities and injuries for each event shows us some very effective outliers, which without, the dataset would relatively have low dispersion. The analyst needs to decide what to do with the outliers. In my case, I decide that if the outliers are not errors, then they should kept. Note: If we boxplot the costs of property and crop damage, we find similar conclusions.

boxplot (smallerdata$FATALITIES, smallerdata$INJ)

Results

We start from answering the first question. We order the most fatal events descendingly and we report the Top 10 of them, followed by the most injurous events in the same manner.

#load some libraries needed for clarifying plots
library(ggplot2)
library(reshape)

#returns the top 10 most fatal events
mostfatal <- head(smallerdata[order(smallerdata$FATALITIES, decreasing= TRUE), ], 10)
#returns the top 10 most injurous events
mostinj <- head(smallerdata[order(smallerdata$INJ, decreasing= TRUE), ], 10)
#returns the top 10 most impactful events on health (i.e. fatalities + injuries)
mosttotalhealthissue <- head(smallerdata[order(smallerdata$totalhealthissues, decreasing= TRUE), ], 10)

#making sure the events are ordered properly and not messed up in the ggplot, ordering them according to total health impact (i.e. injuries + fatalities) is the most intuitive
mosttotalhealthissue$EVS <- factor(mosttotalhealthissue$EVS, levels = mosttotalhealthissue$EVS[order(mosttotalhealthissue$totalhealthissues, decreasing= TRUE)])

Now we take the most injurous along with the most fatal at the same time. This can be operationalized in two different ways:

  1. either by making a union between the TOP 10 most INJUROUS events with the top 10 most FATAL
unimostinjrsorfatal <- union(mostfatal[,1], mostinj[,1])

unimostinjrsorfatal
##  [1] "TORNADO"           "EXCESSIVE HEAT"    "FLASH FLOOD"      
##  [4] "HEAT"              "LIGHTNING"         "TSTM WIND"        
##  [7] "FLOOD"             "RIP CURRENT"       "HIGH WIND"        
## [10] "AVALANCHE"         "ICE STORM"         "THUNDERSTORM WIND"
## [13] "HAIL"
  2. or by intersecting the top 10 most INJUROUS events with the top 10 most FATAL, this approach is more conservative and reflects the case more for deadliest events. I shall use this one to operationalize the wording "most harmful [events] with respect to population health" in the question's text.
intrsctmostinjandfatal <- intersect(mostfatal[,1], mostinj[,1])

The most harmful weather events for population health in the USA are respectively TORNADO, EXCESSIVE HEAT, FLASH FLOOD, HEAT, LIGHTNING, TSTM WIND, FLOOD

Let us represent this in a nice visual.

#a melted dataset to be utilized in plotting the data in a gg barplot using only the intersection of most fatal and most injurous events
meltedmostfatalinj <- melt(mosttotalhealthissue[mosttotalhealthissue$EVS %in% intrsctmostinjandfatal,], id.vars =c("EVS", "PROPDMG", "CROPDMG", "totalcosts"))

#gg bartplot
ggplot(meltedmostfatalinj, aes(EVS, value)) + geom_bar(aes(fill = variable), position = "dodge", stat="identity")+
      ggtitle(label = "Top Most Weather Events Impacting Human Health in the USA") +
      xlab("Event") + ylab ("Affected # of People") + 
      scale_fill_discrete(name = "Variable", labels = c("Fatalities", "Injuries", "Total(Fatalities + Injuries)"))

The graph shows that highest impact to be that of tornados, and that injuries are much more common than fatalities.


In answering question 2. We look for the top 10 most economically costly events (i.e. highest events with crop_damage + property_damage costs), and plot them in a clear barplot.

#returns top 10 most costly events
mosteconomicalcost <- head(smallerdata[order(smallerdata$totalcosts, decreasing= TRUE), ], 10)

The most economically costly events are respectively TORNADO, FLASH FLOOD, TSTM WIND, HAIL, FLOOD, THUNDERSTORM WIND, LIGHTNING, THUNDERSTORM WINDS, HIGH WIND, WINTER STORM

Now, let us plot this.

#make sure of proper ordering
mosteconomicalcost$EVS <- factor(mosteconomicalcost$EVS, levels = mosteconomicalcost$EVS[order(mosteconomicalcost$totalcosts, decreasing= TRUE)])

#prepares the latter dataset for ggplotting
meltedmostecocost <- melt(mosteconomicalcost, id.vars =c("EVS", "FATALITIES", "INJ", "totalhealthissues"))

#again, make sure of tidy ordering
meltedmostecocost$variable <- factor(meltedmostecocost$variable, levels = c("CROPDMG", "PROPDMG", "totalcosts"))

#gg barplotting

ggplot(meltedmostecocost, aes(EVS, value)) +
      geom_bar(aes(fill = variable), position = "dodge", stat="identity") +
      ggtitle(label = "Top 10 Weather Events Impacting Economy in the USA") +
      xlab("Event") + ylab ("Thousand US $ (Ignoring the EXPonential variables)") + 
      scale_fill_discrete(name = "Variable", labels = c("Crop Losses", "Property Losses", "Total Losses(Crop + Property)"))

We see again that tornadoes are the most costly weather events to the US, with property damage to be obviously more affected by weather events than crop damage.