first let’s choose a working directory, clean our enviroments’ variables and download the data. The choose.dir() will prompt you to pick the working directory that you like. Since It is very likely that you downloaded and unzip the data already you can go ahead and choose the working directory that contains the the zip or unzip data. Anyway the following code will download the data if it is not present in the currect working directory. Bear in mind that if you change the original downloaded names the following code will not recognize it and therefore download the data again.
setwd(choose.dir())
rm(list = ls()) #this in case you have any variables preloaded on your R section
ifelse(
file.exists("repdata%2Fdata%2FStormData.csv"),
print("you are ready to roll"),
ifelse(file.exists("repdata%2Fdata%2FStormData.csv.bz2"),
print("you are ready to roll"), {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "./repdata%2Fdata%2FStormData.csv.bz2")}
)
)
## [1] "you are ready to roll"
## [1] "you are ready to roll"
since the read.csv() function read a unziped document we don’t need to unziped, but it is not problem if you already did it.
disaster <- read.csv("repdata%2Fdata%2FStormData.csv.bz2",stringsAsFactors = F)
Since the word harmful is ambiguos we are going to take the variables “FATALITIES” AND “INJURIES” as a proxy for calculating the “harmful” level.
Since the data base is rather large we are going to get rid of all the other variables on the data set For the sake of simplicity and reading easyness we are going to use the the dply library. For this point we will need the following variables: “EVTYPE”, “FATALITIES” AND “INJURIES”
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
harm <- select(disaster, EVTYPE, FATALITIES, INJURIES)
harmsum <- aggregate(cbind(FATALITIES,INJURIES)~ EVTYPE , data = harm, FUN = sum)
harmsumfil <- subset(harmsum, FATALITIES > 0 & INJURIES > 0)
harmorder <- head(harmsumfil[order(harmsumfil$FATALITIES, decreasing = T),],10)
library(ggplot2)
plot1 <- ggplot(harmorder,aes(x= reorder(EVTYPE, -FATALITIES),y = FATALITIES))+
geom_bar(stat="identity")+
xlab("type of event")+ylab("Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))
plot1
Tornado is by far the event that have caused more fatalities over the analyzed period. It is surprising to se that Heat makes the second and forth event. Even if the death is more inportand thant injuries, it is useful to see how this variable accrue over the different type of events.
plot2 <- ggplot(harmorder,aes(x= reorder(EVTYPE, -INJURIES),y = INJURIES))+
geom_bar(stat="identity")+
xlab("type of event")+ylab("Count") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))
plot2
Tornadoes as well are most likely to cause injuries that other type of events. Because of it tornados seem to be the most dagerous of all type of event. ##2. Event with the greatest economic consequences Since I wasn’t able to find a codebook with the description of each variable we are going to make the assumption that the economic damages are assest on the variables “PROPDMG”, “PROPDMGEXP”, “CRPDMGEXP” AND “CROPDMGEXP”. The folowing code select only the varriables that we are going to need for this part of the analysis.
eco <- select(disaster, EVTYPE, PROPDMG:CROPDMGEXP)
the variables that end with “…EXP” appear to indicate the unit of the estimation on the previous variable. For our purposes we are going to define the economic consequenses as the sum of PROPDMG AND CROPDMG. This two variables have special characters and numbers that we don’t know what they are and since they are not the mayority of the values we are going to eliminate those values and only use the letter values that appear to indicate monetary unit (k for thousans, h for hundreds and so on). With the following code you will notice that the amount rows that we are goint to eliminate is not that big.
table(eco$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
table(eco$CROPDMGEXP)
##
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
we assume that for the previous 2 variables: b = billions, h/H = hundreds, k = thousands, m/M = millions. Now we are going to assign the monetary values for each unit.
eco[eco$PROPDMGEXP == "B", 3] <- 1e9
eco[eco$PROPDMGEXP %in% c("m", "M"), 3] <- 1e6
eco[eco$PROPDMGEXP %in% c("H", "h"), 3] <- 100
eco[eco$PROPDMGEXP == "K", 3] <- 1000
eco[eco$PROPDMGEXP %in% c("-", "?", "+"), 3] <- 1
eco$PROPDMGEXP <- as.numeric(eco$PROPDMGEXP)
eco[eco$CROPDMGEXP == "B" , 5] <- 1e9
eco[eco$CROPDMGEXP %in% c("m", "M"), 5] <- 1e6
eco[eco$CROPDMGEXP %in% c("K", "k"), 5] <- 1000
eco[eco$CROPDMGEXP == "?" , 5] <- 1
eco$CROPDMGEXP <- as.numeric(eco$CROPDMGEXP)
Now we are going to create a variable tha sums the values of the crop and property damages. This will the variable that we are going to estimate the event with the greatest economic consequences.First we will assign the NA as 0 so we can make the actual calculation.
eco[is.na(eco$CROPDMGEXP) , 5] <- 0
eco[is.na(eco$PROPDMGEXP) , 3] <- 0
eco$DMTOTAL <- (eco$PROPDMG * eco$PROPDMGEXP + eco$CROPDMG * eco$CROPDMGEXP)/1e9
Now we can aggregate the values for each event to see which one has the largest economic value
ecosum <- aggregate(DMTOTAL ~ EVTYPE, data = eco, sum)
ecorder <- head(ecosum[order(ecosum$DMTOTAL, decreasing = T),], 10)
Now we can plot a barchart with the top 10 events by economics losses.
plot3 <- ggplot(ecorder,aes(x= reorder(EVTYPE, -DMTOTAL),y = DMTOTAL))+
geom_bar(stat="identity")+
xlab("type of event")+ylab("Billions of dollars") +
theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0))
plot3
The flood is by far the event that causes the most economic consequenses as define by the variables created.