Course Project 2

In this assignment I will look at national natural disaster information to determine which natural disasters are the worst for human health and the economy respectively. This document will be a full walkthrough from raw data to data processing to analyzing the data.

Synopsis

The analyss is relativley simple. It consists of subsetting the original dataset into the data we nedd, and then aggregating that data. The aggregation of data gives us easy to work with data frames which can be graphed and understood by almost anyone.

Data Processing

Loading the raw data and any packages required.

library(ggplot2)
library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
stormraw <- read.csv("repdata_data_StormData.csv")

View data to determine how we should subset it

str(stormraw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The raw data has a column for exponenets. These are currently being marked by letter suffixes. We need to change these suffixes into their corresponding multiplier.

unique(stormraw$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
stormraw$PROPDMGEXP <- mapvalues(stormraw$PROPDMGEXP, from = c("K", "M","", "B", "m", "+", "0", "5", "6", "?", "4", "2", "3", "h", "7", "H", "-", "1", "8"), to = c(10^3, 10^6, 1, 10^9, 10^6, 0,1,10^5, 10^6, 0, 10^4, 10^2, 10^3, 10^2, 10^7, 10^2, 0, 10, 10^8))
stormraw$PROPDMGEXP <- as.numeric(as.character(stormraw$PROPDMGEXP))
stormraw$PROPDMGTOTAL <- (stormraw$PROPDMG * stormraw$PROPDMGEXP)/1000000000

After making all the data workable, we can subset the data and check for any missing values.

subnames <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "PROPDMGTOTAL")
stormsub <- stormraw[,subnames,]
str(stormsub)
## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE      : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES    : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG     : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP  : num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ CROPDMG     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP  : chr  "" "" "" "" ...
##  $ PROPDMGTOTAL: num  2.5e-05 2.5e-06 2.5e-05 2.5e-06 2.5e-06 2.5e-06 2.5e-06 2.5e-06 2.5e-05 2.5e-05 ...
sum(is.na(stormsub))
## [1] 0

Analysis

Here we want to aggregate both the fatalities by event and total property damage by event. After this, we will be able to plot them respectively and get our results.

aggFatalities <- aggregate(FATALITIES ~ EVTYPE, stormsub, sum)
topFatalities <- aggFatalities[with(aggFatalities,order(-aggFatalities$FATALITIES)),]
topFatalities<- topFatalities[1:10,]

aggdam <- aggregate(PROPDMGTOTAL ~ EVTYPE, stormsub, sum)
topdam <- aggdam[with(aggdam,order(-aggdam$PROPDMGTOTAL)),]
topdam <- topdam[1:10,]

Results

We can now plot our results for both fatalities and property damage.

fplot <- ggplot(topFatalities, aes(x=reorder(EVTYPE, FATALITIES), y=FATALITIES, fill = EVTYPE))+
      geom_bar(stat="identity")+
      xlab("Event Type")+
      ylab("Total Number of Fatalities")+
      ggtitle("10 Events with highest fatalities")+
      coord_flip()
print(fplot)

Fatalities Plot

dplot <- ggplot(topdam, aes(x=reorder(EVTYPE, PROPDMGTOTAL), y=PROPDMGTOTAL, fill = EVTYPE))+
      geom_bar(stat="identity")+
      xlab("Event Type")+
      ylab("Total Property Damage")+
      ggtitle("10 Events with highest Property Damage")+
      coord_flip()
print(dplot)

Damages Plot

As we can see above, the most lethal natural disaster is the tornado. From the second plot, we can see that the most economically devastating natural disaster is the flood.