In this assignment I will look at national natural disaster information to determine which natural disasters are the worst for human health and the economy respectively. This document will be a full walkthrough from raw data to data processing to analyzing the data.
The analyss is relativley simple. It consists of subsetting the original dataset into the data we nedd, and then aggregating that data. The aggregation of data gives us easy to work with data frames which can be graphed and understood by almost anyone.
Loading the raw data and any packages required.
library(ggplot2)
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
stormraw <- read.csv("repdata_data_StormData.csv")
View data to determine how we should subset it
str(stormraw)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The raw data has a column for exponenets. These are currently being marked by letter suffixes. We need to change these suffixes into their corresponding multiplier.
unique(stormraw$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
stormraw$PROPDMGEXP <- mapvalues(stormraw$PROPDMGEXP, from = c("K", "M","", "B", "m", "+", "0", "5", "6", "?", "4", "2", "3", "h", "7", "H", "-", "1", "8"), to = c(10^3, 10^6, 1, 10^9, 10^6, 0,1,10^5, 10^6, 0, 10^4, 10^2, 10^3, 10^2, 10^7, 10^2, 0, 10, 10^8))
stormraw$PROPDMGEXP <- as.numeric(as.character(stormraw$PROPDMGEXP))
stormraw$PROPDMGTOTAL <- (stormraw$PROPDMG * stormraw$PROPDMGEXP)/1000000000
After making all the data workable, we can subset the data and check for any missing values.
subnames <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "PROPDMGTOTAL")
stormsub <- stormraw[,subnames,]
str(stormsub)
## 'data.frame': 902297 obs. of 8 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES : num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP : chr "" "" "" "" ...
## $ PROPDMGTOTAL: num 2.5e-05 2.5e-06 2.5e-05 2.5e-06 2.5e-06 2.5e-06 2.5e-06 2.5e-06 2.5e-05 2.5e-05 ...
sum(is.na(stormsub))
## [1] 0
Here we want to aggregate both the fatalities by event and total property damage by event. After this, we will be able to plot them respectively and get our results.
aggFatalities <- aggregate(FATALITIES ~ EVTYPE, stormsub, sum)
topFatalities <- aggFatalities[with(aggFatalities,order(-aggFatalities$FATALITIES)),]
topFatalities<- topFatalities[1:10,]
aggdam <- aggregate(PROPDMGTOTAL ~ EVTYPE, stormsub, sum)
topdam <- aggdam[with(aggdam,order(-aggdam$PROPDMGTOTAL)),]
topdam <- topdam[1:10,]
We can now plot our results for both fatalities and property damage.
fplot <- ggplot(topFatalities, aes(x=reorder(EVTYPE, FATALITIES), y=FATALITIES, fill = EVTYPE))+
geom_bar(stat="identity")+
xlab("Event Type")+
ylab("Total Number of Fatalities")+
ggtitle("10 Events with highest fatalities")+
coord_flip()
print(fplot)
Fatalities Plot
dplot <- ggplot(topdam, aes(x=reorder(EVTYPE, PROPDMGTOTAL), y=PROPDMGTOTAL, fill = EVTYPE))+
geom_bar(stat="identity")+
xlab("Event Type")+
ylab("Total Property Damage")+
ggtitle("10 Events with highest Property Damage")+
coord_flip()
print(dplot)
Damages Plot
As we can see above, the most lethal natural disaster is the tornado. From the second plot, we can see that the most economically devastating natural disaster is the flood.