The basic goal of this assignment is to explore the NOAA Storm Database and answer two basic questions about severe weather events. Which types of weather event are most harmful with respect to population health and which types of event have the greatest economic consequences? To answer these two question, we first selected eight related columns from the original database. Then, we arbritrarely set the cutoff point at 80% to avoid possible bias. From these data, we found that TORNADO is the most harmful event to public health, and FLOOD has the greatest economic consequence.
We first set the working directory and load the required packages.
setwd("C:/Users/Bell/Desktop/coursera/reproducible research/PA2")
library(knitr)
library(lubridate)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following object is masked from 'package:lubridate':
##
## here
library(ggplot2)
Then, we download the database, uncompress it, and load the CSV file by using read.csv.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = 'weatherdata.csv.bz2')
data <- read.csv(bzfile('weatherdata.csv.bz2'), sep = ",")
We check the dimensions of our loaded dataset.
dim(data)
## [1] 902297 37
First, we select the related eight columns from the original dataset.
simdata <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
simdata$BGN_DATE <- as.Date(simdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
Since we know there are generally fewer events recorded in the earlier years, we look at the years of beginning date.
years <- year(simdata$BGN_DATE)
cutyear<- quantile(years, probs = c(0.2, 0.8))
print(cutyear)
## 20% 80%
## 1992 2008
We arbritrarely set the cutoff point at 80% to use the majority of the observations. In this way we may avoid possible bias introduced by the observations in the earlier years.
hist(years, breaks = 30, xlab = "Beginning Date")
abline(v=cutyear[1],col="red")
Then, we subset the original dataset by the cutoff year.
yearindex <- which(years >= cutyear[1])
simdata2 <- simdata[yearindex,]
head(simdata2)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND 0 0 0 0
## 5098 1992-01-13 TSTM WIND 0 0 0 0
## 5099 1992-01-13 TSTM WIND 0 0 0 0
## 5100 1992-02-10 TSTM WIND 0 0 0 0
## 5101 1992-02-15 TSTM WIND 0 0 0 0
## 5102 1992-02-15 TSTM WIND 0 0 0 0
## CROPDMGEXP
## 5097
## 5098
## 5099
## 5100
## 5101
## 5102
str(simdata2)
## 'data.frame': 728272 obs. of 8 variables:
## $ BGN_DATE : Date, format: "1992-01-13" "1992-01-13" ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 856 856 856 856 856 856 856 856 856 856 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
Our data is now ready for further analysis.
First, we group our data by event types and make sum of the caused fatalities.
sumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(FATALITIES))
names(sumdata) <- c("evtype", "sum_fatalities")
Then, we rearrage the data by the sum of fatalities in a descending order.
sumdata <- arrange(sumdata, desc(sum_fatalities))
head(sumdata)
## evtype sum_fatalities
## 1 EXCESSIVE HEAT 1903
## 2 TORNADO 1660
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 FLOOD 470
We are only interested in the highest ten events.
ftop10 <- sumdata[1:10,]
print(ftop10)
## evtype sum_fatalities
## 1 EXCESSIVE HEAT 1903
## 2 TORNADO 1660
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 FLOOD 470
## 7 RIP CURRENT 368
## 8 TSTM WIND 255
## 9 HIGH WIND 248
## 10 AVALANCHE 224
We can do the same analysis on the injuries.
sumdata2 <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(INJURIES))
names(sumdata2) <- c("evtype", "sum_injuries")
sumdata2 <- arrange(sumdata2, desc(sum_injuries))
head(sumdata2)
## evtype sum_injuries
## 1 TORNADO 24633
## 2 FLOOD 6789
## 3 EXCESSIVE HEAT 6525
## 4 LIGHTNING 5230
## 5 TSTM WIND 3954
## 6 HEAT 2100
itop10 <- sumdata2[1:10,]
print(itop10)
## evtype sum_injuries
## 1 TORNADO 24633
## 2 FLOOD 6789
## 3 EXCESSIVE HEAT 6525
## 4 LIGHTNING 5230
## 5 TSTM WIND 3954
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 WINTER STORM 1321
We finally add the two group of caused fatalities and injuries together to see the overall effect on public health.
sumdatatotal <- merge(sumdata, sumdata2, by = "evtype")
sumdatatotal <- mutate(sumdatatotal, sum = sum_fatalities + sum_injuries)
sumdatatotal <- arrange(sumdatatotal, desc(sum))
top10 <- sumdatatotal[1:10,]
print(top10)
## evtype sum_fatalities sum_injuries sum
## 1 TORNADO 1660 24633 26293
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 FLOOD 470 6789 7259
## 4 LIGHTNING 816 5230 6046
## 5 TSTM WIND 255 3954 4209
## 6 HEAT 937 2100 3037
## 7 FLASH FLOOD 978 1777 2755
## 8 ICE STORM 89 1975 2064
## 9 THUNDERSTORM WIND 133 1488 1621
## 10 WINTER STORM 206 1321 1527
The results can be shown clearly in a series of bar plots
yrange <- range(top10$sum)
par(mfrow= c(1,3))
barplot(ftop10$sum_fatalities, ylim = c(0, max(yrange)), ylab = "Total Fatalities",
xlab = "Event Type", names.arg = ftop10$evtype,
main = "Top ten fatal weather events", col = "red")
barplot(itop10$sum_injuries, ylim = c(0, max(yrange)), ylab = "Total Injuries",
xlab = "Event Type", names.arg = itop10$evtype,
main = "Top ten dangous weather events", col = "blue")
barplot(top10$sum, ylim = c(0, max(yrange)), ylab = "Total harmness to public health",
xlab = "Event Type", names.arg = top10$evtype,
main = "Overal top ten severe events", col = "black")
From the plots, we know that EXCESSIVE HEAT is the most fatal weather event, TORNADO causes the highest injuries, and TORNADO has the most harmful overall effect on public health.
The absolute values of property and crop damage are related to the columns of PROPDMGEXP and CROPDMGEXP respectively.
simdata2$PROPDMGEXP <- as.character(simdata2$PROPDMGEXP)
simdata2$CROPDMGEXP <- as.character(simdata2$CROPDMGEXP)
unique(simdata2$PROPDMGEXP)
## [1] "" "M" "K" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
First, we need to substitue those letters with numbers, and replace those symbols with zeros.
simdata2$PROPDMGEXP <- gsub("[Hh]", "2", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Kk]", "3", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Mm]", "6", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Bb]", "9", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("\\+|\\-|\\?", "0", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- as.numeric(simdata2$PROPDMGEXP)
unique(simdata2$PROPDMGEXP)
## [1] NA 6 3 9 0 5 4 2 7 1 8
We also need to substitue those NAs with zeros too.
simdata2$PROPDMGEXP[is.na(simdata2$PROPDMGEXP)] <- 0
unique(simdata2$PROPDMGEXP)
## [1] 0 6 3 9 5 4 2 7 1 8
Now, we are ready to calculate the real values of property damage.
simdata2 <- mutate(simdata2, propdmg = PROPDMG * 10^ PROPDMGEXP)
head(simdata2)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND 0 0 0 0 0
## 5098 1992-01-13 TSTM WIND 0 0 0 0 0
## 5099 1992-01-13 TSTM WIND 0 0 0 0 0
## 5100 1992-02-10 TSTM WIND 0 0 0 0 0
## 5101 1992-02-15 TSTM WIND 0 0 0 0 0
## 5102 1992-02-15 TSTM WIND 0 0 0 0 0
## CROPDMGEXP propdmg
## 5097 0
## 5098 0
## 5099 0
## 5100 0
## 5101 0
## 5102 0
We can do the same data processing to the crop damage.
unique(simdata2$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
simdata2$CROPDMGEXP <- gsub("[Kk]", "3", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("[Mm]", "6", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("[Bb]", "9", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("\\?", "0", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- as.numeric(simdata2$CROPDMGEXP)
unique(simdata2$CROPDMGEXP)
## [1] NA 6 3 9 0 2
simdata2$CROPDMGEXP[is.na(simdata2$CROPDMGEXP)] <- 0
unique(simdata2$CROPDMGEXP)
## [1] 0 6 3 9 2
simdata2 <- mutate(simdata2, cropdmg = CROPDMG * 10^ CROPDMGEXP)
head(simdata2)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND 0 0 0 0 0
## 5098 1992-01-13 TSTM WIND 0 0 0 0 0
## 5099 1992-01-13 TSTM WIND 0 0 0 0 0
## 5100 1992-02-10 TSTM WIND 0 0 0 0 0
## 5101 1992-02-15 TSTM WIND 0 0 0 0 0
## 5102 1992-02-15 TSTM WIND 0 0 0 0 0
## CROPDMGEXP propdmg cropdmg
## 5097 0 0 0
## 5098 0 0 0
## 5099 0 0 0
## 5100 0 0 0
## 5101 0 0 0
## 5102 0 0 0
Then, we can make sum of the property damage and crop damage separately with respect to event types.
propsumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(propdmg))
cropsumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(cropdmg))
names(propsumdata) <- c("evtype", "sum_prop")
names(cropsumdata) <- c("evtype", "sum_crop")
propsumdata <- arrange(propsumdata, desc(sum_prop))
propdmg10 <- propsumdata [1:10,]
print(propdmg10)
## evtype sum_prop
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 STORM SURGE 43323536000
## 4 TORNADO 27755871007
## 5 FLASH FLOOD 16822673979
## 6 HAIL 15735267513
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497251
## 10 HIGH WIND 5270046295
cropsumdata <- arrange(cropsumdata, desc(sum_crop))
cropdmg10 <- cropsumdata [1:10,]
print(cropdmg10)
## evtype sum_crop
## 1 DROUGHT 13972566000
## 2 FLOOD 5661968450
## 3 RIVER FLOOD 5029459000
## 4 ICE STORM 5022113500
## 5 HAIL 3025954473
## 6 HURRICANE 2741910000
## 7 HURRICANE/TYPHOON 2607872800
## 8 FLASH FLOOD 1421317100
## 9 EXTREME COLD 1292973000
## 10 FROST/FREEZE 1094086000
Finally, we add both property damage and crop damage together with respect to event types.
totaldmg <- merge(propsumdata, cropsumdata, by = "evtype")
head(totaldmg)
## evtype sum_prop sum_crop
## 1 HIGH SURF ADVISORY 200000 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD 50000 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 8100000 0
## 6 TSTM WIND (G45) 8000 0
totaldmg <- mutate(totaldmg, sumdmg = sum_prop + sum_crop)
totaldmg <- arrange(totaldmg, desc(sumdmg))
head(totaldmg)
## evtype sum_prop sum_crop sumdmg
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 STORM SURGE 43323536000 5000 43323541000
## 4 TORNADO 27755871007 414953270 28170824277
## 5 HAIL 15735267513 3025954473 18761221986
## 6 FLASH FLOOD 16822673979 1421317100 18243991079
total10 <- totaldmg[1:10,]
print(total10)
## evtype sum_prop sum_crop sumdmg
## 1 FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
## 3 STORM SURGE 43323536000 5000 43323541000
## 4 TORNADO 27755871007 414953270 28170824277
## 5 HAIL 15735267513 3025954473 18761221986
## 6 FLASH FLOOD 16822673979 1421317100 18243991079
## 7 DROUGHT 1046106000 13972566000 15018672000
## 8 HURRICANE 11868319010 2741910000 14610229010
## 9 RIVER FLOOD 5118945500 5029459000 10148404500
## 10 ICE STORM 3944927860 5022113500 8967041360
We can show our analysis results in a series of bar plots.
dmgrange <- range(total10$sumdmg)
par(mfrow= c(1,3))
barplot(propdmg10$sum_prop, ylim = c(0, max(dmgrange)), ylab = "Total property damage ($)",
xlab = "Event Type", names.arg = propdmg10$evtype,
main = "Top ten events to property damage", col = "red")
barplot(cropdmg10$sum_crop, ylim = c(0, max(dmgrange)), ylab = "Total crop damage ($)",
xlab = "Event Type", names.arg = cropdmg10$evtype,
main = "Top ten events to crop damage", col = "blue")
barplot(total10$sumdmg, ylim = c(0, max(dmgrange)), ylab = "Overall economic damage ($)",
xlab = "Event Type", names.arg = total10$evtype,
main = "Top ten events to overall damage", col = "black")
From the plots, we find that FLOOD cause the most property damage, DROUGHT has the greatest damage impact on crops, and FLOOD cause the highest overall economic loss from 1992 to 2011.