Analysis of the Impact of Severe Weather Events on Public Health and Economic Loss

Yan Yu

Dec 18, 2015

Synopsis

The basic goal of this assignment is to explore the NOAA Storm Database and answer two basic questions about severe weather events. Which types of weather event are most harmful with respect to population health and which types of event have the greatest economic consequences? To answer these two question, we first selected eight related columns from the original database. Then, we arbritrarely set the cutoff point at 80% to avoid possible bias. From these data, we found that TORNADO is the most harmful event to public health, and FLOOD has the greatest economic consequence.

Data Processing

Loading the data

We first set the working directory and load the required packages.

setwd("C:/Users/Bell/Desktop/coursera/reproducible research/PA2")
library(knitr)
library(lubridate)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following object is masked from 'package:lubridate':
## 
##     here
library(ggplot2)

Then, we download the database, uncompress it, and load the CSV file by using read.csv.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
              destfile = 'weatherdata.csv.bz2')
data <- read.csv(bzfile('weatherdata.csv.bz2'), sep =  ",")

We check the dimensions of our loaded dataset.

dim(data)
## [1] 902297     37

Processing the data

First, we select the related eight columns from the original dataset.

simdata <- select(data, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
simdata$BGN_DATE <- as.Date(simdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")

Since we know there are generally fewer events recorded in the earlier years, we look at the years of beginning date.

years <- year(simdata$BGN_DATE)
cutyear<- quantile(years, probs = c(0.2, 0.8))
print(cutyear)
##  20%  80% 
## 1992 2008

We arbritrarely set the cutoff point at 80% to use the majority of the observations. In this way we may avoid possible bias introduced by the observations in the earlier years.

hist(years, breaks = 30, xlab = "Beginning Date")
abline(v=cutyear[1],col="red")

Then, we subset the original dataset by the cutoff year.

yearindex <- which(years >= cutyear[1])
simdata2 <- simdata[yearindex,]
head(simdata2)
##        BGN_DATE    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND          0        0       0                  0
## 5098 1992-01-13 TSTM WIND          0        0       0                  0
## 5099 1992-01-13 TSTM WIND          0        0       0                  0
## 5100 1992-02-10 TSTM WIND          0        0       0                  0
## 5101 1992-02-15 TSTM WIND          0        0       0                  0
## 5102 1992-02-15 TSTM WIND          0        0       0                  0
##      CROPDMGEXP
## 5097           
## 5098           
## 5099           
## 5100           
## 5101           
## 5102
str(simdata2)
## 'data.frame':    728272 obs. of  8 variables:
##  $ BGN_DATE  : Date, format: "1992-01-13" "1992-01-13" ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 856 856 856 856 856 856 856 856 856 856 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

Our data is now ready for further analysis.

Results

Which types of weather event are most harmful with respect to population health?

First, we group our data by event types and make sum of the caused fatalities.

sumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(FATALITIES))
names(sumdata) <- c("evtype", "sum_fatalities")

Then, we rearrage the data by the sum of fatalities in a descending order.

sumdata <- arrange(sumdata, desc(sum_fatalities))
head(sumdata)
##           evtype sum_fatalities
## 1 EXCESSIVE HEAT           1903
## 2        TORNADO           1660
## 3    FLASH FLOOD            978
## 4           HEAT            937
## 5      LIGHTNING            816
## 6          FLOOD            470

We are only interested in the highest ten events.

ftop10 <- sumdata[1:10,]
print(ftop10)
##            evtype sum_fatalities
## 1  EXCESSIVE HEAT           1903
## 2         TORNADO           1660
## 3     FLASH FLOOD            978
## 4            HEAT            937
## 5       LIGHTNING            816
## 6           FLOOD            470
## 7     RIP CURRENT            368
## 8       TSTM WIND            255
## 9       HIGH WIND            248
## 10      AVALANCHE            224

We can do the same analysis on the injuries.

sumdata2 <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(INJURIES))
names(sumdata2) <- c("evtype", "sum_injuries")
sumdata2 <- arrange(sumdata2, desc(sum_injuries))
head(sumdata2)
##           evtype sum_injuries
## 1        TORNADO        24633
## 2          FLOOD         6789
## 3 EXCESSIVE HEAT         6525
## 4      LIGHTNING         5230
## 5      TSTM WIND         3954
## 6           HEAT         2100
itop10 <- sumdata2[1:10,]
print(itop10)
##               evtype sum_injuries
## 1            TORNADO        24633
## 2              FLOOD         6789
## 3     EXCESSIVE HEAT         6525
## 4          LIGHTNING         5230
## 5          TSTM WIND         3954
## 6               HEAT         2100
## 7          ICE STORM         1975
## 8        FLASH FLOOD         1777
## 9  THUNDERSTORM WIND         1488
## 10      WINTER STORM         1321

We finally add the two group of caused fatalities and injuries together to see the overall effect on public health.

sumdatatotal <- merge(sumdata, sumdata2, by = "evtype")
sumdatatotal <- mutate(sumdatatotal, sum = sum_fatalities + sum_injuries)
sumdatatotal <- arrange(sumdatatotal, desc(sum))
top10 <- sumdatatotal[1:10,]
print(top10)
##               evtype sum_fatalities sum_injuries   sum
## 1            TORNADO           1660        24633 26293
## 2     EXCESSIVE HEAT           1903         6525  8428
## 3              FLOOD            470         6789  7259
## 4          LIGHTNING            816         5230  6046
## 5          TSTM WIND            255         3954  4209
## 6               HEAT            937         2100  3037
## 7        FLASH FLOOD            978         1777  2755
## 8          ICE STORM             89         1975  2064
## 9  THUNDERSTORM WIND            133         1488  1621
## 10      WINTER STORM            206         1321  1527

The results can be shown clearly in a series of bar plots

yrange <- range(top10$sum)
par(mfrow= c(1,3))
barplot(ftop10$sum_fatalities, ylim = c(0, max(yrange)), ylab = "Total Fatalities", 
        xlab = "Event Type", names.arg = ftop10$evtype,
        main = "Top ten fatal weather events", col = "red")
barplot(itop10$sum_injuries, ylim = c(0, max(yrange)), ylab = "Total Injuries", 
        xlab = "Event Type", names.arg = itop10$evtype,
        main = "Top ten dangous weather events", col = "blue")
barplot(top10$sum, ylim = c(0, max(yrange)), ylab = "Total harmness to public health", 
        xlab = "Event Type", names.arg =  top10$evtype,
        main = "Overal top ten severe events", col = "black")

From the plots, we know that EXCESSIVE HEAT is the most fatal weather event, TORNADO causes the highest injuries, and TORNADO has the most harmful overall effect on public health.

Which types of event have the greatest economic consequences?

The absolute values of property and crop damage are related to the columns of PROPDMGEXP and CROPDMGEXP respectively.

simdata2$PROPDMGEXP <- as.character(simdata2$PROPDMGEXP)
simdata2$CROPDMGEXP <- as.character(simdata2$CROPDMGEXP)
unique(simdata2$PROPDMGEXP)
##  [1] ""  "M" "K" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"

First, we need to substitue those letters with numbers, and replace those symbols with zeros.

simdata2$PROPDMGEXP <- gsub("[Hh]", "2", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Kk]", "3", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Mm]", "6", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("[Bb]", "9", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- gsub("\\+|\\-|\\?", "0", simdata2$PROPDMGEXP)
simdata2$PROPDMGEXP <- as.numeric(simdata2$PROPDMGEXP)
unique(simdata2$PROPDMGEXP)
##  [1] NA  6  3  9  0  5  4  2  7  1  8

We also need to substitue those NAs with zeros too.

simdata2$PROPDMGEXP[is.na(simdata2$PROPDMGEXP)] <- 0
unique(simdata2$PROPDMGEXP)
##  [1] 0 6 3 9 5 4 2 7 1 8

Now, we are ready to calculate the real values of property damage.

simdata2 <- mutate(simdata2, propdmg = PROPDMG * 10^ PROPDMGEXP)
head(simdata2)
##        BGN_DATE    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND          0        0       0          0       0
## 5098 1992-01-13 TSTM WIND          0        0       0          0       0
## 5099 1992-01-13 TSTM WIND          0        0       0          0       0
## 5100 1992-02-10 TSTM WIND          0        0       0          0       0
## 5101 1992-02-15 TSTM WIND          0        0       0          0       0
## 5102 1992-02-15 TSTM WIND          0        0       0          0       0
##      CROPDMGEXP propdmg
## 5097                  0
## 5098                  0
## 5099                  0
## 5100                  0
## 5101                  0
## 5102                  0

We can do the same data processing to the crop damage.

unique(simdata2$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"
simdata2$CROPDMGEXP <- gsub("[Kk]", "3", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("[Mm]", "6", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("[Bb]", "9", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- gsub("\\?", "0", simdata2$CROPDMGEXP)
simdata2$CROPDMGEXP <- as.numeric(simdata2$CROPDMGEXP)
unique(simdata2$CROPDMGEXP)
## [1] NA  6  3  9  0  2
simdata2$CROPDMGEXP[is.na(simdata2$CROPDMGEXP)] <- 0
unique(simdata2$CROPDMGEXP)
## [1] 0 6 3 9 2
simdata2 <- mutate(simdata2, cropdmg = CROPDMG * 10^ CROPDMGEXP)
head(simdata2)
##        BGN_DATE    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 5097 1992-01-13 TSTM WIND          0        0       0          0       0
## 5098 1992-01-13 TSTM WIND          0        0       0          0       0
## 5099 1992-01-13 TSTM WIND          0        0       0          0       0
## 5100 1992-02-10 TSTM WIND          0        0       0          0       0
## 5101 1992-02-15 TSTM WIND          0        0       0          0       0
## 5102 1992-02-15 TSTM WIND          0        0       0          0       0
##      CROPDMGEXP propdmg cropdmg
## 5097          0       0       0
## 5098          0       0       0
## 5099          0       0       0
## 5100          0       0       0
## 5101          0       0       0
## 5102          0       0       0

Then, we can make sum of the property damage and crop damage separately with respect to event types.

propsumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(propdmg))
cropsumdata <- ddply(simdata2, .(EVTYPE), summarize, sum = sum(cropdmg))
names(propsumdata) <- c("evtype", "sum_prop")
names(cropsumdata) <- c("evtype", "sum_crop")
propsumdata <- arrange(propsumdata, desc(sum_prop))
propdmg10 <- propsumdata [1:10,]
print(propdmg10)
##               evtype     sum_prop
## 1              FLOOD 144657709807
## 2  HURRICANE/TYPHOON  69305840000
## 3        STORM SURGE  43323536000
## 4            TORNADO  27755871007
## 5        FLASH FLOOD  16822673979
## 6               HAIL  15735267513
## 7          HURRICANE  11868319010
## 8     TROPICAL STORM   7703890550
## 9       WINTER STORM   6688497251
## 10         HIGH WIND   5270046295
cropsumdata <- arrange(cropsumdata, desc(sum_crop))
cropdmg10 <- cropsumdata [1:10,]
print(cropdmg10)
##               evtype    sum_crop
## 1            DROUGHT 13972566000
## 2              FLOOD  5661968450
## 3        RIVER FLOOD  5029459000
## 4          ICE STORM  5022113500
## 5               HAIL  3025954473
## 6          HURRICANE  2741910000
## 7  HURRICANE/TYPHOON  2607872800
## 8        FLASH FLOOD  1421317100
## 9       EXTREME COLD  1292973000
## 10      FROST/FREEZE  1094086000

Finally, we add both property damage and crop damage together with respect to event types.

totaldmg <- merge(propsumdata, cropsumdata, by = "evtype")
head(totaldmg)
##                  evtype sum_prop sum_crop
## 1    HIGH SURF ADVISORY   200000        0
## 2         COASTAL FLOOD        0        0
## 3           FLASH FLOOD    50000        0
## 4             LIGHTNING        0        0
## 5             TSTM WIND  8100000        0
## 6       TSTM WIND (G45)     8000        0
totaldmg <- mutate(totaldmg, sumdmg = sum_prop + sum_crop)
totaldmg <- arrange(totaldmg, desc(sumdmg))
head(totaldmg)
##              evtype     sum_prop   sum_crop       sumdmg
## 1             FLOOD 144657709807 5661968450 150319678257
## 2 HURRICANE/TYPHOON  69305840000 2607872800  71913712800
## 3       STORM SURGE  43323536000       5000  43323541000
## 4           TORNADO  27755871007  414953270  28170824277
## 5              HAIL  15735267513 3025954473  18761221986
## 6       FLASH FLOOD  16822673979 1421317100  18243991079
total10 <- totaldmg[1:10,]
print(total10)
##               evtype     sum_prop    sum_crop       sumdmg
## 1              FLOOD 144657709807  5661968450 150319678257
## 2  HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 3        STORM SURGE  43323536000        5000  43323541000
## 4            TORNADO  27755871007   414953270  28170824277
## 5               HAIL  15735267513  3025954473  18761221986
## 6        FLASH FLOOD  16822673979  1421317100  18243991079
## 7            DROUGHT   1046106000 13972566000  15018672000
## 8          HURRICANE  11868319010  2741910000  14610229010
## 9        RIVER FLOOD   5118945500  5029459000  10148404500
## 10         ICE STORM   3944927860  5022113500   8967041360

We can show our analysis results in a series of bar plots.

dmgrange <- range(total10$sumdmg)
par(mfrow= c(1,3))
barplot(propdmg10$sum_prop, ylim = c(0, max(dmgrange)), ylab = "Total property damage ($)", 
        xlab = "Event Type", names.arg = propdmg10$evtype,
        main = "Top ten events to property damage", col = "red")
barplot(cropdmg10$sum_crop, ylim = c(0, max(dmgrange)), ylab = "Total crop damage ($)", 
        xlab = "Event Type", names.arg = cropdmg10$evtype,
        main = "Top ten events to crop damage", col = "blue")
barplot(total10$sumdmg, ylim = c(0, max(dmgrange)), ylab = "Overall economic damage ($)", 
        xlab = "Event Type", names.arg =  total10$evtype,
        main = "Top ten events to overall damage", col = "black")

From the plots, we find that FLOOD cause the most property damage, DROUGHT has the greatest damage impact on crops, and FLOOD cause the highest overall economic loss from 1992 to 2011.