1) Synopsis

This is the second Peer Assessment of the Reproducible Research course in the Data Science Specialization, in which the characteristics of major storms and weather events in the United States will be analyzed. Two basic questions will be answered:

  1. Which types of events are most harmful with respect to population health?
  2. Which types of events have the greatest economic consequences?

The raw database and the documentation can be found in the course web site.

2) Data Processing

The first step is to load libraries and download the database into R:

library(plyr)

url <-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./StormData.csv"
download.file(url,destfile)

And read it:

stormData <- read.csv("StormData.csv")

Next, we must select the columns that will be used and discard the ones that will not.

stormData <- stormData[,c(8,23,24,25,26,27,28)]
head(stormData)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Then, we must set the values for the PROPDMG and CROPDMG variables (using K for thousands, M for millions and B for billions).

stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)

stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="K",
                            stormData$PROPDMG * 1000,stormData$PROPDMG)
stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="M",
                            stormData$PROPDMG * 1000000,stormData$PROPDMG)
stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="B",
                            stormData$PROPDMG * 1000000000,stormData$PROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="K",
                            stormData$CROPDMG * 1000,stormData$CROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="M",
                            stormData$CROPDMG * 1000000,stormData$CROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="B",
                            stormData$CROPDMG * 1000000000,stormData$CROPDMG)

Discarding unnecessary columns:

stormData <- stormData[,c(1,2,3,4,6)]

We must also clean the EVTYPE variable

stormData$EVTYPE <- toupper(as.character(stormData$EVTYPE))
filter <- grepl("THUNDERSTORM.|TSTM|LIGHTNING|TUNDERSTORM WIND|
                THUNERSTORM WINDS|THUNDERTORM WINDS|THUNDERSTROM WIND|
                THUNDERSNOW|THUNDERESTORM WINDS|THUDERSTORM WINDS|
                THUNDEERSTORM WINDS|TUNDERSTORM WIND",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "THUNDERSTORM"

filter <- grepl("COLD|COLD AND SNOW|COLD AND WET CONDITIONS|COLD TEMPERATURE|
                COLD WAVE|COLD WEATHER|COLD/WIND CHILL|COLD/WINDS|
                COOL AND WET|EXTENDED COLD",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "COLD"

filter <- grepl("DAMAGING FREEZE|EARLY FROST|FREEZE|FREEZING.|FROST|
                FROST/FREEZE|HARD FREEZE|ICE.|ICY ROADS",
                stormData$EVTYPE)
stormData$EVTYPE[filter] <- "FREEZE"

filter <- grepl("RAIN|RAIN/SNOW|RAINSTORM",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "RAIN"

filter <- grepl("FLOOD|FLOODING",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "FLOOD"

filter <- grepl("ICE STORM|WINTER STORM|HAIL|HEAVY SNOW|BLIZZARD",
                 stormData$EVTYPE)
stormData$EVTYPE[filter] <- "WINTER STORM"

filter <- grepl("RIP CURRENT|RIP CURRENTS",stormData$EVTYPE)
stormData$EVTYPE[filter] <-"RIP CURRENT"

filter <- grepl("EXCESSIVE HEAT|HEAT|HEAT WAVE",stormData$EVTYPE)
stormData$EVTYPE[filter] <-"EXCESSIVE HEAT"

filter <- grepl("TORNADO|HIGH WIND|STRONG WIND|TORNDAO",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "TORNADO"

filter <- grepl("WILD.|WILDFIRE",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "WILDFIRE"

filter <- grepl("HURRICANE.|HURRICANE/TYPHOON",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "HURRICANE"

And finally, we subset the database.

stormData2 <- ddply(stormData,.(EVTYPE),summarize,TotalFatal=sum(FATALITIES),
                   TotalInj=sum(INJURIES),TotalPdmg=sum(PROPDMG),
                   TotalCdmg=sum(CROPDMG), Freq=length(FATALITIES))

head(stormData2)
##                  EVTYPE TotalFatal TotalInj TotalPdmg TotalCdmg Freq
## 1    HIGH SURF ADVISORY          0        0     2e+05         0    1
## 2            WATERSPOUT          0        0     0e+00         0    1
## 3                  WIND          0        0     0e+00         0    1
## 4                     ?          0        0     5e+03         0    1
## 5       ABNORMAL WARMTH          0        0     0e+00         0    4
## 6        ABNORMALLY DRY          0        0     0e+00         0    2

2) Analyzing data

Now we can address the following questions:

1. Which types of events are most harmful with respect to population health?

First, we create a subset

stormHealth <- stormData2[,c(1,2,3,6)]

Next, we add a column with the total number of casualties

stormHealth$TotalCasual <- stormHealth$TotalFatal+stormHealth$TotalInj

Then, we can reorder the weather events from highest to lowest number of casualties (for the purposes of this assignment, only the top 10 will be considered).

index <- with(stormHealth, order(TotalCasual,decreasing = TRUE))
stormHealth <- stormHealth[index, ]
stormHealth[1:10,]
##             EVTYPE TotalFatal TotalInj   Freq TotalCasual
## 325        TORNADO       6057    93233  86442       99290
## 321   THUNDERSTORM       1572    14775 352569       16347
## 69  EXCESSIVE HEAT       3138     9224   2648       12362
## 83           FLOOD       1524     8602  82685       10126
## 393   WINTER STORM        462     4564 319249        5026
## 86          FREEZE        108     2065   4007        2173
## 383       WILDFIRE         90     1606   4231        1696
## 132      HURRICANE        133     1328    287        1461
## 212    RIP CURRENT        577      529    777        1106
## 84             FOG         62      734    538         796

2. Which types of events have the greatest economic consequences?

To answer this question, we repeat what was done in the first question:

Subsetting:

stormEcon <- stormData2[,c(1,4,5,6)]

Adding a column with total damage:

stormEcon$TotalDmg <- stormEcon$TotalPdmg + stormEcon$TotalCdmg

Reordering:

index <- with(stormEcon, order(TotalDmg,decreasing = TRUE))
stormEcon <- stormEcon[index, ]
stormEcon[1:10,]
##             EVTYPE    TotalPdmg   TotalCdmg   Freq     TotalDmg
## 83           FLOOD 167507976930 12261926100  82685 179769903030
## 132      HURRICANE  84656180010  5505292800    287  90161472810
## 325        TORNADO  63152886102  1174706870  86442  64327592972
## 249    STORM SURGE  43323536000        5000    261  43323541000
## 393   WINTER STORM  24330502400  3326044723 319249  27656547123
## 44         DROUGHT   1046106000 13972566000   2488  15018672000
## 321   THUNDERSTORM  12310219925  1286106078 352569  13596326003
## 86          FREEZE   3998607560  7024174500   4007  11022782060
## 383       WILDFIRE   8491563500   402781630   4231   8894345130
## 327 TROPICAL STORM   7703890550   678346000    690   8382236550

3) Results:

The answer to the first question can be observed in the following graphic:

barplot(stormHealth$TotalCasual[1:10],names=stormHealth$EVTYPE[1:10],
        cex.names=0.6,las=2, main="Top 10 events with highest number of
        casualties",col="lightblue")

So we conclude that the event with the highest number of casualties (in other words, with the highest number of fatalities and injuries) are tornadoes.

However, when we look at the events with the highest economic costs we observe very different results (do note that the economic costs to properties are 10 times greater than the economic costs to crops).

par(mfrow=c(1,2))
barplot(stormEcon$TotalPdmg[1:10]/100000,names=stormEcon$EVTYPE[1:10],
        cex.names=0.6,las=2, main="Events with highest \n economic cost \n (by US$100.000)",ylim=c(0,1700000),col="lightgreen")
barplot(stormEcon$TotalCdmg[1:10]/10000,names=stormEcon$EVTYPE[1:10],
cex.names=0.6,las=2, main="Events with highest \n damage to crops \n (by US$10.000)",ylim=c(0,1700000),col="salmon")

Based on the data analysis, we can conclude that the event which brings most harm to population health are tornadoes, and the most costly events for properties are floods and hurricanes; the events most harmful to crops are droughts and floods.