This is the second Peer Assessment of the Reproducible Research course in the Data Science Specialization, in which the characteristics of major storms and weather events in the United States will be analyzed. Two basic questions will be answered:
The raw database and the documentation can be found in the course web site.
The first step is to load libraries and download the database into R:
library(plyr)
url <-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "./StormData.csv"
download.file(url,destfile)
And read it:
stormData <- read.csv("StormData.csv")
Next, we must select the columns that will be used and discard the ones that will not.
stormData <- stormData[,c(8,23,24,25,26,27,28)]
head(stormData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Then, we must set the values for the PROPDMG and CROPDMG variables (using K for thousands, M for millions and B for billions).
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)
stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="K",
stormData$PROPDMG * 1000,stormData$PROPDMG)
stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="M",
stormData$PROPDMG * 1000000,stormData$PROPDMG)
stormData$PROPDMG <- ifelse(stormData$PROPDMGEXP=="B",
stormData$PROPDMG * 1000000000,stormData$PROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="K",
stormData$CROPDMG * 1000,stormData$CROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="M",
stormData$CROPDMG * 1000000,stormData$CROPDMG)
stormData$CROPDMG <- ifelse(stormData$CROPDMGEXP=="B",
stormData$CROPDMG * 1000000000,stormData$CROPDMG)
Discarding unnecessary columns:
stormData <- stormData[,c(1,2,3,4,6)]
We must also clean the EVTYPE variable
stormData$EVTYPE <- toupper(as.character(stormData$EVTYPE))
filter <- grepl("THUNDERSTORM.|TSTM|LIGHTNING|TUNDERSTORM WIND|
THUNERSTORM WINDS|THUNDERTORM WINDS|THUNDERSTROM WIND|
THUNDERSNOW|THUNDERESTORM WINDS|THUDERSTORM WINDS|
THUNDEERSTORM WINDS|TUNDERSTORM WIND",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "THUNDERSTORM"
filter <- grepl("COLD|COLD AND SNOW|COLD AND WET CONDITIONS|COLD TEMPERATURE|
COLD WAVE|COLD WEATHER|COLD/WIND CHILL|COLD/WINDS|
COOL AND WET|EXTENDED COLD",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "COLD"
filter <- grepl("DAMAGING FREEZE|EARLY FROST|FREEZE|FREEZING.|FROST|
FROST/FREEZE|HARD FREEZE|ICE.|ICY ROADS",
stormData$EVTYPE)
stormData$EVTYPE[filter] <- "FREEZE"
filter <- grepl("RAIN|RAIN/SNOW|RAINSTORM",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "RAIN"
filter <- grepl("FLOOD|FLOODING",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "FLOOD"
filter <- grepl("ICE STORM|WINTER STORM|HAIL|HEAVY SNOW|BLIZZARD",
stormData$EVTYPE)
stormData$EVTYPE[filter] <- "WINTER STORM"
filter <- grepl("RIP CURRENT|RIP CURRENTS",stormData$EVTYPE)
stormData$EVTYPE[filter] <-"RIP CURRENT"
filter <- grepl("EXCESSIVE HEAT|HEAT|HEAT WAVE",stormData$EVTYPE)
stormData$EVTYPE[filter] <-"EXCESSIVE HEAT"
filter <- grepl("TORNADO|HIGH WIND|STRONG WIND|TORNDAO",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "TORNADO"
filter <- grepl("WILD.|WILDFIRE",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "WILDFIRE"
filter <- grepl("HURRICANE.|HURRICANE/TYPHOON",stormData$EVTYPE)
stormData$EVTYPE[filter] <- "HURRICANE"
And finally, we subset the database.
stormData2 <- ddply(stormData,.(EVTYPE),summarize,TotalFatal=sum(FATALITIES),
TotalInj=sum(INJURIES),TotalPdmg=sum(PROPDMG),
TotalCdmg=sum(CROPDMG), Freq=length(FATALITIES))
head(stormData2)
## EVTYPE TotalFatal TotalInj TotalPdmg TotalCdmg Freq
## 1 HIGH SURF ADVISORY 0 0 2e+05 0 1
## 2 WATERSPOUT 0 0 0e+00 0 1
## 3 WIND 0 0 0e+00 0 1
## 4 ? 0 0 5e+03 0 1
## 5 ABNORMAL WARMTH 0 0 0e+00 0 4
## 6 ABNORMALLY DRY 0 0 0e+00 0 2
Now we can address the following questions:
1. Which types of events are most harmful with respect to population health?
First, we create a subset
stormHealth <- stormData2[,c(1,2,3,6)]
Next, we add a column with the total number of casualties
stormHealth$TotalCasual <- stormHealth$TotalFatal+stormHealth$TotalInj
Then, we can reorder the weather events from highest to lowest number of casualties (for the purposes of this assignment, only the top 10 will be considered).
index <- with(stormHealth, order(TotalCasual,decreasing = TRUE))
stormHealth <- stormHealth[index, ]
stormHealth[1:10,]
## EVTYPE TotalFatal TotalInj Freq TotalCasual
## 325 TORNADO 6057 93233 86442 99290
## 321 THUNDERSTORM 1572 14775 352569 16347
## 69 EXCESSIVE HEAT 3138 9224 2648 12362
## 83 FLOOD 1524 8602 82685 10126
## 393 WINTER STORM 462 4564 319249 5026
## 86 FREEZE 108 2065 4007 2173
## 383 WILDFIRE 90 1606 4231 1696
## 132 HURRICANE 133 1328 287 1461
## 212 RIP CURRENT 577 529 777 1106
## 84 FOG 62 734 538 796
2. Which types of events have the greatest economic consequences?
To answer this question, we repeat what was done in the first question:
Subsetting:
stormEcon <- stormData2[,c(1,4,5,6)]
Adding a column with total damage:
stormEcon$TotalDmg <- stormEcon$TotalPdmg + stormEcon$TotalCdmg
Reordering:
index <- with(stormEcon, order(TotalDmg,decreasing = TRUE))
stormEcon <- stormEcon[index, ]
stormEcon[1:10,]
## EVTYPE TotalPdmg TotalCdmg Freq TotalDmg
## 83 FLOOD 167507976930 12261926100 82685 179769903030
## 132 HURRICANE 84656180010 5505292800 287 90161472810
## 325 TORNADO 63152886102 1174706870 86442 64327592972
## 249 STORM SURGE 43323536000 5000 261 43323541000
## 393 WINTER STORM 24330502400 3326044723 319249 27656547123
## 44 DROUGHT 1046106000 13972566000 2488 15018672000
## 321 THUNDERSTORM 12310219925 1286106078 352569 13596326003
## 86 FREEZE 3998607560 7024174500 4007 11022782060
## 383 WILDFIRE 8491563500 402781630 4231 8894345130
## 327 TROPICAL STORM 7703890550 678346000 690 8382236550
The answer to the first question can be observed in the following graphic:
barplot(stormHealth$TotalCasual[1:10],names=stormHealth$EVTYPE[1:10],
cex.names=0.6,las=2, main="Top 10 events with highest number of
casualties",col="lightblue")
So we conclude that the event with the highest number of casualties (in other words, with the highest number of fatalities and injuries) are tornadoes.
However, when we look at the events with the highest economic costs we observe very different results (do note that the economic costs to properties are 10 times greater than the economic costs to crops).
par(mfrow=c(1,2))
barplot(stormEcon$TotalPdmg[1:10]/100000,names=stormEcon$EVTYPE[1:10],
cex.names=0.6,las=2, main="Events with highest \n economic cost \n (by US$100.000)",ylim=c(0,1700000),col="lightgreen")
barplot(stormEcon$TotalCdmg[1:10]/10000,names=stormEcon$EVTYPE[1:10],
cex.names=0.6,las=2, main="Events with highest \n damage to crops \n (by US$10.000)",ylim=c(0,1700000),col="salmon")
Based on the data analysis, we can conclude that the event which brings most harm to population health are tornadoes, and the most costly events for properties are floods and hurricanes; the events most harmful to crops are droughts and floods.