Using Storm Data from the National Weather Service, this analysis seeks to answer two important questions about weather events:
To this end, the data is aggregated (by weather event) and analyzed by four key indicators: Fatalities, Injuries, Property Damage, and Crop Damage.
We’ll begin by loading in the required packages and the .csv file. Due to the size of the file, this has been pre-downloaded to our R working directory. It will be loaded from there. No pre-processing has been done to this zipped .csv file:
library(ggplot2, lubridate)
stormdata <- read.csv("repdata%2Fdata%2FStormData.csv.bz2")
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We will start by aggregating the data with respect to the EVTYPE (Weather Event Type) category. There are two indicators in this data set which tell us the impact to population health: FATALATIES, and INJURIES. Let’s look at both.
fatal <- aggregate(FATALITIES ~ EVTYPE, data=stormdata, FUN=sum)
dim(fatal)
## [1] 985 2
injure <- aggregate(INJURIES ~ EVTYPE, data=stormdata, FUN=sum)
dim(injure)
## [1] 985 2
healthimpact <- cbind(fatal, injure$INJURIES)
colnames(healthimpact) <- c("Event", "Fatalities", "Injuries")
head(healthimpact)
## Event Fatalities Injuries
## 1 HIGH SURF ADVISORY 0 0
## 2 COASTAL FLOOD 0 0
## 3 FLASH FLOOD 0 0
## 4 LIGHTNING 0 0
## 5 TSTM WIND 0 0
## 6 TSTM WIND (G45) 0 0
Great - there are 985 different types of weather events recorded in this data set, so these summaries look good. Just a note: this is the total number of injuries/fatalities occuring from each weather event in total. not the average number per occurance. We have chosen to analyze the data this way in order to see the total impact.
Because it is difficult to compare a fatality to an injury in this context, we will look for the top 5 events based on fatalities, and compare to the top 5 events based on injuries:
library(plyr)
byfatal <- arrange(healthimpact, desc(Fatalities))
head(byfatal, 5)
## Event Fatalities Injuries
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
byinjure <- arrange(healthimpact, desc(Injuries))
head(byinjure, 5)
## Event Fatalities Injuries
## 1 TORNADO 5633 91346
## 2 TSTM WIND 504 6957
## 3 FLOOD 470 6789
## 4 EXCESSIVE HEAT 1903 6525
## 5 LIGHTNING 816 5230
Tornado, Excessive Heat, and Lightning appear on both lists, giving us 7 unique weather events. Let’s look more closely:
topevents <- rbind(byfatal[1:5,], byinjure[2:3,])
topevents
## Event Fatalities Injuries
## 1 TORNADO 5633 91346
## 2 EXCESSIVE HEAT 1903 6525
## 3 FLASH FLOOD 978 1777
## 4 HEAT 937 2100
## 5 LIGHTNING 816 5230
## 21 TSTM WIND 504 6957
## 31 FLOOD 470 6789
qplot(Fatalities, Injuries, data=topevents, colour=Event, main="Injuries and Fatalities by Weather Event, Top 5")
Far and away, Tornados are the weather event with the most severe impact on population health in the United States.
Economic Impact is stated in this data set as both Property Damage and Crop Damage. Fortunately, since we are working with dollar impact it will be simple to sum the total damage from weather events and determine financial impact. In this data set however, financial numbers may be entered as either “K” for thousands of dollars, “M” for millions, or “B” for billions of dollars - we’ll need to convert before we continue. Here’s an example of how damages are recorded in this data set:
head(stormdata[,25:28],10)
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 25.0 K 0
## 2 2.5 K 0
## 3 25.0 K 0
## 4 2.5 K 0
## 5 2.5 K 0
## 6 2.5 K 0
## 7 2.5 K 0
## 8 2.5 K 0
## 9 25.0 K 0
## 10 25.0 K 0
While this is simple in theory, a simple aggregate function reveals to us that it may not be as easy as we thought:
propdamage <- aggregate(PROPDMG ~ PROPDMGEXP, data=stormdata, FUN=sum)
propdamage
## PROPDMGEXP PROPDMG
## 1 527.41
## 2 - 15.00
## 3 ? 0.00
## 4 + 117.00
## 5 0 7108.30
## 6 1 0.00
## 7 2 12.00
## 8 3 20.00
## 9 4 14.50
## 10 5 210.50
## 11 6 65.00
## 12 7 82.00
## 13 8 0.00
## 14 B 275.85
## 15 h 2.00
## 16 H 25.00
## 17 K 10735292.10
## 18 m 38.90
## 19 M 140694.45
cropdamage <- aggregate(CROPDMG ~ CROPDMGEXP, data=stormdata, FUN=sum)
cropdamage
## CROPDMGEXP CROPDMG
## 1 11.00
## 2 ? 0.00
## 3 0 260.00
## 4 2 0.00
## 5 B 13.61
## 6 k 436.00
## 7 K 1342955.91
## 8 m 10.00
## 9 M 34140.80
Clearly, we have some unclean data here. Fortunately, the total amounts are small compared to the EXP variables we care about: B, m, M, k and K. Let’s sum those on each line of the original data set (we’ll put this into thousands, so no transformation to K is required):
stormdata = within(stormdata, {
prop1 = ifelse(PROPDMGEXP == "B", PROPDMG*1000000, ifelse(PROPDMGEXP == "m", PROPDMG*1000, ifelse(PROPDMGEXP == "M", PROPDMG*1000, PROPDMG)))
crop1 = ifelse(CROPDMGEXP == "B", CROPDMG*1000000, ifelse(CROPDMGEXP == "m", CROPDMG*1000, ifelse(CROPDMGEXP == "M", CROPDMG*1000, CROPDMG)))
economic = prop1 + crop1
})
summary(stormdata$economic)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00e+00 0.00e+00 0.00e+00 5.28e+02 1.00e+00 1.15e+08
Great - we can deduce from the summary on our computed “economic” variable that a small number of events are causing the vast majority of economic impact. Let’s take a closer look to see if that is correct:
econ <- aggregate(economic ~ EVTYPE, data=stormdata, FUN=sum)
econimpact <- arrange(econ, desc(economic))
head(econimpact, 10)
## EVTYPE economic
## 1 FLOOD 150319685
## 2 HURRICANE/TYPHOON 71913713
## 3 TORNADO 57352572
## 4 STORM SURGE 43323541
## 5 HAIL 18758572
## 6 FLASH FLOOD 17562686
## 7 DROUGHT 15018672
## 8 HURRICANE 14610229
## 9 RIVER FLOOD 10148404
## 10 ICE STORM 8967091
tot <- sum(econimpact$economic)
fiveworst <- sum(econimpact[1:5, 2])
flood <- econimpact[1,2]
flood/tot
## [1] 0.3155118
fiveworst/tot
## [1] 0.7171403
31.5% of all weather-related damage in the United States as recorded by this data set was caused by Floods, and 71.7% by the top 5 most damaging events. Interesting! Let’s graph the 5 worst offenders:
top5 <- econimpact[1:5,]
qplot(EVTYPE, data=top5, weight=economic, geom="bar", ylab="Cost, in thousands", main="Economic Impact of Weather Events, Top 5")
Based on the data analysis performed here with the Storm Data weather set, we conclude:
1. Tornados are the #1 cause of weather-related fatalities and injuries in the US. They are the #3 most expensive weather event.
2. Floods are the #1 cause of weather-related damage and expenses in the US. They are the #3 cause of fatalities.