The five most cumulatively costly severe weather events in terms of property and crop damage are Hurricanes, Typhoons, Floods, Tornadoes, and Storm Surges. The five most cumulatively harmful severe weather events in terms of human health outcomes–injuries and fatalities–are Tornadoes, Excessive Heat, Thunderstorm Winds, Floods, and Lightning. Tornadoes are by far the most dangerous for health outcomes, and Hurricanes/Typhoons also are significantly more economically costly than other severe weather events.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, destfile = "stormdata.csv.bz2", method = "curl")
stormData<-read.csv(bzfile("stormdata.csv.bz2"), header=TRUE)
Let’s take a look at the basic structure of the data frame
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
We can see that the data set is rather large: 902297x37, and the variables of interest to us are EVTYPE (for storm type), CROPDMG for crop damage, PROPDMG for property damage, and INJURIES, and FATALITIES. There then appear to be exponents PROPDMGEXP and CROPDMGEXP for the economic damage. We’ll go through these variables and see what the proportion of missing values is for each one, as well as the distribution of values using “summary()”:
summary(stormData$EVTYPE)
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW HEAVY RAIN WINTER STORM
## 15708 11723 11433
## WINTER WEATHER FUNNEL CLOUD MARINE TSTM WIND
## 7026 6839 6175
## MARINE THUNDERSTORM WIND WATERSPOUT STRONG WIND
## 5812 3796 3566
## URBAN/SML STREAM FLD WILDFIRE BLIZZARD
## 3392 2761 2719
## DROUGHT ICE STORM EXCESSIVE HEAT
## 2488 2006 1678
## HIGH WINDS WILD/FOREST FIRE FROST/FREEZE
## 1533 1457 1342
## DENSE FOG WINTER WEATHER/MIX TSTM WIND/HAIL
## 1293 1104 1028
## EXTREME COLD/WIND CHILL HEAT HIGH SURF
## 1002 767 725
## TROPICAL STORM FLASH FLOODING EXTREME COLD
## 690 682 655
## COASTAL FLOOD LAKE-EFFECT SNOW FLOOD/FLASH FLOOD
## 650 636 624
## LANDSLIDE SNOW COLD/WIND CHILL
## 600 587 539
## FOG RIP CURRENT MARINE HAIL
## 538 470 442
## DUST STORM AVALANCHE WIND
## 427 386 340
## RIP CURRENTS STORM SURGE FREEZING RAIN
## 304 261 250
## URBAN FLOOD HEAVY SURF/HIGH SURF EXTREME WINDCHILL
## 249 228 204
## STRONG WINDS DRY MICROBURST ASTRONOMICAL LOW TIDE
## 196 186 174
## HURRICANE RIVER FLOOD LIGHT SNOW
## 174 173 154
## STORM SURGE/TIDE RECORD WARMTH COASTAL FLOODING
## 148 146 143
## DUST DEVIL MARINE HIGH WIND UNSEASONABLY WARM
## 141 135 126
## FLOODING ASTRONOMICAL HIGH TIDE MODERATE SNOWFALL
## 120 103 101
## URBAN FLOODING WINTRY MIX HURRICANE/TYPHOON
## 98 90 88
## FUNNEL CLOUDS HEAVY SURF RECORD HEAT
## 87 84 81
## FREEZE HEAT WAVE COLD
## 74 74 72
## RECORD COLD ICE THUNDERSTORM WINDS HAIL
## 64 61 61
## TROPICAL DEPRESSION SLEET UNSEASONABLY DRY
## 60 59 56
## FROST GUSTY WINDS THUNDERSTORM WINDSS
## 53 53 51
## MARINE STRONG WIND OTHER SMALL HAIL
## 48 48 47
## FUNNEL FREEZING FOG THUNDERSTORM
## 46 45 45
## Temperature record TSTM WIND (G45) Coastal Flooding
## 43 39 38
## WATERSPOUTS MONTHLY PRECIPITATION WINDS
## 37 36 36
## (Other)
## 2940
summary(stormData$CROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.527 0.000 990.000
summary(stormData$PROPDMG)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 12.06 0.50 5000.00
summary(stormData$INJURIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1557 0.0000 1700.0000
summary(stormData$FATALITIES)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.0168 0.0000 583.0000
summary(stormData$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
summary(stormData$CROPDMGEXP)
## ? 0 2 B k K m M
## 618413 7 19 1 9 21 281832 1 1994
Good news is there are no NA values to deal with here! We can see, though, that there are a lot of “0” values in all the economic and health outcome variables. This indicates that the majority of the storms in this data set are relatively non-damaging.
There are some confusing symbols in the “exp” sections however. For the purposes of this analysis, I will exclude infrequent and ambiguous signs, namely “-”, “?”, and “+” (i.e. count them as 0s). I will count “h” or “H” as a 2s exponent, “B” “k” or “K” as a 3s exponent, “m” or “M” as a 6s exponent, and “b” or “B” as a 9s exponent (this is for hundreds, thousands, etc.). I’ll do this by defining a simple function and using apply on the the EXP columns
getExp<-function(char){
if (char=="h" | char=="H"){
char = 2
}
else if (char=="k" | char=="K"){
char = 3
}
else if (char=="m" | char=="M"){
char = 6
}
else if (char=="b" | char=="B"){
char = 9
}
else if (char=="-" | char == "?" | char == "+" | char == ""){
char = 0
}
else if (class(char)=="numeric" | class(char)=="integer"){
char
}
else{
#just in case there's odd/empty strings that didn't show up in summary
char = 0
}
char
}
stormData$CROPDMGEXP<-sapply(stormData$CROPDMGEXP, getExp)
stormData$PROPDMGEXP<-sapply(stormData$PROPDMGEXP, getExp)
Now I’m going to make a new column for the true property/crop damages. To do this, I’ll use dplyr to “mutate” on a couple new columns that are the products of taking CROP/PROPDMG to the EXP number
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
stormData_tbl<-tbl_df(stormData)
stormData_tbl<-stormData_tbl%>%
mutate(PROPDMG_ACTUAL = (PROPDMG)*(10^PROPDMGEXP))%>%
mutate(CROPDMG_ACTUAL = (CROPDMG)*(10^PROPDMGEXP))
The goal of this section is to answer which storm types cause the most economic and human health damage, respectively. I will pay attention only to storms that cause a damage that is quantified>0. I’ll start out by looking at property and crop damage individually, then sum them to get a summary of total economic damage.
propDmg<-stormData_tbl%>%
#select the data we care about
select(EVTYPE, PROPDMG_ACTUAL)%>%
filter(PROPDMG_ACTUAL>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalPropDmg = sum(PROPDMG_ACTUAL))%>%
arrange(desc(totalPropDmg))
head(propDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalPropDmg
## 1 FLOOD 144657709807
## 2 HURRICANE/TYPHOON 69305840000
## 3 TORNADO 56937160779
## 4 STORM SURGE 43323536000
## 5 FLASH FLOOD 16140812067
## 6 HAIL 15732267543
## 7 HURRICANE 11868319010
## 8 TROPICAL STORM 7703890550
## 9 WINTER STORM 6688497251
## 10 HIGH WIND 5270046295
cropDmg<-stormData_tbl%>%
#do the same as above
select(EVTYPE, CROPDMG_ACTUAL)%>%
filter(CROPDMG_ACTUAL>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalCropDmg = sum(CROPDMG_ACTUAL))%>%
arrange(desc(totalCropDmg))
head(cropDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalCropDmg
## 1 HURRICANE 802881916053
## 2 HURRICANE/TYPHOON 732768451330
## 3 FLOOD 87251978368
## 4 FLASH FLOOD 38822140319
## 5 TORNADO 28269878512
## 6 HAIL 15314323923
## 7 HURRICANE OPAL/HIGH WINDS 10000000000
## 8 TSTM WIND 7684657862
## 9 HIGH WIND 7174066263
## 10 WILDFIRE 7173809478
econDmg<-stormData_tbl%>%
#do the same as above
select(EVTYPE, CROPDMG_ACTUAL, PROPDMG_ACTUAL)%>%
filter(CROPDMG_ACTUAL>0 | PROPDMG_ACTUAL>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalEconDmg = sum(CROPDMG_ACTUAL, PROPDMG_ACTUAL))%>%
arrange(desc(totalEconDmg))
head(econDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalEconDmg
## 1 HURRICANE 814750235063
## 2 HURRICANE/TYPHOON 802074291330
## 3 FLOOD 231909688175
## 4 TORNADO 85207039291
## 5 FLASH FLOOD 54962952386
## 6 STORM SURGE 43328536000
## 7 HAIL 31046591466
## 8 HIGH WIND 12444112558
## 9 TSTM WIND 12169586357
## 10 WILDFIRE 11938923478
library(ggplot2)
#we'll compare the 10 most damaging event types
econDmg10<-head(econDmg, n=10)
ggplot(econDmg10, aes(x=EVTYPE, y = totalEconDmg))+geom_bar(stat="identity")
Above we can see the top 10 most costly weather for crop and property damage respectively, then the top 10 with the most cumulative damage. Finally, the barplot plots all the significantly economically damaging events.
I refer the reader to the executive summary for a synopsis of this data. It is also important to note that we have not explored the distribution of the damage by event (i.e. which events cause the most severe damage in one go). This would be useful but a bit beyond the scope of this analysis. This data just gives us a good idea of the worst events.
Now let’s use dplyr again to get a summary of the human health outcomes in injuries and fatalities of these severe weather events
injuryDmg<-stormData_tbl%>%
select(EVTYPE, INJURIES)%>%
filter(INJURIES>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalInjuryDmg = sum(INJURIES))%>%
arrange(desc(totalInjuryDmg))
head(injuryDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalInjuryDmg
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
fatalDmg<-stormData_tbl%>%
select(EVTYPE, FATALITIES)%>%
filter(FATALITIES>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalFatalDmg = sum(FATALITIES))%>%
arrange(desc(totalFatalDmg))
head(fatalDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalFatalDmg
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
healthDmg<-stormData_tbl%>%
#do the same as above
select(EVTYPE, INJURIES, FATALITIES)%>%
filter(INJURIES>0 | FATALITIES>0)%>%
#group by event and sum the total damage value over events
group_by(EVTYPE)%>%
summarize(totalhealthDmg = sum(INJURIES, FATALITIES))%>%
arrange(desc(totalhealthDmg))
head(healthDmg, n=10)
## Source: local data frame [10 x 2]
##
## EVTYPE totalhealthDmg
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
library(ggplot2)
#we'll compare the 10 most damaging event types
healthDmg10<-head(healthDmg, n=10)
ggplot(healthDmg10, aes(x=EVTYPE, y = totalhealthDmg))+geom_bar(stat="identity")
The data here is produced identically to the economic data. Again, I refer the reader to the synopsis for an executive summary of this data.