NOAA’s collected storm data from 1950 to 2011 is a valuable source for information regarding the future preventive efforts in avoiding damages to life and property. Here we want to find out which type of events cause the most damages, physically and economically. As it turns out, we can see that we have one single type of event that clearly ranks on the top of the impact scale, Tornados. Tornados cause more than 13 times the amount of injuries, almost 3 times the amount of fatalities and more than double the amount of property damages always compared to the second rank. Only crop damages are led by a different event type, Hail, with Tornados on rank 5.
Reading in the raw data
raw_storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
First look at the data
dim(raw_storm)
## [1] 902297 37
head(raw_storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
It appears that most of the variables are not interesting for the question on hand. The important variables for us to observe are ‘EVTYPE’, ‘FATALITIES’ & ‘INJURIES’. So I will subset the raw data and reduce it to those three columns.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sub_storm <- select(raw_storm, EVTYPE, FATALITIES, INJURIES)
A look at the ‘head’ and ‘summary’ of the new data shows some of the relevant information
head(sub_storm)
## EVTYPE FATALITIES INJURIES
## 1 TORNADO 0 15
## 2 TORNADO 0 0
## 3 TORNADO 0 2
## 4 TORNADO 0 2
## 5 TORNADO 0 2
## 6 TORNADO 0 6
summary(sub_storm)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
Now I would like to know how many different EVTYPE’s there are.
length(unique(sub_storm$EVTYPE))
## [1] 985
985!
The summary of the sub_storm data showed us the two maximum impact observations. I want to show which events they are related to.
sub_storm[sub_storm$FATALITIES == 583,]
## EVTYPE FATALITIES INJURIES
## 198704 HEAT 583 0
sub_storm[sub_storm$INJURIES == 1700,]
## EVTYPE FATALITIES INJURIES
## 157885 TORNADO 42 1700
I also would like to know how many observations have 0 Fatalities.
x <- sub_storm[sub_storm$FATALITIES == 0,]
length(x[,2])
## [1] 895323
895323 out of 902297 total observations, this leaves 6974 events with fatalities.
Lets also see how many events were with 0 Injuries.
y <- sub_storm[sub_storm$INJURIES == 0,]
length(y[,2])
## [1] 884693
884693 events were without injuries, which leaves 17604 cases with injuries.
These insights helps us to narrow down our dataset for further observations. Lets eliminate all events without fatalities or injuries.
inj_storm <- sub_storm[sub_storm$INJURIES != 0,]
fatal_storm <- sub_storm[sub_storm$FATALITIES != 0,]
Now lets check on the number of different events again
length(unique(inj_storm$EVTYPE))
## [1] 158
length(unique(fatal_storm$EVTYPE))
## [1] 168
Now we have 158 unique events in the injury subset and 168 in the fatality subset. With this information it should be easier to show a clearer relation between the type of event and the number of injuries/fatalities. Let’s bring this in order. First for injuries:
sub_inj <- aggregate(INJURIES ~ EVTYPE, data = inj_storm, FUN = "sum")
sub_inj <- arrange(sub_inj, desc(INJURIES))
sub_inj[1:20,]
## EVTYPE INJURIES
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
## 7 ICE STORM 1975
## 8 FLASH FLOOD 1777
## 9 THUNDERSTORM WIND 1488
## 10 HAIL 1361
## 11 WINTER STORM 1321
## 12 HURRICANE/TYPHOON 1275
## 13 HIGH WIND 1137
## 14 HEAVY SNOW 1021
## 15 WILDFIRE 911
## 16 THUNDERSTORM WINDS 908
## 17 BLIZZARD 805
## 18 FOG 734
## 19 WILD/FOREST FIRE 545
## 20 DUST STORM 440
Now for fatalities:
sub_fatal <- aggregate(FATALITIES ~ EVTYPE, data = fatal_storm, FUN = "sum")
sub_fatal <- arrange(sub_fatal, desc(FATALITIES))
sub_fatal[1:20,]
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
## 7 FLOOD 470
## 8 RIP CURRENT 368
## 9 HIGH WIND 248
## 10 AVALANCHE 224
## 11 WINTER STORM 206
## 12 RIP CURRENTS 204
## 13 HEAT WAVE 172
## 14 EXTREME COLD 160
## 15 THUNDERSTORM WIND 133
## 16 HEAVY SNOW 127
## 17 EXTREME COLD/WIND CHILL 125
## 18 STRONG WIND 103
## 19 BLIZZARD 101
## 20 HIGH SURF 101
I show here the first twenty rows of the resulting data frame, which clearly shows that Tornados are at the top of both lists, for fatalities and injuries, also including the maximum value for injuries at a single event.
i <- sub_inj[1:20,]
with(i, barplot(INJURIES, names.arg = i$EVTYPE, col = rainbow(20), main = "Number of Injuries by Event type", cex.axis = 0.8, las = 3))
To answer this question we have to re-examine our raw data to see which columns we can extract the necessary information from.
I filter out the columns related to economic consequences first. After looking at the summary of our raw data it shows that the columns ‘PROPDMG’ and ‘CROPDMG’ are the ones we need.
summary(raw_storm)
## STATE__ BGN_DATE BGN_TIME
## Min. : 1.0 5/25/2011 0:00:00: 1202 12:00:00 AM: 10163
## 1st Qu.:19.0 4/27/2011 0:00:00: 1193 06:00:00 PM: 7350
## Median :30.0 6/9/2011 0:00:00 : 1030 04:00:00 PM: 7261
## Mean :31.2 5/30/2004 0:00:00: 1016 05:00:00 PM: 6891
## 3rd Qu.:45.0 4/4/2011 0:00:00 : 1009 12:00:00 PM: 6703
## Max. :95.0 4/2/2006 0:00:00 : 981 03:00:00 PM: 6700
## (Other) :895866 (Other) :857229
## TIME_ZONE COUNTY COUNTYNAME STATE
## CST :547493 Min. : 0.0 JEFFERSON : 7840 TX : 83728
## EST :245558 1st Qu.: 31.0 WASHINGTON: 7603 KS : 53440
## MST : 68390 Median : 75.0 JACKSON : 6660 OK : 46802
## PST : 28302 Mean :100.6 FRANKLIN : 6256 MO : 35648
## AST : 6360 3rd Qu.:131.0 LINCOLN : 5937 IA : 31069
## HST : 2563 Max. :873.0 MADISON : 5632 NE : 30271
## (Other): 3631 (Other) :862369 (Other):621339
## EVTYPE BGN_RANGE BGN_AZI
## HAIL :288661 Min. : 0.000 :547332
## TSTM WIND :219940 1st Qu.: 0.000 N : 86752
## THUNDERSTORM WIND: 82563 Median : 0.000 W : 38446
## TORNADO : 60652 Mean : 1.484 S : 37558
## FLASH FLOOD : 54277 3rd Qu.: 1.000 E : 33178
## FLOOD : 25326 Max. :3749.000 NW : 24041
## (Other) :170878 (Other):134990
## BGN_LOCATI END_DATE END_TIME
## :287743 :243411 :238978
## COUNTYWIDE : 19680 4/27/2011 0:00:00: 1214 06:00:00 PM: 9802
## Countywide : 993 5/25/2011 0:00:00: 1196 05:00:00 PM: 8314
## SPRINGFIELD : 843 6/9/2011 0:00:00 : 1021 04:00:00 PM: 8104
## SOUTH PORTION: 810 4/4/2011 0:00:00 : 1007 12:00:00 PM: 7483
## NORTH PORTION: 784 5/30/2004 0:00:00: 998 11:59:00 PM: 7184
## (Other) :591444 (Other) :653450 (Other) :622432
## COUNTY_END COUNTYENDN END_RANGE END_AZI
## Min. :0 Mode:logical Min. : 0.0000 :724837
## 1st Qu.:0 NA's:902297 1st Qu.: 0.0000 N : 28082
## Median :0 Median : 0.0000 S : 22510
## Mean :0 Mean : 0.9862 W : 20119
## 3rd Qu.:0 3rd Qu.: 0.0000 E : 20047
## Max. :0 Max. :925.0000 NE : 14606
## (Other): 72096
## END_LOCATI LENGTH WIDTH
## :499225 Min. : 0.0000 Min. : 0.000
## COUNTYWIDE : 19731 1st Qu.: 0.0000 1st Qu.: 0.000
## SOUTH PORTION : 833 Median : 0.0000 Median : 0.000
## NORTH PORTION : 780 Mean : 0.2301 Mean : 7.503
## CENTRAL PORTION: 617 3rd Qu.: 0.0000 3rd Qu.: 0.000
## SPRINGFIELD : 575 Max. :2315.0000 Max. :4400.000
## (Other) :380536
## F MAG FATALITIES INJURIES
## Min. :0.0 Min. : 0.0 Min. : 0.0000 Min. : 0.0000
## 1st Qu.:0.0 1st Qu.: 0.0 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median :1.0 Median : 50.0 Median : 0.0000 Median : 0.0000
## Mean :0.9 Mean : 46.9 Mean : 0.0168 Mean : 0.1557
## 3rd Qu.:1.0 3rd Qu.: 75.0 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## Max. :5.0 Max. :22000.0 Max. :583.0000 Max. :1700.0000
## NA's :843563
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
## WFO STATEOFFIC
## :142069 :248769
## OUN : 17393 TEXAS, North : 12193
## JAN : 13889 ARKANSAS, Central and North Central: 11738
## LWX : 13174 IOWA, Central : 11345
## PHI : 12551 KANSAS, Southwest : 11212
## TSA : 12483 GEORGIA, North and Central : 11120
## (Other):690738 (Other) :595920
## ZONENAMES
## :594029
## :205988
## GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M : 639
## GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA : 592
## JEFFERSON - JEFFERSON : 303
## MADISON - MADISON : 302
## (Other) :100444
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_
## Min. : 0 Min. :-14451 Min. : 0 Min. :-14455
## 1st Qu.:2802 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0
## Median :3540 Median : 8707 Median : 0 Median : 0
## Mean :2875 Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.:4019 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. :9706 Max. : 17124 Max. :9706 Max. :106220
## NA's :47 NA's :40
## REMARKS REFNUM
## :287433 Min. : 1
## : 24013 1st Qu.:225575
## Trees down.\n : 1110 Median :451149
## Several trees were blown down.\n : 568 Mean :451149
## Trees were downed.\n : 446 3rd Qu.:676723
## Large trees and power lines were blown down.\n: 432 Max. :902297
## (Other) :588295
ec_storm <- select(raw_storm, EVTYPE, PROPDMG, CROPDMG)
We can examine the maximum values for both types of damages at this stage.
ec_storm[ec_storm$PROPDMG == 5000,]
## EVTYPE PROPDMG CROPDMG
## 778568 THUNDERSTORM WIND 5000 0
## 808182 FLASH FLOOD 5000 0
## 808183 FLASH FLOOD 5000 0
## 900685 WATERSPOUT 5000 0
ec_storm[ec_storm$CROPDMG == 990,]
## EVTYPE PROPDMG CROPDMG
## 544253 DROUGHT 0 990
From here lets aggregate the data for both types of damages to get a better idea of the amount of damage per event type. I also eliminate the ‘0’ values and order the data.
prop_storm <- aggregate(PROPDMG ~ EVTYPE, data = ec_storm, FUN = "sum")
prop_storm <- prop_storm[prop_storm$PROPDMG > 0,]
prop_storm <- arrange(prop_storm, desc(PROPDMG))
crop_storm <- aggregate(CROPDMG ~ EVTYPE, data = ec_storm, FUN = "sum")
crop_storm <- crop_storm[crop_storm$CROPDMG > 0,]
crop_storm <- arrange(crop_storm, desc(CROPDMG))
Looking at the first twenty rows for each subset:
prop_storm[1:20,]
## EVTYPE PROPDMG
## 1 TORNADO 3212258.16
## 2 FLASH FLOOD 1420124.59
## 3 TSTM WIND 1335965.61
## 4 FLOOD 899938.48
## 5 THUNDERSTORM WIND 876844.17
## 6 HAIL 688693.38
## 7 LIGHTNING 603351.78
## 8 THUNDERSTORM WINDS 446293.18
## 9 HIGH WIND 324731.56
## 10 WINTER STORM 132720.59
## 11 HEAVY SNOW 122251.99
## 12 WILDFIRE 84459.34
## 13 ICE STORM 66000.67
## 14 STRONG WIND 62993.81
## 15 HIGH WINDS 55625.00
## 16 HEAVY RAIN 50842.14
## 17 TROPICAL STORM 48423.68
## 18 WILD/FOREST FIRE 39344.95
## 19 FLASH FLOODING 28497.15
## 20 URBAN/SML STREAM FLD 26051.94
crop_storm[1:20,]
## EVTYPE CROPDMG
## 1 HAIL 579596.28
## 2 FLASH FLOOD 179200.46
## 3 FLOOD 168037.88
## 4 TSTM WIND 109202.60
## 5 TORNADO 100018.52
## 6 THUNDERSTORM WIND 66791.45
## 7 DROUGHT 33898.62
## 8 THUNDERSTORM WINDS 18684.93
## 9 HIGH WIND 17283.21
## 10 HEAVY RAIN 11122.80
## 11 FROST/FREEZE 7034.14
## 12 EXTREME COLD 6121.14
## 13 TROPICAL STORM 5899.12
## 14 HURRICANE 5339.31
## 15 FLASH FLOODING 5126.05
## 16 HURRICANE/TYPHOON 4798.48
## 17 WILDFIRE 4364.20
## 18 TSTM WIND/HAIL 4356.65
## 19 WILD/FOREST FIRE 4189.54
## 20 LIGHTNING 3580.61
As you can see, the main cause for property damage is again Tornados with more than double the amount of damage compared to the second in the list, Flash Floods.
The result for crop damages is led by Hail as the main cause, more than three time the amount compared to Flash Floods.
x <- prop_storm[1:20,]
with(x, barplot(PROPDMG, names.arg = x$EVTYPE, col = rainbow(20), main = "Property Damage by Event type", ylab = "Property Damages (in Dollars)", cex.axis = 0.8, las = 3))