Analysis of NOAA’s Storm data of types of events with strongest impacts on population health (1) and economic damages (2)

Synopsis

NOAA’s collected storm data from 1950 to 2011 is a valuable source for information regarding the future preventive efforts in avoiding damages to life and property. Here we want to find out which type of events cause the most damages, physically and economically. As it turns out, we can see that we have one single type of event that clearly ranks on the top of the impact scale, Tornados. Tornados cause more than 13 times the amount of injuries, almost 3 times the amount of fatalities and more than double the amount of property damages always compared to the second rank. Only crop damages are led by a different event type, Hail, with Tornados on rank 5.

1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Data Processing 1

Reading in the raw data

raw_storm <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

First look at the data

dim(raw_storm)
## [1] 902297     37
head(raw_storm)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

It appears that most of the variables are not interesting for the question on hand. The important variables for us to observe are ‘EVTYPE’, ‘FATALITIES’ & ‘INJURIES’. So I will subset the raw data and reduce it to those three columns.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sub_storm <- select(raw_storm, EVTYPE, FATALITIES, INJURIES)

A look at the ‘head’ and ‘summary’ of the new data shows some of the relevant information

head(sub_storm)
##    EVTYPE FATALITIES INJURIES
## 1 TORNADO          0       15
## 2 TORNADO          0        0
## 3 TORNADO          0        2
## 4 TORNADO          0        2
## 5 TORNADO          0        2
## 6 TORNADO          0        6
summary(sub_storm)
##                EVTYPE         FATALITIES          INJURIES        
##  HAIL             :288661   Min.   :  0.0000   Min.   :   0.0000  
##  TSTM WIND        :219940   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  THUNDERSTORM WIND: 82563   Median :  0.0000   Median :   0.0000  
##  TORNADO          : 60652   Mean   :  0.0168   Mean   :   0.1557  
##  FLASH FLOOD      : 54277   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  FLOOD            : 25326   Max.   :583.0000   Max.   :1700.0000  
##  (Other)          :170878

Now I would like to know how many different EVTYPE’s there are.

length(unique(sub_storm$EVTYPE))
## [1] 985

985!

The summary of the sub_storm data showed us the two maximum impact observations. I want to show which events they are related to.

sub_storm[sub_storm$FATALITIES == 583,]
##        EVTYPE FATALITIES INJURIES
## 198704   HEAT        583        0
sub_storm[sub_storm$INJURIES == 1700,]
##         EVTYPE FATALITIES INJURIES
## 157885 TORNADO         42     1700

I also would like to know how many observations have 0 Fatalities.

x <- sub_storm[sub_storm$FATALITIES == 0,]
length(x[,2])
## [1] 895323

895323 out of 902297 total observations, this leaves 6974 events with fatalities.

Lets also see how many events were with 0 Injuries.

y <- sub_storm[sub_storm$INJURIES == 0,]
length(y[,2])
## [1] 884693

884693 events were without injuries, which leaves 17604 cases with injuries.

These insights helps us to narrow down our dataset for further observations. Lets eliminate all events without fatalities or injuries.

inj_storm <- sub_storm[sub_storm$INJURIES != 0,]
fatal_storm <- sub_storm[sub_storm$FATALITIES != 0,]

Now lets check on the number of different events again

length(unique(inj_storm$EVTYPE))
## [1] 158
length(unique(fatal_storm$EVTYPE))
## [1] 168

Now we have 158 unique events in the injury subset and 168 in the fatality subset. With this information it should be easier to show a clearer relation between the type of event and the number of injuries/fatalities. Let’s bring this in order. First for injuries:

sub_inj <- aggregate(INJURIES ~ EVTYPE, data = inj_storm, FUN = "sum")
sub_inj <- arrange(sub_inj, desc(INJURIES))
sub_inj[1:20,]
##                EVTYPE INJURIES
## 1             TORNADO    91346
## 2           TSTM WIND     6957
## 3               FLOOD     6789
## 4      EXCESSIVE HEAT     6525
## 5           LIGHTNING     5230
## 6                HEAT     2100
## 7           ICE STORM     1975
## 8         FLASH FLOOD     1777
## 9   THUNDERSTORM WIND     1488
## 10               HAIL     1361
## 11       WINTER STORM     1321
## 12  HURRICANE/TYPHOON     1275
## 13          HIGH WIND     1137
## 14         HEAVY SNOW     1021
## 15           WILDFIRE      911
## 16 THUNDERSTORM WINDS      908
## 17           BLIZZARD      805
## 18                FOG      734
## 19   WILD/FOREST FIRE      545
## 20         DUST STORM      440

Now for fatalities:

sub_fatal <- aggregate(FATALITIES ~ EVTYPE, data = fatal_storm, FUN = "sum")
sub_fatal <- arrange(sub_fatal, desc(FATALITIES))
sub_fatal[1:20,]
##                     EVTYPE FATALITIES
## 1                  TORNADO       5633
## 2           EXCESSIVE HEAT       1903
## 3              FLASH FLOOD        978
## 4                     HEAT        937
## 5                LIGHTNING        816
## 6                TSTM WIND        504
## 7                    FLOOD        470
## 8              RIP CURRENT        368
## 9                HIGH WIND        248
## 10               AVALANCHE        224
## 11            WINTER STORM        206
## 12            RIP CURRENTS        204
## 13               HEAT WAVE        172
## 14            EXTREME COLD        160
## 15       THUNDERSTORM WIND        133
## 16              HEAVY SNOW        127
## 17 EXTREME COLD/WIND CHILL        125
## 18             STRONG WIND        103
## 19                BLIZZARD        101
## 20               HIGH SURF        101

Results 1

I show here the first twenty rows of the resulting data frame, which clearly shows that Tornados are at the top of both lists, for fatalities and injuries, also including the maximum value for injuries at a single event.

i <- sub_inj[1:20,]
with(i, barplot(INJURIES, names.arg = i$EVTYPE, col = rainbow(20), main = "Number of Injuries by Event type", cex.axis = 0.8, las = 3))

2. Across the United States, which types of events have the greatest economic consequences?

To answer this question we have to re-examine our raw data to see which columns we can extract the necessary information from.

Data Processing 2

I filter out the columns related to economic consequences first. After looking at the summary of our raw data it shows that the columns ‘PROPDMG’ and ‘CROPDMG’ are the ones we need.

summary(raw_storm)
##     STATE__                  BGN_DATE             BGN_TIME     
##  Min.   : 1.0   5/25/2011 0:00:00:  1202   12:00:00 AM: 10163  
##  1st Qu.:19.0   4/27/2011 0:00:00:  1193   06:00:00 PM:  7350  
##  Median :30.0   6/9/2011 0:00:00 :  1030   04:00:00 PM:  7261  
##  Mean   :31.2   5/30/2004 0:00:00:  1016   05:00:00 PM:  6891  
##  3rd Qu.:45.0   4/4/2011 0:00:00 :  1009   12:00:00 PM:  6703  
##  Max.   :95.0   4/2/2006 0:00:00 :   981   03:00:00 PM:  6700  
##                 (Other)          :895866   (Other)    :857229  
##    TIME_ZONE          COUNTY           COUNTYNAME         STATE       
##  CST    :547493   Min.   :  0.0   JEFFERSON :  7840   TX     : 83728  
##  EST    :245558   1st Qu.: 31.0   WASHINGTON:  7603   KS     : 53440  
##  MST    : 68390   Median : 75.0   JACKSON   :  6660   OK     : 46802  
##  PST    : 28302   Mean   :100.6   FRANKLIN  :  6256   MO     : 35648  
##  AST    :  6360   3rd Qu.:131.0   LINCOLN   :  5937   IA     : 31069  
##  HST    :  2563   Max.   :873.0   MADISON   :  5632   NE     : 30271  
##  (Other):  3631                   (Other)   :862369   (Other):621339  
##                EVTYPE         BGN_RANGE           BGN_AZI      
##  HAIL             :288661   Min.   :   0.000          :547332  
##  TSTM WIND        :219940   1st Qu.:   0.000   N      : 86752  
##  THUNDERSTORM WIND: 82563   Median :   0.000   W      : 38446  
##  TORNADO          : 60652   Mean   :   1.484   S      : 37558  
##  FLASH FLOOD      : 54277   3rd Qu.:   1.000   E      : 33178  
##  FLOOD            : 25326   Max.   :3749.000   NW     : 24041  
##  (Other)          :170878                      (Other):134990  
##          BGN_LOCATI                  END_DATE             END_TIME     
##               :287743                    :243411              :238978  
##  COUNTYWIDE   : 19680   4/27/2011 0:00:00:  1214   06:00:00 PM:  9802  
##  Countywide   :   993   5/25/2011 0:00:00:  1196   05:00:00 PM:  8314  
##  SPRINGFIELD  :   843   6/9/2011 0:00:00 :  1021   04:00:00 PM:  8104  
##  SOUTH PORTION:   810   4/4/2011 0:00:00 :  1007   12:00:00 PM:  7483  
##  NORTH PORTION:   784   5/30/2004 0:00:00:   998   11:59:00 PM:  7184  
##  (Other)      :591444   (Other)          :653450   (Other)    :622432  
##    COUNTY_END COUNTYENDN       END_RANGE           END_AZI      
##  Min.   :0    Mode:logical   Min.   :  0.0000          :724837  
##  1st Qu.:0    NA's:902297    1st Qu.:  0.0000   N      : 28082  
##  Median :0                   Median :  0.0000   S      : 22510  
##  Mean   :0                   Mean   :  0.9862   W      : 20119  
##  3rd Qu.:0                   3rd Qu.:  0.0000   E      : 20047  
##  Max.   :0                   Max.   :925.0000   NE     : 14606  
##                                                 (Other): 72096  
##            END_LOCATI         LENGTH              WIDTH         
##                 :499225   Min.   :   0.0000   Min.   :   0.000  
##  COUNTYWIDE     : 19731   1st Qu.:   0.0000   1st Qu.:   0.000  
##  SOUTH PORTION  :   833   Median :   0.0000   Median :   0.000  
##  NORTH PORTION  :   780   Mean   :   0.2301   Mean   :   7.503  
##  CENTRAL PORTION:   617   3rd Qu.:   0.0000   3rd Qu.:   0.000  
##  SPRINGFIELD    :   575   Max.   :2315.0000   Max.   :4400.000  
##  (Other)        :380536                                         
##        F               MAG            FATALITIES          INJURIES        
##  Min.   :0.0      Min.   :    0.0   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:0.0      1st Qu.:    0.0   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :1.0      Median :   50.0   Median :  0.0000   Median :   0.0000  
##  Mean   :0.9      Mean   :   46.9   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:1.0      3rd Qu.:   75.0   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :5.0      Max.   :22000.0   Max.   :583.0000   Max.   :1700.0000  
##  NA's   :843563                                                           
##     PROPDMG          PROPDMGEXP        CROPDMG          CROPDMGEXP    
##  Min.   :   0.00          :465934   Min.   :  0.000          :618413  
##  1st Qu.:   0.00   K      :424665   1st Qu.:  0.000   K      :281832  
##  Median :   0.00   M      : 11330   Median :  0.000   M      :  1994  
##  Mean   :  12.06   0      :   216   Mean   :  1.527   k      :    21  
##  3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000   0      :    19  
##  Max.   :5000.00   5      :    28   Max.   :990.000   B      :     9  
##                    (Other):    84                     (Other):     9  
##       WFO                                       STATEOFFIC    
##         :142069                                      :248769  
##  OUN    : 17393   TEXAS, North                       : 12193  
##  JAN    : 13889   ARKANSAS, Central and North Central: 11738  
##  LWX    : 13174   IOWA, Central                      : 11345  
##  PHI    : 12551   KANSAS, Southwest                  : 11212  
##  TSA    : 12483   GEORGIA, North and Central         : 11120  
##  (Other):690738   (Other)                            :595920  
##                                                                                                                                                                                                     ZONENAMES     
##                                                                                                                                                                                                          :594029  
##                                                                                                                                                                                                          :205988  
##  GREATER RENO / CARSON CITY / M - GREATER RENO / CARSON CITY / M                                                                                                                                         :   639  
##  GREATER LAKE TAHOE AREA - GREATER LAKE TAHOE AREA                                                                                                                                                       :   592  
##  JEFFERSON - JEFFERSON                                                                                                                                                                                   :   303  
##  MADISON - MADISON                                                                                                                                                                                       :   302  
##  (Other)                                                                                                                                                                                                 :100444  
##     LATITUDE      LONGITUDE        LATITUDE_E     LONGITUDE_    
##  Min.   :   0   Min.   :-14451   Min.   :   0   Min.   :-14455  
##  1st Qu.:2802   1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0  
##  Median :3540   Median :  8707   Median :   0   Median :     0  
##  Mean   :2875   Mean   :  6940   Mean   :1452   Mean   :  3509  
##  3rd Qu.:4019   3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735  
##  Max.   :9706   Max.   : 17124   Max.   :9706   Max.   :106220  
##  NA's   :47                      NA's   :40                     
##                                            REMARKS           REFNUM      
##                                                :287433   Min.   :     1  
##                                                : 24013   1st Qu.:225575  
##  Trees down.\n                                 :  1110   Median :451149  
##  Several trees were blown down.\n              :   568   Mean   :451149  
##  Trees were downed.\n                          :   446   3rd Qu.:676723  
##  Large trees and power lines were blown down.\n:   432   Max.   :902297  
##  (Other)                                       :588295
ec_storm <- select(raw_storm, EVTYPE, PROPDMG, CROPDMG)

We can examine the maximum values for both types of damages at this stage.

ec_storm[ec_storm$PROPDMG == 5000,]
##                   EVTYPE PROPDMG CROPDMG
## 778568 THUNDERSTORM WIND    5000       0
## 808182       FLASH FLOOD    5000       0
## 808183       FLASH FLOOD    5000       0
## 900685        WATERSPOUT    5000       0
ec_storm[ec_storm$CROPDMG == 990,]
##         EVTYPE PROPDMG CROPDMG
## 544253 DROUGHT       0     990

From here lets aggregate the data for both types of damages to get a better idea of the amount of damage per event type. I also eliminate the ‘0’ values and order the data.

prop_storm <- aggregate(PROPDMG ~ EVTYPE, data = ec_storm, FUN = "sum")
prop_storm <- prop_storm[prop_storm$PROPDMG > 0,]
prop_storm <- arrange(prop_storm, desc(PROPDMG))
crop_storm <- aggregate(CROPDMG ~ EVTYPE, data = ec_storm, FUN = "sum")
crop_storm <- crop_storm[crop_storm$CROPDMG > 0,]
crop_storm <- arrange(crop_storm, desc(CROPDMG))

Looking at the first twenty rows for each subset:

prop_storm[1:20,]
##                  EVTYPE    PROPDMG
## 1               TORNADO 3212258.16
## 2           FLASH FLOOD 1420124.59
## 3             TSTM WIND 1335965.61
## 4                 FLOOD  899938.48
## 5     THUNDERSTORM WIND  876844.17
## 6                  HAIL  688693.38
## 7             LIGHTNING  603351.78
## 8    THUNDERSTORM WINDS  446293.18
## 9             HIGH WIND  324731.56
## 10         WINTER STORM  132720.59
## 11           HEAVY SNOW  122251.99
## 12             WILDFIRE   84459.34
## 13            ICE STORM   66000.67
## 14          STRONG WIND   62993.81
## 15           HIGH WINDS   55625.00
## 16           HEAVY RAIN   50842.14
## 17       TROPICAL STORM   48423.68
## 18     WILD/FOREST FIRE   39344.95
## 19       FLASH FLOODING   28497.15
## 20 URBAN/SML STREAM FLD   26051.94
crop_storm[1:20,]
##                EVTYPE   CROPDMG
## 1                HAIL 579596.28
## 2         FLASH FLOOD 179200.46
## 3               FLOOD 168037.88
## 4           TSTM WIND 109202.60
## 5             TORNADO 100018.52
## 6   THUNDERSTORM WIND  66791.45
## 7             DROUGHT  33898.62
## 8  THUNDERSTORM WINDS  18684.93
## 9           HIGH WIND  17283.21
## 10         HEAVY RAIN  11122.80
## 11       FROST/FREEZE   7034.14
## 12       EXTREME COLD   6121.14
## 13     TROPICAL STORM   5899.12
## 14          HURRICANE   5339.31
## 15     FLASH FLOODING   5126.05
## 16  HURRICANE/TYPHOON   4798.48
## 17           WILDFIRE   4364.20
## 18     TSTM WIND/HAIL   4356.65
## 19   WILD/FOREST FIRE   4189.54
## 20          LIGHTNING   3580.61

Results 2

As you can see, the main cause for property damage is again Tornados with more than double the amount of damage compared to the second in the list, Flash Floods.

The result for crop damages is led by Hail as the main cause, more than three time the amount compared to Flash Floods.

x <- prop_storm[1:20,]
with(x, barplot(PROPDMG, names.arg = x$EVTYPE, col = rainbow(20), main = "Property Damage by Event type", ylab = "Property Damages (in Dollars)", cex.axis = 0.8, las = 3))