Synopsis

This data analysis looks at storm data collected by the US Weather Service between 1950 and 2011, which contains over 900,000 records of extreme weather events, their location, fatalities and injuries and economic damage estimations in the US over that period. This data is available at the following url: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

The goal of the analysis is to see which types of events cause the most damage. The data was downloaded and events were categorized into a reasonable number (10-15) of distinct categories in order to look at the types of events that caused the most death and injuries, and the types of events that had the most economic impact.

Our results show that in terms of human health, tornadoes, hurricanes and wind related events have by far the largest amount of casualties, followed by heatwaves, floods, and cold related events. In terms of economic impact, floods, hurricanes, tornadoes, and other marine events such as coastal surges represent a huge share of the economic impact of extreme weather events. Most of the economic impact comes from a few extreme events which have a disproportional cost, the most expensive event representing more than 20% of total costs registered by itself.

Data Processing

Loading the data

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
fname<-"repdata_data_StormData.csv.bz2"

if(!file.exists(fname))
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                      "repdata_data_StormData.csv.bz2")

stormData <- read.csv(fname)

dim(stormData)
## [1] 902297     37
names(stormData)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Reducing the data to variables of interest

Our analysis deals with types of events, not with individual events per se. Therefore, all of the information regarding the geographical location of the events or the time at which the event occurs is not really relevant to this analysis and can be removed.

stormtbl <- select(stormData, EVTYPE, FATALITIES, INJURIES, PROPDMG:CROPDMGEXP)
stormtbl <- as_tibble(stormtbl)
stormtbl
## # A tibble: 902,297 x 7
##    EVTYPE  FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##    <fct>        <dbl>    <dbl>   <dbl> <fct>        <dbl> <fct>     
##  1 TORNADO          0       15    25   K                0 ""        
##  2 TORNADO          0        0     2.5 K                0 ""        
##  3 TORNADO          0        2    25   K                0 ""        
##  4 TORNADO          0        2     2.5 K                0 ""        
##  5 TORNADO          0        2     2.5 K                0 ""        
##  6 TORNADO          0        6     2.5 K                0 ""        
##  7 TORNADO          0        1     2.5 K                0 ""        
##  8 TORNADO          0        0     2.5 K                0 ""        
##  9 TORNADO          1       14    25   K                0 ""        
## 10 TORNADO          0        0    25   K                0 ""        
## # ... with 902,287 more rows

Processing the economic data for calculations

The codebook tells us that economic damage should be expressed as a number, followed by a letter in the next column, with K, M and B representing a thousand, million and billion. However, we can see that this is not always the case:

table(stormtbl$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(stormtbl$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

We’ll assume that lower case k and m can safely be converted to upper case. Since other values represent at most a few hundred cases out of hundreds of thousands, we’ll remove them (counting the multiplier as 1 so that it is neutral). Blanks will be counted as 1, so that if there is a number next to it, it won’t be lost in the calculation. Then we calculate the total economic impact of each event.

stormtbl$cropmult <- rep(1,902297)
stormtbl$propmult <- rep(1,902297)
stormtbl[stormtbl$PROPDMGEXP %in% c("K", "k"),]$propmult <-1000
stormtbl[stormtbl$PROPDMGEXP %in% c("m", "M"),]$propmult<- 1000000
stormtbl[stormtbl$PROPDMGEXP=="B",]$propmult <- 1000000000
stormtbl[stormtbl$CROPDMGEXP %in% c("K", "k"),]$cropmult <-1000
stormtbl[stormtbl$CROPDMGEXP %in% c("m", "M"),]$cropmult<- 1000000
stormtbl[stormtbl$CROPDMGEXP=="B",]$cropmult <- 1000000000
stormtbl <-mutate(stormtbl, property_damage = PROPDMG * propmult, crop_damage = CROPDMG*cropmult, total_economic_impact = property_damage + crop_damage)
stormclean <- select(stormtbl, -(PROPDMG:propmult))
stormclean
## # A tibble: 902,297 x 6
##    EVTYPE  FATALITIES INJURIES property_damage crop_damage total_economic_impact
##    <fct>        <dbl>    <dbl>           <dbl>       <dbl>                 <dbl>
##  1 TORNADO          0       15           25000           0                 25000
##  2 TORNADO          0        0            2500           0                  2500
##  3 TORNADO          0        2           25000           0                 25000
##  4 TORNADO          0        2            2500           0                  2500
##  5 TORNADO          0        2            2500           0                  2500
##  6 TORNADO          0        6            2500           0                  2500
##  7 TORNADO          0        1            2500           0                  2500
##  8 TORNADO          0        0            2500           0                  2500
##  9 TORNADO          1       14           25000           0                 25000
## 10 TORNADO          0        0           25000           0                 25000
## # ... with 902,287 more rows

Classifying types of events into categories

Now that we have a much smaller table with only variables we’re interested in, we’re going to need to look at the contents of EVTYPE.

length(unique(stormtbl$EVTYPE))
## [1] 985

As we can see EVTYPE contains too many unique names to sort through them right now, so we’ll start by looking at the events which caused the most deaths and categorize these, then apply our categories in general.

mostfatal <- stormtbl %>%  filter(FATALITIES>0) %>% arrange(desc(FATALITIES))
length(unique(mostfatal$EVTYPE))
## [1] 168

Now that we have just 168, we look at the individual categories to find patterns.

unique(mostfatal$EVTYPE)
##   [1] HEAT                           TORNADO                       
##   [3] EXCESSIVE HEAT                 EXTREME HEAT                  
##   [5] HEAT WAVE                      TSUNAMI                       
##   [7] UNSEASONABLY WARM AND DRY      TORNADOES, TSTM WIND, HAIL    
##   [9] TROPICAL STORM                 FLASH FLOOD                   
##  [11] HEAVY RAIN                     RECORD/EXCESSIVE HEAT         
##  [13] FLOOD                          HURRICANE/TYPHOON             
##  [15] COLD AND SNOW                  WILDFIRE                      
##  [17] LANDSLIDE                      HURRICANE                     
##  [19] TSTM WIND                      FOG                           
##  [21] STORM SURGE/TIDE               DUST STORM                    
##  [23] WINTER STORMS                  STORM SURGE                   
##  [25] HIGH SURF                      UNSEASONABLY WARM             
##  [27] EXTREME COLD/WIND CHILL        TROPICAL STORM GORDON         
##  [29] BLIZZARD                       THUNDERSTORM WINDS            
##  [31] HEAVY SNOW                     HIGH WIND                     
##  [33] EXTREME COLD                   THUNDERSTORM WIND             
##  [35] MARINE MISHAP                  FLASH FLOOD/FLOOD             
##  [37] ICE STORM                      AVALANCHE                     
##  [39] RIP CURRENT                    WINTER WEATHER/MIX            
##  [41] FLOOD/FLASH FLOOD              HEAT WAVES                    
##  [43] GLAZE                          LIGHTNING                     
##  [45] WINTER STORM                   STRONG WIND                   
##  [47] MARINE TSTM WIND               HEAVY SURF/HIGH SURF          
##  [49] COLD/WIND CHILL                HAIL                          
##  [51] HIGH WIND/SEAS                 RIP CURRENTS/HEAVY SURF       
##  [53] FLASH FLOODING                 HEAT WAVE DROUGHT             
##  [55] Mudslide                       WILD/FOREST FIRE              
##  [57] ROUGH SEAS                     WILD FIRES                    
##  [59] HURRICANE ERIN                 WIND                          
##  [61] SNOW AND ICE                   WATERSPOUT/TORNADO            
##  [63] COLD WAVE                      HIGH WINDS/SNOW               
##  [65] FLASH FLOODING/FLOOD           HIGH WIND AND SEAS            
##  [67] RAIN/SNOW                      Heavy surf and wind           
##  [69] RIP CURRENTS                   EXTREME WINDCHILL             
##  [71] URBAN/SML STREAM FLD           HURRICANE OPAL/HIGH WINDS     
##  [73] FLOODING                       SLEET                         
##  [75] HIGH WINDS                     DENSE FOG                     
##  [77] ICE                            COLD                          
##  [79] HEAVY SNOW AND HIGH WINDS      SNOW                          
##  [81] FREEZING RAIN                  UNSEASONABLY COLD             
##  [83] EXCESSIVE RAINFALL             DROUGHT/EXCESSIVE HEAT        
##  [85] HEAVY SEAS                     ROUGH SURF                    
##  [87] Cold                           Coastal Flooding              
##  [89] ICY ROADS                      Hypothermia/Exposure          
##  [91] SNOW SQUALL                    RECORD HEAT                   
##  [93] WATERSPOUT                     HIGH WATER                    
##  [95] COLD WEATHER                   WINTER WEATHER                
##  [97] MARINE STRONG WIND             COASTAL FLOOD                 
##  [99] MARINE THUNDERSTORM WIND       HIGH SEAS                     
## [101] GUSTY WINDS                    WINTER STORM HIGH WINDS       
## [103] STRONG WINDS                   HURRICANE OPAL                
## [105] FREEZING RAIN/SNOW             THUNDERSNOW                   
## [107] WIND STORM                     URBAN AND SMALL STREAM FLOODIN
## [109] FREEZE                         LIGHTNING.                    
## [111] THUNDERTORM WINDS              FLOOD/RIVER FLOOD             
## [113] RIVER FLOOD                    AVALANCE                      
## [115] FOG AND COLD TEMPERATURES      HEAVY SURF                    
## [117] HURRICANE FELIX                BLOWING SNOW                  
## [119] RAIN/WIND                      THUNDERSTORM WIND G52         
## [121] HIGH WAVES                     LOW TEMPERATURE               
## [123] HYPOTHERMIA                    COLD/WINDS                    
## [125] RECORD COLD                    SNOW/ BITTER COLD             
## [127] RAPIDLY RISING WATER           RIVER FLOODING                
## [129] MINOR FLOODING                 FLOOD & HEAVY RAIN            
## [131] FLASH FLOODS                   Marine Accident               
## [133] Heavy Surf                     DRY MICROBURST                
## [135] WINDS                          COASTAL STORM                 
## [137] High Surf                      Extreme Cold                  
## [139] Strong Winds                   Extended Cold                 
## [141] Whirlwind                      MIXED PRECIP                  
## [143] Freezing Spray                 Mudslides                     
## [145] Cold Temperature               HYPOTHERMIA/EXPOSURE          
## [147] BLACK ICE                      COASTALSTORM                  
## [149] blowing snow                   FREEZING DRIZZLE              
## [151] FROST                          Snow Squalls                  
## [153] LANDSLIDES                     HIGH SWELLS                   
## [155] TSTM WIND/HAIL                 TSTM WIND (G35)               
## [157] HYPERTHERMIA/EXPOSURE          GUSTY WIND                    
## [159] WINTRY MIX                     THUNDERSTORM WIND (G40)       
## [161] LIGHT SNOW                     THUNDERSTORM                  
## [163] FALLING SNOW/ICE               COASTAL FLOODING              
## [165] ICE ON ROAD                    DROWNING                      
## [167] DUST DEVIL                     MARINE HIGH WIND              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD  FLASH FLOOD ... WND

We see that there are a lot of repetitions of terms like heat, tornado, hurricane, flood, etc… From our most fatal events, we can create the following categories:
- Tornado / Wind
- Hurricane / Tropical storm
- Heat events
- Floods
- Cold/ Snow / Ice
- Marine events
- Fire
- Lightning
- Land/Mudslides
- Fog
- Rain

tornadocat <- grep("tornad|wind", stormtbl$EVTYPE, ignore.case = TRUE)
heatcat <- grep("heat|hot|warm|hyperthermia", stormtbl$EVTYPE, ignore.case = TRUE)
floodcat<- grep("flood|fld|high water|rising water", stormtbl$EVTYPE, ignore.case = TRUE)
hurricanecat <- grep("hurricane|typhoon|tropical storm", stormtbl$EVTYPE, ignore.case = TRUE)
coldcat<- grep("snow|ice|icy|sleet|freez|cold|frost|glaze|low temp|avalan|winter|wintry|hypothermia|blizzard", stormtbl$EVTYPE, ignore.case = TRUE)
seacat <- grep("current|surf|sea|marine|coastal|wave|swell|surge|tsunami", stormtbl$EVTYPE, ignore.case = TRUE)
firecat <- grep("fire", stormtbl$EVTYPE, ignore.case = TRUE)
lightningcat<- grep("lightning|thunder", stormtbl$EVTYPE, ignore.case = TRUE)
slidecat <- grep("landslide|mudslide", stormtbl$EVTYPE, ignore.case = TRUE)
raincat <- grep("rain|precip|waterspout", stormtbl$EVTYPE, ignore.case = TRUE)
fogcat <- grep("fog", stormtbl$EVTYPE, ignore.case = TRUE)

Now let’s check that our categories gather most of the data.

stormclean$category <- rep(NA, 902297)
stormclean$category[tornadocat] <- "Tornado or Wind"
stormclean$category[heatcat] <- "Heat related"
stormclean$category[floodcat] <- "Flood related"
stormclean$category[hurricanecat] <- "Hurricane"
stormclean$category[coldcat] <- "Cold related"
stormclean$category[seacat] <- "Sea related"
stormclean$category[firecat] <- "Fire related"
stormclean$category[lightningcat] <- "Lightning related"
stormclean$category[slidecat] <- "Land and mud slides"
stormclean$category[raincat] <- "Rain related"
stormclean$category[fogcat] <- "Fog related"
uncategorized <-which(is.na(stormclean$category))
totaldead <- sum(stormclean$FATALITIES)
totalcost<- sum(stormclean$total_economic_impact)
sum(stormclean[uncategorized,]$FATALITIES) / totaldead
## [1] 0.002839221
sum(stormclean[uncategorized,]$total_economic_impact) / totalcost
## [1] 0.0718294

We’ve captured over 99% of fatalities with our categories, however, 7% of the economic impact remains uncategorized. Let’s see if we need to add categories by sorting the uncategorized data by economic impact.

othercats <- stormclean[uncategorized,] %>% arrange(desc(total_economic_impact))
head(othercats)
## # A tibble: 6 x 7
##   EVTYPE FATALITIES INJURIES property_damage crop_damage total_economic_~
##   <fct>       <dbl>    <dbl>           <dbl>       <dbl>            <dbl>
## 1 HAIL            0        1      1800000000           0       1800000000
## 2 DROUG~          0        0               0  1000000000       1000000000
## 3 HAIL            0        0       900000000           0        900000000
## 4 DROUG~          0        0       645150000           0        645150000
## 5 DROUG~          0        0               0   578850000        578850000
## 6 DROUG~          0        0               0   515000000        515000000
## # ... with 1 more variable: category <chr>

We can see drought and hail events which have no fatalities but major economic impact, in the billions or hundreds of millions. We add these two categories to our table.

droughtcat <- grep("drought", stormtbl$EVTYPE, ignore.case = TRUE)
hailcat <- grep("hail", stormtbl$EVTYPE, ignore.case=TRUE)
stormclean$category[droughtcat] <- "Drought"
stormclean$category[hailcat] <- "Hail"
uncategorized <-which(is.na(stormclean$category))
sum(stormclean[uncategorized,]$total_economic_impact) / totalcost
## [1] 0.000381211

We now have over 99% of our economic damage categorized, so our categories cover virtually all of the relevant data.

Summary tables

To present the results in graphs, we’ll need to summarize the data by category.
Here we create one table for the most deadly events, and one for those with the most economic impact. We also include a column for rank, which will be useful for graphing, and a column for cumulative share, in order to avoid putting too many unrelevant categories in our graphs.

mostdeadly <- stormclean %>% group_by(category) %>% 
        summarize(totaldeaths=sum(FATALITIES), totalinjury = sum(INJURIES)) %>%
        arrange(desc(totaldeaths), desc(totalinjury)) 
r <- nrow(mostdeadly)
mostdeadly<- mutate(mostdeadly, rank=1:r, cumulative_share_deaths = cumsum(totaldeaths)/totaldead)
mostdeadly
## # A tibble: 14 x 5
##    category            totaldeaths totalinjury  rank cumulative_share_deaths
##    <chr>                     <dbl>       <dbl> <int>                   <dbl>
##  1 Tornado or Wind            6580      100179     1                   0.434
##  2 Heat related               2956        8832     2                   0.630
##  3 Flood related              1550        8674     3                   0.732
##  4 Cold related               1329        6844     4                   0.820
##  5 Sea related                1084        1427     5                   0.891
##  6 Lightning related          1029        7710     6                   0.959
##  7 Hurricane                   201        1714     7                   0.973
##  8 Rain related                122         403     8                   0.981
##  9 Fire related                 90        1608     9                   0.987
## 10 Fog related                  81        1077    10                   0.992
## 11 Hail                         45        1467    11                   0.995
## 12 Land and mud slides          44          55    12                   0.998
## 13 <NA>                         28         519    13                   1.00 
## 14 Drought                       6          19    14                   1
mostcostly <- stormclean %>% group_by(category) %>% 
        summarize(total.property.damage = sum(property_damage), total.crop.damage= sum(crop_damage), total.economic.cost= sum(total_economic_impact)) %>%
        arrange(desc(total.economic.cost))
s<- nrow(mostcostly)
mostcostly<- mutate(mostcostly, rank= 1:s, cumulative_share = cumsum(total.economic.cost)/totalcost)
mostcostly
## # A tibble: 14 x 6
##    category total.property.~ total.crop.dama~ total.economic.~  rank
##    <chr>               <dbl>            <dbl>            <dbl> <int>
##  1 Flood r~    167127574520.      12270338200    179397912720.     1
##  2 Hurrica~     93070725560        6211013800     99281739360      2
##  3 Tornado~     67505857757.       1736789220     69242646977.     3
##  4 Sea rel~     48663842950          36633500     48700476450      4
##  5 Cold re~     12698320263.       8704985450     21403305713.     5
##  6 Hail         17619991072.       3114212873     20734203945.     6
##  7 Drought       1046306000       13972621780     15018927780      7
##  8 Fire re~      8496628500         403281630      8899910130      8
##  9 Lightni~      7365931906.        664968478      8030900384.     9
## 10 Rain re~      3332250892         919315800      4251566692     10
## 11 Heat re~         9665700         898863500       908529200     11
## 12 Land an~       326077000          20017000       346094000     12
## 13 <NA>            30466680         151150950       181617630     13
## 14 Fog rel~        25011500                 0        25011500     14
## # ... with 1 more variable: cumulative_share <dbl>

Results

Events with the most impact in terms of fatalities and injuries

par(mar=c(8,4,4,1))
with(mostdeadly[1:10,], plot(totaldeaths, type="h", lwd =5,
                      main="Events with the most impact on human health", xaxt="n", xlab="",
                      ylim=c(0,10000),ylab ="total number of casualties"))
lines(mostdeadly$totalinjury, lwd=3, lty=2, col="red")
legend("topright", lwd=c(5,3), lty=c(1,2), col=c("black", "red"), legend=c("deaths", "injuries"))
                      
text(x=1:10, y=par()$usr[3]-0.1*(par()$usr[4]-par()$usr[3]),
     labels=mostdeadly[1:10,]$category, srt=45, adj=1, xpd=TRUE)

We can see that tornadoes and wind related events have by far the highest impact on human health, followed by heat, floods and cold related events. While they cause the most deaths, the number of injuries caused by tornadoes is also off the charts, more than ten times the number caused by any other type of event. Lightning is a secondary cause of death but causes many injuries.

Events with the most impact in terms of economic damages

par(mar=c(8,4,4,1))
with(mostcostly[1:10,], plot(total.economic.cost/1000000000, type="h", lwd =5,
                      main="Events with the most harmful economic impact ", xaxt="n", xlab="",
                      ylab ="total economic cost (in B$)"))
lines(mostcostly$total.property.damage/1000000000, col="red")
lines(mostcostly$total.crop.damage/1000000000, col="green")
text(x=1:10, y=par()$usr[3]-0.1*(par()$usr[4]-par()$usr[3]),
     labels=mostcostly[1:10,]$category, srt=45, adj=1, xpd=TRUE)
legend("topright", lwd=1, col=c("red", "green"), legend=c("property damage", "crop damage"))

We can see that floods cause a big share of the economic damage, followed by hurricanes and tornadoes or wind. Other marine events, such as tides and surges, also cause significant damage, and cold related events come in fifth place. Hail and drought, both of which are negligible in terms of their impact on human health, have also caused several billions of dollars of damage. Most damage is property damage, only droughts have more economic impact due to crop losses.

Bonus: Impact of extreme INDIVIDUAL events

While exploring the data, I noticed that particularly for economic damages, a few individual events seemed to have a huge share of the costs. Does this also apply to human health?

max(stormclean$FATALITIES) /totaldead
## [1] 0.03849455
max(stormclean$total_economic_impact) / totalcost
## [1] 0.2414504
fatal_individual <- arrange(stormclean, desc(FATALITIES))
costly_individual <- arrange(stormclean, desc(total_economic_impact))
sum(fatal_individual[1:100,]$FATALITIES) / totaldead
## [1] 0.2408055
sum(costly_individual[1:100,]$total_economic_impact) / totalcost
## [1] 0.67708

This is an interesting result: a few extreme individual events represent the majority (67%) of the economic damages, but not the majority (24%) of deaths. The worst event in terms of economic consequences caused 24% of all economic impacts recorded, but the worst event in terms of death caused just 3% of total deaths.
This can probably be explained by the fact that people can be evacuated when an event is predicted, but property is more difficult to move.