Determination of weather event types in the US that have caused the greatest number of fatalities and injuries, and had the greatest economic impact on property and crops

Synopsis

This study identifies the 10 weather event types that are most harmful to human health, in terms of number of fatalities and injuries caused in the US since 1950, and the 10 weather event types that have had the greatest economic impact on property and crops, in terms of value of damage caused. The weather event types are those defined by the NOAA here

National Weather Service Storm Data Documentation

The data were provided by the NOAA and were sourced from here:

Storm data [47Mb]

The data contained many errors, in that many events had not been categorised or had not been recorded in the required manner, either in respect of recorded weather type, or by value multipler.

A procedure was carried out to map actual recorded data to one of the correct categories, and to omit data that could not be mapped. The final data set captured 94% of total fatalities and injuries, and 97% of total damage value to property and crops.

The top three most damaging weather types in terms of impact on human health have been (in descending order) tornados, heat, and thuderstorms, whiile the top three in terms of economic impact on property and crops have been floods, hurricanes and tornados

Set up

Set working directory

rm(list=ls())
setwd("C:/Users/Mike/Rspace/JHU_RR/PA2") # amend pathway as required
#setwd("H:/Rspace/JHU_Data_Science/JHU_RR/PA2")

Load libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(data.table)
## 
## Attaching package: 'data.table'
## 
## The following objects are masked from 'package:dplyr':
## 
##     between, last
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:data.table':
## 
##     hour, mday, month, quarter, wday, week, yday, year
library(pander)
panderOptions('round', 2)

Data Source

The NOAA data for this analysis is taken from the link provided on the JHU Reproducible Research Peer Assignment 2 Coursera site:

Storm data [47Mb]

Information on this data is available here

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

We save it into a csv.bz2 file on our local machine

Data Processing

Download data and load into R

Create data directory if it doesn’t already exist

if(!file.exists("data")){
        dir.create("data")
}

Download data if not yet already done so

if(!file.exists("./data/stormData.csv.bz2")){
        fileURL<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileURL,destfile="./data/stormData.csv.bz2")
        #include date of download
        datedownloaded<-date()
        datedownloaded
        
}

Load data into R

# using cache here causes memory overload in my computer!
# To do: find a faster way to read the data into R.
file<- bzfile(description = "./data/stormData.csv.bz2", open = 'r', encoding =
getOption('encoding'), compression = 9)
stormdata <- read.csv(file, header = TRUE, stringsAsFactors = FALSE)
#closeAllConnections()

Initial Inspection of the data

First we look at the structure of the data:

str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
#check for NAs
mean(is.na(stormdata$FATALITIES))
## [1] 0
mean(is.na(stormdata$INJURIES))
## [1] 0
mean(is.na(stormdata$PROPDMG))
## [1] 0
mean(is.na(stormdata$CROPDMG))
## [1] 0
# There are no NAs in these relevant data columns. Yay!

This suggests that some data removal can be carried out:

Reduce the size of the data set

All rows of the data set for which fatalities, injuries, crop damage and property damage values were all zero are removed from the data set, as are columns that will not be used in the subsequent analysis..

# select columns required
library(dplyr)
df1 <-select(stormdata,STATE__:EVTYPE,FATALITIES:CROPDMGEXP)
rm(stormdata)
str(df1)
## 'data.frame':    902297 obs. of  14 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
# filter out null rows
df2<-filter(df1,FATALITIES+INJURIES+PROPDMG+CROPDMG>0 )
rm(df1)
nrow(df2)
## [1] 254633

Hence about 70% of the events reported and recorded caused neither fatalities nor injuries, nor was any economic damage to property or crops recorded.

We inspect the data by year to see the time periods from when the data mainly arises:

# extract year from the date column
library(lubridate)
year<-as.POSIXlt(mdy_hms(df2$BGN_DATE))
year<-year$year+1900
summary(year) # check this looks OK.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1950    1997    2002    2000    2008    2011
# include this column in the reduced data set df2
sd.red<-data.frame(year,df2)
rm(df2,year)
fit.i<-aggregate(sd.red$INJURIES,by=list(sd.red$year),FUN="sum")
fit.f<-aggregate(sd.red$FATALITIES,by=list(sd.red$year),FUN="sum")
fit<-data.frame(fit.i$Group.1,fit.i$x+fit.f$x)
summary(fit)
##  fit.i.Group.1  fit.i.x...fit.f.x
##  Min.   :1950   Min.   :  558    
##  1st Qu.:1965   1st Qu.: 1098    
##  Median :1980   Median : 1864    
##  Mean   :1980   Mean   : 2511    
##  3rd Qu.:1996   3rd Qu.: 3167    
##  Max.   :2011   Max.   :11864
names(fit)[names(fit)=="fit.i.Group.1"] <- "Year"
names(fit)[names(fit)=="fit.i.x...fit.f.x"] <- "Fatalities.Injuries"
pt<-aggregate(sd.red$PROPDMG,by=list(sd.red$year),FUN="sum")
ct<-aggregate(sd.red$CROPDMG,by=list(sd.red$year),FUN="sum")
pct<-data.frame(pt$Group.1,pt$x+ct$x)
summary(pct)
##    pt.Group.1    pt.x...ct.x    
##  Min.   :1950   Min.   : 10561  
##  1st Qu.:1965   1st Qu.: 33248  
##  Median :1980   Median : 53566  
##  Mean   :1980   Mean   :197780  
##  3rd Qu.:1996   3rd Qu.:423508  
##  Max.   :2011   Max.   :918602
names(pct)[names(pct)=="pt.Group.1"] <- "Year"
names(pct)[names(pct)=="pt.x...ct.x"] <- "Totals"

Merge these two totals together

annual.totals<-fit
annual.totals$Damage<-pct$Totals
rm(fit.i,fit.f,fit,pt,ct,pct) # clean up
#names(annual.totals)[names(annual.totals)=="pct$Total.Damage"] <- "Total.Damage"
str(annual.totals)
## 'data.frame':    62 obs. of  3 variables:
##  $ Year               : num  1950 1951 1952 1953 1954 ...
##  $ Fatalities.Injuries: num  729 558 2145 5650 751 ...
##  $ Damage             : num  16999 10561 16680 19182 23368 ...

From this analysis, we find that the annual number of reported events and the value of the damage caused is shown below in Figure 1.

library(ggplot2)

par(mfrow = c(1, 2))
par(mar = c(4, 4,3,1))
par(bg="white")

with(annual.totals,plot(Year,Fatalities.Injuries,
                        xlim=c(1950,2020),
                        ylim=c(0,max(annual.totals$Fatalities.Injuries)),
                        xlab="Year",
                        ylab="Annual fatalities and Injuries",
                        pch=19,
                        col="red",
                        #main="Fatalities / Injuries"
                        )
        )
 with(annual.totals,plot(Year,Damage,
                        xlim=c(1950,2020),
                        ylim=c(0,max(annual.totals$Damage)),
                        xlab="Year",
                        ylab="Annual property and crop damage value ($) ",
                        pch=19,
                        col="blue",
                        #main="Property and crop damage ($)"
                        )
        )

Figure 1 : Variation by year of the number of weather events recorded and the economic value of their damage to property and crops.

We note a huge rise in fatalities and injuries from about 1990 onwards. This probably reflects a rise in the available volume of data, rather than an actual rise’ in the number of events, but does suggest that the results of our subsequent analysis are more reflective of the last 25 years than of the first years since 1950. Further analysis could verify this.

We now to turn to the main goal of our analysis, which is to fin the most damaging type of weather events in the US since 1950, as measured by impact on human health and by damage caused to property and crops. To achieve this, the data set must be cleaned. The meaning of this is explained below.

Cleaning the event type data

We wish to identify those weather types that have been most damaging to people, in terms of numbers of fatalities and injuries caused, and to crops and property in economic terms. The categories used will be those recognised by NOAA:

evtype<-c("Astronomical Low Tide","Avalanche","Blizzard","Coastal Flood","Cold/Wind Chill","Debris Flow","Dense Fog","Dense Smoke","Drought","Dust Devil","Dust Storm","Excessive Heat","Extreme Cold/Wind Chill","Flash Flood","Flood","Frost/Freeze","Funnel Cloud","Freezing Fog","Hail","Heat","Heavy Rain","Heavy Snow","High Surf","High Wind","Hurricane","Ice Storm","Lake-Effect Snow","Lakeshore Flood","Lightning","Marine Hail","Marine High Wind","Marine Strong Wind","Marine Thunderstorm Wind","Rip Current","Seiche","Sleet","Storm Surge/Tide","Strong Wind","Thunderstorm Wind","Tornado","Tropical Depression","Tropical Storm","Tsunami","Volcanic Ash","Waterspout","Wildfire","Winter Storm","Winter Weather")
evtype
##  [1] "Astronomical Low Tide"    "Avalanche"               
##  [3] "Blizzard"                 "Coastal Flood"           
##  [5] "Cold/Wind Chill"          "Debris Flow"             
##  [7] "Dense Fog"                "Dense Smoke"             
##  [9] "Drought"                  "Dust Devil"              
## [11] "Dust Storm"               "Excessive Heat"          
## [13] "Extreme Cold/Wind Chill"  "Flash Flood"             
## [15] "Flood"                    "Frost/Freeze"            
## [17] "Funnel Cloud"             "Freezing Fog"            
## [19] "Hail"                     "Heat"                    
## [21] "Heavy Rain"               "Heavy Snow"              
## [23] "High Surf"                "High Wind"               
## [25] "Hurricane"                "Ice Storm"               
## [27] "Lake-Effect Snow"         "Lakeshore Flood"         
## [29] "Lightning"                "Marine Hail"             
## [31] "Marine High Wind"         "Marine Strong Wind"      
## [33] "Marine Thunderstorm Wind" "Rip Current"             
## [35] "Seiche"                   "Sleet"                   
## [37] "Storm Surge/Tide"         "Strong Wind"             
## [39] "Thunderstorm Wind"        "Tornado"                 
## [41] "Tropical Depression"      "Tropical Storm"          
## [43] "Tsunami"                  "Volcanic Ash"            
## [45] "Waterspout"               "Wildfire"                
## [47] "Winter Storm"             "Winter Weather"
levtypes<-length(evtype)

which number 48 in all

evrep<-sd.red$EVTYPE
nevrec<-length(unique(evrep))

Howeever, 488 event types wereactually recorded which means that many events must have been recorded with non-standard terms. Hence before final analysis, the recorded events have had to be mapped to one or other of the recognised categories. This has been achieved by a process of matching terms, identifying common alternative terms (eg TSTM for thunderstorm, precipitation for rain etc), and correction of obvious typographical errors eg TORNDAo etc.

The matching process is described below and may be skipped if not of interest. The criterion for this process was that at least 95% of the reported events should be mapped to one NOAA category. Those that could not be mapped were not included in the final analysis.

Mapping of reported weather events to NOAA categories.

This is an iterative process.

First, we make some substitutions to remove obvious typographical errors (eg THUNERSTORM) and to substitute NOAA category terms for other equivalent terms used in the data set (eg “RAIN” for “precipitation”)

  • TSTM replaced by Thunderstorm
  • Precipitation replaced by Rain
evrep<-sd.red$EVTYPE
evrep<-gsub("Precipitation|drizzle|shower","RAIN",evrep,ignore.case=TRUE)
evrep<-gsub("TSTM|THUNDERESTORM|THUNERSTORM|THUNDEERSTORM|THUNDERTORM",
            "THUNDERSTORM",evrep,ignore.case=TRUE)
evrep<-gsub("STROM","STORM",evrep,ignore.case=TRUE)
evrep<-gsub("TUN","THUN",evrep,ignore.case=TRUE)
evrep<-gsub("THUNDERSTORM\\S","THUNDERSTORM\\s",evrep,ignore.case=TRUE)
evrep<-gsub("STORMS","STORM",evrep,ignore.case=TRUE)
evrep<-gsub("WINDS|WINS","WIND",evrep,ignore.case=TRUE)
evrep<-gsub("Typhoon","Hurricane",evrep,ignore.case=TRUE)
evrep<-gsub("Hurricane\\s[A-z]|Hurricane[A-z]","Hurricane",evrep,ignore.case=TRUE)
evrep<-gsub("ASTRONOMICAL HIGH TIDE","Storm Surge/Tide",evrep,ignore.case=TRUE)
evrep<-gsub("Whirlwind|Torndao","TORNADO",evrep,ignore.case=TRUE)
evrep<-gsub("Lighting|LIGNTNING","Lightning",evrep,ignore.case=TRUE)
evrep<-gsub("Excessive|Hvy|Severe","HEAVY",evrep,ignore.case=TRUE)
evred<-sapply(1:levtypes,function(x){
        grep(evtype[x],evrep,ignore.case=TRUE)
})
event<-c(rep("Other",nrow(sd.red)))
for(i in 1:47){
        event[evred[[i]]]<-evtype[i]
}

The number of terms usewd that were not captured and mapped by this process is:

missed<-evrep[grep("Other",event)]
unique(missed)
##   [1] "WIND"                      "COLD"                     
##   [3] "EXTREME COLD"              "FREEZE"                   
##   [5] "MARINE MISHAP"             "HIGH TIDES"               
##   [7] "HIGH SEAS"                 "HEAVY TURBULENCE"         
##   [9] "RECORD RAINFALL"           "APACHE COUNTY"            
##  [11] "GUSTY WIND"                "WILD FIRES"               
##  [13] "HIGH"                      "MUDSLIDES"                
##  [15] "RAINSTORM"                 "HEAVY THUNDERSTORM"       
##  [17] "THUNDERSTORM"              "HEAVY SURF"               
##  [19] "DRY MIRCOBURST WIND"       "DRY MICROBURST"           
##  [21] "MICROBURST WIND"           "UNSEASONABLY WARM"        
##  [23] "STORM SURGE"               "DAMAGING FREEZE"          
##  [25] "SNOW"                      "FROST"                    
##  [27] "FREEZING RAIN/SNOW"        "THUNDERSNOW"              
##  [29] "COOL AND WET"              "GLAZE ICE"                
##  [31] "MUD SLIDE"                 "HIGH  WIND"               
##  [33] "MUD SLIDES"                "COLD AND WET CONDITIONS"  
##  [35] "HEAVY WETNESS"             "GUSTNADO"                 
##  [37] "FREEZING RAIN"             "EXTREME WIND CHILL"       
##  [39] "WIND DAMAGE"               "FOG"                      
##  [41] "SNOW AND ICE"              "WIND STORM"               
##  [43] "GRASS FIRES"               "ICE"                      
##  [45] "THUNDERSTORM  WIND"        "WINTER WEATHER"           
##  [47] "RAIN"                      "ICE FLOES"                
##  [49] "HEAVY LAKE SNOW"           "RECORD COLD"              
##  [51] "COLD WAVE"                 "GLAZE"                    
##  [53] "MICROBURST"                "AVALANCE"                 
##  [55] "ICE JAM"                   "FOREST FIRES"             
##  [57] "FROST\\FREEZE"             "HARD FREEZE"              
##  [59] "URBAN AND SMALL"           "FOG AND COLD TEMPERATURES"
##  [61] "SNOW/COLD"                 "MUDSLIDE"                 
##  [63] "SNOW SQUALL"               "ICY ROADS"                
##  [65] "HEAVY MIX"                 "SNOW FREEZING RAIN"       
##  [67] "SNOW/FREEZING RAIN"        "SNOW SQUALLS"             
##  [69] "RECORD SNOW"               "COASTAL SURGE"            
##  [71] "THUNDERSTORM DAMAGE TO"    "BLOWING SNOW"             
##  [73] "THUDERSTORM WIND"          "ICE AND SNOW"             
##  [75] "STORM FORCE WIND"          "RAIN/WIND"                
##  [77] "SNOW/ICE"                  "UNSEASONABLY WARM AND DRY"
##  [79] "UNSEASONABLY COLD"         "HIGH WAVES"               
##  [81] "LOW TEMPERATURE"           "HYPOTHERMIA"              
##  [83] "COLD/WIND"                 "SNOW/ BITTER COLD"        
##  [85] "COLD WEATHER"              "RAPIDLY RISING WATER"     
##  [87] "WILD/FOREST FIRE"          "SNOW ACCUMULATION"        
##  [89] "SNOW/ ICE"                 "SNOW/BLOWING SNOW"        
##  [91] "LANDSLIDE"                 "THUNDERSTORMINDS"         
##  [93] "WILD/FOREST FIRES"         "HEAVY SEAS"               
##  [95] "?"                         "HIGH WATER"               
##  [97] "LANDSLIDES"                "URBAN/SMALL STREAM"       
##  [99] "BRUSH FIRE"                "HEAVY SWELLS"             
## [101] "URBAN SMALL"               "Other"                    
## [103] "URBAN/SML STREAM FLD"      "ROUGH SURF"               
## [105] "Heavy Surf"                "Marine Accident"          
## [107] "Freeze"                    "COASTAL STORM"            
## [109] "Damaging Freeze"           "Beach Erosion"            
## [111] "Unseasonable Cold"         "Early Frost"              
## [113] "Wintry Mix"                "Extreme Cold"             
## [115] "Torrential Rainfall"       "Landslump"                
## [117] "Coastal Storm"             "EXTREME WINDCHILL"        
## [119] "Glaze"                     "Extended Cold"            
## [121] "Light snow"                "Light Snow"               
## [123] "MIXED PRECIP"              "Freezing Spray"           
## [125] "DOWNBURST"                 "Mudslides"                
## [127] "Microburst"                "Mudslide"                 
## [129] "Cold"                      "Snow Squalls"             
## [131] "Wind Damage"               "Light Snowfall"           
## [133] "Freezing RAIN"             "Gusty wind/rain"          
## [135] "Wind"                      "Cold Temperature"         
## [137] "Snow"                      "COLD AND SNOW"            
## [139] "RAIN/SNOW"                 "Gusty WIND"               
## [141] "AGRICULTURAL FREEZE"       "OTHER"                    
## [143] "Hypothermia/Exposure"      "HYPOTHERMIA/EXPOSURE"     
## [145] "Lake Effect Snow"          "Freezing Rain"            
## [147] "Mixed RAIN"                "BLACK ICE"                
## [149] "COASTALSTORM"              "LIGHT SNOW"               
## [151] "DAM BREAK"                 "blowing snow"             
## [153] "GRADIENT WIND"             "gradient wind"            
## [155] "Gradient wind"             "WET MICROBURST"           
## [157] "Heavy surf and wind"       "HIGH SWELLS"              
## [159] "UNSEASONAL RAIN"           "COASTAL EROSION"          
## [161] "HYPERTHERMIA/EXPOSURE"     "WINTRY MIX"               
## [163] "ROCK SLIDE"                "LANDSPOUT"                
## [165] "LAKE EFFECT SNOW"          "MIXED RAIN"               
## [167] "WIND AND WAVE"             "LIGHT FREEZING RAIN"      
## [169] "ICE ROADS"                 "ROUGH SEAS"               
## [171] "NON-HEAVY WIND DAMAGE"     "WARM WEATHER"             
## [173] "LATE SEASON SNOW"          "WINTER WEATHER MIX"       
## [175] "ROGUE WAVE"                "FALLING SNOW/ICE"         
## [177] "BLOWING DUST"              "HAZARDOUS SURF"           
## [179] "ICE ON ROAD"               "DROWNING"                 
## [181] "WINTER WEATHER/MIX"
length(missed)
## [1] 3357
length(missed)/length(event)
## [1] 0.01318368

The capture rate is 99 %

This exceeds the threshold of 95% of recorded events set for completeness of capture, and so we include the mapped event set in the original set..

# Add the mapped weather type vector to the reduced data set
sd.red<-data.frame(select(sd.red,year:EVTYPE),
                    event,
                    select(sd.red,FATALITIES:CROPDMGEXP),
                    stringsAsFactors=FALSE)

Cleaning the damage value data

Damage values to property or crops should be recorded as a number of dollars together with a multipler, that is either “” (blank), denoting “as is”, or “K” for a thousand, “M” for a million or “B” for a billion 1000. Since these are human estimates, there are likely to be errors, even if they had all been entered according to the scheme just described In fact, they have not.

We include in our analysis only those values that have either no signifier or one of the recognised signifiers in either lower of upper case, then multiply the damage values by the appropriate factor, to give final values in millions of dollars.

# select for valid value multipliers
# select for valid value multipliers
sd.redd<-sd.red[grepl("[kK]|[mM]|[bB]|[^3]|[^5]",sd.red$PROPDMGEXP),] 
sd.redd$PROPDMGEXP<-gsub("[^kK][^mM][^bB]",'0',sd.redd$PROPDMGEXP)
sd.redd<-sd.redd[grepl("[kK]|[mM]|[bB]",sd.redd$CROPDMGEXP),]
sd.redd$PROPDMGEXP<-gsub("3|5","0.000001",sd.redd$PROPDMGEXP) #remove pesky 3 and 5

# express damage mulipliers in billions of dollars
sd.redd$PROPDMGEXP<-gsub("[kK]","0.000001",sd.redd$PROPDMGEXP)
sd.redd$PROPDMGEXP<-gsub("[mM]","0.001",sd.redd$PROPDMGEXP)
sd.redd$PROPDMGEXP<-gsub("[bB]","1",sd.redd$PROPDMGEXP)

sd.redd$PROPDMGEXP<-as.numeric(sd.redd$PROPDMGEXP)

sd.redd$CROPDMGEXP<-gsub("[kK]","0.000001",sd.redd$CROPDMGEXP)
sd.redd$CROPDMGEXP<-gsub("[mM]","0.001",sd.redd$CROPDMGEXP)
sd.redd$CROPDMGEXP<-gsub("[bB]","1",sd.redd$CROPDMGEXP)

sd.redd$CROPDMGEXP<-as.numeric(sd.redd$CROPDMGEXP)

# check replacements are complete
# should all be 1e-6, 1e-3 or 1
table(sd.redd$PROPDMGEXP)
## 
##     0 1e-06 0.001     1 
##     4 93707  3925    24
table(sd.redd$CROPDMGEXP)
## 
## 1e-06 0.001     1 
## 96113  1544     3
sd.redd$PROPDMG<-sd.redd$PROPDMG*sd.redd$PROPDMGEXP
sd.redd$CROPDMG<-sd.redd$CROPDMG*sd.redd$CROPDMGEXP

Overall Metrics of the data

tfatalities <-sum(sd.red$FATALITIES)
tinjuries<-sum(sd.red$INJURIES)
ttotal<-tfatalities+tinjuries

tpropdmg <-sum(sd.redd$PROPDMG)
tcropdmg <-sum(sd.redd$CROPDMG)
dtotal<-tpropdmg+tcropdmg

Ranking of Events by Human Impact

#find fatality and injury number for each event type
fatalities<-sd.red %>% group_by(event) %>% summarise(fatalities=sum(FATALITIES))
injuries<-sd.red %>% group_by(event) %>% summarise(injuries=sum(INJURIES))
human<-left_join(fatalities,injuries,by="event")
human<-mutate(human,total=fatalities+injuries)
# add a totals row at the top
human<-arrange(human,desc(total))
totals<-data.frame("Total",tfatalities,tinjuries,ttotal)
names(totals)<-names(human)
human<-rbind(totals,human)

# calculate each row as % of total
fatfrac<-100*human$fatalities/ttotal
injfrac<-100*human$injuries/ttotal
totfrac<-100*human$total/ttotal

# calculate sum contribution of top 5.
top10frac<-sum(fatfrac[2:11])
top10inj<-sum(injfrac[2:11])
top10tot<-sum(totfrac[2:11])

# combine for reults table
human<-cbind(human,fatfrac,injfrac,totfrac)
human$event<-as.character(human$event)
humantop10<-human[1:11,]
names(humantop10)[names(humantop10)=="fatfrac"] <- "Fatalities"
names(humantop10)[names(humantop10)=="injfrac"] <- "Injuries"
# melt data for plot
library(reshape2)
hmelt<-melt(humantop10,id="event",measure.vars=c("Fatalities","Injuries"))
hmelt$event<-as.factor(hmelt$event)
hmelt$variable<-as.factor(hmelt$variable)
hmelt<-arrange(hmelt,desc(variable),desc(value))

## set the levels in order we want

hmelt <- transform(hmelt, 
                event.ord  = factor(
                     event ,
                     levels=c( 'Hurricane','Winter Storm','High Wind','Ice Storm','Other','Lightning','Flood','Thunderstorm Wind', 'Heat','Tornado','Total'),
                     ordered =TRUE))

Ranking of Events by Economic Impact

#find property and crop damage value for each event type
propdmg<-sd.redd %>% group_by(event) %>% summarise(propdmg=sum(PROPDMG))
cropdmg<-sd.redd %>% group_by(event) %>% summarise(cropdmg=sum(CROPDMG))
damage<-left_join(propdmg,cropdmg,by="event")
damage<-mutate(damage,totald=propdmg+cropdmg)
# add a totals row at the top
damage<-arrange(damage,desc(totald))
totalsd<-data.frame("Total",tpropdmg,tcropdmg,dtotal)
names(totalsd)<-names(damage)
damage<-rbind(totalsd,damage)

# calculate each row as % of total
propfrac<-100*damage$propdmg/dtotal
cropfrac<-100*damage$cropdmg/dtotal
dtotfrac<-100*damage$totald/dtotal

# calculate sum contribution of top 5.
dtop10prop<-sum(propfrac[2:11])
dtop10crop<-sum(cropfrac[2:11])
dtop10tot<-sum(dtotfrac[2:11])

# combine for reults table
damage<-cbind(damage,propfrac,cropfrac,dtotfrac)
damage$event<-as.character(damage$event)
damagetop10<-damage[1:11,]
names(damagetop10)[names(damagetop10)=="propfrac"] <- "Property"
names(damagetop10)[names(damagetop10)=="cropfrac"] <- "Crops"
# melt data for plot
library(reshape2)
dmelt<-melt(damagetop10,id="event",measure.vars=c("Property","Crops"))
dmelt$event<-as.factor(dmelt$event)
dmelt$variable<-as.factor(dmelt$variable)
dmelt<-arrange(dmelt,desc(variable),desc(value))

## set the levels in order we want

dmelt <- transform(dmelt, 
                eventd.ord  = factor(
                     event ,
                     levels=c( 'Drought','High Wind','Wildfire','Storm Surge/Tide', 'Thunderstorm Wind','Ice Storm','Hail','Tornado','Hurricane','Flood','Total'),
                     ordered =TRUE))

Results

Injuries and Fatalities

The top 10 weather event types in respect of total fatalities and injuries caused in the US since 1950 are shown in Figure 2 below.

#g<-ggplot(data=hmelt, aes(x=event, y=value, fill=variable)) +
#   geom_bar(stat="identity", position=position_dodge())

g<-ggplot(data=hmelt,
          aes(x=event.ord,y=value,fill=variable,order=event.ord))+
        geom_bar(stat="identity")+
        #facet_wrap(~variable)+
        coord_flip()+
        scale_y_continuous(breaks = seq(0, 100, 10))+
        theme(axis.text.x = element_text(size=14),
        axis.text.y=element_text(size=14))+
        labs(y = "% of total fatalities and injuries")+
        theme(axis.title.x = element_text(size=14,vjust=-.5),
        axis.title.y=element_blank())+
        theme(legend.text=element_text(size=12),
                legend.title = element_blank())+
        theme(legend.position=c(.8, .2))
g

Figure 2: The top 10 categories for relative impact of weather events on human health by weather type, as measured by fatalities and injuries caused.

Of the 48 weather event types recognised by NOAA, 94 % of all fatalities and injuries have been caused by these 10 categories.

Economic damage

The top 10 weather event types in respect of total economic impact caused in the US since 1950 are shown in Figure 3 below, subdivided by value of damage to property and by damage to crops.

d<-ggplot(data=dmelt,
        aes(x=eventd.ord, y=value,fill=variable,order=eventd.ord))+
        geom_bar(stat="identity")+
        #facet_wrap(~variable)+
        coord_flip()+
        scale_y_continuous(breaks = seq(0, 100, 10))+
        theme(axis.text.x = element_text(size=14),
        axis.text.y=element_text(size=14))+
        labs(y = "% of total value of damage caused")+
        theme(axis.title.x = element_text(size=14,vjust=-.5),
        axis.title.y=element_blank())+
        theme(legend.text=element_text(size=12),
                legend.title = element_blank())+
        theme(legend.position=c(.8, .2))
d

Figure 3: The top 10 categories for relative conomic damage to property and crops caused by weather events, by weather type.

Of the 48 weather event types recognised by NOAA, 97 % of all damage value to property and crops have been caused by these 10 categories.

# clean workspace
rm(list=ls())