Synopsis

This report contains an analysis of the damage caused by severe weather events based on the data in the National Oceanic and Atmospheric Administration’s storm database. Specifically we determine the kind of weather events that have the greatest impact in terms of fatalities & injuries and economic cost (property and crop damage).

The analysis consists of four main steps:

  1. Loading the raw data and analyzing the general characteristics of the information relevant to the current analysis
  2. Add new columns that contain transformed versions of the raw data to help with the analysis
  3. Rank the various weather events in terms of their impact on life and economic cost
  4. Investigate the changes in the cost and frequency of weather events as a function of time \(~\)

Data Processing

Data Loading and Exploratory Analysis

The NOAA storm database is available in the form of a compressed CSV (.csv.bz2) file. This file is downloaded and the data is read into a data frame directly from the compressed file using the read.csv function.

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "stormdata.bz2")
stormdata <- read.csv("stormdata.bz2", header = TRUE)

The columns in the CSV file are listed below:

colnames(stormdata)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The columns that are relevant to the current analysis are BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The str and summary functions can be used to get an overview of the type of data in these columns:

cols <- which(colnames(stormdata) %in% c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
str(stormdata[,cols])
## 'data.frame':    902297 obs. of  8 variables:
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(stormdata[,cols])
##               BGN_DATE                    EVTYPE         FATALITIES      
##  5/25/2011 0:00:00:  1202   HAIL             :288661   Min.   :  0.0000  
##  4/27/2011 0:00:00:  1193   TSTM WIND        :219940   1st Qu.:  0.0000  
##  6/9/2011 0:00:00 :  1030   THUNDERSTORM WIND: 82563   Median :  0.0000  
##  5/30/2004 0:00:00:  1016   TORNADO          : 60652   Mean   :  0.0168  
##  4/4/2011 0:00:00 :  1009   FLASH FLOOD      : 54277   3rd Qu.:  0.0000  
##  4/2/2006 0:00:00 :   981   FLOOD            : 25326   Max.   :583.0000  
##  (Other)          :895866   (Other)          :170878                     
##     INJURIES            PROPDMG          PROPDMGEXP        CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00          :465934   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   K      :424665   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   M      : 11330   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   0      :   216   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   B      :    40   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   5      :    28   Max.   :990.000  
##                                        (Other):    84                    
##    CROPDMGEXP    
##         :618413  
##  K      :281832  
##  M      :  1994  
##  k      :    21  
##  0      :    19  
##  B      :     9  
##  (Other):     9

This information reveals the following issues with the data:

  1. The event type classification needs to be reviewed - several EVTYPE values refer to the same or similar weather events - e.g. “THUNDERSTORM WIND” and “TSTM WIND”. Hurricane events are represented by 10 different EVTYPE values:

    hurricane_tags <- stormdata[which(str_to_lower(substr(stormdata$EVTYPE,1,9)) == "hurricane"), "EVTYPE"]
    summary(hurricane_tags)[summary(hurricane_tags) > 0]
    ##                  HURRICANE          HURRICANE/TYPHOON 
    ##                        174                         88 
    ##             HURRICANE OPAL             HURRICANE ERIN 
    ##                          9                          7 
    ## HURRICANE-GENERATED SWELLS          Hurricane Edouard 
    ##                          3                          2 
    ##            HURRICANE FELIX            HURRICANE EMILY 
    ##                          2                          1 
    ##           HURRICANE GORDON  HURRICANE OPAL/HIGH WINDS 
    ##                          1                          1

    Similarly there are 99 different EVTYPE values for flood events:

    flood_tags <- stormdata[which(str_detect(str_to_lower(stormdata$EVTYPE), "flood")), "EVTYPE"]
    head(summary(flood_tags)[summary(flood_tags) > 0])
    ##       FLASH FLOOD             FLOOD    FLASH FLOODING     COASTAL FLOOD 
    ##             54277             25326               682               650 
    ## FLOOD/FLASH FLOOD       URBAN FLOOD 
    ##               624               249
    tail(summary(flood_tags)[summary(flood_tags) > 0])
    ## LANDSLIDE/URBAN FLOOD     LOCAL FLASH FLOOD           LOCAL FLOOD 
    ##                     1                     1                     1 
    ##           MINOR FLOOD        Minor Flooding               (Other) 
    ##                     1                     1                    13

    In order to get an accurate picture of the impact of events we need to gather different varieties of the same event type into broader categories.

  2. The property and crop damage data needs to be converted to dollar amounts based on the information in the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP column pairs.

  3. The data in the BGN_DATE needs to be converted to a date value so that we can check the changes in frequency and impact of severe weather events as a function of time.

Bucketing event types

There are 985 distinct EVTYPE string values in the storm database. As mentioned above there are many duplicates among these values and it is necessary to classify these in broader categories. For the current analysis we adopted the following 12 categories:

  1. Floods
  2. Heavy Rains
  3. High Seas/Extreme Tides/Surf
  4. High Winds
  5. Hurricanes
  6. Lightning
  7. Severe Cold
  8. Severe Heat
  9. Snow/Winter Storm/Blizzard
  10. Thunderstorm
  11. Tornadoes/Waterspouts
  12. Wildfires

In cases where event types could not be assigned to one of these categories they were assigned a category of “Miscellaneous”. The bucketing of different EVTYPE values into categories was done manually and the mapping is available on GitHub in a CSV file EventTypeMapping.csv.

NOTE: The category mapping can be easily checked and modified - the only thing that is needed to replace the “mapping.csv” file in the working directory.

download.file("https://raw.githubusercontent.com/HarshBaxi/ReproducibleResearch/master/EventTypeMapping.csv", destfile = "mapping.csv")
mapping <- read.csv("mapping.csv")
stormdata$EventCategory <- mapping[match(stormdata$EVTYPE, mapping$EVTYPE), 2]
summary(stormdata$EventCategory)
##                       Floods                  Heavy Rains 
##                        86106                        11983 
## High Seas/Extreme Tides/Surf                   High Winds 
##                         2562                        26306 
##                   Hurricanes                    Lightning 
##                          301                        15770 
##                Miscellaneous                  Severe Cold 
##                         4546                       295712 
##                  Severe Heat   Snow/Winter Storm/Blizzard 
##                         5579                        40085 
##                 Thunderstorm        Tornadoes/Waterspouts 
##                       337507                        71528 
##                    Wildfires                         NA's 
##                         4237                           75

There are 75 rows in the storm database that have no value in the EventCategory column. These rows have EVTYPE values that indicate that the records represent summary reports of weather events rather than actual events and were deliberatly left out of the category mapping:

head(stormdata[which(is.na(stormdata$EventCategory)), "EVTYPE"])
## [1] Summary Jan 17       Summary of March 14  Summary of March 23 
## [4] Summary of March 24  Summary of April 3rd Summary of April 12 
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

We will exclude these from our analysis:

stormdata <- stormdata[!is.na(stormdata$EventCategory), ]

Dollar value for crop and property damage

The dollar value for crop and property damage is derived by combining the data in the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP columns. The “DMGEXP” columns contain a letter code that gives the multiplier for the number in the “DMG” column. Valid letter codes are:

  • “K”: Multiplier of 103
  • “M”: Multiplier of 106
  • “B”: Multiplier of 109

Checking the values in the “DMGEXP” columns it is clear that there are some invalid codes for rows where there is a non-zero number in the “DMG” columns:

summary(stormdata[which(stormdata$CROPDMG > 0),]$CROPDMGEXP)
##           ?     0     2     B     k     K     m     M 
##     3     0    12     0     7    21 20137     1  1918
# Percentage of rows with non-zero crop damage values and valid codes
100 * nrow(stormdata[which(stormdata$CROPDMG > 0 & stormdata$CROPDMGEXP %in% c("B", "M", "K")),])/nrow(stormdata[which(stormdata$CROPDMG > 0), ])
## [1] 99.83257
summary(stormdata[which(stormdata$PROPDMG > 0),]$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
##     76      1      0      5    209      0      1      1      4     18 
##      6      7      8      B      h      H      K      m      M 
##      3      2      0     40      1      6 227481      7  11319
# Percentage of rows with non-zero crop damage values and valid codes
100 * nrow(stormdata[which(stormdata$PROPDMG > 0 & stormdata$PROPDMGEXP %in% c("B", "M", "K")),])/nrow(stormdata[which(stormdata$PROPDMG > 0), ])
## [1] 99.86035

The percentage of rows with invalid exponent codes is very small (< 0.2%) and the dollar value for these will be set to zero.

multipliers <- data.frame(code = c("K", "M", "B"), multiplier = c(1e+03, 1e+06, 1e+09))
stormdata$PropDamageMultiplier <- multipliers[match(stormdata$PROPDMGEXP, multipliers$code), 2]
stormdata$CropDamageMultiplier <- multipliers[match(stormdata$CROPDMGEXP, multipliers$code), 2]
stormdata[which(is.na(stormdata$CropDamageMultiplier)), ]$CropDamageMultiplier <- 0
stormdata[which(is.na(stormdata$PropDamageMultiplier)), ]$PropDamageMultiplier <- 0

stormdata$PropDamageDollars <- stormdata$PropDamageMultiplier * stormdata$PROPDMG
stormdata$CropDamageDollars <- stormdata$CropDamageMultiplier * stormdata$CROPDMG

Extract year from BGN_DATE value

The “Year” is extracted from the BGN_DATE column in two steps:

  1. Convert the string to a date format and store it in a new columns BeginDate:
stormdata$BeginDate <- as.Date(stormdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
  1. Convert the date value (formatted as the year) back to a character string and convert that to a number.
stormdata$Year <- as.numeric(format(stormdata$BeginDate, "%Y"))

Results

Fatalities and Injuries

First we summarize the data by event categories:

stormdataByCategory <- group_by(stormdata, EventCategory)
fatalitiesByCategory <- summarize(stormdataByCategory, Fatalities = sum(FATALITIES))
injuriesByCategory <- summarize(stormdataByCategory, Injuries = sum(INJURIES))

Next we plot the number of fatalities and injuries by event category:

p1 <- ggplot(data = fatalitiesByCategory, aes(x=EventCategory, y=Fatalities))
p1 <- p1 + geom_bar(stat="identity", color = "dark red", fill = "pink")  + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1),
          axis.title.x = element_blank()) + ylab("Fatalities")
p2 <- ggplot(data = injuriesByCategory, aes(x=EventCategory, y=Injuries))
p2 <- p2 + geom_bar(stat="identity", color = "dark red", fill = "pink")  + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1),
          axis.title.x = element_blank()) + ylab("Injuries")
grid.arrange(p1, p2, ncol=2, bottom = "Fatalities and Injuries by Event Category")

Economic Cost

First we summarize the data by event categories:

stormdataByCategory <- group_by(stormdata, EventCategory)
propDamageByCategory <- summarize(stormdataByCategory, PropertyDamage = sum(PropDamageDollars)/1000000)
cropDamageByCategory <- summarize(stormdataByCategory, CropDamage = sum(CropDamageDollars)/1000000)

Next we plot the crop and property damage by event category:

p3 <- ggplot(data = cropDamageByCategory, aes(x=EventCategory, y=CropDamage))
p3 <- p3 + geom_bar(stat="identity", color = "dark green", fill = "light green")  + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1),
          axis.title.x = element_blank()) + ylab("Crop Damage ($ millions)")
p4 <- ggplot(data = propDamageByCategory, aes(x=EventCategory, y=PropertyDamage))
p4 <- p4 + geom_bar(stat="identity", color = "dark green", fill = "light green")  + 
    theme(axis.text.x = element_text(angle = 60, hjust = 1),
          axis.title.x = element_blank()) + ylab("Property Damage ($ millions)")
grid.arrange(p3, p4, ncol=2, bottom = "Economic Cost by Event Category")

Conclusions

The plots above lead to the following conclusions: