This report contains an analysis of the damage caused by severe weather events based on the data in the National Oceanic and Atmospheric Administration’s storm database. Specifically we determine the kind of weather events that have the greatest impact in terms of fatalities & injuries and economic cost (property and crop damage).
The analysis consists of four main steps:
The NOAA storm database is available in the form of a compressed CSV (.csv.bz2) file. This file is downloaded and the data is read into a data frame directly from the compressed file using the read.csv function.
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "stormdata.bz2")
stormdata <- read.csv("stormdata.bz2", header = TRUE)
The columns in the CSV file are listed below:
colnames(stormdata)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
The columns that are relevant to the current analysis are BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The str and summary functions can be used to get an overview of the type of data in these columns:
cols <- which(colnames(stormdata) %in% c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))
str(stormdata[,cols])
## 'data.frame': 902297 obs. of 8 variables:
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(stormdata[,cols])
## BGN_DATE EVTYPE FATALITIES
## 5/25/2011 0:00:00: 1202 HAIL :288661 Min. : 0.0000
## 4/27/2011 0:00:00: 1193 TSTM WIND :219940 1st Qu.: 0.0000
## 6/9/2011 0:00:00 : 1030 THUNDERSTORM WIND: 82563 Median : 0.0000
## 5/30/2004 0:00:00: 1016 TORNADO : 60652 Mean : 0.0168
## 4/4/2011 0:00:00 : 1009 FLASH FLOOD : 54277 3rd Qu.: 0.0000
## 4/2/2006 0:00:00 : 981 FLOOD : 25326 Max. :583.0000
## (Other) :895866 (Other) :170878
## INJURIES PROPDMG PROPDMGEXP CROPDMG
## Min. : 0.0000 Min. : 0.00 :465934 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 0.00 K :424665 1st Qu.: 0.000
## Median : 0.0000 Median : 0.00 M : 11330 Median : 0.000
## Mean : 0.1557 Mean : 12.06 0 : 216 Mean : 1.527
## 3rd Qu.: 0.0000 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000
## Max. :1700.0000 Max. :5000.00 5 : 28 Max. :990.000
## (Other): 84
## CROPDMGEXP
## :618413
## K :281832
## M : 1994
## k : 21
## 0 : 19
## B : 9
## (Other): 9
This information reveals the following issues with the data:
The event type classification needs to be reviewed - several EVTYPE values refer to the same or similar weather events - e.g. “THUNDERSTORM WIND” and “TSTM WIND”. Hurricane events are represented by 10 different EVTYPE values:
hurricane_tags <- stormdata[which(str_to_lower(substr(stormdata$EVTYPE,1,9)) == "hurricane"), "EVTYPE"]
summary(hurricane_tags)[summary(hurricane_tags) > 0]
## HURRICANE HURRICANE/TYPHOON
## 174 88
## HURRICANE OPAL HURRICANE ERIN
## 9 7
## HURRICANE-GENERATED SWELLS Hurricane Edouard
## 3 2
## HURRICANE FELIX HURRICANE EMILY
## 2 1
## HURRICANE GORDON HURRICANE OPAL/HIGH WINDS
## 1 1
Similarly there are 99 different EVTYPE values for flood events:
flood_tags <- stormdata[which(str_detect(str_to_lower(stormdata$EVTYPE), "flood")), "EVTYPE"]
head(summary(flood_tags)[summary(flood_tags) > 0])
## FLASH FLOOD FLOOD FLASH FLOODING COASTAL FLOOD
## 54277 25326 682 650
## FLOOD/FLASH FLOOD URBAN FLOOD
## 624 249
tail(summary(flood_tags)[summary(flood_tags) > 0])
## LANDSLIDE/URBAN FLOOD LOCAL FLASH FLOOD LOCAL FLOOD
## 1 1 1
## MINOR FLOOD Minor Flooding (Other)
## 1 1 13
In order to get an accurate picture of the impact of events we need to gather different varieties of the same event type into broader categories.
The property and crop damage data needs to be converted to dollar amounts based on the information in the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP column pairs.
The data in the BGN_DATE needs to be converted to a date value so that we can check the changes in frequency and impact of severe weather events as a function of time.
There are 985 distinct EVTYPE string values in the storm database. As mentioned above there are many duplicates among these values and it is necessary to classify these in broader categories. For the current analysis we adopted the following 12 categories:
In cases where event types could not be assigned to one of these categories they were assigned a category of “Miscellaneous”. The bucketing of different EVTYPE values into categories was done manually and the mapping is available on GitHub in a CSV file EventTypeMapping.csv.
NOTE: The category mapping can be easily checked and modified - the only thing that is needed to replace the “mapping.csv” file in the working directory.
download.file("https://raw.githubusercontent.com/HarshBaxi/ReproducibleResearch/master/EventTypeMapping.csv", destfile = "mapping.csv")
mapping <- read.csv("mapping.csv")
stormdata$EventCategory <- mapping[match(stormdata$EVTYPE, mapping$EVTYPE), 2]
summary(stormdata$EventCategory)
## Floods Heavy Rains
## 86106 11983
## High Seas/Extreme Tides/Surf High Winds
## 2562 26306
## Hurricanes Lightning
## 301 15770
## Miscellaneous Severe Cold
## 4546 295712
## Severe Heat Snow/Winter Storm/Blizzard
## 5579 40085
## Thunderstorm Tornadoes/Waterspouts
## 337507 71528
## Wildfires NA's
## 4237 75
There are 75 rows in the storm database that have no value in the EventCategory column. These rows have EVTYPE values that indicate that the records represent summary reports of weather events rather than actual events and were deliberatly left out of the category mapping:
head(stormdata[which(is.na(stormdata$EventCategory)), "EVTYPE"])
## [1] Summary Jan 17 Summary of March 14 Summary of March 23
## [4] Summary of March 24 Summary of April 3rd Summary of April 12
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
We will exclude these from our analysis:
stormdata <- stormdata[!is.na(stormdata$EventCategory), ]
The dollar value for crop and property damage is derived by combining the data in the PROPDMG/PROPDMGEXP and CROPDMG/CROPDMGEXP columns. The “DMGEXP” columns contain a letter code that gives the multiplier for the number in the “DMG” column. Valid letter codes are:
Checking the values in the “DMGEXP” columns it is clear that there are some invalid codes for rows where there is a non-zero number in the “DMG” columns:
summary(stormdata[which(stormdata$CROPDMG > 0),]$CROPDMGEXP)
## ? 0 2 B k K m M
## 3 0 12 0 7 21 20137 1 1918
# Percentage of rows with non-zero crop damage values and valid codes
100 * nrow(stormdata[which(stormdata$CROPDMG > 0 & stormdata$CROPDMGEXP %in% c("B", "M", "K")),])/nrow(stormdata[which(stormdata$CROPDMG > 0), ])
## [1] 99.83257
summary(stormdata[which(stormdata$PROPDMG > 0),]$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 76 1 0 5 209 0 1 1 4 18
## 6 7 8 B h H K m M
## 3 2 0 40 1 6 227481 7 11319
# Percentage of rows with non-zero crop damage values and valid codes
100 * nrow(stormdata[which(stormdata$PROPDMG > 0 & stormdata$PROPDMGEXP %in% c("B", "M", "K")),])/nrow(stormdata[which(stormdata$PROPDMG > 0), ])
## [1] 99.86035
The percentage of rows with invalid exponent codes is very small (< 0.2%) and the dollar value for these will be set to zero.
multipliers <- data.frame(code = c("K", "M", "B"), multiplier = c(1e+03, 1e+06, 1e+09))
stormdata$PropDamageMultiplier <- multipliers[match(stormdata$PROPDMGEXP, multipliers$code), 2]
stormdata$CropDamageMultiplier <- multipliers[match(stormdata$CROPDMGEXP, multipliers$code), 2]
stormdata[which(is.na(stormdata$CropDamageMultiplier)), ]$CropDamageMultiplier <- 0
stormdata[which(is.na(stormdata$PropDamageMultiplier)), ]$PropDamageMultiplier <- 0
stormdata$PropDamageDollars <- stormdata$PropDamageMultiplier * stormdata$PROPDMG
stormdata$CropDamageDollars <- stormdata$CropDamageMultiplier * stormdata$CROPDMG
BGN_DATE valueThe “Year” is extracted from the BGN_DATE column in two steps:
BeginDate:stormdata$BeginDate <- as.Date(stormdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
stormdata$Year <- as.numeric(format(stormdata$BeginDate, "%Y"))
First we summarize the data by event categories:
stormdataByCategory <- group_by(stormdata, EventCategory)
fatalitiesByCategory <- summarize(stormdataByCategory, Fatalities = sum(FATALITIES))
injuriesByCategory <- summarize(stormdataByCategory, Injuries = sum(INJURIES))
Next we plot the number of fatalities and injuries by event category:
p1 <- ggplot(data = fatalitiesByCategory, aes(x=EventCategory, y=Fatalities))
p1 <- p1 + geom_bar(stat="identity", color = "dark red", fill = "pink") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
axis.title.x = element_blank()) + ylab("Fatalities")
p2 <- ggplot(data = injuriesByCategory, aes(x=EventCategory, y=Injuries))
p2 <- p2 + geom_bar(stat="identity", color = "dark red", fill = "pink") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
axis.title.x = element_blank()) + ylab("Injuries")
grid.arrange(p1, p2, ncol=2, bottom = "Fatalities and Injuries by Event Category")
First we summarize the data by event categories:
stormdataByCategory <- group_by(stormdata, EventCategory)
propDamageByCategory <- summarize(stormdataByCategory, PropertyDamage = sum(PropDamageDollars)/1000000)
cropDamageByCategory <- summarize(stormdataByCategory, CropDamage = sum(CropDamageDollars)/1000000)
Next we plot the crop and property damage by event category:
p3 <- ggplot(data = cropDamageByCategory, aes(x=EventCategory, y=CropDamage))
p3 <- p3 + geom_bar(stat="identity", color = "dark green", fill = "light green") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
axis.title.x = element_blank()) + ylab("Crop Damage ($ millions)")
p4 <- ggplot(data = propDamageByCategory, aes(x=EventCategory, y=PropertyDamage))
p4 <- p4 + geom_bar(stat="identity", color = "dark green", fill = "light green") +
theme(axis.text.x = element_text(angle = 60, hjust = 1),
axis.title.x = element_blank()) + ylab("Property Damage ($ millions)")
grid.arrange(p3, p4, ncol=2, bottom = "Economic Cost by Event Category")
In order to check the change in the impact of severe weather events we consider the data for 1992 thru’ 2011. Given that the techniques for reporting and assessing the impact of weather events have improved substantially since the 1950s it seems advisable to take a more recent subset from the NOAA database.
recentStormdata <- stormdata[stormdata$Year > 1991,]
We group this data subset by Year and take the total impact on life and property:
recentStormdataByYear <- group_by(recentStormdata, Year)
injuriesByYear <- summarize(recentStormdataByYear, Injuries = sum(INJURIES))
fatalitiesByYear <- summarize(recentStormdataByYear, Fatalaties = sum(FATALITIES))
cropDamageByYear <- summarize(recentStormdataByYear, CropDamage = sum(CropDamageDollars)/1000000)
propertyDamageByYear <- summarize(recentStormdataByYear, PropertyDamage = sum(PropDamageDollars)/1000000)
And now we plot the trend for each of the four measures of interest:
p1 <- ggplot(data = injuriesByYear, aes(x=Year, y=Injuries))
p2 <- ggplot(data = fatalitiesByYear, aes(x=Year, y=Fatalaties))
p3 <- ggplot(data = cropDamageByYear, aes(x=Year, y=CropDamage))
p4 <- ggplot(data = propertyDamageByYear, aes(x=Year, y=PropertyDamage))
p1 <- p1 + geom_line(color = "pink") +
geom_point(color = "red") +
scale_x_discrete(limit = c(seq(from=1992, to=2011, by=4)))
p2 <- p2 + geom_line(color = "pink") +
geom_point(color = "red") +
scale_x_discrete(limit = c(seq(from=1992, to=2011, by=4)))
p3 <- p3 + geom_line(color = "light green") +
geom_point(color = "dark green") + ylab("Crop Damage ($ millions)") +
scale_x_discrete(limit = c(seq(from=1992, to=2011, by=4)))
p4 <- p4 + geom_line(color = "light green") +
geom_point(color = "dark green") + ylab("Property Damage ($ millions)") +
scale_x_discrete(limit = c(seq(from=1992, to=2011, by=4)))
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol=2,
bottom = "Impact of Severe Weather Events By Year")
The plots above lead to the following conclusions:
The worst impact in terms of fatalities AND injuries is due to tornadoes and tornado related events.
Floods inflict the worst economic damage. If one looks at crop damage alone then floods are a close second to Severe Heat/Drought events. However the cost of damaged property is an order of magnitude greater than crop damage.
There is no discernible trend in the impact of weather events for the 20 years in the data subset. There are spikes in the economic cost of weather events associated with large hurricane events (Hurricane Katrina, Hurricane Ike).