Synopsis

This analysis is done on the StormData database from National Oceanographic and Atmospheric Administration. The goal is to analyse the impact of different weather events on life and property. From our analysis, it is apparent that hard to predict/prevent events like Tornado, Heat wave, Flash floods, Lightning etc., have the most impact on life. However it seems like the the fatalities caused by these events are in decline on average over the years from 1950 to 2011. As far as property damage is concerned, the events with significant impact are high frequency/high impact events like Hail, Tornado, Flood, Heavy wind etc.,. However unlike fatal events there isn’t any observable trend in the property damage caused by these events over the years.

Preprocessing

Loading the data

The ‘Storm Data’ file from U.S National Oceanic and Atmospheric Administration NOAA database is downloaded from this url ‘https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2’ and is read in to R as follows

# Load required Libraries
library(dplyr)
library(readr)
library(ggplot2)

if(!file.exists("StormData.csv.bz2")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "StormData.csv.bz2")
}

data <- read.csv("StormData.csv.bz2")

Preprocessing the Data

A new column BGN_YEAR is created as follows to tag the event with year so that any trend in the impact of the weather events over the years can be analysed.

data <- mutate(data, BGN_POSIXLT =  as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S"))
data <- mutate(data, BGN_YEAR = as.POSIXlt(BGN_POSIXLT)$year + 1900)

Filter Noise

The columns that are of significance for this analysis are

  • EVTYPE - Event Type (Type of Event)
  • FATALITIES - Number of Fatalities
  • INJURIES - Number of Injuries
  • PROPDMG - Property Damages by the event
  • CROPDMG - Crop Damages by the event
  • BGN_YEAR - Year of the event

A Summary of the FATALITIES, INJURIES, PROPDMG and CROPDMG is done here. Looks like the top quartile value of these columns is zero or close to it. Three forths of the events have zero or negligible impact on life and property. Hence these events are not important for our analysis. Four different filtered data sets are created by taking the top quartile data for these 4 columns that are being analysed to filter out noise.

knitr::kable(summary(select(data, FATALITIES, INJURIES, PROPDMG, CROPDMG)), caption = "Table 1. Summary of Significant Columns" )
Table 1. Summary of Significant Columns
FATALITIES INJURIES PROPDMG CROPDMG
Min. : 0.0000 Min. : 0.0000 Min. : 0.00 Min. : 0.000
1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00 1st Qu.: 0.000
Median : 0.0000 Median : 0.0000 Median : 0.00 Median : 0.000
Mean : 0.0168 Mean : 0.1557 Mean : 12.06 Mean : 1.527
3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50 3rd Qu.: 0.000
Max. :583.0000 Max. :1700.0000 Max. :5000.00 Max. :990.000
# Significant Fatal Events
SigFatal <- arrange(data, desc(FATALITIES)) %>% filter(FATALITIES > 0) %>% select(EVTYPE, STATE, BGN_YEAR, FATALITIES, INJURIES)

# Significant Injury Events
SigInjury <- arrange(data, desc(INJURIES)) %>% filter(INJURIES > 0) %>% select(EVTYPE, STATE, BGN_YEAR, FATALITIES, INJURIES)

# significant Property Damage Events
SigPropDmg <- arrange(data, desc(PROPDMG)) %>% filter(PROPDMG > 0.50) %>% select(EVTYPE, STATE, BGN_YEAR, PROPDMG, CROPDMG)

# Significant Crop Damage Events
SigCropDmg <- arrange(data, desc(CROPDMG)) %>% filter(CROPDMG > 0) %>% select(EVTYPE, STATE, BGN_YEAR, PROPDMG, CROPDMG)

Grouping of Data

These 4 ‘significant events’ datasets are grouped by Event Type and Year and summarised to get average values for FATALITIES, INJURIES, PROPDMG and CROPDMG along with EventCount. The EventCount becomes significant since the average values will skew towards rare by high impact events compared to frequent low impact events.

#Grouping By Event Type
# Significant fatal events grouped by Event Type
SigFatByEvType <- group_by(SigFatal, EVTYPE) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())

#Significant Property Damage events grouped by Event Type
SigPropDmgByEvType <- group_by(SigPropDmg, EVTYPE) %>% summarize(avgpropdmg = mean(PROPDMG), avgcropdmg = mean(CROPDMG), EventCount = n())

#Grouping By Year
SigFatalByYear <- group_by(SigFatal, BGN_YEAR) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())

SigInjByYear <- group_by(SigInjury, BGN_YEAR) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())

Results

  1. Based on the analysis it is evidenced that the average fatalities and injuries are in the decline over the years from 1950 to 2011 (Figure 1) This is possibly due to better prediction and follow-up action with the aid of better technology and data modeling. More such efforts should be implemented to reduce cost to human life and limb especially on weather events like Tornado that still causes significant loss of life as late as 2011.
par(mfrow = c(1,2))
with(SigFatalByYear, plot(BGN_YEAR, avgfat, type = "l", xlab = "YEAR", ylab = "Average Fatalities", main = "Fatalities by year"))
with(SigInjByYear, plot(BGN_YEAR, avginj, type = "l", xlab = "YEAR", ylab = "Average Injuries", main = "Injuries By Year" ))

Figure 1

2.Based on the plot below (Figure 2), it is apparent that Tornado, Heat wave, Flash Flood, Lightning are the top fataities causing events across the US over the years. These are also the most frequent events as well.

# add a new column to calculate the total fatalities by events
SigFatByEvType = mutate(SigFatByEvType, totalfat = round(avgfat * EventCount,0))

#get the Top Quartile of the total fatalities
SigFatByEvTypeTopQ <- arrange(SigFatByEvType, desc(totalfat)) %>% filter(totalfat > 19.75)

#plot the Significant Fatal Events
with(SigFatByEvTypeTopQ, ggplot(data = SigFatByEvTypeTopQ, aes(x = totalfat, y=EVTYPE)) + geom_point(size = 3) + scale_y_discrete(limits = EVTYPE)) + labs(x = "Total Fatalities", y = "Event Type") + ggtitle("Significant Fatal Weather Events")

Figure 2

  1. Based on the plot below (Figure 3) it is apparent that Hail, Flash Flood, Tornado, Wind storms are the top property damage causing events. It is surprising to see Hail at the top. But in retrospective it makes sense that this is a common event that can cause a lot of property damage without being very life threatening.
# add a new column to calculate the total property damages by events
SigPropDmgByEvType = mutate(SigCropDmgByEvType, totalpropdmg = round(avgpropdmg * EventCount,0))

#get the Top Quartile of the total property damages
SigPropDmgByEvTypeTopQ <- arrange(SigPropDmgByEvType, desc(totalpropdmg)) %>% filter(totalpropdmg > 764.5)

#plot the Significant property damage Events
with(SigPropDmgByEvTypeTopQ, ggplot(data = SigPropDmgByEvTypeTopQ, aes(x = totalpropdmg, y=EVTYPE)) + geom_point(size = 3) + scale_y_discrete(limits = EVTYPE)) + labs(x = "Total Property Damage (Millions)", y = "Event Type") + ggtitle("Significant Property Damage Events")

Figure 3