This analysis is done on the StormData database from National Oceanographic and Atmospheric Administration. The goal is to analyse the impact of different weather events on life and property. From our analysis, it is apparent that hard to predict/prevent events like Tornado, Heat wave, Flash floods, Lightning etc., have the most impact on life. However it seems like the the fatalities caused by these events are in decline on average over the years from 1950 to 2011. As far as property damage is concerned, the events with significant impact are high frequency/high impact events like Hail, Tornado, Flood, Heavy wind etc.,. However unlike fatal events there isn’t any observable trend in the property damage caused by these events over the years.
The ‘Storm Data’ file from U.S National Oceanic and Atmospheric Administration NOAA database is downloaded from this url ‘https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2’ and is read in to R as follows
# Load required Libraries
library(dplyr)
library(readr)
library(ggplot2)
if(!file.exists("StormData.csv.bz2")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "StormData.csv.bz2")
}
data <- read.csv("StormData.csv.bz2")
A new column BGN_YEAR is created as follows to tag the event with year so that any trend in the impact of the weather events over the years can be analysed.
data <- mutate(data, BGN_POSIXLT = as.Date(BGN_DATE, "%m/%d/%Y %H:%M:%S"))
data <- mutate(data, BGN_YEAR = as.POSIXlt(BGN_POSIXLT)$year + 1900)
The columns that are of significance for this analysis are
A Summary of the FATALITIES, INJURIES, PROPDMG and CROPDMG is done here. Looks like the top quartile value of these columns is zero or close to it. Three forths of the events have zero or negligible impact on life and property. Hence these events are not important for our analysis. Four different filtered data sets are created by taking the top quartile data for these 4 columns that are being analysed to filter out noise.
knitr::kable(summary(select(data, FATALITIES, INJURIES, PROPDMG, CROPDMG)), caption = "Table 1. Summary of Significant Columns" )
| FATALITIES | INJURIES | PROPDMG | CROPDMG | |
|---|---|---|---|---|
| Min. : 0.0000 | Min. : 0.0000 | Min. : 0.00 | Min. : 0.000 | |
| 1st Qu.: 0.0000 | 1st Qu.: 0.0000 | 1st Qu.: 0.00 | 1st Qu.: 0.000 | |
| Median : 0.0000 | Median : 0.0000 | Median : 0.00 | Median : 0.000 | |
| Mean : 0.0168 | Mean : 0.1557 | Mean : 12.06 | Mean : 1.527 | |
| 3rd Qu.: 0.0000 | 3rd Qu.: 0.0000 | 3rd Qu.: 0.50 | 3rd Qu.: 0.000 | |
| Max. :583.0000 | Max. :1700.0000 | Max. :5000.00 | Max. :990.000 |
# Significant Fatal Events
SigFatal <- arrange(data, desc(FATALITIES)) %>% filter(FATALITIES > 0) %>% select(EVTYPE, STATE, BGN_YEAR, FATALITIES, INJURIES)
# Significant Injury Events
SigInjury <- arrange(data, desc(INJURIES)) %>% filter(INJURIES > 0) %>% select(EVTYPE, STATE, BGN_YEAR, FATALITIES, INJURIES)
# significant Property Damage Events
SigPropDmg <- arrange(data, desc(PROPDMG)) %>% filter(PROPDMG > 0.50) %>% select(EVTYPE, STATE, BGN_YEAR, PROPDMG, CROPDMG)
# Significant Crop Damage Events
SigCropDmg <- arrange(data, desc(CROPDMG)) %>% filter(CROPDMG > 0) %>% select(EVTYPE, STATE, BGN_YEAR, PROPDMG, CROPDMG)
These 4 ‘significant events’ datasets are grouped by Event Type and Year and summarised to get average values for FATALITIES, INJURIES, PROPDMG and CROPDMG along with EventCount. The EventCount becomes significant since the average values will skew towards rare by high impact events compared to frequent low impact events.
#Grouping By Event Type
# Significant fatal events grouped by Event Type
SigFatByEvType <- group_by(SigFatal, EVTYPE) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())
#Significant Property Damage events grouped by Event Type
SigPropDmgByEvType <- group_by(SigPropDmg, EVTYPE) %>% summarize(avgpropdmg = mean(PROPDMG), avgcropdmg = mean(CROPDMG), EventCount = n())
#Grouping By Year
SigFatalByYear <- group_by(SigFatal, BGN_YEAR) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())
SigInjByYear <- group_by(SigInjury, BGN_YEAR) %>% summarize(avgfat = mean(FATALITIES), avginj = mean(INJURIES), EventCount = n())
par(mfrow = c(1,2))
with(SigFatalByYear, plot(BGN_YEAR, avgfat, type = "l", xlab = "YEAR", ylab = "Average Fatalities", main = "Fatalities by year"))
with(SigInjByYear, plot(BGN_YEAR, avginj, type = "l", xlab = "YEAR", ylab = "Average Injuries", main = "Injuries By Year" ))
Figure 1
2.Based on the plot below (Figure 2), it is apparent that Tornado, Heat wave, Flash Flood, Lightning are the top fataities causing events across the US over the years. These are also the most frequent events as well.
# add a new column to calculate the total fatalities by events
SigFatByEvType = mutate(SigFatByEvType, totalfat = round(avgfat * EventCount,0))
#get the Top Quartile of the total fatalities
SigFatByEvTypeTopQ <- arrange(SigFatByEvType, desc(totalfat)) %>% filter(totalfat > 19.75)
#plot the Significant Fatal Events
with(SigFatByEvTypeTopQ, ggplot(data = SigFatByEvTypeTopQ, aes(x = totalfat, y=EVTYPE)) + geom_point(size = 3) + scale_y_discrete(limits = EVTYPE)) + labs(x = "Total Fatalities", y = "Event Type") + ggtitle("Significant Fatal Weather Events")
Figure 2
# add a new column to calculate the total property damages by events
SigPropDmgByEvType = mutate(SigCropDmgByEvType, totalpropdmg = round(avgpropdmg * EventCount,0))
#get the Top Quartile of the total property damages
SigPropDmgByEvTypeTopQ <- arrange(SigPropDmgByEvType, desc(totalpropdmg)) %>% filter(totalpropdmg > 764.5)
#plot the Significant property damage Events
with(SigPropDmgByEvTypeTopQ, ggplot(data = SigPropDmgByEvTypeTopQ, aes(x = totalpropdmg, y=EVTYPE)) + geom_point(size = 3) + scale_y_discrete(limits = EVTYPE)) + labs(x = "Total Property Damage (Millions)", y = "Event Type") + ggtitle("Significant Property Damage Events")
Figure 3