The National Oceanic and Atmospheric Administration (NOAA) tracks information on the occurence of storms and significant weather phenomia causing loss of life, injuries, significant property damage and disruption to commercial business. While rare, unusual weather phenoma generate media attention, there are consequences of less widely publicized weather phenomenon that include meterological events and changes in prepcipitation.
This data analysis attempts to answer the following two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
In this analysis it was shown that fatalities from adverse weather were highest where either there was excessive heat or a tornado occured. Similarly injury data suggests that events most highly correlated are (i) excessive heat, (ii) flooding, (iii) ice storms, and (iv) tornados. The event types with the highest property damage included (i) flash flood, (ii) flood, (iii) high wind, (iv) landslide, (v) thunderstorm winds, and (vi) waterspout property damage.
To process the data, I first read it into R as a dataframe. I choose a random seed and set it to 1000 so that each time I generated an analysis it would be the same. Then I randomly chose 500,000 records to analyze. Since this is a large dataset, I wanted to subset the analysis. However in order to prove that it was acceptable to take a random sub-set of data for the analysis, I performed a Shapiro-Wilkes test for normality of the data distributions. The data distributions passed the normality test, and thus subsetting by randomly choosing samples is most likely appropriate and not deliterious to the analysis.
setwd("~/Documents/Reproducible Research")
repdata.data.StormData <- read.csv("~/Documents/Reproducible Research/repdata-data-StormData.csv", stringsAsFactors=FALSE)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
set.seed = 1000
data.random.sample <- sample_n(repdata.data.StormData, 500000, replace=FALSE)
#takes a random sample of 500,000 data points. We assume here the data set has a normal distribution.
#If you really want to convince yourself its a normal distribution run a Shapiro
#test for normality
data.random.check.normal <- sample_n(repdata.data.StormData, 5000, replace=FALSE)
shapiro.test(data.random.check.normal$INJURIES)
##
## Shapiro-Wilk normality test
##
## data: data.random.check.normal$INJURIES
## W = 0.0384, p-value < 2.2e-16
shapiro.test(data.random.check.normal$FATALITIES)
##
## Shapiro-Wilk normality test
##
## data: data.random.check.normal$FATALITIES
## W = 0.0402, p-value < 2.2e-16
shapiro.test(data.random.check.normal$PROPDMG)
##
## Shapiro-Wilk normality test
##
## data: data.random.check.normal$PROPDMG
## W = 0.216, p-value < 2.2e-16
#next we want to reduce the diminsionality of the data, so we subset for only necessary data
#this includes:
EVTYPE <- data.random.sample$EVTYPE
FATALITIES <- data.random.sample$FATALITIES
INJURIES <- data.random.sample$INJURIES
PROPDMG <- data.random.sample$PROPDMG
df <- data.frame(EVTYPE, FATALITIES, INJURIES, PROPDMG)
#look for events that cause injuries and fatalities of greater than 5
fatalityData <- (subset(df, FATALITIES > 50))[c(1,2)]
View(fatalityData) #now we have just the fatalities and about 3889 observations
injuryData <- (subset(df, INJURIES > 50))[c(1,3)]
View(injuryData)
#The look at the data suggests that we can subset further
#might be reasonable number to choose a number of over 500 for high levels because we are returned then the 3 highest levels.
injuryData.high <- (subset(df, INJURIES > 500))[c(1,3)]
p0 <- qplot(EVTYPE, INJURIES, data=injuryData.high)
p0 + labs(title="Top Events Causing > 500 Injuries", x="Event Type", y="Number of Injuries")
summary(injuryData.high)
## EVTYPE INJURIES
## TORNADO :8 Min. : 519.0
## FLOOD :2 1st Qu.: 587.8
## EXCESSIVE HEAT :1 Median : 750.0
## ICE STORM :1 Mean : 926.8
## HIGH SURF ADVISORY:0 3rd Qu.:1169.5
## COASTAL FLOOD :0 Max. :1700.0
## (Other) :0
#the three events causing the highest number of injuries are Flood, Ice Storm
#and Tornado. Summary statistics show that the highest killer is Tornado
fatalityData.high <- (subset(df, FATALITIES>50))[c(1,2)]
fatalityData.high
## EVTYPE FATALITIES
## 126328 TORNADO 158
## 139994 TORNADO 90
## 163636 EXCESSIVE HEAT 99
## 194571 TORNADO 114
## 306913 EXCESSIVE HEAT 74
## 313688 TORNADO 75
## 393551 EXTREME HEAT 57
summary(fatalityData.high)
## EVTYPE FATALITIES
## TORNADO :4 Min. : 57.00
## EXCESSIVE HEAT :2 1st Qu.: 74.50
## EXTREME HEAT :1 Median : 90.00
## HIGH SURF ADVISORY:0 Mean : 95.29
## COASTAL FLOOD :0 3rd Qu.:106.50
## FLASH FLOOD :0 Max. :158.00
## (Other) :0
p <- qplot(EVTYPE, FATALITIES, data=fatalityData.high)
p + labs(title="Top 3 Event Types Causing Fatalities", x="Event Type", y="Fatalities")
The types of events that are most harmful include: Tornado, Excessive Heat, Flash Flood, Heat Wave, Extreme Heat, Fog, Ice Storms. By far the most harmful to human health based upon injury and fatalities.
The top 3 weather events causing fatalties are Excessive Heat, Extreme Heat and Tornando. The top 3 weather events causing injuries are Tornado, Flood and Ice Storms.
The analysis below takes a look at the economic consequences of severe weather.
propertyData.high <- (subset(df, PROPDMG>1000))[c(1,4)]
propertyData.high
## EVTYPE PROPDMG
## 156260 FLASH FLOOD 3000
## 192228 LANDSLIDE 4800
## 196843 THUNDERSTORM WIND 5000
## 286999 THUNDERSTORM WIND 3500
## 293768 FLASH FLOOD 5000
summary(propertyData.high)
## EVTYPE PROPDMG
## FLASH FLOOD :2 Min. :3000
## THUNDERSTORM WIND :2 1st Qu.:3500
## LANDSLIDE :1 Median :4800
## HIGH SURF ADVISORY:0 Mean :4260
## COASTAL FLOOD :0 3rd Qu.:5000
## FLASH FLOOD :0 Max. :5000
## (Other) :0
p1 <- qplot(EVTYPE, PROPDMG, data=propertyData.high)
p1 + labs(title = "Property Damage by Event Type", x="Event Type", y="Property Damage")
The top causes of property damage are Thunderstorm/Wind, Flash Flood, Flood and Landslides. The top 3 causes of property damage are Thunderstorm, Flash Flood and Landslides.