Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to identify:
The results suggests that excessive heat and tornados has greatest impact to public health, while floods and droughts have the greatest economic impact.
setwd("~/Data Science/Module 5 Reproducible Research/Assignment 2/")
library(R.utils) # use R.utils to unzip .bz2 files
## Warning: package 'R.utils' was built under R version 3.2.5
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.2.3
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.2.3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.20.0 (2016-02-17) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.3.0 (2016-04-13) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2) # use ggplot for our data plots
## Warning: package 'ggplot2' was built under R version 3.2.4
library(gridExtra) # enable us to plot 2 charts side by side
## Warning: package 'gridExtra' was built under R version 3.2.5
Check of data file already exists in working directory. Download and unzip if file does not exist.
if(!file.exists("StormData.csv.bz2")) {
temp <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
datafile <- bunzip2(temp,"StormData.csv", remove = FALSE, skip = TRUE)
unlink(temp)
}
StormData <- read.csv(datafile, header=T, sep=",")
2.1.1 This purpose of this exercise is to gain an overview of the variables captured in the dataset, the variable types, as well as, the number of records captured (ie. size of dataset).
head(StormData, n=1)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14 100 3 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The dataset consists of 902297 records of 37 variables, with the first recorded event in 1950.
Given the scope of our project, we will only require the following variables to determine which weather event has greatest impact to public health and the economy.
2.1.2 Next we examine the dataset to determine if there are any problems in the dataset that may potentially impact our analysis. We will use the total number of severe weather events recorded per year as an indication of the quality of the dataset.
We can derive the year of the event from the “BGN_DATE” field in the NOAA dataset.
# Extract year from BGN_DATE field.
if (dim(StormData)[2] == 37) {
StormData$year <- as.numeric(format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}
# Summarise number of events recorded by year
ggplot(data=StormData, aes(StormData$year)) +
geom_histogram(breaks=seq(1950, 2012, by = 1),
col="orange",
fill="orange",
alpha = .2) +
labs(title="Total Number of Severe Weather Events Per Year") +
labs(x="Year", y="No. Events")
From the histogram, the number of recorded events increased significantly from 1995. This would suggest either (a) poor data quality collected prior to 1995 or (b) fewer occurance of severe weather events prior to 1995.
We will assume scenario (b) is true and consider the entire dataset for our analysis.
2.2.1 Create subset of NOAA data containing only the variables identified in section 2.1.1.
As we will not require all the variables in the original dataset for our analysis, we will create a subset of the NOAA data with the relevant variables identified eariler. A smaller dataset is always easier to run and manage.
Projvariables<-c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG", "CROPDMGEXP")
StormData2<- StormData[Projvariables]
str(StormData2) # Check data subset
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
unique(StormData2$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(StormData2$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
2.2.2 Format economic impact variable
Per NOAA code book, the variables “PROPDMGEXP” and “CROPDMGEXP” represent the use of characters to signify the magnitude of the estimated damage captured in the variables “PROPDMG” and “CROPDMG”.
unique(StormData2$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(StormData2$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
We will need to apply these factors to “PROPDMG” and “CROPDMG” in order to arrive at the the full dollar economic impact of the corresponding weather event.
2.2.2.1 Compute the full dollar impact of property damage
# Assign the multiplier to be used for each character
StormData2$PROPVAL[StormData2$PROPDMGEXP == "K" ] <- 1000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "M" ] <- 10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == "" ] <- 1
StormData2$PROPVAL[StormData2$PROPDMGEXP == "B" ] <- 10^9
StormData2$PROPVAL[StormData2$PROPDMGEXP == "m" ] <- 10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == "+" ] <- 0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "0" ] <- 1
StormData2$PROPVAL[StormData2$PROPDMGEXP == "5" ] <- 10^5
StormData2$PROPVAL[StormData2$PROPDMGEXP == "6" ] <- 10^6
StormData2$PROPVAL[StormData2$PROPDMGEXP == "?" ] <- 0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "4" ] <- 10000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "2" ] <- 100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "3" ] <- 1000
StormData2$PROPVAL[StormData2$PROPDMGEXP == "h" ] <- 100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "7" ] <- 10^7
StormData2$PROPVAL[StormData2$PROPDMGEXP == "H" ] <- 100
StormData2$PROPVAL[StormData2$PROPDMGEXP == "-" ] <- 0
StormData2$PROPVAL[StormData2$PROPDMGEXP == "1" ] <- 10
StormData2$PROPVAL[StormData2$PROPDMGEXP == "8" ] <- 10^8
# Compute full dollar impact
StormData2$PROPDMGVAL <- StormData2$PROPDMG * StormData2$PROPVAL
2.2.2.2 Compute the full dollar impact of crop damage
# Assign the multiplier to be used for each character
StormData2$CROPVAL[StormData2$CROPDMGEXP == "" ] <- 1
StormData2$CROPVAL[StormData2$CROPDMGEXP == "M" ] <- 10^6
StormData2$CROPVAL[StormData2$CROPDMGEXP == "K" ] <- 1000
StormData2$CROPVAL[StormData2$CROPDMGEXP == "m" ] <- 10^9
StormData2$CROPVAL[StormData2$CROPDMGEXP == "B" ] <- 10^6
StormData2$CROPVAL[StormData2$CROPDMGEXP == "?" ] <- 0
StormData2$CROPVAL[StormData2$CROPDMGEXP == "0" ] <- 1
StormData2$CROPVAL[StormData2$CROPDMGEXP == "k" ] <- 1000
StormData2$CROPVAL[StormData2$CROPDMGEXP == "2" ] <- 100
# Compute full dollar impact
StormData2$CROPDMGVAL <- StormData2$CROPDMG * StormData2$CROPVAL
We now have a properly formatted subset of the NOAA dataset for our project.
3.1 calculate the total number of fatalities and injuries for each type severe weather event. Identify the top 5 severe weather events for number of deaths and for number of injuries.
Death <- aggregate(FATALITIES ~ EVTYPE, data = StormData2, FUN = sum)
Injury <- aggregate(INJURIES ~ EVTYPE, data = StormData2, FUN = sum)
Deathtop5 <- Death[order(-Death$FATALITIES),][1:5, ]
Injurytop5 <- Injury[order(-Injury$INJURIES),][1:5, ]
Deathtop5
## EVTYPE FATALITIES
## 834 TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153 FLASH FLOOD 978
## 275 HEAT 937
## 464 LIGHTNING 816
Injurytop5
## EVTYPE INJURIES
## 834 TORNADO 91346
## 856 TSTM WIND 6957
## 170 FLOOD 6789
## 130 EXCESSIVE HEAT 6525
## 464 LIGHTNING 5230
4.1 Calculate the total dollar damage for each type severe weather event. Identift the top 5 severe weather events for property and crop damage.
Property <- aggregate(PROPDMGVAL~ EVTYPE, data = StormData2, FUN = sum)
Crops <- aggregate(CROPDMGVAL ~ EVTYPE, data = StormData2, FUN = sum)
Proptop5 <- Property[order(-Property$PROPDMGVAL),][1:5, ]
Croptop5 <- Crops[order(-Crops$CROPDMGVAL),][1:5, ]
Proptop5
## EVTYPE PROPDMGVAL
## 170 FLOOD 144657709807
## 411 HURRICANE/TYPHOON 69305840000
## 834 TORNADO 56947380617
## 670 STORM SURGE 43323536000
## 153 FLASH FLOOD 16822673979
Croptop5
## EVTYPE CROPDMGVAL
## 95 DROUGHT 12474066000
## 409 HURRICANE OPAL 10009000000
## 170 FLOOD 5661968450
## 244 HAIL 3025954473
## 402 HURRICANE 2741910000
# Summarise our findings from section 3.1 into a chart
plot1<- ggplot(data=Deathtop5, aes(reorder(Deathtop5$EVTYPE, Deathtop5$FATALITIES),Deathtop5$FATALITIES)) +
geom_bar(stat="identity",
col="red",
fill="red",
alpha = .2,
width = .5) +
coord_flip() +
theme(title=element_text(size=8))+
labs(title="Total Deaths by Severe Weather Event - 1950 to 2011") +
labs(x="Event Type", y="Total Deaths")
plot2<- ggplot(data=Injurytop5, aes(reorder(Injurytop5$EVTYPE, Injurytop5$INJURIES),Injurytop5$INJURIES)) +
geom_bar(stat="identity",
col="yellow",
fill="yellow",
alpha = .2,
width = .5) +
coord_flip() +
theme(title=element_text(size=8))+
labs(title="Total Injuries by Severe Weather Event - 1950 to 2011") +
labs(x="Event Type", y="Total Injured")
grid.arrange(plot1,plot2, ncol = 2)
The above plot shows the top 5 severe weather events, ranked accordingly to the total number of deaths and injuries caused, over the period of our study.
Tornados, which resulted in the highest number of deaths and also injuries, clearly have the greatest impact to public health.
# Summarise our findings from section 4.1 in a chart
plot3<- ggplot(data=Proptop5, aes(reorder(Proptop5$EVTYPE, Proptop5$PROPDMGVAL),Proptop5$PROPDMGVAL)) +
geom_bar(stat="identity",
col="blue",
fill="blue",
alpha = .2,
width = .5) +
coord_flip() +
theme(title=element_text(size=7))+
labs(title="Total Value of Property Damaged by Severe Weather Event - 1950 to 2011") +
labs(x="Event Type", y="Total $ Property Damaged")
plot4<- ggplot(data=Croptop5, aes(reorder(Croptop5$EVTYPE, Croptop5$CROPDMGVAL),Croptop5$CROPDMGVAL)) +
geom_bar(stat="identity",
col="green",
fill="green",
alpha = .2,
width = .5) +
coord_flip() +
theme(title=element_text(size=7))+
labs(title="Total Value of Crop Damaged by Severe Weather Event - 1950 to 2011") +
labs(x="Event Type", y="Total $ Crop Damaged")
grid.arrange(plot3,plot4, ncol = 2)
The above plot shows the top 5 severe weather events, ranked accordingly to the total value of property damaged and total value of crop damaged, over the period of our study.
We can see that floods have the greatest damage to property, while droughts have the most adverse impact to crops.