This is a project for the Reproducible Research class in the Johns Hopkins Data Science Specialization by Coursera.
The purpose of this data analysis is explore the NOAA Storm Database to address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
By answering these questions, government and municipal officials responsible for preparation for severe weather events will be able to prioritize their resources to respond to different types of weather events.
The data for this exploration come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size and can be downloaded from the course web site:
##Dowload Data
setwd("~/Documents/Class/Reproducable Research")
if(!file.exists("./data")){dir.create("./data")}
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL,destfile="./data/StormData.csv.bz2",method="curl")
Some documentation of the database available listing how some of the variables are constructed/defined:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
In order to analyze the data, it will be loaded into a data frame with the following structure:
StormData<-read.csv("./data/StormData.csv.bz2")
str(StormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
The data elements of interest to assessing the impact of events are:
FATALITIES and INJURIES to examine harm with respect to population healthPROPDMG and CROPDMG to examine economic consequencesThe data is in a time series; however, the dates have been formatted as factors and must be converted to dates to allow for trending:
StormData$BGN_DATE<-as.Date(strptime(as.character(StormData$BGN_DATE),"%m/%d/%Y %H:%M:%S"))
Also of use in the analysis are the EVTYPE that lists the type of weather event as well as geographical featrues of the event: STATE and COUNTYNAME. These 8 storm features will be extracted into a smaller data frame using the sqldf package and 3 additional data elements will be added:
HARM as the sum of FATALITIES and INJURIESCOST as the sum of PROPDMG and CROPDMGBGN_DATEMY as the first of the month of BGN_DATE for aggregation purposeslibrary(sqldf)
StormDataRedux<-sqldf('select
BGN_DATE,
EVTYPE,
STATE,
COUNTYNAME,
FATALITIES,
INJURIES,
PROPDMG,
CROPDMG
from StormData')
# remove original file to save space
rm(StormData)
# add HARM & COST variables
StormDataRedux$HARM<-StormDataRedux$FATALITIES+StormDataRedux$INJURIES
StormDataRedux$COST<-StormDataRedux$PROPDMG+StormDataRedux$CROPDMG
# add "first of month" variable
monthStart <- function(x) {
x <- as.POSIXlt(x)
x$mday <- 1
as.Date(x)
}
StormDataRedux$BGN_DATEMY <- monthStart(StormDataRedux$BGN_DATE)
In order to understand how to analyze the data, it is advisable to summarize the data by ENTYPE for each of the measures:
require(sqldf)
## Loading required package: sqldf
## Loading required package: gsubfn
## Loading required package: proto
## Could not load tcltk. Will use slower R code instead.
## Loading required package: RSQLite
## Loading required package: DBI
TotalEVTYPE<-sqldf('select
EVTYPE,
count(8) as COUNT,
sum(FATALITIES) AS FATALITIES,
sum(INJURIES) AS INJURIES,
sum(HARM) AS HARM,
sum(PROPDMG) AS PROPDMG,
sum(CROPDMG) AS CROPDMG,
sum(COST) AS COST
from StormDataRedux
group by EVTYPE')
for(x in 2:8){
strat<-head(TotalEVTYPE[order(TotalEVTYPE[,x], decreasing= T),], n = 5)
strat<-strat[,c(1,x)]
cat(paste("\n","Ordered by",names(TotalEVTYPE)[x]),"\n \n")
print(strat,row.names=FALSE)
}
##
## Ordered by COUNT
##
## EVTYPE COUNT
## HAIL 288661
## TSTM WIND 219940
## THUNDERSTORM WIND 82563
## TORNADO 60652
## FLASH FLOOD 54277
##
## Ordered by FATALITIES
##
## EVTYPE FATALITIES
## TORNADO 5633
## EXCESSIVE HEAT 1903
## FLASH FLOOD 978
## HEAT 937
## LIGHTNING 816
##
## Ordered by INJURIES
##
## EVTYPE INJURIES
## TORNADO 91346
## TSTM WIND 6957
## FLOOD 6789
## EXCESSIVE HEAT 6525
## LIGHTNING 5230
##
## Ordered by HARM
##
## EVTYPE HARM
## TORNADO 96979
## EXCESSIVE HEAT 8428
## TSTM WIND 7461
## FLOOD 7259
## LIGHTNING 6046
##
## Ordered by PROPDMG
##
## EVTYPE PROPDMG
## TORNADO 3212258.2
## FLASH FLOOD 1420124.6
## TSTM WIND 1335965.6
## FLOOD 899938.5
## THUNDERSTORM WIND 876844.2
##
## Ordered by CROPDMG
##
## EVTYPE CROPDMG
## HAIL 579596.3
## FLASH FLOOD 179200.5
## FLOOD 168037.9
## TSTM WIND 109202.6
## TORNADO 100018.5
##
## Ordered by COST
##
## EVTYPE COST
## TORNADO 3312277
## FLASH FLOOD 1599325
## TSTM WIND 1445168
## HAIL 1268290
## FLOOD 1067976
The remaining analysis will focus only on the top 5 EVTYPE storms overall by reducing the scope to the first or second ranked EVTYPE by each measure to understand their impact on health & economics.
# create vector to stor top 5 EVTYPE
EVTYPES<-NULL
# loop through each measure to identify the top 2 EVTYPE and add unquie values to vector
for(x in 2:8){
strat<-head(TotalEVTYPE[order(TotalEVTYPE[,x], decreasing= T),], n = 2)
EVTYPES <- unique(c(EVTYPES,as.vector(strat$EVTYPE)))
}
# list unique EVTYPE
cat(EVTYPES)
## HAIL TSTM WIND TORNADO EXCESSIVE HEAT FLASH FLOOD
# create new subset with onpy top 5
StormDataReduxTop <- subset(StormDataRedux,StormDataRedux$EVTYPE %in% EVTYPES)
# create new summary by EVTYPE & BGN_DATEMY
require(sqldf)
TotalEVTYPERedux<-sqldf('select
EVTYPE,
BGN_DATEMY,
count(8) as COUNT,
sum(FATALITIES) AS FATALITIES,
sum(INJURIES) AS INJURIES,
sum(HARM) AS HARM,
sum(PROPDMG) AS PROPDMG,
sum(CROPDMG) AS CROPDMG,
sum(COST) AS COST
from StormDataReduxTop
group by EVTYPE,BGN_DATEMY')
Since the data is in a time series, it is advisable to examine if there is a trend in any of the measures:
require(reshape2)
## Loading required package: reshape2
StormImpacts <- melt(TotalEVTYPERedux, id.vars = c(1:2), measure.vars = c("COUNT","FATALITIES", "INJURIES","HARM","PROPDMG","CROPDMG","COST"))
library(ggplot2)
ggplot(subset(StormImpacts,variable =="COUNT"), aes(x=BGN_DATEMY,y=value)) +
geom_smooth(method = "lm",colour="black") +
geom_line(aes(colour=variable),linetype ="dashed") +
facet_grid(. ~ EVTYPE) +
xlab("") + ylab("Numer of Events") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "bottom",legend.title=element_blank())
ggplot(subset(StormImpacts,variable %in% c("FATALITIES", "INJURIES","HARM")), aes(x=BGN_DATEMY,y=value)) +
geom_smooth(method = "lm",colour="black") +
geom_line(aes(colour=variable),linetype ="dashed") +
facet_grid(. ~ EVTYPE) +
xlab("") + ylab("Injuries and Fatalities") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "bottom",legend.title=element_blank())
ggplot(subset(StormImpacts,variable %in% c("PROPDMG","CROPDMG","COST")), aes(x=BGN_DATEMY,y=value)) +
geom_smooth(method = "lm",colour="black") +
geom_line(aes(colour=variable),linetype ="dashed") +
facet_grid(. ~ EVTYPE) +
xlab("") + ylab("Property & Crop Damage") +
theme(axis.text.x = element_text(angle = 90, hjust = 1),legend.position = "bottom",legend.title=element_blank())
While Hail events are the most prevalent type of weather event, the most dangerous events are Tornados. The most costly event is difficult to determine because only Tornados seem to have been well tracked prior to the early 1990’s. The maximum Cost for any type of event was Tornados in 2011, but TSTM Wind has the most significant increase in severity. The only indisputable fact is that Excesive Heat is not a costly weather event.