This analysis uses the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to Nov 2011 to determine weather events which are most harmful to population health and the economy. Some rudimentary processing of the data is performed to remove unnecessary data, group similar weather event types and summarize property and crop damage in USD. The population health indicator used is the sum of all injuries and fatalities for the weather event while the economic damage indicator used is the sum of all property and crop damage due to the weather event. Only the top 10 weather events for population health damage and economic damage were plotted. The weather events in both top 10 for population health damage and economic damage are also listed.
“This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.” - Reproducible Research project explanation
The questions answered in this analysis are:
The NOAA Storm dataset as downloaded from the Coursera course site span the period from 1950 to November 2011. There are fewer recorded events in earlier years, most likely due to a lack of good records.
Information about the software environment being used for reproducibility
sessionInfo()
## R version 3.0.2 (2013-09-25)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.4 evaluate_0.5.5 formatR_0.10 htmltools_0.2.4
## [5] knitr_1.6 rmarkdown_0.2.48 stringr_0.6.2 tools_3.0.2
## [9] yaml_2.1.11
The commented-out code below can be used for most systems to download required NOAA data and create the csv file required for analysis. This section of code is for complete reproducibility with only a url needed. Please skip to next code chunk for data processing starting from the unzipped CSV file downloaded from the courser web site.
# wd <- readline(prompt = "Enter directory to download NOAA data to")
# setwd(paste(wd,sep=""))
# download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","StormData.csv.bz2",method="curl")
# data <- read.csv(bzfile("StormData.csv.bz2"))
# close(bzfile("StormData.csv.bz2"))
# write.csv(data,file="StormData.csv")
The CSV data file is loaded into R for processing
setwd("~/Desktop/test_repo/RepData_PeerAssessement2")
data <- read.csv("StormData.csv",header=TRUE,stringsAsFactors = FALSE)
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
str(data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Removal of unnecessary data columns in the data table for streamline exploratoration of data. Necessary data columns retained are:
strip_data <- data[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","REMARKS","REFNUM")]
rm(data)
str(strip_data)
## 'data.frame': 902297 obs. of 10 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Steps:
# Processing data
proc_data <- strip_data
# Some summarising of event types
proc_data$EVTYPE <- toupper(proc_data$EVTYPE)
original <- c("TSTM","AVALANCE","BLOWING SNOW","BRUSH FIRE","WILD FIRE")
replacement <- c("THUNDERSTORM","AVALANCHE","BLIZZARD","WILDFIRE","WILDFIRE")
for (x in 1:length(original)) {
proc_data$EVTYPE <- gsub(original[x],replacement[x],proc_data$EVTYPE)
}
proc_data$EVTYPE[grep("TORNADO",proc_data$EVTYPE)] <- "TORNADO"
proc_data$EVTYPE[grep("HURRICANE",proc_data$EVTYPE)] <- "HURRICANE/TYPHOON"
proc_data$EVTYPE[grep("TYPHOON",proc_data$EVTYPE)] <- "HURRICANE/TYPHOON"
proc_data$EVTYPE[grep("FLASH FLOOD",proc_data$EVTYPE)] <- "FLASH FLOOD"
proc_data$EVTYPE[grep("^COASTAL",proc_data$EVTYPE)] <- "COASTAL FLOOD"
proc_data$EVTYPE[grep("^COLD",proc_data$EVTYPE)] <- "COLD/WIND CHILL"
proc_data$EVTYPE[grep("^DROUGHT",proc_data$EVTYPE)] <- "DROUGHT"
proc_data$EVTYPE[grep("^EXTREME COLD",proc_data$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
proc_data$EVTYPE[grep("FLASH FLOOD",proc_data$EVTYPE)] <- "FLASH FLOOD"
proc_data$EVTYPE[grep("EXCESSIVE RAINFALL",proc_data$EVTYPE)] <- "HEAVY RAIN"
proc_data$EVTYPE[grep("EXCESSIVE SNOW",proc_data$EVTYPE)] <- "HEAVY SNOW"
proc_data$EVTYPE[grep("EXTREME HEAT",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("EXTREME WINDCHILL",proc_data$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
proc_data$EVTYPE[grep("FALLING SNOW",proc_data$EVTYPE)] <- "HEAVY SNOW"
proc_data$EVTYPE[grep("^FLOOD",proc_data$EVTYPE)] <- "FLOOD"
proc_data$EVTYPE[grep("^HIGH WIND",proc_data$EVTYPE)] <- "HIGH WIND"
proc_data$EVTYPE[grep("^HEAVY SNOW",proc_data$EVTYPE)] <- "HEAVY SNOW"
proc_data$EVTYPE[grep("^HEAVY SURF",proc_data$EVTYPE)] <- "HIGH SURF"
proc_data$EVTYPE[grep("^LIGHTNING",proc_data$EVTYPE)] <- "LIGHTNING"
proc_data$EVTYPE[grep("EXTREME WINDCHILL",proc_data$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
proc_data$EVTYPE[grep("RIVER FLOOD",proc_data$EVTYPE)] <- "FLOOD"
proc_data$EVTYPE[grep("^SNOW",proc_data$EVTYPE)] <- "HEAVY SNOW"
proc_data$EVTYPE[grep("^WINTER STORM",proc_data$EVTYPE)] <- "WINTER STORM"
proc_data$EVTYPE[grep("^WINTER WEATHER",proc_data$EVTYPE)] <- "WINTER WEATHER"
proc_data$EVTYPE[grep("^WINTRY",proc_data$EVTYPE)] <- "WINTER WEATHER"
proc_data$EVTYPE[grep("^WIND",proc_data$EVTYPE)] <- "STRONG WIND"
proc_data$EVTYPE[grep("^WILD",proc_data$EVTYPE)] <- "WILDFIRE"
proc_data$EVTYPE[grep("^THUNDERSTORM",proc_data$EVTYPE)] <- "THUNDERSTORM WIND"
proc_data$EVTYPE[grep("^URBAN",proc_data$EVTYPE)] <- "FLOOD"
proc_data$EVTYPE[grep("^WARM WEATHER",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("^UNSEASONABLY WARM",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("^UNSEASONABLY COLD",proc_data$EVTYPE)] <- "COLD/WIND CHILL"
proc_data$EVTYPE[grep("^TROPICAL STORM",proc_data$EVTYPE)] <- "TROPICAL STORM"
proc_data$EVTYPE[grep("^TORRENTIAL RAIN",proc_data$EVTYPE)] <- "HEAVY RAIN"
proc_data$EVTYPE[grep("^TIDAL FLOOD",proc_data$EVTYPE)] <- "COASTAL FLOOD"
proc_data$EVTYPE[grep("^THUNDERTORM",proc_data$EVTYPE)] <- "THUNDERSTORM WIND"
proc_data$EVTYPE[grep("^STRONG WIND",proc_data$EVTYPE)] <- "STRONG WIND"
proc_data$EVTYPE[grep("^STORM SURGE",proc_data$EVTYPE)] <- "STORM SURGE/TIDE"
proc_data$EVTYPE[grep("^SMALL HAIL",proc_data$EVTYPE)] <- "HAIL"
proc_data$EVTYPE[grep("^RIP CURRENT",proc_data$EVTYPE)] <- "RIP CURRENT"
proc_data$EVTYPE[grep("^RECORD HEAT",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("EXCESSIVE HEAT",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("^LOW TEMPERATURE",proc_data$EVTYPE)] <- "COLD/WIND CHILL"
proc_data$EVTYPE[grep("^RECORD COLD",proc_data$EVTYPE)] <- "EXTREME COLD/WIND CHILL"
proc_data$EVTYPE[grep("^FREEZ",proc_data$EVTYPE)] <- "FROST/FREEZE"
proc_data$EVTYPE[grep("^FROST",proc_data$EVTYPE)] <- "FROST/FREEZE"
proc_data$EVTYPE[grep("^HAIL",proc_data$EVTYPE)] <- "HAIL"
proc_data$EVTYPE[grep("THUNDERSTORM",proc_data$EVTYPE)] <- "THUNDERSTORM WIND"
proc_data$EVTYPE[grep("RAIN",proc_data$EVTYPE)] <- "HEAVY RAIN"
proc_data$EVTYPE[grep("WARM",proc_data$EVTYPE)] <- "EXCESSIVE HEAT"
proc_data$EVTYPE[grep("WIND CHILL",proc_data$EVTYPE)] <- "COLD/WIND CHILL"
# Separate into population health data and economic data
pop_data <- proc_data[proc_data$INJURIES>0|proc_data$FATALITIES>0,2:4]
econ_data <- proc_data[proc_data$CROPDMG>0|proc_data$PROPDMG>0,c(1,2,5,6,7,8)]
# Processing exponents in economic damages
econ_data$PROPDMGEXP <- toupper(econ_data$PROPDMGEXP)
econ_data$CROPDMGEXP <- toupper(econ_data$CROPDMGEXP)
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="H"] <- paste(10^2,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="K"] <- paste(10^3,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="M"] <- paste(10^6,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="B"] <- paste(10^9,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="2"] <- paste(10^2,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="3"] <- paste(10^3,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="4"] <- paste(10^4,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="5"] <- paste(10^5,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="6"] <- paste(10^6,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="7"] <- paste(10^7,sep="")
econ_data$PROPDMGEXP[econ_data$PROPDMGEXP=="8"] <- paste(10^8,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="H"] <- paste(10^2,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="K"] <- paste(10^3,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="M"] <- paste(10^6,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="B"] <- paste(10^9,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="2"] <- paste(10^2,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="3"] <- paste(10^3,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="4"] <- paste(10^4,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="5"] <- paste(10^5,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="6"] <- paste(10^6,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="7"] <- paste(10^7,sep="")
econ_data$CROPDMGEXP[econ_data$CROPDMGEXP=="8"] <- paste(10^8,sep="")
econ_data$PROPDMGEXP <- as.numeric(econ_data$PROPDMGEXP)
## Warning: NAs introduced by coercion
econ_data$PROPDMGEXP[is.na(econ_data$PROPDMGEXP)] <- 0
econ_data$PROPDMGEXP[is.infinite(econ_data$PROPDMGEXP)] <- 0
econ_data$CROPDMGEXP <- as.numeric(econ_data$CROPDMGEXP)
## Warning: NAs introduced by coercion
econ_data$CROPDMGEXP[is.na(econ_data$CROPDMGEXP)] <- 0
unique(econ_data$PROPDMGEXP)
## [1] 1e+03 1e+06 1e+09 0e+00 1e+05 1e+04 1e+02 1e+07
unique(econ_data$CROPDMGEXP)
## [1] 0e+00 1e+06 1e+03 1e+09
econ_data$PROPDMG <- econ_data$PROPDMG*econ_data$PROPDMGEXP
econ_data$CROPDMG <- econ_data$CROPDMG*econ_data$CROPDMGEXP
econ_data <- econ_data[,c(2,3,5)]
# Population health ranking
pop_health <- aggregate(. ~ EVTYPE, data = pop_data, sum)
pop_health$Total_Health_Damage <- pop_health$INJURIES + pop_health$FATALITIES
pop_health <- pop_health[order(pop_health$Total_Health_Damage,decreasing=TRUE),]
pop_health <- pop_health[1:10,]
# Economic damage ranking
econ_health <- aggregate(. ~ EVTYPE, data = econ_data, sum)
econ_health$Total_Economic_Damage <- econ_health$PROPDMG + econ_health$CROPDMG
econ_health <- econ_health[order(econ_health$Total_Economic_Damage,decreasing=TRUE),]
econ_health <- econ_health[1:10,]
Plotting the top 10 worst weather events across the USA for population health and economic damage
library(ggplot2)
ggplot(data=pop_health, aes(x = EVTYPE,y = Total_Health_Damage))+geom_bar()+xlab("Weather Event")+ylab("Sum of all Injuries and Fatalities for all events from 1950 - Nov 2011")+ggtitle("Top 10 Weather Events (Population Health)")+coord_flip()
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
ggplot(data=econ_health, aes(x = EVTYPE,y = Total_Economic_Damage))+geom_bar()+xlab("Weather Event")+ylab("Sum of Crop and Property Damage for all events from 1950 - Nov 2011 (USD)")+ggtitle("Top 10 Weather Events (Economic Damage)")+coord_flip()
## Mapping a variable to y and also using stat="bin".
## With stat="bin", it will attempt to set the y value to the count of cases in each group.
## This can result in unexpected behavior and will not be allowed in a future version of ggplot2.
## If you want y to represent counts of cases, use stat="bin" and don't map a variable to y.
## If you want y to represent values in the data, use stat="identity".
## See ?geom_bar for examples. (Deprecated; last used in version 0.9.2)
worst_weather_events <- paste(unlist(intersect(pop_health$EVTYPE,econ_health$EVTYPE)),sep=",")
top_pop <- paste(unlist(pop_health$EVTYPE),sep=",")
top_econ <- paste(unlist(econ_health$EVTYPE),sep=",")
The top 10 worst weather events for population health (in decreasing order) are TORNADO, THUNDERSTORM WIND, EXCESSIVE HEAT, FLOOD, LIGHTNING, HEAT, FLASH FLOOD, ICE STORM, HIGH WIND, WILDFIRE. The top 10 worst weather events for economic damage (in decreasing order) are FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE/TIDE, FLASH FLOOD, HAIL, DROUGHT, THUNDERSTORM WIND, ICE STORM, WILDFIRE. The weather events in both top 10 lists (in no particular order) are TORNADO, THUNDERSTORM WIND, FLOOD, FLASH FLOOD, ICE STORM, WILDFIRE.