Reproducible Research Peer Assessment 2: Exploring U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database
This analysis explores the NOAA Storm Database and demonstartes the inpact of severe weather events on people and property. The database covers the years 1950 to November 2011. The earlier years of the database contain fewer events, most likely due to a lack of good records. Recent years appear more complete.
Data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Consider writing your report as if it were to be read by a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, there is no need to make any specific recommendations in your report.
Data
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The following packages are required to manipulate data and generate charts:
library(ggplot2)
library(reshape2)
library(knitr)
library(grid)
library(gridExtra)
library(data.table)
fileurl<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filebz<-"repdata-data-StormData.csv.bz2"
filecsv<-"repdata-data-StormData.csv"
if ( !file.exists(filecsv) ) {
if ( !file.exists(filebz) )
download.file(fileurl,destfile=filebz)
con<-bzfile(filebz,open="r")
} else
con<-file(filecsv,open="r")
rawdata<-read.csv(con,stringsAsFactors=FALSE)
close(con)
dim(rawdata)
## [1] 902297 37
## rawdata <- read.csv("repdata_data_StormData.csv" ,header= TRUE, sep = ",", stringsAsFactors = F)
## rawdata = read.table(bzfile("repdata_data_StormData.csv.bz2", open = "rt"))
Transforming Property Damage to a Dollar value and cleaning punctuations where numric is expected.
## trim the columns to the necessary only
data <- data.table(rawdata[, c(2, 8, 23, 24, 25, 26, 27, 28)])
str(data)
## Classes 'data.table' and 'data.frame': 902297 obs. of 8 variables:
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## - attr(*, ".internal.selfref")=<externalptr>
# Covert to lower so to avoid inaccuracy in the data and simplify reclassification
data$EVTYPE <- tolower(data$EVTYPE)
# There are a lot of the terminology contains mixed types, that cross boudaries, and may be difficult to attribute to one or the other types,
# Here we try to reclassify a lot of these mixed bags along higher definition
length(unique(data$EVTYPE))
## [1] 898
# This is very large number
data[grepl("(.*)surf(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)swell(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)sea(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)tide(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("(.*)wave(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("marine(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("seiche(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("beach(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("tide(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("coastal(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("c(.*)st(.*)l(.*)flood(.*)", data$EVTYPE)==T,2] <- "coastal"
data[grepl("thunders(.*)", data$EVTYPE)==T,2] <- "thunderstorm"
data[grepl("tstm(.*)", data$EVTYPE)==T,2] <- "thunderstorm"
data[grepl("hail(.*)", data$EVTYPE)==T,2] <- "hail"
data[grepl("(.*)wind(.*)", data$EVTYPE)==T,2] <- "wind"
data[grepl("(.*)rain(.*)", data$EVTYPE)==T,2] <- "rain"
data[grepl("(.*)snow(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)ice(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)freez(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("wind(.*)", data$EVTYPE)==T,2] <- "wind"
data[grepl("rain(.*)", data$EVTYPE)==T,2] <- "rain"
data[grepl("snow(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("ice(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("freez(.*)", data$EVTYPE)==T,2] <- "ice snow"
data[grepl("(.*)floo(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("floo(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("(.*)snow(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("(.*)hur(.*)", data$EVTYPE)==T,2] <- "hurricane"
data[grepl("hur(.*)", data$EVTYPE)==T,2] <- "hurricane"
data[grepl("(.*)heat(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("heat(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("(.*)cold(.*)", data$EVTYPE)==T,2] <- "cold"
data[grepl("cold(.*)", data$EVTYPE)==T,2] <- "cold"
data[grepl("(.*)dry(.*)", data$EVTYPE)==T,2] <- "drought"
data[grepl("dry(.*)", data$EVTYPE)==T,2] <- "drought"
data[grepl("(.*)tropical(.*)", data$EVTYPE)==T,2] <- "tropical"
data[grepl("tropical(.*)", data$EVTYPE)==T,2] <- "tropical"
data[grepl("(.*)tornad(.*)", data$EVTYPE)==T,2] <- "tornado"
data[grepl("tornad(.*)", data$EVTYPE)==T,2] <- "tornado"
data[grepl("(.*)wint(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("wint(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("(.*)lide(.*)", data$EVTYPE)==T,2] <- "landslides"
data[grepl("(.*)igh temp(.*)", data$EVTYPE)==T,2] <- "heat"
data[grepl("(.*)ightn(.*)", data$EVTYPE)==T,2] <- "lightning"
data[grepl("(.*)ust(.*)", data$EVTYPE)==T,2] <- "dust"
data[grepl("(.*)wat(.*)", data$EVTYPE)==T,2] <- "flood"
data[grepl("wet(.*)", data$EVTYPE)==T,2] <- "wintery"
data[grepl("(.*)ire(.*)", data$EVTYPE)==T,2] <- "fires"
data[grepl("funnel(.*)", data$EVTYPE)==T,2] <- "wind"
data <- data[grepl("summary(.*)", data$EVTYPE)==F,]
data <- data[grepl("apache county", data$EVTYPE)==F,]
data <- data[grepl("southeast", data$EVTYPE)==F,]
data <- data[grepl("monthly(.*)", data$EVTYPE)==F,]
length(unique(data$EVTYPE))
## [1] 119
# Althought we could further reduce the ```EVTYPE```, but it appears the higer values have already been classified correctly
data$EVTYPE <- toupper(data$EVTYPE)
data$FATALITIES <- as.numeric(data$FATALITIES)
Transforming Property Damage to a Dollar value and cleaning punctuations where numric is expected.
data$PROPDMGEXP <- as.character(data$PROPDMGEXP)
data$PROPDMGEXP <- toupper(data$PROPDMGEXP)
data$PROPDMGEXP[data$PROPDMGEXP == "" | data$PROPDMGEXP == "+" | data$PROPDMGEXP == "?" | data$PROPDMGEXP == "-"] <- "0"
data$PROPDMGEXP[data$PROPDMGEXP == "H" ] <- "100"
data$PROPDMGEXP[data$PROPDMGEXP == "K" ] <- "1000"
data$PROPDMGEXP[data$PROPDMGEXP == "M" ] <- "1000000"
data$PROPDMGEXP[data$PROPDMGEXP == "B" ] <- "1000000000"
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMG_USD <- data$PROPDMG * data$PROPDMGEXP
Transforming Crop Damage to a Dollar value and cleaning punctuations where numric is expected.
data$CROPDMGEXP <- as.character(data$CROPDMGEXP)
data$CROPDMGEXP <- toupper(data$CROPDMGEXP)
data$CROPDMGEXP[data$CROPDMGEXP == "" | data$CROPDMGEXP == "?"] <- "0"
data$CROPDMGEXP[data$CROPDMGEXP == "B" ] <- "1000000000"
data$CROPDMGEXP[data$CROPDMGEXP == "M" ] <- "1000000"
data$CROPDMGEXP[data$CROPDMGEXP == "K" ] <- "1000"
# Take out all zeros
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMG_USD <- data$CROPDMG * data$CROPDMGEXP
Data Analysis
Aggregating the data by Event Type (EVTYPE) by summarizing Fatalities, Injuries, Cost of Property Damage & Cost of Crop Damage
## Aggregating the 4 key variables for summary and plotting purposes
ImpactSummary<-aggregate(cbind(FATALITIES,INJURIES,PROPDMG_USD,CROPDMG_USD)~EVTYPE,data=data,FUN=sum,na.action=na.omit)
summary(ImpactSummary)
## EVTYPE FATALITIES INJURIES PROPDMG_USD
## Length:119 Min. : 0 Min. : 0 Min. :0.00e+00
## Class :character 1st Qu.: 0 1st Qu.: 0 1st Qu.:0.00e+00
## Mode :character Median : 0 Median : 0 Median :0.00e+00
## Mean : 127 Mean : 1181 Mean :3.59e+09
## 3rd Qu.: 2 3rd Qu.: 4 3rd Qu.:5.35e+05
## Max. :5636 Max. :91407 Max. :1.72e+11
## CROPDMG_USD
## Min. :0.00e+00
## 1st Qu.:0.00e+00
## Median :0.00e+00
## Mean :4.13e+08
## 3rd Qu.:0.00e+00
## Max. :1.93e+10
str(ImpactSummary)
## 'data.frame': 119 obs. of 5 variables:
## $ EVTYPE : chr "?" "ABNORMAL WARMTH" "AVALANCE" "AVALANCHE" ...
## $ FATALITIES : num 0 0 1 224 0 101 0 480 210 0 ...
## $ INJURIES : num 0 0 0 170 0 805 0 774 280 0 ...
## $ PROPDMG_USD: num 5000 0 0 3721800 0 ...
## $ CROPDMG_USD: num 0 0 0 0 0 ...
## Population Health is represented by Fatalities and Injuries
Fatal_Data<-head(ImpactSummary[order(ImpactSummary[,2],decreasing=TRUE),],10)[,c(1,2)]
Fatal_Data
## EVTYPE FATALITIES
## 92 TORNADO 5636
## 31 HEAT 2957
## 25 FLOOD 1784
## 51 LIGHTNING 817
## 91 THUNDERSTORM 736
## 117 WIND 673
## 8 COASTAL 480
## 79 RIP CURRENT 368
## 118 WINTERY 278
## 4 AVALANCHE 224
Injury_Data<-head(ImpactSummary[order(ImpactSummary[,3],decreasing=TRUE),],10)[,c(1,3)]
Injury_Data
## EVTYPE INJURIES
## 92 TORNADO 91407
## 25 FLOOD 11911
## 91 THUNDERSTORM 9511
## 31 HEAT 8830
## 51 LIGHTNING 5231
## 117 WIND 1954
## 118 WINTERY 1953
## 23 FIRES 1608
## 30 HAIL 1371
## 41 HURRICANE 1326
Fatalities <- ggplot(data=Fatal_Data,aes(y=FATALITIES,x=EVTYPE)) +geom_bar(size=1,colour="black",fill="dark red",stat="identity") +theme(axis.text.x = element_text(angle = 65, hjust = 1)) +labs(list(x="Event Types",y="Number Of Fatalities"))
Injuries <- ggplot(data=Injury_Data,aes(y=INJURIES,x=EVTYPE)) +geom_bar(size=1,colour="black",fill="red",stat="identity") +theme(axis.text.x = element_text(angle = 60, hjust = 1)) +labs(list(x="Event Types",y="Number Of Injuries"))
grid.arrange(Fatalities, Injuries, main = textGrob("THE MOST HARMFUL WEATHER EVENTS TO THE POPULATION \n HEALTH IN THE U.S. FROM 1950 TO 2011", gp = gpar(fontsize = 14)), ncol = 2)
The top 3 events that cause the most fatalities are Tornado, Heat and Flood.
The top 3 events that cause the most injuries are Tornado, Flood and Thunderstorm.
PropertyImpact<-head(ImpactSummary[order(ImpactSummary[,4],decreasing=TRUE),],10)[,c(1,4)]
PropertyImpact
## EVTYPE PROPDMG_USD
## 25 FLOOD 1.721e+11
## 41 HURRICANE 8.466e+10
## 92 TORNADO 5.699e+10
## 89 STORM SURGE 4.332e+10
## 30 HAIL 1.597e+10
## 91 THUNDERSTORM 1.257e+10
## 23 FIRES 8.497e+09
## 94 TROPICAL 7.716e+09
## 118 WINTERY 6.717e+09
## 117 WIND 6.367e+09
CropImpact<-head(ImpactSummary[order(ImpactSummary[,5],decreasing=TRUE),],10)[,c(1,5)]
CropImpact
## EVTYPE CROPDMG_USD
## 25 FLOOD 1.931e+10
## 17 DROUGHT 1.397e+10
## 41 HURRICANE 5.505e+09
## 30 HAIL 3.047e+09
## 9 COLD 1.379e+09
## 91 THUNDERSTORM 1.274e+09
## 117 WIND 9.033e+08
## 31 HEAT 8.989e+08
## 67 RAIN 7.950e+08
## 94 TROPICAL 6.949e+08
Property <-ggplot(data=PropertyImpact,aes(y=PROPDMG_USD/(10^9),x=EVTYPE))+geom_bar(stat="identity", color = "black", fill = " dark green")+theme(axis.text.x = element_text(angle = 65, hjust = 1))+labs(list(x="Event Types",y="Property Impact in Billions"))
Crops<- ggplot(data=CropImpact,aes(y=CROPDMG_USD/(10^9),x=EVTYPE))+geom_bar(stat="identity", color= "black", fill = "green")+theme(axis.text.x = element_text(angle = 65, hjust = 1))+labs(list(x="Event Types",y="Crops Impact in Billions"))
grid.arrange(Property, Crops, main = textGrob("THE MOST HARMFUL WEATHER EVENTS TO THE PROPERTY AND\n CROPS IN THE U.S. FROM 1950 TO 2011", gp = gpar(fontsize = 14)), ncol = 2)
The top 3 events that casue the most damage to Property are Flood, Huricane and Tornado.
The top 3 events that casue the most damage to Crops are Flood, Drought and Huricane.
WeatherSummary <- data.frame(FATALITIES = Fatal_Data$EVTYPE,INJURIES = Injury_Data$EVTYPE,PROPERTY = PropertyImpact$EVTYPE,CROPS = CropImpact$EVTYPE)
WeatherSummary
## FATALITIES INJURIES PROPERTY CROPS
## 1 TORNADO TORNADO FLOOD FLOOD
## 2 HEAT FLOOD HURRICANE DROUGHT
## 3 FLOOD THUNDERSTORM TORNADO HURRICANE
## 4 LIGHTNING HEAT STORM SURGE HAIL
## 5 THUNDERSTORM LIGHTNING HAIL COLD
## 6 WIND WIND THUNDERSTORM THUNDERSTORM
## 7 COASTAL WINTERY FIRES WIND
## 8 RIP CURRENT FIRES TROPICAL HEAT
## 9 WINTERY HAIL WINTERY RAIN
## 10 AVALANCHE HURRICANE WIND TROPICAL
The above table can be used like a cheatsheet to quickly determine the top ten events on each of the segments, in their order of significanse and impact