Storms and other severe weather events may affect both public health and economic. Lot of such events can result in fatalities, injuries, and property damage. Purpose of this analysis is to identify which type of events had greatest impact on public health and economic conditions. Analysis is done based on the U.S. National Oceanic and Atmosphereic Administrations’s (NOAA) Storm Database about severe weather events. Data is available from year 1950 to November 2011.
Storm data is available in bzip2 file at url: Storm Data Documention of the database is available at Storm Data Documentation and FAQ
## set working directory
setwd("C:/Swapnil/Docs/Data Science/reproducible-research/PeerAssignment2/")
## load required packages
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library("plyr")
## Warning: package 'plyr' was built under R version 3.1.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.3
## download the data and load into system
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("stormdata.bz2")){
download.file(url, destfile = "stormdata.bz2", quiet = TRUE)
}
storm <- read.csv(bzfile("stormdata.bz2"))
dim(storm)
## [1] 902297 37
There are total of 902297 observations available from the source.
colnames(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(storm, 3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
This dataset consists of lot of information and lot of fields are not required for our current analysis. So, we will extract only required information from the dataset.
Also, we are interested in values of fatalities, injuries, damage on properties and damage on crops. so, we will keep records where one or more of these have values.
fields <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG",
"CROPDMGEXP")
storm <- storm[,fields]
req_records <- (storm$FATALITIES>0 | storm$INJURIES>0 | storm$PROPDMG>0 | storm$CROPDMG>0)
storm <- storm[req_records,]
EVTYPES variable contains the name of the events and those are manual enttries which causes huge difficulties in categorizing these events. There are around 488 unique source events in reduced subset of data. We will try to catagories those by looking for common words and abbreviations.
storm$SourceType <- NA
storm$EVTYPE <- tolower(storm$EVTYPE)
storm[grepl("precipitation|rain|hail|drizzle|wet|percip|burst|depression|fog|wall cloud|mixed precip",
storm$EVTYPE), "SourceType"] <- "Precipitation & Fog"
storm[grepl("wind|storm|wnd|hurricane|typhoon",
storm$EVTYPE), "SourceType"] <- "Wind & Storm"
storm[grepl("slide|erosion|slump",
storm$EVTYPE), "SourceType"] <- "Landslide & Erosion"
storm[grepl("warmth|warm|heat|dry|hot|drought|thermia|temperature record|record temperature|record high",storm$EVTYPE), "SourceType"] <- "Heat & Drought"
storm[grepl("cold|cool|ice|icy|frost|freeze|snow|winter|wintry|wintery|blizzard|chill|freezing|avalanche|glaze|sleet|avalance",storm$EVTYPE), "SourceType"] <- "Snow & Ice"
storm[grepl("flood|surf|blow-out|swells|fld|dam break|heavy shower",
storm$EVTYPE), "SourceType"] <- "Flooding & High Surf"
storm[grepl("seas|high water|tide|tsunami|wave|current|marine|drowning|rapidly rising water|coastal surge|high",
storm$EVTYPE), "SourceType"] <- "High seas"
storm[grepl("dust|saharan",
storm$EVTYPE), "SourceType"] <- "Dust & Saharan winds"
storm[grepl("tstm|thunderstorm|lightning",
storm$EVTYPE), "SourceType"] <- "Thunderstorm & Lightning"
storm[grepl("tornado|spout|funnel|whirlwind",
storm$EVTYPE), "SourceType"] <- "Tornado"
storm[grepl("fire|smoke|volcanic",
storm$EVTYPE), "SourceType"] <- "Fire & Volcanic activity"
storm[grepl("torndao", storm$EVTYPE), "SourceType"] <- "Tornado"
storm[grepl("ligntning|lighting", storm$EVTYPE), "SourceType"] <- "Thunderstorm & Lightning"
Proper values of Property and Crop damages are needed for further analysis. All symbols in the DMGEXP columns are treated as powers of 10 of the DMG column. We will get final values by cleaning the values in DMGEXP columns.
Based on summary values, observation with 115 billion property damage of flood looks outlier which we will remove for the analysis
## find propety damage exponents and assign the proper values
unique(storm$PROPDMGEXP)
## [1] K M B m + 0 5 6 4 h 2 7 3 H -
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
storm$PROPDMGEXP <- revalue(tolower(storm$PROPDMGEXP), c("-"=NA, "+"=NA, "b"=9, "k"=3, "m"=6, "h"=2))
storm[which(storm$PROPDMGEXP==""),]$PROPDMGEXP <- NA
## find crop damage exponents and assign the proper values
unique(storm$CROPDMGEXP)
## [1] M K m B ? 0 k
## Levels: ? 0 2 B k K m M
storm$CROPDMGEXP <- revalue(tolower(storm$CROPDMGEXP), c("b"=9, "k"=3, "m"=6, "?"=NA))
storm[which(storm$CROPDMGEXP==""),]$CROPDMGEXP <- NA
storm$PROPDMG_Clean <- storm$PROPDMG * (10^as.numeric(storm$PROPDMGEXP))
storm$CROPDMG_Clean <- storm$CROPDMG * (10^as.numeric(storm$CROPDMGEXP))
## summary of damages
summary(storm$PROPDMG_Clean)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 2.500e+03 1.000e+04 1.762e+06 4.200e+04 1.150e+11 11591
summary(storm$CROPDMG_Clean)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000e+00 0.000e+00 0.000e+00 4.816e+05 0.000e+00 5.000e+09 152670
## remove invalid record
storm <- storm[(storm$PROPDMG_Clean!=115000000000),]
With respect to the Population Health, there are two damages caused : fatalities and injuries. Below plot show impact of the harmful events on population health based on these parameters.
fatalityplot <- ggplot(storm[!is.na(storm$FATALITIES),], aes(x = SourceType,y = FATALITIES,fill=SourceType))+geom_bar(stat = "identity", show.legend = F)
fatalityplot <- fatalityplot +labs(x="Events Type", y="Total Fatalities")
fatalityplot <- fatalityplot + ggtitle("Most Fatal Weather Events")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
injuriesplot <- ggplot(storm[!is.na(storm$INJURIES),], aes(x = SourceType,y = INJURIES,fill=SourceType))+geom_bar(stat = "identity", show.legend = F)
injuriesplot <- injuriesplot +labs(x="Event Type", y="Total Injuries")
injuriesplot <- injuriesplot + ggtitle("Most Injurious Weather Events")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(fatalityplot, injuriesplot, ncol=2)
## Warning: Removed 31 rows containing missing values (position_stack).
## Warning: Removed 31 rows containing missing values (position_stack).
Based on the plots, it is clear that Tornadoes cause most number of deaths and injuries among all event types. Tornado is the cause of more than 5,000 deaths and 10,000 injuries in the last 60 years in US.
Impact on economy due to the weather events is measured based on property and crops damages. Below graph shows total damage in US million dollars due the harmful events.
prop_damage <- aggregate(PROPDMG_Clean~SourceType, storm, sum)
crop_damage <- aggregate(CROPDMG_Clean~SourceType, storm, sum)
total_damage <- merge(prop_damage, crop_damage)
colnames(total_damage) <- c("EventType", "PropertyDamage", "CropDamage")
total_damage <- melt(total_damage, id.vars = c("EventType"), measure.vars = c("PropertyDamage", "CropDamage"))
colnames(total_damage) <- c("EventType", "DamageType", "Value")
econo_plt <- ggplot(total_damage, aes(x=EventType, y=Value, fill=DamageType)) + geom_bar(stat="identity")
econo_plt <- econo_plt + labs(x="Event Type", y="Total Damage(thousands US dollars)")
econo_plt <- econo_plt + ggtitle("Most Expensive Weather Events")+ theme(axis.text.x = element_text(angle = 90, hjust = 1))
econo_plt
Based on the plot, we can conclude “Wind and Storm” type events causes the worst economic consequence.