Data from the US National Weather Service over the past 60 years indicates that hurricanes, floods, and tornados had the greatest economic impact, in the hundreds of billions of dollars, while tornados and excessive heat had the greatest health impact, killing or injuring tens of thousands over this time frame.
Certain R libraries will be needed during the analysis, and are loaded here.
library(dplyr)
library(tidyverse)
library(stringdist)
The data is retrieved from an online source, and stored in the current directory.
# setwd("C:/Users/david/Documents/R/datasciencecoursera/datascience-course5-week4-project2")
url<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile<-"StormData.csv.bz2"
### download.file(url,destfile)
Here is when the data was retrieved. The data, in compressed CSV format, was then loaded into a tidyverse tibble, stormData, and its fields and size examined. (Note that in repeated uses of this data, it may be, and was, saved in an RDA file and loaded directly from there rather than from the internet or the downloaded file.)
print("date downloaded:"); print (Sys.time())
## [1] "date downloaded:"
## [1] "2019-06-01 18:38:26 MDT"
# stormData <- read_csv("StormData.csv.bz2")
# save(stormData,file="stormData.rda")
load("stormData.rda")
names(stormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
object.size(stormData)
## 645406264 bytes
dim(stormData)
## [1] 902297 37
We restrict the fields we retain to those relevent to the current project, leaving us with the table stormDataHealthDamage.
stormDataHealthDamage<-select(stormData,c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","REFNUM"))
A first pass at automatically cleaning the Event Types variable, EVTYPE, is made. First we note that every row has a value for EVTYPE. We count the number of distinct event types, after changing them all to lower case. Noting that the original table has reference numbers, REFNUM, but that these are slightly out of order with the row numbers, and we re-order them. We add a best reference number column, and populate it using the fuzzy matching function amatch with maxDist set to 2. This reduces the number of event types, but does not eliminate all variant names for events. We use left_join to bring these reduced names list back into the full table as stormDataHealthDamageEv.
sum(is.na(stormDataHealthDamage$EVTYPE))
## [1] 0
eventTypes<-sort(unique(tolower(stormDataHealthDamage$EVTYPE)))
length(eventTypes)
## [1] 890
stormDataHealthDamage=arrange(stormDataHealthDamage,REFNUM)
stormDataEvents<-stormDataHealthDamage
stormDataEvents$EVTYPE<-tolower(stormDataEvents$EVTYPE)
stormDataEventsDistinct<-distinct(stormDataEvents,EVTYPE,.keep_all=TRUE)
stormDataEventsDistinctBest<-mutate(stormDataEventsDistinct,best=NA)
stormDataEventsDistinctBestTemp<-stormDataEventsDistinctBest
findBestAgrep <- function (k) { x<-stormDataEventsDistinctBestTemp$REFNUM[amatch(stormDataEventsDistinctBestTemp$EVTYPE[k],stormDataEventsDistinctBestTemp$EVTYPE[1:k-1],maxDist=2)]; if (is.na(x)) { x<-stormDataEventsDistinctBestTemp$REFNUM[k] }; stormDataEventsDistinctBest$best[k]<<-x }
for (k in 1:nrow(stormDataEventsDistinctBest)) { findBestAgrep(k) }
length(unique(stormDataEventsDistinctBest$best))
## [1] 655
stormDataEventsDistinctOnlyBest<-transmute(stormDataEventsDistinctBest,EVTYPE,best)
stormDataEventsAgrep<-arrange(left_join(stormDataEvents,stormDataEventsDistinctOnlyBest,by.x=EVTYPE,by.y=EVTYPE),REFNUM)
## Joining, by = "EVTYPE"
stormDataEventsAgrepTrans<-mutate(stormDataEventsAgrep,bestEVTYPE=stormDataEvents$EVTYPE[best])
stormDataHealthDamageEv<-stormDataEventsAgrepTrans
The dollar amounts of damage, stored in PROPDMG and PROPDMGEXP for property damage, and CROPDMG and CROPDMGEXP for crop damage, also need cleaning. We take the letters K, M, B, H to represent exponents associated with thousands, millions, billions, and hundreds, while single digits directly represent base ten exponents. Any other values are treated as an exponent of 0 (a factor of 1).
Finally, we combine the mantissa and exponent together to give a single numeric value of the variables, stored in the variables PROPDMGValue and CROPDMGValue of the table stormDataHealthDamageVars.
x<-stormDataHealthDamageEv$PROPDMGEXP
x[is.na(x)] <- 0
x<-sub("[Kk]",3,x)
x<-sub("[Mm]",6,x)
x<-sub("[Bb]",9,x)
x<-sub("[Hh]",2,x)
x<-sub("[^0-9]",0,x)
x<-as.numeric(x)
stormDataHealthDamageEv$PROPDMGEXP<-x
stormDataHealthDamageEvValue<-mutate(stormDataHealthDamageEv,PROPDMGValue=PROPDMG*(10^PROPDMGEXP))
x[is.na(x)] <- 0
x<-sub("[Kk]",3,x)
x<-sub("[Mm]",6,x)
x<-sub("[Bb]",9,x)
x<-sub("[Hh]",2,x)
x<-sub("[^0-9]",0,x)
x<-as.numeric(x)
stormDataHealthDamageEvValue$CROPDMGEXP<-x
stormDataHealthDamageEvValue<-mutate(stormDataHealthDamageEvValue,CROPDMGValue=CROPDMG*(10^CROPDMGEXP))
stormDataHealthDamageVars<-select(stormDataHealthDamageEvValue,REFNUM,bestEVTYPE,FATALITIES,INJURIES,PROPDMGValue,CROPDMGValue)
To compute a single number of the health and cost impacts, we sum PROPDMGValue and CROPDMGValue variables into the new variable COSTIMPACT, and take the weighted sum (at a 10-to-one ratio) of the counts of FATALITIES and INJURIES into the HEALTHIMPACT variables.
stormDataHealthDamageVars<-group_by(mutate(stormDataHealthDamageVars,HEALTHIMPACT=FATALITIES+INJURIES/10,COSTIMPACT=PROPDMGValue+CROPDMGValue),bestEVTYPE)
dataCost=head(arrange(summarize(stormDataHealthDamageVars,CostImpact=sum(COSTIMPACT),HealthImpact=sum(HEALTHIMPACT)),desc(CostImpact)),20)
dataHealth=head(arrange(summarize(stormDataHealthDamageVars,CostImpact=sum(COSTIMPACT),HealthImpact=sum(HEALTHIMPACT)),desc(HealthImpact)),20)
Finally, we pre-compute graphs of the top 20 impacts of each type, economic cost and health, for use in the results section.
dataCost$bestEVTYPE <- factor(dataCost$bestEVTYPE, levels = dataCost$bestEVTYPE[order(dataCost$CostImpact)])
ggCost<-ggplot(dataCost,aes(x=bestEVTYPE,y=CostImpact))+geom_bar(stat="identity")+coord_flip()+xlab("Event Type")+ylab("Economic Cost Impact in Dollars")+ggtitle("Economic Cost of Weather Events from 1950--2011 in the US")
dataHealth$bestEVTYPE <- factor(dataHealth$bestEVTYPE, levels = dataHealth$bestEVTYPE[order(dataHealth$HealthImpact)])
ggHealth<-ggplot(dataHealth,aes(x=bestEVTYPE,y=HealthImpact))+geom_bar(stat="identity")+coord_flip()+xlab("Event Type")+ylab("Health Impact in Fatalities plus a tenth of Injuries")+ggtitle("Health Cost of Weather Events from 1950--2011 in the US")
The economic impact of storm events in the Unted States over the period from 1950 to 2011 was enormous, totalling 2.2 trillion dollars. Likewise the impact on health, measured by a weighted sum of fatalities and injuries, was approximately 30,000. The data is taken from National Weather Service Storm Data.
The list of event types in this database was large and disorganized, initially comprising 890 types. A first pass at combining types that were measurably close to one another (fuzzy matching using the OSA method with a maximum distance of 2) collapsed this to 655 types, but clearly additional hand work would collapse it considerably more. For example, the top two events in terms of economic impact were hurricane and hurricane/typhoon. However this study was not funded at the level required to do so.
The top 20 events in terms of economic impact are given in this graphic. Hurricanes, Floods and Tornados lead the list.
print(ggCost)
The top 20 events in terms of health impact are given in this graphic. Tornados, Heat and Lightning top this list.
print(ggHealth)
The health impact was measured by a weighted average of the number of injuries and deaths, with 10 injuries being roughly equated with a fatality.
The leaders in each category lead by a decisive amount. So mitigation efforts might best be spent on the effects of tornados and hurricanes.
___END___