The key goal of the analysis is to identify most harmful weather event(s) across USA for both Public and Economic perspectives. The data (Storm Data [47Mb]) for this analysis was obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The key variables included year of the observation, levels of injuries and fatalities for public health assessment, as well as property and crop damage for assessment of the economic problems. As the most weather events has been recorded after 1993, the data was analyzed based on the year range 1993 ~ 2011. The explanation for the crop damage and property damage exp columns were obtained from https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html
Required libraries dplyr; ggplot2; tidyverse; plyr; reshape2
Part 1: Loading Required Dataset
#Download required file from the URL
myurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(myurl, destfile='StormData.csv', method = 'curl')
StormData<-read.csv("StormData.csv")
#Subset only essential columns: type, date, fatalities and injuries, property and crop damages.
#As some events are entered with both low and upper case, make all events with the upper case.
Events<-subset(StormData, select=c(EVTYPE, BGN_DATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
Events$EVTYPE<-toupper(Events$EVTYPE)
head(Events)
## EVTYPE BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 TORNADO 4/18/1950 0:00:00 0 15 25.0 K 0
## 2 TORNADO 4/18/1950 0:00:00 0 0 2.5 K 0
## 3 TORNADO 2/20/1951 0:00:00 0 2 25.0 K 0
## 4 TORNADO 6/8/1951 0:00:00 0 2 2.5 K 0
## 5 TORNADO 11/15/1951 0:00:00 0 2 2.5 K 0
## 6 TORNADO 11/15/1951 0:00:00 0 6 2.5 K 0
## CROPDMGEXP
## 1
## 2
## 3
## 4
## 5
## 6
Part 2: Processing Data
#Convert BGN_DATE column into years only to see the year range for the observed events.
Events$BGN_DATE<-as.Date(Events$BGN_DATE, format='%m/%d/%Y %H:%M:%S')
Events$BGN_DATE<-format(Events$BGN_DATE, format="%Y")
#Convert Property Damage and Crop Damage values into numeric with the applicable index:
unique(Events$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(Events$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
There are: K, M, B, digits, and empty or ?, or 0, meaning 1000,1000000,1000000000,10,0 respectfully Assumption: the meaningful data will contain large values, this the subset for K,M,B values only. But first for matching purposes all EXP values should be converted to the upper case, then subset and converted to numeric values. As a result the final value for each property and crop damage level can be obtained by multiplying the value by the EXP code, that was converted to the respective numeric value.
Events$PROPDMGEXP<-toupper(Events$PROPDMGEXP)
Events$CROPDMGEXP<-toupper(Events$CROPDMGEXP)
Events<-Events[which(Events$PROPDMGEXP %in% c('K','M','B') | Events$CROPDMGEXP %in% c('K','M','B')),]
Events$PROPDMGEXP<-mapvalues(as.vector(Events$PROPDMGEXP), c('K','M','B','5','0','3',''), c(1000,1000000,1000000000,10,10,10,0))
Events$CROPDMGEXP<-mapvalues(as.vector(Events$CROPDMGEXP), c('K','M','B','0','?',''), c(1000,1000000,1000000000,10,0,0))
Events<-mutate(Events, PROPDMG=PROPDMG*(as.numeric(PROPDMGEXP)), CROPDMG=CROPDMG*(as.numeric(CROPDMGEXP)))
Events<-Events %>% select(1,2,3,4,5,7)
head(Events)
## EVTYPE BGN_DATE FATALITIES INJURIES PROPDMG CROPDMG
## 1 TORNADO 1950 0 15 25000 0
## 2 TORNADO 1950 0 0 2500 0
## 3 TORNADO 1951 0 2 25000 0
## 4 TORNADO 1951 0 2 2500 0
## 5 TORNADO 1951 0 2 2500 0
## 6 TORNADO 1951 0 6 2500 0
For overall estimation it is meaningful to combine fatalities and injuries to assess population health damage and property and crop damage to assess economic damage
Events<-
Events %>%
mutate(Population.Harm=FATALITIES+INJURIES,
Economic.Damage=PROPDMG+CROPDMG)
Events<-Events %>% select(1,2,7,8)
head(Events)
## EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1 TORNADO 1950 15 25000
## 2 TORNADO 1950 0 2500
## 3 TORNADO 1951 2 25000
## 4 TORNADO 1951 2 2500
## 5 TORNADO 1951 2 2500
## 6 TORNADO 1951 6 2500
Final step is to agregate by event type with total of Population Harm and Economic Damage, as well as track the first year the event has been observed.
Events<-ddply(Events, .(EVTYPE), summarize,
BGN_DATE=min(BGN_DATE),
Population.Harm = sum(Population.Harm),
Economic.Damage= sum(Economic.Damage))
head(arrange(Events, Events$BGN_DATE))
## EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1 TORNADO 1950 96017 57352115720
## 2 AVALANCHE 1993 174 3721800
## 3 BLIZZARD 1993 848 771273950
## 4 BLIZZARD/WINTER STORM 1993 0 500000
## 5 COASTAL FLOOD 1993 4 259570560
## 6 COASTAL FLOODING 1993 3 133021500
Note! There is an interesting observation: Only Tornado has been tracked from 1950, data should be accounted from only 1993, in addition there is no need to carry zero values for damage or health harm as well. Therefore, the final data processing step is to select only data from 1993 year that is not equate to zero level of damage or harm
Events<-Events[Events$BGN_DATE>=1993&Events$Population.Harm>0&Events$Economic.Damage>0,]
head(Events)
## EVTYPE BGN_DATE Population.Harm Economic.Damage
## 10 AVALANCHE 1993 174 3721800
## 12 BLIZZARD 1993 848 771273950
## 15 BLOWING SNOW 1997 2 15000
## 19 COASTAL FLOOD 1993 4 259570560
## 20 COASTAL FLOODING 1993 3 133021500
## 21 COASTAL FLOODING/EROSION 1998 5 20030000
1. Across the United States, which types of events EVTYPE variable are most harmful with respect to population health? Here are the top 10 events
Events<-arrange(Events, desc(Events$Population.Harm))
topharm<-head(Events, 10)
topharm
## EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1 FLOOD 1993 7152 150319678250
## 2 TSTM WIND 1993 2911 5038965790
## 3 FLASH FLOOD 1993 2272 17562129144
## 4 ICE STORM 1993 1891 8967041810
## 5 LIGHTNING 1993 1821 940751370
## 6 HEAT 1994 1781 403258500
## 7 THUNDERSTORM WIND 1993 1599 3897964190
## 8 HURRICANE/TYPHOON 2002 1337 71913712800
## 9 EXCESSIVE HEAT 1995 1137 500155700
## 10 WINTER STORM 1993 1116 6715441250
ggplot(data=topharm, aes(EVTYPE, Population.Harm))+
geom_point(color='steelblue')+
ggtitle('Top 10% Harmful Weather Events across USA')+
ylab('Number of casualties')+
xlab('')+
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
2. Across the United States, which types of events have the greatest economic consequences? Here are the top 10 events
Events<-arrange(Events, desc(Events$Economic.Damage))
topdamage<-head(Events, 10)
topdamage
## EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1 FLOOD 1993 7152 150319678250
## 2 HURRICANE/TYPHOON 2002 1337 71913712800
## 3 STORM SURGE 1993 20 43323541000
## 4 HAIL 1993 658 18758221470
## 5 FLASH FLOOD 1993 2272 17562129144
## 6 DROUGHT 1993 4 15018672000
## 7 HURRICANE 1993 82 14610229010
## 8 RIVER FLOOD 1993 3 10148404500
## 9 ICE STORM 1993 1891 8967041810
## 10 TROPICAL STORM 1993 385 8382236550
ggplot(data=topdamage, aes(EVTYPE, Economic.Damage))+
geom_point(color='blue')+
ggtitle('Top 10% Economically Harmful Weather Events across USA', )+
ylab('Number of casualties')+
xlab('')+
theme(plot.title = element_text(size = 10),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))
As it is seen from the graphs and the tables the top weather event that is the most harmful for the economy and population across USA between 1993 and 2011 is FLOOD.