Synopsis:

The key goal of the analysis is to identify most harmful weather event(s) across USA for both Public and Economic perspectives. The data (Storm Data [47Mb]) for this analysis was obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The key variables included year of the observation, levels of injuries and fatalities for public health assessment, as well as property and crop damage for assessment of the economic problems. As the most weather events has been recorded after 1993, the data was analyzed based on the year range 1993 ~ 2011. The explanation for the crop damage and property damage exp columns were obtained from https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html

Data Processing

Required libraries dplyr; ggplot2; tidyverse; plyr; reshape2

Part 1: Loading Required Dataset

#Download required file from the URL
myurl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(myurl, destfile='StormData.csv', method = 'curl')
StormData<-read.csv("StormData.csv")


#Subset only essential columns: type, date, fatalities and injuries, property and crop damages.
#As some events are entered with both low and upper case, make all events with the upper case.
Events<-subset(StormData, select=c(EVTYPE, BGN_DATE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
Events$EVTYPE<-toupper(Events$EVTYPE)
head(Events)
##    EVTYPE           BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 TORNADO  4/18/1950 0:00:00          0       15    25.0          K       0
## 2 TORNADO  4/18/1950 0:00:00          0        0     2.5          K       0
## 3 TORNADO  2/20/1951 0:00:00          0        2    25.0          K       0
## 4 TORNADO   6/8/1951 0:00:00          0        2     2.5          K       0
## 5 TORNADO 11/15/1951 0:00:00          0        2     2.5          K       0
## 6 TORNADO 11/15/1951 0:00:00          0        6     2.5          K       0
##   CROPDMGEXP
## 1           
## 2           
## 3           
## 4           
## 5           
## 6

Part 2: Processing Data

#Convert BGN_DATE column into years only to see the year range for the observed events.
Events$BGN_DATE<-as.Date(Events$BGN_DATE, format='%m/%d/%Y %H:%M:%S')
Events$BGN_DATE<-format(Events$BGN_DATE, format="%Y")

#Convert Property Damage and Crop Damage values into numeric with the applicable index:
unique(Events$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(Events$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

There are: K, M, B, digits, and empty or ?, or 0, meaning 1000,1000000,1000000000,10,0 respectfully Assumption: the meaningful data will contain large values, this the subset for K,M,B values only. But first for matching purposes all EXP values should be converted to the upper case, then subset and converted to numeric values. As a result the final value for each property and crop damage level can be obtained by multiplying the value by the EXP code, that was converted to the respective numeric value.

Events$PROPDMGEXP<-toupper(Events$PROPDMGEXP)
Events$CROPDMGEXP<-toupper(Events$CROPDMGEXP)
Events<-Events[which(Events$PROPDMGEXP %in% c('K','M','B') | Events$CROPDMGEXP %in% c('K','M','B')),]
Events$PROPDMGEXP<-mapvalues(as.vector(Events$PROPDMGEXP), c('K','M','B','5','0','3',''), c(1000,1000000,1000000000,10,10,10,0))
Events$CROPDMGEXP<-mapvalues(as.vector(Events$CROPDMGEXP), c('K','M','B','0','?',''), c(1000,1000000,1000000000,10,0,0))
Events<-mutate(Events, PROPDMG=PROPDMG*(as.numeric(PROPDMGEXP)), CROPDMG=CROPDMG*(as.numeric(CROPDMGEXP)))
Events<-Events %>% select(1,2,3,4,5,7)
head(Events)
##    EVTYPE BGN_DATE FATALITIES INJURIES PROPDMG CROPDMG
## 1 TORNADO     1950          0       15   25000       0
## 2 TORNADO     1950          0        0    2500       0
## 3 TORNADO     1951          0        2   25000       0
## 4 TORNADO     1951          0        2    2500       0
## 5 TORNADO     1951          0        2    2500       0
## 6 TORNADO     1951          0        6    2500       0

For overall estimation it is meaningful to combine fatalities and injuries to assess population health damage and property and crop damage to assess economic damage

Events<- 
  Events %>% 
  mutate(Population.Harm=FATALITIES+INJURIES,
         Economic.Damage=PROPDMG+CROPDMG)

Events<-Events %>% select(1,2,7,8)

head(Events)
##    EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1 TORNADO     1950              15           25000
## 2 TORNADO     1950               0            2500
## 3 TORNADO     1951               2           25000
## 4 TORNADO     1951               2            2500
## 5 TORNADO     1951               2            2500
## 6 TORNADO     1951               6            2500

Final step is to agregate by event type with total of Population Harm and Economic Damage, as well as track the first year the event has been observed.

Events<-ddply(Events, .(EVTYPE), summarize, 
      BGN_DATE=min(BGN_DATE),
      Population.Harm = sum(Population.Harm),  
      Economic.Damage= sum(Economic.Damage))
head(arrange(Events, Events$BGN_DATE))
##                  EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1               TORNADO     1950           96017     57352115720
## 2             AVALANCHE     1993             174         3721800
## 3              BLIZZARD     1993             848       771273950
## 4 BLIZZARD/WINTER STORM     1993               0          500000
## 5         COASTAL FLOOD     1993               4       259570560
## 6      COASTAL FLOODING     1993               3       133021500

Note! There is an interesting observation: Only Tornado has been tracked from 1950, data should be accounted from only 1993, in addition there is no need to carry zero values for damage or health harm as well. Therefore, the final data processing step is to select only data from 1993 year that is not equate to zero level of damage or harm

Events<-Events[Events$BGN_DATE>=1993&Events$Population.Harm>0&Events$Economic.Damage>0,]
head(Events)
##                      EVTYPE BGN_DATE Population.Harm Economic.Damage
## 10                AVALANCHE     1993             174         3721800
## 12                 BLIZZARD     1993             848       771273950
## 15             BLOWING SNOW     1997               2           15000
## 19            COASTAL FLOOD     1993               4       259570560
## 20         COASTAL FLOODING     1993               3       133021500
## 21 COASTAL FLOODING/EROSION     1998               5        20030000

Results

1. Across the United States, which types of events EVTYPE variable are most harmful with respect to population health? Here are the top 10 events

Events<-arrange(Events, desc(Events$Population.Harm))
topharm<-head(Events, 10)
topharm
##               EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1              FLOOD     1993            7152    150319678250
## 2          TSTM WIND     1993            2911      5038965790
## 3        FLASH FLOOD     1993            2272     17562129144
## 4          ICE STORM     1993            1891      8967041810
## 5          LIGHTNING     1993            1821       940751370
## 6               HEAT     1994            1781       403258500
## 7  THUNDERSTORM WIND     1993            1599      3897964190
## 8  HURRICANE/TYPHOON     2002            1337     71913712800
## 9     EXCESSIVE HEAT     1995            1137       500155700
## 10      WINTER STORM     1993            1116      6715441250
ggplot(data=topharm, aes(EVTYPE, Population.Harm))+
      geom_point(color='steelblue')+
      ggtitle('Top 10% Harmful Weather Events across USA')+
      ylab('Number of casualties')+
      xlab('')+
      theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

2. Across the United States, which types of events have the greatest economic consequences? Here are the top 10 events

Events<-arrange(Events, desc(Events$Economic.Damage))
topdamage<-head(Events, 10)
topdamage
##               EVTYPE BGN_DATE Population.Harm Economic.Damage
## 1              FLOOD     1993            7152    150319678250
## 2  HURRICANE/TYPHOON     2002            1337     71913712800
## 3        STORM SURGE     1993              20     43323541000
## 4               HAIL     1993             658     18758221470
## 5        FLASH FLOOD     1993            2272     17562129144
## 6            DROUGHT     1993               4     15018672000
## 7          HURRICANE     1993              82     14610229010
## 8        RIVER FLOOD     1993               3     10148404500
## 9          ICE STORM     1993            1891      8967041810
## 10    TROPICAL STORM     1993             385      8382236550
ggplot(data=topdamage, aes(EVTYPE, Economic.Damage))+
  geom_point(color='blue')+
  ggtitle('Top 10% Economically Harmful Weather Events across USA', )+
  ylab('Number of casualties')+
  xlab('')+
  theme(plot.title = element_text(size = 10),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

As it is seen from the graphs and the tables the top weather event that is the most harmful for the economy and population across USA between 1993 and 2011 is FLOOD.