TITLE: Assessment of Health and Economic Impacts from Weather Events 1996-2011

SYNOPSIS:

In this analysis I explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine which weather events have the greatest economic and health impacts. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Data is from 1950 to November 2011 but only the impact of tornadoes was tracked prior to January 1, 1996. The data file has 902297 observations or storm evens and measures37 variables. There are 48 unique storm events described in NOAA web site that has this database, but due to poor configuration control there are 985 unique event s in the file. The initial analysis will be to remove data before 1996 so we have different weather events to compart and to cleanse the data down to a level where conclusive results of the standard unique events can be withdrawn. Filtering by date requires changing the date class from characteristic to date format. To determine health impact I add fatalities and injuries and eventually show tornado events have the greatest impact. Regarding economic impact, I add crop and property damage to determine overall impact, and eventually show flooding events have the greatest impact.

Load libraries Requred for Analysis

library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

Read Data

DT<-read.csv(file='C:/Users/Kevin/Desktop/Data Analytics/JHU R/Course 5/project 2/repdata_data_StormData.csv.bz2')
DT<-data.table(DT)

Data Processing:

Calculate unique EVTYPE (note: should be 48)

DT_event<-DT %>% select(EVTYPE)
DT_event<-DT_event %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_event)
## [1] 985   2

Note: There are 985 unique values, way too many

Now, since observations before 01 JAN 1996 only lists tornadoes I will eliminate observations before this date.

DT$BGN_DATE<-as.character(DT$BGN_DATE)

DT$BGN_DATE<-as.Date(DT$BGN_DATE, "%m/%d/%Y")

DT_JAN96<-DT %>% filter(BGN_DATE>='1996-01-01')
#Note:  DT_JAN96 has 653530 observations so I have eliminated approximately 25K rows

RE- Calculate unique EVTYPE (note: should be 48)

DT_event_JAN96<-DT_JAN96 %>% select(EVTYPE)
DT_event_JAN96<-DT_event_JAN96 %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_event_JAN96)
## [1] 516   2

Now down to 516 unique events, still much greater than 48

I will select the variables needed to answer the questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  1. To answer this I will choose EVTYPE (type of event) and FATALITIES and INJURIES to indicate impact on health.
  1. Across the United States, which types of events have the greatest economic consequences?
  1. To answer this I will choosed EVTYPE and PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP to measure economic impact
DT_working<-DT_JAN96 %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Determine Most harmful Events

Select Events, Fatalities, and Injuries to narrow down the table

DT_injury<-DT_JAN96 %>% select(EVTYPE, FATALITIES, INJURIES)

‘Harmful’ will be measured by the sum of injuries and fatalities for a given event. Additionally, since the harmful effects are being assessed, the events with no fatalities or injuries will be filered out.

DT_injury_tot<-mutate(DT_injury, Tot_Harm=FATALITIES+INJURIES)
#eliminate the storm events with no health impact
DT_injury_tot<-DT_injury_tot %>% filter(Tot_Harm>0)
dim(DT_injury_tot)
## [1] 12764     4

There were 12,764 events that had a health impact

Determine unique events from these storms

DT_event_injury<-DT_injury_tot %>% select(EVTYPE)
DT_event_injury<-DT_event_injury %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_event_injury)
## [1] 138   2

There are now 138 unique types of events, down from 985 in the todal data base and 516 after all observatioins before 1996 were elminated

Now I will cleanse the EVTYPE that were entered with poor configuration control. For example entering TSTM vice THUNDERSTORM

DT_injury_tot$EVTYPE<-gsub('TSTM','THUNDERSTORM',DT_injury_tot$EVTYPE)

DT_injury_tot$EVTYPE<-gsub('WINTER WEATHER MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)

DT_injury_tot$EVTYPE<-gsub('WINTER WEATHER/MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)

DT_injury_tot$EVTYPE<-gsub('WINTRY MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)

DT_injury_tot$EVTYPE[grepl('THUNDERSTORM WIND', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'THUNDERSTORM WIND'

DT_injury_tot$EVTYPE[grepl('GUSTY', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'

DT_injury_tot$EVTYPE[grepl('COASTAL', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'COASTAL FLOOD'

DT_injury_tot$EVTYPE[grepl('FREEZE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'FROST/FREEZE'

DT_injury_tot$EVTYPE[grepl('WIND', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'

DT_injury_tot$EVTYPE[grepl('SURF', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH SURF'

DT_injury_tot$EVTYPE[grepl('ICE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'ICE STORM'

DT_injury_tot$EVTYPE[grepl('COLD', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'COLD/WIND CHILL'

DT_injury_tot$EVTYPE[grepl('RAIN', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HEAVY RAIN'

DT_injury_tot$EVTYPE[grepl('SNOW', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HEAVY SNOW'

DT_injury_tot$EVTYPE[grepl('STORM SURGE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'STORM SURGE/TIDE'

DT_injury_tot$EVTYPE[grepl('HURRICANE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'

DT_injury_tot$EVTYPE[grepl('TYPHOON', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'

DT_injury_tot$EVTYPE[grepl('WILD/FOREST FIRE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'WILDFIRE'

DT_injury_tot$EVTYPE<-toupper(DT_injury_tot$EVTYPE)

Now how many unique events are there?

DT_event_injury<-DT_injury_tot %>% select(EVTYPE)
DT_event_injury<-DT_event_injury %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_event_injury)
## [1] 82  2

Now down to 82 unique events, getting closer to 48. 82 may be close enough, let’s evaluate the tope ten

DT_event_injury_tot<-DT_injury_tot %>% group_by(EVTYPE) %>% summarise(sum=sum(Tot_Harm))
#sort damage amount from highest to lowest dollar amount
DT_event_injury_tot<-setorder(DT_event_injury_tot, -sum)

The top 10 events are standard events listed in the Storm Data Event Table. What percent of total harm is captured by the top 10 events?

totharm<-select(DT_event_injury_tot, sum)
percent<-(100*(apply(totharm[1:10,],2,sum))/(apply(totharm,2,sum)))
          percent
##     sum 
## 87.7749

The top 10 events capture 87.8% of all harm. I assess these results are sufficient to determine the most impactful events.

Graph the results

I<-ggplot(DT_event_injury_tot[1:10,], aes(x = reorder(EVTYPE, -sum), y = sum)) + 
  geom_col(fill='red',col='blue')+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
        labs(x="Event Type", y="Total Fatalities & Injuries", title="Top 10 Health Impac (Measured in Sum of Fatalities & Injuries)")

I

Tornadoes have the greatest health impact

Determine Econominc Impact

I will combine the EXP to the PROP or CROP damages to determine a damage amount

How many uniques values of PROPDMGEXP are there?

DT_workinga<-DT_working %>% select(PROPDMGEXP)
DT_workinga<-DT_workinga %>% group_by(PROPDMGEXP)  %>% summarise(n=n())
DT_workinga
## # A tibble: 5 x 2
##   PROPDMGEXP      n
##   <fct>       <int>
## 1 ""         276185
## 2 0               1
## 3 B              32
## 4 K          369938
## 5 M            7374

Answer: 4 plus blanks (0, B, K, M)

How many uniques values of CROPDMGEXP are there?

DT_workingb<-DT_working %>% select(CROPDMGEXP)
DT_workingb<-DT_workingb %>% group_by(CROPDMGEXP)  %>% summarise(n=n())
DT_workingb
## # A tibble: 4 x 2
##   CROPDMGEXP      n
##   <fct>       <int>
## 1 ""         373069
## 2 B               4
## 3 K          278686
## 4 M            1771

Answer: 3 (B, K, M) and blanks Note: from B is 1,000,000,000 multiplier, M is 1,000,000 multiplier, K is 1000 multiplier and blanks or zeros is a 0 multiplier

I will combine the exp and dollar amounts to determine total cost.

for(i in 1:653530) {
     
        if (DT_working$PROPDMGEXP[i]=='B'){
             DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000000000}
        else if (DT_working$PROPDMGEXP[i]=='K'){
             DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000}
        else if (DT_working$PROPDMGEXP[i]=='M'){
             DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000000}
        else {DT_working$PROPDMG[i]<-0}
     }   
for(i in 1:653530) {
     
        if (DT_working$CROPDMGEXP[i]=='B'){
             DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000000000}
        else if (DT_working$CROPDMGEXP[i]=='K'){
             DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000}
        else if (DT_working$CROPDMGEXP[i]=='M'){
             DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000000}
        else {DT_working$CROPDMG[i]<-0}
     }

Sum total damage

DT_working<-mutate(DT_working, Tot_damage=PROPDMG+CROPDMG)

Select Non-Zero Damage

DT_damage<-DT_working %>% filter(Tot_damage>0)

#determine unique EVTYPE values
DT_damage_a<-DT_damage %>% select(EVTYPE)
DT_damage_a<-DT_damage_a %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_damage_a)
## [1] 186   2

There are now 151 unique types of events, down from 985 in the todal data base and 516 after all observatioins before 1996 were elminated

Now I will cleanse the EVTYPE that were entered with poor configuration control. For example entering TSTM vice THUNDERSTORM

DT_damage$EVTYPE<-gsub('TSTM','THUNDERSTORM',DT_damage$EVTYPE)

DT_damage$EVTYPE<-gsub('WINTER WEATHER MIX','WINTER WEATHER',DT_damage$EVTYPE)

DT_damage$EVTYPE<-gsub('WINTER WEATHER/MIX','WINTER WEATHER',DT_damage$EVTYPE)

DT_damage$EVTYPE<-gsub('WINTRY MIX','WINTER WEATHER',DT_damage$EVTYPE)

DT_damage$EVTYPE[grepl('THUNDERSTORM WIND', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'THUNDERSTORM WIND'

DT_damage$EVTYPE[grepl('GUSTY', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'

DT_damage$EVTYPE[grepl('COASTAL', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'COASTAL FLOOD'

DT_damage$EVTYPE[grepl('FREEZE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'FROST/FREEZE'

DT_damage$EVTYPE[grepl('WIND', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'

DT_damage$EVTYPE[grepl('SURF', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH SURF'

DT_damage$EVTYPE[grepl('ICE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'ICE STORM'

DT_damage$EVTYPE[grepl('COLD', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'COLD/WIND CHILL'

DT_damage$EVTYPE[grepl('RAIN', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HEAVY RAIN'

DT_damage$EVTYPE[grepl('SNOW', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HEAVY SNOW'

DT_damage$EVTYPE[grepl('STORM SURGE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'STORM SURGE/TIDE'

DT_damage$EVTYPE[grepl('HURRICANE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'

DT_damage$EVTYPE[grepl('TYPHOON', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'

DT_damage$EVTYPE[grepl('WILD/FOREST FIRE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'WILDFIRE'


DT_damage$EVTYPE<-toupper(DT_damage$EVTYPE)

Select only columns needed to assess economic damage

DT_econ<-DT_damage %>% select(EVTYPE, Tot_damage)

Now how many unique events are there?

#determine unique EVTYPE values
DT_econ_a<-DT_econ%>% select(EVTYPE)
DT_econ_a<-DT_econ_a %>% group_by(EVTYPE)  %>% summarise(n=n())
dim(DT_econ_a)
## [1] 105   2

Now down to 75 unique events, getting closer to 48. 75 may be close enough, let’s evaluate the tope ten

Determine the total damage per event

DT_econ_tot<-DT_econ %>% group_by(EVTYPE) %>% summarise(sum=sum(Tot_damage))
#sort damage amount from highest to lowest dollar amount
DT_econ_tot<-setorder(DT_econ_tot, -sum)

Determine percent of all damage caused by the top ten events. This is a check to ensure the top ten really represents to most significant damage.

sumtot<-select(DT_econ_tot, sum)
percent<-(100*(apply(sumtot[1:10,],2,sum))/(apply(sumtot,2,sum)))
          percent
##     sum 
## 96.7186

The top ten events cause 96.7% of the damage. I assess these results are sufficient to determine the most impactful events.

Graph the results

f<-ggplot(DT_econ_tot[1:10,], aes(x = reorder(EVTYPE, -sum), y = sum)) + 
  geom_col(fill='red',col='blue')+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
        labs(x="Event Type", y="Damage ($)", title="Top 10 Economic Impact (Measured in Cost of Damage)")

f

Floods have the greatest economic consequences.

RESULTS:

In conclusion, the analysis shows flooding events have the largest economic impact as measured by total crop and property damage, and tornadoes have the largest health impact as measured by total injuries and fatalities. The most challenging part of this analysis was cleansing the data to transform 985 unique weather events to the 48 events on the NOAA site. I did not reach 48 events, but I did reach a small number (82 for health impact, 75 for economic impact). The cleansing shrinks the events to the point where the top 10 most impactful events for health and economic were standard NOAA events and capture the large part of the impact. For health impact, the top ten most impactful events capture 87% of the total fatalities and injuries. For economic impact, the top ten events covert 97% of the total damage.