Using data from the National Weather Service on storm events, the intent of this analysis is to highlight the top 10 storm events that are most harmful to population health, as well as those that have the greatest economic impact. We do this by looking at total fatalities and injuries (separately) for each storm event. For economic impact, we look at total combined property and crop damage for each storm event. This analysis is conducted using total reported counts, grouped by storm event type. We provide graphs that show the top 10 events for fatalities, injuries, and combined property and crop damage. In summary, the event most harmful to population health is tornado (by almost 200% compared to the next most harmful storm event based on fatalities), and the events that have the greatest economic impact are the extreme storm events, especially when a tornado is involved.
To process the data, we download and read in the csv file from the National Weather Service. You can see the link and code to do so below. We format the date column as a date variable and we load R libraries needed later in the analysis. We also remove any inappropriate event type entries, and convert the property and crop damage fields so they are expressed in a parallel format for consumption in the analysis.
#---------------------------------------
## Loading and preprocessing the data
#---------------------------------------
setwd("C:/Users/Sarah Lynn/Desktop/Self Study/Coursera DS JH - reproducible research/Week 4 project")
dstfile <- paste0(getwd(),"/storm_data.csv")
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile=dstfile)
data0 <- read.csv("storm_data.csv")
data0$BGN_DATE <- gsub(x=data0$BGN_DATE,pattern=" 0:00:00",replacement="",fixed=T)
data0$BGN_DATE <- as.Date(data0$BGN_DATE,"%m/%d/%Y")
library(dplyr)
library(ggplot2)
library(knitr)
#Remove inappropriate event type entries
data1 <- data0[!grepl("^Summary.*)|(\\?)",data0$EVTYPE),]
data11 <- data1[grepl("K|M|B",data0$PROPDMGEXP)|grepl("K|M|B",data0$CROPDMGEXP),]
data11$PROPDMGEXP <- sub("K",1000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- sub("M",1000000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- sub("B",1000000000,data11$PROPDMGEXP)
data11$PROPDMGEXP <- as.numeric(data11$PROPDMGEXP)
## Warning: NAs introduced by coercion
data11$CROPDMGEXP <- sub("K",1000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- sub("M",1000000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- sub("B",1000000000,data11$CROPDMGEXP)
data11$CROPDMGEXP <- as.numeric(data11$CROPDMGEXP)
## Warning: NAs introduced by coercion
data22 <- mutate(data11,PROPDMG2=PROPDMG*PROPDMGEXP,CROPDMG2=CROPDMG*CROPDMGEXP)
Here we will assess which storm events are most harmful to population health based on fatalities and injuries from each storm event recorded in our dataset. We will rank the top 10 most harmful events based on fatalities and injuries, and make a final list based on the intersection of these.
The reason we use this approach is that there is no good way to equate an injury to a fatality. Hence, instead of combining and ranking total counts, we will rank them separately and use the intersection. We will also pull separately top ranked events to ensure we get representation from both measures.
data2 <- data1 %>%
group_by(EVTYPE) %>%
summarise_at(c("FATALITIES","INJURIES"),funs(x_sum=sum,x_cnt=n()))
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
data3 <- data2[!(data2$FATALITIES_x_sum==0&data2$INJURIES_x_sum==0),]
data4 <- mutate(data3,tot_hurt=FATALITIES_x_sum+INJURIES_x_sum
,FATALITIES_per_event=FATALITIES_x_sum/FATALITIES_x_cnt
,INJURIES_per_event=INJURIES_x_sum/INJURIES_x_cnt)
ordered_data_f <- arrange(data4,desc(FATALITIES_x_sum))
ordered_data_i <- arrange(data4,desc(INJURIES_x_sum))
top10_fatalities <- as.data.frame(cbind(total_fatalities=ordered_data_f$FATALITIES_x_sum[1:10],event_type=ordered_data_f$EVTYPE[1:10]))
top10_injuries <- as.data.frame(cbind(total_injuries=ordered_data_i$INJURIES_x_sum[1:10],event_type=ordered_data_i$EVTYPE[1:10]))
top10_fatalities
## total_fatalities event_type
## 1 5633 TORNADO
## 2 1903 EXCESSIVE HEAT
## 3 978 FLASH FLOOD
## 4 937 HEAT
## 5 816 LIGHTNING
## 6 504 TSTM WIND
## 7 470 FLOOD
## 8 368 RIP CURRENT
## 9 248 HIGH WIND
## 10 224 AVALANCHE
top10_injuries
## total_injuries event_type
## 1 91346 TORNADO
## 2 6957 TSTM WIND
## 3 6789 FLOOD
## 4 6525 EXCESSIVE HEAT
## 5 5230 LIGHTNING
## 6 2100 HEAT
## 7 1975 ICE STORM
## 8 1777 FLASH FLOOD
## 9 1488 THUNDERSTORM WIND
## 10 1361 HAIL
From the output, you can see that Tornado is by far the most harmful storm event for fatalities or injuries. To ensure this isn’t a mistake with the data, we do the following check:
tornado_inj_data <- data1[data1$EVTYPE=="TORNADO",]$INJURIES
tornado_fat_data <- data1[data1$EVTYPE=="TORNADO",]$FATALITIES
boxplot(cbind(log(tornado_fat_data[tornado_fat_data!=0])
,log(tornado_inj_data[tornado_inj_data!=0])),names=c("fatailities","injuiries")
,main="Boxplot to evaluate data for outliers or bad data")
## Warning in cbind(log(tornado_fat_data[tornado_fat_data != 0]),
## log(tornado_inj_data[tornado_inj_data != : number of rows of result is not a
## multiple of vector length (arg 1)
By removing the zeros (which is fine because these are the storm events that had no fatalities or injuries) and taking the log becuase of the extreme skew of the data, we can tell that there is no one outlier pulling the totals off. Instead, Tornado events must just be more common and/or more harmful when they occur.
To look at just how much more extreme the damage is with tornado’s vs the next storm event, we’ll do the following:
tot_tornado_fatalities <- as.numeric(top10_fatalities[1,1])
tot_tornado_injuries <-as.numeric(top10_injuries[1,1])
tot_next_fatalities <-as.numeric(top10_fatalities[2,1])
tot_next_injuries <-as.numeric(top10_injuries[2,1])
percent_lift_f <- tot_tornado_fatalities/tot_next_fatalities -1
percent_lift_i <- tot_tornado_injuries/tot_next_injuries -1
percent_lift_f
## [1] 1.960063
percent_lift_i
## [1] 12.13008
The code above reports that, by fatalities, tornados report 196% more fatalities than the next highest ranked storm type. By injuries, tornadoes report 1,213% more injuires than the next highest ranked storm type. That’s big!
Next we will find the intersection of the top 10 events by injuries and fatalities. From the following code, we see that there are 7 shared events. They are ranked according to their combined ranks for fatailities and injuries, and output below:
top_injuries <- as.data.frame(cbind(ordered_data_i$INJURIES_x_sum[1:10],ordered_data_i$EVTYPE[1:10]))
top_fatalities <- as.data.frame(cbind(ordered_data_f$FATALITIES_x_sum[1:10],ordered_data_f$EVTYPE[1:10]))
top_injuries_rnk <- mutate(top_injuries,rnk1 = row_number())
top_fatalities_rnk <- mutate(top_fatalities,rnk2 = row_number())
combined_top_health0 <- merge(top_injuries_rnk,top_fatalities_rnk,by=c("V2"),all=TRUE) %>%
mutate(rnk=rnk1+rnk2)
combined_top_health <- combined_top_health0[!is.na(combined_top_health0$rnk),] %>%
arrange(rnk) %>%
mutate(rank = row_number(),event_type=V2) %>%
select(rank,event_type)
combined_top_health
## rank event_type
## 1 1 TORNADO
## 2 2 EXCESSIVE HEAT
## 3 3 TSTM WIND
## 4 4 FLOOD
## 5 5 HEAT
## 6 6 LIGHTNING
## 7 7 FLASH FLOOD
Here we will assess which storm events have the greatest economic impact based on property damage and crop damage from each storm event recorded in our dataset. We will rank the top 10 most harmful events based on property and crop damage combined. Note that we can combine these metrics from the start to rank them because, unlike fatalities and injuries, these have a common base of being measured in dollars.
Using the converted property and crop damage fields so everything is in terms of dollars, we will view the top 10 events.
data33 <- data22 %>%
group_by(EVTYPE) %>%
summarise_at(c("PROPDMG2","CROPDMG2"),funs(x_sum=sum,x_cnt=n()))
data44 <- data33[!(data33$PROPDMG2_x_sum==0&data33$CROPDMG2_x_sum==0),]
data55 <- mutate(data44,tot_DMG=PROPDMG2_x_sum+CROPDMG2_x_sum
,PROPDMG_per_event=PROPDMG2_x_sum/PROPDMG2_x_cnt
,CROPDMG_per_event=CROPDMG2_x_sum/CROPDMG2_x_cnt)
ordered_data_pc <- arrange(data55,desc(tot_DMG))
top_dmg <- as.data.frame(cbind(total_damage=ordered_data_pc$tot_DMG[1:10],event_type=ordered_data_pc$EVTYPE[1:10]))
top_dmg_rnk <- mutate(top_dmg,rank = row_number()) %>%
arrange(rank) %>%
select(rank,event_type)
top_dmg_rnk
## rank event_type
## 1 1 TORNADOES, TSTM WIND, HAIL
## 2 2 TSUNAMI
## 3 3 HIGH WINDS/COLD
## 4 4 HURRICANE OPAL/HIGH WINDS
## 5 5 WINTER STORM HIGH WINDS
## 6 6 TROPICAL STORM JERRY
## 7 7 LAKESHORE FLOOD
## 8 8 HIGH WINDS HEAVY RAINS
## 9 9 FOREST FIRES
## 10 10 FLASH FLOODING/FLOOD
Notice that the 4th highest event type listed is hurrican opal/high winds, which is presumably a specific storm (hence the name “opal” in the event type). As a result, we will now view the top 10 storm event occurances to see if our list is driven by specific storms vs storm types.
data_severe_storms <- data22[data22$EVTYPE%in%ordered_data_pc$EVTYPE[1:10],]
ordered_data_severe_storms <- select(data_severe_storms,BGN_DATE,COUNTYNAME,STATE,EVTYPE,PROPDMG2,CROPDMG2) %>% arrange(desc(PROPDMG2+CROPDMG2))
names(ordered_data_severe_storms) <- c("Date","County_Name","State","Event_Type","Property_Damage", "Crop_Damage")
ordered_data_severe_storms[1:10,]
## Date County_Name State
## 1 1993-03-12 FLZ001>023 FL
## 2 1995-10-04 ALZ001>050 AL
## 3 2009-09-29 PSZ002 AS
## 4 1995-12-09 CAZ01>03 06>010 CA
## 5 1993-03-13 SCZ008 SC
## 6 1993-03-13 SCZ007 SC
## 7 2011-03-11 CAZ529 CA
## 8 1995-08-23 FLZ039 - 042>043 - 048>052 - 055>057 - 060>062 - 065 FL
## 9 2011-03-11 HIZ023 HI
## 10 2006-11-15 CAZ001 CA
## Event_Type Property_Damage Crop_Damage
## 1 TORNADOES, TSTM WIND, HAIL 1.60e+09 2.5e+06
## 2 HURRICANE OPAL/HIGH WINDS 1.00e+08 1.0e+07
## 3 TSUNAMI 8.10e+07 2.0e+04
## 4 WINTER STORM HIGH WINDS 6.00e+07 5.0e+06
## 5 HIGH WINDS/COLD 5.00e+07 5.0e+06
## 6 HIGH WINDS/COLD 5.00e+07 5.0e+05
## 7 TSUNAMI 2.66e+07 0.0e+00
## 8 TROPICAL STORM JERRY 4.00e+06 1.5e+07
## 9 TSUNAMI 1.42e+07 0.0e+00
## 10 TSUNAMI 9.20e+06 0.0e+00
From the list outputed, we can see that the top storm event type for both per occurance and in total is this TORNADOES, TSTM WIND, HAIL. This is specific to a storm in Florida during 1993. Also, the second entry on this list is for the specific Hurricane Opal.
As a result, when we report the storm events that have the greatest economic impact, we should keep in mind that these are often specific severe storm event occurances.
In summary, the following figure shows the top storm events for fatalities, injuries, and total damage dollars.
par(mfrow=c(1,3),mar=c(8,10,4,5),mgp=c(4.5,1,0))
barplot(ordered_data_f$FATALITIES_x_sum[10:1],names.arg=ordered_data_f$EVTYPE[10:1]
,main="Top 10 events - fatalities"
,xlab="Fatalities total"
,las=2
,horiz=TRUE
)
barplot(ordered_data_i$INJURIES_x_sum[10:1],names.arg=ordered_data_i$EVTYPE[10:1]
,main="Top 10 events - injuries"
,xlab="Injuries total"
,las=2
,horiz=TRUE
)
barplot(ordered_data_pc$tot_DMG[10:1],names.arg=ordered_data_pc$EVTYPE[10:1]
,main="Top 10 events - damage $s"
,xlab="property & crop damage total"
,las=2
,horiz=TRUE
)
The following table reports the top storm events for most harmful to population health, as well as worst for economic impact.
final_data <- cbind(rank=combined_top_health[1:5,1],population_health=combined_top_health[1:5,2],economic_impact=top_dmg_rnk[1:5,2])
kable(final_data, caption = "Most Harmful Storm Events", align=c("c","l","l"))
| rank | population_health | economic_impact |
|---|---|---|
| 1 | TORNADO | TORNADOES, TSTM WIND, HAIL |
| 2 | EXCESSIVE HEAT | TSUNAMI |
| 3 | TSTM WIND | HIGH WINDS/COLD |
| 4 | FLOOD | HURRICANE OPAL/HIGH WINDS |
| 5 | HEAT | WINTER STORM HIGH WINDS |
In conclusion, Tornados are by far the most damaging storm type for population health, and when part of a storm, they are also the greatest economic impact. Otherwise, the most damaging storm types for population health are tornado, excessive heat, tsuanmi wind, flood, and heat (in order of severity). The storm types with the greatest economic impact are the extreme storm occurances as events like tornado/tsunami wind/hail, tsunami, high winds/cold, hurrican/high winds, and winter storm high winds (also in order of severity).