In this analysis I explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to determine which weather events have the greatest economic and health impacts. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Data is from 1950 to November 2011 but only the impact of tornadoes was tracked prior to January 1, 1996. The data file has 902297 observations or storm evens and measures37 variables. There are 48 unique storm events described in NOAA web site that has this database, but due to poor configuration control there are 985 unique event s in the file. The initial analysis will be to remove data before 1996 so we have different weather events to compart and to cleanse the data down to a level where conclusive results of the standard unique events can be withdrawn. Filtering by date requires changing the date class from characteristic to date format. To determine health impact I add fatalities and injuries and eventually show tornado events have the greatest impact. Regarding economic impact, I add crop and property damage to determine overall impact, and eventually show flooding events have the greatest impact.
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
DT<-read.csv(file='C:/Users/Kevin/Desktop/Data Analytics/JHU R/Course 5/project 2/repdata_data_StormData.csv.bz2')
DT<-data.table(DT)
Calculate unique EVTYPE (note: should be 48)
DT_event<-DT %>% select(EVTYPE)
DT_event<-DT_event %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_event)
## [1] 985 2
Note: There are 985 unique values, way too many
Now, since observations before 01 JAN 1996 only lists tornadoes I will eliminate observations before this date.
DT$BGN_DATE<-as.character(DT$BGN_DATE)
DT$BGN_DATE<-as.Date(DT$BGN_DATE, "%m/%d/%Y")
DT_JAN96<-DT %>% filter(BGN_DATE>='1996-01-01')
#Note: DT_JAN96 has 653530 observations so I have eliminated approximately 25K rows
RE- Calculate unique EVTYPE (note: should be 48)
DT_event_JAN96<-DT_JAN96 %>% select(EVTYPE)
DT_event_JAN96<-DT_event_JAN96 %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_event_JAN96)
## [1] 516 2
Now down to 516 unique events, still much greater than 48
I will select the variables needed to answer the questions:
DT_working<-DT_JAN96 %>% select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
Select Events, Fatalities, and Injuries to narrow down the table
DT_injury<-DT_JAN96 %>% select(EVTYPE, FATALITIES, INJURIES)
‘Harmful’ will be measured by the sum of injuries and fatalities for a given event. Additionally, since the harmful effects are being assessed, the events with no fatalities or injuries will be filered out.
DT_injury_tot<-mutate(DT_injury, Tot_Harm=FATALITIES+INJURIES)
#eliminate the storm events with no health impact
DT_injury_tot<-DT_injury_tot %>% filter(Tot_Harm>0)
dim(DT_injury_tot)
## [1] 12764 4
There were 12,764 events that had a health impact
Determine unique events from these storms
DT_event_injury<-DT_injury_tot %>% select(EVTYPE)
DT_event_injury<-DT_event_injury %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_event_injury)
## [1] 138 2
There are now 138 unique types of events, down from 985 in the todal data base and 516 after all observatioins before 1996 were elminated
Now I will cleanse the EVTYPE that were entered with poor configuration control. For example entering TSTM vice THUNDERSTORM
DT_injury_tot$EVTYPE<-gsub('TSTM','THUNDERSTORM',DT_injury_tot$EVTYPE)
DT_injury_tot$EVTYPE<-gsub('WINTER WEATHER MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)
DT_injury_tot$EVTYPE<-gsub('WINTER WEATHER/MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)
DT_injury_tot$EVTYPE<-gsub('WINTRY MIX','WINTER WEATHER',DT_injury_tot$EVTYPE)
DT_injury_tot$EVTYPE[grepl('THUNDERSTORM WIND', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'THUNDERSTORM WIND'
DT_injury_tot$EVTYPE[grepl('GUSTY', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'
DT_injury_tot$EVTYPE[grepl('COASTAL', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'COASTAL FLOOD'
DT_injury_tot$EVTYPE[grepl('FREEZE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'FROST/FREEZE'
DT_injury_tot$EVTYPE[grepl('WIND', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'
DT_injury_tot$EVTYPE[grepl('SURF', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HIGH SURF'
DT_injury_tot$EVTYPE[grepl('ICE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'ICE STORM'
DT_injury_tot$EVTYPE[grepl('COLD', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'COLD/WIND CHILL'
DT_injury_tot$EVTYPE[grepl('RAIN', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HEAVY RAIN'
DT_injury_tot$EVTYPE[grepl('SNOW', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HEAVY SNOW'
DT_injury_tot$EVTYPE[grepl('STORM SURGE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'STORM SURGE/TIDE'
DT_injury_tot$EVTYPE[grepl('HURRICANE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'
DT_injury_tot$EVTYPE[grepl('TYPHOON', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'
DT_injury_tot$EVTYPE[grepl('WILD/FOREST FIRE', DT_injury_tot$EVTYPE, ignore.case=FALSE)] <- 'WILDFIRE'
DT_injury_tot$EVTYPE<-toupper(DT_injury_tot$EVTYPE)
Now how many unique events are there?
DT_event_injury<-DT_injury_tot %>% select(EVTYPE)
DT_event_injury<-DT_event_injury %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_event_injury)
## [1] 82 2
Now down to 82 unique events, getting closer to 48. 82 may be close enough, let’s evaluate the tope ten
DT_event_injury_tot<-DT_injury_tot %>% group_by(EVTYPE) %>% summarise(sum=sum(Tot_Harm))
#sort damage amount from highest to lowest dollar amount
DT_event_injury_tot<-setorder(DT_event_injury_tot, -sum)
The top 10 events are standard events listed in the Storm Data Event Table. What percent of total harm is captured by the top 10 events?
totharm<-select(DT_event_injury_tot, sum)
percent<-(100*(apply(totharm[1:10,],2,sum))/(apply(totharm,2,sum)))
percent
## sum
## 87.7749
The top 10 events capture 87.8% of all harm. I assess these results are sufficient to determine the most impactful events.
Graph the results
I<-ggplot(DT_event_injury_tot[1:10,], aes(x = reorder(EVTYPE, -sum), y = sum)) +
geom_col(fill='red',col='blue')+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(x="Event Type", y="Total Fatalities & Injuries", title="Top 10 Health Impac (Measured in Sum of Fatalities & Injuries)")
I
Tornadoes have the greatest health impact
I will combine the EXP to the PROP or CROP damages to determine a damage amount
How many uniques values of PROPDMGEXP are there?
DT_workinga<-DT_working %>% select(PROPDMGEXP)
DT_workinga<-DT_workinga %>% group_by(PROPDMGEXP) %>% summarise(n=n())
DT_workinga
## # A tibble: 5 x 2
## PROPDMGEXP n
## <fct> <int>
## 1 "" 276185
## 2 0 1
## 3 B 32
## 4 K 369938
## 5 M 7374
Answer: 4 plus blanks (0, B, K, M)
How many uniques values of CROPDMGEXP are there?
DT_workingb<-DT_working %>% select(CROPDMGEXP)
DT_workingb<-DT_workingb %>% group_by(CROPDMGEXP) %>% summarise(n=n())
DT_workingb
## # A tibble: 4 x 2
## CROPDMGEXP n
## <fct> <int>
## 1 "" 373069
## 2 B 4
## 3 K 278686
## 4 M 1771
Answer: 3 (B, K, M) and blanks Note: from B is 1,000,000,000 multiplier, M is 1,000,000 multiplier, K is 1000 multiplier and blanks or zeros is a 0 multiplier
I will combine the exp and dollar amounts to determine total cost.
for(i in 1:653530) {
if (DT_working$PROPDMGEXP[i]=='B'){
DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000000000}
else if (DT_working$PROPDMGEXP[i]=='K'){
DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000}
else if (DT_working$PROPDMGEXP[i]=='M'){
DT_working$PROPDMG[i]<-DT_working$PROPDMG[i]*1000000}
else {DT_working$PROPDMG[i]<-0}
}
for(i in 1:653530) {
if (DT_working$CROPDMGEXP[i]=='B'){
DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000000000}
else if (DT_working$CROPDMGEXP[i]=='K'){
DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000}
else if (DT_working$CROPDMGEXP[i]=='M'){
DT_working$CROPDMG[i]<-DT_working$CROPDMG[i]*1000000}
else {DT_working$CROPDMG[i]<-0}
}
Sum total damage
DT_working<-mutate(DT_working, Tot_damage=PROPDMG+CROPDMG)
Select Non-Zero Damage
DT_damage<-DT_working %>% filter(Tot_damage>0)
#determine unique EVTYPE values
DT_damage_a<-DT_damage %>% select(EVTYPE)
DT_damage_a<-DT_damage_a %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_damage_a)
## [1] 186 2
There are now 151 unique types of events, down from 985 in the todal data base and 516 after all observatioins before 1996 were elminated
Now I will cleanse the EVTYPE that were entered with poor configuration control. For example entering TSTM vice THUNDERSTORM
DT_damage$EVTYPE<-gsub('TSTM','THUNDERSTORM',DT_damage$EVTYPE)
DT_damage$EVTYPE<-gsub('WINTER WEATHER MIX','WINTER WEATHER',DT_damage$EVTYPE)
DT_damage$EVTYPE<-gsub('WINTER WEATHER/MIX','WINTER WEATHER',DT_damage$EVTYPE)
DT_damage$EVTYPE<-gsub('WINTRY MIX','WINTER WEATHER',DT_damage$EVTYPE)
DT_damage$EVTYPE[grepl('THUNDERSTORM WIND', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'THUNDERSTORM WIND'
DT_damage$EVTYPE[grepl('GUSTY', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'
DT_damage$EVTYPE[grepl('COASTAL', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'COASTAL FLOOD'
DT_damage$EVTYPE[grepl('FREEZE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'FROST/FREEZE'
DT_damage$EVTYPE[grepl('WIND', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH WIND'
DT_damage$EVTYPE[grepl('SURF', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HIGH SURF'
DT_damage$EVTYPE[grepl('ICE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'ICE STORM'
DT_damage$EVTYPE[grepl('COLD', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'COLD/WIND CHILL'
DT_damage$EVTYPE[grepl('RAIN', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HEAVY RAIN'
DT_damage$EVTYPE[grepl('SNOW', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HEAVY SNOW'
DT_damage$EVTYPE[grepl('STORM SURGE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'STORM SURGE/TIDE'
DT_damage$EVTYPE[grepl('HURRICANE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'
DT_damage$EVTYPE[grepl('TYPHOON', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'HURRICANE (TYPHOON)'
DT_damage$EVTYPE[grepl('WILD/FOREST FIRE', DT_damage$EVTYPE, ignore.case=FALSE)] <- 'WILDFIRE'
DT_damage$EVTYPE<-toupper(DT_damage$EVTYPE)
Select only columns needed to assess economic damage
DT_econ<-DT_damage %>% select(EVTYPE, Tot_damage)
Now how many unique events are there?
#determine unique EVTYPE values
DT_econ_a<-DT_econ%>% select(EVTYPE)
DT_econ_a<-DT_econ_a %>% group_by(EVTYPE) %>% summarise(n=n())
dim(DT_econ_a)
## [1] 105 2
Now down to 75 unique events, getting closer to 48. 75 may be close enough, let’s evaluate the tope ten
Determine the total damage per event
DT_econ_tot<-DT_econ %>% group_by(EVTYPE) %>% summarise(sum=sum(Tot_damage))
#sort damage amount from highest to lowest dollar amount
DT_econ_tot<-setorder(DT_econ_tot, -sum)
Determine percent of all damage caused by the top ten events. This is a check to ensure the top ten really represents to most significant damage.
sumtot<-select(DT_econ_tot, sum)
percent<-(100*(apply(sumtot[1:10,],2,sum))/(apply(sumtot,2,sum)))
percent
## sum
## 96.7186
The top ten events cause 96.7% of the damage. I assess these results are sufficient to determine the most impactful events.
Graph the results
f<-ggplot(DT_econ_tot[1:10,], aes(x = reorder(EVTYPE, -sum), y = sum)) +
geom_col(fill='red',col='blue')+theme(axis.text.x=element_text(angle=90,hjust=1,vjust=0.5))+
labs(x="Event Type", y="Damage ($)", title="Top 10 Economic Impact (Measured in Cost of Damage)")
f
Floods have the greatest economic consequences.
In conclusion, the analysis shows flooding events have the largest economic impact as measured by total crop and property damage, and tornadoes have the largest health impact as measured by total injuries and fatalities. The most challenging part of this analysis was cleansing the data to transform 985 unique weather events to the 48 events on the NOAA site. I did not reach 48 events, but I did reach a small number (82 for health impact, 75 for economic impact). The cleansing shrinks the events to the point where the top 10 most impactful events for health and economic were standard NOAA events and capture the large part of the impact. For health impact, the top ten most impactful events capture 87% of the total fatalities and injuries. For economic impact, the top ten events covert 97% of the total damage.