In this analysis we will be looking at weather event data from NOAA storm data base for the Coursera Reproducible Research course. We start by downloading the data into R, cleaning it and summarizing it using several R packages such as dplyr, and reshape2. We will then look at the summary tables and visualize it.
This will help us find which weather events are the most harmful to population health (deaths and injuries), as well as which event cause the most monetary damage.
The analysis shows that tornados were most harmful to population health over this time period, and floods caused accounted for the most monetary damage.
I started by loading the data set from the CSV file I downloaded from the NOAA website and and saved locally on my computer. I will also load relevant packages that I will use for the data cleaning and analysis.
Link: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
require(reshape2)
## Loading required package: reshape2
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
setwd("C://Users//Luke.Munyan//Desktop//Reproducible Reseach")
stormdata<-read.csv("repdata_data_StormData.csv", stringsAsFactors = F)
I will then look at how many event types there are in the data set, and create a new variable to with the relevant columns for the analysis. I then wrote a loop in convert relevant columns to numeric values. They were intially loaded in as characters.
length(unique(stormdata$EVTYPE))
## [1] 1013
#Look at only the necessary columns
stormdata2<-stormdata[,c(8,23:28)]
#Convert necessary columns to class numeric
for(i in c(2:4,6)){
stormdata2[,i]<-as.numeric(stormdata2[,i])
}
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
## Warning: NAs introduced by coercion
I then took out any incomplete observations and converted the EVTYPE (event type) variable to all lower case, which reduced the number of event types by 115 unique values. I then looked at the EVTYPE variable again to see how I could clean the classifications to condense them more. There were some typos and causing there to be many more event types then there should have been.
#Remove observations with missing
stormdata3<-stormdata2[complete.cases(stormdata2),]
#all event types to lower case
stormdata3$EVTYPE<-tolower(stormdata3$EVTYPE)
length(unique(stormdata3$EVTYPE)) #115 fewer Event types now
## [1] 898
unique(stormdata$EVTYPE)[1:100] #Look to see how this can be condensed
## [1] "TORNADO" "TSTM WIND"
## [3] "HAIL" "FREEZING RAIN"
## [5] "SNOW" "ICE STORM/FLASH FLOOD"
## [7] "SNOW/ICE" "WINTER STORM"
## [9] "HURRICANE OPAL/HIGH WINDS" "THUNDERSTORM WINDS"
## [11] "RECORD COLD" "HURRICANE ERIN"
## [13] "HURRICANE OPAL" "HEAVY RAIN"
## [15] "LIGHTNING" "THUNDERSTORM WIND"
## [17] "DENSE FOG" "RIP CURRENT"
## [19] "THUNDERSTORM WINS" "FLASH FLOOD"
## [21] "FLASH FLOODING" "HIGH WINDS"
## [23] "FUNNEL CLOUD" "TORNADO F0"
## [25] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/HAIL"
## [27] "HEAT" "WIND"
## [29] "LIGHTING" "HEAVY RAINS"
## [31] "LIGHTNING AND HEAVY RAIN" "FUNNEL"
## [33] "WALL CLOUD" "FLOODING"
## [35] "THUNDERSTORM WINDS HAIL" "FLOOD"
## [37] "COLD" "HEAVY RAIN/LIGHTNING"
## [39] "FLASH FLOODING/THUNDERSTORM WI" "WALL CLOUD/FUNNEL CLOUD"
## [41] "THUNDERSTORM" "WATERSPOUT"
## [43] "EXTREME COLD" "HAIL 1.75)"
## [45] "LIGHTNING/HEAVY RAIN" "HIGH WIND"
## [47] "BLIZZARD" "BLIZZARD WEATHER"
## [49] "WIND CHILL" "BREAKUP FLOODING"
## [51] "HIGH WIND/BLIZZARD" "RIVER FLOOD"
## [53] "HEAVY SNOW" "FREEZE"
## [55] "COASTAL FLOOD" "HIGH WIND AND HIGH TIDES"
## [57] "HIGH WIND/BLIZZARD/FREEZING RA" "HIGH TIDES"
## [59] "HIGH WIND AND HEAVY SNOW" "RECORD COLD AND HIGH WIND"
## [61] "RECORD HIGH TEMPERATURE" "RECORD HIGH"
## [63] "HIGH WINDS HEAVY RAINS" "HIGH WIND/ BLIZZARD"
## [65] "ICE STORM" "BLIZZARD/HIGH WIND"
## [67] "HIGH WIND/LOW WIND CHILL" "HEAVY SNOW/HIGH"
## [69] "RECORD LOW" "HIGH WINDS AND WIND CHILL"
## [71] "HEAVY SNOW/HIGH WINDS/FREEZING" "LOW TEMPERATURE RECORD"
## [73] "AVALANCHE" "MARINE MISHAP"
## [75] "WIND CHILL/HIGH WIND" "HIGH WIND/WIND CHILL/BLIZZARD"
## [77] "HIGH WIND/WIND CHILL" "HIGH WIND/HEAVY SNOW"
## [79] "HIGH TEMPERATURE RECORD" "FLOOD WATCH/"
## [81] "RECORD HIGH TEMPERATURES" "HIGH WIND/SEAS"
## [83] "HIGH WINDS/HEAVY RAIN" "HIGH SEAS"
## [85] "SEVERE TURBULENCE" "RECORD RAINFALL"
## [87] "RECORD SNOWFALL" "RECORD WARMTH"
## [89] "HEAVY SNOW/WIND" "EXTREME HEAT"
## [91] "WIND DAMAGE" "DUST STORM"
## [93] "APACHE COUNTY" "SLEET"
## [95] "HAIL STORM" "FUNNEL CLOUDS"
## [97] "FLASH FLOODS" "DUST DEVIL"
## [99] "EXCESSIVE HEAT" "THUNDERSTORM WINDS/FUNNEL CLOU"
I then created a new variable to sort by later, which adds the number of fatalities and injuries for each event.
I then converted the exponent column for property and crop damage, which was listed as h, k, m, or b. which stands for hundreds, thousands, millions, or billions of dollars in damage to the number it represents. I then created a new column by multiplying the PROPDMGE and CROPDMGE columns by the correct exponents to get the actual numeric monetary cost. I then added these columns to create the MonetaryCost variable, which represents the total monetary cost for each event.
#New variable summing fatalities and injuries
stormdata3$TotalDanger<-stormdata3$FATALITIES+stormdata3$INJURIES
#Convert the PROPDMGXP variable to a number rather than letter then multiply PROPDMGEXP by PROPDMGE to get raw monetary damage
stormdata3$PROPDMGEXP<-toupper(stormdata3$PROPDMGEXP)
stormdata3[stormdata3$PROPDMGEXP=="",]$PROPDMGEXP<-0
stormdata3[stormdata3$PROPDMGEXP=="H",]$PROPDMGEXP<-10^2
stormdata3[stormdata3$PROPDMGEXP=="K",]$PROPDMGEXP<-10^3
stormdata3[stormdata3$PROPDMGEXP=="M",]$PROPDMGEXP<-10^6
stormdata3[stormdata3$PROPDMGEXP=="B",]$PROPDMGEXP<-10^9
stormdata3$PROPDMGEXP<-as.numeric(stormdata3$PROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata3$propdamage<-stormdata3$PROPDMG*stormdata3$PROPDMGEXP
#Same for crop damage. No "H" used here
stormdata3$CROPDMGEXP<-toupper(stormdata3$CROPDMGEXP)
stormdata3[stormdata3$CROPDMGEXP=="",]$CROPDMGEXP<-0
stormdata3[stormdata3$CROPDMGEXP=="K",]$CROPDMGEXP<-10^3
stormdata3[stormdata3$CROPDMGEXP=="M",]$CROPDMGEXP<-10^6
stormdata3[stormdata3$CROPDMGEXP=="B",]$CROPDMGEXP<-10^9
stormdata3$CROPDMGEXP<-as.numeric(stormdata3$CROPDMGEXP)
## Warning: NAs introduced by coercion
stormdata3$cropdamage<-stormdata3$CROPDMG*stormdata3$CROPDMGEXP
stormdata3$MonetaryCost<-stormdata3$propdamage+stormdata3$cropdamage
#remove observations that have missing values meaning they were coded improperly
stormdata3<-stormdata3[complete.cases(stormdata3),]
Next I created a summary table, which summarized the data by event type and summed the total injuries, fatalities, monetary costs, and number of observations. Other variable I included were fatalities and injuries per event, as well as cost per event.
data<-stormdata3%>%
group_by(EVTYPE)%>%
summarise(Injuries=sum(INJURIES), Deaths=sum(FATALITIES), Total.Death.Inj=sum(TotalDanger), Economic.Cost=sum(MonetaryCost), count=n(), Deaths.per= Deaths/count, Injuries.per=Injuries/count)%>%
arrange(desc(Total.Death.Inj))%>%
as.data.frame()
data[1:10,]
## EVTYPE Injuries Deaths Total.Death.Inj Economic.Cost count
## 1 tornado 90671 5593 96264 56260908001 60576
## 2 excessive heat 6525 1903 8428 500155700 1678
## 3 tstm wind 6957 504 7461 5038850780 219923
## 4 flood 6789 470 7259 150319678250 25327
## 5 lightning 5230 816 6046 940751456 15754
## 6 heat 2100 937 3037 403258500 767
## 7 flash flood 1777 978 2755 17562129187 54275
## 8 ice storm 1975 89 2064 8967041310 2006
## 9 thunderstorm wind 1488 133 1621 3897964214 82563
## 10 winter storm 1321 206 1527 6715441250 11433
## Deaths.per Injuries.per
## 1 0.092330296 1.49681392
## 2 1.134088200 3.88855781
## 3 0.002291711 0.03163380
## 4 0.018557271 0.26805386
## 5 0.051796369 0.33197918
## 6 1.221642764 2.73794003
## 7 0.018019346 0.03274067
## 8 0.044366899 0.98454636
## 9 0.001610891 0.01802260
## 10 0.018018018 0.11554273
Seeing that there were still too many event types, I decided to write a loop to condense some of these event types.There were plenty of event types that were not consistent with they use of singular and plural (thunderstorm and thunderstorms) causing too many event types that needed condensing. I created a new variable called Class, which is a new classification based on my loop. If the string matching was not relevant, the original event type was used.
I then created a new summary table based on my classifications from the loop and sorted it by event types most fatalities and injuries.
#Fix event types that are coded strangely. This will condense some of the Event types in a new variable called Class
for(i in 1:nrow(data)){
if(grepl("hurricane|typhoon", data$EVTYPE[i])=="TRUE"){
data$Class[i]<-"hurricane/typhoon"
} else if(grepl("flood|fld", data$EVTYPE[i])=="TRUE"){
data$Class[i]<-"flood"
} else if(grepl("storm surge", data$EVTYPE[i])=="TRUE"){
data$Class[i]<-"stormsurge"
} else if(grepl("wind", data$EVTYPE[i])=="TRUE"){
data$Class[i]<-"wind"
} else if(grepl("snow|ice|winter|blizzard|cold|sleet", data$EVTYPE[i])=="TRUE"){
data$Class[i]<-"winter storm"
} else if(grepl("thunderstorm", data$EVTYPE[i], ignore.case = TRUE)==TRUE){
data$Class[i]<-"thunderstorm"
} else if(grepl("heat", data$EVTYPE[i], ignore.case = TRUE)==TRUE){
data$Class[i]<-"heat"
} else if(grepl("fire|wildfire", data$EVTYPE[i], ignore.case = TRUE)==TRUE){
data$Class[i]<-"wildfire"
} else if(grepl("hail", data$EVTYPE[i], ignore.case = TRUE)==TRUE){
data$Class[i]<-"hail"
} else if(grepl("tornado", data$EVTYPE[i], ignore.case = TRUE)==TRUE){
data$Class[i]<-"tornado"
} else{data$Class[i]<-data$EVTYPE[i]}
}
#Summary table by new classification
data2<-data%>%
group_by(Class)%>%
summarise(Injuries=sum(Injuries), Deaths=sum(Deaths), Total.Death.Inj=sum(Total.Death.Inj), Economic.Cost=sum(Economic.Cost), Num.Obs= sum(count), cost.per.event=Economic.Cost/Num.Obs)%>%
arrange(desc(Total.Death.Inj))%>%
as.data.frame()
To get to the results we will look at the summary tables and two visuals which were created using the ggplot2 package. Some minor data processing will be done to create the visual using the reshape2 package.
Plots only include the top 10 values for the variable being analyzed. Total injuries and fatalities/deaths were chosen because the number of fatalities/deaths and injuries was highly correlated.
cor(data2$Deaths, data2$Injuries)
## [1] 0.9040348
#Summary Table
data2[1:10,1:4]
## Class Injuries Deaths Total.Death.Inj
## 1 tornado 90732 5596 96328
## 2 wind 11498 1447 12945
## 3 heat 9224 3138 12362
## 4 flood 8683 1553 10236
## 5 winter storm 6236 856 7092
## 6 lightning 5230 816 6046
## 7 wildfire 1608 90 1698
## 8 hurricane/typhoon 1333 135 1468
## 9 hail 1371 15 1386
## 10 fog 734 62 796
#Reshape the data to be skinny
data3<-melt(data2[1:10,],id.vars="Class", measure.vars=c("Deaths", "Injuries"))
#Barchart plot
ggplot(data=data3, aes(x= reorder(Class,value) ,y= value,fill=variable))+geom_bar(stat="identity")+coord_flip()+labs(title="Injuries and Death by Weather Event Type")+xlab("Weather Event Type")+ylab("Count of Injuries and Deaths")
We can see that tornados were by far the most detrimental to population health. Wind, extreme heat, and floods were the next three most detrimental. While wind caused more combined injuries and deaths, extreme heat caused more fatalities, but fewer injuries.
Next we will look at the Economic/Monetary costs of weather events.
data4<-arrange(data2,desc(Economic.Cost))
data4[1:10,c(1,5:7)]
## Class Economic.Cost Num.Obs cost.per.event
## 1 flood 179975747786 86120 2089825.22
## 2 hurricane/typhoon 90872527810 299 303921497.69
## 3 tornado 56316853401 60622 928983.76
## 4 stormsurge 47965579000 409 117275254.28
## 5 wind 19672310898 364853 53918.46
## 6 winter storm 19203701659 43090 445664.93
## 7 hail 19020050820 289223 65762.58
## 8 drought 15018672000 2488 6036443.73
## 9 wildfire 8904910130 4240 2100214.65
## 10 tropical storm 8382236550 690 12148168.91
#Total costs
ggplot(data=data4[1:10,], aes(x = reorder(Class, Economic.Cost), y = Economic.Cost, fill= Class))+geom_bar(stat="identity", show.legend=F)+coord_flip()+labs(title="Economic Costs of Weather Events")+xlab("Weather Event Type")+ylab("Economic Cost($)")
Floods caused the most damage, followed by hurricanes/typhoons, tornados, and storm surges.
Lastly, I chose to look at costs per event. This will give us an idea of how severe certain event types are.
data5<-data4%>%
filter(Num.Obs>10)%>%
arrange(desc(cost.per.event))
data5[1:10,c(1,5:7)]
## Class Economic.Cost Num.Obs cost.per.event
## 1 hurricane/typhoon 90872527810 299 303921498
## 2 stormsurge 47965579000 409 117275254
## 3 thunderstorm 1226673550 94 13049719
## 4 tropical storm 8382236550 690 12148169
## 5 tsunami 144082000 20 7204100
## 6 drought 15018672000 2488 6036444
## 7 freeze 456930000 76 6012237
## 8 heavy rains 72785050 26 2799425
## 9 wildfire 8904910130 4240 2100215
## 10 flood 179975747786 86120 2089825
ggplot(data=data5[1:10,], aes(x = reorder(Class, cost.per.event), y = cost.per.event, fill= Class))+geom_bar(stat="identity", show.legend=F)+coord_flip()+labs(title="Economic Costs of Weather Events")+xlab("Weather Event Type")+ylab("Cost Per Event ($)")
Here we see that hurricanes are most costly on a per event basis. This makes logical sense. Hurricanes are rarer than other types of events, but when they do occur they are costly. Storm surges were also costly. Events such as floods, which accounted for the most total costs during the time period, seem to occur more often, so they are lest costly on a per event basis.
In summary, tornados had the largest effect on population health, followed by wind and heat. Further analysis could be conducted if more granular data were available such as the severity of the injuries.
Floods had the highest gross economic consequences, followed by hurricanes and tornados. Hurricanes and storm surges were more costly on a per event basis, meaning that they occur less frequently, but have rather devestating economic consequences.