For this analysis we looked at NOAA Data from Storm Events with the aim to answer 2 main questions:
After analyzing the data we found that tornados and floods were particularly damaging to health of people and of property damage. Crops were most affected by extreme temperature fluctuations like drought and extreme heat and cold, along with flood and hurricanes.
The NOAA Storm Data was provided in a bzip file for download
This data also came with the following documentation:
After downloading the data and unzipping it, I loaded it into RStudio
stormdf <- read.table("repdata-data-StormData.csv", sep=",", header=TRUE)
Now after loading the data, I decided that the fields we were interested in were the date, event type, fatalities, injuries, and the property and crop damage-related columns. After investigation and not wanting to rely to heavily on older, incomplete data I sought to only examine storm data from 1995 and on.
sdf <- data.frame(stormdf$BGN_DATE, stormdf$EVTYPE, stormdf$FATALITIES, stormdf$INJURIES, stormdf$PROPDMG,stormdf$PROPDMGEXP, stormdf$CROPDMG,stormdf$CROPDMGEXP)
colnames(sdf) <- c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
sdf$BGN_DATE <- as.Date(sdf$BGN_DATE, format = "%m/%d/%Y")
sdf2 <- sdf[sdf$BGN_DATE >= as.Date("1/1/1995", format = "%d/%m/%Y"), ]
Now I had to deal with converting the exponent field to something measureable. For both Property Damage and Crop Damage, I transformed h to 100s, K to 1,000s, M to 1,000,000s, and B to 1,000,000,000s. The other special characters I treated as 1s after looking into the data and realizing that there few instances of these.
sdf2$PROPDMGEXP <- as.character(sdf2$PROPDMGEXP)
sdf2$CROPDMGEXP <- as.character(sdf2$CROPDMGEXP)
sdf2$PROPDMGEXP[sdf2$PROPDMGEXP == "" | sdf2$PROPDMGEXP == "+" | sdf2$PROPDMGEXP == "?" | sdf2$PROPDMGEXP == "-"] <- "1"
sdf2$PROPDMGEXP[sdf2$PROPDMGEXP == "H" | sdf2$PROPDMGEXP == "h"] <- "100"
sdf2$PROPDMGEXP[sdf2$PROPDMGEXP == "K" | sdf2$PROPDMGEXP == "k"] <- "1000"
sdf2$PROPDMGEXP[sdf2$PROPDMGEXP == "M" | sdf2$PROPDMGEXP == "m"] <- "1000000"
sdf2$PROPDMGEXP[sdf2$PROPDMGEXP == "B" | sdf2$PROPDMGEXP == "b"] <- "1000000000"
sdf2$CROPDMGEXP[sdf2$CROPDMGEXP == "" | sdf2$CROPDMGEXP == "?"] <- "1"
sdf2$CROPDMGEXP[sdf2$CROPDMGEXP == "H" | sdf2$CROPDMGEXP == "h"] <- "100"
sdf2$CROPDMGEXP[sdf2$CROPDMGEXP == "K" | sdf2$CROPDMGEXP == "k"] <- "1000"
sdf2$CROPDMGEXP[sdf2$CROPDMGEXP == "M" | sdf2$CROPDMGEXP == "m"] <- "1000000"
sdf2$CROPDMGEXP[sdf2$CROPDMGEXP == "B" | sdf2$CROPDMGEXP == "b"] <- "1000000000"
sdf2$PROPDMGEXP <- as.integer(sdf2$PROPDMGEXP)
sdf2$CROPDMGEXP <- as.integer(sdf2$CROPDMGEXP)
Now with these new fields, I could multiply the exponent by the damage column to get a real, measurable, and comparable value.
sdf2$propdmgtotal <- sdf2$PROPDMG * sdf2$PROPDMGEXP
sdf2$cropdmgtotal <- sdf2$CROPDMG * sdf2$CROPDMGEXP
Now the data is ready for analysis
To examine the effect on population health, I wanted to combine fatalities and injuries to get a high level idea of how many people were impacted. I aggregated this total by event type and sorted the data by most impacted.
sdf2$pophealtheffect <- sdf2$FATALITIES + sdf2$INJURIES
pophealth <- aggregate(sdf2$pophealtheffect, by = list(sdf2$EVTYPE), FUN = sum)
colnames(pophealth) <- c("evtype", "pop_effect")
pophealth <- pophealth[with(pophealth, order(-pop_effect)),]
pophealth[1:15,]
## evtype pop_effect
## 666 TORNADO 23310
## 112 EXCESSIVE HEAT 8428
## 144 FLOOD 7192
## 358 LIGHTNING 5360
## 683 TSTM WIND 3871
## 231 HEAT 2954
## 134 FLASH FLOOD 2668
## 607 THUNDERSTORM WIND 1557
## 787 WINTER STORM 1493
## 313 HURRICANE/TYPHOON 1339
## 288 HIGH WIND 1334
## 773 WILDFIRE 986
## 206 HAIL 926
## 254 HEAVY SNOW 866
## 157 FOG 779
I also wanted to get an idea if certain events were more likely to kill rather than injure as I didn't think the previous analysis may have been obscuring the full picture. Here are the results of most impactful event types for population injuries and fatalities
injuryfx <- aggregate(sdf2$INJURIES, by = list(sdf2$EVTYPE), FUN = sum)
colnames(injuryfx) <- c("evtype", "injuries")
injuryfx <- injuryfx[with(injuryfx, order(-injuries)), ]
injuryfx[1:15,]
## evtype injuries
## 666 TORNADO 21765
## 144 FLOOD 6769
## 112 EXCESSIVE HEAT 6525
## 358 LIGHTNING 4631
## 683 TSTM WIND 3630
## 231 HEAT 2030
## 134 FLASH FLOOD 1734
## 607 THUNDERSTORM WIND 1426
## 787 WINTER STORM 1298
## 313 HURRICANE/TYPHOON 1275
## 288 HIGH WIND 1093
## 206 HAIL 916
## 773 WILDFIRE 911
## 254 HEAVY SNOW 751
## 157 FOG 718
fatalityfx <- aggregate(sdf2$FATALITIES, by = list(sdf2$EVTYPE), FUN = sum)
colnames(fatalityfx) <- c("evtype", "fatalities")
fatalityfx <- fatalityfx[with(fatalityfx, order(-fatalities)),]
fatalityfx[1:15,]
## evtype fatalities
## 112 EXCESSIVE HEAT 1903
## 666 TORNADO 1545
## 134 FLASH FLOOD 934
## 231 HEAT 924
## 358 LIGHTNING 729
## 144 FLOOD 423
## 461 RIP CURRENT 360
## 288 HIGH WIND 241
## 683 TSTM WIND 241
## 16 AVALANCHE 223
## 462 RIP CURRENTS 204
## 787 WINTER STORM 195
## 233 HEAT WAVE 161
## 607 THUNDERSTORM WIND 131
## 121 EXTREME COLD 126
There were many ones on both lists, but things like avalanche and rip currents were not reflected as much as the fatality incidence was much higher. Overall I thought it was fine to look at injury and fatality in concert when determining the most impactful events, but I also wanted to look into the split between fatality vs. injury for each of these.
Therefore I took the highest 15 event types and examined them by injury vs. fatality
pophealth2 <- pophealth[1:15,]
uniqueNames <- pophealth2$evtype
popdf1 <- sdf2[sdf2$EVTYPE %in% uniqueNames, ]
f_df <- aggregate(popdf1$FATALITIES, by = list(popdf1$EVTYPE), FUN = sum)
i_df <- aggregate(popdf1$INJURIES, by = list(popdf1$EVTYPE), FUN = sum)
colnames(f_df) <- c("evtype", "fatalities")
colnames(i_df) <- c("evtype", "injuries")
f_df <- f_df[with(f_df, order(evtype)),]
i_df <- i_df[with(i_df, order(evtype)),]
pop_df <- data.frame(f_df, i_df$injuries)
colnames(pop_df)[3] <- "injuries"
Now I have a data frame that shows 15 event types and the number of injuries and fatalities they've caused since 1995. Plotting this resulted in the following chart.
library(ggplot2)
library(reshape2)
df2 <- melt(pop_df, id.var = "evtype")
ggplot(df2, aes(x = evtype, y = value, fill = variable)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("# of Fatalities & Injuries") + ggtitle("NOAA Top 15: Most Harmful Events for Population Health, 1995-2011")
Clearly tornadoes, excessive heat, and flooding were primary events that caused harm to the population whether by death or injury.
I decided to look at economic consequences as an effect of property and crop damage. But I also did not want to look at them together, because I thought that there was much more insight to be gleaned from looking at them separately.
As such, I looked at the top 15 most harmful events for property damage first.
propfx <- aggregate(sdf2$propdmgtotal, by = list(sdf2$EVTYPE), FUN = sum)
colnames(propfx) <- c("evtype", "propdmg")
propfx <- propfx[with(propfx, order(-propdmg)),]
propfx2 <- propfx[1:15,]
ggplot(propfx2, aes(x=evtype, y=propdmg)) + geom_bar(stat="identity", color = "black", fill = "blue") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("Property Damage ($)") + ggtitle("NOAA Top 15 Most Harmful Events for Property Damage, 1995-2011")
Most of the events that affect property damage are ones where water is involved such as flood, hurricane, and storms. Interestingly, tornadoes though very impactful on population health, are less consequential on property damage.
Now I'll look at the effects on crop damage.
cropfx <- aggregate(sdf2$cropdmgtotal, by = list(sdf2$EVTYPE), FUN = sum)
colnames(cropfx) <- c("evtype", "cropdmg")
cropfx <- cropfx[with(cropfx, order(-cropdmg)),]
cropfx2 <- cropfx[1:15,]
ggplot(cropfx2, aes(x=evtype, y=cropdmg)) + geom_bar(stat="identity", color = "black", fill = "green") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") + ylab("Crop Damage ($)") + ggtitle("NOAA Top 15 Most Harmful Events for Crop Damage, 1995-2011")
Crops are most affected by extreme temperatures, much more so than properties. This is somewhat intuitive that drought and excessive heat and cold rank fairly highly. Flood also appears here which further cements it as a very impactful event type. Tornadoes however is not among the top 15 most harmful events in regards to crop damage.
As seen above, we can conclude that tornadoes and flood are particularly impactful on population and economic factors. Given more time, I'd like to further investigate with an eye on how often certain events are occur more than others to get a better idea if tornadoes and floods rank so highly simply because the recorded instances of them dwarf that of things like avalanches or mudslides.