Extreme weather events such as tornado, thunderstorms, blizzards, etc. have the potential to cause catastrophic destruction. Costs of such destruction is not only counted in economic terms, but in human lives.
The following analysis attempts to identify which types of weather events have caused the most destruction. Data will be analyzed from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database contains records of weather events and related economic and human impact from 1950 to 2011. The database can be found at:
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Additional information about the database can be found here:
National Weather Service Storm Data Documentation https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf
National Climatic Data Center Storm Events FAQ https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf
The analysis examined four areas of destruction related to specific weather events:
The findings suggest that Tornados have caused the most destruction among all event categories–both in property/crop damage as well as human lives. Other highly destructive weather events include: Storm Winds, Hail, Heat and Winter Storms.
To begin the analysis, import the data from the NOAA site:
filename <- "repdata_data_StormData.csv.bz2"
if (!file.exists(filename)){
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, filename)
}
if (!file.exists("repdata_data_StormData")) {
unzip(filename)
}
data<-read.csv("repdata_data_StormData", stringsAsFactors = FALSE)
Next, load R packages required for processing, analysis and graphics:
library(dplyr)
library(lubridate)
library(tidyr)
library(chron)
library(summarytools)
library(ggplot2)
library(questionr)
library(stats)
library(reshape2)
library(scales)
library(stringr)
The analysis will use the following 5 variables in the database:
Examining these variable reveals three that require cleaning/re-coding procedures in preparation for the analysis. The procedures are as follows:
Property damage expense is the dollar amount of damage from a particular event. This dollar amount is listed in two separate fields, where PROPDMG includes an abbreviated dollar amount (either in 000’s or 000,000’s) next to a variable PROPDMGEXP which indicates the related “K” (thousand) or an “M” (million) extension. To use these dollar amounts for the analysis, we create a new field “propexp” that contains the correct amount abbreviated in billions of dollars (due to the very large amounts when aggregated).
## Recode property damage to new field shown in billions of dollars
data$propexp<-as.numeric(0)
data <- within(data, propexp[PROPDMGEXP=="K"] <- (PROPDMG[PROPDMGEXP=="K"]/1000000))
data <- within(data, propexp[PROPDMGEXP=="M"] <- (PROPDMG[PROPDMGEXP=="M"]/1000))
As with the above property damage recoding, we employ the same procedure to crop damage expense, which is also in two separate fields; the new field “cropexp” is created with the same dollar formating.
## Recode crop damage to new field shown in billions of dollars
data$cropexp<-as.numeric(0)
data <- within(data, cropexp[CROPDMGEXP=="K"] <- (CROPDMG[CROPDMGEXP=="K"]/1000000))
data <- within(data, cropexp[CROPDMGEXP=="M"] <- (PROPDMG[CROPDMGEXP=="M"]/1000))
The field containing weather event categories is EVTYPE. To clean this field, we first make a duplicate called “event2” in order to keep the original variable available throughout the analysis. We then run a frequency count on event2 examining which events have the most destructive effects. We also identify how many unique category names are listed in the field.
## Create new event field "event2"
data$event2<-data$EVTYPE
## Create preliminary frequency counts representing 90%+ of event counts in 1950-2011
head(questionr::freq(data$event2, cum = TRUE, sort = "dec", total = TRUE), 15)
## n % val% %cum val%cum
## HAIL 288661 32.0 32.0 32.0 32.0
## TSTM WIND 219940 24.4 24.4 56.4 56.4
## THUNDERSTORM WIND 82563 9.2 9.2 65.5 65.5
## TORNADO 60652 6.7 6.7 72.2 72.2
## FLASH FLOOD 54277 6.0 6.0 78.3 78.3
## FLOOD 25326 2.8 2.8 81.1 81.1
## THUNDERSTORM WINDS 20843 2.3 2.3 83.4 83.4
## HIGH WIND 20212 2.2 2.2 85.6 85.6
## LIGHTNING 15754 1.7 1.7 87.4 87.4
## HEAVY SNOW 15708 1.7 1.7 89.1 89.1
## HEAVY RAIN 11723 1.3 1.3 90.4 90.4
## WINTER STORM 11433 1.3 1.3 91.7 91.7
## WINTER WEATHER 7026 0.8 0.8 92.4 92.4
## FUNNEL CLOUD 6839 0.8 0.8 93.2 93.2
## MARINE TSTM WIND 6175 0.7 0.7 93.9 93.9
Number of unique event types:
length(summary(as.factor(data$event2),maxsum=1000))
## [1] 985
From above results, we see that:
To re-classify and catogorize all 985 names would be very time consuming as well as unproductive given the goals of the research (identify the most destructive events). Therefore, the following cleaning procedures will seek to identify a select “top-tier”" list that provides an adequate representation.
Through a few iterations of re-classifying, the following cleaing procedure combines similar weather events into a list of eleven final categoreis which represent approximately 98% of total events.
## Create cateogories of events that represent 98% of all events
data$event2<-str_replace_all(data$event2, "DENSE FOG|FOG", "FOG")
data$event2<-str_replace_all(data$event2, "EXCESSIVE HEAT|HEAT", "HEAT")
data$event2<-str_replace_all(data$event2, "STORM WIND/HAIL|HAIL", "HAIL")
data$event2<-str_replace_all(data$event2, "WILD/FOREST FIRE|WILDFIRE", "WILDFIRE")
data$event2<-str_replace_all(data$event2, "FUNNEL CLOUD|TORNADO", "TORNADO")
data$event2<-str_replace_all(data$event2, "FLOODING|FLASH FLOOD|URBAN/SML STREAM FLD|COASTAL FLOOD|HIGH SURF|FLOOD/FLOOD|FLOOD|HEAVY RAIN", "FLOOD")
data$event2<-str_replace_all(data$event2, "SNOW|WINTER WEATHER/WIND CHILL|COLD/WIND CHILL|WINTER WEATHER/MIX|EXTREME COLD|LAKE-EFFECT SNOW|WINTER STORM|EXTREME COLD/WIND CHILL|ICE STORM|HEAVY SNOW|FROST/FREEZE|WINTER WEATHER|BLIZZARD", "WINTER WEATHER")
data$event2<-str_replace_all(data$event2, "STORM WIND/HAIL|MARINE STORM WIND|THUNDERSTORMS|MARINETHUNDERSTORM|TSTM WIND|THUNDERSTORM WIND|THUNDERSTORM WINDS|TSTM WIND/HAIL|STORM WINDS|MARINE TSTM WIND|MARINE THUNDERSTORM WIND|STRONG WIND|HIGH WIND|HIGH WINDS|MARINE THUNDERSTORM", "STORM WIND")
## Create frequency counts to verify 98% of events included in analysis
head(questionr::freq(data$event2, cum = TRUE, sort = "dec", total = TRUE), 11)
## n % val% %cum val%cum
## STORM WIND 338272 37.5 37.5 37.5 37.5
## HAIL 288661 32.0 32.0 69.5 69.5
## FLOOD 96213 10.7 10.7 80.1 80.1
## TORNADO 67491 7.5 7.5 87.6 87.6
## WINTER WEATHER 43755 4.8 4.8 92.5 92.5
## STORM WINDS 22578 2.5 2.5 95.0 95.0
## LIGHTNING 15754 1.7 1.7 96.7 96.7
## WILDFIRE 4218 0.5 0.5 97.2 97.2
## WATERSPOUT 3796 0.4 0.4 97.6 97.6
## DROUGHT 2488 0.3 0.3 97.9 97.9
## HEAT 2445 0.3 0.3 98.2 98.2
Next, we create a new field “eventnew” that only includes these 11 events for the analysis:
## Create new event field with top 11 event categories (98.6% OF EVENTS)
data$eventnew<-NA
data <- within(data, eventnew[event2=="STORM WIND"] <- "Storm Wind")
data <- within(data, eventnew[event2=="HAIL"] <- "Hail")
data <- within(data, eventnew[event2=="FLOOD"] <- "Flood")
data <- within(data, eventnew[event2=="TORNADO"] <- "Tornado")
data <- within(data, eventnew[event2=="WINTER WEATHER"] <- "Winter Storms")
data <- within(data, eventnew[event2=="LIGHTNING"] <- "Ligthning")
data <- within(data, eventnew[event2=="WILDFIRE"] <- "Wildfire")
data <- within(data, eventnew[event2=="WATERSPOUT"] <- "Waterspout")
data <- within(data, eventnew[event2=="DROUGHT"] <- "Drought")
data <- within(data, eventnew[event2=="HEAT"] <- "Heat")
data <- within(data, eventnew[event2=="FOG"] <- "Fog")
After the cleaning procedures are completed, we can then run summary counts on the four analysis variables (Fatalities, Injuries, Property Damage and Crop Damage) to validate the high degree of causal effects from these events.
datafat <- data %>% group_by(eventnew) %>%
summarise(
EventCount = n(),
Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES),
Property = sum(propexp),
Crops = sum(cropexp)
)
na.omit(arrange(datafat, desc(EventCount)))
## # A tibble: 11 x 6
## eventnew EventCount Fatalities Injuries Property Crops
## <chr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Storm Wind 338272 1007 9896 12.1 26.7
## 2 Hail 288661 15 1361 13.9 43.7
## 3 Flood 96213 1684 9052 38.5 47.9
## 4 Tornado 67491 5633 91349 51.6 9.63
## 5 Winter Storms 43755 844 5864 7.39 1.74
## 6 Ligthning 15754 816 5230 0.929 0.104
## 7 Wildfire 4218 87 1456 5.23 2.30
## 8 Waterspout 3796 3 29 0.00935 0
## 9 Drought 2488 0 4 1.05 1.37
## 10 Heat 2445 2840 8625 0.00955 0.296
## 11 Fog 1831 80 1076 0.0228 0
Once the data is processed, three analyses are conducted to exam which weather events have had the most detrimental effects.
From the bar chart below, we see that Storm Winds and Hail emerge as the most frequent types of weather events. Floods, Tornadoes and Winter Storms also have high frequency relative to others.
# Bar Chart 1 - Most Frequent Events
# Make data table for plot
tbleve<-table(data$eventnew)
tbleve<-data.frame(tbleve)
tbleve<-arrange(tbleve, desc(tbleve[,2]))
# Make plot
ggplot(tbleve, aes(x=reorder(Var1, -Freq), y=Freq)) +
geom_bar(stat="identity", width=.5, fill="tomato3") +
labs(title="Number of Events",
subtitle="Years 1950 - 2011",
caption="source: NOAA",
y = "Frequency",
x = "Event")+
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title=element_text(size=15), plot.subtitle = element_text(size = 10), axis.title.x=element_text(size=8),
axis.title.y=element_text(size=7), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=7))+
scale_y_continuous(labels = comma, limits = c(0, 390000), breaks = c(100000, 200000, 300000))
The following chart shows that Tornadoes are the most damaging type of weather event, resulting in the most injuries and fatalities among all categories. Heat, Flooding and Winter Storms are also highly impact.
## Bar chart 2 - Fatilites and Injuries by Event
# Make data table for plot
tblfat<-data %>% filter(!is.na(eventnew)) %>% group_by(eventnew) %>%
summarise(Injuries = sum(INJURIES), Fatalities = sum(FATALITIES))
tblfat <- tblfat[order(desc(tblfat$Fatalities)), ]
tblfat$eventnew <- factor(tblfat$eventnew, levels = tblfat$eventnew)
# Make plot
tblfat2 <- melt(tblfat,id="eventnew")
ggplot(tblfat2, aes(x=eventnew, y=value, fill=variable)) +
stat_summary(fun.y=sum, geom="bar", position=position_dodge(1)) +
labs(title="Number of Fatilities & Injuries",
subtitle="Years 1950 - 2011",
caption="source: NOAA",
y = "No. of Fatalities/Injuries",
x = "Event")+
scale_fill_manual("legend", values = c("Injuries" = "burlywood2", "Fatalities" = "black"))+
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title=element_text(size=15), plot.subtitle = element_text(size = 10), axis.title.x=element_text(size=8),
axis.title.y=element_text(size=7), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=7))+
scale_y_continuous(labels = comma, limits = c(0, 100000), breaks = c(20000, 40000, 60000, 80000))
Regarding property and crop damage, Tornadoes, Floods, Hail and Storm Winds have caused the most most economic damage, resulting in the loss of billions of dollars of time.
## Bar chart 3 - Crop and Property Damage by Event
# Make data table for plot
tbldam<-data %>% filter(!is.na(eventnew)) %>% group_by(eventnew) %>%
summarise(property = sum(propexp), crops = sum(cropexp))
tbldam <- tbldam[order(desc(tbldam$property)), ]
tbldam$eventnew <- factor(tbldam$eventnew, levels = tbldam$eventnew)
# Make plot
tbldam2 <- melt(tbldam,id="eventnew")
ggplot(tbldam2, aes(x=eventnew, y=value, fill=variable)) +
stat_summary(fun.y=sum, geom="bar", position=position_dodge(1)) +
labs(title="Property and Crop Damage Expense",
subtitle="Years 1950 - 2011 (US $Billions)",
caption="source: NOAA",
y = "US Dollars (000,000,000's)",
x = "Event")+
scale_fill_manual("legend", values = c("property" = "lightsalmon4", "crops" = "springgreen4"))+
theme(legend.title = element_blank()) +
theme(axis.text.x = element_text(angle=65, vjust=0.6))+
theme(plot.title=element_text(size=15), plot.subtitle = element_text(size = 10), axis.title.x=element_text(size=8),
axis.title.y=element_text(size=7), axis.text.x=element_text(size=8),
axis.text.y=element_text(size=7))+
scale_y_continuous(labels = dollar, limits = c(0, 55), breaks = c(10, 20, 30, 40, 50))
Based on the above findings, Tornadoes, Floods, Hail, Heat and Winter Storms have caused the greatest destructive impact in the US from 1950 to 2011. This is especially true for Tornadoes, which can have the greatest impact on human lives.