In this report, I’ve explored the NOAA Storm Dataset in an attempt to answer the following basic questions about severe weather events:
I’ve performed the analysis on only the relevant columns by tracking an overall dataset and a more accurate “sound” dataset with at least half of the maximum observations. The results seem to indicate that while tornadoes cause the most harm to human health, floods cause the most economic damage.
First, load required R packages:
library(ggplot2)
library(dplyr)
library(gridExtra)
if (!require(car)) {
install.packages("car")
require(car)
}
Next, from the same NOAA dataset I read in only the variables that are relevant to this analysis. These relevant columns include:
# Load relevant data columns only
if(!exists("storm.data")){
storm.data <- read.csv("repdata-data-StormData.csv.bz2",
header=TRUE,
na.strings="",
colClasses=c('NULL', 'character', rep('NULL',5), 'factor', rep('NULL',14),
'numeric','numeric','numeric','factor','numeric','factor',
rep('NULL',9)))
# Manipulate date to get only the year (to see observation distribution by year, later on)
storm.data$BGN_DATE <- as.numeric(format(as.POSIXct(storm.data$BGN_DATE,"%m/%d/%Y", tz=""), "%Y"))
}
Once loaded, the data needs to be cleaned. Specifically:
Note: The Storm Data Documentation specifies the exponent codes as k = thousands, m = millions, b = billions. However, it is unclear from the handbook how other characters found in this column - such as “?”, “-”, “+” - are to be treated. In light of this uncertainty I assign an exponential value of 0 to these.
# Change "?" EVTYPE to "NOT DEFINED"
storm.data$EVTYPE <- gsub("^[?]$", "NOT DEFINED", storm.data$EVTYPE)
# Trim leading whitespaces from EVTYPE
storm.data$EVTYPE <- gsub("^\\s+", "", storm.data$EVTYPE)
# Trim trailing whitespace
storm.data$EVTYPE <- gsub("\\s+$", "", storm.data$EVTYPE)
# Recode exponent columns PROPDMGEXP and CROPDMGEXP to numeric exponents (using recode() from "car" package)
storm.data$PROPDMGEXP <- recode(tolower(storm.data$PROPDMGEXP),
"'h'=2; 'k'=3;'m'=6;'b'=9;
c(NA,'?','-','+')=0",
as.numeric.result=TRUE)
storm.data$CROPDMGEXP <- recode(tolower(storm.data$CROPDMGEXP),
"'h'=2; 'k'=3;'m'=6;'b'=9;
c(NA,'?','-','+')=0",
as.numeric.result=TRUE)
Here is a sample snapshot of the cleaned dataset:
storm.data[sample(nrow(storm.data), 3), ]
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 342760 1998 HAIL 0 0 0 0
## 801619 2010 HIGH WIND 0 0 0 3
## 868695 2011 THUNDERSTORM WIND 0 0 6 3
## CROPDMG CROPDMGEXP
## 342760 0 0
## 801619 0 3
## 868695 0 3
It has been called out that that the data in the earlier decades might not have been captured as consistently, it is therefore important to see the distribution of observations over the years:
# histogram of the number of observations per year in the NOAA dataset
obs.plot <- ggplot(storm.data, aes(x=BGN_DATE))
obs.plot+geom_histogram(aes(fill=..count..))+labs(title="NOAA Storm Data Observations Distribution by Year", x="Year", y="Obervations", fill=" Observation Count \n (Color Scale)")+theme_bw()+theme(axis.text.x=element_text(angle=45, hjust=1))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
It seems that at least half of the around 100,000 maximum number of observations (made in 2009), were made between the years of 1993 to 2011. For the sake of accuracy then, it is perhaps sound to take into consideration this data range as the analysis proceeds. This shall be tracked in a separate dataset “sound.data”. This data subset shall contain observations that amount to at least 50,000 per year which boils down to the range 1993-2011.
In line with this, I also create a sound data set with at least 50,000 observations per year:
#Create sound datasubset of storm.data
sound.data <- subset(storm.data, BGN_DATE>=1993)
rownames(sound.data) <- NULL
sound.data[sample(nrow(sound.data), 3), ]
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 26610 1995 HAIL 0 0 1.5 3
## 605067 2010 FLASH FLOOD 0 0 0.0 3
## 646044 2010 THUNDERSTORM WIND 0 0 5.0 3
## CROPDMG CROPDMGEXP
## 26610 0.3 3
## 605067 0.0 3
## 646044 0.0 3
Moving on, although there seem to be some duplicates in the “EVTYPE” column, it is unclear from the Storm Data Documentation whether or not “TSTM WIND” is the same as “THUNDERSTORM WINDS”, or if “FLOOD” is to be equivalent to “FLASH FLOOD”. No assumptions have been made in this dataset, as a result; I have treated each weather event as is.
For the purposes of this analysis, the measure of damage to human health may be given by the sum of the fatalities and injuries caused by a type of weather event. What follows is the aggregation of the total fatalities and injuries (health casualties) by the various event types:
# For analysis of #1, create a new column that sums up the effects on health (defined
# as FATALITIES+INJURIES) per observation
storm.data$HEALTH.CASUALTIES <- storm.data$FATALITIES + storm.data$INJURIES
sound.data$HEALTH.CASUALTIES <- sound.data$FATALITIES + sound.data$INJURIES
# Now, aggregate sum of HEALTH.CASUALTIES by EVTYPE
harmful.events <- aggregate(HEALTH.CASUALTIES~EVTYPE, data=storm.data, FUN=sum)
harmful.events.sound <- aggregate(HEALTH.CASUALTIES~EVTYPE, data=sound.data, FUN=sum)
# arrange and get top 10 harmful events
harmful.events <- head(arrange(harmful.events, desc(HEALTH.CASUALTIES)), 10)
harmful.events.sound <- head(arrange(harmful.events.sound, desc(HEALTH.CASUALTIES)), 10)
We see that the top ten harmful (to human health) events across the USA for 1950 to 2011 are:
harmful.events
## EVTYPE HEALTH.CASUALTIES
## 1 TORNADO 96979
## 2 EXCESSIVE HEAT 8428
## 3 TSTM WIND 7461
## 4 FLOOD 7259
## 5 LIGHTNING 6046
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
Whereas, the top then harmful events from the more complete dataset from 1993 up, are:
harmful.events.sound
## EVTYPE HEALTH.CASUALTIES
## 1 TORNADO 24931
## 2 EXCESSIVE HEAT 8428
## 3 FLOOD 7259
## 4 LIGHTNING 6046
## 5 TSTM WIND 3872
## 6 HEAT 3037
## 7 FLASH FLOOD 2755
## 8 ICE STORM 2064
## 9 THUNDERSTORM WIND 1621
## 10 WINTER STORM 1527
Similar in principle to the above, a measure of economic damage may be defined (for the purposes of this analysis) by the sum of property and crop damage caused by a type of weather event:
# For analysis of #2, aggregate economic damage by Event type
# economic damage = property damage + crop damage
storm.data$econ.damage <- with(storm.data, PROPDMG*(10^PROPDMGEXP) + CROPDMG*(10^CROPDMGEXP))
sound.data$econ.damage <- with(sound.data, PROPDMG*(10^PROPDMGEXP) + CROPDMG*(10^CROPDMGEXP))
# Now, get top 10 events to cause most damage
Econ.Damage <- aggregate(econ.damage~EVTYPE, data=storm.data, FUN=sum)
Econ.Damage <- head(arrange(Econ.Damage, desc(econ.damage)), 10)
Econ.Damage.sound <- aggregate(econ.damage~EVTYPE, data=sound.data, FUN=sum)
Econ.Damage.sound <- head(arrange(Econ.Damage.sound, desc(econ.damage)), 10)
We see that the top ten most economically damaging events across the USA for 1957 thru 2011 are:
Econ.Damage
## EVTYPE econ.damage
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57362333947
## 4 STORM SURGE 43323541000
## 5 HAIL 18761221986
## 6 FLASH FLOOD 18244041079
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
Whereas the top ten most economically damaging events in the more complete dataset from 1993 up, are:
Econ.Damage.sound
## EVTYPE econ.damage
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 STORM SURGE 43323541000
## 4 TORNADO 26764135377
## 5 HAIL 18761221986
## 6 FLASH FLOOD 18244041079
## 7 DROUGHT 15018672000
## 8 HURRICANE 14610229010
## 9 RIVER FLOOD 10148404500
## 10 ICE STORM 8967041360
The final datasets are now setup. Onto the results.
The top 10 weather related events that have caused the most harm to human health, as defined by a sum of the fatalities and injuries caused may be seen here:
#ggplot harmful events
harmful.events.plot <- ggplot(data=harmful.events,
aes(x=reorder(EVTYPE, -HEALTH.CASUALTIES),
y=HEALTH.CASUALTIES)) +
geom_bar(stat="identity", fill="red", col="black") +
theme_bw() +
labs(list(title="Top Ten Harmful Weather Events \n Across the US (1950-2011)",
x="Type of Event",
y="Total Health Damage (Fatalities + Injuries)")) +
theme(legend.position="none")+
theme(axis.text.x=element_text(angle=35, hjust=1))
harmful.events.sound.plot <- ggplot(data=harmful.events.sound,
aes(x=reorder(EVTYPE, -HEALTH.CASUALTIES),
y=HEALTH.CASUALTIES)) +
geom_bar(stat="identity", fill="orange", col="black") +
theme_bw() +
labs(list(title="Top Ten Harmful Weather Events \n Across the US (1993-2011)",
x="Type of Event",
y="")) +
theme(legend.position="none")+
theme(axis.text.x=element_text(angle=35, hjust=1))
grid.arrange(harmful.events.plot, harmful.events.sound.plot, ncol=2)
We consider the top 5 results in terms of the overall dataset and sound data subset, in decreasing order of threat to human health:
| Rank | Overall Data (1950-2011) | Sound Data (1993-2011) |
|---|---|---|
| 1 | Tornado | Tornado |
| 2 | Excessive Heat | Excessive Heat |
| 3 | Tstm Wind | Flood |
| 4 | Flood | Lightning |
| 5 | Lightning | Tstm Wind |
By considering more complete observations in the second dataset, we seem to have more accurately identified the top 5 greatest threats to human health.
The top ten weather related events that have caused the most economic damage - sum of property damage and crop damage - across the USA may be seen here, color coded by cost of damage:
#ggplot economic damage
econ.damage.plot <- ggplot(data=Econ.Damage,
aes(x=reorder(EVTYPE, -econ.damage),
y=format(econ.damage, big.mark=","),
fill=scale(-econ.damage)))+
geom_bar(stat="identity") +
theme_bw() +
labs(list(title="Damaging Events \n Across US (1950-2011)",
x="Type of Event",
y="Total Economic Damage ($) \n (Sum of Property & Crop Damage)")) +
theme(legend.position="none")+
theme(axis.text.x=element_text(angle=45, hjust=1))
#sound data
econ.damage.sound.plot <- ggplot(data=Econ.Damage.sound,
aes(x=reorder(EVTYPE, -econ.damage),
y=format(econ.damage, big.mark=","),
fill=scale(-econ.damage)))+
geom_bar(stat="identity") +
theme_bw() +
labs(list(title="Damaging Events \n Across US (1993-2011)",
x="Type of Event",
y="")) +
theme(legend.position="none")+
theme(axis.text.x=element_text(angle=45, hjust=1))
grid.arrange(econ.damage.plot, econ.damage.sound.plot, ncol=2)
Here, we consider the top 5 results in terms of the overall dataset and sound data subset, in decreasing order of economic damage:
| Rank | Overall Data (1950-2011) | Sound Data (1993-2011) |
|---|---|---|
| 1 | Flood | Flood |
| 2 | Hurricane/Typhoon | Hurricane/Typhoon |
| 3 | Tornado | Storm Surge |
| 4 | Storm Surge | Tornado |
| 5 | Hail | Hail |
By considering more complete observations in the second dataset, we again seem to have more accurately identified the top 5 most devastating weather events in terms of property and crop damage across the USA.
End of Report