In this report, we identify severe weather events which are most harmful with respect to population health, as well as events which have the greatest economic consequences. To identify these events, we leverage on data from the NOAA Storm Database, which tracks characteristics of major storms and weather events in the United States. From our analysis, we find that tornados are the most harmful with respect to population health, having killed over 5000 people and injuring more than 90,000 people, while Floods have the greatest economic impact, causing the US more than $100 billion in collateral damage.
From the National Oceanic & Atmospheric Administration, we obtained the data on severe weather events.
We start by setting our working directory, and importing key libraries which helps us conduct our data analysis in a much more efficient manner
library(lattice)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
We first read the first 500 rows of the dataset, to obtain the column type of the dataset. This will facilitate faster reading of the data. We store the column types of the data in a vector, and read the NOAA dataset using the column type as a variable.
noaa500 <- read.table("repdata-data-StormData.csv", sep = ",", nrows = 500)
colclass <- c()
for (i in 1:dim(noaa500)[2]){
colclass <- c(colclass, class(noaa500[2, i]))
}
noaa <- read.table("repdata-data-StormData.csv", sep = ",", colClasses = colclass, header = T,
comment.char = "#")
dim(noaa)
## [1] 902297 37
After reading the dataset, we check the first few rows, noting that there are 902297 rows in this dataset.
head(noaa[, 1:10])
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1.00 4/18/1950 0:00:00 0130 CST 97.00 MOBILE AL
## 2 1.00 4/18/1950 0:00:00 0145 CST 3.00 BALDWIN AL
## 3 1.00 2/20/1951 0:00:00 1600 CST 57.00 FAYETTE AL
## 4 1.00 6/8/1951 0:00:00 0900 CST 89.00 MADISON AL
## 5 1.00 11/15/1951 0:00:00 1500 CST 43.00 CULLMAN AL
## 6 1.00 11/15/1951 0:00:00 2000 CST 77.00 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI
## 1 TORNADO 0.00
## 2 TORNADO 0.00
## 3 TORNADO 0.00
## 4 TORNADO 0.00
## 5 TORNADO 0.00
## 6 TORNADO 0.00
We extract the column of interest (in this case, the Event Type specified by EVTYPE), and print a brief summary.
evtype <- noaa$EVTYPE
summary(evtype)[1:10]; length(unique(evtype))
## HAIL TSTM WIND THUNDERSTORM WIND
## 288661 219940 82563
## TORNADO FLASH FLOOD FLOOD
## 60652 54277 25326
## THUNDERSTORM WINDS HIGH WIND LIGHTNING
## 20843 20212 15754
## HEAVY SNOW
## 15708
## [1] 985
We check if there are any missing values in the dataset.
sum(is.na(evtype))
## [1] 0
As we are interested in finding out the types of events which are most harmful with respect to population health, and the types of events which have the greatest economic consequences, we first filter the dataframe to only include columns which are interested in, and convert the columns to the appropriate formats. We start by creating 2 separate dataframes for Injuries and Fatalities:
Injuries
df_inj <- select(noaa, EVTYPE, INJURIES)
df_inj$INJURIES <- as.numeric(as.character(df_inj$INJURIES))
df_inj <- filter(df_inj, INJURIES > 0)
head(df_inj)
## EVTYPE INJURIES
## 1 TORNADO 15
## 2 TORNADO 2
## 3 TORNADO 2
## 4 TORNADO 2
## 5 TORNADO 6
## 6 TORNADO 1
Fatalities
df_fat <- select(noaa, EVTYPE, FATALITIES)
df_fat$FATALITIES <- as.numeric(as.character(df_fat$FATALITIES))
df_fat <- filter(df_fat, FATALITIES > 0)
head(df_fat)
## EVTYPE FATALITIES
## 1 TORNADO 1
## 2 TORNADO 1
## 3 TORNADO 4
## 4 TORNADO 1
## 5 TORNADO 6
## 6 TORNADO 7
Creating a separate dataframe for Property and Crop Damage
df_dmg <- select(noaa, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
df_dmg$PROPDMG <- as.numeric(as.character(df_dmg$PROPDMG))
df_dmg$CROPDMG <- as.numeric(as.character(df_dmg$CROPDMG))
# Converting to lowercase
df_dmg$PROPDMGEXP <- tolower(df_dmg$PROPDMGEXP)
df_dmg$CROPDMGEXP <- tolower(df_dmg$CROPDMGEXP)
# We look at "b" and "m" as they stand for billions and millions - allowing us to take a more focused approach to identifying high-profile (in terms of economic impact) event types
df_dmg <- filter(df_dmg, CROPDMGEXP %in% c("b", "m") | PROPDMGEXP %in% c("b", "m"))
head(df_dmg)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 2.5 m 0
## 2 TORNADO 2.5 m 0
## 3 TORNADO 2.5 m 0
## 4 TORNADO 2.5 m 0
## 5 TORNADO 2.5 m 0
## 6 TORNADO 2.5 m 0
dim(df_dmg)
## [1] 12693 5
Noting that there were only the dataset contains only 48 unique event types, but the data contains 985 event types, we attempt to fix some of the errors through the use of functions like tolower and trimws.
df_inj$EVTYPE <- lapply(df_inj$EVTYPE, tolower); df_fat$EVTYPE <- lapply(df_fat$EVTYPE, tolower)
df_inj$EVTYPE <- trimws(df_inj$EVTYPE); df_fat$EVTYPE <- trimws(df_fat$EVTYPE)
df_dmg$EVTYPE <- lapply(df_dmg$EVTYPE, tolower)
df_dmg$EVTYPE <- trimws(df_dmg$EVTYPE)
As the dataset used to analyse the economic consequences of various event types contain PROPDMGEXP and CROPDMGEXP, we need to carry out additional processing to the data frame to obtain the total impact from the disaster. See here for more information on processing the data. Following which, we sum the Total Property Damage column with the Total Crop Damage column to obtain the total damage done.
# To see the potential unique values which can arise
unique(union(unique(df_dmg$PROPDMGEXP), unique(df_dmg$CROPDMGEXP)))
## [1] "m" "b" "k" "" "5" "0" "?"
# We identify the values to be "m", "b", "k", "", "5", "0", "?"; convert values in accordance
for (i in 1:dim(df_dmg)[1]){
df_dmg$TOT_PROPDMG[i] <-
if (df_dmg$PROPDMGEXP[i] == "b"){
df_dmg$PROPDMG[i] * 1000000000
} else if (df_dmg$PROPDMGEXP[i] == "m"){
df_dmg$PROPDMG[i] * 1000000
} else if (df_dmg$PROPDMGEXP[i] == "k"){
df_dmg$PROPDMG[i] * 1000
} else if (df_dmg$PROPDMGEXP[i] %in% c("", "?")){
0
} else {
df_dmg$PROPDMG[i] * 10
}
}
for (i in 1:dim(df_dmg)[1]){
df_dmg$TOT_CROPDMG[i] <-
if (df_dmg$CROPDMGEXP[i] == "b"){
df_dmg$CROPDMG[i] * 1000000000
} else if (df_dmg$CROPDMGEXP[i] == "m"){
df_dmg$CROPDMG[i] * 1000000
} else if (df_dmg$CROPDMGEXP[i] == "k"){
df_dmg$CROPDMG[i] * 1000
} else if (df_dmg$PROPDMGEXP[i] %in% c("", "?")){
0
} else {
df_dmg$CROPDMG[i] * 10
}
}
df_dmg$TOT_DMG <- df_dmg$TOT_PROPDMG + df_dmg$TOT_CROPDMG
df_dmg$TOT_DMG <- df_dmg$TOT_DMG/1000000000
df_dmg <- df_dmg[, c("EVTYPE", "TOT_DMG")]
head(df_dmg)
## EVTYPE TOT_DMG
## 1 tornado 0.0025
## 2 tornado 0.0025
## 3 tornado 0.0025
## 4 tornado 0.0025
## 5 tornado 0.0025
## 6 tornado 0.0025
We apply the aggregate function on the 2 datasets to group them by the various Event Types, and since we are interested in the total effect each event has on population health and its economic impact, we pass in the sum function.
Injuries
df_inj <- aggregate(INJURIES ~ EVTYPE, data = df_inj, sum, na.rm = T)
head(df_inj)
## EVTYPE INJURIES
## 1 avalanche 170
## 2 black ice 24
## 3 blizzard 805
## 4 blowing snow 14
## 5 brush fire 2
## 6 coastal flood 2
Fatalities
df_fat <- aggregate(FATALITIES ~ EVTYPE, data = df_fat, sum, na.rm = T)
head(df_fat)
## EVTYPE FATALITIES
## 1 avalance 1
## 2 avalanche 224
## 3 black ice 1
## 4 blizzard 101
## 5 blowing snow 2
## 6 coastal flood 3
Economic Consequence
df_dmg <- aggregate(TOT_DMG ~ EVTYPE, data = df_dmg, sum, na.rm = T)
head(df_dmg)
## EVTYPE TOT_DMG
## 1 agricultural freeze 0.02882
## 2 astronomical high tide 0.00850
## 3 avalanche 0.00210
## 4 blizzard 0.74703
## 5 coastal flooding/erosion 0.01500
## 6 coastal flood 0.24628
Next, we apply a threshold value to reduce the dataset to focus on the top 5 events (by how much of an impact it had):
Injuries
thr_inj <- sort(df_inj$INJURIES, decreasing = T)[5]
thr_fat <- sort(df_fat$FATALITIES, decreasing = T)[5]
thr_dmg <- sort(df_dmg$TOT_DMG, decreasing = T)[5]
The threshold values for injuries, fatalities and economic consequences are 5230, 816 and 17.5620265 respectively.
We filter the datasets based on these threshold values
df_inj <- filter(df_inj, INJURIES >= thr_inj)
df_fat <- filter(df_fat, FATALITIES >= thr_fat)
df_dmg <- filter(df_dmg, TOT_DMG >= thr_dmg)
Finally, we use plots to identify the top 5 event types:
Injuries
g_inj <- ggplot(df_inj, aes(x = EVTYPE, y = INJURIES)) + theme_bw() + geom_bar(stat = 'identity') +
xlab('Event Type') + ylab('Total Injuries') +
ggtitle('Total Injuries by Event Type')
g_inj
Fatalities
g_fat <- ggplot(df_fat, aes(x = EVTYPE, y = FATALITIES)) + theme_bw() + geom_bar(stat = 'identity') +
xlab('Event Type') + ylab('Total Fatalities') +
ggtitle('Total Fatalities by Event Type')
g_fat
Economic Consequence
g_fat <- ggplot(df_dmg, aes(x = EVTYPE, y = TOT_DMG)) + theme_bw() + geom_bar(stat = 'identity') +
xlab('Event Type') + ylab('Total Economic Damage (in $ Bilions)') +
ggtitle('Total Economic Impact by Event Type')
g_fat
It turns out that while tornados have the worst impact on both population health, killing over 5000 and injuring more than 90,000, floods have had the greatest economic consequences, costing the US more than $100 billion.