Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States from 1950 to 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
This report presents an analysis of this database to determine the events which have the more impact in respect to health and economic consequences. It will answer the following questions :
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
The following packages are required to perform the analysis:
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
library(stringr)
library(scales)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:tidyr':
##
## extract
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
The dataset will be automatically downloaded from the Coursera website and unziped in the data folder.
if (!file.exists("./data/repdata-data-StormData.csv.bz2")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
"./data/repdata-data-StormData.csv.bz2", method='curl')
}
if (!file.exists("./data/repdata-data-StormData.csv")) {
bunzip2("./data/repdata-data-StormData.csv.bz2", "./data/repdata-data-StormData.csv", remove = FALSE)
}
d <- read.csv('./data/repdata-data-StormData.csv')
Here is a quick look at the data:
glimpse(d)
## Observations: 902297
## Variables:
## $ STATE__ (dbl) 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ BGN_DATE (fctr) 4/18/1950 0:00:00, 4/18/1950 0:00:00, 2/20/1951 0:...
## $ BGN_TIME (fctr) 0130, 0145, 1600, 0900, 1500, 2000, 0100, 0900, 20...
## $ TIME_ZONE (fctr) CST, CST, CST, CST, CST, CST, CST, CST, CST, CST, ...
## $ COUNTY (dbl) 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, ...
## $ COUNTYNAME (fctr) MOBILE, BALDWIN, FAYETTE, MADISON, CULLMAN, LAUDER...
## $ STATE (fctr) AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL, AL...
## $ EVTYPE (fctr) TORNADO, TORNADO, TORNADO, TORNADO, TORNADO, TORNA...
## $ BGN_RANGE (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ BGN_AZI (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ BGN_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_DATE (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_TIME (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ COUNTY_END (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ COUNTYENDN (lgl) NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ END_RANGE (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ END_AZI (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ END_LOCATI (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ LENGTH (dbl) 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, ...
## $ WIDTH (dbl) 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400...
## $ F (int) 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, ...
## $ MAG (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ FATALITIES (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, ...
## $ INJURIES (dbl) 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50...
## $ PROPDMG (dbl) 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25....
## $ PROPDMGEXP (fctr) K, K, K, K, K, K, K, K, K, K, M, M, K, K, K, K, K,...
## $ CROPDMG (dbl) 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ CROPDMGEXP (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ WFO (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ STATEOFFIC (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ ZONENAMES (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ LATITUDE (dbl) 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 333...
## $ LONGITUDE (dbl) 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 874...
## $ LATITUDE_E (dbl) 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, ...
## $ LONGITUDE_ (dbl) 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, ...
## $ REMARKS (fctr) , , , , , , , , , , , , , , , , , , , , , , , ,
## $ REFNUM (dbl) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
The first step is to select the columns that relates the event types (EVTYPE) with their consequences to the population health (FATALITIES and INJURIES) and economy (PROPDMG and CROPDMG and their relative exponents). I also exclude events without consequences.
dd <- d %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(FATALITIES != 0 | INJURIES != 0 | PROPDMG != 0 | CROPDMG != 0)
For the economic damages, I cleanup the exponents and include the whole figure directly in the main damage columns.
First the property damages. The exponents look as follow:
levels(dd$PROPDMGEXP)
## [1] "" "+" "-" "0" "1" "2" "3" "4" "5" "6" "7" "8" "?" "B" "H" "K" "M"
## [18] "h" "m"
So we need to clean that up. k, m, h and b mean x1000, x1000000, etc.. I assume numeric values to be exponent to 10 (i.e. 2 is equivalent to multiplying by 100). Others values are ignored (PROPDMG multiplied by 1).
dd <- dd %>% mutate(PROPDMG =
ifelse(PROPDMGEXP == 'K' | PROPDMGEXP == 'k' | PROPDMGEXP == '3', PROPDMG * 1000,
ifelse(PROPDMGEXP == 'M' | PROPDMGEXP == 'm' | PROPDMGEXP == '6', PROPDMG * 1000000,
ifelse(PROPDMGEXP == 'h' | PROPDMGEXP == 'H' | PROPDMGEXP == '2', PROPDMG * 100,
ifelse(PROPDMGEXP == 'B' | PROPDMGEXP == 'b' | PROPDMGEXP == '9', PROPDMG * 1000000000,
ifelse(PROPDMGEXP == '4', PROPDMG * 10000,
ifelse(PROPDMGEXP == '5', PROPDMG * 100000,
ifelse(PROPDMGEXP == '7', PROPDMG * 10000000,
ifelse(PROPDMGEXP == '8', PROPDMG * 100000000,
PROPDMG)))))))))
I use the same logic for the crop damages:
dd <- dd %>% mutate(CROPDMG =
ifelse(CROPDMGEXP == 'K' | CROPDMGEXP == 'k' | CROPDMGEXP == '3', CROPDMG * 1000,
ifelse(CROPDMGEXP == 'M' | CROPDMGEXP == 'm' | CROPDMGEXP == '6', CROPDMG * 1000000,
ifelse(CROPDMGEXP == 'h' | CROPDMGEXP == 'H' | CROPDMGEXP == '2', CROPDMG * 100,
ifelse(CROPDMGEXP == 'B' | CROPDMGEXP == 'b' | CROPDMGEXP == '9', CROPDMG * 1000000000,
ifelse(CROPDMGEXP == '4', CROPDMG * 10000,
ifelse(CROPDMGEXP == '5', CROPDMG * 100000,
ifelse(CROPDMGEXP == '7', CROPDMG * 10000000,
ifelse(CROPDMGEXP == '8', CROPDMG * 100000000,
CROPDMG)))))))))
The exponent columns are no longer required for this analysis.
dd <- dd %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
Now I clean up the event types.
length(levels(dd$EVTYPE))
## [1] 985
There are 985 types of events in the original dataset. As we already reduced the dataset, let’s update the levels:
dd$EVTYPE <- factor(dd$EVTYPE)
length(levels(dd$EVTYPE))
## [1] 488
We’re still left with 488 types of events, that’s almost 10 times more than in the documentation. Let’s start by setting all the event types uppercase and removing the trailing white spaces:
dd <- dd %>%
mutate(EVTYPE = as.factor(toupper(str_trim(EVTYPE))))
length(levels(dd$EVTYPE))
## [1] 444
We’re now at 444 types of events.
Here I make some assumption as per how the events should have been classified and make the necessary changes:
dd[grepl('THUNDERSTORM WIND', dd$EVTYPE),]$EVTYPE <- as.factor('THUNDERSTORM WIND')
dd[grepl('TSTM WIND', dd$EVTYPE),]$EVTYPE <- as.factor('THUNDERSTORM WIND')
dd[grepl('^FLOOD', dd$EVTYPE),]$EVTYPE <- as.factor('FLOOD')
dd[grepl('^FLASH FLOOD', dd$EVTYPE),]$EVTYPE <- as.factor('FLASH FLOOD')
dd[grepl('^HAIL', dd$EVTYPE),]$EVTYPE <- as.factor('HAIL')
dd[grepl('^FUNNEL', dd$EVTYPE),]$EVTYPE <- as.factor('FUNNEL CLOUD')
dd[grepl('^HIGH WIND', dd$EVTYPE),]$EVTYPE <- as.factor('HIGH WIND')
dd[grepl('^HURRICANE', dd$EVTYPE),]$EVTYPE <- as.factor('HURRICANE/TYPHOON')
dd[grepl('^LIGHTNING', dd$EVTYPE),]$EVTYPE <- as.factor('LIGHTNING')
dd$EVTYPE <- factor(dd$EVTYPE)
length(levels(dd$EVTYPE))
## [1] 323
That’s enough for the analysis.
Let’s sum all the columns to have for each event type their impact:
dd <- dd %>%
group_by(EVTYPE) %>%
summarise_each(funs(sum))
Let’s tidy the dataset to show the event type, the impact type and the impact value:
dd <- dd %>%
gather(IMPACTTYPE, VALUE, FATALITIES, INJURIES, PROPDMG, CROPDMG)
I will create a separate dataset with the 10 most harmful events with respect to the population health. This includes the fatalities and injuries. I use a temporary TOTAL column to properly order the events types by their total impact.
d_health <- dd %>%
group_by(EVTYPE) %>%
filter(IMPACTTYPE == 'FATALITIES' | IMPACTTYPE == 'INJURIES' ) %>%
mutate(TOTAL = sum(VALUE)) %>%
ungroup %>%
arrange(desc(TOTAL)) %>%
top_n(20, TOTAL) %>%
select(-TOTAL)
d_health$EVTYPE <- factor(d_health$EVTYPE, levels = rev(unique(d_health$EVTYPE)))
ggplot(data=d_health, aes(x = EVTYPE, y = VALUE, fill=IMPACTTYPE)) +
geom_bar(stat="identity") +
ggtitle('Events which are most harmful with repect to population health, 1950-2011') +
ylab('Population impacted') +
xlab('Type of event') +
scale_fill_discrete(name='Type of impact', labels=c('Fatalities', 'Injuries')) +
coord_flip()
I will create a separate dataset with the 10 type of events having the greatest economic consequences. This includes property damages and crop damages. I use the same logic as for the health impacts.
d_eco <- dd %>%
group_by(EVTYPE) %>%
filter(IMPACTTYPE == 'PROPDMG' | IMPACTTYPE == 'CROPDMG' ) %>%
mutate(TOTAL = sum(VALUE)) %>%
ungroup %>%
arrange(desc(TOTAL)) %>%
top_n(20, TOTAL) %>%
select(-TOTAL)
d_eco$EVTYPE <- factor(d_eco$EVTYPE, levels = rev(unique(d_eco$EVTYPE)))
ggplot(data=d_eco, aes(x = EVTYPE, y = VALUE, fill=IMPACTTYPE)) +
geom_bar(stat="identity") +
ggtitle('Types of events which have the greatest economic consequences, 1950-2011') +
ylab('Costs of damages') +
xlab('Type of event') +
scale_fill_discrete(name='Type of damage', labels=c('Properties', 'Crops')) +
scale_y_continuous(labels = comma) +
coord_flip()
From the plots above, we can draw the following conclusions :