This report explores the impact of storms in United States from 1950 to 2011. The data is taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) database. The two key analyses done are on which types of events that are the most harmful with respect to population health and which types of events that have the greatest economic consequences. For health impacts we will look at the number of fatalities and injuries from the events. As for economic consequences, we will look at damage done to props(housing, cars, etc) and crops.
This section loads the necessary R packages for analysis. NOTE: Please ensure you have the packages installed.
library(knitr)
opts_chunk$set(echo = TRUE)
library(lubridate)
library(dplyr)
library(ggplot2)
library(stringr)
library(reshape2)
Output session info
sessionInfo()
## R version 3.2.5 (2016-04-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8.1 x64 (build 9600)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.1 stringr_1.0.0 ggplot2_2.1.0 dplyr_0.5.0
## [5] lubridate_1.5.6 knitr_1.13
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.6 digest_0.6.9 assertthat_0.1 plyr_1.8.4
## [5] grid_3.2.5 R6_2.1.2 gtable_0.2.0 DBI_0.4-1
## [9] formatR_1.4 magrittr_1.5 scales_0.4.0 evaluate_0.9
## [13] stringi_1.1.1 rmarkdown_1.0 tools_3.2.5 munsell_0.4.3
## [17] yaml_2.1.13 colorspace_1.2-6 htmltools_0.3.5 tibble_1.1
This section processes the storm data provided by the NOAA storm database. The file is available here. Documentation on the data can be found here as well. Any data corruption, missing data and duplicates will be dealt here.
First, we unzip and load the data into R. Note: Make sure the file is in your working directory!
raw_data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
Then we take a look at the data itself and determine which are the relevant information to keep
dim(raw_data)
## [1] 902297 37
names(raw_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
To reduce the data size we are working with. We filter out the relevant columns and store in another variable. Then we take a closer look at the remaining variables.
storm <- select(raw_data, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)
str(storm)
## 'data.frame': 902297 obs. of 9 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
Here we look at the event types in the data and change the observations to lower case for easier analysis. Check how many unique values are there in event type
#Check some unique event types
head(unique(storm$EVTYPE),20)
## [1] TORNADO TSTM WIND
## [3] HAIL FREEZING RAIN
## [5] SNOW ICE STORM/FLASH FLOOD
## [7] SNOW/ICE WINTER STORM
## [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [11] RECORD COLD HURRICANE ERIN
## [13] HURRICANE OPAL HEAVY RAIN
## [15] LIGHTNING THUNDERSTORM WIND
## [17] DENSE FOG RIP CURRENT
## [19] THUNDERSTORM WINS FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
#Count the number of unique event types in the data
length(unique(storm$EVTYPE))
## [1] 985
The event type column data contains too many misspellings, different variations of the official 48 event types. Filtering to just 48 of the official events will involve large amount of time and effort. Also, data before 1996 did not capture all 48 event type and were known to be inconsistent, as the data wasn’t digital then. Henceforth in order to make better judgements, data before 1996 will not be used in this analysis.
#Use lubridate to change the BGN_DATE column to datetype
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
#Then mutate a new column with the year into the storm data
storm <- mutate(storm, YEAR = year(storm$BGN_DATE))
storm_data <- filter(storm, YEAR >= 1996)
Next we will clean up the event types column.
#change all variables to lowercase
event_types <- tolower(storm_data$EVTYPE[])
#replace the punctuation characters with a "."
event_types <- make.names(event_types)
event_types <- gsub("\\.", " ", event_types)
length(unique(event_types))
## [1] 433
Cleaning the data further by replacing abbreviations and removing unrelated events types.
#replace all the tstm with thunderstorm
event_types <- gsub("tstm", "thunderstorm", event_types)
storm_data$EVTYPE <- event_types
#filter out the rows of data that contain invalid eventtype
storm_data <- storm_data[-grep('summary', storm_data$EVTYPE),]
#check the number of unique EVTYPE again.
length(unique(storm_data$EVTYPE))
## [1] 362
The resulting number of unique events type is 362 as compared to the 900+ before.
The reference file indicated the relevant exponents for the prop and crop damage. We will use those reference to recalculate the prop and crop damage
First we create a function that takes in the exponent and determine what multiplier to return. There are symbols and other numbers present as well. The documentation made no mention of numbers/symbols as exponents so they will be taken as 0.
#Modify the exponents so they can be multiplied into the crop/prop damage
storm_data <- storm_data %>%
mutate(PROPDMGEXP = ifelse(PROPDMGEXP %in% c('B','b'),1E9,
ifelse(PROPDMGEXP %in% c('M','m'),1E6,
ifelse(PROPDMGEXP %in% c('K','k'),1E3,
ifelse(PROPDMGEXP %in% c('H','h'),1E2,0))))) %>%
mutate(CROPDMGEXP = ifelse(CROPDMGEXP %in% c('B','b'),1E9,
ifelse(CROPDMGEXP %in% c('M','m'),1E6,
ifelse(CROPDMGEXP %in% c('K','k'),1E3,
ifelse(CROPDMGEXP %in% c('H','h'),1E2,0)))))
#Next multiply the new exponents into the prop/crop damage
storm_data <- storm_data %>%
mutate(new_propdmg = PROPDMG * PROPDMGEXP) %>%
mutate(new_cropdmg = CROPDMG * CROPDMGEXP)
With that we are ready to do the analysis.
In this question, we have two determinants of population health impact, and that is the number of fatalities and injuries. First we extract the relevant columns out
#Select the relevant columns
storm_data_health <- select(storm_data, EVTYPE, FATALITIES, INJURIES)
#Rearrage the columns so it sorted by highest fatalities
summaryhealth <- storm_data_health %>%
group_by(EVTYPE) %>%
summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES)) %>%
mutate(TOTAL=FATALITIES+INJURIES) %>%
arrange(desc(TOTAL))
Looking at the printed top 10 entries, we can tell tornado is the main driver behind the number fatalities and injuries in the states. To be better illustrate this, we will plot the results into a bar chart below.
#printing out the summary
head(summaryhealth,10)
## # A tibble: 10 x 4
## EVTYPE FATALITIES INJURIES TOTAL
## <chr> <dbl> <dbl> <dbl>
## 1 tornado 1511 20667 22178
## 2 excessive heat 1797 6391 8188
## 3 flood 414 6758 7172
## 4 thunderstorm wind 371 5029 5400
## 5 lightning 651 4141 4792
## 6 flash flood 887 1674 2561
## 7 winter storm 191 1292 1483
## 8 heat 237 1222 1459
## 9 hurricane typhoon 64 1275 1339
## 10 high wind 235 1083 1318
In order to plot our chart here we breakdown the data so it is suitable.
#Due to some error with dplyr working with reshape.
#The table above has to be manually reassigned as data frame
summaryhealth <- as.data.frame(summaryhealth)
#Then filter out the top 10 entries & total column
summaryhealth_10 <- summaryhealth[1:10,]
summaryhealth_10 <- select(summaryhealth_10, -TOTAL)
#Melt the data so there is a variable to indicate injury or fatality
summ_health <- melt(summaryhealth_10, id_vars = "EVTYPE", measure.vars = c("FATALITIES","INJURIES"))
#Make the event type columns a factor and unique
summ_health$EVTYPE <- factor(summ_health$EVTYPE, levels = unique(summ_health$EVTYPE))
Now we proceed to draw the chart using ggplot
g <- ggplot(summ_health, aes(x = EVTYPE, y = value, fill = factor(variable)))
g + geom_bar(stat = "identity", position = "dodge") +
scale_fill_discrete(name = "Impact",
labels=c("FATALITIES", "INJURIES")) +
xlab("Event Type") +
ylab("Number of Fatalities & Injuries") +
ggtitle("Top 10 Event Types with Highest Impact on U.S. Population Health") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title=element_blank())
From the chart itself, we can see that tornado events causes the most number of injuries since 1996. But for fatalities, it is excessive heat that is driving the numbers - perhaps due to global warming? It is interesting to see that heatstroke might kill more than tornadoes. But with tornado detection and shelter technology more advanced after 1996, fatalities will reduce. The rest of the events that are deadly to the population are thunderstorm wind, flood, lightning and etc.
For this question, we will use the two newly calculated cost of prop/crop damage. First we extract the relevant columns out
#Select the relevant columns
storm_data_econ <- select(storm_data, EVTYPE, new_propdmg, new_cropdmg)
#Rearrage the columns so it sorted by highest fatalities
summaryecon <- storm_data_econ %>%
group_by(EVTYPE) %>%
summarize(PROPDMG = sum(new_propdmg), CROPDMG = sum(new_cropdmg)) %>%
mutate(TOTAL=PROPDMG+CROPDMG) %>%
arrange(desc(TOTAL))
Looking at the printed top 10 entries, we can tell flood is the main driver behind crop/prop damage in the states. To be better illustrate this, we will plot the results into a bar chart below.
#printing out the summary
head(summaryecon,10)
## # A tibble: 10 x 4
## EVTYPE PROPDMG CROPDMG TOTAL
## <chr> <dbl> <dbl> <dbl>
## 1 flood 143944833550 4974778400 148919611950
## 2 hurricane typhoon 69305840000 2607872800 71913712800
## 3 storm surge 43193536000 5000 43193541000
## 4 tornado 24616945710 283425010 24900370720
## 5 hail 14595143420 2476029450 17071172870
## 6 flash flood 15222203910 1334901700 16557105610
## 7 hurricane 11812819010 2741410000 14554229010
## 8 drought 1046101000 13367566000 14413667000
## 9 thunderstorm wind 7860710880 952246350 8812957230
## 10 tropical storm 7642475550 677711000 8320186550
In order to plot our chart here we breakdown the data so it is suitable.
#Due to some error with dplyr working with reshape.
#The table above has to be manually reassigned as data frame
summaryecon <- as.data.frame(summaryecon)
#Then filter out the top 10 entries & total column
summaryecon_10 <- summaryecon[1:10,]
summaryecon_10 <- select(summaryecon_10, -TOTAL)
#Melt the data so there is a variable to indicate injury or fatality
summ_econ <- melt(summaryecon_10, id_vars = "EVTYPE", measure.vars = c("PROPDMG","CROPDMG"))
#Make the event type columns a factor and unique
summ_econ$EVTYPE <- factor(summ_econ$EVTYPE, levels = unique(summ_econ$EVTYPE))
Now we proceed to draw the chart using ggplot
g <- ggplot(summ_econ, aes(x = EVTYPE, y = value/1E9, fill = factor(variable)))
g + geom_bar(stat = "identity", position = "dodge") +
scale_fill_discrete(name = "Impact",
labels=c("PROPDMG", "CROPDMG")) +
xlab("Event Type") +
ylab("Crop & Prop damage in billions") +
ggtitle("Top 10 Event Types with Highest Impact on U.S. Economy") +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.title=element_blank())
In terms of overall damage, flood is the main culprit. Cumulating up to almost 150bil USD worth of damage since 1996. Although it’s impact is the highest on prop damage, it is drought that causes the most crop damage. This might not be surprising considering the fact that flood may occur during seasons where crops have been harvested already. Drought would pose a greater risk instead.