Synopsis

This report explores the impact of storms in United States from 1950 to 2011. The data is taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) database. The two key analyses done are on which types of events that are the most harmful with respect to population health and which types of events that have the greatest economic consequences. For health impacts we will look at the number of fatalities and injuries from the events. As for economic consequences, we will look at damage done to props(housing, cars, etc) and crops.

Pre Data Processing

This section loads the necessary R packages for analysis. NOTE: Please ensure you have the packages installed.

library(knitr)
opts_chunk$set(echo = TRUE)
library(lubridate)
library(dplyr)
library(ggplot2)
library(stringr)
library(reshape2)

Output session info

sessionInfo()
## R version 3.2.5 (2016-04-14)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8.1 x64 (build 9600)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.1  stringr_1.0.0   ggplot2_2.1.0   dplyr_0.5.0    
## [5] lubridate_1.5.6 knitr_1.13     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.6      digest_0.6.9     assertthat_0.1   plyr_1.8.4      
##  [5] grid_3.2.5       R6_2.1.2         gtable_0.2.0     DBI_0.4-1       
##  [9] formatR_1.4      magrittr_1.5     scales_0.4.0     evaluate_0.9    
## [13] stringi_1.1.1    rmarkdown_1.0    tools_3.2.5      munsell_0.4.3   
## [17] yaml_2.1.13      colorspace_1.2-6 htmltools_0.3.5  tibble_1.1

Data Processing

This section processes the storm data provided by the NOAA storm database. The file is available here. Documentation on the data can be found here as well. Any data corruption, missing data and duplicates will be dealt here.

Initial Data processing

First, we unzip and load the data into R. Note: Make sure the file is in your working directory!

raw_data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))

Then we take a look at the data itself and determine which are the relevant information to keep

dim(raw_data)
## [1] 902297     37
names(raw_data)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

To reduce the data size we are working with. We filter out the relevant columns and store in another variable. Then we take a closer look at the remaining variables.

storm <- select(raw_data, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP,BGN_DATE)
str(storm)
## 'data.frame':    902297 obs. of  9 variables:
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...

Cleaning Up Data: Event Types

Here we look at the event types in the data and change the observations to lower case for easier analysis. Check how many unique values are there in event type

#Check some unique event types
head(unique(storm$EVTYPE),20)
##  [1] TORNADO                   TSTM WIND                
##  [3] HAIL                      FREEZING RAIN            
##  [5] SNOW                      ICE STORM/FLASH FLOOD    
##  [7] SNOW/ICE                  WINTER STORM             
##  [9] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS       
## [11] RECORD COLD               HURRICANE ERIN           
## [13] HURRICANE OPAL            HEAVY RAIN               
## [15] LIGHTNING                 THUNDERSTORM WIND        
## [17] DENSE FOG                 RIP CURRENT              
## [19] THUNDERSTORM WINS         FLASH FLOOD              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND
#Count the number of unique event types in the data
length(unique(storm$EVTYPE))
## [1] 985

The event type column data contains too many misspellings, different variations of the official 48 event types. Filtering to just 48 of the official events will involve large amount of time and effort. Also, data before 1996 did not capture all 48 event type and were known to be inconsistent, as the data wasn’t digital then. Henceforth in order to make better judgements, data before 1996 will not be used in this analysis.

#Use lubridate to change the BGN_DATE column to datetype
storm$BGN_DATE <- mdy_hms(storm$BGN_DATE)
#Then mutate a new column with the year into the storm data
storm <- mutate(storm, YEAR = year(storm$BGN_DATE))
storm_data <- filter(storm, YEAR >= 1996)

Next we will clean up the event types column.

#change all variables to lowercase
event_types <- tolower(storm_data$EVTYPE[])
#replace the punctuation characters with a "."
event_types <- make.names(event_types)
event_types <- gsub("\\.", " ", event_types)
length(unique(event_types))
## [1] 433

Cleaning the data further by replacing abbreviations and removing unrelated events types.

#replace all the tstm with thunderstorm
event_types <- gsub("tstm", "thunderstorm", event_types)
storm_data$EVTYPE <- event_types
#filter out the rows of data that contain invalid eventtype
storm_data <- storm_data[-grep('summary', storm_data$EVTYPE),]
#check the number of unique EVTYPE again.
length(unique(storm_data$EVTYPE))
## [1] 362

The resulting number of unique events type is 362 as compared to the 900+ before.

Cleaning Up Data: Reformmating the PROP & CROP damage exponent

The reference file indicated the relevant exponents for the prop and crop damage. We will use those reference to recalculate the prop and crop damage

First we create a function that takes in the exponent and determine what multiplier to return. There are symbols and other numbers present as well. The documentation made no mention of numbers/symbols as exponents so they will be taken as 0.

#Modify the exponents so they can be multiplied into the crop/prop damage
storm_data <- storm_data %>% 
  mutate(PROPDMGEXP = ifelse(PROPDMGEXP %in% c('B','b'),1E9,
                             ifelse(PROPDMGEXP %in% c('M','m'),1E6,
                                    ifelse(PROPDMGEXP %in% c('K','k'),1E3,
                                           ifelse(PROPDMGEXP %in% c('H','h'),1E2,0))))) %>%
  mutate(CROPDMGEXP = ifelse(CROPDMGEXP %in% c('B','b'),1E9,
                             ifelse(CROPDMGEXP %in% c('M','m'),1E6,
                                    ifelse(CROPDMGEXP %in% c('K','k'),1E3,
                                           ifelse(CROPDMGEXP %in% c('H','h'),1E2,0)))))
#Next multiply the new exponents into the prop/crop damage
storm_data <- storm_data %>%
  mutate(new_propdmg = PROPDMG * PROPDMGEXP) %>%
  mutate(new_cropdmg = CROPDMG * CROPDMGEXP)

With that we are ready to do the analysis.

Results

Question: Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

In this question, we have two determinants of population health impact, and that is the number of fatalities and injuries. First we extract the relevant columns out

#Select the relevant columns
storm_data_health <- select(storm_data, EVTYPE, FATALITIES, INJURIES)
#Rearrage the columns so it sorted by highest fatalities
summaryhealth <- storm_data_health %>%
  group_by(EVTYPE) %>%
  summarize(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES)) %>%
  mutate(TOTAL=FATALITIES+INJURIES) %>%
  arrange(desc(TOTAL)) 

Looking at the printed top 10 entries, we can tell tornado is the main driver behind the number fatalities and injuries in the states. To be better illustrate this, we will plot the results into a bar chart below.

#printing out the summary
head(summaryhealth,10)
## # A tibble: 10 x 4
##               EVTYPE FATALITIES INJURIES TOTAL
##                <chr>      <dbl>    <dbl> <dbl>
## 1            tornado       1511    20667 22178
## 2     excessive heat       1797     6391  8188
## 3              flood        414     6758  7172
## 4  thunderstorm wind        371     5029  5400
## 5          lightning        651     4141  4792
## 6        flash flood        887     1674  2561
## 7       winter storm        191     1292  1483
## 8               heat        237     1222  1459
## 9  hurricane typhoon         64     1275  1339
## 10         high wind        235     1083  1318

In order to plot our chart here we breakdown the data so it is suitable.

#Due to some error with dplyr working with reshape. 
#The table above has to be manually reassigned as data frame
summaryhealth <- as.data.frame(summaryhealth)
#Then filter out the top 10 entries & total column 
summaryhealth_10 <- summaryhealth[1:10,]
summaryhealth_10 <- select(summaryhealth_10, -TOTAL)
#Melt the data so there is a variable to indicate injury or fatality
summ_health <- melt(summaryhealth_10, id_vars = "EVTYPE", measure.vars = c("FATALITIES","INJURIES"))
#Make the event type columns a factor and unique
summ_health$EVTYPE <- factor(summ_health$EVTYPE, levels = unique(summ_health$EVTYPE))

Now we proceed to draw the chart using ggplot

g <- ggplot(summ_health, aes(x = EVTYPE, y = value, fill = factor(variable)))
g + geom_bar(stat = "identity", position = "dodge") +
  scale_fill_discrete(name = "Impact", 
                      labels=c("FATALITIES", "INJURIES")) +
  xlab("Event Type") +
  ylab("Number of Fatalities & Injuries") +
  ggtitle("Top 10 Event Types with Highest Impact on U.S. Population Health") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title=element_blank())

From the chart itself, we can see that tornado events causes the most number of injuries since 1996. But for fatalities, it is excessive heat that is driving the numbers - perhaps due to global warming? It is interesting to see that heatstroke might kill more than tornadoes. But with tornado detection and shelter technology more advanced after 1996, fatalities will reduce. The rest of the events that are deadly to the population are thunderstorm wind, flood, lightning and etc.

Across the United States, which types of events have the greatest economic consequences?

For this question, we will use the two newly calculated cost of prop/crop damage. First we extract the relevant columns out

#Select the relevant columns
storm_data_econ <- select(storm_data, EVTYPE, new_propdmg, new_cropdmg)
#Rearrage the columns so it sorted by highest fatalities
summaryecon <- storm_data_econ %>%
  group_by(EVTYPE) %>%
  summarize(PROPDMG = sum(new_propdmg), CROPDMG = sum(new_cropdmg)) %>%
  mutate(TOTAL=PROPDMG+CROPDMG) %>%
  arrange(desc(TOTAL)) 

Looking at the printed top 10 entries, we can tell flood is the main driver behind crop/prop damage in the states. To be better illustrate this, we will plot the results into a bar chart below.

#printing out the summary
head(summaryecon,10)
## # A tibble: 10 x 4
##               EVTYPE      PROPDMG     CROPDMG        TOTAL
##                <chr>        <dbl>       <dbl>        <dbl>
## 1              flood 143944833550  4974778400 148919611950
## 2  hurricane typhoon  69305840000  2607872800  71913712800
## 3        storm surge  43193536000        5000  43193541000
## 4            tornado  24616945710   283425010  24900370720
## 5               hail  14595143420  2476029450  17071172870
## 6        flash flood  15222203910  1334901700  16557105610
## 7          hurricane  11812819010  2741410000  14554229010
## 8            drought   1046101000 13367566000  14413667000
## 9  thunderstorm wind   7860710880   952246350   8812957230
## 10    tropical storm   7642475550   677711000   8320186550

In order to plot our chart here we breakdown the data so it is suitable.

#Due to some error with dplyr working with reshape. 
#The table above has to be manually reassigned as data frame
summaryecon <- as.data.frame(summaryecon)
#Then filter out the top 10 entries & total column 
summaryecon_10 <- summaryecon[1:10,]
summaryecon_10 <- select(summaryecon_10, -TOTAL)
#Melt the data so there is a variable to indicate injury or fatality
summ_econ <- melt(summaryecon_10, id_vars = "EVTYPE", measure.vars = c("PROPDMG","CROPDMG"))
#Make the event type columns a factor and unique
summ_econ$EVTYPE <- factor(summ_econ$EVTYPE, levels = unique(summ_econ$EVTYPE))

Now we proceed to draw the chart using ggplot

g <- ggplot(summ_econ, aes(x = EVTYPE, y = value/1E9, fill = factor(variable)))
g + geom_bar(stat = "identity", position = "dodge") +
  scale_fill_discrete(name = "Impact", 
                      labels=c("PROPDMG", "CROPDMG")) +
  xlab("Event Type") +
  ylab("Crop & Prop damage in billions") +
  ggtitle("Top 10 Event Types with Highest Impact on U.S. Economy") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.title=element_blank())

In terms of overall damage, flood is the main culprit. Cumulating up to almost 150bil USD worth of damage since 1996. Although it’s impact is the highest on prop damage, it is drought that causes the most crop damage. This might not be surprising considering the fact that flood may occur during seasons where crops have been harvested already. Drought would pose a greater risk instead.