Data analysis report: Impact of severe weather events on public health and economy in the USA, based on the NOAA Storm Database

Synonpsis

The idea behind this report is to analyze to analyze the impact of different weather events on public health and economy. Analyzis is based on the public database for storm events, maintained by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). Data used in this study was gathered during 1950 - 2011. This report contains the analysis on the health and economic consenquences according to the type of the storm event, based on correspoding estimates of fatalities, injuries, property and crop damage. The results show that floods, droughts, and typhoons leave the greatest financial losses, whereas tornados and excessive heat are most harmful with the respect to population health.

Data Processing

1. Setting the R environment for analysis (loading packages)

library(R.utils) # for bunzip2
## Warning: package 'R.utils' was built under R version 3.2.2
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.2.2
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.2.2
## R.methodsS3 v1.7.0 (2015-02-19) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.19.0 (2015-02-27) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v2.1.0 (2015-05-27) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2) # for plotting
library(ggvis) # for plotting
## Warning: package 'ggvis' was built under R version 3.2.2
## 
## Attaching package: 'ggvis'
## 
## The following object is masked from 'package:ggplot2':
## 
##     resolution
library(plyr) # for data manipulation
library(dplyr) # for data manipulation
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Loading the data

If the data file is not downloaded yet, download the data file and unzip it.

# create a data dir if it doesn't exist
if(!file.exists("./data")){
      dir.create("./data")
}

if (!"stormData.csv.bz2" %in% dir("./data/")) {
        download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "data/stormData.csv.bz2")
        
        bunzip2("data/stormData.csv.bz2", overwrite=T, remove=F)
}

If the data doesn’t already exists in the working environment, load the data from the generated csv file.

if (!"stormData" %in% ls()) {
    stormData <- read.csv("data/stormData.csv", sep = ",")
}


stormData <- tbl_df(stormData)
str(stormData)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

There are 902297 observations of 37 variables, in total.

3. Preparing the data for analysis

# Adding a year column based on BGN_DATE

if (dim(stormData)[2] == 37) {
    stormData$year <- as.numeric(format(as.Date(stormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
}
# Distribution of events observations over the period 1950 - 2011.
stormData %>% ggvis(~year, fill:="steelblue") %>% layer_histograms()
## Guessing width = 2 # range / 31

As can be seen from the histogram above, the number of events tracked starts to significantly increase around 1995. In the earlier years there are generally fewer events recorded, probably due to a lack of good records. Accordingly, we will use the subset of the data from 1995 to 2011, as it is more complete.

stormData <- filter(stormData, year>=1995)

Now let’s subset the data according to health and economic impacts against weather events:

stormData <- select(stormData,EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

dim(stormData)
## [1] 681500      7

So now we have 681500 observations of 7 variables.

Let us prepare the property and the crop damage data, and agreagate the data by event:

# Values the property exponent
unique(stormData$PROPDMGEXP)
##  [1]   B M K m + 0 5 6 ? 4 2 3 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# Sorting the property exponent data
stormData$PROPEXP[stormData$PROPDMGEXP == "K"] <- 1000
stormData$PROPEXP[stormData$PROPDMGEXP == "M"] <- 1e+06
stormData$PROPEXP[stormData$PROPDMGEXP == ""] <- 1
stormData$PROPEXP[stormData$PROPDMGEXP == "B"] <- 1e+09
stormData$PROPEXP[stormData$PROPDMGEXP == "m"] <- 1e+06
stormData$PROPEXP[stormData$PROPDMGEXP == "0"] <- 1
stormData$PROPEXP[stormData$PROPDMGEXP == "5"] <- 1e+05
stormData$PROPEXP[stormData$PROPDMGEXP == "6"] <- 1e+06
stormData$PROPEXP[stormData$PROPDMGEXP == "4"] <- 10000
stormData$PROPEXP[stormData$PROPDMGEXP == "2"] <- 100
stormData$PROPEXP[stormData$PROPDMGEXP == "3"] <- 1000
stormData$PROPEXP[stormData$PROPDMGEXP == "h"] <- 100
stormData$PROPEXP[stormData$PROPDMGEXP == "7"] <- 1e+07
stormData$PROPEXP[stormData$PROPDMGEXP == "H"] <- 100
stormData$PROPEXP[stormData$PROPDMGEXP == "1"] <- 10
stormData$PROPEXP[stormData$PROPDMGEXP == "8"] <- 1e+08
# Set the invalid exponent data to 0, so they not count in
stormData$PROPEXP[stormData$PROPDMGEXP == "+"] <- 0
stormData$PROPEXP[stormData$PROPDMGEXP == "-"] <- 0
stormData$PROPEXP[stormData$PROPDMGEXP == "?"] <- 0
# Compute the property damage value
stormData$PROPDMGVAL <- stormData$PROPDMG * stormData$PROPEXP

# Values of the crop exponent data
unique(stormData$CROPDMGEXP)
## [1]   M m K B ? 0 k 2
## Levels:  ? 0 2 B k K m M
# Sorting the property exponent data
stormData$CROPEXP[stormData$CROPDMGEXP == "M"] <- 1e+06
stormData$CROPEXP[stormData$CROPDMGEXP == "K"] <- 1000
stormData$CROPEXP[stormData$CROPDMGEXP == "m"] <- 1e+06
stormData$CROPEXP[stormData$CROPDMGEXP == "B"] <- 1e+09
stormData$CROPEXP[stormData$CROPDMGEXP == "0"] <- 1
stormData$CROPEXP[stormData$CROPDMGEXP == "k"] <- 1000
stormData$CROPEXP[stormData$CROPDMGEXP == "2"] <- 100
stormData$CROPEXP[stormData$CROPDMGEXP == ""] <- 1

# Set the invalid exponent data to 0, so they not count in
stormData$CROPEXP[stormData$CROPDMGEXP == "?"] <- 0

# Compute the crop damage value
stormData$CROPDMGVAL <- stormData$CROPDMG * stormData$CROPEXP

# Aggregate the data by event
fatal <- aggregate(FATALITIES ~ EVTYPE, data = stormData, FUN = sum)
injury <- aggregate(INJURIES ~ EVTYPE, data = stormData, FUN = sum)
propdmg <- aggregate(PROPDMGVAL ~ EVTYPE, data = stormData, FUN = sum)
cropdmg <- aggregate(CROPDMGVAL ~ EVTYPE, data = stormData, FUN = sum)

Results

1.Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

# Top 15 events with highest fatalities
fatal15 <-fatal[order(-fatal$FATALITIES), ][1:15, ]
# Top 15 event with highest injuries
injury15 <- injury[order(-injury$INJURIES), ][1:15, ]

par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)

barplot(fatal15$FATALITIES, las = 3, names.arg = fatal15$EVTYPE, main = "Weather Events Causing the Top 15 Fatalities", 
    ylab = "Number of fatalities", col = "steelblue")

barplot(injury15$INJURIES, las = 3, names.arg = injury15$EVTYPE, main = "Weather Events Causing the Top 15 Injuries", 
    ylab = "Number of injuries", col = "orange")

As can be observed the most harmful weather events for USA population health are tornados and excessive heat. They cause both the highest number of fatalities and injuries.

2.Across the United States, which types of events have the greatest economic consequences?

# Top 15 events with highest property damage
propdmg15 <- propdmg[order(-propdmg$PROPDMGVAL), ][1:15, ]
# Top 15 events with highest crop damage
cropdmg15 <- cropdmg[order(-cropdmg$CROPDMGVAL), ][1:15, ]
par(mfrow = c(1, 2), mar = c(12, 4, 3, 2), mgp = c(3, 1, 0), cex = 0.6)
barplot(propdmg15$PROPDMGVAL/(10^9), las = 3, names.arg = propdmg15$EVTYPE, 
    main = "Top 15 Events causing the Greatest Property Damage", ylab = "Cost of damage ($ billions)", 
    col = "steelblue")
barplot(cropdmg15$CROPDMGVAL/(10^9), las = 3, names.arg = cropdmg15$EVTYPE, 
    main = "Top 15 Events Causing the Greatest Crop Damage", ylab = "Cost of damage ($ billions)", 
    col = "orange")

As can be observed weather events that bring the greatest economic consequences are: floods, droughts, tornados and typhoons.