Economic and population health impacts of severe weather events in the United states 1950-2011

Synopsis

In this report we aim to identify two things: 1. Which type of weather events are most harmful with respect to public health, and 2. Which type of events have the greatest economic consequences. Our hypothesis is that weather events have varying severity on health and population, and the intention of this report is to identify which ones are the most severe. To investigate the hypothesis we have obtained data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. Events from before 1995 were excluded from the analysis due to a lack of sufficient records. From the analysis we found that the weather events with the greatest health impacts are Tornadoes, followed, by heat events, and then flood events. The weather events with the greatest economic impact are floods, followed by hurricanes, and then storm surges.

Data processing

We used data from US NOAA storm database. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. A codebook for the data set is available here

Downloading the data

We first manually downloaded and unzipped the data. The data is zipped using bzip2 algorithm, so appropriate unzipping software was used.

Reading the data

We first read in the data from the raw text file. It is text data, with fields separated with a “,” (comma character). We don’t read in the header file.

sdat <- read.csv("./data/dataset.csv", skip = 1, header = FALSE,
                  na.strings = "")

After reading in the data, we check the first few rows (there are 902,297) rows in the data base.

dim(sdat)

## [1] 902297     37

head(sdat[,1:10])

##   V1                 V2   V3  V4 V5         V6 V7      V8 V9  V10
## 1  1  4/18/1950 0:00:00 0130 CST 97     MOBILE AL TORNADO  0 <NA>
## 2  1  4/18/1950 0:00:00 0145 CST  3    BALDWIN AL TORNADO  0 <NA>
## 3  1  2/20/1951 0:00:00 1600 CST 57    FAYETTE AL TORNADO  0 <NA>
## 4  1   6/8/1951 0:00:00 0900 CST 89    MADISON AL TORNADO  0 <NA>
## 5  1 11/15/1951 0:00:00 1500 CST 43    CULLMAN AL TORNADO  0 <NA>
## 6  1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO  0 <NA>

We then attach column headers to the dataset and make sure they are properly formated for R data frames.

cnames <- readLines("./data/dataset.csv",1)
cnames <- strsplit(cnames,",",fixed=TRUE)
names(sdat) <- make.names(cnames[[1]])
head(sdat[,1:10])

##   X.STATE__.        X.BGN_DATE. X.BGN_TIME. X.TIME_ZONE. X.COUNTY.
## 1          1  4/18/1950 0:00:00        0130          CST        97
## 2          1  4/18/1950 0:00:00        0145          CST         3
## 3          1  2/20/1951 0:00:00        1600          CST        57
## 4          1   6/8/1951 0:00:00        0900          CST        89
## 5          1 11/15/1951 0:00:00        1500          CST        43
## 6          1 11/15/1951 0:00:00        2000          CST        77
##   X.COUNTYNAME. X.STATE. X.EVTYPE. X.BGN_RANGE. X.BGN_AZI.
## 1        MOBILE       AL   TORNADO            0       <NA>
## 2       BALDWIN       AL   TORNADO            0       <NA>
## 3       FAYETTE       AL   TORNADO            0       <NA>
## 4       MADISON       AL   TORNADO            0       <NA>
## 5       CULLMAN       AL   TORNADO            0       <NA>
## 6    LAUDERDALE       AL   TORNADO            0       <NA>

We are interested in the effeccts of different event types. To assess public health impacts we are interested in fatalities and injuries. To assess economic impacts we are interested in property damage and crop damage. Here, we extract the data, and identify the amount of missing values.

Event type:

etyp <- sdat$X.EVTYPE.
mean(is.na(etyp))

## [1] 0

Fatalities:

fats <- sdat$X.FATALITIES.
mean(is.na(fats))

## [1] 0

Injuries:

injs <- sdat$X.INJURIES.
mean(is.na(injs))

## [1] 0

Property damage:

pdmg <- sdat$X.PROPDMG.
mean(is.na(pdmg))

## [1] 0

pdmgex <- sdat$X.PROPDMGEXP.
mean(is.na(pdmgex))

## [1] 0.5163865

Crop damage:

cdmg <- sdat$X.CROPDMG.
mean(is.na(cdmg))

## [1] 0

cdmgex <- sdat$X.CROPDMGEXP.
mean(is.na(cdmgex))

## [1] 0.6853763

There are missing values for the damage exponents. Need to be aware of this in later analysis.

Some event types are really the same kind of event, but described separately. So we combine all “heat” events into one, as well as all flood, thunder storm, etc.

sdat$X.EVTYPE. <- sub(".*[Hh][Ee][Aa][Tt].*","HEAT",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*[Ff][Ll][Oo][Oo][Dd].*","FLOOD",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*HURRICANE.*","HURRICANE",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*TROPICAL STORM.*","TROPICAL STORM",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*RIP CURRENT.*","RIP CURRENTS",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*TSTM.*","THUNDER STORM",sdat$X.EVTYPE.)
sdat$X.EVTYPE. <- sub(".*THUNDERSTORM.*","THUNDER STORM",sdat$X.EVTYPE.)

The data was collected starting in 1950. The older data may not be comprehensive. To check this we take a look at the number of events recorded per year and plot the data.

library(dplyr)
sdat <- mutate(sdat, X.YEAR = as.numeric(format(as.Date(as.character(X.BGN_DATE.),
                                                        format = "%m/%d/%Y %H:%M:%S"), "%Y")))
by_yr <- group_by(sdat,X.YEAR) %>%
         summarise(No.Events = n())
library(ggplot2)
g <- ggplot(by_yr, aes(x = X.YEAR,y = No.Events))
p <- g + geom_line() + ggtitle("Number of recorded events per year") + xlab("Year") + ylab("Number of events")
print(p)

There seems to be a spike in the number of events at around 1995. In order not to bias the data, let’s only take the data from 1995 onwards.

sdat <- subset(sdat,X.YEAR > 1994)

Finally, we need to convert the property and crop damage data to actual values, using the correct scaling, and address the NA issues as well:

pex <- data.frame("exp" = c("K","M","B"),"PROP.MULT" = c(1000,1000000,1000000000))
dex <- data.frame("exp" = c("K","M","B"),"CROP.MULT" = c(1000,1000000,1000000000))
sdat <- merge(sdat,pex,by.x = "X.PROPDMGEXP.", by.y = "exp",sort=FALSE, all.x = TRUE)
sdat <- merge(sdat,dex,by.x = "X.CROPDMGEXP.", by.y = "exp",sort=FALSE, all.x = TRUE)
sdat$PROP.MULT[is.na(sdat$PROP.MULT)] <- 0
sdat$CROP.MULT[is.na(sdat$CROP.MULT)] <- 0
sdat <- mutate(sdat, X.TOTAL.PDAMAGE = X.PROPDMG. * PROP.MULT) %>%
        mutate(X.TOTAL.CDAMAGE = X.CROPDMG. * CROP.MULT)

Results

Health Impacts

We take a look at the health impacts of each type of event, using the total fatalaties, and total injuries metrics:

health <- group_by(sdat,X.EVTYPE.) %>%
          summarise(Total.Fatalities = sum(X.FATALITIES.), 
                    Total.Injuries = sum(X.INJURIES.)) %>%
          mutate(Total.Casualties = Total.Fatalities + Total.Injuries) %>%
          filter(Total.Casualties > 0)
health <- health[order(-health$Total.Casualties),]
head(health)

## # A tibble: 6 x 4
##   X.EVTYPE.     Total.Fatalities Total.Injuries Total.Casualties
##   <chr>                    <dbl>          <dbl>            <dbl>
## 1 TORNADO                   1545          21765            23310
## 2 HEAT                      3087           9103            12190
## 3 FLOOD                     1386           8519             9905
## 4 THUNDER STORM              438           5688             6126
## 5 LIGHTNING                  729           4631             5360
## 6 WINTER STORM               195           1298             1493

We plot out the casualties for the largest weather events.

health <- health[1:5,]
g <- ggplot(health, aes(X.EVTYPE.,Total.Casualties))
p <- g + geom_bar(stat="identity") + ggtitle("Total casualties by weather event type") +
     xlab("Event type") + ylab("Total casualties") + 
     geom_text(aes(label = format(Total.Casualties,big.mark=","),vjust = -1)) +
     ylim(c(0,25000))
print(p)

So, since 1995, the weather events that have caused the greatests number of casualties are Tornadoes, followed by Heat, and then Flooding.

Economic Impacts

We now take a look at economic impacts, using the property and crop damage metrics:

econ <- group_by(sdat,X.EVTYPE.) %>%
          summarise(Total.PROP.DAMAGE = sum(X.TOTAL.PDAMAGE), 
                    Total.CROP.DAMAGE = sum(X.TOTAL.CDAMAGE)) %>%
          mutate(Total.DAMAGE.Bil = (Total.PROP.DAMAGE + Total.CROP.DAMAGE)/1000000000) %>%
          filter(Total.DAMAGE.Bil > 0)
econ <- econ[order(-econ$Total.DAMAGE.Bil),]
head(econ)

## # A tibble: 6 x 4
##   X.EVTYPE.   Total.PROP.DAMAGE Total.CROP.DAMAGE Total.DAMAGE.Bil
##   <chr>                   <dbl>             <dbl>            <dbl>
## 1 FLOOD            160151551770        7032166400            167. 
## 2 HURRICANE         84630180010        5504792800             90.1
## 3 STORM SURGE       43193536000              5000             43.2
## 4 TORNADO           24915219460         296595610             25.2
## 5 HAIL              15040821320        2613777050             17.7
## 6 DROUGHT            1046106000       13922066000             15.0

We plot out the damage for the largest weather events:

econ <- econ[1:5,]
g <- ggplot(econ, aes(X.EVTYPE.,Total.DAMAGE.Bil))
p <- g + geom_bar(stat="identity") + 
     ggtitle("Total economic damage ($Billions) by weather event type") +
     xlab("Event type") + ylab("Total damage ($Billions)") + 
     geom_text(aes(label = format(round(Total.DAMAGE.Bil,2)),vjust = -1)) +
     ylim(c(0,180))
print(p)

So the weather events with the largest economic impacts are Flood, followed by Hurricane, and then storm surge.