Tornados have significantly impaired the population and economy over the past decades

Synopsis

In this data analysis we identified the atmospheric event that primarily responsible for the negative effect on US population and economy between 1950 and 2011. We seeked answer for the following two questions: 1. Across the United States, which types of events are most harmful with respect to population health? 2. Across the United States, which types of events have the greatest economic consequences? To investigate the problem, we obtained Storm Data (47Mb) from the U.S. National Oceanic and Atmospheric Administration’s (NOAA). From this data we found that tornados are those atmospheric events that most significantly affect the population and economy of the U.S.

Data processing

Acquiring data

The storm database of NOAA tracks the characteristics of major storms and weather events in the United States, including when and where they occured, as well as estimates of any fatalities, injuries, and property damage. The National Weather Service provides detailed documantation about the collection method and the structure of the data.

At first, we downloaded the complressed (bz2) database stored as comma-separated-values:

## U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database (47 Mb)
NOAA.data.URL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
subfolder <- "./RepData_PA2"            ## Subfolder within the default R working directory
NOAA.data.local <- "StormData.csv.bz2"  ## Local, compressed destination filename
        NOAA.data.local <- paste(subfolder,NOAA.data.local,sep="/")

if(!file.exists(subfolder))
        {
        dir.create(subfolder)
        }
if(!file.exists(NOAA.data.local))
        {
        download.file(NOAA.data.URL,NOAA.data.local,mode="wb")  ## mode="wb" for compressed (binary) files
        }

..then loaded the complete dataset to identify the data of interest and the data quality. The function bzfile() has been applied to secure the extraction of the source data.

Reading in the data

if (!exists("StormData"))
        {
        StormData <- read.csv(bzfile(NOAA.data.local), stringsAsFactors = FALSE)
        }

After reading in the NOAA Storm Data we check the first few rows (there are 902,297) in this dataset.

dim(StormData)
## [1] 902297     37
head(StormData)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

Pre-processing for analysis

Our data analysis requires the following R packages to be installed and loaded:

## dplyr
if("dplyr" %in% rownames(installed.packages()) == FALSE) {install.packages("dplyr")}
suppressMessages(library(dplyr))

## data.table
if("data.table" %in% rownames(installed.packages()) == FALSE) {install.packages("data.table")}
suppressMessages(library(data.table))

## lubridate
if("lubridate" %in% rownames(installed.packages()) == FALSE) {install.packages("lubridate")}
suppressMessages(library(lubridate))

## date
if("date" %in% rownames(installed.packages()) == FALSE) {install.packages("date")}
suppressMessages(library(date))

## ggplot2
if("ggplot2" %in% rownames(installed.packages()) == FALSE) {install.packages("ggplot2")}
suppressMessages(library(ggplot2))

The data of interest is stored in the following columns: EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, REFNUM, BGN_DATE. Therefore, the original dataset was reduced to store only these relevant columns. In addition, the size of the dataset was decreased by excluding those records in which 0 value was stored, since we are looking for events that caused fatalities, injuries or damages in properties/crops.

StormData.subset <- subset(StormData, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0,
                           select = c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "CROPDMG", "REFNUM", "BGN_DATE"))

The characteristics of the StormData.Subset indicated that the numeric data of interest is ready for analysis, since none of the relevant columns stored NA values:

summary(StormData.subset)
##     EVTYPE            FATALITIES          INJURIES        
##  Length:254633      Min.   :  0.0000   Min.   :   0.0000  
##  Class :character   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Mode  :character   Median :  0.0000   Median :   0.0000  
##                     Mean   :  0.0595   Mean   :   0.5519  
##                     3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##                     Max.   :583.0000   Max.   :1700.0000  
##     PROPDMG           CROPDMG            REFNUM         BGN_DATE        
##  Min.   :   0.00   Min.   :  0.000   Min.   :     1   Length:254633     
##  1st Qu.:   2.00   1st Qu.:  0.000   1st Qu.:281406   Class :character  
##  Median :   5.00   Median :  0.000   Median :473485   Mode  :character  
##  Mean   :  42.75   Mean   :  5.411   Mean   :484335                     
##  3rd Qu.:  25.00   3rd Qu.:  0.000   3rd Qu.:703590                     
##  Max.   :5000.00   Max.   :990.000   Max.   :902260

The fatalities and injuries (by person) was summarized into a new measure, called health and the economy related property and crop damages (by US$) were also cumulated as economic. In order to optimize the time-serie, the certain dates of observations has been grouped by decades.

StormData.subset <- mutate(StormData.subset,
                           health = FATALITIES + INJURIES,
                           economic = PROPDMG + CROPDMG)

StormData.subset <- mutate(StormData.subset,
                           decade = (trunc(year(as.date(gsub( " .*$", "", StormData.subset$BGN_DATE), order = "mdy"))/10,-1)*10))

Finally, the subset of Storm Data was aggregated by decades and atmospheric event types (EVTYPE) to provide the summary of health and economic related effects.

StormData.analysis <- summarise(group_by(StormData.subset, decade, EVTYPE), sum.health=sum(health), sum.economic=sum(economic))

Results

The data prepared for analysis has been ranked to identify the top 3 atmospheric events affecting the population and economy over the decades. The resultset contained 2 different labels for the same event, TST WIND and THUNDERSTORM WIND. In order to analyse appropriate event types, the labels were corrected. The correction was ruled by the relevant 2.1.1 Storm Data Event Table section of the Storm Data documentation, linked at the section Acquiring data of this document.

top3.evtype.health <- na.omit(setDT(StormData.analysis)[order(decade,-sum.health), .SD[1:3], by=decade])
top3.evtype.health$EVTYPE[top3.evtype.health$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
top3.evtype.health$EVTYPE <- factor(top3.evtype.health$EVTYPE)
top3.evtype.health$EVTYPE <- factor(top3.evtype.health$EVTYPE, levels = rev(levels(top3.evtype.health$EVTYPE)))

top3.evtype.economic <- na.omit(setDT(StormData.analysis)[order(decade,-sum.economic), .SD[1:3], by=decade])
top3.evtype.economic$EVTYPE[top3.evtype.economic$EVTYPE == "TSTM WIND"] <- "THUNDERSTORM WIND"
top3.evtype.economic$EVTYPE <- factor(top3.evtype.economic$EVTYPE)
top3.evtype.economic$EVTYPE <- factor(top3.evtype.economic$EVTYPE, levels = rev(levels(top3.evtype.economic$EVTYPE)))

The following figures visualize the top 3 events contributing to health and economic damages over the decades.

health.plot <- ggplot(top3.evtype.health, aes(x = reorder(EVTYPE, -sum.health), y = sum.health, fill=EVTYPE))
health.plot <- health.plot + geom_bar(stat="identity", fill="#3333CC")
health.plot <- health.plot + guides(fill=FALSE)
health.plot <- health.plot + facet_grid(~ decade, scales = "free", space = "free")
health.plot <- health.plot + scale_fill_brewer()
health.plot <- health.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=8))
health.plot <- health.plot + theme(panel.background = element_rect(fill = "#333333"))
health.plot <- health.plot + ggtitle("Top 3 atmospheric events\nharmful to population health (by decades)\n")
health.plot <- health.plot + theme(plot.title = element_text(lineheight=.8, face="bold"))
health.plot <- health.plot + xlab("Atmospheric events by decades")
health.plot <- health.plot + ylab("Sum of population affected [1 prs]")
health.plot

economic.plot <- ggplot(top3.evtype.economic, aes(x = reorder(EVTYPE, -sum.economic), y = sum.economic, fill=EVTYPE))
economic.plot <- economic.plot + geom_bar(stat="identity", fill="#006666")
economic.plot <- economic.plot + guides(fill=FALSE)
economic.plot <- economic.plot + facet_grid(~ decade, scales = "free", space = "free")
economic.plot <- economic.plot + scale_fill_brewer()
economic.plot <- economic.plot + theme(axis.text.x = element_text(angle = 90, hjust = 1, size=8))
economic.plot <- economic.plot + theme(panel.background = element_rect(fill = "#333333"))
economic.plot <- economic.plot + ggtitle("Top 3 atmospheric events\nwith great economic consequence (by decades)\n")
economic.plot <- economic.plot + theme(plot.title = element_text(lineheight=.8, face="bold"))
economic.plot <- economic.plot + xlab("Atmospheric events by decades")
economic.plot <- economic.plot + ylab("Total amount of caused demage [$]")
economic.plot

According to the figures above, primarily Tornados are those atmospheric events that most significantly affect the population of the U.S over the past decades. Although in the 00’s tornados caused the the highest number of fatalities and injuries, most of the economic damage of the decade was resulted by primarily flashflood and secondly thunderstorm wind.