Analysis of harmful events to population health and economy in United States

Synopsis

A number of catastrophes over the years have had an adverse effect on the population health and economy of the United States of America. These include storms, hurricanes, floods etc. and have been a major deterrent in the progress of the country. These “severe events” can result in fatalities and damage to property and hence prevention or mainenance following such outcomes remains to be the key focus.

This project aims at understanding the effect of these severe outcomes on the population health and economy with an aim to better focus or prioritize the resources for research/maintenance within the general population or municipalities.

Based on the cursory data analysis, the fatalities incurred over the years from 1950 to 2011 were primarily caused by “Tornadoes” which also prove to be the major cause for injuries. However “Floods” turn out be the major deterrent of the US economy.

Data Processing

The data used for this analysis was obtained from the public domain provided by the National Weather Service (nws). The raw data is a zipped csv file and documentation pertaining to the data can be obtained from the NWS or the National Climatic Data Center.

1. Reading the data

After downloading the data it. We can read it with the read.csv command and look at the summary. (Note: read.csv can be used to read .csv.bz2 files.)

stormdata <- read.csv('./../repdata_data_StormData.csv.bz2', stringsAsFactors = F, header = T)
str(stormdata)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

There are a total of 902297 observations with 37 variables.

2. Cleaning the data

We will be using only the data from the columns, evtype, propdmg, cropdmg, propdmgexp, cropdmgexp, state, fatalities and injuries. Therefore we will subset these variables into another dataframe reducing the number of variables to 8.

data <- stormdata[,c("STATE","EVTYPE","FATALITIES","INJURIES","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP")]
str(data)
## 'data.frame':    902297 obs. of  8 variables:
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
summary(data)
##     STATE              EVTYPE            FATALITIES      
##  Length:902297      Length:902297      Min.   :  0.0000  
##  Class :character   Class :character   1st Qu.:  0.0000  
##  Mode  :character   Mode  :character   Median :  0.0000  
##                                        Mean   :  0.0168  
##                                        3rd Qu.:  0.0000  
##                                        Max.   :583.0000  
##     INJURIES            PROPDMG           CROPDMG       
##  Min.   :   0.0000   Min.   :   0.00   Min.   :  0.000  
##  1st Qu.:   0.0000   1st Qu.:   0.00   1st Qu.:  0.000  
##  Median :   0.0000   Median :   0.00   Median :  0.000  
##  Mean   :   0.1557   Mean   :  12.06   Mean   :  1.527  
##  3rd Qu.:   0.0000   3rd Qu.:   0.50   3rd Qu.:  0.000  
##  Max.   :1700.0000   Max.   :5000.00   Max.   :990.000  
##   PROPDMGEXP         CROPDMGEXP       
##  Length:902297      Length:902297     
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 

To ensure the data is complete, we will first check whether the data contains any NA values.

sum(is.na(data))
## [1] 0

Since there are no missing values, We can proceed with attempting a cursory analysis of the data.

Results and Data Analysis

1. Injuries and Fatalities

We will first convert the EVTYPE variable to a factor, aggregate the fatalities and using dplyr order in descending order. We will then plot the data.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library (ggplot2)
data$EVTYPE <- as.factor(data$EVTYPE)
fatal_data <- aggregate(FATALITIES ~ EVTYPE,data,sum)
ordered_fatal_data <- arrange(fatal_data,desc(FATALITIES))
# png(filename = 'fatalities_plot.png')
ggplot(ordered_fatal_data[1:10,], aes(EVTYPE, FATALITIES, fill=FATALITIES)) + geom_bar(stat="identity", colour="red") + ylab("Fatalities") + xlab("Event type") + ggtitle('Fatalities caused by Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))

# dev.off()

The graph above indicates that HAIL is the deadliest of the event types causing the loss of life of millions of people. To obtain an idea about the event type causing most injuries,

library(dplyr)
library(ggplot2)
injury_data <- aggregate(INJURIES ~ EVTYPE, data, sum)
ordered_injury_data <- arrange(injury_data, desc(INJURIES))
ggplot(ordered_injury_data[1:10,], aes(EVTYPE, INJURIES, fill=INJURIES)) + geom_bar(stat="identity", colour = "red") + ylab("Injuries") + xlab("Event type") + ggtitle('Injuries caused by Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))

Here we see that it is TORNADOES that cause the most injuries.

2. Economic Impact

The PROPDMGEXP and the CROPDMGEXP are scale factors to be used as exponents for the PROPDMG and CROPDMG variables. Let us first see what are the various values that PROPDMGEXP and CROPDMGEXP take:

sort(unique(data$PROPDMGEXP))
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
sort(unique(data$CROPDMGEXP))
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

To cover the exponent values, we will use the following substitutions: - “” unchanged - + unchanged - ? unchanged - - -1 - H hundred (x 100) - K/k thousand (x 1000) - M/m million (x 1000,000) - B billion (x 1000,000,000) - Any other number will be 10 raised to the power of that number.

We wil write two functions to calculate the economic impact based on the scale factor and the PROPDMG variable. We will call that economic_impact.

change_exp <- function(x, y){switch(y, 
               '+' = x*1,
               '?' = x*1,
               ' ' = x*1,
               '1' = x*1,
               '2' = x*100,
               '3' = x*1000,
               '4' = x*10000,
               '5' = x*100000,
               '6' = x*1000000,
               '7' = x*10000000,
               '8' = x*100000000,
               'H' = x*100,
               'K' = x*1000,
               'k' = x*1000,
               'M' = x*1000000,
               'm' = x*1000000,
               'B' = x*1000000000,
               'b' = x*1000000000,
               x)
}

compute_economic_imp <- function(x,exp){
        len <- length(x)
        res <- rep(0,len)
        for (i in 1:len){
                res[i] <- change_exp(x[i],exp[i])
        }
        res
}

data$economic_impact <- compute_economic_imp(data$PROPDMG, data$PROPDMGEXP)

Finally we will compute the economic impact based on the event type and plot it.

eco_data <- aggregate(economic_impact ~ EVTYPE, data, sum)
ordered_eco_data <- arrange(eco_data, desc(economic_impact))
ggplot(ordered_eco_data[1:10,], aes(EVTYPE, economic_impact, fill=economic_impact)) + geom_bar(stat="identity", colour = "red") + ylab("Injuries") + xlab("Event type") + ggtitle('Economic impact of different Event types(top 10)') + theme(axis.text.x = element_text(angle=90,hjust = 1,vjust = 0.5))