Tornado may be the most harmful events for population health and economy damages in the US

opts_chunk$set(echo = TRUE, cache = TRUE)
knit_hooks$set(plot = function(x, options) {
    paste("<figure><img src=\"", opts_knit$get("base.url"), paste(x, collapse = "."), 
        "\"><figcaption>", options$fig.cap, "</figcaption></figure>", sep = "")
})

Synopsis

In this report I show the most harmful events for population health and economy in the US from the year 1950 to November 2011. Many severe events can result in population and economy losses, and preventing such outcomes may help to improve our life.The method I used to find out the answer is a sum of the losses of popuulation health and economy by the types of severe weather events. The data were got from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. I found that Tornado may be the the most harmful events for population health and economy damages in the US in the history. However, if I divide the lossed into subgroup, the answer may change.

Data Processing

From the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database, I get the data of characteristics of major storms and weather events in the United States from the year 1950 to November 2011. Detailed documents related to the data could be found in National Weather Service Storm Data Documentation and National Climatic Data Center Storm Events FAQ

Reading in the storms data


# set the work directory

setwd("/home/yufree/storm/")

# download the data and save it into the data subfolder

if (!file.exists("data")) {
    dir.create("data")
}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./data/DATA.csv.bz2", method = "curl")

# unzip the file using the 'bunzip2' in the R.utils package use
# 'install.packages('R.utils')' to install the package

library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## 下列对象被屏蔽了from 'package:methods':
## 
##     getClasses, getMethods
## 
## 下列对象被屏蔽了from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.29.8 (2014-01-27) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## 下列对象被屏蔽了from 'package:utils':
## 
##     timestamp
## 
## 下列对象被屏蔽了from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings
bunzip2("./data/DATA.csv.bz2")
## Error: File already exists: ./data/DATA.csv

# use read.csv to read the data

data <- read.csv("./data/DATA.csv")

After reading in the data I check the structure of the dataset (there are 902297 rows and 37 columns in this dataset).

dim(data)
## [1] 902297     37
str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels ""," ","  ","   ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

From the document I know the columns related is the EVTYPE,FATALITIES,INJURIES and PROPDMG, PROPDMGEXP, CROPDMGEXP and CROPDMG columns which contain the type of storm event, population health and the economic consequences. But I need to process the data to get a clean dataset for further analysis.

# check for the missing data

sum(is.na(c(data$EVTYPE, data$FATALITIES, data$INJURIES, data$PROPDMG, data$PROPDMGEXP, 
    data$CROPDMG, data$CROPDMGEXP)))
## [1] 0

# calculate new variables for the property damage by 'PROPDMG' and
# 'PROPDMGEXP' also the crop damage by 'CPROPDMG' and 'CPROPDMGEXP' use the
# 'replace' in the car package (install.package('car'))

library(car)

levels(data$PROPDMGEXP)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
data$PROPDMGEXP <- as.numeric(recode(data$PROPDMGEXP, "'0'=1;'1'=10;'2'=100;'3'=1000;'4'=10000;'5'=100000;'6'=1000000;'7'=10000000;'8'=100000000;'B'=1000000000;'h'=100;'H'=100;'K'=1000;'m'=1000000;'M'=1000000;'-'=0;'?'=0;'+'=0"))
data$Pvalue <- data$PROPDMG * data$PROPDMGEXP

levels(data$CROPDMGEXP)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"
data$CROPDMGEXP <- as.numeric(recode(data$CROPDMGEXP, "'0'=1;'2'=100;'B'=1000000000;'k'=1000;'K'=1000;'m'=1000000;'M'=1000000;''=0;'?'=0"))
data$CPvalue <- data$CROPDMG * data$CROPDMGEXP

# subset the data needed

subdata <- data[, c("EVTYPE", "FATALITIES", "INJURIES", "Pvalue", "CPvalue")]
rm(data)  # my computer has a shortage of memory
str(subdata)
## 'data.frame':    902297 obs. of  5 variables:
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ Pvalue    : num  150 15 150 15 15 15 15 15 150 150 ...
##  $ CPvalue   : num  0 0 0 0 0 0 0 0 0 0 ...

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to popuulation health?

I got the top 10 types of the events which has the highest sum of numbers directly killed or injured and both of them across the United States. Then a barplot showed them.

indexF <- tapply(subdata$FATALITIES, subdata$EVTYPE, sum)
indexF <- indexF[order(indexF, decreasing = T)][1:10]
indexI <- tapply(subdata$INJURIES, subdata$EVTYPE, sum)
indexI <- indexI[order(indexI, decreasing = T)][1:10]
subdata$pop <- subdata$FATALITIES + subdata$INJURIES
indexpa <- tapply(subdata$pop, subdata$EVTYPE, sum)
indexpa <- indexpa[order(indexpa, decreasing = T)][1:10]
# make the barplot
par(mfrow = c(3, 1))
barplot(indexF, xlab = "type of events", ylab = "number of people")
title("A: Directly killed", adj = 0)
barplot(indexI, xlab = "type of events", ylab = "number of people")
title("B: Injured", adj = 0)
barplot(indexpa, xlab = "type of events", ylab = "number of people")
title("C: Popuulation health", adj = 0)
The top 10 types of the which has the highest sum of number directly killed( A ) or injured( B ) and both of them( C ) across the United States.

From the figure we can see the most harmful type of events accross the United States is tornado with respect to population healeth from the year 1950 to November 2011. However, when we divided the population into directly killed and injured, hail became the most harmful events to make people injured while tornado were the most harmful events to kill people.

Across the United States, which types of events have the greatest economic consequences?

I got the top 10 types of the events which has the highest sum of property damage or crop damage and both of them across the United States. Then a barplot showed them.

indexP <- tapply(subdata$Pvalue, subdata$EVTYPE, sum)
indexP <- indexP[order(indexP, decreasing = T)][1:10]
indexCP <- tapply(subdata$CPvalue, subdata$EVTYPE, sum)
indexCP <- indexCP[order(indexCP, decreasing = T)][1:10]
subdata$eco <- subdata$Pvalue + subdata$CPvalue
indexea <- tapply(subdata$eco, subdata$EVTYPE, sum)
indexea <- indexea[order(indexea, decreasing = T)][1:10]

par(mfrow = c(3, 1))
barplot(indexP, xlab = "type of events", ylab = "dollar amounts")
title("A: Property damage", adj = 0)
barplot(indexCP, xlab = "type of events", ylab = "dollar amounts")
title("B: Crop damage", adj = 0)
barplot(indexea, xlab = "type of events", ylab = "dollar amounts")
title("C: Economic consequences", adj = 0)
The top 10 types of the which has the highest sum of property damage( A ) or crop damage( B ) and both of them( C ) across the United States.

From the figure we can see the most harmful type of events accross the United States is tornado with respect to the greatest economic consequences from the year 1950 to November 2011. However, when we divided the economic consequences into property damage and crop damage, hail became the most harmful events to crop damage while tornado were the most harmful events to property damage.