Summary of damages to human Health and to Economy caused by Storms and weather Events between 1950 and 2011 in the USA

Synopsis

In this report I aim to provide a quick insight about the damages caused by major weather events in the USA from 1950 to 2011. To get a more detailed view about the subject, the report will focus on two main kind of damages: damages to human health, detailed in fatalities and injuries, and damages to economy. The dataset used to get the report is the one offered by NOAA (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2), that was gathered from various observation sites all over the USA for the years between 1950 and 2011. A sort of codebook helpful in understanding the dataset is available at this link: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf Given the quite long time-span covered, and the heterogeneity of the observations, it turns out that the dataset is quite far from tidiness, and it needs some processing in order to provide reliable results. In the end it had been possible to get the results I was looking for, while keeping most of the information offered by the dataset. The final outcomes are that Tornadoes are the events that cause more damages in terms of human health, and that Floods cause the most damages to economy.

Data Processing

Data Loading

Through the following code I downloaded data from the link indicated in the Synopsis, and I read them using read.csv, that is able to read directly this kind of compressed files. I opened also the required libraries as well.

library(plyr)
library(ggplot2)
library(Hmisc)

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

library(knitr)
library(R.utils)

## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## 
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## 
## The following objects are masked from 'package:base':
## 
##     attach, detach, gc, load, save
## 
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## 
## The following object is masked from 'package:Hmisc':
## 
##     capitalize
## 
## The following object is masked from 'package:utils':
## 
##     timestamp
## 
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, parse, warnings

library(lattice)
Sys.setlocale("LC_ALL","C")
fileUrl<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="repdata_data_StormData.csv.bz2")
data <- read.csv("repdata-data-StormData.csv.bz2", header = TRUE)

To get a first view of the data, I analyzed the header and first rows of the dataset, and how many rows contain missing values (coded as NA), reported by the below ‘missing’ variable.

head(data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0                                               0
## 2 TORNADO         0                                               0
## 3 TORNADO         0                                               0
## 4 TORNADO         0                                               0
## 5 TORNADO         0                                               0
## 6 TORNADO         0                                               0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0                      14.0   100 3   0          0
## 2         NA         0                       2.0   150 2   0          0
## 3         NA         0                       0.1   123 2   0          0
## 4         NA         0                       0.0   100 2   0          0
## 5         NA         0                       0.0   150 2   0          0
## 6         NA         0                       1.5   177 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0                                    
## 2        0     2.5          K       0                                    
## 3        2    25.0          K       0                                    
## 4        2     2.5          K       0                                    
## 5        2     2.5          K       0                                    
## 6        6     2.5          K       0                                    
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806              1
## 2     3042      8755          0          0              2
## 3     3340      8742          0          0              3
## 4     3458      8626          0          0              4
## 5     3412      8642          0          0              5
## 6     3450      8748          0          0              6

narows <- unique (unlist (lapply (data, function (x) which (is.na (x)))))
missing <- length(narows)
missing

## [1] 902297

Data Processing - Part 1: Damages to Health

Then I had to focus on the EVTYPE column, that contains the key variable of my analysis. By performing the following R code, I created a subset of data containing just the columns needed for analyzing the health damages (EVTYPE, FATALITIES, INJURIES), and rows that contain valid values (in which Fatalities+Injuries>0). Analyzing the EVTYPE column, I realized that there are 985 levels for this variable in the dataset, far more than the 48 levels stated by the codebook I mentioned in the Synopsis. So, the observations on this variable are far from tidy, and most of the processing had to be made on EVTYPE to get it tidy. I omitted the results of this R code chunk given that the outcome of the unique function is too long to be shown (985 rows).

health <- subset(data, (FATALITIES + INJURIES >0), select=c("EVTYPE","FATALITIES","INJURIES"))
unique(health$EVTYPE)

For the purpose of this analysis, and to preserve the reliability and comparability of the results, I decided to reshape the EVTYPE column by keeping only the rows in which the beginning of the event type string matches with one of the 48 event types stated by the NOAA in its codebook; I encoded the correct 48 event types into a vector copying them from the mentioned NOAA codebook. This is obtained through the following R code.

evtidy <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", 
               "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", 
               "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", 
               "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", 
               "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", 
               "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", 
               "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", 
               "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
               "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", 
               "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", 
               "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", 
               "Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
health[,1] <- toupper(health[,1])
health[,1] <- trim(health[,1])
healthsum <- ddply(health, .(EVTYPE), numcolwise(sum), drop=FALSE)
healthtidy <- data.frame()
chunk <- data.frame()
for( i in (1:48) ) { 
        event <- evtidy[i]
        chunk <- subset(healthsum, substr(EVTYPE, 1, nchar(event))==event)
        if (nrow(chunk) > 0) {
                chunk[,1] <- event
                healthtidy <- rbind(healthtidy, chunk)
        }
}

The strategy adopted to reshape the EVTYPE column allows to keep the most part of the original data, as it’s shown by the following R code results, where healthsum is the dataset before the reshaping while healthtidy is the reshaped dataset. The reshaping of the EVTYPE column consents to keep more than 90% of total number of FATALITIES and INJURIES, granting a pretty good reliability to the analysis.

sum(healthsum$FATALITIES)

## [1] 15145

sum(healthsum$INJURIES)

## [1] 140528

sum(healthtidy$FATALITIES)

## [1] 13715

sum(healthtidy$INJURIES)

## [1] 128982

The following R code completes the data processing to get to the final results. It groups the dataset by EVTYPE, and adds a new column, named HARMINDEX, that aims to synthesize in one number the whole damage caused to human health. Just for the purpose of this analysis, I decided to weigh the FATALITIES 10 times the INJURIES, so the HARMINDEX that measures the whole damage to health is equal to 10 times the FATALITIES number plus the simple INJURIES number.

harmsum <- ddply(healthtidy, .(EVTYPE), numcolwise(sum), drop=FALSE)
HARMINDEX <- ((harmsum$FATALITIES*10) + harmsum$INJURIES)
harmsum <- cbind(harmsum,HARMINDEX)
harmsum$EVTYPE <- as.factor(harmsum$EVTYPE)

Data Processing - Part 2: Damages to Economy

I performed the same kind of EVTYPE processing for the data subset I used to compute the damages to economy. I decided to perform the data processing separately for the two required analysis to preserve the most of data in both cases. Using two separate data subset permits to avoid that a missing value in a non relevant column caused the loss of rows containing complete data in relevant columns. The following R code reshapes the EVTYPE column into the data subset aimed to compute the whole damages to economy, keeping the following relevant columns: EVTYPE, PROPDMG (damages to properties), CROPDMG (damages to crops), PROPDMGEXP (PROPDMG exponent), CROPDMGEXP (CROPDMG exponent) .

data2 <- subset(data, (PROPDMG + CROPDMG >0), select=c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP"))
econ <- data2[complete.cases(data2),]
econ[,1] <- toupper(econ[,1])
econ[,1] <- trim(econ[,1])
econtidy <- data.frame()
chunk <- data.frame()
for( i in (1:48) ) { 
        event <- evtidy[i]
        chunk <- subset(econ, substr(EVTYPE, 1, nchar(event))==event)
        if (nrow(chunk) > 0) {
                chunk[,1] <- event
                econtidy <- rbind(econtidy, chunk)
        }
}

In this case, the reshaping of the EVTYPE column consents to keep about more than 85% of total amount of property and crops damages, granting a pretty good reliability to the analysis (econ is the dataset before reshaping, econtidy is the one after reshaping).

sum(econ$PROPDMG)

## [1] 10884500

sum(econ$CROPDMG)

## [1] 1377827

sum(econtidy$PROPDMG)

## [1] 9306010

sum(econtidy$CROPDMG)

## [1] 1225886

The problem with the second analysis comes with PROPDMGEXP and CROPDMGEXP columns. These two columns should contain the exponentiations of 10 by which the PROPDMG and CROPDMG values should be respectively multiplied. It turns out that various ways of encoding have been used over time to impute the exponents, as you can see from the below resulting levels.

unique(econtidy$PROPDMGEXP)

##  [1] K M   0 5 4 7 B + m H - 6 h 2 3
## Levels:  + - 0 1 2 3 4 5 6 7 8 ? B H K M h m

unique(econtidy$CROPDMGEXP)

## [1] K   M B ? k 0
## Levels:  0 2 ? B K M k m

So, it’s necessary to reshape these columns in order to get tidy data for exponents. The NOAA codebook provides an explanation just about the letters in the two columns: ‘K’ stands for 10^3, ‘M’ stands for 10^6, ‘B’ stands for 10^9. For the purpose of this analysis, and to avoid the loss of too much data, I had to make two assumptions: first, ‘H’ stands for ‘hundreds’ and then 10^2; second, the numbers in the exponent columns stand for the number of zeros to be added to the damages values. Through the following R code I performed the reshaping of exponents columns according to the above assumptions, by replacing letters with the corresponding number of zeros. As expected, when I converted the two exp-columns to numeric I got two warnings about some missing (NA) values caused by converting some values that originally had no numeric format. I sorted the problem out by keeping just complete rows into the final dataset, and so dropping the rows with missing (NA) values. Then I’m pretty sure to evaluate damages through reliable and complete data, according to the NOAA requirements.

econtidy$PROPDMGEXP <- as.character(econtidy$PROPDMGEXP)
econtidy$PROPDMGEXP <- toupper(econtidy$PROPDMGEXP)
econtidy$PROPDMGEXP <- trim(econtidy$PROPDMGEXP)
econtidy$CROPDMGEXP <- as.character(econtidy$CROPDMGEXP)
econtidy$CROPDMGEXP <- toupper(econtidy$CROPDMGEXP)
econtidy$CROPDMGEXP <- trim(econtidy$CROPDMGEXP)
econtidy[,4] <- sub("B","9", econtidy[,4])
econtidy[,4] <- sub("H","2", econtidy[,4])
econtidy[,4] <- sub("K","3", econtidy[,4])
econtidy[,4] <- sub("M","6", econtidy[,4])
econtidy[,5] <- sub("B","9", econtidy[,5])
econtidy[,5] <- sub("K","3", econtidy[,5])
econtidy[,5] <- sub("M","6", econtidy[,5])
econtidy$PROPDMGEXP <- as.numeric(econtidy$PROPDMGEXP)

## Warning: NAs introduced by coercion

econtidy$CROPDMGEXP <- as.numeric(econtidy$CROPDMGEXP)

## Warning: NAs introduced by coercion

unique(econtidy$PROPDMGEXP)
unique(econtidy$CROPDMGEXP)
econcompl <- econtidy[complete.cases(econtidy),]

Then I concluded the data processing by adding a new column, named TOTDAMAGES, that computes the sum of property and crops damages multiplied by the respective exponents, and by finally grouping data by EVTYPE (event type).

TOTDAMAGES <- ((econcompl$PROPDMG * (10 ^ econcompl$PROPDMGEXP)) +
                   (econcompl$CROPDMG * (10 ^ econcompl$CROPDMGEXP)))
econcompl <- cbind(econcompl, TOTDAMAGES)
econsum <- ddply(econcompl, .(EVTYPE), numcolwise(sum), drop=FALSE)
econsum$EVTYPE <- as.factor(econsum$EVTYPE)

Results

Part 1: Damages to Health

The storm events that cause more damages to human health are TORNADOES, as reported by the R code below.

harmmax <- which.max(harmsum[,4])
harmsum[harmmax,c(1:4)]

##     EVTYPE FATALITIES INJURIES HARMINDEX
## 30 TORNADO       5658    91364    147944

Graph 1: Total Damages to Health

harmmean <- mean(harmsum$HARMINDEX)
g <-ggplot(harmsum, aes(x=EVTYPE, y=HARMINDEX))
g <- g + geom_bar(color="black", fill="light blue", stat="identity")
g <- g + geom_hline(aes(yintercept=harmmean), data=harmsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Index of Damages to Health: Fatalities*10 + Injuries")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Health Damages Index by Event type 1950->2011 (red dashed line = Mean)")
print(g)

GRAPH CAPTION: As you can see from the graph shown above, TORNADOES are by far the worst cause of health damages, and no other event is even close to them in causing FATALITIES and INJURIES to human beings. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.

Graph 2: Total number of Fatalities

fatalmean <- mean(harmsum$FATALITIES)
g <-ggplot(harmsum, aes(x=EVTYPE, y=FATALITIES))
g <- g + geom_bar(color="black", fill="dark grey", stat="identity")
g <- g + geom_hline(aes(yintercept=fatalmean), data=harmsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Total number of Fatalities")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Total number of Fatalities by Event type 1950->2011 (red dashed line = Mean)")
print(g)

GRAPH CAPTION: Just to get some more details, above is a graph that shows the FATALITIES distribution by storm event, since death is the heaviest damage out of all possible damages. TORNADOES confirm to be by far the worst threat even with respect to this kind of consequence, while EXCESSIVE HEAT shows to weigh more as second cause of damages when it comes to counting deaths. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.

Part 2: Damages to Economy

When it comes to damages to the economy, the worst threat is represented by FLOODS, according to the below report.

econmax <- which.max(econsum[,6])
econsum[econmax,c(1,6)]

##    EVTYPE   TOTDAMAGES
## 14  FLOOD 138235749000

Graph 3: Total Damages to Economy

econmean <- mean(econsum$TOTDAMAGES)
g <-ggplot(econsum, aes(x=EVTYPE, y=TOTDAMAGES))
g <- g + geom_bar(color="black", fill="light green", stat="identity")
g <- g + geom_hline(aes(yintercept=econmean), data=econsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Total economic Damages in USD")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Total economic Damages by Storm type 1950->2011 (red line = Mean)")
print(g)

GRAPH CAPTION: As you can see from the graph shown above, FLOODS are by far the worst cause of economic damages, and no other event is even close to them in causing damages to properties and crops, neither TORNADOES, that however are the second threat with respect to total economic damages. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.

Reproducible Research: Peer Assessment 2

Marco Lunardi