In this report I aim to provide a quick insight about the damages caused by major weather events in the USA from 1950 to 2011. To get a more detailed view about the subject, the report will focus on two main kind of damages: damages to human health, detailed in fatalities and injuries, and damages to economy. The dataset used to get the report is the one offered by NOAA (https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2), that was gathered from various observation sites all over the USA for the years between 1950 and 2011. A sort of codebook helpful in understanding the dataset is available at this link: https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf Given the quite long time-span covered, and the heterogeneity of the observations, it turns out that the dataset is quite far from tidiness, and it needs some processing in order to provide reliable results. In the end it had been possible to get the results I was looking for, while keeping most of the information offered by the dataset. The final outcomes are that Tornadoes are the events that cause more damages in terms of human health, and that Floods cause the most damages to economy.
Through the following code I downloaded data from the link indicated in the Synopsis, and I read them using read.csv, that is able to read directly this kind of compressed files. I opened also the required libraries as well.
library(plyr)
library(ggplot2)
library(Hmisc)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
library(knitr)
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:Hmisc':
##
## capitalize
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(lattice)
Sys.setlocale("LC_ALL","C")
fileUrl<-"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl,destfile="repdata_data_StormData.csv.bz2")
data <- read.csv("repdata-data-StormData.csv.bz2", header = TRUE)
To get a first view of the data, I analyzed the header and first rows of the dataset, and how many rows contain missing values (coded as NA), reported by the below ‘missing’ variable.
head(data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
narows <- unique (unlist (lapply (data, function (x) which (is.na (x)))))
missing <- length(narows)
missing
## [1] 902297
Then I had to focus on the EVTYPE column, that contains the key variable of my analysis. By performing the following R code, I created a subset of data containing just the columns needed for analyzing the health damages (EVTYPE, FATALITIES, INJURIES), and rows that contain valid values (in which Fatalities+Injuries>0). Analyzing the EVTYPE column, I realized that there are 985 levels for this variable in the dataset, far more than the 48 levels stated by the codebook I mentioned in the Synopsis. So, the observations on this variable are for from tidy, and most of the processing had to be made on EVTYPE to get it tidy. I omitted the results of this R code chunk given that the outcome of the unique function is too long to be shown (985 rows).
health <- subset(data, (FATALITIES + INJURIES >0), select=c("EVTYPE","FATALITIES","INJURIES"))
unique(health$EVTYPE)
For the purpose of this analysis, and to preserve the reliability and comparability of the results, I decided to reshape the EVTYPE column by keeping only the rows in which the beginning of the event type string matches with one of the 48 event types stated by the NOAA in its codebook; I coded the correct 48 event types into a vector copying them from the mentioned NOAA codebook. This is obtained through the following R code.
evtidy <- toupper(c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood",
"Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke",
"Drought", "Dust Devil", "Dust Storm", "Excessive Heat",
"Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze",
"Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain",
"Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)",
"Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning",
"Marine Hail", "Marine High Wind", "Marine Strong Wind",
"Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet",
"Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado",
"Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
"Waterspout", "Wildfire", "Winter Storm", "Winter Weather"))
health[,1] <- toupper(health[,1])
health[,1] <- trim(health[,1])
healthsum <- ddply(health, .(EVTYPE), numcolwise(sum), drop=FALSE)
healthtidy <- data.frame()
chunk <- data.frame()
for( i in (1:48) ) {
event <- evtidy[i]
chunk <- subset(healthsum, substr(EVTYPE, 1, nchar(event))==event)
if (nrow(chunk) > 0) {
chunk[,1] <- event
healthtidy <- rbind(healthtidy, chunk)
}
}
The strategy adopted to reshape the EVTYPE column allows to keep the most part of the original data, as it’s shown by the following R code results: healthsum is the dataset before the reshaping while healthtidy is the reshaped dataset. The reshaping of the EVTYPE column consents to keep more than 90% of total number of FATALITIES and INJURIES, granting a pretty good reliability to the analysis.
sum(healthsum$FATALITIES)
## [1] 15145
sum(healthsum$INJURIES)
## [1] 140528
sum(healthtidy$FATALITIES)
## [1] 13715
sum(healthtidy$INJURIES)
## [1] 128982
The following R code completes the data processing to get to the final results. It groups the dataset by EVTYPE, and adds a new column, named HARMINDEX, that aims to synthesize in one number the whole damage caused to human health. Just for the purpose of this analysis, I decided to weigh the FATALITIES 10 times the INJURIES, so the HARMINDEX that measures the whole damage to health is equal to 10 times the FATALITIES number plus the simple INJURIES number.
harmsum <- ddply(healthtidy, .(EVTYPE), numcolwise(sum), drop=FALSE)
HARMINDEX <- ((harmsum$FATALITIES*10) + harmsum$INJURIES)
harmsum <- cbind(harmsum,HARMINDEX)
harmsum$EVTYPE <- as.factor(harmsum$EVTYPE)
I performed the same kind of EVTYPE processing for the data subset I used to compute the damages to economy. I decided to perform the data processing separately for the two required analysis to preserve the most of data in both cases. Using two separate data subset permits to avoid that a missing value in a non relevant column caused the loss of rows with complete data in relevant columns. The following R code reshapes the EVTYPE column into the data subset aimed to compute the whole damages to economy, keeping the following relevant columns: EVTYPE, PROPDMG (damages to properties), CROPDMG (damages to crops), PROPDMGEXP (PROPDMG exponent), CROPDMGEXP (CROPDMG exponent) .
data2 <- subset(data, (PROPDMG + CROPDMG >0), select=c("EVTYPE","PROPDMG","CROPDMG","PROPDMGEXP","CROPDMGEXP"))
econ <- data2[complete.cases(data2),]
econ[,1] <- toupper(econ[,1])
econ[,1] <- trim(econ[,1])
econtidy <- data.frame()
chunk <- data.frame()
for( i in (1:48) ) {
event <- evtidy[i]
chunk <- subset(econ, substr(EVTYPE, 1, nchar(event))==event)
if (nrow(chunk) > 0) {
chunk[,1] <- event
econtidy <- rbind(econtidy, chunk)
}
}
In this case, the reshaping of the EVTYPE column consents to keep about more than 85% of total number of property and crops damages and INJURIES, granting a pretty good reliability to the analysis (econ is the dataset before reshaping, econtidy is the one after reshaping).
sum(econ$PROPDMG)
## [1] 10884500
sum(econ$CROPDMG)
## [1] 1377827
sum(econtidy$PROPDMG)
## [1] 9306010
sum(econtidy$CROPDMG)
## [1] 1225886
The problem with the second analysis comes with PROPDMGEXP and CROPDMGEXP columns. These two columns should contain the exponentiations of 10 by which the PROPDMG and CROPDMG values should be respectively multiplied. It turns out that various ways of encoding have been used to impute the exponents, as you can see from the below resulting levels.
unique(econtidy$PROPDMGEXP)
## [1] K M 0 5 4 7 B + m H - 6 h 2 3
## Levels: + - 0 1 2 3 4 5 6 7 8 ? B H K M h m
unique(econtidy$CROPDMGEXP)
## [1] K M B ? k 0
## Levels: 0 2 ? B K M k m
So, it’s necessary to reshape these columns in order to get tidy data for exponents. The NOAA codebook provides an explanation just about the letters in the two columns: ‘K’ stands for 10^3, ‘M’ stands for 10^6, ‘B’ stands for 10^9. For the purpose of this analysis, and to avoid the loss of too much data, I had to make two assumptions: first, ‘H’ stands for ‘hundreds’ and then 10^2; second, the numbers in the exponent columns stand for the number of zeros to be added to the damages values. Through the following R code I performed the reshaping of exponents columns according to the above assumptions, by replacing letters with the corresponding number of zeros. As I was expecting, when I converted the two exp-columns to numeric I got two warnings about some missing (NA) values for the values that had no numeric format. I sorted the problem out by keeping just complete rows, and so dropping the rows with missing (NA) values. Then I’m pretty sure to evaluate damages through reliable and complete data, according to the NOAA requirements.
econtidy$PROPDMGEXP <- as.character(econtidy$PROPDMGEXP)
econtidy$PROPDMGEXP <- toupper(econtidy$PROPDMGEXP)
econtidy$PROPDMGEXP <- trim(econtidy$PROPDMGEXP)
econtidy$CROPDMGEXP <- as.character(econtidy$CROPDMGEXP)
econtidy$CROPDMGEXP <- toupper(econtidy$CROPDMGEXP)
econtidy$CROPDMGEXP <- trim(econtidy$CROPDMGEXP)
econtidy[,4] <- sub("B","9", econtidy[,4])
econtidy[,4] <- sub("H","2", econtidy[,4])
econtidy[,4] <- sub("K","3", econtidy[,4])
econtidy[,4] <- sub("M","6", econtidy[,4])
econtidy[,5] <- sub("B","9", econtidy[,5])
econtidy[,5] <- sub("K","3", econtidy[,5])
econtidy[,5] <- sub("M","6", econtidy[,5])
econtidy$PROPDMGEXP <- as.numeric(econtidy$PROPDMGEXP)
## Warning: NAs introduced by coercion
econtidy$CROPDMGEXP <- as.numeric(econtidy$CROPDMGEXP)
## Warning: NAs introduced by coercion
unique(econtidy$PROPDMGEXP)
unique(econtidy$CROPDMGEXP)
econcompl <- econtidy[complete.cases(econtidy),]
Then I concluded the data processing by adding a new column, named TOTDAMAGES, that computes the sum of property and crops damages, each multiplied for the respective exponents, and by grouping data by EVTYPE (event type).
TOTDAMAGES <- ((econcompl$PROPDMG * (10 ^ econcompl$PROPDMGEXP)) +
(econcompl$CROPDMG * (10 ^ econcompl$CROPDMGEXP)))
econcompl <- cbind(econcompl, TOTDAMAGES)
econsum <- ddply(econcompl, .(EVTYPE), numcolwise(sum), drop=FALSE)
econsum$EVTYPE <- as.factor(econsum$EVTYPE)
The storm events that cause more damages to human health are Tornadoes, as reported by the R code below.
harmmax <- which.max(harmsum[,4])
harmsum[harmmax,c(1:4)]
## EVTYPE FATALITIES INJURIES HARMINDEX
## 30 TORNADO 5658 91364 147944
harmmean <- mean(harmsum$HARMINDEX)
g <-ggplot(harmsum, aes(x=EVTYPE, y=HARMINDEX))
g <- g + geom_bar(color="black", fill="light blue", stat="identity")
g <- g + geom_hline(aes(yintercept=harmmean), data=harmsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Index of Damages to Health: Fatalities*10 + Injuries")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Health Damages Index by Event type 1950->2011 (red dashed line = Mean)")
print(g)
GRAPH CAPTION: As you can see from the graph shown above, TORNADOES are by far the worst cause of health damages, and no other event is even close to them in causing FATALITIES and INJURIES to human beings. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.
fatalmean <- mean(harmsum$FATALITIES)
g <-ggplot(harmsum, aes(x=EVTYPE, y=FATALITIES))
g <- g + geom_bar(color="black", fill="dark grey", stat="identity")
g <- g + geom_hline(aes(yintercept=fatalmean), data=harmsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Total number of Fatalities")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Total number of Fatalities by Event type 1950->2011 (red dashed line = Mean)")
print(g)
GRAPH CAPTION: Just to get some more details, above is a graph that shows the FATALITIES distribution by storm event, since death is the heaviest damage out of all possible damages. TORNADOES confirm to be by far the worst threat even with respect to this kind of consequence, while EXCESSIVE HEAT shows to weigh more as second cause of damages when it comes to counting deaths. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.
When it comes to damages to the economy, the worst threat is represented by FLOODS, according to the below report.
econmax <- which.max(econsum[,6])
econsum[econmax,c(1,6)]
## EVTYPE TOTDAMAGES
## 14 FLOOD 138235749000
econmean <- mean(econsum$TOTDAMAGES)
g <-ggplot(econsum, aes(x=EVTYPE, y=TOTDAMAGES))
g <- g + geom_bar(color="black", fill="light green", stat="identity")
g <- g + geom_hline(aes(yintercept=econmean), data=econsum, colour="red", linetype="dashed")
g <- g + theme(axis.text.x = element_text(angle = -270, hjust = 1, size=10))
g <- g + theme(axis.text.y = element_text(size=10))
g <- g + theme(axis.title=element_text(size=12,face="bold"))
g <- g + theme(title=element_text(size=14,face="bold"))
g <- g + labs(y = "Total economic Damages in USD")
g <- g + labs(x = "Storm Event type")
g <- g + labs(title = "Total economic Damages by Storm type 1950->2011 (red line = Mean)")
print(g)
GRAPH CAPTION: As you can see from the graph shown above, FLOODS are by far the worst cause of economic damages, and no other event is even close to them in causing damages to properties and crops, neither TORNADOES, that however are the second threat with respect to total economic damages. The number of events shown is less than the 48 stated by the NOAA Institute, since that the final processed dataset kept only the events with positive damage values.