The data for this study come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.
Our first priority at this stage of data processing is to read and load the basic data.
library(knitr)
opts_chunk$set(echo=TRUE, results='hide',fig.align='center')
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile='./stormData.csv.bz2',method='curl')
stormData <- read.csv("./stormData.csv.bz2", header = TRUE)
Let us verify if the magnitude and thoroughness of data collection justifies using the whole time series in the present study.
dataFrequency<-as.numeric(format(as.Date(stormData[,"BGN_DATE"],format = "%m/%d/%Y %H:%M:%S"),"%Y"))
The following table of frequency of data collection by year indicates that for each decade since 1950, the data collection has dramatically increased in numbers of observations.
library(xtable)
a<-as.matrix((table(dataFrequency)[c(1,12,22,32,42,52,62)]))
print(xtable(a),type="html")
| x | |
|---|---|
| 1950 | 223 |
| 1961 | 2246 |
| 1971 | 3471 |
| 1981 | 4517 |
| 1991 | 12522 |
| 2001 | 34962 |
| 2011 | 62174 |
Since 2011 was, in the available data set, the year with the highest number of observations, the following arbitrary threshold of data usage was established: the data of interest to this study would be those pertaining to years when the frequency of data collection was at least greater than 1/6 compared with that of 2011.
library(xtable)
c<-table(dataFrequency)
b<-(c[(c/c[length(c)])>1/6])
Therefore, the above code helped us select the first year of our adopted time series as: 1989.
On such a premise, let us create a data table functional to the analysis of the health-related and economic impacts of weather events across the United States between 1989 and 2011.
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
rows<-filter(stormData,as.numeric(format(as.Date(stormData[,"BGN_DATE"],format = "%m/%d/%Y %H:%M:%S"),"%Y"))==c(1989:2011))
## Warning in as.numeric(format(as.Date(structure(list(STATE__ = c(1, 1, 1, :
## longer object length is not a multiple of shorter object length
myStormData<-select(rows,c(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
myStormData<-tbl_df(myStormData)
We must now shape up our data table so that it will permit the calculations of damages to crop and properties. In order to achieve this goal, we must operate on two categorical variables, namely, PROPDMG and CROPDMG. These two variables report damages in various orders of units. To be quantified into $ amounts, these units must be multiplied by the values, respectively, of PROPDMGEXP and CROPDMGEXP. However, the values of the two latter variables are not numeric but symbolic, and must therefore be properly manipulated.
library(dplyr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.3
unique_prop2_measure<-unique(myStormData$PROPDMGEXP)
unique_crop2_measure<-unique(myStormData$CROPDMGEXP)
lunghezza_prop<-length(unique_prop2_measure)
lunghezza_crop<-length(unique_crop2_measure)
PROPDMGEXP includes 8 values, namely:
, M, K, 0, m, 5, 3, B. You will have noticed that one value is left blank in this list. In $ amounts, these values correspond, respectively, to the following vector of values:
[(0,1000000,1000,0, 1000000,100000,1000,1000000000].
The value zero was attributes to PROPDMGEXP’s indeterminate value, namely, the blank one in the list.
In turn, CROPDMGEXP includes 5 values, namely: , K, M, k, 0. Here too we find a blank value. In $ amounts, these values correspond, respectively, to the following vector of values: [0,1000,1000000,1000,0].
In this case as well, the value zero was attributes to the indeterminate value in the list.
The following code manipulates the above data in order to add two columns to our data table. These two columns will be instrumental in quantifying the weather-inflicted damages to properties and crops, as we will multiply their respective values by the pertinent values of PROPDMG and CROPDMG.
x<-c(0,1000000,1000,0, 1000000,100000,1000,1000000000)
vettore1<-as.vector(x)
y<-c(0,1000,1000000,1000,0)
vettore2<-as.vector(y)
lungo<-dim(myStormData)[1]
vettore<-rep(0,lungo)
myStormData2<-mutate(myStormData, multiplier_prop=vettore, multiplier_crop=vettore)
for (i in 1:lunghezza_prop) {myStormData2[myStormData2$PROPDMGEXP==unique_prop2_measure[i],"multiplier_prop"]<-vettore1[i]}
for (i in 1:lunghezza_crop) {myStormData2[myStormData2$CROPDMGEXP==unique_crop2_measure[i],"multiplier_crop"]<-vettore2[i]}
myStormData3<-mutate(myStormData2, Property_Damage=PROPDMG * multiplier_prop, Crop_Damage= CROPDMG * multiplier_crop)
Let us now make sure that missing data won’t be an obstacle to our forecoming analyses.
a<-is.na(myStormData3)
b<-sum(a)
There are 0 NA values in our final data table, so missing values will be of no concern.
Let us create our four relevant variables, namely, “Fatalities”, “Injuries”, “propDamgs” (i.e., Damages to Properties), and “cropDmgs” (i.e., Damages to Crops), each one grouped by type of weather event, or “EVTYPE”.
myStormData_grouped<-group_by(myStormData3,EVTYPE)
finalData<-summarize(myStormData_grouped,Fatalities=sum(FATALITIES), Injuries=sum(INJURIES),propDmgs=sum(Property_Damage), cropDmgs=sum(Crop_Damage))
At this stage of data processing, our priority is to extract the data required to create a data table functional to the analysis of weather events’ impacts on California in 2010.
The methodology will be the same as the one adopted thus far for the manipulation of aggregate data.
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.2.3
CaliforniaData<-select(stormData,c(STATE,BGN_DATE, EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
CaliforniaData<-CaliforniaData[CaliforniaData$STATE=="CA",]
CaliforniaData<-tbl_df(CaliforniaData)
a<-CaliforniaData$BGN_DATE
b<-as.character(a)
date_column<-mdy_hms(b)
date_column_by_year<-year(date_column)
true_date_column_2010<-date_column_by_year==2010
CaliforniaData_2010<-CaliforniaData[true_date_column_2010,]
CaliforniaData_2010<-select(CaliforniaData_2010,c(EVTYPE,FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP))
unique_prop2_measure_2010<-unique(CaliforniaData_2010$PROPDMGEXP)
unique_crop2_measure_2010<-unique(CaliforniaData_2010$CROPDMGEXP)
lunghezza_prop_2010<-length(unique_prop2_measure_2010)
lunghezza_crop_2010<-length(unique_crop2_measure_2010)
j<-c(1000,1000000)
vettoreJ<-as.vector(j)
lungo_2010<-dim(CaliforniaData_2010)[1]
vettore<-rep(0,lungo_2010)
CaliforniaData2_2010<-mutate(CaliforniaData_2010, multiplier_prop=vettore, multiplier_crop=vettore)
for (i in 1:lunghezza_prop_2010) {CaliforniaData2_2010[CaliforniaData2_2010$PROPDMGEXP==unique_prop2_measure_2010[i],"multiplier_prop"]<-vettoreJ[i]}
for (i in 1:lunghezza_crop_2010) {CaliforniaData2_2010[CaliforniaData2_2010$CROPDMGEXP==unique_crop2_measure_2010[i],"multiplier_crop"]<-vettoreJ[i]}
grouped_CaliforniaData2_2010<-group_by(CaliforniaData2_2010,EVTYPE)
CaliforniaData3_2010<-mutate(grouped_CaliforniaData2_2010, propDmgs=PROPDMG * multiplier_prop, cropDmgs= CROPDMG * multiplier_crop)
finalCalifornia_2010<-summarize(CaliforniaData3_2010,Fatalities=sum(FATALITIES), Injuries=sum(INJURIES),propDmgs=sum(propDmgs), cropDmgs=sum(cropDmgs))
As usual, let us check for missing values.
a<-is.na(head(finalCalifornia_2010))
b<-sum(a)
There are 0 NA values in our table for 2010, so missing values will be of no concern in our California Intermission.
max_disaster_fatalities<-finalData[which.max(finalData$Fatalities),]
max_disaster_injuries<-finalData[which.max(finalData$Injuries),]
From running the above code, we learn that across the United States, the weather events classified as HEAT were responsible for the highest number of fatalities from 1989 to 2011, for a total of 609 victims. The events classified as FLOOD were responsible for the highest number of injuries, for a total of 1319 victims.
The following code will enable us to draw two helpful figures.
library(ggplot2)
finalDataFat<-arrange(finalData,desc(Fatalities))
finalDataInj<-arrange(finalData,desc(Injuries))
finalDataPropDmgs<-arrange(finalData,desc(propDmgs))
finalDataCropDmgs<-arrange(finalData,desc(cropDmgs))
plot1<-ggplot(finalDataFat[1:10,c("EVTYPE","Fatalities")], aes(x = EVTYPE, y = Fatalities)) + geom_bar(stat = "identity",col="red", aes(fill=Fatalities)) + ylab("Fatalities") + xlab("Weather Event")+ ggtitle("Ten Highest Fatality Causes: \n 1989-2011") + scale_fill_continuous("Fatalities", low = "red", high = "yellow") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot2<-ggplot(finalDataInj[1:10,c("EVTYPE","Injuries")], aes(x = EVTYPE, y = Injuries)) + geom_bar(stat = "identity",col="red", aes(fill=Injuries)) + ylab("Injuries") + xlab("Weather Event")+ ggtitle("Ten Highest Injury Causes: \n 1989-2011") + scale_fill_continuous("Injuries", low = "green", high = "blue") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot3<-ggplot(finalDataPropDmgs[1:10,c("EVTYPE","propDmgs")], aes(x = EVTYPE, y = propDmgs)) + geom_bar(stat = "identity",col="red", aes(fill=propDmgs)) + ylab("Property Damage") + xlab("Weather Event")+ ggtitle("Largest Property-Damage Causes: \n 1989-2011") + scale_fill_continuous("propDmgs", low = "red", high = "yellow") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot4<-ggplot(finalDataCropDmgs[1:10,c("EVTYPE","cropDmgs")], aes(x = EVTYPE, y = cropDmgs)) + geom_bar(stat = "identity",col="red", aes(fill=cropDmgs)) + ylab("Crop Damage") + xlab("Weather Event")+ ggtitle(" Largest Crop-Damage Causes: \n 1989-2011") + scale_fill_continuous("cropDmgs", low = "green", high = "blue")+ theme(axis.text.x=element_text(angle=55,hjust=1.2))
In the light of the following figure, containing two panels, it is easy to confirm the above remark that the events classified as HEAT were by far the leading cause of fatalities in the time period under examination, followed by the events classified as EXCESSIVE HEAT and as TORNADO. As to injuries, the events classified as FLOOD hold the first place by a small margin, followed, at a virtually equal order of magnitude, by those classified as TORNADO.
library(gridExtra)
ratio<-finalDataPropDmgs[1,"propDmgs"]/finalDataCropDmgs[1,"cropDmgs"]
grid.arrange(plot1, plot2, ncol = 2)
In the light of the following figure, containing two panels, it is easy to see that the events classified as FLOOD were by far the worst cause of damages to properties, followed, on a proportionally microscopic order of magnitude, by the events classified as TORNADO and as FLASH FLOOD.
As to damages to crops, the events classified as DROUGHT were the leading cause, followed by the events classified as FLOOD and as EXCESSIVE WETNESS. But the difference in the order of magnitude between the largest damages to properties and to crops is impressive: the largest damages to properties total a figure which is 152.043416 greater than the largest damages to crops. Here are the graphs that tell this story:
grid.arrange(plot3, plot4, ncol = 2)
As explained in the synopsis, we devoted special attention to 2010 in the light of its peculiarities. Owing to the El-Niño forecasted weather conditions for the coming fall and winter seasons, the analysis of the 2010 El-Niño’s impact on health-related and economic variables in a state such as Califonia, which is vulnerable to floods and going moreover through a long-lasting drought, is of doubtless interest. The following code will complete our data manipulation in order to draw graphs of, respectively, the ten weather events that caused the highest numbers of injuries and the largest damages to properties in California in 2010. Moreover, the following code will supply us with the data to illustrate some the El-Niño’s impact on fatalities. Damages to crops cannot be discussed in reliable detail since it turns out that the pertinent California data for 2010 were poorly collected. (As explained in the synopsis, supplementary code is being provided in case the reader wants to investigate further aspects of the matter under discussion.)
finalCalifornia_2010Fat<-arrange(finalCalifornia_2010,desc(Fatalities))
finalCalifornia_2010Inj<-arrange(finalCalifornia_2010,desc(Injuries))
finalCalifornia_2010Prop<-arrange(finalCalifornia_2010,desc(propDmgs))
finalCalifornia_2010Crop<-arrange(finalCalifornia_2010,desc(cropDmgs))
ten_top_fat<-finalCalifornia_2010Fat[1:10,c(1,2)]
ten_top_inj<-finalCalifornia_2010Inj[1:10,c(1,3)]
ten_top_propDmgs<-finalCalifornia_2010Prop[1:10,c(1,4)]
ten_top_cropDmgs<-finalCalifornia_2010Crop[1:10,c(1,5)]
library(ggplot2)
plot5<-ggplot(ten_top_fat, aes(x = EVTYPE, y = Fatalities)) + geom_bar(stat = "identity",col="red", aes(fill=Fatalities)) + ylab("Fatalities")+ xlab("Weather Event")+ ggtitle("Top Fatalities in California: 2010") + scale_fill_continuous("Fatalities", low = "red", high = "yellow") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot6<-ggplot(ten_top_inj, aes(x = EVTYPE, y = Injuries)) + geom_bar(stat = "identity",col="red", aes(fill=Injuries)) + ylab("Injuries") + xlab("Weather Event")+ ggtitle("Top Injuries in California: 2010") + scale_fill_continuous("Injuries", low = "green", high = "blue") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot7<-ggplot(ten_top_propDmgs, aes(x = EVTYPE, y = propDmgs)) + geom_bar(stat = "identity",col="red", aes(fill=propDmgs)) + ylab("Property Damage") + xlab("Weather Event")+ ggtitle("Top Property Damages in California: 2010") + scale_fill_continuous("propDmgs", low = "red", high = "yellow") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
plot8<-ggplot(ten_top_cropDmgs, aes(x = EVTYPE, y = cropDmgs)) + geom_bar(stat = "identity",col="red", aes(fill=cropDmgs)) + ylab("Crop Damage") + xlab("Weather Event")+ ggtitle("Top Crop Damages in California: 2010") + scale_fill_continuous("cropDmgs", low = "green", high = "blue") + theme(axis.text.x=element_text(angle=55,hjust=1.2))
massimoPropDmgs<-ten_top_propDmgs$propDmgs[1]
secondPropDmgs<-ten_top_propDmgs$propDmgs[2]
thirdPropDmgs<-ten_top_propDmgs$propDmgs[3]
fourthdPropDmgs<-ten_top_propDmgs$propDmgs[4]
massimocropDmgs<-ten_top_cropDmgs$cropDmgs[1]
secondCropDmgs<-ten_top_cropDmgs$cropDmgs[2]
thirdCropDmgs<-ten_top_cropDmgs$cropDmgs[3]
massimoFat<-ten_top_fat$Fatalities[1]
secondFat<-ten_top_fat$Fatalities[2]
thridFat<-ten_top_fat$Fatalities[3]
fouthFat<-ten_top_fat$Fatalities[4]
massimoInj<-ten_top_inj$Injuries[1]
secondInj<-ten_top_inj$Injuries[2]
thirdInj<-ten_top_inj$Injuries[3]
As seen in the following figure, containing two plots, the highest number of injuries in California was caused by the events classified as HIGH SURF in 2010, followed by the events classified as WILDFIRE and as STRONG WIND.
In turn, the worst damages to properties were caused by the evenys classified as FLOOD, followed by those classified as FLASH FLOOD and as HIGH SURF.
The differences with respect to our previous results across the US and on the time series 1950-2011 are not insignificant, and oughtn’t to be neglected by policy makers.
s<-CaliforniaData_2010$CROPDMG>0
availableCropDmgs<-sum(s)
grid.arrange(plot6, plot7, ncol = 2)
Let us take a brief look at the differences between the overall data across the US and the 2010 data for California.
Across the US from 1989 to 2010, floods were a major cause among weather events as regards injuries and property damages, while their virtual counterparts, heat and drought, were the major causes in fatalities and crop damages respectively.
In 2010, California saw high surfs as the major source of fatalities and injuries, while floods and/or flash floods were the major sources of economic damages.
This seems to confirm our initial hypothesis that the 2010 El-Niño’s impact on health-related and economic variables in a state such as Califonia was statistically significant. Furthermore, it suggests that the current, long-lasting drought in California could be reversed by the new, imminent El-Niño, provided proper measures of water conservation be put rapidly into operation.
With more time and space available, as a matter of course, one would want to make punctual comparisons between the 2010 data and the whole 1989-2010 time series for Californa, as well as analyze the El-Niño’s effects on aggregate data for some groups of El-Niño-wise relevant states: for instance, the whole of the US West Coast.
A final caveat: It would seem that the 2010 California data for damages to crops are scarcely reliable, as the entire data set for California contains only 1 datum. According to this datum, as we said, flash floods caused $ 310^{6} in damages to crops. Yet, the fact that regarding damages to crops in 2010 in California, data are not available for any other weather event suggests that we are dealing here with an unreliable source.