In this report meteorological phenomena with more destructive power with respect to health and the economy of a nation in the period 1993-2011 are determined. The initial opinion (hypothesis) of the writer was that hurricanes pose the greatest danger both health and economic terms. However, the exploratory analysis that is then transformed the initial belief. Tornado kill and injure people in greater proportion than hurricanes. And weather events related to flooding are those negatively impact US government coffers greater extent than hurricanes.
Read the Storm Data into bd data set
bd1 <- read.table("repdata%2Fdata%2FStormData.csv.bz2"
, header = T, sep = ",",
na.strings = "")
Choose variables that allow giving a solution to the two questions that are FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP
bd1 <- bd1[,c("BGN_DATE","EVTYPE","FATALITIES","INJURIES",
"PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
After reading row data it´s check the first few rows in this dataset
dim(bd1)
## [1] 902297 8
head(bd1)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP
## 1 0 <NA>
## 2 0 <NA>
## 3 0 <NA>
## 4 0 <NA>
## 5 0 <NA>
## 6 0 <NA>
Research Meteorological phenomena registered from 1950 to 2011
#Transform column date in appropiated format
library(lubridate) #Package to manipulate dates appriopatly
bd1$BGN_DATE <- mdy_hms(as.character(bd1$BGN_DATE))
bd1$BGN_DATE <- as.POSIXlt(bd1$BGN_DATE)
#Create a new adecuated variable that I have called year
bd1$year <- year(bd1$BGN_DATE)
#Get number of meterological phenomena registered by years
y1 <- tapply(bd1$EVTYPE,bd1$year, function(x){length(unique(x))})
#Graphic number of meterological phenomena registered by years
library(ggplot2) #My favorite graphic system
bdg1 <- data.frame(Year=as.numeric(names(y1)), NumberEvents=y1)
ggplot(bdg1, aes(x=Year, y = NumberEvents)) + geom_point() + geom_line() +
ylab("Count") +
xlab("Year") +
ggtitle("Unique number of storm events from 1950 to 2011")+
theme(plot.title=element_text(size=15))
The last graphic say that from 1993 to 2011 there are higher unique number of storm event registered that previously. This result is product, among other things, of advanced techniques to register meteorological phenomena. So it´s created a new data set (bd2) only with years from 1993 to 2011
bd2 <- bd1[bd1$year >= 1993,]
The first question is:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The strategy to answer it is calculate the proportion of negative impact in people health by storm event between time range 1993-2011.
To do that it´s create one new variables: health impact (hi) that is iqual to number of fatalities plus number of injuries. Then it´s calculated the proportion of health impact by storm event. Below it´s create a top 10 of the most dangerous storm event with respect to population healt. Finally it´s represented top 10 in a graphic.
#healt impact (hi) variable
bd2$hi <- bd2$FATALITIES + bd2$INJURIES
#Proportion of hi by storm event
z <- tapply(bd2$hi, bd2$EVTYPE, function(x){round(sum(x)/sum(bd2$hi),10)})
#Top ten
z1 <- z[order(-z)]
top <- head(z1,10)
#Graphic
bd3 <- data.frame(me=names(top),phi=top)
bd4 <- data.frame(me=c("The rest"), phi=1-sum(bd3$phi))
bd5 <- rbind.data.frame(bd3,bd4)
bd5$me <- factor(bd5$me,
levels = c("TORNADO","EXCESSIVE HEAT","FLOOD","LIGHTNING",
"TSTM WIND","HEAT","FLASH FLOOD","ICE STORM",
"THUNDERSTORM WIND", "WINTER STORM",
"The rest")) #Order the factor variable to create adecuated graphic
library(ggplot2)
ggplot(bd5, aes(x=me,y=phi)) + geom_bar(stat = "identity") +
geom_text(aes(label=paste(format(round(phi,4)*100,nsmall = 2),"%")),
colour="white", size=3, vjust= 1.5) +
theme_bw() +
xlab("Storm Event") + ylab("") +
ggtitle("Proportion of population health impact\nby storm event in time periodo 1993-2011") +
theme(axis.text.y=element_blank(), axis.ticks=element_blank(),axis.title.y=element_blank(),
plot.title=element_text(size=25),
axis.text.x = element_text(size = 8,angle=30, hjust=1, vjust=1))
The graphic show that the most dangerous storm event with respect population health are Tornados. 31.31% of total injuries and fatalities caused between 1993 to 2011 across United State is product of them. Followed by heat (excessive heat = 10.58% plus heat = 3.81%), flood (flood = 9.12%, plus flash flood = 3.46%), lightning (7.59%), thunderstorm (TSTM wind = 4.86%, plus thunderstorm wind = 2.04%), ice storm (2.59%), winter storm (1.92%) and the resto of events (only representing the 22.72%).
The second question is:
Across the United States, which types of events have the greatest economic consequences?
The strategy to answer it is calculate the proportion of negative monetary consequences by storm event between time range 1993-2011.
To do that it´s create one new variables: economic impact estimated (eie) that is iqual to cuantification of damage over properties (houses, buildings, streets, etc) plus cuantification of damage over crops.
For create eie variable is important know that economic impact focus in only three values asumidos by CROPDMGEXP and PROPDMEXP. This three values are K, M, B that mean thousands for K, millions for M and Billions for B. Also it´s assum that when apper lower case k is iqual to K and the same for lower case m is equivalent to M.
Create eie variable is a challenger. The following steps allow create it: 1 Transform CROPDMGEXP and PROPDMEXP to character class, thia allow easy manipulation 2 Create one function to determine the monetary amoung using PROPDMG and PROPDMGEXP variables. You should be note that calculate economic impact is not realistic with only PROPDMG. This function must allow creating a variable with economic impact estimated for properties. 3 Create one function, similar to last one, that done the same work but this time for crops. 4 Create Total Economic Impact Estimated (eie), this is equal to economic impact for properties plues economic impact for crops. 5 Calculate the proportion of Economic Impact Estimated by storm event 6 Create a top 10 of the most dangerous storm event with respect to economic consecuences. 7 Represent top 10 in a graphic.
#Work with bd2 data set
#Transform CROPDMGEXP and PROPDMEXP to character class
bd2$PROPDMGEXP <- as.character(bd2$PROPDMGEXP)
bd2$CROPDMGEXP <- as.character(bd2$CROPDMGEXP)
#Function to estimate economic damage using PROPDMG and PROPDMGEXP
estiPD <- function(dataset){
x <- rep(0,dim(dataset)[1])
for (i in 1:dim(dataset)[1]) {
if((dataset[,6][i]=="k" |dataset[,6][i]=="K") & !(is.na(dataset[,6][i]))){
x[i] <- dataset[,5][i]*1000
}else{
if((dataset[,6][i]=="M" | dataset[,6][i]=="m") & !(is.na(dataset[,6][i]))){
x[i] <- dataset[,5][i]*1000000
}else{
if((dataset[,6][i]=="B") & !(is.na(dataset[,6][i]))){
x[i] <- dataset[,5][i]*1000000000
}else{
x[i] <- dataset[,5][i]*0
}
}
}
}
return(x)
}
#Function to estimate economic damage using CROPDMG and CROPDMGEXP
estiCD <- function(dataset){
x <- rep(0,dim(dataset)[1])
for (i in 1:dim(dataset)[1]) {
if((dataset[,8][i]=="k" |dataset[,8][i]=="K") & !(is.na(dataset[,8][i]))){
x[i] <- dataset[,7][i]*1000
}else{
if((dataset[,8][i]=="M" | dataset[,8][i]=="m") & !(is.na(dataset[,8][i]))){
x[i] <- dataset[,7][i]*1000000
}else{
if((dataset[,8][i]=="B") & !(is.na(dataset[,8][i]))){
x[i] <- dataset[,7][i]*1000000000
}else{
x[i] <- dataset[,7][i]*0
}
}
}
}
return(x)
}
#New variable (pde) with economic impact estimated for properties
bd2$pde <- estiPD(bd2)
#New variable (cde) with economic impact estimated for properties
bd2$cde <- estiCD(bd2)
#New variable with Total Economic Impact Estimeted (eie)
bd2$eie <- bd2$pde + bd2$cde
#Calculate proportion of eie by storm event between range 1993-2011
z <- tapply(bd2$eie, bd2$EVTYPE, function(x){round(sum(x)/sum(bd2$eie),10)})
#Top ten
z1 <- z[order(-z)]
top <- head(z1,10)
#Graphic
bd7 <- data.frame(me=names(top),pei=top)
bd8 <- data.frame(me=c("The rest"), pei=1-sum(bd7$pei))
bd9 <- rbind.data.frame(bd7,bd8)
bd9$me <- factor(bd9$me,
levels = c("FLOOD","HURRICANE/TYPHOON","STORM SURGE","TORNADO","HAIL",
"FLASH FLOOD","DROUGHT","HURRICANE","RIVER FLOOD","ICE STORM",
"The rest"))#Order the factor variable to create adecuated graphic
library(ggplot2)
ggplot(bd9, aes(x=me,y=pei)) + geom_bar(stat = "identity") +
geom_text(aes(label=paste(format(round(pei,4)*100,nsmall = 2),"%")),
colour="white", size=3, vjust= 1.5) +
theme_bw()+
xlab("Storm Event") + ylab("")+
ggtitle("Proportion of Economic Impact Estimated\nby storm event in time period 1993-2011") +
theme(axis.text.y=element_blank(),axis.ticks=element_blank(),axis.title.y=element_blank(),
plot.title=element_text(size = 25),
axis.text.x=element_text(size = 8,angle = 30, hjust = 1, vjust = 1))
The last graphic show that the most dangerous storm events with respect economic damages is Flood. 49.66% (flood = 33.72%, plus storm surge = 9.72%, plus flash flood = 3.94, plus river flood = 2.28%) of total economic impact (damage in properties and crops) caused between 1993 to 2011 across United State is product of them. Followed by hurricane/typhone (16.13% plus 3.28%), tornado (6%), hail (4.21%), drought (3.37%), ice storm (2.01%) and the rest of events (only representing the 15.35%).
Remember that this analysis is exploratory and may vary. For example, further research should add analysis of missing values: 34.7% and 47.8% of values for PROPDMGEXP and CROPDMGEXP respectively, are NA´S. Another aspect to keep in mind is that the economic impact of exploratory work was done without control is inflation, thus it is necessary to deflate.
#Research NA´s in all variables
sapply(bd2, function(x){sum(is.na(x))/dim(bd1)[1]})
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.3470465
## CROPDMG CROPDMGEXP year hi pde cde
## 0.0000000 0.4775080 0.0000000 0.0000000 0.0000000 0.0000000
## eie
## 0.0000000