Using this Storm Data provided by the National weather service, we’re going to derive conclusions about which types of storm events are most harmful, both for human HEALTH & ECONOMY. Out of the 37 variables that this dataset provides, we’re only going to focus on 7 of them. For the HEALTHissue; event type, number of fatalities & number of injuries will form our raw data. For the ECONOMICissue; event type, property damage coefficient & exponent, and crop damage coefficient & exponent. Relevant figures about the top 20 most harmful events will be computed and plotted, and conclusions will be extracted.
After reading the dataset, we’ll only need 7 of his columns. We’ll group those columns into two groups, each responding to HEALTH & ECONOMY issues.
data <- read.csv("repdata-data-StormData.csv.bz2")
data <- data[,c(2,8,23,24,25,26,27,28)]
data$EVTYPE <- toupper(data$EVTYPE)
data$BGN_DATE <- as.Date(data$BGN_DATE,format="%m/%d/%Y")
RECORDdist <- data[,1]
HEALTHissue <- data[,c(2,3:4)]
ECONOMYissue <- data[,c(2,5:8)]
hist(RECORDdist,breaks=20,freq = T,col="coral1",main="Distribution of logged events along years",xlab=NULL)
It is clear that the frequency of event reporting has been increasing through the years, but in this analysis I have decided not to impute missing data or to compute weighted averages of values. Instead, a simple sum of values per event type will be performed, yielding simple intrinsic results from this dataset.
Due to misspells, several event types (“EVTYPE”) could be clustered in common groups, but naive implementations of hierarchical string clustering based on Levenshtein edit distance haven’t yielded successful results. Before testing k-means, I decided to symply aggregate the sum of fatalities & injuries by event type, and ordering the results by decreasing sum. The resulting dataset shows a clear dominance on 1 event type (Results on Results section).
#Aggregating FATALITIES & INJURIES by EVENT TYPE:
x <- aggregate(cbind(FATALITIES,INJURIES)~EVTYPE,HEALTHissue,sum)
#Eliminating the zero rows:
x <- x[!(x$FATALITIES==0 & x$INJURIES==0),]
#Ordering by decreasing number of FATALITIES + INJURIES:
x <- x[order(rowSums(x[,2:3]),decreasing = T),]
The allowed exponents will be numbers [0:9], and the characters H,K,M,B will correspond to exponents 2,3,6,9 respectively. Clustering methods could have been implemented as well, but I decided to proceed as the HEALTH issue. Final considerations about the results on Results section.
#Aggregating damage coefficients by event type & exponent:
y1 <- aggregate(PROPDMG~EVTYPE+PROPDMGEXP,ECONOMYissue,sum)
y2 <- aggregate(CROPDMG~EVTYPE+CROPDMGEXP,ECONOMYissue,sum)
#Eliminating 0-valued damages and filtering exponents:
y1 <- y1[!y1$PROPDMG==0,]
y2 <- y2[!y2$CROPDMG==0,]
y1 <- y1[(y1$PROPDMGEXP%in%c("B","h","H","K","m","M",c(0:9))),]
y2 <- y2[(y2$CROPDMGEXP%in%c("B","h","H","K","m","M",c(0:9))),]
#Preparing the rows to be pasted in a sigle string each:
y1$PROPDMG <- paste(y1$PROPDMG,"e",sep = "")
y2$CROPDMG <- paste(y2$CROPDMG,"e",sep = "")
y1$PROPDMGEXP <- chartr("BhHKmM","922366",y1$PROPDMGEXP)
y2$CROPDMGEXP <- chartr("BhHKmM","922366",y2$CROPDMGEXP)
#Taking advantage of cohercion to generate numeric values:
y1$PROP <- as.numeric(paste(y1$PROPDMG,y1$PROPDMGEXP,sep=""))
y2$CROP <- as.numeric(paste(y2$CROPDMG,y2$CROPDMGEXP,sep=""))
#Re-aggregating results:
y1 <- aggregate(PROP~EVTYPE,y1,sum)
y2 <- aggregate(CROP~EVTYPE,y2,sum)
#Ordering by decreasing value of cost damage:
y1 <- y1[order(y1$PROP,decreasing = T),]
y2 <- y2[order(y2$CROP,decreasing = T),]
barNames <- x$EVTYPE[1:20]
par(mfrow=c(1,2),mar=c(7,5,2,0))
with(x,{
foo <- barplot(FATALITIES[1:20], names.arg = barNames, xaxt="n",xlab="",
main = "FATALITIES (upscaled)",cex.main=0.95,col = "aquamarine3",
ylab="Total People, ylim=5e3")
text(foo, par("usr")[3], labels = barNames, srt = 45, adj = 1, xpd = TRUE,cex=0.85)
bar <- barplot(INJURIES[1:20] , names.arg = barNames, xaxt="n",xlab="",
main = "INJURIES (greater weight)",cex.main=0.95,col = "aquamarine3",
ylab="Total People, ylim=8e4")
text(bar, par("usr")[3], labels = barNames, srt = 45, adj = 1, xpd = TRUE,cex=0.85)
})
We can see how TORNADO is predominant on both plots. Even though multiple events are mispelled and could be recombined under a common label, their recombination won’t exceed TORNADO’s impact along the years.
barNames1 <- y1$EVTYPE[1:20]
barNames2 <- y2$EVTYPE[1:20]
par(mfrow=c(1,2),las=2, mar=c(11,5,2,0))
foo <- barplot(y1$PROP[1:20], names.arg = barNames1, xaxt="n",xlab=""
,main = "PROPERTY DAMAGE ($)",cex.main=0.95,col = "chartreuse3")
text(foo, par("usr")[3], labels = barNames1, srt = 45, adj = 1, xpd = TRUE,cex=0.85)
bar <- barplot(y2$CROP[1:20], names.arg = barNames2, xaxt="n",xlab=""
,main = "CROP DAMAGE ($)",cex.main=0.95,col = "chartreuse3")
text(bar, par("usr")[3], labels = barNames2, srt = 45, adj = 1, xpd = TRUE,cex=0.85)
Property damage seems to be mostly weighted on FLOOD, and the next major events could well be clustered in the same group. There’s no doubt that wet events cost the most to human Property.
Crop damage appears to be highly weighted on DROUGHT. Though the next major events could be grouped as wet events, it is safe to say that DROUGHT has had the greatest economic consequences for human Crops.
## [1] "Javier Prado"