Man against nature has been a part of the human condition since the dawn of time, particularly with regards to weather events. This report is a basic data analysis of the NOAA storm data, 1950-2011, in an effort to determine what weather events are the most devastating to public health and the nation's economy. To this end, the storm data set is loaded into memory, cleaned, and processed to yield data frames that contain the total amounts of fatalities, injuries, damages to property, and damages to crops per weather event type. Since fatalities and injuries are two parts to the overall picture of public health, a new data frame is created that contains the total number of casualties. Likewise, the same is done for total economic damage utilizing the sum of damages to property and damages to crops. All data frames are then reorder from highest totals to lowest, and the subset of each frame's top 10 events are used to delineate exactly which events rank the highest. Essentially, an analysis of proportion is what follows, and to this end bar graphs and pie charts are utilized to illustrate how tornados have been the most harmful to the human population, as well as how hurricanes have caused the most economic damage overall.
First, a check for the R.utils and plyr packages is performed; both are installed and loaded into memory if they are not available.
if(!require("R.utils")){
install.packages("R.utils")
library("R.utils")
}
if(!require("plyr")){
install.packages("plyr")
library("plyr")
}
Following the package check, the repdata-data-StormData.csv data set is looked for in the current working directory. If not present, the compressed repdata-data-StormData.csv.bz2 is searched for and then decompressed. However, if it too is not present, then the .bz2 file is downloaded from the link, provided by the JHU course in reproducible research, and decompressed.
file <- "repdata-data-StormData.csv"
if (!file.exists(file)) {
myzip = "repdata-data-StormData.csv.bz2"
#if data file has not yet been downloaded, fetch it
if (!file.exists(myzip)) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile=myzip,method="curl")
}
bunzip2(myzip, remove = FALSE)
}
Now that we can be certain that the repdata-data-StormData.csv data set is in the working directory, we check to see if it has already been read into memory under the object name stormData. If that is not the case, read.csv() is employed. Once loaded into memory, the raw data set is examined via str(stormData).
if(!(exists("stormData"))){
stormData <- read.csv(file)
}
str(stormData)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
As can be seen via the str(stormData) output, the data set is quite large, weighing in at 902297 observations of 37 variables. However, after reviewing the available documentation, it should be clear that the variables of interest are as follows:
Since these are the only columns that pertain to the questions at hand, we will take a subset of these seven variables for a more reduced data set.
stormDataReduced <- stormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
After examining the unique values, using unique(), associated with EVTYPE, PROPDMGEXP,and CROPDMGEXP it is clear that the data set needs further cleaning before any analysis can occur. First of all, EVTYPE contains many values beyond the 48 severe weather events listed in the documentation, most of which are typos or subsets of the 48 proper events. However, significant values for casualties and economic damage appear to be limited to the 48 proper events, so the events need relatively minimal processing;event names will be normalized and only the most important aliases and subsets, with respect to impact on the economy and population health, will be combined.
## Clean Event type names and a few multiple alias issues
tempEVTYPE <- stormDataReduced$EVTYPE
##1) remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
tempEVTYPE <- gsub("'","",tempEVTYPE)
##2) convert to lowercase
tempEVTYPE <- tolower(tempEVTYPE)
##3) change other non-alphanumeric characters to spaces
tempEVTYPE <- gsub("[^a-z0-9]"," ",tempEVTYPE)
##4) change digits to spaces
tempEVTYPE <- gsub("[0-9]+"," ",tempEVTYPE)
##5)trim leading and trailing spaces
tempEVTYPE <- trim(tempEVTYPE)
##6)replace obvious multiple name aliases
tempEVTYPE[grep("hurricane*", tempEVTYPE ,ignore.case=TRUE)] <- "hurricane"
tempEVTYPE[grep("tornado*", tempEVTYPE ,ignore.case=TRUE)] <- "tornado"
tempEVTYPE[grep("drought*", tempEVTYPE ,ignore.case=TRUE)] <- "drought"
tempEVTYPE[grep("flood*", tempEVTYPE ,ignore.case=TRUE)] <- "flood"
tempEVTYPE[tempEVTYPE == "tstm wind"] <- "thunderstorm wind"
stormDataReduced$EVTYPE <- tempEVTYPE
Next, since the exponential values in PROPDMGEXP and CROPDMGEXP are coded, we need to change the character values back into their respective numeric equivalents.
decode <- list("0" = c("", " ", "0", "-", "+", "?"), "1" = "1", "2" = c("2", "H", "h"), "3" = c("3", "k", "K"), "4" = "4", "5" = "5", "6" = c("6", "m", "M"), "7" = "7", "8" = "8", "9" = c("9", "b", "B"))
levels(stormDataReduced$PROPDMGEXP) <- decode
levels(stormDataReduced$CROPDMGEXP) <- decode
## Transform the recoded exponents from factors to numerics
stormDataReduced$PROPDMGEXP <- as.numeric(as.character(stormDataReduced$PROPDMGEXP))
stormDataReduced$CROPDMGEXP <- as.numeric(as.character(stormDataReduced$CROPDMGEXP))
Now we can find the actual value, in USD, for the damages to crops and property, for each event, by multiplying CROPDMG * 10CROPDMGEXP and PROPDMG * 10PROPDMGEXP; we will store these in two new columns cropdamage and propertydamage
stormDataReduced$propertydamage <- stormDataReduced$PROPDMGEXP*10^(stormDataReduced$PROPDMGEXP)
stormDataReduced$cropdamage<- stormDataReduced$CROPDMGEXP*10^(stormDataReduced$CROPDMGEXP)
In terms of generalities, total values for each event, with respect to fatalities, injuries, damages to propery, and damages to crops, are the most denotative of human harm and economic consequence; the higher the total, the more severe the event must be, comparatively.
Since we are only concerned with finding out which types of events are most harmful to population health and have the greatest economic consequence we now will take the sum of each variable's values, and reorder the events from highest to lowest totals. After that, we will take the subset of the top 10 events, for each measure, having the 10th value be the sum of all events below the 9th proper event; we will label the 10th value as all “other” events.
##FATALITIES
totalfatalities <- aggregate(FATALITIES ~ EVTYPE, data = stormDataReduced, FUN = sum)
totalfatalitiesTop <- totalfatalities[order(-totalfatalities$FATALITIES), ]
totalfatalitiesTop[10, "FATALITIES"] <- sum(totalfatalitiesTop[-(1:9), "FATALITIES" ])
totalfatalitiesTop[10, "EVTYPE"] <- "other"
totalfatalitiesTop <- totalfatalitiesTop[1:10, ]
##INJURIES
totalinjuries <- aggregate(INJURIES ~ EVTYPE, data = stormDataReduced, FUN = sum)
totalinjuriesTop <- totalinjuries[order(-totalinjuries$INJURIES), ]
totalinjuriesTop[10, "INJURIES"] <- sum(totalinjuriesTop[-(1:9), "INJURIES" ])
totalinjuriesTop[10, "EVTYPE"] <- "other"
totalinjuriesTop <- totalinjuriesTop[1:10, ]
##propertydamage
totalpropertydamage<- aggregate(propertydamage ~ EVTYPE, data = stormDataReduced, FUN = sum)
totalpropertydamageTop <- totalpropertydamage[order(-totalpropertydamage$propertydamage), ]
totalpropertydamageTop[10, "propertydamage"] <- sum(totalpropertydamageTop[-(1:9), "propertydamage" ])
totalpropertydamageTop[10, "EVTYPE"] <- "other"
totalpropertydamageTop <- totalpropertydamageTop[1:10, ]
##cropdamage
totalcropdamage<- aggregate(cropdamage ~ EVTYPE, data = stormDataReduced, FUN = sum)
totalcropdamageTop <- totalcropdamage[order(-totalcropdamage$cropdamage), ]
totalcropdamageTop[10, "cropdamage"] <- sum(totalcropdamageTop[-(1:9), "cropdamage" ])
totalcropdamageTop[10, "EVTYPE"] <- "other"
totalcropdamageTop <- totalcropdamageTop[1:10, ]
Fatalities and injuries are really two parts of a whole that relates to an event's impact on population health;the event's number of casualties. Likewise, property and crop damage can also be thought of as constituents of the overall economic damage caused by an event. Thus, the total number of casualties and the amount of economic damage done by each event will be found through merging each total, by EVTYPE, and using mutate to find their sum. The resulting data frames will be reordered, and a top 10 will be subsetted. For better comparative analysis, by way of proportions, each top 10 will have its event's percentage added in a separate column named “PERCENT”.
## totalCASUALTIES = totalFATALITIES + totalINJURIES
totalcasualties <- merge(totalfatalities, totalinjuries, by="EVTYPE")
totalcasualties <- mutate(totalcasualties, CASUALTIES = FATALITIES + INJURIES)
totalcasualties <- totalcasualties [order(-totalcasualties$CASUALTIES), !(names(totalcasualties) %in% c("FATALITIES", "INJURIES"))]
##Top 10 events associated with total Casualties
totalcasualties[10, "CASUALTIES"] <- sum(totalcasualties[-(1:9), "CASUALTIES" ])
totalcasualties[10, "EVTYPE"] <- "other"
totalcasualties <- totalcasualties[1:10, ]
totalcasualties$PERCENT <- (totalcasualties$CASUALTIES/sum(totalcasualties$CASUALTIES))*100
##totalECONOMICDAMAGE = totalPROPERTYDAMAGE + totalCROPDAMAGE
totaleconomicdamage <- merge(totalpropertydamage, totalcropdamage, by="EVTYPE")
totaleconomicdamage <- mutate(totaleconomicdamage, economicdamage = propertydamage + cropdamage)
totaleconomicdamage <- totaleconomicdamage [order(-totaleconomicdamage$economicdamage), !(names(totaleconomicdamage) %in% c("propertydamage", "cropdamage"))]
##Top 10 events associated with economic damage
totaleconomicdamage[10, "economicdamage"] <- sum(totaleconomicdamage[-(1:9), "economicdamage" ])
totaleconomicdamage[10, "EVTYPE"] <- "other"
totaleconomicdamage <- totaleconomicdamage[1:10, ]
totaleconomicdamage$PERCENT <- (totaleconomicdamage$economicdamage/sum(totaleconomicdamage$economicdamage))*100
First we will examine the top 10 events for fatalities, injuries, property damage, and crop damage via bar graphs.
par(mfrow = c(2,2), mar = c(5,8,4,2))
barplot(rev(totalfatalitiesTop$FATALITIES), main="Event Fatality Totals", horiz=TRUE, names.arg=rev(totalfatalitiesTop$EVTYPE), las = 2, col = heat.colors(length(totalfatalitiesTop$EVTYPE)))
barplot(rev(totalinjuriesTop$INJURIES), main="Event Injury Totals", horiz=TRUE, names.arg=rev(totalinjuriesTop$EVTYPE), las = 2, col = terrain.colors(length(totalinjuriesTop$EVTYPE)))
barplot(rev(totalpropertydamageTop$propertydamage), main="Event Property Damage Totals in USD", horiz=TRUE, names.arg=rev(totalpropertydamageTop$EVTYPE), las = 2, col = topo.colors(length(totalpropertydamageTop$EVTYPE)))
barplot(rev(totalcropdamageTop$cropdamage), main="Event Crop Damage Totals in USD", horiz=TRUE, names.arg=rev(totalcropdamageTop$EVTYPE), las = 2, col = cm.colors(length(totalcropdamageTop$EVTYPE)))
options(width=140)
table <- totalfatalitiesTop
table <- cbind(table, totalinjuriesTop)
table <- cbind(table, totalpropertydamageTop)
table <- cbind(table, totalcropdamageTop)
names(table) <- c("Event", "Total Fatalities", "Event", "Total Injuries", "Event", "Property Damage", "Event", "Crop Damage")
row.names(table) <- NULL
table
Event Total Fatalities Event Total Injuries Event Property Damage Event Crop Damage
1 tornado 5661 tornado 91407 hurricane 1.628e+11 drought 3.686e+10
2 excessive heat 1903 flood 8604 flood 8.225e+10 flood 1.243e+10
3 flood 1525 thunderstorm wind 8445 tornado 6.309e+10 hurricane 9.396e+09
4 heat 937 excessive heat 6525 storm surge 1.825e+10 freeze 9.084e+09
5 lightning 817 lightning 5230 hail 1.537e+10 ice storm 9.027e+09
6 thunderstorm wind 637 heat 2100 high wind 1.085e+10 heat 9.020e+09
7 rip current 368 ice storm 1975 wildfire 9.881e+09 hail 3.635e+09
8 high wind 248 hail 1361 winter storm 9.856e+09 thunderstorm wind 1.541e+09
9 avalanche 224 hurricane 1328 tropical storm 9.643e+09 tornado 5.505e+08
10 other 2825 other 13553 other 4.851e+10 other 2.275e+09
Based on the above multipanel graph, it is clear that the most harmful weather event, with respect to population health, is a tornado; these amount to the highest death toll, as well as the highest number of injuries. Concerning which event has had the greatest economic consquence, things are not so clear. Obviously, hurricanes have done the most in property damage. Furthermore, hurricanes have costed the most overall in US dollars. Although droughts have been, without a doubt, the most costly type of event for crops to go through, their economic consquence, in USD, is still less than that of the property damage done by hurricanes
Now, let us examine the data concerning casualties and economic damage with respect to the percentage of impact that these events have.
slices1 <- totalcasualties$PERCENT
lbls1 <- totalcasualties$EVTYPE
pct1 <- round(totalcasualties$PERCENT, digits = 1)
lbls1 <- paste(lbls1, pct1)
lbls1 <- paste(lbls1,"%",sep="")
pct1 <- paste(pct1,"%",sep="")
pie(slices1,labels = pct1, col=rainbow(length(lbls1)), main="Percentage of Casualties", cex = .7)
legend(-1.5,1.3,xpd=NA, legend = lbls1, fill=rainbow(length(lbls1)), bty = "n", cex = 1)
color <-c("red","pink", "orange", "yellow", "green", "purple", "blue", "black", "gray", "white")
slices2 <- totaleconomicdamage$PERCENT
lbls2 <- totaleconomicdamage$EVTYPE
pct2 <- round(totaleconomicdamage$PERCENT, digits = 1)
lbls2 <- paste(lbls2, pct2)
lbls2 <- paste(lbls2,"%",sep="")
pct2 <- paste(pct2,"%",sep="")
pie(slices2,labels = pct2, col=color, main="Percentage of Economic Damage", cex = .7)
legend(-1.5,1.3,xpd=NA, legend = lbls2, fill=color, bty = "n", cex = 1)
Alas, the pie chart showing the “Percentage of Casualties” reaffirms what the preceding graphs of fatality and injury totals showed us: Tornados have been the most harmful weather event with respect to population health. More importantly though, the chart for the “Percentage of Economic Damage” illustrates how hurricanes have been the most costly weather phenomenon, which was not so clear in the previous graphs of property and crop damage totals. Based on these finding, further analysis of the storm data is warranted with regards to tornados and hurricanes;given that these events have caused the most damage, either in terms of money or human resource, questions of prediction are paramount.