An Analysis of Weather with Respect to Population Health and Economic Consequence.

Synopsis

Man against nature has been a part of the human condition since the dawn of time, particularly with regards to weather events. This report is a basic data analysis of the NOAA storm data, 1950-2011, in an effort to determine what weather events are the most devastating to public health and the nation's economy. To this end, the storm data set is loaded into memory, cleaned, and processed to yield data frames that contain the total amounts of fatalities, injuries, damages to property, and damages to crops per weather event type. Since fatalities and injuries are two parts to the overall picture of public health, a new data frame is created that contains the total number of casualties. Likewise, the same is done for total economic damage utilizing the sum of damages to property and damages to crops. All data frames are then reorder from highest totals to lowest, and the subset of each frame's top 10 events are used to delineate exactly which events rank the highest. Essentially, an analysis of proportion is what follows, and to this end bar graphs and pie charts are utilized to illustrate how tornados have been the most harmful to the human population, as well as how hurricanes have caused the most economic damage overall.

Data Processing

First, a check for the R.utils and plyr packages is performed; both are installed and loaded into memory if they are not available.

if(!require("R.utils")){
    install.packages("R.utils")
    library("R.utils")
}

if(!require("plyr")){
    install.packages("plyr")
    library("plyr")
}

Following the package check, the repdata-data-StormData.csv data set is looked for in the current working directory. If not present, the compressed repdata-data-StormData.csv.bz2 is searched for and then decompressed. However, if it too is not present, then the .bz2 file is downloaded from the link, provided by the JHU course in reproducible research, and decompressed.

file <- "repdata-data-StormData.csv"
if (!file.exists(file)) {
    myzip = "repdata-data-StormData.csv.bz2"
    #if data file has not yet been downloaded, fetch it
    if (!file.exists(myzip)) {
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",                                       destfile=myzip,method="curl")

        }

    bunzip2(myzip, remove = FALSE)
    }

Now that we can be certain that the repdata-data-StormData.csv data set is in the working directory, we check to see if it has already been read into memory under the object name stormData. If that is not the case, read.csv() is employed. Once loaded into memory, the raw data set is examined via str(stormData).

if(!(exists("stormData"))){
stormData <- read.csv(file)
}

str(stormData)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels ""," Christiansburg",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels ""," CANTON"," TULIA",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","%SD",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As can be seen via the str(stormData) output, the data set is quite large, weighing in at 902297 observations of 37 variables. However, after reviewing the available documentation, it should be clear that the variables of interest are as follows:

EVTYPE: The type of weather event.
FATALITIES: The number of deaths after the event.
INJURIES: The number of injured people after the event.
PROPDMG: The coefficient to be multiplied by 10^PROPDMGEXP for an estimate on damage to property in USD.
PROPDMGEXP: The coded exponential power for property damage done by an event.
CROPDMG: The coefficient to be multiplied by 10^CROPDMGEXP for an estimate on damage to crops in USD.
CROPDMGEXP: The coded exponential power for damage done to crops by an event.

Since these are the only columns that pertain to the questions at hand, we will take a subset of these seven variables for a more reduced data set.

stormDataReduced <- stormData[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

After examining the unique values, using unique(), associated with EVTYPE, PROPDMGEXP,and CROPDMGEXP it is clear that the data set needs further cleaning before any analysis can occur. First of all, EVTYPE contains many values beyond the 48 severe weather events listed in the documentation, most of which are typos or subsets of the 48 proper events. However, significant values for casualties and economic damage appear to be limited to the 48 proper events, so the events need relatively minimal processing;event names will be normalized and only the most important aliases and subsets, with respect to impact on the economy and population health, will be combined.

## Clean Event type names and a few multiple alias issues 

tempEVTYPE <- stormDataReduced$EVTYPE


##1) remove apostrophes (so "don't" -> "dont", "Jane's" -> "Janes", etc.)
tempEVTYPE <- gsub("'","",tempEVTYPE)
##2) convert to lowercase
tempEVTYPE <- tolower(tempEVTYPE)
##3) change other non-alphanumeric characters to spaces
tempEVTYPE <- gsub("[^a-z0-9]"," ",tempEVTYPE)
##4) change digits to spaces
tempEVTYPE <- gsub("[0-9]+"," ",tempEVTYPE) 
##5)trim leading and trailing spaces
tempEVTYPE <- trim(tempEVTYPE)
##6)replace obvious multiple name aliases 
tempEVTYPE[grep("hurricane*", tempEVTYPE ,ignore.case=TRUE)] <- "hurricane"
tempEVTYPE[grep("tornado*", tempEVTYPE ,ignore.case=TRUE)] <- "tornado"
tempEVTYPE[grep("drought*", tempEVTYPE ,ignore.case=TRUE)] <- "drought"
tempEVTYPE[grep("flood*", tempEVTYPE ,ignore.case=TRUE)] <- "flood"
tempEVTYPE[tempEVTYPE == "tstm wind"] <- "thunderstorm wind"

stormDataReduced$EVTYPE <- tempEVTYPE

Next, since the exponential values in PROPDMGEXP and CROPDMGEXP are coded, we need to change the character values back into their respective numeric equivalents.

decode <- list("0" = c("", " ", "0", "-", "+", "?"), "1" = "1", "2" = c("2", "H", "h"), "3" = c("3", "k", "K"), "4" = "4", "5" = "5", "6" = c("6", "m", "M"), "7" = "7", "8" = "8", "9" = c("9", "b", "B"))
levels(stormDataReduced$PROPDMGEXP) <- decode
levels(stormDataReduced$CROPDMGEXP) <- decode

## Transform the recoded exponents from factors to numerics

stormDataReduced$PROPDMGEXP <- as.numeric(as.character(stormDataReduced$PROPDMGEXP))
stormDataReduced$CROPDMGEXP <- as.numeric(as.character(stormDataReduced$CROPDMGEXP))

Now we can find the actual value, in USD, for the damages to crops and property, for each event, by multiplying CROPDMG * 10^CROPDMGEXP and PROPDMG * 10^PROPDMGEXP; we will store these in two new columns cropdamage and propertydamage

stormDataReduced$propertydamage <- stormDataReduced$PROPDMGEXP*10^(stormDataReduced$PROPDMGEXP)
stormDataReduced$cropdamage<- stormDataReduced$CROPDMGEXP*10^(stormDataReduced$CROPDMGEXP)

In terms of generalities, total values for each event, with respect to fatalities, injuries, damages to propery, and damages to crops, are the most denotative of human harm and economic consequence; the higher the total, the more severe the event must be, comparatively.

Since we are only concerned with finding out which types of events are most harmful to population health and have the greatest economic consequence we now will take the sum of each variable's values, and reorder the events from highest to lowest totals. After that, we will take the subset of the top 10 events, for each measure, having the 10th value be the sum of all events below the 9th proper event; we will label the 10th value as all “other” events.

##FATALITIES
totalfatalities <- aggregate(FATALITIES ~ EVTYPE, data = stormDataReduced,  FUN = sum)
totalfatalitiesTop <- totalfatalities[order(-totalfatalities$FATALITIES), ]
totalfatalitiesTop[10, "FATALITIES"] <- sum(totalfatalitiesTop[-(1:9), "FATALITIES" ]) 
totalfatalitiesTop[10, "EVTYPE"] <- "other" 
totalfatalitiesTop <- totalfatalitiesTop[1:10, ] 

##INJURIES
totalinjuries <- aggregate(INJURIES ~ EVTYPE, data = stormDataReduced,  FUN = sum)
totalinjuriesTop <- totalinjuries[order(-totalinjuries$INJURIES), ]
totalinjuriesTop[10, "INJURIES"] <- sum(totalinjuriesTop[-(1:9), "INJURIES" ]) 
totalinjuriesTop[10, "EVTYPE"] <- "other" 
totalinjuriesTop <- totalinjuriesTop[1:10, ] 

##propertydamage
totalpropertydamage<- aggregate(propertydamage ~ EVTYPE, data = stormDataReduced,  FUN = sum)
totalpropertydamageTop <- totalpropertydamage[order(-totalpropertydamage$propertydamage), ]
totalpropertydamageTop[10, "propertydamage"] <- sum(totalpropertydamageTop[-(1:9), "propertydamage" ]) 
totalpropertydamageTop[10, "EVTYPE"] <- "other" 
totalpropertydamageTop <- totalpropertydamageTop[1:10, ] 

##cropdamage
totalcropdamage<- aggregate(cropdamage ~ EVTYPE, data = stormDataReduced,  FUN = sum)
totalcropdamageTop <- totalcropdamage[order(-totalcropdamage$cropdamage), ]
totalcropdamageTop[10, "cropdamage"] <- sum(totalcropdamageTop[-(1:9), "cropdamage" ]) 
totalcropdamageTop[10, "EVTYPE"] <- "other" 
totalcropdamageTop <- totalcropdamageTop[1:10, ]

Fatalities and injuries are really two parts of a whole that relates to an event's impact on population health;the event's number of casualties. Likewise, property and crop damage can also be thought of as constituents of the overall economic damage caused by an event. Thus, the total number of casualties and the amount of economic damage done by each event will be found through merging each total, by EVTYPE, and using mutate to find their sum. The resulting data frames will be reordered, and a top 10 will be subsetted. For better comparative analysis, by way of proportions, each top 10 will have its event's percentage added in a separate column named “PERCENT”.

## totalCASUALTIES = totalFATALITIES + totalINJURIES
totalcasualties <- merge(totalfatalities, totalinjuries, by="EVTYPE")

totalcasualties <- mutate(totalcasualties, CASUALTIES = FATALITIES + INJURIES)

totalcasualties <- totalcasualties [order(-totalcasualties$CASUALTIES), !(names(totalcasualties) %in% c("FATALITIES", "INJURIES"))]

##Top 10 events associated with total Casualties
totalcasualties[10, "CASUALTIES"] <- sum(totalcasualties[-(1:9), "CASUALTIES" ]) 
totalcasualties[10, "EVTYPE"] <- "other" 
totalcasualties <- totalcasualties[1:10, ] 
totalcasualties$PERCENT <-  (totalcasualties$CASUALTIES/sum(totalcasualties$CASUALTIES))*100


##totalECONOMICDAMAGE = totalPROPERTYDAMAGE + totalCROPDAMAGE
totaleconomicdamage <- merge(totalpropertydamage, totalcropdamage, by="EVTYPE")

totaleconomicdamage <- mutate(totaleconomicdamage, economicdamage = propertydamage + cropdamage)

totaleconomicdamage <- totaleconomicdamage [order(-totaleconomicdamage$economicdamage), !(names(totaleconomicdamage) %in% c("propertydamage", "cropdamage"))]

##Top 10 events associated with economic damage
totaleconomicdamage[10, "economicdamage"] <- sum(totaleconomicdamage[-(1:9), "economicdamage" ]) 
totaleconomicdamage[10, "EVTYPE"] <- "other" 
totaleconomicdamage <- totaleconomicdamage[1:10, ] 
totaleconomicdamage$PERCENT <-  (totaleconomicdamage$economicdamage/sum(totaleconomicdamage$economicdamage))*100

Results

First we will examine the top 10 events for fatalities, injuries, property damage, and crop damage via bar graphs.

par(mfrow = c(2,2), mar = c(5,8,4,2))

barplot(rev(totalfatalitiesTop$FATALITIES), main="Event Fatality Totals", horiz=TRUE,  names.arg=rev(totalfatalitiesTop$EVTYPE), las = 2, col = heat.colors(length(totalfatalitiesTop$EVTYPE)))

barplot(rev(totalinjuriesTop$INJURIES), main="Event Injury Totals", horiz=TRUE, names.arg=rev(totalinjuriesTop$EVTYPE), las = 2, col = terrain.colors(length(totalinjuriesTop$EVTYPE)))

barplot(rev(totalpropertydamageTop$propertydamage), main="Event Property Damage Totals in USD", horiz=TRUE, names.arg=rev(totalpropertydamageTop$EVTYPE), las = 2, col = topo.colors(length(totalpropertydamageTop$EVTYPE)))

barplot(rev(totalcropdamageTop$cropdamage), main="Event Crop Damage Totals in USD", horiz=TRUE, names.arg=rev(totalcropdamageTop$EVTYPE), las = 2, col = cm.colors(length(totalcropdamageTop$EVTYPE)))

plot of chunk unnamed-chunk-10

options(width=140)

table <- totalfatalitiesTop
table <- cbind(table, totalinjuriesTop)
table <- cbind(table, totalpropertydamageTop)
table <- cbind(table, totalcropdamageTop)

names(table) <-  c("Event", "Total Fatalities", "Event", "Total Injuries", "Event", "Property Damage", "Event", "Crop Damage")

row.names(table) <- NULL 

table

               Event Total Fatalities             Event Total Injuries          Event Property Damage             Event Crop Damage
1            tornado             5661           tornado          91407      hurricane       1.628e+11           drought   3.686e+10
2     excessive heat             1903             flood           8604          flood       8.225e+10             flood   1.243e+10
3              flood             1525 thunderstorm wind           8445        tornado       6.309e+10         hurricane   9.396e+09
4               heat              937    excessive heat           6525    storm surge       1.825e+10            freeze   9.084e+09
5          lightning              817         lightning           5230           hail       1.537e+10         ice storm   9.027e+09
6  thunderstorm wind              637              heat           2100      high wind       1.085e+10              heat   9.020e+09
7        rip current              368         ice storm           1975       wildfire       9.881e+09              hail   3.635e+09
8          high wind              248              hail           1361   winter storm       9.856e+09 thunderstorm wind   1.541e+09
9          avalanche              224         hurricane           1328 tropical storm       9.643e+09           tornado   5.505e+08
10             other             2825             other          13553          other       4.851e+10             other   2.275e+09

Based on the above multipanel graph, it is clear that the most harmful weather event, with respect to population health, is a tornado; these amount to the highest death toll, as well as the highest number of injuries. Concerning which event has had the greatest economic consquence, things are not so clear. Obviously, hurricanes have done the most in property damage. Furthermore, hurricanes have costed the most overall in US dollars. Although droughts have been, without a doubt, the most costly type of event for crops to go through, their economic consquence, in USD, is still less than that of the property damage done by hurricanes

Now, let us examine the data concerning casualties and economic damage with respect to the percentage of impact that these events have.

slices1 <- totalcasualties$PERCENT
lbls1 <- totalcasualties$EVTYPE
pct1 <- round(totalcasualties$PERCENT, digits = 1)
lbls1 <- paste(lbls1, pct1)
lbls1 <- paste(lbls1,"%",sep="")
pct1 <- paste(pct1,"%",sep="")

pie(slices1,labels = pct1, col=rainbow(length(lbls1)), main="Percentage of Casualties", cex = .7)
legend(-1.5,1.3,xpd=NA, legend = lbls1, fill=rainbow(length(lbls1)),  bty = "n", cex = 1)

plot of chunk unnamed-chunk-12

color <-c("red","pink", "orange", "yellow", "green", "purple", "blue", "black", "gray", "white")

slices2 <- totaleconomicdamage$PERCENT
lbls2 <- totaleconomicdamage$EVTYPE
pct2 <- round(totaleconomicdamage$PERCENT, digits = 1)
lbls2 <- paste(lbls2, pct2)
lbls2 <- paste(lbls2,"%",sep="")
pct2 <- paste(pct2,"%",sep="")

pie(slices2,labels = pct2, col=color, main="Percentage of Economic Damage", cex = .7)
legend(-1.5,1.3,xpd=NA, legend = lbls2, fill=color,  bty = "n", cex = 1)

plot of chunk unnamed-chunk-13

Alas, the pie chart showing the “Percentage of Casualties” reaffirms what the preceding graphs of fatality and injury totals showed us: Tornados have been the most harmful weather event with respect to population health. More importantly though, the chart for the “Percentage of Economic Damage” illustrates how hurricanes have been the most costly weather phenomenon, which was not so clear in the previous graphs of property and crop damage totals. Based on these finding, further analysis of the storm data is warranted with regards to tornados and hurricanes;given that these events have caused the most damage, either in terms of money or human resource, questions of prediction are paramount.