Synopsis

This document has been prepared as a submission in response to the Reproducible Research Peer Assignment 2 - Severe Weather. It describes the process and results of an analysis of the NOAA NWS storm data set containing information on injuries, fatalities, property and crop damage arising from weather related events in the US between 1996 and 2011. This report has been prepared as an example of literate programming. It first describes, in text and code, the initial data processing and the steps required to locate and load the data. This involved some re-formatting and filtering of the data but the main task was to map the event descriptions contained in the file (in excess of 900 in total) onto the NWS standard set of event labels (48 in total). Once this was accomplished, the in-depth analysis of the information in the data set was carried out in order to determine which types of events are responsible for the greatest public health and economic impacts. Summary figures presenting the results of this analysis were prepared and are included in this document. Finally, a section giving the conclusions drawn from the evidence presented is included.

Data Processing

As the intial step in the processing and analysis, the NOAA NWS StormData.csv data set was located, downloaded and read in. The following code chunk was used to accomplish this.

## Set the working directory and then download and unzip the data file (if not already present)
Url                 <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2" 
dirPath             <- "~/Documents/Data Science Specialisation/5 - Reproducible Research/Project 2"
setwd(dirPath)
if(!file.exists("StormData.csv.bz2")){download.file(Url, destfile="StormData.csv.bz2")}

## Read in the file (but only the required variables)
variablesNeeded                     <- as.character(c(rep("NULL",37)))
variablesNeeded[c(2,7,8,26,28)]     <- "character"
variablesNeeded[c(23,24,25,27)]     <- "numeric"
dataStorm                           <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE,  
                                                    strip.white=TRUE, na.strings = c("NA", ""),
                                                                    colClasses = variablesNeeded)

After the data were loaded, initial conversions and reformatting of the date variables was carried out, together with deletion of redundant summary observations. It was considered whether to carry out further filtering at this stage, notably in terms of date, using the $BG_DATE variable, or geographic location using the $STATE variable. It was decided to filter to remove all data prior to 1st January 1996 since it is known that data prior to this cutoff were only partially recorded for some event types and tend to skew the results, particularly toward injuries and fatalities arising from tornadoes (for further details see http://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype ).

Consideration was also given to filtering on location because of the presence of the following “States” in the raw data: “PR” - Puerto Rico, “AS” - American Samoa, “GU” - Guam, “MH” - Marshall Islands, “VI” - Virgin Islands, “AM” - W. North Atlantic Ocean, “LC” - Lake St. Clair, “PH” - Central Pacific Ocean, “GM” - Gulf of Mexico, “PZ” - Eastern North Pacific Ocean, “AN” - NW North Atlantic Ocean, “LH” - Lake Huron, “LM” - Lake Michigan, “LE” - Lake Erie, “LS” - Lake Superior, “SL” - Saint Lawrence, “LO” - Lake Ontario, “PM” - Western Pacific Ocean, and “PK” - N. Pacific Ocean nr. Alaska. However, it was decided to leave these locations unfiltered as it was judged that any health or economic impacts to US personnel or infrastructure at these locations should be included as falling within the scope of “across the United States”.

In addition to filtering, on the first run through a serious data error was detected - a property damage exponent value which was entered as a B[illion] should have been entered as a M[illion]. This was therefore explicitly corrected, at this stage of processing. The following code chunk was used to carry out the initial filtering and to correct the data exponent error.

## Convert the $BGN_DATE variable to date format
dataStorm$BGN_DATE                      <- as.Date(dataStorm$BGN_DATE, "%m/%d/%Y")

## Correct a serious error in property damage (exponent entered as "B" should be "M")
dataStorm[605953,7]                     <- "M"
## Correct a minor problem with a state identifier
dataStorm[dataStorm[,2]=="SL",2]        <- "KS"

# Delete summary rows, data before 1996 and 2 instances of $STATE = "XX"
dataStorm                               <- dataStorm[-grep("Summary|SUMMARY", dataStorm$EVTYPE), ]
dataStorm                               <- dataStorm[dataStorm$BGN_DATE > "1995-12-31", ]
dataStorm                               <- dataStorm[-grep("xx|XX", dataStorm$STATE), ]

The next task was the mapping of the supplied event descriptors (with over 900 separate examples) onto a new variable in the loaded data frame containing the NWS standard event type descriptors (48 examples). This was a complex and time consuming task requiring considerable manual fine-tuning and validation. The full code used is supplied and may be customised but it must be noted that any further modifications to this mapping will require additional effort be spent to ensure that the changes do not result in a mapping which is incomplete (or redundant). This mapping was carried out using the following code chunk.

## Set up a new data frame for mapping used event names to NWS allowed set of 48 + Unknown
eventNames          <- data.frame(used=character(49), barred=character(49), allowed=character(49))

## Define used event names (in the database) including RegEx possibilities - this is
## where we map the names used onto the allowed names - if you want to use a different mapping then this is 
## what you change (or the barred words in the following expression). Caution, manual tuning of these phrases 
## to prevent mis-matches between the used event names and the NWS allowed event identifiers required 
## considerable effort for manual verification and testing - alter at your own risk.
eventNames$used     <- c("Astronomical Low Tide|Low Tide", "Avalanche|Avalance", 
                        "Blizzard", "Coastal Flood|Coastal|Cstl|Erosion|Erosin", 
"Cold/Wind Chill|Wind Chill|Cool|Cold|Unseasonal Low Temp|Unseasonably Cool|Cool Spell", 
                        "Debris Flow|Slide|Slump", "Dense Fog", "Dense Smoke|Smoke", 
                        "Drought|Dry|Driest|Record Low Rainfall", "Dust Devil|Devel", 
                        "Dust Storm|Blowing Dust|Saharan Dust", 
                        "Excessive Heat|Temperature Record|Record Temp|Record High|Record Heat|Warm|Excessive", 
"Extreme Cold/Wind Chill|Extreme Cold|Hypo|Hyper|Record Low|Record Cool|Low Temp|Extreme Windchill|Cold",
                        "Flash Flood|Flash|Dam Break|Dam Failure|Rapidly", 
                        "Flood|Flood|Fld|Floyd|Urban|Small Stream|Drown|Southeast|Apache", "Fog|Vog", 
                        "Frost/Freeze|Frost|Freeze|Freezing|Icy|Glaze|Ice", 
                        "Funnel Cloud|Funnel", "Hail", "Heat|Hot|Warm Weather|Unseasonably Warm", 
                        "Heavy Rain|Heavy Shower|Microburst|Wet|Rain|Precip|Downburst|Heavy Mix", 
                        "Heavy Snow|Snow", "High Surf|Surf", "High Wind|Wind|Wnd|High", 
                        "Hurricane/Typhoon|Hurricane|Typhoon|Wall Cloud", "Ice Storm", 
                        "Lakeshore Flood", "Lake-Effect Snow|Lake effect Snow", 
                        "Lightning|Lighting|Ligntning", "Marine Hail", 
                        "Marine High Wind", "Marine Strong Wind|Mishap|Accident", 
                        "Marine Thunderstorm Wind|Marine tstm Wind|Coastal Storm|Coastalstorm", 
                        "Rip Current", "Seiche", "Sleet", 
                        "Storm Tide|Storm Surge|Wave|Seas|Tide|Swell|High Water|Coastal Surge|Tidal", 
                        "Strong Wind|Severe Turbulence", "Thunderstorm Wind|tstm|Thunderstorm|Metro Storm", 
                        "Tornado|Torndao|Landspout|Gustnado", "Tropical Depression", "Tropical Storm|Storm", 
                        "Tsunami", "Volcanic Ash|Volcanic", "Waterspout|Water Spout|Wayterspout", 
                        "Wildfire|Wild Fires|Wild/Forest Fire|Forest Fire|Red Flag|Brush Fire|Grass Fires", 
                        "Winter Storm", "Winter Weather|Wintry|Winter",
"Unknown|Other|None|Northern Lights|No Severe Weather|Monthly Temperature|Mild Pattern|\\?")

## Define barred event names (in the database) including RegEx possibilities - this is used because RegEx 
## does not easily support negative matching for combinations, ie, "Cold/Wind Chill"  must not contain 
## "Extreme" - so a compound expression grepl(match) & !grepl(not_match) is used below instead (the "QQ" value 
## is used here to indicate that there are no barred words for this match in the NWS event types)
eventNames$barred <- c("QQ", "QQ", "QQ", "Storm", "Extreme|Record|Excessive|Record|Unusually", 
                        "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "Unseasonably Warm|Warm Weather", "QQ", "QQ", "QQ", 
                        "QQ", "QQ", "QQ", "Marine", "QQ", "QQ", "Lake Effect Snow|Lake-Effect Snow", "QQ", 
                        "Windchill|Marine|Thunderstorm|Water", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", 
                        "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "Winter Storm", "QQ", "QQ", "QQ", 
                        "QQ", "QQ", "QQ", "QQ")

## Now define the allowed event names that will actually be used (48 + "Unknown")
eventNames$allowed  <- c("Astronomical Low Tide", "Avalanche", "Blizzard", 
                        "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", 
                        "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", 
                        "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Fog", 
                        "Frost/Freeze", "Funnel Cloud", "Hail", "Heat", "Heavy Rain",
                        "Heavy Snow", "High Surf", "High Wind", "Hurricane/Typhoon", 
                        "Ice Storm", "Lakeshore Flood", "Lake-Effect Snow", "Lightning", 
                        "Marine Hail", "Marine High Wind", "Marine Strong Wind", 
                        "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", 
                        "Storm Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", 
                        "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash",
                        "Waterspout", "Wildfire", "Winter Storm", "Winter Weather", "Unknown")

## Set all $used, $barred & new $NWSevents fields to upper case (prevents overwriting the $NWSevent in the next steps)
eventNames$used     <- toupper(eventNames$used)
eventNames$barred   <- toupper(eventNames$barred)
dataStorm$NWSevents <- toupper(dataStorm$EVTYPE)

## Replace $NWSevents variable in compliance with the 48 allowed event types + "Unknown"
for (i in 1:nrow(eventNames)) {
    dataStorm[grepl(eventNames[i,1], dataStorm$NWSevents) & 
                                        !grepl(eventNames[i,2], dataStorm$NWSevents), 10] <- eventNames[i,3]
}

Further processing carried out in the next stage was to substitute the alphanumeric based exponent descriptors in the property and crop damage assessment exponent variables for their numeric equivalents to simplify later processing. This was achieved using the following code chunk.

## Fix exponent data for the property & crop damage exponent variables
dataStorm[is.na(dataStorm[,7]), 7]  <- "0"
dataStorm[dataStorm[,7] == "B" | dataStorm[,7] == "b", 7]   <- "9"
dataStorm[dataStorm[,7] == "M" | dataStorm[,7] == "m", 7]   <- "6"
dataStorm[dataStorm[,7] == "K" | dataStorm[,7] == "k", 7]   <- "3"
dataStorm[dataStorm[,7] == "H" | dataStorm[,7] == "h", 7]   <- "2"
dataStorm[dataStorm[,7] == "+" | dataStorm[,7] == "-", 7]   <- "0"
dataStorm[dataStorm[,7] == "?", 7]                          <- "0"
dataStorm[is.na(dataStorm[,9]), 9]                          <- "0"
dataStorm[dataStorm[,9] == "B" | dataStorm[,9] == "b", 9]   <- "9"
dataStorm[dataStorm[,9] == "M" | dataStorm[,9] == "m", 9]   <- "6"
dataStorm[dataStorm[,9] == "K" | dataStorm[,9] == "k", 9]   <- "3"
dataStorm[dataStorm[,9] == "H" | dataStorm[,9] == "h", 9]   <- "2"
dataStorm[dataStorm[,9] == "+" | dataStorm[,9] == "-", 9]   <- "0"
dataStorm[dataStorm[,9] == "?", 9]                          <- "0"

Results

For the detailed analysis, the questions addressed were:

  1. Across the United States, which types of events (as indicated in the $EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Therefore, the initial step of the detailed analysis was to calculate and add other variables to the data frame for each observation including the property and crop damage estimates (using the relevant damage numerical mantissa read from the original data set and the numerical exponent calculated in the previous section). Once this was done, new data frames containing the summary results of the injury, fatality, property damage and crop damage per NWS event type or State were prepared using the aggregate() and sum() functions. These estimates were placed in order to support the production of the summary figures. The following code chunk was used to carry out this processing.

## Calculate and store the property damage and crop damage figures
dataStorm$propertyDamage    <- dataStorm$PROPDMG*10^(as.numeric(dataStorm$PROPDMGEXP))
dataStorm$cropDamage        <- dataStorm$CROPDMG*10^(as.numeric(dataStorm$CROPDMGEXP))

## Compute some total damage + %ge levels by NSW event categories & re-order descending
totalInjuries               <- aggregate(INJURIES ~ NWSevents, dataStorm, sum)
totalFatalities             <- aggregate(FATALITIES ~ NWSevents, dataStorm, sum)
totalPropertyDamage         <- aggregate(propertyDamage ~ NWSevents, dataStorm, sum)
totalCropDamage             <- aggregate(cropDamage ~ NWSevents, dataStorm, sum)
totalInjuries$percent       <- round(totalInjuries[,2]*100/sum(totalInjuries[,2]),1)
totalFatalities$percent     <- round(totalFatalities[,2]*100/sum(totalFatalities[,2]),1)
totalPropertyDamage$percent <- round(totalPropertyDamage[,2]*100/sum(totalPropertyDamage[,2]),1)
totalCropDamage$percent     <- round(totalCropDamage[,2]*100/sum(totalCropDamage[,2]),1)
totalInjuries               <- totalInjuries[order(-totalInjuries$INJURIES),]
totalFatalities             <- totalFatalities[order(-totalFatalities$FATALITIES),]
totalPropertyDamage         <- totalPropertyDamage[order(-totalPropertyDamage$propertyDamage),]
totalCropDamage             <- totalCropDamage[order(-totalCropDamage$cropDamage),]

## Compute some total + %ge damage levels by state & order by state
stateInjuries               <- aggregate(INJURIES ~ STATE, dataStorm, sum)
stateFatalities             <- aggregate(FATALITIES ~ STATE, dataStorm, sum)
statePropertyDamage         <- aggregate(propertyDamage ~ STATE, dataStorm, sum)
stateCropDamage             <- aggregate(cropDamage ~ STATE, dataStorm, sum)
stateInjuries$percent       <- round(stateInjuries[,2]*100/sum(stateInjuries[,2]),1)
stateFatalities$percent     <- round(stateFatalities[,2]*100/sum(stateFatalities[,2]),1)
statePropertyDamage$percent <- round(statePropertyDamage[,2]*100/sum(statePropertyDamage[,2]),1)
stateCropDamage$percent     <- round(stateCropDamage[,2]*100/sum(stateCropDamage[,2]),1)
stateInjuries               <- stateInjuries[order(stateInjuries$STATE),]
stateFatalities             <- stateFatalities[order(stateFatalities$STATE),]
statePropertyDamage         <- statePropertyDamage[order(statePropertyDamage$STATE),]
stateCropDamage             <- stateCropDamage[order(stateCropDamage$STATE),]
stateResult                 <- cbind(stateInjuries[1:3],stateFatalities[2:3],stateCropDamage[2:3],
                                                                                  statePropertyDamage[2:3])
names(stateResult)          <- c("State", "Injuries", "pcInj", "Fatalities", "pcFat", "CropDmg", "pcCD", 
                                                                                         "PropDmg", "pcPD")

Finally, summary figures presenting the results were prepared. Figure 1 shows the percentages of the total injuries and fatalities arising from the top rated five event types (injuries on the left and fatalities on the right), where the results are presented as bar charts expressing the percentage of the total attributable to each of the five event types. The topmost bar, in each case, shows the figure for all other event types (excluding the top five) for comparison purposes. The totals occurrences for injuries and fatalities, separately, are given in main titles of these charts. The following code chunk was used to prepare Figure 1.

# % Bar charts for injuries & fatalities
par(mfrow=c(1,2), mar=c(5, 7, 4, 2), mgp=c(2, 1, 0))
values                      <- c(totalInjuries[1:5,3],100-sum(totalInjuries[1:5,3]))
lbls                        <- c(totalInjuries[1:5,1], "All Other Types")
lbls                        <- paste(lbls, values) 
lbls                        <- paste(lbls,"%",sep="")
barplot(values, cex.main=0.6, main=paste("Causes of", sum(totalInjuries[,2]), 
            "total injuries.", sep=" "), xlim=c(0,40), xlab="%", cex.axis=0.5, 
                            cex.lab=0.6, cex.names=0.5, names.arg=lbls, horiz=TRUE, las=1)
values                      <- c(totalFatalities[1:5,3],100-sum(totalFatalities[1:5,3]))
lbls                        <- c(totalFatalities[1:5,1], "All Other Types")
lbls                        <- paste(lbls, values) 
lbls                        <- paste(lbls,"%",sep="")
barplot(values, cex.main=0.6, main=paste("Causes of", sum(totalFatalities[,2]), 
            "total fatalities.", sep=" "), xlim=c(0,40), xlab="%",cex.axis=0.5, 
                            cex.lab=0.6, cex.names=0.5, names.arg=lbls, horiz=TRUE, las=1)

Figure 1: Injuries & Fatalities

From Figure 1, it is immediately possible to see that the causal event type for the highest incidences of injuries was Tornado (at 35.6%) but for fatalities it was Excessive Heat (at 20.6%), across the 1996-2011 data set examined. By comparing injuries and fatalities bar charts, it can be seen that four of the top five contributing event types are common to both, namely Excessive Heat, Tornado, Lightning and High Wind. The other event types that appear in the top five are Flash Flood for fatalities and Flood for injuries. The top five event types are responsible for almost 75% of all recorded injuries. The top five event types account for close to 63% of all recorded fatalities.

Figure 2 presents the property and crop damage by causal event type, in each case as a bar chart. It shows damages arising from the five top rated event types (crop damage on the left and property damage on the right), where the results are presented as bar charts giving the totals in Billions of Dollars. The topmost bar, in each case, shows the amount for all other event types (excluding the top five) for comparison purposes. The combined totals for crop and property damage, separately, are given in the main titles of these bar charts. The following code chunk was used to prepare Figure 2.

# B$ Bar charts for crop & property damage
par(mfrow=c(1,2), mar=c(5, 7, 4, 2), mgp=c(2, 1, 0))
values                      <- c(totalCropDamage[1:5,2],sum(totalCropDamage[,2])-
                                                                sum(totalCropDamage[1:5,2]))
values                      <- round(values/10^9, 2)
lbls                        <- c(totalCropDamage[1:5,1], "All Other Types")
lbls                        <- paste(lbls, values) 
lbls                        <- paste(lbls,"B$",sep=" ")
barplot(values, main=paste("Causes of", round(sum(totalCropDamage[,2])/10^9, 2), 
                        "B$ crop damage.", sep=" "), xlab="B$", cex.axis=0.5, cex.lab=0.6, 
                                cex.main=0.6, cex.names=0.5, names.arg=lbls, horiz=TRUE, las=1)
values                      <- c(totalPropertyDamage[1:5,2],sum(totalPropertyDamage[,2])-
                                                            sum(totalPropertyDamage[1:5,2]))
values                      <- round(values/10^9, 2)
lbls                        <- c(totalPropertyDamage[1:5,1], "All Other Types")
lbls                        <- paste(lbls, values) 
lbls                        <- paste(lbls,"B$",sep=" ")
barplot(values, main=paste("Causes of", round(sum(totalPropertyDamage[,2])/10^9, 2), 
                        "B$ property damage.", sep=" "), xlab="B$", cex.axis=0.5, cex.lab=0.6, 
                                cex.main=0.6, cex.names=0.5, names.arg=lbls, horiz=TRUE, las=1)

Figure 2: Crop & Property Damage

From the titles in Figure 2 it is obvious that the overwhelming majority of the economic consequences arise from property damage rather then crop damage (approx. 252 vs. 35 Billion Dollars). It should be noted that these amounts, although large, are summed across the 15 year span of the data and so reflect an overall annual combined economic impact of approx. 19.1 Billion Dollars / year. By examining the bar charts, it is apparent that the top five event types for crop damage are Drought, Hurricane/Typhoon, Flood, Hail and Frost/Freeze. For property damage they are Hurricane/Typhoon, Storm Tide, Flood, Tornado and Flash Flood. It should be noted, though, that all of the top five causes of property damage exceed the top cause of crop damage in terms of economic impact.

Figure 3 shows the injuries, fatalities, crop damage and property damage as percentages of the totals of each broken down by location (at the $STATE level - NB the definition of “State” here includes the additional locations listed explicitly in the Data Processing section above). The code used for generating Figure 3 was that in the following code chunk.

# % Bar charts for damage by state
par(mfrow=c(4,1), mar=c(4, 4, 2, 2), mgp=c(2, 1, 0))
lbls                        <- stateResult$State
values                      <- stateResult$pcInj
barplot(values, cex.main=1.5, main="% Injuries by state", xlab="State", ylab="%", names.arg=lbls, 
                                            ylim= c(0,25), cex.axis=1.0, cex.lab=1.0, cex.names=0.67, las=1)
values                      <- stateResult$pcFat
barplot(values, cex.main=1.5, main="% Fatalities by state", xlab="State", ylab="%", names.arg=lbls, 
                                            ylim= c(0,25), cex.axis=1.0, cex.lab=1.0, cex.names=0.67, las=1)
values                      <- stateResult$pcCD
barplot(values, cex.main=1.5, main="% Crop damage by state", xlab="State", ylab="%", names.arg=lbls, 
                                            ylim= c(0,25), cex.axis=1.0, cex.lab=1.0, cex.names=0.67, las=1)
values                      <- stateResult$pcPD
barplot(values, cex.main=1.5, main="% Property damage by state", xlab="State", ylab="%", names.arg=lbls, 
                                            ylim= c(0,25), cex.axis=1.0, cex.lab=1.0, cex.names=0.67, las=1)

Figure 3: Damage by State

Comparing the individual location percentage contributions to the severe weather damage, shown in the bar charts in Figure 3, reveals immediately that one state is the top contributor for three of the four categories of damage, namely Texas for injuries, fatalities and crop damage. Texas is also in the top four states in terms of property damage. The only other state to make such a major contribution to damage in all four categories is Florida.

Conclusions

In terms of public health, improvements to high temperature, tornado/storm and flooding warnings and preparedness are clear priorities for reducing the incidence of both injuries and fatalities. In terms of economic consequences, property damage is the most important concern and improvement likely requires more work on early warnings and preparedness for hurricanes, storm tides, flooding and tornadoes. For crop damage, tornadoes are not a major concern (likely due to the relatively small footprint of a funnel vortex compared to a major storm event) but drought is, along with the other main causes of crop damage - hurricanes, floods, hail and frost/freezing conditions. It is also apparent that consideration should be given to localising these efforts at individual state level.