Reproducible Research

Peer Assessment 2

This R program will sort through the National Weather Service's data and discover which types of events produce the most damage—for humans and/or property.


Synopsis

We will see that—from the National Weather Service's data—heavy surf and wind causes the most damage considering both human and financial tolls. The next worry for the human toll is flood/rain/winds, while the next worry for the financial toll is microburst winds.


Data Processing

First, we can load the “bz” type of zip file with

setInternet2(TRUE)
f <- tempfile()
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
    f)
srcData <- read.csv(bzfile(f))

Let us look at what we have here with the 'structure' function.

str(srcData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We can see some curious variables

  1. FATALITIES
  2. INJURIES
  3. PROPDMG
  4. CROPDMG

Next, we can condense the data about the cost to humans. I am going to brashly assume that fatalities are 10 times as bad as injuries.

augmented_data <- cbind(srcData, 10 * srcData$FATALITIES + srcData$INJURIES)

Similarly, we can set up a column to focus on the financial implications.

augmented_data <- cbind(augmented_data, srcData$PROPDMG + srcData$CROPDMG)
colnames(augmented_data) <- c(colnames(srcData), "HUMANTOLL", "FINANCIALTOLL")

Results

We will now add up the data according to the different types of storm events.

human_toll_by_event <- by(augmented_data$HUMANTOLL, augmented_data$EVTYPE, sum)
financial_toll_by_event <- by(augmented_data$FINANCIALTOLL, augmented_data$EVTYPE, 
    sum)
smaller_data_frame <- data.frame(unique(srcData$EVTYPE), as.numeric(human_toll_by_event))
smaller_data_frame <- cbind(smaller_data_frame, as.numeric(financial_toll_by_event))
colnames(smaller_data_frame) <- c("EVTYPE", "HUMANTOLL", "FINANCIALTOLL")

After sorting this smaller data frame, we will try to plot only the most significant events (rather than all 985 of them) that affects humans directly.

smaller_data_frame <- smaller_data_frame[order(-smaller_data_frame$HUMANTOLL), 
    ]
plot(smaller_data_frame[1:5, 2], main = "Weather vs. Man", xaxt = "n", xlab = "Weather Event", 
    ylab = "Human Toll")
axis(side = 1, at = 1:5, labels = as.character(smaller_data_frame[1:5, 1]))

plot of chunk unnamed-chunk-6

as.character(smaller_data_frame[1:5, 1])
## [1] "Heavy surf and wind" "FLOOD/RAIN/WINDS"    "RECORD COLD/FROST"  
## [4] "ABNORMAL WARMTH"     "MICROBURST WINDS"

Finally, we can look for which weather events affect us financially.

smaller_data_frame <- smaller_data_frame[order(-smaller_data_frame$FINANCIALTOLL), 
    ]
plot(smaller_data_frame[1:5, 2], main = "Weather vs. Money", xaxt = "n", xlab = "Weather Event", 
    ylab = "Financial Toll")
axis(side = 1, at = 1:5, labels = as.character(smaller_data_frame[1:5, 1]))

plot of chunk unnamed-chunk-7

as.character(smaller_data_frame[1:5, 1])
## [1] "Heavy surf and wind" "MICROBURST WINDS"    "ABNORMAL WARMTH"    
## [4] "COLD WAVE"           "HIGH WINDS 73"