Synopsis

The focus of this report is to determine the weather event types that result in the largest impacts on public health and economic consequences. The analysis utilizes the storm database from the U.S. National Oceanic and Atmospheric Administration (NOAA) which contains estimates of fatalities, injuries, property damage and agricultural crop damage between the years of 1950 and 2010. The results of the analysis show that fatalities and injuries are most impacted by heat, tornado and flood events and property and crop damage are most impacted by hail, flood, lightning and thunderstorm events.

Methods (Loading and Processing Data)

library(dplyr)

The raw data file was downloaded and read from the bz2 archive. It comprised 902297 rows and 37 variables. The weather event types included 985 unique values.

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
bz2file <- "repdata_data_StormData.csv.bz2"
if (!file.exists(bz2file))
    download.file(fileURL, bz2file, method="curl")
datStorm <- read.csv(bz2file)

dim(datStorm)
## [1] 902297     37
length(unique(datStorm$EVTYPE))
## [1] 985

The beginning date of the weather events was converted to POSIXlt format and a YEAR column was created. A table was created and displayed using a bar plot showing the number of events recorded by year between 1950 and 2010. Examination of the plot demonstrated that there were significantly fewer events recorded during the first 4 decades. The dataset was filtered to include only dates from 1995 to 2010.

datStorm$BGN_DATE <- as.POSIXlt(sub(" 0:00:00", "", datStorm$BGN_DATE), format="%m/%d/%Y")
datStorm$YEAR <- 1900 + datStorm$BGN_DATE$year

eventsbyyear <- as.data.frame(table(datStorm$YEAR)) %>% rename(YEAR=Var1)
barplot(Freq~YEAR, data=eventsbyyear, axis.lty=1, ylab="Number of Events Recorded", main="Storm Events by Year")

datStorm <- filter(datStorm, YEAR>=1995)

The columns were limited to the variables of interest for evaluating health and economic consequences and the rows were limited to those with non-zero entries in these variables (NOTE - I retained the BGN_DATE and YEAR columns to facilitate potential future evaluation of changing events and impacts over time). Limiting to these dates, relevant columns and non-zero measures resulted in a reduction of the dataset to 211775 rows, 9 variables and 372 unique values for event type.

datStorm <- select(datStorm, BGN_DATE, YEAR, EVTYPE, FATALITIES:CROPDMGEXP)
datStorm <- filter(datStorm, FATALITIES>0 | INJURIES>0 | PROPDMG>0 | CROPDMG>0)
dim(datStorm)
## [1] 211775      9
length(unique(datStorm$EVTYPE))
## [1] 372

The accompanying code book indicates that the PROPDMGEXP and CROPDMGEXP columns should only contain optional values such of K, M, B representing a scaling factor for the damage estimate of 10^3, 10^6, 10^9, respectively. Examination of the values in these columns reveals some unapproved codes.

unique(datStorm$PROPDMGEXP)
##  [1] "B" "M" "K" "m" ""  "+" "5" "0" "6" "4" "2" "7" "3" "H" "-"
unique(datStorm$CROPDMGEXP)
## [1] "M" ""  "m" "K" "B" "?" "k" "0"

I allowed lower case correlates of the allowed codes and calculated the number of rows with improper codes, rows with non-zero values and no associated scaling factor codes and the overall percentage ‘bad’ rows. This represented 0.15% of the rows but were removed to facilitated subsequent calculations using the scaling factor (the criteria for row retention was the corollary of that utilized to identiy the ‘bad’ rows).

allowedCodes <- c("K","M","B","k","m","b","")
nrowDamage <- nrow(datStorm[datStorm$PROPDMG>0 | datStorm$CROPDMG>0,])
nrowWrongCode <- nrow(datStorm[!(datStorm$PROPDMGEXP %in% allowedCodes) |
                               !(datStorm$CROPDMGEXP %in% allowedCodes), ])
nrowMissingCode <- nrow(datStorm[(datStorm$PROPDMG>0 & datStorm$PROPDMGEXP=="") |
                                 (datStorm$CROPDMG>0 & datStorm$CROPDMGEXP==""),])
pctBadRows = round((nrowWrongCode + nrowMissingCode) / nrowDamage * 100, digits=2)

paste("Rows with non-standard codes in PROPDMGEXP or CROPDMGEXP:", nrowWrongCode)
## [1] "Rows with non-standard codes in PROPDMGEXP or CROPDMGEXP: 232"
paste("Rows with non-zero damage estimates and no code in PROPDMGEXP or CROPDMGEXP:", nrowMissingCode)
## [1] "Rows with non-zero damage estimates and no code in PROPDMGEXP or CROPDMGEXP: 67"
paste0("The percentage of 'bad' rows was ", pctBadRows, "%")
## [1] "The percentage of 'bad' rows was 0.15%"
datStorm <- filter(datStorm, PROPDMGEXP %in% allowedCodes & 
                             CROPDMGEXP %in% allowedCodes,
                             !(PROPDMG>0 & PROPDMGEXP=="" |
                               CROPDMG>0 & CROPDMGEXP==""))

The scaling factor was utilized to calculate the actual property and crop damage in dollars and added to new columns PROPDMGADJ and CROPDMGADJ. A small function adjustDMG was created to perform the repeated scaling factor calculation. A loop cycled through the rows of the dataset applying the adjustDMG function to the property and crop damage columns.

adjustDMG <- function(dmg, exp) {
    if (exp=="K" | exp=="k") dmg * 1000
    else if (exp=="M" | exp=="m") dmg * 1000000
    else if (exp=="B" | exp=="b") dmg * 1000000000
    else dmg
}

##system.time({
    for (i in 1:nrow(datStorm)) {
        if (datStorm$PROPDMG[i]>0)
            datStorm$PROPDMGADJ[i] <- adjustDMG(datStorm$PROPDMG[i], datStorm$PROPDMGEXP[i])
        if (datStorm$CROPDMG[i]>0)
            datStorm$CROPDMGADJ[i] <- adjustDMG(datStorm$CROPDMG[i], datStorm$CROPDMGEXP[i])
    }
##})

In the raw dataset there were 985 unique values recorded for weather event types and review of the unique values ‘off-line’ revealed many variations of similar or identical events. Normalizing all these values would be a difficult task and thus I explored applying only to those in the TOP 10 event types for each measure. A small function createSummary to aggregate a measure column (i.e. FATALITIES) on the event type (EVTYPE) column, arrange the measure value in descending order, convert EVTYPE to a factor and add a rank column. The function was designed to be flexible and allow the TOP n rows to be returned.

createSummary <- function(df, x, y, n) {
    df <- aggregate(y~x, data=df, FUN=sum) %>% arrange(desc(y))
    df <- df[1:n,]
    df$x <- as.factor(df$x)
    df <- rename(df, EVTYPE=x, VALUE=y)
    df$RANK <- c(1:n)
    df
}

createSummary(datStorm, datStorm$EVTYPE, datStorm$FATALITIES, 10)
##            EVTYPE VALUE RANK
## 1  EXCESSIVE HEAT  1903    1
## 2         TORNADO  1542    2
## 3     FLASH FLOOD   934    3
## 4            HEAT   924    4
## 5       LIGHTNING   729    5
## 6           FLOOD   423    6
## 7     RIP CURRENT   360    7
## 8       TSTM WIND   241    8
## 9       HIGH WIND   239    9
## 10      AVALANCHE   223   10
createSummary(datStorm, datStorm$EVTYPE, datStorm$INJURIES, 10)
##               EVTYPE VALUE RANK
## 1            TORNADO 21740    1
## 2              FLOOD  6769    2
## 3     EXCESSIVE HEAT  6525    3
## 4          LIGHTNING  4631    4
## 5          TSTM WIND  3630    5
## 6               HEAT  2030    6
## 7        FLASH FLOOD  1734    7
## 8  THUNDERSTORM WIND  1426    8
## 9       WINTER STORM  1298    9
## 10 HURRICANE/TYPHOON  1275   10
createSummary(datStorm, datStorm$EVTYPE, datStorm$PROPDMGADJ, 10)
##               EVTYPE        VALUE RANK
## 1               HAIL 296745716320    1
## 2          LIGHTNING 268687265080    2
## 3              FLOOD 188422037050    3
## 4          TSTM WIND 148882361440    4
## 5            TORNADO  78625549460    5
## 6  HURRICANE/TYPHOON  69605840000    6
## 7     EXCESSIVE HEAT  67407753700    7
## 8        FLASH FLOOD  53665932310    8
## 9        STORM SURGE  43593536000    9
## 10 THUNDERSTORM WIND  41199281740   10
createSummary(datStorm, datStorm$EVTYPE, datStorm$CROPDMGADJ, 10)
##                EVTYPE        VALUE RANK
## 1           TSTM WIND 585113947350    1
## 2   THUNDERSTORM WIND 422804354000    2
## 3         FLASH FLOOD 180283315000    3
## 4                HAIL 156779127050    4
## 5           LIGHTNING 119180766040    5
## 6             TORNADO 117046595610    6
## 7               FLOOD  86082810400    7
## 8           HIGH WIND  52783561300    8
## 9  THUNDERSTORM WINDS  46900165400    9
## 10        STRONG WIND  32804953500   10

Using text matching criteria the values for event type (EVTYPE) were reassigned to normalize the event types for those in the TOP 10. This reduced the number of unique event types to 261. Note this process can be performed repeatedly and/or with the TOP n values for event types.

datStorm[grepl("HURRICANE",datStorm$EVTYPE),"EVTYPE"] <- "HURRICANE"
datStorm[grepl("THUNDERSTORM|TSTM",datStorm$EVTYPE),"EVTYPE"] <- "THUNDERSTORM"
datStorm[grepl("FLOOD",datStorm$EVTYPE),"EVTYPE"] <- "FLOOD"
datStorm[grepl("RIP CURRENT",datStorm$EVTYPE),"EVTYPE"] <- "RIP CURRENT"
datStorm[grepl("STORM SURGE",datStorm$EVTYPE),"EVTYPE"] <- "STORM SURGE"
datStorm[grepl("HEAT",datStorm$EVTYPE),"EVTYPE"] <- "HEAT"

length(unique(datStorm$EVTYPE))
## [1] 261

##Results After partial normalization of event types the dataset is passed to the createSummary function limited to the TOP 10 events for each measure. Again this and the subsequent bar plot is flexible and can include the TOP n event types.

FATALbyEVTYPE <- createSummary(datStorm, datStorm$EVTYPE, datStorm$FATALITIES, 10)
INJURbyEVTYPE <- createSummary(datStorm, datStorm$EVTYPE, datStorm$INJURIES, 10)
PROPDMGbyEVTYPE <- createSummary(datStorm, datStorm$EVTYPE, datStorm$PROPDMGADJ, 10)
CROPDMGbyEVTYPE <- createSummary(datStorm, datStorm$EVTYPE, datStorm$CROPDMGADJ, 10)

Bar plot showing the weather event types with the most impact on public health (fatalities and injuries).

par(mfrow=c(1,2), mar=c(5,8,2,1), oma=c(0,0,1,0))
with(FATALbyEVTYPE, barplot(VALUE~RANK, horiz=TRUE, las=2, names.arg=EVTYPE,
                                 xlab="", ylab="", main="Fatalities"))
with(INJURbyEVTYPE, barplot(VALUE~RANK, horiz=TRUE, las=2, names.arg=EVTYPE,
                               xlab="", ylab="", main="Injuries"))

Bar plot showing the weather event types with the most economic impact (property and crop damage).

par(mfrow=c(1,2), mar=c(5,8,2,1), oma=c(0,0,1,0))
with(PROPDMGbyEVTYPE, barplot(VALUE/10^9~RANK, horiz=TRUE, las=2, names.arg=EVTYPE,
                              xlab="Billions of $", ylab="", 
                              main="Estimated Property Damage"))
with(CROPDMGbyEVTYPE, barplot(VALUE/10^9~RANK, horiz=TRUE, las=2, names.arg=EVTYPE,
                              xlab="Billions of $", ylab="", 
                              main="Estimated Crop Damage"))