SYNOPSIS

This analysis of U.S. NOAA Storm Data from January 1950 through November 2011 proposes answers to the following two (2) questions:

DATA PROCESSING

Data processing consists of data loading, cleansing and analysis. The environment requires several packages (see code) for analysis and reporting.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(scales)

DATA LOAD

The NOAA Storm Data is downloaded, extracted and read into the analysis program.

stormDataURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        
        temp <- tempfile()              ## Set up temporary file for download
        download.file(stormDataURL, destfile=temp)
        stormData_raw <- read.csv(temp, stringsAsFactors = FALSE)
        unlink(temp)                    ## Remove temporary file used in download
        
        str(stormData_raw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

DATA CLEANSING

The following steps are taken to clean (“tidy”) the dataset before analysis activities begin.

  1. Variables (columns) not required to answer the questions are removed from the dataset. 37 columns are reduced to 10 columns for analysis.

  2. The beginning date of the storm event (BEGN_DATE) is converted to date format. The field “event_year” is created from the BEGN_DATE field.

stormData <- select(stormData_raw,BGN_DATE, TIME_ZONE, 
                          STATE, 
                          EVTYPE, 
                          FATALITIES, INJURIES, 
                          PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
        
        stormData$BGN_DATE <- as.Date(stormData$BGN_DATE,format="%m/%d/%Y")
        stormData$event_year <- year(stormData$BGN_DATE)
  1. The storm event variable (‘EVTYPE’) which classifies the storm event is the key field used to anchor our analysis to answer the questions. NOAA requires storm events to be classified by one of 48 values. However, due to poor data quality control, reported data has over 985 different classifications. Issues include miss-spelling and not following NOAA standards, resulting in many “one-off” values.

The following actions were taken to normalize ‘EVTYPE’ variable including:

  • Convert ‘EVTYPE’ variable to uppercase values and stored in the “event_type” field
  • Eliminate records what have “SUMMARY” in the ‘EVTYPE’ value because it’s unknown
  • Converge miss-spellings and non-standard to standard usage. If multiple “events” are listed, then the 1st event is used for standardization.

NOTE: The original count of 985 ‘EVTYPE’ have been consolidated to 165 values Detailed transformations can be seen in the R-code below:

stormData$event_type <- toupper(stormData$EVTYPE)
        stormData <- stormData[!grepl("SUMMARY", stormData$event_type),]
        stormData$event_type[grep("THUNDERSTORM*|THUNER*|THUD*|*MICROBURST*|*DOWNBURST*|*MICOBURST*|
                                  *GUSTNADO*|*TSTM*",stormData$event_type)] <- "THUNDERSTORM.WINDS"
        stormData$event_type[grep("*TROPICAL STORM*",stormData$event_type)] <- "TROPICAL.STORM"
        stormData$event_type[grep("*TORNADO*|*TORNDAO*",stormData$event_type)] <- "TORNADO"
        stormData$event_type[grep("*HIGH WINDS*|* WIND*|*WIND AND WAVE*|*WIND DAMAGE*|*WIND GUSTS*|
                                  *WIND STORM*|*WINDS*|^WND*",stormData$event_type)] <- "HIGH.WIND"
        stormData$event_type[grep("^HURRICANE*|*REMNANTS OF FLOYD*|*TYPHOON*", stormData$event_type)] <- "HURRICANE(TYPHOON)"
        stormData$event_type[grep("*HAIL*",stormData$event_type)] <- "HAIL"
        stormData$event_type[grep("HEAVY PRECIPATATION*|*HEAVY PRECIPITATION*|*HEAVY RAIN*|*HEAVY SHOWER*|*HVY RAIN*|
                                  *EXCESSIVE RAIN*|*EXCESSIVE RAINFALL*|EXCESSIVE WETNESS*|EXTREMELY WET*|
                                  *RECORD PRECIPITATION*|*RAINFALL*|*RAINSTORM*|*RAIN*",
                                  stormData$event_type)] <- "HEAVY.RAIN"
      
         ## THE TRANSFORMATION OF 'FLOOD' ITEMS MUST BE IN THE SEQUENCE BELOW
        
        stormData$event_type[grep("*FLASH FLOOD*|*FLOOD FLASH*|*FLOOD/FLASH*|*FLASHFLOOD*|
                                  *URBAN/SML*|*URBAN AND SMALL*|*URBAN SMALL*|*URBAN/SMALL*|
                                  *SMALL STREAM*",stormData$event_type)] <- "FLASHFL--D"
        stormData$event_type[grep("*COASTAL FLOOD*|*BEACH FLOOD*|*COASTAL FLOODING*|*COASTAL  FLOODING*|
                             *COASTAL/TIDAL FLOOD*|*COASTALFLOOD*|*COASTAL EROSION*|*CSTL FLOODING/EROSION*",
                                  stormData$event_type)] <- "COASTALFL--D"
        stormData$event_type[grep("*FLOOD*",stormData$event_type)] <- "FLOOD"
        stormData$event_type[grep("*FLASHFL--D*",stormData$event_type)] <- "FLASH.FLOOD"
        stormData$event_type[grep("*COASTALFL--D*",stormData$event_type)] <- "COASTAL.FLOOD"
        
        stormData$event_type[grep("*HEAVY SNOW*|*HEAVY WET SNOW*|*RECORD SNOW*|*SNOW*",stormData$event_type)] <- "HEAVY.SNOW"
        stormData$event_type[grep("*BLIZZARD*",stormData$event_type)] <- "BLIZZARD"
        stormData$event_type[grep("*COLD*|*COOL*|*RECORD LOW*|*LOW TEMPERATURE*|*EXTREME WIND CHILL*|^EXTREME COLD*|^EXTREME/RECORD COLD*|
                                  ^EXCESSIVE COLD*|^EXCESSIVE COLD*|^EXTENDED COLD*|^UNSEASONABLE COLD*|^UNSEASONABLY COLD*|
                                  ^UNUSUALLY COLD*|^WIND CHILL|*FREEZE*|*UNSEASONAL LOW*|*LOW TEMPERATURE*", stormData$event_type)] <- "COLD/WIND.CHILL"
        stormData$event_type[grep("*SLEET*|*FREEZING*",stormData$event_type)] <- "SLEET" 
        stormData$event_type[grep("*HEAT*|*RECORD HIGH*|*WARM*|*HIGH TEMPERATURE*|*RECORD TEMPERATURE*|*HOT *",stormData$event_type)] <- "HEAT" 
        stormData$event_type[grep("*LIGHTNING*|*LIGHTING*|*LIGNTNING*",stormData$event_type)] <- "LIGHTNING"
        stormData$event_type[grep("*FUNNEL*",stormData$event_type)] <- "FUNNEL.CLOUD"
        stormData$event_type[grep("*WATER SPOUT*|*WATERSPOUT*|*WAYTERSPOUT*",stormData$event_type)] <- "WATERSPOUT"
        stormData$event_type[grep("*VOLCANIC*",stormData$event_type)] <- "VOLCANIC.ASH"
        stormData$event_type[grep("*GLAZE*|*ICE STORM*|*ICY ROADS*|*PATCHY ICE*",stormData$event_type)] <- "ICE.STORM"
        stormData$event_type[grep("*DUST DEVIL*|*DUST DEVEL*",stormData$event_type)] <- "DUST.DEVIL"
        stormData$event_type[grep("*DUST STORM*|*DUSTSTORM*",stormData$event_type)] <- "DUST.STORM"

ANALYSIS

Analysis begins with preparation of 2 metrics which can be used to identify the high impact storm events: economic impact and population health impact. A summary table is prepared to generate reports and graphs.

  1. Property and crop damage estimates are summed to provide an economic impact value. Standard U.S. dollar property damage and crop damage values are calculated using the analysis described in How to Handle Exponent Value of PROPDMGEXP and CROPDMGEXP provided by the class mentor in the Reproducible Research Week 4 Group Forum. This is to resolve inconsistencies in coding within the source file.
       exp_Code <- c("H","h","K","k","M","m","B","b",
                      "+","-","?","0","1","2","3","4","5","6","7","8",""," ")
        exp_Value <- c(100,100,1000,1000,1000000,1000000,1000000000,1000000000,
                       1,0,0,10,10,10,10,10,10,10,10,10,0,0)
        exp_Table <- data.frame(exp_Code,exp_Value)
        
        stormData$prty_damage <- stormData$PROPDMG *
                exp_Table[match(stormData$PROPDMGEXP,exp_Table$exp_Code),"exp_Value"]
        stormData$crop_damage <- stormData$CROPDMG *
                exp_Table[match(stormData$CROPDMGEXP,exp_Table$exp_Code),"exp_Value"]
        stormData$economic_impact <- stormData$prty_damage + stormData$crop_damage
  1. To determine which events are harmful with respect to population health, a modified Fatalities and Weighted Injuries (mFWI) index is calculated. FWI, which is used in the UK, takes the number of non-fatal injuries and applies a weighting factor to determine a number of “fatality equivalents”. The UK calculation assumes being able to distinguish between minor and major injuries. Since we do not have this level of detail, mFWI will assume 600 injuries as equivalent to one fatality. See RailStaff article for more information.
       stormData$mFWI = (600*stormData$FATALITIES) + (stormData$INJURIES)
  1. A summary table for reporting is prepared:

    *  Storm data is grouped by the normalized 'event_type'
    
    *  Economic impact and modified Fatality/Weighted Injuries index are summed 
stormSummary <- stormData %>% group_by(event_type) %>%
                        summarize(economic_impact = sum(economic_impact),
                                  mFWI = sum(mFWI))
## `summarise()` ungrouping output (override with `.groups` argument)

RESULTS

To answer the 2 questions, summary storm data is sorted in descending order by the key metric to identify the “top 10” impacting storm events.

POPULATION HEALTH IMPACT

As indicated in the graph below which uses a modified Fatality/Weighted Injury index to assign impact level, tornadoes have the greatest impact on population health due to number of deaths and injuries across the United States.

top10rank_h <- stormSummary[order(-stormSummary$mFWI, stormSummary$mFWI),] 

top10rank_h <- top10rank_h[1:10,]

ggplot(data = top10rank_h, aes(x=reorder(event_type,mFWI), mFWI,fill=event_type, main="1950-2011 Top 10 Storm Events Impacting Population Health")) +
        geom_bar(stat = "identity", show.legend=FALSE) +
        geom_text(aes(label=mFWI, hjust = +0.25)) +
        coord_flip() + 
        scale_y_continuous(name="Fatality/Weighted Injury Calculation", labels=comma) +
        scale_x_discrete(name="Storm Event")

ECONOMIC IMPACT

As indicated in the graph below which uses the combination of property and crop damage assessment, floods have the greatest economic impact. In the period 1950-2011, floods have caused over $160 billion in damage.

top10rank_e <- stormSummary[order(-stormSummary$economic_impact, stormSummary$economic_impact),] 

top10rank_e <- top10rank_e[1:10,]

ggplot(data = top10rank_e, aes(x=reorder(event_type,economic_impact), 
                               economic_impact/1000000,fill=event_type,
                               main="1950-2011 Economic Impact from 10 Storm Events")) +
        geom_bar(stat = "identity", show.legend=FALSE) +
        
        geom_text(aes(label=sprintf("$%.2f", economic_impact/1000000), hjust = +0.25)) +
        coord_flip() + 
        scale_y_continuous(name="Economic Impact in USD (millions)", label=dollar_format()) +
        scale_x_discrete(name="Storm Event")