Preface

This document is authored in RMarkdown language. The codes used are in R programming language and the transformation from the RMarkdown to the current document format is via R-package knitr. You will need the necessary R programming setup to rerun the RMarkdown file.

This document is to fulfill the requirement for the Reproductible Research - Peer Assignment 2. It is a course offered by John Hopkins University Bloomberg School of Public Health on Coursera.

Initialization codes

The following codes initialize the session. It loads the required library, set the number of events to be displayed and define a simple function to convert suffix into number.

## Initialization process ...
# loading libraries

require(dplyr)
require(knitr)

## basic configuration setting

nb_top_event = 20

## Simple function to convert the exponential code (K,M,B) to equivalent decimal. 
## Note : This function return 1 to any other values as it is not defined in the
## NATIONAL WEATHER SERVICE INSTRUCTION. This is because the value returned by this function will be multipled by something else. 1 is multiplication neutral.
#
## Extract from the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605 [pd01016005curr.pdf | pg 12/97]
# ----------------------------------------------------
# Estimates should be rounded to three significant digits, followed by an alphabetical 
# character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. 
# Alphabetical characters used to signify magnitude include 
# "K" for thousands, "M" for millions, and "B" for billions. 
# If additional precision is available, it may be provided in the narrative part of the entry.
# ----------------------------------------------------

multipler <- function(val) {
    if(val=="" ) { # to speedup the function since most values are empty
        mcoef <- 1
    } else if(val=="k" | val=="K") {
        mcoef <- 1000
    } else if (val=="m" | val =="M") {
        mcoef <- 1000000
    } else if (val=="b" | val =="B") {
        mcoef <- 1000000000
    } else { 
        mcoef <- 1
    }
    mcoef
}

Data Analysis Synopsis

We aim to understand and answer some basic questions about severe weather events.

Storm Data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is used to understand the impact to each storm type to the human health together with the economic consequences.

This report load the full database but uses only the top 20 events to understand the harm inflicted to the population health and top 20 events to show the economic consequenses.

The top events for the harm caused are aggregated and sorted in descending order sequenced by total fatalities, total injuries, total fatal events and total injury events.

The top events for the economic damages are aggregated and sorted in descending order sequenced by total damages to property and crops, total event with property damages and total events with crop damages

We observe that tornado and excessive heat harm the population the most whereas flood and hurricane/typhoon cause the most economic damages to the country.

Tornado causes far more human loss compared to other events.

Economic lost due to property damages is much higher compared to crop damages.

We also show the average loss per event as additional information to be potentially considered in the decision making.

Analysis objective

This analysis is made to address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

This report is intended for a government or municipal manager who might be responsible for preparing for severe weather events and will need to prioritize resources for different types of events. However, this document does not make any specific recommendations.

Data Preprocessing

The datafile is loaded directly from the website. The bzip2 compressed file is read directly using read.csv. The dataset will need much more cleansing if more fields are to be used.

Data loading

bzstormfile <- "stormdata.bz2" # just for file name consistency

if(!file.exists(bzstormfile)) { # no need to download everytime we run
   download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "bzstormfile")
}

#read.csv can read bzfiles directly. Hooray!

storm <- read.csv(bzstormfile)
dstorm <- tbl_df(storm)
warning("BGN_TIME and END_TIME format are not consistent across observation. Needs cleanup if using it")
## Warning: BGN_TIME and END_TIME format are not consistent across
## observation. Needs cleanup if using it
# if need to to convert ABCD format into AB:CD format, use the commands below
# xtime = gsub('^([0-9]{2})([0-9]+)$', '\\1:\\2',xtime)
# dstorm <- mutate(dstorm, start_date = mdy_hms(BGN_DATE), stop_date = mdy_hms(END_DATE), ....)

Data contained in file

str(storm)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Fields indicating harm to health

  1. FATALITIES
  2. INJURIES

Fields indicating economic consequences

  1. PROPDMG
  2. PROPDMGEXP
  3. CROPDMG
  4. CROPDMGEXP

Special note

The data within PROPDMGEXP and CROPDMGEXP are dirty as it has more than the (K,M,B) values as shown below. We will use the function multipler defined above to translate the exponential code into numeric value and use 1 for the invalid values

table(storm$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
table(storm$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Data extraction

Only the relevant fields are extracted from the full dataset.Damage Exponent values are transformed accordingly using the function multiplier defined above. The data is grouped by Event Type (EVTYPE)

# Selecting necessary columns only and 
# insert real value of damages to properties and crop
# add more needed colum into storm_damage if more detailed analysis is needed

storm_damage <- dstorm %>% 
                select(STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>% 
                mutate(PROP_MULTIPLER = sapply(PROPDMGEXP, multipler),
                       PROP_DAMAGE_VALUE=PROPDMG*PROP_MULTIPLER, 
                       CROP_MULTIPLER = sapply(CROPDMGEXP, multipler),
                       CROP_DAMAGE_VALUE=CROPDMG*CROP_MULTIPLER
                      ) %>%
                group_by(EVTYPE) 

storm_damage
## Source: local data frame [902,297 x 12]
## Groups: EVTYPE
## 
##    STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1     AL TORNADO          0       15    25.0          K       0           
## 2     AL TORNADO          0        0     2.5          K       0           
## 3     AL TORNADO          0        2    25.0          K       0           
## 4     AL TORNADO          0        2     2.5          K       0           
## 5     AL TORNADO          0        2     2.5          K       0           
## 6     AL TORNADO          0        6     2.5          K       0           
## 7     AL TORNADO          0        1     2.5          K       0           
## 8     AL TORNADO          0        0     2.5          K       0           
## 9     AL TORNADO          1       14    25.0          K       0           
## 10    AL TORNADO          0        0    25.0          K       0           
## ..   ...     ...        ...      ...     ...        ...     ...        ...
## Variables not shown: PROP_MULTIPLER (dbl), PROP_DAMAGE_VALUE (dbl),
##   CROP_MULTIPLER (dbl), CROP_DAMAGE_VALUE (dbl)

Results

Harm to the population health

From the reduced dataset storm_damage, we summarize the harm inflicted by each of the event type based on

  1. total fatalities
  2. total injuries
  3. number of fatal events
  4. number of events with injury

Only the top 20 events are selected based on the order enumerated above.

Fatalities

harmful_event <- storm_damage %>% 
                summarise(total_fatalities = sum(FATALITIES),
                          total_fatal_event=sum(FATALITIES>0),
                          total_injuries = sum(INJURIES),
                          total_injury_event=sum(INJURIES>0), 
                          average_fatalities_per_event = total_fatalities/total_fatal_event, 
                          average_injuries_per_event = total_injuries/total_injury_event
                          ) %>%
                arrange(desc(total_fatalities), 
                        desc(total_injuries), 
                        desc(total_fatal_event), 
                        desc(total_injury_event)
                        )

top_harmful_event <-  harmful_event[1:nb_top_event,]

The fatalities is extremely high for TORNADO followed by EXCESSIVE HEAT, FLASH FLOOD and HEAT.

par(mar=c(4,14,8,4), cex=0.75, las=1)
x_max = max(top_harmful_event$total_fatalities);

barplot(top_harmful_event$total_fatalities, 
        horiz=TRUE, 
        names.arg = top_harmful_event$EVTYPE, 
        xlim = c(0,ceiling(x_max/10^floor(log10(x_max)))*10^floor(log10(x_max))), 
        ylim=c(0,nb_top_event), 
        main=paste0("Total Fatalities of the top ", nb_top_event, " Event Type"), 
        xlab ="Total Fatalities", 
        ylab = ""
    )

Injuries

TORNADO causes the most injuries. The other events causes much less injuries compared to TORNADO.

x_max = max(top_harmful_event$total_fatalities);

par(mar=c(4,14,8,4), cex=0.75, las=1)

x_max = max(top_harmful_event$total_injuries);

barplot(top_harmful_event$total_injuries, 
        horiz=TRUE, 
        names.arg = top_harmful_event$EVTYPE, 
        xlim = c(0, ceiling(x_max/10^floor(log10(x_max)))*10^floor(log10(x_max))), 
        ylim=c(0,nb_top_event), 
        main=paste0("Total Injuries of the top ", nb_top_event, " Event Type"), 
        xlab ="Total Injuries", 
        ylab = ""
    )

Fatalities and injuries per event

The table below shows

  1. the average fatalities per event type, and
  2. the average injuries per event type

Only the top 20 events are selected based on the order enumerated above.

# rearrange harmful_event
harmful_event <- harmful_event %>%
                arrange(desc(average_fatalities_per_event),
                        desc(average_injuries_per_event))

top_harm_per_event <- harmful_event[1:nb_top_event, ] %>%
                        select(EVTYPE, 
                               total_fatal_event, 
                               total_fatalities,
                               average_fatalities_per_event, 
                               total_injury_event,
                               total_injuries,
                               average_injuries_per_event)
top_harm_per_event$average_fatalities_per_event <- round(top_harm_per_event$average_fatalities_per_event, 3)

top_harm_per_event$average_injuries_per_event <- round(top_harm_per_event$average_injuries_per_event, 3)

top_harm_per_event$average_injuries_per_event[is.nan(top_harm_per_event$average_injuries_per_event)]<-"_Not Applicable_"

kable(top_harm_per_event)
EVTYPE total_fatal_event total_fatalities average_fatalities_per_event total_injury_event total_injuries average_injuries_per_event
UNSEASONABLY WARM AND DRY 1 29 29.000 0 0 Not Applicable
TORNADOES, TSTM WIND, HAIL 1 25 25.000 0 0 Not Applicable
RECORD/EXCESSIVE HEAT 1 17 17.000 0 0 Not Applicable
TSUNAMI 2 33 16.500 1 129 129
COLD AND SNOW 1 14 14.000 0 0 Not Applicable
STORM SURGE/TIDE 1 11 11.000 1 5 5
EXTREME HEAT 9 96 10.667 5 155 31
WINTER STORMS 1 10 10.000 1 17 17
TROPICAL STORM GORDON 1 8 8.000 1 43 43
HEAT WAVE 26 172 6.615 9 309 34.333
UNSEASONABLY WARM 2 11 5.500 4 17 4.25
HEAT 178 937 5.264 46 2100 45.652
HEAT WAVES 1 5 5.000 0 0 Not Applicable
STORM SURGE 3 13 4.333 9 38 4.222
HEAT WAVE DROUGHT 1 4 4.000 1 15 15
Mudslide 1 4 4.000 1 2 2
HIGH WIND/SEAS 1 4 4.000 0 0 Not Applicable
TORNADO 1602 5633 3.516 7704 91346 11.857
MARINE MISHAP 2 7 3.500 1 5 5
HURRICANE/TYPHOON 19 64 3.368 12 1275 106.25

Damage to the economy

From the reduced dataset storm_damage, we summarize the economic damages by each of the event type based on

  1. total property and crop damages
  2. number of events with property damages
  3. number of events with crop damages

Only the top 20 events are selected based on the order enumarated above.

economic_damage <- storm_damage %>% 
                summarise(total_prop_damage = sum(PROP_DAMAGE_VALUE),
                          total_prop_event=sum(PROPDMG>0),
                          total_crop_damage = sum(CROP_DAMAGE_VALUE),
                          total_crop_event=sum(CROPDMG>0),
                          average_prop_damage_per_event = total_prop_damage/total_prop_event, 
                          average_crop_damage_per_event = total_crop_damage/total_crop_event
                          ) %>%
                arrange(desc(total_prop_damage+total_crop_damage), 
                        desc(total_prop_event), 
                        desc(total_crop_event)
                        )


top_economic_damage <-  economic_damage[1:nb_top_event,]

## prepare to draw stacked barplot
bplot_data <- rbind(property_damage = top_economic_damage$total_prop_damage,
                    crop_damage = top_economic_damage$total_crop_damage)

The total economic loss due to FLOOD is highest followed by HURRICANE/TYPHOON and TORNADO.

par(mar=c(4,14,8,4), cex=0.75, las=1)

abs_max <- max(top_economic_damage$total_prop_damage+top_economic_damage$total_crop_damage)
x_scale <- 10^floor(log10(abs_max))

barplot(bplot_data, 
        horiz=TRUE, 
        names.arg = top_economic_damage$EVTYPE, 
        xlim = c(0, x_max = ceiling(abs_max/x_scale))*x_scale, 
        ylim=c(0,nb_top_event), 
        main=paste0("Total Damage of the top ", nb_top_event, " Event Type"), 
        xlab ="Total Prop Damage Value, USD", 
        ylab = "",
        col=c("red","green")
    )

legend("topright", legend = c("Property Damage", "Crop Damage"), fill=c("red","green"))

Top 20 Events causing Economic Damages

Economic property damages per event

The table below shows the top 20 events based on the average property damages per event type.

# rearrange economic_damage
economic_damage <- economic_damage %>%
                arrange(desc(average_prop_damage_per_event))

top_prop_damage_per_event <- economic_damage[1:nb_top_event, ] %>%
                        select(EVTYPE, 
                               total_prop_event, 
                               total_prop_damage,
                               average_prop_damage_per_event)

kable(top_prop_damage_per_event)
EVTYPE total_prop_event total_prop_damage average_prop_damage_per_event
HEAVY RAIN/SEVERE WEATHER 1 2500000000 2500000000
TORNADOES, TSTM WIND, HAIL 1 1600000000 1600000000
HURRICANE/TYPHOON 69 69305840000 1004432464
HURRICANE OPAL 8 3172846000 396605750
STORM SURGE 173 43323536000 250425064
SEVERE THUNDERSTORM 6 1205360000 200893333
WILD FIRES 4 624100000 156025000
HURRICANE OPAL/HIGH WINDS 1 100000000 100000000
STORM SURGE/TIDE 47 4641188000 98748681
HURRICANE 121 11868319010 98085281
HAILSTORM 3 241000000 80333333
TYPHOON 9 600230000 66692222
WINTER STORM HIGH WINDS 1 60000000 60000000
HURRICANE ERIN 5 258100000 51620000
RIVER FLOOD 102 5118945500 50185740
HURRICANE EMILY 1 50000000 50000000
MAJOR FLOOD 3 105000000 35000000
WILDFIRES 3 100500000 33500000
HIGH WINDS/COLD 5 110500000 22100000
River Flooding 5 106155000 21231000

Economic crop damages per event

The table below shows the top 20 events based on the average crop damages per event type.

# rearrange economic_damage
economic_damage <- economic_damage %>%
                arrange(desc(average_crop_damage_per_event))

top_crop_damage_per_event <- economic_damage[1:nb_top_event, ] %>%
                        select(EVTYPE, 
                               total_crop_event, 
                               total_crop_damage,
                               average_crop_damage_per_event)

kable(top_crop_damage_per_event)
EVTYPE total_crop_event total_crop_damage average_crop_damage_per_event
RIVER FLOOD 20 5029459000 251472950
EXCESSIVE HEAT 2 492402000 246201000
ICE STORM 24 5022113500 209254729
EXCESSIVE WETNESS 1 142000000 142000000
DAMAGING FREEZE 3 262100000 87366667
HURRICANE/TYPHOON 33 2607872800 79026448
COLD AND WET CONDITIONS 1 66000000 66000000
HURRICANE 48 2741910000 57123125
DROUGHT 251 13972566000 55667594
Early Frost 1 42000000 42000000
HEAT 10 401461500 40146150
HURRICANE ERIN 4 136010000 34002500
FREEZE 15 446225000 29748333
EXTREME COLD 49 1292973000 26387204
FLOOD/RAIN/WINDS 5 112800000 22560000
Extreme Cold 1 20000000 20000000
BLIZZARD 6 112060000 18676667
Damaging Freeze 2 34130000 17065000
SEVERE THUNDERSTORMS 1 17000000 17000000
HEAVY RAINS 4 60500000 15125000