Synopsis

This work is for the Coursera class Reproducible Research, Peer Assessment 2

Research questions to be addressed

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

Harmful Events

Across the United States the most harmful events in terms of injuries and fatalaties are tornados (65% of injuries in the sample), floods, and heat. Windstorms and ligtning also account for a measurable number of injuries. See the results section for details.

Costly Events

Across the United States the most harmful events in terms of property and crop damage are flood (39% of property damage in the sample), marine events like hurricanes, typhoons and tropical storms. Tornados and other wind storms also cause significant damage. See the results section for details.

Data Processsing

Loading and preprocessing the data: 1. Download the assignment dataset as directed: Storm Data NB: My Win7 and R configuration does not allow R to download https directly. 2. We show syntax for using read.csv on the bz2 file directly. However, for efficiency we extract the file with 7-zip in the data folder and use fread from the data.table package. (fread was not working on the bzip2 file due to an EOF character causing an error).

After examining the basic properties of the file we can move on to cleaning or modifying the dataset, if necessary. We assume data has been downloaded to a data folder. A more complete solution would use R to check for the folder and data and if not found create the folder and attempt to download the data.

# Data are downloaded to the data directory from the URL
# https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
# check and set working directory and load packages we want

library(data.table)
library(xtable)
#getwd()
setwd("C:/StatWare/Rprog/RepData/project2")

#fileloc <- "C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv.bz2"
fileloc <- "C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv"

An aside, we show processing time for read.csv vs. fread below to indicate why we wanted to use fread.

system.time(data.table(read.csv(fileloc,nrows=100000)))
##    user  system elapsed 
##    2.45    0.02    2.47
system.time(fread(fileloc,nrows=100000))
##    user  system elapsed 
##    1.11    0.06    1.17
# Read and check the data - fread works on a subset but not the whole file
#storm.dt1 <- fread("C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv.bz2",nrows=20)
storm.dt1 <- as.data.table(read.csv(fileloc))

Do some formatting on the begin date and create a begin year field. Then report the structure of the data.table.

# Make friendly dates and report the structure of our table
storm.dt1[,beg_dt := as.Date(strptime(storm.dt1$BGN_DATE, "%m/%d/%Y %H:%M:%S"))]
storm.dt1[,beg_yr := year(storm.dt1$beg_dt)]
str(storm.dt1)
## Classes 'data.table' and 'data.frame':   902297 obs. of  39 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ beg_dt    : Date, format: "1950-04-18" "1950-04-18" ...
##  $ beg_yr    : int  1950 1950 1951 1951 1951 1951 1951 1952 1952 1952 ...
##  - attr(*, ".internal.selfref")=<externalptr>
setkey(storm.dt1,beg_yr,EVTYPE)

Make a table of injuries and fatalities for addressing question 1. Include property and crop damage for addressing question 2 and create sensible dollar columns. Note that we are applying the filters relevant to our analysis before exploring column values (e.g. event types) in order to avoid spending time on data we are not going to use. Based on looking at the columns of the data itself and the documentationa we deem INJURIES and FATALITIES to be the relevant measures for addressing question 1 and PROPDMG and CROPDMG to be relevant for question 2. See Notes for more information.

event.dt1 <- storm.dt1[FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0 ][, prop_dmg := if (PROPDMGEXP=="M") {PROPDMG*1000} else {PROPDMG}, by=REFNUM][, prop_dmg := if (PROPDMGEXP=="B") {PROPDMG*1000*1000},by=REFNUM][, crop_dmg := if (CROPDMGEXP=="M") {CROPDMG*1000} else {CROPDMG}, by=REFNUM][, crop_dmg := if (CROPDMGEXP=="B") {CROPDMG*1000*1000}, by=REFNUM]

# Print the sums - look for weird values
props <- event.dt1[,list(sum(PROPDMG),sum(prop_dmg)),by=PROPDMGEXP]

head(props,n=10)
##     PROPDMGEXP          V1           V2
##  1:          K 10735292.10  10735292.10
##  2:          M   140694.45 140694450.00
##  3:                 527.41       527.41
##  4:          B      275.85 275850000.00
##  5:          0     7108.30      7108.30
##  6:          4       14.50        14.50
##  7:          +      117.00       117.00
##  8:          5      210.50       210.50
##  9:          h        2.00         2.00
## 10:          m       38.90        38.90

The dollar multipliers used above were found in the documentation already referenced. We ignore other values because the magnitudes of associated harms and costs are small.

We create a summary table that we would use in a larger report followed by a summary table of events for this report.

# And summarize by year, state and event type 
event.dt2 <- event.dt1[,list(injury=sum(INJURIES),fatality=sum(FATALITIES),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(beg_yr, STATE, EVTYPE)]

# And by events only
event.tots <- event.dt2[,list(injury=sum(injury),fatality=sum(fatality),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(EVTYPE)]

setorder(event.tots,-injury,-prop_dmg)

event.tots
##                       EVTYPE injury fatality    prop_dmg   crop_dmg
##   1:                 TORNADO  91346     5633  56925970.7  415113.11
##   2:               TSTM WIND   6957      504   4484983.4  554007.35
##   3:                   FLOOD   6789      470 144657716.8 5661968.45
##   4:          EXCESSIVE HEAT   6525     1903      7753.7  492402.00
##   5:               LIGHTNING   5230      816    928826.0   12092.09
##  ---                                                               
## 484:   HYPERTHERMIA/EXPOSURE      0        1         0.0       0.00
## 485:         UNSEASONAL RAIN      0        0         0.0   10000.00
## 486: THUNDERSTORM WIND (G40)      0        1         0.0       0.00
## 487:                DROWNING      0        1         0.0       0.00
## 488:             ICE ON ROAD      0        1         0.0       0.00
# Need to map to a new event field in order to get more roll-up in the data. We look at the totals in
# order to determine the level of effort.

sum(event.tots$injury)
## [1] 140528
sum(event.tots$fatality)
## [1] 15145
sum(event.tots$prop_dmg)
## [1] 427287980
sum(event.tots$crop_dmg)
## [1] 49094473

From a visual inspection we can quickly see logical groupings of the event types. We start with events that are more likely to have injuries and property damage in order to capture as much relevant information as possible with a minimum of processing.

# Example of thought process - look at Wind events, then we looked at tornado, flood, etc.
wind <- grep("WIND", event.tots$EVTYPE,ignore.case = TRUE)
wind.events <- event.tots[event.tots$EVTYPE %in% as.character(event.tots$EVTYPE[wind])]
wind.events$event <- "WIND"
head(wind.events)
##                EVTYPE injury fatality  prop_dmg crop_dmg event
## 1:          TSTM WIND   6957      504 4484983.4 554007.3  WIND
## 2:  THUNDERSTORM WIND   1488      133 3483265.1 414843.0  WIND
## 3:          HIGH WIND   1137      248 5270081.3 638571.3  WIND
## 4: THUNDERSTORM WINDS    908       64 1739628.6 190742.7  WIND
## 5:         HIGH WINDS    302       35  608421.7  40720.6  WIND
## 6:        STRONG WIND    280      103  175241.5  64953.5  WIND

For the wind subset we can ask what portion of injuries and damages does the subset represent? We find 0.08182 percent of injuries and 0.0415333 percent of damages. In testing we continued as above to arrive at the solution below.

# Actual solution
event.dt2$event <- event.dt2$EVTYPE # ensures no NA events

event.dt2$event[grep("WIND", event.dt2$EVTYPE, ignore.case=TRUE)] <- "WIND"
event.dt2$event[grep("TORNADO", event.dt2$EVTYPE, ignore.case=TRUE)] <- "TORNADO"
event.dt2$event[grep("FLOOD|FLD", event.dt2$EVTYPE, ignore.case=TRUE)] <- "FLOOD"
event.dt2$event[grep("HURRICANE|TYPHOON", event.dt2$EVTYPE, ignore.case=TRUE)] <- "HURRICANE/TYPHOON"
event.dt2$event[grep("HEAT", event.dt2$EVTYPE, ignore.case=TRUE)] <- "HEAT"
event.dt2$event[grep("SNOW|ICE|BLIZZARD|WINTER STORM|FREEZING|WINTRY MIX|SLEET|HAIL", event.dt2$EVTYPE, ignore.case=TRUE)] <- "WINTER STORM"

#NB: I kept getting an invalid factor error for these
#event.dt2$event[grep("WILD FIRES", event.dt2$EVTYPE, ignore.case=TRUE)] <- "FIRES"
#event.tots$event[grep("STORM SURGE", event.tots$EVTYPE, ignore.case=TRUE)] <- "MARINE EVENTS"
#event.tots$event[grep("TROPICAL STORM", event.tots$EVTYPE, ignore.case=TRUE)] <- "MARINE EVENTS"

One can continue the process above to get any degree of granularity but we have captured over 95% of injuries and property damage in fewer than 20 event categories, as will be seen later.

# Quick check of years
pre1990 <- event.dt2[beg_yr < 1990 , list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg))]
post1990 <- event.dt2[beg_yr >= 1990, list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg))]

pre1990
##       V1   V2       V3 V4
## 1: 66829 4058 26478645  0
post1990
##       V1    V2        V3       V4
## 1: 73699 11087 400809335 49094473

Results

Summarizing the data at the outset saves processing time and reduces the application of “manual” logic. We are able to ignore anomolies in the data, such as invalid values for dollar mutlipliers because they represent such a small portion of the data. We are then able to spend more time examing the results of a gently cleaned file.

Collection of weather event data has changed over time. However we see no reason to exclude early years based on results so far. A longer analysis would include more discussion of the impact of reporting changes and a breakdown of events, perhaps by decade.

Harmful Events

Summarizing and plotting by injuries and fatalities by year shows that we have measurable data in all years but also shows the impact of data collection changes in 1993.

library(ggplot2)

year.tots <- event.dt2[,list(injury=sum(injury),fatality=sum(fatality),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(beg_yr)]
head(year.tots)
##    beg_yr injury fatality  prop_dmg crop_dmg
## 1:   1950    659       70  34481.65        0
## 2:   1951    524       34  65505.99        0
## 3:   1952   1915      230  94102.24        0
## 4:   1953   5131      519 596104.70        0
## 5:   1954    715       36  85805.32        0
## 6:   1955    926      129  82660.63        0
tail(year.tots)
##    beg_yr injury fatality  prop_dmg crop_dmg
## 1:   2006   3368      599 121937434  3534239
## 2:   2007   2191      421   5788934  1691152
## 3:   2008   2703      488  15568383  2209793
## 4:   2009   1354      333   5227204   522220
## 5:   2010   1855      425   9246488  1785286
## 6:   2011   7792     1002  20888982   666742
scalex <- scale_x_discrete(breaks=c(1950,1960,1970,1980,1990,2000,2010))
scaley <- scale_y_continuous(breaks=c(0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000),limits=c(0,10000))
gp1 <- ggplot(year.tots, aes(as.factor(beg_yr), injury))+scalex+scaley
gp1 + geom_jitter()+geom_point(aes(as.factor(beg_yr),fatality),position="jitter",colour="red")+labs(title="NOAA Storm Data: Injuries and Fatalities", y="Count", x="Begin Year of Event")
## Warning: Removed 1 rows containing missing values (geom_point).

We build a table of events with at least 1% of the data for injuries and damages. The table itself is in the Costly Events section but the same information gives us the most harmful events as indicated in the Synopsis: tornados outstrip all others for both injuries and fatalities. Floods cause injuries and fatalities and heat is associatied with nearly 21% of fatalities.

DT <- event.dt2[,list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg)),by=event]
setorder(DT,-V1)
DT[,inj := (V1/140528)*100]
DT[,prop := (V3/427287980)*100]
DT[,death := (V2/15145)*100]
DT[,crop := (V4/49094473)*100]

DT[,list(sum(V1),sum(V2),sum(V3),sum(V4),sum(inj),sum(prop),sum(death),sum(crop))]

Costly Events

# What events have 1% of the data?

DT[inj > 1 | death > 1 |prop > 1 | crop > 1]

DT2 <- DT[inj > 1 | death > 1 | prop > 1 | crop > 1]
setorder(DT2,-V3)

grand <- DT2[,list(Event="TOTAL",Property=sum(V3),Property.percent=sum(prop),Crops=sum(V4),Crops.percent=sum(crop),Injuries=sum(V1),Injury.percent=sum(inj),Fatalities=sum(V2),Fatality.percent=sum(death))]

setnames(DT2,c("event","V1","V2","V3","V4","inj","prop","death","crop"),c("Event","Injuries","Fatalities","Property","Crops","Injury.percent","Property.percent","Fatality.percent","Crops.percent"))

setcolorder(DT2,c("Event","Property", "Property.percent","Crops", "Crops.percent", "Injuries", "Injury.percent", "Fatalities", "Fatality.percent"))

DT3 <- rbind(DT2,grand)
mytable <- xtable(DT3, caption='Table of Events with at least 1% of Harms or Costs. Dollars in Thousands.')
print(mytable,type="html")
Table of Events with at least 1% of Harms or Costs. Dollars in Thousands.
Event Property Property.percent Crops Crops.percent Injuries Injury.percent Fatalities Fatality.percent
1 FLOOD 167576344.68 39.22 12383577.20 25.22 8681.00 6.18 1553.00 10.25
2 HURRICANE/TYPHOON 85336430.01 19.97 5506127.80 11.22 1333.00 0.95 135.00 0.89
3 TORNADO 56981907.93 13.34 415121.36 0.85 91407.00 65.05 5636.00 37.21
4 STORM SURGE 43323536.00 10.14 5.00 0.00 38.00 0.03 13.00 0.09
5 WINTER STORM 30035979.27 7.03 8420537.25 17.15 7060.00 5.02 642.00 4.24
6 WIND 15928566.17 3.73 1964217.05 4.00 11344.00 8.07 1413.00 9.33
7 TROPICAL STORM 7703890.55 1.80 678346.00 1.38 340.00 0.24 58.00 0.38
8 WILDFIRE 4765114.00 1.12 295472.80 0.60 911.00 0.65 75.00 0.50
9 STORM SURGE/TIDE 4641188.00 1.09 850.00 0.00 5.00 0.00 11.00 0.07
10 DROUGHT 1046106.00 0.24 13972566.00 28.46 4.00 0.00 0.00 0.00
11 LIGHTNING 928825.98 0.22 12092.09 0.02 5230.00 3.72 816.00 5.39
12 HEAVY RAIN 694248.09 0.16 733399.80 1.49 251.00 0.18 98.00 0.65
13 EXTREME COLD 67737.40 0.02 1292973.00 2.63 231.00 0.16 160.00 1.06
14 HEAT 20325.75 0.00 904469.28 1.84 9224.00 6.56 3138.00 20.72
15 FROST/FREEZE 9480.00 0.00 1094086.00 2.23 0.00 0.00 0.00 0.00
16 AVALANCHE 3721.80 0.00 0.00 0.00 170.00 0.12 224.00 1.48
17 RIP CURRENTS 162.00 0.00 0.00 0.00 297.00 0.21 204.00 1.35
18 RIP CURRENT 1.00 0.00 0.00 0.00 232.00 0.17 368.00 2.43
19 TOTAL 419063564.63 98.08 47673840.63 97.11 136758.00 97.32 14544.00 96.03

Floods, hurricanes and typhoons account for 60% of property damages in the data set and over 36% of crop damages as measured in dollars. As with injuries, tornados are associated with high losses as well. Drought is very notable in our table accounts for almost 29% of crop damages. (Furthermore we captured this without undue effort on cleaning event categories).

Conclusions

Wind and water events have historically been and continue to be costly to human populations in terms of injuries, lives and the dollar cost of property and crops. Seasons and weather are not preventable; however understanding the patterns of costly and harmful events over time and geography can help us create strategies to mitigiate these costs. For example, not building homes in flood plains and “torndado alleys” or increasing the height of storm walls or distance from shoreline.

Notes

Useful extensions of the analysis would be: look for more current events. Examine dollar trends over years. Look at significant events (e.g. named hurricanes). Examine events by areas–we expect tornado damage in the midwest and fire damage in the west, for example, with marine events on the coasts. Add population data for per capita calculations, and so on.

  1. Storm Data Documentation

End of Document.