This work is for the Coursera class Reproducible Research, Peer Assessment 2
Across the United States the most harmful events in terms of injuries and fatalaties are tornados (65% of injuries in the sample), floods, and heat. Windstorms and ligtning also account for a measurable number of injuries. See the results section for details.
Across the United States the most harmful events in terms of property and crop damage are flood (39% of property damage in the sample), marine events like hurricanes, typhoons and tropical storms. Tornados and other wind storms also cause significant damage. See the results section for details.
Loading and preprocessing the data: 1. Download the assignment dataset as directed: Storm Data NB: My Win7 and R configuration does not allow R to download https directly. 2. We show syntax for using read.csv on the bz2 file directly. However, for efficiency we extract the file with 7-zip in the data folder and use fread from the data.table package. (fread was not working on the bzip2 file due to an EOF character causing an error).
After examining the basic properties of the file we can move on to cleaning or modifying the dataset, if necessary. We assume data has been downloaded to a data folder. A more complete solution would use R to check for the folder and data and if not found create the folder and attempt to download the data.
# Data are downloaded to the data directory from the URL
# https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
# check and set working directory and load packages we want
library(data.table)
library(xtable)
#getwd()
setwd("C:/StatWare/Rprog/RepData/project2")
#fileloc <- "C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv.bz2"
fileloc <- "C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv"
An aside, we show processing time for read.csv vs. fread below to indicate why we wanted to use fread.
system.time(data.table(read.csv(fileloc,nrows=100000)))
## user system elapsed
## 2.45 0.02 2.47
system.time(fread(fileloc,nrows=100000))
## user system elapsed
## 1.11 0.06 1.17
# Read and check the data - fread works on a subset but not the whole file
#storm.dt1 <- fread("C:/StatWare/Rprog/RepData/project2/data/repdata_data_StormData.csv.bz2",nrows=20)
storm.dt1 <- as.data.table(read.csv(fileloc))
Do some formatting on the begin date and create a begin year field. Then report the structure of the data.table.
# Make friendly dates and report the structure of our table
storm.dt1[,beg_dt := as.Date(strptime(storm.dt1$BGN_DATE, "%m/%d/%Y %H:%M:%S"))]
storm.dt1[,beg_yr := year(storm.dt1$beg_dt)]
str(storm.dt1)
## Classes 'data.table' and 'data.frame': 902297 obs. of 39 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
## $ BGN_TIME : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
## $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : Factor w/ 35 levels ""," N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_DATE : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_TIME : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ WFO : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ZONENAMES : Factor w/ 25112 levels ""," "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : Factor w/ 436774 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
## $ beg_dt : Date, format: "1950-04-18" "1950-04-18" ...
## $ beg_yr : int 1950 1950 1951 1951 1951 1951 1951 1952 1952 1952 ...
## - attr(*, ".internal.selfref")=<externalptr>
setkey(storm.dt1,beg_yr,EVTYPE)
Make a table of injuries and fatalities for addressing question 1. Include property and crop damage for addressing question 2 and create sensible dollar columns. Note that we are applying the filters relevant to our analysis before exploring column values (e.g. event types) in order to avoid spending time on data we are not going to use. Based on looking at the columns of the data itself and the documentationa we deem INJURIES and FATALITIES to be the relevant measures for addressing question 1 and PROPDMG and CROPDMG to be relevant for question 2. See Notes for more information.
event.dt1 <- storm.dt1[FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0 ][, prop_dmg := if (PROPDMGEXP=="M") {PROPDMG*1000} else {PROPDMG}, by=REFNUM][, prop_dmg := if (PROPDMGEXP=="B") {PROPDMG*1000*1000},by=REFNUM][, crop_dmg := if (CROPDMGEXP=="M") {CROPDMG*1000} else {CROPDMG}, by=REFNUM][, crop_dmg := if (CROPDMGEXP=="B") {CROPDMG*1000*1000}, by=REFNUM]
# Print the sums - look for weird values
props <- event.dt1[,list(sum(PROPDMG),sum(prop_dmg)),by=PROPDMGEXP]
head(props,n=10)
## PROPDMGEXP V1 V2
## 1: K 10735292.10 10735292.10
## 2: M 140694.45 140694450.00
## 3: 527.41 527.41
## 4: B 275.85 275850000.00
## 5: 0 7108.30 7108.30
## 6: 4 14.50 14.50
## 7: + 117.00 117.00
## 8: 5 210.50 210.50
## 9: h 2.00 2.00
## 10: m 38.90 38.90
The dollar multipliers used above were found in the documentation already referenced. We ignore other values because the magnitudes of associated harms and costs are small.
We create a summary table that we would use in a larger report followed by a summary table of events for this report.
# And summarize by year, state and event type
event.dt2 <- event.dt1[,list(injury=sum(INJURIES),fatality=sum(FATALITIES),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(beg_yr, STATE, EVTYPE)]
# And by events only
event.tots <- event.dt2[,list(injury=sum(injury),fatality=sum(fatality),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(EVTYPE)]
setorder(event.tots,-injury,-prop_dmg)
event.tots
## EVTYPE injury fatality prop_dmg crop_dmg
## 1: TORNADO 91346 5633 56925970.7 415113.11
## 2: TSTM WIND 6957 504 4484983.4 554007.35
## 3: FLOOD 6789 470 144657716.8 5661968.45
## 4: EXCESSIVE HEAT 6525 1903 7753.7 492402.00
## 5: LIGHTNING 5230 816 928826.0 12092.09
## ---
## 484: HYPERTHERMIA/EXPOSURE 0 1 0.0 0.00
## 485: UNSEASONAL RAIN 0 0 0.0 10000.00
## 486: THUNDERSTORM WIND (G40) 0 1 0.0 0.00
## 487: DROWNING 0 1 0.0 0.00
## 488: ICE ON ROAD 0 1 0.0 0.00
# Need to map to a new event field in order to get more roll-up in the data. We look at the totals in
# order to determine the level of effort.
sum(event.tots$injury)
## [1] 140528
sum(event.tots$fatality)
## [1] 15145
sum(event.tots$prop_dmg)
## [1] 427287980
sum(event.tots$crop_dmg)
## [1] 49094473
From a visual inspection we can quickly see logical groupings of the event types. We start with events that are more likely to have injuries and property damage in order to capture as much relevant information as possible with a minimum of processing.
# Example of thought process - look at Wind events, then we looked at tornado, flood, etc.
wind <- grep("WIND", event.tots$EVTYPE,ignore.case = TRUE)
wind.events <- event.tots[event.tots$EVTYPE %in% as.character(event.tots$EVTYPE[wind])]
wind.events$event <- "WIND"
head(wind.events)
## EVTYPE injury fatality prop_dmg crop_dmg event
## 1: TSTM WIND 6957 504 4484983.4 554007.3 WIND
## 2: THUNDERSTORM WIND 1488 133 3483265.1 414843.0 WIND
## 3: HIGH WIND 1137 248 5270081.3 638571.3 WIND
## 4: THUNDERSTORM WINDS 908 64 1739628.6 190742.7 WIND
## 5: HIGH WINDS 302 35 608421.7 40720.6 WIND
## 6: STRONG WIND 280 103 175241.5 64953.5 WIND
For the wind subset we can ask what portion of injuries and damages does the subset represent? We find 0.08182 percent of injuries and 0.0415333 percent of damages. In testing we continued as above to arrive at the solution below.
# Actual solution
event.dt2$event <- event.dt2$EVTYPE # ensures no NA events
event.dt2$event[grep("WIND", event.dt2$EVTYPE, ignore.case=TRUE)] <- "WIND"
event.dt2$event[grep("TORNADO", event.dt2$EVTYPE, ignore.case=TRUE)] <- "TORNADO"
event.dt2$event[grep("FLOOD|FLD", event.dt2$EVTYPE, ignore.case=TRUE)] <- "FLOOD"
event.dt2$event[grep("HURRICANE|TYPHOON", event.dt2$EVTYPE, ignore.case=TRUE)] <- "HURRICANE/TYPHOON"
event.dt2$event[grep("HEAT", event.dt2$EVTYPE, ignore.case=TRUE)] <- "HEAT"
event.dt2$event[grep("SNOW|ICE|BLIZZARD|WINTER STORM|FREEZING|WINTRY MIX|SLEET|HAIL", event.dt2$EVTYPE, ignore.case=TRUE)] <- "WINTER STORM"
#NB: I kept getting an invalid factor error for these
#event.dt2$event[grep("WILD FIRES", event.dt2$EVTYPE, ignore.case=TRUE)] <- "FIRES"
#event.tots$event[grep("STORM SURGE", event.tots$EVTYPE, ignore.case=TRUE)] <- "MARINE EVENTS"
#event.tots$event[grep("TROPICAL STORM", event.tots$EVTYPE, ignore.case=TRUE)] <- "MARINE EVENTS"
One can continue the process above to get any degree of granularity but we have captured over 95% of injuries and property damage in fewer than 20 event categories, as will be seen later.
# Quick check of years
pre1990 <- event.dt2[beg_yr < 1990 , list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg))]
post1990 <- event.dt2[beg_yr >= 1990, list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg))]
pre1990
## V1 V2 V3 V4
## 1: 66829 4058 26478645 0
post1990
## V1 V2 V3 V4
## 1: 73699 11087 400809335 49094473
Summarizing the data at the outset saves processing time and reduces the application of “manual” logic. We are able to ignore anomolies in the data, such as invalid values for dollar mutlipliers because they represent such a small portion of the data. We are then able to spend more time examing the results of a gently cleaned file.
Collection of weather event data has changed over time. However we see no reason to exclude early years based on results so far. A longer analysis would include more discussion of the impact of reporting changes and a breakdown of events, perhaps by decade.
Summarizing and plotting by injuries and fatalities by year shows that we have measurable data in all years but also shows the impact of data collection changes in 1993.
library(ggplot2)
year.tots <- event.dt2[,list(injury=sum(injury),fatality=sum(fatality),prop_dmg=sum(prop_dmg),crop_dmg=sum(crop_dmg)),by=.(beg_yr)]
head(year.tots)
## beg_yr injury fatality prop_dmg crop_dmg
## 1: 1950 659 70 34481.65 0
## 2: 1951 524 34 65505.99 0
## 3: 1952 1915 230 94102.24 0
## 4: 1953 5131 519 596104.70 0
## 5: 1954 715 36 85805.32 0
## 6: 1955 926 129 82660.63 0
tail(year.tots)
## beg_yr injury fatality prop_dmg crop_dmg
## 1: 2006 3368 599 121937434 3534239
## 2: 2007 2191 421 5788934 1691152
## 3: 2008 2703 488 15568383 2209793
## 4: 2009 1354 333 5227204 522220
## 5: 2010 1855 425 9246488 1785286
## 6: 2011 7792 1002 20888982 666742
scalex <- scale_x_discrete(breaks=c(1950,1960,1970,1980,1990,2000,2010))
scaley <- scale_y_continuous(breaks=c(0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000),limits=c(0,10000))
gp1 <- ggplot(year.tots, aes(as.factor(beg_yr), injury))+scalex+scaley
gp1 + geom_jitter()+geom_point(aes(as.factor(beg_yr),fatality),position="jitter",colour="red")+labs(title="NOAA Storm Data: Injuries and Fatalities", y="Count", x="Begin Year of Event")
## Warning: Removed 1 rows containing missing values (geom_point).
We build a table of events with at least 1% of the data for injuries and damages. The table itself is in the Costly Events section but the same information gives us the most harmful events as indicated in the Synopsis: tornados outstrip all others for both injuries and fatalities. Floods cause injuries and fatalities and heat is associatied with nearly 21% of fatalities.
DT <- event.dt2[,list(sum(injury),sum(fatality),sum(prop_dmg),sum(crop_dmg)),by=event]
setorder(DT,-V1)
DT[,inj := (V1/140528)*100]
DT[,prop := (V3/427287980)*100]
DT[,death := (V2/15145)*100]
DT[,crop := (V4/49094473)*100]
DT[,list(sum(V1),sum(V2),sum(V3),sum(V4),sum(inj),sum(prop),sum(death),sum(crop))]
# What events have 1% of the data?
DT[inj > 1 | death > 1 |prop > 1 | crop > 1]
DT2 <- DT[inj > 1 | death > 1 | prop > 1 | crop > 1]
setorder(DT2,-V3)
grand <- DT2[,list(Event="TOTAL",Property=sum(V3),Property.percent=sum(prop),Crops=sum(V4),Crops.percent=sum(crop),Injuries=sum(V1),Injury.percent=sum(inj),Fatalities=sum(V2),Fatality.percent=sum(death))]
setnames(DT2,c("event","V1","V2","V3","V4","inj","prop","death","crop"),c("Event","Injuries","Fatalities","Property","Crops","Injury.percent","Property.percent","Fatality.percent","Crops.percent"))
setcolorder(DT2,c("Event","Property", "Property.percent","Crops", "Crops.percent", "Injuries", "Injury.percent", "Fatalities", "Fatality.percent"))
DT3 <- rbind(DT2,grand)
mytable <- xtable(DT3, caption='Table of Events with at least 1% of Harms or Costs. Dollars in Thousands.')
print(mytable,type="html")
| Event | Property | Property.percent | Crops | Crops.percent | Injuries | Injury.percent | Fatalities | Fatality.percent | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | FLOOD | 167576344.68 | 39.22 | 12383577.20 | 25.22 | 8681.00 | 6.18 | 1553.00 | 10.25 |
| 2 | HURRICANE/TYPHOON | 85336430.01 | 19.97 | 5506127.80 | 11.22 | 1333.00 | 0.95 | 135.00 | 0.89 |
| 3 | TORNADO | 56981907.93 | 13.34 | 415121.36 | 0.85 | 91407.00 | 65.05 | 5636.00 | 37.21 |
| 4 | STORM SURGE | 43323536.00 | 10.14 | 5.00 | 0.00 | 38.00 | 0.03 | 13.00 | 0.09 |
| 5 | WINTER STORM | 30035979.27 | 7.03 | 8420537.25 | 17.15 | 7060.00 | 5.02 | 642.00 | 4.24 |
| 6 | WIND | 15928566.17 | 3.73 | 1964217.05 | 4.00 | 11344.00 | 8.07 | 1413.00 | 9.33 |
| 7 | TROPICAL STORM | 7703890.55 | 1.80 | 678346.00 | 1.38 | 340.00 | 0.24 | 58.00 | 0.38 |
| 8 | WILDFIRE | 4765114.00 | 1.12 | 295472.80 | 0.60 | 911.00 | 0.65 | 75.00 | 0.50 |
| 9 | STORM SURGE/TIDE | 4641188.00 | 1.09 | 850.00 | 0.00 | 5.00 | 0.00 | 11.00 | 0.07 |
| 10 | DROUGHT | 1046106.00 | 0.24 | 13972566.00 | 28.46 | 4.00 | 0.00 | 0.00 | 0.00 |
| 11 | LIGHTNING | 928825.98 | 0.22 | 12092.09 | 0.02 | 5230.00 | 3.72 | 816.00 | 5.39 |
| 12 | HEAVY RAIN | 694248.09 | 0.16 | 733399.80 | 1.49 | 251.00 | 0.18 | 98.00 | 0.65 |
| 13 | EXTREME COLD | 67737.40 | 0.02 | 1292973.00 | 2.63 | 231.00 | 0.16 | 160.00 | 1.06 |
| 14 | HEAT | 20325.75 | 0.00 | 904469.28 | 1.84 | 9224.00 | 6.56 | 3138.00 | 20.72 |
| 15 | FROST/FREEZE | 9480.00 | 0.00 | 1094086.00 | 2.23 | 0.00 | 0.00 | 0.00 | 0.00 |
| 16 | AVALANCHE | 3721.80 | 0.00 | 0.00 | 0.00 | 170.00 | 0.12 | 224.00 | 1.48 |
| 17 | RIP CURRENTS | 162.00 | 0.00 | 0.00 | 0.00 | 297.00 | 0.21 | 204.00 | 1.35 |
| 18 | RIP CURRENT | 1.00 | 0.00 | 0.00 | 0.00 | 232.00 | 0.17 | 368.00 | 2.43 |
| 19 | TOTAL | 419063564.63 | 98.08 | 47673840.63 | 97.11 | 136758.00 | 97.32 | 14544.00 | 96.03 |
Floods, hurricanes and typhoons account for 60% of property damages in the data set and over 36% of crop damages as measured in dollars. As with injuries, tornados are associated with high losses as well. Drought is very notable in our table accounts for almost 29% of crop damages. (Furthermore we captured this without undue effort on cleaning event categories).
Wind and water events have historically been and continue to be costly to human populations in terms of injuries, lives and the dollar cost of property and crops. Seasons and weather are not preventable; however understanding the patterns of costly and harmful events over time and geography can help us create strategies to mitigiate these costs. For example, not building homes in flood plains and “torndado alleys” or increasing the height of storm walls or distance from shoreline.
Useful extensions of the analysis would be: look for more current events. Examine dollar trends over years. Look at significant events (e.g. named hurricanes). Examine events by areas–we expect tornado damage in the midwest and fire damage in the west, for example, with marine events on the coasts. Add population data for per capita calculations, and so on.
End of Document.