Statistical analysis of a comprehensive weather database, compiled by US authority NOAA over 60 years, reveals that the number of fatalities and economic damage caused by extreme weather events are increasing. The database actually consists of two methodologically distinct subsets: those data collected from 1950s to 1993, and those data from 1993 to 2011. The pre-1993 was classified into three distinct extreme-weather events only ( “tornado” “thunderstorm wind” “hail”), with tornado causing the greatest fatalities and largest damage. In contrast the after-1993 data (and including 1993) have a fine granularity of 985 distinct event types, of which both wind-related events, and extreme heat/drought realted event types equally contribute to economic damage and number of fatalities.
In any case, whereas the order of magnitude of number of fatalities remains relatively constant, the amount of damage done by extreme weather events has risen dramatically (6-fold) over the two periods, since 1997 exceeding 5 billion dollars each year, and 10 billion dollars 3 times in the past 20 years.
The weather database was a copy acquired from the web.
Source URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Date accessed: July 23rd 2014
Description: U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, containing 1.23 million records. This database tracks characteristics of major storms and weather events in the United States from the 1950s to 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
There is also very specific documentation of the database available. These links describe how some of the variables are constructed/defined.
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
All calcuations were performed with R version 3.1.0 (2014-04-10).
For details about the software used, see bottom of report.
Load necessary libraries needed by the code below
library(plyr) # mutate function
library(XML) # parse HTML helper table
library(lubridate) #parse character sings to datetype
##
## Attaching package: 'lubridate'
##
## The following object is masked from 'package:plyr':
##
## here
library(ggplot2) #plots
library(Hmisc) #contents() function, a better summary() function. can also use this to assign labels to data frame columns
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
##
## The following objects are masked from 'package:plyr':
##
## is.discrete, summarize
##
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
Set working directory. Data files and pdf files containing documentations are stored in a subdirectory ‘projectdata’ inside this directory.
setwd("/home/knut/Documents/coursera/datascience/5-reprod-research/RepData_PeerAssessment2")
Uncompress data with bunzip2, a common tool provided by most Unix operating systems. Normally, bzip2 will not overwrite existing output files.
cmd <- 'bunzip2("projectdata/repdata_data_StormData.csv.bz2")'
system(cmd)
Try to determine how many rows there are in the data file.
filename <- "projectdata/repdata_data_StormData.csv"
nrow <- gsub('\\s\\S+$', "", system(paste("wc -l", filename), intern = TRUE), perl=TRUE)
The csv-file contains 1232705 newlines. This is just an estimation for the rowcount, because the file might contain “multiline strings”.
Set all relevant environment variables to english, supporting modern (multibyte) character sets to make sure to read in the data correctly.
Sys.setlocale('LC_ALL', 'en_US.UTF-8')
## [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C"
Inspect a subset of the data file.
initial <- read.csv(filename, na.strings=c("","<NA>"), quote="",header=TRUE, nrows=187000, stringsAsFactors=FALSE)
classes <- sapply(initial, class)
classes
## X.STATE__. X.BGN_DATE. X.BGN_TIME. X.TIME_ZONE. X.COUNTY.
## "numeric" "character" "character" "character" "numeric"
## X.COUNTYNAME. X.STATE. X.EVTYPE. X.BGN_RANGE. X.BGN_AZI.
## "character" "character" "character" "numeric" "logical"
## X.BGN_LOCATI. X.END_DATE. X.END_TIME. X.COUNTY_END. X.COUNTYENDN.
## "logical" "logical" "logical" "numeric" "logical"
## X.END_RANGE. X.END_AZI. X.END_LOCATI. X.LENGTH. X.WIDTH.
## "numeric" "logical" "logical" "numeric" "numeric"
## X.F. X.MAG. X.FATALITIES. X.INJURIES. X.PROPDMG.
## "character" "numeric" "numeric" "numeric" "numeric"
## X.PROPDMGEXP. X.CROPDMG. X.CROPDMGEXP. X.WFO. X.STATEOFFIC.
## "character" "numeric" "logical" "character" "logical"
## X.ZONENAMES. X.LATITUDE. X.LONGITUDE. X.LATITUDE_E. X.LONGITUDE_.
## "logical" "numeric" "numeric" "numeric" "numeric"
## X.REMARKS. X.REFNUM.
## "logical" "numeric"
The database contains 37 columns. Not all of these are needed, though.
Read in data file, ignoring columns with “NA” values only. I found this by trial and error. Turn uppercase column names to lowercase column names
important_columns = c("NULL", "character", rep("NULL", 4), "character", "character", rep("NULL", 14), "numeric", "numeric", "numeric", "character", "numeric", "character", rep("NULL", 3), "numeric", "numeric", rep("NULL", 3), "numeric")
storms <- read.csv(filename,colClasses=important_columns)
names(storms) <- tolower(names(storms))
contents(storms)
##
## Data frame:storms 902297 observations and 12 variables Maximum # NAs:47
##
## Storage NAs
## bgn_date character 0
## state character 0
## evtype character 0
## fatalities double 0
## injuries double 0
## propdmg double 0
## propdmgexp character 0
## cropdmg double 0
## cropdmgexp character 0
## latitude double 47
## longitude double 0
## refnum double 0
Number of NULL values is 0 for 11 out of 12 columns, except for the nonessential “latitude” column.
Fix NAPA valley flood 115 Billion error/outlier (B miscoded as M) Source: http://pubs.usgs.gov/of/2006/1182/pdf/ofr2006-1182.pdf
The abstract claims total damage of $300M. Indeed, it looks like the “B” should have been an “M”.
storms$propdmgexp[storms$refnum == 605943] <- "M"
Converting the “date” column to Date datatype, ignoring time zones and time of day
storms$datetime <- parse_date_time(gsub('\\s\\S+$', "", storms$bgn_date, perl=TRUE), "%m/%d/%Y", truncated=3)
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
This can be answered by determining the number of fatalities by year
with(storms, qplot(year(datetime), fatalities, data=storms, geom="bar", xlab="year", stat="identity"))
The plot indicates that there is a notable increase of fatalaties after 1993. This could be caused by more severe weather, or it could representa a systematic issue. The event classification scheme
I’ll introduce an artificial grouping variable to show a quantitative view on the data:
storms$period <- ifelse(year(storms$datetime) >= 1993, "postincl1993", "pre1993")
xtabs(fatalities ~period, data=storms)
## period
## postincl1993 pre1993
## 10865 4280
This shows that the number of fatalities is being classified ina much more fine-grained manner since 1993:
xtabs(fatalities ~evtype + period, data=storms[storms$fatalities > 3,])
## period
## evtype postincl1993 pre1993
## AVALANCHE 10 0
## BLIZZARD 25 0
## COLD AND SNOW 14 0
## COLD/WIND CHILL 5 0
## DUST STORM 10 0
## EXCESSIVE HEAT 1287 0
## EXTREME COLD 29 0
## EXTREME COLD/WIND CHILL 34 0
## EXTREME HEAT 88 0
## FLASH FLOOD 220 0
## FLASH FLOOD/FLOOD 6 0
## FLASH FLOODING 4 0
## FLOOD 97 0
## FLOOD/FLASH FLOOD 5 0
## FOG 16 0
## GLAZE 5 0
## HAIL 0 4
## HEAT 735 0
## HEAT WAVE 150 0
## HEAT WAVE DROUGHT 4 0
## HEAT WAVES 5 0
## HEAVY RAIN 23 0
## HEAVY SNOW 15 0
## HEAVY SURF/HIGH SURF 5 0
## HIGH SURF 22 0
## HIGH WIND 45 0
## HIGH WIND/SEAS 4 0
## HURRICANE 36 0
## HURRICANE/TYPHOON 46 0
## ICE STORM 20 0
## LANDSLIDE 28 0
## LIGHTNING 9 0
## MARINE MISHAP 6 0
## MARINE TSTM WIND 5 0
## Mudslide 4 0
## RECORD/EXCESSIVE HEAT 17 0
## RIP CURRENT 6 0
## RIP CURRENTS/HEAVY SURF 4 0
## ROUGH SEAS 4 0
## STORM SURGE 10 0
## STORM SURGE/TIDE 11 0
## STRONG WIND 5 0
## THUNDERSTORM WIND 7 0
## THUNDERSTORM WINDS 8 0
## TORNADO 1037 2800
## TORNADOES, TSTM WIND, HAIL 25 0
## TROPICAL STORM 30 0
## TROPICAL STORM GORDON 8 0
## TSTM WIND 5 34
## TSUNAMI 32 0
## UNSEASONABLY WARM 10 0
## UNSEASONABLY WARM AND DRY 29 0
## WILDFIRE 38 0
## WILD/FOREST FIRE 4 0
## WINTER STORM 35 0
## WINTER STORMS 10 0
## WINTER WEATHER/MIX 11 0
Determine total amount of economic damage by year In the database, the damage-amount is encoded as “K” for thousands of dollars, “M” as millions of dollars.
storms$damageprop <- ifelse(storms$propdmgexp == "M", storms$propdmg *1000000, ifelse(storms$propdmgexp == "K",storms$propdmg *1000, 0))
storms$damagecrop <- ifelse(storms$cropdmgexp == "M", storms$cropdmg *1000000, ifelse(storms$cropdmgexp == "K",storms$cropdmg *1000, 0))
storms$damage <-storms$damagecrop + storms$damageprop
Plotting this yields this plot:
with(storms, qplot(year(datetime), storms$damageprop+ storms$damagecrop, data=storms, geom="bar", xlab="year", stat="identity",ylab="Damage to crops and property, in US Dollars "))
This table lists the event types, and the damage done for each event type, for each period (post-1993 and pre-1993), in US Dollars (column “Freq”):
storms$damage <- storms$damagecrop + storms$damageprop
damage <- as.data.frame(xtabs(storms$damage ~ evtype + period, data=storms))
damage[damage$Freq > 1000000000,]
## evtype period Freq
## 88 DROUGHT postincl1993 1.352e+10
## 133 EXTREME COLD postincl1993 1.361e+09
## 147 FLASH FLOOD postincl1993 1.656e+10
## 164 FLOOD postincl1993 2.793e+10
## 206 FROST/FREEZE postincl1993 1.104e+09
## 238 HAIL postincl1993 1.695e+10
## 284 HEAVY RAIN postincl1993 1.428e+09
## 304 HEAVY SNOW postincl1993 1.067e+09
## 354 HIGH WIND postincl1993 4.609e+09
## 397 HURRICANE postincl1993 8.910e+09
## 406 HURRICANE/TYPHOON postincl1993 4.904e+09
## 424 ICE STORM postincl1993 3.967e+09
## 759 THUNDERSTORM WIND postincl1993 3.898e+09
## 783 THUNDERSTORM WINDS postincl1993 1.924e+09
## 830 TORNADO postincl1993 2.144e+10
## 844 TROPICAL STORM postincl1993 3.232e+09
## 854 TSTM WIND postincl1993 5.039e+09
## 953 WILDFIRE postincl1993 4.021e+09
## 956 WILD/FOREST FIRE postincl1993 1.609e+09
## 972 WINTER STORM postincl1993 1.715e+09
## 1815 TORNADO pre1993 3.060e+10
In the 20-year period from 1993 to 2011, each of these event types contribute as much as the tornado-related damages that occured during the 40 year period of 1953 to 1993.
This document was prepared for the Coursera.com class Reproducible Research:
Peer Assessment 2
The instructions for this assignment can be found on this page: https://github.com/rdpeng/RepData_PeerAssessment1.
Please read this first for context and details.
Knut Behrends
knb@gfz-potsdam.de
July 2014
sessionInfo()
## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] splines grid stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] Hmisc_3.14-4 Formula_1.1-1 survival_2.37-7 lattice_0.20-29
## [5] ggplot2_1.0.0 lubridate_1.3.3 XML_3.98-1 plyr_1.8.1
## [9] rj_2.0.2-1 knitr_1.6 colorout_1.0-3
##
## loaded via a namespace (and not attached):
## [1] cluster_1.15.2 colorspace_1.2-4 dichromat_2.0-0
## [4] digest_0.6.4 evaluate_0.5.3 formatR_0.10
## [7] gtable_0.1.2 htmltools_0.2.4 labeling_0.2
## [10] latticeExtra_0.6-26 MASS_7.3-31 memoise_0.2.1
## [13] munsell_0.4.2 proto_0.3-10 RColorBrewer_1.0-5
## [16] Rcpp_0.11.1 reshape2_1.2.2 rmarkdown_0.2.49
## [19] scales_0.2.3 stringr_0.6.2 tools_3.1.0