Dangers of weather events in Unites States

Synopsis

U.S.A goverment have published data about weather events. This data set (called “Storm Data”) covers weather events observations from year 1950 to 2011. It also have information about casulties and material damages. Following document is my simple approach to determine most dangerous type of weather event and also most expensive (in terms of meterial damage) type. It is organised as explorary data analysis, no model will be built, there will be simple remarks and comments for results, and everything should be fully reproducible. Data was downloaded at 21 July 2014, and md5sum for data file is df4aa61fff89427db6b7f7b1113b5553.

Data Processing

Obtaining data

Data was opbtained form CloudFront data storage. URLs are included in following R code. Data and documentation files are downloaded only if main data file (StormData.csv.bz2) is missing.

if (!file.exists("StormData.csv.bz2")) {
  download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
                destfile = "StormData.csv.bz2")
  download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf", 
                destfile = "StormDataDoc.pdf")
  download.file(url = "http://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf", 
                destfile = "StormDataFaq.pdf")
}

Reading into data frame

File with data is just CSV file compressed with bz2 method. We can load it into data frame directly. Lets also observe structure of loaded data.

stormdata <- read.csv("StormData.csv.bz2")
str(stormdata)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4371 10087 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMALLY DRY",..: 826 826 826 826 826 826 826 826 826 826 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","5NM E OF FAIRPORT MI TO ROCK I - 5NM E OF FAIRPORT MI TO ROCK I",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "",".","\t","\t\t",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Some variables are not assigned properly, lets fitx that.

stormdata$BGN_DATE <- as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y")
stormdata$END_DATE <- as.Date(stormdata$END_DATE, format = "%m/%d/%Y")

Names in data frame are written using big leters, lets chante it to lowercase.

names(stormdata) <- tolower(names(stormdata))
stormdata$evtype <- tolower(as.character(stormdata$evtype))

Data related to fatalities or injuries:

Data contains fatalities or injuries:

fi.stormdata <- stormdata[stormdata$injuries != 0 | stormdata$fatalities != 0, ]

Number of injuries and fatalities:

injuries <- sum(fi.stormdata$injuries)
fatalities <- sum(fi.stormdata$fatalities)

So, based on this data, we have 1.4053 × 10⁵ injuries and 1.5145 × 10⁴ fatalities. Lets order dataframe by number of fatalities and injuries.

fi.stormdata <- fi.stormdata[with(fi.stormdata, order(-fatalities, -injuries)), ]
head(fi.stormdata$fatalities)

## [1] 583 158 116 114  99  90

There are many different *evtype*s. Lets take 16 (abitrary number) types which are causing most of fatalities.

killer.evtypes <- NULL
for (event in 1:nrow(fi.stormdata)) {
  if (fi.stormdata$evtype[event] %in% killer.evtypes) {NULL}
  else {
    killer.evtypes <- c(killer.evtypes, fi.stormdata$evtype[event])
    }
  if (length(killer.evtypes) == 16) {break}
  }
killer.evtypes

##  [1] "heat"                       "tornado"                   
##  [3] "excessive heat"             "extreme heat"              
##  [5] "heat wave"                  "tsunami"                   
##  [7] "unseasonably warm and dry"  "tornadoes, tstm wind, hail"
##  [9] "tropical storm"             "flash flood"               
## [11] "heavy rain"                 "record/excessive heat"     
## [13] "hurricane/typhoon"          "flood"                     
## [15] "wildfire"                   "landslide"

Some evtype occurs only once so maybe it is good idea to filter data with them.

single.evtype <- names(table(fi.stormdata$evtype)[table(fi.stormdata$evtype) == 1])
head(single.evtype)

## [1] "avalance"                 "black ice"               
## [3] "brush fire"               "coastalstorm"            
## [5] "coastal flooding/erosion" "cold/winds"

Lets check if type of event that occured only once is not in type of most lethal type of events.

intersect(single.evtype, killer.evtypes)

## [1] "record/excessive heat"      "tornadoes, tstm wind, hail"
## [3] "unseasonably warm and dry"

So there are some types of events that are unique but also important in terms of number of casulties. So filtering is not good option.

Ltes chceck how much of total number of fatalities and injuries is covered by number of most lethal type of events (calculated above as killer.evtypes):

injuries.top <- sum(fi.stormdata$injuries[fi.stormdata$evtype %in% killer.evtypes])
fatalities.top <- sum(fi.stormdata$fatalities[fi.stormdata$evtype %in% killer.evtypes])

We have 1.4053 × 10⁵ injuries and 1.5145 × 10⁴ fatalities in whole data set, and 1.1203 × 10⁵ injuries and 1.0626 × 10⁴ fatalities in filtered dataset which is covering 79.7201 percent of injuries and 70.1618 percent of fatalites.

top.fi.stormdata <- fi.stormdata[fi.stormdata$evtype %in% killer.evtypes, ]
total.rows <- nrow(stormdata)
top.fi.rows <- nrow(top.fi.stormdata)

By doing thise type of filtering our working data set (related to fatalities and injuries) has 1.176 percent of original size.

Data related to property damages or crop damages

Property and crop damage is located in two types of columns. One column is having numerical vale (e.g propdmg) and second related column suppose to have order of magnitude (e.g propdmgexp) represented by letter from following list: h - hundred, k - tousand, m - milion, b - bilion, and “” (empty) for plain value. Since there are some different symbols in this type of columns lets examine effect of filtering.

valid.exp <- c("","B","b","K","k","M","m","h","H")
da.stormdata <- stormdata[stormdata$propdmgexp %in% valid.exp, ]
da.stormdata <- da.stormdata[da.stormdata$cropdmgexp %in% valid.exp, ]
da.stormdata$propdmgexp <- factor(da.stormdata$propdmgexp)
da.stormdata$cropdmgexp <- factor(da.stormdata$cropdmgexp)

Lets filter out data which contains information about 0 damages.

da.stormdata <- da.stormdata[da.stormdata$propdmg != 0 | da.stormdata$cropdmg != 0, ]

Since value and factor are in two different columns, lets create additional columns with proper values

converter <- function(letter) {
  if (letter == "h" || letter == "H" ) {
    multiplier <- 100
  } else if (letter == "M" || letter == "m") {
    multiplier <- 1000000
  } else if (letter == "K" || letter == "k") {
    multiplier <- 1000
  } else if (letter == "B" || letter == "b") {
    multiplier <- 1000000000
  } else {
    multiplier <- 1
  }
  multiplier
}

da.stormdata$propdmgtot <- sapply(da.stormdata$propdmgexp, converter)
da.stormdata$propdmgtot <- da.stormdata$propdmgtot * da.stormdata$propdmg

da.stormdata$cropdmgtot <- sapply(da.stormdata$cropdmgexp, converter)
da.stormdata$cropdmgtot <- da.stormdata$cropdmgtot * da.stormdata$cropdmg

I will assume the same methodology of analysis as in case with fatalities and injuries. I think that cropdmg is more important that propdmg. Lets calculate total costs of damages and sort dataframe descending by cropdmg and propdmg.

costproperties <- sum(da.stormdata$propdmgtot)
costcrops <- sum(da.stormdata$cropdmgtot)
da.stormdata <- da.stormdata[with(da.stormdata, order(-cropdmgtot, -propdmgtot)), ]
head(da.stormdata$propdmgtot)

## [1] 5.00e+09 5.00e+05 5.88e+09 0.00e+00 0.00e+00 0.00e+00

There are many different *evtype*s. Lets take 16 (abitrary number) types which are causing most of costs of crops damages.

crops.evtypes <- NULL
for (event in 1:nrow(da.stormdata)) {
  if (da.stormdata$evtype[event] %in% crops.evtypes) {NULL}
  else {
    crops.evtypes <- c(crops.evtypes, da.stormdata$evtype[event])
    }
  if (length(crops.evtypes) == 16) {break}
  }
crops.evtypes

##  [1] "river flood"       "ice storm"         "hurricane/typhoon"
##  [4] "drought"           "extreme cold"      "hurricane"        
##  [7] "flood"             "excessive heat"    "heat"             
## [10] "frost/freeze"      "damaging freeze"   "flash flood"      
## [13] "freeze"            "tropical storm"    "heavy rain"       
## [16] "high wind"

Lets calculate costs for top types of evens

costcrops.top <- sum(da.stormdata$cropdmgtot[da.stormdata$evtype %in% crops.evtypes])
costproperties.top <- sum(da.stormdata$propdmgtot[da.stormdata$evtype %in% crops.evtypes])

We have 4.2732 × 10¹¹ cost of properties damages and 4.9029 × 10¹⁰ cost of crops damages in whole data set, and 2.6585 × 10¹¹ cost of properties damages and 4.2561 × 10¹⁰ cost of crops damages in filtered dataset which is covering 62.213 percent of costs of properties and 86.8083 percent of costs of crops.

top.da.stormdata <- da.stormdata[da.stormdata$evtype %in% crops.evtypes, ]
top.da.rows <- nrow(top.da.stormdata)

By doing thise type of filtering our working data set (related to damages) has 4.3227 percent of original size.

Results

Fatalities and Injuries

Lets make barplot to visualise fatalities and injuries according to event type. Data preparation:

people <- data.frame(type = character(length(killer.evtypes)), fatalities = numeric(length(killer.evtypes)), injuries = numeric(length(killer.evtypes)))
people$type <- killer.evtypes
counter <- 1
for (type in killer.evtypes) {
  people[counter, "injuries"] <- sum(top.fi.stormdata$injuries[top.fi.stormdata$evtype == type])
  people[counter, "fatalities"] <- sum(top.fi.stormdata$fatalities[top.fi.stormdata$evtype == type])
  counter <- counter + 1
}
people$sum <- people$injuries + people$fatalities
maximum <- max(people$sum)
max.people <- people$type[people$sum == maximum]
head(people)

##             type fatalities injuries   sum
## 1           heat        937     2100  3037
## 2        tornado       5633    91346 96979
## 3 excessive heat       1903     6525  8428
## 4   extreme heat         96      155   251
## 5      heat wave        172      379   551
## 6        tsunami         33      129   162

Plot:

par(mai=c(1,3,1,1))
barplot(t(as.matrix(people[,c(2,3)])), names.arg = people$type, cex.names = 1, horiz = TRUE, las = 1, legend = TRUE, col = c("red","blue"), xlab = "Number of people affected")

plot of chunk unnamed-chunk-21

Costs of damages

Lets make barplot to visualise costs of damages according to event type. Data preparation:

damages <- data.frame(type = character(length(crops.evtypes)), crops = numeric(length(crops.evtypes)), properties = numeric(length(crops.evtypes)))
damages$type <- crops.evtypes
counter <- 1
for (type in crops.evtypes) {
  damages[counter, "properties"] <- sum(top.da.stormdata$propdmgtot[top.da.stormdata$evtype == type])
  damages[counter, "crops"] <- sum(top.da.stormdata$cropdmgtot[top.da.stormdata$evtype == type])
  counter <- counter + 1
}
damages$sum <- damages$properties + damages$crops
maximum <- max(damages$sum)
max.damage <- damages$type[damages$sum == maximum]
head(damages)

##                type     crops properties       sum
## 1       river flood 5.029e+09  5.119e+09 1.015e+10
## 2         ice storm 5.022e+09  3.945e+09 8.967e+09
## 3 hurricane/typhoon 2.608e+09  6.931e+10 7.191e+10
## 4           drought 1.397e+10  1.046e+09 1.502e+10
## 5      extreme cold 1.313e+09  6.774e+07 1.381e+09
## 6         hurricane 2.742e+09  1.187e+10 1.461e+10

Plot:

par(mai=c(1,3,1,1))
barplot(t(as.matrix(damages[,c(2,3)])), names.arg = damages$type, cex.names = 1, horiz = TRUE, las = 1, legend = TRUE, col = c("red","blue"), xlab = "Total value of damages")

plot of chunk unnamed-chunk-23

Summary

We can observe based on this simple analysis, that most dangerous type of event for people is tornado and most costly type of events is flood. We can take intersection of this two sets and examine worst yeas based on those type of events.

worst.events <- intersect(crops.evtypes, killer.evtypes)
worst.years <- stormdata$bgn_date[stormdata$evtype %in% worst.events]
hist(worst.years, breaks = "years", main = "Density of worst events in different years", xlab = "Years")

plot of chunk unnamed-chunk-24

Weather events that are causing most dangerous situations and are most devastating: hurricane/typhoon, flood, excessive heat, heat, flash flood, tropical storm, heavy rain.

R and packages information

Following versions of R and packages were used.

sessionInfo()

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=pl_PL.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=pl_PL.UTF-8        LC_COLLATE=pl_PL.UTF-8    
##  [5] LC_MONETARY=pl_PL.UTF-8    LC_MESSAGES=pl_PL.UTF-8   
##  [7] LC_PAPER=pl_PL.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=pl_PL.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.6
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.4   evaluate_0.5.5 formatR_0.10   stringr_0.6.2 
## [5] tools_3.1.0