Synopsis

The National Climatic Data Center (NCDC) collects data about storms and other weather events, which may results in fatalities, injuries or property damage. Detailed information about this data is given in the National Weather Service Storm Data Description and the National Climatic Data Center Storm Events FAQ.

This document discusses these two question:

Data Processing

Environment Setup

library(plyr)
library(reshape)
library(ggplot2)
library(gridExtra)

Loading Data

The source of data is a copy of the NCDC data, which is located on a repository installed for the Coursera course “Reproducible Research”.

# download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repda_data_StormData.csv.bz2")
# stormdataDF <- read.csv(bzfile("repda_head.csv.bz2"))
stormdataDF <- read.csv(bzfile("repda_data_StormData.csv.bz2"))
dim(stormdataDF)
## [1] 902297     37

The dataset has 902,297 rows and 37 columns.

names(stormdataDF)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

So first we reduce the size of the dataset to the columns we are interested in:

  • BGN_DATE: Recorded date of event
  • EVTYPE: type of event
  • FATALITIES: number of fatalities
  • INJURIES: number of injuries
  • PROPDMG: property damage in dollar
  • PROPDMGEXP: Factor “K” for thousands, “M” for millions, and “B” for billions….
  • CROPDMG: crop damage in dollar
  • CROPDMGEXP: Factor as in PROPDMGEXP
stormredDF <- stormdataDF[,c(2,8,23,24,25,26,27,28)]
head(stormredDF)
##             BGN_DATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1  4/18/1950 0:00:00 TORNADO          0       15    25.0          K
## 2  4/18/1950 0:00:00 TORNADO          0        0     2.5          K
## 3  2/20/1951 0:00:00 TORNADO          0        2    25.0          K
## 4   6/8/1951 0:00:00 TORNADO          0        2     2.5          K
## 5 11/15/1951 0:00:00 TORNADO          0        2     2.5          K
## 6 11/15/1951 0:00:00 TORNADO          0        6     2.5          K
##   CROPDMG CROPDMGEXP
## 1       0           
## 2       0           
## 3       0           
## 4       0           
## 5       0           
## 6       0
rm(stormdataDF)

Second we reduce the time span that we will explore, because in the first years of weather monitoring there were only tornados recorded. Primary in 1996 48 event types are recorded as defined in NWS Directive 10-1605. Therefore we will explore the time range since 1996. Rows without a real timestamp are eliminated from the dataset.

stampedDF <- subset(stormredDF, grepl("[0-9]+/[0-9]+", stormredDF$BGN_DATE, perl=TRUE))
stormredDF$date <- strptime(stormredDF$BGN_DATE,format='%m/%d/%Y %H:%M:%S')
stormredDF$date <- as.POSIXlt(stormredDF$date)
stormredDF$year <- 1900 + stormredDF$date$year
stormredDF <- subset(stormredDF, year > 1995)
stormredDF$EVTYPE <- toupper(stormredDF$EVTYPE)

Results

Fatalities and Injuries

fatalitiesDF <- stormredDF[,c(10,2,3,4)]
fatatotalsumDF <- ddply(fatalitiesDF,.(EVTYPE),summarise,Fatalities=sum(FATALITIES),Injuries=sum(INJURIES))
fataordDF <- arrange(fatatotalsumDF, desc(Fatalities))
ftopten <- fataordDF[1:10,]
injuryordDF <- arrange(fatatotalsumDF, desc(Injuries))
itopten <- injuryordDF[1:10,]

The 10 wheather phenomena which cause the most fatalities or injuries in the observed time range are the following:

fatalPlot <- qplot(EVTYPE, data = ftopten, weight = Fatalities, geom = "bar", binwidth = 1) + 
    scale_y_continuous("Fatalities") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Top Ten Severe Weather Event") + 
    ggtitle("Total Fatalities")
injuryPlot <- qplot(EVTYPE, data = itopten, weight = Injuries, geom = "bar", binwidth = 1) + 
    scale_y_continuous("Injuries") + 
    theme(axis.text.x = element_text(angle = 90)) + 
    xlab("Top Ten Severe Weather Event") + 
    ggtitle("Total Injuries")
grid.arrange(fatalPlot, injuryPlot, ncol = 2)

Weather events based on high temperature or storm cause the highest rates of victims.

Economical Effects

Damages to property are recorded in two classes: property and crop. The recorded numbers are supplemented by a multiplication factor.

mult <- function(exp) {
    if (exp == "K" | exp == "k") { 1000 }
    else if (exp  == "M" | exp =="m") { 1000000 }
    else if (exp == "B") { 1000000000 }
    else 1
}
damageDF <-  stormredDF[,c(10,2,5,7,6,8)]
damageDF$PROPDMGEXP <- sapply(as.character(damageDF$PROPDMGEXP), mult)
damageDF$CROPDMGEXP <- sapply(as.character(damageDF$CROPDMGEXP), mult)
damageDF$DMG <- damageDF$PROPDMG*damageDF$PROPDMGEXP + damageDF$CROPDMG*damageDF$CROPDMGEXP
damagesumDF <-  ddply(damageDF,.(EVTYPE),summarise,sum=sum(DMG))
damageordDF <- arrange(damagesumDF, desc(sum))
toptendamages <- damageordDF[1:10,]
toptendamages$sum <- toptendamages$sum / 1000000

The phenomena with the greatest economical impact are flood and different type of storms.

ggplot(toptendamages, aes(x=EVTYPE, y=sum)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Top weather phenomena affecting the economy") + xlab("Event types") + ylab("Damage (in million dollar)")