The National Climatic Data Center (NCDC) collects data about storms and other weather events, which may results in fatalities, injuries or property damage. Detailed information about this data is given in the National Weather Service Storm Data Description and the National Climatic Data Center Storm Events FAQ.
This document discusses these two question:
library(plyr)
library(reshape)
library(ggplot2)
library(gridExtra)
The source of data is a copy of the NCDC data, which is located on a repository installed for the Coursera course “Reproducible Research”.
# download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "repda_data_StormData.csv.bz2")
# stormdataDF <- read.csv(bzfile("repda_head.csv.bz2"))
stormdataDF <- read.csv(bzfile("repda_data_StormData.csv.bz2"))
dim(stormdataDF)
## [1] 902297 37
The dataset has 902,297 rows and 37 columns.
names(stormdataDF)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
So first we reduce the size of the dataset to the columns we are interested in:
stormredDF <- stormdataDF[,c(2,8,23,24,25,26,27,28)]
head(stormredDF)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 1 4/18/1950 0:00:00 TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 TORNADO 0 6 2.5 K
## CROPDMG CROPDMGEXP
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
rm(stormdataDF)
Second we reduce the time span that we will explore, because in the first years of weather monitoring there were only tornados recorded. Primary in 1996 48 event types are recorded as defined in NWS Directive 10-1605. Therefore we will explore the time range since 1996. Rows without a real timestamp are eliminated from the dataset.
stampedDF <- subset(stormredDF, grepl("[0-9]+/[0-9]+", stormredDF$BGN_DATE, perl=TRUE))
stormredDF$date <- strptime(stormredDF$BGN_DATE,format='%m/%d/%Y %H:%M:%S')
stormredDF$date <- as.POSIXlt(stormredDF$date)
stormredDF$year <- 1900 + stormredDF$date$year
stormredDF <- subset(stormredDF, year > 1995)
stormredDF$EVTYPE <- toupper(stormredDF$EVTYPE)
fatalitiesDF <- stormredDF[,c(10,2,3,4)]
fatatotalsumDF <- ddply(fatalitiesDF,.(EVTYPE),summarise,Fatalities=sum(FATALITIES),Injuries=sum(INJURIES))
fataordDF <- arrange(fatatotalsumDF, desc(Fatalities))
ftopten <- fataordDF[1:10,]
injuryordDF <- arrange(fatatotalsumDF, desc(Injuries))
itopten <- injuryordDF[1:10,]
The 10 wheather phenomena which cause the most fatalities or injuries in the observed time range are the following:
fatalPlot <- qplot(EVTYPE, data = ftopten, weight = Fatalities, geom = "bar", binwidth = 1) +
scale_y_continuous("Fatalities") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Top Ten Severe Weather Event") +
ggtitle("Total Fatalities")
injuryPlot <- qplot(EVTYPE, data = itopten, weight = Injuries, geom = "bar", binwidth = 1) +
scale_y_continuous("Injuries") +
theme(axis.text.x = element_text(angle = 90)) +
xlab("Top Ten Severe Weather Event") +
ggtitle("Total Injuries")
grid.arrange(fatalPlot, injuryPlot, ncol = 2)
Weather events based on high temperature or storm cause the highest rates of victims.
Damages to property are recorded in two classes: property and crop. The recorded numbers are supplemented by a multiplication factor.
mult <- function(exp) {
if (exp == "K" | exp == "k") { 1000 }
else if (exp == "M" | exp =="m") { 1000000 }
else if (exp == "B") { 1000000000 }
else 1
}
damageDF <- stormredDF[,c(10,2,5,7,6,8)]
damageDF$PROPDMGEXP <- sapply(as.character(damageDF$PROPDMGEXP), mult)
damageDF$CROPDMGEXP <- sapply(as.character(damageDF$CROPDMGEXP), mult)
damageDF$DMG <- damageDF$PROPDMG*damageDF$PROPDMGEXP + damageDF$CROPDMG*damageDF$CROPDMGEXP
damagesumDF <- ddply(damageDF,.(EVTYPE),summarise,sum=sum(DMG))
damageordDF <- arrange(damagesumDF, desc(sum))
toptendamages <- damageordDF[1:10,]
toptendamages$sum <- toptendamages$sum / 1000000
The phenomena with the greatest economical impact are flood and different type of storms.
ggplot(toptendamages, aes(x=EVTYPE, y=sum)) + geom_bar(stat="identity") + coord_flip() + ggtitle("Top weather phenomena affecting the economy") + xlab("Event types") + ylab("Damage (in million dollar)")