National Weather Services (NWS) recorded storm and other severe weather events from 1950 to 2011. Officially these events are categorized into 48 types. However in the data the events are registered under various names. The goal of this project is to change the recorded event types into official ones and to find out which types caused most casualties and ecconomic damages. Our analysis shows that tornado is the most disastrous in terms of both casualties and ecconomic damages.
# setup working directory.
setwd('E:/Rstudio/reproducible research/project2')
if(!file.exists('repdata-data-StormData.csv')) {
# download data
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
download.file(url, 'repdata-data-StormData.csv.bz2')
# unzip data
library(R.utils)
bunzip2('repdata-data-StormData.csv.bz2')
# unzip() and unz() not working for .bz2 file
}
Read the data using read.csv(). It is slow but reliable.
storm <- read.csv('repdata-data-StormData.csv', stringsAsFactors=FALSE,
na.strings=c('NA', ''))
The data has 37 variable, among them we are interested in ‘EVTYPE’ - the type of weather, ‘FATALITIES’ - how many people were killed, ‘INJURIES’ - how many people were injured, ‘PROPDMG’ - damage of properties in dollars, and ‘CROPDMG’ - damage of crops in dollars.
To better process the data, we make a subset of data of variables we are interested in.
loss <- subset(storm, select = c('EVTYPE','FATALITIES','INJURIES','PROPDMG','CROPDMG'))
A list of official event types can be found in page 6 of National Weather Service (NWS) Storm Data Documentation. There are a total number of 48 weather types identified by NWS. We convert all types to upper case and store them in a vector.
# all event types in alphabetical order
officialType <- c('Astronomical Low Tide', 'Avalanche', 'Blizzard', 'Coastal Flood',
'Cold/Wind Chill', 'Debris Flow', 'Dense Fog', 'Dense Smoke',
'Drought', 'Dust Devil', 'Dust Storm', 'Excessive Heat',
'Extreme Cold/Wind Chill', 'Flash Flood', 'Flood', 'Frost/Freeze',
'Funnel Cloud', 'Freezing Fog', 'Hail', 'Heat', 'Heavy Rain',
'Heavy Snow', 'High Surf', 'High Wind', 'Hurricane',
'Ice Storm', 'Lake-Effect Snow', 'Lakeshore Flood', 'Lightning',
'Marine Hail', 'Marine High Wind', 'Marine Strong Wind',
'Marine Thunderstorm Wind', 'Rip Current', 'Seiche', 'Sleet',
'Storm Surge/Tide', 'Strong Wind', 'Thunderstorm Wind', 'Tornado',
'Tropical Depression', 'Tropical Storm', 'Tsunami', 'Volcanic Ash',
'Waterspout', 'Wildfire', 'Winter Storm', 'Winter Weather'
)
# note: 'Hurricane (Typhoon)' shorted to 'Hurricane'
# convert to upper case
officialType <- toupper(officialType)
If we look into the storm data, there is an astonishing 985 unique weather types. Let’s order them alphabetically and examine them by eyes.
sort(unique(loss$EVTYPE))
If we print them out we can see that many of them are actually the same but appears different mostly due to slight difference in names. Some difference are caused by extra space, lower case/upper case, or simply typo. We need to correct these problems.
As the first step, we convert all event to upper case.
loss$EVTYPE <- toupper(loss$EVTYPE)
This reduces the number of recorded weahter types to 898. But it is still far exceeding the number of official types. More cleaning is needed.
We will use function adist(), which calulate Levenshtein distance between two words, to find the closest match of official type for each recorded type and used to replace it. For this purpose we define a function:
# replace strings in vector a with closest strings in vector b
str_replace <- function(a, b) {
# compute distance matrix between the two vectors
distMatrix <- adist(a, b)
# obtain the closest pair
closestPair <- apply(distMatrix, 1, which.min)
# (i, closestPair[i]) is a pair indicating that i^th string in a
# is closest to closestPair[i]^th string in b
# replace strings in a with closest pairs in b
output <- c(character(0)) # initialize output
i <- 0 # initialize place holder for index of string in a
for(string in a) {
i <- i + 1
output <- c(output, b[closestPair[i]])
}
# function return
output
}
Let’s make a data frame of event types. The first column is the recorded types in the data. The second is official types from NWS that has been used to replace recorded types in the same row.
recordedType <- sort(unique(loss$EVTYPE))
dfEventType <- data.frame(recorded=recordedType,
official=str_replace(recordedType, officialType),
stringsAsFactors=FALSE)
By eyeballing the dfEventType, the accuracy of replacement is reasonably good. Additional improvement in matching have to be done manually. But I have run out of time for this project and have to move on to next step.
# create a new column 'official' by replacing elements in loss$EVTYP
i <- 0 # initialize place holder for index of TYPE in loss$EVTYPE
for(TYPE in loss$EVTYPE) {
i <- i + 1
# check index of TYPE in dfEvetType
index <- which(dfEventType$recorded==TYPE)
# replace recorded types with official ones
loss$official[i] <- dfEventType$official[index]
# print progress. take a very long time, better know progress
if(i %% 1000 == 0) {print(i/length(loss$EVTYPE))}
}
In order to examin the losses and damages caused by weather type, we group the data by official types.
library(dplyr)
byType <- group_by(loss, official)
The total losses and damages of each type are summarised below
lossByType <- summarise(byType, totalCasualty=sum(FATALITIES)+sum(INJURIES),
totalDamage=sum(PROPDMG)+sum(CROPDMG))
Figure 1 shows the total casaulties of each types of event in 1950-2011. Tornado is the most harmful, with a total number of 96997 people killed or injured, far more than any other type of event.
# specify figure caption for figure 1
fig1_caption <- 'Figure 1: total casualties caused by each weather type in 1950-2011.
Tornado killed or injured far more people than any other types.'
library(ggplot2)
ggplot(lossByType, aes(x=official, y=totalCasualty)) +
geom_bar(stat="identity", fill="dark blue") +
# geom_text(aes(label=totalCasualty), vjust=-0.4) +
ylab("total casualty (person)") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 70, hjust = 1))
Figure 1: total casualties caused by each weather type in 1950-2011. Tornado killed or injured far more people than any other types.
Figure 2 shows the total ecconoic damage in billion dollars of each types of event in 1950-2011. Again, tornado is the most costly type of event, with total damage of 3.3 billion dollars.
# specify figure caption for figure 2
fig2_caption <- 'Figure 2: total ecconomic damage caused by each weather
types in 1950-2011. Tornado is the most costly weather type.'
library(ggplot2)
ggplot(lossByType, aes(x=official, y=totalDamage/1e6)) +
geom_bar(stat="identity", fill="dark blue") +
# geom_text(aes(label=round(totalDamage/1e6,1)), vjust=-0.4) +
ylab("total damage (billion dollars)") +
theme(axis.title.x = element_blank(), axis.text.x = element_text(angle = 70, hjust = 1))
Figure 2: total ecconomic damage caused by each weather types in 1950-2011. Tornado is the most costly weather type.