The storms and other severe weather events have caused both public health and economic problems for communities and municipalities of United States. These severe events have resulted in many fatalities and injuries, and agricultural crop and property damages.
The basic goal of this analysis work is to explore the NOAA Storm Database and to answer questions about the types of severe weather events that have the greatest impact to the following aspects:
The analysis shows the top 10 severe weather events that are most harmful to population health are
The analysis shows the top 10 severe weather events that have the greatest economic consequences are
Using the data from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database from 1950 to end November 2011, the project seeks to gain insights of how severe weather events impact on U.S Public Health and Economic Disruptions. It was noted that the data collected from the earlier years were less complete as there were primarily due to lack of good records. However, data from recent years should be considered more complete.
# function : check if package is installed
is.installed <- function(mypkg){
is.element(mypkg, installed.packages()[,1])
}
# install "downloader" package if not installed
if (!is.installed("downloader")){
install.packages("downloader")
}
# install "dplyr" package if not installed
if (!is.installed("dplyr")){
install.packages("dplyr")
}
# install "ggplot2" package if not installed
if (!is.installed("ggplot2")){
install.packages("ggplot2")
}
# install "R.utils" package if not installed
if (!is.installed("R.utils")){
install.packages("R.utils")
}
# install "R.utils" package if not installed
if (!is.installed("data.table")){
install.packages("data.table")
}
# load required libraries
library(downloader)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.1.2
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.1.2
## Loading required package: R.methodsS3
## Warning: package 'R.methodsS3' was built under R version 3.1.2
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.34.0 (2014-10-07) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(data.table)
##
## Attaching package: 'data.table'
##
## The following objects are masked from 'package:dplyr':
##
## between, last
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.2
Storm Data was downloaded from Storm Data from NOAA on 12 December 2014.
# download data from data source
download("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", "stormdata.cs.bz2")
# unzip downloaded data
bunzip2("stormdata.csv.bz2", destname=gsub("[.]bz2$", "", "stormdata.csv.bz2"), remove=FALSE, overwrite=TRUE)
# read in data
raw.data <- read.table("stormdata.csv", header = T, sep=",", as.is=TRUE)
str(raw.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
summary(raw.data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE
## Min. : 1.0 Length:902297 Length:902297 Length:902297
## 1st Qu.:19.0 Class :character Class :character Class :character
## Median :30.0 Mode :character Mode :character Mode :character
## Mean :31.2
## 3rd Qu.:45.0
## Max. :95.0
##
## COUNTY COUNTYNAME STATE EVTYPE
## Min. : 0.0 Length:902297 Length:902297 Length:902297
## 1st Qu.: 31.0 Class :character Class :character Class :character
## Median : 75.0 Mode :character Mode :character Mode :character
## Mean :100.6
## 3rd Qu.:131.0
## Max. :873.0
##
## BGN_RANGE BGN_AZI BGN_LOCATI
## Min. : 0.000 Length:902297 Length:902297
## 1st Qu.: 0.000 Class :character Class :character
## Median : 0.000 Mode :character Mode :character
## Mean : 1.484
## 3rd Qu.: 1.000
## Max. :3749.000
##
## END_DATE END_TIME COUNTY_END COUNTYENDN
## Length:902297 Length:902297 Min. :0 Mode:logical
## Class :character Class :character 1st Qu.:0 NA's:902297
## Mode :character Mode :character Median :0
## Mean :0
## 3rd Qu.:0
## Max. :0
##
## END_RANGE END_AZI END_LOCATI
## Min. : 0.0000 Length:902297 Length:902297
## 1st Qu.: 0.0000 Class :character Class :character
## Median : 0.0000 Mode :character Mode :character
## Mean : 0.9862
## 3rd Qu.: 0.0000
## Max. :925.0000
##
## LENGTH WIDTH F MAG
## Min. : 0.0000 Min. : 0.000 Min. :0.0 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 0.000 1st Qu.:0.0 1st Qu.: 0.0
## Median : 0.0000 Median : 0.000 Median :1.0 Median : 50.0
## Mean : 0.2301 Mean : 7.503 Mean :0.9 Mean : 46.9
## 3rd Qu.: 0.0000 3rd Qu.: 0.000 3rd Qu.:1.0 3rd Qu.: 75.0
## Max. :2315.0000 Max. :4400.000 Max. :5.0 Max. :22000.0
## NA's :843563
## FATALITIES INJURIES PROPDMG
## Min. : 0.0000 Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 0.0000 Median : 0.00
## Mean : 0.0168 Mean : 0.1557 Mean : 12.06
## 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.: 0.50
## Max. :583.0000 Max. :1700.0000 Max. :5000.00
##
## PROPDMGEXP CROPDMG CROPDMGEXP
## Length:902297 Min. : 0.000 Length:902297
## Class :character 1st Qu.: 0.000 Class :character
## Mode :character Median : 0.000 Mode :character
## Mean : 1.527
## 3rd Qu.: 0.000
## Max. :990.000
##
## WFO STATEOFFIC ZONENAMES LATITUDE
## Length:902297 Length:902297 Length:902297 Min. : 0
## Class :character Class :character Class :character 1st Qu.:2802
## Mode :character Mode :character Mode :character Median :3540
## Mean :2875
## 3rd Qu.:4019
## Max. :9706
## NA's :47
## LONGITUDE LATITUDE_E LONGITUDE_ REMARKS
## Min. :-14451 Min. : 0 Min. :-14455 Length:902297
## 1st Qu.: 7247 1st Qu.: 0 1st Qu.: 0 Class :character
## Median : 8707 Median : 0 Median : 0 Mode :character
## Mean : 6940 Mean :1452 Mean : 3509
## 3rd Qu.: 9605 3rd Qu.:3549 3rd Qu.: 8735
## Max. : 17124 Max. :9706 Max. :106220
## NA's :40
## REFNUM
## Min. : 1
## 1st Qu.:225575
## Median :451149
## Mean :451149
## 3rd Qu.:676723
## Max. :902297
##
nrow <- nrow(raw.data)
ncol <- ncol(raw.data)
d <- strptime(raw.data$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
d <- sort (d)
first.date <- head(d, 1)
first.date <- strsplit(as.character(first.date), split=" ")[[1]]
last.date <- tail(d, 1)
last.date <- strsplit(as.character(last.date), split=" ")[[1]]
A basic analysis of the data was performed and below were the findings:
There are relatively little missing data. The variables with missing data includes
The documentations on the NOAA website were reviewed for the definition of the variables required to perform the analysis. The documents are:
Based on the objectives of this project, the variables with the missing data are not used to perform the analysis. Therefore the analysis work is performed on a complete dataset.
The variables required for the detail analysis are
A subset from the dataset is created and the summary of the subset data is reviewed if cleansing and/or transformation are required.
sub.data <- data.frame(EVTYPE=raw.data$EVTYPE, FATALITIES=raw.data$FATALITIES, INJURIES=raw.data$INJURIES,
PROPDMG=raw.data$PROPDMG, PROPDMGEXP=raw.data$PROPDMGEXP, CROPDMG=raw.data$CROPDMG,
CROPDMGEXP=raw.data$CROPDMGEXP)
summary(sub.data)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
# number of event types
length(unique(sub.data$EVTYPE))
## [1] 985
It is observed that there are over 170,000 observations classified into 979 event types. The strategy to generalise event types is then performed.
In addition, there is a need to express the actual property and crop damage in dollars for purpose of cost comparison between two event types.
Event type of similar in nature are combined into a generic event type. They are
In addition, observations with event type ‘Summary’ are summary of statistics. To avoid impacting on the overall data analysis work, they are excluded.
# function which combine similar events to under a more generic event type.
Combine.Event.Type <- function (x){
x <- toupper(x)
for (i in 1:length(x)){
if (grepl("TSTM", x[i])){
x[i] <- "THUNDERSTORM"
} else if (grepl("HURRICANE", x[i])){
x[i] <- "HURRICANE"
} else if (grepl("RAIN|PRECIP?|DEPRESSION", x[i])){
x[i] <- "RAIN"
} else if (grepl("FLOOD|?SLIDE?|MUD?|EROSION|LANDSLUMP|FLD?", x[i])){
x[i] <- "FLOOD"
} else if (grepl("HEAT|HOT|HIGH TEMPERATURE?", x[i])){
x[i] <- "HEAT"
} else if (grepl("HAIL", x[i])){
x[i] <- "HAIL"
} else if (grepl("?STORM|^THU|SHOWERS", x[i])){
x[i] <- "THUNDERSTORM"
} else if (grepl("TORNADO|CLOUD|DUST DEVIL|LANDSPOUT|FUNNEL", x[i])){
x[i] <- "TORNADO"
} else if (grepl("SNOW|CHILL|COLD|BLIZZARD|WINTER|ICE|SLEET|FROST|FREEZ?|ICY|GLAZE|?EXPOSURE|HYPOTHERMIA", x[i])){
x[i] <- "SNOW"
} else if (grepl("WIND|?BURST?|GUST|DUST", x[i])){
x[i] <- "WIND"
} else if (grepl("TSUNAMI|TIDE|WATER?|SEICHE|MARINE|SEA?|WAVE|SURF|?TERSPOUT|SWELL?", x[i])){
x[i] <- "MARINE"
} else if (grepl("FIRE?", x[i])){
x[i] <- "FIRE"
} else if (grepl("VOLCANIC", x[i])){
x[i] <- "VOLCANIC"
} else if (grepl("LIGHTNING", x[i])){
x[i] <- "LIGHTNING"
}
}
return (x)
}
#cast to character type.
Event.Type <- as.character(sub.data$EVTYPE)
#combine event types
sub.data$EVTYPE <- as.factor(Combine.Event.Type(Event.Type))
# remove 'Summary' event type
prep.data <- sub.data[!grepl("^SUMMARY",sub.data$EVTYPE),]
# list of exponential expression for Property damage
unique(prep.data$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
# list of exponential expression for Property damage
unique(prep.data$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
It is observed that the PROPDMGEXP and CROPDMGEXP variables contain values that are non-standard exponential expressions for money. These are then set to 0.
Following are the standard exponential expressions translation:
prep.data$FATALITIES <- as.numeric(prep.data$FATALITIES)
prep.data$INJURIES <- as.numeric(prep.data$INJURIES)
prep.data$PROPDMG <- as.numeric(prep.data$PROPDMG)
prep.data$PROPDMGEXP <- as.character(prep.data$PROPDMGEXP)
prep.data$CROPDMG <- as.numeric(prep.data$CROPDMG)
prep.data$CROPDMGEXP <- as.character(prep.data$CROPDMGEXP)
# function to translate exponential expression to its corresponding numeric value
Convert.Symbol.Numeric <- function (x){
x <- toupper(x)
if (x=="H") return (100)
else if (x=="K") return (1000)
else if (x=="M") return (1000000)
else if (x=="B") return (1000000000)
return (0)
}
# translate from exponential expression to numeric value
prep.data$PROPDMGEXP <- sapply (prep.data$PROPDMGEXP, Convert.Symbol.Numeric)
prep.data$CROPDMGEXP <- sapply (prep.data$CROPDMGEXP, Convert.Symbol.Numeric)
# compute the actual damage in dollars
prep.data$PROPDMGVALUE <- as.numeric(prep.data$PROPDMG * prep.data$PROPDMGEXP)
prep.data$CROPDMGVALUE <- as.numeric(prep.data$CROPDMG * prep.data$CROPDMGEXP)
# summation of variables by event type
tmp.dt <- data.table(prep.data)
clean.data <- as.data.frame(tmp.dt[, list(FATALITIES=sum(FATALITIES),
INJURIES=sum(INJURIES),
PROPDMG=sum(PROPDMGVALUE),
CROPDMG=sum(CROPDMGVALUE)),
by=c("EVTYPE")])
# arrange from least to most impact of event types
impact.population <- arrange(data.frame(EVTYPE=clean.data$EVTYPE,
FATALITIES=clean.data$FATALITIES,
INJURIES=clean.data$INJURIES,
TOTAL=clean.data$FATALITIES+clean.data$INJURIES), desc(TOTAL))
# plot the top 10 most impact event type
sub.impact.population <- head(impact.population,10)
# number of fatalities and injuries to thousands
sub.impact.population$TOTAL <- sapply (sub.impact.population$TOTAL, function (x) x/1e3)
ggplot(sub.impact.population, aes(x=EVTYPE, y=TOTAL)) +
geom_bar(stat="identity") +
scale_x_discrete(limits=rev(sub.impact.population$EVTYPE)) +
coord_flip() +
xlab("Event Type") + theme_bw()+
ylab("Number of Fatalities and Injuries (in thousands)") +
ggtitle("Top 10 Event Types on Fatalities And Injuries")
The plot shows that highest impact is Tornado related events, such as, general tornado, types of clouds, dust devil, landspout, etc.
# arrange from least to most impact of event types
impact.economy <- arrange(data.frame(EVTYPE=as.factor(clean.data$EVTYPE),
PROPDMG=as.numeric(clean.data$PROPDMG),
CROPDMG=as.numeric(clean.data$CROPDMG),
TOTAL=clean.data$PROPDMG+clean.data$CROPDMG), desc(TOTAL))
# plot the top 10 most impact event type
sub.impact.economy <- head(impact.economy,10)
# convert dollar units to billions
sub.impact.economy$TOTAL <- sapply (sub.impact.economy$TOTAL, function (x) x/1e9)
ggplot(sub.impact.economy, aes(x=EVTYPE, y=TOTAL)) +
geom_bar(stat="identity") +
scale_x_discrete(limits=rev(sub.impact.economy$EVTYPE)) +
coord_flip() +
xlab("Event Type") + theme_bw()+
ylab("Property and Crop Damage Costs (in billions)") +
ggtitle("Top 10 Event Types on Crops and Property Damage Costs")
The plot shows that highest impact is Flood related events such as general floods, landslide, mudslide, soil erosion, landslump, etc.