Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
knitr::opts_chunk$set(echo = TRUE, warning = FALSE)
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: storm data[47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The basic goal of this analysis is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The data was downloaded from the above link and saved in a directory on the local computer. The working directory was set using the setwd() command in which one can provide a path to the directory where the data is placed. Then the data was loading in R using read.csv(). If the data is aleady loaded in the object "storm data", R will not load the data again. Since, the data is compressed using the bzip2 algorithm so unzipping the .csv file using bzfile().
if(!exists("storm.data")) {
storm.data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"),header = TRUE)
}
dim(storm.data)
## [1] 902297 37
str(storm.data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
names(storm.data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Health variables:
Economic variables:
Events - target variable:
voi <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data_of_interest <- storm.data[, voi]
Checking for head and tail of data for any missing values:
head(data_of_interest)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
tail(data_of_interest)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER 0 0 0 K 0 K
## 902293 HIGH WIND 0 0 0 K 0 K
## 902294 HIGH WIND 0 0 0 K 0 K
## 902295 HIGH WIND 0 0 0 K 0 K
## 902296 BLIZZARD 0 0 0 K 0 K
## 902297 HEAVY SNOW 0 0 0 K 0 K
Looking at the head and tail of data there seem to be no NA values.
Checking to confirm for any missing values in data:
Check for health variables - there is no NA’s in the data.
sum(is.na(data_of_interest$FATALITIES))
## [1] 0
sum(is.na(data_of_interest$INJURIES))
## [1] 0
Check for economic variables for "size" of damage - there is no NA's in the data.
sum(is.na(data_of_interest$PROPDMG))
## [1] 0
sum(is.na(data_of_interest$CROPDMG))
## [1] 0
Check for economic variables for units damage - there is no NA’s in the data.
sum(is.na(data_of_interest$PROPDMGEXP))
## [1] 0
sum(is.na(data_of_interest$CROPDMGEXP))
## [1] 0
length(unique(data_of_interest$EVTYPE))
## [1] 985
So, there are total 985 types of events.
First 10 events that most appear in the dataset:
sort(table(data_of_interest$EVTYPE), decreasing = TRUE)[1:10]
##
## HAIL TSTM WIND THUNDERSTORM WIND TORNADO
## 288661 219940 82563 60652
## FLASH FLOOD FLOOD THUNDERSTORM WINDS HIGH WIND
## 54277 25326 20843 20212
## LIGHTNING HEAVY SNOW
## 15754 15708
Grouping the data based on the most occurring keywords(like WIND). New variable EVENTS is the transform variable of EVTYPE that have 10 different types of events: HEAT, FLOOD, etc., and type OTHER for events in which name the keyword is not found.
#create a new variable EVENT
data_of_interest$EVENT <- "OTHER"
event = c("HAIL", "HEAT", "FLOOD", "WIND", "STORM", "SNOW", "TORNADO", "WINTER", "RAIN")
for(key in event){
data_of_interest$EVENT[grep(key, data_of_interest$EVTYPE, ignore.case = TRUE)] <- key
}
# listing the transformed event types
sort(table(data_of_interest$EVENT), decreasing = TRUE)
##
## HAIL WIND STORM FLOOD TORNADO OTHER WINTER SNOW RAIN HEAT
## 289270 255362 113156 82686 60700 48970 19604 17660 12241 2648
sort(table(data_of_interest$PROPDMGEXP), decreasing = TRUE)[1:10]
##
## K M 0 B 5 1 2 ? m
## 465934 424665 11330 216 40 28 25 13 8 7
sort(table(data_of_interest$CROPDMGEXP), decreasing = TRUE)[1:10]
##
## K M k 0 B ? 2 m <NA>
## 618413 281832 1994 21 19 9 7 1 1
There is some mess in units, so we transform those variables in one unit (dollar) variable by the following rule:
New variable(s) is product of value of damage and dollar unit.
##for property
data_of_interest$PROPDMGEXP[is.na(data_of_interest$PROPDMGEXP)] <- 0 # NA's considered as dollars
data_of_interest$PROPDMGEXP[!grepl("K|M|B", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
data_of_interest$PROPDMGEXP[grep("K", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "3"
data_of_interest$PROPDMGEXP[grep("M", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "6"
data_of_interest$PROPDMGEXP[grep("B", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "9"
data_of_interest$PROPDMGEXP <- as.numeric(data_of_interest$PROPDMGEXP)
#create a variable with actual property damage value
data_of_interest$property.damage <- data_of_interest$PROPDMG * 10^data_of_interest$PROPDMGEXP
##for crop
data_of_interest$CROPDMGEXP[is.na(data_of_interest$CROPDMGEXP)] <- 0 # NA's considered as dollars
data_of_interest$CROPDMGEXP[!grepl("K|M|B", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
data_of_interest$CROPDMGEXP[grep("K", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "3"
data_of_interest$CROPDMGEXP[grep("M", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "6"
data_of_interest$CROPDMGEXP[grep("B", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "9"
data_of_interest$CROPDMGEXP <- as.numeric(data_of_interest$CROPDMGEXP)
#create a variable with actual property damage value
data_of_interest$crop.damage <- data_of_interest$CROPDMG * 10^data_of_interest$CROPDMGEXP
aggr.fatalities <- aggregate(FATALITIES~EVENT, data = data_of_interest, sum)
names(aggr.fatalities)[2] <- "TOTAL"
aggr.fatalities$type <- "fatalities"
aggr.injuries <- aggregate(INJURIES~EVENT, data = data_of_interest, sum)
names(aggr.injuries)[2] <- "TOTAL"
aggr.injuries$type <- "injuries"
aggr.health<- rbind(aggr.fatalities, aggr.injuries)
aggr.health
## EVENT TOTAL type
## 1 FLOOD 1524 fatalities
## 2 HAIL 15 fatalities
## 3 HEAT 3138 fatalities
## 4 OTHER 2626 fatalities
## 5 RAIN 114 fatalities
## 6 SNOW 164 fatalities
## 7 STORM 416 fatalities
## 8 TORNADO 5661 fatalities
## 9 WIND 1209 fatalities
## 10 WINTER 278 fatalities
## 11 FLOOD 8602 injuries
## 12 HAIL 1371 injuries
## 13 HEAT 9224 injuries
## 14 OTHER 12224 injuries
## 15 RAIN 305 injuries
## 16 SNOW 1164 injuries
## 17 STORM 5339 injuries
## 18 TORNADO 91407 injuries
## 19 WIND 9001 injuries
## 20 WINTER 1891 injuries
aggr.property <- aggregate(property.damage~EVENT, data = data_of_interest, sum)
names(aggr.property)[2] <- "TOTAL"
aggr.property$type <- "property"
aggr.crop <- aggregate(crop.damage~EVENT, data = data_of_interest, sum)
names(aggr.crop)[2] <- "TOTAL"
aggr.crop$type <- "crop"
aggr.economy <- rbind(aggr.property, aggr.crop)
aggr.economy
## EVENT TOTAL type
## 1 FLOOD 167502193929 property
## 2 HAIL 15733043048 property
## 3 HEAT 20325750 property
## 4 OTHER 97246712337 property
## 5 RAIN 3270230192 property
## 6 SNOW 1024169752 property
## 7 STORM 66304415393 property
## 8 TORNADO 58593098029 property
## 9 WIND 10847166618 property
## 10 WINTER 6777295251 property
## 11 FLOOD 12266906100 crop
## 12 HAIL 3046837473 crop
## 13 HEAT 904469280 crop
## 14 OTHER 23588880870 crop
## 15 RAIN 919315800 crop
## 16 SNOW 134683100 crop
## 17 STORM 6374474888 crop
## 18 TORNADO 417461520 crop
## 19 WIND 1403719150 crop
## 20 WINTER 47444000 crop
#transform EVENT variable to factor
aggr.health$EVENT <- as.factor(aggr.health$EVENT)
# plot FATALITIES and INJURIES by EVENT
type <- c("fatalities", "injuries")
health.plot <- ggplot(aggr.health, aes(x = EVENT, y = TOTAL, fill = type)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("Event Type") +
ylab("Total number of health impact") +
ggtitle("Weather event types impact on public health") +
theme(plot.title = element_text(hjust = 0.5))
print(health.plot)
The most harmful weather event for health (in number of total fatalites and injuries) is, by far, a TORNADO.
#transform EVENT variable to factor
aggr.economy$EVENT <- as.factor(aggr.economy$EVENT)
# plot PROPERTY damage and CROP damage by EVENT
economic.plot <- ggplot(aggr.economy, aes(x = EVENT, y = TOTAL, fill = type)) + geom_bar(stat = "identity") +
coord_flip() +
xlab("Event Type") +
ylab("Total damage in dollars") +
ggtitle("Weather event types impact on property and crop damage") +
theme(plot.title = element_text(hjust = 0.5))
print(economic.plot)
The most devastating weather event with the greatest economic cosequences (to property and crops) is a flood.