STORM_DATA_ANALYSIS

Exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (Health and Economic Impacts)

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

The basic goal of this analysis is to explore the NOAA Storm Database and answer some basic questions about severe weather events. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Basic Settings

knitr::opts_chunk$set(echo = TRUE, warning = FALSE)

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: storm data[47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Goals

The basic goal of this analysis is to explore the NOAA Storm Database and answer the following basic questions about severe weather events.

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data processing

Loading Libraries

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading Data

The data was downloaded from the above link and saved in a directory on the local computer. The working directory was set using the setwd() command in which one can provide a path to the directory where the data is placed. Then the data was loading in R using read.csv(). If the data is aleady loaded in the object "storm data", R will not load the data again. Since, the data is compressed using the bzip2 algorithm so unzipping the .csv file using bzfile().

if(!exists("storm.data")) {
    storm.data <- read.csv(bzfile("repdata_data_StormData.csv.bz2"),header = TRUE)
  }

Examine Data

dim(storm.data)

## [1] 902297     37

str(storm.data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

names(storm.data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

Columns of Interest

Health variables:

FATALITIES: approx. number of deaths
INJURIES: approx. number of injuries

Economic variables:

PROPDMG: approx. property damages
PROPDMGEXP: the units for property damage value
CROPDMG: approx. crop damages
CROPDMGEXP: the units for crop damage value

Events - target variable:

EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..

Extract Variables of Interest

voi <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
data_of_interest <- storm.data[, voi]

Checking for head and tail of data for any missing values:

head(data_of_interest)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

tail(data_of_interest)

##                EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 902292 WINTER WEATHER          0        0       0          K       0          K
## 902293      HIGH WIND          0        0       0          K       0          K
## 902294      HIGH WIND          0        0       0          K       0          K
## 902295      HIGH WIND          0        0       0          K       0          K
## 902296       BLIZZARD          0        0       0          K       0          K
## 902297     HEAVY SNOW          0        0       0          K       0          K

Looking at the head and tail of data there seem to be no NA values.

Checking to confirm for any missing values in data:

Check for health variables - there is no NA’s in the data.

sum(is.na(data_of_interest$FATALITIES))

## [1] 0

sum(is.na(data_of_interest$INJURIES))

## [1] 0

Check for economic variables for "size" of damage - there is no NA's in the data.

sum(is.na(data_of_interest$PROPDMG))

## [1] 0

sum(is.na(data_of_interest$CROPDMG))

## [1] 0

Check for economic variables for units damage - there is no NA’s in the data.

sum(is.na(data_of_interest$PROPDMGEXP))

## [1] 0

sum(is.na(data_of_interest$CROPDMGEXP))

## [1] 0

Types of Events

length(unique(data_of_interest$EVTYPE))

## [1] 985

So, there are total 985 types of events.

First 10 events that most appear in the dataset:

sort(table(data_of_interest$EVTYPE), decreasing = TRUE)[1:10]

## 
##               HAIL          TSTM WIND  THUNDERSTORM WIND            TORNADO 
##             288661             219940              82563              60652 
##        FLASH FLOOD              FLOOD THUNDERSTORM WINDS          HIGH WIND 
##              54277              25326              20843              20212 
##          LIGHTNING         HEAVY SNOW 
##              15754              15708

Transforming Variables

Grouping Data:

Grouping the data based on the most occurring keywords(like WIND). New variable EVENTS is the transform variable of EVTYPE that have 10 different types of events: HEAT, FLOOD, etc., and type OTHER for events in which name the keyword is not found.

#create a new variable EVENT
data_of_interest$EVENT <- "OTHER"
event = c("HAIL", "HEAT", "FLOOD", "WIND", "STORM", "SNOW", "TORNADO", "WINTER", "RAIN")
for(key in event){
  data_of_interest$EVENT[grep(key, data_of_interest$EVTYPE, ignore.case = TRUE)] <- key
}
# listing the transformed event types 
sort(table(data_of_interest$EVENT), decreasing = TRUE)

## 
##    HAIL    WIND   STORM   FLOOD TORNADO   OTHER  WINTER    SNOW    RAIN    HEAT 
##  289270  255362  113156   82686   60700   48970   19604   17660   12241    2648

Checking units for property damage value

sort(table(data_of_interest$PROPDMGEXP), decreasing = TRUE)[1:10]

## 
##             K      M      0      B      5      1      2      ?      m 
## 465934 424665  11330    216     40     28     25     13      8      7

Checking units for crop damage value

sort(table(data_of_interest$CROPDMGEXP), decreasing = TRUE)[1:10]

## 
##             K      M      k      0      B      ?      2      m   <NA> 
## 618413 281832   1994     21     19      9      7      1      1

There is some mess in units, so we transform those variables in one unit (dollar) variable by the following rule:

K or k: thousand dollars (10^3)
M or m: million dollars (10^6)
B or b: billion dollars (10^9)
the rest would be consider as dollars

New variable(s) is product of value of damage and dollar unit.

##for property
data_of_interest$PROPDMGEXP[is.na(data_of_interest$PROPDMGEXP)] <- 0 # NA's considered as dollars
data_of_interest$PROPDMGEXP[!grepl("K|M|B", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
data_of_interest$PROPDMGEXP[grep("K", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "3"
data_of_interest$PROPDMGEXP[grep("M", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "6"
data_of_interest$PROPDMGEXP[grep("B", data_of_interest$PROPDMGEXP, ignore.case = TRUE)] <- "9"
data_of_interest$PROPDMGEXP <- as.numeric(data_of_interest$PROPDMGEXP)
#create a variable with actual property damage value
data_of_interest$property.damage <- data_of_interest$PROPDMG * 10^data_of_interest$PROPDMGEXP

##for crop
data_of_interest$CROPDMGEXP[is.na(data_of_interest$CROPDMGEXP)] <- 0 # NA's considered as dollars
data_of_interest$CROPDMGEXP[!grepl("K|M|B", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
data_of_interest$CROPDMGEXP[grep("K", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "3"
data_of_interest$CROPDMGEXP[grep("M", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "6"
data_of_interest$CROPDMGEXP[grep("B", data_of_interest$CROPDMGEXP, ignore.case = TRUE)] <- "9"
data_of_interest$CROPDMGEXP <- as.numeric(data_of_interest$CROPDMGEXP)
#create a variable with actual property damage value
data_of_interest$crop.damage <- data_of_interest$CROPDMG * 10^data_of_interest$CROPDMGEXP

Analysis

Table of public health problems by event type

aggr.fatalities <- aggregate(FATALITIES~EVENT, data = data_of_interest, sum)
names(aggr.fatalities)[2] <- "TOTAL"
aggr.fatalities$type <- "fatalities"

aggr.injuries <- aggregate(INJURIES~EVENT, data = data_of_interest, sum)
names(aggr.injuries)[2] <- "TOTAL"
aggr.injuries$type <- "injuries"

aggr.health<- rbind(aggr.fatalities, aggr.injuries)
aggr.health

##      EVENT TOTAL       type
## 1    FLOOD  1524 fatalities
## 2     HAIL    15 fatalities
## 3     HEAT  3138 fatalities
## 4    OTHER  2626 fatalities
## 5     RAIN   114 fatalities
## 6     SNOW   164 fatalities
## 7    STORM   416 fatalities
## 8  TORNADO  5661 fatalities
## 9     WIND  1209 fatalities
## 10  WINTER   278 fatalities
## 11   FLOOD  8602   injuries
## 12    HAIL  1371   injuries
## 13    HEAT  9224   injuries
## 14   OTHER 12224   injuries
## 15    RAIN   305   injuries
## 16    SNOW  1164   injuries
## 17   STORM  5339   injuries
## 18 TORNADO 91407   injuries
## 19    WIND  9001   injuries
## 20  WINTER  1891   injuries

Table of property and crop damage by event type

aggr.property <- aggregate(property.damage~EVENT, data = data_of_interest, sum)
names(aggr.property)[2] <- "TOTAL"
aggr.property$type <- "property"

aggr.crop <- aggregate(crop.damage~EVENT, data = data_of_interest, sum)
names(aggr.crop)[2] <- "TOTAL"
aggr.crop$type <- "crop"

aggr.economy <- rbind(aggr.property, aggr.crop)
aggr.economy

##      EVENT        TOTAL     type
## 1    FLOOD 167502193929 property
## 2     HAIL  15733043048 property
## 3     HEAT     20325750 property
## 4    OTHER  97246712337 property
## 5     RAIN   3270230192 property
## 6     SNOW   1024169752 property
## 7    STORM  66304415393 property
## 8  TORNADO  58593098029 property
## 9     WIND  10847166618 property
## 10  WINTER   6777295251 property
## 11   FLOOD  12266906100     crop
## 12    HAIL   3046837473     crop
## 13    HEAT    904469280     crop
## 14   OTHER  23588880870     crop
## 15    RAIN    919315800     crop
## 16    SNOW    134683100     crop
## 17   STORM   6374474888     crop
## 18 TORNADO    417461520     crop
## 19    WIND   1403719150     crop
## 20  WINTER     47444000     crop

Results

Across the United States, which types of events are most harmful with respect to population health?

#transform EVENT variable to factor
aggr.health$EVENT <- as.factor(aggr.health$EVENT)
# plot FATALITIES and INJURIES by EVENT
type <- c("fatalities", "injuries")
health.plot <- ggplot(aggr.health, aes(x = EVENT, y = TOTAL, fill = type)) + 
  geom_bar(stat = "identity") +
  coord_flip() +
  xlab("Event Type") + 
  ylab("Total number of health impact") +
  ggtitle("Weather event types impact on public health") +
  theme(plot.title = element_text(hjust = 0.5))
print(health.plot)

The most harmful weather event for health (in number of total fatalites and injuries) is, by far, a TORNADO.

Across the United States, which types of events have the greatest economic consequences?

#transform EVENT variable to factor
aggr.economy$EVENT <- as.factor(aggr.economy$EVENT)

# plot PROPERTY damage and CROP damage by EVENT
economic.plot <- ggplot(aggr.economy, aes(x = EVENT, y = TOTAL, fill = type)) + geom_bar(stat = "identity") +
  coord_flip() +
  xlab("Event Type") + 
  ylab("Total damage in dollars") +
  ggtitle("Weather event types impact on property and crop damage") +
  theme(plot.title = element_text(hjust = 0.5))
print(economic.plot)

The most devastating weather event with the greatest economic cosequences (to property and crops) is a flood.