Analysis of storm data from the U.S. National Oceanic and Atmospheric Administration (NOAA)

Synopsis

This is a second course project for Reproducible Research, which is the fifth course in Coursera Data Science Specialization.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, loss of crops and property damage. Preventing such outcomes as far as practical and possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The analysis of the data shows that tornadoes have the greatest human impact as measured by numbers of injuries and fatalities. The second biggest impact is heat. In terms of economic impact, as measured by damage to crops and property, flooding is the greatest threat.

Data Processing

Load libraries

The following libraries were used for the analysis

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.5.1
## -- Attaching packages -------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.0
## v readr   1.1.1     v forcats 0.3.0
## Warning: package 'readr' was built under R version 3.5.1
## Warning: package 'purrr' was built under R version 3.5.1
## Warning: package 'dplyr' was built under R version 3.5.1
## Warning: package 'forcats' was built under R version 3.5.1
## -- Conflicts ----------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data

The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. The file may be downloaded from here:

Storm Data

See links below for documentation of this data:

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ page

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Goals

The aim of this work is to answer the following questions.

  1. Across the United States, which types of severe weather events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Loading data

The data was downloaded from the link above and saved into my R environment (use setwd command to set the file path of the working directory and use getwd command to check the current working directory). The read_csv command was used to import the zipped csv file. If object storm.data is already loaded, use the cached object insted of reloading each time the Rmd file is knitted.

storm.data <- read_csv("StormData.csv.bz2")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   STATE__ = col_double(),
##   COUNTY = col_double(),
##   BGN_RANGE = col_double(),
##   COUNTY_END = col_double(),
##   END_RANGE = col_double(),
##   LENGTH = col_double(),
##   WIDTH = col_double(),
##   F = col_integer(),
##   MAG = col_double(),
##   FATALITIES = col_double(),
##   INJURIES = col_double(),
##   PROPDMG = col_double(),
##   CROPDMG = col_double(),
##   LATITUDE = col_double(),
##   LONGITUDE = col_double(),
##   LATITUDE_E = col_double(),
##   LONGITUDE_ = col_double(),
##   REFNUM = col_double()
## )
## See spec(...) for full column specifications.

Examine the storm data

head(storm.data)
str(storm.data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: chr  NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
##  - attr(*, "spec")=List of 2
##   ..$ cols   :List of 37
##   .. ..$ STATE__   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ BGN_DATE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BGN_TIME  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ TIME_ZONE : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ COUNTY    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ COUNTYNAME: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ STATE     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ EVTYPE    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BGN_RANGE : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ BGN_AZI   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ BGN_LOCATI: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ END_DATE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ END_TIME  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ COUNTY_END: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ COUNTYENDN: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ END_RANGE : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ END_AZI   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ END_LOCATI: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LENGTH    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ WIDTH     : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ F         : list()
##   .. .. ..- attr(*, "class")= chr  "collector_integer" "collector"
##   .. ..$ MAG       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ FATALITIES: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ INJURIES  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PROPDMG   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ PROPDMGEXP: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ CROPDMG   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ CROPDMGEXP: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ WFO       : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ STATEOFFIC: list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ ZONENAMES : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ LATITUDE  : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ LONGITUDE : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ LATITUDE_E: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ LONGITUDE_: list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   .. ..$ REMARKS   : list()
##   .. .. ..- attr(*, "class")= chr  "collector_character" "collector"
##   .. ..$ REFNUM    : list()
##   .. .. ..- attr(*, "class")= chr  "collector_double" "collector"
##   ..$ default: list()
##   .. ..- attr(*, "class")= chr  "collector_guess" "collector"
##   ..- attr(*, "class")= chr "col_spec"

Variables for analysis

The following variables are of most interest with regard to health and economic outcomes

Health variables:

FATALITIES: approx. number of deaths INJURIES: approx. number of injuries

Economic variables:

PROPDMG: approx. property damags PROPDMGEXP: the units for property damage value CROPDMG: approx. crop damages CROPDMGEXP: the units for crop damage value Events - target variable:

EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..)

Create a subset of data containing just these required variables:

vars <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
mydata <- storm.data[, vars]

Data Processing

Check if any values are missing from each of the numeric variables as a basic quality check.

incomplete <- sum(!complete.cases(mydata))

sum(is.na(mydata$FATALITIES))
## [1] 0
sum(is.na(mydata$INJURIES))
## [1] 0
sum(is.na(mydata$PROPDMG))
## [1] 0
sum(is.na(mydata$CROPDMG))
## [1] 0

Transform the units so actual dollar values are shown. The rules applied are as follows:

*K or k: thousand dollars (10^3)

*M or m: million dollars (10^6)

*B or b: billion dollars (10^9)

*those without the units above are considered as dollars

mydata$PROPDMGEXP <- as.character(mydata$PROPDMGEXP)
mydata$PROPDMGEXP[is.na(mydata$PROPDMGEXP)] <- 0 # NA's considered as dollars
mydata$PROPDMGEXP[!grepl("K|M|B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
mydata$PROPDMGEXP[grep("K", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$PROPDMGEXP[grep("M", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$PROPDMGEXP[grep("B", mydata$PROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$PROPDMGEXP <- as.numeric(as.character(mydata$PROPDMGEXP))
mydata$property.damage <- mydata$PROPDMG * 10^mydata$PROPDMGEXP

mydata$CROPDMGEXP <- as.character(mydata$CROPDMGEXP)
mydata$CROPDMGEXP[is.na(mydata$CROPDMGEXP)] <- 0 # NA's considered as dollars
mydata$CROPDMGEXP[!grepl("K|M|B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
mydata$CROPDMGEXP[grep("K", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "3"
mydata$CROPDMGEXP[grep("M", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "6"
mydata$CROPDMGEXP[grep("B", mydata$CROPDMGEXP, ignore.case = TRUE)] <- "9"
mydata$CROPDMGEXP <- as.numeric(as.character(mydata$CROPDMGEXP))
mydata$crop.damage <- mydata$CROPDMG * 10^mydata$CROPDMGEXP



#summarise data by event type; transform EVTYPE into a new variable called EVENT

mydata$EVENT <- "OTHER"
# group by keyword in EVTYPE
mydata$EVENT[grep("HAIL", mydata$EVTYPE, ignore.case = TRUE)] <- "HAIL"
mydata$EVENT[grep("STORM", mydata$EVTYPE, ignore.case = TRUE)] <- "STORM"
mydata$EVENT[grep("WINTER", mydata$EVTYPE, ignore.case = TRUE)] <- "WINTER"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
mydata$EVENT[grep("ICE", mydata$EVTYPE, ignore.case = TRUE)] <- "ICE"
mydata$EVENT[grep("RAIN", mydata$EVTYPE, ignore.case = TRUE)] <- "RAIN"
mydata$EVENT[grep("SNOW", mydata$EVTYPE, ignore.case = TRUE)] <- "SNOW"
mydata$EVENT[grep("TORNADO", mydata$EVTYPE, ignore.case = TRUE)] <- "TORNADO"
mydata$EVENT[grep("HEAT", mydata$EVTYPE, ignore.case = TRUE)] <- "HEAT"
mydata$EVENT[grep("FLOOD", mydata$EVTYPE, ignore.case = TRUE)] <- "FLOOD"
mydata$EVENT[grep("WIND", mydata$EVTYPE, ignore.case = TRUE)] <- "WIND"


# list the transformed event types 
sort(table(mydata$EVENT), decreasing = TRUE)
## 
##    WIND    HAIL   FLOOD TORNADO   OTHER  WINTER    SNOW    RAIN    HEAT 
##  364901  289270   82713   60699   48878   19597   17677   12156    2648 
##     ICE   STORM 
##    2101    1657

Summarise fatalities, injuries, property and crop damage totals by event category

mydata_summarised <- mydata %>%
  group_by(EVENT, crop.damage, property.damage, FATALITIES, INJURIES) %>%
              summarise ()
crop_damage_summary <- aggregate(crop.damage~EVENT, mydata_summarised, sum)
arrange(crop_damage_summary, desc(crop.damage))
property_damage_summary <- aggregate(property.damage~EVENT, mydata_summarised, sum)
arrange(property_damage_summary, desc(property.damage))
fatalities_summary <- aggregate(FATALITIES~EVENT, mydata_summarised, sum)
arrange(fatalities_summary, desc(FATALITIES))
injuries_summary <- aggregate(INJURIES~EVENT, mydata_summarised, sum)
arrange(injuries_summary, desc(INJURIES))

Results

1. Across the U.S which severe weather event(s) are most harmful with respect to human health?

The barplot ranks top severe weather events by type according to the number of fatalities and injuries. We see that tornadoes are resposible for by far the highest proportion of injuries and fatalities.

human <- merge(fatalities_summary, injuries_summary)
totals1 <- arrange(human, desc(FATALITIES + INJURIES))
names_events <- totals1$EVENT
barplot(t(totals1[,-1]), names.arg = names_events, ylim = c(0,70000), beside = T, cex.names = 0.8, las=2, col = c("black", "red"), main="Number of fatalities and injuries by severe weather event")
legend("topright",c("FATALITIES","INJURIES"),fill=c("black", "red"),bty = "n")

2. Across the U.S which weather event(s) have the greatest economic consequences?

economic <- merge(property_damage_summary, crop_damage_summary)
totals <- arrange(economic, desc(property.damage + crop.damage))
names_events <- totals$EVENT
barplot(t(totals[,-1]), names.arg = names_events, ylim = c(0,160000000000), beside = T, cex.names = 0.8, las=2, col = c("green", "grey"), main="Economic loss ($) by severe weather event")
legend("topright",c("Property damage","Crop loss"),fill=c("green","grey"),bty = "n")

The barplot ranks top severe weather events by type according to the financial impact. We see that both flood and
tornado pose the biggest risk to property; flood, hail and wind (in order) are the three biggest risks to crops.