Peer Assignment 2 - Exploring the NOAA Database

Reproducible Research - Peer Assignment 2

Exploring the NOAA Storm Database: Synopsis

This assignment seeks to use the NOAA Storm Database to answer two questions: (1) Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health? (2) Across the United States, which types of events have the greatest economic impact?

Data Processing

The data should be downloaded and loaded into the directory. I’ll use an IF statement to first check if it already exists in the directory.

sessionInfo()

## R version 3.6.1 (2019-07-05)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17763)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.6.1  magrittr_1.5    tools_3.6.1     htmltools_0.4.0
##  [5] yaml_2.2.0      Rcpp_1.0.3      stringi_1.4.4   rmarkdown_2.1  
##  [9] knitr_1.27      stringr_1.4.0   xfun_0.12       digest_0.6.23  
## [13] rlang_0.4.4     evaluate_0.14

#Loading packages
library(knitr)

## Warning: package 'knitr' was built under R version 3.6.2

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.2

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.2

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(grid)
library(plyr)

## Warning: package 'plyr' was built under R version 3.6.2

## ------------------------------------------------------------------------------

## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)

## ------------------------------------------------------------------------------

## 
## Attaching package: 'plyr'

## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

#Downloading the data file & examining it's structure
if(!file.exists("StormData.csv.bz2")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2", method = "curl")
}

RawData <- read.csv("StormData.csv.bz2")
str(RawData)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

We can see that there are 37 variables across ~902k observations. The first step cleaning this data is to isolate only the variables that pertain to our specific questions. We can see there are variables pertaining to fatalities, injuries, property damage, and crop damage. We can subset the raw data on these variables.

SubSettedData <- RawData[, c("EVTYPE", "BGN_DATE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
str(SubSettedData)

## 'data.frame':    902297 obs. of  8 variables:
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...

##Need to fix some data types
SubSettedData$BGN_DATE <- as.POSIXct(SubSettedData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")

The total damange for both property and crop actually isn’t given by any of these variables. We first need to convert the PROPDMGEXP and CROPDMGEXP variables into a mathamatical factor that can be multiplied against the PROPDMG and CROPDMG variables. That would give us two new TOTALDMG variables, and the old variables can be removed.

The mapping goes as follows: H: 1^10e2 K: 1^10e3 M: 1^10e6 B: 1^10e9 Any number: 1^10e(that number) Anything else: 1

Applying that mapping here:

## What are the unique elements in our two mappign columns?
unique(SubSettedData$PROPDMGEXP)

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

unique(SubSettedData$CROPDMGEXP)

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

## Creating the two new TOTAL variables
tempPROPDMG <- mapvalues(SubSettedData$PROPDMGEXP, c("K","M","", "B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"), 
c(1e3,1e6,1,1e9,1e6,1,1,1e5,1e6,1,1e4,1e2,1e3,1,1e7,1e2,1,10,1e8))

tempCROPDMG <- mapvalues(SubSettedData$CROPDMGEXP, c("","M","K","m","B","?","0","k","2"),
c(1,1e6,1e3,1e6,1e9,1,1,1e3,1e2))

SubSettedData$TOTALPROPDMG <- as.numeric(tempPROPDMG) * SubSettedData$PROPDMG
SubSettedData$TOTALCROPDMG <- as.numeric(tempCROPDMG) * SubSettedData$CROPDMG

remove(tempPROPDMG)
remove(tempCROPDMG)

SubSettedData$TOTALDMG <- SubSettedData$TOTALPROPDMG + SubSettedData$TOTALCROPDMG

## Checking the adjustments
str(SubSettedData)

## 'data.frame':    902297 obs. of  11 variables:
##  $ EVTYPE      : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_DATE    : POSIXct, format: "1950-04-18" "1950-04-18" ...
##  $ FATALITIES  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES    : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG     : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP  : Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP  : Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ TOTALPROPDMG: num  100 10 100 10 10 10 10 10 100 100 ...
##  $ TOTALCROPDMG: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ TOTALDMG    : num  100 10 100 10 10 10 10 10 100 100 ...

head(SubSettedData)

##    EVTYPE   BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 1950-04-18          0       15    25.0          K       0           
## 2 TORNADO 1950-04-18          0        0     2.5          K       0           
## 3 TORNADO 1951-02-20          0        2    25.0          K       0           
## 4 TORNADO 1951-06-08          0        2     2.5          K       0           
## 5 TORNADO 1951-11-15          0        2     2.5          K       0           
## 6 TORNADO 1951-11-15          0        6     2.5          K       0           
##   TOTALPROPDMG TOTALCROPDMG TOTALDMG
## 1          100            0      100
## 2           10            0       10
## 3          100            0      100
## 4           10            0       10
## 5           10            0       10
## 6           10            0       10

Most Dangerous Weather Events by Fatalaties and Injuries

I’ll construct to plots to show how fatalities and injuries vary accross the different types of weather events. I’ll limit the plots to the top 5 events.

First, we will do fatalities.

fatal <- aggregate(FATALITIES ~ EVTYPE, SubSettedData, sum)
fatal <- fatal[order(-fatal$FATALITIES), ][1:5,]
fatal$EVTYPE <- factor(fatal$EVTYPE, levels = fatal$EVTYPE)

ggplot(fatal, aes(x = EVTYPE, y = FATALITIES)) +
  geom_bar(stat = "identity", fill = "red", las = 4) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Weather Event Type") + ylab("Deaths") + ggtitle("Number of Deaths by Top 5 Most Dangerous Weather Events")

## Warning: Ignoring unknown parameters: las

Now we do injuries.

injure <- aggregate(INJURIES ~ EVTYPE, SubSettedData, sum)
injure <- injure[order(-injure$INJURIES), ][1:5,]
injure$EVTYPE <- factor(injure$EVTYPE, levels = injure$EVTYPE)

ggplot(injure, aes(x = EVTYPE, y = INJURIES)) +
  geom_bar(stat = "identity", fill = "green", las = 4) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Weather Event Type") + ylab("Non-Fatal Injuries") + ggtitle("Number of Injuries by Top 5 Most Dangerous Weather Events")

## Warning: Ignoring unknown parameters: las

Economic Impact

Lastly, we can go through a similar process to determine total damanges (property + crop) accross weather event type and order it by top 5.

## Summing the total damage by event type and calling it Total Damange. Then we subset that only by the top 5 descending
damages <- aggregate(TOTALDMG ~ EVTYPE, SubSettedData, sum)
damages <- damages[order(-damages$TOTALDMG), ][1:5,]
damages$EVTYPE <- factor(damages$EVTYPE, levels = damages$EVTYPE)

## Making the plot
ggplot(damages, aes(x = EVTYPE, y = TOTALDMG)) +
  geom_bar(stat = "identity", fill = "green", las = 4) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab("Weather Event Type") + ylab("Total Damange Caused in $") + ggtitle("Total $ Damage by Top 5 Weather Events")

## Warning: Ignoring unknown parameters: las

Results

We can see from our plots that tornados cause the most deaths, injuries, and damages.