Synopsis

This data analysis addresses the following questions:

Data is taken from NOAA’s storm database, with the relevant variables processed from 1993 onwards, and plotted.

Tornadoes have resulted in the greatest number of victims reported since 1993: 24931 victims. While flooding has resulted in the greatest damage cost-wise across the USA since 1993, with a total cost of $150,319,678,257.

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data Processing

Pre-processing/set-up

Necessary packages are loaded

library(plyr)
library(dplyr)
library(lubridate)
library(ggplot2)
library(reshape2)

The computing environment:

sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 17134)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 
## [2] LC_CTYPE=English_United Kingdom.1252   
## [3] LC_MONETARY=English_United Kingdom.1252
## [4] LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.3  ggplot2_3.1.0   lubridate_1.7.4 dplyr_0.7.8    
## [5] plyr_1.8.4     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.0       knitr_1.21       bindr_0.1.1      magrittr_1.5    
##  [5] munsell_0.5.0    tidyselect_0.2.5 colorspace_1.4-0 R6_2.3.0        
##  [9] rlang_0.3.1      stringr_1.3.1    tools_3.5.2      grid_3.5.2      
## [13] gtable_0.2.0     xfun_0.4         withr_2.1.2      htmltools_0.3.6 
## [17] lazyeval_0.2.1   yaml_2.2.0       assertthat_0.2.0 digest_0.6.18   
## [21] tibble_2.0.1     crayon_1.3.4     bindrcpp_0.2.2   purrr_0.3.0     
## [25] glue_1.3.0       evaluate_0.12    rmarkdown_1.11   stringi_1.2.4   
## [29] compiler_3.5.2   pillar_1.3.1     scales_1.0.0     pkgconfig_2.0.2

Data download and read

## Download and unzip the data, provided it doesn't already exist
if(!file.exists("FStormData.csv.bz2")){
  fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(fileUrl,destfile="FStormData.csv.bz2",method="curl")
}

## Read in the data
if(!"FStormData.csv.bz2" %in% ls()){
data <- read.csv("FStormData.csv.bz2")
}

Exploratory Data Analysis

The next step is to perform exploratory data analysis to determine what is relevant. The ‘str’ function is first used for this purpose.

str(data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

As can be seen, there are 902297 observations from 37 variables.

So as to visualize the data prior to cleaning and processing, a bar plot was drawn showing the number of different event types which were reported in the US per year. So as to be able to group the data by year, an extra variable was added with only the year of the reported event.

data$year <- year(mdy_hms(data$BGN_DATE))

## Creating a data frame where the event types reported are grouped by year
cutoffYear <- data %>% group_by(year) %>% summarise(n=n_distinct(EVTYPE))
## Bar plot 
plot(x=cutoffYear$year,y=cutoffYear$n, type = "h",
     main = "Number of Different Weather Events Reported per Year across the US",
     xlab = "Year", ylab = "Events Types")
Figure 1: Barplot showing the number of storm event types reported per year in the US

Figure 1: Barplot showing the number of storm event types reported per year in the US

From the plot, there is a large jump in the number of different reported incidents from 1993 onwards, while pre 1993, the number remains close to 0. It has been determined that the pre 1993 data will be insufficient to provide any satisfactory conclusions, therefore it will be cut.

Processing

## Filtering the data to only include data from post 1992
S_Data <- filter(data, year > 1992)

Health Impact

For the first question, to determine which type of event is most harmful with respect to population health, the first step is to create a data frame consisting of only the relevant variables: EVTYPE (event type), FATALITIES, and INJURIES. In addition, an extra variable is created by adding the values for the FATALITIES and INJURIEs columns together, so that the top 10 most harmful events can be determined.

## Relevant variables selected and placed in new data frame
healthData <- S_Data %>% select(EVTYPE,FATALITIES,INJURIES)

## New column added to help with ordering
healthData$victims <- healthData$FATALITIES + healthData$INJURIES

The sum of the fatalities and injuries per event type were calculated using the ‘aggregate’ function, then ordered by the decreasing number of victims. The data was then melted to allow the variables to be plotted simultaneously.

## Aggregate the fatalities, injuries, and victims columns by event type 
health <- aggregate(cbind(FATALITIES,INJURIES,victims)~EVTYPE,healthData,sum)

## Data frame created by ordering by the top 10 victim numbers
healthTop <- head(health[order(-health$victims),],10)

## Melt the fatalities and injuries variables so that they can be plotted together
healthM <- melt(healthTop,id.vars="EVTYPE",
                measure.vars=c("INJURIES","FATALITIES"))

Bar plot created to showcase results. Please jump to Results section.

healthPlot <- ggplot(healthM, aes(EVTYPE, value, fill = variable)) +
    geom_bar(stat="identity") +
    labs(title = "Victim (Injury & Fatality) Numbers per Event Type Reported across the USA since 1993",
         x = "Event Type", y = "Reported Incidents") +
    theme(axis.text.x = element_text(hjust = 1, angle = 20), 
          legend.title=element_blank()) +
    scale_fill_manual(values = c("steelblue1","red"))

Economic Impact

For the second question, to determine which types of events have the greatest economic consequences, the first step is to create a data frame consisting of only the relevant variables: EVTYPE (event type), PROPDMG (property damage), PROPDMGEXP (property damage exponent), CROPDMG (crop damage), CROPDMGEXP (crop damage exponent).

economData <- S_Data %>% select(EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

To be able to calculate the damage done to property and crops, the exponents in the PROPDMGEXP and CROPDMGEXP columns needed to be replaced. This as done using the ‘revalue’ function. Symbols were assumed to be 0.

## Determining the values that compromise PROPDMGEXP and CROPDMGEXP
unique(economData$PROPDMGEXP)
##  [1]   B K M m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(economData$CROPDMGEXP)
## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M
## Dictionary created with keys and their respective values
dmgOrder <- c("-"=0,"?"=0,"+"=0,"h"=2,"H"=2,"k"=3,"K"=3,"m"=6,"M"=6,"B"=9)

## Empty cells were assigned a zero
economData[economData==''] <- 0

## The PROPDMGEXP and CROPDMGEXP values replaced using 'dmgOrder' dictionary
economData$PROPDMGEXP <- revalue(economData$PROPDMGEXP, dmgOrder, warn_missing = FALSE)
economData$CROPDMGEXP <- revalue(economData$CROPDMGEXP, dmgOrder, warn_missing = FALSE)

## Values converted from factor to numeric so that they can be further processed
economData$PROPDMGEXP <- as.numeric(as.character(economData$PROPDMGEXP))
economData$CROPDMGEXP <- as.numeric(as.character(economData$CROPDMGEXP))

With the completion of the previous step, the next step was to calculate the actual damage costs by multiplying the damage cost base by the damage exponent for both properties and crops. Then just like with the Health Impact analysis, an extra variable is created by adding the values for the total property damage cost (PROPDMGTOT) and the total crop damage cost (CROPDMGTOT) columns together, so that the top 10 most costly events can be determined.

## New variable added to contain damage costs base multiplied by exponent
economData$PROPDMGTOT <- economData$PROPDMG * 10^economData$PROPDMGEXP
economData$CROPDMGTOT <- economData$CROPDMG * 10^economData$CROPDMGEXP

## New column added to help with ordering
economData$TotalCost <- economData$PROPDMGTOT + economData$CROPDMGTOT

The sum of the PROPDMGTOT and CROPDMGTOT per event types were calculated using the ‘aggregate’ function, then ordered by the decreasing amount of damage. The data was then melted to allow the variables to be plotted simultaneously.

## Aggregate the total property damage, total crop damage, and the total added 
## together columns by event type 
cost <- aggregate(cbind(PROPDMGTOT,CROPDMGTOT,TotalCost)~EVTYPE,economData,sum)

## Data frame created by ordering by the top 10 costliest
costTop <- head(cost[order(-cost$TotalCost),],10)

## Melt the total property damage, total crop damage variables so that they can be plotted together
costM <- melt(costTop,id.vars="EVTYPE",
                measure.vars=c("PROPDMGTOT","CROPDMGTOT"))

Bar plot created to showcase results. Please jump to Results section.

costPlot <- ggplot(costM, aes(EVTYPE, value, fill = variable)) +
    geom_bar(stat="identity") +
    labs(title = "Economic Cost (Property & Crop Damage) per Event Type Reported across the USA since 1993",
         x = "Event Type", y = "Damage in USD") +
    theme(axis.text.x = element_text(hjust = 1, angle = 20)) +
    scale_fill_manual(values = c("steelblue1","red"),
                      name = "Type of Damage",
                      labels = c("Property","Crops"))

Results

Health Impact

healthTop
##                EVTYPE FATALITIES INJURIES victims
## 834           TORNADO       1621    23310   24931
## 130    EXCESSIVE HEAT       1903     6525    8428
## 170             FLOOD        470     6789    7259
## 464         LIGHTNING        816     5230    6046
## 856         TSTM WIND        241     3631    3872
## 275              HEAT        937     2100    3037
## 153       FLASH FLOOD        978     1777    2755
## 427         ICE STORM         89     1975    2064
## 760 THUNDERSTORM WIND        133     1488    1621
## 972      WINTER STORM        206     1321    1527
healthPlot
Figure 2: Top 10 most harmful (in terms of number of deaths and injuries) storm events to the US since 1993

Figure 2: Top 10 most harmful (in terms of number of deaths and injuries) storm events to the US since 1993

The figure shows that Tornadoes are the most harmful event to the population of the US in terms of the total number of victims affected - 24931 victims. This is due to the large number of injuries that result from the event, more than three times the number of injuries that result from the second most harmful event - Excessive Heat. However, in terms of the number of fatalities, the most harmful event to the population is Excessive Heat, with 1903 deaths, to tornadoes’ 1621.

Economic Impact

costTop
##                EVTYPE   PROPDMGTOT  CROPDMGTOT    TotalCost
## 170             FLOOD 144657709807  5661968450 150319678257
## 411 HURRICANE/TYPHOON  69305840000  2607872800  71913712800
## 670       STORM SURGE  43323536000        5000  43323541000
## 834           TORNADO  26349182107   414953270  26764135377
## 244              HAIL  15735267513  3025954473  18761221986
## 153       FLASH FLOOD  16822673979  1421317100  18243991079
## 95            DROUGHT   1046106000 13972566000  15018672000
## 402         HURRICANE  11868319010  2741910000  14610229010
## 590       RIVER FLOOD   5118945500  5029459000  10148404500
## 427         ICE STORM   3944927860  5022113500   8967041360
costPlot
Figure 3: Top 10 most economically damaging (in terms of cost to properties and crops) storm events to the US since 1993

Figure 3: Top 10 most economically damaging (in terms of cost to properties and crops) storm events to the US since 1993

The figure shows that Floods do the most economic damage in the US in terms of the total cost of $150,319,678,257. This is due to the large amount of property damage done; which is more than twice that of Hurricane/Typhoon damage. However, in terms of the cost of the damage done to crops, the most harmful event is Drought, with $13972566000 worth of damage, to flood’s $5661968450.

Conclusion

From 1993, tornadoes have been the most harmful event to the population of the US, while floods have done the greatest damage economically.