Course Project 2 - Analysis of Costs and Casualties in US Storm Data ’96-’11

Synopsys

The basic goal is to explore the NOAA Storm Database and answer the following questions:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The main conclusions of this analysis are:

Tornados are by far the most harmful storm event in the US on population health over the periode 1996 - 2011. For more details see the analysis below.
Floods are by far the storm event in the US that have the greatest economic consequences. For more details see the analysis below.

Source data

The source data is originated from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather ServiceStorm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Data processing

In this section the processing of the source data is described, so it is reproducible:

Load libraries

The following libraries are used:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Download and load source data

First the data needs to be downloaded and processed for analysis. Set the working directory to /StormData and download the file to the subdirectory ‘/data’.

## First load the data
# checking and creating directories
if(!file.exists("data")){
  dir.create("data")
}
# download files from internet
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "./data/rep-data-StormData.csv.bz2"

download.file(fileUrl, destfile = filename, method="curl") # curl: for https
list.files("./data")

## [1] "rep-data-StormData.csv.bz2"

# read data
data <- read.csv(filename, header=TRUE, sep = ",") # comma-seperated file with header

Check and look at the data

First a global look at the various variables and data set:

# check and look at the data
names(data)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

str(data)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

head(data)

##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1       3051       8806              1
## 2          0          0              2
## 3          0          0              3
## 4          0          0              4
## 5          0          0              5
## 6          0          0              6

Let’s see what types of events (EVTYPE) exist?

events <- as.factor(unique(data$EVTYPE))
str(events)

##  Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 856 244 201 629 429 657 972 409 786 ...

What variables are used to determine harmful for population health and economic disaster?

Harmful for population is determined by the number of direct fatalities and injuries. This is listed in the following columns:

FATALITIES
INJURIES

# FATALITIES
summary(data$FATALITIES)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000

# INJURIES
summary(data$INJURIES)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000

What variables are used to determine the greatest economic consequences in the dataset?

According to the documentation of the dataset “Damage” is Flood related damage, crop related damage, other related damage and delayed damage. In the dataset only property damage (PROPDMG) and crop damage (CROPDMG) are listed.

With some assistance of the Discussion Forum, there is some important additional information about these columns: The ‘CROPDMGEXP’ is the exponent values for ‘CROPDMG’ (crop damage). In the same way, ‘PROPDMGEXP’ is the exponent values for ‘PROPDMG’ (property damage). You should use both to get the total values for crops and property damage. (B or b = Billion, M or m = Million, K or k = Thousand, H or h = Hundred). The number from one to ten represent the power of ten (10^The number). The symbols “-”, “+” and “?” refers to less than, greater than and low certainty. You have the option to ignore these three symbols altogether.

Therefore the following variables determine the economic consequences of the dataset:

PROPDMG
PROPDMGEXP
CROPDMG
CROPDMGEXP

Now let’s have a look at the various exponents in the dataset:

exponentPROP <- as.factor(unique(data$PROPDMGEXP))
exponentPROP

##  [1] K M   B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M

exponentCROP <- as.factor(unique(data$CROPDMGEXP))
exponentCROP

## [1]   M K m B ? 0 k 2
## Levels:  ? 0 2 B k K m M

These values relate to the following exponents and need to be converted. The following named list can be used to change the exponents:

exponent <- c(" " = 0,
              "-" = 0,
              "?" = 0,
              "+" = 1,
              "0" = 10,
              "1" = 10,
              "2" = 10,
              "3" = 10,
              "4" = 10,
              "5" = 10,
              "6" = 10,
              "7" = 10,
              "8" = 10,
              "h" = 100,
              "H" = 100,
              "K" = 1000,
              "m" = 1000000,
              "M" = 1000000,
              "B" = 1000000000)

Now, the economic consequences can be calculated as:

Total damage = Total damange property + Total damage crop

Total damage = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP

Results

Now present the results:

Question 1) Across the United States, which types of events (variable EVTYPE) are most harmfull with respect to population health?

Below are the 15 most harmfull storm events listed order by the total casualties over the period 1996 - 2011. The total casualties consists of the total number of fatalities and total number of injuries.

# which events have most impact on populaties Health (fatalities and injuries)?
PopHealth <- data %>%
              select(EVTYPE, FATALITIES, INJURIES) %>%
              group_by(EVTYPE) %>%
              summarise(Fatalities = sum(FATALITIES), Injuries = sum(INJURIES),
                        Total = sum(FATALITIES) + sum(INJURIES)) %>%
              arrange(desc(Total)) 

PopHealth <- PopHealth[1:15, ]
PopHealth

## # A tibble: 15 × 4
##    EVTYPE            Fatalities Injuries Total
##    <chr>                  <dbl>    <dbl> <dbl>
##  1 TORNADO                 5633    91346 96979
##  2 EXCESSIVE HEAT          1903     6525  8428
##  3 TSTM WIND                504     6957  7461
##  4 FLOOD                    470     6789  7259
##  5 LIGHTNING                816     5230  6046
##  6 HEAT                     937     2100  3037
##  7 FLASH FLOOD              978     1777  2755
##  8 ICE STORM                 89     1975  2064
##  9 THUNDERSTORM WIND        133     1488  1621
## 10 WINTER STORM             206     1321  1527
## 11 HIGH WIND                248     1137  1385
## 12 HAIL                      15     1361  1376
## 13 HURRICANE/TYPHOON         64     1275  1339
## 14 HEAVY SNOW               127     1021  1148
## 15 WILDFIRE                  75      911   986

Now product a barchart that summarizes the casualties per event type. In order to present the barchart for the fatalities, injuries and total casualties the data first need to be wrangled:

lPopHealth <- PopHealth %>%
              pivot_longer(cols = c("Fatalities", "Injuries", "Total"),
                           names_to = "Impact", values_to = "Number")
lPopHealth$Impact <- as.factor(lPopHealth$Impact)
p <- ggplot(data = lPopHealth, aes(x= reorder(EVTYPE, -Number), y=Number , fill=Impact)) +
  geom_bar(stat="identity", position="dodge") +
  theme(axis.text.x = element_text(angle=60, hjust=1)) +
  ggtitle("Most Harmful Storm Event Types in US on Population Health ('96-'11)") +
  xlab("Storm Event Types") +
  ylab("Number of casualties")
p

Quesion 2) Across the United States, which types of events have the greatest economic consequences.

As stated above, the economic consequences can be calculated as: