On the impact of severe weather in the US on health and economy

Synopsis

The study the follows investigates the relationship between severe weather in the US based on the official data from the National Oceanic and Atmospheric Administration (NOAA). The NOAA database shows the number of injuries and fatalities occurred during a meteorological event, as well as the estimated damage caused to properties and crop during the event. Further details on the database are available at https://www.ncdc.noaa.gov/stormevents/.

Data Processing

Download and load the data

The first step in this study is to collect the data and load them into R. The data is taken from the Coursera Reproducible Research Course website, downloaded and stored as a data frame into a variable storm.

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "repdata_data_StormData.csv.bz2", method = "curl")
storm <- read.csv("./repdata_data_StormData.csv.bz2")

Data inspection

Use colnames, str to have a first look at the variables.

colnames(storm)

##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

str(storm)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

It is convenient to consider the variable EVTYPE as factor. Also, in order to tidy up the entries of this variable, we shall convert all strings into upper-case. Furthermore, since the date is not in the right format, we have to fix the data type and, since the data set presents limited data before 1996, is shall be convenient to create an additional variable, BGN_YEAR to store the year where the event had happened.

# Converting EVTYPE in Uppercase and
# saving as factors
storm$EVTYPE <- toupper(storm$EVTYPE)
storm$EVTYPE <- factor(storm$EVTYPE)
# Converting the date into Date type
# and creating new variable BGN_YEAR
dates <- sapply(strsplit(storm$BGN_DATE, split = " "), function(x) x[[1]])
storm$BGN_DATE <- dates
storm$BGN_DATE <- as.Date(storm$BGN_DATE, "%m/%d/%Y")
storm$BGN_YEAR <- as.numeric(format(storm$BGN_DATE, "%Y"))
rm(dates)

Selection of the variables of interest

From the inspection it is clear that the variables of interest are

EVTYPE, the type of meteorological event occurred
INJURIES, counting the number of direct and indirect injuries due to the event
FATALITIES, counting the number of direct and indirect fatalities due to the event
PROPDMG, PROPDMGEXP, representing the quantity of monetary cost for property damage and the corresponding exponent
CROPDMG, CROPDMGEXP, representing the quantity of monetary cost for crop damage and the corresponding exponent
BGN_YEAR, the year where the meteorological event took place

To select and work with this specific subset it is convenient to load the library dplyr.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

In taking the selection, we have also changed the position of the variable BGN_YEAR at the front of the data set.

stormData <- select(storm, c(BGN_DATE, BGN_YEAR, EVTYPE, 
                             FATALITIES, INJURIES,
                             PROPDMG, PROPDMGEXP,
                             CROPDMG, CROPDMGEXP))

# Only take events from 1996 (before data are incomplete)
stormData <- filter(stormData, BGN_YEAR >= 1996)

# Only take data with an impact
stormData <- filter(stormData, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)

Evaluating the economic damage for property and for crop

From the preliminary data inspection it turns out that the monetary cost of property and crop damage are stored in two pairs of variables containing the monetary cost and the power of this cost. This power, however, is stored in several different ways, as shown in the tables.

table(storm$PROPDMGEXP)

## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330

table(storm$CROPDMGEXP)

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

Some letters appear both in upper-case and lower-case, some other times, instead of the magnitude (k, M, B) a number is used (0 to 9). There are also some irregular entries (-, +, ?) which need to be addressed: we shall consider these as \(10^0 = 1\). The following function has aims to fix these exponents in an organic way.

ExponentConvert <- function(x) {
    if(x %in% 0:9) x <- 10 ^ as.numeric(x)
    else if(tolower(x) == "h") x <- 10 ^ 2
    else if(tolower(x) == "k") x <- 10 ^ 3
    else if(tolower(x) == "m") x <- 10 ^ 6
    else if(tolower(x) == "b") x <- 10 ^ 9
    else x <- 1
    return(x)
}

This function can be used to create two new variables, representing magnitude, as a power of 10.

# Adding power variables of economic damages
stormData$PROPDMGFACTOR <- sapply(stormData$PROPDMGEXP, ExponentConvert)
stormData$CROPDMGFACTOR <- sapply(stormData$CROPDMGEXP, ExponentConvert)

With these two new variables, we can now create the variables which represent the health impact and the economical cost of the meteorological event, which are labelled by HEALTHIMP and ECONCOST.

# Getting health impact and economic impact variables
stormData <- mutate(stormData, HEALTHIMP = FATALITIES + INJURIES)
stormData <- mutate(stormData, ECONCOST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)

Selection of the data set of use

We can now remove from the data set the variables that we don’t need and focus only on the variable of interest.

stormData <- select(stormData, BGN_YEAR, EVTYPE, HEALTHIMP, ECONCOST)

This gives finally a tidy data set that can be analysed.

head(stormData)

##   BGN_YEAR       EVTYPE HEALTHIMP ECONCOST
## 1     1996 WINTER STORM         0   418000
## 2     1996      TORNADO         0   100000
## 3     1996    TSTM WIND         0     3000
## 4     1996    TSTM WIND         0     5000
## 5     1996    TSTM WIND         0     2000
## 6     1996    HIGH WIND         0   400000

Analysis of the processed data

Let us start by noticing that the variable EVTYPE has many different levels and some may be duplicates. For example:

evtypeUnique <- unique(stormData$EVTYPE)
evtypeUnique[grep("THUND", evtypeUnique)]

## [1] THUNDERSTORM             THUNDERSTORM WIND (G40)  THUNDERSTORM WIND       
## [4] MARINE THUNDERSTORM WIND
## 898 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD  FLASH FLOOD ... WND

This also shows that there are actually 898 levels. Going through all these levels is pointlessly time consuming, so we shall take a faster route. To this end, let us start by taking a new data set consisting of the total health impact due to severe weather, taken adding up the variable HEALTHIMP grouped by EVTYPE and let’s check the subset of the 95% most extreme events.

healthImpact <- stormData %>% group_by(EVTYPE) %>% summarise(HEALTHIMP = sum(HEALTHIMP))
subset(healthImpact, HEALTHIMP > quantile(HEALTHIMP, prob = 0.95))

## # A tibble: 10 × 2
##    EVTYPE            HEALTHIMP
##    <fct>                 <dbl>
##  1 EXCESSIVE HEAT         8188
##  2 FLASH FLOOD            2561
##  3 FLOOD                  7172
##  4 HEAT                   1459
##  5 HURRICANE/TYPHOON      1339
##  6 LIGHTNING              4792
##  7 THUNDERSTORM WIND      1530
##  8 TORNADO               22178
##  9 TSTM WIND              3870
## 10 WINTER STORM           1483

healthImpact <- as.data.frame(healthImpact)

As it is clear from the table shown, line 7 and line 9 both correspond to thunderstorm wind though sometimes this is recorded as TSTM. Let us also fix the record HURRICANTE/TYPHOON changing every possible hurricane to HURRICANE.

stormData[(stormData$EVTYPE == "TSTM WIND"), "EVTYPE"] <- "THUNDERSTORM WIND"
stormData[grep("HURRICANE", stormData$EVTYPE), "EVTYPE"] <- "HURRICANE"

Let’s do the same with the economic cost data set:

economicCost <- stormData %>% group_by(EVTYPE) %>% summarise(ECONCOST = sum(ECONCOST))
subset(as.data.frame(economicCost), ECONCOST > quantile(ECONCOST, prob = 0.95))

##                EVTYPE     ECONCOST
## 32            DROUGHT  14413667000
## 46        FLASH FLOOD  16557105610
## 48              FLOOD 148919611950
## 66               HAIL  17071172870
## 83          HIGH WIND   5881421660
## 86          HURRICANE  86467941810
## 139       STORM SURGE  43193541000
## 144 THUNDERSTORM WIND   8812957230
## 147           TORNADO  24900370720
## 150    TROPICAL STORM   8320186550

stormData[grep("STORM SURGE", stormData$EVTYPE), "EVTYPE"] <- "STORM SURGE"

Results

Once the problem of duplicates has been dealt with, we can finally recreate the data sets healthImpact and economicCost and plot a diagram to see order the events depending on their health impact and economic cost, repectively. Let us start by loading the package ggplot2.

library(ggplot2)

Health impact

# health Impact
healthImpact <- stormData %>% 
    group_by(EVTYPE) %>% 
    summarise(HEALTHIMP = sum(HEALTHIMP)) %>% 
    arrange(desc(HEALTHIMP))

ggplot(healthImpact[1:10,], aes(x = reorder(EVTYPE, -HEALTHIMP),
                                y = HEALTHIMP, colour = EVTYPE)) +
    geom_bar(stat = "identity", fill = "white") +
    ggtitle("Fatalities and Injuries in the US caused by severe weather") +
    xlab("Event") + ylab("Number of fatalities and injuries") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "none", 
          plot.title = element_text(hjust = 0.5, face = "bold"))

The diagram shows the first 10 most significative events in terms of health impact (injuries and fatalities). The graph shows that

Tornado are the most significant meteorological event.

Economic cost

economicCost <- stormData %>% 
    group_by(EVTYPE) %>% 
    summarise(ECONCOST = sum(ECONCOST)) %>% 
    arrange(desc(ECONCOST))

ggplot(economicCost[1:10,], aes(x = reorder(EVTYPE, -ECONCOST),
                                y = ECONCOST / 1e6, colour = EVTYPE)) +
    geom_bar(stat = "identity", fill = "white") +
    ggtitle("Economic Cost in the US caused by severe weather") +
    xlab("Event") + ylab("Economic cost in million USD") +
    theme(axis.text.x = element_text(angle = 45, hjust = 1),
          legend.position = "none",
          plot.title = element_text(hjust = 0.5, face = "bold"))

The diagram shows the first 10 most significative events in terms of economic cost. The graph shows that

Flood are the most significant meteorological event, followed by Hurricane.

Coursera Reproducible Research (PA 2)

Andrea Carpignani

2023-02-01