Storm Data Project

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

library(tidyverse)

Synopsis

The basic goal of this assignment is to explore the NOAA Storm Database to determine the types of weather events that 1) are most harmful with respect to population health, and 2) have the greatest economic impact in the United States. The Storm Database was subset into two smaller dataframes - each with data needed to answer one of the questions. Variable names and event types were changed to all lower case. No other transformation was performed on any of the original data, but additional variables were added to each sunset to hold calculation results. Group-by() and summarise() were used to get totals per event type in both subsets.

Data Processing

Read the data into R and take a look its characteristics and structure.

stormdata <- read.csv("repdata_data_StormData.csv.bz2")

str(stormdata)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

The analysis for this project will be need the following variables:
- EVTYPE: Type of weather event
- FATALITIES
- INJURIES
- PROPDMG: property damage cost _ PROPDMGEXP: K=Thousands, M=Millions, etc.
- CROPDMG: crop damage
- CROPDMGEXP: K=Thousands, M=Millions, etc.

Create a subset of stormdata with only these variables.

events <- stormdata[, c(8,23,24,25,26,27,28)] ##variable column numbers

head(events)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          K       0           
## 2 TORNADO          0        0     2.5          K       0           
## 3 TORNADO          0        2    25.0          K       0           
## 4 TORNADO          0        2     2.5          K       0           
## 5 TORNADO          0        2     2.5          K       0           
## 6 TORNADO          0        6     2.5          K       0

Change variable names and EVTYPE to lowercase.

events$EVTYPE <- tolower(events$EVTYPE)
names(events) <- tolower(names(events))

head(events)

##    evtype fatalities injuries propdmg propdmgexp cropdmg cropdmgexp
## 1 tornado          0       15    25.0          K       0           
## 2 tornado          0        0     2.5          K       0           
## 3 tornado          0        2    25.0          K       0           
## 4 tornado          0        2     2.5          K       0           
## 5 tornado          0        2     2.5          K       0           
## 6 tornado          0        6     2.5          K       0

Check the data for NA’s.

colSums(is.na(events))

##     evtype fatalities   injuries    propdmg propdmgexp    cropdmg cropdmgexp 
##          0          0          0          0          0          0          0

Queston 1 - Population Health Impact

The first question asks about effects of events on population health so we only want data for the event types that resulted in fatalities or injuries. Create a new dataframe, health, that only contains data for events that match that criteria. Select columns evtype, fatalities, and injuries.

health <- events %>% filter(fatalities != 0 | injuries != 0) %>%
                    select(evtype,fatalities,injuries)

slice_sample(health, n=10)

##       evtype fatalities injuries
## 1    tornado          1        1
## 2    tornado          0        4
## 3    tornado          0        3
## 4  lightning          0        1
## 5      glaze          0        2
## 6    tornado          0       15
## 7  tstm wind          0        1
## 8  tstm wind          0        1
## 9    tornado          1        0
## 10   tornado          2       69

Add the number of fatalities and injuries for each row and put the total in a new column,“healtheffect”.

health <- health %>% 
          mutate(healtheffect = fatalities+injuries)

head(health)

##    evtype fatalities injuries healtheffect
## 1 tornado          0       15           15
## 2 tornado          0        2            2
## 3 tornado          0        2            2
## 4 tornado          0        2            2
## 5 tornado          0        6            6
## 6 tornado          0        1            1

Calculate the total health effect for each event type.

health <- health %>% 
         group_by(evtype) %>%
         summarise(healtheffect = sum(healtheffect))

head(health,10)

## # A tibble: 10 × 2
##    evtype                   healtheffect
##    <chr>                           <dbl>
##  1 avalance                            1
##  2 avalanche                         394
##  3 black ice                          25
##  4 blizzard                          906
##  5 blowing snow                       16
##  6 brush fire                          2
##  7 coastal flood                       5
##  8 coastal flooding                    3
##  9 coastal flooding/erosion            5
## 10 coastal storm                       5

The group_by function has the data sorted in alphabetical order by event type. Sort by healtheffect in descending order,

health <- health[order(-health$healtheffect),]

head(health,5)

## # A tibble: 5 × 2
##   evtype         healtheffect
##   <chr>                 <dbl>
## 1 tornado               96979
## 2 excessive heat         8428
## 3 tstm wind              7461
## 4 flood                  7259
## 5 lightning              6046

Tornadoes are by far the most damaging event in terms of population health. Let’s see just how much more damaging they are. Total the health effect of all events other than tornado, and compare that total to tornado’s total.

tornadoes <- health[1,]
others <- sum(health$healtheffect[2:205]) ##total all rows except tornado (row 1)

## create new dataframe to store & plot results
compare <- add_row(tornadoes, evtype = "others", healtheffect = others) 
compare

## # A tibble: 2 × 2
##   evtype  healtheffect
##   <chr>          <dbl>
## 1 tornado        96979
## 2 others         58694

Plot the results.

barplot(compare$healtheffect, width=.25,names.arg= c("Tornadoes", "All Others"),
    xlab="Event Type", ylab="Fatalities & Injuries", col="#1E7AB1", yaxt="n")
axis(2, at=seq(0,120000, by=20000), labels = c("","20K","40K","60K","80K","100K",""))
title("Tornadoes vs All Other Event Types",line=1)
mtext("1950 - November 2011",side=3,line=0)

Tornadoes are responsible for more fatalities and injuries than all other event types combined. Let’s see if this is true for economic impact as well.

Question 2 - Economic Impact

To check the economic impact of event types, we need to know the cost of property damage and crop damage. Filter the events dataframe to get propdmg or cropdmg values that do not equal zero; and select columns evtype, propdmg, propdmgexp, cropdmg, and cropdmgexp.

economic <- events %>% filter(propdmg != 0 | cropdmg != 0) %>%
                    select(evtype,propdmg, propdmgexp,cropdmg,cropdmgexp)
head(economic)

##    evtype propdmg propdmgexp cropdmg cropdmgexp
## 1 tornado    25.0          K       0           
## 2 tornado     2.5          K       0           
## 3 tornado    25.0          K       0           
## 4 tornado     2.5          K       0           
## 5 tornado     2.5          K       0           
## 6 tornado     2.5          K       0

Check the values in the PROPDMGEXP & CROPDMGEXP columns. The values will be used to create a multiplier (x1 & x2) for each DMG variable.

y1 <- unique(economic$propdmgexp)
y1

##  [1] "K" "M" "B" "m" ""  "+" "0" "5" "6" "4" "h" "2" "7" "3" "H" "-"

y2 <- unique(economic$cropdmgexp)
y2

## [1] ""  "M" "K" "m" "B" "?" "0" "k"

Use the information above to create multiplier variables. According to the NWS documentation, propdmgexp & cropdmgexp should only contain monetary indicators (e,g, K,M,B). The values that are not monetary indicators will be assigned a multiplier of one (1).

 economic <- economic %>%
      mutate(x1=case_when(propdmgexp=="h"|propdmgexp=="H" ~ 100,
                propdmgexp=="m"|propdmgexp=="M"~ 1000000,
              propdmgexp=="B" ~ 1000000000,
                propdmgexp=="K" ~ 1000,
                TRUE ~ 1)
           )
 economic <- economic %>%
      mutate(x2=case_when(cropdmgexp=="m"|cropdmgexp=="M" ~ 1000000,
                cropdmgexp=="k"| cropdmgexp=="K" ~ 1000,
                cropdmgexp=="B" ~ 100000000,
                TRUE ~ 1)
           )
 
 head(economic,10)

##     evtype propdmg propdmgexp cropdmg cropdmgexp   x1 x2
## 1  tornado    25.0          K       0            1000  1
## 2  tornado     2.5          K       0            1000  1
## 3  tornado    25.0          K       0            1000  1
## 4  tornado     2.5          K       0            1000  1
## 5  tornado     2.5          K       0            1000  1
## 6  tornado     2.5          K       0            1000  1
## 7  tornado     2.5          K       0            1000  1
## 8  tornado     2.5          K       0            1000  1
## 9  tornado    25.0          K       0            1000  1
## 10 tornado    25.0          K       0            1000  1

Now create new columns for each damage variable * x1 or x2.

 economic  <- economic %>%
    mutate(totalprop=propdmg*x1)  %>%
    mutate(totalcrop=cropdmg*x2) %>%
    mutate(total=totalprop+totalcrop)

head(economic)

##    evtype propdmg propdmgexp cropdmg cropdmgexp   x1 x2 totalprop totalcrop
## 1 tornado    25.0          K       0            1000  1     25000         0
## 2 tornado     2.5          K       0            1000  1      2500         0
## 3 tornado    25.0          K       0            1000  1     25000         0
## 4 tornado     2.5          K       0            1000  1      2500         0
## 5 tornado     2.5          K       0            1000  1      2500         0
## 6 tornado     2.5          K       0            1000  1      2500         0
##   total
## 1 25000
## 2  2500
## 3 25000
## 4  2500
## 5  2500
## 6  2500

for easier review, let’s select only the EVTYPE & totals columns

costs <- select(economic,evtype,totalprop,totalcrop, total)
head(costs)

##    evtype totalprop totalcrop total
## 1 tornado     25000         0 25000
## 2 tornado      2500         0  2500
## 3 tornado     25000         0 25000
## 4 tornado      2500         0  2500
## 5 tornado      2500         0  2500
## 6 tornado      2500         0  2500

Sum the totals and group by evtype.

costs <- costs %>% 
      group_by(evtype) %>%
      summarise(totalcost = round((sum(total)/1000000000),3)) ## Convert to Billions 
                                                              ## for easier reading
                                                      

costs <- costs[order(-costs$totalcost),] ##sort data by total cost in desc order

head(costs,5)

## # A tibble: 5 × 2
##   evtype            totalcost
##   <chr>                 <dbl>
## 1 flood                 150. 
## 2 hurricane/typhoon      70.6
## 3 tornado                57.4
## 4 storm surge            43.3
## 5 hail                   18.8

Plot results

barplot(costs$totalcost[1:5], width=.25,names.arg= c("floods", "hurricanes","tornadoes","storm surge","hail"),
    xlab="Event Type", ylab="Property & Crop Damage Cost (Billions)", col="#ECE355")
title(main="Events with Highest Economic Impact", line=1)
mtext("1950 - November, 2011", line=0)

Results

Floods and tornadoes are the only events that show up in the top 5 in both economic impact and public health impact. Each earns a top spot in our research into how weather events impact the US terms of public health and costs.

Floods have the largest economic impact in the US, costing approx $ 80B more than the next costly event, hurricanes. They are ranked 4th in terms of their impact on public health.

Tornadoes come in at a distant 3rd in terms of costs, yet they account for more injuries/fatalities in the US than all other events combined.