Severe weather events from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database - exploring health and economic impact.

Synopsis

  • The data is downloaded and loaded into R.
  • The required fields for the analysis are extracted.
  • The data is cleaned to ensure that the event types match closely to the documentation.
  • The economic data is cleaned to make actual dollar amounts.
  • The health data is combined from fatalities and injuries.
  • Graphs are created to show the top 10 events for health impact (fatalities and injuries combined) and economic cost (in dollars).

Data processing

Initial download

Data file is downloaded from https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2.

There is also some dcumentation of the database available. Here you will find how some of the variables are constructed/defined.

# download the data file if it doesnt already exist
if (!file.exists("StormData.csv.bz2"))
{
        fileUrl<-"https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
        download.file(fileUrl, destfile="StormData.csv.bz2", method="curl")
}

# Read the datafile
stormData <- read.csv("StormData.csv.bz2")

# Check the structure
str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Extracting relevant fields

The event types are stored in EVTYPE

To evaluate health impact we need FATALITIES and INJURIES

To evaluate economic impact we need PROPDMG (property damage), CROPDMG (crop damage). We also need the PROPDMGEXP and CROPDMGEXP which are multipliers on the ###DMG variables. (for example K for thousand, M for million)

# setup libraries required
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

# create an extract with just the required fields
stormDataEx <- select(stormData, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)

Exploring the data and making transformations required

Each of the fields is checked for its contents to see what needs to be cleaned.

EVTYPE seems to be inconsistent in upper/lower case, as well as has many sub groups. There is 985 EVTYPE factors, we will look at the most common 20 and clean them until they align with the documentation. Additional cleaning is added in this step as found by looking at the top 20 sorted by health and economic data.

The DMG fields are numbers, but need to be multiplied by the EXP fields to make sensible information.

The EXP fields need to be cleaned. Primarily it contains H, K, M, B (hundreds, thousands, millions, billions) But it also contains some direct multipliers like 3,4,5,6,7 (assumed to be the number of zeroes) Some lower case h and m (assumed to be H and M) Also some -, +, ? (assumed to be no multiplier)

FATALITIES and INJURIES are just raw numbers so can be combined directly to make a health impact variable

The 2 DMG variables are multiplied with their exponents and combined to make the ECON variable

# summary
summary(stormDataEx)
##                EVTYPE          PROPDMG          PROPDMGEXP    
##  HAIL             :288661   Min.   :   0.00          :465934  
##  TSTM WIND        :219940   1st Qu.:   0.00   K      :424665  
##  THUNDERSTORM WIND: 82563   Median :   0.00   M      : 11330  
##  TORNADO          : 60652   Mean   :  12.06   0      :   216  
##  FLASH FLOOD      : 54277   3rd Qu.:   0.50   B      :    40  
##  FLOOD            : 25326   Max.   :5000.00   5      :    28  
##  (Other)          :170878                     (Other):    84  
##     CROPDMG          CROPDMGEXP       FATALITIES          INJURIES        
##  Min.   :  0.000          :618413   Min.   :  0.0000   Min.   :   0.0000  
##  1st Qu.:  0.000   K      :281832   1st Qu.:  0.0000   1st Qu.:   0.0000  
##  Median :  0.000   M      :  1994   Median :  0.0000   Median :   0.0000  
##  Mean   :  1.527   k      :    21   Mean   :  0.0168   Mean   :   0.1557  
##  3rd Qu.:  0.000   0      :    19   3rd Qu.:  0.0000   3rd Qu.:   0.0000  
##  Max.   :990.000   B      :     9   Max.   :583.0000   Max.   :1700.0000  
##                    (Other):     9
# showing event types - sorted by top 20 as there is too many
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219940                    82563 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25326 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20212                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11723                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7026                     6839                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3566 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761
# showing EXP field information
table(stormDataEx$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
## 465934      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(stormDataEx$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
# upper case the EXP variables
stormDataEx$PROPDMGEXP <- toupper(stormDataEx$PROPDMGEXP)
stormDataEx$CROPDMGEXP <- toupper(stormDataEx$CROPDMGEXP)

# cleaning the EXP variables
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "?")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "0")] <- 10^0
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "2")] <- 10^2
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "K")] <- 10^3
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "M")] <- 10^6
stormDataEx$CROPDMGEXP[(stormDataEx$CROPDMGEXP == "B")] <- 10^9
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "-")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "?")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "+")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "0")] <- 10^0
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "1")] <- 10^1
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "2")] <- 10^2
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "3")] <- 10^3
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "4")] <- 10^4
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "5")] <- 10^5
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "6")] <- 10^6
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "7")] <- 10^7
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "8")] <- 10^8
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "H")] <- 10^2
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "K")] <- 10^3
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "M")] <- 10^6
stormDataEx$PROPDMGEXP[(stormDataEx$PROPDMGEXP == "B")] <- 10^9

stormDataEx$PROPDMGEXP <- as.numeric(stormDataEx$PROPDMGEXP)
stormDataEx$CROPDMGEXP <- as.numeric(stormDataEx$PROPDMGEXP)

# creating the total economic consequences variable
stormDataEx <- mutate(stormDataEx, ECON = (PROPDMG*PROPDMGEXP)+(CROPDMG*CROPDMGEXP) )


# upper case the EVTYPE
stormDataEx$EVTYPE <- toupper(stormDataEx$EVTYPE)

# check EVTYPE again - looking at the top 20
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
## 
##                     HAIL                TSTM WIND        THUNDERSTORM WIND 
##                   288661                   219942                    82564 
##                  TORNADO              FLASH FLOOD                    FLOOD 
##                    60652                    54277                    25327 
##       THUNDERSTORM WINDS                HIGH WIND                LIGHTNING 
##                    20843                    20214                    15754 
##               HEAVY SNOW               HEAVY RAIN             WINTER STORM 
##                    15708                    11742                    11433 
##           WINTER WEATHER             FUNNEL CLOUD         MARINE TSTM WIND 
##                     7045                     6844                     6175 
## MARINE THUNDERSTORM WIND               WATERSPOUT              STRONG WIND 
##                     5812                     3796                     3569 
##     URBAN/SML STREAM FLD                 WILDFIRE 
##                     3392                     2761
# cleaning the variables
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "THUNDERSTORM WINDS")] <- "THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "MARINE TSTM WIND")] <- "MARINE THUNDERSTORM WIND"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "URBAN/SML STREAM FLD")] <- "FLOOD"

# some cleaning as found in the health data
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "WILD/FOREST FIRE")] <- "WILDFIRE"

# some cleaning as found in the economic data
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "HURRICANE")] <- "HURRICANE/TYPHOON"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "STORM SURGE")] <- "STORM SURGE/TIDE"
stormDataEx$EVTYPE[(stormDataEx$EVTYPE == "RIVER FLOOD")] <- "FLOOD"

# check EVTYPE again - looking at the top 20
head(sort(table(stormDataEx$EVTYPE), decreasing = TRUE), n=20)
## 
##        THUNDERSTORM WIND                     HAIL                  TORNADO 
##                   323349                   288661                    60652 
##              FLASH FLOOD                    FLOOD                HIGH WIND 
##                    54277                    28892                    20214 
##                LIGHTNING               HEAVY SNOW MARINE THUNDERSTORM WIND 
##                    15754                    15708                    11987 
##               HEAVY RAIN             WINTER STORM           WINTER WEATHER 
##                    11742                    11433                     7045 
##             FUNNEL CLOUD                 WILDFIRE               WATERSPOUT 
##                     6844                     4218                     3796 
##              STRONG WIND                 BLIZZARD                  DROUGHT 
##                     3569                     2719                     2488 
##                ICE STORM           EXCESSIVE HEAT 
##                     2006                     1678
# combining FATALITIES and INJURIES
stormDataEx <- mutate(stormDataEx, HEALTH = FATALITIES + INJURIES)

# summing the health impacts by event
healthdata <- with(stormDataEx, aggregate(HEALTH ~ EVTYPE, FUN = sum))

# arranging by descending number of total health impacts
healthdata <- arrange(healthdata, desc(HEALTH))

# summing the economic cost in dollars by event
econdata <- with(stormDataEx, aggregate(ECON ~ EVTYPE, FUN = sum))

# arranging by descending number of total economic cost
econdata <- arrange(econdata, desc(ECON))

Results

Figure 1 displays the total health impact numbers by event type

# plotting health using ggplot
g_h <- ggplot(healthdata[1:10,], aes(x=reorder(EVTYPE, -HEALTH),y=HEALTH)) + 
        geom_bar(stat="identity") + 
        xlab("Event types") + ylab("Total health impacts (fatalities and injuries)") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        theme(legend.position="none") +
        ggtitle("Total health impact numbers (Injuries + Fatalities) by weather event type")
g_h

Figure 2 displays the total economic cost by event type

# plotting economic cost using ggplot
g_e <- ggplot(econdata[1:10,], aes(x=reorder(EVTYPE, -ECON),y=ECON)) + 
        geom_bar(stat="identity") + 
        xlab("Event types") + ylab("Total economic cost ($)") +
        theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
        theme(legend.position="none") +
        ggtitle("Total economic cost by weather event type")
g_e

Summary

  • Tornadoes are the highest cause of health impacts (fatalities and injuries) in the data.
  • Hurricanes/Typhoons are the highest cause of economic cost in the data.