This research analyses impact of various weather events on public health and economics of USA. The data is taken from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The events in the database start in the year 1950 and end in November 2011 and cover all states. The main goal of the analysis was to reveal which weather events were the most harmful for public health and which caused the biggest economic problems. Current research doesn’t take into account a lack of good records in earlier years and money inflation.

Data Processing

Here are the libraries we will need to process the data.

library(readr) # reading bz2
library(dplyr) # dealing with data tables
library(stringdist) # amatch() function
library(ggplot2) # plots
library(reshape2) # melting tables
library(scales) # number formatting

At first, we read the raw data taken here.

storm.data <- read_csv("repdata-data-StormData.csv.bz2")

Let’s take a quick look into the data.

str(storm.data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: chr  NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

To answer two addressed questions about public health and economic consequences caused by weather events we’re going to use the following variables from the dataset:

Let’s select necessary data and filter non-zero values from the initial dataset.

damage.data <- storm.data%>%select(c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))%>%filter(FATALITIES > 0 | INJURIES  > 0 | PROPDMG > 0 | CROPDMG > 0)
str(damage.data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    254633 obs. of  7 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
length(unique(damage.data$EVTYPE))
## [1] 485
head(unique(damage.data$EVTYPE), 20)
##  [1] "TORNADO"                      "TSTM WIND"                   
##  [3] "HAIL"                         "ICE STORM/FLASH FLOOD"       
##  [5] "WINTER STORM"                 "HURRICANE OPAL/HIGH WINDS"   
##  [7] "THUNDERSTORM WINDS"           "HURRICANE ERIN"              
##  [9] "HURRICANE OPAL"               "HEAVY RAIN"                  
## [11] "LIGHTNING"                    "THUNDERSTORM WIND"           
## [13] "DENSE FOG"                    "RIP CURRENT"                 
## [15] "THUNDERSTORM WINS"            "FLASH FLOODING"              
## [17] "FLASH FLOOD"                  "TORNADO F0"                  
## [19] "THUNDERSTORM WINDS LIGHTNING" "THUNDERSTORM WINDS/HAIL"

Now we have less data. However the list of event types is still huge and messy. According to the National Weather Service Storm Data Documentation, 2.1.1 Storm Data Event Table, there are only 48 events. We will clean our data by trying to coerce event type values to ones in this list (when possible).

event.types <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather")
damage.data$EVTYPE2 <- event.types[amatch(damage.data$EVTYPE, toupper(event.types), maxDist = 10)]
length(unique(damage.data$EVTYPE2))
## [1] 49
table(is.na(damage.data$EVTYPE2))
## 
##  FALSE   TRUE 
## 253716    917
head(damage.data[is.na(damage.data$EVTYPE2), ], 10)
## Source: local data frame [10 x 8]
## 
##                            EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
##                             (chr)      (dbl)    (dbl)   (dbl)      (chr)
## 1       HURRICANE OPAL/HIGH WINDS          2        0     0.1          B
## 2    THUNDERSTORM WINDS LIGHTNING          0        0    85.0          K
## 3        LIGHTNING AND HEAVY RAIN          0        0    28.0          K
## 4  FLASH FLOODING/THUNDERSTORM WI          0        0   500.0          K
## 5          HIGH WINDS HEAVY RAINS          0        0     7.5          M
## 6           HIGH WINDS/HEAVY RAIN          0        0     5.0          K
## 7               SEVERE TURBULENCE          0        0    50.0          K
## 8  THUNDERSTORM WINDS/FUNNEL CLOU          0        0     5.0          K
## 9     HEAVY SURF COASTAL FLOODING          0        0    50.0          K
## 10        WINTER STORM HIGH WINDS          1       15    60.0          M
## Variables not shown: CROPDMG (dbl), CROPDMGEXP (chr), EVTYPE2 (chr)

Looks better now. We still have ~0.36% of unclassified events, but quick look into those cases shows that they are mostly mixed like ‘HURRICANE OPAL/HIGH WINDS’. We can’t neither classify them, nor get much value from them. The amount of such cases is also so insignificant, that it would not influence the results. So we just ignore them.

Public health data

Let’s prepare data to see which types of events are most harmful with respect to population health.

health.damage <- damage.data%>%select(EVTYPE2, FATALITIES, INJURIES)%>%filter((FATALITIES > 0 | INJURIES > 0) & !is.na(EVTYPE2))%>%group_by(EVTYPE2)%>%summarise_each(funs(sum), FATALITIES, INJURIES)%>%arrange(desc(FATALITIES))
str(health.damage)
## Classes 'tbl_df', 'tbl' and 'data.frame':    41 obs. of  3 variables:
##  $ EVTYPE2   : chr  "Tornado" "Excessive Heat" "Heat" "Flash Flood" ...
##  $ FATALITIES: num  5633 2020 1116 1041 818 ...
##  $ INJURIES  : num  91364 6703 2463 1804 5233 ...

Economic data

Let’s prepare data to see which types of events are most harmful with respect to economics.

economic.damage <- damage.data%>%select(EVTYPE2, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)%>%filter((PROPDMG > 0 | CROPDMG > 0) & !is.na(EVTYPE2))
str(economic.damage)
## Classes 'tbl_df', 'tbl' and 'data.frame':    244207 obs. of  5 variables:
##  $ EVTYPE2   : chr  "Tornado" "Tornado" "Tornado" "Tornado" ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...

Let’s make a summary of the data related to property and crops.

summary(economic.damage$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    2.00    8.00   44.43   25.00 5000.00
table(economic.damage$PROPDMGEXP)
## 
##      -      +      0      2      3      4      5      6      7      B 
##      1      5    209      1      1      4     18      3      2     37 
##      h      H      K      m      M 
##      1      6 228283      7  11294
summary(economic.damage$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   5.622   0.000 990.000
table(economic.damage$CROPDMGEXP)
## 
##     ?     0     B     k     K     m     M 
##     6    17     7    21 97830     1  1962

While PRORDMG and CROPDMG numbers are easy to understad, PROPDMGEXP and CROPDMGEXP values are not that obvious. We will use the research ‘How To Handle Exponent Value of PROPDMGEXP and CROPDMGEXP’ to make them more senisible and consistent. We will also replace NA’s with 0 and calculate total damage to property and crops.

exp <- data.frame(value = c('H', 'K', 'M', 'B', '+', '-', '?', ' ', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'), number = c(100, 1000, 1000000, 1000000000, 1, 0, 0, 0, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10))
economic.damage$PROPDMGEXP2 <- with(exp, number[match(economic.damage$PROPDMGEXP, value)])
economic.damage$CROPDMGEXP2 <- with(exp, number[match(economic.damage$CROPDMGEXP, value)])
economic.damage$PROPDMGEXP2[is.na(economic.damage$PROPDMGEXP2)] <- 0 
economic.damage$CROPDMGEXP2[is.na(economic.damage$CROPDMGEXP2)] <- 0 
economic.damage$TOTALDMG <- with(economic.damage, PROPDMG*PROPDMGEXP2+CROPDMG*CROPDMGEXP2)
head(economic.damage, 10)
## Source: local data frame [10 x 8]
## 
##    EVTYPE2 PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP PROPDMGEXP2 CROPDMGEXP2
##      (chr)   (dbl)      (chr)   (dbl)      (chr)       (dbl)       (dbl)
## 1  Tornado    25.0          K       0         NA        1000           0
## 2  Tornado     2.5          K       0         NA        1000           0
## 3  Tornado    25.0          K       0         NA        1000           0
## 4  Tornado     2.5          K       0         NA        1000           0
## 5  Tornado     2.5          K       0         NA        1000           0
## 6  Tornado     2.5          K       0         NA        1000           0
## 7  Tornado     2.5          K       0         NA        1000           0
## 8  Tornado     2.5          K       0         NA        1000           0
## 9  Tornado    25.0          K       0         NA        1000           0
## 10 Tornado    25.0          K       0         NA        1000           0
## Variables not shown: TOTALDMG (dbl)

The data looks better now. Let’s aggregate it.

economic.damage <- economic.damage%>%group_by(EVTYPE2)%>%summarise(TOTALDMG = sum(TOTALDMG))%>%arrange(desc(TOTALDMG))
str(economic.damage)
## Classes 'tbl_df', 'tbl' and 'data.frame':    48 obs. of  2 variables:
##  $ EVTYPE2 : chr  "Flood" "Hurricane (Typhoon)" "Tornado" "Storm Surge/Tide" ...
##  $ TOTALDMG: num  1.51e+11 7.55e+10 5.73e+10 4.80e+10 2.87e+10 ...

Results

Health consequenses

health.damage.melt <- melt(health.damage, id.vars = 'EVTYPE2')
health.damage.melt <- transform(health.damage.melt, EVTYPE2 = reorder(EVTYPE2, value))
ggplot(health.damage.melt, aes(x = factor(EVTYPE2), y = value, fill = variable)) +
        geom_bar(stat = 'identity', position = 'dodge') +
        coord_flip(ylim = c(0, 9000)) +
        ggtitle('Fatalities and injuries caused by weather events in USA 1950-2011') +
        ylab('Number of victims') + xlab('Type of event') +
        geom_text(aes(label = value), position = position_dodge(width = 1), size = 2, hjust = 0) +
        scale_fill_manual(name = '', labels = c('Fatalities', 'Injuries'), values = c('#FF6347', '#40E0D0'))
## ymax not defined: adjusting position using y instead

head(health.damage, 1)
## Source: local data frame [1 x 3]
## 
##   EVTYPE2 FATALITIES INJURIES
##     (chr)      (dbl)    (dbl)
## 1 Tornado       5633    91364

The major ammount of injuries (91364) happened because of tornadoes, this is an outstanding number which has a higher order of magnitude. For this plot, we limit the ‘Number of victims’ axis and crop the Tornado injuries bar in order to see other bars better.

The bars are ordered by total amount of victims (fatalities + injuries).

Economic consequenses

economic.damage <- transform(economic.damage, EVTYPE2 = reorder(EVTYPE2, TOTALDMG))
ggplot(economic.damage, aes(x = factor(EVTYPE2), y = TOTALDMG)) +
        geom_bar(stat = 'identity', fill = '#000099') + 
        coord_flip() +
        ggtitle('Damage to propery and crops caused by weather events in USA 1950-2011') + 
        ylab('Losses in $') + xlab('Type of event') +
        geom_text(aes(label = dollar(TOTALDMG)), size = 2, hjust = 0)

dollar(economic.damage[1, 2])
## [1] "$151,162,022,360"

The biggest loss of money was caused by floods ($151,162,022,360).