Synopsis

This analysis looks at the hard weather events in the US, based on data from time series taken from the National Oceanic and Atmospheric Administration (NOAA). This database not only tracks many events across all states in the US, but also provides information on injuries and property damages. The current data analysis makes use of the information and shows how tornados have the most harmful impact on people’s health, whereas floods are most expensive, ecnomically speaking.

Data Processing

First of all, libraries and the dataset are downlaoded and stored.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(ggplot2)

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#Download the dataset
download.file(url, "StormData.csv.bz2")
# Load the dataset
stormDataRaw <- read.csv("StormData.csv.bz2")
# Show the structure of the dataset
str(stormDataRaw)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "10/10/1954 0:00:00",..: 6523 6523 4213 11116 1426 1426 1462 2873 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "000","0000","00:00:00 AM",..: 212 257 2645 1563 2524 3126 122 1563 3126 3126 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4305 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "?","ABNORMAL WARMTH",..: 830 830 830 830 830 830 830 830 830 830 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","E","Eas","EE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","?","(01R)AFB GNRY RNG AL",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","10/10/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels "","?","0000",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","(0E4)PAYSON ARPT",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels "","2","43","9V9",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels ""," ","  ","   ",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

There are 902.297 observations w Only a subset will be taken for the analysis.

  1. For this analysis, the relevant variables are: t date (BGN_DATE), event type (EVTYPE), health impact (FATALITIES and INJURIES), monetary impact on crop and property (PROPDMG and CROPDMG) their corresponding exponents (PROPDMGEXP and CROPDMGEXP).

  2. According to the information provided by NOAA, the full set of wheather events ( is available since 1996. Between 1950 and 1995 only a subset (Tornado, Thunderstorm Wind and Hail) of these events is available in the storm database. In order to have o comparable basis for the analysis the dataset is limited to the observations between 1996 and 2011.

  3. The dataset contains a lot of observations that do not include information about health and/or economic damages. These observations are excluded from the analysis.

stormData <- select(stormDataRaw, BGN_DATE, EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP, FATALITIES, INJURIES)

# Format the BGN_DATE variable as a date
stormData$BGN_DATE <- as.Date(stormData$BGN_DATE, "%m/%d/%Y")
stormData$YEAR <- year(stormData$BGN_DATE)

# Tornado 1950 - 1954
# Tornado, Thunderstorm Wind, Hail 1955 - 1995
# 48 Events since 1996
# Only use events since 1996
stormData <- filter(stormData, YEAR >= 1996)

# Only use events with either health impact or economic damage
stormData <- filter(stormData, PROPDMG > 0 | CROPDMG > 0 | FATALITIES > 0 | INJURIES > 0)
#Partial results are shown
head(stormData)
##     BGN_DATE       EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP FATALITIES
## 1 1996-01-06 WINTER STORM     380          K      38          K          0
## 2 1996-01-11      TORNADO     100          K       0                     0
## 3 1996-01-11    TSTM WIND       3          K       0                     0
## 4 1996-01-11    TSTM WIND       5          K       0                     0
## 5 1996-01-11    TSTM WIND       2          K       0                     0
## 6 1996-01-18    HIGH WIND     400          K       0                     0
##   INJURIES YEAR
## 1        0 1996
## 2        0 1996
## 3        0 1996
## 4        0 1996
## 5        0 1996
## 6        0 1996

The data related to echonomical damages require some adjustments. Each variable - CROPDMG and PROPDMG - comes with a separate exponent - CROPDMGEXP and PROPDMGEXP. Then, the first step is to convert the exponent variables into a proper factor.

Most of the numerical exponents in CROPDMGEXP are missing. The factor is only calculated for the exponents provided in that variable.

table(stormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5      6 
##   8448      0      0      0      0      0      0      0      0      0      0 
##      7      8      B      h      H      K      m      M 
##      0      0     32      0      0 185474      0   7364
table(stormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 102767      0      0      0      2      0  96787      0   1762
stormData$PROPDMGEXP <- toupper(stormData$PROPDMGEXP)
stormData$CROPDMGEXP <- toupper(stormData$CROPDMGEXP)

stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "?")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "0")] <- 10^0
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "2")] <- 10^2
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "K")] <- 10^3
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "M")] <- 10^6
stormData$CROPDMGFACTOR[(stormData$CROPDMGEXP == "B")] <- 10^9

stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "-")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "?")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "+")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "0")] <- 10^0
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "1")] <- 10^1
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "2")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "3")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "4")] <- 10^4
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "5")] <- 10^5
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "6")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "7")] <- 10^7
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "8")] <- 10^8
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "H")] <- 10^2
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "K")] <- 10^3
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "M")] <- 10^6
stormData$PROPDMGFACTOR[(stormData$PROPDMGEXP == "B")] <- 10^9

This analysis fatalities and injuries are the same, hence both variables are added to form a new variable HEALTH_INJ . A similar aproach is used for the economic impact. Both crop and property damages (in USD) are multiplied by their corresponding factor and added to form a new variable ECONOMIC_COST.

stormData <- mutate(stormData, HEALTH_INJ = FATALITIES + INJURIES)
stormData <- mutate(stormData, ECONOMIC_COST = PROPDMG * PROPDMGFACTOR + CROPDMG * CROPDMGFACTOR)

The event types also require a more detailled examination. In this variable, there are a lot of duplicates. Some examples are elements containing the string “THUND. Since this analysis is looking at the most harmful events, only part of the event types are cleaned. Therefore the health impact (HEALTH_INJ) is summed up per event type. Only event types in the 95% quantile are to be cleaned.

healthImpact <- with(stormData, aggregate(HEALTH_INJ ~ EVTYPE, FUN = sum))
subset(healthImpact, HEALTH_INJ > quantile(HEALTH_INJ, prob = 0.95))
##                EVTYPE HEALTH_INJ
## 41     EXCESSIVE HEAT       8188
## 50        FLASH FLOOD       2561
## 52              FLOOD       7172
## 83               HEAT       1459
## 100         HIGH WIND       1318
## 105 HURRICANE/TYPHOON       1339
## 124         LIGHTNING       4792
## 175 THUNDERSTORM WIND       1530
## 179           TORNADO      22178
## 185         TSTM WIND       3870
## 210          WILDFIRE        986
## 217      WINTER STORM       1483

There are only two event types in the 95% quantile, which are not compliant to the official types defined in:

https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf

These are TSTM WIND and HURRICANE/TYPHOON.

stormData$EVTYPE <- toupper(stormData$EVTYPE)
stormData$EVTYPE[(stormData$EVTYPE == "TSTM WIND")] <- "THUNDERSTORM WIND"
stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE/TYPHOON")] <- "HURRICANE (TYPHOON)"

The same procedure is been used to clean the event types for the most important economic impacts. After summing up the economic cost ECONOMIC_COST, events in the 95% quantile are to be cleaned.

economicCost <- with(stormData, aggregate(ECONOMIC_COST ~ EVTYPE, FUN = sum))
subset(economicCost, ECONOMIC_COST > quantile(ECONOMIC_COST, prob = 0.95))
##                  EVTYPE ECONOMIC_COST
## 28              DROUGHT   14413667000
## 43          FLASH FLOOD   16557105610
## 45                FLOOD  148919611950
## 63                 HAIL   17071172870
## 84            HURRICANE   14554229010
## 86  HURRICANE (TYPHOON)   71913712800
## 139         STORM SURGE   43193541000
## 144   THUNDERSTORM WIND    8812957230
## 147             TORNADO   24900370720
## 150      TROPICAL STORM    8320186550

Once again there are only two event types in the 95% quantile, which are not compliant to the official types: HURRICANE and STORM SURGE.

stormData$EVTYPE[(stormData$EVTYPE == "HURRICANE")] <- "HURRICANE (TYPHOON)"
stormData$EVTYPE[(stormData$EVTYPE == "STORM SURGE")] <- "STORM SURGE/TIDE"

Results

The cleaned data frame stormData is been aggregated per EVTYPE and provided in a descending order in the new data frame healthImpact:

healthImpact <- stormData %>% 
                group_by(EVTYPE) %>% 
                summarise(HEALTH_INJ = sum(HEALTH_INJ)) %>% 
                arrange(desc(HEALTH_INJ))
#healthImpact[1:10,]
g1 <- ggplot(healthImpact[1:10,], aes(x=reorder(EVTYPE, -HEALTH_INJ),y=HEALTH_INJ,color=EVTYPE)) + 
      geom_bar(stat="identity", fill="white") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Number of fatalities and injuries") +
      theme(legend.position="none") +
      ggtitle("Fatalities and injuries in the US caused by severe weather events")
g1

The previous figure shows that tornados are the most harmful weather events for people’s health.

For the conomic impact, the cleaned data frame stormData is been aggregated per EVTYPE and provided in a descending order in the new data frame economicCost.

economicCost <- stormData %>% 
                group_by(EVTYPE) %>% 
                summarise(ECONOMIC_COST = sum(ECONOMIC_COST)) %>% 
                arrange(desc(ECONOMIC_COST))
#economicCost[1:10,]
g1 <- ggplot(economicCost[1:10,], aes(x=reorder(EVTYPE, -ECONOMIC_COST),y=ECONOMIC_COST,color=EVTYPE)) + 
      geom_bar(stat="identity", fill="white") + 
      theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
      xlab("Event") + ylab("Economic cost in USD") +
      theme(legend.position="none") +
      ggtitle("Economic cost in the US caused by severe weather events")
g1

It can be seen in the previous figure, how floods cause the biggest economical damages in the area evaluated.