1. Synopsis

The aim of this report is to present exploratory data analysis of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, to support the discussion of two research questions (described below). The original data was first processed to transform the variable BNG_DATE into a date format, to understand the evolution of the events over time. Data was then subsetted to the variables that represent the measurements of interest. Hence, population health is represented by the variables FATALITIES and INJURIES and economic consequences are represented by PROPDMG and CROPDMG. This latter variables were summarised into a single one Total DMG. The results present summaries of the total number of injuries, total number of fatalities and total economic consequences (properties and crops damage). As there are 985 different types of events, the top 15 events (higher values of Injuries, Fatalities and economic damage) were selected. The plots present the number of injuries, fatalities and economic damage, per type of event.

Research questions

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

2. Data Processing

Data for USA severe weather events was loaded from here directly into working directory.

bz2file <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
dir <- getwd()
download.file(bz2file, "dir")
StormData <- read.csv("StormData")
library(dplyr)
library(lubridate)
library(ggplot2)
library(gridExtra)
dim(StormData)
## [1] 902297     37

There are 902297 observations and 37 variables.

tbl_df(StormData)
## Source: local data frame [902,297 x 37]
## 
##    STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME  STATE
##      <dbl>             <fctr>   <fctr>    <fctr>  <dbl>     <fctr> <fctr>
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE     AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN     AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE     AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON     AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN     AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE     AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT     AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA     AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA     AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE     AL
## ..     ...                ...      ...       ...    ...        ...    ...
## Variables not shown: EVTYPE <fctr>, BGN_RANGE <dbl>, BGN_AZI <fctr>,
##   BGN_LOCATI <fctr>, END_DATE <fctr>, END_TIME <fctr>, COUNTY_END <dbl>,
##   COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fctr>, END_LOCATI <fctr>,
##   LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
##   INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fctr>, CROPDMG <dbl>,
##   CROPDMGEXP <fctr>, WFO <fctr>, STATEOFFIC <fctr>, ZONENAMES <fctr>,
##   LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
##   REMARKS <fctr>, REFNUM <dbl>.

Clean the data to prepare it for analysis

StormData$BGN_DATE <- mdy_hms(as.character(StormData$BGN_DATE))
StormData$BGN_DATE <- format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y"), "%Y")
StormData$BGN_DATE <- as.numeric(StormData$BGN_DATE)

Variable date was changed to determine total occurencies/events by year. Below is the subset of the dataset by date DataByYear

StormData$EVTYPE <- as.character(StormData$EVTYPE)

Subset data for total Injuries, total Fatalities and total economic damage, per type of event

EventsData <- StormData %>%
        select(BGN_DATE, INJURIES, FATALITIES, EVTYPE, PROPDMG, CROPDMG) %>%
        group_by(EVTYPE) %>%
        summarise(
                Total_INJURIES = sum(INJURIES),
                Total_FATALITIES = sum(FATALITIES),
                Total_DMG = sum(PROPDMG, CROPDMG)
        ) 

Subset data per year and type of event, for injuries, fatalities and economic damage

DataByYear <- StormData %>%
        select(BGN_DATE, INJURIES, FATALITIES, EVTYPE, PROPDMG, CROPDMG) %>%
        group_by(BGN_DATE) %>%
        summarise(
                Total_INJURIES = sum(INJURIES),
                Total_FATALITIES = sum(FATALITIES),
                Total_DMG = sum(PROPDMG, CROPDMG)
        ) %>% arrange(desc(Total_INJURIES)) 

3. Results

Evolution of Population health (Injuries and Fatalities) and economy indicators (Crop and Property damage) along the years

Plot total events by date to understand the evolution along the years

p1 <- ggplot(DataByYear, aes(BGN_DATE, Total_INJURIES)) + geom_line(color = "blue") + xlab("") + ylab("Total Injuries") + ggtitle("Population health indicators and economy over the years")
p2 <- ggplot(DataByYear, aes(BGN_DATE, Total_FATALITIES)) + geom_line(color = "darkblue")
p3 <- ggplot(DataByYear, aes(BGN_DATE, Total_DMG)) + geom_line(color = "darkorange")
grid.arrange(p1, p2, p3, nrow = 3)

The higher rates of events in the latter years might be due to the fact that there were more measurements made in the recent years.

Population health

Select the top 15 events with higher impact in population health (Injuries and Fatalities)

TopINJ <- tbl_df(EventsData) %>% arrange(desc(Total_INJURIES)) %>% print(n = 15) 
## Source: local data frame [985 x 4]
## 
##               EVTYPE Total_INJURIES Total_FATALITIES  Total_DMG
##                <chr>          <dbl>            <dbl>      <dbl>
## 1            TORNADO          91346             5633 3312276.68
## 2          TSTM WIND           6957              504 1445168.21
## 3              FLOOD           6789              470 1067976.36
## 4     EXCESSIVE HEAT           6525             1903    1954.40
## 5          LIGHTNING           5230              816  606932.39
## 6               HEAT           2100              937     961.20
## 7          ICE STORM           1975               89   67689.62
## 8        FLASH FLOOD           1777              978 1599325.05
## 9  THUNDERSTORM WIND           1488              133  943635.62
## 10              HAIL           1361               15 1268289.66
## 11      WINTER STORM           1321              206  134699.58
## 12 HURRICANE/TYPHOON           1275               64   10637.85
## 13         HIGH WIND           1137              248  342014.77
## 14        HEAVY SNOW           1021              127  124417.71
## 15          WILDFIRE            911               75   88823.54
## ..               ...            ...              ...        ...
TopFATAL <- tbl_df(EventsData) %>% arrange(desc(Total_FATALITIES)) %>% print(n = 15) 
## Source: local data frame [985 x 4]
## 
##               EVTYPE Total_INJURIES Total_FATALITIES  Total_DMG
##                <chr>          <dbl>            <dbl>      <dbl>
## 1            TORNADO          91346             5633 3312276.68
## 2     EXCESSIVE HEAT           6525             1903    1954.40
## 3        FLASH FLOOD           1777              978 1599325.05
## 4               HEAT           2100              937     961.20
## 5          LIGHTNING           5230              816  606932.39
## 6          TSTM WIND           6957              504 1445168.21
## 7              FLOOD           6789              470 1067976.36
## 8        RIP CURRENT            232              368       1.00
## 9          HIGH WIND           1137              248  342014.77
## 10         AVALANCHE            170              224    1623.90
## 11      WINTER STORM           1321              206  134699.58
## 12      RIP CURRENTS            297              204     162.00
## 13         HEAT WAVE            309              172    1524.55
## 14      EXTREME COLD            231              160   13778.68
## 15 THUNDERSTORM WIND           1488              133  943635.62
## ..               ...            ...              ...        ...

Plot the top weather events over population health

plot1 <- ggplot(head(TopFATAL, 15), aes(x = reorder(EVTYPE, Total_FATALITIES), y = Total_FATALITIES)) + geom_bar(fill = "darkblue", stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total number of Fatalities") + coord_flip() + ggtitle("Weather events and health effects in the US")

plot2 <- ggplot(head(TopINJ, 15), aes(x = reorder(EVTYPE, Total_INJURIES), y = Total_INJURIES)) + geom_bar(fill = "darkred", stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total number of Injuries") + coord_flip()

grid.arrange(plot1, plot2, nrow = 2)

The plots present the events that have higher impact in population health indicators. “Tornados” appear to have the highest impact in both injuries and fatalities that occur in the U.S. The other type of weather events differ between Fatalities and Injuries, which indicates that possibly the classification of these two types of occurencies might influence the data analysis.

Population economy

Top 15 events that impact the economy (property and crop damage)

TopDAMAGE <- tbl_df(EventsData) %>% arrange(desc(Total_DMG)) %>% print(n = 15)
## Source: local data frame [985 x 4]
## 
##                EVTYPE Total_INJURIES Total_FATALITIES  Total_DMG
##                 <chr>          <dbl>            <dbl>      <dbl>
## 1             TORNADO          91346             5633 3312276.68
## 2         FLASH FLOOD           1777              978 1599325.05
## 3           TSTM WIND           6957              504 1445168.21
## 4                HAIL           1361               15 1268289.66
## 5               FLOOD           6789              470 1067976.36
## 6   THUNDERSTORM WIND           1488              133  943635.62
## 7           LIGHTNING           5230              816  606932.39
## 8  THUNDERSTORM WINDS            908               64  464978.11
## 9           HIGH WIND           1137              248  342014.77
## 10       WINTER STORM           1321              206  134699.58
## 11         HEAVY SNOW           1021              127  124417.71
## 12           WILDFIRE            911               75   88823.54
## 13          ICE STORM           1975               89   67689.62
## 14        STRONG WIND            280              103   64610.71
## 15         HEAVY RAIN            251               98   61964.94
## ..                ...            ...              ...        ...

Plot the top weather events over economic damage

ggplot(head(TopDAMAGE, 15), aes(x = reorder(EVTYPE, Total_DMG), y = Total_DMG, fill = Total_DMG)) + geom_bar(stat = "identity") + guides(fill = FALSE) + xlab("Type of Event") + ylab("Total Property and Crop Damage") + ggtitle("Weather events and economic impact in the US") + theme(text = element_text(size = 11), axis.text.x = element_text(angle = 90, vjust = 1))

The plot shows the top 15 types of events with higher levels of economic damage. “Tornados” show the highest impact in the economic damage, which is coherent with the analysis for type of event on population health.