A data analysis on severe weather events that influence population health and the economy in the U.S.

Synopsis

This report is the second course project for the course Reproducible Research from the Coursera course Data Science Specialization.

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The main finding of the data analysis is that tornadoes, by far, have the largest impact on population health (measured by the number of injuries and fatalities) and floods cause the greatest economic damage in terms of US dollars (measured by property and crop damage).

Data Processing

The data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Loading libraries

The following libraries are used in the data analysis:

library("dplyr")
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library("plyr")
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
library("ggplot2")

Loading the data

The data is obtained from the link above and saved. The working directory has been set in order to read the data in Rstudio.

setwd('C:/Users/Hidde/Documents/Coursera/Data Science Foundations using R Specialization/5_Reproducible_Research/Project2')

if(!exists("stormData")) {
      stormData <- read.csv(bzfile("repdata_data_StormData.csv.bz2"), header = TRUE)
}

Quick examination of the data

There are 902297 observations in the dataset and 37 variables.

dim(stormData)
## [1] 902297     37

The structure of the dataset is as follows:

str(stormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Extracting the variables of interest

The variables of interest are population health, economy, and weather event types. For each of these, the following variables are selected:

Population health:

  • “FATALITIES”: approximation of the number of deaths
  • INJURIES": approximation of the number of injuries

Economic variables:

  • “PROPDMG”: approximation of property damages
  • “PROPDMGEXP”: unit of property damage value
  • “CROPDMG”: approximation of crop damages
  • “CROPDMGEXP”: unit of crop damage value

Weather event types:

  • “EVTYPE”: weather events
variables <- c( "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
dataSelection <- stormData[, variables]

Missing values check

In the new dataset with selected variables, it is important to check the number of missing values. The number of missing values per variable:

sapply(dataSelection, function(x) sum(is.na(x)))
##     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP 
##          0          0          0          0          0          0          0

And in percentage:

sapply(dataSelection, function(x) sum(is.na(x))) / nrow(dataSelection)
##     EVTYPE FATALITIES   INJURIES    PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP 
##          0          0          0          0          0          0          0

since there are no missing values, the data analysis can continue without managing missing values.

Transformation of the extracted variables

First, the ten most occurring weather events are selected.

sort(table(dataSelection$EVTYPE), decreasing = T)[1:10]
## 
##               HAIL          TSTM WIND  THUNDERSTORM WIND            TORNADO 
##             288661             219940              82563              60652 
##        FLASH FLOOD              FLOOD THUNDERSTORM WINDS          HIGH WIND 
##              54277              25326              20843              20212 
##          LIGHTNING         HEAVY SNOW 
##              15754              15708

Then, weather events are grouped based on reoccuring keywords. For example, “THUNDERSTORM WIND”, “THUNDERSTORM WINDS”, “HIGH WIND” are grouped based on the keyword “WIND”. Weather events that do not have common keywords are grouped in “OTHER”.

dataSelection$EVENT <- "OTHER"
dataSelection$EVENT[grep("HAIL", dataSelection$EVTYPE, ignore.case = T)] <- "HAIL"
dataSelection$EVENT[grep("HEAT", dataSelection$EVTYPE, ignore.case = T)] <- "HEAT"
dataSelection$EVENT[grep("FLOOD", dataSelection$EVTYPE, ignore.case = T)] <- "FLOOD"
dataSelection$EVENT[grep("WIND", dataSelection$EVTYPE, ignore.case = T)] <- "WIND"
dataSelection$EVENT[grep("STORM", dataSelection$EVTYPE, ignore.case = T)] <- "STORM"
dataSelection$EVENT[grep("SNOW", dataSelection$EVTYPE, ignore.case = T)] <- "SNOW"
dataSelection$EVENT[grep("TORNADO", dataSelection$EVTYPE, ignore.case = T)] <- "TORNADO"
dataSelection$EVENT[grep("WINTER", dataSelection$EVTYPE, ignore.case = T)] <- "WINTER"
dataSelection$EVENT[grep("RAIN", dataSelection$EVTYPE, ignore.case = T)] <- "RAIN"

sort(table(dataSelection$EVENT), decreasing = T)
## 
##    HAIL    WIND   STORM   FLOOD TORNADO   OTHER  WINTER    SNOW    RAIN    HEAT 
##  289270  255362  113156   82686   60700   48970   19604   17660   12241    2648

The units of the economic variables are as follows.

sort(table(dataSelection$PROPDMGEXP), decreasing = T)[1:10]
## 
##             K      M      0      B      5      1      2      ?      m 
## 465934 424665  11330    216     40     28     25     13      8      7
sort(table(dataSelection$CROPDMGEXP), decreasing = T)[1:10]
## 
##             K      M      k      0      B      ?      2      m   <NA> 
## 618413 281832   1994     21     19      9      7      1      1

Since these units are not very representative, they are transformed into the same unit:

  • K or k represents thousand dollars: 10^3
  • M or m represents thousand dollars: 10^6
  • B or b represents thousand dollars: 10^9
  • The remainder represents dollars
dataSelection$PROPDMGEXP <- as.character(dataSelection$PROPDMGEXP)

# Missing values are considered as dollars (in this case 0)
dataSelection$PROPDMGEXP[is.na(dataSelection$PROPDMGEXP)] <- 0

dataSelection$PROPDMGEXP[!grep("K|M|B", dataSelection$PROPDMGEXP, ignore.case = T)] <- 0
dataSelection$PROPDMGEXP[grep("K", dataSelection$PROPDMGEXP, ignore.case = T)] <- "3"
dataSelection$PROPDMGEXP[grep("M", dataSelection$PROPDMGEXP, ignore.case = T)] <- "6"
dataSelection$PROPDMGEXP[grep("B", dataSelection$PROPDMGEXP, ignore.case = T)] <- "9"
dataSelection$PROPDMGEXP <- as.numeric(as.character(dataSelection$PROPDMGEXP))
## Warning: NAs introduced by coercion
dataSelection$PropertyDamage <- dataSelection$PROPDMG * 10^dataSelection$PROPDMGEXP

dataSelection$CROPDMGEXP <- as.character(dataSelection$CROPDMGEXP)
# Missing values are considered as dollars (in this case 0)
dataSelection$CROPDMGEXP[is.na(dataSelection$CROPDMGEXP)] <- 0

dataSelection$CROPDMGEXP[!grep("K|M|B", dataSelection$CROPDMGEXP, ignore.case = T)] <- 0
dataSelection$CROPDMGEXP[grep("K", dataSelection$CROPDMGEXP, ignore.case = T)] <- "3"
dataSelection$CROPDMGEXP[grep("M", dataSelection$CROPDMGEXP, ignore.case = T)] <- "6"
dataSelection$CROPDMGEXP[grep("B", dataSelection$CROPDMGEXP, ignore.case = T)] <- "9"
dataSelection$CROPDMGEXP <- as.numeric(as.character(dataSelection$CROPDMGEXP))
## Warning: NAs introduced by coercion
dataSelection$CropDamage <- dataSelection$CROPDMG * 10^dataSelection$CROPDMGEXP

A second check of the values of economic variables, showing the first ten most occurring:

sort(table(dataSelection$PropertyDamage), decreasing = T)[1:10]
## 
##      0   5000  10000   1000   2000  25000  50000   3000  20000  15000 
## 197257  31731  21788  17545  17186  17104  13596  10364   9182   8617
sort(table(dataSelection$CropDamage), decreasing = T)[1:10]
## 
##      0   5000  10000  50000  1e+05   1000   2000  25000  20000  5e+05 
## 261781   4097   2349   1984   1233    956    951    830    758    721

Analysis

The weather events are aggregated for the population health and economic variables.

Aggregation of the pupulation health variables:

aggFatalitiesInjuries <- ddply(dataSelection, .(EVENT), summarize, total = sum(FATALITIES + INJURIES, na.rm = T))
aggFatalitiesInjuries$Type <- "Fatalities and injuries"

aggFatalities <- ddply(dataSelection, .(EVENT), summarize, total = sum(FATALITIES, na.rm = T))
aggFatalities$Type <- "Fatalities"

aggInjuries <- ddply(dataSelection, .(EVENT), summarize, total = sum(INJURIES, na.rm = T))
aggInjuries$Type <- "Injuries"

aggHealth <- rbind(aggFatalities, aggInjuries)

healthByEvent <- join(aggFatalities, aggInjuries, by = "EVENT", type = "inner")
healthByEvent
##      EVENT total       Type total     Type
## 1    FLOOD  1524 Fatalities  8602 Injuries
## 2     HAIL    15 Fatalities  1371 Injuries
## 3     HEAT  3138 Fatalities  9224 Injuries
## 4    OTHER  2626 Fatalities 12224 Injuries
## 5     RAIN   114 Fatalities   305 Injuries
## 6     SNOW   164 Fatalities  1164 Injuries
## 7    STORM   416 Fatalities  5339 Injuries
## 8  TORNADO  5661 Fatalities 91407 Injuries
## 9     WIND  1209 Fatalities  9001 Injuries
## 10  WINTER   278 Fatalities  1891 Injuries

Aggregation of the economic variables:

aggPropCropDmg <- ddply(dataSelection, .(EVENT), summarize, total = sum(PropertyDamage + CropDamage, na.rm = T))
aggPropCropDmg$Type <- "Property and crop damage"

aggProp <- ddply(dataSelection, .(EVENT), summarize, total = sum(PropertyDamage, na.rm = T))
aggProp$Type <- "Property"

aggCrop <- ddply(dataSelection, .(EVENT), summarize, total = sum(CropDamage, na.rm = T))
aggCrop$Type <- "Crop"

aggEconomic <- rbind(aggProp, aggCrop)

EconomicByEvent <- join(aggProp, aggCrop, by = "EVENT", type = "inner")
EconomicByEvent
##      EVENT        total     Type       total Type
## 1    FLOOD 168184668589 Property 12266906100 Crop
## 2     HAIL  15736042956 Property  3046837470 Crop
## 3     HEAT     20325750 Property   904469280 Crop
## 4    OTHER  97248432317 Property 23588880870 Crop
## 5     RAIN   3270230190 Property   919315800 Crop
## 6     SNOW   1024339740 Property   134683100 Crop
## 7    STORM  66513046188 Property  6374474880 Crop
## 8  TORNADO  58603317864 Property   417461520 Crop
## 9     WIND  10847166565 Property  1403719150 Crop
## 10  WINTER   6777295251 Property    47444000 Crop

Results

The first question to analyze is about the types of weather events that are most harmful with respect to population health.

aggHealth$EVENT <- as.character(aggHealth$EVENT)
plotHealth <- ggplot(aggHealth, aes(x = EVENT, y = total, fill = Type)) +
      geom_bar(stat = "identity") + 
      scale_fill_brewer(palette="Pastel2") +
      coord_flip() + 
      xlab("Event") + 
      ylab("Number of health impacts") +
      ggtitle("Weather event types and impact on population health") +
      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5))

print(plotHealth)

From the graph presented above it can be seen that tornadoes, by far, have the most influence on health impacts. Wind, heat, flood, and other weather events have the second largest impact on health. It can also be observed that injuries are the dominant force on health impacts.

The second question to analyze relates to the types of weather events that have the greatest economic consequences.

aggEconomic$EVENT <- as.character(aggEconomic$EVENT)
plotEconomic <- ggplot(aggEconomic, aes(x = EVENT, y = total, fill = Type)) +
      geom_bar(stat = "identity") + 
      scale_fill_brewer(palette="Pastel1") +
      coord_flip() + 
      xlab("Event") + 
      ylab("Economic damage in US dollars") +
      ggtitle("Weather event types and impact on economy (property and crop damage)") +
      theme_bw() +
      theme(plot.title = element_text(hjust = 0.5))

print(plotEconomic)

From the graph presented above it can be concluded that greatest economic damage, measured in US dollars, is due to floods. Other weather events have the second largest impact. It can be seen that economic damage on property is the main factor in economic damage, relatively to crop damage.