What wheather events are most harmful to health and economy?

Synopsis

Using data from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database we address the question of what type of severe weather events are most harmful to either human health or the economy. We characterize the health effects of weather events by both the number of injuries and the number of fatalities they produce. We characterize the effects on the economy by looking at the combined damage to crop and property. We find that tornadoes are by far the most damaging events in terms of fatalities, injuries, and damage to crops and property. We conclude that reducing adverse effects from tornadoes should receive the highest priority. Other important sources of damage that should receive attention are excessive heat (for health), and flash floods (for the economy).

Data Processing

The data processing proceeds in several steps.

Importing the relevant packages

We use the tidyverse

library(tidyverse)
## ── Attaching packages ───────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.23.0 successfully loaded. See ?R.oo for help.
## 
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
## 
##     throw
## The following objects are masked from 'package:methods':
## 
##     getClasses, getMethods
## The following objects are masked from 'package:base':
## 
##     attach, detach, load, save
## R.utils v2.9.0 successfully loaded. See ?R.utils for help.
## 
## Attaching package: 'R.utils'
## The following object is masked from 'package:tidyr':
## 
##     extract
## The following object is masked from 'package:utils':
## 
##     timestamp
## The following objects are masked from 'package:base':
## 
##     cat, commandArgs, getOption, inherits, isOpen, nullfile,
##     parse, warnings
library(data.table)
## 
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
## 
##     between, first, last
## The following object is masked from 'package:purrr':
## 
##     transpose

Data Aquisistion and Reading

We unzip and load in memory the file with the relevant info we downloaded from the NOAA website. As we use the tidyverse we work with tibbles.

sourcefile <- "repdata_data_StormData.csv.bz2"
destfile<-'repdata_data_StormData.csv'
bunzip2(sourcefile, destfile, remove=FALSE, skip=TRUE)
## [1] "repdata_data_StormData.csv"
## attr(,"temporary")
## [1] FALSE
d<-fread(input=destfile)
dt<-as_tibble(d)

Exploratory Data Analysis

Let us start analyzing the dataset: which fields does it contain?

names(dt)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"

The names show that:
1. The type of events is recorded in the “EVTYPE” field.
2. Events damaging to the health are recorded in “FATALITIES” and “INJURIES”.
3. Events damaging to the economy are recorded in “PROPDMG” and “CROPDMG”.

Let us print summary statistics for these

summary(dt$FATALITIES)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##   0.0000   0.0000   0.0000   0.0168   0.0000 583.0000
summary(dt$INJURIES)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0000    0.0000    0.0000    0.1557    0.0000 1700.0000
summary(dt$PROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    0.00   12.06    0.50 5000.00
summary(dt$CROPDMG)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   1.527   0.000 990.000

How many observations do we have? And how many types of events are considered in the dataset?

length(dt$EVTYPE)
## [1] 902297
length(unique(dt$EVTYPE))
## [1] 985

We have 902297 observations and 985 types of events.

n=10

Given this high number of events we will focus on the most common 10 of them.

Prepare the data for the analysis

Let us group the events by event type

gb<-group_by(dt, EVTYPE)

And summarise them by taking the totals of FATALITIES, INJURIES, PROPDMG and CROPDMG.

s<-summarise(gb,
             tf=sum(FATALITIES,na.rm=TRUE),
             ti=sum(INJURIES,na.rm=TRUE),
             tpd=sum(PROPDMG,na.rm=TRUE),
             tcd=sum(CROPDMG,na.rm=TRUE))

Effects on health: injuries and fatalities

Let us start analyzing the effects on health; we sort the data by FATALITIES, then by INJURIES, then by PROPDMG and then by CROPDMG, taking the first 10 of them; in this way we will know what are the 10 most important causes of fatality; we expect fatalities and injuries to be correlated so these should also be the most important causes of injuries. The following figure shows the number of injured people vs the number of fatalities.

x<-head(arrange(s,desc(tf),desc(ti),desc(tpd),desc(tcd)),n=n)
title = paste(
    'Total number of injured people vs total number of fatalities for the\n',
    n, 'most important causes of fatality', sep=' ')
ggplot(data = x) +
    geom_point(mapping=aes(x=tf,y=ti,color=EVTYPE),size=8) +
    scale_x_log10("Total number of fatalities") +
    scale_y_log10("Total number of injured people") +
    labs(title = title)

The figure clearly shows that injuries and fatalities are correlated so sorting the data by fatality or injury gives the same results: in both cases tornadoes are by far the most important cause, so they are the most important factor affecting health.

Effects on the economy: property damages

As we cannot expect damages to crop to be related to damages to property we must analyze the two separately. We sort the data by PROPDMG, then by CROPDMG, then by FATALITIES and then by INJURIES, again taking the first 10 of them; in this way we will know what are the 10 most important causes of property damage. The following figure shows the total amount of damages to crops vs the total amount of damages to property in dollars.

x<-head(arrange(s,desc(tpd),desc(tcd),desc(tf),desc(ti)),n=n)
title = paste(
    'Total damage to crops vs total damage to property for the\n',
    n, 'most important causes of property damage', sep=' ')
ggplot(data = x) +
    geom_point(mapping=aes(x=tpd,y=tcd,color=EVTYPE),size=8) +
    scale_x_log10("Total property damages ($)") +
    scale_y_log10("Total crop damages ($)") +
    labs(title = title)

Effects on the economy: crop damages

We now sort the data by CROPDMG, then by PROPDMG, then by FATALITIES and then by INJURIES, taking the first 10 of them; in this way we will know what are the 10 most important causes of crop damage. The following figure shows the total amount of damages to property vs the total amount of damage to crops in dollars.

x<-head(arrange(s,desc(tcd),desc(tpd),desc(tf),desc(ti)),n=n)
title = paste(
    'Total damage to property vs total damage to crop for the\n',
    n, 'most important causes of crop damage', sep=' ')
ggplot(data = x) +
    geom_point(mapping=aes(x=tcd,y=tpd,color=EVTYPE),size=8) +
    scale_x_log10("Total crop damages ($)") +
    scale_y_log10("Total property damages ($)") +
    labs(title = title)

Results

The most important cause of both fatality and injury are by far tornadoes; we therefore conclude that tornadoes are the most important events affecting the health. The second most imporant event affecting the health is excessive heat; other important events are lightnings, floods, heat, tstm wind and flash floods. For the economy we need to distinguish between property damages and crop damages. For property damages the most important cause is by far tornadoes, followed by tstm winds and flash floods. For crop damages the most important cause is hail, followed by floods and flash floods; tornadoes are important too. We should also note that the value of property damages is about an order of magnitudes greater than the value of crop damages (millions vs hundreds of thousand dollars). The final conclusion is that the most important cause of damage both to health and economy are tornadoes, followed by excessive heat (for health), and flash floods (both for property and crop); for crops it is important to address the problem of hail.