Exploratory analysis on the impact of weather events to public health and economics in the United States from 1950 to 2011

Lou Marvin Caraig

Synopsis

The analysis takes in account the U.S. National Oceanic and Atmospheric Administration’s (NOAA) database that tracks the effects of wheather events in terms of fatalities and injuries (public health damages) and in terms of property and crop damages (economical damages).

The purpose of the analsys is to address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The analysis starts by retrieving the total numbers of fatalities, injuries, property damages and crop damages for each weather event across the United States. After that the same analysis is performed but also grouping by region (West, Midwest, South, NorthEast) to verify if the results are different.

Data processing

The data comes from U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, property damages and crop damages. The data are obtained from here.

For the data processing we use dplyr and tidyr, for the data visualization we use ggplot2 and gridExtra and knitr for printing tables.

library(dplyr, warn.conflicts=FALSE)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(knitr)

Downloading and reading data

First of all we download the data from the above provided link and saved it in the current directory by naming it stormData.csv.bz2. Now we can load the data (it may take a long time for loading it).

rawDt <- read.csv('stormData.csv.bz2')

Preprocessing

This dataset has a lot of columns so we first drop those we are not interested in, and we rename the remaining ones.

dt <- rawDt %>%
    select(state=STATE, evtype=EVTYPE,
           fatalities=FATALITIES, injuries=INJURIES,
           propertyDamage=PROPDMG, propertyDamangeExp=PROPDMGEXP,
           cropDamage=CROPDMG, cropDamageExp=CROPDMGEXP)

The values of propertyDamage and cropDamage are to be multiplied by a factor given by propertyDamageExp and cropDamageExp respectively. The multiplier factor can be one of k, m and b (case-insensitive) which mean *10^3, *10^6 and *10^9 resepectively.

calcMultiplier <- function(multipliers) {
    sapply(multipliers, function(v) {
        ifelse(v %in% c('k', 'K'), 10**3,
        ifelse(v %in% c('m', 'M'), 10**6,
        ifelse(v %in% c('b', 'B'), 10**9, 0)))})
}

dt <- dt %>%
    mutate(propertyDamangeExp=calcMultiplier(propertyDamangeExp),
           cropDamageExp=calcMultiplier(cropDamageExp),
           propertyDamage=propertyDamage * propertyDamangeExp,
           cropDamage=cropDamage * cropDamageExp) %>%
    select(-propertyDamangeExp, -cropDamageExp)

We add a column region that can be one between NorthEast, Midwest, South and West that can be useful for the analysis.

calcRegion <- function(states) {
    as.factor(sapply(states, function(v) {
        switch(as.character(v),
               CT='NorthEast', ME='NorthEast', MA='NorthEast', NH='NorthEast',
               RI='NorthEast', VT='NorthEast', NJ='NorthEast', NY='NorthEast',
               PA='NorthEast',
               IL='Midwest', IN='Midwest', MI='Midwest', OH='Midwest', WI='Midwest',
               IA='Midwest', KS='Midwest', MN='Midwest', MO='Midwest', NE='Midwest',
               ND='Midwest', SD='Midwest',
               DE='South', FL='South', GA='South', MD='South', NC='South', SC='South',
               VA='South', WV='South', AL='South', KY='South', MS='South', TN='South',
               AR='South', LA='South', OK='South', TX='South',
               AZ='West', CO='West', ID='West', MT='West', NV='West', NM='West', UT='West',
               WY='West', AK='West', CA='West', HI='West', OR='West', WA='West',
               'Unknown')
    }))
}

dt <- dt %>% mutate(region=calcRegion(state))

In order to sort the event types we use a function to calculate the harmfullness score. We consider two different scores: healthHarmfulnessScore and economicHarmfulnessScore. The first one is to measure the impact to the public health and correspond to the weighted mean between the total number of fatalities weighted 10, and the total number of injuries weighted 1. The second one is to measure the impact to the economics and correspond to the weighted mean between the total property damage and the total crop damage both weighted 1 (same as mean). So we add the these columns and we keep only the observations with a score greater than zero.

calcHealthHarmfulnessScore <- function(fatalities, injuries) {
    apply(cbind(fatalities, injuries), 1, function(v) {weighted.mean(v, c(10, 1))})
}

calcEconomicHarmfulnessScore <- function(propertyDamage, cropDamage) {
    apply(cbind(propertyDamage, cropDamage), 1, function(v) {weighted.mean(v, c(1, 1))})
}

dt <- dt %>%
    mutate(healthHarmfulnessScore=calcHealthHarmfulnessScore(fatalities, injuries),
           economicHarmfulnessScore=calcEconomicHarmfulnessScore(propertyDamage, cropDamage)) %>%
           filter(healthHarmfulnessScore > 0 | economicHarmfulnessScore > 0)

Results

Analysis of the events across the United States

After the initial preprocessing we now summarize our data by grouping by evtype.

summarizeDt <- function(dt, ...) {
    dt %>%
        group_by(...) %>%
        summarize(totalFatalities=sum(fatalities), totalInjuries=sum(injuries),
                  totalPropertyDamage=sum(propertyDamage), totalCropDamage=sum(cropDamage),
                  count=n(),
                  healthHarmfulnessScore=sum(healthHarmfulnessScore),
                  economicHarmfulnessScore=sum(economicHarmfulnessScore))
}

dtSummarized <- dt %>% summarizeDt(evtype)

We can now retrieve and plot the top 10 event types in terms of harmfullness. To find the most harmfull wheather events to the public health we get those with the highest healthHarmfulnessScore while to find the most harmfull to the economics we get those with the highest economicHarmfulnessScore.

dtHealth <- dtSummarized %>%
    select(-contains('Crop'), -contains('Property'), -contains('economic')) %>%
    top_n(10, healthHarmfulnessScore) %>%
    arrange(-healthHarmfulnessScore) %>%
    mutate(evtype=factor(evtype, levels=unique(evtype)))
dtEconomic <- dtSummarized %>%
    select(-contains('Fatalities'), -contains('Injuries'), -contains('health')) %>%
    top_n(10, economicHarmfulnessScore) %>%
    arrange(-economicHarmfulnessScore) %>%
    mutate(evtype=factor(evtype, levels=unique(evtype)),
           totalPropertyDamage=totalPropertyDamage / 10**9,
           totalCropDamage=totalCropDamage / 10**9)

Before the plotting some transformations are performed in order to make the datafareme suitable for plotting a bar plot. The unit of measurement for economical damages is also changed to billions of dollars.

barPlot <- function(dt, title, xlab, ylab, legendBreaks, legendLabels) {
    ggplot(dt, aes(evtype, total, fill=harmType)) +
        geom_bar(stat='identity', position='stack') +
        theme(axis.text.x=element_text(angle=45, hjust=1), legend.position=c(1, 1),
              legend.justification=c(1, 1)) +
        scale_fill_manual(name='Harm type', values=c('#e74c3c', '#3498db'),
                          breaks=legendBreaks, labels=legendLabels) +
        xlab(xlab) + ylab(ylab) + ggtitle(title)
}

healthBarPlot <- function(dtHealth) {
    dtHealth %>%
        select(fatalities=totalFatalities, injuries=totalInjuries,
               evtype, healthHarmfulnessScore) %>%
        gather(harmType, total, -evtype, -healthHarmfulnessScore) %>%
        barPlot('Top 10 weather events most harmfull to health',
                'Event type', 'Number of persons', c('fatalities', 'injuries'),
                c('Fatalities', 'Injuries'))
}

economicBarPlot <- function(dtEconomic) {
    dtEconomicPlot <- dtEconomic %>%
        select(cropDamage=totalCropDamage, propertyDamage=totalPropertyDamage,
               evtype, economicHarmfulnessScore) %>%
        gather(harmType, total, -evtype, -economicHarmfulnessScore) %>%
        barPlot('Top 10 weather events most harmfull to economics',
                'Event type', 'Damage in billions (USD)',
                c('cropDamage', 'propertyDamage'), c('Crop damage', 'Property Damage'))
}

healhtPlot <- healthBarPlot(dtHealth)
economicPlot <- economicBarPlot(dtEconomic)

grid.arrange(healhtPlot, economicPlot, nrow=2)

Analysis of the events across the United States by region

It is interesting to see if the most harmfull events change if we perform the analysis also by grouping by region. We can perform this analysis very similary to what we have just performed. The data are slightly different as we filter out those observations whose region is unknown. For this analysis we consider the top 5 events per region.

dtSummarizedRegion <- dt %>%
    filter(region != 'Unknown') %>%
    summarizeDt(evtype, region)

dtHealthRegion <- dtSummarizedRegion %>%
    select(-contains('Crop'), -contains('Property'), -contains('economic')) %>%
    group_by(region) %>%
    top_n(5, healthHarmfulnessScore) %>%
    arrange(-healthHarmfulnessScore)

dtEconomicRegion <- dtSummarizedRegion %>%
    select(-contains('Fatalities'), -contains('Injuries'), -contains('health')) %>%
    group_by(region) %>%
    top_n(5, economicHarmfulnessScore) %>%
    arrange(-economicHarmfulnessScore) %>%
    mutate(totalPropertyDamage=totalPropertyDamage / 10**9,
           totalCropDamage=totalCropDamage / 10**9)

barPlotRegion <- function(dt, title, xlab, ylab, legendBreaks, legendLabels) {
    barPlot(dt, title, xlab, ylab, legendBreaks, legendLabels) + facet_grid(. ~ region)
}

healthBarPlotRegion <- function(dtHealthRegion) {
    dtHealthRegion %>%
        select(fatalities=totalFatalities, injuries=totalInjuries, evtype, region,
               healthHarmfulnessScore) %>%
        gather(harmType, total, -evtype, -region, -healthHarmfulnessScore) %>%
        barPlotRegion('Top 5 weather events most harmfull to health by region',
                      'Event type', 'Number of persons', c('fatalities', 'injuries'),
                      c('Fatalities', 'Injuries'))
}

economicBarPlotRegion <- function(dtEconomic) {
    dtEconomicRegion %>%
        select(cropDamage=totalCropDamage, propertyDamage=totalPropertyDamage, evtype,
               region, economicHarmfulnessScore) %>%
        gather(harmType, total, -evtype, -region, -economicHarmfulnessScore) %>%
        barPlotRegion('Top 5 most harmfull weather events to economics by region',
                      'Event type', 'Damage in billions (USD)',
                      c('cropDamage', 'propertyDamage'), c('Crop damage', 'Property Damage'))
}

healhtPlotRegion <- healthBarPlotRegion(dtHealthRegion)
economicPlotRegion <- economicBarPlotRegion(dtEconomicRegion)

grid.arrange(healhtPlotRegion, economicPlotRegion, nrow=2)

Conclusions

Impact to public health

The most harmfull weather event across the United States is TORNADO with a total number of 5633 fatalities and 91346 injuries. TORNADOs are greatly more harfmull than the other events such as EXCESSIVE HEAT which is in second place with 1903 fatalities and 6525 injuries, and LIGHTNING which is in third place with 816 fatalities and 5230 injuries. In the first plot is possible to see how much TORNADOs are cleary the most harmfull events to public health.

By grouping by region TORNADOs are the most harmfull events in South, in Midwest and in NorthEast while in West the most harmfull events are AVALANCHEs.

Here are in detail the top 5 for the Midwest, NorthEast, South and West regions in order. Each table is ordered by event type harmfullness decreasingly.

transpose <- function(dt) {
    dt.T <- t(dt[,2:ncol(dt)])
    colnames(dt.T) <- dt$evtype
    dt.T
}

printTableHealth <- function(dt, reg) {
    kable(ungroup(dt) %>%
          filter(region==reg) %>%
          arrange(-healthHarmfulnessScore) %>%
          select(evtype, Fatalities=totalFatalities, Injuries=totalInjuries) %>%
          transpose())
}

printTableHealth(dtHealthRegion, 'Midwest')

	TORNADO	EXCESSIVE HEAT	HEAT	TSTM WIND	LIGHTNING
Fatalities	1886	594	702	160	159
Injuries	30941	4232	1262	2339	866

printTableHealth(dtHealthRegion, 'NorthEast')

	TORNADO	EXCESSIVE HEAT	LIGHTNING	TSTM WIND	FLASH FLOOD
Fatalities	219	492	76	77	101
Injuries	4169	621	1048	935	257

printTableHealth(dtHealthRegion, 'South')

	TORNADO	FLOOD	EXCESSIVE HEAT	LIGHTNING	FLASH FLOOD
Fatalities	3500	206	613	431	490
Injuries	55047	6391	1096	2586	984

printTableHealth(dtHealthRegion, 'West')

	AVALANCHE	LIGHTNING	EXCESSIVE HEAT	FLASH FLOOD	WINTER STORM
Fatalities	220	140	184	161	89
Injuries	164	712	260	344	696

Impact to economics

On the other hand the type of weather event that has the highest impact on ecomics is FLOOD with a total of ~145 billion dollars of property damage and ~6 billion dollars of crop damage. FLOODs are followed by HURRICANE/TYPHOONs with ~69 billion dollars of property damage and ~3 billion dollars of crop damage, TORNADOs with ~57 billion dollars of property damage and ~415 million dollars of crop damage.

By grouping by region the scenario is different. FLOODs are the most harmfull to economics only in NorthEast and West while in Midwest and South are TORNADOs and HURRICANE/TYPHOONs respectively.

Here are in detail the top 5 for the Midwest, NorthEast, South and West regions in order. Each table is ordered by event type harmfullness decreasingly.

printTableEconomic <- function(dt, reg) {
    kable(ungroup(dt) %>%
          filter(region==reg) %>%
          arrange(-economicHarmfulnessScore) %>%
          select(evtype, `Property Damage`=totalPropertyDamage,
                 `Crop Damage`=totalCropDamage) %>%
          transpose())
}

printTableEconomic(dtEconomicRegion, 'Midwest')

	TORNADO	FLOOD	RIVER FLOOD	HAIL	FLASH FLOOD
Property Damage	22.4578686	8.806530	5.080905	6.451396	5.2035404
Crop Damage	0.1194902	3.799456	5.022584	1.800628	0.8022694

printTableEconomic(dtEconomicRegion, 'NorthEast')

	FLOOD	FLASH FLOOD	TORNADO	DROUGHT	ICE STORM
Property Damage	6.148491	4.452846	3.716512	0.000005	0.578952
Crop Damage	0.019240	0.019570	0.008224	0.719600	0.008550

printTableEconomic(dtEconomicRegion, 'South')

	HURRICANE/TYPHOON	STORM SURGE	TORNADO	HURRICANE	FLOOD
Property Damage	69.033100	43.281360	29.8685764	9.907848	10.436738
Crop Damage	2.603501	0.000005	0.2789392	2.189930	1.133034

printTableEconomic(dtEconomicRegion, 'West')

	FLOOD	HAIL	WILDFIRE	WILD/FOREST FIRE	HEAVY RAIN
Property Damage	119.1395610	4.660423	3.9964420	2.6010650	0.5739176
Crop Damage	0.6622375	0.561866	0.1249014	0.1030946	0.7068108