Health and Economic Implications of Storms in US

Synopsis

This is a second course project for Reproducible Research course which is part of the Coursera’s Data Science Specialization.

Severe weather has serious economic and health impacts, causing property damage, crop damage, injury and even death. The purpose of the project was to determine which storm event(s) had the most significant economic and health effects.

This project involves exploring and analysing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The analysis of the data revealed that floods cause the greatest economic impact as measured by property damage and crop damage. The analysis also indicated that tornadoes have the greatest health impact as measured by the number of injuries and fatalities.

Statement of the problem

This data analysis will address the following questions:

Across the United States, which types of events have the greatest economic consequences?
Across the United States, which types of events are most harmful with respect to population health?

Objectives

The main objective of this project is to determine which storm event(s) have the most significant economic and health effects.

To identify the weather events that have the greatest economic consequences.
To identify the weather events that are most harmful with respect to population health.

Data Processing

Load Libraries and prepare the R environment

This project used these libraries for data processing and analysis:

library(ggplot2)  
library(plyr) 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(reshape2)
library(knitr)

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: storm data[47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Loading the data

Uncomment the code below to download and unzip the data. The data was downloaded from the link above and saved to current working directory. If object strom_df is already loaded, use that cached object instead of loading it each time where the Rmd file is knitted.

#URL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#destfile <- "repdata_data_StormData.csv.bz2"
#download.file(URL, destfile)

if(!exists("storm_df")){
        storm_df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
}

Examine the data set

In storm_df there is 37 columns (variables) and 902,297 rows (observations).

dim(storm_df)

## [1] 902297     37

Examine the structure of the data

str(storm_df)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Extracting variables of interest

From a list of variables in storm data set, the following are columns (variables) of interest:

Health variables:

FATALITIES: approx. number of deaths
INJURIES: approx. number of injuries

Economic variables:

PROPDMG: approx. property damages
PROPDMGEXP: the units for property damage value
CROPDMG: approx. crop damages
CROPDMGEXP: the units for crop damage value

Events - target variable:

EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..)

Extract variables of interest from original data set:

storm_data <- subset(storm_df, select = c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

Checking for missing values

In every analysis we must check number of missing values in variables.

count_NAs <- function(df) {
        for (col_Name in colnames(df)) {
                NA_count <- 0
                NA_count < as.numeric(sum(is.na(df[,col_Name])))
                if(NA_count > 0) {
                        message(col_Name, ":", NA_count, "missing values")
                        } else {
                        message(col_Name, ":", "No NAs")
                        }
                }
}
count_NAs(storm_data)

## EVTYPE:No NAs

## FATALITIES:No NAs

## INJURIES:No NAs

## PROPDMG:No NAs

## PROPDMGEXP:No NAs

## CROPDMG:No NAs

## CROPDMGEXP:No NAs

There is no missing values (no NAS), so moving on to examine data integrity.

Transforming Property and Crop Damage Variables

There are 985 unique event types in EVTYPE column. Therefore, it is better to limit database to a reasonable number. We can make it by capitalizing all letters in EVTYPE column as well as subsetting only non-zero data regarding our target numbers.

## cleaning event type names
storm_data$EVTYPE <- toupper(storm_data$EVTYPE)
## eliminating zero data
storm_data <- storm_data[storm_data$FATALITIES !=0 | 
                       storm_data$INJURIES !=0 | 
                       storm_data$PROPDMG !=0 | 
                       storm_data$CROPDMG !=0, ]
length(unique(storm_data$EVTYPE))

## [1] 447

Now we have 447 unique event types and it seems like something to work with.

To determine economic consequences of storms, we need to calculate the total amount of damage in US Dollars. The data provided one column indicating the amount and another indicating the unit: “K” for thousands,“M” for millions, or “B” for billions. There is some mess in units, so we transform those variables in one unit (dollar) variable by the following rule:

K or k: thousand dollars (10^3)
M or m: million dollars (10^6)
B or b: billion dollars (10^9)
the rest would be consider as dollars

We need to combine these two columns into a single column representing the total cost dollars for each observation.

storm_data$PROPDMGEXP <- as.character(storm_data$PROPDMGEXP)
storm_data$PROPDMGEXP[is.na(storm_data$PROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$PROPDMGEXP[!grepl("K|M|B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$PROPDMGEXP[grep("K", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$PROPDMGEXP[grep("M", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$PROPDMGEXP[grep("B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$PROPDMGEXP <- as.numeric(as.character(storm_data$PROPDMGEXP))
storm_data$PROPDMG_DOL <- storm_data$PROPDMG * 10^storm_data$PROPDMGEXP

storm_data$CROPDMGEXP <- as.character(storm_data$CROPDMGEXP)
storm_data$CROPDMGEXP[is.na(storm_data$CROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$CROPDMGEXP[!grepl("K|M|B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$CROPDMGEXP[grep("K", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$CROPDMGEXP[grep("M", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$CROPDMGEXP[grep("B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$CROPDMGEXP <- as.numeric(as.character(storm_data$CROPDMGEXP))
storm_data$CROPDMG_DOL <- storm_data$CROPDMG * 10^storm_data$CROPDMGEXP

head(storm_data)

##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          3       0          0
## 2 TORNADO          0        0     2.5          3       0          0
## 3 TORNADO          0        2    25.0          3       0          0
## 4 TORNADO          0        2     2.5          3       0          0
## 5 TORNADO          0        2     2.5          3       0          0
## 6 TORNADO          0        6     2.5          3       0          0
##   PROPDMG_DOL CROPDMG_DOL
## 1       25000           0
## 2        2500           0
## 3       25000           0
## 4        2500           0
## 5        2500           0
## 6        2500           0

Analysis

Aggregating events for economic variables

Now we aggregate property and crop damage numbers in order to identify TOP-10 events contributing the total economic loss:

Economic_data <- aggregate(cbind(PROPDMG_DOL, CROPDMG_DOL) ~ EVTYPE, data = storm_data, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMG_DOL + Economic_data$CROPDMG_DOL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")

	EVTYPE	PROPDMG_DOL	CROPDMG_DOL	ECONOMIC_LOSS
78	FLOOD	144657709807	5661968450	150319678257
203	HURRICANE/TYPHOON	69305840000	2607872800	71913712800
371	TORNADO	56937160779	414953270	57352114049
317	STORM SURGE	43323536000	5000	43323541000
116	HAIL	15732267048	3025954473	18758221521
65	FLASH FLOOD	16140812067	1421317100	17562129167
43	DROUGHT	1046106000	13972566000	15018672000
194	HURRICANE	11868319010	2741910000	14610229010
280	RIVER FLOOD	5118945500	5029459000	10148404500
216	ICE STORM	3944927860	5022113500	8967041360

Aggregating events for public health variables

We aggregate fatalities and injuries numbers in order to identify TOP-10 events contributing the total people loss:

Health_data <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = storm_data, FUN=sum)
Health_data$TOTAL <- Health_data$FATALITIES + Health_data$INJURIES
Health_data <- Health_data[order(Health_data$TOTAL, decreasing = TRUE), ]
Top10_events_people <- Health_data[1:10,]
knitr::kable(Top10_events_people, format = "markdown")

	EVTYPE	FATALITIES	INJURIES	TOTAL
371	TORNADO	5633	91346	96979
54	EXCESSIVE HEAT	1903	6525	8428
386	TSTM WIND	504	6957	7461
78	FLOOD	470	6789	7259
233	LIGHTNING	816	5230	6046
133	HEAT	937	2100	3037
65	FLASH FLOOD	978	1777	2755
216	ICE STORM	89	1975	2064
329	THUNDERSTORM WIND	133	1488	1621
441	WINTER STORM	206	1321	1527

Results

Economic Effects of Storm Events

# aggregate PROPDMG_DOL, CROPDMG_DOL and ECONOMIC_LOSS by Event Type

Top10_events_economy2<- melt(Top10_events_economy, id=c("EVTYPE", "ECONOMIC_LOSS"), measure.vars=c("PROPDMG_DOL","CROPDMG_DOL"))

## transform EVTYPE to factor variable for economic variables
Top10_events_economy2$EVTYPE <- as.factor(Top10_events_economy2$EVTYPE)

## plotting economic loss
ggplot(data = Top10_events_economy2, aes(reorder(EVTYPE, ECONOMIC_LOSS), value, fill=variable)) +
    geom_bar(stat = "identity")+
    ggtitle("Crop and Property Damage by Weather Events")+
    theme(plot.title = element_text(hjust = 0.5))+
    labs(y = "Property and Crop Loss", x = "Event Type") +
    coord_flip()

The result of economic impact from the graph revealed that FLOOD, HURRICANE/TYPHOON and TORNADO are the main contributors to severe economic consequences out of all event types of weather events. Thus, the most devastating weather event with the greatest economic loss (property and crop damage) is a flood.

Health Effects of Storm Events

# aggregate FATALITIES, INJURIES and TOTAL by EVTYPE

Top10_events_people2<- melt(Top10_events_people, id=c("EVTYPE", "TOTAL"), measure.vars=c("FATALITIES","INJURIES"))

## transform EVTYPE to factor variable for health variables
Top10_events_people2$EVTYPE <- as.factor(Top10_events_people2$EVTYPE)

## plot FATALITIES and INJURIES by Event Type 
ggplot(data = Top10_events_people2, aes(x = reorder(EVTYPE, TOTAL), y = value, fill = variable)) +
    geom_bar(stat = "identity") +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Injuries and Fatalities by Weather Event") +
    labs(y = "Number of fatalities and injuries", x = "Event Type") +
    coord_flip()

The result of population health impact from the graph show that TORNADOS, EXCESSIVE HEAT and TSTM WIND are the main contributors to deaths and injuries out of all event types of weather events. Thus, the most harmful weather event for health (in number of total fatalites and injuries) is, by far, a tornado.

Conclusion

floods caused the most economic damage, with over $150 billion dollars in economic loss (property and crop damage). Tornadoes are the most dangerous having caused 5,633 fatalities and 91,346 injuries from 1950 to 2011.