Synopsis

This is a second course project for Reproducible Research course which is part of the Coursera’s Data Science Specialization.

Severe weather has serious economic and health impacts, causing property damage, crop damage, injury and even death. The purpose of the project was to determine which storm event(s) had the most significant economic and health effects.

This project involves exploring and analysing the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The analysis of the data revealed that floods cause the greatest economic impact as measured by property damage and crop damage. The analysis also indicated that tornadoes have the greatest health impact as measured by the number of injuries and fatalities.

Statement of the problem

This data analysis will address the following questions:

  1. Across the United States, which types of events have the greatest economic consequences?

  2. Across the United States, which types of events are most harmful with respect to population health?

Objectives

The main objective of this project is to determine which storm event(s) have the most significant economic and health effects.

  • To identify the weather events that have the greatest economic consequences.

  • To identify the weather events that are most harmful with respect to population health.

Data Processing

Load Libraries and prepare the R environment

This project used these libraries for data processing and analysis:

library(ggplot2)  
library(plyr) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)
library(knitr)

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: storm data[47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Loading the data

Uncomment the code below to download and unzip the data. The data was downloaded from the link above and saved to current working directory. If object strom_df is already loaded, use that cached object instead of loading it each time where the Rmd file is knitted.

#URL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
#destfile <- "repdata_data_StormData.csv.bz2"
#download.file(URL, destfile)

if(!exists("storm_df")){
        storm_df <- read.csv(bzfile("repdata_data_StormData.csv.bz2"))
}

Examine the data set

In storm_df there is 37 columns (variables) and 902,297 rows (observations).

dim(storm_df)
## [1] 902297     37

Examine the structure of the data

str(storm_df)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29601 levels "","5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13513 1873 4598 10592 4372 10094 1973 23873 24418 4598 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 35 levels "","  N"," NW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_LOCATI: Factor w/ 54429 levels "","- 1 N Albion",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_DATE  : Factor w/ 6663 levels "","1/1/1993 0:00:00",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_TIME  : Factor w/ 3647 levels ""," 0900CST",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 24 levels "","E","ENE","ESE",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ END_LOCATI: Factor w/ 34506 levels "","- .5 NNW",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ WFO       : Factor w/ 542 levels ""," CI","$AC",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ STATEOFFIC: Factor w/ 250 levels "","ALABAMA, Central",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ZONENAMES : Factor w/ 25112 levels "","                                                                                                               "| __truncated__,..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436781 levels "","-2 at Deer Park\n",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Extracting variables of interest

From a list of variables in storm data set, the following are columns (variables) of interest:

Health variables:

  • FATALITIES: approx. number of deaths
  • INJURIES: approx. number of injuries

Economic variables:

  • PROPDMG: approx. property damages
  • PROPDMGEXP: the units for property damage value
  • CROPDMG: approx. crop damages
  • CROPDMGEXP: the units for crop damage value

Events - target variable:

  • EVTYPE: weather event (Tornados, Wind, Snow, Flood, etc..)

Extract variables of interest from original data set:

storm_data <- subset(storm_df, select = c("EVTYPE","FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

Checking for missing values

In every analysis we must check number of missing values in variables.

count_NAs <- function(df) {
        for (col_Name in colnames(df)) {
                NA_count <- 0
                NA_count < as.numeric(sum(is.na(df[,col_Name])))
                if(NA_count > 0) {
                        message(col_Name, ":", NA_count, "missing values")
                        } else {
                        message(col_Name, ":", "No NAs")
                        }
                }
}
count_NAs(storm_data)
## EVTYPE:No NAs
## FATALITIES:No NAs
## INJURIES:No NAs
## PROPDMG:No NAs
## PROPDMGEXP:No NAs
## CROPDMG:No NAs
## CROPDMGEXP:No NAs

There is no missing values (no NAS), so moving on to examine data integrity.

Transforming Property and Crop Damage Variables

There are 985 unique event types in EVTYPE column. Therefore, it is better to limit database to a reasonable number. We can make it by capitalizing all letters in EVTYPE column as well as subsetting only non-zero data regarding our target numbers.

## cleaning event type names
storm_data$EVTYPE <- toupper(storm_data$EVTYPE)
## eliminating zero data
storm_data <- storm_data[storm_data$FATALITIES !=0 | 
                       storm_data$INJURIES !=0 | 
                       storm_data$PROPDMG !=0 | 
                       storm_data$CROPDMG !=0, ]
length(unique(storm_data$EVTYPE))
## [1] 447

Now we have 447 unique event types and it seems like something to work with.

To determine economic consequences of storms, we need to calculate the total amount of damage in US Dollars. The data provided one column indicating the amount and another indicating the unit: “K” for thousands,“M” for millions, or “B” for billions. There is some mess in units, so we transform those variables in one unit (dollar) variable by the following rule:

  • K or k: thousand dollars (10^3)
  • M or m: million dollars (10^6)
  • B or b: billion dollars (10^9)
  • the rest would be consider as dollars

We need to combine these two columns into a single column representing the total cost dollars for each observation.

storm_data$PROPDMGEXP <- as.character(storm_data$PROPDMGEXP)
storm_data$PROPDMGEXP[is.na(storm_data$PROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$PROPDMGEXP[!grepl("K|M|B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$PROPDMGEXP[grep("K", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$PROPDMGEXP[grep("M", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$PROPDMGEXP[grep("B", storm_data$PROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$PROPDMGEXP <- as.numeric(as.character(storm_data$PROPDMGEXP))
storm_data$PROPDMG_DOL <- storm_data$PROPDMG * 10^storm_data$PROPDMGEXP

storm_data$CROPDMGEXP <- as.character(storm_data$CROPDMGEXP)
storm_data$CROPDMGEXP[is.na(storm_data$CROPDMGEXP)] <- 0 # NA's considered as dollars
storm_data$CROPDMGEXP[!grepl("K|M|B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- 0 # everything exept K,M,B is dollar
storm_data$CROPDMGEXP[grep("K", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "3"
storm_data$CROPDMGEXP[grep("M", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "6"
storm_data$CROPDMGEXP[grep("B", storm_data$CROPDMGEXP, ignore.case = TRUE)] <- "9"
storm_data$CROPDMGEXP <- as.numeric(as.character(storm_data$CROPDMGEXP))
storm_data$CROPDMG_DOL <- storm_data$CROPDMG * 10^storm_data$CROPDMGEXP
head(storm_data)
##    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO          0       15    25.0          3       0          0
## 2 TORNADO          0        0     2.5          3       0          0
## 3 TORNADO          0        2    25.0          3       0          0
## 4 TORNADO          0        2     2.5          3       0          0
## 5 TORNADO          0        2     2.5          3       0          0
## 6 TORNADO          0        6     2.5          3       0          0
##   PROPDMG_DOL CROPDMG_DOL
## 1       25000           0
## 2        2500           0
## 3       25000           0
## 4        2500           0
## 5        2500           0
## 6        2500           0

Analysis

Aggregating events for economic variables

Now we aggregate property and crop damage numbers in order to identify TOP-10 events contributing the total economic loss:

Economic_data <- aggregate(cbind(PROPDMG_DOL, CROPDMG_DOL) ~ EVTYPE, data = storm_data, FUN=sum)
Economic_data$ECONOMIC_LOSS <- Economic_data$PROPDMG_DOL + Economic_data$CROPDMG_DOL
Economic_data <- Economic_data[order(Economic_data$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- Economic_data[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
EVTYPE PROPDMG_DOL CROPDMG_DOL ECONOMIC_LOSS
78 FLOOD 144657709807 5661968450 150319678257
203 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
371 TORNADO 56937160779 414953270 57352114049
317 STORM SURGE 43323536000 5000 43323541000
116 HAIL 15732267048 3025954473 18758221521
65 FLASH FLOOD 16140812067 1421317100 17562129167
43 DROUGHT 1046106000 13972566000 15018672000
194 HURRICANE 11868319010 2741910000 14610229010
280 RIVER FLOOD 5118945500 5029459000 10148404500
216 ICE STORM 3944927860 5022113500 8967041360

Aggregating events for public health variables

We aggregate fatalities and injuries numbers in order to identify TOP-10 events contributing the total people loss:

Health_data <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = storm_data, FUN=sum)
Health_data$TOTAL <- Health_data$FATALITIES + Health_data$INJURIES
Health_data <- Health_data[order(Health_data$TOTAL, decreasing = TRUE), ]
Top10_events_people <- Health_data[1:10,]
knitr::kable(Top10_events_people, format = "markdown")
EVTYPE FATALITIES INJURIES TOTAL
371 TORNADO 5633 91346 96979
54 EXCESSIVE HEAT 1903 6525 8428
386 TSTM WIND 504 6957 7461
78 FLOOD 470 6789 7259
233 LIGHTNING 816 5230 6046
133 HEAT 937 2100 3037
65 FLASH FLOOD 978 1777 2755
216 ICE STORM 89 1975 2064
329 THUNDERSTORM WIND 133 1488 1621
441 WINTER STORM 206 1321 1527

Results

Economic Effects of Storm Events

# aggregate PROPDMG_DOL, CROPDMG_DOL and ECONOMIC_LOSS by Event Type

Top10_events_economy2<- melt(Top10_events_economy, id=c("EVTYPE", "ECONOMIC_LOSS"), measure.vars=c("PROPDMG_DOL","CROPDMG_DOL"))

## transform EVTYPE to factor variable for economic variables
Top10_events_economy2$EVTYPE <- as.factor(Top10_events_economy2$EVTYPE)
## plotting economic loss
ggplot(data = Top10_events_economy2, aes(reorder(EVTYPE, ECONOMIC_LOSS), value, fill=variable)) +
    geom_bar(stat = "identity")+
    ggtitle("Crop and Property Damage by Weather Events")+
    theme(plot.title = element_text(hjust = 0.5))+
    labs(y = "Property and Crop Loss", x = "Event Type") +
    coord_flip()

The result of economic impact from the graph revealed that FLOOD, HURRICANE/TYPHOON and TORNADO are the main contributors to severe economic consequences out of all event types of weather events. Thus, the most devastating weather event with the greatest economic loss (property and crop damage) is a flood.

Health Effects of Storm Events

# aggregate FATALITIES, INJURIES and TOTAL by EVTYPE

Top10_events_people2<- melt(Top10_events_people, id=c("EVTYPE", "TOTAL"), measure.vars=c("FATALITIES","INJURIES"))

## transform EVTYPE to factor variable for health variables
Top10_events_people2$EVTYPE <- as.factor(Top10_events_people2$EVTYPE)
## plot FATALITIES and INJURIES by Event Type 
ggplot(data = Top10_events_people2, aes(x = reorder(EVTYPE, TOTAL), y = value, fill = variable)) +
    geom_bar(stat = "identity") +
    theme(plot.title = element_text(hjust = 0.5)) +
    ggtitle("Injuries and Fatalities by Weather Event") +
    labs(y = "Number of fatalities and injuries", x = "Event Type") +
    coord_flip()

The result of population health impact from the graph show that TORNADOS, EXCESSIVE HEAT and TSTM WIND are the main contributors to deaths and injuries out of all event types of weather events. Thus, the most harmful weather event for health (in number of total fatalites and injuries) is, by far, a tornado.

Conclusion

The analysis of the data revealed that floods cause the greatest economic impact as measured by property damage and crop damage. The analysis also indicated that tornadoes have the greatest health impact as measured by the number of injuries and fatalities.

floods caused the most economic damage, with over $150 billion dollars in economic loss (property and crop damage). Tornadoes are the most dangerous having caused 5,633 fatalities and 91,346 injuries from 1950 to 2011.