Reproducible Research Course Project 2

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Assignment

The goal of this analysis is to use the NOAA Storm Database to answer some basic questions about severe weather events.

This analysis will address the following questions:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Data

The 47Mb file for this project has been compressed via the bzip2 algorithm to reduce its size. It can be downloaded from the course web site:

Storm Data

Data Downloading and Processing

# downloading data
Url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

File_data <- "StormData.csv.bz2"
if (!file.exists(File_data)) {
        download.file(Url_data, File_data, mode = "wb")
}

# reading data
rawdata <- read.csv(file = File_data, header=TRUE, sep=",")

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Exploring the original data - NOAA began recording all event types in January 1996, the dataset will be filtered to remove data recorded before this date.

# subsetting by date
data <- rawdata
data$BGN_DATE <- strptime(rawdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data <- subset(data, BGN_DATE > "1995-12-31")

Exploring the dataset, there are seven variables that pertain to the two questions. They are:

EVTYPE - type of event
FATALITIES – number of fatalities
INJURIES – number of injuries
PROPDMG – the size of property damage
PROPDMGEXP - the exponent values for ‘PROPDMG’ (property damage)
CROPDMG - the size of crop damage
CROPDMGEXP - the exponent values for ‘CROPDMG’ (crop damage)

data <- subset(data, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

To further reduce the data size and make the analysis more efficient, events with “zero” data are eliminated from the main data:

#cleaning event types names
data$EVTYPE <- toupper(data$EVTYPE)

# eliminating zero data
data <- data[data$FATALITIES !=0 | 
                       data$INJURIES !=0 | 
                       data$PROPDMG !=0 | 
                       data$CROPDMG !=0, ]

Processing Population Health Data

Data on fatalities and injuries is aggregated to allow the top ten events to be identified:

healthdata <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = data, FUN=sum)
healthdata$PEOPLE_LOSS <- healthdata$FATALITIES + healthdata$INJURIES
healthdata <- healthdata[order(healthdata$PEOPLE_LOSS, decreasing = TRUE), ]
Top10_events_people <- healthdata[1:10,]
knitr::kable(Top10_events_people, format = "markdown")

	EVTYPE	FATALITIES	INJURIES	PEOPLE_LOSS
149	TORNADO	1511	20667	22178
39	EXCESSIVE HEAT	1797	6391	8188
48	FLOOD	414	6758	7172
107	LIGHTNING	651	4141	4792
153	TSTM WIND	241	3629	3870
46	FLASH FLOOD	887	1674	2561
146	THUNDERSTORM WIND	130	1400	1530
182	WINTER STORM	191	1292	1483
69	HEAT	237	1222	1459
88	HURRICANE/TYPHOON	64	1275	1339

Processing Economic Loss Data

Converting the exponent values of the two columns PROPDMGEXP and CROPDMGEXP into formats that are easier to visualize and sort:

data$PROPDMGEXP <- gsub("[Hh]", "2", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Kk]", "3", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Mm]", "6", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Bb]", "9", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\+", "1", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0",  data$PROPDMGEXP)
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)

data$CROPDMGEXP <- gsub("[Hh]", "2", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Kk]", "3", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Mm]", "6", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Bb]", "9", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\+", "1", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", data$CROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)

data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- mutate(data, 
                    PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                    CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

ecodata <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data = data, FUN=sum)
ecodata$ECONOMIC_LOSS <- ecodata$PROPDMGTOTAL + ecodata$CROPDMGTOTAL
ecodata <- ecodata[order(ecodata$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- ecodata[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")

	EVTYPE	PROPDMGTOTAL	CROPDMGTOTAL	ECONOMIC_LOSS
48	FLOOD	143944833550	4974778400	148919611950
88	HURRICANE/TYPHOON	69305840000	2607872800	71913712800
141	STORM SURGE	43193536000	5000	43193541000
149	TORNADO	24616945710	283425010	24900370720
66	HAIL	14595143420	2476029450	17071172870
46	FLASH FLOOD	15222203910	1334901700	16557105610
86	HURRICANE	11812819010	2741410000	14554229010
32	DROUGHT	1046101000	13367566000	14413667000
152	TROPICAL STORM	7642475550	677711000	8320186550
83	HIGH WIND	5247860360	633561300	5881421660

Final Analysis

Question #1: Identifying which weather events cause the most Fatalities And Injuries:

#plotting health loss
library(ggplot2)
g <- ggplot(data = Top10_events_people, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Fatalities & Injuries in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)

Question #2: Identifying which weather events cause the most economic damage:

#plotting economic loss
g <- ggplot(data = Top10_events_economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Economic Loss in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)

RESULTS

Based on the analysis of the data from NOAA, Tornadoes caused the most fatalities and injuries while Flooding caused the most economic loss.