Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Assignment

The goal of this analysis is to use the NOAA Storm Database to answer some basic questions about severe weather events.

This analysis will address the following questions:

  1. Across the United States, which types of events are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Data

The 47Mb file for this project has been compressed via the bzip2 algorithm to reduce its size. It can be downloaded from the course web site:

Storm Data

Data Downloading and Processing

# downloading data
Url_data <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

File_data <- "StormData.csv.bz2"
if (!file.exists(File_data)) {
        download.file(Url_data, File_data, mode = "wb")
}

# reading data
rawdata <- read.csv(file = File_data, header=TRUE, sep=",")

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete. Exploring the original data - NOAA began recording all event types in January 1996, the dataset will be filtered to remove data recorded before this date.

# subsetting by date
data <- rawdata
data$BGN_DATE <- strptime(rawdata$BGN_DATE, "%m/%d/%Y %H:%M:%S")
data <- subset(data, BGN_DATE > "1995-12-31")

Exploring the dataset, there are seven variables that pertain to the two questions. They are:

  1. EVTYPE - type of event

  2. FATALITIES – number of fatalities

  3. INJURIES – number of injuries

  4. PROPDMG – the size of property damage

  5. PROPDMGEXP - the exponent values for ‘PROPDMG’ (property damage)

  6. CROPDMG - the size of crop damage

  7. CROPDMGEXP - the exponent values for ‘CROPDMG’ (crop damage)

data <- subset(data, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

To further reduce the data size and make the analysis more efficient, events with “zero” data are eliminated from the main data:

#cleaning event types names
data$EVTYPE <- toupper(data$EVTYPE)

# eliminating zero data
data <- data[data$FATALITIES !=0 | 
                       data$INJURIES !=0 | 
                       data$PROPDMG !=0 | 
                       data$CROPDMG !=0, ]

Processing Population Health Data

Data on fatalities and injuries is aggregated to allow the top ten events to be identified:

healthdata <- aggregate(cbind(FATALITIES, INJURIES) ~ EVTYPE, data = data, FUN=sum)
healthdata$PEOPLE_LOSS <- healthdata$FATALITIES + healthdata$INJURIES
healthdata <- healthdata[order(healthdata$PEOPLE_LOSS, decreasing = TRUE), ]
Top10_events_people <- healthdata[1:10,]
knitr::kable(Top10_events_people, format = "markdown")
EVTYPE FATALITIES INJURIES PEOPLE_LOSS
149 TORNADO 1511 20667 22178
39 EXCESSIVE HEAT 1797 6391 8188
48 FLOOD 414 6758 7172
107 LIGHTNING 651 4141 4792
153 TSTM WIND 241 3629 3870
46 FLASH FLOOD 887 1674 2561
146 THUNDERSTORM WIND 130 1400 1530
182 WINTER STORM 191 1292 1483
69 HEAT 237 1222 1459
88 HURRICANE/TYPHOON 64 1275 1339

Processing Economic Loss Data

Converting the exponent values of the two columns PROPDMGEXP and CROPDMGEXP into formats that are easier to visualize and sort:

data$PROPDMGEXP <- gsub("[Hh]", "2", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Kk]", "3", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Mm]", "6", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("[Bb]", "9", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\+", "1", data$PROPDMGEXP)
data$PROPDMGEXP <- gsub("\\?|\\-|\\ ", "0",  data$PROPDMGEXP)
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)

data$CROPDMGEXP <- gsub("[Hh]", "2", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Kk]", "3", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Mm]", "6", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("[Bb]", "9", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\+", "1", data$CROPDMGEXP)
data$CROPDMGEXP <- gsub("\\-|\\?|\\ ", "0", data$CROPDMGEXP)
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)

data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- mutate(data, 
                    PROPDMGTOTAL = PROPDMG * (10 ^ PROPDMGEXP), 
                    CROPDMGTOTAL = CROPDMG * (10 ^ CROPDMGEXP))

ecodata <- aggregate(cbind(PROPDMGTOTAL, CROPDMGTOTAL) ~ EVTYPE, data = data, FUN=sum)
ecodata$ECONOMIC_LOSS <- ecodata$PROPDMGTOTAL + ecodata$CROPDMGTOTAL
ecodata <- ecodata[order(ecodata$ECONOMIC_LOSS, decreasing = TRUE), ]
Top10_events_economy <- ecodata[1:10,]
knitr::kable(Top10_events_economy, format = "markdown")
EVTYPE PROPDMGTOTAL CROPDMGTOTAL ECONOMIC_LOSS
48 FLOOD 143944833550 4974778400 148919611950
88 HURRICANE/TYPHOON 69305840000 2607872800 71913712800
141 STORM SURGE 43193536000 5000 43193541000
149 TORNADO 24616945710 283425010 24900370720
66 HAIL 14595143420 2476029450 17071172870
46 FLASH FLOOD 15222203910 1334901700 16557105610
86 HURRICANE 11812819010 2741410000 14554229010
32 DROUGHT 1046101000 13367566000 14413667000
152 TROPICAL STORM 7642475550 677711000 8320186550
83 HIGH WIND 5247860360 633561300 5881421660

Final Analysis

Question #1: Identifying which weather events cause the most Fatalities And Injuries:

#plotting health loss
library(ggplot2)
g <- ggplot(data = Top10_events_people, aes(x = reorder(EVTYPE, PEOPLE_LOSS), y = PEOPLE_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Fatalities & Injuries in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Number of fatalities and injuries", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)

Question #2: Identifying which weather events cause the most economic damage:

#plotting economic loss
g <- ggplot(data = Top10_events_economy, aes(x = reorder(EVTYPE, ECONOMIC_LOSS), y = ECONOMIC_LOSS))
g <- g + geom_bar(stat = "identity", colour = "black", fill="lightblue")
g <- g + labs(title = "Total Economic Loss in USA by weather events 1996-2011")
g <- g + theme(plot.title = element_text(hjust = 0.5))
g <- g + labs(y = "Size of property and crop loss", x = "Event Type")
g <- g + coord_flip()
g <- g + theme_classic()
print(g)

RESULTS

Based on the analysis of the data from NOAA, Tornadoes caused the most fatalities and injuries while Flooding caused the most economic loss.