Analysis of U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database.

Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. The aim of the report is to determine the events across the United States are most harmful with respect to population health and have the greatest economic consequence. Based in the analysis, it is concluded that Tornadoes and Floods are the most harmful events un the United States with respect to population health and economic impact.

Data Set

The data for this assignment comes in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be downloaded from - Storm Data

There is also some documentation of the database available. Here is how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years are considered to be more complete.

Data Processing

Loading R Libraries

library(data.table)
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the data into R

First, download the data from the URL provided, unzip it and then save it to the current working directory. Then load the csv data into R.

dataStorm <- read.csv("C:/Users/Simran Kharbanda/Desktop/ABC/R/rep research week 4/repdata_data_StormData.csv.bz2", sep = ",", header = TRUE)

Subsetting the data

We will observe that the dataset has many columns that we don’t require. So we will remove the undesired columns and keep the ones we need.

selection <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')
dataStorm <- dataStorm[, selection]
dataStorm <- as.data.table(dataStorm)
dataStorm <- dataStorm[(EVTYPE != "?" & (INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0)), 
             c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

Converting exponent columns into new columns

We will observe that there are two columns with exponent values. We will convert those values from K, Mand B to 1000, 1000000 and 1000000000 respectively.

cols <- c("PROPDMGEXP", "CROPDMGEXP")
dataStorm[,  (cols) := c(lapply(.SD, toupper)), .SDcols = cols]

PROPDMGKey <-  c("\"\"" = 10^0, 
                 "-" = 10^0, "+" = 10^0, "0" = 10^0, "1" = 10^1, "2" = 10^2, "3" = 10^3,
                 "4" = 10^4, "5" = 10^5, "6" = 10^6, "7" = 10^7, "8" = 10^8, "9" = 10^9, 
                 "H" = 10^2, "K" = 10^3, "M" = 10^6, "B" = 10^9)
CROPDMGKey <-  c("\"\"" = 10^0, "?" = 10^0, "0" = 10^0, "K" = 10^3, "M" = 10^6, "B" = 10^9)

dataStorm[, PROPDMGEXP := PROPDMGKey[as.character(dataStorm[,PROPDMGEXP])]]
dataStorm[is.na(PROPDMGEXP), PROPDMGEXP := 10^0 ]

dataStorm[, CROPDMGEXP := CROPDMGKey[as.character(dataStorm[,CROPDMGEXP])] ]
dataStorm[is.na(CROPDMGEXP), CROPDMGEXP := 10^0 ]

Now we will create two new columns

dataStorm <- dataStorm[, .(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, PROPCOST = PROPDMG * PROPDMGEXP, CROPDMG, CROPDMGEXP, CROPCOST = CROPDMG * CROPDMGEXP)]

Data Analysis

1. Estimating total Health Impacts (Injuries and Fatalities)

Health_Impact <- dataStorm[, .(FATALITIES = sum(FATALITIES), INJURIES = sum(INJURIES), TOTAL_HEALTH_IMPACTS = sum(FATALITIES) + sum(INJURIES)), by = .(EVTYPE)]

Health_Impact <- Health_Impact[order(-TOTAL_HEALTH_IMPACTS), ]

Health_Impact <- Health_Impact[1:10, ]

2. Estimating total Economic Impacts (Property Cost and Crop Cost)

Eco_Impact <- dataStorm[, .(PROPCOST = sum(PROPCOST), CROPCOST = sum(CROPCOST), TOTAL_ECO_IMPACTS = sum(PROPCOST) + sum(CROPCOST)), by = .(EVTYPE)]

Eco_Impact <- Eco_Impact[order(-TOTAL_ECO_IMPACTS), ]

Eco_Impact <- Eco_Impact[1:10, ]

Results

Answer 1 - Events across the United States that are most harmful with respect to population health

Health_Consequences <- melt(Health_Impact, id.vars = "EVTYPE", variable.name = "Fatalities_or_Injuries")

ggplot(Health_Consequences, aes(x = reorder(EVTYPE, -value), y = value)) + 
  geom_bar(stat = "identity", aes(fill = Fatalities_or_Injuries), position = "dodge") + 
  ylab("Total Injuries and Fatalities") + 
  xlab("Event Type") + 
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  ggtitle("US Events that are most harmful to Population Health") + 
  theme(plot.title = element_text(hjust = 0.5))+
  labs(caption = "Figure 1. Events responsible for most fatalitites/injuries to population health")

Therefore, Tornadoes have the most impact on population health.

Answer 2 - Events across the United States that have the greatest economic consequences

Eco_Consequences <- melt(Eco_Impact, id.vars = "EVTYPE", variable.name = "Damage_Type")

ggplot(Eco_Consequences, aes(x = reorder(EVTYPE, -value), y = value/1e9)) + 
  geom_bar(stat = "identity", aes(fill = Damage_Type), position = "dodge") + 
  ylab("Cost/Damage (in billion USD)") + 
  xlab("Event Type") + 
  theme(axis.text.x = element_text(angle=45, hjust=1)) + 
  ggtitle("US Events that have the Greatest Economic Consequence") + 
  theme(plot.title = element_text(hjust = 0.5))+
  labs(caption = "Figure 2. Events responsible for greatest economic consequence")

Therefore, Floods have the most economic impact.