Reproducible Research: Peer Assessment 2

Synopsis:

Storm and Weather events causes harmful for population health and damage property which impacts country’s economic conditions. U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database is the collection of data from various parts of the country to study/analyze the cuases of more health and economic consequences to take preventive actions. Results of this analysis address the following questions:
1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Processing Data

The analysis was performed on Storm Events Database, provided by National Climatic Data Center. The data is from a comma-separated-value file available here. There is also some documentation of the data available here.

Downloading data file

#setting local working directory
setwd("C:/Data/devtools/Git/RepData_PeerAssessment2")
library(knitr)
library(ggplot2)
#suppressMessages to suppress warning/ messages
suppressMessages(library(dplyr))
#setting working directory for knit
opts_knit$set(base.dir = "C:/Data/devtools/Git/RepData_PeerAssessment2")
stdata <- NULL
#Checking for file in current directory
if(!file.exists("PA2_StormData.bz2"))
{
        download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",destfile = "PA2_StormData.bz2",mode = "wb")
}

Reading and checking data from file

#reading data from csv file
stdata <- read.csv(bzfile("PA2_StormData.bz2"))
#getting rows and columns count
colnms <- names(stdata)
#rows & columns
rws <- nrow(stdata); cls <- ncol(stdata)

Data from file:
. Numer of rows 902297
. Number of columns 37

Filtering required columns from dataframe for analysis. Ploting histogram to understand the data available for each year in U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database (from 1970-2011).

knitr::opts_chunk$set(fig.width=40, fig.height=20, fig.path='figs/', warning=FALSE, message=FALSE)
#getting required data for analysis
prcdata <- stdata
names(prcdata) <- toupper(names(prcdata))
#getting required columns
prcdata <- prcdata[,c("BGN_DATE", "EVTYPE", "FATALITIES", "INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP")]
#formatting date
prcdata <- mutate(prcdata, BGN_DATE = as.Date(as.character(prcdata$BGN_DATE), "%m/%d/%Y"))
#Starting year
minDate <- min(prcdata$BGN_DATE)
maxDate <- max(prcdata$BGN_DATE)
#adding year column
prcdata$YEAR <- as.integer(format(prcdata$BGN_DATE, "%Y"))
opar=par(ps=26)
hist(prcdata$YEAR, breaks = 45, main="Number of events recorded per year", xlab="Year", ylab="Number of events", cex=1.0, cex.main=2.5)

Histogram results supporting the statement The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Considerable measurements being collected from 1970 to 2011 by NOAA for major storms and weather events.
Event type (EVTYPE) values should be edited/updated with proper charecter sequences and trailing spaces to get proper counts and labels from the data.

#filtering data from 1970 to 2011
strmdata <- filter(prcdata, YEAR >= 1970)
#converting to lower case
evntlbls <- toupper(strmdata$EVTYPE)
## Replace all punct. characters with a space
evntlbls <- gsub("(^[[:space:]]+|[[:space:]]+$)", "", evntlbls)
evntlbls <- gsub("[[:blank:][:punct:]+]", " ", evntlbls)
evntlbls <- gsub("^thunderstorm wind[:alnum:] | ^tstm wind[:alnum:]", "thunderstorm wind", evntlbls)
#updating data with updated labels
strmdata$EVTYPE <- evntlbls
#unique(strmdata$EVTYPE)

Data Analysis

Subsetting wheather events which causes most harmful to population health and greatest economic consequences events from the data.

Major Weather Eevents harmful to Population Health

#Getting harmful events data from dataframe
hdata <- filter(strmdata,strmdata$FATALITIES > 0 | strmdata$INJURIES > 0)
#harmful data rows count
nrow(hdata)
## [1] 19585

Health Data Anasysis

#Fatalities events counts
fatcounts <- aggregate(FATALITIES ~ EVTYPE,data=hdata,FUN=sum)
#InjuryEvents by aggregation
injcounts <- aggregate(INJURIES ~ EVTYPE,data=hdata,FUN=sum)
#Top ten records for FATALITIES and INJURIES
fatTop10 <- head(fatcounts[order(fatcounts$FATALITIES, decreasing = T), ], 10)
injTop10 <- head(injcounts[order(injcounts$INJURIES, decreasing = T), ], 10)
# Updating column names
colnames(fatTop10) <- c("Event", "Fatalities")
colnames(injTop10) <- c("Event", "Injuries")

Results

Health Data Top 10 records

. Fatal Events
. Injury Events

fatTop10
##              Event Fatalities
## 169        TORNADO       3272
## 27  EXCESSIVE HEAT       1903
## 36     FLASH FLOOD        978
## 61            HEAT        937
## 111      LIGHTNING        816
## 176      TSTM WIND        504
## 41           FLOOD        470
## 135    RIP CURRENT        368
## 82       HIGH WIND        248
## 2        AVALANCHE        224
injTop10
##                 Event Injuries
## 169           TORNADO    59611
## 176         TSTM WIND     6957
## 41              FLOOD     6789
## 27     EXCESSIVE HEAT     6525
## 111         LIGHTNING     5230
## 61               HEAT     2100
## 105         ICE STORM     1975
## 36        FLASH FLOOD     1777
## 158 THUNDERSTORM WIND     1488
## 59               HAIL     1361

Health Data plots

par(mfrow = c(1, 2), mar = c(14, 6, 4, 3), mgp = c(2, 1, 0), cex = 1.0, cex.lab=2, cex.main=2.5)

ylim <- c(0, 1.1*max(fatTop10$Fatalities))

fatalPlot <- barplot(fatTop10$Fatalities, names.arg = fatTop10$Event, main = 'Top 10 events for fatalities', ylab = 'Number of fatalities', ylim = ylim, cex.axis = 2)
text(x = fatalPlot, y = fatTop10$Fatalities, label = round(fatTop10$Fatalities, 0), pos = 3)
ylim <- c(0, 1.1*max(injTop10$Injuries))

injuryPlot <- barplot(injTop10$Injuries, names.arg = injTop10$Event, main = 'Top 10 events for injuries', ylab = 'Number of injuries', ylim = ylim, cex.axis = 2)
text(x = fatalPlot, y = injTop10$Injuries, label = round(injTop10$Injuries, 0), pos = 3)

Economic Data Analysis

#Economic consequence events data
edata <- filter(strmdata, strmdata$PROPDMG > 0 | strmdata$CROPDMG > 0)
#economic data rows count
nrow(edata)
## [1] 235473
#Function to convert damage amount unit:
# h -> hundred, k -> thousand, m -> million, b -> billion
convertCurrUnit <- function(e) 
{
        if (e %in% c('h', 'H')){
                return(2)
        } else if (e %in% c('k', 'K')) {
                return(3)
        } else if (e %in% c('m', 'M')) {
                return(6)
        } else if (e %in% c('b', 'B')) {
                return(9)
        } else if (!is.na(as.numeric(e))) {# if a digit
                return(as.numeric(e))
        } else if (e %in% c('', '-', '?', '+')) {
                return(0)
        } else {
                stop("Not valid.")
        }
}

Calculating Property and Corp damage dxpenses

#Getting property damage
edata$PROPDMG <- edata$PROPDMG * (10 ** sapply(edata$PROPDMGEXP, FUN=convertCurrUnit))
#Getting corp damage 
edata$CROPDMG <- edata$CROPDMG * (10 ** sapply(edata$CROPDMGEXP, FUN=convertCurrUnit))
# Fatal events
prcounts <- aggregate(PROPDMG ~ EVTYPE,data=edata,FUN=sum)
crcounts <- aggregate(CROPDMG ~ EVTYPE,data=edata,FUN=sum)
# Events caused most economic expenses
prevntTop10 <- head(prcounts[order(prcounts$PROPDMG, decreasing = T), ], 10)
crevntTop10 <- head(crcounts[order(crcounts$CROPDMG, decreasing = T), ], 10)

# Updating column names
colnames(prevntTop10) <- c("Event", "propDMG")
colnames(crevntTop10) <- c("Event", "cropDMG")

Reselts

Economic Data Top 10 records

. Property damage
. Corp damage

prevntTop10
##                  Event      propDMG
## 49         FLASH FLOOD 6.820237e+13
## 290 THUNDERSTORM WINDS 2.086532e+13
## 314            TORNADO 1.073677e+12
## 94                HAIL 3.157558e+11
## 196          LIGHTNING 1.729433e+11
## 62               FLOOD 1.446577e+11
## 172  HURRICANE TYPHOON 6.930584e+10
## 69            FLOODING 5.920825e+10
## 263        STORM SURGE 4.332354e+10
## 126         HEAVY SNOW 1.793259e+10
crevntTop10 
##                 Event     cropDMG
## 31            DROUGHT 13972566000
## 62              FLOOD  5661968450
## 230       RIVER FLOOD  5029459000
## 181         ICE STORM  5022113500
## 94               HAIL  3025974480
## 164         HURRICANE  2741910000
## 172 HURRICANE TYPHOON  2607872800
## 49        FLASH FLOOD  1421317100
## 44       EXTREME COLD  1312973000
## 81       FROST FREEZE  1094186000

Economic data plots

par(mfrow = c(1, 2), mar = c(12, 5, 3, 2), mgp = c(3, 1, 0), cex = 1.0, las = 3, cex.lab=2, cex.main=2.5)

prdmgplot <- barplot((prevntTop10$propDMG/1000000000), names.arg = prevntTop10$Event, main = 'Top 10 events for fatalities', ylab = 'Number of fatalities (Billions)', log="y")
crdmgplot <- barplot((crevntTop10$cropDMG/1000000000), names.arg = crevntTop10$Event, main = 'Top 10 events for injuries', ylab = 'Number of injuries (Billions)', log="y")

Conclusion

This report shows that Flash Flood, Thunderstorm Winds, Tornado, Hail, Lightning, and Flood weather events caused huge property damage (billions of dollars) across the United States.
Drought, Flood, River flood, Ice Storm, Hurricane, Hurricane, and Typhoon events effected population health across the United States. Building the necessary infrastructure to predict weather events early, keeping necessary equipment, medication, and publishing safety precautions could help reducing population health problems.

Execute below script in commandline (or R console) to generate plot images and place them in ‘./figure’ folder
knit2html(“PA2_template.Rmd”, “PA2_template.html”)