Important Note: Kindly visit the below link for an updated report for this Peer Assessment. The plots are available in the figures folder.
Link to my RPubs
Link to my Reproducible Research Repository
As part of this analysis and research report, our goal is to explore the United States National Oceanic and Atmospheric Administration’s (NOAA) Storm Database from 1950 - 2011 and answer some basic questions about severe weather events. We aim to address the impacts of Severe Weather events on Public Health and related economic consequences across United States, by looking at the Storm Data from NOAA database. Specifically, we will analyze the Fatalities, Injuries, Property/Crop damages and the estimates documented over the years for our research to determine the type of events which are most harmful to the population health and for the US economy. Based on our research and analysis, we found that Tornado and Excessive Heat are the most harmful for the population health, while flood, drought, Hail/Flood and Hurricanes have the greatest economic consequences.
# Set Working Directory
wd <- "~/Documents/Buva/Data Science/Data Science Course-John Hopkins/05-Reproducible Research/Peer Assessment#2"
setwd(wd)
echo = TRUE # Always make code visible
cache = TRUE
options(scipen = 1) # Turn off scientific notations for numbers
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.6.1 (2014-01-04) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.18.0 (2014-02-22) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
##
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
##
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
##
## R.utils v1.32.4 (2014-05-14) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
##
## The following object is masked from 'package:utils':
##
## timestamp
##
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
library(ggplot2)
library(grid)
require(gridExtra)
## Loading required package: gridExtra
The data processing is done as following:
1. Source file “StormData.csv.bz2” is downloaded if not present
2. Source data is unzipped using bzfile() function and read via the read.csv() function
3. Examine the dataframe using dim() and names() functions
# function to Download the file from the website location to the local directory
dwld_file <- function(fileurl, dest){
if (!file.exists("data")) dir.create("data") # create a folder if it doesnt exist
if (!file.exists(dest)) { # download the file if its not already downloaded
download.file(fileurl, destfile = dest, method = "curl")
}
}
# Assign Data file destination and doc file destination
dest="./data/stormdata.csv.bz2"
# 1. Call funtion to download from the url
dwld_file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", dest)
# 2. Read the downloaded bzfile to a dataframe
StormData <- read.csv(bzfile(dest), header=TRUE)
# 3. Examine using Head and names functions
dim(StormData)
## [1] 902297 37
# Display column names
names(StormData)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
There are 902297 rows and 37 variables/columns in total. The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
# Format Date to see how the data is distributed over years
StormData$year <- as.numeric(format(as.Date(StormData$BGN_DATE, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
hist(StormData$year, breaks=50, color="green", xlab="Storm Data - Years", main="Histogram of the total number of Severe Weather Events")
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
## Warning: "color" is not a graphical parameter
As observed from the above histogram, the number of events tracked starts to significantly increase around 1995. So, we use the subset of the data from 1995 to 2011 for a better analysis from most out of good records.
# Filter StormData from 1995
Data <- StormData[StormData$year >= 1995, ]
dim(Data)
## [1] 681500 38
After filtering data after 1995, there are 681500 rows and 38 columns. However we dont need the entire columns. For our analysis, we are interested only in looking at the Fatalities, Injuries, Property and Crop Damages and their estimations for the Event types. So we can further filter the dataset to narrow down to the columns which we would like to focus for our research. We do the following:
1. Identify the columns required for research
2. Filter the Dataset for the required columns
3. Examine column names and data with dim(), head() and names() functions
# 1. Identify the Required columns for this research
ReqdCols <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
# 2. Filter the StormData set with the reqd columns
FilteredStormData <- Data[, ReqdCols]
# 3. Display column names
names(FilteredStormData)
## [1] "EVTYPE" "FATALITIES" "INJURIES" "PROPDMG" "PROPDMGEXP"
## [6] "CROPDMG" "CROPDMGEXP"
head(FilteredStormData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP
## 187560 FREEZING RAIN 0 0 0.0
## 187561 SNOW 0 0 0.0
## 187563 SNOW/ICE 0 0 0.0
## 187565 SNOW/ICE 0 0 0.0
## 187566 HURRICANE OPAL/HIGH WINDS 2 0 0.1 B
## 187575 HAIL 0 0 0.0
## CROPDMG CROPDMGEXP
## 187560 0
## 187561 0
## 187563 0
## 187565 0
## 187566 10 M
## 187575 0
dim(FilteredStormData)
## [1] 681500 7
To address First part of the question from this assignment, we check the number of fatalities and injuries that are caused by the severe weather events. We would like to get the first 20 most severe types of weather events. We have done the following:
1. Group and sum all Fatalities for all the event types
2. Assign new column names for the aggregate data set
3. Order the new dataset in the descending order and take out the top 20 for our analysis
4. Refresh the factor levels for the filtered dataset
Repeat Steps 1-4 for analysing the Injuries
# 1. Group and sum all Fatalities for the all of the event types
Fatalities <- aggregate( FilteredStormData$FATALITIES, by=list(FilteredStormData$EVTYPE), sum)
# 2. Assign Column names
names(Fatalities) <- c("EventType", "Consequences")
# 3. Filter the top 20 Fatalities
Fatalities <- Fatalities[order(-Fatalities$Consequences), ] [1:20, ]
# 4. Refresh the factor levels with the new Fatalaties subset
Fatalities$EventType <- factor(Fatalities$EventType, levels = Fatalities$EventType)
# Repeat Steps 1-4 for Injuries
# 1. Group and sum all Injuries for the all of the event types
Injuries <- aggregate( FilteredStormData$INJURIES, by=list(FilteredStormData$EVTYPE), sum)
# 2. Assign Column names
names(Injuries) <- c("EventType", "Consequences")
# 3. Filter the top 20 Injuries
Injuries <- Injuries[order(-Injuries$Consequences), ] [1:20, ]
# 4. Refresh the factor levels with the new Injuries subset
Injuries$EventType <- factor(Injuries$EventType, levels = Injuries$EventType)
For the Second part of a question, we check the number of Property and Corp Damages caused by the severe weather events. In addition to that we also need to calculate the financial damages from the exponential data stored in the databse. The following sequence of steps are performed to prepare for our analysis on economy impacts:
1. Display more details on PropertyDamanageEXP and CropDamageEXP variables
2. Assign the numeric value which will be exponentially applied for calculating the Financial Damages; B for billion, M for million, k for kilo, h for hundred
3. Check for missing values if any
4. Convert and Assign the PropertyDamageEXP variable to the numeric new variable created
5. Compute the Financial Property Damages
Repeat the steps 1-5 for the calculating the Crop damages
options(scipen = 999)
# 1. Display more details on PropertyDamanageEXP variable
summary(FilteredStormData$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 294516 1 5 4 187 25 13 4 3 26
## 6 7 8 B h H K m M
## 3 5 1 37 0 6 378706 6 7952
# 2. Assign the numeric value which will be exponentially applied for calculating the Financial Damages
# B for billion, M for million, k for kilo, h for hundred;
FilteredStormData$PROPDMGEXP <- as.character(FilteredStormData$PROPDMGEXP)
FilteredStormData$PROPDMGEXP[toupper(FilteredStormData$PROPDMGEXP) == "B" ] <- "9"
FilteredStormData$PROPDMGEXP[toupper(FilteredStormData$PROPDMGEXP) == "M" ] <- "6"
FilteredStormData$PROPDMGEXP[toupper(FilteredStormData$PROPDMGEXP) == "K" ] <- "3"
FilteredStormData$PROPDMGEXP[toupper(FilteredStormData$PROPDMGEXP) == "H" ] <- "2"
FilteredStormData$PROPDMGEXP[toupper(FilteredStormData$PROPDMGEXP) == "" ] <- "0"
# 3. Check for missing values
sum(is.na(as.numeric(FilteredStormData$PROPDMGEXP)))
## Warning: NAs introduced by coercion
## [1] 10
FilteredStormData$PROPDMGEXP[is.na(as.numeric(FilteredStormData$PROPDMGEXP))] <- 0
## Warning: NAs introduced by coercion
sum(is.na(as.numeric(FilteredStormData$PROPDMGEXP)))
## [1] 0
# 4. Convert and Assign the PropertyDamageEXP variable to the numeric new variable created
#FilteredStormData$numericPROPDMGEXP[complete.cases(FilteredStormData$PROPDMGEXP)] <- as.numeric(FilteredStormData$PROPDMGEXP[complete.cases(FilteredStormData$PROPDMGEXP)])
FilteredStormData$PROPDMGEXP <- as.numeric(FilteredStormData$PROPDMGEXP)
# 5. Compute the Financial Property Damages
FilteredStormData$FinancialDmgPROP <- FilteredStormData$PROPDMG * 10^(FilteredStormData$PROPDMGEXP)
# Repeat the same process for the CROPDMG
# 1. Display more details on CropDamageEXP variable
summary(FilteredStormData$CROPDMGEXP)
## ? 0 2 B k K m M
## 400088 5 8 1 7 3 279483 1 1904
# Assign the numeric value which will be exponentially applied for calculating the Financial Damages
# B for billion, M for million, k for kilo, h for hundred;
FilteredStormData$CROPDMGEXP <- as.character(FilteredStormData$CROPDMGEXP)
FilteredStormData$CROPDMGEXP[toupper(FilteredStormData$CROPDMGEXP) == "B" ] <- "9"
FilteredStormData$CROPDMGEXP[toupper(FilteredStormData$CROPDMGEXP) == "M" ] <- "6"
FilteredStormData$CROPDMGEXP[toupper(FilteredStormData$CROPDMGEXP) == "K" ] <- "3"
FilteredStormData$CROPDMGEXP[toupper(FilteredStormData$CROPDMGEXP) == "H" ] <- "2"
FilteredStormData$CROPDMGEXP[toupper(FilteredStormData$CROPDMGEXP) == "" ] <- "0"
# 3. Check for missing values
sum(is.na(as.numeric(FilteredStormData$CROPDMGEXP)))
## Warning: NAs introduced by coercion
## [1] 5
FilteredStormData$CROPDMGEXP[is.na(as.numeric(FilteredStormData$CROPDMGEXP))] <- 0
## Warning: NAs introduced by coercion
sum(is.na(as.numeric(FilteredStormData$CROPDMGEXP)))
## [1] 0
# 4. Convert and Assign the PropertyDamageEXP variable to the numeric new variable created
#FilteredStormData$numericCROPDMGEXP[complete.cases(FilteredStormData$CROPDMGEXP)] <- as.numeric(FilteredStormData$CROPDMGEXP[complete.cases(FilteredStormData$CROPDMGEXP)])
FilteredStormData$CROPDMGEXP <- as.numeric(FilteredStormData$CROPDMGEXP)
# 5. Compute the Financial CROP Damages
FilteredStormData$FinancialDmgCROP <- FilteredStormData$CROPDMG * 10^(FilteredStormData$CROPDMGEXP)
Now that the financial damages are calculated for both the Property and the Crop Damages, we can now look at the Property and Crop Damages caused by the Severe Weather events. Again, we focus on the first 20 most severe types of weather events. We have done the following:
1. Group and sum all PropertyDamage for all the event types
2. Assign new column names for the aggregate data set
3. Order the new dataset in the descending order and take out the top 20 for our analysis
4. Refresh the factor levels for the filtered dataset
Repeat the steps 1-4 for the CROPDamage
options(scipen = 999)
# 1. Group and sum all PropertyDamages for the all of the event types
PROPDamage <- aggregate( FilteredStormData$FinancialDmgPROP, by=list(FilteredStormData$EVTYPE), sum)
# 2. Assign Column names
names(PROPDamage) <- c("EventType", "Consequences")
# 3. Filter the top 20 Property Damage values
PROPDamage <- PROPDamage[order(-PROPDamage$Consequences), ] [1:20, ]
# 4. Refresh the factor levels with the new Property Damage subset
PROPDamage$EventType <- factor(PROPDamage$EventType, levels = PROPDamage$EventType)
# Repeat the same process for the Crop Damage
# 1. Group and sum all CropDamages for the all of the event types
CROPDamage <- aggregate( FilteredStormData$FinancialDmgCROP, by=list(FilteredStormData$EVTYPE), sum)
# 2. Assign Column names
names(CROPDamage) <- c("EventType", "Consequences")
# 3. Filter the top 20 Crop Damage values
CROPDamage <- CROPDamage[order(-CROPDamage$Consequences), ] [1:20, ]
# 4. Refresh the factor levels with the new Crop Damage subset
CROPDamage$EventType <- factor(CROPDamage$EventType, levels = CROPDamage$EventType)
Part1: From our research we have completed our analysis/findings and have arrived at the top 20 most Severe Weather event types that cause the most number of Public Health Fatalities/Injuries respectively:
Fatalities
## EventType Consequences
## 112 EXCESSIVE HEAT 1903
## 666 TORNADO 1545
## 134 FLASH FLOOD 934
## 231 HEAT 924
## 358 LIGHTNING 729
## 144 FLOOD 423
## 461 RIP CURRENT 360
## 288 HIGH WIND 241
## 683 TSTM WIND 241
## 16 AVALANCHE 223
## 462 RIP CURRENTS 204
## 787 WINTER STORM 195
## 233 HEAT WAVE 161
## 607 THUNDERSTORM WIND 131
## 121 EXTREME COLD 126
## 122 EXTREME COLD/WIND CHILL 125
## 254 HEAVY SNOW 115
## 524 STRONG WIND 103
## 280 HIGH SURF 99
## 70 COLD/WIND CHILL 95
Injuries
## EventType Consequences
## 666 TORNADO 21765
## 144 FLOOD 6769
## 112 EXCESSIVE HEAT 6525
## 358 LIGHTNING 4631
## 683 TSTM WIND 3630
## 231 HEAT 2030
## 134 FLASH FLOOD 1734
## 607 THUNDERSTORM WIND 1426
## 787 WINTER STORM 1298
## 313 HURRICANE/TYPHOON 1275
## 288 HIGH WIND 1093
## 206 HAIL 916
## 773 WILDFIRE 911
## 254 HEAVY SNOW 751
## 157 FOG 718
## 771 WILD/FOREST FIRE 545
## 632 THUNDERSTORM WINDS 444
## 103 DUST STORM 420
## 792 WINTER WEATHER 398
## 27 BLIZZARD 385
To best illustrate, the following pair of graph plots depict the total Public Health Fatalities and Injuries affected by the Severe Weather event types across United States during the period 1995 - 2011:
options(scipen = 999)
# Plot Fatalities graph
g1 <- ggplot(Fatalities, aes(x = EventType, y = Consequences)) + geom_bar(stat = "identity", fill = "555",
las = 3) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") +
ylab("Count") + ggtitle("Total# of Fatalities against\n Severe Weather Events\n in US (1995-2011)")
# Plot Injuries graph
g2 <- ggplot(Injuries, aes(x = EventType, y = Consequences)) + geom_bar(stat = "identity", fill = "555",
las = 3) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") +
ylab("Count") + ggtitle("Total# of Injuries against\n Severe Weather Events\n in US (1995-2011)")
# Arrange the plots in a grid
grid.arrange(g1, g2, ncol=2, widths=1:1)
From the above histograms, we observe that Excessive Heat and Tornado cause most fatalities; Tornato causes most injuries in the United States from 1995 to 2011.
Part2: Regarding the economic consequences, we have arrived at the top 20 most Severe Weather event types that caused economical disasters on Public Property/Food and Crop damages respectively:
PROPDamage
## EventType Consequences
## 144 FLOOD 144022037057
## 313 HURRICANE/TYPHOON 69305840000
## 519 STORM SURGE 43193536000
## 666 TORNADO 24935939545
## 134 FLASH FLOOD 16047794571
## 206 HAIL 15048722103
## 306 HURRICANE 11812819010
## 677 TROPICAL STORM 7653335550
## 288 HIGH WIND 5259785375
## 773 WILDFIRE 4759064000
## 520 STORM SURGE/TIDE 4641188000
## 683 TSTM WIND 4482361440
## 326 ICE STORM 3643555810
## 607 THUNDERSTORM WIND 3399282992
## 310 HURRICANE OPAL 3172846000
## 771 WILD/FOREST FIRE 3001812500
## 247 HEAVY RAIN/SEVERE WEATHER 2500000000
## 787 WINTER STORM 1538047250
## 479 SEVERE THUNDERSTORM 1200310000
## 84 DROUGHT 1046106000
CROPDamage
## EventType Consequences
## 84 DROUGHT 13922066000
## 144 FLOOD 5422810400
## 306 HURRICANE 2741410000
## 206 HAIL 2614127070
## 313 HURRICANE/TYPHOON 2607872800
## 134 FLASH FLOOD 1343915000
## 121 EXTREME COLD 1292473000
## 179 FROST/FREEZE 1094086000
## 241 HEAVY RAIN 728399800
## 677 TROPICAL STORM 677836000
## 288 HIGH WIND 633561300
## 683 TSTM WIND 553947350
## 112 EXCESSIVE HEAT 492402000
## 607 THUNDERSTORM WIND 414354000
## 231 HEAT 401411500
## 159 FREEZE 396225000
## 666 TORNADO 296595770
## 773 WILDFIRE 295472800
## 76 DAMAGING FREEZE 262100000
## 117 EXCESSIVE WETNESS 142000000
To best illustrate, the following pair of graph plots depict the total Public Property damages and Crop Damages influenced by the Severe Weather event types across United States during the period 1995 - 2011:
options(scipen = 999)
# Plot Property Damage graph
g3 <- ggplot(PROPDamage, aes(x = EventType, y = Consequences)) + geom_bar(stat = "identity", fill = "555",
las = 3) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") +
ylab("Count") + ggtitle("Total Property Damages by\n Severe Weather Events\n in US (1995-2011)")
# Plot Crop Damage graph
g4 <- ggplot(CROPDamage, aes(x = EventType, y = Consequences)) + geom_bar(stat = "identity", fill = "555",
las = 3) + theme(axis.text.x = element_text(angle = 90, hjust = 1)) + xlab("Event Type") +
ylab("Count") + ggtitle("Total Crop Damages by\n Severe Weather Events\n in US (1995-2011)")
# Arrange the plots in a grid
grid.arrange(g3, g4, ncol=2, widths=1:1)
From the above histograms, we observe that Flood and Hurricane/Typhoons cause most Property Damages; Flood causes more Crop Damages in the United States from 1995 to 2011.
To Conclude the research, from the NOAA’s Storm Data, we found that Excessive Heat and Tornado are most harmful with respect to population health, while Flood, Drought and Hurricane/Typhoon have the greatest economic consequences.