The second project of the Coursera course Reproducible Data, requires the student to analyze the health and economic consequences of severe weather between the years 1950 - 2011. The data to be analyzed will be provided by US NOAA database which tracks weather events across the US. The questions to answer are as follows:
First the following packages will be loaded before running further code
library(R.utils)
## Loading required package: R.oo
## Loading required package: R.methodsS3
## R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following object is masked from 'package:R.methodsS3':
##
## throw
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, load, save
## R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(data.table)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:R.utils':
##
## extract
Next, the storm data file is downloaded and stored
if (!file.exists("stormdata.csv.bz2")) {
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, "stormdata.csv.bz2")
bunzip2("stormdata.csv.bz2", "stormdata.csv", remove=FALSE)
}
storm <- data.table::fread("stormdata.csv", fill=TRUE, header=TRUE)
head(storm)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1: 1.00 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2: 1.00 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3: 1.00 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4: 1.00 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5: 1.00 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6: 1.00 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1: TORNADO 0 0 NA
## 2: TORNADO 0 0 NA
## 3: TORNADO 0 0 NA
## 4: TORNADO 0 0 NA
## 5: TORNADO 0 0 NA
## 6: TORNADO 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1: 0 14.0 100 3 0 0 15 25.0
## 2: 0 2.0 150 2 0 0 0 2.5
## 3: 0 0.1 123 2 0 0 2 25.0
## 4: 0 0.0 100 2 0 0 2 2.5
## 5: 0 0.0 150 2 0 0 2 2.5
## 6: 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1: K 0 3040 8812
## 2: K 0 3042 8755
## 3: K 0 3340 8742
## 4: K 0 3458 8626
## 5: K 0 3412 8642
## 6: K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1: 3051 8806 1
## 2: 0 0 2
## 3: 0 0 3
## 4: 0 0 4
## 5: 0 0 5
## 6: 0 0 6
Review the column names and extract only the required names for analysis
names(storm)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
stormdata <- storm %>% select(c("EVTYPE","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
str(stormdata)
## Classes 'data.table' and 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## - attr(*, ".internal.selfref")=<externalptr>
The remaining variables for analysis are:
The next step is to process the data based on the questions to be answered. The first question asks which storm events are most harmful with respect to population health.
The data will be processed by grouping event type with the total injuries and deaths. Then, the data will be grouped in descending order to clearly delineate the most harmful storm type. The goal is to get the data on the same plot for easier comparison
totalInjuries <- aggregate(INJURIES ~ EVTYPE, data = stormdata, FUN = sum)
topInjuries <- totalInjuries[order(-totalInjuries$INJURIES), ]
totalFatalities <- aggregate(FATALITIES ~ EVTYPE, data = stormdata, FUN = sum)
topFatalities <- totalFatalities[order(-totalFatalities$FATALITIES), ]
#### combine the injury and fatality data to one table and order the top 10
topHealth <- merge(topInjuries, topFatalities, all.x = TRUE)
topHealth[is.na(topHealth)] <- 0
topHealth2 <- topHealth[order(-topHealth$INJURIES),][1:10,]
topHealth2 <- gather(topHealth2,key=type, value=value, INJURIES,FATALITIES)
The second question asks which storm types have the greatest economic consequences. The methodology to determine the economic impact will be similar to the injuries and fatalities, but with a few extra steps to aggregate the costs. The data will be combined in order to be viewable on the same plot.
The first step is to create a function to set the property damage exponents to real numbers. This will allow us to collect the damage costs to property and crops
cost <- function(x) {
if (x == "H")
100
else if (x == "K")
1000
else if (x == "M")
1e+06
else if (x == "B")
1e+09
else if (x == "")
1
else if (x == "m")
1e+06
else if (x == "0")
1
else if (x == "5")
1e+05
else if (x == "6")
1e+06
else if (x == "4")
10000
else if (x == "2")
100
else if (x == "3")
1000
else if (x == "h")
100
else if (x == "7")
1e+07
else if (x == "1")
10
else if (x == "8")
1e+08
else
0
}
Next the data will be processed by grouping event type with property and crop damage costs. Then, the data will be grouped in descending order to clearly delineate the greatest economic impact.
econdata <- storm %>% select(c("EVTYPE","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP"))
econdata$Property <- econdata$PROPDMG*sapply(econdata$PROPDMGEXP, FUN = cost)
econdata$Crops <- econdata$CROPDMG*sapply(econdata$CROPDMGEXP, FUN = cost)
dmgValue <- aggregate(Property~EVTYPE,data = econdata, FUN = sum, na.rm = TRUE)
dmgValue <- dmgValue[with(dmgValue,order(-Property)),]
cropValue <- aggregate(Crops~EVTYPE,data = econdata, FUN = sum, na.rm = TRUE)
cropValue <- cropValue[with(cropValue,order(-Crops)),]
#### combine economic losses into one table and order the top 10
topDmg <- merge(dmgValue, cropValue, all.x = TRUE)
topDmg[is.na(topDmg)] <- 0
topDmg2 <- topDmg[order(-topDmg$Property),][1:10,]
topDmg2 <- gather(topDmg2,key=type, value=value, Property,Crops)
The first question asked for the events which caused the most fatalities and injuries. The below plot represents the top 10 storm events which were most harmful to population health
damagePlot <- ggplot(data=topHealth2, aes(reorder(EVTYPE, -value), value, fill=type)) +
geom_bar(position = "dodge", stat="identity") +
labs(x="Storm Type", y="Count") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5)) +
ggtitle("Total Number of Injuries and Fatalities of the top 10 storm types")
print(damagePlot)
Based on the plot, Tornados had the greatest impact on population health.
The second question asked for the storm events which had the greatest economic consequences. The below plot represents the top 10 storm events with the greatest economic consequences.
costPlot <- ggplot(data=topDmg2, aes(reorder(EVTYPE, -value), value, fill=type)) +
geom_bar(position = "dodge", stat="identity") +
labs(x="Storm Type", y="Cost") +
theme(axis.text.x = element_text(angle = 45, vjust=0.5)) +
ggtitle("Total Property and Crop Damage Cost of the top 10 storm types")
print(costPlot)
Based on the plot, Floods had the greatest economic loss impact in the US