This following data analysis involves exploring the U.S. National Ocenic and Atmospheric Administration (NOAA) Storm Database in order to determine two major aspects.
Bases on the analysis, the major finding is that tornadoes have the most impact on overall population health. Additionally, the results show that floods and drought cause the most economic damage across the United States.
Load required R libraries:
library(ggplot2)
library(grid)
library(gridExtra)
library(dplyr)
We first download and read the NOAA Storm data. (The documentation can be found in National Weather Service Storm Data Documentation)
download.file(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="repdata-data-StormData.csv.bz2", method="curl")
rawData <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
After reading the NOAA Storm data, we display the first few rows in this dataset.
dim(rawData)
## [1] 902297 37
head(rawData)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 0
## 2 TORNADO 0 0
## 3 TORNADO 0 0
## 4 TORNADO 0 0
## 5 TORNADO 0 0
## 6 TORNADO 0 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 14.0 100 3 0 0
## 2 NA 0 2.0 150 2 0 0
## 3 NA 0 0.1 123 2 0 0
## 4 NA 0 0.0 100 2 0 0
## 5 NA 0 0.0 150 2 0 0
## 6 NA 0 1.5 177 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0
## 2 0 2.5 K 0
## 3 2 25.0 K 0
## 4 2 2.5 K 0
## 5 2 2.5 K 0
## 6 6 2.5 K 0
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 1
## 2 3042 8755 0 0 2
## 3 3340 8742 0 0 3
## 4 3458 8626 0 0 4
## 5 3412 8642 0 0 5
## 6 3450 8748 0 0 6
Then, we subset only attributes that we will use in this analysis.EVTYPE - Event TypesINJURIES - Count of injuriesFATALITIES - Count of weather-related fatalitiesPROPDMG,PROPDMGEXP - Property damage in USDCROPDMG,CROPDMGEXP - Crop damage in USD
stormDf <- rawData[, c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG',
'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
Because some attributes, such as PROPDMGEXP, CROPDMGEXP, contains coded symbols of magnitude, we will convert this symbols to estimated damage in dollars.
Calculate Property Damage:
unique(stormDf$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
#Convert symbols of magnitude
stormDf$PROPDMGEXP <- as.character(stormDf$PROPDMGEXP)
stormDf$PROPDMGEXP <- gsub('B', 9, stormDf$PROPDMGEXP, ignore.case = TRUE)
stormDf$PROPDMGEXP <- gsub('M', 6, stormDf$PROPDMGEXP, ignore.case = TRUE)
stormDf$PROPDMGEXP <- gsub('K', 3, stormDf$PROPDMGEXP, ignore.case = TRUE)
stormDf$PROPDMGEXP <- gsub('H', 2, stormDf$PROPDMGEXP, ignore.case = TRUE)
stormDf$PROPDMGEXP <- gsub('\\-|\\+|\\?', 0, stormDf$PROPDMGEXP)
stormDf$PROPDMGEXP <- as.numeric(stormDf$PROPDMGEXP)
stormDf$PROPDMGEXP[is.na(stormDf$PROPDMGEXP)] <- 0
#Calculate property damage in USD
stormDf$PROPDMGVAL <- stormDf$PROPDMG * 10^stormDf$PROPDMGEXP
Calculate Crop Damage:
unique(stormDf$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
#Convert symbols of magnitude
stormDf$CROPDMGEXP <- as.character(stormDf$CROPDMGEXP)
stormDf$CROPDMGEXP <- gsub('B', 9, stormDf$CROPDMGEXP, ignore.case = TRUE)
stormDf$CROPDMGEXP <- gsub('M', 6, stormDf$CROPDMGEXP, ignore.case = TRUE)
stormDf$CROPDMGEXP <- gsub('K', 3, stormDf$CROPDMGEXP, ignore.case = TRUE)
stormDf$CROPDMGEXP <- gsub('H', 2, stormDf$CROPDMGEXP, ignore.case = TRUE)
stormDf$CROPDMGEXP <- gsub('\\-|\\+|\\?', 0, stormDf$CROPDMGEXP)
stormDf$CROPDMGEXP <- as.numeric(stormDf$CROPDMGEXP)
stormDf$CROPDMGEXP[is.na(stormDf$CROPDMGEXP)] <- 0
#Calculate crop damage in USD
stormDf$CROPDMGVAL <- stormDf$CROPDMG * 10^stormDf$CROPDMGEXP
We will create fatalDf and injureDf data frames to find the most harmful events over a period of time.
fatalDf represents the total deaths for each event type:
fatalitiesDf <- aggregate(FATALITIES ~ EVTYPE, data=stormDf, sum)
fatalitiesDf <- arrange(fatalitiesDf, desc(FATALITIES))
injureDf represents the total injuries for each event type:
injureDf <- aggregate(INJURIES ~ EVTYPE, data=stormDf, sum)
injureDf <- arrange(injureDf, desc(INJURIES))
We will plot the top 10 events with the highest number of fatalities and injuries.
p1 <- qplot(y=FATALITIES, x=reorder(EVTYPE, -FATALITIES), data=fatalitiesDf[1:10, ],
geom="bar", stat="identity",
xlab="Event Type", ylab="Total Number of Fatalities",
main="Fatalities") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p2 <- qplot(y=INJURIES, x=reorder(EVTYPE, -INJURIES), data=injureDf[1:10, ],
geom="bar", stat="identity",
xlab="Event Type", ylab="Total number of Injuries",
main="Injuries") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p1, p2, ncol=2,
main="Top 10 Events with The Most Harmful with Repect to Population Health")
We will create propDf and cropDf data frames to identify the greatest economic impacts.
propDf represents the total property damage for each event type:
propDf <- aggregate(PROPDMGVAL ~ EVTYPE, data=stormDf, sum)
propDf <- arrange(propDf, desc(PROPDMGVAL))
cropDf represents the total crop damage for each event type:
cropDf <- aggregate(CROPDMGVAL ~ EVTYPE, data=stormDf, sum)
cropDf <- arrange(cropDf, desc(CROPDMGVAL))
Finally, we will plot the top 10 events with the greatest economic occurences.
p3 <- qplot(y=PROPDMGVAL, x=reorder(EVTYPE, -PROPDMGVAL), data=propDf[1:10, ],
geom="bar", stat="identity",
xlab="Event Type", ylab="Total Damage (USD)",
main="Property Damage") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
p4 <- qplot(y=CROPDMGVAL, x=reorder(EVTYPE, -CROPDMGVAL), data=cropDf[1:10, ],
geom="bar", stat="identity",
xlab="Event Type", ylab="Total Crop Damage (USD)",
main="Crop Damage") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
grid.arrange(p3, p4, ncol=2, main="Top 10 Events with The Greatest Economic Impacts")
Ultimately, tornadoes have the highest impact on population health across the United States by causing the highest number of fatalities and injuries. In addition, Floods and drought have dealt the greatest economic damage in the United States.