In this project, I’ll figure out how natural disaster influences against population health and economic consequences. To do this, I’ll use an dataset on U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. It shows us characteristics of storms and natural disaster. Through my data analysis, I found that the factor which has biggest impact for fatalities, injuries, properties, and crops is TORNADO. From the next chunk, I’ll introduce the process of my data analysis.
The packages in below will be used in this RMarkdown file in order to analyze data.
# Preference
knitr::opts_chunk$set(
echo = TRUE,
message = FALSE,
warning = FALSE
)
# For data processing
require(tibble)
## Loading required package: tibble
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(readr)
## Loading required package: readr
# For data visualization
require(ggplot2)
## Loading required package: ggplot2
require(patchwork)
## Loading required package: patchwork
First, I download the Bzip file from course website, and unzip the file to get the csv file inside of it. The size of Bzip file is about 47MB, and I recommend you to download the file in the environment you can use fast speed Internet. In addition to the file, there are about 535.6 MB usage on a storage after you unzip the file. Please confirm that you have enough space to unzip file before you execute below code chunk.
url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
if(!file.exists('./repdata_data_StormData.csv.bz2')) {
download.file(url, './repdata_data_StormData.csv.bz2', method = 'curl')
}
if(!file.exists('./repdata_data_StormData.csv')) {
bzfile('./repdata_data_StormData.csv.bz2')
}
Second, I assign the data to variable df.
filePath <- 'repdata_data_StormData.csv'
df <- read_csv(filePath)
head(df, 10)
## # A tibble: 10 × 37
## STATE__ BGN_DATE BGN_T…¹ TIME_…² COUNTY COUNT…³ STATE EVTYPE BGN_R…⁴ BGN_AZI
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 1 4/18/195… 0130 CST 97 MOBILE AL TORNA… 0 <NA>
## 2 1 4/18/195… 0145 CST 3 BALDWIN AL TORNA… 0 <NA>
## 3 1 2/20/195… 1600 CST 57 FAYETTE AL TORNA… 0 <NA>
## 4 1 6/8/1951… 0900 CST 89 MADISON AL TORNA… 0 <NA>
## 5 1 11/15/19… 1500 CST 43 CULLMAN AL TORNA… 0 <NA>
## 6 1 11/15/19… 2000 CST 77 LAUDER… AL TORNA… 0 <NA>
## 7 1 11/16/19… 0100 CST 9 BLOUNT AL TORNA… 0 <NA>
## 8 1 1/22/195… 0900 CST 123 TALLAP… AL TORNA… 0 <NA>
## 9 1 2/13/195… 2000 CST 125 TUSCAL… AL TORNA… 0 <NA>
## 10 1 2/13/195… 2000 CST 57 FAYETTE AL TORNA… 0 <NA>
## # … with 27 more variables: BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## # COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>,
## # END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <dbl>, MAG <dbl>,
## # FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## # CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## # ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## # LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>, and abbreviated variable …
Before I move onto my explanatory data analysis, I output some information about this dataset to understand the construction of it.
# To get the column names of this dataset.
colnames(df)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
# To get the unique values of column EVTYPE.
unique(df[, 'EVTYPE'])
## # A tibble: 977 × 1
## EVTYPE
## <chr>
## 1 TORNADO
## 2 TSTM WIND
## 3 HAIL
## 4 FREEZING RAIN
## 5 SNOW
## 6 ICE STORM/FLASH FLOOD
## 7 SNOW/ICE
## 8 WINTER STORM
## 9 HURRICANE OPAL/HIGH WINDS
## 10 THUNDERSTORM WINDS
## # … with 967 more rows
In order to plot the data effectively, I only picked out top 20 natural disaster in fatalities and injuries.
harmful <- df %>%
group_by(EVTYPE) %>%
summarise(
fatalities = sum(FATALITIES, na.rm=TRUE),
injuries = sum(INJURIES, na.rm=TRUE)
) %>%
pivot_longer(
cols = 2:3,
names_to = 'category',
values_to = 'population'
) %>%
arrange(desc(population))
harmful <- rbind(head(harmful[harmful$category == 'fatalities', ], 20), head(harmful[harmful$category == 'injuries', ], 20))
p1 <- ggplot(harmful[harmful$category == 'fatalities', ], aes(population, reorder(EVTYPE, -population, decreasing = TRUE))) +
geom_bar(stat = 'identity') +
labs(
title = 'Fatalities',
x = 'Population',
y = 'Event',
caption = "Source: U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database"
) +
theme(
plot.caption = element_text(hjust = 0, face= "italic"),
plot.title.position = "plot",
plot.caption.position = "plot"
)
p2 <- ggplot(harmful[harmful$category == 'injuries', ], aes(population, reorder(EVTYPE, -population, decreasing = TRUE))) +
geom_bar(stat = 'identity') +
labs(
title = 'Injuries',
x = 'Population',
y = 'Event'
) +
theme(
plot.caption = element_text(hjust = 0, face= "italic"),
plot.title.position = "plot",
plot.caption.position = "plot"
)
p1 + p2
As you can see, the highest number of population got fatalities or injuries due to natural disaster is TORNADO.
In order to plot the data effectively, I only picked out top 20 natural disaster in properties and crops.
economy <- df %>%
group_by(EVTYPE) %>%
summarise(
property = sum(PROPDMG),
crop = sum(CROPDMG)
) %>%
pivot_longer(
cols = 2:3,
names_to = 'category',
values_to = 'damages'
) %>%
arrange(desc(damages))
economy_top20 <- rbind(head(economy[economy$category == 'property', ], 20), head(economy[economy$category == 'crop', ], 20))
p1 <- ggplot(economy_top20[economy_top20$category == 'property', ], aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
geom_bar(stat = 'identity') +
labs(
title = 'Property',
x = 'Damages',
y = 'Event'
) +
theme(
plot.caption = element_text(hjust = 0, face= "italic"),
plot.title.position = "plot",
plot.caption.position = "plot"
)
p2 <- ggplot(economy_top20[economy_top20$category == 'crop', ], aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
geom_bar(stat = 'identity') +
labs(
title = 'Crop',
x = 'Damages',
y = 'Event'
) +
theme(
plot.caption = element_text(hjust = 0, face= "italic"),
plot.title.position = "plot",
plot.caption.position = "plot"
)
p1 + p2
economy_sum <- economy %>%
group_by(EVTYPE) %>%
summarise(
damages = sum(damages)
) %>%
arrange(desc(damages))
p3 <- ggplot(head(economy_sum, 20), aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
geom_bar(stat = 'identity') +
labs(
title = 'Total',
x = 'Damages',
y = 'Event'
) +
theme(
plot.caption = element_text(hjust = 0, face= "italic"),
plot.title.position = "plot",
plot.caption.position = "plot"
)
p3
As you can see, TORNADO have the greatest economic consequences.