Synopsis

In this project, I’ll figure out how natural disaster influences against population health and economic consequences. To do this, I’ll use an dataset on U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database. It shows us characteristics of storms and natural disaster. Through my data analysis, I found that the factor which has biggest impact for fatalities, injuries, properties, and crops is TORNADO. From the next chunk, I’ll introduce the process of my data analysis.

Preparation

1. Required Modules in this RMarkdown

The packages in below will be used in this RMarkdown file in order to analyze data.

# Preference
knitr::opts_chunk$set(
    echo = TRUE,
    message = FALSE,
    warning = FALSE
)

# For data processing
require(tibble)
## Loading required package: tibble
require(tidyr)
## Loading required package: tidyr
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(readr)
## Loading required package: readr
# For data visualization
require(ggplot2)
## Loading required package: ggplot2
require(patchwork)
## Loading required package: patchwork

2. Download and Load the dataset that I analyze here

First, I download the Bzip file from course website, and unzip the file to get the csv file inside of it. The size of Bzip file is about 47MB, and I recommend you to download the file in the environment you can use fast speed Internet. In addition to the file, there are about 535.6 MB usage on a storage after you unzip the file. Please confirm that you have enough space to unzip file before you execute below code chunk.

url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'

if(!file.exists('./repdata_data_StormData.csv.bz2')) {
    download.file(url, './repdata_data_StormData.csv.bz2', method = 'curl')
}

if(!file.exists('./repdata_data_StormData.csv')) {
    bzfile('./repdata_data_StormData.csv.bz2')
}

Second, I assign the data to variable df.

filePath <- 'repdata_data_StormData.csv'
df <- read_csv(filePath)
head(df, 10)
## # A tibble: 10 × 37
##    STATE__ BGN_DATE  BGN_T…¹ TIME_…² COUNTY COUNT…³ STATE EVTYPE BGN_R…⁴ BGN_AZI
##      <dbl> <chr>     <chr>   <chr>    <dbl> <chr>   <chr> <chr>    <dbl> <chr>  
##  1       1 4/18/195… 0130    CST         97 MOBILE  AL    TORNA…       0 <NA>   
##  2       1 4/18/195… 0145    CST          3 BALDWIN AL    TORNA…       0 <NA>   
##  3       1 2/20/195… 1600    CST         57 FAYETTE AL    TORNA…       0 <NA>   
##  4       1 6/8/1951… 0900    CST         89 MADISON AL    TORNA…       0 <NA>   
##  5       1 11/15/19… 1500    CST         43 CULLMAN AL    TORNA…       0 <NA>   
##  6       1 11/15/19… 2000    CST         77 LAUDER… AL    TORNA…       0 <NA>   
##  7       1 11/16/19… 0100    CST          9 BLOUNT  AL    TORNA…       0 <NA>   
##  8       1 1/22/195… 0900    CST        123 TALLAP… AL    TORNA…       0 <NA>   
##  9       1 2/13/195… 2000    CST        125 TUSCAL… AL    TORNA…       0 <NA>   
## 10       1 2/13/195… 2000    CST         57 FAYETTE AL    TORNA…       0 <NA>   
## # … with 27 more variables: BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## #   COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>,
## #   END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <dbl>, MAG <dbl>,
## #   FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## #   CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## #   ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## #   LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>, and abbreviated variable …

3. Exploratory Data Analysis

Before I move onto my explanatory data analysis, I output some information about this dataset to understand the construction of it.

# To get the column names of this dataset.
colnames(df)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
# To get the unique values of column EVTYPE.
unique(df[, 'EVTYPE'])
## # A tibble: 977 × 1
##    EVTYPE                   
##    <chr>                    
##  1 TORNADO                  
##  2 TSTM WIND                
##  3 HAIL                     
##  4 FREEZING RAIN            
##  5 SNOW                     
##  6 ICE STORM/FLASH FLOOD    
##  7 SNOW/ICE                 
##  8 WINTER STORM             
##  9 HURRICANE OPAL/HIGH WINDS
## 10 THUNDERSTORM WINDS       
## # … with 967 more rows

Questions

1. Across the United States, which types of events (as indicated in the EVTYPEEVTYPE variable) are most harmful with respect to population health?

Data Processing

In order to plot the data effectively, I only picked out top 20 natural disaster in fatalities and injuries.

harmful <- df %>% 
    group_by(EVTYPE) %>% 
    summarise(
        fatalities = sum(FATALITIES, na.rm=TRUE),
        injuries = sum(INJURIES, na.rm=TRUE)
    ) %>% 
    pivot_longer(
        cols = 2:3,
        names_to = 'category',
        values_to = 'population'
    ) %>% 
    arrange(desc(population))

harmful <- rbind(head(harmful[harmful$category == 'fatalities', ], 20), head(harmful[harmful$category == 'injuries', ], 20))
    
p1 <- ggplot(harmful[harmful$category == 'fatalities', ], aes(population, reorder(EVTYPE, -population, decreasing = TRUE))) +
    geom_bar(stat = 'identity') +
    labs(
        title = 'Fatalities',
        x = 'Population',
        y = 'Event',
        caption = "Source: U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database"
    ) +
    theme(
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position = "plot"
    )

p2 <- ggplot(harmful[harmful$category == 'injuries', ], aes(population, reorder(EVTYPE, -population, decreasing = TRUE))) +
    geom_bar(stat = 'identity') +
    labs(
        title = 'Injuries',
        x = 'Population',
        y = 'Event'
    ) +
    theme(
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position = "plot"
    )

p1 + p2

Results

As you can see, the highest number of population got fatalities or injuries due to natural disaster is TORNADO.


2. Across the United States, which types of events have the greatest economic consequences?

Data Processing

In order to plot the data effectively, I only picked out top 20 natural disaster in properties and crops.

economy <- df %>% 
    group_by(EVTYPE) %>% 
    summarise(
        property = sum(PROPDMG),
        crop = sum(CROPDMG)
    ) %>% 
    pivot_longer(
        cols = 2:3,
        names_to = 'category',
        values_to = 'damages'
    ) %>% 
    arrange(desc(damages))

economy_top20 <- rbind(head(economy[economy$category == 'property', ], 20), head(economy[economy$category == 'crop', ], 20))

p1 <- ggplot(economy_top20[economy_top20$category == 'property', ], aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
    geom_bar(stat = 'identity') +
    labs(
        title = 'Property',
        x = 'Damages',
        y = 'Event'
    ) +
    theme(
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position = "plot"
    )

p2 <- ggplot(economy_top20[economy_top20$category == 'crop', ], aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
    geom_bar(stat = 'identity') +
    labs(
        title = 'Crop',
        x = 'Damages',
        y = 'Event'
    ) +
    theme(
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position = "plot"
    )

p1 + p2

economy_sum <- economy %>% 
    group_by(EVTYPE) %>% 
    summarise(
        damages = sum(damages)
    ) %>% 
    arrange(desc(damages))

p3 <- ggplot(head(economy_sum, 20), aes(damages, reorder(EVTYPE, -damages, decreasing=TRUE))) +
    geom_bar(stat = 'identity') +
    labs(
        title = 'Total',
        x = 'Damages',
        y = 'Event'
    ) +
    theme(
        plot.caption = element_text(hjust = 0, face= "italic"),
        plot.title.position = "plot",
        plot.caption.position = "plot"
    )

p3

Results

As you can see, TORNADO have the greatest economic consequences.