Symphosis

The dataset for this projects comes from U.S. National Oceanic and Atmospheric Administration’s(NOAA).

Here can read full documentation.

Here is the dataset(47Mb)

The events in the database start in the year 1950 and end in November 2011.

The goal of this analysis is to answer the following question.

In summary, Torando causes greatest fatalities and injuries and flood, drought cause highest impact upon economy. The followings are how did I get those result using with R codes.

Import the required libraries and dataset.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.3     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.3     ✓ stringr 1.4.0
## ✓ readr   2.0.1     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
library(fuzzyjoin)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(reactable)

# url <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
# download.file(url, destfile = 'data.csv.bz2')
data <- data.table::fread('data.csv.bz2')

Data Preprocessing

From the documentation of the dataset, the official events type are 48 and there were about 1000 events in the dataset at EVTYPE column! All that is just typo. So, first clean the EVTYPE column.

The official events are as follows. So, I make a data frame and try to match with EVTYPE column.

events <- c('ASTRONOMICAL LOW TIDE', 'AVALANCHE', 'BLIZZARD', 'COASTAL FLOOD', 'COLD/WIND CHILL', 'DEBRIS FLOW', 'DENSE FOG', 'DENSE SMOKE', 'DROUGHT', 'DUST DEVIL', 'DUST STORM', 'EXCESSIVE HEAT', 'EXTREME COLD/WIND CHILL', 'FLASH FLOOD', 'FLOOD', 'FROST/FREEZE', 'FUNNEL CLOUD', 'HAIL', 'HEAT', 'HEAVY RAIN', 'HEAVY SNOW', 'HIGH SURF', 'HIGH WIND', 'HURRICANE(TYPHOON)', 'ICE STORM', 'LAKE-EFFECT SNOW', 'LAKESHORE FLOOD', 'LIGHTNING', 'MARINE HAIL', 'MARINE HIGH WIND', 'MARINE STRONG WIND', 'MARINE THUNDERSTORM WIND', 'RIP CURRENT', 'SEICHE', 'SLEET', 'STORM SURGE/TIDE', 'STORM WIND', 'THUNDERSTORM WIND', 'TORNADO', 'TROPICAL DEPRESSION', 'TROPICAL STORM', 'TSUNAMI', 'VOLCANIC ASH', 'WATERSPOUT', 'WILDFIRE', 'WINTER STROM', 'WINTER WEATHER')
event_df <- data.frame(EVTYPE = events)

# to mark the original sequence 
data$ID <- seq.int(nrow(data))

# separate from the original dataset as it is too big. 
storm_events <- data %>% select(c( 'ID', 'EVTYPE'))

# before that code, I tried to clean the dataset with 1000 sample data for faster and I saw that most words 'THUNDERSTORM' are written as 'TSTM' so I first replace them. 
storm_events$EVTYPE <- str_replace_all(storm_events$EVTYPE, 'TSTM', 'THUNDERSTORM')

# match with official names. 
storm_events_test <- stringdist_left_join(storm_events, event_df, by='EVTYPE', method='dl')

# after these two steps there were only 21361 rows left to clean. If I tried to increase the max_dist, then most of the words with length less than 6 are changed. for example, HAIL is true, but it changed to HEAT, etc. So, I try to subset with length more than 7. 
storm_events_test_2 <- storm_events_test %>% filter(is.na(EVTYPE.y) & str_count(EVTYPE.x) > 7)

# match with official names, this time, max string distance with 6. 
storm_events_test_2 <- stringdist_left_join(storm_events_test_2, event_df, by=c('EVTYPE.x' = 'EVTYPE'), method='lcs', max_dist=6)

# select only required columns. 
storm_events_test_2 <- storm_events_test_2 %>% select(c("ID", EVTYPE.y = 'EVTYPE'))

# combine them all
clean_storm_words <- storm_events_test %>% 
    select(c('ID', EVTYPE = 'EVTYPE.y')) %>% 
    full_join(storm_events_test_2, by='ID') %>%
    mutate(EVTYPE = ifelse(is.na(EVTYPE.y),EVTYPE, EVTYPE.y)) %>%
    select(c('ID', 'EVTYPE'))

# combine with original data frame
data <- data %>% inner_join(clean_storm_words, by='ID')
data <- data %>% filter(!is.na(EVTYPE.y))
dim(data)
## [1] 890274     39
# only loss about 12000 rows from original data frame of over 900000 rows and with 46 unique names.
unique(data$EVTYPE.y)
##  [1] "TORNADO"                  "THUNDERSTORM WIND"       
##  [3] "HAIL"                     "SEICHE"                  
##  [5] "WINTER STROM"             "HEAVY RAIN"              
##  [7] "LIGHTNING"                "DENSE FOG"               
##  [9] "RIP CURRENT"              "FLASH FLOOD"             
## [11] "HIGH WIND"                "FUNNEL CLOUD"            
## [13] "HEAT"                     "FLOOD"                   
## [15] "WATERSPOUT"               "BLIZZARD"                
## [17] "COLD/WIND CHILL"          "HEAVY SNOW"              
## [19] "COASTAL FLOOD"            "ICE STORM"               
## [21] "AVALANCHE"                "MARINE HAIL"             
## [23] "HIGH SURF"                "DUST STORM"              
## [25] "SLEET"                    "DUST DEVIL"              
## [27] "EXCESSIVE HEAT"           "STORM WIND"              
## [29] "WILDFIRE"                 "WINTER WEATHER"          
## [31] "DROUGHT"                  "STORM SURGE/TIDE"        
## [33] "TROPICAL STORM"           "EXTREME COLD/WIND CHILL" 
## [35] "LAKESHORE FLOOD"          "LAKE-EFFECT SNOW"        
## [37] "FROST/FREEZE"             "VOLCANIC ASH"            
## [39] "TROPICAL DEPRESSION"      "MARINE THUNDERSTORM WIND"
## [41] "HURRICANE(TYPHOON)"       "MARINE HIGH WIND"        
## [43] "TSUNAMI"                  "DENSE SMOKE"             
## [45] "MARINE STRONG WIND"       "ASTRONOMICAL LOW TIDE"

It is ready to answer the question 1.

Which types of events are most harmful with respect to population health?

# Selected the required columns. 
data_select <- data %>% select(c('FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP', EVENT = 'EVTYPE.y'))

# group by event and them sum all the fatalities, and all the injuries and select top 10 of each type. 
health_impact <- data_select %>% 
    group_by(EVENT)  %>%
    summarise(fatalities = sum(FATALITIES),
              injuries = sum(INJURIES)) %>%
    gather(key='health_impact', value='total',-EVENT) %>%
    group_by(health_impact) %>% 
    top_n(10,total)


ggplot(health_impact, aes(EVENT, total)) +
    geom_col(fill='darkblue') +
    facet_wrap(~ health_impact, scales='free') +
    theme_bw() +
    theme(axis.text.x = element_text(angle=90)) +
    labs(x = 'EVENTS', y = 'TOTAL IMPACTS', title = 'TOP 10 HEALTH IMPACTS BY NATURAL DISASTERS')

Which types of events have the greatest economic consequences?

First, need to calculate the Economic consequences. In the dataset it is not given as numbers and just exponent columns and number column. So, change to numbers and calculate the total cost.

unique(data_select$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "+" "0" "5" "m" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
unique(data_select$CROPDMGEXP)
## [1] ""  "K" "M" "B" "?" "0" "k" "2"
#changes these letter ... H,h = hundreds, K,k=thousands, M,m=Millions, B,b=billions, + = 1, - = 0, ? = 0, blank = 0, numeric, 0...8 = 10

data_select <- data_select %>%
    mutate(PROPDMGNUM = case_when(
        PROPDMGEXP == 'K' ~ 1000,
        PROPDMGEXP == 'M' | PROPDMGEXP == 'm' ~ 1000000,
        PROPDMGEXP == 'B' ~ 1000000000,
        PROPDMGEXP == 'h' | PROPDMGEXP == 'H' ~ 100,
        PROPDMGEXP == '+' ~ 1,
        PROPDMGEXP == '-' | PROPDMGEXP == '?' | PROPDMGEXP == "" ~ 0,
        TRUE ~ 10),
        CROPDMGNUM = case_when(
            CROPDMGEXP == 'K' | CROPDMGEXP == 'k' ~ 1000,
            CROPDMGEXP == 'M' ~ 1000000,
            CROPDMGEXP == 'B' ~ 1000000000,
            CROPDMGEXP == '?' | CROPDMGEXP == "" ~ 0,
            TRUE ~ 10),
        PROPDMG_TOTAL = PROPDMG * PROPDMGNUM,
        CROPDMG_TOTAL = CROPDMG * CROPDMGNUM)

economic_impact_byevent <- data_select %>% 
    group_by(EVENT) %>%
    summarize(CROP_LOSS = sum(CROPDMG_TOTAL),
              PROPERTY_LOSS = sum(PROPDMG_TOTAL)) %>%
    gather(key='damage', value='total_loss', -'EVENT') %>%
    group_by(damage) %>%
    top_n(10, total_loss)

ggplot(economic_impact_byevent, aes(fct_rev(fct_reorder(EVENT,total_loss)), total_loss/1000000000)) + 
    geom_col(color='blue',fill='darkblue') +
    labs(x='EVENTS', y='Total Economic Loss in Billions', title='TOP 10 ECONOMIC LOSS BY NATURAL DISASTERS') +
    facet_wrap(~ damage, scales='free') +
    theme_bw() + 
    theme(axis.text.x = element_text(angle=90)) +
    scale_y_continuous(labels= function(x){format(x, scientific=FALSE)})

Which years were most affected?

data_select$DATE <- data$BGN_DATE
data_select$DATE <-word(data_select$DATE, 1)

# Make a date column 
data_select$DATE <- mdy(data_select$DATE)

# make a year column with total damages are greater than 0.     
damage_select <- data_select %>% filter(PROPDMG_TOTAL > 0 | CROPDMG_TOTAL > 0)
damage_select$YEAR <- year(damage_select$DATE)
# calculate total fatalities and injuries per year
year_health <- damage_select %>% group_by(YEAR) %>%
    summarize(fatalities = sum(FATALITIES), 
              injuries = sum(INJURIES))%>%
    gather(key='damage_category', value='total', -YEAR) 

# calculate total damages per year
year_economic <- damage_select %>% group_by(YEAR) %>%
    summarize(prop_damage = sum(PROPDMG_TOTAL), 
              crop_damage = sum(CROPDMG_TOTAL)) %>%
    gather(key='damage_category', value='total', -YEAR) 


ggplot(year_health, aes(x=YEAR, y=total, color=damage_category)) + 
    theme_classic() + 
    geom_line(size=1) +
    labs(y = 'Total Numbers of People Effected', title = 'Number of people effected by Natural Disasters')

Which years are the most affected?

year_health %>% group_by(damage_category) %>% top_n(3, total)
## # A tibble: 6 × 3
## # Groups:   damage_category [2]
##    YEAR damage_category total
##   <dbl> <chr>           <dbl>
## 1  1953 fatalities        498
## 2  1998 fatalities        324
## 3  2011 fatalities        650
## 4  1953 injuries         4708
## 5  1998 injuries         9411
## 6  2011 injuries         6111

1998, 2011, 1953 are most effected years. Which type of natural disasters happened in that years?

year_select <- damage_select %>% filter(YEAR %in% c(1998, 2011, 1953)) %>%
    group_by(YEAR, EVENT) %>%
    summarize(count = n()) %>%
    group_by(YEAR) %>%
    top_n(7, count) %>%
    arrange(desc(YEAR), desc(count))
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups` argument.
reactable(year_select)

7 most common types of natural disasters are selected in each year. In 1953, records only contain for ‘Tornado’.

ggplot(year_economic, aes(x=YEAR, y=total/1000000000)) + 
    theme_classic() + 
    geom_line(size=1, color='navy') +
    facet_wrap(~ damage_category, scales='free', 
               labeller = labeller(damage_category = c('crop_damage' = 'Damage of Crop',
                                                       'prop_damage' = 'Damage of Properties'))) + 
    labs(y= 'Total Economic Loss in Billions', title='Total Economic Loss by Natural Disasters') +
    scale_y_continuous(labels=function(x){format(x, scientific=FALSE)})

For economic perspective, which years are most affected? I will combine the crop damage and properties damages as total damage.

reactable(year_economic %>% 
    spread(key=damage_category, value = total) %>%
    mutate(total = crop_damage + prop_damage,
           loss_in_billions = total/1000000000) %>%
    top_n(5, total) %>%
    select(c('YEAR', 'loss_in_billions')) %>%
    arrange(desc(loss_in_billions)))

In year 2006, 2005, there were most economic loss due to natural disasters. What types of disaster?

reactable(damage_select %>% filter(YEAR %in% c(2006, 2005)) %>%
    group_by(YEAR, EVENT) %>%
    summarize(count = n()) %>%
    group_by(YEAR) %>%
    top_n(7, count) %>%
    arrange(desc(YEAR), desc(count)))
## `summarise()` has grouped output by 'YEAR'. You can override using the `.groups` argument.

There are many things to discover. It is just Exploratory Data Analysis not included conclusions, forecasting and so on. I have to try and learn more.