Synopsis

This is a project to explore the National Oceanic and Atmospheric Administration (NOAA) Storm Database, and through some exploratory analysis on the health consequences and economical loss caused by severe weather events.

Data from 1950-2011 will be used in the analysis, health consequences will be estimated by the column fatalities and injuries, financial loss will be estimated by crop damage and property damage.

For the results, tornadoes have the most cases for the negative health consequences, and floods were found to have the highest crop and property damage.

1. Data Processing

1.1 get the data

This is a data storm data from National Weather Service, within 902,297 rows and 37 columns.

We’re going to dig into health consequences and financial loss of the events.

suppressPackageStartupMessages({
        library(data.table)
        library(dplyr)
})


url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "StormData.csv.bz2"

if(!file.exists(filename)){
        download.file(url = url, destfile = filename ,method = "curl")
}

data <- fread(filename, header = TRUE, na.strings = "")
data <- as_tibble(data)

head(data)
## # A tibble: 6 × 37
##   STATE__ BGN_DATE   BGN_T…¹ TIME_…² COUNTY COUNT…³ STATE EVTYPE BGN_R…⁴ BGN_AZI
##     <dbl> <chr>      <chr>   <chr>    <dbl> <chr>   <chr> <chr>    <dbl> <chr>  
## 1       1 4/18/1950… 0130    CST         97 MOBILE  AL    TORNA…       0 <NA>   
## 2       1 4/18/1950… 0145    CST          3 BALDWIN AL    TORNA…       0 <NA>   
## 3       1 2/20/1951… 1600    CST         57 FAYETTE AL    TORNA…       0 <NA>   
## 4       1 6/8/1951 … 0900    CST         89 MADISON AL    TORNA…       0 <NA>   
## 5       1 11/15/195… 1500    CST         43 CULLMAN AL    TORNA…       0 <NA>   
## 6       1 11/15/195… 2000    CST         77 LAUDER… AL    TORNA…       0 <NA>   
## # … with 27 more variables: BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## #   COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <chr>,
## #   END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## #   FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## #   CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## #   ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## #   LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>, and abbreviated variable …

1.2 subsetting the data

Only few columns will be use for further analysis.

By Storm Data Documentation , we know that EVTYPE represents for the event, we will make the health consequences be the sum of FATALITIES and INJURIES, and financial loss will be interpreted as the sum of property damage and crop damage.

# selecting the columns
data01 <- data %>% select("BGN_DATE", "EVTYPE", "STATE", "EVTYPE", "LENGTH", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")

# date class
suppressPackageStartupMessages({library(lubridate)})
data01$BGN_DATE <- mdy_hms(data01$BGN_DATE)

1.3 processing the exponential problem in the data

to turn the exponential column into the correct number, we have to identify what is in the column now.

# check what's in the column
# hmmm...seems that there are numbers, 
# and the alphabetical characters are in both lower and upper case.
table(data01$PROPDMGEXP)
## 
##      -      ?      +      0      1      2      3      4      5      6      7 
##      1      8      5    216     25     13      4      4     28      4      5 
##      8      B      h      H      K      m      M 
##      1     40      1      6 424665      7  11330
table(data01$CROPDMGEXP)
## 
##      ?      0      2      B      k      K      m      M 
##      7     19      1      9     21 281832      1   1994
# Alphabetical characters used to signify magnitude
# include “K” for thousands, “M” for millions, and “B” for billions.

# we have to fix this!

exp_robot <- function(x){
        
        if(is.na(x)){return(0)}
        # remember na.strings = "", and NA cannot be pass into if()
        else if(x == "-" |x == "?" |x == "+" ){return(10^0)}
        # return the dmg var itself
        else if(x == "h" |x == "H"){return(10^2)}
        else if(x == "K" |x == "k"){return(10^3)}
        else if(x == "m" |x == "M"){return(10^6)}
        else if(x == "B"      ){return(10^9)}
        else{return(10^as.numeric(x))}
        
}

1.4 calculate the financial loss and health consequences

Now we’ll be sure with calculating the financial loss, and health consequences.

# calculate the financial loss, and drop the columns that will not be use.
data02 <- data01 %>% 
                mutate(PROP_damage = PROPDMG * sapply(PROPDMGEXP, exp_robot) ) %>%
                mutate(CROP_damage = CROPDMG * sapply(CROPDMGEXP, exp_robot)) %>%
                mutate(financial_loss = PROP_damage + CROP_damage)

data02 <- data02 %>% select(-c("PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP"))

# calculate the health consequence
data02 <- data02 %>% mutate(health_cons = FATALITIES + INJURIES) %>%
                        select(-c("FATALITIES", "INJURIES"))

# what we got now
names(data02)
## [1] "BGN_DATE"       "EVTYPE"         "STATE"          "LENGTH"        
## [5] "PROP_damage"    "CROP_damage"    "financial_loss" "health_cons"

sort the ranking

data02$EVTYPE <- as.factor(data02$EVTYPE)

moneyloss_rank <- data02 %>% group_by(EVTYPE) %>% summarise(sum = sum(financial_loss)) %>% arrange(desc(sum)) 

hurting_rank <- data02 %>% group_by(EVTYPE) %>% summarise(sum = sum(health_cons)) %>% arrange(desc(sum))

1.5 plots

Turn the ranking into plot, may be easier to interpret the results.

First, take a look of the crop and property damage bar plot, as mentioned in the title, the estimated total loss on the y axis, is represented in the unit of million.

Flood, hurricane and tornado are the top three event that had caused serious damage, especially flood, the top of the plot, had impacted twice more than the next order, hurricane.

library(ggplot2)

ggplot(moneyloss_rank[1:6,], aes( x = reorder(EVTYPE, -sum), y = sum/10^6 )) + 
        geom_col(fill = "pink", alpha = 3/4) + 
        labs(title = "Crop and Property Damage (in million)") + 
        xlab("event type") + ylab("estimated total loss") + 
        theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

Next, have a look at the injury and fatality cases bar plot.

Tornado caused significantly larger damage than other events, from the beginning of the data collection till 2011, it had caused nearly 10 thousands cases.

ggplot(hurting_rank[1:6,], aes( x = reorder(EVTYPE, -sum), y = sum )) + 
        geom_col(fill = "pink", alpha = 3/4) + 
        labs(title = "Injury and Fatality cases") + 
        xlab("event type") + ylab("cases") + 
        theme(plot.title = element_text(hjust = 0.5)) + theme_bw()

2. Results

Event that leads to the most negative health consequences are Tornado, Excessive Heat, Thunderstorm Wind, Flood, Lightning and Heat.

Event that leads to the most negative health consequences are Flood, Hurricane/Typhoon, Tornado,Storm, Hail and Flash Flood.