Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

Environment Setup

library(tidyverse)
library(plotly)
library(tidytext)
library(dygraphs)
library(cowplot)
library(lubridate)
library(stringi)
downloadDate <- date()
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] stringi_1.4.6    lubridate_1.7.8  cowplot_1.0.0    dygraphs_1.1.1.6
##  [5] tidytext_0.2.4   plotly_4.9.2.1   forcats_0.5.0    stringr_1.4.0   
##  [9] dplyr_0.8.5      purrr_0.3.4      readr_1.3.1      tidyr_1.0.2     
## [13] tibble_3.0.1     ggplot2_3.3.0    tidyverse_1.3.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.4        lattice_0.20-40   zoo_1.8-7         assertthat_0.2.1 
##  [5] digest_0.6.25     R6_2.4.1          cellranger_1.1.0  backports_1.1.6  
##  [9] reprex_0.3.0      evaluate_0.14     httr_1.4.1        pillar_1.4.3     
## [13] rlang_0.4.5       lazyeval_0.2.2    readxl_1.3.1      rstudioapi_0.11  
## [17] data.table_1.12.8 Matrix_1.2-18     rmarkdown_2.1     htmlwidgets_1.5.1
## [21] munsell_0.5.0     broom_0.5.5       compiler_3.6.3    janeaustenr_0.1.5
## [25] modelr_0.1.6      xfun_0.12         pkgconfig_2.0.3   htmltools_0.4.0  
## [29] tidyselect_1.0.0  fansi_0.4.1       viridisLite_0.3.0 crayon_1.3.4     
## [33] dbplyr_1.4.2      withr_2.2.0       SnowballC_0.7.0   grid_3.6.3       
## [37] nlme_3.1-145      jsonlite_1.6.1    gtable_0.3.0      lifecycle_0.2.0  
## [41] DBI_1.1.0         magrittr_1.5      scales_1.1.0      tokenizers_0.2.1 
## [45] cli_2.0.2         fs_1.3.2          xml2_1.3.1        ellipsis_0.3.0   
## [49] generics_0.0.2    vctrs_0.2.4       tools_3.6.3       glue_1.4.0       
## [53] hms_0.5.3         yaml_2.2.1        colorspace_1.4-1  rvest_0.3.5      
## [57] knitr_1.28        haven_2.2.0

Data Processing

Download date is Wed Apr 22 14:19:05 2020.

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fname <- c("./data/maindf.csv.bz2")

if(!file.exists("data")){
        dir.create("data")
}

download.file(fileUrl,destfile="./data/maindf.csv.bz2",method="curl")

maindf <- read.csv(fname, header = TRUE, na.strings=c("","NA"), stringsAsFactors = FALSE)

The below dataframe df tells us that there are 902,297 observations and 37 variables.

df <- maindf

str(df)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  NA NA NA NA ...
##  $ BGN_LOCATI: chr  NA NA NA NA ...
##  $ END_DATE  : chr  NA NA NA NA ...
##  $ END_TIME  : chr  NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  NA NA NA NA ...
##  $ END_LOCATI: chr  NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  NA NA NA NA ...
##  $ WFO       : chr  NA NA NA NA ...
##  $ STATEOFFIC: chr  NA NA NA NA ...
##  $ ZONENAMES : chr  NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...
# sum(is.na(df$END_TIME))
# sum(is.na(df$END_DATE))
# range(mdy(df$BGN_DATE)

Purpose

The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.

Data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

Strategy in Dimension Reduction

  1. Focus on the Purpose
  2. Analyze missing data
  • Harmful (fatalities and injuries)
  • EVTYPE
  • Economic consequences
  1. Analyze Date and time

Missing Data Analysis

## We will analyze the beginning date only ----
df$BGN_DATE <- as.POSIXct(df$BGN_DATE,format="%m/%d/%Y")
df$END_DATE <- as.POSIXct(df$END_DATE,format="%m/%d/%Y")

startDate <- min(df$BGN_DATE)
endDate <- max(df$BGN_DATE)

Date range from 1950-01-03, **2011-11-30

naAnalysis <- df %>%
    purrr::map_df(function(x) round(mean(is.na(x)),digits = 2)*100) %>%
    gather(EVType, naAverage) %>% 
    print(n = 10)
## # A tibble: 37 x 2
##    EVType     naAverage
##    <chr>          <dbl>
##  1 STATE__            0
##  2 BGN_DATE           0
##  3 BGN_TIME           0
##  4 TIME_ZONE          0
##  5 COUNTY             0
##  6 COUNTYNAME         0
##  7 STATE              0
##  8 EVTYPE             0
##  9 BGN_RANGE          0
## 10 BGN_AZI           61
## # … with 27 more rows
naAnalysis1 <- naAnalysis %>% arrange(naAverage) %>% filter(naAverage>0)
quantile(naAnalysis1$naAverage) # range 16 100
##     0%    25%    50%    75%   100% 
##  16.00  29.00  53.50  68.25 100.00
g1 <- naAnalysis %>% ggplot(aes(x=EVType, y=naAverage)) %>% + geom_point(aes(reorder(EVType, naAverage))) + theme(axis.text.x=element_text(angle=90,hjust=.1))+labs(x= "Event Type", y= "NA Average (%)", title = "Missing Data Analysis") 

gg1 <- naAnalysis1 %>% ggplot(aes(x=EVType, y=naAverage)) %>% + geom_point(aes(reorder(EVType, naAverage))) + theme(axis.text.x=element_text(angle=90,hjust=.1))+labs(x= "Event Type", y= "NA Average (%)", title = "Missing Data Analysis") 

ggg1 <- subplot(g1,gg1, shareY = TRUE, nrows = 1) %>% layout(autosize = T)
ggg1

The above chart shows the original 37 variables, 23 variables has 0 missing data and 14 variables with missing data. Base on our Purpose, we need to look at Harmful (FATALITIES, INJURIES) - 0 missing data, EVTYPE has 0missing data and Property/Crop damages missing data. EVTYPE has 0 missing data.

Property damages are divided into 2 variables which are:

  1. PROPDMG - 0 missing data
  2. PROPDMGEXP (unit of measure) - 52 % missing data

Crop damages are divided into 2 variables which are 1. CROPDMG - 0 missing data 2. CROPDMGEXP (unit of measure) - 69 % missing data

Property and Crop Unit of Measure Analysis

RgPropEXP <- table(unlist(df$PROPDMGEXP))
RgPropEXPdf <- tibble(PropUnit = names(RgPropEXP),FREQ = as.integer(RgPropEXP))
print(RgPropEXPdf, nrow=1:20)
## # A tibble: 18 x 2
##    PropUnit   FREQ
##    <chr>     <int>
##  1 -             1
##  2 ?             8
##  3 +             5
##  4 0           216
##  5 1            25
##  6 2            13
##  7 3             4
##  8 4             4
##  9 5            28
## 10 6             4
## 11 7             5
## 12 8             1
## 13 B            40
## 14 h             1
## 15 H             6
## 16 K        424665
## 17 m             7
## 18 M         11330
plot(log10(RgPropEXP))

RgCropEXP <- table(unlist(df$CROPDMGEXP))
RgCropEXPdf <- tibble(CropUnit = names(RgCropEXP),FREQ = as.integer(RgCropEXP))
gg2 <- plot(log10(RgCropEXP))

print(RgCropEXPdf, nrow=1:10)
## # A tibble: 8 x 2
##   CropUnit   FREQ
##   <chr>     <int>
## 1 ?             7
## 2 0            19
## 3 2             1
## 4 B             9
## 5 k            21
## 6 K        281832
## 7 m             1
## 8 M          1994
plot(log10(RgCropEXP))

### Result of Property and Crop “Unit” - Analysis

Property Damages Missing Data

PropDmgDF <- df %>% filter(is.na(PROPDMGEXP), PROPDMG > 0) 

g3 <- PropDmgDF %>% group_by(Y = year(BGN_DATE), MONTH = month(BGN_DATE, 
       label = TRUE, abbr = TRUE), STATE, EVTYPE) %>% 
        summarize(PROPDMG = sum(PROPDMG)) %>% arrange(STATE) ## analysis by month ----

gg3 <- g3 %>% ggplot(aes(x = MONTH, y = (PROPDMG))) + geom_col(aes(fill = EVTYPE)) + facet_wrap(g3$Y)+
        labs(x = "Month", y = "Total Property Damange (missing units)", 
             title = "Total Property Damage (missing units) by Month") +
        theme(axis.text.x = element_text(angle = 90, hjust =.5))

ggg3 <-
        g3 %>% ggplot(aes(x = STATE, y = (PROPDMG))) + geom_col(aes(fill = EVTYPE)) + labs(x = "State", y = "Total Property Damange (missing units)", title = "Total Property (missing units) by State") +
        theme(axis.text.x = element_text(angle = 90, hjust = .5))
ggplotly(gg3)
ggplotly(ggg3)

#### Crop Damages missing data

CropDmgDF <- df %>% filter(is.na(CROPDMGEXP), CROPDMG > 0) 

g4 <- CropDmgDF %>% group_by(Y = year(BGN_DATE), MONTH = month(BGN_DATE, 
       label = TRUE, abbr = TRUE), STATE, EVTYPE) %>% 
        summarize(CROPDMG = sum(CROPDMG)) %>% arrange(STATE) ## analysis by month ----

gg4 <- g4 %>% ggplot(aes(x = MONTH, y = (CROPDMG))) + geom_col(aes(fill = EVTYPE)) + facet_wrap(g4$Y)+
        labs(x = "Month", y = "Total Property Damange (missing units)", 
             title = "Total Crop Damage (missing units) by Month") +
        theme(axis.text.x = element_text(angle = 90, hjust =.5))

ggg4 <-
        g4 %>% ggplot(aes(x = STATE, y = (CROPDMG))) + geom_col(aes(fill = EVTYPE)) + labs(x = "State", y = "Total Crop Damange (missing units)", title = "Total Property (missing units) by State") +
        theme(axis.text.x = element_text(angle = 90, hjust = .5))
ggplotly(gg4)
ggplotly(ggg4)
greatflood <- df %>% group_by(Y=year(BGN_DATE), BGN_DATE) %>% filter(Y=="1993") %>% summarize(PropDmg = sum(PROPDMG)) ## total property dmg is 19 but at what unit

Reduce the dimension

Subset the data that answers the purpose.

Fstdata <- df %>% group_by(YEAR= year(BGN_DATE), STATE, EVTYPE) %>% summarize(fatTotal = sum(FATALITIES), injTotal =sum(INJURIES), propDmg =sum(PROPDMG), cropDmg = sum(CROPDMG)) %>% arrange(STATE)

library(plotluck)
plotluck(Fstdata, fatTotal~propDMG|EVTYPE)
## Factor variable EVTYPE has too many levels (985), truncating to 6
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

### Examine the data as follows:

  1. Fatality by year and state
  2. Injuries by year and state
  3. Property Damage by year and state
  4. Crop damage by year and state
  5. 1+2
  6. 3+4
  7. Month

Fatality by Year

The column chart below shows the intensity of fatalities change from 1950-1992 to 1993-2011 .

# examine fatality by year
g3 <-
        Fstdata %>% ggplot(aes(x = YEAR, y = (fatTotal))) + geom_col(aes(fill =
                STATE)) +geom_hline(yintercept = 100) + geom_vline(xintercept = 1993)+ labs(x = "Year", y = "Total Fatality", title = "Total Fatality by Year")
ggplotly(g3)

#### Fatality by State

Filtered out fatalities less than 0. Column chart shows 10 states with fatalities more than 500.

g4 <-
        Fstdata %>% filter(fatTotal>0) %>%  ggplot(aes(x = STATE, y = (fatTotal))) + geom_col(aes(fill=EVTYPE)) + geom_hline(yintercept = 500) + labs(x = "State", y = "Total Fatality", title = "Total Fatality by State")+theme(axis.text.x=element_text(angle=90,hjust=.5))
ggplotly(g4)

Property Damage by Year

Review missing data in PROPDMGEXP and remove any PROPDMG less than 0. We see 76 observations which ranges 15 states and total PROPDMG of 527.41 without a proper unit of measure. All within 1993, 1994 and 1995.