Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
library(tidyverse)
library(plotly)
library(tidytext)
library(dygraphs)
library(cowplot)
library(lubridate)
library(stringi)
downloadDate <- date()
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Catalina 10.15.4
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] stringi_1.4.6 lubridate_1.7.8 cowplot_1.0.0 dygraphs_1.1.1.6
## [5] tidytext_0.2.4 plotly_4.9.2.1 forcats_0.5.0 stringr_1.4.0
## [9] dplyr_0.8.5 purrr_0.3.4 readr_1.3.1 tidyr_1.0.2
## [13] tibble_3.0.1 ggplot2_3.3.0 tidyverse_1.3.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.4 lattice_0.20-40 zoo_1.8-7 assertthat_0.2.1
## [5] digest_0.6.25 R6_2.4.1 cellranger_1.1.0 backports_1.1.6
## [9] reprex_0.3.0 evaluate_0.14 httr_1.4.1 pillar_1.4.3
## [13] rlang_0.4.5 lazyeval_0.2.2 readxl_1.3.1 rstudioapi_0.11
## [17] data.table_1.12.8 Matrix_1.2-18 rmarkdown_2.1 htmlwidgets_1.5.1
## [21] munsell_0.5.0 broom_0.5.5 compiler_3.6.3 janeaustenr_0.1.5
## [25] modelr_0.1.6 xfun_0.12 pkgconfig_2.0.3 htmltools_0.4.0
## [29] tidyselect_1.0.0 fansi_0.4.1 viridisLite_0.3.0 crayon_1.3.4
## [33] dbplyr_1.4.2 withr_2.2.0 SnowballC_0.7.0 grid_3.6.3
## [37] nlme_3.1-145 jsonlite_1.6.1 gtable_0.3.0 lifecycle_0.2.0
## [41] DBI_1.1.0 magrittr_1.5 scales_1.1.0 tokenizers_0.2.1
## [45] cli_2.0.2 fs_1.3.2 xml2_1.3.1 ellipsis_0.3.0
## [49] generics_0.0.2 vctrs_0.2.4 tools_3.6.3 glue_1.4.0
## [53] hms_0.5.3 yaml_2.2.1 colorspace_1.4-1 rvest_0.3.5
## [57] knitr_1.28 haven_2.2.0
Download date is Wed Apr 22 14:19:05 2020.
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
fname <- c("./data/maindf.csv.bz2")
if(!file.exists("data")){
dir.create("data")
}
download.file(fileUrl,destfile="./data/maindf.csv.bz2",method="curl")
maindf <- read.csv(fname, header = TRUE, na.strings=c("","NA"), stringsAsFactors = FALSE)
The below dataframe df tells us that there are 902,297 observations and 37 variables.
df <- maindf
str(df)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr NA NA NA NA ...
## $ BGN_LOCATI: chr NA NA NA NA ...
## $ END_DATE : chr NA NA NA NA ...
## $ END_TIME : chr NA NA NA NA ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr NA NA NA NA ...
## $ END_LOCATI: chr NA NA NA NA ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr NA NA NA NA ...
## $ WFO : chr NA NA NA NA ...
## $ STATEOFFIC: chr NA NA NA NA ...
## $ ZONENAMES : chr NA NA NA NA ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr NA NA NA NA ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
# sum(is.na(df$END_TIME))
# sum(is.na(df$END_DATE))
# range(mdy(df$BGN_DATE)
The basic goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. You must use the database to answer the questions below and show the code for your entire analysis. Your analysis can consist of tables, figures, or other summaries. You may use any R package you want to support your analysis.
Data analysis must address the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
## We will analyze the beginning date only ----
df$BGN_DATE <- as.POSIXct(df$BGN_DATE,format="%m/%d/%Y")
df$END_DATE <- as.POSIXct(df$END_DATE,format="%m/%d/%Y")
startDate <- min(df$BGN_DATE)
endDate <- max(df$BGN_DATE)
Date range from 1950-01-03, **2011-11-30
naAnalysis <- df %>%
purrr::map_df(function(x) round(mean(is.na(x)),digits = 2)*100) %>%
gather(EVType, naAverage) %>%
print(n = 10)
## # A tibble: 37 x 2
## EVType naAverage
## <chr> <dbl>
## 1 STATE__ 0
## 2 BGN_DATE 0
## 3 BGN_TIME 0
## 4 TIME_ZONE 0
## 5 COUNTY 0
## 6 COUNTYNAME 0
## 7 STATE 0
## 8 EVTYPE 0
## 9 BGN_RANGE 0
## 10 BGN_AZI 61
## # … with 27 more rows
naAnalysis1 <- naAnalysis %>% arrange(naAverage) %>% filter(naAverage>0)
quantile(naAnalysis1$naAverage) # range 16 100
## 0% 25% 50% 75% 100%
## 16.00 29.00 53.50 68.25 100.00
g1 <- naAnalysis %>% ggplot(aes(x=EVType, y=naAverage)) %>% + geom_point(aes(reorder(EVType, naAverage))) + theme(axis.text.x=element_text(angle=90,hjust=.1))+labs(x= "Event Type", y= "NA Average (%)", title = "Missing Data Analysis")
gg1 <- naAnalysis1 %>% ggplot(aes(x=EVType, y=naAverage)) %>% + geom_point(aes(reorder(EVType, naAverage))) + theme(axis.text.x=element_text(angle=90,hjust=.1))+labs(x= "Event Type", y= "NA Average (%)", title = "Missing Data Analysis")
ggg1 <- subplot(g1,gg1, shareY = TRUE, nrows = 1) %>% layout(autosize = T)
ggg1
The above chart shows the original 37 variables, 23 variables has 0 missing data and 14 variables with missing data. Base on our Purpose, we need to look at Harmful (FATALITIES, INJURIES) - 0 missing data, EVTYPE has 0missing data and Property/Crop damages missing data. EVTYPE has 0 missing data.
Property damages are divided into 2 variables which are:
Crop damages are divided into 2 variables which are 1. CROPDMG - 0 missing data 2. CROPDMGEXP (unit of measure) - 69 % missing data
RgPropEXP <- table(unlist(df$PROPDMGEXP))
RgPropEXPdf <- tibble(PropUnit = names(RgPropEXP),FREQ = as.integer(RgPropEXP))
print(RgPropEXPdf, nrow=1:20)
## # A tibble: 18 x 2
## PropUnit FREQ
## <chr> <int>
## 1 - 1
## 2 ? 8
## 3 + 5
## 4 0 216
## 5 1 25
## 6 2 13
## 7 3 4
## 8 4 4
## 9 5 28
## 10 6 4
## 11 7 5
## 12 8 1
## 13 B 40
## 14 h 1
## 15 H 6
## 16 K 424665
## 17 m 7
## 18 M 11330
plot(log10(RgPropEXP))
RgCropEXP <- table(unlist(df$CROPDMGEXP))
RgCropEXPdf <- tibble(CropUnit = names(RgCropEXP),FREQ = as.integer(RgCropEXP))
gg2 <- plot(log10(RgCropEXP))
print(RgCropEXPdf, nrow=1:10)
## # A tibble: 8 x 2
## CropUnit FREQ
## <chr> <int>
## 1 ? 7
## 2 0 19
## 3 2 1
## 4 B 9
## 5 k 21
## 6 K 281832
## 7 m 1
## 8 M 1994
plot(log10(RgCropEXP))
### Result of Property and Crop “Unit” - Analysis
PropDmgDF <- df %>% filter(is.na(PROPDMGEXP), PROPDMG > 0)
g3 <- PropDmgDF %>% group_by(Y = year(BGN_DATE), MONTH = month(BGN_DATE,
label = TRUE, abbr = TRUE), STATE, EVTYPE) %>%
summarize(PROPDMG = sum(PROPDMG)) %>% arrange(STATE) ## analysis by month ----
gg3 <- g3 %>% ggplot(aes(x = MONTH, y = (PROPDMG))) + geom_col(aes(fill = EVTYPE)) + facet_wrap(g3$Y)+
labs(x = "Month", y = "Total Property Damange (missing units)",
title = "Total Property Damage (missing units) by Month") +
theme(axis.text.x = element_text(angle = 90, hjust =.5))
ggg3 <-
g3 %>% ggplot(aes(x = STATE, y = (PROPDMG))) + geom_col(aes(fill = EVTYPE)) + labs(x = "State", y = "Total Property Damange (missing units)", title = "Total Property (missing units) by State") +
theme(axis.text.x = element_text(angle = 90, hjust = .5))
ggplotly(gg3)
ggplotly(ggg3)
#### Crop Damages missing data
CropDmgDF <- df %>% filter(is.na(CROPDMGEXP), CROPDMG > 0)
g4 <- CropDmgDF %>% group_by(Y = year(BGN_DATE), MONTH = month(BGN_DATE,
label = TRUE, abbr = TRUE), STATE, EVTYPE) %>%
summarize(CROPDMG = sum(CROPDMG)) %>% arrange(STATE) ## analysis by month ----
gg4 <- g4 %>% ggplot(aes(x = MONTH, y = (CROPDMG))) + geom_col(aes(fill = EVTYPE)) + facet_wrap(g4$Y)+
labs(x = "Month", y = "Total Property Damange (missing units)",
title = "Total Crop Damage (missing units) by Month") +
theme(axis.text.x = element_text(angle = 90, hjust =.5))
ggg4 <-
g4 %>% ggplot(aes(x = STATE, y = (CROPDMG))) + geom_col(aes(fill = EVTYPE)) + labs(x = "State", y = "Total Crop Damange (missing units)", title = "Total Property (missing units) by State") +
theme(axis.text.x = element_text(angle = 90, hjust = .5))
ggplotly(gg4)
ggplotly(ggg4)
greatflood <- df %>% group_by(Y=year(BGN_DATE), BGN_DATE) %>% filter(Y=="1993") %>% summarize(PropDmg = sum(PROPDMG)) ## total property dmg is 19 but at what unit
Subset the data that answers the purpose.
Fstdata <- df %>% group_by(YEAR= year(BGN_DATE), STATE, EVTYPE) %>% summarize(fatTotal = sum(FATALITIES), injTotal =sum(INJURIES), propDmg =sum(PROPDMG), cropDmg = sum(CROPDMG)) %>% arrange(STATE)
library(plotluck)
plotluck(Fstdata, fatTotal~propDMG|EVTYPE)
## Factor variable EVTYPE has too many levels (985), truncating to 6
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
### Examine the data as follows:
The column chart below shows the intensity of fatalities change from 1950-1992 to 1993-2011 .
# examine fatality by year
g3 <-
Fstdata %>% ggplot(aes(x = YEAR, y = (fatTotal))) + geom_col(aes(fill =
STATE)) +geom_hline(yintercept = 100) + geom_vline(xintercept = 1993)+ labs(x = "Year", y = "Total Fatality", title = "Total Fatality by Year")
ggplotly(g3)
#### Fatality by State
Filtered out fatalities less than 0. Column chart shows 10 states with fatalities more than 500.
g4 <-
Fstdata %>% filter(fatTotal>0) %>% ggplot(aes(x = STATE, y = (fatTotal))) + geom_col(aes(fill=EVTYPE)) + geom_hline(yintercept = 500) + labs(x = "State", y = "Total Fatality", title = "Total Fatality by State")+theme(axis.text.x=element_text(angle=90,hjust=.5))
ggplotly(g4)
Review missing data in PROPDMGEXP and remove any PROPDMG less than 0. We see 76 observations which ranges 15 states and total PROPDMG of 527.41 without a proper unit of measure. All within 1993, 1994 and 1995.