This Project is part of Data Analytic and Reproducible Research Course Assignment from John Hopkins University on Coursera. This is to analyze Storm Database from U.S. National Oceanic and Atmospheric Administration’s (NOAA) The database was collected from year 1950 - November 2011 which consists of : 1. Victims (Fatalities and Injuries) 2. Damage Cost (Property and Crop/Agricultures)
Data Source [https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2]
Data Documentation [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]
Data FAQ [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf]
Import data from local directories and summary all variables for initial analytic. Data features will be selected only for the required analysis
# Import required libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Importing raw data from local directory
stormdata <- read.csv("repdata_data_StormData.csv.bz2")
#summarize variable and class
glimpse(stormdata)
## Rows: 902,297
## Columns: 37
## $ STATE__ <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ BGN_DATE <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0…
## $ BGN_TIME <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0100", "09…
## $ TIME_ZONE <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CS…
## $ COUNTY <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, 49, 107,…
## $ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN", "LAUD…
## $ STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ EVTYPE <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR…
## $ BGN_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGN_AZI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ BGN_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_DATE <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_TIME <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ END_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ END_AZI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LENGTH <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, 1.3, 4.7…
## $ WIDTH <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400, 400, 2…
## $ F <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, 1, 1, 1,…
## $ MAG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,…
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, …
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, …
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",…
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ WFO <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ STATEOFFIC <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ ZONENAMES <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LATITUDE <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 3334, 3336,…
## $ LONGITUDE <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 8740, 8738,…
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, 0, 3432,…
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, 0, 8540,…
## $ REMARKS <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ REFNUM <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
Data Tidying and cleaning consists of transforming unknown/confusing value to known and undesrtanable value, removing unnecessary features/columns to simplify the data processing.
# subset required data for analysis only
stormdatarequired <- stormdata[,c(5,6,7,8,23,24,25,26,27,28)]
# filter only the important event which is FATALITIES > 0 INJURIES > 0 PROPDMG > 0 and CROPDMG > 0 and exclude EVTYPE = ?
stormdatarequired_important <- (filter(stormdatarequired,EVTYPE != "?" & (FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)))
#define table conversion for property dmg for each magnitude and define magnitude symbol to numerical symbol conversion
PROPDMGEXP_Conv <- stormdatarequired_important |> group_by(PROPDMGEXP) |> count()
PROPConv <- c(10^0,10^0,10^0,10^0,10^2,10^3,10^4,10^5,10^6,10^7,10^9,10^2,10^3,10^6,10^2,10^6)
tablepropdmgconv <- cbind(PROPDMGEXP_Conv,PROPConv)
## New names:
## • `` -> `...3`
names(tablepropdmgconv) <- c("symbol","cnt","num_conv_prop")
#define table conversion for crop dmg for each magnitude and define magnitude symbol to numerical symbol conversion
CROPDMGEXP_Conv <- stormdatarequired_important |> group_by(CROPDMGEXP) |> count()
CROPConv <- c(10^0,10^0,10^0,10^9,10^3,10^6,10^3,10^6)
tablecropdmgconv <- cbind(CROPDMGEXP_Conv,CROPConv)
## New names:
## • `` -> `...3`
names(tablecropdmgconv) <- c("symbol","cnt","num_conv_crop")
#creating clean dataframe
# combining tablepropdmgconv & tablecropdmgconv by left join
stormdatarequired_imp_prop <- left_join(stormdatarequired_important,tablepropdmgconv,join_by(PROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- left_join(stormdatarequired_imp_prop,tablecropdmgconv,join_by(CROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- mutate(stormdatarequired_imp_propcrop, Total_Victims = FATALITIES+INJURIES,Prop_Damage_Value = PROPDMG*num_conv_prop,Crop_Damage_Value = CROPDMG*num_conv_crop, Total_Damage_Value = Prop_Damage_Value+Crop_Damage_Value)
stormdataDF <- stormdatarequired_imp_propcrop[,c(1:7,9,12,14:18)]
# ready data
head(stormdataDF)
## COUNTY COUNTYNAME STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 97 MOBILE AL TORNADO 0 15 25.0 0
## 2 3 BALDWIN AL TORNADO 0 0 2.5 0
## 3 57 FAYETTE AL TORNADO 0 2 25.0 0
## 4 89 MADISON AL TORNADO 0 2 2.5 0
## 5 43 CULLMAN AL TORNADO 0 2 2.5 0
## 6 77 LAUDERDALE AL TORNADO 0 6 2.5 0
## num_conv_prop num_conv_crop Total_Victims Prop_Damage_Value Crop_Damage_Value
## 1 1000 1 15 25000 0
## 2 1000 1 0 2500 0
## 3 1000 1 2 25000 0
## 4 1000 1 2 2500 0
## 5 1000 1 2 2500 0
## 6 1000 1 6 2500 0
## Total_Damage_Value
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
This procedure is to prepare ready data for analysis purposes
#creating clean dataframe
# combining tablepropdmgconv & tablecropdmgconv by left join
stormdatarequired_imp_prop <- left_join(stormdatarequired_important,tablepropdmgconv,join_by(PROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- left_join(stormdatarequired_imp_prop,tablecropdmgconv,join_by(CROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- mutate(stormdatarequired_imp_propcrop, Total_Victims = FATALITIES+INJURIES,Prop_Damage_Value = PROPDMG*num_conv_prop,Crop_Damage_Value = CROPDMG*num_conv_crop, Total_Damage_Value = Prop_Damage_Value+Crop_Damage_Value)
stormdataDF <- stormdatarequired_imp_propcrop[,c(1:7,9,12,14:18)]
# ready data
head(stormdataDF)
## COUNTY COUNTYNAME STATE EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1 97 MOBILE AL TORNADO 0 15 25.0 0
## 2 3 BALDWIN AL TORNADO 0 0 2.5 0
## 3 57 FAYETTE AL TORNADO 0 2 25.0 0
## 4 89 MADISON AL TORNADO 0 2 2.5 0
## 5 43 CULLMAN AL TORNADO 0 2 2.5 0
## 6 77 LAUDERDALE AL TORNADO 0 6 2.5 0
## num_conv_prop num_conv_crop Total_Victims Prop_Damage_Value Crop_Damage_Value
## 1 1000 1 15 25000 0
## 2 1000 1 0 2500 0
## 3 1000 1 2 25000 0
## 4 1000 1 2 2500 0
## 5 1000 1 2 2500 0
## 6 1000 1 6 2500 0
## Total_Damage_Value
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
This part is to prepare data for analytical purpose to highlight the impact of Storm in terms of Victims and Economical Damage ## 3.1 Total Victims (FATALITIES and INJURIES)
head(stormdataDF|>group_by(EVTYPE)|> summarise(sum(Total_Victims),sum(FATALITIES),sum(INJURIES))|> arrange(desc(`sum(Total_Victims)`)) |> rename("Storm Type" = EVTYPE,"Total Victims" = `sum(Total_Victims)`,"Fatalities" = `sum(FATALITIES)`, "Injuries" = `sum(INJURIES)`),10)
## # A tibble: 10 × 4
## `Storm Type` `Total Victims` Fatalities Injuries
## <chr> <dbl> <dbl> <dbl>
## 1 TORNADO 96979 5633 91346
## 2 EXCESSIVE HEAT 8428 1903 6525
## 3 TSTM WIND 7461 504 6957
## 4 FLOOD 7259 470 6789
## 5 LIGHTNING 6046 816 5230
## 6 HEAT 3037 937 2100
## 7 FLASH FLOOD 2755 978 1777
## 8 ICE STORM 2064 89 1975
## 9 THUNDERSTORM WIND 1621 133 1488
## 10 WINTER STORM 1527 206 1321
## Charting
## prepare data for chart
stormdataDF_victims <- stormdataDF[,c(1:6,11)]
stormdataDF_victims_chart <- stormdataDF_victims |> pivot_longer(cols = c(5:7),names_to = "Type",values_to = "Total")
chart_victim <- stormdataDF_victims_chart |> group_by(EVTYPE,Type) |> summarise(sum(Total)) |> filter(`sum(Total)`> 500)
## `summarise()` has grouped output by 'EVTYPE'. You can override using the
## `.groups` argument.
ggplot(chart_victim,aes(x= fct_reorder(EVTYPE,`sum(Total)`),y = `sum(Total)`))+geom_bar(stat = "identity",aes(fill = Type),position = "dodge")+coord_flip()+labs(x= "Storm Type", y= "Total Victims",title = "Total Victims per Storm Type")+theme_bw()
head(stormdataDF|>group_by(EVTYPE)|> summarise(sum(Total_Damage_Value)/10^9,sum(Crop_Damage_Value)/10^9,sum(Prop_Damage_Value)/10^9)|> arrange(desc(`sum(Total_Damage_Value)/10^9`)) |> rename(Storm_Type = EVTYPE, "Total Damage (Billion USD)" = `sum(Total_Damage_Value)/10^9`, "Crop Damage Value (Billion USD)" = `sum(Crop_Damage_Value)/10^9`, "Prop Damage Value (Billion USD)" = `sum(Prop_Damage_Value)/10^9` ),10)
## # A tibble: 10 × 4
## Storm_Type `Total Damage (Billion USD)` Crop Damage Value (Billion U…¹
## <chr> <dbl> <dbl>
## 1 FLOOD 150. 5.66
## 2 HURRICANE/TYPHOON 71.9 2.61
## 3 TORNADO 57.4 0.415
## 4 STORM SURGE 43.3 0.000005
## 5 HAIL 18.8 3.03
## 6 FLASH FLOOD 18.2 1.42
## 7 DROUGHT 15.0 14.0
## 8 HURRICANE 14.6 2.74
## 9 RIVER FLOOD 10.1 5.03
## 10 ICE STORM 8.97 5.02
## # ℹ abbreviated name: ¹​`Crop Damage Value (Billion USD)`
## # ℹ 1 more variable: `Prop Damage Value (Billion USD)` <dbl>
stormdataDF_dmg <- stormdataDF[,c(1:4,12:14)]
stormdataDF_dmg_chart <- stormdataDF_dmg |> pivot_longer(cols = c(5:7),names_to = "Type",values_to = "Total")
chart_dmg <- stormdataDF_dmg_chart|>group_by(EVTYPE,Type) |> summarise(sum(Total)/10^9) |> filter(`sum(Total)/10^9`> 1)
## `summarise()` has grouped output by 'EVTYPE'. You can override using the
## `.groups` argument.
ggplot(chart_dmg,aes(x=fct_reorder(EVTYPE,`sum(Total)/10^9`),y=`sum(Total)/10^9`))+geom_bar(stat = "identity",aes(fill = Type),position = "dodge")+coord_flip()+labs(x= "Storm Type",y = "Total Damage in Billion USD",title = "Total Damage per Storm Type in Billion USD") + theme_bw()
# Top states with damage
stormdataDF_dmg_county <- stormdataDF[,c(1:4,14)]
stormdataDF_dmg_county |> group_by(STATE) |> summarise(sum(Total_Damage_Value)/10^9)|>filter(`sum(Total_Damage_Value)/10^9` > 1) |> ggplot(aes(x=fct_reorder(STATE,`sum(Total_Damage_Value)/10^9`),y=`sum(Total_Damage_Value)/10^9`))+geom_bar(fill = "#90d5ff",stat = "identity") + coord_flip()+theme_bw()+labs(x="STATE",y= "Total Damage (Billion USD)",title = "Total Damage by State in Billion USD")
From above analytical result from NOAA Storm database, Storm are impacting US National Economy and People as victims where it need to be addressed to relevant department in order to mitigate the impact in order to minimize the impact. Below are the summary :