1. PROJECT DESCRIPTION

This Project is part of Data Analytic and Reproducible Research Course Assignment from John Hopkins University on Coursera. This is to analyze Storm Database from U.S. National Oceanic and Atmospheric Administration’s (NOAA) The database was collected from year 1950 - November 2011 which consists of : 1. Victims (Fatalities and Injuries) 2. Damage Cost (Property and Crop/Agricultures)

Data Source [https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2]

Data Documentation [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf]

Data FAQ [https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf]

2. DATA TRANSFORMATION

2.1 IMPORTING DATA

Import data from local directories and summary all variables for initial analytic. Data features will be selected only for the required analysis

# Import required libraries
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Importing raw data from local directory
stormdata <- read.csv("repdata_data_StormData.csv.bz2")
#summarize variable and class 
glimpse(stormdata)
## Rows: 902,297
## Columns: 37
## $ STATE__    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ BGN_DATE   <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0…
## $ BGN_TIME   <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0100", "09…
## $ TIME_ZONE  <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CS…
## $ COUNTY     <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, 49, 107,…
## $ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN", "LAUD…
## $ STATE      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ EVTYPE     <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR…
## $ BGN_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGN_AZI    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ BGN_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_DATE   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_TIME   <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ END_RANGE  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ END_AZI    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LENGTH     <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, 1.3, 4.7…
## $ WIDTH      <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400, 400, 2…
## $ F          <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, 1, 1, 1,…
## $ MAG        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,…
## $ INJURIES   <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, …
## $ PROPDMG    <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, …
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",…
## $ CROPDMG    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ WFO        <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ STATEOFFIC <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ ZONENAMES  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LATITUDE   <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 3334, 3336,…
## $ LONGITUDE  <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 8740, 8738,…
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, 0, 3432,…
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, 0, 8540,…
## $ REMARKS    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ REFNUM     <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …

2.2 DATA TIDYING AND CLEANING

Data Tidying and cleaning consists of transforming unknown/confusing value to known and undesrtanable value, removing unnecessary features/columns to simplify the data processing.

# subset required data for analysis only
stormdatarequired <- stormdata[,c(5,6,7,8,23,24,25,26,27,28)]
# filter only the important event which is FATALITIES > 0 INJURIES > 0 PROPDMG > 0 and CROPDMG > 0 and exclude EVTYPE = ?
stormdatarequired_important <- (filter(stormdatarequired,EVTYPE != "?" & (FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)))
#define table conversion for property dmg for each magnitude and define magnitude symbol to numerical symbol conversion
PROPDMGEXP_Conv <- stormdatarequired_important |> group_by(PROPDMGEXP) |> count()
PROPConv <- c(10^0,10^0,10^0,10^0,10^2,10^3,10^4,10^5,10^6,10^7,10^9,10^2,10^3,10^6,10^2,10^6)
tablepropdmgconv <- cbind(PROPDMGEXP_Conv,PROPConv)
## New names:
## • `` -> `...3`
names(tablepropdmgconv) <- c("symbol","cnt","num_conv_prop")
#define table conversion for crop dmg for each magnitude and define magnitude symbol to numerical symbol conversion
CROPDMGEXP_Conv <- stormdatarequired_important |> group_by(CROPDMGEXP) |> count()
CROPConv <- c(10^0,10^0,10^0,10^9,10^3,10^6,10^3,10^6)
tablecropdmgconv <- cbind(CROPDMGEXP_Conv,CROPConv)
## New names:
## • `` -> `...3`
names(tablecropdmgconv) <- c("symbol","cnt","num_conv_crop")
#creating clean dataframe
# combining tablepropdmgconv & tablecropdmgconv by left join
stormdatarequired_imp_prop <- left_join(stormdatarequired_important,tablepropdmgconv,join_by(PROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- left_join(stormdatarequired_imp_prop,tablecropdmgconv,join_by(CROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- mutate(stormdatarequired_imp_propcrop, Total_Victims = FATALITIES+INJURIES,Prop_Damage_Value = PROPDMG*num_conv_prop,Crop_Damage_Value = CROPDMG*num_conv_crop, Total_Damage_Value = Prop_Damage_Value+Crop_Damage_Value)
stormdataDF <- stormdatarequired_imp_propcrop[,c(1:7,9,12,14:18)]
# ready data 
head(stormdataDF)
##   COUNTY COUNTYNAME STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1     97     MOBILE    AL TORNADO          0       15    25.0       0
## 2      3    BALDWIN    AL TORNADO          0        0     2.5       0
## 3     57    FAYETTE    AL TORNADO          0        2    25.0       0
## 4     89    MADISON    AL TORNADO          0        2     2.5       0
## 5     43    CULLMAN    AL TORNADO          0        2     2.5       0
## 6     77 LAUDERDALE    AL TORNADO          0        6     2.5       0
##   num_conv_prop num_conv_crop Total_Victims Prop_Damage_Value Crop_Damage_Value
## 1          1000             1            15             25000                 0
## 2          1000             1             0              2500                 0
## 3          1000             1             2             25000                 0
## 4          1000             1             2              2500                 0
## 5          1000             1             2              2500                 0
## 6          1000             1             6              2500                 0
##   Total_Damage_Value
## 1              25000
## 2               2500
## 3              25000
## 4               2500
## 5               2500
## 6               2500

2.3 CREATING READY DATA

This procedure is to prepare ready data for analysis purposes

#creating clean dataframe
# combining tablepropdmgconv & tablecropdmgconv by left join
stormdatarequired_imp_prop <- left_join(stormdatarequired_important,tablepropdmgconv,join_by(PROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- left_join(stormdatarequired_imp_prop,tablecropdmgconv,join_by(CROPDMGEXP == symbol))
stormdatarequired_imp_propcrop <- mutate(stormdatarequired_imp_propcrop, Total_Victims = FATALITIES+INJURIES,Prop_Damage_Value = PROPDMG*num_conv_prop,Crop_Damage_Value = CROPDMG*num_conv_crop, Total_Damage_Value = Prop_Damage_Value+Crop_Damage_Value)

stormdataDF <- stormdatarequired_imp_propcrop[,c(1:7,9,12,14:18)]
# ready data 
head(stormdataDF)
##   COUNTY COUNTYNAME STATE  EVTYPE FATALITIES INJURIES PROPDMG CROPDMG
## 1     97     MOBILE    AL TORNADO          0       15    25.0       0
## 2      3    BALDWIN    AL TORNADO          0        0     2.5       0
## 3     57    FAYETTE    AL TORNADO          0        2    25.0       0
## 4     89    MADISON    AL TORNADO          0        2     2.5       0
## 5     43    CULLMAN    AL TORNADO          0        2     2.5       0
## 6     77 LAUDERDALE    AL TORNADO          0        6     2.5       0
##   num_conv_prop num_conv_crop Total_Victims Prop_Damage_Value Crop_Damage_Value
## 1          1000             1            15             25000                 0
## 2          1000             1             0              2500                 0
## 3          1000             1             2             25000                 0
## 4          1000             1             2              2500                 0
## 5          1000             1             2              2500                 0
## 6          1000             1             6              2500                 0
##   Total_Damage_Value
## 1              25000
## 2               2500
## 3              25000
## 4               2500
## 5               2500
## 6               2500

3. ANALYTIC RESULTS

This part is to prepare data for analytical purpose to highlight the impact of Storm in terms of Victims and Economical Damage ## 3.1 Total Victims (FATALITIES and INJURIES)

head(stormdataDF|>group_by(EVTYPE)|> summarise(sum(Total_Victims),sum(FATALITIES),sum(INJURIES))|> arrange(desc(`sum(Total_Victims)`)) |> rename("Storm Type" = EVTYPE,"Total Victims" = `sum(Total_Victims)`,"Fatalities" = `sum(FATALITIES)`, "Injuries" = `sum(INJURIES)`),10)
## # A tibble: 10 × 4
##    `Storm Type`      `Total Victims` Fatalities Injuries
##    <chr>                       <dbl>      <dbl>    <dbl>
##  1 TORNADO                     96979       5633    91346
##  2 EXCESSIVE HEAT               8428       1903     6525
##  3 TSTM WIND                    7461        504     6957
##  4 FLOOD                        7259        470     6789
##  5 LIGHTNING                    6046        816     5230
##  6 HEAT                         3037        937     2100
##  7 FLASH FLOOD                  2755        978     1777
##  8 ICE STORM                    2064         89     1975
##  9 THUNDERSTORM WIND            1621        133     1488
## 10 WINTER STORM                 1527        206     1321
## Charting
## prepare data for chart 
stormdataDF_victims <- stormdataDF[,c(1:6,11)]
stormdataDF_victims_chart <- stormdataDF_victims |> pivot_longer(cols = c(5:7),names_to = "Type",values_to = "Total")
chart_victim <- stormdataDF_victims_chart |> group_by(EVTYPE,Type) |> summarise(sum(Total)) |> filter(`sum(Total)`> 500)
## `summarise()` has grouped output by 'EVTYPE'. You can override using the
## `.groups` argument.
ggplot(chart_victim,aes(x= fct_reorder(EVTYPE,`sum(Total)`),y = `sum(Total)`))+geom_bar(stat = "identity",aes(fill = Type),position = "dodge")+coord_flip()+labs(x= "Storm Type", y= "Total Victims",title = "Total Victims per Storm Type")+theme_bw()

We can see from the table and chart above that TORNADO is the highest impact storm in terms of Total Victims followed by Excessive Heat, TSTM Wind, Flood and Lightning (see the detail on the table above and chart)

3.2 Total Damage (Property and Crops) in Billion USD

head(stormdataDF|>group_by(EVTYPE)|> summarise(sum(Total_Damage_Value)/10^9,sum(Crop_Damage_Value)/10^9,sum(Prop_Damage_Value)/10^9)|> arrange(desc(`sum(Total_Damage_Value)/10^9`)) |> rename(Storm_Type = EVTYPE, "Total Damage (Billion USD)" = `sum(Total_Damage_Value)/10^9`, "Crop Damage Value (Billion USD)" = `sum(Crop_Damage_Value)/10^9`,  "Prop Damage Value (Billion USD)" = `sum(Prop_Damage_Value)/10^9` ),10)
## # A tibble: 10 × 4
##    Storm_Type        `Total Damage (Billion USD)` Crop Damage Value (Billion U…¹
##    <chr>                                    <dbl>                          <dbl>
##  1 FLOOD                                   150.                         5.66    
##  2 HURRICANE/TYPHOON                        71.9                        2.61    
##  3 TORNADO                                  57.4                        0.415   
##  4 STORM SURGE                              43.3                        0.000005
##  5 HAIL                                     18.8                        3.03    
##  6 FLASH FLOOD                              18.2                        1.42    
##  7 DROUGHT                                  15.0                       14.0     
##  8 HURRICANE                                14.6                        2.74    
##  9 RIVER FLOOD                              10.1                        5.03    
## 10 ICE STORM                                 8.97                       5.02    
## # ℹ abbreviated name: ¹​`Crop Damage Value (Billion USD)`
## # ℹ 1 more variable: `Prop Damage Value (Billion USD)` <dbl>
stormdataDF_dmg <- stormdataDF[,c(1:4,12:14)]
stormdataDF_dmg_chart <- stormdataDF_dmg |> pivot_longer(cols = c(5:7),names_to = "Type",values_to = "Total")
chart_dmg <- stormdataDF_dmg_chart|>group_by(EVTYPE,Type) |> summarise(sum(Total)/10^9) |> filter(`sum(Total)/10^9`> 1) 
## `summarise()` has grouped output by 'EVTYPE'. You can override using the
## `.groups` argument.
ggplot(chart_dmg,aes(x=fct_reorder(EVTYPE,`sum(Total)/10^9`),y=`sum(Total)/10^9`))+geom_bar(stat = "identity",aes(fill = Type),position = "dodge")+coord_flip()+labs(x= "Storm Type",y = "Total Damage in Billion USD",title = "Total Damage per Storm Type in Billion USD") + theme_bw()

# Top states with damage
stormdataDF_dmg_county <- stormdataDF[,c(1:4,14)]
stormdataDF_dmg_county |> group_by(STATE) |> summarise(sum(Total_Damage_Value)/10^9)|>filter(`sum(Total_Damage_Value)/10^9` > 1) |> ggplot(aes(x=fct_reorder(STATE,`sum(Total_Damage_Value)/10^9`),y=`sum(Total_Damage_Value)/10^9`))+geom_bar(fill = "#90d5ff",stat = "identity") + coord_flip()+theme_bw()+labs(x="STATE",y= "Total Damage (Billion USD)",title = "Total Damage by State in Billion USD")

We can see from the table and chart above that Flood is the highest impact storm in terms of Economical Damage followed by Hurricane/Typhoon,Tornado, Storm Surge and Hail (see the detail on the table above and chart)

Top 5 Impacted Storm in terms of Economical Damage, CA, LA, FL, MS, TX (see above chart)

4. SUMMARY

From above analytical result from NOAA Storm database, Storm are impacting US National Economy and People as victims where it need to be addressed to relevant department in order to mitigate the impact in order to minimize the impact. Below are the summary :

1. Top 5 Victims (Fatalities and Injuries)

1.1 Tornado, ~96k victims
1.2 Excessive Heat, ~8k victims
1.3 TSTM Wind, 7.4k victims
1.4 Flood, 7.2k victims
1.5 Lightning, 6k victims

2. Top 5 Economical Damage in Billion USD (Property and Crops/Agricultural)

2.1 Flood, ~150
2.2 Hurricane/Typhoon, ~71
2.3 Tornado, ~57
2.4 Storm Surge, ~43
2.5 Hail, ~18

3. Top Impacted State in terms of Economical Damage,CA, LA, FL, MS, TX