Synopsis

From the collected data we are able to conclude that Heat Wave’s act as the leading cause of injuries and fatalities to human life while Tornadoes cause the largest amount of economic damage.

Introduction

This is a data analysis on the effect of various types of natural phenomenon on the human life such as injuries and fatalities caused by them, as well as their effect on the economics of the country such as crop damages and property damages. This data analysis has two main questions which it aims to identify.

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

To perform this analysis we utilize the NOAA Storm Database for the data, and will be performing a preliminary evaluation. The significant events from this report are recommended to be further analysed with more comprehensive data.

Library Inclusions

Various Libraries are required to perform this analysis, dplyr is used for table manipulation while tidyr is used for column expansion for plotting. GGplot2 is used for making the plots.

library(dplyr)
library(tidyr)
library(ggplot2)

Data Processing

To start the analysis we first download the storm data set and save it as a file.

DataLink <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destFile <- "./StormData.csv.bz2"
download.file(DataLink,destFile)

Using the read csv method we then read the data in the storm data set into the variable data and we find the structure of it.

Data <- read.csv(destFile)
str(Data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...


Now we arrive at the Preprocessing of the data. From the Data book from the NOAA website we are able to find that the values of Crop Damage and the Property Damage are each split into two separate fields. Each containing a value and a corresponding Exponent (i.e K,M,B). We Need to combine these two values into a single comprehensive value in order to compare the metrics. To perform this we multiply the Damage value with their corresponding exponent and create a new column.

Data <- Data %>% mutate(ConvertedCropExpValue = 
                          case_when(
                            CROPDMGEXP == "" ~ 0,
                            CROPDMGEXP == "K" ~ 1e3,
                            CROPDMGEXP == "M" ~ 1e6,
                            CROPDMGEXP == "B" ~ 1e9),
                        ConvertedPropExpValue =                           
                          case_when(
                            PROPDMGEXP == "" ~ 0,
                            PROPDMGEXP == "K" ~ 1e3,
                            PROPDMGEXP == "M" ~ 1e6,
                            PROPDMGEXP == "B" ~ 1e9)
                        )

Data <- Data %>% mutate (CalculatedCropDmg = ConvertedCropExpValue * CROPDMG,
                         CalculatedPropDmg = ConvertedPropExpValue * PROPDMG)
              
head(Data)
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO
##   BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1         0                                               0         NA
## 2         0                                               0         NA
## 3         0                                               0         NA
## 4         0                                               0         NA
## 5         0                                               0         NA
## 6         0                                               0         NA
##   END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1         0                      14.0   100 3   0          0       15    25.0
## 2         0                       2.0   150 2   0          0        0     2.5
## 3         0                       0.1   123 2   0          0        2    25.0
## 4         0                       0.0   100 2   0          0        2     2.5
## 5         0                       0.0   150 2   0          0        2     2.5
## 6         0                       1.5   177 2   0          0        6     2.5
##   PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1          K       0                                         3040      8812
## 2          K       0                                         3042      8755
## 3          K       0                                         3340      8742
## 4          K       0                                         3458      8626
## 5          K       0                                         3412      8642
## 6          K       0                                         3450      8748
##   LATITUDE_E LONGITUDE_ REMARKS REFNUM ConvertedCropExpValue
## 1       3051       8806              1                     0
## 2          0          0              2                     0
## 3          0          0              3                     0
## 4          0          0              4                     0
## 5          0          0              5                     0
## 6          0          0              6                     0
##   ConvertedPropExpValue CalculatedCropDmg CalculatedPropDmg
## 1                  1000                 0             25000
## 2                  1000                 0              2500
## 3                  1000                 0             25000
## 4                  1000                 0              2500
## 5                  1000                 0              2500
## 6                  1000                 0              2500

Dangerous Events

Now that we have pre processed the data, we can come to the questions proposed at the start of the analysis.

Comparision of Disaster with respect to Health of the Population

To compare the disaster with health of the population we first select the appropriate columns and group the data according to the disaster column using the group by function. We then summarize the data by applying the mean function for the Injuries, Fatalities column and also add a new column “total damage” which is the sum of both the means. We take the mean of each disaster since the motivation for this analysis is to find the damage caused by each phenomenon on average. We then arrange them in a descending order of total damage.

HealthIssues <- Data %>% select (EVTYPE,INJURIES,FATALITIES) %>% 
    group_by(EVTYPE) %>%
  summarize(.,Injuries = mean(INJURIES),Fatalities= mean(FATALITIES), 
            TotalHealthDamage = mean(INJURIES) + mean(FATALITIES),.group="keep") %>% 
  arrange(desc(TotalHealthDamage),by_group=TRUE)
head(HealthIssues)
## # A tibble: 6 × 5
##   EVTYPE                     Injuries Fatalities TotalHealthDamage .group
##   <chr>                         <dbl>      <dbl>             <dbl> <chr> 
## 1 Heat Wave                      70         0                 70   keep  
## 2 TROPICAL STORM GORDON          43         8                 51   keep  
## 3 WILD FIRES                     37.5       0.75              38.2 keep  
## 4 THUNDERSTORMW                  27         0                 27   keep  
## 5 TORNADOES, TSTM WIND, HAIL      0        25                 25   keep  
## 6 HIGH WIND AND SEAS             20         3                 23   keep

We then plot the data from above using the ggplots bar plot. To perform this we first convert the multiple types of Health issues into a single column using the pivot longer function from tidyr. we then plot the graphs with the type in x axis and the count in y axis.We have also chosen an arbitary value of 5 as the number of disaster to plot. This will plot the top 5 disasters as a bar plot.

n_count <- 5

to_plot_health <- pivot_longer(HealthIssues[1:n_count,],
                        cols = c(TotalHealthDamage,Fatalities,Injuries), 
                        names_to = "Type", values_to = "Count")

to_plot_health$EVTYPE <- factor(to_plot_health$EVTYPE,levels = unique(to_plot_health$EVTYPE))

ggplot(to_plot_health, aes(x = EVTYPE, y = Count, fill = Type)) +
  geom_bar(position="dodge", stat="identity") +
  theme_grey(base_size = 22) +
   labs(title="Top 5 Events that are Harmful to Population Health")

The Above Graph depicts the top 5 events that have caused most damage to human life. The data for each event are also further split into injuries and fatalities. The green bar depicts the injuries while the pink bar depicts the Fatalities. The blue bar is the total damage which is the combination of both damages.From the graph we are able to find that Heat Waves acts as the leading cause of injuries reaching over 60 injuries on average. This is followed by Storms and Wild Fires.

Comparision of Disaster with respect to Economic Damage

To compare the disaster with economic damage we first select the appropriate columns and group the data according to the disaster column using the group by function. We then summarize the data by applying the mean function for the calculated crop damage, calculated Property damage columns and also add a new column “total damage” which is the sum of both the means. We take the mean of each disaster since the motivation for this analysis is to find the damage caused by each phenomenon on average. We then arrange them in a descending order of total damage.

EconomicIssues <- Data %>% select (EVTYPE,CalculatedCropDmg ,CalculatedPropDmg) %>%
      group_by(EVTYPE) %>%
  summarize(.,Crop_Damage = mean(CalculatedCropDmg),
            Property_Damage= mean(CalculatedPropDmg),
            Total_Economic_Damage = mean(CalculatedCropDmg) + mean(CalculatedPropDmg),.group="keep") %>%
  arrange(desc(Total_Economic_Damage),,by_group=TRUE)

We then plot the data from above using the ggplots bar plot. To perform this we first convert the multiple types of damages into a single column using the pivot longer function from tidyr. we then plot the graphs with the type in x axis and the count in y axis. We have also chosen an arbitary value of 5 as the number of disaster to plot. This will plot the top 5 disasters as a bar plot.

n_count <- 5

to_plot_economics <- pivot_longer(EconomicIssues[1:n_count,],
                        cols = c(Total_Economic_Damage,Crop_Damage,Property_Damage), 
                        names_to = "Type", values_to = "Count")

to_plot_economics$EVTYPE <- factor(to_plot_economics$EVTYPE,levels = unique(to_plot_economics$EVTYPE))

ggplot(to_plot_economics, aes(x = EVTYPE, y = Count, fill = Type)) +
  geom_bar(position="dodge", stat="identity") +
  theme_grey(base_size = 22) +
   labs(title="Top 5 Events that are Damage Economics")

The Above Graph depicts the top 5 events that have caused most economic damage. The data for each event are also further split into crop damage and property damage. The green bar depicts the property damage while the pink bar depicts the crop damage. The blue bar is the total damage which is the combination of both damages.From the graph we are able to find that Tornadoes acts as the leading cause of damage reaching over 1.5 billion usd average damage. This is followed by heavy rains and hurricanes.

Results

From the analysis done we can conclude the Heat Waves act as a leading cause for damage for human life followed by storms, wildfires, thunderstorms, and tornadoes. We can also identify that the leading cause for damage is tornadoes followed by Heavy rains, hurricane, storms surges and wild fires. Further analysis into these event may be required to create a concrete conclusion.