From the collected data we are able to conclude that Heat Wave’s act as the leading cause of injuries and fatalities to human life while Tornadoes cause the largest amount of economic damage.
This is a data analysis on the effect of various types of natural phenomenon on the human life such as injuries and fatalities caused by them, as well as their effect on the economics of the country such as crop damages and property damages. This data analysis has two main questions which it aims to identify.
To perform this analysis we utilize the NOAA Storm Database for the data, and will be performing a preliminary evaluation. The significant events from this report are recommended to be further analysed with more comprehensive data.
Various Libraries are required to perform this analysis, dplyr is used for table manipulation while tidyr is used for column expansion for plotting. GGplot2 is used for making the plots.
library(dplyr)
library(tidyr)
library(ggplot2)
To start the analysis we first download the storm data set and save it as a file.
DataLink <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destFile <- "./StormData.csv.bz2"
download.file(DataLink,destFile)
Using the read csv method we then read the data in the storm data set into the variable data and we find the structure of it.
Data <- read.csv(destFile)
str(Data)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
Now we arrive at the Preprocessing of the data. From the Data book from the NOAA website we are able to find that the values of Crop Damage and the Property Damage are each split into two separate fields. Each containing a value and a corresponding Exponent (i.e K,M,B). We Need to combine these two values into a single comprehensive value in order to compare the metrics. To perform this we multiply the Damage value with their corresponding exponent and create a new column.
Data <- Data %>% mutate(ConvertedCropExpValue =
case_when(
CROPDMGEXP == "" ~ 0,
CROPDMGEXP == "K" ~ 1e3,
CROPDMGEXP == "M" ~ 1e6,
CROPDMGEXP == "B" ~ 1e9),
ConvertedPropExpValue =
case_when(
PROPDMGEXP == "" ~ 0,
PROPDMGEXP == "K" ~ 1e3,
PROPDMGEXP == "M" ~ 1e6,
PROPDMGEXP == "B" ~ 1e9)
)
Data <- Data %>% mutate (CalculatedCropDmg = ConvertedCropExpValue * CROPDMG,
CalculatedPropDmg = ConvertedPropExpValue * PROPDMG)
head(Data)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM ConvertedCropExpValue
## 1 3051 8806 1 0
## 2 0 0 2 0
## 3 0 0 3 0
## 4 0 0 4 0
## 5 0 0 5 0
## 6 0 0 6 0
## ConvertedPropExpValue CalculatedCropDmg CalculatedPropDmg
## 1 1000 0 25000
## 2 1000 0 2500
## 3 1000 0 25000
## 4 1000 0 2500
## 5 1000 0 2500
## 6 1000 0 2500
Now that we have pre processed the data, we can come to the questions proposed at the start of the analysis.
To compare the disaster with health of the population we first select the appropriate columns and group the data according to the disaster column using the group by function. We then summarize the data by applying the mean function for the Injuries, Fatalities column and also add a new column “total damage” which is the sum of both the means. We take the mean of each disaster since the motivation for this analysis is to find the damage caused by each phenomenon on average. We then arrange them in a descending order of total damage.
HealthIssues <- Data %>% select (EVTYPE,INJURIES,FATALITIES) %>%
group_by(EVTYPE) %>%
summarize(.,Injuries = mean(INJURIES),Fatalities= mean(FATALITIES),
TotalHealthDamage = mean(INJURIES) + mean(FATALITIES),.group="keep") %>%
arrange(desc(TotalHealthDamage),by_group=TRUE)
head(HealthIssues)
## # A tibble: 6 × 5
## EVTYPE Injuries Fatalities TotalHealthDamage .group
## <chr> <dbl> <dbl> <dbl> <chr>
## 1 Heat Wave 70 0 70 keep
## 2 TROPICAL STORM GORDON 43 8 51 keep
## 3 WILD FIRES 37.5 0.75 38.2 keep
## 4 THUNDERSTORMW 27 0 27 keep
## 5 TORNADOES, TSTM WIND, HAIL 0 25 25 keep
## 6 HIGH WIND AND SEAS 20 3 23 keep
We then plot the data from above using the ggplots bar plot. To perform this we first convert the multiple types of Health issues into a single column using the pivot longer function from tidyr. we then plot the graphs with the type in x axis and the count in y axis.We have also chosen an arbitary value of 5 as the number of disaster to plot. This will plot the top 5 disasters as a bar plot.
n_count <- 5
to_plot_health <- pivot_longer(HealthIssues[1:n_count,],
cols = c(TotalHealthDamage,Fatalities,Injuries),
names_to = "Type", values_to = "Count")
to_plot_health$EVTYPE <- factor(to_plot_health$EVTYPE,levels = unique(to_plot_health$EVTYPE))
ggplot(to_plot_health, aes(x = EVTYPE, y = Count, fill = Type)) +
geom_bar(position="dodge", stat="identity") +
theme_grey(base_size = 22) +
labs(title="Top 5 Events that are Harmful to Population Health")
The Above Graph depicts the top 5 events that have caused most damage to
human life. The data for each event are also further split into injuries
and fatalities. The green bar depicts the injuries while the pink bar
depicts the Fatalities. The blue bar is the total damage which is the
combination of both damages.From the graph we are able to find that Heat
Waves acts as the leading cause of injuries reaching over 60 injuries on
average. This is followed by Storms and Wild Fires.
To compare the disaster with economic damage we first select the appropriate columns and group the data according to the disaster column using the group by function. We then summarize the data by applying the mean function for the calculated crop damage, calculated Property damage columns and also add a new column “total damage” which is the sum of both the means. We take the mean of each disaster since the motivation for this analysis is to find the damage caused by each phenomenon on average. We then arrange them in a descending order of total damage.
EconomicIssues <- Data %>% select (EVTYPE,CalculatedCropDmg ,CalculatedPropDmg) %>%
group_by(EVTYPE) %>%
summarize(.,Crop_Damage = mean(CalculatedCropDmg),
Property_Damage= mean(CalculatedPropDmg),
Total_Economic_Damage = mean(CalculatedCropDmg) + mean(CalculatedPropDmg),.group="keep") %>%
arrange(desc(Total_Economic_Damage),,by_group=TRUE)
We then plot the data from above using the ggplots bar plot. To perform this we first convert the multiple types of damages into a single column using the pivot longer function from tidyr. we then plot the graphs with the type in x axis and the count in y axis. We have also chosen an arbitary value of 5 as the number of disaster to plot. This will plot the top 5 disasters as a bar plot.
n_count <- 5
to_plot_economics <- pivot_longer(EconomicIssues[1:n_count,],
cols = c(Total_Economic_Damage,Crop_Damage,Property_Damage),
names_to = "Type", values_to = "Count")
to_plot_economics$EVTYPE <- factor(to_plot_economics$EVTYPE,levels = unique(to_plot_economics$EVTYPE))
ggplot(to_plot_economics, aes(x = EVTYPE, y = Count, fill = Type)) +
geom_bar(position="dodge", stat="identity") +
theme_grey(base_size = 22) +
labs(title="Top 5 Events that are Damage Economics")
The Above Graph depicts the top 5 events that have caused most economic
damage. The data for each event are also further split into crop damage
and property damage. The green bar depicts the property damage while the
pink bar depicts the crop damage. The blue bar is the total damage which
is the combination of both damages.From the graph we are able to find
that Tornadoes acts as the leading cause of damage reaching over 1.5
billion usd average damage. This is followed by heavy rains and
hurricanes.
From the analysis done we can conclude the Heat Waves act as a leading cause for damage for human life followed by storms, wildfires, thunderstorms, and tornadoes. We can also identify that the leading cause for damage is tornadoes followed by Heavy rains, hurricane, storms surges and wild fires. Further analysis into these event may be required to create a concrete conclusion.