There are several factors that affect the different aspects of the communities (including the public health and economy). Some of these factors are naturally occurring e.g storms and severe weathers. Although it might be almost impossible to prevent these natural factors or disasters, the level of damage (including loss of lives, property damages, injuries and fatalities) it might cause can be mitigated.
The purpose of this project was to process and analyze the type of environmental events (e.g rain, flooding, hurricane, etc.) that caused the most damaging effect to the population health (i.e fatalities and injuries) and the economy (i.e cost of both property and crop damages) between years 1950-2011.
link <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(link, destfile = "StormData.csv.bz2")
stormdata <- read.csv("./StormData.csv.bz2", stringsAsFactors = FALSE)
##To check if data was loaded properly
head(stormdata)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL TORNADO
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL TORNADO
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL TORNADO
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL TORNADO
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL TORNADO
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL TORNADO
## BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END COUNTYENDN
## 1 0 0 NA
## 2 0 0 NA
## 3 0 0 NA
## 4 0 0 NA
## 5 0 0 NA
## 6 0 0 NA
## END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES INJURIES PROPDMG
## 1 0 14.0 100 3 0 0 15 25.0
## 2 0 2.0 150 2 0 0 0 2.5
## 3 0 0.1 123 2 0 0 2 25.0
## 4 0 0.0 100 2 0 0 2 2.5
## 5 0 0.0 150 2 0 0 2 2.5
## 6 0 1.5 177 2 0 0 6 2.5
## PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES LATITUDE LONGITUDE
## 1 K 0 3040 8812
## 2 K 0 3042 8755
## 3 K 0 3340 8742
## 4 K 0 3458 8626
## 5 K 0 3412 8642
## 6 K 0 3450 8748
## LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3051 8806 1
## 2 0 0 2
## 3 0 0 3
## 4 0 0 4
## 5 0 0 5
## 6 0 0 6
Performed a exploratory data analysis to give a brief overview of the whole data set.As not all the columns present in the original data set is needed for this project, a new data set was created with only the columns/variables of interest.
##load needed package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
up_data <- stormdata %>%
select("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
##Check the new data
head(up_data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Checked the new data set to know the number of events recorded during the study time period.
Total_events <- summary(up_data$EVTYPE)
Unique_events <- summary(unique(up_data$EVTYPE))
From the data set, there is a total of 902297, character, character events recorded but only 985, character, character of these events are unique. Some of the events recorded in the data set can be categorized under a single event. For example, the events hurricane and typhoon etc, can be put together under a single event, Hurricane.
up_data$EVTYPE[grepl("tornado", up_data$EVTYPE, ignore.case = TRUE)] <- "Tornado"
up_data$EVTYPE[grepl("FLOOD",up_data$EVTYPE, ignore.case = TRUE)] <- "Flooding"
up_data$EVTYPE[grepl("hurricane|typhoon",up_data$EVTYPE, ignore.case = TRUE)] <- "Hurricane"
up_data$EVTYPE<-factor(up_data$EVTYPE)
##Check data
head(up_data)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 Tornado 0 15 25.0 K 0
## 2 Tornado 0 0 2.5 K 0
## 3 Tornado 0 2 25.0 K 0
## 4 Tornado 0 2 2.5 K 0
## 5 Tornado 0 2 2.5 K 0
## 6 Tornado 0 6 2.5 K 0
##Select the three columns needed to answer the question above
most_harmful <- up_data %>%
group_by(EVTYPE) %>%
summarise(Total_fatalities = sum(FATALITIES), Total_injuries = sum(INJURIES)) %>%
arrange(desc(Total_fatalities + Total_injuries))
##Check the data
head(most_harmful)
## # A tibble: 6 x 3
## EVTYPE Total_fatalities Total_injuries
## <fct> <dbl> <dbl>
## 1 Tornado 5661 91407
## 2 Flooding 1525 8604
## 3 EXCESSIVE HEAT 1903 6525
## 4 TSTM WIND 504 6957
## 5 LIGHTNING 816 5230
## 6 HEAT 937 2100
##create a variable with just the top 5 events with the most harmful effect on the population health
Total_data <- with(most_harmful, aggregate(Total_fatalities + Total_injuries ~ EVTYPE, data = most_harmful, FUN = "sum"))
## Rename the second column of the total data
names(Total_data)[2] <- "Causalties"
## order the total harm column in descending order to get the top events
Total_data <- Total_data[order(-Total_data$Causalties), ]
top5 <- Total_data[1:5, ]
print(top5)
## EVTYPE Causalties
## 728 Tornado 97068
## 137 Flooding 10129
## 113 EXCESSIVE HEAT 8428
## 741 TSTM WIND 7461
## 387 LIGHTNING 6046
Another information collected in the data set is the economic magnitude, recorded for both property and crop damages in the form of PROPDMGEXP and CROPDMGEXP respectively. Although no specific amount was given for most damages recorded, a range identifier was provided. The identifier include:
Converted the identifier to numeric value in order to successfully find the events with the greatest economic consequences.
- Also calculated the total amount of damage caused by each event by multiplying the number of damages with the magnitude of the damage.
library(dplyr)
library(tidyr)
##Replace the total amount identifier of with nearest 10s
up_data$PROPDMGEXP<-dplyr::recode(up_data$PROPDMGEXP,'K'=1000,'M'=1000000,'B'=1000000000,.default=1)
up_data$CROPDMGEXP<-dplyr::recode(up_data$CROPDMGEXP,'K'=1000,'M'=1000000,'B'=1000000000,.default=1)
##calculate the total amount of damage
up_data$PROPVAL <- up_data$PROPDMG * up_data$PROPDMGEXP
up_data$CROPVAL <- up_data$CROPDMG * up_data$CROPDMGEXP
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(magrittr)
cost_data <- up_data %>%
group_by(EVTYPE) %>%
summarise(tot_prop = sum(PROPVAL), tot_crop = sum(CROPVAL)) %>%
arrange(desc(tot_prop + tot_crop))
head(cost_data)
## # A tibble: 6 x 3
## EVTYPE tot_prop tot_crop
## <fct> <dbl> <dbl>
## 1 Flooding 167529740932. 12380099110
## 2 Hurricane 85336410030 5506117810
## 3 Tornado 58581598040. 417461520
## 4 STORM SURGE 43323536000 5000
## 5 HAIL 15727367053. 3025537890
## 6 DROUGHT 1046106000 13972566000
##create a variable with just the top 5 events that were most harmful to the economy
most_cost <- with(cost_data, aggregate(tot_prop + tot_crop ~ EVTYPE, data = cost_data, FUN = "sum"))
## Rename the second column of the total data
names(most_cost)[2] <- "TOTDMGEXP"
## order the total harm column in descending order to get the top events
most_cost <- most_cost[order(-most_cost$TOTDMGEXP), ]
top5a <- most_cost[1:5, ]
print(top5a)
## EVTYPE TOTDMGEXP
## 137 Flooding 179909840042
## 340 Hurricane 90842527840
## 728 Tornado 58999059560
## 574 STORM SURGE 43323541000
## 194 HAIL 18752904943
library(ggplot2)
- Created a visual representation (using ggplot) of the type of events that are most harmful to population health (i.e have the highest number of total recorded fatalities and injuries). Plotted the graphs of the top 5 events to make the figure less cumbersome and more easier to interpret. Couple of steps were take to create a final plot.
par(tcl = 0.5, mgp = c(4, 0, 0), las = 1,
mar = c(6.1, 6.1, 5.1, 2.1),
family = 'serif')
##Plot graph
barplot(top5$Causalties, col = "Coral", xlab = "Type of Event Recorded", ylab = "Total number of Causalties", main = "Top 5 Events with the Most Harmful effects on the Population Health", sub = "(Between 1950-2011)", names.arg = top5$EVTYPE, las = 1)
Figure 1: Plot of the top 5 events with the most harmful effect on the population health (fatalities and injuries), as recorded between the years 1950-2011
From the plot above, tornado had the most harmful effect on the population health i.e causing the most fatalities and injuries
par(tcl = 0.5, mgp = c(4, 0, 0), las = 1,
mar = c(6.1, 6.1, 5.1, 2.1),
family = 'serif')
##Plot graph
barplot(top5a$TOTDMGEXP, col = "Coral", xlab = "Type of Event Recorded", ylab = "Total Amount of Damages", main = "Top 5 Events with the Most Harmful effects on the Economy", sub = "(Between 1950-2011)", names.arg = top5a$EVTYPE, las = 1)
Figure 2: Plot of the top 5 events with the most negative effects on the economy (with respect to the properties and crops damaged), as recorded between the years 1950-2011
From the plot above, flooding had the most negative effect on the economy. The most properties and crops were damaged when flooding occurred.