In this report we aim to answer two research questions:
1- which types of events are most harmful with respect to population health across the U.S?
2- which types of events have the greatest economic consequences across the U.S.?
To answer these two questions we performed simple analysis on the data obtained from Coursera website. From these data we found that the most harmful type of event with respect to population health across the U.s. is the TORNADO event type. We also found that the event that has the most greates economic consequence is the TROPICAL STORM GORDON.
I downloaded the data from Reproducible Research Course Website andd read it into storm_df data frame. I made sure to make the process of downloading and reading the data reproducible, by the code below.
# Loading some packages for data manipulation and visualization
library(ggplot2); library(dplyr); library(tidyr); library(readr)
data_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
## Download the file containing the data and extract it in the same
## working directory, if the .csv.bz2 file already exists then skip downloading
#this checks if this R script was run before
#'storm_df' is the data frame where I stored the "StormData.csv" file
if(!exists("storm_df")) {
if(!file.exists("repdata-data-StormData.csv.bz2")){
#create a temp file
temp <- tempfile()
#Download the file containing the data and store it in temp
download.file(data_url, temp)
# Read it into R environment
storm_df <- tbl_df(read_csv(temp))
# get rid of temp
unlink(temp)
}
#if "repdata-data-StormData.csv.bz2" already exists in my working directory
# then just read it in strom_df
storm_df <- tbl_df(read_csv("repdata-data-StormData.csv.bz2"))
}
##
|================================================================================| 100% 535 MB
First, let’s have a look at storm_df to get a sense of the data
storm_df
## Source: local data frame [902,297 x 37]
##
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## (dbl) (chr) (int) (chr) (dbl) (chr) (chr)
## 1 1 4/18/1950 0:00:00 130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## 7 1 11/16/1951 0:00:00 100 CST 9 BLOUNT AL
## 8 1 1/22/1952 0:00:00 900 CST 123 TALLAPOOSA AL
## 9 1 2/13/1952 0:00:00 2000 CST 125 TUSCALOOSA AL
## 10 1 2/13/1952 0:00:00 2000 CST 57 FAYETTE AL
## .. ... ... ... ... ... ... ...
## Variables not shown: EVTYPE (chr), BGN_RANGE (dbl), BGN_AZI (lgl),
## BGN_LOCATI (lgl), END_DATE (lgl), END_TIME (lgl), COUNTY_END (dbl),
## COUNTYENDN (lgl), END_RANGE (dbl), END_AZI (lgl), END_LOCATI (lgl),
## LENGTH (dbl), WIDTH (dbl), F (int), MAG (dbl), FATALITIES (dbl),
## INJURIES (dbl), PROPDMG (dbl), PROPDMGEXP (chr), CROPDMG (dbl),
## CROPDMGEXP (lgl), WFO (lgl), STATEOFFIC (lgl), ZONENAMES (lgl), LATITUDE
## (dbl), LONGITUDE (dbl), LATITUDE_E (dbl), LONGITUDE_ (dbl), REMARKS
## (lgl), REFNUM (dbl)
Now, we need to anlayze the data to answer the following questions:
1- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
2- Across the United States, which types of events have the greatest economic consequences?
To achieve that, first we need to prepare the data in a way that makes answering those questions as easy as possible.
In the next code chunk, I’ll use dplyr package to preprocess the data.
# Processing PROPDMGEXP by mapping each character with its corresponing numeric value
# This will take 'H', 'k', 'M' ... etc from PROPDMGEXP variable and map them to
# 100, 1000, 1000000, etc. respectively
PROPDMGEX_FILTERED <-
storm_df$PROPDMGEXP[storm_df$PROPDMGEXP %in% c('', 'h', 'H', 'K', 'm', 'M', 'B')]
PROPDMGEXP_VALUES <-
as.numeric(plyr::mapvalues(PROPDMGEX_FILTERED,
from=c('', 'h', 'H', 'K', 'm', 'M', 'B'),
to=c(1, 10^2, 10^2, 10^3, 10^6, 10^6, 10^9)))
# As for CROPDMG, no need to do the same as the CROPDMGEXP variable is just NAs.
# check using !anyNA(storm_df$CROPDMGEXP)
# Now I will use dplyr to create a new df , storm_ready, to make the analysis easier.
# 1- Select from storm_df the variables of interest:
# EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG
# 2- Exclude rows where PROPDMGEXP contains values other than:
# '', 'h', 'H', 'K', 'm', 'M', 'B'.
# (Values other than these have no specific meaning)
# 3- Group by EVTYPE
# 4- Make 2 summaries:
# TOTDMG: that's the sum of (PROPDMG * PROPDMGEXP_VALUES + CROPDMG)
# HARMED_PEOPLE, that's the sum of (INJURIES + FATALITIES)
# 5- Arrange according to TOTDMG and HARMED_PEOPLE
# 6- Omit rows where there's no health harm nor property damage
# (these are not relevant to our analysis)
# 7- Add a new variable, EVENT_INDEX, that's just recording the row index of the resulting df
storm_ready <- storm_df %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG) %>%
filter(PROPDMGEXP %in% c('', 'h', 'H', 'K', 'm', 'M', 'B')) %>%
group_by(EVTYPE) %>%
summarise(TOTDMG = sum(PROPDMG * PROPDMGEXP_VALUES + CROPDMG),
HARMED_PEOPLE = sum(INJURIES + FATALITIES)) %>%
arrange(desc(HARMED_PEOPLE), desc(TOTDMG))%>%
filter(TOTDMG > 0 & HARMED_PEOPLE > 0) %>%
mutate(EVENT_INDEX = 1:n())
Now, the data storm_ready is clean and ready for analysis
Take a look at storm_ready to get a sense of it
storm_ready
## Source: local data frame [163 x 4]
##
## EVTYPE TOTDMG HARMED_PEOPLE EVENT_INDEX
## (chr) (dbl) (dbl) (int)
## 1 TORNADO 2.056802e+12 96951 1
## 2 EXCESSIVE HEAT 3.423919e+10 8428 2
## 3 TSTM WIND 1.084023e+11 7461 3
## 4 FLOOD 1.123072e+12 7259 4
## 5 LIGHTNING 2.068139e+12 6046 5
## 6 HEAT 4.412652e+09 3037 6
## 7 FLASH FLOOD 5.143075e+11 2755 7
## 8 ICE STORM 1.678672e+12 2064 8
## 9 THUNDERSTORM WIND 7.685333e+11 1621 9
## 10 WINTER STORM 1.260491e+12 1527 10
## .. ... ... ... ...
Addressing The First Question:
To find the most harmful event with respect to human health: We will pick the first element of the variable EVTYPE in storm_ready data frame. This first element contains the maximum number of harmed people (injured + dead), because we already arranged storm_ready in a descending order.
most_harmful <- storm_ready$EVTYPE[1]
So, it turned out that TORNADO is the most harmful event.
We can see how many people were effected by TORNADO:
harmed_by_tornado <- storm_ready$HARMED_PEOPLE[1]
harmed_by_tornado
## [1] 96951
The plot below shows the EVENT_INDEX on the x-axis and the number of people harmed, scaled on \(log_{10}\) basis, on the y-axis.
I chose \(log_{10}\) basis to compensate for the huge dispersion in the data and make the plot more visible.
g <- ggplot(data=storm_ready, aes(EVENT_INDEX, log10(HARMED_PEOPLE)))
g <- g +
geom_bar(stat = "identity", fill = 'salmon', color = 'black') +
labs(list(title = "Harmful Effect of Different Events across The U.S.",
x = "Event Index",
y = "No. of Affected People (scaled to log10)"))
g
Addressing The Second Question:
To find the event with highest economic consequences:
most_harmful_economic <- subset(storm_ready, TOTDMG == max(TOTDMG))$EVTYPE
most_harmful_economic
## [1] "TROPICAL STORM GORDON"
The plot below shows the EVENT_INDEX on the x-axis and the total economic damage, on the y-axis
g <- ggplot(data=storm_ready, aes(EVENT_INDEX, TOTDMG))
g <- g +
geom_bar(stat= "identity", fill = 'salmon') +
labs(list(title = "Total Economic Damage of Different Events across The U.S.",
x = "Event Index",
y = "Economic Damage"))
g