The goal of this study is to analyze the main impacts of weather events like storms on the United States in terms of economic and population health, by answering the following questions:
The first section of the notebook will cover the steps for extracting and transforming the data to a format suitable for answering the question provided, by converting the amount (economic) columns to their real values in US Dollars. There is also a part where the total was obtained from Fatalities and Injuries counts caused for a specific event.
In the second part, the questions are answered by plotting two barplots to compare each of the events and observe which of them is the most relevant in terms of economic and polulation health loss.
This section will cover the steps for getting the data to a form that allows to ansewrs the questions.
First the data file is downloaded from the course web site using the download.file() function.
## downloading file
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile="Storm_Data.zip")
This file is in a “csv.bz2” format. So it can be read using the read.csv() function. After reading the table, the function str() will allow to investigate the dimension of the data, the class of the data contain in each variable and their first elements.
## unziping file
stormdata <- read.csv("Storm_Data.zip")
## showing it characteristics
str(stormdata)
## 'data.frame': 902297 obs. of 37 variables:
## $ STATE__ : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr "CST" "CST" "CST" "CST" ...
## $ COUNTY : num 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr "" "" "" "" ...
## $ BGN_LOCATI: chr "" "" "" "" ...
## $ END_DATE : chr "" "" "" "" ...
## $ END_TIME : chr "" "" "" "" ...
## $ COUNTY_END: num 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi NA NA NA NA NA NA ...
## $ END_RANGE : num 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr "" "" "" "" ...
## $ END_LOCATI: chr "" "" "" "" ...
## $ LENGTH : num 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
## $ WFO : chr "" "" "" "" ...
## $ STATEOFFIC: chr "" "" "" "" ...
## $ ZONENAMES : chr "" "" "" "" ...
## $ LATITUDE : num 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num 3051 0 0 0 0 ...
## $ LONGITUDE_: num 8806 0 0 0 0 ...
## $ REMARKS : chr "" "" "" "" ...
## $ REFNUM : num 1 2 3 4 5 6 7 8 9 10 ...
All of these variables are not necessary for the analysis, since the study is just focusing on health and economic effects of this severe events on the hole country. So, from the stormdata dataframe a new one will be generated by taking the variables that describe the impact of the events on population health (FATALITIES and INJURIES), and the ones that represent the impact on the economic (PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP) which refer to property and crop damages.
## new data frame
library(dplyr)
studyRawData <- select(stormdata, EVTYPE, FATALITIES,
INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
head(studyRawData)
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 0 15 25.0 K 0
## 2 TORNADO 0 0 2.5 K 0
## 3 TORNADO 0 2 25.0 K 0
## 4 TORNADO 0 2 2.5 K 0
## 5 TORNADO 0 2 2.5 K 0
## 6 TORNADO 0 6 2.5 K 0
Now, it is necessary to check if the numeric columns have missing values to figure out how to deal with them.
## checking missing values
na_sum <- function(X) {
### function that receives a varieble and return the number of NA's values within it.
sum(is.na(X))
}
## appliyng the function to the numeric columns
sapply(studyRawData[, c("FATALITIES", "INJURIES", "PROPDMG", "CROPDMG")], na_sum)
## FATALITIES INJURIES PROPDMG CROPDMG
## 0 0 0 0
There are no missing values within the numeric columns. The variables PROPDMGEXP and CROPDMGEXP should represent the magnitude in dollars for the values in the variables PROPDMG and CROPDMG, having: “K” for thousands, “M” for millions and “B” for billions. However these two variables have more characters than the should.
## cheking characters in variables PROPDMGEXP and CROPDMGEXP
unique(studyRawData$PROPDMGEXP); unique(studyRawData$CROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
In this stage, it was assumed that the rows with letters different than K, M and B must be taken as the are at the time of converting the PROPDMG and CROPDMG into their real values. Three new variables called propertydmg_value, cropdmg_value and total_exp were added as follow:
library(plyr)
## givin a number to each character
prop_val <- mapvalues(studyRawData$PROPDMGEXP,
c("K","M","","B","m","+","0","5","6","?","4","2","3","h","7","H","-","1","8"),
c(1e3, 1e6,1,1e9,1e6,1,1,1,1,1,1,1,1,1,1,1,1,1,1))
crop_val <- mapvalues(studyRawData$CROPDMGEXP,
c("","M","K","m","B","?","0","k","2"),
c(1,1e6,1e3,1e6,1e9,1,1,1e3,1))
## Now the real values are taken
studyRawData$propertydmg_value <- as.numeric(prop_val)*studyRawData$PROPDMG
studyRawData$cropdmg_value <- as.numeric(crop_val)*studyRawData$CROPDMG
## adding the total column
studyRawData$total_exp <- studyRawData$propertydmg_value+studyRawData$cropdmg_value
Now, the columns PROPDMGEXP, CROPDMGEXP,PROPDMG and PROPDMG can be deleted to leave just the variables that till be used for the analysis.
Here a new variable called pop_health_loss was added from the sum of FATALITIES AND INJURIES columns.
## population heatl losss
studyRawData$pop_health_loss <- studyRawData$FATALITIES+studyRawData$INJURIES
## selecting variables
studyData <- studyRawData[,c("EVTYPE", "FATALITIES",
"INJURIES","pop_health_loss",
"propertydmg_value","cropdmg_value","total_exp")]
## converting to factor
studyData$EVTYPE <- as.factor(studyData$EVTYPE)
## first six rows
head(studyData)
## EVTYPE FATALITIES INJURIES pop_health_loss propertydmg_value cropdmg_value
## 1 TORNADO 0 15 15 25000 0
## 2 TORNADO 0 0 0 2500 0
## 3 TORNADO 0 2 2 25000 0
## 4 TORNADO 0 2 2 2500 0
## 5 TORNADO 0 2 2 2500 0
## 6 TORNADO 0 6 6 2500 0
## total_exp
## 1 25000
## 2 2500
## 3 25000
## 4 2500
## 5 2500
## 6 2500
Finally, the total expenditures and population health loss where summarized for each event type using the summarise() function in two different tables.
detach(package:plyr)
library(dplyr)
## population health table
HealthTable <- studyData %>%
group_by(EVTYPE) %>%
summarize(Total_health_loss = sum(pop_health_loss, na.rm = TRUE)) %>%
arrange(desc(Total_health_loss))
## economic table
EconomicTable <- studyData %>%
group_by(EVTYPE) %>%
summarize(total_expenditures = sum(total_exp, na.rm = TRUE))%>%
arrange(desc(total_expenditures))
Theggplot2 library was used for ploting two barplots in order to compare the events.
For the first question: Across the United States, which types of events are most harmful with respect to population health?
library(ggplot2)
##barplot
ggplot(data=head(HealthTable,5), aes(x=EVTYPE, y=Total_health_loss)) +
xlab("Event Type") +
ylab("Total health loss") +
ggtitle("Most harmful events across the United States") +
geom_bar(stat="identity")
For the second question: Across the United States, which types of events are most harmful with respect to population health?
library(ggplot2)
##barplot
ggplot(data=head(EconomicTable,5), aes(x=EVTYPE, y=total_expenditures)) +
xlab("Event Type") +
ylab("Economic consequences") +
ggtitle("Greatest economic consequences by events") +
geom_bar(stat="identity")