This set of exercises contains 1 question.
Each part of the exercise contributes equally for a total of 100 possible points, again based on your best efforts.
You may only use functions from Base R to complete the exercises.
Email Dr. Sanchez Gomez (josesa@ucr.edu) or Jericho Lawson
(jlaws011@ucr.edu) if you have any questions.
To complete this assignment, create a folder named “DA6”, and save in it this R-Markdown file, and any data file provided.
One of the fundamental methods for data visualization is the use of scatterplots. When we are interested in visualizing only 2 variables, a single scatterplot is often a good approach. Despite this, when the number of numerical variables to visualize increases, it can become challenging to choose the right tool.
A common approach for visualizing data is to use a pair plot. This plot consists of displaying a scatterplot for any two pairs of variables in a matrix of plots. In this assignment, we create code that allows us to visualize multivariate numerical data through pair plots.
In the following chunk, paste the functions we have created for cleaning data-frames and factorizing character variables in data-frames.
#######################################################
#######################################################
remove_na <- function(df) { ## Automatize the removal of missing observations.
n_obs <- nrow(df)
missing <- rep(FALSE, n_obs)
for (obs_ind in 1:n_obs) {
obs <- df[obs_ind, ]
n_missing <- sum(is.na(obs))
if(n_missing > 0) {
missing[obs_ind] <- TRUE
}
}
df_red <- df[!missing, ]
return(df_red)
}
#######################################################
#######################################################
to_factors <- function(df) { ## Turn all character variables to factors.
n_vars <- ncol(df)
for (var_ind in 1:n_vars) {
var <- df[, var_ind]
if (class(var) == "character") {
df[, var_ind] <- factor(var)
}
}
return(df)
}
cdc. Once it is
loaded, remove all the observations with missing entries. Verify if the
dataset has any variables of class character, and turn them into factors
if needed. Save a clean version of the dataset, including only
the first 1000 clean observations in a data-frame named
cdc_clean.source("http://www.openintro.org/stat/data/cdc.R")
cdc_clean = remove_na(cdc)
color_by_gender, which assigns
the color "forestgreen" to observations from
cdc_clean with gender = "f", and
"darkred" to the observations corresponding to
gender = "m". Also create a vector
pointstyle_by_gender which assigns the value 17 to all
observations in cdc_clean with gender == "f"
and 8 to all observations with gender == "m"color_by_gender = ifelse(cdc_clean$gender == 'm', "darkred", "forestgreen")
pointstyle_by_gender = ifelse(cdc_clean$gender == 'm', 8, 17)
cdc so
the names are properly capitalized, and include the measurement units.
For example for the variable height, change the name to
"Height (in.)". Change the names so they are appropriate as
the axis of a plot. Do this for the 4 numerical variables in the data.
Also, change the name of the variable "gender" to
"Gender".colnames(cdc_clean)= c("genhlth", "exerany", "hlthplan", "smoke100", "Height (in.)", "Weight(lbs)", "Desired Weight(lbs)", "Age (Yrs)", "Gender")
Height (in) from the
data-frame cdc_clean, dividing the observations by the
variable Gender. The title and the y-label of the plot must
be the name of the variable. Use colnames(cdc_clean)[5] to
access the name without writing the characters directly. Furthermore,
the x-label must be "Gender".boxplot(formula= cdc_clean$`Height (in.)`~ Gender,data= cdc_clean, main = "Boxplot of height and gender",xlab= "Gender", ylab= colnames(cdc_clean)[5])
numeric_vars which contains the column
indexes of the 4 numeric variables in the dataset. Then, using a
for() iterating over the vector numeric_vars,
generate a figure containing 4 different plots aligned on a single
row.numeric_vars = c(5,6,7,8)
par(mfrow= c(1,4))
for(i in numeric_vars){
boxplot(formula= cdc_clean[,i] ~ Gender, data= cdc_clean, xlab= 'Gender', ylab= colnames(cdc_clean)[i], main= paste("Boxplot of Gender and", colnames(cdc_clean)[i]))
}
Each plot must be a boxplot display corresponding to one of the
variables Height (in), Weight (lbs),
Desired Weight (lbs) and Age (Years). Similar
to the plot you created in part (B), the plots must be titled and have
y-label according to the name of the variable, and the x-label must be
"Gender". Make sure that the axis labels are appropriate
for each of the individual plots.
Are there any interesting patterns you see in the data? Are there observable differences of each variable according to gender?
(Hint: use par() to setup the multi-plot figure).
## Insert code here:
numeric_vars = c(5,6,7,8)
par(mfrow= c(1,4))
for(i in numeric_vars){
boxplot(formula= cdc_clean[,i] ~ Gender, data= cdc_clean, xlab= 'Gender', ylab= colnames(cdc_clean)[i], main= paste("Boxplot of Gender and", colnames(cdc_clean)[i]))
}
Answer here: Mostly, the men tend to be taller than women and tend to weigh more. The desired weight is heavier for men. Women on average live longer than men, so their age is higher than men’s age.
cdc_clean, create a scatterplot of
the variable Height (in) on the x-axis and the variable
Weight (lbs) on the y-axis. The title must be
Height (in) vs. Weight (lbs). The x and y labels must be
the names of the variables. Furthermore, assign the color and shape of
each point according to the gender of the observation. Make sure to
access the values and names of the variables through
cdc_clean[, ind] and colnames(cdc_clean)[ind]
instead of using the name directly.plot(cdc_clean[,5], cdc_clean[,6], xlab= colnames(cdc_clean)[5], ylab= colnames(cdc_clean)[6], main= "Height (in) vs. Weight (lbs)", col= color_by_gender, pch= pointstyle_by_gender)
Is there any association between height and weight? Is there any relationship between both variables and the gender of the individuals? Comment below.
Hint: In order to give the title without writing it explicitly by
hand, consider using the function paste(), which
concatenates character objects.
Answer here: Based on the data it seems that men are generally taller and weigh more than females. As you get more height and weight, the results seem to be showing males make up the most of that population.
Height (in), the second
column/row to Weight (lbs), third column/row to
Desired Weight (lbs) and the fourth column/row to
Age (Years).If i != j, the plot corresponding to the position
(i,j) must be a scatterplot of the i-th vs
j-th numerical variables with titles and labels, color and
point shapes similar to what we created in part (F). For example, the
figure in the position (3,2) should be a scatterplot of
wtdesire as the y-axis, and weight in the
x-axis.
If i == j, the plot in the position
(i,i) must be a boxplot of the i-th numerical
variable, with titles, label and colors similar to the plots you
generated in part (E). For example, the plot in the position
(2,2) is a boxplot divided by gender of the variable
weight.
(Hint: you can achieve this by modifying par(), and also
implementing two for() loops)
After generating these plots, comment on the plot. Do all variables
present associations? Are all variables associated to the variable
gender? Comment below.
for (i in numeric_vars) {
for (j in numeric_vars) {
plot(cdc_clean[,i], cdc_clean[,j], xlab= colnames(cdc_clean)[i], ylab= colnames(cdc_clean)[j])
if(i == j){
}else{
}
}
}
## Insert code here.
Answer here: