Instructions

Exercise 1: Multivariate Visualization in R.

One of the fundamental methods for data visualization is the use of scatterplots. When we are interested in visualizing only 2 variables, a single scatterplot is often a good approach. Despite this, when the number of numerical variables to visualize increases, it can become challenging to choose the right tool.

A common approach for visualizing data is to use a pair plot. This plot consists of displaying a scatterplot for any two pairs of variables in a matrix of plots. In this assignment, we create code that allows us to visualize multivariate numerical data through pair plots.

In the following chunk, paste the functions we have created for cleaning data-frames and factorizing character variables in data-frames.

#######################################################
#######################################################
remove_na <- function(df) { ## Automatize the removal of missing observations.
  n_obs <- nrow(df)
  missing <- rep(FALSE, n_obs)
  for (obs_ind in 1:n_obs) {
    obs <- df[obs_ind, ]
    n_missing <- sum(is.na(obs))
    if(n_missing > 0) {
      missing[obs_ind] <- TRUE
    }
  }
  df_red <- df[!missing, ]

  return(df_red)
}

#######################################################
#######################################################
to_factors <- function(df) { ## Turn all character variables to factors.
  n_vars <- ncol(df)
  for (var_ind in 1:n_vars) {
    var <- df[, var_ind]
    if (class(var) == "character") {
      df[, var_ind] <- factor(var)
    }
  }
  return(df)
}
  1. Run the chunk below to load the dataset cdc. Once it is loaded, remove all the observations with missing entries. Verify if the dataset has any variables of class character, and turn them into factors if needed. Save a clean version of the dataset, including only the first 1000 clean observations in a data-frame named cdc_clean.
source("http://www.openintro.org/stat/data/cdc.R")

cdc_clean = remove_na(cdc)
  1. First, create a vector color_by_gender, which assigns the color "forestgreen" to observations from cdc_clean with gender = "f", and "darkred" to the observations corresponding to gender = "m". Also create a vector pointstyle_by_gender which assigns the value 17 to all observations in cdc_clean with gender == "f" and 8 to all observations with gender == "m"
color_by_gender = ifelse(cdc_clean$gender == 'm', "darkred", "forestgreen")
pointstyle_by_gender = ifelse(cdc_clean$gender == 'm', 8, 17)
  1. Change the names of the columns in the dataframe cdc so the names are properly capitalized, and include the measurement units. For example for the variable height, change the name to "Height (in.)". Change the names so they are appropriate as the axis of a plot. Do this for the 4 numerical variables in the data. Also, change the name of the variable "gender" to "Gender".
colnames(cdc_clean)= c("genhlth", "exerany", "hlthplan", "smoke100", "Height (in.)", "Weight(lbs)", "Desired Weight(lbs)", "Age (Yrs)", "Gender")
  1. Create a boxplot of the variable Height (in) from the data-frame cdc_clean, dividing the observations by the variable Gender. The title and the y-label of the plot must be the name of the variable. Use colnames(cdc_clean)[5] to access the name without writing the characters directly. Furthermore, the x-label must be "Gender".
boxplot(formula= cdc_clean$`Height (in.)`~ Gender,data= cdc_clean, main = "Boxplot of height and gender",xlab= "Gender", ylab= colnames(cdc_clean)[5])

  1. Define a vector numeric_vars which contains the column indexes of the 4 numeric variables in the dataset. Then, using a for() iterating over the vector numeric_vars, generate a figure containing 4 different plots aligned on a single row.
numeric_vars = c(5,6,7,8)
par(mfrow= c(1,4))
for(i in numeric_vars){
  boxplot(formula= cdc_clean[,i] ~ Gender, data= cdc_clean, xlab= 'Gender', ylab= colnames(cdc_clean)[i], main= paste("Boxplot of Gender and", colnames(cdc_clean)[i])) 

}

Each plot must be a boxplot display corresponding to one of the variables Height (in), Weight (lbs), Desired Weight (lbs) and Age (Years). Similar to the plot you created in part (B), the plots must be titled and have y-label according to the name of the variable, and the x-label must be "Gender". Make sure that the axis labels are appropriate for each of the individual plots.

Are there any interesting patterns you see in the data? Are there observable differences of each variable according to gender?

(Hint: use par() to setup the multi-plot figure).

## Insert code here:
numeric_vars = c(5,6,7,8)
par(mfrow= c(1,4))
for(i in numeric_vars){
  boxplot(formula= cdc_clean[,i] ~ Gender, data= cdc_clean, xlab= 'Gender', ylab= colnames(cdc_clean)[i], main= paste("Boxplot of Gender and", colnames(cdc_clean)[i])) 
}


Answer here: Mostly, the men tend to be taller than women and tend to weigh more. The desired weight is heavier for men. Women on average live longer than men, so their age is higher than men’s age.


  1. Using the data-frame cdc_clean, create a scatterplot of the variable Height (in) on the x-axis and the variable Weight (lbs) on the y-axis. The title must be Height (in) vs. Weight (lbs). The x and y labels must be the names of the variables. Furthermore, assign the color and shape of each point according to the gender of the observation. Make sure to access the values and names of the variables through cdc_clean[, ind] and colnames(cdc_clean)[ind] instead of using the name directly.
plot(cdc_clean[,5], cdc_clean[,6], xlab= colnames(cdc_clean)[5], ylab= colnames(cdc_clean)[6], main= "Height (in) vs. Weight (lbs)", col= color_by_gender, pch= pointstyle_by_gender)

Is there any association between height and weight? Is there any relationship between both variables and the gender of the individuals? Comment below.

Hint: In order to give the title without writing it explicitly by hand, consider using the function paste(), which concatenates character objects.


Answer here: Based on the data it seems that men are generally taller and weigh more than females. As you get more height and weight, the results seem to be showing males make up the most of that population.


  1. Finally, lets create a “pairs” plot. The goal is to create a 4-by-4 figure with 16 different plots. In this 4-by-4 figure, the first column/row corresponds to Height (in), the second column/row to Weight (lbs), third column/row to Desired Weight (lbs) and the fourth column/row to Age (Years).
  • If i != j, the plot corresponding to the position (i,j) must be a scatterplot of the i-th vs j-th numerical variables with titles and labels, color and point shapes similar to what we created in part (F). For example, the figure in the position (3,2) should be a scatterplot of wtdesire as the y-axis, and weight in the x-axis.

  • If i == j, the plot in the position (i,i) must be a boxplot of the i-th numerical variable, with titles, label and colors similar to the plots you generated in part (E). For example, the plot in the position (2,2) is a boxplot divided by gender of the variable weight.

(Hint: you can achieve this by modifying par(), and also implementing two for() loops)

After generating these plots, comment on the plot. Do all variables present associations? Are all variables associated to the variable gender? Comment below.

for (i in numeric_vars) {
  for (j in numeric_vars) {
    plot(cdc_clean[,i], cdc_clean[,j], xlab= colnames(cdc_clean)[i], ylab= colnames(cdc_clean)[j])
      if(i == j){ 
        
        
      }else{
      }
     }
  
}

## Insert code here.

Answer here: