STA 363/663 Lab 2

Complete all Questions and submit your final PDF or html (either works!) under Assignments in Canvas.

The Goal

In our last lab we encountered a data set with missing data. In that lab, we used complete case analysis (CCA) to handle the missing data. However, we now know that using CCA requires some assumptions and can have some repercussions if those assumptions are violated. How else can we handle missing data?

In our lab today, we will explore different ways of handling missing data. We will also work on refining our ability to work with objects in R. Pay attention to how we work with rows and columns today, because this is going to be very important as we move into our unit on predictive accuracy next week.

The Data

Our data from today comes from the World Happiness Report (https://worldhappiness.report/). We will be working with the most recent data from the 2023 report. The link you need to download the data is on Canvas.

Cleaning the data

When you load in your data set, you will notice a few things about the data set that suggest we would benefit from data cleaning. Data cleaning is a broad term that relates to getting a data set ready to work with. This can include creating new features, handling missing data, merging data frames, changing the names of columns, formatting features, and so on. We will practice a few common data cleaning steps today, and we will learn more as we move through the course.

Renaming a data frame and storing things

Our first data cleaning step has to do with the name of the file. The name of the data set (data frame) we have just loaded is really long: DataForFigureWHR2023. You probably don’t want to type that all through the lab today. Because of this, our first data cleaning step is to rename our data set.

We are going to change the name of our data set to Happiness. Doing this in R requires (1) creating a new copy of the data set under the new name that you want and (2) removing the old data set. Let’s see how to do that.

# Step 1: 
# Create a new copy of the data set
# under the new code name we want
Happiness <- PUT THE CURRENT NAME OF YOUR DATA SET HERE
Happiness <- data.frame(Happiness)

# Step 2: 
# Remove the original data set
rm(PUT THE ORIGINAL NAME OF YOUR DATA SET HERE)

Your Happiness data set should have $n = 137$ rows and 19 columns.

The code for Step 1 is particularly important. In this code, we are introduced to the <- operator. I think of this as a storage operator. We take whatever stuff is to the right of the <- (<- stuff) and we store it under the name on the left of the operator (whereWePutIt <- stuff)

Question 1

Using this idea of storing, create an object called holder which contains (stores) the numbers 1 through 7 (1:7). Once you have done this, type holder into a chunk and press play to show what has been stored under that name!

Removing Columns

The next step in our data cleaning process is to remove columns from our data set that we don’t need. It turns out that we only need some of the columns in the data set for our analysis today, and it can be easier to work with data if we remove columns we don’t need.

Columns we don’t need: 3 - 5, 12 - 19

Let’s practice by removing the columns 3, 4, and 5.

When we work with data frames in R, we use the following notation to access rows and columns:

dataframe[rows, columns]

So, if I want to print out the first row in the third column, I use:

Happiness[1,3]

Question 2

What part of the Happiness data would we print out with the code Happiness[3,4]?

If we leave either the rows or the columns part of the notation blank, this just means all. For instance, the following code tells R to print out the entire first row (so information on all the columns for this row).

Happiness[ 1 , ]

Similarly, this code prints out the first 3 rows (1, 2, and 3), with information on all the columns for those rows.

Happiness[ c(1:3) , ]

The c part of the code tells R we want a collection of things. In this case. it tells R we want to print out more than one row. The same notation would work for columns, just on the other side of the ,.

Question 3

What code would print out only the 2nd column?

We want to remove columns 3, 4, and 5. To do this, we need to run this code. NOTE: YOU NEED TO RUN THIS CODE EXACTY ONCE!

# Remove columns 3 -5 
Happiness <- Happiness[,-c(3:5)]

Look, there’s the <- again!! This symbol and the idea of storing things will be very helpful as we move through our course.

Question 4

Happiness <- Happiness[,-c(3:5)]

One of the skills we will practice in this course is translating code into words that clearly explain what each part of the code is doing. Let’s try it. In words, explain what the code above is doing. Make sure you are clear about what the <- part of the code and the - part of the code do.

Question 5

We have already removed columns 3 to 5. Next, we need to remove columns 9 through 16. Show and run the code you would need to do remove columns 9 through 16.
Originally, I told you we needed to remove columns 3-5 and columns 12 - 19. Why are we now removing 9 - 16 instead of 12 - 19?

Changing column names

At this point, we should have a data frame called Happiness with the following column names:

“Country name”
“Ladder score”
“Logged GDP per capita”
“Social support”
“Healthy life expectancy”
“Freedom to make life choices”
“Generosity”
“Perceptions of corruption”

Like the name of the data set, these column names are tricky to work with. In R, we want columns names with no spaces and no special characters (^, &, etc.). This means that before we proceed, we want to change the names of our columns.

There are a few ways to do this, but we will use the function colnames today. This function prints out the names of the columns in a data set:

# Print out the column names in the data set
colnames(Happiness)

However, this function can also be used to change the names of columns. To do so, you just need to adapt the code below by filling in the name you want for each column in place of Column1, Column2, etc.

colnames(Happiness) <- c("Column1", "Column2", "Column3", "Column4", "Column5", "Column6","Column7","Column8")

Question 6

Use the code above to change the names of the columns to “country”, “happiness”, “logGDP”, “socialSupport”, “healthyLifeExpectancy”, “freedomChoices”, “generosity”, and “perceptionCorruption”. Print out the new column names (colnames(Happiness)) as the answer to this question.

Note: You need the ” ” symbols around each name. This tells R this is a word and not a command.

Why didn’t you do that for us???

Wow, that was quite a few steps!!! Why didn’t I format that for you??

Well, in this course our goal is to help you get prepared to use statistics in a career, in an internship, in research, etc. In the real world, data is messy, and data cleaning is an absolutely essential skill. Because of this, we will practice data cleaning in this course so you are ready to perform similar cleaning tasks on real data you encounter.

The Full Data Set

Now that we have our data set ready to go, let’s see what our client wants us to do with it.

We have a client who is interested in the relationship between X = the log GDP per capita (essentially the wealth of the average person in the population of a country) and Y = the happiness score of that country.

Question 7

Is this a prediction task or an association task?

To address the goal of the client, for today we will build a least squares linear regression (LSLR) model.

# Train the model
full_model <- lm( happiness ~ logGDP, data = Happiness)

# Look at the regression coefficients
full_model$coefficients

# Build a 95% confidence interval
confint(full_model, level = .95)

Question 8

Write down the trained LSLR model (i.e., the regression line) using proper notation. Round to two decimal places.

Hint: To do this, you will need to use the white space (not a chunk!!) in your Markdown file. Type and adapt the following:

$$\widehat{Happiness} = $$

This result above is what we get using the whole data set. Now, let’s see what happens if we introduce some missing data.

Missing Data

This data set is fully observed for our two variables of interest, which means that we don’t actually have any missing data for happiness score or log GDP. This is actually great for us today because it gives us the opportunity to create missing data, try out some of our techniques from class, and then determine how well they work!

When we worked with the penguin data in Lab 1, we performed complete case analysis (CCA). This requires the very strong assumption of missing completely at random (MCAR), which means that the representative nature of our sample is preserved even when the data are missing. In other words, we assume that the missing data is just a random sample from the original data set.

However, in the real world we often encounter what is called missing at random (MAR) data. This is data in which the reason a variable is missing is related to another variable in the data set. For instance, if there is missing information on how many drinks a person consumes, this could be related to age - folks who are underage may be less likely to report the number of drinks that they consume.

Today, we are going to create MAR data related to social support in a country. If a country has high social support (with a score above .8), we are going to give that country a higher chance of having missing data on happiness score than a country with lower social support.

How can we do that?? We will build our first function. Copy and paste the following into a chunk and press play:

create_MAR <- function(data, percentMissing, variableMissing){
  
  # Set a seed 
  set.seed(10)

  # How many rows in data?
  n <- nrow(data)
  
  # Choose which rows get missing data
  missing <- sample(1:n, percentMissing*n, prob= c(ifelse(data$socialSupport > .8, .7,.3)), replace = FALSE)
  
  # Create a new data set
  data_MAR <- data
  
  # Fill in NA for the variable with missing dat
  # and the rows with missing data
  data_MAR[missing, variableMissing] <- NA
  
  # Output the data set
  data_MAR
}

Now, it will seem like nothing has happened, but it has. We have just taught R a function called create_MAR, which introduces MAR missing data into a data set related to social support. It takes the following inputs:

data: the data set (Happiness)
percentMissing: how much of the data you want to be missing (ex: .15 = 15%)
variableMissing: which variables we want to have missing data? For us, this is the happiness score.

This means that to create a version of the Happiness data set with 10% MAR data for happiness score, we use:

happiness10 <- create_MAR(Happiness, .10, "happiness")

And here is the <- operator again! We create the data set with 10% MAR missing data, and then we store the data set under the name happiness10.

Question 9

Using the function, create a data set called happiness25 with 25% missing data for happiness score. How many rows in this data set are missing information on happiness score? Hint: The summary command is helpful for this.

We won’t use the happiness10 data set anymore, so you can remove it:

rm(happiness10)

Let’s see what happens if we perform a complete case analysis on the happiness25 data set. We know that the assumption of MCAR is violated - how much does that impact our result?

Question 10

Train a LSLR model with X = log GDP and Y = happiness score using the happiness25 data set. Store your model output under the name MAR25_model. Write down the trained LSLR model using proper notation.

Note: When you run lm with missing data, it performs available case analysis (removing all rows without information in X or Y). Since we only have missing data on happiness score, and happiness score is our variable of interest, CCA and ACA are the same thing :)

Now let’s compare confidence intervals, which is generally what we are reporting if we are doing inference. As a reminder, the confidence interval from the LSLR model trained on the model with no missing data is:

	Lower Bound	Upper Bound
Intercept	-2.41	-.51
Slope	.64	.84

Question 11

Build a 95% confidence interval for the population slope using the model you trained in Question 10. Is this the same interval you got from the model trained on the full data set? If it is different, explain what about the interval is different than when we used the full data set.

Imputation

Okay, so this is what happens if we use CCA. What other options are there? One option if we do not want to use CCA is to impute, or fill in, the values for the rows that contain missing data. We learned a few of these techniques in class. Let’s try them out now.

We are going to try two different imputation techniques: unconditional mean imputation (UMI) and regression imputation. For each technique, we are going to:

Use the technique to impute the missing data.
Train our LSLR model using the imputed and observed data.
Compare the 95% confidence interval we get after imputation with what we would have gotten on the fully observed data set.
Assess the accuracy of our imputations.

Techinque 1: Unconditional Mean

The first technique for imputation we will try is unconditional mean imputation (UMI). As a reminder, this means that we find the mean of happiness score in our happiness25 data set, and we use that mean to fill in all the missing values for happiness score.

Let’s start off with finding the mean, which is the value we are going use for all our imputations. We learned the code for how to do this in class, and the skeleton is provided below.

# Find the mean of happiness score
# in the happiness25 data set
mean(      , na.rm = TRUE)

Question 12

Fill in the blanks in the skeleton code above to find the mean happiness score in the observed (non-missing) values in the happiness25 data set. State the mean (round to two decimal places).
What do you think the na.rm = TRUE part of the code does?

Now that we have the mean, we want to fill in this value for all the missing values in happiness score. The first step is to determine which rows in the happiness variable have missing data. In R, a code called is.na checks each row in a column and responds TRUE if the row has missing data and FALSE otherwise.

is.na(happiness25$happiness)

Question 13

Does row 11 have a missing value for happiness score in the happiness25 data set?

Ideally, we do not want to go through this list and find all the TRUE results. Luckily, R can do this for us using the which command. This asks R to look at a vector and find which entries are equal to a certain value.

which( is.na( happiness25$happiness) == TRUE)

Question 14

How would you change the code to find all the values that do NOT have missing data?

We’ve now learned all the code we need for UMI! A skeleton code is provided below, which means there are some pieces for you to fill in.

# Find the rows missing happiness score in the happiness25 data set
whichMissing <- which(is.na(    ) == TRUE)

# Create a data set called happiness_UMI
happinessUMI <- happiness25

# Fill in the missing values in with the mean
happinessUMI[     , "happiness"] <-

Question 15

Fill in the blanks in the skeleton code above to use UMI to impute the missing values for happiness score. Run the code and show your code.

Now we want to train our LSLR model using these imputations and compare them to what we got using the data set with no missing data.

Question 16

Train a LSLR model with X = log GDP and Y = happiness using the happinessUMI data set. Store your model output under the name UMI_model. Build and show a 95% confidence interval for the population slope using this trained model.

As a reminder, the confidence interval from the LSLR model trained on no missing data (the full model) is:

	Lower Bound	Upper Bound
Intercept	-2.41	-.51
Slope	.64	.84

Question 17

Discuss how the upper and lower bounds of the confidence intervals you got on the UMI imputed data differ from the above, correct confidence intervals.

Assessing Imputation Accuracy

The final step with our UMI imputations is to assess the quality of our imputations. Because we created the missing data ourselves, we know what the missing values are supposed to be. This means we can check to see how accurate our imputations are! This doesn’t typically happen with missing data, and we’ll talk about how we assess accuracy in those situations next week.

Let $y_i$ be the true value of happiness for row $i$ in the data set. Let $y_i^{*}$ be the value we imputed for happiness score for row $i$ in the data set. We want to see how close $y_i$ is to $y_i^{*}$.

Now, we have $n_{miss}= 34$ rows of missing data in our happiness25 data set. This means that ideally, we would like some metric (numeric tool) that allows us to summarize how well we are imputing the values for all of these missing values. A common metric we use for this is called the root mean squared error (RMSE).

\[RMSE = \sqrt{\frac{1}{n_{miss}} \sum_{i=1}^{n_{miss}} (y_i - y_i^{*})^2}\] Hmm, this looks like a lot. Let’s break it down.

Part 1

\[y_i - y_i^*\] This is the distance between each true value of happiness $y_i$ and the imputed value $y_i^{*}$. In other words, it’s the residual for row $i$!

Part 2

\[(y_i - y_i^*)^2\]

We square the values to get rid of any negative numbers; we don’t want a large negative value and a large positive value to cancel each other out, as each would represent a large difference between the truth and the imputation.

Part 3

\[\sum_{i=1}^{n_{miss}} (y_i - y_i^{*})^2\] Hey, this should look familiar! It looks just like the residual sum of squares (RSS). The only difference is that we only sum over the rows $i$ in the data that had missing data on happiness score, meaning the rows for which we actually had imputation.

Part 4

\[MSE = {\frac{1}{n_{miss}} \sum_{i=1}^{n_{miss}} (y_i - y_i^{*})^2}\] The next step is to divide the RSS by the number of imputed values. This gives us the mean squared error (MSE), which is the average squared imputation error, meaning the average squared distance between the true value and the imputation. We don’t want the squared difference, though. So…

Part 5

\[RMSE = \sqrt{\frac{1}{n_{miss}} \sum_{i=1}^{n_{miss}} (y_i - y_i^{*})^2}\]

We take the square root of the MSE to get the root mean squared error (RMSE). This gives us a measure that describes, on average, how far away the imputations are from the truth.

Question 18

Based on the definition of the RMSE, if we are using RMSE to assess imputation accuracy, do larger or smaller values of the RMSE indicate more accurate imputations?

As we move through the semester, we will learn several different metrics we can use to assess the performance of a model. Each will have a different job and a different interpretation, so it is important to think through how the metric is built so we understand what it means.

Let’s build a quick function to compute the RMSE so we don’t have to it by hand. Some of it is filled in, but you need to fill in the rest!

compute_RMSE <- function(truth, imputations){

  # Part 1
  part1 <- truth - imputations
  
  #Part 2
  part2 <- 
  
  #Part 3: RSS 
  part3 <- 
  
  #Part4: MSE
  part4 <- 1/length(imputations)*
  
  # Part 5:RMSE
  sqrt(        )
}

Let’s use this to compute the RMSE for our UMI imputations.

# Pull the imputations
imputations_UMI <-  happinessUMI[whichMissing, "happiness"]

# Pull the true values of happiness corresponding to those rows
truth <- Happiness[whichMissing, "happiness"]

# Compute the RSME for UMI
compute_RMSE(truth , imputations_UMI )

Question 19

What is the RMSE for UMI? Round to two decimal places.

Imputation: Regression

Okay, so that is what happens when we use UMI. Now, let’s see what happens if we try a regression imputation model instead.

We want to build a model that can impute happiness score as accurately as we can. Since our data set is small today, we want to use all other features in the data set to build a model to impute happiness score.

We learned this code in class, and it is provided again below.

# Build a model for imputation
imputation_engine <- lm(happiness ~ ., data = happiness25[,-c(1)])

# Create a new data set to store our imputations
happinessRegression <- happiness25

# Fill in the missing values in with
# predictions from the regression model
happinessRegression[whichMissing, "happiness"] <-
predict(imputation_engine, newdata = happiness25[whichMissing,-c(1)])

Question 20

You will notice that we excluded column 1 as a feature in training our imputation engine. Why did we need to do that?

Question 21

Train the LSLR model with X = log GDP and Y = happiness using the happinessRegression data set. Build and show a 95% confidence interval for the population slope.

Question 22

Discuss how this interval compares to the one on the full data set (with no missing data).

As a reminder, the confidence interval from the LSLR model trained on no missing data (the full model) is:

	Lower Bound	Upper Bound
Intercept	-2.41	-.51
Slope	.64	.84

Now that we have checked how the confidence intervals compare, let’s try assessing the predictive accuracy of regression imputation.

Question 23

What is the RMSE for regression imputation?

Question 24

What is the percent improvement in the RMSE for regression imputation versus UMI? In other words, compute:

\[\frac{(RMSE_{UMI} - RMSE_{Regression})}{RMSE_{UMI}} \times 100\]

When do we NOT use imputation

Imputation is cool, but it relies on two things.

1. We need enough data that is fully observed (no missingness) to build a reliable imputation engine.
1. There needs to be strong enough relationships in the data set for our imputations to make sense.

This means that if we have a variable with, say 75% of the rows that are missing, imputation is usually NOT a good choice. With only 25% of the data observed, this is likely not enough information for us to reasonably estimate the missing values.

Similarly, if we have a variable that appears to be mostly independent of all other variables we could use for imputation, our imputation engine essentially just guesses at the values to impute, and this isn’t helpful either.

Question 25

Suppose we are in the situation where a feature has 80% of its rows missing. How would you handle the missing data in this feature?

There is more than one correct answer - just brainstorm.

Question 26

Suppose we are in the situation where we can’t seem to find a strong imputation engine, meaning the feature is not highly related to other variables in the data set. The missingness is 25%. How would you handle the missing data for this feature?

Again, there is more than one correct answer - just brainstorm.

Next Steps

So, in this case we were able to check to see how different imputation techniques performed because we created the missing data. This means that we had the results of the full data set to compare to.

However, when we have missing data in the real world, we don’t know what the true values are - they are missing. In this case, how can we assess the accuracy of our imputations?? This moves us into a very important concept in Statistical Learning called test data. We will work on this starting next week!

References

This work was created by Dr. Nicole Dalzell, Associate Teaching Professor at Wake Forest University, and is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. Last updated 2024 September 1.

The Data

The data set used in this lab is from the World Happiness Report. The data set can be accessed here.

Citation: Helliwell, J. F., Layard, R., Sachs, J. D., Aknin, L. B., De Neve, J.-E., & Wang, S. (Eds.). (2023). World Happiness Report 2023 (11th ed.). Sustainable Development Solutions Network.