Mean imputation of missing data in R

Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Main objectives

By the end of this lesson you will know how to …

Identify all of the missing values in a column of a dataframe or vector
Replaces all the NAs in a column with a new value, such as the mean.
Know how a for() loop can work on many columns for you.

Review

set a working directory in RStudio
confirm the location of the working directory with getwd()
confirm a file is present with and list.files(pattern = ...)
load typical R data file in spreadsheet format with read.csv()
Use basic R functions to check data you’ve loaded (e.g. dim, summary, etc.)
The function mean() has the important arguement ‘na.rm = T’
REview the difference between accessing columns with $ and [,]
Create a basic PCA, make a screeplot, and make and interpret a simple biplot.

Preliminaries

For this task we will be using measurements on birds stored in the file walsh2017morphology.csv.

First we need to set our working directory to where the file is located.

# run getwd()
getwd()                 # TODO

## [1] "/Users/james/Documents/Comp Bio Fall 2022/FinalProject"

Check for the presence of the walsh2017morphology.csv file in the working directory with list.files()

# run list.files()
  list.files()           # TODO

##  [1] "07-mean_imputation.docx"                                        
##  [2] "07-mean_imputation.Rmd"                                         
##  [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
##  [4] "8.21122256-21362256.ALL.chr8_GRCh38.genotypes.20170504.vcf"     
##  [5] "all_loci.vcf"                                                   
##  [6] "bird_snps_remove_NAs.html"                                      
##  [7] "bird_snps_remove_NAs.Rmd"                                       
##  [8] "fst_exploration_in_class-STUDENT.html"                          
##  [9] "fst_exploration_in_class-STUDENT.Rmd"                           
## [10] "removing_fixed_alleles.html"                                    
## [11] "removing_fixed_alleles.Rmd"                                     
## [12] "rsconnect"                                                      
## [13] "transpose_VCF_data.html"                                        
## [14] "transpose_VCF_data.Rmd"                                         
## [15] "walsh2017morphology.csv"

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “walsh”)

# Run list.files() with pattern = "walsh"
list.files(pattern = "walsh")                 # TODO

## [1] "walsh2017morphology.csv"

Load the .csv file with the read.csv() function.

# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv") # TODO

Always check to make sure the data looks like what you expected with head(), summary() and other functions.

# run head(), summary(), and dim() on the data
   # TODO
head(df)

##    spp wing bill weight
## 1 NESP   56  8.5   18.2
## 2 NESP   56  8.5   20.7
## 3 NESP   59  8.0   17.6
## 4 NESP   59  8.2   16.0
## 5 NESP   60  8.3   16.5
## 6 NESP   58  8.5   16.0

summary(df)

##      spp                 wing            bill           weight    
##  Length:73          Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  Class :character   1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Mode  :character   Median :57.00   Median :8.600   Median :17.0  
##                     Mean   :57.01   Mean   :8.782   Mean   :17.4  
##                     3rd Qu.:58.00   3rd Qu.:9.240   3rd Qu.:18.9  
##                     Max.   :60.00   Max.   :9.900   Max.   :21.7  
##                     NA's   :10      NA's   :10      NA's   :12

dim(df)

## [1] 73  4

The first column of the dataframe has the species of each sample. We’ll save that to a separate vector and then remove that column from the dataframe. (PCA functions only take on numeric columns, so we need to get the character data out of the way)

First, a vector with the species code/

# Select the column spp using $spp
## and save it to a vector
spp_vector <- df$spp   # TODO

Second, remove the column from the data frame using negative indexing.

# remove the first column using
## negative indexing, e.g. [, -1]
df02 <- df[, -1]  # TODO

Mean imputation

We could just use na.omit() on the dataframe and get rid of all of the NAs. This is what is normally done for PCA. This approach can have problems for some analyses, such as analyses of SNPs.

Another option is to impute missing data. The most common form of imputation is mean imputation.

The process is

Mean: Calculate the mean of the column
NAs: Locate all the NAs
Replace: Replace the NAs with the mean of the column
Repeat: Repeat for all other columns.

We can get a sense for how many NAs there are with summary()

# call summary on df02
summary(df02) # TODO

##       wing            bill           weight    
##  Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Median :57.00   Median :8.600   Median :17.0  
##  Mean   :57.01   Mean   :8.782   Mean   :17.4  
##  3rd Qu.:58.00   3rd Qu.:9.240   3rd Qu.:18.9  
##  Max.   :60.00   Max.   :9.900   Max.   :21.7  
##  NA's   :10      NA's   :10      NA's   :12

We can calculate the mean of the wing column with mean(), being sure to include na.rm = TRUE so that R doesn’t give us an error

# Call mean() on df$02$wing
## set na.rm = TRUE
mean_wing <- mean(df02$wing,    #TODO
                  na.rm = TRUE )

A digression related to `for()` loops

We can do the same thing with the column index of 1 also.

mean_wing <- mean(df02[, 1], na.rm = TRUE)

Its useful to remember this because when we use for() loops we have to cycle through part of a dataframe (usually columns, sometimes rows). This is done within a for() loop by assigning the relevant index to an object, often called i, and using that object to pull out the column we want.

Within the for() loop it works something like this:

i <- 1
head(df02[, i])

## [1] 56 56 59 59 60 58

The for() loop works on the next column by updating i to 2, kind of like this

i <- 2
head(df02[, i])

## [1] 8.5 8.5 8.0 8.2 8.3 8.5

Back to mean imputation

We have the mean of our first column, wing length

mean_wing

## [1] 57.00794

We identify the locations of the NAs using the general form of which(is.na(x) == TRUE). I’ll separate this out into parts first to show how it works.

First, a logical vector of TRUE/FALSE is a NA present

# call is.na()
is_NA_wing <- is.na(df02$wing)

Now use which() to determine which elements of the vector contain TRUE

# call which()
i_NA_wing <- which(is_NA_wing == TRUE)

In a single line I can do it like this

i_NA_wing <- which(is.na(df02$wing) == TRUE)

Again, we can do this with bracket notation by referring to the wing column as column 1:

i_NA_wing <- which(is.na(df02[, 1]) == TRUE)

This vector of indices can then pull out the NAs from the dataframe:

df02$wing[i_NA_wing]

##  [1] NA NA NA NA NA NA NA NA NA NA

I can then assigned the mean value of wings to the NAs

df02$wing[i_NA_wing] <- mean_wing

With a column index its done like this

df02[i_NA_wing, 1] <- mean_wing

Note that the mean before doing this (stored in mean_wing) and after this is the same. I can check this with a logical comparison

## Add == to compare the two elements
mean_wing ==  mean(df02$wing)

## [1] TRUE

Since our dataframe is small we can easily do mean imputation on the remaining features.

First we need the means of the two remaining columns

# call mean() on the columns;
## and set na.rm = TRUE
mean_bill   <- mean(df02$bill, na.rm = TRUE)    # TODO
mean_weight <- mean(df02$weight, na.rm = TRUE)  # TODO

Of course, we can do this with column indices too

# Set the column index to  3
mean_bill   <- mean(df02[, 2], na.rm = TRUE) 
mean_weight <- mean(df02[, 3], na.rm = TRUE) # TODO

We now need the locations of the NAs

i_NA_bill   <- which(is.na(df02$bill)   == TRUE)
i_NA_weight <- which(is.na(df02$weight) == TRUE)

Or with column indices

# Set the column indices to be 2 and 3
i_NA_bill   <- which(is.na(df02[, 2]) == TRUE) # TODO
i_NA_weight <- which(is.na(df02[, 3]) == TRUE) # TODO

We can check that we are getting just NAs by using our i_NA_ vectors to access the elements of the columns.

df02$bill[i_NA_bill]

##  [1] NA NA NA NA NA NA NA NA NA NA

df02$weight[i_NA_weight]

##  [1] NA NA NA NA NA NA NA NA NA NA NA NA

Now do the the replacement of the NAs

df02$bill[i_NA_bill]     <- mean_bill
df02$weight[i_NA_weight] <- mean_weight

For completeness, let’s do this with column indices

# set the column indices with 2 and 3
df02[i_NA_bill, 2] <- mean_bill
df02[i_NA_weight, 3] <- mean_weight  # TODO

We can check that there are now values in these rows

df02$bill[i_NA_bill]

##  [1] 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063
##  [9] 8.782063 8.782063

df02$weight[i_NA_weight]

##  [1] 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328
##  [9] 17.40328 17.40328 17.40328 17.40328

Calling summary on the data shows us that everything is filled in - no more NAs are reported in the summary output.

summary(df02)

##       wing            bill           weight    
##  Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Median :57.00   Median :8.782   Median :17.4  
##  Mean   :57.01   Mean   :8.782   Mean   :17.4  
##  3rd Qu.:58.00   3rd Qu.:9.120   3rd Qu.:18.5  
##  Max.   :60.00   Max.   :9.900   Max.   :21.7

Mean imputation with a `for()` loop

Let’s do this all again, but instead of doing each column (wing, bill, weight) explicitly, we’ll use a for() loop.

First I want to start with a fresh set of data, so I’ll go back to the df object and make df02 again be removing column 1 (spp)

df02 <- df[, -1]

I will write a very explicit, overly long for() loop to try to be very clear about each step.

There are three columns in the dataframe which need to have the NAs replaced, so I cam going to make a vector with c(1, 2, 3). i in the loop will take on each of these values. I’ll have the loop print out i each time it takes on a new value. The loop will then pull out the current column into a separate vector and calculate the mean of that vector. I’ll have to report the mean to us too. After that, the loop will find all of the NAs, and report the number of NAs to us. Finally, it will replace the NAs in the vector. Now that we’ve done our imputation, I will replace the original column in the dataframe with our working vector of imputed data.

for(i in c(1,2,3)){
  
  # print the current value of i
  print(i)
  
  # get the current column
  column_i <- df02[, i]
  
  # get the mean of the current column
  mean_i <- mean(column_i, na.rm = TRUE)
  
  # report the mean
  cat("The mean of column ", i, "is ", mean_i,"\n")
  
  # get the NAs in the current column
  NAs_i <- which(is.na(column_i))
  
  # report the number of NAs
  N_NAs <- length(NAs_i)
  cat("The number of NAs in column ", i, "is ", N_NAs,"\n")
  
  # replace the NAs in the current column
  column_i[NAs_i] <- mean_i
  
  # replace the original column with the
  ## updated columns
  df02[, i] <- column_i
  
  cat("\n")
  
}

## [1] 1
## The mean of column  1 is  57.00794 
## The number of NAs in column  1 is  10 
## 
## [1] 2
## The mean of column  2 is  8.782063 
## The number of NAs in column  2 is  10 
## 
## [1] 3
## The mean of column  3 is  17.40328 
## The number of NAs in column  3 is  12

All of this printing of output is not necessary; indeed, this loop can be made much shorter. However, hopefully this makes it clear what is going on.

for() loops and functions

Loops are great for getting things done. One drawback of the loop I have above is that it is hard coded - it is set up to only work for a particular dataframe. This can be seen because the name of the dataframe object appears several times within the loop. If you want to use the loop for something else, then there are several things you need to change by hand. For example, if I want to grade 75 submissions to an assignment where people are cleaning data to put into a PCA, I’ll have to change the for loop 75 times.

One way to solve this problem is to set the for() loop in a function. This takes a little thinking – which I’ll do for us – but makes things more flexible and adaptable. (I’ll also remove all the information that the loop was giving us - it was a little too verbose). Now, instead of modifying 75 of for() loops to grade an assignment, I can just change what goes into the function, and it will do the rest.

Here’s the function:

mean_imputation <- function(df){
  n_cols <- ncol(df)
  
  for(i in 1:n_cols){
  # get the current column
  column_i <- df[, i]
  
  # get the mean of the current column
  mean_i <- mean(column_i, na.rm = TRUE)
  
  # get the NAs in the current column
  NAs_i <- which(is.na(column_i))
  
  # report the number of NAs
  N_NAs <- length(NAs_i)
  
  # replace the NAs in the current column
  column_i[NAs_i] <- mean_i
  
  # replace the original column with the
  ## updated columns
  df[, i] <- column_i
  
  }
  
  return(df)
}

Let me test my function now. I need to start with a clean slate. I’ll back to my original dataframe df and remove the first column again to remake df02

df02 <- df[, -1]

Check to make sure I have NAs to clean up again:

summary(df02)

##       wing            bill           weight    
##  Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Median :57.00   Median :8.600   Median :17.0  
##  Mean   :57.01   Mean   :8.782   Mean   :17.4  
##  3rd Qu.:58.00   3rd Qu.:9.240   3rd Qu.:18.9  
##  Max.   :60.00   Max.   :9.900   Max.   :21.7  
##  NA's   :10      NA's   :10      NA's   :12

Now run my handy imputation function

df02 <- mean_imputation(df02)

And check that the NAs are gone

summary(df02)

##       wing            bill           weight    
##  Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Median :57.00   Median :8.782   Median :17.4  
##  Mean   :57.01   Mean   :8.782   Mean   :17.4  
##  3rd Qu.:58.00   3rd Qu.:9.120   3rd Qu.:18.5  
##  Max.   :60.00   Max.   :9.900   Max.   :21.7

Review - PCA in R

As an exercise, brainstorm the steps you’d have to do to conduct PCA with these data, then run the code below and see if you’ve missed anything.

We always scale data for PCA.

df02scaled <- scale(df02)

We don’t have to remove any NAs because we did imputation.

We can run a PCA with prcomp() or another R function, such as vegan::rda()

# add prcomp() and assign it to an object called
## my_pca
my_pca <- prcomp(df02scaled) #TODO

A scree plot is used to see which features are important.

# add screeplot() to make the scree plot
screeplot(my_pca) #TODO

Now we’ll make the biplot. Look at the biplot and interpret the relationship between the 3 features bill, weight, and wing. Then read the information below.

# add biplot() to see the biplot
biplot(my_pca) #TODO