Learning objectives
All of this material will appear on the exam. Take notes on the
workflow, functions, and concepts.
Main objectives
By the end of this lesson you will know how to …
- Identify all of the missing values in a column of a dataframe or
vector
- Replaces all the
NAs
in a column with a new value, such
as the mean.
- Know how a
for()
loop can work on many columns for
you.
Review
- set a working directory in RStudio
- confirm the location of the working directory with
getwd()
- confirm a file is present with and
list.files(pattern = ...)
- load typical R data file in spreadsheet format with
read.csv()
- Use basic R functions to check data you’ve loaded (e.g. dim,
summary, etc.)
- The function
mean()
has the important arguement ‘na.rm
= T’
- REview the difference between accessing columns with
$
and [,]
- Create a basic PCA, make a screeplot, and make and interpret a
simple biplot.
Preliminaries
For this task we will be using measurements on birds stored in the
file walsh2017morphology.csv
.
First we need to set our working directory to where the file is
located.
# run getwd()
getwd() # TODO
## [1] "/Users/james/Documents/Comp Bio Fall 2022/FinalProject"
Check for the presence of the walsh2017morphology.csv
file in the working directory with list.files()
# run list.files()
list.files() # TODO
## [1] "07-mean_imputation.docx"
## [2] "07-mean_imputation.Rmd"
## [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [4] "8.21122256-21362256.ALL.chr8_GRCh38.genotypes.20170504.vcf"
## [5] "all_loci.vcf"
## [6] "bird_snps_remove_NAs.html"
## [7] "bird_snps_remove_NAs.Rmd"
## [8] "fst_exploration_in_class-STUDENT.html"
## [9] "fst_exploration_in_class-STUDENT.Rmd"
## [10] "removing_fixed_alleles.html"
## [11] "removing_fixed_alleles.Rmd"
## [12] "rsconnect"
## [13] "transpose_VCF_data.html"
## [14] "transpose_VCF_data.Rmd"
## [15] "walsh2017morphology.csv"
If you have lots of files in the working directory, you can search
for the file specifically with list.files(pattern = “walsh”)
# Run list.files() with pattern = "walsh"
list.files(pattern = "walsh") # TODO
## [1] "walsh2017morphology.csv"
Load the .csv file with the read.csv()
function.
# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv") # TODO
Always check to make sure the data looks like what you expected with
head()
, summary()
and other functions.
# run head(), summary(), and dim() on the data
# TODO
head(df)
## spp wing bill weight
## 1 NESP 56 8.5 18.2
## 2 NESP 56 8.5 20.7
## 3 NESP 59 8.0 17.6
## 4 NESP 59 8.2 16.0
## 5 NESP 60 8.3 16.5
## 6 NESP 58 8.5 16.0
summary(df)
## spp wing bill weight
## Length:73 Min. :53.00 Min. :7.900 Min. :14.5
## Class :character 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Mode :character Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
dim(df)
## [1] 73 4
The first column of the dataframe has the species of each sample.
We’ll save that to a separate vector and then remove that column from
the dataframe. (PCA functions only take on numeric columns, so we need
to get the character data out of the way)
First, a vector with the species code/
# Select the column spp using $spp
## and save it to a vector
spp_vector <- df$spp # TODO
Second, remove the column from the data frame using negative
indexing.
# remove the first column using
## negative indexing, e.g. [, -1]
df02 <- df[, -1] # TODO
Mean imputation
We could just use na.omit()
on the dataframe and get rid
of all of the NAs. This is what is normally done for PCA. This approach
can have problems for some analyses, such as analyses of SNPs.
Another option is to impute missing data. The most
common form of imputation is mean imputation.
The process is
- Mean: Calculate the mean of the column
- NAs: Locate all the NAs
- Replace: Replace the NAs with the mean of the
column
- Repeat: Repeat for all other columns.
We can get a sense for how many NAs there are with
summary()
# call summary on df02
summary(df02) # TODO
## wing bill weight
## Min. :53.00 Min. :7.900 Min. :14.5
## 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
We can calculate the mean of the wing column with
mean()
, being sure to include na.rm = TRUE so that R
doesn’t give us an error
# Call mean() on df$02$wing
## set na.rm = TRUE
mean_wing <- mean(df02$wing, #TODO
na.rm = TRUE )
Back to mean imputation
We have the mean of our first column, wing length
mean_wing
## [1] 57.00794
We identify the locations of the NAs using the general form of
which(is.na(x) == TRUE)
. I’ll separate this out into parts
first to show how it works.
First, a logical vector of TRUE/FALSE is a NA present
# call is.na()
is_NA_wing <- is.na(df02$wing)
Now use which()
to determine which elements of the
vector contain TRUE
# call which()
i_NA_wing <- which(is_NA_wing == TRUE)
In a single line I can do it like this
i_NA_wing <- which(is.na(df02$wing) == TRUE)
Again, we can do this with bracket notation by referring to the wing
column as column 1:
i_NA_wing <- which(is.na(df02[, 1]) == TRUE)
This vector of indices can then pull out the NAs from the
dataframe:
df02$wing[i_NA_wing]
## [1] NA NA NA NA NA NA NA NA NA NA
I can then assigned the mean value of wings to the NAs
df02$wing[i_NA_wing] <- mean_wing
With a column index its done like this
df02[i_NA_wing, 1] <- mean_wing
Note that the mean before doing this (stored in
mean_wing
) and after this is the same. I can check this
with a logical comparison
## Add == to compare the two elements
mean_wing == mean(df02$wing)
## [1] TRUE
Since our dataframe is small we can easily do mean imputation on the
remaining features.
First we need the means of the two remaining columns
# call mean() on the columns;
## and set na.rm = TRUE
mean_bill <- mean(df02$bill, na.rm = TRUE) # TODO
mean_weight <- mean(df02$weight, na.rm = TRUE) # TODO
Of course, we can do this with column indices too
# Set the column index to 3
mean_bill <- mean(df02[, 2], na.rm = TRUE)
mean_weight <- mean(df02[, 3], na.rm = TRUE) # TODO
We now need the locations of the NAs
i_NA_bill <- which(is.na(df02$bill) == TRUE)
i_NA_weight <- which(is.na(df02$weight) == TRUE)
Or with column indices
# Set the column indices to be 2 and 3
i_NA_bill <- which(is.na(df02[, 2]) == TRUE) # TODO
i_NA_weight <- which(is.na(df02[, 3]) == TRUE) # TODO
We can check that we are getting just NAs by using our i_NA_ vectors
to access the elements of the columns.
df02$bill[i_NA_bill]
## [1] NA NA NA NA NA NA NA NA NA NA
df02$weight[i_NA_weight]
## [1] NA NA NA NA NA NA NA NA NA NA NA NA
Now do the the replacement of the NAs
df02$bill[i_NA_bill] <- mean_bill
df02$weight[i_NA_weight] <- mean_weight
For completeness, let’s do this with column indices
# set the column indices with 2 and 3
df02[i_NA_bill, 2] <- mean_bill
df02[i_NA_weight, 3] <- mean_weight # TODO
We can check that there are now values in these rows
df02$bill[i_NA_bill]
## [1] 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063 8.782063
## [9] 8.782063 8.782063
df02$weight[i_NA_weight]
## [1] 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328 17.40328
## [9] 17.40328 17.40328 17.40328 17.40328
Calling summary on the data shows us that everything is filled in -
no more NAs are reported in the summary output.
summary(df02)
## wing bill weight
## Min. :53.00 Min. :7.900 Min. :14.5
## 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Median :57.00 Median :8.782 Median :17.4
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.120 3rd Qu.:18.5
## Max. :60.00 Max. :9.900 Max. :21.7
Mean imputation with a for()
loop
Let’s do this all again, but instead of doing each column (wing,
bill, weight) explicitly, we’ll use a for()
loop.
First I want to start with a fresh set of data, so I’ll go back to
the df object and make df02 again be removing column 1 (spp)
df02 <- df[, -1]
I will write a very explicit, overly long for() loop to try to be
very clear about each step.
There are three columns in the dataframe which need to have the NAs
replaced, so I cam going to make a vector with c(1, 2, 3).
i
in the loop will take on each of these values. I’ll have
the loop print out i
each time it takes on a new value. The
loop will then pull out the current column into a separate vector and
calculate the mean of that vector. I’ll have to report the mean to us
too. After that, the loop will find all of the NAs, and report the
number of NAs to us. Finally, it will replace the NAs in the vector. Now
that we’ve done our imputation, I will replace the original column in
the dataframe with our working vector of imputed data.
for(i in c(1,2,3)){
# print the current value of i
print(i)
# get the current column
column_i <- df02[, i]
# get the mean of the current column
mean_i <- mean(column_i, na.rm = TRUE)
# report the mean
cat("The mean of column ", i, "is ", mean_i,"\n")
# get the NAs in the current column
NAs_i <- which(is.na(column_i))
# report the number of NAs
N_NAs <- length(NAs_i)
cat("The number of NAs in column ", i, "is ", N_NAs,"\n")
# replace the NAs in the current column
column_i[NAs_i] <- mean_i
# replace the original column with the
## updated columns
df02[, i] <- column_i
cat("\n")
}
## [1] 1
## The mean of column 1 is 57.00794
## The number of NAs in column 1 is 10
##
## [1] 2
## The mean of column 2 is 8.782063
## The number of NAs in column 2 is 10
##
## [1] 3
## The mean of column 3 is 17.40328
## The number of NAs in column 3 is 12
All of this printing of output is not necessary; indeed, this loop
can be made much shorter. However, hopefully this makes it clear what is
going on.
for() loops and functions
Loops are great for getting things done. One drawback of the loop I
have above is that it is hard coded - it is set up to
only work for a particular dataframe. This can be seen because the name
of the dataframe object appears several times within the loop. If you
want to use the loop for something else, then there are several things
you need to change by hand. For example, if I want to grade 75
submissions to an assignment where people are cleaning data to put into
a PCA, I’ll have to change the for loop 75 times.
One way to solve this problem is to set the for()
loop
in a function. This takes a little thinking – which I’ll do for us – but
makes things more flexible and adaptable. (I’ll also remove all the
information that the loop was giving us - it was a little too
verbose). Now, instead of modifying 75 of
for()
loops to grade an assignment, I can just change what
goes into the function, and it will do the rest.
Here’s the function:
mean_imputation <- function(df){
n_cols <- ncol(df)
for(i in 1:n_cols){
# get the current column
column_i <- df[, i]
# get the mean of the current column
mean_i <- mean(column_i, na.rm = TRUE)
# get the NAs in the current column
NAs_i <- which(is.na(column_i))
# report the number of NAs
N_NAs <- length(NAs_i)
# replace the NAs in the current column
column_i[NAs_i] <- mean_i
# replace the original column with the
## updated columns
df[, i] <- column_i
}
return(df)
}
Let me test my function now. I need to start with a clean slate. I’ll
back to my original dataframe df
and remove the first
column again to remake df02
df02 <- df[, -1]
Check to make sure I have NAs to clean up again:
summary(df02)
## wing bill weight
## Min. :53.00 Min. :7.900 Min. :14.5
## 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
Now run my handy imputation function
df02 <- mean_imputation(df02)
And check that the NAs are gone
summary(df02)
## wing bill weight
## Min. :53.00 Min. :7.900 Min. :14.5
## 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Median :57.00 Median :8.782 Median :17.4
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.120 3rd Qu.:18.5
## Max. :60.00 Max. :9.900 Max. :21.7
Review - PCA in R
As an exercise, brainstorm the steps you’d have to do to conduct PCA
with these data, then run the code below and see if you’ve missed
anything.
We always scale data for PCA.
df02scaled <- scale(df02)
We don’t have to remove any NAs
because we did
imputation.
We can run a PCA with prcomp()
or another R function,
such as vegan::rda()
# add prcomp() and assign it to an object called
## my_pca
my_pca <- prcomp(df02scaled) #TODO
A scree plot is used to see which features are
important.
# add screeplot() to make the scree plot
screeplot(my_pca) #TODO

Now we’ll make the biplot. Look at the biplot and interpret the
relationship between the 3 features bill, weight, and wing. Then read
the information below.
# add biplot() to see the biplot
biplot(my_pca) #TODO
