Learning objectives
All of this material will appear on the exam. Take notes on the
workflow, functions, and concepts.
Main objectives
By the end of this lesson you will know how to..
- set a working directory in RStudio
- confirm the location of the working directory with
getwd()
- confirm a file is present with and
list.files(pattern = ...)
- load typical R data file in spreadsheet format with
read.csv()
Review
- Use basic R functions to check data you’ve loaded (e.g. dim,
summary, etc.)
- Create a basic PCA, make a screeplot, and make and interpret a
simple biplot.
Introduction
R and R packages have many datasets that can easily be loaded with
the data()
function.
Real data analysis typically require you to load files from your hard
drive into R. For typical statistical and machine learning analysis
these files are .csv files, which stands for “comma
separated volume.” These are essentially light-weight spreadsheet files
that can be opened by a text editor, R, or a spreadsheet program.
Bioinformatics and computational biology analyses often involve more
complicated data, but the basic ideas behind loading data are the
same.
Key to loading data into R is that R knows where to find the file.
This isn’t hard too do, but it is easy to forget to do it or to not get
it correct on the first try.
This lesson will guide you through the process and give you practice
with some tools to maximize success.
Step 1: Save your R script file in a good place
First, save this or whatever R code file you’re working with in a
place where you want it to live and where you can easily find on your
computer.
The easiest thing to do is make a new folder on your computer’s
desktop and save this file there.
Step 2: Set R’s working diretory to where the script file is
You must make sure that R has its working directory
set to where this file is located. You need to do this
every time need to load a file. (The only exception is
if you use RStudio projects).
- On the top RStudio menu, click on “Session”,
- Then “Set Working Directory”,
- Then “To Source File Location”.
This will set the working directory to where the file is saved. R
will tell you in the console where the working directory is set by
printing out something like “setwd(C:/Users/nlb24….”)
Again, unless you are using an RStudio project you must do this every
time you need to load the file with the data. One thing you can do is
copy and paste the code from the console that appears when you set the
working directory.
Copy and paste it below:
# Paste the "setwd(C:/Users/nlb24....") code below
# setwd("~/Desktop/R/BIOSC1540")
Step 3: Copy your data file to the folder where the R code file
is
Download your data and save it to where this or whatever code file
you are using is. For this exercise, download the file
walsh2017morphology.csv
and save it to where this
.Rmd
file is located.
Step 3: Confirm your working directory and the presence of the
file
Check the location of your working directory with
getwd()
# run getwd()
getwd()
## [1] "/Users/jasonlee/Desktop/R/BIOSC1540"
Check for the presence of the “walsh2017morphology.csv” file in the
working directory with list.files()
# run list.files()
list.files()
## [1] "BIOSC1540_Exam3.Rmd" "center_function.R"
## [3] "code_checkpoint_vcfR.html" "code_checkpoint_vcfR.Rmd"
## [5] "vcfR_test.vcf" "vcfR_test.vcf.gz"
## [7] "walsh2017morphology.csv" "working_directory_practice.Rmd"
If you have lots of files in the working directory, you can search
for the file specifically with list.files(pattern = “walsh”)
# Run list.files() with pattern = "walsh"
list.files(pattern = "walsh")
## [1] "walsh2017morphology.csv"
Load the .csv file
CSV files can be read in with the read.csv()
function.
# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv")
Always check to make sure the data looks like what you expected with
head()
, summary()
and other functions.
# run head(), summary(), and dim() on the data
head(df)
## spp wing bill weight
## 1 NESP 56 8.5 18.2
## 2 NESP 56 8.5 20.7
## 3 NESP 59 8.0 17.6
## 4 NESP 59 8.2 16.0
## 5 NESP 60 8.3 16.5
## 6 NESP 58 8.5 16.0
summary(df)
## spp wing bill weight
## Length:73 Min. :53.00 Min. :7.900 Min. :14.5
## Class :character 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Mode :character Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
dim(df)
## [1] 73 4
Review - Run a PCA
We always scale data for PCA. The first column is character data so
we’ll drop that using df2 <- df[, -1]
.
# add [, -1] to drop the first column
df2 <- df[, -1] #TODO
# add scale() to scale the data
df2_scale <- scale(df2) #TODO
We need to remove NAs with na.omit()
# add na.omit() to remove the NAs
# assign the output to df2_scale_noNA
df2_scale_noNA <- na.omit(df2_scale) #TODO
We can run a PCA with prcomp()
# add prcomp() and assign it to an object called
## my_pca
my_pca <- prcomp(df2_scale_noNA) #TODO
Let’s look at the scree plot. There’s only 3
features and PC3 is fairly tall on the scree plot, so in a real analysis
we should look at it.
# add screeplot() to make the scree plot
screeplot(my_pca) #TODO

Now we’ll make the biplot. Look at the biplot and interpret the
relationship between the 3 features bill, weight, and wing. Then read
the information below.
# add biplot() to see the biplot
biplot(my_pca) #TODO

How to interpret the biplot
In the biplot created above, the “bill” and “weight” vectors point to
the left, and “wing” points straight down.
This means that bill and weight are correlated with PC1, which is
always the horizontal axis. Wing is correlated with PC2, the vertical
axis.
Bill and weight are very close to each other, so the raw data of
these features are going to be highly correlated with each other.
The “Wing” vector points straight down at a about a 90 degree (right)
angle to not only PC1, but also bill and weight. We can therefore say
that the wing vector is orthogonal to PC1, bill, and
weight.