Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Main objectives

By the end of this lesson you will know how to..

  • set a working directory in RStudio
  • confirm the location of the working directory with getwd()
  • confirm a file is present with and list.files(pattern = ...)
  • load typical R data file in spreadsheet format with read.csv()

Review

  • Use basic R functions to check data you’ve loaded (e.g. dim, summary, etc.)
  • Create a basic PCA, make a screeplot, and make and interpret a simple biplot.

Introduction

R and R packages have many datasets that can easily be loaded with the data() function.

Real data analysis typically require you to load files from your hard drive into R. For typical statistical and machine learning analysis these files are .csv files, which stands for “comma separated volume.” These are essentially light-weight spreadsheet files that can be opened by a text editor, R, or a spreadsheet program. Bioinformatics and computational biology analyses often involve more complicated data, but the basic ideas behind loading data are the same.

Key to loading data into R is that R knows where to find the file. This isn’t hard too do, but it is easy to forget to do it or to not get it correct on the first try.

This lesson will guide you through the process and give you practice with some tools to maximize success.

Step 1: Save your R script file in a good place

First, save this or whatever R code file you’re working with in a place where you want it to live and where you can easily find on your computer.

The easiest thing to do is make a new folder on your computer’s desktop and save this file there.

Step 2: Set R’s working diretory to where the script file is

You must make sure that R has its working directory set to where this file is located. You need to do this every time need to load a file. (The only exception is if you use RStudio projects).

  1. On the top RStudio menu, click on “Session”,
  2. Then “Set Working Directory”,
  3. Then “To Source File Location”.

This will set the working directory to where the file is saved. R will tell you in the console where the working directory is set by printing out something like “setwd(C:/Users/nlb24….”)

Again, unless you are using an RStudio project you must do this every time you need to load the file with the data. One thing you can do is copy and paste the code from the console that appears when you set the working directory.

Copy and paste it below:

# Paste the "setwd(C:/Users/nlb24....") code below
     # setwd("~/Desktop/R/BIOSC1540")

Step 3: Copy your data file to the folder where the R code file is

Download your data and save it to where this or whatever code file you are using is. For this exercise, download the file walsh2017morphology.csv and save it to where this .Rmd file is located.

Step 3: Confirm your working directory and the presence of the file

Check the location of your working directory with getwd()

# run getwd()
getwd()
## [1] "/Users/jasonlee/Desktop/R/BIOSC1540"

Check for the presence of the “walsh2017morphology.csv” file in the working directory with list.files()

# run list.files()
          list.files()
## [1] "BIOSC1540_Exam3.Rmd"            "center_function.R"             
## [3] "code_checkpoint_vcfR.html"      "code_checkpoint_vcfR.Rmd"      
## [5] "vcfR_test.vcf"                  "vcfR_test.vcf.gz"              
## [7] "walsh2017morphology.csv"        "working_directory_practice.Rmd"

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “walsh”)

# Run list.files() with pattern = "walsh"
           list.files(pattern = "walsh")
## [1] "walsh2017morphology.csv"

Load the .csv file

CSV files can be read in with the read.csv() function.

# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv")

Always check to make sure the data looks like what you expected with head(), summary() and other functions.

# run head(), summary(), and dim() on the data
head(df)
##    spp wing bill weight
## 1 NESP   56  8.5   18.2
## 2 NESP   56  8.5   20.7
## 3 NESP   59  8.0   17.6
## 4 NESP   59  8.2   16.0
## 5 NESP   60  8.3   16.5
## 6 NESP   58  8.5   16.0
summary(df)
##      spp                 wing            bill           weight    
##  Length:73          Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  Class :character   1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Mode  :character   Median :57.00   Median :8.600   Median :17.0  
##                     Mean   :57.01   Mean   :8.782   Mean   :17.4  
##                     3rd Qu.:58.00   3rd Qu.:9.240   3rd Qu.:18.9  
##                     Max.   :60.00   Max.   :9.900   Max.   :21.7  
##                     NA's   :10      NA's   :10      NA's   :12
dim(df)
## [1] 73  4

Review - Run a PCA

We always scale data for PCA. The first column is character data so we’ll drop that using df2 <- df[, -1].

# add [, -1] to drop the first column
df2       <- df[, -1]      #TODO

# add scale() to scale the data
df2_scale <- scale(df2)  #TODO

We need to remove NAs with na.omit()

# add na.omit() to remove the NAs
# assign the output to df2_scale_noNA

df2_scale_noNA <- na.omit(df2_scale) #TODO

We can run a PCA with prcomp()

# add prcomp() and assign it to an object called
## my_pca
my_pca <- prcomp(df2_scale_noNA) #TODO

Let’s look at the scree plot. There’s only 3 features and PC3 is fairly tall on the scree plot, so in a real analysis we should look at it.

# add screeplot() to make the scree plot
screeplot(my_pca) #TODO

Now we’ll make the biplot. Look at the biplot and interpret the relationship between the 3 features bill, weight, and wing. Then read the information below.

# add biplot() to see the biplot
biplot(my_pca) #TODO

How to interpret the biplot

In the biplot created above, the “bill” and “weight” vectors point to the left, and “wing” points straight down.

This means that bill and weight are correlated with PC1, which is always the horizontal axis. Wing is correlated with PC2, the vertical axis.

Bill and weight are very close to each other, so the raw data of these features are going to be highly correlated with each other.

The “Wing” vector points straight down at a about a 90 degree (right) angle to not only PC1, but also bill and weight. We can therefore say that the wing vector is orthogonal to PC1, bill, and weight.