Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Main objectives

By the end of this lesson you will know how to..

  • set a working directory in RStudio
  • confirm the location of the working directory with getwd()
  • confirm a file is present with and list.files(pattern = ...)
  • load typical R data file in spreadsheet format with read.csv()

Review

  • Use basic R functions to check data you’ve loaded (e.g. dim, summary, etc.)
  • Create a basic PCA, make a screeplot, and make and interpret a simple biplot.

Introduction

R and R packages have many datasets that can easily be loaded with the data() function.

Real data analysis typically require you to load files from your hard drive into R. For typical statistical and machine learning analysis these files are .csv files, which stands for “comma separated volume.” These are essentially light-weight spreadsheet files that can be opened by a text editor, R, or a spreadsheet program. Bioinformatics and computational biology analyses often involve more complicated data, but the basic ideas behind loading data are the same.

Key to loading data into R is that R knows where to find the file. This isn’t hard too do, but it is easy to forget to do it or to not get it correct on the first try.

This lesson will guide you through the process and give you practice with some tools to maximize success.

Step 1: Save your R script file in a good place

First, save this or whatever R code file you’re working with in a place where you want it to live and where you can easily find on your computer.

The easiest thing to do is make a new folder on your computer’s desktop and save this file there.

Step 2: Set R’s working diretory to where the script file is

You must make sure that R has its working directory set to where this file is located. You need to do this every time need to load a file. (The only exception is if you use RStudio projects).

  1. On the top RStudio menu, click on “Session”,
  2. Then “Set Working Directory”,
  3. Then “To Source File Location”.

This will set the working directory to where the file is saved. R will tell you in the console where the working directory is set by printing out something like “setwd(C:/Users/nlb24….”)

Again, unless you are using an RStudio project you must do this every time you need to load the file with the data. One thing you can do is copy and paste the code from the console that appears when you set the working directory.

Copy and paste it below:

Step 3: Copy your data file to the folder where the R code file is

Download your data and save it to where this or whatever code file you are using is. For this exercise, download the file walsh2017morphology.csv and save it to where this .Rmd file is located.

Step 3: Confirm your working directory and the presence of the file

Check the location of your working directory with getwd()

Check for the presence of the “walsh2017morphology.csv” file in the working directory with list.files()

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “walsh”)

Load the .csv file

CSV files can be read in with the read.csv() function.

Always check to make sure the data looks like what you expected with head(), summary() and other functions.

Review - Run a PCA

We always scale data for PCA. The first column is character data so we’ll drop that using df2 <- df[, -1].

We need to remove NAs with na.omit()

We can run a PCA with prcomp()

Let’s look at the scree plot. There’s only 3 features and PC3 is fairly tall on the scree plot, so in a real analysis we should look at it.

Now we’ll make the biplot. Look at the biplot and interpret the relationship between the 3 features bill, weight, and wing. Then read the information below.

How to interpret the biplot

In the biplot created above, the “bill” and “weight” vectors point to the left, and “wing” points straight down.

This means that bill and weight are correlated with PC1, which is always the horizontal axis. Wing is correlated with PC2, the vertical axis.

Bill and weight are very close to each other, so the raw data of these features are going to be highly correlated with each other.

The “Wing” vector points straight down at a about a 90 degree (right) angle to not only PC1, but also bill and weight. We can therefore say that the wing vector is orthogonal to PC1, bill, and weight.