When starting a new project in R, you will need to designate your working directory. Think of this like your project home from which you can easily load and save your files. You can find out where your directory is using getwd(), and change directory using setwd().
# Work directory
getwd()
## [1] "C:/Users/wuj95/OneDrive - McMaster University/HTHSCI 3CB3"
# Change directory using setwd() and substitute text in the bracket with any directory path such as "/path/to/directory"
setwd("C:/Users/wuj95/OneDrive - McMaster University")
setwd("C:/Users/wuj95/")
setwd("C:/Users/wuj95/OneDrive - McMaster University/HTHSCI 3CB3")
You can create new directories or folders directly from R using dir.create(). Note that it is good practice to name directories or any files saved in your environment without spaces and hyphens as they are not ‘computer friendly’ characters.
# Create directory called week_1
dir.create(path = './week_1')
## Warning in dir.create(path = "./week_1"): '.\week_1' already exists
# Notice how the full path does not need to be written out? If you know your current location, you don't need to write out the entire path mapping to your location (although I usually do this anyway as good practice so that I know exactly where all the files are located). You can replace all preceding directories with the symbol './'
dir.exists('./week_1')
## [1] TRUE
R comes with a ‘base’ system which contains the most fundamental functions for working with datasets. The base system comes pre-installed.
In addition, users also have access to an extensive library of open-source packages that other users have created. Packages add functions to R that enable you to perform specific tasks. Some packages provide enhanced functions commonly used across many tasks. Other packages provide niche functions that you will seek out only when needed.
Packages are typically stored in two libraries or repositories. The first and most common is the Comprehensive R Archive Network (CRAN) repository. This is for general purpose usage. The other is Bioconductor which as the name suggests, contains packages more tailored for computational biology.
For now, we will learn how to install packages from CRAN by installing the dplyr package used for common data wrangling.
# Install packages using install.packages()
install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into 'C:/Users/wuj95/AppData/Local/R/win-library/4.4'
## (as 'lib' is unspecified)
## package 'dplyr' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\local_wuj95\Temp\Rtmpkxp1N7\downloaded_packages
# Load packages using library()
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Vectors are the basic units of data structure in R. It is a collection of numbers, texts, or dates etc. The collection can be composed of a single unit (e.g. a single number), or multiple units (e.g. multiple words).
To create a vector of multiple units, you will use the concatenate function c(). We refer to individual units within a vector as an element.
# Create a vector with a single digit and store in the variable 'x'
x = 1
# You can assign vectors to variables using either '=' or '<-', I prefer '=' as it requires less typing.
x <- 1
# Do it again but with a letter, and store in a different variable name of your choice. Note that a vector containing letters or other non-numerical characters will require quotation mark.
y = 'a'
# Create a longer vector composed of two numbers, two letters, or a number and a letter
two_num = c(1, 2)
two_letter = c('a', 'b')
num_letter = c(1, 'a')
# You can create a vector of consecutive integers by using ':' without needing c()
three_num = 3:5
print(three_num)
## [1] 3 4 5
Variables can be categorized by their data type. This is important as there are rules restricting certain operations on data type. For example, you cannot perform math operations on a variable that is of the ‘character’ type.
Note that when a variable contains both letters and numbers, the variable will automatically become a character class as a number can be considered a character but a letter cannot be considered a number.
# Use class() to identify the data type of your variables x and y
class(x)
## [1] "numeric"
class(y)
## [1] "character"
class(num_letter)
## [1] "character"
# Try out a basic arithmetic on x and y
x + 1
## [1] 2
y - 1
## Error in y - 1: non-numeric argument to binary operator
two_num * 4
## [1] 4 8
num_letter / 2
## Error in num_letter/2: non-numeric argument to binary operator
three_num ^ 2
## [1] 9 16 25
A common data manipulation exercise involves selecting a specific sets of number or letter from your variable. To do so, you will need to know where the desired elements are located within your variable. You can find this out either programmatically in our future sessions by using functions such as grep() or which(). Or you can visually inspect your data and identify the location of specific elements. To subset, we will use the square bracket ‘[]’, and input the location or index of the desired elements therein.
# First create a vector containing 100 numbers with a distribution of mean = 0, and standard deviation = 1 using rnorm()
x = rnorm(n = 100, mean = 0 , sd = 1)
# Inspect the data and subset an element. I will choose the 7th element
x[7]
## [1] -0.4051123
# To subset multiple elements, using the concatenate function c()
x[c(7, 19, 25)]
## [1] -0.4051123 0.7833215 0.1254946
# If the elements you would like to subset are consecutive, you can use the ':' mark without needing c()
x[7:9]
## [1] -0.4051123 -1.9760318 -0.2935765
Another common exercise is sorting data alphabetically or numerically. We can accomplish this using either sort() or order() functions.
# First create a vector containing 100 numbers with a distribution of mean = 0, and standard deviation = 1 using rnorm()
x = rnorm(n = 100, mean = 0 , sd = 1)
# Sort numerically in descending order
sort(x, decreasing = TRUE)
## [1] 3.77449279 2.47625770 2.30446357 2.23306968 1.81747045 1.80815956
## [7] 1.64157885 1.56788402 1.56479339 1.43849967 1.43223059 1.37715498
## [13] 1.37359356 1.36827760 1.29055708 1.22182489 1.20936804 1.11164832
## [19] 1.05645487 1.03855283 0.81892912 0.81206481 0.77667037 0.75735701
## [25] 0.68989358 0.63393617 0.57222075 0.56094987 0.54034969 0.41034843
## [31] 0.37636256 0.36255909 0.35657541 0.34463730 0.28553923 0.27377089
## [37] 0.26222432 0.23814079 0.22234079 0.22019727 0.20855381 0.19674297
## [43] 0.17234222 0.09745390 0.08395051 0.06384979 0.06011980 0.01351980
## [49] -0.02974315 -0.04905034 -0.09792052 -0.10924248 -0.11267489 -0.18403729
## [55] -0.19285723 -0.20504634 -0.27245500 -0.27337915 -0.37085926 -0.40998068
## [61] -0.41840386 -0.42372424 -0.42624931 -0.42841250 -0.44480126 -0.45397791
## [67] -0.46416008 -0.46919066 -0.47969587 -0.50843228 -0.53313928 -0.53624672
## [73] -0.55605674 -0.56030081 -0.58530222 -0.63191737 -0.66779854 -0.67208196
## [79] -0.67467309 -0.67720382 -0.73714354 -0.77945578 -0.85439591 -0.91827863
## [85] -1.04470419 -1.05455647 -1.05696118 -1.06293220 -1.09023031 -1.15323290
## [91] -1.15799996 -1.24832793 -1.33155765 -1.34956312 -1.36177857 -1.37855582
## [97] -1.74561913 -1.87207493 -1.95738769 -2.16578957
# Sort numerically in ascending order
sort(x, decreasing = FALSE)
## [1] -2.16578957 -1.95738769 -1.87207493 -1.74561913 -1.37855582 -1.36177857
## [7] -1.34956312 -1.33155765 -1.24832793 -1.15799996 -1.15323290 -1.09023031
## [13] -1.06293220 -1.05696118 -1.05455647 -1.04470419 -0.91827863 -0.85439591
## [19] -0.77945578 -0.73714354 -0.67720382 -0.67467309 -0.67208196 -0.66779854
## [25] -0.63191737 -0.58530222 -0.56030081 -0.55605674 -0.53624672 -0.53313928
## [31] -0.50843228 -0.47969587 -0.46919066 -0.46416008 -0.45397791 -0.44480126
## [37] -0.42841250 -0.42624931 -0.42372424 -0.41840386 -0.40998068 -0.37085926
## [43] -0.27337915 -0.27245500 -0.20504634 -0.19285723 -0.18403729 -0.11267489
## [49] -0.10924248 -0.09792052 -0.04905034 -0.02974315 0.01351980 0.06011980
## [55] 0.06384979 0.08395051 0.09745390 0.17234222 0.19674297 0.20855381
## [61] 0.22019727 0.22234079 0.23814079 0.26222432 0.27377089 0.28553923
## [67] 0.34463730 0.35657541 0.36255909 0.37636256 0.41034843 0.54034969
## [73] 0.56094987 0.57222075 0.63393617 0.68989358 0.75735701 0.77667037
## [79] 0.81206481 0.81892912 1.03855283 1.05645487 1.11164832 1.20936804
## [85] 1.22182489 1.29055708 1.36827760 1.37359356 1.37715498 1.43223059
## [91] 1.43849967 1.56479339 1.56788402 1.64157885 1.80815956 1.81747045
## [97] 2.23306968 2.30446357 2.47625770 3.77449279
# Create a random vector containing 100 letters
y = sample(letters, 100, replace = TRUE)
# Sort alphabetically
sort(y, decreasing = TRUE)
## [1] "z" "z" "y" "y" "x" "w" "v" "v" "v" "v" "v" "u" "u" "t" "t" "t" "t" "r"
## [19] "r" "r" "r" "r" "r" "r" "q" "q" "q" "q" "p" "p" "p" "p" "o" "o" "o" "n"
## [37] "n" "n" "m" "m" "m" "m" "m" "m" "m" "l" "l" "l" "l" "l" "l" "l" "l" "k"
## [55] "k" "k" "j" "j" "j" "j" "j" "i" "i" "i" "i" "i" "h" "h" "h" "h" "h" "h"
## [73] "g" "g" "g" "g" "f" "f" "f" "f" "f" "e" "e" "e" "e" "e" "d" "d" "d" "d"
## [91] "c" "c" "c" "b" "b" "b" "b" "a" "a" "a"
sort(y, decreasing = FALSE)
## [1] "a" "a" "a" "b" "b" "b" "b" "c" "c" "c" "d" "d" "d" "d" "e" "e" "e" "e"
## [19] "e" "f" "f" "f" "f" "f" "g" "g" "g" "g" "h" "h" "h" "h" "h" "h" "i" "i"
## [37] "i" "i" "i" "j" "j" "j" "j" "j" "k" "k" "k" "l" "l" "l" "l" "l" "l" "l"
## [55] "l" "m" "m" "m" "m" "m" "m" "m" "n" "n" "n" "o" "o" "o" "p" "p" "p" "p"
## [73] "q" "q" "q" "q" "r" "r" "r" "r" "r" "r" "r" "t" "t" "t" "t" "u" "u" "v"
## [91] "v" "v" "v" "v" "w" "x" "y" "y" "z" "z"
# In contrast to sort(), order() returns the index of all elements in a descending or ascending fashion rather than the elemenst directly. The output of order can be used inside the square bracket [] to sort the original data based on the rearranged index.
index = order(x, decreasing = TRUE)
x[index]
## [1] 3.77449279 2.47625770 2.30446357 2.23306968 1.81747045 1.80815956
## [7] 1.64157885 1.56788402 1.56479339 1.43849967 1.43223059 1.37715498
## [13] 1.37359356 1.36827760 1.29055708 1.22182489 1.20936804 1.11164832
## [19] 1.05645487 1.03855283 0.81892912 0.81206481 0.77667037 0.75735701
## [25] 0.68989358 0.63393617 0.57222075 0.56094987 0.54034969 0.41034843
## [31] 0.37636256 0.36255909 0.35657541 0.34463730 0.28553923 0.27377089
## [37] 0.26222432 0.23814079 0.22234079 0.22019727 0.20855381 0.19674297
## [43] 0.17234222 0.09745390 0.08395051 0.06384979 0.06011980 0.01351980
## [49] -0.02974315 -0.04905034 -0.09792052 -0.10924248 -0.11267489 -0.18403729
## [55] -0.19285723 -0.20504634 -0.27245500 -0.27337915 -0.37085926 -0.40998068
## [61] -0.41840386 -0.42372424 -0.42624931 -0.42841250 -0.44480126 -0.45397791
## [67] -0.46416008 -0.46919066 -0.47969587 -0.50843228 -0.53313928 -0.53624672
## [73] -0.55605674 -0.56030081 -0.58530222 -0.63191737 -0.66779854 -0.67208196
## [79] -0.67467309 -0.67720382 -0.73714354 -0.77945578 -0.85439591 -0.91827863
## [85] -1.04470419 -1.05455647 -1.05696118 -1.06293220 -1.09023031 -1.15323290
## [91] -1.15799996 -1.24832793 -1.33155765 -1.34956312 -1.36177857 -1.37855582
## [97] -1.74561913 -1.87207493 -1.95738769 -2.16578957
# The output of x[index] and sort(x, decreasing = TRUE) are the same. You can confirm this using the function == or all.equal()
x[index] == sort(x, decreasing = TRUE)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
all.equal(x[index], sort(x, decreasing = TRUE))
## [1] TRUE
Ordering may seem like a more tedious alternative to sorting. However, knowing the location of elements or indices is important when working with a dataframe consisted of more than a single vector.
We will now create a dataframe and inspect the output.
# Create a dataframe by fitting two vectors x and y inside a the data.frame() function and assign to the variable dat
dat = data.frame(x = rnorm(100, mean = 0, sd = 1),
y = sample(letters, 100, replace = TRUE))
# It is very, very important to do a quick inspection of any new dataframe you create or load to catch any unexpected abnormalities!!!
# Inspect the structure of the dataframe using class() and str()
class(dat)
## [1] "data.frame"
str(dat)
## 'data.frame': 100 obs. of 2 variables:
## $ x: num 0.3057 -1.03298 -1.28432 -0.6868 0.00927 ...
## $ y: chr "f" "v" "n" "v" ...
# str() provides a quick glimpse of the data and a summary of important characteristics such as the number of observations (rows) and number of variables (columns). You can also obtain the dimensions of the dataframe using the dim() function, nrow() and ncol()
nrow(dat)
## [1] 100
ncol(dat)
## [1] 2
dim(dat)
## [1] 100 2
# To add additional variables (columns) to your dataframe. You can use the cbind() function.
dat = cbind(dat, z = 2)
str(dat)
## 'data.frame': 100 obs. of 3 variables:
## $ x: num 0.3057 -1.03298 -1.28432 -0.6868 0.00927 ...
## $ y: chr "f" "v" "n" "v" ...
## $ z: num 2 2 2 2 2 2 2 2 2 2 ...
# Notice only a single integer, 2, was assigned to z? This is more convenient when your new variable only has a single element so that you do not need to write out the element many times (the program automatically replicates the digit to fit in the dataframe). Note that you can only do this when the total number of elements in the new variable is evenly divisible by the number of rows. In our case, if the variable z contains 3 elements, it will return an error when you attempt to add it to the
cbind(dat, z = 1:2) %>% head()
## x y z z
## 1 0.305700067 f 2 1
## 2 -1.032981152 v 2 2
## 3 -1.284324667 n 2 1
## 4 -0.686800286 v 2 2
## 5 0.009271487 y 2 1
## 6 -1.010318651 o 2 2
cbind(dat, z = 1:3) %>% head()
## Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 100, 3
You can select specific elements from a dataframe using the ‘[]’ function. However, in contrast to vectors, dataframes are 2 dimensional and therefore require more nuance when using [] to specify either column or row selections. We do this by using a comma ‘,’ to separate rows and columns such as ‘dat[row, column]’.
# Create a dataframe by fitting two vectors x and y inside a the data.frame() function and assign to the variable dat
dat = data.frame(x = rnorm(100, mean = 0, sd = 1),
y = sample(letters, 100, replace = TRUE),
z = c(1, 4))
# Select a random row or column
dat[5, ]
## x y z
## 5 -0.7123735 r 1
dat[, 2]
## [1] "q" "w" "o" "c" "r" "m" "f" "l" "l" "l" "f" "z" "u" "c" "v" "z" "p" "i"
## [19] "a" "v" "z" "h" "b" "s" "g" "w" "g" "u" "j" "r" "z" "z" "k" "i" "s" "v"
## [37] "r" "h" "l" "d" "f" "y" "k" "a" "q" "e" "w" "n" "b" "f" "i" "h" "u" "v"
## [55] "l" "x" "t" "s" "i" "w" "x" "h" "p" "n" "e" "o" "a" "f" "c" "q" "d" "k"
## [73] "q" "r" "a" "q" "g" "a" "b" "n" "t" "n" "n" "t" "i" "i" "o" "w" "j" "g"
## [91] "k" "l" "g" "o" "f" "s" "t" "q" "x" "j"
# Select the intersection of a row and column
dat[5, 2]
## [1] "r"
# Select the intersection of multiple rows and columns
dat[5:10, 2:3]
## y z
## 5 r 1
## 6 m 4
## 7 f 1
## 8 l 4
## 9 l 1
## 10 l 4
# If only selecting columns, you can also use the '$' symbol followed by the name of the column. This transforms the variable into a 1 dimensional vector.
dat$y
## [1] "q" "w" "o" "c" "r" "m" "f" "l" "l" "l" "f" "z" "u" "c" "v" "z" "p" "i"
## [19] "a" "v" "z" "h" "b" "s" "g" "w" "g" "u" "j" "r" "z" "z" "k" "i" "s" "v"
## [37] "r" "h" "l" "d" "f" "y" "k" "a" "q" "e" "w" "n" "b" "f" "i" "h" "u" "v"
## [55] "l" "x" "t" "s" "i" "w" "x" "h" "p" "n" "e" "o" "a" "f" "c" "q" "d" "k"
## [73] "q" "r" "a" "q" "g" "a" "b" "n" "t" "n" "n" "t" "i" "i" "o" "w" "j" "g"
## [91] "k" "l" "g" "o" "f" "s" "t" "q" "x" "j"
In addition to the class() and str() functions. There are additional ways to obtain summaries of data such as using summary() and table().
# Create a dataframe by fitting two vectors x and y inside a the data.frame() function and assign to the variable dat
dat = data.frame(x = rnorm(100, mean = 0, sd = 1),
y = sample(letters, 100, replace = TRUE))
# Use the summary() function to attain descriptive statistics of numerical variables
summary(dat)
## x y
## Min. :-3.11674 Length:100
## 1st Qu.:-0.55576 Class :character
## Median :-0.01420 Mode :character
## Mean : 0.04266
## 3rd Qu.: 0.72542
## Max. : 2.20747
# If dataset is large, summary() will take a long time. Therefore, it is good practice to use summary() on select variables of interest
summary(dat$x)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.11674 -0.55576 -0.01420 0.04266 0.72542 2.20747
# Notice how summary() is not very useful for character type variables. For these variables, it is more informative to tabulate the frequency of appearance of each unique element by using table().
table(dat$y)
##
## a b c d e f g h i j k l m n o p q r s t u v w x y z
## 8 3 2 2 4 1 7 5 4 4 6 5 2 1 2 2 5 4 4 1 4 5 1 6 4 8
# We can combine table with sort() to arrange characters based on their frequency of appearance
sort(table(dat$y), decreasing = T)
##
## a z g k x h l q v e i j r s u y b c d m o p f n t w
## 8 8 7 6 6 5 5 5 5 4 4 4 4 4 4 4 3 2 2 2 2 2 1 1 1 1
sort(table(dat$y), decreasing = F)
##
## f n t w c d m o p b e i j r s u y h l q v k x g a z
## 1 1 1 1 2 2 2 2 2 3 4 4 4 4 4 4 4 5 5 5 5 6 6 7 8 8
Recall that sort() can rearrange vectors. However, sort() will not work with dataframe. Therefore, it is better to use order() to rearrange dataframe based on select variables.
# Create a dataframe by fitting two vectors x and y inside a the data.frame() function and assign to the variable dat
dat = data.frame(x = rnorm(100, mean = 0, sd = 1),
y = sample(letters, 100, replace = TRUE))
# Sort data
sort(dat)
## Error in xtfrm.data.frame(x): cannot xtfrm data frames
# Order data
index = order(dat$x, decreasing = T)
dat[index, ] %>% head()
## x y
## 13 2.171104 c
## 76 2.033512 k
## 88 1.973879 b
## 3 1.829745 m
## 17 1.639099 f
## 73 1.621700 n
index = order(dat$x, decreasing = F)
dat[index, ] %>% head()
## x y
## 89 -2.672860 f
## 5 -2.612869 u
## 38 -2.475703 q
## 36 -2.212770 w
## 12 -1.980042 x
## 41 -1.654493 h
We can save data created in our R session in any directory. We can save in either readable formats such as .txt or .csv which can be opened in notepad or excel.
Alternatively, we can save data as R objects in .rds format. This is useful when your data is not a conventional 2 dimensional object. This format compresses the data which saves space, is an exact copy of the R object with no alterations, but can only be opened in R.
# Save as .txt file with the following parameters: retain column name, discard row names, do not use any quotations marks, and separate data by tab
write.table(dat, file = './dat.txt',
col.names = T, row.names = F, quote = F, sep = '\t')
# Save as .rds file, no alterations or parameters needed
saveRDS(dat, file = './dat.RDS')
# To read a .txt file:
dat = read.table(file = './dat.txt', header = T) #we specify header = True as the file was saved with column names intact
# To read a .rds file:
dat = readRDS(file = './dat.RDS')