I. Basic of RStudio

Comparing R and R Studio: R is like an engine running all the code, while RStudio is a dashboard that users can interact with. It is an interface with add-ons of many convenient features and tools, making R much easier to use. Throughout this workshop, as well as three other sessions, Basic Quantitative Methods, Social Network Analysis, and Text Analysis, we will use R and RStudio.

A. Download R and RStudio

First you will have to download R from https://cloud.r-project.org/. Next, download RStudio from https://www.rstudio.com/products/rstudio/download/.

B. RStudio

RStudio has many features that are much more convenient for users. Typically, on the RStudio Screen, you should see four parts:

  • Console
  • Workspace and History
  • Files, Plots, Packages, and Help
  • R Script and Data View

The console is where you type commands and see the output of your codes. Once you imported the data into R and/or create new variables or a new data frame, these will be shown in the workspace on the top right of the RStudio window under the Environment tab.

In the same place, there is also a history tab, which contains a list of commands you have used so far.

Similar to Stata’s do file, you can save a record of your coding commands in the R script file to reproduce your work and/or save your progress. The R script file is shown on the top left of the RStudio window.

If you click on a data frame saved in the environment, the data view will be available as a tab in the same place as the R script.

To start a new project, first go to File –> New Project –> New Directory –> New Project.

Then, name a directory name, this will be the main directory that you have to store relevant files to be used in the project. To select a new path which this R project will be a sub-directory, click on Browse... and select the home directory location of your choice.

II. Basic R Operations and Different Data Types

A. Basic Arithmetics

R can be used as a basic calculator. Let’s try some basic arithmetic operations.

  • Addition: +
  • Subtraction: -
  • Multiplication: *
  • Division: /
  • Exponentiation: ^
  • Modulo: %%
2+3
## [1] 5
4*5-8
## [1] 12
(6/3)+5
## [1] 7
2^3
## [1] 8
5%%2
## [1] 1

B. Variable Assignments

In order to assign a value to a variable of your choice, use the operator <- as shown in the code below:

x <- 70
x
## [1] 70
y <- 50
y
## [1] 50
x+y
## [1] 120
z <- x+y #combine with the arithmetic operator

z
## [1] 120
w <- x-y

What is the value of w?

w
## [1] 20

C. Data Types

Apart from assigning integer values to variables, you can also assign other data types to variables. - numeric - logical - character

numeric_var <- 34.5
char_var <- "R Workshop"
logical_var <- TRUE

You can use a function (we will cover this later) class() to check the type of variable assigned.

class(numeric_var)
## [1] "numeric"
class(char_var)
## [1] "character"
class(logical_var)
## [1] "logical"

For a character variable, it should also be noted that R is case-sensitive:

char_var2 <- "r workshop"

char_var2 == char_var
## [1] FALSE

III. Working with Various Data Structure

There are many ways to store more than one value. We will cover: vector, matrix, factor, data frame, and list. Each is suitable for different contexts.

A. Vector

A vector can store one type of data with a dimension n x 1, using the function c()

score_winter <- c(3, 4, 4, 4, 5, 5, 5, 5, 6)

#let's check what a vector looks like:
score_winter
## [1] 3 4 4 4 5 5 5 5 6
#it can also store different types of variable:
student_names <- c("Steve", "Carol", "Sam", "Maddie", "Aaron", "Erin", "Ian", "Kyle", "Lucy")
student_names
## [1] "Steve"  "Carol"  "Sam"    "Maddie" "Aaron"  "Erin"   "Ian"    "Kyle"  
## [9] "Lucy"
#it can, however, only hold one type of variable:
test_vector <- c("Steve", 1)
class(test_vector) #1 is coerced into a character type.
## [1] "character"

Alternatively, you can also name the elements in your vector, using name() function:

names(score_winter) <- student_names
score_winter
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      3      4      4      4      5      5      5      5      6

You can also perform arithmetic operations on vectors:

score_spring <- c(4, 3, 5, 5, 3, 3, 5, 6, 5)
names(score_spring) <- student_names

total_score <- score_spring + score_winter
total_score
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      7      7      9      9      8      8     10     11     11

Sum of all values in the vector can be calculated using sum() function:

sum_all_score <- sum(total_score)
average_score <- sum_all_score/9
average_score
## [1] 8.888889

There are different ways to select certain elements in the vector:

total_score[1]
## Steve 
##     7
total_score[c(2,5,7)]
## Carol Aaron   Ian 
##     7     8    10
total_score[c("Ian", "Kyle")]
##  Ian Kyle 
##   10   11
total_score[3:9]
##    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      9      9      8      8     10     11     11

You can also filer for elements with some desired characteristics:

selection <- score_winter > 4
score_winter[selection]
## Aaron  Erin   Ian  Kyle  Lucy 
##     5     5     5     5     6

B. Matrix

Matrix also store one type of data but with a dimension n x n. We use the function matrix() to create a matrix object in R.

#try this first:
?matrix

#create a matrix containing a value of 1 to 16 for 4x4 matrix,  2x8 matrix, and 8x2 matrix, respectively
matrix(1:16, nrow = 4, ncol = 4)
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
matrix(1:16, nrow = 2, ncol = 8)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,]    1    3    5    7    9   11   13   15
## [2,]    2    4    6    8   10   12   14   16
matrix(1:16, nrow = 8, ncol = 2)
##      [,1] [,2]
## [1,]    1    9
## [2,]    2   10
## [3,]    3   11
## [4,]    4   12
## [5,]    5   13
## [6,]    6   14
## [7,]    7   15
## [8,]    8   16

Note: ? is a useful command to call R Documentation explaining the specific function you are curious about. R Documentation for each function generally contains Description, Usage, Arguments, Details, and Examples.

Going back to the vectors created earlier, now we will construct a matrix from them.

score_winter
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      3      4      4      4      5      5      5      5      6
score_spring
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      4      3      5      5      3      3      5      6      5
student_score <- c(score_winter, score_spring)
student_score
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy  Steve  Carol 
##      3      4      4      4      5      5      5      5      6      4      3 
##    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      5      5      3      3      5      6      5
student_score_matrix <- matrix(student_score, byrow = F, nrow = 9) #byrow indicates whether you are filling the matrix by row first (T) or by column first (F), and nrow indicates the number of 9 for the matrix.

student_score
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy  Steve  Carol 
##      3      4      4      4      5      5      5      5      6      4      3 
##    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      5      5      3      3      5      6      5
student_score_matrix
##       [,1] [,2]
##  [1,]    3    4
##  [2,]    4    3
##  [3,]    4    5
##  [4,]    4    5
##  [5,]    5    3
##  [6,]    5    3
##  [7,]    5    5
##  [8,]    5    6
##  [9,]    6    5

Notice that in the matrix, the names you have assigned to the vectors are gone. We will rename the rows and columns of this matrix using rownames() and colnames().

rownames(student_score_matrix) <- student_names
colnames(student_score_matrix) <- c("Winter", "Spring")
student_score_matrix
##        Winter Spring
## Steve       3      4
## Carol       4      3
## Sam         4      5
## Maddie      4      5
## Aaron       5      3
## Erin        5      3
## Ian         5      5
## Kyle        5      6
## Lucy        6      5

Similar to vector, we can also perform basic arithmetic operations within a matrix:

total_student_score <- rowSums(student_score_matrix)
total_student_score
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      7      7      9      9      8      8     10     11     11
total_score_bySemester <- colSums(student_score_matrix)
total_score_bySemester
## Winter Spring 
##     41     39

Now that we have a new column, the next step is adding this new column to the original matrix we have, using cbind() function. To add a new row, we use rbind() function instead.

all_score_matrix <- cbind(student_score_matrix, total_student_score)
all_score_matrix
##        Winter Spring total_student_score
## Steve       3      4                   7
## Carol       4      3                   7
## Sam         4      5                   9
## Maddie      4      5                   9
## Aaron       5      3                   8
## Erin        5      3                   8
## Ian         5      5                  10
## Kyle        5      6                  11
## Lucy        6      5                  11

Similar to vectors, we can also select elements in matrices. As a matrix has two dimensions, we have to specify both dimensions. If you want to select all elements in a specific row or column, you can leave the number in column or row blank. Alternatively, specific subsets of a matrix can also be selected.

all_score_matrix[4,3]
## [1] 9
#select all elements in a specific row or column
all_score_matrix[7,]
##              Winter              Spring total_student_score 
##                   5                   5                  10
all_score_matrix[,1]
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      3      4      4      4      5      5      5      5      6
#subset of a matrix
all_score_matrix[3:5,]
##        Winter Spring total_student_score
## Sam         4      5                   9
## Maddie      4      5                   9
## Aaron       5      3                   8
all_score_matrix[,1:2]
##        Winter Spring
## Steve       3      4
## Carol       4      3
## Sam         4      5
## Maddie      4      5
## Aaron       5      3
## Erin        5      3
## Ian         5      5
## Kyle        5      6
## Lucy        6      5

C. Factor

Factor is a data type containing categorical variables, such as sex (M/F) or grades (A/B/C/D). In order to specify that a certain vector contains factor variables, we use the factor() function.

#back to the original example: we assign values to the sex variable for all observations:
sex_vector <- c("M", "F", "M", "F", "M", "F", "M", "M", "F")
names(sex_vector) <- student_names
sex_vector
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##    "M"    "F"    "M"    "F"    "M"    "F"    "M"    "M"    "F"
#However, this is still a vector with characters, not factor. To turn this into a factor, we use factor():

sex_vector_factor <- factor(sex_vector)
sex_vector_factor
##  Steve  Carol    Sam Maddie  Aaron   Erin    Ian   Kyle   Lucy 
##      M      F      M      F      M      F      M      M      F 
## Levels: F M

Alternatively, if you want to code an ordinal categorical such as grades, we add another argument into factor().

winter_grade <- c("D", "C", "C", "C", "B", "B", "B", "B", "A")
spring_grade <- c("C", "D", "B", "B", "C", "C", "B", "A", "B")

winter_grade_factor <- factor(winter_grade, order = T, levels = c("D", "C", "B", "A"))
spring_grade_factor <- factor(spring_grade, order = T, levels = c("D", "C", "B", "A"))

winter_grade_factor
## [1] D C C C B B B B A
## Levels: D < C < B < A
spring_grade_factor
## [1] C D B B C C B A B
## Levels: D < C < B < A

Take a quick overview of the factor variables with summary():

summary(winter_grade_factor)
## D C B A 
## 1 3 4 1
summary(spring_grade_factor)
## D C B A 
## 1 3 4 1

D. Data Frame

As we have covered so far, there are various data types coming with each observation. It is more convenient to combine all into one data set. This can be done by using a data frame as a data structure. A data frame typically contains various variables of different types as columns and each observation as a row, using data.frame().

student_df <- data.frame(score_winter, score_spring, winter_grade_factor, spring_grade_factor, sex_vector_factor)

student_df
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Ian               5            5                   B                   B
## Kyle              5            6                   B                   A
## Lucy              6            5                   A                   B
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F
## Ian                    M
## Kyle                   M
## Lucy                   F

Useful functions to primarily inspect a data frame are head(), which shows you the first few observations and str() which shows you the structure of a data frame.

str(student_df)
## 'data.frame':    9 obs. of  5 variables:
##  $ score_winter       : num  3 4 4 4 5 5 5 5 6
##  $ score_spring       : num  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
head(student_df)
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F

Similar to a matrix, you can also select elements in the data frame.

student_df[3,4]
## [1] B
## Levels: D < C < B < A
student_df[3,] #select one observation
##     score_winter score_spring winter_grade_factor spring_grade_factor
## Sam            4            5                   C                   B
##     sex_vector_factor
## Sam                 M
student_df[,3] #select one variable
## [1] D C C C B B B B A
## Levels: D < C < B < A
#Alternatively, if you know the variable name:
student_df$winter_grade_factor
## [1] D C C C B B B B A
## Levels: D < C < B < A

Alternatively, you can also use a function subset() to select elements in the data frame.

?subset
#selecting students with grades greater than C in the winter semester
subset(student_df, winter_grade_factor > "C")
##       score_winter score_spring winter_grade_factor spring_grade_factor
## Aaron            5            3                   B                   C
## Erin             5            3                   B                   C
## Ian              5            5                   B                   B
## Kyle             5            6                   B                   A
## Lucy             6            5                   A                   B
##       sex_vector_factor
## Aaron                 M
## Erin                  F
## Ian                   M
## Kyle                  M
## Lucy                  F
#selecting students with grades equal to A in the spring semester
subset(student_df, spring_grade_factor == "A")
##      score_winter score_spring winter_grade_factor spring_grade_factor
## Kyle            5            6                   B                   A
##      sex_vector_factor
## Kyle                 M

Reorder the data frame by the values of the spring_grade_factr variable, using order().

order(student_df$spring_grade_factor)
## [1] 2 1 5 6 3 4 7 9 8
student_df[order(student_df$spring_grade_factor),]
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Carol             4            3                   C                   D
## Steve             3            4                   D                   C
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Ian               5            5                   B                   B
## Lucy              6            5                   A                   B
## Kyle              5            6                   B                   A
##        sex_vector_factor
## Carol                  F
## Steve                  M
## Aaron                  M
## Erin                   F
## Sam                    M
## Maddie                 F
## Ian                    M
## Lucy                   F
## Kyle                   M
#reverse the order:
student_df[order(student_df$spring_grade_factor, decreasing = T), ]
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Kyle              5            6                   B                   A
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Ian               5            5                   B                   B
## Lucy              6            5                   A                   B
## Steve             3            4                   D                   C
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Carol             4            3                   C                   D
##        sex_vector_factor
## Kyle                   M
## Sam                    M
## Maddie                 F
## Ian                    M
## Lucy                   F
## Steve                  M
## Aaron                  M
## Erin                   F
## Carol                  F
#only interested in the score in spring semester, ordered by the grade:
student_df$score_spring[order(student_df$spring_grade_factor, decreasing = T)]
## [1] 6 5 5 5 5 4 3 3 3

E. List

List is capable of storing many different types of variables and data structures under one name.

student_info <- list(winter_grade_factor, spring_grade_factor, student_df)
student_info
## [[1]]
## [1] D C C C B B B B A
## Levels: D < C < B < A
## 
## [[2]]
## [1] C D B B C C B A B
## Levels: D < C < B < A
## 
## [[3]]
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Ian               5            5                   B                   B
## Kyle              5            6                   B                   A
## Lucy              6            5                   A                   B
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F
## Ian                    M
## Kyle                   M
## Lucy                   F
#name each item store in the list
names(student_info) <- c("winter_grade", "spring_grade", "student_df")
student_info
## $winter_grade
## [1] D C C C B B B B A
## Levels: D < C < B < A
## 
## $spring_grade
## [1] C D B B C C B A B
## Levels: D < C < B < A
## 
## $student_df
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Ian               5            5                   B                   B
## Kyle              5            6                   B                   A
## Lucy              6            5                   A                   B
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F
## Ian                    M
## Kyle                   M
## Lucy                   F
#call an item in the list:
student_info$student_df
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Ian               5            5                   B                   B
## Kyle              5            6                   B                   A
## Lucy              6            5                   A                   B
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F
## Ian                    M
## Kyle                   M
## Lucy                   F
student_info[[3]]
##        score_winter score_spring winter_grade_factor spring_grade_factor
## Steve             3            4                   D                   C
## Carol             4            3                   C                   D
## Sam               4            5                   C                   B
## Maddie            4            5                   C                   B
## Aaron             5            3                   B                   C
## Erin              5            3                   B                   C
## Ian               5            5                   B                   B
## Kyle              5            6                   B                   A
## Lucy              6            5                   A                   B
##        sex_vector_factor
## Steve                  M
## Carol                  F
## Sam                    M
## Maddie                 F
## Aaron                  M
## Erin                   F
## Ian                    M
## Kyle                   M
## Lucy                   F

IV. Install and Download R Packages

An R package is a collection of functions, developed and shared by R developers to facilitate certain needs and that are not available in base R (the default package that comes when you have downloaded R).In order to use R Packages, you will need to first install and download the packages into your work space.

A. Install R Packages from CRAN

CRAN stands for the Comprehensive R Archive Network, which is an official R Packages repository storing R packages for free download. We can down R Packages from CRAN by calling the install.packages() function. Here, we are installing a tidyr package, which we will be using in the next part of this workshop. After installing the package, you must also load the package into your work space by calling the ’library()` function.

#install.packages("tidyr")
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.1
#installing multiple packages at once:
#install.packages(c("ggplot2", "dplyr"))

RStudio also provides an easy interface to install and load R packages by simply clicking on install button and type the name of the package you want to install. To load the package, simply go to the Packages tab and check the box in front of the package you want to load.

B. Install R Packages from GitHub

Alternatively, you can also download the package from GitHub using the install_github() function from the devtools package:

#devtools::install_github("tidyverse/ggplot2")

V. Reading Data into R

Last but not least, I will cover a section on how to read different files into R. Specifically, I will go over different methods to read both flat files, such as .csv, and excel sheets into R.

A. Reading files from Base R

First, we will read .csv file into R using read.csv() from base R. Before running the code, make sure to download the file here and that you have the file in your R directory.

student_df <- read.csv("student_data.csv")
student_df
##     name score_winter score_spring winter_grade_factor spring_grade_factor
## 1  Steve            3            4                   D                   C
## 2  Carol            4            3                   C                   D
## 3    Sam            4            5                   C                   B
## 4 Maddie            4            5                   C                   B
## 5  Aaron            5            3                   B                   C
## 6   Erin            5            3                   B                   C
## 7    Ian            5            5                   B                   B
## 8   Kyle            5            6                   B                   A
## 9   Lucy            6            5                   A                   B
##   sex_vector_factor
## 1                 M
## 2                 F
## 3                 M
## 4                 F
## 5                 M
## 6                 F
## 7                 M
## 8                 M
## 9                 F
#examine the structure of the data frame: do you notice anything different from the previous data frame we have constructed?
str(student_df)
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : int  3 4 4 4 5 5 5 5 6
##  $ score_spring       : int  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: chr  "D" "C" "C" "C" ...
##  $ spring_grade_factor: chr  "C" "D" "B" "B" ...
##  $ sex_vector_factor  : chr  "M" "F" "M" "F" ...

Alternatively, you can also use the read.table() function to download .csv file but with certain specifications: separator is ,. This means there are also options for other types of separators.

#specify also that the first row is the name header by including the option header = T:
read.table("student_data.csv", header = T, sep = ",") #a .csv file is basically separated by ","
##     name score_winter score_spring winter_grade_factor spring_grade_factor
## 1  Steve            3            4                   D                   C
## 2  Carol            4            3                   C                   D
## 3    Sam            4            5                   C                   B
## 4 Maddie            4            5                   C                   B
## 5  Aaron            5            3                   B                   C
## 6   Erin            5            3                   B                   C
## 7    Ian            5            5                   B                   B
## 8   Kyle            5            6                   B                   A
## 9   Lucy            6            5                   A                   B
##   sex_vector_factor
## 1                 M
## 2                 F
## 3                 M
## 4                 F
## 5                 M
## 6                 F
## 7                 M
## 8                 M
## 9                 F
#Again, do you see the same problem as read.csv() here?

After inspecting the data imported, we will need to change grade and sex variables into factor, using the as.factor() function:

student_df$winter_grade_factor <- factor(student_df$winter_grade_factor, order = T, levels = c("D", "C", "B", "A"))
str(student_df)
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : int  3 4 4 4 5 5 5 5 6
##  $ score_spring       : int  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: chr  "C" "D" "B" "B" ...
##  $ sex_vector_factor  : chr  "M" "F" "M" "F" ...
student_df
##     name score_winter score_spring winter_grade_factor spring_grade_factor
## 1  Steve            3            4                   D                   C
## 2  Carol            4            3                   C                   D
## 3    Sam            4            5                   C                   B
## 4 Maddie            4            5                   C                   B
## 5  Aaron            5            3                   B                   C
## 6   Erin            5            3                   B                   C
## 7    Ian            5            5                   B                   B
## 8   Kyle            5            6                   B                   A
## 9   Lucy            6            5                   A                   B
##   sex_vector_factor
## 1                 M
## 2                 F
## 3                 M
## 4                 F
## 5                 M
## 6                 F
## 7                 M
## 8                 M
## 9                 F
student_df$spring_grade_factor <- factor(student_df$spring_grade_factor, order = T, levels = c("D", "C", "B", "A"))
str(student_df)
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : int  3 4 4 4 5 5 5 5 6
##  $ score_spring       : int  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : chr  "M" "F" "M" "F" ...
student_df$sex_vector_factor <- factor(student_df$sex_vector_factor, order = F)
str(student_df)
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : int  3 4 4 4 5 5 5 5 6
##  $ score_spring       : int  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
#Alternatively, you can also pre-specify the data type in the read.table function:
student_df <- read.table("student_data.csv", header = T, sep = ",", 
                         colClasses = c("character", "numeric", "numeric", "factor", "factor", "factor"))
str(student_df) #notice the grade factors: they are not ordinal. You will need to manually change the type of the variable:
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num  3 4 4 4 5 5 5 5 6
##  $ score_spring       : num  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Factor w/ 4 levels "A","B","C","D": 4 3 3 3 2 2 2 2 1
##  $ spring_grade_factor: Factor w/ 4 levels "A","B","C","D": 3 4 2 2 3 3 2 1 2
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
student_df$winter_grade_factor <- factor(student_df$winter_grade_factor, order = T, levels = c("D", "C", "B", "A"))
student_df$spring_grade_factor <- factor(student_df$spring_grade_factor, order = T, levels = c("D", "C", "B", "A"))

str(student_df)
## 'data.frame':    9 obs. of  6 variables:
##  $ name               : chr  "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num  3 4 4 4 5 5 5 5 6
##  $ score_spring       : num  4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1

B. Reading files from the readr package

The readr package is faster, easier and more efficient than the read.table() from base R. The difference is that the output is a tibble, instead of a data frame. A tibble is basically a data frame but is capable of additional functions, which we will learn later in this workshop.

library(readr)

student_df_readr <- read_csv("student_data.csv")
## Rows: 9 Columns: 6── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, winter_grade_factor, spring_grade_factor, sex_vector_factor
## dbl (2): score_winter, score_spring
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num [1:9] 3 4 4 4 5 5 5 5 6
##  $ score_spring       : num [1:9] 4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: chr [1:9] "D" "C" "C" "C" ...
##  $ spring_grade_factor: chr [1:9] "C" "D" "B" "B" ...
##  $ sex_vector_factor  : chr [1:9] "M" "F" "M" "F" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   score_winter = col_double(),
##   ..   score_spring = col_double(),
##   ..   winter_grade_factor = col_character(),
##   ..   spring_grade_factor = col_character(),
##   ..   sex_vector_factor = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
student_df_readr
## # A tibble: 9 × 6
##   name   score_winter score_spring winter_grade_factor spring_grade_factor
##   <chr>         <dbl>        <dbl> <chr>               <chr>              
## 1 Steve             3            4 D                   C                  
## 2 Carol             4            3 C                   D                  
## 3 Sam               4            5 C                   B                  
## 4 Maddie            4            5 C                   B                  
## 5 Aaron             5            3 B                   C                  
## 6 Erin              5            3 B                   C                  
## 7 Ian               5            5 B                   B                  
## 8 Kyle              5            6 B                   A                  
## 9 Lucy              6            5 A                   B                  
## # … with 1 more variable: sex_vector_factor <chr>
#do you notice something we need to fix here?

Again, the data we have displays different data types than what we would want. There is a way to specify which data type for each variable with an option col_types which is equivalent to colClasses in read.table() in base R:

  • c for character
  • n for numeric
  • d for double
  • i for integer
  • l for logical
  • f for factor
student_df_readr <- read_csv("student_data.csv", col_types = "cnnfff")
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num [1:9] 3 4 4 4 5 5 5 5 6
##  $ score_spring       : num [1:9] 4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Factor w/ 4 levels "D","C","B","A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Factor w/ 4 levels "C","D","B","A": 1 2 3 3 1 1 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "M","F": 1 2 1 2 1 2 1 1 2
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   score_winter = col_number(),
##   ..   score_spring = col_number(),
##   ..   winter_grade_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   spring_grade_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   sex_vector_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>
#do you notice something missing here?

Notice that the grade factors are not ordered. If we would like to have an ordinal categorical variable, we will need to further specify the col_types option:

of <- col_factor(levels = c("D", "C", "B", "A"), ordered = T)
cha <- col_character()
int <- col_integer()
fac <- col_factor(levels = c("F", "M"))
student_df_readr <- read_csv("student_data.csv", col_types = list(cha, int, int, of, of, fac))
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : int [1:9] 3 4 4 4 5 5 5 5 6
##  $ score_spring       : int [1:9] 4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
##  - attr(*, "spec")=
##   .. cols(
##   ..   name = col_character(),
##   ..   score_winter = col_integer(),
##   ..   score_spring = col_integer(),
##   ..   winter_grade_factor = col_factor(levels = c("D", "C", "B", "A"), ordered = TRUE, include_na = FALSE),
##   ..   spring_grade_factor = col_factor(levels = c("D", "C", "B", "A"), ordered = TRUE, include_na = FALSE),
##   ..   sex_vector_factor = col_factor(levels = c("F", "M"), ordered = FALSE, include_na = FALSE)
##   .. )
##  - attr(*, "problems")=<externalptr>

There is also an equivalent function to read.table() in the readr package: read_delim() which you can further customize in the case that the file has a different separator:

read_delim("student_data.csv", delim = ",", col_types = list(cha, int, int, of, of, fac))
## # A tibble: 9 × 6
##   name   score_winter score_spring winter_grade_factor spring_grade_factor
##   <chr>         <int>        <int> <ord>               <ord>              
## 1 Steve             3            4 D                   C                  
## 2 Carol             4            3 C                   D                  
## 3 Sam               4            5 C                   B                  
## 4 Maddie            4            5 C                   B                  
## 5 Aaron             5            3 B                   C                  
## 6 Erin              5            3 B                   C                  
## 7 Ian               5            5 B                   B                  
## 8 Kyle              5            6 B                   A                  
## 9 Lucy              6            5 A                   B                  
## # … with 1 more variable: sex_vector_factor <fct>

C. Reading files from the readxl package

So far, we have only covered the flat file option. What if we want to read data from an excel file? We can do so by using the readxl package. First, download the excel file that we will be working with here.

library(readxl)
excel_sheets("student_data.xlsx") #show all worksheets
## [1] "2021" "2022"
student_df_2021 <- read_excel("student_data.xlsx", sheet = 1)
student_df_2022 <- read_excel("student_data.xlsx", sheet = 2)

str(student_df_2021)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num [1:9] 3 4 4 4 5 5 5 5 6
##  $ score_spring       : num [1:9] 4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: chr [1:9] "D" "C" "C" "C" ...
##  $ spring_grade_factor: chr [1:9] "C" "D" "B" "B" ...
##  $ sex_vector_factor  : chr [1:9] "M" "F" "M" "F" ...
str(student_df_2022)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num [1:9] 4 4 3 3 5 6 4 5 5
##  $ score_spring       : num [1:9] 4 4 4 4 5 5 6 4 6
##  $ winter_grade_factor: chr [1:9] "C" "C" "D" "D" ...
##  $ spring_grade_factor: chr [1:9] "C" "C" "C" "C" ...
##  $ sex_vector_factor  : chr [1:9] "M" "F" "M" "F" ...

Note that unlike the readr package, to specify the variables as factors we will need to work on this after we import data from excel:

student_df_2021$winter_grade_factor <- factor(student_df_2021$winter_grade_factor, levels = c("D", "C", "B", "A"), ordered = T)
student_df_2021$spring_grade_factor <- factor(student_df_2021$spring_grade_factor, levels = c("D", "C", "B", "A"), ordered = T)
student_df_2021$sex_vector_factor <- factor(student_df_2021$sex_vector_factor, levels = c("F", "M"), ordered = F)

str(student_df_2021)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
##  $ name               : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
##  $ score_winter       : num [1:9] 3 4 4 4 5 5 5 5 6
##  $ score_spring       : num [1:9] 4 3 5 5 3 3 5 6 5
##  $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
##  $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
##  $ sex_vector_factor  : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1

D. Read data in RStudio

RStudio has a point-and-click option to import a data set, which basically run base R functions, readr functions or readxl functions that we have covered earlier. In order to do so, go to the File tab and select Import Data option.

VI. If & Else Statements and For-Loop

Writing a code at times may require that you execute a portion of the code if certain conditions are true. Otherwise, you do not wish to execute those lines of code. There are ways to check and/or set a pre-condition, such as a fixed number of time to run the same lines of code for you to utilize. These could be very useful and powerful when you are manipulating or reshaping your data and get them ready for statistical analysis. Hence, in this section, I will cover the if-else statements and the for-loops, which are the backbones of coding you should know before delving into a more elaborate R exercises.

A. The If-Else Statement

If and Else Statements exist to change the behavior of your code script. The code within the if-else statement will be executed, once the triggering condition within the if statement returns TRUE. The said condition is inside the parentheses. The syntax for the if Statement is as follows.

if(student_df$winter_grade_factor[1] > "C"){ #the condition inside the parentheses test whether the student in row 1 has a grade greater than C
  print(student_df$name[1]) #if so, his or her name will be printed out here.
}

#notice that `print() was not executed in the previous lines of code.
#Let's try again:
if(student_df$winter_grade_factor[6] > "C"){ #similarly, this has a similar test for the student in row 6
  print(student_df$name[6])
}
## [1] "Erin"
#Now, because Erin (the student in row 6), as a grade greater than "C", the name is printed.

Alternatively, there might be times when you want two different sets of code to run, depending on the conditions that are the opposite of each other. You can create an if-statement twice or, more effectively, you can also just check once and tell R which one to run, depending on the result of the if-statement clause. This is called an if-else statement

if(student_df$winter_grade_factor[1] > "C") {
  print("Pass")
} else {
  print("Fail")  
}
## [1] "Fail"
#Let's try again with Erin
if(student_df$winter_grade_factor[6] > "C") {
  print("Pass")
} else {
  print("Fail")  
}
## [1] "Pass"

B. The For Loop

However, this is not very convenient to manually check every row. This issue can be solved by using a for-loop, which add another layer of command that would tell the R Code to keep running in loops until you have gone over all rows in the data set. Another useful function is nrow() which can check the number of rows and column a data frame or a matrix has.

row_num <- nrow(student_df) #get the number of row, i.e the number of observations in the data set
#check the number of rows we have:
row_num
## [1] 9
#put in the for-loop:
for(i in 1:row_num){
  if(student_df$winter_grade_factor[i] > "C") {
    cat(student_df$name[i], ": ", "Pass","\n")
  } else {
    cat(student_df$name[i], ": ", "Fail", "\n") 
  }
}
## Steve :  Fail 
## Carol :  Fail 
## Sam :  Fail 
## Maddie :  Fail 
## Aaron :  Pass 
## Erin :  Pass 
## Ian :  Pass 
## Kyle :  Pass 
## Lucy :  Pass

VII. Basic Functions

For coding, it is possible to define a function of your own design. This is particularly useful when you will have to repeat similar lines of code often (which is often the case). Functions are usually defined in the beginning of the coding script or in a separate R Script, named defined functions - just to organize them separately.

#first define a function that you would repeat often
grade_check <- function(name, grade){ #name and grade here are the two arguments you need to put in when using this function
  if(grade > "C"){ #check if the grade is greater than C.
    cat(name, ": ", "Pass", "\n")
  } else {
    cat(name, ": ", "Fail", "\n")
  }
}

#now use the function we just created. The code is much more concise in this fashion.
for(i in 1:row_num){
  grade_check(student_df$name[i], student_df$winter_grade_factor[i])
}
## Steve :  Fail 
## Carol :  Fail 
## Sam :  Fail 
## Maddie :  Fail 
## Aaron :  Pass 
## Erin :  Pass 
## Ian :  Pass 
## Kyle :  Pass 
## Lucy :  Pass