Comparing R and R Studio: R is like an engine running all the code, while RStudio is a dashboard that users can interact with. It is an interface with add-ons of many convenient features and tools, making R much easier to use. Throughout this workshop, as well as three other sessions, Basic Quantitative Methods, Social Network Analysis, and Text Analysis, we will use R and RStudio.
First you will have to download R from https://cloud.r-project.org/. Next, download RStudio from https://www.rstudio.com/products/rstudio/download/.
RStudio has many features that are much more convenient for users. Typically, on the RStudio Screen, you should see four parts:
The console is where you type commands and see the output of your codes. Once you imported the data into R and/or create new variables or a new data frame, these will be shown in the workspace on the top right of the RStudio window under the Environment tab.
In the same place, there is also a history tab, which contains a list of commands you have used so far.
Similar to Stata’s do file, you can save a record of your coding commands in the R script file to reproduce your work and/or save your progress. The R script file is shown on the top left of the RStudio window.
If you click on a data frame saved in the environment, the data view
will be available as a tab in the same place as the R script.
To start a new project, first go to File –> New Project –> New Directory –> New Project.
Then, name a directory name, this will be the main directory that you
have to store relevant files to be used in the project. To select a new
path which this R project will be a sub-directory, click on
Browse... and select the home directory location of your
choice.
R can be used as a basic calculator. Let’s try some basic arithmetic operations.
+-*/^%%2+3
## [1] 5
4*5-8
## [1] 12
(6/3)+5
## [1] 7
2^3
## [1] 8
5%%2
## [1] 1
In order to assign a value to a variable of your choice, use the
operator <- as shown in the code below:
x <- 70
x
## [1] 70
y <- 50
y
## [1] 50
x+y
## [1] 120
z <- x+y #combine with the arithmetic operator
z
## [1] 120
w <- x-y
What is the value of w?
w
## [1] 20
Apart from assigning integer values to variables, you can also assign other data types to variables. - numeric - logical - character
numeric_var <- 34.5
char_var <- "R Workshop"
logical_var <- TRUE
You can use a function (we will cover this later)
class() to check the type of variable assigned.
class(numeric_var)
## [1] "numeric"
class(char_var)
## [1] "character"
class(logical_var)
## [1] "logical"
For a character variable, it should also be noted that R is case-sensitive:
char_var2 <- "r workshop"
char_var2 == char_var
## [1] FALSE
There are many ways to store more than one value. We will cover: vector, matrix, factor, data frame, and list. Each is suitable for different contexts.
A vector can store one type of data with a dimension n x 1, using the
function c()
score_winter <- c(3, 4, 4, 4, 5, 5, 5, 5, 6)
#let's check what a vector looks like:
score_winter
## [1] 3 4 4 4 5 5 5 5 6
#it can also store different types of variable:
student_names <- c("Steve", "Carol", "Sam", "Maddie", "Aaron", "Erin", "Ian", "Kyle", "Lucy")
student_names
## [1] "Steve" "Carol" "Sam" "Maddie" "Aaron" "Erin" "Ian" "Kyle"
## [9] "Lucy"
#it can, however, only hold one type of variable:
test_vector <- c("Steve", 1)
class(test_vector) #1 is coerced into a character type.
## [1] "character"
Alternatively, you can also name the elements in your vector, using
name() function:
names(score_winter) <- student_names
score_winter
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 3 4 4 4 5 5 5 5 6
You can also perform arithmetic operations on vectors:
score_spring <- c(4, 3, 5, 5, 3, 3, 5, 6, 5)
names(score_spring) <- student_names
total_score <- score_spring + score_winter
total_score
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 7 7 9 9 8 8 10 11 11
Sum of all values in the vector can be calculated using
sum() function:
sum_all_score <- sum(total_score)
average_score <- sum_all_score/9
average_score
## [1] 8.888889
There are different ways to select certain elements in the vector:
total_score[1]
## Steve
## 7
total_score[c(2,5,7)]
## Carol Aaron Ian
## 7 8 10
total_score[c("Ian", "Kyle")]
## Ian Kyle
## 10 11
total_score[3:9]
## Sam Maddie Aaron Erin Ian Kyle Lucy
## 9 9 8 8 10 11 11
You can also filer for elements with some desired characteristics:
selection <- score_winter > 4
score_winter[selection]
## Aaron Erin Ian Kyle Lucy
## 5 5 5 5 6
Matrix also store one type of data but with a dimension n x n. We use
the function matrix() to create a matrix object in R.
#try this first:
?matrix
#create a matrix containing a value of 1 to 16 for 4x4 matrix, 2x8 matrix, and 8x2 matrix, respectively
matrix(1:16, nrow = 4, ncol = 4)
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
matrix(1:16, nrow = 2, ncol = 8)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 3 5 7 9 11 13 15
## [2,] 2 4 6 8 10 12 14 16
matrix(1:16, nrow = 8, ncol = 2)
## [,1] [,2]
## [1,] 1 9
## [2,] 2 10
## [3,] 3 11
## [4,] 4 12
## [5,] 5 13
## [6,] 6 14
## [7,] 7 15
## [8,] 8 16
Note: ? is a useful command to call R Documentation
explaining the specific function you are curious about. R Documentation
for each function generally contains Description, Usage, Arguments,
Details, and Examples.
Going back to the vectors created earlier, now we will construct a matrix from them.
score_winter
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 3 4 4 4 5 5 5 5 6
score_spring
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 4 3 5 5 3 3 5 6 5
student_score <- c(score_winter, score_spring)
student_score
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy Steve Carol
## 3 4 4 4 5 5 5 5 6 4 3
## Sam Maddie Aaron Erin Ian Kyle Lucy
## 5 5 3 3 5 6 5
student_score_matrix <- matrix(student_score, byrow = F, nrow = 9) #byrow indicates whether you are filling the matrix by row first (T) or by column first (F), and nrow indicates the number of 9 for the matrix.
student_score
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy Steve Carol
## 3 4 4 4 5 5 5 5 6 4 3
## Sam Maddie Aaron Erin Ian Kyle Lucy
## 5 5 3 3 5 6 5
student_score_matrix
## [,1] [,2]
## [1,] 3 4
## [2,] 4 3
## [3,] 4 5
## [4,] 4 5
## [5,] 5 3
## [6,] 5 3
## [7,] 5 5
## [8,] 5 6
## [9,] 6 5
Notice that in the matrix, the names you have assigned to the vectors
are gone. We will rename the rows and columns of this matrix using
rownames() and colnames().
rownames(student_score_matrix) <- student_names
colnames(student_score_matrix) <- c("Winter", "Spring")
student_score_matrix
## Winter Spring
## Steve 3 4
## Carol 4 3
## Sam 4 5
## Maddie 4 5
## Aaron 5 3
## Erin 5 3
## Ian 5 5
## Kyle 5 6
## Lucy 6 5
Similar to vector, we can also perform basic arithmetic operations within a matrix:
total_student_score <- rowSums(student_score_matrix)
total_student_score
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 7 7 9 9 8 8 10 11 11
total_score_bySemester <- colSums(student_score_matrix)
total_score_bySemester
## Winter Spring
## 41 39
Now that we have a new column, the next step is adding this new
column to the original matrix we have, using cbind()
function. To add a new row, we use rbind() function
instead.
all_score_matrix <- cbind(student_score_matrix, total_student_score)
all_score_matrix
## Winter Spring total_student_score
## Steve 3 4 7
## Carol 4 3 7
## Sam 4 5 9
## Maddie 4 5 9
## Aaron 5 3 8
## Erin 5 3 8
## Ian 5 5 10
## Kyle 5 6 11
## Lucy 6 5 11
Similar to vectors, we can also select elements in matrices. As a matrix has two dimensions, we have to specify both dimensions. If you want to select all elements in a specific row or column, you can leave the number in column or row blank. Alternatively, specific subsets of a matrix can also be selected.
all_score_matrix[4,3]
## [1] 9
#select all elements in a specific row or column
all_score_matrix[7,]
## Winter Spring total_student_score
## 5 5 10
all_score_matrix[,1]
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## 3 4 4 4 5 5 5 5 6
#subset of a matrix
all_score_matrix[3:5,]
## Winter Spring total_student_score
## Sam 4 5 9
## Maddie 4 5 9
## Aaron 5 3 8
all_score_matrix[,1:2]
## Winter Spring
## Steve 3 4
## Carol 4 3
## Sam 4 5
## Maddie 4 5
## Aaron 5 3
## Erin 5 3
## Ian 5 5
## Kyle 5 6
## Lucy 6 5
Factor is a data type containing categorical variables, such as sex
(M/F) or grades (A/B/C/D). In order to specify that a certain vector
contains factor variables, we use the factor()
function.
#back to the original example: we assign values to the sex variable for all observations:
sex_vector <- c("M", "F", "M", "F", "M", "F", "M", "M", "F")
names(sex_vector) <- student_names
sex_vector
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## "M" "F" "M" "F" "M" "F" "M" "M" "F"
#However, this is still a vector with characters, not factor. To turn this into a factor, we use factor():
sex_vector_factor <- factor(sex_vector)
sex_vector_factor
## Steve Carol Sam Maddie Aaron Erin Ian Kyle Lucy
## M F M F M F M M F
## Levels: F M
Alternatively, if you want to code an ordinal categorical such as
grades, we add another argument into factor().
winter_grade <- c("D", "C", "C", "C", "B", "B", "B", "B", "A")
spring_grade <- c("C", "D", "B", "B", "C", "C", "B", "A", "B")
winter_grade_factor <- factor(winter_grade, order = T, levels = c("D", "C", "B", "A"))
spring_grade_factor <- factor(spring_grade, order = T, levels = c("D", "C", "B", "A"))
winter_grade_factor
## [1] D C C C B B B B A
## Levels: D < C < B < A
spring_grade_factor
## [1] C D B B C C B A B
## Levels: D < C < B < A
Take a quick overview of the factor variables with
summary():
summary(winter_grade_factor)
## D C B A
## 1 3 4 1
summary(spring_grade_factor)
## D C B A
## 1 3 4 1
As we have covered so far, there are various data types coming with
each observation. It is more convenient to combine all into one data
set. This can be done by using a data frame as a data structure. A data
frame typically contains various variables of different types as columns
and each observation as a row, using data.frame().
student_df <- data.frame(score_winter, score_spring, winter_grade_factor, spring_grade_factor, sex_vector_factor)
student_df
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
Useful functions to primarily inspect a data frame are
head(), which shows you the first few observations and
str() which shows you the structure of a data frame.
str(student_df)
## 'data.frame': 9 obs. of 5 variables:
## $ score_winter : num 3 4 4 4 5 5 5 5 6
## $ score_spring : num 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
head(student_df)
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
Similar to a matrix, you can also select elements in the data frame.
student_df[3,4]
## [1] B
## Levels: D < C < B < A
student_df[3,] #select one observation
## score_winter score_spring winter_grade_factor spring_grade_factor
## Sam 4 5 C B
## sex_vector_factor
## Sam M
student_df[,3] #select one variable
## [1] D C C C B B B B A
## Levels: D < C < B < A
#Alternatively, if you know the variable name:
student_df$winter_grade_factor
## [1] D C C C B B B B A
## Levels: D < C < B < A
Alternatively, you can also use a function subset() to
select elements in the data frame.
?subset
#selecting students with grades greater than C in the winter semester
subset(student_df, winter_grade_factor > "C")
## score_winter score_spring winter_grade_factor spring_grade_factor
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
#selecting students with grades equal to A in the spring semester
subset(student_df, spring_grade_factor == "A")
## score_winter score_spring winter_grade_factor spring_grade_factor
## Kyle 5 6 B A
## sex_vector_factor
## Kyle M
Reorder the data frame by the values of the
spring_grade_factr variable, using
order().
order(student_df$spring_grade_factor)
## [1] 2 1 5 6 3 4 7 9 8
student_df[order(student_df$spring_grade_factor),]
## score_winter score_spring winter_grade_factor spring_grade_factor
## Carol 4 3 C D
## Steve 3 4 D C
## Aaron 5 3 B C
## Erin 5 3 B C
## Sam 4 5 C B
## Maddie 4 5 C B
## Ian 5 5 B B
## Lucy 6 5 A B
## Kyle 5 6 B A
## sex_vector_factor
## Carol F
## Steve M
## Aaron M
## Erin F
## Sam M
## Maddie F
## Ian M
## Lucy F
## Kyle M
#reverse the order:
student_df[order(student_df$spring_grade_factor, decreasing = T), ]
## score_winter score_spring winter_grade_factor spring_grade_factor
## Kyle 5 6 B A
## Sam 4 5 C B
## Maddie 4 5 C B
## Ian 5 5 B B
## Lucy 6 5 A B
## Steve 3 4 D C
## Aaron 5 3 B C
## Erin 5 3 B C
## Carol 4 3 C D
## sex_vector_factor
## Kyle M
## Sam M
## Maddie F
## Ian M
## Lucy F
## Steve M
## Aaron M
## Erin F
## Carol F
#only interested in the score in spring semester, ordered by the grade:
student_df$score_spring[order(student_df$spring_grade_factor, decreasing = T)]
## [1] 6 5 5 5 5 4 3 3 3
List is capable of storing many different types of variables and data structures under one name.
student_info <- list(winter_grade_factor, spring_grade_factor, student_df)
student_info
## [[1]]
## [1] D C C C B B B B A
## Levels: D < C < B < A
##
## [[2]]
## [1] C D B B C C B A B
## Levels: D < C < B < A
##
## [[3]]
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
#name each item store in the list
names(student_info) <- c("winter_grade", "spring_grade", "student_df")
student_info
## $winter_grade
## [1] D C C C B B B B A
## Levels: D < C < B < A
##
## $spring_grade
## [1] C D B B C C B A B
## Levels: D < C < B < A
##
## $student_df
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
#call an item in the list:
student_info$student_df
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
student_info[[3]]
## score_winter score_spring winter_grade_factor spring_grade_factor
## Steve 3 4 D C
## Carol 4 3 C D
## Sam 4 5 C B
## Maddie 4 5 C B
## Aaron 5 3 B C
## Erin 5 3 B C
## Ian 5 5 B B
## Kyle 5 6 B A
## Lucy 6 5 A B
## sex_vector_factor
## Steve M
## Carol F
## Sam M
## Maddie F
## Aaron M
## Erin F
## Ian M
## Kyle M
## Lucy F
An R package is a collection of functions, developed and shared by R developers to facilitate certain needs and that are not available in base R (the default package that comes when you have downloaded R).In order to use R Packages, you will need to first install and download the packages into your work space.
CRAN stands for the Comprehensive R Archive Network, which is an
official R Packages repository storing R packages for free download. We
can down R Packages from CRAN by calling the
install.packages() function. Here, we are installing a
tidyr package, which we will be using in the next part of
this workshop. After installing the package, you must also load the
package into your work space by calling the ’library()` function.
#install.packages("tidyr")
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.1.1
#installing multiple packages at once:
#install.packages(c("ggplot2", "dplyr"))
RStudio also provides an easy interface to install and load R
packages by simply clicking on install button and type the
name of the package you want to install. To load the package, simply go
to the Packages tab and check the box in front of the
package you want to load.
Alternatively, you can also download the package from GitHub using
the install_github() function from the
devtools package:
#devtools::install_github("tidyverse/ggplot2")
Last but not least, I will cover a section on how to read different files into R. Specifically, I will go over different methods to read both flat files, such as .csv, and excel sheets into R.
First, we will read .csv file into R using read.csv()
from base R. Before running the code, make sure to download the file here
and that you have the file in your R directory.
student_df <- read.csv("student_data.csv")
student_df
## name score_winter score_spring winter_grade_factor spring_grade_factor
## 1 Steve 3 4 D C
## 2 Carol 4 3 C D
## 3 Sam 4 5 C B
## 4 Maddie 4 5 C B
## 5 Aaron 5 3 B C
## 6 Erin 5 3 B C
## 7 Ian 5 5 B B
## 8 Kyle 5 6 B A
## 9 Lucy 6 5 A B
## sex_vector_factor
## 1 M
## 2 F
## 3 M
## 4 F
## 5 M
## 6 F
## 7 M
## 8 M
## 9 F
#examine the structure of the data frame: do you notice anything different from the previous data frame we have constructed?
str(student_df)
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : int 3 4 4 4 5 5 5 5 6
## $ score_spring : int 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: chr "D" "C" "C" "C" ...
## $ spring_grade_factor: chr "C" "D" "B" "B" ...
## $ sex_vector_factor : chr "M" "F" "M" "F" ...
Alternatively, you can also use the read.table()
function to download .csv file but with certain specifications:
separator is ,. This means there are also options for other
types of separators.
#specify also that the first row is the name header by including the option header = T:
read.table("student_data.csv", header = T, sep = ",") #a .csv file is basically separated by ","
## name score_winter score_spring winter_grade_factor spring_grade_factor
## 1 Steve 3 4 D C
## 2 Carol 4 3 C D
## 3 Sam 4 5 C B
## 4 Maddie 4 5 C B
## 5 Aaron 5 3 B C
## 6 Erin 5 3 B C
## 7 Ian 5 5 B B
## 8 Kyle 5 6 B A
## 9 Lucy 6 5 A B
## sex_vector_factor
## 1 M
## 2 F
## 3 M
## 4 F
## 5 M
## 6 F
## 7 M
## 8 M
## 9 F
#Again, do you see the same problem as read.csv() here?
After inspecting the data imported, we will need to change grade and
sex variables into factor, using the as.factor()
function:
student_df$winter_grade_factor <- factor(student_df$winter_grade_factor, order = T, levels = c("D", "C", "B", "A"))
str(student_df)
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : int 3 4 4 4 5 5 5 5 6
## $ score_spring : int 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: chr "C" "D" "B" "B" ...
## $ sex_vector_factor : chr "M" "F" "M" "F" ...
student_df
## name score_winter score_spring winter_grade_factor spring_grade_factor
## 1 Steve 3 4 D C
## 2 Carol 4 3 C D
## 3 Sam 4 5 C B
## 4 Maddie 4 5 C B
## 5 Aaron 5 3 B C
## 6 Erin 5 3 B C
## 7 Ian 5 5 B B
## 8 Kyle 5 6 B A
## 9 Lucy 6 5 A B
## sex_vector_factor
## 1 M
## 2 F
## 3 M
## 4 F
## 5 M
## 6 F
## 7 M
## 8 M
## 9 F
student_df$spring_grade_factor <- factor(student_df$spring_grade_factor, order = T, levels = c("D", "C", "B", "A"))
str(student_df)
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : int 3 4 4 4 5 5 5 5 6
## $ score_spring : int 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : chr "M" "F" "M" "F" ...
student_df$sex_vector_factor <- factor(student_df$sex_vector_factor, order = F)
str(student_df)
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : int 3 4 4 4 5 5 5 5 6
## $ score_spring : int 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
#Alternatively, you can also pre-specify the data type in the read.table function:
student_df <- read.table("student_data.csv", header = T, sep = ",",
colClasses = c("character", "numeric", "numeric", "factor", "factor", "factor"))
str(student_df) #notice the grade factors: they are not ordinal. You will need to manually change the type of the variable:
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num 3 4 4 4 5 5 5 5 6
## $ score_spring : num 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Factor w/ 4 levels "A","B","C","D": 4 3 3 3 2 2 2 2 1
## $ spring_grade_factor: Factor w/ 4 levels "A","B","C","D": 3 4 2 2 3 3 2 1 2
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
student_df$winter_grade_factor <- factor(student_df$winter_grade_factor, order = T, levels = c("D", "C", "B", "A"))
student_df$spring_grade_factor <- factor(student_df$spring_grade_factor, order = T, levels = c("D", "C", "B", "A"))
str(student_df)
## 'data.frame': 9 obs. of 6 variables:
## $ name : chr "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num 3 4 4 4 5 5 5 5 6
## $ score_spring : num 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
readr packageThe readr package is faster, easier and more efficient
than the read.table() from base R. The difference is that
the output is a tibble, instead of a data frame. A tibble is basically a
data frame but is capable of additional functions, which we will learn
later in this workshop.
library(readr)
student_df_readr <- read_csv("student_data.csv")
## Rows: 9 Columns: 6── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): name, winter_grade_factor, spring_grade_factor, sex_vector_factor
## dbl (2): score_winter, score_spring
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num [1:9] 3 4 4 4 5 5 5 5 6
## $ score_spring : num [1:9] 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: chr [1:9] "D" "C" "C" "C" ...
## $ spring_grade_factor: chr [1:9] "C" "D" "B" "B" ...
## $ sex_vector_factor : chr [1:9] "M" "F" "M" "F" ...
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. score_winter = col_double(),
## .. score_spring = col_double(),
## .. winter_grade_factor = col_character(),
## .. spring_grade_factor = col_character(),
## .. sex_vector_factor = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
student_df_readr
## # A tibble: 9 × 6
## name score_winter score_spring winter_grade_factor spring_grade_factor
## <chr> <dbl> <dbl> <chr> <chr>
## 1 Steve 3 4 D C
## 2 Carol 4 3 C D
## 3 Sam 4 5 C B
## 4 Maddie 4 5 C B
## 5 Aaron 5 3 B C
## 6 Erin 5 3 B C
## 7 Ian 5 5 B B
## 8 Kyle 5 6 B A
## 9 Lucy 6 5 A B
## # … with 1 more variable: sex_vector_factor <chr>
#do you notice something we need to fix here?
Again, the data we have displays different data types than what we
would want. There is a way to specify which data type for each variable
with an option col_types which is equivalent to
colClasses in read.table() in base R:
c for charactern for numericd for doublei for integerl for logicalf for factorstudent_df_readr <- read_csv("student_data.csv", col_types = "cnnfff")
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num [1:9] 3 4 4 4 5 5 5 5 6
## $ score_spring : num [1:9] 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Factor w/ 4 levels "D","C","B","A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Factor w/ 4 levels "C","D","B","A": 1 2 3 3 1 1 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "M","F": 1 2 1 2 1 2 1 1 2
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. score_winter = col_number(),
## .. score_spring = col_number(),
## .. winter_grade_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. spring_grade_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
## .. sex_vector_factor = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE)
## .. )
## - attr(*, "problems")=<externalptr>
#do you notice something missing here?
Notice that the grade factors are not ordered. If we would like to
have an ordinal categorical variable, we will need to further specify
the col_types option:
of <- col_factor(levels = c("D", "C", "B", "A"), ordered = T)
cha <- col_character()
int <- col_integer()
fac <- col_factor(levels = c("F", "M"))
student_df_readr <- read_csv("student_data.csv", col_types = list(cha, int, int, of, of, fac))
str(student_df_readr)
## spec_tbl_df [9 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : int [1:9] 3 4 4 4 5 5 5 5 6
## $ score_spring : int [1:9] 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
## - attr(*, "spec")=
## .. cols(
## .. name = col_character(),
## .. score_winter = col_integer(),
## .. score_spring = col_integer(),
## .. winter_grade_factor = col_factor(levels = c("D", "C", "B", "A"), ordered = TRUE, include_na = FALSE),
## .. spring_grade_factor = col_factor(levels = c("D", "C", "B", "A"), ordered = TRUE, include_na = FALSE),
## .. sex_vector_factor = col_factor(levels = c("F", "M"), ordered = FALSE, include_na = FALSE)
## .. )
## - attr(*, "problems")=<externalptr>
There is also an equivalent function to read.table() in
the readr package: read_delim() which you can
further customize in the case that the file has a different
separator:
read_delim("student_data.csv", delim = ",", col_types = list(cha, int, int, of, of, fac))
## # A tibble: 9 × 6
## name score_winter score_spring winter_grade_factor spring_grade_factor
## <chr> <int> <int> <ord> <ord>
## 1 Steve 3 4 D C
## 2 Carol 4 3 C D
## 3 Sam 4 5 C B
## 4 Maddie 4 5 C B
## 5 Aaron 5 3 B C
## 6 Erin 5 3 B C
## 7 Ian 5 5 B B
## 8 Kyle 5 6 B A
## 9 Lucy 6 5 A B
## # … with 1 more variable: sex_vector_factor <fct>
readxl packageSo far, we have only covered the flat file option. What if we want to
read data from an excel file? We can do so by using the
readxl package. First, download the excel file that we will
be working with here.
library(readxl)
excel_sheets("student_data.xlsx") #show all worksheets
## [1] "2021" "2022"
student_df_2021 <- read_excel("student_data.xlsx", sheet = 1)
student_df_2022 <- read_excel("student_data.xlsx", sheet = 2)
str(student_df_2021)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num [1:9] 3 4 4 4 5 5 5 5 6
## $ score_spring : num [1:9] 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: chr [1:9] "D" "C" "C" "C" ...
## $ spring_grade_factor: chr [1:9] "C" "D" "B" "B" ...
## $ sex_vector_factor : chr [1:9] "M" "F" "M" "F" ...
str(student_df_2022)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num [1:9] 4 4 3 3 5 6 4 5 5
## $ score_spring : num [1:9] 4 4 4 4 5 5 6 4 6
## $ winter_grade_factor: chr [1:9] "C" "C" "D" "D" ...
## $ spring_grade_factor: chr [1:9] "C" "C" "C" "C" ...
## $ sex_vector_factor : chr [1:9] "M" "F" "M" "F" ...
Note that unlike the readr package, to specify the
variables as factors we will need to work on this after we import data
from excel:
student_df_2021$winter_grade_factor <- factor(student_df_2021$winter_grade_factor, levels = c("D", "C", "B", "A"), ordered = T)
student_df_2021$spring_grade_factor <- factor(student_df_2021$spring_grade_factor, levels = c("D", "C", "B", "A"), ordered = T)
student_df_2021$sex_vector_factor <- factor(student_df_2021$sex_vector_factor, levels = c("F", "M"), ordered = F)
str(student_df_2021)
## tibble [9 × 6] (S3: tbl_df/tbl/data.frame)
## $ name : chr [1:9] "Steve" "Carol" "Sam" "Maddie" ...
## $ score_winter : num [1:9] 3 4 4 4 5 5 5 5 6
## $ score_spring : num [1:9] 4 3 5 5 3 3 5 6 5
## $ winter_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 1 2 2 2 3 3 3 3 4
## $ spring_grade_factor: Ord.factor w/ 4 levels "D"<"C"<"B"<"A": 2 1 3 3 2 2 3 4 3
## $ sex_vector_factor : Factor w/ 2 levels "F","M": 2 1 2 1 2 1 2 2 1
RStudio has a point-and-click option to import a data set, which
basically run base R functions, readr functions or
readxl functions that we have covered earlier. In order to
do so, go to the File tab and select
Import Data option.
Writing a code at times may require that you execute a portion of the code if certain conditions are true. Otherwise, you do not wish to execute those lines of code. There are ways to check and/or set a pre-condition, such as a fixed number of time to run the same lines of code for you to utilize. These could be very useful and powerful when you are manipulating or reshaping your data and get them ready for statistical analysis. Hence, in this section, I will cover the if-else statements and the for-loops, which are the backbones of coding you should know before delving into a more elaborate R exercises.
If and Else Statements exist to change the behavior of your code
script. The code within the if-else statement will be executed, once the
triggering condition within the if statement returns TRUE.
The said condition is inside the parentheses. The syntax for the if
Statement is as follows.
if(student_df$winter_grade_factor[1] > "C"){ #the condition inside the parentheses test whether the student in row 1 has a grade greater than C
print(student_df$name[1]) #if so, his or her name will be printed out here.
}
#notice that `print() was not executed in the previous lines of code.
#Let's try again:
if(student_df$winter_grade_factor[6] > "C"){ #similarly, this has a similar test for the student in row 6
print(student_df$name[6])
}
## [1] "Erin"
#Now, because Erin (the student in row 6), as a grade greater than "C", the name is printed.
Alternatively, there might be times when you want two different sets of code to run, depending on the conditions that are the opposite of each other. You can create an if-statement twice or, more effectively, you can also just check once and tell R which one to run, depending on the result of the if-statement clause. This is called an if-else statement
if(student_df$winter_grade_factor[1] > "C") {
print("Pass")
} else {
print("Fail")
}
## [1] "Fail"
#Let's try again with Erin
if(student_df$winter_grade_factor[6] > "C") {
print("Pass")
} else {
print("Fail")
}
## [1] "Pass"
However, this is not very convenient to manually check every row.
This issue can be solved by using a for-loop, which add another layer of
command that would tell the R Code to keep running in loops until you
have gone over all rows in the data set. Another useful function is
nrow() which can check the number of rows and column a data
frame or a matrix has.
row_num <- nrow(student_df) #get the number of row, i.e the number of observations in the data set
#check the number of rows we have:
row_num
## [1] 9
#put in the for-loop:
for(i in 1:row_num){
if(student_df$winter_grade_factor[i] > "C") {
cat(student_df$name[i], ": ", "Pass","\n")
} else {
cat(student_df$name[i], ": ", "Fail", "\n")
}
}
## Steve : Fail
## Carol : Fail
## Sam : Fail
## Maddie : Fail
## Aaron : Pass
## Erin : Pass
## Ian : Pass
## Kyle : Pass
## Lucy : Pass
For coding, it is possible to define a function of your own design. This is particularly useful when you will have to repeat similar lines of code often (which is often the case). Functions are usually defined in the beginning of the coding script or in a separate R Script, named defined functions - just to organize them separately.
#first define a function that you would repeat often
grade_check <- function(name, grade){ #name and grade here are the two arguments you need to put in when using this function
if(grade > "C"){ #check if the grade is greater than C.
cat(name, ": ", "Pass", "\n")
} else {
cat(name, ": ", "Fail", "\n")
}
}
#now use the function we just created. The code is much more concise in this fashion.
for(i in 1:row_num){
grade_check(student_df$name[i], student_df$winter_grade_factor[i])
}
## Steve : Fail
## Carol : Fail
## Sam : Fail
## Maddie : Fail
## Aaron : Pass
## Erin : Pass
## Ian : Pass
## Kyle : Pass
## Lucy : Pass