Introduction to R and RStudio

What is R?
What is RStudio?
Basics of R
The “data” in R
- Using the existing data sets in R
- Reading external data files in R
Graphics in R
Installing Packages in R

What is R?

R is an open source implementation of S (S-Plus is a commercial implementation)
R is freely available under the GNU general public license.
R is maintained and archived by the Comprehensive R Archive Network (CRAN).
The current version of R is version 3.3.2.
R can be used for statistical computing and graphics.
R provides a wide variety of statistical and graphical tools: linear and nonlinear modelling, statistical tests, time series analysis, classification, clustering, etc.(for more information https://www.r-project.org/about.html).
If you have used either SPSS and SAS to run statistical analysis before, you can check out the ebook titled “R for SAS and SPSS Users” under the “Resources” page on eClass.

What is RStudio?

Despite its extensive capabilities, the current user interface of R is a bit plain for users who want to access all the codes, output, and other elements in one place.
RStudio (available from https://www.rstudio.com/) has been created to make R easier to use.
RStudio includes a code editor, debugging, and visualization tools. Therefore, it allows us to see all input and output sections of R on the same screen.
In EDPY 607, we will use R through RStudio for psychometric data analysis.

Basics of R

After installing both R and RStudio into your computer, we need to open RStudio, which will activate R for us in the background. When RStudio opens, the first screen should look like the screenshot in Figure 1. “Console” is the part that R executes our commands. We can type in this area and hit “enter” to execute our R commands.

Figure 1. A screenshot of RStudio

For example, we can use R as a calculator to run some mathematical operations. Any command begining with “#” is considered as a comment and thus R doesn’t execute that part.

2+2

[1] 4

25*4

[1] 100

666/111

[1] 6

3^2       #Square of 3

[1] 9

sqrt(81)  #Square root of 81

[1] 9

sum(1:10) #To sum numbers from 1 to 10

[1] 55

prod(c(5, 7, 8)) #To find the product of 5, 7, and 8

[1] 280

As you can see, R executes our commands immediately and shows us the results in the same screen called “Console”. Although this is good for quick computations, we may want to keep our R codes in a script file so that we can run the same analysis again in the future. To open an empty script file, you need to find the “File” menu on the top-left corner and select “R Script” under “New File”. This will create an empty R script file that we can type our codes and save the script file with a file name you will choose. You can select the part of the script you want to execute and click on the “Run” button. In the following example, we will assign scalars (such as 5) to objects (such as a)

a <- 5
a

[1] 5

When naming objects, we need to remember two things: 1. The names of objects must start with a letter, they can have periods and underscores, but they cannot have spaces 2. If we don’t know, try to double check whether your object has been already “assigned.”

For example, the object “c” is already a function in R:

function (...)  .Primitive("c")

c stands for “concatenate” in R and allows you to build numbers or letters into vectors:

A <- c(1, 3, 5, 7, 9)
B <- c(2, 4, 6, 8, 10)
C <- c("Edmonton", "Calgary", "Red Deer", "Fort McMurray")
A

[1] 1 3 5 7 9

[1]  2  4  6  8 10

[1] "Edmonton"      "Calgary"       "Red Deer"      "Fort McMurray"

is(A)

[1] "numeric" "vector"

is(B)

[1] "numeric" "vector"

is(C)

[1] "character"           "vector"              "data.frameRowLabels"
[4] "SuperClassMethod"

In this example, A and B are numerical vectors and C is a character vector. We can also build vectors of the same element (both numbers and letters). The first entry of rep tells R what we want repeated, while the second entry tells R how many times it should be repeated.

Threes <- rep(3, 5)
Threes

[1] 3 3 3 3 3

Aaaas <- rep("a", 5)
Aaaas

[1] "a" "a" "a" "a" "a"

If we want to build a sequence, there are two main ways: 1. Type numb1:numb2, and R will provide a vector of every integer between those two:

OneToFive <- 1:5
OneToFive

[1] 1 2 3 4 5

FiveToOne <- 5:1
FiveToOne

[1] 5 4 3 2 1

seq builds a vector between the first term (the first element of the vector) and the second term (the last element of the vector), and it increments the sequence by whatever is in the “by” command. If nothing is in “by” it will default to integers!

SmallSteps <- seq(1, 7, by = .5)
SmallSteps

 [1] 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0

BigInts <- seq(1, 7)
BigInts

[1] 1 2 3 4 5 6 7

The function length tells us how many elements there are in the vector A:

length(A)

[1] 5

We use brackets to pick out an element of the vector A. Since A is only a vector, we just put A followed by [ followed by the particular number, followed by ]. For instance, the second and fifth elements of A can be called:

A[2]

[1] 3

A[5]

[1] 9

But if we try to pull out the seventh element, R will give an error message since this element doesn’t exist.

A[7]

[1] NA

Let’s say we want to attach two vectors, of the same size, and put them into a matrix. We can append the columns together as follows:

X <- cbind(A, B)
X

     A  B
[1,] 1  2
[2,] 3  4
[3,] 5  6
[4,] 7  8
[5,] 9 10

In addition to columns, we can also append the rows together:

Y <- rbind(A, B)
Y

  [,1] [,2] [,3] [,4] [,5]
A    1    3    5    7    9
B    2    4    6    8   10

Descriptive statistics are also pretty straightforward in R. For example:

sum(A)  #sum of the all values in vector A

[1] 25

mean(A) #mean of the values in vector A

[1] 5

sd(A)   #standard deviation of the values in vector A

[1] 3.162278

The “data” in R

Using the existing data sets in R

R comes with many data sets for us to use. We can see all available data sets with the following command:

data()

In the following example, I will use a data set called “cars”. To activate the “cars” dataset, we use the following commands:

data(cars)
?cars

The data command will activate the car dataset and the ? will open the information page about the cars data set. If there is additional information or a help page for any data set, function, etc., R can open this information using the ? sign.

The head function previews the data set for us. By default, the head function only prints the first six rows of the data but we can see more of the data set using a command such as head(cars, 10) to see the first 10 rows (if you want to see the entire data set, just type cars in the console).

head(cars)

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

In the data set, we see two variables speed and dist, which represent the speed of cars and the distances taken to stop (based on the information from the help page for the cars data set). If I only want to see or select the first variable speed in the data, I can use one of the following commands:

cars[,1]   #1 because I want to select the first column
cars$speed #The dollar sign $ allows us the use the column names directly

Similarly, I can see the data for a specific row or a range of rows using the following commands:

cars[5, 2]      #To see the 5th row and second column of the data

[1] 16

cars[1:3, 1]    #To see the first 3 rows under the first column of the data

[1] 4 4 7

cars$speed[1:3] #To see the first 3 rows for the speed column

[1] 4 4 7

Reading external data files in R

R is capable of reading many types of data files in. One way to import data files into R is to use the “Import Dataset” option in the “File” menu in RStudio. However, this may be quite tiresome especially if we open many data files in R. Instead of manually importing the files individually, we can use the following statements. The file refers to the data file we want to read into R, header=TRUE means that the first row in our data file includes the variable names. If the names are not included in the data, then we can use header=FALSE and R will name our variables such as V1, V2, V3, etc. I call my data sets as “mydata1” and “mydata2” (but you can choose any other name unless the name starts with a number). Once the data set is properly imported, I will be able to use these data sets by calling them as “mydata1” and “mydata2” in R.

mydata1 <- read.csv(file, header = TRUE)
mydata2 <- read.table(file, header = TRUE)

If we have a SPSS data set, we can also directly import this data set into R using the foreign package:

install.packages("foreign") #Assuming this is the first time I am using this package
library(foreign)            #Activate the package
mydata3 <- read.spss(file)  #Read the SPSS file (e.g., data.sav)

In the R statements above, `file refers to the full path of the data set. Note that R does not accept a single slash (“\”) when specifying the file path. Therefore, we need to use either double slash or back slash (“/”). You can see the following examples where I open a .csv file on my desktop using the read.csv function.

mydata4 <- read.csv("C:\\Users\\Okan\\Desktop\\data.csv", header = TRUE)
#or
mydata4 <- read.csv("C:/Users/Okan/Desktop/data.csv", header = TRUE)

If we are planning to open several data files from the same folder, then we can change the working directory in R. “Working Directory” refers to the default location that we want R to use when importing and exporting files, output, etc. To learn our current working directory, we can use

getwd()

I can change the working directory to any location in my computer. For example, I can set my desktop as the working directory using the following command:

setwd("C:/Users/Okan/Desktop")

Because my working directory is now my desktop, any file in this location can be called without writing the full path. For example, let’s assume that I have a data file called “example.txt” in my desktop. I can import this file using the following command:

mydata5 <- read.table("example.txt", header = TRUE)

Graphics in R

R is also capable of generating various graphics. For example, we can draw a simple scatterplot using the plot function in R. In the plot function, the variable we mention first becomes the x axis and the second variable becomes the y axis. A comma separates the two variables inside the function.

plot(cars$speed, cars$dist)

We can make our plot more informative by adding a title and changing the axis labels as follows:

plot(cars$speed, cars$dist, xlab = "Speed of Cars", ylab = "Distance Taken to Stop", 
     main = "An Example Plot")

Another plot type would be a histogram. We can use the hist function to draw a histogram:

hist(cars$speed, main = "Histogram of Speed")

Installing Packages in R

The functions we have used so far (e.g., sum, prod, plot) are the functions included in the base version of R. When we need to use other available packages from the CRAN website, we need to use the install.packages command. For example, the package mirt is the one that we will use for estimating IRT models. In order to install the mirt package into R, your computer needs to be connected to the internet so that it can call the package from the CRAN website. The following commands will download and install the package in your computer. When you download it once, the package is always in your computer! You just need to activate it using the library command before using it.

install.packages("mirt") #Installs the package mirt
library("mirt")          #Activates the package

You can also install a bunch of packages at once:

install.packages(c("psych","lme4","ggplot2")) 
library("psych")    
library("lme4")
library("ggplot2")

To open the help page about the packages, we can again use ? (e.g., ?lme4), or we can search for the package manual (e.g., https://cran.r-project.org/web/packages/mirt/index.html).