Recitation 3: Basic R

Set Working Directory

The first step to any R project is to set the working directory. This is a folder where R will know to look for data files to import, save your code, and save any additional files created by your code (plots, graphs, cleaned data files, etc.)

I chose a file on my Desktop entitled “TA 107” that contains the file “Recitations” where all of my recitation resources and code will be saved. I recommend that you make access easier by creating a file on your desktop for this class. You can create another file within it called “Recitations” or “R stuff” or whatever makes the most sense for your organizational style.

To get the path of the folder you select as your working directory, do one of the following:

select the folder and go to “Properties” then copy and paste the “Location”
Shift + right-click + select “copy as path”

Note: Windows users need to use backslashes instead of forward slashes in their file path Remember to enclose your file path in quotation marks

setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")

Any time you need to check what your working directory is set to, use the code below:

getwd()

## [1] "/Users/kiraflemke"

You should set the working directory at the top of each R file you create to document where your resources are saved.

R as a Calculator

R can be used to perform basic or complex arithmetic, just like a calculator. The basic operation signs are below:

#addition 
6+9

## [1] 15

#subtraction:
11-7

## [1] 4

#multiplication
5*12

## [1] 60

#division
12/6

## [1] 2

#exponents
5^2

## [1] 25

# roots
25^(1/2)

## [1] 5

Use parentheses to dictate order of operations:

(6+9)/3

## [1] 5

6+(9/3)

## [1] 9

Objects in R

You can assign values to objects in R using ‘<-’

This allows exact values to be repeatedly used in calculations without having to repeat previous operations. You can call these values by typing the name of the object.

calc1 <- (6+9)/3
calc1

## [1] 5

calc1*100

## [1] 500

calc2 <- calc1*100

Rules for Naming Objects in R:

Object names cannot start with a number (ex. ‘object1’ can be used but ‘1object’ cannot)
Object names cannot contain spaces or dashes (ex. ‘object_1’ or ‘object.1’ can be used but ‘object 1’ and ‘object-1’ cannot)

As often as possible, try to use names that are meaningful to you, so you don’t forget what the object represents.

Vectors

Objects can also be sets of values, called vectors. These can be created by typing a set of values separated by commas within the wrapper ‘c()’. This is called concatenating.

vec1 <- c(1,7,12,15,28)
vec1

## [1]  1  7 12 15 28

You can specify a set of consecutive numbers using a colon. You can even do this several times within the same vector.

vec2 <- c(1:20)
vec2

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

vec2.1 <- c(1:4,5:12,18)
vec2.1

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 18

If you want to create a vector of ordered, non-conecutive numbers, for example counting by 2, you can use the seq() function.

vec3 <- seq(from = 1, to = 20, by = 2)
vec3

##  [1]  1  3  5  7  9 11 13 15 17 19

Manipulating Vectors

Once you have a vector created, you can perform operations on the vector. These can be saved as new objects.

vec3/2

##  [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5

vec3a <- vec3/2

You can also perform functions across entire vectors. This allows you to easily perform descriptive statistics.

mean(vec3a)

## [1] 5

median(vec3a)

## [1] 5

sum(vec3a)

## [1] 50

summary(vec3a)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    2.75    5.00    5.00    7.25    9.50

Data Classes

The vectors above contain numeric data. Vectors can be contain one of the following four data classes:

Numeric – numbers, can be used in calculations
Character – character strings can contain letters, words, numbers, and/or spaces but they must be contained within quotation marks "" (character vectors cannot be used in mathematical calculations)
Logical – logical data can take the value of TRUE or FALSE (also specified by T and F). Quotation marks are not used to refer to logical data.
Factor – a limited set of character or integer values that corresponding to set of possible responses called levels

A single vector can only contain one data class. The class() function can be used to identify the type of data within a vector.

name <- c("person1","person2","person3","person4")
class(name)

## [1] "character"

age <- c(18,19,22,20)
class(age)

## [1] "numeric"

partyid <- c("D","D","I","R")
class(partyid)

## [1] "character"

gender <- c("M","F","M","M")
class(gender)

## [1] "character"

voted.2020 <- c(T,T,F,T)
class(voted.2020)

## [1] "logical"

Coercing Data Classes

You can change the class of a vector from one to another:

# Changing a numeric vector to a character vector:
class(age)

## [1] "numeric"

age.char <- as.character(age)
class(age.char)

## [1] "character"

age.char

## [1] "18" "19" "22" "20"

# Changing a character vector of numbers to a numeric vector:
age.num <- as.numeric(age.char)
class(age.num)

## [1] "numeric"

age.num

## [1] 18 19 22 20

# A character vector of words cannot be transformed into a numeric vector. R will not know what numeric values to assign, so it will replace each element with NA.
class(partyid)

## [1] "character"

partyid1 <- as.numeric(partyid)

## Warning: NAs introduced by coercion

class(partyid1)

## [1] "numeric"

partyid1

## [1] NA NA NA NA

Factor Variables: The benefit of factor variables is that they are categorized into numbered levels. This allows R to calculate of degrees of freedom and use them in statistical models.

# character to factor 
partyid.fact <- as.factor(partyid)
class(partyid.fact)

## [1] "factor"

partyid.fact

## [1] D D I R
## Levels: D I R

# Notice that the 'partyid.fact' is displayed as the list of vector elements and a list of the 3 possible levels.

# numeric to factor 
age.fact <- as.factor(age)
class(age.fact)

## [1] "factor"

age.fact

## [1] 18 19 22 20
## Levels: 18 19 20 22

# factor to character 
partyid2 <- as.character(partyid.fact)
class(partyid2)

## [1] "character"

partyid2

## [1] "D" "D" "I" "R"

# In order to change a factor vector into a numeric vector, it needs to be transformed into a character vector first. Otherwise, R will assign unrelated numbers to each level - rather than the existing meaningful numbers.

age1 <- as.numeric(as.character(age.fact))
age1

## [1] 18 19 22 20

Indexing

In order to select elements from a vector, we must specify which elements we would like to select using brackets [ ]

name[1] #returns the first element from the name vector

## [1] "person1"

age[2:4] #returns elements 2,3, and 4 from the age vector

## [1] 19 22 20

partyid[c(1,3,4)] #returns elements 1,3, and 4 from the partyid vector

## [1] "D" "I" "R"

elements1 <- c(1,3,4)
partyid[elements1]

## [1] "D" "I" "R"

gender[-3]#returns elements 1,2, and 4 from the gender vector

## [1] "M" "F" "M"

Combining Vectors into a Matrix

A matrix is a two-dimensional table of data organized into rows and columns. We can combine vectors into a matrix using the following commands:

cbind( ) - each vector becomes a column
rbind( ) - each vector becomes a row

matrix1 <- rbind(name,age,partyid,gender,voted.2020)
matrix1

##            [,1]      [,2]      [,3]      [,4]     
## name       "person1" "person2" "person3" "person4"
## age        "18"      "19"      "22"      "20"     
## partyid    "D"       "D"       "I"       "R"      
## gender     "M"       "F"       "M"       "M"      
## voted.2020 "TRUE"    "TRUE"    "FALSE"   "TRUE"

matrix2 <- cbind(name,age,partyid,gender,voted.2020)
matrix2

##      name      age  partyid gender voted.2020
## [1,] "person1" "18" "D"     "M"    "TRUE"    
## [2,] "person2" "19" "D"     "F"    "TRUE"    
## [3,] "person3" "22" "I"     "M"    "FALSE"   
## [4,] "person4" "20" "R"     "M"    "TRUE"

*Notice that each value shows up in the matrix within quotes, indiciating that it is class character. This is because matrices can only contain vectors of the same class.

Exercise 1

https://kira-f.shinyapps.io/Intro_R_Exercise/

Recitation 4: Data Frames and Conditional Logic

Data Frames

You can combine vectors to create a matrix or dataframe of related values. This allows the data to take on the structure of a spreadsheet. A dataframe can contain vectors of different classes.

mydata <- cbind.data.frame(name,age,partyid,gender,voted.2020)
mydata

##      name age partyid gender voted.2020
## 1 person1  18       D      M       TRUE
## 2 person2  19       D      F       TRUE
## 3 person3  22       I      M      FALSE
## 4 person4  20       R      M       TRUE

dim(mydata) #dimensions of the data frame returned as the # of rows and columns

## [1] 4 5

colnames(mydata) # the names of all variables included in the data frame

## [1] "name"       "age"        "partyid"    "gender"     "voted.2020"

Indexing in Data Frames

If you want to locate a specific value within a dataframe, you have to give R the “address” of that value by,first, specifying the name of the dataframe, then identifying to row (observation / individual) and column (vector / variable) that contain the value.

The “address” will always be specified in the format ‘[row #,column #]’ ‘[row #,]’ will print all values in the specified row
‘[,column #]’ will print all numbers in the speficied column. *Don’t forget to include the comma

# print the first column, which contains the vector "name"

#use brackets '[]' to identify the column by number 
mydata[,1]

## [1] person1 person2 person3 person4
## Levels: person1 person2 person3 person4

# you can better understand the content of a variable using the table() or summary() functions
#table is most useful for factor variables
table(mydata[,4])

## 
## F M 
## 1 3

#summary is better for numeric variables
summary(mydata[,2])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   18.75   19.50   19.75   20.50   22.00

# print the first row of the data frame, which contains information about "person1"
mydata[1,]

##      name age partyid gender voted.2020
## 1 person1  18       D      M       TRUE

#print the gender of person 2 - row 2, column 4
mydata[2,4]

## [1] F
## Levels: F M

Identifying Rows based on Conditional Logic

There will be times when you want to identify which rows contain a specific value for a variable. In these cases, the which() command will give you the row numbers of all observations that meet your inclusion criteria. The length() command will count the number of observations in a vector.

# Identify all rows for which the gender is male  
which(mydata[,4] == "M")

## [1] 1 3 4

# Identify how many respondents identify as male 
length(which(mydata[,4] == "M"))

## [1] 3

# Identify the party ID of all respondents who identify as male 
males <- which(mydata[,4] == "M")
mydata[males,3]

## [1] D I R
## Levels: D I R

Subsetting

You can select multiple rows and columns at once. If you then assign the selected rows and columns to a new object, you can create a subset that only includes your selected variables and observations.

mydata[c(1:3),] # rows 1,2,3 and all columns

##      name age partyid gender voted.2020
## 1 person1  18       D      M       TRUE
## 2 person2  19       D      F       TRUE
## 3 person3  22       I      M      FALSE

mydata[,c(1:4)] # all rows and columns 1-5

##      name age partyid gender
## 1 person1  18       D      M
## 2 person2  19       D      F
## 3 person3  22       I      M
## 4 person4  20       R      M

mydata[c(-4),c(-3)] # rows 1,2,3 and columns 1,2,4,5

##      name age gender voted.2020
## 1 person1  18      M       TRUE
## 2 person2  19      F       TRUE
## 3 person3  22      M      FALSE

rows <- c(1,3,4)
cols <- c(1,2,5)
mydata[rows,cols]

##      name age voted.2020
## 1 person1  18       TRUE
## 3 person3  22      FALSE
## 4 person4  20       TRUE

sub.mydata <- mydata[rows,cols]

Subsetting based on Conditional Logic

You can also create a subset of rows based on the value they contain for a specific variable (inclusion criteria). This can be accomplished by indexing with brackets [ ] or the subset( ) function.

male.mydata1 <- mydata[mydata[,4]=="M",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(male.mydata1) # there are only 3 rows in the subset. These are the students that are coded as male.

## [1] 3 5

male.mydata2 <- subset(mydata, mydata[,4]=="M")
dim(male.mydata2)

## [1] 3 5

Inclusion criteria are specified using operators. You saw above that “==” was used to specify only rows where the variable was “exactly equal to” the following value. The other operators are as follows:

Logical Operators Apply to all Vector Classes

==	‘exactly equals’
!=	‘is/are not equal to’

Logical Operators Apply only to Numeric Vectors

<	‘less than’
>	‘greater than’
>=	‘greater or equal’
<=	‘less or equal’

### Create a subset of all respondents that do not have a Party ID of Republican

mydata.not.gop <- mydata[mydata[,3] != "R",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(mydata.not.gop)

## [1] 3 5

table(mydata.not.gop[,3]) # check that all observations in the subset have a party ID other than "R"

## 
## D I R 
## 2 1 0

### Create a subset of all respondents that are 20 or older

age20.mydata <- subset(mydata, mydata[,2] >= 20)
dim(age20.mydata)

## [1] 2 5

table(age20.mydata[,2])

## 
## 20 22 
##  1  1

Multiple Conditions

You can select multiple inclusion criteria for your subset. These are combined using:

& - AND
| - OR

### Create a subset of all male students under 20

male20.mydata <- mydata[mydata[,4] == "M" & mydata[,2] <= 20,] 
male20.mydata

##      name age partyid gender voted.2020
## 1 person1  18       D      M       TRUE
## 4 person4  20       R      M       TRUE

# there are 2 rows in this subset. These are the students that are coded as male AND whose ages are less than or equal to 20. 
#Those who are above 20 are not included, even if they are male. 
#Those that are coded female  are not included even if they are below 20. 

### Create a subset of students that have a Party ID of Dem OR Rep

DR.mydata <- subset(mydata, mydata[,3] =="D" | mydata[,3] == "R")
DR.mydata

##      name age partyid gender voted.2020
## 1 person1  18       D      M       TRUE
## 2 person2  19       D      F       TRUE
## 4 person4  20       R      M       TRUE

# there are 3 rows in this subset. These are the students whose party identification is D OR R.
# It includes both students that have "D" and those that have "R" as the value for the partyid variable, but not students with any other value for partyid.

Putting it all together: Best practice for subsetting combines all of the above skills.

First, you want to identify the rows that will be included in your subset using the which() function.
Then you want to check how many observations will be in your final subset.
Finally, you want to create your subset and check that the final dimensions include the expected number of rows and columns.

### Create a subset of all students over 18 that are not Independents

# Identify rows and save as an object
DR18 <- which(mydata[,2] > 18 & mydata[,3] != "I")
DR18

## [1] 2 4

# check the number of rows that meet criteria
length(DR18)

## [1] 2

# create a subset of only the identified rows and all columns 
DR18.mydata <- mydata[DR18,] 
# check that the subset dimensions match the number of identified rows and original number of columns
dim(DR18.mydata)

## [1] 2 5

# our subset is small enough that we can just print it in the console to visually check that it is correct, but with larger datasets, this will not be possible. 
DR18.mydata

##      name age partyid gender voted.2020
## 2 person2  19       D      F       TRUE
## 4 person4  20       R      M       TRUE

Exercise 2

https://kira-f.shinyapps.io/Data_Frame_Exercise/

Recitation 4: Extended - Indexing Data Frames Using Column Names

So far we have been specifying the column of our datasets using brackets. When using data frames, we can also identify columns using a dollar sign ‘$’ and the recognized variable name.

Note: Do not do this on Problem Set 1. We will learn more about data frame notation in week 5.

#use a dollar sign '$' to identify the column by variable name.  
mydata$name

## [1] person1 person2 person3 person4
## Levels: person1 person2 person3 person4

#print the gender of person 2 - row 2, column 4
mydata$gender[2]

## [1] F
## Levels: F M

#when indexing with variable names, you do not need to include a comma after the row number in brackets. Since column is already specified outside the brackets, R knows that it is only looking for one dimension: the row number.

Creating New Variables in a Data Frame

You can create new variables within a dataframe by:

Naming a new variable and assigning it the value NA. This creates an empty column at the end of that dataframe.
Assign values to the new column. You can specify individual values in the order you want them to appear using c() or you can perform an operation on an existing variable. This operation can be a numeric calculation or a logical statement (returns a T or F for each row of that dataframe).

mydata$major <- NA
mydata$major <- c("psci", "compsci","econ","comm")
mydata$major # this variable contains the major of each student in the dataset

## [1] "psci"    "compsci" "econ"    "comm"

dim(mydata) #each time you add a variable, the # of columns should increase

## [1] 4 6

colnames(mydata) #the name of your new variable should appear

## [1] "name"       "age"        "partyid"    "gender"     "voted.2020"
## [6] "major"

mydata$years.vote <- NA
mydata$years.vote <- mydata$age - 18
mydata$years.vote # this variable contains the number of years since each student turned 18 and became eligible to vote

## [1] 0 1 4 2

dim(mydata)

## [1] 4 7

colnames(mydata)

## [1] "name"       "age"        "partyid"    "gender"     "voted.2020"
## [6] "major"      "years.vote"

mydata$twenties <- NA
mydata$twenties <- mydata$age > 20
mydata$twenties # this variable contains a value of TRUE or FALSE for each student, which indicates whether or not they are in their twenties

## [1] FALSE FALSE  TRUE FALSE

dim(mydata)

## [1] 4 8

colnames(mydata)

## [1] "name"       "age"        "partyid"    "gender"     "voted.2020"
## [6] "major"      "years.vote" "twenties"

Below is a copy of the subsetting examples from recitation 4, written using column names - rather than number - for indexing:

Subsetting based on Conditional Logic

You can also create a subset of rows based on the value they contain for a specific variable (inclusion criteria). This can be accomplished by indexing with brackets [ ] or the subset( ) function.

male.mydata1 <- mydata[mydata$gender=="M",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(male.mydata1) # there are only 3 rows in the subset. These are the students that are coded as male.

## [1] 3 8

male.mydata2 <- subset(mydata, mydata$gender=="M")
dim(male.mydata2)

## [1] 3 8

Logical Operators Apply to all Vector Classes

==	‘exactly equals’
!=	‘is/are not equal to’

Logical Operators Apply only to Numeric Vectors

<	‘less than’
>	‘greater than’
>=	‘greater or equal’
<=	‘less or equal’

mydata.not.gop <- mydata[mydata$partyid != "R",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(mydata.not.gop)

## [1] 3 8

table(mydata.not.gop$partyid)

## 
## D I R 
## 2 1 0

vote1.mydata <- subset(mydata, mydata$years.vote >= 1)
dim(vote1.mydata)

## [1] 3 8

table(vote1.mydata$years.vote)

## 
## 1 2 4 
## 1 1 1

Multiple Conditions

You can select multiple inclusion criteria for your subset. These are combined using:

& - AND

| - OR

male.mydata3 <- mydata[mydata$gender=="M" & mydata$age <= 20,] 
male.mydata3

##      name age partyid gender voted.2020 major years.vote twenties
## 1 person1  18       D      M       TRUE  psci          0    FALSE
## 4 person4  20       R      M       TRUE  comm          2    FALSE

# there are 2 rows in this subset. These are the students that are coded as male AND whose ages are less than or equal to 20. 
#Those who are above 20 are not included, even if they are male. 
#Those that are coded female  are not included even if they are below 20. 

male.mydata4 <- subset(mydata, mydata$partyid=="D" | mydata$partyid=="R")
male.mydata4

##      name age partyid gender voted.2020   major years.vote twenties
## 1 person1  18       D      M       TRUE    psci          0    FALSE
## 2 person2  19       D      F       TRUE compsci          1    FALSE
## 4 person4  20       R      M       TRUE    comm          2    FALSE

# there are 3 rows in this subset. These are the students whose party identification is D OR R.
# It includes both students that have "D" and those that have "R" as the value for the partyid variable, but not students with any other value for partyid.

PSCI 107 R Sessions 1 & 2: Intro to R

Recitation 3: Basic R

Set Working Directory

R as a Calculator

Objects in R

Rules for Naming Objects in R:

Vectors

Manipulating Vectors

Data Classes

Coercing Data Classes

Indexing

Combining Vectors into a Matrix

Exercise 1

Recitation 4: Data Frames and Conditional Logic

Data Frames

Indexing in Data Frames

Identifying Rows based on Conditional Logic

Subsetting

Subsetting based on Conditional Logic

Logical Operators Apply to all Vector Classes

Logical Operators Apply only to Numeric Vectors

Multiple Conditions

Exercise 2

Recitation 4: Extended - Indexing Data Frames Using Column Names

Creating New Variables in a Data Frame

Subsetting based on Conditional Logic

Logical Operators Apply to all Vector Classes

Logical Operators Apply only to Numeric Vectors

Multiple Conditions

& - AND

| - OR