The first step to any R project is to set the working directory. This is a folder where R will know to look for data files to import, save your code, and save any additional files created by your code (plots, graphs, cleaned data files, etc.)
I chose a file on my Desktop entitled “TA 107” that contains the file “Recitations” where all of my recitation resources and code will be saved. I recommend that you make access easier by creating a file on your desktop for this class. You can create another file within it called “Recitations” or “R stuff” or whatever makes the most sense for your organizational style.
To get the path of the folder you select as your working directory, do one of the following:
Note: Windows users need to use backslashes instead of forward slashes in their file path Remember to enclose your file path in quotation marks
setwd("/Users/kiraflemke/Desktop/TA 107/Recitations")
Any time you need to check what your working directory is set to, use the code below:
getwd()
## [1] "/Users/kiraflemke"
You should set the working directory at the top of each R file you create to document where your resources are saved.
R can be used to perform basic or complex arithmetic, just like a calculator. The basic operation signs are below:
#addition
6+9
## [1] 15
#subtraction:
11-7
## [1] 4
#multiplication
5*12
## [1] 60
#division
12/6
## [1] 2
#exponents
5^2
## [1] 25
# roots
25^(1/2)
## [1] 5
Use parentheses to dictate order of operations:
(6+9)/3
## [1] 5
6+(9/3)
## [1] 9
You can assign values to objects in R using ‘<-’
This allows exact values to be repeatedly used in calculations without having to repeat previous operations. You can call these values by typing the name of the object.
calc1 <- (6+9)/3
calc1
## [1] 5
calc1*100
## [1] 500
calc2 <- calc1*100
Object names cannot start with a number (ex. ‘object1’ can be used but ‘1object’ cannot)
Object names cannot contain spaces or dashes (ex. ‘object_1’ or ‘object.1’ can be used but ‘object 1’ and ‘object-1’ cannot)
As often as possible, try to use names that are meaningful to you, so you don’t forget what the object represents.
Objects can also be sets of values, called vectors. These can be created by typing a set of values separated by commas within the wrapper ‘c()’. This is called concatenating.
vec1 <- c(1,7,12,15,28)
vec1
## [1] 1 7 12 15 28
You can specify a set of consecutive numbers using a colon. You can even do this several times within the same vector.
vec2 <- c(1:20)
vec2
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
vec2.1 <- c(1:4,5:12,18)
vec2.1
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 18
If you want to create a vector of ordered, non-conecutive numbers, for example counting by 2, you can use the seq() function.
vec3 <- seq(from = 1, to = 20, by = 2)
vec3
## [1] 1 3 5 7 9 11 13 15 17 19
Once you have a vector created, you can perform operations on the vector. These can be saved as new objects.
vec3/2
## [1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5
vec3a <- vec3/2
You can also perform functions across entire vectors. This allows you to easily perform descriptive statistics.
mean(vec3a)
## [1] 5
median(vec3a)
## [1] 5
sum(vec3a)
## [1] 50
summary(vec3a)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 2.75 5.00 5.00 7.25 9.50
The vectors above contain numeric data. Vectors can be contain one of the following four data classes:
Numeric – numbers, can be used in calculations
Character – character strings can contain letters, words, numbers, and/or spaces but they must be contained within quotation marks "" (character vectors cannot be used in mathematical calculations)
Logical – logical data can take the value of TRUE or FALSE (also specified by T and F). Quotation marks are not used to refer to logical data.
Factor – a limited set of character or integer values that corresponding to set of possible responses called levels
A single vector can only contain one data class. The class() function can be used to identify the type of data within a vector.
name <- c("person1","person2","person3","person4")
class(name)
## [1] "character"
age <- c(18,19,22,20)
class(age)
## [1] "numeric"
partyid <- c("D","D","I","R")
class(partyid)
## [1] "character"
gender <- c("M","F","M","M")
class(gender)
## [1] "character"
voted.2020 <- c(T,T,F,T)
class(voted.2020)
## [1] "logical"
You can change the class of a vector from one to another:
# Changing a numeric vector to a character vector:
class(age)
## [1] "numeric"
age.char <- as.character(age)
class(age.char)
## [1] "character"
age.char
## [1] "18" "19" "22" "20"
# Changing a character vector of numbers to a numeric vector:
age.num <- as.numeric(age.char)
class(age.num)
## [1] "numeric"
age.num
## [1] 18 19 22 20
# A character vector of words cannot be transformed into a numeric vector. R will not know what numeric values to assign, so it will replace each element with NA.
class(partyid)
## [1] "character"
partyid1 <- as.numeric(partyid)
## Warning: NAs introduced by coercion
class(partyid1)
## [1] "numeric"
partyid1
## [1] NA NA NA NA
Factor Variables: The benefit of factor variables is that they are categorized into numbered levels. This allows R to calculate of degrees of freedom and use them in statistical models.
# character to factor
partyid.fact <- as.factor(partyid)
class(partyid.fact)
## [1] "factor"
partyid.fact
## [1] D D I R
## Levels: D I R
# Notice that the 'partyid.fact' is displayed as the list of vector elements and a list of the 3 possible levels.
# numeric to factor
age.fact <- as.factor(age)
class(age.fact)
## [1] "factor"
age.fact
## [1] 18 19 22 20
## Levels: 18 19 20 22
# factor to character
partyid2 <- as.character(partyid.fact)
class(partyid2)
## [1] "character"
partyid2
## [1] "D" "D" "I" "R"
# In order to change a factor vector into a numeric vector, it needs to be transformed into a character vector first. Otherwise, R will assign unrelated numbers to each level - rather than the existing meaningful numbers.
age1 <- as.numeric(as.character(age.fact))
age1
## [1] 18 19 22 20
In order to select elements from a vector, we must specify which elements we would like to select using brackets [ ]
name[1] #returns the first element from the name vector
## [1] "person1"
age[2:4] #returns elements 2,3, and 4 from the age vector
## [1] 19 22 20
partyid[c(1,3,4)] #returns elements 1,3, and 4 from the partyid vector
## [1] "D" "I" "R"
elements1 <- c(1,3,4)
partyid[elements1]
## [1] "D" "I" "R"
gender[-3]#returns elements 1,2, and 4 from the gender vector
## [1] "M" "F" "M"
A matrix is a two-dimensional table of data organized into rows and columns. We can combine vectors into a matrix using the following commands:
matrix1 <- rbind(name,age,partyid,gender,voted.2020)
matrix1
## [,1] [,2] [,3] [,4]
## name "person1" "person2" "person3" "person4"
## age "18" "19" "22" "20"
## partyid "D" "D" "I" "R"
## gender "M" "F" "M" "M"
## voted.2020 "TRUE" "TRUE" "FALSE" "TRUE"
matrix2 <- cbind(name,age,partyid,gender,voted.2020)
matrix2
## name age partyid gender voted.2020
## [1,] "person1" "18" "D" "M" "TRUE"
## [2,] "person2" "19" "D" "F" "TRUE"
## [3,] "person3" "22" "I" "M" "FALSE"
## [4,] "person4" "20" "R" "M" "TRUE"
*Notice that each value shows up in the matrix within quotes, indiciating that it is class character. This is because matrices can only contain vectors of the same class.
You can combine vectors to create a matrix or dataframe of related values. This allows the data to take on the structure of a spreadsheet. A dataframe can contain vectors of different classes.
mydata <- cbind.data.frame(name,age,partyid,gender,voted.2020)
mydata
## name age partyid gender voted.2020
## 1 person1 18 D M TRUE
## 2 person2 19 D F TRUE
## 3 person3 22 I M FALSE
## 4 person4 20 R M TRUE
dim(mydata) #dimensions of the data frame returned as the # of rows and columns
## [1] 4 5
colnames(mydata) # the names of all variables included in the data frame
## [1] "name" "age" "partyid" "gender" "voted.2020"
If you want to locate a specific value within a dataframe, you have to give R the “address” of that value by,first, specifying the name of the dataframe, then identifying to row (observation / individual) and column (vector / variable) that contain the value.
The “address” will always be specified in the format ‘[row #,column #]’ ‘[row #,]’ will print all values in the specified row
‘[,column #]’ will print all numbers in the speficied column. *Don’t forget to include the comma
# print the first column, which contains the vector "name"
#use brackets '[]' to identify the column by number
mydata[,1]
## [1] person1 person2 person3 person4
## Levels: person1 person2 person3 person4
# you can better understand the content of a variable using the table() or summary() functions
#table is most useful for factor variables
table(mydata[,4])
##
## F M
## 1 3
#summary is better for numeric variables
summary(mydata[,2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 18.75 19.50 19.75 20.50 22.00
# print the first row of the data frame, which contains information about "person1"
mydata[1,]
## name age partyid gender voted.2020
## 1 person1 18 D M TRUE
#print the gender of person 2 - row 2, column 4
mydata[2,4]
## [1] F
## Levels: F M
There will be times when you want to identify which rows contain a specific value for a variable. In these cases, the which() command will give you the row numbers of all observations that meet your inclusion criteria. The length() command will count the number of observations in a vector.
# Identify all rows for which the gender is male
which(mydata[,4] == "M")
## [1] 1 3 4
# Identify how many respondents identify as male
length(which(mydata[,4] == "M"))
## [1] 3
# Identify the party ID of all respondents who identify as male
males <- which(mydata[,4] == "M")
mydata[males,3]
## [1] D I R
## Levels: D I R
You can select multiple rows and columns at once. If you then assign the selected rows and columns to a new object, you can create a subset that only includes your selected variables and observations.
mydata[c(1:3),] # rows 1,2,3 and all columns
## name age partyid gender voted.2020
## 1 person1 18 D M TRUE
## 2 person2 19 D F TRUE
## 3 person3 22 I M FALSE
mydata[,c(1:4)] # all rows and columns 1-5
## name age partyid gender
## 1 person1 18 D M
## 2 person2 19 D F
## 3 person3 22 I M
## 4 person4 20 R M
mydata[c(-4),c(-3)] # rows 1,2,3 and columns 1,2,4,5
## name age gender voted.2020
## 1 person1 18 M TRUE
## 2 person2 19 F TRUE
## 3 person3 22 M FALSE
rows <- c(1,3,4)
cols <- c(1,2,5)
mydata[rows,cols]
## name age voted.2020
## 1 person1 18 TRUE
## 3 person3 22 FALSE
## 4 person4 20 TRUE
sub.mydata <- mydata[rows,cols]
You can also create a subset of rows based on the value they contain for a specific variable (inclusion criteria). This can be accomplished by indexing with brackets [ ] or the subset( ) function.
male.mydata1 <- mydata[mydata[,4]=="M",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(male.mydata1) # there are only 3 rows in the subset. These are the students that are coded as male.
## [1] 3 5
male.mydata2 <- subset(mydata, mydata[,4]=="M")
dim(male.mydata2)
## [1] 3 5
Inclusion criteria are specified using operators. You saw above that “==” was used to specify only rows where the variable was “exactly equal to” the following value. The other operators are as follows:
| == | ‘exactly equals’ |
| != | ‘is/are not equal to’ |
| < | ‘less than’ |
| > | ‘greater than’ |
| >= | ‘greater or equal’ |
| <= | ‘less or equal’ |
### Create a subset of all respondents that do not have a Party ID of Republican
mydata.not.gop <- mydata[mydata[,3] != "R",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(mydata.not.gop)
## [1] 3 5
table(mydata.not.gop[,3]) # check that all observations in the subset have a party ID other than "R"
##
## D I R
## 2 1 0
### Create a subset of all respondents that are 20 or older
age20.mydata <- subset(mydata, mydata[,2] >= 20)
dim(age20.mydata)
## [1] 2 5
table(age20.mydata[,2])
##
## 20 22
## 1 1
You can select multiple inclusion criteria for your subset. These are combined using:
& - AND
| - OR
### Create a subset of all male students under 20
male20.mydata <- mydata[mydata[,4] == "M" & mydata[,2] <= 20,]
male20.mydata
## name age partyid gender voted.2020
## 1 person1 18 D M TRUE
## 4 person4 20 R M TRUE
# there are 2 rows in this subset. These are the students that are coded as male AND whose ages are less than or equal to 20.
#Those who are above 20 are not included, even if they are male.
#Those that are coded female are not included even if they are below 20.
### Create a subset of students that have a Party ID of Dem OR Rep
DR.mydata <- subset(mydata, mydata[,3] =="D" | mydata[,3] == "R")
DR.mydata
## name age partyid gender voted.2020
## 1 person1 18 D M TRUE
## 2 person2 19 D F TRUE
## 4 person4 20 R M TRUE
# there are 3 rows in this subset. These are the students whose party identification is D OR R.
# It includes both students that have "D" and those that have "R" as the value for the partyid variable, but not students with any other value for partyid.
Putting it all together: Best practice for subsetting combines all of the above skills.
First, you want to identify the rows that will be included in your subset using the which() function.
Then you want to check how many observations will be in your final subset.
Finally, you want to create your subset and check that the final dimensions include the expected number of rows and columns.
### Create a subset of all students over 18 that are not Independents
# Identify rows and save as an object
DR18 <- which(mydata[,2] > 18 & mydata[,3] != "I")
DR18
## [1] 2 4
# check the number of rows that meet criteria
length(DR18)
## [1] 2
# create a subset of only the identified rows and all columns
DR18.mydata <- mydata[DR18,]
# check that the subset dimensions match the number of identified rows and original number of columns
dim(DR18.mydata)
## [1] 2 5
# our subset is small enough that we can just print it in the console to visually check that it is correct, but with larger datasets, this will not be possible.
DR18.mydata
## name age partyid gender voted.2020
## 2 person2 19 D F TRUE
## 4 person4 20 R M TRUE
So far we have been specifying the column of our datasets using brackets. When using data frames, we can also identify columns using a dollar sign ‘$’ and the recognized variable name.
Note: Do not do this on Problem Set 1. We will learn more about data frame notation in week 5.
#use a dollar sign '$' to identify the column by variable name.
mydata$name
## [1] person1 person2 person3 person4
## Levels: person1 person2 person3 person4
#print the gender of person 2 - row 2, column 4
mydata$gender[2]
## [1] F
## Levels: F M
#when indexing with variable names, you do not need to include a comma after the row number in brackets. Since column is already specified outside the brackets, R knows that it is only looking for one dimension: the row number.
You can create new variables within a dataframe by:
Naming a new variable and assigning it the value NA. This creates an empty column at the end of that dataframe.
Assign values to the new column. You can specify individual values in the order you want them to appear using c() or you can perform an operation on an existing variable. This operation can be a numeric calculation or a logical statement (returns a T or F for each row of that dataframe).
mydata$major <- NA
mydata$major <- c("psci", "compsci","econ","comm")
mydata$major # this variable contains the major of each student in the dataset
## [1] "psci" "compsci" "econ" "comm"
dim(mydata) #each time you add a variable, the # of columns should increase
## [1] 4 6
colnames(mydata) #the name of your new variable should appear
## [1] "name" "age" "partyid" "gender" "voted.2020"
## [6] "major"
mydata$years.vote <- NA
mydata$years.vote <- mydata$age - 18
mydata$years.vote # this variable contains the number of years since each student turned 18 and became eligible to vote
## [1] 0 1 4 2
dim(mydata)
## [1] 4 7
colnames(mydata)
## [1] "name" "age" "partyid" "gender" "voted.2020"
## [6] "major" "years.vote"
mydata$twenties <- NA
mydata$twenties <- mydata$age > 20
mydata$twenties # this variable contains a value of TRUE or FALSE for each student, which indicates whether or not they are in their twenties
## [1] FALSE FALSE TRUE FALSE
dim(mydata)
## [1] 4 8
colnames(mydata)
## [1] "name" "age" "partyid" "gender" "voted.2020"
## [6] "major" "years.vote" "twenties"
Below is a copy of the subsetting examples from recitation 4, written using column names - rather than number - for indexing:
You can also create a subset of rows based on the value they contain for a specific variable (inclusion criteria). This can be accomplished by indexing with brackets [ ] or the subset( ) function.
male.mydata1 <- mydata[mydata$gender=="M",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(male.mydata1) # there are only 3 rows in the subset. These are the students that are coded as male.
## [1] 3 8
male.mydata2 <- subset(mydata, mydata$gender=="M")
dim(male.mydata2)
## [1] 3 8
Inclusion criteria are specified using operators. You saw above that “==” was used to specify only rows where the variable was “exactly equal to” the following value. The other operators are as follows:
| == | ‘exactly equals’ |
| != | ‘is/are not equal to’ |
| < | ‘less than’ |
| > | ‘greater than’ |
| >= | ‘greater or equal’ |
| <= | ‘less or equal’ |
mydata.not.gop <- mydata[mydata$partyid != "R",] #make sure to include the comma before the bracket to indicate that you want to include all columns in your subset
dim(mydata.not.gop)
## [1] 3 8
table(mydata.not.gop$partyid)
##
## D I R
## 2 1 0
vote1.mydata <- subset(mydata, mydata$years.vote >= 1)
dim(vote1.mydata)
## [1] 3 8
table(vote1.mydata$years.vote)
##
## 1 2 4
## 1 1 1
You can select multiple inclusion criteria for your subset. These are combined using:
male.mydata3 <- mydata[mydata$gender=="M" & mydata$age <= 20,]
male.mydata3
## name age partyid gender voted.2020 major years.vote twenties
## 1 person1 18 D M TRUE psci 0 FALSE
## 4 person4 20 R M TRUE comm 2 FALSE
# there are 2 rows in this subset. These are the students that are coded as male AND whose ages are less than or equal to 20.
#Those who are above 20 are not included, even if they are male.
#Those that are coded female are not included even if they are below 20.
male.mydata4 <- subset(mydata, mydata$partyid=="D" | mydata$partyid=="R")
male.mydata4
## name age partyid gender voted.2020 major years.vote twenties
## 1 person1 18 D M TRUE psci 0 FALSE
## 2 person2 19 D F TRUE compsci 1 FALSE
## 4 person4 20 R M TRUE comm 2 FALSE
# there are 3 rows in this subset. These are the students whose party identification is D OR R.
# It includes both students that have "D" and those that have "R" as the value for the partyid variable, but not students with any other value for partyid.