Last time, we introduced some very basics of R data manipulation. Let’s continue on that road by introducing a difference between matrices and data frames.
R understands tables (row by column matrices) of data that comes all in one form (for instance, numeric, logical, or text) as being matrices. So let’s consider a hypothetical data set of students’ gpas:
gpa <- c(4.0, 3.7, 3.2, 3.8, 2.1)
studentid <- c(1001,1002,1003,1004,1005)
We would create a matrix from this data like so:
studentgpa <- cbind(gpa,studentid) # use ? to find out what a function does--e.g. ?cbind()
studentgpa # remember we can just type an object's name and hit enter to find out what it contains
## gpa studentid
## [1,] 4.0 1001
## [2,] 3.7 1002
## [3,] 3.2 1003
## [4,] 3.8 1004
## [5,] 2.1 1005
So far, so good! We can even use operations to find the mean, median, mode, and so forth of datasets. First, we need to learn how R understands working with data:
studentgpa[,1] # returns the first column
studentgpa[1,] # returns the first row
studentgpa[3,2] # returns the studentID of the third student
And now let’s do a simple operation to find the average GPA:
mean(studentgpa[,1])
What is the median GPA of the students in the dataset? Give the R code and answer.
Sounds good, right? Let’s see what happens when we try to incorporate the students’ names:
names <- c("Billy","Trini","Zack","Kimberly","Jason") # what are those quotation marks? They indicate text being entered.
studentgpa <- cbind(studentgpa,names)
studentgpa
## gpa studentid names
## [1,] "4" "1001" "Billy"
## [2,] "3.7" "1002" "Trini"
## [3,] "3.2" "1003" "Zack"
## [4,] "3.8" "1004" "Kimberly"
## [5,] "2.1" "1005" "Jason"
Using the studentgpa dataset as modified, take the average and median of studentgpa.
Wait. Why doesn’t this work any more?
Remember: R puts data of a single kind into a matrix. So a matrix can be all logical, all numeric, or all text. When we enter text and numbers, as above, that means that we’re telling R that this matrix should now be all text. And you can’t add, subtract, or divide text!
So that means we need to enter a new kind of R data function: data frames.
studentgpa.df <- data.frame(studentid,names,gpa) # Again, use ?data.frame() to learn more here
studentgpa.df
## studentid names gpa
## 1 1001 Billy 4.0
## 2 1002 Trini 3.7
## 3 1003 Zack 3.2
## 4 1004 Kimberly 3.8
## 5 1005 Jason 2.1
Note that all of the quotation marks have disappeared! That’s great: it means that text is text and numbers are numbers. Let’s try that out:
Using the studentgpa.df dataset, take the average and median of studentgpa.
How does working with data frames help us? In a lot of ways, really, but one of the easiest advantages comes in plotting data. Here’s a toy example:
plot(studentid,gpa,
main="Student GPAs by Student ID", # note these options!
xlab="Student ID",
ylab="GPAs")
We could also make a bar chart:
barplot(gpa,
main="Student GPAs",
names.arg=studentid)
One of the things that Few recommends is intelligently presenting the data. For instance, you might want to arrange GPAs from lowest to highest:
gpa.sorted <- sort(gpa) # watch out: sort() is for vectors (1-row matrices); order() is for data frames. This is non-trivial.
gpa.sorted
## [1] 2.1 3.2 3.7 3.8 4.0
And now let’s plot it….
barplot(gpa.sorted,
main="Student GPAs",
names.arg=studentid)
Something has gone badly wrong in the last plot. What is it?
Using data frames allows us to reliably order data and keep everything else in order. So going back to `studentgpa.df’, we can order the data:
studentgpa.df.ordered <- studentgpa.df[order(gpa),]
studentgpa.df
## studentid names gpa
## 1 1001 Billy 4.0
## 2 1002 Trini 3.7
## 3 1003 Zack 3.2
## 4 1004 Kimberly 3.8
## 5 1005 Jason 2.1
studentgpa.df.ordered #Always check to make sure things have gone as planned!
## studentid names gpa
## 5 1005 Jason 2.1
## 3 1003 Zack 3.2
## 2 1002 Trini 3.7
## 4 1004 Kimberly 3.8
## 1 1001 Billy 4.0
And now we can plot:
barplot(studentgpa.df.ordered$gpa, ## Note that we could also use studentgpa.df.ordered[,3]
main="Student GPAs",
names.arg=studentgpa.df.ordered$names)