Handout 2: R Data Basics

Quick Notes on R Data Types: Matrices and Data Frames

Last time, we introduced some very basics of R data manipulation. Let’s continue on that road by introducing a difference between matrices and data frames.

Matrices

R understands tables (row by column matrices) of data that comes all in one form (for instance, numeric, logical, or text) as being matrices. So let’s consider a hypothetical data set of students’ gpas:

gpa <- c(4.0, 3.7, 3.2, 3.8, 2.1)
studentid <- c(1001,1002,1003,1004,1005)

We would create a matrix from this data like so:

studentgpa <- cbind(gpa,studentid) # use ? to find out what a function does--e.g. ?cbind()
studentgpa # remember we can just type an object's name and hit enter to find out what it contains

##      gpa studentid
## [1,] 4.0      1001
## [2,] 3.7      1002
## [3,] 3.2      1003
## [4,] 3.8      1004
## [5,] 2.1      1005

So far, so good! We can even use operations to find the mean, median, mode, and so forth of datasets. First, we need to learn how R understands working with data:

studentgpa[,1] # returns the first column
studentgpa[1,] # returns the first row
studentgpa[3,2] # returns the studentID of the third student

And now let’s do a simple operation to find the average GPA:

mean(studentgpa[,1])

Quick Exercise 1

What is the median GPA of the students in the dataset? Give the R code and answer.

Sounds good, right? Let’s see what happens when we try to incorporate the students’ names:

names <- c("Billy","Trini","Zack","Kimberly","Jason") # what are those quotation marks? They indicate text being entered.

studentgpa <- cbind(studentgpa,names)

studentgpa

##      gpa   studentid names     
## [1,] "4"   "1001"    "Billy"   
## [2,] "3.7" "1002"    "Trini"   
## [3,] "3.2" "1003"    "Zack"    
## [4,] "3.8" "1004"    "Kimberly"
## [5,] "2.1" "1005"    "Jason"

Quick Exercise 2

Using the studentgpa dataset as modified, take the average and median of studentgpa.

Wait. Why doesn’t this work any more?

Remember: R puts data of a single kind into a matrix. So a matrix can be all logical, all numeric, or all text. When we enter text and numbers, as above, that means that we’re telling R that this matrix should now be all text. And you can’t add, subtract, or divide text!

Data Frames

So that means we need to enter a new kind of R data function: data frames.

studentgpa.df <- data.frame(studentid,names,gpa) # Again, use ?data.frame() to learn more here
studentgpa.df

##   studentid    names gpa
## 1      1001    Billy 4.0
## 2      1002    Trini 3.7
## 3      1003     Zack 3.2
## 4      1004 Kimberly 3.8
## 5      1005    Jason 2.1

Note that all of the quotation marks have disappeared! That’s great: it means that text is text and numbers are numbers. Let’s try that out:

Quick Exercise 3

Using the studentgpa.df dataset, take the average and median of studentgpa.

Using Data Frames: Basic Plotting Applications

How does working with data frames help us? In a lot of ways, really, but one of the easiest advantages comes in plotting data. Here’s a toy example:

plot(studentid,gpa,
     main="Student GPAs by Student ID", # note these options!
     xlab="Student ID",
     ylab="GPAs")

We could also make a bar chart:

barplot(gpa,
        main="Student GPAs",
        names.arg=studentid)

One of the things that Few recommends is intelligently presenting the data. For instance, you might want to arrange GPAs from lowest to highest:

gpa.sorted <- sort(gpa)  # watch out: sort() is for vectors (1-row matrices); order() is for data frames. This is non-trivial.
gpa.sorted

## [1] 2.1 3.2 3.7 3.8 4.0

And now let’s plot it….

barplot(gpa.sorted,
        main="Student GPAs",
        names.arg=studentid)

Quick Exercise 4

Something has gone badly wrong in the last plot. What is it?

Using data frames allows us to reliably order data and keep everything else in order. So going back to `studentgpa.df’, we can order the data:

studentgpa.df.ordered <- studentgpa.df[order(gpa),]
studentgpa.df

##   studentid    names gpa
## 1      1001    Billy 4.0
## 2      1002    Trini 3.7
## 3      1003     Zack 3.2
## 4      1004 Kimberly 3.8
## 5      1005    Jason 2.1

studentgpa.df.ordered #Always check to make sure things have gone as planned!

##   studentid    names gpa
## 5      1005    Jason 2.1
## 3      1003     Zack 3.2
## 2      1002    Trini 3.7
## 4      1004 Kimberly 3.8
## 1      1001    Billy 4.0

And now we can plot:

barplot(studentgpa.df.ordered$gpa, ## Note that we could also use studentgpa.df.ordered[,3]
        main="Student GPAs",
        names.arg=studentgpa.df.ordered$names)