By now, you should be comfortable with scalers and vectors. Next, we’ll cover the next two most common data objects:
A matrix is a rectangular collection data with m rows and n columns. You can think of a matrix as a collection of n column vectors, where each vector has length m.
There are many ways to create matricies:
# matrix() function
matrix(1:9, nrow = 3, ncol = 3)
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
matrix(1:9, nrow = 3, ncol = 3, byrow = T) # fill by row
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
matrix(c("a", "b", "c", "d"), nrow = 2, ncol = 2) # string matrix
## [,1] [,2]
## [1,] "a" "c"
## [2,] "b" "d"
# rbind(), combine vectors by row
rbind(1:10, 91:100)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 1 2 3 4 5 6 7 8 9 10
## [2,] 91 92 93 94 95 96 97 98 99 100
# cbind(), combine vectors by column
cbind(c(1, 2, 3, 4, 5), c(11, 12, 13, 14, 15))
## [,1] [,2]
## [1,] 1 11
## [2,] 2 12
## [3,] 3 13
## [4,] 4 14
## [5,] 5 15
Once you’ve created a matrix, you can find its dimensions (the number of rows and columns) using the dim() function. This function returns a vector of length 2, where the first element is the number of rows, and the second element is the number of columns.
If you just want to know the number of rows or columns, you can use nrow() or ncol()
mtx <- matrix(100, nrow = 2, ncol = 50)
dim(mtx) # first element is # of rows, second is # of columns
## [1] 2 50
nrow(mtx)
## [1] 2
ncol(mtx)
## [1] 50
Just like a vector, a matrix can either contain numbers or strings, not both!
A dataframe looks a lot like a matrix at first: it is also rectangular and has m rows and n columns. However, unlike matrices, dataframes can contain both string vectors and numeric vectors within the same object. For this reason, most large datasets in R, for example, a survey including numeric data and text data, will be stored as dataframes.
To create a dataframe, you can use the data.frame function. Let’s create a dataframe of fictional survey data. I’ll create 5 entries for Males and 5 entries for Females. I’ll then generate 10 heights from a normal distribution with mean 150 and standard deviation 10.
Survey <- data.frame(Gender = rep(c("Female", "Male"), times = 5),
Height = rnorm(10, mean = 150, sd = 10), # Heights come from N(mu = 150, sd = 10)
stringsAsFactors = F # don't convert strings to factors
)
Survey # Print the dataframe!
## Gender Height
## 1 Female 140.8
## 2 Male 143.6
## 3 Female 149.8
## 4 Male 139.6
## 5 Female 148.9
## 6 Male 155.2
## 7 Female 139.9
## 8 Male 149.0
## 9 Female 183.8
## 10 Male 151.0
You’ll notice I included the argument “stringsAsFactors = F”, this tells R to NOT convert the strings (the Gender column) to a factor datatype. We’ll talk about this later:
When you want to try out a new function but don’t have any data, you can always play with the preloaded datasets in the datasets R package. For example the dataframe called ChickWeight, contains data about the weight of several chickens over time. Let’s look at this dataset and learn the functions **names()*, head(), dim(), and View()
?ChickWeight # Tell me about the ChickWeight dataset
dim(ChickWeight) # What are the dimensions of ChickWeight?
## [1] 578 4
names(ChickWeight) # What are the column names of ChickWeight?
## [1] "weight" "Time" "Chick" "Diet"
head(ChickWeight) # Show me the first few rows of ChickWeight
## weight Time Chick Diet
## 1 42 0 1 1
## 2 51 2 1 1
## 3 59 4 1 1
## 4 64 6 1 1
## 5 76 8 1 1
## 6 93 10 1 1
If you want to look at an entire dataframe in a separate window, use the View() command:
View(ChickWeight)
Once you have a matrix or a dataframe loaded in R, you can access specific rows or columns by indexing the object. Why would you want to do this? Well, let’s say you want to calculate the mean weight of chickens at time 0. To do this, you first need to access just the weight data at time 0. You can easily do this using indexing:
To index a matrix or dataframe, use the [,] command, where the first element is the row(s) you want, and the second is the column(s). If you want all rows or all columns, leave that entry blank:
mtx <- matrix(1:25, nrow = 5, ncol = 5)
mtx[,] # All rows and columns
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
## [4,] 4 9 14 19 24
## [5,] 5 10 15 20 25
mtx[1,1] # What is in the first row and first column?
## [1] 1
mtx[,5] # What is in the fifth column?
## [1] 21 22 23 24 25
mtx[2,] # What is in the second row?
## [1] 2 7 12 17 22
mtx[1:3,] # What is in rows 1 through 3?
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 6 11 16 21
## [2,] 2 7 12 17 22
## [3,] 3 8 13 18 23
mtx[, c(2, 4)] # What is in columns 2 and 4?
## [,1] [,2]
## [1,] 6 16
## [2,] 7 17
## [3,] 8 18
## [4,] 9 19
## [5,] 10 20
As you can guess, you can also use brackets to index vectors (but with just a single argument)
a <- seq(from = 0, to = 100, by = 10)
a[1] # What is the first element of a?
## [1] 0
a[3:5] # What is the third to fifth element of a?
## [1] 20 30 40
You can also use bracket indexing with dataframes. However, you can additionally index columns of a dataframe using the $ symbol followed by the name of the column:
Survey$Gender
## [1] "Female" "Male" "Female" "Male" "Female" "Male" "Female"
## [8] "Male" "Female" "Male"
Survey$Height
## [1] 140.8 143.6 149.8 139.6 148.9 155.2 139.9 149.0 183.8 151.0
ChickWeight$weight[1:10] # First ten elements of the chicken weight vector
## [1] 42 51 59 64 76 93 106 125 149 171
Frequently, you’ll want to index a dataframe based on certain criteria. For example, in our Survey dataframe, we might just want the heights of Males. Or in the ChickWeight dataframe, we might want the weight from a specific Chick. We can accomplish this by putting a logical vector into our bracketing index:
Survey$Gender == "Male" # Which rows are from males?
## [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
Survey[Survey$Gender == "Male",] # Give me the rows of Survey from males
## Gender Height
## 2 Male 143.6
## 4 Male 139.6
## 6 Male 155.2
## 8 Male 149.0
## 10 Male 151.0
ChickWeight$weight[ChickWeight$Time == 0] # What are the weights at time of 0?
## [1] 42 40 43 42 41 41 41 42 42 41 43 41 41 41 41 41 42 39 43 41 40 41 43
## [24] 42 40 42 39 39 39 42 42 41 39 41 41 39 41 41 42 41 42 42 42 42 41 40
## [47] 41 39 40 41
ChickWeight$weight[ChickWeight$Time == 2] # What are the weights at time of 2?
## [1] 51 49 39 49 42 49 49 50 51 44 51 49 48 49 49 45 51 35 48 47 50 55 52
## [24] 52 49 48 46 46 48 48 53 49 50 49 53 48 48 49 50 55 51 49 55 51 50 52
## [47] 53 50 53 54
ChickWeight$weight[ChickWeight$Time == 0 &
ChickWeight$Diet == 1
]
## [1] 42 40 43 42 41 41 41 42 42 41 43 41 41 41 41 41 42 39 43 41
Once you know how to index a dataframe to get the data vectors you want, you can then easily calculate descriptive statistics:
# Statistics from our Survey
Male.Heights <- Survey$Height[Survey$Gender == "Male"]
mean(Male.Heights) # What is the mean hight of males?
## [1] 147.7
sd(Male.Heights) # What is the standard deviation of the male heights?
## [1] 6.148
# Descriptive statistics from ChickWeight
Time.0.Weights <- ChickWeight$weight[ChickWeight$Time == 0]
mean(Time.0.Weights) # What is the mean weight at time 0?
## [1] 41.06
var(Time.0.Weights) # What is the variance of weights at time 0?
## [1] 1.282
max(Time.0.Weights) # What is the maximum weight at time 0?
## [1] 43