Vectors have 1 dimension: a list, or row of items
Matrix <- (row vectors & column vectors) of same class eg, a “table” of numbers or integers
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
Data Frame <- (row vectors & column vectors) of different classes eg, a “table” with strings, numbers, integers, dates, factors etc
## Name Age Pet
## 1 Butch 4 dog
## 2 Fluffy 2 bunny
Seeing number of rows and columns
dim(matrix(1:12, ncol=4))
## [1] 3 4
attributes(matrix(1:12, ncol=4))$dim
## [1] 3 4
Both give 3 rows, 4 columns
Always have rows first, then columns: matrix[nrows, ncols]
r_matrix[2,] #second row
## [1] 5 6 7 8
r_matrix[1,] #first row
## [1] 1 2 3 4
r_matrix[,2] #second column
## [1] 2 6 10
r_matrix[,1] #first column
## [1] 1 5 9
c_matrix <- matrix(1:12, ncol=4)
c_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
r_matrix <- matrix(1:12, ncol=4, byrow=TRUE)
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
Vector to Matrix:
age <- c(23, 44, 15, 12, 31, 16)
age
## [1] 23 44 15 12 31 16
dim(age) <- c(2, 3) #rows first, columns second
age
## [,1] [,2] [,3]
## [1,] 23 15 31
## [2,] 44 12 16
class(age)
## [1] "matrix"
Joining vectors by column, by rows:
x <- c(1, 2, 3, 4) #four elements
y <- c(20, 30, 40, 50) #four elements
cbind(x, y)
## x y
## [1,] 1 20
## [2,] 2 30
## [3,] 3 40
## [4,] 4 50
rbind(x, y)
## [,1] [,2] [,3] [,4]
## x 1 2 3 4
## y 20 30 40 50
a <- c(1, 2, 3) #three elements
b <- c(20, 30, 40, 50, 60) #five elements
cbind(a, b)
## Warning in cbind(a, b): number of rows of result is not a multiple of
## vector length (arg 1)
## a b
## [1,] 1 20
## [2,] 2 30
## [3,] 3 40
## [4,] 1 50
## [5,] 2 60
a <- c(1, 2, 3) #three elements
b <- c(20, 30, 40, 50, 60) #five elements
rbind(a, b)
## Warning in rbind(a, b): number of columns of result is not a multiple of
## vector length (arg 1)
## [,1] [,2] [,3] [,4] [,5]
## a 1 2 3 1 2
## b 20 30 40 50 60
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
summary(r_matrix)
## V1 V2 V3 V4
## Min. :1 Min. : 2 Min. : 3 Min. : 4
## 1st Qu.:3 1st Qu.: 4 1st Qu.: 5 1st Qu.: 6
## Median :5 Median : 6 Median : 7 Median : 8
## Mean :5 Mean : 6 Mean : 7 Mean : 8
## 3rd Qu.:7 3rd Qu.: 8 3rd Qu.: 9 3rd Qu.:10
## Max. :9 Max. :10 Max. :11 Max. :12
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
r_matrix[3, 2] <- 4
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 4 11 12
r_matrix[2, ] <- c(1,3)
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 1 3 1 3
## [3,] 9 4 11 12
r_matrix[1:2, 3:4] <- c(8, 4, 2, 1)
r_matrix
## [,1] [,2] [,3] [,4]
## [1,] 1 2 8 2
## [2,] 1 3 4 1
## [3,] 9 4 11 12
#Remember to use "" for strings (text) as names
rownames(r_matrix) <- c("row1", "row2", "row3")
colnames(r_matrix) <- c("col1", "col2", "col3", "col4")
r_matrix
## col1 col2 col3 col4
## row1 1 2 8 2
## row2 1 3 4 1
## row3 9 4 11 12
#If we make a mistake, fix it easily with:
colnames(r_matrix)[3] <- "3rd"
r_matrix
## col1 col2 3rd col4
## row1 1 2 8 2
## row2 1 3 4 1
## row3 9 4 11 12
r_matrix[,c("col2", "3rd")] #instead of r_matrix[, c(2,3)]
## col2 3rd
## row1 2 8
## row2 3 4
## row3 4 11
r_matrix["row1",] #instead of r_matrix[1,]
## col1 col2 3rd col4
## 1 2 8 2
Standard operations work e.g.
r_matrix + 4
## col1 col2 3rd col4
## row1 5 6 12 6
## row2 5 7 8 5
## row3 13 8 15 16
r_matrix * 12
## col1 col2 3rd col4
## row1 12 24 96 24
## row2 12 36 48 12
## row3 108 48 132 144
matrix_one <- matrix(1:4, nrow=2, ncol=4, byrow=TRUE)
#matrix: 1, 2, 3, 4 in two rows
matrix_two <- matrix(10:13, nrow=2, ncol=4, byrow=TRUE)
#matrix: 10, 11, 12, 13 in two rows
matrix_one + matrix_two
## [,1] [,2] [,3] [,4]
## [1,] 11 13 15 17
## [2,] 11 13 15 17
df <- data.frame(name = c("ash","jane","paul","mark"),
score = c(67,56,87,91))
df
## name score
## 1 ash 67
## 2 jane 56
## 3 paul 87
## 4 mark 91
nrow(df)
## [1] 4
ncol(df)
## [1] 2
dim(df)
## [1] 4 2
Same as for a matrix
str(data.frame(name = c("ash","jane","paul","mark"),
score = c(67,56,87,91)))
## 'data.frame': 4 obs. of 2 variables:
## $ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
## $ score: num 67 56 87 91
str(df) shows the structure of a data frame
lists the variables in the df and their class
“name” = factor (categorical variable)
“score” = numeric (continuous variable)
Continuous variables: can take any form e.g. 1, 2, 3.5, 4.66 etc.
Categorical variables: take only discrete values e.g. 2, 5, 11, 15 etc.
In R, categorical values <- factors.
In our df, name is a factor variable having 4 unique levels.
Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values (using the same data frame df).
df[1:2, 2] <- NA #injecting NA at 1st, 2nd row and 2nd column
df
## name score
## 1 ash NA
## 2 jane NA
## 3 paul 87
## 4 mark 91
is.na(df) #checks the entire data set for NA's, returns logical vector
## name score
## [1,] FALSE TRUE
## [2,] FALSE TRUE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
table(is.na(df)) #returns a table of the logical output
##
## FALSE TRUE
## 6 2
# returns a list of rows of valid values
df[complete.cases(df), ]
## name score
## 3 paul 87
## 4 mark 91
# returns a list of rows of !(NOT) valid values
df[!complete.cases(df), ]
## name score
## 1 ash NA
## 2 jane NA
Missing values bad for calculations, can’t get the mean of a set of data if values are missing. R does not “automagically” convert NA’s to 0.
mean(df[, 2])
## [1] NA
mean(df[, 2], na.rm = TRUE)
## [1] 89
We tell R to ignore NA’s using na.rm = TRUE, ie remove NA’s
new_df <- na.omit(df)
new_df
## name score
## 3 paul 87
## 4 mark 91
df
## name score
## 1 ash NA
## 2 jane NA
## 3 paul 87
## 4 mark 91
summary(df)
## name score
## ash :1 Min. :87
## jane:1 1st Qu.:88
## mark:1 Median :89
## paul:1 Mean :89
## 3rd Qu.:90
## Max. :91
## NA's :2
Joining a character vector, a numeric vector and a date vector
employee <- c("John Doe", "Peter Gynn", "Jolie Hope")
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1', '2008-3-25', '2007-3-14'))
employ_data <- data.frame(employee, salary, startdate)
employ_data
## employee salary startdate
## 1 John Doe 21000 2010-11-01
## 2 Peter Gynn 23400 2008-03-25
## 3 Jolie Hope 26800 2007-03-14
Always make sure you’ve done what you wanted to using str() or summary()
str(employ_data)
## 'data.frame': 3 obs. of 3 variables:
## $ employee : Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
## $ salary : num 21000 23400 26800
## $ startdate: Date, format: "2010-11-01" "2008-03-25" ...
We didn’t want names as factors though and does this by default, how do we fix that?
employ_data <- data.frame(employee, salary, startdate,
stringsAsFactors = FALSE)
str(employ_data)
## 'data.frame': 3 obs. of 3 variables:
## $ employee : chr "John Doe" "Peter Gynn" "Jolie Hope"
## $ salary : num 21000 23400 26800
## $ startdate: Date, format: "2010-11-01" "2008-03-25" ...
employ_data[1, 2] #values from row1, column 2
## [1] 21000
employ_data[, 2] #values from column 2
## [1] 21000 23400 26800
But why remember column numbers if we’ve given them names? No need, it’s better and far easier just to use the names! Change: df[row, column] to df$column[row]
employ_data$salary[1] #values from row1, column 2
## [1] 21000
employ_data$salary #values from column 2
## [1] 21000 23400 26800
John Doe gets a raise…
employ_data$salary[1] <- 32000
employ_data
## employee salary startdate
## 1 John Doe 32000 2010-11-01
## 2 Peter Gynn 23400 2008-03-25
## 3 Jolie Hope 26800 2007-03-14
We need extra info:
early_riser <- c("yes", "no", "yes")
employ_data_updated <- cbind(employ_data, early_riser)
employ_data_updated
## employee salary startdate early_riser
## 1 John Doe 32000 2010-11-01 yes
## 2 Peter Gynn 23400 2008-03-25 no
## 3 Jolie Hope 26800 2007-03-14 yes
I’d love to give an example of how we use all of this in real life. Task: Clean data in an excel spread sheet, reshape and restructure it so that we can use it in Tableau. Spoiler Alert: this is going to look difficult now but by the end of this series, you’ll do it easily! See here: Excel To Tableau