Working in More Dimensions

Vectors have 1 dimension: a list, or row of items

Matrix <- (row vectors & column vectors) of same class eg, a “table” of numbers or integers

##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8

Data Frame <- (row vectors & column vectors) of different classes eg, a “table” with strings, numbers, integers, dates, factors etc

##     Name Age   Pet
## 1  Butch   4   dog
## 2 Fluffy   2 bunny

Matrices

Seeing number of rows and columns

dim(matrix(1:12, ncol=4))
## [1] 3 4
attributes(matrix(1:12, ncol=4))$dim
## [1] 3 4

Both give 3 rows, 4 columns

Always have rows first, then columns: matrix[nrows, ncols]

Accessing data from a matrix

r_matrix[2,] #second row
## [1] 5 6 7 8
r_matrix[1,] #first row
## [1] 1 2 3 4
r_matrix[,2] #second column
## [1]  2  6 10
r_matrix[,1] #first column
## [1] 1 5 9

Matrix creation

c_matrix <- matrix(1:12, ncol=4)
c_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    4    7   10
## [2,]    2    5    8   11
## [3,]    3    6    9   12
r_matrix <- matrix(1:12, ncol=4, byrow=TRUE)
r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12

Matrix creation cont.

Vector to Matrix:

age <- c(23, 44, 15, 12, 31, 16)
age
## [1] 23 44 15 12 31 16
dim(age) <- c(2, 3) #rows first, columns second
age
##      [,1] [,2] [,3]
## [1,]   23   15   31
## [2,]   44   12   16
class(age)
## [1] "matrix"

Matrix creation cont.

Joining vectors by column, by rows:

x <- c(1, 2, 3, 4) #four elements
y <- c(20, 30, 40, 50) #four elements
cbind(x, y)
##      x  y
## [1,] 1 20
## [2,] 2 30
## [3,] 3 40
## [4,] 4 50
rbind(x, y)
##   [,1] [,2] [,3] [,4]
## x    1    2    3    4
## y   20   30   40   50

Matrix creation what ifs…

a <- c(1, 2, 3) #three elements
b <- c(20, 30, 40, 50, 60) #five elements
cbind(a, b)
## Warning in cbind(a, b): number of rows of result is not a multiple of
## vector length (arg 1)
##      a  b
## [1,] 1 20
## [2,] 2 30
## [3,] 3 40
## [4,] 1 50
## [5,] 2 60

Matrix creation what ifs…

a <- c(1, 2, 3) #three elements
b <- c(20, 30, 40, 50, 60) #five elements
rbind(a, b)
## Warning in rbind(a, b): number of columns of result is not a multiple of
## vector length (arg 1)
##   [,1] [,2] [,3] [,4] [,5]
## a    1    2    3    1    2
## b   20   30   40   50   60

Matrix Summaries

r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
summary(r_matrix)
##        V1          V2           V3           V4    
##  Min.   :1   Min.   : 2   Min.   : 3   Min.   : 4  
##  1st Qu.:3   1st Qu.: 4   1st Qu.: 5   1st Qu.: 6  
##  Median :5   Median : 6   Median : 7   Median : 8  
##  Mean   :5   Mean   : 6   Mean   : 7   Mean   : 8  
##  3rd Qu.:7   3rd Qu.: 8   3rd Qu.: 9   3rd Qu.:10  
##  Max.   :9   Max.   :10   Max.   :11   Max.   :12

Matrices: Replacing Values

r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
r_matrix[3, 2] <- 4
r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9    4   11   12

Matrices: Replacing Values cont

r_matrix[2, ] <- c(1,3)
r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    1    3    1    3
## [3,]    9    4   11   12
r_matrix[1:2, 3:4] <- c(8, 4, 2, 1)
r_matrix
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    8    2
## [2,]    1    3    4    1
## [3,]    9    4   11   12

Matrices: Naming Columns and Rows

#Remember to use "" for strings (text) as names
rownames(r_matrix) <- c("row1", "row2", "row3") 
colnames(r_matrix) <- c("col1", "col2", "col3", "col4")
r_matrix
##      col1 col2 col3 col4
## row1    1    2    8    2
## row2    1    3    4    1
## row3    9    4   11   12
#If we make a mistake, fix it easily with:
colnames(r_matrix)[3] <- "3rd"
r_matrix
##      col1 col2 3rd col4
## row1    1    2   8    2
## row2    1    3   4    1
## row3    9    4  11   12

Matrices: Using Names as Indices

r_matrix[,c("col2", "3rd")] #instead of r_matrix[, c(2,3)]
##      col2 3rd
## row1    2   8
## row2    3   4
## row3    4  11
r_matrix["row1",] #instead of r_matrix[1,]
## col1 col2  3rd col4 
##    1    2    8    2

Matrix Calculations

Standard operations work e.g.

r_matrix + 4
##      col1 col2 3rd col4
## row1    5    6  12    6
## row2    5    7   8    5
## row3   13    8  15   16
r_matrix * 12
##      col1 col2 3rd col4
## row1   12   24  96   24
## row2   12   36  48   12
## row3  108   48 132  144

Adding Matrices Together

matrix_one <- matrix(1:4, nrow=2, ncol=4, byrow=TRUE)
#matrix: 1, 2, 3, 4 in two rows
matrix_two <- matrix(10:13, nrow=2, ncol=4, byrow=TRUE)
#matrix: 10, 11, 12, 13 in two rows
matrix_one + matrix_two
##      [,1] [,2] [,3] [,4]
## [1,]   11   13   15   17
## [2,]   11   13   15   17

Data Frames

df <- data.frame(name = c("ash","jane","paul","mark"), 
                 score = c(67,56,87,91))
df
##   name score
## 1  ash    67
## 2 jane    56
## 3 paul    87
## 4 mark    91

Data Frame Dimensions

nrow(df)
## [1] 4
ncol(df)
## [1] 2
dim(df)
## [1] 4 2

Same as for a matrix

Data Frame Structure

str(data.frame(name = c("ash","jane","paul","mark"), 
                 score = c(67,56,87,91)))
## 'data.frame':    4 obs. of  2 variables:
##  $ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
##  $ score: num  67 56 87 91

str(df) shows the structure of a data frame

lists the variables in the df and their class

“name” = factor (categorical variable)

“score” = numeric (continuous variable)

Asside - Variable Type

Continuous variables: can take any form e.g. 1, 2, 3.5, 4.66 etc.

Categorical variables: take only discrete values e.g. 2, 5, 11, 15 etc.

In R, categorical values <- factors.

In our df, name is a factor variable having 4 unique levels.

Asside2 - Missing Values

Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values (using the same data frame df).

df[1:2, 2] <- NA #injecting NA at 1st, 2nd row and 2nd column
df
##   name score
## 1  ash    NA
## 2 jane    NA
## 3 paul    87
## 4 mark    91

is.na(df) #checks the entire data set for NA's, returns logical vector
##       name score
## [1,] FALSE  TRUE
## [2,] FALSE  TRUE
## [3,] FALSE FALSE
## [4,] FALSE FALSE
table(is.na(df)) #returns a table of the logical output
## 
## FALSE  TRUE 
##     6     2

Finding Valid Values

# returns a list of rows of valid values
df[complete.cases(df), ]
##   name score
## 3 paul    87
## 4 mark    91
# returns a list of rows of !(NOT) valid values 
df[!complete.cases(df), ]
##   name score
## 1  ash    NA
## 2 jane    NA

NA’s and calculations

Missing values bad for calculations, can’t get the mean of a set of data if values are missing. R does not “automagically” convert NA’s to 0.

mean(df[, 2])
## [1] NA
mean(df[, 2], na.rm = TRUE)
## [1] 89

We tell R to ignore NA’s using na.rm = TRUE, ie remove NA’s

Removing NA’s from data

new_df <- na.omit(df)
new_df
##   name score
## 3 paul    87
## 4 mark    91

Data Frame Summaries

df
##   name score
## 1  ash    NA
## 2 jane    NA
## 3 paul    87
## 4 mark    91
summary(df)
##    name       score   
##  ash :1   Min.   :87  
##  jane:1   1st Qu.:88  
##  mark:1   Median :89  
##  paul:1   Mean   :89  
##           3rd Qu.:90  
##           Max.   :91  
##           NA's   :2

Data Frame: Creating from Scratch

Joining a character vector, a numeric vector and a date vector

employee <- c("John Doe", "Peter Gynn", "Jolie Hope")
salary <- c(21000, 23400, 26800)
startdate <- as.Date(c('2010-11-1', '2008-3-25', '2007-3-14'))
employ_data <- data.frame(employee, salary, startdate)
employ_data
##     employee salary  startdate
## 1   John Doe  21000 2010-11-01
## 2 Peter Gynn  23400 2008-03-25
## 3 Jolie Hope  26800 2007-03-14

Checking Data Frame Creation

Always make sure you’ve done what you wanted to using str() or summary()

str(employ_data)
## 'data.frame':    3 obs. of  3 variables:
##  $ employee : Factor w/ 3 levels "John Doe","Jolie Hope",..: 1 3 2
##  $ salary   : num  21000 23400 26800
##  $ startdate: Date, format: "2010-11-01" "2008-03-25" ...

When R doesn’t know what we wanted

We didn’t want names as factors though and does this by default, how do we fix that?

employ_data <- data.frame(employee, salary, startdate, 
                          stringsAsFactors = FALSE)
str(employ_data)
## 'data.frame':    3 obs. of  3 variables:
##  $ employee : chr  "John Doe" "Peter Gynn" "Jolie Hope"
##  $ salary   : num  21000 23400 26800
##  $ startdate: Date, format: "2010-11-01" "2008-03-25" ...

Data Frame: Getting Values Out

employ_data[1, 2] #values from row1, column 2
## [1] 21000
employ_data[, 2] #values from column 2
## [1] 21000 23400 26800

Data Frame: Getting Values Out An Easier Way

But why remember column numbers if we’ve given them names? No need, it’s better and far easier just to use the names! Change: df[row, column] to df$column[row]

employ_data$salary[1] #values from row1, column 2 
## [1] 21000
employ_data$salary #values from column 2
## [1] 21000 23400 26800

Data Frame: Manipulating Values

John Doe gets a raise…

employ_data$salary[1] <- 32000
employ_data
##     employee salary  startdate
## 1   John Doe  32000 2010-11-01
## 2 Peter Gynn  23400 2008-03-25
## 3 Jolie Hope  26800 2007-03-14

Data Frame: Adding Values

We need extra info:

early_riser <- c("yes", "no", "yes")
employ_data_updated <- cbind(employ_data, early_riser)
employ_data_updated
##     employee salary  startdate early_riser
## 1   John Doe  32000 2010-11-01         yes
## 2 Peter Gynn  23400 2008-03-25          no
## 3 Jolie Hope  26800 2007-03-14         yes

Finally, Why do all this?

I’d love to give an example of how we use all of this in real life. Task: Clean data in an excel spread sheet, reshape and restructure it so that we can use it in Tableau. Spoiler Alert: this is going to look difficult now but by the end of this series, you’ll do it easily! See here: Excel To Tableau