Jester - Collaborative Filtering

Starting Up
Data Preparation
Create Similarity Matrix
Neighborhood and Weighted Mean
Execution
Vectorized Implementation

Starting Up

The ppt slides explaining the background behind the data set and collaborative filtering methods can be found here:

https://raw.githubusercontent.com/Tyllis/Data607/master/jester-collaborative-filtering.pptx

Load the packages and the file into R.

library(dplyr)
library(knitr)
theURL <- "https://raw.githubusercontent.com/Tyllis/Data607/master/jester-data-1.csv"
jester <- read.csv(theURL, header = F, stringsAsFactors = F)

Data Preparation

Save the 1st column in num_rate, then removed it from the data.

# Save 1st column
num_rate <- jester[,1]   
# Remove 1st column
jester <- jester[,-1]

Inspect the data:

# Check dimension
dim(jester)

## [1] 24983   100

num_row <- dim(jester)[1]
num_col <- dim(jester)[2]
# Rename rows and columns
names(jester) <- c(1:num_col)
row.names(jester) <- c(1:num_row)
# Show portion of data
kable(jester[1:10,1:15])

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
-7.82	8.79	-9.66	-8.16	-7.52	-8.50	-9.85	4.17	-8.98	-4.76	-8.50	-6.75	-7.18	8.45	-7.18
4.08	-0.29	6.36	4.37	-2.38	-9.66	-0.73	-5.34	8.88	9.22	6.75	8.64	4.42	7.43	4.56
99.00	99.00	99.00	99.00	9.03	9.27	9.03	9.27	99.00	99.00	7.33	7.57	9.37	6.17	-6.36
99.00	8.35	99.00	99.00	1.80	8.16	-2.82	6.21	99.00	1.84	7.33	6.60	6.31	8.11	-7.23
8.50	4.61	-4.17	-5.39	1.36	1.60	7.04	4.61	-0.44	5.73	8.25	6.84	-3.93	7.23	-2.33
-6.17	-3.54	0.44	-8.50	-7.09	-4.32	-8.69	-0.87	-6.65	-1.80	-6.80	-5.73	-5.00	-8.59	0.49
99.00	99.00	99.00	99.00	8.59	-9.85	7.72	8.79	99.00	99.00	4.27	7.62	-6.26	2.96	6.07
6.84	3.16	9.17	-6.21	-8.16	-1.70	9.27	1.41	-5.19	-4.42	8.20	-7.86	-6.94	-7.96	0.29
-3.79	-3.54	-9.42	-6.89	-8.74	-0.29	-5.29	-8.93	-7.86	-1.60	-2.91	-0.29	-4.85	-0.49	-8.74
3.01	5.15	5.15	3.01	6.41	5.15	8.93	2.52	3.01	8.16	5.53	6.02	4.47	5.44	-4.66

Replace all the value “99” by “NA”

# Replace all value 99 by NA
jester <- as.data.frame(apply(jester, 2, function(x) replace(x, x==99, NA)))

Save the indices of NA. The result is a list containing 24983 row elements, each element contains the column indices of NA.

# Save indices of NA
na_idx <- apply(jester, 1, function(x) which(is.na(x)))

Create Similarity Matrix

Next, create several helper functions.

# This function takes a vector and returns the magnitude of the vector.
mag <- function(x) sqrt(sum(x^2, na.rm=T))

# This function takes two vectors and returns the dot product of the two vectors.
dot <- function(x,y) sum(x*y, na.rm=T)

# This function takes two vectors and returns the cosine of the angles between the two vectors
cosine <- function(x,y) dot(x,y)/ (mag(x)*mag(y))

# Function that takes a vector and subtract the mean from each vector element.
center <- function(x) x-mean(x, na.rm=T)

# Function that takes two vectors and return the similarity of the vectors.
sim <- function(x,y) cosine(center(x), center(y))

Using the above helper functions, we can create the similarity matrix.

# Create similarity matrix
sim_mat <- c()
for (col in 1:num_col){
  sim_mat <- rbind(sim_mat, apply(jester, 2, sim, jester[,col]))
}

The result is a matrix sim_mat. Each row contains the similarities vector.

For example, the similarity between Joke #19 and Joke #66 is stored in row 19 col 66.

# Check sim_mat[19,66]
sim(jester[,19], jester[,66]) == sim_mat[19,66]

##   66 
## TRUE

Below code picks a pair of random jokes and check if the sim_mat stores the similarity value correctly.

# Check a pair of random jokes
random_jokes <- sample(1:num_col, size=2)
joke1 <- random_jokes[1]
joke2 <- random_jokes[2]
print(paste(joke1, joke2))

## [1] "62 55"

sim(jester[,joke1], jester[,joke2])

## [1] 0.2314762

sim_mat[joke1,joke2]

##        55 
## 0.2314762

Also, another check is that the similarity matrix should be symmetric, meaning that sim_mat[i,j] = sim_mat[j,i]

# Check symetric
sim_mat[joke1, joke2] == sim_mat[joke2, joke1]

##   55 
## TRUE

Lastly, the diagonal of the similarity matrix should be all ones, because the similarity between each joke and itself should be 1.

# Check diagnal of similiarity matrix
diag(sim_mat)

##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Check a portion of the matrix.

# Inspect the similarity matrix
dim(sim_mat)

## [1] 100 100

kable(sim_mat[1:10,1:15], digits = 3)

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
1.000	0.363	0.356	0.284	0.142	0.185	0.162	0.154	0.319	0.305	0.236	0.195	0.130	0.170	0.143
0.363	1.000	0.280	0.300	0.176	0.209	0.171	0.105	0.234	0.255	0.263	0.281	0.157	0.287	0.115
0.356	0.280	1.000	0.362	0.177	0.244	0.147	0.149	0.307	0.279	0.382	0.255	0.129	0.208	0.113
0.284	0.300	0.362	1.000	0.165	0.285	0.159	0.074	0.359	0.230	0.281	0.286	0.114	0.289	0.074
0.142	0.176	0.177	0.165	1.000	0.172	0.200	0.162	0.092	0.148	0.242	0.197	0.148	0.176	0.128
0.185	0.209	0.244	0.285	0.172	1.000	0.114	0.058	0.264	0.224	0.334	0.388	0.085	0.353	0.048
0.162	0.171	0.147	0.159	0.200	0.114	1.000	0.211	0.090	0.149	0.186	0.155	0.167	0.155	0.214
0.154	0.105	0.149	0.074	0.162	0.058	0.211	1.000	0.119	0.168	0.163	0.076	0.224	0.102	0.230
0.319	0.234	0.307	0.359	0.092	0.264	0.090	0.119	1.000	0.336	0.226	0.260	0.129	0.223	0.108
0.305	0.255	0.279	0.230	0.148	0.224	0.149	0.168	0.336	1.000	0.311	0.246	0.142	0.194	0.142

Neighborhood and Weighted Mean

Create a function that can search for the N-neighbors for a selected joke. The function takes three values:

x = a vector containing the similarity values. This vector can be retrieved from sim_mat by calling sim_mat[jokeID, ].
i = a vector containing the indices to exclude from the search. This vector is the indices of the jokes the user never rate before. This vector can be retrieved from na_idx created earlier, by calling na_idx[[userID]].
n = a integer indicating how many neighbors to look for.

# Function that takes a similarity vector x and a integer n and returns the top n neighbors but excluding the indices in the vector i. It returns a vector containing the similarity values, and the names of the vector are their indices.
n_neighbor <- function(x, i, n){
  # If na_idx is not empty, remove the indices from vector x
  if(length(na_idx[[i]])!=0) x=x[-na_idx[[i]]]
  # Sort x indecreasing order, take the top n+1 elements
  neighbors <- head(sort(x, decreasing = T), n+1)
  # Check if 1st element is 1. Remove it if it is.
  if(max(neighbors)==1){
    neighbors <- neighbors[-1]
  }else{
    neighbors <- neighbors[1:n]
  }
  # Return all Non-NA elements
  return(neighbors[-which(is.na(neighbors))])
}

Create a function that calculates the weighted mean. The function will perform the following task:

Given the userID, jokeID, and n, look for the n-neighbors of the joke relevant to the user
Retrieve the similarity values of the n-neighbors of the joke
Retrieve the rating values of the n-neighbors of the joke that the user had rated
Calculate the weighted mean.

# Function that takes a jokeID and a userID and estimate a rating based on collaborating filtering
cal_rating <- function(userID, jokeID, n){
  # look for n-neighbors
  neighbors <- n_neighbor(sim_mat[jokeID,], userID, n)  
  # retrieve the column indices of the neighbors
  idx <- as.integer(names(neighbors))
  # retrieve the rating values of the joke neighbors
  val <- jester[userID, idx]
  # calculate weighted mean
  weighted.mean(val, neighbors)
}

Execution

Now it is ready to fill in the “NA” cells in the jester table. A nested for-loop implementation can be used here. Here, I used n = 100 so I can compare with the vectorized implementation below. Predicted ratings are also saved in a list predict_ratings.

n <- 100
jester_cf <- jester
predict_ratings <- list()
for (i in 1:num_row){
  temp <- c()
  if (length(na_idx[[i]])!=0){ 
    for (j in na_idx[[i]]){
      val <- cal_rating(i,j,n)
      jester_cf[i,j] <- val
      temp <- c(temp, val)
    }
    names(temp) <- na_idx[[i]]
  }
  predict_ratings[[i]] <- temp
}

Let’s see portion of the data.

# Inspect portion of data
kable(jester_cf[1:10,1:15], digits = 2)

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
-7.82	8.79	-9.66	-8.16	-7.52	-8.50	-9.85	4.17	-8.98	-4.76	-8.50	-6.75	-7.18	8.45	-7.18
4.08	-0.29	6.36	4.37	-2.38	-9.66	-0.73	-5.34	8.88	9.22	6.75	8.64	4.42	7.43	4.56
7.42	7.48	7.46	7.37	9.03	9.27	9.03	9.27	7.45	7.29	7.33	7.57	9.37	6.17	-6.36
3.00	8.35	3.34	3.45	1.80	8.16	-2.82	6.21	2.83	1.84	7.33	6.60	6.31	8.11	-7.23
8.50	4.61	-4.17	-5.39	1.36	1.60	7.04	4.61	-0.44	5.73	8.25	6.84	-3.93	7.23	-2.33
-6.17	-3.54	0.44	-8.50	-7.09	-4.32	-8.69	-0.87	-6.65	-1.80	-6.80	-5.73	-5.00	-8.59	0.49
4.57	4.64	4.50	4.42	8.59	-9.85	7.72	8.79	4.39	4.44	4.27	7.62	-6.26	2.96	6.07
6.84	3.16	9.17	-6.21	-8.16	-1.70	9.27	1.41	-5.19	-4.42	8.20	-7.86	-6.94	-7.96	0.29
-3.79	-3.54	-9.42	-6.89	-8.74	-0.29	-5.29	-8.93	-7.86	-1.60	-2.91	-0.29	-4.85	-0.49	-8.74
3.01	5.15	5.15	3.01	6.41	5.15	8.93	2.52	3.01	8.16	5.53	6.02	4.47	5.44	-4.66

Vectorized Implementation

Alternatively, a vectorized implementation can be used and is much more efficient.

First, create a 24983 x 100 matrix that fills all of the non-NA cells of jester with 1, and all NA cells with 0.

# Create a matrix with just one and zero
jester_oz <- as.matrix(jester)
jester_oz[!is.na(jester_oz)] <- 1
jester_oz[is.na(jester_oz)] <- 0

Next, turn all of the NA cells of jester into 0.

# Turn all NA cells to zero
jester[is.na(jester)] <- 0
jester <- as.matrix(jester)

Perform matrix multiplication.

Here, jester x sim_mat will yield the sum of ratings*weights for each cell, while jester_oz x sim_mat will yield the sum of weights for that cell. Then, the element-wise multiplication between jester x sim_mat and 1/(jester_oz x sim_mat) will give us the weighted mean for each cell.

jester_vcf <- (jester %*% sim_mat) * 1/(jester_oz %*% sim_mat)

The predicted ratings can be pulled out.

predict_ratingsv <- list()
for (i in 1:num_row){
  predict_ratingsv[[i]] <- jester_vcf[i, na_idx[[i]]]
}

To check if the looping and the vectorized implementations are the same:

random_row <- sample(c(1:num_row), size=1)
random_row

## [1] 1479

round(predict_ratingsv[[random_row]] - predict_ratings[[random_row]], 5)

##  71  72  73  74  75  76  77  78  79  80  83  84  85  86  87  88  90  91 
##   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0 
##  92  93  94  95  96  97  98  99 100 
##   0   0   0   0   0   0   0   0   0

As you can see, they are the same.

The complete result can be found here:

https://raw.githubusercontent.com/Tyllis/Data607/master/jester_cf.csv