The ppt slides explaining the background behind the data set and collaborative filtering methods can be found here:
https://raw.githubusercontent.com/Tyllis/Data607/master/jester-collaborative-filtering.pptx
Load the packages and the file into R.
library(dplyr)
library(knitr)
theURL <- "https://raw.githubusercontent.com/Tyllis/Data607/master/jester-data-1.csv"
jester <- read.csv(theURL, header = F, stringsAsFactors = F)
Save the 1st column in num_rate, then removed it from the data.
# Save 1st column
num_rate <- jester[,1]
# Remove 1st column
jester <- jester[,-1]
Inspect the data:
# Check dimension
dim(jester)
## [1] 24983 100
num_row <- dim(jester)[1]
num_col <- dim(jester)[2]
# Rename rows and columns
names(jester) <- c(1:num_col)
row.names(jester) <- c(1:num_row)
# Show portion of data
kable(jester[1:10,1:15])
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -7.82 | 8.79 | -9.66 | -8.16 | -7.52 | -8.50 | -9.85 | 4.17 | -8.98 | -4.76 | -8.50 | -6.75 | -7.18 | 8.45 | -7.18 |
| 4.08 | -0.29 | 6.36 | 4.37 | -2.38 | -9.66 | -0.73 | -5.34 | 8.88 | 9.22 | 6.75 | 8.64 | 4.42 | 7.43 | 4.56 |
| 99.00 | 99.00 | 99.00 | 99.00 | 9.03 | 9.27 | 9.03 | 9.27 | 99.00 | 99.00 | 7.33 | 7.57 | 9.37 | 6.17 | -6.36 |
| 99.00 | 8.35 | 99.00 | 99.00 | 1.80 | 8.16 | -2.82 | 6.21 | 99.00 | 1.84 | 7.33 | 6.60 | 6.31 | 8.11 | -7.23 |
| 8.50 | 4.61 | -4.17 | -5.39 | 1.36 | 1.60 | 7.04 | 4.61 | -0.44 | 5.73 | 8.25 | 6.84 | -3.93 | 7.23 | -2.33 |
| -6.17 | -3.54 | 0.44 | -8.50 | -7.09 | -4.32 | -8.69 | -0.87 | -6.65 | -1.80 | -6.80 | -5.73 | -5.00 | -8.59 | 0.49 |
| 99.00 | 99.00 | 99.00 | 99.00 | 8.59 | -9.85 | 7.72 | 8.79 | 99.00 | 99.00 | 4.27 | 7.62 | -6.26 | 2.96 | 6.07 |
| 6.84 | 3.16 | 9.17 | -6.21 | -8.16 | -1.70 | 9.27 | 1.41 | -5.19 | -4.42 | 8.20 | -7.86 | -6.94 | -7.96 | 0.29 |
| -3.79 | -3.54 | -9.42 | -6.89 | -8.74 | -0.29 | -5.29 | -8.93 | -7.86 | -1.60 | -2.91 | -0.29 | -4.85 | -0.49 | -8.74 |
| 3.01 | 5.15 | 5.15 | 3.01 | 6.41 | 5.15 | 8.93 | 2.52 | 3.01 | 8.16 | 5.53 | 6.02 | 4.47 | 5.44 | -4.66 |
Replace all the value “99” by “NA”
# Replace all value 99 by NA
jester <- as.data.frame(apply(jester, 2, function(x) replace(x, x==99, NA)))
Save the indices of NA. The result is a list containing 24983 row elements, each element contains the column indices of NA.
# Save indices of NA
na_idx <- apply(jester, 1, function(x) which(is.na(x)))
Next, create several helper functions.
# This function takes a vector and returns the magnitude of the vector.
mag <- function(x) sqrt(sum(x^2, na.rm=T))
# This function takes two vectors and returns the dot product of the two vectors.
dot <- function(x,y) sum(x*y, na.rm=T)
# This function takes two vectors and returns the cosine of the angles between the two vectors
cosine <- function(x,y) dot(x,y)/ (mag(x)*mag(y))
# Function that takes a vector and subtract the mean from each vector element.
center <- function(x) x-mean(x, na.rm=T)
# Function that takes two vectors and return the similarity of the vectors.
sim <- function(x,y) cosine(center(x), center(y))
Using the above helper functions, we can create the similarity matrix.
# Create similarity matrix
sim_mat <- c()
for (col in 1:num_col){
sim_mat <- rbind(sim_mat, apply(jester, 2, sim, jester[,col]))
}
The result is a matrix sim_mat. Each row contains the similarities vector.
For example, the similarity between Joke #19 and Joke #66 is stored in row 19 col 66.
# Check sim_mat[19,66]
sim(jester[,19], jester[,66]) == sim_mat[19,66]
## 66
## TRUE
Below code picks a pair of random jokes and check if the sim_mat stores the similarity value correctly.
# Check a pair of random jokes
random_jokes <- sample(1:num_col, size=2)
joke1 <- random_jokes[1]
joke2 <- random_jokes[2]
print(paste(joke1, joke2))
## [1] "62 55"
sim(jester[,joke1], jester[,joke2])
## [1] 0.2314762
sim_mat[joke1,joke2]
## 55
## 0.2314762
Also, another check is that the similarity matrix should be symmetric, meaning that sim_mat[i,j] = sim_mat[j,i]
# Check symetric
sim_mat[joke1, joke2] == sim_mat[joke2, joke1]
## 55
## TRUE
Lastly, the diagonal of the similarity matrix should be all ones, because the similarity between each joke and itself should be 1.
# Check diagnal of similiarity matrix
diag(sim_mat)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Check a portion of the matrix.
# Inspect the similarity matrix
dim(sim_mat)
## [1] 100 100
kable(sim_mat[1:10,1:15], digits = 3)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1.000 | 0.363 | 0.356 | 0.284 | 0.142 | 0.185 | 0.162 | 0.154 | 0.319 | 0.305 | 0.236 | 0.195 | 0.130 | 0.170 | 0.143 |
| 0.363 | 1.000 | 0.280 | 0.300 | 0.176 | 0.209 | 0.171 | 0.105 | 0.234 | 0.255 | 0.263 | 0.281 | 0.157 | 0.287 | 0.115 |
| 0.356 | 0.280 | 1.000 | 0.362 | 0.177 | 0.244 | 0.147 | 0.149 | 0.307 | 0.279 | 0.382 | 0.255 | 0.129 | 0.208 | 0.113 |
| 0.284 | 0.300 | 0.362 | 1.000 | 0.165 | 0.285 | 0.159 | 0.074 | 0.359 | 0.230 | 0.281 | 0.286 | 0.114 | 0.289 | 0.074 |
| 0.142 | 0.176 | 0.177 | 0.165 | 1.000 | 0.172 | 0.200 | 0.162 | 0.092 | 0.148 | 0.242 | 0.197 | 0.148 | 0.176 | 0.128 |
| 0.185 | 0.209 | 0.244 | 0.285 | 0.172 | 1.000 | 0.114 | 0.058 | 0.264 | 0.224 | 0.334 | 0.388 | 0.085 | 0.353 | 0.048 |
| 0.162 | 0.171 | 0.147 | 0.159 | 0.200 | 0.114 | 1.000 | 0.211 | 0.090 | 0.149 | 0.186 | 0.155 | 0.167 | 0.155 | 0.214 |
| 0.154 | 0.105 | 0.149 | 0.074 | 0.162 | 0.058 | 0.211 | 1.000 | 0.119 | 0.168 | 0.163 | 0.076 | 0.224 | 0.102 | 0.230 |
| 0.319 | 0.234 | 0.307 | 0.359 | 0.092 | 0.264 | 0.090 | 0.119 | 1.000 | 0.336 | 0.226 | 0.260 | 0.129 | 0.223 | 0.108 |
| 0.305 | 0.255 | 0.279 | 0.230 | 0.148 | 0.224 | 0.149 | 0.168 | 0.336 | 1.000 | 0.311 | 0.246 | 0.142 | 0.194 | 0.142 |
Create a function that can search for the N-neighbors for a selected joke. The function takes three values:
sim_mat by calling sim_mat[jokeID, ].na_idx created earlier, by calling na_idx[[userID]].# Function that takes a similarity vector x and a integer n and returns the top n neighbors but excluding the indices in the vector i. It returns a vector containing the similarity values, and the names of the vector are their indices.
n_neighbor <- function(x, i, n){
# If na_idx is not empty, remove the indices from vector x
if(length(na_idx[[i]])!=0) x=x[-na_idx[[i]]]
# Sort x indecreasing order, take the top n+1 elements
neighbors <- head(sort(x, decreasing = T), n+1)
# Check if 1st element is 1. Remove it if it is.
if(max(neighbors)==1){
neighbors <- neighbors[-1]
}else{
neighbors <- neighbors[1:n]
}
# Return all Non-NA elements
return(neighbors[-which(is.na(neighbors))])
}
Create a function that calculates the weighted mean. The function will perform the following task:
# Function that takes a jokeID and a userID and estimate a rating based on collaborating filtering
cal_rating <- function(userID, jokeID, n){
# look for n-neighbors
neighbors <- n_neighbor(sim_mat[jokeID,], userID, n)
# retrieve the column indices of the neighbors
idx <- as.integer(names(neighbors))
# retrieve the rating values of the joke neighbors
val <- jester[userID, idx]
# calculate weighted mean
weighted.mean(val, neighbors)
}
Now it is ready to fill in the “NA” cells in the jester table. A nested for-loop implementation can be used here. Here, I used n = 100 so I can compare with the vectorized implementation below. Predicted ratings are also saved in a list predict_ratings.
n <- 100
jester_cf <- jester
predict_ratings <- list()
for (i in 1:num_row){
temp <- c()
if (length(na_idx[[i]])!=0){
for (j in na_idx[[i]]){
val <- cal_rating(i,j,n)
jester_cf[i,j] <- val
temp <- c(temp, val)
}
names(temp) <- na_idx[[i]]
}
predict_ratings[[i]] <- temp
}
Let’s see portion of the data.
# Inspect portion of data
kable(jester_cf[1:10,1:15], digits = 2)
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| -7.82 | 8.79 | -9.66 | -8.16 | -7.52 | -8.50 | -9.85 | 4.17 | -8.98 | -4.76 | -8.50 | -6.75 | -7.18 | 8.45 | -7.18 |
| 4.08 | -0.29 | 6.36 | 4.37 | -2.38 | -9.66 | -0.73 | -5.34 | 8.88 | 9.22 | 6.75 | 8.64 | 4.42 | 7.43 | 4.56 |
| 7.42 | 7.48 | 7.46 | 7.37 | 9.03 | 9.27 | 9.03 | 9.27 | 7.45 | 7.29 | 7.33 | 7.57 | 9.37 | 6.17 | -6.36 |
| 3.00 | 8.35 | 3.34 | 3.45 | 1.80 | 8.16 | -2.82 | 6.21 | 2.83 | 1.84 | 7.33 | 6.60 | 6.31 | 8.11 | -7.23 |
| 8.50 | 4.61 | -4.17 | -5.39 | 1.36 | 1.60 | 7.04 | 4.61 | -0.44 | 5.73 | 8.25 | 6.84 | -3.93 | 7.23 | -2.33 |
| -6.17 | -3.54 | 0.44 | -8.50 | -7.09 | -4.32 | -8.69 | -0.87 | -6.65 | -1.80 | -6.80 | -5.73 | -5.00 | -8.59 | 0.49 |
| 4.57 | 4.64 | 4.50 | 4.42 | 8.59 | -9.85 | 7.72 | 8.79 | 4.39 | 4.44 | 4.27 | 7.62 | -6.26 | 2.96 | 6.07 |
| 6.84 | 3.16 | 9.17 | -6.21 | -8.16 | -1.70 | 9.27 | 1.41 | -5.19 | -4.42 | 8.20 | -7.86 | -6.94 | -7.96 | 0.29 |
| -3.79 | -3.54 | -9.42 | -6.89 | -8.74 | -0.29 | -5.29 | -8.93 | -7.86 | -1.60 | -2.91 | -0.29 | -4.85 | -0.49 | -8.74 |
| 3.01 | 5.15 | 5.15 | 3.01 | 6.41 | 5.15 | 8.93 | 2.52 | 3.01 | 8.16 | 5.53 | 6.02 | 4.47 | 5.44 | -4.66 |
Alternatively, a vectorized implementation can be used and is much more efficient.
First, create a 24983 x 100 matrix that fills all of the non-NA cells of jester with 1, and all NA cells with 0.
# Create a matrix with just one and zero
jester_oz <- as.matrix(jester)
jester_oz[!is.na(jester_oz)] <- 1
jester_oz[is.na(jester_oz)] <- 0
Next, turn all of the NA cells of jester into 0.
# Turn all NA cells to zero
jester[is.na(jester)] <- 0
jester <- as.matrix(jester)
Perform matrix multiplication.
Here, jester x sim_mat will yield the sum of ratings*weights for each cell, while jester_oz x sim_mat will yield the sum of weights for that cell. Then, the element-wise multiplication between jester x sim_mat and 1/(jester_oz x sim_mat) will give us the weighted mean for each cell.
jester_vcf <- (jester %*% sim_mat) * 1/(jester_oz %*% sim_mat)
The predicted ratings can be pulled out.
predict_ratingsv <- list()
for (i in 1:num_row){
predict_ratingsv[[i]] <- jester_vcf[i, na_idx[[i]]]
}
To check if the looping and the vectorized implementations are the same:
random_row <- sample(c(1:num_row), size=1)
random_row
## [1] 1479
round(predict_ratingsv[[random_row]] - predict_ratings[[random_row]], 5)
## 71 72 73 74 75 76 77 78 79 80 83 84 85 86 87 88 90 91
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 92 93 94 95 96 97 98 99 100
## 0 0 0 0 0 0 0 0 0
As you can see, they are the same.
The complete result can be found here:
https://raw.githubusercontent.com/Tyllis/Data607/master/jester_cf.csv