To do a recommendation system based on the Knn algorithm, we will get data from packages in R. The aim is to know which packages are similar to others. For example, if if you go to download rpart, we want to be able to recommend that you look at randomForest too, because it is similar (maybe better!)
This is going to be an item-based recommender: Item-based recommendations look for items that are similar to items the user already likes, and recommends those similar items. Given that a user hops over to a package page, we want to show them other packages “similar” to that package that they might like.
Let’s first download the dataset we will use:
user.package.matrix <- read.csv("http://jakeporway.com/teaching/data/user_packages.csv", h=T, as.is=T)
We can see that the rows in the dataset are the R packages (a total of 2487), and the columns are 53 users selected randomly. Within the table, there is a 1 if the user has downloaded the package in the row, and a 0 if he has not.
Let’s fix the annoying fact that the rownames don’t get saved well in CSVs. We move them from the first column into the rownames:
row.names(user.package.matrix) <- user.package.matrix[,1]
user.package.matrix <- user.package.matrix[,-1]
user.package.matrix[1:5,1:5]
## user1 user2 user3 user4 user5
## abind 1 1 0 1 1
## AcceptanceSampling 0 1 1 1 1
## ACCLMA 0 0 1 1 1
## accuracy 1 1 1 0 0
## acepack 0 1 1 1 1
We can see how the dataset appears now, with the rownames being the names of the packages.
Now, how we want our recommender to work? We could to association rules (see which packages are installed together more often). However, we want to personalized recommendations per user. Therefore, if we know all the packages installed on a user’s machine, we can recommend some new packages to her by finding the probability that each CRAN package would be installed on her machine and recommending the highest probability packages. To do that we need to create a probability that a package is installed on someone’s machine given the other packages they’ve installed.
To compute the probability that a given package P has been installed, we’ll find the kNN packages that are most similar to P’s install profile (i.e. who has installed it) and then compute the probability P is installed as the mean number of similar packages that this users has already installed.
For instance, if we want to find the probability that rpart is installed on user U’s machine, we first find the k most similar packages to rpart. By similar, again, we mean we’re looking at packages that are often installed with rpart and not installed when rpart’s not installed (the rows of user.package.matrix). Once we have those kNN, we can see how many of them user U has installed. The more similar packages this user’s installed, the more likely we believe package P will be installed.
To begin with, we compute the distance matrix between each of the packages (how similar/ different they are based on when they’re installed together):
distances <- as.matrix(dist(user.package.matrix, method="euclidean"))
This creates a matrix with pakages on the rows and columns, calculating the distance between each package (obviously, a matrix with 0 in the diagonal, as the distance of a package with the same package is 0).
Now, the funcion k.nearest.neighbors, which computes the nearest neighbors:
k.nearest.neighbors <- function(i, distance.matrix, k = 5)
{
ordered.neighbors <- order(distance.matrix[i, ])
# This just gives us the list of points that are
# closest to row i in descending order.
# The first entry is always 0 (the closest point is the point itself) so
# let's ignore that entry and return points 2:(k+1) instead of 1:k
return(ordered.neighbors[2:(k + 1)])
}
Now, here’s a function to compute the probability that a package is installed on a user’s machine for a given user:
installation.probability <- function(user, package, user.package.matrix, distances, k = 25)
{
# Find the kNN closest packages to this package in question, e.g.
# randomForest's neighbors are sparkTable, atmi, brainwaver, and others.
neighbors <- k.nearest.neighbors(which(row.names(user.package.matrix) == package), distances, k)
# Now that we know the other packages that are installed with this one, return
# the mean of how many of those packages this user has already installed.
return(mean(user.package.matrix[neighbors, user]))
}
For example:
installation.probability(1, "abind", user.package.matrix, distances)
## [1] 0.96
installation.probability(1, "aylmer", user.package.matrix, distances)
## [1] 0.64
Therefore, for user 1, the probability that abind is installed is 96%, whereas the probability of installing aylmer is 64%.
So if we can predict the probability that a package is installed on a user’s machine, then we can run this function for all packages that the user has not yet installed and return the ones with the highest probability of being installed (yet aren’t yet installed). We do this with this function, which takes in the user in question, the user.package.matrix, the distances between packages, and the number of k neighbors you want to use. The function will return an ordered list of the packages recommended to that user:
most.probable.packages <- function(user, user.package.matrix, distances, k = 25)
{
probabilities <- rep(0, nrow(user.package.matrix))
for (i in 1:nrow(user.package.matrix)) { # For each package
if (user.package.matrix[i,user] == 1) {
next # Eh, they've already installed this one
}
probabilities[i] <- installation.probability(user, row.names(user.package.matrix)[i], user.package.matrix, distances, k)
}
return(order(probabilities, decreasing=T))
}
So, for instance, let’s recommend to user 1 the packages he is most likely to install:
user <- 1
listing <- most.probable.packages(user, user.package.matrix, distances)
rownames(user.package.matrix)[listing[1:10]]
## [1] "CircSpatial" "Daim" "gsarima"
## [4] "hotspots" "migui" "MixSim"
## [7] "NetIndices" "regress" "ResearchMethods"
## [10] "RImageJ"