I will build a recommender system to recommend jokes to users. I am using the Jester Dataset, which is available at http://eigentaste.berkeley.edu/dataset/.
The joke data set lists each user in a different row. There are 100 jokes that are listed in columns and 23,500 different users. The data frame is sparse, as each user did not rate every joke. Each joke is rated from -10 to 10. A joke that isn’t rated is designated with a score of 99.
The data frame needs to be tidied; the first column lists the number of jokes rated by each user. That needs to be removed. I removed the 99s and replaced them with NA. Each column is labeled with the number of the joke it represents.
The first 5 users’ ratings of the first 10 jokes are shown below.
## 1 2 3 4 5 6 7 8 9 10
## 1 NA 8.11 NA NA -2.28 -4.22 5.49 -2.62 NA -2.28
## 2 -4.37 -3.88 0.73 -3.2 -6.41 1.17 7.82 -4.76 -6.41 0.73
## 3 NA NA NA NA 0.73 NA 5.53 3.25 NA NA
## 4 0.34 -6.55 2.86 NA -3.64 1.12 5.34 2.33 NA 2.33
## 5 NA NA NA NA 9.13 NA -9.32 -2.04 NA NA
The dataframe is stored as a realRatingMatrix, which supports the compact storage of sparce matrices.
Most of the jokes that were rated by over 60 people, were given positive ratings.
The data is first separated into a training set and testing set. The training set is 80% of the data and the test set is 20% of the data.
An item-based collaborative filter recommends jokes based on the similarity between jokes In the following model, the similarity between jokes is determined by calculating the cosine similarity of the jokes based on how the user rated them.
## Recommendations as 'topNList' with n = 5 for 4757 users.
## [1] "topNList"
## attr(,"package")
## [1] "recommenderlab"
## [1] "items" "ratings" "itemLabels" "n"
## [[1]]
## [1] 52 63 94 99 41
##
## [[2]]
## integer(0)
##
## [[3]]
## [1] 95 94 88 96 99
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
##
## [[6]]
## [1] 25 52 97 99 86
##
## [[7]]
## integer(0)
##
## [[8]]
## [1] 59 3 99 41 75
##
## [[9]]
## [1] 3 52 4 59 51
##
## [[10]]
## [1] 94 95 84 97 100
The list above shows the number of the 5 best recommended jokes for the first 10 users. Users 2,4,5 and 7 have no recommendations. Perhaps that is because the matrix is sparse so there is not enough data from enough users to make a prediction for those users.
The following recommender system is built by identifying the 60 most similar jokes. (The previous method built a recommender by indentifying the 30 most similar jokes.) The jokes recommended are different from the previous recommender. Since this identifies more similar items, I would think that this recommendation would be more accurate.
## [[1]]
## [1] 52 63 82 84 79
##
## [[2]]
## integer(0)
##
## [[3]]
## [1] 95 94 86 84 90
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
##
## [[6]]
## [1] 86 90 99 82 97
##
## [[7]]
## integer(0)
##
## [[8]]
## [1] 3 59 79 64 43
##
## [[9]]
## [1] 3 43 84 82 59
##
## [[10]]
## [1] 94 84 96 86 100
A user based collaborative filter recommends jokes that are most preferred by similar users. The similarity between users is determined by the cosine similarity. The prediction of the 5 best jokes for users is shown below. There are no recommendations for users 2,4,5 and 7, which are the same users as before that did not get recommendations.
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 18742 users.
## $description
## [1] "UBCF-Real data: contains full or sample of data set"
##
## $data
## 18742 x 100 rating matrix of class 'realRatingMatrix' with 1363038 ratings.
## Normalized using center on rows.
##
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
##
## $verbose
## [1] FALSE
## [[1]]
## [1] 69 14 46 68 25
##
## [[2]]
## integer(0)
##
## [[3]]
## [1] 81 95 76 84 87
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
##
## [[6]]
## [1] 10 86 11 81 24
##
## [[7]]
## integer(0)
##
## [[8]]
## [1] 81 99 97 72 91
##
## [[9]]
## [1] 28 39 34 12 6
##
## [[10]]
## [1] 76 1 89 99 80
To attempt to see what distinguishes users 2,4,5 and 7 in the test set from the other users in the test set, the above heat map represents users’ similarities. The more red the box, the greater the similarity between the two users. User 2 is very similar to users 5,6 and 7. User 4 is also very similar to 5, 6 and 7. There is a link between those users for which there are no recommendations.Perhaps the data is too sparse to make recommendations for users 2,4,5 and 7.
The following recommendations are made by looking at the Pearson similarity between users. The joke recommendations are very similar for the first 10 users and still users 2,4,5 and 7 have no jokes recommended to them.
## [[1]]
## [1] 69 14 68 25 46
##
## [[2]]
## integer(0)
##
## [[3]]
## [1] 72 98 76 85 80
##
## [[4]]
## integer(0)
##
## [[5]]
## integer(0)
##
## [[6]]
## [1] 10 11 81 6 86
##
## [[7]]
## integer(0)
##
## [[8]]
## [1] 97 83 99 72 91
##
## [[9]]
## [1] 28 34 25 45 6
##
## [[10]]
## [1] 1 80 91 76 3