2017-03-03

Outline

  • 套件簡介
  • Data Preparation
  • Data Cleaning & Exploration
  • Recommenderlab
    • 數據處理
  • recommender
    • 推薦方法簡介
    • 以UBCF為例的參數含義介紹
  • 建立推薦模型 - UBCF為例
  • 推薦系統的評估
    • 預測模型的評估
    • 推薦結果的評估
  • Reference
  • IBCF code

套件簡介

  • 一个可以用評分數據(矩陣)和0-1(binary)數據來發展和測試推驗算法的框架
  • recommenderlab套件的資料屬性運用s4結構,使用抽象的raringMatrix來做為可分析數據
  • 提供image()函數,專門用來畫heatmap
  • 提供很多方便的函數能夠計算與評估推薦、評分
  • 相關分享提到,在大規模情況下較不適合使用

實際案例Demo

Data Preparation

資料來源簡介

GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities specializing in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and local geographic information systems.

  • 本文將使用MovieLens 100K的資料來做套件的試用
    • 資料為電影得評價數據
    • 近1000位用戶對近1700部電影的評價
    • 總共有10萬個評分,評分分為五個等級1~5

Data Cleaning & Exploration

movie <- read.table("u.data", header = F, stringsAsFactors = T) 
head(movie) 
   V1  V2 V3        V4
1 196 242  3 881250949
2 186 302  3 891717742
3  22 377  1 878887116
4 244  51  2 880606923
5 166 346  1 886397596
6 298 474  4 884182806
  • V1 = user id
  • V2 = item id
  • V3 = rating id
  • V4是時間變數,在這裡暫時不用到

Data Exploration

  • 簡單的EDA,看看rating分佈狀況

將資料進行轉換

  • log format to ratingMatrix
  • 轉換為rating matrix
temp = movie %>% 
  select(1:3) %>% 
  spread(V2,V3) %>% 
  select(-1)
temp[1:10,1:10]
    1  2  3  4  5  6  7  8  9 10
1   5  3  4  3  3  5  4  1  5  3
2   4 NA NA NA NA NA NA NA NA  2
3  NA NA NA NA NA NA NA NA NA NA
4  NA NA NA NA NA NA NA NA NA NA
5   4  3 NA NA NA NA NA NA NA NA
6   4 NA NA NA NA NA  2  4  4 NA
7  NA NA NA  5 NA NA  5  5  5  4
8  NA NA NA NA NA NA  3 NA NA NA
9  NA NA NA NA NA  5  4 NA NA NA
10  4 NA NA  4 NA NA  4 NA  4 NA

Recommenderlab數據處理

  • 在使用Recommenderlab處理數據之前,須將數據轉換為realRatingMatrix

realRatingMatrixRecommenderlab這個套件針對rating:1~5的類別所使用的資料結構,需要從Matrix轉換過來。

class(temp) 
[1] "data.frame"
  • temp目前是dataframe的屬性,我們必須將其轉為matrix後才能轉化為Recommenderlab接受的realRatingMatrix
library("recommenderlab")
temp_mov = temp %>% 
  as.matrix() %>% 
  as("realRatingMatrix")

Recommenderlab數據處理

  • 看看temp_mov現在是什麼?
class(temp_mov)
[1] "realRatingMatrix"
attr(,"package")
[1] "recommenderlab"
temp_mov
943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.
  • realRatingMatrix是一個可以在recommenderlab套件中運作的型態
  • 943是user數,1682是item數
  • realRatingMatrix可以方便的轉換為matrix或list

recommender - 推薦方法簡介

  • recommenderlab針對realRatingMatrix提供了6種推薦技術
# 我們挑兩種出來看就好,不然位置不夠
recommenderRegistry$get_entries(dataType = "realRatingMatrix")[c(3,9)]
$IBCF_realRatingMatrix
Recommender method: IBCF for realRatingMatrix
Description: Recommender based on item-based collaborative filtering.
Reference: NA
Parameters:
   k   method normalize normalize_sim_matrix alpha na_as_zero
1 30 "Cosine"  "center"                FALSE   0.5      FALSE

$UBCF_realRatingMatrix
Recommender method: UBCF for realRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
    method nn sample normalize
1 "cosine" 25  FALSE  "center"

recommender - 以UBCF為例的參數含義介紹

$UBCF_realRatingMatrix
Recommender method: UBCF for realRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
    method nn sample normalize
1 "cosine" 25  FALSE  "center"
  • method:相似度算法,預設是使用餘弦相似度cosine
  • nn:最靠近的user個數
  • normalize:以平均做標準化

建立模型

  • recommender()是recommenderlab套件中,用於建立模型的函數
  • 注意:在進行模型建立前,須將資料每個欄位先命名。
colnames(temp_mov) <- paste("M", 1:1682, sep = "") 
as(temp_mov[1,1:10], "list")
$`1`
 M1  M2  M3  M4  M5  M6  M7  M8  M9 M10 
  5   3   4   3   3   5   4   1   5   3 
# 基於用戶推薦的模型建立
temp_mov.recommModel <- Recommender(temp_mov[1:700], method = "UBCF")
temp_mov.recommModel
Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 700 users.

建立模型推薦TopN

  • 模型建立後就是進行預測及推薦,在recommenderlab中,大家再度用到predict()
  • 在這個demo我針對701~703user進行推薦
  • predict()函數中有一個type參數可用來:
    • 評分預測
    • Top-N推薦
##TopN推薦,n = 5 表示Top5推薦
temp_mov.predict1 <- predict(temp_mov.recommModel, temp_mov[701:703], n = 5)
temp_mov.predict1 
Recommendations as 'topNList' with n = 5 for 3 users. 
$`701`
[1] "M302" "M268" "M258" "M126" "M475"

$`702`
[1] "M50"  "M272" "M172" "M302" "M174"

$`703`
[1] "M313" "M98"  "M174" "M427" "M125"

建立模型對評分做預測

  • 用戶對電影的評分預測
temp_mov.predict2 <- predict(temp_mov.recommModel,temp_mov[701:703], type = "ratings")
temp_mov.predict2 
3 x 1682 rating matrix of class 'realRatingMatrix' with 4935 ratings.
  • 查看針對M1~M16的預測評分(抓16個看就好)
     M1  M2  M3  M4  M5  M6  M7  M8  M9 M10 M11 M12 M13 M14 M15 M16
701  NA 4.2 4.1 4.2 4.2 4.3 4.4 4.2 4.4 4.2 4.2 4.3 4.2 4.4 4.3 4.2
702 2.6 2.4 2.5 2.4 2.5 2.5 2.3 2.5 2.4 2.5 2.4 2.5 2.4 2.5 2.5 2.5
703  NA 3.6 3.6 3.5 3.6 3.5  NA 3.6  NA 3.5 3.7 3.7 3.5 3.6  NA 3.6

推薦系統的評估

Preparing the data to evaluate the models

To evaluate models, you need to build them with some data and test them on some other data. This chapter will show you how to prepare the two sets of data. The recommenderlab package contains prebuilt tools that help in this task.

  • Training set: These are the models from which users learn
  • Testing set: These are the models that users apply and test
  • 內心os:其實就是modeling常遇到的問題 XD

模型的評估

  • 針對評分建立預測模型,以8:2做為測試來分割資料,建立資料集
  • evaluationScheme()
    • recommenderlab提供來評估模型的函數,我們無須自訂評估函數
    • data: This is the initial dataset
    • method: splitcross-validationbootstrap
    • train: This is the percentage of data in the training set
    • given: This is the number of items to keep
    • goodRating: This is the rating threshold
    • k: This is the number of times to run the evaluation

模型預測評分的評估

# 評價方案:943個樣本中,80%做training,20%做testing
# 測試集中15個項目用於推薦演算法中,剩餘的項目用於計算誤差
model.eval <- evaluationScheme(temp_mov[1:943], method = "split", 
                               train = 0.8, given = 15 , goodRating = 5) 
model.eval
Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=5.000000
Data set: 943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.

evaluationScheme()資料差異

  • 分割資料的差異
getData(model.eval, "train")
754 x 1682 rating matrix of class 'realRatingMatrix' with 80899 ratings.
getData(model.eval, "known")
189 x 1682 rating matrix of class 'realRatingMatrix' with 2835 ratings.
getData(model.eval, "unknown")
189 x 1682 rating matrix of class 'realRatingMatrix' with 16266 ratings.

模型的評估

模型的評估

  • 應用training data產生基於用戶的推薦
model.ubcf <- Recommender(getData(model.eval, "train"), method = "UBCF")
  • 對已知部分的testing data(每個使用者對15個物品評分),用基於使用者的演算法計算預設評分
predict.ubcf <- predict(model.ubcf, getData(model.eval, "known"), type = "ratings") 
  • 預測誤差的變異程度
error_ubcf = calcPredictionAccuracy(predict.ubcf, getData(model.eval, "unknown"))
error_ubcf
RMSE  MSE  MAE 
1.05 1.10 0.85 

TopN推薦的評估

  • Evaluating the recommendations

  • evaluate()是用來評測推薦結果的函數:
    • x: This is the object containing the evaluation scheme.
    • method: This is the recommendation technique.
    • This is the number of items to recommend to each user. If we can specify a vector of n, the function will evaluate the recommender performance depending on n.

推薦結果的評估時間

results_I <- evaluate(x = model.eval, method = "IBCF", n = seq(10, 100, 10) )
IBCF run fold/sample [model time/prediction time]
     1  [52sec/0.26sec] 
results_U <- evaluate(x = model.eval, method = "UBCF", n = seq(10, 100, 10) )
UBCF run fold/sample [model time/prediction time]
     1  [0.01sec/2.2sec] 

推薦結果的評估

  • Using getConfusionMatrix(), we can extract a list of confusion matrices
head(getConfusionMatrix(results_U)[[1]]) %>% 
  kable
TP FP FN TN precision recall TPR FPR
10 1.9 8.1 17 1640 0.19 0.18 0.18 0.00
20 3.0 17.0 16 1631 0.15 0.25 0.25 0.01
30 3.9 26.1 15 1622 0.13 0.30 0.30 0.02
40 4.7 35.3 14 1613 0.12 0.34 0.34 0.02
50 5.4 44.6 14 1603 0.11 0.37 0.37 0.03
60 5.9 54.1 13 1594 0.10 0.40 0.40 0.03

推薦結果的評估

推薦結果的評估

Reference

IBCF實例 - 問題討論區

建模 -> 預測 -> 推薦

# 基於物品的推薦
model.ibcf <- Recommender(temp_mov[1:700], method = "IBCF")
model.ibcf 
Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 700 users.
  • 對已知部分的testing data(每個使用者對15個物品評分),用基於使用者的演算法計算預設評分
##TopN推薦,n = 5 表示Top5推薦
pre_model_ibcf <- predict(temp_mov.recommModel, temp_mov[701:703], n = 5)
pre_model_ibcf
Recommendations as 'topNList' with n = 5 for 3 users. 
as( temp_mov.predict1, "list")
$`701`
[1] "M302" "M268" "M258" "M126" "M475"

$`702`
[1] "M50"  "M272" "M172" "M302" "M174"

$`703`
[1] "M313" "M98"  "M174" "M427" "M125"

建模 -> 預測 -> 推薦

pre_model_ibcf_rating <- predict(temp_mov.recommModel,
                                 temp_mov[701:703], type = "ratings")
pre_model_ibcf_rating
3 x 1682 rating matrix of class 'realRatingMatrix' with 4935 ratings.
# 評價方案:943個樣本中,80%做training,20%做testing
# 測試集中15個項目用於推薦演算法中,剩餘的項目用於計算誤差
model.eval <- evaluationScheme(temp_mov[1:943], method = "split", 
                               train = 0.8, given = 15 , goodRating = 5) 
model.eval
Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=5.000000
Data set: 943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.

建模 -> 預測 -> 推薦

# trainindata建立
model.ibcf <- Recommender(getData(model.eval, "train"), method = "IBCF")
# 評分預測
predict.ibcf <- predict(model.ibcf, 
                        getData(model.eval, "known"), type = "ratings")
# 計算評分的test error rate
error_ibcf = calcPredictionAccuracy(predict.ibcf,
                                    getData(model.eval, "unknown"))
error_ibcf
RMSE  MSE  MAE 
1.19 1.40 0.85