Recommender System with R

2017-03-03

Outline

套件簡介
Data Preparation
Data Cleaning & Exploration
Recommenderlab
- 數據處理
recommender
- 推薦方法簡介
- 以UBCF為例的參數含義介紹
建立推薦模型 - UBCF為例
推薦系統的評估
- 預測模型的評估
- 推薦結果的評估
Reference
IBCF code

套件簡介

一个可以用評分數據(矩陣)和0-1(binary)數據來發展和測試推驗算法的框架
recommenderlab套件的資料屬性運用s4結構，使用抽象的raringMatrix來做為可分析數據
提供image()函數，專門用來畫heatmap
提供很多方便的函數能夠計算與評估推薦、評分
相關分享提到，在大規模情況下較不適合使用

實際案例Demo

Data Preparation

Hold Me & Touch Me

資料來源簡介

GroupLens is a research lab in the Department of Computer Science and Engineering at the University of Minnesota, Twin Cities specializing in recommender systems, online communities, mobile and ubiquitous technologies, digital libraries, and local geographic information systems.

本文將使用MovieLens 100K的資料來做套件的試用
- 資料為電影得評價數據
- 近1000位用戶對近1700部電影的評價
- 總共有10萬個評分，評分分為五個等級1~5

Data Cleaning & Exploration

movie <- read.table("u.data", header = F, stringsAsFactors = T) 
head(movie)

   V1  V2 V3        V4
1 196 242  3 881250949
2 186 302  3 891717742
3  22 377  1 878887116
4 244  51  2 880606923
5 166 346  1 886397596
6 298 474  4 884182806

V1 = user id
V2 = item id
V3 = rating id
V4是時間變數，在這裡暫時不用到

Data Exploration

簡單的EDA，看看rating分佈狀況

將資料進行轉換

log format to ratingMatrix
轉換為rating matrix

temp = movie %>% 
  select(1:3) %>% 
  spread(V2,V3) %>% 
  select(-1)
temp[1:10,1:10]

    1  2  3  4  5  6  7  8  9 10
1   5  3  4  3  3  5  4  1  5  3
2   4 NA NA NA NA NA NA NA NA  2
3  NA NA NA NA NA NA NA NA NA NA
4  NA NA NA NA NA NA NA NA NA NA
5   4  3 NA NA NA NA NA NA NA NA
6   4 NA NA NA NA NA  2  4  4 NA
7  NA NA NA  5 NA NA  5  5  5  4
8  NA NA NA NA NA NA  3 NA NA NA
9  NA NA NA NA NA  5  4 NA NA NA
10  4 NA NA  4 NA NA  4 NA  4 NA

Recommenderlab數據處理

在使用Recommenderlab處理數據之前，須將數據轉換為realRatingMatrix

realRatingMatrix是Recommenderlab這個套件針對rating：1~5的類別所使用的資料結構，需要從Matrix轉換過來。

class(temp)

[1] "data.frame"

temp目前是dataframe的屬性，我們必須將其轉為matrix後才能轉化為Recommenderlab接受的realRatingMatrix

library("recommenderlab")
temp_mov = temp %>% 
  as.matrix() %>% 
  as("realRatingMatrix")

Recommenderlab數據處理

看看temp_mov現在是什麼？

class(temp_mov)

[1] "realRatingMatrix"
attr(,"package")
[1] "recommenderlab"

temp_mov

943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.

realRatingMatrix是一個可以在recommenderlab套件中運作的型態
943是user數，1682是item數
realRatingMatrix可以方便的轉換為matrix或list

recommender - 推薦方法簡介

recommenderlab針對realRatingMatrix提供了6種推薦技術

# 我們挑兩種出來看就好，不然位置不夠
recommenderRegistry$get_entries(dataType = "realRatingMatrix")[c(3,9)]

$IBCF_realRatingMatrix
Recommender method: IBCF for realRatingMatrix
Description: Recommender based on item-based collaborative filtering.
Reference: NA
Parameters:
   k   method normalize normalize_sim_matrix alpha na_as_zero
1 30 "Cosine"  "center"                FALSE   0.5      FALSE

$UBCF_realRatingMatrix
Recommender method: UBCF for realRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
    method nn sample normalize
1 "cosine" 25  FALSE  "center"

recommender - 以UBCF為例的參數含義介紹

$UBCF_realRatingMatrix
Recommender method: UBCF for realRatingMatrix
Description: Recommender based on user-based collaborative filtering.
Reference: NA
Parameters:
    method nn sample normalize
1 "cosine" 25  FALSE  "center"

method：相似度算法，預設是使用餘弦相似度cosine
nn：最靠近的user個數
normalize：以平均做標準化

建立模型

recommender()是recommenderlab套件中，用於建立模型的函數
注意：在進行模型建立前，須將資料每個欄位先命名。

colnames(temp_mov) <- paste("M", 1:1682, sep = "") 
as(temp_mov[1,1:10], "list")

$`1`
 M1  M2  M3  M4  M5  M6  M7  M8  M9 M10 
  5   3   4   3   3   5   4   1   5   3

# 基於用戶推薦的模型建立
temp_mov.recommModel <- Recommender(temp_mov[1:700], method = "UBCF")
temp_mov.recommModel

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 700 users.

建立模型推薦TopN

模型建立後就是進行預測及推薦，在recommenderlab中，大家再度用到predict()
在這個demo我針對701~703user進行推薦
predict()函數中有一個type參數可用來：
- 評分預測
- Top-N推薦

##TopN推薦，n = 5 表示Top5推薦
temp_mov.predict1 <- predict(temp_mov.recommModel, temp_mov[701:703], n = 5)
temp_mov.predict1

Recommendations as 'topNList' with n = 5 for 3 users.

$`701`
[1] "M302" "M268" "M258" "M126" "M475"

$`702`
[1] "M50"  "M272" "M172" "M302" "M174"

$`703`
[1] "M313" "M98"  "M174" "M427" "M125"

建立模型對評分做預測

用戶對電影的評分預測

temp_mov.predict2 <- predict(temp_mov.recommModel,temp_mov[701:703], type = "ratings")
temp_mov.predict2

3 x 1682 rating matrix of class 'realRatingMatrix' with 4935 ratings.

查看針對M1~M16的預測評分(抓16個看就好)

     M1  M2  M3  M4  M5  M6  M7  M8  M9 M10 M11 M12 M13 M14 M15 M16
701  NA 4.2 4.1 4.2 4.2 4.3 4.4 4.2 4.4 4.2 4.2 4.3 4.2 4.4 4.3 4.2
702 2.6 2.4 2.5 2.4 2.5 2.5 2.3 2.5 2.4 2.5 2.4 2.5 2.4 2.5 2.5 2.5
703  NA 3.6 3.6 3.5 3.6 3.5  NA 3.6  NA 3.5 3.7 3.7 3.5 3.6  NA 3.6

模型的評估

針對評分建立預測模型，以8：2做為測試來分割資料，建立資料集
evaluationScheme()
- recommenderlab提供來評估模型的函數，我們無須自訂評估函數
- data: This is the initial dataset
- method: split、cross-validation、bootstrap
- train: This is the percentage of data in the training set
- given: This is the number of items to keep
- goodRating: This is the rating threshold
- k: This is the number of times to run the evaluation

模型預測評分的評估

# 評價方案：943個樣本中，80%做training，20%做testing
# 測試集中15個項目用於推薦演算法中，剩餘的項目用於計算誤差
model.eval <- evaluationScheme(temp_mov[1:943], method = "split", 
                               train = 0.8, given = 15 , goodRating = 5) 
model.eval

Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=5.000000
Data set: 943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.

`evaluationScheme()`資料差異

分割資料的差異

getData(model.eval, "train")

754 x 1682 rating matrix of class 'realRatingMatrix' with 80899 ratings.

getData(model.eval, "known")

189 x 1682 rating matrix of class 'realRatingMatrix' with 2835 ratings.

getData(model.eval, "unknown")

189 x 1682 rating matrix of class 'realRatingMatrix' with 16266 ratings.

模型的評估

應用training data產生基於用戶的推薦

model.ubcf <- Recommender(getData(model.eval, "train"), method = "UBCF")

對已知部分的testing data(每個使用者對15個物品評分)，用基於使用者的演算法計算預設評分

predict.ubcf <- predict(model.ubcf, getData(model.eval, "known"), type = "ratings")

預測誤差的變異程度

error_ubcf = calcPredictionAccuracy(predict.ubcf, getData(model.eval, "unknown"))
error_ubcf

RMSE  MSE  MAE 
1.05 1.10 0.85

TopN推薦的評估

Evaluating the recommendations
evaluate()是用來評測推薦結果的函數：
- x: This is the object containing the evaluation scheme.
- method: This is the recommendation technique.
- This is the number of items to recommend to each user. If we can specify a vector of n, the function will evaluate the recommender performance depending on n.

	TP	FP	FN	TN	precision	recall	TPR	FPR
10	1.9	8.1	17	1640	0.19	0.18	0.18	0.00
20	3.0	17.0	16	1631	0.15	0.25	0.25	0.01
30	3.9	26.1	15	1622	0.13	0.30	0.30	0.02
40	4.7	35.3	14	1613	0.12	0.34	0.34	0.02
50	5.4	44.6	14	1603	0.11	0.37	0.37	0.03
60	5.9	54.1	13	1594	0.10	0.40	0.40	0.03

Reference

IBCF實例 - 問題討論區

建模 -> 預測 -> 推薦

# 基於物品的推薦
model.ibcf <- Recommender(temp_mov[1:700], method = "IBCF")
model.ibcf

Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 700 users.

對已知部分的testing data(每個使用者對15個物品評分)，用基於使用者的演算法計算預設評分

##TopN推薦，n = 5 表示Top5推薦
pre_model_ibcf <- predict(temp_mov.recommModel, temp_mov[701:703], n = 5)
pre_model_ibcf

Recommendations as 'topNList' with n = 5 for 3 users.

as( temp_mov.predict1, "list")

$`701`
[1] "M302" "M268" "M258" "M126" "M475"

$`702`
[1] "M50"  "M272" "M172" "M302" "M174"

$`703`
[1] "M313" "M98"  "M174" "M427" "M125"

建模 -> 預測 -> 推薦

pre_model_ibcf_rating <- predict(temp_mov.recommModel,
                                 temp_mov[701:703], type = "ratings")
pre_model_ibcf_rating

3 x 1682 rating matrix of class 'realRatingMatrix' with 4935 ratings.

# 評價方案：943個樣本中，80%做training，20%做testing
# 測試集中15個項目用於推薦演算法中，剩餘的項目用於計算誤差
model.eval <- evaluationScheme(temp_mov[1:943], method = "split", 
                               train = 0.8, given = 15 , goodRating = 5) 
model.eval

Evaluation scheme with 15 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=5.000000
Data set: 943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.

建模 -> 預測 -> 推薦

# trainindata建立
model.ibcf <- Recommender(getData(model.eval, "train"), method = "IBCF")
# 評分預測
predict.ibcf <- predict(model.ibcf, 
                        getData(model.eval, "known"), type = "ratings")
# 計算評分的test error rate
error_ibcf = calcPredictionAccuracy(predict.ibcf,
                                    getData(model.eval, "unknown"))
error_ibcf

RMSE  MSE  MAE 
1.19 1.40 0.85

Outline

套件簡介

實際案例Demo

Data Preparation

資料來源簡介

Data Cleaning & Exploration

Data Exploration

將資料進行轉換

Recommenderlab數據處理

Recommenderlab數據處理

recommender - 推薦方法簡介

recommender - 以UBCF為例的參數含義介紹

建立模型

建立模型推薦TopN

建立模型對評分做預測

推薦系統的評估

Preparing the data to evaluate the models

模型的評估

模型預測評分的評估

`evaluationScheme()`資料差異

模型的評估

模型的評估

TopN推薦的評估

推薦結果的評估時間

推薦結果的評估

推薦結果的評估

推薦結果的評估

Reference

IBCF實例 - 問題討論區

建模 -> 預測 -> 推薦

建模 -> 預測 -> 推薦

建模 -> 預測 -> 推薦

Outline

套件簡介

實際案例Demo

Data Preparation

資料來源簡介

Data Cleaning & Exploration

Data Exploration

將資料進行轉換

Recommenderlab數據處理

Recommenderlab數據處理

recommender - 推薦方法簡介

recommender - 以UBCF為例的參數含義介紹

建立模型

建立模型推薦TopN

建立模型對評分做預測

推薦系統的評估

Preparing the data to evaluate the models

模型的評估

模型預測評分的評估

evaluationScheme()資料差異

模型的評估

模型的評估

TopN推薦的評估

推薦結果的評估時間

推薦結果的評估

推薦結果的評估

推薦結果的評估

Reference

IBCF實例 - 問題討論區

建模 -> 預測 -> 推薦

建模 -> 預測 -> 推薦

建模 -> 預測 -> 推薦

`evaluationScheme()`資料差異