Raw Data

Column

Reading & Merging Data to Explore

As with the other models, this starts with reading in the data and understanding the base constructs of the profiles and reviews

require(dplyr)
rec_data<-read.table('data/ratings.dat', 
            sep = ',', 
            col.names = c('user_id', 
                          'profile_id', 
                          'rating'))
gen_data <- read.table('data/gender.dat', 
            sep = ',', 
            col.names = c('user_id', 
                          'gender'))
data <- merge(rec_data, 
              gen_data, by= 'user_id', all=TRUE)

# Number of Reviews Made by Each user
user_data <-data%>%
  dplyr::group_by(user_id)%>%
  summarize(reviews = n())

# Number of Unique Users
unique_users <- length(unique(user_data$user_id))

# Number reviews per profile
profile_data <- data%>%
  dplyr::group_by(profile_id)%>%
  summarize(reviews = n())%>%
  arrange(desc(reviews))

Column

Unique Users & Profiles

Blow are a few samples of Most Reviewed Profiles with review counts as wells the the top Users for reviewing profiles along with the number of reviews per user.

Number of Unique Users: 220970

Number of Reviews Per Profile:

profile_id	reviews
NA	85611
156148	33389
31116	28398
193687	23649
121859	23639
83773	23113

Number of Reviews per User:

user_id	reviews
1	345
2	97
3	20
4	101
5	105
6	96

Distribution of Reviews

Looking distributions of both profiles reviewed and reviews by users there is clearly

Smaller Data

Column

Creating Subset of Most Reviewed Profiles

In order to make the data more manageable for local evaluation, I reduced it to the most reviewed 100 profiles

# top 100 users
most_reviewed <- profile_data[2:101,]
final_data <-merge(most_reviewed, 
                   data, by = 'profile_id', 
                   all.x=TRUE)

Summary of Most Reviewed Profiles

NA Values for User_ID : 0

Number of Unique User_ID’s: 130910

NA Values for Profile_ID: 0

Number of Unique Profile_ID’s: 130910

NA Values for Rating: 0

Column

Top 12 Profiles & User Reviews

A quick review of the most reviewed profiles in the smaller data set.

total_reviews <- length(unique(final_data$user_id)) * length(unique(final_data$profile_id ))
knitr::kable(head(most_reviewed, 12), col.names = c("Profiles", "User Reviews"))

Profiles	User Reviews
156148	33389
31116	28398
193687	23649
121859	23639
83773	23113
22319	21387
71636	21284
89855	20634
20737	18550
162707	18224
68989	16591
60983	16253

Total Possible Numbers of Reviews: 13091000

`realRatingMatrix`

Column

Simplifying & Splitting Data

I removed all but the user_id, profile_id and rating, then took the data wide, to make the data into a matrix and converted it to a recommenderlab realRatingMatrix.

# Reducing to necessary fields

final_data <- final_data[,c("user_id",
                            "profile_id",
                            "rating")]
#create wide matrix

data_matrix <- tidyr::spread(final_data, 
                             key = 'profile_id', 
                             value = 'rating')


data_names <- data_matrix$user_id


data_matrix$user_id <- NULL

data_matrix<-as.matrix(data_matrix)
dating <- as(data_matrix, 'realRatingMatrix')

Column

Distribution of Ratings

To get an understanding of the data, is is important to evaluate the distribution of rating values in our final ratings matrix.

The NA values were removed so as not to skew the data. The most common rating is 10, with a solid showing in 6 and 8 as well.

Comparing Models

Column

Setting Scheme & Evaluating Models

As previously, I set up a schema to evaluate models prior to building and testing a the diversity component.

Two models are compared, Item Based Collaborative Filtering and Singular Value Decomposition.

scheme <- evaluationScheme(dating, method="split", 
                           train=0.8, 
                           given= 1, 
                           goodRating=7)
set.seed(1492)


evaluation_models <-list(IBCF_COS = list(name='IBCF', 
                         param = list(
                            method = 'cosine',
                            normalize = 'z-score',
                            k=40)),
                         SVD_CENT = list(name='SVD', 
                         param = list( 
                           normalize = 'center')))

list_results <- evaluate(x =scheme, 
                    method = evaluation_models,
                    n = seq(10, 20))


average_matrices <- lapply(list_results, avg)

An evaluation results list is created and plotted. Clearly NEITHER model is particularly good.

Both the ROC curve and the Precision & Recall curve clearly show that the Item Based Recommender is of little use with all the different settings producing the same results.

The SVD is not much different, however, is does show that the performance is better baseline and improves slightly as n increases.

Based on these graphs I will be using the Singular Value Decomposition model for the diversity test.

Column

ROC Curve

Precision & Recall Curve

SVD Wins

Column

The Best Model

Run the `SVD model on the test set, and create baseline error metrics.

rec_svd <- Recommender(getData(scheme, 
                               "known"), 
                       method = "SVD",
                     list(k =40,
                       normalize ='center',
                     type='ratings'))
p <- predict(rec_svd, getData(scheme, "known"), type="ratings") 
err <-calcPredictionAccuracy(p, getData(scheme, "unknown"))
err2 <- calcPredictionAccuracy(p, 
                       getData(scheme, "unknown"), 
                       given = 5, 
                       goodRating = 7)

RMSE	MSE	MAE
3.411992	11.64169	2.572059

RMSE	MSE	MAE
3.411992	11.64169	2.572059

Column

Make Predictions

Sixty predictions are made from the test data, using the Singular Value Decomposition model and saved into predictions and ratings.

pred_svd <- predict(rec_svd, 
                    getData(scheme, "unknown"), 
                    n = 60)
predictions <- getList(pred_svd)
ratings <- getRatings(pred_svd)

getList(pred_svd)[[35]][1:5]

[1] "102328" "179192" "61157"  "34328"  "54929"

getRatings(pred_svd)[[35]][1:5]

[1] 9.012046 8.890652 8.565071 8.538497 8.535509

Adding Diversity

Column

Augmented Sampling

In order to add some diversity to the recommender system, the top three profiles are presented and two more, which are randomly selected from the remaining 57 predictions. The hope is that by choosing people a bit further out, this will improve the odds of finding potential dates who are interesting in ways not covered by the sites questionnaire but may be interesting for other reasons.

## Adding Serendipity to the List of Predictions
div_pred <-list()
div_rate <-list()
for (i in 1:length(predictions)){
  p_top = predictions[[i]][1:3] #first three predictions
  r_top = ratings[[i]][1:3] #first three ratings
  p_diversity = sample(predictions[[i]][4:60], size=2, 
                       replace = FALSE, 
                       prob = NULL)
  r_diversity = sample(ratings[[i]][4:60], size=2, 
                       replace = FALSE, 
                       prob = NULL)
  p_temp = c(p_top, p_diversity)
  r_temp = c(r_top, r_diversity)
  div_pred[[i]] = p_temp
  div_rate[[i]] = r_temp
}

Column

Prediction Sample for Observation 41

obs_41 <- predict(rec_svd, 
                    getData(scheme, "unknown"), 
                    type = "topNList",
                    n = 5)
getRatings(obs_41)[41]

Sample Predictions

Both Profiles and Ratings are generated and based on the ratings the predictions seem to be in the ballpark of the original (although slightly lower)

Test Observation 41 Profiles:

75169, 93891, 54929, 61157, 45992

Test Observation 41 Ratings:

8.372086, 8.3496249, 8.3268807, 8.1493106, 8.2385773

Performance

Column

Diversity Estimation

In order to estimate the difference in variation due to the new recommendation selection algorithm, the mean of each row of predictions and diversity predictions is calculated. Rowise differences are calculated and the mean difference in variation is calculated by summing them up and dividing to the number of rows.

prediction_variance = unlist(lapply(ratings, mean))
diversity_variance = unlist(lapply(div_rate, mean))
difference <-sum(diversity_variance-
                   prediction_variance, na.rm = TRUE)/
  length(prediction_variance)

Prediction Variance Increase Due to Diversity 0.1426675

Based on the method used to select recommendations in the diversity algorithm, there is an average change of .14 in the ratings recommended.

While there is no sense that this small change in recommendation will have a pragmatic change in the results, clearly there is some amount of variety added due to the new selection model.

Column

Online Performance

In the grander scheme of things all of this work in the background is a decent starting place, however, the real value of recommender systems is in applied performance.

Whether or not this data is useful can only be ascertained through extensive monitoring and performance testing live, online.

Were I to use a recommender system like this one live, I would build in a number of features in the site to gather data to try to estimate the utility of the diversity we believe we added. There would be three features that I would find useful. - A post viewing survey asking if they recommendation was good (or if the profile was desirable) - A timer which tracks how long users are on each profile and compares the organic recommendations to the diversity enhanced recommendations - A counter which tracks how many times a member views a profile, then compare the organic recommendations to the diversity recommendations

I think with these sorts of measurements, we could use A/B testing to figure out whether or not new algorithm is providing reasonably meaningful suggestions relative to the standard model.