Introduction
The goal for the final project is for you to build out a recommender system using a large dataset (ex: 1M+ ratings or 10k+ users, 10k+ items. There are three deliverables, with separate dates:
Planning Document Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters).
The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as the recommender works!
Implementation. In this final project deliverable, you’ll build out the system that you describe in the planning document.
Project Highlights
- I am plann to use recommenderlab, an R package which provides a convenient framework to evaluate and compare various recommendation algorithms and quickly establish the best suited approach.
recommenderlab accepts 2 types of rating matrix for modelling:
real rating matrix consisting of actual user ratings, which requires normalisation.
binary rating matrix consisting of 0’s and 1’s, where 1’s indicate if the product was purchased. This is the matrix type needed for the analysis and it does not require normalization.
I will arrange the purchase history in a rating matrix, with orders in rows and products in columns. This format is often called a user_item matrix because “users” (e.g. customers or orders) tend to be on the rows and “items” (e.g. products) on the columns.
The recommenderlab has an ability to estimate multiple algorithms at a time, I will create a list with the algorithms and consider schemes which evaluate on a binary rating matrix.
I will compare and evaluate the performance of the algorithms using ROC Curve, Precision/Recall, RMSE, MSE and MAE and use the best performing algorithm to predict top items list for the new customer.
About the Data
The data for this project comes from the UCI Machine Learning Repository, an online archive of large datasets which includes a wide variety of data types, analysis tasks, and application areas.
In this project I’ll use the Online Retail dataset donated to UCI in 2015 by the School of Engineering at London South Bank University. This dataset contains transactions occurring between 01/Dec/2010 and 09/Dec/2011 for a UK-based and registered online retail company and contains 541,909 observations with eight variables. The data is too large for my GitHub but can be downloaded from: http://archive.ics.uci.edu/ml/machine-learning-databases/00352/
Loading the Online Retail dataset
Data structure
## tibble [541,909 x 8] (S3: tbl_df/tbl/data.frame)
## $ InvoiceNo : chr [1:541909] "536365" "536365" "536365" "536365" ...
## $ StockCode : chr [1:541909] "85123A" "71053" "84406B" "84029G" ...
## $ Description: chr [1:541909] "WHITE HANGING HEART T-LIGHT HOLDER" "WHITE METAL LANTERN" "CREAM CUPID HEARTS COAT HANGER" "KNITTED UNION FLAG HOT WATER BOTTLE" ...
## $ Quantity : num [1:541909] 6 6 8 6 6 2 6 6 6 32 ...
## $ InvoiceDate: POSIXct[1:541909], format: "2010-12-01 08:26:00" "2010-12-01 08:26:00" ...
## $ UnitPrice : num [1:541909] 2.55 3.39 2.75 3.39 3.39 7.65 4.25 1.85 1.85 1.69 ...
## $ CustomerID : num [1:541909] 17850 17850 17850 17850 17850 ...
## $ Country : chr [1:541909] "United Kingdom" "United Kingdom" "United Kingdom" "United Kingdom" ...
Remove the Cancelled Transaction
Exploring the dataset
Excluding the missing values
Pre-processing
Convert data to numeric
Create the sparse Matrix
Modified Distributions
Algorithms- Modeling
Set up List of Algorithms
Comprison of the Model Accuracy
Model Performance Metrics - RMSE, MSE and MAE
ROC curve
Precision and Recall
Training, Testing and Evaluation Data sets
Prepare training dataset
Prepare testing set
Prepare evaluation set
Conclusion
References
LS0tDQp0aXRsZTogIkRhdGEgNjEyIC0gRmluYWwgUHJvamVjdCBQcm9wb3NhbCINCmF1dGhvcjogIkVtbWFudWVsIEhheWJsZS1Hb21lcyINCmRhdGU6ICIwNy8wNi8yMDIwIg0Kb3V0cHV0Og0KICBodG1sX2RvY3VtZW50Og0KICAgIGNvZGVfZG93bmxvYWQ6IHllcw0KICAgIGNvZGVfZm9sZGluZzogaGlkZQ0KICAgIGhpZ2hsaWdodDogcHlnbWVudHMNCiAgICBudW1iZXJfc2VjdGlvbnM6IHllcw0KICAgIHRoZW1lOiBmbGF0bHkNCiAgICB0b2M6IHllcw0KICAgIHRvY19mbG9hdDogeWVzDQogICAga2VlcF9tZDogeWVzDQogIHdvcmRfZG9jdW1lbnQ6DQogICAgdG9jOiB5ZXMNCiAgcGRmX2RvY3VtZW50Og0KICAgIHRvYzogeWVzDQplZGl0b3Jfb3B0aW9uczoNCiAgY2h1bmtfb3V0cHV0X3R5cGU6IGlubGluZQ0KLS0tDQoNCiMgSW50cm9kdWN0aW9uDQoNClRoZSBnb2FsIGZvciB0aGUgZmluYWwgcHJvamVjdCBpcyBmb3IgeW91IHRvIGJ1aWxkIG91dCBhIHJlY29tbWVuZGVyIHN5c3RlbSB1c2luZyBhIGxhcmdlIGRhdGFzZXQgKGV4OiAxTSsgcmF0aW5ncyBvciAxMGsrIHVzZXJzLCAxMGsrIGl0ZW1zLiBUaGVyZSBhcmUgdGhyZWUgZGVsaXZlcmFibGVzLCB3aXRoIHNlcGFyYXRlIGRhdGVzOg0KDQpQbGFubmluZyBEb2N1bWVudCBGaW5kIGFuIGludGVyZXN0aW5nIGRhdGFzZXQgYW5kIGRlc2NyaWJlIHRoZSBzeXN0ZW0geW91IHBsYW4gdG8gYnVpbGQgb3V0LiBJZiB5b3Ugd291bGQgbGlrZSB0byB1c2Ugb25lIG9mIHRoZSBkYXRhc2V0cyB5b3UgaGF2ZSBhbHJlYWR5IHdvcmtlZCB3aXRoLCB5b3Ugc2hvdWxkIGFkZCBhIHVuaXF1ZSBlbGVtZW50IG9yIGluY29ycG9yYXRlIGFkZGl0aW9uYWwgZGF0YS4gKGkuZS4gZXhwbGljaXQgZmVhdHVyZXMgeW91IHNjcmFwZSBmcm9tIGFub3RoZXIgc291cmNlLCBsaWtlIGltYWdlIGFuYWx5c2lzIG9uIG1vdmllIHBvc3RlcnMpLiANCg0KVGhlIG92ZXJhbGwgZ29hbCwgaG93ZXZlciwgd2lsbCBiZSB0byBwcm9kdWNlIHF1YWxpdHkgcmVjb21tZW5kYXRpb25zIGJ5IGV4dHJhY3RpbmcgaW5zaWdodHMgZnJvbSBhIGxhcmdlIGRhdGFzZXQuIFlvdSBtYXkgZG8gc28gdXNpbmcgU3BhcmssIG9yIGFub3RoZXIgZGlzdHJpYnV0ZWQgY29tcHV0aW5nIG1ldGhvZCwgT1IgYnkgZWZmZWN0aXZlbHkgYXBwbHlpbmcgb25lIG9mIHRoZSBtb3JlIGFkdmFuY2VkIG1hdGhlbWF0aWNhbCB0ZWNobmlxdWVzIHdlIGhhdmUgY292ZXJlZC4gVGhlcmUgaXMgbm8gcHJlZmVyZW5jZSBmb3Igb25lIG92ZXIgdGhlIG90aGVyLCBhcyBsb25nIGFzIHRoZSByZWNvbW1lbmRlciB3b3JrcyENCg0KSW1wbGVtZW50YXRpb24uIEluIHRoaXMgZmluYWwgcHJvamVjdCBkZWxpdmVyYWJsZSwgeW914oCZbGwgYnVpbGQgb3V0IHRoZSBzeXN0ZW0gdGhhdCB5b3UgZGVzY3JpYmUgaW4gdGhlIHBsYW5uaW5nIGRvY3VtZW50Lg0KDQojIFByb2plY3QgSGlnaGxpZ2h0cw0KDQoxLiBJIGFtIHBsYW5uIHRvIHVzZSByZWNvbW1lbmRlcmxhYiwgYW4gUiBwYWNrYWdlIHdoaWNoIHByb3ZpZGVzIGEgY29udmVuaWVudCBmcmFtZXdvcmsgdG8gZXZhbHVhdGUgYW5kIGNvbXBhcmUgdmFyaW91cyByZWNvbW1lbmRhdGlvbiBhbGdvcml0aG1zIGFuZCBxdWlja2x5IGVzdGFibGlzaCB0aGUgYmVzdCBzdWl0ZWQgYXBwcm9hY2guDQoNCnJlY29tbWVuZGVybGFiIGFjY2VwdHMgMiB0eXBlcyBvZiByYXRpbmcgbWF0cml4IGZvciBtb2RlbGxpbmc6DQoNCnJlYWwgcmF0aW5nIG1hdHJpeCBjb25zaXN0aW5nIG9mIGFjdHVhbCB1c2VyIHJhdGluZ3MsIHdoaWNoIHJlcXVpcmVzIG5vcm1hbGlzYXRpb24uDQoNCmJpbmFyeSByYXRpbmcgbWF0cml4IGNvbnNpc3Rpbmcgb2YgMOKAmXMgYW5kIDHigJlzLCB3aGVyZSAx4oCZcyBpbmRpY2F0ZSBpZiB0aGUgcHJvZHVjdCB3YXMgcHVyY2hhc2VkLiBUaGlzIGlzIHRoZSBtYXRyaXggdHlwZSBuZWVkZWQgZm9yIHRoZSBhbmFseXNpcyBhbmQgaXQgZG9lcyBub3QgcmVxdWlyZSBub3JtYWxpemF0aW9uLg0KDQoyLiBJIHdpbGwgYXJyYW5nZSB0aGUgcHVyY2hhc2UgaGlzdG9yeSBpbiBhIHJhdGluZyBtYXRyaXgsIHdpdGggb3JkZXJzIGluIHJvd3MgYW5kIHByb2R1Y3RzIGluIGNvbHVtbnMuIFRoaXMgZm9ybWF0IGlzIG9mdGVuIGNhbGxlZCBhIHVzZXJfaXRlbSBtYXRyaXggYmVjYXVzZSDigJx1c2Vyc+KAnSAoZS5nLiBjdXN0b21lcnMgb3Igb3JkZXJzKSB0ZW5kIHRvIGJlIG9uIHRoZSByb3dzIGFuZCDigJxpdGVtc+KAnSAoZS5nLiBwcm9kdWN0cykgb24gdGhlIGNvbHVtbnMuDQoNCjMuIFRoZSByZWNvbW1lbmRlcmxhYiBoYXMgYW4gYWJpbGl0eSB0byBlc3RpbWF0ZSBtdWx0aXBsZSBhbGdvcml0aG1zIGF0IGEgdGltZSwgSSB3aWxsIGNyZWF0ZSBhIGxpc3Qgd2l0aCB0aGUgYWxnb3JpdGhtcyBhbmQgY29uc2lkZXIgc2NoZW1lcyB3aGljaCBldmFsdWF0ZSBvbiBhIGJpbmFyeSByYXRpbmcgbWF0cml4LiANCg0KNC4gSSB3aWxsIGNvbXBhcmUgYW5kIGV2YWx1YXRlIHRoZSBwZXJmb3JtYW5jZSBvZiB0aGUgYWxnb3JpdGhtcyB1c2luZyBST0MgQ3VydmUsIFByZWNpc2lvbi9SZWNhbGwsIFJNU0UsIE1TRSBhbmQgTUFFIGFuZCB1c2UgdGhlIGJlc3QgcGVyZm9ybWluZyBhbGdvcml0aG0gdG8gcHJlZGljdCB0b3AgaXRlbXMgbGlzdCBmb3IgdGhlIG5ldyBjdXN0b21lci4NCg0KIyBBYm91dCB0aGUgRGF0YQ0KDQpUaGUgZGF0YSBmb3IgdGhpcyBwcm9qZWN0IGNvbWVzIGZyb20gdGhlIFVDSSBNYWNoaW5lIExlYXJuaW5nIFJlcG9zaXRvcnksIGFuIG9ubGluZSBhcmNoaXZlIG9mIGxhcmdlIGRhdGFzZXRzIHdoaWNoIGluY2x1ZGVzIGEgd2lkZSB2YXJpZXR5IG9mIGRhdGEgdHlwZXMsIGFuYWx5c2lzIHRhc2tzLCBhbmQgYXBwbGljYXRpb24gYXJlYXMuDQoNCkluIHRoaXMgcHJvamVjdCBJ4oCZbGwgdXNlIHRoZSBPbmxpbmUgUmV0YWlsIGRhdGFzZXQgZG9uYXRlZCB0byBVQ0kgaW4gMjAxNSBieSB0aGUgU2Nob29sIG9mIEVuZ2luZWVyaW5nIGF0IExvbmRvbiBTb3V0aCBCYW5rIFVuaXZlcnNpdHkuIFRoaXMgZGF0YXNldCBjb250YWlucyB0cmFuc2FjdGlvbnMgb2NjdXJyaW5nIGJldHdlZW4gMDEvRGVjLzIwMTAgYW5kIDA5L0RlYy8yMDExIGZvciBhIFVLLWJhc2VkIGFuZCByZWdpc3RlcmVkIG9ubGluZSByZXRhaWwgY29tcGFueSBhbmQgY29udGFpbnMgNTQxLDkwOSBvYnNlcnZhdGlvbnMgd2l0aCBlaWdodCB2YXJpYWJsZXMuIFRoZSBkYXRhIGlzIHRvbyBsYXJnZSBmb3IgbXkgR2l0SHViIGJ1dCBjYW4gYmUgZG93bmxvYWRlZCBmcm9tOiBodHRwOi8vYXJjaGl2ZS5pY3MudWNpLmVkdS9tbC9tYWNoaW5lLWxlYXJuaW5nLWRhdGFiYXNlcy8wMDM1Mi8NCg0KYGBge3IsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9DQpsaWJyYXJ5KGdncGxvdDIpDQpsaWJyYXJ5KGthYmxlRXh0cmEpDQpsaWJyYXJ5KHJlY29tbWVuZGVybGFiKQ0KbGlicmFyeShkYXRhLnRhYmxlKSAgICAgICAgICAgDQpsaWJyYXJ5KHJlYWR4bCkgICAgICAgICAgICAgICANCmxpYnJhcnkodGlkeXZlcnNlKQ0KbGlicmFyeShsdWJyaWRhdGUpDQpsaWJyYXJ5KHNraW1yKSAgICAgICAgICAgICAgICANCmxpYnJhcnkoa25pdHIpICAgICAgICAgICAgICAgIA0KbGlicmFyeSh0cmVlbWFwKQ0KbGlicmFyeShyZWFkVUNJKQ0KYGBgDQoNCiMjIExvYWRpbmcgdGhlIE9ubGluZSBSZXRhaWwgZGF0YXNldA0KDQpgYGB7cn0NCnJldGFpbCA8LSByZWFkX2V4Y2VsKCJPbmxpbmUgUmV0YWlsLnhsc3giLHRyaW1fd3MgPSBUUlVFKQ0KDQpgYGANCg0KIyMgRGF0YSBzdHJ1Y3R1cmUNCg0KYGBge3J9DQpzdHIocmV0YWlsKQ0KYGBgDQoNCiMjIFJlbW92ZSB0aGUgQ2FuY2VsbGVkIFRyYW5zYWN0aW9uDQoNCmBgYHtyfQ0KDQpgYGANCg0KDQojIyBFeHBsb3JpbmcgdGhlIGRhdGFzZXQNCg0KYGBge3J9DQoNCmBgYA0KDQojIyBFeGNsdWRpbmcgdGhlIG1pc3NpbmcgdmFsdWVzDQoNCmBgYHtyfQ0KDQpgYGANCg0KIyBQcmUtcHJvY2Vzc2luZw0KDQoqKkNvbnZlcnQgZGF0YSB0byBudW1lcmljKioNCg0KYGBge3J9DQoNCmBgYA0KDQojIyBDcmVhdGUgdGhlIHNwYXJzZSBNYXRyaXgNCg0KYGBge3J9DQoNCmBgYA0KDQojIyBNb2RpZmllZCBEaXN0cmlidXRpb25zDQoNCmBgYHtyfQ0KDQpgYGANCg0KIyBBbGdvcml0aG1zLSBNb2RlbGluZw0KDQpTZXQgdXAgTGlzdCBvZiBBbGdvcml0aG1zDQoNCmBgYHtyfQ0KDQpgYGANCg0KIyBDb21wcmlzb24gb2YgdGhlIE1vZGVsIEFjY3VyYWN5DQoNCiMjIE1vZGVsIFBlcmZvcm1hbmNlIE1ldHJpY3MgLSBSTVNFLCBNU0UgYW5kIE1BRQ0KDQpgYGB7cn0NCg0KYGBgDQoNCiMjIFJPQyBjdXJ2ZQ0KDQpgYGB7cn0NCg0KYGBgDQoNCiMjIFByZWNpc2lvbiBhbmQgUmVjYWxsDQoNCmBgYHtyfQ0KDQpgYGANCg0KIyBUcmFpbmluZywgVGVzdGluZyBhbmQgRXZhbHVhdGlvbiBEYXRhIHNldHMNCg0KKipQcmVwYXJlIHRyYWluaW5nIGRhdGFzZXQqKg0KDQpgYGB7cn0NCg0KYGBgDQoNCioqUHJlcGFyZSB0ZXN0aW5nIHNldCoqDQoNCmBgYHtyfQ0KDQpgYGANCg0KKipQcmVwYXJlIGV2YWx1YXRpb24gc2V0KioNCg0KYGBge3J9DQoNCmBgYA0KDQojIFJlY29tbWVuZGVyIFN5c3RlbQ0KDQoNCiMjIEJlc3QgUGVyZm9ybWluZyBNb2RlbA0KDQpgYGB7cn0NCg0KYGBgDQoNCiMjIyBNYWtpbmcgUHJlZGljdGlvbnMgd2l0aCBuZXdkYXRhID0gdGVzdA0KDQpgYGB7cn0NCg0KYGBgDQoNCiMjIyBNb2RlbCBFdmFsdWF0aW9uIC0gVXNlciBCYXNlZA0KDQpgYGB7cn0NCg0KYGBgDQoNCg0KYGBge3J9DQoNCmBgYA0KDQoNCiMgQ29uY2x1c2lvbg0KDQoNCg0KIyBSZWZlcmVuY2VzDQoNCg==