612 final plan

Find an interesting dataset and describe the system you plan to build out. If you would like to use one of the datasets you have already worked with, you should add a unique element or incorporate additional data. (i.e. explicit features you scrape from another source, like image analysis on movie posters). The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered. There is no preference for one over the other, as long as your recommender works! The planning document should be written up and published as a notebook on GitHub or in RPubs. Please submit the link in the Unit 4 folder, due Tuesday, July 3.

#include appropriate packages
library(ggplot2)
library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  2.0.1     ✔ purrr   0.3.0
## ✔ tidyr   0.8.3     ✔ dplyr   0.7.8
## ✔ readr   1.3.1     ✔ stringr 1.3.1
## ✔ tibble  2.0.1     ✔ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(kableExtra)
library(knitr)
library(recommenderlab)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following object is masked from 'package:tidyr':
## 
##     expand

## Loading required package: arules

## 
## Attaching package: 'arules'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

## The following object is masked from 'package:purrr':
## 
##     transpose

library(reshape2)

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

## The following object is masked from 'package:tidyr':
## 
##     smiths

Data Set Up

I will be using the Book-Crossing dataset [1]. This dataset contains book ratings (on a scale of 0-10), user ID’s (along with user age and user location), and book ID’s (along with author, title, and year published).

users <- read.csv("BX-Users.csv")
books <- read.csv("BX-Books.csv")
ratings <- read.csv("BX-Book-Ratings.csv")

Project Idea

I will be making a recommender system for the books. I will be doing the following steps: 1. Split data into training and test sets 2. Use SparklyR to produce insights into this large dataset 3. Implement the recommender system to the dataset 4. Compare accuracy across datasets and algorithms

Acknowledgements: [1] Improving Recommendation Lists Through Topic Diversification, Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, Georg Lausen; Proceedings of the 14th International World Wide Web Conference (WWW ’05), May 10-14, 2005, Chiba, Japan. To appear.