Data 643

Final Project Proposal

Task : Create a recommendation system using a large dataset, with quality recommendations.

To complete this task, I’ve chosen the Book-Crossing dataset; mined by Ziegler and Freiburg; found here : ‘http://www2.informatik.uni-freiburg.de/~cziegler/BX/’

Contains 278,858 users (anonymized but with demographic information) providing 1,149,780 ratings (explicit / implicit) about 271,379 books.

I’ve chosen this dataset because it contains Implicit and Explicit ratings + Uniform contextual information on both the user and item.

I’m proposing to create a recommendation system using hybrid filtering. Multinomial Naive Bayes model, incorporating alternating least squares and perhaps binary classification.

All this will be done using Apache Spark hosted by Databricks, coded in Python 2.7.

Below are the datapoints at my disposal; hoping to find a way to get genre by ISBN (I would need help, or guidance here)

ratings =  'D:/databases/BX-CSV-Dump/BX-Book-Ratings.csv'
users =   'D:/databases/BX-CSV-Dump/BX-Users.csv'
books =  'D:/databases/BX-CSV-Dump/BX-Books.csv'
df1 = read.csv2(ratings,header = TRUE,sep = ";")
df2 = read.csv2(users,header = TRUE,sep = ";")
df3 = read.csv2(books,header = TRUE,sep = ";")
c(colnames(df1),colnames(df2),colnames(df3))

##  [1] "User.ID"             "ISBN"                "Book.Rating"        
##  [4] "User.ID"             "Location"            "Age"                
##  [7] "ISBN"                "Book.Title"          "Book.Author"        
## [10] "Year.Of.Publication" "Publisher"           "Image.URL.S"        
## [13] "Image.URL.M"         "Image.URL.L"