Planning Document Objective

Find an interesting dataset and describe the system you plan to build out. The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered.

Final Project Propsal

For my final project I plan to use the Good Reads dataset and to create and deploy a book recommender engine. The Good Reads dataset is large with approximarly 1 million user rating across 10,000 books. The dataset also includes book_tags and tags files that provide additional information. The amount and quality of information in this dataset should enable me to produce a recommendation engine that produces quality recommendations. Once I have created the recemmendation engine, I will explore various strategy to deploy to the in a product-like manner. To that end, the speed of the system will be as important as the quality of the recommendations. Potential tools to deploy the system include spark, algorithms that at excel at quick performance or alternative means.

Tables 1 and 2 below utilize the skimr package to provide a summry of the ratings and books datasets.

Table 1 - Ratings Data Set

Data summary
Name ratings
Number of rows 981756
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
book_id 0 1 4943.28 2873.21 1 2457 4921 7414 10000 ▇▇▇▇▇
user_id 0 1 25616.76 15228.34 1 12372 25077 38572 53424 ▇▇▇▇▆
rating 0 1 3.86 0.98 1 3 4 5 5 ▁▂▆▇▆

Table 2 - Books Data Set

Data summary
Name books
Number of rows 10000
Number of columns 23
_______________________
Column type frequency:
character 7
numeric 16
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
isbn 700 0.93 7 10 0 9300 0
authors 0 1.00 3 742 0 4664 0
original_title 590 0.94 1 196 0 9258 0
title 0 1.00 2 186 0 9964 0
language_code 1084 0.89 2 5 0 25 0
image_url 0 1.00 52 88 0 6669 0
small_image_url 0 1.00 52 86 0 6669 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 5.000500e+03 2.886900e+03 1.00 2.500750e+03 5.000500e+03 7.500250e+03 1.000000e+04 ▇▇▇▇▇
book_id 0 1.00 5.264697e+06 7.575462e+06 1.00 4.627575e+04 3.949655e+05 9.382225e+06 3.328864e+07 ▇▂▁▁▁
best_book_id 0 1.00 5.471214e+06 7.827330e+06 1.00 4.791175e+04 4.251235e+05 9.636113e+06 3.553423e+07 ▇▂▁▁▁
work_id 0 1.00 8.646183e+06 1.175106e+07 87.00 1.008841e+06 2.719525e+06 1.451775e+07 5.639960e+07 ▇▂▁▁▁
books_count 0 1.00 7.571000e+01 1.704700e+02 1.00 2.300000e+01 4.000000e+01 6.700000e+01 3.455000e+03 ▇▁▁▁▁
isbn13 585 0.94 9.755044e+12 4.428619e+11 195170342.00 9.780316e+12 9.780452e+12 9.780831e+12 9.790008e+12 ▁▁▁▁▇
original_publication_year 21 1.00 1.981990e+03 1.525800e+02 -1750.00 1.990000e+03 2.004000e+03 2.011000e+03 2.017000e+03 ▁▁▁▁▇
average_rating 0 1.00 4.000000e+00 2.500000e-01 2.47 3.850000e+00 4.020000e+00 4.180000e+00 4.820000e+00 ▁▁▃▇▁
ratings_count 0 1.00 5.400124e+04 1.573700e+05 2716.00 1.356875e+04 2.115550e+04 4.105350e+04 4.780653e+06 ▇▁▁▁▁
work_ratings_count 0 1.00 5.968732e+04 1.678038e+05 5510.00 1.543875e+04 2.383250e+04 4.591500e+04 4.942365e+06 ▇▁▁▁▁
work_text_reviews_count 0 1.00 2.919960e+03 6.124380e+03 3.00 6.940000e+02 1.402000e+03 2.744250e+03 1.552540e+05 ▇▁▁▁▁
ratings_1 0 1.00 1.345040e+03 6.635630e+03 11.00 1.960000e+02 3.910000e+02 8.850000e+02 4.561910e+05 ▇▁▁▁▁
ratings_2 0 1.00 3.110880e+03 9.717120e+03 30.00 6.560000e+02 1.163000e+03 2.353250e+03 4.368020e+05 ▇▁▁▁▁
ratings_3 0 1.00 1.147589e+04 2.854645e+04 323.00 3.112000e+03 4.894000e+03 9.287000e+03 7.933190e+05 ▇▁▁▁▁
ratings_4 0 1.00 1.996570e+04 5.144736e+04 750.00 5.405750e+03 8.269500e+03 1.602350e+04 1.481305e+06 ▇▁▁▁▁
ratings_5 0 1.00 2.378981e+04 7.976889e+04 754.00 5.334000e+03 8.836000e+03 1.730450e+04 3.011543e+06 ▇▁▁▁▁