Find an interesting dataset and describe the system you plan to build out. The overall goal, however, will be to produce quality recommendations by extracting insights from a large dataset. You may do so using Spark, or another distributed computing method, OR by effectively applying one of the more advanced mathematical techniques we have covered.
For my final project I plan to use the Good Reads dataset and to create and deploy a book recommender engine. The Good Reads dataset is large with approximarly 1 million user rating across 10,000 books. The dataset also includes book_tags and tags files that provide additional information. The amount and quality of information in this dataset should enable me to produce a recommendation engine that produces quality recommendations. Once I have created the recemmendation engine, I will explore various strategy to deploy to the in a product-like manner. To that end, the speed of the system will be as important as the quality of the recommendations. Potential tools to deploy the system include spark, algorithms that at excel at quick performance or alternative means.
Tables 1 and 2 below utilize the skimr package to provide a summry of the ratings and books datasets.
| Name | ratings |
| Number of rows | 981756 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| book_id | 0 | 1 | 4943.28 | 2873.21 | 1 | 2457 | 4921 | 7414 | 10000 | ▇▇▇▇▇ |
| user_id | 0 | 1 | 25616.76 | 15228.34 | 1 | 12372 | 25077 | 38572 | 53424 | ▇▇▇▇▆ |
| rating | 0 | 1 | 3.86 | 0.98 | 1 | 3 | 4 | 5 | 5 | ▁▂▆▇▆ |
| Name | books |
| Number of rows | 10000 |
| Number of columns | 23 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| isbn | 700 | 0.93 | 7 | 10 | 0 | 9300 | 0 |
| authors | 0 | 1.00 | 3 | 742 | 0 | 4664 | 0 |
| original_title | 590 | 0.94 | 1 | 196 | 0 | 9258 | 0 |
| title | 0 | 1.00 | 2 | 186 | 0 | 9964 | 0 |
| language_code | 1084 | 0.89 | 2 | 5 | 0 | 25 | 0 |
| image_url | 0 | 1.00 | 52 | 88 | 0 | 6669 | 0 |
| small_image_url | 0 | 1.00 | 52 | 86 | 0 | 6669 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 5.000500e+03 | 2.886900e+03 | 1.00 | 2.500750e+03 | 5.000500e+03 | 7.500250e+03 | 1.000000e+04 | ▇▇▇▇▇ |
| book_id | 0 | 1.00 | 5.264697e+06 | 7.575462e+06 | 1.00 | 4.627575e+04 | 3.949655e+05 | 9.382225e+06 | 3.328864e+07 | ▇▂▁▁▁ |
| best_book_id | 0 | 1.00 | 5.471214e+06 | 7.827330e+06 | 1.00 | 4.791175e+04 | 4.251235e+05 | 9.636113e+06 | 3.553423e+07 | ▇▂▁▁▁ |
| work_id | 0 | 1.00 | 8.646183e+06 | 1.175106e+07 | 87.00 | 1.008841e+06 | 2.719525e+06 | 1.451775e+07 | 5.639960e+07 | ▇▂▁▁▁ |
| books_count | 0 | 1.00 | 7.571000e+01 | 1.704700e+02 | 1.00 | 2.300000e+01 | 4.000000e+01 | 6.700000e+01 | 3.455000e+03 | ▇▁▁▁▁ |
| isbn13 | 585 | 0.94 | 9.755044e+12 | 4.428619e+11 | 195170342.00 | 9.780316e+12 | 9.780452e+12 | 9.780831e+12 | 9.790008e+12 | ▁▁▁▁▇ |
| original_publication_year | 21 | 1.00 | 1.981990e+03 | 1.525800e+02 | -1750.00 | 1.990000e+03 | 2.004000e+03 | 2.011000e+03 | 2.017000e+03 | ▁▁▁▁▇ |
| average_rating | 0 | 1.00 | 4.000000e+00 | 2.500000e-01 | 2.47 | 3.850000e+00 | 4.020000e+00 | 4.180000e+00 | 4.820000e+00 | ▁▁▃▇▁ |
| ratings_count | 0 | 1.00 | 5.400124e+04 | 1.573700e+05 | 2716.00 | 1.356875e+04 | 2.115550e+04 | 4.105350e+04 | 4.780653e+06 | ▇▁▁▁▁ |
| work_ratings_count | 0 | 1.00 | 5.968732e+04 | 1.678038e+05 | 5510.00 | 1.543875e+04 | 2.383250e+04 | 4.591500e+04 | 4.942365e+06 | ▇▁▁▁▁ |
| work_text_reviews_count | 0 | 1.00 | 2.919960e+03 | 6.124380e+03 | 3.00 | 6.940000e+02 | 1.402000e+03 | 2.744250e+03 | 1.552540e+05 | ▇▁▁▁▁ |
| ratings_1 | 0 | 1.00 | 1.345040e+03 | 6.635630e+03 | 11.00 | 1.960000e+02 | 3.910000e+02 | 8.850000e+02 | 4.561910e+05 | ▇▁▁▁▁ |
| ratings_2 | 0 | 1.00 | 3.110880e+03 | 9.717120e+03 | 30.00 | 6.560000e+02 | 1.163000e+03 | 2.353250e+03 | 4.368020e+05 | ▇▁▁▁▁ |
| ratings_3 | 0 | 1.00 | 1.147589e+04 | 2.854645e+04 | 323.00 | 3.112000e+03 | 4.894000e+03 | 9.287000e+03 | 7.933190e+05 | ▇▁▁▁▁ |
| ratings_4 | 0 | 1.00 | 1.996570e+04 | 5.144736e+04 | 750.00 | 5.405750e+03 | 8.269500e+03 | 1.602350e+04 | 1.481305e+06 | ▇▁▁▁▁ |
| ratings_5 | 0 | 1.00 | 2.378981e+04 | 7.976889e+04 | 754.00 | 5.334000e+03 | 8.836000e+03 | 1.730450e+04 | 3.011543e+06 | ▇▁▁▁▁ |