612finaloutline.knit

Final Project Outline

Daniel DeBonis

The Dataset

Through Kaggle, I was able to find a collection of ratings for, including informative descriptions of board games from a prominent board game resources, boardgamesgeek.com. The database is comprised of approximately 19 million ratings from around 411,000 users of around 22,000 different board games. In total, the data is contained across 9 tables. One table contains the ratings. A second contains based game information such as year released, number of players, and a text description. A third table breaks down ratings by game instead of user, providing the number of times each game was given each rating (0-10 by tenths). Three tables provide themes, mechanics, and subcategories associated with each Game ID. The final three tables are centered around the game makers, looking at associated artists, designers, and publishers, but only given data if the creator is associated with three or more games.

URL: https://www.kaggle.com/datasets/threnjen/board-games-database-from-boardgamegeek?select=games.csv

Goals and Planned Methodology

The goal here is to make good, personalized recommendations for people who want to find new board games to play. This recommender will be built on a lot of data, which will hopefully translate to accurate predictions. However, the ultimate model used will have to be determined through assessing, and possibly combining multiple options. I’ve used R for every other assignment in this class so far, so I will probably rely on it for this project. This means that I may have to using sampling to train and build a model due to capacity limitations. I will certainly use Cloud Processing at some point in this process if I want to take full advantage of the depth of the data provided. One thing that I find particularly fortunate about this data set is the amount of data provided about the games themselves. There are many variables that are reported and can therefore be looked at by the model. In designing the final recommender, there is a lot to consider about the weight of each. Since most of my work this semester centered around the Jester dataset, there were only two things to base the recommendations off of: the text of the jokes and the similarity between ratings. Here, the model will likely be more complex due to how much we are told about the individual games.