Hey there! Welcome to another R project of mine. This time, the project is a little simpler but I think pretty handy and not something I learned in school. This is somewhat related to some work stuff I’m doing so I did this to evaluate a few new tricks.

Cosine similarity and Jaccard similarity are two methods of measuring similarity between two documents by looking at matching words within those documents. I’ll have pictures below to help but with this type of technique you can build a type of recommender system. Try to imagine you have a pile of Lego pieces and you have a bunch of different instructions, such as one for a truck and one for a house. With cosine similarity/Jaccard similarity, one could compare the list of pieces in their pile against the list of pieces for the instructions to see what they could possibly build. This is more or less the project I’m trying to solve.

The dataset we’re working with is just a list of videogames on Steam, which we will be working with just the title of the games to find games with similar titles.

The rough flow of this project will be: importing & transforming the data, cosine similarity, Jaccard similarity, then what the next step would be.

Anyhow, let’s see what we have. Packages used here: tidyverse, tm, & proxy.

And with that…

Monster Hunter World, the input game title we will use. Great game by the way!
Monster Hunter World, the input game title we will use. Great game by the way!

So here is where I’m going to import the dataset locally, shorten it, and transform it a little bit. The dataframe is games_df which has plenty of data but we really just want the title column.

Quick note on why we are shortening the dataset - similarity analysis like this use sparse matrices which are relatively intensive to compute. We’ll be working with just 5000 observations out of 50,000.

The transformation taking place is converting the text stored in the titles to a document term matrix that keeps track of which terms (or words) appear in which documents (or titles). Usually we would clean the text data up by stemming words like “running” to just “run” but I’m too lazy for that.

I included a head() function that is supposed to show the first 5 rows of the dataset.

#Loading the dataset
games_df <- read_csv("games.csv")
## Rows: 50872 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (2): title, rating
## dbl  (6): app_id, positive_ratio, user_reviews, price_final, price_original,...
## lgl  (4): win, mac, linux, steam_deck
## date (1): date_release
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Shortening the dataset
games_df <- games_df[0:5000,]
#inspecting first 5 rows
head(games_df)
## # A tibble: 6 × 13
##   app_id title date_release win   mac   linux rating positive_ratio user_reviews
##    <dbl> <chr> <date>       <lgl> <lgl> <lgl> <chr>           <dbl>        <dbl>
## 1  13500 Prin… 2008-11-21   TRUE  FALSE FALSE Very …             84         2199
## 2  22364 BRIN… 2011-08-03   TRUE  FALSE FALSE Posit…             85           21
## 3 113020 Mona… 2013-04-24   TRUE  TRUE  TRUE  Very …             92         3722
## 4 226560 Esca… 2014-11-18   TRUE  FALSE FALSE Mixed              61          873
## 5 249050 Dung… 2014-10-27   TRUE  TRUE  FALSE Very …             88         8784
## 6 250180 META… 2015-09-14   TRUE  FALSE FALSE Very …             90         5579
## # ℹ 4 more variables: price_final <dbl>, price_original <dbl>, discount <dbl>,
## #   steam_deck <lgl>
#Create a corpus from just the titles of the games
corpus <- Corpus(VectorSource(games_df$title))
#Create a document-term matrix using TF-IDF
dtm <- DocumentTermMatrix(corpus)
tfidf_matrix <- weightTfIdf(dtm)

Great, now to some actual fun stuff. First up will be cosine similarity, somewhat explained by this picture. Suppose you had two statements: “Deep learning can be hard” and “Deep learning can be simple.” You can turn these statements into vectors and compare them in a matrix like so.

Cosine similarity is going to look at the distance between these two statements using the formula: (CS) = (A . B) / (||A|| ||B||) where we take dot products. Don’t lose too much sleep, the answer is 0.8 or 80% similarity which is good.

So below we are going to create cosine similarities and a function that will take a game title and spit out the top 10 similar game titles along with their scores. The title we’re going to input is “Monster Hunter: World - Deluxe Kit”.

#Compute cosine similarities
normalize <- function(x) x / sqrt(sum(x^2))
cosine_similarities <- tcrossprod(apply(tfidf_matrix, 2, normalize))

#Function to get top recommendations for a given game
get_recommendations <- function(game_title, similarity_matrix, df) {
  game_index <- which(df$title == game_title)
  if (length(game_index) == 0) {
    stop("Game not found in the dataset")
  }
  similarity_scores <- similarity_matrix[game_index, ]
  similar_indices <- order(similarity_scores, decreasing = TRUE)[1:10]
  
  #Create a data frame with game titles and similarity scores
  recommendations <- data.frame(
    title = df$title[similar_indices],
    similarity_score = similarity_scores[similar_indices]
  )
  return(recommendations)
}

#Example: Get top 10 recommendations with similarity scores for a specific game
target_game <- "Monster Hunter: World - Deluxe Kit"
game_index <- which(games_df$title == target_game)
recommendations <- get_recommendations(target_game, cosine_similarities, games_df)

#Print the recommendations with similarity scores
print(recommendations)
##                                                        title similarity_score
## 2648                      Monster Hunter: World - Deluxe Kit        0.4579883
## 1953                        The Sims™ 4 Decor to the Max Kit        0.2444297
## 2232         Street Fighter V - Champion Edition Upgrade Kit        0.2444297
## 1426      Romance of the Three Kingdoms IX with Power Up Kit        0.2095111
## 2420         Galactic Civilizations III - Mech Parts Kit DLC        0.2095111
## 4220               Monster Hunter: World Original Soundtrack        0.1412891
## 4624               Monster Hunter: World - Gesture: Hadoken!        0.1412891
## 2333         Monster Hunter: World - Gesture: Feverish Dance        0.1177409
## 2727             Monster Hunter: World - Gesture: Air Splits        0.1177409
## 2640 Monster Hunter: World - The Handler's Mischievous Dress        0.1009208

As we see from the results, not bad. There are some questionable results but overall a good first pass. One thing to note is the similarity is two way. We’re not just looking for titles that are similar to our input, but what titles our input is similar to. So overall shared words. We will talk about improvements later.

Next we move on to Jaccard similarities which is a little more straightforward calculation:

For example, suppose we have the following two sets of data:

E = [‘cat’, ‘dog’, ‘hippo’, ‘monkey’] F = [‘monkey’, ‘rhino’, ‘ostrich’, ‘salmon’] To calculate the Jaccard Similarity between them, we first find the total number of observations in both sets, then divide by the total number of observations in either set:

Number of observations in both: {‘monkey’} = 1 Number of observations in either: {‘cat’, ‘dog’, hippo’, ‘monkey’, ‘rhino’, ‘ostrich’, ‘salmon’} = 7 Jaccard Similarity: 1 / 7= 0.142857.

Since this number is fairly low, it indicates that the two sets are quite dissimilar.

Now let’s use our data and see what happens.

#Calculate Jaccard similarity between the target game and all other games
jaccard_similarity <- function(set1, set2) {
  intersection <- length(intersect(set1, set2))
  union <- length(union(set1, set2))
  return(intersection / union)
}

jaccard_similarities <- apply(as.matrix(tfidf_matrix), 1, function(row) {
  jaccard_similarity(row, tfidf_matrix[game_index, ])
})

#Adjust recommender function for Jaccard
get_recommendations <- function(game_title, tfidf_matrix, df) {
  game_index <- which(df$title == target_game)
  if (length(game_index) == 0) {
    stop("Game not found in the dataset")
  }
  #Find the top 10 similar games
  similar_indices <- order(jaccard_similarities, decreasing = TRUE)[1:10]
  #Create a data frame with game titles and Jaccard similarity scores
  recommendations <- data.frame(
    title = df$title[similar_indices],
    jaccard_similarity_score = jaccard_similarities[similar_indices]
  )
  return(recommendations)
}

#Example: Get top 5 recommendations with Jaccard similarity scores for a specific game
target_game <- "Monster Hunter: World - Deluxe Kit"
recommendations <- get_recommendations(target_game, tfidf_matrix, games_df)
#Print the recommendations with Jaccard similarity scores
print(recommendations)
##                                           title jaccard_similarity_score
## 2648         Monster Hunter: World - Deluxe Kit                1.0000000
## 4220  Monster Hunter: World Original Soundtrack                0.5000000
## 4624  Monster Hunter: World - Gesture: Hadoken!                0.5000000
## 501          Dying Light - Harran Inmate Bundle                0.3750000
## 685          WRC 9 FIA World Rally Championship                0.3750000
## 326  Community College Hero: Knowledge is Power                0.3333333
## 643           Touhou Monster TD ~ Nagae Iku DLC                0.3333333
## 657         Disney•Pixar Cars 2: The Video Game                0.3333333
## 1510        Orcs Must Die! - Artifacts of Power                0.3333333
## 1825            Just Cause™ 4: Toy Vehicle Pack                0.3333333

The results are a little more rigid and I think a little worse than cosine but overall not bad for the little code we wrote. To make things a little stronger we would potentially clean up the input and game titles to remove stop words and standardize the text but for my actual work project this wouldn’t be helpful. I’m all ears for anyone who has some fixes in my code since I hardcoded a bunch of stuff that is probably in a package somewhere. Hope you learned something.

One more thing - to take an analysis like this to the next step we would incorporate the other data we have like reviews or ratings to factor into the sorting of our results to promote better games or more relevant ones.