Steam Tags Clustering

Description

This report provides Steam tags clustering using clustering algorithms. The dataset is supported by kaggle. It can be downloaded here: https://www.kaggle.com/nikdavis/steam-store-games

The report is structured as follows:

Data Extraction
Data Cleaning
Data Preparation
Optimal number of clusters
Generate Dendrogram
Network of Tags
Recommendation

The purpose of this clustering was to figure out the connections on each tags and how it could be used by the computer developers and how it benefit the players.

1. Data Extraction

Import necessary libraries.

Library ggplot2: for graphics and visualization. Library ade4: Tools for multivariate data analysis

Then read the dataframes so it could be used.

library(ggplot2)
library(ade4)

t2 <- read.csv("data/steamspy_tag_data.csv", header = TRUE, row.names = 1, sep = ",")

2. Data Cleaning

Before the data cleaning, the data were originally 29,022 rows with 371 total columns.

The data that will be removed will be all data where their row sum value below 60, and their column sum value below 5000.

rowMax = apply(t2, 1, max)

t3 = t2[rowMax > 60,]
rowMax = rowMax[rowMax > 60]

colMax = apply(t2, 2, max)

t3 = t3[,colMax > 5000]
colMax = colMax[colMax > 5000]

After data cleaning, this is how much the data is now:

dim(t3)

## [1] 3644   48

3. Data Preparation

After the data cleaning is complete, next is to transform the data into presence/absence.

for(i in 1:dim(t3)[1])
  for(j in 1:dim(t3)[2])
  {
    if(t3[i,j] < 0.5*rowMax[i]) 
    {
      t3[i,j] = 0
    } else 
    {
      t3[i,j] = 1
    }
    
  }

Then we would need to compute the pairwise distances.

dist = dist.binary(t(t3), method = 8, upper = TRUE)

mds = cmdscale(dist)
df2 = data.frame(Dim.1 = mds[,1], Dim.2 =  mds[,2])

4. Optimal number of clusters

Before doing the unsupervised clustering, it would be best if there is an optimal number of clusters to give out best result. K-Means and Elbow method will be used to figure out the optimal number of cluster.

set.seed(2021)
k <- 2 # number of clusters
km <- kmeans(df2, k)
# km$cluster

# plot(df2, col = km$cluster)

sse <- 1:5

set.seed(2021)
k <- 1 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss

k <- 2 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss

k <- 3 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss

k <- 4 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss

k <- 5 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss

plot(sse, type = "b")

Based on the result, the optimal number of cluster is k = 4.

Generate Dengrogram

After figuring out the number of cluster, it is time to visualize the data through dendrogram.

options(repr.plot.width=12, repr.plot.height = 12)

fit <- hclust(dist, method="ward.D2")
plot(fit)
groups <- cutree(fit, k=4)
rect.hclust(fit, k=4, border="red")

Based on the clusters split, 3 out of 4 tag clusters has a connecting theme:

Cluster 1: Story/Characters/lore focus oriented.

Cluster 2: Almost no limitation for the players.

Cluster 3: Coop/Multiplayer focus

For Cluster 4, it is difficult to find the connecting theme.

##5. Network of Tags

We could also make the tags into a network tags and find out which tag has the most connections. Also the opposite for which tags has the least connections.

dist_ = as.matrix(dist)
dist_ = 1 - dist_

dist_[dist_ < 0.11] = 0

#install.packages("igraph")
library("igraph")

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

net = graph_from_adjacency_matrix(dist_, weighted = TRUE, mode = "undirected", diag = F)
E(net)$width = E(net)$weight*40 - 3

options(repr.plot.width=13, repr.plot.height = 13)

plot(net, vertex.color = rgb(0.9,0.9,0.5,0.8), vertex.frame.color = rgb(0.99,0.99, 0.99, 0.4), 
     vertex.shape = "circle",  vertex.size = 12, vertex.size2 = NA, vertex.label.color = "black", 
     vertex.label.family = "Times", vertex.label.cex = 1, edge.color = "blue", edge.lty = "solid", edge.curved = 0)

## Warning in text.default(x, y, labels = labels, col = label.color, family =
## label.family, : font family not found in Windows font database

The connections on the tags could have this possible two meanings:

The thicker the blue lines, the more games was made using the tags with thicker lines. And the one with no blue lines was made less.
The thicker the lines, the more the tags were correlated and appeared together in a game.

The one with no tag connections:

Racing
Moddable
Class_based
Mature
Mod
Third Person Shooter
Crime
Dinosaurs
Moba
E-sports
Post apocalyptic

6. Recommendation

The developers could use the tags with no connections to try to create new types of games if they wanted to use blue ocean strategy.