This report provides Steam tags clustering using clustering algorithms. The dataset is supported by kaggle. It can be downloaded here: https://www.kaggle.com/nikdavis/steam-store-games
The report is structured as follows:
Data Extraction
Data Cleaning
Data Preparation
Optimal number of clusters
Generate Dendrogram
Network of Tags
Recommendation
The purpose of this clustering was to figure out the connections on each tags and how it could be used by the computer developers and how it benefit the players.
Import necessary libraries.
Library ggplot2: for graphics and visualization. Library ade4: Tools for multivariate data analysis
Then read the dataframes so it could be used.
library(ggplot2)
library(ade4)
t2 <- read.csv("data/steamspy_tag_data.csv", header = TRUE, row.names = 1, sep = ",")
Before the data cleaning, the data were originally 29,022 rows with 371 total columns.
The data that will be removed will be all data where their row sum value below 60, and their column sum value below 5000.
rowMax = apply(t2, 1, max)
t3 = t2[rowMax > 60,]
rowMax = rowMax[rowMax > 60]
colMax = apply(t2, 2, max)
t3 = t3[,colMax > 5000]
colMax = colMax[colMax > 5000]
After data cleaning, this is how much the data is now:
dim(t3)
## [1] 3644 48
After the data cleaning is complete, next is to transform the data into presence/absence.
for(i in 1:dim(t3)[1])
for(j in 1:dim(t3)[2])
{
if(t3[i,j] < 0.5*rowMax[i])
{
t3[i,j] = 0
} else
{
t3[i,j] = 1
}
}
Then we would need to compute the pairwise distances.
dist = dist.binary(t(t3), method = 8, upper = TRUE)
mds = cmdscale(dist)
df2 = data.frame(Dim.1 = mds[,1], Dim.2 = mds[,2])
Before doing the unsupervised clustering, it would be best if there is an optimal number of clusters to give out best result. K-Means and Elbow method will be used to figure out the optimal number of cluster.
set.seed(2021)
k <- 2 # number of clusters
km <- kmeans(df2, k)
# km$cluster
# plot(df2, col = km$cluster)
sse <- 1:5
set.seed(2021)
k <- 1 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss
k <- 2 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss
k <- 3 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss
k <- 4 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss
k <- 5 # number of clusters
km <- kmeans(df2, k)
#km$cluster
sse[k] <- km$tot.withinss
plot(sse, type = "b")
Based on the result, the optimal number of cluster is k = 4.
After figuring out the number of cluster, it is time to visualize the data through dendrogram.
options(repr.plot.width=12, repr.plot.height = 12)
fit <- hclust(dist, method="ward.D2")
plot(fit)
groups <- cutree(fit, k=4)
rect.hclust(fit, k=4, border="red")
Cluster 1: Story/Characters/lore focus oriented.
Cluster 2: Almost no limitation for the players.
Cluster 3: Coop/Multiplayer focus
For Cluster 4, it is difficult to find the connecting theme.
##5. Network of Tags
We could also make the tags into a network tags and find out which tag has the most connections. Also the opposite for which tags has the least connections.
dist_ = as.matrix(dist)
dist_ = 1 - dist_
dist_[dist_ < 0.11] = 0
#install.packages("igraph")
library("igraph")
##
## Attaching package: 'igraph'
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
net = graph_from_adjacency_matrix(dist_, weighted = TRUE, mode = "undirected", diag = F)
E(net)$width = E(net)$weight*40 - 3
options(repr.plot.width=13, repr.plot.height = 13)
plot(net, vertex.color = rgb(0.9,0.9,0.5,0.8), vertex.frame.color = rgb(0.99,0.99, 0.99, 0.4),
vertex.shape = "circle", vertex.size = 12, vertex.size2 = NA, vertex.label.color = "black",
vertex.label.family = "Times", vertex.label.cex = 1, edge.color = "blue", edge.lty = "solid", edge.curved = 0)
## Warning in text.default(x, y, labels = labels, col = label.color, family =
## label.family, : font family not found in Windows font database
The connections on the tags could have this possible two meanings:
The thicker the blue lines, the more games was made using the tags with thicker lines. And the one with no blue lines was made less.
The thicker the lines, the more the tags were correlated and appeared together in a game.
The one with no tag connections:
Racing
Moddable
Class_based
Mature
Mod
Third Person Shooter
Crime
Dinosaurs
Moba
E-sports
Post apocalyptic