I found a dataset on Kaggle that focuses on video game sales of Sony’s home console, the Playstation 4. Originally released in 2014, the console has seen thousands of titles released on the platform. This dataset focuses on significant titles, their release year, global sales, and individual region sales. For the sake of this assignment, I will only present the most representative games for this data. (https://www.kaggle.com/datasets/sidtwr/videogames-sales-dataset)
# Simple PS4 Games Regional Sales Cluster Analysis
# Load only essential libraries
library(stats) # For k-means clustering
library(graphics) # For basic plotting
# Read the CSV data
ps4_sales <- read.csv("PS4_GamesSales.csv")
# Extract regional sales data
region_sales <- ps4_sales[, c("North.America", "Europe", "Japan", "Rest.of.World")]
# Calculate proportions of total sales for each region
region_props <- region_sales / rowSums(region_sales)
# Run k-means clustering with 4 clusters
set.seed(123) # For reproducibility
kmeans_result <- kmeans(region_props, centers = 4, nstart = 25)
# Add cluster assignments back to the data
ps4_sales$cluster <- kmeans_result$cluster
# Calculate average proportions by cluster
cluster_means <- aggregate(region_props, by = list(Cluster = kmeans_result$cluster), mean)
print(cluster_means)
## Cluster North.America Europe Japan Rest.of.World
## 1 1 0.3930138 0.3979180 0.051164770 0.15790336
## 2 2 0.2467637 0.1805520 0.485779511 0.08690481
## 3 3 0.2544907 0.5633883 0.029457342 0.15266363
## 4 4 0.6526774 0.1636757 0.004121053 0.17952585
# Show the most representative games from each cluster (closest to cluster centers)
cluster_centers <- kmeans_result$centers
representative_games <- vector("list", 4)
for (i in 1:4) {
# Calculate distance from each game to the cluster center
cluster_i_games <- which(kmeans_result$cluster == i)
distances <- numeric(length(cluster_i_games))
for (j in 1:length(cluster_i_games)) {
game_idx <- cluster_i_games[j]
distances[j] <- sum((region_props[game_idx,] - cluster_centers[i,])^2)
}
# Get the 3 most representative games
closest_indices <- cluster_i_games[order(distances)[1:3]]
representative_games[[i]] <- ps4_sales[closest_indices, c("Game", "Genre", "Publisher", "Global")]
cat("\nCluster", i, "representative games:\n")
print(representative_games[[i]])
}
##
## Cluster 1 representative games:
## Game Genre Publisher
## 26 Overwatch Shooter Blizzard Entertainment
## 19 Horizon: Zero Dawn Action Sony Interactive Entertainment
## 81 Middle-Earth: Shadow of War Action Warner Bros. Interactive Entertainment
## Global
## 26 4.54
## 19 5.82
## 81 2.04
##
## Cluster 2 representative games:
## Game Genre Publisher Global
## 25 Monster Hunter: World Action Capcom 4.67
## 97 Persona 5 Role-Playing Deep Silver 1.64
## 82 Dragon Quest XI Role-Playing Square Enix 2.04
##
## Cluster 3 representative games:
## Game Genre Publisher Global
## 41 Assassin's Creed Syndicate Action Ubisoft 3.60
## 59 Mafia III Action-Adventure 2K Games 2.87
## 90 The Crew Racing Ubisoft 1.79
##
## Cluster 4 representative games:
## Game Genre Publisher Global
## 42 NBA 2K17 Sports 2K Sports 3.52
## 63 Madden NFL 18 Sports EA Sports 2.62
## 37 NBA 2K16 Sports 2K Sports 3.98
# Basic visualization: plot North America vs Europe sales by cluster
plot(region_props[,"North.America"], region_props[,"Europe"],
col = kmeans_result$cluster, pch = 16,
xlab = "North America (proportion)", ylab = "Europe (proportion)",
main = "Regional Sales Clusters")
# Add a legend
legend("topright", legend = paste("Cluster", 1:4),
col = 1:4, pch = 16, cex = 0.8)
# Plot cluster profiles (average regional proportions)
barplot(t(as.matrix(cluster_means[,-1])),
beside = TRUE,
names.arg = paste("Cluster", 1:4),
col = c("skyblue", "coral", "lightgreen", "purple"),
legend.text = c("North America", "Europe", "Japan", "Rest of World"),
main = "Average Regional Sales Proportions by Cluster",
ylab = "Proportion of Sales")
Looking at this data, primarily the second graph (the barplot), North America has the higher proportion of sales combined across the different clusters. I remember doing a presentation for a different class and found out that video games are a higher selling entertainment than movies and music combined. This does not surprise me, that North America has a higher proportion of sales overall compared to other regions, because there are so many outlets and cheap ways to play games through; whether that be a mobile device, a computer, or even cheaper handhelds. It seems to get more popular every year as many games release and the accessibility is easier than it was many years ago. Also, for example console gaming is not popular in Japan, people prefer on the go gaming, so mobile devices and Nintendo’s handheld stand above the other competition in that region.
Video Games Sales Dataset. (2019, May 10). Kaggle. https://www.kaggle.com/datasets/sidtwr/videogames-sales-dataset https://claude.ai/ (Used to create syntax for analysis)