“Over the past few decades, the video game business has experienced tremendous development and transformation, emerging as a major force in the worldwide entertainment sector. Comprehending the elements that lead to video game success is becoming increasingly important for developers, publishers, and other stakeholders as the industry grows. With the use of global sales data, we may examine market dynamics, spot new trends, and arrive at a well-informed conclusions.
In order to find trends and insights that propel industry success, we examine the data set of worldwide video game sales in this research. Through data preparation, clustering analysis, and visualizations, we explore the links between several features, including genre, platform, publisher, and sales numbers, to obtain a more detailed understanding. This analysis will use two unsupervised learning techniques that k-means algorithm to find clusters from the data set and the number of the clusters will be determined by Elbow Method.”
The first step was to load the data and determine the structure and the class of the data. So I determined that the dataset consists of numeric characters and integers. Henceforth, there is a need for data processing.
library(readr)
Warning: package ‘readr’ was built under R version 4.3.2
library(readr)
videogameglobalsales <- read_csv("clustering final/videogameglobalsales.csv")
Rows: 16598 Columns: 11── Column specification ───────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (4): name, platform, genre, publisher
dbl (7): rank, year, na_sales, eu_sales, jp_sales, other_sales, global_sales
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(videogameglobalsales)
View(videogameglobalsales)
class(videogameglobalsales)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
colnames(videogameglobalsales)
[1] "rank" "name" "platform" "year" "genre" "publisher"
[7] "na_sales" "eu_sales" "jp_sales" "other_sales" "global_sales"
head(videogameglobalsales)
I then checked for missing values in the data set.The only column with missing data was the year column and i decided to omit all rows with the missing value as it was insignificant to the data set
missing_values_per_column <- colSums(is.na(videogameglobalsales))
print(missing_values_per_column)
rank name platform year genre publisher na_sales
0 0 0 271 0 0 0
eu_sales jp_sales other_sales global_sales
0 0 0 0
video_game_global_sales1 <- na.omit(videogameglobalsales)
missing_values_per_column <- colSums(is.na(video_game_global_sales1))
print(missing_values_per_column)
rank name platform year genre publisher na_sales
0 0 0 0 0 0 0
eu_sales jp_sales other_sales global_sales
0 0 0 0
In this step i checked my data type variables and also the head of my data set so that i can have a clear view of what each column contains
print(sapply(video_game_global_sales1, class))
rank name platform year genre publisher na_sales
"numeric" "character" "character" "numeric" "character" "character" "numeric"
eu_sales jp_sales other_sales global_sales
"numeric" "numeric" "numeric" "numeric"
head(video_game_global_sales1)
I identified all characters and integers columns in my data set and converted it to factors so as to standardize and prepare my data for clustering. Also on this stage i also summarized my data set
character_columns <- sapply(video_game_global_sales1, is.character)
integer_columns <- sapply(video_game_global_sales1, is.integer)
video_game_global_sales1[character_columns] <-lapply(video_game_global_sales1[character_columns], as.factor)
video_game_global_sales1[integer_columns] <- lapply(video_game_global_sales1[integer_columns], as.factor)
print(sapply(video_game_global_sales1, class))
rank name platform year genre publisher na_sales
"numeric" "factor" "factor" "numeric" "factor" "factor" "numeric"
eu_sales jp_sales other_sales global_sales
"numeric" "numeric" "numeric" "numeric"
summary(video_game_global_sales1)
rank name platform year
Min. : 1 Need for Speed: Most Wanted: 12 DS :2133 Min. :1980
1st Qu.: 4136 FIFA 14 : 9 PS2 :2127 1st Qu.:2003
Median : 8295 LEGO Marvel Super Heroes : 9 PS3 :1304 Median :2007
Mean : 8293 Ratatouille : 9 Wii :1290 Mean :2006
3rd Qu.:12442 Angry Birds Star Wars : 8 X360 :1235 3rd Qu.:2010
Max. :16600 Cars : 8 PSP :1197 Max. :2020
(Other) :16272 (Other):7041
genre publisher na_sales eu_sales
Action :3253 Electronic Arts : 1339 Min. : 0.0000 Min. : 0.0000
Sports :2304 Activision : 966 1st Qu.: 0.0000 1st Qu.: 0.0000
Misc :1710 Namco Bandai Games : 928 Median : 0.0800 Median : 0.0200
Role-Playing:1471 Ubisoft : 918 Mean : 0.2654 Mean : 0.1476
Shooter :1282 Konami Digital Entertainment: 823 3rd Qu.: 0.2400 3rd Qu.: 0.1100
Adventure :1276 THQ : 712 Max. :41.4900 Max. :29.0200
(Other) :5031 (Other) :10641
jp_sales other_sales global_sales
Min. : 0.00000 Min. : 0.00000 Min. : 0.0100
1st Qu.: 0.00000 1st Qu.: 0.00000 1st Qu.: 0.0600
Median : 0.00000 Median : 0.01000 Median : 0.1700
Mean : 0.07866 Mean : 0.04832 Mean : 0.5402
3rd Qu.: 0.04000 3rd Qu.: 0.04000 3rd Qu.: 0.4800
Max. :10.22000 Max. :10.57000 Max. :82.7400
At this stage we determined outliers using the z-score and removed it from all numerical columns.In this data set a total of 1150 outliers where identified and removed from the data set.
numeric_columns <- c("na_sales", "eu_sales", "jp_sales", "other_sales", "global_sales")
z_scores <- scale(video_game_global_sales1[numeric_columns])
outliers <- apply(abs(z_scores) > 3, 1, any)
cat("Number of outliers identified:", sum(outliers), "\n")
Number of outliers identified: 470
video_game_global_sales1 <- video_game_global_sales1[!outliers, ]
summary(video_game_global_sales1)
rank name platform year
Min. : 204 Need for Speed: Most Wanted : 11 DS :2090 Min. :1980
1st Qu.: 4498 LEGO Marvel Super Heroes : 9 PS2 :2060 1st Qu.:2003
Median : 8534 Ratatouille : 9 Wii :1262 Median :2007
Mean : 8525 Angry Birds Star Wars : 8 PS3 :1261 Mean :2007
3rd Qu.:12562 Cars : 8 X360 :1195 3rd Qu.:2010
Max. :16600 Lego Batman 3: Beyond Gotham: 8 PSP :1183 Max. :2020
(Other) :15804 (Other):6806
genre publisher na_sales eu_sales
Action :3176 Electronic Arts : 1292 Min. :0.0000 Min. :0.0000
Sports :2240 Activision : 932 1st Qu.:0.0000 1st Qu.:0.0000
Misc :1670 Namco Bandai Games : 913 Median :0.0700 Median :0.0200
Role-Playing:1382 Ubisoft : 900 Mean :0.1957 Mean :0.1012
Adventure :1270 Konami Digital Entertainment: 804 3rd Qu.:0.2200 3rd Qu.:0.1000
Shooter :1232 THQ : 709 Max. :2.7100 Max. :1.6600
(Other) :4887 (Other) :10307
jp_sales other_sales global_sales
Min. :0.00000 Min. :0.00000 Min. :0.0100
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.0600
Median :0.00000 Median :0.01000 Median :0.1600
Mean :0.04761 Mean :0.03213 Mean :0.3769
3rd Qu.:0.03000 3rd Qu.:0.03000 3rd Qu.:0.4300
Max. :1.01000 Max. :0.61000 Max. :5.0200
dim(video_game_global_sales1)
[1] 15857 11
This is the last step of data cleaning and processing when i standardized all numeric columns and also encoded genre so that it can be used for clustering purposes as my main goal in this clustering is to identify the effect of the video game genre with Global Sales
video_game_global_sales1$genre_code <- as.numeric(factor(video_game_global_sales1$genre))
unique_genres <- levels(factor(video_game_global_sales1$genre))
unique_genre_codes <- unique(video_game_global_sales1$genre_code)
cat("Genre Codes:\n")
Genre Codes:
for (i in seq_along(unique_genres)) {
cat(sprintf("%s: %d\n", unique_genres[i], unique_genre_codes[i]))
}
Action: 3
Adventure: 7
Fighting: 1
Misc: 4
Platform: 11
Puzzle: 8
Racing: 9
Role-Playing: 5
Shooter: 10
Simulation: 6
Sports: 2
Strategy: 12
summary(video_game_global_sales1)
rank name platform year genre
Min. : 204 Need for Speed: Most Wanted : 11 DS :2090 Min. :1980 Action :3176
1st Qu.: 4498 LEGO Marvel Super Heroes : 9 PS2 :2060 1st Qu.:2003 Sports :2240
Median : 8534 Ratatouille : 9 Wii :1262 Median :2007 Misc :1670
Mean : 8525 Angry Birds Star Wars : 8 PS3 :1261 Mean :2007 Role-Playing:1382
3rd Qu.:12562 Cars : 8 X360 :1195 3rd Qu.:2010 Adventure :1270
Max. :16600 Lego Batman 3: Beyond Gotham: 8 PSP :1183 Max. :2020 Shooter :1232
(Other) :15804 (Other):6806 (Other) :4887
publisher na_sales eu_sales jp_sales
Electronic Arts : 1292 Min. :-0.58000 Min. :-0.502985 Min. :-0.3941
Activision : 932 1st Qu.:-0.58000 1st Qu.:-0.502985 1st Qu.:-0.3941
Namco Bandai Games : 913 Median :-0.37251 Median :-0.403545 Median :-0.3941
Ubisoft : 900 Mean : 0.00000 Mean : 0.000000 Mean : 0.0000
Konami Digital Entertainment: 804 3rd Qu.: 0.07211 3rd Qu.:-0.005785 3rd Qu.:-0.1457
THQ : 709 Max. : 7.45291 Max. : 7.750540 Max. : 7.9658
(Other) :10307
other_sales global_sales genre_code cluster
Min. :-0.50462 Min. :-0.64902 Min. : 1.000 1:4969
1st Qu.:-0.50462 1st Qu.:-0.56056 1st Qu.: 2.000 2:3954
Median :-0.34759 Median :-0.38365 Median : 6.000 3:6934
Mean : 0.00000 Mean : 0.00000 Mean : 5.911
3rd Qu.:-0.03352 3rd Qu.: 0.09402 3rd Qu.: 9.000
Max. : 9.07441 Max. : 8.21437 Max. :12.000
video_game_global_sales1[numeric_columns] <- scale(video_game_global_sales1[numeric_columns])
video_game_global_sales1$genre_code <- as.numeric(factor(video_game_global_sales1$genre))
summary(video_game_global_sales1)
rank name platform year genre
Min. : 204 Need for Speed: Most Wanted : 11 DS :2090 Min. :1980 Action :3176
1st Qu.: 4498 LEGO Marvel Super Heroes : 9 PS2 :2060 1st Qu.:2003 Sports :2240
Median : 8534 Ratatouille : 9 Wii :1262 Median :2007 Misc :1670
Mean : 8525 Angry Birds Star Wars : 8 PS3 :1261 Mean :2007 Role-Playing:1382
3rd Qu.:12562 Cars : 8 X360 :1195 3rd Qu.:2010 Adventure :1270
Max. :16600 Lego Batman 3: Beyond Gotham: 8 PSP :1183 Max. :2020 Shooter :1232
(Other) :15804 (Other):6806 (Other) :4887
publisher na_sales eu_sales jp_sales
Electronic Arts : 1292 Min. :-0.58000 Min. :-0.502985 Min. :-0.3941
Activision : 932 1st Qu.:-0.58000 1st Qu.:-0.502985 1st Qu.:-0.3941
Namco Bandai Games : 913 Median :-0.37251 Median :-0.403545 Median :-0.3941
Ubisoft : 900 Mean : 0.00000 Mean : 0.000000 Mean : 0.0000
Konami Digital Entertainment: 804 3rd Qu.: 0.07211 3rd Qu.:-0.005785 3rd Qu.:-0.1457
THQ : 709 Max. : 7.45291 Max. : 7.750540 Max. : 7.9658
(Other) :10307
other_sales global_sales genre_code cluster
Min. :-0.50462 Min. :-0.64902 Min. : 1.000 1:4969
1st Qu.:-0.50462 1st Qu.:-0.56056 1st Qu.: 2.000 2:3954
Median :-0.34759 Median :-0.38365 Median : 6.000 3:6934
Mean : 0.00000 Mean : 0.00000 Mean : 5.911
3rd Qu.:-0.03352 3rd Qu.: 0.09402 3rd Qu.: 9.000
Max. : 9.07441 Max. : 8.21437 Max. :12.000
Now that i have cleaned and standardized my data i will proceed and go to clustering. I will use Elbow method to determine the number of clusters for the selected feature to achieve the goal of my project
library(ggplot2)
Warning: package ‘ggplot2’ was built under R version 4.3.2
elbow_point <- function(wcss_values) {
deltas <- c(0, diff(wcss_values))
elbow <- which(deltas == max(deltas))
return(elbow)
}
calculate_wcss <- function(data, k) {
kmeans_result <- kmeans(data, centers = k, nstart = 10)
return(kmeans_result$tot.withinss)
}
features_for_clustering <- video_game_global_sales1[, c("global_sales", "genre_code")]
k_values <- 1:10
wcss_values <- sapply(k_values, function(k) calculate_wcss(features_for_clustering, k))
plot(k_values, wcss_values, type = "b", pch = 19, frame = FALSE,
xlab = "Number of Clusters (k)", ylab = "Within-Cluster Sum of Squares (WCSS)",
main = "Elbow Method for Optimal Number of Clusters")
Based of my elbow graph i will decide to use 3 clusters based on observing a more gradual stabilization of SSD after 3 clusters, suggesting that the additional cluster captures meaningful variation in the data.
Based on the number of clusters I clustered the main features of this project which is global sales and genre.
kmeans_result <- kmeans(features_for_clustering, centers = 3, nstart = 10)
str(kmeans_result$cluster)
int [1:15857] 1 2 1 1 1 1 1 3 2 3 ...
length(kmeans_result$cluster)
[1] 15857
video_game_global_sales1$cluster <- as.factor(kmeans_result$cluster)
summary(video_game_global_sales1$cluster)
1 2 3
6934 3954 4969
According to the results my data is grouped into three clusters with my first cluster containing 3608 observation and 6511 and 4588 for the second and third cluster. The clustering plot are below showed by a scatter plot for clarity.
Scatter plot of ‘global_sales’ vs. ‘genre_code’ with cluster coloring
plot(features_for_clustering, col = c("black", "red", "green")[kmeans_result$cluster], pch = 16,
main = "K-Means Clustering", xlab = "Global Sales", ylab = "Genre Code")
From the above scatter plot it shows that clustering was divided into 3 groups depending on their genres.Further clarity will then be taken to determine which genres are in each cluster without suing codes but the actual genres
Understanding which type of genres falls in which cluster.
genre_mapping <- unique(video_game_global_sales1[, c("genre_code", "genre")])
table_summary_with_genres <- table(video_game_global_sales1$cluster, video_game_global_sales1$genre)
print(table_summary_with_genres)
Action Adventure Fighting Misc Platform Puzzle Racing Role-Playing Shooter Simulation Sports
1 3176 1270 818 1670 0 0 0 0 0 0 0
2 0 0 0 0 820 557 1195 1382 0 0 0
3 0 0 0 0 0 0 0 0 1232 835 2240
Strategy
1 0
2 0
3 662
From this analysis you can determine that clusters where equaly grouped into 4 genres per each cluster This was further presented graphical using a plot
library(ggplot2)
plot_data <- as.data.frame(table_summary_with_genres)
plot_data$Var1 <- factor(plot_data$Var1, levels = unique(video_game_global_sales1$cluster))
ggplot(plot_data, aes(x = Var1, y = Freq, fill = Var2)) +
geom_bar(stat = "identity") +
labs(title = "Genre Distribution in Each Cluster",
x = "Cluster",
y = "Count") +
scale_fill_brewer(palette = "Set3") + # Adjust the color palette as needed
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels
NA
NA
The graph above shows how each genre where distributed in each cluster also quantifying the amount of genre as we can see there is a high count of Action genres in this data set which is clustered in cluster 2,as well as high count of fighting games which is clustered in cluster 3
Summary of the cluster
cluster_summary <- aggregate(. ~ cluster, data = video_game_global_sales1[, c("global_sales", "genre_code", "cluster")], mean)
print(cluster_summary)
NA
Global Sales:
Clusters 2 :contains genres with the lowest Global Sales most likely the games in this cluster are cheap considering it has a high count of genres Cluster 1:indicates an overall moderate positive trend in sales within these cluster and genres count in this cluster are low justifying the global sales Cluster 3 has the highest value for global_sales, suggesting a high sales trend of genres in this cluster though it has a low genre count most likely genres code in this cluster are quite expensive and unique.
This was also explained by the centroids plot below
library(ggplot2)
centroids <- as.data.frame(kmeans_result$centers)
centroids$cluster <- factor(1:nrow(centroids)) # Add a factor column for cluster assignments
ggplot(centroids, aes(x = genre_code, y = global_sales, color = cluster)) +
geom_point(size = 3) +
labs(title = "Cluster Centroids", x = "Genre Code", y = "Global Sales") +
theme_minimal()
Cluster 1: This cluster has a relatively moderate average global sales compared to cluster 2 Cluster 2:This cluster has the lowest average global sales compared to all other clusters Cluster 3:This cluster has the highest average global sales as compared to all clusters
Analysis However the only variable that was considered in this clustering was Genre so before concluding will look at other relationships in the original data according to clusters
Firstly will do a count to determine how many publishers are in each cluster.
cluster_comparison <- aggregate(. ~ cluster, data = video_game_global_sales1[, c("cluster","publisher")], mean)
print(cluster_comparison)
library(ggplot2)
ggplot(cluster_comparison, aes(x = cluster, fill = publisher)) +
geom_bar(stat = "count", position = "dodge") +
labs(title = "Count of Publishers in Each Cluster",
x = "Cluster", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
When the count was done we noticed that almost every publisher was there in each cluster with only a few not available in some clusters .Henceforth i further anaylsed this relationship by determining the most 10 popular publisher in each cluster
Identifying the publishers that where used in each cluster considering only the top 10 most popular publisher in each cluster
# Load necessary packages
library(dplyr)
Warning: package ‘dplyr’ was built under R version 4.3.2
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
top_publishers_by_cluster <- video_game_global_sales1 %>%
group_by(cluster, publisher) %>%
summarise(count = n()) %>%
arrange(cluster, desc(count)) %>%
group_by(cluster) %>%
top_n(10, wt = count) %>%
ungroup()
`summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
ggplot(top_publishers_by_cluster, aes(x = reorder(publisher, -count), y = count, fill = as.factor(cluster))) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Publishers in Each Cluster",
x = "Publisher",
y = "Count") +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
NA
NA
NA
NA
Cluster 3:This cluster is dominated by a variety of publisher but mainly Electronic Arts and Konami digital entertainment and it is the only cluster has games published by Take two interactive Publishers making it unique in that way.It spans over a range of publisher however one can conclude that publishers sales the most sophisticated games in this cluster and explains the high sales despite the low count
Cluster 2:This cluster is mainly dominated by Namo Bandai Games arts however its dominants spans over many clusters. Uniquely its the only cluster that sales games published by capcom and Tecmo Koei.Despite its low sales one can conclude that publishers in this cluster sale their cheapest games.
Cluster 1:Cluster 1 contains sales from most publishers however it it the only cluster that sales games published by Square Enix and Nitendo .Cluster 1 has the highest count of all publishers and one can conclude that most publishers charge moderate prices for their games ,justifying the moderate global sales.
Identifying the platform that where used in each cluster
cluster_comparison <- aggregate(. ~ cluster, data = video_game_global_sales1[, c("cluster","platform")], mean)
print(cluster_comparison)
library(ggplot2)
ggplot(cluster_comparison, aes(x = cluster, fill = platform)) +
geom_bar(stat = "count", position = "dodge") +
labs(title = "Count of Platform in Each Cluster",
x = "Cluster", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the above details we can conclude that almost every platform is there in each cluster but we would like to further analyse which where the most popular 10 platforms in each cluster #Platform Analysis Using the below plot i identified the most 10 popular platforms in each group
top_platforms_by_cluster <- video_game_global_sales1 %>%
group_by(cluster, platform) %>%
summarise(count = n()) %>%
arrange(cluster, desc(count)) %>%
group_by(cluster) %>%
top_n(10, wt = count) %>%
ungroup()
`summarise()` has grouped output by 'cluster'. You can override using the `.groups` argument.
ggplot(top_platforms_by_cluster, aes(x = reorder(platform, -count), y = count, fill = as.factor(cluster))) +
geom_bar(stat = "identity") +
labs(title = "Top 10 Platforms in Each Cluster",
x = "Platform",
y = "Count") +
scale_fill_brewer(palette = "Set3") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels
NA
NA
NA
NA
NA
Cluster 1 This cluster uniquely contains a variety of platforms but mainly dominated by DS and PS games.Also a high count explaing that most platforms are in cluster 1 contains games that have average sales Cluster 2 The cluster contains a variety of platforms but mainly dominating on PS2 and PSP games .Also the cluster is uniquely identified my by 3DS and PSV games.Mostly games in this cluster are considered cheap and not complex
Cluster 3 The clusters is basically dominated by PC platform games however its the only cluster that has GC games.Mostly platforms like gc contains complex and expensive games as despite the low count in this cluster it still has the highest global sales.
Identifying the year sale trend of video games
names(video_game_global_sales1)
[1] "rank" "name" "platform" "year" "genre" "publisher"
[7] "na_sales" "eu_sales" "jp_sales" "other_sales" "global_sales" "genre_code"
[13] "cluster"
ggplot(video_game_global_sales1, aes(x = year, y = global_sales, color = as.factor(cluster))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
facet_wrap(~cluster, scales = "free_y", ncol = 1) +
labs(title = "Global Sales Trend by Year Group Within Each Cluster",
x = "Year Group", y = "Global Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
NA
NA
NA
NA
NA
Cluster 1:Cluster 1 sales are mainly from year 1990 to 2017 and afterwards it has no any presence of sale.However we can also tell that there was a little presence of sales was there in the years 1980 to 1990.
Cluster 2:Cluster 2 sales are mainly from year 1996 to 2017 and afterwards it has no any presence of sale .However we can conclude that genres in cluster 2 where only popular in the earlier years from 1996 to 2017.This will give us a conlusion that Vedio Game Sales has decreased from 2017 up-to now.Also in this cluster there was some presence in the 1980’s
Cluster 3:Cluster 3 sales are from year 1995 to 2017 and afterwards it has no any presence of sale expect for little presence between in 2020.However we can conclude that genres in cluster 3 continued to show its presence during the year frame.
library(ggplot2)
create_clustered_na_trend_graph <- function(cluster) {
current_data <- video_game_global_sales1[video_game_global_sales1$cluster == cluster, ]
ggplot(current_data, aes(x = year, y = na_sales, color = as.factor(cluster))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
labs(title = paste("Cluster", cluster, "- Trend of NA Sales"),
x = "Year Group", y = "NA Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
for (clust in unique(video_game_global_sales1$cluster)) {
plot <- create_clustered_na_trend_graph(clust)
print(plot)
}
combined_plot <- ggplot(video_game_global_sales1, aes(x = year, y = na_sales, color = as.factor(cluster))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
labs(title = "Trend of NA Sales by Cluster",
x = "Year Group", y = "NA Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~as.factor(cluster), scales = "free_y", ncol = 2)
print(combined_plot)
NA sales are has high presence in cluster 3 followed by cluster 2 than 1.However one can conclude that people in NA prefer complex games that are in cluster 3.
library(ggplot2)
library(ggplot2)
create_clustered_eu_trend_graph <- function(cluster) {
current_data <- video_game_global_sales1[video_game_global_sales1$cluster == cluster, ]
ggplot(current_data, aes(x = year, y = eu_sales, color = as.factor(cluster))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
labs(title = paste("Cluster", cluster, "- Trend of EU Sales"),
x = "Year Group", y = "EU Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
}
for (clust in unique(video_game_global_sales1$cluster)) {
plot <- create_clustered_eu_trend_graph(clust)
print(plot)
}
combined_plot_eu <- ggplot(video_game_global_sales1, aes(x = year, y = eu_sales, color = as.factor(cluster))) +
geom_point() +
stat_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +
labs(title = "Trend of EU Sales by Cluster",
x = "Year Group", y = "EU Sales") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
facet_wrap(~as.factor(cluster), scales = "free_y", ncol = 2)
print(combined_plot_eu)
NA
NA
EU sales are more dominant in cluster 2 followed by cluster 1 then lastly cluster 3 .This concludes that mostly EU sales are in cluster 2.This entails us that residents of the European Union prefer simpler games which are also not costly.
In summary, this analysis goes into determining the global video game sales according to genres, employing clustering techniques to identify distinct patterns and relationships. The clustering based on global sales and genre revealed three clusters, each with its unique characteristics. Cluster 3 emerged as the leader in global sales followed by cluster 1 and lastly cluster 3, .The decline in sales post-2016 was observed across all clusters. Beyond genre, exploration into publishers and platforms unveiled intriguing insights, showcasing the dominance of specific publishers and platforms in each cluster. The examination of regional sales trends underscored the diverse preferences of North American, European, and Japanese audiences. This analysis not only sheds light on the market dynamics of the video game industry but also offers valuable insights for developers, publishers, and stakeholders to tailor their strategies based on genre, region, and platform preferences. Continued exploration and adaptation to emerging trends are essential for navigating the ever-evolving landscape of the video game market and to determine why the market is decreasing.