WQD7004 Group Assignment
Group Members (Team 5)
- Siti Nur Liyana Roslan (23067122)
- Karenina Kamila (23117951)
- Nur Ridwana Mohd Rafix (24054165)
- Bong Hui Xin (22089462)
- Salvin A/L Ravindran (17167138)
Project Title: Predicting User Ratings and Classifying Popularity in Video Games
Dataset
Title: Video Game Reviews and Ratings
Year of Publish: 2024
Dataset Description: Consists of 18 features, and 47774 rows of synthetic data on video game reviews and ratings.
Dataset Category: Entertainment
Numerical | Categorical |
---|---|
User.Rating, Price, Release.Year, Game.Length..Hours., Min.Number.of.Player | Game.Title, Age.Group.Targeted, Platform, Requires.Special.Device, Developer, Publisher, Genre, Multiplayer, Graphics.Quality, Soundtrack.Quality, Story.Quality, Game.Mode, User.Review.Text |
Introduction
In this project, our team leveraged R programming to analyse video game reviews and ratings, identify key factors influencing high ratings, and propose actionable insights for game developers and publishers. The analysis involved data visualisation, summary statistics, and generating a word cloud extract patterns and trends from user reviews.
Objectives
1. To classify popular and non-popular video games.
2. To predict user rating of video games.
By achieving these objectives, this research aims to provide a data-driven foundation for making informed decisions in game design and marketing strategies, ultimately enhancing the player experience and boosting game success in a competitive market.
Loading Necessary Libraries
# Load necessary libraries
#install.packages(c("tidyverse", "ggplot2", "dplyr", "corrplot", "vcd", "wordcloud", "RColorBrewer", "tm"))
#install.packages(c("rpart","rpart.plot", "caret", "e1071", "randomForest"))
library(rpart)
library(e1071)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: ggplot2
## Loading required package: lattice
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ✔ readr 2.1.5
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ✖ purrr::lift() masks caret::lift()
## ✖ randomForest::margin() masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## corrplot 0.95 loaded
## Loading required package: grid
## Loading required package: RColorBrewer
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
Load the Dataset
data <- read.csv("video_game_reviews.csv")
# Display summary statistics of the dataset
summary(data)
## Game.Title User.Rating Age.Group.Targeted Price
## Length:47774 Min. :10.10 Length:47774 Min. :19.99
## Class :character 1st Qu.:24.30 Class :character 1st Qu.:29.99
## Mode :character Median :29.70 Mode :character Median :39.84
## Mean :29.72 Mean :39.95
## 3rd Qu.:35.10 3rd Qu.:49.96
## Max. :49.50 Max. :59.99
## Platform Requires.Special.Device Developer
## Length:47774 Length:47774 Length:47774
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Publisher Release.Year Genre Multiplayer
## Length:47774 Min. :2010 Length:47774 Length:47774
## Class :character 1st Qu.:2013 Class :character Class :character
## Mode :character Median :2016 Mode :character Mode :character
## Mean :2016
## 3rd Qu.:2020
## Max. :2023
## Game.Length..Hours. Graphics.Quality Soundtrack.Quality Story.Quality
## Min. : 5.00 Length:47774 Length:47774 Length:47774
## 1st Qu.:18.80 Class :character Class :character Class :character
## Median :32.50 Mode :character Mode :character Mode :character
## Mean :32.48
## 3rd Qu.:46.30
## Max. :60.00
## User.Review.Text Game.Mode Min.Number.of.Players
## Length:47774 Length:47774 Min. : 1.000
## Class :character Class :character 1st Qu.: 3.000
## Mode :character Mode :character Median : 5.000
## Mean : 5.117
## 3rd Qu.: 7.000
## Max. :10.000
## [1] 47774 18
## 'data.frame': 47774 obs. of 18 variables:
## $ Game.Title : chr "Grand Theft Auto V" "The Sims 4" "Minecraft" "Bioshock Infinite" ...
## $ User.Rating : num 36.4 38.3 26.8 38.4 30.1 38.6 33.1 32.3 26.7 23.9 ...
## $ Age.Group.Targeted : chr "All Ages" "Adults" "Teens" "All Ages" ...
## $ Price : num 41.4 57.6 44.9 48.3 55.5 ...
## $ Platform : chr "PC" "PC" "PC" "Mobile" ...
## $ Requires.Special.Device: chr "No" "No" "Yes" "Yes" ...
## $ Developer : chr "Game Freak" "Nintendo" "Bungie" "Game Freak" ...
## $ Publisher : chr "Innersloth" "Electronic Arts" "Capcom" "Nintendo" ...
## $ Release.Year : int 2015 2015 2012 2015 2022 2017 2020 2012 2010 2013 ...
## $ Genre : chr "Adventure" "Shooter" "Adventure" "Sports" ...
## $ Multiplayer : chr "No" "Yes" "Yes" "No" ...
## $ Game.Length..Hours. : num 55.3 34.6 13.9 41.9 13.2 48.8 36.9 52.1 56.4 46 ...
## $ Graphics.Quality : chr "Medium" "Low" "Low" "Medium" ...
## $ Soundtrack.Quality : chr "Average" "Poor" "Good" "Good" ...
## $ Story.Quality : chr "Poor" "Poor" "Average" "Excellent" ...
## $ User.Review.Text : chr "Solid game, but too many bugs." "Solid game, but too many bugs." "Great game, but the graphics could be better." "Solid game, but the graphics could be better." ...
## $ Game.Mode : chr "Offline" "Offline" "Offline" "Online" ...
## $ Min.Number.of.Players : int 1 3 5 4 1 4 3 3 10 5 ...
## Rows: 47,774
## Columns: 18
## $ Game.Title <chr> "Grand Theft Auto V", "The Sims 4", "Minecraft…
## $ User.Rating <dbl> 36.4, 38.3, 26.8, 38.4, 30.1, 38.6, 33.1, 32.3…
## $ Age.Group.Targeted <chr> "All Ages", "Adults", "Teens", "All Ages", "Ad…
## $ Price <dbl> 41.41, 57.56, 44.93, 48.29, 55.49, 51.73, 46.4…
## $ Platform <chr> "PC", "PC", "PC", "Mobile", "PlayStation", "Xb…
## $ Requires.Special.Device <chr> "No", "No", "Yes", "Yes", "Yes", "No", "No", "…
## $ Developer <chr> "Game Freak", "Nintendo", "Bungie", "Game Frea…
## $ Publisher <chr> "Innersloth", "Electronic Arts", "Capcom", "Ni…
## $ Release.Year <int> 2015, 2015, 2012, 2015, 2022, 2017, 2020, 2012…
## $ Genre <chr> "Adventure", "Shooter", "Adventure", "Sports",…
## $ Multiplayer <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "No", …
## $ Game.Length..Hours. <dbl> 55.3, 34.6, 13.9, 41.9, 13.2, 48.8, 36.9, 52.1…
## $ Graphics.Quality <chr> "Medium", "Low", "Low", "Medium", "High", "Low…
## $ Soundtrack.Quality <chr> "Average", "Poor", "Good", "Good", "Poor", "Av…
## $ Story.Quality <chr> "Poor", "Poor", "Average", "Excellent", "Good"…
## $ User.Review.Text <chr> "Solid game, but too many bugs.", "Solid game,…
## $ Game.Mode <chr> "Offline", "Offline", "Offline", "Online", "Of…
## $ Min.Number.of.Players <int> 1, 3, 5, 4, 1, 4, 3, 3, 10, 5, 1, 5, 10, 5, 4,…
Data Pre-processing
## [1] "Game.Title" "User.Rating"
## [3] "Age.Group.Targeted" "Price"
## [5] "Platform" "Requires.Special.Device"
## [7] "Developer" "Publisher"
## [9] "Release.Year" "Genre"
## [11] "Multiplayer" "Game.Length..Hours."
## [13] "Graphics.Quality" "Soundtrack.Quality"
## [15] "Story.Quality" "User.Review.Text"
## [17] "Game.Mode" "Min.Number.of.Players"
# Dropping unnecessary columns; based on our objectives and correlation (numerical)
data1 <- data[c("Game.Title", "User.Rating", "Price", "Platform", "Release.Year", "Genre", "Game.Length..Hours.", "User.Review.Text")]
# Changing all chr columns to factor
data1$Genre <- factor(data1$Genre)
data1$Platform <- factor(data1$Platform)
data1$User.Review.Text <- factor(data1$User.Review.Text)
# Checking and handling for missing values
sum(is.na(data1))
## [1] 0
EDA
# After cleaning the data set, we explored the summary of data1 and find the values of standard deviation and mean of the dataset.
summary(data1)
## Game.Title User.Rating Price Platform
## Length:47774 Min. :10.10 Min. :19.99 Mobile :9589
## Class :character 1st Qu.:24.30 1st Qu.:29.99 Nintendo Switch:9596
## Mode :character Median :29.70 Median :39.84 PC :9599
## Mean :29.72 Mean :39.95 PlayStation :9633
## 3rd Qu.:35.10 3rd Qu.:49.96 Xbox :9357
## Max. :49.50 Max. :59.99
##
## Release.Year Genre Game.Length..Hours.
## Min. :2010 RPG : 4873 Min. : 5.00
## 1st Qu.:2013 Shooter : 4869 1st Qu.:18.80
## Median :2016 Strategy : 4867 Median :32.50
## Mean :2016 Puzzle : 4822 Mean :32.48
## 3rd Qu.:2020 Simulation: 4784 3rd Qu.:46.30
## Max. :2023 Adventure : 4750 Max. :60.00
## (Other) :18809
## User.Review.Text
## Great game, but the graphics could be better. : 4067
## Disappointing game, but the gameplay is amazing. : 4020
## Disappointing game, but the graphics could be better.: 4018
## Solid game, but too many bugs. : 4017
## Solid game, but the gameplay is amazing. : 3996
## Great game, but too many bugs. : 3977
## (Other) :23679
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.10 24.30 29.70 29.72 35.10 49.50
## [1] 7.550131
## [1] 29.71933
# Next, we found the highest and lowest rated game.
high <- data1[order(-data$User.Rating), ][1, ]
high
## Game.Title User.Rating Price Platform Release.Year Genre
## 15184 Just Dance 2024 49.5 59.17 PlayStation 2014 Action
## Game.Length..Hours. User.Review.Text
## 15184 59.7 Disappointing game, but the gameplay is amazing.
## Game.Title User.Rating Price Platform Release.Year Genre
## 17711 Tetris 10.1 20.08 Xbox 2013 Party
## Game.Length..Hours. User.Review.Text
## 17711 6.3 Solid game, but the graphics could be better.
# Since our focus attribute is 'User Ratings', we explored the quantiles of this data that can be used in modelling part later on.
quantiles <- quantile(data1$User.Rating, probs = c(0.25, 0.5, 0.75))
cat("25th Percentile:", quantiles[1], "\n")
## 25th Percentile: 24.3
## Median: 29.7
## 75th Percentile: 35.1
# 1. Univariate Analysis
# Bar plot of number of games by genres
ggplot(data1, aes(x = Genre)) + geom_bar(fill = "red", color = "black") + labs(title = "Number of Games by Genres", x = "Game Genres", y = "Number of Games" )
# The graph shows that the number of games for each genres are in between the same range, hence making the dataset quite balanced and does not require additional balance tuning.
# Bar plot of game count by platform
ggplot(data1, aes(x = Platform)) + geom_bar(fill = "lightblue", color = "black") + labs(title = "Number of Games by Platform", x = "Platform", y = "Number of Games")
# Same goes to the type of platform, the graph also shows that the number of games are spread equally across variety of platforms.
# Histogram of user rating for games released in year 2022
y22 <- data1%>%filter(Release.Year==2022)
ggplot(y22, aes(x = User.Rating)) + geom_histogram(binwidth = 1, fill = "orange", color = "black") + labs(title = "User Rating by Games Released in 2022", x = "Games Released", y = "User Rating")
# The user ratings depicted that rating values peaked at 32.50 for majority of the games released in 2022. There is a noticeable decreasing slope between rating value 27.5 and 32.5. This pattern creates a "valley" in the distribution. The slope and subsequent recovery could be due to various factors, such as shifts in user preferences or game characteristics that polarized ratings into higher and lower groups.
# 2. Bivariate Analysis
# Line graph of number of games released over years
count_release <- data1 %>% count(Release.Year)
ggplot(count_release, aes(x = Release.Year, y = n)) + geom_point() + geom_line() + labs(title = "Number of Games Released Over Years", x = "Years", y = "Number of Games")
# The graph peaked at 2020 with the most number of games released might be due to global pandemic that increased the demand for home entertainment With lockdowns and social distancing measures in place. Gaming through digital platforms had resonated with audiences because they provided a sense of normalcy and connection in a socially isolated world.
# Line graph of average rating by release year
ggplot(data1, aes(x = Release.Year, y = User.Rating)) + stat_summary(fun = "mean", geom = "line", color = "orange") + labs(title = "Average User Rating by Release Year", x = "Release Year", y = "Average Rating")
# Although 2020 has the highest number of game releases it also received the lowest average ratings. This can be attributed to several factors, including the increased player base and harsher standards in a crowded market. Mass production of games releases might dilute its quality which caused some games may have struggled to stand out, and the average quality might have dropped due to oversaturated market.
# Scatter plot for User Rating vs. Game Length (Hours) in year 2020
v <- data1%>%filter(Release.Year==2020)
ggplot(v, aes(x = User.Rating, y = Game.Length..Hours.)) + geom_point() + geom_smooth(method = lm) + labs(title = "Correlation between User Rating and Game Length (Hours) in year 2020")
## `geom_smooth()` using formula = 'y ~ x'
# This scatter plot graph shows positive correlation as the points form an upward trend. Generally, games with longer playtimes tend to have higher ratings which might suggests that users favor longer games.
# 3. Multivariate Analysis
# Correlation matrix with numerical data
b <- data1[sapply(data1, is.numeric)]
b <- cor(b)
corrplot(b, method = "color", main = "Correlation Matrix with Numerical Data")
# This correlation matrix shows that 'Price' and 'Game Length Hours' are positively correlated with user rating values, particularly the 'Price' being highly correlated with 'User Rating'. Other attributes such as release year shows no relation at all.
##
## Pearson's Chi-squared test
##
## data: data1$Genre and data1$Platform
## X-squared = 24.441, df = 36, p-value = 0.9282
# 'Genre' and 'Platform' are highly correlated to each other (p=0.9282).
chisq.test(data1$Genre, data1$User.Review.Text)
##
## Pearson's Chi-squared test
##
## data: data1$Genre and data1$User.Review.Text
## X-squared = 101.88, df = 99, p-value = 0.4013
##
## Pearson's Chi-squared test
##
## data: data1$Platform and data1$User.Review.Text
## X-squared = 48.642, df = 44, p-value = 0.2915
# Meanwhile, 'User Review Text' data is only weakly correlated to both 'Genre' and 'Platform data'.
quality<- data[c("User.Rating", "Graphics.Quality", "Soundtrack.Quality", "Story.Quality")]
chisq.test(quality$User.Rating, quality$Graphics.Quality)
## Warning in chisq.test(quality$User.Rating, quality$Graphics.Quality):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: quality$User.Rating and quality$Graphics.Quality
## X-squared = 1205.3, df = 1173, p-value = 0.2498
## Warning in chisq.test(quality$User.Rating, quality$Soundtrack.Quality):
## Chi-squared approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: quality$User.Rating and quality$Soundtrack.Quality
## X-squared = 1191.8, df = 1173, p-value = 0.3446
## Warning in chisq.test(quality$User.Rating, quality$Story.Quality): Chi-squared
## approximation may be incorrect
##
## Pearson's Chi-squared test
##
## data: quality$User.Rating and quality$Story.Quality
## X-squared = 1162.2, df = 1173, p-value = 0.583
# All of the game qualities are positively correlated with 'User Rating' with 'Story Quality' obtaining the highest correlation compared to 'Graphics Quality' and 'Soundtrack Quality'.
# 5. WORD CLOUD
# Visualizing user review texts with word cloud
reviews <- data1$User.Review
reviews <- gsub("\\s+", " ", reviews)
words <- unlist(strsplit(reviews, "\\s+"))
stopwords <- c(stopwords("en"), "but", "and", "many")
words <- words[!tolower(words) %in% stopwords]
# Create a frequency table of the words
freq <- table(words)
# Generate the word cloud
wordcloud(names(freq), freq, min.freq = 1, scale = c(4, 0.5), colors = brewer.pal(8, "PuOr"), random.order = F)
# The word cloud analysis reveals that the words "amazing," "graphics," "better," "bugs," and "gameplay" appear in similar sizes, suggesting they occur with nearly equal frequency in the comments. This indicates these terms are key themes in player feedback and likely central to players' experiences with the games.
# In contrast, words like "great," "disappointing," and "solid" appear slightly smaller, meaning they are mentioned less frequently but still contribute to the overall sentiment. These terms might reflect players' general impressions but are not as central as the dominant themes.
# The prominence of "graphics" and "gameplay" in the word cloud suggests they are significant factors influencing players' opinions. Players may have expressed positive sentiments ("amazing", "great", "solid" and "better") which might highlight players' appreciation for improved graphics or engaging gameplay. Meanwhile, negative sentiments (e.g., "bugs" and "disappointing") might indicates that technical issues might have impacted players' enjoyment.
Prepare Data For Modelling (Classification)
Create new column “popular” to classify the rating of the games based on the 75th percentile threshold and split the data into 70% training and 30% testing
## 'data.frame': 47774 obs. of 18 variables:
## $ Game.Title : chr "Grand Theft Auto V" "The Sims 4" "Minecraft" "Bioshock Infinite" ...
## $ User.Rating : num 36.4 38.3 26.8 38.4 30.1 38.6 33.1 32.3 26.7 23.9 ...
## $ Age.Group.Targeted : chr "All Ages" "Adults" "Teens" "All Ages" ...
## $ Price : num 41.4 57.6 44.9 48.3 55.5 ...
## $ Platform : chr "PC" "PC" "PC" "Mobile" ...
## $ Requires.Special.Device: chr "No" "No" "Yes" "Yes" ...
## $ Developer : chr "Game Freak" "Nintendo" "Bungie" "Game Freak" ...
## $ Publisher : chr "Innersloth" "Electronic Arts" "Capcom" "Nintendo" ...
## $ Release.Year : int 2015 2015 2012 2015 2022 2017 2020 2012 2010 2013 ...
## $ Genre : chr "Adventure" "Shooter" "Adventure" "Sports" ...
## $ Multiplayer : chr "No" "Yes" "Yes" "No" ...
## $ Game.Length..Hours. : num 55.3 34.6 13.9 41.9 13.2 48.8 36.9 52.1 56.4 46 ...
## $ Graphics.Quality : chr "Medium" "Low" "Low" "Medium" ...
## $ Soundtrack.Quality : chr "Average" "Poor" "Good" "Good" ...
## $ Story.Quality : chr "Poor" "Poor" "Average" "Excellent" ...
## $ User.Review.Text : chr "Solid game, but too many bugs." "Solid game, but too many bugs." "Great game, but the graphics could be better." "Solid game, but the graphics could be better." ...
## $ Game.Mode : chr "Offline" "Offline" "Offline" "Online" ...
## $ Min.Number.of.Players : int 1 3 5 4 1 4 3 3 10 5 ...
# Add Popularity column
data$Popularity <- ifelse(data$User.Rating >= 35.1, "Popular", "Unpopular")
data$Popularity <- as.factor(data$Popularity)
data$User.Rating <- as.numeric(data$User.Rating)
#spliting the data into training 70% and testing 30%
set.seed(123)
train_index <- createDataPartition(data$Popularity, p=0.7, list=FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
message(
"Initial: ", nrow(data), " rows.\n",
"Train: ", nrow(train_data), " rows \n",
"Test: ", nrow(test_data), " rows ."
)
## Initial: 47774 rows.
## Train: 33443 rows
## Test: 14331 rows .
Objective 1 : Classify Popular and Unpopular Games
Classification Algorithm 1 : Decision Tree Model
# Build the Decision Tree model
dt_model <- rpart(Popularity ~ Price + Game.Length..Hours.,
data = train_data,
method = "class")
# Predict on the test data
dt_predictions <- predict(dt_model, newdata = test_data, type = "class")
# Visualize the Decision Tree
rpart.plot(dt_model)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Popular Unpopular
## Popular 3187 286
## Unpopular 446 10412
##
## Accuracy : 0.9489
## 95% CI : (0.9452, 0.9525)
## No Information Rate : 0.7465
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8631
##
## Mcnemar's Test P-Value : 4.183e-09
##
## Sensitivity : 0.8772
## Specificity : 0.9733
## Pos Pred Value : 0.9177
## Neg Pred Value : 0.9589
## Prevalence : 0.2535
## Detection Rate : 0.2224
## Detection Prevalence : 0.2423
## Balanced Accuracy : 0.9253
##
## 'Positive' Class : Popular
##
Classification Algorithm 2 : SVM Model
# Train the SVM model
svm_model <- svm(Popularity ~ Price + Game.Length..Hours.,
data = train_data,
type = "C-classification",
kernel = "linear")
# Predict on the test data
svm_predictions <- predict(svm_model, newdata = test_data)
# Confusion matrix to evaluate the SVM model
confusionMatrix(svm_predictions, test_data$Popularity)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Popular Unpopular
## Popular 3345 272
## Unpopular 288 10426
##
## Accuracy : 0.9609
## 95% CI : (0.9576, 0.964)
## No Information Rate : 0.7465
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8966
##
## Mcnemar's Test P-Value : 0.5262
##
## Sensitivity : 0.9207
## Specificity : 0.9746
## Pos Pred Value : 0.9248
## Neg Pred Value : 0.9731
## Prevalence : 0.2535
## Detection Rate : 0.2334
## Detection Prevalence : 0.2524
## Balanced Accuracy : 0.9477
##
## 'Positive' Class : Popular
##
Prepare Data For Modelling (Regression)
Prepare data by splitting it to 70% training and 30% testing and User.Rating as the target variable
#regression
set.seed(123)
train_index <- createDataPartition(data$User.Rating, p=0.7, list=FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
message(
"Initial: ", nrow(data), " rows.\n",
"Train: ", nrow(train_data), " rows \n",
"Test: ", nrow(test_data), " rows ."
)
## Initial: 47774 rows.
## Train: 33444 rows
## Test: 14330 rows .
Objective 2 : Predict The Future Rating of Games
Regression Algorithm 1 : Random Forest Model
train_data$User.Rating <- as.numeric(train_data$User.Rating)
test_data$User.Rating <- as.numeric(test_data$User.Rating)
# Train the Random Forest model
rf_model <- randomForest(User.Rating ~ Price + Game.Length..Hours.,
data = train_data,
importance = TRUE,
ntree = 50)
# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_data)
# Calculate RMSE and R-Square
rmse <- sqrt(mean((rf_predictions - test_data$User.Rating)^2))
rsq <- cor(rf_predictions, test_data$User.Rating)^2
mae <- mean(abs(rf_predictions - test_data$User.Rating))
# Visualize prediction using plot
rf <- data.frame(y_test = test_data$User.Rating, y_pred = round(rf_predictions,1))
subset_rf<- rf[1:35, ]
par(mar = c(5, 4, 4, 2))
plot(subset_rf$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_rf$y_pred, col = "blue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "blue"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")
Regression Algorithm 2 : Linear Regression Model
# Fit the linear regression model
lr_model <- lm(User.Rating ~ Price + Game.Length..Hours.,
data = train_data)
# Make predictions on the test set
lr_predictions <- predict(lr_model, newdata = test_data)
# Calculate RMSE and R-Square
rmse_lr <- sqrt(mean((lr_predictions - test_data$User.Rating)^2))
rsq_lr <- cor(lr_predictions, test_data$User.Rating)^2
mae_lr <- mean(abs(lr_predictions - test_data$User.Rating))
# Visualize prediction using plot
lr <- data.frame(y_test = test_data$User.Rating, y_pred = round(lr_predictions,1))
subset_lr<- lr[1:35, ]
par(mar = c(5, 4, 4, 2))
plot(subset_lr$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_lr$y_pred, col = "blue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "blue"), lwd = 2)
title(main="Actual vs Predicted for Linear Regression Model")
Comparing Random forest and Linear Regression
#compare model performance
evalution_table<-data.frame(
Model = c('Random Forest', 'Linear Regression'),
R_squared = c(rsq,rsq_lr),
RMSE = c(rmse,rmse_lr),
MAE = c(mae,mae_lr)
)
print(evalution_table)
## Model R_squared RMSE MAE
## 1 Random Forest 0.9725199 1.245908 1.059441
## 2 Linear Regression 0.9761487 1.160752 1.007670
Conclusion
According to the correlation checking, price and game length hour have a real correlation with rating. This implies that the value of upcoming rating games can be predicted using these factors. The best model, as determined by the modeling part, is: