WQD7004 Group Assignment

Group Members (Team 5)

Siti Nur Liyana Roslan (23067122)
Karenina Kamila (23117951)
Nur Ridwana Mohd Rafix (24054165)
Bong Hui Xin (22089462)
Salvin A/L Ravindran (17167138)

Project Title: Predicting User Ratings and Classifying Popularity in Video Games

Dataset

Title: Video Game Reviews and Ratings

Source: https://www.kaggle.com/datasets/jahnavipaliwal/video-game-reviews-and-ratings

Year of Publish: 2024

Dataset Description: Consists of 18 features, and 47774 rows of synthetic data on video game reviews and ratings.

Dataset Category: Entertainment

Numerical	Categorical
User.Rating, Price, Release.Year, Game.Length..Hours., Min.Number.of.Player	Game.Title, Age.Group.Targeted, Platform, Requires.Special.Device, Developer, Publisher, Genre, Multiplayer, Graphics.Quality, Soundtrack.Quality, Story.Quality, Game.Mode, User.Review.Text

Introduction

In this project, our team leveraged R programming to analyse video game reviews and ratings, identify key factors influencing high ratings, and propose actionable insights for game developers and publishers. The analysis involved data visualisation, summary statistics, and generating a word cloud extract patterns and trends from user reviews.

Objectives

1. To classify popular and non-popular video games.

2. To predict user rating of video games.

By achieving these objectives, this research aims to provide a data-driven foundation for making informed decisions in game design and marketing strategies, ultimately enhancing the player experience and boosting game success in a competitive market.

Loading Necessary Libraries

# Load necessary libraries
#install.packages(c("tidyverse", "ggplot2", "dplyr", "corrplot", "vcd", "wordcloud", "RColorBrewer", "tm"))
#install.packages(c("rpart","rpart.plot", "caret", "e1071", "randomForest"))
library(rpart)
library(e1071)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(rpart.plot)
library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ✔ readr     2.1.5

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ randomForest::combine() masks dplyr::combine()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ purrr::lift()           masks caret::lift()
## ✖ randomForest::margin()  masks ggplot2::margin()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(vcd)

## Loading required package: grid

library(wordcloud)

## Loading required package: RColorBrewer

library(RColorBrewer)
library(tm)

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

Load the Dataset

data <- read.csv("video_game_reviews.csv")

# Display summary statistics of the dataset
summary(data)

##   Game.Title         User.Rating    Age.Group.Targeted     Price      
##  Length:47774       Min.   :10.10   Length:47774       Min.   :19.99  
##  Class :character   1st Qu.:24.30   Class :character   1st Qu.:29.99  
##  Mode  :character   Median :29.70   Mode  :character   Median :39.84  
##                     Mean   :29.72                      Mean   :39.95  
##                     3rd Qu.:35.10                      3rd Qu.:49.96  
##                     Max.   :49.50                      Max.   :59.99  
##    Platform         Requires.Special.Device  Developer        
##  Length:47774       Length:47774            Length:47774      
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character  
##                                                               
##                                                               
##                                                               
##   Publisher          Release.Year     Genre           Multiplayer       
##  Length:47774       Min.   :2010   Length:47774       Length:47774      
##  Class :character   1st Qu.:2013   Class :character   Class :character  
##  Mode  :character   Median :2016   Mode  :character   Mode  :character  
##                     Mean   :2016                                        
##                     3rd Qu.:2020                                        
##                     Max.   :2023                                        
##  Game.Length..Hours. Graphics.Quality   Soundtrack.Quality Story.Quality     
##  Min.   : 5.00       Length:47774       Length:47774       Length:47774      
##  1st Qu.:18.80       Class :character   Class :character   Class :character  
##  Median :32.50       Mode  :character   Mode  :character   Mode  :character  
##  Mean   :32.48                                                               
##  3rd Qu.:46.30                                                               
##  Max.   :60.00                                                               
##  User.Review.Text    Game.Mode         Min.Number.of.Players
##  Length:47774       Length:47774       Min.   : 1.000       
##  Class :character   Class :character   1st Qu.: 3.000       
##  Mode  :character   Mode  :character   Median : 5.000       
##                                        Mean   : 5.117       
##                                        3rd Qu.: 7.000       
##                                        Max.   :10.000

# Get the dimension of the dataset
dim(data)

## [1] 47774    18

# Get the structure of the dataset
str(data)

## 'data.frame':    47774 obs. of  18 variables:
##  $ Game.Title             : chr  "Grand Theft Auto V" "The Sims 4" "Minecraft" "Bioshock Infinite" ...
##  $ User.Rating            : num  36.4 38.3 26.8 38.4 30.1 38.6 33.1 32.3 26.7 23.9 ...
##  $ Age.Group.Targeted     : chr  "All Ages" "Adults" "Teens" "All Ages" ...
##  $ Price                  : num  41.4 57.6 44.9 48.3 55.5 ...
##  $ Platform               : chr  "PC" "PC" "PC" "Mobile" ...
##  $ Requires.Special.Device: chr  "No" "No" "Yes" "Yes" ...
##  $ Developer              : chr  "Game Freak" "Nintendo" "Bungie" "Game Freak" ...
##  $ Publisher              : chr  "Innersloth" "Electronic Arts" "Capcom" "Nintendo" ...
##  $ Release.Year           : int  2015 2015 2012 2015 2022 2017 2020 2012 2010 2013 ...
##  $ Genre                  : chr  "Adventure" "Shooter" "Adventure" "Sports" ...
##  $ Multiplayer            : chr  "No" "Yes" "Yes" "No" ...
##  $ Game.Length..Hours.    : num  55.3 34.6 13.9 41.9 13.2 48.8 36.9 52.1 56.4 46 ...
##  $ Graphics.Quality       : chr  "Medium" "Low" "Low" "Medium" ...
##  $ Soundtrack.Quality     : chr  "Average" "Poor" "Good" "Good" ...
##  $ Story.Quality          : chr  "Poor" "Poor" "Average" "Excellent" ...
##  $ User.Review.Text       : chr  "Solid game, but too many bugs." "Solid game, but too many bugs." "Great game, but the graphics could be better." "Solid game, but the graphics could be better." ...
##  $ Game.Mode              : chr  "Offline" "Offline" "Offline" "Online" ...
##  $ Min.Number.of.Players  : int  1 3 5 4 1 4 3 3 10 5 ...

glimpse(data)

## Rows: 47,774
## Columns: 18
## $ Game.Title              <chr> "Grand Theft Auto V", "The Sims 4", "Minecraft…
## $ User.Rating             <dbl> 36.4, 38.3, 26.8, 38.4, 30.1, 38.6, 33.1, 32.3…
## $ Age.Group.Targeted      <chr> "All Ages", "Adults", "Teens", "All Ages", "Ad…
## $ Price                   <dbl> 41.41, 57.56, 44.93, 48.29, 55.49, 51.73, 46.4…
## $ Platform                <chr> "PC", "PC", "PC", "Mobile", "PlayStation", "Xb…
## $ Requires.Special.Device <chr> "No", "No", "Yes", "Yes", "Yes", "No", "No", "…
## $ Developer               <chr> "Game Freak", "Nintendo", "Bungie", "Game Frea…
## $ Publisher               <chr> "Innersloth", "Electronic Arts", "Capcom", "Ni…
## $ Release.Year            <int> 2015, 2015, 2012, 2015, 2022, 2017, 2020, 2012…
## $ Genre                   <chr> "Adventure", "Shooter", "Adventure", "Sports",…
## $ Multiplayer             <chr> "No", "Yes", "Yes", "No", "Yes", "Yes", "No", …
## $ Game.Length..Hours.     <dbl> 55.3, 34.6, 13.9, 41.9, 13.2, 48.8, 36.9, 52.1…
## $ Graphics.Quality        <chr> "Medium", "Low", "Low", "Medium", "High", "Low…
## $ Soundtrack.Quality      <chr> "Average", "Poor", "Good", "Good", "Poor", "Av…
## $ Story.Quality           <chr> "Poor", "Poor", "Average", "Excellent", "Good"…
## $ User.Review.Text        <chr> "Solid game, but too many bugs.", "Solid game,…
## $ Game.Mode               <chr> "Offline", "Offline", "Offline", "Online", "Of…
## $ Min.Number.of.Players   <int> 1, 3, 5, 4, 1, 4, 3, 3, 10, 5, 1, 5, 10, 5, 4,…

Data Pre-processing

# View the column names
colnames(data)

##  [1] "Game.Title"              "User.Rating"            
##  [3] "Age.Group.Targeted"      "Price"                  
##  [5] "Platform"                "Requires.Special.Device"
##  [7] "Developer"               "Publisher"              
##  [9] "Release.Year"            "Genre"                  
## [11] "Multiplayer"             "Game.Length..Hours."    
## [13] "Graphics.Quality"        "Soundtrack.Quality"     
## [15] "Story.Quality"           "User.Review.Text"       
## [17] "Game.Mode"               "Min.Number.of.Players"

# Dropping unnecessary columns; based on our objectives and correlation (numerical)
data1 <- data[c("Game.Title", "User.Rating", "Price", "Platform", "Release.Year", "Genre", "Game.Length..Hours.", "User.Review.Text")]

# Changing all chr columns to factor
data1$Genre <- factor(data1$Genre)
data1$Platform <- factor(data1$Platform)
data1$User.Review.Text <- factor(data1$User.Review.Text)

# Checking and handling for missing values
sum(is.na(data1))

## [1] 0

EDA

# After cleaning the data set, we explored the summary of data1 and find the values of standard deviation and mean of the dataset.
summary(data1)

##   Game.Title         User.Rating        Price                  Platform   
##  Length:47774       Min.   :10.10   Min.   :19.99   Mobile         :9589  
##  Class :character   1st Qu.:24.30   1st Qu.:29.99   Nintendo Switch:9596  
##  Mode  :character   Median :29.70   Median :39.84   PC             :9599  
##                     Mean   :29.72   Mean   :39.95   PlayStation    :9633  
##                     3rd Qu.:35.10   3rd Qu.:49.96   Xbox           :9357  
##                     Max.   :49.50   Max.   :59.99                         
##                                                                           
##   Release.Year         Genre       Game.Length..Hours.
##  Min.   :2010   RPG       : 4873   Min.   : 5.00      
##  1st Qu.:2013   Shooter   : 4869   1st Qu.:18.80      
##  Median :2016   Strategy  : 4867   Median :32.50      
##  Mean   :2016   Puzzle    : 4822   Mean   :32.48      
##  3rd Qu.:2020   Simulation: 4784   3rd Qu.:46.30      
##  Max.   :2023   Adventure : 4750   Max.   :60.00      
##                 (Other)   :18809                      
##                                               User.Review.Text
##  Great game, but the graphics could be better.        : 4067  
##  Disappointing game, but the gameplay is amazing.     : 4020  
##  Disappointing game, but the graphics could be better.: 4018  
##  Solid game, but too many bugs.                       : 4017  
##  Solid game, but the gameplay is amazing.             : 3996  
##  Great game, but too many bugs.                       : 3977  
##  (Other)                                              :23679

summary(data1$User.Rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   10.10   24.30   29.70   29.72   35.10   49.50

sd(data1$User.Rating)

## [1] 7.550131

mean(data1$User.Rating)

## [1] 29.71933

# Next, we found the highest and lowest rated game.
high <- data1[order(-data$User.Rating), ][1, ]
high

##            Game.Title User.Rating Price    Platform Release.Year  Genre
## 15184 Just Dance 2024        49.5 59.17 PlayStation         2014 Action
##       Game.Length..Hours.                                 User.Review.Text
## 15184                59.7 Disappointing game, but the gameplay is amazing.

low <- data1[order(desc(-data$User.Rating)), ][1, ]
low

##       Game.Title User.Rating Price Platform Release.Year Genre
## 17711     Tetris        10.1 20.08     Xbox         2013 Party
##       Game.Length..Hours.                              User.Review.Text
## 17711                 6.3 Solid game, but the graphics could be better.

# Since our focus attribute is 'User Ratings', we explored the quantiles of this data that can be used in modelling part later on.
quantiles <- quantile(data1$User.Rating, probs = c(0.25, 0.5, 0.75))
cat("25th Percentile:", quantiles[1], "\n")

## 25th Percentile: 24.3

cat("Median:", quantiles[2], "\n")

## Median: 29.7

cat("75th Percentile:", quantiles[3], "\n")

## 75th Percentile: 35.1

# 1. Univariate Analysis
# Bar plot of number of games by genres
ggplot(data1, aes(x = Genre)) + geom_bar(fill = "red", color = "black") + labs(title = "Number of Games by Genres", x = "Game Genres", y = "Number of Games" )

# The graph shows that the number of games for each genres are in between the same range, hence making the dataset quite balanced and does not require additional balance tuning.

# Bar plot of game count by platform
ggplot(data1, aes(x = Platform)) + geom_bar(fill = "lightblue", color = "black") + labs(title = "Number of Games by Platform", x = "Platform", y = "Number of Games")

# Same goes to the type of platform, the graph also shows that the number of games are spread equally across variety of platforms.

# Histogram of user rating for games released in year 2022
y22 <- data1%>%filter(Release.Year==2022)
ggplot(y22, aes(x = User.Rating)) + geom_histogram(binwidth = 1, fill = "orange", color = "black") + labs(title = "User Rating by Games Released in 2022", x = "Games Released", y = "User Rating")

# The user ratings depicted that rating values peaked at 32.50 for majority of the games released in 2022. There is a noticeable decreasing slope between rating value 27.5 and 32.5. This pattern creates a "valley" in the distribution. The slope and subsequent recovery could be due to various factors, such as shifts in user preferences or game characteristics that polarized ratings into higher and lower groups.

# 2. Bivariate Analysis
# Line graph of number of games released over years
count_release <- data1 %>% count(Release.Year)
ggplot(count_release, aes(x = Release.Year, y = n)) + geom_point() + geom_line() + labs(title = "Number of Games Released Over Years", x = "Years", y = "Number of Games")

# The graph peaked at 2020 with the most number of games released might be due to global pandemic that increased the demand for home entertainment With lockdowns and social distancing measures in place. Gaming through digital platforms had resonated with audiences because they provided a sense of normalcy and connection in a socially isolated world.

# Line graph of average rating by release year
ggplot(data1, aes(x = Release.Year, y = User.Rating)) + stat_summary(fun = "mean", geom = "line", color = "orange") + labs(title = "Average User Rating by Release Year", x = "Release Year", y = "Average Rating")

# Although 2020 has the highest number of game releases it also received the lowest average ratings. This can be attributed to several factors, including the increased player base and harsher standards in a crowded market. Mass production of games releases might dilute its quality which caused some games may have struggled to stand out, and the average quality might have dropped due to oversaturated market.

# Scatter plot for User Rating vs. Game Length (Hours) in year 2020
v <- data1%>%filter(Release.Year==2020)
ggplot(v, aes(x = User.Rating, y = Game.Length..Hours.)) + geom_point() + geom_smooth(method = lm) + labs(title = "Correlation between User Rating and Game Length (Hours) in year 2020")

## `geom_smooth()` using formula = 'y ~ x'

# This scatter plot graph shows positive correlation as the points form an upward trend. Generally, games with longer playtimes tend to have higher ratings which might suggests that users favor longer games.

# 3. Multivariate Analysis
# Correlation matrix with numerical data
b <- data1[sapply(data1, is.numeric)]
b <- cor(b)
corrplot(b, method = "color", main = "Correlation Matrix with Numerical Data")

# This correlation matrix shows that 'Price' and 'Game Length Hours' are positively correlated with user rating values, particularly the 'Price' being highly correlated with 'User Rating'. Other attributes such as release year shows no relation at all.

# 4. Correlation Analysis for Categorical Data
chisq.test(data1$Genre, data1$Platform)

## 
##  Pearson's Chi-squared test
## 
## data:  data1$Genre and data1$Platform
## X-squared = 24.441, df = 36, p-value = 0.9282

# 'Genre' and 'Platform' are highly correlated to each other (p=0.9282).

chisq.test(data1$Genre, data1$User.Review.Text)

## 
##  Pearson's Chi-squared test
## 
## data:  data1$Genre and data1$User.Review.Text
## X-squared = 101.88, df = 99, p-value = 0.4013

chisq.test(data1$Platform, data1$User.Review.Text)

## 
##  Pearson's Chi-squared test
## 
## data:  data1$Platform and data1$User.Review.Text
## X-squared = 48.642, df = 44, p-value = 0.2915

# Meanwhile, 'User Review Text' data is only weakly correlated to both 'Genre' and 'Platform data'.

quality<- data[c("User.Rating", "Graphics.Quality", "Soundtrack.Quality", "Story.Quality")]
chisq.test(quality$User.Rating, quality$Graphics.Quality)

## Warning in chisq.test(quality$User.Rating, quality$Graphics.Quality):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  quality$User.Rating and quality$Graphics.Quality
## X-squared = 1205.3, df = 1173, p-value = 0.2498

chisq.test(quality$User.Rating, quality$Soundtrack.Quality)

## Warning in chisq.test(quality$User.Rating, quality$Soundtrack.Quality):
## Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  quality$User.Rating and quality$Soundtrack.Quality
## X-squared = 1191.8, df = 1173, p-value = 0.3446

chisq.test(quality$User.Rating, quality$Story.Quality)

## Warning in chisq.test(quality$User.Rating, quality$Story.Quality): Chi-squared
## approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  quality$User.Rating and quality$Story.Quality
## X-squared = 1162.2, df = 1173, p-value = 0.583

# All of the game qualities are positively correlated with 'User Rating' with 'Story Quality' obtaining the highest correlation compared to 'Graphics Quality' and 'Soundtrack Quality'.

# 5. WORD CLOUD
# Visualizing user review texts with word cloud
reviews <- data1$User.Review
reviews <- gsub("\\s+", " ", reviews)
words <- unlist(strsplit(reviews, "\\s+"))
stopwords <- c(stopwords("en"), "but", "and", "many")
words <- words[!tolower(words) %in% stopwords]
# Create a frequency table of the words
freq <- table(words)
# Generate the word cloud
wordcloud(names(freq), freq, min.freq = 1, scale = c(4, 0.5), colors = brewer.pal(8, "PuOr"), random.order = F)

# The word cloud analysis reveals that the words "amazing," "graphics," "better," "bugs," and "gameplay" appear in similar sizes, suggesting they occur with nearly equal frequency in the comments. This indicates these terms are key themes in player feedback and likely central to players' experiences with the games.
# In contrast, words like "great," "disappointing," and "solid" appear slightly smaller, meaning they are mentioned less frequently but still contribute to the overall sentiment. These terms might reflect players' general impressions but are not as central as the dominant themes.
# The prominence of "graphics" and "gameplay" in the word cloud suggests they are significant factors influencing players' opinions. Players may have expressed positive sentiments ("amazing", "great", "solid" and "better") which might highlight players' appreciation for improved graphics or engaging gameplay. Meanwhile, negative sentiments (e.g., "bugs" and "disappointing") might indicates that technical issues might have impacted players' enjoyment.

Prepare Data For Modelling (Classification)

Create new column “popular” to classify the rating of the games based on the 75th percentile threshold and split the data into 70% training and 30% testing

# Check the structure of the dataset
str(data)

## 'data.frame':    47774 obs. of  18 variables:
##  $ Game.Title             : chr  "Grand Theft Auto V" "The Sims 4" "Minecraft" "Bioshock Infinite" ...
##  $ User.Rating            : num  36.4 38.3 26.8 38.4 30.1 38.6 33.1 32.3 26.7 23.9 ...
##  $ Age.Group.Targeted     : chr  "All Ages" "Adults" "Teens" "All Ages" ...
##  $ Price                  : num  41.4 57.6 44.9 48.3 55.5 ...
##  $ Platform               : chr  "PC" "PC" "PC" "Mobile" ...
##  $ Requires.Special.Device: chr  "No" "No" "Yes" "Yes" ...
##  $ Developer              : chr  "Game Freak" "Nintendo" "Bungie" "Game Freak" ...
##  $ Publisher              : chr  "Innersloth" "Electronic Arts" "Capcom" "Nintendo" ...
##  $ Release.Year           : int  2015 2015 2012 2015 2022 2017 2020 2012 2010 2013 ...
##  $ Genre                  : chr  "Adventure" "Shooter" "Adventure" "Sports" ...
##  $ Multiplayer            : chr  "No" "Yes" "Yes" "No" ...
##  $ Game.Length..Hours.    : num  55.3 34.6 13.9 41.9 13.2 48.8 36.9 52.1 56.4 46 ...
##  $ Graphics.Quality       : chr  "Medium" "Low" "Low" "Medium" ...
##  $ Soundtrack.Quality     : chr  "Average" "Poor" "Good" "Good" ...
##  $ Story.Quality          : chr  "Poor" "Poor" "Average" "Excellent" ...
##  $ User.Review.Text       : chr  "Solid game, but too many bugs." "Solid game, but too many bugs." "Great game, but the graphics could be better." "Solid game, but the graphics could be better." ...
##  $ Game.Mode              : chr  "Offline" "Offline" "Offline" "Online" ...
##  $ Min.Number.of.Players  : int  1 3 5 4 1 4 3 3 10 5 ...

# Add Popularity column
data$Popularity <- ifelse(data$User.Rating >= 35.1, "Popular", "Unpopular")
data$Popularity <- as.factor(data$Popularity)
data$User.Rating <- as.numeric(data$User.Rating)

#spliting the data into training 70% and testing 30%
set.seed(123)
train_index <- createDataPartition(data$Popularity, p=0.7, list=FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

message(
  "Initial: ", nrow(data), " rows.\n",
  "Train: ", nrow(train_data), " rows  \n",
  "Test: ", nrow(test_data), " rows ."
)

## Initial: 47774 rows.
## Train: 33443 rows  
## Test: 14331 rows .

Objective 1 : Classify Popular and Unpopular Games

Classification Algorithm 1 : Decision Tree Model

# Build the Decision Tree model
dt_model <- rpart(Popularity ~ Price + Game.Length..Hours.,
                  data = train_data,
                  method = "class")
# Predict on the test data
dt_predictions <- predict(dt_model, newdata = test_data, type = "class")
# Visualize the Decision Tree
rpart.plot(dt_model)

# Confusion matrix to evaluate the model
confusionMatrix(dt_predictions, test_data$Popularity)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Popular Unpopular
##   Popular      3187       286
##   Unpopular     446     10412
##                                           
##                Accuracy : 0.9489          
##                  95% CI : (0.9452, 0.9525)
##     No Information Rate : 0.7465          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8631          
##                                           
##  Mcnemar's Test P-Value : 4.183e-09       
##                                           
##             Sensitivity : 0.8772          
##             Specificity : 0.9733          
##          Pos Pred Value : 0.9177          
##          Neg Pred Value : 0.9589          
##              Prevalence : 0.2535          
##          Detection Rate : 0.2224          
##    Detection Prevalence : 0.2423          
##       Balanced Accuracy : 0.9253          
##                                           
##        'Positive' Class : Popular         
##

Classification Algorithm 2 : SVM Model

# Train the SVM model
svm_model <- svm(Popularity ~ Price + Game.Length..Hours.,
                 data = train_data,
                 type = "C-classification",
                 kernel = "linear")
# Predict on the test data
svm_predictions <- predict(svm_model, newdata = test_data)
# Confusion matrix to evaluate the SVM model
confusionMatrix(svm_predictions, test_data$Popularity)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Popular Unpopular
##   Popular      3345       272
##   Unpopular     288     10426
##                                          
##                Accuracy : 0.9609         
##                  95% CI : (0.9576, 0.964)
##     No Information Rate : 0.7465         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.8966         
##                                          
##  Mcnemar's Test P-Value : 0.5262         
##                                          
##             Sensitivity : 0.9207         
##             Specificity : 0.9746         
##          Pos Pred Value : 0.9248         
##          Neg Pred Value : 0.9731         
##              Prevalence : 0.2535         
##          Detection Rate : 0.2334         
##    Detection Prevalence : 0.2524         
##       Balanced Accuracy : 0.9477         
##                                          
##        'Positive' Class : Popular        
##

Prepare Data For Modelling (Regression)

Prepare data by splitting it to 70% training and 30% testing and User.Rating as the target variable

#regression
set.seed(123)
train_index <- createDataPartition(data$User.Rating, p=0.7, list=FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]

message(
  "Initial: ", nrow(data), " rows.\n",
  "Train: ", nrow(train_data), " rows  \n",
  "Test: ", nrow(test_data), " rows ."
)

## Initial: 47774 rows.
## Train: 33444 rows  
## Test: 14330 rows .

Objective 2 : Predict The Future Rating of Games

Regression Algorithm 1 : Random Forest Model

train_data$User.Rating <- as.numeric(train_data$User.Rating)
test_data$User.Rating <- as.numeric(test_data$User.Rating)
# Train the Random Forest model
rf_model <- randomForest(User.Rating ~ Price + Game.Length..Hours.,
                         data = train_data,
                         importance = TRUE,
                         ntree = 50)
# Make predictions on the test data
rf_predictions <- predict(rf_model, newdata = test_data)
# Calculate RMSE and R-Square
rmse <- sqrt(mean((rf_predictions - test_data$User.Rating)^2))
rsq <- cor(rf_predictions, test_data$User.Rating)^2
mae <- mean(abs(rf_predictions - test_data$User.Rating))

# Visualize prediction using plot
rf <- data.frame(y_test = test_data$User.Rating, y_pred = round(rf_predictions,1))
subset_rf<- rf[1:35, ]
par(mar = c(5, 4, 4, 2))
plot(subset_rf$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_rf$y_pred, col = "blue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "blue"), lwd = 2)
title(main="Actual vs Predicted for Random Forest Regressor Model")

Regression Algorithm 2 : Linear Regression Model

# Fit the linear regression model
lr_model <- lm(User.Rating ~ Price + Game.Length..Hours.,
            data = train_data)

# Make predictions on the test set
lr_predictions <- predict(lr_model, newdata = test_data)

# Calculate RMSE and R-Square
rmse_lr <- sqrt(mean((lr_predictions - test_data$User.Rating)^2))
rsq_lr <- cor(lr_predictions, test_data$User.Rating)^2
mae_lr <- mean(abs(lr_predictions - test_data$User.Rating))

# Visualize prediction using plot
lr <- data.frame(y_test = test_data$User.Rating, y_pred = round(lr_predictions,1))
subset_lr<- lr[1:35, ]
par(mar = c(5, 4, 4, 2))
plot(subset_lr$y_test, type = "l", col = "black", lwd = 2, xlab = "Index", ylab = "Value")
lines(subset_lr$y_pred, col = "blue", lwd = 2)
legend("topright", legend = c("Actual", "Predicted"), col = c("black", "blue"), lwd = 2)
title(main="Actual vs Predicted for Linear Regression Model")

Comparing Random forest and Linear Regression

#compare model performance
evalution_table<-data.frame(
  Model = c('Random Forest', 'Linear Regression'),
  R_squared = c(rsq,rsq_lr),
  RMSE = c(rmse,rmse_lr),
  MAE = c(mae,mae_lr)
)
print(evalution_table)

##               Model R_squared     RMSE      MAE
## 1     Random Forest 0.9725199 1.245908 1.059441
## 2 Linear Regression 0.9761487 1.160752 1.007670

Conclusion

According to the correlation checking, price and game length hour have a real correlation with rating. This implies that the value of upcoming rating games can be predicted using these factors. The best model, as determined by the modeling part, is:

WQD7004 Group Assignment