#Install the packages
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.2.0
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
Both of us like playing video games, and often look at game critics
on a variety of platforms. Therefore, we found this dataset which
records over 15 thousand games and their criteria like scores and sale
amounts. We consider this research would be beneficial to stakeholders
in the game industry for the following reasons:
1. For the game developers and publishers,
understanding the factors that lead to a high-score game helps them to
create games that can appreciate their target audience, resulting in
higher sales, increased user engagement, and improved brand reputation.
The result of our research can guide game development, allowing people
to make better choices regarding game-play mechanics, graphics, story
lines, and marketing strategies.
2. For the investors who care about whether a game can
gain success,identifying factors associated with high-score games can
help them make more informed investment decisions and better assess the
risks and returns associated with a particular propose.
3. For the professional game reviewers, recognizing the
elements that contribute to high-scoring games can help reviewers refine
their evaluation criteria, ensuring they provide accurate and reliable
assessments that effectively guide gamers’ choices.
4. For the scientists who are studying the similar
topics as ours, investigating factors leading to popular games can
advance their understanding of game design, player behavior, and the
gaming industry as a whole. Our research can boost game design education
and contribute to the development of game-developing methods in the
field. 5. Most importantly, for the gamers themselves,
the ability to discern the factors contributing to a high-score game
provides valuable information for them, helping them to make better
purchasing decisions. This can lead to increased satisfaction and trust
in game reviews written by gamers, as well as fostering a better
understanding of their personal preferences of games.
raw = read.csv('raw_data.csv') #Remove rows with NA critic scores
data = raw %>% drop_na(Critic_Score)
head(data)
## Name Platform Year_of_Release Genre Publisher
## 1 Wii Sports Wii 2006 Sports Nintendo
## 2 Mario Kart Wii Wii 2008 Racing Nintendo
## 3 Wii Sports Resort Wii 2009 Sports Nintendo
## 4 New Super Mario Bros. DS 2006 Platform Nintendo
## 5 Wii Play Wii 2006 Misc Nintendo
## 6 New Super Mario Bros. Wii Wii 2009 Platform Nintendo
## NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
## 1 41.36 28.96 3.77 8.45 82.53 76 51
## 2 15.68 12.76 3.79 3.29 35.52 82 73
## 3 15.61 10.93 3.28 2.95 32.77 80 73
## 4 11.28 9.14 6.50 2.88 29.80 89 65
## 5 13.96 9.18 2.93 2.84 28.92 58 41
## 6 14.44 6.94 4.70 2.24 28.32 87 80
## User_Score User_Count Developer Rating
## 1 8 322 Nintendo E
## 2 8.3 709 Nintendo E
## 3 8 192 Nintendo E
## 4 8.5 431 Nintendo E
## 5 6.6 129 Nintendo E
## 6 8.4 594 Nintendo E
The above code reads the csv file of raw (unprocessed data) and then
removes rows in which the Critic_Score are NAs (missing).
The reason is that our topic is to discover the factors of a game that
lead to high critic scores. We consider critic score to be a better
index than user score because critics are often written by experts with
a lot of experience and knowledge in games, so they can give more object
tive and comprehensive suggestions. In addition, the critic platforms
use standardized criteria and systems to rate games, so the scores are
consistent and we can better compare them. Moreover, critic scores are
better for reference because they come out before release of the game
and help the players to determine whether they should buy the
game.
The last line of the code shows the first 6 rows of the data so that I
can check if the NAs are removed.
#Subsetting the features of the game (independent variable)
#Four variables we consider important:
x1 <- c("Platform", "Year_of_Release", "Genre", "Publisher")
x_imp <- data[x1]
head(x_imp)
## Platform Year_of_Release Genre Publisher
## 1 Wii 2006 Sports Nintendo
## 2 Wii 2008 Racing Nintendo
## 3 Wii 2009 Sports Nintendo
## 4 DS 2006 Platform Nintendo
## 5 Wii 2006 Misc Nintendo
## 6 Wii 2009 Platform Nintendo
#Separate games based on critic scores into three levels: low, medium, high
min(data["Critic_Score"])
## [1] 13
max(data["Critic_Score"])
## [1] 98
(98-13)/3
## [1] 28.33333
low_score <- data %>% filter(Critic_Score <= 41.3)
med_score <- data %>% filter(between(Critic_Score, 42, 70))
high_score <- data %>% filter(Critic_Score >= 71)
We don’t need to rename the variabales, because the original dataset
from kaggle is well-formatted, and we only need to explain some
variables.
Note: xx_Sales are in millions.
NA_Sales: This means the number of sales in North
America area.
EU_Sales: This means the number of sales in Europe
area.
JP_Sales: This means the number of sales in Japan area.
We have Japan but not Asia because Japan has a lot of famous game
companies (like Sony and Nintendo), and Japanese buy a lot of
games.
Other_Scales: This means the total number of sales in
areas except for the above three ones (like Africa and other Asia
countries)
Critic_Score: This means the mean score from
game-rating platforms like IGN (Imagine Games Network), GameSpot,
Metacritic and so on.
User_Score: This means the mean score from game player
themselves. Some famous platforms are: Steam: One of the most popular
game distribution platforms for video game players, where players can
leave reviews and ratings for games they’ve played. Steam is on PC so it
is reliable if you want to look at the score of a PC game (like game
optimization and novelty of playing method).
Google Play Store and Apple App Store: Both of these mobile app stores
allow users to rate and review mobile games. From my experience, Google
Play has more games and Apple App Store in with higher variaty.
GOG: A digital distribution platform that specializes in classic and
indie games, where users can leave reviews and ratings for games they’ve
played.
Rating: This means the appropriate age range for
players. It provides parents and guardians with guidance on the game’s
content. Here are the detailed explanation of each of them:
Everyone (E): Meaning that games are suitable for all ages of players,
as the game doesn’t have objectionable content.
Everyone 10+ (E10+): Games suitable for children aged ten and up, with
mild violence or cartoonish violence, some crude humor, and minimal
suggestive content.
Teen (T): Games suitable for teenagers aged thirteen and up, with mild
to moderate violence, some language, and some suggestive content.
Mature (M): Games suitable for players aged seventeen and up, with
intense violence, strong language, sexual content, and/or use of
drugs.
Adults Only (AO): Games containing extreme violence, explicit sexual
content, and/or gambling, and are only suitable for players aged
eighteen and up.
K-A (Kids to Adults): A video game content rating that was used by the
Entertainment Software Rating Board (ESRB) in North America from 1994 to
1998. The K-A rating was used for video games that were considered
appropriate for players of all ages, but before the E rating was
introduced in 1998.
#Data frame that specifies on USER's opinion on the games (Might be useful)
user_data <- data %>% drop_na(User_Count)
#Summarize important variables
df1 <- data %>% group_by(Platform) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(mean_score))
df1
## # A tibble: 17 × 3
## Platform n mean_score
## <chr> <int> <dbl>
## 1 DC 14 87.4
## 2 PC 715 75.9
## 3 XOne 169 73.3
## 4 PS4 252 72.1
## 5 PS 200 71.5
## 6 PSV 120 70.8
## 7 WiiU 90 70.7
## 8 PS3 820 70.4
## 9 XB 725 69.9
## 10 GC 448 69.5
## 11 PS2 1298 68.7
## 12 X360 916 68.6
## 13 PSP 462 67.4
## 14 GBA 438 67.4
## 15 3DS 168 67.1
## 16 DS 717 63.8
## 17 Wii 585 62.8
df2 <- data %>% group_by(Genre) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(mean_score))
df2
## # A tibble: 12 × 3
## Genre n mean_score
## <chr> <int> <dbl>
## 1 Role-Playing 737 72.7
## 2 Strategy 302 72.1
## 3 Sports 1194 72.0
## 4 Shooter 944 70.2
## 5 Fighting 409 69.2
## 6 Simulation 352 68.6
## 7 Platform 497 68.1
## 8 Racing 742 68.0
## 9 Puzzle 224 67.4
## 10 Action 1890 66.6
## 11 Misc 523 66.6
## 12 Adventure 323 65.3
df3 <- data %>% group_by(Publisher) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(n))
df3
## # A tibble: 304 × 3
## Publisher n mean_score
## <chr> <int> <dbl>
## 1 Electronic Arts 1029 74.5
## 2 Activision 569 69.7
## 3 Ubisoft 558 68.5
## 4 THQ 405 66.7
## 5 Sony Computer Entertainment 349 74.0
## 6 Konami Digital Entertainment 328 68.3
## 7 Sega 319 69.7
## 8 Nintendo 310 75.5
## 9 Take-Two Interactive 294 75.0
## 10 Namco Bandai Games 279 66.4
## # … with 294 more rows
We can see that in df3, there are too many publishers, and some of them only published few games but have high scores.
We divided the games into three levels: high, medium, low, by their critic scores. The code groups the dataset by the ‘Platform’ column, then calculates the count ‘n’ and the mean of the ‘Critic_Score’ for each platform. It stores the result in df and arranges the data in descending order based on the mean score. Finally, it displays df. I added the ‘Platform’ column out of curiosity, as I want to see which Platforms tend to produce high-score games.
#First purpose: See which factors lead to high critic score.
#Maybe we can clean df3?
This step in down in previous two questions.
library(ggplot2)
p <- ggplot(data, aes(x=Genre, y=Critic_Score)) + geom_boxplot() + theme(axis.text.x = element_text(angle =90, vjust = 0.5, hjust=1))
p
df3 <- data[, c('Year_of_Release', 'Critic_Score')] %>% drop_na(Year_of_Release)
df4 <- df3 %>% group_by(Year_of_Release) %>% summarise(mean_score = mean(Critic_Score)) %>% arrange(Year_of_Release)
df4 <- slice(df4, 1:(n() - 1)) #Remove the last NA row(I can't delete it by drop_na)
df4$Year_of_Release <- df4$Year_of_Release
q2 <- ggplot(data=df4, aes(x=Year_of_Release, y=mean_score)) + geom_path()+ geom_point()
q2 + theme(axis.text.x = element_text(angle =90, vjust = 0.5, hjust=1))
## `geom_path()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?
Now, plot distributions of three levels of games
#Low
p1 <- ggplot(low_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p1
#Medium
p2 <- ggplot(med_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p2
#High
p3 <- ggplot(high_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p3
I plotted a boxplot about Genre and Cirtic Score; a dot plot to see the trend of mean score of games in each year, and three bar plots for each level of game to see the frequency of each genre.
Central tendency refers to the value around which the data tends to cluster. The most common measures of central tendency are mean, median, and mode. Mean is the arithmetic average of the data, median is the middle value when the data is arranged in ascending or descending order, and mode is the value that appears most frequently in the data. These measures help to understand the distribution of the data and can be used to make comparisons between different groups or variables in the dataset.
In order to calculate these measures in R, we can use various built-in functions. One thing to notice is that, since mode is not always well-defined and can be less informative compared to the other two measures, it is not used as frequently.
Another important measure of central tendency is the range of the data, which is simply the difference between the maximum and minimum values of the variable. This measure helps to understand the spread of the data and can be used to identify outliers in the dataset.
Following code will walk through the process of calculating the measures.
#Summary One easy way to calculate the central tendency measures, we can use the ‘summary()’ function in R.
# use Critic_score as an example
summary(data$Critic_Score)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 13.00 60.00 71.00 68.97 79.00 98.00
The ‘summary()’ method will output the min, 1st quartile, median, mean, 3rd quartile, and max of the input variable
#Mean To calculate the mean for a variable, we can use the mean() function in R.
# Calculate the mean of the Global_Sales variable
mean_sales <- mean(data$Global_Sales)
# Print the result
cat("The mean of Global_Sales is:", mean_sales)
## The mean of Global_Sales is: 0.6890353
Then if we want to calculate the mean ‘Critic_Score’ for different subsets of the data we mentioned in previous question, we can use the ‘group_by()’ and ‘summarise()’ functions from the dplyr package.
# Load the dplyr package
library(dplyr)
# Take the 'data' dataframe and group it by the 'Genre' variable using the pipe operator
data %>%
group_by(Genre) %>%
# Calculate the mean of the 'Critic_Score' variable for each group of 'Genre'
summarise(mean_Critic_Score = mean(Critic_Score))
## # A tibble: 12 × 2
## Genre mean_Critic_Score
## <chr> <dbl>
## 1 Action 66.6
## 2 Adventure 65.3
## 3 Fighting 69.2
## 4 Misc 66.6
## 5 Platform 68.1
## 6 Puzzle 67.4
## 7 Racing 68.0
## 8 Role-Playing 72.7
## 9 Shooter 70.2
## 10 Simulation 68.6
## 11 Sports 72.0
## 12 Strategy 72.1
This code will create a new dataframe with two columns: ‘Genre’ and ‘mean_Critic_Score’. #Median To calculate the median for a variable, we can use the median() function in R.
# Calculate the median of the Global_Sales variable
median_sales <- median(data$Global_Sales)
# Print the result
cat("The median of Global_Sales is:", median_sales)
## The median of Global_Sales is: 0.24
#Mode To calculate the mode for a variable is different from the steps before since there is no built-in function to do this. However, we can write our own function to calculate the mode:
# Define a function to calculate the mode
get_mode <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
# Calculate the mode of the Platform variable
mode_platform <- get_mode(data$Platform)
# Print the result
cat("The mode of Platform is:", mode_platform)
## The mode of Platform is: PS2
#Correlational Analysis Correlation analysis is a technique used to examine the relationship between two variables in a dataset. Correlation coefficients measure the strength and direction of the relationship between two variables. The most commonly used correlation coefficient is the Pearson correlation coefficient, which measures the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.
To calculate the Pearson correlation coefficient in R, we can use the ‘cor()’ function.It takes two variables as input and returns the correlation coefficient.
Because it is possible for the variables to contain different type and na values, we use ‘pairwise.complete.obs’ argument to only use complete observations in the calculation of the correlation coefficient
#Take Critic score and Genre as an example
data$User_Score <- as.numeric(data$User_Score)
## Warning: 强制改变过程中产生了NA
cor_result <- cor(data$Critic_Score, data$User_Score, use = "pairwise.complete.obs")
cor_result
## [1] 0.5808778
#Visualization We can also visualize the relationship between two variables using a scatterplot. Scatterplots help to visualize the nature of the relationship between two variables. If the points on the scatterplot form a linear pattern, then we can assume a linear relationship between the variables.
This code will create a scatter plot with Critic_Score on the x-axis and User_Score on the y-axis.
library(ggplot2)
# Create a scatter plot
ggplot(data, aes(x = Critic_Score, y = User_Score)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = TRUE) +
labs(x = "Critic Score", y = "User Score") +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 1120 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 1120 rows containing missing values (`geom_point()`).
The output is a scatter plot that shows the relationship between critic
scores and user scores. We can see that there is a positive correlation
between these two variables, which means as critic scores increase, user
scores also tend to increase.
###Q7. A look forward to what further questions the analysis suggests and what it enables
Data analysis is an iterative process that involves constantly asking new questions and refining our understanding of the data. Once we have explored the data and identified some initial insights, we can start to ask more specific questions that help us understand the data in greater detail.
The code we have covered so far aims to shed light on the factors that play a role in determining the success of a video game. Initially, we subset the variables(including genre, platform, publisher, and release year) and sales to explore the relationship. We then utilized regression analysis to create a model that reflects the relationship between these variables. Moving forward, we can investigate which variables have the most significant impact on sales. By doing so, we can gain further insight into what factors contribute the most to a video game’s success.
Furthermore, we used clustering techniques to identify groups(low score, middle score, high score) of similar games based on their attributes, so we can use this to identify patterns in the data and understand the characteristics of successful games in each group in future analysis.
Based on the current analysis, some further questions that could be explored include:
Is there a relationship between games platform and rating. To be more specific, are there any significant differences in ‘Critic_Score’ between different ‘Platform’ categories? We can possibly use some hypothesis tests, such as a t-test or ANOVA to explore this question.
Are games that are more recent generally have better rating? To be more specific, Does the Year_of_Release of a game have an impact on its Critic_Score? We can possibly use a linear regression model to explore this question.
For the other variables such as ‘NA_Sales’, ‘EU_Sales’, ‘JP_Sales’, and ‘Other_Sales’, can we find any relationship or trend from them? We can possibly use visualization such as line graph to explore this question.
By making analysis to these questions, we will be able to make some preliminary conclusions about the data. If there are many highly rated games in the early years, such as 2010, we can infer that the high rating is not solely based on hardware advancement or graphics, but also on the gameplay.
Furthermore, if we know that a platform contains a lot of game in the high rating games group, then we can infer that this platform must have good productivity, which can provide recommendation to consumers to buy this platform’s game