R Notebook

#Install the packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.2.0
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Motivation

Both of us like playing video games, and often look at game critics on a variety of platforms. Therefore, we found this dataset which records over 15 thousand games and their criteria like scores and sale amounts. We consider this research would be beneficial to stakeholders in the game industry for the following reasons:
1. For the game developers and publishers, understanding the factors that lead to a high-score game helps them to create games that can appreciate their target audience, resulting in higher sales, increased user engagement, and improved brand reputation. The result of our research can guide game development, allowing people to make better choices regarding game-play mechanics, graphics, story lines, and marketing strategies.
2. For the investors who care about whether a game can gain success,identifying factors associated with high-score games can help them make more informed investment decisions and better assess the risks and returns associated with a particular propose.
3. For the professional game reviewers, recognizing the elements that contribute to high-scoring games can help reviewers refine their evaluation criteria, ensuring they provide accurate and reliable assessments that effectively guide gamers’ choices.
4. For the scientists who are studying the similar topics as ours, investigating factors leading to popular games can advance their understanding of game design, player behavior, and the gaming industry as a whole. Our research can boost game design education and contribute to the development of game-developing methods in the field. 5. Most importantly, for the gamers themselves, the ability to discern the factors contributing to a high-score game provides valuable information for them, helping them to make better purchasing decisions. This can lead to increased satisfaction and trust in game reviews written by gamers, as well as fostering a better understanding of their personal preferences of games.

Q1.Reading the data into R in a tabular format, identifying, subsetting, and renaming the variables for your use

raw = read.csv('raw_data.csv') #Remove rows with NA critic scores
data = raw %>% drop_na(Critic_Score)
head(data)

##                        Name Platform Year_of_Release    Genre Publisher
## 1                Wii Sports      Wii            2006   Sports  Nintendo
## 2            Mario Kart Wii      Wii            2008   Racing  Nintendo
## 3         Wii Sports Resort      Wii            2009   Sports  Nintendo
## 4     New Super Mario Bros.       DS            2006 Platform  Nintendo
## 5                  Wii Play      Wii            2006     Misc  Nintendo
## 6 New Super Mario Bros. Wii      Wii            2009 Platform  Nintendo
##   NA_Sales EU_Sales JP_Sales Other_Sales Global_Sales Critic_Score Critic_Count
## 1    41.36    28.96     3.77        8.45        82.53           76           51
## 2    15.68    12.76     3.79        3.29        35.52           82           73
## 3    15.61    10.93     3.28        2.95        32.77           80           73
## 4    11.28     9.14     6.50        2.88        29.80           89           65
## 5    13.96     9.18     2.93        2.84        28.92           58           41
## 6    14.44     6.94     4.70        2.24        28.32           87           80
##   User_Score User_Count Developer Rating
## 1          8        322  Nintendo      E
## 2        8.3        709  Nintendo      E
## 3          8        192  Nintendo      E
## 4        8.5        431  Nintendo      E
## 5        6.6        129  Nintendo      E
## 6        8.4        594  Nintendo      E

Code explanation

The above code reads the csv file of raw (unprocessed data) and then removes rows in which the Critic_Score are NAs (missing).
The reason is that our topic is to discover the factors of a game that lead to high critic scores. We consider critic score to be a better index than user score because critics are often written by experts with a lot of experience and knowledge in games, so they can give more object tive and comprehensive suggestions. In addition, the critic platforms use standardized criteria and systems to rate games, so the scores are consistent and we can better compare them. Moreover, critic scores are better for reference because they come out before release of the game and help the players to determine whether they should buy the game.
The last line of the code shows the first 6 rows of the data so that I can check if the NAs are removed.

#Subsetting the features of the game (independent variable)
#Four variables we consider important:
x1 <- c("Platform", "Year_of_Release", "Genre", "Publisher")
x_imp <- data[x1]
head(x_imp)

##   Platform Year_of_Release    Genre Publisher
## 1      Wii            2006   Sports  Nintendo
## 2      Wii            2008   Racing  Nintendo
## 3      Wii            2009   Sports  Nintendo
## 4       DS            2006 Platform  Nintendo
## 5      Wii            2006     Misc  Nintendo
## 6      Wii            2009 Platform  Nintendo

#Separate games based on critic scores into three levels: low, medium, high
min(data["Critic_Score"])

## [1] 13

max(data["Critic_Score"])

## [1] 98

(98-13)/3

## [1] 28.33333

low_score <- data %>% filter(Critic_Score <= 41.3)
med_score <- data %>% filter(between(Critic_Score, 42, 70))
high_score <- data %>% filter(Critic_Score >= 71)

Identifying name and unit of variables:

We don’t need to rename the variabales, because the original dataset from kaggle is well-formatted, and we only need to explain some variables.

Note: xx_Sales are in millions.

NA_Sales: This means the number of sales in North America area.
EU_Sales: This means the number of sales in Europe area.
JP_Sales: This means the number of sales in Japan area. We have Japan but not Asia because Japan has a lot of famous game companies (like Sony and Nintendo), and Japanese buy a lot of games.
Other_Scales: This means the total number of sales in areas except for the above three ones (like Africa and other Asia countries)
Critic_Score: This means the mean score from game-rating platforms like IGN (Imagine Games Network), GameSpot, Metacritic and so on.
User_Score: This means the mean score from game player themselves. Some famous platforms are: Steam: One of the most popular game distribution platforms for video game players, where players can leave reviews and ratings for games they’ve played. Steam is on PC so it is reliable if you want to look at the score of a PC game (like game optimization and novelty of playing method).
Google Play Store and Apple App Store: Both of these mobile app stores allow users to rate and review mobile games. From my experience, Google Play has more games and Apple App Store in with higher variaty.
GOG: A digital distribution platform that specializes in classic and indie games, where users can leave reviews and ratings for games they’ve played.
Rating: This means the appropriate age range for players. It provides parents and guardians with guidance on the game’s content. Here are the detailed explanation of each of them:
Everyone (E): Meaning that games are suitable for all ages of players, as the game doesn’t have objectionable content.
Everyone 10+ (E10+): Games suitable for children aged ten and up, with mild violence or cartoonish violence, some crude humor, and minimal suggestive content.
Teen (T): Games suitable for teenagers aged thirteen and up, with mild to moderate violence, some language, and some suggestive content.
Mature (M): Games suitable for players aged seventeen and up, with intense violence, strong language, sexual content, and/or use of drugs.
Adults Only (AO): Games containing extreme violence, explicit sexual content, and/or gambling, and are only suitable for players aged eighteen and up.
K-A (Kids to Adults): A video game content rating that was used by the Entertainment Software Rating Board (ESRB) in North America from 1994 to 1998. The K-A rating was used for video games that were considered appropriate for players of all ages, but before the E rating was introduced in 1998.

Q2. Forming the core data frame that you’ll be working from by reshaping and summarizing the data, storing new data frames in memory where appropriate

#Data frame that specifies on USER's opinion on the games (Might be useful)
user_data <- data %>% drop_na(User_Count)

#Summarize important variables
df1 <- data %>% group_by(Platform) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(mean_score))
df1

## # A tibble: 17 × 3
##    Platform     n mean_score
##    <chr>    <int>      <dbl>
##  1 DC          14       87.4
##  2 PC         715       75.9
##  3 XOne       169       73.3
##  4 PS4        252       72.1
##  5 PS         200       71.5
##  6 PSV        120       70.8
##  7 WiiU        90       70.7
##  8 PS3        820       70.4
##  9 XB         725       69.9
## 10 GC         448       69.5
## 11 PS2       1298       68.7
## 12 X360       916       68.6
## 13 PSP        462       67.4
## 14 GBA        438       67.4
## 15 3DS        168       67.1
## 16 DS         717       63.8
## 17 Wii        585       62.8

df2 <- data %>% group_by(Genre) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(mean_score))
df2

## # A tibble: 12 × 3
##    Genre            n mean_score
##    <chr>        <int>      <dbl>
##  1 Role-Playing   737       72.7
##  2 Strategy       302       72.1
##  3 Sports        1194       72.0
##  4 Shooter        944       70.2
##  5 Fighting       409       69.2
##  6 Simulation     352       68.6
##  7 Platform       497       68.1
##  8 Racing         742       68.0
##  9 Puzzle         224       67.4
## 10 Action        1890       66.6
## 11 Misc           523       66.6
## 12 Adventure      323       65.3

df3 <- data %>% group_by(Publisher) %>% summarise(n = n(), mean_score = mean(Critic_Score)) %>% arrange(desc(n))
df3

## # A tibble: 304 × 3
##    Publisher                        n mean_score
##    <chr>                        <int>      <dbl>
##  1 Electronic Arts               1029       74.5
##  2 Activision                     569       69.7
##  3 Ubisoft                        558       68.5
##  4 THQ                            405       66.7
##  5 Sony Computer Entertainment    349       74.0
##  6 Konami Digital Entertainment   328       68.3
##  7 Sega                           319       69.7
##  8 Nintendo                       310       75.5
##  9 Take-Two Interactive           294       75.0
## 10 Namco Bandai Games             279       66.4
## # … with 294 more rows

We can see that in df3, there are too many publishers, and some of them only published few games but have high scores.

Code explanation

We divided the games into three levels: high, medium, low, by their critic scores. The code groups the dataset by the ‘Platform’ column, then calculates the count ‘n’ and the mean of the ‘Critic_Score’ for each platform. It stores the result in df and arranges the data in descending order based on the mean score. Finally, it displays df. I added the ‘Platform’ column out of curiosity, as I want to see which Platforms tend to produce high-score games.

Q3. Cleaning the data according to your purpose for the dataset

#First purpose: See which factors lead to high critic score.
#Maybe we can clean df3?

This step in down in previous two questions.

Q4. Plotting distributions of your data

library(ggplot2)
p <- ggplot(data, aes(x=Genre, y=Critic_Score)) + geom_boxplot() + theme(axis.text.x = element_text(angle =90, vjust = 0.5, hjust=1)) 
p

df3 <- data[, c('Year_of_Release', 'Critic_Score')] %>% drop_na(Year_of_Release)
df4 <- df3 %>% group_by(Year_of_Release) %>% summarise(mean_score = mean(Critic_Score)) %>% arrange(Year_of_Release) 
df4 <- slice(df4, 1:(n() - 1)) #Remove the last NA row(I can't delete it by drop_na)
df4$Year_of_Release <- df4$Year_of_Release

q2 <- ggplot(data=df4, aes(x=Year_of_Release, y=mean_score)) + geom_path()+ geom_point()
q2 + theme(axis.text.x = element_text(angle =90, vjust = 0.5, hjust=1))

## `geom_path()`: Each group consists of only one observation.
## ℹ Do you need to adjust the group aesthetic?

Now, plot distributions of three levels of games

#Low
p1 <- ggplot(low_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p1

#Medium
p2 <- ggplot(med_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p2

#High
p3 <- ggplot(high_score, aes(x=forcats::fct_infreq(Genre))) + geom_bar()
p3

Code explanation

I plotted a boxplot about Genre and Cirtic Score; a dot plot to see the trend of mean score of games in each year, and three bar plots for each level of game to see the frequency of each genre.

Q5. Calculating basic statistical metrics like central tendency measures

Central tendency refers to the value around which the data tends to cluster. The most common measures of central tendency are mean, median, and mode. Mean is the arithmetic average of the data, median is the middle value when the data is arranged in ascending or descending order, and mode is the value that appears most frequently in the data. These measures help to understand the distribution of the data and can be used to make comparisons between different groups or variables in the dataset.

In order to calculate these measures in R, we can use various built-in functions. One thing to notice is that, since mode is not always well-defined and can be less informative compared to the other two measures, it is not used as frequently.

Another important measure of central tendency is the range of the data, which is simply the difference between the maximum and minimum values of the variable. This measure helps to understand the spread of the data and can be used to identify outliers in the dataset.

Following code will walk through the process of calculating the measures.

#Summary One easy way to calculate the central tendency measures, we can use the ‘summary()’ function in R.

# use Critic_score as an example
summary(data$Critic_Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   60.00   71.00   68.97   79.00   98.00

The ‘summary()’ method will output the min, 1st quartile, median, mean, 3rd quartile, and max of the input variable

#Mean To calculate the mean for a variable, we can use the mean() function in R.

# Calculate the mean of the Global_Sales variable
mean_sales <- mean(data$Global_Sales)

# Print the result
cat("The mean of Global_Sales is:", mean_sales)

## The mean of Global_Sales is: 0.6890353

Then if we want to calculate the mean ‘Critic_Score’ for different subsets of the data we mentioned in previous question, we can use the ‘group_by()’ and ‘summarise()’ functions from the dplyr package.

# Load the dplyr package
library(dplyr)

# Take the 'data' dataframe and group it by the 'Genre' variable using the pipe operator
data %>% 
  group_by(Genre) %>% 
  # Calculate the mean of the 'Critic_Score' variable for each group of 'Genre'
  summarise(mean_Critic_Score = mean(Critic_Score))

## # A tibble: 12 × 2
##    Genre        mean_Critic_Score
##    <chr>                    <dbl>
##  1 Action                    66.6
##  2 Adventure                 65.3
##  3 Fighting                  69.2
##  4 Misc                      66.6
##  5 Platform                  68.1
##  6 Puzzle                    67.4
##  7 Racing                    68.0
##  8 Role-Playing              72.7
##  9 Shooter                   70.2
## 10 Simulation                68.6
## 11 Sports                    72.0
## 12 Strategy                  72.1

This code will create a new dataframe with two columns: ‘Genre’ and ‘mean_Critic_Score’. #Median To calculate the median for a variable, we can use the median() function in R.

# Calculate the median of the Global_Sales variable
median_sales <- median(data$Global_Sales)

# Print the result
cat("The median of Global_Sales is:", median_sales)

## The median of Global_Sales is: 0.24

#Mode To calculate the mode for a variable is different from the steps before since there is no built-in function to do this. However, we can write our own function to calculate the mode:

# Define a function to calculate the mode
get_mode <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

# Calculate the mode of the Platform variable
mode_platform <- get_mode(data$Platform)

# Print the result
cat("The mode of Platform is:", mode_platform)

## The mode of Platform is: PS2

Q6. Some light correlational analysis/visualization of key variables.

#Correlational Analysis Correlation analysis is a technique used to examine the relationship between two variables in a dataset. Correlation coefficients measure the strength and direction of the relationship between two variables. The most commonly used correlation coefficient is the Pearson correlation coefficient, which measures the linear relationship between two variables. It ranges from -1 to 1, where -1 indicates a perfect negative correlation, 0 indicates no correlation, and 1 indicates a perfect positive correlation.

To calculate the Pearson correlation coefficient in R, we can use the ‘cor()’ function.It takes two variables as input and returns the correlation coefficient.

Because it is possible for the variables to contain different type and na values, we use ‘pairwise.complete.obs’ argument to only use complete observations in the calculation of the correlation coefficient

#Take Critic score and Genre as an example
data$User_Score <- as.numeric(data$User_Score)

## Warning: 强制改变过程中产生了NA

cor_result <- cor(data$Critic_Score, data$User_Score, use = "pairwise.complete.obs")
cor_result

## [1] 0.5808778

#Visualization We can also visualize the relationship between two variables using a scatterplot. Scatterplots help to visualize the nature of the relationship between two variables. If the points on the scatterplot form a linear pattern, then we can assume a linear relationship between the variables.

This code will create a scatter plot with Critic_Score on the x-axis and User_Score on the y-axis.

library(ggplot2)

# Create a scatter plot
ggplot(data, aes(x = Critic_Score, y = User_Score)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = TRUE) +
  labs(x = "Critic Score", y = "User Score") +
  theme_bw()

## `geom_smooth()` using formula = 'y ~ x'

## Warning: Removed 1120 rows containing non-finite values (`stat_smooth()`).

## Warning: Removed 1120 rows containing missing values (`geom_point()`).

The output is a scatter plot that shows the relationship between critic scores and user scores. We can see that there is a positive correlation between these two variables, which means as critic scores increase, user scores also tend to increase.

###Q7. A look forward to what further questions the analysis suggests and what it enables

Data analysis is an iterative process that involves constantly asking new questions and refining our understanding of the data. Once we have explored the data and identified some initial insights, we can start to ask more specific questions that help us understand the data in greater detail.

The code we have covered so far aims to shed light on the factors that play a role in determining the success of a video game. Initially, we subset the variables(including genre, platform, publisher, and release year) and sales to explore the relationship. We then utilized regression analysis to create a model that reflects the relationship between these variables. Moving forward, we can investigate which variables have the most significant impact on sales. By doing so, we can gain further insight into what factors contribute the most to a video game’s success.

Furthermore, we used clustering techniques to identify groups(low score, middle score, high score) of similar games based on their attributes, so we can use this to identify patterns in the data and understand the characteristics of successful games in each group in future analysis.

Based on the current analysis, some further questions that could be explored include:

Is there a relationship between games platform and rating. To be more specific, are there any significant differences in ‘Critic_Score’ between different ‘Platform’ categories? We can possibly use some hypothesis tests, such as a t-test or ANOVA to explore this question.
Are games that are more recent generally have better rating? To be more specific, Does the Year_of_Release of a game have an impact on its Critic_Score? We can possibly use a linear regression model to explore this question.
For the other variables such as ‘NA_Sales’, ‘EU_Sales’, ‘JP_Sales’, and ‘Other_Sales’, can we find any relationship or trend from them? We can possibly use visualization such as line graph to explore this question.

By making analysis to these questions, we will be able to make some preliminary conclusions about the data. If there are many highly rated games in the early years, such as 2010, we can infer that the high rating is not solely based on hardware advancement or graphics, but also on the gameplay.

Furthermore, if we know that a platform contains a lot of game in the high rating games group, then we can infer that this platform must have good productivity, which can provide recommendation to consumers to buy this platform’s game