Video Game Dataset Link: https://www.kaggle.com/rush4ratio/video-game-sales-with-ratings

# load all necessary packages
library(boot)
library(readr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ dplyr   1.0.5
## ✓ tibble  3.1.0     ✓ stringr 1.4.0
## ✓ tidyr   1.1.3     ✓ forcats 0.5.1
## ✓ purrr   0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# load the data
videogames <- read_csv("/Users/tracylam/Desktop/school/STAT 167/Final Project/Video_Games_Sales.csv")
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   Name = col_character(),
##   Platform = col_character(),
##   Year_of_Release = col_character(),
##   Genre = col_character(),
##   Publisher = col_character(),
##   NA_Sales = col_double(),
##   EU_Sales = col_double(),
##   JP_Sales = col_double(),
##   Other_Sales = col_double(),
##   Global_Sales = col_double(),
##   Critic_Score = col_double(),
##   Critic_Count = col_double(),
##   User_Score = col_character(),
##   User_Count = col_double(),
##   Developer = col_character(),
##   Rating = col_character()
## )
#Cleaning the data
# Setting data to view total number of NA observations 
all_NA <- videogames %>% filter(is.na(Critic_Score)| is.na(Name)| is.na(Critic_Count)| is.na(Genre)| is.na(User_Score)| is.na(User_Count)| is.na(Developer)| is.na(Rating)| User_Score == "tbd"|Year_of_Release == "N/A")    # 9893 rows with NA 
# Setting data to contain no NA values 
videogames1 <- 
  videogames %>%
  filter(!is.na(Name), !is.na(Platform), !is.na(Year_of_Release), !is.na(Genre), !is.na(Publisher), !is.na(NA_Sales), !is.na(EU_Sales), !is.na(JP_Sales), !is.na(Other_Sales), !is.na(Global_Sales), !is.na(Critic_Score), !is.na(User_Score), User_Score != "tbd", !is.na(Rating), !is.na(Developer), Year_of_Release != "N/A")
# Converting columns to become double due to having numeric values 
videogames1$User_Score <- as.double(videogames1$User_Score)
videogames1$Year_of_Release <- as.double(videogames1$Year_of_Release)
videogames1
## # A tibble: 6,826 x 16
##    Name     Platform Year_of_Release Genre Publisher  NA_Sales EU_Sales JP_Sales
##    <chr>    <chr>              <dbl> <chr> <chr>         <dbl>    <dbl>    <dbl>
##  1 Wii Spo… Wii                 2006 Spor… Nintendo      41.4     29.0      3.77
##  2 Mario K… Wii                 2008 Raci… Nintendo      15.7     12.8      3.79
##  3 Wii Spo… Wii                 2009 Spor… Nintendo      15.6     10.9      3.28
##  4 New Sup… DS                  2006 Plat… Nintendo      11.3      9.14     6.5 
##  5 Wii Play Wii                 2006 Misc  Nintendo      14.0      9.18     2.93
##  6 New Sup… Wii                 2009 Plat… Nintendo      14.4      6.94     4.7 
##  7 Mario K… DS                  2005 Raci… Nintendo       9.71     7.47     4.13
##  8 Wii Fit  Wii                 2007 Spor… Nintendo       8.92     8.03     3.6 
##  9 Kinect … X360                2010 Misc  Microsoft…    15        4.89     0.24
## 10 Wii Fit… Wii                 2009 Spor… Nintendo       9.01     8.49     2.53
## # … with 6,816 more rows, and 8 more variables: Other_Sales <dbl>,
## #   Global_Sales <dbl>, Critic_Score <dbl>, Critic_Count <dbl>,
## #   User_Score <dbl>, User_Count <dbl>, Developer <chr>, Rating <chr>

Question: Which video games are different audiences least attracted to? (Kids? Adults? Teens?)

Splitting Data Set into Ratings

We have grouped the games into its specific ratings to categorize our different audiences. However, due to a lack of data in rating EC, we had to remove it from our observations. The rating K-A was also paired together with the E Rated games due to the fact that K-A, also known as Kids-Adults, was the rating name for E before 1998. At the same time, we also grouped AO games together with Mature games due to its similar targeted age range. Due to these conditions, we looked at the audience based on four major categories:

  • E for Everyone which meant little kids and family friendly content
  • E10 for Everyone 10+ which meant kids and anyone 10 or older
  • T for Teen which meant any teenagers
  • M for Mature which meant for anyone ages 17 and up
#makes sure user_score is numeric
videogames1 <- transform(videogames1, User_Score = as.numeric(User_Score))
#Filter the games into their own respective ratings
#Everyone Games
E_Rated_Games <- filter(videogames1, Rating == "E" | Rating == "K-A")
#E10 Games
E10_Rated_Games <- filter(videogames1, Rating == "E10+")
#T Games
T_Rated_Games <- filter(videogames1, Rating == "T")
#M Games
M_Rated_Games <- filter(videogames1, Rating == "M" | Rating == "AO")

To visualize the different dataset, we used plots to look at each video game rating with the information of its global sales spread across different genres. (The code for just Everyone games are shown while the rest are hidden in the html due to it being the same code with just the different rating tibbles. The results are still provided.)

#For E Rated Games
ggplot(data = E_Rated_Games)+ # plot video_game_data
  geom_point(mapping = aes(x = Genre, y = Global_Sales, color = Genre), position = "jitter") +
  ggtitle("E Rated Games Global Sales Across Genres") +
  theme(axis.text.x = element_text(size=10, angle=90))

As we can see with the plots, each of the ratings had an abundance of games in specific genres. The genres with the most games made typically also had the most amount of global sales within it. For instance, in E-Rated Games, sports and racing games had a lot of games within that genre and sold the best. For E10-Rated games, many of the genres sold well despite not having as many games categorized as them compared to games that are considered action or misc. For Teen games, it was a similar spread as E10 where a lot of genres did well. With Mature games, action, role-playing, and shooter games appear to do the best while everything else does mediocre.

Global Sales for Each Rating

To see how well attracted a game is to a particular audience, we decided to take a look at the global sales as a factor. To do so, we looked at a summary of the global sales and used the 25th percentile as a way to tell us what games are considered to have performed poorly in terms of sales. For this we also made sure to filter out any NA’s to help the data be more accurate. We complete this process for each of the ratings. (Once again, the code for the other ratings are hidden due to its similarity with the code for the Everyone Rating; however, the results are shown.)

E (Everyone) Rating

#looks at how which of the genres is most made within the E Rating
E_Rated_Games %>% count(Genre) %>% arrange(n)
##           Genre   n
## 1      Fighting   6
## 2       Shooter  22
## 3      Strategy  42
## 4     Adventure  49
## 5  Role-Playing  73
## 6        Puzzle  89
## 7    Simulation 103
## 8          Misc 164
## 9        Action 191
## 10     Platform 240
## 11       Racing 350
## 12       Sports 754
#Use summary to find out the 25th percentile to use to see if a game performed poorly sales wise
summary(E_Rated_Games$Global_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1300  0.3300  0.9416  0.8900 82.5300
#Creates a new tibble of the poorly performed games sales wise
ELGS <- filter(E_Rated_Games, Global_Sales <= 0.1300)
ELGS <- ELGS %>% select(Name, Genre, Global_Sales) %>% arrange(Global_Sales)
head(ELGS)
##                        Name      Genre Global_Sales
## 1 Space Invaders Revolution    Shooter         0.01
## 2 Valentino Rossi: The Game     Racing         0.01
## 3       Ship Simulator 2008 Simulation         0.01
## 4                  NBA 2K17     Sports         0.01
## 5    Don Bradman Cricket 14     Sports         0.01
## 6     Rayman Raving Rabbids       Misc         0.01
#groups by genre and counts the amount in each genre
ELGS %>% group_by(Genre) %>% summarise(amount = n()) %>% arrange(desc(amount))
## # A tibble: 12 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Sports          145
##  2 Racing          106
##  3 Platform         61
##  4 Action           48
##  5 Puzzle           42
##  6 Misc             39
##  7 Simulation       30
##  8 Role-Playing     20
##  9 Strategy         20
## 10 Shooter          16
## 11 Adventure        15
## 12 Fighting          3

E10 (Everyone 10 and up) Rating

##           Genre   n
## 1      Fighting  14
## 2        Puzzle  24
## 3    Simulation  28
## 4     Adventure  32
## 5       Shooter  34
## 6      Strategy  62
## 7        Sports  74
## 8        Racing  79
## 9          Misc  80
## 10 Role-Playing 100
## 11     Platform 104
## 12       Action 299
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1100  0.2900  0.5812  0.6175 10.1200
##                        Name     Genre Global_Sales
## 1   Super Stardust Ultra VR   Shooter         0.01
## 2     Azure Striker Gunvolt    Action         0.01
## 3                Worms: WMD    Action         0.01
## 4        Doki-Doki Universe Adventure         0.01
## 5 Spider-Man: Friend or Foe    Action         0.01
## 6                    Dokuro    Action         0.01
## # A tibble: 12 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Action           69
##  2 Strategy         32
##  3 Role-Playing     29
##  4 Platform         25
##  5 Racing           15
##  6 Adventure        14
##  7 Shooter          14
##  8 Misc             12
##  9 Puzzle           10
## 10 Sports            9
## 11 Simulation        7
## 12 Fighting          6

T (Teen) Rating

##           Genre   n
## 1        Puzzle   5
## 2      Platform  56
## 3     Adventure  82
## 4        Sports 104
## 5          Misc 129
## 6        Racing 135
## 7      Strategy 139
## 8    Simulation 161
## 9       Shooter 285
## 10     Fighting 313
## 11 Role-Playing 387
## 12       Action 582
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1000  0.2500  0.5793  0.6300 12.8400
##                                         Name        Genre Global_Sales
## 1                                Blackthorne       Action         0.01
## 2                                   Contrast     Platform         0.01
## 3                                    Sengoku     Strategy         0.01
## 4                       Guild Wars: Factions Role-Playing         0.01
## 5                Codename: Panzers Phase Two     Strategy         0.01
## 6 Neverwinter Nights 2: Mask of the Betrayer Role-Playing         0.01
## # A tibble: 12 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Action          140
##  2 Role-Playing     98
##  3 Strategy         84
##  4 Shooter          75
##  5 Fighting         50
##  6 Simulation       50
##  7 Adventure        44
##  8 Racing           32
##  9 Misc             19
## 10 Platform         17
## 11 Sports           16
## 12 Puzzle            2

M (Mature 17 and up) Rating

##           Genre   n
## 1      Platform   3
## 2    Simulation   5
## 3          Misc  11
## 4        Sports  11
## 5        Racing  17
## 6      Strategy  24
## 7      Fighting  45
## 8     Adventure  85
## 9  Role-Playing 152
## 10      Shooter 523
## 11       Action 558
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1200  0.3200  0.9956  0.9575 21.0400
##                        Name     Genre Global_Sales
## 1   Super Stardust Ultra VR   Shooter         0.01
## 2     Azure Striker Gunvolt    Action         0.01
## 3                Worms: WMD    Action         0.01
## 4        Doki-Doki Universe Adventure         0.01
## 5 Spider-Man: Friend or Foe    Action         0.01
## 6                    Dokuro    Action         0.01
## # A tibble: 12 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Action           75
##  2 Strategy         33
##  3 Role-Playing     31
##  4 Platform         27
##  5 Racing           16
##  6 Shooter          16
##  7 Adventure        15
##  8 Misc             13
##  9 Puzzle           12
## 10 Sports           10
## 11 Simulation        7
## 12 Fighting          6

As seen with all of these ratings, the genres with the most games in it are also the ones that usually had the most games that did not sell as well. An explanation for this could be due to the fact of the abundance of games within a specific genre. For instance, despite some sports games selling very well within the E-Rated games, this does not mean every sports game for everyone will do well. A reason as to why there may be so many games within these popular genres may be due to the fact that companies saw such genres doing well for specific ages and decided to create their own but that doesn’t always end in a good result.

Although global sales can be a method of telling us how well a game is received by the audience, our data makes it difficult to use that as the only factor since the dataset rounds the sales which makes a lot of games have the same number of sales. With this in mind, we start looking at critic and user scores as a way to possibly see how a game performs with different audiences.

Critic and User Score with Each Major Rating

In this part, we take a look at the average critic score and user score within each of the major ratings. With this information, we can take a look at whether or not games in each rating may be rated differently between each of the game ratings. However, before we can do so, we make sure that the amount of critics and users rating it is around 30 and above so that we have a fair amount of people giving their opinions on the games. I also divided the critics score by 10 so that it could be on the same scale as the user score for the plot. (The code for the other ratings are hidden in html due to it being similar to the E-Rated code but just with the tibble corresponding to the respective rating.)

#E Rated
E_Rated_SC <- filter (E_Rated_Games, Critic_Count >= 30, User_Count >= 30)
mean(E_Rated_SC$Critic_Score/10)
## [1] 7.996542
mean(E_Rated_SC$User_Score)
## [1] 7.524207
## [1] 7.764904
## [1] 7.463462
## [1] 7.657666
## [1] 7.578426
## [1] 7.740964
## [1] 7.402811
#Created data frame with the average critic and user score for each ratings
Ratings_SI <- data.frame("Rating" = c("E", "E", "E10", "E10", "T", "T", "M", "M"), 
                         "Scores" = c(7.992264, 7.5149, 7.766038, 7.465566, 7.658255, 7.574228, 7.7318, 7.403022),
                         "Type" = c("Critic", "User", "Critic", "User", "Critic", "User", "Critic", "User"))
#plot the results to see how the critic and user scores differ from each other
#red is user scores while blue is critic score
ggplot(data = Ratings_SI) +
  geom_bar(aes(x = Rating, y = Scores, fill = Type == "Critic"), stat = "identity") +
  theme(legend.position = "none") +
  ggtitle("Average Critic and User Scores Among Ratings")

As we can see above, the average critics and user score did not vary by much despite the different ratings. Due to this observation, critic and user score may not be the best in determining the least attractive games to the different major audiences. It may also indicate how ratings may not be as important when critics or users are rating a game since it doesn’t seem to have an influence in scores which was something we will see in the potential models. I also decided to look at the means since the critic scores and user scores follow a somewhat normal distribution compared to the sales.

Global Sales, Critic Score, and User Score with Each Major Rating

In this section, we combined the 3 factors that can contribute to whether a video game is well liked to a targeted audience. By taking a look at the games that had global sales, critics scores and user scores below the 25th percentile, we can list the games that take all of that into consideration. We have also taken account of the genres that appeared most within these and saw that action seems to be the least popular among these audiences which matches the observation we saw previously when we focused on genre specifically.

The following genres did the worst in terms of global sales, critic score, and user score based on our plots below:

  • E-Rated: Sports, Platform, Racing, Action, and Misc
  • E10-Rated: Action and Platform
  • T-Rated: Action, Shooter, and Fighting
  • M-Rated: Shooter and Action

As we can see, action is common among the different ratings for being some of the least attractive games. This supports the claim we had before about action being the least popular genre. The list of games that printed out from each rating took into account the least popular genre which includes action and shooter games. (I hid the code in html for the plots due to its similarity with the code for E-Rated Games but with just the respective tibbles for each rating.)

E for Everyone:

#Gets summaries of global sales, critic score, and user score to obtain the lower 25% 
summary(E_Rated_Games$Global_Sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1300  0.3300  0.9416  0.8900 82.5300
summary(E_Rated_Games$Critic_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   63.00   73.00   70.69   81.00   97.00
summary(E_Rated_Games$User_Score)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   6.400   7.500   7.162   8.300   9.600
#filters to get the games that fit all the worst info and arranges them as needed
LSGE <- filter(E_Rated_Games, Global_Sales < 0.1300, Critic_Score < 63.00, User_Score < 6.400)
LSGE <- select(LSGE, Name, Genre, Global_Sales, Critic_Score, User_Score)
LSGE <- arrange(LSGE, Global_Sales, Critic_Score, User_Score)
#gives information about what genres these least attractive games are for this audience
LSGEG <- LSGE %>% group_by(Genre) %>% summarise(amount = n()) %>% arrange(desc(amount))
LSGEG
## # A tibble: 11 x 2
##    Genre      amount
##    <chr>       <int>
##  1 Sports         19
##  2 Platform       16
##  3 Racing         16
##  4 Action         12
##  5 Misc           10
##  6 Simulation      7
##  7 Adventure       6
##  8 Puzzle          6
##  9 Strategy        4
## 10 Fighting        2
## 11 Shooter         1
#plots the info about genre to get a visual
ggplot(data = LSGEG) +
  geom_bar(aes(x = Genre, y = amount, fill = Genre), stat = "identity") +
  ggtitle("Least Popular Genres of E Rated Games") +
  theme(axis.text.x = element_text(size=10, angle=90))

LSGEA <- LSGE %>% filter(Genre == "Action" | Genre == "Shooter")
#top part of these least attractive games
head(LSGEA)
##                                          Name   Genre Global_Sales Critic_Score
## 1                  E.T. The Extra-Terrestrial  Action         0.01           46
## 2                   Space Invaders Revolution Shooter         0.01           49
## 3                              Turn It Around  Action         0.03           39
## 4                               Dance Factory  Action         0.03           56
## 5 Gravity Falls: Legend of the Gnome Gemulets  Action         0.04           46
## 6                          Beyblade Evolution  Action         0.04           49
##   User_Score
## 1        2.4
## 2        5.2
## 3        4.5
## 4        5.8
## 5        5.8
## 6        6.3

E10 for Everyone 10+:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1100  0.2900  0.5812  0.6175 10.1200
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   19.00   60.00   70.00   68.23   78.00   95.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   6.200   7.200   6.903   8.000   9.300
## # A tibble: 11 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Action           14
##  2 Platform          8
##  3 Misc              5
##  4 Role-Playing      5
##  5 Strategy          4
##  6 Adventure         3
##  7 Fighting          3
##  8 Puzzle            3
##  9 Shooter           3
## 10 Simulation        2
## 11 Sports            2

##                                Name   Genre Global_Sales Critic_Score
## 1                Super Dungeon Bros  Action         0.01           42
## 2         Spider-Man: Friend or Foe  Action         0.01           57
## 3      Tenkai Knights: Brave Battle  Action         0.02           26
## 4      Invizimals: The Lost Kingdom  Action         0.02           50
## 5 Touhou Genso Rondo: Bullet Ballet Shooter         0.02           55
## 6                       Nacho Libre  Action         0.03           56
##   User_Score
## 1        2.3
## 2        5.3
## 3        2.4
## 4        5.0
## 5        4.7
## 6        5.8

T for Teens:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1000  0.2500  0.5793  0.6300 12.8400
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   21.00   62.00   72.00   69.67   79.00   98.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50    6.70    7.60    7.32    8.30    9.50
## # A tibble: 12 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Action           23
##  2 Shooter          18
##  3 Fighting         13
##  4 Role-Playing      8
##  5 Simulation        7
##  6 Racing            6
##  7 Adventure         4
##  8 Sports            4
##  9 Misc              2
## 10 Platform          2
## 11 Puzzle            2
## 12 Strategy          2

##                                         Name   Genre Global_Sales Critic_Score
## 1               Aquaman: Battle for Atlantis  Action         0.01           27
## 3                Army Men: Major Malfunction Shooter         0.02           36
## 4       Raven Squad: Operation Hidden Dagger Shooter         0.02           39
## 5                            Lost: Via Domus  Action         0.02           52
## 6 Knight's Apprentice: Memorick's Adventures  Action         0.02           53
## 7                                  Dark Void  Action         0.02           57
##   User_Score
## 1        3.1
## 3        4.0
## 4        3.1
## 5        6.0
## 6        4.5
## 7        5.1

M for Mature Audiences 17+:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0100  0.1200  0.3200  0.9956  0.9575 21.0400
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   13.00   63.00   75.00   71.98   83.00   98.00
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   6.500   7.600   7.178   8.200   9.400
## # A tibble: 10 x 2
##    Genre        amount
##    <chr>         <int>
##  1 Shooter          31
##  2 Action           25
##  3 Role-Playing      8
##  4 Fighting          6
##  5 Racing            6
##  6 Adventure         4
##  7 Misc              1
##  8 Simulation        1
##  9 Sports            1
## 10 Strategy          1

##                          Name   Genre Global_Sales Critic_Score User_Score
## 1     Carmageddon: Max Damage  Action         0.01           51        5.5
## 2 Prototype: Biohazard Bundle  Action         0.01           56        3.1
## 3        Conflict: Denied Ops Shooter         0.01           58        4.8
## 4                     RoboCop Shooter         0.02           30        3.6
## 5            Chicago Enforcer Shooter         0.02           33        2.8
## 6     Final Fight: Streetwise  Action         0.02           42        4.0

Potential Models with Rating as a Predictor

Since we were looking at ratings specifically for one of our objective questions, we decided to see if we could build any models that use ratings as predictors for global sales, critics score, or user score. For these models, we can see that the adjusted \(R^2\) is very low which is an indication of a model that isn’t very well fit in helping predict the global sales, critic score, or user score. As seen based on the p-values below, there is no specific rating that has a significant relationship with global sales, critic score, or user score. With the model called ‘lm.fit4’, we can see that when using the critics and user score as a predictor, it does have a significant relationship with global sales. However, despite that, the model still has a low adjusted \(R^2\) which indicates that the model is not the best.

#model for Global Sales with predictor Ratings
lm.fit1 <- lm(Global_Sales ~ Rating, videogames1)
lm.sum1 <- summary(lm.fit1)
lm.sum1
## 
## Call:
## lm(formula = Global_Sales ~ Rating, data = videogames1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.985 -0.651 -0.419  0.001 81.589 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)    1.950      1.955   0.998    0.318
## RatingE       -1.009      1.955  -0.516    0.606
## RatingE10+    -1.369      1.956  -0.700    0.484
## RatingK-A     -0.030      2.764  -0.011    0.991
## RatingM       -0.955      1.955  -0.488    0.625
## RatingRP      -1.920      2.764  -0.695    0.487
## RatingT       -1.371      1.955  -0.701    0.483
## 
## Residual standard error: 1.955 on 6819 degrees of freedom
## Multiple R-squared:  0.009733,   Adjusted R-squared:  0.008862 
## F-statistic: 11.17 on 6 and 6819 DF,  p-value: 1.927e-12
#model for Critic Score with predictor Ratings
lm.fit2 <- lm(Critic_Score ~ Rating, videogames1)
lm.sum2 <- summary(lm.fit2)
lm.sum2
## 
## Call:
## lm(formula = Critic_Score ~ Rating, data = videogames1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.964  -7.964   2.321  10.321  28.330 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    93.00      13.82   6.728 1.86e-11 ***
## RatingE       -22.32      13.83  -1.614   0.1065    
## RatingE10+    -24.77      13.83  -1.791   0.0733 .  
## RatingK-A      -1.00      19.55  -0.051   0.9592    
## RatingM       -21.04      13.83  -1.521   0.1282    
## RatingRP      -30.00      19.55  -1.535   0.1249    
## RatingT       -23.33      13.82  -1.687   0.0916 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.82 on 6819 degrees of freedom
## Multiple R-squared:  0.0078, Adjusted R-squared:  0.006927 
## F-statistic: 8.934 on 6 and 6819 DF,  p-value: 9.692e-10
#model for User Score and Ratings
videogames1 <- transform(videogames1, User_Score = as.numeric(User_Score))
lm.fit3 <- lm(User_Score ~ Rating, videogames1)
lm.sum3 <- summary(lm.fit3)
lm.sum3
## 
## Call:
## lm(formula = User_Score ~ Rating, data = videogames1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.8203 -0.7203  0.3384  1.0384  2.4384 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.600      1.435   5.992 2.18e-09 ***
## RatingE       -1.438      1.436  -1.002    0.316    
## RatingE10+    -1.697      1.436  -1.182    0.237    
## RatingK-A     -1.200      2.030  -0.591    0.554    
## RatingM       -1.423      1.436  -0.991    0.322    
## RatingRP      -1.800      2.030  -0.887    0.375    
## RatingT       -1.280      1.436  -0.891    0.373    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.435 on 6819 degrees of freedom
## Multiple R-squared:  0.00853,    Adjusted R-squared:  0.007658 
## F-statistic: 9.778 on 6 and 6819 DF,  p-value: 9.361e-11
#model for Global Sales with Ratings, Critic_Score, and User_Score
lm.fit4 <- lm(Global_Sales ~ Rating + Critic_Score + User_Score, videogames1)
lm.sum4 <- summary(lm.fit4)
lm.sum4
## 
## Call:
## lm(formula = Global_Sales ~ Rating + Critic_Score + User_Score, 
##     data = videogames1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -1.960 -0.671 -0.297  0.188 81.462 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.827736   1.905562  -0.434    0.664    
## RatingE      -0.284086   1.899190  -0.150    0.881    
## RatingE10+   -0.573713   1.899841  -0.302    0.763    
## RatingK-A    -0.102918   2.684795  -0.038    0.969    
## RatingM      -0.278313   1.899357  -0.147    0.884    
## RatingRP     -0.933544   2.685164  -0.348    0.728    
## RatingT      -0.592475   1.899167  -0.312    0.755    
## Critic_Score  0.038450   0.002047  18.786  < 2e-16 ***
## User_Score   -0.092807   0.019712  -4.708 2.55e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.898 on 6817 degrees of freedom
## Multiple R-squared:  0.06616,    Adjusted R-squared:  0.06506 
## F-statistic: 60.37 on 8 and 6817 DF,  p-value: < 2.2e-16

To evaluate the models even further, we decided to use the model validation set approach which splits our data into a training and testing set. After building them, we want to take a look at their adjusted \(R^2\) values and MSE to see if our models are usable.

Model Validation Set Approach for Potential Models:

dim(videogames1)  #dimensions of dataset
## [1] 6826   16
#split data in 50/50 into training and test set
set.seed(167) #so analysis is reproducible
train.idx <-sample(6947, 3473) #random sample the training data index
train <- videogames1[train.idx, ] #training set
test <- videogames1[-train.idx, ] #validation/test set
#Global sales~ratings training model
lm.train.gr <- lm(Global_Sales ~ Rating, train)
#Obtaining MSE and adjusted R^2 for training set
mean(lm.train.gr$residuals^2)
## [1] 2.457608
summary(lm.train.gr)$adj.r.squared
## [1] 0.009812566
#Global sales~ratings testing model
lm.test.gr <- lm(Global_Sales ~ Rating, test)
#Obtaining MSE and adjusted R^2 for testing set
mean(lm.test.gr$residuals^2)
## [1] 5.163233
summary(lm.test.gr)$adj.r.squared
## [1] 0.008262002
#Critic_Score~rating training model
lm.train.cr <- lm(Critic_Score ~ Rating, train)
#Obtaining MSE and adjusted R^2 for training set
mean(lm.train.cr$residuals^2)
## [1] 194.8205
summary(lm.train.cr)$adj.r.squared
## [1] 0.008314343
#Critic_Score~rating testing model
lm.test.cr <- lm(Critic_Score ~ Rating, test)
#Obtaining MSE and adjusted R^2 for testing set
mean(lm.test.cr$residuals^2)
## [1] 186.8181
summary(lm.test.cr)$adj.r.squared
## [1] 0.005090109
train <- transform(train, User_Score = as.numeric(User_Score))
test <- transform(test, User_Score = as.numeric(User_Score))
#User_Score~rating training model
lm.train.ur <- lm(User_Score ~ Rating, train)
#Obtaining MSE and adjusted R^2 for training set
mean(lm.train.ur$residuals^2)
## [1] 2.091849
summary(lm.train.ur)$adj.r.squared
## [1] 0.005982134
#User_Score~rating testing model
lm.test.ur <- lm(User_Score ~ Rating, test)
#Obtaining MSE and adjusted R^2 for testing set
mean(lm.test.ur$residuals^2)
## [1] 2.023327
summary(lm.test.ur)$adj.r.squared
## [1] 0.008659012
#Global sales~ratings+critic score+user score training model
lm.train.grcu <- lm(Global_Sales ~ Rating + Critic_Score + User_Score, train)
#Obtaining MSE and adjusted R^2 for training set
mean(lm.train.grcu$residuals^2)
## [1] 2.263025
summary(lm.train.grcu)$adj.r.squared
## [1] 0.08767471
#Global sales~ratings+critic score+user score testing model
lm.test.grcu <- lm(Global_Sales ~ Rating + Critic_Score + User_Score, test)
#Obtaining MSE and adjusted R^2 for testing set
mean(lm.test.grcu$residuals^2)
## [1] 4.918558
summary(lm.test.grcu)$adj.r.squared
## [1] 0.05470487

As seen above, the adjusted \(R^2\) values for the testing models are very low which is unfavorable. With the model that hopes to predict critics score while using ratings as a predictor, we can see that the MSE is very high and the adjusted \(R^2\) value is very low which makes this the worst model out of all of these. If we were to choose the best model out of all of these, it would be the last model that uses critics score, user score, and ratings as predictors for global sales. However, it still is not the best model. Due to the low MSE and low adjusted \(R^2\) value of our models, it might be a case of overfitting which is not favorable. These models also tell us that ratings may not be an important factor in predicting any of the variables used in the models above.