Project 1 fniyas

Author

Fatimah Niyas

Video Games Project 1

Video Games Introduction

Video games are an extremely popular form of entertainment, and it is rapidly developing over time. There are numerous types of video games that vary from genres such as action, fps, strategy, and many more. The data set, PC Games Reviews, consists of thousands of reviews for different PC games. These reviews include ratings from users playing the game and actual reviewers ratings. This data set has pulled its information from Metacritic.com. In my visual, I plan on exploring the correlation between the ratings made by users versus ratings made by the reviewer.

PC games dataset

library(tidyverse) #load necessary libraries

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
library(RColorBrewer)
setwd("C:/Users/asman/Documents/data110") #set working directory
videogames <- read_csv("metacritic_pc_games.csv") #read dataset

Rows: 513250 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (12): Game Title, Game Poster, Game Release Date, Game Developer, Genre,...
dbl  (2): Overall Metascore, Rating Given By The Reviewer

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

summary(videogames)

  Game Title        Game Poster        Game Release Date  Game Developer    
 Length:513250      Length:513250      Length:513250      Length:513250     
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    Genre            Platforms         Product Rating     Overall Metascore
 Length:513250      Length:513250      Length:513250      Min.   :16.00    
 Class :character   Class :character   Class :character   1st Qu.:72.00    
 Mode  :character   Mode  :character   Mode  :character   Median :80.00    
                                                          Mean   :78.04    
                                                          3rd Qu.:86.00    
                                                          Max.   :97.00    
                                                          NA's   :69       
 Overall User Rating Reviewer Name      Reviewer Type     
 Length:513250       Length:513250      Length:513250     
 Class :character    Class :character   Class :character  
 Mode  :character    Mode  :character   Mode  :character  
                                                          
                                                          
                                                          
                                                          
 Rating Given By The Reviewer Review Date        Review Text       
 Min.   :  0.0                Length:513250      Length:513250     
 1st Qu.:  4.0                Class :character   Class :character  
 Median :  9.0                Mode  :character   Mode  :character  
 Mean   : 20.3                                                     
 3rd Qu.: 10.0                                                     
 Max.   :100.0                                                     
 NA's   :2801

Shortening Dataset

videogames1 <- videogames %>%
  head(5000) #This pulls the first 5000 values so that the visualization is smaller and appears neater.
videogames1

# A tibble: 5,000 × 14
   `Game Title`         `Game Poster` `Game Release Date` `Game Developer` Genre
   <chr>                <chr>         <chr>               <chr>            <chr>
 1 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 2 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 3 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 4 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 5 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 6 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 7 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 8 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
 9 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
10 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
# ℹ 4,990 more rows
# ℹ 9 more variables: Platforms <chr>, `Product Rating` <chr>,
#   `Overall Metascore` <dbl>, `Overall User Rating` <chr>,
#   `Reviewer Name` <chr>, `Reviewer Type` <chr>,
#   `Rating Given By The Reviewer` <dbl>, `Review Date` <chr>,
#   `Review Text` <chr>

Cleaning Dataset

videogamevalues <- videogames1 %>%
  select(`Game Title`,`Game Release Date`, Platforms, `Overall User Rating`, `Rating Given By The Reviewer`) %>%
  mutate(reviewer_rating = videogames1$`Rating Given By The Reviewer`/ 10) %>% #Divided by 10 so that it matches the user rating
  filter(!is.na(videogames1$`Overall User Rating`)) %>% #getting rid of NAs
  filter(!is.na(reviewer_rating))

Linear Regression Analysis

#linear1 <- lm(`Overall User Rating` ~  `Rating Given By The Reviewer`, data = videogames1)
#summary(linear1)

Analysis

The model has the equation: user rating = 0.04(rating given by the reviewer) + 4.08

The p-value on the right of Rating Given By The Reviewer has 3 asterisks which suggests it is a meaningful variable to explain the linear increase in Overall User Ratings. However, the Adjusted R-Squared value states that about 22% of the variation may be explained by the model. In other words, 79% of the variation in the data is likely not explained by this model.

*Side note: R did not let me render because of the linear regression model so I had to make it a comment, but it still worked and gave me the following info:

Call: lm(formula = Overall User Rating ~ Rating Given By The Reviewer, data = videogames1)

Residuals: Min 1Q Median 3Q Max -5.2106 -0.7281 0.1754 0.8120 3.1404

Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.08065 0.07918 51.54 <2e-16 Rating Given By The Reviewer 0.03930 0.00108 36.39 <2e-16 — Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.144 on 4770 degrees of freedom (228 observations deleted due to missingness) Multiple R-squared: 0.2173, Adjusted R-squared: 0.2172 F-statistic: 1325 on 1 and 4770 DF, p-value: < 2.2e-16

Creating Data Visualization - Scatterplot

videogames_plot <- ggplot(videogamevalues, aes(x = `Overall User Rating`, y = reviewer_rating, col = "#CB7876")) +
  geom_point() + 
  geom_smooth(method = 'lm', formula = y~x, color = "#8BA47C", linewidth = 1.5) + #adding linear regression line
  labs(title = "Video Games User Ratings Versus Reviewer Ratings",
  caption = "Source: Metacritic.com") +
  xlab("User Ratings") +
  ylab ("Reviewer Ratings") +
  theme_minimal() #changing theme
videogames_plot

Making an additional visualization

videogames2 <- videogames %>%
  head(1000) #This pulls the first 1000 values so that the visualization is smaller and appears neater.
videogames2

# A tibble: 1,000 × 14
   `Game Title`         `Game Poster` `Game Release Date` `Game Developer` Genre
   <chr>                <chr>         <chr>               <chr>            <chr>
 1 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 2 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 3 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 4 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 5 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 6 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 7 .hackG.U. Last Reco… https://stat… 3-Nov-17            CyberConnect2    Genr…
 8 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
 9 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
10 007_ NightFire       https://stat… 28-Nov-02           Gearbox Software Genr…
# ℹ 990 more rows
# ℹ 9 more variables: Platforms <chr>, `Product Rating` <chr>,
#   `Overall Metascore` <dbl>, `Overall User Rating` <chr>,
#   `Reviewer Name` <chr>, `Reviewer Type` <chr>,
#   `Rating Given By The Reviewer` <dbl>, `Review Date` <chr>,
#   `Review Text` <chr>

Creating New Data Visualization

videogamesvalues2 <- videogames2 %>%
  select(`Overall Metascore`, `Rating Given By The Reviewer`, Platforms) %>% #Selecting needed variables
  mutate(reviewer_rating2 = videogames2$`Rating Given By The Reviewer`/ 10) %>% 
  group_by(`Overall Metascore`, Platforms) %>% #grouping
  ggplot(aes(x = videogames2$`Overall Metascore`, y = reviewer_rating2, group = Platforms, fill = Platforms)) +
  labs(title = "Video Games Reviewer Rating Versus Metascore by Platforms",
  caption = "Source: Metacritic.com") +
  xlab("Overall Metascore") +
  ylab ("Reviewer Ratings") +
  scale_fill_manual(values = c("#F4D35E","#FFB88A", "#FF9C5B","#FBC2C2","#E39B99","#81B2D9", "#8C7DA8","#64557B"))+ #Adding custom colors
  geom_area() +
  theme(legend.position = "right", panel.background = element_rect(fill = "#8BA47C"), panel.grid = element_blank() ) #Changing themes and background color
videogamesvalues2

Warning: Removed 25 rows containing non-finite values (`stat_align()`).

Conclusion

I started off by shortening this dataset and making it the first 5000 values so that my visualization wouldn’t be overflowed with too much data. I then selected some of the variables that I wanted to work with. I mutated and divided the Ratings made by reviewers variable by 10 so that it matches the User Ratings. Lastly, I filtered out any NAs that were in the dataset.
My first visualization represents the correlation between the reviews made by the reviewer and the reviews made by the game users. There was nothing too interesting, but I noticed that the reviewer critics tend to be lower than the user ratings. This could be because the user is more familiar with the game than a reviewer would be. In my second visualization, I displayed the reviewer ratings and the overall metascore and the platform type. I used my own color palette in order to make it pop more. Something interesting I noticed in this visualization is that the platform type with the highest reviews includes multiple types of platforms like iPhone/iPad, PlayStation4, Switch, XboxOne, PC.
In my first visualization, I tried to add color to the points to make it more interesting, but it did not turn out to be the color I wanted it to be. Originally I wanted to add multiple colors, but I would have to do it manually for all 5000 variables which I did not know how to do. Also, my legend did not turn out right. In my second visualization, I wish I could have each platform displayed individually instead of it being grouped together in some.