Final Project: Analyzing Your Scraped Data

videogame <- read_csv("~/Math 215 Fall 2021/Final Project/videogame.csv", 
    col_types = cols(...1 = col_skip()))
glimpse(videogame)

## Rows: 399
## Columns: 7
## $ PopularityByRatings <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
## $ Game                <chr> "Minecraft", "Portal 2", "Portal", "Grand Theft Au…
## $ ReleaseYear         <dbl> 2011, 2011, 2007, 2013, 2011, 2010, 2011, 2015, 20…
## $ AverageRating       <dbl> 4.04, 4.25, 4.05, 3.81, 3.59, 4.24, 4.37, 4.02, 4.…
## $ NumberOfReviews     <dbl> 10, 8, 9, 7, 3, 9, 16, 18, 10, 7, 10, 3, 5, 9, 6, …
## $ GameTags            <chr> "sandbox, survival, fantasy,", "puzzle platformer,…
## $ Multiplayer         <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1,…

This data was scraped from the website glitchwave at the following link: https://glitchwave.com/charts/popular/game/all-time. Our data consists of 399 of the most commonly reviewed video games on the glitchwave website. The games are describes by their release year, the gametags which describes the type of game, and a binary variable Multiplayer which takes on the value of 1 if there is a mulitplayer mode available in the game. The rest of the variables come from the reviews that the games receive. Average Rating is out of 5, Number of Reviews refers to the amount of written reviews, and PopularityByRatings orders the games in terms of how many times they were rated. For example, we see that “Minecraft” has a PopularityByRatings value of 1 which means that out of all the games, it was given a rating out of 5 the most frequently.

For this research project, I will be focusing on the relationship between the GameTags variable and the AverageRating variable in order to investigate the types of games that tend to receive higher ratings than the average game. Thus the research question that I will be addressing is: Are there certain types of games that tend to receive higher ratings or lower ratings than the average rating of all games? The types of games I will be looking at include sandbox, horror, first-person shooter, open world, RPG, and action games. If you are unfamiliar with these terms, I will describe them when I look at each category.

For this investigation I will be using bootstrapping. When answering our research question, we will be looking for the difference in mean ratings between games under a certain category compared to all games. This difference in mean ratings will be our sample statistic. With bootstrapping, we can resample our data by using replacement in order to get many different sample groups and complete a distribution of the differences of means. When this distribution is obtained from bootstrapping, we can use the standard error that is produced and use it with the observed difference of means in order to create a confidence interval for the parameter. Bootstrapping is useful for answering our research question in that the confidence interval that is produced will tell us if there is a significant difference between the average ratings of a certain type of game compared to the ratings of all the other games. If the confidence interval contains the value of 0, then the difference is not significant and if 0 is not in the interval, then we will know that the certain type of game is either more or less successful than the average game.

Since we are investigating the types of games, it would be useful to use the GameTags variable to create new variables that define a certain category of a game and the game is given a value of 1 if it falls under that category and 0 if it does not.

videogame<-videogame%>%
  mutate(sandbox= case_when(
    str_detect(GameTags,'sandbox') ~ 1,
    str_detect(GameTags,'sandbox',negate=TRUE) ~ 0,    ))
head(videogame %>%
  dplyr::select(Game, AverageRating, GameTags, sandbox))

## # A tibble: 6 × 4
##   Game                        AverageRating GameTags                     sandbox
##   <chr>                               <dbl> <chr>                          <dbl>
## 1 Minecraft                            4.04 sandbox, survival, fantasy,        1
## 2 Portal 2                             4.25 puzzle platformer, physics …       0
## 3 Portal                               4.05 puzzle platformer, physics …       0
## 4 Grand Theft Auto V                   3.81 crime, open world, mission-…       0
## 5 The Elder Scrolls V: Skyrim          3.59 action RPG, open world, wes…       0
## 6 Fallout: New Vegas                   4.24 western RPG, post-apocalypt…       0

Using the mutate function to create the new variable sandbox, we can use str_detect to determine if the word “sandbox” is in the GameTags variable and if it is, give the value of the sandbox variable a 1. For the second str_detect, we make use of the “negate=TRUE’ phrase in order to say that if the word”sandbox" is not found, then assign a value of 0. We want to do this same process to create binary variables for all of the categories we are interested in. Since the code will look the same but the word “sandbox” will be replaced by a new category, the R code will be omitted.

head(videogame%>%
       dplyr::select(Game, sandbox, horror, firstPersonShooter, openWorld, RPG, action))

## # A tibble: 6 × 7
##   Game                   sandbox horror firstPersonShoot… openWorld   RPG action
##   <chr>                    <dbl>  <dbl>             <dbl>     <dbl> <dbl>  <dbl>
## 1 Minecraft                    1      0                 0         0     0      0
## 2 Portal 2                     0      0                 0         0     0      0
## 3 Portal                       0      0                 0         0     0      0
## 4 Grand Theft Auto V           1      0                 0         1     0      0
## 5 The Elder Scrolls V: …       0      0                 0         1     1      1
## 6 Fallout: New Vegas           0      0                 1         1     1      1

Now that we have our data set up, we can start looking at each category starting with sandbox.

Sandbox: A sandbox game is one that allows the player a large degree of freedom in terms of what is done in the game. Basically, the player isn’t told what they have to do. To start, we can plot the mean ratings for sandbox games compared to the mean ratings for all other games.

ggplot(videogame, aes(x = sandbox,
                  y = AverageRating)) + 
  geom_boxplot(aes(color=as.factor(sandbox))) +
  theme(title = element_text(size=22),
        legend.key.size = unit(2, 'cm'),
        axis.text=element_text(size=18),
        axis.title=element_text(size=18))+
  labs(
    x = "Sandbox",
    y = "Average Rating",
    title = "Mean Rating of Games",
    subtitle = "sandbox vs non-sandbox" 
  ) +
  coord_flip()

From this plot, we can see that the mean of the average ratings of sandbox games is slightly higher than that of the rest of the games. Based on the distributions of the box plots, there doesn’t appear to be a significant difference in terms of average ratings between sandbox games and non-sandbox games but we can use bootstrapping to determine this.

We start by finding the difference between the average rating of sandbox games and non-sandbox games based on our data in order to get our observed statistic. For this R code, we use the filter function to only consider games where the sandbox variable is 1 and then use the summarize function to get the mean of these ratings. We can do the same thing for when the sandbox variable is 0 and then find the difference between these 2 values.

sandboxAverage<-videogame %>%
  filter(sandbox==1) %>%
  summarize(sandboxMean=mean(AverageRating))
nonSandboxAverage<-videogame %>%
  filter(sandbox==0) %>%
  summarize(nonSandboxMean=mean(AverageRating))
sandboxAverage-nonSandboxAverage

##   sandboxMean
## 1  0.09058376

This shows that the difference in mean ratings is .0906, meaning that in our observed data sandbox games have a higher average rating by .091. Now we want to set up for bootstrapping in order to resample our data by replacement and get a bunch of samples, rather than just the 1 observed sample.

First we can get rid of all of the unnecessary variables while we are focusing on sandbox games.

sandboxData<-videogame %>%
  dplyr::select(Game, sandbox, AverageRating) %>%
  mutate(sandboxAverageRating=case_when(
    sandbox==1 ~ AverageRating,
    sandbox==0 ~ 0,    )) %>%
  mutate(AverageRating=case_when(
    sandbox==1 ~ 0,
    sandbox==0 ~ AverageRating))

We use the select function to get rid of all variables except game, sandbox, and average rating. We also create a new variable called sandboxAverageRating which takes the average rating for sandbox games and keeps the value, but replaces the value for non-sandbox games with 0. Once this is done, we change the original AverageRating variable by making all the sandbox games have a value of 0. We do this so that the average rating values are in separate columns and thus it is easier to calculate the mean of these average ratings.

Although the rating values are seperated, the 0’s in the column will cause the averages to be innacurate so we must replace all of the 0’s with NA values using the following code:

sandboxData[ sandboxData==0] <-NA
sandboxData$sandbox[is.na(sandboxData$sandbox)]<-0
head(sandboxData)

## # A tibble: 6 × 4
##   Game                        sandbox AverageRating sandboxAverageRating
##   <chr>                         <dbl>         <dbl>                <dbl>
## 1 Minecraft                         1         NA                    4.04
## 2 Portal 2                          0          4.25                NA   
## 3 Portal                            0          4.05                NA   
## 4 Grand Theft Auto V                1         NA                    3.81
## 5 The Elder Scrolls V: Skyrim       0          3.59                NA   
## 6 Fallout: New Vegas                0          4.24                NA

This is what our data looks like now. Sandbox games have their average ratings under a different variable than the other games.

For using bootstrapping, we first want to set a seed because each time the code is ran without one, the data will change and thus the descriptions might not line up exactly.

set.seed(8888)
sandboxData %>% slice_sample(n = 3, replace = TRUE)

## # A tibble: 3 × 4
##   Game                sandbox AverageRating sandboxAverageRating
##   <chr>                 <dbl>         <dbl>                <dbl>
## 1 Dead Space 2              0          3.71                   NA
## 2 Donkey Kong Country       0          3.91                   NA
## 3 Stardew Valley            0          3.89                   NA

This shows the way in which we can use the slice_sample function to randomly data in order to get a new dataset. Although these games all only showed up once, there is a chance that a game shows up multiple times while other games do not show up at all and that is why all of the samples in bootstrapping are not all the same.

For this next part, we set \(n\) equal to 399 because our samples will be as large as the original sample. This R code defines the original sample so that it can be used in the next chunk and produce samples by random replacement there.

n<-399
orig_sample<- sandboxData%>%
  slice_sample(n=n, replace=FALSE)

For one single bootstrap, we get the following:

set.seed(8888)
orig_sample %>%
  slice_sample(n = n, replace = TRUE) %>%
  summarize(meanRatingDifference = mean(sandboxAverageRating,na.rm=TRUE)-mean(AverageRating,na.rm=TRUE))

## # A tibble: 1 × 1
##   meanRatingDifference
##                  <dbl>
## 1                0.137

This means that the original observed sample was randomly replaced and the new sample produced a mean rating difference of .137. Note that we use the na.rm phrase to elimnate the NA values we made earlier in order for the mean to be found. We want to repeat this process a bunch of times to get a distribution of these differences in mean ratings. For this next R code, we set the number of trials to 1000 which means that the same process will be repeated 1000 times to get 1000 different samples. We define sandboxData_399_bs to store all of these samples and the map_dfr function basically allows all of these repeated measures to compile into one place.

set.seed(8888)
num_trials<-1000
sandboxData_399_bs <- 1:num_trials %>%
  map_dfr(
    ~orig_sample %>%
      slice_sample(n = n, replace = TRUE) %>%
      summarize(meanRatingDifference = mean(sandboxAverageRating,na.rm=TRUE)-mean(AverageRating,na.rm=TRUE))) %>%
  mutate(n = n)

hist(sandboxData_399_bs$meanRatingDifference)

Once we get all of our samples, we can plot a histogram of the differences in mean ratings between sandbox games and non-sandbox games. We can see that this histogram is centered slightly below .1. For finding a confidence interval at significance level \(\alpha=.05\), we want to find the range in which 95% of the differences fall into. Looking at the histogram, since 0 is so close to the middle, it will be in this 95% confidence interval which would make the difference in average ratings between sandbox games and non-sandbox games insignificant. To be sure of this, we can use a skim function to get the mean of the distribution as well as the standard error which we can then use to calculate a 95% confidence interval.

sandboxData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	6	0.99	0.09	0.07	-0.1	0.04	0.08	0.13	0.38	▂▇▆▁▁

From this output, we get that the distribution of differences in mean ratings for sandbox games and non-sandbox games was \(.0879\) with a standard deviation of \(.0753\). To get a 95% confidence interval, we want the range from two standard deviations below the mean to two standard deviations above the mean.

sandboxLowerBound<-.0879-2*(.0753)
sandboxLowerBound

## [1] -0.0627

sandboxUpperBound<-.0879+2*(.0753)
sandboxUpperBound

## [1] 0.2385

The 95% confidence interval we obtain is \((-.0627,.2385)\) which means that we are 95% confident that the population difference between average ratings for sandbox and non-sandbox games is between \(-.0626\) and \(.2385\). As we had previously guessed, 0 is contained in this interval which means that we are not confident that sandbox games typically receive different average ratings compared to other games and therefore the difference between sandbox games and non-sandbox games in terms of average ratings is not significant.

Horror: The next category we are going to investigate is horror. Since we are repeating the same process as before but just replacing sandbox games with horror games, only the histogram and summary statistics for the confidence interval will be reported. The same will be done for all of the categories as they all follow the same exact processes with the only thing changing being the binary variable of interest that defines the type of game.

hist(horrorData_399_bs$meanRatingDifference)

For this histogram that shows the distribution of differences between average ratings of horror games and average ratings of non-horror game from bootstrapping, we see that the distribution is centered slightly above .1. Although this is centered further away from 0 than our previous plot with sandbox data, it still appears that 0 would be in a 95% confidence interval as it is still not far from the center. We need to see the summary produced by the skim function to obtain the confidence interval:

horrorData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	0	1	0.12	0.08	-0.14	0.07	0.12	0.17	0.39	▁▃▇▃▁

This R code shows that the mean difference from our bootstrapping distribution is .1114 with a standard deviation of .0775. Calculating the confidence interval the same way as before goes as follows:

horrorLowerBound<-.114-2*(.0775)
horrorLowerBound

## [1] -0.041

horrorUpperBound<-.114+2*(.0775)
horrorUpperBound

## [1] 0.269

The confidence interval obtained for the difference in average ratings between horror games and non-horror games is \((-.041,.269)\). Since this 95% confidence interval contains 0, we once again find that we are not confident that there is a difference between horror games and non-horror games when it comes to average ratings.

First-Person Shooters: The next game type we will be looking at are first-person shooters. These games involve combat that allows the player to play from the point of view of the character in the game. A popular example of a first-person shooter game is any Call of Duty® game. Following the same steps as before for bootstrapping yields:

hist(firstPersonShooterData_399_bs$meanRatingDifference)

Looking at this histogram that shows the distribution of differences in average ratings between FPS and non-FPS games obtained from bootstrap samples, we see that the distribution is centered at about -.15 and almost all of the data falls below 0. This means we can expect our confidence interval to be completely negative.

firstPersonShooterData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	0	1	-0.14	0.06	-0.36	-0.18	-0.14	-0.1	0.01	▁▂▇▇▂

From the skim function, we get that the mean of the differences in average rating between FPS games and non-FPS games is -.143 and the standard deviation is .0576. Finding the confidence interval in the same way as before goes as follows:

firstPersonShooterLowerBound<- -.143-2*(.0576)
firstPersonShooterLowerBound

## [1] -0.2582

firstPersonShooterUpperBound<- -.143+2*(.0576)
firstPersonShooterUpperBound

## [1] -0.0278

The 95% confidence interval obtained is \((-.258,-.0278)\). Since all of the values in this confidence interval are negative, we are 95% confident that first-person shooter games have a population average rating that is less than the average rating of non-first-person shooter games. Thus we have found that the difference in average ratings between first-person shooter games and non-first-person shooter games is significant.

RPG: RPG stands for role-playing games and means that it is a game where a player advance through some sort of story line and completes some sort of tasks or quests along the way. Doing the same thing for bootstrapping once again yields:

hist(RPGData_399_bs$meanRatingDifference)

Looking at this histogram, we see that the distribution of differences between average rating of RPG games and non-RPG games iis centered at about .15 and there are hardly any samples that produced a difference below 0. Thus we can expect the confidence interval to include only positive values.

RPGData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	0	1	0.16	0.05	-0.01	0.12	0.16	0.19	0.32	▁▃▇▅▁

From the RPG data, we see that the distribution has a mean of .159 and standard deviation of .0526. This would produce a confidence interval of \((.0538,.2642)\) so as we expected, we are 95% confident that the difference between the population average rating for RPG games and non-RPG games is positive with RPG games tending to receive higher scores. This shows that there is a significant difference between RPG games and non-RPG games when it comes to average ratings.

Open World: An open world game is one that allows a player to explore freely. There are objectives, but the player does not need to be doing them to be playing the game. Performing bootstrapping yields:

hist(openWorldData_399_bs$meanRatingDifference)

The distribution of differences in our bootstrap samples between the average rating of open world games and non-open world games is centered as about -.08 and because it is centered so close to 0, it doesn’t appear that there is a significant difference between the ratings.

openWorldData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	0	1	-0.08	0.07	-0.3	-0.13	-0.08	-0.04	0.1	▁▃▇▅▁

The skim function shows us a mean of -.0844 and a standard deviation of .0657. This means that our 95% confidence interval for the difference in average ratings between open world games and non-open world games is \((-.2158,.047)\). Since \(0\) is in this interval, we are not confident that there is a difference in population open world game average ratings and non-open world game average ratings.

Action The last type of game we will be looking at is action. Bootstrapping gives us:

hist(actionData_399_bs$meanRatingDifference)

Looking at this histogram that shows the distribution of differences in average ratings between action and non-action games obtained from bootstrap samples, we see that the distribution is centered at about .06 and while most of the differences are positive, there are enough below 0 that the confidence interval still probably contains 0.

actionData_399_bs %>%
  skim(meanRatingDifference)

Data summary
Name	Piped data
Number of rows	1000
Number of columns	2
_______________________
Column type frequency:
numeric	1
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
meanRatingDifference	0	1	0.06	0.05	-0.11	0.03	0.06	0.09	0.22	▁▂▇▃▁

The skim function produces a mean of .0595 and standard deviation of .0457 which means that the 95% confidence interval would be \((-.0319,.1509)\). Thus 0 is contained in our 95% confidence interval and therefore we are not confident that there is a difference in population average ratings for action games and average ratings for non-action games.

CONCLUSIONS: Through this investigation, we used bootstrapping to compare average ratings of certain genres of games to the average ratings of games that did not fall into that genre. The biggest advantage to this research method is that bootstrapping provides us with thousands of samples through random replacement of the original data compared to having to repeat an experiment or find new observational units to get more samples.

Through the use of bootstrapping, we were able to get confidence intervals for the differences in average ratings between games of a certain genre and games not of that genre. We say that we are not confident that there is a difference between these groups if 0 was contained in the confidence interval. With this in mind, we found that the average ratings for sandbox, horror, open world, and action were not significantly different from ratings of games that did not fall into those categories. However, when looking at first-person shooter games we obtained a confidence interval that was all negative. This means that we are confident that the population average rating of first-person shooter games is significantly lower than that of games that are not classified as first-person shooter games. When looking at RPG games, we obtained a completely positive confidence interval which means that the population average rating of RPG games is significantly higher than that of games that do not fall under the category of RPG games.

With the research question, the goal was to determine if there were categories of games that tended to receive better or worse ratings than others. The investigation showed that many categories of games produce ratings that could not be determined as significantly different from the average game rating. However, we did find that first-person shooter games receive significantly worse ratings and RPG games receive significantly better ratings. One thing that we do need to be careful about is the population of games that this can be applied to and what the relationship between type of game and average ratings is. First of all, this was an observational study and so we are unable to conclude that there is a cause-and-effect relationship between type of game and the ratings. This is because there could be other confounding variables that caused this correlation that we did not mitigate due to lack of random assignment which can’t happen in an observational study. Also, the data used came from the 400 games that were most frequently rated which means that all of the games in our sample are very popular. This means that we should not generalize these results to games that have very few players and aren’t popular. While the applications of my findings are fairly limited, the investigation could be considered when making a game. If success is being defined by high ratings, then it might be smart to make some kind of RPG game and avoid first-person shooter games.

Final Project: Analyzing Your Scraped Data

MATH/CS 215: Intro to Data Science

Sam Fix