0 - Loading in Data

Use this code to read in your data. I’ve loaded packages and made a few clean up changes to the data to try and make it a little more useable, and I’ve filtered to games that are only in one category.

A glimpse of each of the data frames:

## Rows: 21,831
## Columns: 9
## $ id            <dbl> 30549, 822, 13, 68448, 36218, 9209, 178900, 167791, 1733…
## $ name          <chr> "Pandemic", "Carcassonne", "Catan", "7 Wonders", "Domini…
## $ year          <dbl> 2008, 2000, 1995, 2010, 2008, 2004, 2015, 2016, 2015, 20…
## $ rank          <dbl> 106, 190, 429, 73, 104, 192, 101, 4, 16, 36, 31, 183, 14…
## $ avg_rating    <dbl> 7.59, 7.42, 7.14, 7.74, 7.61, 7.41, 7.60, 8.42, 8.11, 7.…
## $ bayes_average <dbl> 7.487, 7.309, 6.970, 7.634, 7.499, 7.305, 7.508, 8.274, …
## $ users_rated   <dbl> 108975, 108738, 108024, 89982, 81561, 76171, 74419, 7421…
## $ url           <chr> "/boardgame/30549/pandemic", "/boardgame/822/carcassonne…
## $ thumbnail     <chr> "https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXLLjVqA__mi…

## Rows: 4,327
## Columns: 14
## $ id            <dbl> 30549, 9209, 14996, 133473, 171, 244521, 320, 148949, 12…
## $ primary       <chr> "Pandemic", "Ticket to Ride", "Ticket to Ride: Europe", …
## $ yearpublished <dbl> 2008, 2004, 2005, 2013, 1475, 2018, 1948, 2014, 2012, 20…
## $ minplayers    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 1, 2,…
## $ maxplayers    <dbl> 4, 5, 5, 5, 2, 4, 4, 5, 5, 6, 7, 100, 4, 2, 4, 2, 6, 4, …
## $ playingtime   <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minplaytime   <dbl> 45, 30, 30, 15, 0, 45, 90, 40, 45, 45, 20, 25, 30, 15, 4…
## $ maxplaytime   <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minage        <dbl> 8, 8, 8, 8, 6, 10, 10, 10, 8, 10, 8, 10, 8, 8, 6, 12, 8,…
## $ category      <fct> Medical, Trains, Trains, Card Game, Abstract Strategy, M…
## $ owned         <dbl> 168364, 105748, 92203, 65495, 43531, 45597, 51494, 32271…
## $ trading       <dbl> 2508, 930, 709, 964, 192, 178, 522, 442, 625, 901, 306, …
## $ wanting       <dbl> 625, 692, 692, 171, 30, 1337, 86, 886, 541, 500, 339, 51…
## $ wishing       <dbl> 9344, 6620, 5937, 2330, 334, 8482, 447, 5351, 4867, 4191…

##        id             name                year           rank      
##  Min.   :     1   Length:21831       Min.   :   0   Min.   :    1  
##  1st Qu.: 12308   Class :character   1st Qu.:2001   1st Qu.: 5458  
##  Median :104994   Mode  :character   Median :2011   Median :10916  
##  Mean   :118145                      Mean   :1987   Mean   :10916  
##  3rd Qu.:207219                      3rd Qu.:2017   3rd Qu.:16374  
##  Max.   :350992                      Max.   :3500   Max.   :21831  
##    avg_rating    bayes_average    users_rated         url           
##  Min.   :1.040   Min.   :0.000   Min.   :    30   Length:21831      
##  1st Qu.:5.830   1st Qu.:5.510   1st Qu.:    56   Class :character  
##  Median :6.450   Median :5.545   Median :   122   Mode  :character  
##  Mean   :6.417   Mean   :5.682   Mean   :   867                     
##  3rd Qu.:7.040   3rd Qu.:5.674   3rd Qu.:   392                     
##  Max.   :9.570   Max.   :8.511   Max.   :108975                     
##   thumbnail        
##  Length:21831      
##  Class :character  
##  Mode  :character  
##                    
##                    
##

1 - Ratings

(1.1) We’ve got our data! A natural question to start with - What are the top 5 rated games? Make it so that your answer only displays the name, year, rating, and the users_rated.

library(dplyr)


ratings %>% arrange(desc(avg_rating)) %>% slice(1:5) %>% select (name, year, avg_rating, users_rated)

## # A tibble: 5 × 4
##   name                                               year avg_rating users_rated
##   <chr>                                             <dbl>      <dbl>       <dbl>
## 1 Malhya: Lands of Legends                           2022       9.57          41
## 2 Erune                                              2021       9.56          43
## 3 TerroriXico                                        2018       9.43          70
## 4 System Gateway (fan expansion for Android: Netru…  2021       9.4           47
## 5 DEFCON 1                                           2022       9.36          60

(1.2) What does the distribution of ratings look like? Create a ggplot histogram to find out. Make sure you’re choosing a good binwidth.

library(ggplot2)

ggplot(data = ratings, aes(x= avg_rating)) + geom_histogram(bins = 100, fill = "blue", color = "black")

(1.3) Are some games rated by more users than others? The data relevant to this is in users_rated. Create an appropriate plot to find out, using a different kind of plot than the histogram you used in the last graph.

library(ggplot2)
ggplot(data = ratings, aes(x = avg_rating, y =users_rated)) + geom_point()

Ratings and Number of Ratings connected?

Is there a relationship between the ratings and the number of users that rated it? Those top rated games were only rated by around 70 people each, that seems like not very many….

(1.4) Create a scatterplot to look at the relationship between the two continuous variables, avg_rating and users_rated.

library(ggplot2)
library(dplyr)

ratings %>% ggplot() +
  aes(x = avg_rating, y = users_rated) + 
  geom_point()

(1.5) Something you might have noticed is that users_rated is has a lot of values close to zero. Create the plot again, but this time use a log scale on users_rated.

ratings %>% ggplot() +
  aes(x = avg_rating, y = log(users_rated)) + 
  geom_point()

The greatest variation of ratings happens when not many people rate a game. Perhaps we should only consider games that have more than a certain threshold of number of ratings…

Recent games are rated higher?

(1.6) When we looked at the top 5 games, they all also seem to be pretty recent games. How many games are there for each year in the ratings data?

ratings %>% group_by(year) %>%
  summarise(count = n())

## # A tibble: 188 × 2
##     year count
##    <dbl> <int>
##  1     0   193
##  2   100     1
##  3   200     1
##  4   400     2
##  5   500     1
##  6   550     2
##  7   600     1
##  8   700     2
##  9   762     1
## 10  1000     2
## # ℹ 178 more rows

Whoa, wait, what? Look at all the games that have 0, 200, or 3000 as their year! Some of these are chess and checkers, where they can’t know when it started. But Evil Dead 2: The Official Board Game is probably not from the year 0, and Backgammon is probably not from the year 3000. (…Maybe it’s BC?)

Regardless of the reason, let’s go ahead and narrow down our data using this and our work on the number of users.

(1.7) Filter the data to games that have more than 50 user ratings, and games that are from 1980 to 2022.

year_range <- seq(1980,2022, by = 1)
ratings %>% filter(users_rated > 50, year %in% year_range)

## # A tibble: 16,073 × 9
##        id name   year  rank avg_rating bayes_average users_rated url   thumbnail
##     <dbl> <chr> <dbl> <dbl>      <dbl>         <dbl>       <dbl> <chr> <chr>    
##  1  30549 Pand…  2008   106       7.59          7.49      108975 /boa… https://…
##  2    822 Carc…  2000   190       7.42          7.31      108738 /boa… https://…
##  3     13 Catan  1995   429       7.14          6.97      108024 /boa… https://…
##  4  68448 7 Wo…  2010    73       7.74          7.63       89982 /boa… https://…
##  5  36218 Domi…  2008   104       7.61          7.50       81561 /boa… https://…
##  6   9209 Tick…  2004   192       7.41          7.30       76171 /boa… https://…
##  7 178900 Code…  2015   101       7.6           7.51       74419 /boa… https://…
##  8 167791 Terr…  2016     4       8.42          8.27       74216 /boa… https://…
##  9 173346 7 Wo…  2015    16       8.11          7.98       69472 /boa… https://…
## 10  31260 Agri…  2007    36       7.93          7.81       66093 /boa… https://…
## # ℹ 16,063 more rows

(1.8) Create a new variable called recent that indicates whether or not the game is from before or after 2015. It should be equal to true if year is greater than or equal to 2015.

ratings <- ratings %>% mutate(recent = ifelse(year >= 2015, TRUE, FALSE))
ratings

## # A tibble: 21,831 × 10
##        id name   year  rank avg_rating bayes_average users_rated url   thumbnail
##     <dbl> <chr> <dbl> <dbl>      <dbl>         <dbl>       <dbl> <chr> <chr>    
##  1  30549 Pand…  2008   106       7.59          7.49      108975 /boa… https://…
##  2    822 Carc…  2000   190       7.42          7.31      108738 /boa… https://…
##  3     13 Catan  1995   429       7.14          6.97      108024 /boa… https://…
##  4  68448 7 Wo…  2010    73       7.74          7.63       89982 /boa… https://…
##  5  36218 Domi…  2008   104       7.61          7.50       81561 /boa… https://…
##  6   9209 Tick…  2004   192       7.41          7.30       76171 /boa… https://…
##  7 178900 Code…  2015   101       7.6           7.51       74419 /boa… https://…
##  8 167791 Terr…  2016     4       8.42          8.27       74216 /boa… https://…
##  9 173346 7 Wo…  2015    16       8.11          7.98       69472 /boa… https://…
## 10  31260 Agri…  2007    36       7.93          7.81       66093 /boa… https://…
## # ℹ 21,821 more rows
## # ℹ 1 more variable: recent <lgl>

(1.9) Create a density graph of the distribution of the ratings, and color or facet it by your new variable recent. Remember to set the transparency, if needed.

library(dplyr)
library(ggplot2)
ratings %>% ggplot()+aes(x = avg_rating, color = recent) + geom_density()

(1.10) What can you conclude from this graph?

(1.11) And finally, for this section: What are the top 5 games now, after our filter?

2 - Details

Moving to the second details dataframe now, you might want to glimpse it to remind yourself of the variables

Number of players

(2.1) Create a new column called player_range in your dataframe that contains the difference between the minimum number of players and the maximum number of players.

glimpse(details)

## Rows: 4,327
## Columns: 14
## $ id            <dbl> 30549, 9209, 14996, 133473, 171, 244521, 320, 148949, 12…
## $ primary       <chr> "Pandemic", "Ticket to Ride", "Ticket to Ride: Europe", …
## $ yearpublished <dbl> 2008, 2004, 2005, 2013, 1475, 2018, 1948, 2014, 2012, 20…
## $ minplayers    <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 1, 2,…
## $ maxplayers    <dbl> 4, 5, 5, 5, 2, 4, 4, 5, 5, 6, 7, 100, 4, 2, 4, 2, 6, 4, …
## $ playingtime   <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minplaytime   <dbl> 45, 30, 30, 15, 0, 45, 90, 40, 45, 45, 20, 25, 30, 15, 4…
## $ maxplaytime   <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minage        <dbl> 8, 8, 8, 8, 6, 10, 10, 10, 8, 10, 8, 10, 8, 8, 6, 12, 8,…
## $ category      <fct> Medical, Trains, Trains, Card Game, Abstract Strategy, M…
## $ owned         <dbl> 168364, 105748, 92203, 65495, 43531, 45597, 51494, 32271…
## $ trading       <dbl> 2508, 930, 709, 964, 192, 178, 522, 442, 625, 901, 306, …
## $ wanting       <dbl> 625, 692, 692, 171, 30, 1337, 86, 886, 541, 500, 339, 51…
## $ wishing       <dbl> 9344, 6620, 5937, 2330, 334, 8482, 447, 5351, 4867, 4191…

details <- details %>% mutate(player_range = maxplayers - minplayers )

(2.2) Use dplyr to find the average, largest, and smallest player range.

details %>% summarise(average = mean(player_range), largest = max(player_range), smallest = min(player_range))

## # A tibble: 1 × 3
##   average largest smallest
##     <dbl>   <dbl>    <dbl>
## 1    4.09     997       -4

(2.3) The average seems reasonable. The other two… not so much. Use filter to find which games have those values, it might be something we’d be interested in looking at later on.

details %>% filter(player_range == 997 | player_range == (-4))

## # A tibble: 6 × 15
##       id primary     yearpublished minplayers maxplayers playingtime minplaytime
##    <dbl> <chr>               <dbl>      <dbl>      <dbl>       <dbl>       <dbl>
## 1 189890 Scrimish C…          2015          2        999         100          10
## 2   3515 Men Are fr…          1998          4          0          60          60
## 3 146596 Logo Party           2013          4          0          60          30
## 4  28567 I Don't Kn…          2007          2        999           5           5
## 5  51624 Know It or…          2009          4          0          45          45
## 6  19531 Charoodles           2005          4          0          90          90
## # ℹ 8 more variables: maxplaytime <dbl>, minage <dbl>, category <fct>,
## #   owned <dbl>, trading <dbl>, wanting <dbl>, wishing <dbl>,
## #   player_range <dbl>

(2.4) To wrap this piece up, I’ve noticed in the past that it’s sometimes hard to find a game for a big group. Create a bar graph to find the most common number of max players.

## you need maxplayers to be a factor here, so I've done that for you. 
details <- details %>% mutate(maxplayers = factor(maxplayers, ordered = T))
details %>% ggplot() + aes(x = maxplayers) + geom_bar()

General Questions about the Details

(2.5) Use dplyr and the details dataframe to answer the following questions:

How many games are in each category?
What is the highest minimum age? (this information is in the column minage)
What is the game owned by the most people? (this information is in the column owned)
Are there any games that can only be played by one person? Hint: you’ll need maxplayers here.
What is the average playingtime for each category of game?
What is the category that takes the longest to play, on average? Hint: Arrange your previous answer.

#glimpse(details)
library(tidyverse)

details %>% drop_na()%>%
  group_by(category) %>% summarise(total_game = n()) %>% arrange(-total_game)

## # A tibble: 73 × 2
##    category             total_game
##    <fct>                     <int>
##  1 "Card Game"                 992
##  2 "Abstract Strategy"         655
##  3 "Dice"                      242
##  4 "Party Game"                182
##  5 "Economic"                  158
##  6 "Fantasy"                   126
##  7 "Action / Dexterity"        124
##  8 "\"Childrens Game\""        112
##  9 "Sports"                    101
## 10 "Word Game"                  75
## # ℹ 63 more rows

Some graphs of Details

(2.6) Create a bar graph that shows how many games there are for each year.

details %>% group_by(yearpublished) %>% summarise(total_game = n()) %>%  ggplot() + aes(x= yearpublished, y = total_game) + geom_point()

details %>% group_by(yearpublished) %>% summarise(total_game = n()) %>%  ggplot() + aes(x= yearpublished, y = total_game) + geom_bar(stat= "identity")

(2.7) Aaaaand we’ve got the same issue here that we had in ratings. Filter the data for years between 1980 and 2022 and plot it again.

year_range <- seq(1980,2022, by=1)

details %>% group_by(yearpublished) %>% 
  filter(yearpublished %in% year_range) %>%
  summarise(total_game = n()) %>%  
  ggplot() + 
  aes(x= yearpublished, y = total_game) + 
  geom_bar(stat = "identity")

3 - Ratings + Details

(3.1) Use the correct join to create a dataset that contains the columns from both datasets. The datasets are linked by the variable id. It should only contain games that appear in both datasets.

inner_table <- inner_join(ratings, details, by = "id")
inner_table

## # A tibble: 4,327 × 24
##        id name   year  rank avg_rating bayes_average users_rated url   thumbnail
##     <dbl> <chr> <dbl> <dbl>      <dbl>         <dbl>       <dbl> <chr> <chr>    
##  1  30549 Pand…  2008   106       7.59          7.49      108975 /boa… https://…
##  2   9209 Tick…  2004   192       7.41          7.30       76171 /boa… https://…
##  3  14996 Tick…  2005   131       7.54          7.42       63264 /boa… https://…
##  4 133473 Sush…  2013   481       7.04          6.93       38011 /boa… https://…
##  5    171 Chess  1475   430       7.15          6.97       30810 /boa… https://…
##  6 244521 The …  2018    61       7.85          7.68       29635 /boa… https://…
##  7    320 Scra…  1948  1916       6.27          6.16       28245 /boa… https://…
##  8 148949 Ista…  2014   133       7.57          7.42       27078 /boa… https://…
##  9 123540 Toka…  2012   597       6.97          6.81       22188 /boa… https://…
## 10 100901 Flas…  2011   380       7.19          7.04       21950 /boa… https://…
## # ℹ 4,317 more rows
## # ℹ 15 more variables: recent <lgl>, primary <chr>, yearpublished <dbl>,
## #   minplayers <dbl>, maxplayers <ord>, playingtime <dbl>, minplaytime <dbl>,
## #   maxplaytime <dbl>, minage <dbl>, category <fct>, owned <dbl>,
## #   trading <dbl>, wanting <dbl>, wishing <dbl>, player_range <dbl>

(3.2) Use this dataset to answer the following questions:

What is the category of the highest rated game? (Consider selecting just the name, avg_rating, and category. Reminder, the highest rated game at the beginning has been removed for < 50 user ratings, so this will be a different game.)
What is the average rating for each category? Arrange by descending order of avg_rating.

#glimpse(inner_table)
library(dplyr)
library(tidyverse)
inner_table %>% summarise(max_avg_rating = max(avg_rating)) %>% 
  inner_join(inner_table, by = c("max_avg_rating" = "avg_rating" )) %>%
  select(avg_rating = max_avg_rating, name, category)

## # A tibble: 1 × 3
##   avg_rating name               category  
##        <dbl> <chr>              <fct>     
## 1       9.07 Thug Life the Game Miniatures

inner_table %>% drop_na() %>% 
  group_by(category) %>% 
  summarise(avg_rating_category = mean(avg_rating))

## # A tibble: 73 × 2
##    category             avg_rating_category
##    <fct>                              <dbl>
##  1 "\"Childrens Game\""                5.55
##  2 "Abstract Strategy"                 6.31
##  3 "Action / Dexterity"                6.14
##  4 "Adventure"                         6.06
##  5 "Age of Reason"                     7.19
##  6 "American Civil War"                7.24
##  7 "American West"                     6.62
##  8 "Ancient"                           6.55
##  9 "Animals"                           6.26
## 10 "Arabian"                           6   
## # ℹ 63 more rows

4 - Your Turn

Using one or both of these dataframes of games, state a question and create a plot (or plots) of your own to answer that question. You may use the original dataframes, the filtered dataframes, the joined dataframe, or any combination of your choice.

Your question doesn’t have to have an interesting answer, but your plot must answer the question. However, extra credit will be given for interesting plots and questions.

details %>% group_by(yearpublished) %>% 
  filter(yearpublished %in% year_range) %>%
  summarise(total_game = n()) %>%  
  ggplot() + 
  aes(x= yearpublished, y = total_game) + 
  geom_bar(stat = "identity")

Exam 2 - Fall 2022

Dr. Shoemaker