Use this code to read in your data. I’ve loaded packages and made a few clean up changes to the data to try and make it a little more useable, and I’ve filtered to games that are only in one category.
A glimpse of each of the data frames:
## Rows: 21,831
## Columns: 9
## $ id <dbl> 30549, 822, 13, 68448, 36218, 9209, 178900, 167791, 1733…
## $ name <chr> "Pandemic", "Carcassonne", "Catan", "7 Wonders", "Domini…
## $ year <dbl> 2008, 2000, 1995, 2010, 2008, 2004, 2015, 2016, 2015, 20…
## $ rank <dbl> 106, 190, 429, 73, 104, 192, 101, 4, 16, 36, 31, 183, 14…
## $ avg_rating <dbl> 7.59, 7.42, 7.14, 7.74, 7.61, 7.41, 7.60, 8.42, 8.11, 7.…
## $ bayes_average <dbl> 7.487, 7.309, 6.970, 7.634, 7.499, 7.305, 7.508, 8.274, …
## $ users_rated <dbl> 108975, 108738, 108024, 89982, 81561, 76171, 74419, 7421…
## $ url <chr> "/boardgame/30549/pandemic", "/boardgame/822/carcassonne…
## $ thumbnail <chr> "https://cf.geekdo-images.com/S3ybV1LAp-8SnHIXLLjVqA__mi…
## Rows: 4,327
## Columns: 14
## $ id <dbl> 30549, 9209, 14996, 133473, 171, 244521, 320, 148949, 12…
## $ primary <chr> "Pandemic", "Ticket to Ride", "Ticket to Ride: Europe", …
## $ yearpublished <dbl> 2008, 2004, 2005, 2013, 1475, 2018, 1948, 2014, 2012, 20…
## $ minplayers <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 1, 2,…
## $ maxplayers <dbl> 4, 5, 5, 5, 2, 4, 4, 5, 5, 6, 7, 100, 4, 2, 4, 2, 6, 4, …
## $ playingtime <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minplaytime <dbl> 45, 30, 30, 15, 0, 45, 90, 40, 45, 45, 20, 25, 30, 15, 4…
## $ maxplaytime <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minage <dbl> 8, 8, 8, 8, 6, 10, 10, 10, 8, 10, 8, 10, 8, 8, 6, 12, 8,…
## $ category <fct> Medical, Trains, Trains, Card Game, Abstract Strategy, M…
## $ owned <dbl> 168364, 105748, 92203, 65495, 43531, 45597, 51494, 32271…
## $ trading <dbl> 2508, 930, 709, 964, 192, 178, 522, 442, 625, 901, 306, …
## $ wanting <dbl> 625, 692, 692, 171, 30, 1337, 86, 886, 541, 500, 339, 51…
## $ wishing <dbl> 9344, 6620, 5937, 2330, 334, 8482, 447, 5351, 4867, 4191…
## id name year rank
## Min. : 1 Length:21831 Min. : 0 Min. : 1
## 1st Qu.: 12308 Class :character 1st Qu.:2001 1st Qu.: 5458
## Median :104994 Mode :character Median :2011 Median :10916
## Mean :118145 Mean :1987 Mean :10916
## 3rd Qu.:207219 3rd Qu.:2017 3rd Qu.:16374
## Max. :350992 Max. :3500 Max. :21831
## avg_rating bayes_average users_rated url
## Min. :1.040 Min. :0.000 Min. : 30 Length:21831
## 1st Qu.:5.830 1st Qu.:5.510 1st Qu.: 56 Class :character
## Median :6.450 Median :5.545 Median : 122 Mode :character
## Mean :6.417 Mean :5.682 Mean : 867
## 3rd Qu.:7.040 3rd Qu.:5.674 3rd Qu.: 392
## Max. :9.570 Max. :8.511 Max. :108975
## thumbnail
## Length:21831
## Class :character
## Mode :character
##
##
##
(1.1) We’ve got our data! A natural question to start with - What are the top 5 rated games? Make it so that your answer only displays the name, year, rating, and the users_rated.
library(dplyr)
ratings %>% arrange(desc(avg_rating)) %>% slice(1:5) %>% select (name, year, avg_rating, users_rated)
## # A tibble: 5 × 4
## name year avg_rating users_rated
## <chr> <dbl> <dbl> <dbl>
## 1 Malhya: Lands of Legends 2022 9.57 41
## 2 Erune 2021 9.56 43
## 3 TerroriXico 2018 9.43 70
## 4 System Gateway (fan expansion for Android: Netru… 2021 9.4 47
## 5 DEFCON 1 2022 9.36 60
(1.2) What does the distribution of ratings look like? Create a ggplot histogram to find out. Make sure you’re choosing a good binwidth.
library(ggplot2)
ggplot(data = ratings, aes(x= avg_rating)) + geom_histogram(bins = 100, fill = "blue", color = "black")
(1.3) Are some games rated by more users than others? The data
relevant to this is in users_rated. Create an
appropriate plot to find out, using a different kind of plot than the
histogram you used in the last graph.
library(ggplot2)
ggplot(data = ratings, aes(x = avg_rating, y =users_rated)) + geom_point()
Is there a relationship between the ratings and the number of users that rated it? Those top rated games were only rated by around 70 people each, that seems like not very many….
(1.4) Create a scatterplot to look at the relationship
between the two continuous variables, avg_rating and
users_rated.
library(ggplot2)
library(dplyr)
ratings %>% ggplot() +
aes(x = avg_rating, y = users_rated) +
geom_point()
(1.5) Something you might have noticed is that
users_rated is has a lot of values close to zero.
Create the plot again, but this time use a log scale on
users_rated.
ratings %>% ggplot() +
aes(x = avg_rating, y = log(users_rated)) +
geom_point()
The greatest variation of ratings happens when not many people rate a game. Perhaps we should only consider games that have more than a certain threshold of number of ratings…
(1.6) When we looked at the top 5 games, they all also seem to be
pretty recent games. How many games are there for each year in
the ratings data?
ratings %>% group_by(year) %>%
summarise(count = n())
## # A tibble: 188 × 2
## year count
## <dbl> <int>
## 1 0 193
## 2 100 1
## 3 200 1
## 4 400 2
## 5 500 1
## 6 550 2
## 7 600 1
## 8 700 2
## 9 762 1
## 10 1000 2
## # ℹ 178 more rows
Whoa, wait, what? Look at all the games that have 0, 200, or 3000 as their year! Some of these are chess and checkers, where they can’t know when it started. But Evil Dead 2: The Official Board Game is probably not from the year 0, and Backgammon is probably not from the year 3000. (…Maybe it’s BC?)
Regardless of the reason, let’s go ahead and narrow down our data using this and our work on the number of users.
(1.7) Filter the data to games that have more than 50 user ratings, and games that are from 1980 to 2022.
year_range <- seq(1980,2022, by = 1)
ratings %>% filter(users_rated > 50, year %in% year_range)
## # A tibble: 16,073 × 9
## id name year rank avg_rating bayes_average users_rated url thumbnail
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 30549 Pand… 2008 106 7.59 7.49 108975 /boa… https://…
## 2 822 Carc… 2000 190 7.42 7.31 108738 /boa… https://…
## 3 13 Catan 1995 429 7.14 6.97 108024 /boa… https://…
## 4 68448 7 Wo… 2010 73 7.74 7.63 89982 /boa… https://…
## 5 36218 Domi… 2008 104 7.61 7.50 81561 /boa… https://…
## 6 9209 Tick… 2004 192 7.41 7.30 76171 /boa… https://…
## 7 178900 Code… 2015 101 7.6 7.51 74419 /boa… https://…
## 8 167791 Terr… 2016 4 8.42 8.27 74216 /boa… https://…
## 9 173346 7 Wo… 2015 16 8.11 7.98 69472 /boa… https://…
## 10 31260 Agri… 2007 36 7.93 7.81 66093 /boa… https://…
## # ℹ 16,063 more rows
(1.8) Create a new variable called recent that
indicates whether or not the game is from before or after 2015.
It should be equal to true if year is greater than or equal to 2015.
ratings <- ratings %>% mutate(recent = ifelse(year >= 2015, TRUE, FALSE))
ratings
## # A tibble: 21,831 × 10
## id name year rank avg_rating bayes_average users_rated url thumbnail
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 30549 Pand… 2008 106 7.59 7.49 108975 /boa… https://…
## 2 822 Carc… 2000 190 7.42 7.31 108738 /boa… https://…
## 3 13 Catan 1995 429 7.14 6.97 108024 /boa… https://…
## 4 68448 7 Wo… 2010 73 7.74 7.63 89982 /boa… https://…
## 5 36218 Domi… 2008 104 7.61 7.50 81561 /boa… https://…
## 6 9209 Tick… 2004 192 7.41 7.30 76171 /boa… https://…
## 7 178900 Code… 2015 101 7.6 7.51 74419 /boa… https://…
## 8 167791 Terr… 2016 4 8.42 8.27 74216 /boa… https://…
## 9 173346 7 Wo… 2015 16 8.11 7.98 69472 /boa… https://…
## 10 31260 Agri… 2007 36 7.93 7.81 66093 /boa… https://…
## # ℹ 21,821 more rows
## # ℹ 1 more variable: recent <lgl>
(1.9) Create a density graph of the distribution of the
ratings, and color or facet it by your new variable
recent. Remember to set the transparency, if
needed.
library(dplyr)
library(ggplot2)
ratings %>% ggplot()+aes(x = avg_rating, color = recent) + geom_density()
(1.10) What can you conclude from this graph?
(1.11) And finally, for this section: What are the top 5 games now, after our filter?
Moving to the second details dataframe now, you
might want to glimpse it to remind yourself of the
variables
(2.1) Create a new column called player_range in
your dataframe that contains the difference between the minimum number
of players and the maximum number of players.
glimpse(details)
## Rows: 4,327
## Columns: 14
## $ id <dbl> 30549, 9209, 14996, 133473, 171, 244521, 320, 148949, 12…
## $ primary <chr> "Pandemic", "Ticket to Ride", "Ticket to Ride: Europe", …
## $ yearpublished <dbl> 2008, 2004, 2005, 2013, 1475, 2018, 1948, 2014, 2012, 20…
## $ minplayers <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 1, 1, 2, 2, 2, 2, 1, 2,…
## $ maxplayers <dbl> 4, 5, 5, 5, 2, 4, 4, 5, 5, 6, 7, 100, 4, 2, 4, 2, 6, 4, …
## $ playingtime <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minplaytime <dbl> 45, 30, 30, 15, 0, 45, 90, 40, 45, 45, 20, 25, 30, 15, 4…
## $ maxplaytime <dbl> 45, 60, 60, 15, 0, 45, 90, 60, 45, 45, 20, 25, 30, 20, 4…
## $ minage <dbl> 8, 8, 8, 8, 6, 10, 10, 10, 8, 10, 8, 10, 8, 8, 6, 12, 8,…
## $ category <fct> Medical, Trains, Trains, Card Game, Abstract Strategy, M…
## $ owned <dbl> 168364, 105748, 92203, 65495, 43531, 45597, 51494, 32271…
## $ trading <dbl> 2508, 930, 709, 964, 192, 178, 522, 442, 625, 901, 306, …
## $ wanting <dbl> 625, 692, 692, 171, 30, 1337, 86, 886, 541, 500, 339, 51…
## $ wishing <dbl> 9344, 6620, 5937, 2330, 334, 8482, 447, 5351, 4867, 4191…
details <- details %>% mutate(player_range = maxplayers - minplayers )
(2.2) Use dplyr to find the average, largest, and smallest player range.
details %>% summarise(average = mean(player_range), largest = max(player_range), smallest = min(player_range))
## # A tibble: 1 × 3
## average largest smallest
## <dbl> <dbl> <dbl>
## 1 4.09 997 -4
(2.3) The average seems reasonable. The other two… not so much. Use filter to find which games have those values, it might be something we’d be interested in looking at later on.
details %>% filter(player_range == 997 | player_range == (-4))
## # A tibble: 6 × 15
## id primary yearpublished minplayers maxplayers playingtime minplaytime
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 189890 Scrimish C… 2015 2 999 100 10
## 2 3515 Men Are fr… 1998 4 0 60 60
## 3 146596 Logo Party 2013 4 0 60 30
## 4 28567 I Don't Kn… 2007 2 999 5 5
## 5 51624 Know It or… 2009 4 0 45 45
## 6 19531 Charoodles 2005 4 0 90 90
## # ℹ 8 more variables: maxplaytime <dbl>, minage <dbl>, category <fct>,
## # owned <dbl>, trading <dbl>, wanting <dbl>, wishing <dbl>,
## # player_range <dbl>
(2.4) To wrap this piece up, I’ve noticed in the past that it’s sometimes hard to find a game for a big group. Create a bar graph to find the most common number of max players.
## you need maxplayers to be a factor here, so I've done that for you.
details <- details %>% mutate(maxplayers = factor(maxplayers, ordered = T))
details %>% ggplot() + aes(x = maxplayers) + geom_bar()
(2.5) Use dplyr and the details dataframe to
answer the following questions:
minage)owned)maxplayers here.playingtime for each category of
game?#glimpse(details)
library(tidyverse)
details %>% drop_na()%>%
group_by(category) %>% summarise(total_game = n()) %>% arrange(-total_game)
## # A tibble: 73 × 2
## category total_game
## <fct> <int>
## 1 "Card Game" 992
## 2 "Abstract Strategy" 655
## 3 "Dice" 242
## 4 "Party Game" 182
## 5 "Economic" 158
## 6 "Fantasy" 126
## 7 "Action / Dexterity" 124
## 8 "\"Childrens Game\"" 112
## 9 "Sports" 101
## 10 "Word Game" 75
## # ℹ 63 more rows
(2.6) Create a bar graph that shows how many games there are for each year.
details %>% group_by(yearpublished) %>% summarise(total_game = n()) %>% ggplot() + aes(x= yearpublished, y = total_game) + geom_point()
details %>% group_by(yearpublished) %>% summarise(total_game = n()) %>% ggplot() + aes(x= yearpublished, y = total_game) + geom_bar(stat= "identity")
(2.7) Aaaaand we’ve got the same issue here that we had in ratings. Filter the data for years between 1980 and 2022 and plot it again.
year_range <- seq(1980,2022, by=1)
details %>% group_by(yearpublished) %>%
filter(yearpublished %in% year_range) %>%
summarise(total_game = n()) %>%
ggplot() +
aes(x= yearpublished, y = total_game) +
geom_bar(stat = "identity")
(3.1) Use the correct join to create a dataset that contains
the columns from both datasets. The datasets are linked by the
variable id. It should only contain games that appear in
both datasets.
inner_table <- inner_join(ratings, details, by = "id")
inner_table
## # A tibble: 4,327 × 24
## id name year rank avg_rating bayes_average users_rated url thumbnail
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 30549 Pand… 2008 106 7.59 7.49 108975 /boa… https://…
## 2 9209 Tick… 2004 192 7.41 7.30 76171 /boa… https://…
## 3 14996 Tick… 2005 131 7.54 7.42 63264 /boa… https://…
## 4 133473 Sush… 2013 481 7.04 6.93 38011 /boa… https://…
## 5 171 Chess 1475 430 7.15 6.97 30810 /boa… https://…
## 6 244521 The … 2018 61 7.85 7.68 29635 /boa… https://…
## 7 320 Scra… 1948 1916 6.27 6.16 28245 /boa… https://…
## 8 148949 Ista… 2014 133 7.57 7.42 27078 /boa… https://…
## 9 123540 Toka… 2012 597 6.97 6.81 22188 /boa… https://…
## 10 100901 Flas… 2011 380 7.19 7.04 21950 /boa… https://…
## # ℹ 4,317 more rows
## # ℹ 15 more variables: recent <lgl>, primary <chr>, yearpublished <dbl>,
## # minplayers <dbl>, maxplayers <ord>, playingtime <dbl>, minplaytime <dbl>,
## # maxplaytime <dbl>, minage <dbl>, category <fct>, owned <dbl>,
## # trading <dbl>, wanting <dbl>, wishing <dbl>, player_range <dbl>
(3.2) Use this dataset to answer the following questions:
selecting just the name, avg_rating, and category.
Reminder, the highest rated game at the beginning has been removed for
< 50 user ratings, so this will be a different game.)#glimpse(inner_table)
library(dplyr)
library(tidyverse)
inner_table %>% summarise(max_avg_rating = max(avg_rating)) %>%
inner_join(inner_table, by = c("max_avg_rating" = "avg_rating" )) %>%
select(avg_rating = max_avg_rating, name, category)
## # A tibble: 1 × 3
## avg_rating name category
## <dbl> <chr> <fct>
## 1 9.07 Thug Life the Game Miniatures
inner_table %>% drop_na() %>%
group_by(category) %>%
summarise(avg_rating_category = mean(avg_rating))
## # A tibble: 73 × 2
## category avg_rating_category
## <fct> <dbl>
## 1 "\"Childrens Game\"" 5.55
## 2 "Abstract Strategy" 6.31
## 3 "Action / Dexterity" 6.14
## 4 "Adventure" 6.06
## 5 "Age of Reason" 7.19
## 6 "American Civil War" 7.24
## 7 "American West" 6.62
## 8 "Ancient" 6.55
## 9 "Animals" 6.26
## 10 "Arabian" 6
## # ℹ 63 more rows
Using one or both of these dataframes of games, state a question and create a plot (or plots) of your own to answer that question. You may use the original dataframes, the filtered dataframes, the joined dataframe, or any combination of your choice.
Your question doesn’t have to have an interesting answer, but your plot must answer the question. However, extra credit will be given for interesting plots and questions.
details %>% group_by(yearpublished) %>%
filter(yearpublished %in% year_range) %>%
summarise(total_game = n()) %>%
ggplot() +
aes(x= yearpublished, y = total_game) +
geom_bar(stat = "identity")