’’’’

This week’s Tidy Tuesday is about video game data from Steam! First let’s load the data into the workspace:

vg <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-07-30/video_games.csv")

head(vg)
## # A tibble: 6 x 10
##   number game         release_date price owners      developer  publisher 
##    <dbl> <chr>        <chr>        <dbl> <chr>       <chr>      <chr>     
## 1     1. Half-Life 2  Nov 16, 2004  9.99 10,000,000~ Valve      Valve     
## 2     3. Counter-Str~ Nov 1, 2004   9.99 10,000,000~ Valve      Valve     
## 3    21. Counter-Str~ Mar 1, 2004   9.99 10,000,000~ Valve      Valve     
## 4    47. Half-Life 2~ Nov 1, 2004   4.99 5,000,000 ~ Valve      Valve     
## 5    36. Half-Life: ~ Jun 1, 2004   9.99 2,000,000 ~ Valve      Valve     
## 6    52. CS2D         Dec 24, 2004 NA    1,000,000 ~ Unreal So~ Unreal So~
## # ... with 3 more variables: average_playtime <dbl>,
## #   median_playtime <dbl>, metascore <dbl>

I definitely want to do something with the “owners” column but the current format is rough to deal with. I’m going to create a new column representing the min of each bucket. This is certainly not a great assumption but at least it gives me numbers to work with!

vg <- vg %>% mutate(owners2 = as.numeric(gsub(",", "", substr(owners, 0, regexpr('\\.{2}', owners)-2))))

What games have been played the most total hours?

most_played <- vg %>%
    mutate(total_playtime = owners2 * average_playtime) %>%
    arrange(desc(total_playtime)) %>%
    slice(1:20)
    
ggplot(most_played, aes(reorder(game, total_playtime), total_playtime)) +
    geom_bar(aes(fill = average_playtime), stat = "identity") +
    coord_flip()

What games have made the most money (assuming no DLC purchases)?

most_money <- vg %>%
    mutate(sales = owners2 * price) %>%
    arrange(desc(sales)) %>%
    slice(1:20)
    
ggplot(most_money, aes(reorder(game, sales), sales)) +
    geom_bar(aes(fill = price), stat = "identity") +
    coord_flip()

Let’s see how much price influences total downloads. There are 200 unique prices in the data… I’m going to do some rounding to make it better to work with.

my_prices <- c(0, 0.49, 0.99, 4.99, 9.99, 14.99, 19.99, 49.99, 99.99, 1000)

round_prices <- vg %>%
    filter(!is.na(price)) %>%
    mutate(round_price = cut(price, breaks = my_prices))

price_buckets <- round_prices %>%
    count(round_price)

ggplot(price_buckets, aes(round_price, n)) +
    geom_bar(stat = "identity")

Cool. Now let’s see total owners and average owners per game by price bucket.

ggplot(round_prices, aes(round_price, owners2)) +
    geom_bar(stat = "sum")

#PUBG is throwing off the y axis... let's yank it out for now.
round_prices2 <- filter(round_prices, game != "PLAYERUNKNOWN'S BATTLEGROUNDS")

ggplot(round_prices2, aes(round_price, owners2)) +
    geom_bar(stat = "sum")

avg_players_by_price <- round_prices2 %>%
    group_by(round_price) %>%
    summarise(avg_players = mean(owners2))

ggplot(avg_players_by_price, aes(round_price, avg_players)) + 
    geom_bar(stat = "identity")

’’’’