Load Package

Load Data

(Follow along @ David Robinson youtube)

Quick Overview

The only missing value are in gain, because they are calculated as change from previous month and there are 1258 games in the dataset, so 1258 first value is missing. Otherwise the dataset is complete.

The numerical variable avg and peak are both skewed, so a log transformation may be needed.

month and year are in seperate column, which we can combine to a single date column.

avg_peak_perc is in character, because the % sign, though we can just calculate again from avg and peak

skim(dataset)
Data summary
Name dataset
Number of rows 83631
Number of columns 7
_______________________
Column type frequency:
character 3
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
gamename 0 1 3 81 0 1258 0
month 0 1 3 9 0 12 0
avg_peak_perc 0 1 2 8 0 71354 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1.00 2017.34 2.24 2012 2016.00 2018.00 2019.00 2021.0 ▂▅▆▇▅
avg 0 1.00 2765.73 26500.54 0 53.11 203.07 763.97 1584886.8 ▇▁▁▁▁
gain 1258 0.98 -10.29 3790.65 -250249 -38.18 -1.62 22.24 426446.1 ▁▇▁▁▁
peak 0 1.00 5469.95 50184.65 0 137.00 500.00 1727.00 3236027.0 ▇▁▁▁▁
glimpse(dataset)
## Rows: 83,631
## Columns: 7
## $ gamename      <chr> "Counter-Strike: Global Offensive", "Dota 2", "PLAYER...
## $ year          <dbl> 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,...
## $ month         <chr> "February", "February", "February", "February", "Febr...
## $ avg           <dbl> 741013.24, 404832.13, 198957.52, 120982.64, 117742.27...
## $ gain          <dbl> -2196.42, -27839.52, -2289.67, 49215.90, -24374.98, 1...
## $ peak          <dbl> 1123485, 651615, 447390, 196799, 224276, 133620, 1464...
## $ avg_peak_perc <chr> "65.9567%", "62.1275%", "44.4707%", "61.4752%", "52.4...

Make the correction mentioned above.

dataset_2 <- dataset %>% 
  mutate(date = ymd(paste0(year, month, 1)),
         avg_peak_perc = avg/peak) %>% 
  select(-c(year, month))

It looks clean now.

summary(dataset_2)
##    gamename              avg                 gain                peak        
##  Length:83631       Min.   :      0.0   Min.   :-250249.0   Min.   :      0  
##  Class :character   1st Qu.:     53.1   1st Qu.:    -38.2   1st Qu.:    137  
##  Mode  :character   Median :    203.1   Median :     -1.6   Median :    500  
##                     Mean   :   2765.7   Mean   :    -10.3   Mean   :   5470  
##                     3rd Qu.:    764.0   3rd Qu.:     22.2   3rd Qu.:   1727  
##                     Max.   :1584886.8   Max.   : 426446.1   Max.   :3236027  
##                                         NA's   :1258                         
##  avg_peak_perc         date           
##  Min.   :0.0000   Min.   :2012-07-01  
##  1st Qu.:0.3445   1st Qu.:2016-03-01  
##  Median :0.4382   Median :2018-02-01  
##  Mean   :0.4244   Mean   :2017-10-19  
##  3rd Qu.:0.5135   3rd Qu.:2019-09-01  
##  Max.   :0.8884   Max.   :2021-02-01  
##  NA's   :156

Visual Plots

Check the number of games, we see Steam as a stable number games supplied onto the platform.

dataset_2 %>% 
  count(date) %>% 
  ggplot(aes(x = date, y = n)) +
  geom_line()+
  labs(x = NULL,
       y = "Number of games",
       title = "Number of games on Steam by date")

Check distributon between avg and peak, we see both are symmetrically distributed on log scale with (monthly) peak number of players higher than average number of players as expected. And the peak value is less often achieved compare with average, also expected.

There are quite a few games has less than 10 players, while few can achieve over 1 million.

dataset_2 %>% 
  pivot_longer(cols = c(avg, peak)) %>% 
  ggplot(aes(x = value+1, fill = name)) + ## prevent log 0
  geom_histogram(bins = 100) +
  scale_x_log10(labels = comma, breaks = 10^seq(1, 6)) +
  labs(x = "Number of Players",
       y = "Density",
       title = "Comparsion between average monthly number of players and peak monthly number")

Next we have a look how the peak to average ratio look. The game World of Warships stands out among the rest, it indicates the games is popular once (high peak value), but the players abondoned it quickly enough (low average value).

Generally the most popular games with high player number has the ration below 2.

dataset_2 %>%
  filter(avg > 100) %>% ## has at least 100 monthly players
  filter(date == max(date)) %>%
  ggplot(aes(x = avg, y = 1/avg_peak_perc, label = gamename)) +
  geom_point() +
  scale_x_log10(labels = comma, breaks = 10 ^ seq(1, 6)) +
  scale_y_log10(breaks = 2^seq(1:5)) +
  labs(x = "Average monthly number of players",
       y = "Peak/average ratio",
       title = "Stability of player base in Feb 2020"
       ) +
  geom_text_repel()

We apply similar logic to all date where the ratio is at least 10. We see Life is Strange is the most hypered game. We a ratio over 100, the regular number of monthly player is less than 1% of the peak value.

N.B. the game recieved very high rating in fact, the high ratio is due to it being single player game, hence low replayablity once the storyline is finished, a large increase in average monthly number of players indcates the first game is well received to attract a large fan base.

dataset_2 %>%
  filter(avg > 100) %>%
  filter(1/avg_peak_perc >= 10) %>% 
  ggplot(aes(x = avg, y = 1/avg_peak_perc, label = gamename)) +
  geom_point() +
  scale_x_log10(labels = comma, breaks = 10 ^ seq(1, 8)) +
  scale_y_log10(breaks = 2^seq(1:7)) +
  labs(x = "Average monthly number of players",
       y = "Peak/average ratio",
       title = "Most abondoned game by player after peak"
       ) +
  geom_text_repel()

Next we check the most popular games on Steam.

PUBG is the most popular game from its release, achieve 1.5 million monthly players that surpass any game on the platform by a large margin, though the popularity has dwindled substantially now. Dota 2 is has stable player base throughout it’s history on the platfom, but it is being replayed by CS: Global Offensive from 2020.

The red dotted line marked Mar 2020, where lockdown starts to happen around the world. We see quite a few games received boost due to people forced to stay in home.

dataset_2 %>% 
  filter(fct_lump(gamename, 12, w = avg) != "Other") %>% 
  mutate(gamename = fct_reorder(gamename,.x = -avg, .fun = mean)) %>% 
  ggplot(aes(x = date, y = avg)) +
  geom_line()+
  scale_y_continuous(labels = comma) +
  expand_limits(y = 0) +
  facet_wrap(~gamename, scales = "free_y")+
  labs(x = NULL,
       y = "Average Monthly Number of Players",
       title = "Top 12 most popular games on Steam") +
  geom_vline(xintercept = as.Date("2020-03-01"), colour = "red", lty = 2)