1 Data

本分析では、Steamゲームに関するデータを用いて、価格、recommendations、発売年、無料ゲームであるかどうか、多人数ゲームであるかどうかなどが、好評率とどのような関係にあるかを確認する。

games <- read_csv("steam_games.csv.csv", show_col_types = FALSE)

データに含まれる変数名を確認する。

names(games)
##  [1] "appid"                    "name"                    
##  [3] "release_date"             "required_age"            
##  [5] "price"                    "dlc_count"               
##  [7] "detailed_description"     "about_the_game"          
##  [9] "short_description"        "reviews"                 
## [11] "header_image"             "website"                 
## [13] "support_url"              "support_email"           
## [15] "windows"                  "mac"                     
## [17] "linux"                    "metacritic_score"        
## [19] "metacritic_url"           "achievements"            
## [21] "recommendations"          "notes"                   
## [23] "supported_languages"      "full_audio_languages"    
## [25] "packages"                 "developers"              
## [27] "publishers"               "categories"              
## [29] "genres"                   "screenshots"             
## [31] "movies"                   "user_score"              
## [33] "score_rank"               "positive"                
## [35] "negative"                 "estimated_owners"        
## [37] "average_playtime_forever" "average_playtime_2weeks" 
## [39] "median_playtime_forever"  "median_playtime_2weeks"  
## [41] "discount"                 "peak_ccu"                
## [43] "tags"                     "pct_pos_total"           
## [45] "num_reviews_total"        "pct_pos_recent"          
## [47] "num_reviews_recent"

分析に使用する変数のみを抽出する。

games2 <- games[, c(
  "name",
  "price",
  "recommendations",
  "categories",
  "genres",
  "pct_pos_total",
  "positive",
  "negative",
  "num_reviews_total",
  "release_date"
)]

2 Organizing Data

好評率が0または欠損しているゲームを除外する。

games2 <- subset(games2, !is.na(pct_pos_total) & pct_pos_total > 0)

Demo、Playtest、DLC、Soundtrackなど、通常のゲーム本体とは性質が異なるものを除外する。

games2 <- games2[
  !grepl("Demo|Playtest|DLC|Soundtrack", games2$name, ignore.case = TRUE),
]

データの概要を確認する。

summary(games2)
##      name               price         recommendations    categories       
##  Length:52954       Min.   :  0.000   Min.   :      0   Length:52954      
##  Class :character   1st Qu.:  0.990   1st Qu.:      0   Class :character  
##  Mode  :character   Median :  4.990   Median :      0   Mode  :character  
##                     Mean   :  7.869   Mean   :   1686                     
##                     3rd Qu.:  9.990   3rd Qu.:    159                     
##                     Max.   :199.990   Max.   :4401572                     
##     genres          pct_pos_total       positive          negative        
##  Length:52954       Min.   :  1.00   Min.   :      0   Min.   :      0.0  
##  Class :character   1st Qu.: 67.00   1st Qu.:     12   1st Qu.:      1.0  
##  Mode  :character   Median : 81.00   Median :     36   Median :     10.0  
##                     Mean   : 77.11   Mean   :   2120   Mean   :    351.9  
##                     3rd Qu.: 91.00   3rd Qu.:    182   3rd Qu.:     47.0  
##                     Max.   :100.00   Max.   :7480813   Max.   :1135108.0  
##  num_reviews_total  release_date       
##  Min.   :     10   Min.   :1997-06-30  
##  1st Qu.:     21   1st Qu.:2018-03-15  
##  Median :     55   Median :2021-02-22  
##  Mean   :   2225   Mean   :2020-09-04  
##  3rd Qu.:    238   3rd Qu.:2023-07-27  
##  Max.   :8632939   Max.   :2025-03-10

recommendationsは分布の偏りが大きいため、対数変換を行う。recommendationsが0の場合にも対応できるように、log(1 + recommendations)を用いる。

games2$log_rec <- log1p(games2$recommendations)

発売日から発売年を表すrelease_yearを作成する。

games2$release_date <- as.Date(games2$release_date)
games2$release_year <- as.numeric(format(games2$release_date, "%Y"))

価格が0ドルのゲームを無料ゲームとみなし、free_dummyを作成する。

games2$free_dummy <- ifelse(games2$price == 0, 1, 0)

categoriesにMulti-playerまたはMultiplayerが含まれる場合、多人数ゲームとみなし、multiplayerダミーを作成する。

games2$multiplayer <- ifelse(
  grepl("Multi-player|Multiplayer", games2$categories, ignore.case = TRUE),
  1,
  0
)

Steamのgenreは1つのゲームに複数付与される場合がある。本分析では、主要ジャンルとしてAction、Adventure、RPG、Simulation、Strategy、Indieについて、それぞれダミー変数を作成する。

games2$action <- ifelse(grepl("Action", games2$genres, ignore.case = TRUE), 1, 0)
games2$adventure <- ifelse(grepl("Adventure", games2$genres, ignore.case = TRUE), 1, 0)
games2$rpg <- ifelse(grepl("RPG", games2$genres, ignore.case = TRUE), 1, 0)
games2$simulation <- ifelse(grepl("Simulation", games2$genres, ignore.case = TRUE), 1, 0)
games2$strategy <- ifelse(grepl("Strategy", games2$genres, ignore.case = TRUE), 1, 0)
games2$indie <- ifelse(grepl("Indie", games2$genres, ignore.case = TRUE), 1, 0)

3 Descriptive Statistics

分析に用いる主な変数の記述統計量を確認する。

summary(games2[, c(
  "price",
  "recommendations",
  "log_rec",
  "pct_pos_total",
  "num_reviews_total",
  "release_year",
  "free_dummy",
  "multiplayer"
)])
##      price         recommendations      log_rec       pct_pos_total   
##  Min.   :  0.000   Min.   :      0   Min.   : 0.000   Min.   :  1.00  
##  1st Qu.:  0.990   1st Qu.:      0   1st Qu.: 0.000   1st Qu.: 67.00  
##  Median :  4.990   Median :      0   Median : 0.000   Median : 81.00  
##  Mean   :  7.869   Mean   :   1686   Mean   : 2.008   Mean   : 77.11  
##  3rd Qu.:  9.990   3rd Qu.:    159   3rd Qu.: 5.075   3rd Qu.: 91.00  
##  Max.   :199.990   Max.   :4401572   Max.   :15.297   Max.   :100.00  
##  num_reviews_total  release_year    free_dummy      multiplayer    
##  Min.   :     10   Min.   :1997   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:     21   1st Qu.:2018   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :     55   Median :2021   Median :0.0000   Median :0.0000  
##  Mean   :   2225   Mean   :2020   Mean   :0.2147   Mean   :0.2104  
##  3rd Qu.:    238   3rd Qu.:2023   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :8632939   Max.   :2025   Max.   :1.0000   Max.   :1.0000

4 Distribution of Recommendations

recommendationsは一部のゲームで非常に大きな値を取るため、まず10,000以下の範囲に限定して分布を確認する。

hist(
  games2$recommendations[games2$recommendations <= 10000],
  breaks = 50,
  main = "Distribution of Recommendations (<= 10,000)",
  xlab = "Recommendations",
  ylab = "Number of Games"
)

次に、対数変換後のrecommendationsの分布を確認する。

hist(
  games2$log_rec,
  breaks = 50,
  main = "Distribution of log(Recommendations + 1)",
  xlab = "log(Recommendations + 1)",
  ylab = "Number of Games"
)

5 Regression Analysis

まず、価格、recommendations、発売年、無料ゲームダミー、多人数ゲームダミーを説明変数として回帰分析を行う。

model1_all <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer,
  data = games2
)

summary(model1_all)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer, data = games2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.906  -9.300   3.287  12.455  38.442 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.333e+03  4.671e+01 -49.949   <2e-16 ***
## price        -7.157e-02  7.298e-03  -9.807   <2e-16 ***
## log_rec       1.239e+00  2.630e-02  47.119   <2e-16 ***
## release_year  1.192e+00  2.312e-02  51.576   <2e-16 ***
## free_dummy    4.536e-01  1.987e-01   2.282   0.0225 *  
## multiplayer  -3.833e+00  1.844e-01 -20.781   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.88 on 52948 degrees of freedom
## Multiple R-squared:  0.07832,    Adjusted R-squared:  0.07823 
## F-statistic: 899.8 on 5 and 52948 DF,  p-value: < 2.2e-16

次に、主要ジャンルのダミー変数を加えたモデルを推定する。

model2_all <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer +
    action + adventure + rpg + simulation + strategy + indie,
  data = games2
)

summary(model2_all)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer + action + adventure + rpg + simulation + 
##     strategy + indie, data = games2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -75.590  -9.157   3.138  11.982  43.881 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.496e+03  4.605e+01 -54.205  < 2e-16 ***
## price        -2.033e-02  7.289e-03  -2.789 0.005294 ** 
## log_rec       1.341e+00  2.593e-02  51.705  < 2e-16 ***
## release_year  1.274e+00  2.280e-02  55.870  < 2e-16 ***
## free_dummy    7.090e-01  1.961e-01   3.615 0.000301 ***
## multiplayer  -3.514e+00  1.885e-01 -18.639  < 2e-16 ***
## action       -1.907e+00  1.539e-01 -12.386  < 2e-16 ***
## adventure    -1.322e+00  1.525e-01  -8.673  < 2e-16 ***
## rpg          -2.840e+00  1.832e-01 -15.498  < 2e-16 ***
## simulation   -6.458e+00  1.747e-01 -36.971  < 2e-16 ***
## strategy     -1.925e+00  1.866e-01 -10.317  < 2e-16 ***
## indie         2.771e+00  1.634e-01  16.961  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.54 on 52942 degrees of freedom
## Multiple R-squared:  0.1147, Adjusted R-squared:  0.1146 
## F-statistic: 623.8 on 11 and 52942 DF,  p-value: < 2.2e-16

6 Interpretation

回帰分析の結果、priceの係数が負で有意である場合、価格が高いゲームでは好評率が低い傾向が見られると解釈できる。

また、log_recの係数が正で有意である場合、recommendationsが多いゲームでは、好評率も高い傾向が見られる。

free_dummyの係数が正で有意である場合、無料ゲームでは有料ゲームに比べて好評率が高い傾向があると考えられる。

multiplayerの係数が負で有意である場合、多人数ゲームでは好評率が低い傾向が見られる。

7 Additional Analysis

レビュー数が少ないゲームでは好評率が不安定になりやすいため、追加分析として、総レビュー数の水準ごとにサンプルを分けた回帰分析を行う。

games2$total_reviews <- games2$num_reviews_total

games2$review_group <- cut(
  games2$total_reviews,
  breaks = quantile(games2$total_reviews, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
  include.lowest = TRUE,
  labels = c("low", "medium_low", "medium_high", "high")
)

table(games2$review_group)
## 
##         low  medium_low medium_high        high 
##       13766       12890       13060       13238

レビュー数グループごとに、価格やrecommendationsなどの変数と好評率の関係を確認する。

model_low <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer,
  data = subset(games2, review_group == "low")
)

model_medium_low <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer,
  data = subset(games2, review_group == "medium_low")
)

model_medium_high <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer,
  data = subset(games2, review_group == "medium_high")
)

model_high <- lm(
  pct_pos_total ~ price + log_rec + release_year + free_dummy + multiplayer,
  data = subset(games2, review_group == "high")
)

summary(model_low)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer, data = subset(games2, review_group == 
##     "low"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -77.286 -11.878   4.571  16.106  41.020 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.403e+03  1.290e+02 -34.123  < 2e-16 ***
## price        -1.323e-02  1.342e-02  -0.986    0.324    
## log_rec      -2.194e-02  3.945e+00  -0.006    0.996    
## release_year  2.216e+00  6.385e-02  34.705  < 2e-16 ***
## free_dummy   -2.665e-01  4.562e-01  -0.584    0.559    
## multiplayer  -2.252e+00  4.844e-01  -4.649 3.37e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.44 on 13760 degrees of freedom
## Multiple R-squared:  0.08367,    Adjusted R-squared:  0.08334 
## F-statistic: 251.3 on 5 and 13760 DF,  p-value: < 2.2e-16
summary(model_medium_low)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer, data = subset(games2, review_group == 
##     "medium_low"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -78.557 -10.516   3.232  13.521  44.518 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.120e+03  1.037e+02 -30.082  < 2e-16 ***
## price        -1.359e-01  1.870e-02  -7.266 3.92e-13 ***
## log_rec              NA         NA      NA       NA    
## release_year  1.582e+00  5.134e-02  30.819  < 2e-16 ***
## free_dummy   -3.123e+00  3.961e-01  -7.884 3.43e-15 ***
## multiplayer  -3.619e+00  4.113e-01  -8.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.34 on 12885 degrees of freedom
## Multiple R-squared:  0.078,  Adjusted R-squared:  0.07771 
## F-statistic: 272.5 on 4 and 12885 DF,  p-value: < 2.2e-16
summary(model_medium_high)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer, data = subset(games2, review_group == 
##     "medium_high"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.640  -8.749   2.756  11.594  35.637 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.373e+03  8.106e+01 -29.278  < 2e-16 ***
## price        -1.119e-01  1.888e-02  -5.928 3.15e-09 ***
## log_rec       2.671e-01  5.904e-02   4.524 6.13e-06 ***
## release_year  1.214e+00  4.015e-02  30.232  < 2e-16 ***
## free_dummy   -1.614e+00  3.956e-01  -4.079 4.55e-05 ***
## multiplayer  -3.974e+00  3.333e-01 -11.922  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.02 on 13054 degrees of freedom
## Multiple R-squared:  0.07837,    Adjusted R-squared:  0.07802 
## F-statistic:   222 on 5 and 13054 DF,  p-value: < 2.2e-16
summary(model_high)
## 
## Call:
## lm(formula = pct_pos_total ~ price + log_rec + release_year + 
##     free_dummy + multiplayer, data = subset(games2, review_group == 
##     "high"))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.504  -6.294   2.734   9.183  23.005 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -831.02866   63.28836 -13.131  < 2e-16 ***
## price          -0.10845    0.01051 -10.323  < 2e-16 ***
## log_rec         1.20231    0.06481  18.552  < 2e-16 ***
## release_year    0.44967    0.03132  14.358  < 2e-16 ***
## free_dummy      2.59852    0.47534   5.467 4.67e-08 ***
## multiplayer    -5.76976    0.24654 -23.403  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.69 on 13232 degrees of freedom
## Multiple R-squared:  0.0835, Adjusted R-squared:  0.08315 
## F-statistic: 241.1 on 5 and 13232 DF,  p-value: < 2.2e-16

8 Interpretation

レビュー数が最も少ないグループlowでは価格の係数は負で有意であった。一方、medium_highでは価格の係数は正で有意であった。しかし,medium_lowおよびhighでは有意な関係は見られなかった。

このことから,価格と好評率の関係はレビュー数の規模によって異なる可能性がある。ただし,いずれのモデルでも決定係数は非常に小さく,価格のみで好評率を説明する力は限定的である。