EDA_Report

Review the data

## Dataset Shape: 3024 13

## 
## Dataset Info:

## 'data.frame':    3024 obs. of  13 variables:
##  $ UserID                       : chr  "c9889ab0-9cfc-4a75-acd9-5eab1df0015c" "7c9e413c-ecca-45f2-a780-2826a07952a2" "fd61e419-1a92-4f43-a8c7-135842ad328a" "bdb7f6d1-ff9a-468c-afe7-43f32a94293e" ...
##  $ Age                          : num  49 15 23 31 37 38 20 34 38 46 ...
##  $ Gender                       : chr  "Male" "Male" "Male" "Male" ...
##  $ Country                      : chr  "Norway" "Switzerland" "China" "Mexico" ...
##  $ Device                       : chr  "Android" "iOS" "Android" "Android" ...
##  $ GameGenre                    : chr  "Battle Royale" "Action RPG" "Fighting" "Racing" ...
##  $ SessionCount                 : int  9 11 9 12 10 16 9 15 10 18 ...
##  $ AverageSessionLength         : num  12.83 19.39 8.87 19.56 15.23 ...
##  $ SpendingSegment              : chr  "Minnow" "Minnow" "Minnow" "Minnow" ...
##  $ InAppPurchaseAmount          : num  11.4 6.37 15.81 13.49 10.86 ...
##  $ FirstPurchaseDaysAfterInstall: num  28 18 30 9 15 28 24 30 1 19 ...
##  $ PaymentMethod                : chr  "Apple Pay" "Debit Card" "Apple Pay" "Debit Card" ...
##  $ LastPurchaseDate             : chr  "2025-03-19" "2025-06-08" "2025-06-02" "2025-04-01" ...

## 
## Summary Statistics (Numerical):

##                               vars    n    mean     sd  median trimmed     mad
## UserID*                          1 3024 1512.50 873.10 1512.50 1512.50 1120.85
## Age                              2 2964   33.53  11.99   33.00   33.53   14.83
## Gender*                          3 3024    2.62   0.56    3.00    2.65    0.00
## Country*                         4 3024   14.60   7.75   14.00   14.57    8.90
## Device*                          5 3024    2.39   0.53    2.00    2.38    0.00
## GameGenre*                       6 3024    9.02   4.43    9.00    9.07    5.93
## SessionCount                     7 3024   10.07   3.12   10.00    9.99    2.97
## AverageSessionLength             8 3024   20.07   8.59   20.31   20.11   10.90
## SpendingSegment*                 9 3024    1.89   0.38    2.00    1.95    0.00
## InAppPurchaseAmount             10 2888  102.58 454.34   11.98   18.45    8.74
## FirstPurchaseDaysAfterInstall   11 2888   15.38   8.95   16.00   15.47   11.86
## PaymentMethod*                  12 3024    4.86   2.10    5.00    4.88    2.97
## LastPurchaseDate*               13 3024  110.53  67.56  112.00  110.65   87.47
##                                 min     max   range  skew kurtosis    se
## UserID*                        1.00 3024.00 3023.00  0.00    -1.20 15.88
## Age                           13.00   54.00   41.00  0.01    -1.19  0.22
## Gender*                        1.00    4.00    3.00 -0.49    -0.27  0.01
## Country*                       1.00   28.00   27.00  0.04    -1.10  0.14
## Device*                        1.00    3.00    2.00  0.06    -1.12  0.01
## GameGenre*                     1.00   16.00   15.00 -0.06    -1.21  0.08
## SessionCount                   1.00   22.00   21.00  0.27     0.12  0.06
## AverageSessionLength           5.01   34.99   29.98 -0.04    -1.20  0.16
## SpendingSegment*               1.00    3.00    2.00 -1.12     2.59  0.01
## InAppPurchaseAmount            0.00 4964.45 4964.45  7.69    64.44  8.45
## FirstPurchaseDaysAfterInstall  0.00   30.00   30.00 -0.08    -1.21  0.17
## PaymentMethod*                 1.00    8.00    7.00 -0.08    -1.14  0.04
## LastPurchaseDate*              1.00  226.00  225.00 -0.01    -1.21  1.23

數值欄位分佈

類別欄位分佈

## 
## Unique values in : Gender
##        Female   Male  Other 
##     60   1098   1810     56 
## 
##  Gender uni_value : 4 
## -----------------------------------------------------------------------------------------------

## 
## Unique values in : Country
##               Afghanistan    Australia   Bangladesh       Brazil       Canada 
##           60           93          113          101          105          105 
##        China      Denmark        Egypt       France      Germany        India 
##          101          106          101           92          114          242 
##         Iran        Italy        Japan       Mexico  Netherlands       Norway 
##          105          112          112          115          114          107 
##       Russia Saudi Arabia  South Korea        Spain    Sri Lanka       Sweden 
##           93          116          103           96           88          106 
##  Switzerland       Turkey           UK          USA 
##          119           90          108          107 
## 
##  Country uni_value : 28 
## -----------------------------------------------------------------------------------------------

## 
## Unique values in : Device
##         Android     iOS 
##      60    1738    1226 
## 
##  Device uni_value : 3 
## -----------------------------------------------------------------------------------------------

## 
## Unique values in : GameGenre
##                  Action RPG     Adventure Battle Royale          Card 
##            60           191           170           189           202 
##        Casual      Fighting        MMORPG          MOBA        Puzzle 
##           209           181           198           181           206 
##        Racing  Role Playing       Sandbox    Simulation        Sports 
##           191           193           219           219           217 
##      Strategy 
##           198 
## 
##  GameGenre uni_value : 16 
## -----------------------------------------------------------------------------------------------

## 
## Unique values in : SpendingSegment
## Dolphin  Minnow   Whale 
##     412    2544      68 
## 
##  SpendingSegment uni_value : 3 
## -----------------------------------------------------------------------------------------------

## 
## Unique values in : PaymentMethod
##                       Apple Pay Carrier Billing     Credit Card      Debit Card 
##             136             374             417             410             433 
##       Gift Card      Google Pay          Paypal 
##             422             431             401 
## 
##  PaymentMethod uni_value : 8 
## -----------------------------------------------------------------------------------------------

##  chr [1:3024] "2025-03-19" "2025-06-08" "2025-06-02" "2025-04-01" ...

## 
## Sample LastPurchaseDate:

##  [1] "2025-03-19" "2025-06-08" "2025-06-02" "2025-04-01" "2025-05-05"
##  [6] "2025-05-05" "2025-06-04" "2025-04-26" "2025-04-08" "2025-02-03"

## 
## Missing Values per Column:
##  UserID[1] 0
## 
## Missing Values per Column:
##  Age[1] 60
## 
## Missing Values per Column:
##  Gender[1] 60
## 
## Missing Values per Column:
##  Country[1] 60
## 
## Missing Values per Column:
##  Device[1] 60
## 
## Missing Values per Column:
##  GameGenre[1] 60
## 
## Missing Values per Column:
##  SessionCount[1] 0
## 
## Missing Values per Column:
##  AverageSessionLength[1] 0
## 
## Missing Values per Column:
##  SpendingSegment[1] 0
## 
## Missing Values per Column:
##  InAppPurchaseAmount[1] 136
## 
## Missing Values per Column:
##  FirstPurchaseDaysAfterInstall[1] 136
## 
## Missing Values per Column:
##  PaymentMethod[1] 136
## 
## Missing Values per Column:
##  LastPurchaseDate[1] 136

## 
## 總共找到 412 筆有問題的觀測值。

## 
## Number of Duplicated Rows:

## [1] 0

Key Observations

一、數值型特徵分析與洞察

1. 年齡 (Age)

玩家平均年齡約為 33.53 歲（標準差 11.99），範圍涵蓋 13 至 54 歲。其分佈呈現明顯的多峰結構，約在 20–25、30–35、45–50 歲出現三個峰值，並帶有輕微右偏。這強烈暗示了遊戲中可能存在多個不同年齡層的玩家族群（如年輕玩家、輕熟齡玩家、中高齡玩家），分群時應著重探索此差異。年齡有 60 筆缺失值，佔整體 2%。

2. 遊戲次數 (SessionCount)

平均遊戲次數為 10.07 次（標準差 3.12），範圍 1 至 22 次。分佈高度尖峰集中於 10–12 次，整體呈對稱狀，近似常態分佈，顯示大部分玩家的遊戲頻率穩定且集中。此特徵無缺失值。

3. 平均遊玩時長 (AverageSessionLength)

平均時長為 20.07 分鐘（標準差 8.59），範圍 5.01 至 34.99 分鐘。分佈近似常態，但略微帶有雙峰特徵，暗示可能存在兩類核心行為模式：一類是追求快速遊玩的玩家，另一類是傾向長時間沉浸的重度玩家。此特徵無缺失值。

4. 內購金額 (InAppPurchaseAmount)

此特徵分佈呈高度右偏，平均值高達 102.58（標準差 454.34），但大多數金額集中在 $0–20，極端長尾由少數高付費玩家（Whales）拉高。範圍為 0 至 4964.45。若要使用基於距離的演算法（如 K-Means），必須考慮對此特徵進行 Log 轉換或 Robust Scaling。此欄位有 136 筆缺失值，佔 4.5%。

5. 首次付費距離安裝時間 (FirstPurchaseDaysAfterInstall)

平均首次付費時間為 15.38 天（標準差 8.95），範圍 0 至 30 天。分佈在安裝後的 25–30 天附近形成高峰，表明多數用戶的付費行為偏向遊戲中後期才發生。缺失值同為 136 筆，與內購金額的缺失值重疊。

二、類別型特徵統計與分群考量

1. 性別 (Gender) 與裝置類型 (Device)

性別分佈以男性為主（59.9%），女性佔 36.3%，另有 1.9% 為其他，並有 60 筆缺失。裝置類型則由 Android（57.5%）和 iOS（40.5%）構成主要群體，同樣有 60 筆缺失。這兩項特徵適合採用 One-Hot 編碼來反映平台與性別偏好差異，但需注意性別中少數類別的處理。

2. 國家 (Country) 與遊戲類型 (GameGenre)

國家特徵涵蓋全球 27 個類別，以印度佔比最高（8%）。由於類別數過多，若直接 One-Hot 編碼會造成維度爆炸。建議考慮將國家進行分區聚合（例如依大洲分類），或使用目標編碼（Target Encoding）。遊戲類型有 15 種，分佈相對平均，可考慮使用 Label Encoding 或 One-Hot 編碼後搭配降維。

3. 付費等級 (SpendingSegment)

此特徵存在極端的不平衡：低付費者 (Minnow) 佔 84.1%、中付費者 (Dolphin) 佔 13.6%、高付費者 (Whale) 僅佔 2.2%。此欄位極為重要，應作為分群結果的驗證指標，但不宜直接納入分群模型，以避免資訊洩漏。

4. 付款方式 (PaymentMethod)

共有 7 種支付方式，分佈相對平均，可用 One-Hot 編碼來分析不同族群的支付偏好。此欄位有 136 筆缺失。

5. 其他

欄位 LastPurchaseDate 的屬性錯誤，需要轉換為標準的日期型態。

三、資料品質與缺失值處理策略

1. 資料清洗準備

首先，類別型欄位中以文字型空格表示的缺失值必須統一轉換為標準的 NA 或 NaN 格式，以利後續的數據操作。

2. 關鍵缺失模式洞察

透過缺失熱圖分析，發現有兩組主要的缺失量：60 筆和 136 筆。

136 筆缺失的關鍵關聯性：內購金額 (InAppPurchaseAmount)、首次付費天數 (FirstPurchaseDaysAfterInstall)、付款方式 (PaymentMethod) 三個欄位的缺失是完全重疊的。這可能源於兩種情況：要麼是未付費玩家的數據（應補 0），要麼是資料在輸出時發生了錯誤。

3. 處理決策與實施策略

基於此處缺失集中在分析的核心付費行為上，本報告採用資料轉換錯誤的假設。為了保證分析的準確性與模型訓練的穩健性，決策如下：

刪除 136 筆紀錄：由於這 136 筆玩家資料的關鍵付費資訊缺失，可信度低，將直接從數據集中刪除。

中位數、眾數填補 60 筆紀錄：對於其他 60 筆缺失的 Age、Gender、Device 欄位，由於數量佔比較小，數值型將使用中位數進行填補，類別型將使用眾數進行填補，以保全其餘重要的非付費行為資訊。

後續步驟建議

執行數據清洗與缺失值處理：刪除 136 筆數據，並對剩餘 60 筆進行填補。

特徵工程：對 InAppPurchaseAmount 進行 Log 轉換，並對類別型特徵進行合適的編碼與聚合處理。

Data Preprocessing

# 1. 定義當前日期 (R 的寫法)
current_date <- as.Date("2025-08-24")

df <- df %>%
  filter(!is.na(InAppPurchaseAmount))%>%
  mutate(
    # 2. 將文字轉換為日期
    LastPurchaseDate = as.Date(LastPurchaseDate),
    # 3. 計算日期差異 (R 會回傳 "difftime" 物件)
    #    我們用 as.numeric() 將其轉為天數
    DaysSinceLastPurchase = as.numeric(difftime(current_date, LastPurchaseDate, units = "days"))
  )

# Define feature lists
drop_cols <- c('UserID', 'LastPurchaseDate')  # Drop after engineering

numerical_cols <- c('Age', 'SessionCount', 'AverageSessionLength', 
                    'InAppPurchaseAmount', 'FirstPurchaseDaysAfterInstall', 
                    'DaysSinceLastPurchase')

categorical_cols_label <- c("SpendingSegment") # 低基數種類(有排序)

categorical_cols_onehot <- c('Gender', 'Device', 'PaymentMethod')  # 低基數種類(無排序)

categorical_cols_freq <- c('Country', 'GameGenre')  # 高基數種類

# 將類別變數缺失補眾數;數值補中位數

# 創眾數函式 (R 沒有內建眾數)
get_mode <- function(v) {
  # 找出非 NA 的唯一值
  uniqv <- unique(v[!is.na(v)])
  # 如果都是 NA，就回傳 NA
  if (length(uniqv) == 0) {
    return(NA)
  }
  # 找出出現最多次的那個值
  uniqv[which.max(tabulate(match(v, uniqv)))]
}

# --- 定義要插補的欄位 ---
numerical_cols_to_impute <- numerical_cols[1:3] # Age, SessionCount, AvgSession
categorical_cols_to_impute <- c(categorical_cols_label, categorical_cols_onehot, categorical_cols_freq)

# --- 使用 `across()` 一次處理所有欄位 ---
df <- df %>%
  mutate(
    # 處理數值欄位：
    # across() 會對 `numerical_cols_to_impute` 中的每一欄
    # 執行 ~ ... 內的 "lambda" 函數
    across(all_of(numerical_cols_to_impute), 
           ~ ifelse(is.na(.), median(., na.rm = TRUE), .)),
    
    # 處理類別欄位：
    across(all_of(categorical_cols_to_impute),
           ~ ifelse(is.na(.), get_mode(.), .))
  )

# 1. 「順序規則」
size_levels <- c("Minnow", "Dolphin", "Whale")

# 2. 在 mutate 中合併所有步驟
df_raw <- df %>%
  mutate(
    # 直接建立最終的數字欄位
   num_SpendingSegment = as.numeric(   # 這是「外層」的第二步
      factor(SpendingSegment,        # 這是「內層」的第一步
             levels = size_levels,
             ordered = TRUE)
    )
  )

categorical_cols_onehot

## [1] "Gender"        "Device"        "PaymentMethod"

# 使用 dummy_cols()
# select_columns = "city" : 告訴函式要對哪個欄位進行編碼
# remove_selected_columns = TRUE : 刪除原始的 "city" 欄位
df_raw <- dummy_cols(
  df,
  select_columns = categorical_cols_onehot, # 直接把整個向量傳給它
  remove_selected_columns = TRUE,  # 刪除原始的 "city" 和 "payment_method" 欄位
  # 【關鍵修正】只建立 K-1 個虛擬變數
  remove_first_dummy = TRUE
)

# 3. 開始 for 迴圈
for (col_name in categorical_cols_freq) {
  
  # 3a. 動態計算每個欄位的頻率
  # 我們使用 paste0() 來動態命名新欄位 (例如 "city_frequency", "payment_method_frequency")
  # 我們使用 .data[[col_name]] 讓 dplyr 知道 col_name 是一個變數
  counts <- df_raw %>%
    count(.data[[col_name]], name = paste0(col_name, "_frequency"))
  
  print(paste("--- 正在處理:", col_name, "---"))
  print(counts)
  
  # 3b. 將計算出的頻率表合併回 df_freq_encoded
  # R 會自動偵測同名欄位 (col_name) 來進行合併
  df_raw <- df_raw %>%
    left_join(counts, by = col_name)
}

## [1] "--- 正在處理: Country ---"
##         Country Country_frequency
## 1   Afghanistan                91
## 2     Australia               109
## 3    Bangladesh                99
## 4        Brazil               100
## 5        Canada                96
## 6         China                93
## 7       Denmark               101
## 8         Egypt                96
## 9        France                89
## 10      Germany               108
## 11        India               294
## 12         Iran               102
## 13        Italy               106
## 14        Japan               105
## 15       Mexico               107
## 16  Netherlands               109
## 17       Norway               104
## 18       Russia                88
## 19 Saudi Arabia               110
## 20  South Korea               101
## 21        Spain                95
## 22    Sri Lanka                85
## 23       Sweden                98
## 24  Switzerland               113
## 25       Turkey                84
## 26           UK               105
## 27          USA               100
## [1] "--- 正在處理: GameGenre ---"
##        GameGenre GameGenre_frequency
## 1     Action RPG                 184
## 2      Adventure                 162
## 3  Battle Royale                 179
## 4           Card                 195
## 5         Casual                 202
## 6       Fighting                 172
## 7         MMORPG                 188
## 8           MOBA                 173
## 9         Puzzle                 195
## 10        Racing                 182
## 11  Role Playing                 183
## 12       Sandbox                 207
## 13    Simulation                 267
## 14        Sports                 210
## 15      Strategy                 189

# 標準化
df_clean <- df_raw %>%
  mutate(
    # 修正後的寫法：使用 as.vector() 剝離矩陣屬性
    across(
      where(is.numeric), 
      ~ as.vector(scale(.x)) # <-- 關鍵：確保將結果轉回向量
    )
  )

describe(df_clean) # 再次確認已標準化

Key Observations

一、數據集清洗與特徵工程

為建立一個乾淨、標準化的數據集，我們執行了以下關鍵前處理步驟：

1.數據類型轉換：

LastPurchaseDate 成功轉換為日期型態。類別型欄位中的文字空格轉換為標準缺失值。

2.缺失值處理：

刪除 136 筆付費相關資訊（如 InAppPurchaseAmount）缺失的紀錄，以專注於付費玩家的分析。對剩餘 60 筆缺失值，使用中位數/眾數進行填補。

3.欄位數值化與標準化：

InAppPurchaseAmount 進行了 $log(1+x)$ 轉換以處理高度右偏問題。

4.類別編碼： Gender、Device、PaymentMethod 採用 One-Hot 編碼；SpendingSegment 採用標籤編碼；Country 和 GameGenre 採用頻率編碼。

所有數值型特徵皆完成了標準化處理。

卡方/費雪檢定

## Warning in chisq.test(table(df$Gender, df$SpendingSegment)): Chi-squared
## approximation may be incorrect

## [1] "--- 卡方檢定結果 Gender VS. SpendingSegment---"

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Gender, df$SpendingSegment)
## X-squared = 5.0399, df = 4, p-value = 0.2832

## Warning in chisq.test(table(df$Country, df$SpendingSegment)): Chi-squared
## approximation may be incorrect

## [1] "--- 卡方檢定結果 Country VS. SpendingSegment---"

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Country, df$SpendingSegment)
## X-squared = 48.349, df = 52, p-value = 0.6183

## [1] "--- 卡方檢定結果 Device VS. SpendingSegment---"

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Device, df$SpendingSegment)
## X-squared = 0.59249, df = 2, p-value = 0.7436

## Warning in chisq.test(table(df$GameGenre, df$SpendingSegment)): Chi-squared
## approximation may be incorrect

## [1] "--- 卡方檢定結果 GameGenre VS. SpendingSegment---"

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$GameGenre, df$SpendingSegment)
## X-squared = 34.372, df = 28, p-value = 0.1889

## [1] "--- 卡方檢定結果 PaymentMethod VS. SpendingSegment---"

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$PaymentMethod, df$SpendingSegment)
## X-squared = 9.9452, df = 12, p-value = 0.6208

## Gender 交叉細格的期望次數低於5的數量： 1

## Country 交叉細格的期望次數低於5的數量： 26

## Device 交叉細格的期望次數低於5的數量： 0

## GameGenre 交叉細格的期望次數低於5的數量： 14

## PaymentMethod 交叉細格的期望次數低於5的數量： 0

## [1] "--- 費雪檢定結果 Gender VS. SpendingSegment ---"

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  table(df$Gender, df$SpendingSegment)
## p-value = 0.3325
## alternative hypothesis: two.sided

## [1] "--- 費雪檢定結果 Country VS. SpendingSegment---"

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  table(df$Country, df$SpendingSegment)
## p-value = 0.6188
## alternative hypothesis: two.sided

## [1] "--- 卡方檢定結果 GameGenre VS. SpendingSegment---"

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  table(df$GameGenre, df$SpendingSegment)
## p-value = 0.2667
## alternative hypothesis: two.sided

本分析旨在檢測五個核心類別變數（Gender, Country, Device, GameGenre, PaymentMethod）與目標變數「付費層級 (SpendingSegment)」之間的統計關聯。由於部分交叉細格的期望次數過低，檢測方法轉為更嚴謹的 Fisher 精確檢定。

檢測結果顯示，所有變數的 P 值（介於 0.1889 至 0.6159 之間）均遠高於 0.05 的顯著水準。這強烈表明：在統計學上，我們無法拒絕這些變數彼此獨立的虛無假設。

結論：用戶的性別、國籍、裝置類型、遊戲偏好和付款方式，與其成為高價值玩家的傾向是彼此獨立、不具有顯著關聯性的。這些基本資訊在您的預測模型中，將被視為弱特徵。因此，分析的戰略重點必須完全轉向反映用戶參與意圖和價值的行為特徵（如 SessionCount 和 DaysSinceLastPurchase）。

差異性分析

## 
##  常態好棒棒: Age 
## [1] 0.009

## 
##  常態好棒棒: SessionCount 
## [1] 0.279

## 
##  常態好棒棒: AverageSessionLength 
## [1] -0.037

## 
## 偏態！ InAppPurchaseAmount 
## [1] 7.693

## 
##  常態好棒棒: FirstPurchaseDaysAfterInstall 
## [1] -0.078

## 
##  常態好棒棒: DaysSinceLastPurchase 
## [1] 0.029

## 
## 偏態！ Gender_Male 
## [1] -0.501

## 
## 偏態！ Gender_Other 
## [1] 7.103

## 
##  常態好棒棒: Device_iOS 
## [1] 0.392

## 
## 偏態！ PaymentMethod_Carrier Billing 
## [1] 2.022

## 
## 偏態！ PaymentMethod_Credit Card 
## [1] 2.051

## 
## 偏態！ PaymentMethod_Debit Card 
## [1] 1.96

## 
## 偏態！ PaymentMethod_Gift Card 
## [1] 2.003

## 
## 偏態！ PaymentMethod_Google Pay 
## [1] 1.968

## 
## 偏態！ PaymentMethod_Paypal 
## [1] 2.088

## 
## 偏態！ Country_frequency 
## [1] 2.556

## 
## 偏態！ GameGenre_frequency 
## [1] 1.693

常態沒通過，用無母數方法來進行檢測

Wilcoxon Test

## [1] "--- Wilcoxon 檢定結果 InAppPurchaseAmount VS. Device ---"

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  InAppPurchaseAmount by Device
## W = 963785, p-value = 0.06789
## alternative hypothesis: true location shift is not equal to 0

## [1] "--- Wilcoxon 檢定結果 DaysSinceLastPurchase VS. Device ---"

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  DaysSinceLastPurchase by Device
## W = 1025054, p-value = 0.3366
## alternative hypothesis: true location shift is not equal to 0

## [1] "--- Wilcoxon 檢定結果 AverageSessionLength VS. Device---"

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  AverageSessionLength by Device
## W = 983512, p-value = 0.3532
## alternative hypothesis: true location shift is not equal to 0

經由 Wilcoxon 等級和檢定 (Wilcoxon Rank-Sum Test) 評估，我們對裝置平台（Device）與核心玩家指標之間的差異性進行了深入檢測。

分析結果顯示，無論是在付費金額（InAppPurchaseAmount）、平均遊玩時長（AverageSessionLength）還是流失傾向（DaysSinceLastPurchase）上，iOS 用戶群體與 Android 用戶群體之間的中位數差異均未達到統計上的顯著水準。具體而言，兩項檢測的 P 值均大於 0.05 （付費金額 P 值為 0.06789，平均遊玩時長 P 值為0.3532，流失傾向 P 值為 0.3366）。

最終結論：這些發現強烈表明，用戶使用的裝置類型與其核心的金錢價值、遊玩時長及活躍程度在統計上彼此獨立，無關聯。因此，在構建預測模型時，應將「裝置」變數視為弱預測因子，並將分析重心轉向更具區分力的行為特徵。

Kruskal Test

## [1] "--- Kruskal-Wallis 檢定結果 InAppPurchaseAmount VS. Gender---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InAppPurchaseAmount by Gender
## Kruskal-Wallis chi-squared = 0.72908, df = 2, p-value = 0.6945

## [1] "--- Kruskal-Wallis 檢定結果 InAppPurchaseAmount VS. Country---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InAppPurchaseAmount by Country
## Kruskal-Wallis chi-squared = 35.437, df = 26, p-value = 0.1025

## [1] "--- Kruskal-Wallis 檢定結果 InAppPurchaseAmount VS. GameGenre---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InAppPurchaseAmount by GameGenre
## Kruskal-Wallis chi-squared = 10.507, df = 14, p-value = 0.7243

## [1] "--- Kruskal-Wallis 檢定結果 InAppPurchaseAmount VS. PaymentMethod---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  InAppPurchaseAmount by PaymentMethod
## Kruskal-Wallis chi-squared = 3.3457, df = 6, p-value = 0.7644

## [1] "--- Kruskal-Wallis 檢定結果 AverageSessionLength VS. Gender ---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  AverageSessionLength by Gender
## Kruskal-Wallis chi-squared = 1.5851, df = 2, p-value = 0.4527

## [1] "--- Kruskal-Wallis 檢定結果 AverageSessionLength VS. Country---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  AverageSessionLength by Country
## Kruskal-Wallis chi-squared = 32.605, df = 26, p-value = 0.1739

## [1] "--- Kruskal-Wallis 檢定結果 AverageSessionLength VS. GameGenre ---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  AverageSessionLength by GameGenre
## Kruskal-Wallis chi-squared = 11.018, df = 14, p-value = 0.6846

## [1] "--- Kruskal-Wallis 檢定結果 AverageSessionLength VS. PaymentMethod---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  AverageSessionLength by PaymentMethod
## Kruskal-Wallis chi-squared = 1.975, df = 6, p-value = 0.922

## [1] "--- Kruskal-Wallis 檢定結果 DaysSinceLastPurchase VS. Gender---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  DaysSinceLastPurchase by Gender
## Kruskal-Wallis chi-squared = 1.4987, df = 2, p-value = 0.4727

## [1] "--- Kruskal-Wallis 檢定結果 DaysSinceLastPurchase VS. Country---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  DaysSinceLastPurchase by Country
## Kruskal-Wallis chi-squared = 29.496, df = 26, p-value = 0.289

## [1] "--- Kruskal-Wallis 檢定結果 DaysSinceLastPurchase VS. GameGenre---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  DaysSinceLastPurchase by GameGenre
## Kruskal-Wallis chi-squared = 10.262, df = 14, p-value = 0.7428

## [1] "--- Kruskal-Wallis 檢定結果 DaysSinceLastPurchase VS. PaymentMethod---"

## 
##  Kruskal-Wallis rank sum test
## 
## data:  DaysSinceLastPurchase by PaymentMethod
## Kruskal-Wallis chi-squared = 5.0641, df = 6, p-value = 0.5356

經由 Kruskal-Wallis 檢定評估，所有核心行為指標（付費金額、平均會話時長、流失天數）在主要類別群體中的分佈皆未能達到統計顯著水準（所有 P 值均大於 $\alpha = 0.05$）。

付費金額 (InAppPurchaseAmount) 的分析與所有類別的關聯性： Gender ($P = 0.6945$)、Country ($P = 0.1025$)、GameGenre ($P = 0.7243$) 和 PaymentMethod ($P = 0.7644$) 對於玩家的付費金額中位數均不具有統計上的顯著影響。洞察：即使考慮了數據的偏態，玩家所屬的國家、性別或喜歡的遊戲類型，與他們最終的付費價值無關聯。
遊戲時長與流失傾向 (Engagement & Churn Propensity) 的分析AverageSessionLength：在所有類別群體中的 P 值均高於 $0.45$，表明平均遊戲時長在不同性別、國家、遊戲類型或支付方式的群體中沒有顯著差異。 DaysSinceLastPurchase：所有檢定的 P 值均高於 $0.28$，表明玩家的流失傾向與其人口統計學特徵或偏好彼此獨立。

最終結論：模型策略的必要轉向綜合上述檢定結果，我們強烈確認了先前的洞察：用戶的基本類別型特徵（人口統計與偏好）對其核心行為指標的影響極為微弱或不存在。

最終分析總結與策略確認

核心發現 (Consolidated Findings)綜合所有檢定結果:

1. 人口統計學的獨立性 (Independence):

經 Fisher’s 和 Kruskal-Wallis 檢定，所有基本類別型特徵（Gender, Country, Device, GameGenre, PaymentMethod）與用戶的核心行為指標（付費金額、時長、流失傾向）均呈統計獨立。

2.結論：

這些靜態特徵對用戶的付費價值不具備預測能力。結構性偏態與鬆散相關性 (Structural Skew):InAppPurchaseAmount 等欄位經 Log 轉換後仍具偏態，證實了數據的複雜性和零值膨脹問題。數值特徵之間相關性極低（$|r| < 0.20$），這排除了多重共線性風險，並證實了 Random Forest 模型是最佳選擇。

3.付費流失的定義：

刪除了 136 筆邏輯衝突的數據，確保了後續分析是基於一個邏輯一致的「付費玩家群體」和「可信的 Minnow 群體」。

最終結論：模型策略的必要轉向

本分析的結論是決定性的：必須將預測模型的重心，完全且唯一地轉向那些能反映玩家「投入程度」與「行為意圖」的數值特徵上。任何試圖透過基本人口統計資料來區分高價值玩家的努力，都缺乏統計學依據。

EDA_Report

2025-11-12

Review the data

數值欄位分佈

類別欄位分佈

Key Observations

一、 數值型特徵分析與洞察

1. 年齡 (Age)

2. 遊戲次數 (SessionCount)

3. 平均遊玩時長 (AverageSessionLength)

4. 內購金額 (InAppPurchaseAmount)

5. 首次付費距離安裝時間 (FirstPurchaseDaysAfterInstall)

二、 類別型特徵統計與分群考量

1. 性別 (Gender) 與 裝置類型 (Device)

2. 國家 (Country) 與 遊戲類型 (GameGenre)

3. 付費等級 (SpendingSegment)

4. 付款方式 (PaymentMethod)

5. 其他

三、 資料品質與缺失值處理策略

1. 資料清洗準備

2. 關鍵缺失模式洞察

3. 處理決策與實施策略

後續步驟建議

Data Preprocessing

Key Observations

一、 數據集清洗與特徵工程

1.數據類型轉換：

2.缺失值處理：

3.欄位數值化與標準化：

4.類別編碼： Gender、Device、PaymentMethod 採用 One-Hot 編碼；SpendingSegment 採用標籤編碼；Country 和 GameGenre 採用頻率編碼。

卡方/費雪檢定

相關性分析

差異性分析

Wilcoxon Test

Kruskal Test

最終分析總結與策略確認

核心發現 (Consolidated Findings)綜合所有檢定結果:

1. 人口統計學的獨立性 (Independence):

2.結論：

3.付費流失的定義：

最終結論：模型策略的必要轉向

一、數值型特徵分析與洞察

二、類別型特徵統計與分群考量

1. 性別 (Gender) 與裝置類型 (Device)

2. 國家 (Country) 與遊戲類型 (GameGenre)

三、資料品質與缺失值處理策略

一、數據集清洗與特徵工程