2023 厚數據與意義探勘實做

關於資料來源

這個練習的資料，是個來自兩波網路調查。這是個依受訪者id合併之後的「定群追蹤資料」（panel data）。由smilepoll.tw提供。第一筆（代號B）是「統獨意見大調查」，調查時間： 2018. 10.22～2018. 11.13，N=886（完成率86.7%）；第二筆（代號D）是「2018地方選舉選後心情札記」，調查時間：2019.01.21～2019.02.19，N=1,297 (完成率78%)。

專案準備

請開啟一個新的專案，將語法檔及資料檔（dataBD.rda）都放入該專案資料夾內。

讀入資料與變數觀察

## Learn more about sjPlot with 'browseVignettes("sjPlot")'.

## 性別 (x) <categorical> 
## # total N=579 valid N=576 mean=0.43 sd=0.50
## 
## Value | Label |   N | Raw % | Valid % | Cum. %
## ----------------------------------------------
##     0 |    女 | 329 | 56.82 |   57.12 |  57.12
##     1 |    男 | 247 | 42.66 |   42.88 | 100.00
##  <NA> |  <NA> |   3 |  0.52 |    <NA> |   <NA>

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): Windows
## 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_stringMetric, as.graphicsAnnot(x$label)): Windows
## 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## x <numeric> 
## # total N=579 valid N=579 mean=34.95 sd=11.40
## 
## Value |  N | Raw % | Valid % | Cum. %
## -------------------------------------
##    15 |  2 |  0.35 |    0.35 |   0.35
##    17 |  4 |  0.69 |    0.69 |   1.04
##    18 |  4 |  0.69 |    0.69 |   1.73
##    19 | 12 |  2.07 |    2.07 |   3.80
##    20 | 11 |  1.90 |    1.90 |   5.70
##    21 | 13 |  2.25 |    2.25 |   7.94
##    22 | 13 |  2.25 |    2.25 |  10.19
##    23 | 21 |  3.63 |    3.63 |  13.82
##    24 | 20 |  3.45 |    3.45 |  17.27
##    25 | 23 |  3.97 |    3.97 |  21.24
##    26 | 29 |  5.01 |    5.01 |  26.25
##    27 | 17 |  2.94 |    2.94 |  29.19
##    28 | 21 |  3.63 |    3.63 |  32.82
##    29 | 13 |  2.25 |    2.25 |  35.06
##    30 | 26 |  4.49 |    4.49 |  39.55
##    31 | 25 |  4.32 |    4.32 |  43.87
##    32 | 17 |  2.94 |    2.94 |  46.80
##    33 | 31 |  5.35 |    5.35 |  52.16
##    34 | 22 |  3.80 |    3.80 |  55.96
##    35 | 25 |  4.32 |    4.32 |  60.28
##    36 | 24 |  4.15 |    4.15 |  64.42
##    37 | 15 |  2.59 |    2.59 |  67.01
##    38 | 14 |  2.42 |    2.42 |  69.43
##    39 | 13 |  2.25 |    2.25 |  71.68
##    40 | 11 |  1.90 |    1.90 |  73.58
##    41 | 12 |  2.07 |    2.07 |  75.65
##    42 | 11 |  1.90 |    1.90 |  77.55
##    43 |  9 |  1.55 |    1.55 |  79.10
##    44 |  6 |  1.04 |    1.04 |  80.14
##    45 | 16 |  2.76 |    2.76 |  82.90
##    46 |  8 |  1.38 |    1.38 |  84.28
##    47 | 12 |  2.07 |    2.07 |  86.36
##    48 |  9 |  1.55 |    1.55 |  87.91
##    49 |  6 |  1.04 |    1.04 |  88.95
##    50 |  6 |  1.04 |    1.04 |  89.98
##    51 |  4 |  0.69 |    0.69 |  90.67
##    52 |  7 |  1.21 |    1.21 |  91.88
##    53 |  5 |  0.86 |    0.86 |  92.75
##    54 |  5 |  0.86 |    0.86 |  93.61
##    55 |  4 |  0.69 |    0.69 |  94.30
##    56 |  2 |  0.35 |    0.35 |  94.65
##    57 |  4 |  0.69 |    0.69 |  95.34
##    58 |  4 |  0.69 |    0.69 |  96.03
##    60 |  2 |  0.35 |    0.35 |  96.37
##    61 |  1 |  0.17 |    0.17 |  96.55
##    62 |  3 |  0.52 |    0.52 |  97.06
##    63 |  8 |  1.38 |    1.38 |  98.45
##    64 |  1 |  0.17 |    0.17 |  98.62
##    66 |  3 |  0.52 |    0.52 |  99.14
##    68 |  1 |  0.17 |    0.17 |  99.31
##    74 |  1 |  0.17 |    0.17 |  99.48
##    78 |  1 |  0.17 |    0.17 |  99.65
##    79 |  1 |  0.17 |    0.17 |  99.83
##    90 |  1 |  0.17 |    0.17 | 100.00
##  <NA> |  0 |  0.00 |    <NA> |   <NA>

## Warning: `stat(density)` was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## Windows 字型資料庫裡不明的字型系列
## Windows 字型資料庫裡不明的字型系列
## Windows 字型資料庫裡不明的字型系列
## Windows 字型資料庫裡不明的字型系列
## Windows 字型資料庫裡不明的字型系列
## Windows 字型資料庫裡不明的字型系列
## ℹ The deprecated feature was likely used in the sjPlot package.
##   Please report the issue at <https://github.com/strengejacke/sjPlot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## 教育程度 (x) <categorical> 
## # total N=579 valid N=572 mean=2.03 sd=0.53
## 
## Value |    Label |   N | Raw % | Valid % | Cum. %
## -------------------------------------------------
##     1 | 大專以下 |  72 | 12.44 |   12.59 |  12.59
##     2 |     大專 | 408 | 70.47 |   71.33 |  83.92
##     3 |   研究所 |  92 | 15.89 |   16.08 | 100.00
##  <NA> |     <NA> |   7 |  1.21 |    <NA> |   <NA>

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## 居住地 (x) <categorical> 
## # total N=579 valid N=558 mean=7.98 sd=5.58
## 
## Value |                      Label |   N | Raw % | Valid % | Cum. %
## -------------------------------------------------------------------
##     1 |                     台北市 |  56 |  9.67 |   10.04 |  10.04
##     2 |                     新北市 | 120 | 20.73 |   21.51 |  68.64
##     3 |                     基隆市 |   5 |  0.86 |    0.90 |  70.07
##     4 |                     桃園市 |  42 |  7.25 |    7.53 |  77.60
##     5 |                     新竹市 |  13 |  2.25 |    2.33 |  79.93
##     6 |                     新竹縣 |  12 |  2.07 |    2.15 |  82.08
##     7 |                     苗栗縣 |   8 |  1.38 |    1.43 |  83.51
##     8 |                     台中市 |  73 | 12.61 |   13.08 |  96.59
##     9 |                     彰化縣 |  19 |  3.28 |    3.41 | 100.00
##    10 |                     南投縣 |   5 |  0.86 |    0.90 |  10.93
##    11 |                     雲林縣 |  18 |  3.11 |    3.23 |  14.16
##    12 |                     嘉義市 |   5 |  0.86 |    0.90 |  15.05
##    13 |                     嘉義縣 |  13 |  2.25 |    2.33 |  17.38
##    14 |                     台南市 |  45 |  7.77 |    8.06 |  25.45
##    15 |                     高雄市 |  99 | 17.10 |   17.74 |  43.19
##    16 |                     屏東縣 |  14 |  2.42 |    2.51 |  45.70
##    17 |                     台東縣 |   0 |  0.00 |    0.00 |  45.70
##    18 |                     花蓮縣 |   5 |  0.86 |    0.90 |  46.59
##    19 |                     宜蘭縣 |   3 |  0.52 |    0.54 |  47.13
##    20 |                     澎湖縣 |   2 |  0.35 |    0.36 |  69.00
##    21 |                     金門縣 |   1 |  0.17 |    0.18 |  69.18
##    22 |                     連江縣 |   0 |  0.00 |    0.00 |  69.18
##    23 | 中國大陸(含香港、澳門)地區 |   0 |  0.00 |    0.00 |  69.18
##    24 |                       其他 |   0 |  0.00 |    0.00 |  69.18
##  <NA> |                       <NA> |  21 |  3.63 |    <NA> |   <NA>

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## Windows 字型資料庫裡不明的字型系列

請依你圖表上的資訊，在此寫下你對這筆資料的基本印象及看法：

1.女性比例比男性多。

2.年齡多集中於30歲上下最多，但30歲前有一個明顯的下降、60歲有一個明顯的上升。甚至還有一個90歲的資料。推測填答者與使用網路頻率或是發布民調單位的族群較為相關。

3.教育程度以大專生最多。推測問卷是由大專生發放，以致於大專生填答比例居高，或是為了做研究或專題報告，以致更願意填答此筆資料。

4.人口分布仍以六都比例最高，新北最多，但台中高於台北市。但雲林縣的比例高於屏東縣。此外，離島的比例遠低於本島的比例，且沒有連江縣的數據。

變數選取與MCA分析

## 
## 載入套件：'dplyr'

## 下列物件被遮斷自 'package:stats':
## 
##     filter, lag

## 下列物件被遮斷自 'package:base':
## 
##     intersect, setdiff, setequal, union

## 載入需要的套件：ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

## [1] 564

##  [1] "Gender"  "college" "B23r"    "B25r"    "B29r"    "B33r"    "B39r"   
##  [8] "B42r"    "B46r"    "B47r"    "B51r"    "B53r"    "B54r"    "B56r"   
## [15] "B57r"    "D25r"    "D52r"    "D58r"    "D61r"    "D81r"    "D99r"   
## [22] "D146r"   "D147r"

探索

#前30個重要的選項類別
plot(resBD, axes=c(1, 2), new.plot=TRUE, 
     col.var="black", col.ind="black", col.ind.sup="black",
     col.quali.sup="darkgreen", col.quanti.sup="blue",
     label=c("var"), cex=0.7, 
     selectMod = "cos2 20",   # 試試看，將20調為更高的數字，你會看到更多的變數類別
     invisible=c("ind", "quali.sup"), 
     autoLab = "yes",
     xlim=c(-1, 1.5), ylim=c(-1.3, 1.5),
     title="")

## Warning: Removed 1 rows containing missing values (`geom_point()`).

## Warning: Removed 1 rows containing missing values (`geom_text_repel()`).

# 受訪者分佈圖
plot(resBD, axes=c(1, 2), new.plot=TRUE,
     col.var="red", col.ind="brown", col.ind.sup="black",
     col.quali.sup="darkgreen", col.quanti.sup="blue",
     label=c("var"), cex=0.8,
     selectMod = "cos2",
     invisible=c("var", "quali.sup"),
     xlim=c(-1, 1.5),
     title="")

我的觀察

1.B46r_1(被統後民生會變好)與B47r_1(被統後經濟會變好)，我覺得民生應該主要包含經濟面向，但是兩者關連性卻很低。我推測受訪者認為民生不只限於經濟層面，也包含國家的社會福利制度、教育制度等。

2.B29r_2(不接受與與民主化後的中國（大陸）合為一個國家)與B47r_2(被統後經濟不會變好)和B57r_1(身份認同唯台灣人)關聯性較高，也可以發現綠營主要選戰論調為此，希望以台灣人為一個族群一個國家，因為與中國是不同的主體，不僅族群性格上有所不同，經濟上統一後也無法獲得好處，固可推斷該族群政黨傾向於綠營。

3.B25r_2(心中國家的名字是中華民國) 與B39r_1(可接受大陸居住證)相近，可見該族群支持以中華民國為國名，且認為國家目前屬於分裂狀態，故可以接受兩岸目前分屬不同的體制。

大膽假設

現在，請你依圖找到你覺得有趣的、有相關的問卷題，將你的假設寫下。

## 用卡方檢定確認具潛在關聯的變數之間的相關性  
library(sjPlot)
library(sjmisc)

sjt.xtab(dataBD$B46r, dataBD$B47r,    ## 請把這兩個變數換成你想要檢視變數
         show.row.prc = TRUE, # 顯示列百分比
         show.col.prc = TRUE  # 顯示欄百分比
)

B46r	B47r			Total
B46r	1	2	3	Total
1	102 91.1 % 45.1 %	9 8 % 3.7 %	1 0.9 % 0.9 %	112 100 % 19.3 %
2	101 27.8 % 44.7 %	228 62.8 % 93.1 %	34 9.4 % 31.5 %	363 100 % 62.7 %
3	23 22.1 % 10.2 %	8 7.7 % 3.3 %	73 70.2 % 67.6 %	104 100 % 18 %
Total	226 39 % 100 %	245 42.3 % 100 %	108 18.7 % 100 %	579 100 % 100 %
χ²=377.438 · df=4 · Cramer’s V=0.571 · p=0.000

sjt.xtab(dataBD$B29r, dataBD$B47r,    ## 請把這兩個變數換成你想要檢視變數
         show.row.prc = TRUE, # 顯示列百分比
         show.col.prc = TRUE  # 顯示欄百分比
)

B29r	B47r			Total
B29r	1	2	3	Total
1	129 64.8 % 57.1 %	53 26.6 % 21.6 %	17 8.5 % 15.7 %	199 100 % 34.4 %
2	66 23.7 % 29.2 %	164 59 % 66.9 %	48 17.3 % 44.4 %	278 100 % 48 %
3	31 30.4 % 13.7 %	28 27.5 % 11.4 %	43 42.2 % 39.8 %	102 100 % 17.6 %
Total	226 39 % 100 %	245 42.3 % 100 %	108 18.7 % 100 %	579 100 % 100 %
χ²=129.085 · df=4 · Cramer’s V=0.334 · p=0.000

sjt.xtab(dataBD$B25r, dataBD$B39r,    ## 請把這兩個變數換成你想要檢視變數
         show.row.prc = TRUE, # 顯示列百分比
         show.col.prc = TRUE  # 顯示欄百分比
)

B25r	B39r			Total
B25r	1	2	3	Total
0	9 81.8 % 4.5 %	2 18.2 % 0.9 %	0 0 % 0 %	11 100 % 1.9 %
1	97 24.3 % 48 %	179 44.9 % 81.4 %	123 30.8 % 78.3 %	399 100 % 68.9 %
2	96 56.8 % 47.5 %	39 23.1 % 17.7 %	34 20.1 % 21.7 %	169 100 % 29.2 %
Total	202 34.9 % 100 %	220 38 % 100 %	157 27.1 % 100 %	579 100 % 100 %
χ²=67.056 · df=4 · Cramer’s V=0.241 · Fisher’s p=0.000

#想檢視多組關係，請把上一段複製貼上，調整變數，多做幾次吧！

再來，請你寫下經過你驗證後的、非常可能相關的一組組變數（請寫下最多三組有趣的即可）

編譯出你專屬的作品

最後，請你用knit將你的分析結果存為html繳交給老師。