1 學習目標

本次講義將介紹R的資料型態與資料結構，也就是R可以讀取並進行運算的對象，例如我們用 c() 來表示數字、文字的集合：

A<-c("台北市","新北市", "桃園市", "台中市","台南市","高雄市")
print(A)

## [1] "台北市" "新北市" "桃園市" "台中市" "台南市" "高雄市"

B<-c(0,1,2,3,4,5,6,7,8,9)
print(B)

##  [1] 0 1 2 3 4 5 6 7 8 9

A是文字而B是數字的集合，另外還有邏輯以及日期等常用的資料型態。

R資料結構可分為一維、二維、多維：
- 一維
- 二維
多維

陣列 (array)
列表 (list)
表格 (table)

由此出發，了解R的資料型態以及運算。

R是物件導向(object-oriented)的語言，所謂物件導向，就是使用者告訴系統某個函式、向量、矩陣、迴圈等等的名稱，或者包含程式與資料在一個物件，系統就會儲存在工作環境中，使用者可以呼叫執行或者修改。使用者可以以堆積木的方式，把物件堆疊起完成希望得到的結果。
- 在R的環境(environment)中，有各種物件，可以用ls()這個函式顯示。例如：
```
## [1] "a" "A" "b" "B" "f"
```
Global（全域）環境空間中所定義的變數，在任何地方都可以被使用，所以定義在全域環境空間中的變數也稱為全域變數（global variables），而定義在函數中的變數，則稱為區域變數（local variables），在該區域外部無法使用。

## [1] 4
## [1] 25

在上面的例子中，我們用.GlobalEnv顯示目前所在的環境，y, Y不在這個環境中，這是因為當我們定義一個包含x的新函數時，已經創造一個新環境，這個環境內的地域變數無法被全域所取用。
函數類似一個黑盒子，把變數或者是資料放進去這個黑盒子，根據函數中的設定，回傳計算的結果。例如，我們想知道一個變數的平均值，R的平均值函數為：

?mean
mean(x, trim = 0, na.rm = FALSE)

在以上的函數中，x為變數，可以是向量，也可以是矩陣，但是向量或矩陣的元素不能為字串。trim表示設定要去除多少百分比的變數中的資料。例如有一個變數型態為：

x<-c(100000, 10000000, c(1:10)); sort(x)

##  [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07

可以看到這個變數有一些極端值，例如我們可以用$\texttt{trim=0.2}$去掉12個數字中，去掉10%以及 20%的數字，也就是$0.1\times 12=1.2$以及$0.2 \times 12=2.4$，等於從這一串數字前後拿掉2個數字。以下比較有無設定傳回的結果：

x<-c(100000, 10000000, c(1:10))
sort(x)

##  [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07

mean(x)

## [1] 841671

mean(x, trim=0.1)

## [1] 10005

mean(sort(x)[2:11])

## [1] 10005

mean(x, trim=0.2)

## [1] 6.5

mean(sort(x)[3:10])

## [1] 6.5

當使用$\texttt{trim=0.2}$，x剩下第3到第10個資料，也就是x[3:10]。記得用$\texttt{sort()}$排序x，再計算平均數。
我們可以順便認識數字用科學符號表示的方式。e+01代表兩位數，e+05代表6位數。
Rousselet的部落格進一步介紹為何要去掉極端值。
大部份時候我們執行函數時，不會用到所有的設定，但是越了解每一個函數的設定的意義，越能發揮函數的功能。

2 資料型態

2.1 數值(numeric)

數值可分為數值(numeric)或者是整數(integer)。任意數例如：

X<-c(2, 4, 6, 8); X

[1] 2 4 6 8

class(X)

[1] “numeric”

typeof(X)

[1] “double”

str(X)

num [1:4] 2 4 6 8

$\texttt{class()}$函式告訴我們該資料型態或者是結構的屬性。

可以用科學符號表示比較大的數字：

y=c(1.1e+06); y

[1] 1100000

class(y)

[1] “numeric”

整數則為：

u<-as.integer(c(4)); class(u)

[1] “integer”

整數是數值的子集合，因為整數的最大值是2147483647，遠小於數值的最大值。在一般運算或是統計上幾乎沒有差異，唯一的差別在於整數的儲存佔用空間比較小，而一般數值其實帶有小數點，只是沒有顯示，稱為浮點運算。由於電腦使用二進位制的運算，由一個有效數字加上冪數來表示，以這種表示法表示的數值，稱為浮點數。利用浮點進行運算，稱為浮點計算，也就是所有渉及小數的運算。。

請輸入以下指令：

is.numeric(100)
is.integer(100)
typeof(100)

乍看之下100應該是整數，但是其實是帶小數點的數值。

又例如：

a<-c(7, 8.5, 9); class(a)

[1] “numeric”

b<-as.integer(a); b

[1] 7 8 9

$\texttt{as.integer()}$傳回向量的整數，但是看起來跟數值完全相同。

如果輸入一個整數並加入 L，R 語言就會儲存為整數（integer）。

a<-3L; class(a)

## [1] "integer"

那麼，整數有什麼用途？大部份時候沒有特別的不同，例如我們可以用as.numeric()以及as.integer()轉換日期：

sys_date <- Sys.Date()
as.numeric(sys_date)

## [1] 19791

as.integer(sys_date)

## [1] 19791

為什麼轉換今天的日期會得到一個數字？這是因為R 語言預設以西元 1970 年 1 月 1 日作為 0，在這一天以後的每天都 +1 來記錄，所以我們可以把這兩個日期相減，就是用as.numeric()以及as.integer()轉換的日期：

date_of_origin <- as.Date("1970-01-01")
as.integer(sys_date) - as.integer(date_of_origin)

## [1] 19791

變數如果與另一個變數對應條件時，R會顯示變數性質與對應結果：

b <- c(1, 2, 3, 10)
h<-c(100, 200, 300, 500)
ok<-h>300
b[ok]

## [1] 10

ok<-h>1000
b[ok]

## numeric(0)

2.2 字串(character)

變數可以是字串，例如受訪者的性別、學生的姓名、國家名稱等等。

請輸入：

state.abb
class(state.abb)

文字對於資料使用者而言相對於數值容易理解，但是用途比較受限，無法進行數學運算，但是在資料視覺化時相當有用，例如圖 2.1 顯示各州的人口數：

Figure 2.1: 美國各州人口數的點狀圖

圖 2.2 則重新排序各州的人口數：

Figure 2.2: 各州人口排序後的點狀圖

state.x77是矩陣，所以取出這個矩陣中的人口此一欄位，然後與state.abb結合成一個資料框，再以dotplot或者dotchart指令畫成點狀圖。

數字也可以當成文字，例如我們分配第一位來上課的人到第1組、第二位第2組，總共分成5組，下課時規定第1組要關門窗，第2組倒垃圾等等：

char1<-c("1","2","3","4","5"); char1

## [1] "1" "2" "3" "4" "5"

char2<-c(1, 2, "文字"); char2

## [1] "1"    "2"    "文字"

數字都被視為字串，無法進行數學運算。字串無法用as.numeric轉換為數字，但是可以用語法進行轉換（請見因素一節）。

class(row.names(quakes))

## [1] "character"

head(state.x77)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado         2541   4884        0.7    72.06    6.8    63.9   166 103766

class(row.names(state.x77))

## [1] "character"

數字經常會被誤認為字串。例如，我們到政府開放資料網站下載一筆資料稱為opendata106N0101.csv。 $\texttt{str()}$函式可以顯示資料中的變數性質。：

library(foreign); library(tidyverse)
file<-here::here('data','opendata106N0101.csv')
opendf<-read.csv(file, header=T, sep=',', fileEncoding = 'UTF-8')
str(opendf)

## 'data.frame':    375 obs. of  4 variables:
##  $ code      : chr  "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
##  $ 年底人口數: chr  "551480" "387484" "413590" "222585" ...
##  $ 土地面積  : num  23.14 16.32 20.14 5.71 19.74 ...
##  $ 人口密度  : chr  "23835" "23747" "20532" "38956" ...

file<-here::here('data','opendata106N0101.csv')
dat<-read.csv(file, header=T, stringsAsFactors = F)
nrow(dat) #check how many rows; n=375

## [1] 375

dat <- dat[-c(369:375),] #delete the rows of small islands and notes
head(dat, n=3)

##           code 年底人口數 土地面積 人口密度
## 1 新北市板橋區     551480    23.14    23835
## 2 新北市三重區     387484    16.32    23747
## 3 新北市中和區     413590    20.14    20532

可以看到，新的資料中，變數code被認為是是字串，但是年底人口數、人口密度也被認為是字串。我們創造一個新的數字變數(popu)來取代字串變數：

dat <- dat %>% mutate(popu=as.numeric(年底人口數))
str(dat)

## 'data.frame':    368 obs. of  5 variables:
##  $ code      : chr  "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
##  $ 年底人口數: chr  "551480" "387484" "413590" "222585" ...
##  $ 土地面積  : num  23.14 16.32 20.14 5.71 19.74 ...
##  $ 人口密度  : chr  "23835" "23747" "20532" "38956" ...
##  $ popu      : num  551480 387484 413590 222585 416524 ...

mutate()可以轉換像數字的字串為數字：

file<-here::here('data','opendata106N0101.csv')
df<-read.csv(file, 
             header=F, stringsAsFactors = F)
colnames(df) <-df[1,]
df <- df[-c(1, 370:376),] #delete the first row and the rows of notes
df<-df%>% mutate(popu=as.numeric(年底人口數))
head(df, n=3)

##           code 年底人口數 土地面積 人口密度   popu
## 2 新北市板橋區     551480  23.1373    23835 551480
## 3 新北市三重區     387484   16.317    23747 387484
## 4 新北市中和區     413590   20.144    20532 413590

min(df$popu) #check if data has no missing value

## [1] 685

$\texttt{transform}$有類似的功能：

library(data.table)
file<-here::here('data','opendata106N0101.csv')
DT <- read.csv(file, header=F, stringsAsFactors = F)
colnames(DT) <-DT[1,]
nrow(DT) #check how many rows; n=376

## [1] 376

DT <- DT[-c(1, 370:376),] #delete the first row and the rows of notes
DT <- data.table(DT)
DT <- DT %>% transform(年底人口數=as.numeric(年底人口數))
head(DT, n=3)

##            code 年底人口數 土地面積 人口密度
## 1: 新北市板橋區     551480  23.1373    23835
## 2: 新北市三重區     387484   16.317    23747
## 3: 新北市中和區     413590   20.144    20532

min(DT$年底人口數) #check if data has no missing value

## [1] 685

2.3 因素(factor)

有些文字具有類別，例如性別、是或否、地區等等。Verzani (p.10) 的例子：

x=c("Yes","No","No","Yes","Yes"); x

## [1] "Yes" "No"  "No"  "Yes" "Yes"

factor(x)

## [1] Yes No  No  Yes Yes
## Levels: No Yes

table(x)

## x
##  No Yes 
##   2   3

在這個例子中，x是字串資料，而factor()這個函數把x轉換為因素，有No, Yes兩類別。

因素也是一種整數，但是因素只有少數幾個固定的值。因素看起來很像文字，但是適用於文字的函數不見得適用於因素，例如nchar(), gsub()等等。請自己嘗試以下語法：

#character
a <- letters[1:5]
typeof(a)
nchar(a)
#factor
b <- factor(a)
typeof(b)
nchar(b)

建議用stringFactors = FALSE來限制讀取檔案時把文字讀成因素，再視情況把文字轉成因素。

因素的優點是容易理解，例如在交叉分析時，屬性為因素的變數可以直接顯示變數的變量：

str(Chile)

## 'data.frame':    2700 obs. of  8 variables:
##  $ region    : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ population: int  175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex       : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
##  $ age       : int  65 29 38 49 23 28 26 24 41 41 ...
##  $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
##  $ income    : int  35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ statusquo : num  1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote      : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...

kableExtra::kable_styling(knitr::kable(table(Chile$sex, Chile$vote)),
                        bootstrap_options = "striped", full_width = F)

	A	N	U	Y
F	104	363	362	480
M	83	526	226	388

不論是字串或者是因素，比較容易讓人理解交叉分析的結果。例如我們的資料用數字代表地區：

Chile$ncode<-as.numeric(Chile$region) 
kableExtra::kable_styling(knitr::kable(table(Chile$ncode, Chile$vote)),
                          bootstrap_options = "striped", full_width = F)

A	N	U	Y
44	210	141	174
2	18	23	38
30	102	46	135
42	214	148	275
69	345	230	246

或者是繪圖，都可以看出具有類別的因素的優點。例如圖 2.3 顯示性別與投票的關係：

library(lattice)
plot(Chile$sex, Chile$vote, xlab="Sex", ylab="Vote")

Figure 2.3: 性別與投票之一

2.3.1 因素轉換

有時候我們需要把因素轉換成數值，例如教育程度從國小、國中、…轉換成1到6的尺度。或者性別從男、女轉換成0與1。可用as.numeric()函式，如表2.1：

gender<-as.numeric(Chile$sex) 
kableExtra::kable_styling(knitr::kable(table(gender), caption="性別分佈"),
                          bootstrap_options = "striped", full_width = F)

Table 2.1: 性別分佈
gender	Freq
1	1379
2	1321

可以看到R按照類別的字母順序轉換類別為數字。如果進一步要轉換數字就容易了：

sex <- c()
sex[gender==2]<-0
sex[gender==1]<-1
kableExtra::kable_styling(knitr::kable(table(sex)),bootstrap_options = "striped", full_width = F)

sex	Freq
0	1321
1	1379

也可以寫語法直接轉換因素為需要的數字：

ngender<-c()
ngender[Chile$sex=='F']<-1
ngender[Chile$sex=='M']<-0
kableExtra::kable_styling(knitr::kable(table(ngender)),
                          bootstrap_options = "striped", full_width = F)

ngender	Freq
0	1321
1	1379

或是把因素轉換為字串：

Chile$gender[Chile$sex=="F"]<-"Female"
Chile$gender[Chile$sex=="M"]<-"Male"
class(Chile$gender)

[1] “character”

kableExtra::kable_styling(knitr::kable(table(Chile$gender)),
                          bootstrap_options = "striped", full_width = F)

Var1	Freq
Female	1379
Male	1321

我們可以用sapply以及lapply這兩個函數來轉換因素為字串，首先用sapply找出資料中屬於因素的欄位，然後用lapply針對這些欄位轉為為字串：

str(Chile)

## 'data.frame':    2700 obs. of  10 variables:
##  $ region    : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ population: int  175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex       : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
##  $ age       : int  65 29 38 49 23 28 26 24 41 41 ...
##  $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
##  $ income    : int  35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ statusquo : num  1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote      : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
##  $ ncode     : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ gender    : chr  "Male" "Male" "Female" "Female" ...

data_char2 <- Chile                                              # Duplicate data
fac_cols <- sapply(data_char2, is.factor)                           # Identify all factor columns
data_char2[fac_cols] <- lapply(data_char2[fac_cols], as.character)  # Convert all factor
str(data_char2)

## 'data.frame':    2700 obs. of  10 variables:
##  $ region    : chr  "N" "N" "N" "N" ...
##  $ population: int  175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
##  $ sex       : chr  "M" "M" "F" "F" ...
##  $ age       : int  65 29 38 49 23 28 26 24 41 41 ...
##  $ education : chr  "P" "PS" "P" "P" ...
##  $ income    : int  35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
##  $ statusquo : num  1.01 -1.3 1.23 -1.03 -1.1 ...
##  $ vote      : chr  "Y" "N" "Y" "N" ...
##  $ ncode     : num  3 3 3 3 3 3 3 3 3 3 ...
##  $ gender    : chr  "Male" "Male" "Female" "Female" ...

那麼如果是數字轉為因素（類別）呢？例如有一個數字向量是：

##  ISLR::Auto$cylinders   n  percent
##                     3   4 0.010204
##                     4 199 0.507653
##                     5   3 0.007653
##                     6  83 0.211735
##                     8 103 0.262755

用以下的方式把數字轉為字串，再轉為因素：

library(ISLR); library(dplyr)
data(Auto)
CYL <- c()
CYL[Auto$cylinders==3] <- '3 cyclinders'
CYL[Auto$cylinders==4] <- '4 cyclinders'
CYL[Auto$cylinders==5] <- '5 cyclinders'
CYL[Auto$cylinders==6] <- '6 cyclinders'
CYL[Auto$cylinders==8] <- '8 cyclinders'
df <- data.frame(CYL=CYL)

df %>% janitor::tabyl(CYL)

##           CYL   n  percent
##  3 cyclinders   4 0.010204
##  4 cyclinders 199 0.507653
##  5 cyclinders   3 0.007653
##  6 cyclinders  83 0.211735
##  8 cyclinders 103 0.262755

用圖 2.4 表示分佈：

df$fac_cols <- as.factor(df$CYL)

ggplot2::ggplot(df,aes(x = fac_cols)) +
  geom_bar(stat='count', aes(fill = fac_cols))

Figure 2.4: 汽缸數直方圖

如果是字串，可以進行交叉分析，或是畫圖，例如圖 2.5顯示男女投票對象的差異：

	A	N	U	Y
Female	104	363	362	480
Male	83	526	226	388

Figure 2.5: 性別與投票之二

需要注意字串無法轉換為數字，因為R無法判斷哪一個字串應該被給予哪一個數字。

as.numeric(Chile$gender)

☛請嘗試練習AMSsurvey的citizen等類別變數的轉換。

在上面的範例中：

dplyr套件提供mutate_if這個函式，可以把資料中的所有字串變數一次改成數字或者字串：

dplyr套件裡面的$\texttt{recode_factor()}$可以轉換因素為因素，也可以轉換字串為因素。

library(carData); library(dplyr)
Chile |> mutate(sex.new = recode_factor(sex, `F`='Female', `M`="Male"))|>
  janitor::tabyl(sex.new)

##  sex.new    n percent
##   Female 1379  0.5107
##     Male 1321  0.4893

更改字串變數裡面的值（字串）的做法如下：

file <- here::here('data','africa2023.csv'); dt <- read.csv(file, header=T)
dt |> mutate(sub = recode_factor(subregion, `Eastern Africa`='East', 
                                 `Middle Africa`="Middle",
             `Northern Africa`='North', `Southern Africa`='South',
             `Western Africa` = 'West'))|>
 janitor::tabyl(sub)

##     sub  n percent
##    East 18 0.33333
##  Middle  9 0.16667
##   North  6 0.11111
##   South  5 0.09259
##    West 16 0.29630

把部分類別轉換為遺漏值(NA_character_)：

dt |> mutate(sub2 = recode_factor(subregion, `Eastern Africa`='East', 
                                 `Middle Africa`="Middle",
             `Northern Africa`='North', `Southern Africa`='South',
             .default = NA_character_))|>
 janitor::tabyl(sub2)

##    sub2  n percent valid_percent
##    East 18 0.33333        0.4737
##  Middle  9 0.16667        0.2368
##   North  6 0.11111        0.1579
##   South  5 0.09259        0.1316
##    <NA> 16 0.29630            NA

合併連續變數時，可以用dplyr的 $\texttt{recode}$，也可以用 car的 $\texttt{recode}$合併好幾個值。以下的語法示範如何合併數值，並且指定某些數值為遺漏值。

library(janitor)
#open data
file <- here::here('data', 'PP0797B2.sav')
PP0797B2 <- sjlabelled::read_spss(file)

PP07 <- PP0797B2 %>% mutate(nQ1 = dplyr::recode(Q1, `1`= 1, `2`= 1, `3`= 0, `4` = 0)) %>%
    mutate(nQ1 = ifelse(nQ1>1, NA, nQ1)) %>%
  mutate(NEWQ2 = car::recode(Q2, "1:2=1; 3:4=0; 95:98=NA"))
PP07 %>% tabyl(nQ1)

##  nQ1    n percent valid_percent
##    0  534  0.2595         0.291
##    1 1301  0.6322         0.709
##   NA  223  0.1084            NA

PP07 %>% tabyl(NEWQ2)

##  NEWQ2    n percent valid_percent
##      0 1688 0.82021         0.895
##      1  198 0.09621         0.105
##   <NA>  172 0.08358            NA

dplyr套件還提供$\texttt{mutate}$這個函式，可以把資料中的特定變數改成數字或者字串，例如我們把pscl套件中的year這個變數從數字改成字串，讓之後在視覺化的過程中不會被當成連續數字。

library(pscl)
attach(absentee)
admit.1 <- absentee %>% 
           mutate(across(matches('year'), as.character)) %>% 
     data.table::data.table()
typeof(admit.1$year)

## [1] "character"

2.3.1.1 因素的順序

我們有時候會想指定因素變數的變量的順序，而不是按照電腦自動排列。例如：

x<-c("花蓮縣","臺北市","屏東縣","臺南市","高雄市");x
table(x)

x的排列方式為”花蓮縣”,“臺北市”,“屏東縣”,“臺南市”,“高雄市”，而分布的排列方式也是如此。

如果希望按照由北到南再到東排列觀察值，可以加上level指令

## [1] <NA> <NA> <NA> <NA> <NA>
## Levels: 臺北市 臺南市 高雄市 屏東縣 花蓮縣

xf	Freq
臺北市	0
臺南市	0
高雄市	0
屏東縣	0
花蓮縣	0

有時候我們希望設定特定的類別作為對照的類別，例如在迴歸模型中，報表直接呈現因素變數的k-1個類別的係數。我們如果想更換沒被顯示的組別，可以用$\texttt{relevel}$這個函數指定哪一個類別是對照組如表 ??。：

Table 2.2: **Infant Model**

	Dependent variable:

	infant
	(1)	(2)

income	-0.003	-0.003
	(0.008)	(0.008)

regionAmericas	-84.730^***
	(22.500)

regionAsia	-44.800^**
	(20.800)

regionEurope	-113.500^***
	(31.400)

region.rAfrica		84.730^***
		(22.500)

region.rAsia		39.940^*
		(23.070)

region.rEurope		-28.740
		(29.850)

Constant	143.200^***	58.510^***
	(13.860)	(18.590)


Observations	101	101
R²	0.257	0.257
Adjusted R²	0.226	0.226
Residual Std. Error (df = 96)	79.870	79.870
F Statistic (df = 4; 96)	8.311^***	8.311^***

Note:	p<0.1; p<0.05; p<0.01

附帶一提，factor()這個函數裡面有ordered的邏輯選項，不過只要指定levels，有無ordered為真並不影響。但是ordered()這個函式會得到一個已經排序的因素，例如：

od<-ordered(1:20); class(od)

## [1] "ordered" "factor"

☛請嘗試練習Chile的vote順序為”Y”, “N”, “A”, “U”。

2.4 邏輯(logical)

資料可以是真(True)或是偽(False)的邏輯，對於篩選資料特別有用。

[1] “logical”

例如我們先建立一筆資料以及邏輯：

a<-c(0:9); a

##  [1] 0 1 2 3 4 5 6 7 8 9

ok<-a>5; ok

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

然後用這個邏輯篩選資料：

a[ok]

## [1] 6 7 8 9

以下這個例子用6個元素的向量對應邏輯的向量：

x <- c(2,7,9,2,NA,5)                 # 6 elements
r <- c(TRUE,TRUE,FALSE,FALSE,FALSE,FALSE)
x[r]

## [1] 2 7

x[c(TRUE,FALSE)] # odd numbered elements

## [1]  2  9 NA

☛請執行以下語法，並且回答篩選後的變數剩下幾個觀察值？

head(Duncan)
ok<-Duncan$income>50  
Duncan$income[ok]

附帶介紹一個指令$\texttt{length()}$，用途為顯示向量的長度。

有時候我們的資料裡面有遺漏值NA。可以用邏輯加上$\texttt{is.na()}$函式，剔除這些遺漏值。例如：

library(data.table)
H<-data.table(Age=c(NA,"30-39","40-49","20-29",
                "20-29","60-69","60-69","30-39","60-69",
                "30-39","20-29","20-29","30-39","40-49",
               "40-49", "40-49","50-59","50-59","20-29",NA), 
              Vote=c("Ding", NA, "Ko", "Ko", "Ko", 
                     "Ding","Ding",NA,NA, "Ko","Yao","Yao", "Yao","Ding","Ko","Ko","Yao","Yao","Ding","Ko"),
              pride=c(3,NA, 7, 3, NA, 5, 5, 4, 
                      NA,2,5, 1,6,8, 7, 6, 1, 3,3,5))

如果我們想知道受訪者平均的城市光榮感(pride)有多高，我們應該先剔除遺漏值，計算剩下受訪者的平均值，不然會得到NA：

ok	Freq
FALSE	3
TRUE	17

## [1] NA

## [1] 4.353

上述的例子是如果沒有回答pride，如果我們想顯示H資料中年齡與投票之間的關係，有無設定邏輯去掉遺漏值會造成影響嗎？

	Ding	Ko	Yao
20-29	1	2	2
30-39	0	1	1
40-49	1	3	0
50-59	0	0	2
60-69	2	0	0

	Ding	Ko	Yao
20-29	1	2	2
30-39	0	1	1
40-49	1	3	0
50-59	0	0	2
60-69	2	0	0

可以發現去掉遺漏值並不會影響交叉列表的結果，也就是R會自動去掉無法列表的遺漏值。但是我們交叉其他變數，就會發現差異。

2.5 日期(date)

請輸入 $\texttt{Sys.Date()}$即可顯示今天的日期。日期是文字的一種類型。 as.Date()以及as.POSIXct可以將字串轉變為日期資料。

v<-c("1/27/2020", "6/26/2020", "12/31/2021"); class(v)

## [1] "character"

v.date1<-as.Date(v, format='%m/%d/%Y'); class(v.date1); v.date1

## [1] "Date"

## [1] "2020-01-27" "2020-06-26" "2021-12-31"

v.date2<-as.POSIXct(v, format='%m/%d/%Y'); class(v.date2);v.date2

## [1] "POSIXct" "POSIXt"

## [1] "2020-01-27 CST" "2020-06-26 CST" "2021-12-31 CST"

或者是

v<-c("", "6/26/2018", "12/31/2018")
as.Date(v, format='%m/%d/%Y')

## [1] NA           "2018-06-26" "2018-12-31"

format()則轉換屬性為日期的資料為不同格式，例如：

today <- Sys.Date()
format(today, format='%m/%d/%Y')

## [1] "03/09/2024"

format(today, format='%Y-%m-%d')

## [1] "2024-03-09"

format(v.date2, "%m-%d-%Y")

## [1] "01-27-2020" "06-26-2020" "12-31-2021"

2.5.1 日期的格式

日期的格式如下：

符號	意義	例子
%d	日	01-31
%a	星期幾的縮寫	Mon
%A	星期幾	Monday
%m	月份（數字）	01-12
%b	月份的縮寫	Jan
%B	月份的完整寫法	January
%y	兩位數年份	18
%Y	年份	2018

我們用format()這個函式轉換已經是日期格式的資料，例如：

Today<-Sys.Date(); Today

## [1] "2024-03-09"

today_format1<-format(Today, format='%Y-%b-%d'); today_format1

## [1] "2024-Mar-09"

today_format2<-format(Today, format='%b/%d/%y'); today_format2

## [1] "Mar/09/24"

today_format3<-format(Today, format='%Y年%b月%d日(%a)'); today_format3

## [1] "2024年Mar月09日(Sat)"

2.5.2 萃取日期

利用format萃取日期的資料如下：

j <- c("1/1/2018","2/11/2019", "7/6/2020")
j <- as.Date(j, format="%m/%d/%Y")
print(format(j, "%m"))

## [1] "01" "02" "07"

print(format(j, "%Y"))

## [1] "2018" "2019" "2020"

我們從csv檔案讀取帶有日期格式的資料，雖然不見得可以完全讀取，但是仍然可以轉為圖形。例如：

file <- here::here('data','tsaipopularity0921.csv')
tmp <- read.csv(file, header = T, sep=',')
tmp$Date <- format(tmp$Date, format="%y-%b")
tmp$Date

##  [1] "18-Mar" "18-Jun" "18-Sep" "18-Dec" "19-Mar" "19-Jun" "19-Sep" "19-Dec"
##  [9] "20-Mar" "20-Mar" "20-Jun" "20-Sep" "20-Dec" "21-Mar" "21-Jun"

轉為圖 2.6：

ggplot(data=tmp, aes(x=Date, y=Tsai, group=1)) +
  geom_line(size=1.5, col="goldenrod") +
  geom_point(shape=1, size=3) +
  labs(y="%", 
       subtitle="Data: Taiwan's Election and Democratization Study") +
  theme_bw() +
  ggtitle("Percentage of Satisfaction with President Tsai's Performance")+
  geom_text(data=tmp, aes(x=Date, label=Tsai), 
            vjust=0, hjust=-0.3, size=5) +
  geom_text(label="Level 3 alert", x=14, y=45, col="#FF5511", size=4) +
  geom_text(label="Presidential election, Covid-19", x=7.9, y=70, col="#FF5511", size=4) +
  geom_text(label="One Country Two Systems", x=4.8, y=35, col="#FF5511", size=4) +
  theme(axis.title= element_text(color="blue", size=14, face="bold"),
        axis.text = element_text(size=9))

Figure 2.6: 總統表現滿意度

2.5.3 日期的差距

請輸入兩個人的生日：

xi<-"1953-06-15" #Xi's birthday
tsai<-"1956-08-31" #Tsai's birthday

我們可以轉換字串為日期變數，然後計算兩個日期之間的差距：

as.Date(c(xi,tsai), origin="1904-01-01")

## [1] "1953-06-15" "1956-08-31"

difftime(tsai, xi)

## Time difference of 1173 days

在這個例子中，origin指令可設定也可不設定。但是計算某一個數字代表的日期時必須要有起始日：

as.Date(1100, origin="2018-08-01")

## [1] "2021-08-05"

2.6 lubridate 套件

lubridate是一個專門處理日期與時間的套件，可以直接把數字轉換成對應的日期。

## [1] "2010-12-15" "2023-12-25"

## [1] "Date"

可以加上時間，並且計算時間的差距：

## Time difference of 1.678 hours

如果加上”UTC” (Universal Coordinated Time)，或者”tz”(time zone)，會變成POSIXct或者POSIXit格式：

## [1] "POSIXct" "POSIXt"

## [1] "POSIXct" "POSIXt"

POSIXct或者POSIXit格式指的是從1970年之後所經過的秒數。
在lubridate 套件，可以用time.interaval表示兩個時間的差距，然後計算經過多少時間：

## [1] 2024-02-15 08:30:30 UTC--2024-02-15 10:11:10 UTC

## [1] "1H 40M 40S"

## [1] "6040s (~1.68 hours)"

2.7 strftime, strptime

除了lubridate之外，還有strptime把字串轉為日期，例如：

## [1] "2020-01-01 EST" "2021-01-01 EST"

strftime把日期轉換為字串，例如：

## [1] "2023-01-01" "2024-01-25"

## [1] "character"

## [1] "01-01" "01-25"

如果只需要「月、日」，我們需要把日期轉為字串，以方便閱讀。

2.7.1 範例1：賴清德的民調變化

在2023年，許多民調單位發布對4位可能的總統候選人民調，在此用賴清德的民調數字畫一個折線圖 2.7：

Figure 2.7: 2023年賴清德的民調

2.7.2 範例2：2021-2023年美國通貨膨脹率

我們從網頁擷取2021到2023年的美國通貨膨脹率，並且畫成折線圖 2.8。我們運用lubridate轉換字串為日期，然後用strftime轉換為字串。如果沒有先轉換為日期，會出現錯誤的資料排列順序。

inflation <- data.frame(rate=c(6.4, 6.0,    5.0,    4.9,    4.0,    3.0,    3.2,    3.7,    3.7,    3.2,    3.1,    3.4,    
7.5,    7.9,    8.5,    8.3,    8.6,    9.1,    8.5,    8.3,    8.2,    7.7,    7.1,    6.5,    
1.4,    1.7,    2.6,    4.2,    5.0,    5.4,    5.4,    5.3,    5.4,    6.2,    6.8,    7.0),
y.m = c("2023.1","2023.2","2023.3","2023.4","2023.5","2023.6",
         "2023.7","2023.8","2023.9","2023.10","2023.11","2023.12","2022.1","2022.2","2022.3","2022.4","2022.5","2022.6",
         "2022.7","2022.8","2022.9","2022.10","2022.11","2022.12","2021.1","2021.2","2021.3","2021.4","2021.5","2021.6",
         "2021.7","2021.8","2021.9","2021.10","2021.11","2021.12"))

dat <- inflation %>% mutate(date = lubridate::ym(y.m)) %>% 
          mutate(date = strftime(date, format = "%Y-%m")) 

g2 <- ggplot(dat, aes(x = date, y = rate, group=1)) +
     geom_point() +
     geom_line(colour = "#BB0000") +
   theme_bw() +
    theme(axis.text = element_text(size=8, angle = 45,vjust = 3)) 
g2

Figure 2.8: 2021-2023年美國通貨膨脹率

這三年中以2022年中的通貨膨脹率最高，同時美元對台幣匯率也越來越高，這是因為美國一直升息壓抑通膨，台幣換成美金越來越不划算，但是美國當地生活物價卻越來越高。
在學習lubridate, strptime, strftime之後，請試著改善圖 2.6，在X軸顯示年份。
介紹完資料型態之後，接下來介紹資料結構。

3 資料結構

3.1 一維

3.1.1 向量(vector)

向量是最常見的資料結構，可以寫成：

example<-c(0,1,2,3,4)
print(example)

## [1] 0 1 2 3 4

或者是

c(2,4,6,8)->A; A

## [1] 2 4 6 8

英文字母的向量：

c(letters)

##  [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"

c(LETTERS)

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"

有時候我們為向量中的元素加上名稱，方便繪圖，例如在圖 3.1，有兩個向量分別表示某位投資者的資金配置以及2016到2019的現金數量：

shares <- c(150, 40,  65)
names(shares) <- c('Finance','Techonolgy','Cash')
shares

Finance Techonolgy Cash 150 40 65

class(shares)

[1] “numeric”

cash<-c(100, 120, 80, 65)
names(cash) <- c(2016, 2017, 2018, 2019)
par(mfrow=c(1,2), bg='lightgreen',mai=c(0.4,0.3,0.1,0.3))
pie(shares); barplot(cash, cex.axis = 0.8)

Figure 3.1: 資金配置

可以在向量內進行數學運算，例如：

j<-c(2*2, 2*9, 10-2, 3^3); j

## [1]  4 18  8 27

向量可以加、減、乘、除其他數字：

R<-c(100, 200, 300); R/5; sqrt(R)

## [1] 20 40 60

## [1] 10.00 14.14 17.32

一個向量可以合併另一個向量：

c(j, R)

## [1]   4  18   8  27 100 200 300

或者是一個向量包含其他向量：

Y<-c(j, c(9:5), R[c(1,2)]); Y

##  [1]   4  18   8  27   9   8   7   6   5 100 200

因為R的向量可以連結，我們可以增加資料的數量。

用減號可以去除向量中的元素，例如：

Y[-c(8:12)]

## [1]  4 18  8 27  9  8  7

3.2 二維

3.2.1 矩陣 (matrix)

R的資料結構之一是矩陣，例如VADeaths就是一筆矩陣的資料：

data("VADeaths"); VADeaths

##       Rural Male Rural Female Urban Male Urban Female
## 50-54       11.7          8.7       15.4          8.4
## 55-59       18.1         11.7       24.3         13.6
## 60-64       26.9         20.3       37.0         19.3
## 65-69       41.0         30.9       54.6         35.1
## 70-74       66.0         54.3       71.1         50.0

class(VADeaths)

## [1] "matrix" "array"

數學的矩陣與R的矩陣類似。矩陣的讀法是先列再行。例如我們需要一個$3\times 3$的矩陣可寫成：

m<-matrix(c(1:9), nrow=3, ncol=3); m

##      [,1] [,2] [,3]
## [1,]    1    4    7
## [2,]    2    5    8
## [3,]    3    6    9

再寫一個$3\times 2$矩陣：

n<-matrix(c(1:6), nrow=3, ncol=2);n

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

兩個矩陣相乘可寫成：

m<-matrix(c(1:9), nrow=3, ncol=3); n<-matrix(c(1:6), nrow=3, ncol=2); m%*%n

##      [,1] [,2]
## [1,]   30   66
## [2,]   36   81
## [3,]   42   96

矩陣的乘法需要第一個矩陣的行(column)等於第二個矩陣的列(row)。

矩陣的對角向量為：

diag(m)

## [1] 1 5 9

轉置矩陣為：

t(m)

##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    4    5    6
## [3,]    7    8    9

如果要替代或者是選取部分的矩陣資料，例如要選取m矩陣的第二列第三行的資料，並且命它為0：

m[2,3]; m[2,3]<-0

## [1] 8

請輸入以下指令得到n矩陣，然後轉置矩陣：

n<-matrix(c(1:6), nrow=3, ncol=2, byrow=T)

矩陣的列與行命名方式為用dimnames的指令分別對列與行指定名稱，例如：

n<-matrix(c(1:6), nrow=3, ncol=2, dimnames = list(c("a","b","c"),c("A","B"))); n

##   A B
## a 1 4
## b 2 5
## c 3 6

3.2.2 特殊矩陣

主對角元素為 1, 其餘元素均為 0 之矩陣稱為單位矩陣 I (unitary matrix).

diag(1, nrow=5)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    0    0    0    0
## [2,]    0    1    0    0    0
## [3,]    0    0    1    0    0
## [4,]    0    0    0    1    0
## [5,]    0    0    0    0    1

單位矩陣之特徵值為 1, 任何矩陣與單位矩陣之內積等於本身, 即 :

A %% I=A I %% A=A

我們導出最小平方法迴歸係數時需要用單位矩陣，假設有一個迴歸模型如下：

\[ y=X\beta +\epsilon \]

normal equation 可表示如下：

\[(X'X)\hat{\beta}=X'y\]

normal equation 的兩邊都乘以$(X'X)^{-1}$：

\[\begin{align*} (X'X)\hat{\beta}=X'y \\ (X'X)^{-1}(X'X)\hat{\beta}=(X'X)^{-1}X'y \\ \end{align*}\]

因為$(X'X)^{-1}(X'X)=I$，所以：

\[\begin{align*} (X'X)^{-1}(X'X)\hat{\beta}= I \hat{\beta} \\ \hat{\beta} = (X'X)^{-1}X'y \end{align*}\]

3.3 資料框 (data frame)

資料框是向量組合起來成為一個類似矩陣的資料，可以指定變數名稱。例如：

v1<-c(170, 175, 166, 172, 165, 157, 167, 167, 
        156, 160)
v2<-c("F","M","M","M","F","F","F","F","M","F")
v3<-v1/10 + 42
tmp<-data.frame(height=v1,gender=v2,weight=v3, 
                 stringsAsFactors = FALSE)
tmp

##    height gender weight
## 1     170      F   59.0
## 2     175      M   59.5
## 3     166      M   58.6
## 4     172      M   59.2
## 5     165      F   58.5
## 6     157      F   57.7
## 7     167      F   58.7
## 8     167      F   58.7
## 9     156      M   57.6
## 10    160      F   58.0

資料框的每一行必須有相同的長度，每一列也必須要有同樣數目的數字、文字等。

如果沒有特別指定資料框，R會當做矩陣。例如：

H<-cbind(LETTERS[1:6], seq(10,60, 10))
H

##      [,1] [,2]
## [1,] "A"  "10"
## [2,] "B"  "20"
## [3,] "C"  "30"
## [4,] "D"  "40"
## [5,] "E"  "50"
## [6,] "F"  "60"

class(H)

## [1] "matrix" "array"

3.3.1 structure

structure() 可以創造指定性質的物件，例如：

df <- structure(list( 
         year = c(2001, 2002, 2004, 2006),
        length_days = c(366.3240, 365.4124, 366.5323423, 364.9573234)),
        .Names = c("year", "length of days") ,
        row.names = c(NA, -4L)  ,class = "data.frame",
        comment = 'Example Data Set')
df

##   year length of days
## 1 2001          366.3
## 2 2002          365.4
## 3 2004          366.5
## 4 2006          365.0

又例如：

dtt <- structure(list
        (model = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
  .Label = c("ma", "mb", "mc"), class = "factor"), 
  year = c(2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L),
  V = c(0.16, 0.14, 0.11, 0.13, 0.15, 0.16, 0.24, 0.17, 0.12, 0.13, 0.15, 0.15, 0.2, 0.16, 0.11, 0.12, 0.12, 0.15),
  lower = c(0.11, 0.11, 0.07, 0.09, 0.11, 0.12, 0.16, 0.12, 0.04, 0.09, 0.09, 0.11, 0.14, 0.1, 0.07, 0.08, 0.05, 0.1), 
  upper = c(0.21, 0.19, 0.17, 0.17, 0.19, 0.2, 0.29, 0.23, 0.16, 0.17, 0.16, 0.2, 0.26, 0.27, 0.15, 0.16, 0.15, 0.19)), 
     .Names = c("model", "year", "V", "lower", "upper"),
    class = "data.frame", row.names = c(c(1:18), -1L))
dtt

##    model year    V lower upper
## 1     ma 2005 0.16  0.11  0.21
## 2     ma 2006 0.14  0.11  0.19
## 3     ma 2007 0.11  0.07  0.17
## 4     ma 2008 0.13  0.09  0.17
## 5     ma 2009 0.15  0.11  0.19
## 6     ma 2010 0.16  0.12  0.20
## 7     mb 2005 0.24  0.16  0.29
## 8     mb 2006 0.17  0.12  0.23
## 9     mb 2007 0.12  0.04  0.16
## 10    mb 2008 0.13  0.09  0.17
## 11    mb 2009 0.15  0.09  0.16
## 12    mb 2010 0.15  0.11  0.20
## 13    mc 2005 0.20  0.14  0.26
## 14    mc 2006 0.16  0.10  0.27
## 15    mc 2007 0.11  0.07  0.15
## 16    mc 2008 0.12  0.08  0.16
## 17    mc 2009 0.12  0.05  0.15
## 18    mc 2010 0.15  0.10  0.19
## -1  <NA> <NA> <NA>  <NA>  <NA>

structure()適用於比較小規模的資料，但是可以直接賦予變數性質，可以多加使用。

3.3.2 matrix 或 dataframe?

資料框允許不同性質的變數，但是矩陣只接受同一類型的變數。
ggplot2 畫圖時只能使用資料框。
雖然矩陣容許欄位名稱，但是只能用matrix[,1]告訴系統向量位置，不能用matrix$a。
矩陣無法設定$\texttt{stringAsFactor=F}$，資料框才可以。

我們創造兩個簡單的變數，構成矩陣：

a<-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
b<-c(52.5, 48.4, 57.1, 60.1, 71.1)
dftest <-cbind(a, b)
class(dftest)

## [1] "matrix" "array"

dftest

##      a           b     
## [1,] "Monday"    "52.5"
## [2,] "Tuesday"   "48.4"
## [3,] "Wednesday" "57.1"
## [4,] "Thursday"  "60.1"
## [5,] "Friday"    "71.1"

#ggplot(data=dftest, aes(x=a, y=b, fill=a)) +
#  geom_bar(stat = 'identity')

如果執行上面區塊中的ggplot2語法，會出現錯誤訊息。但是矩陣還是可以畫圖，我們用Chile資料的投票與性別為例如圖 3.2：

tm<-cbind(carData::Chile$vote, carData::Chile$sex)
class(tm)

## [1] "matrix" "array"

plot(table(tm[,2], tm[,1]))

Figure 3.2: 智利選舉

在上面星期一到五的例子，我們改成資料框，即可畫圖3.3表示每一天的快樂程度：

dt <- as.data.frame(dftest)
dt$date<-factor(dt$a, levels=c("Monday", "Tuesday", 
            "Wednesday", "Thursday", "Friday"))
ggplot(data=dt, aes(x = date, y=b, fill = date)) +
   geom_bar(stat = 'identity')

Figure 3.3: 週一至週五的快樂程度

有幾個資料框與矩陣的相關指令：
nrow(x)：顯示x資料框或矩陣的列數量，也等於是觀察值數目
ncol(x)：顯示x資料框或矩陣的行數量，也等於是變數數目
dim(x)：同時顯示x資料框或矩陣的行列的數量
str(x)：顯示x資料框或矩陣的性質以及變數名稱與性質
head(x)：顯示x資料框或矩陣的前6列
head(x, n=a)：顯示x資料框或矩陣的前a列
colnames(x)：顯示或設定x資料框或矩陣的變數或欄位名稱
rownames(x)：顯示或設定x資料框或矩陣每一列的名稱
有關rownames的更多說明，請參考這個部落格。

例如我們想知道AMSsurvey有幾筆觀察值：

nrow(AMSsurvey)

[1] 24

如果想要改變欄位名稱，可以這麼做：

colnames(tmp)<-c("V1","V2","V3"); tmp

##     V1 V2   V3
## 1  170  F 59.0
## 2  175  M 59.5
## 3  166  M 58.6
## 4  172  M 59.2
## 5  165  F 58.5
## 6  157  F 57.7
## 7  167  F 58.7
## 8  167  F 58.7
## 9  156  M 57.6
## 10 160  F 58.0

有的資料框的最左邊是流水號，有的資料框沒有，前者可能是因為從csv等工作表讀取而來，被賦予流水號，但是並不是第一行，真正的第一行應該有欄位的名稱。有的資料框則用字串當做列的名稱：

library(ISLR)
head(College, n=3)

##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54

rownames 可以幫助我們刪掉不需要的資料。例如我們有一筆美國各州的資料：

head(state.x77, n=5)

##            Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Alabama          3615   3624        2.1    69.05   15.1    41.3    20  50708
## Alaska            365   6315        1.5    69.31   11.3    66.7   152 566432
## Arizona          2212   4530        1.8    70.55    7.8    58.1    15 113417
## Arkansas         2110   3378        1.9    70.66   10.1    39.9    65  51945
## California      21198   5114        1.1    71.71   10.3    62.6    20 156361

我們想刪掉Alabama, Alaska, Arkansas三個州的資料，先成立一個矩陣：

names.to.delete<-c('Alabama', 'Alaska', 'Arkansas')

再用which(rownames(data) %in% vector)傳回所要選出的列：

rows.to.delete<-which(rownames(state.x77) %in% names.to.delete)

最後用$\texttt{data[-c(), ]}$刪掉所選的列：

newstate <- state.x77[-c(rows.to.delete),]
head(newstate, n=5)

##             Population Income Illiteracy Life Exp Murder HS Grad Frost   Area
## Arizona           2212   4530        1.8    70.55    7.8    58.1    15 113417
## California       21198   5114        1.1    71.71   10.3    62.6    20 156361
## Colorado          2541   4884        0.7    72.06    6.8    63.9   166 103766
## Connecticut       3100   5348        1.1    72.48    3.1    56.0   139   4862
## Delaware           579   4809        0.9    70.06    6.2    54.6   103   1982

更進一步的篩選資料方法將會在後面課程介紹。

3.4 多維

3.4.1 陣列 (array)

陣列容納一個以上的矩陣，只有一個矩陣的陣列相當於矩陣：

Array1 <- array(1:12, dim = c(2, 6, 1)); Array1

## , , 1
## 
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    3    5    7    9   11
## [2,]    2    4    6    8   10   12

而有多個矩陣的陣列如：

Array2 <- array(1:12, dim = c(2, 3, 2)); Array2

## , , 1
## 
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6
## 
## , , 2
## 
##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

陣列的優點是同時容納一個以上的矩陣，如果只需要某一個矩陣，可以這樣取出：

A12<-Array2[,,2]; A12

##      [,1] [,2] [,3]
## [1,]    7    9   11
## [2,]    8   10   12

3.4.2 列表 (list)

列表的特性為資料的長度、結構沒有限制，例如：

listA<-list(tmp, H, c(xi,tsai)); listA

## [[1]]
##     V1 V2   V3
## 1  170  F 59.0
## 2  175  M 59.5
## 3  166  M 58.6
## 4  172  M 59.2
## 5  165  F 58.5
## 6  157  F 57.7
## 7  167  F 58.7
## 8  167  F 58.7
## 9  156  M 57.6
## 10 160  F 58.0
## 
## [[2]]
##      [,1] [,2]
## [1,] "A"  "10"
## [2,] "B"  "20"
## [3,] "C"  "30"
## [4,] "D"  "40"
## [5,] "E"  "50"
## [6,] "F"  "60"
## 
## [[3]]
## [1] "1953-06-15" "1956-08-31"

又例如我們在一個列表中，創造兩個資料框並給定名稱：

list(A=data.frame(x=c(1:5),y=c(101:105)), 
     B=data.frame(v1=rep(NA,6)))

## $A
##   x   y
## 1 1 101
## 2 2 102
## 3 3 103
## 4 4 104
## 5 5 105
## 
## $B
##   v1
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA

如果要取出列表中的某一個部分資料，可以寫成：

listA[[3]]

## [1] "1953-06-15" "1956-08-31"

列表的優點是儲存尚未格式化的資料，但是資料相當龐大，矩陣與陣列的資料分散為一個個元素，不容易取出，如果事先命名，就比較容易了解哪些元素來自於什麼資料。例如：

listB<-list(data=tmp, vec=m, char=c(tsai, xi)); 
listB[["data"]]

##     V1 V2   V3
## 1  170  F 59.0
## 2  175  M 59.5
## 3  166  M 58.6
## 4  172  M 59.2
## 5  165  F 58.5
## 6  157  F 57.7
## 7  167  F 58.7
## 8  167  F 58.7
## 9  156  M 57.6
## 10 160  F 58.0

如果列表中的變數有同樣的長度，可以用$\texttt{setDT()}$轉換列表為data.table。例如：

X = list(1:5, letters[1:5], c('Y','Y','N','Y','N'),
         c("2/27/2018", "6/26/2018", "12/31/2018","1/20/2019","4/8/2019")); X

## [[1]]
## [1] 1 2 3 4 5
## 
## [[2]]
## [1] "a" "b" "c" "d" "e"
## 
## [[3]]
## [1] "Y" "Y" "N" "Y" "N"
## 
## [[4]]
## [1] "2/27/2018"  "6/26/2018"  "12/31/2018" "1/20/2019"  "4/8/2019"

X.dt<-setDT(X); X.dt

##    V1 V2 V3         V4
## 1:  1  a  Y  2/27/2018
## 2:  2  b  Y  6/26/2018
## 3:  3  c  N 12/31/2018
## 4:  4  d  Y  1/20/2019
## 5:  5  e  N   4/8/2019

請嘗試把c('a','b','c'), c(1,2,3,4)以及
c('2018-01-01', '2018-04-04', '2018-04-05', '2018-06-18', '2018-10-10')`結合成為一個列表。

3.4.3 表格 (table)

Titanic這筆資料為表格的型態，同時也是四個陣列：

class(Titanic); Titanic

## [1] "table"

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

可以看到這筆資料有四個變數：艙等、性別、年齡、是否存活。因為有四個陣列，所以我們取出其中一個表格時需要給定兩個條件，如果設定三個條件，就是一個向量，如果設定四個就是一個元素。例如我們想知道沒有生還的兒童搭的艙等與性別，也就是第一個表格，可以這樣輸入：

T1<-Titanic[, , 1, 1]
class(T1); T1

## [1] "table"

##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0

顯示沒有生還的兒童都是搭三等艙，男童為女童一倍。如果我們想知道艙等與生還的關係，可以先試著呈現：

Titanic[, 1, 1,]

##       Survived
## Class  No Yes
##   1st   0   5
##   2nd   0  11
##   3rd  35  13
##   Crew  0   0

Titanic[, 1, 2,]

##       Survived
## Class   No Yes
##   1st  118  57
##   2nd  154  14
##   3rd  387  75
##   Crew 670 192

Titanic[, 2, 1,]

##       Survived
## Class  No Yes
##   1st   0   1
##   2nd   0  13
##   3rd  17  14
##   Crew  0   0

Titanic[, 2, 2,]

##       Survived
## Class   No Yes
##   1st    4 140
##   2nd   13  80
##   3rd   89  76
##   Crew   3  20

3.4.4 資料表(data.table)

資料表(data.table)是資料框的延伸，可以直接在資料中計算特定的列或是行。例如：

library(data.table)
DT = data.table(a = 1:3, b = c(10,20,30))
DT

##    a  b
## 1: 1 10
## 2: 2 20
## 3: 3 30

DT[1:3, sum(a)]

## [1] 6

DT[1:2, mean(b)]

## [1] 15

在創造一個資料表之後，先計算a變數的總和，然後算b變數前兩個值的平均。

或者直接繪圖。例如我們先建立一個常態分佈的變數，然後畫出機率密度圖3.4：

DT <-data.table(x=rnorm(1000, 0, 1))

DT[, plot(density(x), type='l', xlab='x', ylab='',
          lwd=3, col='#1111ff')]

Figure 3.4: 常態分佈機率密度

## NULL

資料表的功能非常多，相較於資料框來得複雜，不過這門課程主要用資料框來處理資料。對於資料表有興趣的同學可以參考該套件的網站。

3.5 小結

資料框在各種資料結構之中，是最常被用到的。如果要把資料提出，最好把各種資料結構轉成資料框。例如我們想知道一筆資料的變數性質：

g<-Titanic[ , , 2, 2]; class(g)

## [1] "table"

輸入 $\texttt{g\$Class}$結果發生錯誤，無法顯示。改為：

g<-data.frame(g)

請輸入 $\texttt{g\$Class}$則會顯示變數的性質。

4 基本運算

4.1 向量的運算

向量具有方位的性質，所以數字具有先後順序，與另一個有同樣數目的向量相加減乘除時，將會依照順序進行運算。我們以一個純量 (scalar) Sca 為例：

X<-c(10,20,30,40,50,60); Sca<-10
X+Sca

## [1] 20 30 40 50 60 70

如果是乘或是除，就是每個元素同時乘或除某個數：

X/Sca

## [1] 1 2 3 4 5 6

如果是乘或是除另一個矩陣向量，就是兩個向量對應的每個元素進行乘或除：

Y<-c(5,10,6,8,25,6)
X/Y; X*Y

## [1]  2  2  5  5  2 10

## [1]   50  200  180  320 1250  360

也可以運算平方與開根號：

a<-c(2,3,4); b<-a^2; print(b)

## [1]  4  9 16

c<-sqrt(b); print(c)

## [1] 2 3 4

4.2 進位

以下是3種數字進位的指令：

round：四捨五入
floor：強制捨去
ceiling：強制進位

我們用以下例子顯示進位的指令功能：

EX<-c(2.54, 3.111, 10.999)
round(EX, digits=4)

## [1]  2.540  3.111 10.999

floor(EX)

## [1]  2  3 10

ceiling(EX)

## [1]  3  4 11

也可以在執行指令前，設定小數點前後顯示的數字數目：

options(digits = 5)
print(EX)

## [1]  2.540  3.111 10.999

比較麻煩的是R為了對齊輸出的數字，會輸出比較多的小數點後面的數字，可以考慮用sprintf這個函式控制，但是會得到字串而非數字：

perc<-MASS::Animals$brain/MASS::Animals$body
print(perc[1:3], digits=2)

## [1] 6.00 0.91 3.29

sprintf(perc[1:3], fmt="%#.2f")

## [1] "6.00" "0.91" "3.29"

總結：我們目前無法限制顯示小數點後面幾位的數字，同時不允許進位。例如10.999無法只顯示10.99。

5 資料型態作業

交作業時請複製貼上題目，以方便助教檢查。

1. 使用orange資料，把Tree變數換成A, B, C, D, E，然後顯示每一個類型的數量。

2. 請用weekdays()指令顯示今天上課日期是星期五。

3. 請問在mtcars資料裡面，wt大於或等於2的資料有幾筆？

4. 請嘗試建立一個包含流水號序號、一個數字變數、一個類別變數的資料框或資料表。

5. 請創造一個對角向量為{0,0,0}的矩陣。

6. 請用data.table計算Titanic男性與女性生還的人數。

7. 請問faithful這筆資料有多少個觀察值？

8. 請建立三個$3\times 2$的陣列。

9. 請把Arrests資料裡面的employed變數改成0, 1

10. 請用三種格式表示今天的日期。

6 基本運算作業

1. 請用程式計算 $\text{log}(\frac{14}{5})=$?

2. 請用程式計算 $1\times 2\times 3\times , \dots ,\times 8=$?

3. 用程式計算英文有幾個字母。

4. 請以程式顯示在您寫作業的這一天，今年已經過了幾天？

5. 請寫一段語法把今天的氣溫轉換成華氏。

6. 小傑的祖父是民國35年出生，祖母36年出生，父親53年次，母親60年次，他自己27歲，他妹妹23歲。請問小傑家的平均年齡幾歲？最年長跟最年輕差幾歲？總和幾歲？

7. 續上題，請問小傑家的3名女性平均幾歲？3名男性平均幾歲？

8. 續上題，如果小傑結婚，他的太太是26歲，小傑家多一個成員，請問婚後小傑家的女性平均幾歲？

9. 請建立一個資料框，列出彩虹七種顏色的中、英文。然後按照英文遞增的順序，排列整個資料框。

10. 續上題，請用data.table這個套件的功能，排序上述的資料框。

7 更新講義時間

最後更新時間: 2024-03-09 11:51:10

社會科學統計方法

R的資料型態與基本運算

蔡佳泓

2024

1 學習目標

2 資料型態

2.1 數值(numeric)

2.2 字串(character)

2.3 因素(factor)

2.3.1 因素轉換

2.3.1.1 因素的順序

2.4 邏輯(logical)

2.5 日期(date)

2.5.1 日期的格式

2.5.2 萃取日期

2.5.3 日期的差距

2.6 lubridate 套件

2.7 strftime, strptime

2.7.1 範例1：賴清德的民調變化

2.7.2 範例2：2021-2023年美國通貨膨脹率

3 資料結構

3.1 一維

3.1.1 向量(vector)

3.2 二維

3.2.1 矩陣 (matrix)

3.2.2 特殊矩陣

3.3 資料框 (data frame)

3.3.1 structure

3.3.2 matrix 或 dataframe?

3.4 多維

3.4.1 陣列 (array)

3.4.2 列表 (list)

3.4.3 表格 (table)

3.4.4 資料表(data.table)

3.5 小結

4 基本運算

4.1 向量的運算

4.2 進位

5 資料型態作業

6 基本運算作業

7 更新講義時間