本週上課將介紹R
的資料型態與資料結構,也就是R
可以讀取並進行運算的對象,例如我們用 c()
來表示數字、文字的集合:
A<-c("台北市","新北市", "桃園市", "台中市","台南市","高雄市")
print(A)
[1] “台北市” “新北市” “桃園市” “台中市” “台南市” “高雄市”
B<-c(0,1,2,3,4,5,6,7,8,9)
print(B)
[1] 0 1 2 3 4 5 6 7 8 9 A是文字而B是數字的集合,另外還有邏輯以及日期等常用的資料型態。
R
資料結構可分為一維、二維、多維:
由此出發,了解R
的資料型態以及運算。
R
的資料型態之前,先來認識R
的函數。
R
是物件導向(object-oriented)的語言,所謂物件導向,就是使用者告訴系統某個函式、向量、矩陣、迴圈等等的名稱,或者包含程式與資料在一個物件,系統就會儲存在工作環境中,使用者可以呼叫執行或者修改。使用者可以以堆積木的方式,把物件堆疊起完成希望得到的結果。
R
的環境(environment)中,有各種物件,可以用ls()這個函式顯示。例如:
[1] “a” “A” “b” “B” “f”
[1] 4 [1] 25
R
的平均值函數為:
?mean
mean(x, trim = 0, na.rm = FALSE)
x
為變數,可以是向量,也可以是矩陣,但是向量或矩陣的元素不能為字串。trim
表示設定要去除多少百分比的變數中的資料。例如有一個變數型態為:
x<-c(100000, 10000000, c(1:10)); sort(x)
## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07
x<-c(100000, 10000000, c(1:10))
sort(x)
## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07
mean(x)
## [1] 841671
mean(x, trim=0.1)
## [1] 10005
mean(sort(x)[2:11])
## [1] 10005
mean(x, trim=0.2)
## [1] 6.5
mean(sort(x)[3:10])
## [1] 6.5
X<-c(2, 4, 6, 8); X
[1] 2 4 6 8
class(X)
[1] “numeric”
typeof(X)
[1] “double”
str(X)
num [1:4] 2 4 6 8
y=c(1.1e+06); y
[1] 1100000
class(y)
[1] “numeric”
u<-as.integer(c(4)); class(u)
[1] “integer”
is.numeric(100)
is.integer(100)
a<-c(7, 8.5, 9); class(a)
[1] “numeric”
b<-as.integer(a); b
[1] 7 8 9
L
,R 語言就會儲存為整數(integer)。
a<-3L; class(a)
[1] “integer”
sys_date <- Sys.Date()
as.numeric(sys_date)
[1] 19056
as.integer(sys_date)
[1] 19056
date_of_origin <- as.Date("1970-01-01")
as.integer(sys_date) - as.integer(date_of_origin)
[1] 19056
R
會顯示變數性質與對應結果:
b <- c(1, 2, 3, 10)
h<-c(100, 200, 300, 500)
ok<-h>300
b[ok]
[1] 10
ok<-h>1000
b[ok]
numeric(0)
state.abb
class(state.abb)
Figure 2.1: 美國各州人口數的點狀圖
Figure 2.2: 各州人口排序後的點狀圖
state.x77
是矩陣,所以取出這個矩陣中的人口此一欄位,然後與state.abb
結合成一個資料框,再以dotplot
或者dotchart
指令畫成點狀圖。
char1<-c("1","2","3","4","5"); char1
[1] “1” “2” “3” “4” “5”
char2<-c(1, 2, "文字"); char2
[1] “1” “2” “文字”
as.numeric
轉換為數字,但是可以用語法進行轉換(請見因素一節)。
class(row.names(quakes))
[1] “character”
head(state.x77)
Population Income Illiteracy Life Exp Murder HS Grad Frost Area
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708 Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432 Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417 Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945 California 21198 5114 1.1 71.71 10.3 62.6 20 156361 Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
class(row.names(state.x77))
[1] “character”
library(foreign); library(tidyverse)
file<-here::here('data','opendata106N0101.csv')
opendf<-read.csv(file, header=T, sep=',', fileEncoding = 'UTF-8')
str(opendf)
## 'data.frame': 375 obs. of 4 variables:
## $ code : chr "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
## $ 年底人口數: chr "551480" "387484" "413590" "222585" ...
## $ 土地面積 : num 23.14 16.32 20.14 5.71 19.74 ...
## $ 人口密度 : chr "23835" "23747" "20532" "38956" ...
file<-here::here('data','opendata106N0101.csv')
dat<-read.csv(file, header=T, stringsAsFactors = F)
nrow(dat) #check how many rows; n=375
[1] 375
dat <- dat[-c(369:375),] #delete the rows of small islands and notes
head(dat, n=3)
code 年底人口數 土地面積 人口密度
1 新北市板橋區 551480 23.14 23835
2 新北市三重區 387484 16.32 23747
3 新北市中和區 413590 20.14 20532
dat <- dat %>% mutate(popu=as.numeric(年底人口數))
str(dat)
## 'data.frame': 368 obs. of 5 variables:
## $ code : chr "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
## $ 年底人口數: chr "551480" "387484" "413590" "222585" ...
## $ 土地面積 : num 23.14 16.32 20.14 5.71 19.74 ...
## $ 人口密度 : chr "23835" "23747" "20532" "38956" ...
## $ popu : num 551480 387484 413590 222585 416524 ...
library(data.table)
file<-here::here('data','opendata106N0101.csv')
DT <- read.csv(file, header=F, stringsAsFactors = F)
colnames(DT) <-DT[1,]
nrow(DT) #check how many rows; n=376
DT <- DT[-c(1, 370:376),] #delete the first row and the rows of notes
DT <- data.table(DT)
DT <- DT %>% transform(年底人口數=as.numeric(年底人口數))
head(DT, n=3)
str(DT)
min(DT$年底人口數) #check if data has no missing value
x=c("Yes","No","No","Yes","Yes"); x
## [1] "Yes" "No" "No" "Yes" "Yes"
factor(x)
## [1] Yes No No Yes Yes
## Levels: No Yes
table(x)
## x
## No Yes
## 2 3
factor()
這個函數把x轉換為因素,有No, Yes兩類別。
str(Chile)
## 'data.frame': 2700 obs. of 8 variables:
## $ region : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
kableExtra::kable_styling(knitr::kable(table(Chile$sex, Chile$vote)),
bootstrap_options = "striped", full_width = F)
A | N | U | Y | |
---|---|---|---|---|
F | 104 | 363 | 362 | 480 |
M | 83 | 526 | 226 | 388 |
Chile$ncode<-as.numeric(Chile$region)
kableExtra::kable_styling(knitr::kable(table(Chile$ncode, Chile$vote)),
bootstrap_options = "striped", full_width = F)
A | N | U | Y |
---|---|---|---|
44 | 210 | 141 | 174 |
2 | 18 | 23 | 38 |
30 | 102 | 46 | 135 |
42 | 214 | 148 | 275 |
69 | 345 | 230 | 246 |
library(lattice)
plot(Chile$sex, Chile$vote, xlab="Sex", ylab="Vote")
Figure 2.3: 性別與投票之一
gender<-as.numeric(Chile$sex)
kableExtra::kable_styling(knitr::kable(table(gender), caption="性別分佈"),
bootstrap_options = "striped", full_width = F)
gender | Freq |
---|---|
1 | 1379 |
2 | 1321 |
R
按照類別的字母順序轉換類別為數字。如果進一步要轉換數字就容易了:
sex <- c()
sex[gender==2]<-0
sex[gender==1]<-1
kableExtra::kable_styling(knitr::kable(table(sex)),bootstrap_options = "striped", full_width = F)
sex | Freq |
---|---|
0 | 1321 |
1 | 1379 |
ngender<-c()
ngender[Chile$sex=='F']<-1
ngender[Chile$sex=='M']<-0
kableExtra::kable_styling(knitr::kable(table(ngender)),
bootstrap_options = "striped", full_width = F)
ngender | Freq |
---|---|
0 | 1321 |
1 | 1379 |
Chile$gender[Chile$sex=="F"]<-"Female"
Chile$gender[Chile$sex=="M"]<-"Male"
class(Chile$gender)
[1] “character”
kableExtra::kable_styling(knitr::kable(table(Chile$gender)),
bootstrap_options = "striped", full_width = F)
Var1 | Freq |
---|---|
Female | 1379 |
Male | 1321 |
str(Chile)
## 'data.frame': 2700 obs. of 10 variables:
## $ region : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
## $ ncode : num 3 3 3 3 3 3 3 3 3 3 ...
## $ gender : chr "Male" "Male" "Female" "Female" ...
data_char2 <- Chile # Duplicate data
fac_cols <- sapply(data_char2, is.factor) # Identify all factor columns
data_char2[fac_cols] <- lapply(data_char2[fac_cols], as.character) # Convert all factor
str(data_char2)
## 'data.frame': 2700 obs. of 10 variables:
## $ region : chr "N" "N" "N" "N" ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : chr "M" "M" "F" "F" ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : chr "P" "PS" "P" "P" ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : chr "Y" "N" "Y" "N" ...
## $ ncode : num 3 3 3 3 3 3 3 3 3 3 ...
## $ gender : chr "Male" "Male" "Female" "Female" ...
## ISLR::Auto$cylinders n percent
## 3 4 0.010204
## 4 199 0.507653
## 5 3 0.007653
## 6 83 0.211735
## 8 103 0.262755
library(ISLR); library(dplyr)
data(Auto)
CYL <- c()
CYL[Auto$cylinders==3] <- '3 cyclinders'
CYL[Auto$cylinders==4] <- '4 cyclinders'
CYL[Auto$cylinders==5] <- '5 cyclinders'
CYL[Auto$cylinders==6] <- '6 cyclinders'
CYL[Auto$cylinders==8] <- '8 cyclinders'
df <- data.frame(CYL=CYL)
df %>% janitor::tabyl(CYL)
## CYL n percent
## 3 cyclinders 4 0.010204
## 4 cyclinders 199 0.507653
## 5 cyclinders 3 0.007653
## 6 cyclinders 83 0.211735
## 8 cyclinders 103 0.262755
df$fac_cols <- as.factor(df$CYL)
ggplot2::ggplot(df,aes(x = fac_cols)) +
geom_bar(stat='count', aes(fill = fac_cols))
Figure 2.4: 汽缸數直方圖
A | N | U | Y | |
---|---|---|---|---|
Female | 104 | 363 | 362 | 480 |
Male | 83 | 526 | 226 | 388 |
Figure 2.5: 性別與投票之二
R
無法判斷哪一個字串應該被給予哪一個數字。
as.numeric(Chile$gender)
☛請嘗試練習AMSsurvey
的citizen
等類別變數的轉換。
Var1 | Freq |
---|---|
82 | 3 |
84 | 4 |
86 | 3 |
88 | 4 |
90 | 3 |
92 | 4 |
93 | 1 |
x<-c("花蓮縣","臺北市","屏東縣","臺南市","高雄市");x
table(x)
level
指令
\(\texttt{xf<-factor(x, levels=c("臺北市", "臺南市","高雄市","屏東縣","花蓮縣"))}\)
xf | Freq |
---|---|
臺北市 | 0 |
臺南市 | 0 |
高雄市 | 0 |
屏東縣 | 0 |
花蓮縣 | 0 |
factor()
這個函數裡面有ordered
的邏輯選項,不過只要指定levels
,有無ordered
為真並不影響。但是ordered()
這個函式會得到一個已經排序的因素,例如:
od<-ordered(1:20); class(od)
[1] “ordered” “factor”
☛請嘗試練習Chile
的vote
順序為”Y”, “N”, “A”, “U”。
a<-c(0:9); a
[1] 0 1 2 3 4 5 6 7 8 9
ok<-a>5; ok
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
a[ok]
[1] 6 7 8 9
x <- c(2,7,9,2,NA,5) # 6 elements
r <- c(TRUE,TRUE,FALSE,FALSE,FALSE,FALSE)
x[r]
[1] 2 7
x[c(TRUE,FALSE)] # odd numbered elements
[1] 2 9 NA
☛請執行以下語法,並且回答篩選後的變數剩下幾個觀察值?
head(Duncan)
ok<-Duncan$income>50
Duncan$income[ok]
library(data.table)
H<-data.table(Age=c(NA,"30-39","40-49","20-29",
"20-29","60-69","60-69","30-39","60-69",
"30-39","20-29","20-29","30-39","40-49",
"40-49", "40-49","50-59","50-59","20-29",NA),
Vote=c("Ding", NA, "Ko", "Ko", "Ko",
"Ding","Ding",NA,NA, "Ko","Yao","Yao", "Yao","Ding","Ko","Ko","Yao","Yao","Ding","Ko"),
pride=c(3,NA, 7, 3, NA, 5, 5, 4,
NA,2,5, 1,6,8, 7, 6, 1, 3,3,5))
ok | Freq |
---|---|
FALSE | 3 |
TRUE | 17 |
## [1] NA
## [1] 4.353
上述的例子是如果沒有回答pride,如果我們想顯示H資料中年齡與投票之間的關係,有無設定邏輯去掉遺漏值會造成影響嗎?
Ding | Ko | Yao | |
---|---|---|---|
20-29 | 1 | 2 | 2 |
30-39 | 0 | 1 | 1 |
40-49 | 1 | 3 | 0 |
50-59 | 0 | 0 | 2 |
60-69 | 2 | 0 | 0 |
Ding | Ko | Yao | |
---|---|---|---|
20-29 | 1 | 2 | 2 |
30-39 | 0 | 1 | 1 |
40-49 | 1 | 3 | 0 |
50-59 | 0 | 0 | 2 |
60-69 | 2 | 0 | 0 |
R
會自動去掉無法列表的遺漏值。但是我們交叉其他變數,就會發現差異。
as.Date()
以及as.POSIXct
可以將字串轉變為日期資料。
v<-c("1/27/2020", "6/26/2020", "12/31/2021"); class(v)
[1] “character”
v.date1<-as.Date(v, format='%m/%d/%Y'); class(v.date1); v.date1
[1] “Date” [1] “2020-01-27” “2020-06-26” “2021-12-31”
v.date2<-as.POSIXct(v, format='%m/%d/%Y'); class(v.date2);v.date2
[1] “POSIXct” “POSIXt” [1] “2020-01-27 CST” “2020-06-26 CST” “2021-12-31 CST”
或者是
v<-c("", "6/26/2018", "12/31/2018")
as.Date(v, format='%m/%d/%Y')
[1] NA “2018-06-26” “2018-12-31”
format()
則轉換屬性為日期的資料為不同格式,例如:
today <- Sys.Date()
format(today, format='%m/%d/%Y')
[1] “03/05/2022”
format(today, format='%Y-%m-%d')
[1] “2022-03-05”
format(v.date2, "%m-%d-%Y")
[1] “01-27-2020” “06-26-2020” “12-31-2021”
符號 | 意義 | 例子 |
---|---|---|
%d | 日 | 01-31 |
%a | 星期幾的縮寫 | Mon |
%A | 星期幾 | Monday |
%m | 月份(數字) | 01-12 |
%b | 月份的縮寫 | Jan |
%B | 月份的完整寫法 | January |
%y | 兩位數年份 | 18 |
%Y | 年份 | 2018 |
format()
這個函式轉換已經是日期格式的資料,例如:
Today<-Sys.Date(); Today
[1] “2022-03-05”
today_format1<-format(Today, format='%Y-%b-%d'); today_format1
[1] “2022- 3-05”
today_format2<-format(Today, format='%b/%d/%y'); today_format2
[1] ” 3/05/22”
today_format3<-format(Today, format='%Y年%b月%d日(%a)'); today_format3
[1] “2022年 3月05日(六)”
format
萃取日期的資料如下:
j <- c("1/1/2018","2/11/2019", "7/6/2020")
j <- as.Date(j, format="%m/%d/%Y")
print(format(j, "%m"))
[1] “01” “02” “07”
print(format(j, "%Y"))
[1] “2018” “2019” “2020”
file <- here::here('data','tsaipopularity0921.csv')
tmp <- read.csv(file, header = T, sep=',')
tmp$Date <- format(tmp$Date, format="%y-%b")
tmp$Date
[1] “18-Mar” “18-Jun” “18-Sep” “18-Dec” “19-Mar” “19-Jun” “19-Sep” “19-Dec” [9] “20-Mar” “20-Mar” “20-Jun” “20-Sep” “20-Dec” “21-Mar” “21-Jun”
ggplot(data=tmp, aes(x=Date, y=Tsai, group=1)) +
geom_line(size=1.5, col="goldenrod") +
geom_point(shape=1, size=3) +
labs(y="%",
subtitle="Data: Taiwan's Election and Democratization Study") +
theme_bw() +
ggtitle("Percentage of Satisfaction with President Tsai's Performance")+
geom_text(data=tmp, aes(x=Date, label=Tsai),
vjust=0, hjust=-0.3, size=5) +
geom_text(label="Level 3 alert", x=14, y=45, col="#FF5511", size=4) +
geom_text(label="Presidential election, Covid-19", x=7.9, y=70, col="#FF5511", size=4) +
geom_text(label="One Country Two Systems", x=4.8, y=35, col="#FF5511", size=4) +
theme(axis.title= element_text(color="blue", size=14, face="bold"),
axis.text = element_text(size=9))
Figure 2.6: 總統表現滿意度
xi<-"1953-06-15" #Xi's birthday
tsai<-"1956-08-31" #Tsai's birthday
as.Date(c(xi,tsai), origin="1904-01-01")
[1] “1953-06-15” “1956-08-31”
difftime(tsai, xi)
Time difference of 1173 days
origin
指令可設定也可不設定。但是計算某一個數字代表的日期時必須要有起始日:
as.Date(1100, origin="2018-08-01")
[1] “2021-08-05”
介紹完資料型態之後,接下來介紹資料結構。
example<-c(0,1,2,3,4)
print(example)
## [1] 0 1 2 3 4
或者是
c(2,4,6,8)->A; A
## [1] 2 4 6 8
c(letters)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
c(LETTERS)
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
shares <- c(150, 40, 65)
names(shares) <- c('Finance','Techonolgy','Cash')
shares
Finance Techonolgy Cash 150 40 65
class(shares)
[1] “numeric”
cash<-c(100, 120, 80, 65)
names(cash) <- c(2016, 2017, 2018, 2019)
par(mfrow=c(1,2), bg='lightgreen',mai=c(0.4,0.3,0.1,0.3))
pie(shares); barplot(cash, cex.axis = 0.8)
Figure 3.1: 資金配置
j<-c(2*2, 2*9, 10-2, 3^3); j
## [1] 4 18 8 27
R<-c(100, 200, 300); R/5; sqrt(R)
## [1] 20 40 60
## [1] 10.00 14.14 17.32
c(j, R)
## [1] 4 18 8 27 100 200 300
Y<-c(j, c(9:5), R[c(1,2)]); Y
## [1] 4 18 8 27 9 8 7 6 5 100 200
R
的向量可以連結,我們可以增加資料的數量。
Y[-c(8:12)]
## [1] 4 18 8 27 9 8 7
R
的資料結構之一是矩陣,例如VADeaths
就是一筆矩陣的資料:
data("VADeaths"); VADeaths
## Rural Male Rural Female Urban Male Urban Female
## 50-54 11.7 8.7 15.4 8.4
## 55-59 18.1 11.7 24.3 13.6
## 60-64 26.9 20.3 37.0 19.3
## 65-69 41.0 30.9 54.6 35.1
## 70-74 66.0 54.3 71.1 50.0
class(VADeaths)
## [1] "matrix" "array"
R
的矩陣類似。矩陣的讀法是先列再行。例如我們需要一個\(3\times 3\)的矩陣可寫成:
m<-matrix(c(1:9), nrow=3, ncol=3); m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
n<-matrix(c(1:6), nrow=3, ncol=2);n
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
m<-matrix(c(1:9), nrow=3, ncol=3); n<-matrix(c(1:6), nrow=3, ncol=2); m%*%n
## [,1] [,2]
## [1,] 30 66
## [2,] 36 81
## [3,] 42 96
diag(m)
## [1] 1 5 9
t(m)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
m[2,3]; m[2,3]<-0
## [1] 8
n<-matrix(c(1:6), nrow=3, ncol=2, byrow=T)
dimnames
的指令分別對列與行指定名稱,例如:
n<-matrix(c(1:6), nrow=3, ncol=2, dimnames = list(c("a","b","c"),c("A","B"))); n
## A B
## a 1 4
## b 2 5
## c 3 6
diag(1, nrow=5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
A %% I=A I %% A=A
\[ y=X\beta +\epsilon \]
\[(X'X)\hat{\beta}=X'y\]
\[\begin{align*} (X'X)\hat{\beta}=X'y \\ (X'X)^{-1}(X'X)\hat{\beta}=(X'X)^{-1}X'y \\ \end{align*}\]
因為\((X'X)^{-1}(X'X)=I\),所以:<>
\[\begin{align*} (X'X)^{-1}(X'X)\hat{\beta}= I \hat{\beta} \\ \hat{\beta} = (X'X)^{-1}X'y \end{align*}\]
v1<-c(170, 175, 166, 172, 165, 157, 167, 167,
156, 160)
v2<-c("F","M","M","M","F","F","F","F","M","F")
v3<-v1/10 + 42
tmp<-data.frame(height=v1,gender=v2,weight=v3,
stringsAsFactors = FALSE)
tmp
## height gender weight
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
R
會當做矩陣。例如:
H<-cbind(LETTERS[1:6], seq(10,60, 10))
H
## [,1] [,2]
## [1,] "A" "10"
## [2,] "B" "20"
## [3,] "C" "30"
## [4,] "D" "40"
## [5,] "E" "50"
## [6,] "F" "60"
class(H)
## [1] "matrix" "array"
df <- structure(list(
year = c(2001, 2002, 2004, 2006),
length_days = c(366.3240, 365.4124, 366.5323423, 364.9573234)),
.Names = c("year", "length of days") ,
row.names = c(NA, -4L) ,class = "data.frame",
comment = 'Example Data Set')
df
## year length of days
## 1 2001 366.3
## 2 2002 365.4
## 3 2004 366.5
## 4 2006 365.0
dtt <- structure(list
(model = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("ma", "mb", "mc"), class = "factor"),
year = c(2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L),
V = c(0.16, 0.14, 0.11, 0.13, 0.15, 0.16, 0.24, 0.17, 0.12, 0.13, 0.15, 0.15, 0.2, 0.16, 0.11, 0.12, 0.12, 0.15),
lower = c(0.11, 0.11, 0.07, 0.09, 0.11, 0.12, 0.16, 0.12, 0.04, 0.09, 0.09, 0.11, 0.14, 0.1, 0.07, 0.08, 0.05, 0.1),
upper = c(0.21, 0.19, 0.17, 0.17, 0.19, 0.2, 0.29, 0.23, 0.16, 0.17, 0.16, 0.2, 0.26, 0.27, 0.15, 0.16, 0.15, 0.19)),
.Names = c("model", "year", "V", "lower", "upper"),
class = "data.frame", row.names = c(c(1:18), -1L))
dtt
## model year V lower upper
## 1 ma 2005 0.16 0.11 0.21
## 2 ma 2006 0.14 0.11 0.19
## 3 ma 2007 0.11 0.07 0.17
## 4 ma 2008 0.13 0.09 0.17
## 5 ma 2009 0.15 0.11 0.19
## 6 ma 2010 0.16 0.12 0.20
## 7 mb 2005 0.24 0.16 0.29
## 8 mb 2006 0.17 0.12 0.23
## 9 mb 2007 0.12 0.04 0.16
## 10 mb 2008 0.13 0.09 0.17
## 11 mb 2009 0.15 0.09 0.16
## 12 mb 2010 0.15 0.11 0.20
## 13 mc 2005 0.20 0.14 0.26
## 14 mc 2006 0.16 0.10 0.27
## 15 mc 2007 0.11 0.07 0.15
## 16 mc 2008 0.12 0.08 0.16
## 17 mc 2009 0.12 0.05 0.15
## 18 mc 2010 0.15 0.10 0.19
## -1 <NA> <NA> <NA> <NA> <NA>
matrix[,1]
告訴系統向量位置,
不能用matrix$a。
a<-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
b<-c(52.5, 48.4, 57.1, 60.1, 71.1)
dftest <-cbind(a, b)
class(dftest)
## [1] "matrix" "array"
dftest
## a b
## [1,] "Monday" "52.5"
## [2,] "Tuesday" "48.4"
## [3,] "Wednesday" "57.1"
## [4,] "Thursday" "60.1"
## [5,] "Friday" "71.1"
#ggplot(data=dftest, aes(x=a, y=b, fill=a)) +
# geom_bar(stat = 'identity')
tm<-cbind(carData::Chile$vote, carData::Chile$sex)
class(tm)
## [1] "matrix" "array"
plot(table(tm[,2], tm[,1]))
Figure 3.2: 智利選舉
dt <- as.data.frame(dftest)
dt$a<-factor(dt$a, levels=c("Monday", "Tuesday",
"Wednesday", "Thursday", "Friday"))
ggplot(data=dt, aes(x=a, y=b, fill=a)) +
geom_bar(stat = 'identity')
Figure 3.3: 週一至週五的快樂程度
有幾個資料框與矩陣的相關指令:
AMSsurvey
有幾筆觀察值:
nrow(AMSsurvey)
[1] 24
colnames(tmp)<-c("V1","V2","V3"); tmp
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
library(ISLR)
head(College, n=3)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
head(state.x77, n=5)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
names.to.delete<-c('Alabama', 'Alaska', 'Arkansas')
which(rownames(data) %in% vector)
傳回所要選出的列:
rows.to.delete<-which(rownames(state.x77) %in% names.to.delete)
newstate <- state.x77[-c(rows.to.delete),]
head(newstate, n=5)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
Array1 <- array(1:12, dim = c(2, 6, 1)); Array1
## , , 1
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 3 5 7 9 11
## [2,] 2 4 6 8 10 12
Array2 <- array(1:12, dim = c(2, 3, 2)); Array2
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
A12<-Array2[,,2]; A12
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
listA<-list(tmp, H, c(xi,tsai)); listA
## [[1]]
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
##
## [[2]]
## [,1] [,2]
## [1,] "A" "10"
## [2,] "B" "20"
## [3,] "C" "30"
## [4,] "D" "40"
## [5,] "E" "50"
## [6,] "F" "60"
##
## [[3]]
## [1] "1953-06-15" "1956-08-31"
list(A=data.frame(x=c(1:5),y=c(101:105)),
B=data.frame(v1=rep(NA,6)))
## $A
## x y
## 1 1 101
## 2 2 102
## 3 3 103
## 4 4 104
## 5 5 105
##
## $B
## v1
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
listA[[3]]
## [1] "1953-06-15" "1956-08-31"
listB<-list(data=tmp, vec=m, char=c(tsai, xi));
listB[["data"]]
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
X = list(1:5, letters[1:5], c('Y','Y','N','Y','N'),
c("2/27/2018", "6/26/2018", "12/31/2018","1/20/2019","4/8/2019")); X
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c" "d" "e"
##
## [[3]]
## [1] "Y" "Y" "N" "Y" "N"
##
## [[4]]
## [1] "2/27/2018" "6/26/2018" "12/31/2018" "1/20/2019" "4/8/2019"
X.dt<-setDT(X); X.dt
## V1 V2 V3 V4
## 1: 1 a Y 2/27/2018
## 2: 2 b Y 6/26/2018
## 3: 3 c N 12/31/2018
## 4: 4 d Y 1/20/2019
## 5: 5 e N 4/8/2019
請嘗試把c('a','b','c'), c(1,2,3,4)以及
c('2018-01-01', '2018-04-04', '2018-04-05', '2018-06-18', '2018-10-10')`結合成為一個列表。
class(Titanic); Titanic
## [1] "table"
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
T1<-Titanic[, , 1, 1]
class(T1); T1
## [1] "table"
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
Titanic[, 1, 1,]
## Survived
## Class No Yes
## 1st 0 5
## 2nd 0 11
## 3rd 35 13
## Crew 0 0
Titanic[, 1, 2,]
## Survived
## Class No Yes
## 1st 118 57
## 2nd 154 14
## 3rd 387 75
## Crew 670 192
Titanic[, 2, 1,]
## Survived
## Class No Yes
## 1st 0 1
## 2nd 0 13
## 3rd 17 14
## Crew 0 0
Titanic[, 2, 2,]
## Survived
## Class No Yes
## 1st 4 140
## 2nd 13 80
## 3rd 89 76
## Crew 3 20
library(data.table)
DT = data.table(a = 1:3, b = c(10,20,30))
DT
## a b
## 1: 1 10
## 2: 2 20
## 3: 3 30
DT[1:3, sum(a)]
## [1] 6
DT[1:2, mean(b)]
## [1] 15
DT <-data.table(x=rnorm(1000, 0, 1))
DT[, plot(density(x), type='l', xlab='x', ylab='',
lwd=3, col='lightblue')]
Figure 3.4: 常態分佈機率密度
## NULL
g<-Titanic[ , , 2, 2]; class(g)
## [1] "table"
請輸入 \(\texttt{g\$Class}\)結果發生錯誤,無法顯示。改為:
g<-data.frame(g)
請輸入 \(\texttt{g\$Class}\)則會顯示變數的性質。
因為向量具有方位的性質,所以數字具有先後順序,與另一個有同樣數目的向量相加減乘除時,將會依照順序進行運算。我們以一個純量 (scalar) Sca 為例:
X<-c(10,20,30,40,50,60); Sca<-10
X+Sca
[1] 20 30 40 50 60 70
X/Sca
[1] 1 2 3 4 5 6
Y<-c(5,10,6,8,25,6)
X/Y; X*Y
[1] 2 2 5 5 2 10 [1] 50 200 180 320 1250 360
a<-c(2,3,4); b<-a^2; print(b)
[1] 4 9 16
c<-sqrt(b); print(c)
[1] 2 3 4
EX<-c(2.54, 3.111, 10.999)
round(EX, digits=4)
## [1] 2.540 3.111 10.999
floor(EX)
## [1] 2 3 10
ceiling(EX)
## [1] 3 4 11
options(digits = 5)
print(EX)
## [1] 2.540 3.111 10.999
R
為了對齊輸出的數字,會輸出比較多的小數點後面的數字,可以考慮用sprintf這個函式控制,但是會得到字串而非數字:
perc<-MASS::Animals$brain/MASS::Animals$body
print(perc[1:3], digits=2)
## [1] 6.00 0.91 3.29
sprintf(perc[1:3], fmt="%#.2f")
## [1] "6.00" "0.91" "3.29"
weekdays()
指令顯示今天上課日期是星期二。
mtcars
資料裡面,wt大於或等於2的資料有幾筆?
data.table
這個套件的功能,排序上述的資料框。
最後更新日期 03/05/2022