R
的資料型態與資料結構,也就是R
可以讀取並進行運算的對象,例如我們用 c()
來表示數字、文字的集合:A<-c("台北市","新北市", "桃園市", "台中市","台南市","高雄市")
print(A)
## [1] "台北市" "新北市" "桃園市" "台中市" "台南市" "高雄市"
B<-c(0,1,2,3,4,5,6,7,8,9)
print(B)
## [1] 0 1 2 3 4 5 6 7 8 9
R
資料結構可分為一維、二維、多維:
R
的資料型態以及運算。R
是物件導向(object-oriented)的語言,所謂物件導向,就是使用者告訴系統某個函式、向量、矩陣、迴圈等等的名稱,或者包含程式與資料在一個物件,系統就會儲存在工作環境中,使用者可以呼叫執行或者修改。使用者可以以堆積木的方式,把物件堆疊起完成希望得到的結果。
R
的環境(environment)中,有各種物件,可以用ls()這個函式顯示。例如:
## [1] "a" "A" "b" "B" "f"
## [1] 4
## [1] 25
R
的平均值函數為:
?mean
mean(x, trim = 0, na.rm = FALSE)
x
為變數,可以是向量,也可以是矩陣,但是向量或矩陣的元素不能為字串。trim
表示設定要去除多少百分比的變數中的資料。例如有一個變數型態為:
x<-c(100000, 10000000, c(1:10)); sort(x)
## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07
x<-c(100000, 10000000, c(1:10))
sort(x)
## [1] 1e+00 2e+00 3e+00 4e+00 5e+00 6e+00 7e+00 8e+00 9e+00 1e+01 1e+05 1e+07
mean(x)
## [1] 841671
mean(x, trim=0.1)
## [1] 10005
mean(sort(x)[2:11])
## [1] 10005
mean(x, trim=0.2)
## [1] 6.5
mean(sort(x)[3:10])
## [1] 6.5
X<-c(2, 4, 6, 8); X
[1] 2 4 6 8
class(X)
[1] “numeric”
typeof(X)
[1] “double”
str(X)
num [1:4] 2 4 6 8
y=c(1.1e+06); y
[1] 1100000
class(y)
[1] “numeric”
u<-as.integer(c(4)); class(u)
[1] “integer”
is.numeric(100)
is.integer(100)
typeof(100)
a<-c(7, 8.5, 9); class(a)
[1] “numeric”
b<-as.integer(a); b
[1] 7 8 9
L
,R 語言就會儲存為整數(integer)。
a<-3L; class(a)
## [1] "integer"
sys_date <- Sys.Date()
as.numeric(sys_date)
## [1] 19791
as.integer(sys_date)
## [1] 19791
date_of_origin <- as.Date("1970-01-01")
as.integer(sys_date) - as.integer(date_of_origin)
## [1] 19791
R
會顯示變數性質與對應結果:
b <- c(1, 2, 3, 10)
h<-c(100, 200, 300, 500)
ok<-h>300
b[ok]
## [1] 10
ok<-h>1000
b[ok]
## numeric(0)
state.abb
class(state.abb)
Figure 2.1: 美國各州人口數的點狀圖
Figure 2.2: 各州人口排序後的點狀圖
state.x77
是矩陣,所以取出這個矩陣中的人口此一欄位,然後與state.abb
結合成一個資料框,再以dotplot
或者dotchart
指令畫成點狀圖。
char1<-c("1","2","3","4","5"); char1
## [1] "1" "2" "3" "4" "5"
char2<-c(1, 2, "文字"); char2
## [1] "1" "2" "文字"
as.numeric
轉換為數字,但是可以用語法進行轉換(請見因素一節)。
class(row.names(quakes))
## [1] "character"
head(state.x77)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
class(row.names(state.x77))
## [1] "character"
library(foreign); library(tidyverse)
file<-here::here('data','opendata106N0101.csv')
opendf<-read.csv(file, header=T, sep=',', fileEncoding = 'UTF-8')
str(opendf)
## 'data.frame': 375 obs. of 4 variables:
## $ code : chr "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
## $ 年底人口數: chr "551480" "387484" "413590" "222585" ...
## $ 土地面積 : num 23.14 16.32 20.14 5.71 19.74 ...
## $ 人口密度 : chr "23835" "23747" "20532" "38956" ...
file<-here::here('data','opendata106N0101.csv')
dat<-read.csv(file, header=T, stringsAsFactors = F)
nrow(dat) #check how many rows; n=375
## [1] 375
dat <- dat[-c(369:375),] #delete the rows of small islands and notes
head(dat, n=3)
## code 年底人口數 土地面積 人口密度
## 1 新北市板橋區 551480 23.14 23835
## 2 新北市三重區 387484 16.32 23747
## 3 新北市中和區 413590 20.14 20532
dat <- dat %>% mutate(popu=as.numeric(年底人口數))
str(dat)
## 'data.frame': 368 obs. of 5 variables:
## $ code : chr "新北市板橋區" "新北市三重區" "新北市中和區" "新北市永和區" ...
## $ 年底人口數: chr "551480" "387484" "413590" "222585" ...
## $ 土地面積 : num 23.14 16.32 20.14 5.71 19.74 ...
## $ 人口密度 : chr "23835" "23747" "20532" "38956" ...
## $ popu : num 551480 387484 413590 222585 416524 ...
file<-here::here('data','opendata106N0101.csv')
df<-read.csv(file,
header=F, stringsAsFactors = F)
colnames(df) <-df[1,]
df <- df[-c(1, 370:376),] #delete the first row and the rows of notes
df<-df%>% mutate(popu=as.numeric(年底人口數))
head(df, n=3)
## code 年底人口數 土地面積 人口密度 popu
## 2 新北市板橋區 551480 23.1373 23835 551480
## 3 新北市三重區 387484 16.317 23747 387484
## 4 新北市中和區 413590 20.144 20532 413590
min(df$popu) #check if data has no missing value
## [1] 685
library(data.table)
file<-here::here('data','opendata106N0101.csv')
DT <- read.csv(file, header=F, stringsAsFactors = F)
colnames(DT) <-DT[1,]
nrow(DT) #check how many rows; n=376
## [1] 376
DT <- DT[-c(1, 370:376),] #delete the first row and the rows of notes
DT <- data.table(DT)
DT <- DT %>% transform(年底人口數=as.numeric(年底人口數))
head(DT, n=3)
## code 年底人口數 土地面積 人口密度
## 1: 新北市板橋區 551480 23.1373 23835
## 2: 新北市三重區 387484 16.317 23747
## 3: 新北市中和區 413590 20.144 20532
min(DT$年底人口數) #check if data has no missing value
## [1] 685
x=c("Yes","No","No","Yes","Yes"); x
## [1] "Yes" "No" "No" "Yes" "Yes"
factor(x)
## [1] Yes No No Yes Yes
## Levels: No Yes
table(x)
## x
## No Yes
## 2 3
factor()
這個函數把x轉換為因素,有No, Yes兩類別。
nchar()
, gsub()
等等。請自己嘗試以下語法:#character
a <- letters[1:5]
typeof(a)
nchar(a)
#factor
b <- factor(a)
typeof(b)
nchar(b)
stringFactors = FALSE
來限制讀取檔案時把文字讀成因素,再視情況把文字轉成因素。str(Chile)
## 'data.frame': 2700 obs. of 8 variables:
## $ region : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
kableExtra::kable_styling(knitr::kable(table(Chile$sex, Chile$vote)),
bootstrap_options = "striped", full_width = F)
A | N | U | Y | |
---|---|---|---|---|
F | 104 | 363 | 362 | 480 |
M | 83 | 526 | 226 | 388 |
Chile$ncode<-as.numeric(Chile$region)
kableExtra::kable_styling(knitr::kable(table(Chile$ncode, Chile$vote)),
bootstrap_options = "striped", full_width = F)
A | N | U | Y |
---|---|---|---|
44 | 210 | 141 | 174 |
2 | 18 | 23 | 38 |
30 | 102 | 46 | 135 |
42 | 214 | 148 | 275 |
69 | 345 | 230 | 246 |
library(lattice)
plot(Chile$sex, Chile$vote, xlab="Sex", ylab="Vote")
Figure 2.3: 性別與投票之一
gender<-as.numeric(Chile$sex)
kableExtra::kable_styling(knitr::kable(table(gender), caption="性別分佈"),
bootstrap_options = "striped", full_width = F)
gender | Freq |
---|---|
1 | 1379 |
2 | 1321 |
R
按照類別的字母順序轉換類別為數字。如果進一步要轉換數字就容易了:
sex <- c()
sex[gender==2]<-0
sex[gender==1]<-1
kableExtra::kable_styling(knitr::kable(table(sex)),bootstrap_options = "striped", full_width = F)
sex | Freq |
---|---|
0 | 1321 |
1 | 1379 |
ngender<-c()
ngender[Chile$sex=='F']<-1
ngender[Chile$sex=='M']<-0
kableExtra::kable_styling(knitr::kable(table(ngender)),
bootstrap_options = "striped", full_width = F)
ngender | Freq |
---|---|
0 | 1321 |
1 | 1379 |
Chile$gender[Chile$sex=="F"]<-"Female"
Chile$gender[Chile$sex=="M"]<-"Male"
class(Chile$gender)
[1] “character”
kableExtra::kable_styling(knitr::kable(table(Chile$gender)),
bootstrap_options = "striped", full_width = F)
Var1 | Freq |
---|---|
Female | 1379 |
Male | 1321 |
str(Chile)
## 'data.frame': 2700 obs. of 10 variables:
## $ region : Factor w/ 5 levels "C","M","N","S",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : Factor w/ 2 levels "F","M": 2 2 1 1 1 1 2 1 1 2 ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : Factor w/ 3 levels "P","PS","S": 1 2 1 1 3 1 2 3 1 1 ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : Factor w/ 4 levels "A","N","U","Y": 4 2 4 2 2 2 2 2 3 2 ...
## $ ncode : num 3 3 3 3 3 3 3 3 3 3 ...
## $ gender : chr "Male" "Male" "Female" "Female" ...
data_char2 <- Chile # Duplicate data
fac_cols <- sapply(data_char2, is.factor) # Identify all factor columns
data_char2[fac_cols] <- lapply(data_char2[fac_cols], as.character) # Convert all factor
str(data_char2)
## 'data.frame': 2700 obs. of 10 variables:
## $ region : chr "N" "N" "N" "N" ...
## $ population: int 175000 175000 175000 175000 175000 175000 175000 175000 175000 175000 ...
## $ sex : chr "M" "M" "F" "F" ...
## $ age : int 65 29 38 49 23 28 26 24 41 41 ...
## $ education : chr "P" "PS" "P" "P" ...
## $ income : int 35000 7500 15000 35000 35000 7500 35000 15000 15000 15000 ...
## $ statusquo : num 1.01 -1.3 1.23 -1.03 -1.1 ...
## $ vote : chr "Y" "N" "Y" "N" ...
## $ ncode : num 3 3 3 3 3 3 3 3 3 3 ...
## $ gender : chr "Male" "Male" "Female" "Female" ...
## ISLR::Auto$cylinders n percent
## 3 4 0.010204
## 4 199 0.507653
## 5 3 0.007653
## 6 83 0.211735
## 8 103 0.262755
library(ISLR); library(dplyr)
data(Auto)
CYL <- c()
CYL[Auto$cylinders==3] <- '3 cyclinders'
CYL[Auto$cylinders==4] <- '4 cyclinders'
CYL[Auto$cylinders==5] <- '5 cyclinders'
CYL[Auto$cylinders==6] <- '6 cyclinders'
CYL[Auto$cylinders==8] <- '8 cyclinders'
df <- data.frame(CYL=CYL)
df %>% janitor::tabyl(CYL)
## CYL n percent
## 3 cyclinders 4 0.010204
## 4 cyclinders 199 0.507653
## 5 cyclinders 3 0.007653
## 6 cyclinders 83 0.211735
## 8 cyclinders 103 0.262755
df$fac_cols <- as.factor(df$CYL)
ggplot2::ggplot(df,aes(x = fac_cols)) +
geom_bar(stat='count', aes(fill = fac_cols))
Figure 2.4: 汽缸數直方圖
A | N | U | Y | |
---|---|---|---|---|
Female | 104 | 363 | 362 | 480 |
Male | 83 | 526 | 226 | 388 |
Figure 2.5: 性別與投票之二
R
無法判斷哪一個字串應該被給予哪一個數字。
as.numeric(Chile$gender)
☛請嘗試練習AMSsurvey
的citizen
等類別變數的轉換。
dplyr
套件裡面的\(\texttt{recode_factor()}\)可以轉換因素為因素,也可以轉換字串為因素。library(carData); library(dplyr)
Chile |> mutate(sex.new = recode_factor(sex, `F`='Female', `M`="Male"))|>
janitor::tabyl(sex.new)
## sex.new n percent
## Female 1379 0.5107
## Male 1321 0.4893
file <- here::here('data','africa2023.csv'); dt <- read.csv(file, header=T)
dt |> mutate(sub = recode_factor(subregion, `Eastern Africa`='East',
`Middle Africa`="Middle",
`Northern Africa`='North', `Southern Africa`='South',
`Western Africa` = 'West'))|>
janitor::tabyl(sub)
## sub n percent
## East 18 0.33333
## Middle 9 0.16667
## North 6 0.11111
## South 5 0.09259
## West 16 0.29630
dt |> mutate(sub2 = recode_factor(subregion, `Eastern Africa`='East',
`Middle Africa`="Middle",
`Northern Africa`='North', `Southern Africa`='South',
.default = NA_character_))|>
janitor::tabyl(sub2)
## sub2 n percent valid_percent
## East 18 0.33333 0.4737
## Middle 9 0.16667 0.2368
## North 6 0.11111 0.1579
## South 5 0.09259 0.1316
## <NA> 16 0.29630 NA
dplyr
的 \(\texttt{recode}\),也可以用
car
的 \(\texttt{recode}\)合併好幾個值。以下的語法示範如何合併數值,並且指定某些數值為遺漏值。library(janitor)
#open data
file <- here::here('data', 'PP0797B2.sav')
PP0797B2 <- sjlabelled::read_spss(file)
PP07 <- PP0797B2 %>% mutate(nQ1 = dplyr::recode(Q1, `1`= 1, `2`= 1, `3`= 0, `4` = 0)) %>%
mutate(nQ1 = ifelse(nQ1>1, NA, nQ1)) %>%
mutate(NEWQ2 = car::recode(Q2, "1:2=1; 3:4=0; 95:98=NA"))
PP07 %>% tabyl(nQ1)
## nQ1 n percent valid_percent
## 0 534 0.2595 0.291
## 1 1301 0.6322 0.709
## NA 223 0.1084 NA
PP07 %>% tabyl(NEWQ2)
## NEWQ2 n percent valid_percent
## 0 1688 0.82021 0.895
## 1 198 0.09621 0.105
## <NA> 172 0.08358 NA
library(pscl)
attach(absentee)
admit.1 <- absentee %>%
mutate(across(matches('year'), as.character)) %>%
data.table::data.table()
typeof(admit.1$year)
## [1] "character"
x<-c("花蓮縣","臺北市","屏東縣","臺南市","高雄市");x
table(x)
level
指令## [1] <NA> <NA> <NA> <NA> <NA>
## Levels: 臺北市 臺南市 高雄市 屏東縣 花蓮縣
xf | Freq |
---|---|
臺北市 | 0 |
臺南市 | 0 |
高雄市 | 0 |
屏東縣 | 0 |
花蓮縣 | 0 |
Dependent variable: | ||
infant | ||
(1) | (2) | |
income | -0.003 | -0.003 |
(0.008) | (0.008) | |
regionAmericas | -84.730*** | |
(22.500) | ||
regionAsia | -44.800** | |
(20.800) | ||
regionEurope | -113.500*** | |
(31.400) | ||
region.rAfrica | 84.730*** | |
(22.500) | ||
region.rAsia | 39.940* | |
(23.070) | ||
region.rEurope | -28.740 | |
(29.850) | ||
Constant | 143.200*** | 58.510*** |
(13.860) | (18.590) | |
Observations | 101 | 101 |
R2 | 0.257 | 0.257 |
Adjusted R2 | 0.226 | 0.226 |
Residual Std. Error (df = 96) | 79.870 | 79.870 |
F Statistic (df = 4; 96) | 8.311*** | 8.311*** |
Note: | p<0.1; p<0.05; p<0.01 |
factor()
這個函數裡面有ordered
的邏輯選項,不過只要指定levels
,有無ordered
為真並不影響。但是ordered()
這個函式會得到一個已經排序的因素,例如:
od<-ordered(1:20); class(od)
## [1] "ordered" "factor"
☛請嘗試練習Chile
的vote
順序為”Y”, “N”, “A”, “U”。
[1] “logical”
a<-c(0:9); a
## [1] 0 1 2 3 4 5 6 7 8 9
ok<-a>5; ok
## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE
a[ok]
## [1] 6 7 8 9
x <- c(2,7,9,2,NA,5) # 6 elements
r <- c(TRUE,TRUE,FALSE,FALSE,FALSE,FALSE)
x[r]
## [1] 2 7
x[c(TRUE,FALSE)] # odd numbered elements
## [1] 2 9 NA
☛請執行以下語法,並且回答篩選後的變數剩下幾個觀察值?
head(Duncan)
ok<-Duncan$income>50
Duncan$income[ok]
library(data.table)
H<-data.table(Age=c(NA,"30-39","40-49","20-29",
"20-29","60-69","60-69","30-39","60-69",
"30-39","20-29","20-29","30-39","40-49",
"40-49", "40-49","50-59","50-59","20-29",NA),
Vote=c("Ding", NA, "Ko", "Ko", "Ko",
"Ding","Ding",NA,NA, "Ko","Yao","Yao", "Yao","Ding","Ko","Ko","Yao","Yao","Ding","Ko"),
pride=c(3,NA, 7, 3, NA, 5, 5, 4,
NA,2,5, 1,6,8, 7, 6, 1, 3,3,5))
ok | Freq |
---|---|
FALSE | 3 |
TRUE | 17 |
## [1] NA
## [1] 4.353
Ding | Ko | Yao | |
---|---|---|---|
20-29 | 1 | 2 | 2 |
30-39 | 0 | 1 | 1 |
40-49 | 1 | 3 | 0 |
50-59 | 0 | 0 | 2 |
60-69 | 2 | 0 | 0 |
Ding | Ko | Yao | |
---|---|---|---|
20-29 | 1 | 2 | 2 |
30-39 | 0 | 1 | 1 |
40-49 | 1 | 3 | 0 |
50-59 | 0 | 0 | 2 |
60-69 | 2 | 0 | 0 |
R
會自動去掉無法列表的遺漏值。但是我們交叉其他變數,就會發現差異。
as.Date()
以及as.POSIXct
可以將字串轉變為日期資料。
v<-c("1/27/2020", "6/26/2020", "12/31/2021"); class(v)
## [1] "character"
v.date1<-as.Date(v, format='%m/%d/%Y'); class(v.date1); v.date1
## [1] "Date"
## [1] "2020-01-27" "2020-06-26" "2021-12-31"
v.date2<-as.POSIXct(v, format='%m/%d/%Y'); class(v.date2);v.date2
## [1] "POSIXct" "POSIXt"
## [1] "2020-01-27 CST" "2020-06-26 CST" "2021-12-31 CST"
v<-c("", "6/26/2018", "12/31/2018")
as.Date(v, format='%m/%d/%Y')
## [1] NA "2018-06-26" "2018-12-31"
format()
則轉換屬性為日期的資料為不同格式,例如:
today <- Sys.Date()
format(today, format='%m/%d/%Y')
## [1] "03/09/2024"
format(today, format='%Y-%m-%d')
## [1] "2024-03-09"
format(v.date2, "%m-%d-%Y")
## [1] "01-27-2020" "06-26-2020" "12-31-2021"
符號 | 意義 | 例子 |
---|---|---|
%d | 日 | 01-31 |
%a | 星期幾的縮寫 | Mon |
%A | 星期幾 | Monday |
%m | 月份(數字) | 01-12 |
%b | 月份的縮寫 | Jan |
%B | 月份的完整寫法 | January |
%y | 兩位數年份 | 18 |
%Y | 年份 | 2018 |
format()
這個函式轉換已經是日期格式的資料,例如:
Today<-Sys.Date(); Today
## [1] "2024-03-09"
today_format1<-format(Today, format='%Y-%b-%d'); today_format1
## [1] "2024-Mar-09"
today_format2<-format(Today, format='%b/%d/%y'); today_format2
## [1] "Mar/09/24"
today_format3<-format(Today, format='%Y年%b月%d日(%a)'); today_format3
## [1] "2024年Mar月09日(Sat)"
format
萃取日期的資料如下:
j <- c("1/1/2018","2/11/2019", "7/6/2020")
j <- as.Date(j, format="%m/%d/%Y")
print(format(j, "%m"))
## [1] "01" "02" "07"
print(format(j, "%Y"))
## [1] "2018" "2019" "2020"
file <- here::here('data','tsaipopularity0921.csv')
tmp <- read.csv(file, header = T, sep=',')
tmp$Date <- format(tmp$Date, format="%y-%b")
tmp$Date
## [1] "18-Mar" "18-Jun" "18-Sep" "18-Dec" "19-Mar" "19-Jun" "19-Sep" "19-Dec"
## [9] "20-Mar" "20-Mar" "20-Jun" "20-Sep" "20-Dec" "21-Mar" "21-Jun"
ggplot(data=tmp, aes(x=Date, y=Tsai, group=1)) +
geom_line(size=1.5, col="goldenrod") +
geom_point(shape=1, size=3) +
labs(y="%",
subtitle="Data: Taiwan's Election and Democratization Study") +
theme_bw() +
ggtitle("Percentage of Satisfaction with President Tsai's Performance")+
geom_text(data=tmp, aes(x=Date, label=Tsai),
vjust=0, hjust=-0.3, size=5) +
geom_text(label="Level 3 alert", x=14, y=45, col="#FF5511", size=4) +
geom_text(label="Presidential election, Covid-19", x=7.9, y=70, col="#FF5511", size=4) +
geom_text(label="One Country Two Systems", x=4.8, y=35, col="#FF5511", size=4) +
theme(axis.title= element_text(color="blue", size=14, face="bold"),
axis.text = element_text(size=9))
Figure 2.6: 總統表現滿意度
xi<-"1953-06-15" #Xi's birthday
tsai<-"1956-08-31" #Tsai's birthday
as.Date(c(xi,tsai), origin="1904-01-01")
## [1] "1953-06-15" "1956-08-31"
difftime(tsai, xi)
## Time difference of 1173 days
origin
指令可設定也可不設定。但是計算某一個數字代表的日期時必須要有起始日:
as.Date(1100, origin="2018-08-01")
## [1] "2021-08-05"
lubridate
是一個專門處理日期與時間的套件,可以直接把數字轉換成對應的日期。## [1] "2010-12-15" "2023-12-25"
## [1] "Date"
## Time difference of 1.678 hours
## [1] "POSIXct" "POSIXt"
## [1] "POSIXct" "POSIXt"
POSIXct或者POSIXit格式指的是從1970年之後所經過的秒數。
在lubridate
套件,可以用time.interaval
表示兩個時間的差距,然後計算經過多少時間:
## [1] 2024-02-15 08:30:30 UTC--2024-02-15 10:11:10 UTC
## [1] "1H 40M 40S"
## [1] "6040s (~1.68 hours)"
lubridate
之外,還有strptime
把字串轉為日期,例如:## [1] "2020-01-01 EST" "2021-01-01 EST"
strftime
把日期轉換為字串,例如:## [1] "2023-01-01" "2024-01-25"
## [1] "character"
## [1] "01-01" "01-25"
lubridate
轉換字串為日期,然後用strftime
轉換為字串。如果沒有先轉換為日期,會出現錯誤的資料排列順序。inflation <- data.frame(rate=c(6.4, 6.0, 5.0, 4.9, 4.0, 3.0, 3.2, 3.7, 3.7, 3.2, 3.1, 3.4,
7.5, 7.9, 8.5, 8.3, 8.6, 9.1, 8.5, 8.3, 8.2, 7.7, 7.1, 6.5,
1.4, 1.7, 2.6, 4.2, 5.0, 5.4, 5.4, 5.3, 5.4, 6.2, 6.8, 7.0),
y.m = c("2023.1","2023.2","2023.3","2023.4","2023.5","2023.6",
"2023.7","2023.8","2023.9","2023.10","2023.11","2023.12","2022.1","2022.2","2022.3","2022.4","2022.5","2022.6",
"2022.7","2022.8","2022.9","2022.10","2022.11","2022.12","2021.1","2021.2","2021.3","2021.4","2021.5","2021.6",
"2021.7","2021.8","2021.9","2021.10","2021.11","2021.12"))
dat <- inflation %>% mutate(date = lubridate::ym(y.m)) %>%
mutate(date = strftime(date, format = "%Y-%m"))
g2 <- ggplot(dat, aes(x = date, y = rate, group=1)) +
geom_point() +
geom_line(colour = "#BB0000") +
theme_bw() +
theme(axis.text = element_text(size=8, angle = 45,vjust = 3))
g2
Figure 2.8: 2021-2023年美國通貨膨脹率
這三年中以2022年中的通貨膨脹率最高,同時美元對台幣匯率也越來越高,這是因為美國一直升息壓抑通膨,台幣換成美金越來越不划算,但是美國當地生活物價卻越來越高。
在學習lubridate
, strptime
, strftime
之後,請試著改善圖 2.6,在X軸顯示年份。
介紹完資料型態之後,接下來介紹資料結構。
example<-c(0,1,2,3,4)
print(example)
## [1] 0 1 2 3 4
或者是
c(2,4,6,8)->A; A
## [1] 2 4 6 8
c(letters)
## [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
## [20] "t" "u" "v" "w" "x" "y" "z"
c(LETTERS)
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
shares <- c(150, 40, 65)
names(shares) <- c('Finance','Techonolgy','Cash')
shares
Finance Techonolgy Cash 150 40 65
class(shares)
[1] “numeric”
cash<-c(100, 120, 80, 65)
names(cash) <- c(2016, 2017, 2018, 2019)
par(mfrow=c(1,2), bg='lightgreen',mai=c(0.4,0.3,0.1,0.3))
pie(shares); barplot(cash, cex.axis = 0.8)
Figure 3.1: 資金配置
j<-c(2*2, 2*9, 10-2, 3^3); j
## [1] 4 18 8 27
R<-c(100, 200, 300); R/5; sqrt(R)
## [1] 20 40 60
## [1] 10.00 14.14 17.32
c(j, R)
## [1] 4 18 8 27 100 200 300
Y<-c(j, c(9:5), R[c(1,2)]); Y
## [1] 4 18 8 27 9 8 7 6 5 100 200
R
的向量可以連結,我們可以增加資料的數量。
Y[-c(8:12)]
## [1] 4 18 8 27 9 8 7
R
的資料結構之一是矩陣,例如VADeaths
就是一筆矩陣的資料:
data("VADeaths"); VADeaths
## Rural Male Rural Female Urban Male Urban Female
## 50-54 11.7 8.7 15.4 8.4
## 55-59 18.1 11.7 24.3 13.6
## 60-64 26.9 20.3 37.0 19.3
## 65-69 41.0 30.9 54.6 35.1
## 70-74 66.0 54.3 71.1 50.0
class(VADeaths)
## [1] "matrix" "array"
R
的矩陣類似。矩陣的讀法是先列再行。例如我們需要一個\(3\times 3\)的矩陣可寫成:
m<-matrix(c(1:9), nrow=3, ncol=3); m
## [,1] [,2] [,3]
## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
n<-matrix(c(1:6), nrow=3, ncol=2);n
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
m<-matrix(c(1:9), nrow=3, ncol=3); n<-matrix(c(1:6), nrow=3, ncol=2); m%*%n
## [,1] [,2]
## [1,] 30 66
## [2,] 36 81
## [3,] 42 96
diag(m)
## [1] 1 5 9
t(m)
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
m[2,3]; m[2,3]<-0
## [1] 8
n<-matrix(c(1:6), nrow=3, ncol=2, byrow=T)
dimnames
的指令分別對列與行指定名稱,例如:
n<-matrix(c(1:6), nrow=3, ncol=2, dimnames = list(c("a","b","c"),c("A","B"))); n
## A B
## a 1 4
## b 2 5
## c 3 6
diag(1, nrow=5)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0 0 0 0
## [2,] 0 1 0 0 0
## [3,] 0 0 1 0 0
## [4,] 0 0 0 1 0
## [5,] 0 0 0 0 1
A %% I=A I %% A=A
\[ y=X\beta +\epsilon \]
\[(X'X)\hat{\beta}=X'y\]
\[\begin{align*} (X'X)\hat{\beta}=X'y \\ (X'X)^{-1}(X'X)\hat{\beta}=(X'X)^{-1}X'y \\ \end{align*}\]
\[\begin{align*} (X'X)^{-1}(X'X)\hat{\beta}= I \hat{\beta} \\ \hat{\beta} = (X'X)^{-1}X'y \end{align*}\]
v1<-c(170, 175, 166, 172, 165, 157, 167, 167,
156, 160)
v2<-c("F","M","M","M","F","F","F","F","M","F")
v3<-v1/10 + 42
tmp<-data.frame(height=v1,gender=v2,weight=v3,
stringsAsFactors = FALSE)
tmp
## height gender weight
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
R
會當做矩陣。例如:
H<-cbind(LETTERS[1:6], seq(10,60, 10))
H
## [,1] [,2]
## [1,] "A" "10"
## [2,] "B" "20"
## [3,] "C" "30"
## [4,] "D" "40"
## [5,] "E" "50"
## [6,] "F" "60"
class(H)
## [1] "matrix" "array"
df <- structure(list(
year = c(2001, 2002, 2004, 2006),
length_days = c(366.3240, 365.4124, 366.5323423, 364.9573234)),
.Names = c("year", "length of days") ,
row.names = c(NA, -4L) ,class = "data.frame",
comment = 'Example Data Set')
df
## year length of days
## 1 2001 366.3
## 2 2002 365.4
## 3 2004 366.5
## 4 2006 365.0
dtt <- structure(list
(model = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L),
.Label = c("ma", "mb", "mc"), class = "factor"),
year = c(2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L, 2005L, 2006L, 2007L, 2008L, 2009L, 2010L),
V = c(0.16, 0.14, 0.11, 0.13, 0.15, 0.16, 0.24, 0.17, 0.12, 0.13, 0.15, 0.15, 0.2, 0.16, 0.11, 0.12, 0.12, 0.15),
lower = c(0.11, 0.11, 0.07, 0.09, 0.11, 0.12, 0.16, 0.12, 0.04, 0.09, 0.09, 0.11, 0.14, 0.1, 0.07, 0.08, 0.05, 0.1),
upper = c(0.21, 0.19, 0.17, 0.17, 0.19, 0.2, 0.29, 0.23, 0.16, 0.17, 0.16, 0.2, 0.26, 0.27, 0.15, 0.16, 0.15, 0.19)),
.Names = c("model", "year", "V", "lower", "upper"),
class = "data.frame", row.names = c(c(1:18), -1L))
dtt
## model year V lower upper
## 1 ma 2005 0.16 0.11 0.21
## 2 ma 2006 0.14 0.11 0.19
## 3 ma 2007 0.11 0.07 0.17
## 4 ma 2008 0.13 0.09 0.17
## 5 ma 2009 0.15 0.11 0.19
## 6 ma 2010 0.16 0.12 0.20
## 7 mb 2005 0.24 0.16 0.29
## 8 mb 2006 0.17 0.12 0.23
## 9 mb 2007 0.12 0.04 0.16
## 10 mb 2008 0.13 0.09 0.17
## 11 mb 2009 0.15 0.09 0.16
## 12 mb 2010 0.15 0.11 0.20
## 13 mc 2005 0.20 0.14 0.26
## 14 mc 2006 0.16 0.10 0.27
## 15 mc 2007 0.11 0.07 0.15
## 16 mc 2008 0.12 0.08 0.16
## 17 mc 2009 0.12 0.05 0.15
## 18 mc 2010 0.15 0.10 0.19
## -1 <NA> <NA> <NA> <NA> <NA>
matrix[,1]
告訴系統向量位置,
不能用matrix$a。
a<-c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
b<-c(52.5, 48.4, 57.1, 60.1, 71.1)
dftest <-cbind(a, b)
class(dftest)
## [1] "matrix" "array"
dftest
## a b
## [1,] "Monday" "52.5"
## [2,] "Tuesday" "48.4"
## [3,] "Wednesday" "57.1"
## [4,] "Thursday" "60.1"
## [5,] "Friday" "71.1"
#ggplot(data=dftest, aes(x=a, y=b, fill=a)) +
# geom_bar(stat = 'identity')
tm<-cbind(carData::Chile$vote, carData::Chile$sex)
class(tm)
## [1] "matrix" "array"
plot(table(tm[,2], tm[,1]))
Figure 3.2: 智利選舉
dt <- as.data.frame(dftest)
dt$date<-factor(dt$a, levels=c("Monday", "Tuesday",
"Wednesday", "Thursday", "Friday"))
ggplot(data=dt, aes(x = date, y=b, fill = date)) +
geom_bar(stat = 'identity')
Figure 3.3: 週一至週五的快樂程度
有幾個資料框與矩陣的相關指令:
nrow(x):顯示x資料框或矩陣的列數量,也等於是觀察值數目
ncol(x):顯示x資料框或矩陣的行數量,也等於是變數數目
dim(x):同時顯示x資料框或矩陣的行列的數量
str(x):顯示x資料框或矩陣的性質以及變數名稱與性質
head(x):顯示x資料框或矩陣的前6列
head(x, n=a):顯示x資料框或矩陣的前a列
colnames(x):顯示或設定x資料框或矩陣的變數或欄位名稱
rownames(x):顯示或設定x資料框或矩陣每一列的名稱
有關rownames的更多說明,請參考這個部落格 。
AMSsurvey
有幾筆觀察值:
nrow(AMSsurvey)
[1] 24
colnames(tmp)<-c("V1","V2","V3"); tmp
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
library(ISLR)
head(College, n=3)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
head(state.x77, n=5)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
## Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
names.to.delete<-c('Alabama', 'Alaska', 'Arkansas')
which(rownames(data) %in% vector)
傳回所要選出的列:
rows.to.delete<-which(rownames(state.x77) %in% names.to.delete)
newstate <- state.x77[-c(rows.to.delete),]
head(newstate, n=5)
## Population Income Illiteracy Life Exp Murder HS Grad Frost Area
## Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
## California 21198 5114 1.1 71.71 10.3 62.6 20 156361
## Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766
## Connecticut 3100 5348 1.1 72.48 3.1 56.0 139 4862
## Delaware 579 4809 0.9 70.06 6.2 54.6 103 1982
Array1 <- array(1:12, dim = c(2, 6, 1)); Array1
## , , 1
##
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 3 5 7 9 11
## [2,] 2 4 6 8 10 12
Array2 <- array(1:12, dim = c(2, 3, 2)); Array2
## , , 1
##
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
##
## , , 2
##
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
A12<-Array2[,,2]; A12
## [,1] [,2] [,3]
## [1,] 7 9 11
## [2,] 8 10 12
listA<-list(tmp, H, c(xi,tsai)); listA
## [[1]]
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
##
## [[2]]
## [,1] [,2]
## [1,] "A" "10"
## [2,] "B" "20"
## [3,] "C" "30"
## [4,] "D" "40"
## [5,] "E" "50"
## [6,] "F" "60"
##
## [[3]]
## [1] "1953-06-15" "1956-08-31"
list(A=data.frame(x=c(1:5),y=c(101:105)),
B=data.frame(v1=rep(NA,6)))
## $A
## x y
## 1 1 101
## 2 2 102
## 3 3 103
## 4 4 104
## 5 5 105
##
## $B
## v1
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
listA[[3]]
## [1] "1953-06-15" "1956-08-31"
listB<-list(data=tmp, vec=m, char=c(tsai, xi));
listB[["data"]]
## V1 V2 V3
## 1 170 F 59.0
## 2 175 M 59.5
## 3 166 M 58.6
## 4 172 M 59.2
## 5 165 F 58.5
## 6 157 F 57.7
## 7 167 F 58.7
## 8 167 F 58.7
## 9 156 M 57.6
## 10 160 F 58.0
X = list(1:5, letters[1:5], c('Y','Y','N','Y','N'),
c("2/27/2018", "6/26/2018", "12/31/2018","1/20/2019","4/8/2019")); X
## [[1]]
## [1] 1 2 3 4 5
##
## [[2]]
## [1] "a" "b" "c" "d" "e"
##
## [[3]]
## [1] "Y" "Y" "N" "Y" "N"
##
## [[4]]
## [1] "2/27/2018" "6/26/2018" "12/31/2018" "1/20/2019" "4/8/2019"
X.dt<-setDT(X); X.dt
## V1 V2 V3 V4
## 1: 1 a Y 2/27/2018
## 2: 2 b Y 6/26/2018
## 3: 3 c N 12/31/2018
## 4: 4 d Y 1/20/2019
## 5: 5 e N 4/8/2019
請嘗試把c('a','b','c'), c(1,2,3,4)以及
c('2018-01-01', '2018-04-04', '2018-04-05', '2018-06-18', '2018-10-10')`結合成為一個列表。
class(Titanic); Titanic
## [1] "table"
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
T1<-Titanic[, , 1, 1]
class(T1); T1
## [1] "table"
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
Titanic[, 1, 1,]
## Survived
## Class No Yes
## 1st 0 5
## 2nd 0 11
## 3rd 35 13
## Crew 0 0
Titanic[, 1, 2,]
## Survived
## Class No Yes
## 1st 118 57
## 2nd 154 14
## 3rd 387 75
## Crew 670 192
Titanic[, 2, 1,]
## Survived
## Class No Yes
## 1st 0 1
## 2nd 0 13
## 3rd 17 14
## Crew 0 0
Titanic[, 2, 2,]
## Survived
## Class No Yes
## 1st 4 140
## 2nd 13 80
## 3rd 89 76
## Crew 3 20
library(data.table)
DT = data.table(a = 1:3, b = c(10,20,30))
DT
## a b
## 1: 1 10
## 2: 2 20
## 3: 3 30
DT[1:3, sum(a)]
## [1] 6
DT[1:2, mean(b)]
## [1] 15
DT <-data.table(x=rnorm(1000, 0, 1))
DT[, plot(density(x), type='l', xlab='x', ylab='',
lwd=3, col='#1111ff')]
Figure 3.4: 常態分佈機率密度
## NULL
g<-Titanic[ , , 2, 2]; class(g)
## [1] "table"
g<-data.frame(g)
X<-c(10,20,30,40,50,60); Sca<-10
X+Sca
## [1] 20 30 40 50 60 70
X/Sca
## [1] 1 2 3 4 5 6
Y<-c(5,10,6,8,25,6)
X/Y; X*Y
## [1] 2 2 5 5 2 10
## [1] 50 200 180 320 1250 360
a<-c(2,3,4); b<-a^2; print(b)
## [1] 4 9 16
c<-sqrt(b); print(c)
## [1] 2 3 4
EX<-c(2.54, 3.111, 10.999)
round(EX, digits=4)
## [1] 2.540 3.111 10.999
floor(EX)
## [1] 2 3 10
ceiling(EX)
## [1] 3 4 11
options(digits = 5)
print(EX)
## [1] 2.540 3.111 10.999
R
為了對齊輸出的數字,會輸出比較多的小數點後面的數字,可以考慮用sprintf這個函式控制,但是會得到字串而非數字:
perc<-MASS::Animals$brain/MASS::Animals$body
print(perc[1:3], digits=2)
## [1] 6.00 0.91 3.29
sprintf(perc[1:3], fmt="%#.2f")
## [1] "6.00" "0.91" "3.29"
weekdays()
指令顯示今天上課日期是星期五。
mtcars
資料裡面,wt大於或等於2的資料有幾筆?
data.table
這個套件的功能,排序上述的資料框。
最後更新時間: 2024-03-09 11:51:10