R
的資料讀取、資料匯出、基本指令,也就是R
可以讀取的外部資料,包括統計軟體、試算表軟體、編輯軟體、網路等等資料。請在你的R project所在的資料夾下面開一個新資料夾data,然後把資料存到data這個資料夾。例如讀取Stata資料:
#Library
library(readstata13)
#working path
file <- here::here('data','Mystata.dta')
#read dataset
df1 <- read.dta13(file)
#show the first five variables in the data
str(df1[, 1:5])
## 'data.frame': 1244 obs. of 5 variables:
## $ Q1: int 0 0 12 1 0 0 1 0 0 11 ...
## $ Q2: int 1 1 2 2 2 2 2 2 2 1 ...
## $ Q3: int 1 1 1 1 1 1 1 98 96 1 ...
## $ Q4: int 2 2 1 3 1 4 4 98 3 3 ...
## $ Q5: int 2 3 2 2 2 2 3 2 3 2 ...
scan('~/Dropbox/EastAsia2024/data/voteshare', comment.char = '#', dec='.')
[1] 55.6 66.1 36.8 65.1 50.9 44.9 48.7 52.4 48.5 53.0 51.9
R
把前面有’#’視為文字說明,不是資料。
ucladat <- scan("https://stats.idre.ucla.edu/stat/data/scan.txt",
what = list(age = 'numeric', name = ""))
ndat <- data.table::setDF(ucladat)
ndat
## age name
## 1 12 bobby
## 2 24 kate
## 3 35 david
## 4 20 michael
R
直接輸入資料。例如我們想要創造一筆資料稱為tmp,先指定兩個變數,第一個是公司名稱,第二個是目前市值(十億美元為單位):
tmp<-scan(what=list(company="character",
marketvalue="numeric"))
Apple, 851,
Alphabet, 719,
Microsoft, 703,
Amazon, 701,
Tencent, 496,
Berkshire Hathaway, 492,
Alibaba, 470,
Facebook, 464,
Jpmorgan Chase, 375,
Johnson & Johnson, 344
tmp1 <- data.table::setDT(tmp)
tmp1
marketvalue <- here::here('data', 'marketvalue.txt')
write.table(tmp1, marketvalue)
file <- here::here('data','tencompanies.txt')
dt <-scan(file, comment.char = '#',
what=list(company="character",
marketvalue="numeric"))
ndt <- data.table::setDF(dt)
ndt
## company marketvalue
## 1 Apple 851
## 2 Alphabet 719
## 3 Microsoft 703
## 4 Amazon 701
## 5 Tencent 496
## 6 Berkshire Hathaway 492
## 7 Alibaba 470
## 8 Facebook 464
## 9 Jpmorgan Chase 375
## 10 Johnson & Johnson 344
library(tidyverse)
ndt <- ndt %>% mutate(marketvalue=as.numeric(marketvalue))
str(ndt)
## 'data.frame': 10 obs. of 2 variables:
## $ company : chr "Apple" "Alphabet" "Microsoft" "Amazon" ...
## $ marketvalue: num 851 719 703 701 496 492 470 464 375 344
file <- here::here('data', 'councilor.csv')
csv1<-read.csv(file,
header=TRUE, sep=",", fileEncoding = 'BIG5')
head(csv1)
## Year budget unit contracter open
## 1 2015 676 水利處 台球 Yes
## 2 2016 673 新建工程處 茂盛 Yes
## 3 2016 270 新建工程處 冠君 Yes
## 4 2016 255 新建工程處 金煌 Yes
## 5 2016 235 新建工程處 聖鋒 Yes
## 6 2016 190 新建工程處 福呈 No
header=TRUE
表示第一列被認為是變數名稱,而sep
規範分隔的符號,fileEncoding=BIG5
則是將文字以BIG5編碼顯示中文。
R
讓使用者控制資料中的字串是否視為因素資料,也就是用stringAsFactors
控制:
path <- here::here('data', 'councilor.csv')
csv2<-read.csv(path,
header=TRUE, sep=",",
fileEncoding = 'BIG5',
stringsAsFactors = F)
class(csv1$unit); table(csv1$unit)
## [1] "character"
##
## 公園處 新建工程處 水利處
## 1 8 1
class(csv2$unit)
## [1] "character"
readr
套件裡面也有read_csv
的函數,但是無法處理中文的編碼。path <- here::here('data', 'tsaipopularity0921.csv')
csv.tsai <- readr::read_csv(path,
col_names = TRUE)
head(csv.tsai)
## # A tibble: 6 × 2
## Date Tsai
## <chr> <dbl>
## 1 18-Mar 26.4
## 2 18-Jun 23
## 3 18-Sep 23.6
## 4 18-Dec 21.6
## 5 19-Mar 30.4
## 6 19-Jun 44.6
font.add('LiSu', 'Lisu.ttf')
barplot(table(csv1$unit), family='LiSu')
Figure 2.1: 字型測試
ggplot2
的繪圖功能畫圖,如圖 2.2 :
font.add("GenRyuMin2JP-R","GenRyuMin2-R.ttc")
library(ggplot2)
p<-ggplot(data=csv1, aes(x=factor(unit))) +
geom_bar(stat="count") +
theme(text=element_text(family="GenRyuMin2JP-R", size=12)) +
labs(x='Unit')
p
Figure 2.2: ggplot2例子
file <- here::here("data", "Studentsfull.txt")
students<-read.table(file, header=TRUE, sep="")
head(students)
## ID Name Department Score Gender
## 1 10322011 Ariel Aerospace 78 F
## 2 10325023 Becky Physics 86 F
## 3 10430101 Carl Journalism 69 M
## 4 10401032 Dimitri English 83 M
## 5 10307120 Enrique Chemistry 80 M
## 6 10207005 Fernando Chemistry 66 M
readr
這個套件裡面有read_table
這個指令,可以讀取txt格式的檔案,如果遇到用空格相隔變數的資料,可以這樣設定:file <- here::here('data', 'hsb2.txt')
hsb2 <- readr::read_table(file, col_names = TRUE,
col_types = NULL, na = "NA", skip = 0,
comment = "")
head(hsb2)
## # A tibble: 6 × 11
## `"id"` `"gender"` `"race"` `"ses"` `"schtyp"` `"prog"` `"read"` `"write"`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 "\"1\"" 70 "\"male\"" "\"whi… "\"low\"" "\"publ… "\"gene… 57
## 2 "\"2\"" 121 "\"female\"" "\"whi… "\"middle… "\"publ… "\"voca… 68
## 3 "\"3\"" 86 "\"male\"" "\"whi… "\"high\"" "\"publ… "\"gene… 44
## 4 "\"4\"" 141 "\"male\"" "\"whi… "\"high\"" "\"publ… "\"voca… 63
## 5 "\"5\"" 172 "\"male\"" "\"whi… "\"middle… "\"publ… "\"acad… 47
## 6 "\"6\"" 113 "\"male\"" "\"whi… "\"middle… "\"publ… "\"acad… 44
## # ℹ 3 more variables: `"math"` <dbl>, `"science"` <dbl>, `"socst"` <dbl>
readr
這個套件裡面還有read_delim
這個指令,可以讀取txt格式的檔案,用分號相隔變數的資料,可以這樣設定:lambda <- here::here('data','lambda.txt')
tmp <- readr::read_delim(lambda, delim = ';', show_col_types = F)
tmp
## # A tibble: 5 × 4
## `Rating of Department` `Modal Category` `Right Guesses` `Wrong Guesses`
## <chr> <chr> <dbl> <dbl>
## 1 軍公教人員 (N=30) "TVBS" 15 15
## 2 私部門職員 (N=30) " 三立" 15 15
## 3 勞工 (N=25) " 民視" 15 10
## 4 農林漁牧 (N=10) " 三立" 8 2
## 5 總數(N=95) <NA> 53 42
read_tsv
讀取:fruit <- here::here("data","fruits.tsv")
tmp <- readr::read_tsv(fruit)
head(tmp)
## # A tibble: 6 × 6
## ID Name Age Fruits Drink Music
## <dbl> <chr> <dbl> <chr> <chr> <chr>
## 1 1 John 22 Pear Coffee K-Pop
## 2 2 Alice 30 Strawberry Soda Country
## 3 3 Ben 28 Banana Soda R&B
## 4 4 Eve 35 Mango Juice R&B
## 5 5 Mia 29 Durian Coffee R&B
## 6 6 Paul 38 Pear Water Country
readr
套件中有一些資料檔,可以自行嘗試,例如:DELIM <- readr::read_delim(readr::readr_example('chickens.csv'),
delim=',')
head(DELIM)
## # A tibble: 5 × 4
## chicken sex eggs_laid motto
## <chr> <chr> <dbl> <chr>
## 1 Foghorn Leghorn rooster 0 That's a joke, ah say, that's a jok…
## 2 Chicken Little hen 3 The sky is falling!
## 3 Ginger hen 12 Listen. We'll either die free chick…
## 4 Camilla the Chicken hen 7 Bawk, buck, ba-gawk.
## 5 Ernie The Giant Chicken rooster 0 Put Captain Solo in the cargo hold.
R
有套件可以直接讀取。Stata的12版以前資料可以用foreign
這個套件其中的library(foreign)
ucladata<-c("https://stats.idre.ucla.edu/stat/data/test.dta")
udata1<-read.dta(ucladata)
head(udata1)
## make model mpg weight price
## 1 AMC Concord 22 2930 4099
## 2 AMC Pacer 17 3350 4749
## 3 AMC Spirit 22 2640 3799
## 4 Buick Century 20 3250 4816
## 5 Buick Electra 15 4080 7827
readstata13
這個套件:
library(readstata13)
file <- here::here('data', 'Mystata.dta')
udata2<-read.dta13(file)
str(udata2$Q1)
## int [1:1244] 0 0 12 1 0 0 1 0 0 11 ...
convert.factors
這個參數控制是否將變數的值轉為因素,如果不轉為因素,則維持為整數或者數值。
file <- here::here('data', 'Mystata.dta')
udata3 <- readstata13::read.dta13(file, convert.factors=F)
class(udata2$partyid); class(udata3$partyid)
## [1] "factor"
## [1] "integer"
udata3 %>% janitor::tabyl(partyid) %>%
janitor::adorn_totals()
## partyid n percent
## 1 287 0.230707
## 2 246 0.197749
## 3 4 0.003215
## 4 21 0.016881
## 5 2 0.001608
## 6 54 0.043408
## 7 557 0.447749
## 9 73 0.058682
## Total 1244 1.000000
foreign
的套件也可以讀取SPSS的資料,使用library(foreign)
file <- here::here('data', 'PP0797B2.sav')
dv<-read.spss(file,
use.value.labels=F, to.data.frame=TRUE)
dv %>% janitor::tabyl(Q1) %>%
adorn_totals()
## Q1 n percent
## 1 617 0.299806
## 2 684 0.332362
## 3 443 0.215258
## 4 91 0.044218
## 95 10 0.004859
## 96 57 0.027697
## 97 52 0.025267
## 98 104 0.050534
## Total 2058 1.000000
設定use.value.labels=F表示讀取資料時並不會使用資料中原有的變數標記,例如低、中、高教育程度會變成 1、2、3。這樣做的好處是不必把類別變數轉換成數字,壞處則是需要對照原有的資料才能得知每一個值的意義。如果沒有設定 to.data.frame=T,讀取的資料會轉換成列表。請嘗試去掉use.value.labels=F,也就是
dv$Q1n <-c()
dv$Q1n[dv$Q1==1]<-'非常不同意'
dv$Q1n[dv$Q1==2]<-'不同意'
dv$Q1n[dv$Q1==3]<-'同意'
dv$Q1n[dv$Q1==4]<-'非常同意'
dv$Q1n=factor(dv$Q1n, levels=c('非常不同意','不同意','同意','非常同意'))
par(bg='lightblue', family='HanWangWCL07')
barplot(table(dv$Q1n), col='white')
Figure 2.3: 編碼標記圖形
haven
這個套件,然後用udata1<-haven::read_sav(file, encoding = 'UTF-8')
udata1[1:4, 1:3]
## # A tibble: 4 × 3
## Q1 Q2 Q3
## <dbl+lbl> <dbl+lbl> <dbl+lbl>
## 1 96 [很難說] 3 [同意] 2 [不同意]
## 2 1 [非常不同意] 4 [非常同意] 2 [不同意]
## 3 1 [非常不同意] 4 [非常同意] 1 [非常不同意]
## 4 3 [同意] 3 [同意] 2 [不同意]
pie(table(udata1$Q1))
Figure 2.4: 以haven套件讀取資料後的圓餅圖
sjlabelled
這個套件,然後用sjlabelled
的功能,請參考這個套件的作者–Daniel L\(\rm{\ddot{u}}\)decke的網頁。
file <- here::here('data', 'PP1697C1.sav')
udata4<-sjlabelled::read_spss(file)
sjlabelled::get_labels(udata4$Q10)
[1] “非常不同意” “不同意” “既不同意也不反對” “同意”
[5] “非常同意” “拒答” “看情形” “無意見”
[9] “不知道”
sjlabelled
這個套件的#set_labels(udata4$Q7, labels='總統滿意度')
# set_labels(udata4$Q8, labels='政治興趣')
par(bg='#0022FF33')
barplot(table(sjlabelled::as_label(udata4$Q8)),
col='white', family='YouYuan', cex.names=0.8)
Figure 2.5: 以sjlabelled套件讀取資料後的直方圖
library(sjlabelled); library(knitr)
library(kableExtra)
crx<-table(as_label(udata4$Q8), as_label(udata4$Q7))
kable(crx, format = 'pandoc',
caption = '政治興趣與總統滿意度') %>%
kable_styling(bootstrap_options = "striped", full_width = F)
非常不滿意 | 不太滿意 | 有點滿意 | 非常滿意 | 拒答 | 看情形 | 無意見 | 不知道 | |
---|---|---|---|---|---|---|---|---|
完全沒興趣 | 97 | 92 | 71 | 7 | 7 | 5 | 27 | 29 |
不太有興趣 | 95 | 150 | 122 | 14 | 5 | 11 | 38 | 25 |
還算有興趣 | 55 | 69 | 84 | 16 | 2 | 2 | 10 | 4 |
非常有興趣 | 12 | 10 | 14 | 9 | 0 | 1 | 1 | 2 |
拒答 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
看情形 | 9 | 3 | 1 | 2 | 2 | 0 | 1 | 0 |
無意見 | 0 | 0 | 3 | 1 | 0 | 0 | 2 | 1 |
不知道 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
sjlabelled
這個套件,可以方便地查看SPSS資料的變數標記以及選項數值,不用再跟SPSS資料對照。除此之外,這個套件還有很多功能。
test.missing <- read.table(
"https://stats.idre.ucla.edu/stat/data/test_missing_comma.txt",
header = TRUE, sep = ",")
head(test.missing)
## prgtype gender id ses schtyp level
## 1 general 0 70 4 1 1
## 2 vocati 1 121 4 NA 1
## 3 general 0 86 NA NA 1
## 4 vocati 0 141 4 3 1
## 5 academic 0 172 4 2 1
## 6 academic 0 113 4 2 1
college<-read.csv('https://stats.moe.gov.tw/files/detail/111/111_ab110_S.csv', header=TRUE)
nrow(college)
## [1] 148
showtext::showtext_auto()
library(tidyverse)
df<- college %>% mutate (degree=學位生_正式修讀學位外國生+
學位生_僑生.含港澳.+學位生_正式修讀學位陸生)
newdf <- df[order(df$degree, decreasing=T), ]
newdat<- data.frame(school=newdf$學校名稱[1:5],
degree=newdf$degree[1:5] )
newdat$school<-factor(newdat$school, levels=newdf$學校名稱[1:5])
ggplot(data=newdat, aes(x=school, y=degree)) +
geom_bar(stat='identity', fill='#EECCEE') +
theme(text=element_text(family='YouYuan', size=11))+
theme_bw()
Figure 2.6: 境外學位生人數前五名學校
readr
、haven
、sjlabelled
、foreign
這四個套件各有不同的讀取資料函數,有不同的功能,例如sjlabelled
可以顯示變數的標記與名稱,readr
適用於各種分隔方式的文字檔。請多多嘗試。R
讓使用者處理資料之後輸出資料,讓其他使用者在其他平台使用。
file <- here::here('data', 'voteshare')
vs<-scan(file, comment.char = '#', dec='.')
vs
file <- here::here('data', 'voteshare')
vs <- scan(file, comment.char = '#', dec='.')
vsnew<-c(vs, 61.9, 31.8, 44.5)
vsnew
## [1] 55.6 66.1 36.8 65.1 50.9 44.9 48.7 52.4 48.5 53.0 51.9 61.9 31.8 44.5
write.table(vsnew,'vsnew.txt')
read.table('vsnew.txt')
## x
## 1 55.6
## 2 66.1
## 3 36.8
## 4 65.1
## 5 50.9
## 6 44.9
## 7 48.7
## 8 52.4
## 9 48.5
## 10 53.0
## 11 51.9
## 12 61.9
## 13 31.8
## 14 44.5
de<-data.frame(name=state.abb, region=state.region, area=state.area)
region.a<-substr(state.region, 1,1)
region.a
## [1] "S" "W" "W" "S" "W" "W" "N" "S" "S" "S" "W" "W" "N" "N" "N" "N" "S" "S" "N"
## [20] "S" "N" "N" "N" "S" "N" "W" "N" "W" "N" "N" "W" "N" "S" "N" "N" "S" "W" "N"
## [39] "N" "S" "N" "S" "S" "W" "N" "S" "W" "S" "N" "W"
de <- data.frame(de, region.short=as.factor(region.a))
head(de)
## name region area region.short
## 1 AL South 51609 S
## 2 AK West 589757 W
## 3 AZ West 113909 W
## 4 AR South 53104 S
## 5 CA West 158693 W
## 6 CO West 104247 W
write.csv(de, 'state.csv', row.names = F)
state<-read.csv('state.csv', header=TRUE)
head(state)
## name region area region.short
## 1 AL South 51609 S
## 2 AK West 589757 W
## 3 AZ West 113909 W
## 4 AR South 53104 S
## 5 CA West 158693 W
## 6 CO West 104247 W
write.csv()
時,不需要指定分隔的符號,在重新讀取時,也不需要刻意指定,仍然可以匯入正確的資料。
R
有 global 這個環境空間中儲存命令列中所建立的任何變數,若要了解 global 環境空間有哪些物件,可以使用globalenv()
<environment: R_GlobalEnv>
ls(envir = globalenv(),10)
[1] “college” “crx” “csv.tsai” “csv1” “csv2”
[6] “de” “DELIM” “df” “df1” “dt”
[11] “dv” “file” “fruit” “hsb2” “lambda”
[16] “ndat” “ndt” “newdat” “newdf” “p”
[21] “path” “region.a” “state” “students” “test.missing”
[26] “tmp” “ucladat” “ucladata” “udata1” “udata2”
[31] “udata3” “udata4” “vs” “vsnew”
ls()
指令回傳在特定環境空間內的物件。R
是直接可見的。但是attach無法儲存更改後的資料,因此要記得匯出資料,或者是用語法紀錄。例如:head(csv2)
## Year budget unit contracter open
## 1 2015 676 水利處 台球 Yes
## 2 2016 673 新建工程處 茂盛 Yes
## 3 2016 270 新建工程處 冠君 Yes
## 4 2016 255 新建工程處 金煌 Yes
## 5 2016 235 新建工程處 聖鋒 Yes
## 6 2016 190 新建工程處 福呈 No
attach(csv2)
contracter
## [1] "台球" "茂盛" "冠君" "金煌" "聖鋒" "福呈" "盛吉" "茂盛"
## [9] "冠君" "未發包"
contracter[1]<-"未發包"
csv2$contracter[10]<-"台球"
csv2
## Year budget unit contracter open
## 1 2015 676 水利處 台球 Yes
## 2 2016 673 新建工程處 茂盛 Yes
## 3 2016 270 新建工程處 冠君 Yes
## 4 2016 255 新建工程處 金煌 Yes
## 5 2016 235 新建工程處 聖鋒 Yes
## 6 2016 190 新建工程處 福呈 No
## 7 2015 155 公園處 盛吉 Yes
## 8 2016 154 新建工程處 茂盛 Yes
## 9 2016 142 新建工程處 冠君 Yes
## 10 2016 123 新建工程處 台球 Yes
detach(csv2)
csv2
## Year budget unit contracter open
## 1 2015 676 水利處 台球 Yes
## 2 2016 673 新建工程處 茂盛 Yes
## 3 2016 270 新建工程處 冠君 Yes
## 4 2016 255 新建工程處 金煌 Yes
## 5 2016 235 新建工程處 聖鋒 Yes
## 6 2016 190 新建工程處 福呈 No
## 7 2015 155 公園處 盛吉 Yes
## 8 2016 154 新建工程處 茂盛 Yes
## 9 2016 142 新建工程處 冠君 Yes
## 10 2016 123 新建工程處 台球 Yes
rm(list=ls()) #remove all data
data(mtcars) #suppose we analyze mtcars
m1<-lm(mpg ~ cyl, data=mtcars) #regression
summary(m1) #results
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.981 -2.119 0.222 1.072 7.519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.885 2.074 18.27 < 2e-16 ***
## cyl -2.876 0.322 -8.92 6.1e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 30 degrees of freedom
## Multiple R-squared: 0.726, Adjusted R-squared: 0.717
## F-statistic: 79.6 on 1 and 30 DF, p-value: 6.11e-10
mydata<-data.frame(date=as.Date(c("2018-03-13",
"2018-03-14","2018-03-15"),
format='%Y-%m-%d'),
workinghours=c(4, 3, 4)) #create your own data
save.image("test.Rdata") #save all results to Rdata
rm(list=ls()) #remove all data
load("test.Rdata") #load Rdata
ls(envir = globalenv(),10) #display objects in this environment
## [1] "m1" "mtcars" "mydata"
mydata #diplay your data
## date workinghours
## 1 2018-03-13 4
## 2 2018-03-14 3
## 3 2018-03-15 4
newfile <- here::here('data','voteshare')
vs<-scan(newfile, comment.char = '#', dec='.')
vs
## [1] 55.6 66.1 36.8 65.1 50.9 44.9 48.7 52.4 48.5 53.0 51.9
vs2<-vs/100
saveRDS(vs, "vs.rds")
saveRDS(vs2, 'vs2.rds')
rm(vs); rm(vs2)
vs<-readRDS('vs.rds')
vs2<-readRDS('vs2.rds')
vs; vs2
## [1] 55.6 66.1 36.8 65.1 50.9 44.9 48.7 52.4 48.5 53.0 51.9
## [1] 0.556 0.661 0.368 0.651 0.509 0.449 0.487 0.524 0.485 0.530 0.519
R
可以讀取既有指令的檔案,在不必開啟命令稿的情況下直接執行多行程式,可節省許多篇幅以及時間。例如我們寫一個自訂函數,語法很長,我們先存成一個語法檔,未來可以直接執行。sink("twohistograms.R") #define a new script file
cat("set.seed(02138)") #input a function that sets starting number for random number
cat("\n") #end of line
cat("#write R script to a file without opening a document")
cat("\n") #end of line
cat("fnorm<-function(mu){ #create a function with a parameter: mu
sample.o<-rnorm(20,mu,1/sqrt(mu)) #define the 1st vector that generates random numbers
sample.i<-sample.o+runif(1,0,10) #define the 2nd vector that generates random numbers
par(mfrow=c(1,2)) #set parameter of graphic for 1*2 graphics
hist(sample.o, col=1, main='', #histogram with Basic R
xlab='Original sample')
hist(sample.i, col=4, main='', #another histogram
xlab='Original sample + random number')
}")
cat("\n") #end of function
sink() #save the script in the specified file
file.show("twohistograms.R") #Opening an editor to show the script
source("twohistograms.R")
fnorm(1)
par(mfrow=c(1,2))
library(car)
with(Duncan, hist(income, col=2))
with(Salaries, hist(salary, col=6))
Figure 4.1: 兩個變數名稱相似的長條圖
names(mtcars)
## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
## [11] "carb"
which(Orange$circumference>100)
## [1] 4 5 6 7 10 11 12 13 14 18 19 20 21 24 25 26 27 28 32 33 34 35
which()
函數加以篩選:
oc<-which(Orange$circumference>100) #create a vector
#of data that meets a condition
oc
## [1] 4 5 6 7 10 11 12 13 14 18 19 20 21 24 25 26 27 28 32 33 34 35
Orange[oc,] #match data with the vector
## Tree age circumference
## 4 1 1004 115
## 5 1 1231 120
## 6 1 1372 142
## 7 1 1582 145
## 10 2 664 111
## 11 2 1004 156
## 12 2 1231 172
## 13 2 1372 203
## 14 2 1582 203
## 18 3 1004 108
## 19 3 1231 115
## 20 3 1372 139
## 21 3 1582 140
## 24 4 664 112
## 25 4 1004 167
## 26 4 1231 179
## 27 4 1372 209
## 28 4 1582 214
## 32 5 1004 125
## 33 5 1231 142
## 34 5 1372 174
## 35 5 1582 177
rep(3, 5)
## [1] 3 3 3 3 3
c(rep("大", 3), rep("中", 1), rep("小",2))
## [1] "大" "大" "大" "中" "小" "小"
seq(1,10)
## [1] 1 2 3 4 5 6 7 8 9 10
seq(100,110, by=2)
## [1] 100 102 104 106 108 110
seq(5:10)
## [1] 1 2 3 4 5 6
seq(100:110)
## [1] 1 2 3 4 5 6 7 8 9 10 11
latvija<-c("Daugavpils","Jēkabpils","Jelgava
Liepāja","Rēzekne","Rīga","Valmiera",
"Ventspils")
grep("pils", latvija)
## [1] 1 2 7
latvija[grep("pils", latvija)]
## [1] "Daugavpils" "Jēkabpils" "Ventspils"
opendata <- here::here("data","opendata106N0101.csv")
dat <- readr::read_csv(opendata, col_names = TRUE)
district<-dat[grep("區", dat$code), ]
head(dat, n=3)
## # A tibble: 3 × 4
## code 年底人口數 土地面積 人口密度
## <chr> <chr> <dbl> <chr>
## 1 新北市板橋區 551480 23.1 23835
## 2 新北市三重區 387484 16.3 23747
## 3 新北市中和區 413590 20.1 20532
#create a list
L <- list(a<-c('lecture', 'movie'), b<-c('Movie channel'), c=c(1:10),
d<-c('movie','food', "news",'car','music'))
#select elements in L
match.s<-grep('movie', L) ; match.s
## [1] 1 4
#subset of L
L[grep('movie', L)]
## [[1]]
## [1] "lecture" "movie"
##
## [[2]]
## [1] "movie" "food" "news" "car" "music"
library(tidyverse)
#dat2 <-dat[grep("臺", dat$code), ]
#change 臺 to 台
dat2 <- dat%>% mutate(code=gsub("臺", "台", dat$code))
#subset
dat2[grep('台北市', dat2$code), ]
## # A tibble: 12 × 4
## code 年底人口數 土地面積 人口密度
## <chr> <chr> <dbl> <chr>
## 1 台北市松山區 206988 9.29 22286
## 2 台北市信義區 225753 11.2 20143
## 3 台北市大安區 309969 11.4 27283
## 4 台北市中山區 230710 13.7 16862
## 5 台北市中正區 159608 7.61 20981
## 6 台北市大同區 129278 5.68 22754
## 7 台北市萬華區 191850 8.85 21673
## 8 台北市文山區 274424 31.5 8709
## 9 台北市南港區 122155 21.8 5593
## 10 台北市內湖區 287771 31.6 9113
## 11 台北市士林區 288295 62.4 4622
## 12 台北市北投區 256456 56.8 4513
#Open data
opendata <- here::here("data","opendata106N0101.csv")
dat <- readr::read_csv(opendata, col_names = TRUE)
#create a new variable from 'code'
dat2 <- dat%>% dplyr::mutate(city=substr(dat$code, 1,3))
head(dat2, n=3)
## # A tibble: 3 × 5
## code 年底人口數 土地面積 人口密度 city
## <chr> <chr> <dbl> <chr> <chr>
## 1 新北市板橋區 551480 23.1 23835 新北市
## 2 新北市三重區 387484 16.3 23747 新北市
## 3 新北市中和區 413590 20.1 20532 新北市
dat2 <- dat2[-c(371:375),]
dat3 <- dat2 %>%
dplyr::group_by(city) %>%
dplyr::summarize(avg.area=mean(土地面積, na.rm = T),
sum.area=sum(土地面積, na.rm = T)) %>%
dplyr::filter(sum.area >3)
ggplot2::ggplot(data=dat3, aes(y=sum.area,
x=reorder(city, -sum.area))) +
geom_point() +
theme(axis.text = element_text(family="HanWangYanKai", size=7),
axis.title = element_text(family="Georgia", size=14)) +
xlab("City") +
ylab("Area")
Figure 4.2: 各縣市土地面積
\(\blacksquare\)請練習畫圖表示各縣市的人口數統計(提示,用轉換字串的年底人口數變成數值)
sub()
或gsub()
:取代指定的字串,例如:country<-c( "United States", "Republic of Kenya", "Republic of Korea")
sub('Republic of', '', country)
## [1] "United States" " Kenya" " Korea"
U<-matrix(c('文殊蘭花與蝴蝶蘭花','茶花','杜鵑花',
'玫瑰花','菊花','蘭花'), nrow=3, ncol=2)
U
## [,1] [,2]
## [1,] "文殊蘭花與蝴蝶蘭花" "玫瑰花"
## [2,] "茶花" "菊花"
## [3,] "杜鵑花" "蘭花"
sub('蘭花','蘭', U)
## [,1] [,2]
## [1,] "文殊蘭與蝴蝶蘭花" "玫瑰花"
## [2,] "茶花" "菊花"
## [3,] "杜鵑花" "蘭"
gsub('蘭花','蘭', U)
## [,1] [,2]
## [1,] "文殊蘭與蝴蝶蘭" "玫瑰花"
## [2,] "茶花" "菊花"
## [3,] "杜鵑花" "蘭"
zodiac<-c( "(mouse)", "(ox)", "(tiger)", "(rabbit)", "(dragon)")
zodiac<-sub("\\(","", zodiac)
sub("\\)","", zodiac)
## [1] "mouse" "ox" "tiger" "rabbit" "dragon"
country<-c( "United States", "Republic of Kenya", "Republic of Korea")
country<-c("People's Republic of China
Democratic Republic of Congo",
"United States",
"Republic of Kenya", "Republic of Korea",
"Democratic People's Republic of Korea")
country[grep('^Republic of', country)]
## [1] "Republic of Kenya" "Republic of Korea"
gsub("^Republic of", "", country)
## [1] "People's Republic of China\n Democratic Republic of Congo"
## [2] "United States"
## [3] " Kenya"
## [4] " Korea"
## [5] "Democratic People's Republic of Korea"
a <- c("Every day, Customs and Border Protection agents
encounter thousands of illegal immigrants trying
to enter our country. We are out of space to hold
them, and we have no way to promptly return them
back home to their country. America proudly
welcomes millions of lawful immigrants who enrich
our society and contribute to our nation, but all
Americans are hurt by uncontrolled illegal migration.")
strsplit(a, split=" ")
[[1]]
[1] “Every” “day,” “Customs” “and” “Border”
[6] “Protection” “agents” “” “” “”
[11] “” “” “” “” “encounter”
[16] “thousands” “of” “illegal” “immigrants” “trying”
[21] “” “” “” “” “”
[26] “” “” “to” “enter” “our”
[31] “country.” “We” “are” “out” “of”
[36] “space” “to” “hold” “” “”
[41] “” “” “” “” “”
[46] “them,” “and” “we” “have” “no”
[51] “way” “to” “promptly” “return” “them”
[56] “” “” “” “” “”
[61] “” “” “back” “home” “to”
[66] “their” “country.” “America” “proudly” “”
[71] “” “” “” “” “”
[76] “” “welcomes” “millions” “of” “lawful”
[81] “immigrants” “who” “enrich” “” “”
[86] “” “” “” “” “our”
[91] “society” “and” “contribute” “to” “our”
[96] “nation,” “but” “all” “” “”
[101] “” “” “” “” “Americans”
[106] “are” “hurt” “by” “uncontrolled” “illegal”
[111] “migration.”
x<-c(2,4,6)
cat(x, "\n");
## 2 4 6
cat("summation:", sum(x), "\n", "average:", mean(x))
## summation: 12
## average: 4
library(dplyr)
x1 = starwars$mass[1:6]
weight <- paste(x1, "kg", sep=" ")
x2 = starwars$height[1:6]
height <- paste0(x2, " ", "cm")
data.table::data.table(name=starwars$name[1:6], weight, height)
## name weight height
## <char> <char> <char>
## 1: Luke Skywalker 77 kg 172 cm
## 2: C-3PO 75 kg 167 cm
## 3: R2-D2 32 kg 96 cm
## 4: Darth Vader 136 kg 202 cm
## 5: Leia Organa 49 kg 150 cm
## 6: Owen Lars 120 kg 178 cm
apply()
允許使用者建立自己的函數,應用到所研究的物件。例如我們想要知道 head(faithful)
## eruptions waiting
## 1 3.600 79
## 2 1.800 54
## 3 3.333 74
## 4 2.283 62
## 5 4.533 85
## 6 2.883 55
apply(faithful,2,function(x) c(min(x),max(x),
mean(x),length(x)))
## eruptions waiting
## [1,] 1.600 43.0
## [2,] 5.100 96.0
## [3,] 3.488 70.9
## [4,] 272.000 272.0
apply()
裡面的 2代表這個統計是順著「行」也就是直排的資料計算;如果是 1,代表是順著橫排的列進行統計,例如有一筆不同年度的各季銷售量: dataapp<-read.table(text="Q1 Q2 Q3 Q4
1980 70 55 60 70
1990 60 70 55 69
2000 80 50 90 66
2010 80 60 70 88",header=T)
apply(dataapp, 1, mean)
## 1980 1990 2000 2010
## 63.75 63.50 71.50 74.50
可以看到每一個年度的平均銷售量統計。讀者可以自行統計每一季的歷年平均銷售量。
類似跨變數的資料分析比較少見,因此我們多半用 2 來進行個別變數的統計分析。不過我們處理交叉列表時也可以應用apply()
,例如有一筆虛擬的飲料與身體健康的資料,我們可以計算三種飲料的總和:
drink<-c(rep("coffee",14), rep("juice",10),
rep("coffee",5), rep("soda",13),
rep("juice",10),rep("soda",8))
heart<-c(rep("healthy", 21),
rep("not healthy",6),
rep("not healthy", 7),
rep("healthy",11),
rep("refuse to answer",15))
tdh<-table(drink,heart); tdh
## heart
## drink healthy not healthy refuse to answer
## coffee 14 5 0
## juice 10 3 7
## soda 8 5 8
apply(tdh,1,sum)
## coffee juice soda
## 19 20 21
apply(tdh,2,sum)
## healthy not healthy refuse to answer
## 32 13 15
addmargins()
可以計算表格的行或列的總和。此處我們用apply()
練習,並且練習計算邊際的機率:row.margin<-apply(tdh,1,function(x)
100*sum(x)/length(drink))
row.margin<-round(row.margin,1)
col.margin<-apply(tdh,2,function(x)
100*sum(x)/length(heart))
col.margin<-round(col.margin,1)
tdh<-cbind(tdh, row.margin)
tdh<-rbind(tdh, col.margin=c(col.margin,100))
tdh
## healthy not healthy refuse to answer row.margin
## coffee 14.0 5.0 0 31.7
## juice 10.0 3.0 7 33.3
## soda 8.0 5.0 8 35.0
## col.margin 53.3 21.7 25 100.0
在上面的例子,用sum()
100\(\times\) sum(x)/length(variable) 計算列與行的邊際機率,也就是 \(\frac{\sum y_{i}}{N}\)。最後將原來的表格與邊際機率合併。在最後合併時因為 col.margin 只有三個元素,所以補上 100,並且再指定名稱為 col.margin。
上面例子顯示,雖然本節一開始的例子中使用 mean()
等等內建的函數,但是我們可以自創函數,例如我們想求出每一季銷售量的平方和:
apply(dataapp, 2, function(x) sum(x^2))
## Q1 Q2 Q3 Q4
## 21300 14025 19625 21761
subset()
這個指令將資料分組然後進行統計,不過tapply()
可以更容易地得到相同的結果。以table(sleep$group)
##
## 1 2
## 10 10
tapply(sleep$extra,sleep$group,mean)
## 1 2
## 0.75 2.33
tapply(sleep$extra,sleep$group,function(x) sum(x^2))
## 1 2
## 34.43 90.37
由上例可以看出tapply()
的第一個參數是我們想要分組的資料,第二個參數是分組變數,第三個則是函數。函數可以像 tapply()
一樣自創,此處是計算平均值以及平方和。
tapply()
允許超過一個的分組變數,例如用
file <- here::here('data', 'fruits.tsv')
dt <- readr::read_tsv(file)
tr <- tapply(dt$Age, dt$Music, mean)
tr
## Country K-Pop R&B Rock
## 31.88 25.60 28.50 29.44
lapply()
使用的時機是針對列表或者資料框進行統計,結果會呈現列表形態的結果。例如:x <- list(a = sample(1:100, 10), beta = exp(-3:3), logic = c(TRUE,FALSE,FALSE,TRUE))
# compute the list mean for each list element
lapply(x, mean)
## $a
## [1] 47
##
## $beta
## [1] 4.535
##
## $logic
## [1] 0.5
ISLR::Auto %>% select_if(is.numeric) %>%
lapply(quantile)
## $mpg
## 0% 25% 50% 75% 100%
## 9.00 17.00 22.75 29.00 46.60
##
## $cylinders
## 0% 25% 50% 75% 100%
## 3 4 4 8 8
##
## $displacement
## 0% 25% 50% 75% 100%
## 68.0 105.0 151.0 275.8 455.0
##
## $horsepower
## 0% 25% 50% 75% 100%
## 46.0 75.0 93.5 126.0 230.0
##
## $weight
## 0% 25% 50% 75% 100%
## 1613 2225 2804 3615 5140
##
## $acceleration
## 0% 25% 50% 75% 100%
## 8.00 13.78 15.50 17.02 24.80
##
## $year
## 0% 25% 50% 75% 100%
## 70 73 76 79 82
##
## $origin
## 0% 25% 50% 75% 100%
## 1 1 1 2 3
lapply
不能同時進行兩種統計。mean_squared_dev <- function(x) sum(x^2 - mean(x))/length(x)
carData::Duncan %>% select_if(is.numeric) %>%
lapply(mean_squared_dev)
## $income
## [1] 2295
##
## $education
## [1] 3576
##
## $prestige
## [1] 3197
sapply()
得到的結果是帶有名稱的向量,例如:st <- carData::Duncan %>% select_if(is.numeric) %>%
sapply(mean)
st
## income education prestige
## 41.87 52.56 47.69
class(st)
## [1] "numeric"
datalist3<-list(dep=c(rep(1,30),rep(2,30),
rep(3,30), rep(4,60), rep(5,40)),
x1=c(rep(1,35),rep(0,80)),
x2=c(rep("low",2),rep("middle",3),
rep("high",11)))
dep<-datalist3[["dep"]]
length(dep)
## [1] 190
sapply
計算sapply(datalist3, length)
## dep x1 x2
## 190 115 16
sapply(datalist3, mean)
## dep x1 x2
## 3.2632 0.3043 NA
這裡要注意的是dplyr::select()
不能處理列表資料,所以我們必須要用基礎的功能。
接下來我們試著刪掉最後一個變數,然後重新計算平均數以及 25、75 百分位:
dl13<-datalist3[-length(datalist3)]
sapply(dl13, mean)
## dep x1
## 3.2632 0.3043
sapply(dl13, function(x) quantile(x,c (.25,.75)))
## dep x1
## 25% 2 0
## 75% 4 1
nchar()
做為 lapply()
的函數:dl2<-datalist3[-c(1,2)]
nchar_x2<-lapply(dl2, nchar)
nchar_x2[[1]][1:6]
## [1] 3 3 6 6 6 4
從上面的介紹可以看出,應用函數大致上可以分成一般資料以及列表資料兩種,可以使用公用以及自訂函數進行分析。
我們可以結合select
以及sapply
以篩選變數,例如:
fruits <- here::here("data","fruits.tsv")
tmp <- readr::read_tsv(fruits, show_col_types = FALSE)
.f <- tmp %>% select(which(sapply(., class)=="character"))
head(.f)
## # A tibble: 6 × 4
## Name Fruits Drink Music
## <chr> <chr> <chr> <chr>
## 1 John Pear Coffee K-Pop
## 2 Alice Strawberry Soda Country
## 3 Ben Banana Soda R&B
## 4 Eve Mango Juice R&B
## 5 Mia Durian Coffee R&B
## 6 Paul Pear Water Country
sapply(., class)==
裡面的「.」指的是前面提到的資料框。又例如:UsingR::dowdata %>% sapply(., mean)
## Date Open High Low Close
## NA 10609 10777 10448 10609
#
說明該語法的意義(中英文皆可),並且顯示執行語法的結果。
請匯入這筆ire的資料hsb2_small(“https://stats.idre.ucla.edu/stat/data/hsb2_small.csv”),並且顯示該資料的變數名稱。
請使用site=“http://faculty.gvsu.edu/kilburnw/nes2008.RData” 以及load(file=url(site))。由以上指令讀取資料後,請先列出V083097的分佈。然後把這個變數重新編碼為「民主黨」(Democrat)、「共和黨」(Republican)、「獨立」(Independent)、「其他政黨」(Other party (SPECIFY)),然後列出這個變數的次數分配。
請匯出hsb2_small的資料為Text格式以及rds格式。
請匯入2008年的總統選舉資料(2008Election.csv),並且找出國民黨得票率最高的town.id。(提示:最大值的函數為
請嘗試匯入本週課程所使用的studentsfull檔案,但是這一次用
請列出政府開放資料opendata106N0101.csv中的大安區的部分資料。
請將Studentsfull.txt這筆資料中的Journalism改為Communication,並且顯示修改後屬於Communication的資料。
請問以下文字之中,有多少重複的字?
Four score and seven years ago our fathers brought forth on this continent, a new nation, conceived in Liberty, and dedicated to the proposition that all men are created equal. Now we are engaged in a great civil war, testing whether that nation, or any nation so conceived and so dedicated, can long endure. We are met on a great battle-field of that war. We have come to dedicate a portion of that field, as a final resting place for those who here gave their lives that that nation might live. It is altogether fitting and proper that we should do this. But, in a larger sense, we can not dedicate—we can not consecrate—we can not hallow—this ground. The brave men, living and dead, who struggled here, have consecrated it, far above our poor power to add or detract. The world will little note, nor long remember what we say here, but it can never forget what they did here. It is for us the living, rather, to be dedicated here to the unfinished work which they who fought here have thus far so nobly advanced. It is rather for us to be here dedicated to the great task remaining before us—that from these honored dead we take increased devotion to that cause for which they gave the last full measure of devotion—that we here highly resolve that these dead shall not have died in vain—that this nation, under God, shall have a new birth of freedom—and that government of the people, by the people, for the people, shall not perish from the earth.
請讀取來自Gareth James的網站Resources - Second Edition — An Introduction to Statistical Learning (statlearning.com)中的資料連結 (https://www.statlearning.com/s/Advertising.csv) 的資料,並且顯示變數名稱與性質。
某同學有如下的資料,
db <- tibble(salary=c('42,000','55,000','45,000','66,000', '65,000'),
years=c(3,4,3,5,5), bonus=c(5000,4000,5000,6000,5000))
db
## # A tibble: 5 × 3
## salary years bonus
## <chr> <dbl> <dbl>
## 1 42,000 3 5000
## 2 55,000 4 4000
## 3 45,000 3 5000
## 4 66,000 5 6000
## 5 65,000 5 5000
請幫忙他去除第一個變數的千位符號。
最後更新時間: 2025-03-10 20:30:48