我們將介紹R
的繪圖指令,包括ggplot2
的套件中的指令,把資料視覺化。雖然之前我們已經看過很多例子,但是仍然有必要詳細介紹包括直方圖、折線圖、散佈圖等等圖形,結合dplyr
套件中的指令,以培養視覺化資料的能力,並且在圖形上面顯示統計結果。例如我們可以視覺化資料的分佈並且找出每一個類別的平均數:
file<-here("data","studentsfull.txt")
library(ggplot2)
students<-read.table(file, sep='', header=TRUE)
p <- ggplot(students, aes(x=Department, y=Score)) +
geom_point() +
theme_classic()
p + ggplot2::stat_summary(fun.y='mean', colour='red',size=3, geom = "point")
有人說,「一張照片勝過千言萬語」。越來越來媒體報導、研究報告、論文以圖形表現變數之間的關係,讓讀者可以很快地了解。
R
內建了許多基本的繪圖指令,讓我們可以很輕鬆地使用,但是ggplot2
提升了繪圖的層次,所以我們必須學習並且熟練ggplot2
。
在R
的基礎繪圖功能中,繪圖的元素有「點」、「線」、「大小」、「粗細」以及「顏色」等等。
xn<-c(1:10)
yn<-c(1:10)
col<-c(80, 75, 70, 60, 50, 40, 30, 20, 10, 0)
graycol<-paste("gray",col,sep="")
graycol
## [1] "gray80" "gray75" "gray70" "gray60" "gray50" "gray40" "gray30" "gray20"
## [9] "gray10" "gray0"
plot(xn, yn,
cex=c(rep(1, 5), rep(2,5)),
col=graycol,
pch=c(1,2,5,6,8, 16, 17, 20, 22, 24),
xlab="pch", ylab="cex",
xaxt="n", yaxt="n",
xlim=c(0,11),ylim=c(0,11), font.lab=2, cex.lab=1.5)
axis(1, at=c(1:10), labels=c(1,2,5,6,8, 16, 17, 20, 22, 24))
axis(2, at=c(1:10), labels=c(rep(1, 5), rep(2,5)))
abline(v=6, lwd=2, col='red'); abline(h=6, lwd=3, lty=2)
Figure 2.1: 圖型的元素
with(anscombe, plot(x1, y1, pch=16, cex=2, col='blue',
xlab='X', ylab='Y', main='x1-y1', xaxt='n'))
axis(side=1, labels=c(1,"" ,"" ,"" ,"" ,6), at=c(4,6,8,10,12,14))
Figure 2.2: 散佈圖一
with(anscombe, summary(x1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 6.5 9.0 9.0 11.5 14.0
with(anscombe, summary(y1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.26 6.32 7.58 7.50 8.57 10.84
x<-c(1: max(anscombe$x1))
y<-c(1: max(anscombe$x1))
plot(x, y, type='n')
with(anscombe, points(x1, y1, pch=16, cex=2, col='gray25'))
Figure 2.3: 散佈圖二
x<-c(1: max(anscombe$x1))
y<-c(1: max(anscombe$x1))
plot(x, y, type='n')
with(anscombe, points(x1[which(x1<=8)],
y1[which(x1<=8)], pch=16, cex=2, col='red'))
with(anscombe, points(x1[which(x1>8)],
y1[which(x1>8)], pch=10, cex=2, col='#ee0022'))
abline(v=8, lwd=2, lty=3)
Figure 2.4: 散佈圖三
#RRGGBB
的顏色。可以查這個網站輸入Hex6位數找到喜歡的顏色。
plot(x, y, type='n')
with(anscombe, points(x1[which(x1<=8)],
y1[which(x1<=8)], pch=16, cex=2, col='red'))
with(anscombe, points(x1[which(x1>8)],
y1[which(x1>8)], pch=16, cex=2, col='darkblue'))
abline(v=8, lwd=2, lty=3)
with(anscombe, text(x1, y1,
c(1: nrow(anscombe)), pos=4))
Figure 2.5: 散佈圖四
plot(x, y, type='n', xlim=c(1, 16), xaxt='n')
axis(1, labels = c(2:15), at=c(2:15))
with(anscombe, points(x1[which(x1<=8)], y1[which(x1<=8)], pch=16,
cex=2, col='red'))
with(anscombe, points(x1[which(x1>8)], y1[which(x1>8)], pch=22,
cex=2, col='darkblue'))
abline(v=8, lwd=2, lty=3)
with(anscombe, text(x1, y1, paste(x1, y1, sep=","), pos=4))
Figure 2.6: 散佈圖五
☛ 請根據ISLR::College這筆資料的Top10perc排序,篩選出新生為高中前10%畢業生的比率最高的前10名學校,觀察錄取率(Accept/Apps)與曾捐款的校友比率(perc.alumni)的關係,然後註記學校的名稱。
TXT<-here("data","studentsfull.txt")
students<-read.table(TXT, header=T, sep='')
stu <- table(students$Department)
barplot(stu, main="Departments",
xlab="", ylab="frequency")
Figure 2.7: 直方圖一
barplot(100*stu/nrow(students), main="Departments",
xlab="", ylab="Percent", border='red',
col='#0011EE22', ylim=c(0, 20))
Figure 2.8: 直方圖二
student.table <- table(students$Gender, students$Department)
barplot(100*prop.table(student.table, margin=2),
col=c('brown', 'white'),
legend = levels(students$Gender))
Figure 2.9: 直方圖三
student.table <- table(students$Gender, students$Department)
barplot(100*prop.table(student.table, margin=2),
col=c('blue', '#EE330011'))
legend("top", fill=c('blue', '#EE330011'), c("M","F"))
Figure 2.10: 直方圖與圖例
par(xpd=T, mar=par()$mar+c(0,0,0,6))
TXT<-here::here("data","studentsfull.txt")
students<-read.table(TXT, header=T, sep='')
stu <- table(students$Department)
student.table <- table(students$Gender, students$Department)
barplot(student.table, col=c("red", "blue"),
legend=c('female','male'))
Figure 2.11: 圖例在圖外面
pt.students<-100*prop.table(student.table, 2)
barplot(pt.students, col=c("salmon1", "royalblue1"),
ylim=c(0,100),xlim=c(0.3,8))
legend(8.5, 90, c('female','male'),
col=c("salmon1", "royalblue1"),
pch=c(20,20), bty='n')
Figure 2.12: 圖例在圖外面
barplot(100*prop.table(student.table, margin=2),
names.arg=c("Aer","Che.", "Eco.", "Eng.", "Jou.", "Mec.","Phy."))
Figure 2.13: 直方圖四
☛請嘗試讀取PP0797B2.sav這筆資料,然後畫長條圖表示partyid這個變數中,政黨各類別的相對次數。
boxplot(mtcars$mpg, ylim=c(0,40), yaxt='n')
Figure 2.14: 箱型圖一
quantile(mtcars$mpg, c(.25, .5, .75), type=6)
## 25% 50% 75%
## 15.27 19.20 22.80
R
計算如下:
qu<-quantile(mtcars$mpg, c(.25, .5, .75), type=7)
qu
## 25% 50% 75%
## 15.43 19.20 22.80
upper<- qu[3]+1.5*(qu[3]-qu[1])
lower<- qu[1]-1.5*(qu[3]-qu[1])
cat("upper", upper); cat("/", "lower", lower)
## upper 33.86
## / lower 4.363
# upper inner fence
min(max(mtcars$mpg), upper)
## [1] 33.86
# lower inner fence
max(min(mtcars$mpg), lower)
## [1] 10.4
boxplot(mtcars$mpg, ylim=c(0,40), yaxt='n')
axis(2, at=c(1:40, by=5), labels=c(1:40,by=5))
Figure 2.15: 盒型圖二
boxplot(state.area, ylab="Area of State")
Figure 2.16: 盒型圖三
y<-state.abb
identify(rep(1, length(y)), y, labels=seq_along(y)))
Orange$tree <-ordered(Orange$Tree, levels=c(1,2,3,4,5))
with(Orange, boxplot(circumference ~ tree))
Figure 2.17: 盒型圖四
☛請試著畫ISLR
套件中
par(mfrow=c(1,2))
hist(USArrests$Assault, col="tomato2", main="breaks_default")
hist(USArrests$Assault, breaks = 15, col="tomato3",main="breaks_15")
Figure 2.18: 長條圖一
hist(USArrests$Assault, col="tomato4", freq = F,
xlab="Assault", main="Assault in 50 States", breaks = 10)
Figure 2.19: 長條圖二
R
會自動挑選適合該變數分佈離散程度的寬度,breaks參數越大,寬度越小,有可能出現某一間隔沒有任何觀察值之狀況。但是寬度越大,變數的分佈越粗略。
par(mfrow=c(1,2))
y <- c()
for (i in 1:1000)
{x=rnorm(1000,0,1)
x.sample <- sample(x, 100)
y[i]=mean(x.sample)}
hist(y, 10, probability = T)
rug(jitter(y))
hist(y, 50, probability = T)
rug(jitter(y))
Figure 2.20: 兩個長條圖並列
par(mfrow=c(1,1))
set.seed(02138)
y <- c()
for (i in 1:1000)
{x=rnorm(1000,0,1)
x.sample <- sample(x, 100)
y[i]=mean(x.sample)}
hist(y, 100, probability = T, col="gray90")
curve(dunif(x, min=min(y), max=max(y)), add=T, col="blue", lwd=2)
curve(dnorm(x, mean=mean(y), sd=sd(y)), add=T, col="red", lwd=2)
Figure 2.21: 加上機率密度曲線的長條圖
Y<-ggplot2::diamonds$price/1000
hist(Y, freq = F , breaks=25)
lines(density(Y), lty=2,
lwd=2, col="#e122aa")
curve(dchisq(x, df=4), add=T,
lwd=2, col="darkblue")
curve(dexp(x, rate=.25), lwd=2, add=T)
Figure 2.22: 鑽石價格機率密度
with(airquality, plot(Wind, type='b'))
Figure 2.23: 空氣品質折線圖一
或者是圖 2.24 設定不同的顏色以及符號,顯示臭氧層的變化情形:
with(airquality, plot(Ozone, type='o', pch=16,
cex=1.2, lty=2, lwd=2, col='red'))
Figure 2.24: 空氣品質折線圖二
plot(LakeHuron, type = "o", pch=16, cex=1.2, lty=2) ## Index plot
Figure 2.25: Lake Huron折線圖
CSV<-here::here("data","Tondutrend.csv")
trend<-read.csv(CSV, header=T, sep=",")
tonduts<-ts(trend, start=1992.6, frequency=2)
tonduts
## Time Series:
## Start = 1992.6
## End = 2019.6
## Frequency = 2
## 統一 維持現狀 獨立 拒答
## 1993 2.4 66.1 7.7 23.80
## 1993 18.6 42.2 4.0 35.20
## 1994 20.4 51.4 11.4 16.70
## 1994 NA NA NA NA
## 1995 NA NA NA NA
## 1995 NA NA NA NA
## 1996 19.1 46.1 11.1 23.80
## 1996 22.3 41.1 12.2 24.40
## 1997 21.0 48.5 14.1 16.40
## 1997 27.8 47.1 16.1 9.00
## 1998 22.8 50.8 17.9 9.00
## 1998 20.5 49.9 17.4 12.20
## 1999 20.3 50.5 20.0 9.20
## 1999 18.8 53.8 18.9 8.50
## 2000 13.3 58.8 17.8 10.10
## 2000 20.8 52.5 16.2 10.60
## 2001 26.8 51.1 9.0 13.00
## 2001 23.8 55.5 13.3 7.40
## 2002 26.2 52.5 15.4 6.00
## 2002 21.5 53.5 17.9 7.10
## 2003 21.6 52.9 18.8 6.80
## 2003 17.6 56.6 19.2 6.60
## 2004 14.9 55.9 21.6 7.70
## 2004 12.8 60.2 19.6 7.40
## 2005 14.1 57.7 20.7 7.50
## 2005 15.7 59.3 19.8 5.20
## 2006 16.2 58.4 20.1 5.20
## 2006 14.9 60.9 18.5 5.70
## 2007 15.9 60.7 18.6 4.80
## 2007 18.2 53.3 23.1 5.40
## 2008 12.9 58.2 20.8 8.00
## 2008 11.5 59.6 20.4 8.50
## 2009 10.5 59.6 25.7 4.10
## 2009 10.2 63.1 21.3 5.50
## 2010 10.6 63.9 19.6 5.90
## 2010 12.3 61.0 22.8 3.80
## 2011 10.0 64.6 21.0 4.40
## 2011 11.1 61.3 23.0 4.60
## 2012 10.8 63.7 19.1 6.40
## 2012 10.7 64.8 19.1 5.40
## 2013 11.1 62.8 19.6 6.50
## 2013 12.3 59.2 23.3 5.20
## 2014 11.6 61.3 21.7 5.40
## 2014 10.9 60.8 22.9 5.40
## 2015 8.8 61.6 23.6 5.90
## 2015 9.6 61.6 20.5 8.40
## 2016 10.8 60.6 22.2 6.40
## 2016 10.0 61.3 22.8 5.90
## 2017 11.4 60.1 22.5 5.90
## 2017 13.2 58.7 23.1 5.00
## 2018 14.4 59.8 19.9 5.90
## 2018 16.5 58.1 20.0 5.50
## 2019 17.5 57.9 19.5 5.20
## 2019 10.6 57.6 25.8 6.00
## 2020 7.5 57.6 28.0 6.98
R
的優點在於允許時間點有缺漏,畫出來的圖形就有缺少某個或是某些時間點,不會強制連接前後兩個時間點。接下來是畫折線圖。圖 2.26 顯示2004年以來統獨趨勢一直相當穩定,但是在2016年之後偏統的支持度似乎上升,到了2019年又下降,而偏獨立的民意上升:
# Margin
par(xpd=NA, mar=par()$mar+c(2.5, 0, 0, 0),family='HanWangKaiMediumChuIn')
# Plot
plot(tonduts, plot.type=c("single"), lty=c(1,2,3,2),ylab="%",xlab=NULL,pch='1', lwd=3,frame.plot=F,col=c("gray20","gray60", "black", "gray80"),xaxt="n", main="台灣民眾的統獨立場, 1992.6-2020.12")
axis(1, at=seq(1992,2020,by=2))
axis(2, at=seq(10,70,by=10))
legend("bottomright", c("統一","維持現狀"),inset=c(0.35, -0.6), col=c("gray20","gray60"),lty=c(1,2), bty='n', lwd=3)
legend("bottomright", c("獨立","無反應"), inset=c(0, -0.6), col=c("black", "gray80"),lty=c(3,2), bty='n', lwd=3)
text(2000, 30, paste("第一次政黨輪替"))
Figure 2.26: 統獨趨勢圖
R
會出現figure margin too large的錯誤訊息,這時候請調整一下par(mar=c()),例如par(mar=c(-0.2,-0.2,0,0))
library(here);library(foriegn)
CSV<-here("data","Tondutrend.csv")
trend<-read.csv(CSV, header=T, sep=",")
tonduts<-ts(trend, start=1992.6, frequency=2)
par(xpd=NA, mar=par()$mar+c(2.5, 0, 0, 0),
family='HanWangKaiMediumChuIn')
library(here);library(foriegn);library(showtext)
showtext.auto()
font.add("細明體", "MingLiu.ttf")
par(xpd=NA,
mar=par()$mar+c(0, 0, 0, 0),
family='細明體')
☛請畫圖表示MASS::accdeaths的趨勢。
with(anscombe, symbols(x2, y1, circles=anscombe$x1, inches=0.2, fg='blue'))
Figure 2.27: 特殊點狀圖一
with(USArrests, symbols(Murder, Assault, circles=UrbanPop, inches=0.12, bg="red"))
Figure 2.28: 特殊點狀圖二
m.arrest<-with(USArrests, lm(Assault ~ Murder+UrbanPop))
summary(m.arrest)
##
## Call:
## lm(formula = Assault ~ Murder + UrbanPop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -107.78 -25.99 -1.97 22.49 111.82
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.620 33.217 -0.71 0.481
## Murder 15.071 1.572 9.59 1.2e-12 ***
## UrbanPop 1.175 0.473 2.48 0.017 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 47.8 on 47 degrees of freedom
## Multiple R-squared: 0.684, Adjusted R-squared: 0.671
## F-statistic: 51 on 2 and 47 DF, p-value: 1.69e-12
☛請畫圖表示MASS::Boston這筆資料中,與生師比(ptratio)與房價中位數(medv)的關係,並且考慮低社會地位人口比例(lstat)的作用。
#設定路徑
setwd(here::here())
setwd(here::here('Fig'))
#設定圖形輸出的檔案名稱與大小
png("pie.png", width=4, height = 6, units = 'in', res=300)
#繪圖
pie(table(ISLR::Auto$origin))
#結束
dev.off()
quartz_off_screen 2
接下來我們介紹用ggplot2
繪圖。基本的指令為:
ggplot2
的例子,有興趣的同學可參考。ggplot(ChickWeight, aes(x=Diet), stat='bin') +
geom_bar(fill='red', col='blue') +
ggtitle('Diet')
Figure 3.1: ggplot2直方圖一
stu<-here::here('data','studentsfull.txt')
students<-read.table(stu, header=T, sep='')
stu.t <- as.data.frame(table(students$Department, dnn=c("Dep")))
stu.t
## Dep Freq
## 1 Aerospace 4
## 2 Chemistry 3
## 3 Economics 4
## 4 English 3
## 5 Journalism 4
## 6 Mechanics 5
## 7 Physics 3
g1 <- ggplot(stu.t, aes(x=Dep, y=Freq)) +
geom_bar(stat = 'identity', fill="#4411FF33") + ylab('Count')
g1
Figure 3.2: ggplot2直方圖二
dt <- scan(what=list(company="character",
marketvalue="numeric"))
#Apple 851
#Microsoft 703
#Amazon 701
#Facebook 464
#DT<-data.table::setDT(dt)
dt <- data.frame(company=c("Apple", "Microsoft",
"Amazon", "Facebook"),
marketvalue=c(851,703,701,464))
DT <- dt %>% mutate(company=as.factor(company),
marketvalue=as.numeric(marketvalue))
ggplot(DT, aes(x=reorder(company, -marketvalue),
y=marketvalue)) +
geom_bar(aes(fill=company),stat="identity") +
theme_classic()
Figure 3.3: 次數分配轉成直方圖
ggplot2
裡面用☛請由小而大,呈現reshape2
的
ggplot(data=tips, aes(x=time)) +
geom_bar(aes(y=(..count..)/sum(..count..)),
stat="count", fill="#ee0011") +
scale_y_continuous(label = scales::percent)
Figure 3.4: 加上百分比的直方圖
g1 + geom_text(data=stu.t, aes(x=Dep, label=Freq, y=Freq+0.2), size=8)
Figure 3.5: 加上資料標籤的直方圖
salary<-alr4::salary
ggplot(salary, aes(x=rank)) +
stat_count(geom='bar', fill='#0aaee1') +
geom_text(aes(label=..count..), stat='count', colour='white', vjust=1.6)
Figure 3.6: ggplot2直方圖三
full<-here::here('data','studentsfull.txt')
students<-read.table(full, header=T, sep='')
stu <- dplyr::mutate(students, Pass=(Score>70))
kableExtra::kable_styling(knitr::kable(stu))
ID | Name | Department | Score | Gender | Pass |
---|---|---|---|---|---|
10322011 | Ariel | Aerospace | 78 | F | TRUE |
10325023 | Becky | Physics | 86 | F | TRUE |
10430101 | Carl | Journalism | 69 | M | FALSE |
10401032 | Dimitri | English | 83 | M | TRUE |
10307120 | Enrique | Chemistry | 80 | M | TRUE |
10207005 | Fernando | Chemistry | 66 | M | FALSE |
10305019 | George | Mechanics | 75 | F | TRUE |
10305022 | Howell | Mechanics | 81 | M | TRUE |
10305029 | Ian | Mechanics | 60 | M | FALSE |
10305031 | Julio | Mechanics | 89 | M | TRUE |
10322014 | Kaori | Aerospace | 82 | F | TRUE |
10425026 | Luke | Physics | 88 | M | TRUE |
10401022 | Miguel | English | 92 | M | TRUE |
10501006 | Neo | English | 77 | M | TRUE |
10321010 | Olivia | Economics | 85 | F | TRUE |
10321011 | Peter | Economics | 88 | M | TRUE |
10405017 | Qing | Mechanics | 88 | F | TRUE |
10422007 | Ricky | Aerospace | 91 | M | TRUE |
10422008 | Seiko | Aerospace | 80 | F | TRUE |
10430005 | Terresa | Journalism | 62 | F | FALSE |
10530009 | Usla | Journalism | 87 | F | TRUE |
10421001 | Vivian | Economics | 70 | F | FALSE |
10307018 | Wendy | Chemistry | 85 | F | TRUE |
10425003 | Physics | 93 | M | TRUE | |
10221030 | Yoko | Economics | 66 | F | FALSE |
10430015 | Zoe | Journalism | 92 | F | TRUE |
ggplot(stu, aes(x=Department, fill=Pass)) +
geom_bar(position='dodge')
Figure 3.7: 並列直方圖
library(dplyr)
stu <- students %>% mutate(Pass=(Score>70))
stu.ag <- summarize(group_by(stu, Department, Pass), Count=n())
stu.ag <- stu.ag %>% mutate(Pct=Count/sum(Count))
kableExtra::kable_styling(knitr::kable(stu.ag))
Department | Pass | Count | Pct |
---|---|---|---|
Aerospace | TRUE | 4 | 1.0000 |
Chemistry | FALSE | 1 | 0.3333 |
Chemistry | TRUE | 2 | 0.6667 |
Economics | FALSE | 2 | 0.5000 |
Economics | TRUE | 2 | 0.5000 |
English | TRUE | 3 | 1.0000 |
Journalism | FALSE | 2 | 0.5000 |
Journalism | TRUE | 2 | 0.5000 |
Mechanics | FALSE | 1 | 0.2000 |
Mechanics | TRUE | 4 | 0.8000 |
Physics | TRUE | 3 | 1.0000 |
ggplot(stu.ag, aes(x=Department, y=Pct, fill=Pass)) +
geom_bar(stat='identity') +
scale_y_continuous(label = scales::percent)
Figure 3.8: ggbarplot2堆疊直方圖
newstu<-stu.ag[stu.ag$Pass==TRUE,]
ggplot(newstu, aes(x=Department, y=Pct, fill=Pass)) +
geom_bar(stat='identity') +
scale_y_continuous(label = scales::percent)
Figure 3.9: ggplot2堆疊直方圖
☛請用
qplot(cut, data = diamonds,
geom = "bar", fill = color, stat="count")
Figure 3.10: qplot直方圖
df<-ggplot2::diamonds
df <- df %>% select(cut, color)
newdf <-reshape2::melt(df, id.vars=c("cut"))
colnames(newdf)<-c("cut", "color.g", "color")
qdf <- summarize(group_by(newdf, cut, color), N=n())
g1 <- ggplot(qdf, aes(x=cut, y=N, fill=color)) +
geom_bar(stat="identity")
g1 + scale_fill_grey(start = 0.2, end = 0.9) +
theme_bw()
Figure 3.11: 灰階原始數目直方圖
ggplot2
的長條圖使用的函數為ggplot(mtcars, aes(x=mpg)) +
geom_histogram(stat='bin', binwidth=1, fill="#a310ec")
Figure 3.12: ggplot2長條圖一
ggplot(mtcars, aes(x=wt, fill=as.factor(am))) +
geom_density(position="identity", alpha=.4)
Figure 3.13: ggplot2長條圖二
☛請畫出
\[\frac{H+2\times Double+3\times Triple+4\times HR}{AB}\]
如果需要表現兩個變數之間的關係,散佈圖可以顯示兩個變數對應的位置。散佈圖用的函數是
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point()
Figure 3.14: ggplot2散佈圖一
set.seed(02138)
ggplot(data=data.frame(x=rnorm(100, 1, 1.5),y=rnorm(100,0,1)),
aes(x=x,y=y))+
geom_point(col="#b10a2c", shape=5, size=2)
Figure 3.15: ggplot2散佈圖二
散佈圖可以加上迴歸線表現兩個變數的相關,如圖3.16:
ggplot(mtcars, aes(x=wt, y=mpg)) +
geom_point(col='red') +
geom_smooth(method="lm", se=F, col='blue')
Figure 3.16: ggplot2散佈圖加上迴歸線
ggplot(Orange, aes(x=age, y=circumference)) +
geom_point(aes(col=Tree), size=3) +
geom_smooth(method="loess", se=F) +
theme_bw()
Figure 3.17: 散佈圖加上無母數迴歸線
☛請用散佈圖加迴歸線表示長打率跟三振率(SO/AB)的關係:
兩個變數的散佈圖可表示兩個變數之間的相關,例如圖 3.18:
ggplot(anscombe, aes(x=x1, y=y1)) +
geom_point(size=3)
Figure 3.18: 三個變數散佈圖一
n <- nrow(anscombe)
anscombe$r <- rep(0, n)
anscombe$r[anscombe$x1>8]<-1
sc1<-ggplot(anscombe, aes(x=x1, y=y1, shape=factor(r)))
sc1 + geom_point(aes(color=factor(r)) , size=3)
Figure 3.19: 三個變數散佈圖二
sc2<-ggplot(anscombe, aes(x=x1, y=y1))
sc2 + geom_point(aes(color=x2) , size=3)
Figure 3.20: 三個變數散佈圖三
☛reshape2
裡面的 french_fries 針對油炸與薯條的味道進行實驗。在這筆資料裡有幾個測量薯條口味的變數,請找兩個變數,畫出散佈圖,但是同時要顯示三個實驗組別的分佈。
盒型圖可以表現一個變數或者是一個類別變數對應另一個連續變數的分佈,不過ggplot2
無法直接畫出一個連續變數的盒型圖,所以我們先創造一個只有一個值的變數,然後對應我們要顯示的連續變數,例如圖 3.21:
mtcars <- mutate(mtcars, X=1)
ggplot(mtcars, aes(x=X, y=mpg)) +
geom_boxplot() +
labs(x="",y='mpg') +
stat_summary(fun.y=median, geom="point", shape=16, size=2) +
theme_bw()
Figure 3.21: 盒型圖一
g1 <-ggplot(mtcars, aes(x=cyl, y=hp, group=cyl)) +
geom_boxplot() +
labs(x='Cylinder', y="Horse Power") +
stat_summary(fun.y=median, geom="point", shape=16, size=2) +
theme_bw()
g1
Figure 3.22: 盒型圖二
g1 + scale_x_discrete(limit = c(4, 6, 8),
labels = c("4","6","8")) + theme_bw()
Figure 3.23: 盒型圖三
g1 + scale_y_continuous(limits = c(50,200)) +
scale_x_discrete(limit = c(4, 6, 8),
labels = c("4","6","8")) +
theme_bw()
Figure 3.24: 盒型圖四
☛請分別畫出
過去有所有的圓餅圖,甜甜圈圖可以顯示資料的次數分佈,例如我們想知道ggplot2
裡面的的diamonds
資料中的cut的次數分配,先針對這個變數計算個數:
dt <-group_by(diamonds, cut)
dat = summarize(dt, count=n())
kableExtra::kable_styling(knitr::kable(dat))
cut | count |
---|---|
Fair | 1610 |
Good | 4906 |
Very Good | 12082 |
Premium | 13791 |
Ideal | 21551 |
# Add addition columns, needed for drawing with geom_rect.
dat$fraction = dat$count / sum(dat$count)
dat = dat[order(dat$fraction), ]
dat$ymax = cumsum(dat$fraction)
dat$ymin = c(0, head(dat$ymax, n=-1))
kableExtra::kable_styling(knitr::kable(dat))
cut | count | fraction | ymax | ymin |
---|---|---|---|---|
Fair | 1610 | 0.0298 | 0.0298 | 0.0000 |
Good | 4906 | 0.0910 | 0.1208 | 0.0298 |
Very Good | 12082 | 0.2240 | 0.3448 | 0.1208 |
Premium | 13791 | 0.2557 | 0.6005 | 0.3448 |
Ideal | 21551 | 0.3995 | 1.0000 | 0.6005 |
p1 = ggplot(dat, aes(fill=cut, ymax=ymax, ymin=ymin, xmax=4, xmin=3)) +
geom_rect() +
coord_polar(theta="y") +
xlim(c(0, 4)) +
theme(panel.grid=element_blank()) +
theme(axis.text=element_blank()) +
theme(axis.ticks=element_blank()) +
annotate("text", x = 0, y = 0, label = "Diamond Cut") +
labs(title="") +
xlab("") +
ylab("")
p1
Figure 3.25: 甜甜圈圖一
☛請嘗試把
之前我們用R
的基本指令對airquality
這筆時間序列資料畫折線圖,例如圖 3.26:
ggplot(airquality, aes(x=Day, y=Wind)) +
geom_line()
Figure 3.26: ggplot折線圖一
N <- summarize(airquality, n())
airquality <- mutate(airquality, index=seq_along(1:N[,1]))
ggplot(airquality, aes(x=index, y=Wind)) +
geom_line()
Figure 3.27: ggplot折線圖二
airquality$Date <- paste0(airquality$Month, "/", airquality$Day)
airquality$Date <- as.Date(airquality$Date, format="%m/%d")
ggplot(airquality, aes(x=Date, y=Wind)) +
geom_line()
Figure 3.28: ggplot2折線圖三
airquality.n <- select(airquality, Date, Wind, Temp)
library(reshape2)
airquality.l<-melt(airquality.n, id.vars=c('Date'))
head(airquality.l)
## Date variable value
## 1 2021-05-01 Wind 7.4
## 2 2021-05-02 Wind 8.0
## 3 2021-05-03 Wind 12.6
## 4 2021-05-04 Wind 11.5
## 5 2021-05-05 Wind 14.3
## 6 2021-05-06 Wind 14.9
ggplot(airquality.l, aes(x=Date)) +
geom_line(aes(y=value, col=variable))
Figure 3.29: ggplot2折線圖四
☛請畫出
ggplot2
可以應用在比較迴歸係數的大小以及標準誤的範圍,方便讀者瞭解模型的估計結果。例如我們估計有三個自變數的迴歸模型:
m1 <- with(mtcars, lm(mpg ~ wt + hp + am))
summary(m1)
##
## Call:
## lm(formula = mpg ~ wt + hp + am)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.422 -1.792 -0.379 1.225 5.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.00288 2.64266 12.87 2.8e-13 ***
## wt -2.87858 0.90497 -3.18 0.00357 **
## hp -0.03748 0.00961 -3.90 0.00055 ***
## am 2.08371 1.37642 1.51 0.14127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.54 on 28 degrees of freedom
## Multiple R-squared: 0.84, Adjusted R-squared: 0.823
## F-statistic: 49 on 3 and 28 DF, p-value: 2.91e-11
df <- data.frame(coef=coef(m1)[-1], se=coef(summary(m1))[-1,
"Std. Error"] )
dt <- mutate(df, lower=coef-qt(0.975,
nrow(mtcars))*se, upper=coef+qt(0.975,
nrow(mtcars))*se)
s1 <- ggplot() +
geom_segment(data=dt,
mapping=aes(x=row.names(dt), y=lower,
xend=row.names(dt), yend=upper),
size=2, color="blue")
s1
Figure 3.30: ggplot圓點圖一
s1 + geom_point(data=dt,
aes(x=row.names(dt), y=coef), size=4, shape=21, fill="white") +
scale_x_discrete(breaks=c('wt','hp','am'),
labels=c("Weight", "Horse Power", "Auto")) +
xlab("") +
ylab("Estimates") +
coord_flip() +
theme(axis.text.y = element_text(face="bold", color="#993333", size=10))
Figure 3.31: ggplot圓點圖二
V<-c(50, 100, 30, 200)
rev(V)
## [1] 200 30 100 50
☛請把係數名稱跟係數順序根據 3.30 得到的順序排列一遍:
tmp <- dt[order(rev(row.names(dt))),]
s2=ggplot() +
geom_point(data=tmp, aes(x=row.names(tmp),
y=rev(coef)), size=4, shape=21, fill="#e21b33") +
scale_x_discrete(breaks=c("wt","hp","am"),
labels=c( "Auto", "Horse Power","Weight")) +
ylab("") +
xlab("") +
coord_flip() +
theme(axis.text.y = element_text(face="bold", color="#00ABFE", size=14) )
s2+geom_segment(data=dt, mapping=aes(x=row.names(dt), y=rev(lower), xend=row.names(dt), yend=rev(upper)),
size=2, color="#FABEC1")
Figure 3.32: ggplot2係數圖
除了上述的圖形,ggplot2
還有更多有趣的圖形,也允許使用者設定圖形風格、資料點顏色、副標題、圖例的標題等等,請同學上網多多參考他人的例子。例如圖3.33顯示經濟學人雜誌風格的直方圖:
library(ggthemes)
# Number of Cases
dt <- data.frame(Year=c(2013, 2014, 2015, 2016, 2017),
Amount=c(65, 56, 78, 85, 79))
# Economist theme
p1<-ggplot() + theme_economist() + scale_fill_economist() +
geom_bar(aes(x=Year, y=Amount),
data=dt, stat='identity')
#data label
p3 <- p1+ theme(axis.title= element_text(color="blue", size=14, face="bold"),
axis.text = element_text(size=14),
legend.text=element_text(size=18))+
ggtitle("Number of Cases") +
labs(y="N", subtitle="Children, Spouses, Financial, etc.")
p3
Figure 3.33: 主題風繪圖
ggplot(diamonds, aes(cut, price)) +
geom_bar(stat = "summary_bin", fun.y = "mean", aes(fill=cut))
Figure 3.34: 平均值直方圖
lattice
套件用格子(trellis)圖形來產生高品質的資料視覺化,尤其是用在多變量的資料。
Deepayan Sakar在2008年出版「Lattice: Multivariate Data Visualisation in R」,詳細地介紹lattice套件的功能。有興趣的同學可以閱讀。
我們很快地介紹lattice套件的直方圖、長條圖、雙變數直方圖、散佈圖、折線圖、雙變數長條圖、箱型圖、點狀圖功能等等。
我們用
library(lattice)
barchart(starwars$skin_color,col="gray60",
scale=list(cex=0.5, relation="free"),
aspect=.8)
Figure 4.1: lattice直方圖一
lattice
的直方圖是橫的,適合類別多的變數,例如我國面對最重要的問題,或者最常看的網站,如果用橫的直方圖會比較容易呈現。
☛請嘗試對
長條圖可用來描述中央趨勢,例如圖 4.2顯示噴發時間的分佈:
histogram(faithful$eruptions, col='#a01eb2')
Figure 4.2: lattice長條圖一
histogram跟R
基礎指令hist函數得到的圖形相比,多了一個外框。
lattice
並沒有單變數的直方圖,以長條圖代替直方圖。但是比較類別變數與連續變數之間的關係時,可以用直方圖,並且可以控制另一個類別變數。例如圖 4.3顯示品種跟顏色的關係:
library(MASS)
data(crabs)
crabs$sp<-factor(crabs$sp, labels=c("Blue",
"Orange"))
crabs$sex<-factor(crabs$sex, labels=c("Female",
"Male"))
barchart(sp ~ BD|sex, data=crabs,
col=RColorBrewer::brewer.pal(n=4, 'RdGy'),
main=paste0("Crab's Body Depth(mm)"))
Figure 4.3: lattice直方圖二
我們可以直接對一個表格資料畫圖,表示兩個變數之間的關係。例如使用
lattice::dotplot(VADeaths, group=F, type="o")
Figure 4.4: lattice點狀圖一
vad <- as.data.frame.table(VADeaths)
names(vad) <- c("age", "demographic", "deaths")
head(vad, n=3)
## age demographic deaths
## 1 50-54 Rural Male 11.7
## 2 55-59 Rural Male 18.1
## 3 60-64 Rural Male 26.9
dpt<-dotplot(age ~ deaths, vad, group=demographic,
type = "o")
update(dpt, auto.key=list(points=T, lines=T))
Figure 4.5: lattice點狀圖二
在上面的例子中,
我們想知道每一類型的樹的平均樹圍,並且畫圖。首先用
orange.mean <- tapply(Orange$circumference,
Orange$Tree, mean)
dotplot(names(orange.mean) ~ orange.mean,
aspect = .5,
ylab = "Group",
xlab = "Mean Circumference")
Figure 4.6: lattice點狀圖一
dplyr
的o.mean<-Orange %>% group_by(Tree) %>% summarise(mean(circumference))
dotplot(as.factor(Tree) ~ `mean(circumference)`, aspect=0.5, data=o.mean,
ylab = "Group",
xlab = "Mean Circumference")
Figure 4.7: lattice點狀圖二
dplyr
的pipeline,而是要在函數內設定data=。o.mean<-Orange %>% group_by(Tree) %>% summarise(avg=mean(circumference))
dotplot(reorder(rownames(o.mean),-avg) ~ avg, aspect=0.5, data=o.mean,
ylab = "Group",
xlab = "Mean Circumference",
cex=1.5)
Figure 4.8: lattice點狀圖三
兩個連續變數可以用散佈圖表示相對位置,例如圖4.9顯示E與NOx之間的關係:
with(lattice::ethanol, xyplot(NOx ~ E))
Figure 4.9: lattice散佈圖一
with(lattice::ethanol, xyplot(NOx ~ E|C, col='#FF3300'))
Figure 4.10: lattice散佈圖二
在lattice
,類別變數對應連續變數的密度長條圖,可以一次展示在同一個圖上面。用以下指令畫圖4.11:
histogram(~temp|origin, data=nycflights13::weather,
type="density", layout=c(1,3))
Figure 4.11: lattice長條圖
箱型圖也可以表示連續變數在每一個類別上面的分佈,例如圖 4.12:
lattice::bwplot(as.factor(cyl) ~ mpg, data=mtcars)
Figure 4.12: lattice箱型圖
lattice::xyplot(MASS::accdeaths)
Figure 4.13: lattice折線圖一
tsp<-TSstudio::ts_reshape(MASS::accdeaths,
type = 'long')
head(tsp, n=5)
## year month value
## 1 1973 1 9007
## 2 1973 2 8106
## 3 1973 3 8928
## 4 1973 4 9137
## 5 1973 5 10017
xyplot(value ~ month | as.factor(year),
data=tsp, type="l", col="black",
index.cond=list(c(4,5,6,1,2,3)),
scales = list(at=c(1,6, 12),
labels=c("Jan.","Jun.","Dec.")),
strip = strip.custom(bg="gray90"),
ylab="percent")
Figure 4.14: 每年的意外死亡折線圖
ggplot2
,但是提供ggplot2
以外的選擇。
ggpubr
指的是ggplot publication ready。請下載ggpubr
這個套件。詳細介紹可以參考這個網站。
ggpubr
的優點是美化ggplot2
的圖形,指令相當類似,例如圖5.1顯示鄉村與城市的死亡率長條圖,並且以虛線標示變數的平均值:
mor<-lattice::USRegionalMortality
ggpubr::gghistogram(mor, x = "Rate", bins=40,
add = "mean", rug = TRUE,
color = "Status", fill = "Status",
palette = c("#EE0011", "#111100"))
Figure 5.1: ggpubr長條圖
ggpubr
的圖形顯示出觀察值。
p <- ggpubr::ggboxplot(mtcars, x = "cyl", y = "mpg",
color = "cyl", palette =c("#00AFBB", "#E7B800", "#FC4E07"),
add = "jitter", shape = "cyl")
p
Figure 5.2: ggpubr的箱型圖
ggstatsplot
套件結合圖形與統計,直接呈現圖形上面的統計結果。先下載套件:
r install.packages('ggstatsplot')
# for reproducibility
set.seed(02139)
data(ISLR::Default)
# plot
ggstatsplot::ggpiestats(
data = ISLR::Default,
x = student,
y = default,
title = "Student and Default", # title for the plot
legend.title = "student", # title for the legend
caption = substitute(paste(italic("Source"), ": ISLR")),
messages = FALSE
)
Figure 5.3: 卡方檢定圓餅圖
ggstatsplot
可以顯示相關係數:
ggstatsplot::ggscatterstats(
data = alr4::florida,
x = Gore,
y = Bush,
xlab = "Vote for Gore",
ylab = "Vote for Bush",
title = "County-by-county vote for president in Florida in 2000 for Bush, Gore",
caption = substitute(paste(italic("Source"), ": alr4")),
messages = FALSE,
bf.message = F,
results.subtitle = T
)
Figure 5.4: ggstatsplot散佈圖與迴歸係數
還記得一開始的課程目標是畫出類別變數與連續變數之間的關係,並且標示各類別的平均值?請見圖 6.1。
library(alr4)
data(salary)
p <- ggplot(salary, aes(x=rank, y=salary)) +
geom_point() +
theme_classic()
p + ggplot2::stat_summary(fun='mean', colour='red',size=3, geom = "point")
Figure 6.1: ggplot2散佈圖加上平均值
p + stat_summary(fun = mean, colour='#2200ae',size=1.3,
fun.min = function(x) mean(x) - sd(x),
fun.max = function(x) mean(x) + sd(x),
geom = "pointrange") +
stat_summary(fun = mean, geom = "line") +
scale_x_discrete(breaks=c('Asst','Assoc','Prof'),
labels=c('Assistant Professor','Associate Professor', 'Full Professor'))
Figure 6.2: ggplot2散佈圖加上平均值及標準差
MASS::birthwt
)的分佈。
reshape2::tips
這筆資料的tips與total_bill兩個變數的關係,並且標記顧客是否抽煙(smoker)。請記得加上圖例。
ggplot2
繪製直方圖顯示hsb2的race與ses的關係。
flights
這筆資料之中的arr_delay,畫圖顯示每一家航空公司(carrier)的分布情況:
airquality
這筆資料的8,9兩個月份的資料,並且用ggplot2
畫出
library(dplyr)
DT <- tibble(Country=c("Spain", "France", "Germany","Italy", "Sweden"),
Populist=c(52,58,65,50, 73),
Mixed=c(70,72,70,65,81),
Nonpopulist=c(85,84,88,68, 90))
kableExtra::kable_styling(knitr::kable(DT))
Country | Populist | Mixed | Nonpopulist |
---|---|---|---|
Spain | 52 | 70 | 85 |
France | 58 | 72 | 84 |
Germany | 65 | 70 | 88 |
Italy | 50 | 65 | 68 |
Sweden | 73 | 81 | 90 |
ggthemes
套件set.seed(2019)
m=1000; n=580; p1=0.54; p2=0.60; p3=0.70; p4=0.80
res1=rbinom(m, n, p1);res2=rbinom(m, n, p2); res3=rbinom(m,n,p3);
res4=rbinom(m, n, p4)
dplyr
指令以及
最後更新日期 03/26/2021