Grammar of graphics
In-class Exercisess:
1.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.
#使用women資料畫圖,使用lattice,將weight與height分別做y與x軸,將women的row名稱第一筆資料做一個點#
lattice::xyplot(weight ~ height,
data=women,
subset=row.names(women)==1, type='p')#使用women資料畫圖,使用ggplot,將weight與height分別做y與x軸,將women的row名稱第一筆資料做一個點#
library(ggplot2)
ggplot(data=women[1,], aes(height, weight))+
geom_point()2.The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. After deleting pupils with missing values, the number of pupils is 2,287 and the number of schools is 131. Class size ranges from 4 to 35. The response variables are score on a language test and that on an arithmetic test. The research intest is on how the two test scores depend on the pupil’s intelligence (verbal IQ) and on the number of pupils in a school class.
The class size is categorized into small, medium, and large with roughly equal number of observations in each category. The verbal IQ is categorized into low, middle and high with roughly equal number of observations in each category. Reproduce the plot below.
Column 1: School ID Column 2: Pupil ID Column 3: Verbal IQ score Column 4: The number of pupils in a class Column 5: Language test score Column 6: Arithmetic test score
#讀檔案,開工具包,看檔案#
dta <- read.table("C:/Users/boss/Desktop/data_management/langMathDutch.txt", h = T)
head(dta) school pupil IQV size lang arith
1 1 17001 15.0 29 46 24
2 1 17002 14.5 29 45 19
3 1 17003 9.5 29 33 24
4 1 17004 11.0 29 46 26
5 1 17005 8.0 29 20 9
6 1 17006 9.5 29 30 13
[1] "school" "pupil" "IQV" "size" "lang" "arith"
library(ggplot2)
library(tidyverse)
library(lattice)
library(gridExtra)
#以size與IQV分別以"Small", "Medium", "Large"還有"Low", "Middle", "High"另外增加欄位並分割資料照順序排列#
dta <- dta %>% mutate(sizef=cut(size,
breaks=quantile(size,
probs=c(0, .33, .66, 1)),
label=c("Small", "Medium", "Large"),
ordered=T,
include.lowest=T))
dta <- dta %>% mutate(IQV_f=cut(IQV,
breaks=quantile(IQV,
probs=c(0, .33,.66, 1)),
label=c("Low", "Middle", "High"),
ordered=T,
include.lowest=T))
#再看有沒有分開#
head(dta) school pupil IQV size lang arith sizef IQV_f
1 1 17001 15.0 29 46 24 Large High
2 1 17002 14.5 29 45 19 Large High
3 1 17003 9.5 29 33 24 Large Low
4 1 17004 11.0 29 46 26 Large Low
5 1 17005 8.0 29 20 9 Large Low
6 1 17006 9.5 29 30 13 Large Low
#再把多個子圖合併做比較#
p1 <- ggplot(dta,
aes(lang, arith)) +
stat_smooth(method="lm",
formula=y ~ x) +
geom_point(shape=20) +
facet_grid(sizef ~ IQV_f) +
labs(x="Language score", y="Arithnetric score") +
theme_bw()
p13.Use the USPersonalExpenditure{datasets} for this problem. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960.
Plot the US personal expenditure data in the style of the third plot on the “Time Use” case study in the course web page. You might want to transform the dollar amounts to log base 10 unit first.
#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
#開檔案#
require(stats)
USPersonalExpenditure 1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health 3.530 5.760 9.71 14.0 21.10
Personal Care 1.040 1.980 2.45 3.4 5.40
Private Education 0.341 0.974 1.80 2.6 3.64
num [1:5, 1:5] 22.2 10.5 3.53 1.04 0.341 44.5 15.5 5.76 1.98 0.974 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:5] "Food and Tobacco" "Household Operation" "Medical and Health" "Personal Care" ...
..$ : chr [1:5] "1940" "1945" "1950" "1955" ...
#改欄位名字#
colnames(dta) <- c("1940y","1945y","1950y","1955y","1960y")
rownames(dta) <- c("F&T","HO","M&H","PC","PE")
#再看一次#
head(dta) 1940y 1945y 1950y 1955y 1960y
F&T 22.200 44.500 59.60 73.2 86.80
HO 10.500 15.500 29.00 36.5 46.20
M&H 3.530 5.760 9.71 14.0 21.10
PC 1.040 1.980 2.45 3.4 5.40
PE 0.341 0.974 1.80 2.6 3.64
Var1 Var2 value
1 F&T 1940y 22.200
2 HO 1940y 10.500
3 M&H 1940y 3.530
4 PC 1940y 1.040
5 PE 1940y 0.341
6 F&T 1945y 44.500
#重新命名欄位名稱#
colnames(dta1) <- c("activity", "year", "expenditure")
#把expenditure改成乘以log10之後,四捨五入到第2位#
logexpenditure <- round(log10(dta1$expenditure),2)
head(logexpenditure)[1] 1.35 1.02 0.55 0.02 -0.47 1.65
#畫圖#
qplot(logexpenditure, activity, data = dta1) +
geom_segment(aes(xend = 0, yend = activity)) +
geom_vline(xintercept = 0, colour = "grey50") +
facet_wrap(~ year, nrow = 1)4.A sample of 158 children with autisim spectrum disorder were recruited. Social development was assessed using the Vineland Adaptive Behavior Interview survey form, a parent-reported measure of socialization. It is a combined score that included assessment of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale. These assessments were repeated on these children when they were 3, 5, 9, 13 years of age.
Source: West, B.T., Welch, K.B., & Galecki, A.T. (2002). Linear Mixed Models: Practical Guide Using Statistical Software. p. 220-271.
Data: autism{WWGbook}
Column 1: Age (in years) Column 2: Vineland Socialization Age Equivalent score Column 3: Sequenced Inventory of Communication Development Expressive Group (1 = Low, 2 = Medium, 3 = High) Column 4: Child ID
Replicate the two plots above using ggplot2.
#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
library(hrbrthemes)
#讀檔案#
pacman::p_load(WWGbook)
data("autism", package="WWGbook")
dta <- WWGbook::autism
#看檔案#
head(dta) age vsae sicdegp childid
1 2 6 3 1
2 3 7 3 1
3 5 18 3 1
4 9 25 3 1
5 13 27 3 1
6 2 17 3 3
#將sicdegp資料以L、M、H分開另外照著順序排列#
dta$sic <- factor(dta$sicdegp, levels = c(1,2,3), labels = c("L", "M", "H"))
#再看一下#
head(dta) age vsae sicdegp childid sic
1 2 6 3 1 H
2 3 7 3 1 H
3 5 18 3 1 H
4 9 25 3 1 H
5 13 27 3 1 H
6 2 17 3 3 H
#解決遺漏值#
dta1<-na.omit(dta)
#把age改成圖形要的/10#
logage <- round(log10(dta1$age),2)
#做圖#
p0 <- ggplot(data=dta1,
aes(x=logage,
y=vsae))+
geom_point()+
geom_path(color="grey")+
labs(x='Age (in years, centered)',
y='VSAE score',
title = NULL)+
stat_smooth(data=dta1,
aes(color=childid),
formula=y ~ x,
method='lm',
se=FALSE)+
facet_grid(. ~ sic)
p0#另一個圖首先要照圖上將age-2#
p1 <- dta1 %>% mutate(age2=age-2) %>%
group_by(sic, age2) %>%
dplyr::summarise(n=n(), m_p=mean(vsae),se_p=sd(vsae)/sqrt(n)) %>%
ggplot() +
aes(age2, m_p, group=sic, shape=sic) +
#接下來就是做圖#
geom_errorbar(aes(ymin=m_p - se_p,
ymax=m_p + se_p),
width=.2, size=.3,group="sic") +
geom_line(aes(linetype=sic), show.legend=T)+
geom_point(size=rel(3),show.legend=T) +
scale_shape(guide=guide_legend(title=NULL)) +
labs(x="AGE(in years-2)", y="VSAE score") +
theme_ipsum() +
theme(legend.position=c(.1, .8))
p15.Use the diabetes dataset to generate a plot similar to the one below and inteprete the plot.
#開工具#
library(ggplot2)
library(tidyverse)
library(lattice)
library(grid)
library(gridExtra)
library(hrbrthemes)
library(reshape2)
library(ggalluvial)
#讀檔案#
dta <- read.csv("C:/Users/boss/Desktop/data_management/diabetes_mell.csv")
#看檔案#
head(dta) SEQN RIAGENDR RIDRETH1 DIQ010 BMXBMI gender race diabetes BMI
1 51624 1 3 2 32.22 Males White No Overweight
2 51626 1 4 2 22.00 Males Black No Normal weight
3 51627 1 4 2 18.22 Males Black No Normal weight
4 51628 2 4 1 42.39 Females Black Yes Overweight
5 51629 1 1 2 32.61 Males Hispanic No Overweight
6 51630 2 3 2 30.57 Females White No Overweight
race gender diabetes BMI Freq
1 Black Females No Normal weight 347
2 Hispanic Females No Normal weight 712
3 White Females No Normal weight 998
4 Black Males No Normal weight 429
5 Hispanic Males No Normal weight 706
6 White Males No Normal weight 873
p <- ggplot(dta_v3,
aes(axis1=race,
axis2=gender,
axis3=diabetes,
y=Freq)) +
scale_x_discrete(limits=c("race",
"gender",
"diabetes"),
expand=c(.1, .01)) +
labs(x='',
y='No.individuals') +
geom_alluvium(aes(fill=BMI)) +
geom_stratum() +
geom_text(stat="stratum",
infer.label=TRUE) +
scale_fill_manual(values=c('skyblue','hotpink'))+
theme_minimal() +
ggtitle("Dibetes in overall population in US 200-2010") +
labs(subtitle = "stratified by race, gender and diabetes mellitus") +
theme(legend.position = "bottom")
#除了顏色不對之外大致符合#
#這張圖可以看得出來各膚色無論男女,過重的人得糖尿病的機率很少#
#但是有白種人即使沒過重,也會得糖尿病#
p6.Find out what each code chunk (indicated by ‘##’) in the R script does and provide comments.
# 開ggplot2# library(ggplot2) ?ggplot2
# install.packages('formatR') library(formatR)
# 裝+開gapminder# install.packages('gapminder') library(gapminder)
# 使用gaminder本身的資料,並看其資料結構如何# data(gapminder) str(gapminder)
# 將其資料定義為gap# gap <- gapminder
# 畫背景框線# ggplot(data = gap, aes(x = lifeExp))
# 加上其資料的長條圖# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram()
# 加長條圖填充藍色、框線黑色、標題、X軸與Y軸標題,使用theme_classic()來調整大小#
# ggplot(data = gap, aes(x = lifeExp)) + geom_histogram(fill = 'blue', color =
# 'black', bins = 10) + ggtitle('Life expectancy for the gap dataset') +
# xlab('Life expectancy (years)') + ylab('Frequency') + theme_classic()
# 改成畫盒鬚圖,加上標題、X軸與Y軸標題,使用theme_minimal()來填入顏色#
# ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) +
# geom_boxplot() + ggtitle('Boxplots for lifeExp by continent') +
# xlab('Continent') + ylab('Life expectancy (years)') + theme_minimal() #+
# guides(fill = FALSE)
# What happens if you un-hashtage `guides(fill = FALSE)` and the plus sign in
# lines 68 and 69 above?
# 就不會跑出來右邊的各區域色塊標籤了#
# 改成畫點圖,加上標題、X軸與Y軸標題,使用
# theme_classic()調整非資料展示的其他圖形,使用第二個theme來調整圖的色塊、大小...等等#
# ggplot(data = gap, aes(x = lifeExp, y = gdpPercap, color = continent, shape =
# continent)) + geom_point(size = 5, alpha = 0.5) + ggtitle('Scatterplot of life
# expectancy by gdpPercap') + theme_classic() + xlab('Life expectancy (years)') +
# ylab('gdpPercap (USD)') + theme(legend.position = 'top', plot.title =
# element_text(hjust = 0.5, size = 20), legend.title = element_text(size = 10),
# legend.text = element_text(size = 5), axis.text.x = element_text(angle = 45,
# hjust = 1))
# In lines the ggplot code above, what are the arguments inside of our second
# 'theme' argument doing?
# 根據網站上查到的是,可以用來調整非資料展示的其他圖形,有試過把那條刪掉再重新跑,沒有任何影響#
# 另外這個Rmarkdown Knit不了,所以強制加上#字號# The EndStatistical graphics
Exercisess:
1.The distribution of personal disposable income in Taiwan in 2015 has a story to tell.
Revise the following plot to enhance that message.
Income Count
1 160,000 and under 807160
2 160,000 to 179,999 301650
3 180,000 to 199,999 313992
4 200,000 to 219,999 329290
5 220,000 to 239,999 369583
6 240,000 to 259,999 452671
names(dta) <- c("Income", "Count")
p <- ggplot(data = dta, aes(x = Income, y = Count)) + geom_col(aes(x = Income, y = Count)) +
coord_flip() + labs(title = "Distribution of disposable personal income in Taiwan in 2015",
y = "Nunber of persons")
p2.Comment on how the graphs presented in this link violate the principles for effective graphics and how would you revise them.
3.Sarah Leo at the Economist magazine published a data set to accompany the story about how scientific publishing is dominated by men. The plot on the left panel below is the orignal graph that appeared in the article.
Help her find a better plot.