getwd()#查看工作路径
## [1] "/Users/xrb/Desktop/R语言学习/UCLA网站学习/R Graphics_ Introduction to ggplot2 (1)_files"
说明:这篇文档主要记录UCLA网站中Introduction to ggplot2研讨课程的整体内容。
首先加载学习所用的R包
library(pacman)
p_load(ggplot2,MASS,tidyverse)
了解绘图语法中的基本元素:
ggplot2包使用的基础语法为:
ggplot() #注意和包的名称“ggplot2”进行区分
ggplot(dataset, aes(x=xvar, y=yvar)) + geom_function()
做一个散点图:
# scatter plot of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point()
## Warning: Removed 568 rows containing missing values (`geom_point()`).
如果去掉geom_()部分则只生成XY坐标轴
ggplot(txhousing, aes(x=volume, y=sales))
可以使用”+“逐个添加图层,包括geoms, stats, scales, faceting, themes
丰富一下上面这幅图(添加地毯图层rug):
ggplot(txhousing, aes(x=volume, y=sales)) + # x=volume and y=sales inherited by all layers
geom_point() +
geom_rug(aes(color=median)) # color will only apply to the rug plot because not specified in ggplot()
aes)参数Commonly used aesthetics:
x: positioning along x-axis
y: positioning along y-axis
color: color of objects; for 2-d objects, the color
of the object’s outline (compare to fill below)
fill: fill color of objects
linetype: how lines should be drawn (solid, dashed,
dotted, etc.)
shape: shape of markers in scatter plots
size: how large objects appear
alpha: transparency of objects (value between 0,
transparent, and 1, opaque – inverse of how many stacked objects it will
take to be opaque)
加入一些aes参数美化一下刚刚产出的散点图:
# mapping color to median inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color=median))
## Warning: Removed 568 rows containing missing values (`geom_point()`).
#导入一个临时数据集
data(Sitka)
该数据集描述了树木随时间的生长情况,其中一些树木生长在富含臭氧的室内。变量说明:
size: numeric, log of size (height times diameter2)
Time: numeric, time of measurement (days since January 1, 1988)
tree: integer, tree id
treat: factor, treatment group, 2 levels=“control” and “ozone”
题目:
A. Create a scatter plot of
Timevssizeto view the growth of trees over time.
B. Color the scatter plot points by the variable .
treat
C. Add an additional (loess) layer to the graph.
geom_smooth()
D. SET the color of the loess smooth to “green” rather than have it colored by
treat. Why is there only one smoothed curve now?ggplot(Sitka,aes(x = Time,y = size,color = treat,fill = treat)) + geom_point() + geom_smooth(method = lm)#按照threat进行分组## `geom_smooth()` using formula = 'y ~ x'
ggplot(Sitka,aes(x = Time,y = size,color = treat)) + geom_point() + geom_smooth(method = lm,col = "green")#不按照threat分组## `geom_smooth()` using formula = 'y ~ x'
Geom函数可以为数据集绘制不同的几何图形。
常见的geoms:
geom_bar(): bars with bases on the x-axis
geom_boxplot(): boxes-and-whiskers
geom_errorbar(): T-shaped error bars
geom_density(): density plots
geom_histogram(): histogram
geom_line(): lines
geom_point(): points (scatterplot)
geom_ribbon(): bands spanning y-values across a
range of x-values
geom_smooth(): smoothed conditional means
(e.g. loess smooth)
geom_text(): text
常见图形绘制:
ggplot(txhousing, aes(x=median)) +
geom_histogram(color="steelblue",fill="lightblue")#单独为图形set颜色:color指定边框颜色;fill制定填充颜色
2.密度图——曲线平滑的直方图
ggplot(txhousing, aes(x=median)) +
geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).
密度图可以进行分组绘制:
ggplot(txhousing, aes(x=median,color = factor(month))) +
# month为 数值型变量,需要用factor()转换为分组变量再进行操作,相当于stata中的encode命令
geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).
3.箱形图——用于查看Y变量在X变量不同取值上的分布情况
ggplot(txhousing, aes(x=factor(year), y=median)) +
geom_boxplot()#画出房价在各年的分布
## Warning: Removed 616 rows containing non-finite values (`stat_boxplot()`).
4.柱状图——统计分类变量的频率
ggplot(diamonds, aes(x=cut, fill=clarity)) +
geom_bar()
5.散点图——描述两变量间的协方差关系
ggplot(txhousing, aes(x=volume, y=sales,
color=median, alpha=listings, size=inventory)) + #丰富画面内容
geom_point()
## Warning: Removed 1468 rows containing missing values (`geom_point()`).
6.折线图
ggplot(txhousing, aes(x=date, y=sales, color=city)) +
geom_line()
## Warning: Removed 430 rows containing missing values (`geom_line()`).
尝试一下可视化我国省级数字普惠金融指数变化趋势:
library(haven)
merge6 <- read_dta("~/Desktop/论文/小论文-数字经济与农户收入/数据/目前可用的/merge6.dta")
ggplot(merge6, aes(x=year,y=index_aggregate,color=prov_name_eng)) +
geom_line()
## Warning: Removed 434 rows containing missing values (`geom_line()`).
We will be using the Sitka data set again for this
exercise.
A. Using 2 different geoms, compare the distribution of
sizebetween the two levels oftreat.Use a different color for each distribution.
B. Use a bar plot to visualize the crosstabulation of
Timeandtreat. PutTimeon the x-axis.
C. Create a line graph of
sizeoverTime, with separate lines bytreeand lines colored bytreat.
D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?
解答:
ggplot(Sitka,aes(x=size,fill=treat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Sitka,aes(x=size,color=treat)) +
geom_density()
tab = table(Sitka$Time,Sitka$treat)
barplot(tab,
xlab = "Time",
ylab = "treat_frequence",
col = c("green","blue"))
ggplot(Sitka,aes(x=Time,y=size,group=tree))+
geom_line(aes(color=treat))
D.期刊要求为黑白图片时,可使用不同形状的线linetype对组别进行区分:
ggplot(Sitka,aes(x=Time,y=size,group=tree))+
geom_line(aes(linetype=treat))
---
Stats、尺度Scales、坐标系Coordinate Systems和分子集(facet)统计函数对数据进行统计转换,通常作为某种形式的总结,如平均值,或标准差,或置信区间。
每个stat函数都与一个默认的geom相关联,所以不需要geom来呈现形状.
# 使用txhousing数据集,summarize sales (y) for each year (x)
ggplot(txhousing,aes(x=year,y=sales))+
stat_summary()
## Warning: Removed 568 rows containing non-finite values (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`
Create a new plot where x is mapped to Time
and y is mapped to size. Then, add a
stat_summary() layer.
ggplot(Sitka,aes(x=Time,y=size))+
stat_summary()
## No summary function supplied, defaulting to `mean_se()`
Here is a color scale that ggplot2 chooses for us:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + #按照cut匹配颜色
geom_point()
可以使用scale_color_manual()手动设置value=()的颜色:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
scale函数作用于Y坐标轴添加scale_y_continuous()调整y坐标轴的分断区间breaks:
调整前:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
此时y轴刻度线分别为:0、5000、10000和15000。
调整后:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))#设置y轴分断值
现在,我们继续把上图y轴中的(美元)单位替换为(千美元)单位。给每个break处添加对应的数值,最后给y轴修改标题为:price(thousands of dollars)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels = c(0,1.5,5,7.5,10,12.5,15,17.5),#这里每个lables的值对应上方breaks中的值,特意用1.5对应了2500
name = "price(thousands of dollars)")
scale函数作用于X坐标轴给X轴的取值划定一个范围并修改坐标轴的标题:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
xlim(c(0,3))+# cut 0-3之间的
labs(x="CARAT", y="PRICE", color="Diamond_CUT", title="CARAT vs PRICE by CUT")#修改X、Y轴以及图例标题
## Warning: Removed 32 rows containing missing values (`geom_point()`).
在上文做的所有图中,默认输出带有scale
guide作用的图例,可以使用guides()命令将其移除。
Guides for each scale can be set scale-by-scale with the
guide argument, or en masse with guides().
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
labs(x="CARAT", y="PRICE",title="CARAT vs PRICE by CUT")+
guides(color="none")#去除按cut映射颜色的guide图例
坐标系统定义了物体在图上的空间定位平面,大多数图都使用笛卡尔坐标系。
用切面函数faceting
functions将图按切面变量的子集分成小的(panels)进行排布
。此部分学习的函数包括facet_wrap()和facet_grid()。
facet_wrap() 函数:ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~cut) # 根据变量cut的各个子集分组做图,按列展示。
facet_grid() 函数:facet_grid()允许直接指定哪些变量被用来沿着行和列分割图。把分行变量放在
~前面,分列变量放在 ~后面。
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point(color = "steelblue",shape = 17,size = 0.5) + #为散点设置颜色、形状、尺寸
facet_grid(clarity~cut)#每行按照clarity分割,每列按照cut分割
上面这幅图使用facet_grid(clarity~cut)
参数实现了每行不同清澈度子集,每列不同切割度子集的钻石,其价格和重量的散点相关情况。
以上就是facet()函数的主要用法,更详细的facet语法和参数配置可以查看Facets for ggplot2 in R.
Use the Sitka data set.
A. Recreate the line plot
Timeof vssize, with thecolorof the lines mapped totreat. Usescale_color_manual()to change the colors to “orange” and “purple”.
B. Use to
scale_x_continuous()convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.
C. Split the scatter plot into a panel of scatter plots by
tree. (Note: Make the graph area large; graph may take a few seconds to appear)
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()+
scale_color_manual(values=c("orange","purple"))+ #修改折线颜色
facet_wrap(~treat)
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()+
scale_color_manual(values=c("orange","purple")) +#修改折线颜色
scale_x_continuous(breaks = c(150,180,210,240),
labels = c(5,6,7,8),
name = "time(months)")#修改X轴信息
ggplot(Sitka,aes(x=Time,y=size))+
geom_point(color = "forestgreen")+
scale_x_continuous(breaks = c(150,180,210,240),
labels = c(5,6,7,8),)+#修改X轴信息
facet_wrap(~treat,ncol = 1)
主题设置的主要参数:
element_line() - can specify
color,linewidth,linetype etc.
element_rect() - can specify
fill,color,size , etc.
element_text() - can specify
family,face,size,color,angle etc.
element_blank() - removes theme elements from
graph
给图形的坐标轴进行颜色和线宽的设置:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black",linewidth = 2)) # linewidth in mm
## Warning: Removed 568 rows containing missing values (`geom_point()`).
进一步可以设置背景颜色、字体、轴标题等细节:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme(axis.line=element_line(color="black", linewidth=2),
panel.background=element_rect(fill="white", color="gray"),
title=element_text(family="serif", face="bold"))
## Warning: Removed 568 rows containing missing values (`geom_point()`).
除了通过一个个参数来调整外观,ggplot2包也提供了一些完整的theme主题,使用它们可以更简单地改变图形的背景和线条风格。
一些完整的theme:
theme_bw()
theme_light()
theme_dark()
theme_classic()
下面,我们直接调用完整主题来改变图形外观:
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme_classic()+
labs(title = 'classic style')
## Warning: Removed 568 rows containing missing values (`geom_point()`).
ggplot(txhousing, aes(x=volume, y=sales, color=median)) +
geom_point() +
theme_dark() +
labs(title = 'dark style')
## Warning: Removed 568 rows containing missing values (`geom_point()`).
ggsave()命令可以将生成的图形保存到文件(默认情况下,会保存最后生成的图形),支持的格式包括:eps/ps,
tex (pictex), pdf, jpeg, tiff, png, bmp, svg, wmf.
一些参数设置:
width
height
units:units of width and
height of plot file ("in","cm"or"mm")
dpi: plot resolution in dots per inch
plot: name of object with stored plot
#save last displayed plot as pdf
ggsave("plot.pdf")#保存为pdf格式,图片名为“plot“
## Saving 7 x 5 in image
## Warning: Removed 568 rows containing missing values (`geom_point()`).
#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) +
geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)#用plot指定要保存的图像
## Saving 7 x 5 in image
本练习使用MASS包中的Rabbits数据集创建图形,Run
data(Rabbit) and then click on Rabbit in the RStudio
Environment pane.
The Rabbit data set describes an experiment where:
5 rabbits were each given both a treatment drug (MDL) and a control drug.
After injection of either drug, each rabbit was then given one of 6 doses (6.25, 12.5, 25, 50, 100, 200 micrograms) of another chemical that interacts with the drug to affect blood pressure.
Change in blood pressure was measured as the outcome.
Thus each rabbit’s blood pressure was measured 12 times, 6 each for treatment and control.
The data set contains 60 rows (5 rabbits measured 12 times) of the following 5 variables:
BPchange: change in blood pressure relative to the
start of the experiment
Dose: dose of interacting chemical in
micrograms
Run: label of trial
Treatment: Control or MDL
Animal: animal ID (“R1” through “R5”)
Goal: create a dose-response curve for each rabbit under each treatment, resulting in 10 curves (2 each for 5 rabbits)
Constraints: no color, but publication quality (imagine submitting to a journal that only accepts non-color figures)
提示(可能的步骤):
A. First, try creating a line graph with
Doseon the x-axis andBPchangeon the y-axis, with separate linetypes byAnimal.
Why does this graph look wrong?
B. Draw separate lines by
Treatment. How can we accomplish this without color?
Some of the line patterns still look a little too similar to distinguish between rabbits.
C. Add a scatter plot where the shape of the points is mapped to
Animal.
Next we will change the shapes used. See for a list of codes for
shapes.?pch
D. Use
scale_shape_manual()to change the shapes of the points. Use the shapes corresponding to the codes (0, 3, 8, 16, 17).
Ok, the graph has all the data we want on it. Now, we’ll prepare it for publication.
E. Change the x-axis title to “Dose(mcg)” and the y-axis title to “Change in blood pressure”.
Finally, we will change some of the elements.theme()
F. First, change the background from gray to white (or remove it) using in
theme(panel.background).
G. Next, change the color of the grid lines to “gray90”, a light gray using
panel.grid.
H. Use to
titlefacechange the titles (axes and legend) to bold .
I. Use
strip.textfaceto change the facet titles to bold .
J. Save your last plot as
pubilc.png.
library(pacman)
p_load(MASS,ggplot2)
data(Rabbit)#import data
做图:
#A.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal))+
geom_line()#按照编号生成每个兔子血压和用药剂量的折线图
#B.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal))+
geom_line()+
facet_grid(Animal~Treatment)#按照是否注射药物处理,进行控制组和处理组的分割(发现此药物可能有降血压的作用)
#C.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+ #给不同的兔子映射不同形状的散点
geom_line()+
facet_grid(Animal~Treatment)+
geom_point()+#添加散点图层
scale_shape_manual(values = c(0,3,8,16,17))#?pch查看点的各种形状,设置区分度较高的几种形状
#D.调整外观以符合出版要求
pubilc = ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+
geom_line()+
facet_grid(Animal~Treatment)+
geom_point()+
scale_shape_manual(values = c(0,3,8,16,17))+
labs(x="Dose(mcg)",y="Change in blood pressure",title = "BP vs Dose")+
theme(plot.title = element_text(hjust = 0.5),#标题水平居中
panel.background = element_rect(fill = "white"),#设置背景填充色为白色
panel.grid = element_line(colour = "grey90"),#设置网格线颜色
title = element_text(face = "bold"),#加粗标题,坐标轴,图例
strip.text = element_text(face = "bold"))#加粗分割标题
ggsave("public.pdf",plot = pubilc)
## Saving 7 x 5 in image
#E.
pubilc2 = ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+
geom_line()+
facet_wrap(~Treatment)+
geom_point()+
scale_shape_manual(values = c(0,3,8,16,17))+
labs(x="Dose(mcg)",y="Change in blood pressure",title = "BP vs Dose")+
theme(plot.title = element_text(hjust = 0.5),#标题水平居中
panel.background = element_rect(fill = "white"),#设置背景填充色为白色
panel.grid = element_line(colour = "grey90"),#设置网格线颜色
title = element_text(face = "bold"),#加粗标题,坐标轴,图例
strip.text = element_text(face = "bold"))#加粗分割标题
ggsave("public2.pdf",plot = pubilc2)
## Saving 7 x 5 in image
导入这部分要使用的数据集:
library(MASS)
data("birthwt")
birthwt
数据集描述了风险因素和婴儿低出生体重之间的关系。包含的变量有:
low: 0/1 indicator of birth weight < 2.5
kg
age: mother’s age
lwt: mother’s weight in pounds
race: mother’s race, (1=white, 2=black,
3=other)
smoke: 0/1 indicator of smoking during
pregnancy
ptl: number of previous premature labors
ht: 0/1 indicator of history of
hypertension
ui: 0/1 indicator of uterine irritability
ftv: numer of physician visits during first
trimester
在R中,可以使用factor()函数来编码encode分类变量。
ggplot2中既可以映射到numeric变量,又可以映射到factor变量的美学参数项aesthetics有:
x and y: continuous or discrete
axes
color and fill: color gradient scales
(颜色梯度)or evenly-spaced hue scales(间隔颜色)
只能被映射到分类变量上的aes参数:
shape
linetype
只能被映射到数值变量上的aes参数:
size
alpha
在可映射两种类型变量的aes参数中使用不同的变量类型可输出不同的图形效果:
ggplot(birthwt,aes(x=age,y=bwt,color=race))+#race作为数值型变量进行映射
geom_point()
ggplot(birthwt,aes(x=age,y=bwt,color=factor(race)))+#race作为分类变量进行映射
geom_point()
因此,在绘图之前,我们最好把分类变量用factor()函数转化为正确的格式,在转化的同时,可以为每种分类赋予你想要的label。
例如,我们想绘制一张描述婴儿出生体重bwt和母亲是否抽烟smoke之间关系的箱形图boxplot,可以这么做:
birthwt$smokef = factor(birthwt$smoke,levels = 0:1,labels = c("did not smoke","smoked"))#数据预处理
ggplot(birthwt,aes(y=bwt,fill=smokef))+
geom_boxplot()
A. For the
birthwtdata, converthtto a factor and label the values 0 and 1 “non-hyper” and “hyper”, respectively.
B. Create boxplots of
bwt(birth weight), colored byht, with separate panels bysmokef.
birthwt$htr = factor(birthwt$ht,levels = 0:1,labels = c("non-hyper","hyper"))#将母亲是否超重ht转换为分类变量
ggplot(birthwt,aes(y=bwt,fill=htr))+
geom_boxplot()+
facet_wrap(~smokef)
点当两个数据点在图上绘制的值相同时,它们通常会占据相同的位置,导致其中一个模糊另一个。
例如:
birthwt$racef = factor(birthwt$race,levels = 1:3,labels = c("black","white","other"))
ggplot(birthwt, aes(x=racef, y=age)) +
geom_point()
数据集中有189个数据点,在这幅图中我们可以看到的点的数量远远少于189,因为许多点完全重叠.
在原先的命令中加入position="jitter"
,对各散点位置进行调整,可以更清晰地看到每个年龄处有几个点。
ggplot(birthwt, aes(x=racef, y=age)) +
geom_point(position = "jitter")
geom_bar()命令默认将生成的条形图堆叠stack放置,下面一些参数可以调整geom_bar()生成的条形图的放置方式:
position="stack": stack elements vertically (the
default for geom_bar()
position="dodge": move elements side-by-side (the
default for geom_boxplot())
position="fill": stack elements vertically,
standardize heights to
1,这时每个条的长度可以看做取值所占的比例。
例如:
ggplot(birthwt, aes(x=low, fill=racef)) +
geom_bar()+#默认堆叠
labs(title = "stack bars with the same x-position")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(birthwt, aes(x=low, fill=racef)) +
geom_bar(position="dodge")+#调整为并列分布
labs(title = "dodging emphasizes counts")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(birthwt, aes(x=low, fill=racef)) +
geom_bar(position="fill")+#各列长度标准化为1
labs(title = "filling emphasizes proportions")+
theme(plot.title = element_text(hjust = 0.5))
绘制误差条和置信带来描述变量的统计情况,分别使用命令geom_errorbar()和geom_ribbon()
。
相关参数:
x: horizontal positioning of error bar or
band
ymin: vertical position of lower error bar or
band
ymax: vertical position of upper error bar or
band
例如,以下代码估计了birthwt数据集中三个种族的平均出生体重和平均值的95%置信区间。均值和置信限存储在一个名为.bwt_by_race的新数据框中:
bwt_by_racef <- do.call(rbind,
tapply(birthwt$bwt, birthwt$racef, mean_cl_normal))
bwt_by_racef$racef <- row.names(bwt_by_racef)
names(bwt_by_racef) <- c("mean", "lower", "upper", "racef")
bwt_by_racef
## mean lower upper racef
## black 3102.719 2955.235 3250.202 black
## white 2719.692 2461.722 2977.662 white
## other 2805.284 2629.127 2981.441 other
接着,我们使用geom_point()与geom_errorbar()绘制不同人种新生儿体重的均值和95%误差限:
ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper))+
labs(title = "mean birthweight by race")+
theme(plot.title = element_text(hjust = 0.5))
ggplot(bwt_by_racef, aes(x=racef, y=mean)) +
geom_point() +
geom_errorbar(aes(ymin=lower, ymax=upper),width=0.1)+#width调整误差条宽度
labs(title = "mean birthweight by race")+
theme(plot.title = element_text(hjust = 0.5))
这一次,我们将根据出生体重对年龄的回归,创建一个具有置信带的预测值图。首先,我们将运行模型,并将预测值和置信限添加到原始数据集中以进行绘图:
# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
## low age lwt race smoke ptl ht ui ftv bwt smokef htr racef
## 85 0 19 182 2 0 0 0 1 0 2523 did not smoke non-hyper white
## 86 0 33 155 3 0 0 0 0 3 2551 did not smoke non-hyper other
## 87 0 20 105 1 1 0 0 0 1 2557 smoked non-hyper black
## 88 0 21 108 1 1 0 0 1 2 2594 smoked non-hyper black
## 89 0 18 107 1 1 0 0 1 0 2600 smoked non-hyper black
## 91 0 21 124 3 0 0 0 0 0 2622 did not smoke non-hyper other
## fit lwr upr
## 85 2891.909 2757.969 3025.849
## 86 3065.925 2846.442 3285.408
## 87 2904.339 2781.794 3026.883
## 88 2916.768 2803.295 3030.242
## 89 2879.479 2732.358 3026.600
## 91 2916.768 2803.295 3030.242
接着,使用geom_line()绘制拟合线,geom_ribbon()绘制置信带
ggplot(birthwt, aes(x=age, y=fit)) +
geom_line(color="steelblue") +
geom_ribbon(aes(ymin=lwr, ymax=upr,),fill="blue", alpha=.5)#alpua调整透明度
有时,我们需要直接向图中添加笔记或注释,而这些笔记或注释不是由图数据集中的任何变量表示的。比如:
将文本标签添加到散点图上的单个点
用方框突出显示图形的一部分。
使用annotate()函数可以实现。
假设我们想将异常值标记为可能的数据错误。我们用x=,y=确定要添加text的位置,在label=中输入text的内容。
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
annotate("text", x=42, y=5000, label="Data error?") # notice first argument is "text", not geom_text
在图形上添加矩形框来突出某一区域:
ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
geom_point() +
annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2)
对于最后一组练习,我们将使用存储在加州大学洛杉矶分校IDRE网站上的数据集,我们用以下代码加载该数据集:
hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")
This data set contains demographic and academic data for 200 high school students. We will be using the following variables:
read,write , math,
science: academic test scores
female: gender, factor with levels “female” and
“male”
honors: enrollment in honors program, factor with 2
levels “enrolled” and “not enrolled”
ses: socioeconomic status, factor with 3 levels,
“low”, “middle”, “high”
schtyp: school type, factor with 2 levels, “private”
and “public”
1.Create a scatter plot of math(x) vs
read(y), with different shapes by prog. Color
all of the points red.
2.Find the outlier at math=45, read=63, Add annotation text next to this outlier that says “error?”
3.Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables: female, prog, schtyp, ses.
From such a graph we can know, for example, how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.
4.Try to recreate this graph:
Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.
ggplot(hsb, aes(x=math, y=read, shape=prog)) +
geom_point(color="red") +
annotate("text", x=35, y=64, label="error?")
ggplot(hsb,aes(x=female))+
geom_bar()
ggplot(hsb, aes(x=female, fill=prog)) +
geom_bar(position="dodge", width=.5) +
facet_grid(schtyp ~ ses)
#4.
# reg = lm(write~read,data = hsb)
# preddata <- predict(reg, interval="confidence")
# hsb = cbind(hsb,preddata)#合并数据集
ggplot(hsb,aes(x=read,y=write,color=math))+
geom_point()+
geom_smooth(color="red")+
labs(x="Reading Score",y="Writing Score",color="Math Score")+
theme(title = element_text(family = "mono",color = "red"),
panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'