getwd()#查看工作路径
## [1] "/Users/xrb/Desktop/R语言学习/UCLA网站学习/R Graphics_ Introduction to ggplot2 (1)_files"

Introduction to ggplot2学习笔记

说明:这篇文档主要记录UCLA网站中Introduction to ggplot2研讨课程的整体内容。


2023-8-3学习内容:

1.基础知识部分:

首先加载学习所用的R包

library(pacman)
p_load(ggplot2,MASS,tidyverse)

了解绘图语法中的基本元素:

2.基础语法介绍

ggplot2包使用的基础语法为:

ggplot() #注意和包的名称“ggplot2”进行区分
ggplot(dataset, aes(x=xvar, y=yvar)) + geom_function() 

做一个散点图:

# scatter plot of volume vs sales 
ggplot(txhousing, aes(x=volume, y=sales)) +   
  geom_point()
## Warning: Removed 568 rows containing missing values (`geom_point()`).

如果去掉geom_()部分则只生成XY坐标轴

ggplot(txhousing, aes(x=volume, y=sales))

可以使用”+“逐个添加图层,包括geoms, stats, scales, faceting, themes

丰富一下上面这幅图(添加地毯图层rug):

ggplot(txhousing, aes(x=volume, y=sales)) +     # x=volume and y=sales inherited by all layers     
  geom_point() +   
  geom_rug(aes(color=median))   # color will only apply to the rug plot because not specified in ggplot()

3.Aesthetics(aes)参数

Commonly used aesthetics:

  • x: positioning along x-axis

  • y: positioning along y-axis

  • color: color of objects; for 2-d objects, the color of the object’s outline (compare to fill below)

  • fill: fill color of objects

  • linetype: how lines should be drawn (solid, dashed, dotted, etc.)

  • shape: shape of markers in scatter plots

  • size: how large objects appear

  • alpha: transparency of objects (value between 0, transparent, and 1, opaque – inverse of how many stacked objects it will take to be opaque)

加入一些aes参数美化一下刚刚产出的散点图:

# mapping color to median inside of aes() 
ggplot(txhousing, aes(x=volume, y=sales)) +   
  geom_point(aes(color=median))
## Warning: Removed 568 rows containing missing values (`geom_point()`).

练习1:

#导入一个临时数据集

data(Sitka)

该数据集描述了树木随时间的生长情况,其中一些树木生长在富含臭氧的室内。变量说明:

  • size: numeric, log of size (height times diameter2)

  • Time: numeric, time of measurement (days since January 1, 1988)

  • tree: integer, tree id

  • treat: factor, treatment group, 2 levels=“control” and “ozone”

题目:

A. Create a scatter plot of Time vs size to view the growth of trees over time.

B. Color the scatter plot points by the variable .treat

C. Add an additional (loess) layer to the graph.geom_smooth()

D. SET the color of the loess smooth to “green” rather than have it colored by treat. Why is there only one smoothed curve now?

ggplot(Sitka,aes(x = Time,y = size,color = treat,fill = treat)) +
  geom_point() + 
  geom_smooth(method = lm)#按照threat进行分组
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Sitka,aes(x = Time,y = size,color = treat)) +   
  geom_point() +    
  geom_smooth(method = lm,col = "green")#不按照threat分组
## `geom_smooth()` using formula = 'y ~ x'

4.设计对象Geoms

Geom函数可以为数据集绘制不同的几何图形。

常见的geoms:

  • geom_bar(): bars with bases on the x-axis

  • geom_boxplot(): boxes-and-whiskers

  • geom_errorbar(): T-shaped error bars

  • geom_density(): density plots

  • geom_histogram(): histogram

  • geom_line(): lines

  • geom_point(): points (scatterplot)

  • geom_ribbon(): bands spanning y-values across a range of x-values

  • geom_smooth(): smoothed conditional means (e.g. loess smooth)

  • geom_text(): text

常见图形绘制:

  1. 直方图——描述连续变量的分布情况
ggplot(txhousing, aes(x=median)) +    
  geom_histogram(color="steelblue",fill="lightblue")#单独为图形set颜色:color指定边框颜色;fill制定填充颜色

2.密度图——曲线平滑的直方图

ggplot(txhousing, aes(x=median)) +    
  geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).

密度图可以进行分组绘制:

ggplot(txhousing, aes(x=median,color = factor(month))) +   
# month为 数值型变量,需要用factor()转换为分组变量再进行操作,相当于stata中的encode命令
  geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).

3.箱形图——用于查看Y变量在X变量不同取值上的分布情况

ggplot(txhousing, aes(x=factor(year), y=median)) +    
  geom_boxplot()#画出房价在各年的分布
## Warning: Removed 616 rows containing non-finite values (`stat_boxplot()`).

4.柱状图——统计分类变量的频率

ggplot(diamonds, aes(x=cut, fill=clarity)) +    
  geom_bar()

5.散点图——描述两变量间的协方差关系

ggplot(txhousing, aes(x=volume, y=sales,
                      color=median, alpha=listings, size=inventory)) + #丰富画面内容   
  geom_point() 
## Warning: Removed 1468 rows containing missing values (`geom_point()`).

6.折线图

ggplot(txhousing, aes(x=date, y=sales, color=city)) +    
  geom_line()
## Warning: Removed 430 rows containing missing values (`geom_line()`).

尝试一下可视化我国省级数字普惠金融指数变化趋势:

library(haven) 
merge6 <- read_dta("~/Desktop/论文/小论文-数字经济与农户收入/数据/目前可用的/merge6.dta")

ggplot(merge6, aes(x=year,y=index_aggregate,color=prov_name_eng)) +
  geom_line()
## Warning: Removed 434 rows containing missing values (`geom_line()`).

练习2:

We will be using the Sitka data set again for this exercise.

A. Using 2 different geoms, compare the distribution of size between the two levels of treat .Use a different color for each distribution.

B. Use a bar plot to visualize the crosstabulation of Time and treat. Put Time on the x-axis.

C. Create a line graph of size over Time, with separate lines by tree and lines colored by treat.

D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?

解答:

ggplot(Sitka,aes(x=size,fill=treat)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Sitka,aes(x=size,color=treat)) +   
  geom_density() 

tab = table(Sitka$Time,Sitka$treat)
barplot(tab,
        xlab = "Time",
        ylab = "treat_frequence",
        col = c("green","blue"))

ggplot(Sitka,aes(x=Time,y=size,group=tree))+
  geom_line(aes(color=treat))

D.期刊要求为黑白图片时,可使用不同形状的线linetype对组别进行区分:

ggplot(Sitka,aes(x=Time,y=size,group=tree))+   
  geom_line(aes(linetype=treat))

---


2023-8-4

5.统计量Stats、尺度Scales、坐标系Coordinate Systems和分子集(facet

5.1 stat function

统计函数对数据进行统计转换,通常作为某种形式的总结,如平均值,或标准差,或置信区间。

每个stat函数都与一个默认的geom相关联,所以不需要geom来呈现形状.

# 使用txhousing数据集,summarize sales (y) for each year (x) 
ggplot(txhousing,aes(x=year,y=sales))+
  stat_summary()
## Warning: Removed 568 rows containing non-finite values (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`

Create a new plot where x is mapped to Time and y is mapped to size. Then, add a stat_summary() layer.

ggplot(Sitka,aes(x=Time,y=size))+
  stat_summary()
## No summary function supplied, defaulting to `mean_se()`

5.2 scale function

  • Scales定义了哪些aes值被映射到数据值上。

Here is a color scale that ggplot2 chooses for us:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +  #按照cut匹配颜色  
  geom_point()

可以使用scale_color_manual()手动设置value=()的颜色:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))

  • scale函数作用于Y坐标轴

添加scale_y_continuous()调整y坐标轴的分断区间breaks

调整前:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 

此时y轴刻度线分别为:0、5000、10000和15000。

调整后:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))#设置y轴分断值

现在,我们继续把上图y轴中的(美元)单位替换为(千美元)单位。给每个break处添加对应的数值,最后给y轴修改标题为:price(thousands of dollars)

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels = c(0,1.5,5,7.5,10,12.5,15,17.5),#这里每个lables的值对应上方breaks中的值,特意用1.5对应了2500
                     name = "price(thousands of dollars)")

  • scale函数作用于X坐标轴

给X轴的取值划定一个范围并修改坐标轴的标题:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  xlim(c(0,3))+# cut 0-3之间的
  labs(x="CARAT", y="PRICE", color="Diamond_CUT", title="CARAT vs PRICE by CUT")#修改X、Y轴以及图例标题
## Warning: Removed 32 rows containing missing values (`geom_point()`).

  • Guide(图例设置)

在上文做的所有图中,默认输出带有scale guide作用的图例,可以使用guides()命令将其移除。

Guides for each scale can be set scale-by-scale with the guide argument, or en masse with guides().

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  labs(x="CARAT", y="PRICE",title="CARAT vs PRICE by CUT")+
  guides(color="none")#去除按cut映射颜色的guide图例

5.3 Coordinate system

坐标系统定义了物体在图上的空间定位平面,大多数图都使用笛卡尔坐标系。

用切面函数faceting functions将图按切面变量的子集分成小的(panels)进行排布 。此部分学习的函数包括facet_wrap()facet_grid()。

  • facet_wrap() 函数:
ggplot(diamonds, aes(x=carat, y=price)) +    
  geom_point() +    
  facet_wrap(~cut) # 根据变量cut的各个子集分组做图,按列展示。

  • facet_grid() 函数:

facet_grid()允许直接指定哪些变量被用来沿着行和列分割图。把分行变量放在 ~前面,分列变量放在 ~后面。

ggplot(diamonds, aes(x=carat, y=price)) +       
  geom_point(color = "steelblue",shape = 17,size = 0.5) +  #为散点设置颜色、形状、尺寸     
  facet_grid(clarity~cut)#每行按照clarity分割,每列按照cut分割

上面这幅图使用facet_grid(clarity~cut) 参数实现了每行不同清澈度子集,每列不同切割度子集的钻石,其价格和重量的散点相关情况。

以上就是facet()函数的主要用法,更详细的facet语法和参数配置可以查看Facets for ggplot2 in R.

练习3:

Use the Sitka data set.

A. Recreate the line plot Time of vs size, with the color of the lines mapped to treat. Use scale_color_manual() to change the colors to “orange” and “purple”.

B. Use to scale_x_continuous() convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.

C. Split the scatter plot into a panel of scatter plots by tree. (Note: Make the graph area large; graph may take a few seconds to appear)

解答:

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
  geom_line()

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +   
  geom_line()+
  scale_color_manual(values=c("orange","purple"))+ #修改折线颜色
  facet_wrap(~treat)

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +      
  geom_line()+   
  scale_color_manual(values=c("orange","purple")) +#修改折线颜色
  scale_x_continuous(breaks = c(150,180,210,240),
                     labels = c(5,6,7,8),
                     name = "time(months)")#修改X轴信息

ggplot(Sitka,aes(x=Time,y=size))+
  geom_point(color = "forestgreen")+
  scale_x_continuous(breaks = c(150,180,210,240),
                      labels = c(5,6,7,8),)+#修改X轴信息
  facet_wrap(~treat,ncol = 1)

6.Themes——主题风格

通过参数改变主题外观

主题设置的主要参数:

  • element_line() - can specify color,linewidth,linetype etc.

  • element_rect() - can specify fill,color,size , etc.

  • element_text() - can specify family,face,size,color,angle etc.

  • element_blank() - removes theme elements from graph

给图形的坐标轴进行颜色和线宽的设置:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) +    
  geom_point() +   
  theme(axis.line=element_line(color="black",linewidth = 2)) # linewidth in mm 
## Warning: Removed 568 rows containing missing values (`geom_point()`).

进一步可以设置背景颜色、字体、轴标题等细节:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme(axis.line=element_line(color="black", linewidth=2),
        panel.background=element_rect(fill="white", color="gray"),
        title=element_text(family="serif", face="bold")) 
## Warning: Removed 568 rows containing missing values (`geom_point()`).

使用完整主题改变整体外观

除了通过一个个参数来调整外观,ggplot2包也提供了一些完整的theme主题,使用它们可以更简单地改变图形的背景线条风格。

一些完整的theme

  • theme_bw()

  • theme_light()

  • theme_dark()

  • theme_classic()

下面,我们直接调用完整主题来改变图形外观:

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_classic()+
  labs(title = 'classic style')
## Warning: Removed 568 rows containing missing values (`geom_point()`).

ggplot(txhousing, aes(x=volume, y=sales, color=median)) + 
  geom_point() +
  theme_dark() +
  labs(title = 'dark style')
## Warning: Removed 568 rows containing missing values (`geom_point()`).

7.把图形保存到文件

ggsave()命令可以将生成的图形保存到文件(默认情况下,会保存最后生成的图形),支持的格式包括:eps/ps, tex (pictex), pdf, jpeg, tiff, png, bmp, svg, wmf.

一些参数设置:

  • width

  • height

  • units:units of width and height of plot file ("in","cm"or"mm")

  • dpi: plot resolution in dots per inch

  • plot: name of object with stored plot

#save last displayed plot as pdf
ggsave("plot.pdf")#保存为pdf格式,图片名为“plot“
## Saving 7 x 5 in image
## Warning: Removed 568 rows containing missing values (`geom_point()`).
#if you're working with lots of graphs, you can store them in R objects
p <- ggplot(Sitka, aes(x=Time, y=size)) + 
  geom_point()
#You can then use the plot argument of ggsave() to specify which plot to save instead of the last
ggsave("myplot.png", plot=p)#用plot指定要保存的图像
## Saving 7 x 5 in image

练习4:

数据说明:

本练习使用MASS包中的Rabbits数据集创建图形,Run data(Rabbit) and then click on Rabbit in the RStudio Environment pane.

The Rabbit data set describes an experiment where:

5 rabbits were each given both a treatment drug (MDL) and a control drug.

After injection of either drug, each rabbit was then given one of 6 doses (6.25, 12.5, 25, 50, 100, 200 micrograms) of another chemical that interacts with the drug to affect blood pressure.

Change in blood pressure was measured as the outcome.

Thus each rabbit’s blood pressure was measured 12 times, 6 each for treatment and control.

The data set contains 60 rows (5 rabbits measured 12 times) of the following 5 variables:

  • BPchange: change in blood pressure relative to the start of the experiment

  • Dose: dose of interacting chemical in micrograms

  • Run: label of trial

  • Treatment: Control or MDL

  • Animal: animal ID (“R1” through “R5”)

题目要求:

Goal: create a dose-response curve for each rabbit under each treatment, resulting in 10 curves (2 each for 5 rabbits)

Constraints: no color, but publication quality (imagine submitting to a journal that only accepts non-color figures)

提示(可能的步骤):

A. First, try creating a line graph with Dose on the x-axis and BPchange on the y-axis, with separate linetypes by Animal.

Why does this graph look wrong?

B. Draw separate lines by Treatment. How can we accomplish this without color?

Some of the line patterns still look a little too similar to distinguish between rabbits.

C. Add a scatter plot where the shape of the points is mapped to Animal.

Next we will change the shapes used. See for a list of codes for shapes.?pch

D. Use scale_shape_manual() to change the shapes of the points. Use the shapes corresponding to the codes (0, 3, 8, 16, 17).

Ok, the graph has all the data we want on it. Now, we’ll prepare it for publication.

E. Change the x-axis title to “Dose(mcg)” and the y-axis title to “Change in blood pressure”.

Finally, we will change some of the elements.theme()

F. First, change the background from gray to white (or remove it) using in theme(panel.background).

G. Next, change the color of the grid lines to “gray90”, a light gray using panel.grid.

H. Use to titleface change the titles (axes and legend) to bold .

I. Use strip.textface to change the facet titles to bold .

J. Save your last plot as pubilc.png.

解答:

library(pacman)
p_load(MASS,ggplot2)
data(Rabbit)#import data

做图:

#A.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal))+
  geom_line()#按照编号生成每个兔子血压和用药剂量的折线图

#B.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal))+   
  geom_line()+
  facet_grid(Animal~Treatment)#按照是否注射药物处理,进行控制组和处理组的分割(发现此药物可能有降血压的作用)

#C.
ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+  #给不同的兔子映射不同形状的散点 
  geom_line()+
  facet_grid(Animal~Treatment)+
  geom_point()+#添加散点图层
  scale_shape_manual(values = c(0,3,8,16,17))#?pch查看点的各种形状,设置区分度较高的几种形状

#D.调整外观以符合出版要求
pubilc = ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+  
  geom_line()+
  facet_grid(Animal~Treatment)+
  geom_point()+
  scale_shape_manual(values = c(0,3,8,16,17))+
  labs(x="Dose(mcg)",y="Change in blood pressure",title = "BP vs Dose")+
  theme(plot.title = element_text(hjust = 0.5),#标题水平居中
        panel.background = element_rect(fill = "white"),#设置背景填充色为白色
        panel.grid = element_line(colour = "grey90"),#设置网格线颜色
        title = element_text(face = "bold"),#加粗标题,坐标轴,图例
        strip.text = element_text(face = "bold"))#加粗分割标题
ggsave("public.pdf",plot = pubilc)
## Saving 7 x 5 in image
#E.
pubilc2 = ggplot(Rabbit,aes(x=Dose,y=BPchange,linetype=Animal,shape=Animal))+  
  geom_line()+
  facet_wrap(~Treatment)+
  geom_point()+
  scale_shape_manual(values = c(0,3,8,16,17))+
  labs(x="Dose(mcg)",y="Change in blood pressure",title = "BP vs Dose")+
  theme(plot.title = element_text(hjust = 0.5),#标题水平居中
        panel.background = element_rect(fill = "white"),#设置背景填充色为白色
        panel.grid = element_line(colour = "grey90"),#设置网格线颜色
        title = element_text(face = "bold"),#加粗标题,坐标轴,图例
        strip.text = element_text(face = "bold"))#加粗分割标题
ggsave("public2.pdf",plot = pubilc2)
## Saving 7 x 5 in image

2023-8-5

8.分类变量Factors vs 数值变量numeric variables in ggplot2

导入这部分要使用的数据集:

library(MASS)
data("birthwt")

birthwt 数据集描述了风险因素婴儿低出生体重之间的关系。包含的变量有:

  • low: 0/1 indicator of birth weight < 2.5 kg

  • age: mother’s age

  • lwt: mother’s weight in pounds

  • race: mother’s race, (1=white, 2=black, 3=other)

  • smoke: 0/1 indicator of smoking during pregnancy

  • ptl: number of previous premature labors

  • ht: 0/1 indicator of history of hypertension

  • ui: 0/1 indicator of uterine irritability

  • ftv: numer of physician visits during first trimester

在R中,可以使用factor()函数来编码encode分类变量。

ggplot2中既可以映射到numeric变量,又可以映射到factor变量美学参数项aesthetics有:

  • x and y: continuous or discrete axes

  • color and fill: color gradient scales (颜色梯度)or evenly-spaced hue scales(间隔颜色)

只能被映射到分类变量上的aes参数:

  • shape

  • linetype

只能被映射到数值变量上的aes参数:

  • size

  • alpha

在可映射两种类型变量的aes参数中使用不同的变量类型可输出不同的图形效果:

ggplot(birthwt,aes(x=age,y=bwt,color=race))+#race作为数值型变量进行映射
  geom_point()

ggplot(birthwt,aes(x=age,y=bwt,color=factor(race)))+#race作为分类变量进行映射
  geom_point()

因此,在绘图之前,我们最好把分类变量用factor()函数转化为正确的格式,在转化的同时,可以为每种分类赋予你想要的label

例如,我们想绘制一张描述婴儿出生体重bwt和母亲是否抽烟smoke之间关系的箱形图boxplot,可以这么做:

birthwt$smokef = factor(birthwt$smoke,levels = 0:1,labels = c("did not smoke","smoked"))#数据预处理
ggplot(birthwt,aes(y=bwt,fill=smokef))+
  geom_boxplot()

练习5

A. For the birthwt data, convert ht to a factor and label the values 0 and 1 “non-hyper” and “hyper”, respectively.

B. Create boxplots of bwt(birth weight), colored by ht , with separate panels by smokef.

birthwt$htr = factor(birthwt$ht,levels = 0:1,labels = c("non-hyper","hyper"))#将母亲是否超重ht转换为分类变量
ggplot(birthwt,aes(y=bwt,fill=htr))+
  geom_boxplot()+
  facet_wrap(~smokef)


2023-8-6

9.重叠数据绘图

9.1散点图重叠值

点当两个数据点在图上绘制的值相同时,它们通常会占据相同的位置,导致其中一个模糊另一个。

例如:

birthwt$racef = factor(birthwt$race,levels = 1:3,labels = c("black","white","other"))
ggplot(birthwt, aes(x=racef, y=age)) +   
  geom_point() 

数据集中有189个数据点,在这幅图中我们可以看到的点的数量远远少于189,因为许多点完全重叠. 在原先的命令中加入position="jitter" ,对各散点位置进行调整,可以更清晰地看到每个年龄处有几个点。

ggplot(birthwt, aes(x=racef, y=age)) +      
  geom_point(position = "jitter")

9.2条形图重叠值

geom_bar()命令默认将生成的条形图堆叠stack放置,下面一些参数可以调整geom_bar()生成的条形图的放置方式:

  • position="stack": stack elements vertically (the default for geom_bar()

  • position="dodge": move elements side-by-side (the default for geom_boxplot())

  • position="fill": stack elements vertically, standardize heights to 1,这时每个条的长度可以看做取值所占的比例。

例如:

ggplot(birthwt, aes(x=low, fill=racef)) +   
  geom_bar()+#默认堆叠
  labs(title = "stack bars with the same x-position")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(birthwt, aes(x=low, fill=racef)) +   
  geom_bar(position="dodge")+#调整为并列分布
  labs(title = "dodging emphasizes counts")+
  theme(plot.title = element_text(hjust = 0.5))

ggplot(birthwt, aes(x=low, fill=racef)) +   
  geom_bar(position="fill")+#各列长度标准化为1
  labs(title = "filling emphasizes proportions")+
  theme(plot.title = element_text(hjust = 0.5))

10.误差条(Error bars) & 置信带(confidence bands)

绘制误差条置信带来描述变量的统计情况,分别使用命令geom_errorbar()geom_ribbon()

相关参数:

  • x: horizontal positioning of error bar or band

  • ymin: vertical position of lower error bar or band

  • ymax: vertical position of upper error bar or band

例如,以下代码估计了birthwt数据集中三个种族的平均出生体重和平均值的95%置信区间。均值和置信限存储在一个名为.bwt_by_race的新数据框中:

bwt_by_racef <- do.call(rbind, 
                        tapply(birthwt$bwt, birthwt$racef, mean_cl_normal))
bwt_by_racef$racef <- row.names(bwt_by_racef)
names(bwt_by_racef) <- c("mean", "lower", "upper", "racef")
bwt_by_racef
##           mean    lower    upper racef
## black 3102.719 2955.235 3250.202 black
## white 2719.692 2461.722 2977.662 white
## other 2805.284 2629.127 2981.441 other

接着,我们使用geom_point()geom_errorbar()绘制不同人种新生儿体重的均值和95%误差限:

ggplot(bwt_by_racef, aes(x=racef, y=mean)) +   
  geom_point() +   
  geom_errorbar(aes(ymin=lower, ymax=upper))+
  labs(title = "mean birthweight by race")+   
  theme(plot.title = element_text(hjust = 0.5))

ggplot(bwt_by_racef, aes(x=racef, y=mean)) +   
  geom_point() +   
  geom_errorbar(aes(ymin=lower, ymax=upper),width=0.1)+#width调整误差条宽度
  labs(title = "mean birthweight by race")+   
  theme(plot.title = element_text(hjust = 0.5))

这一次,我们将根据出生体重对年龄的回归,创建一个具有置信带的预测值图。首先,我们将运行模型,并将预测值和置信限添加到原始数据集中以进行绘图:

# linear regression of birth weight on age
m <- lm(bwt ~ age, data=birthwt)
# get predicted values (fit) and confidence limits (lwr and upr)
preddata <- predict(m, interval="confidence")
# add predicted values to original data
birthwt <- cbind(birthwt, preddata)
head(birthwt)
##    low age lwt race smoke ptl ht ui ftv  bwt        smokef       htr racef
## 85   0  19 182    2     0   0  0  1   0 2523 did not smoke non-hyper white
## 86   0  33 155    3     0   0  0  0   3 2551 did not smoke non-hyper other
## 87   0  20 105    1     1   0  0  0   1 2557        smoked non-hyper black
## 88   0  21 108    1     1   0  0  1   2 2594        smoked non-hyper black
## 89   0  18 107    1     1   0  0  1   0 2600        smoked non-hyper black
## 91   0  21 124    3     0   0  0  0   0 2622 did not smoke non-hyper other
##         fit      lwr      upr
## 85 2891.909 2757.969 3025.849
## 86 3065.925 2846.442 3285.408
## 87 2904.339 2781.794 3026.883
## 88 2916.768 2803.295 3030.242
## 89 2879.479 2732.358 3026.600
## 91 2916.768 2803.295 3030.242

接着,使用geom_line()绘制拟合线,geom_ribbon()绘制置信带

ggplot(birthwt, aes(x=age, y=fit)) +    
  geom_line(color="steelblue") +   
  geom_ribbon(aes(ymin=lwr, ymax=upr,),fill="blue", alpha=.5)#alpua调整透明度

11.注释图形

有时,我们需要直接向图中添加笔记或注释,而这些笔记或注释不是由图数据集中的任何变量表示的。比如:

  • 将文本标签添加到散点图上的单个点

  • 用方框突出显示图形的一部分。

使用annotate()函数可以实现。

假设我们想将异常值标记为可能的数据错误。我们用x=,y=确定要添加text的位置,在label=中输入text的内容。

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() + 
  annotate("text", x=42, y=5000, label="Data error?")  # notice first argument is "text", not geom_text

在图形上添加矩形框来突出某一区域:

ggplot(birthwt, aes(x=age, y=bwt, color=racef)) +
  geom_point() + 
  annotate("rect", xmin=13, xmax=46, ymin=2215, ymax=3673, alpha=.2)

期末练习

对于最后一组练习,我们将使用存储在加州大学洛杉矶分校IDRE网站上的数据集,我们用以下代码加载该数据集:

hsb <- read.csv("https://stats.idre.ucla.edu/stat/data/hsbdemo.csv")

This data set contains demographic and academic data for 200 high school students. We will be using the following variables:

  • read,write , math, science: academic test scores

  • female: gender, factor with levels “female” and “male”

  • honors: enrollment in honors program, factor with 2 levels “enrolled” and “not enrolled”

  • ses: socioeconomic status, factor with 3 levels, “low”, “middle”, “high”

  • schtyp: school type, factor with 2 levels, “private” and “public”

题目要求:

1.Create a scatter plot of math(x) vs read(y), with different shapes by prog. Color all of the points red.

2.Find the outlier at math=45, read=63, Add annotation text next to this outlier that says “error?”

3.Create bar graph that displays the counts the number of students that fall into groups made up of the following 4 variables: female, prog, schtyp, ses.

From such a graph we can know, for example, how many female students in the academic program who go to public school who are of high socioeconomic status are in the data set.

4.Try to recreate this graph:

Note that the background has been entirely removed and that the axis and legend titles are red and in “mono” font.

解答:

ggplot(hsb, aes(x=math, y=read, shape=prog)) +
  geom_point(color="red") +
  annotate("text", x=35, y=64, label="error?")

ggplot(hsb,aes(x=female))+
  geom_bar()

ggplot(hsb, aes(x=female, fill=prog)) +   
  geom_bar(position="dodge", width=.5) +   
  facet_grid(schtyp ~ ses)

#4.
# reg = lm(write~read,data = hsb)
# preddata <- predict(reg, interval="confidence")
# hsb = cbind(hsb,preddata)#合并数据集
ggplot(hsb,aes(x=read,y=write,color=math))+
  geom_point()+
  geom_smooth(color="red")+
  labs(x="Reading Score",y="Writing Score",color="Math Score")+
  theme(title = element_text(family = "mono",color = "red"),
        panel.background = element_blank())
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'