说明:这篇文档主要记录UCLA网站中Introduction to ggplot2研讨课程的整体内容。
首先加载学习所用的R包
library(pacman)
p_load(ggplot2,MASS,tidyverse)
了解绘图语法中的基本元素:
ggplot2包使用的基础语法为:
ggplot() #注意和包的名称“ggplot2”进行区分
ggplot(dataset, aes(x=xvar, y=yvar)) + geom_function()
做一个散点图:
# scatter plot of volume vs sales
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point()
## Warning: Removed 568 rows containing missing values (`geom_point()`).
如果去掉geom_()部分则只生成XY坐标轴
ggplot(txhousing, aes(x=volume, y=sales))
可以使用”+“逐个添加图层,包括geoms, stats, scales, faceting, themes
丰富一下上面这幅图(添加地毯图层rug):
ggplot(txhousing, aes(x=volume, y=sales)) + # x=volume and y=sales inherited by all layers
geom_point() +
geom_rug(aes(color=median)) # color will only apply to the rug plot because not specified in ggplot()
aes)参数Commonly used aesthetics:
x: positioning along x-axis
y: positioning along y-axis
color: color of objects; for 2-d objects, the color
of the object’s outline (compare to fill below)
fill: fill color of objects
linetype: how lines should be drawn (solid, dashed,
dotted, etc.)
shape: shape of markers in scatter plots
size: how large objects appear
alpha: transparency of objects (value between 0,
transparent, and 1, opaque – inverse of how many stacked objects it will
take to be opaque)
加入一些aes参数美化一下刚刚产出的散点图:
# mapping color to median inside of aes()
ggplot(txhousing, aes(x=volume, y=sales)) +
geom_point(aes(color=median))
## Warning: Removed 568 rows containing missing values (`geom_point()`).
#导入一个临时数据集
data(Sitka)
该数据集描述了树木随时间的生长情况,其中一些树木生长在富含臭氧的室内。变量说明:
size: numeric, log of size (height times diameter2)
Time: numeric, time of measurement (days since January 1, 1988)
tree: integer, tree id
treat: factor, treatment group, 2 levels=“control” and “ozone”
题目:
A. Create a scatter plot of
Timevssizeto view the growth of trees over time.
B. Color the scatter plot points by the variable .
treat
C. Add an additional (loess) layer to the graph.
geom_smooth()
D. SET the color of the loess smooth to “green” rather than have it colored by
treat. Why is there only one smoothed curve now?ggplot(Sitka,aes(x = Time,y = size,color = treat,fill = treat)) + geom_point() + geom_smooth(method = lm)#按照threat进行分组## `geom_smooth()` using formula = 'y ~ x'
ggplot(Sitka,aes(x = Time,y = size,color = treat)) + geom_point() + geom_smooth(method = lm,col = "green")#不按照threat分组## `geom_smooth()` using formula = 'y ~ x'
Geom函数可以为数据集绘制不同的几何图形。
常见的geoms:
geom_bar(): bars with bases on the x-axis
geom_boxplot(): boxes-and-whiskers
geom_errorbar(): T-shaped error bars
geom_density(): density plots
geom_histogram(): histogram
geom_line(): lines
geom_point(): points (scatterplot)
geom_ribbon(): bands spanning y-values across a
range of x-values
geom_smooth(): smoothed conditional means
(e.g. loess smooth)
geom_text(): text
常见图形绘制:
ggplot(txhousing, aes(x=median)) +
geom_histogram(color="steelblue",fill="lightblue")#单独为图形set颜色:color指定边框颜色;fill制定填充颜色
2.密度图——曲线平滑的直方图
ggplot(txhousing, aes(x=median)) +
geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).
密度图可以进行分组绘制:
ggplot(txhousing, aes(x=median,color = factor(month))) +
# month为 数值型变量,需要用factor()转换为分组变量再进行操作,相当于stata中的encode命令
geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).
3.箱形图——用于查看Y变量在X变量不同取值上的分布情况
ggplot(txhousing, aes(x=factor(year), y=median)) +
geom_boxplot()#画出房价在各年的分布
## Warning: Removed 616 rows containing non-finite values (`stat_boxplot()`).
4.柱状图——统计分类变量的频率
ggplot(diamonds, aes(x=cut, fill=clarity)) +
geom_bar()
5.散点图——描述两变量间的协方差关系
ggplot(txhousing, aes(x=volume, y=sales,
color=median, alpha=listings, size=inventory)) + #丰富画面内容
geom_point()
## Warning: Removed 1468 rows containing missing values (`geom_point()`).
6.折线图
ggplot(txhousing, aes(x=date, y=sales, color=city)) +
geom_line()
## Warning: Removed 430 rows containing missing values (`geom_line()`).
尝试一下可视化我国省级数字普惠金融指数变化趋势:
library(haven)
merge6 <- read_dta("~/Desktop/论文/小论文-数字经济与农户收入/数据/目前可用的/merge6.dta")
ggplot(merge6, aes(x=year,y=index_aggregate,color=prov_name_eng)) +
geom_line()
## Warning: Removed 434 rows containing missing values (`geom_line()`).
We will be using the Sitka data set again for this
exercise.
A. Using 2 different geoms, compare the distribution of
sizebetween the two levels oftreat.Use a different color for each distribution.
B. Use a bar plot to visualize the crosstabulation of
Timeandtreat. PutTimeon the x-axis.
C. Create a line graph of
sizeoverTime, with separate lines bytreeand lines colored bytreat.
D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?
解答:
ggplot(Sitka,aes(x=size,fill=treat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(Sitka,aes(x=size,color=treat)) +
geom_density()
tab = table(Sitka$Time,Sitka$treat)
barplot(tab,
xlab = "Time",
ylab = "treat_frequence",
col = c("green","blue"))
ggplot(Sitka,aes(x=Time,y=size,group=tree))+
geom_line(aes(color=treat))
D.期刊要求为黑白图片时,可使用不同形状的线linetype对组别进行区分:
ggplot(Sitka,aes(x=Time,y=size,group=tree))+
geom_line(aes(linetype=treat))
---
Stats、尺度Scales、坐标系Coordinate Systems和分子集(facet)统计函数对数据进行统计转换,通常作为某种形式的总结,如平均值,或标准差,或置信区间。
每个stat函数都与一个默认的geom相关联,所以不需要geom来呈现形状.
# 使用txhousing数据集,summarize sales (y) for each year (x)
ggplot(txhousing,aes(x=year,y=sales))+
stat_summary()
## Warning: Removed 568 rows containing non-finite values (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`
Create a new plot where x is mapped to Time
and y is mapped to size. Then, add a
stat_summary() layer.
ggplot(Sitka,aes(x=Time,y=size))+
stat_summary()
## No summary function supplied, defaulting to `mean_se()`
Here is a color scale that ggplot2 chooses for us:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + #按照cut匹配颜色
geom_point()
可以使用scale_color_manual()手动设置value=()的颜色:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
scale函数作用于Y坐标轴添加scale_y_continuous()调整y坐标轴的分断区间breaks:
调整前:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))
此时y轴刻度线分别为:0、5000、10000和15000。
调整后:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))#设置y轴分断值
现在,我们继续把上图y轴中的(美元)单位替换为(千美元)单位。给每个break处添加对应的数值,最后给y轴修改标题为:price(thousands of dollars)
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) +
scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
labels = c(0,1.5,5,7.5,10,12.5,15,17.5),#这里每个lables的值对应上方breaks中的值,特意用1.5对应了2500
name = "price(thousands of dollars)")
scale函数作用于X坐标轴给X轴的取值划定一个范围并修改坐标轴的标题:
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
xlim(c(0,3))+# cut 0-3之间的
labs(x="CARAT", y="PRICE", color="Diamond_CUT", title="CARAT vs PRICE by CUT")#修改X、Y轴以及图例标题
## Warning: Removed 32 rows containing missing values (`geom_point()`).
在上文做的所有图中,默认输出带有scale
guide作用的图例,可以使用guides()命令将其移除。
Guides for each scale can be set scale-by-scale with the
guide argument, or en masse with guides().
ggplot(diamonds, aes(x=carat, y=price, color=cut)) +
geom_point() +
labs(x="CARAT", y="PRICE",title="CARAT vs PRICE by CUT")+
guides(color="none")#去除按cut映射颜色的guide图例
坐标系统定义了物体在图上的空间定位平面,大多数图都使用笛卡尔坐标系。
用切面函数faceting
functions将图按切面变量的子集分成小的(panels)进行排布
。此部分学习的函数包括facet_wrap()和facet_grid()。
facet_wrap() 函数:ggplot(diamonds, aes(x=carat, y=price)) +
geom_point() +
facet_wrap(~cut) # 根据变量cut的各个子集分组做图,按列展示。
facet_grid() 函数:facet_grid()允许直接指定哪些变量被用来沿着行和列分割图。把分行变量放在
~前面,分列变量放在 ~后面。
ggplot(diamonds, aes(x=carat, y=price)) +
geom_point(color = "steelblue",shape = 17,size = 0.5) + #为散点设置颜色、形状、尺寸
facet_grid(clarity~cut)#每行按照clarity分割,每列按照cut分割
上面这幅图使用facet_grid(clarity~cut)
参数实现了每行不同清澈度子集,每列不同切割度子集的钻石,其价格和重量的散点相关情况。
以上就是facet()函数的主要用法,更详细的facet语法和参数配置可以查看Facets for ggplot2 in R.
Use the Sitka data set.
A. Recreate the line plot
Timeof vssize, with thecolorof the lines mapped totreat. Usescale_color_manual()to change the colors to “orange” and “purple”.
B. Use to
scale_x_continuous()convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.
C. Split the scatter plot into a panel of scatter plots by
tree. (Note: Make the graph area large; graph may take a few seconds to appear)
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()+
scale_color_manual(values=c("orange","purple")) #修改折线颜色
ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
geom_line()+
scale_color_manual(values=c("orange","purple")) +#修改折线颜色
scale_x_continuous(breaks = c(150,180,210,240),
labels = c(5,6,7,8),
name = "time(months)")#修改X轴信息
ggplot(Sitka,aes(x=Time,y=size))+
geom_point(color = "forestgreen")+
scale_x_continuous(breaks = c(150,180,210,240),
labels = c(5,6,7,8),)+#修改X轴信息
facet_wrap(~treat,ncol = 1)