Introduction to ggplot2学习笔记

说明:这篇文档主要记录UCLA网站中Introduction to ggplot2研讨课程的整体内容。

2023-8-3学习内容:

1.基础知识部分

首先加载学习所用的R包

library(pacman)
p_load(ggplot2,MASS,tidyverse)

了解绘图语法中的基本元素:

2.基础语法介绍

ggplot2包使用的基础语法为:

ggplot() #注意和包的名称“ggplot2”进行区分
ggplot(dataset, aes(x=xvar, y=yvar)) + geom_function() 

做一个散点图:

# scatter plot of volume vs sales 
ggplot(txhousing, aes(x=volume, y=sales)) +   
  geom_point()
## Warning: Removed 568 rows containing missing values (`geom_point()`).

如果去掉geom_()部分则只生成XY坐标轴

ggplot(txhousing, aes(x=volume, y=sales))

可以使用”+“逐个添加图层,包括geoms, stats, scales, faceting, themes

丰富一下上面这幅图(添加地毯图层rug):

ggplot(txhousing, aes(x=volume, y=sales)) +     # x=volume and y=sales inherited by all layers     
  geom_point() +   
  geom_rug(aes(color=median))   # color will only apply to the rug plot because not specified in ggplot()

3.Aesthetics(aes)参数

Commonly used aesthetics:

  • x: positioning along x-axis

  • y: positioning along y-axis

  • color: color of objects; for 2-d objects, the color of the object’s outline (compare to fill below)

  • fill: fill color of objects

  • linetype: how lines should be drawn (solid, dashed, dotted, etc.)

  • shape: shape of markers in scatter plots

  • size: how large objects appear

  • alpha: transparency of objects (value between 0, transparent, and 1, opaque – inverse of how many stacked objects it will take to be opaque)

加入一些aes参数美化一下刚刚产出的散点图:

# mapping color to median inside of aes() 
ggplot(txhousing, aes(x=volume, y=sales)) +   
  geom_point(aes(color=median))
## Warning: Removed 568 rows containing missing values (`geom_point()`).

练习:

#导入一个临时数据集

data(Sitka)

该数据集描述了树木随时间的生长情况,其中一些树木生长在富含臭氧的室内。变量说明:

  • size: numeric, log of size (height times diameter2)

  • Time: numeric, time of measurement (days since January 1, 1988)

  • tree: integer, tree id

  • treat: factor, treatment group, 2 levels=“control” and “ozone”

题目:

A. Create a scatter plot of Time vs size to view the growth of trees over time.

B. Color the scatter plot points by the variable .treat

C. Add an additional (loess) layer to the graph.geom_smooth()

D. SET the color of the loess smooth to “green” rather than have it colored by treat. Why is there only one smoothed curve now?

ggplot(Sitka,aes(x = Time,y = size,color = treat,fill = treat)) +
  geom_point() + 
  geom_smooth(method = lm)#按照threat进行分组
## `geom_smooth()` using formula = 'y ~ x'

ggplot(Sitka,aes(x = Time,y = size,color = treat)) +   
  geom_point() +    
  geom_smooth(method = lm,col = "green")#不按照threat分组
## `geom_smooth()` using formula = 'y ~ x'

4.设计对象Geoms

Geom函数可以为数据集绘制不同的几何图形。

常见的geoms:

  • geom_bar(): bars with bases on the x-axis

  • geom_boxplot(): boxes-and-whiskers

  • geom_errorbar(): T-shaped error bars

  • geom_density(): density plots

  • geom_histogram(): histogram

  • geom_line(): lines

  • geom_point(): points (scatterplot)

  • geom_ribbon(): bands spanning y-values across a range of x-values

  • geom_smooth(): smoothed conditional means (e.g. loess smooth)

  • geom_text(): text

常见图形绘制:

  1. 直方图——描述连续变量的分布情况
ggplot(txhousing, aes(x=median)) +    
  geom_histogram(color="steelblue",fill="lightblue")#单独为图形set颜色:color指定边框颜色;fill制定填充颜色

2.密度图——曲线平滑的直方图

ggplot(txhousing, aes(x=median)) +    
  geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).

密度图可以进行分组绘制:

ggplot(txhousing, aes(x=median,color = factor(month))) +   
# month为 数值型变量,需要用factor()转换为分组变量再进行操作,相当于stata中的encode命令
  geom_density()
## Warning: Removed 616 rows containing non-finite values (`stat_density()`).

3.箱形图——用于查看Y变量在X变量不同取值上的分布情况

ggplot(txhousing, aes(x=factor(year), y=median)) +    
  geom_boxplot()#画出房价在各年的分布
## Warning: Removed 616 rows containing non-finite values (`stat_boxplot()`).

4.柱状图——统计分类变量的频率

ggplot(diamonds, aes(x=cut, fill=clarity)) +    
  geom_bar()

5.散点图——描述两变量间的协方差关系

ggplot(txhousing, aes(x=volume, y=sales,
                      color=median, alpha=listings, size=inventory)) + #丰富画面内容   
  geom_point() 
## Warning: Removed 1468 rows containing missing values (`geom_point()`).

6.折线图

ggplot(txhousing, aes(x=date, y=sales, color=city)) +    
  geom_line()
## Warning: Removed 430 rows containing missing values (`geom_line()`).

尝试一下可视化我国省级数字普惠金融指数变化趋势:

library(haven) 
merge6 <- read_dta("~/Desktop/论文/小论文-数字经济与农户收入/数据/目前可用的/merge6.dta")

ggplot(merge6, aes(x=year,y=index_aggregate,color=prov_name_eng)) +
  geom_line()
## Warning: Removed 434 rows containing missing values (`geom_line()`).

练习:

We will be using the Sitka data set again for this exercise.

A. Using 2 different geoms, compare the distribution of size between the two levels of treat .Use a different color for each distribution.

B. Use a bar plot to visualize the crosstabulation of Time and treat. Put Time on the x-axis.

C. Create a line graph of size over Time, with separate lines by tree and lines colored by treat.

D. Imagine you plan to submit your line graph of size over time by tree to a journal that does not print color graphs. How else can you distinguish between the 2 treatments on the graph?

解答:

ggplot(Sitka,aes(x=size,fill=treat)) +
  geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(Sitka,aes(x=size,color=treat)) +   
  geom_density() 

tab = table(Sitka$Time,Sitka$treat)
barplot(tab,
        xlab = "Time",
        ylab = "treat_frequence",
        col = c("green","blue"))

ggplot(Sitka,aes(x=Time,y=size,group=tree))+
  geom_line(aes(color=treat))

D.期刊要求为黑白图片时,可使用不同形状的线linetype对组别进行区分:

ggplot(Sitka,aes(x=Time,y=size,group=tree))+   
  geom_line(aes(linetype=treat))

---

2023-8-4

5.统计量Stats、尺度Scales、坐标系Coordinate Systems和分子集(facet

5.1 stat function

统计函数对数据进行统计转换,通常作为某种形式的总结,如平均值,或标准差,或置信区间。

每个stat函数都与一个默认的geom相关联,所以不需要geom来呈现形状.

# 使用txhousing数据集,summarize sales (y) for each year (x) 
ggplot(txhousing,aes(x=year,y=sales))+
  stat_summary()
## Warning: Removed 568 rows containing non-finite values (`stat_summary()`).
## No summary function supplied, defaulting to `mean_se()`

Create a new plot where x is mapped to Time and y is mapped to size. Then, add a stat_summary() layer.

ggplot(Sitka,aes(x=Time,y=size))+
  stat_summary()
## No summary function supplied, defaulting to `mean_se()`

5.2 scale function

  • Scales定义了哪些aes值被映射到数据值上。

Here is a color scale that ggplot2 chooses for us:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +  #按照cut匹配颜色  
  geom_point()

可以使用scale_color_manual()手动设置value=()的颜色:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet"))

  • scale函数作用于Y坐标轴

添加scale_y_continuous()调整y坐标轴的分断区间breaks

调整前:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) 

此时y轴刻度线分别为:0、5000、10000和15000。

调整后:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500))#设置y轴分断值

现在,我们继续把上图y轴中的(美元)单位替换为(千美元)单位。给每个break处添加对应的数值,最后给y轴修改标题为:price(thousands of dollars)

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + 
  geom_point() +
  scale_color_manual(values=c("red", "yellow", "green", "blue", "violet")) + 
  scale_y_continuous(breaks=c(0,2500,5000,7500,10000,12500,15000,17500),
                     labels = c(0,1.5,5,7.5,10,12.5,15,17.5),#这里每个lables的值对应上方breaks中的值,特意用1.5对应了2500
                     name = "price(thousands of dollars)")

  • scale函数作用于X坐标轴

给X轴的取值划定一个范围并修改坐标轴的标题:

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  xlim(c(0,3))+# cut 0-3之间的
  labs(x="CARAT", y="PRICE", color="Diamond_CUT", title="CARAT vs PRICE by CUT")#修改X、Y轴以及图例标题
## Warning: Removed 32 rows containing missing values (`geom_point()`).

  • Guide(图例设置)

在上文做的所有图中,默认输出带有scale guide作用的图例,可以使用guides()命令将其移除。

Guides for each scale can be set scale-by-scale with the guide argument, or en masse with guides().

ggplot(diamonds, aes(x=carat, y=price, color=cut)) +    
  geom_point() +   
  labs(x="CARAT", y="PRICE",title="CARAT vs PRICE by CUT")+
  guides(color="none")#去除按cut映射颜色的guide图例

5.3 Coordinate system

坐标系统定义了物体在图上的空间定位平面,大多数图都使用笛卡尔坐标系。

用切面函数faceting functions将图按切面变量的子集分成小的(panels)进行排布 。此部分学习的函数包括facet_wrap()facet_grid()。

  • facet_wrap() 函数:
ggplot(diamonds, aes(x=carat, y=price)) +    
  geom_point() +    
  facet_wrap(~cut) # 根据变量cut的各个子集分组做图,按列展示。

  • facet_grid() 函数:

facet_grid()允许直接指定哪些变量被用来沿着行和列分割图。把分行变量放在 ~前面,分列变量放在 ~后面。

ggplot(diamonds, aes(x=carat, y=price)) +       
  geom_point(color = "steelblue",shape = 17,size = 0.5) +  #为散点设置颜色、形状、尺寸     
  facet_grid(clarity~cut)#每行按照clarity分割,每列按照cut分割

上面这幅图使用facet_grid(clarity~cut) 参数实现了每行不同清澈度子集,每列不同切割度子集的钻石,其价格和重量的散点相关情况。

以上就是facet()函数的主要用法,更详细的facet语法和参数配置可以查看Facets for ggplot2 in R.

练习:

Use the Sitka data set.

A. Recreate the line plot Time of vs size, with the color of the lines mapped to treat. Use scale_color_manual() to change the colors to “orange” and “purple”.

B. Use to scale_x_continuous() convert the x-axis from days to months. First, relabel the tick marks from (150,180,210,240) to (5,6,7,8). Then retitle the x-axis “time(months)”.

C. Split the scatter plot into a panel of scatter plots by tree. (Note: Make the graph area large; graph may take a few seconds to appear)

解答:

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +
  geom_line()

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +   
  geom_line()+
  scale_color_manual(values=c("orange","purple")) #修改折线颜色

ggplot(Sitka,aes(x=Time,y=size,color=treat)) +      
  geom_line()+   
  scale_color_manual(values=c("orange","purple")) +#修改折线颜色
  scale_x_continuous(breaks = c(150,180,210,240),
                     labels = c(5,6,7,8),
                     name = "time(months)")#修改X轴信息

ggplot(Sitka,aes(x=Time,y=size))+
  geom_point(color = "forestgreen")+
  scale_x_continuous(breaks = c(150,180,210,240),
                      labels = c(5,6,7,8),)+#修改X轴信息
  facet_wrap(~treat,ncol = 1)