最常见的时间相关图是时间序列线图,其他选择包括哑铃图和斜率图。
时间序列是在连续的时间点上获得的一组定量值。时间点之间的间隔(例如小时、天、周、月或年)通常是相等的。
考虑ggplot2包附带的经济学时间序列。它包含从1967年1月到2015年1月收集的美国月度经济数据。让我们绘制个人储蓄率(psavert)。 我们可以用一个简单的线图来做这个。
library(tidyverse)
## -- Attaching packages ---------------------- tidyverse 1.3.0 --
## √ ggplot2 3.3.0 √ purrr 0.3.3
## √ tibble 2.1.3 √ dplyr 0.8.5
## √ tidyr 1.0.2 √ stringr 1.4.0
## √ readr 1.3.1 √ forcats 0.5.0
## -- Conflicts ------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(DT)
economics %>% datatable()
economics %>% glimpse()
## Observations: 574
## Variables: 6
## $ date <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967-11-0...
## $ pce <dbl> 506.7, 509.8, 515.6, 512.2, 517.4, 525.1, 530.9, 533.6, 54...
## $ pop <dbl> 198712, 198911, 199113, 199311, 199498, 199657, 199808, 19...
## $ psavert <dbl> 12.6, 12.6, 11.9, 12.9, 12.8, 11.8, 11.7, 12.3, 11.7, 12.3...
## $ uempmed <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4...
## $ unemploy <dbl> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877, 2709...
economics %>%
ggplot(aes(date,psavert)) +
geom_line()
economics %>%
ggplot(aes(date,psavert)) +
geom_line() +
scale_y_continuous(breaks = seq(2,20,2)) +
scale_x_date(date_breaks = "2 year") +
labs(title = "Personal Savings Rate",
x = "Date",
y = "Personal Savings Rate") +
theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
plot.title = element_text(hjust = 0.5))
Scale_x_date函数可用于重新格式化日期。在下面的图表中,刻度标记每5年出现一次,日期以 MMM-YY 格式显示。
ggplot(economics, aes(x = date, y = psavert)) +
geom_line(color = "indianred3",
size=1 ) +
geom_smooth() +
scale_x_date(date_breaks = '5 years',date_labels = "%m-%Y") +
labs(title = "Personal Savings Rate",
subtitle = "1967 to 2015",
x = "",
y = "Personal Savings Rate") +
theme(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5),
plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
在绘制时间序列时,请确保 date 变量是类 date 而不是类字符。有关详细信息,请参阅日期值
让我们用一个多元时间序列(多个序列)来结束这一节。我们将比较苹果和Facebook2018年1月1日至今的收盘价。
library(quantmod)
library(dplyr)
# get apple (AAPL) closing prices
apple <- getSymbols("AAPL",
return.class = "data.frame",
from="2018-01-01")
apple <- AAPL %>%
mutate(Date = as.Date(row.names(.))) %>%
select(Date, AAPL.Close) %>%
rename(Close = AAPL.Close) %>%
mutate(Company = "Apple")
# get facebook (FB) closing prices
facebook <- getSymbols("FB",
return.class = "data.frame",
from="2018-01-01")
facebook <- FB %>%
mutate(Date = as.Date(row.names(.))) %>%
select(Date, FB.Close) %>%
rename(Close = FB.Close) %>%
mutate(Company = "Facebook")
# combine data for both companies
mseries <- rbind(apple, facebook)
mseries %>% datatable()
# plot data
library(ggplot2)
ggplot(mseries,
aes(x=Date, y= Close, group = Company,color=Company)) +
geom_line(size=1) +
scale_x_date(date_breaks = '1 months',expand = c(0,0)) +
scale_y_continuous(limits = c(130, 230),
breaks = seq(130, 230, 10),
labels = scales::dollar) +
labs(title = "NASDAQ Closing Prices",
caption = "source: Yahoo Finance",
y = "Closing Price") +
scale_color_brewer(palette = "Dark2") +
theme(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5),
plot.title = element_text(hjust = 0.5))
哑铃图对于显示几个组或观察的两个时间点之间的变化很有用。使用了来自ggalt包的geom_dumbbell函数。
使用gapminder数据集,我们来绘制美洲从1952年到2007年的预期寿命变化。数据集的格式很长。为了创建哑铃图,我们需要将它转换为宽格式.
pacman::p_load(ggalt)
library(tidyr)
library(dplyr)
# load data
data(gapminder, package = "gapminder")
# subset data
plotdata_long <- filter(gapminder,
continent == "Americas" &
year %in% c(1952, 2007)) %>%
select(country, year, lifeExp)
plotdata_long
## # A tibble: 50 x 3
## country year lifeExp
## <fct> <int> <dbl>
## 1 Argentina 1952 62.5
## 2 Argentina 2007 75.3
## 3 Bolivia 1952 40.4
## 4 Bolivia 2007 65.6
## 5 Brazil 1952 50.9
## 6 Brazil 2007 72.4
## 7 Canada 1952 68.8
## 8 Canada 2007 80.7
## 9 Chile 1952 54.7
## 10 Chile 2007 78.6
## # ... with 40 more rows
# convert data to wide format
plotdata_wide <- spread(plotdata_long, year, lifeExp)
names(plotdata_wide) <- c("country", "y1952", "y2007")
# create dumbbell plot
ggplot(plotdata_wide,
aes(y = country,x = y1952,xend = y2007)) + # 平均寿命都在增加
geom_dumbbell()
如果对国家进行排序,并对点进行大小和颜色设置,图表将更容易阅读。在下一个图表中,我们将按照1952年的预期寿命排序,修改线条和点的大小,为点上色,添加标题和标签,并简化主题。
# create dumbbell plot
ggplot(plotdata_wide,
aes(y = reorder(country, y1952), # country根据变量名进行排序
x = y1952,
xend = y2007)) +
geom_dumbbell(size = 1.2,
size_x = 3,
size_xend = 3,
colour = "grey",
colour_x = "blue",
colour_xend = "red") +
theme_minimal() +
labs(title = "Change in Life Expectancy",
subtitle = "1952 to 2007",
x = "Life Expectancy (years)",
y = "")
当有多个组和多个时间点时,斜率图是有用的。让我们来绘制6个中美洲国家在1992年、1997年、2002年和2007年的预期寿命。我们将再次使用gapminder数据。要创建斜率图,我们将使用来自CGPfunctions包的newggslopegraph函数。
newggslopegraph函数参数如下(按顺序)
pacman::p_load(CGPfunctions)
df <- gapminder %>%
filter(year %in% c(1992, 1997, 2002, 2007),
country %in% c("Panama", "Costa Rica", "Nicaragua", "Honduras", "El Salvador", "Guatemala","Belize")) %>%
mutate(year = factor(year),
lifeExp = round(lifeExp))
df
## # A tibble: 24 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <fct> <dbl> <int> <dbl>
## 1 Costa Rica Americas 1992 76 3173216 6160.
## 2 Costa Rica Americas 1997 77 3518107 6677.
## 3 Costa Rica Americas 2002 78 3834934 7723.
## 4 Costa Rica Americas 2007 79 4133884 9645.
## 5 El Salvador Americas 1992 67 5274649 4444.
## 6 El Salvador Americas 1997 70 5783439 5155.
## 7 El Salvador Americas 2002 71 6353681 5352.
## 8 El Salvador Americas 2007 72 6939688 5728.
## 9 Guatemala Americas 1992 63 8486949 4439.
## 10 Guatemala Americas 1997 66 9803875 4684.
## # ... with 14 more rows
# create slope graph
newggslopegraph(df, year, lifeExp, country) +
labs(title="Life Expectancy by Country",
subtitle="Central America",
caption="source: gapminder") +
theme_gray()
##
## Converting 'year' to an ordered factor
在上面的图表中,哥斯达黎加的预期寿命是所有研究年限中最高的。危地马拉是最低的,并在2002年**赶上了洪都拉斯(同样低,为69)。
简单的区域图基本上是一个折线图,从直线填充到x轴。
ggplot(economics, aes(x = date, y = psavert)) +
geom_line() +
theme_gray() +
labs(title = "Personal Savings Rate",
x = "Date",
y = "Personal Savings Rate") +
scale_x_date(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
# basic area chart
ggplot(economics, aes(x = date, y = psavert)) +
geom_area(fill="lightblue", color="black") +
labs(title = "Personal Savings Rate",
x = "Date",
y = "Personal Savings Rate") +
scale_x_date(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
一个堆叠的区域图可以用来显示组之间随时间的差异。考虑来自gcookbook包的uspopage数据集。我们将绘制1900年至2002年美国人口的年龄分布。
pacman::p_load(gcookbook)
uspopage %>% datatable()
# stacked area chart
data(uspopage, package = "gcookbook")
ggplot(uspopage, aes(x = Year,
y = Thousands)) +
geom_line(aes(col = AgeGroup,linetype = AgeGroup),size = 1.5 ) +
scale_color_brewer(palette = "Set2")
ggplot(uspopage, aes(x = Year,
y = Thousands,
fill = AgeGroup)) +
geom_area() +
labs(title = "US Population by age",
x = "Year",
y = "Population in Thousands")
data(uspopage, package = "gcookbook")
uspopage %>%
filter(AgeGroup %in% c('5-14')) %>%
ggplot(aes(x = Year,y = Thousands)) +
geom_area(fill = "steelblue") +
labs(title = "US Population by age",
x = "Year",
y = "Population in Thousands") +
scale_x_continuous(expand = c(0,0)) +
scale_y_continuous(expand = c(0,0))
在图表中最好避免使用科学符号。普通读者知道3e+05意味着300,000的可能性有多大?在ggplot2中很容易更改scale。只需将千位变量除以1000并将其报告为百万。
可以使用forcats包中的fct_rev函数来反转AgeGroup变量的级别level。
# stacked area chart
data(uspopage, package = "gcookbook")
ggplot(uspopage, aes(x = Year,
y = Thousands/1000,
fill = forcats::fct_rev(AgeGroup))) + # 翻转因子变量
geom_area(color = "black") + # 黑色边框
labs(title = "US Population by age",
subtitle = "1900 to 2002",
caption = "source: U.S. Census Bureau, 2003, HS-3",
x = "Year",
y = "Population in Millions",
fill = "Age Group") +
scale_fill_brewer(palette = "Set2") # 改变颜色
显然,在过去的100年里,儿童的数量没有太大的变化。 当对(1)组随时间的变化和(2)总体随时间的变化感兴趣时,堆叠区域图最有用。把最重要的组放在最下面。在这种类型的情节中,这些是最容易解释的。