利用R进行数据可视化——第六章时间序列图

最常见的时间相关图是时间序列线图,其他选择包括哑铃图和斜率图。

6.1 时间序列图

时间序列是在连续的时间点上获得的一组定量值。时间点之间的间隔(例如小时、天、周、月或年)通常是相等的。

考虑ggplot2包附带的经济学时间序列。它包含从1967年1月到2015年1月收集的美国月度经济数据。让我们绘制个人储蓄率(psavert)。我们可以用一个简单的线图来做这个。

library(tidyverse)

## -- Attaching packages ---------------------- tidyverse 1.3.0 --

## √ ggplot2 3.3.0     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.5
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0

## -- Conflicts ------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(DT)
economics %>% datatable()

economics %>% glimpse()

## Observations: 574
## Variables: 6
## $ date     <date> 1967-07-01, 1967-08-01, 1967-09-01, 1967-10-01, 1967-11-0...
## $ pce      <dbl> 506.7, 509.8, 515.6, 512.2, 517.4, 525.1, 530.9, 533.6, 54...
## $ pop      <dbl> 198712, 198911, 199113, 199311, 199498, 199657, 199808, 19...
## $ psavert  <dbl> 12.6, 12.6, 11.9, 12.9, 12.8, 11.8, 11.7, 12.3, 11.7, 12.3...
## $ uempmed  <dbl> 4.5, 4.7, 4.6, 4.9, 4.7, 4.8, 5.1, 4.5, 4.1, 4.6, 4.4, 4.4...
## $ unemploy <dbl> 2944, 2945, 2958, 3143, 3066, 3018, 2878, 3001, 2877, 2709...

economics %>% 
  ggplot(aes(date,psavert)) +
  geom_line()

economics %>% 
  ggplot(aes(date,psavert)) +
  geom_line() +
  scale_y_continuous(breaks = seq(2,20,2)) +
  scale_x_date(date_breaks = "2 year") +
  labs(title = "Personal Savings Rate",
       x = "Date",
       y = "Personal Savings Rate") +
  theme(axis.text.x = element_text(angle = 90,vjust = 0.5),
        plot.title = element_text(hjust = 0.5))

Scale_x_date函数可用于重新格式化日期。在下面的图表中，刻度标记每5年出现一次，日期以 MMM-YY 格式显示。

ggplot(economics, aes(x = date, y = psavert)) +
  geom_line(color = "indianred3", 
            size=1 ) +
  geom_smooth() +
  scale_x_date(date_breaks = '5 years',date_labels = "%m-%Y") +
  labs(title = "Personal Savings Rate",
       subtitle = "1967 to 2015",
       x = "",
       y = "Personal Savings Rate")  +
  theme(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5),
        plot.title = element_text(hjust = 0.5))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

在绘制时间序列时，请确保 date 变量是类 date 而不是类字符。有关详细信息，请参阅日期值

让我们用一个多元时间序列(多个序列)来结束这一节。我们将比较苹果和Facebook2018年1月1日至今的收盘价。

library(quantmod)
library(dplyr)

# get apple (AAPL) closing prices
apple <- getSymbols("AAPL", 
                    return.class = "data.frame", 
                    from="2018-01-01")

apple <- AAPL %>% 
  mutate(Date = as.Date(row.names(.))) %>%
  select(Date, AAPL.Close) %>%
  rename(Close = AAPL.Close) %>%
  mutate(Company = "Apple")

# get facebook (FB) closing prices
facebook <- getSymbols("FB", 
                        return.class = "data.frame", 
                       from="2018-01-01")

facebook <- FB %>% 
  mutate(Date = as.Date(row.names(.))) %>%
  select(Date, FB.Close) %>%
  rename(Close = FB.Close) %>%
  mutate(Company = "Facebook")

# combine data for both companies
mseries <- rbind(apple, facebook)

mseries %>% datatable()

# plot data
library(ggplot2)
ggplot(mseries, 
       aes(x=Date, y= Close, group = Company,color=Company)) + 
  geom_line(size=1) +
  scale_x_date(date_breaks = '1 months',expand = c(0,0)) +
  scale_y_continuous(limits = c(130, 230), 
                     breaks = seq(130, 230, 10),
                     labels = scales::dollar) +
  labs(title = "NASDAQ Closing Prices",
       caption = "source: Yahoo Finance",
       y = "Closing Price") +
  scale_color_brewer(palette = "Dark2") +
  theme(axis.text.x = element_text(angle = 90,hjust = 0.5,vjust = 0.5),
        plot.title = element_text(hjust = 0.5))

6.2 Dummbbell图表

哑铃图对于显示几个组或观察的两个时间点之间的变化很有用。使用了来自ggalt包的geom_dumbbell函数。

使用gapminder数据集，我们来绘制美洲从1952年到2007年的预期寿命变化。数据集的格式很长。为了创建哑铃图，我们需要将它转换为宽格式.

pacman::p_load(ggalt)
library(tidyr)
library(dplyr)

# load data
data(gapminder, package = "gapminder")

# subset data
plotdata_long <- filter(gapminder,
                        continent == "Americas" &
                        year %in% c(1952, 2007)) %>%
  select(country, year, lifeExp)

plotdata_long

## # A tibble: 50 x 3
##    country    year lifeExp
##    <fct>     <int>   <dbl>
##  1 Argentina  1952    62.5
##  2 Argentina  2007    75.3
##  3 Bolivia    1952    40.4
##  4 Bolivia    2007    65.6
##  5 Brazil     1952    50.9
##  6 Brazil     2007    72.4
##  7 Canada     1952    68.8
##  8 Canada     2007    80.7
##  9 Chile      1952    54.7
## 10 Chile      2007    78.6
## # ... with 40 more rows

# convert data to wide format
plotdata_wide <- spread(plotdata_long, year, lifeExp)
names(plotdata_wide) <- c("country", "y1952", "y2007")

# create dumbbell plot
ggplot(plotdata_wide, 
       aes(y = country,x = y1952,xend = y2007)) +        # 平均寿命都在增加
  geom_dumbbell()

如果对国家进行排序，并对点进行大小和颜色设置，图表将更容易阅读。在下一个图表中，我们将按照1952年的预期寿命排序，修改线条和点的大小，为点上色，添加标题和标签，并简化主题。

# create dumbbell plot
ggplot(plotdata_wide, 
       aes(y = reorder(country, y1952),  # country根据变量名进行排序
           x = y1952,
           xend = y2007)) +  
  geom_dumbbell(size = 1.2,
                size_x = 3, 
                size_xend = 3,
                colour = "grey", 
                colour_x = "blue", 
                colour_xend = "red") +
  theme_minimal() + 
  labs(title = "Change in Life Expectancy",
       subtitle = "1952 to 2007",
       x = "Life Expectancy (years)",
       y = "")

6.3 斜率图

当有多个组和多个时间点时，斜率图是有用的。让我们来绘制6个中美洲国家在1992年、1997年、2002年和2007年的预期寿命。我们将再次使用gapminder数据。要创建斜率图，我们将使用来自CGPfunctions包的newggslopegraph函数。

newggslopegraph函数参数如下(按顺序)

data frame
time variable (which must be a factor)
numeric variable to be plotted
grouping variable (creating one line per group).

pacman::p_load(CGPfunctions)
df <- gapminder %>%
  filter(year %in% c(1992, 1997, 2002, 2007),
         country %in% c("Panama", "Costa Rica", "Nicaragua", "Honduras", "El Salvador", "Guatemala","Belize")) %>%
  mutate(year = factor(year),
         lifeExp = round(lifeExp)) 

df

## # A tibble: 24 x 6
##    country     continent year  lifeExp     pop gdpPercap
##    <fct>       <fct>     <fct>   <dbl>   <int>     <dbl>
##  1 Costa Rica  Americas  1992       76 3173216     6160.
##  2 Costa Rica  Americas  1997       77 3518107     6677.
##  3 Costa Rica  Americas  2002       78 3834934     7723.
##  4 Costa Rica  Americas  2007       79 4133884     9645.
##  5 El Salvador Americas  1992       67 5274649     4444.
##  6 El Salvador Americas  1997       70 5783439     5155.
##  7 El Salvador Americas  2002       71 6353681     5352.
##  8 El Salvador Americas  2007       72 6939688     5728.
##  9 Guatemala   Americas  1992       63 8486949     4439.
## 10 Guatemala   Americas  1997       66 9803875     4684.
## # ... with 14 more rows

# create slope graph
newggslopegraph(df, year, lifeExp, country) +
  labs(title="Life Expectancy by Country", 
       subtitle="Central America", 
       caption="source: gapminder") +
  theme_gray()

## 
## Converting 'year' to an ordered factor

在上面的图表中，哥斯达黎加的预期寿命是所有研究年限中最高的。危地马拉是最低的，并在2002年**赶上了洪都拉斯(同样低，为69)。

6.4 面积图

简单的区域图基本上是一个折线图，从直线填充到x轴。

ggplot(economics, aes(x = date, y = psavert)) +
  geom_line() +
  theme_gray() +
  labs(title = "Personal Savings Rate",
       x = "Date",
       y = "Personal Savings Rate") +
  scale_x_date(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))

# basic area chart
ggplot(economics, aes(x = date, y = psavert)) +
  geom_area(fill="lightblue", color="black") +
  labs(title = "Personal Savings Rate",
       x = "Date",
       y = "Personal Savings Rate") +
  scale_x_date(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))

一个堆叠的区域图可以用来显示组之间随时间的差异。考虑来自gcookbook包的uspopage数据集。我们将绘制1900年至2002年美国人口的年龄分布。

pacman::p_load(gcookbook)
uspopage %>% datatable()

# stacked area chart
data(uspopage, package = "gcookbook")
ggplot(uspopage, aes(x = Year,
                     y = Thousands)) +
  geom_line(aes(col = AgeGroup,linetype = AgeGroup),size = 1.5 ) +
  scale_color_brewer(palette = "Set2")

ggplot(uspopage, aes(x = Year,
                     y = Thousands, 
                     fill = AgeGroup)) +
  geom_area() +
  labs(title = "US Population by age",
       x = "Year",
       y = "Population in Thousands")

data(uspopage, package = "gcookbook")

uspopage %>% 
  filter(AgeGroup %in% c('5-14')) %>% 
  ggplot(aes(x = Year,y = Thousands)) +
  geom_area(fill = "steelblue") +
  labs(title = "US Population by age",
       x = "Year",
       y = "Population in Thousands") +
  scale_x_continuous(expand = c(0,0)) +
  scale_y_continuous(expand = c(0,0))

在图表中最好避免使用科学符号。普通读者知道3e+05意味着300,000的可能性有多大?在ggplot2中很容易更改scale。只需将千位变量除以1000并将其报告为百万。

create black borders to highlight the difference between groups
reverse the order the groups to match increasing age
improve labeling
choose a different color scheme
choose a simpler theme.

可以使用forcats包中的fct_rev函数来反转AgeGroup变量的级别level。

# stacked area chart
data(uspopage, package = "gcookbook")
ggplot(uspopage, aes(x = Year,
                     y = Thousands/1000, 
                     fill = forcats::fct_rev(AgeGroup))) +   # 翻转因子变量
  geom_area(color = "black") +                               # 黑色边框
  labs(title = "US Population by age",
       subtitle = "1900 to 2002",
       caption = "source: U.S. Census Bureau, 2003, HS-3",
       x = "Year",
       y = "Population in Millions",
       fill = "Age Group") +
  scale_fill_brewer(palette = "Set2")                        # 改变颜色

显然，在过去的100年里，儿童的数量没有太大的变化。当对(1)组随时间的变化和(2)总体随时间的变化感兴趣时，堆叠区域图最有用。把最重要的组放在最下面。在这种类型的情节中，这些是最容易解释的。

利用R进行数据可视化——第六章时间序列图

LJJ

2020/3/20

6.1 时间序列图

6.2 Dummbbell图表

6.3 斜率图

6.4 面积图