利用R进行数据可视化——第八章其它类型图

本章中的图表可能非常有用，但是不能很容易地融入其他章节中。

8.1 3D散点图

ggplot2包及其扩展不能创建3d绘图。但是，可以用scatterplot3d软件包中的scatterplot3d函数创建三维散射图。

假设我们要使用mtcars数据来绘制汽车里程数、发动机排量和汽车重量的关系图。

# basic 3-D scatterplot
pacman::p_load(tidyverse,scatterplot3d,DT)
library(scatterplot3d)
with(mtcars, {
   scatterplot3d(x = disp,
                 y = wt, 
                 z = mpg,
                 main="3-D Scatterplot Example 1")
})

现在，让我们修改图形，用填充的蓝色圆替换点，在x-y平面上添加拖放线，并创建更有意义的标签。

library(scatterplot3d)
with(mtcars, {
  scatterplot3d(x = disp, 
                y = wt, 
                z = mpg, 
                # filled blue circles
                color="blue", 
                pch=19, 
                # lines to the horizontal plane
                type = "h",
                main = "3-D Scatterplot Example 2",
                xlab = "Displacement (cu. in.)",
                ylab = "Weight (lb/1000)",
                zlab = "Miles/(US) Gallon")
})

接下来，让我们标记这些点。我们可以通过使用xyz将scatterplot3d函数的结果保存到一个对象中来实现这一点。将坐标从3-D (x, y, z)转换为2 - d投影(x, y)，并应用文本函数向图形添加标签。

with(mtcars, {
  s3d <- scatterplot3d(
    x = disp, 
    y = wt, 
    z = mpg,
    color = "blue", 
    pch = 19,      
    type = "h",
    main = "3-D Scatterplot Example 3",
    xlab = "Displacement (cu. in.)",
    ylab = "Weight (lb/1000)",
    zlab = "Miles/(US) Gallon")
  
  # convert 3-D coords to 2D projection
  s3d.coords <- s3d$xyz.convert(disp, wt, mpg) 
  
  # plot text with 50% shrink and place to right of points
  text(s3d.coords$x, 
       s3d.coords$y,   
       labels = row.names(mtcars),  
       cex = .5, 
       pos = 4)
})

差不多了。作为最后一步，我们将添加关于每辆车的气缸数量的信息。为此，我们将向mtcars dataframe添加一列，表示每个点的颜色。为了更好地度量，我们将缩短y轴，将下降线更改为虚线，并添加图例。这种3D图还是简单味美吧，数据量大看着不舒服。

# create column indicating point color
mtcars$pcolor[mtcars$cyl == 4] <- "red"
mtcars$pcolor[mtcars$cyl == 6] <- "blue"
mtcars$pcolor[mtcars$cyl == 8] <- "darkgreen"

with(mtcars, {
    s3d <- scatterplot3d(
      x = disp, 
      y = wt, 
      z = mpg,
      color = pcolor, 
      pch = 19, 
      type = "h", 
      lty.hplot = 2, 
      scale.y = .75,
      main = "3-D Scatterplot Example 4",
      xlab = "Displacement (cu. in.)",
      ylab = "Weight (lb/1000)",
      zlab = "Miles/(US) Gallon")
    
     s3d.coords <- s3d$xyz.convert(disp, wt, mpg)
     text(s3d.coords$x, 
          s3d.coords$y, 
          labels = row.names(mtcars), 
          pos = 4, 
          cex = .5)  
     
# add the legend
legend(#location
       "topleft", 
       inset=.05,
       # suppress legend box, shrink text 50%
       bty="n", 
       cex=.5, 
       title="Number of Cylinders",
       c("4", "6", "8"), 
       fill=c("red", "blue", "darkgreen"))
})

8.2 Biplot

双图是一种特殊的图，它试图在一个低维(通常是二维)空间中表示观察值之间、变量之间、以及观察值与变量之间的关系。通过一个例子很容易看出这是如何工作的。让我们使用factoextra包中的fviz_pca函数为mtcars数据集创建一个双图。

data(mtcars)

# fit a principal components model
fit <- prcomp(x = mtcars, 
              center = TRUE, 
              scale = TRUE)

# plot the results
pacman::p_load(factoextra)
fviz_pca(fit, 
         repel = TRUE, 
         labelsize = 3) + 
  labs(title = "Biplot of mtcars data")

8.3 气泡图

气泡图基本上就是一个散点图，其中点的大小与第三个定量变量的值成比例。

library(ggplot2)
ggplot(mtcars, 
       aes(x = wt, y = mpg, size = hp,col = hp)) +
  geom_point()

我们可以通过增加气泡的大小，选择不同的点形状和颜色，增加一些透明度来改善默认的外观。

ggplot(mtcars, 
       aes(x = wt, y = mpg, size = hp)) +
  geom_point(alpha = .5, 
             fill="cornflowerblue", 
             color="black", 
             shape=21) +
  scale_size_continuous(range = c(1, 14),name = "Gross horsepower") +
  labs(title = "Auto mileage by weight and horsepower",
       subtitle = "Motor Trend US Magazine (1973-74 models)",
       x = "Weight (1000 lbs)",
       y = "Miles/(US) gallon")

scale_size_continuous函数中的范围参数指定绘图符号的最小和最大大小。默认值是range = c(1,6)。 geom_point函数中的shape选项指定一个带有边框颜色和填充颜色的圆。显然，每加仑行驶的英里数会随着汽车重量和马力的增加而减少。然而，有一款车重量轻，马力大，油耗高。回到数据上，这是Lotus Europa。气泡图是有争议的，就像饼状图是有争议的一样。人们更善于判断长度而不是体积。

8.4 流程图

流程图表示一组动态关系。它通常通过网络中的一组节点来捕获人、物、通信或对象的物理或隐喻流。

在Sankey图中，两个节点之间的线的宽度与流量成正比。我们将用英国能源预测数据来证明这一点。这些数据包括2050年的能源生产和消费预测。构建图需要两个数据帧，一个包含节点名，另一个包含节点之间的链接和节点之间的流数量。

8.5 热图

heatmap热图显示了一组数据，在每个观测中，每个变量值都使用了彩色的小块。热图有很多种。尽管base R带有一个heatmap函数，但我们将使用更强大的superheat包(我喜欢这些名称)。首先，让我们为基本r附带的mtcars数据集创建热图。mtcars数据集包含在11个变量上测量的32辆车的信息。

mtcars %>% DT::datatable()

pacman::p_load(superheat)
superheat(mtcars %>% select(-pcolor), scale = TRUE)

scale = TRUE选项将列标准化为均值为0，标准偏差为1。看这张图，我们可以看到Merc 230有1 / 4英里的时间(qsec)，远远高于平均水平(亮黄色)。莲花木卫二的重量远低于平均水平(深蓝色)。我们可以使用集群对行和/或列进行排序。

在下一个示例中，我们将对行进行排序，以便相似的汽车在彼此附近出现。我们还将调整文本和标签的大小。

pacman::p_load(superheat)
# sorted heat map
superheat(mtcars %>% select(-pcolor),
          scale = TRUE,
          left.label.text.size=3,
          bottom.label.text.size=3,
          bottom.label.size = .05,
          row.dendrogram = TRUE )

在这里我们可以看到丰田花冠和菲亚特128有着相似的特点。林肯大陆和凯迪拉克弗利特伍德也有类似的特点。

superheat函数要求数据具有特定的格式。具体地说

the data most be all numeric
the row names are used to label the left axis. If the desired labels are in a column variable, the variable must be converted to row names (more on this below)
missing values are allowed

让我们使用热图来显示亚洲国家的预期寿命随时间的变化。数据来自gapminder数据集。

由于数据是长格式的，我们首先必须转换成宽格式。然后，我们需要确保它是一个数据帧，并将变量country转换为行名称。最后，我们将根据2007年的预期寿命对数据进行排序。既然这样，我们就改变一下配色方案吧。

# load data
data(gapminder, package="gapminder")

# subset Asian countries
asia <- gapminder %>%
  filter(continent == "Asia") %>%
  select(year, country, lifeExp)

asia %>% spread(key = year,value = lifeExp)->plotdata
# save country as row names
plotdata <- as.data.frame(plotdata)
row.names(plotdata) <- plotdata$country

plotdata$country <- NULL

# row order
sort.order <- order(plotdata$"2007")

library(RColorBrewer)
colors <- rev(brewer.pal(5, "Blues"))

plotdata %>% datatable()

# create the heat map
library(superheat)
superheat(plotdata,
          scale = FALSE,
          left.label.text.size=3,
          bottom.label.text.size=3,
          bottom.label.size = .05,
          heat.pal = colors,
          order.rows = sort.order,
          title = "Life Expectancy in Asia")

日本、香港和以色列的预期寿命最高。韩国在80年代做得很好，但现在失去了一些优势。1977年，柬埔寨人的预期寿命大幅下降。要查看您可以用热图做什么，请查看superheat函数。

绘制时间序列图更好吧：

plotdata %>% 
  rownames_to_column(var = "Country") %>% 
  gather(key = year,value = lifeExp,-Country)->plotdata1

plotdata1 %>% 
  ggplot(aes(year,lifeExp,group = Country,col = Country)) +
  geom_point() +
  geom_line() + 
  theme(legend.position = "NULL") +
  geom_label(data = plotdata1 %>% filter(year == 1952),
             aes(label = Country),alpha = 0.1)

很不好看！！！

看看棒棒糖图：

plotdata_long <- filter(gapminder,
                        continent == "Americas" &
                        year %in% c(1952, 2007)) %>%
  select(country, year, lifeExp)

# convert data to wide format
plotdata_wide <- spread(plotdata_long, year, lifeExp)
names(plotdata_wide) <- c("country", "y1952", "y2007")

plotdata_wide

## # A tibble: 25 x 3
##    country            y1952 y2007
##    <fct>              <dbl> <dbl>
##  1 Argentina           62.5  75.3
##  2 Bolivia             40.4  65.6
##  3 Brazil              50.9  72.4
##  4 Canada              68.8  80.7
##  5 Chile               54.7  78.6
##  6 Colombia            50.6  72.9
##  7 Costa Rica          57.2  78.8
##  8 Cuba                59.4  78.3
##  9 Dominican Republic  45.9  72.2
## 10 Ecuador             48.4  75.0
## # ... with 15 more rows

# create dumbbell plot
library(ggalt)

## Registered S3 methods overwritten by 'ggalt':
##   method                  from   
##   grid.draw.absoluteGrob  ggplot2
##   grobHeight.absoluteGrob ggplot2
##   grobWidth.absoluteGrob  ggplot2
##   grobX.absoluteGrob      ggplot2
##   grobY.absoluteGrob      ggplot2

ggplot(plotdata_wide, 
       aes(y = reorder(country, y1952),
           x = y1952,
           xend = y2007)) +  
  geom_dumbbell(size = 1.2,
                size_x = 3, 
                size_xend = 3,
                colour = "grey", 
                colour_x = "blue", 
                colour_xend = "red") +
  labs(title = "Change in Life Expectancy",
       subtitle = "1952 to 2007",
       x = "Life Expectancy (years)",
       y = "")

在这里更容易辨别模式。例如，海地1952年的人均寿命最低，到2007年仍然是最低的。巴拉圭的起点相对较高，但几乎没有取得进展

8.6 雷达图

参考：雷达图

8.7 散点图矩阵

散点图矩阵是组织成网格的散点图的集合。它类似于相关图，但不显示相关性，而是显示基础数据。

我们可以通过研究哺乳动物的体型和睡眠特征之间的关系来说明它的用途。数据来自与ggplot2一起发布的msleep数据集。大脑重量和体重是高度倾斜的(想想老鼠和大象)，所以我们将把它们转换成记录大脑重量和体重的对数，然后再创建图表。

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

# prepare data
data(msleep, package="ggplot2")
library(dplyr)
df <- msleep %>% 
  mutate(log_brainwt = log(brainwt),
         log_bodywt = log(bodywt)) %>%
  select(log_brainwt, log_bodywt, sleep_total, sleep_rem)

df %>% datatable()

# create a scatterplot matrix
ggpairs(df)

默认情况下, 主对角线包含每个变量的核密度图。主对角线下的单元格包含由行和列变量的交集表示的散点图。上方的变量是x轴右侧的变量是y轴。主对角线上的单元格包含相关系数。

例如，随着大脑重量的增加，总的睡眠时间和快速眼动睡眠时间会减少。可以通过创建自定义函数来修改图。

# custom function for density plot
my_density <- function(data, mapping, ...){
  ggplot(data = data, mapping = mapping) + 
    geom_density(alpha = 0.5,
                 fill = "cornflowerblue", ...)
}

# custom function for scatterplot
my_scatter <- function(data, mapping, ...){
  ggplot(data = data, mapping = mapping) + 
    geom_point(alpha = 0.5,
               color = "cornflowerblue") + 
    geom_smooth(method=lm, 
                se=FALSE, ...)
}


# create scatterplot matrix
ggpairs(df, 
        lower=list(continuous = my_scatter), 
        diag = list(continuous = my_density)) +
  labs(title = "Mammal size and sleep characteristics")

## Warning: Removed 27 rows containing non-finite values (stat_density).

## Warning in (function (data, mapping, alignPercent = 0.6, method = "pearson", :
## Removed 27 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method = "pearson", :
## Removed 27 rows containing missing values

## Warning in (function (data, mapping, alignPercent = 0.6, method = "pearson", :
## Removed 35 rows containing missing values

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 27 rows containing non-finite values (stat_smooth).

## Warning: Removed 27 rows containing missing values (geom_point).

## Warning in (function (data, mapping, alignPercent = 0.6, method = "pearson", :
## Removed 22 rows containing missing values

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 27 rows containing non-finite values (stat_smooth).

## Warning: Removed 27 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning in (function (data, mapping, alignPercent = 0.6, method = "pearson", :
## Removed 22 rows containing missing values

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 35 rows containing non-finite values (stat_smooth).

## Warning: Removed 35 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

## Warning: Removed 22 rows containing missing values (geom_point).

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

## Warning: Removed 22 rows containing missing values (geom_point).

## Warning: Removed 22 rows containing non-finite values (stat_density).

8.8 瀑布图

瀑布图说明了正负值序列的累积效应。例如，我们可以绘制一个虚构公司的收入和支出的累积效果。首先，让我们创建一个数据集.

# create company income statement
category <- c("Sales", "Services", "Fixed Costs", 
              "Variable Costs", "Taxes")
amount <- c(101000, 52000, -23000, -15000, -10000)
income <- data.frame(category, amount) 
income

##         category amount
## 1          Sales 101000
## 2       Services  52000
## 3    Fixed Costs -23000
## 4 Variable Costs -15000
## 5          Taxes -10000

现在，我们可以使用waterfalls package中的waterfall函数，用瀑布图将其可视化。

pacman::p_load(waterfalls)
waterfall(income)

我们还可以添加一个total (net)列。由于结果是ggplot2图，所以我们可以使用其他函数来定制结果。

# create waterfall chart with total column
waterfall(income,calc_total = TRUE)
waterfall(income,calc_total = TRUE,
          total_axis_text = "Net",
          total_rect_text_color = "black",
          total_rect_color = "goldenrod1") +
  scale_y_continuous(label=scales::dollar) +
  labs(title = "West Coast Profit and Loss", 
       subtitle = "Year 2017",
       y="", 
       x="") +
  theme(plot.title = element_text(hjust = 0.5))

8.9 词云图

词云图是文本挖掘的范畴，以后再研究。