利用R进行数据可视化——第三章单变量图

3.1 离散变量图
3.2 连续变量图

单变量图绘制单一变量数据的分布。变量可以是分类的(例如，种族，性别)或数量的(例如，年龄，体重)。

3.1 离散变量图

单个离散变量的分布通常与条形图、饼图或(不常见的)树形图一起绘制。

3.1.1 条形图

Marriage数据集包含了阿拉巴马州莫比尔县98个人的婚姻记录。下面的柱状图显示了婚礼参与者的种族分布。

pacman::p_load(tidyverse,DT)

data(Marriage, package = "mosaicData")

Marriage %>% glimpse()

## Observations: 98
## Variables: 15
## $ bookpageID    <fct> B230p539, B230p677, B230p766, B230p892, B230p994, B23...
## $ appdate       <fct> 10/29/96, 11/12/96, 11/19/96, 12/2/96, 12/9/96, 12/26...
## $ ceremonydate  <fct> 11/9/96, 11/12/96, 11/27/96, 12/7/96, 12/14/96, 12/26...
## $ delay         <int> 11, 0, 8, 5, 5, 0, 16, 0, 28, 10, 8, 0, 4, 4, 0, 4, 9...
## $ officialTitle <fct> CIRCUIT JUDGE , MARRIAGE OFFICIAL, MARRIAGE OFFICIAL,...
## $ person        <fct> Groom, Groom, Groom, Groom, Groom, Groom, Groom, Groo...
## $ dob           <fct> 4/11/64, 8/6/64, 2/20/62, 5/20/56, 12/14/66, 2/21/70,...
## $ age           <dbl> 32.60274, 32.29041, 34.79178, 40.57808, 30.02192, 26....
## $ race          <fct> White, White, Hispanic, Black, White, White, White, W...
## $ prevcount     <int> 0, 1, 1, 1, 0, 1, 1, 1, 0, 3, 1, 1, 0, 0, 1, 0, 0, 0,...
## $ prevconc      <fct> NA, Divorce, Divorce, Divorce, NA, NA, Divorce, Divor...
## $ hs            <int> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 1...
## $ college       <int> 7, 0, 3, 4, 0, 0, 0, 0, 0, 6, 2, 1, 1, 0, 0, 4, 2, NA...
## $ dayOfBirth    <dbl> 102.00, 219.00, 51.50, 141.00, 348.50, 52.50, 284.25,...
## $ sign          <fct> Aries, Leo, Pisces, Gemini, Saggitarius, Pisces, Libr...

# plot the distribution of race
ggplot(Marriage, aes(x = race %>% as_factor() %>% fct_infreq())) + 
  geom_bar() +
  labs(x = "race")->p1

ggplot(Marriage, aes(x = race)) + 
  geom_bar()->p2

library(gridExtra)

grid.arrange(p1,p2,ncol = 2)

大多数参与者是白人，其次是黑人，很少有西班牙裔或美国美洲原住民。

您可以通过向geom_bar函数添加选项来修改条形图填充和边框颜色、绘图标签和标题。

# plot the distribution of race with modified colors and labels
ggplot(Marriage, aes(x = race)) + 
  geom_bar(fill = "cornflowerblue", 
           color="black") +
  labs(x = "Race", 
       y = "Frequency", 
       title = "Participants by race")

3.1.1.1 绘制比例

条形图可以展示比例而不是频数，对于条形图而言，aes(x=race)实际上是aes(x = race, y = ..count..)的简写，..count..代表每个类别中频率，你也可以通过指定y变量计算类别比例。

# plot the distribution as percentages
ggplot(Marriage, 
       aes(x = race, 
           y = ..count.. / sum(..count..))) +    # 计算比例
  geom_bar() +
  labs(x = "Race", 
       y = "Percent", 
       title  = "Participants by race") +
  scale_y_continuous(labels = scales::percent)

3.1.1.2 排序条形图

按频率对条形图进行分类通常是有帮助的。在下面的代码中，频率是显式计算的。然后使用重新排序函数按频率对类别进行排序。选项stat“ identity”告诉绘图函数不要计算计数，因为计数是直接提供的。

plotdata <- Marriage %>%
 count(race)

plotdata

## # A tibble: 4 x 2
##   race                n
##   <fct>           <int>
## 1 American Indian     1
## 2 Black              22
## 3 Hispanic            1
## 4 White              74

ggplot(plotdata, 
       aes(x = race, 
           y = n)) + 
  geom_bar(stat = "identity") +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

ggplot(plotdata, 
       aes(x = reorder(race, n), 
           y = n)) + 
  geom_bar(stat = "identity") +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

图形条按升序排序。使用reorder (race,-n)按降序排序。

3.1.1.3 标签条形图

最后，您可能希望用数值标记每个条。

# plot the bars with numeric labels
ggplot(plotdata, 
       aes(x = race, 
           y = n)) + 
  geom_bar(stat = "identity") +
  geom_text(aes(label = n),      # 添加标签label
            vjust=-0.5) +
  labs(x = "Race", 
       y = "Frequency", 
       title  = "Participants by race")

这里geom_tsxt添加标签，vjust控制垂直对齐。

把这些想法放在一起，你可以创建一个如下图所示的图表。

plotdata <- Marriage %>%
  count(race) %>%
  mutate(pct = n / sum(n),
         pctlabel = paste0(round(pct*100), "%"))

plotdata

## # A tibble: 4 x 4
##   race                n    pct pctlabel
##   <fct>           <int>  <dbl> <chr>   
## 1 American Indian     1 0.0102 1%      
## 2 Black              22 0.224  22%     
## 3 Hispanic            1 0.0102 1%      
## 4 White              74 0.755  76%

ggplot(plotdata, 
       aes(x = reorder(race, -pct),
           y = pct)) + 
  geom_bar(stat = "identity", 
           fill = "indianred3", 
           color = "black") +
  geom_text(aes(label = pctlabel), 
            vjust = -0.25) +
  scale_y_continuous(labels = scales::percent) +
  labs(x = "Race", 
       y = "Percent", 
       title  = "Participants by race")

3.1.1.4 重叠标签

类别标签可能会重叠，如果(1)有许多类别或(2)标签是长的。考虑一下婚姻官员的分布。

# basic bar chart with overlapping labels
ggplot(Marriage, aes(x = officialTitle)) + 
  geom_bar() +
  labs(x = "Officiate",
       y = "Frequency",
       title = "Marriages by officiate")

在本例中，可以翻转 x 轴和 y 轴。

ggplot(Marriage, aes(x = officialTitle)) + 
  geom_bar() +
  labs(x = "Officiate",
       y = "Frequency",
       title = "Marriages by officiate") +
  coord_flip()

ggplot(Marriage, aes(x = officialTitle %>% as_factor() %>% fct_infreq())) + 
  geom_bar() +
  labs(x = "Officiate",
       y = "Frequency",
       title = "Marriages by officiate") +
  coord_flip()

或者，您可以旋转轴标签。

ggplot(Marriage, aes(x = officialTitle)) + 
  geom_bar() +
  labs(x = "",
       y = "Frequency",
       title = "Marriages by officiate") +
  theme(axis.text.x = element_text(angle = 45, 
                                   hjust = 1))

最后，您可以尝试错开标签，诀窍是在每个标签上添加一个换行符 n。

# bar chart with staggered labels
lbls <- paste0(c("", "\n"), levels(Marriage$officialTitle))
lbls

## [1] "BISHOP"              "\nCATHOLIC PRIEST"   "CHIEF CLERK"        
## [4] "\nCIRCUIT JUDGE "    "ELDER"               "\nMARRIAGE OFFICIAL"
## [7] "MINISTER"            "\nPASTOR"            "REVEREND"

factor(Marriage$officialTitle, labels = lbls)

##  [1] \nCIRCUIT JUDGE     \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
##  [4] MINISTER            MINISTER            \nMARRIAGE OFFICIAL
##  [7] \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [10] MINISTER            \nPASTOR            CHIEF CLERK        
## [13] \nMARRIAGE OFFICIAL MINISTER            \nMARRIAGE OFFICIAL
## [16] \nPASTOR            \nPASTOR            \nMARRIAGE OFFICIAL
## [19] REVEREND            \nPASTOR            \nPASTOR           
## [22] MINISTER            \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [25] \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [28] ELDER               \nPASTOR            \nCATHOLIC PRIEST  
## [31] \nMARRIAGE OFFICIAL \nPASTOR            \nPASTOR           
## [34] MINISTER            \nMARRIAGE OFFICIAL BISHOP             
## [37] \nPASTOR            \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [40] \nPASTOR            \nMARRIAGE OFFICIAL MINISTER           
## [43] \nMARRIAGE OFFICIAL MINISTER            \nPASTOR           
## [46] \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL MINISTER           
## [49] MINISTER            \nCIRCUIT JUDGE     \nMARRIAGE OFFICIAL
## [52] \nMARRIAGE OFFICIAL MINISTER            MINISTER           
## [55] \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [58] \nMARRIAGE OFFICIAL MINISTER            \nPASTOR           
## [61] CHIEF CLERK         \nMARRIAGE OFFICIAL MINISTER           
## [64] \nMARRIAGE OFFICIAL \nPASTOR            \nPASTOR           
## [67] \nMARRIAGE OFFICIAL REVEREND            \nPASTOR           
## [70] \nPASTOR            MINISTER            \nMARRIAGE OFFICIAL
## [73] \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [76] \nMARRIAGE OFFICIAL ELDER               \nPASTOR           
## [79] \nCATHOLIC PRIEST   \nMARRIAGE OFFICIAL \nPASTOR           
## [82] \nPASTOR            MINISTER            \nMARRIAGE OFFICIAL
## [85] BISHOP              \nPASTOR            \nMARRIAGE OFFICIAL
## [88] \nMARRIAGE OFFICIAL \nPASTOR            \nMARRIAGE OFFICIAL
## [91] MINISTER            \nMARRIAGE OFFICIAL MINISTER           
## [94] \nPASTOR            \nMARRIAGE OFFICIAL \nMARRIAGE OFFICIAL
## [97] MINISTER            MINISTER           
## 9 Levels: BISHOP \nCATHOLIC PRIEST CHIEF CLERK \nCIRCUIT JUDGE  ... REVEREND

ggplot(Marriage, 
       aes(x=factor(officialTitle, labels = lbls))) + 
  geom_bar() +
  labs(x = "",
       y = "Frequency",
       title = "Marriages by officiate")

3.1.2 饼状图

饼图在统计学上是有争议的。如果你的目标是比较分类的频率，你最好使用条形图(人类更擅长判断条形图的长度，而不是饼图的数量)。如果你的目标是将每个类别与整个类别进行比较(例如，与所有参与者相比，哪一部分参与者是西班牙裔)，而且类别的数量很少，那么饼图可能适合你。在 r 中制作一个有吸引力的饼图需要多一点的代码。

# create a basic ggplot2 pie chart
plotdata <- Marriage %>%
  count(race) %>%
  arrange(desc(race)) %>%
  mutate(prop = round(n * 100 / sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5  * prop)

plotdata

## # A tibble: 4 x 4
##   race                n  prop lab.ypos
##   <fct>           <int> <dbl>    <dbl>
## 1 White              74  75.5     37.8
## 2 Hispanic            1   1       76  
## 3 Black              22  22.4     87.7
## 4 American Indian     1   1       99.4

ggplot(plotdata,
       aes(x = "",
           y = prop,
           fill = race)) +
  geom_bar(width = 1,
           stat = "identity",
           color = "black") +
  coord_polar("y",
              start = 0,
              direction = -1) +
  theme_void()

现在，让我们添加标签，同时删除图例。

# create a pie chart with slice labels
plotdata <- Marriage %>%
  count(race) %>%
  arrange(desc(race)) %>%
  mutate(prop = round(n*100/sum(n), 1),
         lab.ypos = cumsum(prop) - 0.5*prop)

plotdata$label <- paste0(plotdata$race, "\n",
                         round(plotdata$prop), "%")

ggplot(plotdata, 
       aes(x = "", 
           y = prop, 
           fill = race)) +
  geom_bar(width = 1, 
           stat = "identity", 
           color = "black") +
  geom_text(aes(y = lab.ypos, label = label), 
            color = "black") +
  coord_polar("y", 
              start = 0, 
              direction = -1) +
  theme_void() +
  theme(legend.position = "FALSE") +
  labs(title = "Participants by race")

使用饼图可以很容易地将每个切片与整个切片进行比较。例如，Back 被认为大约占总参与者的四分之一

3.1.3 树形图

饼图的另一种替代形式是树形图。与饼图不同，它可以处理具有许多级别的变量。

pacman::p_load(treemapify)

# create a treemap of marriage officials
plotdata <- Marriage %>%
  count(officialTitle)

plotdata

## # A tibble: 9 x 2
##   officialTitle           n
##   <fct>               <int>
## 1 "BISHOP"                2
## 2 "CATHOLIC PRIEST"       2
## 3 "CHIEF CLERK"           2
## 4 "CIRCUIT JUDGE "        2
## 5 "ELDER"                 2
## 6 "MARRIAGE OFFICIAL"    44
## 7 "MINISTER"             20
## 8 "PASTOR"               22
## 9 "REVEREND"              2

ggplot(plotdata, 
       aes(fill = officialTitle, 
           area = n)) +
  geom_treemap() + 
  labs(title = "Marriages by officiate")

这里有一个带标签的更有用的版本。

# create a treemap with tile labels
ggplot(plotdata, 
       aes(fill = officialTitle, 
           area = n, 
           label = officialTitle)) +
  geom_treemap() + 
  geom_treemap_text(colour = "white", 
                    place = "centre") +
  labs(title = "Marriages by officiate") +
  theme(legend.position = "none")

3.2 连续变量图

一个连续变量的分布通常用直方图、核密度图或点图绘制。

3.2.1 直方图

# plot the age distribution using a histogram
ggplot(Marriage, aes(x = age)) +
  geom_histogram() + 
  labs(title = "Participants by age",
       x = "Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

大多数参与者似乎在20岁出头，另一组在40岁左右，另一组在60岁后期和70岁出头，人数要少得多。

可以使用两个选项修改直方图颜色：fill和color

# plot the histogram with blue bars and white borders
ggplot(Marriage, 
       aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white") + 
  labs(title="Participants by age",
       x = "Age")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

最重要的直方图选项之一是bins，它控制将数字变量划分到的箱子数(即图中的条形图数)。默认值是30，但是尝试使用越来越小的数字来获得分布形状的更好印象是有帮助的。

# plot the histogram with 20 bins
ggplot(Marriage, aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 bins = 20) + 
  labs(title="Participants by age", 
       subtitle = "number of bins = 20",
       x = "Age")

或者，您可以指定 binwidth，即由条形图表示的箱子的宽度。

# plot the histogram with a binwidth of 5
ggplot(Marriage, aes(x = age)) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 binwidth = 5) + 
  labs(title="Participants by age", 
       subtitle = "binwidth = 5 years",
       x = "Age")

与条形图一样，y轴可以表示总数或总数的百分比

ggplot(Marriage, 
       aes(x = age, 
           y= ..count.. / sum(..count..))) +
  geom_histogram(fill = "cornflowerblue", 
                 color = "white", 
                 binwidth = 5) + 
  labs(title="Participants by age", 
       y = "Percent",
       x = "Age") +
  scale_y_continuous(labels = scales::percent)

3.2.2 核密度图

一种替代直方图的方法是核密度图。从技术上讲，核密度估计是一种估计连续随机变量概率密度函数的非参数方法。基本上，我们试图绘制一个平滑的直方图，其中曲线下面积等于一。

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density(col = "red",size = 2) + 
  labs(title = "Participants by age")

图表显示了分数的分布情况。例如，20至40岁观测值的比例将用 x 轴上20至40之间曲线下的面积表示。与前面的图表一样，我们可以使用填充fill和颜色color来指定填充和边框颜色。

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density(fill = "indianred3") + 
  labs(title = "Participants by age")

3.2.2.1 匀滑参数

平滑度由带宽参数 bw 控制。要查找特定变量的默认值，请使用 bw.nrd0 函数。较大的值将导致更平滑的结果，而较小的值将产生较少的平滑。

# default bandwidth for the age variable
bw.nrd0(Marriage$age)

## [1] 5.181946

# Create a kernel density plot of age
ggplot(Marriage, aes(x = age)) +
  geom_density(fill = "deepskyblue", 
               bw = 1) + 
  labs(title = "Participants by age",
       subtitle = "bandwidth = 1")

在这个例子中，年龄的默认带宽是5.18。选择1的值会导致更少的平滑和更多的细节。核密度图允许你很容易地看到哪些得分是最频繁的，哪些是相对罕见的。然而，要向非统计学家解释 y 轴的意义是很困难的。 (但这会让你在派对上看起来很聪明!)

3.2.3 点图

直方图的另一种替代方法是点状图。同样，数量变量被划分为箱子，但是不是汇总条，每个观察值用一个点表示。点是堆叠的，每个点代表一个观察值。当观测数量很小(比如小于150个)时，这种方法效果最好。

# plot the age distribution using a dotplot
ggplot(Marriage, aes(x = age)) +
  geom_dotplot() + 
  labs(title = "Participants by age",
       y = "Proportion",
       x = "Age")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

可以使用填充fill和颜色color选项分别指定每个点的填充和边框颜色。

# Plot ages as a dot plot using 
# gold dots with black borders
ggplot(Marriage, aes(x = age)) +
  geom_dotplot(fill = "gold", 
               color = "black") + 
  labs(title = "Participants by age",
       y = "Proportion",
       x = "Age")

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

# y axis isn't really meaningful, so hide it
ggplot(Marriage, aes(x = age)) +
  geom_dotplot(fill = "gold", 
               color = "black") + 
  labs(title = "Participants by age",
       x = "Age") +
  scale_y_continuous(NULL,breaks = NULL)

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

还有更多可用的选项。详细信息和示例请参阅帮助。