利用R进行数据可视化——第七章统计模型图

统计模型描述一个或多个解释变量与一个或多个响应变量之间的关系。图表可以帮助可视化这些关系。

在本节中，我们将关注具有单个响应变量的模型，这些响应变量可以是定量的(数字) ，也可以是二进制的(yes / no)。

7.1 相关系数图

相关图通过使用颜色或阴影显示它们的相关性，帮助您可视化一组定量变量之间的成对关系。考虑一下 Saratoga Houses 的数据集，其中包含了2006年萨拉托加县房屋的销售价格和特征。为了探讨定量变量之间的关系，我们可以计算皮尔逊积矩相关系数。

library(tidyverse)

## -- Attaching packages -------------------

## √ ggplot2 3.3.0     √ purrr   0.3.3
## √ tibble  2.1.3     √ dplyr   0.8.5
## √ tidyr   1.0.2     √ stringr 1.4.0
## √ readr   1.3.1     √ forcats 0.5.0

## -- Conflicts --- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(mosaicData)
data(SaratogaHouses, package="mosaicData")
SaratogaHouses %>% DT::datatable()

SaratogaHouses %>% glimpse()

## Observations: 1,728
## Variables: 16
## $ price           <int> 132500, 181115, 109000, 155000, 86060, 120000, 1530...
## $ lotSize         <dbl> 0.09, 0.92, 0.19, 0.41, 0.11, 0.68, 0.40, 1.21, 0.8...
## $ age             <int> 42, 0, 133, 13, 0, 31, 33, 23, 36, 4, 123, 1, 13, 1...
## $ landValue       <int> 50000, 22300, 7300, 18700, 15000, 14000, 23300, 146...
## $ livingArea      <int> 906, 1953, 1944, 1944, 840, 1152, 2752, 1662, 1632,...
## $ pctCollege      <int> 35, 51, 51, 51, 51, 22, 51, 35, 51, 44, 51, 51, 41,...
## $ bedrooms        <int> 2, 3, 4, 3, 2, 4, 4, 4, 3, 3, 7, 3, 2, 3, 3, 3, 3, ...
## $ fireplaces      <int> 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ bathrooms       <dbl> 1.0, 2.5, 1.0, 1.5, 1.0, 1.0, 1.5, 1.5, 1.5, 1.5, 1...
## $ rooms           <int> 5, 6, 8, 5, 3, 8, 8, 9, 8, 6, 12, 6, 4, 5, 8, 4, 7,...
## $ heating         <fct> electric, hot water/steam, hot water/steam, hot air...
## $ fuel            <fct> electric, gas, gas, gas, gas, gas, oil, oil, electr...
## $ sewer           <fct> septic, septic, public/commercial, septic, public/c...
## $ waterfront      <fct> No, No, No, No, No, No, No, No, No, No, No, No, No,...
## $ newConstruction <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No...
## $ centralAir      <fct> No, No, No, No, Yes, No, No, No, No, No, No, No, No...

SaratogaHouses %>% select_if(is.numeric)->df
# calulate the correlations
r <- cor(df, use="complete.obs")
round(r,2)

##            price lotSize   age landValue livingArea pctCollege bedrooms
## price       1.00    0.16 -0.19      0.58       0.71       0.20     0.40
## lotSize     0.16    1.00 -0.02      0.06       0.16      -0.03     0.11
## age        -0.19   -0.02  1.00     -0.02      -0.17      -0.04     0.03
## landValue   0.58    0.06 -0.02      1.00       0.42       0.23     0.20
## livingArea  0.71    0.16 -0.17      0.42       1.00       0.21     0.66
## pctCollege  0.20   -0.03 -0.04      0.23       0.21       1.00     0.16
## bedrooms    0.40    0.11  0.03      0.20       0.66       0.16     1.00
## fireplaces  0.38    0.09 -0.17      0.21       0.47       0.25     0.28
## bathrooms   0.60    0.08 -0.36      0.30       0.72       0.18     0.46
## rooms       0.53    0.14 -0.08      0.30       0.73       0.16     0.67
##            fireplaces bathrooms rooms
## price            0.38      0.60  0.53
## lotSize          0.09      0.08  0.14
## age             -0.17     -0.36 -0.08
## landValue        0.21      0.30  0.30
## livingArea       0.47      0.72  0.73
## pctCollege       0.25      0.18  0.16
## bedrooms         0.28      0.46  0.67
## fireplaces       1.00      0.44  0.32
## bathrooms        0.44      1.00  0.52
## rooms            0.32      0.52  1.00

ggcorrplot包中的ggcorrplot函数可用于可视化这些相关性。默认情况下，它会创建一个ggplot2图形，其中较深的红色表示较强的正相关性，较深的蓝色表示较强的负相关性，白色表示无相关性。

7.1.1 普通相关系数图

library(ggplot2)
pacman::p_load(ggcorrplot)
ggcorrplot(r)

从图中可以看出，浴室和居住面积的增加与房价的上涨有关，而老房子的价格往往较低。老房子的浴室往往也更少。 ggcorrplot函数有许多自定义输出的选项。例如

hc.order = TRUE reorders the variables, placing variables with similar correlation patterns together.
type = "lower" plots the lower portion of the correlation matrix.
lab = TRUE overlays the correlation coefficients (as text) on the plot.

ggcorrplot(r, 
           hc.order = TRUE, 
           type = "lower",
           lab = TRUE)

7.1.2 方块图绘制相关系数图

corr <- round(cor(df), 2)
df <- reshape2::melt(corr)

df %>% head()

##         Var1  Var2 value
## 1      price price  1.00
## 2    lotSize price  0.16
## 3        age price -0.19
## 4  landValue price  0.58
## 5 livingArea price  0.71
## 6 pctCollege price  0.20

df %>% 
  ggplot(aes(Var1,Var2,fill = value)) +              # fill填充数值
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45,vjust = 0.8,hjust = 0.5)) +
  scale_x_discrete(expand = c(0,0)) +
  scale_y_discrete(expand = c(0,0)) +
  geom_label(aes(label = value,size = value),col = "white") +
  theme(legend.position = "NULL")

df %>% 
  ggplot(aes(Var1,Var2,fill = value)) +
  geom_tile() +
  theme(axis.text.x = element_text(angle = 45,vjust = 0.8,hjust = 0.5)) +
  scale_x_discrete(expand = c(0,0)) +
  scale_y_discrete(expand = c(0,0)) +
  geom_text(aes(label = value,size = value),col = "white") +
  theme(legend.position = "NULL")

7.2 线性回归

线性回归允许我们在其他变量保持不变的情况下，探索定量响应变量和解释变量之间的关系。考虑Saratoga数据集中的房价预测，包括地块大小(平方英尺)、年龄(年份)、土地价值(1000美元)、居住面积(平方英尺)、卧室和浴室的数量，以及房子是否在海滨。

houses_lm <- lm(price ~ lotSize + age + landValue +
                livingArea + bedrooms + bathrooms +
                waterfront, 
                data = SaratogaHouses)
summary(houses_lm)

## 
## Call:
## lm(formula = price ~ lotSize + age + landValue + livingArea + 
##     bedrooms + bathrooms + waterfront, data = SaratogaHouses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -220208  -35416   -5443   27570  464320 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.399e+05  1.647e+04   8.491  < 2e-16 ***
## lotSize       7.501e+03  2.075e+03   3.615 0.000309 ***
## age          -1.360e+02  5.416e+01  -2.512 0.012099 *  
## landValue     9.093e-01  4.583e-02  19.841  < 2e-16 ***
## livingArea    7.518e+01  4.158e+00  18.080  < 2e-16 ***
## bedrooms     -5.767e+03  2.388e+03  -2.414 0.015863 *  
## bathrooms     2.455e+04  3.332e+03   7.366 2.71e-13 ***
## waterfrontNo -1.207e+05  1.560e+04  -7.738 1.70e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 59370 on 1720 degrees of freedom
## Multiple R-squared:  0.6378, Adjusted R-squared:  0.6363 
## F-statistic: 432.6 on 7 and 1720 DF,  p-value: < 2.2e-16

从结果中，我们可以估计出，在其他变量不变的情况下，每增加一平方英尺的居住面积，房价就会增加75美元。此外，滨水住宅比非滨水住宅多花费约120,726美元，同样控制了模型中的其他变量。

visreg包提供了可视化这些条件关系的工具。visreg函数接受(1)模型和(2)感兴趣的变量，并绘制条件关系图，控制其他变量。选项gg = TRUE用于生成ggplot2图。

# conditional plot of price vs. living area
library(ggplot2)
pacman::p_load(visreg)
visreg(houses_lm, 
       "livingArea", 
       gg = TRUE)

从图中可以看出，在控制了地块面积、年龄、居住面积、卧室浴室数量、滨水位置之后，销售价格随居住面积成线性增长。

visreg是如何工作的?拟合模型用于预测响应变量的值，跨越所选解释变量的范围。其他变量设置为它们的中值(对于数值变量)或最常见的类别(对于分类变量)。用户可以覆盖这些默认值，并为模型中的任何变量选择特定的值。

继续这个例子，在控制了其他七个变量的情况下，绘制了滨水和非滨水住宅之间的价格差异。由于生成了ggplot2图，所以可以添加其他ggplot2函数来定制图。

# conditional plot of price vs. waterfront location
visreg(houses_lm, "waterfront", gg = TRUE) +
  scale_y_continuous(label = scales::dollar) +
  labs(title = "Relationship between price and location",
       subtitle = "controlling for lot size, age, land value, bedrooms and bathrooms",
       caption = "source: Saratoga Housing Data (2006)",
       y = "Home Price",
       x = "Waterfront")

水上的房子少得多，而且往往更贵(即使考虑到面积、年代和土地价值)。vizreg包提供了广泛的绘图功能。有关详细信息，请参阅使用visreg的回归模型的可视化。

7.3 Logistic回归

逻辑回归可以用来探讨二元响应变量和解释变量之间的关系，而其他变量则保持不变。二进制响应变量有两个级别(是/否、存活/死亡、通过/失败、恶性/良性)。与线性回归一样，我们可以使用visreg包来可视化这些关系。

利用CPS85的数据，让我们预测结婚的对数概率，给定一个人的性别、年龄、种族和工作领域。

# fit logistic model for predicting
# marital status: married/single
data(CPS85, package = "mosaicData")
cps85_glm <- glm(married ~ sex + age + race + sector, 
                 family="binomial", 
                 data=CPS85)

cps85_glm %>% summary()

## 
## Call:
## glm(formula = married ~ sex + age + race + sector, family = "binomial", 
##     data = CPS85)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4376  -0.9395  -0.6566   1.1493   2.2723  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.770342   0.474398   3.732  0.00019 ***
## sexM           0.014040   0.216778   0.065  0.94836    
## age           -0.057216   0.009357  -6.115 9.69e-10 ***
## raceW         -0.216076   0.289098  -0.747  0.45481    
## sectorconst   -0.320205   0.577917  -0.554  0.57953    
## sectormanag   -0.255371   0.383185  -0.666  0.50513    
## sectormanuf   -0.455678   0.363865  -1.252  0.21045    
## sectorother   -0.083591   0.368836  -0.227  0.82071    
## sectorprof    -0.395464   0.317199  -1.247  0.21249    
## sectorsales   -0.730764   0.457628  -1.597  0.11030    
## sectorservice  0.183573   0.324572   0.566  0.57168    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 687.81  on 533  degrees of freedom
## Residual deviance: 634.89  on 523  degrees of freedom
## AIC: 656.89
## 
## Number of Fisher Scoring iterations: 4

在保持其他变量不变的情况下，利用拟合模型，我们将年龄与结婚概率之间的关系形象化。同样，visreg函数获取模型和感兴趣的变量，并绘制条件关系图，控制其他变量。

选项gg = TRUE用于生成ggplot2图。scale = “response”选项创建一个基于概率(而不是log-odds)比例的图表。

visreg(cps85_glm, "age", 
       gg = TRUE, 
       scale="response") +
  labs(y = "Prob(Married)", 
       x = "Age",
       title = "Relationship of age and marital status",
       subtitle = "controlling for sex, race, and job sector",
       caption = "source: Current Population Survey 1985")

在控制其他变量的情况下，20岁时结婚的概率大约为0.5,60岁时下降到0.1。我们可以通过添加by选项来创建多个条件图。例如，下面的代码将按年龄绘制结婚的概率，男女分开，控制种族和工作部门。

# plot results
library(ggplot2)
library(visreg)
visreg(cps85_glm, "age",
       by = "sex",
       gg = TRUE, 
       scale="response") +
  labs(y = "Prob(Married)", 
       x = "Age",
       title = "Relationship of age and marital status",
       subtitle = "controlling for race and job sector",
       caption = "source: Current Population Survey 1985")

在这些数据中，男性和女性结婚的可能性非常相似。

7.4 生存图

在许多研究设置中，响应变量是事件发生的时间。这在医疗研究中经常是正确的，我们关心的是恢复的时间，死亡的时间，或者复发的时间。

如果该事件没有发生为一个观察(无论是因为研究结束或病人退出)，该观察被称为审查。

生存包中的NCCTG肺癌数据集提供了晚期肺癌患者治疗后生存时间的数据。这项研究对患者进行了长达34个月的跟踪调查。

每个病人的结果由两个变量来衡量

time - survival time in days
status - 1=censored, 2=dead

因此，时间=305和状态=2的患者在治疗后活了305天。另一位患者的时间=400，状态=1，至少存活了400天，但随后就消失在研究中。时间=1022，状态=1的患者存活至研究结束(34个月)。

生存图(也称为Kaplan-Meier曲线)可以用来说明一个个体生存到时间t并包括时间t的概率。

# plot survival curve
library(survival)
pacman::p_load(survminer)

data(lung)
sfit <- survfit(Surv(time, status) ~  1, data=lung)
ggsurvplot(sfit,
            title="Kaplan-Meier curve for lung cancer survival")

大约50%的患者在治疗后300天仍然存活。运行summary(sfit)以获得更多细节。患者组是否具有相同的生存概率常常引起人们的极大兴趣。在下一幅图中，我们比较了男性和女性的生存曲线。

# plot survival curve for men and women
sfit <- survfit(Surv(time, status) ~  sex, data=lung)
ggsurvplot(sfit, 
           conf.int=TRUE, 
           pval=TRUE,
           legend.labs=c("Male", "Female"), 
           legend.title="Sex",  
           palette=c("cornflowerblue", "indianred3"), 
           title="Kaplan-Meier Curve for lung cancer survival",
           xlab = "Time (days)")

ggsurvplot有许多选项。特别地，conf.int提供了置信区间，而pval提供了比较生存曲线的log-rank检验。 p值(0.0013)为男性和女性在治疗后生存概率不同提供了有力的证据。

7.5 马赛克图

马赛克图可以使用矩形显示类别变量之间的关系，这些矩形的面积表示任何给定级别组合的个案比例。磁贴的颜色还可以指示变量之间的程度关系。尽管可以使用ggmosaic软件包通过ggplot2创建镶嵌图，但我建议改用vcd软件包。该软件包虽然不会创建ggplot2图形，但它提供了一种更全面的方法来可视化分类数据。

人们对泰坦尼克号着迷（或者对狮子座着迷？）。在泰坦尼克号灾难中，性别和阶级在生存中扮演什么角色？我们可以使用下面的代码来可视化这三个类别变量之间的关系。

Titanic

## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

# create a table
tbl <- xtabs(~Survived + Class + Sex, Titanic)
ftable(Titanic)

##                    Survived  No Yes
## Class Sex    Age                   
## 1st   Male   Child            0   5
##              Adult          118  57
##       Female Child            0   1
##              Adult            4 140
## 2nd   Male   Child            0  11
##              Adult          154  14
##       Female Child            0  13
##              Adult           13  80
## 3rd   Male   Child           35  13
##              Adult          387  75
##       Female Child           17  14
##              Adult           89  76
## Crew  Male   Child            0   0
##              Adult          670 192
##       Female Child            0   0
##              Adult            3  20

# create a mosaic plot from the table
library(vcd)

## Loading required package: grid

mosaic(Titanic, main = "Titanic data")

切片的大小与该级别组合中的情况百分比成比例。死亡人数显然比幸存者还多。遇难者主要是三等舱男性乘客和男乘务员（人数最多）。如果我们假设这三个变量是独立的，则可以检查模型中的残差并对阴影进行匹配。在下图中，深蓝色代表比给定独立性超出预期的情况。如果保持独立，深红色代表的案件少于预期。

mosaic(tbl, 
       shade = TRUE,
       legend = TRUE,
       labeling_args = list(set_varnames = c(Sex = "Gender",
                                             Survived = "Survived",
                                             Class = "Passenger Class")),
       set_labels = list(Survived = c("No", "Yes"),
                         Class = c("1st", "2nd", "3rd", "Crew"),
                         Sex = c("F", "M")),
       main = "Titanic data")

## Warning in legend(residuals, gpfun, residuals_type): All residuals are zero.