Multivariable Regression Example

Sergio Arroyo Giles

enero 12, 2021

Abstract

The following lines are based on the video lectures provided by the Data Science Specialization Program at Coursera by Johns Hopkins University. This program is a series of courses that helps you master skills focused on data science understanding and the development of statistic and mathematical based models. The idea of this document is to provide an example of the generalization of a simple linear regression model by extending it to multiple variables as regressor for one common outcome. The mathematical background is considered as a prerequisite to completely understand what is happening around this. However, this document purpose is giving a naive guide to perform this method and making inference based on the context data.

Multiple Regression Model

Suppose we have a given bunch of variables \(X_{ik}, Y_{i}\), for \(i=1,2,\ldots, n\) and \(k=1,2,\ldots,p\) where \(n\) is the number of observations and \(p\) is the number of dependent variables, and we want to show if there exits some relationship (specificly, a linear relationship), between the regressor variables \(X_{ik}\) (also called explanatory variables) and the outcome \(Y_{i}\). So, the multiple regression model assumes that the relationship betwen the outcome (\(Y\)) and the regressors (\(X\)) is linear. Of course this assumption requires to add some noise for every relation. Thus the model takes the form

\[Y_{i}=\beta_0+\beta_1X_{i,1}+\cdots+ \beta_pX_{i,p} + \varepsilon_i = \sum_{k=0}^p \beta_k X_{i,k} = \mathbf{X}_i'\boldsymbol{\beta} + \varepsilon_i,\qquad i=1,2,\ldots,n\] where \(X_{i,0}=1\) for \(i=1,\ldots,n\). Here we undertand the notation \(\mathbf v'\) as the transpose of the corresponding vector \(\mathbf v\in\mathbb{R}^p\).

Sometimes, these equations are written in matrix notation as \[\mathbf{Y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\varepsilon}\] where \(\mathbf{Y}=(Y_1,\ldots,Y_n)'\) is the outcome vector, \((\mathbf{X})_{i,k} = X_{i,k}\) is a \(n\times (p+1)\) matrix of regressors, \(\boldsymbol\beta\) is the vector of the \((p+1)\) coefficients and \(\boldsymbol\varepsilon\) is the noise vector. Vectors are considered as column vectors.

The linear part of the model is related with the coefficients \(\beta\) because they perform the linear relation between our data. For example, the model defined as

We consider normality assumptions, i.e \(\varepsilon_i \sim \mathcal{N}(0,\sigma^2)\). (Usually we don’t know what \(\sigma^2\) value is)

Swiss Data

We load swiss dataset which contains standardized fertility measure and socio-economic indicators for each of 47 French-speaking provinces of Switzerland at about 1888. The variables of this dataset are:

Fertility: Ig, ‘common standardized fertility measure’
Agriculture: % of males involved in agriculture as occupation
Examination: % draftees receiving highest mark on army examination
Education: % education beyond primary school for draftees.
Catholic: % ‘catholic’ (as opposed to ‘protestant’).
Infant.Mortality: live births who live less than 1 year.

All variables but Fertility give proportions of the population.

data("swiss")
str(swiss)

## 'data.frame':    47 obs. of  6 variables:
##  $ Fertility       : num  80.2 83.1 92.5 85.8 76.9 76.1 83.8 92.4 82.4 82.9 ...
##  $ Agriculture     : num  17 45.1 39.7 36.5 43.5 35.3 70.2 67.8 53.3 45.2 ...
##  $ Examination     : int  15 6 5 12 17 9 16 14 12 16 ...
##  $ Education       : int  12 9 5 7 15 7 7 8 7 13 ...
##  $ Catholic        : num  9.96 84.84 93.4 33.77 5.16 ...
##  $ Infant.Mortality: num  22.2 22.2 20.2 20.3 20.6 26.6 23.6 24.9 21 24.4 ...

Firsts steps

We start by doing a slope/correlation plot of the variables. The upper triangle of the matrix gives the correlation between two variables, for example, the empirical correlation between Fertility and Agriculture variables have a value fo 0.3530792. Also, the lower triangle gives a scatterplot between two variables and the result of fitting a loess smooth curve (NOT A LINEAR MODEL!).

library(ggplot2); library(GGally)
ggpairs(swiss, lower = list(continuous = "smooth"))

Now we investigate the performance of the linear regression of all variables as regressor for Fertility as the outcome.

fitAll <- lm(Fertility ~ ., swiss)
summary(fitAll)$coefficients

##                    Estimate  Std. Error   t value     Pr(>|t|)
## (Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
## Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
## Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
## Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
## Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
## Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

So, how we interprete this results? Let focus on the estimate weigth for Agriculture: -0.172114. Our model estimates an expected 0.17 decrease (because the weight is negative) in standardized fertility for every porcentual unit increase in percentage of males involved in agriculture whereas the rest of the variables keep fixed (constant). Also if we want to perform the test

\[H_0: \beta_{\text{Agr}}=0\qquad \text{vs}\qquad H_a:\beta_{\text{Agr}}\neq 0\] the \(t\)-statistic happend to give a significant performance with a \(p\)-value of 0.018.

Now we contrast the last model by taking away all the regressors except the intercept and Agriculture variable as the outcome.

fitAgr <- lm(Fertility ~ Agriculture, swiss)
summary(fitAgr)$coefficients

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
## Agriculture  0.1942017 0.07671176  2.531577 1.491720e-02

But now, Agriculture variable has a positive impact (or directly proportional) in Fertility. Also, it looks that this result between ONLY this variables is significant too. So, this part can tell us the relevance about the decision of removing a certain variable from our model. A lot of things had to be examinated before that kind of decisions are made, for example, considering correlation between variables or some external information like empirical knowledge.

Now, let see what happens when we include unnecessary variable in our model. For that consider a new variable z like a linear combination of two variables that were already in our model.

z <- swiss$Agriculture + swiss$Education
fitUnn <- lm(Fertility ~ . + z, swiss)
fitUnn

## 
## Call:
## lm(formula = Fertility ~ . + z, data = swiss)
## 
## Coefficients:
##      (Intercept)       Agriculture       Examination         Education  
##          66.9152           -0.1721           -0.2580           -0.8709  
##         Catholic  Infant.Mortality                 z  
##           0.1041            1.0770                NA

So, fortunately, R process the most unnecessary variable and gives it a NA coefficient. Thus, next time we have a NA coefficient, a numerical or exactly linear combination can be one of the reasons of that.

Dummy variables

Also, we can add factor variables. Consider the lineal model

\[Y_{i}= \beta_0+\beta_1 B_{i,1} + \varepsilon_i\]

where \(B_{i,1}\) is a binary variable for every \(i=1,2,\ldots,n\). We can think on control groups of clinical trial (\(1\) if patient is controlled and \(0\) for placebo). Then we have that the mean for control group (\(B_{i,1}=1\)) is \(\mathbf{E}[Y_i]=\beta_0+\beta_1\) and for the people not in the control group \(\mathbf{E}[Y_i]=\beta_0\). Then, \(\beta_1\) can be interpreted as the change (increase or decrease) comparing those un the group to those to not. Also, including a binary variable that codifies if someone is in the control group then is no necessary to include a a binary variable when you are not in the control group.

For more than 2 levels (multinary options), for example the religious belief of people we should expect more than two different answers: buddhism, christianity, hinduism, agnosticism and judaism (to say some). Let us consider a model that try to measure the response of some variable \(Y_i\) in terms of political parties in the USA: Republican, Democrat and Independent. So our linear model would be

\[Y_i = \beta_0 + \beta_1X_{i,1}+\beta_2X_{i,2} + \varepsilon_i\] where \(X_{i,1}= 1\) for republicans and 0 otherwise, \(X_{i,2}= 1\) for democrats and 0 otherwise. Note that we not include a binary variable for Independent party because it would be redundant, i. e., if subject \(i\) is neither Republican nor Democrat then has to be Independent. Also, by getting the means for wvery case we have

for Republicans: \(\mathbf{E}[Y_i]=\beta_0 +\beta_1\),
for Democrats: \(\mathbf{E}[Y_i]=\beta_0 + \beta_2\), and
for Independents: \(\mathbf{E}[Y_i]=\beta_0\).

Then we must interprete the coefficients as follows:

\(\beta_1\) is the difference between Republicans and Independents;
\(\beta_2\) measure the difference between Democrats and Independents; and
\(\beta_2-\beta_1\) is the difference between Republicans and Democrats.

Infect Sprays dataset

Let see an example. First we load the InsectSprays dataset from datasets library. This dataset measure the counts of insects in agricultural experimental units treated with different insecticides and includes 72 observations on 2 variables:

count: Insect count, and
spray: The type of spray (factor variable).

data("InsectSprays")
str(InsectSprays)

## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

Linear Regression with Dummy variables

Remember, is useful to plot our data first before running any model.

InsectSprays %>% ggplot(aes(y = count, x = spray, fill = spray)) +
        geom_violin(color = "black", size = 1)+
        xlab("Type of spray") + ylab("Count")

Since "A" is the first level of the factor variable, when we run a linear model on count by spray, a factor variable, the command lm will take the first level as the reference level. That is:

fitInsects <- lm(count~spray, InsectSprays)
summary(fitInsects)$coefficients

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  14.5000000   1.132156 12.8074279 1.470512e-19
## sprayB        0.8333333   1.601110  0.5204724 6.044761e-01
## sprayC      -12.4166667   1.601110 -7.7550382 7.266893e-11
## sprayD       -9.5833333   1.601110 -5.9854322 9.816910e-08
## sprayE      -11.0000000   1.601110 -6.8702352 2.753922e-09
## sprayF        2.1666667   1.601110  1.3532281 1.805998e-01

The interpretation of the coefficients depends completely of the reference level. For example, the coefficient for sprayD: -9.5833333 tell us that the change between spray A and D is -9.583. On the other hand, the intercept coefficient: 14.5 is the mean of the count variable for the sprayA. By the mean comparison, we can obtain the mean of every spray by taking the sum of the intercept coefficient and the respective coefficient of some spray. Lets see this for the mean of sprayC.

data.frame(
        "Sum Coefficients" = sum(summary(fitInsects)$coefficients[c(1,3),1]), 
           "Mean from data" = mean(subset(InsectSprays, spray == "C")$count))

##   Sum.Coefficients Mean.from.data
## 1         2.083333       2.083333

Now, what happens if we remove the intercept and include a binary variable for sprayA? The creation of dummy variables is easier by using the dummy_cols command from the fastDummies library.

library(fastDummies)
InsectSprayDummy <- dummy_cols(InsectSprays)
head(InsectSprayDummy)

##   count spray spray_A spray_B spray_C spray_D spray_E spray_F
## 1    10     A       1       0       0       0       0       0
## 2     7     A       1       0       0       0       0       0
## 3    20     A       1       0       0       0       0       0
## 4    14     A       1       0       0       0       0       0
## 5    14     A       1       0       0       0       0       0
## 6    12     A       1       0       0       0       0       0

fitAllDummy <- lm(count ~ . -spray, InsectSprayDummy) 
summary(fitAllDummy)$coefficients

##               Estimate Std. Error    t value     Pr(>|t|)
## (Intercept)  16.666667   1.132156 14.7211815 1.573471e-22
## spray_A      -2.166667   1.601110 -1.3532281 1.805998e-01
## spray_B      -1.333333   1.601110 -0.8327558 4.079858e-01
## spray_C     -14.583333   1.601110 -9.1082663 2.794343e-13
## spray_D     -11.750000   1.601110 -7.3386603 4.035610e-10
## spray_E     -13.166667   1.601110 -8.2234633 1.054387e-11

Automatically, the command drop the last variable because, as we said a few lines ago, happend to be redundant to include all six parameter because we have six means. Otherwise, when we exlude the intercept coefficient the output turns out to be different.

fitNoInter <- lm(count ~ spray - 1, InsectSprays) 
summary(fitNoInter)$coefficients

##         Estimate Std. Error   t value     Pr(>|t|)
## sprayA 14.500000   1.132156 12.807428 1.470512e-19
## sprayB 15.333333   1.132156 13.543487 1.001994e-20
## sprayC  2.083333   1.132156  1.840148 7.024334e-02
## sprayD  4.916667   1.132156  4.342749 4.953047e-05
## sprayE  3.500000   1.132156  3.091448 2.916794e-03
## sprayF 16.666667   1.132156 14.721181 1.573471e-22

This model computes the count mean of every spray. We can verify this by taking the sum of the coefficients like before.

Coef <- summary(fitInsects)$coefficients
MeansData <- InsectSprays %>% group_by(spray) %>% summarise(avg = mean(count))
data.frame(
        "Means from regression" = Coef[1,1] + c(0,Coef[-1,1]),
        "Means from data" = MeansData$avg, 
        row.names = paste0("spray", LETTERS[1:6]))

##        Means.from.regression Means.from.data
## sprayA             14.500000       14.500000
## sprayB             15.333333       15.333333
## sprayC              2.083333        2.083333
## sprayD              4.916667        4.916667
## sprayE              3.500000        3.500000
## sprayF             16.666667       16.666667

Although the mean can be derived from both linear models, the information given by the \(p\)-values turns to have a different interpretation. From the model without the intercept and including all binary variables, the resulting \(p\)-values are obtained by performing the test wheter the spray kill any insect, i.e,

\[H_0: \beta_i^\star = 0 \qquad\text{vs}\qquad H_a: \beta_i^\star\neq0.\]

where every \(\beta_i^\star\) has to be interpreted as the mean of the count variable for each spray separately.

Conversely, when including intercept and removing the reference level the perfoming test is wheter some spray is distinct from the reference one. This is because the interpretation of the coefficients change in every model.

Finally, if we need to change the reference level we can change it by the relevel command. For example, we want to set sprayC as the reference level and compare it mean with the others sprays via hypothesis test.

spray2 <- relevel(InsectSprays$spray, "C")
fitRefC <- lm(count ~ spray2, InsectSprays)
summary(fitRefC)$coef

##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept)  2.083333   1.132156 1.840148 7.024334e-02
## spray2A     12.416667   1.601110 7.755038 7.266893e-11
## spray2B     13.250000   1.601110 8.275511 8.509776e-12
## spray2D      2.833333   1.601110 1.769606 8.141205e-02
## spray2E      1.416667   1.601110 0.884803 3.794750e-01
## spray2F     14.583333   1.601110 9.108266 2.794343e-13

Note that the coefficients has been change because now we are comparing each spary with our reference level, i.e., the sprayC.

In summary, about the estimates of the coefficients, they can be obtained from any model, no matter which referencel level we picked. However, the selection of the reference level should be a decision based on empirical knowledge or purpose of our research.

Adjust the linear model

Suppose we need to explain some variable, given by \(Y_i\), with some two kind of regressors: a continuous and a binary. Of course we can apply a simple linear regression model: suppose \(X_{i,1}\) is the continuous variable and \(B_{i,1}\) the binary one, then a simple linear regression model could be:

\[Y_{i} = \beta_0 + \beta_1 B_{i,1} + \beta_2X_{i,2} +\varepsilon_i\]

Recall that the interpretation of the coefficients are given by taking the expected value of the model. For simplicity, consider that \(B_{i,1}=1\) when \(i\in \Upsilon\) (recall the example of control groups) and \(B_{i,1}=0\) otherwise. Then we have the following means:

When \(i\in\Upsilon\), \(\mathbf{E}[Y_i]= \beta_0 + \beta_1 + \beta_2X_{i,2} = \beta_\star + \beta_2X_{i,2}\)
When \(i\notin\Upsilon\), \(\mathbf{E}[Y_i]= \beta_0 + \beta_2X_{i,2}\)

So, we get two different lines because we obtain the same slope but with different intercept. Almost the same information, right? In other words, this model tell us that the two different responses of the binary variable only cause a desplacement of the line. Let’s check an example…

For example, recall the swiss dataset from the first section.

head(swiss)

##              Fertility Agriculture Examination Education Catholic
## Courtelary        80.2        17.0          15        12     9.96
## Delemont          83.1        45.1           6         9    84.84
## Franches-Mnt      92.5        39.7           5         5    93.40
## Moutier           85.8        36.5          12         7    33.77
## Neuveville        76.9        43.5          17        15     5.16
## Porrentruy        76.1        35.3           9         7    90.57
##              Infant.Mortality
## Courtelary               22.2
## Delemont                 22.2
## Franches-Mnt             20.2
## Moutier                  20.3
## Neuveville               20.6
## Porrentruy               26.6

Here we don’t have a binary variable, but we can create one from the data. The Catholic variable represents the percentage of the population with catholism as religious belief. Then, consider a binary variable \(B_{i}\) that vanishes when less or equal than \(50\%\) of the population is catholic, and is \(1\) otherwise.

swiss2 <- swiss %>% mutate(catholicBin = factor(ifelse(Catholic <= 50, 0, 1)))
head(swiss2)

##   Fertility Agriculture Examination Education Catholic Infant.Mortality
## 1      80.2        17.0          15        12     9.96             22.2
## 2      83.1        45.1           6         9    84.84             22.2
## 3      92.5        39.7           5         5    93.40             20.2
## 4      85.8        36.5          12         7    33.77             20.3
## 5      76.9        43.5          17        15     5.16             20.6
## 6      76.1        35.3           9         7    90.57             26.6
##   catholicBin
## 1           0
## 2           1
## 3           1
## 4           0
## 5           0
## 6           1

We can think that our new variable answer the question Is the population mostly Catholic?. Now suppose we want to explain Fertility via Agriculture.

fit1 <- lm(Fertility ~ Agriculture, swiss2)
summary(fit1)$coef

##               Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 60.3043752 4.25125562 14.185074 3.216304e-18
## Agriculture  0.1942017 0.07671176  2.531577 1.491720e-02

swiss2 %>% ggplot(aes(x=Agriculture, y=Fertility)) + geom_point(size=2) +
        geom_abline(intercept = coef(fit1)[1], slope = coef(fit1)[2], 
                    size = 1, alpha = 0.8, color = "darkgreen")

Positive impact from Agriculture. Now, what happens if we add the binary variable catholicBin?

swiss2 %>% ggplot(aes(x=Agriculture, y=Fertility, color=catholicBin)) +
        geom_point(size = 2) + 
        geom_abline(intercept = coef(fit1)[1], slope = coef(fit1)[2], 
                    size = 1, alpha = 0.8, color = "darkgreen")

Now we can see that the fitted line can be different respect of the catholicBin output.

fit2 <- lm(Fertility ~ Agriculture + catholicBin, swiss2)
summary(fit2)$coef

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)  60.8322366  4.1058630 14.815944 1.032493e-18
## Agriculture   0.1241776  0.0810977  1.531210 1.328763e-01
## catholicBin1  7.8843292  3.7483622  2.103406 4.118221e-02

So, as result from our model we have two parallel lines (same slope different intercept).

swiss2 %>% ggplot(aes(x=Agriculture, y=Fertility, color=catholicBin)) +
        geom_point(size = 2) + 
        geom_abline(intercept = coef(fit2)[1] + coef(fit2)[3], slope = coef(fit2)[2], 
                    size = 1, alpha = 0.8, color = "#f8766d") +
        geom_abline(intercept = coef(fit2)[1], slope = coef(fit2)[2], 
                    size = 1, color = "#00bfc4")

Although this model is completely valid and useful in some contexts, we can extract more information as follows: Consider the linear model

\[Y_i=\beta_0 + \beta_1 B_{i,1} + \beta_2 X_{i,2} + \beta_3 B_{i,1}X_{i,2} + \varepsilon_i.\]

In this case, by taking the mean given the data we obtain the following fitted lines

If \(i \in\Upsilon\implies \mathbf{E}[Y_i]=\beta_0+ \beta_1+\beta_2X_{i,2} + \beta_3 X_{i,2} = \beta_0^\star + \beta_1^\star X_{i,2}\)
If \(i\notin\Upsilon \implies \mathbf{E}[Y_i]=\beta_0+\beta_2X_{i,2}\)

So, in this case we obtain two completely different fitted lines for every response of the binary variable. Let’s see how this works with lm command.

Recall the lm (basic) syntax:

lm(formula, data)

We can pass the formula of our adjusted model by writing the multiplication \(B_{i,1}X_{i,2}\) as I(varB*varX). Affortunately, R is pretty smart –I meant the people who have programed everything here– and by setting formula = outcome ~ varB*varX we get the previous model performance, i.e., we obtain estimators for the coefficients of the previous model.

fit3 <- lm(Fertility ~ catholicBin*Agriculture, swiss2)
summary(fit3)$coef

##                             Estimate  Std. Error    t value     Pr(>|t|)
## (Intercept)              62.04993019  4.78915566 12.9563402 1.919379e-16
## catholicBin1              2.85770359 10.62644275  0.2689238 7.892745e-01
## Agriculture               0.09611572  0.09881204  0.9727127 3.361364e-01
## catholicBin1:Agriculture  0.08913512  0.17610660  0.5061430 6.153416e-01

Also we can plot this in two different ways: by hardcoding the fitted lines, and using geom_smoth(method="lm") from ggplot2 library.

library(gridExtra); library(ggpubr)
p1 <- swiss2 %>% ggplot(aes(x=Agriculture, y=Fertility, color=catholicBin)) +
        geom_point(size = 2) + 
        geom_abline(intercept = coef(fit3)[1], slope = coef(fit3)[3], 
                    size = 1, alpha = 0.8, color = "#f8766d") +
        geom_abline(intercept = coef(fit3)[1] + coef(fit3)[2], 
                    slope = coef(fit3)[3] + coef(fit3)[4],
                    size = 1, color = "#00bfc4") + ggtitle("Hardcoding") + 
        theme(legend.position = "bottom")

p2 <- swiss2 %>% ggplot(aes(x=Agriculture, y=Fertility, color=catholicBin)) +
        geom_point(size = 2) + geom_smooth(method="lm") + ggtitle("geom_smooth()") + 
        theme(legend.position = "bottom") + stat_regline_equation()

grid.arrange(p1,p2, nrow=1)

Adjustment

Until this point, we only observed how to model, to interpretate the coefficients and the hypothesis tests performed by the model. Now we want to explore how some variables can impact the performance of other variables on the resulting outcome. Recall the idea of control groups and the effects of considering a linear model without interaction, i.e, when \(Y=\beta_0+\beta_1B+\beta_2X\) where \(B\) is a binary variable. In this examples, the data is simulated and we are not going to check the numerical results, we are only given some intuition about what is happening and to be cauti

In the plot showed just above this, the horizontal dashed lines represents the mean of the data grouped by the binary variable response. So, the space between the two dashed lines is the marginal effect disregarding x variable. Also, about the same quantity can be found as the differece of intercepts by the two fitted lines (if you aren’t sure why this is true, check the previous section and the interpretation of the coefficients when adding a binary variable but with no interaction between the other regressor). In this case, we can compare the groups directly by choosing some (proper) short interval of x which includes observations of both groups. However, this is not alway the case.

Now we have two groups that which direct comparison is not able. Also, the marginal effect of disregarding x is greater than the estimated difference between both groups, i.e, the direct comparison of the both given intercepts. For example, if the data tell us that some subject has observed an x value less than 1 or greater than 1.5, then we can know if this subject belongs to the control group or not. In other words, the variable x is already explaining the effect of the binary effect.

This notes aren’t finished yet