In this document I will explore a situation when the slope of line is dependent on other variables.
Assume that we have data about some beetle species, which consist of the age in days and the length of the body in millimeters. We also have measured the sex of the beetles. The idea is that the relation between Age and length is different between sexes. Next we simulate the data for this scenario:
set.seed(333)
n<-250
Sex<-sample(c("F","M"),n,replace = T)
Age<-runif(n,18,100)
z1<-data.frame(Sex=factor(Sex),Age,mm_1=NA,mm_2=NA)
z1[z1$Sex %in% "F","mm_1"]<-77+1.2*z1[z1$Sex %in% "F","Age"]+rnorm(nrow(z1[z1$Sex %in% "F",]),0,sd=70)
z1[z1$Sex %in% "M","mm_1"]<-77+30+1.2*z1[z1$Sex %in% "M","Age"]+rnorm(nrow(z1[z1$Sex %in% "M",]),0,sd=70)
z1[z1$Sex %in% "F","mm_2"]<-77+1.2*z1[z1$Sex %in% "F","Age"]+rnorm(nrow(z1[z1$Sex %in% "F",]),0,sd=70)
z1[z1$Sex %in% "M","mm_2"]<-77+18+(1.2+6.6)*z1[z1$Sex %in% "M","Age"]+rnorm(nrow(z1[z1$Sex %in% "M",]),0,sd=70)
p1<-z1 %>% ggplot(aes(x=Age,y=mm_1,color=Sex))+geom_point()+theme_bw()+labs(title="Case 1")
p2<-z1 %>% ggplot(aes(x=Age,y=mm_2,color=Sex))+geom_point()+theme_bw()+labs(title="Case 2")
Figura 1.1: scatterplots of Age vs length (in mm) and sex of beetles.
We can formulate a linear model for this scenario as follows:
\[\begin{equation} E(mm|Age=x_1, I_M=x_2)= \beta_0+\beta_1x_1+\beta_2x_2+\beta_3(x_1 x_2)\tag{1.1} \end{equation}\]
Where \(I_M\) is an indicator variable for sex \(M\), that is \(I_M=0\) for female beetles and \(I_M=1\) for male beetles.
Model (1.1) is very special since it is actually specifying two simple linear regressions: Age vs mm for the two sexes available. Let us verify this by evaluating model (1.1) for each sex.
For female beetles model (1.1) results in
\[\begin{equation} E(mm|Age=x_1, I_M=0)= \beta_0+\beta_1x_1+\beta_2 \times 0 + \beta_3(x_1 \times 0)= \beta_0+\beta_1x_1\tag{1.2} \end{equation}\]
So \(\beta_0\) is the intercept and \(\beta_1\) is the slope of \(Age\) vs \(mm\) for female beetles. Now let us check the model for male beetles:
\[\begin{equation} E(mm|Age=x_1, I_M=1)= \beta_0+\beta_1x_1+\beta_2 \times 1 + \beta_3(x_1 \times 1)= (\beta_0+\beta_2)+(\beta_1+\beta_3)x_1 \tag{1.3} \end{equation}\]
In this case, the intercept and slope for male beetles is now \((\beta_0+\beta_2)\) and \((\beta_1+\beta_3)\) respectively. Thus \(\beta_2\) can be considered as the difference of intercepts between sex male and female, and similarly, \(\beta_3\) is the difference between slopes.
Now it should be clear that model (1.1) effectively is modeling two different lines for each beetle gender. In particular a different slope for each gender.
Let us now explore how to fit such a model in R:
For case 1:
##
## Call:
## lm(formula = mm_1 ~ Age * Sex, data = z1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -185.106 -45.195 -1.139 44.537 217.417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0386 15.3770 5.140 5.60e-07 ***
## Age 1.4992 0.2436 6.154 3.05e-09 ***
## SexM 14.3469 23.3845 0.614 0.540
## Age:SexM -0.2327 0.3621 -0.643 0.521
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 67.27 on 246 degrees of freedom
## Multiple R-squared: 0.1975, Adjusted R-squared: 0.1877
## F-statistic: 20.18 on 3 and 246 DF, p-value: 9.997e-12
And for case 2:
##
## Call:
## lm(formula = mm_2 ~ Age * Sex, data = z1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -223.641 -38.973 -5.263 51.094 193.015
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.6988 16.2271 4.727 3.85e-06 ***
## Age 1.3680 0.2571 5.322 2.32e-07 ***
## SexM 0.6211 24.6774 0.025 0.98
## Age:SexM 6.5763 0.3821 17.209 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 70.99 on 246 degrees of freedom
## Multiple R-squared: 0.9218, Adjusted R-squared: 0.9208
## F-statistic: 966.5 on 3 and 246 DF, p-value: < 2.2e-16
As intended (you can verify this checking how the data for mm_1
and mm_2
was simulated) in case 1 the only difference is in intercepts (slopes are not significantly different, parameter Age:SexM
is not significant) whereas for case 2 slopes are significantly different (parameter Age:SexM
is significant).
The above example illustrated how a categorical variable can be used to influence the slope of the line describing the statistical association between two numeric variables. This scenario led me to think if this could be possible also for a numeric variable. That is, a numeric variable affecting the slope of the line describing the statistical association between two numeric variables. Some thing of the form
\[\begin{equation} E(Y|X=x, Z=z)= \beta_0+(\beta_1+\beta_2z)x \tag{2.1} \end{equation}\]
Let us see how the mathematical model (2.1) can be represented in a three dimensional space:
Figura 2.1: Graphic representation of model (2.1)
Surface of figure 2.1 has the following parameters:
\[\begin{equation} Y=-0.06599867+(0.0+1.5z)x \tag{2.2} \end{equation}\]
Now we simulate some data following equation (2.2)
n<-25
z<-seq(-1,1,length.out = n)
x<-seq(-1,1,length.out = n)
z2<-expand_grid(x,z) %>% data.frame()
slpe_coefs<-c(0,1.5)
intrcp<--0.06599867
z2$y<-intrcp+(slpe_coefs[1]+slpe_coefs[2]*z2$z)*z2$x+rnorm(nrow(z2),sd=0.1)
summary(z2)
## x z y
## Min. :-1.0 Min. :-1.0 Min. :-1.59765
## 1st Qu.:-0.5 1st Qu.:-0.5 1st Qu.:-0.36109
## Median : 0.0 Median : 0.0 Median :-0.05965
## Mean : 0.0 Mean : 0.0 Mean :-0.06670
## 3rd Qu.: 0.5 3rd Qu.: 0.5 3rd Qu.: 0.23599
## Max. : 1.0 Max. : 1.0 Max. : 1.40399
Next we present the 3d scatterplot of this data set
Figura 2.2: Scatterplot of simulated data following model (2.2)
Now the challenge is how to specify this model appropriately in a linear regression model. First we may note that model (2.1) can be rewritten as
\[\begin{equation} E(Y|X=x, Z=z)= \beta_0+\beta_1x+\beta_2zx \tag{2.3} \end{equation}\]
Let’s try to do this in the lm
function
##
## Call:
## lm(formula = y ~ x + x * z, data = z2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.315515 -0.068998 -0.003604 0.061892 0.310250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0666986 0.0039675 -16.811 <2e-16 ***
## x 0.0082439 0.0066024 1.249 0.212
## z 0.0006239 0.0066024 0.094 0.925
## x:z 1.4884205 0.0109870 135.471 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09919 on 621 degrees of freedom
## Multiple R-squared: 0.9673, Adjusted R-squared: 0.9671
## F-statistic: 6118 on 3 and 621 DF, p-value: < 2.2e-16
we got an extra term associated to z variable which does not appears in model (2.3), let’s try again
##
## Call:
## lm(formula = y ~ x + x * z - z, data = z2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.315203 -0.068634 -0.003188 0.062047 0.309626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.066699 0.003964 -16.82 <2e-16 ***
## x 0.008244 0.006597 1.25 0.212
## x:z 1.488420 0.010978 135.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09911 on 622 degrees of freedom
## Multiple R-squared: 0.9673, Adjusted R-squared: 0.9672
## F-statistic: 9192 on 2 and 622 DF, p-value: < 2.2e-16
It is a surprise that subtracting variables inside a formula in lm works in such a way. Finally we could do the same using the function I()
:
##
## Call:
## lm(formula = y ~ x + I(x * z), data = z2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.315203 -0.068634 -0.003188 0.062047 0.309626
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.066699 0.003964 -16.82 <2e-16 ***
## x 0.008244 0.006597 1.25 0.212
## I(x * z) 1.488420 0.010978 135.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09911 on 622 degrees of freedom
## Multiple R-squared: 0.9673, Adjusted R-squared: 0.9672
## F-statistic: 9192 on 2 and 622 DF, p-value: < 2.2e-16
We can see that the estimation of the parameters is very close to the real values used to simulate the data as shown in equation (2.2).
Linear models are commonly specified as
\[\begin{equation} E(Y|X_1=x_1, X_2=x_2,...,X_p=x_p)= \beta_0+\beta_1x_1+\beta_2x_2+...+\beta_px_p \tag{3.1} \end{equation}\]
most of the time, \(X_j\) variables are assumed to be unrelated, and one of the functional consequences of this kind of models is that the effect of the different covariates are independent (also called additive effect). Most of the time, there is one of the \(X_j\) variables which is of principal interest for the researcher, and the other covariates are just included in the model in order to take in to account their effect on the outcome of interest. The inclusion of the other, secondary important covariates, in a linear model is intended to “adjust” for their effect and to obtain an “adjusted” or “corrected“ effect for the covariate of principal interest, \(X_j\). However, this adjustment is only modifying the intercept component of the relation of \(Y\) and \(X_j\). However the slope component remains unaffected.
With our proposal, we are also including a possible effect of a secondary covariate in the slope component of the relation between \(Y\) and \(X_j\).
Therefore, we propose the following approach to analyze the relation of a covariate of interest, let us call it \(X\) with an outcome of interest \(Y\), and to study the possible modifier effects of other set of covariates, let us call them \(Z_1,Z_2,…,Z_p\), on the relation between \(X\) and \(Y\). The model is as follows:
\[\begin{equation} E(Y|X=x, Z_1=z_1,...,Z_p=z_p)= \beta_0+\beta_1x_1+\gamma_1z_1+...+\gamma_pz_p+\\ (\alpha_1z_1+...+\alpha_pz_p)x \tag{3.2} \end{equation}\]
which can be rewritten as
\[\begin{equation} E(Y|X=x, Z_1=z_1,...,Z_p=z_p)= (\beta_0+\gamma_1z_1+...+\gamma_pz_p)+\\ (\beta_1+\alpha_1z_1+...+\alpha_pz_p)x \tag{3.3} \end{equation}\]
where \((\beta_0+\gamma_1z_1+...+\gamma_pz_p)\) is the intercept component of the linear relation between \(X\) and \(Y\), which depends on \(Z_j\) covariates and \((\beta_1+\alpha_1z_1+...+\alpha_pz_p)\) is the slope component of the linear relation between \(X\) and \(Y\), also affected or modified by \(Z_j\) covariates.
Models (3.2) and (3.3) are linear in their parameters and the usual theory of linear models apply to their estimation. Just care is needed in the specification of this models and we show that it is possible to fit these models using the appropriate formulation in R