1.1 and 6.1 Simple Linear Regression Model

\(\def\a{\boldsymbol{a}}\) \(\def\b{\boldsymbol{b}}\) \(\def\c{\boldsymbol{c}}\) \(\def\f{\boldsymbol{f}}\) \(\def\g{\boldsymbol{g}}\) \(\def\h{\boldsymbol{h}}\) \(\def\j{\boldsymbol{j}}\) \(\def\t{\boldsymbol{t}}\) \(\def\u{\boldsymbol{u}}\) \(\def\v{\boldsymbol{v}}\) \(\def\x{\boldsymbol{x}}\) \(\def\y{\boldsymbol{y}}\) \(\def\z{\boldsymbol{z}}\) \(\def\A{\boldsymbol{\mathrm{A}}}\) \(\def\B{\boldsymbol{\mathrm{B}}}\) \(\def\C{\boldsymbol{\mathrm{C}}}\) \(\def\D{\boldsymbol{\mathrm{D}}}\) \(\def\E{\boldsymbol{\mathrm{E}}}\) \(\def\H{\boldsymbol{\mathrm{H}}}\) \(\def\I{\boldsymbol{\mathrm{I}}}\) \(\def\J{\boldsymbol{\mathrm{J}}}\) \(\def\M{\boldsymbol{\mathrm{M}}}\) \(\def\O{\boldsymbol{\mathrm{O}}}\) \(\def\P{\boldsymbol{\mathrm{P}}}\) \(\def\Q{\boldsymbol{\mathrm{Q}}}\) \(\def\T{\boldsymbol{\mathrm{T}}}\) \(\def\U{\boldsymbol{\mathrm{U}}}\) \(\def\X{\boldsymbol{\mathrm{X}}}\) \(\def\Y{\boldsymbol{\mathrm{Y}}}\) \(\def\bmu{\boldsymbol{\mu}}\) \(\def\ep{\boldsymbol{\varepsilon}}\) \(\def\Sig{\boldsymbol{\Sigma}}\) \(\def\zeros{\boldsymbol{0}}\) \(\def\diag{\mathrm{diag}}\) \(\def\rank{\mathrm{rank}}\) \(\def\tr{^\top}\) \(\def\ds{\displaystyle}\) \(\def\bea{\begin{eqnarray}}\) \(\def\nnn{\nonumber}\) \(\def\eea{\nnn\end{eqnarray}}\) \(\def\bpm{\begin{pmatrix}}\) \(\def\epm{\end{pmatrix}}\) \(\def\var{\mbox{var}}\) \(\def\cov{\mbox{cov}}\) \(\def\Reals{\mathbb{R}}\) \(\def\abs{\mbox{abs}}\) \(\def\sion{\sum_{i=1}^n}\) \(\def\res{\hat{\varepsilon}}\) \(\newcommand{\EV}[1]{E\left(#1\right)}\) \(\newcommand{\trace}[1]{\mathrm{tr}\left(#1\right)}\)

Historic Origins:

  • Regression analysis was first developed by Sir Francis Galton in the latter part of the 19th Century. Galton had studied the relation between heights of parents and children and noted that the heights of children of both tall and short parents appeared to “revert” or “regress” to the mean of the group. He considered this tendency to be a regression to “mediocrity”. Galton developed a mathematical description of this regression tendency, the precursor of today’s regression models.

Functional Relation between Two Variables

A function relation: \(y=f(x)\)

\(y\): dependent variable \(x\): independent variable

Statistical Relation between Two Variables

A statistical relation, unlike functional relation, is not a perfect one. In general, the observations for a statistical relation do not fall directly on the curve of relationship.

  • R Example 1.1.1: Consider a famous data set on the heights of children based on their parent(s) collected by Galton in (1886). The dataset GaltonFamilies is included in the R package HistData which can be installed with the following command.
install.packages("HistData")

The following commands show how to load the package, display its first 10 rows, and display the dimension of the data frame.

require(HistData)
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.0.4
GaltonFamilies[1:10,]
##    family father mother midparentHeight children childNum gender childHeight
## 1     001   78.5   67.0           75.43        4        1   male        73.2
## 2     001   78.5   67.0           75.43        4        2 female        69.2
## 3     001   78.5   67.0           75.43        4        3 female        69.0
## 4     001   78.5   67.0           75.43        4        4 female        69.0
## 5     002   75.5   66.5           73.66        4        1   male        73.5
## 6     002   75.5   66.5           73.66        4        2   male        72.5
## 7     002   75.5   66.5           73.66        4        3 female        65.5
## 8     002   75.5   66.5           73.66        4        4 female        65.5
## 9     003   75.0   64.0           72.06        2        1   male        71.0
## 10    003   75.0   64.0           72.06        2        2 female        68.0
dim(GaltonFamilies)
## [1] 934   8

A scatterplot of the dataset can be obtained as the following:

plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")

Here the Child Height is taken as the dependent or response variable \(y\), and midparent Height is taken as the independent, explanatory or predictor variable \(x\).

Basic Concepts

A regression model is a formal means of expressing the two essential ingredients of a statistical relation:

  1. A tendency of the response variable \(y\) to vary with the predictor variable \(x\) in a systematic fashion.

  2. A scattering of points around the curve of statistical relationship.

Use of Regression Analysis

  • When deciding on what model we want to use to model data, we need to consider the purpose of our analysis. There are various purposes that we may be interested in:
    • Prediction
    • Data Description or Explanation
    • Parameter Estimation
    • Variable Selection or Screening
    • Control of Output

Regression and Causality

  • The existence of a statistical relation between the response variable \(y\) and the explanatory or predictor variable \(x\) does not imply in any way that \(y\) depends causally on \(x\). No matter how strongly is the statistical relation between \(x\) and \(y\), no cause-and-effect patter is necessarily implied by the regression model. For example,data on size of vocabulary (\(x\)) and writing speed (\(y\)) for a sample of young children aged 5-10 will show a positive regression relation. This relation does not imply, however, that an increase in vocabulary causes a faster writing speed. Here, other explanatory variables, such as age of the child and amount of education, affect both the vocabulary (\(x\)) and the writing speed (\(y\)). Older children have a larger vocabulary and a faster writing speed.

  • Even when a strong statistical relationship reflects causal conditions, the causal conditions may act in the opposite direction, from \(y\) to \(x\). Consider, for instance, the calibration of a thermometer. Here, readings of the thermometer are taken at different known temperatures, and the regression relation is studied so that the accuracy of predictions made by using the thermometer readings can be assessed. For this purpose, the thermometer reading is the predictor variable \(x\), and the actual temperature is the response variable \(y\) to be predicted. However, the causal pattern here does not go from \(x\) to \(y\), but in the opposite direction: the actual temperature (\(y\)) affects the thermometer reading (\(x\)).

Simple Linear Regression Model

\[\begin{align} y=\beta_0+\beta_1x+\varepsilon \label{eq1}\tag{1} \end{align}\]

where \(y\) is the response (dependent) variable, \(x\) is predictor(independent) variable. \(\beta_0\) and \(\beta_1\) are the intercept and slope respectively and \(\varepsilon\) is the model error. We ordinarly view (1) in a data setting where pairs of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) are taken on experimental units, and estimates of the parameters \(\beta_0\) and \(\beta_1\) are sought. Thus, we can write the model as the following definition.

6.2 Estimation of \(\boldsymbol{\beta}_0\), \(\boldsymbol{\beta}_1\), and \(\sigma^2\)

Least-squares Approach

Seek estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squares of the deviations \(y_i-\hat{y}_i\) of the \(n\) observed \(y_i\)’s from their predicted values \(\hat{y_i}=\hat{\beta}_0+\hat{\beta_1}x_i\). That is to minimize the residual sum of squares \(\sum\limits_{i=1}^n(y_i-\hat{y_i})^2\)

6.4 Coefficient of Determination

  • The predicted value of \(y\) at \(x\) is \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x\).
  • The fitted value of \(y\) at \(x_i\) is \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i=\bar{y}+\hat{\beta}_1(x_i-\bar{x})\).
  • The residuals are the differences between the fitted values and the actual values and denoted by \(\res_i=y_i-\hat{y}_i\).
  • Evaluating the least squares function at \(\hat{\beta}_0\) and \(\hat{\beta}_1\), we obtain the residual sum of squares (or error sum of squares) denoted by \[ SSE=Q(\hat{\beta}_0,\hat{\beta}_1)=\sum_{i=1}^n \res_i^2. \]
  • Suppose we have no information about the \(x\)’s, and want to estimate the mean of the \(y\)’s based only on \(y_1, y_2, \ldots, y_n\). Then the least squares estimate of the mean of the \(y\)’s is \(\bar{y}=\frac{1}{n}\sion y_i\).
  • Consider the following interesting decomposition of \(\ds{(n-1)s_y^2=\sion (y_i-\bar{y})^2}\): \[ \bea \nnn \sion (y_i-\bar{y})^2&=&\sion (y_i-\hat{y}_i+\hat{y}_i-\bar{y})^2 \\ \nnn &=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2+2\sion(y_i-\hat{y}_i)(\hat{y}_i-\bar{y}) \\ \nnn &=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2+2\hat{\beta}_1\sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)(x_i-\bar{x}) \eea \]
  • Note that the score equations imply that \[ \bea \nnn \sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)(x_i-\bar{x}) &=&\sion x_i (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)-\bar{x}\sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i) \\ \nnn &=& 0+0=0. \eea \]
  • So, we have \[ \bea \nnn \sion (y_i-\bar{y})^2&=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2 \\ \nnn SST &=& SSE + SSR \eea \] where \(SST=\sion (y_i-\bar{y})^2\) is the total sum of squares and \(SSR=\sion(\hat{y}_i-\bar{y})^2\) is the regression sum of squares.
  • To assess the proportion of variation explained by the regression model compared with that explained only using the mean, we can consider the coefficient of determination defined by \(\frac{SSR}{SST}\).
  • Since \(SST=\sion (y_i-\bar{y})^2=(n-1)s_y^2\) and \(SSR=\sion(\hat{y}_i-\bar{y})^2=\hat{\beta}_1^2\sion(x_i-\bar{x})^2= \hat{\beta}_1^2(n-1)s_{x}^2\), it follows that \[ \bea \nnn \frac{SSR}{SST}&=&\hat{\beta}_1^2\frac{s_{x}^2}{s_{y}^2} \\ \nnn &=& \left(\frac{s_{xy}}{s_{x}^2}\right)^2\frac{s_{x}^2}{s_{y}^2} \\ \nnn &=& \left(\frac{s_{xy}}{s_{x}s_{y}}\right)^2 \\ \nnn &=& r^2 \eea \] where \(\ds{r= \frac{s_{xy}}{s_{x}s_{y}}}\) is the sample correlation between the \(x\)’s and the \(y\)’s.
  • We also can express the estimate of the slope of the regression line in terms of the correlation: \[ \bea \nnn \hat{\beta}_1&=&\frac{s_{xy}}{s_{x}^2}\\ \nnn &=& \frac{rs_{x}s_{y}}{s_{x}^2} \\ \nnn &=& \frac{rs_{y}}{s_{x}}. \eea \]
  • Note that \(s_x\) is the sample standard deviation of the \(x\)’s and \(s_y\) is the sample standard deviation of the \(y\)’s.
  • So, if the \(x\) value increases by one standard deviation, the method of least squares estimates that the \(y\) value increases by \(r\) standard deviations.
  • R Example 6.4.1: Consider a famous data set on the heights of children based on their parent(s) collected by Galton in (1886).(The data set we used in R Example 1.1.1) Suppose we first consider a model to predict the height of each child based on the midParent height. Every statistical package has built-in functions to perform common tasks such as fitting a simple linear regression model. In R, this simple linear regression model can be fit and viewed using the following built-in commands.
lm.model=lm(childHeight~midparentHeight,data=GaltonFamilies)
lm.model
## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
## 
## Coefficients:
##     (Intercept)  midparentHeight  
##         22.6362           0.6374

So, the fitted model in this case is \[ \mbox{Predicted child height}=22.6362+.6374 \cdot \mbox{Midparent height}. \]

Now, let’s see how to obtain this directly from the formulas we have derived in Chapter 6. First, let’s compute the summary statistics.

x=GaltonFamilies$midparentHeight
y=GaltonFamilies$childHeight
x.bar=mean(x)
y.bar=mean(y)
s.x=sd(x)
s.y=sd(y)
r=cor(x,y)
x.bar;y.bar;s.x;s.y;r
## [1] 69.20677
## [1] 66.74593
## [1] 1.80237
## [1] 3.579251
## [1] 0.3209499

So, then we can compute the least squares estimates of \(\beta_1\) and \(\beta_0\)

beta1.hat=r*s.y/s.x
beta0.hat=y.bar-beta1.hat*x.bar
beta1.hat;beta0.hat
## [1] 0.6373609
## [1] 22.63624

So, we see that \[ \hat{\beta}_1=\frac{rs_y}{s_x}=\frac{(0.3209499)(3.579251)}{1.80237}=0.6373609 \] and \[ \hat{\beta}_0=66.74593-(0.6373609)(69.20677)=22.63624 \]

The regression line can be added to the scatterplot we draw in R Example 1.1.1 in blue.

plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")
abline(lm.model$coef,col="blue")

The unbiased estimate of \(\sigma^2\) is \[ s^2=\frac{SSE}{n-2}=\frac{(1-r^2) SST}{n-2}=\frac{(1-r^2)(n-1)s_y^2}{n-2} \]] and can be computed as follows.

n=length(x)
(1-r^2)*(n-1)*s.y^2/(n-2)
## [1] 11.50372

So, we have \[ s^2=\frac{(1-r^2)(n-1)s_y^2}{n-2}=11.50372. \]

We saw that the built-in function summary() can be used to obtain the estimate and some other information of the regression line as follows.

summary(lm(childHeight~midparentHeight,data=GaltonFamilies))
## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9570 -2.6989 -0.2155  2.7961 11.6848 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     22.63624    4.26511   5.307 1.39e-07 ***
## midparentHeight  0.63736    0.06161  10.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared:  0.103,  Adjusted R-squared:  0.102 
## F-statistic:   107 on 1 and 932 DF,  p-value: < 2.2e-16

1.A.1 Installing R and a Basic Introduction

  • Downloading and Installing R on Windows
    • Go to http://cran.r-project.org.
    • Look under the Section titled “Download and Install R” and click on “Download R for Windows”.
    • Click on “base”.
    • Click on “Download R 3.3.4 for Windows” (or current version) and save it on your computer.
    • Click on the new desktop icon titled “R-3.3.4-win.exe” (or current setup file) to install R on your computer.
    • The setup process is self-explanatory.
  • Downloading and Installing R on a Mac
    • Go to http://cran.r-project.org.
    • Look under the Section titled “Download and Install R” and click on “Download R for (Mac) OS X”.
    • Click on “R-3.4.3.pkg” (or current version) and save it on your computer.
    • Double click on the installer file titled “R-3.4.3.pkg” (or current setup file) to install R on your computer.
    • The setup process is self-explanatory.
  • R Example 1.A.1: Consider the following data set: \[ 0.44, -6.23, 1, 2, 3, 4, 7, 0, 0, 0 \]

( a ) Compute the sample mean, sample standard deviation, and sample median for this variable using built-in functions.

( b ) Repeat ( a ) by directly using formulas rather than the built-in functions.

  • Answer: ( a ) We can enter the data into R and view the vector as follows.
x=c(.44,-6.23,1:4,7,rep(0,3))
x
##  [1]  0.44 -6.23  1.00  2.00  3.00  4.00  7.00  0.00  0.00  0.00

The sample summary statistics can be computer with the following built-in functions.

mean(x)
## [1] 1.121
sd(x)
## [1] 3.422801
median(x)
## [1] 0.72

( b ) We can also compute these quantites directly in R. The formula for the mean \[ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i \] can be computed in R as follows.

sum(x)
## [1] 11.21
n=length(x)
n
## [1] 10
x.bar=sum(x)/n
x.bar
## [1] 1.121

So, the sample mean is \(\bar{x}=1.121\).

Then, we can use the mean to directly compute the variance based on the formula \[ s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. \]

x-x.bar
##  [1] -0.681 -7.351 -0.121  0.879  1.879  2.879  5.879 -1.121 -1.121 -1.121
(x-x.bar)^2
##  [1]  0.463761 54.037201  0.014641  0.772641  3.530641  8.288641 34.562641
##  [8]  1.256641  1.256641  1.256641
sum((x-x.bar)^2)
## [1] 105.4401
sum((x-x.bar)^2)/(n-1)
## [1] 11.71557
s=sqrt(sum((x-x.bar)^2)/(n-1))
s
## [1] 3.422801

So, the sample standard deviation is \(s=3.423\).

Finally, to compute the median, we need to sort the data.

sorted.x=sort(x)
sorted.x
##  [1] -6.23  0.00  0.00  0.00  0.44  1.00  2.00  3.00  4.00  7.00

Then the sample median is usually defined as \[ \frac{x_{\left(\lfloor{(n+1)/2\rfloor}\right)}+x_{\left(\lfloor{n/2+1\rfloor}\right)}}{2} \] where \(\lfloor \ \rfloor\) is the floor function. In R, this can be computed as follows.

c(sorted.x[5],sorted.x[6])
## [1] 0.44 1.00
x.median=(sorted.x[5]+sorted.x[6])/2

So, the sample median is \(\frac{x_{(5)}+x_{(6)}}{2}=0.72\).

  • R Example 1.A.2: Now, we consider the Galton (1886) data set again to illustrate some additional R commands and syntax while also improving the linear model for the heights of the children based on their parents.

It is important to know how to read data from a file. The data set “HeightData.txt” is a text file which has the Galton (1886) data set. In order to read the data into R at the command line, we need to know some commands for finding and changing the filepath. The command getwd() returns the current working directory.

getwd()
## [1] "D:/OneDrive - University of Louisville/2022 Spring/MATH 668 Linear Stat modeling/NotesPreparation/Lecture 1 Chapter 1"

The directory can be changed using the setwd function as illustrated below. Note that you must use slashes instead of backslashes (even in Windows which uses backslashes in its filepaths; for more discussion, see http://www.cs.ucsb.edu/~pconrad/topics/BackslashVsForwardSlash/).

setwd("D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data")
getwd()
## [1] "D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data"

The command dir() displays all of the files and directories in the current working directory.

dir()
##  [1] "Body_fat_example.txt"  "Clinic Data.txt"       "Clinic Data.xls"      
##  [4] "create yield data.txt" "Forbes data.txt"       "HeightData.txt"       
##  [7] "Hematology data.txt"   "Hematology data.xlsx"  "HW5-3.txt"            
## [10] "HW5-3code.txt"         "LandRentData.txt"      "LandRentData.xlsx"    
## [13] "us2003sars.txt"        "us65+rates.txt"        "YieldData.txt"

Now, the read.table function can be used to read the data into R. The optional input sep=“\t” tells R that te file is tab-delimited. The option header=TRUE indicates that the file has a header with variable names. Detailed documentation for the function read.table can be obtained with the command help(read.table).

D=read.table("HeightData.txt",sep="\t",header=TRUE)

The following command displays the first ten rows of the data frame D and stores it in a new data frame D10.

D10=D[1:10,]

There are various ways to select subsets of rows and columns in R. Here are three different ways to extract rows 1-7 and 9 from D10.

D10[c(1,2,3,4,5,6,7,9),]
D10[c(1:7,9),]
D10[-c(8,10),]
##   family father mother midparentHeight children childNum gender childHeight
## 1    001   78.5   67.0           75.43        4        1   male        73.2
## 2    001   78.5   67.0           75.43        4        2 female        69.2
## 3    001   78.5   67.0           75.43        4        3 female        69.0
## 4    001   78.5   67.0           75.43        4        4 female        69.0
## 5    002   75.5   66.5           73.66        4        1   male        73.5
## 6    002   75.5   66.5           73.66        4        2   male        72.5
## 7    002   75.5   66.5           73.66        4        3 female        65.5
## 9    003   75.0   64.0           72.06        2        1   male        71.0

The columns of a data frame represent variables, and as we see in D10, different columns can be different types; for example, most of the variables are quantitative, but the 7th column gender is qualitative. Here are two ways to extract a vector representing the 7th column of D10.

D10[,7]
D10$gender
##  [1] "male"   "female" "female" "female" "male"   "male"   "female" "female"
##  [9] "male"   "female"

Furthermore, we can simultaneously select specified rows and columns as illustrated in the following example.

D10[-c(8,10),7:8]
##   gender childHeight
## 1   male        73.2
## 2 female        69.2
## 3 female        69.0
## 4 female        69.0
## 5   male        73.5
## 6   male        72.5
## 7 female        65.5
## 9   male        71.0

Now, let’s attach D10.

attach(D10)

The following boolean operation checks which children are females.

gender=="female"
##  [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE

R also allows boolean vectors to be used to select rows to be extracted. For example, the following command displays all data for female children in D10.

D10[gender=="female",]
##    family father mother midparentHeight children childNum gender childHeight
## 2     001   78.5   67.0           75.43        4        2 female        69.2
## 3     001   78.5   67.0           75.43        4        3 female        69.0
## 4     001   78.5   67.0           75.43        4        4 female        69.0
## 7     002   75.5   66.5           73.66        4        3 female        65.5
## 8     002   75.5   66.5           73.66        4        4 female        65.5
## 10    003   75.0   64.0           72.06        2        2 female        68.0
detach(D10)

Now, we can use subsetting to recreate the scatterplot for midparent height and child height but with female children plotted in blue and male children plotted in red as shown in the following code. The optional argument type=“n” plots invisible points, but creates the axes so that all of the points will fit on the graph. The function points then adds points to the existing plot, and we use the optional argument col to specify the color for each group of points. Then we can fit regression lines for each group using the lm function with an appropriate boolean vector in the subset option.

attach(D)
plot(midparentHeight,childHeight,type="n")
points(midparentHeight[gender=="female"],childHeight[gender=="female"],col="blue")
points(midparentHeight[gender=="male"],childHeight[gender=="male"],col="red")
lm.females=lm(childHeight~midparentHeight,subset=(gender=="female"))
lm.males=lm(childHeight~midparentHeight,subset=(gender=="male"))
abline(lm.females$coef,col="blue")
abline(lm.males$coef,col="red")

A 3-dimensional scatterplot with the fitted model superimposed can be obtained as follows. First, if it is not already installed, we need to install the package rgl with the command install.packages("rgl"). Then we load the package with the require function.

require(rgl)
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 4.0.5
## Warning: package 'knitr' was built under R version 4.0.5

Similarly, the following code can be used to create separate 3D plots and regression models for children by gender using both parents’ heights as covariates.

clear3d()
plot3d(mother,father,childHeight,type="n")
points3d(mother[gender=="female"],father[gender=="female"],childHeight[gender=="female"],col='blue')
points3d(mother[gender=="male"],father[gender=="male"],childHeight[gender=="male"],col='red')
lm2.females=lm(childHeight~mother+father,subset=(gender=="female"))
lm2.males=lm(childHeight~mother+father,subset=(gender=="male"))
a=coef(lm2.females)["mother"]
b=coef(lm2.females)["father"]
c=-1
d=coef(lm2.females)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='blue')
a=coef(lm2.males)["mother"]
b=coef(lm2.males)["father"]
c=-1
d=coef(lm2.males)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='red')
detach(D)

Reference

  1. Rencher and Schaalje (2008)
  2. Kutner et al. (2005)

Kutner, Michael H, Christopher J Nachtsheim, John Neter, William Li, and others. 2005. Applied Linear Statistical Models. Vol. 5. McGraw-Hill Irwin Boston.

Rencher, Alvin C, and G Bruce Schaalje. 2008. Linear Models in Statistics. John Wiley & Sons.