Simple Linear Regression Model

1.1 and 6.1 Simple Linear Regression Model

\(\def\a{\boldsymbol{a}}\) \(\def\b{\boldsymbol{b}}\) \(\def\c{\boldsymbol{c}}\) \(\def\f{\boldsymbol{f}}\) \(\def\g{\boldsymbol{g}}\) \(\def\h{\boldsymbol{h}}\) \(\def\j{\boldsymbol{j}}\) \(\def\t{\boldsymbol{t}}\) \(\def\u{\boldsymbol{u}}\) \(\def\v{\boldsymbol{v}}\) \(\def\x{\boldsymbol{x}}\) \(\def\y{\boldsymbol{y}}\) \(\def\z{\boldsymbol{z}}\) \(\def\A{\boldsymbol{\mathrm{A}}}\) \(\def\B{\boldsymbol{\mathrm{B}}}\) \(\def\C{\boldsymbol{\mathrm{C}}}\) \(\def\D{\boldsymbol{\mathrm{D}}}\) \(\def\E{\boldsymbol{\mathrm{E}}}\) \(\def\H{\boldsymbol{\mathrm{H}}}\) \(\def\I{\boldsymbol{\mathrm{I}}}\) \(\def\J{\boldsymbol{\mathrm{J}}}\) \(\def\M{\boldsymbol{\mathrm{M}}}\) \(\def\O{\boldsymbol{\mathrm{O}}}\) \(\def\P{\boldsymbol{\mathrm{P}}}\) \(\def\Q{\boldsymbol{\mathrm{Q}}}\) \(\def\T{\boldsymbol{\mathrm{T}}}\) \(\def\U{\boldsymbol{\mathrm{U}}}\) \(\def\X{\boldsymbol{\mathrm{X}}}\) \(\def\Y{\boldsymbol{\mathrm{Y}}}\) \(\def\bmu{\boldsymbol{\mu}}\) \(\def\ep{\boldsymbol{\varepsilon}}\) \(\def\Sig{\boldsymbol{\Sigma}}\) \(\def\zeros{\boldsymbol{0}}\) \(\def\diag{\mathrm{diag}}\) \(\def\rank{\mathrm{rank}}\) \(\def\tr{^\top}\) \(\def\ds{\displaystyle}\) \(\def\bea{\begin{eqnarray}}\) \(\def\nnn{\nonumber}\) \(\def\eea{\nnn\end{eqnarray}}\) \(\def\bpm{\begin{pmatrix}}\) \(\def\epm{\end{pmatrix}}\) \(\def\var{\mbox{var}}\) \(\def\cov{\mbox{cov}}\) \(\def\Reals{\mathbb{R}}\) \(\def\abs{\mbox{abs}}\) \(\def\sion{\sum_{i=1}^n}\) \(\def\res{\hat{\varepsilon}}\) \(\newcommand{\EV}[1]{E\left(#1\right)}\) \(\newcommand{\trace}[1]{\mathrm{tr}\left(#1\right)}\)

Historic Origins:

Regression analysis was first developed by Sir Francis Galton in the latter part of the 19th Century. Galton had studied the relation between heights of parents and children and noted that the heights of children of both tall and short parents appeared to “revert” or “regress” to the mean of the group. He considered this tendency to be a regression to “mediocrity”. Galton developed a mathematical description of this regression tendency, the precursor of today’s regression models.

Functional Relation between Two Variables

A function relation: \(y=f(x)\)

\(y\): dependent variable \(x\): independent variable

Statistical Relation between Two Variables

A statistical relation, unlike functional relation, is not a perfect one. In general, the observations for a statistical relation do not fall directly on the curve of relationship.

R Example 1.1.1: Consider a famous data set on the heights of children based on their parent(s) collected by Galton in (1886). The dataset GaltonFamilies is included in the R package HistData which can be installed with the following command.

install.packages("HistData")

The following commands show how to load the package, display its first 10 rows, and display the dimension of the data frame.

require(HistData)

## Loading required package: HistData

## Warning: package 'HistData' was built under R version 4.0.4

GaltonFamilies[1:10,]

##    family father mother midparentHeight children childNum gender childHeight
## 1     001   78.5   67.0           75.43        4        1   male        73.2
## 2     001   78.5   67.0           75.43        4        2 female        69.2
## 3     001   78.5   67.0           75.43        4        3 female        69.0
## 4     001   78.5   67.0           75.43        4        4 female        69.0
## 5     002   75.5   66.5           73.66        4        1   male        73.5
## 6     002   75.5   66.5           73.66        4        2   male        72.5
## 7     002   75.5   66.5           73.66        4        3 female        65.5
## 8     002   75.5   66.5           73.66        4        4 female        65.5
## 9     003   75.0   64.0           72.06        2        1   male        71.0
## 10    003   75.0   64.0           72.06        2        2 female        68.0

dim(GaltonFamilies)

## [1] 934   8

A scatterplot of the dataset can be obtained as the following:

plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")

Here the Child Height is taken as the dependent or response variable \(y\), and midparent Height is taken as the independent, explanatory or predictor variable \(x\).

Basic Concepts

A regression model is a formal means of expressing the two essential ingredients of a statistical relation:

A tendency of the response variable \(y\) to vary with the predictor variable \(x\) in a systematic fashion.
A scattering of points around the curve of statistical relationship.

Use of Regression Analysis

When deciding on what model we want to use to model data, we need to consider the purpose of our analysis. There are various purposes that we may be interested in:
- Prediction
- Data Description or Explanation
- Parameter Estimation
- Variable Selection or Screening
- Control of Output

Regression and Causality

The existence of a statistical relation between the response variable \(y\) and the explanatory or predictor variable \(x\) does not imply in any way that \(y\) depends causally on \(x\). No matter how strongly is the statistical relation between \(x\) and \(y\), no cause-and-effect patter is necessarily implied by the regression model. For example,data on size of vocabulary (\(x\)) and writing speed (\(y\)) for a sample of young children aged 5-10 will show a positive regression relation. This relation does not imply, however, that an increase in vocabulary causes a faster writing speed. Here, other explanatory variables, such as age of the child and amount of education, affect both the vocabulary (\(x\)) and the writing speed (\(y\)). Older children have a larger vocabulary and a faster writing speed.
Even when a strong statistical relationship reflects causal conditions, the causal conditions may act in the opposite direction, from \(y\) to \(x\). Consider, for instance, the calibration of a thermometer. Here, readings of the thermometer are taken at different known temperatures, and the regression relation is studied so that the accuracy of predictions made by using the thermometer readings can be assessed. For this purpose, the thermometer reading is the predictor variable \(x\), and the actual temperature is the response variable \(y\) to be predicted. However, the causal pattern here does not go from \(x\) to \(y\), but in the opposite direction: the actual temperature (\(y\)) affects the thermometer reading (\(x\)).

Simple Linear Regression Model

We begin with the description

\[\begin{align} y=\beta_0+\beta_1x+\varepsilon \label{eq1}\tag{1} \end{align}\]

where \(y\) is the response (dependent) variable, \(x\) is predictor(independent) variable. \(\beta_0\) and \(\beta_1\) are the intercept and slope respectively and \(\varepsilon\) is the model error. We ordinarly view (1) in a data setting where pairs of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) are taken on experimental units, and estimates of the parameters \(\beta_0\) and \(\beta_1\) are sought. Thus, we can write the model as the following definition.

Definition 6.1.1 (p.127): The simple linear regression model with \(n\) observations can be written as \[ y_i=\beta_0+\beta_1 x_i+\varepsilon_i, i=1, 2, \ldots, n \] with the following assumptions:
- \(E(\varepsilon_i)=0\) for \(i=1,\ldots,n\), or equivalently, \(E(y_i)=\beta_0+\beta_1x_i\)
- \(\var(\varepsilon_i)=\sigma^2\) for \(i=1,\ldots,n\), or equivalently, \(var(y_i)=\sigma^2\) (homoscedasticity of the errors, homogeneous variance)
- \(\cov(\varepsilon_i,\varepsilon_j)=0\) for all \(i\neq j\), or equivalently, \(cov(y_i,y_j)=0\) (uncorrelated errors).
Here, \(\beta_0\) and \(\beta_1\) are fixed but unknown model parameters representing the intercept and the slope of the regression line, respectively.
In this chapter, we assume the \(x_i\)’s are not random.
\(y\) is called the dependent or response variable, \(y_i\) is the value of the response variable in the ith trial
\(x\) is called the independent, predictor, or explanatory variable, \(x_i\) is a known constant, the value of the predictor variable in the ith trial
\(\varepsilon\) is a random variable called the error term.\(\varepsilon_i\) is a the value of the in the ith trial
Remark:

1.Simple: There is only one \(x\) to predict the response \(y\)

2.Linear: The model is linear in parameters \(\beta_0\) and \(\beta_1\)
- Example: \(y_i=\beta_0+e^{\beta_1 x_i}+\varepsilon_i\) is not linear. But \(y_i=\beta_0+\beta_1x_i+\beta_2x_i^2+\varepsilon_i\) is linear.
Important Features of Model:
- 1. The response \(y\); in the ith trial is the sum of two components: (1) the constant term \(\beta_0+\beta_1 x_i\) and (2) the random term \(\varepsilon_i\). Hence, \(y_i\) is a random variable.
- 1. Since \(E(\varepsilon_i) = 0\), it follows: \(E(y_i) = E(\beta_0+\beta_1x_i+\varepsilon_i)=\beta_0+\beta_1x_i+E(\varepsilon_i)=\beta_0+\beta_1x_i\) Thus, the response \(y_i\), when the level of \(x\) in the ith trial is \(x_i\), comes from a probability distribution whose mean is: \(E(y_i)=\beta_0+\beta_1x_i\) We therefore know that the regression function for model in Definition 6.1.1 is \[E(y)=\beta_0+\beta_1x\] since the regression function relates the means of the probability distributions of \(y\) for given \(x\) to the level of \(x\)
- 1. The response \(y_i\) in the ith trial exceeds or falls short of the value of the regression function by the error term amount \(\varepsilon\).
- 1. The error terms \(\varepsilon_i\) are assumed to have constant variance \(\sigma^2\). It therefore follows that the responses \(y_i\) have the same constant variance : \(var(y_i)=var(\beta_0+\beta_1x_i+\varepsilon_i)=var(\varepsilon_i)=\sigma^2\) Thus, this regression model assumes that the probability distributions of \(y\) have the same variance \(\sigma^2\) regardless of the level of the predictor variable \(x\).
- 1. The error terms are assumed to be uncorrelated. Since the error terms \(\varepsilon_i\) and \(\varepsilon_j\) are uncorrelated, so are the responses \(y_i\) and \(y_j\)
- 1. In summary, this model implies that the responses \(y_i\) come from probability distributions whose means are \(E(y_i)=\beta_0+\beta_1 x_i\) and whose variances are \(\sigma^2\), the same for all levels of \(x\). Further, any two responses \(y_i\) and \(y_j\) are uncorrelated

6.2 Estimation of \(\boldsymbol{\beta}_0\), \(\boldsymbol{\beta}_1\), and \(\sigma^2\)

Least-squares Approach

Seek estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squares of the deviations \(y_i-\hat{y}_i\) of the \(n\) observed \(y_i\)’s from their predicted values \(\hat{y_i}=\hat{\beta}_0+\hat{\beta_1}x_i\). That is to minimize the residual sum of squares \(\sum\limits_{i=1}^n（y_i-\hat{y_i})^2\)

Definition 6.2.1 (p.128): The least square estimates of \(\beta_0\) and \(\beta_1\) are the values which minimize the least squares function \[ Q(b_0,b_1)=\sum_{i=1}^n \left(y_i-b_0-b_1 x_i\right)^2. \] The minimizers are often denoted by \(\hat{\beta}_0\) and \(\hat{\beta}_1\).
Theorem 6.2.1 (p.128): Given observed data \((x_1,y_1), \ldots, (x_n,y_n)\) from a simple linear regression model such that there is some \(k\) such that \(x_k\neq x_1\), the least squares estimates of \(\beta_0\) and \(\beta_1\) are \[ \hat{\beta}_1=\frac{s_{xy}}{s_{x}^2}=\frac{\sion (x_i-\bar{x})y_i}{\sion (x_i-\bar{x})^2} \] and \[ \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x} \] where \(\ds{s_{xy}=\frac{1}{n-1}\sion (x_i-\bar{x})(y_i-\bar{y})=\frac{1}{n-1}\sion (x_i-\bar{x})y_i=\frac{1}{n-1}\sion x_i(y_i-\bar{y})}\) and \(\ds{s_{x}^2=\frac{1}{n-1}\sion (x_i-\bar{x})^2=\frac{1}{n-1}\left(\sion x_i^2-n\bar{x}^2\right)}\).
Proof: Differentiating \(Q\) with respect to \(b_0\) and \(b_1\), we obtain \[ \frac{\partial Q}{\partial b_0}=-2\sum_{i=1}^n\left(y_i-b_0-b_1 x_i\right) \] and \[ \frac{\partial Q}{\partial b_1}=-2\sum_{i=1}^n x_i\left(y_i-b_0-b_1 x_i\right). \]
Setting these equal to 0, we obtain the system of equations (called the score equations) \[ \begin{array}{c} \ds{\sum_{i=1}^n\left(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i\right)=0} \\ \ds{\sum_{i=1}^nx_i\left(y_i-\hat{\beta}_0-\hat{\beta}_1 x_i\right)=0} \end{array} \] where \(\hat{\beta}_0\) and \(\hat{\beta}_1\) denote the values of \(b_0\) and \(b_1\) which solves the system. (Technically, the hat notation should not be used until we verify that the solution to \(\ds{\frac{\partial Q}{\partial b_0}=\frac{\partial Q}{\partial b_1}=0}\) is a minimizer of \(Q\).)
The following steps show how to solve this system.
We rewrite the sums as \(\ds{\sion y_i=n\hat{\beta}_0+\hat{\beta}_1\sion x_i}\) and \(\ds{\sion x_i y_i= \hat{\beta}_0 \sion x_i +\hat{\beta}_1 \sion x_i^2}\).
Dividing both sides of \(\ds{\sion y_i=n\hat{\beta}_0+\hat{\beta}_1\sion x_i}\) by \(n\), we get \(\ds{\bar{y}=\hat{\beta}_0+\hat{\beta}_1 \bar{x} \Longrightarrow \hat{\beta}_0=\bar{y}-\hat{\beta}_1\bar{x}}\).
Then we substitute \(\hat{\beta}_0\) into \(\ds{\sion x_i y_i= \hat{\beta}_0 \sion x_i +\hat{\beta}_1 \sion x_i^2}\) and solve for \(\hat{\beta}_1\): \[ \bea \nnn \sion x_iy_i&=&(\bar{y}-\hat{\beta}_1\bar{x})\sion x_i +\hat{\beta}_1 \sion x_i^2 \\ \nnn \sion x_iy_i&=&(\bar{y}-\hat{\beta}_1\bar{x})n\bar{x} +\hat{\beta}_1 \sion x_i^2 \\ \nnn \sion x_iy_i&=&n\bar{x}\bar{y} -\hat{\beta}_1 n \bar{x}^2 +\hat{\beta}_1\sion x_i^2 \\ \nnn \sion x_iy_i-n\bar{x}\bar{y}&=&\hat{\beta}_1\left(\sion x_i^2-n \bar{x}^2\right) \\ \nnn \sion (x_i-\bar{x})(y_i-\bar{y})&=&\hat{\beta}_1 \sion (x_i-\bar{x})^2 \\ \nnn (n-1)s_{xy}&=& (n-1)\hat{\beta}_1 s_{x}^2 \\ \nnn \hat{\beta}_1&=&\frac{s_{xy}}{s_{x}^2} \ \ \ \ (s_x^2>0 \mbox{ since not all } x_i\mbox{'s are the same}). \eea \]
Note that \[ s_{xy}=\frac{1}{n-1}\sion (x_i-\bar{x})(y_i-\bar{y})=\frac{1}{n-1}\sion (x_i-\bar{x})y_i-\bar{y}\sion(x_i-\bar{x})=\frac{1}{n-1}\sion (x_i-\bar{x})y_i-0, \] so \(\ds{\hat{\beta}_1=\frac{\sion (x_i-\bar{x})y_i}{\sion (x_i-\bar{x})^2}}\).
To see that \(\hat{\beta}_0\) and \(\hat{\beta}_1\) minimize \(Q\), we show that \(Q\) is a convex function by showing its matrix of second partial derivatives is positive definite for all \(b_0\) and \(b_1\). We compute \[ \frac{\partial^2 Q}{\partial (b_0,b_1)}=\bpm 2n & 2n\bar{x} \\ 2n\bar{x} & 2\sion x_i^2\epm \] which is positive definite since \(2n>0\) and \[ \bea \nnn \det\left(\frac{\partial^2 Q}{\partial (b_0,b_1)}\right)&=&4n\sion x_i^2-4n^2\bar{x}^2 \\ \nnn &=& 4n \left(\sion x_i^2-n\bar{x}^2\right) \\ \nnn &=& 4n (n-1)s_x^2 > 0. \eea \]
Theorem 6.2.2: For the simple linear regression model, we have \(E(\hat{\beta}_0)=\beta_0\), \(E(\hat{\beta}_1)=\beta_1\), \(\ds{\var(\hat{\beta}_1)=\frac{\sigma^2}{\sion (x_i-\bar{x})^2}}\) and \(\ds{\var(\hat{\beta}_0)=\sigma^2\left(\frac{1}{n}+\frac{\bar{x}^2}{\sion (x_i-\bar{x})^2}\right)}\).
Proof: Here, we just prove that \(\hat{\beta}_1\) is unbiased; the other results can be obtained in a similar manner using properties of expectations and variances, and will be proved more generally in Chapter 7. \[ \bea \nnn \EV{\hat{\beta}_1}&=& \EV{\frac{\sion (x_i-\bar{x})y_i}{\sion (x_i-\bar{x})^2}} \\ \nnn &=& \EV{\sion\frac{(x_i-\bar{x})}{(n-1)s_x^2} y_i} \\ \nnn &=& \sion\frac{(x_i-\bar{x})}{(n-1)s_x^2}\EV{y_i} \\ \nnn &=& \sion\frac{ (x_i-\bar{x})}{(n-1)s_x^2}(\beta_0+\beta_1 x_i) \\ \nnn &=& \frac{\beta_0}{(n-1)s_x^2}\sion (x_i-\bar{x})+\frac{\beta_1}{(n-1)s_x^2}\sion (x_i-\bar{x})x_i \\ \nnn &=& 0+\frac{\beta_1}{(n-1)s_x^2}\sion (x_i-\bar{x})x_i-\frac{\beta_1}{(n-1)s_x^2}\sion (x_i-\bar{x})\bar{x} \\ \nnn &=& 0+\frac{\beta_1}{(n-1)s_x^2}\sion (x_i-\bar{x})^2 \\ \nnn &=& \beta_1. \eea \]
Theorem 6.2.3 (p.131): Let \(\ds{s^2=\frac{1}{n-2}\sion (y_i-\hat{y}_i)^2}\). Then \(E(s^2)=\sigma^2\) for the simple linear regression model.
Proof: First, we note that \[ y_i-\bar{y}=\beta_0+\beta_1 x_i+\varepsilon_i-(\beta_0+\beta_1 \bar{x}+\bar{\varepsilon})=(\varepsilon_i-\bar{\varepsilon})+\beta_1(x_i-\bar{x}) \] where \(\bar{\varepsilon}=\frac{1}{n}\sion \varepsilon_i\). Then this implies that \[ \bea \nnn \EV{y_i-\hat{y}_i}&=& \EV{\varepsilon_i-\bar{\varepsilon}+\beta_1(x_i-\bar{x})} \\ \nnn &=& \EV{\varepsilon_i-\bar{\varepsilon}}+\beta_1(x_i-\bar{x}) \\ \nnn &=& 0+\beta_1(x_i-\bar{x})=\beta_1(x_i-\bar{x}) \\ \eea \] and \[ \bea \nnn \var(y_i-\bar{y})&=& \var(\varepsilon_i-\bar{\varepsilon}+\beta_1(x_i-\bar{x})) \\ \nnn &=& \var(\varepsilon_i-\bar{\varepsilon}) \\ \nnn &=& \var\left(\left(1-\frac{1}{n}\right)\varepsilon_i-\frac{1}{n}\sum_{j\neq i} \varepsilon_j \right) \\ \nnn &=& \left(1-\frac{1}{n}\right)^2\sigma^2+\frac{1}{n^2}\sum_{j\neq i} \sigma^2 \\ \nnn &=& \left(1-\frac{1}{n}\right)^2\sigma^2+\frac{1}{n^2}(n-1)\sigma^2 \\ \nnn &=& \left(1-\frac{2}{n}+\frac{1}{n^2}+\frac{1}{n}-\frac{1}{n^2}\right)\sigma^2 \\ \nnn &=& \left(1-\frac{1}{n}\right)\sigma^2. \eea \] Also, we have \[ \bea \nnn \cov(y_i,\hat{\beta}_1)&=& \cov\left(y_i,\sum_{j=1}^n \frac{(x_j-\bar{x})y_j}{(n-1)s_x^2}\right) \\ \nnn &=& \sum_{j=1}^n\frac{(x_j-\bar{x})}{(n-1)s_x^2}\cov(y_i,y_j) \\ \nnn &=& \frac{(x_i-\bar{x})}{(n-1)s_x^2}\var(y_i) \\ \nnn &=& \frac{(x_i-\bar{x})}{(n-1)s_x^2}\sigma^2 \eea \] and \[ \bea \nnn \cov(\bar{y},\hat{\beta}_1)&=&\sum_{i=1}^n \frac{1}{n}\cov(y_i,\hat{\beta}_1) \\ \nnn &=& \frac{1}{n}\sion \frac{(x_i-\bar{x})}{(n-1)s_x^2}\sigma^2\\ \nnn &=& \frac{\sigma^2}{n(n-1)s_x^2}\sion (x_i-\bar{x})\\ \nnn &=& 0. \eea \] Now, we compute \[ \bea \nnn \EV{\sion (y_i-\hat{y}_i)^2}&=& \sion\EV{(y_i-\hat{y}_i)^2} \\ \nnn &=& \sion\EV{(y_i-\hat{y}_i)^2} \\ \nnn &=& \sion\left\{\var(y_i-\hat{y}_i)+\left(\EV{y_i-\hat{y}_i}\right)^2\right\} \\ \nnn &=& \sion\left\{\var(y_i-\hat{y}_i)+\left(\beta_0+\beta_1x_i-\EV{\hat{\beta}_0}+x_i\EV{\hat{\beta}_1}\right)^2\right\} \\ \nnn &=& \sion\left\{\var(y_i-\hat{y}_i)+0\right\} \\ \nnn &=& \sion\left\{\var(y_i-\bar{y}-\hat{\beta}_1(x_i-\bar{x}))\right\} \\ \nnn &=& \sion\left\{\var(y_i-\bar{y})+(x_i-\bar{x})^2\var(\hat{\beta}_1)-2(x_i-\bar{x})\cov(y_i-\bar{y},\hat{\beta}_1)\right\} \\ \nnn &=& \sion\left\{\var(y_i-\bar{y})+(x_i-\bar{x})^2\var(\hat{\beta}_1)-2(x_i-\bar{x})[\cov(y_i,\hat{\beta}_1)-\cov(\bar{y},\hat{\beta_1})]\right\} \\ \nnn &=& \sion\left\{\left(1-\frac{1}{n}\right)\sigma^2+(x_i-\bar{x})^2\frac{\sigma^2}{\sum_{j=1}^n (x_j-\bar{x})^2}- 2(x_i-\bar{x})(\frac{(x_i-\bar{x})\sigma^2}{(n-1)s_x^2}-0)\right\} \\ \nnn &=& \sion\frac{n-1}{n}\sigma^2+\sion(x_i-\bar{x})^2\frac{\sigma^2}{\sum_{j=1}^n (x_j-\bar{x})^2}-2\sigma^2 \\ \nnn &=& (n-1)\sigma^2+\sigma^2-2\sigma \\ \nnn &=& (n-2)\sigma^2. \eea \] Thus, we have \(\EV{s^2}=\frac{1}{n-2}\EV{\sion(y_i-\hat{y}_i)^2}=\sigma^2\).

6.4 Coefficient of Determination

The predicted value of \(y\) at \(x\) is \(\hat{y}=\hat{\beta}_0+\hat{\beta}_1 x\).
The fitted value of \(y\) at \(x_i\) is \(\hat{y}_i=\hat{\beta}_0+\hat{\beta}_1 x_i=\bar{y}+\hat{\beta}_1(x_i-\bar{x})\).
The residuals are the differences between the fitted values and the actual values and denoted by \(\res_i=y_i-\hat{y}_i\).
Evaluating the least squares function at \(\hat{\beta}_0\) and \(\hat{\beta}_1\), we obtain the residual sum of squares (or error sum of squares) denoted by \[ SSE=Q(\hat{\beta}_0,\hat{\beta}_1)=\sum_{i=1}^n \res_i^2. \]
Suppose we have no information about the \(x\)’s, and want to estimate the mean of the \(y\)’s based only on \(y_1, y_2, \ldots, y_n\). Then the least squares estimate of the mean of the \(y\)’s is \(\bar{y}=\frac{1}{n}\sion y_i\).
Consider the following interesting decomposition of \(\ds{(n-1)s_y^2=\sion (y_i-\bar{y})^2}\): \[ \bea \nnn \sion (y_i-\bar{y})^2&=&\sion (y_i-\hat{y}_i+\hat{y}_i-\bar{y})^2 \\ \nnn &=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2+2\sion(y_i-\hat{y}_i)(\hat{y}_i-\bar{y}) \\ \nnn &=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2+2\hat{\beta}_1\sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)(x_i-\bar{x}) \eea \]
Note that the score equations imply that \[ \bea \nnn \sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i)(x_i-\bar{x}) &=&\sion x_i (y_i-\hat{\beta}_0-\hat{\beta}_1x_i)-\bar{x}\sion(y_i-\hat{\beta}_0-\hat{\beta}_1x_i) \\ \nnn &=& 0+0=0. \eea \]
So, we have \[ \bea \nnn \sion (y_i-\bar{y})^2&=& \sion (y_i-\hat{y}_i)^2+\sion(\hat{y}_i-\bar{y})^2 \\ \nnn SST &=& SSE + SSR \eea \] where \(SST=\sion (y_i-\bar{y})^2\) is the total sum of squares and \(SSR=\sion(\hat{y}_i-\bar{y})^2\) is the regression sum of squares.
To assess the proportion of variation explained by the regression model compared with that explained only using the mean, we can consider the coefficient of determination defined by \(\frac{SSR}{SST}\).
Since \(SST=\sion (y_i-\bar{y})^2=(n-1)s_y^2\) and \(SSR=\sion(\hat{y}_i-\bar{y})^2=\hat{\beta}_1^2\sion(x_i-\bar{x})^2= \hat{\beta}_1^2(n-1)s_{x}^2\), it follows that \[ \bea \nnn \frac{SSR}{SST}&=&\hat{\beta}_1^2\frac{s_{x}^2}{s_{y}^2} \\ \nnn &=& \left(\frac{s_{xy}}{s_{x}^2}\right)^2\frac{s_{x}^2}{s_{y}^2} \\ \nnn &=& \left(\frac{s_{xy}}{s_{x}s_{y}}\right)^2 \\ \nnn &=& r^2 \eea \] where \(\ds{r= \frac{s_{xy}}{s_{x}s_{y}}}\) is the sample correlation between the \(x\)’s and the \(y\)’s.
We also can express the estimate of the slope of the regression line in terms of the correlation: \[ \bea \nnn \hat{\beta}_1&=&\frac{s_{xy}}{s_{x}^2}\\ \nnn &=& \frac{rs_{x}s_{y}}{s_{x}^2} \\ \nnn &=& \frac{rs_{y}}{s_{x}}. \eea \]
Note that \(s_x\) is the sample standard deviation of the \(x\)’s and \(s_y\) is the sample standard deviation of the \(y\)’s.
So, if the \(x\) value increases by one standard deviation, the method of least squares estimates that the \(y\) value increases by \(r\) standard deviations.
R Example 6.4.1: Consider a famous data set on the heights of children based on their parent(s) collected by Galton in (1886).(The data set we used in R Example 1.1.1) Suppose we first consider a model to predict the height of each child based on the midParent height. Every statistical package has built-in functions to perform common tasks such as fitting a simple linear regression model. In R, this simple linear regression model can be fit and viewed using the following built-in commands.

lm.model=lm(childHeight~midparentHeight,data=GaltonFamilies)
lm.model

## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
## 
## Coefficients:
##     (Intercept)  midparentHeight  
##         22.6362           0.6374

So, the fitted model in this case is \[ \mbox{Predicted child height}=22.6362+.6374 \cdot \mbox{Midparent height}. \]

Now, let’s see how to obtain this directly from the formulas we have derived in Chapter 6. First, let’s compute the summary statistics.

x=GaltonFamilies$midparentHeight
y=GaltonFamilies$childHeight
x.bar=mean(x)
y.bar=mean(y)
s.x=sd(x)
s.y=sd(y)
r=cor(x,y)
x.bar;y.bar;s.x;s.y;r

## [1] 69.20677

## [1] 66.74593

## [1] 1.80237

## [1] 3.579251

## [1] 0.3209499

So, then we can compute the least squares estimates of \(\beta_1\) and \(\beta_0\)

beta1.hat=r*s.y/s.x
beta0.hat=y.bar-beta1.hat*x.bar
beta1.hat;beta0.hat

## [1] 0.6373609

## [1] 22.63624

So, we see that \[ \hat{\beta}_1=\frac{rs_y}{s_x}=\frac{(0.3209499)(3.579251)}{1.80237}=0.6373609 \] and \[ \hat{\beta}_0=66.74593-(0.6373609)(69.20677)=22.63624 \]

The regression line can be added to the scatterplot we draw in R Example 1.1.1 in blue.

plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")
abline(lm.model$coef,col="blue")

The unbiased estimate of \(\sigma^2\) is \[ s^2=\frac{SSE}{n-2}=\frac{(1-r^2) SST}{n-2}=\frac{(1-r^2)(n-1)s_y^2}{n-2} \]] and can be computed as follows.

n=length(x)
(1-r^2)*(n-1)*s.y^2/(n-2)

## [1] 11.50372

So, we have \[ s^2=\frac{(1-r^2)(n-1)s_y^2}{n-2}=11.50372. \]

We saw that the built-in function summary() can be used to obtain the estimate and some other information of the regression line as follows.

summary(lm(childHeight~midparentHeight,data=GaltonFamilies))

## 
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9570 -2.6989 -0.2155  2.7961 11.6848 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     22.63624    4.26511   5.307 1.39e-07 ***
## midparentHeight  0.63736    0.06161  10.345  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared:  0.103,  Adjusted R-squared:  0.102 
## F-statistic:   107 on 1 and 932 DF,  p-value: < 2.2e-16

1.A.1 Installing R and a Basic Introduction

Downloading and Installing R on Windows
- Go to http://cran.r-project.org.
- Look under the Section titled “Download and Install R” and click on “Download R for Windows”.
- Click on “base”.
- Click on “Download R 3.3.4 for Windows” (or current version) and save it on your computer.
- Click on the new desktop icon titled “R-3.3.4-win.exe” (or current setup file) to install R on your computer.
- The setup process is self-explanatory.
Downloading and Installing R on a Mac
- Go to http://cran.r-project.org.
- Look under the Section titled “Download and Install R” and click on “Download R for (Mac) OS X”.
- Click on “R-3.4.3.pkg” (or current version) and save it on your computer.
- Double click on the installer file titled “R-3.4.3.pkg” (or current setup file) to install R on your computer.
- The setup process is self-explanatory.
R Example 1.A.1: Consider the following data set: \[ 0.44, -6.23, 1, 2, 3, 4, 7, 0, 0, 0 \]

( a ) Compute the sample mean, sample standard deviation, and sample median for this variable using built-in functions.

( b ) Repeat ( a ) by directly using formulas rather than the built-in functions.

Answer: ( a ) We can enter the data into R and view the vector as follows.

x=c(.44,-6.23,1:4,7,rep(0,3))
x

##  [1]  0.44 -6.23  1.00  2.00  3.00  4.00  7.00  0.00  0.00  0.00

The sample summary statistics can be computer with the following built-in functions.

mean(x)

## [1] 1.121

sd(x)

## [1] 3.422801

median(x)

## [1] 0.72

( b ) We can also compute these quantites directly in R. The formula for the mean \[ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i \] can be computed in R as follows.

sum(x)

## [1] 11.21

n=length(x)
n

## [1] 10

x.bar=sum(x)/n
x.bar

## [1] 1.121

So, the sample mean is \(\bar{x}=1.121\).

Then, we can use the mean to directly compute the variance based on the formula \[ s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. \]

x-x.bar

##  [1] -0.681 -7.351 -0.121  0.879  1.879  2.879  5.879 -1.121 -1.121 -1.121

(x-x.bar)^2

##  [1]  0.463761 54.037201  0.014641  0.772641  3.530641  8.288641 34.562641
##  [8]  1.256641  1.256641  1.256641

sum((x-x.bar)^2)

## [1] 105.4401

sum((x-x.bar)^2)/(n-1)

## [1] 11.71557

s=sqrt(sum((x-x.bar)^2)/(n-1))
s

## [1] 3.422801

So, the sample standard deviation is \(s=3.423\).

Finally, to compute the median, we need to sort the data.

sorted.x=sort(x)
sorted.x

##  [1] -6.23  0.00  0.00  0.00  0.44  1.00  2.00  3.00  4.00  7.00

Then the sample median is usually defined as \[ \frac{x_{\left(\lfloor{(n+1)/2\rfloor}\right)}+x_{\left(\lfloor{n/2+1\rfloor}\right)}}{2} \] where \(\lfloor \ \rfloor\) is the floor function. In R, this can be computed as follows.

c(sorted.x[5],sorted.x[6])

## [1] 0.44 1.00

x.median=(sorted.x[5]+sorted.x[6])/2

So, the sample median is \(\frac{x_{(5)}+x_{(6)}}{2}=0.72\).

R Example 1.A.2: Now, we consider the Galton (1886) data set again to illustrate some additional R commands and syntax while also improving the linear model for the heights of the children based on their parents.

It is important to know how to read data from a file. The data set “HeightData.txt” is a text file which has the Galton (1886) data set. In order to read the data into R at the command line, we need to know some commands for finding and changing the filepath. The command getwd() returns the current working directory.

getwd()

## [1] "D:/OneDrive - University of Louisville/2022 Spring/MATH 668 Linear Stat modeling/NotesPreparation/Lecture 1 Chapter 1"

The directory can be changed using the setwd function as illustrated below. Note that you must use slashes instead of backslashes (even in Windows which uses backslashes in its filepaths; for more discussion, see http://www.cs.ucsb.edu/~pconrad/topics/BackslashVsForwardSlash/).

setwd("D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data")
getwd()

## [1] "D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data"

The command dir() displays all of the files and directories in the current working directory.

dir()

##  [1] "Body_fat_example.txt"  "Clinic Data.txt"       "Clinic Data.xls"      
##  [4] "create yield data.txt" "Forbes data.txt"       "HeightData.txt"       
##  [7] "Hematology data.txt"   "Hematology data.xlsx"  "HW5-3.txt"            
## [10] "HW5-3code.txt"         "LandRentData.txt"      "LandRentData.xlsx"    
## [13] "us2003sars.txt"        "us65+rates.txt"        "YieldData.txt"

Now, the read.table function can be used to read the data into R. The optional input sep=“\t” tells R that te file is tab-delimited. The option header=TRUE indicates that the file has a header with variable names. Detailed documentation for the function read.table can be obtained with the command help(read.table).

D=read.table("HeightData.txt",sep="\t",header=TRUE)

The following command displays the first ten rows of the data frame D and stores it in a new data frame D10.

D10=D[1:10,]

There are various ways to select subsets of rows and columns in R. Here are three different ways to extract rows 1-7 and 9 from D10.

D10[c(1,2,3,4,5,6,7,9),]
D10[c(1:7,9),]
D10[-c(8,10),]

##   family father mother midparentHeight children childNum gender childHeight
## 1    001   78.5   67.0           75.43        4        1   male        73.2
## 2    001   78.5   67.0           75.43        4        2 female        69.2
## 3    001   78.5   67.0           75.43        4        3 female        69.0
## 4    001   78.5   67.0           75.43        4        4 female        69.0
## 5    002   75.5   66.5           73.66        4        1   male        73.5
## 6    002   75.5   66.5           73.66        4        2   male        72.5
## 7    002   75.5   66.5           73.66        4        3 female        65.5
## 9    003   75.0   64.0           72.06        2        1   male        71.0

The columns of a data frame represent variables, and as we see in D10, different columns can be different types; for example, most of the variables are quantitative, but the 7th column gender is qualitative. Here are two ways to extract a vector representing the 7th column of D10.

D10[,7]
D10$gender

##  [1] "male"   "female" "female" "female" "male"   "male"   "female" "female"
##  [9] "male"   "female"

Furthermore, we can simultaneously select specified rows and columns as illustrated in the following example.

D10[-c(8,10),7:8]

##   gender childHeight
## 1   male        73.2
## 2 female        69.2
## 3 female        69.0
## 4 female        69.0
## 5   male        73.5
## 6   male        72.5
## 7 female        65.5
## 9   male        71.0

Now, let’s attach D10.

attach(D10)

The following boolean operation checks which children are females.

gender=="female"

##  [1] FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE

R also allows boolean vectors to be used to select rows to be extracted. For example, the following command displays all data for female children in D10.

D10[gender=="female",]

##    family father mother midparentHeight children childNum gender childHeight
## 2     001   78.5   67.0           75.43        4        2 female        69.2
## 3     001   78.5   67.0           75.43        4        3 female        69.0
## 4     001   78.5   67.0           75.43        4        4 female        69.0
## 7     002   75.5   66.5           73.66        4        3 female        65.5
## 8     002   75.5   66.5           73.66        4        4 female        65.5
## 10    003   75.0   64.0           72.06        2        2 female        68.0

detach(D10)

Now, we can use subsetting to recreate the scatterplot for midparent height and child height but with female children plotted in blue and male children plotted in red as shown in the following code. The optional argument type=“n” plots invisible points, but creates the axes so that all of the points will fit on the graph. The function points then adds points to the existing plot, and we use the optional argument col to specify the color for each group of points. Then we can fit regression lines for each group using the lm function with an appropriate boolean vector in the subset option.

attach(D)
plot(midparentHeight,childHeight,type="n")
points(midparentHeight[gender=="female"],childHeight[gender=="female"],col="blue")
points(midparentHeight[gender=="male"],childHeight[gender=="male"],col="red")
lm.females=lm(childHeight~midparentHeight,subset=(gender=="female"))
lm.males=lm(childHeight~midparentHeight,subset=(gender=="male"))
abline(lm.females$coef,col="blue")
abline(lm.males$coef,col="red")

A 3-dimensional scatterplot with the fitted model superimposed can be obtained as follows. First, if it is not already installed, we need to install the package rgl with the command install.packages("rgl"). Then we load the package with the require function.

require(rgl)

## Loading required package: rgl

## Warning: package 'rgl' was built under R version 4.0.5

## Warning: package 'knitr' was built under R version 4.0.5

Similarly, the following code can be used to create separate 3D plots and regression models for children by gender using both parents’ heights as covariates.

clear3d()
plot3d(mother,father,childHeight,type="n")
points3d(mother[gender=="female"],father[gender=="female"],childHeight[gender=="female"],col='blue')
points3d(mother[gender=="male"],father[gender=="male"],childHeight[gender=="male"],col='red')
lm2.females=lm(childHeight~mother+father,subset=(gender=="female"))
lm2.males=lm(childHeight~mother+father,subset=(gender=="male"))
a=coef(lm2.females)["mother"]
b=coef(lm2.females)["father"]
c=-1
d=coef(lm2.females)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='blue')
a=coef(lm2.males)["mother"]
b=coef(lm2.males)["father"]
c=-1
d=coef(lm2.males)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='red')
detach(D)

Here are some additional references for learning R.
- R Manuals
- Using R for Introductory Statistics by John Verzani
- R for Beginners by Emmanuel Paradis
- R Functions for Regression Analysis by Vito Ricci
- Link to some other additional references at http://lib.stat.cmu.edu/R/CRAN/other-docs.html

Reference

Rencher and Schaalje (2008)
Kutner et al. (2005)

Kutner, Michael H, Christopher J Nachtsheim, John Neter, William Li, and others. 2005. Applied Linear Statistical Models. Vol. 5. McGraw-Hill Irwin Boston.

Rencher, Alvin C, and G Bruce Schaalje. 2008. Linear Models in Statistics. John Wiley & Sons.

Simple Linear Regression Model

Dan Han

Notes for MATH 668

1.1 and 6.1 Simple Linear Regression Model

Historic Origins:

Functional Relation between Two Variables

Statistical Relation between Two Variables

Basic Concepts

Use of Regression Analysis

Regression and Causality

Simple Linear Regression Model

6.2 Estimation of \(\boldsymbol{\beta}_0\), \(\boldsymbol{\beta}_1\), and \(\sigma^2\)

Least-squares Approach

6.4 Coefficient of Determination

1.A.1 Installing R and a Basic Introduction

Reference