\(\def\a{\boldsymbol{a}}\) \(\def\b{\boldsymbol{b}}\) \(\def\c{\boldsymbol{c}}\) \(\def\f{\boldsymbol{f}}\) \(\def\g{\boldsymbol{g}}\) \(\def\h{\boldsymbol{h}}\) \(\def\j{\boldsymbol{j}}\) \(\def\t{\boldsymbol{t}}\) \(\def\u{\boldsymbol{u}}\) \(\def\v{\boldsymbol{v}}\) \(\def\x{\boldsymbol{x}}\) \(\def\y{\boldsymbol{y}}\) \(\def\z{\boldsymbol{z}}\) \(\def\A{\boldsymbol{\mathrm{A}}}\) \(\def\B{\boldsymbol{\mathrm{B}}}\) \(\def\C{\boldsymbol{\mathrm{C}}}\) \(\def\D{\boldsymbol{\mathrm{D}}}\) \(\def\E{\boldsymbol{\mathrm{E}}}\) \(\def\H{\boldsymbol{\mathrm{H}}}\) \(\def\I{\boldsymbol{\mathrm{I}}}\) \(\def\J{\boldsymbol{\mathrm{J}}}\) \(\def\M{\boldsymbol{\mathrm{M}}}\) \(\def\O{\boldsymbol{\mathrm{O}}}\) \(\def\P{\boldsymbol{\mathrm{P}}}\) \(\def\Q{\boldsymbol{\mathrm{Q}}}\) \(\def\T{\boldsymbol{\mathrm{T}}}\) \(\def\U{\boldsymbol{\mathrm{U}}}\) \(\def\X{\boldsymbol{\mathrm{X}}}\) \(\def\Y{\boldsymbol{\mathrm{Y}}}\) \(\def\bmu{\boldsymbol{\mu}}\) \(\def\ep{\boldsymbol{\varepsilon}}\) \(\def\Sig{\boldsymbol{\Sigma}}\) \(\def\zeros{\boldsymbol{0}}\) \(\def\diag{\mathrm{diag}}\) \(\def\rank{\mathrm{rank}}\) \(\def\tr{^\top}\) \(\def\ds{\displaystyle}\) \(\def\bea{\begin{eqnarray}}\) \(\def\nnn{\nonumber}\) \(\def\eea{\nnn\end{eqnarray}}\) \(\def\bpm{\begin{pmatrix}}\) \(\def\epm{\end{pmatrix}}\) \(\def\var{\mbox{var}}\) \(\def\cov{\mbox{cov}}\) \(\def\Reals{\mathbb{R}}\) \(\def\abs{\mbox{abs}}\) \(\def\sion{\sum_{i=1}^n}\) \(\def\res{\hat{\varepsilon}}\) \(\newcommand{\EV}[1]{E\left(#1\right)}\) \(\newcommand{\trace}[1]{\mathrm{tr}\left(#1\right)}\)
A function relation: \(y=f(x)\)
\(y\): dependent variable \(x\): independent variable
A statistical relation, unlike functional relation, is not a perfect one. In general, the observations for a statistical relation do not fall directly on the curve of relationship.
GaltonFamilies is included in the R package HistData which can be installed with the following command.install.packages("HistData")
The following commands show how to load the package, display its first 10 rows, and display the dimension of the data frame.
require(HistData)
## Loading required package: HistData
## Warning: package 'HistData' was built under R version 4.0.4
GaltonFamilies[1:10,]
## family father mother midparentHeight children childNum gender childHeight
## 1 001 78.5 67.0 75.43 4 1 male 73.2
## 2 001 78.5 67.0 75.43 4 2 female 69.2
## 3 001 78.5 67.0 75.43 4 3 female 69.0
## 4 001 78.5 67.0 75.43 4 4 female 69.0
## 5 002 75.5 66.5 73.66 4 1 male 73.5
## 6 002 75.5 66.5 73.66 4 2 male 72.5
## 7 002 75.5 66.5 73.66 4 3 female 65.5
## 8 002 75.5 66.5 73.66 4 4 female 65.5
## 9 003 75.0 64.0 72.06 2 1 male 71.0
## 10 003 75.0 64.0 72.06 2 2 female 68.0
dim(GaltonFamilies)
## [1] 934 8
A scatterplot of the dataset can be obtained as the following:
plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")
Here the Child Height is taken as the dependent or response variable \(y\), and midparent Height is taken as the independent, explanatory or predictor variable \(x\).
A regression model is a formal means of expressing the two essential ingredients of a statistical relation:
A tendency of the response variable \(y\) to vary with the predictor variable \(x\) in a systematic fashion.
A scattering of points around the curve of statistical relationship.
The existence of a statistical relation between the response variable \(y\) and the explanatory or predictor variable \(x\) does not imply in any way that \(y\) depends causally on \(x\). No matter how strongly is the statistical relation between \(x\) and \(y\), no cause-and-effect patter is necessarily implied by the regression model. For example,data on size of vocabulary (\(x\)) and writing speed (\(y\)) for a sample of young children aged 5-10 will show a positive regression relation. This relation does not imply, however, that an increase in vocabulary causes a faster writing speed. Here, other explanatory variables, such as age of the child and amount of education, affect both the vocabulary (\(x\)) and the writing speed (\(y\)). Older children have a larger vocabulary and a faster writing speed.
Even when a strong statistical relationship reflects causal conditions, the causal conditions may act in the opposite direction, from \(y\) to \(x\). Consider, for instance, the calibration of a thermometer. Here, readings of the thermometer are taken at different known temperatures, and the regression relation is studied so that the accuracy of predictions made by using the thermometer readings can be assessed. For this purpose, the thermometer reading is the predictor variable \(x\), and the actual temperature is the response variable \(y\) to be predicted. However, the causal pattern here does not go from \(x\) to \(y\), but in the opposite direction: the actual temperature (\(y\)) affects the thermometer reading (\(x\)).
\[\begin{align} y=\beta_0+\beta_1x+\varepsilon \label{eq1}\tag{1} \end{align}\]
where \(y\) is the response (dependent) variable, \(x\) is predictor(independent) variable. \(\beta_0\) and \(\beta_1\) are the intercept and slope respectively and \(\varepsilon\) is the model error. We ordinarly view (1) in a data setting where pairs of observations \((x_1,y_1),(x_2,y_2),...,(x_n,y_n)\) are taken on experimental units, and estimates of the parameters \(\beta_0\) and \(\beta_1\) are sought. Thus, we can write the model as the following definition.
Definition 6.1.1 (p.127): The simple linear regression model with \(n\) observations can be written as \[ y_i=\beta_0+\beta_1 x_i+\varepsilon_i, i=1, 2, \ldots, n \] with the following assumptions:
Here, \(\beta_0\) and \(\beta_1\) are fixed but unknown model parameters representing the intercept and the slope of the regression line, respectively.
In this chapter, we assume the \(x_i\)’s are not random.
\(y\) is called the dependent or response variable, \(y_i\) is the value of the response variable in the ith trial
\(x\) is called the independent, predictor, or explanatory variable, \(x_i\) is a known constant, the value of the predictor variable in the ith trial
\(\varepsilon\) is a random variable called the error term.\(\varepsilon_i\) is a the value of the in the ith trial
Remark:
1.Simple: There is only one \(x\) to predict the response \(y\)
2.Linear: The model is linear in parameters \(\beta_0\) and \(\beta_1\)
Important Features of Model:
Seek estimators \(\hat{\beta}_0\) and \(\hat{\beta}_1\) that minimize the sum of squares of the deviations \(y_i-\hat{y}_i\) of the \(n\) observed \(y_i\)’s from their predicted values \(\hat{y_i}=\hat{\beta}_0+\hat{\beta_1}x_i\). That is to minimize the residual sum of squares \(\sum\limits_{i=1}^n(y_i-\hat{y_i})^2\)
lm.model=lm(childHeight~midparentHeight,data=GaltonFamilies)
lm.model
##
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
##
## Coefficients:
## (Intercept) midparentHeight
## 22.6362 0.6374
So, the fitted model in this case is \[ \mbox{Predicted child height}=22.6362+.6374 \cdot \mbox{Midparent height}. \]
Now, let’s see how to obtain this directly from the formulas we have derived in Chapter 6. First, let’s compute the summary statistics.
x=GaltonFamilies$midparentHeight
y=GaltonFamilies$childHeight
x.bar=mean(x)
y.bar=mean(y)
s.x=sd(x)
s.y=sd(y)
r=cor(x,y)
x.bar;y.bar;s.x;s.y;r
## [1] 69.20677
## [1] 66.74593
## [1] 1.80237
## [1] 3.579251
## [1] 0.3209499
So, then we can compute the least squares estimates of \(\beta_1\) and \(\beta_0\)
beta1.hat=r*s.y/s.x
beta0.hat=y.bar-beta1.hat*x.bar
beta1.hat;beta0.hat
## [1] 0.6373609
## [1] 22.63624
So, we see that \[ \hat{\beta}_1=\frac{rs_y}{s_x}=\frac{(0.3209499)(3.579251)}{1.80237}=0.6373609 \] and \[ \hat{\beta}_0=66.74593-(0.6373609)(69.20677)=22.63624 \]
The regression line can be added to the scatterplot we draw in R Example 1.1.1 in blue.
plot(GaltonFamilies$midparentHeight,GaltonFamilies$childHeight,xlab="Midparent Height",ylab="Child Height")
abline(lm.model$coef,col="blue")
The unbiased estimate of \(\sigma^2\) is \[ s^2=\frac{SSE}{n-2}=\frac{(1-r^2) SST}{n-2}=\frac{(1-r^2)(n-1)s_y^2}{n-2} \]] and can be computed as follows.
n=length(x)
(1-r^2)*(n-1)*s.y^2/(n-2)
## [1] 11.50372
So, we have \[ s^2=\frac{(1-r^2)(n-1)s_y^2}{n-2}=11.50372. \]
We saw that the built-in function summary() can be used to obtain the estimate and some other information of the regression line as follows.
summary(lm(childHeight~midparentHeight,data=GaltonFamilies))
##
## Call:
## lm(formula = childHeight ~ midparentHeight, data = GaltonFamilies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9570 -2.6989 -0.2155 2.7961 11.6848
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.63624 4.26511 5.307 1.39e-07 ***
## midparentHeight 0.63736 0.06161 10.345 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.392 on 932 degrees of freedom
## Multiple R-squared: 0.103, Adjusted R-squared: 0.102
## F-statistic: 107 on 1 and 932 DF, p-value: < 2.2e-16
( a ) Compute the sample mean, sample standard deviation, and sample median for this variable using built-in functions.
( b ) Repeat ( a ) by directly using formulas rather than the built-in functions.
x=c(.44,-6.23,1:4,7,rep(0,3))
x
## [1] 0.44 -6.23 1.00 2.00 3.00 4.00 7.00 0.00 0.00 0.00
The sample summary statistics can be computer with the following built-in functions.
mean(x)
## [1] 1.121
sd(x)
## [1] 3.422801
median(x)
## [1] 0.72
( b ) We can also compute these quantites directly in R. The formula for the mean \[ \bar{x}=\frac{1}{n}\sum_{i=1}^n x_i \] can be computed in R as follows.
sum(x)
## [1] 11.21
n=length(x)
n
## [1] 10
x.bar=sum(x)/n
x.bar
## [1] 1.121
So, the sample mean is \(\bar{x}=1.121\).
Then, we can use the mean to directly compute the variance based on the formula \[ s^2=\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2. \]
x-x.bar
## [1] -0.681 -7.351 -0.121 0.879 1.879 2.879 5.879 -1.121 -1.121 -1.121
(x-x.bar)^2
## [1] 0.463761 54.037201 0.014641 0.772641 3.530641 8.288641 34.562641
## [8] 1.256641 1.256641 1.256641
sum((x-x.bar)^2)
## [1] 105.4401
sum((x-x.bar)^2)/(n-1)
## [1] 11.71557
s=sqrt(sum((x-x.bar)^2)/(n-1))
s
## [1] 3.422801
So, the sample standard deviation is \(s=3.423\).
Finally, to compute the median, we need to sort the data.
sorted.x=sort(x)
sorted.x
## [1] -6.23 0.00 0.00 0.00 0.44 1.00 2.00 3.00 4.00 7.00
Then the sample median is usually defined as \[ \frac{x_{\left(\lfloor{(n+1)/2\rfloor}\right)}+x_{\left(\lfloor{n/2+1\rfloor}\right)}}{2} \] where \(\lfloor \ \rfloor\) is the floor function. In R, this can be computed as follows.
c(sorted.x[5],sorted.x[6])
## [1] 0.44 1.00
x.median=(sorted.x[5]+sorted.x[6])/2
So, the sample median is \(\frac{x_{(5)}+x_{(6)}}{2}=0.72\).
It is important to know how to read data from a file. The data set “HeightData.txt” is a text file which has the Galton (1886) data set. In order to read the data into R at the command line, we need to know some commands for finding and changing the filepath. The command getwd() returns the current working directory.
getwd()
## [1] "D:/OneDrive - University of Louisville/2022 Spring/MATH 668 Linear Stat modeling/NotesPreparation/Lecture 1 Chapter 1"
The directory can be changed using the setwd function as illustrated below. Note that you must use slashes instead of backslashes (even in Windows which uses backslashes in its filepaths; for more discussion, see http://www.cs.ucsb.edu/~pconrad/topics/BackslashVsForwardSlash/).
setwd("D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data")
getwd()
## [1] "D:/OneDrive - University of Louisville/2020 Spring/2020 Spring Linear Statistics Modeling/data"
The command dir() displays all of the files and directories in the current working directory.
dir()
## [1] "Body_fat_example.txt" "Clinic Data.txt" "Clinic Data.xls"
## [4] "create yield data.txt" "Forbes data.txt" "HeightData.txt"
## [7] "Hematology data.txt" "Hematology data.xlsx" "HW5-3.txt"
## [10] "HW5-3code.txt" "LandRentData.txt" "LandRentData.xlsx"
## [13] "us2003sars.txt" "us65+rates.txt" "YieldData.txt"
Now, the read.table function can be used to read the data into R. The optional input sep=“\t” tells R that te file is tab-delimited. The option header=TRUE indicates that the file has a header with variable names. Detailed documentation for the function read.table can be obtained with the command help(read.table).
D=read.table("HeightData.txt",sep="\t",header=TRUE)
The following command displays the first ten rows of the data frame D and stores it in a new data frame D10.
D10=D[1:10,]
There are various ways to select subsets of rows and columns in R. Here are three different ways to extract rows 1-7 and 9 from D10.
D10[c(1,2,3,4,5,6,7,9),]
D10[c(1:7,9),]
D10[-c(8,10),]
## family father mother midparentHeight children childNum gender childHeight
## 1 001 78.5 67.0 75.43 4 1 male 73.2
## 2 001 78.5 67.0 75.43 4 2 female 69.2
## 3 001 78.5 67.0 75.43 4 3 female 69.0
## 4 001 78.5 67.0 75.43 4 4 female 69.0
## 5 002 75.5 66.5 73.66 4 1 male 73.5
## 6 002 75.5 66.5 73.66 4 2 male 72.5
## 7 002 75.5 66.5 73.66 4 3 female 65.5
## 9 003 75.0 64.0 72.06 2 1 male 71.0
The columns of a data frame represent variables, and as we see in D10, different columns can be different types; for example, most of the variables are quantitative, but the 7th column gender is qualitative. Here are two ways to extract a vector representing the 7th column of D10.
D10[,7]
D10$gender
## [1] "male" "female" "female" "female" "male" "male" "female" "female"
## [9] "male" "female"
Furthermore, we can simultaneously select specified rows and columns as illustrated in the following example.
D10[-c(8,10),7:8]
## gender childHeight
## 1 male 73.2
## 2 female 69.2
## 3 female 69.0
## 4 female 69.0
## 5 male 73.5
## 6 male 72.5
## 7 female 65.5
## 9 male 71.0
Now, let’s attach D10.
attach(D10)
The following boolean operation checks which children are females.
gender=="female"
## [1] FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE
R also allows boolean vectors to be used to select rows to be extracted. For example, the following command displays all data for female children in D10.
D10[gender=="female",]
## family father mother midparentHeight children childNum gender childHeight
## 2 001 78.5 67.0 75.43 4 2 female 69.2
## 3 001 78.5 67.0 75.43 4 3 female 69.0
## 4 001 78.5 67.0 75.43 4 4 female 69.0
## 7 002 75.5 66.5 73.66 4 3 female 65.5
## 8 002 75.5 66.5 73.66 4 4 female 65.5
## 10 003 75.0 64.0 72.06 2 2 female 68.0
detach(D10)
Now, we can use subsetting to recreate the scatterplot for midparent height and child height but with female children plotted in blue and male children plotted in red as shown in the following code. The optional argument type=“n” plots invisible points, but creates the axes so that all of the points will fit on the graph. The function points then adds points to the existing plot, and we use the optional argument col to specify the color for each group of points. Then we can fit regression lines for each group using the lm function with an appropriate boolean vector in the subset option.
attach(D)
plot(midparentHeight,childHeight,type="n")
points(midparentHeight[gender=="female"],childHeight[gender=="female"],col="blue")
points(midparentHeight[gender=="male"],childHeight[gender=="male"],col="red")
lm.females=lm(childHeight~midparentHeight,subset=(gender=="female"))
lm.males=lm(childHeight~midparentHeight,subset=(gender=="male"))
abline(lm.females$coef,col="blue")
abline(lm.males$coef,col="red")
A 3-dimensional scatterplot with the fitted model superimposed can be obtained as follows. First, if it is not already installed, we need to install the package rgl with the command install.packages("rgl"). Then we load the package with the require function.
require(rgl)
## Loading required package: rgl
## Warning: package 'rgl' was built under R version 4.0.5
## Warning: package 'knitr' was built under R version 4.0.5
Similarly, the following code can be used to create separate 3D plots and regression models for children by gender using both parents’ heights as covariates.
clear3d()
plot3d(mother,father,childHeight,type="n")
points3d(mother[gender=="female"],father[gender=="female"],childHeight[gender=="female"],col='blue')
points3d(mother[gender=="male"],father[gender=="male"],childHeight[gender=="male"],col='red')
lm2.females=lm(childHeight~mother+father,subset=(gender=="female"))
lm2.males=lm(childHeight~mother+father,subset=(gender=="male"))
a=coef(lm2.females)["mother"]
b=coef(lm2.females)["father"]
c=-1
d=coef(lm2.females)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='blue')
a=coef(lm2.males)["mother"]
b=coef(lm2.males)["father"]
c=-1
d=coef(lm2.males)["(Intercept)"]
planes3d(a, b, c, d, alpha = 0.5, col='red')
detach(D)
Kutner, Michael H, Christopher J Nachtsheim, John Neter, William Li, and others. 2005. Applied Linear Statistical Models. Vol. 5. McGraw-Hill Irwin Boston.
Rencher, Alvin C, and G Bruce Schaalje. 2008. Linear Models in Statistics. John Wiley & Sons.