Sameer Mathur
October 2017
Simple Linear Regression
Suppose we want to construct a model that will predict a woman's weight, using her height.
We need data..
We could collect data. For e.g., a dataset called women, in built in R, contains the heights and weights of 15 women.
Suppose we want to construct a model that will predict weight from height using this data.
women
height weight
1 58 115
2 59 117
3 60 120
4 61 123
5 62 126
6 63 129
7 64 132
8 65 135
9 66 139
10 67 142
11 68 146
12 69 150
13 70 154
14 71 159
15 72 164
attach(women)
library(psych)
describe(women) [, c(1:5,8:9)] # selected columns
vars n mean sd median min max
height 1 15 65.00 4.47 65 58 72
weight 2 15 136.73 15.50 135 115 164
height is measured in inches;
weight is measured in lbs
plot(height, weight, xlab="height (inches)", ylab="weight (lbs)")
In R, the function for fitting a linear model is called lm().
The format is
\( fit \, <- \, lm(formula, \,\, data) \)
The formula is typically written as \( Y \sim X_1 + X_2 + \ldots + X_k \)
When the Regression Model contains just one independent variable \( X \)
Regression Model:
\[ Y = b_0 + b_1*X + e \]
This describes a line with slope \( b_1 \) and \( y \)-intercept \( b_0 \)
\( X \): Independent Variable;
\( Y \): Dependent Variable
In an experimental context we may have data points which reflect such a relationship between \( Y \) and \( X \), but only approximately.
Say there are \( n \) such points and call them \( {(X_i, Y_i), i = 1, \ldots, n} \).
We can describe the approximate relation which exists between \( Y_i \) and \( X_i \) by introducing an error term \( e \) to capture the deviation of the data from the model
\[ Y = b_0 + b_1*X + e \]
This relationship between the parameters and the data points is called a linear regression model
The \( e_i \) are called residuals.
weight: Dependent variable
height: Independent variable
Model:
\[ weight = b_0 + b_1*height + e \]
The lm() function in R gets (weight, height) as input.
It returns beta coefficients \( {b_0, b_1} \) as output.
fit <- lm(weight ~ height, data = women)
summary(fit)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
\( b_0 = -87.52 \), \( b_1 = 3.45 \)
Model: \( weight = -87.52 + 3.45*height \)
fit$coefficients
(Intercept) height
-87.51667 3.45000
If repeated samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean.
confint(fit)
2.5 % 97.5 %
(Intercept) -100.342655 -74.690679
height 3.253112 3.646888
# weight
women$weight
[1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164
fitted(fit)
1 2 3 4 5 6 7 8
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333
9 10 11 12 13 14 15
140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833
Residuals are the vertical distances between the data and the fitted line.
The Ordinary Least Squares (OLS) method minimizes the residuals.
The accuracy of a line through the sample data points is measured by the sum of squared residuals, and the goal is to make this sum as small as possible.
residuals(fit)
1 2 3 4 5 6
2.41666667 0.96666667 0.51666667 0.06666667 -0.38333333 -0.83333333
7 8 9 10 11 12
-1.28333333 -1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333
13 14 15
0.01666667 1.56666667 3.11666667
summary(fit)
Call:
lm(formula = weight ~ height, data = women)
Residuals:
Min 1Q Median 3Q Max
-1.7333 -1.1333 -0.3833 0.7417 3.1167
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -87.51667 5.93694 -14.74 1.71e-09 ***
height 3.45000 0.09114 37.85 1.09e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared: 0.991, Adjusted R-squared: 0.9903
F-statistic: 1433 on 1 and 13 DF, p-value: 1.091e-14
The multiple R-squared indciates that the model accounts for 99.1% of the variance in weights.
The F-statistic tests whether the predictor variables, taken together,predict the response variable above chance levels.