Predicting Weight from Height

Sameer Mathur
October 2017

Simple Linear Regression

Predict a woman's weight from her height (1/3)

Suppose we want to construct a model that will predict a woman's weight, using her height.

Predict a woman's weight from her height (2/3)

We need data..

We could collect data. For e.g., a dataset called women, in built in R, contains the heights and weights of 15 women.

Predict a woman's weight from her height (3/3)

Suppose we want to construct a model that will predict weight from height using this data.

Data (1/2)

women

   height weight
1      58    115
2      59    117
3      60    120
4      61    123
5      62    126
6      63    129
7      64    132
8      65    135
9      66    139
10     67    142
11     68    146
12     69    150
13     70    154
14     71    159
15     72    164

Data (2/2)

attach(women)
library(psych)
describe(women) [, c(1:5,8:9)]   # selected columns

       vars  n   mean    sd median min max
height    1 15  65.00  4.47     65  58  72
weight    2 15 136.73 15.50    135 115 164

height is measured in inches;

weight is measured in lbs

Plot weight versus height

plot(height, weight, xlab="height (inches)", ylab="weight (lbs)")

plot of chunk unnamed-chunk-3

Fitting a Linear Model (1/3)

In R, the function for fitting a linear model is called lm().

Fitting a Linear Model (2/3)

The format is

\( fit \, <- \, lm(formula, \,\, data) \)

formula: describes the model to be fit
data: the dataframe containing the data to be used in fitting the model
fit: any variable name; it stores a list that contains extensive information about the fitted model

Fitting a Linear Model (3/3)

The formula is typically written as \( Y \sim X_1 + X_2 + \ldots + X_k \)

\( Y \): Dependent Variable, also called Response Variable
\( X_i \): Independent Variable(s), also called Predictor Variable(s)
\( ~ \): separates the Dependent Variable \( (Y) \) from the Independent variables \( (X_1, X_2, \ldots) \)

Simple Linear Regression (1)

When the Regression Model contains just one independent variable \( X \)

Regression Model:

\[ Y = b_0 + b_1*X + e \]

This describes a line with slope \( b_1 \) and \( y \)-intercept \( b_0 \)

\( X \): Independent Variable;

\( Y \): Dependent Variable

Simple Linear Regression (2)

In an experimental context we may have data points which reflect such a relationship between \( Y \) and \( X \), but only approximately.

Say there are \( n \) such points and call them \( {(X_i, Y_i), i = 1, \ldots, n} \).

Simple Linear Regression (3)

We can describe the approximate relation which exists between \( Y_i \) and \( X_i \) by introducing an error term \( e \) to capture the deviation of the data from the model

\[ Y = b_0 + b_1*X + e \]

This relationship between the parameters and the data points is called a linear regression model

The \( e_i \) are called residuals.

Predicting weight from height (4)

weight: Dependent variable

height: Independent variable

Model:

\[ weight = b_0 + b_1*height + e \]

The lm() function in R gets (weight, height) as input.

It returns beta coefficients \( {b_0, b_1} \) as output.

fit <- lm(weight ~ height, data = women)
summary(fit)


Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Beta coefficients

\( b_0 = -87.52 \), \( b_1 = 3.45 \)

Model: \( weight = -87.52 + 3.45*height \)

Beta coefficients

fit$coefficients

(Intercept)      height 
  -87.51667     3.45000

Confidence Intervals (95%)

If repeated samples were taken and the 95% confidence interval was computed for each sample, 95% of the intervals would contain the population mean.

confint(fit)

                  2.5 %     97.5 %
(Intercept) -100.342655 -74.690679
height         3.253112   3.646888

Data

# weight
women$weight

 [1] 115 117 120 123 126 129 132 135 139 142 146 150 154 159 164

Predicted values

fitted(fit)

       1        2        3        4        5        6        7        8 
112.5833 116.0333 119.4833 122.9333 126.3833 129.8333 133.2833 136.7333 
       9       10       11       12       13       14       15 
140.1833 143.6333 147.0833 150.5333 153.9833 157.4333 160.8833

Residuals (1)

Residuals are the vertical distances between the data and the fitted line.

The Ordinary Least Squares (OLS) method minimizes the residuals.

The accuracy of a line through the sample data points is measured by the sum of squared residuals, and the goal is to make this sum as small as possible.

Residuals (2)

residuals(fit)

          1           2           3           4           5           6 
 2.41666667  0.96666667  0.51666667  0.06666667 -0.38333333 -0.83333333 
          7           8           9          10          11          12 
-1.28333333 -1.73333333 -1.18333333 -1.63333333 -1.08333333 -0.53333333 
         13          14          15 
 0.01666667  1.56666667  3.11666667

Statistical Significance and p-values

summary(fit)


Call:
lm(formula = weight ~ height, data = women)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.7333 -1.1333 -0.3833  0.7417  3.1167 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -87.51667    5.93694  -14.74 1.71e-09 ***
height        3.45000    0.09114   37.85 1.09e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.525 on 13 degrees of freedom
Multiple R-squared:  0.991, Adjusted R-squared:  0.9903 
F-statistic:  1433 on 1 and 13 DF,  p-value: 1.091e-14

Statistical Significance and p-values

The regression coefficient (3.45) is significantly dfferent from zero (p < 0.001)
There is an expected increase of 3.45 lbs of weight for every 1 inch increases in height.

Multiple R-squared

The multiple R-squared indciates that the model accounts for 99.1% of the variance in weights.

F-Statistic

The F-statistic tests whether the predictor variables, taken together,predict the response variable above chance levels.