Data available at https://www2.math.binghamton.edu/p/people/kargin/math448/start

Prediction

Let us try to predict IQ from physical characteirstics of an individual. The data is on a sample of n = 38 college students. IQ is from the revised Wechsler Adult Intelligence Scale. Brain size is from MRI scans. Also height and weight are available.

rm(list = ls())
iqsize <- read.csv("./data/iqsize.csv", header=T)
attach(iqsize)
head(iqsize)
##   PIQ  Brain Height Weight
## 1 124  81.69   64.5    118
## 2 150 103.84   73.3    143
## 3 128  96.54   68.8    172
## 4 134  95.15   65.0    147
## 5 110  92.88   69.0    146
## 6 131  99.13   64.5    138
summary(iqsize)
##       PIQ             Brain            Height          Weight     
##  Min.   : 72.00   Min.   : 79.06   Min.   :62.00   Min.   :106.0  
##  1st Qu.: 89.25   1st Qu.: 85.48   1st Qu.:66.00   1st Qu.:135.2  
##  Median :115.00   Median : 90.54   Median :68.00   Median :146.5  
##  Mean   :111.34   Mean   : 90.68   Mean   :68.42   Mean   :151.1  
##  3rd Qu.:128.00   3rd Qu.: 94.95   3rd Qu.:70.38   3rd Qu.:172.0  
##  Max.   :150.00   Max.   :107.95   Max.   :77.00   Max.   :192.0

Now let us estimate this model.

model <- lm(PIQ ~ Brain + Height + Weight)
summary(model)
## 
## Call:
## lm(formula = PIQ ~ Brain + Height + Weight)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -32.74 -12.09  -3.84  14.17  51.69 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.114e+02  6.297e+01   1.768 0.085979 .  
## Brain        2.060e+00  5.634e-01   3.657 0.000856 ***
## Height      -2.732e+00  1.229e+00  -2.222 0.033034 *  
## Weight       5.599e-04  1.971e-01   0.003 0.997750    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.79 on 34 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2327 
## F-statistic: 4.741 on 3 and 34 DF,  p-value: 0.007215

It is interesting that IQ appears to be predictable, although the \(R^2\) (a goodness of fit measure) is pretty low.

Let us omit the statistically insignificant Weight variable and re-estimate the mode.

model <- lm(PIQ ~ Brain + Height)
summary(model)
## 
## Call:
## lm(formula = PIQ ~ Brain + Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.750 -12.090  -3.841  14.174  51.690 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 111.2757    55.8673   1.992 0.054243 .  
## Brain         2.0606     0.5466   3.770 0.000604 ***
## Height       -2.7299     0.9932  -2.749 0.009399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.51 on 35 degrees of freedom
## Multiple R-squared:  0.2949, Adjusted R-squared:  0.2546 
## F-statistic: 7.321 on 2 and 35 DF,  p-value: 0.002208

Now let us try to perform prediction. First let us look at the confidence interval for the regression mean. The results include the prediction and the confidence interval and some other variables such as the standard error of the regression mean.

predict(model, interval="confidence", se.fit=T,
        newdata=data.frame(Brain=90, Height=70))
## $fit
##        fit      lwr     upr
## 1 105.6391 98.23722 113.041
## 
## $se.fit
## [1] 3.646064
## 
## $df
## [1] 35
## 
## $residual.scale
## [1] 19.50957

And now let us calculate the prediction interval. (We omitted here the option se.fit = T that calculates the standard error. It would give the same value for the standard error.)

predict(model, interval="prediction", 
        newdata=data.frame(Brain=90, Height=70))
##        fit      lwr      upr
## 1 105.6391 65.34688 145.9314

Note that the prediction interval is extremely wide due. The model is not very accurate.

detach(iqsize)
rm(list = ls()) #remove everything in the global environment 

Model comparison

Let us load a new database. this dataset consists of variables possibly relating to blood pressures of \(n = 39\) Peruvians who have moved from rural high-altitude areas to urban lower-altitude areas (Peru dataset).

Let us read the model and attach it so that the variables in this dataset can be visible to R without mentioning the name of the dataset.

peru <- read.csv("./data/peru.csv", header=T)
attach(peru) 
head(peru)
##   Age Years Weight Height Chin Forearm Calf Pulse Systol Diastol
## 1  21     1   71.0   1629  8.0     7.0 12.7    88    170      76
## 2  22     6   56.5   1569  3.3     5.0  8.0    64    120      60
## 3  24     5   56.0   1561  3.3     1.3  4.3    68    125      75
## 4  24     1   61.0   1619  3.7     3.0  4.3    52    148     120
## 5  25     1   65.0   1566  9.0    12.7 20.7    72    140      78
## 6  27    19   62.0   1639  3.0     3.3  5.7    72    106      72

Now let us run the linear regression. (We also included one new variable, fraction of life lived in urban area.)

FracLife <- Years/Age
model.1 <- lm(Systol ~ Age + Years + FracLife + Weight + Height + Chin +
              Forearm + Calf + Pulse)
summary(model.1)
## 
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight + Height + 
##     Chin + Forearm + Calf + Pulse)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3442  -6.3972   0.0507   5.7292  14.5257 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  146.81907   48.97096   2.998 0.005526 ** 
## Age           -1.12144    0.32741  -3.425 0.001855 ** 
## Years          2.45538    0.81458   3.014 0.005306 ** 
## FracLife    -115.29395   30.16900  -3.822 0.000648 ***
## Weight         1.41393    0.43097   3.281 0.002697 ** 
## Height        -0.03464    0.03686  -0.940 0.355194    
## Chin          -0.94369    0.74097  -1.274 0.212923    
## Forearm       -1.17085    1.19329  -0.981 0.334612    
## Calf          -0.15867    0.53716  -0.295 0.769810    
## Pulse          0.11455    0.17043   0.672 0.506818    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.655 on 29 degrees of freedom
## Multiple R-squared:  0.6674, Adjusted R-squared:  0.5641 
## F-statistic: 6.465 on 9 and 29 DF,  p-value: 5.241e-05

Now let us run a reduced model, which only includes variables that appear significant

model.2 <- lm(Systol ~ Age + Years + FracLife + Weight)
summary(model.2)
## 
## Call:
## lm(formula = Systol ~ Age + Years + FracLife + Weight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.890  -5.976   0.058   5.407  16.835 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  116.8354    21.9797   5.316 6.69e-06 ***
## Age           -0.9507     0.3164  -3.004 0.004971 ** 
## Years          2.3393     0.7714   3.032 0.004621 ** 
## FracLife    -108.0728    28.3302  -3.815 0.000549 ***
## Weight         0.8324     0.2754   3.022 0.004742 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.795 on 34 degrees of freedom
## Multiple R-squared:  0.5974, Adjusted R-squared:   0.55 
## F-statistic: 12.61 on 4 and 34 DF,  p-value: 2.142e-06

Now we want to check if the full model is more useful than the reduced model. We need to extract the SSEs of both models. We could do it by hand but there is an excellent function in R for this purpose. It shows the sum of squared errors as RSS (residual sum of squares) and it also calculates the F statistic for us.

anova(model.2, model.1)
## Analysis of Variance Table
## 
## Model 1: Systol ~ Age + Years + FracLife + Weight
## Model 2: Systol ~ Age + Years + FracLife + Weight + Height + Chin + Forearm + 
##     Calf + Pulse
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     34 2629.7                           
## 2     29 2172.6  5    457.12 1.2204 0.3247
#This is to illustrate how the calculation is done behind the scenes
(2629.71-2172.58)/(34-29) / (2172.58/29)
## [1] 1.220371
pf(1.220371, 5, 29, lower.tail=F) 
## [1] 0.3247213
detach(peru)
rm(list = ls()) #remove everything in the environment