Introduction

In this R guide we will talk about tools to compare models and methods to build models. Then we will work witht the water dataset to create models using forward selection and backward elimination.

Tools for Comparing Models

There are several tools for comparing models. \(R^2\) and adjusted \(R^2\) can be used if necessary, but these are not good comparison tools. Since we are such competent statisticians, this R guide will cover comparison tools that are good to use, which include t-test, partial F-test, AIC, and Mallow’s \(C_p\).

t-test

The t-test is a familiar topic for us. In R, we can use the summary function to see the p-values for our predictors, which are produced by a t-test. Whether or not a term is significant will dtermine whether we should add or drop the term when we’re building our model. This process of adding or dropping terms is discussed more in the Methods for Building Models section. Note: this method doesn’t work for categorical predictors with more than two categories.

F-test (partial)

The F-test is very similar to the t-test. We use the summary function to see the p-values for our predictors, which are produced by the partial F-test. Whether or not a term is significant will determine whether we should add or drop the term when we’re building our model. This process of adding or dropping terms is discussed more in the Methods for Building Models section. Note: this method works for all models.

AIC

AIC statnds for Akaike Information Criteria, and we use it to try and balance our model fit with the number of predictors in our model. The AIC can be expressed in the following equation: AIC = 2(k+1) - 2(loglikelihood of our model), where (k+1) is the number of coefficients in our model. We want the AIC to be as low/negative as possible becasue we want to have the smallest number of predictors possible, and we want to minimize the log likelihood. However, we don’t want to make our model be overly complicated. Note: the AIC for one model isn’t useful; we will always compare teh AICs from two or more models to decide which oe is better.

When deciding which model is better, we must consider teh AIC and the number of predictors. If the simplest model has the lowest AIC, we will definitely use it. If the model with the lowest AIC is the more complex model, it’s AIC must be at least 10 units lower than the AIC of the simple model. Otherwise, if the two AICs are within 10 units, we will use the simpler model.

Mallow’s \(C_p\)

Mallow’s \(C_p\) is the fourth tool used to compare models. Like AIC, a smaller \(C_p\) is better; we want our \(C_p\) to be as close to (k+1) as possible. For this calculation we compare the complete model with a model with a subset of the predictors to try and get a model that contains only the significant predictors.

Methods for Building Models

Now we will discuss the possible methods for building regression models. When buidling regression models, there are several methods that can be used.

Forward Selection

One method is forward selection. In this stepwise method, we start with the simplest model possible and gradually add more predictors based on their significance. For example, we might start by making SLR models between our response and all possible predictors. Then we see which predictors are significant. We can eliminate any predictors that aren’t significant, and we know that we want to add the predicotr that is the most significant. Then we would create MLR models with our response and two predictors; the predictors include the most significant predictor from our first step, and each of our remaing predictors. If any of the newly added predictors are significant we would keep the most significant one, and continue this process.

Overall, we gradually add predictors based on their significance. One way tot do this is to add the predictor with the smallest p-value if the p-value is smaller than our significance level. We choose to stop adding predictors when they are no longer significant. Alternatively, we can evaluate significance based on AIC. In this case, we add predictors if the model’s AIC drops by at least 10 units. If the AIC doesn’t drop by 10 or more we wouldn’t add the prdictor. Note that it’s very important to communicate whether you’re using p-values or AIC to build your model.

Backward Elimination

Another stepwise method for building models is backward elimination. In this method, we start with the most complicated model. Then we analyze the significance of the predictors, and if there are any predicotrs that are not significant, we drop the least significant one. The we analyze our new model with one less term, see if any predictors are insignificant, and drop the least significant one. This process repeats until we have a model where all of the predictors are significant.

Similar to forward selection, we can use p-values or AIC for backward elimination. Again, it’s important to express which method is being used.

All Subsets Regression

A third way to build models is to use all subsets regression. For this method, we fit all possible models for our data and analyze all of the models created. We use the model comparison toosl described in the preceeding section to see which one is best. However, it’s important to note that if two models are nested we can use the t-test, F-test, AIC, or Mallow’s \(C_p\), but for nested models it’s best to use one of the formal tests (t-test or F-test). For models that are not nested, we can only use AIC or Mallow’s \(C_p\) to compare the models. Note: two models are nested if all fo the predictors from one model are included in the other model.

Building Models

Forward Selection Example

Use the water dataset and forward selection to build the model.

Call and attach the data.

library(alr3)

## Loading required package: car

## Warning: package 'car' was built under R version 3.4.3

data(water)
attach(water)

First, make 6 SLR models If in SLR, a predictor doesn’t have a significant relationship with the response, eliminate it.

mod1 <- lm(BSAAM ~ APMAM)
mod2 <- lm(BSAAM ~ APSAB)
mod3 <- lm(BSAAM ~ APSLAKE)
mod4 <- lm(BSAAM ~ OPBPC)
mod5 <- lm(BSAAM ~ OPRC)
mod6 <- lm(BSAAM ~ OPSLAKE)

summary(mod1)

## 
## Call:
## lm(formula = BSAAM ~ APMAM)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917   6.389 1.21e-07 ***
## APMAM           1965       1249   1.573    0.123    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared:  0.05692,    Adjusted R-squared:  0.03391 
## F-statistic: 2.474 on 1 and 41 DF,  p-value: 0.1234

summary(mod2)

## 
## Call:
## lm(formula = BSAAM ~ APSAB)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689   6.931 2.06e-08 ***
## APSAB           2279       1909   1.194    0.239    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared:  0.0336, Adjusted R-squared:  0.01003 
## F-statistic: 1.425 on 1 and 41 DF,  p-value: 0.2394

summary(mod3)

## 
## Call:
## lm(formula = BSAAM ~ APSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249   6.905 2.25e-08 ***
## APSLAKE         2818       1709   1.649    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared:  0.06217,    Adjusted R-squared:  0.0393 
## F-statistic: 2.718 on 1 and 41 DF,  p-value: 0.1069

summary(mod4)

## 
## Call:
## lm(formula = BSAAM ~ OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  40017.4     3589.1   11.15 5.47e-14 ***
## OPBPC         2940.1      240.6   12.22 3.00e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared:  0.7845, Adjusted R-squared:  0.7793 
## F-statistic: 149.3 on 1 and 41 DF,  p-value: 2.996e-15

summary(mod5)

## 
## Call:
## lm(formula = BSAAM ~ OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  21741.4     4044.1   5.376 3.32e-06 ***
## OPRC          4667.3      311.3  14.991  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared:  0.8457, Adjusted R-squared:  0.842 
## F-statistic: 224.7 on 1 and 41 DF,  p-value: < 2.2e-16

summary(mod6)

## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17603.8  -5338.0    332.1   3410.6  20875.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
## OPSLAKE       3752.5      215.7  17.394  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8778 
## F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

APMAM, APSLAKE, and APSAB are insignificant, so eiliminat them now. OPSLAKE has smallest p-val (and largest test stat), so we know that we want this on our model because it’s most significant.

Now make 2 MLR with OPSLAKE and remianing two possible predicotrs (OPBPC and OPRC)

mod4a <- lm(BSAAM ~ OPBPC + OPSLAKE)
mod5a <- lm(BSAAM ~ OPRC + OPSLAKE)

summary(mod4a)

## 
## Call:
## lm(formula = BSAAM ~ OPBPC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17591.0  -5276.6    275.6   3380.7  20867.0 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 27050.95    3540.07   7.641 2.44e-09 ***
## OPBPC          14.37     546.41   0.026    0.979    
## OPSLAKE      3736.16     658.24   5.676 1.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared:  0.8807, Adjusted R-squared:  0.8747 
## F-statistic: 147.6 on 2 and 40 DF,  p-value: < 2.2e-16

summary(mod5a)

## 
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15991.2  -6484.6   -498.3   4700.1  19945.8 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  22891.2     3277.8   6.984 1.98e-08 ***
## OPRC          1866.5      638.8   2.922   0.0057 ** 
## OPSLAKE       2400.8      503.3   4.770 2.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared:  0.9017, Adjusted R-squared:  0.8967 
## F-statistic: 183.4 on 2 and 40 DF,  p-value: < 2.2e-16

OPBPC is not significant, so we can drop it. OPRC has a small p-value and is significant, so we want our model to include this.

End with lm(BSAAM ~ OPRC + OPSLAKE)

Alternative Method for Forward Selection

Use the step AIC command

simplemod <- lm(BSAAM ~1)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:alr3':
## 
##     forbes

stepAIC(simplemod, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC))

## Start:  AIC=873.65
## BSAAM ~ 1
## 
##           Df  Sum of Sq        RSS    AIC
## + OPSLAKE  1 2.4087e+10 3.2640e+09 784.24
## + OPRC     1 2.3131e+10 4.2199e+09 795.28
## + OPBPC    1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE  1 1.7004e+09 2.5651e+10 872.89
## + APMAM    1 1.5567e+09 2.5794e+10 873.13
## <none>                  2.7351e+10 873.65
## + APSAB    1 9.1891e+08 2.6432e+10 874.18
## 
## Step:  AIC=784.24
## BSAAM ~ OPSLAKE
## 
##           Df Sum of Sq        RSS    AIC
## + APSLAKE  1 663368666 2600641788 776.47
## + APSAB    1 661988129 2602022326 776.49
## + OPRC     1 574050696 2689959758 777.92
## + APMAM    1 524283532 2739726922 778.71
## <none>                 3264010454 784.24
## + OPBPC    1     56424 3263954031 786.24
## 
## Step:  AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
## 
##         Df Sum of Sq        RSS    AIC
## + OPRC   1 531694203 2068947585 768.63
## <none>               2600641788 776.47
## + APSAB  1  33349091 2567292697 777.91
## + APMAM  1  11041158 2589600630 778.28
## + OPBPC  1    122447 2600519341 778.46
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##         Df Sum of Sq        RSS    AIC
## <none>               2068947585 768.63
## + APSAB  1  11814207 2057133378 770.39
## + APMAM  1   1410311 2067537274 770.60
## + OPBPC  1    583748 2068363837 770.62

## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797

Gives us our final model using forward selection.

Model: lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)

Backward Elimination

Start with the biggest possible model.

modB <- lm(BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
summary(modB)

## 
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + 
##     OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12690  -4936  -1424   4173  18542 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15944.67    4099.80   3.889 0.000416 ***
## APMAM         -12.77     708.89  -0.018 0.985725    
## APSAB        -664.41    1522.89  -0.436 0.665237    
## OPBPC          69.70     461.69   0.151 0.880839    
## OPSLAKE      2211.58     752.69   2.938 0.005729 ** 
## APSLAKE      2270.68    1341.29   1.693 0.099112 .  
## OPRC         1916.45     641.36   2.988 0.005031 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9123 
## F-statistic: 73.82 on 6 and 36 DF,  p-value: < 2.2e-16

Delete APMAM (least significant)

modB1 <- lm(BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
summary(modB1)

## 
## Call:
## lm(formula = BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12696  -4933  -1396   4187  18550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15930.84    3972.50   4.010 0.000283 ***
## APSAB        -673.42    1418.96  -0.475 0.637873    
## OPBPC          68.94     453.50   0.152 0.879996    
## OPSLAKE      2212.62     740.28   2.989 0.004952 ** 
## APSLAKE      2263.86    1269.35   1.783 0.082714 .  
## OPRC         1915.75     631.46   3.034 0.004399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9147 
## F-statistic: 91.05 on 5 and 37 DF,  p-value: < 2.2e-16

Delete OPBPC (least significant)

modB2 <- lm(BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC)
summary(modB2)

## 
## Call:
## lm(formula = BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12750  -5095  -1494   4245  18594 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15749.8     3740.8   4.210 0.000151 ***
## APSAB         -650.6     1392.8  -0.467 0.643055    
## OPSLAKE       2295.4      494.8   4.639 4.07e-05 ***
## APSLAKE       2244.9     1246.9   1.800 0.079735 .  
## OPRC          1910.2      622.3   3.070 0.003942 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared:  0.9248, Adjusted R-squared:  0.9169 
## F-statistic: 116.8 on 4 and 38 DF,  p-value: < 2.2e-16

Delete APSAB (least significant)

modB3 <- lm(BSAAM ~ OPSLAKE + APSLAKE + OPRC)
summary(modB3)

## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -12964  -5140  -1252   4446  18649 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  15424.6     3638.4   4.239 0.000133 ***
## OPSLAKE       2389.8      447.1   5.346 4.19e-06 ***
## APSLAKE       1712.5      500.5   3.421 0.001475 ** 
## OPRC          1797.5      567.8   3.166 0.002998 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared:  0.9244, Adjusted R-squared:  0.9185 
## F-statistic: 158.9 on 3 and 39 DF,  p-value: < 2.2e-16

All are significant now! Don’t delete more.

End with modB3 <- lm(BSAAM ~ OPRC + OPSLAKE + APSLAKE).

Note: Overall, backwards elim. is less typing, but it generally will give a bigger model in the end.

Alternative Function for Backward Elimination

mod <- lm(BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
library(MASS)
stepAIC(mod)

## Start:  AIC=774.36
## BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC
## 
##           Df Sum of Sq        RSS    AIC
## - APMAM    1     18537 2055849271 772.36
## - OPBPC    1   1301629 2057132362 772.39
## - APSAB    1  10869771 2066700504 772.58
## <none>                 2055830733 774.36
## - APSLAKE  1 163662571 2219493304 775.65
## - OPSLAKE  1 493012936 2548843669 781.60
## - OPRC     1 509894399 2565725132 781.89
## 
## Step:  AIC=772.36
## BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC
## 
##           Df Sum of Sq        RSS    AIC
## - OPBPC    1   1284108 2057133378 770.39
## - APSAB    1  12514566 2068363837 770.62
## <none>                 2055849271 772.36
## - APSLAKE  1 176735690 2232584961 773.90
## - OPSLAKE  1 496370866 2552220136 779.66
## - OPRC     1 511413723 2567262994 779.91
## 
## Step:  AIC=770.39
## BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC
## 
##           Df  Sum of Sq        RSS    AIC
## - APSAB    1   11814207 2068947585 768.63
## <none>                  2057133378 770.39
## - APSLAKE  1  175480984 2232614362 771.91
## - OPRC     1  510159318 2567292697 777.91
## - OPSLAKE  1 1165227857 3222361235 787.68
## 
## Step:  AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
## 
##           Df  Sum of Sq        RSS    AIC
## <none>                  2068947585 768.63
## - OPRC     1  531694203 2600641788 776.47
## - APSLAKE  1  621012173 2689959758 777.92
## - OPSLAKE  1 1515918540 3584866125 790.27

## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
## 
## Coefficients:
## (Intercept)      OPSLAKE      APSLAKE         OPRC  
##       15425         2390         1712         1797

The output goes down the line eliminating the ones with the lowest AIC.

Our final model is lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)

Extra Example

Now use OK Cupid data

data <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(data)
names(data)

## [1] "Sex"             "Height"          "IdealMateHeight" "Age"

Forward Selection

mod1 <- lm(IdealMateHeight ~ Sex)
mod2 <- lm(IdealMateHeight ~ Height)
mod3 <- lm(IdealMateHeight ~ Age)
summary(mod1)

## 
## Call:
## lm(formula = IdealMateHeight ~ Sex)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.147 -1.697  0.303  1.303  6.303 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  71.1471     0.3164  224.88   <2e-16 ***
## SexM         -5.4501     0.4508  -12.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.5219 
## F-statistic: 146.2 on 1 and 132 DF,  p-value: < 2.2e-16

summary(mod2)

## 
## Call:
## lm(formula = IdealMateHeight ~ Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6262 -2.6511 -0.1796  2.7022  8.9938 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 82.47144    4.93571  16.709  < 2e-16 ***
## Height      -0.20665    0.07266  -2.844  0.00516 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.676 on 132 degrees of freedom
## Multiple R-squared:  0.05774,    Adjusted R-squared:  0.0506 
## F-statistic: 8.089 on 1 and 132 DF,  p-value: 0.005163

summary(mod3)

## 
## Call:
## lm(formula = IdealMateHeight ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4880 -2.4797 -0.4506  3.5203  8.5286 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 68.213927   2.623250  26.004   <2e-16 ***
## Age          0.008304   0.086889   0.096    0.924    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.787 on 132 degrees of freedom
## Multiple R-squared:  6.92e-05,   Adjusted R-squared:  -0.007506 
## F-statistic: 0.009135 on 1 and 132 DF,  p-value: 0.924

Sex is most significant, so try adding each of the other ones to sex.

mod2a <- lm(IdealMateHeight ~ Height + Sex)
mod3a <- lm(IdealMateHeight ~ Age + Sex)
summary(mod2a)

## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

summary(mod3a)

## 
## Call:
## lm(formula = IdealMateHeight ~ Age + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.1488 -1.6945  0.3026  1.3046  6.3005 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 71.1599812  1.8304576  38.876   <2e-16 ***
## Age         -0.0004307  0.0600899  -0.007    0.994    
## SexM        -5.4501283  0.4525571 -12.043   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.619 on 131 degrees of freedom
## Multiple R-squared:  0.5255, Adjusted R-squared:  0.5182 
## F-statistic: 72.53 on 2 and 131 DF,  p-value: < 2.2e-16

Height is more significant. Try adding age to sex and height

mod2ab <- lm(IdealMateHeight ~ Height + Sex + Age)
summary(mod2ab)

## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1639 -1.2728  0.2494  1.3221  5.2108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65668    4.14920   9.558  < 2e-16 ***
## Height       0.49372    0.06066   8.140  2.8e-13 ***
## SexM        -8.54402    0.53027 -16.113  < 2e-16 ***
## Age         -0.01520    0.04913  -0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6784 
## F-statistic: 94.52 on 3 and 130 DF,  p-value: < 2.2e-16

Age is not significant, so don’t add it. Only height and sex are important. Final Model : lm(IdealMateHeight ~ Height + Sex)

Backward Elimination

modF <- lm(IdealMateHeight ~ Height + Sex + Age)
summary(modF)

## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1639 -1.2728  0.2494  1.3221  5.2108 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.65668    4.14920   9.558  < 2e-16 ***
## Height       0.49372    0.06066   8.140  2.8e-13 ***
## SexM        -8.54402    0.53027 -16.113  < 2e-16 ***
## Age         -0.01520    0.04913  -0.309    0.758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared:  0.6857, Adjusted R-squared:  0.6784 
## F-statistic: 94.52 on 3 and 130 DF,  p-value: < 2.2e-16

Age isn’t significant. Drop it.

modF1 <- lm(IdealMateHeight ~ Height + Sex)
summary(modF1)

## 
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.1770 -1.2642  0.2219  1.2602  5.3020 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  39.2456     3.9171  10.019  < 2e-16 ***
## Height        0.4930     0.0604   8.162 2.38e-13 ***
## SexM         -8.5383     0.5281 -16.168  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared:  0.6854, Adjusted R-squared:  0.6806 
## F-statistic: 142.7 on 2 and 131 DF,  p-value: < 2.2e-16

All are now significant. Keep only sex and height. Final Model: lm(IdealMateHeight ~ Height + Sex)

3/13 Model Building

Emily Eberspacher

3/13/2018

Introduction

Tools for Comparing Models

t-test

F-test (partial)

AIC

Mallow’s \(C_p\)

Methods for Building Models

Forward Selection

Backward Elimination

All Subsets Regression

Building Models

Forward Selection Example

Alternative Method for Forward Selection

Backward Elimination

Alternative Function for Backward Elimination

Extra Example

Forward Selection

Backward Elimination