In this R guide we will talk about tools to compare models and methods to build models. Then we will work witht the water dataset to create models using forward selection and backward elimination.
There are several tools for comparing models. \(R^2\) and adjusted \(R^2\) can be used if necessary, but these are not good comparison tools. Since we are such competent statisticians, this R guide will cover comparison tools that are good to use, which include t-test, partial F-test, AIC, and Mallow’s \(C_p\).
The t-test is a familiar topic for us. In R, we can use the summary function to see the p-values for our predictors, which are produced by a t-test. Whether or not a term is significant will dtermine whether we should add or drop the term when we’re building our model. This process of adding or dropping terms is discussed more in the Methods for Building Models section. Note: this method doesn’t work for categorical predictors with more than two categories.
The F-test is very similar to the t-test. We use the summary function to see the p-values for our predictors, which are produced by the partial F-test. Whether or not a term is significant will determine whether we should add or drop the term when we’re building our model. This process of adding or dropping terms is discussed more in the Methods for Building Models section. Note: this method works for all models.
AIC statnds for Akaike Information Criteria, and we use it to try and balance our model fit with the number of predictors in our model. The AIC can be expressed in the following equation: AIC = 2(k+1) - 2(loglikelihood of our model), where (k+1) is the number of coefficients in our model. We want the AIC to be as low/negative as possible becasue we want to have the smallest number of predictors possible, and we want to minimize the log likelihood. However, we don’t want to make our model be overly complicated. Note: the AIC for one model isn’t useful; we will always compare teh AICs from two or more models to decide which oe is better.
When deciding which model is better, we must consider teh AIC and the number of predictors. If the simplest model has the lowest AIC, we will definitely use it. If the model with the lowest AIC is the more complex model, it’s AIC must be at least 10 units lower than the AIC of the simple model. Otherwise, if the two AICs are within 10 units, we will use the simpler model.
Mallow’s \(C_p\) is the fourth tool used to compare models. Like AIC, a smaller \(C_p\) is better; we want our \(C_p\) to be as close to (k+1) as possible. For this calculation we compare the complete model with a model with a subset of the predictors to try and get a model that contains only the significant predictors.
Now we will discuss the possible methods for building regression models. When buidling regression models, there are several methods that can be used.
One method is forward selection. In this stepwise method, we start with the simplest model possible and gradually add more predictors based on their significance. For example, we might start by making SLR models between our response and all possible predictors. Then we see which predictors are significant. We can eliminate any predictors that aren’t significant, and we know that we want to add the predicotr that is the most significant. Then we would create MLR models with our response and two predictors; the predictors include the most significant predictor from our first step, and each of our remaing predictors. If any of the newly added predictors are significant we would keep the most significant one, and continue this process.
Overall, we gradually add predictors based on their significance. One way tot do this is to add the predictor with the smallest p-value if the p-value is smaller than our significance level. We choose to stop adding predictors when they are no longer significant. Alternatively, we can evaluate significance based on AIC. In this case, we add predictors if the model’s AIC drops by at least 10 units. If the AIC doesn’t drop by 10 or more we wouldn’t add the prdictor. Note that it’s very important to communicate whether you’re using p-values or AIC to build your model.
Another stepwise method for building models is backward elimination. In this method, we start with the most complicated model. Then we analyze the significance of the predictors, and if there are any predicotrs that are not significant, we drop the least significant one. The we analyze our new model with one less term, see if any predictors are insignificant, and drop the least significant one. This process repeats until we have a model where all of the predictors are significant.
Similar to forward selection, we can use p-values or AIC for backward elimination. Again, it’s important to express which method is being used.
A third way to build models is to use all subsets regression. For this method, we fit all possible models for our data and analyze all of the models created. We use the model comparison toosl described in the preceeding section to see which one is best. However, it’s important to note that if two models are nested we can use the t-test, F-test, AIC, or Mallow’s \(C_p\), but for nested models it’s best to use one of the formal tests (t-test or F-test). For models that are not nested, we can only use AIC or Mallow’s \(C_p\) to compare the models. Note: two models are nested if all fo the predictors from one model are included in the other model.
Use the water dataset and forward selection to build the model.
Call and attach the data.
library(alr3)
## Loading required package: car
## Warning: package 'car' was built under R version 3.4.3
data(water)
attach(water)
First, make 6 SLR models If in SLR, a predictor doesn’t have a significant relationship with the response, eliminate it.
mod1 <- lm(BSAAM ~ APMAM)
mod2 <- lm(BSAAM ~ APSAB)
mod3 <- lm(BSAAM ~ APSLAKE)
mod4 <- lm(BSAAM ~ OPBPC)
mod5 <- lm(BSAAM ~ OPRC)
mod6 <- lm(BSAAM ~ OPSLAKE)
summary(mod1)
##
## Call:
## lm(formula = BSAAM ~ APMAM)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37043 -16339 -5457 17158 72467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63364 9917 6.389 1.21e-07 ***
## APMAM 1965 1249 1.573 0.123
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25080 on 41 degrees of freedom
## Multiple R-squared: 0.05692, Adjusted R-squared: 0.03391
## F-statistic: 2.474 on 1 and 41 DF, p-value: 0.1234
summary(mod2)
##
## Call:
## lm(formula = BSAAM ~ APSAB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41314 -16784 -5101 16492 70942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67152 9689 6.931 2.06e-08 ***
## APSAB 2279 1909 1.194 0.239
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25390 on 41 degrees of freedom
## Multiple R-squared: 0.0336, Adjusted R-squared: 0.01003
## F-statistic: 1.425 on 1 and 41 DF, p-value: 0.2394
summary(mod3)
##
## Call:
## lm(formula = BSAAM ~ APSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46438 -16907 -5661 19028 69464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63864 9249 6.905 2.25e-08 ***
## APSLAKE 2818 1709 1.649 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25010 on 41 degrees of freedom
## Multiple R-squared: 0.06217, Adjusted R-squared: 0.0393
## F-statistic: 2.718 on 1 and 41 DF, p-value: 0.1069
summary(mod4)
##
## Call:
## lm(formula = BSAAM ~ OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21183 -7298 -819 4731 38430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40017.4 3589.1 11.15 5.47e-14 ***
## OPBPC 2940.1 240.6 12.22 3.00e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11990 on 41 degrees of freedom
## Multiple R-squared: 0.7845, Adjusted R-squared: 0.7793
## F-statistic: 149.3 on 1 and 41 DF, p-value: 2.996e-15
summary(mod5)
##
## Call:
## lm(formula = BSAAM ~ OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741.4 4044.1 5.376 3.32e-06 ***
## OPRC 4667.3 311.3 14.991 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10150 on 41 degrees of freedom
## Multiple R-squared: 0.8457, Adjusted R-squared: 0.842
## F-statistic: 224.7 on 1 and 41 DF, p-value: < 2.2e-16
summary(mod6)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17603.8 -5338.0 332.1 3410.6 20875.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27014.6 3218.9 8.393 1.93e-10 ***
## OPSLAKE 3752.5 215.7 17.394 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8922 on 41 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
## F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
APMAM, APSLAKE, and APSAB are insignificant, so eiliminat them now. OPSLAKE has smallest p-val (and largest test stat), so we know that we want this on our model because it’s most significant.
Now make 2 MLR with OPSLAKE and remianing two possible predicotrs (OPBPC and OPRC)
mod4a <- lm(BSAAM ~ OPBPC + OPSLAKE)
mod5a <- lm(BSAAM ~ OPRC + OPSLAKE)
summary(mod4a)
##
## Call:
## lm(formula = BSAAM ~ OPBPC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17591.0 -5276.6 275.6 3380.7 20867.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27050.95 3540.07 7.641 2.44e-09 ***
## OPBPC 14.37 546.41 0.026 0.979
## OPSLAKE 3736.16 658.24 5.676 1.35e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9033 on 40 degrees of freedom
## Multiple R-squared: 0.8807, Adjusted R-squared: 0.8747
## F-statistic: 147.6 on 2 and 40 DF, p-value: < 2.2e-16
summary(mod5a)
##
## Call:
## lm(formula = BSAAM ~ OPRC + OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15991.2 -6484.6 -498.3 4700.1 19945.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22891.2 3277.8 6.984 1.98e-08 ***
## OPRC 1866.5 638.8 2.922 0.0057 **
## OPSLAKE 2400.8 503.3 4.770 2.46e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8201 on 40 degrees of freedom
## Multiple R-squared: 0.9017, Adjusted R-squared: 0.8967
## F-statistic: 183.4 on 2 and 40 DF, p-value: < 2.2e-16
OPBPC is not significant, so we can drop it. OPRC has a small p-value and is significant, so we want our model to include this.
End with lm(BSAAM ~ OPRC + OPSLAKE)
Use the step AIC command
simplemod <- lm(BSAAM ~1)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:alr3':
##
## forbes
stepAIC(simplemod, direction = "forward", scope = list(upper = BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC))
## Start: AIC=873.65
## BSAAM ~ 1
##
## Df Sum of Sq RSS AIC
## + OPSLAKE 1 2.4087e+10 3.2640e+09 784.24
## + OPRC 1 2.3131e+10 4.2199e+09 795.28
## + OPBPC 1 2.1458e+10 5.8928e+09 809.64
## + APSLAKE 1 1.7004e+09 2.5651e+10 872.89
## + APMAM 1 1.5567e+09 2.5794e+10 873.13
## <none> 2.7351e+10 873.65
## + APSAB 1 9.1891e+08 2.6432e+10 874.18
##
## Step: AIC=784.24
## BSAAM ~ OPSLAKE
##
## Df Sum of Sq RSS AIC
## + APSLAKE 1 663368666 2600641788 776.47
## + APSAB 1 661988129 2602022326 776.49
## + OPRC 1 574050696 2689959758 777.92
## + APMAM 1 524283532 2739726922 778.71
## <none> 3264010454 784.24
## + OPBPC 1 56424 3263954031 786.24
##
## Step: AIC=776.47
## BSAAM ~ OPSLAKE + APSLAKE
##
## Df Sum of Sq RSS AIC
## + OPRC 1 531694203 2068947585 768.63
## <none> 2600641788 776.47
## + APSAB 1 33349091 2567292697 777.91
## + APMAM 1 11041158 2589600630 778.28
## + OPBPC 1 122447 2600519341 778.46
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## + APSAB 1 11814207 2057133378 770.39
## + APMAM 1 1410311 2067537274 770.60
## + OPBPC 1 583748 2068363837 770.62
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
Gives us our final model using forward selection.
Model: lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
Start with the biggest possible model.
modB <- lm(BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
summary(modB)
##
## Call:
## lm(formula = BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE +
## OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12690 -4936 -1424 4173 18542
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15944.67 4099.80 3.889 0.000416 ***
## APMAM -12.77 708.89 -0.018 0.985725
## APSAB -664.41 1522.89 -0.436 0.665237
## OPBPC 69.70 461.69 0.151 0.880839
## OPSLAKE 2211.58 752.69 2.938 0.005729 **
## APSLAKE 2270.68 1341.29 1.693 0.099112 .
## OPRC 1916.45 641.36 2.988 0.005031 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7557 on 36 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9123
## F-statistic: 73.82 on 6 and 36 DF, p-value: < 2.2e-16
Delete APMAM (least significant)
modB1 <- lm(BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
summary(modB1)
##
## Call:
## lm(formula = BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12696 -4933 -1396 4187 18550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15930.84 3972.50 4.010 0.000283 ***
## APSAB -673.42 1418.96 -0.475 0.637873
## OPBPC 68.94 453.50 0.152 0.879996
## OPSLAKE 2212.62 740.28 2.989 0.004952 **
## APSLAKE 2263.86 1269.35 1.783 0.082714 .
## OPRC 1915.75 631.46 3.034 0.004399 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7454 on 37 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9147
## F-statistic: 91.05 on 5 and 37 DF, p-value: < 2.2e-16
Delete OPBPC (least significant)
modB2 <- lm(BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC)
summary(modB2)
##
## Call:
## lm(formula = BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12750 -5095 -1494 4245 18594
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15749.8 3740.8 4.210 0.000151 ***
## APSAB -650.6 1392.8 -0.467 0.643055
## OPSLAKE 2295.4 494.8 4.639 4.07e-05 ***
## APSLAKE 2244.9 1246.9 1.800 0.079735 .
## OPRC 1910.2 622.3 3.070 0.003942 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7358 on 38 degrees of freedom
## Multiple R-squared: 0.9248, Adjusted R-squared: 0.9169
## F-statistic: 116.8 on 4 and 38 DF, p-value: < 2.2e-16
Delete APSAB (least significant)
modB3 <- lm(BSAAM ~ OPSLAKE + APSLAKE + OPRC)
summary(modB3)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12964 -5140 -1252 4446 18649
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15424.6 3638.4 4.239 0.000133 ***
## OPSLAKE 2389.8 447.1 5.346 4.19e-06 ***
## APSLAKE 1712.5 500.5 3.421 0.001475 **
## OPRC 1797.5 567.8 3.166 0.002998 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7284 on 39 degrees of freedom
## Multiple R-squared: 0.9244, Adjusted R-squared: 0.9185
## F-statistic: 158.9 on 3 and 39 DF, p-value: < 2.2e-16
All are significant now! Don’t delete more.
End with modB3 <- lm(BSAAM ~ OPRC + OPSLAKE + APSLAKE).
Note: Overall, backwards elim. is less typing, but it generally will give a bigger model in the end.
mod <- lm(BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC)
library(MASS)
stepAIC(mod)
## Start: AIC=774.36
## BSAAM ~ APMAM + APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## - APMAM 1 18537 2055849271 772.36
## - OPBPC 1 1301629 2057132362 772.39
## - APSAB 1 10869771 2066700504 772.58
## <none> 2055830733 774.36
## - APSLAKE 1 163662571 2219493304 775.65
## - OPSLAKE 1 493012936 2548843669 781.60
## - OPRC 1 509894399 2565725132 781.89
##
## Step: AIC=772.36
## BSAAM ~ APSAB + OPBPC + OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## - OPBPC 1 1284108 2057133378 770.39
## - APSAB 1 12514566 2068363837 770.62
## <none> 2055849271 772.36
## - APSLAKE 1 176735690 2232584961 773.90
## - OPSLAKE 1 496370866 2552220136 779.66
## - OPRC 1 511413723 2567262994 779.91
##
## Step: AIC=770.39
## BSAAM ~ APSAB + OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## - APSAB 1 11814207 2068947585 768.63
## <none> 2057133378 770.39
## - APSLAKE 1 175480984 2232614362 771.91
## - OPRC 1 510159318 2567292697 777.91
## - OPSLAKE 1 1165227857 3222361235 787.68
##
## Step: AIC=768.63
## BSAAM ~ OPSLAKE + APSLAKE + OPRC
##
## Df Sum of Sq RSS AIC
## <none> 2068947585 768.63
## - OPRC 1 531694203 2600641788 776.47
## - APSLAKE 1 621012173 2689959758 777.92
## - OPSLAKE 1 1515918540 3584866125 790.27
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
##
## Coefficients:
## (Intercept) OPSLAKE APSLAKE OPRC
## 15425 2390 1712 1797
The output goes down the line eliminating the ones with the lowest AIC.
Our final model is lm(formula = BSAAM ~ OPSLAKE + APSLAKE + OPRC)
Now use OK Cupid data
data <- read.csv("http://cknudson.com/data/OKCupid.csv")
attach(data)
names(data)
## [1] "Sex" "Height" "IdealMateHeight" "Age"
mod1 <- lm(IdealMateHeight ~ Sex)
mod2 <- lm(IdealMateHeight ~ Height)
mod3 <- lm(IdealMateHeight ~ Age)
summary(mod1)
##
## Call:
## lm(formula = IdealMateHeight ~ Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.147 -1.697 0.303 1.303 6.303
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.1471 0.3164 224.88 <2e-16 ***
## SexM -5.4501 0.4508 -12.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.609 on 132 degrees of freedom
## Multiple R-squared: 0.5255, Adjusted R-squared: 0.5219
## F-statistic: 146.2 on 1 and 132 DF, p-value: < 2.2e-16
summary(mod2)
##
## Call:
## lm(formula = IdealMateHeight ~ Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6262 -2.6511 -0.1796 2.7022 8.9938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 82.47144 4.93571 16.709 < 2e-16 ***
## Height -0.20665 0.07266 -2.844 0.00516 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.676 on 132 degrees of freedom
## Multiple R-squared: 0.05774, Adjusted R-squared: 0.0506
## F-statistic: 8.089 on 1 and 132 DF, p-value: 0.005163
summary(mod3)
##
## Call:
## lm(formula = IdealMateHeight ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4880 -2.4797 -0.4506 3.5203 8.5286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.213927 2.623250 26.004 <2e-16 ***
## Age 0.008304 0.086889 0.096 0.924
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.787 on 132 degrees of freedom
## Multiple R-squared: 6.92e-05, Adjusted R-squared: -0.007506
## F-statistic: 0.009135 on 1 and 132 DF, p-value: 0.924
Sex is most significant, so try adding each of the other ones to sex.
mod2a <- lm(IdealMateHeight ~ Height + Sex)
mod3a <- lm(IdealMateHeight ~ Age + Sex)
summary(mod2a)
##
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1770 -1.2642 0.2219 1.2602 5.3020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.2456 3.9171 10.019 < 2e-16 ***
## Height 0.4930 0.0604 8.162 2.38e-13 ***
## SexM -8.5383 0.5281 -16.168 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared: 0.6854, Adjusted R-squared: 0.6806
## F-statistic: 142.7 on 2 and 131 DF, p-value: < 2.2e-16
summary(mod3a)
##
## Call:
## lm(formula = IdealMateHeight ~ Age + Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.1488 -1.6945 0.3026 1.3046 6.3005
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.1599812 1.8304576 38.876 <2e-16 ***
## Age -0.0004307 0.0600899 -0.007 0.994
## SexM -5.4501283 0.4525571 -12.043 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.619 on 131 degrees of freedom
## Multiple R-squared: 0.5255, Adjusted R-squared: 0.5182
## F-statistic: 72.53 on 2 and 131 DF, p-value: < 2.2e-16
Height is more significant. Try adding age to sex and height
mod2ab <- lm(IdealMateHeight ~ Height + Sex + Age)
summary(mod2ab)
##
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1639 -1.2728 0.2494 1.3221 5.2108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.65668 4.14920 9.558 < 2e-16 ***
## Height 0.49372 0.06066 8.140 2.8e-13 ***
## SexM -8.54402 0.53027 -16.113 < 2e-16 ***
## Age -0.01520 0.04913 -0.309 0.758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared: 0.6857, Adjusted R-squared: 0.6784
## F-statistic: 94.52 on 3 and 130 DF, p-value: < 2.2e-16
Age is not significant, so don’t add it. Only height and sex are important. Final Model : lm(IdealMateHeight ~ Height + Sex)
modF <- lm(IdealMateHeight ~ Height + Sex + Age)
summary(modF)
##
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1639 -1.2728 0.2494 1.3221 5.2108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.65668 4.14920 9.558 < 2e-16 ***
## Height 0.49372 0.06066 8.140 2.8e-13 ***
## SexM -8.54402 0.53027 -16.113 < 2e-16 ***
## Age -0.01520 0.04913 -0.309 0.758
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.14 on 130 degrees of freedom
## Multiple R-squared: 0.6857, Adjusted R-squared: 0.6784
## F-statistic: 94.52 on 3 and 130 DF, p-value: < 2.2e-16
Age isn’t significant. Drop it.
modF1 <- lm(IdealMateHeight ~ Height + Sex)
summary(modF1)
##
## Call:
## lm(formula = IdealMateHeight ~ Height + Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.1770 -1.2642 0.2219 1.2602 5.3020
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.2456 3.9171 10.019 < 2e-16 ***
## Height 0.4930 0.0604 8.162 2.38e-13 ***
## SexM -8.5383 0.5281 -16.168 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.132 on 131 degrees of freedom
## Multiple R-squared: 0.6854, Adjusted R-squared: 0.6806
## F-statistic: 142.7 on 2 and 131 DF, p-value: < 2.2e-16
All are now significant. Keep only sex and height. Final Model: lm(IdealMateHeight ~ Height + Sex)