(note) explain what code is saying #Data Set For this R guide, focused on multiple linear regression, we will use a data set called ais in the alr3 package. This data comes from the Australian Institute of Sport Data. The data contains the height, weight, sex, percent body fat, and a number of other health statistics of 102 male and 100 female athletes. In our examples of multiple linear regression we will stick to the named variables, and hold percent body fat as the response. That leaves weight and height as quantitative predictors and sex as a categorical predictor.

library(alr3)

## Loading required package: car

data(ais)
attach(ais)

Setting up the Multiple Linear Regression Model

To set up our model in R, it is very similar to simple linear regression. We simply add the predictors in the second part of the statement. Once it is created and named we will call the model to see its parameters.

mlr.mod <- lm(Bfat ~ Ht + Wt + Sex)
mlr.mod

## 
## Call:
## lm(formula = Bfat ~ Ht + Wt + Sex)
## 
## Coefficients:
## (Intercept)           Ht           Wt          Sex  
##     3.12251     -0.09389      0.28531     11.90504

Interpreting the Regression Model

The model gives us four coefficients, one for each variable as well as one for the intercept. To think about what the intercept coefficient means, it is the value we would expect to find for percent body fat if a male(this means the value for sex=0) had no height and weight. Obviously this is nonsense, so the intercept on its own means essentially nothing. The value for height means that for every cm somebody gains in height, their expected percentage of body fat goes down by .09389 percent. The value for weight means with each kiligram gained we expect percentage of body fat to increase by 0.28531. Because sex is a categorical predictor, its value has a slightly different meaning. It means that we expect females to have 11.90504 percent more body fat than men. In total, we come out with the following equation. \[\hat{Body Fat} = -.09389\hat{Height} + .28531\hat{Weight} + 11.90504I(sex)\]

Point Prediction

Now that we have model coefficients we can make predictions with them. We just multiply the model coefficients with a vector containing the values of our predictors that we choose. The first thing we have to make sure of is that the first value in our vector is 1, which will be multiplied by the intercept coefficient, which should always present. The other piece to make sure of is that the values you want to use for each predictor is in the same order as it is set up in our model. Let’s find out what our model predicts for a woman who is 160 cm tall and weighs 45 kilograms. To quickly get the model coefficients in R as a vector we use the “coef” function and simply input our model. To multiply our two vectors we just put %*% in between them.

coef(mlr.mod)%*%c(1, 160, 45, 1)

##          [,1]
## [1,] 12.84474

Based on our model, we would expect this woman to have 12.84474 percent body fat.

Mean Squared Error

The mean squared error is an estimation of the population variance. In our case, the population is Australian athletes. To calculate the mean squared error, we can take the sum of the squared residuals and divide it by the number of degrees of freedom. The number of degrees of freedom is n - (k + 1), where n is the sample size and k is the number of predictors. We could calculate that by hand but we should really let R do the work and save us a lot of time. We can find the mean squared error through R by taking a summary of our model.

summary(mlr.mod)

## 
## Call:
## lm(formula = Bfat ~ Ht + Wt + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9986 -2.0107 -0.4103  1.9666 13.7210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.12251    6.26857   0.498   0.6190    
## Ht          -0.09389    0.04110  -2.285   0.0234 *  
## Wt           0.28531    0.02838  10.055   <2e-16 ***
## Sex         11.90504    0.59529  19.999   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 198 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6943 
## F-statistic: 153.1 on 3 and 198 DF,  p-value: < 2.2e-16

In the summary, the mean squared error is shown as residual standard error, which for our model is 3.423, which is our best estimate of the population variance.

F-Test

Another piece of information found in the summary of our model is the overall F-test. The null hypothesis in this F-test is that the coefficients for each of the predictors is equal to zero, or in simpler terms, insignificant. The alternative hypothesis is that at least one of the predictors is significant. The test statistic is determined by a numerator of the explained variation in the model over the number of predictors and a denominator of the unexplained variation in the model over the degrees of freedom. This can be difficult to calculate manually but thankfully R can do it for us. It is actually found in the same summary as the mean squared error. We see that our test statistic is 151.1, resulting in a p-value less than 2.2 times 10^-16. This means we reject the null hypothesis in favor of the alternative, which is that at least one of our predictors is significant.

R-Squared

R-sqaured, or the coefficient of determination, is the proportion of variability in the response that can be explained by its linear relationship with the predictors. In simpler terms, it is a scale from zero to one of how well the predictors fit the model. We also have the adjusted R-squared value, which is a modified version of R-squared that penalizes based on the number of predictors. This adjusted R-squared will always be less than or equal to the regular R-squared. It should also be noted that the adjusted R-squared is much less reliable than R-squared. Both of these values can be found in the summary of our model. Let’s call it up again just to refresh.

summary(mlr.mod)

## 
## Call:
## lm(formula = Bfat ~ Ht + Wt + Sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9986 -2.0107 -0.4103  1.9666 13.7210 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.12251    6.26857   0.498   0.6190    
## Ht          -0.09389    0.04110  -2.285   0.0234 *  
## Wt           0.28531    0.02838  10.055   <2e-16 ***
## Sex         11.90504    0.59529  19.999   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.423 on 198 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6943 
## F-statistic: 153.1 on 3 and 198 DF,  p-value: < 2.2e-16

Our R-squared value is 0.6988. This means that we can explain nearly 70 percent of the variation in our response based on its linear relationship with our predictors. Coincidentally, we typically look for a level of at least 0.7 as a benchmark for what we can consider a good model fit.

Testing Independent Coefficients

If we want to test the significance of independent predictors, we can perform t-tests. In previous statistics classes you have likely dealt with t-tests before, and these are no different. Our null hypothesis we’re operating under is that the coefficient for the predictor is not significant, or that it is zero. The alternative hypothesis is that the coefficient is significant. The test statistic is beta coefficient over the standard error of that coefficient. The standard error can also be defined as the sample variance. This statistic operates on a t distribution with n - (k + 1) degrees of freedom. To perform these tests in R we once again look at the summary of our model. What we’re really interested in is the p-values associated with these tests. Using a typical significance level of .05, we find that all three of our predictors have p-values less than the significance level, meaning all of our predictors are significant. Had we used a very strict p-value of .01, we would have found height to be insignificant as it has a p-value of .0234.

Interaction Terms

Sometimes we might think that the interaction of two predictors might be significant. For example, if we think that the slope of our height predictor varies with gender, we can create and interaction model to test it. First we’ll create a model with height, gender, and a third term of the two multiplied together. Then we’ll take a summary of the model to see if the multiplied term is significant.

int.mod <- lm(Bfat ~ Ht + Sex + Ht*Sex)
summary(int.mod)

## 
## Call:
## lm(formula = Bfat ~ Ht + Sex + Ht * Sex)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.843 -2.474 -0.670  2.392 16.145 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  -5.12949    9.61848  -0.533  0.59443   
## Ht            0.07752    0.05180   1.496  0.13614   
## Sex         -28.19777   13.01626  -2.166  0.03148 * 
## Ht:Sex        0.21560    0.07212   2.989  0.00315 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.115 on 198 degrees of freedom
## Multiple R-squared:  0.5647, Adjusted R-squared:  0.5581 
## F-statistic: 85.62 on 3 and 198 DF,  p-value: < 2.2e-16

We do find that our interaction term is significant, with a p-value of less than 0.01. This means we have different slopes for each gender. Think about it like this: for a male, the values of both terms that include sex is zero so the intercept and slope are just those shown in the table. However, for females, the intercept will decrease by 28.19777 from -5.12949 while the slope will increase by .2156 from .07752. So for males we would have an equation like this: \[\hat{Body Fat} = -5.12949 + .07752\hat{height}\]

For females the equation would be this: \[\hat{Body Fat} = -33.32726 + .29312\hat{height}\] #Intervals for Specific Data Frames There are two types of intervals we can create to estimate our response value. They both invlolve predicting and interval for the response based on a given set of values for the predictors. ##Prediction Intervals The first is called a prediction interval. This is used to predict an individual value of the response based and given values for the predictors. To do this in R, first we must create a data frame for our given predictor values. Let’s just use the same values from earlier, a woman with a height of 160 cm and a weight of 45 kg. We can simply use the data frame command and enter in our desired values to set this up.

newdata <- data.frame(Ht = 160, Wt = 45, Sex = 1)

Then we use the predict function and enter in our original model, followed by our data frame and then out interval type, which in this case is predict. The last bit is our confidence level.

pred <- predict(mlr.mod, newdata, interval = "predict", level = .95)
pred

##        fit      lwr      upr
## 1 12.84474 5.992902 19.69658

This function gives us lower and upper bounds for an interval, (5.992902, 19.69658). This means that if we had an individual woman of 160 cm and 45 kg, we would be 95% confident that her percentage of body fat would fall inside the interval. If you notice the fit value, that is the center of the interval. It is the same as our point prediction, which is what we would expect by using the same predictor values. ##Confidence Intervals The second type of interval we can construct is a confidence interval. This interval is centered at the same spot as the prediction interval, but will always be more narrow. This is because the confidence interval is the interval on which we are 95% confident (or whatever level we choose) the true population mean. We set up the prediction function the same way as before except we change our interval type to confidence.

conf <- predict(mlr.mod, newdata, interval = "confidence", level = .95)
conf

##        fit      lwr      upr
## 1 12.84474 11.66384 14.02565

We used the same data frame as before and find exactly what we hoped for: a skinnier interval centered at the same spot. This means we are 95% confident that the true mean percentage of body fat for all women of 160 cm and 45 kg is between 11.66384 and 14.02565.

Confidence Intervals for a Parameter

Suppose we want to construct a confidence interval for the parameters in our model. R actually gives us a confint function we can use to do just this. We first enter in our model and then the confidence level.

confint(mlr.mod, level = .95)

##                  2.5 %      97.5 %
## (Intercept) -9.2392275 15.48424185
## Ht          -0.1749278 -0.01284377
## Wt           0.2293531  0.34126582
## Sex         10.7311222 13.07895414

R gives us confidence intervals for each parameter in the model. What these numbers mean is that for each parameter, we are 95% confident that the true population mean lies on the interval.

The Quadratic Regerssion Model

When looking at the residuals for your linear model, you might noticed that you are systematically under predicting for some regions and systematically over predicting for others. If this were to happen you should use a quadratic model. This will only occur when your predictor seems to have a non-linear relationship with your explainatory variable. As far as why we look into quadratic regression models, we want to make the most accurate model we can, and when we have a systematic over prediction/under prediction in your current model, we are able to make it more accurate with a quadratic regression model.

When looking to see if you should use quadratic regression, you should look at the residuals of a linear model first.

line1 <- lm(Bfat ~ Ht)
plot(Ht,Bfat)
abline(line1)

From our graph it is hard to tell if we should use our linear regression model. The best way to tell is to look at the residuals of our model. To get the graphs of residuals we use the plot(x) command where x is our linear model.

par(mfrow = c(2,2)) in this case will allow us to see all 4 graphs at once. (in a 2 by 2 figure)

par(mfrow = c(2,2))
plot(line1)

The top left plot of our residuals includes a red trend line. This helps us notice that we are systematically over predicting for some groups, and systematically over predicting for others. This means that we want to use a quadratic model to improve our accuracy.

With this in mind, we can look into if a quadratic model would aid us. In the next few lines of code we use I(Ht^2) for the quadratic term.

par(mfrow = c(2,2))
line2 <- lm(Bfat ~ Ht + I(Ht^2))
plot(line2)

Focusing on the top left graph, with the exception of a few points on the far left, these residuals are far more reasonable. To test if our new model is more significant that the last we need to use a partial F-test.

The Partial F-test

One test we can use to determine if a set of predictors are significant is the partial F-test.

The partial F-test takes two models, the complete model H0, and the partial model Ha. For a test stat, the F-test uses the F stat, which is:

F= ([SSER-SSEC]/[k-g])/(SSEC/[n-(k+1)])

where SSER is the unexplained variation for the reduced model, SSEC is the unexplained variation for the complete model. K is the number of predictors in the larger model and G is the number of perdictors in the smalle model. Lastly, n is the size of your sample.

We would use the Partial F-test to test if a subset of predictors explains the data more adequately with less predictors. This has a few advantages over other tests, one of them being that this gives you the opportunity to eliminate a group of predictors all at once. Unfortunately, this method also could lead you to you dropping a single good predictor if also dropped with several bad predictors.

Coming back to our models from before, we use anova(x,y) to perform our partial f-test where x and y are the two models, 1 of which is a subset of the other.

anova(line1,line2)

## Analysis of Variance Table
## 
## Model 1: Bfat ~ Ht
## Model 2: Bfat ~ Ht + I(Ht^2)
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1    200 7428.9                              
## 2    199 7226.8  1    202.09 5.5648 0.01929 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In the event of a low P-value we use the most complicated/longer model. In the event of a high P-value we use the smaller version of the model. In our case, our p-value is .01929, therefor we would use the longer model with Ht^2 due to it being more significant.

Model Building

When it comes to building a model, there are two methods of step-wise creation. There is forward selection, and backwards elimination. Forward selection starts with a model of just the intercept, and looks at how much adding a predictor would increase its significance. The predictor that adds the most significance is then added to the model. This is repeated until adding in a predictor would not add to the models significance. Backwards elimination starts with a model containing all the predictors, and looks to see if removing a variable would increase the significance. It then removes the predictor that would cause the greatest increase to the models significance. This is repeated until removing any predictors would cause the significance to decrease. Significance in any of these tests can be measured in a variety of ways. One of the ways aside from the usual tests is the Akaike information criterion. This method has a simple equation of 2(K+1)-2log(likelihood of the model) where K is the number of predictors. This way prioritizes lower scores. Another method of determining what predictors to use is through the T-test and using the most significant predictors.

We use these methods of model creation so that if someone were to get our data, they would be able to follow our steps and create the same model. This is important when you consider that if someone were to choose a different method of model creation they could very well end with a different model then you.

One method of constructing a model is by using the Akaike information criterion (AIC for short). In the following code we use step(x,y,z) to automate the process of building our model. For step(x,z,y) to work, x must be a linear model, y must be the scope that step should walk through. We do this by using the list(lower = h, upper = j) where h is only the intercept, and J is the equation with all possible predictors. Lastly, z must be direction = “forward” for forward selection, or direction = “backward” for backwards elimination.

lower.scope <-lm(Bfat ~ 1)
upper.scope <-lm(Bfat ~ ., data = ais)

stepFwd <- step(lm(Bfat ~ 1, data = ais), scope = list(lower = lower.scope, upper=upper.scope), direction = "forward")

## Start:  AIC=737.45
## Bfat ~ 1
## 
##         Df Sum of Sq    RSS    AIC
## + SSF    1    7142.0  559.1 209.64
## + Label 16    5620.6 2080.5 505.08
## + Sex    1    3733.1 3968.0 605.51
## + Sport  9    3089.2 4611.9 651.88
## + Hc     1    2183.3 5517.8 672.11
## + Hg     1    2175.7 5525.4 672.39
## + RCC    1    1875.6 5825.5 683.07
## + LBM    1    1008.3 6692.8 711.10
## + Ht     1     272.3 7428.9 732.18
## + BMI    1     270.9 7430.2 732.22
## + Ferr   1     259.0 7442.1 732.54
## + WCC    1      90.0 7611.1 737.08
## <none>               7701.1 737.45
## + Wt     1       0.0 7701.1 739.45
## 
## Step:  AIC=209.64
## Bfat ~ SSF
## 
##         Df Sum of Sq    RSS     AIC
## + Label 16    363.26 195.83  29.735
## + Sex    1    315.09 244.00  44.155
## + LBM    1    210.66 348.43 116.122
## + Wt     1    174.40 384.69 136.123
## + BMI    1    127.14 431.95 159.529
## + Hg     1    119.62 439.47 163.013
## + Ht     1    110.36 448.73 167.228
## + RCC    1    102.17 456.92 170.882
## + Hc     1     96.33 462.76 173.447
## + Ferr   1     48.81 510.28 193.190
## + Sport  9     66.22 492.87 202.178
## <none>               559.09 209.644
## + WCC    1      4.50 554.59 210.012
## 
## Step:  AIC=29.74
## Bfat ~ SSF + Label
## 
##        Df Sum of Sq    RSS    AIC
## + LBM   1    3.8890 191.94 27.683
## <none>              195.83 29.735
## + Ht    1    1.2881 194.54 30.402
## + WCC   1    0.8281 195.00 30.879
## + Ferr  1    0.8254 195.01 30.882
## + Hc    1    0.4624 195.37 31.258
## + Wt    1    0.2375 195.59 31.490
## + BMI   1    0.1822 195.65 31.547
## + Hg    1    0.1142 195.72 31.617
## + RCC   1    0.1000 195.73 31.632
## 
## Step:  AIC=27.68
## Bfat ~ SSF + Label + LBM
## 
##        Df Sum of Sq     RSS      AIC
## + Wt    1   110.920  81.023 -144.535
## + BMI   1     6.825 185.117   22.369
## <none>              191.942   27.683
## + Hc    1     0.671 191.271   28.976
## + Ferr  1     0.587 191.356   29.065
## + WCC   1     0.582 191.360   29.070
## + Hg    1     0.237 191.705   29.434
## + RCC   1     0.120 191.822   29.557
## + Ht    1     0.056 191.886   29.624
## 
## Step:  AIC=-144.53
## Bfat ~ SSF + Label + LBM + Wt
## 
##        Df Sum of Sq    RSS     AIC
## <none>              81.023 -144.53
## + Hc    1  0.257686 80.765 -143.18
## + BMI   1  0.166228 80.857 -142.95
## + RCC   1  0.119506 80.903 -142.83
## + Hg    1  0.084595 80.938 -142.75
## + Ht    1  0.053253 80.970 -142.67
## + WCC   1  0.027137 80.996 -142.60
## + Ferr  1  0.000739 81.022 -142.54

This function starts with an equation of just the intercept, and looks at each of the variables to see which one would reduce the AIC the most. After looking at all the predictors, it chooses SSF. This is then done again, this time with the base equation consisting of the intercept and SSF. This is repeated until adding another predictor would cause the AIC to increase. In the end the function comes up with SSF + Label + LBM + Wt.

The following code for step is the same as before, only this time direction = “backward” rather than “forward”.

lower.scope <-lm(Bfat ~ 1)
upper.scope <-lm(Bfat ~ ., data = ais)
stepback <- step(lm(Bfat ~ ., data = ais), scope = list(lower = lower.scope, upper=upper.scope), direction = "backward")

## Start:  AIC=-133.34
## Bfat ~ Sex + Ht + Wt + LBM + RCC + WCC + Hc + Hg + Ferr + BMI + 
##     SSF + Label + Sport
## 
## 
## Step:  AIC=-133.34
## Bfat ~ Sex + Ht + Wt + LBM + RCC + WCC + Hc + Hg + Ferr + BMI + 
##     SSF + Label
## 
## 
## Step:  AIC=-133.34
## Bfat ~ Ht + Wt + LBM + RCC + WCC + Hc + Hg + Ferr + BMI + SSF + 
##     Label
## 
##         Df Sum of Sq     RSS      AIC
## - RCC    1     0.000  79.906 -135.338
## - Ferr   1     0.016  79.922 -135.297
## - WCC    1     0.110  80.016 -135.060
## - Hg     1     0.165  80.071 -134.923
## - Hc     1     0.218  80.124 -134.787
## - Ht     1     0.399  80.305 -134.333
## - BMI    1     0.519  80.424 -134.032
## <none>                79.906 -133.339
## - SSF    1    11.565  91.470 -108.035
## - Label 16    41.795 121.700  -80.355
## - Wt     1    42.922 122.827  -48.492
## - LBM    1   111.957 191.863   41.600
## 
## Step:  AIC=-135.34
## Bfat ~ Ht + Wt + LBM + WCC + Hc + Hg + Ferr + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - Ferr   1     0.016  79.922 -137.297
## - WCC    1     0.111  80.017 -137.058
## - Hg     1     0.165  80.071 -136.923
## - Hc     1     0.344  80.250 -136.470
## - Ht     1     0.405  80.311 -136.316
## - BMI    1     0.535  80.441 -135.990
## <none>                79.906 -135.338
## - SSF    1    11.643  91.549 -109.862
## - Label 16    42.008 121.914  -82.001
## - Wt     1    43.487 123.393  -49.564
## - LBM    1   111.992 191.898   39.637
## 
## Step:  AIC=-137.3
## Bfat ~ Ht + Wt + LBM + WCC + Hc + Hg + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - WCC    1     0.105  80.027 -139.032
## - Hg     1     0.155  80.078 -138.905
## - Hc     1     0.329  80.252 -138.466
## - Ht     1     0.402  80.324 -138.284
## - BMI    1     0.535  80.457 -137.950
## <none>                79.922 -137.297
## - SSF    1    11.627  91.549 -111.862
## - Label 16    42.601 122.523  -82.994
## - Wt     1    43.688 123.611  -51.208
## - LBM    1   112.737 192.659   38.436
## 
## Step:  AIC=-139.03
## Bfat ~ Ht + Wt + LBM + Hc + Hg + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - Hg     1     0.154  80.181 -140.644
## - Hc     1     0.299  80.326 -140.279
## - Ht     1     0.380  80.407 -140.076
## - BMI    1     0.507  80.534 -139.756
## <none>                80.027 -139.032
## - SSF    1    11.837  91.864 -113.167
## - Label 16    42.569 122.596  -84.874
## - Wt     1    43.649 123.676  -53.102
## - LBM    1   113.130 193.157   36.958
## 
## Step:  AIC=-140.64
## Bfat ~ Ht + Wt + LBM + Hc + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - Hc     1     0.196  80.377 -142.150
## - Ht     1     0.428  80.609 -141.568
## - BMI    1     0.532  80.713 -141.308
## <none>                80.181 -140.644
## - SSF    1    11.757  91.939 -115.004
## - Label 16    44.479 124.660  -83.500
## - Wt     1    43.546 123.727  -55.018
## - LBM    1   113.650 193.831   35.661
## 
## Step:  AIC=-142.15
## Bfat ~ Ht + Wt + LBM + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - Ht     1     0.479  80.857 -142.949
## - BMI    1     0.592  80.970 -142.667
## <none>                80.377 -142.150
## - SSF    1    11.651  92.028 -116.807
## - Label 16    47.342 127.720  -80.603
## - Wt     1    43.378 123.755  -56.973
## - LBM    1   113.920 194.297   34.147
## 
## Step:  AIC=-142.95
## Bfat ~ Wt + LBM + BMI + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## - BMI    1     0.166  81.023 -144.535
## <none>                80.857 -142.949
## - SSF    1    12.235  93.092 -116.485
## - Label 16    47.732 128.588  -81.234
## - Wt     1   104.260 185.117   22.369
## - LBM    1   113.447 194.303   32.153
## 
## Step:  AIC=-144.53
## Bfat ~ Wt + LBM + SSF + Label
## 
##         Df Sum of Sq     RSS      AIC
## <none>                81.023 -144.535
## - SSF    1    12.599  93.622 -117.339
## - Label 16    51.589 132.612  -77.010
## - Wt     1   110.920 191.942   27.683
## - LBM    1   114.571 195.594   31.490

This function works in a similar way to the first one, only the starting equation is of all the predictors, and it looks at how much the AIC would improve if one of the predictors were to be removed. In this instance the predictors are removed in the order of Sport, Sex, RCC, ferr, WCC, Hg, Hc, Ht, and lastly BMI.

MultiCollinearity

Another thing to look into is multicolliniarity. Multicolliniarity is when two or more predictors have a strong correlation. This can lead to several negative side effects. Multicolliniarity leads to unstable point estimates, t-tests end up dropping variables more often, and confidence intervals are wider than necessary.

To check for multicolliniarity there are two methods. The first involves looking at a graph to see if there is an obvious correlation between two predictors. The second method involves getting the correlation as a number between two variables. (between -1 and 1) For the first method, the code below provides us with plots of all variables plotted against each other.

plot(ais)

For our data, this is a ton of information to take in at once. This is where the second method comes in. The code below provides us with the correlations between several of the variables in our model above. In the code below, we will look at the correlation between Wt and LBM, and Wt and SSF. 3 is the index for Wt, 4 is the index for LBM, and 11 is the index for SSF.

cor(ais[3],ais[4]) #Wt and LBM

##         LBM
## Wt 0.930904

cor(ais[3],ais[11]) #Wt and SSF

##          SSF
## Wt 0.1542267

This shows us that we have a very significant correlation between weight and LBM. In this case where the correlation is greater than .9, we would omit one of the two variables. The new model might be:

mod1 <-lm(Bfat ~ SSF + Label,data=ais)
mod1

## 
## Call:
## lm(formula = Bfat ~ SSF + Label, data = ais)
## 
## Coefficients:
##    (Intercept)             SSF    Labelf-field      Labelf-gym  
##         5.2977          0.1509         -1.1163         -1.5561  
## Labelf-netball      Labelf-row     Labelf-swim   Labelf-t_400m  
##        -0.6344          0.4291         -1.3133         -1.1780  
## Labelf-t_sprnt   Labelf-tennis   Labelm-b_ball    Labelm-field  
##        -1.4165         -1.1114         -4.2733         -3.4588  
##     Labelm-row     Labelm-swim   Labelm-t_400m  Labelm-t_sprnt  
##        -3.7832         -4.2165         -3.9260         -3.6451  
##  Labelm-tennis   Labelm-w_polo  
##        -3.7408         -3.4274

Residual Analysis in Simple Regression

When looking at residuals we use several plots. All of these are provided when using the code “plot(model)” The first of these is the standard residual plot. This will include a red trend line that indicates the trend of the residuals. This is helpful when trying to figure out if our model is missing anything. Another of the plots we use is QQnorm. QQnorm plots the standardized residuals against the theoretical quantiles. This plot reduces the weight given to outliers. The third model given is the scale-location plot. The scale-location plot plots the square root of the standardized residuals against the fitted values. This plot helps us by reducing the impact of squed data, in addition to reducing the heteroscedasticity of our data. “plot(model)” also gives us the Cook’s distance, which helps us recognize the leverage value of each point. In some cases this points out when we have a few outlying points.

par(mfrow = c(2,2))  #used to plot 2 by 2 plots on screen
plot(line2)

In addition to these plots we can use several transformations for when some problems occur in our data. First we were given the Square root transformation. This transformation is used to help reduce the heteroscedasticity of our data. Second we were given the log transformation. This transformation is used to also reduce heteroscedasticity of our data. The downside is that when using the log transformation we need to make sure our data doesn’t ever have a point where y=0. Lastly we were given the inverse transformation. We were told to use this transformation when we have some knowledge of a relationship between the variables beforehand. We use these transformations because we know the real world is not made up of linear relationships. If we were to find something in our data that would indicate it was not a linear relationship, it wouldn’t be outside the realm of possibility for us to have a non-linear relationship.

Following are a few graphs of what happen to our first model when we transform them.

par(mfrow = c(2,2))
line1 <- lm(Bfat ~ Ht + I(Ht^2))
plot(line1)

line1 <- lm(Bfat^.5 ~ Ht + I(Ht^2))
plot(line1)

line1 <- lm(log(Bfat) ~ Ht + I(Ht^2))
plot(line1)

line1 <- lm(1/Bfat ~ Ht + I(Ht^2))
plot(line1)

The first set of graphs is of our initial model, the second set of graphs is the square root transformation, the third is the log transformation, and lastly is the inverse transformation. In the case of our data, none of the transformations aided in cleaning up our data. This can be attributed to a great data set.

R guide chapter 4 and 5

Nate Persing

March 26, 2018