ASSIGNMENT 11 - LINEAR REGRESSION IN R


Problem 1

Perform a linear regression analysis fitting the Max Heart Rate to Age using the lm function in R. What is the resulting equation?

Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)

health <- data.frame(Age = Age, MaxHR = MaxHR)
health.lm <- lm(MaxHR ~ Age, data = health)

The equation for the linear model is:

\[\widehat{MaxHR} = 210.0484584 - 0.7977266\times Age\]

Is the effect of Age on Max HR signiffcant? What is the signiffcance level?

First, we will establish a hypothesis test:

  • \(H_0 \rightarrow \beta_1 = 0\): The slope is equal to zero, i.e. there is not a linear relationship between Age and MaxHR.

  • \(H_A \rightarrow \beta_1 \neq 0\): The slope is not equal to zero, i.e. there is a linear relationship between Age and MaxHR.

We will establish a conservative threshold for a p-value of < 0.05 being statistically significant and a p-value of < 0.01 as statistically highly significant. The summary of the linear model and its p-value is as follows:

summary(health.lm)
## 
## Call:
## lm(formula = MaxHR ~ Age, data = health)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.9258 -2.5383  0.3879  3.1867  6.6242 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.04846    2.86694   73.27  < 2e-16 ***
## Age          -0.79773    0.06996  -11.40 3.85e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.9021 
## F-statistic:   130 on 1 and 13 DF,  p-value: 3.848e-08

The resulting p-value is highly significant, but with the small sample size of 15, there should be some hesitation in rejecting \(H_0\). However, since the P-value is << 0.001 and because the test statistic is taken from a 2-sided test, we should strongly consider rejecting \(H_0\) in favor of \(H_A\).

Is the fitted model approximate to the standard equation for max heart rate given an age?

The standardized equation for max heart rate given an age is:

\[MaxHR = 220 - Age\]

  • Where 220 is the y-intercept and Age coefficient is -1.

Does the model from our limited data capture this behavior despite its small size? Taking a conservative 95% confidence interval of our model we would want the above coefficients to be within our bounds.

confint(health.lm, level = 0.95)
##                  2.5 %      97.5 %
## (Intercept) 203.854813 216.2421034
## Age          -0.948872  -0.6465811

The interval fails to capture both the intercept and the Age coefficient. What is likely the case is that the sample size is too small and does reflect the larger population as a whole.

Plot the fitted relationship between Max HR and Age.

plot(health$MaxHR ~ health$Age, xlab = "Age", ylab = "Max Heart Rate")
abline(health.lm, col = "red")

In addition to the good visible fit of the model, the \(R^2\) value is \(\approx 1\), which means that the variability in the Max Heart Rate is very well described by the variable Age.

Comment on fit and significance

The model fits the data very well visually. In addition, the very highly significant p-value makes it very unlikely that the sample collected predicts the response variable by pure chance.


Problem 2

Using the Auto dataset perform a Linear Regression analysis using mpg as the dependent variable and the other 4 (displacement, horsepower, weight, acceleration) as independent variables. Perform this taking 40 random points and also using the entire dataset (392 points).

#Data for all
auto <- read.csv(repo_url, header = FALSE, sep = "")
colnames(auto) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
auto.all_lm <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto)

#Sample of 40
set.seed(8675309)
auto.samp <- sample_n(auto, 40)
auto.samp_lm <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto.samp)

What is the final linear regression fit equation?

Using a sample of 40

\[\widehat{mpg} = 54.4881551 + (0.0111858)displacement + (-0.082517)horsepower + \\ (-0.0063925)weight + (-0.279526)acceleration\]

Using entire dataset

\[\widehat{mpg} = 45.2511397 + (-0.0060009)displacement + (-0.0436077)horsepower + \\ (-0.0052805)weight + (-0.023148)acceleration\]

Comment on the difference

The intercept and coefficients are quite different for all the variables; some even switching signs.

Which of the 4 independent variables have a signifficant impact on mpg?

Using a sample of 40

(auto.samp_lm.out <- summary(auto.samp_lm))
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = auto.samp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8354 -3.0397 -0.3918  2.4682 15.0053 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  54.488155   7.818121   6.969 4.18e-08 ***
## displacement  0.011186   0.025296   0.442   0.6611    
## horsepower   -0.082517   0.059811  -1.380   0.1765    
## weight       -0.006393   0.003168  -2.018   0.0513 .  
## acceleration -0.279526   0.379037  -0.737   0.4658    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.04 on 35 degrees of freedom
## Multiple R-squared:  0.6922, Adjusted R-squared:  0.657 
## F-statistic: 19.68 on 4 and 35 DF,  p-value: 1.456e-08

The p-values for weight has the most significance in the associated changes to the response variable mpg, but it is weak, i.e. > 0.05. The other coefficients show little to now evidence in rejecting \(H_0\), i.e. the coefficient has no use in describing the behavior of the response variable mpg.

Using entire dataset

(auto.all_lm.out <- summary(auto.all_lm))
## 
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration, 
##     data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.378  -2.793  -0.333   2.193  16.256 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  45.2511397  2.4560447  18.424  < 2e-16 ***
## displacement -0.0060009  0.0067093  -0.894  0.37166    
## horsepower   -0.0436077  0.0165735  -2.631  0.00885 ** 
## weight       -0.0052805  0.0008109  -6.512  2.3e-10 ***
## acceleration -0.0231480  0.1256012  -0.184  0.85388    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared:  0.707,  Adjusted R-squared:  0.704 
## F-statistic: 233.4 on 4 and 387 DF,  p-value: < 2.2e-16

Using the entire dataset we see the emergence of weight having a high significance in the behavior of the response variable. Horsepower is also significantly high being < 0.01. The remainder of the variables having a little to no significance.

What are the corresponding signifficance levels and standard errors of each of the coefficients?

auto_sig <- data.frame(summary(auto.samp_lm)$coefficients[,4,drop = FALSE],
                       summary(auto.all_lm)$coefficients[,4,drop = FALSE],
                       summary(auto.samp_lm)$coefficients[,2,drop = FALSE],
                       summary(auto.all_lm)$coefficients[,2,drop = FALSE]
)

colnames(auto_sig) <- c("Sample.Significance", "All.Significance",
                        "Sample.StdError", "All.StdError")

auto_sig
##              Sample.Significance All.Significance Sample.StdError
## (Intercept)         4.183387e-08     7.072099e-55      7.81812093
## displacement        6.610665e-01     3.716584e-01      0.02529599
## horsepower          1.764543e-01     8.848982e-03      0.05981130
## weight              5.131401e-02     2.302545e-10      0.00316785
## acceleration        4.657555e-01     8.538765e-01      0.37903659
##              All.StdError
## (Intercept)  2.4560446927
## displacement 0.0067093055
## horsepower   0.0165734633
## weight       0.0008108541
## acceleration 0.1256011622

The significance levels when using the entire dataset are stronger than that of the sample, but displacement and acceleration are still poor predictors of the response. The standard errors are also smaller with the increasing sample size, a la the law of large numbers.

What is the 95% confidence interval for each model.

Using a sample of 40

confint(auto.samp_lm)
##                    2.5 %       97.5 %
## (Intercept)  38.61652581 7.035978e+01
## displacement -0.04016776 6.253944e-02
## horsepower   -0.20394037 3.890643e-02
## weight       -0.01282363 3.853096e-05
## acceleration -1.04901122 4.899591e-01

Using the entire dataset

confint(auto.all_lm)
##                     2.5 %       97.5 %
## (Intercept)  40.422278855 50.080000544
## displacement -0.019192122  0.007190380
## horsepower   -0.076193029 -0.011022433
## weight       -0.006874738 -0.003686277
## acceleration -0.270094049  0.223798050
tmp_rownames <- rownames(auto.all_lm)
auto_interval <- data.frame(samp_coeffs = auto.samp_lm$coefficients,
                            all_coeffs = auto.all_lm$coefficients)

The confidence interval for the sample set is a lot larger than that of the entire dataset. The interval for the smaller set catches all of the coefficients of the complete dataset due to the large interval.

The complete dataset produces a much smaller interval due to the weak law of large numbers, i.e., the more values used for a given \(x_i, y_i\), the closer it will be to the expected value of the population.

Comment on the fit and significance

A good way to visualize a fit when using multiple variables is plotting the residuals against the fitted values. We would like to see a relatively uniform variation about the fitted line. We will only use the complete set in this step.

plot(auto.all_lm$residuals~auto.all_lm$fitted.values)
abline(h = 0, lty = 3, col = "red")

The model is a fairly good fit for the variables in regards to mpg. It does, however, appear to be slightly non-linear.

The significant levels of some of the variables are very weak and should not be used in the model. There is a way to determine the best variable to use in fitting a model and reducing the complexity at the same time.

Building a better model for mpg

From the above model, we run into the problem of over fitting with the use of every variable. Some variables have a significance level larger than we would want, e.g., > 0.05 and this could be due to their lack of ability in explaining the variability in mpg or because other variables are affecting their ability to explain the variability in mpg. We should be asking questions such as - what will be the affect of the removal of specific variables from the model? There are a lot of combinations of variables that can be used in this model, specifically 15. Using the above method on these 15 combinations would be a little maddening; giving weights to the number of variables used and the resulting \(R^2\) and P-Values. Ideally, we want to minimize the number of variables used while maximizing the fit.

There is an R package, leaps, which can model variable selection. We will use the Bayesian Information Criterion (BIC) which attempts to best fit the mode while penalizing the amount of variables used; preventing over-fitting. We will subset the possible models for each step, i.e. introduction of a variable, into 4 to at least see the impact that each variable has on the fit by itself.

auto.leap <- regsubsets(mpg ~ displacement + horsepower + weight + acceleration,
                        data = auto, nbest = 4)
plot(auto.leap, scale = "bic")

From the output, the lowest BIC utilizes horsepower and weight. We will use these variables in our model. It is important to note that using the variable of weight by itself is nearly equal in to using all of the variables.

auto.best <- lm(mpg ~ horsepower + weight, data = auto)
summary(auto.best)
## 
## Call:
## lm(formula = mpg ~ horsepower + weight, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0762  -2.7340  -0.3312   2.1752  16.2601 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.6402108  0.7931958  57.540  < 2e-16 ***
## horsepower  -0.0473029  0.0110851  -4.267 2.49e-05 ***
## weight      -0.0057942  0.0005023 -11.535  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.24 on 389 degrees of freedom
## Multiple R-squared:  0.7064, Adjusted R-squared:  0.7049 
## F-statistic: 467.9 on 2 and 389 DF,  p-value: < 2.2e-16

The resulting p-values are all highly significant and the \(R^2\) value is nearly identical to that of using all the variables. Therefore, we dramatically reduced the complexity of the model and got the same fit as a result.

The resulting equation is:

\[\widehat{mpg} = 45.6402108 + (-0.0473029)horsepower + (-0.0057942) weight\]

The resulting fit can be plotted using the residuals as carried out previously:

plot(auto.best$residuals~auto.best$fitted.values)
abline(h = 0, lty = 3, col = "red")

Comparing the plot to the one generated above, they are nearly identical, but the complexity being dramatically reduced makes it much more preferable to the original with all variables used.