Age <- c(18, 23, 25, 35, 65, 54, 34, 56, 72, 19, 23, 42, 18, 39, 37)
MaxHR <- c(202, 186, 187, 180, 156, 169, 174, 172, 153, 199, 193, 174, 198, 183, 178)
health <- data.frame(Age = Age, MaxHR = MaxHR)
health.lm <- lm(MaxHR ~ Age, data = health)
The equation for the linear model is:
\[\widehat{MaxHR} = 210.0484584 - 0.7977266\times Age\]
First, we will establish a hypothesis test:
\(H_0 \rightarrow \beta_1 = 0\): The slope is equal to zero, i.e. there is not a linear relationship between Age and MaxHR.
\(H_A \rightarrow \beta_1 \neq 0\): The slope is not equal to zero, i.e. there is a linear relationship between Age and MaxHR.
We will establish a conservative threshold for a p-value of < 0.05 being statistically significant and a p-value of < 0.01 as statistically highly significant. The summary of the linear model and its p-value is as follows:
summary(health.lm)
##
## Call:
## lm(formula = MaxHR ~ Age, data = health)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.9258 -2.5383 0.3879 3.1867 6.6242
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.04846 2.86694 73.27 < 2e-16 ***
## Age -0.79773 0.06996 -11.40 3.85e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.578 on 13 degrees of freedom
## Multiple R-squared: 0.9091, Adjusted R-squared: 0.9021
## F-statistic: 130 on 1 and 13 DF, p-value: 3.848e-08
The resulting p-value is highly significant, but with the small sample size of 15, there should be some hesitation in rejecting \(H_0\). However, since the P-value is << 0.001 and because the test statistic is taken from a 2-sided test, we should strongly consider rejecting \(H_0\) in favor of \(H_A\).
The standardized equation for max heart rate given an age is:
\[MaxHR = 220 - Age\]
Does the model from our limited data capture this behavior despite its small size? Taking a conservative 95% confidence interval of our model we would want the above coefficients to be within our bounds.
confint(health.lm, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 203.854813 216.2421034
## Age -0.948872 -0.6465811
The interval fails to capture both the intercept and the Age coefficient. What is likely the case is that the sample size is too small and does reflect the larger population as a whole.
plot(health$MaxHR ~ health$Age, xlab = "Age", ylab = "Max Heart Rate")
abline(health.lm, col = "red")
In addition to the good visible fit of the model, the \(R^2\) value is \(\approx 1\), which means that the variability in the Max Heart Rate is very well described by the variable Age.
#Data for all
auto <- read.csv(repo_url, header = FALSE, sep = "")
colnames(auto) <- c("displacement", "horsepower", "weight", "acceleration", "mpg")
auto.all_lm <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto)
#Sample of 40
set.seed(8675309)
auto.samp <- sample_n(auto, 40)
auto.samp_lm <- lm(mpg ~ displacement + horsepower + weight + acceleration, data = auto.samp)
Using a sample of 40
\[\widehat{mpg} = 54.4881551 + (0.0111858)displacement + (-0.082517)horsepower + \\ (-0.0063925)weight + (-0.279526)acceleration\]
Using entire dataset
\[\widehat{mpg} = 45.2511397 + (-0.0060009)displacement + (-0.0436077)horsepower + \\ (-0.0052805)weight + (-0.023148)acceleration\]
Comment on the difference
The intercept and coefficients are quite different for all the variables; some even switching signs.
Using a sample of 40
(auto.samp_lm.out <- summary(auto.samp_lm))
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration,
## data = auto.samp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8354 -3.0397 -0.3918 2.4682 15.0053
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.488155 7.818121 6.969 4.18e-08 ***
## displacement 0.011186 0.025296 0.442 0.6611
## horsepower -0.082517 0.059811 -1.380 0.1765
## weight -0.006393 0.003168 -2.018 0.0513 .
## acceleration -0.279526 0.379037 -0.737 0.4658
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.04 on 35 degrees of freedom
## Multiple R-squared: 0.6922, Adjusted R-squared: 0.657
## F-statistic: 19.68 on 4 and 35 DF, p-value: 1.456e-08
The p-values for weight has the most significance in the associated changes to the response variable mpg, but it is weak, i.e. > 0.05. The other coefficients show little to now evidence in rejecting \(H_0\), i.e. the coefficient has no use in describing the behavior of the response variable mpg.
Using entire dataset
(auto.all_lm.out <- summary(auto.all_lm))
##
## Call:
## lm(formula = mpg ~ displacement + horsepower + weight + acceleration,
## data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.378 -2.793 -0.333 2.193 16.256
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.2511397 2.4560447 18.424 < 2e-16 ***
## displacement -0.0060009 0.0067093 -0.894 0.37166
## horsepower -0.0436077 0.0165735 -2.631 0.00885 **
## weight -0.0052805 0.0008109 -6.512 2.3e-10 ***
## acceleration -0.0231480 0.1256012 -0.184 0.85388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.247 on 387 degrees of freedom
## Multiple R-squared: 0.707, Adjusted R-squared: 0.704
## F-statistic: 233.4 on 4 and 387 DF, p-value: < 2.2e-16
Using the entire dataset we see the emergence of weight having a high significance in the behavior of the response variable. Horsepower is also significantly high being < 0.01. The remainder of the variables having a little to no significance.
auto_sig <- data.frame(summary(auto.samp_lm)$coefficients[,4,drop = FALSE],
summary(auto.all_lm)$coefficients[,4,drop = FALSE],
summary(auto.samp_lm)$coefficients[,2,drop = FALSE],
summary(auto.all_lm)$coefficients[,2,drop = FALSE]
)
colnames(auto_sig) <- c("Sample.Significance", "All.Significance",
"Sample.StdError", "All.StdError")
auto_sig
## Sample.Significance All.Significance Sample.StdError
## (Intercept) 4.183387e-08 7.072099e-55 7.81812093
## displacement 6.610665e-01 3.716584e-01 0.02529599
## horsepower 1.764543e-01 8.848982e-03 0.05981130
## weight 5.131401e-02 2.302545e-10 0.00316785
## acceleration 4.657555e-01 8.538765e-01 0.37903659
## All.StdError
## (Intercept) 2.4560446927
## displacement 0.0067093055
## horsepower 0.0165734633
## weight 0.0008108541
## acceleration 0.1256011622
The significance levels when using the entire dataset are stronger than that of the sample, but displacement and acceleration are still poor predictors of the response. The standard errors are also smaller with the increasing sample size, a la the law of large numbers.
Using a sample of 40
confint(auto.samp_lm)
## 2.5 % 97.5 %
## (Intercept) 38.61652581 7.035978e+01
## displacement -0.04016776 6.253944e-02
## horsepower -0.20394037 3.890643e-02
## weight -0.01282363 3.853096e-05
## acceleration -1.04901122 4.899591e-01
Using the entire dataset
confint(auto.all_lm)
## 2.5 % 97.5 %
## (Intercept) 40.422278855 50.080000544
## displacement -0.019192122 0.007190380
## horsepower -0.076193029 -0.011022433
## weight -0.006874738 -0.003686277
## acceleration -0.270094049 0.223798050
tmp_rownames <- rownames(auto.all_lm)
auto_interval <- data.frame(samp_coeffs = auto.samp_lm$coefficients,
all_coeffs = auto.all_lm$coefficients)
The confidence interval for the sample set is a lot larger than that of the entire dataset. The interval for the smaller set catches all of the coefficients of the complete dataset due to the large interval.
The complete dataset produces a much smaller interval due to the weak law of large numbers, i.e., the more values used for a given \(x_i, y_i\), the closer it will be to the expected value of the population.
A good way to visualize a fit when using multiple variables is plotting the residuals against the fitted values. We would like to see a relatively uniform variation about the fitted line. We will only use the complete set in this step.
plot(auto.all_lm$residuals~auto.all_lm$fitted.values)
abline(h = 0, lty = 3, col = "red")
The model is a fairly good fit for the variables in regards to mpg. It does, however, appear to be slightly non-linear.
The significant levels of some of the variables are very weak and should not be used in the model. There is a way to determine the best variable to use in fitting a model and reducing the complexity at the same time.
From the above model, we run into the problem of over fitting with the use of every variable. Some variables have a significance level larger than we would want, e.g., > 0.05 and this could be due to their lack of ability in explaining the variability in mpg or because other variables are affecting their ability to explain the variability in mpg. We should be asking questions such as - what will be the affect of the removal of specific variables from the model? There are a lot of combinations of variables that can be used in this model, specifically 15. Using the above method on these 15 combinations would be a little maddening; giving weights to the number of variables used and the resulting \(R^2\) and P-Values. Ideally, we want to minimize the number of variables used while maximizing the fit.
There is an R package, leaps, which can model variable selection. We will use the Bayesian Information Criterion (BIC) which attempts to best fit the mode while penalizing the amount of variables used; preventing over-fitting. We will subset the possible models for each step, i.e. introduction of a variable, into 4 to at least see the impact that each variable has on the fit by itself.
auto.leap <- regsubsets(mpg ~ displacement + horsepower + weight + acceleration,
data = auto, nbest = 4)
plot(auto.leap, scale = "bic")
From the output, the lowest BIC utilizes horsepower and weight. We will use these variables in our model. It is important to note that using the variable of weight by itself is nearly equal in to using all of the variables.
auto.best <- lm(mpg ~ horsepower + weight, data = auto)
summary(auto.best)
##
## Call:
## lm(formula = mpg ~ horsepower + weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0762 -2.7340 -0.3312 2.1752 16.2601
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.6402108 0.7931958 57.540 < 2e-16 ***
## horsepower -0.0473029 0.0110851 -4.267 2.49e-05 ***
## weight -0.0057942 0.0005023 -11.535 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.24 on 389 degrees of freedom
## Multiple R-squared: 0.7064, Adjusted R-squared: 0.7049
## F-statistic: 467.9 on 2 and 389 DF, p-value: < 2.2e-16
The resulting p-values are all highly significant and the \(R^2\) value is nearly identical to that of using all the variables. Therefore, we dramatically reduced the complexity of the model and got the same fit as a result.
The resulting equation is:
\[\widehat{mpg} = 45.6402108 + (-0.0473029)horsepower + (-0.0057942) weight\]
The resulting fit can be plotted using the residuals as carried out previously:
plot(auto.best$residuals~auto.best$fitted.values)
abline(h = 0, lty = 3, col = "red")
Comparing the plot to the one generated above, they are nearly identical, but the complexity being dramatically reduced makes it much more preferable to the original with all variables used.
Comment on fit and significance
The model fits the data very well visually. In addition, the very highly significant p-value makes it very unlikely that the sample collected predicts the response variable by pure chance.