In this R guide we will be covering concepts regarding multiple linear regression. Multiple linear regression is used to analyze the linear relationship between a quantitative response and multiple predictors. For multiple linear regression, the predictors can be quantitative or qualitative, and when building a model a MLR model one can choose to include interaction terms. Since there are different types of predictors, there is a variety of MLR models that can be created. In this R guide we will create one model without an interaction term, one model with an interaction term, one transformed model, and one polynomial. We will describe these models in detail in their corresponding sections.
For this document we will be using the Melanoma dataset in the MASS library. Let’s call and attach our data so that we won’t have to continuously reference it throughout our work.
library(MASS)
data(Melanoma)
attach(Melanoma)
Now that our dataset is attached, let’s become more familar with it. First, type ?Melanoma to learn why and how the data was collected, what the variables are, and how many entries there are in the dataset. Then, use the head command to see the first six rows of our dataset and the name command to view all of the variables. Finally, use the summary command to see an overview of the data.
?Melanoma
head(Melanoma)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
names(Melanoma)
## [1] "time" "status" "sex" "age" "year" "thickness"
## [7] "ulcer"
summary(Melanoma)
## time status sex age
## Min. : 10 Min. :1.00 Min. :0.0000 Min. : 4.00
## 1st Qu.:1525 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:42.00
## Median :2005 Median :2.00 Median :0.0000 Median :54.00
## Mean :2153 Mean :1.79 Mean :0.3854 Mean :52.46
## 3rd Qu.:3042 3rd Qu.:2.00 3rd Qu.:1.0000 3rd Qu.:65.00
## Max. :5565 Max. :3.00 Max. :1.0000 Max. :95.00
## year thickness ulcer
## Min. :1962 Min. : 0.10 Min. :0.000
## 1st Qu.:1968 1st Qu.: 0.97 1st Qu.:0.000
## Median :1970 Median : 1.94 Median :0.000
## Mean :1970 Mean : 2.92 Mean :0.439
## 3rd Qu.:1972 3rd Qu.: 3.56 3rd Qu.:1.000
## Max. :1977 Max. :17.42 Max. :1.000
From our output we see that there are 205 entries in our dataset, and we’re given the name and meanings of our variables. We can also see the ranges in values for each of our variables. Becoming familiar with the data is a necessary first step to working with the data and interpretting the results, but there is no need to submit the results from the previous commands.
The first type of MLR model that we will create has two quantitative predictors without an interaction term. This type of model is used when we want to analyze how each predictor individually affects our response, and we don’t think that the interaction between our predictors will impact our model. This type of model will be in the following form: \[\hat{y_i}= \hat{\beta_0}+\hat{\beta_1} x_i+\hat{\beta_2} x_j\] where: \(\hat{y_i}\) = predicted response
\(\hat{\beta_0}\) = estimate of the intercept
\(\hat{\beta_1}\) = estimate for the regression coefficient for the first predictor
\(\hat{\beta_2}\) = estimate for the regression coefficient for the second predictor
Using the Melanoma dataset, let’s create the model. Our reponse is survival time, which is the number of days that a person with malignant melanoma survived once they were diagnosed. Our first predictor is age, which is the age in years of the patient, and the second predictor is thickness, which is the patient’s tumor thickness in millimeters (mm). The argument for these variable assignmnets is that the age of the patient and/or the thickness of their tumor will affect how long the patient survives.
Now that our variables are assigned, let’s create our model. Use the lm command, inputting our response, followed by “~”; then enter the name of one predictor, followed by “+” and the name of our second predictor. Since we aren’t considering an interaction term in our model we use “+” in our input. Let’s name our model “mod1”, so that we can reference it later.
mod1 <- lm(time ~ age + thickness)
mod1
##
## Call:
## lm(formula = time ~ age + thickness)
##
## Coefficients:
## (Intercept) age thickness
## 3281.47 -17.73 -68.04
Our output tells us the values for each of our regression coefficients. The intercept for our model is 3281.47. However, the intercept cannot be interpretted because age does not equal zero for any patient in our dataset. Our coefficient for age is -17.73, which means that, while holding a patient’s tumor thickness constant, for every one year increase in a patient’s age, we would expect the patient’s survival time to decrease by 17.73 days. Our coefficient for tumor thickness is -68.04, which means that, while holding age constant, for every 1 mm increase in a patient’s tumor thickness, we would expect their survival time to dcrease by 68.04 days. Therefore, we can write out our model in the following form: \[\hat{time}= 3281.47-17.73age_i-68.04 thickness_i\].
Before we move on to our analysis, we need to see if our model is one that will produce accurate outputs. By looking at the diagnostic plots for our model and checking the linear regression assumptions we are able to determine how reliable our model is.
Use the plot commnd, inputting our model name, to see the four diagnostic plots of our model. Also use the coding par(mfrow = c(2,2)) prior to the plot command to see the plots in a 2x2 grid so that all four plots can be viewed at the same time.
par(mfrow = c(2,2))
plot(mod1)
The Residual vs. Fitted Values plot gives the fitted values of our model plotted against their residuals. When looking at this plot, we want our residuals to be homoscedastic, meaning that they have a constant variance, because the constant variance assumption is one of four assumptions we make when we build our MLR model. If our residuals are not homoscedastic, our MSE will be inflated or deflated, causing our confidence and prediction intervals to be too wide for closely packed residuals or too narrow for widely spread residuals. We also do not want there to be a trend in our residuals; if we see a trend in our residuals this means that our mean function is incomplete, and our point predictions will not be accurate.
By looking at the Residual vs. Fitted Values plot for our model, we see that there is not a significant trend in our residuals. However, our residuals are not homoscedastic. This is an issue that can be resolved by applying a transformation to one variable or multiple variables in our model. We will transform our model after we explain the remaining diagnostic plots and regression assumptions.
The second plot given by the plot command is the Normal Q-Q plot, which illuatrates how closely the residuals follow a normal distribution. Another regression assumption is that our residuals are normally distributed, and we want this assumptino to hold for our model. This means that we want our residuals to fall along the diagonal line in the Normal Q-Q plot. If our residuals are not normally distributed our confidence and prediction intervals will be too wide or too narrow because the tests that we use to calculate these intervals are based on the quantiles of a normal distribution.
By looking at the Normal Q-Q plot for our model we see that the data points fall very close to the diagonal line, so we can conclude that our residuals are normally distributed.
The third plot is the Scale-Location plot, which shows the absolute values of the residuals for our model. This plot serves the same purposes as the Residual vs. Fitted Values plot for this R guide.
The final diagnostic plot is the Cook’s Distance plot, which illustrates the influence that each data point has on the reponse. We want all of the data points to affect the response equally, so we want all of the points to fall between the two red lines in this plot. If some data points have a significantly higher influence on the respnse than others, the outputs of our model would be skewed towards those influential points.
By looking at the Cook’s Distance plot for our model we see that all of the points fall between the red lines. Thus, no one point has a significantly higher weight on our response than the others.
In addition to the diagnostic plots, there are linear regression assumptions that we need to consider when evaluating our model. When we create an MLR model we make four assumptions, and if they hold we can rely on our model to produce accurate estimates and intervals, but if one or more of our assumptions doesn’t hold our model may not produce reliable outputs.
The significance of the constant variance assumption and normality assumption were explained in the previous section, and we concluded that the normality assumption holds for our model but the equal variance assumption does not.
Another regression assumption is that the mean of the residuals equals zero. If this assumption doesn’t hold, our point estimations, confidence intervals, and prediction intervals will not be accurate because the tests that we use to calculate these intervals are based on the quantiles of a normal distribution with a mean of 0. When R creates our model it automatically assures that the mean of the residuals is equal to zero, so we don’t need to check this assumption.
The final assumption is the independence assumption, which states that any value of our error term is statistically independent of any other error value. If any data points are dependent on others, the standard error for our model will be underestimated, which will cause our point predictions, confidence intervals, and predicition intervals to be inaccurate.
During our data exploration we learned that the Melanoma data was collected from 205 Denmark patients with malignant melanoma. There was no evidence that these observations were dependent on one another, and we assume that the survival time and medical care received was independent for eachof these patients. Thus, we can conclude that the independence assumption holds for our model.
Since the constant variance assumption does not hold for our model, we will use transformations to create a model with residuals which better meet this assumption. We can take the square-root (sqrt), natural log (log), or reciprocal (1/variable name) of our response and/or one or more of our predictors to transform our model. Through a process of creating different models and comparing their diagnostic plots, we will be able to come up with a better model for our data.
Below is the code for various transformations of our model. We want to use whichever model is the most homoscedastic, with normally distributed residuals. We also want to use the model that doesn’t have a trend in the residuals because a trend in the residuals will cause point predictions to be incorrect. Also, we want all of the data points to be inside the red lines in the Cook’s Distance plot becasue we want all of our data points to have an equal amount of influence on our response. Keep in mind that there are many more possible transformations than those listed below.
To save space in this R guide, none of these outputs will be provided.
mod1a <- lm(log(time) ~ age + thickness)
mod1b <- lm(sqrt(time) ~ age + thickness)
mod1c <- lm(time ~ sqrt(age) + thickness)
mod1d <- lm((time) ~ (age) + log(thickness))
mod1e <- lm(time ~ age + (1/thickness))
mod1f <- lm(log(time) ~ age + (1/thickness))
mod1g <- lm(sqrt(time) ~ log(age) + thickness)
par(mfrow = c(2,2))
plot(mod1a)
plot(mod1b)
plot(mod1c)
plot(mod1d)
plot(mod1e)
plot(mod1f)
plot(mod1g)
After looking at the diagnostic plots for these models and others, it’s clear that taking the natural log of tumor thickness creates the best model. The diagnostic plots are shown below.
mod_log <- lm((time) ~ (age) + log(thickness))
par(mfrow = c(2,2))
plot(mod_log)
By looking at the Residuals vs. Fitted plot and the Scale-Location plot, we are able to see that the residuals for our new model are homoscedastic, which leads us to believe that this model is better than our original model with no transformations. We also see that there is a very minor trend in our residuals, but since the trend is so subtle we can feel comfortable using this model anyways.and it’s acceptable to use this model with such a subtle trend. Also, the Normal Q-Q plot shows us that our residuals are normally distributed, and the Cook’s Distance plot shows us that all of our data points have a relatively equal influence on our response. Lastly, transforming our model doesn’t affect the independence on our variables, so we can conclude that the independence assumption holds for our new model.
Thus, all of the linear regresson assumptions hold for our model, so we can use it for further analysis.Let’s call on our model to access the coefficients and write out the equation for our model. Recall, our new model name is “mod_log”
mod_log
##
## Call:
## lm(formula = (time) ~ (age) + log(thickness))
##
## Coefficients:
## (Intercept) age log(thickness)
## 3236.42 -18.66 -168.99
Thus, our model can be written in the following form: \[\hat{time}= 3236.42-18.66age_i-168.99\log(thickness_i)\] So for every one year increase in a patient’s age, we expect their survival time to decrease by 18.66 days, and for every increase in a patient’s tumor thickness we also expect their survival time to decrease, all else being held equal.
Next, let’s calculate the multicollinearity of our predictors. Multicollinearity exists in a model when the predictors are correlated, and it’s bad because it inflates the standard errors of a model. Inflated standard errors will cause confidence intervals to be too wide, and they make it more difficult to determine the the importance of the predictors in the model. If the absolute value of the correlation coefficient for two predictors is greater than 0.9, the model raises concern and may not be reliable.
We can use the cor command, inputting our two predictor names to calculate the collinearity of our predictors.
cor(age, log(thickness))
## [1] 0.1588602
The correlation between age and the natural log of tumor thickness is 0.1588602. The number is positive, which means that as one predictor increases, we would expect the other one to increase. However, since the correlation coefficient is so close to zero we can conclude that there is very little correlation between our predictors. Thus, we can cotinue using our model without concern.
Since our correlation coefficient wasn’t equal to zero, there is still some collinearity in our model, so the standard errors in our model are somewhat inflated. The VIF measures how inflated our standard errors are, and is caluculated \(1/(1-R^2)\), where \(R^2\) is the coefficient of determination for our model. If the VIF for our model is greater than 10 we should be concerned that our model may not be reliable.
By using the summary command and inputting our model name we can calculate the \(R^2\) value for our model. Then we can use the \(R^2\) value to calculate our VIF.
summary(mod_log)
##
## Call:
## lm(formula = (time) ~ (age) + log(thickness))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2386.8 -621.5 -77.3 699.6 3273.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3236.418 245.520 13.182 < 2e-16 ***
## age -18.664 4.516 -4.133 5.25e-05 ***
## log(thickness) -168.992 74.393 -2.272 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1062 on 202 degrees of freedom
## Multiple R-squared: 0.1136, Adjusted R-squared: 0.1048
## F-statistic: 12.94 on 2 and 202 DF, p-value: 5.161e-06
The Multiple R-squared value in the summary output is the value of our coefficient of determination, and it equals 0.1136.
VIF <- 1/(1-0.1136)
VIF
## [1] 1.128159
The VIF for our model is 1.128159. This is well below 10, so we do not need to be concerned about the standard errors for our model being too inflated.
Now let’s use our model to calculate the MSE for our model (denoted \(s^2\)), which is the point estimate for the constant variance of our residuals (denoted \({\sigma}^2\)). As its name implies, the MSE is the average of all of the squared errors for our model. The summary command can be used to find the mean square error.
summary(mod_log)
##
## Call:
## lm(formula = (time) ~ (age) + log(thickness))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2386.8 -621.5 -77.3 699.6 3273.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3236.418 245.520 13.182 < 2e-16 ***
## age -18.664 4.516 -4.133 5.25e-05 ***
## log(thickness) -168.992 74.393 -2.272 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1062 on 202 degrees of freedom
## Multiple R-squared: 0.1136, Adjusted R-squared: 0.1048
## F-statistic: 12.94 on 2 and 202 DF, p-value: 5.161e-06
In our output, the “Residual standard error” gives us the the standard error of our model, which is a point estimate for the constant standard deviation of our errors. The MSE is the square of this, so the MSE for our model is \(10627^2 = 1127844\). Thus, the point estimate for the constant variance of our residuals is 1127844.
Now let’s check to see if there is a significant relationship between our response, survival time, and our predictors, age and the natural log of tumor thickness. This is important because our linear regression model is not useful unless there is a significant relationship between our response and predictors.
For the following tests we will use a significance level of \({\alpha} = 0.05\).
First, we can use an F-test to see if our reponse has a significant relatioship with any of our predictors. The null hypothesis states that survival time is not related to age or the natural log of tumor thickness. In other words, it states that both predictors have coefficients of 0 in our regression model. The alternative hypothesis states that survival time is related to at least one of our predictors. Equivalently, this means that at least one of our predictors has a coefficient not equal to 0. Our hypotheses can be written in the following forms: \[H_0:{\beta_1}= {\beta_2}=0\] \[H_A: {\beta_1}, {\beta_2} ≠ 0\] If the test produces a p-value that is smaller than our significance level we will reject the null hypothesis in favor of the alternative and conclude that survival time is related to at least one of our predictors.
We can use the summary command to access the results of the F-test.
summary(mod_log)
##
## Call:
## lm(formula = (time) ~ (age) + log(thickness))
##
## Residuals:
## Min 1Q Median 3Q Max
## -2386.8 -621.5 -77.3 699.6 3273.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3236.418 245.520 13.182 < 2e-16 ***
## age -18.664 4.516 -4.133 5.25e-05 ***
## log(thickness) -168.992 74.393 -2.272 0.0242 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1062 on 202 degrees of freedom
## Multiple R-squared: 0.1136, Adjusted R-squared: 0.1048
## F-statistic: 12.94 on 2 and 202 DF, p-value: 5.161e-06
The last line of the summary output yields the statistics for the F-test. The degrees of freedom for the F-distribution are 2 (\(df_1\) = number of parameters - 1) and 202 (\(df_2\) = number of observations - number of parameters). The F-statistic is 12.94, and the corresponding p-value is \(5.16 * 10^{-6}\), which is less than our significance level. Thus, we can conclude that survival time has a significant relationship with age and/or the natural log of tumor thickness.
Next, we can use a T-test to determine which of our predictors has/have a significant relationship with survival time after accounting for the other predictor.For example, we can test whether or not survival time is related to age after accounting for the natural log of tumor thickness. The null hypothesis states that after accounting for the other predictor, survival time is not related to the predictor we are testing, which means that the predictor has a coefficient of 0 in our model. The alternative hypothesis states that after accounting for the other predictor, survival time is related to the predictor of concern, so the predictor has a coefficient not equal to 0. Our hypotheses can be written in the following form: \[H_0:{\beta_i}=0\] \[H_0:{\beta_i} ≠ 0\] where \(i\) equals 1 when we’re testing the relationship between survival time and age, and \(i\) equals 2 when we’re testing the relationship between survival time and the natural log of tumor thickness.
If the test shows that our predictor is significant, we will reject the null hypothesis in favor of the alternative and conclude that survival time is related to our predictor.
We can look at the summary output above to get the needed information for the T-test. In the “Residual standard error” row of the output we see that the T-distribution has 202 degrees of freedom (df = number of observations - number of parameters). In the “age” row of the output we see that the test statistic for age is -4.133, and the corresponding p-value is \(5.45 * 10^{-16}\). This p-value is less than our significance level of 0.05, so we can conclude that survival time is significantly related to age. In the “log(thickness)” row of our output we see that the test statistic for log(thickness) is -2.272, and the corresponding p-value is 0.0242. This p-value is greater than the p-value for age, but it’s still less than our significance level of 0.05. Therefore, we can conclude that survival time is also significantly related to the natural log of tumor thickness. However, note that if we were using a smaller significance level, such as \({\alpha} = 0.01\), we would conclude that survival time is not significanctly related to the natural log of tumor thickness.
Next, we can use our multiple linear regression model to make predicitons about our data. We can select a patient from our dataset and enter their age and tumor thickness into our model to get a predicted value of their survival time.
For example, let’s predict the survival time for the 100th patient in our data. The survival time, age, and tumor thickness for this patient are in the 100th row of our dataset, and to access these values we use indexing. Specifically, we list the name of the dataset, followed by a set of brackets. In the brackets we need to list the row and column numbers for the number that we are concerned about. Since we want to know the age for the patient in the 100th row, we will list 100 as the row and 4 as the column number, because age is the variable in the fourth column of our data. Let’s call this value “obsAge” We will repeat this indexing with 6 as the column number for tumor thickness because that variable is listed in the sixth column of our dataset. Let’s call that value “obsThick”.
obsAge <- Melanoma[100, 4]
obsAge
## [1] 69
obsThick <- Melanoma[100,6]
obsThick
## [1] 12.88
Our output tells us that the age of the 100th patient is 69 years, and their tumor is 12.88mm thick.
Now, we can use this age and tumor thickness to predict the survival time for this patient. To do so, we will use the coef command and matrix multiplication. To use the coef command we input the name of our model, and it outputs the coefficients of our model in a vector of length three, with the intercept listed first, followed by the coefficient for age and the coefficient for the natural log of tumor thickness. Then we multiply this vector by another vector of length three with 1 as the first entry, followed by the observed age and the natural log of the observed tumor thickness. This will give us the estimate for the patient’s survival time. Use %*% to matrix multiply.
predTime <- coef(mod_log)%*%c(1, obsAge, log(obsThick))
predTime
## [,1]
## [1,] 1516.746
Our output tells us that the the predicted survival time for the patient in the 100th row is 1516.746 days.
Now we can caluculate the residuals to see how close this patient’s predicted survival time is to their actual survival time. First, let’s find out what the patient’s acutal survival time is. Again, we will use indexing. Our row number is still 100, but now our column number is 1 because survival time is listed in the first column of our dataset.
obsTime <- Melanoma[100, 1]
obsTime
## [1] 1958
The patient’s actual survival time is 1958 days.
The residual is the observed value minus the expected value, or observed minus predicted. Thus, to find the residual we will subtract this patient’s predicted survival time from their actual survival time.
resid <- obsTime - predTime
resid
## [,1]
## [1,] 441.2536
Our residual is 441.2536, which means that our model underestimated this patient’s survival time by 441.2536 days.
Now we can use our mode to create several intervals.
First let’s calculate confidence intervals for each of our regression coefficients. The confint command will give us the lower and upper bounds for each of these confidence intervals. Let’s calculate 95% confidence intervals for each of our parameters. The confint command assumes a 95% confidence interval, so the only argument that we need to provide is the name of our model.
confint(mod_log)
## 2.5 % 97.5 %
## (Intercept) 2752.30759 3720.528777
## age -27.56771 -9.759347
## log(thickness) -315.67821 -22.305464
Looking at the first line of the output, we are 95% confident that the true intercept for our model, \({\beta_0}\), is between 2752.30759 and 3720.528777. However, keep in mind that for our specific model we can’t interpret this interval because our intercept can’t be interpreted.
Looking at the second line of the output, we are 95% confident that the true value for our age coefficient, \({\beta_1}\), is between -27.56771 and -9.759347. Thus, we are 95% confident that on average, while holding tumor thickness costant, a one year increase in a patient’s age will cause that patient’s survival time to decrease by some amount between 9.759347 and 27.56771 days. Also, in 95% of random samples, we would find that the coefficient falls in this interval.
From the last line, we are 95% confident that the coefficient for log(thickness), \({\beta_2}\) is between -315.67821 and -22.305464. Therefore, we are 95% confident that on average, while holding age constant, a one unit increase in the natural log of a patient’s tumor thickness will casue that patient’s survival time to decrease by some value between 22.305464 and 315.67821 days.
Confidence intervals can also be constructed for other parameters related to our dataset. For example, let’s construct a 95% confidence interval for the average survival time of patients with an age of 69 years and tumor thickness of 12.88mm. To do so, we will need to create a new dataset that only includes patients who are 69 years old and have a tumor that is 12.88mm thick. Use the data.frame command, and input the arguments “age=69” and “thickness=12.88” to generate the new dataset. Let’s call our new dataset “newdata”.
newdata <- data.frame(age = 69, thickness = 12.88)
Now we can create our confidence interval for the mean survival time for these patients. We will use the predict command with three arguments. First we need to list the name of our model and the name of our dataset. Then we tell R that we want to construct a confidence interval. A 95% confidence interval is the default, so we do not need to specify our confidence level. Let’s name this interval “conf”.
conf <- predict(mod_log, newdata, interval = "confidence")
conf
## fit lwr upr
## 1 1516.746 1184.28 1849.213
The first column of the output gives us an estimate of the mean survival time of all patients who are 69 years old and have a tumor that is 12.88mm thick; the second and third columns give us the lower and upper bounds of our 95% confidence interval. From the first column, the average survival time for patient’s who are 69 years old and have a tumor that is 12.88mm thick is 1516.746 days. Also, from the bounds of the confidence interval we are 95% confident that the average survival time for patient’s who are 69 years old with a tumor that is 12.88mm thick is between 1184.28 and 1849.213 days.
In addition, we can create prediction intervals for any individual point in our data. To demonstrate this, let’s create a 95% prediction interval for an individual patient whose age is 69 and tumor thickness is 12.88mm. Notice that instead of looking at the average survival time of all patients in our new dataset, we now only want to estimate the survival time for any one patient. Since we are still using the age of 69 years and tumor thickness of 12.88mm, we can use the new dataset that we created in the last calculation. However, this time we will tell R that we want to create a prediction interval, not a confidence interval. Let’s name this interval “pred”.
pred <- predict(mod_log, newdata, interval = "prediction")
pred
## fit lwr upr
## 1 1516.746 -602.826 3636.319
Now we predict that an individual patient who is 69 years old with a tumor that is 12.88mm thick will survive 1516.746 days. Also, we are 95% confident that if a patient is 69 years old and has a 12.88mm thick tumor, their survival time is between 0 and 3636.319 days. Notice that the lower bound for our prediction interval is negative, but survival time can’t be a negative number. Thus we say that their predicted survival time is between 0 and 3636.319 days. Notice that this interval is centered around the same value as the confidence interval was in the previous section, but this interval is much wider than the confidence interval becasue we are now only looking at an individual rather than the mean.
The model used previously in the analysis of the Melanoma data did not include an interaction term for thickness and age. So, in that model, it was assumed that the relationship between the survival time of the patient and the thickness of the tumor was not dependent on the age of the patient, and similarly, the relationship between survival time and age was not dependent on the thickness of the tumor. However, we do not know that this is the case, and such a relationship may exist in this data. We can create a model that includes a term to account for the interaction of age and thickness using the “*" operator instead of “+”:
(mod_Interact <- lm(time ~ age * log(thickness), data = Melanoma))
##
## Call:
## lm(formula = time ~ age * log(thickness), data = Melanoma)
##
## Coefficients:
## (Intercept) age log(thickness)
## 3126.752 -16.541 -8.280
## age:log(thickness)
## -2.879
Then, the regression equation relating age and thickness with an interaction term is \[\hat{\text{time}} = 3126.752 -16.541\times \text{age} - 8.280\times \log(\text{thickness}) - 2.879\times \text{age}\times \log(\text{thickness}).\]
While it is easy to add such a term relating age and tumot thickenss to our model, if there is no relationship between age and thickness, then it is not beneficial to include this term in our model. We can use hypothesis tests to determine whether or not the interaction term is needed to accuratly model the data, we can look at the summary of the model.
For the following test, we will use a significance level of \(\alpha = 0.05\)
One way we can test to see if there is a relationship between survival time and the interaction between age and tumor thickness is by using a T-test. We conduct the t-test in the same manner we did previously. Similarl to the previous t-test, our null hypothesis is that there is no relationship between this term and the response variable: \[H_0: \beta_i = 0\]. The alternative hypothesis is that there is such a relationship. \[H_A: \beta_i \neq 0\]. We look at the interaction model summary to find the results of the t-test.
summary(mod_Interact)
##
## Call:
## lm(formula = time ~ age * log(thickness), data = Melanoma)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2383.2 -611.6 -103.3 700.5 3250.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3126.752 306.394 10.205 <2e-16 ***
## age -16.541 5.742 -2.881 0.0044 **
## log(thickness) -8.280 278.017 -0.030 0.9763
## age:log(thickness) -2.879 4.798 -0.600 0.5492
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1063 on 201 degrees of freedom
## Multiple R-squared: 0.1151, Adjusted R-squared: 0.1019
## F-statistic: 8.718 on 3 and 201 DF, p-value: 1.829e-05
The interaction term is listed in the Coefficients column as age:thickness. By looking at the p-value of the t-test, in the last column labeled Pr(>|t|), we can decide whether or not the interaction term is significant. Since the p-value of \(0.5492\) is larger than our significance level of \(0.05\), we fail to reject the null hypothesis, and conclude that this term does not add value to the model. Therefore, the relationship between the natural log of tumor thickness and survival time is not dependent on age, and the relationship between age and survival time is not dependent on the natural log of tumor thickness. So, we do not need to include the interaction term in our model.
We can get the same results by doing a partial F-test. A partial F-test compares two models, and since our two models are the same except for the interaction term, if we find that the original model is a better fit, then it is evident that the difference is do to the new interaction term. We complete a hypothesis test, where the null hypothesis is that the interaction term is insignificant, and the alternative hypothesis is that the interaction term is significant.
We use the anova command to compare these models, with both models as unputs.
anova(mod_log, mod_Interact)
## Warning in anova.lmlist(object, ...): models with response '"time"' removed
## because response differs from model 1
## Analysis of Variance Table
##
## Response: (time)
## Df Sum Sq Mean Sq F value Pr(>F)
## age 1 23350122 23350122 20.7170 9.185e-06 ***
## log(thickness) 1 5816069 5816069 5.1602 0.02416 *
## Residuals 202 227673917 1127099
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of this test can be seen in the last row and the last column of the output. We see that this comparison resulted in the same p-value as the t-test: \(0.5492\). Again, since the p-value is larger than \(\alpha\), we can conclude that the interaction term is insignificant. We fail to reject the null hypothesis, and conclude that there is no interaction between the two predictors.
Thus far, we have looked at the linear relationship between predictors and a response as well as a logarithmic transformation of the relationship. Additionally, we can consider a case where a predictor and a response have a polynomial relationship. To demonstrate such a case, we will use the Cars93 dataset from the MASS library and look at the relationship between the average price of a car and miles per gallon in the city. First, we call and attach the data, before ploting the two variables to see the general relationship between the two.
library(MASS)
data("Cars93")
attach(Cars93)
plot(Price, MPG.city)
From the plot, it appears that there is a negative correlation between price and City MPG, but the relationship does not appear to be linear.
Even though the plot shows that the pattern of the points is slightly curved, we make a linear model anyways, since this will help us see whether or not a linear model is appropriate.
mod_Linear <- lm(Price ~ MPG.city)
We look at a residual plot to determine whether or not a linear model is the best choice for the relationship between car price and City MPG. The residuals should be fairly evenly spread out, with approximately the same amount of points above and below zero for each value of revolutions per mile. If this is not the case, then that means that the model we have chosen does not accurately fitted the data.
plot(mod_Linear$residuals ~ mod_Linear$fitted.values)
abline(0,0)
From looking at the residual plot, we can see that the residuals are not evenly spread out. From \(5\) to \(15\) City MPG, there are almsot no residuals above zero, while there are many residuals above zero for the other value or City MPG. Therefore, we can conclude that this linear model is not correctly fitting the data.
Since a linear model was not appropriate, we can look at other models. For example, we can look at a quadratic model instead of a linear model.
To create a quadratic model, we add the square of miles per gallon, inside I(), to our linear model:
mod_Quad <- lm(Price ~ MPG.city + I(MPG.city^2))
Again, we look at the residual plot to determine whether or not the model accurately fits the data:
plot(mod_Quad$residuals ~ mod_Quad$fitted.values)
abline(0,0)
From the residual plot, it seems that the quadratic model is a better fit for the data than the linear model. The residuals are more evenly spread out, and there is less of a pattern.
We can check to confirm that the quadratic term is significant and is needed to better model the relationship between city miles per gallon and car price. We do this by looking at the summary of the quadratic model. Similarly to the way we checked to see if the interaction term was significant in the previous example, we use a t-test to see if the quadratic term is significant. Our null hypothesis is that it is not, \[H_0: \beta = 0\], and our alternative hypothesis is that the quadratic term is significant, or \[H_A: \beta \neq 0\]. We look at the model summary to find the results of the t-test.
summary(mod_Quad)
##
## Call:
## lm(formula = Price ~ MPG.city + I(MPG.city^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.263 -4.436 -0.848 2.664 38.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.95022 10.14757 7.682 1.84e-11 ***
## MPG.city -3.85901 0.78148 -4.938 3.61e-06 ***
## I(MPG.city^2) 0.05244 0.01422 3.686 0.000388 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.319 on 90 degrees of freedom
## Multiple R-squared: 0.4383, Adjusted R-squared: 0.4258
## F-statistic: 35.12 on 2 and 90 DF, p-value: 5.334e-12
The p-value is found in the last column of the row labeled I(MPG.city^2) under Pr(>|t|). The p-value of \(0.00388\) is small, so we can conclude that the quadratic term is significant and should be included in the model. So, the quadratic model is a better fit for modeling the relationship between city miles per gallon and average car price than the linear model.
Similarly to the quadratic model, we can create a model that includes a third degree term for the predictor. We make this model the same way as the quadratic model, escept this time we add a term where we raise City MPG to the third power:
mod_3rd <- lm(Price ~ MPG.city + I(MPG.city^2) + I(MPG.city^3))
We look at the residuals of the model to determine whether or not it accuratly fits the data:
plot(mod_3rd$residuals ~ mod_Quad$fitted.values)
abline(0,0)
This residual plot looks very similar to the residual plot of the quadratic model, so it is possible that the additional term is unnecessary.
We can look at the model summary to determine whether or not the third degree term is significant:
summary(mod_3rd)
##
## Call:
## lm(formula = Price ~ MPG.city + I(MPG.city^2) + I(MPG.city^3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.035 -4.511 -0.844 2.589 38.308
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.2257052 39.3988447 1.859 0.0664 .
## MPG.city -3.3080612 4.5068106 -0.734 0.4649
## I(MPG.city^2) 0.0322051 0.1635939 0.197 0.8444
## I(MPG.city^3) 0.0002324 0.0018717 0.124 0.9015
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.36 on 89 degrees of freedom
## Multiple R-squared: 0.4384, Adjusted R-squared: 0.4195
## F-statistic: 23.16 on 3 and 89 DF, p-value: 3.599e-11
The p-value for the third degree term is \(0.9015\), found under Pr(>|t|) in the last row of the Coefficients section. Since this p-value is large, we conclude that the term is unnecessary and that the quadratic model is a better fit for the data.
When choosing a model, it can often be difficult to determine which predictors should be included, and which are irrelevant. There are several ways that such a decision can be made. A model can be built using either a stepwise method or the all subsets regression method. Both stepwise methods, forward selection and backward elimination, begin with a model that may or may not be good, and gradually add or take away predictors until the model remaining has only the necessary predictors. The all subsets regression method fits a model with every possible combination of predictors, and compares the AIC value for each one. In this R guide, we will provide examples only for the two stepwise methods.
The backward elimination method begins with a model that includes every predictor. Gradually, predictors are eliminated, beginning with the least significant predictor. The elimination ends when removing any of the remaining predictors results in a less accurate model. We use the Cars97 data set in the MASS library again to demonstrate this method. We will build a model that predicts the price of a car based on predictors: miles per gallon in the city, miles per gallon on a highway, engine size, horsepower, revolutions per minute, engine revolutions per mile, and fuel tank capacity. We begin with a model using all of these predictors:
carMod_full <- lm(Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM + Rev.per.mile + Fuel.tank.capacity, data = Cars93)
Then, we use the stepAIC package to do the backward elimination. This package defaults to backward elimination, so the only input we need is the name of the full model:
stepAIC(carMod_full)
## Start: AIC=339.35
## Price ~ MPG.city + MPG.highway + EngineSize + Horsepower + RPM +
## Rev.per.mile + Fuel.tank.capacity
##
## Df Sum of Sq RSS AIC
## - MPG.highway 1 0.02 3009.3 337.35
## - Fuel.tank.capacity 1 0.15 3009.4 337.35
## - RPM 1 0.32 3009.6 337.36
## - MPG.city 1 23.09 3032.3 338.06
## - EngineSize 1 29.54 3038.8 338.25
## <none> 3009.2 339.35
## - Rev.per.mile 1 155.75 3165.0 342.04
## - Horsepower 1 589.15 3598.4 353.97
##
## Step: AIC=337.35
## Price ~ MPG.city + EngineSize + Horsepower + RPM + Rev.per.mile +
## Fuel.tank.capacity
##
## Df Sum of Sq RSS AIC
## - Fuel.tank.capacity 1 0.14 3009.4 335.35
## - RPM 1 0.32 3009.6 335.36
## - EngineSize 1 29.73 3039.0 336.26
## <none> 3009.3 337.35
## - MPG.city 1 76.27 3085.5 337.67
## - Rev.per.mile 1 160.69 3170.0 340.19
## - Horsepower 1 589.14 3598.4 351.97
##
## Step: AIC=335.35
## Price ~ MPG.city + EngineSize + Horsepower + RPM + Rev.per.mile
##
## Df Sum of Sq RSS AIC
## - RPM 1 0.31 3009.7 333.36
## - EngineSize 1 33.28 3042.7 334.37
## <none> 3009.4 335.35
## - MPG.city 1 116.98 3126.4 336.90
## - Rev.per.mile 1 174.72 3184.1 338.60
## - Horsepower 1 603.97 3613.4 350.36
##
## Step: AIC=333.36
## Price ~ MPG.city + EngineSize + Horsepower + Rev.per.mile
##
## Df Sum of Sq RSS AIC
## - EngineSize 1 59.45 3069.2 333.18
## <none> 3009.7 333.36
## - MPG.city 1 125.16 3134.9 335.15
## - Rev.per.mile 1 175.58 3185.3 336.63
## - Horsepower 1 1694.81 4704.5 372.90
##
## Step: AIC=333.18
## Price ~ MPG.city + Horsepower + Rev.per.mile
##
## Df Sum of Sq RSS AIC
## <none> 3069.2 333.18
## - Rev.per.mile 1 116.75 3185.9 334.65
## - MPG.city 1 152.35 3221.5 335.69
## - Horsepower 1 2477.71 5546.9 386.22
##
## Call:
## lm(formula = Price ~ MPG.city + Horsepower + Rev.per.mile, data = Cars93)
##
## Coefficients:
## (Intercept) MPG.city Horsepower Rev.per.mile
## -0.024781 -0.355801 0.138255 0.003262
First, the package calculates the AIC for each model where one of the seven predictors is removed, as well as the model where none of the predictors is removed. The model with the lowest AIC is taken, and the process is repeated until a model remains where the AIC is lowest if none of the predictor are removed. In this model, the first step reveals that removing the predictor of miles per gallon on the highway results in the lowest AIC. In the next step, fuel tank capacity is removed, followed by RPM and engine size. After these four predictors are removed, removing any of the remaining predictors results in a higher AIC. So, our final model is one in which the price of a car is predicted by engine revolutions per minute, miles per gallon in the city, and horsepower.
Similarly, we can find the best possible model using the forward selection method. This process starts with a single model. Then predictors with the lowest p-values or AICs are gradually added, until adding the next most significant predictor no longer makes the model more accurate. Again, we can use the stepAIC package, but forward selection is not the default, so we will need to specify this, as well as our begining model and the biggest possible model. We start by creating a model with no predictors:
carMod_simple <- lm(Price ~ 1, data = Cars93)
Then, we can use this model, as well as our full model as inputs to the command.
stepAIC(carMod_simple, direction = "forward", scope = list(upper = carMod_full))
## Start: AIC=422.83
## Price ~ 1
##
## Df Sum of Sq RSS AIC
## + Horsepower 1 5333.1 3250.9 334.53
## + Fuel.tank.capacity 1 3294.2 5289.9 379.81
## + EngineSize 1 3063.8 5520.2 383.77
## + MPG.city 1 3034.5 5549.5 384.26
## + MPG.highway 1 2698.5 5885.5 389.73
## + Rev.per.mile 1 1560.7 7023.3 406.17
## <none> 8584.0 422.83
## + RPM 1 0.2 8583.8 424.83
##
## Step: AIC=334.53
## Price ~ Horsepower
##
## Df Sum of Sq RSS AIC
## + MPG.highway 1 73.637 3177.2 334.40
## <none> 3250.9 334.53
## + MPG.city 1 64.974 3185.9 334.65
## + Fuel.tank.capacity 1 59.411 3191.5 334.81
## + Rev.per.mile 1 29.372 3221.5 335.69
## + RPM 1 9.863 3241.0 336.25
## + EngineSize 1 7.666 3243.2 336.31
##
## Step: AIC=334.4
## Price ~ Horsepower + MPG.highway
##
## Df Sum of Sq RSS AIC
## + Rev.per.mile 1 79.421 3097.8 334.04
## <none> 3177.2 334.40
## + Fuel.tank.capacity 1 8.966 3168.3 336.14
## + MPG.city 1 0.417 3176.8 336.39
## + RPM 1 0.351 3176.9 336.39
## + EngineSize 1 0.000 3177.2 336.40
##
## Step: AIC=334.04
## Price ~ Horsepower + MPG.highway + Rev.per.mile
##
## Df Sum of Sq RSS AIC
## <none> 3097.8 334.04
## + EngineSize 1 63.998 3033.8 334.10
## + RPM 1 33.113 3064.7 335.05
## + MPG.city 1 28.675 3069.2 335.18
## + Fuel.tank.capacity 1 19.153 3078.7 335.47
##
## Call:
## lm(formula = Price ~ Horsepower + MPG.highway + Rev.per.mile,
## data = Cars93)
##
## Coefficients:
## (Intercept) Horsepower MPG.highway Rev.per.mile
## 1.999974 0.140969 -0.294868 0.002492
The model begins with no predictors, and the first step is to look at simple models with each of the predictors to determine which has the lowest AIC. We find that a model with only horsepower as a predictor has the lowest AIC. So, we take this model, and add each of the remaining predictors, and keep the one with the lowest AIC. This process ends in the step where the AIC is lowest if no additional predictors are added. In this model, after horsepower, highway miles per gallon and revolutions per mile were added, and the remaining predictors did not lower the AIC. Our final model using this method has predictors horsepower, revolutions per mile, and highway miles per gallon to predict price of the car. This is a slightly different model than what we found using forward selection. It is possible to get different models using the two different methods, since forward selection favors small models, while backward elimination favors models with more predictors.