Module IV: Multiple Linear Regression Models

Amoud University

Introduction

The Multiple Linear Regression method, together with its associated statistics and tests, will be covered in this R module. It will also discuss how to assess a model’s accuracy and what to consider in order to decide whether we have the best model for our data. We’ll go over how to perform a multiple linear regression in R, what we can do with our model, interpretations, intervals, significance testing, and other pertinent details.

Dataset

We will walk through everything using a data collection that provides information about cars to illustrate and clarify the methods. The number of cylinders, engine displacement, horsepower, weight, quarter-mile time, engine type, gearbox type, rear axle ratio, number of gears, number of carburetors, and miles per gallon are all included in this data set. Our data must first be entered into R. This can be accomplished by writing a line of code that calls the R data set as it is now stored in R.

data("mtcars")

Our data set is now available in the R environment, where we can use it and perform analysis on it. The following step is to always familiarize yourself with your data. To know the best methods for using data, a person needs to know what it includes and how it looks. Two commands provide us with a quick peek. The head() command is the first and displays the first six rows of data. The second one is the names() command, which displays all of your variable names and is crucial because using these variables necessitates that they be consistently written correctly and case-sensitive.

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

names(mtcars)

##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb"

Modifying Data Set

Now we have an idea what we must choose what we want to examine. In order to forecast the one dependent variable of interest, multiple linear regression is a tool for examining the association between more than three variables (quantitative or categorical). With the help of this data, we hope to develop a model that can forecast the miles per gallon (mpg) a car will achieve based on other factors like the number of cylinders in the engine (cyl), displacement (disp, measure in cubic inches), gross horsepower (hp), rear axle ratio (drat), weight (wt, in 1000lbs), quarter-mile time (qsec), v engine or straight engine (vs), transmission type (am, 0 = automatic and 1 = manual

attach(mtcars)

The best technique to see the relationship between variables after preparing the data is to plot them and examine the appearance of our plots. The plot() command, which lists the x-axis variable first and the y-axis variable second but separated by a comma, will be used to do this. With our response variable, MPG, I will plot each predictor variable.

plot(mpg, cyl)

plot(mpg, gear)

plot(mpg, am)

plot(mpg, vs)

plot(mpg, disp)

plot(mpg, carb)

plot(mpg, hp)

plot(mpg, qsec)

plot(mpg, drat)

plot(mpg, wt)

These plots start to give us an idea on what might be related to MPG or what variables may be collinear as well. We will keep these in mind as we continue our analysis. Also we will use these plots to assess some other things that we will mention later in this guide.

library(GGally)

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 4.1.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'mtcars':
## 
##     mpg

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(mtcars,lower = list(continuous = wrap('points', colour = "blue")),
  diag = list(continuous = wrap("barDiag", colour = "red"))
    )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Multiple Linear Regression Model

Model Formulation

The next step is to create our regression model to get some concrete numbers about the relationship between our predictors. We are looking for numbers that will tell us about our correlation and our error of our model. Our correlation is the big one because that tells us how well our predictor variable predicts our response variable.

The multiple linear model command is a very simple one. It is setup like:

modelname <-lm(response~predictor1+predictor2+predictor3+…)

Like mentioned earlier, we want to use the variables to predict MPG.

Then we just call the model’s name and it gives us it’s data on coefficients.

mreg <- lm(mpg~cyl+disp+hp+drat+wt+qsec+vs+am+gear+carb,data=mtcars)
mreg

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl         disp           hp         drat           wt  
##    12.30337     -0.11144      0.01334     -0.02148      0.78711     -3.71530  
##        qsec           vs           am         gear         carb  
##     0.82104      0.31776      2.52023      0.65541     -0.19942

As we can see here our model was created under the name “model” so by just calling it’s name in a command it will show us our coefficients. We could now create a regression equation from this output:

predictedmpg=12.303+observedcylinders∗−.111+observeddisplacement∗.013+observedhorsepower∗−.021+observedrearaxleratio∗.787+observedweightinthousandsofpounds∗−3.715+observedquartermiletime∗.821+observedenginetype∗.318+observedtransmissiontype∗2.520+observednumberofgears∗.655+observednumberofcarburetors∗−.199

MLR Interpretation

Now to interpret our model and its equation. Interpretation is important so that other people understand what analysis is being done by the model. Multiple linear regression is a little trickier than simple linear regression in its interpretations but it still is understandable. The best way to explain it is to say what we expect to happen to the response variable when we increase one predictor variable by one unit, while holding all other variables constant. For our model this would sound like: For every thousand pound increase in the weight of the car, our MPG will decrease 3.715, with all other variables held constant.

The other thing to interpret would be the intercept but you can only interpret that if it makes sense that our predictor variables be zero. In our case multiple variables cannot be zero so we cannot interpret the intercept.

Model’s Output

There are a couple things to look in the output, including correlation/correlation of determination(R, R-squared, and adjusted R-squared), t-tests, the f-test, coefficient confidence intervals, and mean squared error. To find these we can just look in the model summary using the summary() command.

summary(mreg)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Correlation/Coefficient of Determination

Arguably the most important measures of how good a model is are $R$ and $R^2$ . These tell us how well our model predicts the desired response variable. The closer $R^2$ is to one the better. A value of 1 would mean that our model is perfect. But different from the simple linear models, we now have an adjusted R-squared that takes into account how many predictors we are using. It penalizes a model for more predictors so if they are not contributing much to the model it will actually hurt this value, however, $R^2$ is not used by many statisticians at all.

As we can see our $R^2$ value is $.869$ which is very good and it means our model is able to predict mpg well. Also, this value means that predictors explain $86.9\%$ of the variability in MPG. Now our adjusted R-squared is a little lower obviously as we are using 11 variables to predict. This could possibly improve if there is one or more predictors that aren’t very good and are hurting our model. One thing to note though is that comparing $R^2$ values is not a great way of deciding which model is better than the other.

Mean Squared Error

Mean squared error is exactly how it sounds: we take the mean of all of our errors squared. This is a good measure for seeing how accurate a model is because we obviously want as little error as possible. To find this we can use the following formula:

sqrt(sum((mreg$residuals)^2)/21)

## [1] 2.650197

Or we can take the easy route and look in the summary:

summary(mreg)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

As you can see these values are equal. There are other useful pieces of information in this summary such as the coefficients for each predictor and degrees of freedom for the model.

Hypothesis Testing

t-Test

An important thing to do with our model is test the significance of a single predictor variable to see if it is related to our response variable. We use a hypothesis test to do this. Our null hypothesis is that the response variable coefficient is equal to $0$ with our alternative hypothesis being that the its coefficient is not equal to $0$. To find our p-value we run a T-test with $n-(k+1)$ degrees of freedom. Our test statistic is $b_j/s_b$ which is just the point estimate of the slope of the model divided by the standard error of that coefficient/slope value. The easiest way to find this is just calling the numbers/coefficients from the summary of the model and divide them by each other.

tstat <- coef(summary(mreg))[3,1]/coef(summary(mreg))[3,2]
tstat

## [1] 0.7467585

Now we can use our test statistic and the degrees of freedom of 21 that we found for our model to run our t-test. This is just done by doubling our pt() command because we need to include both tails of our distribution and not just one. Then the command goes: pt(test statistic, degrees of freedom, lower.tail=TRUE/FALSE(if false it includes the right side of the test stat on the distribution)).

2*pt(tstat, 21, lower.tail=FALSE)

## [1] 0.4634887

Here we can see our P-value is larger than our .05 level of significance so this tells us that we do not have enough evidence to reject the null hypothesis and that our predictor variable of engine displacement does not have the most significant relationship with our response variable of MPG.

We can do another one of these tests for our intercept.

tstat2 <- coef(summary(mreg))[1,1]/coef(summary(mreg))[1,2]
tstat2

## [1] 0.6573058

2*pt(tstat2, 21, lower.tail=FALSE)

## [1] 0.5181244

We have another large p-value so we can say our intercept is not significant to our model as well although we still cannot interpret this as we cannot reasonably set all of the predictor variables to 0 as no car would have values of 0 for each item.

An easier way to do this is just look in the model’s summary and look at the t-test/level of significance for each coefficient.

summary(mreg)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

As you can see these values are the exact same so using the summary() command is a great time saver.

F-test

An additional way to see if there is a linear relationship between the predictor variables and response variable is by using the F-Test. This test determines if there is any relationship between our predictor and response variables. It is set up like our t-test where our null hypothesis is that there is no relationship between the response variable and its predictors and our alternative is that there is some linear relationship between them. A common $\alpha$ value to reject the null hypothesis is .05. We can find our $F$-test $p$-value in the bottom of our model’s summary.

summary(mreg)

## 
## Call:
## lm(formula = mpg ~ cyl + disp + hp + drat + wt + qsec + vs + 
##     am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4506 -1.6044 -0.1196  1.2193  4.6271 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 12.30337   18.71788   0.657   0.5181  
## cyl         -0.11144    1.04502  -0.107   0.9161  
## disp         0.01334    0.01786   0.747   0.4635  
## hp          -0.02148    0.02177  -0.987   0.3350  
## drat         0.78711    1.63537   0.481   0.6353  
## wt          -3.71530    1.89441  -1.961   0.0633 .
## qsec         0.82104    0.73084   1.123   0.2739  
## vs           0.31776    2.10451   0.151   0.8814  
## am           2.52023    2.05665   1.225   0.2340  
## gear         0.65541    1.49326   0.439   0.6652  
## carb        -0.19942    0.82875  -0.241   0.8122  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8066 
## F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

Our $p$-value is very small so we can reject our null hypothesis and say that there is a relationship between at least one of the predictor variables and the response variable. One thing you might notice is that our p-value for our $f$ and $t$ tests are not identical as they were in simple linear regression. We will do a partial $F$-test as well later.

Coefficient Confidence Intervals

Another thing to look at is the confidence intervals for our coefficients. Our point estimates for each coefficient are not exact so we want to find a range where we are a certain percent confident that the actual value is in this range. A common interval is a 95% confidence interval so we will create one using the confint() command.

confint(mreg)

##                    2.5 %      97.5 %
## (Intercept) -26.62259745 51.22934576
## cyl          -2.28468553  2.06180457
## disp         -0.02380146  0.05047194
## hp           -0.06675236  0.02378812
## drat         -2.61383350  4.18805545
## wt           -7.65495413  0.22434628
## qsec         -0.69883421  2.34091571
## vs           -4.05880242  4.69432805
## am           -1.75681208  6.79726585
## gear         -2.44999107  3.76081711
## carb         -1.92290442  1.52406591

We can interpret this output to mean: for every thousand pounds the car’s weight increases, we are 95% confident that the corresponding MPG will change between -7.654 and 0.224.

Model Assumptions

An important thing when considering a multiple linear regression model is that our assumptions are met. One important assumption is that our errors (or residuals) have an approximate normal distribution. A residual is our observed value minus our predicted value for any value of x. We can check this easiest by getting our residuals for all our points and making a histogram for them using hist() command and calling for the residuals by using name<-modelname$residuals.

resid<- mreg$residuals
hist(resid)

This histogram does not have enough evidence to show that it is not approximately normally distributed. It has fairly even tails and looks fairly bell shaped.

Another good way to look at this assumption is to plot our errors along a straight line in a quantile plot. If our error is normally distributed then they will follow the straight line.

qqnorm(resid)
qqline(resid)

As we can see from this plot our errors follow the straight line well so we can say this assumption is met.

Another assumption is homoscedasticity which is constant variance throughout our data. We can look at this by plotting our residuals along a horizontal line across 0.

plot(mreg$residuals ~ disp)
abline(0,0)

We will say this assumption is met as our plots give no evidence of different levels of variance and the points appear to be randomly distributed opposed to displaying signs of a pattern of sorts.

And our last assumption to be met is that our cars we used for this data set are independent of one another. We will assume this assumption is met as there were many different makes and models along with different types of vehicles in the data set, so none of them should have an effect on any of the others.

Residual Analysis

There are a couple plots we would like to look at that will give us more of an idea of how good our model is or how well it is predicting. Some of these include looking at plots of our residuals.

plot(mreg)

Looking at the Residual vs Fitted Values plot, we can see that there is a slight pattern to the residuals of our model where we would rather see the residuals distributed as randomly as possible. However, we may be able to fix this by dropping some predictors or adding transformations to our model. The quantiles to our model follow a normal distribution as there is not much deviation from Normal QQ Plot. Based on the studentized residual plot, we can again see that some data points are clumped together whereas we would rather have them evenly spread out. Lastly, analyzing our Cook’s Distance plot we can see that some data points have a lot of leverage to them and may be affecting the accuracy of our model by skewing the results.

Transformations

After completing our residual analysis, we concluded that some transformations may be beneficial to our model. Lets see if it is possible to improve our model by transforming some of our predictors, either with a square root transformation or a log transformation.

mreg1 = lm(mpg ~cyl+log(disp)+log(hp)+drat+wt+qsec+vs+am+gear+carb, data=mtcars)
summary(mreg1)

## 
## Call:
## lm(formula = mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + 
##     vs + am + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3225 -1.6278 -0.4725  1.1672  3.7616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 52.749992  25.678622   2.054   0.0526 .
## cyl          0.934117   1.010657   0.924   0.3658  
## log(disp)   -4.923860   3.852996  -1.278   0.2152  
## log(hp)     -3.406400   3.003988  -1.134   0.2696  
## drat         0.169684   1.549901   0.109   0.9139  
## wt          -0.975286   1.506152  -0.648   0.5243  
## qsec         0.156231   0.686120   0.228   0.8221  
## vs          -0.005731   1.904313  -0.003   0.9976  
## am           0.986507   2.084110   0.473   0.6408  
## gear         1.706226   1.463070   1.166   0.2566  
## carb        -0.973755   0.653397  -1.490   0.1510  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.452 on 21 degrees of freedom
## Multiple R-squared:  0.8879, Adjusted R-squared:  0.8345 
## F-statistic: 16.63 on 10 and 21 DF,  p-value: 8.023e-08

The log transformation did improve upon our model just based off of the summary as our $R^2$ value increased from $.869$ to $.8879$. We can also check if our plots improved at all for our new model:

plot(mreg1)

The residuals vs the fitted values plot slightly improved although the data points are not yet evenly distributed. However, the quantiles still follow a normal distribution and the studentized residuals were more evenly spread out. The Cook’s Distance plot also improved as the data no longer is as bunched up as it used to be and all of the data points lie within the threshold for acceptable weight because all of the data points lie within the dashed lines.

Reducing The Model

We made a model using all of the variables we were given but that might not be our best model as we may be able to create a better model using less predictors. Here we will go into some steps on how to reduce our model and explain how one can tell if one model is better than the other.

Step AIC

Now that we feel we have two comparable models we will run what is called Akaike Information Criteria (AIC) analysis for each of them in the backwards direction. This means that R is taking our total model with all of the variables in it and taking one out at a time so that our model’s AIC value improves every time, where the smaller the value the better the model. When the AIC cannot get any better, it stops and R tells us that this is the best model to use. It basically gets rid of the worst predictor at every step until it reaches a point where it would be removing a good predictor instead of a bad one. We should be left with our best predictors after using the stepAIC() command on our model which is a part of the R package MASS. An important thing to note is that an AIC value decrease of at least 10 is necessary to justify using a more complicated model over a simpler one.

Mallow’s $C_P$ is another way to test models for adding or removing predictors. Mallows $C_P=\Sigma\left(Y_i-Y^2\right) / M S E$ Ideally, this would roughly equal $k+1$ but it is not used by many statisticians so it is better to use AIC as AIC is a much more well known test.

library(MASS)
stepAIC(mreg)

## Start:  AIC=70.9
## mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - cyl   1    0.0799 147.57 68.915
## - vs    1    0.1601 147.66 68.932
## - carb  1    0.4067 147.90 68.986
## - gear  1    1.3531 148.85 69.190
## - drat  1    1.6270 149.12 69.249
## - disp  1    3.9167 151.41 69.736
## - hp    1    6.8399 154.33 70.348
## - qsec  1    8.8641 156.36 70.765
## <none>              147.49 70.898
## - am    1   10.5467 158.04 71.108
## - wt    1   27.0144 174.51 74.280
## 
## Step:  AIC=68.92
## mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - vs    1    0.2685 147.84 66.973
## - carb  1    0.5201 148.09 67.028
## - gear  1    1.8211 149.40 67.308
## - drat  1    1.9826 149.56 67.342
## - disp  1    3.9009 151.47 67.750
## - hp    1    7.3632 154.94 68.473
## <none>              147.57 68.915
## - qsec  1   10.0933 157.67 69.032
## - am    1   11.8359 159.41 69.384
## - wt    1   27.0280 174.60 72.297
## 
## Step:  AIC=66.97
## mpg ~ disp + hp + drat + wt + qsec + am + gear + carb
## 
##        Df Sum of Sq    RSS    AIC
## - carb  1    0.6855 148.53 65.121
## - gear  1    2.1437 149.99 65.434
## - drat  1    2.2139 150.06 65.449
## - disp  1    3.6467 151.49 65.753
## - hp    1    7.1060 154.95 66.475
## <none>              147.84 66.973
## - am    1   11.5694 159.41 67.384
## - qsec  1   15.6830 163.53 68.200
## - wt    1   27.3799 175.22 70.410
## 
## Step:  AIC=65.12
## mpg ~ disp + hp + drat + wt + qsec + am + gear
## 
##        Df Sum of Sq    RSS    AIC
## - gear  1     1.565 150.09 63.457
## - drat  1     1.932 150.46 63.535
## <none>              148.53 65.121
## - disp  1    10.110 158.64 65.229
## - am    1    12.323 160.85 65.672
## - hp    1    14.826 163.35 66.166
## - qsec  1    26.408 174.94 68.358
## - wt    1    69.127 217.66 75.350
## 
## Step:  AIC=63.46
## mpg ~ disp + hp + drat + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - drat  1     3.345 153.44 62.162
## - disp  1     8.545 158.64 63.229
## <none>              150.09 63.457
## - hp    1    13.285 163.38 64.171
## - am    1    20.036 170.13 65.466
## - qsec  1    25.574 175.67 66.491
## - wt    1    67.572 217.66 73.351
## 
## Step:  AIC=62.16
## mpg ~ disp + hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - disp  1     6.629 160.07 61.515
## <none>              153.44 62.162
## - hp    1    12.572 166.01 62.682
## - qsec  1    26.470 179.91 65.255
## - am    1    32.198 185.63 66.258
## - wt    1    69.043 222.48 72.051
## 
## Step:  AIC=61.52
## mpg ~ hp + wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## - hp    1     9.219 169.29 61.307
## <none>              160.07 61.515
## - qsec  1    20.225 180.29 63.323
## - am    1    25.993 186.06 64.331
## - wt    1    78.494 238.56 72.284
## 
## Step:  AIC=61.31
## mpg ~ wt + qsec + am
## 
##        Df Sum of Sq    RSS    AIC
## <none>              169.29 61.307
## - am    1    26.178 195.46 63.908
## - qsec  1   109.034 278.32 75.217
## - wt    1   183.347 352.63 82.790

## 
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
## 
## Coefficients:
## (Intercept)           wt         qsec           am  
##       9.618       -3.917        1.226        2.936

First running this on our original model with every variable, R gives us an output telling us to keep only 3 variables as predictors: quarter mile time, weight, and transmission type.

Now if we run this on our transformation model we get:

stepAIC(mreg1)

## Start:  AIC=65.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + vs + am + 
##     gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - vs         1    0.0001 126.28 63.928
## - drat       1    0.0721 126.35 63.947
## - qsec       1    0.3118 126.59 64.007
## - am         1    1.3473 127.63 64.268
## - wt         1    2.5214 128.80 64.561
## - cyl        1    5.1370 131.42 65.204
## - log(hp)    1    7.7323 134.01 65.830
## <none>                   126.28 65.928
## - gear       1    8.1782 134.46 65.936
## - log(disp)  1    9.8204 136.10 66.325
## - carb       1   13.3555 139.63 67.146
## 
## Step:  AIC=63.93
## mpg ~ cyl + log(disp) + log(hp) + drat + wt + qsec + am + gear + 
##     carb
## 
##             Df Sum of Sq    RSS    AIC
## - drat       1    0.0720 126.35 61.947
## - qsec       1    0.3497 126.63 62.017
## - am         1    1.4160 127.70 62.285
## - wt         1    2.5408 128.82 62.566
## - cyl        1    5.4640 131.74 63.284
## - log(hp)    1    7.9966 134.28 63.893
## <none>                   126.28 63.928
## - gear       1    8.2188 134.50 63.946
## - log(disp)  1    9.9454 136.22 64.354
## - carb       1   13.3899 139.67 65.153
## 
## Step:  AIC=61.95
## mpg ~ cyl + log(disp) + log(hp) + wt + qsec + am + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - qsec       1    0.3086 126.66 60.025
## - am         1    1.4720 127.82 60.317
## - wt         1    2.5345 128.89 60.582
## - cyl        1    5.3971 131.75 61.285
## <none>                   126.35 61.947
## - log(hp)    1    8.5533 134.91 62.043
## - gear       1    8.6525 135.00 62.066
## - log(disp)  1   10.1591 136.51 62.421
## - carb       1   13.3729 139.72 63.166
## 
## Step:  AIC=60.02
## mpg ~ cyl + log(disp) + log(hp) + wt + am + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - am         1    1.1724 127.83 58.320
## - wt         1    2.3372 129.00 58.610
## - cyl        1    5.1185 131.78 59.292
## <none>                   126.66 60.025
## - gear       1    8.7232 135.38 60.156
## - log(hp)    1    9.3330 135.99 60.300
## - log(disp)  1   12.4852 139.15 61.033
## - carb       1   15.9928 142.65 61.830
## 
## Step:  AIC=58.32
## mpg ~ cyl + log(disp) + log(hp) + wt + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - wt         1    2.7307 130.56 56.996
## - cyl        1    6.6592 134.49 57.945
## <none>                   127.83 58.320
## - log(hp)    1    8.7242 136.56 58.432
## - gear       1   15.8483 143.68 60.060
## - log(disp)  1   16.0475 143.88 60.104
## - carb       1   16.7435 144.58 60.258
## 
## Step:  AIC=57
## mpg ~ cyl + log(disp) + log(hp) + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - log(hp)    1     7.249 137.81 56.725
## <none>                   130.56 56.996
## - cyl        1    13.526 144.09 58.150
## - gear       1    32.779 163.34 62.164
## - carb       1    38.572 169.13 63.279
## - log(disp)  1    53.640 184.20 66.010
## 
## Step:  AIC=56.73
## mpg ~ cyl + log(disp) + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## - cyl        1     8.707 146.52 56.686
## <none>                   137.81 56.725
## - gear       1    26.481 164.29 60.349
## - carb       1    60.918 198.73 66.439
## - log(disp)  1    97.216 235.03 71.807
## 
## Step:  AIC=56.69
## mpg ~ log(disp) + gear + carb
## 
##             Df Sum of Sq    RSS    AIC
## <none>                   146.52 56.686
## - gear       1    20.189 166.71 58.817
## - carb       1    52.570 199.09 64.497
## - log(disp)  1   152.562 299.08 77.519

## 
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb, data = mtcars)
## 
## Coefficients:
## (Intercept)    log(disp)         gear         carb  
##      51.789       -6.592        1.787       -1.227

This output tells us to keep only 3 variables as well: log of displacement, number of gears, and number of carburetors. Then we want to run a model for each using the recommended variables.

mreg2<-lm(mpg~qsec+wt+am, data=mtcars)
summary(mreg2)

## 
## Call:
## lm(formula = mpg ~ qsec + wt + am, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4811 -1.5555 -0.7257  1.4110  4.6610 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   9.6178     6.9596   1.382 0.177915    
## qsec          1.2259     0.2887   4.247 0.000216 ***
## wt           -3.9165     0.7112  -5.507 6.95e-06 ***
## am            2.9358     1.4109   2.081 0.046716 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared:  0.8497, Adjusted R-squared:  0.8336 
## F-statistic: 52.75 on 3 and 28 DF,  p-value: 1.21e-11

mreg3<-lm(mpg~log(disp)+gear+carb, data=mtcars)
summary(mreg3)

## 
## Call:
## lm(formula = mpg ~ log(disp) + gear + carb, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.0461 -1.3931 -0.5111  1.8053  4.2983 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  51.7887     8.5069   6.088 1.45e-06 ***
## log(disp)    -6.5917     1.2208  -5.399 9.31e-06 ***
## gear          1.7869     0.9097   1.964  0.05950 .  
## carb         -1.2271     0.3872  -3.170  0.00368 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.288 on 28 degrees of freedom
## Multiple R-squared:  0.8699, Adjusted R-squared:  0.8559 
## F-statistic:  62.4 on 3 and 28 DF,  p-value: 1.62e-12

Here we can see that the model with the transformation in it does not improve much compared to the reduced model without any transformations. It also is harder to interpret transformed models so we will stick with model2 instead.

Partial F Test

Instead of using the stepAIC() command, we could also have created a nested model and used the anova(first model, second model) command to test if the nested model was an improvement to the parent model. This command is testing the null hypothesis where the predictors that are being dropped are equal to 0 against the alternative hypothesis that these predictors are not equal to 0. If the nested model is an improvement to the parent model, we do not want to reject the null hypothesis. If we were to reject the null hypothesis, it would mean that the predictors that we dropped were significant to the model and should not have been dropped.

nestmreg = lm(mpg ~ wt + qsec + am,data=mtcars)
anova(mreg,nestmreg)

## Analysis of Variance Table
## 
## Model 1: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Model 2: mpg ~ wt + qsec + am
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     21 147.49                           
## 2     28 169.29 -7   -21.791 0.4432 0.8636

From the ANOVA table above, we can see that the p-value is quite large so we do not have enough evidence to reject the null hypothesis. This means that the nested model using the predictors for weight, quarter mile time, and the transmission type is an improvement to the model using all of the predictors since we were able to create a simpler model that still was accurate in predicting the response variable mpg.

Multicollinearity

Multicollinearity is an important issue that needs to be addressed during multiple linear regression. If our data displays multicollinearity then this means our predictor variables are correlated between each other which would cause overfitting and higher errors. The first thing to do to check for multicollinearity is to run a plot of your data. This will make a correlation matrix where you can see how all of the variables are correlated.

plot(mtcars)

From this plot it may be hard to tell what is correlated and what is not but an example of a pair of variables that could be highly correlated are weight and displacement. However, neither of these predictors are included in our model so there is no need to worry about it. But we should check the correlation between all of our variables we are using by using the cor() command. If our data does not display multicollinearity then all of these values will be less than .8.

cor(mtcars)

##             mpg        cyl       disp         hp        drat         wt
## mpg   1.0000000 -0.8521620 -0.8475514 -0.7761684  0.68117191 -0.8676594
## cyl  -0.8521620  1.0000000  0.9020329  0.8324475 -0.69993811  0.7824958
## disp -0.8475514  0.9020329  1.0000000  0.7909486 -0.71021393  0.8879799
## hp   -0.7761684  0.8324475  0.7909486  1.0000000 -0.44875912  0.6587479
## drat  0.6811719 -0.6999381 -0.7102139 -0.4487591  1.00000000 -0.7124406
## wt   -0.8676594  0.7824958  0.8879799  0.6587479 -0.71244065  1.0000000
## qsec  0.4186840 -0.5912421 -0.4336979 -0.7082234  0.09120476 -0.1747159
## vs    0.6640389 -0.8108118 -0.7104159 -0.7230967  0.44027846 -0.5549157
## am    0.5998324 -0.5226070 -0.5912270 -0.2432043  0.71271113 -0.6924953
## gear  0.4802848 -0.4926866 -0.5555692 -0.1257043  0.69961013 -0.5832870
## carb -0.5509251  0.5269883  0.3949769  0.7498125 -0.09078980  0.4276059
##             qsec         vs          am       gear        carb
## mpg   0.41868403  0.6640389  0.59983243  0.4802848 -0.55092507
## cyl  -0.59124207 -0.8108118 -0.52260705 -0.4926866  0.52698829
## disp -0.43369788 -0.7104159 -0.59122704 -0.5555692  0.39497686
## hp   -0.70822339 -0.7230967 -0.24320426 -0.1257043  0.74981247
## drat  0.09120476  0.4402785  0.71271113  0.6996101 -0.09078980
## wt   -0.17471588 -0.5549157 -0.69249526 -0.5832870  0.42760594
## qsec  1.00000000  0.7445354 -0.22986086 -0.2126822 -0.65624923
## vs    0.74453544  1.0000000  0.16834512  0.2060233 -0.56960714
## am   -0.22986086  0.1683451  1.00000000  0.7940588  0.05753435
## gear -0.21268223  0.2060233  0.79405876  1.0000000  0.27407284
## carb -0.65624923 -0.5696071  0.05753435  0.2740728  1.00000000

Conclusion

Overall, after studying this article, you ought to have a solid knowledge of both multiple linear regression and how to develop one using R. Any type of data that is available in R or that can be imported into R can be used to duplicate these procedures. Last but not least, you can create multiple linear regression using the programs below in Python, STATA, and SPSS, respectively.

Python Code for MLR Model

Python

STATA Code for MLR Model

STATA

SPSS Code for MLR Model

SPSS

Homework

About Dataset

Context

Although there have been lot of studies undertaken in the past on factors affecting life expectancy considering demographic variables, income composition and mortality rates. It was found that affect of immunization and human development index was not taken into account in the past. Also, some of the past research was done considering multiple linear regression based on data set of one year for all the countries. Hence, this gives motivation to resolve both the factors stated previously by formulating a regression model based on mixed effects model and multiple linear regression while considering data from a period of 2000 to 2015 for all the countries. Important immunization like Hepatitis B, Polio and Diphtheria will also be considered. In a nutshell, this study will focus on immunization factors, mortality factors, economic factors, social factors and other health related factors as well. Since the observations this dataset are based on different countries, it will be easier for a country to determine the predicting factor which is contributing to lower value of life expectancy. This will help in suggesting a country which area should be given importance in order to efficiently improve the life expectancy of its population.

Content

The project relies on accuracy of data. The Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries The data-sets are made available to public for the purpose of health data analysis. The data-set related to life expectancy, health factors for 193 countries has been collected from the same WHO data repository website and its corresponding economic data was collected from United Nation website. Among all categories of health-related factors only those critical factors were chosen which are more representative. It has been observed that in the past 15 years , there has been a huge development in health sector resulting in improvement of human mortality rates especially in the developing nations in comparison to the past 30 years. Therefore, in this project we have considered data from year 2000-2015 for 193 countries for further analysis. The individual data files have been merged together into a single data-set. On initial visual inspection of the data showed some missing values. As the data-sets were from WHO, we found no evident errors. Missing data was handled in R software by using Missmap command. The result indicated that most of the missing data was for population, Hepatitis B and GDP. The missing data were from less known countries like Vanuatu, Tonga, Togo, Cabo Verde etc. Finding all data for these countries was difficult and hence, it was decided that we exclude these countries from the final model data-set. The final merged file(final dataset) consists of 22 Columns and 2938 rows which meant 20 predicting variables. All predicting variables was then divided into several broad categories:Immunization related factors, Mortality factors, Economical factors and Social factors.

Acknowledgements

The data was collected from WHO and United Nations website with the help of Deeksha Russell and Duan Wang. Plus [Kaggle Page]{https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who} owners.

Inspiration

The data-set aims to answer the following key questions:

Does various predicting factors which has been chosen initially really affect the Life expectancy? What are the predicting variables actually affecting the life expectancy? Should a country having a lower life expectancy value(<65) increase its healthcare expenditure in order to improve its average lifespan? How does Infant and Adult mortality rates affect life expectancy? Does Life Expectancy has positive or negative correlation with eating habits, lifestyle, exercise, smoking, drinking alcohol etc. What is the impact of schooling on the lifespan of humans? Does Life Expectancy have positive or negative relationship with drinking alcohol? Do densely populated countries tend to have lower life expectancy? What is the impact of Immunization coverage on life Expectancy?

Due Date

15-June-2022

Thanks for your attention

Module IV: Multiple Linear Regression Models using R, STATA, Python and SPSS

Abdisalam Hassan Muse (PhD)

2023-06-04