Introduction

This R guide will include information about Multiple Linear Regression and all of the statistics and tests that come with it. It also will talk about ways to check how accurate a model is and what we can look at to determine if we have the best model for our data. We will discuss how to run a multiple linear regression in R and what we can do with our model, interpretations, intervals, significance test, and some more useful information.

Install required packages


The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“stargazer”)

library(stargazer) # styling tables
library(kableExtra) # display table formatting

Data set


To demonstrate and explain the methods, we will be walking through everything using the nhanes data set. A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 observations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/linear-regression")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

Understanding your data set

Now our data set is in the R environment so we can use it and manipulate it to run our analysis. The next thing to familiarize yourself with the data set. A person needs to know what their data contains and how it looks to know the best techniques for using it. There are two commands that give us a quick glance. The first is the headTail() function from the psych library which shows you the first m rows and n columns of data (the default is m = n = 4). The second one is the names() function which shows you all your variable names which will be very important because using these variables requires them to be spelled correctly and case sensitive every time.

headtail = psych::headTail(nhanesdata[, c(1, 2, 11:15)], top = 5, bottom = 3)
kbl(headtail, 
    caption = "Table 1: Showing first 6 observations and a few variables.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 1: Showing first 6 observations and a few variables.
id region highbp sex race age agegroup
1 1400 South No Male White 54 50-59
2 1401 South No Female White 41 40-49
3 1402 South No Female Other 21 20-29
4 1404 South Yes Female White 63 60-69
5 1405 South No Female White 64 60-69
NA NA NA NA NA
10335 48764 North West No Female White 60 60-69
10336 48768 North West No Female White 29 20-29
10337 48770 North West No Male White 31 30-39
names(nhanesdata)
##  [1] "id"            "region"        "smsa"          "hhsize"       
##  [5] "hhsize_1_5"    "hhsize1"       "hhsize2"       "hhsize3"      
##  [9] "hhsize4"       "hhsize5"       "highbp"        "sex"          
## [13] "race"          "age"           "agegroup"      "height"       
## [17] "weight"        "bmi"           "bpsystolic"    "bpdiastolic"  
## [21] "cholesterol"   "triglycerides" "hdl"           "hemoglobin"   
## [25] "hemoglobin2"   "hct"           "tibc"          "iron"         
## [29] "heartatk"      "diabetes"      "place"         "corpuscl"     
## [33] "trnsfern"      "albumin"       "vitaminc"      "zinc"         
## [37] "copper"        "porphyrn"      "lead"          "highlead"     
## [41] "health"

Multiple Linear Regression


Creating the Regression Model

The next step is to create our regression model to get some concrete numbers about the relationship between our predictors. We are looking for numbers that will tell us about our correlation and our error of our model. Our correlation is the big one because that tells us how well our predictor variable predicts our response variable.

The multiple linear model command is a very simple one. It is setup like:

  • modelname = lm(response ~ predictor1 + predictor2 + predictor3 + … + predictork)

Like mentioned earlier, we want to use the variables age, bmi, hemoglobin, bpdiastolic and race to predict serum cholesterol of respondents in the sample. This is done as follows.

mlmodel = lm(cholesterol ~ age + bmi + hemoglobin + bpdiastolic + I(race),
             data = nhanesdata)

In the above code, the multiple linear regression model was created under the name mlmodel. This object mlmodel stores several attributes related to the computed linear model. Some of these attributes can be seen by running the code below.

attributes(mlmodel)
## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

The summary() function is used to get the table of coefficients as is illustrated in the code below. Results are presented immediately after the code.

results = summary(mlmodel)
results
## 
## Call:
## lm(formula = cholesterol ~ age + bmi + hemoglobin + bpdiastolic + 
##     I(race), data = nhanesdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166.03  -30.18   -4.04   25.38  635.51 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  120.69104    5.25473  22.968  < 2e-16 ***
## age            1.04211    0.02682  38.860  < 2e-16 ***
## bmi            0.81147    0.09825   8.259  < 2e-16 ***
## hemoglobin     0.28172    0.33317   0.846  0.39781    
## bpdiastolic    0.28274    0.03864   7.317 2.72e-13 ***
## I(race)Black  -4.38889    1.48910  -2.947  0.00321 ** 
## I(race)Other   0.65934    3.21492   0.205  0.83751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.91 on 10330 degrees of freedom
## Multiple R-squared:  0.1739, Adjusted R-squared:  0.1734 
## F-statistic: 362.5 on 6 and 10330 DF,  p-value: < 2.2e-16

We can extract only the coefficients table using results$coef as shown in syntax below. The results are styled and presented in Table 2.

results.coef = results$coef
kbl(results.coef, digits = c(2, 2, 2, 4), 
    align = rep("r", times = ncol(results.coef)),
    caption = "Table 2: Multiple linear regression results for cholesterol.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")
Table 2: Multiple linear regression results for cholesterol.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 120.69 5.25 22.97 0.0000
age 1.04 0.03 38.86 0.0000
bmi 0.81 0.10 8.26 0.0000
hemoglobin 0.28 0.33 0.85 0.3978
bpdiastolic 0.28 0.04 7.32 0.0000
I(race)Black -4.39 1.49 -2.95 0.0032
I(race)Other 0.66 3.21 0.21 0.8375

The stargazer() function from the stargazer library could also be used to present the results as shown in Table 2 (this is one of the formats used for reporting in scientific journals).

stargazer(mlmodel, type = "html",
          ci = FALSE, 
          header = FALSE,
          single.row = TRUE,
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          digits = 2,
          covariate.labels = c("Age", "Body mass index", 
                               "Hemoglobin", "Diastolic blood pressure",
                               "Race: Black", "Race: Other"),
          title = "Table 3: Multiple linear regression analysis for serum cholesterol.")
Table 3: Multiple linear regression analysis for serum cholesterol.
Dependent variable:
cholesterol
Age 1.04**** (0.03)
Body mass index 0.81**** (0.10)
Hemoglobin 0.28 (0.33)
Diastolic blood pressure 0.28**** (0.04)
Race: Black -4.39*** (1.49)
Race: Other 0.66 (3.21)
Constant 120.69**** (5.25)
Observations 10,337
R2 0.17
Adjusted R2 0.17
Residual Std. Error 44.91 (df = 10330)
F Statistic 362.45**** (df = 6; 10330)
Note: p<0.1; p<0.05; p<0.01

Write the regression model


The results presented in the above tables lead to the following regression model.

\[ \hat{\text{cholesterol}} = 120.69 + 1.04 \, \text{age + 0.81} \, \text{bmi + 0.28} \, \text{hemoglobin + 0.28} \, \text{bpdiastolic - 4.39} \, \text{Race: Black} + 0.66 \, \, \text{Race: Other} \] Interpretation of the above model is similar to that of the simple linear regression mode. The best way to interpret this model is to say what we expect to happen to the response variable (cholesterol) when we increase one predictor variable by one unit, while holding all other variables constant.

For the above model, we could for example say:

  • Age: For every 1 year increase in the age of the respondent, the serum cholesterol level will increase by 1.04, with all other variables held constant.

  • Race: Black: Note that White race is the reference category. So we could interpret this as: Serum cholesterol level of blacks is 4.39mg/dL lower than that of whites. Similarly, serum cholesterol level of other races is 0.66 higher than that of whites. *We have used lower for blacks and higher for other races because the regression coefficients are negative and positive respectively.

Other coefficients can be interpreted in a similar manner.

Conclusion


Overall, you should now have a great understanding of what multiple linear regression is as well as how to build a multiple linear regression model using R. These methods can be replicated for any data set.


STEM Research
https://stemresearchs.com