Multiple linear regression

Introduction

This R guide will include information about Multiple Linear Regression and all of the statistics and tests that come with it. It also will talk about ways to check how accurate a model is and what we can look at to determine if we have the best model for our data. We will discuss how to run a multiple linear regression in R and what we can do with our model, interpretations, intervals, significance test, and some more useful information.

Install required packages

The following packages will be required and should therefore be loaded first. If they are not already installed, begin by installing them using the install.packages() function e.g. install.packages(“stargazer”)

library(stargazer) # styling tables
library(kableExtra) # display table formatting

Data set

To demonstrate and explain the methods, we will be walking through everything using the nhanes data set. A description of the data set and the variable name labels can be found here. The code below downloads the data set and then displays the first 6 observations and a few columns of this data set.

setwd("D:/stemresearch/R/analysis/linear-regression")
nhanesdata <- readRDS(file = url("http://drmathematics.com/learning/datasets/nhanesdata.RDS"))

Understanding your data set

Now our data set is in the R environment so we can use it and manipulate it to run our analysis. The next thing to familiarize yourself with the data set. A person needs to know what their data contains and how it looks to know the best techniques for using it. There are two commands that give us a quick glance. The first is the headTail() function from the psych library which shows you the first m rows and n columns of data (the default is m = n = 4). The second one is the names() function which shows you all your variable names which will be very important because using these variables requires them to be spelled correctly and case sensitive every time.

headtail = psych::headTail(nhanesdata[, c(1, 2, 11:15)], top = 5, bottom = 3)
kbl(headtail, 
    caption = "Table 1: Showing first 6 observations and a few variables.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 1: Showing first 6 observations and a few variables.
	id	region	highbp	sex	race	age	agegroup
1	1400	South	No	Male	White	54	50-59
2	1401	South	No	Female	White	41	40-49
3	1402	South	No	Female	Other	21	20-29
4	1404	South	Yes	Female	White	63	60-69
5	1405	South	No	Female	White	64	60-69
…	…	NA	NA	NA	NA	…	NA
10335	48764	North West	No	Female	White	60	60-69
10336	48768	North West	No	Female	White	29	20-29
10337	48770	North West	No	Male	White	31	30-39

names(nhanesdata)

##  [1] "id"            "region"        "smsa"          "hhsize"       
##  [5] "hhsize_1_5"    "hhsize1"       "hhsize2"       "hhsize3"      
##  [9] "hhsize4"       "hhsize5"       "highbp"        "sex"          
## [13] "race"          "age"           "agegroup"      "height"       
## [17] "weight"        "bmi"           "bpsystolic"    "bpdiastolic"  
## [21] "cholesterol"   "triglycerides" "hdl"           "hemoglobin"   
## [25] "hemoglobin2"   "hct"           "tibc"          "iron"         
## [29] "heartatk"      "diabetes"      "place"         "corpuscl"     
## [33] "trnsfern"      "albumin"       "vitaminc"      "zinc"         
## [37] "copper"        "porphyrn"      "lead"          "highlead"     
## [41] "health"

Multiple Linear Regression

Creating the Regression Model

The next step is to create our regression model to get some concrete numbers about the relationship between our predictors. We are looking for numbers that will tell us about our correlation and our error of our model. Our correlation is the big one because that tells us how well our predictor variable predicts our response variable.

The multiple linear model command is a very simple one. It is setup like:

modelname = lm(response ~ predictor1 + predictor2 + predictor3 + … + predictork)

Like mentioned earlier, we want to use the variables age, bmi, hemoglobin, bpdiastolic and race to predict serum cholesterol of respondents in the sample. This is done as follows.

mlmodel = lm(cholesterol ~ age + bmi + hemoglobin + bpdiastolic + I(race),
             data = nhanesdata)

In the above code, the multiple linear regression model was created under the name mlmodel. This object mlmodel stores several attributes related to the computed linear model. Some of these attributes can be seen by running the code below.

attributes(mlmodel)

## $names
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "contrasts"     "xlevels"       "call"          "terms"        
## [13] "model"        
## 
## $class
## [1] "lm"

The summary() function is used to get the table of coefficients as is illustrated in the code below. Results are presented immediately after the code.

results = summary(mlmodel)
results

## 
## Call:
## lm(formula = cholesterol ~ age + bmi + hemoglobin + bpdiastolic + 
##     I(race), data = nhanesdata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -166.03  -30.18   -4.04   25.38  635.51 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  120.69104    5.25473  22.968  < 2e-16 ***
## age            1.04211    0.02682  38.860  < 2e-16 ***
## bmi            0.81147    0.09825   8.259  < 2e-16 ***
## hemoglobin     0.28172    0.33317   0.846  0.39781    
## bpdiastolic    0.28274    0.03864   7.317 2.72e-13 ***
## I(race)Black  -4.38889    1.48910  -2.947  0.00321 ** 
## I(race)Other   0.65934    3.21492   0.205  0.83751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 44.91 on 10330 degrees of freedom
## Multiple R-squared:  0.1739, Adjusted R-squared:  0.1734 
## F-statistic: 362.5 on 6 and 10330 DF,  p-value: < 2.2e-16

We can extract only the coefficients table using results$coef as shown in syntax below. The results are styled and presented in Table 2.

results.coef = results$coef
kbl(results.coef, digits = c(2, 2, 2, 4), 
    align = rep("r", times = ncol(results.coef)),
    caption = "Table 2: Multiple linear regression results for cholesterol.") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE, position = "left")

Table 2: Multiple linear regression results for cholesterol.
	Estimate	Std. Error	t value	Pr(>\|t\|)
(Intercept)	120.69	5.25	22.97	0.0000
age	1.04	0.03	38.86	0.0000
bmi	0.81	0.10	8.26	0.0000
hemoglobin	0.28	0.33	0.85	0.3978
bpdiastolic	0.28	0.04	7.32	0.0000
I(race)Black	-4.39	1.49	-2.95	0.0032
I(race)Other	0.66	3.21	0.21	0.8375

The stargazer() function from the stargazer library could also be used to present the results as shown in Table 2 (this is one of the formats used for reporting in scientific journals).

stargazer(mlmodel, type = "html",
          ci = FALSE, 
          header = FALSE,
          single.row = TRUE,
          star.cutoffs = c(0.1, 0.05, 0.01, 0.001),
          digits = 2,
          covariate.labels = c("Age", "Body mass index", 
                               "Hemoglobin", "Diastolic blood pressure",
                               "Race: Black", "Race: Other"),
          title = "Table 3: Multiple linear regression analysis for serum cholesterol.")

**Table 3: Multiple linear regression analysis for serum cholesterol.**

	Dependent variable:

	cholesterol

Age	1.04^**** (0.03)
Body mass index	0.81^**** (0.10)
Hemoglobin	0.28 (0.33)
Diastolic blood pressure	0.28^**** (0.04)
Race: Black	-4.39^*** (1.49)
Race: Other	0.66 (3.21)
Constant	120.69^**** (5.25)

Observations	10,337
R²	0.17
Adjusted R²	0.17
Residual Std. Error	44.91 (df = 10330)
F Statistic	362.45^**** (df = 6; 10330)

Note:	p<0.1; p<0.05; p<0.01

Write the regression model

The results presented in the above tables lead to the following regression model.

\[ \hat{\text{cholesterol}} = 120.69 + 1.04 \, \text{age + 0.81} \, \text{bmi + 0.28} \, \text{hemoglobin + 0.28} \, \text{bpdiastolic - 4.39} \, \text{Race: Black} + 0.66 \, \, \text{Race: Other} \] Interpretation of the above model is similar to that of the simple linear regression mode. The best way to interpret this model is to say what we expect to happen to the response variable (cholesterol) when we increase one predictor variable by one unit, while holding all other variables constant.

For the above model, we could for example say:

Age: For every 1 year increase in the age of the respondent, the serum cholesterol level will increase by 1.04, with all other variables held constant.
Race: Black: Note that White race is the reference category. So we could interpret this as: Serum cholesterol level of blacks is 4.39mg/dL lower than that of whites. Similarly, serum cholesterol level of other races is 0.66 higher than that of whites. *We have used lower for blacks and higher for other races because the regression coefficients are negative and positive respectively.

Other coefficients can be interpreted in a similar manner.

Conclusion

Overall, you should now have a great understanding of what multiple linear regression is as well as how to build a multiple linear regression model using R. These methods can be replicated for any data set.

STEM Research
https://stemresearchs.com