Hello there! In this very nice teaching guide we (Iana Borisenko and Anastasia Bakhareva) will introduce you linear regression models and will teach you, how to properly conduct and analyze it!
First of all. we would like to introduce the concept of linear regression to you by answering the most common questions:
The aim of a linear regression is to model a mathematical function, so that we can use this model to predict Y when only X is known.
Linear regression model is needed, when you want to predict the value of the output variable Y based on one or more of the input predictor variables X.
Now it is time to start our teaching guide in linear regression modeling. We will demonstrate, how to conduct linear regression in a simple and easy to understand way.
First, run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.
library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
library(knitr)
library(psych)
library(magrittr)
library(kableExtra)
library(tidytext)
library(RColorBrewer)
library(haven)Next, let`s run the dataset and have a look at its structure.
df <- read_csv("D:/Documents/data1_1.csv")
str(df)## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 40 obs. of 3 variables:
## $ books : num 0 1 0 2 4 4 1 4 3 0 ...
## $ attend: num 9 15 10 16 10 20 11 20 15 15 ...
## $ grade : num 45 57 45 51 65 88 44 87 89 59 ...
## - attr(*, "spec")=
## .. cols(
## .. books = col_double(),
## .. attend = col_double(),
## .. grade = col_double()
## .. )
Then, it is necessary to define the types of variables presented.
df$books = as.numeric(as.character(df$books))
df$attend = as.numeric(as.character(df$attend))
df$grade = as.numeric(as.character(df$grade))We can have a glance on specifications of our dataset with the function “summary”.
summary(df)## books attend grade
## Min. :0 Min. : 6.00 Min. :37.00
## 1st Qu.:1 1st Qu.:10.00 1st Qu.:51.00
## Median :2 Median :15.00 Median :60.50
## Mean :2 Mean :14.10 Mean :63.55
## 3rd Qu.:3 3rd Qu.:18.25 3rd Qu.:73.00
## Max. :4 Max. :20.00 Max. :97.00
Seems legit, now it is time to check for outliers. We surely can do this with the graphs.
We construct boxplots as follows:
par(mfrow=c(1, 3))
boxplot(df$books, main="books", sub=paste("Outlier rows: ", boxplot.stats(df$books)$out))
boxplot(df$attend, main="attendance", sub=paste("Outlier rows: ", boxplot.stats(df$attend)$out))
boxplot(df$grade, main="grades", sub=paste("Outlier rows: ", boxplot.stats(df$grade)$out))Next step in our preparation stages is Density Plot, with the help of which we can check if the distribution of variables is close to normal.
# use additional libraries to put all graphs into one image
library(cowplot)##
## Attaching package: 'cowplot'
## The following object is masked from 'package:ggplot2':
##
## ggsave
library(gridExtra)##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
bo = ggplot(data = df, aes(x = books)) + geom_density(col = "pink", fill = "pink", alpha = 0.5) + xlab("Books") + ggtitle("Books read by students")
at = ggplot(data = df, aes(x = attend)) + geom_density(col = "blue", fill = "blue", alpha = 0.5) + xlab("Attendance") + ggtitle("Attendance of students")
gr = ggplot(data = df, aes(x = grade)) + geom_density(col = "yellow", fill = "yellow", alpha = 0.5) + xlab("Grades") + ggtitle("Grades of students")
plot_grid(bo, at, gr)Graphs on books and attendance look pretty well, though grades graph is not that perfect. However, we can surely work with that.
Now, since we checked our dataset and made sure, that it is suitable to work with, it is time for some analysis. We start with the visualization of the relationship between variables with the scatter plot.
sc1 = ggplot(data = df, aes(x = books, y = grade)) + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE) + xlab("Books") + ylab("Grade") + ggtitle("Relationship between books and grade")
sc2 = ggplot(data = df, aes(x = attend, y = grade)) + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE) + xlab("Attendance") + ylab("Grade") + ggtitle("Relationship between attendance and grade")
sc3 = ggplot(data = df, aes(x = attend, y = books)) + geom_point() + geom_smooth(method=lm, fill="blue", color="blue", se = FALSE) + xlab("Attendance") + ylab("Books") + ggtitle("Relationship between attendance and books")
plot_grid(sc1, sc2, sc3)It is worth mentioning, that scatter plots visualize linear relationships between the dependent variable and independent variables. Ideally, a scatter plot is drawn for each one of them against the response, along with the line of best, as we can clearly see on our graphs
Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – and we have three of these to deal with.
We will have a look on them on this fine visualisation:
# run the libraries to create this graph
library(sjlabelled)##
## Attaching package: 'sjlabelled'
## The following objects are masked from 'package:haven':
##
## as_factor, read_sas, read_spss, read_stata, write_sas,
## zap_labels
library(sjmisc)
library(sjstats) ##
## Attaching package: 'sjstats'
## The following objects are masked from 'package:psych':
##
## pca, phi
library(ggeffects)
library(sjPlot)## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.17
## Current Matrix version is 1.2.15
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package
##
## Attaching package: 'sjPlot'
## The following objects are masked from 'package:cowplot':
##
## plot_grid, save_plot
q = cor(df)
sjp.corr(df, show.legend = TRUE)## Computing correlation using pearson-method with listwise-deletion...
## Warning: Removed 6 rows containing missing values (geom_text).
From what we can see, all the relationship between our variables:
have positive direction(can be defined by looking at sign before the number)
have medium strenght since they are bigger than 0.3 and have the ** of each.
Modeling Time! Since we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, it is time for model conduction.
model1 = lm(grade ~ books, data = df)
model2 = lm(grade ~ attend, data = df)
model3 = lm(grade ~ books + attend, data = df)To analyze them, we need to:
tab_modellibrary(snakecase)
sjPlot::tab_model(model1)| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 52.07 | 44.17 – 59.98 | <0.001 |
| books | 5.74 | 2.51 – 8.97 | 0.001 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.242 / 0.222 | ||
sjPlot::tab_model(model2)| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 37.00 | 20.99 – 53.01 | <0.001 |
| attend | 1.88 | 0.80 – 2.97 | 0.002 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.233 / 0.212 | ||
sjPlot::tab_model(model3)| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 37.38 | 22.20 – 52.56 | <0.001 |
| books | 4.04 | 0.60 – 7.47 | 0.027 |
| attend | 1.28 | 0.13 – 2.43 | 0.035 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.329 / 0.292 | ||
These models predict the change of the variable grade by two numerical variables attend and books.
To all variables we have P-value, if it is below 0.05, then this variable is not significant in this model. Basically, the p-value shows whether we were wrong somewhere in the early stages, so that we can once again be convinced that everything is fine with our data and model.
Adjusted R-squared, as well as Multiple R-squared, shows how much our model is better than the average model (0 - you did an oopsie, 1 - the model completely describes the data), but at the same time makes an amendment to the number of variables. (R-squared will always increase as the number of variables grows,while Adjusted R-squared fixes this). The more it’s value, the better, but if it is too high, then, most likely, something went wrong. It can also be interpreted as the proportion of data that can be explained using the model.
As mentioned earlier, in coefficients other than p-value and variable names, we are also interested in estimate.
Estimate - the values of b, meaning that, as this independent variable increases by one, the predicted variable should, according to the model, increase by this very estimate (respectively, as the variable decreases, we eliminate it’s estimate).
In general, a linear model is written like this:
\[Y = b0 + b1*x1 + e\]
Thus, in this three cases the formulas go like this: \[grade1 = 52.07 + 5.74 * books + e\] \[grade2 = 37 + 1.88 * attend + e\] \[grade3 = 37.38 + 4.04 * `books + 1.28 ** attend + e\]
df$attend_cent = df$attend - mean(df$attend)
model2.1 = lm(grade ~ attend_cent, data = df)
sjPlot::tab_model(model2.1)| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 63.55 | 58.96 – 68.14 | <0.001 |
| attend cent | 1.88 | 0.80 – 2.97 | 0.002 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.233 / 0.212 | ||
sjPlot::tab_model(model2) #compare with not centrilized version of the same model| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 37.00 | 20.99 – 53.01 | <0.001 |
| attend | 1.88 | 0.80 – 2.97 | 0.002 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.233 / 0.212 | ||
ANOVA helps us to compare models in which everything is the same, but several variables are added to one of them (or more), which are not taken into account in another model.
Here we look at the P-value and RSS. If P-value is less than 0.05, we should consider model with the lower RSS as a better one. If P-value is bigger, than 0.05, then there is no difference, which model to pick - they are both equally good.
anova(model1, model3)anova(model2, model3)As we can see here, p-value is less than 0.05 in both anova tests, so we should consider model3 as a better one in both cases.
Linear regression makes several assumptions about the data, such as:
All these assumptions can be checked by producing some diagnostic plots visualizing the residual errors. We would like to analyze model3 as it is the best model according to ANOVA.
par(mfrow = c(2, 2))
plot(model3)model3 : It is can be clearly seen on our plot!model3 : Yay! Looks like this plot shows good stuff!model3: Once again, no problems with our graph!Despite the fact that in the previous step we chose a better model, we would also like to draw your attention to the assumptions of the other two models.
We will give only a brief overlook, but strongly recommend you to do with the help of algorithm presented earlier.
par(mfrow = c(2, 2))
plot(model1)Conclusion for model1: Residuals vs Fitted plot don’t look as good as in the previous model, since it is curved, though it is more or less ok. The same goes for Q-Q plot, which is not as straight as it is wished to be. Others two graphs look pretty well.
par(mfrow = c(2, 2))
plot(model2)Conclusion for model2: Here again two graphs look nice, but two others, which are Residuals vs Fitted and Scale-Location plots do not look that fine. The red lines on both of these graphs are curved, dots are not equally spread, so no good signs of variance homogeneity.
Now it is time to work with another dataset! * We are going to focus on the specific features of multiple linear regression. * This part of guide will be mostly practical than theorethical for you to have more practice with coding and interpretations.
ESS5e03 <- read_sav("D:/Documents/ESS5e03.sav")
data = ESS5e03
savevars <- c("eduyrs", "agea", "agertr", "gndr",
"cntry", "icmnact", "hinctnta")
ESS1 <- data[savevars]
rm(data, savevars)The dataset has multiple variables, which contain data about respondent’s gender, age, years, spent on education, salary level and wished retirement age. It is noticeable, that the subject of people’s desire to retire has been studied with deep interest last decade. A great amount of psychological and social research had been conducted, such as:
In these researches various factors, which influence the desired retirement age, are named. The most significant are, surely, family factors, physical well-being and socioeconomic context. It would be interesting to see, whether data in our dataset confirm these facts, or not.
First, let`s select only the variables which are interesting for the further analysis.
ESS2 = ESS1 %>%
select(eduyrs, agea, agertr, gndr, cntry, icmnact, hinctnta)Next, we should assign the appropriate type to them.
ESS2$eduyrs = as.numeric(as.character(ESS2$eduyrs))
ESS2$agertr = as.numeric(as.character(ESS2$agertr))
ESS2$agea = as.numeric(as.character(ESS2$agea))
ESS2$hinctnta<-as.numeric(ESS1$hinctnta)Let us take a look on our dataset. Here you can see the structure of the dataset and a basic summary of our data.
str(ESS2)## Classes 'tbl_df', 'tbl' and 'data.frame': 52458 obs. of 7 variables:
## $ eduyrs : num 15 15 13 15 15 13 17 13 12 12 ...
## $ agea : num 22 43 19 23 58 62 26 40 48 68 ...
## $ agertr : num NA NA NA NA 60 60 NA NA 58 61 ...
## $ gndr : 'haven_labelled' num 1 1 2 2 2 2 1 1 1 1 ...
## ..- attr(*, "label")= chr "Gender"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 6
## ..- attr(*, "labels")= Named num 1 2 9
## .. ..- attr(*, "names")= chr "Male" "Female" "No answer"
## $ cntry : 'haven_labelled' chr "BE" "BE" "BE" "BE" ...
## ..- attr(*, "label")= chr "Country"
## ..- attr(*, "format.spss")= chr "A2"
## ..- attr(*, "display_width")= int 7
## ..- attr(*, "labels")= Named chr "UA" "GB" "BE" "DE" ...
## .. ..- attr(*, "names")= chr "Ukraine" "United Kingdom" "Belgium" "Germany" ...
## $ icmnact : 'haven_labelled' num 3 1 3 1 3 2 1 1 1 2 ...
## ..- attr(*, "label")= chr "Interviewer code, main activity of respondent"
## ..- attr(*, "format.spss")= chr "F1.0"
## ..- attr(*, "display_width")= int 9
## ..- attr(*, "labels")= Named num 1 2 3 9
## .. ..- attr(*, "names")= chr "In paid work" "Retired" "All others" "Not available"
## $ hinctnta: num NA 5 1 10 6 5 6 2 6 8 ...
summary(ESS2)## eduyrs agea agertr gndr
## Min. : 0.00 Min. : 14.00 Min. : 0.00 Min. :1.000
## 1st Qu.:10.00 1st Qu.: 33.00 1st Qu.: 56.00 1st Qu.:1.000
## Median :12.00 Median : 48.00 Median : 60.00 Median :2.000
## Mean :12.29 Mean : 48.51 Mean : 60.04 Mean :1.546
## 3rd Qu.:15.00 3rd Qu.: 63.00 3rd Qu.: 64.00 3rd Qu.:2.000
## Max. :55.00 Max. :102.00 Max. :120.00 Max. :2.000
## NA's :629 NA's :153 NA's :27902 NA's :21
## cntry icmnact hinctnta
## Length:52458 Min. :1.000 Min. : 1.000
## Class :haven_labelled 1st Qu.:1.000 1st Qu.: 3.000
## Mode :character Median :2.000 Median : 5.000
## Mean :1.812 Mean : 5.049
## 3rd Qu.:3.000 3rd Qu.: 7.000
## Max. :3.000 Max. :10.000
## NA's :134 NA's :12620
Then, there is a description of the meanings of the variables.
Label <- c("`eduyrs`", "`agea`","`agertr`", "`gndr`", "`cntry`", "`icmnact`", "`hinctnta" )
Meaning <- c("Years of education completed", "Age of the respondent", "what age the respondent would like to retire", "Gender of the respondent", "Country", "Working activity of respondent", "household's total net income")
fd <- data.frame(Label, Meaning, stringsAsFactors = FALSE)
kable(fd) %>%
kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)| Label | Meaning |
|---|---|
eduyrs
|
Years of education completed |
agea
|
Age of the respondent |
agertr
|
what age the respondent would like to retire |
gndr
|
Gender of the respondent |
cntry
|
Country |
icmnact
|
Working activity of respondent |
| `hinctnta | household’s total net income |
In the graph below, we immediately look at all of the following:
how quantitative variables are distributed Here it is clearly seen, that graphs 1 and 3 show univariate distribution (when data is divided into five bars, all of the same length)
correlation coefficient (which is not really impressive, by the way) That’s why we can also immediately look at the numerical values The most strong correlation, certainly, is between agertr and eduyrs variables, as well as between agertr and hinctnta.
my_fn1 <- function(data, mapping, ...){
p1 <- ggplot(data = ESS2, mapping = mapping) +
geom_point() +
geom_smooth(method=lm, fill="blue", color="blue", ...)
p1
}
library(GGally)
g1 = ggpairs(ESS2,columns = c(1,2,3,7), lower = list(continuous = my_fn1), title= "1")
g1ESS4 <- na.omit(subset(ESS2, icmnact==1))
ESS4$gndr = as.factor(ESS4$gndr)summary to get the more detailed summary than previously (but also more bigger)Linear Regression Model with 1 predictor
m1 <- lm(agertr ~ agea, data = ESS4)
summary(m1)##
## Call:
## lm(formula = agertr ~ agea, data = ESS4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.250 -2.506 0.195 2.505 58.728
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.33278 0.56274 84.11 <2e-16 ***
## agea 0.24455 0.01035 23.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.169 on 7425 degrees of freedom
## Multiple R-squared: 0.06996, Adjusted R-squared: 0.06983
## F-statistic: 558.5 on 1 and 7425 DF, p-value: < 2.2e-16
If you want to add a predictor to the model, you should add it with + after the first predictor. Linear Regression Model with 2 predictors
m2 <- lm(agertr ~ agea + gndr, data = ESS4)
summary(m2)##
## Call:
## lm(formula = agertr ~ agea + gndr, data = ESS4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -57.962 -2.723 -0.002 2.384 57.993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.40151 0.56280 86.00 <2e-16 ***
## agea 0.23869 0.01025 23.30 <2e-16 ***
## gndr2 -1.52691 0.11879 -12.85 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.113 on 7424 degrees of freedom
## Multiple R-squared: 0.09021, Adjusted R-squared: 0.08996
## F-statistic: 368 on 2 and 7424 DF, p-value: < 2.2e-16
anova(m1, m2)Linear Regression Model with 3 predictors
m3 <- lm(agertr ~ agea + gndr + hinctnta, data = ESS4)
summary(m3)##
## Call:
## lm(formula = agertr ~ agea + gndr + hinctnta, data = ESS4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.321 -2.667 0.033 2.424 57.785
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 47.27044 0.59474 79.481 < 2e-16 ***
## agea 0.24357 0.01026 23.745 < 2e-16 ***
## gndr2 -1.44434 0.11939 -12.097 < 2e-16 ***
## hinctnta 0.13259 0.02295 5.776 7.95e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.102 on 7423 degrees of freedom
## Multiple R-squared: 0.09428, Adjusted R-squared: 0.09391
## F-statistic: 257.6 on 3 and 7423 DF, p-value: < 2.2e-16
anova(m2, m3)Linear Regression Model with 4 predictors
m4 <- lm(agertr ~ agea + gndr + hinctnta + eduyrs, data = ESS4)
summary(m4)##
## Call:
## lm(formula = agertr ~ agea + gndr + hinctnta + eduyrs, data = ESS4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.964 -2.634 0.073 2.459 57.756
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.74656 0.62295 73.436 < 2e-16 ***
## agea 0.24686 0.01022 24.145 < 2e-16 ***
## gndr2 -1.52126 0.11930 -12.751 < 2e-16 ***
## hinctnta 0.07234 0.02410 3.002 0.0027 **
## eduyrs 0.13201 0.01672 7.896 3.31e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.081 on 7422 degrees of freedom
## Multiple R-squared: 0.1018, Adjusted R-squared: 0.1013
## F-statistic: 210.3 on 4 and 7422 DF, p-value: < 2.2e-16
anova(m3, m4)If X1 and X2 interact, this means that the effect of X1 on Y depends on the value of X2, and vice verca.
Let us write down the formula of the most interesting model - m4:
\[agertr = 47.27 + 0.25 * agea - 1.52 * gndr + 0.07 * hinctnta + 0.13 * eduyrs + e\]
Then again, as in the previous case, we conduct a model and run anova test, to pick the best. All of the p-values are less than 0.05, so they are certainly not equally good. The model with the least RSS is Model 2. —
Here we will not analyze every step, just have a look at the most interesting model with a bunch of predictors to practice once again.
par(mfrow = c(2, 2))
plot(m4)Hope this little guide has been useful for you, and now you have a full understanding of how to conduct linear regression models and what is going on in this process. Thank you for your attention!