Introduction

Hello there! In this very nice teaching guide we (Iana Borisenko and Anastasia Bakhareva) will introduce you linear regression models and will teach you, how to properly conduct and analyze it!

First of all. we would like to introduce the concept of linear regression to you by answering the most common questions:

What for do we need linear regression?

The aim of a linear regression is to model a mathematical function, so that we can use this model to predict Y when only X is known.

When do we need to apply linear regression model?

Linear regression model is needed, when you want to predict the value of the output variable Y based on one or more of the input predictor variables X.

Now it is time to start our teaching guide in linear regression modeling. We will demonstrate, how to conduct linear regression in a simple and easy to understand way.

Exploring the data

First, run the libraries which will be necessary for reading & manipulating our data and then conducting the graphs.

library(dplyr)
library(ggplot2)
library(stringr)
library(readr)
library(knitr)
library(psych)
library(magrittr)
library(kableExtra)
library(tidytext)
library(RColorBrewer)
library(haven)

Next, let`s run the dataset and have a look at its structure.

df <- read_csv("D:/Documents/data1_1.csv")
str(df)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 40 obs. of  3 variables:
##  $ books : num  0 1 0 2 4 4 1 4 3 0 ...
##  $ attend: num  9 15 10 16 10 20 11 20 15 15 ...
##  $ grade : num  45 57 45 51 65 88 44 87 89 59 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   books = col_double(),
##   ..   attend = col_double(),
##   ..   grade = col_double()
##   .. )

Then, it is necessary to define the types of variables presented.

df$books = as.numeric(as.character(df$books))
df$attend = as.numeric(as.character(df$attend))
df$grade = as.numeric(as.character(df$grade))

We can have a glance on specifications of our dataset with the function “summary”.

summary(df)

##      books       attend          grade      
##  Min.   :0   Min.   : 6.00   Min.   :37.00  
##  1st Qu.:1   1st Qu.:10.00   1st Qu.:51.00  
##  Median :2   Median :15.00   Median :60.50  
##  Mean   :2   Mean   :14.10   Mean   :63.55  
##  3rd Qu.:3   3rd Qu.:18.25   3rd Qu.:73.00  
##  Max.   :4   Max.   :20.00   Max.   :97.00

Seems legit, now it is time to check for outliers. We surely can do this with the graphs.

Using Boxplot to check for outliers

We construct boxplots as follows:

par(mfrow=c(1, 3))
boxplot(df$books, main="books", sub=paste("Outlier rows: ", boxplot.stats(df$books)$out))
boxplot(df$attend, main="attendance", sub=paste("Outlier rows: ", boxplot.stats(df$attend)$out))
boxplot(df$grade, main="grades", sub=paste("Outlier rows: ", boxplot.stats(df$grade)$out))

Boxplots show that the median grade the students get is 60. It is interesting to notice that the median number of books is 2 and median number of attended classes is 15.
Besides, we should put attention on outliers. Wow, what a dataset! No visible outliers! Though we can’t yet say, that it is suitable.

Using Density Plots to check if variables are close to normal

Next step in our preparation stages is Density Plot, with the help of which we can check if the distribution of variables is close to normal.

# use additional libraries to put all graphs into one image
library(cowplot)

## 
## Attaching package: 'cowplot'

## The following object is masked from 'package:ggplot2':
## 
##     ggsave

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

bo = ggplot(data = df, aes(x = books))  + geom_density(col = "pink", fill = "pink", alpha = 0.5) + xlab("Books") + ggtitle("Books read by students")
at = ggplot(data = df, aes(x = attend)) + geom_density(col = "blue", fill = "blue", alpha = 0.5) + xlab("Attendance") + ggtitle("Attendance of students")
gr = ggplot(data = df, aes(x = grade)) + geom_density(col = "yellow", fill = "yellow", alpha = 0.5) + xlab("Grades") + ggtitle("Grades of students")

plot_grid(bo, at, gr)

Graphs on books and attendance look pretty well, though grades graph is not that perfect. However, we can surely work with that.

Using Scatterplots to visualise the relationship

Now, since we checked our dataset and made sure, that it is suitable to work with, it is time for some analysis. We start with the visualization of the relationship between variables with the scatter plot.

sc1 = ggplot(data = df, aes(x = books, y = grade))  + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE) + xlab("Books") + ylab("Grade") + ggtitle("Relationship between books and grade")
sc2 = ggplot(data = df, aes(x = attend, y = grade)) + geom_point() + geom_smooth(method = lm, fill="blue", color="blue", se = FALSE) + xlab("Attendance") + ylab("Grade") + ggtitle("Relationship between attendance and grade")
sc3 = ggplot(data = df, aes(x = attend, y = books)) + geom_point() + geom_smooth(method=lm, fill="blue", color="blue", se = FALSE) + xlab("Attendance") + ylab("Books") + ggtitle("Relationship between attendance and books")

plot_grid(sc1, sc2, sc3)

It is worth mentioning, that scatter plots visualize linear relationships between the dependent variable and independent variables. Ideally, a scatter plot is drawn for each one of them against the response, along with the line of best, as we can clearly see on our graphs

Look at correlation coefficients

Correlation is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – and we have three of these to deal with.

We will have a look on them on this fine visualisation:

# run the libraries to create this graph
library(sjlabelled)

## 
## Attaching package: 'sjlabelled'

## The following objects are masked from 'package:haven':
## 
##     as_factor, read_sas, read_spss, read_stata, write_sas,
##     zap_labels

library(sjmisc) 
library(sjstats)

## 
## Attaching package: 'sjstats'

## The following objects are masked from 'package:psych':
## 
##     pca, phi

library(ggeffects)  
library(sjPlot)

## Warning in checkMatrixPackageVersion(): Package version inconsistency detected.
## TMB was built with Matrix version 1.2.17
## Current Matrix version is 1.2.15
## Please re-install 'TMB' from source using install.packages('TMB', type = 'source') or ask CRAN for a binary version of 'TMB' matching CRAN's 'Matrix' package

## 
## Attaching package: 'sjPlot'

## The following objects are masked from 'package:cowplot':
## 
##     plot_grid, save_plot

q = cor(df)
sjp.corr(df, show.legend = TRUE)

## Computing correlation using pearson-method with listwise-deletion...

## Warning: Removed 6 rows containing missing values (geom_text).

From what we can see, all the relationship between our variables:

have positive direction(can be defined by looking at sign before the number)
have medium strenght since they are bigger than 0.3 and have the ** of each.

Conducting linear resression models

Modeling Time! Since we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, it is time for model conduction.

model1 = lm(grade ~ books, data = df)
model2 = lm(grade ~ attend, data = df)
model3 = lm(grade ~ books + attend, data = df)

To analyze them, we need to:

Check out the summary with the function tab_model
First of all, have a look at the F-Ratio, check that the P-value (PR (> F)) is significant
Then, we give a glance on Adjusted R squared - the share of variance of outcome variable that explains the model (from 0 to 1). It shows how good the model is.
Last but not the least, we look at the Estimate

library(snakecase)
sjPlot::tab_model(model1)

	grade
Predictors	Estimates	CI	p
(Intercept)	52.07	44.17 – 59.98	<0.001
books	5.74	2.51 – 8.97	0.001
Observations	40
R² / adjusted R²	0.242 / 0.222

sjPlot::tab_model(model2)

	grade
Predictors	Estimates	CI	p
(Intercept)	37.00	20.99 – 53.01	<0.001
attend	1.88	0.80 – 2.97	0.002
Observations	40
R² / adjusted R²	0.233 / 0.212

sjPlot::tab_model(model3)

	grade
Predictors	Estimates	CI	p
(Intercept)	37.38	22.20 – 52.56	<0.001
books	4.04	0.60 – 7.47	0.027
attend	1.28	0.13 – 2.43	0.035
Observations	40
R² / adjusted R²	0.329 / 0.292

These models predict the change of the variable grade by two numerical variables attend and books.

To all variables we have P-value, if it is below 0.05, then this variable is not significant in this model. Basically, the p-value shows whether we were wrong somewhere in the early stages, so that we can once again be convinced that everything is fine with our data and model.
Adjusted R-squared, as well as Multiple R-squared, shows how much our model is better than the average model (0 - you did an oopsie, 1 - the model completely describes the data), but at the same time makes an amendment to the number of variables. (R-squared will always increase as the number of variables grows,while Adjusted R-squared fixes this). The more it’s value, the better, but if it is too high, then, most likely, something went wrong. It can also be interpreted as the proportion of data that can be explained using the model.
As mentioned earlier, in coefficients other than p-value and variable names, we are also interested in estimate.

Estimate - the values of b, meaning that, as this independent variable increases by one, the predicted variable should, according to the model, increase by this very estimate (respectively, as the variable decreases, we eliminate it’s estimate).

In general, a linear model is written like this:

\[Y = b0 + b1*x1 + e\]

b0 -is intercept (the expected mean value of Y when all X=0),
b1 - is slope, indicator of how the model line changes depending on the values of this variable
- is, basically, the value of those variable
e - modelling error

Thus, in this three cases the formulas go like this: \[grade1 = 52.07 + 5.74 * books + e\] \[grade2 = 37 + 1.88 * attend + e\] \[grade3 = 37.38 + 4.04 * `books + 1.28 ** attend + e\]

Сentering attendance variable

df$attend_cent = df$attend - mean(df$attend)

model2.1 = lm(grade ~ attend_cent, data = df)
sjPlot::tab_model(model2.1)

	grade
Predictors	Estimates	CI	p
(Intercept)	63.55	58.96 – 68.14	<0.001
attend cent	1.88	0.80 – 2.97	0.002
Observations	40
R² / adjusted R²	0.233 / 0.212

sjPlot::tab_model(model2) #compare with not centrilized version of the same model

	grade
Predictors	Estimates	CI	p
(Intercept)	37.00	20.99 – 53.01	<0.001
attend	1.88	0.80 – 2.97	0.002
Observations	40
R² / adjusted R²	0.233 / 0.212

Comparing Models

ANOVA helps us to compare models in which everything is the same, but several variables are added to one of them (or more), which are not taken into account in another model.
Here we look at the P-value and RSS. If P-value is less than 0.05, we should consider model with the lower RSS as a better one. If P-value is bigger, than 0.05, then there is no difference, which model to pick - they are both equally good.

anova(model1, model3)

anova(model2, model3)

As we can see here, p-value is less than 0.05 in both anova tests, so we should consider model3 as a better one in both cases.

Checking Linear Regression Assumptions

Linear regression makes several assumptions about the data, such as:

Linearity of the data
Normality of residuals
Homogeneity of residuals variance - whether variation of observations around the regression line is constant
Independence of residuals error terms

All these assumptions can be checked by producing some diagnostic plots visualizing the residual errors. We would like to analyze model3 as it is the best model according to ANOVA.

par(mfrow = c(2, 2))
plot(model3)

For linearity assumption: look at Residuals vs.Fitted plot. A horizontal line, without distinct patterns is an indication for a linear relationship is a good sign in this graph. For model3 : It is can be clearly seen on our plot!
For normality assumption: look at Q-Q plot. If residuals are normally distributed, the residuals points are supposed to follow the straight dashed line. For model3 : Yay! Looks like this plot shows good stuff!
For homoscedasticity assumption: look at Scale-Location & Residuals vs. Leverage plots. In this case horizontal line with equally spread points is a good indication of homoscedasticity. For model3: Once again, no problems with our graph!
To identify outliers, which extreme values might influence the regression results can be identified with the Residuals vs Leverage plot. Here we can spot only one of them.

Two other models to practice

Despite the fact that in the previous step we chose a better model, we would also like to draw your attention to the assumptions of the other two models.

We will give only a brief overlook, but strongly recommend you to do with the help of algorithm presented earlier.

par(mfrow = c(2, 2))
plot(model1)

Conclusion for model1: Residuals vs Fitted plot don’t look as good as in the previous model, since it is curved, though it is more or less ok. The same goes for Q-Q plot, which is not as straight as it is wished to be. Others two graphs look pretty well.

par(mfrow = c(2, 2))
plot(model2)

Conclusion for model2: Here again two graphs look nice, but two others, which are Residuals vs Fitted and Scale-Location plots do not look that fine. The red lines on both of these graphs are curved, dots are not equally spread, so no good signs of variance homogeneity.

Multiple linear regression.

Now it is time to work with another dataset! * We are going to focus on the specific features of multiple linear regression. * This part of guide will be mostly practical than theorethical for you to have more practice with coding and interpretations.

ESS5e03 <- read_sav("D:/Documents/ESS5e03.sav")
data = ESS5e03
savevars <- c("eduyrs", "agea", "agertr", "gndr", 
              "cntry", "icmnact", "hinctnta")
ESS1 <- data[savevars]
rm(data, savevars)

First, it will be nice of you to provide a background behid your research to show that the further analysis is necessary.

The dataset has multiple variables, which contain data about respondent’s gender, age, years, spent on education, salary level and wished retirement age. It is noticeable, that the subject of people’s desire to retire has been studied with deep interest last decade. A great amount of psychological and social research had been conducted, such as:

Wang, Mo & Shi, Junqi. (2013). Psychological Research on Retirement. Annual review of psychology.
Feldman, D.C. (2014). Feeling like it’s time to retire: A fit perspective on early retirement decisions.
Bernasek, Alexandra & Shwiff, Stephanie. (2016). Gender, Risk, And Retirement. Journal of Economic Issues.
C. Feldman, Daniel. (2017). The Decision to Retire Early: A Review and Conceptualization. The Academy of Management Review.

In these researches various factors, which influence the desired retirement age, are named. The most significant are, surely, family factors, physical well-being and socioeconomic context. It would be interesting to see, whether data in our dataset confirm these facts, or not.

Describing the data

First, let`s select only the variables which are interesting for the further analysis.

ESS2 = ESS1 %>% 
  select(eduyrs, agea, agertr, gndr, cntry, icmnact, hinctnta)

Next, we should assign the appropriate type to them.

ESS2$eduyrs = as.numeric(as.character(ESS2$eduyrs))
ESS2$agertr = as.numeric(as.character(ESS2$agertr))
ESS2$agea = as.numeric(as.character(ESS2$agea))
ESS2$hinctnta<-as.numeric(ESS1$hinctnta)

Let us take a look on our dataset. Here you can see the structure of the dataset and a basic summary of our data.

str(ESS2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    52458 obs. of  7 variables:
##  $ eduyrs  : num  15 15 13 15 15 13 17 13 12 12 ...
##  $ agea    : num  22 43 19 23 58 62 26 40 48 68 ...
##  $ agertr  : num  NA NA NA NA 60 60 NA NA 58 61 ...
##  $ gndr    : 'haven_labelled' num  1 1 2 2 2 2 1 1 1 1 ...
##   ..- attr(*, "label")= chr "Gender"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 6
##   ..- attr(*, "labels")= Named num  1 2 9
##   .. ..- attr(*, "names")= chr  "Male" "Female" "No answer"
##  $ cntry   : 'haven_labelled' chr  "BE" "BE" "BE" "BE" ...
##   ..- attr(*, "label")= chr "Country"
##   ..- attr(*, "format.spss")= chr "A2"
##   ..- attr(*, "display_width")= int 7
##   ..- attr(*, "labels")= Named chr  "UA" "GB" "BE" "DE" ...
##   .. ..- attr(*, "names")= chr  "Ukraine" "United Kingdom" "Belgium" "Germany" ...
##  $ icmnact : 'haven_labelled' num  3 1 3 1 3 2 1 1 1 2 ...
##   ..- attr(*, "label")= chr "Interviewer code, main activity of respondent"
##   ..- attr(*, "format.spss")= chr "F1.0"
##   ..- attr(*, "display_width")= int 9
##   ..- attr(*, "labels")= Named num  1 2 3 9
##   .. ..- attr(*, "names")= chr  "In paid work" "Retired" "All others" "Not available"
##  $ hinctnta: num  NA 5 1 10 6 5 6 2 6 8 ...

summary(ESS2)

##      eduyrs           agea            agertr            gndr      
##  Min.   : 0.00   Min.   : 14.00   Min.   :  0.00   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.: 33.00   1st Qu.: 56.00   1st Qu.:1.000  
##  Median :12.00   Median : 48.00   Median : 60.00   Median :2.000  
##  Mean   :12.29   Mean   : 48.51   Mean   : 60.04   Mean   :1.546  
##  3rd Qu.:15.00   3rd Qu.: 63.00   3rd Qu.: 64.00   3rd Qu.:2.000  
##  Max.   :55.00   Max.   :102.00   Max.   :120.00   Max.   :2.000  
##  NA's   :629     NA's   :153      NA's   :27902    NA's   :21     
##     cntry                   icmnact         hinctnta     
##  Length:52458            Min.   :1.000   Min.   : 1.000  
##  Class :haven_labelled   1st Qu.:1.000   1st Qu.: 3.000  
##  Mode  :character        Median :2.000   Median : 5.000  
##                          Mean   :1.812   Mean   : 5.049  
##                          3rd Qu.:3.000   3rd Qu.: 7.000  
##                          Max.   :3.000   Max.   :10.000  
##                          NA's   :134     NA's   :12620

Then, there is a description of the meanings of the variables.

Label <- c("`eduyrs`", "`agea`","`agertr`", "`gndr`", "`cntry`", "`icmnact`", "`hinctnta" ) 
Meaning <- c("Years of education completed", "Age of the respondent", "what age the respondent would like to retire", "Gender of the respondent", "Country", "Working activity of respondent", "household's total net income")
fd <- data.frame(Label, Meaning, stringsAsFactors = FALSE)
kable(fd) %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)

Label	Meaning
`eduyrs`	Years of education completed
`agea`	Age of the respondent
`agertr`	what age the respondent would like to retire
`gndr`	Gender of the respondent
`cntry`	Country
`icmnact`	Working activity of respondent
`hinctnta	household’s total net income

Graphs for all variables

In the graph below, we immediately look at all of the following:

how quantitative variables are distributed Here it is clearly seen, that graphs 1 and 3 show univariate distribution (when data is divided into five bars, all of the same length)
correlation coefficient (which is not really impressive, by the way) That’s why we can also immediately look at the numerical values The most strong correlation, certainly, is between agertr and eduyrs variables, as well as between agertr and hinctnta.

my_fn1 <- function(data, mapping, ...){
  p1 <- ggplot(data = ESS2, mapping = mapping) + 
    geom_point() + 
    geom_smooth(method=lm, fill="blue", color="blue", ...)
  p1
}
library(GGally)
g1 = ggpairs(ESS2,columns = c(1,2,3,7), lower = list(continuous = my_fn1), title= "1")
g1

Multiple Linear Regression

This time, we will create code for analysing 4 linear regression models. We will start from model with one predictor and finish with the model with 4 predictor which will be analyzed.

ESS4 <- na.omit(subset(ESS2, icmnact==1))
ESS4$gndr = as.factor(ESS4$gndr)

We will use the fuction summary to get the more detailed summary than previously (but also more bigger)

Linear Regression Model with 1 predictor

m1 <- lm(agertr ~ agea, data = ESS4)
summary(m1)

## 
## Call:
## lm(formula = agertr ~ agea, data = ESS4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.250  -2.506   0.195   2.505  58.728 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.33278    0.56274   84.11   <2e-16 ***
## agea         0.24455    0.01035   23.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.169 on 7425 degrees of freedom
## Multiple R-squared:  0.06996,    Adjusted R-squared:  0.06983 
## F-statistic: 558.5 on 1 and 7425 DF,  p-value: < 2.2e-16

If you want to add a predictor to the model, you should add it with + after the first predictor. Linear Regression Model with 2 predictors

m2 <- lm(agertr ~ agea + gndr, data = ESS4)
summary(m2)

## 
## Call:
## lm(formula = agertr ~ agea + gndr, data = ESS4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -57.962  -2.723  -0.002   2.384  57.993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.40151    0.56280   86.00   <2e-16 ***
## agea         0.23869    0.01025   23.30   <2e-16 ***
## gndr2       -1.52691    0.11879  -12.85   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.113 on 7424 degrees of freedom
## Multiple R-squared:  0.09021,    Adjusted R-squared:  0.08996 
## F-statistic:   368 on 2 and 7424 DF,  p-value: < 2.2e-16

anova(m1, m2)

Linear Regression Model with 3 predictors

m3 <- lm(agertr ~ agea + gndr + hinctnta, data = ESS4)
summary(m3)

## 
## Call:
## lm(formula = agertr ~ agea + gndr + hinctnta, data = ESS4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.321  -2.667   0.033   2.424  57.785 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 47.27044    0.59474  79.481  < 2e-16 ***
## agea         0.24357    0.01026  23.745  < 2e-16 ***
## gndr2       -1.44434    0.11939 -12.097  < 2e-16 ***
## hinctnta     0.13259    0.02295   5.776 7.95e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.102 on 7423 degrees of freedom
## Multiple R-squared:  0.09428,    Adjusted R-squared:  0.09391 
## F-statistic: 257.6 on 3 and 7423 DF,  p-value: < 2.2e-16

anova(m2, m3)

Linear Regression Model with 4 predictors

m4 <- lm(agertr ~ agea + gndr + hinctnta + eduyrs, data = ESS4)
summary(m4)

## 
## Call:
## lm(formula = agertr ~ agea + gndr + hinctnta + eduyrs, data = ESS4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -58.964  -2.634   0.073   2.459  57.756 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 45.74656    0.62295  73.436  < 2e-16 ***
## agea         0.24686    0.01022  24.145  < 2e-16 ***
## gndr2       -1.52126    0.11930 -12.751  < 2e-16 ***
## hinctnta     0.07234    0.02410   3.002   0.0027 ** 
## eduyrs       0.13201    0.01672   7.896 3.31e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.081 on 7422 degrees of freedom
## Multiple R-squared:  0.1018, Adjusted R-squared:  0.1013 
## F-statistic: 210.3 on 4 and 7422 DF,  p-value: < 2.2e-16

anova(m3, m4)

If X1 and X2 interact, this means that the effect of X1 on Y depends on the value of X2, and vice verca.

Let us write down the formula of the most interesting model - m4:

\[agertr = 47.27 + 0.25 * agea - 1.52 * gndr + 0.07 * hinctnta + 0.13 * eduyrs + e\]

Then again, as in the previous case, we conduct a model and run anova test, to pick the best. All of the p-values are less than 0.05, so they are certainly not equally good. The model with the least RSS is Model 2. —

Checking Linear Regression Assumptions

Here we will not analyze every step, just have a look at the most interesting model with a bunch of predictors to practice once again.

par(mfrow = c(2, 2))
plot(m4)

Despite the high density of points, we can accurately track the straight red line in our graph “Residuals vs Fitted”, which is definitely good.
The same goes for Q-Q plot, which represents a straight dashed line, so we can be sure, that our residuals are normally distributed.
Scale-Location plot looks not that perfect, as it’s red line is curved, though it is still pretty good.
On the Residuals vs Leverage plot outlying values are generally located at the upper right corner or at the lower right corner, and we can surely spot them on this graph. Though there are quite a few, so everything is just fine!

Conclusion

Hope this little guide has been useful for you, and now you have a full understanding of how to conduct linear regression models and what is going on in this process. Thank you for your attention!

Guide for Linear Regression

Anastasia Bakhareva, Iana Borisenko, 173

23.04.2019

Introduction

Exploring the data

Using Boxplot to check for outliers

Using Density Plots to check if variables are close to normal

Using Scatterplots to visualise the relationship

Look at correlation coefficients

Conducting linear resression models

Сentering attendance variable

Comparing Models

Checking Linear Regression Assumptions

Two other models to practice

Multiple linear regression.

Describing the data

Graphs for all variables

Multiple Linear Regression

Checking Linear Regression Assumptions

Conclusion