Data 1

Data from Applying Regression and Correlation, Miles and Shevlin (2001).

Variables:

library(foreign)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(ggplot2)
df <- read.csv("data1_1.csv")
summary(df)
##      books       attend          grade      
##  Min.   :0   Min.   : 6.00   Min.   :37.00  
##  1st Qu.:1   1st Qu.:10.00   1st Qu.:51.00  
##  Median :2   Median :15.00   Median :60.50  
##  Mean   :2   Mean   :14.10   Mean   :63.55  
##  3rd Qu.:3   3rd Qu.:18.25   3rd Qu.:73.00  
##  Max.   :4   Max.   :20.00   Max.   :97.00

Vizualizing the data

From the histogram above it can be seen, that there are equal amount of people (8), who have read 0, 1, 2, 3, 4 books

From the histogram above it can be seen, that the most frequently recorded amount of attended classes is 20, then goes 15 and then 10. Moreover, an outlier can be seen in the left, it tells us that there is the only one student who attended less than 10 classes during the course.

Correlations between variables

sjp.corr(df)

It can be seen that correlation coefficients are approximately 0,5 => it is possible to say that we have medium correlation between variables

Let’s construct scatterplots in order to investigate the correlation deeper:

From the scatterplot it can be seen, that those students, who attended more lectures were more likely to get a higher grade

The scatterplot again shows that those who read more books during the course were more likely to get higher grades

Building models

model1 <- lm(grade ~ books, data = df)
tab_model(model1)
  grade
Predictors Estimates CI p
(Intercept) 52.07 44.17 – 59.98 <0.001
books 5.74 2.51 – 8.97 0.001
Observations 40
R2 / adjusted R2 0.242 / 0.222

Regression equation for the first model: the predicted value of grade = 52,07 + 5,74 * books

model2 <- lm(grade ~ attend, data = df)
tab_model(model2)
  grade
Predictors Estimates CI p
(Intercept) 37.00 20.99 – 53.01 <0.001
attend 1.88 0.80 – 2.97 0.002
Observations 40
R2 / adjusted R2 0.233 / 0.212

Regression equation for the second model: the predicted value of grade = 37,00 + 1,88 * attend

model3 <- lm(grade ~ books + attend, data = df)
tab_model(model3)
  grade
Predictors Estimates CI p
(Intercept) 37.38 22.20 – 52.56 <0.001
books 4.04 0.60 – 7.47 0.027
attend 1.28 0.13 – 2.43 0.035
Observations 40
R2 / adjusted R2 0.329 / 0.292

Regression equation for the third model: the predicted value of grade = 37,38 + 4,04 * books + 1,28 * attend

Judging by the Adjusted R-squared values we can claim that the third model has the highiest explanatory power, as its’ Adjusted R-squared value equals to 0.292Р±, which is larger than the other two

Comparing models

anova(model1, model3)
## Analysis of Variance Table
## 
## Model 1: grade ~ books
## Model 2: grade ~ books + attend
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 8250.4                              
## 2     37 7306.2  1    944.16 4.7814 0.03517 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.

The p-value here estimates 0.03517, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model1), therefore we should favour the more complex one

anova(model2, model3)
## Analysis of Variance Table
## 
## Model 1: grade ~ attend
## Model 2: grade ~ books + attend
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 8353.4                              
## 2     37 7306.2  1    1047.1 5.3028 0.02701 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.

The p-value here estimates 0.02701, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model2), therefore we should favour the more complex one

Plots for model3

Centering variables (shifting the zero to the mean)

df$attend_cent <- df$attend - mean(df$attend)
model4 <- lm(grade ~ attend_cent, data=df)
tab_model(model4)
  grade
Predictors Estimates CI p
(Intercept) 63.55 58.96 – 68.14 <0.001
attend cent 1.88 0.80 – 2.97 0.002
Observations 40
R2 / adjusted R2 0.233 / 0.212

Data 2

Data from the 5th wave of ESS

library(haven)
library(foreign)
ESS2 <- read.spss("ESS5e03.sav",  to.data.frame = TRUE)

Now from the whole dataset let’s choose the 7 variables, that are going to be observed:

savevars <- c("eduyrs", "agea", "agertr", "gndr", 
              "cntry", "icmnact", "hinctnta")
data = ESS2
ESS1 <- data[savevars]
rm(data, savevars)

Now let’s look at the head of the dataset:

head(ESS1)
##   eduyrs agea agertr   gndr   cntry      icmnact hinctnta
## 1     15   22   <NA>   Male Belgium   All others     <NA>
## 2     15   43   <NA>   Male Belgium In paid work        F
## 3     13   19   <NA> Female Belgium   All others        J
## 4     15   23   <NA> Female Belgium In paid work        H
## 5     15   58     60 Female Belgium   All others        S
## 6     13   62     60 Female Belgium      Retired        F

From the table it can be seen that the variable hinctnta is needed to be transformed in numeric form:

ESS1$hinctnta1<-as.numeric(ESS1$hinctnta)

Now the highiest income is 10 and the lowest is 1.

Then, let’s omit all the NAs from our dataset:

ESS2 <- na.omit(ESS1)

Now, when we have a new dataset with the needed variables and without NAs, let’s make a histogrm of households’ income since we are going to use the variable as a continuous predictor:

From the histogram it can be seen, that most households have net income less than 2.5. Nevertheless, the amount of households, which have medium net income isn’t much smaller, while the number of those, whose net income is higher than medium is aproximately twice lower.

Building models for data2

data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
savevars <- c("eduyrs", "agea", "agertr", "gndr", "hinctnta")
ESS2 <- data[savevars]
rm(data, savevars)
head(ESS2)
##   eduyrs agea agertr   gndr hinctnta
## 1     15   22   <NA>   Male     <NA>
## 2     15   43   <NA>   Male        F
## 3     13   19   <NA> Female        J
## 4     15   23   <NA> Female        H
## 5     15   58     60 Female        S
## 6     13   62     60 Female        F

Here we change those variables which have wrong data type from factor to numeric, we can use as.numeric(as.character()) functions to do so.

ESS1$hinctnta1<-as.numeric(ESS1$hinctnta) #decile income, 1 - min, 10 - max
ESS1$eduyrs1<-as.numeric(as.character(ESS1$eduyrs))
ESS1$agea1<-as.numeric(as.character(ESS1$agea))
ESS1$agertr1<-as.numeric(as.character(ESS1$agertr))

We have five main variables to use.

To clean the data, we can remove NA’s with na.omit() function:

ESS2 <- na.omit(ESS1)
model1 <- lm(agertr ~ agea, data = ESS2)
tab_model(model1)
  agertr
Predictors Estimates CI p
(Intercept) 48.00
agea 17 -10.00
agea 18 -7.33
agea 19 -2.67
agea 20 -10.00
agea 21 -10.00
agea 22 -15.00
agea 27 -5.00
agea 29 -10.00
agea 32 -20.00
agea 35 -5.00
agea 38 -17.50
agea 39 -20.00
agea 41 -17.33
agea 42 -8.00
agea 43 -10.00
agea 44 -13.00
agea 45 -10.13
agea 46 -10.21
agea 47 -11.00
agea 48 -10.47
agea 49 -10.90
agea 50 -10.90
agea 51 -10.83
agea 52 -10.64
agea 53 -10.70
agea 54 -10.25
agea 55 -10.40
agea 56 -10.30
agea 57 -10.23
agea 58 -9.69
agea 59 -10.09
agea 60 -9.72
agea 61 -9.89
agea 62 -9.59
agea 63 -9.83
agea 64 -9.61
agea 65 -9.10
agea 66 -9.17
agea 67 -9.00
agea 68 -9.16
agea 69 -9.61
agea 70 -9.61
agea 71 -9.93
agea 72 -9.84
agea 73 -9.94
agea 74 -10.59
agea 75 -9.26
agea 76 -9.84
agea 77 -9.42
agea 78 -9.75
agea 79 -9.77
agea 80 -9.85
agea 81 -9.45
agea 82 -9.68
agea 83 -9.53
agea 84 -9.44
agea 85 -8.39
agea 86 -8.74
agea 87 -9.77
agea 88 -10.08
agea 89 -7.56
agea 90 -8.18
agea 91 -9.64
agea 92 -5.11
agea 93 -6.40
agea 94 -7.43
agea 95 -7.14
agea 96 -8.33
agea 97 -10.00
agea 101 0.00
Observations 19624
model2 <- lm(agertr ~ agea + gndr, data = ESS2)
tab_model(model2)
  agertr
Predictors Estimates CI p
(Intercept) 50.11
agea 17 -11.41
agea 18 -8.04
agea 19 -4.08
agea 20 -12.11
agea 21 -10.00
agea 22 -17.11
agea 27 -7.11
agea 29 -12.11
agea 32 -20.00
agea 35 -7.11
agea 38 -18.56
agea 39 -22.11
agea 41 -18.04
agea 42 -10.11
agea 43 -11.06
agea 44 -13.00
agea 45 -11.01
agea 46 -11.28
agea 47 -11.94
agea 48 -11.53
agea 49 -11.92
agea 50 -11.87
agea 51 -11.79
agea 52 -11.65
agea 53 -11.66
agea 54 -11.26
agea 55 -11.28
agea 56 -11.32
agea 57 -11.15
agea 58 -10.71
agea 59 -11.07
agea 60 -10.74
agea 61 -10.87
agea 62 -10.55
agea 63 -10.79
agea 64 -10.56
agea 65 -10.01
agea 66 -10.19
agea 67 -10.01
agea 68 -10.17
agea 69 -10.62
agea 70 -10.64
agea 71 -10.90
agea 72 -10.79
agea 73 -10.86
agea 74 -11.56
agea 75 -10.23
agea 76 -10.73
agea 77 -10.45
agea 78 -10.50
agea 79 -10.62
agea 80 -10.75
agea 81 -10.46
agea 82 -10.56
agea 83 -10.40
agea 84 -10.11
agea 85 -9.34
agea 86 -9.33
agea 87 -10.57
agea 88 -10.68
agea 89 -8.75
agea 90 -8.82
agea 91 -10.21
agea 92 -6.28
agea 93 -7.39
agea 94 -7.73
agea 95 -7.14
agea 96 -10.45
agea 97 -10.00
agea 101 0.00
Female -2.11
Observations 19624
model3 <- lm(agertr ~ agea + gndr + hinctnta1, data = ESS2)
tab_model(model3)
  agertr
Predictors Estimates CI p
(Intercept) 49.16
agea 17 -11.73
agea 18 -8.05
agea 19 -4.01
agea 20 -11.44
agea 21 -10.14
agea 22 -17.45
agea 27 -6.73
agea 29 -12.30
agea 32 -20.29
agea 35 -6.73
agea 38 -18.87
agea 39 -22.02
agea 41 -18.20
agea 42 -10.16
agea 43 -11.51
agea 44 -12.78
agea 45 -11.02
agea 46 -11.27
agea 47 -11.88
agea 48 -11.47
agea 49 -11.85
agea 50 -11.78
agea 51 -11.67
agea 52 -11.57
agea 53 -11.56
agea 54 -11.14
agea 55 -11.17
agea 56 -11.22
agea 57 -10.99
agea 58 -10.57
agea 59 -10.92
agea 60 -10.55
agea 61 -10.65
agea 62 -10.35
agea 63 -10.59
agea 64 -10.31
agea 65 -9.75
agea 66 -9.92
agea 67 -9.72
agea 68 -9.87
agea 69 -10.28
agea 70 -10.31
agea 71 -10.53
agea 72 -10.41
agea 73 -10.46
agea 74 -11.17
agea 75 -9.82
agea 76 -10.29
agea 77 -10.04
agea 78 -10.07
agea 79 -10.21
agea 80 -10.28
agea 81 -10.00
agea 82 -10.12
agea 83 -9.91
agea 84 -9.67
agea 85 -8.89
agea 86 -8.84
agea 87 -10.11
agea 88 -10.15
agea 89 -8.29
agea 90 -8.38
agea 91 -9.76
agea 92 -5.79
agea 93 -6.98
agea 94 -7.21
agea 95 -6.75
agea 96 -10.11
agea 97 -9.28
agea 101 0.72
Female -2.02
hinctnta 1 0.14
Observations 19624
model4 <- lm(agertr ~ agea + gndr + hinctnta1 + eduyrs, data = ESS2)
tab_model(model4)
  agertr
Predictors Estimates CI p
(Intercept) 48.92
agea 17 -11.27
agea 18 -7.63
agea 19 -3.86
agea 20 -11.05
agea 21 -9.41
agea 22 -17.19
agea 27 -6.85
agea 29 -11.50
agea 32 -21.34
agea 35 -8.02
agea 38 -19.16
agea 39 -22.01
agea 41 -18.08
agea 42 -9.42
agea 43 -11.33
agea 44 -12.98
agea 45 -10.92
agea 46 -11.28
agea 47 -11.82
agea 48 -11.43
agea 49 -11.78
agea 50 -11.70
agea 51 -11.58
agea 52 -11.46
agea 53 -11.47
agea 54 -11.03
agea 55 -11.08
agea 56 -11.09
agea 57 -10.86
agea 58 -10.50
agea 59 -10.86
agea 60 -10.50
agea 61 -10.59
agea 62 -10.29
agea 63 -10.57
agea 64 -10.25
agea 65 -9.71
agea 66 -9.86
agea 67 -9.64
agea 68 -9.75
agea 69 -10.22
agea 70 -10.22
agea 71 -10.43
agea 72 -10.35
agea 73 -10.34
agea 74 -11.05
agea 75 -9.77
agea 76 -10.13
agea 77 -9.94
agea 78 -10.01
agea 79 -10.00
agea 80 -10.15
agea 81 -9.86
agea 82 -10.00
agea 83 -9.77
agea 84 -9.48
agea 85 -8.78
agea 86 -8.63
agea 87 -10.06
agea 88 -9.97
agea 89 -8.10
agea 90 -8.14
agea 91 -10.08
agea 92 -5.66
agea 93 -6.72
agea 94 -6.87
agea 95 -6.41
agea 96 -10.42
agea 97 -8.79
agea 101 1.07
Female -2.01
hinctnta 1 0.08
eduyrs 1 0.87
eduyrs 2 -1.19
eduyrs 3 -0.04
eduyrs 4 -0.96
eduyrs 5 -0.15
eduyrs 6 0.54
eduyrs 7 -0.06
eduyrs 8 -0.04
eduyrs 9 1.35
eduyrs 10 0.60
eduyrs 11 -0.21
eduyrs 12 -0.07
eduyrs 13 0.23
eduyrs 14 0.61
eduyrs 15 0.49
eduyrs 16 0.49
eduyrs 17 1.37
eduyrs 18 1.77
eduyrs 19 1.87
eduyrs 20 2.76
eduyrs 21 2.31
eduyrs 22 2.45
eduyrs 23 2.09
eduyrs 24 3.53
eduyrs 25 2.36
eduyrs 26 3.50
eduyrs 27 7.39
eduyrs 28 4.49
eduyrs 29 4.32
eduyrs 30 3.93
eduyrs 31 2.42
eduyrs 33 12.03
eduyrs 34 3.40
eduyrs 35 2.10
eduyrs 36 -8.39
eduyrs 40 6.54
eduyrs 42 10.09
eduyrs 43 1.65
eduyrs 45 1.39
eduyrs 47 4.05
eduyrs 50 -0.89
Observations 19624

Comparing models

anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: agertr ~ agea
## Model 2: agertr ~ agea + gndr
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19553   0                      
## 2  19552   0  1         0
anova(model2, model3)
## Analysis of Variance Table
## 
## Model 1: agertr ~ agea + gndr
## Model 2: agertr ~ agea + gndr + hinctnta1
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19552   0                      
## 2  19551   0  1         0
anova(model3, model4)
## Analysis of Variance Table
## 
## Model 1: agertr ~ agea + gndr + hinctnta1
## Model 2: agertr ~ agea + gndr + hinctnta1 + eduyrs
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19551   0                      
## 2  19510   0 41         0