Data from Applying Regression and Correlation, Miles and Shevlin (2001).
Variables:
books = books read during the course
attend = classes attended during the course
grade = a grade of a person
library(foreign)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(ggplot2)
df <- read.csv("data1_1.csv")
summary(df)
## books attend grade
## Min. :0 Min. : 6.00 Min. :37.00
## 1st Qu.:1 1st Qu.:10.00 1st Qu.:51.00
## Median :2 Median :15.00 Median :60.50
## Mean :2 Mean :14.10 Mean :63.55
## 3rd Qu.:3 3rd Qu.:18.25 3rd Qu.:73.00
## Max. :4 Max. :20.00 Max. :97.00
books From the table it can be seen, that the maximal amoumt of books read during the course was 4 and the minimal was 0. Moreover, both Mean and Median measures equals to 2
attend From the table it can be seen, that the maximal amoumt of classes attended by a student during the course was 20 and the minimal was 6.Mean measure here equals to 14,10 and Median measure equals to 15,00
grade From the table it can be seen, that the maximal grade for the course was 97,00 and the minimal was 37,00. Mean measure here equals to 63,55 and Median measure equals to 60,50
From the histogram above it can be seen, that there are equal amount of people (8), who have read 0, 1, 2, 3, 4 books
From the histogram above it can be seen, that the most frequently recorded amount of attended classes is 20, then goes 15 and then 10. Moreover, an outlier can be seen in the left, it tells us that there is the only one student who attended less than 10 classes during the course.
sjp.corr(df)
It can be seen that correlation coefficients are approximately 0,5 => it is possible to say that we have medium correlation between variables
Let’s construct scatterplots in order to investigate the correlation deeper:
From the scatterplot it can be seen, that those students, who attended more lectures were more likely to get a higher grade
The scatterplot again shows that those who read more books during the course were more likely to get higher grades
model1 <- lm(grade ~ books, data = df)
tab_model(model1)
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 52.07 | 44.17 – 59.98 | <0.001 |
| books | 5.74 | 2.51 – 8.97 | 0.001 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.242 / 0.222 | ||
Regression equation for the first model: the predicted value of grade = 52,07 + 5,74 * books
model2 <- lm(grade ~ attend, data = df)
tab_model(model2)
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 37.00 | 20.99 – 53.01 | <0.001 |
| attend | 1.88 | 0.80 – 2.97 | 0.002 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.233 / 0.212 | ||
Regression equation for the second model: the predicted value of grade = 37,00 + 1,88 * attend
model3 <- lm(grade ~ books + attend, data = df)
tab_model(model3)
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 37.38 | 22.20 – 52.56 | <0.001 |
| books | 4.04 | 0.60 – 7.47 | 0.027 |
| attend | 1.28 | 0.13 – 2.43 | 0.035 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.329 / 0.292 | ||
Regression equation for the third model: the predicted value of grade = 37,38 + 4,04 * books + 1,28 * attend
Judging by the Adjusted R-squared values we can claim that the third model has the highiest explanatory power, as its’ Adjusted R-squared value equals to 0.292Р±, which is larger than the other two
anova(model1, model3)
## Analysis of Variance Table
##
## Model 1: grade ~ books
## Model 2: grade ~ books + attend
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 8250.4
## 2 37 7306.2 1 944.16 4.7814 0.03517 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.
The p-value here estimates 0.03517, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model1), therefore we should favour the more complex one
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: grade ~ attend
## Model 2: grade ~ books + attend
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 38 8353.4
## 2 37 7306.2 1 1047.1 5.3028 0.02701 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.
The p-value here estimates 0.02701, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model2), therefore we should favour the more complex one
model3df$attend_cent <- df$attend - mean(df$attend)
model4 <- lm(grade ~ attend_cent, data=df)
tab_model(model4)
| grade | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 63.55 | 58.96 – 68.14 | <0.001 |
| attend cent | 1.88 | 0.80 – 2.97 | 0.002 |
| Observations | 40 | ||
| R2 / adjusted R2 | 0.233 / 0.212 | ||
Data from the 5th wave of ESS
library(haven)
library(foreign)
ESS2 <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
Now from the whole dataset let’s choose the 7 variables, that are going to be observed:
eduyrs = years of full-time education completed
agea = age of a person
agertr = what age you would like to/would have liked to retire
gndr = gender of a person
cntry = country
icmnact = main activity of respondent
hinctnta = household’s total net income, all sources
savevars <- c("eduyrs", "agea", "agertr", "gndr",
"cntry", "icmnact", "hinctnta")
data = ESS2
ESS1 <- data[savevars]
rm(data, savevars)
Now let’s look at the head of the dataset:
head(ESS1)
## eduyrs agea agertr gndr cntry icmnact hinctnta
## 1 15 22 <NA> Male Belgium All others <NA>
## 2 15 43 <NA> Male Belgium In paid work F
## 3 13 19 <NA> Female Belgium All others J
## 4 15 23 <NA> Female Belgium In paid work H
## 5 15 58 60 Female Belgium All others S
## 6 13 62 60 Female Belgium Retired F
From the table it can be seen that the variable hinctnta is needed to be transformed in numeric form:
ESS1$hinctnta1<-as.numeric(ESS1$hinctnta)
Now the highiest income is 10 and the lowest is 1.
Then, let’s omit all the NAs from our dataset:
ESS2 <- na.omit(ESS1)
Now, when we have a new dataset with the needed variables and without NAs, let’s make a histogrm of households’ income since we are going to use the variable as a continuous predictor:
From the histogram it can be seen, that most households have net income less than 2.5. Nevertheless, the amount of households, which have medium net income isn’t much smaller, while the number of those, whose net income is higher than medium is aproximately twice lower.
data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
savevars <- c("eduyrs", "agea", "agertr", "gndr", "hinctnta")
ESS2 <- data[savevars]
rm(data, savevars)
head(ESS2)
## eduyrs agea agertr gndr hinctnta
## 1 15 22 <NA> Male <NA>
## 2 15 43 <NA> Male F
## 3 13 19 <NA> Female J
## 4 15 23 <NA> Female H
## 5 15 58 60 Female S
## 6 13 62 60 Female F
Here we change those variables which have wrong data type from factor to numeric, we can use as.numeric(as.character()) functions to do so.
ESS1$hinctnta1<-as.numeric(ESS1$hinctnta) #decile income, 1 - min, 10 - max
ESS1$eduyrs1<-as.numeric(as.character(ESS1$eduyrs))
ESS1$agea1<-as.numeric(as.character(ESS1$agea))
ESS1$agertr1<-as.numeric(as.character(ESS1$agertr))
We have five main variables to use.
To clean the data, we can remove NA’s with na.omit() function:
ESS2 <- na.omit(ESS1)
model1 <- lm(agertr ~ agea, data = ESS2)
tab_model(model1)
| agertr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 48.00 | ||
| agea 17 | -10.00 | ||
| agea 18 | -7.33 | ||
| agea 19 | -2.67 | ||
| agea 20 | -10.00 | ||
| agea 21 | -10.00 | ||
| agea 22 | -15.00 | ||
| agea 27 | -5.00 | ||
| agea 29 | -10.00 | ||
| agea 32 | -20.00 | ||
| agea 35 | -5.00 | ||
| agea 38 | -17.50 | ||
| agea 39 | -20.00 | ||
| agea 41 | -17.33 | ||
| agea 42 | -8.00 | ||
| agea 43 | -10.00 | ||
| agea 44 | -13.00 | ||
| agea 45 | -10.13 | ||
| agea 46 | -10.21 | ||
| agea 47 | -11.00 | ||
| agea 48 | -10.47 | ||
| agea 49 | -10.90 | ||
| agea 50 | -10.90 | ||
| agea 51 | -10.83 | ||
| agea 52 | -10.64 | ||
| agea 53 | -10.70 | ||
| agea 54 | -10.25 | ||
| agea 55 | -10.40 | ||
| agea 56 | -10.30 | ||
| agea 57 | -10.23 | ||
| agea 58 | -9.69 | ||
| agea 59 | -10.09 | ||
| agea 60 | -9.72 | ||
| agea 61 | -9.89 | ||
| agea 62 | -9.59 | ||
| agea 63 | -9.83 | ||
| agea 64 | -9.61 | ||
| agea 65 | -9.10 | ||
| agea 66 | -9.17 | ||
| agea 67 | -9.00 | ||
| agea 68 | -9.16 | ||
| agea 69 | -9.61 | ||
| agea 70 | -9.61 | ||
| agea 71 | -9.93 | ||
| agea 72 | -9.84 | ||
| agea 73 | -9.94 | ||
| agea 74 | -10.59 | ||
| agea 75 | -9.26 | ||
| agea 76 | -9.84 | ||
| agea 77 | -9.42 | ||
| agea 78 | -9.75 | ||
| agea 79 | -9.77 | ||
| agea 80 | -9.85 | ||
| agea 81 | -9.45 | ||
| agea 82 | -9.68 | ||
| agea 83 | -9.53 | ||
| agea 84 | -9.44 | ||
| agea 85 | -8.39 | ||
| agea 86 | -8.74 | ||
| agea 87 | -9.77 | ||
| agea 88 | -10.08 | ||
| agea 89 | -7.56 | ||
| agea 90 | -8.18 | ||
| agea 91 | -9.64 | ||
| agea 92 | -5.11 | ||
| agea 93 | -6.40 | ||
| agea 94 | -7.43 | ||
| agea 95 | -7.14 | ||
| agea 96 | -8.33 | ||
| agea 97 | -10.00 | ||
| agea 101 | 0.00 | ||
| Observations | 19624 | ||
model2 <- lm(agertr ~ agea + gndr, data = ESS2)
tab_model(model2)
| agertr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 50.11 | ||
| agea 17 | -11.41 | ||
| agea 18 | -8.04 | ||
| agea 19 | -4.08 | ||
| agea 20 | -12.11 | ||
| agea 21 | -10.00 | ||
| agea 22 | -17.11 | ||
| agea 27 | -7.11 | ||
| agea 29 | -12.11 | ||
| agea 32 | -20.00 | ||
| agea 35 | -7.11 | ||
| agea 38 | -18.56 | ||
| agea 39 | -22.11 | ||
| agea 41 | -18.04 | ||
| agea 42 | -10.11 | ||
| agea 43 | -11.06 | ||
| agea 44 | -13.00 | ||
| agea 45 | -11.01 | ||
| agea 46 | -11.28 | ||
| agea 47 | -11.94 | ||
| agea 48 | -11.53 | ||
| agea 49 | -11.92 | ||
| agea 50 | -11.87 | ||
| agea 51 | -11.79 | ||
| agea 52 | -11.65 | ||
| agea 53 | -11.66 | ||
| agea 54 | -11.26 | ||
| agea 55 | -11.28 | ||
| agea 56 | -11.32 | ||
| agea 57 | -11.15 | ||
| agea 58 | -10.71 | ||
| agea 59 | -11.07 | ||
| agea 60 | -10.74 | ||
| agea 61 | -10.87 | ||
| agea 62 | -10.55 | ||
| agea 63 | -10.79 | ||
| agea 64 | -10.56 | ||
| agea 65 | -10.01 | ||
| agea 66 | -10.19 | ||
| agea 67 | -10.01 | ||
| agea 68 | -10.17 | ||
| agea 69 | -10.62 | ||
| agea 70 | -10.64 | ||
| agea 71 | -10.90 | ||
| agea 72 | -10.79 | ||
| agea 73 | -10.86 | ||
| agea 74 | -11.56 | ||
| agea 75 | -10.23 | ||
| agea 76 | -10.73 | ||
| agea 77 | -10.45 | ||
| agea 78 | -10.50 | ||
| agea 79 | -10.62 | ||
| agea 80 | -10.75 | ||
| agea 81 | -10.46 | ||
| agea 82 | -10.56 | ||
| agea 83 | -10.40 | ||
| agea 84 | -10.11 | ||
| agea 85 | -9.34 | ||
| agea 86 | -9.33 | ||
| agea 87 | -10.57 | ||
| agea 88 | -10.68 | ||
| agea 89 | -8.75 | ||
| agea 90 | -8.82 | ||
| agea 91 | -10.21 | ||
| agea 92 | -6.28 | ||
| agea 93 | -7.39 | ||
| agea 94 | -7.73 | ||
| agea 95 | -7.14 | ||
| agea 96 | -10.45 | ||
| agea 97 | -10.00 | ||
| agea 101 | 0.00 | ||
| Female | -2.11 | ||
| Observations | 19624 | ||
model3 <- lm(agertr ~ agea + gndr + hinctnta1, data = ESS2)
tab_model(model3)
| agertr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 49.16 | ||
| agea 17 | -11.73 | ||
| agea 18 | -8.05 | ||
| agea 19 | -4.01 | ||
| agea 20 | -11.44 | ||
| agea 21 | -10.14 | ||
| agea 22 | -17.45 | ||
| agea 27 | -6.73 | ||
| agea 29 | -12.30 | ||
| agea 32 | -20.29 | ||
| agea 35 | -6.73 | ||
| agea 38 | -18.87 | ||
| agea 39 | -22.02 | ||
| agea 41 | -18.20 | ||
| agea 42 | -10.16 | ||
| agea 43 | -11.51 | ||
| agea 44 | -12.78 | ||
| agea 45 | -11.02 | ||
| agea 46 | -11.27 | ||
| agea 47 | -11.88 | ||
| agea 48 | -11.47 | ||
| agea 49 | -11.85 | ||
| agea 50 | -11.78 | ||
| agea 51 | -11.67 | ||
| agea 52 | -11.57 | ||
| agea 53 | -11.56 | ||
| agea 54 | -11.14 | ||
| agea 55 | -11.17 | ||
| agea 56 | -11.22 | ||
| agea 57 | -10.99 | ||
| agea 58 | -10.57 | ||
| agea 59 | -10.92 | ||
| agea 60 | -10.55 | ||
| agea 61 | -10.65 | ||
| agea 62 | -10.35 | ||
| agea 63 | -10.59 | ||
| agea 64 | -10.31 | ||
| agea 65 | -9.75 | ||
| agea 66 | -9.92 | ||
| agea 67 | -9.72 | ||
| agea 68 | -9.87 | ||
| agea 69 | -10.28 | ||
| agea 70 | -10.31 | ||
| agea 71 | -10.53 | ||
| agea 72 | -10.41 | ||
| agea 73 | -10.46 | ||
| agea 74 | -11.17 | ||
| agea 75 | -9.82 | ||
| agea 76 | -10.29 | ||
| agea 77 | -10.04 | ||
| agea 78 | -10.07 | ||
| agea 79 | -10.21 | ||
| agea 80 | -10.28 | ||
| agea 81 | -10.00 | ||
| agea 82 | -10.12 | ||
| agea 83 | -9.91 | ||
| agea 84 | -9.67 | ||
| agea 85 | -8.89 | ||
| agea 86 | -8.84 | ||
| agea 87 | -10.11 | ||
| agea 88 | -10.15 | ||
| agea 89 | -8.29 | ||
| agea 90 | -8.38 | ||
| agea 91 | -9.76 | ||
| agea 92 | -5.79 | ||
| agea 93 | -6.98 | ||
| agea 94 | -7.21 | ||
| agea 95 | -6.75 | ||
| agea 96 | -10.11 | ||
| agea 97 | -9.28 | ||
| agea 101 | 0.72 | ||
| Female | -2.02 | ||
| hinctnta 1 | 0.14 | ||
| Observations | 19624 | ||
model4 <- lm(agertr ~ agea + gndr + hinctnta1 + eduyrs, data = ESS2)
tab_model(model4)
| agertr | |||
|---|---|---|---|
| Predictors | Estimates | CI | p |
| (Intercept) | 48.92 | ||
| agea 17 | -11.27 | ||
| agea 18 | -7.63 | ||
| agea 19 | -3.86 | ||
| agea 20 | -11.05 | ||
| agea 21 | -9.41 | ||
| agea 22 | -17.19 | ||
| agea 27 | -6.85 | ||
| agea 29 | -11.50 | ||
| agea 32 | -21.34 | ||
| agea 35 | -8.02 | ||
| agea 38 | -19.16 | ||
| agea 39 | -22.01 | ||
| agea 41 | -18.08 | ||
| agea 42 | -9.42 | ||
| agea 43 | -11.33 | ||
| agea 44 | -12.98 | ||
| agea 45 | -10.92 | ||
| agea 46 | -11.28 | ||
| agea 47 | -11.82 | ||
| agea 48 | -11.43 | ||
| agea 49 | -11.78 | ||
| agea 50 | -11.70 | ||
| agea 51 | -11.58 | ||
| agea 52 | -11.46 | ||
| agea 53 | -11.47 | ||
| agea 54 | -11.03 | ||
| agea 55 | -11.08 | ||
| agea 56 | -11.09 | ||
| agea 57 | -10.86 | ||
| agea 58 | -10.50 | ||
| agea 59 | -10.86 | ||
| agea 60 | -10.50 | ||
| agea 61 | -10.59 | ||
| agea 62 | -10.29 | ||
| agea 63 | -10.57 | ||
| agea 64 | -10.25 | ||
| agea 65 | -9.71 | ||
| agea 66 | -9.86 | ||
| agea 67 | -9.64 | ||
| agea 68 | -9.75 | ||
| agea 69 | -10.22 | ||
| agea 70 | -10.22 | ||
| agea 71 | -10.43 | ||
| agea 72 | -10.35 | ||
| agea 73 | -10.34 | ||
| agea 74 | -11.05 | ||
| agea 75 | -9.77 | ||
| agea 76 | -10.13 | ||
| agea 77 | -9.94 | ||
| agea 78 | -10.01 | ||
| agea 79 | -10.00 | ||
| agea 80 | -10.15 | ||
| agea 81 | -9.86 | ||
| agea 82 | -10.00 | ||
| agea 83 | -9.77 | ||
| agea 84 | -9.48 | ||
| agea 85 | -8.78 | ||
| agea 86 | -8.63 | ||
| agea 87 | -10.06 | ||
| agea 88 | -9.97 | ||
| agea 89 | -8.10 | ||
| agea 90 | -8.14 | ||
| agea 91 | -10.08 | ||
| agea 92 | -5.66 | ||
| agea 93 | -6.72 | ||
| agea 94 | -6.87 | ||
| agea 95 | -6.41 | ||
| agea 96 | -10.42 | ||
| agea 97 | -8.79 | ||
| agea 101 | 1.07 | ||
| Female | -2.01 | ||
| hinctnta 1 | 0.08 | ||
| eduyrs 1 | 0.87 | ||
| eduyrs 2 | -1.19 | ||
| eduyrs 3 | -0.04 | ||
| eduyrs 4 | -0.96 | ||
| eduyrs 5 | -0.15 | ||
| eduyrs 6 | 0.54 | ||
| eduyrs 7 | -0.06 | ||
| eduyrs 8 | -0.04 | ||
| eduyrs 9 | 1.35 | ||
| eduyrs 10 | 0.60 | ||
| eduyrs 11 | -0.21 | ||
| eduyrs 12 | -0.07 | ||
| eduyrs 13 | 0.23 | ||
| eduyrs 14 | 0.61 | ||
| eduyrs 15 | 0.49 | ||
| eduyrs 16 | 0.49 | ||
| eduyrs 17 | 1.37 | ||
| eduyrs 18 | 1.77 | ||
| eduyrs 19 | 1.87 | ||
| eduyrs 20 | 2.76 | ||
| eduyrs 21 | 2.31 | ||
| eduyrs 22 | 2.45 | ||
| eduyrs 23 | 2.09 | ||
| eduyrs 24 | 3.53 | ||
| eduyrs 25 | 2.36 | ||
| eduyrs 26 | 3.50 | ||
| eduyrs 27 | 7.39 | ||
| eduyrs 28 | 4.49 | ||
| eduyrs 29 | 4.32 | ||
| eduyrs 30 | 3.93 | ||
| eduyrs 31 | 2.42 | ||
| eduyrs 33 | 12.03 | ||
| eduyrs 34 | 3.40 | ||
| eduyrs 35 | 2.10 | ||
| eduyrs 36 | -8.39 | ||
| eduyrs 40 | 6.54 | ||
| eduyrs 42 | 10.09 | ||
| eduyrs 43 | 1.65 | ||
| eduyrs 45 | 1.39 | ||
| eduyrs 47 | 4.05 | ||
| eduyrs 50 | -0.89 | ||
| Observations | 19624 | ||
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: agertr ~ agea
## Model 2: agertr ~ agea + gndr
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19553 0
## 2 19552 0 1 0
anova(model2, model3)
## Analysis of Variance Table
##
## Model 1: agertr ~ agea + gndr
## Model 2: agertr ~ agea + gndr + hinctnta1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19552 0
## 2 19551 0 1 0
anova(model3, model4)
## Analysis of Variance Table
##
## Model 1: agertr ~ agea + gndr + hinctnta1
## Model 2: agertr ~ agea + gndr + hinctnta1 + eduyrs
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 19551 0
## 2 19510 0 41 0