Regression

Data 1

Data from Applying Regression and Correlation, Miles and Shevlin (2001).

Variables:

books = books read during the course
attend = classes attended during the course
grade = a grade of a person

library(foreign)
library(sjPlot)
library(sjmisc)
library(sjlabelled)
library(ggplot2)

df <- read.csv("data1_1.csv")
summary(df)

##      books       attend          grade      
##  Min.   :0   Min.   : 6.00   Min.   :37.00  
##  1st Qu.:1   1st Qu.:10.00   1st Qu.:51.00  
##  Median :2   Median :15.00   Median :60.50  
##  Mean   :2   Mean   :14.10   Mean   :63.55  
##  3rd Qu.:3   3rd Qu.:18.25   3rd Qu.:73.00  
##  Max.   :4   Max.   :20.00   Max.   :97.00

books From the table it can be seen, that the maximal amoumt of books read during the course was 4 and the minimal was 0. Moreover, both Mean and Median measures equals to 2
attend From the table it can be seen, that the maximal amoumt of classes attended by a student during the course was 20 and the minimal was 6.Mean measure here equals to 14,10 and Median measure equals to 15,00
grade From the table it can be seen, that the maximal grade for the course was 97,00 and the minimal was 37,00. Mean measure here equals to 63,55 and Median measure equals to 60,50

Vizualizing the data

From the histogram above it can be seen, that there are equal amount of people (8), who have read 0, 1, 2, 3, 4 books

From the histogram above it can be seen, that the most frequently recorded amount of attended classes is 20, then goes 15 and then 10. Moreover, an outlier can be seen in the left, it tells us that there is the only one student who attended less than 10 classes during the course.

Correlations between variables

sjp.corr(df)

It can be seen that correlation coefficients are approximately 0,5 => it is possible to say that we have medium correlation between variables

Let’s construct scatterplots in order to investigate the correlation deeper:

From the scatterplot it can be seen, that those students, who attended more lectures were more likely to get a higher grade

The scatterplot again shows that those who read more books during the course were more likely to get higher grades

Building models

model1 <- lm(grade ~ books, data = df)
tab_model(model1)

	grade
Predictors	Estimates	CI	p
(Intercept)	52.07	44.17 – 59.98	<0.001
books	5.74	2.51 – 8.97	0.001
Observations	40
R² / adjusted R²	0.242 / 0.222

Regression equation for the first model: the predicted value of grade = 52,07 + 5,74 * books

model2 <- lm(grade ~ attend, data = df)
tab_model(model2)

	grade
Predictors	Estimates	CI	p
(Intercept)	37.00	20.99 – 53.01	<0.001
attend	1.88	0.80 – 2.97	0.002
Observations	40
R² / adjusted R²	0.233 / 0.212

Regression equation for the second model: the predicted value of grade = 37,00 + 1,88 * attend

model3 <- lm(grade ~ books + attend, data = df)
tab_model(model3)

	grade
Predictors	Estimates	CI	p
(Intercept)	37.38	22.20 – 52.56	<0.001
books	4.04	0.60 – 7.47	0.027
attend	1.28	0.13 – 2.43	0.035
Observations	40
R² / adjusted R²	0.329 / 0.292

Regression equation for the third model: the predicted value of grade = 37,38 + 4,04 * books + 1,28 * attend

Judging by the Adjusted R-squared values we can claim that the third model has the highiest explanatory power, as its’ Adjusted R-squared value equals to 0.292Р±, which is larger than the other two

Comparing models

anova(model1, model3)

## Analysis of Variance Table
## 
## Model 1: grade ~ books
## Model 2: grade ~ books + attend
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 8250.4                              
## 2     37 7306.2  1    944.16 4.7814 0.03517 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.

The p-value here estimates 0.03517, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model1), therefore we should favour the more complex one

anova(model2, model3)

## Analysis of Variance Table
## 
## Model 1: grade ~ attend
## Model 2: grade ~ books + attend
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
## 1     38 8353.4                              
## 2     37 7306.2  1    1047.1 5.3028 0.02701 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Analysis of Variance Table we can see, that Df indicates 1, which means, that a more complex model has just one additional parameter.

The p-value here estimates 0.02701, which is less than 0,05, therefore the more complex model (model3) is significantly better than the simplier model (model2), therefore we should favour the more complex one

Plots for `model3`

Centering variables (shifting the zero to the mean)

df$attend_cent <- df$attend - mean(df$attend)
model4 <- lm(grade ~ attend_cent, data=df)
tab_model(model4)

	grade
Predictors	Estimates	CI	p
(Intercept)	63.55	58.96 – 68.14	<0.001
attend cent	1.88	0.80 – 2.97	0.002
Observations	40
R² / adjusted R²	0.233 / 0.212

Data 2

Data from the 5th wave of ESS

library(haven)
library(foreign)
ESS2 <- read.spss("ESS5e03.sav",  to.data.frame = TRUE)

Now from the whole dataset let’s choose the 7 variables, that are going to be observed:

eduyrs = years of full-time education completed
agea = age of a person
agertr = what age you would like to/would have liked to retire
gndr = gender of a person
cntry = country
icmnact = main activity of respondent
hinctnta = household’s total net income, all sources

savevars <- c("eduyrs", "agea", "agertr", "gndr", 
              "cntry", "icmnact", "hinctnta")
data = ESS2
ESS1 <- data[savevars]
rm(data, savevars)

Now let’s look at the head of the dataset:

head(ESS1)

##   eduyrs agea agertr   gndr   cntry      icmnact hinctnta
## 1     15   22   <NA>   Male Belgium   All others     <NA>
## 2     15   43   <NA>   Male Belgium In paid work        F
## 3     13   19   <NA> Female Belgium   All others        J
## 4     15   23   <NA> Female Belgium In paid work        H
## 5     15   58     60 Female Belgium   All others        S
## 6     13   62     60 Female Belgium      Retired        F

From the table it can be seen that the variable hinctnta is needed to be transformed in numeric form:

ESS1$hinctnta1<-as.numeric(ESS1$hinctnta)

Now the highiest income is 10 and the lowest is 1.

Then, let’s omit all the NAs from our dataset:

ESS2 <- na.omit(ESS1)

Now, when we have a new dataset with the needed variables and without NAs, let’s make a histogrm of households’ income since we are going to use the variable as a continuous predictor:

From the histogram it can be seen, that most households have net income less than 2.5. Nevertheless, the amount of households, which have medium net income isn’t much smaller, while the number of those, whose net income is higher than medium is aproximately twice lower.

Building models for data2

data <- read.spss("ESS5e03.sav", to.data.frame = TRUE)
savevars <- c("eduyrs", "agea", "agertr", "gndr", "hinctnta")
ESS2 <- data[savevars]
rm(data, savevars)
head(ESS2)

##   eduyrs agea agertr   gndr hinctnta
## 1     15   22   <NA>   Male     <NA>
## 2     15   43   <NA>   Male        F
## 3     13   19   <NA> Female        J
## 4     15   23   <NA> Female        H
## 5     15   58     60 Female        S
## 6     13   62     60 Female        F

Here we change those variables which have wrong data type from factor to numeric, we can use as.numeric(as.character()) functions to do so.

ESS1$hinctnta1<-as.numeric(ESS1$hinctnta) #decile income, 1 - min, 10 - max
ESS1$eduyrs1<-as.numeric(as.character(ESS1$eduyrs))
ESS1$agea1<-as.numeric(as.character(ESS1$agea))
ESS1$agertr1<-as.numeric(as.character(ESS1$agertr))

We have five main variables to use.

agertr1 - age of desired retirement
agea1 - age
eduyrs1 - total years of education
hinctnta1 - household’s total net income
gndr - gender

To clean the data, we can remove NA’s with na.omit() function:

ESS2 <- na.omit(ESS1)

model1 <- lm(agertr ~ agea, data = ESS2)
tab_model(model1)

	agertr
Predictors	Estimates	CI	p
(Intercept)	48.00
agea 17	-10.00
agea 18	-7.33
agea 19	-2.67
agea 20	-10.00
agea 21	-10.00
agea 22	-15.00
agea 27	-5.00
agea 29	-10.00
agea 32	-20.00
agea 35	-5.00
agea 38	-17.50
agea 39	-20.00
agea 41	-17.33
agea 42	-8.00
agea 43	-10.00
agea 44	-13.00
agea 45	-10.13
agea 46	-10.21
agea 47	-11.00
agea 48	-10.47
agea 49	-10.90
agea 50	-10.90
agea 51	-10.83
agea 52	-10.64
agea 53	-10.70
agea 54	-10.25
agea 55	-10.40
agea 56	-10.30
agea 57	-10.23
agea 58	-9.69
agea 59	-10.09
agea 60	-9.72
agea 61	-9.89
agea 62	-9.59
agea 63	-9.83
agea 64	-9.61
agea 65	-9.10
agea 66	-9.17
agea 67	-9.00
agea 68	-9.16
agea 69	-9.61
agea 70	-9.61
agea 71	-9.93
agea 72	-9.84
agea 73	-9.94
agea 74	-10.59
agea 75	-9.26
agea 76	-9.84
agea 77	-9.42
agea 78	-9.75
agea 79	-9.77
agea 80	-9.85
agea 81	-9.45
agea 82	-9.68
agea 83	-9.53
agea 84	-9.44
agea 85	-8.39
agea 86	-8.74
agea 87	-9.77
agea 88	-10.08
agea 89	-7.56
agea 90	-8.18
agea 91	-9.64
agea 92	-5.11
agea 93	-6.40
agea 94	-7.43
agea 95	-7.14
agea 96	-8.33
agea 97	-10.00
agea 101	0.00
Observations	19624

model2 <- lm(agertr ~ agea + gndr, data = ESS2)
tab_model(model2)

	agertr
Predictors	Estimates	CI	p
(Intercept)	50.11
agea 17	-11.41
agea 18	-8.04
agea 19	-4.08
agea 20	-12.11
agea 21	-10.00
agea 22	-17.11
agea 27	-7.11
agea 29	-12.11
agea 32	-20.00
agea 35	-7.11
agea 38	-18.56
agea 39	-22.11
agea 41	-18.04
agea 42	-10.11
agea 43	-11.06
agea 44	-13.00
agea 45	-11.01
agea 46	-11.28
agea 47	-11.94
agea 48	-11.53
agea 49	-11.92
agea 50	-11.87
agea 51	-11.79
agea 52	-11.65
agea 53	-11.66
agea 54	-11.26
agea 55	-11.28
agea 56	-11.32
agea 57	-11.15
agea 58	-10.71
agea 59	-11.07
agea 60	-10.74
agea 61	-10.87
agea 62	-10.55
agea 63	-10.79
agea 64	-10.56
agea 65	-10.01
agea 66	-10.19
agea 67	-10.01
agea 68	-10.17
agea 69	-10.62
agea 70	-10.64
agea 71	-10.90
agea 72	-10.79
agea 73	-10.86
agea 74	-11.56
agea 75	-10.23
agea 76	-10.73
agea 77	-10.45
agea 78	-10.50
agea 79	-10.62
agea 80	-10.75
agea 81	-10.46
agea 82	-10.56
agea 83	-10.40
agea 84	-10.11
agea 85	-9.34
agea 86	-9.33
agea 87	-10.57
agea 88	-10.68
agea 89	-8.75
agea 90	-8.82
agea 91	-10.21
agea 92	-6.28
agea 93	-7.39
agea 94	-7.73
agea 95	-7.14
agea 96	-10.45
agea 97	-10.00
agea 101	0.00
Female	-2.11
Observations	19624

model3 <- lm(agertr ~ agea + gndr + hinctnta1, data = ESS2)
tab_model(model3)

	agertr
Predictors	Estimates	CI	p
(Intercept)	49.16
agea 17	-11.73
agea 18	-8.05
agea 19	-4.01
agea 20	-11.44
agea 21	-10.14
agea 22	-17.45
agea 27	-6.73
agea 29	-12.30
agea 32	-20.29
agea 35	-6.73
agea 38	-18.87
agea 39	-22.02
agea 41	-18.20
agea 42	-10.16
agea 43	-11.51
agea 44	-12.78
agea 45	-11.02
agea 46	-11.27
agea 47	-11.88
agea 48	-11.47
agea 49	-11.85
agea 50	-11.78
agea 51	-11.67
agea 52	-11.57
agea 53	-11.56
agea 54	-11.14
agea 55	-11.17
agea 56	-11.22
agea 57	-10.99
agea 58	-10.57
agea 59	-10.92
agea 60	-10.55
agea 61	-10.65
agea 62	-10.35
agea 63	-10.59
agea 64	-10.31
agea 65	-9.75
agea 66	-9.92
agea 67	-9.72
agea 68	-9.87
agea 69	-10.28
agea 70	-10.31
agea 71	-10.53
agea 72	-10.41
agea 73	-10.46
agea 74	-11.17
agea 75	-9.82
agea 76	-10.29
agea 77	-10.04
agea 78	-10.07
agea 79	-10.21
agea 80	-10.28
agea 81	-10.00
agea 82	-10.12
agea 83	-9.91
agea 84	-9.67
agea 85	-8.89
agea 86	-8.84
agea 87	-10.11
agea 88	-10.15
agea 89	-8.29
agea 90	-8.38
agea 91	-9.76
agea 92	-5.79
agea 93	-6.98
agea 94	-7.21
agea 95	-6.75
agea 96	-10.11
agea 97	-9.28
agea 101	0.72
Female	-2.02
hinctnta 1	0.14
Observations	19624

model4 <- lm(agertr ~ agea + gndr + hinctnta1 + eduyrs, data = ESS2)
tab_model(model4)

	agertr
Predictors	Estimates	CI	p
(Intercept)	48.92
agea 17	-11.27
agea 18	-7.63
agea 19	-3.86
agea 20	-11.05
agea 21	-9.41
agea 22	-17.19
agea 27	-6.85
agea 29	-11.50
agea 32	-21.34
agea 35	-8.02
agea 38	-19.16
agea 39	-22.01
agea 41	-18.08
agea 42	-9.42
agea 43	-11.33
agea 44	-12.98
agea 45	-10.92
agea 46	-11.28
agea 47	-11.82
agea 48	-11.43
agea 49	-11.78
agea 50	-11.70
agea 51	-11.58
agea 52	-11.46
agea 53	-11.47
agea 54	-11.03
agea 55	-11.08
agea 56	-11.09
agea 57	-10.86
agea 58	-10.50
agea 59	-10.86
agea 60	-10.50
agea 61	-10.59
agea 62	-10.29
agea 63	-10.57
agea 64	-10.25
agea 65	-9.71
agea 66	-9.86
agea 67	-9.64
agea 68	-9.75
agea 69	-10.22
agea 70	-10.22
agea 71	-10.43
agea 72	-10.35
agea 73	-10.34
agea 74	-11.05
agea 75	-9.77
agea 76	-10.13
agea 77	-9.94
agea 78	-10.01
agea 79	-10.00
agea 80	-10.15
agea 81	-9.86
agea 82	-10.00
agea 83	-9.77
agea 84	-9.48
agea 85	-8.78
agea 86	-8.63
agea 87	-10.06
agea 88	-9.97
agea 89	-8.10
agea 90	-8.14
agea 91	-10.08
agea 92	-5.66
agea 93	-6.72
agea 94	-6.87
agea 95	-6.41
agea 96	-10.42
agea 97	-8.79
agea 101	1.07
Female	-2.01
hinctnta 1	0.08
eduyrs 1	0.87
eduyrs 2	-1.19
eduyrs 3	-0.04
eduyrs 4	-0.96
eduyrs 5	-0.15
eduyrs 6	0.54
eduyrs 7	-0.06
eduyrs 8	-0.04
eduyrs 9	1.35
eduyrs 10	0.60
eduyrs 11	-0.21
eduyrs 12	-0.07
eduyrs 13	0.23
eduyrs 14	0.61
eduyrs 15	0.49
eduyrs 16	0.49
eduyrs 17	1.37
eduyrs 18	1.77
eduyrs 19	1.87
eduyrs 20	2.76
eduyrs 21	2.31
eduyrs 22	2.45
eduyrs 23	2.09
eduyrs 24	3.53
eduyrs 25	2.36
eduyrs 26	3.50
eduyrs 27	7.39
eduyrs 28	4.49
eduyrs 29	4.32
eduyrs 30	3.93
eduyrs 31	2.42
eduyrs 33	12.03
eduyrs 34	3.40
eduyrs 35	2.10
eduyrs 36	-8.39
eduyrs 40	6.54
eduyrs 42	10.09
eduyrs 43	1.65
eduyrs 45	1.39
eduyrs 47	4.05
eduyrs 50	-0.89
Observations	19624

Comparing models

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea
## Model 2: agertr ~ agea + gndr
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19553   0                      
## 2  19552   0  1         0

anova(model2, model3)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea + gndr
## Model 2: agertr ~ agea + gndr + hinctnta1
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19552   0                      
## 2  19551   0  1         0

anova(model3, model4)

## Analysis of Variance Table
## 
## Model 1: agertr ~ agea + gndr + hinctnta1
## Model 2: agertr ~ agea + gndr + hinctnta1 + eduyrs
##   Res.Df RSS Df Sum of Sq F Pr(>F)
## 1  19551   0                      
## 2  19510   0 41         0

Regression

Anya Gorobtsova, group 173

April,18 (seminar)

Data 1

Vizualizing the data

Correlations between variables

Building models

Comparing models

Plots for `model3`

Centering variables (shifting the zero to the mean)

Data 2

Building models for data2

Comparing models

Regression

Anya Gorobtsova, group 173

April,18 (seminar)

Data 1

Vizualizing the data

Correlations between variables

Building models

Comparing models

Plots for model3

Centering variables (shifting the zero to the mean)

Data 2

Building models for data2

Comparing models

Plots for `model3`