Introduction

In this work I consider TIMSS (Trends in Mathematics and Science Study) in 2015 for 8th grade. The focus of the analysis is on Russia. There are two analytical parts of the project: part with exploratory factor analysis and part with regression analysis. The aim of factor analysis part is to find out some latent factors that have effect on home possessions. And the research question for regression part is the following: what is more important for student achievement nowadays: books or computers?

Getting started

This section is devoted to preparation of data before analysis. First, it is necessary to upload packages and dataset.

library(foreign)
library(ggplot2)
library(dplyr)
library(psych)
library(knitr)
library(kableExtra)
library(DescTools)
library(car)
library(lmtest)
library(polycor)
library(corrplot)
library(GPArotation)
library(lm.beta)
data1 <- read.spss("BSGRUSM6.sav", to.data.frame = TRUE, use.value.labels = TRUE)

Then, we create a separate dataset only with variables that we will use in analysis and get rid of NA’s because EFA does not work with them.

data2<-data1[c("BSBG06A", "BSBG06B", "BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K", "BSMMAT01", "BSBG04", "BSBGSCM", "BSBGSVM", "BSBG01")]

data3<-na.omit(data2)

Also, it is important to check type of variables to be sure they are right.

str(data3)

## 'data.frame':    4575 obs. of  16 variables:
##  $ BSBG06A : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
##  $ BSBG06B : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
##  $ BSBG06C : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BSBG06D : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BSBG06E : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 1 1 1 ...
##  $ BSBG06F : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ BSBG06G : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 1 2 1 ...
##  $ BSBG06H : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 1 1 2 ...
##  $ BSBG06I : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 1 1 1 1 1 ...
##  $ BSBG06J : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 2 2 2 1 ...
##  $ BSBG06K : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 1 2 ...
##  $ BSMMAT01: Factor w/ 4775 levels "273.05727","282.6379",..: 3049 1640 1984 515 155 1013 1384 2896 1205 464 ...
##  $ BSBG04  : Factor w/ 5 levels "0вЂ“10 books",..: 2 4 2 3 5 3 2 2 3 5 ...
##  $ BSBGSCM : Factor w/ 132 levels "3.19621","3.36684",..: 86 54 67 92 48 38 6 6 39 38 ...
##  $ BSBGSVM : Factor w/ 86 levels "2.99893","4.33377",..: 83 56 75 24 86 53 61 43 56 43 ...
##  $ BSBG01  : Factor w/ 2 levels "Girl","Boy": 2 2 1 2 2 1 1 2 2 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:205] 8 26 92 111 114 116 148 246 319 354 ...
##   ..- attr(*, "names")= chr [1:205] "8" "26" "92" "111" ...

All variables are factors. The variables for factor analysis will be processed later. The variables BSMMAT01, BSBGSCM and BSBGSVM should be in numeric form, so we do transformation.

data3$BSMMAT01<-as.numeric(as.character(data3$BSMMAT01))
data3$BSBGSCM<-as.numeric(as.character(data3$BSBGSCM))
data3$BSBGSVM<-as.numeric(as.character(data3$BSBGSVM))

class(data3$BSMMAT01)

## [1] "numeric"

class(data3$BSBGSCM)

## [1] "numeric"

class(data3$BSBGSVM)

## [1] "numeric"

Finally, we split our dataset into two: the first one (data_fa) is for factor analysis and the second one (data_reg) is for regression analysis.

save1<-c("BSBG06A", "BSBG06B", "BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K")
save2<-c("BSMMAT01", "BSBG04", "BSBG06A", "BSBG06B", "BSBGSCM", "BSBGSVM", "BSBG01")

data_fa <- data3[save1] 
data_reg <- data3[save2]

In addition, we need to recode variable BSBG04 because its values does not extracted properly from file with data.

data_reg$BSBG04a <- ifelse(data_reg$BSBG04=="0вЂ“10 books", "0-10 books",
                           ifelse(data_reg$BSBG04=="11вЂ“25 books", "11-25 books",
                                  ifelse(data_reg$BSBG04=="26вЂ“100 books","26-100 books",
                                         ifelse(data_reg$BSBG04=="101вЂ“200 books", "101-200 books",
                                                ifelse(data_reg$BSBG04=="More than 200", "More than 200", NA)))))

data_reg$BSBG04a <- factor(data_reg$BSBG04a, levels = c("0-10 books", "11-25 books", "26-100 books", "101-200 books", "More than 200"))

Data description

In this part of project, the variables for further analysis are described in detail. To get information about variables, press buttons.

Variables for regression analysis

There are 7 variables to use in a regression analysis. The predictor variables are amount of books at home (BSBG04), possession of own computer at home (BSBG06A), possession of shared computer at home (BSBG06B), gender (BSBG01), value of mathematics for a student (BSBGSVM) and student’s confidence in mathematics (BSBGSCM). The outcome variable is proficiency scores in mathematics and science (BSMMAT01). In order not to write this long name, I will use math achievement instead of proficiency scores in mathematics and science. To get information about variables, press buttons.

BSBG04

Amount of books at home (BSBG04) is one of predictors in further analysis. It is an ordinal variable. The distribution of the variable is presented below.

BSBG06A

Possession of own computer at home (BSBG06A) is one of predictors in further analysis. It is a binomial variable. The distribution of the variable is on the barchart below.

BSBG06B

Possession of computer shared with others at home (BSBG06B) is one of predictors in further analysis. It is a binomial variable. The distribution of the variable is on the barchart below.

BSBG01

Gender (BSBG01) is a variable used in construction of interaction effect as a moderator. It is a binomial variable with levels Male/Female. The distribution of the variable is on the barchart below.

BSBGSCM

Student’s confidence in mathematics (BSBSCM) is one of predictors in further analysis. It is a numeric variable. The descriptive statistics and distribution of the variable are below.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
4575	9.89	2	9.68	3.2	15.93	0.39	1.51

BSBGSVM

Value of mathematics for student (BSBGSVM) is one of predictors in further analysis. It is a numeric variable. The descriptive statistics and distribution of the variable are below.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
4575	9.38	1.88	9.07	3	13.65	0.26	0.39

BSBMMAT01

BSMMAT01 (outcome variable) is a plausible value for math achievement. The descriptive statistics and distribution are presented below.

N	Mean	SD	Median	Min	Max	Skew	Kurtosis
4575	538.39	79.37	540.55	273.06	819.82	-0.12	-0.21

Variales for EFA

For exploratory factor analysis there are 11 variables to use which represents answers on question “Do you have any of these things at your home?”

Variable	Description
BSBG06A	A computer or tablet of your own
BSBG06B	A computer or tablet that is shared with other people at home
BSBG06C	Study desk/table for your use
BSBG06D	Your own room
BSBG06E	Internet connection
BSBG06F	Your own mobile phone
BSBG06G	A gaming system
BSBG06H	Musical instruments
BSBG06I	A car
BSBG06J	A flat (or house) with 4 or more rooms
BSBG06K	Dishwasher

Here is the table with frequencies of answers which demonstrates that all variables are filled with responses. Since these variables are binomial, it is impossible to count mean, median and other descriptive statistics.

	BSBG06A	BSBG06B	BSBG06C	BSBG06D	BSBG06E	BSBG06F	BSBG06G	BSBG06H	BSBG06I	BSBG06J	BSBG06K
Yes	3889	3836	4198	3100	4422	4488	1205	1687	3108	2339	942
No	686	739	377	1475	153	87	3370	2888	1467	2236	3633

Regression analysis (part 1)

In this part of research I will try to predict math achievement (BSMMAT01) with several variables where some of them are connected with educational resources and others are about internal factors. All variables seems to be in the right format, so it is possible to proceed to analysis.

Building model

First, I build a multiple regression model with 5 predicting variables.

model1 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM, data = data_reg)
summary(model1)

## 
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + 
##     BSBGSVM, data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -269.602  -45.662    0.645   46.235  235.808 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          362.7882     7.3888  49.099  < 2e-16 ***
## BSBG04a11-25 books     8.4761     4.4866   1.889 0.058931 .  
## BSBG04a26-100 books   18.4978     4.3937   4.210 2.60e-05 ***
## BSBG04a101-200 books  31.7300     4.8422   6.553 6.27e-11 ***
## BSBG04aMore than 200  27.9010     5.2593   5.305 1.18e-07 ***
## BSBG06ANo             11.1731     2.8870   3.870 0.000110 ***
## BSBG06BNo            -14.3247     2.8142  -5.090 3.72e-07 ***
## BSBGSCM               18.2209     0.5591  32.587  < 2e-16 ***
## BSBGSVM               -2.2706     0.5896  -3.851 0.000119 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.59 on 4566 degrees of freedom
## Multiple R-squared:  0.2327, Adjusted R-squared:  0.2313 
## F-statistic: 173.1 on 8 and 4566 DF,  p-value: < 2.2e-16

As seen from the output above, the model overall is statistically significant, however there is one variable which is insignificant. The model explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:

for students who have 26-100 books at home math achievement is higher by 18.4978 than for students who have 0-10 books at home
compared to students with 0-10 books at home, for students who have 101-200 books at home math achievement is higher by 31.73
compared to students with 0-10 books at home, for students who have more than 200 books at home math achievement is higher by 27.9010
for students who do not have own personal computer at home math achievement is higher by 11.1731 than for those who do have it
for students who do not have shared with others personal computer at home math achievement is lower by 14.3247 than for those who do have it
if confidence in maths increase by one unit, math achievement increases by 18.2209
if value of maths increase by one unit, math achievement decreases by 2.2706

Now, let us try to add an interaction effect in the model. The variable for gender (BSBG01) interacts with variable for confidence in maths (BSBGSCM).

model2 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM*BSBG01 + BSBGSVM, data = data_reg)
summary(model2)

## 
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM * 
##     BSBG01 + BSBGSVM, data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -269.324  -45.189    0.893   46.637  237.806 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          359.9186     8.9975  40.002  < 2e-16 ***
## BSBG04a11-25 books     8.4889     4.4864   1.892   0.0585 .  
## BSBG04a26-100 books   18.8409     4.3984   4.284 1.88e-05 ***
## BSBG04a101-200 books  32.0632     4.8463   6.616 4.11e-11 ***
## BSBG04aMore than 200  28.4075     5.2681   5.392 7.31e-08 ***
## BSBG06ANo             11.4926     2.8932   3.972 7.23e-05 ***
## BSBG06BNo            -14.6035     2.8190  -5.180 2.31e-07 ***
## BSBGSCM               18.3825     0.7512  24.471  < 2e-16 ***
## BSBG01Boy              7.8570    10.4491   0.752   0.4521    
## BSBGSVM               -2.3434     0.5919  -3.959 7.64e-05 ***
## BSBGSCM:BSBG01Boy     -0.4514     1.0350  -0.436   0.6627    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.58 on 4564 degrees of freedom
## Multiple R-squared:  0.2332, Adjusted R-squared:  0.2315 
## F-statistic: 138.8 on 10 and 4564 DF,  p-value: < 2.2e-16

As seen from the output above, the model overall is statistically significant, however there are two variables which are insignificant. The interaction effect is also insignificant. The model still explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:

for students who have 26-100 books at home math achievement is higher by 18.8409 than for students who have 0-10 books at home
compared to students with 0-10 books at home, for students who have 101-200 books at home math achievement is higher by 32.0632
compared to students with 0-10 books at home, for students who have more than 200 books at home math achievement is higher by 28.4075
for students who do not have own personal computer or tablet at home math achievement is higher by 11.4926 than for those who do have it
for students who do not have shared with others personal computer or tablet at home math achievement is lower by 14.6035 than for those who do have it
if confidence in maths increase by one unit, math achievement increases by 18.3825
if value of maths increase by one unit, math achievement decreases by 2.3434

Model comparison

It is necessary to compare these two models to reveal which one is better. To do this, ANOVA is implemented.

anova(model1, model2)

## Analysis of Variance Table
## 
## Model 1: BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM
## Model 2: BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM * BSBG01 + BSBGSVM
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1   4566 22109090                           
## 2   4564 22095440  2     13650 1.4097 0.2443

The result of ANOVA tells that the first model is better than the second one as p-value is large. Thus, I will use the model without interaction effect in further analysis.

Model diagnostics

In this section, it is examined that assumptions for regression model are true.

No prefect multicolinearity - there is no perfect linear relationship between predictors. It is checked by Variance Inflation Factors (VIF) below.

VIF(model1)

##             GVIF Df GVIF^(1/(2*Df))
## BSBG04a 1.018361  4        1.002277
## BSBG06A 1.003738  1        1.001867
## BSBG06B 1.013453  1        1.006704
## BSBGSCM 1.179296  1        1.085954
## BSBGSVM 1.162255  1        1.078079

All numbers are less than five, so it means that there is no multicollinearity and the assumption is true.

Homoscedasticity - constant variance of residuals

ncvTest(model1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 15.92173, Df = 1, p = 6.6016e-05

bptest(model1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 36.565, df = 8, p-value = 1.384e-05

Both tests have significant p-value which means that variance of residual is not constant and the assumption of homoscedasticity is violated.

Independent errors - for any two observations the residual terms is uncorrelated (or independent). It is checked by Durbin-Watson Test below.

dwt(model1)

##  lag Autocorrelation D-W Statistic p-value
##    1       0.3953192      1.209085       0
##  Alternative hypothesis: rho != 0

The closer to 2 that the value is, the better. Here we have the value equals to 1.209085, which is not so close to 2 meaning that the assumption has not been met.

Linearity and normality of residuals - the mean values of the outcome variable for each increment of the predictors lie along a straight line; the residuals in the model are random, normally distributed variables with a mean of 0.

par(mfrow=c(2,2))
plot(model1)

The first plot shows that the assumption of linearity has been met as points lie more or less along the line. In the Normal Q-Q plot residuals almost perfectly lie along the line which means that the assumption of normally distributed errors is true. Also, from the plots it is seen that there are three outliers, however none of them is an influential case as they do not cross Cook?s distance.

Exploratory factor analysis

This section presents procedure of exploratory factor analysis and its results. In the dataset there are 11 variables that measures presence or absence of one or another things at students’ homes for which I will try to discover latent factors.

Preliminary preparation

At first, it is necessary to look at correlation matrix in order to understand whether EFA is possible.

cor1 <- hetcor (data_fa)
cor2 <- cor1$correlations
corrplot(cor2)

The picture on the matrix seems to be not very appropriate to execute EFA, however it is still possibly to try factor analysis as there are some correlations.

Since all variables in our dataset are binomial, it is also necessary to transform data into numeric form as EFA works only with numbers.

datafa<-as.data.frame(lapply(data_fa, as.numeric))

The data is in the right format and now we need to reveal how many factors should be extracted.

fa.parallel(datafa)

## Parallel analysis suggests that the number of factors =  5  and the number of components =  4

As seen from the plot, recommended number of factors is 5. At this point, we are ready to move to factor analysis.

EFA

Here, we apply fa function for exploratory factor analysis using 5 factors and argument cor=“mixed” because we have imitation of quantitative data.

fa1<-fa(datafa, 5, cor = "mixed")

fa1

## Factor Analysis using method =  minres
## Call: fa(r = datafa, nfactors = 5, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR2   MR1   MR3   MR5   MR4   h2    u2 com
## BSBG06A  0.02  0.03 -0.01  0.82  0.02 0.70 0.298 1.0
## BSBG06B  0.69 -0.12 -0.13 -0.14  0.10 0.43 0.569 1.3
## BSBG06C  0.01  0.97 -0.01  0.02  0.00 0.94 0.058 1.0
## BSBG06D -0.05  0.46  0.01  0.06  0.31 0.37 0.626 1.8
## BSBG06E  0.67  0.11  0.01  0.21  0.00 0.63 0.373 1.3
## BSBG06F  0.54  0.15  0.33  0.04 -0.17 0.61 0.386 2.1
## BSBG06G -0.07 -0.16  0.55  0.23  0.02 0.38 0.619 1.6
## BSBG06H  0.19  0.12  0.29 -0.19 -0.01 0.16 0.839 2.9
## BSBG06I  0.28 -0.04  0.20  0.06  0.39 0.35 0.648 2.5
## BSBG06J  0.00  0.05  0.06  0.02  0.69 0.51 0.488 1.0
## BSBG06K  0.00  0.07  0.69 -0.06  0.12 0.52 0.475 1.1
## 
##                        MR2  MR1  MR3  MR5  MR4
## SS loadings           1.43 1.32 1.15 0.89 0.83
## Proportion Var        0.13 0.12 0.10 0.08 0.08
## Cumulative Var        0.13 0.25 0.35 0.44 0.51
## Proportion Explained  0.25 0.23 0.21 0.16 0.15
## Cumulative Proportion 0.25 0.49 0.69 0.85 1.00
## 
##  With factor correlations of 
##      MR2  MR1  MR3  MR5  MR4
## MR2 1.00 0.27 0.30 0.17 0.04
## MR1 0.27 1.00 0.26 0.42 0.16
## MR3 0.30 0.26 1.00 0.34 0.22
## MR5 0.17 0.42 0.34 1.00 0.10
## MR4 0.04 0.16 0.22 0.10 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  2.4 with Chi Square of  10983.34
## The degrees of freedom for the model are 10  and the objective function was  0.04 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  4575 with the empirical chi square  87.44  with prob <  1.7e-14 
## The total number of observations was  4575  with Likelihood Chi Square =  172.55  with prob <  8.2e-32 
## 
## Tucker Lewis Index of factoring reliability =  0.918
## RMSEA index =  0.06  and the 90 % confidence intervals are  0.052 0.068
## BIC =  88.27
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR2  MR1  MR3  MR5  MR4
## Correlation of (regression) scores with factors   0.87 0.97 0.83 0.86 0.78
## Multiple R square of scores with factors          0.76 0.95 0.69 0.74 0.61
## Minimum correlation of possible factor scores     0.51 0.89 0.38 0.48 0.21

Now, we analyze an output of factor analysis. Cumulative variance is 0.53 which is more or less fine. For two factors proportion variance is less than 0.1 which is not really good. The proportion explained is rather unequal. Here we have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.01 which is very good. RMSEA index = 0.06 which is acceptable. Tucker Lewis Index also has a good score (0.918). Overall, the results of factor analysis are somewhat fine but could be better.

Let us look at the structure of factors now.

fa.diagram(fa1)

As seen from the diagram, there are three factors that are not meaningful because they contain less than three variables. Moreover, there are one variable that does not fall into any factor. Thus, the number of factors should be decreased and the variable BSBG06H should be deleted.

datafa1 <- select(datafa, -BSBG06H)

Now, the number of factors is 4.

fa2<-fa(datafa1, 4, cor = "mixed")

fa2

## Factor Analysis using method =  minres
## Call: fa(r = datafa1, nfactors = 4, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1   MR2   MR3   MR4   h2   u2 com
## BSBG06A  0.38  0.01  0.30 -0.04 0.30 0.70 1.9
## BSBG06B -0.18  0.72 -0.19  0.11 0.49 0.51 1.3
## BSBG06C  0.88  0.03 -0.05  0.01 0.77 0.23 1.0
## BSBG06D  0.53 -0.06 -0.02  0.31 0.41 0.59 1.7
## BSBG06E  0.25  0.63  0.11 -0.05 0.59 0.41 1.4
## BSBG06F  0.19  0.57  0.28 -0.14 0.59 0.41 1.9
## BSBG06G -0.10 -0.04  0.74  0.04 0.51 0.49 1.0
## BSBG06I  0.00  0.28  0.22  0.40 0.36 0.64 2.4
## BSBG06J  0.09 -0.03  0.07  0.66 0.48 0.52 1.1
## BSBG06K  0.05  0.11  0.48  0.21 0.37 0.63 1.5
## 
##                        MR1  MR2  MR3  MR4
## SS loadings           1.46 1.42 1.16 0.84
## Proportion Var        0.15 0.14 0.12 0.08
## Cumulative Var        0.15 0.29 0.40 0.49
## Proportion Explained  0.30 0.29 0.24 0.17
## Cumulative Proportion 0.30 0.59 0.83 1.00
## 
##  With factor correlations of 
##      MR1  MR2  MR3  MR4
## MR1 1.00 0.26 0.31 0.17
## MR2 0.26 1.00 0.22 0.08
## MR3 0.31 0.22 1.00 0.15
## MR4 0.17 0.08 0.15 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  45  and the objective function was  2.27 with Chi Square of  10383.08
## The degrees of freedom for the model are 11  and the objective function was  0.1 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  4575 with the empirical chi square  263.53  with prob <  4.1e-50 
## The total number of observations was  4575  with Likelihood Chi Square =  449.97  with prob <  1.5e-89 
## 
## Tucker Lewis Index of factoring reliability =  0.826
## RMSEA index =  0.093  and the 90 % confidence intervals are  0.086 0.101
## BIC =  357.26
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    MR1  MR2  MR3  MR4
## Correlation of (regression) scores with factors   0.90 0.87 0.82 0.77
## Multiple R square of scores with factors          0.82 0.75 0.68 0.59
## Minimum correlation of possible factor scores     0.64 0.50 0.36 0.18

The results are following. Cumulative variance is 0.49 which is somewhat good. For one factor proportion variance is still less than 0.1 which is not good. The proportion explained is rather unequal. Here we again have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.03 which is good. RMSEA index = 0.093 which is not so good. Tucker Lewis Index also has not enough good score (0.826). Overall, the results of factor analysis are rather undesirable.

Probably, it will be useful to look at factor structure again.

fa.diagram(fa2)

In the diagram it is seen two not meaningful factors, so let us try to reduce number of factors to 3.

fa3<-fa(datafa1, 3, cor = "mixed")

fa3

## Factor Analysis using method =  minres
## Call: fa(r = datafa1, nfactors = 3, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1   MR2   MR3   h2   u2 com
## BSBG06A  0.07  0.33  0.23 0.24 0.76 1.9
## BSBG06B  0.59 -0.21 -0.09 0.29 0.71 1.3
## BSBG06C  0.06  0.90 -0.05 0.83 0.17 1.0
## BSBG06D -0.12  0.54  0.21 0.36 0.64 1.4
## BSBG06E  0.76  0.13  0.02 0.67 0.33 1.1
## BSBG06F  0.64  0.06  0.15 0.53 0.47 1.1
## BSBG06G  0.00 -0.08  0.57 0.30 0.70 1.0
## BSBG06I  0.20 -0.01  0.44 0.29 0.71 1.4
## BSBG06J -0.12  0.15  0.41 0.20 0.80 1.5
## BSBG06K  0.07  0.00  0.62 0.42 0.58 1.0
## 
##                        MR1  MR2  MR3
## SS loadings           1.46 1.37 1.31
## Proportion Var        0.15 0.14 0.13
## Cumulative Var        0.15 0.28 0.41
## Proportion Explained  0.35 0.33 0.32
## Cumulative Proportion 0.35 0.68 1.00
## 
##  With factor correlations of 
##      MR1  MR2  MR3
## MR1 1.00 0.36 0.34
## MR2 0.36 1.00 0.34
## MR3 0.34 0.34 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  45  and the objective function was  2.27 with Chi Square of  10383.08
## The degrees of freedom for the model are 18  and the objective function was  0.29 
## 
## The root mean square of the residuals (RMSR) is  0.06 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  4575 with the empirical chi square  1274.85  with prob <  1e-259 
## The total number of observations was  4575  with Likelihood Chi Square =  1335.99  with prob <  7.8e-273 
## 
## Tucker Lewis Index of factoring reliability =  0.681
## RMSEA index =  0.127  and the 90 % confidence intervals are  0.121 0.132
## BIC =  1184.28
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy             
##                                                    MR1  MR2  MR3
## Correlation of (regression) scores with factors   0.88 0.92 0.82
## Multiple R square of scores with factors          0.78 0.85 0.67
## Minimum correlation of possible factor scores     0.55 0.70 0.34

The output tells the following. Cumulative variance is 0.41 which is somewhat acceptable. For all factors proportion variance is bigger than 0.1 which is good. The proportion explained is almost equal. Here we again have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.06 which is rather acceptable. RMSEA index = 0.127 which is not good. Tucker Lewis Index also has rather small score (0.681). Overall, although the fit is not really good, the explanatory aspect seems to be better than in previous models.

Intrepretation

At this point it is possible to name and interpret factors. In order to understand the structure, it again is necessary to look at the diagram.

fa.diagram(fa3)

The first factor (MR1) can be named as ‘resources for internet connection’. If a student has his/her own mobile phone and a computer shared with other people, then it is likely to have internet connection.
The second factor (MR2) can be called ‘private possessions’. If a student has his/her own room and study desk, he/she also has own computer or tablet.
The third factor (MR3) is ‘country-specific indicators of wealth’. When a student’s family has a car and a flat or a house with at least 4 rooms, they are likely to have a dishwasher and a gaming system. It seems that the status of family determines values of the variables.

Model fit

It is necessary to check model fit using Cronbach’s alpha. For this dataframes for each factor are created.

MR1<- as.data.frame(datafa[c("BSBG06E", "BSBG06F", "BSBG06B")])
MR2<- as.data.frame(datafa[c("BSBG06C", "BSBG06D", "BSBG06A")])
MR3<- as.data.frame(datafa[c("BSBG06K", "BSBG06G", "BSBG06I", "BSBG06J")])

And now we test scales of each factor using alpha function. Cronbach’s alpha > 0.7 indicates good reliability. Overall, tests show bad results but it can be because variables are binary.

MR1

As seen from the output below, for the first factor we have Cronbach’s alpha = 0.34 which is bad result. But there are no dropping items as deleting of any variable does not lead to significant increase of reliability.

psych::alpha(MR1, check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR1, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.25      0.34    0.26      0.15 0.52 0.016  1.1 0.16     0.14
## 
##  lower alpha upper     95% confidence boundaries
## 0.22 0.25 0.29 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## BSBG06E      0.12      0.18    0.10      0.10 0.22    0.017    NA  0.10
## BSBG06F      0.19      0.24    0.14      0.14 0.32    0.019    NA  0.14
## BSBG06B      0.33      0.34    0.21      0.21 0.52    0.019    NA  0.21
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## BSBG06E 4575  0.54  0.68  0.40   0.19  1.0 0.18
## BSBG06F 4575  0.44  0.66  0.35   0.17  1.0 0.14
## BSBG06B 4575  0.86  0.63  0.26   0.15  1.2 0.37
## 
## Non missing response frequency for each item
##            1    2 miss
## BSBG06E 0.97 0.03    0
## BSBG06F 0.98 0.02    0
## BSBG06B 0.84 0.16    0

MR2

For the second factor, alpha equals to 0.41 which is bad. Nevertheless, none of variables should be deleted to increase reliability.

psych::alpha(MR2, check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR2, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.38      0.41    0.32      0.19 0.68 0.015  1.2 0.25     0.18
## 
##  lower alpha upper     95% confidence boundaries
## 0.35 0.38 0.41 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## BSBG06C      0.23      0.24    0.14      0.14 0.31    0.022    NA  0.14
## BSBG06D      0.29      0.30    0.18      0.18 0.43    0.020    NA  0.18
## BSBG06A      0.35      0.39    0.24      0.24 0.65    0.017    NA  0.24
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## BSBG06C 4575  0.60  0.70  0.44   0.28  1.1 0.28
## BSBG06D 4575  0.78  0.68  0.39   0.24  1.3 0.47
## BSBG06A 4575  0.62  0.65  0.31   0.19  1.1 0.36
## 
## Non missing response frequency for each item
##            1    2 miss
## BSBG06C 0.92 0.08    0
## BSBG06D 0.68 0.32    0
## BSBG06A 0.85 0.15    0

MR3

As for MR3, Cronbach’s alpha is 0.43 which is not really good result. However, there are no dropping items as deleting of any variable does not lead to significant increase of reliability.

psych::alpha(MR3, check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR3, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.42      0.43    0.37      0.16 0.75 0.014  1.6 0.27     0.15
## 
##  lower alpha upper     95% confidence boundaries
## 0.4 0.42 0.45 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## BSBG06K      0.33      0.33    0.25      0.14 0.48    0.017 0.0043  0.13
## BSBG06G      0.38      0.38    0.29      0.17 0.61    0.016 0.0013  0.16
## BSBG06I      0.33      0.34    0.27      0.15 0.52    0.017 0.0055  0.14
## BSBG06J      0.38      0.38    0.30      0.17 0.62    0.016 0.0025  0.16
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## BSBG06K 4575  0.59  0.63  0.41   0.27  1.8 0.40
## BSBG06G 4575  0.57  0.59  0.34   0.21  1.7 0.44
## BSBG06I 4575  0.63  0.62  0.39   0.26  1.3 0.47
## BSBG06J 4575  0.63  0.59  0.33   0.21  1.5 0.50
## 
## Non missing response frequency for each item
##            1    2 miss
## BSBG06K 0.21 0.79    0
## BSBG06G 0.26 0.74    0
## BSBG06I 0.68 0.32    0
## BSBG06J 0.51 0.49    0

Extracting scores

In order to use factors in regression analysis it is necessary to extract factor scores and include them in a dataset.

fascores<-as.data.frame(fa1$scores)
datareg<-cbind(data_reg,fascores)

Regression analysis (part 2)

In this part of I will continue predicting math achievement (BSMMAT01). Here, factor scores will be added into the model as predictors.

Updating model

model3 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM + MR1 + MR2 + MR3, data = datareg)
summary(model3)

## 
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + 
##     BSBGSVM + MR1 + MR2 + MR3, data = datareg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -267.153  -45.149    0.536   46.458  231.857 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          362.0634     7.4089  48.869  < 2e-16 ***
## BSBG04a11-25 books     8.0074     4.4831   1.786  0.07415 .  
## BSBG04a26-100 books   17.2101     4.4045   3.907 9.47e-05 ***
## BSBG04a101-200 books  30.0463     4.8665   6.174 7.23e-10 ***
## BSBG04aMore than 200  26.2766     5.2851   4.972 6.88e-07 ***
## BSBG06ANo             12.7431     3.1253   4.077 4.63e-05 ***
## BSBG06BNo             -3.0490     3.9581  -0.770  0.44114    
## BSBGSCM               18.0868     0.5601  32.291  < 2e-16 ***
## BSBGSVM               -2.1584     0.5890  -3.665  0.00025 ***
## MR1                    0.8342     1.2380   0.674  0.50048    
## MR2                   -7.0313     1.7490  -4.020 5.91e-05 ***
## MR3                    1.4608     1.4692   0.994  0.32014    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 69.44 on 4563 degrees of freedom
## Multiple R-squared:  0.2364, Adjusted R-squared:  0.2345 
## F-statistic: 128.4 on 11 and 4563 DF,  p-value: < 2.2e-16

As seen from the output above, the model itself is statistically significant, however there are four variables which are insignificant. The model explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:

for students who have 26-100 books at home math achievement is higher by 17.2101 than for students who have 0-10 books at home
compared to students with 0-10 books at home, for students who have 101-200 books at home math achievement is higher by 30.0463
compared to students with 0-10 books at home, for students who have more than 200 books at home math achievement is higher by 26.2766
for students who do not have own personal computer at home math achievement is higher by 12.7431 than for those who do have it
if confidence in maths increase by one unit, math achievement increases by 18.0868
if value of maths increase by one unit, math achievement decreases by 2.1584
if there is an increase in private possessions by one unit, there is a decrease of math achievement by 7.0313

Interpretation

It is necessary to give some meaning to the results above. Thus, in this part I will give some interpretation.

Starting with books, all regression models show that there almost no difference in math achievement between students with 0-10 books at home and 11-25 books at home, but starting from 26 books the math achievement increases with growing number of books compared to small amount of books.

Talking about computers and tablets, possession of own computer or tablet is associated with decreasing math achievement, however having a computer or a tablet shared with others at home is related to increase of math achievement. It is possible to assume that the presence of computer or tablet itself is good, but having it as own property imply that owner can do whatever he/she want without control which can badly affect studies.

If we want to compare importance of books and computers, it is necessary to look at standardized coefficients in regression.

lm.beta(model3)

## 
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + 
##     BSBGSVM + MR1 + MR2 + MR3, data = datareg)
## 
## Standardized Coefficients::
##          (Intercept)   BSBG04a11-25 books  BSBG04a26-100 books 
##           0.00000000           0.04598862           0.10569543 
## BSBG04a101-200 books BSBG04aMore than 200            BSBG06ANo 
##           0.13740292           0.09791692           0.05732738 
##            BSBG06BNo              BSBGSCM              BSBGSVM 
##          -0.01413943           0.45537096          -0.05116060 
##                  MR1                  MR2                  MR3 
##           0.01145199          -0.09516489           0.01854671

As sees from the output, the coefficients for books are larger than for computer ownership, so it is possible to assume that books are still more important for students’ achievement.

Moving forward, students who are confident in maths seem to have better math achievement. However, it is hard to interpret result about value of mathematics.

Finally, having own room, table and computer is associated with decreasing math achievement. It can be assumed that staying in room alone allow students to do some entertaining activities instead of studying which may lead to lower scores.

Conclusion

To conclude, it was found out 3 latent factors, namely resources for internet connection, private possessions and status. And due to regression analysis I revealed that books are more important for students achievement than computers.

Final exam. Student achievement in Russia

Artyom Kulikov, BSC-172