In this work I consider TIMSS (Trends in Mathematics and Science Study) in 2015 for 8th grade. The focus of the analysis is on Russia. There are two analytical parts of the project: part with exploratory factor analysis and part with regression analysis. The aim of factor analysis part is to find out some latent factors that have effect on home possessions. And the research question for regression part is the following: what is more important for student achievement nowadays: books or computers?
This section is devoted to preparation of data before analysis. First, it is necessary to upload packages and dataset.
library(foreign)
library(ggplot2)
library(dplyr)
library(psych)
library(knitr)
library(kableExtra)
library(DescTools)
library(car)
library(lmtest)
library(polycor)
library(corrplot)
library(GPArotation)
library(lm.beta)
data1 <- read.spss("BSGRUSM6.sav", to.data.frame = TRUE, use.value.labels = TRUE)
Then, we create a separate dataset only with variables that we will use in analysis and get rid of NA’s because EFA does not work with them.
data2<-data1[c("BSBG06A", "BSBG06B", "BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K", "BSMMAT01", "BSBG04", "BSBGSCM", "BSBGSVM", "BSBG01")]
data3<-na.omit(data2)
Also, it is important to check type of variables to be sure they are right.
str(data3)
## 'data.frame': 4575 obs. of 16 variables:
## $ BSBG06A : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
## $ BSBG06B : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
## $ BSBG06C : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ BSBG06D : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ BSBG06E : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 1 1 1 ...
## $ BSBG06F : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ BSBG06G : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 1 2 1 ...
## $ BSBG06H : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 1 1 2 ...
## $ BSBG06I : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 1 1 1 1 1 ...
## $ BSBG06J : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 2 2 2 1 ...
## $ BSBG06K : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 1 2 ...
## $ BSMMAT01: Factor w/ 4775 levels "273.05727","282.6379",..: 3049 1640 1984 515 155 1013 1384 2896 1205 464 ...
## $ BSBG04 : Factor w/ 5 levels "0–10 books",..: 2 4 2 3 5 3 2 2 3 5 ...
## $ BSBGSCM : Factor w/ 132 levels "3.19621","3.36684",..: 86 54 67 92 48 38 6 6 39 38 ...
## $ BSBGSVM : Factor w/ 86 levels "2.99893","4.33377",..: 83 56 75 24 86 53 61 43 56 43 ...
## $ BSBG01 : Factor w/ 2 levels "Girl","Boy": 2 2 1 2 2 1 1 2 2 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:205] 8 26 92 111 114 116 148 246 319 354 ...
## ..- attr(*, "names")= chr [1:205] "8" "26" "92" "111" ...
All variables are factors. The variables for factor analysis will be processed later. The variables BSMMAT01, BSBGSCM and BSBGSVM should be in numeric form, so we do transformation.
data3$BSMMAT01<-as.numeric(as.character(data3$BSMMAT01))
data3$BSBGSCM<-as.numeric(as.character(data3$BSBGSCM))
data3$BSBGSVM<-as.numeric(as.character(data3$BSBGSVM))
class(data3$BSMMAT01)
## [1] "numeric"
class(data3$BSBGSCM)
## [1] "numeric"
class(data3$BSBGSVM)
## [1] "numeric"
Finally, we split our dataset into two: the first one (data_fa) is for factor analysis and the second one (data_reg) is for regression analysis.
save1<-c("BSBG06A", "BSBG06B", "BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K")
save2<-c("BSMMAT01", "BSBG04", "BSBG06A", "BSBG06B", "BSBGSCM", "BSBGSVM", "BSBG01")
data_fa <- data3[save1]
data_reg <- data3[save2]
In addition, we need to recode variable BSBG04 because its values does not extracted properly from file with data.
data_reg$BSBG04a <- ifelse(data_reg$BSBG04=="0–10 books", "0-10 books",
ifelse(data_reg$BSBG04=="11–25 books", "11-25 books",
ifelse(data_reg$BSBG04=="26–100 books","26-100 books",
ifelse(data_reg$BSBG04=="101–200 books", "101-200 books",
ifelse(data_reg$BSBG04=="More than 200", "More than 200", NA)))))
data_reg$BSBG04a <- factor(data_reg$BSBG04a, levels = c("0-10 books", "11-25 books", "26-100 books", "101-200 books", "More than 200"))
In this part of project, the variables for further analysis are described in detail. To get information about variables, press buttons.
There are 7 variables to use in a regression analysis. The predictor variables are amount of books at home (BSBG04), possession of own computer at home (BSBG06A), possession of shared computer at home (BSBG06B), gender (BSBG01), value of mathematics for a student (BSBGSVM) and student’s confidence in mathematics (BSBGSCM). The outcome variable is proficiency scores in mathematics and science (BSMMAT01). In order not to write this long name, I will use math achievement instead of proficiency scores in mathematics and science. To get information about variables, press buttons.
Amount of books at home (BSBG04) is one of predictors in further analysis. It is an ordinal variable. The distribution of the variable is presented below.
Possession of own computer at home (BSBG06A) is one of predictors in further analysis. It is a binomial variable. The distribution of the variable is on the barchart below.
Possession of computer shared with others at home (BSBG06B) is one of predictors in further analysis. It is a binomial variable. The distribution of the variable is on the barchart below.
Gender (BSBG01) is a variable used in construction of interaction effect as a moderator. It is a binomial variable with levels Male/Female. The distribution of the variable is on the barchart below.
Student’s confidence in mathematics (BSBSCM) is one of predictors in further analysis. It is a numeric variable. The descriptive statistics and distribution of the variable are below.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 4575 | 9.89 | 2 | 9.68 | 3.2 | 15.93 | 0.39 | 1.51 |
Value of mathematics for student (BSBGSVM) is one of predictors in further analysis. It is a numeric variable. The descriptive statistics and distribution of the variable are below.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 4575 | 9.38 | 1.88 | 9.07 | 3 | 13.65 | 0.26 | 0.39 |
BSMMAT01 (outcome variable) is a plausible value for math achievement. The descriptive statistics and distribution are presented below.
| N | Mean | SD | Median | Min | Max | Skew | Kurtosis |
|---|---|---|---|---|---|---|---|
| 4575 | 538.39 | 79.37 | 540.55 | 273.06 | 819.82 | -0.12 | -0.21 |
For exploratory factor analysis there are 11 variables to use which represents answers on question “Do you have any of these things at your home?”
| Variable | Description |
|---|---|
| BSBG06A | A computer or tablet of your own |
| BSBG06B | A computer or tablet that is shared with other people at home |
| BSBG06C | Study desk/table for your use |
| BSBG06D | Your own room |
| BSBG06E | Internet connection |
| BSBG06F | Your own mobile phone |
| BSBG06G | A gaming system |
| BSBG06H | Musical instruments |
| BSBG06I | A car |
| BSBG06J | A flat (or house) with 4 or more rooms |
| BSBG06K | Dishwasher |
Here is the table with frequencies of answers which demonstrates that all variables are filled with responses. Since these variables are binomial, it is impossible to count mean, median and other descriptive statistics.
| BSBG06A | BSBG06B | BSBG06C | BSBG06D | BSBG06E | BSBG06F | BSBG06G | BSBG06H | BSBG06I | BSBG06J | BSBG06K | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Yes | 3889 | 3836 | 4198 | 3100 | 4422 | 4488 | 1205 | 1687 | 3108 | 2339 | 942 |
| No | 686 | 739 | 377 | 1475 | 153 | 87 | 3370 | 2888 | 1467 | 2236 | 3633 |
In this part of research I will try to predict math achievement (BSMMAT01) with several variables where some of them are connected with educational resources and others are about internal factors. All variables seems to be in the right format, so it is possible to proceed to analysis.
First, I build a multiple regression model with 5 predicting variables.
model1 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM, data = data_reg)
summary(model1)
##
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM +
## BSBGSVM, data = data_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -269.602 -45.662 0.645 46.235 235.808
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 362.7882 7.3888 49.099 < 2e-16 ***
## BSBG04a11-25 books 8.4761 4.4866 1.889 0.058931 .
## BSBG04a26-100 books 18.4978 4.3937 4.210 2.60e-05 ***
## BSBG04a101-200 books 31.7300 4.8422 6.553 6.27e-11 ***
## BSBG04aMore than 200 27.9010 5.2593 5.305 1.18e-07 ***
## BSBG06ANo 11.1731 2.8870 3.870 0.000110 ***
## BSBG06BNo -14.3247 2.8142 -5.090 3.72e-07 ***
## BSBGSCM 18.2209 0.5591 32.587 < 2e-16 ***
## BSBGSVM -2.2706 0.5896 -3.851 0.000119 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.59 on 4566 degrees of freedom
## Multiple R-squared: 0.2327, Adjusted R-squared: 0.2313
## F-statistic: 173.1 on 8 and 4566 DF, p-value: < 2.2e-16
As seen from the output above, the model overall is statistically significant, however there is one variable which is insignificant. The model explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:
Now, let us try to add an interaction effect in the model. The variable for gender (BSBG01) interacts with variable for confidence in maths (BSBGSCM).
model2 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM*BSBG01 + BSBGSVM, data = data_reg)
summary(model2)
##
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM *
## BSBG01 + BSBGSVM, data = data_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -269.324 -45.189 0.893 46.637 237.806
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 359.9186 8.9975 40.002 < 2e-16 ***
## BSBG04a11-25 books 8.4889 4.4864 1.892 0.0585 .
## BSBG04a26-100 books 18.8409 4.3984 4.284 1.88e-05 ***
## BSBG04a101-200 books 32.0632 4.8463 6.616 4.11e-11 ***
## BSBG04aMore than 200 28.4075 5.2681 5.392 7.31e-08 ***
## BSBG06ANo 11.4926 2.8932 3.972 7.23e-05 ***
## BSBG06BNo -14.6035 2.8190 -5.180 2.31e-07 ***
## BSBGSCM 18.3825 0.7512 24.471 < 2e-16 ***
## BSBG01Boy 7.8570 10.4491 0.752 0.4521
## BSBGSVM -2.3434 0.5919 -3.959 7.64e-05 ***
## BSBGSCM:BSBG01Boy -0.4514 1.0350 -0.436 0.6627
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.58 on 4564 degrees of freedom
## Multiple R-squared: 0.2332, Adjusted R-squared: 0.2315
## F-statistic: 138.8 on 10 and 4564 DF, p-value: < 2.2e-16
As seen from the output above, the model overall is statistically significant, however there are two variables which are insignificant. The interaction effect is also insignificant. The model still explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:
It is necessary to compare these two models to reveal which one is better. To do this, ANOVA is implemented.
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM
## Model 2: BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM * BSBG01 + BSBGSVM
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4566 22109090
## 2 4564 22095440 2 13650 1.4097 0.2443
The result of ANOVA tells that the first model is better than the second one as p-value is large. Thus, I will use the model without interaction effect in further analysis.
In this section, it is examined that assumptions for regression model are true.
No prefect multicolinearity - there is no perfect linear relationship between predictors. It is checked by Variance Inflation Factors (VIF) below.
VIF(model1)
## GVIF Df GVIF^(1/(2*Df))
## BSBG04a 1.018361 4 1.002277
## BSBG06A 1.003738 1 1.001867
## BSBG06B 1.013453 1 1.006704
## BSBGSCM 1.179296 1 1.085954
## BSBGSVM 1.162255 1 1.078079
All numbers are less than five, so it means that there is no multicollinearity and the assumption is true.
Homoscedasticity - constant variance of residuals
ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 15.92173, Df = 1, p = 6.6016e-05
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 36.565, df = 8, p-value = 1.384e-05
Both tests have significant p-value which means that variance of residual is not constant and the assumption of homoscedasticity is violated.
Independent errors - for any two observations the residual terms is uncorrelated (or independent). It is checked by Durbin-Watson Test below.
dwt(model1)
## lag Autocorrelation D-W Statistic p-value
## 1 0.3953192 1.209085 0
## Alternative hypothesis: rho != 0
The closer to 2 that the value is, the better. Here we have the value equals to 1.209085, which is not so close to 2 meaning that the assumption has not been met.
Linearity and normality of residuals - the mean values of the outcome variable for each increment of the predictors lie along a straight line; the residuals in the model are random, normally distributed variables with a mean of 0.
par(mfrow=c(2,2))
plot(model1)
The first plot shows that the assumption of linearity has been met as points lie more or less along the line. In the Normal Q-Q plot residuals almost perfectly lie along the line which means that the assumption of normally distributed errors is true. Also, from the plots it is seen that there are three outliers, however none of them is an influential case as they do not cross Cook?s distance.
This section presents procedure of exploratory factor analysis and its results. In the dataset there are 11 variables that measures presence or absence of one or another things at students’ homes for which I will try to discover latent factors.
At first, it is necessary to look at correlation matrix in order to understand whether EFA is possible.
cor1 <- hetcor (data_fa)
cor2 <- cor1$correlations
corrplot(cor2)
The picture on the matrix seems to be not very appropriate to execute EFA, however it is still possibly to try factor analysis as there are some correlations.
Since all variables in our dataset are binomial, it is also necessary to transform data into numeric form as EFA works only with numbers.
datafa<-as.data.frame(lapply(data_fa, as.numeric))
The data is in the right format and now we need to reveal how many factors should be extracted.
fa.parallel(datafa)
## Parallel analysis suggests that the number of factors = 5 and the number of components = 4
As seen from the plot, recommended number of factors is 5. At this point, we are ready to move to factor analysis.
Here, we apply fa function for exploratory factor analysis using 5 factors and argument cor=“mixed” because we have imitation of quantitative data.
fa1<-fa(datafa, 5, cor = "mixed")
fa1
## Factor Analysis using method = minres
## Call: fa(r = datafa, nfactors = 5, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR2 MR1 MR3 MR5 MR4 h2 u2 com
## BSBG06A 0.02 0.03 -0.01 0.82 0.02 0.70 0.298 1.0
## BSBG06B 0.69 -0.12 -0.13 -0.14 0.10 0.43 0.569 1.3
## BSBG06C 0.01 0.97 -0.01 0.02 0.00 0.94 0.058 1.0
## BSBG06D -0.05 0.46 0.01 0.06 0.31 0.37 0.626 1.8
## BSBG06E 0.67 0.11 0.01 0.21 0.00 0.63 0.373 1.3
## BSBG06F 0.54 0.15 0.33 0.04 -0.17 0.61 0.386 2.1
## BSBG06G -0.07 -0.16 0.55 0.23 0.02 0.38 0.619 1.6
## BSBG06H 0.19 0.12 0.29 -0.19 -0.01 0.16 0.839 2.9
## BSBG06I 0.28 -0.04 0.20 0.06 0.39 0.35 0.648 2.5
## BSBG06J 0.00 0.05 0.06 0.02 0.69 0.51 0.488 1.0
## BSBG06K 0.00 0.07 0.69 -0.06 0.12 0.52 0.475 1.1
##
## MR2 MR1 MR3 MR5 MR4
## SS loadings 1.43 1.32 1.15 0.89 0.83
## Proportion Var 0.13 0.12 0.10 0.08 0.08
## Cumulative Var 0.13 0.25 0.35 0.44 0.51
## Proportion Explained 0.25 0.23 0.21 0.16 0.15
## Cumulative Proportion 0.25 0.49 0.69 0.85 1.00
##
## With factor correlations of
## MR2 MR1 MR3 MR5 MR4
## MR2 1.00 0.27 0.30 0.17 0.04
## MR1 0.27 1.00 0.26 0.42 0.16
## MR3 0.30 0.26 1.00 0.34 0.22
## MR5 0.17 0.42 0.34 1.00 0.10
## MR4 0.04 0.16 0.22 0.10 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 2.4 with Chi Square of 10983.34
## The degrees of freedom for the model are 10 and the objective function was 0.04
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 4575 with the empirical chi square 87.44 with prob < 1.7e-14
## The total number of observations was 4575 with Likelihood Chi Square = 172.55 with prob < 8.2e-32
##
## Tucker Lewis Index of factoring reliability = 0.918
## RMSEA index = 0.06 and the 90 % confidence intervals are 0.052 0.068
## BIC = 88.27
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR2 MR1 MR3 MR5 MR4
## Correlation of (regression) scores with factors 0.87 0.97 0.83 0.86 0.78
## Multiple R square of scores with factors 0.76 0.95 0.69 0.74 0.61
## Minimum correlation of possible factor scores 0.51 0.89 0.38 0.48 0.21
Now, we analyze an output of factor analysis. Cumulative variance is 0.53 which is more or less fine. For two factors proportion variance is less than 0.1 which is not really good. The proportion explained is rather unequal. Here we have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.01 which is very good. RMSEA index = 0.06 which is acceptable. Tucker Lewis Index also has a good score (0.918). Overall, the results of factor analysis are somewhat fine but could be better.
Let us look at the structure of factors now.
fa.diagram(fa1)
As seen from the diagram, there are three factors that are not meaningful because they contain less than three variables. Moreover, there are one variable that does not fall into any factor. Thus, the number of factors should be decreased and the variable BSBG06H should be deleted.
datafa1 <- select(datafa, -BSBG06H)
Now, the number of factors is 4.
fa2<-fa(datafa1, 4, cor = "mixed")
fa2
## Factor Analysis using method = minres
## Call: fa(r = datafa1, nfactors = 4, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR3 MR4 h2 u2 com
## BSBG06A 0.38 0.01 0.30 -0.04 0.30 0.70 1.9
## BSBG06B -0.18 0.72 -0.19 0.11 0.49 0.51 1.3
## BSBG06C 0.88 0.03 -0.05 0.01 0.77 0.23 1.0
## BSBG06D 0.53 -0.06 -0.02 0.31 0.41 0.59 1.7
## BSBG06E 0.25 0.63 0.11 -0.05 0.59 0.41 1.4
## BSBG06F 0.19 0.57 0.28 -0.14 0.59 0.41 1.9
## BSBG06G -0.10 -0.04 0.74 0.04 0.51 0.49 1.0
## BSBG06I 0.00 0.28 0.22 0.40 0.36 0.64 2.4
## BSBG06J 0.09 -0.03 0.07 0.66 0.48 0.52 1.1
## BSBG06K 0.05 0.11 0.48 0.21 0.37 0.63 1.5
##
## MR1 MR2 MR3 MR4
## SS loadings 1.46 1.42 1.16 0.84
## Proportion Var 0.15 0.14 0.12 0.08
## Cumulative Var 0.15 0.29 0.40 0.49
## Proportion Explained 0.30 0.29 0.24 0.17
## Cumulative Proportion 0.30 0.59 0.83 1.00
##
## With factor correlations of
## MR1 MR2 MR3 MR4
## MR1 1.00 0.26 0.31 0.17
## MR2 0.26 1.00 0.22 0.08
## MR3 0.31 0.22 1.00 0.15
## MR4 0.17 0.08 0.15 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 2.27 with Chi Square of 10383.08
## The degrees of freedom for the model are 11 and the objective function was 0.1
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
##
## The harmonic number of observations is 4575 with the empirical chi square 263.53 with prob < 4.1e-50
## The total number of observations was 4575 with Likelihood Chi Square = 449.97 with prob < 1.5e-89
##
## Tucker Lewis Index of factoring reliability = 0.826
## RMSEA index = 0.093 and the 90 % confidence intervals are 0.086 0.101
## BIC = 357.26
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## MR1 MR2 MR3 MR4
## Correlation of (regression) scores with factors 0.90 0.87 0.82 0.77
## Multiple R square of scores with factors 0.82 0.75 0.68 0.59
## Minimum correlation of possible factor scores 0.64 0.50 0.36 0.18
The results are following. Cumulative variance is 0.49 which is somewhat good. For one factor proportion variance is still less than 0.1 which is not good. The proportion explained is rather unequal. Here we again have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.03 which is good. RMSEA index = 0.093 which is not so good. Tucker Lewis Index also has not enough good score (0.826). Overall, the results of factor analysis are rather undesirable.
Probably, it will be useful to look at factor structure again.
fa.diagram(fa2)
In the diagram it is seen two not meaningful factors, so let us try to reduce number of factors to 3.
fa3<-fa(datafa1, 3, cor = "mixed")
fa3
## Factor Analysis using method = minres
## Call: fa(r = datafa1, nfactors = 3, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR3 h2 u2 com
## BSBG06A 0.07 0.33 0.23 0.24 0.76 1.9
## BSBG06B 0.59 -0.21 -0.09 0.29 0.71 1.3
## BSBG06C 0.06 0.90 -0.05 0.83 0.17 1.0
## BSBG06D -0.12 0.54 0.21 0.36 0.64 1.4
## BSBG06E 0.76 0.13 0.02 0.67 0.33 1.1
## BSBG06F 0.64 0.06 0.15 0.53 0.47 1.1
## BSBG06G 0.00 -0.08 0.57 0.30 0.70 1.0
## BSBG06I 0.20 -0.01 0.44 0.29 0.71 1.4
## BSBG06J -0.12 0.15 0.41 0.20 0.80 1.5
## BSBG06K 0.07 0.00 0.62 0.42 0.58 1.0
##
## MR1 MR2 MR3
## SS loadings 1.46 1.37 1.31
## Proportion Var 0.15 0.14 0.13
## Cumulative Var 0.15 0.28 0.41
## Proportion Explained 0.35 0.33 0.32
## Cumulative Proportion 0.35 0.68 1.00
##
## With factor correlations of
## MR1 MR2 MR3
## MR1 1.00 0.36 0.34
## MR2 0.36 1.00 0.34
## MR3 0.34 0.34 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 2.27 with Chi Square of 10383.08
## The degrees of freedom for the model are 18 and the objective function was 0.29
##
## The root mean square of the residuals (RMSR) is 0.06
## The df corrected root mean square of the residuals is 0.09
##
## The harmonic number of observations is 4575 with the empirical chi square 1274.85 with prob < 1e-259
## The total number of observations was 4575 with Likelihood Chi Square = 1335.99 with prob < 7.8e-273
##
## Tucker Lewis Index of factoring reliability = 0.681
## RMSEA index = 0.127 and the 90 % confidence intervals are 0.121 0.132
## BIC = 1184.28
## Fit based upon off diagonal values = 0.95
## Measures of factor score adequacy
## MR1 MR2 MR3
## Correlation of (regression) scores with factors 0.88 0.92 0.82
## Multiple R square of scores with factors 0.78 0.85 0.67
## Minimum correlation of possible factor scores 0.55 0.70 0.34
The output tells the following. Cumulative variance is 0.41 which is somewhat acceptable. For all factors proportion variance is bigger than 0.1 which is good. The proportion explained is almost equal. Here we again have oblique rotation because there is factor correlation matrix in the output; oblique rotation is used automatically because we have argument cor=“mixed”. RMSR equals to 0.06 which is rather acceptable. RMSEA index = 0.127 which is not good. Tucker Lewis Index also has rather small score (0.681). Overall, although the fit is not really good, the explanatory aspect seems to be better than in previous models.
At this point it is possible to name and interpret factors. In order to understand the structure, it again is necessary to look at the diagram.
fa.diagram(fa3)
It is necessary to check model fit using Cronbach’s alpha. For this dataframes for each factor are created.
MR1<- as.data.frame(datafa[c("BSBG06E", "BSBG06F", "BSBG06B")])
MR2<- as.data.frame(datafa[c("BSBG06C", "BSBG06D", "BSBG06A")])
MR3<- as.data.frame(datafa[c("BSBG06K", "BSBG06G", "BSBG06I", "BSBG06J")])
And now we test scales of each factor using alpha function. Cronbach’s alpha > 0.7 indicates good reliability. Overall, tests show bad results but it can be because variables are binary.
As seen from the output below, for the first factor we have Cronbach’s alpha = 0.34 which is bad result. But there are no dropping items as deleting of any variable does not lead to significant increase of reliability.
psych::alpha(MR1, check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR1, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.25 0.34 0.26 0.15 0.52 0.016 1.1 0.16 0.14
##
## lower alpha upper 95% confidence boundaries
## 0.22 0.25 0.29
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBG06E 0.12 0.18 0.10 0.10 0.22 0.017 NA 0.10
## BSBG06F 0.19 0.24 0.14 0.14 0.32 0.019 NA 0.14
## BSBG06B 0.33 0.34 0.21 0.21 0.52 0.019 NA 0.21
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBG06E 4575 0.54 0.68 0.40 0.19 1.0 0.18
## BSBG06F 4575 0.44 0.66 0.35 0.17 1.0 0.14
## BSBG06B 4575 0.86 0.63 0.26 0.15 1.2 0.37
##
## Non missing response frequency for each item
## 1 2 miss
## BSBG06E 0.97 0.03 0
## BSBG06F 0.98 0.02 0
## BSBG06B 0.84 0.16 0
For the second factor, alpha equals to 0.41 which is bad. Nevertheless, none of variables should be deleted to increase reliability.
psych::alpha(MR2, check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR2, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.38 0.41 0.32 0.19 0.68 0.015 1.2 0.25 0.18
##
## lower alpha upper 95% confidence boundaries
## 0.35 0.38 0.41
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBG06C 0.23 0.24 0.14 0.14 0.31 0.022 NA 0.14
## BSBG06D 0.29 0.30 0.18 0.18 0.43 0.020 NA 0.18
## BSBG06A 0.35 0.39 0.24 0.24 0.65 0.017 NA 0.24
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBG06C 4575 0.60 0.70 0.44 0.28 1.1 0.28
## BSBG06D 4575 0.78 0.68 0.39 0.24 1.3 0.47
## BSBG06A 4575 0.62 0.65 0.31 0.19 1.1 0.36
##
## Non missing response frequency for each item
## 1 2 miss
## BSBG06C 0.92 0.08 0
## BSBG06D 0.68 0.32 0
## BSBG06A 0.85 0.15 0
As for MR3, Cronbach’s alpha is 0.43 which is not really good result. However, there are no dropping items as deleting of any variable does not lead to significant increase of reliability.
psych::alpha(MR3, check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR3, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.42 0.43 0.37 0.16 0.75 0.014 1.6 0.27 0.15
##
## lower alpha upper 95% confidence boundaries
## 0.4 0.42 0.45
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBG06K 0.33 0.33 0.25 0.14 0.48 0.017 0.0043 0.13
## BSBG06G 0.38 0.38 0.29 0.17 0.61 0.016 0.0013 0.16
## BSBG06I 0.33 0.34 0.27 0.15 0.52 0.017 0.0055 0.14
## BSBG06J 0.38 0.38 0.30 0.17 0.62 0.016 0.0025 0.16
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBG06K 4575 0.59 0.63 0.41 0.27 1.8 0.40
## BSBG06G 4575 0.57 0.59 0.34 0.21 1.7 0.44
## BSBG06I 4575 0.63 0.62 0.39 0.26 1.3 0.47
## BSBG06J 4575 0.63 0.59 0.33 0.21 1.5 0.50
##
## Non missing response frequency for each item
## 1 2 miss
## BSBG06K 0.21 0.79 0
## BSBG06G 0.26 0.74 0
## BSBG06I 0.68 0.32 0
## BSBG06J 0.51 0.49 0
In order to use factors in regression analysis it is necessary to extract factor scores and include them in a dataset.
fascores<-as.data.frame(fa1$scores)
datareg<-cbind(data_reg,fascores)
In this part of I will continue predicting math achievement (BSMMAT01). Here, factor scores will be added into the model as predictors.
model3 <- lm(BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM + BSBGSVM + MR1 + MR2 + MR3, data = datareg)
summary(model3)
##
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM +
## BSBGSVM + MR1 + MR2 + MR3, data = datareg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -267.153 -45.149 0.536 46.458 231.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 362.0634 7.4089 48.869 < 2e-16 ***
## BSBG04a11-25 books 8.0074 4.4831 1.786 0.07415 .
## BSBG04a26-100 books 17.2101 4.4045 3.907 9.47e-05 ***
## BSBG04a101-200 books 30.0463 4.8665 6.174 7.23e-10 ***
## BSBG04aMore than 200 26.2766 5.2851 4.972 6.88e-07 ***
## BSBG06ANo 12.7431 3.1253 4.077 4.63e-05 ***
## BSBG06BNo -3.0490 3.9581 -0.770 0.44114
## BSBGSCM 18.0868 0.5601 32.291 < 2e-16 ***
## BSBGSVM -2.1584 0.5890 -3.665 0.00025 ***
## MR1 0.8342 1.2380 0.674 0.50048
## MR2 -7.0313 1.7490 -4.020 5.91e-05 ***
## MR3 1.4608 1.4692 0.994 0.32014
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 69.44 on 4563 degrees of freedom
## Multiple R-squared: 0.2364, Adjusted R-squared: 0.2345
## F-statistic: 128.4 on 11 and 4563 DF, p-value: < 2.2e-16
As seen from the output above, the model itself is statistically significant, however there are four variables which are insignificant. The model explains 23% of variance. According to regression coefficients, if effects of all other predictors are held constant:
It is necessary to give some meaning to the results above. Thus, in this part I will give some interpretation.
Starting with books, all regression models show that there almost no difference in math achievement between students with 0-10 books at home and 11-25 books at home, but starting from 26 books the math achievement increases with growing number of books compared to small amount of books.
Talking about computers and tablets, possession of own computer or tablet is associated with decreasing math achievement, however having a computer or a tablet shared with others at home is related to increase of math achievement. It is possible to assume that the presence of computer or tablet itself is good, but having it as own property imply that owner can do whatever he/she want without control which can badly affect studies.
If we want to compare importance of books and computers, it is necessary to look at standardized coefficients in regression.
lm.beta(model3)
##
## Call:
## lm(formula = BSMMAT01 ~ BSBG04a + BSBG06A + BSBG06B + BSBGSCM +
## BSBGSVM + MR1 + MR2 + MR3, data = datareg)
##
## Standardized Coefficients::
## (Intercept) BSBG04a11-25 books BSBG04a26-100 books
## 0.00000000 0.04598862 0.10569543
## BSBG04a101-200 books BSBG04aMore than 200 BSBG06ANo
## 0.13740292 0.09791692 0.05732738
## BSBG06BNo BSBGSCM BSBGSVM
## -0.01413943 0.45537096 -0.05116060
## MR1 MR2 MR3
## 0.01145199 -0.09516489 0.01854671
As sees from the output, the coefficients for books are larger than for computer ownership, so it is possible to assume that books are still more important for students’ achievement.
Moving forward, students who are confident in maths seem to have better math achievement. However, it is hard to interpret result about value of mathematics.
Finally, having own room, table and computer is associated with decreasing math achievement. It can be assumed that staying in room alone allow students to do some entertaining activities instead of studying which may lead to lower scores.
To conclude, it was found out 3 latent factors, namely resources for internet connection, private possessions and status. And due to regression analysis I revealed that books are more important for students achievement than computers.