In order to conduct regression analysis and further EFA TIMSS dataset of the year 2015 for the 8th graders in Russia was used.
Research question for regression: Can such variables as number of books at home, availability of one’s own or shared tablet/PC, gender and age be considered as predictors of one’s math achievement?
Research question for EFA: Are there any latent factors describing one’s home possessions?
Variables for EFA include:
library(foreign)
data1 <- read.spss("BSGRUSM6 (1).sav", to.data.frame = TRUE, use.value.labels = TRUE)
data2<-data1[c("BSMMAT01","BSBG04", "BSBG06A", "BSBG06B", "BSBG01", "BSBG06A","BSBG06B","BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K","BSDAGE")]
data3<-na.omit(data2)
save1<-c("BSMMAT01","BSBG04", "BSBG06A", "BSBG06B", "BSBG01", "BSDAGE" )
save2<-c("BSBG06A","BSBG06B","BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K")
data_reg <- data3[save1]
data_fa <- data3[save2]
names(data_reg)[names(data_reg) == "BSMMAT01"] <- "matach"
names(data_reg)[names(data_reg) == "BSBG01"] <- "gender"
names(data_reg)[names(data_reg) == "BSBG04"] <- "books"
names(data_reg)[names(data_reg) == "BSBG06A"] <- "pcown"
names(data_reg)[names(data_reg) == "BSBG06B"] <- "pcshared"
names(data_reg)[names(data_reg) == "BSDAGE"] <- "age"
data_reg$matach<-as.numeric(as.character(data_reg$matach))
data_reg$age<-as.numeric(as.character(data_reg$age))
str(data_reg)
## 'data.frame': 4598 obs. of 6 variables:
## $ matach : num 570 506 523 433 389 ...
## $ books : Factor w/ 5 levels "0–10 books","11–25 books",..: 2 4 2 3 5 3 2 2 3 5 ...
## $ pcown : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
## $ pcshared: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
## $ gender : Factor w/ 2 levels "Girl","Boy": 2 2 1 2 2 1 1 2 2 1 ...
## $ age : num 15.4 15.5 14.2 14.9 14.6 ...
names(data_fa)[names(data_fa) == "BSBG06A"] <- "pcown"
names(data_fa)[names(data_fa) == "BSBG06B"] <- "pcshared"
names(data_fa)[names(data_fa) == "BSBG06C"] <- "studydesk"
names(data_fa)[names(data_fa) == "BSBG06D"] <- "ownroom"
names(data_fa)[names(data_fa) == "BSBG06E"] <- "internet"
names(data_fa)[names(data_fa) == "BSBG06F"] <- "mobphone"
names(data_fa)[names(data_fa) == "BSBG06G"] <- "gamingsys"
names(data_fa)[names(data_fa) == "BSBG06H"] <- "musinst"
names(data_fa)[names(data_fa) == "BSBG06I"] <- "car"
names(data_fa)[names(data_fa) == "BSBG06J"] <- "flat"
names(data_fa)[names(data_fa) == "BSBG06K"] <- "dishwasher"
str(data_fa)
## 'data.frame': 4598 obs. of 11 variables:
## $ pcown : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
## $ pcshared : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
## $ studydesk : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ ownroom : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ internet : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 1 1 1 ...
## $ mobphone : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
## $ gamingsys : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 1 2 1 ...
## $ musinst : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 1 1 2 ...
## $ car : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 1 1 1 1 1 ...
## $ flat : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 2 2 2 1 ...
## $ dishwasher: Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 1 2 ...
library(summarytools)
print(dfSummary(data_fa, graph.magnif = 0.75,style="grid", varnumbers=FALSE, valid.col=FALSE, na.col=FALSE),
max.tbl.height = 300, method = "render")
| Variable | Stats / Values | Freqs (% of Valid) | Graph | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| pcown [factor] | 1. Yes 2. No |
|
|||||||||
| pcshared [factor] | 1. Yes 2. No |
|
|||||||||
| studydesk [factor] | 1. Yes 2. No |
|
|||||||||
| ownroom [factor] | 1. Yes 2. No |
|
|||||||||
| internet [factor] | 1. Yes 2. No |
|
|||||||||
| mobphone [factor] | 1. Yes 2. No |
|
|||||||||
| gamingsys [factor] | 1. Yes 2. No |
|
|||||||||
| musinst [factor] | 1. Yes 2. No |
|
|||||||||
| car [factor] | 1. Yes 2. No |
|
|||||||||
| flat [factor] | 1. Yes 2. No |
|
|||||||||
| dishwasher [factor] | 1. Yes 2. No |
|
Generated by summarytools 0.9.6 (R version 4.0.0)
2020-06-28
For further EFA analysis we have 11 yes/no questions, which are aimed at finding out whether a respondent possesses certain items, like their own or shared pc/tablet, study desk, own room, inernet connection at home, mobile phone, gaming system, music instrument, car, flat or dishwasher. In the table above some frequencies of answers can be observed.
library(ggplot2)
ggplot(data_reg, aes(x = gender)) +
geom_bar(col = "navy", fill = "cornflowerblue") +
ggtitle("Gender distribution barplot")+
xlab("boy/girl")+
ylab("Number of observations")+
theme_minimal()
The number of boys and girls in the data is almost the same which means that we have a representative sample.
t1 <- t.test(matach ~ gender, data_reg)
t1
##
## Welch Two Sample t-test
##
## data: matach by gender
## t = -2.4761, df = 4563.7, p-value = 0.01332
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -10.391998 -1.207794
## sample estimates:
## mean in group Girl mean in group Boy
## 535.1722 540.9721
Mean math achievement score for girls is 535.1722, which is a little bit lower than for boys whose mean achievement score is 540.9721 Additionally t-test produced relatively low p-value(<0.05). Therefore, maybe it is possible to hypothesise that further regression analysis might prove the fact that gender can be considered as a predictor of math achievement scores with boys performing better than boys.
library(ggplot2)
library(ggrepel)
ggplot(data_reg, aes(x = books, fill= gender)) +
geom_bar(col="grey") +
ggtitle("Number of books at home barplot")+
xlab("Number of books at home")+
ylab("Number of observations")+
geom_text_repel(stat='count', aes(label=..count..), vjust=0.4,size = 3, position = position_stack(vjust = 0.5)) +
theme_minimal()
The barplot shows that the majority of respondents have less than 100 books at home.
library(dplyr)
data_reg %>%
group_by(books) %>%
summarize(mean = mean(matach))
## # A tibble: 5 x 2
## books mean
## <fct> <dbl>
## 1 0–10 books 514.
## 2 11–25 books 525.
## 3 26–100 books 541.
## 4 101–200 books 557.
## 5 More than 200 555.
aov.out <- aov(data_reg$matach ~ data_reg$books)
summary(aov.out)
## Df Sum Sq Mean Sq F value Pr(>F)
## data_reg$books 4 792827 198207 32.32 <2e-16 ***
## Residuals 4593 28167322 6133
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It can be seen that mean math achievement scores are incresing when the number of books also increasing. Additionally, anova produced very small p-value (<2e-16), which means that the difference in math achievement scores across groups with different numbers of books is statisticcally significant. Thus, it is again possible to assume that number of books can be a predictor of one’s math performance in further regression analysis with bigger number of books being associated with higher performance scores.
t2 <- t.test(matach ~ pcown, data_reg)
t2
##
## Welch Two Sample t-test
##
## data: matach by pcown
## t = -3.7896, df = 918.85, p-value = 0.0001607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -19.63468 -6.23655
## sample estimates:
## mean in group Yes mean in group No
## 536.2291 549.1647
By looking at mean values it can be concluded that mean math performance scores among those who doesn’t have their own devices is bigger than among those who have. Moreover, t-test comparing math achievement between two groups produced relatively small p-value (= 0.0001607), which means that the diference is statistically significant and it can be assumed that one’s possesion of his/her own pc or tablet can be a predictor of one’s math performance with those who doesn’t have their own device performing better than those who have.
model1<-lm(matach ~ books+pcown+pcshared+age+gender, data = data_reg)
summary(model1)
##
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age + gender,
## data = data_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -248.495 -54.581 1.953 53.622 268.051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 785.956 40.271 19.516 < 2e-16 ***
## books11–25 books 10.675 4.957 2.153 0.0313 *
## books26–100 books 26.298 4.853 5.419 6.29e-08 ***
## books101–200 books 42.524 5.347 7.953 2.27e-15 ***
## booksMore than 200 40.822 5.808 7.029 2.39e-12 ***
## pcownNo 13.449 3.200 4.203 2.69e-05 ***
## pcsharedNo -21.797 3.105 -7.020 2.54e-12 ***
## age -18.659 2.714 -6.874 7.06e-12 ***
## genderBoy 10.218 2.299 4.444 9.05e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.23 on 4589 degrees of freedom
## Multiple R-squared: 0.05479, Adjusted R-squared: 0.05314
## F-statistic: 33.25 on 8 and 4589 DF, p-value: < 2.2e-16
Adjusted R-squared is 0.05258, which means that model explains aproximately 5% of variance. All predictors have statistically significant p-values (smaller than 0.05) and the coefficients show that:
Interpretation of coefficients:
Comparing with people, who have 0-10 books at home,for those who have 1–25 books math performance increases by 10.675
Comparing with people, who have 0-10 books at home,for those who have 26–100 books math performance increases by 26.298
Comparing with people, who have 0-10 books at home,for those who have 101–200 books math performance increases by 42.524
Comparing with people, who have 0-10 books at home,for those who have more than 200 math performance increases by 40.822
So generally we can say that the more books a person has at home, the better his/her math performance is
Comparing with people, who have their own pc or tablet,for those who don’t math performance increases by 13.449 This means that people without their own device perform better in math than those who have.
Comparing with people, who have shared pc or tablet,for those who don’t math performance decreases by 21.797 In this case it means if a respondent have a computer which he or she shares with other people than he or she has better math performance than those who don’t have one
With one unit increase in age math performance decreases by 18.659 In simple words this means that the the older a person is the worse his/her math performance is
Comparing with girls, for boys math performance increases by 10.218
library(sjPlot)
model2<-lm(matach ~ books+pcown+pcshared+age*gender,data = data_reg)
summary(model2)
##
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age * gender,
## data = data_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.860 -54.599 1.927 53.929 267.951
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 762.097 57.690 13.210 < 2e-16 ***
## books11–25 books 10.684 4.958 2.155 0.0312 *
## books26–100 books 26.257 4.854 5.410 6.63e-08 ***
## books101–200 books 42.445 5.349 7.935 2.62e-15 ***
## booksMore than 200 40.802 5.809 7.024 2.46e-12 ***
## pcownNo 13.385 3.202 4.180 2.97e-05 ***
## pcsharedNo -21.772 3.105 -7.011 2.72e-12 ***
## age -17.038 3.904 -4.364 1.30e-05 ***
## genderBoy 56.562 80.262 0.705 0.4810
## age:genderBoy -3.140 5.435 -0.578 0.5635
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.24 on 4588 degrees of freedom
## Multiple R-squared: 0.05486, Adjusted R-squared: 0.053
## F-statistic: 29.59 on 9 and 4588 DF, p-value: < 2.2e-16
plot_model(model2, type = "int")
Adjusted R-squared hasn’t changed much. However, in this model p-values for the interaction efffect and for the gender variable alone is not statistically significant.
Interpretation of interaction effect:
So from the plot it can be seen that with age math achievement for boys decreases more drastically than for girls and when they are almost 18 years old their math performance becomes more or less the same as in case with girls.
anova(model1,model2)
## Analysis of Variance Table
##
## Model 1: matach ~ books + pcown + pcshared + age + gender
## Model 2: matach ~ books + pcown + pcshared + age * gender
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 4589 27373517
## 2 4588 27371527 1 1990.7 0.3337 0.5635
First model is better as p-value is statistically insignificant (0.5635).
library(car)
vif(model1)
## GVIF Df GVIF^(1/(2*Df))
## books 1.016461 4 1.002043
## pcown 1.008033 1 1.004008
## pcshared 1.011161 1 1.005565
## age 1.004747 1 1.002371
## gender 1.017660 1 1.008791
Values are less than 5. Therefore, it can be concluded that we do not have multicollinearity.
par(mfrow = c(2,2))
plot(model1)
Residuals VS Fitted (we can see that dots are not quite evenly dispersed around zero, which means that we face the problem of heteroscedasticity)
Normal Q-Q plot shows that our data is normally distributed
Also we do not have any leverages or influential cases, as Cook’s distance line is not present on the last plot
library(lmtest)
bptest(model1)
##
## studentized Breusch-Pagan test
##
## data: model1
## BP = 20.857, df = 8, p-value = 0.007538
ncvTest(model1)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 5.050145, Df = 1, p = 0.024624
These two tests also produce significant p-values, which proves the fact that we have the problem of heteroscedasticity.
Before starting the actual factor analysis we need to look at the correlation matrix and see whether we do have some potential factors:
library(polycor)
library(corrplot)
dat.cor <- hetcor(data_fa)
dat.cor<- dat.cor$correlations
dat.cor
## pcown pcshared studydesk ownroom internet
## pcown 1.000000000 -0.0759647395 0.3782703719 0.25719768 0.32617536
## pcshared -0.075964740 1.0000000000 -0.0003508223 0.00549872 0.37566069
## studydesk 0.378270372 -0.0003508223 1.0000000000 0.50262082 0.39863957
## ownroom 0.257197684 0.0054987201 0.5026208189 1.00000000 0.16481039
## internet 0.326175362 0.3756606865 0.3986395701 0.16481039 1.00000000
## mobphone 0.253977640 0.3399911414 0.3677667669 0.17860624 0.58337993
## gamingsys 0.280805149 -0.0566202120 0.0701626218 0.10423669 0.15747422
## musinst -0.001411598 0.1151978114 0.1682043566 0.05924328 0.21193472
## car 0.182116855 0.2084456543 0.1915666181 0.19299980 0.32018956
## flat 0.130631750 0.0302197019 0.1612083980 0.33053152 0.09930588
## dishwasher 0.189138072 0.0428103525 0.2347650651 0.20959728 0.22124916
## mobphone gamingsys musinst car flat dishwasher
## pcown 0.25397764 0.28080515 -0.001411598 0.1821169 0.13063175 0.18913807
## pcshared 0.33999114 -0.05662021 0.115197811 0.2084457 0.03021970 0.04281035
## studydesk 0.36776677 0.07016262 0.168204357 0.1915666 0.16120840 0.23476507
## ownroom 0.17860624 0.10423669 0.059243284 0.1929998 0.33053152 0.20959728
## internet 0.58337993 0.15747422 0.211934724 0.3201896 0.09930588 0.22124916
## mobphone 1.00000000 0.23002366 0.265221446 0.2463445 0.04563401 0.36891892
## gamingsys 0.23002366 1.00000000 0.118672347 0.2333951 0.13740871 0.38882885
## musinst 0.26522145 0.11867235 1.000000000 0.1722299 0.07820802 0.22539367
## car 0.24634448 0.23339510 0.172229868 1.0000000 0.33791602 0.31540963
## flat 0.04563401 0.13740871 0.078208015 0.3379160 1.00000000 0.24549622
## dishwasher 0.36891892 0.38882885 0.225393674 0.3154096 0.24549622 1.00000000
corrplot(dat.cor, method = "square", type="lower")
According to correlation plot it is hard to say whether we have certain groups of variables, which can comprise factors. However, there are some of the variables which have relatively high correlation indexes (those with more intense blue colours)
Turning variables into numeric form:
datafa<-as.data.frame(lapply(data_fa, as.numeric))
library(psych)
fa.parallel(datafa)
## Parallel analysis suggests that the number of factors = 5 and the number of components = 4
Interpretation of the Parallel Analysis screen plot:
From the plot it can be seen that 5 factors in the “Factor Analysis” lie above the corresponding simulated data line and 4 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line.
Therefore, Parallel analysis suggests that the number of factors = 5 and the number of components = 4
fa1<-fa(datafa, 5, cor = "mixed")
fa1
## Factor Analysis using method = minres
## Call: fa(r = datafa, nfactors = 5, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR2 MR1 MR3 MR5 MR4 h2 u2 com
## pcown 0.03 0.04 0.00 0.79 0.02 0.67 0.333 1.0
## pcshared 0.68 -0.12 -0.12 -0.15 0.10 0.43 0.571 1.3
## studydesk 0.01 0.97 -0.01 0.01 0.00 0.96 0.044 1.0
## ownroom -0.05 0.44 0.00 0.06 0.32 0.37 0.631 1.9
## internet 0.69 0.10 0.01 0.21 0.00 0.64 0.358 1.2
## mobphone 0.56 0.16 0.31 0.03 -0.14 0.61 0.392 1.9
## gamingsys -0.06 -0.16 0.54 0.24 0.02 0.38 0.617 1.6
## musinst 0.18 0.12 0.29 -0.19 -0.01 0.16 0.842 2.9
## car 0.28 -0.05 0.20 0.05 0.38 0.34 0.655 2.5
## flat -0.01 0.04 0.05 0.02 0.71 0.54 0.461 1.0
## dishwasher 0.01 0.07 0.69 -0.06 0.11 0.52 0.476 1.1
##
## MR2 MR1 MR3 MR5 MR4
## SS loadings 1.46 1.32 1.13 0.86 0.86
## Proportion Var 0.13 0.12 0.10 0.08 0.08
## Cumulative Var 0.13 0.25 0.36 0.43 0.51
## Proportion Explained 0.26 0.23 0.20 0.15 0.15
## Cumulative Proportion 0.26 0.49 0.70 0.85 1.00
##
## With factor correlations of
## MR2 MR1 MR3 MR5 MR4
## MR2 1.00 0.29 0.30 0.16 0.07
## MR1 0.29 1.00 0.26 0.41 0.18
## MR3 0.30 0.26 1.00 0.34 0.25
## MR5 0.16 0.41 0.34 1.00 0.12
## MR4 0.07 0.18 0.25 0.12 1.00
##
## Mean item complexity = 1.6
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 2.42 with Chi Square of 11105.62
## The degrees of freedom for the model are 10 and the objective function was 0.03
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.03
##
## The harmonic number of observations is 4598 with the empirical chi square 77.23 with prob < 1.8e-12
## The total number of observations was 4598 with Likelihood Chi Square = 153.09 with prob < 8.6e-28
##
## Tucker Lewis Index of factoring reliability = 0.929
## RMSEA index = 0.056 and the 90 % confidence intervals are 0.048 0.064
## BIC = 68.76
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## MR2 MR1 MR3 MR5 MR4
## Correlation of (regression) scores with factors 0.87 0.98 0.83 0.84 0.79
## Multiple R square of scores with factors 0.76 0.96 0.68 0.71 0.62
## Minimum correlation of possible factor scores 0.53 0.92 0.37 0.42 0.24
fa.diagram(fa1)
Description of the model fit:
First of all, looking at the factor loadings it can be seen that almost all variables belong to only one factor, this is also proved by relatively low complexity values. Communality value (h2) is not always higher than uniqueness(u2), which is not a goof sign.
Proportion explained: the explained variance should be evenly distributed among factors. In this model we can see that the first three factors have approximately the same value of explained proportion, while the other two have smaller values. This is not perfect, but looking on other fit indexes it is possible to say that the model actually has a good fit
Proportion variance: A factor should explain at least 10% of the variance. In this model it can be seen that MR5 and MR4 do not meet this criterion, but they are still rather close to it, as they both explain 8% of variance.
Cumulative Variance: looking at this parameter we can see that all in all our model explains 51% of variance
Also Chi Square of 11105.62 tells us that observed and expected data aren’t significantly different, which is good
Tucker Lewis Index of factoring reliability = 0.929, which is a very good measure of model fit (it should be >0.9)
RMSR index = 0.01 , which is also good, as it should be <0,05
However, factor MR5 include only one variable, which is rather meaningless. That is why, probably there is some logic in trying to reduce the number of factors
fa2<-fa(datafa, 4, cor = "mixed")
fa2
## Factor Analysis using method = minres
## Call: fa(r = datafa, nfactors = 4, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR2 MR1 MR3 MR4 h2 u2 com
## pcown -0.02 0.40 0.29 -0.04 0.29 0.71 1.8
## pcshared 0.70 -0.19 -0.21 0.12 0.46 0.54 1.4
## studydesk 0.06 0.86 -0.05 0.00 0.75 0.25 1.0
## ownroom -0.08 0.54 -0.02 0.31 0.42 0.58 1.6
## internet 0.64 0.24 0.09 -0.04 0.59 0.41 1.3
## mobphone 0.62 0.18 0.25 -0.12 0.62 0.38 1.6
## gamingsys -0.03 -0.09 0.72 0.04 0.49 0.51 1.0
## musinst 0.25 0.01 0.14 0.06 0.11 0.89 1.7
## car 0.28 -0.01 0.21 0.40 0.36 0.64 2.4
## flat -0.04 0.09 0.07 0.66 0.48 0.52 1.1
## dishwasher 0.14 0.03 0.49 0.20 0.39 0.61 1.5
##
## MR2 MR1 MR3 MR4
## SS loadings 1.53 1.44 1.14 0.84
## Proportion Var 0.14 0.13 0.10 0.08
## Cumulative Var 0.14 0.27 0.37 0.45
## Proportion Explained 0.31 0.29 0.23 0.17
## Cumulative Proportion 0.31 0.60 0.83 1.00
##
## With factor correlations of
## MR2 MR1 MR3 MR4
## MR2 1.00 0.28 0.22 0.09
## MR1 0.28 1.00 0.31 0.18
## MR3 0.22 0.31 1.00 0.17
## MR4 0.09 0.18 0.17 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 4 factors are sufficient.
##
## The degrees of freedom for the null model are 55 and the objective function was 2.42 with Chi Square of 11105.62
## The degrees of freedom for the model are 17 and the objective function was 0.13
##
## The root mean square of the residuals (RMSR) is 0.03
## The df corrected root mean square of the residuals is 0.05
##
## The harmonic number of observations is 4598 with the empirical chi square 439.34 with prob < 1.1e-82
## The total number of observations was 4598 with Likelihood Chi Square = 607.74 with prob < 3.3e-118
##
## Tucker Lewis Index of factoring reliability = 0.827
## RMSEA index = 0.087 and the 90 % confidence intervals are 0.081 0.093
## BIC = 464.38
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy
## MR2 MR1 MR3 MR4
## Correlation of (regression) scores with factors 0.87 0.90 0.82 0.77
## Multiple R square of scores with factors 0.76 0.81 0.67 0.59
## Minimum correlation of possible factor scores 0.52 0.61 0.34 0.18
fa.diagram(fa2)
In this case BIC equals to 464.38 which is much higher than in the model with 5 factors (BIC=68.76). Therefore, according to this estimate we can say that this model is worse. However, BIC estimates will show better results for models with bigger amount of factors. But a factor containing just one variable, like MR5 in the first model is not good and it is better to choose a model which is conceptually more understandable and has some logic behind it even though the fit of the model with less factors is a little bit worse:
Tucker Lewis Index of factoring reliability is smaller and equals to 0.827, which is not that good as in first case
RMSR index is bigger (0.03) , but it is still smaller than 0.05, which is a good sign
Proportion explained is again not completely evenly distributed but it is not much worse than in the first case
Proportion variance is almost everywhere bigger than 10%
Cumulative variance is 0.45, which slightly lower than in the model with 5 factors
However, this model is still not good enough. It can be seen that variable ‘musinst’ is left alone in both of the previous models and does not refer to any of the factors. Moreover, in both cases the uniqueness of this variable is very much higher than the communality. Additionally, factors MR3 and MR4 still do not make much sense. Thus, probably there is some logic behind building a model without ‘musinst’ variable and with only 3 factors
library(dplyr)
datafa1 <- select(datafa, -musinst)
fa3<-fa(datafa1, 3, cor = "mixed")
fa3
## Factor Analysis using method = minres
## Call: fa(r = datafa1, nfactors = 3, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR2 MR1 MR3 h2 u2 com
## pcown 0.04 0.33 0.24 0.24 0.76 1.9
## pcshared 0.62 -0.20 -0.10 0.31 0.69 1.3
## studydesk 0.06 0.90 -0.05 0.82 0.18 1.0
## ownroom -0.12 0.54 0.21 0.36 0.64 1.4
## internet 0.74 0.14 0.03 0.66 0.34 1.1
## mobphone 0.64 0.07 0.15 0.55 0.45 1.1
## gamingsys -0.01 -0.08 0.59 0.31 0.69 1.0
## car 0.21 -0.01 0.43 0.29 0.71 1.4
## flat -0.11 0.14 0.40 0.19 0.81 1.4
## dishwasher 0.08 0.00 0.61 0.41 0.59 1.0
##
## MR2 MR1 MR3
## SS loadings 1.47 1.37 1.31
## Proportion Var 0.15 0.14 0.13
## Cumulative Var 0.15 0.28 0.42
## Proportion Explained 0.35 0.33 0.32
## Cumulative Proportion 0.35 0.68 1.00
##
## With factor correlations of
## MR2 MR1 MR3
## MR2 1.00 0.36 0.33
## MR1 0.36 1.00 0.34
## MR3 0.33 0.34 1.00
##
## Mean item complexity = 1.3
## Test of the hypothesis that 3 factors are sufficient.
##
## The degrees of freedom for the null model are 45 and the objective function was 2.29 with Chi Square of 10525.5
## The degrees of freedom for the model are 18 and the objective function was 0.28
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.09
##
## The harmonic number of observations is 4598 with the empirical chi square 1220.74 with prob < 4e-248
## The total number of observations was 4598 with Likelihood Chi Square = 1286.68 with prob < 2.9e-262
##
## Tucker Lewis Index of factoring reliability = 0.697
## RMSEA index = 0.124 and the 90 % confidence intervals are 0.118 0.13
## BIC = 1134.88
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy
## MR2 MR1 MR3
## Correlation of (regression) scores with factors 0.88 0.92 0.82
## Multiple R square of scores with factors 0.77 0.85 0.67
## Minimum correlation of possible factor scores 0.55 0.70 0.34
fa.diagram(fa3)
The fit of this model is a little bit worse:
BIC of the model is higher than in previous models (1134.88).
Tucker Lewis Index of factoring reliability = 0.697, which is smaller than in both other cases
The root mean square of the residuals (RMSR) is 0.05, which is higher than in other 2 models
However, proportion explained is almost evenly distributed and proportion variance is higher than 10% in all factors
Cumulative variance is 42%, which is not very much smaller than in the previous model
Additionally, it is important to mention that the argument ‘cor=mixed’ uses oblique rotation by default allowing the factors to be correlated between each other.
All in all, it is possible to conclude that model with 3 factors and without ‘musinst’ variable is conceptually more logical than others two that is why it is better to choose this model.
Factor names:
MR2: Variables which were assigned to this factor mostly concerned with the technologies necessary for organizing a comfortable study environment at home. Computer, internet connection and mobile phone have become important tools which help people search for necessary information and communicate with others. Therefore, the name for this factor can be Technologies necessary for studying
MR1: Variables which were assigned to this factor mostly concerned with personal space of a person which can be crucial in organizing study process at home. Therefore, the name for this factor can be Personal space of a person
MR3: Variables which were assigned to this factor mostly serve the purpose of indicating wealth of a person, as not any family can afford to have gaming system, dishwasher, flat or a car. Therefore, the name for this factor can be Wealth indicators
library(psych)
MR1<- as.data.frame(datafa1[c("studydesk", "ownroom", "pcown")])
psych::alpha(MR1,check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR1, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.38 0.41 0.32 0.19 0.68 0.015 1.2 0.25 0.18
##
## lower alpha upper 95% confidence boundaries
## 0.35 0.38 0.41
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## studydesk 0.23 0.24 0.14 0.14 0.32 0.022 NA 0.14
## ownroom 0.29 0.30 0.18 0.18 0.43 0.020 NA 0.18
## pcown 0.35 0.39 0.24 0.24 0.64 0.017 NA 0.24
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## studydesk 4598 0.60 0.70 0.44 0.28 1.1 0.28
## ownroom 4598 0.78 0.68 0.39 0.24 1.3 0.47
## pcown 4598 0.62 0.65 0.31 0.19 1.2 0.36
##
## Non missing response frequency for each item
## 1 2 miss
## studydesk 0.92 0.08 0
## ownroom 0.68 0.32 0
## pcown 0.85 0.15 0
Std. Cronbach’s alpha is 0.41, which is rather small and indicates not very good scale reliability.
MR2<- as.data.frame(datafa1[c("internet", "mobphone", "pcshared")])
psych::alpha(MR2,check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR2, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.26 0.35 0.27 0.15 0.54 0.016 1.1 0.16 0.14
##
## lower alpha upper 95% confidence boundaries
## 0.23 0.26 0.29
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## internet 0.13 0.19 0.10 0.10 0.23 0.017 NA 0.10
## mobphone 0.20 0.24 0.14 0.14 0.32 0.019 NA 0.14
## pcshared 0.35 0.36 0.22 0.22 0.55 0.019 NA 0.22
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## internet 4598 0.55 0.68 0.40 0.20 1.0 0.18
## mobphone 4598 0.45 0.67 0.36 0.18 1.0 0.14
## pcshared 4598 0.86 0.63 0.26 0.16 1.2 0.37
##
## Non missing response frequency for each item
## 1 2 miss
## internet 0.97 0.03 0
## mobphone 0.98 0.02 0
## pcshared 0.84 0.16 0
Std. Cronbach’s alpha is 0.35, which is rather small and indicates not very good scale reliability.
MR3<- as.data.frame(datafa1[c("dishwasher", "gamingsys", "car", "flat")])
psych::alpha(MR3,check.keys = TRUE)
##
## Reliability analysis
## Call: psych::alpha(x = MR3, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.42 0.43 0.37 0.16 0.75 0.014 1.6 0.27 0.15
##
## lower alpha upper 95% confidence boundaries
## 0.4 0.42 0.45
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## dishwasher 0.33 0.33 0.25 0.14 0.49 0.017 0.0043 0.13
## gamingsys 0.38 0.38 0.29 0.17 0.61 0.016 0.0014 0.16
## car 0.33 0.34 0.27 0.15 0.52 0.017 0.0053 0.14
## flat 0.38 0.38 0.30 0.17 0.62 0.016 0.0025 0.16
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## dishwasher 4598 0.59 0.63 0.41 0.26 1.8 0.40
## gamingsys 4598 0.58 0.59 0.34 0.21 1.7 0.44
## car 4598 0.63 0.62 0.39 0.26 1.3 0.47
## flat 4598 0.63 0.59 0.33 0.22 1.5 0.50
##
## Non missing response frequency for each item
## 1 2 miss
## dishwasher 0.21 0.79 0
## gamingsys 0.26 0.74 0
## car 0.68 0.32 0
## flat 0.51 0.49 0
Std. Cronbach’s alpha is 0.43, which is rather small and indicates not very good scale reliability.
fascores<-as.data.frame(fa3$scores)
datareg<-cbind(data_reg,fascores)
names(datareg)[names(datareg) == "MR1"] <- "personal"
names(datareg)[names(datareg) == "MR2"] <- "tech"
names(datareg)[names(datareg) == "MR3"] <- "wealth"
model3<-lm(matach ~ books+pcown+pcshared+age+gender+wealth, data = datareg)
summary(model3)
##
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age + gender +
## wealth, data = datareg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -252.985 -54.562 2.304 54.382 263.086
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 786.447 40.246 19.541 < 2e-16 ***
## books11–25 books 10.136 4.958 2.044 0.04099 *
## books26–100 books 25.413 4.861 5.228 1.79e-07 ***
## books101–200 books 41.122 5.370 7.658 2.29e-14 ***
## booksMore than 200 39.276 5.834 6.732 1.88e-11 ***
## pcownNo 16.814 3.445 4.881 1.09e-06 ***
## pcsharedNo -21.417 3.106 -6.895 6.13e-12 ***
## age -18.655 2.712 -6.877 6.92e-12 ***
## genderBoy 9.742 2.305 4.226 2.42e-05 ***
## wealth -3.300 1.256 -2.627 0.00863 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77.18 on 4588 degrees of freedom
## Multiple R-squared: 0.05621, Adjusted R-squared: 0.05436
## F-statistic: 30.36 on 9 and 4588 DF, p-value: < 2.2e-16
Adjusted R-squared (0.05436) is not much bigger than in the first model(0.05314). New variable “wealth” has statistically significant effect on the outcome varible as p-value relatively small (0.00863). And coefficient shows that with one unit increase in welath math achievement decreases by 3.300. In simple words, the results show that the higher wealth indicator of a person the lower his/her math achievement.
After conducting EFA 3 latent factors were found:
Technologies necessary for studying
Personal space of a person
Wealth indicators
Regression analysis showed that:
The more books a person has at home, the better his/her math performance is. There is some logic behind this finding as it is a common knowledge that reading enhances one’s mental abilities. Moreover, big amount of books indicate that the family a respondent is intelligent enough to encourage their children to study
People without their own device(pc or tablet) perform better in math than those who have one. Such finding can be explained by the fact that a person who has his/her own devices has more time and ability to play games, surf the net and spend time in the social media instead of studying. Therefore, such people do not spend enough time learning and practicing math as they have more pleasant things to do
If a respondent have a computer which he or she shares with other people than he or she has better math performance than those who don’t have one. In this case people have limited amount of time for computer or tablet usage as they have to share it with other people in the house. As result, they have more time to do homework and as a result better math performance.
Gender also appeared to be a significant predictor of math achievement with boys showing higher math achievement scores than girls. Such findings can be explained by the teacher’s stereotypes in Russia concerning the fact that girls are better in humanitarian subjects rather than in STEM ones like boys. This can certainly influence the educational approaches and as a result student’s performance
Finally, analysis showed that the higher wealth indicator of a person the lower his/her math achievement.