Exam task

Introduction

In order to conduct regression analysis and further EFA TIMSS dataset of the year 2015 for the 8th graders in Russia was used.

Research question for regression: Can such variables as number of books at home, availability of one’s own or shared tablet/PC, gender and age be considered as predictors of one’s math achievement?

Research question for EFA: Are there any latent factors describing one’s home possessions?

Variables for EFA include:

BSBG06A (Possession of own pc/tablet)
BSBG06B (Possession of shared pc/tablet)
BSBG06C (Possession of a study desk)
BSBG06D (Possession of own room)
BSBG06E (Internet connection at home)
BSBG06F (Possession of mobile phone)
BSBG06G (Possession of a gaming system)
BSBG06H (Possession of a musical instrument
BSBG06I (Possession of a car)
BSBG06J (Possession of a flat)
BSBG06K (Possession of sa dishwasher)

Data preparation

Importing data

library(foreign)
data1 <- read.spss("BSGRUSM6 (1).sav", to.data.frame = TRUE, use.value.labels = TRUE)
data2<-data1[c("BSMMAT01","BSBG04", "BSBG06A", "BSBG06B", "BSBG01", "BSBG06A","BSBG06B","BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K","BSDAGE")]
data3<-na.omit(data2)

Making two separate datasets for factor analysis and regression analysis

save1<-c("BSMMAT01","BSBG04", "BSBG06A", "BSBG06B", "BSBG01", "BSDAGE" )
save2<-c("BSBG06A","BSBG06B","BSBG06C", "BSBG06D", "BSBG06E", "BSBG06F", "BSBG06G", "BSBG06H", "BSBG06I", "BSBG06J", "BSBG06K")
data_reg <- data3[save1] 
data_fa <- data3[save2]

Renaming vars for regression

names(data_reg)[names(data_reg) == "BSMMAT01"] <- "matach"
names(data_reg)[names(data_reg) == "BSBG01"] <- "gender"
names(data_reg)[names(data_reg) == "BSBG04"] <- "books"
names(data_reg)[names(data_reg) == "BSBG06A"] <- "pcown"
names(data_reg)[names(data_reg) == "BSBG06B"] <- "pcshared"
names(data_reg)[names(data_reg) == "BSDAGE"] <- "age"
data_reg$matach<-as.numeric(as.character(data_reg$matach))
data_reg$age<-as.numeric(as.character(data_reg$age))
str(data_reg)

## 'data.frame':    4598 obs. of  6 variables:
##  $ matach  : num  570 506 523 433 389 ...
##  $ books   : Factor w/ 5 levels "0–10 books","11–25 books",..: 2 4 2 3 5 3 2 2 3 5 ...
##  $ pcown   : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
##  $ pcshared: Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
##  $ gender  : Factor w/ 2 levels "Girl","Boy": 2 2 1 2 2 1 1 2 2 1 ...
##  $ age     : num  15.4 15.5 14.2 14.9 14.6 ...

Renaming vars for EFA

names(data_fa)[names(data_fa) == "BSBG06A"] <- "pcown"
names(data_fa)[names(data_fa) == "BSBG06B"] <- "pcshared"
names(data_fa)[names(data_fa) == "BSBG06C"] <- "studydesk"
names(data_fa)[names(data_fa) == "BSBG06D"] <- "ownroom"
names(data_fa)[names(data_fa) == "BSBG06E"] <- "internet"
names(data_fa)[names(data_fa) == "BSBG06F"] <- "mobphone"
names(data_fa)[names(data_fa) == "BSBG06G"] <- "gamingsys"
names(data_fa)[names(data_fa) == "BSBG06H"] <- "musinst"
names(data_fa)[names(data_fa) == "BSBG06I"] <- "car"
names(data_fa)[names(data_fa) == "BSBG06J"] <- "flat"
names(data_fa)[names(data_fa) == "BSBG06K"] <- "dishwasher"
str(data_fa)

## 'data.frame':    4598 obs. of  11 variables:
##  $ pcown     : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 2 1 ...
##  $ pcshared  : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 2 1 1 1 ...
##  $ studydesk : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ ownroom   : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ internet  : Factor w/ 2 levels "Yes","No": 1 1 2 1 1 1 1 1 1 1 ...
##  $ mobphone  : Factor w/ 2 levels "Yes","No": 1 1 1 1 1 1 1 1 1 1 ...
##  $ gamingsys : Factor w/ 2 levels "Yes","No": 1 2 2 2 1 2 2 1 2 1 ...
##  $ musinst   : Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 1 1 2 ...
##  $ car       : Factor w/ 2 levels "Yes","No": 2 2 2 1 1 1 1 1 1 1 ...
##  $ flat      : Factor w/ 2 levels "Yes","No": 1 2 2 1 1 2 2 2 2 1 ...
##  $ dishwasher: Factor w/ 2 levels "Yes","No": 2 2 2 2 1 2 2 2 1 2 ...

Descriptive statistics

Summary of dataset with variables for EFA

library(summarytools)
print(dfSummary(data_fa, graph.magnif = 0.75,style="grid", varnumbers=FALSE, valid.col=FALSE, na.col=FALSE), 
      max.tbl.height = 300, method = "render")

Data Frame Summary

data_fa

Dimensions: 4598 x 11
Duplicates: 4165

Variable

Stats / Values

Freqs (% of Valid)

Graph

pcown [factor]

1. Yes 2. No

3907	(	85.0%	)
691	(	15.0%	)

pcshared [factor]

1. Yes 2. No

3851	(	83.8%	)
747	(	16.2%	)

studydesk [factor]

1. Yes 2. No

4216	(	91.7%	)
382	(	8.3%	)

ownroom [factor]

1. Yes 2. No

3116	(	67.8%	)
1482	(	32.2%	)

internet [factor]

1. Yes 2. No

4442	(	96.6%	)
156	(	3.4%	)

mobphone [factor]

1. Yes 2. No

4508	(	98.0%	)
90	(	2.0%	)

gamingsys [factor]

1. Yes 2. No

1211	(	26.3%	)
3387	(	73.7%	)

musinst [factor]

1. Yes 2. No

1693	(	36.8%	)
2905	(	63.2%	)

car [factor]

1. Yes 2. No

3118	(	67.8%	)
1480	(	32.2%	)

flat [factor]

1. Yes 2. No

2349	(	51.1%	)
2249	(	48.9%	)

dishwasher [factor]

1. Yes 2. No

945	(	20.5%	)
3653	(	79.5%	)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-06-28

For further EFA analysis we have 11 yes/no questions, which are aimed at finding out whether a respondent possesses certain items, like their own or shared pc/tablet, study desk, own room, inernet connection at home, mobile phone, gaming system, music instrument, car, flat or dishwasher. In the table above some frequencies of answers can be observed.

Gender distribution in the sample

library(ggplot2)
ggplot(data_reg, aes(x = gender)) +
  geom_bar(col = "navy", fill = "cornflowerblue") +
  ggtitle("Gender distribution barplot")+
  xlab("boy/girl")+
  ylab("Number of observations")+
  theme_minimal()

The number of boys and girls in the data is almost the same which means that we have a representative sample.

T-est (math achievement scores by gender)

t1 <- t.test(matach ~ gender, data_reg)
t1

## 
##  Welch Two Sample t-test
## 
## data:  matach by gender
## t = -2.4761, df = 4563.7, p-value = 0.01332
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.391998  -1.207794
## sample estimates:
## mean in group Girl  mean in group Boy 
##           535.1722           540.9721

Mean math achievement score for girls is 535.1722, which is a little bit lower than for boys whose mean achievement score is 540.9721 Additionally t-test produced relatively low p-value(<0.05). Therefore, maybe it is possible to hypothesise that further regression analysis might prove the fact that gender can be considered as a predictor of math achievement scores with boys performing better than boys.

Number of books at home barplot

library(ggplot2)
library(ggrepel)
ggplot(data_reg, aes(x = books, fill= gender)) +
  geom_bar(col="grey") +
  ggtitle("Number of books at home barplot")+
  xlab("Number of books at home")+
  ylab("Number of observations")+
  geom_text_repel(stat='count', aes(label=..count..), vjust=0.4,size = 3, position = position_stack(vjust = 0.5)) +
  theme_minimal()

The barplot shows that the majority of respondents have less than 100 books at home.

Mean math achievement scores by number of books + anova

library(dplyr)
data_reg %>% 
  group_by(books) %>% 
  summarize(mean = mean(matach))

## # A tibble: 5 x 2
##   books          mean
##   <fct>         <dbl>
## 1 0–10 books     514.
## 2 11–25 books    525.
## 3 26–100 books   541.
## 4 101–200 books  557.
## 5 More than 200  555.

aov.out <- aov(data_reg$matach ~ data_reg$books) 
summary(aov.out)

##                  Df   Sum Sq Mean Sq F value Pr(>F)    
## data_reg$books    4   792827  198207   32.32 <2e-16 ***
## Residuals      4593 28167322    6133                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It can be seen that mean math achievement scores are incresing when the number of books also increasing. Additionally, anova produced very small p-value (<2e-16), which means that the difference in math achievement scores across groups with different numbers of books is statisticcally significant. Thus, it is again possible to assume that number of books can be a predictor of one’s math performance in further regression analysis with bigger number of books being associated with higher performance scores.

T-test (Mean math achievement scores according to one’s possesion of own pc/tablet)

t2 <- t.test(matach ~ pcown, data_reg)
t2

## 
##  Welch Two Sample t-test
## 
## data:  matach by pcown
## t = -3.7896, df = 918.85, p-value = 0.0001607
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -19.63468  -6.23655
## sample estimates:
## mean in group Yes  mean in group No 
##          536.2291          549.1647

By looking at mean values it can be concluded that mean math performance scores among those who doesn’t have their own devices is bigger than among those who have. Moreover, t-test comparing math achievement between two groups produced relatively small p-value (= 0.0001607), which means that the diference is statistically significant and it can be assumed that one’s possesion of his/her own pc or tablet can be a predictor of one’s math performance with those who doesn’t have their own device performing better than those who have.

T-test(Mean math achievement scores by one’s possesion of shared pc/tablet)

t3 <- t.test(matach ~ pcshared, data_reg)
t3

## 
##  Welch Two Sample t-test
## 
## data:  matach by pcshared
## t = 7.6733, df = 1031, p-value = 3.874e-14
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  18.48986 31.19584
## sample estimates:
## mean in group Yes  mean in group No 
##          542.2091          517.3662

Here mean math performance scores for those who have shared pc or tablet is bigger than for those who have not. Addirionally, t-test comparing math achievement between two groups produced small p-value (= 3.874e-14), which means that the diference is statistically significant and it can be assumed that one’s possesion of shared pc or tablet can be a predictor of one’s math performance with those who doesn’t have their own device performing worse than those who have.

Regression

Model 1

model1<-lm(matach ~ books+pcown+pcshared+age+gender, data = data_reg) 
summary(model1)

## 
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age + gender, 
##     data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -248.495  -54.581    1.953   53.622  268.051 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         785.956     40.271  19.516  < 2e-16 ***
## books11–25 books     10.675      4.957   2.153   0.0313 *  
## books26–100 books    26.298      4.853   5.419 6.29e-08 ***
## books101–200 books   42.524      5.347   7.953 2.27e-15 ***
## booksMore than 200   40.822      5.808   7.029 2.39e-12 ***
## pcownNo              13.449      3.200   4.203 2.69e-05 ***
## pcsharedNo          -21.797      3.105  -7.020 2.54e-12 ***
## age                 -18.659      2.714  -6.874 7.06e-12 ***
## genderBoy            10.218      2.299   4.444 9.05e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.23 on 4589 degrees of freedom
## Multiple R-squared:  0.05479,    Adjusted R-squared:  0.05314 
## F-statistic: 33.25 on 8 and 4589 DF,  p-value: < 2.2e-16

Adjusted R-squared is 0.05258, which means that model explains aproximately 5% of variance. All predictors have statistically significant p-values (smaller than 0.05) and the coefficients show that:

Interpretation of coefficients:

Comparing with people, who have 0-10 books at home,for those who have 1–25 books math performance increases by 10.675
Comparing with people, who have 0-10 books at home,for those who have 26–100 books math performance increases by 26.298
Comparing with people, who have 0-10 books at home,for those who have 101–200 books math performance increases by 42.524
Comparing with people, who have 0-10 books at home,for those who have more than 200 math performance increases by 40.822
So generally we can say that the more books a person has at home, the better his/her math performance is
Comparing with people, who have their own pc or tablet,for those who don’t math performance increases by 13.449 This means that people without their own device perform better in math than those who have.
Comparing with people, who have shared pc or tablet,for those who don’t math performance decreases by 21.797 In this case it means if a respondent have a computer which he or she shares with other people than he or she has better math performance than those who don’t have one
With one unit increase in age math performance decreases by 18.659 In simple words this means that the the older a person is the worse his/her math performance is
Comparing with girls, for boys math performance increases by 10.218

Model 2 (Adding interaction effect)

library(sjPlot)
model2<-lm(matach ~ books+pcown+pcshared+age*gender,data = data_reg)
summary(model2)

## 
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age * gender, 
##     data = data_reg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -247.860  -54.599    1.927   53.929  267.951 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         762.097     57.690  13.210  < 2e-16 ***
## books11–25 books     10.684      4.958   2.155   0.0312 *  
## books26–100 books    26.257      4.854   5.410 6.63e-08 ***
## books101–200 books   42.445      5.349   7.935 2.62e-15 ***
## booksMore than 200   40.802      5.809   7.024 2.46e-12 ***
## pcownNo              13.385      3.202   4.180 2.97e-05 ***
## pcsharedNo          -21.772      3.105  -7.011 2.72e-12 ***
## age                 -17.038      3.904  -4.364 1.30e-05 ***
## genderBoy            56.562     80.262   0.705   0.4810    
## age:genderBoy        -3.140      5.435  -0.578   0.5635    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.24 on 4588 degrees of freedom
## Multiple R-squared:  0.05486,    Adjusted R-squared:  0.053 
## F-statistic: 29.59 on 9 and 4588 DF,  p-value: < 2.2e-16

plot_model(model2, type = "int")

Adjusted R-squared hasn’t changed much. However, in this model p-values for the interaction efffect and for the gender variable alone is not statistically significant.

Interpretation of interaction effect:

So from the plot it can be seen that with age math achievement for boys decreases more drastically than for girls and when they are almost 18 years old their math performance becomes more or less the same as in case with girls.

Comparison of models

anova(model1,model2)

## Analysis of Variance Table
## 
## Model 1: matach ~ books + pcown + pcshared + age + gender
## Model 2: matach ~ books + pcown + pcshared + age * gender
##   Res.Df      RSS Df Sum of Sq      F Pr(>F)
## 1   4589 27373517                           
## 2   4588 27371527  1    1990.7 0.3337 0.5635

First model is better as p-value is statistically insignificant (0.5635).

Checking for multicollinearity:

library(car)
vif(model1)

##              GVIF Df GVIF^(1/(2*Df))
## books    1.016461  4        1.002043
## pcown    1.008033  1        1.004008
## pcshared 1.011161  1        1.005565
## age      1.004747  1        1.002371
## gender   1.017660  1        1.008791

Values are less than 5. Therefore, it can be concluded that we do not have multicollinearity.

Model diagnostics

par(mfrow = c(2,2))
plot(model1)

Residuals VS Fitted (we can see that dots are not quite evenly dispersed around zero, which means that we face the problem of heteroscedasticity)
Normal Q-Q plot shows that our data is normally distributed
Also we do not have any leverages or influential cases, as Cook’s distance line is not present on the last plot

Checking for heteroscedasticity again:

library(lmtest)
bptest(model1)

## 
##  studentized Breusch-Pagan test
## 
## data:  model1
## BP = 20.857, df = 8, p-value = 0.007538

ncvTest(model1)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 5.050145, Df = 1, p = 0.024624

These two tests also produce significant p-values, which proves the fact that we have the problem of heteroscedasticity.

EFA

Correlation matrix

Before starting the actual factor analysis we need to look at the correlation matrix and see whether we do have some potential factors:

library(polycor)
library(corrplot)
dat.cor <- hetcor(data_fa)
dat.cor<- dat.cor$correlations
dat.cor

##                   pcown      pcshared     studydesk    ownroom   internet
## pcown       1.000000000 -0.0759647395  0.3782703719 0.25719768 0.32617536
## pcshared   -0.075964740  1.0000000000 -0.0003508223 0.00549872 0.37566069
## studydesk   0.378270372 -0.0003508223  1.0000000000 0.50262082 0.39863957
## ownroom     0.257197684  0.0054987201  0.5026208189 1.00000000 0.16481039
## internet    0.326175362  0.3756606865  0.3986395701 0.16481039 1.00000000
## mobphone    0.253977640  0.3399911414  0.3677667669 0.17860624 0.58337993
## gamingsys   0.280805149 -0.0566202120  0.0701626218 0.10423669 0.15747422
## musinst    -0.001411598  0.1151978114  0.1682043566 0.05924328 0.21193472
## car         0.182116855  0.2084456543  0.1915666181 0.19299980 0.32018956
## flat        0.130631750  0.0302197019  0.1612083980 0.33053152 0.09930588
## dishwasher  0.189138072  0.0428103525  0.2347650651 0.20959728 0.22124916
##              mobphone   gamingsys      musinst       car       flat dishwasher
## pcown      0.25397764  0.28080515 -0.001411598 0.1821169 0.13063175 0.18913807
## pcshared   0.33999114 -0.05662021  0.115197811 0.2084457 0.03021970 0.04281035
## studydesk  0.36776677  0.07016262  0.168204357 0.1915666 0.16120840 0.23476507
## ownroom    0.17860624  0.10423669  0.059243284 0.1929998 0.33053152 0.20959728
## internet   0.58337993  0.15747422  0.211934724 0.3201896 0.09930588 0.22124916
## mobphone   1.00000000  0.23002366  0.265221446 0.2463445 0.04563401 0.36891892
## gamingsys  0.23002366  1.00000000  0.118672347 0.2333951 0.13740871 0.38882885
## musinst    0.26522145  0.11867235  1.000000000 0.1722299 0.07820802 0.22539367
## car        0.24634448  0.23339510  0.172229868 1.0000000 0.33791602 0.31540963
## flat       0.04563401  0.13740871  0.078208015 0.3379160 1.00000000 0.24549622
## dishwasher 0.36891892  0.38882885  0.225393674 0.3154096 0.24549622 1.00000000

corrplot(dat.cor, method = "square", type="lower")

According to correlation plot it is hard to say whether we have certain groups of variables, which can comprise factors. However, there are some of the variables which have relatively high correlation indexes (those with more intense blue colours)

Turning variables into numeric form:

datafa<-as.data.frame(lapply(data_fa, as.numeric))

How many factors should be extracted?

library(psych)
fa.parallel(datafa)

## Parallel analysis suggests that the number of factors =  5  and the number of components =  4

Interpretation of the Parallel Analysis screen plot:

From the plot it can be seen that 5 factors in the “Factor Analysis” lie above the corresponding simulated data line and 4 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line.
Therefore, Parallel analysis suggests that the number of factors = 5 and the number of components = 4

Building a factor model with 5 factors

fa1<-fa(datafa, 5, cor = "mixed")

fa1

## Factor Analysis using method =  minres
## Call: fa(r = datafa, nfactors = 5, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              MR2   MR1   MR3   MR5   MR4   h2    u2 com
## pcown       0.03  0.04  0.00  0.79  0.02 0.67 0.333 1.0
## pcshared    0.68 -0.12 -0.12 -0.15  0.10 0.43 0.571 1.3
## studydesk   0.01  0.97 -0.01  0.01  0.00 0.96 0.044 1.0
## ownroom    -0.05  0.44  0.00  0.06  0.32 0.37 0.631 1.9
## internet    0.69  0.10  0.01  0.21  0.00 0.64 0.358 1.2
## mobphone    0.56  0.16  0.31  0.03 -0.14 0.61 0.392 1.9
## gamingsys  -0.06 -0.16  0.54  0.24  0.02 0.38 0.617 1.6
## musinst     0.18  0.12  0.29 -0.19 -0.01 0.16 0.842 2.9
## car         0.28 -0.05  0.20  0.05  0.38 0.34 0.655 2.5
## flat       -0.01  0.04  0.05  0.02  0.71 0.54 0.461 1.0
## dishwasher  0.01  0.07  0.69 -0.06  0.11 0.52 0.476 1.1
## 
##                        MR2  MR1  MR3  MR5  MR4
## SS loadings           1.46 1.32 1.13 0.86 0.86
## Proportion Var        0.13 0.12 0.10 0.08 0.08
## Cumulative Var        0.13 0.25 0.36 0.43 0.51
## Proportion Explained  0.26 0.23 0.20 0.15 0.15
## Cumulative Proportion 0.26 0.49 0.70 0.85 1.00
## 
##  With factor correlations of 
##      MR2  MR1  MR3  MR5  MR4
## MR2 1.00 0.29 0.30 0.16 0.07
## MR1 0.29 1.00 0.26 0.41 0.18
## MR3 0.30 0.26 1.00 0.34 0.25
## MR5 0.16 0.41 0.34 1.00 0.12
## MR4 0.07 0.18 0.25 0.12 1.00
## 
## Mean item complexity =  1.6
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  2.42 with Chi Square of  11105.62
## The degrees of freedom for the model are 10  and the objective function was  0.03 
## 
## The root mean square of the residuals (RMSR) is  0.01 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  4598 with the empirical chi square  77.23  with prob <  1.8e-12 
## The total number of observations was  4598  with Likelihood Chi Square =  153.09  with prob <  8.6e-28 
## 
## Tucker Lewis Index of factoring reliability =  0.929
## RMSEA index =  0.056  and the 90 % confidence intervals are  0.048 0.064
## BIC =  68.76
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR2  MR1  MR3  MR5  MR4
## Correlation of (regression) scores with factors   0.87 0.98 0.83 0.84 0.79
## Multiple R square of scores with factors          0.76 0.96 0.68 0.71 0.62
## Minimum correlation of possible factor scores     0.53 0.92 0.37 0.42 0.24

fa.diagram(fa1)

Description of the model fit:

First of all, looking at the factor loadings it can be seen that almost all variables belong to only one factor, this is also proved by relatively low complexity values. Communality value (h2) is not always higher than uniqueness(u2), which is not a goof sign.
Proportion explained: the explained variance should be evenly distributed among factors. In this model we can see that the first three factors have approximately the same value of explained proportion, while the other two have smaller values. This is not perfect, but looking on other fit indexes it is possible to say that the model actually has a good fit
Proportion variance: A factor should explain at least 10% of the variance. In this model it can be seen that MR5 and MR4 do not meet this criterion, but they are still rather close to it, as they both explain 8% of variance.
Cumulative Variance: looking at this parameter we can see that all in all our model explains 51% of variance
Also Chi Square of 11105.62 tells us that observed and expected data aren’t significantly different, which is good
Tucker Lewis Index of factoring reliability = 0.929, which is a very good measure of model fit (it should be >0.9)
RMSR index = 0.01 , which is also good, as it should be <0,05
However, factor MR5 include only one variable, which is rather meaningless. That is why, probably there is some logic in trying to reduce the number of factors

Building a factor model with 4 factors

fa2<-fa(datafa, 4, cor = "mixed")

fa2

## Factor Analysis using method =  minres
## Call: fa(r = datafa, nfactors = 4, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              MR2   MR1   MR3   MR4   h2   u2 com
## pcown      -0.02  0.40  0.29 -0.04 0.29 0.71 1.8
## pcshared    0.70 -0.19 -0.21  0.12 0.46 0.54 1.4
## studydesk   0.06  0.86 -0.05  0.00 0.75 0.25 1.0
## ownroom    -0.08  0.54 -0.02  0.31 0.42 0.58 1.6
## internet    0.64  0.24  0.09 -0.04 0.59 0.41 1.3
## mobphone    0.62  0.18  0.25 -0.12 0.62 0.38 1.6
## gamingsys  -0.03 -0.09  0.72  0.04 0.49 0.51 1.0
## musinst     0.25  0.01  0.14  0.06 0.11 0.89 1.7
## car         0.28 -0.01  0.21  0.40 0.36 0.64 2.4
## flat       -0.04  0.09  0.07  0.66 0.48 0.52 1.1
## dishwasher  0.14  0.03  0.49  0.20 0.39 0.61 1.5
## 
##                        MR2  MR1  MR3  MR4
## SS loadings           1.53 1.44 1.14 0.84
## Proportion Var        0.14 0.13 0.10 0.08
## Cumulative Var        0.14 0.27 0.37 0.45
## Proportion Explained  0.31 0.29 0.23 0.17
## Cumulative Proportion 0.31 0.60 0.83 1.00
## 
##  With factor correlations of 
##      MR2  MR1  MR3  MR4
## MR2 1.00 0.28 0.22 0.09
## MR1 0.28 1.00 0.31 0.18
## MR3 0.22 0.31 1.00 0.17
## MR4 0.09 0.18 0.17 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  55  and the objective function was  2.42 with Chi Square of  11105.62
## The degrees of freedom for the model are 17  and the objective function was  0.13 
## 
## The root mean square of the residuals (RMSR) is  0.03 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  4598 with the empirical chi square  439.34  with prob <  1.1e-82 
## The total number of observations was  4598  with Likelihood Chi Square =  607.74  with prob <  3.3e-118 
## 
## Tucker Lewis Index of factoring reliability =  0.827
## RMSEA index =  0.087  and the 90 % confidence intervals are  0.081 0.093
## BIC =  464.38
## Fit based upon off diagonal values = 0.99
## Measures of factor score adequacy             
##                                                    MR2  MR1  MR3  MR4
## Correlation of (regression) scores with factors   0.87 0.90 0.82 0.77
## Multiple R square of scores with factors          0.76 0.81 0.67 0.59
## Minimum correlation of possible factor scores     0.52 0.61 0.34 0.18

fa.diagram(fa2)

In this case BIC equals to 464.38 which is much higher than in the model with 5 factors (BIC=68.76). Therefore, according to this estimate we can say that this model is worse. However, BIC estimates will show better results for models with bigger amount of factors. But a factor containing just one variable, like MR5 in the first model is not good and it is better to choose a model which is conceptually more understandable and has some logic behind it even though the fit of the model with less factors is a little bit worse:

Tucker Lewis Index of factoring reliability is smaller and equals to 0.827, which is not that good as in first case
RMSR index is bigger (0.03) , but it is still smaller than 0.05, which is a good sign
Proportion explained is again not completely evenly distributed but it is not much worse than in the first case
Proportion variance is almost everywhere bigger than 10%
Cumulative variance is 0.45, which slightly lower than in the model with 5 factors
However, this model is still not good enough. It can be seen that variable ‘musinst’ is left alone in both of the previous models and does not refer to any of the factors. Moreover, in both cases the uniqueness of this variable is very much higher than the communality. Additionally, factors MR3 and MR4 still do not make much sense. Thus, probably there is some logic behind building a model without ‘musinst’ variable and with only 3 factors

Building a factor model with 3 factors and without ‘musinst’ variable

library(dplyr)
datafa1 <- select(datafa, -musinst)
fa3<-fa(datafa1, 3, cor = "mixed")

fa3

## Factor Analysis using method =  minres
## Call: fa(r = datafa1, nfactors = 3, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              MR2   MR1   MR3   h2   u2 com
## pcown       0.04  0.33  0.24 0.24 0.76 1.9
## pcshared    0.62 -0.20 -0.10 0.31 0.69 1.3
## studydesk   0.06  0.90 -0.05 0.82 0.18 1.0
## ownroom    -0.12  0.54  0.21 0.36 0.64 1.4
## internet    0.74  0.14  0.03 0.66 0.34 1.1
## mobphone    0.64  0.07  0.15 0.55 0.45 1.1
## gamingsys  -0.01 -0.08  0.59 0.31 0.69 1.0
## car         0.21 -0.01  0.43 0.29 0.71 1.4
## flat       -0.11  0.14  0.40 0.19 0.81 1.4
## dishwasher  0.08  0.00  0.61 0.41 0.59 1.0
## 
##                        MR2  MR1  MR3
## SS loadings           1.47 1.37 1.31
## Proportion Var        0.15 0.14 0.13
## Cumulative Var        0.15 0.28 0.42
## Proportion Explained  0.35 0.33 0.32
## Cumulative Proportion 0.35 0.68 1.00
## 
##  With factor correlations of 
##      MR2  MR1  MR3
## MR2 1.00 0.36 0.33
## MR1 0.36 1.00 0.34
## MR3 0.33 0.34 1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 3 factors are sufficient.
## 
## The degrees of freedom for the null model are  45  and the objective function was  2.29 with Chi Square of  10525.5
## The degrees of freedom for the model are 18  and the objective function was  0.28 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.09 
## 
## The harmonic number of observations is  4598 with the empirical chi square  1220.74  with prob <  4e-248 
## The total number of observations was  4598  with Likelihood Chi Square =  1286.68  with prob <  2.9e-262 
## 
## Tucker Lewis Index of factoring reliability =  0.697
## RMSEA index =  0.124  and the 90 % confidence intervals are  0.118 0.13
## BIC =  1134.88
## Fit based upon off diagonal values = 0.96
## Measures of factor score adequacy             
##                                                    MR2  MR1  MR3
## Correlation of (regression) scores with factors   0.88 0.92 0.82
## Multiple R square of scores with factors          0.77 0.85 0.67
## Minimum correlation of possible factor scores     0.55 0.70 0.34

fa.diagram(fa3)

The fit of this model is a little bit worse:

BIC of the model is higher than in previous models (1134.88).
Tucker Lewis Index of factoring reliability = 0.697, which is smaller than in both other cases
The root mean square of the residuals (RMSR) is 0.05, which is higher than in other 2 models
However, proportion explained is almost evenly distributed and proportion variance is higher than 10% in all factors
Cumulative variance is 42%, which is not very much smaller than in the previous model
Additionally, it is important to mention that the argument ‘cor=mixed’ uses oblique rotation by default allowing the factors to be correlated between each other.
All in all, it is possible to conclude that model with 3 factors and without ‘musinst’ variable is conceptually more logical than others two that is why it is better to choose this model.

Factor names:

MR2: Variables which were assigned to this factor mostly concerned with the technologies necessary for organizing a comfortable study environment at home. Computer, internet connection and mobile phone have become important tools which help people search for necessary information and communicate with others. Therefore, the name for this factor can be Technologies necessary for studying
MR1: Variables which were assigned to this factor mostly concerned with personal space of a person which can be crucial in organizing study process at home. Therefore, the name for this factor can be Personal space of a person
MR3: Variables which were assigned to this factor mostly serve the purpose of indicating wealth of a person, as not any family can afford to have gaming system, dishwasher, flat or a car. Therefore, the name for this factor can be Wealth indicators

Scale reliability

library(psych)
MR1<- as.data.frame(datafa1[c("studydesk", "ownroom", "pcown")])
psych::alpha(MR1,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR1, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.38      0.41    0.32      0.19 0.68 0.015  1.2 0.25     0.18
## 
##  lower alpha upper     95% confidence boundaries
## 0.35 0.38 0.41 
## 
##  Reliability if an item is dropped:
##           raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## studydesk      0.23      0.24    0.14      0.14 0.32    0.022    NA  0.14
## ownroom        0.29      0.30    0.18      0.18 0.43    0.020    NA  0.18
## pcown          0.35      0.39    0.24      0.24 0.64    0.017    NA  0.24
## 
##  Item statistics 
##              n raw.r std.r r.cor r.drop mean   sd
## studydesk 4598  0.60  0.70  0.44   0.28  1.1 0.28
## ownroom   4598  0.78  0.68  0.39   0.24  1.3 0.47
## pcown     4598  0.62  0.65  0.31   0.19  1.2 0.36
## 
## Non missing response frequency for each item
##              1    2 miss
## studydesk 0.92 0.08    0
## ownroom   0.68 0.32    0
## pcown     0.85 0.15    0

Std. Cronbach’s alpha is 0.41, which is rather small and indicates not very good scale reliability.

MR2<- as.data.frame(datafa1[c("internet", "mobphone", "pcshared")])
psych::alpha(MR2,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR2, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.26      0.35    0.27      0.15 0.54 0.016  1.1 0.16     0.14
## 
##  lower alpha upper     95% confidence boundaries
## 0.23 0.26 0.29 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r  S/N alpha se var.r med.r
## internet      0.13      0.19    0.10      0.10 0.23    0.017    NA  0.10
## mobphone      0.20      0.24    0.14      0.14 0.32    0.019    NA  0.14
## pcshared      0.35      0.36    0.22      0.22 0.55    0.019    NA  0.22
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## internet 4598  0.55  0.68  0.40   0.20  1.0 0.18
## mobphone 4598  0.45  0.67  0.36   0.18  1.0 0.14
## pcshared 4598  0.86  0.63  0.26   0.16  1.2 0.37
## 
## Non missing response frequency for each item
##             1    2 miss
## internet 0.97 0.03    0
## mobphone 0.98 0.02    0
## pcshared 0.84 0.16    0

Std. Cronbach’s alpha is 0.35, which is rather small and indicates not very good scale reliability.

MR3<- as.data.frame(datafa1[c("dishwasher", "gamingsys", "car", "flat")])
psych::alpha(MR3,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR3, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r  S/N   ase mean   sd median_r
##       0.42      0.43    0.37      0.16 0.75 0.014  1.6 0.27     0.15
## 
##  lower alpha upper     95% confidence boundaries
## 0.4 0.42 0.45 
## 
##  Reliability if an item is dropped:
##            raw_alpha std.alpha G6(smc) average_r  S/N alpha se  var.r med.r
## dishwasher      0.33      0.33    0.25      0.14 0.49    0.017 0.0043  0.13
## gamingsys       0.38      0.38    0.29      0.17 0.61    0.016 0.0014  0.16
## car             0.33      0.34    0.27      0.15 0.52    0.017 0.0053  0.14
## flat            0.38      0.38    0.30      0.17 0.62    0.016 0.0025  0.16
## 
##  Item statistics 
##               n raw.r std.r r.cor r.drop mean   sd
## dishwasher 4598  0.59  0.63  0.41   0.26  1.8 0.40
## gamingsys  4598  0.58  0.59  0.34   0.21  1.7 0.44
## car        4598  0.63  0.62  0.39   0.26  1.3 0.47
## flat       4598  0.63  0.59  0.33   0.22  1.5 0.50
## 
## Non missing response frequency for each item
##               1    2 miss
## dishwasher 0.21 0.79    0
## gamingsys  0.26 0.74    0
## car        0.68 0.32    0
## flat       0.51 0.49    0

Std. Cronbach’s alpha is 0.43, which is rather small and indicates not very good scale reliability.

Adding factors to regression data set

fascores<-as.data.frame(fa3$scores)
datareg<-cbind(data_reg,fascores)
names(datareg)[names(datareg) == "MR1"] <- "personal"
names(datareg)[names(datareg) == "MR2"] <- "tech"
names(datareg)[names(datareg) == "MR3"] <- "wealth"

Adding one of the factors to regression model:

model3<-lm(matach ~ books+pcown+pcshared+age+gender+wealth, data = datareg) 
summary(model3)

## 
## Call:
## lm(formula = matach ~ books + pcown + pcshared + age + gender + 
##     wealth, data = datareg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -252.985  -54.562    2.304   54.382  263.086 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         786.447     40.246  19.541  < 2e-16 ***
## books11–25 books     10.136      4.958   2.044  0.04099 *  
## books26–100 books    25.413      4.861   5.228 1.79e-07 ***
## books101–200 books   41.122      5.370   7.658 2.29e-14 ***
## booksMore than 200   39.276      5.834   6.732 1.88e-11 ***
## pcownNo              16.814      3.445   4.881 1.09e-06 ***
## pcsharedNo          -21.417      3.106  -6.895 6.13e-12 ***
## age                 -18.655      2.712  -6.877 6.92e-12 ***
## genderBoy             9.742      2.305   4.226 2.42e-05 ***
## wealth               -3.300      1.256  -2.627  0.00863 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77.18 on 4588 degrees of freedom
## Multiple R-squared:  0.05621,    Adjusted R-squared:  0.05436 
## F-statistic: 30.36 on 9 and 4588 DF,  p-value: < 2.2e-16

Adjusted R-squared (0.05436) is not much bigger than in the first model(0.05314). New variable “wealth” has statistically significant effect on the outcome varible as p-value relatively small (0.00863). And coefficient shows that with one unit increase in welath math achievement decreases by 3.300. In simple words, the results show that the higher wealth indicator of a person the lower his/her math achievement.

Conclusion

After conducting EFA 3 latent factors were found:

Technologies necessary for studying
Personal space of a person
Wealth indicators

Regression analysis showed that:

The more books a person has at home, the better his/her math performance is. There is some logic behind this finding as it is a common knowledge that reading enhances one’s mental abilities. Moreover, big amount of books indicate that the family a respondent is intelligent enough to encourage their children to study
People without their own device(pc or tablet) perform better in math than those who have one. Such finding can be explained by the fact that a person who has his/her own devices has more time and ability to play games, surf the net and spend time in the social media instead of studying. Therefore, such people do not spend enough time learning and practicing math as they have more pleasant things to do
If a respondent have a computer which he or she shares with other people than he or she has better math performance than those who don’t have one. In this case people have limited amount of time for computer or tablet usage as they have to share it with other people in the house. As result, they have more time to do homework and as a result better math performance.
Gender also appeared to be a significant predictor of math achievement with boys showing higher math achievement scores than girls. Such findings can be explained by the teacher’s stereotypes in Russia concerning the fact that girls are better in humanitarian subjects rather than in STEM ones like boys. This can certainly influence the educational approaches and as a result student’s performance
Finally, analysis showed that the higher wealth indicator of a person the lower his/her math achievement.

Exam task

Anna Gorobtsova, group 172

Introduction

Data preparation

Importing data

Making two separate datasets for factor analysis and regression analysis

Renaming vars for regression

Renaming vars for EFA

Descriptive statistics

Summary of dataset with variables for EFA

Data Frame Summary

data_fa

Gender distribution in the sample

T-est (math achievement scores by gender)

Number of books at home barplot

Mean math achievement scores by number of books + anova

T-test (Mean math achievement scores according to one’s possesion of own pc/tablet)

T-test(Mean math achievement scores by one’s possesion of shared pc/tablet)

Regression

Model 1

Model 2 (Adding interaction effect)

Comparison of models

Checking for multicollinearity:

Model diagnostics

Checking for heteroscedasticity again:

EFA

Correlation matrix

How many factors should be extracted?

Building a factor model with 5 factors

Building a factor model with 4 factors

Building a factor model with 3 factors and without ‘musinst’ variable

Scale reliability

Adding factors to regression data set

Adding one of the factors to regression model:

Conclusion