Data Analysis 2nd task

Group members

Anna Gorobtsova
Nadezda Bykova
Artem Kulikov
Anastasia Vlasenko

Aim of the study and research question

In order to conduct EFA analysis and further regression analysis TIMSS 2015 data on Singapore is used. Variables which were selected for the analysis measure student’s attitudes towards mathematics. Therefore, the first question we would like to answer is whether there are some latent factors in the data. In order to answer this question exploratory factor analysis will be conducted. Secondly, regrerssion analysis will be conducted in order to see what possible variables can predict student’s math achievement.

Data preparation

Downloading necessary packages

library(foreign)
library(psych)
library(dplyr)
library(polycor)
library(corrplot)
library(sjPlot)
library(summarytools)
library(lmtest)
library(car)
library(gridExtra)
library(ggplot2)
library(GPArotation)

Importing data

data1 <- read.spss("BSGSGPM6.sav", to.data.frame = TRUE, use.value.labels = TRUE)
data2<-data1[c("BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D", "BSBM17E", "BSBM17F", "BSBM17G", "BSBM17H", "BSBM17I", "BSBM18A", "BSBM18B", "BSBM18C", "BSBM18D", "BSBM18E", "BSBM18F", "BSBM18G", "BSBM18H", "BSBM18I", "BSBM18J", "BSBM19A", "BSBM19B", "BSBM19C", "BSBM19D", "BSBM19E", "BSMMAT01", "BSBG01", "BSBG07A", "BSBG07B", "BSBG10A")]
data3<-na.omit(data2)

Making two separate datasets for factor analysis and regression analysis

save1<-c("BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D", "BSBM17E", "BSBM17F", "BSBM17G", "BSBM17H", "BSBM17I", "BSBM18A", "BSBM18B", "BSBM18C", "BSBM18D", "BSBM18E", "BSBM18F", "BSBM18G", "BSBM18H", "BSBM18I", "BSBM18J", "BSBM19A", "BSBM19B", "BSBM19C", "BSBM19D", "BSBM19E")
save2<-c("BSMMAT01", "BSBG01", "BSBG07A", "BSBG07B", "BSBG10A")
data_fa <- data3[save1] 
data_reg <- data3[save2]

Descriptive statistics

Summary of dataset with variables for EFA

print(dfSummary(data_fa, graph.magnif = 0.75,style="grid", varnumbers=FALSE, valid.col=FALSE, na.col=FALSE), 
      max.tbl.height = 300, method = "render")

Data Frame Summary

data_fa

Dimensions: 5875 x 24
Duplicates: 194

Variable

Stats / Values

Freqs (% of Valid)

Graph

BSBM17A [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2232	(	38.0%	)
2433	(	41.4%	)
766	(	13.0%	)
444	(	7.6%	)

BSBM17B [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

964	(	16.4%	)
1325	(	22.6%	)
1751	(	29.8%	)
1835	(	31.2%	)

BSBM17C [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

610	(	10.4%	)
1672	(	28.5%	)
2074	(	35.3%	)
1519	(	25.9%	)

BSBM17D [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1947	(	33.1%	)
2718	(	46.3%	)
938	(	16.0%	)
272	(	4.6%	)

BSBM17E [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2021	(	34.4%	)
2296	(	39.1%	)
980	(	16.7%	)
578	(	9.8%	)

BSBM17F [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

947	(	16.1%	)
2081	(	35.4%	)
2088	(	35.5%	)
759	(	12.9%	)

BSBM17G [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1487	(	25.3%	)
2185	(	37.2%	)
1457	(	24.8%	)
746	(	12.7%	)

BSBM17H [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1178	(	20.1%	)
2169	(	36.9%	)
1747	(	29.7%	)
781	(	13.3%	)

BSBM17I [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1924	(	32.8%	)
1566	(	26.7%	)
1297	(	22.1%	)
1088	(	18.5%	)

BSBM18A [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2413	(	41.1%	)
3018	(	51.4%	)
365	(	6.2%	)
79	(	1.3%	)

BSBM18B [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2265	(	38.6%	)
2592	(	44.1%	)
792	(	13.5%	)
226	(	3.9%	)

BSBM18C [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1551	(	26.4%	)
2819	(	48.0%	)
1190	(	20.3%	)
315	(	5.4%	)

BSBM18D [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1186	(	20.2%	)
2600	(	44.3%	)
1647	(	28.0%	)
442	(	7.5%	)

BSBM18E [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2332	(	39.7%	)
2608	(	44.4%	)
739	(	12.6%	)
196	(	3.3%	)

BSBM18F [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2688	(	45.8%	)
2348	(	40.0%	)
652	(	11.1%	)
187	(	3.2%	)

BSBM18G [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1702	(	29.0%	)
2957	(	50.3%	)
987	(	16.8%	)
229	(	3.9%	)

BSBM18H [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1929	(	32.8%	)
2774	(	47.2%	)
917	(	15.6%	)
255	(	4.3%	)

BSBM18I [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2376	(	40.4%	)
2725	(	46.4%	)
593	(	10.1%	)
181	(	3.1%	)

BSBM18J [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

2137	(	36.4%	)
2824	(	48.1%	)
693	(	11.8%	)
221	(	3.8%	)

BSBM19A [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1452	(	24.7%	)
2311	(	39.3%	)
1305	(	22.2%	)
807	(	13.7%	)

BSBM19B [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

749	(	12.8%	)
1659	(	28.2%	)
2380	(	40.5%	)
1087	(	18.5%	)

BSBM19C [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1366	(	23.2%	)
1486	(	25.3%	)
1664	(	28.3%	)
1359	(	23.1%	)

BSBM19D [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1222	(	20.8%	)
2403	(	40.9%	)
1664	(	28.3%	)
586	(	10.0%	)

BSBM19E [factor]

1. Agree a lot 2. Agree a little 3. Disagree a little 4. Disagree a lot

1115	(	19.0%	)
1968	(	33.5%	)
1823	(	31.0%	)
969	(	16.5%	)

Generated by summarytools 0.9.6 (R version 4.0.0)
2020-05-20

For further factor analysis we have 24 factor variables, which are answers on likert scale questions about math and math lessons attitudes among students from Singapore. All of them are supposed to measure to what extent a person agrees with certain statement concearning math lessons. Therefore, variables have 4 levels ranging from ‘Agree a lot’ to ‘Disagree a lot’. In the table above some frequencies of answers can be observed.

Gender distribution in the sample

ggplot(data_reg, aes(x = BSBG01)) +
  geom_bar(col = "white", fill = "cornflowerblue") +
  ggtitle("Gender distribution barplot")+
  xlab("boy/girl")+
  ylab("Number of observations")+
  theme_minimal()

The number of boys and girls in the data is almost the same which means that we have a representative sample.

Math achievement scores by gender

data_reg$BSMMAT01<-as.numeric(as.character(data_reg$BSMMAT01))
data_reg %>% 
  group_by(BSBG01) %>% 
  summarize(mean = mean(BSMMAT01))

## # A tibble: 2 x 2
##   BSBG01  mean
##   <fct>  <dbl>
## 1 Girl    622.
## 2 Boy     612.

Mean math achievement score for girls is 621.55, which is a little bit higher than for boys whose mean achievement score is 611.64. Therefore, maybe it is possible to hypothesise that further regression analysis might prove the fact that gender can be considered as a predictor of math achievement scores with girls performing better than boys.

Distribution of parent’s education

m <- ggplot(data_reg, aes(x = BSBG07A)) +
  geom_bar(col = "white", fill = "pink") +
  ggtitle("Mother's education barplot")+
  xlab("Level of mother's education")+
  ylab("Number of observations")+
  coord_flip()+
  theme_minimal()

f <- ggplot(data_reg, aes(x = BSBG07B)) +
  geom_bar(col = "white", fill = "cornflowerblue") +
  ggtitle("Father's education barplot")+
  xlab("Level of father's education")+
  ylab("Number of observations")+
  coord_flip()+
  theme_minimal()

grid.arrange(m,f, nrow=2)

We have quite a lot ‘Don’t know’ responses, which is not very convinient. Nevertheless, it can be seen that in both cases most people have upper secondary education and bachelor’s degree. Additionally, it is interesting to mention that more fathers than mothers have Postgraduate degree. Moreover, relatively more mothers have some primary or lower secondary education or did not go to school at all. In case of all the other levels of education the distributions are more or less the same for mothers and fathers.

Math achievement scores by mother’s education

data_reg %>% 
  group_by(BSBG07A) %>% 
  summarize(mean = mean(BSMMAT01))

## # A tibble: 8 x 2
##   BSBG07A                                                  mean
##   <fct>                                                   <dbl>
## 1 Some Primary or Lower secondary or did not go to school  580.
## 2 Lower secondary                                          591.
## 3 Upper secondary                                          613.
## 4 Post-secondary, non-tertiary                             619.
## 5 Short-cycle tertiary                                     633.
## 6 Bachelor’s or equivalent                                 659.
## 7 Postgraduate degree                                      666.
## 8 Don’t know                                               598.

From the table above it can be seen that the mean achievement scores are increasing as the level of mother’s education is increasing also. Therefore, as in case with gender we can also make a hypothesis about the level of mother’s education being the predictor of math achievement scores of a students with higher levels of mother’s education associated with higher achievement scores of a student.

Math achievement scores by father’s education

data_reg %>% 
  group_by(BSBG07B) %>% 
  summarize(mean = mean(BSMMAT01))

## # A tibble: 8 x 2
##   BSBG07B                                                  mean
##   <fct>                                                   <dbl>
## 1 Some Primary or Lower secondary or did not go to school  579.
## 2 Lower secondary                                          593.
## 3 Upper secondary                                          609.
## 4 Post-secondary, non-tertiary                             606.
## 5 Short-cycle tertiary                                     630.
## 6 Bachelor’s or equivalent                                 658.
## 7 Postgraduate degree                                      665.
## 8 Don’t know                                               600.

The same situation as with mother’s education can observed with father’s education as well. Therefore, here we can also hypothesise that the level of father’s education might be a predictor of math achievement scores and that higher levels of father’s education will be associated with higher math achievement scores.

Distribution by country of birth

ggplot(data_reg, aes(x = BSBG10A)) +
  geom_bar(col = "white", fill = "cornflowerblue") +
  ggtitle("Were you born in Singapore")+
  xlab("yes/no")+
  ylab("Number of observations")+
  theme_minimal()

From the bar plot it can be seen that the majority of respondents are native citizens of Singapore. The group of those people who were not born there is almost 5 times smaller than the group of those who were. Therefore, it is questionable whether by including this variable in the further regerssion analysis we will get generalizable results as the group in of non-citizens is underrepresented in this sample

Math achievement scores by country of birth

data_reg %>% 
  group_by(BSBG10A) %>% 
  summarize(mean = mean(BSMMAT01))

## # A tibble: 2 x 2
##   BSBG10A  mean
##   <fct>   <dbl>
## 1 Yes      614.
## 2 No       631.

Mean value for non-native citizens is higher. However, we must keep in mind the fact that the number of respondents from this group is too small, which might’ve influenced the result.

EFA

Correlation matrix:

Before starting the actual factor analysis we need to look at the correlation matrix and see whether we do have some potential factors:

dat.cor <- hetcor(data_fa)
dat.cor<- dat.cor$correlations
dat.cor

##            BSBM17A    BSBM17B    BSBM17C    BSBM17D    BSBM17E    BSBM17F
## BSBM17A  1.0000000 -0.7193920 -0.7288701  0.7355962  0.9111533  0.7247957
## BSBM17B -0.7193920  1.0000000  0.7543606 -0.5770894 -0.7337688 -0.6041232
## BSBM17C -0.7288701  0.7543606  1.0000000 -0.5989825 -0.7423630 -0.6128955
## BSBM17D  0.7355962 -0.5770894 -0.5989825  1.0000000  0.7402890  0.6496178
## BSBM17E  0.9111533 -0.7337688 -0.7423630  0.7402890  1.0000000  0.7576284
## BSBM17F  0.7247957 -0.6041232 -0.6128955  0.6496178  0.7576284  1.0000000
## BSBM17G  0.8157328 -0.6925543 -0.6838195  0.6717706  0.8482756  0.7881664
## BSBM17H  0.7707337 -0.6338387 -0.6630358  0.6922386  0.7863611  0.7173778
## BSBM17I  0.8771027 -0.7351604 -0.7198372  0.6815484  0.9046162  0.7419462
## BSBM18A  0.4365172 -0.3361437 -0.3483428  0.4440063  0.4220274  0.3824189
## BSBM18B  0.4239973 -0.3355252 -0.3407935  0.4049221  0.4128685  0.3541809
## BSBM18C  0.5570614 -0.4150562 -0.4632566  0.5509088  0.5336086  0.4726867
## BSBM18D  0.4954503 -0.3723860 -0.4110799  0.5337899  0.4682367  0.4417382
## BSBM18E  0.4109797 -0.3163164 -0.3273311  0.4056564  0.3953043  0.3436193
## BSBM18F  0.4153566 -0.3241408 -0.3476863  0.4314988  0.4049488  0.3287098
## BSBM18G  0.3701707 -0.2727177 -0.2972288  0.4104661  0.3596911  0.3385384
## BSBM18H  0.3450139 -0.2410395 -0.2769005  0.4173687  0.3393638  0.3221327
## BSBM18I  0.3482216 -0.2553291 -0.2809522  0.4138137  0.3366036  0.3053723
## BSBM18J  0.3411972 -0.2596593 -0.3045194  0.3714719  0.3247526  0.3060154
## BSBM19A  0.6897207 -0.5912628 -0.5229339  0.5005319  0.6939807  0.5508194
## BSBM19B -0.4832689  0.4906922  0.4422061 -0.3166211 -0.4799602 -0.3596238
## BSBM19C -0.6473779  0.6173160  0.5552635 -0.4500244 -0.6630397 -0.5156707
## BSBM19D  0.6748903 -0.5596667 -0.5335571  0.5080798  0.6680382  0.5677637
## BSBM19E -0.4147306  0.4672779  0.4091927 -0.2669650 -0.4391049 -0.3525716
##            BSBM17G    BSBM17H    BSBM17I    BSBM18A    BSBM18B    BSBM18C
## BSBM17A  0.8157328  0.7707337  0.8771027  0.4365172  0.4239973  0.5570614
## BSBM17B -0.6925543 -0.6338387 -0.7351604 -0.3361437 -0.3355252 -0.4150562
## BSBM17C -0.6838195 -0.6630358 -0.7198372 -0.3483428 -0.3407935 -0.4632566
## BSBM17D  0.6717706  0.6922386  0.6815484  0.4440063  0.4049221  0.5509088
## BSBM17E  0.8482756  0.7863611  0.9046162  0.4220274  0.4128685  0.5336086
## BSBM17F  0.7881664  0.7173778  0.7419462  0.3824189  0.3541809  0.4726867
## BSBM17G  1.0000000  0.7508902  0.8288109  0.4140434  0.3830032  0.4787517
## BSBM17H  0.7508902  1.0000000  0.7809962  0.4554865  0.4913806  0.6331046
## BSBM17I  0.8288109  0.7809962  1.0000000  0.3873663  0.4030038  0.5025247
## BSBM18A  0.4140434  0.4554865  0.3873663  1.0000000  0.6711455  0.6068716
## BSBM18B  0.3830032  0.4913806  0.4030038  0.6711455  1.0000000  0.7497485
## BSBM18C  0.4787517  0.6331046  0.5025247  0.6068716  0.7497485  1.0000000
## BSBM18D  0.4316913  0.5783473  0.4369200  0.5444553  0.6551203  0.7975683
## BSBM18E  0.3721247  0.4762511  0.3775339  0.5931073  0.7449749  0.6764502
## BSBM18F  0.3701431  0.4827557  0.3873250  0.5919589  0.7733271  0.6763973
## BSBM18G  0.3462319  0.4351105  0.3313616  0.5536995  0.5961035  0.6017152
## BSBM18H  0.3059449  0.4389687  0.3094893  0.4917850  0.5681181  0.5917624
## BSBM18I  0.3043495  0.4303230  0.3020523  0.6022166  0.6343243  0.6113786
## BSBM18J  0.3267990  0.4258687  0.3127983  0.5661035  0.6391627  0.6045479
## BSBM19A  0.6380101  0.5411217  0.7480869  0.3429004  0.3344971  0.3400501
## BSBM19B -0.4721218 -0.3479669 -0.5441074 -0.2049625 -0.2095191 -0.1819877
## BSBM19C -0.6232494 -0.5100653 -0.7380489 -0.2641201 -0.2751570 -0.2966087
## BSBM19D  0.6633560  0.5509171  0.7069892  0.3601472  0.3693986  0.3635825
## BSBM19E -0.4308426 -0.3459162 -0.4794271 -0.1647835 -0.1694237 -0.1725734
##            BSBM18D    BSBM18E    BSBM18F     BSBM18G     BSBM18H     BSBM18I
## BSBM17A  0.4954503  0.4109797  0.4153566  0.37017075  0.34501392  0.34822165
## BSBM17B -0.3723860 -0.3163164 -0.3241408 -0.27271765 -0.24103952 -0.25532907
## BSBM17C -0.4110799 -0.3273311 -0.3476863 -0.29722876 -0.27690045 -0.28095216
## BSBM17D  0.5337899  0.4056564  0.4314988  0.41046607  0.41736870  0.41381368
## BSBM17E  0.4682367  0.3953043  0.4049488  0.35969108  0.33936381  0.33660364
## BSBM17F  0.4417382  0.3436193  0.3287098  0.33853839  0.32213272  0.30537228
## BSBM17G  0.4316913  0.3721247  0.3701431  0.34623192  0.30594492  0.30434946
## BSBM17H  0.5783473  0.4762511  0.4827557  0.43511053  0.43896873  0.43032300
## BSBM17I  0.4369200  0.3775339  0.3873250  0.33136158  0.30948934  0.30205227
## BSBM18A  0.5444553  0.5931073  0.5919589  0.55369948  0.49178501  0.60221664
## BSBM18B  0.6551203  0.7449749  0.7733271  0.59610347  0.56811808  0.63432429
## BSBM18C  0.7975683  0.6764502  0.6763973  0.60171519  0.59176237  0.61137857
## BSBM18D  1.0000000  0.6591867  0.6307510  0.61235983  0.66709662  0.60363191
## BSBM18E  0.6591867  1.0000000  0.8344016  0.62571618  0.59227405  0.66727259
## BSBM18F  0.6307510  0.8344016  1.0000000  0.66018751  0.63356709  0.69893239
## BSBM18G  0.6123598  0.6257162  0.6601875  1.00000000  0.63725315  0.65738366
## BSBM18H  0.6670966  0.5922741  0.6335671  0.63725315  1.00000000  0.70150717
## BSBM18I  0.6036319  0.6672726  0.6989324  0.65738366  0.70150717  1.00000000
## BSBM18J  0.5777493  0.6665217  0.6571562  0.61485954  0.60417791  0.71258447
## BSBM19A  0.2962577  0.2861207  0.2886201  0.24803162  0.19295750  0.21627977
## BSBM19B -0.1523205 -0.1607949 -0.1621349 -0.12951734 -0.08171768 -0.08257291
## BSBM19C -0.2525691 -0.2332858 -0.2288745 -0.18561717 -0.14136484 -0.15154191
## BSBM19D  0.3281293  0.3171978  0.3048265  0.26355465  0.22709306  0.23710884
## BSBM19E -0.1553963 -0.1367925 -0.1257406 -0.09158992 -0.07198792 -0.07916595
##            BSBM18J    BSBM19A     BSBM19B    BSBM19C    BSBM19D     BSBM19E
## BSBM17A  0.3411972  0.6897207 -0.48326887 -0.6473779  0.6748903 -0.41473060
## BSBM17B -0.2596593 -0.5912628  0.49069224  0.6173160 -0.5596667  0.46727795
## BSBM17C -0.3045194 -0.5229339  0.44220608  0.5552635 -0.5335571  0.40919272
## BSBM17D  0.3714719  0.5005319 -0.31662112 -0.4500244  0.5080798 -0.26696505
## BSBM17E  0.3247526  0.6939807 -0.47996019 -0.6630397  0.6680382 -0.43910486
## BSBM17F  0.3060154  0.5508194 -0.35962382 -0.5156707  0.5677637 -0.35257160
## BSBM17G  0.3267990  0.6380101 -0.47212179 -0.6232494  0.6633560 -0.43084263
## BSBM17H  0.4258687  0.5411217 -0.34796694 -0.5100653  0.5509171 -0.34591620
## BSBM17I  0.3127983  0.7480869 -0.54410745 -0.7380489  0.7069892 -0.47942706
## BSBM18A  0.5661035  0.3429004 -0.20496246 -0.2641201  0.3601472 -0.16478346
## BSBM18B  0.6391627  0.3344971 -0.20951910 -0.2751570  0.3693986 -0.16942371
## BSBM18C  0.6045479  0.3400501 -0.18198775 -0.2966087  0.3635825 -0.17257339
## BSBM18D  0.5777493  0.2962577 -0.15232050 -0.2525691  0.3281293 -0.15539630
## BSBM18E  0.6665217  0.2861207 -0.16079494 -0.2332858  0.3171978 -0.13679254
## BSBM18F  0.6571562  0.2886201 -0.16213485 -0.2288745  0.3048265 -0.12574063
## BSBM18G  0.6148595  0.2480316 -0.12951734 -0.1856172  0.2635546 -0.09158992
## BSBM18H  0.6041779  0.1929575 -0.08171768 -0.1413648  0.2270931 -0.07198792
## BSBM18I  0.7125845  0.2162798 -0.08257291 -0.1515419  0.2371088 -0.07916595
## BSBM18J  1.0000000  0.2347068 -0.12377651 -0.1633140  0.2471607 -0.11813449
## BSBM19A  0.2347068  1.0000000 -0.62863345 -0.7683127  0.7386583 -0.50530068
## BSBM19B -0.1237765 -0.6286334  1.00000000  0.7645472 -0.5683495  0.53180002
## BSBM19C -0.1633140 -0.7683127  0.76454719  1.0000000 -0.6558558  0.57176167
## BSBM19D  0.2471607  0.7386583 -0.56834948 -0.6558558  1.0000000 -0.44979620
## BSBM19E -0.1181345 -0.5053007  0.53180002  0.5717617 -0.4497962  1.00000000

corrplot(dat.cor, method = "circle")

And from the correlation plot above it can be clearly seen that we have 5 groups of variables, which have rather high correlation indices and therefore сan be our potential factors.

Turning variables into numeric form:

datafa<-as.data.frame(lapply(data_fa, as.numeric))

How many factors should be extracted?

fa.parallel(datafa)

## Parallel analysis suggests that the number of factors =  4  and the number of components =  3

Interpretation of the Parallel Analysis screen plot:

From the plot it can be seen that 4 factors in the “Factor Analysis” lie above the corresponding simulated data line and 3 components in the “Principal Components” parallel analysis lie above the corresponding simulated data line.
Therefore, Parallel analysis suggests that the number of factors = 4 and the number of components = 3

Building a factor model with 4 factors

fa1<-fa(datafa, 4, cor = "mixed")
fa1

## Factor Analysis using method =  minres
## Call: fa(r = datafa, nfactors = 4, cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##           MR1   MR2   MR3   MR4   h2    u2 com
## BSBM17A  0.87 -0.03 -0.07  0.05 0.86 0.138 1.0
## BSBM17B -0.65  0.05  0.21 -0.04 0.63 0.367 1.2
## BSBM17C -0.73  0.02  0.08 -0.03 0.62 0.380 1.0
## BSBM17D  0.79  0.19  0.10 -0.05 0.65 0.354 1.2
## BSBM17E  0.92 -0.03 -0.06  0.01 0.90 0.098 1.0
## BSBM17F  0.85  0.05  0.04 -0.07 0.67 0.329 1.0
## BSBM17G  0.84  0.01 -0.09 -0.02 0.80 0.205 1.0
## BSBM17H  0.84  0.05  0.10  0.12 0.76 0.241 1.1
## BSBM17I  0.79 -0.04 -0.22  0.03 0.88 0.116 1.2
## BSBM18A  0.10  0.26 -0.08  0.42 0.53 0.475 1.9
## BSBM18B -0.04 -0.04 -0.07  0.94 0.83 0.168 1.0
## BSBM18C  0.37  0.07  0.17  0.62 0.75 0.250 1.8
## BSBM18D  0.32  0.31  0.14  0.37 0.66 0.341 3.2
## BSBM18E -0.02  0.24 -0.04  0.66 0.74 0.259 1.3
## BSBM18F -0.02  0.30 -0.05  0.61 0.76 0.239 1.5
## BSBM18G  0.05  0.65 -0.03  0.13 0.61 0.387 1.1
## BSBM18H  0.08  0.81  0.03 -0.04 0.66 0.342 1.0
## BSBM18I -0.03  0.86 -0.06  0.02 0.76 0.241 1.0
## BSBM18J -0.03  0.60 -0.07  0.23 0.63 0.373 1.3
## BSBM19A  0.27  0.02 -0.63  0.06 0.72 0.279 1.4
## BSBM19B  0.08 -0.05  0.86 -0.01 0.68 0.324 1.0
## BSBM19C -0.15 -0.02  0.80 -0.01 0.83 0.170 1.1
## BSBM19D  0.35 -0.01 -0.48  0.10 0.62 0.383 1.9
## BSBM19E -0.10  0.01  0.56  0.00 0.40 0.604 1.1
## 
##                        MR1  MR2  MR3  MR4
## SS loadings           7.24 3.31 3.00 3.39
## Proportion Var        0.30 0.14 0.13 0.14
## Cumulative Var        0.30 0.44 0.56 0.71
## Proportion Explained  0.43 0.20 0.18 0.20
## Cumulative Proportion 0.43 0.62 0.80 1.00
## 
##  With factor correlations of 
##       MR1   MR2   MR3   MR4
## MR1  1.00  0.42 -0.62  0.48
## MR2  0.42  1.00 -0.06  0.80
## MR3 -0.62 -0.06  1.00 -0.20
## MR4  0.48  0.80 -0.20  1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 4 factors are sufficient.
## 
## The degrees of freedom for the null model are  276  and the objective function was  23.57 with Chi Square of  138249
## The degrees of freedom for the model are 186  and the objective function was  1.26 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.02 
## 
## The harmonic number of observations is  5875 with the empirical chi square  1291.37  with prob <  1.2e-164 
## The total number of observations was  5875  with Likelihood Chi Square =  7366.33  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.923
## RMSEA index =  0.081  and the 90 % confidence intervals are  0.079 0.083
## BIC =  5752.13
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2  MR3  MR4
## Correlation of (regression) scores with factors   0.99 0.95 0.95 0.96
## Multiple R square of scores with factors          0.97 0.91 0.91 0.93
## Minimum correlation of possible factor scores     0.94 0.82 0.82 0.86

fa.diagram(fa1)

Description of the model fit:

First of all, looking at the factor loadings it can be seen that almost all variables belong to only one factor, this is also proved by the low complexity values. Only variable ‘BSBM18D’ has high complexity value and its’ uniqueness value is also higher than in all the other cases. However, its’ shared variance (communality) is higher than its’ unique variance, which is a good sign
Proportion explained: the explained variance should be evenly distributed among factors. In this model we can see that factor MR1 explains the largest proportion of variance which is 43%, while the estimate for all the other factors is more or less the same. This is not perfect, but looking on other fit indices it is possible to say that the model actually has a good fit
Proportion variance: A factor should explain at least 10% of the variance. In this model it can be seen that all the factors meet this criterion.
Cumulative Variance: looking at this parameter we can see that all in all our model explains 71% of variance
Also Chi Square of 138249.6 tells us that observed and expected data aren’t significantly different, which is good
Tucker Lewis Index of factoring reliability = 0.923, which is a very good measure of model fit (it should be >0.9)
RMSR index = 0.02 , which is also good, as it should be <0,05

Therefore, the model with 4 factors has a good fit judging by the values we got. Additionally, it is important to mention that the argument ‘cor=mixed’ uses oblique rotation by default allowing the factors to be correlated between each other. The 4 of our factors have rather hight correlation indices.

Factor names:

MR1: Variables which were assigned to this factor mostly measure the extant to which a students likes doing mathematics or not. The factor include such likert scale questions, as “I like mathematics”, “Mathematics is boring”, “I look forward to mathematics class” etc. Therefore, the name for this factor can be Student’s attitude towards mathematics
MR2: Variables which were assigned to this factor mostly measure support of a teacher, the extent to which he or she cares about studetns understanding the subject and helping them to make sence of their mistakes and tasks that they don’t understand. The factor include such likert scale questions, as “My teacher tells me how to do better when i make a mistake” or “My teacher listens to what i have to say” and others. Therefore, the name for this factor can be Level of teacher’s support
MR3: Variables which were assigned to this factor mostly measure the extent to which a student assesses his or her own level of understanding of mathematics. The factor include such likert scale questions, as “I usually do well in mathematics”, “I learn things quickly in mathematics” or “Mathematics is not one of my strengths” etc. Therefore, the name for this factor can be Self-perceived mathematical abilities
MR4: Variables which were assigned to this factor mostly measure the extent to which students understand the materials and requirements which are presented to them by the teacher. The factor include such likert scale questions, as “My teacher is good at explaining mathematics” or “My teacher is easy to understand”. Therefore, the name for this factor can be Level of clarity of teacher’s requirements and materials

Scale reliability

MR1<- as.data.frame(datafa[c("BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D", "BSBM17E", "BSBM17F", "BSBM17G", "BSBM17H", "BSBM17I")])
psych::alpha(MR1,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR1, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.94      0.94    0.94      0.65  16 0.0011  2.2 0.79     0.64
## 
##  lower alpha upper     95% confidence boundaries
## 0.94 0.94 0.94 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BSBM17A       0.93      0.93    0.93      0.63  14   0.0013 0.0060  0.64
## BSBM17B-      0.94      0.94    0.94      0.66  16   0.0012 0.0063  0.65
## BSBM17C-      0.94      0.94    0.94      0.66  15   0.0012 0.0069  0.65
## BSBM17D       0.94      0.94    0.94      0.67  16   0.0012 0.0058  0.66
## BSBM17E       0.93      0.93    0.93      0.63  13   0.0014 0.0049  0.63
## BSBM17F       0.94      0.94    0.94      0.66  15   0.0012 0.0069  0.65
## BSBM17G       0.93      0.93    0.93      0.64  14   0.0013 0.0068  0.64
## BSBM17H       0.94      0.94    0.94      0.65  15   0.0012 0.0076  0.64
## BSBM17I       0.93      0.93    0.93      0.63  14   0.0013 0.0058  0.64
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## BSBM17A  5875  0.87  0.88  0.87   0.84  1.9 0.90
## BSBM17B- 5875  0.78  0.77  0.74   0.71  2.2 1.07
## BSBM17C- 5875  0.79  0.78  0.75   0.73  2.2 0.95
## BSBM17D  5875  0.74  0.75  0.71   0.69  1.9 0.82
## BSBM17E  5875  0.90  0.90  0.90   0.87  2.0 0.95
## BSBM17F  5875  0.79  0.79  0.76   0.73  2.5 0.91
## BSBM17G  5875  0.86  0.86  0.85   0.82  2.2 0.97
## BSBM17H  5875  0.82  0.82  0.79   0.77  2.4 0.95
## BSBM17I  5875  0.89  0.88  0.87   0.85  2.3 1.10
## 
## Non missing response frequency for each item
##            1    2    3    4 miss
## BSBM17A 0.38 0.41 0.13 0.08    0
## BSBM17B 0.16 0.23 0.30 0.31    0
## BSBM17C 0.10 0.28 0.35 0.26    0
## BSBM17D 0.33 0.46 0.16 0.05    0
## BSBM17E 0.34 0.39 0.17 0.10    0
## BSBM17F 0.16 0.35 0.36 0.13    0
## BSBM17G 0.25 0.37 0.25 0.13    0
## BSBM17H 0.20 0.37 0.30 0.13    0
## BSBM17I 0.33 0.27 0.22 0.19    0

Cronbach’s alpha is 0.9427067, which indicates very good scale reliability. Which means that if we use this scale to measure this construct multiple times we will get the same results showing very good internal consistency.

MR2<- as.data.frame(datafa[c("BSBM18I", "BSBM18H", "BSBM18G", "BSBM18J")])
psych::alpha(MR2,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR2, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.84      0.84     0.8      0.56 5.1 0.0035  1.9 0.64     0.55
## 
##  lower alpha upper     95% confidence boundaries
## 0.83 0.84 0.84 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se   var.r med.r
## BSBM18I      0.77      0.77    0.69      0.53 3.4   0.0052 0.00036  0.52
## BSBM18H      0.80      0.80    0.73      0.57 3.9   0.0046 0.00190  0.56
## BSBM18G      0.80      0.80    0.73      0.58 4.1   0.0045 0.00308  0.60
## BSBM18J      0.80      0.80    0.73      0.57 4.0   0.0045 0.00082  0.56
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## BSBM18I 5875  0.84  0.85  0.78   0.71  1.8 0.76
## BSBM18H 5875  0.82  0.81  0.72   0.66  1.9 0.81
## BSBM18G 5875  0.80  0.80  0.70   0.64  2.0 0.78
## BSBM18J 5875  0.81  0.81  0.71   0.65  1.8 0.78
## 
## Non missing response frequency for each item
##            1    2    3    4 miss
## BSBM18I 0.40 0.46 0.10 0.03    0
## BSBM18H 0.33 0.47 0.16 0.04    0
## BSBM18G 0.29 0.50 0.17 0.04    0
## BSBM18J 0.36 0.48 0.12 0.04    0

Cronbach’s alpha is 0.8358135, which indicates very good scale reliability. Which means that if we use this scale to measure this construct multiple times we will get the same results showing very good internal consistency.

MR3<- as.data.frame(datafa[c("BSBM19A", "BSBM19B", "BSBM19C", "BSBM19D", "BSBM19E")])
psych::alpha(MR3,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR3, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.86      0.86    0.84      0.54   6 0.0029  2.6 0.78     0.52
## 
##  lower alpha upper     95% confidence boundaries
## 0.85 0.86 0.86 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BSBM19A-      0.81      0.81    0.78      0.52 4.3   0.0040 0.0102  0.50
## BSBM19B       0.82      0.82    0.80      0.54 4.7   0.0037 0.0145  0.54
## BSBM19C       0.80      0.80    0.76      0.50 4.0   0.0043 0.0092  0.48
## BSBM19D-      0.83      0.83    0.80      0.55 5.0   0.0035 0.0117  0.52
## BSBM19E       0.86      0.86    0.84      0.61 6.2   0.0029 0.0061  0.62
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## BSBM19A- 5875  0.84  0.84  0.80   0.73  2.8 0.98
## BSBM19B  5875  0.80  0.80  0.74   0.68  2.6 0.92
## BSBM19C  5875  0.87  0.86  0.84   0.77  2.5 1.09
## BSBM19D- 5875  0.77  0.78  0.71   0.65  2.7 0.90
## BSBM19E  5875  0.70  0.70  0.57   0.53  2.5 0.98
## 
## Non missing response frequency for each item
##            1    2    3    4 miss
## BSBM19A 0.25 0.39 0.22 0.14    0
## BSBM19B 0.13 0.28 0.41 0.19    0
## BSBM19C 0.23 0.25 0.28 0.23    0
## BSBM19D 0.21 0.41 0.28 0.10    0
## BSBM19E 0.19 0.33 0.31 0.16    0

Cronbach’s alpha is 0.856337, which indicates very good scale reliability. Which means that if we use this scale to measure this construct multiple times we will get the same results showing very good internal consistency.

MR4<- as.data.frame(datafa[c("BSBM18A", "BSBM18B", "BSBM18C", "BSBM18D", "BSBM18E", "BSBM18F")])
psych::alpha(MR4,check.keys = TRUE)

## 
## Reliability analysis   
## Call: psych::alpha(x = MR4, check.keys = TRUE)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.89      0.89    0.89      0.58 8.3 0.0021  1.9 0.63     0.56
## 
##  lower alpha upper     95% confidence boundaries
## 0.89 0.89 0.9 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## BSBM18A      0.89      0.89    0.88      0.62 8.3   0.0023 0.0051  0.61
## BSBM18B      0.86      0.86    0.85      0.56 6.4   0.0027 0.0092  0.55
## BSBM18C      0.87      0.87    0.85      0.57 6.6   0.0027 0.0087  0.56
## BSBM18D      0.88      0.88    0.86      0.59 7.2   0.0024 0.0076  0.58
## BSBM18E      0.87      0.87    0.85      0.57 6.6   0.0026 0.0074  0.56
## BSBM18F      0.87      0.87    0.85      0.57 6.7   0.0026 0.0066  0.56
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## BSBM18A 5875  0.70  0.72  0.62   0.59  1.7 0.65
## BSBM18B 5875  0.85  0.85  0.81   0.77  1.8 0.80
## BSBM18C 5875  0.84  0.83  0.80   0.75  2.0 0.82
## BSBM18D 5875  0.80  0.79  0.73   0.69  2.2 0.85
## BSBM18E 5875  0.83  0.83  0.80   0.75  1.8 0.78
## BSBM18F 5875  0.83  0.83  0.79   0.74  1.7 0.78
## 
## Non missing response frequency for each item
##            1    2    3    4 miss
## BSBM18A 0.41 0.51 0.06 0.01    0
## BSBM18B 0.39 0.44 0.13 0.04    0
## BSBM18C 0.26 0.48 0.20 0.05    0
## BSBM18D 0.20 0.44 0.28 0.08    0
## BSBM18E 0.40 0.44 0.13 0.03    0
## BSBM18F 0.46 0.40 0.11 0.03    0

Cronbach’s alpha is 0.8924768, which indicates very good scale reliability. Which means that if we use this scale to measure this construct multiple times we will get the same results showing very good internal consistency.

Adding factors to regression data set

fascores<-as.data.frame(fa1$scores)
datareg<-cbind(data_reg,fascores)
datareg$BSMMAT01<-as.numeric(as.character(datareg$BSMMAT01))
names(datareg)[names(datareg) == "BSMMAT01"] <- "matach"
names(datareg)[names(datareg) == "BSBG01"] <- "gender"
names(datareg)[names(datareg) == "BSBG07A"] <- "educmat"
names(datareg)[names(datareg) == "BSBG07B"] <- "educfat"
names(datareg)[names(datareg) == "BSBG10A"] <- "countrybrn"

names(datareg)[names(datareg) == "MR1"] <- "likemat"
names(datareg)[names(datareg) == "MR2"] <- "teachsup"
names(datareg)[names(datareg) == "MR3"] <- "matab"
names(datareg)[names(datareg) == "MR4"] <- "clreq"
names(datareg)

## [1] "matach"     "gender"     "educmat"    "educfat"    "countrybrn"
## [6] "likemat"    "teachsup"   "matab"      "clreq"

Regression analysis

Research question and Hypotheses:

Do education of parents and self-confidence of a student in his or her math abilities influence math achievement?

Hypotheses:

H1: Higher educational levels of parents will be associated with higher math achievemnts of their children.

H2: Higher self-percieved math abilities of a student will be associated with higher math achievements irrespective of the gender of a person

Recoding variables:

datareg$educmat <- ifelse(datareg$educmat == "Some Primary or Lower secondary or did not go to school",1,
                          ifelse(datareg$educmat == "Lower secondary", 2,
                                 ifelse(datareg$educmat =="Upper secondary", 3,
                                        ifelse(datareg$educmat == "Post-secondary, non-tertiary", 4,
                                               ifelse(datareg$educmat == "Short-cycle tertiary",5,
                                                      ifelse(datareg$educmat == "Bachelor’s or equivalent",6,
                                                             ifelse(datareg$educmat == "Postgraduate degree", 7, NA)))))))

datareg$educfat <- ifelse(datareg$educfat == "Some Primary or Lower secondary or did not go to school",1,
                          ifelse(datareg$educfat == "Lower secondary", 2,
                                 ifelse(datareg$educfat =="Upper secondary", 3,
                                        ifelse(datareg$educfat == "Post-secondary, non-tertiary", 4,
                                               ifelse(datareg$educfat == "Short-cycle tertiary",5,
                                                      ifelse(datareg$educfat == "Bachelor’s or equivalent",6,
                                                             ifelse(datareg$educfat == "Postgraduate degree", 7, NA)))))))

Model:

regmodel <- lm(matach~educfat+educmat+matab*gender, data=datareg)
summary(regmodel)

## 
## Call:
## lm(formula = matach ~ educfat + educmat + matab * gender, data = datareg)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -312.432  -39.042    7.688   46.332  190.835 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     565.0991     3.2186 175.574  < 2e-16 ***
## educfat           7.6696     0.7967   9.627  < 2e-16 ***
## educmat           8.2342     0.8263   9.965  < 2e-16 ***
## matab            26.0294     1.5306  17.006  < 2e-16 ***
## genderBoy       -13.1746     2.2349  -5.895 4.09e-09 ***
## matab:genderBoy   8.8848     2.2362   3.973 7.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 66.86 on 3611 degrees of freedom
##   (2258 observations deleted due to missingness)
## Multiple R-squared:  0.2907, Adjusted R-squared:  0.2897 
## F-statistic:   296 on 5 and 3611 DF,  p-value: < 2.2e-16

plot_model(regmodel, type = "int")

The model explains aproximately 29% of variance (Adjusted R-squared:0.2907). p-values are also signifficant and the coefficients show:

Meaningful interpretation of the results:

Father’s education: the general trend is that the higher the educational level of a father the higher the level of math achievement a student experience

Mother’s education: the same trend can be observed in case of mother’s education. Children whose mothers have higher educational levels tend to have higher math achievement levels. Therefore, one may conclude that children with well-educated parents tend to do better on their math classes

Self-perceived mathematical abilities of a student: here it can be observed that the hifgher the confidence of a student in his or her math abilities the higher the achievement scores are. Meaning that students who are confident in their abilities tend to perform better

Gender: Additionally, the coefficients show that boys tend to have lower math achievement scores

Interaction effect: However, an interaction effect can be observed between gender and self-perceived math abilities of a student. Specifically with the increase in Self-perceived mathematical abilities of a student the level of math achievement increases differently for boys and girls. The results suggest that boys who are confident in their math abilities start to perform better than girls taking into aacount the same levels of self-perceived mathematical abilities.

Technical interpretation of coefficients:

With one unit increase in the variable “educfat” math achievement increases by 7.6696
With one unit increase in the variable “educmat” math achievement increases by 8.2342
With one unit increase in the variable “matab” math achievement increases by 26.0294
If a student is a boy, than his math achievement decreases by 13.1746 comparing to girls

Model diagnostics:

par(mfrow = c(2,2))
plot(regmodel)

Residuals VS Fitted (we can see that dots are notquite evenly dispersed around zero, whihch means that we face the problem of heteroscedasticity)
Normal Q-Q plot shows that our data is normally distributed
Also we do not have any leverages or influential cases, as Cook’s distance line is not present on the last plot

Checking for heteroscedasticity again:

bptest(regmodel)

## 
##  studentized Breusch-Pagan test
## 
## data:  regmodel
## BP = 120.17, df = 5, p-value < 2.2e-16

ncvTest(regmodel)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 126.3135, Df = 1, p = < 2.22e-16

These two tests also produce significant p-values, which proves the fact that we have the problem of heteroscedasticity.

Checking for multicollinearity:

vif(regmodel)

##      educfat      educmat        matab       gender matab:gender 
##     1.725496     1.718170     1.896043     1.008564     1.873943

Values are less than 5. Therefore, it can be concluded that we do not have multicollinearity.

Conclusion:

After conducting EFA 4 latent factors were found:

Student’s attitude towards mathematics
Level of teacher’s support
Self-perceived mathematical abilities
Level of clarity of teacher’s requirements and materials

Additionally, after conducting regression analysis the first hypothesis about higher level of parent’s education being a predictor of higher math achievemnts of their child has been supported. The second hypothesis was also supported but only partially. Regression analysis has indeed showed that higher level of self-percieved math abilities are associated with higher math achievements. However, it was found that with the increase in self-percieved math abilities the level of math achievement was increasing differently for boys and girls. To be more specific, girls tend to have higher math abilities tnan boys when the level of self-perceived math abilities is rather low or somewhat medium. However, when it is relatively high boys are starting to experience more rapid increase in the level of math achievement comparing to girls who have the same level of self-perceived math abilities which means that there is a significant intereaction effect betwwen self-percieved math abilities and gender.