Project 2. Inspecting the structure of scales measuring student’s attitudes towards Math in Islamic countries: Exploratory Factor Analysis for Qatar case

Introduction

Multiple researches have been done concerning the topic of sociology of education. To tackle various research problems in the field, a number of measurement instruments have also been created to capture such concepts as student’s motivation, (un)favorable attitude towards particular discipline, teacher-student relationships, student’s self-concept regarding different subjects (e.g. how well the student thinks he/she doing on a particular subject) and many more.

However, it appeared to be that behind extensive questionnaires there could be some underlying latent factors that represent a certain higher-level concept. Those latent factors are also usually called factors, and variables (questions) constituting a factor is oftentimes called scales. Thus, determining factor structure in different thematical settings in sociology of education is one of the agenda for exploratory factors analysis to be implemented at.

Finally, factor detection could serve two beneficial purposes for the general regression analysis. First, it allows to reduce the dimensionality of the data at hand, leaving only a subset of variables to be used. Second, individual factor loadings can be used a single predictor in regression analysis.

In plain words, factor analysis allows to combine a lot of highly correlated variables into only several major factors (= predictors), where each of them embraces the most distinct higher-level concepts within the data. As a consequence, fewer predictors are left for regression modeling, which saves the degrees of freedom and allows to naturally avoid multicollinearity.

It also goes without saying that student’s attitudes towards Math vary from country to country. To draw a baseline, one would state that on average, boys tend to be more favorable towards math compared to girls. This phenomenon generally stems from the major problem of gender differences in STEM, which is reflected in multiple scientific papers.

Still, there is a separate room dedicated to this topic concerning gender differences in Science in Muslim countries in particular. Overall, there is no clear consensus on the question of whether women in Arabic countries are now more prone to seek a career in Science and develop more favorable attitude towards it. Some papers suggest that the aforementioned trend indeed exist in the modern world, nevertheless other studies came up with the explanation of the opposite.

Data description

The current analysis is based on TIMSS 2015 latest data release, which is a longitudinal cross-national study tracking 8th grade school student’s (~ 14-15 y.o.) educational process in terms of motivation, attitudes and learning experiences.

This short study has 2 major objectives. First, it aims to explore the structure of scales measuring attitudes towards mathematics of 8th grade pupils in Arabic countries, Qatar in particular. Second, it aims to use the extracted factors to explain the variation in academic achievement of Qatar students, controlling also for gender of a student, parental highest level of education and the fact of being originally born in the Qatar or not.

Variables description

library(formattable)
Variable <- c("BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D","BSBM17E","BSBM17F","BSBM17G","BSBM17H","BSBM17I","BSBM18A","BSBM18B","BSBM18C","BSBM18D","BSBM18E","BSBM18F","BSBM18G","BSBM18H","BSBM18I","BSBM19A","BSBM19B","BSBM19C","BSBM19D","BSBM19E")

Description <- c("I enjoy learning math", "I wish not to study math", "I find math boring", " Math helpd to learn interesting things","I like math","I like numbers"," I like math problems","I look forward to math classes","Teacher expects me to do math","Teacher is easy to understand","I am interested in what teacher says","Teacher offers interesting things to do","Teacher gives clear answers","Teacher explains good","Teacher shows learned","Teacher gives different things to help learning math","Teacher tells how to do better doing math","Teacher listens","I usually do well in maths","Math is more difficult than other subjects","Math is not my strength","I learn quickly in math","Math makes me nervous")

VarTable0 <- data.frame(Variable, Description)

formattable(VarTable0, 
            align =c("c","l","l","l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Variable	Description
BSBM17A	I enjoy learning math
BSBM17B	I wish not to study math
BSBM17C	I find math boring
BSBM17D	Math helpd to learn interesting things
BSBM17E	I like math
BSBM17F	I like numbers
BSBM17G	I like math problems
BSBM17H	I look forward to math classes
BSBM17I	Teacher expects me to do math
BSBM18A	Teacher is easy to understand
BSBM18B	I am interested in what teacher says
BSBM18C	Teacher offers interesting things to do
BSBM18D	Teacher gives clear answers
BSBM18E	Teacher explains good
BSBM18F	Teacher shows learned
BSBM18G	Teacher gives different things to help learning math
BSBM18H	Teacher tells how to do better doing math
BSBM18I	Teacher listens
BSBM19A	I usually do well in maths
BSBM19B	Math is more difficult than other subjects
BSBM19C	Math is not my strength
BSBM19D	I learn quickly in math
BSBM19E	Math makes me nervous

Research question

On average, are female students in Qatar doing better at Maths compared to male students?

Descriptive statistics

library(ggplot2)
ggplot(qatar, aes(x = BSMMAT01))+
  geom_histogram(alpha = 0.3, color = "black", fill = "cornflowerblue")+
  labs(title = "Math scores distribution of Qatar 8th grade students
            ", y  = "", x = "Math score") +
  theme_minimal() +
  geom_vline(aes(xintercept = mean(BSMMAT01)), color = "red", linetype = "dashed", size = 0.5)+
  annotate(x= 650, y= 350, label=paste("Mean score = ", 
       round(mean(qatar$BSMMAT01),1)),geom="text", size= 4)+
  annotate(x= 650, y= 330, label=paste("SD(score) = ", 
       round(sd(qatar$BSMMAT01),1)),geom="text", size= 4)+
  scale_x_continuous(breaks = 0:1000*100, limits = c(130,750)) +
  theme(text = element_text(size = 10),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

ann_text<- data.frame(BSMMAT01=c(640,640),count=c(200,200),ITSEX=c("Male","Female"),label=c(paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$ITSEX == "Male"]),1)), paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$ITSEX == "Female"]),1))))
                       
ggplot(qatar, aes(x = BSMMAT01, group = ITSEX))+
  geom_histogram(alpha = 0.3, color = "black", fill = "cornflowerblue")+
  facet_wrap(~ITSEX)+
  labs(title = "Math scores distribution of Qatar 8th grade students by gender
            ", y  = "count", x = "Math score")+
  theme_minimal()+
  geom_text(data = ann_text, aes(x = BSMMAT01, y = count, group = ITSEX), label=ann_text$label)+
  theme(text = element_text(size = 12),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

As it can be seent in the histograms, the means score for female students are a little larger than for male students.

ann_text1<- data.frame(BSMMAT01=c(640,640),count=c(200,200),BSBG10A=c("yes","no"),label=c(paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSBG10A == "yes"]),1)), paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSBG10A == "no"]),1))))
                       
ggplot(qatar, aes(x = BSMMAT01, group = BSBG10A))+
  geom_histogram(alpha = 0.3, color = "black", fill = "cornflowerblue")+
  facet_wrap(~BSBG10A)+
  labs(title = "Math scores distribution of students by the fact of being\noriginally born in Qatar
            ", y  = "count", x = "Math score")+
  theme_minimal()+
  geom_text(data = ann_text1, aes(x = BSMMAT01, y = count, group = BSBG10A), label=ann_text1$label)+
  theme(text = element_text(size = 12),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

An average score for migrant (non-native) students is much higher compared to native ones.

ann_text2<- data.frame(BSMMAT01=c(650,650),count=c(190,190),BSDGEDUP=c("Uni or higher","Post/Upper secondary","Primary/Lower secondary","Don't know"),label=c(paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSDGEDUP == "Uni or higher"]),1)), paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSDGEDUP == "Post/Upper secondary"]),1)), paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSDGEDUP == "Primary/Lower secondary"]),1)),paste("Mean score = ",round(mean(qatar$BSMMAT01[qatar$BSDGEDUP == "Don't know"]),1))))
                       
ggplot(qatar, aes(x = BSMMAT01, group = BSDGEDUP))+
  geom_histogram(alpha = 0.3, color = "black", fill = "cornflowerblue")+
  facet_wrap(~BSDGEDUP)+
  labs(title = "Math scores distribution by the highest level\nof parental education
            ", y  = "count", x = "Math score")+
  theme_minimal()+
  geom_text(data = ann_text2, aes(x = BSMMAT01, y = count, group = BSDGEDUP), label=ann_text2$label, size = 3.5)+
  theme(text = element_text(size = 12),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 12))

Children of parents with higher eduation are, on average, doing better at Maths comapred to children whose parents’ educational attainment is less then university diploma.

Exploratory factor analysis

library(pander)
library(ggplot2)
library(ggcorrplot)
library(polycor)
corr <- as.matrix(hetcor(as.data.frame(scales)))

ggcorrplot(corr, hc.order = TRUE, type = "lower",
     outline.col = "white")+
  theme_minimal()+
  labs(title = "Correlation matrix for 24 itemes in the survey\n(see variables description table for more detail)
            ", y  = "", x = "")+
  theme(text = element_text(size = 10),
       plot.title = element_text(hjust = 0.5), title = element_text(size = 12), axis.text.x = element_text(angle = 90, hjust = 1))

The lower correlation matrix shows that items that stem from the same category: 17, 18, 19 are oftentimes highly correlated withing the category.

How many factors to extract?

library(psych)
fa.parallel(scales, fa="both", n.iter=100)

## Parallel analysis suggests that the number of factors =  4  and the number of components =  3

As it is suggested by the scree plot, it is possible to extract 4 factors. However, it ca be clearly observed that the 4th factor is on the vary edge of the threshold. Therefore, it would be reasonable to take only 3 factors for further analysis.

Now, as the optimal number of factors is defined, let us start EFA. From the very beginning I will use orthogonal rotation (rotate=“varimax”) and then I will compare it with non-orthogonal one (rotate=“oblimin”). The model with no rotation is commonly suggested to perform worse compared to models with rotation, thus we can skip no-rotation model.

Orthogonal rotation model

#fa <- fa(scales, nfactors=3, rotate="varimax", fm="ml",cor= "mixed") 

Metric <- c("Proportion Var","Cumulative Var","RMSR","RMSEA","TLI")

Result_varimax <- c("0.30 - 0.26 - 0.11","0.67","0.03","0.082","0.917")

Comment <- c("meets the criterion for explaining at least 0.1", "good", "meets the criterion of being <0.05", "acceptable, as belongs to [0.06,0.08]","acceptable, as belongs to [0.9, 0.95]")

tabl <- data.frame(Metric, Result_varimax, Comment)

formattable(tabl, 
            align =c("l","l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Metric	Result_varimax	Comment
Proportion Var	0.30 - 0.26 - 0.11	meets the criterion for explaining at least 0.1
Cumulative Var	0.67	good
RMSR	0.03	meets the criterion of being <0.05
RMSEA	0.082	acceptable, as belongs to [0.06,0.08]
TLI	0.917	acceptable, as belongs to [0.9, 0.95]

Non-orthogonal rotation model

#fa(scales, nfactors=3, rotate="oblimin", fm="ml", cor= "mixed") 

Metric <- c("Proportion Var","Cumulative Var","RMSR","RMSEA","TLI")

Result_oblimin <- c("0.29 - 0.28 - 0.10","0.67","0.03","0.082","0.917")

Comment <- c("Meet the criterion for explaining at least 0.1", "good (the same)", "meets the criterion of being <0.05 (the same)", "acceptable, as belongs to [0.06,0.08] (the same)","acceptable, as belongs to [0.9, 0.95] (the same)")

tabl1 <- data.frame(Metric, Result_oblimin, Comment)

formattable(tabl1, 
            align =c("l","l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Metric	Result_oblimin	Comment
Proportion Var	0.29 - 0.28 - 0.10	Meet the criterion for explaining at least 0.1
Cumulative Var	0.67	good (the same)
RMSR	0.03	meets the criterion of being <0.05 (the same)
RMSEA	0.082	acceptable, as belongs to [0.06,0.08] (the same)
TLI	0.917	acceptable, as belongs to [0.9, 0.95] (the same)

As a result, it seems like the models with different rotation options are pretty much the same. Thus, following a rule of thumb, we wull proceed with a simplier model.

Let us explore the factor structure in more detail:

fa.diagram(fa(scales, nfactors=3, rotate="varimax", fm="ml",cor= "mixed"))

## 
## mixed.cor is deprecated, please use mixedCor.

First, it is useful to stress that the factor structure is plausible, since there are enough scales for each factor to be allocated, and also the loadings are pretty high.

Factor interpretation

ML2 = “Teacher-student relationships”. As it can be seen from the diagram, all the questions with “18” mark were perfectly fitted onto a single factor. This factor combines perceived teacher’s expectations, behavior, and the level of engagement in the study process, as well as teacher’s perceived ability to explain the subject and give useful advice during practice.

ML1 = “Positive self-affirmation on math” As it is shown on the graph, ML1 groups almost all the variables with “17” mark, however there are 2 scales from another category. By analyzing the structure of questions behind the variables, it appeared that this factor combines all the “positively-formulated” questions related to math. It encompasses the following topics: enjoyment and interest while doing math, playing around with numbers and mathematical problems; positive anticipation of math classes; self-assessed ability to learn quickly and do well in math.

ML3 = “Negative self-affirmation on math” Reversed situation happened to ML3. By analyzing the structure of questions behind the variables, it appeared that this factor combines all the “negatively-formulated” questions related to math. Overall, it combines self-doubt in math and negative anticipation of math classes due to the boringness or anxiety-provoking nature of the subject.

Estimating model fit

#ML1<- as.data.frame(scales[c(1,4:9,20,23)])
#psych::alpha(ML1, check.keys = TRUE)

#ML2<- as.data.frame(scales[,10:19])
#psych::alpha(ML2, check.keys=TRUE)

#ML3<- as.data.frame(scales[c(21,22,24,2,3)])
#psych::alpha(ML3, check.keys = TRUE)

Factor <- c("Teacher-student relationships","Positive self-affirmation on math","Negative self-affirmation on math")
Std_alpha <- c("0.93","0.94","0.78")

Comment <- c("more than 0.7 => excellent :))", "more then 0.7 => excellent :)))", "more than 0.7 => good :)")

tabl12 <- data.frame(Factor, Std_alpha, Comment)

formattable(tabl12, 
            align =c("l","c","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Factor	Std_alpha	Comment
Teacher-student relationships	0.93	more than 0.7 => excellent :))
Positive self-affirmation on math	0.94	more then 0.7 => excellent :)))
Negative self-affirmation on math	0.78	more than 0.7 => good :)

Overall, the most vulnerable scale is “Negative self-affirmation on math”. Lower std. alpha probably resulted from lesser number of factors, however some of the standardized loadings are also relatively low, so this scale definitely needs a bunch of methodological improvements.

Regression analysis: explaining math achievement

model <- lm(BSMMAT01 ~., data = bfi)
library(sjPlot) 
tab_model(model,show.est = T, show.ci = F, show.std = F, show.se = F, show.p = T, show.stat = T, show.r2 = T, show.fstat = T, show.obs = T,string.stat = "t-value", title = "Academic achievement in Mathematics by 8th grade students in Qatar",string.p = "p-value", dv.labels = "Score on mathematics",digits = 3)

Academic achievement in Mathematics by 8th grade students in Qatar
	Score on mathematics
Predictors	Estimates	t-value	p-value
(Intercept)	495.778	206.042	<0.001
Male	-7.659	-3.113	0.002
Don’t know	-13.131	-4.307	<0.001
Post/Upper secondary	-41.237	-12.445	<0.001
Primary/Lower secondary	-73.224	-14.223	<0.001
yes	-50.751	-20.193	<0.001
ML 2	-9.305	-7.648	<0.001
ML 1	-11.818	-9.602	<0.001
ML 3	33.367	27.162	<0.001
Observations	4472
R² / adjusted R²	0.329 / 0.328

Overall, multiple regression model with 3 control variables and 3 variables with factor scores managed ti explain about 32% of the variation in academic achievement in Maths. All the predictors are statistically significant on at least the level of 0.01.

Coefficient <- c("Intercept","Gender = Male","Parental edu = don't know","Parental edu = Post/Upper secondary","Parental edu = Primary/Lower secondary","Born in Qatar = yes","Teacher-student relationships","Postitive self-affirmation on maths","Negative self-affirmation on maths")

Interpretation <- c("For non-native female student whose parents possess higher education diploma (and holding all the factors constant), score on math is predicted to be at least 496 points (max 750)",
                    "For non-native male student whose parents possess higher education diploma the predicted score on math will be associated with -8.06 point decrease compared to female student",
                    "For non-native female student whose parents did not outline their educational attainment, the predicted score on math will be associated with -13.6 point decrease compared to parents with higher education diploma",
                    "For non-native female student whose parents possess Post/Upper secondary education diploma, the predicted score on math will be associated with -42.03 point decrease compared to parents with higher education diploma",
                    "For non-native female student whose parents possess Primary/Lower secondary education diploma, the predicted score on maths will be associated with -74.4 point decrease compared to parents with higher education diploma",
                    "For native female student whose parents possess higher education diploma, the predicted score on math will be associated with -50.9 point decrease compared to non-native one",
                    "For non-native female student whose parents possess higher education diploma, each additional point towards total diagreement on the questions about a teacher being attentive and helpful, will be associated with -9.6 point decrese in maths score",
                    "For non-native female student whose parents possess higher education diploma, each additional point towards total disagreement on the questions about enjoyment and interest in maths, will be associated with -11.6 point decrease in math score",
                    "For non-native female student whose parents possess higher education diploma, each additional point towards total disagreement on the questions about being unable to cope with math, will be associated with +32.05 point increase in math score")

tabl123 <- data.frame(Coefficient, Interpretation)

formattable(tabl123, 
            align =c("l","l"), 
            list(`Indicator Name` = formatter(
              "span", style = ~ style(color = "grey",font.weight = "bold"))))

Coefficient	Interpretation
Intercept	For non-native female student whose parents possess higher education diploma (and holding all the factors constant), score on math is predicted to be at least 496 points (max 750)
Gender = Male	For non-native male student whose parents possess higher education diploma the predicted score on math will be associated with -8.06 point decrease compared to female student
Parental edu = don’t know	For non-native female student whose parents did not outline their educational attainment, the predicted score on math will be associated with -13.6 point decrease compared to parents with higher education diploma
Parental edu = Post/Upper secondary	For non-native female student whose parents possess Post/Upper secondary education diploma, the predicted score on math will be associated with -42.03 point decrease compared to parents with higher education diploma
Parental edu = Primary/Lower secondary	For non-native female student whose parents possess Primary/Lower secondary education diploma, the predicted score on maths will be associated with -74.4 point decrease compared to parents with higher education diploma
Born in Qatar = yes	For native female student whose parents possess higher education diploma, the predicted score on math will be associated with -50.9 point decrease compared to non-native one
Teacher-student relationships	For non-native female student whose parents possess higher education diploma, each additional point towards total diagreement on the questions about a teacher being attentive and helpful, will be associated with -9.6 point decrese in maths score
Postitive self-affirmation on maths	For non-native female student whose parents possess higher education diploma, each additional point towards total disagreement on the questions about enjoyment and interest in maths, will be associated with -11.6 point decrease in math score
Negative self-affirmation on maths	For non-native female student whose parents possess higher education diploma, each additional point towards total disagreement on the questions about being unable to cope with math, will be associated with +32.05 point increase in math score

Model diagnostics

library(car)
vif(model)

##              GVIF Df GVIF^(1/(2*Df))
## ITSEX    1.036420  1        1.018047
## BSDGEDUP 1.061340  3        1.009971
## BSBG10A  1.071250  1        1.035012
## ML2      1.013716  1        1.006835
## ML1      1.037622  1        1.018637
## ML3      1.033572  1        1.016647

Variance inflation factor shows all the estimates to be less then 5, so everything is fine. Considering prior analysis, tha data is distributed normally and nothing raises any substantial concerns.

Conclusion

Overall, the underlying structure of the data can be described as having 3 major latent variables meaningfully encompassing different subsets. The first latent factor can be named “Student-teacher relationships” and is basically refers to the perceived teacher’s behavior and the level of engagement in the study process. The second latent variable can be presented as “Positive self-affirmation on math” and it consists of the questions on the topics of enjoyment and interest while doing math. The last latent factor can be named “Negative self-affirmation on math”, which is the reversed version of the previous factor. It reflects self-doubt in math and negative anticipation of math classes.

Moving towards the general statistics, it can be said that the scores of Math are distributed quiet normally among Qatar students, however, girls, on average, have greater points on math compared to boys, which affirmatively addresses the research question stated in the beginning.

Also, being non-native student in Qatar is associated with higher scores in math. Besides, children of parents with higher education are, on average, do better at Math compared to children whose parents’ educational attainment is lower.

Out of three latent variables, which factor scores were used in regression analysis, the most considerable impact was done by the “Negative self-affirmation on math”. In plain words, if a student completely refuses to admit that he or she is not good at math, or that math can be found boring and anxiety-provoking, such an affirmative attitude results, on average, in greater increase in math score, compared to disagreeing to positive affirmations about math in general or about teachers.