The following document explores the structure of scales that measure attitudes towards mathematics among students on the example of Norway. The analysis is based on the TIMSS series, an international assessment of mathematics and science at eighth grades that has been conducted every four years since 1995, with coverage of 2015 year-round only.
Norway has a public educational system that is, as claimed by several sources on the Internet, considered one of the best educational systems in Europe and among the most popular destinations for foreign students (e.g. https://www.justlanded.com/english/Norway/Norway-Guide/Education/Schools-in-Norway). It is also known that Norway participated in four rounds of TIMSS (1993, 2003, 2007, and 2011 years) before 2015, the results of which made the Ministry of Education and Research in Norway conduct several examinations of educational situation in the country and, eventually, led to the reformation of the entire school system in 2006 (source: http://timssandpirls.bc.edu/timss2015/encyclopedia/countries/norway/use-and-impact-of-timss/). One of the possible, yet probably quite superficial, conclusions that can be made based on this information is that TIMMS assessment became an impetus for the Norwegian government to start the successful improvement of the country’s educational system. Considering it, I thought it would be interesting to investigate how the scales used to measure issues in the country’s educational system are structured, namely, whether the latent factors able to incorporate several scales in one are present. And, in line with educational sociology interests, I am going to conduct this investigation on the example of students’ attitudes towards one of the STEM subjects, the latter ones being an arena for gender inequality debate in the XXI century. Only 8th grade is going to be covered.
Questions that assess students’ attitudes towards mathematics are coded as BSBM17A, BSBM17B, BSBM17C, BSBM17D, BSBM17E, BSBM17F, BSBM17G, BSBM17H, BSBM17I, BSBM18A,BSBM18B, BSBM18C, BSBM18D, BSBM18E, BSBM18F, BSBM18G, BSBM18H, BSBM18I, BSBM18J, BSBM19A, BSBM19B, BSBM19C, BSBM19D, and BSBM19E. The dataset an Exploratory Factor Analysis will be carried out on has the following structure - with 4 795 observations and 24 mentioned variables
library(dplyr)
library(foreign)
require(ggplot2)
timss_2015 = read.spss("BSGNO8M6.sav", to.data.frame=TRUE)
timss_NOR <- timss_2015 %>% dplyr::select( BSBM17A, BSBM17B, BSBM17C,BSBM17D, BSBM17E, BSBM17F, BSBM17G, BSBM17H, BSBM17I, BSBM18A,BSBM18B, BSBM18C, BSBM18D, BSBM18E, BSBM18F, BSBM18G, BSBM18H,BSBM18I, BSBM18J, BSBM19A, BSBM19B, BSBM19C, BSBM19D, BSBM19E)
str(timss_NOR)
## 'data.frame': 4795 obs. of 24 variables:
## $ BSBM17A: Factor w/ 4 levels "Agree a lot",..: 2 2 1 2 2 1 1 2 4 3 ...
## $ BSBM17B: Factor w/ 4 levels "Agree a lot",..: 3 3 4 4 3 4 4 2 2 NA ...
## $ BSBM17C: Factor w/ 4 levels "Agree a lot",..: 2 3 3 3 2 3 4 3 1 2 ...
## $ BSBM17D: Factor w/ 4 levels "Agree a lot",..: 2 2 1 2 3 1 1 3 4 3 ...
## $ BSBM17E: Factor w/ 4 levels "Agree a lot",..: 3 3 1 2 3 1 1 2 3 3 ...
## $ BSBM17F: Factor w/ 4 levels "Agree a lot",..: 4 3 3 2 4 2 1 3 3 3 ...
## $ BSBM17G: Factor w/ 4 levels "Agree a lot",..: 3 3 1 2 2 2 1 1 3 3 ...
## $ BSBM17H: Factor w/ 4 levels "Agree a lot",..: 3 3 3 2 3 2 1 3 4 3 ...
## $ BSBM17I: Factor w/ 4 levels "Agree a lot",..: 3 3 1 2 3 1 1 1 4 3 ...
## $ BSBM18A: Factor w/ 4 levels "Agree a lot",..: 3 1 2 2 2 2 2 3 2 3 ...
## $ BSBM18B: Factor w/ 4 levels "Agree a lot",..: 2 4 2 2 1 2 2 3 2 2 ...
## $ BSBM18C: Factor w/ 4 levels "Agree a lot",..: 2 3 3 1 3 1 1 3 3 3 ...
## $ BSBM18D: Factor w/ 4 levels "Agree a lot",..: 3 3 4 1 3 1 2 3 3 3 ...
## $ BSBM18E: Factor w/ 4 levels "Agree a lot",..: 2 4 3 2 3 2 2 3 3 3 ...
## $ BSBM18F: Factor w/ 4 levels "Agree a lot",..: 1 4 2 1 2 1 1 2 3 4 ...
## $ BSBM18G: Factor w/ 4 levels "Agree a lot",..: 1 3 2 1 2 1 2 3 2 3 ...
## $ BSBM18H: Factor w/ 4 levels "Agree a lot",..: 1 3 3 1 2 1 1 3 2 NA ...
## $ BSBM18I: Factor w/ 4 levels "Agree a lot",..: 1 4 1 1 2 1 1 3 1 NA ...
## $ BSBM18J: Factor w/ 4 levels "Agree a lot",..: 1 3 2 1 2 1 1 4 3 2 ...
## $ BSBM19A: Factor w/ 4 levels "Agree a lot",..: 1 3 1 1 2 1 1 1 3 2 ...
## $ BSBM19B: Factor w/ 4 levels "Agree a lot",..: 2 1 4 4 4 4 4 4 2 3 ...
## $ BSBM19C: Factor w/ 4 levels "Agree a lot",..: 4 1 4 4 3 3 4 4 2 3 ...
## $ BSBM19D: Factor w/ 4 levels "Agree a lot",..: 2 4 2 2 2 1 1 1 3 2 ...
## $ BSBM19E: Factor w/ 4 levels "Agree a lot",..: 3 1 4 4 4 3 4 4 3 3 ...
## - attr(*, "variable.labels")= Named chr "Country ID - Numeric Code" "Booklet ID" "School ID" "Class ID" ...
## ..- attr(*, "names")= chr "IDCNTRY" "IDBOOK" "IDSCHOOL" "IDCLASS" ...
## - attr(*, "codepage")= int 65001
Variables detailed description can be found on the picture below:
The distribution of counts scores by categories in each variable can be observed on the multiple bar charts below
require(lattice)
BSBM17A <- barchart(timss_NOR$BSBM17A,col="mediumturquoise", xlab="", main = "BSBM17A")
BSBM17B <- barchart(timss_NOR$BSBM17B,col="mediumturquoise", xlab="", main = "BSBM17B")
BSBM17C <- barchart(timss_NOR$BSBM17C,col="mediumturquoise", xlab="", main = "BSBM17C")
BSBM17D <- barchart(timss_NOR$BSBM17D,col="mediumturquoise", xlab="", main = "BSBM17D")
BSBM17E <- barchart(timss_NOR$BSBM17E,col="mediumturquoise", xlab="", main = "BSBM17E")
BSBM17F <- barchart(timss_NOR$BSBM17F, col="mediumturquoise", xlab="", main = "BSBM17F")
BSBM17G <- barchart(timss_NOR$BSBM17G, col="mediumturquoise", xlab="", main = "BSBM17G")
BSBM17H <- barchart(timss_NOR$BSBM17H,col="mediumturquoise", xlab="", main = "BSBM17H")
BSBM17I <- barchart(timss_NOR$BSBM17I, col="mediumturquoise", xlab="", main = "BSBM17I")
BSBM18A <- barchart(timss_NOR$BSBM18A,col="mediumturquoise", xlab="", main = "BSBM18A")
BSBM18B <- barchart(timss_NOR$BSBM18B,col="mediumturquoise", xlab="", main = "BSBM18B")
BSBM18C <- barchart(timss_NOR$BSBM18C, col="mediumturquoise", xlab="", main = "BSBM18C")
BSBM18D <- barchart(timss_NOR$BSBM18D,col="mediumturquoise", xlab="", main = "BSBM18D")
BSBM18E <- barchart(timss_NOR$BSBM18E,col="mediumturquoise", xlab="", main = "BSBM18E")
BSBM18F <- barchart(timss_NOR$BSBM18F,col="mediumturquoise", xlab="", main = "BSBM18F")
BSBM18G <- barchart(timss_NOR$BSBM18G,col="mediumturquoise", xlab="", main = "BSBM18G")
BSBM18H <- barchart(timss_NOR$BSBM18H,col="mediumturquoise", xlab="", main = "BSBM18H")
BSBM18I <- barchart(timss_NOR$BSBM18I,col="mediumturquoise", xlab="", main = "BSBM18I")
BSBM18J <- barchart(timss_NOR$BSBM18J,col="mediumturquoise", xlab="", main = "BSBM18J")
BSBM19A <- barchart(timss_NOR$BSBM19A,col="mediumturquoise", xlab="", main = "BSBM19A")
BSBM19B <- barchart(timss_NOR$BSBM19B,col="mediumturquoise", xlab="", main = "BSBM19B")
BSBM19C <- barchart(timss_NOR$BSBM19C,col="mediumturquoise", xlab="", main = "BSBM19C")
BSBM19D <- barchart(timss_NOR$BSBM19D,col="mediumturquoise", xlab="", main = "BSBM19D")
BSBM19E <- barchart(timss_NOR$BSBM19E,col="mediumturquoise", xlab="", main = "BSBM19E")
pdp::grid.arrange( BSBM17A, BSBM17B, BSBM17C,BSBM17D, BSBM17E, BSBM17F,ncol=3)
pdp::grid.arrange( BSBM17G, BSBM17H,BSBM18F, BSBM18G, BSBM19D, BSBM19E, ncol=3)
pdp::grid.arrange( BSBM17I, BSBM18A,BSBM18B, BSBM18C, BSBM18D, BSBM18E, ncol=3)
pdp::grid.arrange( BSBM18H,BSBM18I, BSBM18J, BSBM19A, BSBM19B, BSBM19C, ncol = 3)
To make sure that Exploratory Factor Analysis makes sense at all on this data, we can simply look at correlations between all possible pairs of variables in the dataset. This way the assumption that there are some latent factors would have some grounds.
library(polycor)
library(ggcorrplot)
timssNOR_hetcor <- hetcor(timss_NOR)
timssNOR_cor<- round(timssNOR_hetcor$correlations, 1)
ggcorrplot(timssNOR_cor, hc.order = TRUE, type = "lower", ggtheme = ggplot2::theme_bw, colors =c("darkturquoise", "white", "#E46726"))
The heatmap above shows that there are many high correlations (values higher than |0.4|) between variables. That means we can proceed.
The number of latent factors, which are assumed to be present, can be determined with parallel analysis.
library(psych)
timss_NOR[sapply(timss_NOR, is.factor)] <- lapply(timss_NOR[sapply(timss_NOR, is.factor)], as.numeric)
fa.parallel(timss_NOR,
fa="both",
n.iter=100)
## Parallel analysis suggests that the number of factors = 5 and the number of components = 3
The parallel analysis suggests that 5 latent factors are underlying the scales in the data - as from the plot we can see the blue line with triangles starting to come very close to the red dashed line on the number five on the horizontal axis.
Let’s believe the plot and try to build a factor model with 5 factors. I start with no rotation for this baseline model.
fa_n5 <- fa(timss_NOR, nfactors=5, rotate="none", fm="ml", cor='mixed')
##
## mixed.cor is deprecated, please use mixedCor.
fa.diagram(fa_n5)
Well, the visualization of this 5-factors model looks questionable. The first thing that bothers is that two identified factors out of 5 are not filled up with scales (variables) representing them. Another issue is that one of the 3 remaining factors with scales inside has only one variable that was considered representative of this factor. While some “unspoken rule” obliges researchers to have at least 3 scales (variables) for one factor. These issues suggest to reduce the number of factors, and perhaps add rotation which probably will solve the problem with “empty” factors.
To check which rotation is needed, the one assuming no correlation between factors or one saying factors are not correlated, we can look at the output of the model
print(fa_n5$loadings, cutoff = 0, digits = 2)
##
## Loadings:
## ML1 ML2 ML3 ML4 ML5
## BSBM17A 0.89 -0.17 0.16 0.01 0.00
## BSBM17B -0.71 0.17 -0.08 0.01 0.32
## BSBM17C -0.79 0.18 -0.12 -0.01 0.25
## BSBM17D 0.81 0.03 0.19 -0.06 0.06
## BSBM17E 0.92 -0.22 0.16 0.07 0.01
## BSBM17F 0.83 -0.14 0.21 0.03 0.11
## BSBM17G 0.87 -0.18 0.19 0.02 0.07
## BSBM17H 0.87 -0.07 0.20 0.03 0.00
## BSBM17I 0.87 -0.24 0.09 0.13 -0.01
## BSBM18A 0.51 0.28 -0.15 -0.11 0.13
## BSBM18B 0.61 0.55 -0.20 -0.06 -0.01
## BSBM18C 0.77 0.29 0.10 -0.38 -0.03
## BSBM18D 0.75 0.34 0.11 -0.26 0.05
## BSBM18E 0.57 0.60 -0.14 0.04 -0.04
## BSBM18F 0.63 0.61 -0.16 0.10 -0.05
## BSBM18G 0.54 0.52 -0.11 0.07 0.04
## BSBM18H 0.54 0.58 -0.03 0.14 0.04
## BSBM18I 0.53 0.63 -0.12 0.21 0.02
## BSBM18J 0.56 0.59 -0.13 0.12 -0.03
## BSBM19A 0.69 -0.29 -0.42 -0.04 0.19
## BSBM19B -0.49 0.35 0.60 0.04 0.08
## BSBM19C -0.63 0.36 0.52 0.06 0.09
## BSBM19D 0.70 -0.25 -0.39 0.02 0.20
## BSBM19E -0.59 0.16 0.40 -0.03 0.13
##
## ML1 ML2 ML3 ML4 ML5
## SS loadings 12.0 3.34 1.51 0.36 0.31
## Proportion Var 0.5 0.14 0.06 0.01 0.01
## Cumulative Var 0.5 0.64 0.70 0.72 0.73
print(fa_n5)
## Factor Analysis using method = ml
## Call: fa(r = timss_NOR, nfactors = 5, rotate = "none", fm = "ml", cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
## ML1 ML2 ML3 ML4 ML5 h2 u2 com
## BSBM17A 0.89 -0.17 0.16 0.01 0.00 0.84 0.161 1.1
## BSBM17B -0.71 0.17 -0.08 0.01 0.32 0.64 0.358 1.5
## BSBM17C -0.79 0.18 -0.12 -0.01 0.25 0.74 0.263 1.4
## BSBM17D 0.81 0.03 0.19 -0.06 0.06 0.70 0.297 1.1
## BSBM17E 0.92 -0.22 0.16 0.07 0.01 0.92 0.081 1.2
## BSBM17F 0.83 -0.14 0.21 0.03 0.11 0.77 0.227 1.2
## BSBM17G 0.87 -0.18 0.19 0.02 0.07 0.83 0.166 1.2
## BSBM17H 0.87 -0.07 0.20 0.03 0.00 0.80 0.199 1.1
## BSBM17I 0.87 -0.24 0.09 0.13 -0.01 0.83 0.168 1.2
## BSBM18A 0.51 0.28 -0.15 -0.11 0.13 0.39 0.611 2.0
## BSBM18B 0.61 0.55 -0.20 -0.06 -0.01 0.71 0.287 2.2
## BSBM18C 0.77 0.29 0.10 -0.38 -0.03 0.83 0.167 1.8
## BSBM18D 0.75 0.34 0.11 -0.26 0.05 0.75 0.248 1.7
## BSBM18E 0.57 0.60 -0.14 0.04 -0.04 0.71 0.293 2.1
## BSBM18F 0.63 0.61 -0.16 0.10 -0.05 0.80 0.200 2.2
## BSBM18G 0.54 0.52 -0.11 0.07 0.04 0.58 0.420 2.1
## BSBM18H 0.54 0.58 -0.03 0.14 0.04 0.66 0.343 2.1
## BSBM18I 0.53 0.63 -0.12 0.21 0.02 0.74 0.258 2.3
## BSBM18J 0.56 0.59 -0.13 0.12 -0.03 0.69 0.308 2.2
## BSBM19A 0.69 -0.29 -0.42 -0.04 0.19 0.77 0.234 2.3
## BSBM19B -0.49 0.35 0.60 0.04 0.08 0.73 0.271 2.6
## BSBM19C -0.63 0.36 0.52 0.06 0.09 0.81 0.188 2.7
## BSBM19D 0.70 -0.25 -0.39 0.02 0.20 0.74 0.263 2.1
## BSBM19E -0.59 0.16 0.40 -0.03 0.13 0.55 0.453 2.1
##
## ML1 ML2 ML3 ML4 ML5
## SS loadings 12.00 3.34 1.51 0.36 0.31
## Proportion Var 0.50 0.14 0.06 0.01 0.01
## Cumulative Var 0.50 0.64 0.70 0.72 0.73
## Proportion Explained 0.68 0.19 0.09 0.02 0.02
## Cumulative Proportion 0.68 0.88 0.96 0.98 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 276 and the objective function was 23.69 with Chi Square of 113368.8
## The degrees of freedom for the model are 166 and the objective function was 0.78
##
## The root mean square of the residuals (RMSR) is 0.01
## The df corrected root mean square of the residuals is 0.02
##
## The harmonic number of observations is 4654 with the empirical chi square 557.67 with prob < 7.9e-44
## The total number of observations was 4795 with Likelihood Chi Square = 3735.32 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.947
## RMSEA index = 0.067 and the 90 % confidence intervals are 0.065 0.069
## BIC = 2328.42
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy
## ML1 ML2 ML3 ML4 ML5
## Correlation of (regression) scores with factors 0.99 0.96 0.93 0.80 0.72
## Multiple R square of scores with factors 0.98 0.93 0.86 0.64 0.52
## Minimum correlation of possible factor scores 0.97 0.86 0.73 0.27 0.04
It can be seen that, for some factors, the minimum correlation of possible factor scores is quite high. Also, judging by factor loadings, some variables (scales) cannot be attributed to only one factor, e.g. BSBM18B.
Let’s construct a factor model with 3 factors with rotation assuming factors are correlated.
fa_n3 <- fa(timss_NOR, nfactors=3, rotate="oblimin", fm="ml", cor='mixed')
##
## mixed.cor is deprecated, please use mixedCor.
fa.diagram(fa_n3)
It looks so much better! All three factors are now filled up with scales, and the correlation is depicted with arrows connecting factors. Not all of them are correlated - it is valid only for ML1-ML3 and ML1-ML2 pairs, although the correlation values are quite high.
Now let’s look at the loadings
print(fa_n3$loadings, cutoff = 0, digits = 2)
##
## Loadings:
## ML1 ML2 ML3
## BSBM17A 0.89 0.00 -0.04
## BSBM17B -0.66 0.01 0.12
## BSBM17C -0.78 0.03 0.08
## BSBM17D 0.77 0.17 0.08
## BSBM17E 0.94 -0.05 -0.07
## BSBM17F 0.90 -0.01 0.03
## BSBM17G 0.93 -0.04 -0.01
## BSBM17H 0.87 0.08 0.04
## BSBM17I 0.84 -0.06 -0.14
## BSBM18A 0.08 0.49 -0.15
## BSBM18B -0.04 0.83 -0.13
## BSBM18C 0.49 0.46 0.05
## BSBM18D 0.48 0.50 0.10
## BSBM18E -0.03 0.85 -0.03
## BSBM18F -0.01 0.88 -0.05
## BSBM18G 0.02 0.74 -0.02
## BSBM18H 0.08 0.78 0.10
## BSBM18I -0.04 0.86 0.02
## BSBM18J -0.02 0.83 -0.01
## BSBM19A 0.22 0.04 -0.69
## BSBM19B 0.13 -0.01 0.92
## BSBM19C -0.08 0.01 0.86
## BSBM19D 0.24 0.07 -0.64
## BSBM19E -0.08 -0.15 0.63
##
## ML1 ML2 ML3
## SS loadings 7.08 5.55 2.97
## Proportion Var 0.29 0.23 0.12
## Cumulative Var 0.29 0.53 0.65
SS Loadings indicate that all three factors are significant - for all three of them, values are greater than 1. And among the loadings for each variable, all of them can be attributed to at least one of the factors - there are no variables whose loadings for each factor are smaller than 0.4.
So, I am pretty satisfied with the layout of this model and will stick to it, proceeding to assess its validity and fit.
fa_n3$TLI
## [1] 0.9240114
fa_n3$RMSEA[1]
## RMSEA
## 0.08066856
The first value of listed above is the Tucker-Lewis Index (TLI) that indicates the model is valid being greater than 0.9. The second value is the Root Mean Squared Adjusted (RMSEA) which supports the model’s validity - its value is 0.08 which is acceptable for this measure.
Now, the fit of the model can be assessed. It’s done with the Cronbach’s alpha measure: a value bigger than 0.7 indicates good reliability.
ML1 <- as.data.frame(timss_NOR[c("BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D", "BSBM17E", "BSBM17F", "BSBM17G", "BSBM17H", "BSBM17I", "BSBM18C")])
alpha(ML1,
check.keys = T,
keys = T)
##
## Reliability analysis
## Call: alpha(x = ML1, keys = T, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.94 0.95 0.95 0.63 17 0.0012 2.4 0.8 0.65
##
## lower alpha upper 95% confidence boundaries
## 0.94 0.94 0.95
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBM17A 0.94 0.94 0.94 0.62 15 0.0013 0.0095 0.63
## BSBM17B- 0.94 0.94 0.94 0.66 17 0.0012 0.0083 0.67
## BSBM17C- 0.94 0.94 0.94 0.64 16 0.0013 0.0103 0.64
## BSBM17D 0.94 0.94 0.94 0.64 16 0.0013 0.0102 0.65
## BSBM17E 0.93 0.93 0.93 0.61 14 0.0014 0.0075 0.63
## BSBM17F 0.94 0.94 0.94 0.63 15 0.0013 0.0094 0.64
## BSBM17G 0.94 0.94 0.94 0.62 15 0.0014 0.0088 0.63
## BSBM17H 0.94 0.94 0.94 0.63 15 0.0013 0.0096 0.63
## BSBM17I 0.94 0.94 0.94 0.63 15 0.0013 0.0089 0.63
## BSBM18C 0.95 0.95 0.95 0.66 18 0.0012 0.0065 0.67
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBM17A 4716 0.87 0.87 0.86 0.83 2.1 0.94
## BSBM17B- 4697 0.74 0.73 0.68 0.66 2.2 1.09
## BSBM17C- 4641 0.81 0.80 0.78 0.76 2.6 1.01
## BSBM17D 4689 0.78 0.79 0.76 0.73 2.1 0.91
## BSBM17E 4681 0.91 0.91 0.91 0.88 2.3 1.03
## BSBM17F 4713 0.83 0.83 0.81 0.79 2.7 0.91
## BSBM17G 4694 0.87 0.87 0.86 0.84 2.4 0.99
## BSBM17H 4691 0.86 0.86 0.84 0.82 2.7 0.96
## BSBM17I 4717 0.84 0.83 0.82 0.79 2.7 1.15
## BSBM18C 4716 0.68 0.69 0.64 0.62 1.9 0.84
##
## Non missing response frequency for each item
## 1 2 3 4 miss
## BSBM17A 0.29 0.42 0.19 0.10 0.02
## BSBM17B 0.16 0.24 0.24 0.36 0.02
## BSBM17C 0.21 0.37 0.24 0.18 0.03
## BSBM17D 0.29 0.41 0.22 0.08 0.02
## BSBM17E 0.26 0.35 0.23 0.16 0.02
## BSBM17F 0.11 0.32 0.38 0.20 0.02
## BSBM17G 0.21 0.36 0.27 0.17 0.02
## BSBM17H 0.12 0.30 0.35 0.23 0.02
## BSBM17I 0.21 0.20 0.24 0.35 0.02
## BSBM18C 0.35 0.43 0.18 0.04 0.02
The first factor (ML1) indicates really good fit as it’s alpha is bigger than 0.9.
ML2 <- as.data.frame(timss_NOR[c("BSBM18A", "BSBM18B", "BSBM18D", "BSBM18E", "BSBM18F", "BSBM18G", "BSBM18H","BSBM18J", "BSBM18I")])
alpha(ML2,check.keys = T, keys = T)
##
## Reliability analysis
## Call: alpha(x = ML2, keys = T, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.91 0.91 0.91 0.52 9.8 0.002 1.8 0.63 0.54
##
## lower alpha upper 95% confidence boundaries
## 0.9 0.91 0.91
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBM18A 0.91 0.91 0.90 0.56 10.2 0.0020 0.0038 0.54
## BSBM18B 0.89 0.89 0.89 0.51 8.4 0.0023 0.0092 0.54
## BSBM18D 0.90 0.90 0.90 0.53 9.1 0.0021 0.0100 0.54
## BSBM18E 0.89 0.89 0.89 0.51 8.4 0.0023 0.0084 0.52
## BSBM18F 0.89 0.89 0.88 0.50 8.2 0.0024 0.0069 0.53
## BSBM18G 0.90 0.90 0.89 0.52 8.7 0.0022 0.0104 0.54
## BSBM18H 0.90 0.90 0.89 0.52 8.6 0.0023 0.0092 0.53
## BSBM18J 0.90 0.90 0.89 0.52 8.6 0.0023 0.0088 0.53
## BSBM18I 0.89 0.89 0.89 0.51 8.5 0.0023 0.0081 0.53
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBM18A 4708 0.59 0.60 0.52 0.50 1.8 0.75
## BSBM18B 4712 0.80 0.80 0.77 0.73 1.8 0.83
## BSBM18D 4699 0.72 0.71 0.66 0.63 2.2 0.90
## BSBM18E 4701 0.80 0.80 0.77 0.73 1.8 0.84
## BSBM18F 4693 0.83 0.83 0.81 0.77 1.6 0.83
## BSBM18G 4679 0.76 0.75 0.71 0.68 2.0 0.88
## BSBM18H 4711 0.77 0.77 0.74 0.70 1.8 0.86
## BSBM18J 4710 0.77 0.78 0.74 0.70 1.6 0.77
## BSBM18I 4707 0.78 0.78 0.76 0.72 1.6 0.78
##
## Non missing response frequency for each item
## 1 2 3 4 miss
## BSBM18A 0.36 0.49 0.13 0.03 0.02
## BSBM18B 0.44 0.38 0.14 0.04 0.02
## BSBM18D 0.24 0.40 0.27 0.09 0.02
## BSBM18E 0.43 0.37 0.16 0.04 0.02
## BSBM18F 0.55 0.30 0.10 0.04 0.02
## BSBM18G 0.34 0.40 0.20 0.06 0.02
## BSBM18H 0.46 0.35 0.15 0.04 0.02
## BSBM18J 0.57 0.31 0.09 0.03 0.02
## BSBM18I 0.57 0.30 0.10 0.03 0.02
The second factor (ML2) also is a good good fit - it’s alpha is bigger than 0.9, as well.
ML3 <- as.data.frame(timss_NOR[c("BSBM19A", "BSBM19B", "BSBM19C", "BSBM19D", "BSBM19E")])
alpha(ML3,
check.keys = T,
keys = T)
##
## Reliability analysis
## Call: alpha(x = ML3, keys = T, check.keys = T)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.87 0.87 0.85 0.57 6.7 0.0029 3 0.77 0.56
##
## lower alpha upper 95% confidence boundaries
## 0.86 0.87 0.88
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## BSBM19A- 0.84 0.84 0.80 0.56 5.2 0.0037 0.0058 0.55
## BSBM19B 0.84 0.84 0.81 0.57 5.4 0.0037 0.0078 0.59
## BSBM19C 0.82 0.82 0.79 0.54 4.7 0.0042 0.0065 0.52
## BSBM19D- 0.84 0.84 0.81 0.57 5.3 0.0036 0.0068 0.56
## BSBM19E 0.87 0.87 0.84 0.62 6.6 0.0031 0.0035 0.63
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## BSBM19A- 4701 0.82 0.83 0.78 0.72 3.1 0.84
## BSBM19B 4708 0.82 0.81 0.75 0.70 3.0 1.00
## BSBM19C 4703 0.87 0.86 0.83 0.77 3.0 1.05
## BSBM19D- 4687 0.81 0.82 0.76 0.70 2.8 0.93
## BSBM19E 4672 0.74 0.74 0.63 0.60 3.3 0.89
##
## Non missing response frequency for each item
## 1 2 3 4 miss
## BSBM19A 0.39 0.42 0.15 0.05 0.02
## BSBM19B 0.09 0.22 0.28 0.41 0.02
## BSBM19C 0.12 0.18 0.27 0.43 0.02
## BSBM19D 0.26 0.40 0.25 0.10 0.02
## BSBM19E 0.04 0.17 0.28 0.51 0.03
For the final third factor (ML3) the alpha dropped a little, yet it still is quite high with a value bigger than 0.86. Thus, it can be concluded that all 3 factors are reliable and all the variables assigned to them should be kept as they are now.
Now, when the latent factors were discovered, all that is left is to label them. Considering the scales that have gotten into factors, I would label the latter ones the following way:
ML1 = “Interest in learning”. As the questions in it mostly concern students ’ interest in doing mathematical exercises, interest in a way it is taught, interest in learning other things besides mathematics, etc.
ML1 = “Attitudes towards a teacher”. Here, questions assess a student’s attitude towards a teacher as personally and how this teacher is teaching the subject.
ML3 = “Performance in Mathematics”. Questions (scales) in this factor concern how good a student performs in Math from his/her point of view.
When Exploratory Factor Analysis is finished we can play around a little and build a regression model predicting Math achievement (BSMMAT01) using the factors we acquired. Also, I am going to add such variables as student’s gender (throwback to the STEM gender inequality debate mentioned in the very beginning ;3), parental education, and whether a student was born on the country or outside as control - to see if achievements in Mathematics are differentiated on the levels of these controls.
fascores <-as.data.frame(fa_n3$scores)
timssNOR_lm <-cbind(timss_NOR, fascores, timss_2015$BSMMAT01, timss_2015$BSBG10A, timss_2015$BSBG01, timss_2015$BSDGEDUP)
timssNOR_lm <- timssNOR_lm %>% rename(BSMMAT01 = `timss_2015$BSMMAT01`,
BSB_bornincountry = `timss_2015$BSBG10A`,
BSB_gender = `timss_2015$BSBG01`,
BSB_parentedu = `timss_2015$BSDGEDUP`)
timssNOR_lm$BSMMAT01 <- as.numeric(timssNOR_lm$BSMMAT01)
lm1 <- lm(BSMMAT01~ML1 + ML2 + ML3, data=timssNOR_lm)
lm2 <- lm(BSMMAT01~ML1 + ML2 + ML3 + BSB_bornincountry + BSB_gender + BSB_parentedu, data=timssNOR_lm)
summary(lm1)
##
## Call:
## lm(formula = BSMMAT01 ~ ML1 + ML2 + ML3, data = timssNOR_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3420.9 -780.2 57.0 856.4 3426.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2447.15 16.99 144.075 < 2e-16 ***
## ML1 14.28 25.13 0.568 0.56977
## ML2 61.36 20.38 3.010 0.00263 **
## ML3 860.35 22.16 38.827 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1082 on 4058 degrees of freedom
## (733 observations deleted due to missingness)
## Multiple R-squared: 0.3792, Adjusted R-squared: 0.3788
## F-statistic: 826.3 on 3 and 4058 DF, p-value: < 2.2e-16
summary(lm2)
##
## Call:
## lm(formula = BSMMAT01 ~ ML1 + ML2 + ML3 + BSB_bornincountry +
## BSB_gender + BSB_parentedu, data = timssNOR_lm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3539.3 -749.8 61.3 827.5 4430.8
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 2714.447 32.012
## ML1 7.552 25.214
## ML2 53.018 20.481
## ML3 824.766 22.435
## BSB_bornincountryNo -347.423 63.971
## BSB_genderBoy -91.606 34.094
## BSB_parenteduPost-secondary but not University -319.806 52.647
## BSB_parenteduUpper Secondary -523.284 90.958
## BSB_parenteduLower Secondary -678.977 137.537
## BSB_parenteduSome Primary, Lower Secondary or No School -858.333 218.754
## BSB_parenteduDon't Know -238.120 38.588
## t value Pr(>|t|)
## (Intercept) 84.795 < 2e-16 ***
## ML1 0.300 0.76457
## ML2 2.589 0.00967 **
## ML3 36.762 < 2e-16 ***
## BSB_bornincountryNo -5.431 5.95e-08 ***
## BSB_genderBoy -2.687 0.00724 **
## BSB_parenteduPost-secondary but not University -6.075 1.36e-09 ***
## BSB_parenteduUpper Secondary -5.753 9.44e-09 ***
## BSB_parenteduLower Secondary -4.937 8.28e-07 ***
## BSB_parenteduSome Primary, Lower Secondary or No School -3.924 8.87e-05 ***
## BSB_parenteduDon't Know -6.171 7.48e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1059 on 3913 degrees of freedom
## (871 observations deleted due to missingness)
## Multiple R-squared: 0.4018, Adjusted R-squared: 0.4002
## F-statistic: 262.8 on 10 and 3913 DF, p-value: < 2.2e-16
The first output is the summary figures of the regression model with 3 factors as predicting variables only. Whereas the following after that output belongs to the model with factors as well as control variables.
It can be seen that both models are statistically significant, though the model’s performance improves after adding controls - the R-squared value increases from 0.38 to 0.4. It should be said that both values are high enough for social studies to say those are good models to explain the association between predictors and the outcome variable. Another thing to be noted is that, among three factors, ML1 (Interest in learning) factor’s influence is not statistically significant. The notion is valid for both models meaning that controls are not related to this issue. As for other factors, they are significant and appear to be positively associated with Math achievements, in both models, meaning that the more a student scores on those factors (“Attitudes towards a teacher” and “Performance in Mathematics), the higher their achievements in Mathematics will be. In the second model, controls are also statistically significant. The male gender of a student, surprisingly, is negatively associated with the outcome variable, meaning that, if a student is a boy, his achievement in Math decreases by 91.6, all other variables being constant. The same direction of the relationship is observed for variables whether a student was born outside of the country and the outcome: if it is the case, his/her Math achievements decrease by 347.4. Adding parental education variable, quite interestingly, also leads to a decrease in the outcome. Or doesn’t change it at all, with other variables being constant, in case of the”Higher education" category.
To make sure no violations of regressions assumptions were made, I will conduct some model’s diagnostics. First of all, according to the histograms below, redisual values of both models are distributed almost normally - that is a good sign.
par(mfrow = c(1, 2))
hist(resid(lm1),
xlab = "Residuals",
main = "Histogram of Residuals, Model 1",
col = "darkolivegreen2",
border = "black",
breaks = 20)
hist(resid(lm2),
xlab = "Residuals",
main = "Histogram of Residuals, Model 2",
col = "darkolivegreen2",
border = "black",
breaks = 20)
Then, checking of the variance inflation factor (VIF) shows that everything is okay and no multicollinearity is present, for both models. No VIF values greater than 5 are observed.
library(car)
vif(lm1)
## ML1 ML2 ML3
## 2.380667 1.517942 1.745981
vif(lm2)
## GVIF Df GVIF^(1/(2*Df))
## ML1 2.402638 1 1.550045
## ML2 1.532760 1 1.238047
## ML3 1.800557 1 1.341848
## BSB_bornincountry 1.011003 1 1.005486
## BSB_gender 1.016605 1 1.008268
## BSB_parentedu 1.057944 5 1.005649
Finally, leverage values are checked. In case of the first model we observe that there are leverages (labeled peaks on the left plot), yet they do not appear outside of the Cook’s distance (see the right plot) which makes them tolerable.
par(mfrow = c(1, 2))
plot(lm1, 4)
plot(lm1, 5)
About the same patterns are observed for the second model, as well. That means the two models do not violate regression assumptions drastically.
par(mfrow = c(1, 2))
plot(lm2, 4)
plot(lm2, 5)
The results of the presented analysis can be briefly summarized in the following conclusions.
Questions assessing students’ attitudes towards Mathematics from TIMSS 2015 have a statistically reliable underlying structure, in which 3 latent factors can be found. The factors can be labeled as “Interest in learning”, “Attitudes towards a teacher”, and “Performance in Mathematics”.
Variation in students’ achievements in Mathematics variable can be on 38% explained by the 3 discovered factors’ scores solely and on 40% by the same scores and student’s gender, parental education, and a variable indicating whether a student was born in a country or not. Factor scores underlying students’ attitudes towards Mathematics are positively associated with students’ achievements in this subject. Whereas all other mentioned variables are negatively associated with achievements.