knitr::opts_chunk$set(message = FALSE)
knitr::opts_chunk$set(warning  = FALSE)

Aim of the study and research question

This research project examines students’ attitudes towards mathematics and how the variables used for analysis break down into factors, and how they affect the student’s academic performance. The research question is: What metrics related to mathematics lessons influence the academic performance of Japanese students? The analysis used TIMSS 2015 data for Japan.

Data preparation and descriptive statistics

A total of 4,745 responses were received from students from Japan. I used 29 variables to answer the research questions, each of which is listed in the table below. For further convenience in the interpretation of the results, the variables have been renamed. The main variable that will subsequently be predicted is “BSMMAT01” or “success”, which is responsible for the student’s academic performance.

For factor analysis, 24 variables were used that related to mathematics. So the first 9 variables collected data on students’ attitudes towards math lessons, how they progress and their attitude towards the teacher. The common question for all variables was: How much do you agree with these statements about learning mathematics?. СThe following 10 variables were responsible for the student’s attitude towards math lessons and answered the general question: How much do you agree with these statements about your mathematics lessons?. The last 5 questions examined the student’s general attitude towards mathematics, in particular, how easy or, conversely, difficult for him or her the subject is given. The common question was: How much do you agree with these statements about mathematics?.

As future control variables for the regression analysis, I used 4 additional variables: gender, parental education, and whether the student was born in Japan or another country.

library(haven)
library(dplyr)
library(foreign)
library(psych)
library(polycor)
library(corrplot)
library(sjPlot)
library(summarytools)
library(lmtest)
library(car)
library(gridExtra)
library(ggplot2)
library(GPArotation)
library(polycor)
library(ggcorrplot)
library(DT)
library(gtable)
library(grid)
library(knitr)
library(kableExtra)
data <- read.spss("~/Downloads/BSGJPNM6.sav", to.data.frame = TRUE, use.value.labels = TRUE) %>% select(BSBM17A, BSBM17B, BSBM17C, BSBM17D, BSBM17E, BSBM17F, BSBM17G, BSBM17H, BSBM17I, BSBM18A, BSBM18B, BSBM18C, BSBM18D, BSBM18E, BSBM18F, BSBM18G, BSBM18H, BSBM18I, BSBM18J, BSBM19A, BSBM19B, BSBM19C, BSBM19D, BSBM19E, BSBG01, BSBG07A, BSBG07B, BSBG10A, BSMMAT01)


data = data %>% rename( enjoy= BSBM17A, study = BSBM17B, boring = BSBM17C, intrst = BSBM17D, likemath = BSBM17E, likenumb = BSBM17F, likesolve= BSBM17G, lookfrwrd= BSBM17H, fav = BSBM17I, expect = BSBM18A, undrstnd= BSBM18B, intrst_teacher= BSBM18C, intrst_things = BSBM18D, answers= BSBM18E, explain = BSBM18F, learned = BSBM18G, help = BSBM18H, mistake = BSBM18I, listen = BSBM18J, dowell= BSBM19A, diffclt = BSBM19B, not_strength= BSBM19C, quickly= BSBM19D, nervous = BSBM19E, sex = BSBG01, mother = BSBG07A, father = BSBG07B, country = BSBG10A, success = BSMMAT01)

data$success = as.numeric(as.character(data$success))

`Name of variable` <- c("BSBG01", "BSBG07A", "BSBG07B", "BSBG10A", "BSMMAT01", "BSBM17A", "BSBM17B", "BSBM17C", "BSBM17D", "BSBM17E", "BSBM17F", "BSBM17G", "BSBM17H", "BSBM17I", "BSBM18A", "BSBM18B", "BSBM18C", "BSBM18D", "BSBM18E", "BSBM18F", "BSBM18G", "BSBM18H", "BSBM18I", "BSBM18J", "BSBM19A", "BSBM19B", "BSBM19C", "BSBM19D", "BSBM19E")

`Question` <- c("Are you a girl or a boy?",
                "What is the highest level of education completed by your mother (or stepmother or female guardian)?",
                "What is the highest level of education completed by your father (or stepfather or male guardian)?",
                "Were you born in Japan?",
                "Assesment of student's achievement",
                "How much do you agree with these statements about learning mathematics? I enjoy learning mathematics",
                "How much do you agree with these statements about learning mathematics? I wish I did not have to study mathematics",
                "How much do you agree with these statements about learning mathematics? Mathematics is boring", 
                "How much do you agree with these statements about learning mathematics? I learn many interesting things in mathematics",
                "How much do you agree with these statements about learning mathematics? I like mathematics",
                "How much do you agree with these statements about learning mathematics? I like any schoolwork that involves numbers", 
                "How much do you agree with these statements about learning mathematics? I like to solve mathematics problems",
                "How much do you agree with these statements about learning mathematics? I look forward to mathematics class",
                "How much do you agree with these statements about learning mathematics? Mathematics is one of my favorite subjects",
                "How much do you agree with these statements about your mathematics lessons? I know what my teacher expects me to do", 
                "How much do you agree with these statements about your mathematics lessons? My teacher is easy to understand", 
                "How much do you agree with these statements about your mathematics lessons? I am interested in what my teacher says", 
                "How much do you agree with these statements about your mathematics lessons? My teacher gives me interesting things to do", 
                "How much do you agree with these statements about your mathematics lessons? My teacher has clear answers to my questions", 
                "How much do you agree with these statements about your mathematics lessons? My teacher is good at explaining mathematics", 
                "How much do you agree with these statements about your mathematics lessons? My teacher lets me show what I have learned", 
                "How much do you agree with these statements about your mathematics lessons? My teacher does a variety of things to help us learn", 
                "How much do you agree with these statements about your mathematics lessons? My teacher tells me how to do better when I make a mistake", 
                "How much do you agree with these statements about your mathematics lessons? My teacher listens to what I have to say", 
                "How much do you agree with these statements about mathematics? I usually do well in mathematics", 
                "How much do you agree with these statements about mathematics? Mathematics is more difficult for me than for many of my classmates", 
                "How much do you agree with these statements about mathematics? Mathematics is not one of my strengths", 
                "How much do you agree with these statements about mathematics? I learn things quickly in mathematics", 
                "How much do you agree with these statements about mathematics? Mathematics makes me nervous")

`New name` <- c("sex" , "mother" , "father" , "country" , "success" , "enjoy", "study", "boring", "intrst", "likemath", "likenumb", "likesolve", "lookfrwrd", "fav", "expect", "undrstnd", "intrst_teacher", "intrst_things", "answers", "explain", "learned", "help", "mistake", "listen", "dowell", "diffclt", "not_strength", "quickly", "nervous")

tab <- data.frame(`Name of variable`, `Question`,`New name`)

tab %>% datatable(colnames = c('Previous name' = 2, 'New name' = 4), options = list(pageLength=5, scrollX='400px', scrollY='270px'))

It is also important to examine the presence of missing values. First, let’s look at how many in total there were missing values, or rather, what percentage they make up of the total amount of data.

kable(paste0(round(((nrow(data) - nrow(na.omit(data)))/nrow(data))*100,2), "%"), col.names = "Percent of NA") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
Percent of NA
4%

Missing data accounts for about 4% of the total. This percentage is very small, so I decided to use the imputation method. The imputation method suggests missing data. This is most useful when the percentage of missing data is small. Before imputation, you need to understand which specific variables are NA. For this I use the summary () function.

Variable Stats / Values Freqs (% of Valid) Missing
enjoy [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
739(15.6%)
1741(36.8%)
1626(34.3%)
631(13.3%)
8 (0.2%)
study [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
588(12.4%)
1191(25.2%)
1993(42.1%)
962(20.3%)
11 (0.2%)
boring [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
501(10.6%)
1304(27.7%)
2185(46.4%)
717(15.2%)
38 (0.8%)
intrst [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
553(11.7%)
1607(34.0%)
1997(42.3%)
567(12.0%)
21 (0.4%)
likemath [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
707(15.0%)
1315(27.9%)
1732(36.7%)
961(20.4%)
30 (0.6%)
likenumb [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
265(5.6%)
925(19.5%)
2571(54.3%)
975(20.6%)
9 (0.2%)
likesolve [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
592(12.5%)
1402(29.6%)
1825(38.6%)
914(19.3%)
12 (0.3%)
lookfrwrd [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
337(7.1%)
1107(23.4%)
2219(46.9%)
1070(22.6%)
12 (0.3%)
fav [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
631(13.3%)
953(20.1%)
1798(38.0%)
1353(28.6%)
10 (0.2%)
expect [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
232(4.9%)
1185(25.0%)
2386(50.4%)
931(19.7%)
11 (0.2%)
undrstnd [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
937(19.8%)
2466(52.1%)
973(20.6%)
357(7.5%)
12 (0.3%)
intrst_teacher [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
645(13.6%)
1793(37.9%)
1729(36.6%)
563(11.9%)
15 (0.3%)
intrst_things [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
253(5.4%)
1172(24.8%)
2482(52.5%)
821(17.4%)
17 (0.4%)
answers [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1048(22.2%)
2580(54.6%)
843(17.8%)
257(5.4%)
17 (0.4%)
explain [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1149(24.3%)
2352(49.7%)
906(19.2%)
323(6.8%)
15 (0.3%)
learned [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
343(7.3%)
1504(31.9%)
2161(45.8%)
712(15.1%)
25 (0.5%)
help [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1084(22.9%)
2660(56.2%)
761(16.1%)
224(4.7%)
16 (0.3%)
mistake [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1024(21.6%)
2543(53.7%)
912(19.3%)
253(5.3%)
13 (0.3%)
listen [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
736(15.6%)
2356(49.8%)
1244(26.3%)
396(8.4%)
13 (0.3%)
dowell [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
275(5.8%)
993(21.0%)
2274(48.0%)
1191(25.2%)
12 (0.3%)
diffclt [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
754(15.9%)
1519(32.1%)
1875(39.6%)
588(12.4%)
9 (0.2%)
not_strength [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1507(31.9%)
1398(29.6%)
1217(25.7%)
606(12.8%)
17 (0.4%)
quickly [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
348(7.4%)
1295(27.4%)
2422(51.2%)
662(14.0%)
18 (0.4%)
nervous [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
544(11.5%)
1070(22.6%)
2092(44.2%)
1024(21.6%)
15 (0.3%)
sex [factor]
1. Girl
2. Boy
2417(51.0%)
2325(49.0%)
3 (0.1%)
mother [factor]
1. Some Primary or Lower sec
2. Lower secondary
3. Upper secondary
4. Post-secondary, non-terti
5. Short-cycle tertiary
6. Bachelor’s or equivalent
7. Postgraduate degree
8. Don’t know
2(0.0%)
94(2.0%)
1354(28.7%)
99(2.1%)
995(21.1%)
827(17.5%)
35(0.7%)
1315(27.9%)
24 (0.5%)
father [factor]
1. Some Primary or Lower sec
2. Lower secondary
3. Upper secondary
4. Post-secondary, non-terti
5. Short-cycle tertiary
6. Bachelor’s or equivalent
7. Postgraduate degree
8. Don’t know
5(0.1%)
133(2.8%)
1053(22.4%)
103(2.2%)
301(6.4%)
1341(28.5%)
107(2.3%)
1668(35.4%)
34 (0.7%)
country [factor]
1. Yes
2. No
4696(99.1%)
41(0.9%)
8 (0.2%)
success [numeric]
Mean (sd) : 585.3 (88.4)
min ≤ med ≤ max:
243.2 ≤ 586.8 ≤ 865.1
IQR (CV) : 117.2 (0.2)
4743 distinct values 0 (0.0%)

As can be seen from the results obtained, there are missing values in all factor variables. To replace them, I’ll use the mode parameter for each variable. After outputting with summery (), you can see that there are no missing values left.

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

data =  data %>% mutate_if(is.factor, funs(replace(.,is.na(.), Mode(na.omit(.)))))

kable(paste0(round(((nrow(data) - nrow(na.omit(data)))/nrow(data))*100,2), "%"), col.names = "Percent of NA") %>% 
  kable_styling(bootstrap_options=c("bordered", "responsive","striped"), full_width = FALSE)
Percent of NA
0%

Distribution of variables connected with mathematics

Before proceeding with factor analysis, it is necessary to look at the distribution of the 24 variables related to mathematics. All of them should measure how much a person agrees with a certain statement regarding the lessons of mathematics or their relationship to mathematics. Each variable has 4 levels from “Agree on many things” to “Disagree on many things”. Thanks to the summarytools package, you can look at the distribution of responses in each variable in a convenient format.

Variable Stats / Values Freqs (% of Valid) Graph
enjoy [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
739(15.6%)
1749(36.9%)
1626(34.3%)
631(13.3%)
study [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
588(12.4%)
1191(25.1%)
2004(42.2%)
962(20.3%)
boring [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
501(10.6%)
1304(27.5%)
2223(46.8%)
717(15.1%)
intrst [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
553(11.7%)
1607(33.9%)
2018(42.5%)
567(11.9%)
likemath [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
707(14.9%)
1315(27.7%)
1762(37.1%)
961(20.3%)
likenumb [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
265(5.6%)
925(19.5%)
2580(54.4%)
975(20.5%)
likesolve [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
592(12.5%)
1402(29.5%)
1837(38.7%)
914(19.3%)
lookfrwrd [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
337(7.1%)
1107(23.3%)
2231(47.0%)
1070(22.6%)
fav [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
631(13.3%)
953(20.1%)
1808(38.1%)
1353(28.5%)
expect [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
232(4.9%)
1185(25.0%)
2397(50.5%)
931(19.6%)
undrstnd [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
937(19.7%)
2478(52.2%)
973(20.5%)
357(7.5%)
intrst_teacher [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
645(13.6%)
1808(38.1%)
1729(36.4%)
563(11.9%)
intrst_things [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
253(5.3%)
1172(24.7%)
2499(52.7%)
821(17.3%)
answers [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1048(22.1%)
2597(54.7%)
843(17.8%)
257(5.4%)
explain [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1149(24.2%)
2367(49.9%)
906(19.1%)
323(6.8%)
learned [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
343(7.2%)
1504(31.7%)
2186(46.1%)
712(15.0%)
help [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1084(22.8%)
2676(56.4%)
761(16.0%)
224(4.7%)
mistake [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1024(21.6%)
2556(53.9%)
912(19.2%)
253(5.3%)
listen [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
736(15.5%)
2369(49.9%)
1244(26.2%)
396(8.3%)
dowell [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
275(5.8%)
993(20.9%)
2286(48.2%)
1191(25.1%)
diffclt [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
754(15.9%)
1519(32.0%)
1884(39.7%)
588(12.4%)
not_strength [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
1524(32.1%)
1398(29.5%)
1217(25.6%)
606(12.8%)
quickly [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
348(7.3%)
1295(27.3%)
2440(51.4%)
662(14.0%)
nervous [factor]
1. Agree a lot
2. Agree a little
3. Disagree a little
4. Disagree a lot
544(11.5%)
1070(22.6%)
2107(44.4%)
1024(21.6%)

Interpretation:

  • Some of the variables are more or less normally distributed and students do not express their strongly polarized opinions, choosing 1 or 4. However, the variables that are based on statements: Mathematics is boring, wish I did not have to study mathematics, I like any schoolwork that involves numbers, I look forward to mathematics class, Mathematics is one of my favorite subjects, I know what my teacher expects me to do, I usually do well in mathematics, I learn things quickly in mathematics, Mathematics makes me nervous, My teacher gives me interesting things to do, are shifted towards Disagree a little and Disagree a lot. On the other hand, the variables are: My teacher is easy to understand, My teacher has clear answers to my questions, My teacher is good at explaining mathematics, My teacher does a variety of things to help us learn, My teacher tells me how to do better when I make a mistake, Mathematics is not one of my strengths, are shifted more towards answers expressing the student’s agreement with the statements. To summarize, we can say that most of the students are not very good at math, do not consider it their strong point and do not particularly like it. In addition, many do not fully understand what the teacher wants from them and believe that he does not give interesting things. I am glad that math does not make students nervous and they are waiting for math lessons. At the same time, the students assessed the interaction with the teacher well in most cases, that is, they agreed with the statements that the teacher explains the material well, explains mistakes, and answers questions.

Distribution of gender and mathematic achievement score in the sample

th = theme(plot.title = element_text(size=10, hjust = 0.5, face="bold"),
           text = element_text(size=8),
        axis.ticks = element_blank(),
        panel.grid = element_blank(),
        rect = element_blank(),
        panel.grid.major.y = element_line(color = "grey92", size = 0.5))

plot1 = ggplot(data, aes(x = sex, fill = sex)) +
  geom_bar() +
  ggtitle("Gender distribution\n in the sample")+
  xlab("")+
  ylab("Number of observations")+
  th

legend = gtable_filter(ggplotGrob(plot1), "guide-box") 

plot2 = ggplot(data, aes(x = sex, y = success, fill = sex, color = sex)) + 
        geom_boxplot(alpha = 0.5) +
        ggtitle("Comparing mathematic achievement score\nof different Gender") +
        xlab("") +
        ylab("Mathematic achievement score") +
        th
#plot graphs
grid.arrange(arrangeGrob( plot1 + theme(legend.position="none"), plot2 + theme(legend.position="none"), ncol = 2), legend, widths=unit.c(unit(1, "npc") - legend$width, legend$width))

Interpretation:

  • The gender of students in the data is distributed fairly evenly, which indicates the representativeness of the sample. If we compare the progress of boys and girls, we can see that there is no special dependence of academic performance on gender: the median is approximately at the same level; outlines are present in both boxplots; however, the interquartile range or “boxplot body” is wider in boys, which means that the median of the first quartile (the lower limit of the average mathematical achievement) is less, and the third (the upper limit of the average mathematical achievement) is higher than that of girls).

Distribution of parents education in the sample

plot1 = ggplot(data, aes(x = mother)) +
  geom_bar(fill="#AF52CC", alpha = 0.5, color = "black") +
  ggtitle("Mother's education plot")+
  xlab("")+
  ylab("")+
  coord_flip() +
  theme_minimal()

plot2 =  ggplot(data, aes(x = father)) +
  geom_bar(fill="#6BECCE", col= "black", alpha = 0.5) +
  ggtitle("Father's education plot")+
  xlab("")+
  ylab("Number of observations")+
  coord_flip()+
  theme_minimal()

grid.arrange(arrangeGrob(plot1, plot2, nrow = 2,
             left = textGrob("Level of education", rot = 90, vjust = 1, 
                                          gp = gpar(fontsize = 12))))

Interpretation:

  • Many students do not know the educational level of their parents - in each case, almost one third do not know the educational background of their parents. At the same time, there are very few of those who have only primary education or do not have it at all. In both cases, most parents have completed secondary education and a bachelor’s degree. In addition, it is interesting to note that more fathers than mothers have postgraduate education. Moreover, relatively more mothers have primary or secondary education or have not attended school at all.

Distribution of birth country and math achievement score in the sample

plot1 = data %>% 
  group_by(country) %>% 
  summarise(Number = n()) %>%
  mutate(Percent = prop.table(Number)*100) %>% 
ggplot(aes(country, Percent)) + 
  geom_col(aes(fill = country)) +
  ggtitle("Country of born (Japan)\n distribution in sample")+
  xlab("")+
  ylab("Percentage")+
  theme(plot.title = element_text(hjust = 0.5)) +
  geom_text(aes(label = sprintf("%.2f%%", Percent)), vjust = 3,size = 4) + 
  th

legend = gtable_filter(ggplotGrob(plot1), "guide-box") 

plot2 = ggplot(data, aes(x = country, y = success, fill = country, color = country)) + 
        geom_boxplot(alpha = 0.5) +
        ggtitle("Comparing mathematic achievement score\nof country of born") +
        xlab("") +
        ylab("Mathematic achievement score") +
        th
#plot graphs
grid.arrange(arrangeGrob( plot1 + theme(legend.position="none"), plot2 + theme(legend.position="none"), ncol = 2), legend, widths=unit.c(unit(1, "npc") - legend$width, legend$width))

Interpretation:

  • The graph shows that the data is distributed absolutely unevenly, since slightly more than 99% of students were born in Japan. Thus, there are about 4705 students in one group, and only 40 in the other. I still leave this variable for regression, since the minimum number of observations in the regression analysis varies from 8 to 10, and the scientists’ recommendations are equal to a sample of> 25 respondents.

Factor Analysis

As I wrote earlier, factor analysis will be performed on 24 variables related to mathematics. Therefore, I create two separate datasets, in the first I leave 24 variables for EFA, the remaining 5 are still saved in a new dataset. In addition, a correlation matrix needs to be built to ensure that factor analysis makes sense.

data1 = data[,c(25:29)]
data = data[, c(1:24)]

corr = hetcor(data)
corrplot(corr$correlations, method="number", title = "", tl.cex = 0.5, tl.col = "black", number.cex=0.5)

Interpretation:

  • All correlations that are greater than + -0.3 can be considered significant. It can be seen from the graph that there are 5 groups of variables that have fairly high correlation rates and therefore can be our potential factors.

How many factors do we need?

fa.parallel(corr$correlations, 4745) # вторая цифра - кол-во данных

## Parallel analysis suggests that the number of factors =  5  and the number of components =  3

Interpretation of the Parallel Analysis screen plot:

  • Parallel analysis suggests there are 5 latent factors
  • We don’t have a single triangle on the black line, so maybe 5 factors are the best solution.
  • The graph shows that 5 factors lie above the FA Simulated Data line
  • the analysis offers us 3 components, since three crosses lie above the Principal Components Simulated Data line

Building a factor model with 5 factors

In order to carry out factor analysis with categorical or factorial variables, all of them must be converted to numeric. I also create a 5-factor model using the fa () function. I decided to focus on 5 factors, since despite some poor performance relative to variables, this option turns out to be the most logical in terms of statistical characteristics, such as RMSE, Chi Square, Tucker Lewis Index.

When using “cor =” mixed “, rotate =” oblimin “is automatically enabled, but I wanted to use it anyway, since oblimin rotation, which recognizes that there is likely to be some correlation between pain relief factors in the real world.

data <- data.frame(lapply(data, function(x) as.numeric(as.factor(x))))
fa_plot5 <- fa(data, nfactors=5, rotate = "oblimin", cor='mixed')
fa_plot5
## Factor Analysis using method =  minres
## Call: fa(r = data, nfactors = 5, rotate = "oblimin", cor = "mixed")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                  MR1   MR2   MR3   MR5   MR4   h2    u2 com
## enjoy           0.86  0.04  0.00 -0.01 -0.06 0.83 0.170 1.0
## study          -0.16 -0.06  0.14  0.01  0.57 0.57 0.435 1.3
## boring         -0.30 -0.05  0.05 -0.06  0.60 0.75 0.252 1.5
## intrst          0.79  0.03  0.08  0.12 -0.06 0.72 0.277 1.1
## likemath        0.91  0.02 -0.09 -0.05 -0.01 0.93 0.071 1.0
## likenumb        0.92 -0.02  0.02  0.03  0.01 0.82 0.185 1.0
## likesolve       0.88 -0.02 -0.08 -0.01  0.00 0.85 0.149 1.0
## lookfrwrd       0.71  0.06  0.11  0.19 -0.15 0.75 0.247 1.3
## fav             0.82  0.00 -0.16 -0.02 -0.02 0.87 0.127 1.1
## expect          0.16  0.08 -0.12  0.51  0.13 0.45 0.549 1.5
## undrstnd       -0.05  0.65 -0.04  0.21 -0.13 0.71 0.291 1.3
## intrst_teacher  0.12  0.26  0.06  0.56 -0.11 0.73 0.269 1.6
## intrst_things   0.00  0.00 -0.04  0.93  0.00 0.88 0.117 1.0
## answers        -0.02  0.79 -0.03  0.02  0.01 0.64 0.361 1.0
## explain        -0.08  0.84  0.00  0.02 -0.12 0.74 0.258 1.1
## learned         0.07  0.36  0.02  0.29  0.11 0.38 0.622 2.2
## help            0.04  0.86  0.03 -0.04  0.02 0.71 0.286 1.0
## mistake         0.04  0.90  0.01 -0.08  0.07 0.72 0.283 1.0
## listen          0.05  0.63 -0.01  0.16  0.07 0.58 0.417 1.2
## dowell          0.20  0.07 -0.70  0.06  0.17 0.68 0.316 1.3
## diffclt         0.15  0.05  0.81 -0.04  0.14 0.59 0.411 1.1
## not_strength   -0.18  0.00  0.72  0.02  0.13 0.80 0.203 1.2
## quickly         0.29  0.04 -0.54  0.08  0.10 0.58 0.421 1.7
## nervous        -0.22 -0.12  0.27  0.04  0.37 0.51 0.493 2.8
## 
##                        MR1  MR2  MR3  MR5  MR4
## SS loadings           6.23 4.43 2.56 2.23 1.35
## Proportion Var        0.26 0.18 0.11 0.09 0.06
## Cumulative Var        0.26 0.44 0.55 0.64 0.70
## Proportion Explained  0.37 0.26 0.15 0.13 0.08
## Cumulative Proportion 0.37 0.63 0.79 0.92 1.00
## 
##  With factor correlations of 
##       MR1   MR2   MR3   MR5   MR4
## MR1  1.00  0.42 -0.64  0.48 -0.49
## MR2  0.42  1.00 -0.10  0.71 -0.24
## MR3 -0.64 -0.10  1.00 -0.17  0.28
## MR5  0.48  0.71 -0.17  1.00 -0.21
## MR4 -0.49 -0.24  0.28 -0.21  1.00
## 
## Mean item complexity =  1.3
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  276  and the objective function was  22.31 with Chi Square of  105642.5
## The degrees of freedom for the model are 166  and the objective function was  1.17 
## 
## The root mean square of the residuals (RMSR) is  0.02 
## The df corrected root mean square of the residuals is  0.03 
## 
## The harmonic number of observations is  4745 with the empirical chi square  1017.8  with prob <  2.1e-122 
## The total number of observations was  4745  with Likelihood Chi Square =  5535.02  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.915
## RMSEA index =  0.083  and the 90 % confidence intervals are  0.081 0.084
## BIC =  4129.86
## Fit based upon off diagonal values = 1
## Measures of factor score adequacy             
##                                                    MR1  MR2  MR3  MR5  MR4
## Correlation of (regression) scores with factors   0.99 0.97 0.94 0.96 0.88
## Multiple R square of scores with factors          0.97 0.93 0.88 0.92 0.77
## Minimum correlation of possible factor scores     0.95 0.87 0.77 0.83 0.54

Interpretation of the Factor Analysis:

  • h2 (communality) shows what percentage of variables is explained by a specific factor and the more, the better. u2 (uniqness), on the other hand, shows unexplained variance, that is, how much the variables are explained by other factors. Several variables turned out to be controversial, since they are 40-50% explained by other variables, not factors: study (u2 = 0.43456574), expect (u2 = 0.54949392), learned (u2 = 0.62216770), nervous (u2 = 0.49334286).

  • complexity (com) shows how clearly a variable belongs to a factor. It’s good when all variables have a value below 1.5. We have several variables that do not completely clearly belong to one factor, these are: learned (com = 2.242519) and nervous (com = 2.830408).

  • Proportion Explained explained variance should be evenly distributed among factors. Our indicator is not very good, since not all factors are evenly distributed. For example, the first factor explains 37% of the variance, while MR4 explains less than 1% of the variance.

  • Proportion Var is again not entirely good, since 2 out of 5 factors do not explain 10% of the variance.

  • Despite the previous “not very good performance”, the model does have a good fit. So, the root mean square of the residuals (RMSR) is 0.02 is good because it less than 0.05.

  • Tucker Lewis Index of factoring reliability = 0.915, then it is a very good measure of the fit of the model (it should be> 0.9) and the model is close to the ideal result (1 is the maximum value for this indicator).

  • RMSEA index = 0.084, which is below 0.01 - so the indicator is again good.

  • Chi Square 105642.5 tells us that the observed and expected data are not significantly different, which is a good thing.

Now let’s look at the graph.

fa.diagram(fa_plot5)

Factor <- c("MR1" , "MR2" , "MR3" , "MR4" , "MR5")

Variables <- c("likenumb, likemath, likesolve, enjoy, fav, intrst, lookfrwrd" , "mistake, help, explain, answers, undrstnd, listen, learned" , "diffclt, not_strength, dowell, quickly" , "boring, study, nervous", "intrst_teacher, intrst_things, expect")


Explanation <- c("The factor is responsible for the student's love of maths, whether he or she likes it. The higher the values assigned to a particular individual in the future, the more he or she likes doing mathematics." , "The factor is responsible for students' interaction with the teacher during mathematics lessons." , "The factor is responsible for how easy maths is for the student." , "The factor is responsible for the student not liking mathematics and feeling uncomfortable doing it." , "The factor is responsible for the student's attitude towards what the teacher says in class.")

tab <- data.frame(Factor, Variables, Explanation)

tab %>% datatable(options = list(scrollX='400px', scrollY='270px'))

Interpretation of the Factor Analysis:

  • Each factor must have at least 3 variables associated with it. The red lines show a negative correlation between the variable and the factor. As you can see from the graph, each of the five factors has at least 3 variables associated with it.

  • All loadings are at least 0.4, which is good. If they were lower, it means that the variable is not sufficiently correlated with the factor and it can be excluded altogether.

  • The factors correlate with each other: MR2 - MR5 strongly correlate with each other (0.7), since both relate to the question about math lessons “How much do you agree with these statements about your mathematics lessons”. MR1 - MR3 are strongly negatively correlated with each other (-0.7), all of the variables of them refer to the question “How much do you agree with these statements about learning mathematics?”. The negative correlation can be explained by the fact that M1 contains most of the variables related to the question of the general attitude towards mathematics and it assessed the students’ love and interest in this science. In contrast, the MR3 factor includes variables that focus on how hard the student is in math. MR1 - MR2 do not correlate too much with each other, as the correlation coefficient is 0.4 (slightly more than 0.3), possibly the correlation is connected with the fact that if a student has a good attitude towards mathematics and he likes it, he also likes the teacher and, conversely, if the teacher is good, the student likes his subject. Since M5 is related to what the teacher says, the correlation between MR1 and MR5 (= 0.5) is explained in the same way as the correlation between MR1 and MR2. Finally, the correlation between MR1 and MR4 is negative, since both factors are based on variables from the same question “How much do you agree with these statements about learning mathematics?”, but the first refers to positive questions and the fact that the student loves mathematics. and the second means that the student is uncomfortable with studying mathematics. the student likes his subject.

Chronbach’s alpha

Chronbach’s alpha is a consistency test that measures how closely a set of elements are related as a group. For each scale (factor) it is necessary to carry out separately. Check.keys = T - Expands all arguments in one direction. For the matrix to be only +, it is therefore important to include it if any questions were “expanded”. In our case, this situation may be typical for the third factor. However, all factors need to be checked for consistency. Chronbach’s alpha must be at least 0.7, otherwise it means that all questions on the scale measure the same concept.

MR1

# MR1
psych::alpha(data[,c("likenumb", "likemath", "likesolve", "enjoy", "fav", "intrst", "lookfrwrd")], check.keys = T)
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("likenumb", "likemath", "likesolve", 
##     "enjoy", "fav", "intrst", "lookfrwrd")], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.95      0.95    0.94      0.72  18 0.0011  2.7 0.78      0.7
## 
##  lower alpha upper     95% confidence boundaries
## 0.94 0.95 0.95 
## 
##  Reliability if an item is dropped:
##           raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## likenumb       0.94      0.94    0.93      0.72  16   0.0013 0.0045  0.70
## likemath       0.93      0.93    0.92      0.70  14   0.0015 0.0018  0.68
## likesolve      0.94      0.94    0.93      0.71  15   0.0014 0.0035  0.69
## enjoy          0.94      0.94    0.93      0.72  15   0.0014 0.0041  0.69
## fav            0.94      0.94    0.93      0.71  15   0.0014 0.0028  0.69
## intrst         0.94      0.94    0.94      0.74  17   0.0012 0.0035  0.75
## lookfrwrd      0.94      0.94    0.94      0.74  17   0.0012 0.0035  0.75
## 
##  Item statistics 
##              n raw.r std.r r.cor r.drop mean   sd
## likenumb  4745  0.86  0.86  0.83   0.81  2.9 0.78
## likemath  4745  0.93  0.92  0.92   0.89  2.6 0.97
## likesolve 4745  0.89  0.89  0.87   0.85  2.6 0.93
## enjoy     4745  0.88  0.88  0.86   0.84  2.5 0.91
## fav       4745  0.90  0.89  0.88   0.85  2.8 0.99
## intrst    4745  0.82  0.83  0.78   0.76  2.5 0.85
## lookfrwrd 4745  0.82  0.82  0.78   0.76  2.9 0.85
## 
## Non missing response frequency for each item
##              1    2    3    4 miss
## likenumb  0.06 0.19 0.54 0.21    0
## likemath  0.15 0.28 0.37 0.20    0
## likesolve 0.12 0.30 0.39 0.19    0
## enjoy     0.16 0.37 0.34 0.13    0
## fav       0.13 0.20 0.38 0.29    0
## intrst    0.12 0.34 0.43 0.12    0
## lookfrwrd 0.07 0.23 0.47 0.23    0

Interpretation:

  • Cronbach’s alpha is 0.9427067, which indicates very good scale reliability. This means that the variables used have common covariance and are likely to measure the same underlying concept.

MR2

# MR2
psych::alpha(data[,c("mistake", "help", "explain", "answers", "undrstnd", "listen", "learned")], check.keys = T)
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("mistake", "help", "explain", "answers", 
##     "undrstnd", "listen", "learned")], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.89      0.89    0.89      0.53   8 0.0025  2.2 0.62     0.57
## 
##  lower alpha upper     95% confidence boundaries
## 0.88 0.89 0.89 
## 
##  Reliability if an item is dropped:
##          raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## mistake       0.87      0.87    0.86      0.52 6.5   0.0030 0.0122  0.56
## help          0.87      0.87    0.86      0.52 6.5   0.0030 0.0121  0.55
## explain       0.86      0.86    0.85      0.52 6.4   0.0031 0.0082  0.56
## answers       0.87      0.87    0.87      0.53 6.7   0.0029 0.0128  0.55
## undrstnd      0.87      0.87    0.85      0.53 6.6   0.0030 0.0091  0.57
## listen        0.87      0.88    0.87      0.54 7.0   0.0028 0.0134  0.57
## learned       0.89      0.89    0.89      0.59 8.5   0.0024 0.0050  0.58
## 
##  Item statistics 
##             n raw.r std.r r.cor r.drop mean   sd
## mistake  4745  0.81  0.81  0.78   0.73  2.1 0.78
## help     4745  0.81  0.81  0.78   0.73  2.0 0.76
## explain  4745  0.83  0.83  0.81   0.75  2.1 0.84
## answers  4745  0.79  0.79  0.74   0.70  2.1 0.78
## undrstnd 4745  0.80  0.80  0.77   0.71  2.2 0.82
## listen   4745  0.76  0.76  0.70   0.66  2.3 0.82
## learned  4745  0.63  0.63  0.52   0.50  2.7 0.81
## 
## Non missing response frequency for each item
##             1    2    3    4 miss
## mistake  0.22 0.54 0.19 0.05    0
## help     0.23 0.56 0.16 0.05    0
## explain  0.24 0.50 0.19 0.07    0
## answers  0.22 0.55 0.18 0.05    0
## undrstnd 0.20 0.52 0.21 0.08    0
## listen   0.16 0.50 0.26 0.08    0
## learned  0.07 0.32 0.46 0.15    0

Interpretation:

  • Cronbach’s alpha is 0.8889813, which indicates very good scale reliability. This means that the variables used have common covariance and are likely to measure the same underlying concept.

MR3

# MR3
psych::alpha(data[,c("diffclt", "not_strength", "dowell", "quickly")], check.keys = T)
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("diffclt", "not_strength", "dowell", 
##     "quickly")], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.82      0.82    0.79      0.54 4.7 0.0041  2.3 0.72     0.56
## 
##  lower alpha upper     95% confidence boundaries
## 0.81 0.82 0.83 
## 
##  Reliability if an item is dropped:
##              raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## diffclt           0.79      0.80    0.73      0.57 4.0   0.0051 0.0011  0.58
## not_strength      0.75      0.75    0.68      0.50 3.1   0.0063 0.0052  0.49
## dowell-           0.77      0.77    0.69      0.52 3.3   0.0057 0.0058  0.54
## quickly-          0.79      0.79    0.72      0.56 3.8   0.0051 0.0039  0.59
## 
##  Item statistics 
##                 n raw.r std.r r.cor r.drop mean   sd
## diffclt      4745  0.79  0.78  0.67   0.61  2.5 0.90
## not_strength 4745  0.86  0.84  0.78   0.71  2.2 1.03
## dowell-      4745  0.81  0.82  0.75   0.67  2.1 0.83
## quickly-     4745  0.77  0.79  0.68   0.62  2.3 0.79
## 
## Non missing response frequency for each item
##                 1    2    3    4 miss
## diffclt      0.16 0.32 0.40 0.12    0
## not_strength 0.32 0.29 0.26 0.13    0
## dowell       0.06 0.21 0.48 0.25    0
## quickly      0.07 0.27 0.51 0.14    0

Interpretation:

  • Cronbach’s alpha is 0.8243822, which indicates very good scale reliability. This means that the variables used have common covariance and are likely to measure the same underlying concept.

MR4

# MR4
psych::alpha(data[,c("boring", "study", "nervous")], check.keys = T)
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("boring", "study", "nervous")], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.77      0.77     0.7      0.52 3.3 0.0059  2.7 0.74     0.51
## 
##  lower alpha upper     95% confidence boundaries
## 0.75 0.77 0.78 
## 
##  Reliability if an item is dropped:
##         raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## boring       0.61      0.61    0.44      0.44 1.6   0.0112    NA  0.44
## study        0.67      0.68    0.51      0.51 2.1   0.0094    NA  0.51
## nervous      0.76      0.77    0.62      0.62 3.3   0.0068    NA  0.62
## 
##  Item statistics 
##            n raw.r std.r r.cor r.drop mean   sd
## boring  4745  0.85  0.86  0.76   0.67  2.7 0.86
## study   4745  0.84  0.83  0.71   0.61  2.7 0.93
## nervous 4745  0.79  0.79  0.60   0.53  2.8 0.92
## 
## Non missing response frequency for each item
##            1    2    3    4 miss
## boring  0.11 0.27 0.47 0.15    0
## study   0.12 0.25 0.42 0.20    0
## nervous 0.11 0.23 0.44 0.22    0

Interpretation:

  • Cronbach’s alpha is 0.7680854, which indicates good scale reliability. This means that the variables used have common covariance and are likely to measure the same underlying concept.

MR5

# MR5
psych::alpha(data[,c("intrst_teacher", "intrst_things", "expect")], check.keys = T)
## 
## Reliability analysis   
## Call: psych::alpha(x = data[, c("intrst_teacher", "intrst_things", 
##     "expect")], check.keys = T)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
##       0.78      0.78    0.72      0.54 3.5 0.0056  2.7 0.68     0.51
## 
##  lower alpha upper     95% confidence boundaries
## 0.77 0.78 0.79 
## 
##  Reliability if an item is dropped:
##                raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## intrst_teacher      0.68      0.68    0.51      0.51 2.1   0.0093    NA  0.51
## intrst_things       0.60      0.60    0.43      0.43 1.5   0.0117    NA  0.43
## expect              0.81      0.81    0.68      0.68 4.3   0.0055    NA  0.68
## 
##  Item statistics 
##                   n raw.r std.r r.cor r.drop mean   sd
## intrst_teacher 4745  0.86  0.84  0.75   0.64  2.5 0.87
## intrst_things  4745  0.88  0.88  0.81   0.71  2.8 0.78
## expect         4745  0.77  0.78  0.57   0.51  2.8 0.79
## 
## Non missing response frequency for each item
##                   1    2    3    4 miss
## intrst_teacher 0.14 0.38 0.36 0.12    0
## intrst_things  0.05 0.25 0.53 0.17    0
## expect         0.05 0.25 0.51 0.20    0

Interpretation:

  • Cronbach’s alpha is 0.7800495, which indicates good scale reliability. This means that the variables used have common covariance and are likely to measure the same underlying concept.

Thus, it can be concluded that all 5 factors are reliable and all the variables assigned to them should be kept as they are now.

Regression Analysis

Prepearing data

Before moving on to the regression analysis, you need to add the coefficients to the rest of the variables. Thanks to the scores variable, I get a table with the loads for each factor for each individual, which is considered as a regrison coefficient. That is, the higher the value of the load, the higher the indicator for this factor for a given individual. In addition to factors, I also use control variables such as gender, country of birth, and parenting. We decided to include not all factors in the analysis, but only those related to the teacher and how difficult the student is at mathematics.

fascores<-fa(data, nfactors=5, rotate = "oblimin", cor='mixed', scores = T)
fascores = fascores$scores
data1 <- cbind(data1,fascores) 
data1 = data1 %>% rename( factor_like_math = MR1, factor_teacher = MR2, factor_hard_math = MR3, factor_uncomfotable_math = MR4,factor_inrst_teacher = MR5)

Choosing the best model

library(car)
m1 = lm(success ~ sex, data = data1) 
m2 = lm(success ~ sex + mother + father, data = data1)
m3 = lm(success ~ sex + mother + father + country, data = data1)
m4 = lm(success ~ sex + mother + father + country + factor_like_math, data = data1)
m5 = lm(success ~ sex + mother + father + country + factor_like_math + factor_teacher, data = data1)
m6 = lm(success ~ sex + mother + father + country + factor_like_math + factor_teacher + factor_hard_math, data = data1)
m7 = lm(success ~ sex + mother + father + country + factor_like_math + factor_teacher + factor_hard_math + factor_uncomfotable_math, data = data1)
m8 = lm(success ~ sex + mother + father + country + factor_like_math + factor_teacher + factor_hard_math + factor_uncomfotable_math + factor_inrst_teacher, data = data1)

anova(m1, m2, m3, m4, m5, m6, m7, m8)
## Analysis of Variance Table
## 
## Model 1: success ~ sex
## Model 2: success ~ sex + mother + father
## Model 3: success ~ sex + mother + father + country
## Model 4: success ~ sex + mother + father + country + factor_like_math
## Model 5: success ~ sex + mother + father + country + factor_like_math + 
##     factor_teacher
## Model 6: success ~ sex + mother + father + country + factor_like_math + 
##     factor_teacher + factor_hard_math
## Model 7: success ~ sex + mother + father + country + factor_like_math + 
##     factor_teacher + factor_hard_math + factor_uncomfotable_math
## Model 8: success ~ sex + mother + father + country + factor_like_math + 
##     factor_teacher + factor_hard_math + factor_uncomfotable_math + 
##     factor_inrst_teacher
##   Res.Df      RSS Df Sum of Sq         F    Pr(>F)    
## 1   4743 37048617                                     
## 2   4729 31003897 14   6044720   87.6032 < 2.2e-16 ***
## 3   4728 31003267  1       631    0.1279  0.720595    
## 4   4727 26065195  1   4938071 1001.9115 < 2.2e-16 ***
## 5   4726 26027368  1     37828    7.6750  0.005621 ** 
## 6   4725 23757296  1   2270072  460.5869 < 2.2e-16 ***
## 7   4724 23278778  1    478518   97.0891 < 2.2e-16 ***
## 8   4723 23278014  1       764    0.1550  0.693782    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation:

  • Several models turned out to be significant: m2, m4, m5, m6 and m7. In statistics, AIC is most often used for model selection. By calculating and comparing the AIC scores of several possible models, you can choose the one that best fits the data.
# install.packages("AICcmodavg")
library(AICcmodavg)
aictab(cand.set = list(m2, m4, m5, m6, m7))
## 
## Model selection based on AICc:
## 
##       K     AICc Delta_AICc AICcWt Cum.Wt        LL
## Mod5 22 53833.93       0.00      1      1 -26894.86
## Mod4 21 53928.46      94.53      0      1 -26943.13
## Mod3 20 54359.46     525.54      0      1 -27159.64
## Mod2 19 54364.34     530.41      0      1 -27163.09
## Mod1 17 55183.62    1349.69      0      1 -27574.75

Interpretation:

  • From the results obtained, it can be seen that model 5 turned out to be the best model, since its AICc is the smallest (53833.93). In addition, the model is much better than the others, as it carries 100% of the total weight of the model. Model 5 includes only two factors - factor_like_math and factor_teacher. For further work and to answer the research question, I will use it.

Interpretation of the model

summary(m5)
## 
## Call:
## lm(formula = success ~ sex + mother + father + country + factor_like_math + 
##     factor_teacher, data = data1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -283.342  -47.545    0.629   50.820  242.748 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                         375.676     57.313   6.555 6.17e-11 ***
## sexBoy                               -9.399      2.204  -4.265 2.04e-05 ***
## motherLower secondary               142.185     56.558   2.514 0.011971 *  
## motherUpper secondary               155.903     56.174   2.775 0.005536 ** 
## motherPost-secondary, non-tertiary  145.102     56.665   2.561 0.010477 *  
## motherShort-cycle tertiary          183.570     56.201   3.266 0.001097 ** 
## motherBachelor’s or equivalent      187.509     56.201   3.336 0.000855 ***
## motherPostgraduate degree           192.396     57.650   3.337 0.000852 ***
## motherDon’t know                    159.376     56.151   2.838 0.004554 ** 
## fatherLower secondary                10.543     35.983   0.293 0.769536    
## fatherUpper secondary                34.046     35.524   0.958 0.337902    
## fatherPost-secondary, non-tertiary   32.280     36.206   0.892 0.372668    
## fatherShort-cycle tertiary           50.560     35.717   1.416 0.156970    
## fatherBachelor’s or equivalent       73.476     35.498   2.070 0.038520 *  
## fatherPostgraduate degree           108.230     36.277   2.983 0.002865 ** 
## fatherDon’t know                     31.817     35.477   0.897 0.369853    
## countryNo                            -2.116     12.094  -0.175 0.861109    
## factor_like_math                    -33.347      1.204 -27.693  < 2e-16 ***
## factor_teacher                        3.143      1.199   2.621 0.008800 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.21 on 4726 degrees of freedom
## Multiple R-squared:  0.2976, Adjusted R-squared:  0.2949 
## F-statistic: 111.2 on 18 and 4726 DF,  p-value: < 2.2e-16

Interpretation:

  • In the model, 12 variables turned out to be significant, since their p-value <0.05: gender, all levels of mother’s education other than om “Primary or Lower secondary or did not go to school”, father (Bachelor’s or equivalent), father (Postgraduate degree ) and both factors - factor_like_math, factor_teacher.

  • Adjusted R-square: 0.1821 tells us that this new model is able to explain 18% of the data distribution Also this means that 82% of the variation cannot be explained by the selected predictors.

  • Residual standard error = 79.73 on 4725 degrees of freedom, this coefficient is small and therefore the model has good predictive power.

  • The F value is very significant (p-value: <2.2e-16), which means that all of the explanatory variables together significantly explain student success. This result indicates that there is less than 0.1% probability that such a large F-ratio would occur if the null hypothesis were true. Also F = 63.1, which is good because it is greater than 1, so it is in fact significant and reliable.

Coefficient <- c("sex (Boy)","mother (Lower secondary)", "mother (Upper secondary)", "mother (Post-secondary, non-tertiary)", "mother (Short-cycle tertiary)", "mother (Bachelor’s or equivalent)", "mother (Postgraduate degree)", "mother (Don’t know)", "father (Bachelor’s or equivalent)", "father (Postgraduate degree)", "factor_like_math", "factor_teacher")

Interpretation <- c("If a student turns out to be a boy, then his academic success decreases by 9.098 points, with other variables unchanged.",
                    "If a student's mother has a Lower secondary education, then his or her academic success increases by 142,241 points, with all other variables unchanged.", 
                    "If the student's mother has an Upper secondary education, then his or her academic success increases by 156,127 points, with other variables unchanged.",
                    "If a student's mother has a Post-secondary, non-tertiary education, then his or her academic success increases by 145.176 points, with other variables unchanged.",
                    "If a student's mother has a Short-cycle tertiary education, then his or her academic success increases by 183.674 points, with all other variables unchanged.)",
                    "If a student's mother has a Bachelor's or equivalent education, then his or her academic success increases by 187,921 points, with other variables unchanged.",
                    "If the student's mother has a Postgraduate degree, then his or her academic success increases by 192.644 points, with other variables unchanged.", 
                    "If a student does not know his or her mother's education, then his or her academic success increases by 159,489 points, with all other variables unchanged.",
                    "If a student's father has a Bachelor's or equivalent education, then his or her academic success increases by 73.605 points, with other variables unchanged.",
                    "If a student's father has a Postgraduate degree, then his or her academic success increases by 108.167 points, all other variables unchanged.",
                    "Since the scale is expanded (1 - Agree with everything, 4 - Completely disagree), the interpretation will be as follows: an increase in the indicator by one point reduces her or his academic performance by 33.142. That is, if a student does not like mathematics, his success will fall.",
                    "Since the scale is expanded (1 - Agree with everything, 4 - Completely disagree), the interpretation will be as follows: an increase in the indicator by one point increases her or his academic performance by 3.043. That is, if a student does not like the way math lessons are going, his or her success will drop slightly.")

Coef_tab <- data.frame(Coefficient, Interpretation)
Coef_tab %>% datatable(options = list(pageLength=6, scrollX='400px', scrollY='270px'))

Model Diagnistic

Assimption 1: Multicollinearity

car::vif(m5)
##                      GVIF Df GVIF^(1/(2*Df))
## sex              1.045837  1        1.022662
## mother           2.998922  7        1.081606
## father           2.952662  7        1.080405
## country          1.079551  1        1.039014
## factor_like_math 1.337994  1        1.156717
## factor_teacher   1.288816  1        1.135260

Interpretation:

  • All GVIF indicators have less than 5, so it can be argued that there is no multicollinearity.

Assimption 2: Outliers

outlierTest(m5)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##       rstudent unadjusted p-value Bonferroni p
## 2325 -3.826543         0.00013163       0.6246
qqPlot(m5, main = "QQ Plot") # there are two outliers in lines 2325 and 3864

## [1] 2325 3864
data1 <- data1[-c(2325, 3864), ]
m5 = lm(success ~ sex + mother + father + country + factor_like_math + factor_teacher, data = data1)

Interpretation:

  • The graph shows that the data is distributed more or less evenly. Thanks to the outlierTest () function, I found outliers that can be removed from the data. After that, you need to run the model again.

Assimption 3: Studentized residuals

library(MASS)
sresid <- studres(m5) 
hist(sresid, xlab   = "Residuals", freq = FALSE,
     col    = "lightblue",
     border = "black",
     breaks = 20, 
     main = "Distribution of Studentized Residuals")

shapiro.test(sresid)
## 
##  Shapiro-Wilk normality test
## 
## data:  sresid
## W = 0.99886, p-value = 0.002397

Interpretation:

  • The graph shows that the residuals are distributed more or less normally. This indicates that the model is good.

  • However, in Shapiro-Wilk, the normality test p-value is less than 0.05, then the null hypothesis that the data are normally distributed is rejected..

Assimption 4: Homoscedasticity

bptest(m5)
## 
##  studentized Breusch-Pagan test
## 
## data:  m5
## BP = 29.623, df = 18, p-value = 0.04128
ncvTest(m5)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 6.761915, Df = 1, p = 0.0093124

Interpretation:

  • I use two tests to test homoscedasticity. Unfortunately both p-value <0.05, which means there is a homoscedasticity problem in the data.
plot(m5, 3)

Interpretation:

  • This graph shows whether the residuals are equally distributed across the predictor ranges. It is good when the graph shows a horizontal line with equally spaced points. This is not the case in my example, so the variance is not uniform.

Assimption 5: Leverage values

Finally, leverage values are checked.

plot(m5, 4)

plot(m5, 5)

Interpretation:

  • In the graph above, the 3 most extreme points (# 2622, # 1282 and # 1186) are highlighted, the standardized remainder of which is higher than 3, but they do not particularly affect the Cook distance lines (red dashed line) because all cases are inside the Cook distance lines.

Conclusions

Before summing up the general results, I want to remind you what the research question was. It sounded like this: What indicators related to math lessons affect the academic performance of students from Japan? During the project, the following conclusions were obtained:

  • Thanks to factor analysis, 5 factors were obtained, but only 2 got into the regression model. The first is responsible for how much the student likes mathematics as a subject, the second is for how the student interacts with the teacher. Also, control variables were included in the regression model: gender, parental education, and country of birth.

  • The regression model showed that students whose mother’s education is higher than Primary or Lower secondary or did not go to school or who do not know their mother’s education, academic success is higher than students whose mothers have primary education or none at all. In addition, the father’s education also has a positive effect on the student’s success, but only if the father has an incomplete higher education (bachelor’s degree) or Postgraduate degree. Perhaps this is due to the fact that usually mothers are engaged in lessons with the child or are more interested in their progress. To make more complete conclusions, it is necessary to study the cultural context of Japanese education.

  • Students’ academic success depends on whether they like or dislike math. Moreover, if a student likes the subject, the student’s success in it is also higher and vice versa.

  • The influence of the factor related to the student’s satisfaction with the interaction with the teacher during the math lessons remained unclear. The results of the model showed a weak effect and a negative one: that is, if a student does not like lessons in mathematics, his success in this subject will be higher. I think the problem is multicollinearity. Ksenia Alekseevna faced this problem on a couple of June 7 and the problem could not be solved then. Unfortunately, I also failed to solve the problem, but the effect was the same as on a couple: if you remove the effect from “love of mathematics,” gender ceases to be significant, but the factor about the lessons increases and the estimate becomes many times larger and negative.