There are many studies devoted to the search for predictors of a person’s success in the educational process. A huge part of sociology is built on the study of such factors. However, in this “mini-study” as part of training in the use of a tool such as Correspondence Analysis, I had a desire to look at something less familiar, such as the relationship between what kind of lunch a student eats and whether he studied for an exam on the special test preparation course, and what the result he will get on the math test.
Students Performance in Exams dataset was taken from Kaggle platform (https://www.kaggle.com/datasets/spscientist/students-performance-in-exams). It contains many variables, but the following will be used in this work: - math.score (will be converted into categories) - lunch (standard & reduced) - test preparation course (none & completed)
Let’s divide the date into conditional categories according to the scores on the mathematical test. Looking at the histogram, you can see that the distribution is not so smooth, so you can see at what points the number of people clearly changes.
library(ggplot2)
library(plotly)
ggplot(data, aes(x=math.score)) + geom_histogram(bins=60) +
geom_vline(xintercept = 40, linetype = 2, color = "red") +
geom_vline(xintercept = 55, linetype = 2, color = "red") +
geom_vline(xintercept = 70, linetype = 2, color = "red") +
geom_vline(xintercept = 80, linetype = 2, color = "red") +
geom_vline(xintercept = 95, linetype = 2, color = "red")
So, after some prepartion we got such a contingency table
library(dplyr)
data$math.score_category = if_else(data$math.score < 40, "Math 0-40",
if_else(data$math.score >= 40 & data$math.score < 55, "Math 40-55",
if_else(data$math.score >= 55 & data$math.score < 70, "Math 55-70",
if_else(data$math.score >= 70 & data$math.score < 80, "Math 70-80",
if_else(data$math.score >= 80 & data$math.score < 95, "Math 80-95",
"Math 95-100")))))
# DF to CA format function
to_ca_df = function(datta, main_variable){
colnames_vec = colnames(datta)
index_main_var = which(colnames(datta) == main_variable)
fo = table(datta[,index_main_var]) %>% as.data.frame()
result_df = data.frame(first_one = fo[,1])
result_colnames_vec = c(colnames(datta)[index_main_var])
for(i in seq(1, length(colnames(datta)), 1)){
if(i != index_main_var){
df = table(datta[,index_main_var],datta[,i]) %>% as.data.frame()
for(k in seq(1, nrow(df)/length(unique(datta[,index_main_var])) , 1)){
new = df[(1+length(unique(datta[,index_main_var]))*(k-1)):(length(unique(datta[,index_main_var]))+length(unique(datta[,index_main_var]))*(k-1)),3]
result_colnames_vec = append(result_colnames_vec,as.character(df[(1+length(unique(datta[,index_main_var]))*(k-1)),2]))
result_df = cbind(result_df, new)
}
} else {
next
}
}
colnames(result_df) = result_colnames_vec
result_df
}
data2 = to_ca_df(data[,c(4,5,9)], main_variable = "math.score_category")
math_mtrx = data2 %>% select(-"math.score_category") %>% as.matrix()
row.names(math_mtrx) = data2$math.score_category
math_mtrx
## free/reduced standard completed none
## Math 0-40 33 7 6 34
## Math 40-55 100 81 46 135
## Math 55-70 135 235 130 240
## Math 70-80 65 151 88 128
## Math 80-95 21 148 73 96
## Math 95-100 1 23 15 9
chisq.test(math_mtrx)
##
## Pearson's Chi-squared test
##
## data: math_mtrx
## X-squared = 151.77, df = 15, p-value < 2.2e-16
P-value is less than 0.05, so columns and rows are statistically significantly associated
chisq.test(math_mtrx)$exp
## free/reduced standard completed none
## Math 0-40 14.200 25.800 14.320 25.680
## Math 40-55 64.255 116.745 64.798 116.202
## Math 55-70 131.350 238.650 132.460 237.540
## Math 70-80 76.680 139.320 77.328 138.672
## Math 80-95 59.995 109.005 60.502 108.498
## Math 95-100 8.520 15.480 8.592 15.408
Expected counts below 5 are not presented
chisq.test(math_mtrx)$stdres
## free/reduced standard completed none
## Math 0-40 5.6144873 -4.589419 -2.4765359 2.0335521
## Math 40-55 5.4331530 -4.441192 -2.8478513 2.3384496
## Math 55-70 0.4424249 -0.361649 -0.2972014 0.2440403
## Math 70-80 -1.6610236 1.357761 1.5126826 -1.2421056
## Math 80-95 -6.0895182 4.977721 1.9452865 -1.5973286
## Math 95-100 -2.8754458 2.350460 2.4421868 -2.0053472
According to Standard Residuals, mainly students who eat reduced lunch score lower on a math test and who eat standard are much better in tests. The same can be said about the completion of an additional preparatory course - those who complete have more points than those who do not complete
library(gplots)
dt2 <- as.table(math_mtrx)
balloonplot(
t(dt2),
main = "housetasks",
xlab = "",
ylab = "",
label = FALSE,
show.margins = FALSE
)
Contingency table looks pretty boring. Most of the students did not complete the preparatory course, most of the students have a standard lunch.
Judging by the preliminary analysis of the data, it is unlikely that it will be possible to find any interesting connections, so far everything looks too predictable
library(FactoMineR)
res.ca = CA(math_mtrx, graph = F)
The data contains 6 rows and 4 columns
If the data were random, the expected value of the eigenvalue for each axis would be 1/(nrow(math_mtrx)-1) = 1/5 = 20% in terms of rows.
Likewise, the average axis should account for 1/(ncol(math_mtrx)-1) = 1/3 = 33.33% in terms of the 4 columns.
library(factoextra)
res.ca$eig
## eigenvalue percentage of variance cumulative percentage of variance
## dim 1 7.464435e-02 9.836188e+01 98.36188
## dim 2 1.243126e-03 1.638118e+00 100.00000
## dim 3 5.176751e-33 6.821615e-30 100.00000
fviz_screeplot(res.ca) +
geom_hline(yintercept = 33.33, linetype = 2, color = "red")
Amazing. In this case the first dimension explain almost all of the variance - 98.3%. The remaining 1.7% is explained by Dim2. Dimension 3 is so useless that it explains only 6.8*10^-30 percentage of variance. This means that two-dimensial representation of rows and columns would be perfect in terms of variance.
By symmetric plot we only can make general conclusions.
fviz_ca_biplot(res.ca, repel = TRUE)
Okay, probably this symmetric plot is not so easy to interpret. We can
see that “95-100” category is really far away from other test scores
categories - absolute champion of contribution to the Dim2. It is not
possible to unambiguously associate this category of test scores with
any factors using Symmetric Plot.
“Math 70-80” relates to the completed test course, “Math 40-55” relates to free reduced lunch type. But I will give any deeper interpretation for now.
fviz_ca_row(res.ca, alpha.row = "cos2", repel = TRUE)
library(corrplot)
corrplot(res.ca$row$cos2, method = "number", is.corr = FALSE)
As it was expected, row items is perfectly represented not even by two dimensions, but by only Dim1. Least represented by Dim1 (0.92 and 0.93) row items are “Math 70-80” and “Math 95-100”, which was seen at the plot before.
corrplot(res.ca$row$contrib, is.corr = FALSE)
fviz_contrib(res.ca, choice = "row", axes = 1, top = 10)
fviz_contrib(res.ca, choice = "row", axes = 2, top = 10)
fviz_contrib(res.ca, choice = "row", axes = 1:2, top = 10)
fviz_ca_row(res.ca, col.row = "contrib",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
On the first axis, we see that there are two opposites - a excellent score “85-90” and bad scores from 0 to 55 points. It is important to note that “Math 95-100” is not so significant for the Dim1, despite the fact that it is the highest score.
It is interesting that test scores “80-95” and “95-100” make the main contribution to Dim2 despite the fact that they are so close in human terms. So, they are the variables that are most important in explaining the variability on Dim2.
It seems like test scores from 0 to 95 is explained in one way, but the scores from 95 to 100 is absolutely different story (It reminds me the difference between 0 to 8 and 8 to 10 scores in the HSE)
fviz_ca_col(res.ca, col.col = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE)
fviz_cos2(res.ca, choice = "col", axes = 1:2)
And again, as it was expected, all of the column items is represented
perfectly (with 1.00 scores of cos2) on the two-dimensional map.
“If the angle between two arrows is acute, then their is a strong association between the corresponding row and column.
To interpret the distance between rows and and a column you should perpendicularly project row points on the column arrow.”
fviz_ca_biplot(res.ca,
map ="rowprincipal", arrow = c(TRUE, TRUE),
repel = TRUE)
fviz_ca_biplot(res.ca,
map ="colprincipal", arrow = c(TRUE, TRUE),
repel = TRUE, labelsize =2)
By the assymetric plots it is clearly visible that there is a strong association between 1. Math Test scores of 40 to 55 and absence of completion of the test preparation course. That is, such scores are most often associated precisely with the fact that the student simply did not try hardly enough and did not pass the preliminary course. 2. Math Test scores of 0-40 and free reduced lunch consumption, as well as Math Test scores of 80-95 and standard lunch consumption. Of course, it is not about food, but probably about socio-economic position of the student.