The data in this analysis comes from a survey conducted at regional universities in the U.S. seeking to investigate which factors may impact the satisfaction level of undergraduate business students with their institution. The study population was defined to be all undergraduate business students at these two colleges. A total of 332 sets of student responses were collected. The survey consists of several multi-item questions corresponding to different factors which may influence a student’s satisfaction with their school. The aim of this analysis is to validate the internal consistency of the subscale pertaining to the students’ level of engagement in learning to assess whether principal component analysis would be appropriate for aggregation purposes. This evaluation will be based on the calculation and interpretion of its corresponding Cronbach alpha value and bootstrap confidence interval.
The data set is preprocessed and thus there are no true missing values to be handled. However, each observation is separated by a full row of missing values, presumably for ease of reading. These rows will be deleted to eliminate the risk of calculation issues due to missing values later in the analysis. Then, a new data frame be created which consists of only the columns corresponding to the questions in the learning engagement subscale.
survey <- na.omit(read.csv("https://pengdsci.github.io/STA490/w11/at-risk-survey-data.csv"))
engagement <- survey[, 4:24]
The subscale of interest consists of 21 questions relating to different forms or means of learning engagement and participation in which a student may partake, such as making a class presentation or working with other students on a class project. It is to be answered using a Likert-type scale with four possible responses indicating how frequently the student engages in each activity: 1 - Very often, 2 - Often, 3 - Sometimes, 4 - Never.
Most of the questions inquire about what would generally be considered a positive or productive activity. Thus, a lower numerical response (e.g., 1 for “Very often” as opposed to 4 for “Never”) can be seen as an indication of greater level of learning engagement by the student. However, question 4.5 asks how frequently a student comes to class without having completed their readings or assignments, while 4.21 asks how often they skip class entirely. Unlike the rest of the questions, a lower numerical response to these two questions would suggest that the student is less engaged in learning activities. Therefore, the scale for these two questions will be reversed so that each question’s scale follows the same order from more engagement to less engagement as the numerical value increases.
engagement.new <- engagement
engagement.new$q45 <- 5 - engagement$q45
engagement.new$q421 <- 5 - engagement$q421
As a preliminary step in assessing the internal consistency of this subscale, a correlation plot will be generated and interpreted as a general visual representation of the correlation among the variables, i.e., questions.
library(corrplot)
M=cor(engagement.new)
corrplot.mixed(M, lower.col = "royalblue3", upper = "ellipse", number.cex = .45, tl.cex = 0.55)
Overall, the plot does offer some support for the use of PCA as an aggregation tool, as the great majority of the pairs of questions appear to exhibit a weak to moderate positive correlation, as indicated by the blue, upward-sloping ellipses. There are also a few pairs for which the correlation is noticeably strong, indicated by darker blue ellipses which are thinner in shape.
Interestingly, questions 4.5 & 4.21 actually exhibit a negative correlation, albeit very weak in most cases, with almost all of the other questions despite reversing the order of the scale to match the others in theory. One possible explanation is that most students are generally not very likely to fail to do assignments or skip class frequently even if they rarely give presentations, discuss grades with their instructors, etc. Thus, many of the responses containing higher-numbered answers (closer to 4) indicating minimal engagement across most activities may still have lower-numbered answers (closer to 1) for questions 4.5 & 4.21. It may also be the case that students were simply less forthcoming about frequent failure to do their work or attend class than they were about infrequently engaging in activities that are not necessarily as essential or mandatory to academic success, perhaps out of fear of disciplinary action. Admittedly, neither of these explanations fully account for the other side of the negative relationship, i.e., answers to question 4.5 & 4.21 indicating frequent failure to complete assignments and attend class corresponding to answers to the rest of the questions which suggest frequent engagement in other learning activities. That said, most of these negative correlations were quite weak as noted previously, so it may not be particularly significant or impactful to the analysis as a whole.
Having detected ample evidence for correlation among the questions from a visual representation, we will now compute a Cronbach alpha value for the subscale, a numerical metric indicating the level of internal consistency. This metric will be computed for this set of responses, along with a 95% confidence interval for the value generated via bootstrapping.
cronbach.eng = as.numeric(alpha(engagement.new, check.keys = TRUE)$total[1])
kable(cronbach.eng, caption = "Cronbach Alpha")
| x |
|---|
| 0.8780093 |
set.seed(123)
num_bootstraps <- 1000
bootstrap_alphas <- numeric(num_bootstraps)
for (i in 1:num_bootstraps) {
boot_sample <- sample(engagement.new, replace = TRUE)
bootstrap_alphas[i] <- as.numeric(alpha(boot_sample, check.keys = TRUE)$total[1])
}
lci.025 = round(quantile(bootstrap_alphas, 0.025, type = 2),8)
uci.975 = round(quantile(bootstrap_alphas,0.975, type = 2 ),8)
bootstrap.ci = paste("[", round(lci.025,4),", ", round(uci.975,4),"]")
kable(bootstrap.ci, caption = "95% Bootstrap Confidence Interval for Cronbach Alpha")
| x |
|---|
| [ 0.8677 , 0.9189 ] |
Both the single value for Cronbach alpha calculated directly from the data (alpha = 0.878) as well as the 95% bootstrap confidence interval (0.8677, 0.9189) indicate very good internal consistency among the items in this subscale. It should be noted that these values were computed using the original scale for questions 4.5 & 4.21 as uniform positive correlation among the questions simplified the Cronbach alpha calculation. When using the reversed scales for these questions, the single value and lower limit of the confidence interval dropped slightly, but still fell into the range of values indicating a very good internal consistency. Therefore, it can be concluded that aggregation of this information via PCA is appropriate and advisable.
Having determined that principal component analysis is appropriate for aggregation purposes, a scree plot along with cumulative proportions of explained variance will be used to determine the optimal number of principal components.
My.plotnScree = function(mat, legend = TRUE, method ="components", main){
# mat = data matrix
# method = c("factors", "components"), default is "factors".
# main = title of the plot
ev <- eigen(cor(mat)) # get eigenvalues
ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)
##
if (!inherits(nScree, "nScree"))
stop("Method is only for nScree objects")
if (nScree$Model == "components")
nkaiser = "Eigenvalues > mean: n = "
if (nScree$Model == "factors")
nkaiser = "Eigenvalues > zero: n = "
# axis labels
xlab = nScree$Model
ylab = "Eigenvalues"
##
par(col = 1, pch = 18)
par(mfrow = c(1, 1))
eig <- nScree$Analysis$Eigenvalues
k <- 1:length(eig)
plot(1:length(eig), eig, type="b", main = main,
xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
#
nk <- length(eig)
noc <- nScree$Components$noc
vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
x <- sum(c(1, 1) * coef(vp.p))
y <- sum(c(1, nk) * coef(vp.p))
par(col = 10)
lines(k[c(1, nk)], c(x, y))
par(col = 11, pch = 20)
lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
if (legend == TRUE) {
leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser),
c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)),
c(paste("Optimal Coordinates: n = ", nScree$Components$noc)),
c(paste("Acceleration Factor: n = ", nScree$Components$naf))
)
legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA),
text.col = c(1, 3, 2, 4),
col = c(1, 3, 2, 4), bty="n", cex=0.7)
}
naf <- nScree$Components$naf
text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7,
adj = c(0, 0), col = 2)
text(x = naf + 1, y = eig[naf + 1], label = " (AF)",
cex = 0.7, adj = c(0, 0), col = 4)
}
My.loadings.var <- function(mat, nfct, method="fa"){
# mat = data matrix
# nfct = number of factors or components
# method = c("fa", "pca"), default = is "fa".
if(method == "fa"){
f1 <- factanal(mat, factors = nfct, rotation = "varimax")
x <- loadings(f1)
vx <- colSums(x^2)
varSS = rbind('SS loadings' = vx,
'Proportion Var' = vx/nrow(x),
'Cumulative Var' = cumsum(vx/nrow(x)))
weight = f1$loadings[]
} else if (method == "pca"){
pca <- prcomp(mat, center = TRUE, scale = TRUE)
varSS = summary(pca)$importance[,1:nfct]
weight = pca$rotation[,1:nfct]
}
list(Loadings = weight, Prop.Var = varSS)
}
My.plotnScree(mat=engagement.new, legend = TRUE, method ="components",
main="Determination of Number of Components")
The scree plot suggests that it is optimal to retain the first four or five principal components, depending on the method used. Next, we will look at the cumulative proportions of variance explained by each of the first six principal components.
VarProp = My.loadings.var(mat=engagement.new, nfct=6, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and Proportion of Variances Explained by Each
Principal Component in the Learning Engagement Subscale")
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | |
|---|---|---|---|---|---|---|
| Standard deviation | 2.533 | 1.366 | 1.244 | 1.211 | 1.071 | 0.977 |
| Proportion of Variance | 0.306 | 0.089 | 0.074 | 0.070 | 0.055 | 0.045 |
| Cumulative Proportion | 0.306 | 0.394 | 0.468 | 0.538 | 0.593 | 0.638 |
Based on the above table, the first five principal components explain about 59.3% of the variance, while including the sixth PC only explains roughly an additional 4%. Therefore, the first five PCs will be retained and used as a learning engagement index. A histogram of the distribution of this index for the data can be generated.
pca <- prcomp(engagement.new, center = TRUE, scale = TRUE)
eng.idx = pca$x[,5]
# hist(sc.idx, breaks=10, main="Distribution of Learning Engagement Index")
##
hist(eng.idx,
main="Distribution of Learning Engagement Index",
breaks = seq(min(eng.idx), max(eng.idx), length=9),
xlab="Learning Engagement Index",
xlim=range(eng.idx),
border="goldenrod",
col="navy",
freq=FALSE
)
The distribution of this index appears to be approximately normal.
Finally,another correlation plot will be used to check that this new learning engagement is positively correlated with the individual items in the subscale.
M1=cor(cbind(eng.idx, engagement.new))
corrplot.mixed(M1, lower.col = "royalblue3", upper = "ellipse", number.cex = .45, tl.cex = 0.55)
Unfortunately, while there appears to be at least a weak positive correlation between the index and about half of the individual items, it also appears to be negatively correlated or essentially uncorrelated with the other half. This suggests that the current proposed index may not be the most effective for aggregating information on students’ learning engagement. An index consisting of only the first four PCs yields similar results. Referring back to the scree plot, the acceleration factor method dictates that only the first principal component should be retained for aggregation of the information. Another correlation plot can be generated using an index consisting of the first PC.
eng.idx1 = pca$x[,1]
M2=cor(cbind(eng.idx1, engagement.new))
corrplot.mixed(M2, lower.col = "royalblue3", upper = "ellipse", number.cex = .45, tl.cex = 0.55)
As can be seen in this new plot, an index consisting of only the first PC is moderately to strongly positively correlated with almost all of the individual items in the subscale (besides the anomalous 4th & 21st questions already discussed). Based on this observation, retaining only the first PC may actually be optimal for aggregation of the learning engagement information, despite it only explaining about 30.6% of the variance. More investigation is required to conclusively determine the optimal index.