The goal of this research project is to assess the level of student engagement and overall academic experience among undergraduate students enrolled at two business schools at two regional universities in the US. Using a comprehensive survey instrument, we aim to evaluate how students perceive their academic and social environment across multiple domains, including faculty interaction, resource availability, academic support, and peer engagement.
We will explore the relationship between key engagement indicators and demographic variables such as student classification (first-year vs. senior), gender, college affiliation, and residency status. The ultimate objective is to identify areas that support or hinder student success and satisfaction within the university environment.
The survey includes the following major sections:
Section 4: Engagement with Academic and Social Aspects of Campus Life
Section 5: Experiences with Faculty and Instructors
Section 6: Frequency of Specific Academic Experiences
Section 7: Frequency of Academic and Campus Support Usage
Section 8: Usefulness of Campus Services
Section 9: Interaction with Peers and Faculty
Section 10: Online and Campus Resources
Section 11–13: Demographic Characteristics
The purpose of our analysis is to uncover patterns in student responses and examine how these factors correlate with a student’s sense of academic belonging, satisfaction, and potential persistence to graduation. Findings from this study may inform faculty and/or professors to enhance student engagement and retention across academic programs.
survey = read.csv("https://raw.githubusercontent.com/TylerBattaglini/TBattaglini/refs/heads/main/at-risk-survey-data.csv", head = TRUE)
The code in this section prepares the survey data for analysis by handling missing values across multiple sections of the questionnaire. First, we use develop a function “my.mode” to compute the mode of any given survey item. Then, for each major category of survey responses (e.g., engagement, learning styles, writing load), the code isolates relevant columns from the original dataset and imputes missing values in each item by replacing them with the mode.
my.mode = function(dataset){
freq.tbl = table(dataset)
max.freq.id=which(freq.tbl==max(freq.tbl))
mode=names(freq.tbl[max.freq.id])
as.numeric(mode)
}
engagement = survey[, 4:24]
# Imputing with the mode in each survey item
for (i in 1:21) {
engagement[,i][is.na(engagement[,i])]=my.mode(engagement[,i])
}
LearnStyle = survey[, 25:30]
# Imputing with the mode in each survey item
for (i in 1:6) {
LearnStyle[,i][is.na(LearnStyle[,i])]=my.mode(LearnStyle[,i])
}
WriRead = survey[, 31:33]
# Imputing with the mode in each survey item
for (i in 1:3) {
WriRead[,i][is.na(WriRead[,i])]=my.mode(WriRead[,i])
}
Remed = survey[, 35:43]
# Imputing with the mode in each survey item
for (i in 1:9) {
Remed[,i][is.na(Remed[,i])]=my.mode(Remed[,i])
}
EngcnSupp = survey[, 44:50]
# Imputing with the mode in each survey item
for (i in 1:7) {
EngcnSupp[,i][is.na(EngcnSupp[,i])]=my.mode(EngcnSupp[,i])
}
GrownDev = survey[, 51:65]
# Imputing with the mode in each survey item
for (i in 1:15) {
GrownDev[,i][is.na(GrownDev[,i])]=my.mode(GrownDev[,i])
}
Resource = survey[, 66:98]
# Imputing with the mode in each survey item
for (i in 1:33) {
Resource[,i][is.na(Resource[,i])]=my.mode(Resource[,i])
}
Retention = survey[, 99:103]
# Imputing with the mode in each survey item
for (i in 1:5) {
Retention[,i][is.na(Retention[,i])]=my.mode(Retention[,i])
}
Pay = survey[, 104:109]
# Imputing with the mode in each survey item
for (i in 1:6) {
Pay[,i][is.na(Pay[,i])]=my.mode(Pay[,i])
}
In this section we get an initial correlation matrix from their we make a decision on whether to scale our values if we have a multitude of negatively correlated values or to compute our cronbachs alpha if our values are all positively correlated. After scaling and then getting our alpha value we then make a decision on whether this specific questionnaire has high, medium, or low internal consistency.
M=cor(engagement)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from this correlation matric that we have a multitude of negatively correlated values so we will scale our values into different sub sections. We chose 6 sections active learning, collaborative learning, tech use, faculty interaction, diverse perspectives, and negative engagement.
## Engagement subscales (items)
active_learning = cbind(q401 = engagement$q401, q402 = engagement$q402,
q403 = engagement$q403, q404 = engagement$q404,
q405 = engagement$q405)
collab_learning = cbind(q406 = engagement$q406, q407 = engagement$q407,
q408 = engagement$q408, q409 = engagement$q409)
tech_use = cbind(q410 = engagement$q410, q411 = engagement$q411)
faculty_interaction = cbind(q412 = engagement$q412, q413 = engagement$q413,
q414 = engagement$q414, q415 = engagement$q415,
q416 = engagement$q416, q417 = engagement$q417)
diverse_perspectives = cbind(q418 = engagement$q418, q419 = engagement$q419,
q420 = engagement$q420)
negative_engagement = cbind(q421 = engagement$q421) # or use `q421_rev` if reversed
# Combine all subscales for correlation matrix
engage_items = cbind(active_learning, collab_learning, tech_use,
faculty_interaction, diverse_perspectives, negative_engagement)
# Correlation matrix
M = cor(engage_items, use = "pairwise.complete.obs")
# Pretty plot
corrplot.mixed(M, lower.col = "purple", upper = "ellipse",
number.cex = 0.7, tl.cex = 0.7)
From our new correlation matrix we see a vast improvement but technology use and negative engagement are pulling down our correlation so we decide to remove them.
## Engagement subscales (items) - excluding negative engagement and tech use
active_learning = cbind(q401 = engagement$q401, q402 = engagement$q402,
q403 = engagement$q403, q404 = engagement$q404,
q405 = engagement$q405)
collab_learning = cbind(q406 = engagement$q406, q407 = engagement$q407,
q408 = engagement$q408, q409 = engagement$q409)
faculty_interaction = cbind(q412 = engagement$q412, q413 = engagement$q413,
q414 = engagement$q414, q415 = engagement$q415,
q416 = engagement$q416, q417 = engagement$q417)
diverse_perspectives = cbind(q418 = engagement$q418, q419 = engagement$q419,
q420 = engagement$q420)
# Combine only the desired subscales
engage_items = cbind(active_learning, collab_learning,
faculty_interaction, diverse_perspectives)
# Correlation matrix
M = cor(engage_items, use = "pairwise.complete.obs")
# Visualize correlation matrix
library(corrplot)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse",
number.cex = 0.7, tl.cex = 0.7)
We finally got a a positive correlation matrix so we can now go ahead and and calculate our cronbachs alpha to see if they have high internal consistency.
cronbach.eng = as.numeric(psych::alpha(engage_items)$total[1])
CI.eng = cronbach.alpha.CI(alpha = cronbach.eng, n = 664, items = 12, conf.level = 0.95)
CI.engagement = cbind(LCI = CI.eng[1], alpha = cronbach.eng, UCI = CI.eng[2])
row.names(CI.engagement) = ""
pander(CI.engagement, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.8105 | 0.8302 | 0.8487 |
With a cronbachs alpha of .7871 and a CI of (.7624, .8103) we can say that the student engagement questionair has a high level of internal consistency.
M=cor(LearnStyle)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from our correlation matrix of student learning styles questionnaire that we have high correlation between our questions.
cronbach.lea = as.numeric(psych::alpha(LearnStyle)$total[1])
CI.lea = cronbach.alpha.CI(alpha = cronbach.lea, n = 664, items = 6, conf.level = 0.95)
CI.LearnStyle = cbind(LCI = CI.lea[1], alpha = cronbach.lea, UCI = CI.lea[2])
row.names(CI.LearnStyle) = ""
pander(CI.LearnStyle, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.8325 | 0.8508 | 0.8677 |
Since we had a strong postitve correlation matrix we went ahead and did a cronbachs alpha calculation. We got an alpha level of .8508 and a CI of (.8325, .8677). These values are very high which means we have a high level of internal consitency.
M=cor(WriRead)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from our correlation matrix we have positive but slightly weak correlation. But since these are all positive we go ahead and perform a cronbachs alpha calculation.
cronbach.wri = as.numeric(psych::alpha(WriRead)$total[1])
CI.wri = cronbach.alpha.CI(alpha = cronbach.wri, n = 664, items = 3, conf.level = 0.95)
CI.WriRead = cbind(LCI = CI.wri[1], alpha = cronbach.wri, UCI = CI.wri[2])
row.names(CI.WriRead) = ""
pander(CI.WriRead, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.3965 | 0.4703 | 0.5365 |
As we have seen in the correlation matrix we did get a lower cronbachs alpha value of .4703 and CI of (.3965, .5365). With this value we cannot accept this questionnaire.
M=cor(Remed)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from this correlation matrix we have an all positive matrix with some values differing in strong and weak correlation. Since our values are positive we can go ahead and do a cronbachs alpha calculation.
cronbach.rem = as.numeric(psych::alpha(Remed)$total[1])
CI.rem = cronbach.alpha.CI(alpha = cronbach.rem, n = 664, items = 9, conf.level = 0.95)
CI.remedial = cbind(LCI = CI.rem[1], alpha = cronbach.rem, UCI = CI.rem[2])
row.names(CI.remedial) = ""
pander(CI.remedial, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.8347 | 0.8521 | 0.8684 |
Since we had a strong postitve correlation matrix we went ahead and did a cronbachs alpha calculation. We got an alpha level of .8521 and a CI of (.8347, .8684). These values are very high which means we have a high level of internal consitency.
M=cor(EngcnSupp)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from the correlation matric above we have all positive values but they seem to trend twoards the weaker side of positive. But we still proceed with a cronbachs calculation.
cronbach.enc = as.numeric(psych::alpha(EngcnSupp)$total[1])
CI.enc = cronbach.alpha.CI(alpha = cronbach.enc, n = 664, items = 7, conf.level = 0.95)
CI.EngcnSupp = cbind(LCI = CI.enc[1], alpha = cronbach.enc, UCI = CI.enc[2])
row.names(CI.EngcnSupp) = ""
pander(CI.EngcnSupp, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.7851 | 0.8083 | 0.8298 |
Unlike our previous assumption we got a higher cronbachs value than expected of .8083 and CI of (.7851, .8298). We can go ahead and say that we have high internal consistency between our questions.
M=cor(Resource)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see from the correlation matrix above that we have differing almost equal levels of positive and negative values so we will have to make some subsections of our observations for this questionnair.
### FS = Frequency of Service Use
Resource_FS = cbind(Resource$q111.1, Resource$q112.1, Resource$q113.1,
Resource$q114.1, Resource$q115.1, Resource$q116.1,
Resource$q117.1, Resource$q118.1, Resource$q119.1,
Resource$q1110.1, Resource$q1111.1)
### SS = Satisfaction with Service
Resource_SS = cbind(Resource$q111.2, Resource$q112.2, Resource$q113.2,
Resource$q114.2, Resource$q115.2, Resource$q116.2,
Resource$q117.2, Resource$q118.2, Resource$q119.2,
Resource$q1110.2, Resource$q1111.2)
### IS = Importance of Service
Resource_IS = cbind(Resource$q111.3, Resource$q112.3, Resource$q113.3,
Resource$q114.3, Resource$q115.3, Resource$q116.3,
Resource$q117.3, Resource$q118.3, Resource$q119.3,
Resource$q1110.3, Resource$q1111.3)
# Correlation matrix for combined subscales
M = cor(cbind(Resource_FS, Resource_SS, Resource_IS))
corrplot.mixed(M, lower.col = "purple", upper = "ellipse",
number.cex = 0.7, tl.cex = 0.7)
With this matrix we separated it into the 3 different scales that were asked for each question. This did not help us at all in noticing any trends, again we see a lot of negative and positive correlation values. So we will try more subsections to see if anything changes there.
# Academic Support
FS_academic = cbind(Resource$q111.1, Resource$q114.1, Resource$q115.1, Resource$q1110.1)
SS_academic = cbind(Resource$q111.2, Resource$q114.2, Resource$q115.2, Resource$q1110.2)
IS_academic = cbind(Resource$q111.3, Resource$q114.3, Resource$q115.3, Resource$q1110.3)
# Career & Job Services
FS_career = cbind(Resource$q112.1, Resource$q113.1)
SS_career = cbind(Resource$q112.2, Resource$q113.2)
IS_career = cbind(Resource$q112.3, Resource$q113.3)
# Campus Life & Support
FS_life = cbind(Resource$q116.1, Resource$q117.1, Resource$q118.1, Resource$q119.1, Resource$q1111.1)
SS_life = cbind(Resource$q116.2, Resource$q117.2, Resource$q118.2, Resource$q119.2, Resource$q1111.2)
IS_life = cbind(Resource$q116.3, Resource$q117.3, Resource$q118.3, Resource$q119.3, Resource$q1111.3)
# Academic Support (combined use, satisfaction, importance)
Resource$Academic_Support = cbind(
Resource$q111.1, Resource$q114.1, Resource$q115.1, Resource$q1110.1,
Resource$q111.2, Resource$q114.2, Resource$q115.2, Resource$q1110.2,
Resource$q111.3, Resource$q114.3, Resource$q115.3, Resource$q1110.3
)
# Career & Job Services
Resource$Career_Services = rowMeans(cbind(
Resource$q112.1, Resource$q113.1,
Resource$q112.2, Resource$q113.2,
Resource$q112.3, Resource$q113.3))
# Campus Life & Support
Resource$Campus_Life = rowMeans(cbind(
Resource$q116.1, Resource$q117.1, Resource$q118.1, Resource$q119.1, Resource$q1111.1,
Resource$q116.2, Resource$q117.2, Resource$q118.2, Resource$q119.2, Resource$q1111.2,
Resource$q116.3, Resource$q117.3, Resource$q118.3, Resource$q119.3, Resource$q1111.3
))
M = cor(cbind(Resource$Academic_Support, Resource$Career_Services, Resource$Campus_Life))
corrplot.mixed(M, lower.col = "purple", upper = "ellipse",
number.cex = 0.7, tl.cex = 0.7)
This time we sectioned it off by Academic support, career services, and campus life. We see a little improvement in our matrix this time but again we see a lot of negatively correlated values. Since we cannot get these values out of the matrix we will not proceed with a cronbachs alpha calculation and assume that we cannot use this questionnaire.
M=cor(Retention)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
We see that this correlation matrix looks almost perfect with every value is positively correlated and appears to be strong also. So we can go ahead and calculate our cronbachs alpha.
cronbach.ret = as.numeric(psych::alpha(Retention)$total[1])
CI.ret = cronbach.alpha.CI(alpha = cronbach.ret, n = 664, items = 5, conf.level = 0.95)
CI.Retention = cbind(LCI = CI.ret[1], alpha = cronbach.ret, UCI = CI.ret[2])
row.names(CI.Retention) = ""
pander(CI.Retention, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.8589 | 0.8747 | 0.8891 |
Like we had predicted we get a great cronbachs alpha of .8747 and a CI of (.8589, .8891). So we can say that this questionnair has high internal consistency.
M=cor(Pay)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
This pay matrix has a mix of both negative and positively correlated values so we cannot automatically do a cronbachs alpha calculation so we go ahead and sub section our pay questionnari by oersonal assitance and outside assitance.
## Engagement subscales (items) - excluding negative engagement and tech use
personal_assistance = cbind(q4131 = Pay$q4131, q132 = Pay$q132)
outside_assistance = cbind(q133=Pay$q133, q134=Pay$q134, q135=Pay$q135, q136=Pay$q136)
# Combine only the desired subscales
pay_items = cbind(outside_assistance, personal_assistance )
M=cor(outside_assistance)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
This new matrix of just outside assistance is a vast improvement of our previous matrix all the values are positively correlated and we can go ahead and perform a cronbacha alpha calculation.
cronbach.pay = as.numeric(psych::alpha(outside_assistance)$total[1])
CI.pay = cronbach.alpha.CI(alpha = cronbach.pay, n = 664, items = 4, conf.level = 0.95)
CI.Payment = cbind(LCI = CI.pay[1], alpha = cronbach.pay, UCI = CI.pay[2])
row.names(CI.Payment) = ""
pander(CI.Payment, caption = "Confidence Interval of Cronbach's Alpha")
LCI | alpha | UCI |
---|---|---|
0.5828 | 0.6309 | 0.6747 |
This calculation of a lower cronbach value of .6309 and CI of (.5828, .6747) makes sense because our correlation matrix had positive correlation but very low correlation. Since our value is at .63 we are questionable on whether to use this questionnaire.
This section presents the results of the principal component analysis. The significant principal components identified will be incorporated into the analytic dataset for use in regression modeling.
My.plotnScree(): Creates a scree plot using parallel analysis and other criteria (e.g., Kaiser criterion, optimal coordinates, acceleration factor) to help determine the number of significant components or factors to retain.
My.plotnScree = function(mat, legend = TRUE, method ="factors", main){
# mat = data matrix
# method = c("factors", "components"), default is "factors".
# main = title of the plot
ev <- eigen(cor(mat)) # get eigenvalues
ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)
##
if (!inherits(nScree, "nScree"))
stop("Method is only for nScree objects")
if (nScree$Model == "components")
nkaiser = "Eigenvalues > mean: n = "
if (nScree$Model == "factors")
nkaiser = "Eigenvalues > zero: n = "
# axis labels
xlab = nScree$Model
ylab = "Eigenvalues"
##
par(col = 1, pch = 18)
par(mfrow = c(1, 1))
eig <- nScree$Analysis$Eigenvalues
k <- 1:length(eig)
plot(1:length(eig), eig, type="b", main = main,
xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
#
nk <- length(eig)
noc <- nScree$Components$noc
vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
x <- sum(c(1, 1) * coef(vp.p))
y <- sum(c(1, nk) * coef(vp.p))
par(col = 10)
lines(k[c(1, nk)], c(x, y))
par(col = 11, pch = 20)
lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
if (legend == TRUE) {
leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser),
c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)),
c(paste("Optimal Coordinates: n = ", nScree$Components$noc)),
c(paste("Acceleration Factor: n = ", nScree$Components$naf))
)
legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA),
text.col = c(1, 3, 2, 4),
col = c(1, 3, 2, 4), bty="n", cex=0.7)
}
naf <- nScree$Components$naf
text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7,
adj = c(0, 0), col = 2)
text(x = naf + 1, y = eig[naf + 1], label = " (AF)",
cex = 0.7, adj = c(0, 0), col = 4)
}
My.loadings.var(): Calculates and returns the factor or component loadings and the proportion of variance explained for a specified number of factors or components, using either factanal() for FA or prcomp() for PCA.
My.loadings.var <- function(mat, nfct, method="fa"){
# mat = data matrix
# nfct = number of factors or components
# method = c("fa", "pca"), default = is "fa".
if(method == "fa"){
f1 <- factanal(mat, factors = nfct, rotation = "varimax")
x <- loadings(f1)
vx <- colSums(x^2)
varSS = rbind('SS loadings' = vx,
'Proportion Var' = vx/nrow(x),
'Cumulative Var' = cumsum(vx/nrow(x)))
weight = f1$loadings[]
} else if (method == "pca"){
pca <- prcomp(mat, center = TRUE, scale = TRUE)
varSS = summary(pca)$importance[,1:nfct]
weight = pca$rotation[,1:nfct]
}
list(Loadings = weight, Prop.Var = varSS)
}
We first generate a plot to visually and statistically help us decide how many underlying dimensions are in our engage_items data. We do this before running PCA or using PCs in regression.
My.plotnScree(mat=engage_items, legend = TRUE, method ="components",
main="Determination of Number of Components\n Engagement (Positive)")
Our plot above shows that the appropriate amount of components will be around 2 so we go ahead and perform principle component analysis.
Loadings = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Engagement Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q412 | -0.359 | 0.039 |
q413 | -0.359 | -0.259 |
q414 | -0.364 | -0.271 |
q415 | -0.249 | -0.483 |
q416 | -0.240 | -0.390 |
q417 | -0.350 | 0.042 |
q418 | -0.377 | 0.155 |
q419 | -0.325 | 0.478 |
q420 | -0.346 | 0.468 |
VarProp = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the engagement survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.967 | 1.242 |
Proportion of Variance | 0.430 | 0.171 |
Cumulative Proportion | 0.430 | 0.601 |
We see when calculating our proportion of variance for PCA that we were incorrect in assuming to use two principle components we will only be using one because PCA 1 explain around 43% of the total variance. We see that our second component only explains around 17% of the total variance.
pca <- prcomp(engage_items, center = TRUE, scale = TRUE)
eng.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(eng.idx,
main="Distribution of Engagement Index",
breaks = seq(min(eng.idx), max(eng.idx), length=9),
xlab="Engagement Index",
xlim=range(eng.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see here that our graph appears to be almost normal and could be
skewed to the right. Further analysis will be need on whether to do a
box cox transformation to fix the distributional issue in the regression
residuals.
My.plotnScree(mat=LearnStyle, legend = TRUE, method ="components",
main="Determination of Number of Components\n Learning Style (Positive)")
We see from the above plot that it suggests to use one principal component but we will go ahead and use two components for exploratory purposes.
Loadings = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Learning Style Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q412 | -0.359 | 0.039 |
q413 | -0.359 | -0.259 |
q414 | -0.364 | -0.271 |
q415 | -0.249 | -0.483 |
q416 | -0.240 | -0.390 |
q417 | -0.350 | 0.042 |
q418 | -0.377 | 0.155 |
q419 | -0.325 | 0.478 |
q420 | -0.346 | 0.468 |
VarProp = My.loadings.var(mat=LearnStyle, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the Learning Style survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.877 | 0.918 |
Proportion of Variance | 0.587 | 0.141 |
Cumulative Proportion | 0.587 | 0.727 |
As the plot had predicted we see from the above table that PCA one has a staggering amount of proportion of variance at around 59% and our second component only adds 14% so we will go ahead and only use the first component for further analysis.
pca <- prcomp(LearnStyle, center = TRUE, scale = TRUE)
lea.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(lea.idx,
main="Distribution of Learning Style Index",
breaks = seq(min(lea.idx), max(lea.idx), length=9),
xlab="Learning Style Index",
xlim=range(lea.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see again from the plot above that our distribution may be normal but appears to be skewed to the right. We will need further analysis to see if we need to perform a box cox transformation to fix the distributional issue in the regression residuals.
My.plotnScree(mat=WriRead, legend = TRUE, method ="components",
main="Determination of Number of Components\n Writing and Reading (Positive)")
We see from the above plot that it suggests to use one principal component but we will go ahead and use two components for exploratory purposes.
Loadings = My.loadings.var(mat=WriRead, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Writing and Reading Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q61 | 0.676 | -0.196 |
q62 | 0.307 | 0.951 |
q63 | 0.670 | -0.237 |
VarProp = My.loadings.var(mat=WriRead, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the writing and reading survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.216 | 0.975 |
Proportion of Variance | 0.493 | 0.317 |
Cumulative Proportion | 0.493 | 0.810 |
We see form the above table of our proportion of variance that our initial assumption of only using one principle component was wrong because our first two principle components give us 30% or more of our proportion of variance so we will go ahead and use two principle components in our analysis.
pca <- prcomp(WriRead, center = TRUE, scale = TRUE)
wri.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(wri.idx,
main="Distribution of Writing and Reading Index",
breaks = seq(min(wri.idx), max(wri.idx), length=9),
xlab="Writing and Reading Index",
xlim=range(wri.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see from the distribution plot it is approximately normal.
My.plotnScree(mat=Remed, legend = TRUE, method ="components",
main="Determination of Number of Components\n Remedial (Positive)")
We see from the above plot that it suggests to use 2 principle components in our component analysis.
Loadings = My.loadings.var(mat=Remed, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Remedial Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q81 | 0.127 | 0.625 |
q82 | 0.318 | -0.134 |
q83 | 0.394 | -0.266 |
q84 | 0.395 | -0.303 |
q85 | 0.363 | -0.359 |
q86 | 0.354 | 0.035 |
q87 | 0.316 | 0.287 |
q88 | 0.321 | 0.299 |
q89 | 0.334 | 0.356 |
VarProp = My.loadings.var(mat=Remed, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the remedial survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 2.067 | 1.075 |
Proportion of Variance | 0.475 | 0.128 |
Cumulative Proportion | 0.475 | 0.603 |
We see form the above principle component analysis that PCA 1 has 47.5% of proportion of variance compared to our second component which only has 12.8% because of the staggering difference we will only be using the first principle component.
pca <- prcomp(Remed, center = TRUE, scale = TRUE)
rem.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(rem.idx,
main="Distribution of Remedial Index",
breaks = seq(min(rem.idx), max(rem.idx), length=9),
xlab="Remedial Index",
xlim=range(rem.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see from the distribution table above that it is skewed to the left. So we will have to perform a box cox transformation to fix the distributional issue in the regression residuals.
My.plotnScree(mat=EngcnSupp, legend = TRUE, method ="components",
main="Determination of Number of Components\n Encouragement and Support (Positive)")
We see from the above table that it suggests us to use the first two principle components so that is what we will do for our PCA.
Loadings = My.loadings.var(mat=EngcnSupp, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Encouragement and Support Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q91 | 0.297 | -0.378 |
q92 | 0.426 | -0.162 |
q93 | 0.416 | -0.178 |
q94 | 0.435 | 0.347 |
q95 | 0.430 | 0.331 |
q96 | 0.376 | 0.278 |
q97 | 0.203 | -0.701 |
VarProp = My.loadings.var(mat=EngcnSupp, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the Encouragement and Support survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.832 | 1.102 |
Proportion of Variance | 0.480 | 0.173 |
Cumulative Proportion | 0.480 | 0.653 |
We see from the above table that our first component explains 48% of the total variance while our second component only explains 17% of the total variance we will go ahead and only be using the first component in our analysis.
pca <- prcomp(EngcnSupp, center = TRUE, scale = TRUE)
enc.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(enc.idx,
main="Distribution of Encouragement and Support Index",
breaks = seq(min(enc.idx), max(enc.idx), length=9),
xlab="Encouragement and Support Index",
xlim=range(enc.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see from the above table that our distribution table is approximately normal.
My.plotnScree(mat=Retention, legend = TRUE, method ="components",
main="Determination of Number of Components\n Retention (Positive)")
We see from the above table that it suggets for us to use one component but we will go ahead and use one component for exploratory purposes.
Loadings = My.loadings.var(mat=Retention, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Retention Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q121 | -0.458 | 0.332 |
q122 | -0.461 | 0.315 |
q123 | -0.439 | 0.312 |
q124 | -0.455 | -0.205 |
q125 | -0.422 | -0.807 |
VarProp = My.loadings.var(mat=Retention, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the Retention survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.835 | 0.720 |
Proportion of Variance | 0.673 | 0.104 |
Cumulative Proportion | 0.673 | 0.777 |
We see again that from the table above shows that the first PC explains 67% of our total variance while our second component only explains 10%. So from this we will only use the first component for our analysis like our initial plot suggested.
pca <- prcomp(Retention, center = TRUE, scale = TRUE)
ret.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(ret.idx,
main="Distribution of Retention Index",
breaks = seq(min(ret.idx), max(ret.idx), length=9),
xlab="Retention Index",
xlim=range(ret.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see from the above plot that our plot is almost normal but it is skewed to the right so we might need to perform a box cox transformation and further analysis is needed.
My.plotnScree(mat=outside_assistance, legend = TRUE, method ="components",
main="Determination of Number of Components\n Pay (Positive)")
We see from thr plot above that it suggests to use on PC but for exploratory purposes we will go ahead and use two.
Loadings = My.loadings.var(mat=outside_assistance, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
caption="Factor loadings of the first few PCAs and the cumulative
the proportion of variation explained by the corresponding PCAs in the
Pay Questionaire Survey.")
PC1 | PC2 | |
---|---|---|
q133 | -0.418 | 0.597 |
q134 | -0.586 | -0.332 |
q135 | -0.528 | -0.531 |
q136 | -0.451 | 0.501 |
VarProp = My.loadings.var(mat=outside_assistance, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
caption="Cumulative and proportion of variances explained by each
the principal component in the pay survey.")
PC1 | PC2 | |
---|---|---|
Standard deviation | 1.391 | 0.988 |
Proportion of Variance | 0.484 | 0.244 |
Cumulative Proportion | 0.484 | 0.728 |
We see from the above table that both our components provide a staggering amount of information explains 48.4% and 24.4% of the proportion of variance. So unlike what our initial graph says we will use two principle components for our analysis.
pca <- prcomp(outside_assistance, center = TRUE, scale = TRUE)
pay.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(pay.idx,
main="Distribution of Pay Index",
breaks = seq(min(pay.idx), max(pay.idx), length=9),
xlab="Pay Index",
xlim=range(pay.idx),
border="red",
col="lightblue",
freq=FALSE
)
We see from the graph above that it is skewed to the right. So we will need to perform a box cox transformation to fix the distributional issue in the regression residuals.
How well does student engagement in inside and outside the classroom predict their overall satisfaction with their college experience?
Does the level of support and encouragement from the school and/or school community correlate with students reported development in areas such as critical thinking, communication, and job skills?
Understanding what drives student engagement is crucial for any universities aiming to improve student application rates and after college life. The first question addresses the idea that students who are more actively engaged in their learning—by participating in discussions, collaborating with peers, or interacting with instructors—may feel more connected and fulfilled in their academic experience. The survey data provides extensive detail on student engagement through 21 questions (Section 4), making it a strong candidate for giving us substantial and enough variable representing engagement. These can be used to assess how this engagement translates into satisfaction, especially since the dataset includes direct or measures of satisfaction such as retention and overall college experience.
The second question is from the assumption that support from the school academically, socially, and financially can enhance students personal and academic growth. Section 9 (“Encouragement and Support”) captures how much students feel the university emphasizes support in various fields. These responses can be linked to outcomes reported in Section 10 (“Growth and Development”), which includes self-assessments of skills such as critical thinking, communication, and job preparedness. This pairing can offer meaningful insights into how institutional investment in student support pays off in developmental outcomes.
Both questions align with the same goal of informing universities about how they could improve student satisfaction. The survey data is comprehensive, with clear question structure that can be grouped into meaningful scales. Many questions are structured with like, dislike,or not at all response options, making them applicable for correlation, regression, and factor analysis techniques. This ensures a solid analytical foundation for answering the proposed questions with statistics.