Introduction

The goal of this research project is to assess the level of student engagement and overall academic experience among undergraduate students enrolled at two business schools at two regional universities in the US. Using a comprehensive survey instrument, we aim to evaluate how students perceive their academic and social environment across multiple domains, including faculty interaction, resource availability, academic support, and peer engagement.

We will explore the relationship between key engagement indicators and demographic variables such as student classification (first-year vs. senior), gender, college affiliation, and residency status. The ultimate objective is to identify areas that support or hinder student success and satisfaction within the university environment.

The survey includes the following major sections:

Section 4: Engagement with Academic and Social Aspects of Campus Life

Section 5: Experiences with Faculty and Instructors

Section 6: Frequency of Specific Academic Experiences

Section 7: Frequency of Academic and Campus Support Usage

Section 8: Usefulness of Campus Services

Section 9: Interaction with Peers and Faculty

Section 10: Online and Campus Resources

Section 11–13: Demographic Characteristics

The purpose of our analysis is to uncover patterns in student responses and examine how these factors correlate with a student’s sense of academic belonging, satisfaction, and potential persistence to graduation. Findings from this study may inform faculty and/or professors to enhance student engagement and retention across academic programs.

survey = read.csv("https://raw.githubusercontent.com/TylerBattaglini/TBattaglini/refs/heads/main/at-risk-survey-data.csv", head = TRUE)

Data Management and Analyzing Survey Instruments

The code in this section prepares the survey data for analysis by handling missing values across multiple sections of the questionnaire. First, we use develop a function “my.mode” to compute the mode of any given survey item. Then, for each major category of survey responses (e.g., engagement, learning styles, writing load), the code isolates relevant columns from the original dataset and imputes missing values in each item by replacing them with the mode.

my.mode = function(dataset){
  freq.tbl = table(dataset)
  max.freq.id=which(freq.tbl==max(freq.tbl))
  mode=names(freq.tbl[max.freq.id])
  as.numeric(mode)
}

Students’ Engagement in Learning

engagement = survey[, 4:24]
# Imputing with the mode in each survey item
for (i in 1:21) {
  engagement[,i][is.na(engagement[,i])]=my.mode(engagement[,i])
}

Student Learning Styles

LearnStyle = survey[, 25:30]
# Imputing with the mode in each survey item
for (i in 1:6) {
  LearnStyle[,i][is.na(LearnStyle[,i])]=my.mode(LearnStyle[,i])
}

Writing and Reading Load

WriRead = survey[, 31:33]
# Imputing with the mode in each survey item
for (i in 1:3) {
  WriRead[,i][is.na(WriRead[,i])]=my.mode(WriRead[,i])
}

Remedial Experience

Remed = survey[, 35:43]
# Imputing with the mode in each survey item
for (i in 1:9) {
  Remed[,i][is.na(Remed[,i])]=my.mode(Remed[,i])
}

Encouragement and Support

EngcnSupp = survey[, 44:50]
# Imputing with the mode in each survey item
for (i in 1:7) {
  EngcnSupp[,i][is.na(EngcnSupp[,i])]=my.mode(EngcnSupp[,i])
}

Growth and Development

GrownDev = survey[, 51:65]
# Imputing with the mode in each survey item
for (i in 1:15) {
  GrownDev[,i][is.na(GrownDev[,i])]=my.mode(GrownDev[,i])
}

Campus Resource Utilization (with three subsets of responses)

Resource = survey[, 66:98]
# Imputing with the mode in each survey item
for (i in 1:33) {
  Resource[,i][is.na(Resource[,i])]=my.mode(Resource[,i])
}

Retention

Retention = survey[, 99:103]
# Imputing with the mode in each survey item
for (i in 1:5) {
  Retention[,i][is.na(Retention[,i])]=my.mode(Retention[,i])
}

How Students Pay for College

Pay = survey[, 104:109]
# Imputing with the mode in each survey item
for (i in 1:6) {
  Pay[,i][is.na(Pay[,i])]=my.mode(Pay[,i])
}

Validity and Reliability Analyses

In this section we get an initial correlation matrix from their we make a decision on whether to scale our values if we have a multitude of negatively correlated values or to compute our cronbachs alpha if our values are all positively correlated. After scaling and then getting our alpha value we then make a decision on whether this specific questionnaire has high, medium, or low internal consistency.

Students’ Engagement in Learning Questionnaire

M=cor(engagement)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from this correlation matric that we have a multitude of negatively correlated values so we will scale our values into different sub sections. We chose 6 sections active learning, collaborative learning, tech use, faculty interaction, diverse perspectives, and negative engagement.

## Engagement subscales (items)
active_learning = cbind(q401 = engagement$q401, q402 = engagement$q402,
                        q403 = engagement$q403, q404 = engagement$q404,
                        q405 = engagement$q405)

collab_learning = cbind(q406 = engagement$q406, q407 = engagement$q407,
                        q408 = engagement$q408, q409 = engagement$q409)

tech_use = cbind(q410 = engagement$q410, q411 = engagement$q411)

faculty_interaction = cbind(q412 = engagement$q412, q413 = engagement$q413,
                            q414 = engagement$q414, q415 = engagement$q415,
                            q416 = engagement$q416, q417 = engagement$q417)

diverse_perspectives = cbind(q418 = engagement$q418, q419 = engagement$q419,
                              q420 = engagement$q420)

negative_engagement = cbind(q421 = engagement$q421)  # or use `q421_rev` if reversed

# Combine all subscales for correlation matrix
engage_items = cbind(active_learning, collab_learning, tech_use, 
                     faculty_interaction, diverse_perspectives, negative_engagement)

# Correlation matrix
M = cor(engage_items, use = "pairwise.complete.obs")

# Pretty plot
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", 
               number.cex = 0.7, tl.cex = 0.7)

From our new correlation matrix we see a vast improvement but technology use and negative engagement are pulling down our correlation so we decide to remove them.

## Engagement subscales (items) - excluding negative engagement and tech use
active_learning = cbind(q401 = engagement$q401, q402 = engagement$q402,
                        q403 = engagement$q403, q404 = engagement$q404,
                        q405 = engagement$q405)

collab_learning = cbind(q406 = engagement$q406, q407 = engagement$q407,
                        q408 = engagement$q408, q409 = engagement$q409)

faculty_interaction = cbind(q412 = engagement$q412, q413 = engagement$q413,
                            q414 = engagement$q414, q415 = engagement$q415,
                            q416 = engagement$q416, q417 = engagement$q417)

diverse_perspectives = cbind(q418 = engagement$q418, q419 = engagement$q419,
                              q420 = engagement$q420)

# Combine only the desired subscales
engage_items = cbind(active_learning, collab_learning, 
                     faculty_interaction, diverse_perspectives)

# Correlation matrix
M = cor(engage_items, use = "pairwise.complete.obs")

# Visualize correlation matrix
library(corrplot)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", 
               number.cex = 0.7, tl.cex = 0.7)

We finally got a a positive correlation matrix so we can now go ahead and and calculate our cronbachs alpha to see if they have high internal consistency.

cronbach.eng = as.numeric(psych::alpha(engage_items)$total[1])

CI.eng = cronbach.alpha.CI(alpha = cronbach.eng, n = 664, items = 12, conf.level = 0.95)

CI.engagement = cbind(LCI = CI.eng[1], alpha = cronbach.eng, UCI = CI.eng[2])
row.names(CI.engagement) = ""

pander(CI.engagement, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.8105 0.8302 0.8487

With a cronbachs alpha of .7871 and a CI of (.7624, .8103) we can say that the student engagement questionair has a high level of internal consistency.

Student Learning Styles

M=cor(LearnStyle)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from our correlation matrix of student learning styles questionnaire that we have high correlation between our questions.

cronbach.lea = as.numeric(psych::alpha(LearnStyle)$total[1])

CI.lea = cronbach.alpha.CI(alpha = cronbach.lea, n = 664, items = 6, conf.level = 0.95)

CI.LearnStyle = cbind(LCI = CI.lea[1], alpha = cronbach.lea, UCI = CI.lea[2])
row.names(CI.LearnStyle) = ""

pander(CI.LearnStyle, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.8325 0.8508 0.8677

Since we had a strong postitve correlation matrix we went ahead and did a cronbachs alpha calculation. We got an alpha level of .8508 and a CI of (.8325, .8677). These values are very high which means we have a high level of internal consitency.

Writing and Reading Load Questionare

M=cor(WriRead)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from our correlation matrix we have positive but slightly weak correlation. But since these are all positive we go ahead and perform a cronbachs alpha calculation.

cronbach.wri = as.numeric(psych::alpha(WriRead)$total[1])

CI.wri = cronbach.alpha.CI(alpha = cronbach.wri, n = 664, items = 3, conf.level = 0.95)

CI.WriRead = cbind(LCI = CI.wri[1], alpha = cronbach.wri, UCI = CI.wri[2])
row.names(CI.WriRead) = ""

pander(CI.WriRead, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.3965 0.4703 0.5365

As we have seen in the correlation matrix we did get a lower cronbachs alpha value of .4703 and CI of (.3965, .5365). With this value we cannot accept this questionnaire.

Remedial Experience Questionare

M=cor(Remed)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from this correlation matrix we have an all positive matrix with some values differing in strong and weak correlation. Since our values are positive we can go ahead and do a cronbachs alpha calculation.

cronbach.rem = as.numeric(psych::alpha(Remed)$total[1])

CI.rem = cronbach.alpha.CI(alpha = cronbach.rem, n = 664, items = 9, conf.level = 0.95)

CI.remedial = cbind(LCI = CI.rem[1], alpha = cronbach.rem, UCI = CI.rem[2])
row.names(CI.remedial) = ""

pander(CI.remedial, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.8347 0.8521 0.8684

Since we had a strong postitve correlation matrix we went ahead and did a cronbachs alpha calculation. We got an alpha level of .8521 and a CI of (.8347, .8684). These values are very high which means we have a high level of internal consitency.

Encouragement and Support Questionare

M=cor(EngcnSupp)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from the correlation matric above we have all positive values but they seem to trend twoards the weaker side of positive. But we still proceed with a cronbachs calculation.

cronbach.enc = as.numeric(psych::alpha(EngcnSupp)$total[1])

CI.enc = cronbach.alpha.CI(alpha = cronbach.enc, n = 664, items = 7, conf.level = 0.95)

CI.EngcnSupp = cbind(LCI = CI.enc[1], alpha = cronbach.enc, UCI = CI.enc[2])
row.names(CI.EngcnSupp) = ""

pander(CI.EngcnSupp, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.7851 0.8083 0.8298

Unlike our previous assumption we got a higher cronbachs value than expected of .8083 and CI of (.7851, .8298). We can go ahead and say that we have high internal consistency between our questions.

Campus Resource Utilization (with three subsets of responses)

M=cor(Resource)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see from the correlation matrix above that we have differing almost equal levels of positive and negative values so we will have to make some subsections of our observations for this questionnair.

### FS = Frequency of Service Use
Resource_FS = cbind(Resource$q111.1, Resource$q112.1, Resource$q113.1, 
                    Resource$q114.1, Resource$q115.1, Resource$q116.1, 
                    Resource$q117.1, Resource$q118.1, Resource$q119.1,
                    Resource$q1110.1, Resource$q1111.1)

### SS = Satisfaction with Service
Resource_SS = cbind(Resource$q111.2, Resource$q112.2, Resource$q113.2, 
                    Resource$q114.2, Resource$q115.2, Resource$q116.2, 
                    Resource$q117.2, Resource$q118.2, Resource$q119.2,
                    Resource$q1110.2, Resource$q1111.2)

### IS = Importance of Service
Resource_IS = cbind(Resource$q111.3, Resource$q112.3, Resource$q113.3, 
                    Resource$q114.3, Resource$q115.3, Resource$q116.3, 
                    Resource$q117.3, Resource$q118.3, Resource$q119.3,
                    Resource$q1110.3, Resource$q1111.3)



# Correlation matrix for combined subscales
M = cor(cbind(Resource_FS, Resource_SS, Resource_IS))
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", 
               number.cex = 0.7, tl.cex = 0.7)

With this matrix we separated it into the 3 different scales that were asked for each question. This did not help us at all in noticing any trends, again we see a lot of negative and positive correlation values. So we will try more subsections to see if anything changes there.

# Academic Support
FS_academic = cbind(Resource$q111.1, Resource$q114.1, Resource$q115.1, Resource$q1110.1)
SS_academic = cbind(Resource$q111.2, Resource$q114.2, Resource$q115.2, Resource$q1110.2)
IS_academic = cbind(Resource$q111.3, Resource$q114.3, Resource$q115.3, Resource$q1110.3)


# Career & Job Services
FS_career = cbind(Resource$q112.1, Resource$q113.1)
SS_career = cbind(Resource$q112.2, Resource$q113.2)
IS_career = cbind(Resource$q112.3, Resource$q113.3)


# Campus Life & Support
FS_life = cbind(Resource$q116.1, Resource$q117.1, Resource$q118.1, Resource$q119.1, Resource$q1111.1)
SS_life = cbind(Resource$q116.2, Resource$q117.2, Resource$q118.2, Resource$q119.2, Resource$q1111.2)
IS_life = cbind(Resource$q116.3, Resource$q117.3, Resource$q118.3, Resource$q119.3, Resource$q1111.3)



# Academic Support (combined use, satisfaction, importance)
Resource$Academic_Support = cbind(
  Resource$q111.1, Resource$q114.1, Resource$q115.1, Resource$q1110.1,
  Resource$q111.2, Resource$q114.2, Resource$q115.2, Resource$q1110.2,
  Resource$q111.3, Resource$q114.3, Resource$q115.3, Resource$q1110.3
)

# Career & Job Services
Resource$Career_Services = rowMeans(cbind(
  Resource$q112.1, Resource$q113.1,
  Resource$q112.2, Resource$q113.2,
  Resource$q112.3, Resource$q113.3))

# Campus Life & Support
Resource$Campus_Life = rowMeans(cbind(
  Resource$q116.1, Resource$q117.1, Resource$q118.1, Resource$q119.1, Resource$q1111.1,
  Resource$q116.2, Resource$q117.2, Resource$q118.2, Resource$q119.2, Resource$q1111.2,
  Resource$q116.3, Resource$q117.3, Resource$q118.3, Resource$q119.3, Resource$q1111.3
))

M = cor(cbind(Resource$Academic_Support, Resource$Career_Services, Resource$Campus_Life))
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", 
               number.cex = 0.7, tl.cex = 0.7)

This time we sectioned it off by Academic support, career services, and campus life. We see a little improvement in our matrix this time but again we see a lot of negatively correlated values. Since we cannot get these values out of the matrix we will not proceed with a cronbachs alpha calculation and assume that we cannot use this questionnaire.

Retention Questionare

M=cor(Retention)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We see that this correlation matrix looks almost perfect with every value is positively correlated and appears to be strong also. So we can go ahead and calculate our cronbachs alpha.

cronbach.ret = as.numeric(psych::alpha(Retention)$total[1])

CI.ret = cronbach.alpha.CI(alpha = cronbach.ret, n = 664, items = 5, conf.level = 0.95)

CI.Retention = cbind(LCI = CI.ret[1], alpha = cronbach.ret, UCI = CI.ret[2])
row.names(CI.Retention) = ""

pander(CI.Retention, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.8589 0.8747 0.8891

Like we had predicted we get a great cronbachs alpha of .8747 and a CI of (.8589, .8891). So we can say that this questionnair has high internal consistency.

How Students Pay for College Questionare

M=cor(Pay)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

This pay matrix has a mix of both negative and positively correlated values so we cannot automatically do a cronbachs alpha calculation so we go ahead and sub section our pay questionnari by oersonal assitance and outside assitance.

## Engagement subscales (items) - excluding negative engagement and tech use
personal_assistance = cbind(q4131 = Pay$q4131, q132 = Pay$q132)

outside_assistance = cbind(q133=Pay$q133, q134=Pay$q134, q135=Pay$q135, q136=Pay$q136)

# Combine only the desired subscales
pay_items = cbind(outside_assistance, personal_assistance )

M=cor(outside_assistance)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

This new matrix of just outside assistance is a vast improvement of our previous matrix all the values are positively correlated and we can go ahead and perform a cronbacha alpha calculation.

cronbach.pay = as.numeric(psych::alpha(outside_assistance)$total[1])

CI.pay = cronbach.alpha.CI(alpha = cronbach.pay, n = 664, items = 4, conf.level = 0.95)

CI.Payment = cbind(LCI = CI.pay[1], alpha = cronbach.pay, UCI = CI.pay[2])
row.names(CI.Payment) = ""

pander(CI.Payment, caption = "Confidence Interval of Cronbach's Alpha")
Confidence Interval of Cronbach’s Alpha
LCI alpha UCI
0.5828 0.6309 0.6747

This calculation of a lower cronbach value of .6309 and CI of (.5828, .6747) makes sense because our correlation matrix had positive correlation but very low correlation. Since our value is at .63 we are questionable on whether to use this questionnaire.

PCA

This section presents the results of the principal component analysis. The significant principal components identified will be incorporated into the analytic dataset for use in regression modeling.

My.plotnScree(): Creates a scree plot using parallel analysis and other criteria (e.g., Kaiser criterion, optimal coordinates, acceleration factor) to help determine the number of significant components or factors to retain.

My.plotnScree = function(mat, legend = TRUE, method ="factors", main){
    # mat = data matrix
    # method = c("factors", "components"), default is "factors".
    # main = title of the plot
    ev <- eigen(cor(mat))    # get eigenvalues
    ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
    nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)  
    ##
    if (!inherits(nScree, "nScree")) 
        stop("Method is only for nScree objects")
    if (nScree$Model == "components") 
        nkaiser = "Eigenvalues > mean: n = "
    if (nScree$Model == "factors") 
      nkaiser = "Eigenvalues > zero: n = "
    # axis labels
    xlab = nScree$Model
    ylab = "Eigenvalues"
    ##
    par(col = 1, pch = 18)
    par(mfrow = c(1, 1))
    eig <- nScree$Analysis$Eigenvalues
    k <- 1:length(eig)
    plot(1:length(eig), eig, type="b", main = main, 
        xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
    #
    nk <- length(eig)
    noc <- nScree$Components$noc
    vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
    x <- sum(c(1, 1) * coef(vp.p))
    y <- sum(c(1, nk) * coef(vp.p))
    par(col = 10)
    lines(k[c(1, nk)], c(x, y))
    par(col = 11, pch = 20)
    lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
    if (legend == TRUE) {
        leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser), 
                   c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)), 
                   c(paste("Optimal Coordinates: n = ", nScree$Components$noc)), 
                   c(paste("Acceleration Factor: n = ", nScree$Components$naf))
                   )
        legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA), 
                           text.col = c(1, 3, 2, 4), 
                           col = c(1, 3, 2, 4), bty="n", cex=0.7)
    }
    naf <- nScree$Components$naf
    text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7, 
        adj = c(0, 0), col = 2)
    text(x = naf + 1, y = eig[naf + 1], label = " (AF)", 
        cex = 0.7, adj = c(0, 0), col = 4)
}

My.loadings.var(): Calculates and returns the factor or component loadings and the proportion of variance explained for a specified number of factors or components, using either factanal() for FA or prcomp() for PCA.

My.loadings.var <- function(mat, nfct, method="fa"){
   # mat =  data matrix
   # nfct = number of factors or components
   # method = c("fa", "pca"), default = is "fa".
    if(method == "fa"){ 
     f1 <- factanal(mat, factors = nfct,  rotation = "varimax")
     x <- loadings(f1)
     vx <- colSums(x^2)
     varSS = rbind('SS loadings' = vx,
            'Proportion Var' = vx/nrow(x),
           'Cumulative Var' = cumsum(vx/nrow(x)))
     weight = f1$loadings[] 
   } else if (method == "pca"){
     pca <- prcomp(mat, center = TRUE, scale = TRUE)
     varSS = summary(pca)$importance[,1:nfct]
     weight = pca$rotation[,1:nfct]
  }
    list(Loadings = weight, Prop.Var = varSS)
}

Engagement Survey PCA

We first generate a plot to visually and statistically help us decide how many underlying dimensions are in our engage_items data. We do this before running PCA or using PCs in regression.

My.plotnScree(mat=engage_items, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Engagement (Positive)")

Our plot above shows that the appropriate amount of components will be around 2 so we go ahead and perform principle component analysis.

Loadings = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Engagement Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Engagement Questionaire Survey.
PC1 PC2
q412 -0.359 0.039
q413 -0.359 -0.259
q414 -0.364 -0.271
q415 -0.249 -0.483
q416 -0.240 -0.390
q417 -0.350 0.042
q418 -0.377 0.155
q419 -0.325 0.478
q420 -0.346 0.468
VarProp = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the engagement survey.")
Cumulative and proportion of variances explained by each the principal component in the engagement survey.
PC1 PC2
Standard deviation 1.967 1.242
Proportion of Variance 0.430 0.171
Cumulative Proportion 0.430 0.601

We see when calculating our proportion of variance for PCA that we were incorrect in assuming to use two principle components we will only be using one because PCA 1 explain around 43% of the total variance. We see that our second component only explains around 17% of the total variance.

pca <- prcomp(engage_items, center = TRUE, scale = TRUE)
eng.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(eng.idx,
main="Distribution of Engagement Index",
breaks = seq(min(eng.idx), max(eng.idx), length=9),
xlab="Engagement Index",
xlim=range(eng.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see here that our graph appears to be almost normal and could be skewed to the right. Further analysis will be need on whether to do a box cox transformation to fix the distributional issue in the regression residuals.

Student Learning Styles PCA

My.plotnScree(mat=LearnStyle, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Learning Style (Positive)")

We see from the above plot that it suggests to use one principal component but we will go ahead and use two components for exploratory purposes.

Loadings = My.loadings.var(mat=engage_items, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Learning Style Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Learning Style Questionaire Survey.
PC1 PC2
q412 -0.359 0.039
q413 -0.359 -0.259
q414 -0.364 -0.271
q415 -0.249 -0.483
q416 -0.240 -0.390
q417 -0.350 0.042
q418 -0.377 0.155
q419 -0.325 0.478
q420 -0.346 0.468
VarProp = My.loadings.var(mat=LearnStyle, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the Learning Style survey.")
Cumulative and proportion of variances explained by each the principal component in the Learning Style survey.
PC1 PC2
Standard deviation 1.877 0.918
Proportion of Variance 0.587 0.141
Cumulative Proportion 0.587 0.727

As the plot had predicted we see from the above table that PCA one has a staggering amount of proportion of variance at around 59% and our second component only adds 14% so we will go ahead and only use the first component for further analysis.

pca <- prcomp(LearnStyle, center = TRUE, scale = TRUE)
lea.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(lea.idx,
main="Distribution of Learning Style Index",
breaks = seq(min(lea.idx), max(lea.idx), length=9),
xlab="Learning Style Index",
xlim=range(lea.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see again from the plot above that our distribution may be normal but appears to be skewed to the right. We will need further analysis to see if we need to perform a box cox transformation to fix the distributional issue in the regression residuals.

Writing and Reading PCA

My.plotnScree(mat=WriRead, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Writing and Reading (Positive)")

We see from the above plot that it suggests to use one principal component but we will go ahead and use two components for exploratory purposes.

Loadings = My.loadings.var(mat=WriRead, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Writing and Reading Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Writing and Reading Questionaire Survey.
PC1 PC2
q61 0.676 -0.196
q62 0.307 0.951
q63 0.670 -0.237
VarProp = My.loadings.var(mat=WriRead, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the writing and reading survey.")
Cumulative and proportion of variances explained by each the principal component in the writing and reading survey.
PC1 PC2
Standard deviation 1.216 0.975
Proportion of Variance 0.493 0.317
Cumulative Proportion 0.493 0.810

We see form the above table of our proportion of variance that our initial assumption of only using one principle component was wrong because our first two principle components give us 30% or more of our proportion of variance so we will go ahead and use two principle components in our analysis.

pca <- prcomp(WriRead, center = TRUE, scale = TRUE)
wri.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(wri.idx,
main="Distribution of Writing and Reading Index",
breaks = seq(min(wri.idx), max(wri.idx), length=9),
xlab="Writing and Reading Index",
xlim=range(wri.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see from the distribution plot it is approximately normal.

Remedial PCA

My.plotnScree(mat=Remed, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Remedial (Positive)")

We see from the above plot that it suggests to use 2 principle components in our component analysis.

Loadings = My.loadings.var(mat=Remed, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Remedial Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Remedial Questionaire Survey.
PC1 PC2
q81 0.127 0.625
q82 0.318 -0.134
q83 0.394 -0.266
q84 0.395 -0.303
q85 0.363 -0.359
q86 0.354 0.035
q87 0.316 0.287
q88 0.321 0.299
q89 0.334 0.356
VarProp = My.loadings.var(mat=Remed, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the remedial survey.")
Cumulative and proportion of variances explained by each the principal component in the remedial survey.
PC1 PC2
Standard deviation 2.067 1.075
Proportion of Variance 0.475 0.128
Cumulative Proportion 0.475 0.603

We see form the above principle component analysis that PCA 1 has 47.5% of proportion of variance compared to our second component which only has 12.8% because of the staggering difference we will only be using the first principle component.

pca <- prcomp(Remed, center = TRUE, scale = TRUE)
rem.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(rem.idx,
main="Distribution of Remedial Index",
breaks = seq(min(rem.idx), max(rem.idx), length=9),
xlab="Remedial Index",
xlim=range(rem.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see from the distribution table above that it is skewed to the left. So we will have to perform a box cox transformation to fix the distributional issue in the regression residuals.

Encouragement and Support PCA

My.plotnScree(mat=EngcnSupp, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Encouragement and Support (Positive)")

We see from the above table that it suggests us to use the first two principle components so that is what we will do for our PCA.

Loadings = My.loadings.var(mat=EngcnSupp, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Encouragement and Support Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Encouragement and Support Questionaire Survey.
PC1 PC2
q91 0.297 -0.378
q92 0.426 -0.162
q93 0.416 -0.178
q94 0.435 0.347
q95 0.430 0.331
q96 0.376 0.278
q97 0.203 -0.701
VarProp = My.loadings.var(mat=EngcnSupp, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the Encouragement and Support survey.")
Cumulative and proportion of variances explained by each the principal component in the Encouragement and Support survey.
PC1 PC2
Standard deviation 1.832 1.102
Proportion of Variance 0.480 0.173
Cumulative Proportion 0.480 0.653

We see from the above table that our first component explains 48% of the total variance while our second component only explains 17% of the total variance we will go ahead and only be using the first component in our analysis.

pca <- prcomp(EngcnSupp, center = TRUE, scale = TRUE)
enc.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(enc.idx,
main="Distribution of Encouragement and Support Index",
breaks = seq(min(enc.idx), max(enc.idx), length=9),
xlab="Encouragement and Support Index",
xlim=range(enc.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see from the above table that our distribution table is approximately normal.

Retention PCA

My.plotnScree(mat=Retention, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Retention (Positive)")

We see from the above table that it suggets for us to use one component but we will go ahead and use one component for exploratory purposes.

Loadings = My.loadings.var(mat=Retention, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Retention Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Retention Questionaire Survey.
PC1 PC2
q121 -0.458 0.332
q122 -0.461 0.315
q123 -0.439 0.312
q124 -0.455 -0.205
q125 -0.422 -0.807
VarProp = My.loadings.var(mat=Retention, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the Retention survey.")
Cumulative and proportion of variances explained by each the principal component in the Retention survey.
PC1 PC2
Standard deviation 1.835 0.720
Proportion of Variance 0.673 0.104
Cumulative Proportion 0.673 0.777

We see again that from the table above shows that the first PC explains 67% of our total variance while our second component only explains 10%. So from this we will only use the first component for our analysis like our initial plot suggested.

pca <- prcomp(Retention, center = TRUE, scale = TRUE)
ret.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(ret.idx,
main="Distribution of Retention Index",
breaks = seq(min(ret.idx), max(ret.idx), length=9),
xlab="Retention Index",
xlim=range(ret.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see from the above plot that our plot is almost normal but it is skewed to the right so we might need to perform a box cox transformation and further analysis is needed.

Pay PCA

My.plotnScree(mat=outside_assistance, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Pay (Positive)")

We see from thr plot above that it suggests to use on PC but for exploratory purposes we will go ahead and use two.

Loadings = My.loadings.var(mat=outside_assistance, nfct=2, method="pca")$Loadings
# pca loadings
kable(round(Loadings,3),
    caption="Factor loadings of the first few PCAs and the cumulative
    the proportion of variation explained by the corresponding PCAs in the 
    Pay Questionaire Survey.")
Factor loadings of the first few PCAs and the cumulative the proportion of variation explained by the corresponding PCAs in the Pay Questionaire Survey.
PC1 PC2
q133 -0.418 0.597
q134 -0.586 -0.332
q135 -0.528 -0.531
q136 -0.451 0.501
VarProp = My.loadings.var(mat=outside_assistance, nfct=2, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,3),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the pay survey.")
Cumulative and proportion of variances explained by each the principal component in the pay survey.
PC1 PC2
Standard deviation 1.391 0.988
Proportion of Variance 0.484 0.244
Cumulative Proportion 0.484 0.728

We see from the above table that both our components provide a staggering amount of information explains 48.4% and 24.4% of the proportion of variance. So unlike what our initial graph says we will use two principle components for our analysis.

pca <- prcomp(outside_assistance, center = TRUE, scale = TRUE)
pay.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Self-compassion Index")
##
hist(pay.idx,
main="Distribution of Pay Index",
breaks = seq(min(pay.idx), max(pay.idx), length=9),
xlab="Pay Index",
xlim=range(pay.idx),
border="red",
col="lightblue",
freq=FALSE
)

We see from the graph above that it is skewed to the right. So we will need to perform a box cox transformation to fix the distributional issue in the regression residuals.

Project Questions

  1. How well does student engagement in inside and outside the classroom predict their overall satisfaction with their college experience?

  2. Does the level of support and encouragement from the school and/or school community correlate with students reported development in areas such as critical thinking, communication, and job skills?

Understanding what drives student engagement is crucial for any universities aiming to improve student application rates and after college life. The first question addresses the idea that students who are more actively engaged in their learning—by participating in discussions, collaborating with peers, or interacting with instructors—may feel more connected and fulfilled in their academic experience. The survey data provides extensive detail on student engagement through 21 questions (Section 4), making it a strong candidate for giving us substantial and enough variable representing engagement. These can be used to assess how this engagement translates into satisfaction, especially since the dataset includes direct or measures of satisfaction such as retention and overall college experience.

The second question is from the assumption that support from the school academically, socially, and financially can enhance students personal and academic growth. Section 9 (“Encouragement and Support”) captures how much students feel the university emphasizes support in various fields. These responses can be linked to outcomes reported in Section 10 (“Growth and Development”), which includes self-assessments of skills such as critical thinking, communication, and job preparedness. This pairing can offer meaningful insights into how institutional investment in student support pays off in developmental outcomes.

Both questions align with the same goal of informing universities about how they could improve student satisfaction. The survey data is comprehensive, with clear question structure that can be grouped into meaningful scales. Many questions are structured with like, dislike,or not at all response options, making them applicable for correlation, regression, and factor analysis techniques. This ensures a solid analytical foundation for answering the proposed questions with statistics.