Student satisfaction analysis can be very useful to academic institutions to help them understand the student learning experience and identify areas for improvement. Student satisfaction is a multifaceted concept, which builds on many different overlapping experiences, making it difficult to measure.
The processed dataset was extracted from an online survey system based on a comprehensive survey that contains several subscales to measure students’ satisfaction and experience. Both single-item and multi-item survey questionnaires are used in the instrument with a total of 106 items.
The study population was defined to be all college business undergraduate students from two regional (north) universities. The sizes of the two samples taken from the two universities were 151 and 183 respectively and came from all grade levels. Students were invited to participate in the survey voluntarily. The survey was completely anonymous. The participation rate was unknown. A non-probability sampling plan was used in this survey.
The overall research question is to identify factors that can be used to make interventions and improve students’ outcomes and, hence, satisfaction.
setwd("C:/Users/Gianna/Documents/STA 490")
survey <- read.csv("at-risk-survey-data.csv", head = TRUE)
#county number of missing values
sapply(survey, function(x) sum(is.na(x)))
## q1 q2 q3 q41 q42 q43 q44 q45 q46 q47
## 332 332 332 332 332 332 332 332 332 332
## q48 q49 q410 q411 q412 q413 q414 q415 q416 q417
## 332 332 332 332 332 332 332 332 332 332
## q418 q419 q420 q421 q51 q52 q53 q54 q55 q56
## 332 332 332 332 332 332 332 332 332 332
## q61 q62 q63 q7 q81 q82 q83 q84 q85 q86
## 332 332 332 332 332 332 332 332 332 332
## q87 q88 q89 q91 q92 q93 q94 q95 q96 q97
## 332 332 332 332 332 332 332 332 332 332
## q101 q102 q103 q104 q105 q106 q107 q108 q109 q1010
## 332 332 332 332 332 332 332 332 332 332
## q1011 q1012 q1013 q1014 q1015 q111.1 q111.2 q111.3 q112.1 q112.2
## 332 332 332 332 332 332 332 332 332 332
## q112.3 q113.1 q113.2 q113.3 q114.1 q114.2 q114.3 q115.1 q115.2 q115.3
## 332 332 332 332 332 332 332 332 332 332
## q116.1 q116.2 q116.3 q117.1 q117.2 q117.3 q118.1 q118.2 q118.3 q119.1
## 332 332 332 332 332 332 332 332 332 332
## q119.2 q119.3 q1110.1 q1110.2 q1110.3 q1111.1 q1111.2 q1111.3 q121 q122
## 332 332 332 332 332 332 332 332 332 332
## q123 q124 q125 q131 q132 q133 q134 q135 q136 q14
## 332 332 332 332 332 332 332 332 332 332
## q15 q16 q17 q18 q19 q20 q21 q22 q23 q24
## 332 332 332 332 332 332 332 332 332 332
## q25
## 332
## DATA has a missing row after every input row. Looking at the csv file shows us that a space row is put in between each observation. This row can be removed since it does not represent an observation
survey_na <- survey[complete.cases(survey), ] #left with 332 observations
There are now no more missing values.
my.mode = function(dataset){
freq.tbl = table(dataset)
max.freq.id=which(freq.tbl==max(freq.tbl))
mode=names(freq.tbl[max.freq.id])
as.numeric(mode)
}
The following questions were selected to help us determine student satisfaction. Q17 or Q18 could be our response variables if we do linear regression later.
We already took care of all of the missing values so that is something we don’t need to worry about.
library(tidyverse)
satisfaction <- survey_na %>%
dplyr::select(q62,q92,q95,q96,q15,q17,q18)
We start with some correlation plots to see the relevance of the PCA procedure on the satisfaction data.
library(corrplot)
M=cor(satisfaction)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
The shape of an ellipse represents the correlation. The skinnier the
ellipse, the higher the correlation. The direction reflects whether a
correlation is positive or negative. The off-diagonal direction implies
s positive correlation while the main diagonal direction implies a
negative association.
Since Likert scales of Q6.2 are in reverse order in the design. We transform back the usual order and create a new dataset using the same variable names.
satisfaction$q62 <- 6-satisfaction$q62
We will redue the correlation plots below. Now all of the questions have the correct correlations.
M=cor(satisfaction)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
With the adjusted scores in the satisfaction instrument, we calculate one of the commonly used internal consistency reliability Cronbach alpha as follows.
Next, we find the Cronbach alpha and its 95% confidence interval.
cronbach.sc = as.numeric(alpha(satisfaction)$total[1])
CI.sc = cronbach.alpha.CI(alpha=cronbach.sc, n=332, items=7, conf.level = 0.95)
CI.comp = cbind(LCI = CI.sc[1], alpha = cronbach.sc, UCI =CI.sc[2])
row.names(CI.comp) = ""
kable(CI.comp, caption="Confodence Interval of Cranbach Alpha")
| LCI | alpha | UCI | |
|---|---|---|---|
| 0.5431435 | 0.610644 | 0.6714143 |
We can see that Cronbach’s alpha is 0.54 with a 95% confidence interval (0.61, 0.67) suggesting that the items in the satisfaction instrument have moderate internal consistency.
For convenience, we define the following two R functions and use them in the subsequent PCA.
My.plotnScree = function(mat, legend = TRUE, method ="factors", main){
# mat = data matrix
# method = c("factors", "components"), default is "factors".
# main = title of the plot
ev <- eigen(cor(mat)) # get eigenvalues
ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)
##
if (!inherits(nScree, "nScree"))
stop("Method is only for nScree objects")
if (nScree$Model == "components")
nkaiser = "Eigenvalues > mean: n = "
if (nScree$Model == "factors")
nkaiser = "Eigenvalues > zero: n = "
# axis labels
xlab = nScree$Model
ylab = "Eigenvalues"
##
par(col = 1, pch = 18)
par(mfrow = c(1, 1))
eig <- nScree$Analysis$Eigenvalues
k <- 1:length(eig)
plot(1:length(eig), eig, type="b", main = main,
xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
#
nk <- length(eig)
noc <- nScree$Components$noc
vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
x <- sum(c(1, 1) * coef(vp.p))
y <- sum(c(1, nk) * coef(vp.p))
par(col = 10)
lines(k[c(1, nk)], c(x, y))
par(col = 11, pch = 20)
lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
if (legend == TRUE) {
leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser),
c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)),
c(paste("Optimal Coordinates: n = ", nScree$Components$noc)),
c(paste("Acceleration Factor: n = ", nScree$Components$naf))
)
legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA),
text.col = c(1, 3, 2, 4),
col = c(1, 3, 2, 4), bty="n", cex=0.7)
}
naf <- nScree$Components$naf
text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7,
adj = c(0, 0), col = 2)
text(x = naf + 1, y = eig[naf + 1], label = " (AF)",
cex = 0.7, adj = c(0, 0), col = 4)
}
# example
# My.plotnScree(mat=compassion, legend = TRUE, method ="factors",
# main = "Number of Factors to Retain")
My.loadings.var produces a list of two objects: factor loadings and proportion variance explained by each factor. There are no existing R functions that can be used to extract the proportion of variance from the output of factanal(). The function can also extract similar information from the output of a PCA but we need to specify the method in the argument.
My.loadings.var <- function(mat, nfct, method="fa"){
# mat = data matrix
# nfct = number of factors or components
# method = c("fa", "pca"), default = is "fa".
if(method == "fa"){
f1 <- factanal(mat, factors = nfct, rotation = "varimax")
x <- loadings(f1)
vx <- colSums(x^2)
varSS = rbind('SS loadings' = vx,
'Proportion Var' = vx/nrow(x),
'Cumulative Var' = cumsum(vx/nrow(x)))
weight = f1$loadings[]
} else if (method == "pca"){
pca <- prcomp(mat, center = TRUE, scale = TRUE)
varSS = summary(pca)$importance[,1:nfct]
weight = pca$rotation[,1:nfct]
}
list(Loadings = weight, Prop.Var = varSS)
}
# example
# My.loadings.var(mat, nfct=3, method="pca")
Using the scree plot below, we are going to select the first three PCs.
My.plotnScree(mat=satisfaction, legend = TRUE, method ="components",
main="Determination of Number of Components\n Satisfaction (Positive)")
Different methods of identification of the number of principal
components to be retained in exploratory analysis: Kaiser’s eigenvalue
rule, Raiche et al Monte Carlo simulation method (parallel analysis),
optimal coordinate (OC) index, and accelerate factor (AF) method.
The above Figure indicates that it is sufficient to retain the first principle component for the subsequent analysis. In the following, we will extract the first two PCAs. The PCA factor loadings and the proportion of variance explained by the retained PCAs are summarized in the following tables.
Loadings = My.loadings.var(mat=satisfaction, nfct=3, method="pca")$Loadings
#
# pca loadings
kable(round(Loadings,4),
caption="Factor loadings of the first few PCAs and the cumulative proportion
of variation explained by the corresponding PCAs in the satisfaction survey.")
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| q62 | 0.1074 | 0.0420 | 0.9637 |
| q92 | 0.4753 | -0.0794 | -0.1088 |
| q95 | 0.4754 | -0.2788 | -0.0239 |
| q96 | 0.4244 | -0.2793 | 0.1322 |
| q15 | 0.0041 | -0.7764 | -0.0804 |
| q17 | 0.4174 | 0.3095 | -0.0450 |
| q18 | 0.4268 | 0.3711 | -0.1814 |
VarProp = My.loadings.var(mat=satisfaction, nfct=3, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,4),
caption="Cumulative and proportion of variances explained by each
the principal component in the satisfaction survey.")
| PC1 | PC2 | PC3 | |
|---|---|---|---|
| Standard deviation | 1.6134 | 1.0710 | 1.0007 |
| Proportion of Variance | 0.3719 | 0.1639 | 0.1431 |
| Cumulative Proportion | 0.3719 | 0.5358 | 0.6788 |
We also conduct the same analysis using EFA. The Scree type of test also suggests retaining a single factor. The proportion of total variation is lower than that of PCA. we decide to use the PCA method and extract the first three principal components for future analysis. Table 1 shows the factor loadings of the first three principal components. We can see that each of the original items contributes to the three PCAs evenly in terms of the magnitude. The first PCA counts about 37.2% of the total variation, the second PCA counts 16.4% of the total variation, and the third PCA counts 14.3% of the total variation.
Next, we extract the satisfaction index in the following code
pca <- prcomp(satisfaction, center = TRUE, scale = TRUE)
sc.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Satisfaction Index")
##
hist(sc.idx,
main="Distribution of Satisfaction Index",
breaks = seq(min(sc.idx), max(sc.idx), length=9),
xlab="Satisfaction Index",
xlim=range(sc.idx),
border="red",
col="lightblue",
freq=FALSE
)
Histogram of the first principle component extract from the satisfaction survey.