Introduction

Student satisfaction analysis can be very useful to academic institutions to help them understand the student learning experience and identify areas for improvement. Student satisfaction is a multifaceted concept, which builds on many different overlapping experiences, making it difficult to measure.

The processed dataset was extracted from an online survey system based on a comprehensive survey that contains several subscales to measure students’ satisfaction and experience. Both single-item and multi-item survey questionnaires are used in the instrument with a total of 106 items.

The study population was defined to be all college business undergraduate students from two regional (north) universities. The sizes of the two samples taken from the two universities were 151 and 183 respectively and came from all grade levels. Students were invited to participate in the survey voluntarily. The survey was completely anonymous. The participation rate was unknown. A non-probability sampling plan was used in this survey.

The overall research question is to identify factors that can be used to make interventions and improve students’ outcomes and, hence, satisfaction.

setwd("C:/Users/Gianna/Documents/STA 490")
survey <- read.csv("at-risk-survey-data.csv", head = TRUE)
#county number of missing values
sapply(survey, function(x) sum(is.na(x)))
##      q1      q2      q3     q41     q42     q43     q44     q45     q46     q47 
##     332     332     332     332     332     332     332     332     332     332 
##     q48     q49    q410    q411    q412    q413    q414    q415    q416    q417 
##     332     332     332     332     332     332     332     332     332     332 
##    q418    q419    q420    q421     q51     q52     q53     q54     q55     q56 
##     332     332     332     332     332     332     332     332     332     332 
##     q61     q62     q63      q7     q81     q82     q83     q84     q85     q86 
##     332     332     332     332     332     332     332     332     332     332 
##     q87     q88     q89     q91     q92     q93     q94     q95     q96     q97 
##     332     332     332     332     332     332     332     332     332     332 
##    q101    q102    q103    q104    q105    q106    q107    q108    q109   q1010 
##     332     332     332     332     332     332     332     332     332     332 
##   q1011   q1012   q1013   q1014   q1015  q111.1  q111.2  q111.3  q112.1  q112.2 
##     332     332     332     332     332     332     332     332     332     332 
##  q112.3  q113.1  q113.2  q113.3  q114.1  q114.2  q114.3  q115.1  q115.2  q115.3 
##     332     332     332     332     332     332     332     332     332     332 
##  q116.1  q116.2  q116.3  q117.1  q117.2  q117.3  q118.1  q118.2  q118.3  q119.1 
##     332     332     332     332     332     332     332     332     332     332 
##  q119.2  q119.3 q1110.1 q1110.2 q1110.3 q1111.1 q1111.2 q1111.3    q121    q122 
##     332     332     332     332     332     332     332     332     332     332 
##    q123    q124    q125    q131    q132    q133    q134    q135    q136     q14 
##     332     332     332     332     332     332     332     332     332     332 
##     q15     q16     q17     q18     q19     q20     q21     q22     q23     q24 
##     332     332     332     332     332     332     332     332     332     332 
##     q25 
##     332
## DATA has a missing row after every input row. Looking at the csv file shows us that a space row is put in between each observation. This row can be removed since it does not represent an observation
survey_na <- survey[complete.cases(survey), ] #left with 332 observations

There are now no more missing values.

my.mode = function(dataset){
  freq.tbl = table(dataset)
  max.freq.id=which(freq.tbl==max(freq.tbl))
  mode=names(freq.tbl[max.freq.id])
  as.numeric(mode)
}

Selecting necessary questions

The following questions were selected to help us determine student satisfaction. Q17 or Q18 could be our response variables if we do linear regression later.

We already took care of all of the missing values so that is something we don’t need to worry about.

library(tidyverse)
satisfaction <- survey_na %>% 
  dplyr::select(q62,q92,q95,q96,q15,q17,q18)

Validity and Reliability Analyses

We start with some correlation plots to see the relevance of the PCA procedure on the satisfaction data.

library(corrplot)
M=cor(satisfaction)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

The shape of an ellipse represents the correlation. The skinnier the ellipse, the higher the correlation. The direction reflects whether a correlation is positive or negative. The off-diagonal direction implies s positive correlation while the main diagonal direction implies a negative association.

Since Likert scales of Q6.2 are in reverse order in the design. We transform back the usual order and create a new dataset using the same variable names.

satisfaction$q62 <- 6-satisfaction$q62

We will redue the correlation plots below. Now all of the questions have the correct correlations.

M=cor(satisfaction)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

Intern Consistency

With the adjusted scores in the satisfaction instrument, we calculate one of the commonly used internal consistency reliability Cronbach alpha as follows.

Next, we find the Cronbach alpha and its 95% confidence interval.

cronbach.sc = as.numeric(alpha(satisfaction)$total[1])
CI.sc = cronbach.alpha.CI(alpha=cronbach.sc, n=332, items=7, conf.level = 0.95)
CI.comp = cbind(LCI = CI.sc[1], alpha = cronbach.sc, UCI =CI.sc[2])
row.names(CI.comp) = ""
kable(CI.comp, caption="Confodence Interval of Cranbach Alpha")
Confodence Interval of Cranbach Alpha
LCI alpha UCI
0.5431435 0.610644 0.6714143

We can see that Cronbach’s alpha is 0.54 with a 95% confidence interval (0.61, 0.67) suggesting that the items in the satisfaction instrument have moderate internal consistency.

PCA

For convenience, we define the following two R functions and use them in the subsequent PCA.

My.plotnScree = function(mat, legend = TRUE, method ="factors", main){
    # mat = data matrix
    # method = c("factors", "components"), default is "factors".
    # main = title of the plot
    ev <- eigen(cor(mat))    # get eigenvalues
    ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
    nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)  
    ##
    if (!inherits(nScree, "nScree")) 
        stop("Method is only for nScree objects")
    if (nScree$Model == "components") 
        nkaiser = "Eigenvalues > mean: n = "
    if (nScree$Model == "factors") 
      nkaiser = "Eigenvalues > zero: n = "
    # axis labels
    xlab = nScree$Model
    ylab = "Eigenvalues"
    ##
    par(col = 1, pch = 18)
    par(mfrow = c(1, 1))
    eig <- nScree$Analysis$Eigenvalues
    k <- 1:length(eig)
    plot(1:length(eig), eig, type="b", main = main, 
        xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
    #
    nk <- length(eig)
    noc <- nScree$Components$noc
    vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
    x <- sum(c(1, 1) * coef(vp.p))
    y <- sum(c(1, nk) * coef(vp.p))
    par(col = 10)
    lines(k[c(1, nk)], c(x, y))
    par(col = 11, pch = 20)
    lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
    if (legend == TRUE) {
        leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser), 
                   c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)), 
                   c(paste("Optimal Coordinates: n = ", nScree$Components$noc)), 
                   c(paste("Acceleration Factor: n = ", nScree$Components$naf))
                   )
        legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA), 
                           text.col = c(1, 3, 2, 4), 
                           col = c(1, 3, 2, 4), bty="n", cex=0.7)
    }
    naf <- nScree$Components$naf
    text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7, 
        adj = c(0, 0), col = 2)
    text(x = naf + 1, y = eig[naf + 1], label = " (AF)", 
        cex = 0.7, adj = c(0, 0), col = 4)
}
# example
# My.plotnScree(mat=compassion, legend = TRUE, method ="factors", 
#              main = "Number of Factors to Retain")

My.loadings.var produces a list of two objects: factor loadings and proportion variance explained by each factor. There are no existing R functions that can be used to extract the proportion of variance from the output of factanal(). The function can also extract similar information from the output of a PCA but we need to specify the method in the argument.

My.loadings.var <- function(mat, nfct, method="fa"){
   # mat =  data matrix
   # nfct = number of factors or components
   # method = c("fa", "pca"), default = is "fa".
    if(method == "fa"){ 
     f1 <- factanal(mat, factors = nfct,  rotation = "varimax")
     x <- loadings(f1)
     vx <- colSums(x^2)
     varSS = rbind('SS loadings' = vx,
            'Proportion Var' = vx/nrow(x),
           'Cumulative Var' = cumsum(vx/nrow(x)))
     weight = f1$loadings[] 
   } else if (method == "pca"){
     pca <- prcomp(mat, center = TRUE, scale = TRUE)
     varSS = summary(pca)$importance[,1:nfct]
     weight = pca$rotation[,1:nfct]
  }
    list(Loadings = weight, Prop.Var = varSS)
}
# example
# My.loadings.var(mat, nfct=3, method="pca")

Scree Plot

Using the scree plot below, we are going to select the first three PCs.

My.plotnScree(mat=satisfaction, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Satisfaction (Positive)")

Different methods of identification of the number of principal components to be retained in exploratory analysis: Kaiser’s eigenvalue rule, Raiche et al Monte Carlo simulation method (parallel analysis), optimal coordinate (OC) index, and accelerate factor (AF) method.

The above Figure indicates that it is sufficient to retain the first principle component for the subsequent analysis. In the following, we will extract the first two PCAs. The PCA factor loadings and the proportion of variance explained by the retained PCAs are summarized in the following tables.

Loadings = My.loadings.var(mat=satisfaction, nfct=3, method="pca")$Loadings
#
# pca loadings
kable(round(Loadings,4),
  caption="Factor loadings of the first few PCAs and the cumulative proportion
of variation explained by the corresponding PCAs in the satisfaction survey.")
Factor loadings of the first few PCAs and the cumulative proportion of variation explained by the corresponding PCAs in the satisfaction survey.
PC1 PC2 PC3
q62 0.1074 0.0420 0.9637
q92 0.4753 -0.0794 -0.1088
q95 0.4754 -0.2788 -0.0239
q96 0.4244 -0.2793 0.1322
q15 0.0041 -0.7764 -0.0804
q17 0.4174 0.3095 -0.0450
q18 0.4268 0.3711 -0.1814
VarProp = My.loadings.var(mat=satisfaction, nfct=3, method="pca")$Prop.Var
# pca loadings
kable(round(VarProp,4),
    caption="Cumulative and proportion of variances explained by each 
    the principal component in the satisfaction survey.")
Cumulative and proportion of variances explained by each the principal component in the satisfaction survey.
PC1 PC2 PC3
Standard deviation 1.6134 1.0710 1.0007
Proportion of Variance 0.3719 0.1639 0.1431
Cumulative Proportion 0.3719 0.5358 0.6788

We also conduct the same analysis using EFA. The Scree type of test also suggests retaining a single factor. The proportion of total variation is lower than that of PCA. we decide to use the PCA method and extract the first three principal components for future analysis. Table 1 shows the factor loadings of the first three principal components. We can see that each of the original items contributes to the three PCAs evenly in terms of the magnitude. The first PCA counts about 37.2% of the total variation, the second PCA counts 16.4% of the total variation, and the third PCA counts 14.3% of the total variation.

Next, we extract the satisfaction index in the following code

pca <- prcomp(satisfaction, center = TRUE, scale = TRUE)
sc.idx = pca$x[,1]
# hist(sc.idx, breaks=10, main="Distribution of Satisfaction Index")
##
hist(sc.idx,
main="Distribution of Satisfaction Index",
breaks = seq(min(sc.idx), max(sc.idx), length=9),
xlab="Satisfaction Index",
xlim=range(sc.idx),
border="red",
col="lightblue",
freq=FALSE
)

Histogram of the first principle component extract from the satisfaction survey.