1 Introduction

For this assignment, we will be looking at a data set for a survey done on student experience and satisfaction. This survey asks students questions regarding their experience in college. These questions include gathering information on a student’s academic background, their year level and credit status, along with their engagement in learning their class material both inside and outside of school. This survey also asks students about which campus resources they have used throughout their time in college. The survey looks at the satisfaction rating students have regarding the college, along with their information to gather data about the students at the college and their opinions regarding their experience and their satisfaction level regarding their college.

For this project, we will take a look at this survey data and begin with analyzing the data set along with conducting some exploratory data analysis steps to start. Then, we will perform internal reliability analysis of the subscales in the survey. We will also perform principal component analysis for the two subsets of student experience and student satisfaction. We will consider some potential consulting questions which could be drawn from this survey data for future projects on this survey data set.

1.1 Data Description

Let’s read in the survey data set from GitHub. We will call the data set as “survey”.

survey <- read.csv("https://raw.githubusercontent.com/JosieGallop/STA490/refs/heads/main/student-satisfaction-survey.csv")

As we can see, the survey data set contains 332 observations of 121 variables. The variables represent the different questions within the survey.

2 Exploratory Data Analysis

We will perform some exploratory data analysis to gain a better understanding of the survey data set, and to further prepare the data for additional analysis related to reliability and validation of the survey data.

2.1 Dealing with the Missing Values

For the first step of our exploratory data analysis, we will check for and fix any missing value concerns within our data set. First, let’s check to see if there indeed are some missing values which we will need to fix.

colSums(is.na(survey))
     q1      q2      q3     q41     q42     q43     q44     q45     q46     q47 
      0       0       0       0       0       0       0       0       0       0 
    q48     q49    q410    q411    q412    q413    q414    q415    q416    q417 
      0       0       0       0       0       0       0       0       0       0 
   q418    q419    q420    q421     q51     q52     q53     q54     q55     q56 
      0       0       0       0       0       0       0       0       0       0 
    q61     q62     q63      q7     q81     q82     q83     q84     q85     q86 
      0       0       0       0       0       0       0       0       0       0 
    q87     q88     q89     q91     q92     q93     q94     q95     q96     q97 
      0       0       0       0       0       0       0       0       0       0 
   q101    q102    q103    q104    q105    q106    q107    q108    q109   q1010 
      0       0       0       0       0       0       0       0       0       0 
  q1011   q1012   q1013   q1014   q1015  q111.1  q111.2  q111.3  q112.1  q112.2 
      0       0       0       0       0       0       0       0       0       0 
 q112.3  q113.1  q113.2  q113.3  q114.1  q114.2  q114.3  q115.1  q115.2  q115.3 
      0       0       0       0       0       0       0       0       0       0 
 q116.1  q116.2  q116.3  q117.1  q117.2  q117.3  q118.1  q118.2  q118.3  q119.1 
      0       0       0       0       0       0       0       0       0       0 
 q119.2  q119.3 q1110.1 q1110.2 q1110.3 q1111.1 q1111.2 q1111.3    q121    q122 
      0       0       0       0       0       0       0       0       0       0 
   q123    q124    q125    q131    q132    q133    q134    q135    q136     q14 
      0       0       0       0       0       0       0       0       0       0 
    q15     q16     q17     q18     q19     q20     q21     q22     q23     q24 
      0       0       0       0       0       0       0       0       0       0 
    q25 
      0 

It turns out that there are exactly zero missing values in this survey data set, meaning that we do not have any missing values to worry about. So, we do not have to do any imputation to fix any missing values, because we do not have any missing values in the survey data set.

3 Splitting the Survey Data Set

We have two main components of our survey, the student experience portion and the student satisfaction portion. We will split these apart for our further analysis of the survey data set.

3.1 Student Experience Portion

First, let’s separate the student experience portion of the survey data set. The student experience portion looks at the personal and academic background of students at the college. This information gives some background details about the student along with their experience at the college. The goal of this survey is to see how these experience factors may affect a student’s college satisfaction.

The student experience portion takes up majority of the survey questionarre. So, this student experience portion will contain the majority of the questions within the survey data set.

experience = survey[, 1:112]

Now we have created the experience subset of our survey data.

3.2 Student Satisfaction Portion

The second portion of the survey data set is the student satisfaction questions. These questions look at a student’s overall satisfaction and feelings towards their college. There are only two questions within the survey questionarre which ask about student satisfaction related topics, so this subset portion of the data set will be much smaller than the student experience portion of the data set.

satisfaction = survey[, 113:114]

Now we have created the satisfaction subset of our survey data.

4 Reliability Analysis

Now that we have prepared the data set, we will perform internal reliability and validity assessments in order to further investigate the survey data set.

4.1 Student Experience Reliability Analysis

First, let’s look at the correlation plots for the student experience subset of our survey data set.

The experience portion of the survey is incredibly large, so let’s take a smaller subset of this portion in order to make our correlation plot easier to interpret. We will look at questions 5, 6, and 7 for this subset. We will call this subset “experience.1”.

experience.1 = survey[, 25:34]

Now, let’s look at the correlation plot for this subset of the student experience portion of the survey.

M=cor(experience.1)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

We can see some moderate correlation between the student experience survey datafrom how some of the ellipses do appear to be moderately long and stretched in their shape.

4.2 Student Satisfaction Reliability Analysis

Next, let’s look at the correlation plots for the student satisfaction subset of the survey data set.

M=cor(satisfaction)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

The satisfaction portion of the survey was so short compared to the student experience portion, that it is quite hard to judge the correlation due to the lack of entries on our plot. The ellipse does appear to be noticeably streched and long, indicating some moderate correlation does indeed exist.

5 Cronbach Alpha Levels

Next, we will calculate the Cronbach alpha levels for each of our two subsets, along with their 95% confidence intervals. This will allow us to assess the reliability of the two subsets.

5.1 Student Experience

First, let’s calculate the Cronbach alpha level for the student experience portion of the survey data set.

cronbach.e = as.numeric(alpha(experience.1)$total[1])
Some items ( q61 q62 q63 q7 ) were negatively correlated with the first principal component and 
probably should be reversed.  
To do this, run the function again with the 'check.keys=TRUE' option
CI.e = cronbach.alpha.CI(alpha=cronbach.e, n=332, items=10, conf.level = 0.95)
CI.comp = cbind(LCI = CI.e[1], alpha = cronbach.e, UCI =CI.e[2])
row.names(CI.comp) = ""
pander(CI.comp, caption="Confidence Interval of Cronbach Alpha")
Confidence Interval of Cronbach Alpha
LCI alpha UCI
0.4298 0.5119 0.5866

The Cronbach alpha level for the student experience portion of the survey data set is 0.5119, 95% CI [0.4298, 0.5866]. This value of 0.5119 is not that high, but it is not incredibly low either. This indicates that their is moderate, but not great, reliability.

5.2 Student Satisfaction

Next, let’s calculate the Cronbach alpha level for the student satisfaction portion of the survey data set.

cronbach.s = as.numeric(alpha(satisfaction)$total[1])
CI.s = cronbach.alpha.CI(alpha=cronbach.s, n=332, items=2, conf.level = 0.95)
CI.comp = cbind(LCI = CI.s[1], alpha = cronbach.s, UCI =CI.s[2])
row.names(CI.comp) = ""
pander(CI.comp, caption="Confidence Interval of Cronbach Alpha")
Confidence Interval of Cronbach Alpha
LCI alpha UCI
0.4854 0.5853 0.6658

The Cronbach alpha level for the student experience portion of the survey data set is 0.5853, 95% CI [0.4298, 0.6658]. TOnce again, this value of 0.5853 is not that high, but it is not incredibly low either. This indicates that their is moderate, but not great, reliability. This value is slightly higher than that of the student experience subset, indicating the student satisfaction subset has slightly better reliability of the two.

6 Principal Component Analysis

Now we will perform principal component analysis for the survey data set. We will find the PCA twice, for both the experience subset and the satisfaction subset of the student survey data.

We will define functions for our principal component analysis in order to allow us to create scree plots to help visualize our data.

This first function will allow us to begin with our principal component analysis, and to later create the plots we will use to further analyze our findings.

My.plotnScree = function(mat, legend = TRUE, method ="factors", main){

    ev <- eigen(cor(mat))    
    ap <- parallel(subject=nrow(mat),var=ncol(mat), rep=5000,cent=.05)
    nScree = nScree(x=ev$values, aparallel=ap$eigen$qevpea, model=method)  
    
    if (!inherits(nScree, "nScree")) 
        stop("Method is only for nScree objects")
    if (nScree$Model == "components") 
        nkaiser = "Eigenvalues > mean: n = "
    if (nScree$Model == "factors") 
      nkaiser = "Eigenvalues > zero: n = "
    
    xlab = nScree$Model
    ylab = "Eigenvalues"
    
    par(col = 1, pch = 18)
    par(mfrow = c(1, 1))
    eig <- nScree$Analysis$Eigenvalues
    k <- 1:length(eig)
    plot(1:length(eig), eig, type="b", main = main, 
        xlab = xlab, ylab = ylab, ylim=c(0, 1.2*max(eig)))
    #
    nk <- length(eig)
    noc <- nScree$Components$noc
    vp.p <- lm(eig[c(noc + 1, nk)] ~ k[c(noc + 1, nk)])
    x <- sum(c(1, 1) * coef(vp.p))
    y <- sum(c(1, nk) * coef(vp.p))
    par(col = 10)
    lines(k[c(1, nk)], c(x, y))
    par(col = 11, pch = 20)
    lines(1:nk, nScree$Analysis$Par.Analysis, type = "b")
    if (legend == TRUE) {
        leg.txt <- c(paste(nkaiser, nScree$Components$nkaiser), 
                   c(paste("Parallel Analysis: n = ", nScree$Components$nparallel)), 
                   c(paste("Optimal Coordinates: n = ", nScree$Components$noc)), 
                   c(paste("Acceleration Factor: n = ", nScree$Components$naf))
                   )
        legend("topright", legend = leg.txt, pch = c(18, 20, NA, NA), 
                           text.col = c(1, 3, 2, 4), 
                           col = c(1, 3, 2, 4), bty="n", cex=0.7)
    }
    naf <- nScree$Components$naf
    text(x = noc, y = eig[noc], label = " (OC)", cex = 0.7, 
        adj = c(0, 0), col = 2)
    text(x = naf + 1, y = eig[naf + 1], label = " (AF)", 
        cex = 0.7, adj = c(0, 0), col = 4)
}

This next function will help us with analyzing the various factors in the PCA. This will allow us to analyze the factor loadings and the proportion variance that can be explained by each factor. This will allow us to see the proportion of the total variation which can be explained by each principal component in the model.

My.loadings.var <- function(mat, nfct, method="fa"){
    if(method == "fa"){ 
     f1 <- factanal(mat, factors = nfct,  rotation = "varimax")
     x <- loadings(f1)
     vx <- colSums(x^2)
     varSS = rbind('SS loadings' = vx,
            'Proportion Var' = vx/nrow(x),
           'Cumulative Var' = cumsum(vx/nrow(x)))
     weight = f1$loadings[] 
   } else if (method == "pca"){
     pca <- prcomp(mat, center = TRUE, scale = TRUE)
     varSS = summary(pca)$importance[,1:nfct]
     weight = pca$rotation[,1:nfct]
  }
    list(Loadings = weight, Prop.Var = varSS)
}

6.1 Student Experience PCA

First, we will perform PCA for the student experience subset of the survey data set.

To avoid from having too many principal components since the studnt experience portion of the survey is so large, we will use the experience.1 subset of the student experience survey questions which was created earlier during the validity analysis.

Let’s calculate the principal components for the experience.1 data set.

experience.pca <- prcomp(experience.1, center = TRUE, scale = TRUE)

Now, we will find the factor loading of the PCA for the student experience survey questions.

kable(round(experience.pca$rotation, 2), caption="Factor Loadings of the PCA")
Factor Loadings of the PCA
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
q51 0.26 -0.09 -0.15 0.60 -0.57 0.29 0.18 -0.30 0.04 0.03
q52 0.41 -0.07 0.05 0.09 -0.16 -0.54 0.08 0.43 0.55 -0.09
q53 0.44 -0.09 -0.05 -0.01 -0.04 -0.32 0.07 0.10 -0.79 -0.24
q54 0.41 -0.02 -0.02 -0.28 0.18 -0.07 -0.18 -0.71 0.25 -0.35
q55 0.44 -0.10 -0.02 -0.21 0.10 0.05 -0.01 -0.07 -0.03 0.85
q56 0.39 -0.15 -0.08 -0.03 0.22 0.67 -0.23 0.44 0.07 -0.26
q61 -0.10 -0.66 -0.07 -0.04 0.30 0.03 0.66 -0.06 0.06 -0.07
q62 -0.01 -0.32 0.80 -0.27 -0.41 0.13 -0.07 -0.01 -0.04 -0.05
q63 -0.17 -0.60 -0.08 0.34 0.11 -0.22 -0.65 -0.05 -0.03 0.08
q7 -0.14 -0.22 -0.56 -0.56 -0.53 0.01 -0.09 0.08 0.03 -0.04

We can see that we have ten total principal components for the student experience data.

kable(summary(experience.pca)$importance, caption="The Importance of Each Principal Component")
The Importance of Each Principal Component
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Standard deviation 1.903139 1.224847 0.9956294 0.9782884 0.9071515 0.7291144 0.7027027 0.6709049 0.5693246 0.5540031
Proportion of Variance 0.362190 0.150030 0.0991300 0.0957000 0.0822900 0.0531600 0.0493800 0.0450100 0.0324100 0.0306900
Cumulative Proportion 0.362190 0.512220 0.6113500 0.7070500 0.7893400 0.8425000 0.8918800 0.9369000 0.9693100 1.0000000

The first PC accounts for around 36.22% of the total variation. The first two PCs account for around 51.22% of the total variation. The first three PCs account for around 61.14% of the total variation.

6.1.1 Scree Plot

Let’s take a look at the scree plot for the student experience survey portion.

My.plotnScree(mat=experience.1, legend = TRUE, method ="components", 
              main="Determination of Number of Components\n Student Experience (Positive)")

As we can see, we have ten principal components just like we saw before in the analysis portion. The elbow of the scree plot appears to occur around when our components equals two. So this would be the ideal value of components to choose for further analysis on the student experience survey data.

6.1.2 Student Experience Distribution Histogram

Next, we can look at a historgram of the distribution of the student experience index.

pca <- prcomp(experience.1, center = TRUE, scale = TRUE)
se.idx = pca$x[,1]
hist(se.idx,
main="Distribution of Student Experience Index",
breaks = seq(min(se.idx), max(se.idx), length=9),
xlab="Self-compassion Index",
xlim=range(se.idx),
border="purple",
col="lightblue",
freq=FALSE
)

As we can see, the student experience index appears to be normally distributed without any skew or outliers present.

6.2 Student Satisfaction PCA

Now, we will perform PCA again, this time for the student satisfaction subset of the survey data set.

satisfaction.pca <- prcomp(satisfaction, center = TRUE, scale = TRUE)

Now, we will find the factor loading of the PCA for the student satisfaction survey questions.

kable(round(satisfaction.pca$rotation, 2), caption="Factor Loadings of the PCA")
Factor Loadings of the PCA
PC1 PC2
q17 0.71 -0.71
q18 0.71 0.71

We can see that we have two total principal components for the student experience data.

kable(summary(satisfaction.pca)$importance, caption="The Importance of Each Principal Component")
The Importance of Each Principal Component
PC1 PC2
Standard deviation 1.231695 0.6949298
Proportion of Variance 0.758540 0.2414600
Cumulative Proportion 0.758540 1.0000000

The first PC accounts for around 75.85% of the total variation. The first two PCs account for 100% of the total variation.

6.2.1 Student Satisfaction Distribution Histogram

Next, we can look at a historgram of the distribution of the student satisfaction index.

pca <- prcomp(satisfaction, center = TRUE, scale = TRUE)
sat.idx = pca$x[,1]
hist(sat.idx,
main="Distribution of Student Experience Index",
breaks = seq(min(sat.idx), max(sat.idx), length=9),
xlab="Self-compassion Index",
xlim=range(sat.idx),
border="purple",
col="lightblue",
freq=FALSE
)

As we can see, the student satisfaction index does not appear to be normally distributed. Instead, the distribution appears to be noticeably skewed to the right, with some potential outliers present on the right tail of the graph.

7 Concluslion and Recommendations

Overall, the principal component analysis allowed us to further analysis and better understand the survey data set. For further recommendations for this project, something which stood out is how long the student experience portion of the survey is in comparison to the student satisfaction portion of the survey. In the survey, there only appeared to be two questions which directly related to student satisfaction.

These two student satisfaction questions were questions 17 and 18, with those being: “17. Would you recommend our School of Business to a friend or family member? 1 -Yes, 2 - No.” and “18. How would you evaluate your entire educational experience at our school? 1 -Excellent, 2 - Good, 3 - Fair, 4 - Poor”.

This student satisfaction portion of the survey was incredibly short, and led to it being impossible to create a scree plot for the student satisfaction portion as we would need three or more prinicpal components in order to do so, but we only had two principal components for this subset of the survey data set.

So, the main recommendation I would make for future projects is to expand the student satisfaction portion of the data set to have more questions, in order to provide for further analysis of these survey questions.

8 Project Questions

Lastly, we will now look at some potential project questions based upon the analysis of the survey data set and the results that were found.

Some potential questions include:

  • Which factors of student experience showed the strongest influence on student satisfaction?

  • Did these particular factors with the strongest influence on satisfaction show stronger reliability when compared to the others?

  • Do students appear to be mostly satisfied with their school experience?

  • How can we analysis the relationship between the principal components in order to learn more about the revelant factors in relation to a student’s overall experience and satisfaction in college?

For this project, I would be interesting in looking into particularly the topic of which student experience factors have the greatest significance or effect on a student’s satisfaction with their college experience. This would be helpful because it would give specific factors for college faculty or advisors to look out for in their students in order to help them have a more positive and satisfactory experience during their time in college.

I think it would be interesting to see if these particular factors with stronger influence on a student’s satisfaction have stronger reliability. It would be interesting to compare the reliability results of these particular factors to see if their appears stronger or weaker than the average. This would help to see if these findings truly are reliable or not.

Additionally, we could use the findings in the principal component analysis to further investigate the survey data set. We could look into the relationships between the various principal components in order to learn more about the relationships between various factors in student experience and satisfaction. It would also be interesting to use which loading factors had the highest importance, in order to create models based upon these specific factors in order to see their importance and influence on overall student experience and satisfaction.

