For this report, I will be analyzing survey data about student satisfaction and engagement. The goal for this analysis is to measure validity in terms of work load.

survey_raw <- read.csv("at-risk-survey-data.csv", header = TRUE)
nrow(survey_raw)
## [1] 664

1 Cleaning and Extracting Data

1.1 Cleaning the Data

This data set originally has 332 observations. However, there are one row gaps between the observations in the original excel spreadsheet. Because of this, when loaded into R, the data frame has 664 observations. And where there would have been gaps, ‘responses’ are recorded as NA.

To start, we need to clean up the data. By that, i mean get rid of the gaps in the original excel spreadsheet.

survey <- survey_raw[-c(1, seq(3, nrow(survey_raw), by = 2)), ]
nrow(survey)
## [1] 332

1.2 Extracting the Subscale

Now, i will extract the questions concerning Growth and Development.

Growth <- survey %>% 
  dplyr::select(q101:q1015)

2 Analysis

2.1 Pairwise scatterplot

We will first look at a pairwise correlation scatter plot for each variable in “Growth”.

M=cor(Growth)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)

The smallest correlation coefficient is 0.39 (belonging to questions 10.6 and 10.8), and the largest if 0.79 (belonging to two pairs of questions: 10.14 and 10.15, and 10.11 and 10.12). All of these are positively correlated, and some are more strongly correlated than others. We will see how this works when Principal Component Analysis is done.

2.2 Cronbach Alpha

In the meanwhile, we will calculate Cronbach’s alpha to determine validity of these items.

cronbach.alpha(Growth, CI=TRUE)
## 
## Cronbach's alpha for the 'Growth' data-set
## 
## Items: 15
## Sample units: 332
## alpha: 0.947
## 
## Bootstrap 95% CI based on 1000 samples
##  2.5% 97.5% 
## 0.936 0.955

According to this table, alpha is 0.947, with a 95% bootstrap confidence interval of (0.936, 0.955).This suggests that there is a lot of internal consistency among the items. In other words, they share a lot in terms of variance.

2.3 Principal Component Analysis

Next, PCA will be done to see what we can aggregate out of this subscale. We will first look at a scree plot to see how much variance each principle component would explain.

pca_result <- prcomp(Growth)

eigenvalues <- pca_result$sdev^2

prop_variance <- eigenvalues / sum(eigenvalues)

plot(prop_variance, type = "b", xlab = "Principal Component", ylab = "Proportion of Variance Explained", 
     main = "Scree Plot for Growth and Development")

cumulative_prop_variance <- cumsum(prop_variance)
lines(cumulative_prop_variance, type = "b", col = "red")

legend("topright", legend = c("Proportion of Variance", "Cumulative Proportion"), col = c("black", "red"), 
       pch = 20, lty = 1)

Our first principal component explains over 50% of the total variance, so we’ll definitely keep that. The second explains over 20%, and the third explains just under 20%. So, we probably should just look at the first two.

Next, let’s look at our loadings.

loadings <- pca_result$rotation[, 1:4]

loadings_table <- data.frame(
  Row = rownames(loadings),
  PC1 = loadings[, 1],
  PC2 = loadings[, 2],
  PC3 = loadings[, 3],
  PC4 = loadings[, 4]
)

loadings_table[, -1] <- round(loadings_table[, -1], 3)

kable(loadings_table, caption = "Factor loadings of the first few PCAs and the cumulative proportion of variation explained by the corresponding PCAs in the Student Satisfaction survey.")
Factor loadings of the first few PCAs and the cumulative proportion of variation explained by the corresponding PCAs in the Student Satisfaction survey.
Row PC1 PC2 PC3 PC4
q101 q101 -0.213 0.315 0.073 0.172
q102 q102 -0.239 0.333 0.299 0.240
q103 q103 -0.248 0.261 -0.070 0.313
q104 q104 -0.256 0.176 -0.145 0.392
q105 q105 -0.233 0.286 -0.133 -0.012
q106 q106 -0.210 0.295 -0.075 -0.499
q107 q107 -0.213 0.264 -0.058 -0.460
q108 q108 -0.245 0.038 -0.191 0.055
q109 q109 -0.222 0.071 -0.129 -0.268
q1010 q1010 -0.295 -0.262 -0.258 -0.207
q1011 q1011 -0.302 -0.396 -0.222 0.056
q1012 q1012 -0.307 -0.314 -0.189 0.045
q1013 q1013 -0.282 -0.219 -0.073 0.227
q1014 q1014 -0.290 -0.198 0.525 -0.028
q1015 q1015 -0.284 -0.189 0.606 -0.160

Finally, we can quantify how much variance each principal component explains.

std_dev <- pca_result$sdev

prop_variance <- prop_variance

cumulative_prop_variance <- cumulative_prop_variance

table_data <- data.frame(
  PC = paste("PC", 1:length(std_dev)),
  Standard_Deviation = std_dev,
  Proportion_of_Variance = prop_variance,
  Cumulative_Proportion = cumulative_prop_variance
)

table_data[, -1] <- round(table_data[, -1], 3)

print(table_data)
##       PC Standard_Deviation Proportion_of_Variance Cumulative_Proportion
## 1   PC 1              2.724                  0.582                 0.582
## 2   PC 2              1.038                  0.085                 0.666
## 3   PC 3              0.852                  0.057                 0.723
## 4   PC 4              0.768                  0.046                 0.769
## 5   PC 5              0.674                  0.036                 0.805
## 6   PC 6              0.668                  0.035                 0.840
## 7   PC 7              0.609                  0.029                 0.869
## 8   PC 8              0.577                  0.026                 0.895
## 9   PC 9              0.502                  0.020                 0.915
## 10 PC 10              0.477                  0.018                 0.933
## 11 PC 11              0.462                  0.017                 0.949
## 12 PC 12              0.455                  0.016                 0.966
## 13 PC 13              0.431                  0.015                 0.980
## 14 PC 14              0.378                  0.011                 0.991
## 15 PC 15              0.333                  0.009                 1.000

PC1 has a standard deviation of 2.724 and explains almost 60% of the variance. So PC1 must be pretty important. PC2 has a standard deviation of 1.038. and explains 8.5% of the variance. PC2 will probably be important as well. PC3 and PC4 must also be important as well.

2.4 Summary and Conclusion

I analyzed data from a student satisfaction survey. I first extracted the subscale Growth and Development, and then plotted a pairwise scatterplot to check correlation. Some of the variables were more correlated than others. other. Next, Crombach Alpha was calculated to determine validity, and it was determined with an alpha value of 0.947 that there was a lot of consistency. Next, Principal Component Analysis was then performed to see if dimension reduction could be utilized. The first PC explained almost 60 percent, and the first four PCs explained almost 80% of the variance, so maybe dimension reduction could be utilized.