For this report, I will be analyzing survey data about student satisfaction and engagement. The goal for this analysis is to measure validity in terms of work load.
survey_raw <- read.csv("at-risk-survey-data.csv", header = TRUE)
nrow(survey_raw)
## [1] 664
This data set originally has 332 observations. However, there are one row gaps between the observations in the original excel spreadsheet. Because of this, when loaded into R, the data frame has 664 observations. And where there would have been gaps, ‘responses’ are recorded as NA.
To start, we need to clean up the data. By that, i mean get rid of the gaps in the original excel spreadsheet.
survey <- survey_raw[-c(1, seq(3, nrow(survey_raw), by = 2)), ]
nrow(survey)
## [1] 332
Now, i will extract the questions concerning Growth and Development.
Growth <- survey %>%
dplyr::select(q101:q1015)
We will first look at a pairwise correlation scatter plot for each variable in “Growth”.
M=cor(Growth)
corrplot.mixed(M, lower.col = "purple", upper = "ellipse", number.cex = .7, tl.cex = 0.7)
The smallest correlation coefficient is 0.39 (belonging to questions
10.6 and 10.8), and the largest if 0.79 (belonging to two pairs of
questions: 10.14 and 10.15, and 10.11 and 10.12). All of these are
positively correlated, and some are more strongly correlated than
others. We will see how this works when Principal Component Analysis is
done.
In the meanwhile, we will calculate Cronbach’s alpha to determine validity of these items.
cronbach.alpha(Growth, CI=TRUE)
##
## Cronbach's alpha for the 'Growth' data-set
##
## Items: 15
## Sample units: 332
## alpha: 0.947
##
## Bootstrap 95% CI based on 1000 samples
## 2.5% 97.5%
## 0.936 0.955
According to this table, alpha is 0.947, with a 95% bootstrap confidence interval of (0.936, 0.955).This suggests that there is a lot of internal consistency among the items. In other words, they share a lot in terms of variance.
Next, PCA will be done to see what we can aggregate out of this subscale. We will first look at a scree plot to see how much variance each principle component would explain.
pca_result <- prcomp(Growth)
eigenvalues <- pca_result$sdev^2
prop_variance <- eigenvalues / sum(eigenvalues)
plot(prop_variance, type = "b", xlab = "Principal Component", ylab = "Proportion of Variance Explained",
main = "Scree Plot for Growth and Development")
cumulative_prop_variance <- cumsum(prop_variance)
lines(cumulative_prop_variance, type = "b", col = "red")
legend("topright", legend = c("Proportion of Variance", "Cumulative Proportion"), col = c("black", "red"),
pch = 20, lty = 1)
Our first principal component explains over 50% of the total variance, so we’ll definitely keep that. The second explains over 20%, and the third explains just under 20%. So, we probably should just look at the first two.
Next, let’s look at our loadings.
loadings <- pca_result$rotation[, 1:4]
loadings_table <- data.frame(
Row = rownames(loadings),
PC1 = loadings[, 1],
PC2 = loadings[, 2],
PC3 = loadings[, 3],
PC4 = loadings[, 4]
)
loadings_table[, -1] <- round(loadings_table[, -1], 3)
kable(loadings_table, caption = "Factor loadings of the first few PCAs and the cumulative proportion of variation explained by the corresponding PCAs in the Student Satisfaction survey.")
| Row | PC1 | PC2 | PC3 | PC4 | |
|---|---|---|---|---|---|
| q101 | q101 | -0.213 | 0.315 | 0.073 | 0.172 |
| q102 | q102 | -0.239 | 0.333 | 0.299 | 0.240 |
| q103 | q103 | -0.248 | 0.261 | -0.070 | 0.313 |
| q104 | q104 | -0.256 | 0.176 | -0.145 | 0.392 |
| q105 | q105 | -0.233 | 0.286 | -0.133 | -0.012 |
| q106 | q106 | -0.210 | 0.295 | -0.075 | -0.499 |
| q107 | q107 | -0.213 | 0.264 | -0.058 | -0.460 |
| q108 | q108 | -0.245 | 0.038 | -0.191 | 0.055 |
| q109 | q109 | -0.222 | 0.071 | -0.129 | -0.268 |
| q1010 | q1010 | -0.295 | -0.262 | -0.258 | -0.207 |
| q1011 | q1011 | -0.302 | -0.396 | -0.222 | 0.056 |
| q1012 | q1012 | -0.307 | -0.314 | -0.189 | 0.045 |
| q1013 | q1013 | -0.282 | -0.219 | -0.073 | 0.227 |
| q1014 | q1014 | -0.290 | -0.198 | 0.525 | -0.028 |
| q1015 | q1015 | -0.284 | -0.189 | 0.606 | -0.160 |
Finally, we can quantify how much variance each principal component explains.
std_dev <- pca_result$sdev
prop_variance <- prop_variance
cumulative_prop_variance <- cumulative_prop_variance
table_data <- data.frame(
PC = paste("PC", 1:length(std_dev)),
Standard_Deviation = std_dev,
Proportion_of_Variance = prop_variance,
Cumulative_Proportion = cumulative_prop_variance
)
table_data[, -1] <- round(table_data[, -1], 3)
print(table_data)
## PC Standard_Deviation Proportion_of_Variance Cumulative_Proportion
## 1 PC 1 2.724 0.582 0.582
## 2 PC 2 1.038 0.085 0.666
## 3 PC 3 0.852 0.057 0.723
## 4 PC 4 0.768 0.046 0.769
## 5 PC 5 0.674 0.036 0.805
## 6 PC 6 0.668 0.035 0.840
## 7 PC 7 0.609 0.029 0.869
## 8 PC 8 0.577 0.026 0.895
## 9 PC 9 0.502 0.020 0.915
## 10 PC 10 0.477 0.018 0.933
## 11 PC 11 0.462 0.017 0.949
## 12 PC 12 0.455 0.016 0.966
## 13 PC 13 0.431 0.015 0.980
## 14 PC 14 0.378 0.011 0.991
## 15 PC 15 0.333 0.009 1.000
PC1 has a standard deviation of 2.724 and explains almost 60% of the variance. So PC1 must be pretty important. PC2 has a standard deviation of 1.038. and explains 8.5% of the variance. PC2 will probably be important as well. PC3 and PC4 must also be important as well.
I analyzed data from a student satisfaction survey. I first extracted the subscale Growth and Development, and then plotted a pairwise scatterplot to check correlation. Some of the variables were more correlated than others. other. Next, Crombach Alpha was calculated to determine validity, and it was determined with an alpha value of 0.947 that there was a lot of consistency. Next, Principal Component Analysis was then performed to see if dimension reduction could be utilized. The first PC explained almost 60 percent, and the first four PCs explained almost 80% of the variance, so maybe dimension reduction could be utilized.