Dataset

The dataset used in this analysis is the Student Performance Dataset obtained from the kaggle Repository.

  • Source: kaggle Repository
  • File used: student-por.csv (Student Performance in Portuguese)

1. Library

The following packages are used in this analysis: psych for multivariate statistical methods, dplyr for data manipulation, and corrplot for correlation visualization.

library(psych)
library(dplyr)
library(corrplot)

2. Data Preparation

The dataset used is student-por.csv. Column names are standardized to X1, X2, ..., and only numeric variables are retained for analysis.

data <- read.csv("student-por.csv")
colnames(data) <- paste0("X", 1:ncol(data))

data_numeric <- data[sapply(data, is.numeric)]

3. Assumption Checks

Prior to conducting PCA or Factor Analysis, two key assumptions must be verified.

3.1 Correlation Matrix

A correlation matrix is computed and visualized to examine the linear relationships among variables. Sufficient inter-variable correlation is a prerequisite for dimensionality reduction.

r <- cor(data_numeric)

corrplot(r,
         tl.col = "black",
         tl.srt = 45,
         tl.cex = 0.5,
         title  = "Correlation Matrix",
         mar    = c(0, 0, 1, 0))

3.2 Kaiser-Meyer-Olkin (KMO) Test

The KMO measure assesses sampling adequacy for each variable and for the overall model. Values closer to 1 indicate that the data is well-suited for factor analysis (minimum acceptable value: 0.5).

KMO(r)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = r)
## Overall MSA =  0.73
## MSA for each item = 
##   X3   X7   X8  X13  X14  X15  X24  X25  X26  X27  X28  X29  X30  X31  X32  X33 
## 0.62 0.64 0.63 0.81 0.86 0.86 0.53 0.60 0.61 0.66 0.61 0.52 0.75 0.87 0.73 0.78

3.3 Bartlett’s Test of Sphericity

Bartlett’s test examines whether the correlation matrix is significantly different from an identity matrix. A p-value below 0.05 confirms that the correlations are significant and factor analysis is appropriate.

cortest.bartlett(r, n = nrow(data_numeric))
## $chisq
## [1] 3535.842
## 
## $p.value
## [1] 0
## 
## $df
## [1] 120

4. Principal Component Analysis (PCA)

4.1 Eigenvalues and Eigenvectors

Data is standardized using scale() before analysis. Eigenvalues represent the amount of variance explained by each component, while eigenvectors define the direction of each principal component.

scale_data <- scale(data_numeric)
r_cov      <- cov(scale_data)

pc <- eigen(r_cov)

cat("Eigenvalues:\n")
## Eigenvalues:
print(pc$values)
##  [1] 3.66522444 1.94781579 1.43169697 1.32197163 1.13753974 1.01059364
##  [7] 0.92486924 0.86029726 0.80786480 0.78054845 0.62896637 0.55434404
## [13] 0.36239871 0.31735807 0.17404805 0.07446281
cat("\nEigenvectors:\n")
## 
## Eigenvectors:
print(pc$vectors)
##              [,1]        [,2]        [,3]        [,4]        [,5]        [,6]
##  [1,]  0.14985339 -0.03072974 -0.16769887  0.25904502  0.65152307  0.20999128
##  [2,] -0.22778366 -0.34350437  0.46521149  0.05378557  0.09141139 -0.13246905
##  [3,] -0.20001281 -0.37039517  0.46487844  0.03856801  0.04284566 -0.11131647
##  [4,]  0.14250860  0.12769961 -0.35760847 -0.01091853 -0.22002103 -0.13026619
##  [5,] -0.19858804  0.10719291 -0.05903041  0.02697283  0.22305221 -0.14057688
##  [6,]  0.28726402  0.06175917  0.04979487  0.09275109  0.43967470  0.07358447
##  [7,] -0.05159628 -0.02533133 -0.04504691 -0.54908284  0.22422207  0.18629627
##  [8,]  0.10442511 -0.23602892 -0.10530609 -0.50187008  0.21684623 -0.28119650
##  [9,]  0.12115860 -0.40757729 -0.27824466 -0.22741374  0.15363838 -0.30955128
## [10,]  0.20654530 -0.42984408 -0.18983621  0.18829647 -0.19944469  0.12951925
## [11,]  0.19741902 -0.49027603 -0.20920886  0.12194532 -0.24291999  0.13310220
## [12,]  0.05303337 -0.10581936  0.14869974 -0.30715649 -0.06955755  0.77759489
## [13,]  0.11956444 -0.17069172  0.02006425  0.39664975  0.19295824  0.05887652
## [14,] -0.44885921 -0.09250038 -0.24653957  0.03434532  0.02076939  0.09753811
## [15,] -0.45850114 -0.09809644 -0.27185712  0.05818802  0.07982333  0.11375158
## [16,] -0.45243720 -0.07833531 -0.27919240  0.08611296  0.07270050  0.09567593
##               [,7]        [,8]        [,9]        [,10]        [,11]
##  [1,]  0.114015038 -0.16222683  0.13606380  0.193628830 -0.336115799
##  [2,]  0.051197390 -0.24946875  0.09292882 -0.006075780  0.001706861
##  [3,] -0.001394447 -0.29187735  0.10087873 -0.121636503 -0.048631536
##  [4,] -0.030180354 -0.72581397  0.12425189 -0.459310367 -0.053248357
##  [5,]  0.694200166 -0.08857760 -0.56600371 -0.142406257  0.138249125
##  [6,]  0.057176654 -0.05755585  0.39371306 -0.123686937  0.455292013
##  [7,] -0.346276697 -0.39512505 -0.34436116  0.387211263  0.226973064
##  [8,]  0.024955959  0.31101478  0.14104994 -0.366786318  0.285482662
##  [9,]  0.007049263  0.06374569 -0.06784500  0.006148621 -0.532870267
## [10,]  0.188057001 -0.09883508 -0.09018961  0.196375633  0.410075906
## [11,]  0.117368759  0.02284601  0.01145988  0.162399841  0.046834611
## [12,]  0.201494950  0.04944765 -0.01466885 -0.390958344 -0.216634708
## [13,] -0.521810725  0.11540732 -0.52412354 -0.433876239  0.068976773
## [14,] -0.020275687  0.06586853  0.12929409 -0.076215242  0.084678795
## [15,] -0.080247956  0.02449356  0.13433970 -0.027355273  0.079179844
## [16,] -0.093942377  0.01950597  0.10408048 -0.063543278  0.054472867
##             [,12]        [,13]        [,14]        [,15]         [,16]
##  [1,]  0.43729750 -0.064976291  0.054234948 -0.076697473  0.0149173422
##  [2,]  0.02837008  0.553545501  0.450636657  0.030514154  0.0121337007
##  [3,]  0.02941887 -0.551261195 -0.415771751 -0.028660109 -0.0015788903
##  [4,]  0.07390635  0.028154359  0.078661109 -0.018743882 -0.0158783820
##  [5,] -0.11253698 -0.093475518  0.059456701  0.025464342 -0.0157481237
##  [6,] -0.56295211  0.010095784 -0.040018619  0.030545617  0.0110342543
##  [7,] -0.06709189 -0.059781296  0.050993641 -0.030302950  0.0231102783
##  [8,]  0.44028392 -0.072061191  0.109237384  0.007686881  0.0037100047
##  [9,] -0.43023156  0.201269288 -0.227963872  0.018656344 -0.0029512550
## [10,]  0.22754179  0.342966810 -0.463482654 -0.003060498  0.0174839171
## [11,] -0.15134756 -0.443411341  0.567853105  0.017805781 -0.0002096884
## [12,] -0.05142337  0.102208766 -0.049819084  0.037794225 -0.0003304998
## [13,] -0.01790039  0.019556142  0.045925731 -0.039413469 -0.0212870145
## [14,] -0.09885720  0.033394199  0.002605489 -0.799282659  0.1840359090
## [15,] -0.02910992  0.008527162 -0.025007825  0.234189012 -0.7713986639
## [16,] -0.01849440 -0.017650374 -0.028241235  0.540463936  0.6072604228

4.2 Proportion and Cumulative Variance

This table summarizes the variance explained by each principal component. Components are typically retained until the cumulative variance reaches 70–80%.

sumvar  <- sum(pc$values)
propvar <- (pc$values / sumvar) * 100

cumvar <- data.frame(
  eigen_value = pc$values,
  prop_var    = propvar
) %>% mutate(cum_var = cumsum(prop_var))

row.names(cumvar) <- paste0("PC", 1:length(pc$values))
print(cumvar)
##      eigen_value   prop_var   cum_var
## PC1   3.66522444 22.9076527  22.90765
## PC2   1.94781579 12.1738487  35.08150
## PC3   1.43169697  8.9481060  44.02961
## PC4   1.32197163  8.2623227  52.29193
## PC5   1.13753974  7.1096234  59.40155
## PC6   1.01059364  6.3162102  65.71776
## PC7   0.92486924  5.7804328  71.49820
## PC8   0.86029726  5.3768579  76.87505
## PC9   0.80786480  5.0491550  81.92421
## PC10  0.78054845  4.8784278  86.80264
## PC11  0.62896637  3.9310398  90.73368
## PC12  0.55434404  3.4646503  94.19833
## PC13  0.36239871  2.2649919  96.46332
## PC14  0.31735807  1.9834879  98.44681
## PC15  0.17404805  1.0878003  99.53461
## PC16  0.07446281  0.4653925 100.00000

4.3 Scree Plot

The scree plot visualizes eigenvalues across components. The red dashed line at eigenvalue = 1 represents Kaiser’s Rule — components above this threshold are generally retained.

plot(pc$values,
     type = "b",
     pch  = 19,
     xlab = "Principal Component",
     ylab = "Eigenvalue",
     main = "Scree Plot")

abline(h = 1, col = "red", lty = 2)

4.4 PC Scores (Manual)

Principal component scores are obtained by projecting the standardized data onto the eigenvectors.

scores_manual <- as.matrix(scale_data) %*% pc$vectors
head(scores_manual)
##            [,1]       [,2]       [,3]        [,4]       [,5]        [,6]
## [1,]  1.3292614 -0.0618981  2.1573345 -0.01230047  0.9384844 -1.33662794
## [2,]  0.4836354  1.8017441 -0.3527618 -0.89723782  0.4278207  0.07146723
## [3,] -0.1229392  0.6573366 -0.8498326  0.28489355 -1.3347787  0.40738294
## [4,] -2.5683083  0.9456003  1.1440652  0.31392107 -0.8903418  0.68025211
## [5,] -1.1999696  0.2833878  1.2290959 -0.55949421 -0.4868174  0.89345203
## [6,] -1.2069326 -0.4911520  1.5212548 -1.05872144  0.2643133  0.77670310
##             [,7]        [,8]        [,9]       [,10]      [,11]      [,12]
## [1,] -0.02875205 -1.64788797 -0.38973340  0.05352091 -1.5768793  0.8647519
## [2,] -0.42351752  0.60905607 -0.84112406  1.19294478 -0.1631189  0.2311293
## [3,] -0.42436581  1.35763655 -0.91220979  0.43416981  1.3439901 -0.1895358
## [4,]  1.35191505  0.50231375 -0.08922503 -0.46951645  0.1793541 -0.7239119
## [5,]  0.39382003  0.24427036  0.27779824  0.05086577  0.1000340  0.1270744
## [6,] -0.55412130  0.07164701 -0.54306251 -0.47762282  0.7034832  0.4515991
##            [,13]       [,14]      [,15]      [,16]
## [1,]  0.04515699 -0.39254752 2.99452824 -0.7810214
## [2,] -0.03195025 -0.36373457 0.42907611 -0.1606329
## [3,] -0.29978654  0.08736565 0.01588474 -0.3469934
## [4,]  1.13316391  0.27980951 0.03707345 -0.1092970
## [5,] -0.30686446  0.09134754 0.50570986 -0.1869752
## [6,]  0.08444541  0.71473403 0.08572768  0.1557004

5. PCA using principal()

The principal() function from the psych package provides a streamlined approach to PCA. Here, 6 components are extracted without rotation.

pc_psych <- principal(scale_data, nfactors = 6, rotate = "none")
print(pc_psych)
## Principal Components Analysis
## Call: principal(r = scale_data, nfactors = 6, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       PC1   PC2   PC3   PC4   PC5   PC6   h2   u2 com
## X3  -0.29  0.04  0.20 -0.30  0.69  0.21 0.74 0.26 2.2
## X7   0.44  0.48 -0.56 -0.06  0.10 -0.13 0.76 0.24 3.1
## X8   0.38  0.52 -0.56 -0.04  0.05 -0.11 0.74 0.26 2.9
## X13 -0.27 -0.18  0.43  0.01 -0.23 -0.13 0.36 0.64 3.0
## X14  0.38 -0.15  0.07 -0.03  0.24 -0.14 0.25 0.75 2.5
## X15 -0.55 -0.09 -0.06 -0.11  0.47  0.07 0.55 0.45 2.2
## X24  0.10  0.04  0.05  0.63  0.24  0.19 0.50 0.50 1.6
## X25 -0.20  0.33  0.13  0.58  0.23 -0.28 0.63 0.37 3.0
## X26 -0.23  0.57  0.33  0.26  0.16 -0.31 0.68 0.32 3.4
## X27 -0.40  0.60  0.23 -0.22 -0.21  0.13 0.68 0.32 2.8
## X28 -0.38  0.68  0.25 -0.14 -0.26  0.13 0.78 0.22 2.4
## X29 -0.10  0.15 -0.18  0.35 -0.07  0.78 0.81 0.19 1.7
## X30 -0.23  0.24 -0.02 -0.46  0.21  0.06 0.36 0.64 2.6
## X31  0.86  0.13  0.29 -0.04  0.02  0.10 0.85 0.15 1.3
## X32  0.88  0.14  0.33 -0.07  0.09  0.11 0.92 0.08 1.4
## X33  0.87  0.11  0.33 -0.10  0.08  0.10 0.90 0.10 1.4
## 
##                        PC1  PC2  PC3  PC4  PC5  PC6
## SS loadings           3.67 1.95 1.43 1.32 1.14 1.01
## Proportion Var        0.23 0.12 0.09 0.08 0.07 0.06
## Cumulative Var        0.23 0.35 0.44 0.52 0.59 0.66
## Proportion Explained  0.35 0.19 0.14 0.13 0.11 0.10
## Cumulative Proportion 0.35 0.53 0.67 0.80 0.90 1.00
## 
## Mean item complexity =  2.3
## Test of the hypothesis that 6 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  436.27  with prob <  1.3e-68 
## 
## Fit based upon off diagonal values = 0.94

PC scores are recovered from the loadings matrix and eigenvalues as follows.

L         <- as.matrix(pc_psych$loadings)
lambda_k  <- pc_psych$values[1:ncol(L)]
V         <- sweep(L, 2, sqrt(lambda_k), "/")
scores_PC <- scale_data %*% V
head(scores_PC)
##             PC1        PC2        PC3         PC4        PC5         PC6
## [1,] -1.3292614  0.0618981 -2.1573345  0.01230047  0.9384844 -1.33662794
## [2,] -0.4836354 -1.8017441  0.3527618  0.89723782  0.4278207  0.07146723
## [3,]  0.1229392 -0.6573366  0.8498326 -0.28489355 -1.3347787  0.40738294
## [4,]  2.5683083 -0.9456003 -1.1440652 -0.31392107 -0.8903418  0.68025211
## [5,]  1.1999696 -0.2833878 -1.2290959  0.55949421 -0.4868174  0.89345203
## [6,]  1.2069326  0.4911520 -1.5212548  1.05872144  0.2643133  0.77670310

6. Factor Analysis (FA)

6.1 Without Rotation

Factor analysis without rotation extracts the initial factor solution. This serves as the baseline before applying any rotation method.

fa_none <- principal(scale_data, nfactors = 6, rotate = "none")
print(fa_none)
## Principal Components Analysis
## Call: principal(r = scale_data, nfactors = 6, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       PC1   PC2   PC3   PC4   PC5   PC6   h2   u2 com
## X3  -0.29  0.04  0.20 -0.30  0.69  0.21 0.74 0.26 2.2
## X7   0.44  0.48 -0.56 -0.06  0.10 -0.13 0.76 0.24 3.1
## X8   0.38  0.52 -0.56 -0.04  0.05 -0.11 0.74 0.26 2.9
## X13 -0.27 -0.18  0.43  0.01 -0.23 -0.13 0.36 0.64 3.0
## X14  0.38 -0.15  0.07 -0.03  0.24 -0.14 0.25 0.75 2.5
## X15 -0.55 -0.09 -0.06 -0.11  0.47  0.07 0.55 0.45 2.2
## X24  0.10  0.04  0.05  0.63  0.24  0.19 0.50 0.50 1.6
## X25 -0.20  0.33  0.13  0.58  0.23 -0.28 0.63 0.37 3.0
## X26 -0.23  0.57  0.33  0.26  0.16 -0.31 0.68 0.32 3.4
## X27 -0.40  0.60  0.23 -0.22 -0.21  0.13 0.68 0.32 2.8
## X28 -0.38  0.68  0.25 -0.14 -0.26  0.13 0.78 0.22 2.4
## X29 -0.10  0.15 -0.18  0.35 -0.07  0.78 0.81 0.19 1.7
## X30 -0.23  0.24 -0.02 -0.46  0.21  0.06 0.36 0.64 2.6
## X31  0.86  0.13  0.29 -0.04  0.02  0.10 0.85 0.15 1.3
## X32  0.88  0.14  0.33 -0.07  0.09  0.11 0.92 0.08 1.4
## X33  0.87  0.11  0.33 -0.10  0.08  0.10 0.90 0.10 1.4
## 
##                        PC1  PC2  PC3  PC4  PC5  PC6
## SS loadings           3.67 1.95 1.43 1.32 1.14 1.01
## Proportion Var        0.23 0.12 0.09 0.08 0.07 0.06
## Cumulative Var        0.23 0.35 0.44 0.52 0.59 0.66
## Proportion Explained  0.35 0.19 0.14 0.13 0.11 0.10
## Cumulative Proportion 0.35 0.53 0.67 0.80 0.90 1.00
## 
## Mean item complexity =  2.3
## Test of the hypothesis that 6 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  436.27  with prob <  1.3e-68 
## 
## Fit based upon off diagonal values = 0.94

6.2 With Varimax Rotation

Varimax rotation simplifies the factor structure by maximizing the variance of squared loadings within each factor. This yields a cleaner, more interpretable solution.

fa_varimax <- principal(scale_data, nfactors = 6, rotate = "varimax")
print(fa_varimax)
## Principal Components Analysis
## Call: principal(r = scale_data, nfactors = 6, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##       RC1   RC2   RC3   RC4   RC5   RC6   h2   u2 com
## X3   0.01  0.04 -0.09  0.06  0.85  0.03 0.74 0.26 1.0
## X7   0.15  0.02  0.85  0.03 -0.09 -0.07 0.76 0.24 1.1
## X8   0.11  0.08  0.84  0.04 -0.12 -0.04 0.74 0.26 1.1
## X13 -0.08  0.14 -0.54  0.05 -0.11 -0.16 0.36 0.64 1.5
## X14  0.34 -0.32  0.06  0.03  0.07 -0.16 0.25 0.75 2.6
## X15 -0.42 -0.04 -0.10  0.08  0.59  0.03 0.55 0.45 1.9
## X24  0.10 -0.24 -0.01  0.47 -0.04  0.47 0.50 0.50 2.6
## X25 -0.13  0.04  0.02  0.78 -0.02  0.05 0.63 0.37 1.1
## X26  0.01  0.40 -0.01  0.70  0.09 -0.15 0.68 0.32 1.7
## X27 -0.10  0.81 -0.03  0.07  0.11  0.01 0.68 0.32 1.1
## X28 -0.07  0.86 -0.01  0.14  0.04  0.05 0.78 0.22 1.1
## X29 -0.08  0.15  0.07 -0.06 -0.01  0.87 0.81 0.19 1.1
## X30 -0.08  0.31  0.13 -0.15  0.44 -0.15 0.36 0.64 2.7
## X31  0.90 -0.08  0.14 -0.03 -0.15  0.02 0.85 0.15 1.1
## X32  0.94 -0.09  0.14 -0.03 -0.08  0.02 0.92 0.08 1.1
## X33  0.93 -0.09  0.11 -0.05 -0.08 -0.02 0.90 0.10 1.1
## 
##                        RC1  RC2  RC3  RC4  RC5  RC6
## SS loadings           2.95 1.89 1.82 1.39 1.37 1.10
## Proportion Var        0.18 0.12 0.11 0.09 0.09 0.07
## Cumulative Var        0.18 0.30 0.42 0.50 0.59 0.66
## Proportion Explained  0.28 0.18 0.17 0.13 0.13 0.10
## Cumulative Proportion 0.28 0.46 0.63 0.77 0.90 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 6 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  436.27  with prob <  1.3e-68 
## 
## Fit based upon off diagonal values = 0.94

6.3 Factor Scores

Factor scores are computed using the regression method, which multiplies the standardized data by the inverse of the correlation matrix and the loadings matrix.

scores_FA <- scale_data %*% solve(cor(scale_data)) %*% as.matrix(fa_none$loadings)
head(scores_FA)
##              PC1         PC2        PC3        PC4        PC5         PC6
## [1,] -0.69432072  0.04435099 -1.8029843  0.0106982  0.8799212 -1.32960382
## [2,] -0.25262005 -1.29097894  0.2948194  0.7803627  0.4011239  0.07109167
## [3,]  0.06421553 -0.47099235  0.7102444 -0.2477830 -1.2514860  0.40524210
## [4,]  1.34151914 -0.67753800 -0.9561482 -0.2730294 -0.8347828  0.67667731
## [5,]  0.62678699 -0.20305197 -1.0272123  0.4866139 -0.4564391  0.88875685
## [6,]  0.63042399  0.35191843 -1.2713830  0.9208113  0.2478196  0.77262145

7. Descriptive Statistics

7.1 Summary Table

A descriptive statistics table is presented to provide an overview of the distribution of each numeric variable.

desc_table <- data.frame(
  Min    = sapply(data_numeric, min),
  Max    = sapply(data_numeric, max),
  Mean   = round(sapply(data_numeric, mean), 2),
  Median = sapply(data_numeric, median),
  SD     = round(sapply(data_numeric, sd), 2)
)
print(desc_table)
##     Min Max  Mean Median   SD
## X3   15  22 16.74     17 1.22
## X7    0   4  2.51      2 1.13
## X8    0   4  2.31      2 1.10
## X13   1   4  1.57      1 0.75
## X14   1   4  1.93      2 0.83
## X15   0   3  0.22      0 0.59
## X24   1   5  3.93      4 0.96
## X25   1   5  3.18      3 1.05
## X26   1   5  3.18      3 1.18
## X27   1   5  1.50      1 0.92
## X28   1   5  2.28      2 1.28
## X29   1   5  3.54      4 1.45
## X30   0  32  3.66      2 4.64
## X31   0  19 11.40     11 2.75
## X32   0  19 11.57     11 2.91
## X33   0  19 11.91     12 3.23

7.2 Boxplot

The boxplot below illustrates the spread and potential outliers for each numeric variable in the dataset.

par(mar = c(10, 4, 4, 2))
boxplot(data_numeric,
        main = "Boxplot of Numeric Variables",
        las  = 2,
        col  = "lightblue")