Introduction

Principal Component Analysis is a mathematical technique to explain information in a multivariate dataset with fewer variables and minimal loss of information. In other words, PCA is a transformation technique that enables the size of the dataset, which contains many interrelated variables, to be reduced to a smaller size by preserving the data in the dataset. PCA reduces dimensionality in large datasets. The technique aims to reduce the number of variables in the dataset in the size reduction process. The variables obtained after the conversion are called the basic components of the first variables. As the first main component, the variance value is chosen the most and the other main components are listed in a way to decrease the variance values.

Among the main advantages of PCA are low sensitivity to noise, reduced memory and capacity needs, and more efficient operation in low-dimensional spaces. To summarize, PCA has three main objectives:

Decreasing the size of the data,
Make guesses,
Visualize the dataset for some analysis.

Basic Mathematical Structure of PCA

The basis of PCA is based on the spectral properties of the covariance and correlation matrix between variables in datasets. These matrices are symmetrical and positive. The eigenvalues of these matrices are identical to their variances. Due to these features, a highly effective PCA technique emerges.

In other words, PCA is the process of finding eigenvalues and eigenvectors of datasets of covariance and correlation matrices. Prior to software development, this technique was very challenging and laborious. However, R-packs made the technique both easy and highly understandable.

Application Steps of PCA:

PCA generally consists of 5 basic steps:

Preparing the data
Creating the covariance / correlation matrix
Calculating eigenvalues and eigenvectors of the covariance
Selecting Principal Components
Calculating the new dataset

About the Data Set

Source:

Ernest Fokoue Center for Quality and Applied Statistics Rochester Institute of Technology 98 Lomb Memorial Drive Rochester, NY 14623, USA eMaÄ±l: epfeqa ‘@’ rit.edu

Necla Gunduz Department of Statistics Faculty of Science, Gazi University Teknikokullar,06500 Ankara, Turkey eMail: ngunduz ‘@’ gazi.edu.tr gunduznecla ‘@’ yahoo.com

Attribute Information:

instr: Instructor’s identifier; values taken from {1,2,3}
class: Course code (descriptor); values taken from {1-13}
repeat: Number of times the student is taking this course; values taken from {0,1,2,3,…}
attendance: Code of the level of attendance; values from {0, 1, 2, 3, 4}
difficulty: Level of difficulty of the course as perceived by the student; values taken from {1,2,3,4,5}
Q1: The semester course content, teaching method and evaluation system were provided at the start.
Q2: The course aims and objectives were clearly stated at the beginning of the period.
Q3: The course was worth the amount of credit assigned to it.
Q4: The course was taught according to the syllabus announced on the first day of class.
Q5: The class discussions, homework assignments, applications and studies were satisfactory.
Q6: The textbook and other courses resources were sufficient and up to date.
Q7: The course allowed field work, applications, laboratory, discussion and other studies.
Q8: The quizzes, assignments, projects and exams contributed to helping the learning.
Q9: I greatly enjoyed the class and was eager to actively participate during the lectures.
Q10: My initial expectations about the course were met at the end of the period or year.
Q11: The course was relevant and beneficial to my professional development.
Q12: The course helped me look at life and the world with a new perspective.
Q13: The Instructor’s knowledge was relevant and up to date.
Q14: The Instructor came prepared for classes.
Q15: The Instructor taught in accordance with the announced lesson plan.
Q16: The Instructor was committed to the course and was understandable.
Q17: The Instructor arrived on time for classes.
Q18: The Instructor has a smooth and easy to follow delivery/speech.
Q19: The Instructor made effective use of class hours.
Q20: The Instructor explained the course and was eager to be helpful to students.
Q21: The Instructor demonstrated a positive approach to students.
Q22: The Instructor was open and respectful of the views of students about the course.
Q23: The Instructor encouraged participation in the course.
Q24: The Instructor gave relevant homework assignments/projects, and helped/guided students.
Q25: The Instructor responded to questions about the course inside and outside of the course.
Q26: The Instructor’s evaluation system (midterm and final questions, projects, assignments, etc.) effectively measured the course objectives.
Q27: The Instructor provided solutions to exams and discussed them with students.
Q28: The Instructor treated all students in a right and objective manner.

Q1-Q28 are all Likert-type, meaning that the values are taken from {1,2,3,4,5}

Methodology

Data Preparation

Let’s make the relevant imports and see the head of data.

library(clusterSim)
library(FactoMineR)
library(factoextra)


setwd("C:\\Users\\ozgrp\\Desktop\\UW\\USL\\Project02")

data <-read.csv("turkiye-student-evaluation_R_Specific.csv", sep=",", dec=".", header=TRUE) 
head(data)

##   instr class nb.repeat attendance difficulty Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
## 1     1     2         1          0          4  3  3  3  3  3  3  3  3  3
## 2     1     2         1          1          3  3  3  3  3  3  3  3  3  3
## 3     1     2         1          2          4  5  5  5  5  5  5  5  5  5
## 4     1     2         1          1          3  3  3  3  3  3  3  3  3  3
## 5     1     2         1          0          1  1  1  1  1  1  1  1  1  1
## 6     1     2         1          3          3  4  4  4  4  4  4  4  4  4
##   Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27
## 1   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
## 2   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
## 3   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5   5
## 4   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3   3
## 5   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1
## 6   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4   4
##   Q28
## 1   3
## 2   3
## 3   5
## 4   3
## 5   1
## 6   4

Check if the columns are numeric or not since the PCA requires all of its inputs to be in type numeric:

sapply(data, is.numeric)

##      instr      class  nb.repeat attendance difficulty         Q1 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##         Q2         Q3         Q4         Q5         Q6         Q7 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##         Q8         Q9        Q10        Q11        Q12        Q13 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##        Q14        Q15        Q16        Q17        Q18        Q19 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##        Q20        Q21        Q22        Q23        Q24        Q25 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
##        Q26        Q27        Q28 
##       TRUE       TRUE       TRUE

As can be seen above all of the variables are numeric. So let’s check if there is any missing values:

sort(colSums(is.na(data)))

##      instr      class  nb.repeat attendance difficulty         Q1 
##          0          0          0          0          0          0 
##         Q2         Q3         Q4         Q5         Q6         Q7 
##          0          0          0          0          0          0 
##         Q8         Q9        Q10        Q11        Q12        Q13 
##          0          0          0          0          0          0 
##        Q14        Q15        Q16        Q17        Q18        Q19 
##          0          0          0          0          0          0 
##        Q20        Q21        Q22        Q23        Q24        Q25 
##          0          0          0          0          0          0 
##        Q26        Q27        Q28 
##          0          0          0

The output above shows that each of our variables has no missing values which is perfect.

Let’s get our data scaled (in other words standardized but we will make the PCA function do this for ourselves again on the unscaled data)

scaled.data <- scale(data, center = TRUE, scale = TRUE)
head(scaled.data)

##       instr     class  nb.repeat attendance difficulty          Q1
## 1 -2.067673 -1.430596 -0.4021395 -1.1360201   0.901784  0.05227373
## 2 -2.067673 -1.430596 -0.4021395 -0.4580425   0.160487  0.05227373
## 3 -2.067673 -1.430596 -0.4021395  0.2199350   0.901784  1.54361247
## 4 -2.067673 -1.430596 -0.4021395 -0.4580425   0.160487  0.05227373
## 5 -2.067673 -1.430596 -0.4021395 -1.1360201  -1.322107 -1.43906502
## 6 -2.067673 -1.430596 -0.4021395  0.8979125   0.160487  0.79794310
##           Q2         Q3          Q4          Q5          Q6          Q7
## 1 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 2 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 3  1.4986310  1.4528982  1.49270921  1.48098025  1.47767069  1.51175190
## 4 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 5 -1.6136018 -1.7379952 -1.62111430 -1.64648893 -1.64535914 -1.61545480
## 6  0.7205728  0.6551749  0.71425333  0.69911295  0.69691323  0.72995022
##            Q8         Q9         Q10        Q11         Q12        Q13
## 1 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 2 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 3  1.52559804  1.4453279  1.49711027  1.4019541  1.50442214  1.3899821
## 4 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 5 -1.59092727 -1.7069331 -1.63938425 -1.6857934 -1.55889873 -1.7740722
## 6  0.74646671  0.6572627  0.71298664  0.6300172  0.73859193  0.5989685
##          Q14        Q15        Q16        Q17        Q18        Q19
## 1 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 2 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 3  1.3614334  1.3667584  1.4211113  1.2635180  1.3888836  1.3704338
## 4 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 5 -1.8248709 -1.8252696 -1.6844431 -1.8922270 -1.7366079 -1.7830404
## 6  0.5648573  0.5687514  0.6447227  0.4745818  0.6075107  0.5820652
##          Q20        Q21        Q22        Q23        Q24        Q25
## 1 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 2 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 3  1.3428414  1.3327923  1.3264978  1.4129535  1.4367498  1.3421429
## 4 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 5 -1.7898720 -1.8168782 -1.8271858 -1.7302436 -1.6982708 -1.8393139
## 6  0.5596630  0.5453747  0.5380769  0.6271542  0.6529946  0.5467787
##          Q26        Q27        Q28
## 1 -0.1748373 -0.1198347 -0.2409272
## 2 -0.1748373 -0.1198347 -0.2409272
## 3  1.3991044  1.4283069  1.3231510
## 4 -0.1748373 -0.1198347 -0.2409272
## 5 -1.7487791 -1.6679763 -1.8050053
## 6  0.6121335  0.6542361  0.5411119

After this step we calculate the covariances and get the eigen values.

cov.data <- cov(scaled.data)
head(cov.data)

##                  instr       class   nb.repeat  attendance  difficulty
## instr       1.00000000 -0.03987106  0.11276336 -0.10723137 -0.05836793
## class      -0.03987106  1.00000000  0.09152659 -0.01631225 -0.04489877
## nb.repeat   0.11276336  0.09152659  1.00000000 -0.07808589  0.11049303
## attendance -0.10723137 -0.01631225 -0.07808589  1.00000000  0.43679161
## difficulty -0.05836793 -0.04489877  0.11049303  0.43679161  1.00000000
## Q1         -0.12893147 -0.02954238 -0.02470843  0.10526616  0.05211965
##                     Q1          Q2          Q3          Q4          Q5
## instr      -0.12893147 -0.12706996 -0.10894863 -0.11322187 -0.13560563
## class      -0.02954238 -0.03327379 -0.02153407 -0.03016465 -0.03658396
## nb.repeat  -0.02470843 -0.04170674 -0.03570381 -0.03361266 -0.03177018
## attendance  0.10526616  0.14925830  0.17839349  0.13810790  0.14974653
## difficulty  0.05211965  0.06503112  0.07145738  0.06217107  0.06418094
## Q1          1.00000000  0.86613798  0.76738148  0.84977255  0.80475673
##                     Q6          Q7          Q8          Q9         Q10
## instr      -0.09831929 -0.12443411 -0.15010866 -0.11178933 -0.13004590
## class      -0.04597201 -0.04676086 -0.03929564 -0.01842956 -0.03393786
## nb.repeat  -0.02691810 -0.03094972 -0.02420414 -0.03734668 -0.02861191
## attendance  0.14370478  0.13747239  0.13282121  0.18229316  0.14693113
## difficulty  0.05274597  0.05005350  0.05169472  0.05502915  0.04288434
## Q1          0.76956071  0.79395720  0.79334652  0.73474404  0.79661214
##                    Q11         Q12         Q13         Q14         Q15
## instr      -0.12953460 -0.12703632 -0.11145992 -0.10232819 -0.09980218
## class      -0.02113611 -0.04486154 -0.04697085 -0.04389983 -0.03904068
## nb.repeat  -0.03365828 -0.01589967 -0.04379275 -0.05154347 -0.03965889
## attendance  0.17889853  0.12957197  0.18647788  0.20225192  0.19584539
## difficulty  0.05896892  0.03637216  0.07949756  0.09249966  0.08945881
## Q1          0.71607569  0.76119667  0.71786244  0.69582240  0.69641049
##                    Q16         Q17         Q18         Q19         Q20
## instr      -0.12706856 -0.08058263 -0.14854652 -0.11248496 -0.08683139
## class      -0.03663522 -0.02895678 -0.02187378 -0.01872659 -0.03123429
## nb.repeat  -0.02563978 -0.04952619 -0.03739157 -0.04556763 -0.04262435
## attendance  0.15307103  0.23147972  0.17917260  0.19069353  0.19516552
## difficulty  0.04971816  0.12252026  0.06852016  0.08001655  0.09105086
## Q1          0.73693754  0.61220212  0.70568166  0.69936994  0.68529916
##                    Q21         Q22         Q23         Q24         Q25
## instr      -0.07810139 -0.08058617 -0.11888900 -0.12887991 -0.08356328
## class      -0.02275796 -0.01655280 -0.02598300 -0.03671183 -0.02781614
## nb.repeat  -0.04626193 -0.04546352 -0.04123302 -0.03361830 -0.04991824
## attendance  0.20480200  0.20773969  0.17781379  0.16354632  0.20443462
## difficulty  0.09562751  0.09954325  0.07531710  0.07260860  0.09968242
## Q1          0.67376961  0.67070160  0.72876989  0.73216686  0.67212031
##                    Q26         Q27         Q28
## instr      -0.10349866 -0.10766351 -0.08167233
## class      -0.02949070 -0.02257620 -0.03736421
## nb.repeat  -0.03551841 -0.03245673 -0.04489990
## attendance  0.17269474  0.14468671  0.20015000
## difficulty  0.06445622  0.05936982  0.09087630
## Q1          0.69892434  0.70963896  0.65887343

data.eigen <- eigen(cov.data)
head(data.eigen, n=1)

## $values
##  [1] 23.10434548  1.49296567  1.22730132  1.12713508  1.03353577
##  [6]  0.81544366  0.52661856  0.38672563  0.34874071  0.28732902
## [11]  0.25320004  0.20344215  0.18293627  0.17050801  0.14174905
## [16]  0.13802371  0.13634271  0.11874927  0.11619028  0.11382587
## [21]  0.10923216  0.10544191  0.10028425  0.09520722  0.09258663
## [26]  0.08455972  0.08422788  0.08007743  0.07731162  0.07071785
## [31]  0.06764076  0.05577729  0.05182704

When we obtain the covariance matrix we can move to compute the PCA. Although there are many methods/functions to to compute the principal components we will use prcomp() function here that uses singular value decomposition. Scale parameter is intentionally set true here in order to bear in mind that it is not necessary to scale the data outside of the function.

data.pca <- prcomp(data, center = TRUE, scale. = TRUE)  
summary(data.pca)

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     4.8067 1.22187 1.10784 1.06167 1.01663 0.90302
## Proportion of Variance 0.7001 0.04524 0.03719 0.03416 0.03132 0.02471
## Cumulative Proportion  0.7001 0.74537 0.78256 0.81672 0.84804 0.87275
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.72568 0.62187 0.59054 0.53603 0.50319 0.45105
## Proportion of Variance 0.01596 0.01172 0.01057 0.00871 0.00767 0.00616
## Cumulative Proportion  0.88871 0.90043 0.91099 0.91970 0.92737 0.93354
##                           PC13    PC14   PC15    PC16    PC17   PC18
## Standard deviation     0.42771 0.41293 0.3765 0.37152 0.36925 0.3446
## Proportion of Variance 0.00554 0.00517 0.0043 0.00418 0.00413 0.0036
## Cumulative Proportion  0.93908 0.94425 0.9485 0.95273 0.95686 0.9605
##                           PC19    PC20    PC21   PC22    PC23    PC24
## Standard deviation     0.34087 0.33738 0.33050 0.3247 0.31668 0.30856
## Proportion of Variance 0.00352 0.00345 0.00331 0.0032 0.00304 0.00289
## Cumulative Proportion  0.96398 0.96743 0.97074 0.9739 0.97697 0.97986
##                           PC25    PC26    PC27    PC28    PC29    PC30
## Standard deviation     0.30428 0.29079 0.29022 0.28298 0.27805 0.26593
## Proportion of Variance 0.00281 0.00256 0.00255 0.00243 0.00234 0.00214
## Cumulative Proportion  0.98266 0.98522 0.98778 0.99020 0.99255 0.99469
##                           PC31    PC32    PC33
## Standard deviation     0.26008 0.23617 0.22766
## Proportion of Variance 0.00205 0.00169 0.00157
## Cumulative Proportion  0.99674 0.99843 1.00000

As it can be seen above we started to get the ability to explain 70 percent of the variance in the first component and being able to hit 85% with only 5 components out of 28 which is a success. Let’s see our power of explanation on a scree plot and also our eigen values:

fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 75))

get_eig(data.pca)

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  23.10434548       70.0131681                    70.01317
## Dim.2   1.49296567        4.5241384                    74.53731
## Dim.3   1.22730132        3.7190949                    78.25640
## Dim.4   1.12713508        3.4155608                    81.67196
## Dim.5   1.03353577        3.1319266                    84.80389
## Dim.6   0.81544366        2.4710414                    87.27493
## Dim.7   0.52661856        1.5958138                    88.87074
## Dim.8   0.38672563        1.1718959                    90.04264
## Dim.9   0.34874071        1.0567900                    91.09943
## Dim.10  0.28732902        0.8706940                    91.97012
## Dim.11  0.25320004        0.7672729                    92.73740
## Dim.12  0.20344215        0.6164914                    93.35389
## Dim.13  0.18293627        0.5543523                    93.90824
## Dim.14  0.17050801        0.5166910                    94.42493
## Dim.15  0.14174905        0.4295426                    94.85447
## Dim.16  0.13802371        0.4182537                    95.27273
## Dim.17  0.13634271        0.4131597                    95.68589
## Dim.18  0.11874927        0.3598463                    96.04573
## Dim.19  0.11619028        0.3520918                    96.39783
## Dim.20  0.11382587        0.3449269                    96.74275
## Dim.21  0.10923216        0.3310065                    97.07376
## Dim.22  0.10544191        0.3195209                    97.39328
## Dim.23  0.10028425        0.3038917                    97.69717
## Dim.24  0.09520722        0.2885067                    97.98568
## Dim.25  0.09258663        0.2805655                    98.26624
## Dim.26  0.08455972        0.2562416                    98.52249
## Dim.27  0.08422788        0.2552360                    98.77772
## Dim.28  0.08007743        0.2426589                    99.02038
## Dim.29  0.07731162        0.2342776                    99.25466
## Dim.30  0.07071785        0.2142965                    99.46895
## Dim.31  0.06764076        0.2049720                    99.67393
## Dim.32  0.05577729        0.1690221                    99.84295
## Dim.33  0.05182704        0.1570516                   100.00000

Now it is time to visualize our components in 2 dimensional space where first and the second dimensions are our first and second components respectively:

fviz_pca_var(data.pca, col.var="blue")

As per seen above attention and difficulty contributes negatively on our dimension 1 and all of the quality indicators contribute negatively to the dimension 2.

Dimension Reduction on the Dataset According to the PCA Results

Since we decided to continue with first 5 components we need to recreate our dataset by multiplying our data matrix with the matrix (mx here) of selected eigen vectors. (ev here)

ev <- data.eigen$vectors[,1:5]

data.std <- data.Normalization(data, type="n1",normalization="column")

mx <- as.matrix(data.std)

data.pca.final <- mx %*% ev

Let’s see what do we get in final:

head(data.pca.final,n = 10)

##          [,1]       [,2]      [,3]      [,4]       [,5]
## 1   0.7422488  0.1097109 1.3141364 -1.204887  0.2701274
## 2   0.7274376  0.1471334 1.2810959 -1.411252  0.1691307
## 3  -7.5858460 -0.3641984 1.5195832 -1.093701  0.3410784
## 4   0.7274376  0.1471334 1.2810959 -1.411252  0.1691307
## 5   9.0540066  1.1133755 0.8256403 -1.913668 -0.1059166
## 6  -3.4647065 -0.5073088 1.5677471 -1.377159  0.2067092
## 7  -3.4085132  0.3276665 1.1998915 -1.334159  0.2025032
## 8  -7.5444640  0.5081995 1.1186872 -1.257066  0.2358757
## 9  -3.5503100  0.3651643 1.2743075 -1.316108  0.2049692
## 10 -3.5060886 -1.3797067 1.9686432 -1.213794  0.3119118

Conclusion

Working with big data is fun. But when you are dealing with data without given information at hand you need to apply unsupervised techniques to make things simpler yet powerful for yourself. Thus, PCA helped a lot to reduce the dimensions we had in the beginning in exchange to only explanation of 15% of variability in the data. Final produced dataset may be used in further clustering applications as well.

Citation Request for Data Source:
If you publish material based on databases obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following pseudo-APA reference format for referring to this repository:

Gunduz, G. & Fokoue, E. (2013). UCI Machine Learning Repository [[Web Link]]. Irvine, CA: University of California, School of Information and Computer Science.

Here is a BiBTeX citation as well:

@misc{GunduzFokoue:2013 , author = ‘Gunduz, N. and Fokoue, E.’, year = ‘2013’, title = ‘{UCI} Machine Learning Repository’, url = ‘[Web Link]’, institution = ‘University of California, Irvine, School of Information and Computer Sciences’ }

PCA on Student Evaluation Data

Ozgur Polat