Principal Component Analysis is a mathematical technique to explain information in a multivariate dataset with fewer variables and minimal loss of information. In other words, PCA is a transformation technique that enables the size of the dataset, which contains many interrelated variables, to be reduced to a smaller size by preserving the data in the dataset. PCA reduces dimensionality in large datasets. The technique aims to reduce the number of variables in the dataset in the size reduction process. The variables obtained after the conversion are called the basic components of the first variables. As the first main component, the variance value is chosen the most and the other main components are listed in a way to decrease the variance values.
Among the main advantages of PCA are low sensitivity to noise, reduced memory and capacity needs, and more efficient operation in low-dimensional spaces. To summarize, PCA has three main objectives:
The basis of PCA is based on the spectral properties of the covariance and correlation matrix between variables in datasets. These matrices are symmetrical and positive. The eigenvalues of these matrices are identical to their variances. Due to these features, a highly effective PCA technique emerges.
In other words, PCA is the process of finding eigenvalues and eigenvectors of datasets of covariance and correlation matrices. Prior to software development, this technique was very challenging and laborious. However, R-packs made the technique both easy and highly understandable.
PCA generally consists of 5 basic steps:
Ernest Fokoue Center for Quality and Applied Statistics Rochester Institute of Technology 98 Lomb Memorial Drive Rochester, NY 14623, USA eMaıl: epfeqa ‘@’ rit.edu
Necla Gunduz Department of Statistics Faculty of Science, Gazi University Teknikokullar,06500 Ankara, Turkey eMail: ngunduz ‘@’ gazi.edu.tr gunduznecla ‘@’ yahoo.com
instr: Instructor’s identifier; values taken from {1,2,3}
class: Course code (descriptor); values taken from {1-13}
repeat: Number of times the student is taking this course; values taken from {0,1,2,3,…}
attendance: Code of the level of attendance; values from {0, 1, 2, 3, 4}
difficulty: Level of difficulty of the course as perceived by the student; values taken from {1,2,3,4,5}
Q1: The semester course content, teaching method and evaluation system were provided at the start.
Q2: The course aims and objectives were clearly stated at the beginning of the period.
Q3: The course was worth the amount of credit assigned to it.
Q4: The course was taught according to the syllabus announced on the first day of class.
Q5: The class discussions, homework assignments, applications and studies were satisfactory.
Q6: The textbook and other courses resources were sufficient and up to date.
Q7: The course allowed field work, applications, laboratory, discussion and other studies.
Q8: The quizzes, assignments, projects and exams contributed to helping the learning.
Q9: I greatly enjoyed the class and was eager to actively participate during the lectures.
Q10: My initial expectations about the course were met at the end of the period or year.
Q11: The course was relevant and beneficial to my professional development.
Q12: The course helped me look at life and the world with a new perspective.
Q13: The Instructor’s knowledge was relevant and up to date.
Q14: The Instructor came prepared for classes.
Q15: The Instructor taught in accordance with the announced lesson plan.
Q16: The Instructor was committed to the course and was understandable.
Q17: The Instructor arrived on time for classes.
Q18: The Instructor has a smooth and easy to follow delivery/speech.
Q19: The Instructor made effective use of class hours.
Q20: The Instructor explained the course and was eager to be helpful to students.
Q21: The Instructor demonstrated a positive approach to students.
Q22: The Instructor was open and respectful of the views of students about the course.
Q23: The Instructor encouraged participation in the course.
Q24: The Instructor gave relevant homework assignments/projects, and helped/guided students.
Q25: The Instructor responded to questions about the course inside and outside of the course.
Q26: The Instructor’s evaluation system (midterm and final questions, projects, assignments, etc.) effectively measured the course objectives.
Q27: The Instructor provided solutions to exams and discussed them with students.
Q28: The Instructor treated all students in a right and objective manner.
Q1-Q28 are all Likert-type, meaning that the values are taken from {1,2,3,4,5}
Let’s make the relevant imports and see the head of data.
library(clusterSim)
library(FactoMineR)
library(factoextra)
setwd("C:\\Users\\ozgrp\\Desktop\\UW\\USL\\Project02")
data <-read.csv("turkiye-student-evaluation_R_Specific.csv", sep=",", dec=".", header=TRUE)
head(data)
## instr class nb.repeat attendance difficulty Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9
## 1 1 2 1 0 4 3 3 3 3 3 3 3 3 3
## 2 1 2 1 1 3 3 3 3 3 3 3 3 3 3
## 3 1 2 1 2 4 5 5 5 5 5 5 5 5 5
## 4 1 2 1 1 3 3 3 3 3 3 3 3 3 3
## 5 1 2 1 0 1 1 1 1 1 1 1 1 1 1
## 6 1 2 1 3 3 4 4 4 4 4 4 4 4 4
## Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Q23 Q24 Q25 Q26 Q27
## 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 3 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5
## 4 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 6 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## Q28
## 1 3
## 2 3
## 3 5
## 4 3
## 5 1
## 6 4
Check if the columns are numeric or not since the PCA requires all of its inputs to be in type numeric:
sapply(data, is.numeric)
## instr class nb.repeat attendance difficulty Q1
## TRUE TRUE TRUE TRUE TRUE TRUE
## Q2 Q3 Q4 Q5 Q6 Q7
## TRUE TRUE TRUE TRUE TRUE TRUE
## Q8 Q9 Q10 Q11 Q12 Q13
## TRUE TRUE TRUE TRUE TRUE TRUE
## Q14 Q15 Q16 Q17 Q18 Q19
## TRUE TRUE TRUE TRUE TRUE TRUE
## Q20 Q21 Q22 Q23 Q24 Q25
## TRUE TRUE TRUE TRUE TRUE TRUE
## Q26 Q27 Q28
## TRUE TRUE TRUE
As can be seen above all of the variables are numeric. So let’s check if there is any missing values:
sort(colSums(is.na(data)))
## instr class nb.repeat attendance difficulty Q1
## 0 0 0 0 0 0
## Q2 Q3 Q4 Q5 Q6 Q7
## 0 0 0 0 0 0
## Q8 Q9 Q10 Q11 Q12 Q13
## 0 0 0 0 0 0
## Q14 Q15 Q16 Q17 Q18 Q19
## 0 0 0 0 0 0
## Q20 Q21 Q22 Q23 Q24 Q25
## 0 0 0 0 0 0
## Q26 Q27 Q28
## 0 0 0
The output above shows that each of our variables has no missing values which is perfect.
Let’s get our data scaled (in other words standardized but we will make the PCA function do this for ourselves again on the unscaled data)
scaled.data <- scale(data, center = TRUE, scale = TRUE)
head(scaled.data)
## instr class nb.repeat attendance difficulty Q1
## 1 -2.067673 -1.430596 -0.4021395 -1.1360201 0.901784 0.05227373
## 2 -2.067673 -1.430596 -0.4021395 -0.4580425 0.160487 0.05227373
## 3 -2.067673 -1.430596 -0.4021395 0.2199350 0.901784 1.54361247
## 4 -2.067673 -1.430596 -0.4021395 -0.4580425 0.160487 0.05227373
## 5 -2.067673 -1.430596 -0.4021395 -1.1360201 -1.322107 -1.43906502
## 6 -2.067673 -1.430596 -0.4021395 0.8979125 0.160487 0.79794310
## Q2 Q3 Q4 Q5 Q6 Q7
## 1 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 2 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 3 1.4986310 1.4528982 1.49270921 1.48098025 1.47767069 1.51175190
## 4 -0.0574854 -0.1425485 -0.06420255 -0.08275434 -0.08384423 -0.05185145
## 5 -1.6136018 -1.7379952 -1.62111430 -1.64648893 -1.64535914 -1.61545480
## 6 0.7205728 0.6551749 0.71425333 0.69911295 0.69691323 0.72995022
## Q8 Q9 Q10 Q11 Q12 Q13
## 1 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 2 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 3 1.52559804 1.4453279 1.49711027 1.4019541 1.50442214 1.3899821
## 4 -0.03266461 -0.1308026 -0.07113699 -0.1419197 -0.02723829 -0.1920451
## 5 -1.59092727 -1.7069331 -1.63938425 -1.6857934 -1.55889873 -1.7740722
## 6 0.74646671 0.6572627 0.71298664 0.6300172 0.73859193 0.5989685
## Q14 Q15 Q16 Q17 Q18 Q19
## 1 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 2 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 3 1.3614334 1.3667584 1.4211113 1.2635180 1.3888836 1.3704338
## 4 -0.2317188 -0.2292556 -0.1316659 -0.3143545 -0.1738622 -0.2063033
## 5 -1.8248709 -1.8252696 -1.6844431 -1.8922270 -1.7366079 -1.7830404
## 6 0.5648573 0.5687514 0.6447227 0.4745818 0.6075107 0.5820652
## Q20 Q21 Q22 Q23 Q24 Q25
## 1 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 2 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 3 1.3428414 1.3327923 1.3264978 1.4129535 1.4367498 1.3421429
## 4 -0.2235153 -0.2420430 -0.2503440 -0.1586450 -0.1307605 -0.2485855
## 5 -1.7898720 -1.8168782 -1.8271858 -1.7302436 -1.6982708 -1.8393139
## 6 0.5596630 0.5453747 0.5380769 0.6271542 0.6529946 0.5467787
## Q26 Q27 Q28
## 1 -0.1748373 -0.1198347 -0.2409272
## 2 -0.1748373 -0.1198347 -0.2409272
## 3 1.3991044 1.4283069 1.3231510
## 4 -0.1748373 -0.1198347 -0.2409272
## 5 -1.7487791 -1.6679763 -1.8050053
## 6 0.6121335 0.6542361 0.5411119
After this step we calculate the covariances and get the eigen values.
cov.data <- cov(scaled.data)
head(cov.data)
## instr class nb.repeat attendance difficulty
## instr 1.00000000 -0.03987106 0.11276336 -0.10723137 -0.05836793
## class -0.03987106 1.00000000 0.09152659 -0.01631225 -0.04489877
## nb.repeat 0.11276336 0.09152659 1.00000000 -0.07808589 0.11049303
## attendance -0.10723137 -0.01631225 -0.07808589 1.00000000 0.43679161
## difficulty -0.05836793 -0.04489877 0.11049303 0.43679161 1.00000000
## Q1 -0.12893147 -0.02954238 -0.02470843 0.10526616 0.05211965
## Q1 Q2 Q3 Q4 Q5
## instr -0.12893147 -0.12706996 -0.10894863 -0.11322187 -0.13560563
## class -0.02954238 -0.03327379 -0.02153407 -0.03016465 -0.03658396
## nb.repeat -0.02470843 -0.04170674 -0.03570381 -0.03361266 -0.03177018
## attendance 0.10526616 0.14925830 0.17839349 0.13810790 0.14974653
## difficulty 0.05211965 0.06503112 0.07145738 0.06217107 0.06418094
## Q1 1.00000000 0.86613798 0.76738148 0.84977255 0.80475673
## Q6 Q7 Q8 Q9 Q10
## instr -0.09831929 -0.12443411 -0.15010866 -0.11178933 -0.13004590
## class -0.04597201 -0.04676086 -0.03929564 -0.01842956 -0.03393786
## nb.repeat -0.02691810 -0.03094972 -0.02420414 -0.03734668 -0.02861191
## attendance 0.14370478 0.13747239 0.13282121 0.18229316 0.14693113
## difficulty 0.05274597 0.05005350 0.05169472 0.05502915 0.04288434
## Q1 0.76956071 0.79395720 0.79334652 0.73474404 0.79661214
## Q11 Q12 Q13 Q14 Q15
## instr -0.12953460 -0.12703632 -0.11145992 -0.10232819 -0.09980218
## class -0.02113611 -0.04486154 -0.04697085 -0.04389983 -0.03904068
## nb.repeat -0.03365828 -0.01589967 -0.04379275 -0.05154347 -0.03965889
## attendance 0.17889853 0.12957197 0.18647788 0.20225192 0.19584539
## difficulty 0.05896892 0.03637216 0.07949756 0.09249966 0.08945881
## Q1 0.71607569 0.76119667 0.71786244 0.69582240 0.69641049
## Q16 Q17 Q18 Q19 Q20
## instr -0.12706856 -0.08058263 -0.14854652 -0.11248496 -0.08683139
## class -0.03663522 -0.02895678 -0.02187378 -0.01872659 -0.03123429
## nb.repeat -0.02563978 -0.04952619 -0.03739157 -0.04556763 -0.04262435
## attendance 0.15307103 0.23147972 0.17917260 0.19069353 0.19516552
## difficulty 0.04971816 0.12252026 0.06852016 0.08001655 0.09105086
## Q1 0.73693754 0.61220212 0.70568166 0.69936994 0.68529916
## Q21 Q22 Q23 Q24 Q25
## instr -0.07810139 -0.08058617 -0.11888900 -0.12887991 -0.08356328
## class -0.02275796 -0.01655280 -0.02598300 -0.03671183 -0.02781614
## nb.repeat -0.04626193 -0.04546352 -0.04123302 -0.03361830 -0.04991824
## attendance 0.20480200 0.20773969 0.17781379 0.16354632 0.20443462
## difficulty 0.09562751 0.09954325 0.07531710 0.07260860 0.09968242
## Q1 0.67376961 0.67070160 0.72876989 0.73216686 0.67212031
## Q26 Q27 Q28
## instr -0.10349866 -0.10766351 -0.08167233
## class -0.02949070 -0.02257620 -0.03736421
## nb.repeat -0.03551841 -0.03245673 -0.04489990
## attendance 0.17269474 0.14468671 0.20015000
## difficulty 0.06445622 0.05936982 0.09087630
## Q1 0.69892434 0.70963896 0.65887343
data.eigen <- eigen(cov.data)
head(data.eigen, n=1)
## $values
## [1] 23.10434548 1.49296567 1.22730132 1.12713508 1.03353577
## [6] 0.81544366 0.52661856 0.38672563 0.34874071 0.28732902
## [11] 0.25320004 0.20344215 0.18293627 0.17050801 0.14174905
## [16] 0.13802371 0.13634271 0.11874927 0.11619028 0.11382587
## [21] 0.10923216 0.10544191 0.10028425 0.09520722 0.09258663
## [26] 0.08455972 0.08422788 0.08007743 0.07731162 0.07071785
## [31] 0.06764076 0.05577729 0.05182704
When we obtain the covariance matrix we can move to compute the PCA. Although there are many methods/functions to to compute the principal components we will use prcomp() function here that uses singular value decomposition. Scale parameter is intentionally set true here in order to bear in mind that it is not necessary to scale the data outside of the function.
data.pca <- prcomp(data, center = TRUE, scale. = TRUE)
summary(data.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 4.8067 1.22187 1.10784 1.06167 1.01663 0.90302
## Proportion of Variance 0.7001 0.04524 0.03719 0.03416 0.03132 0.02471
## Cumulative Proportion 0.7001 0.74537 0.78256 0.81672 0.84804 0.87275
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.72568 0.62187 0.59054 0.53603 0.50319 0.45105
## Proportion of Variance 0.01596 0.01172 0.01057 0.00871 0.00767 0.00616
## Cumulative Proportion 0.88871 0.90043 0.91099 0.91970 0.92737 0.93354
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 0.42771 0.41293 0.3765 0.37152 0.36925 0.3446
## Proportion of Variance 0.00554 0.00517 0.0043 0.00418 0.00413 0.0036
## Cumulative Proportion 0.93908 0.94425 0.9485 0.95273 0.95686 0.9605
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 0.34087 0.33738 0.33050 0.3247 0.31668 0.30856
## Proportion of Variance 0.00352 0.00345 0.00331 0.0032 0.00304 0.00289
## Cumulative Proportion 0.96398 0.96743 0.97074 0.9739 0.97697 0.97986
## PC25 PC26 PC27 PC28 PC29 PC30
## Standard deviation 0.30428 0.29079 0.29022 0.28298 0.27805 0.26593
## Proportion of Variance 0.00281 0.00256 0.00255 0.00243 0.00234 0.00214
## Cumulative Proportion 0.98266 0.98522 0.98778 0.99020 0.99255 0.99469
## PC31 PC32 PC33
## Standard deviation 0.26008 0.23617 0.22766
## Proportion of Variance 0.00205 0.00169 0.00157
## Cumulative Proportion 0.99674 0.99843 1.00000
As it can be seen above we started to get the ability to explain 70 percent of the variance in the first component and being able to hit 85% with only 5 components out of 28 which is a success. Let’s see our power of explanation on a scree plot and also our eigen values:
fviz_eig(data.pca, addlabels = TRUE, ylim = c(0, 75))
get_eig(data.pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 23.10434548 70.0131681 70.01317
## Dim.2 1.49296567 4.5241384 74.53731
## Dim.3 1.22730132 3.7190949 78.25640
## Dim.4 1.12713508 3.4155608 81.67196
## Dim.5 1.03353577 3.1319266 84.80389
## Dim.6 0.81544366 2.4710414 87.27493
## Dim.7 0.52661856 1.5958138 88.87074
## Dim.8 0.38672563 1.1718959 90.04264
## Dim.9 0.34874071 1.0567900 91.09943
## Dim.10 0.28732902 0.8706940 91.97012
## Dim.11 0.25320004 0.7672729 92.73740
## Dim.12 0.20344215 0.6164914 93.35389
## Dim.13 0.18293627 0.5543523 93.90824
## Dim.14 0.17050801 0.5166910 94.42493
## Dim.15 0.14174905 0.4295426 94.85447
## Dim.16 0.13802371 0.4182537 95.27273
## Dim.17 0.13634271 0.4131597 95.68589
## Dim.18 0.11874927 0.3598463 96.04573
## Dim.19 0.11619028 0.3520918 96.39783
## Dim.20 0.11382587 0.3449269 96.74275
## Dim.21 0.10923216 0.3310065 97.07376
## Dim.22 0.10544191 0.3195209 97.39328
## Dim.23 0.10028425 0.3038917 97.69717
## Dim.24 0.09520722 0.2885067 97.98568
## Dim.25 0.09258663 0.2805655 98.26624
## Dim.26 0.08455972 0.2562416 98.52249
## Dim.27 0.08422788 0.2552360 98.77772
## Dim.28 0.08007743 0.2426589 99.02038
## Dim.29 0.07731162 0.2342776 99.25466
## Dim.30 0.07071785 0.2142965 99.46895
## Dim.31 0.06764076 0.2049720 99.67393
## Dim.32 0.05577729 0.1690221 99.84295
## Dim.33 0.05182704 0.1570516 100.00000
Now it is time to visualize our components in 2 dimensional space where first and the second dimensions are our first and second components respectively:
fviz_pca_var(data.pca, col.var="blue")
As per seen above attention and difficulty contributes negatively on our dimension 1 and all of the quality indicators contribute negatively to the dimension 2.
Since we decided to continue with first 5 components we need to recreate our dataset by multiplying our data matrix with the matrix (mx here) of selected eigen vectors. (ev here)
ev <- data.eigen$vectors[,1:5]
data.std <- data.Normalization(data, type="n1",normalization="column")
mx <- as.matrix(data.std)
data.pca.final <- mx %*% ev
Let’s see what do we get in final:
head(data.pca.final,n = 10)
## [,1] [,2] [,3] [,4] [,5]
## 1 0.7422488 0.1097109 1.3141364 -1.204887 0.2701274
## 2 0.7274376 0.1471334 1.2810959 -1.411252 0.1691307
## 3 -7.5858460 -0.3641984 1.5195832 -1.093701 0.3410784
## 4 0.7274376 0.1471334 1.2810959 -1.411252 0.1691307
## 5 9.0540066 1.1133755 0.8256403 -1.913668 -0.1059166
## 6 -3.4647065 -0.5073088 1.5677471 -1.377159 0.2067092
## 7 -3.4085132 0.3276665 1.1998915 -1.334159 0.2025032
## 8 -7.5444640 0.5081995 1.1186872 -1.257066 0.2358757
## 9 -3.5503100 0.3651643 1.2743075 -1.316108 0.2049692
## 10 -3.5060886 -1.3797067 1.9686432 -1.213794 0.3119118
Working with big data is fun. But when you are dealing with data without given information at hand you need to apply unsupervised techniques to make things simpler yet powerful for yourself. Thus, PCA helped a lot to reduce the dimensions we had in the beginning in exchange to only explanation of 15% of variability in the data. Final produced dataset may be used in further clustering applications as well.
Citation Request for Data Source:
If you publish material based on databases obtained from this repository, then, in your acknowledgements, please note the assistance you received by using this repository. This will help others to obtain the same data sets and replicate your experiments. We suggest the following pseudo-APA reference format for referring to this repository:
Gunduz, G. & Fokoue, E. (2013). UCI Machine Learning Repository [[Web Link]]. Irvine, CA: University of California, School of Information and Computer Science.
Here is a BiBTeX citation as well:
@misc{GunduzFokoue:2013 , author = ‘Gunduz, N. and Fokoue, E.’, year = ‘2013’, title = ‘{UCI} Machine Learning Repository’, url = ‘[Web Link]’, institution = ‘University of California, Irvine, School of Information and Computer Sciences’ }