Diab dataset represent attributes with women who have cancer and those who don’t based on the attributes given.

#calling the libraries
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
#
#loading the dataset
diab=read.table("D:\\Desktop 3 2022\\DSC14A R\\diab.txt",header=T)
#previewing the 1st 6 rows
head(diab)
##   Pregnancies Glucose BloodPressure SkinThickness Insulin  BMI
## 1           6     148            72            35       0 33.6
## 2           1      85            66            29       0 26.6
## 3           8     183            64             0       0 23.3
## 4           1      89            66            23      94 28.1
## 5           0     137            40            35     168 43.1
## 6           5     116            74             0       0 25.6
##   DiabetesPedigreeFunction Age Outcome
## 1                    0.627  50       1
## 2                    0.351  31       0
## 3                    0.672  32       1
## 4                    0.167  21       0
## 5                    2.288  33       1
## 6                    0.201  30       0

Note: we can drop the target variable(outcome) to make it unsupervised dataset, however this deomostraion is purely for interpreting PCA thus wee’ll ignore the dropping part.

the below code can be used to drop the target (outcome) attribute if need be.

#diab1= subset(diab, select = -c(Outcome) )
#getting attribute records/values
a <-diab$Glucose
head(a)
## [1] 148  85 183  89 137 116
#using attach function
attach(diab)
#getting the attributes value
a <-Glucose
head(a)
## [1] 148  85 183  89 137 116
#statistics summary using summary()
summary(diab)
##   Pregnancies        Glucose      BloodPressure    SkinThickness  
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     Insulin           BMI        DiabetesPedigreeFunction      Age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780           Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437           1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725           Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719           Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262           3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200           Max.   :81.00  
##     Outcome     
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000
#stat using describe()
library(psych)
## Warning: package 'psych' was built under R version 4.1.3
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(diab)
##                          vars   n   mean     sd median trimmed   mad   min
## Pregnancies                 1 768   3.85   3.37   3.00    3.46  2.97  0.00
## Glucose                     2 768 120.89  31.97 117.00  119.38 29.65  0.00
## BloodPressure               3 768  69.11  19.36  72.00   71.36 11.86  0.00
## SkinThickness               4 768  20.54  15.95  23.00   19.94 17.79  0.00
## Insulin                     5 768  79.80 115.24  30.50   56.75 45.22  0.00
## BMI                         6 768  31.99   7.88  32.00   31.96  6.82  0.00
## DiabetesPedigreeFunction    7 768   0.47   0.33   0.37    0.42  0.25  0.08
## Age                         8 768  33.24  11.76  29.00   31.54 10.38 21.00
## Outcome                     9 768   0.35   0.48   0.00    0.31  0.00  0.00
##                             max  range  skew kurtosis   se
## Pregnancies               17.00  17.00  0.90     0.14 0.12
## Glucose                  199.00 199.00  0.17     0.62 1.15
## BloodPressure            122.00 122.00 -1.84     5.12 0.70
## SkinThickness             99.00  99.00  0.11    -0.53 0.58
## Insulin                  846.00 846.00  2.26     7.13 4.16
## BMI                       67.10  67.10 -0.43     3.24 0.28
## DiabetesPedigreeFunction   2.42   2.34  1.91     5.53 0.01
## Age                       81.00  60.00  1.13     0.62 0.42
## Outcome                    1.00   1.00  0.63    -1.60 0.02
#checking the shape(dimension)
dim(diab)
## [1] 768   9
#running our principal component
# scale=T(True) means we use correlation matrix to calculate the PCA. if scale= F(False) then we use covariance matrix to calculate PCA.
# we use covariance(scale=F)when all the unit of our attributes are the same.
#we use correlation(scale=T) when the unit of our attributes are different.
diab.pca=prcomp(diab,scale=T)
#checking the possible things we can get from running principal component(PC)
var=get_pca_var(diab.pca)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Observations:

coord >>> coordinate of the variables cor >> correlation cos2 >> how well is the attribute represented in each PC. contrib >> how well each attribute/variable contribte to PC.

#getting the Eigenvalues
eig.val=get_eigenvalue(diab.pca)
eig.val
##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.3525016        26.138907                    26.13891
## Dim.2  1.7743120        19.714578                    45.85348
## Dim.3  1.1202251        12.446946                    58.30043
## Dim.4  0.8819549         9.799499                    68.09993
## Dim.5  0.8446234         9.384705                    77.48463
## Dim.6  0.7348682         8.165203                    85.64984
## Dim.7  0.4884234         5.426927                    91.07676
## Dim.8  0.4181811         4.646457                    95.72322
## Dim.9  0.3849102         4.276780                   100.00000

observation:

5 PC accounts for 77.48% of the data

#Plotting the PC
fviz_eig(diab.pca, addlabels = T, ylim = c(0, 30))

Observation: It shows the % of each PC depending on how confidence we’re we can stop at PC3 which explains 58% of the variance.

#Using visualization to check the contribution. this is difficult to interpret.
fviz_pca_var(diab.pca, col.var = "blue")

#Plotting the contribution of attributes in each PC
fviz_contrib(diab.pca,choice = "var", axes = 1)

fviz_contrib(diab.pca, choice = "var", axes = 2)

fviz_contrib(diab.pca, choice = "var", axes = 3)

fviz_contrib(diab.pca, choice = "var", axes = 4)

Observation:

Dim1: Glucoose, outcome, bmi, and insulin contribute the most in PC1 dim2: Age, pregnancy, skin thickness and insulin contribute the most.

other Dims can be checked too…

These can be checked using corr plot as well.

#checking attribute contribution using corr plot
corrplot(var$cos2, is.corr=FALSE)# u can also use is.corr=F

Observation:

Dim1 is a good presentation of outcome, BMI, and glucos Dim2 is a good presentation of age, skin thickness, and pregrancies Dim3 is a good presentation of blodd presure Dim4 is a good presentation of DiabetesPedigreeFunction

This can still be got using the below code

#Attribute contribution
corrplot(var$contrib, is.corr=F) 

#checking the PC values
#this is a bit messy and abit confusing to interpret thus people prefer using visual representaion alredy done above.
diab.pca
## Standard deviations (1, .., p=9):
## [1] 1.5337867 1.3320330 1.0584069 0.9391245 0.9190340 0.8572446 0.6988730
## [8] 0.6466693 0.6204113
## 
## Rotation (n x k) = (9 x 9):
##                                 PC1         PC2        PC3         PC4
## Pregnancies              -0.2159984  0.52744611 -0.1645123  0.16088274
## Glucose                  -0.4367568  0.09563451  0.3914174 -0.32712336
## BloodPressure            -0.3004554  0.04625339 -0.6297053  0.01105718
## SkinThickness            -0.3072920 -0.44839783 -0.2943260  0.07401586
## Insulin                  -0.3363324 -0.35534569  0.1416535 -0.09763514
## BMI                      -0.3973420 -0.21031643 -0.2519303 -0.16321892
## DiabetesPedigreeFunction -0.2376345 -0.17519467  0.2853247  0.87352658
## Age                      -0.2786556  0.53320237 -0.1261179  0.17111166
## Outcome                  -0.4156528  0.15476805  0.3946151 -0.18167421
##                                  PC5          PC6         PC7         PC8
## Pregnancies              -0.21176571  0.456773101 -0.07851795 -0.54064221
## Glucose                  -0.10877916 -0.382749448  0.20799965  0.02524536
## BloodPressure             0.05937792 -0.607867100 -0.32670833 -0.16280013
## SkinThickness            -0.17180950  0.397956778 -0.28199843  0.43491377
## Insulin                  -0.65037442  0.008730189  0.08221649 -0.26883758
## BMI                       0.52745778  0.241007894  0.57252374 -0.16646311
## DiabetesPedigreeFunction  0.15655457 -0.171628373  0.05719014 -0.09316407
## Age                      -0.20428857 -0.001054913  0.29942807  0.61489048
## Outcome                   0.37793701  0.171378976 -0.58037812  0.06827401
##                                  PC9
## Pregnancies              -0.26505155
## Glucose                  -0.58107579
## BloodPressure             0.06830645
## SkinThickness            -0.39477039
## Insulin                   0.47845101
## BMI                       0.12644026
## DiabetesPedigreeFunction -0.04961786
## Age                       0.28875418
## Outcome                   0.31756221

Observations: PC1: a decrease in glucose, a decrease in bmi, and a decrease in insulin leads to a decrease in the persons outcome. PC2: an increase in age, pregnancy we have decrease in skin thickness and a decrease in insulin. this means the values are correlated in some way(inverse correlation). Pc3: there’s a relationship btn bllod pressure, outcome and glucose. i.e. glose goes up when blood pressure comes down.