Diab dataset represent attributes with women who have cancer and those who don’t based on the attributes given.
#calling the libraries
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.1.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(ggplot2)
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.1.3
## corrplot 0.92 loaded
#
#loading the dataset
diab=read.table("D:\\Desktop 3 2022\\DSC14A R\\diab.txt",header=T)
#previewing the 1st 6 rows
head(diab)
## Pregnancies Glucose BloodPressure SkinThickness Insulin BMI
## 1 6 148 72 35 0 33.6
## 2 1 85 66 29 0 26.6
## 3 8 183 64 0 0 23.3
## 4 1 89 66 23 94 28.1
## 5 0 137 40 35 168 43.1
## 6 5 116 74 0 0 25.6
## DiabetesPedigreeFunction Age Outcome
## 1 0.627 50 1
## 2 0.351 31 0
## 3 0.672 32 1
## 4 0.167 21 0
## 5 2.288 33 1
## 6 0.201 30 0
Note: we can drop the target variable(outcome) to make it unsupervised dataset, however this deomostraion is purely for interpreting PCA thus wee’ll ignore the dropping part.
the below code can be used to drop the target (outcome) attribute if need be.
#diab1= subset(diab, select = -c(Outcome) )
#getting attribute records/values
a <-diab$Glucose
head(a)
## [1] 148 85 183 89 137 116
#using attach function
attach(diab)
#getting the attributes value
a <-Glucose
head(a)
## [1] 148 85 183 89 137 116
#statistics summary using summary()
summary(diab)
## Pregnancies Glucose BloodPressure SkinThickness
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## Insulin BMI DiabetesPedigreeFunction Age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## Outcome
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
#stat using describe()
library(psych)
## Warning: package 'psych' was built under R version 4.1.3
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
describe(diab)
## vars n mean sd median trimmed mad min
## Pregnancies 1 768 3.85 3.37 3.00 3.46 2.97 0.00
## Glucose 2 768 120.89 31.97 117.00 119.38 29.65 0.00
## BloodPressure 3 768 69.11 19.36 72.00 71.36 11.86 0.00
## SkinThickness 4 768 20.54 15.95 23.00 19.94 17.79 0.00
## Insulin 5 768 79.80 115.24 30.50 56.75 45.22 0.00
## BMI 6 768 31.99 7.88 32.00 31.96 6.82 0.00
## DiabetesPedigreeFunction 7 768 0.47 0.33 0.37 0.42 0.25 0.08
## Age 8 768 33.24 11.76 29.00 31.54 10.38 21.00
## Outcome 9 768 0.35 0.48 0.00 0.31 0.00 0.00
## max range skew kurtosis se
## Pregnancies 17.00 17.00 0.90 0.14 0.12
## Glucose 199.00 199.00 0.17 0.62 1.15
## BloodPressure 122.00 122.00 -1.84 5.12 0.70
## SkinThickness 99.00 99.00 0.11 -0.53 0.58
## Insulin 846.00 846.00 2.26 7.13 4.16
## BMI 67.10 67.10 -0.43 3.24 0.28
## DiabetesPedigreeFunction 2.42 2.34 1.91 5.53 0.01
## Age 81.00 60.00 1.13 0.62 0.42
## Outcome 1.00 1.00 0.63 -1.60 0.02
#checking the shape(dimension)
dim(diab)
## [1] 768 9
#running our principal component
# scale=T(True) means we use correlation matrix to calculate the PCA. if scale= F(False) then we use covariance matrix to calculate PCA.
# we use covariance(scale=F)when all the unit of our attributes are the same.
#we use correlation(scale=T) when the unit of our attributes are different.
diab.pca=prcomp(diab,scale=T)
#checking the possible things we can get from running principal component(PC)
var=get_pca_var(diab.pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
Observations:
coord >>> coordinate of the variables cor >> correlation cos2 >> how well is the attribute represented in each PC. contrib >> how well each attribute/variable contribte to PC.
#getting the Eigenvalues
eig.val=get_eigenvalue(diab.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.3525016 26.138907 26.13891
## Dim.2 1.7743120 19.714578 45.85348
## Dim.3 1.1202251 12.446946 58.30043
## Dim.4 0.8819549 9.799499 68.09993
## Dim.5 0.8446234 9.384705 77.48463
## Dim.6 0.7348682 8.165203 85.64984
## Dim.7 0.4884234 5.426927 91.07676
## Dim.8 0.4181811 4.646457 95.72322
## Dim.9 0.3849102 4.276780 100.00000
observation:
5 PC accounts for 77.48% of the data
#Plotting the PC
fviz_eig(diab.pca, addlabels = T, ylim = c(0, 30))
Observation: It shows the % of each PC depending on how confidence we’re
we can stop at PC3 which explains 58% of the variance.
#Using visualization to check the contribution. this is difficult to interpret.
fviz_pca_var(diab.pca, col.var = "blue")
#Plotting the contribution of attributes in each PC
fviz_contrib(diab.pca,choice = "var", axes = 1)
fviz_contrib(diab.pca, choice = "var", axes = 2)
fviz_contrib(diab.pca, choice = "var", axes = 3)
fviz_contrib(diab.pca, choice = "var", axes = 4)
Observation:
Dim1: Glucoose, outcome, bmi, and insulin contribute the most in PC1 dim2: Age, pregnancy, skin thickness and insulin contribute the most.
other Dims can be checked too…
These can be checked using corr plot as well.
#checking attribute contribution using corr plot
corrplot(var$cos2, is.corr=FALSE)# u can also use is.corr=F
Observation:
Dim1 is a good presentation of outcome, BMI, and glucos Dim2 is a good presentation of age, skin thickness, and pregrancies Dim3 is a good presentation of blodd presure Dim4 is a good presentation of DiabetesPedigreeFunction
This can still be got using the below code
#Attribute contribution
corrplot(var$contrib, is.corr=F)
#checking the PC values
#this is a bit messy and abit confusing to interpret thus people prefer using visual representaion alredy done above.
diab.pca
## Standard deviations (1, .., p=9):
## [1] 1.5337867 1.3320330 1.0584069 0.9391245 0.9190340 0.8572446 0.6988730
## [8] 0.6466693 0.6204113
##
## Rotation (n x k) = (9 x 9):
## PC1 PC2 PC3 PC4
## Pregnancies -0.2159984 0.52744611 -0.1645123 0.16088274
## Glucose -0.4367568 0.09563451 0.3914174 -0.32712336
## BloodPressure -0.3004554 0.04625339 -0.6297053 0.01105718
## SkinThickness -0.3072920 -0.44839783 -0.2943260 0.07401586
## Insulin -0.3363324 -0.35534569 0.1416535 -0.09763514
## BMI -0.3973420 -0.21031643 -0.2519303 -0.16321892
## DiabetesPedigreeFunction -0.2376345 -0.17519467 0.2853247 0.87352658
## Age -0.2786556 0.53320237 -0.1261179 0.17111166
## Outcome -0.4156528 0.15476805 0.3946151 -0.18167421
## PC5 PC6 PC7 PC8
## Pregnancies -0.21176571 0.456773101 -0.07851795 -0.54064221
## Glucose -0.10877916 -0.382749448 0.20799965 0.02524536
## BloodPressure 0.05937792 -0.607867100 -0.32670833 -0.16280013
## SkinThickness -0.17180950 0.397956778 -0.28199843 0.43491377
## Insulin -0.65037442 0.008730189 0.08221649 -0.26883758
## BMI 0.52745778 0.241007894 0.57252374 -0.16646311
## DiabetesPedigreeFunction 0.15655457 -0.171628373 0.05719014 -0.09316407
## Age -0.20428857 -0.001054913 0.29942807 0.61489048
## Outcome 0.37793701 0.171378976 -0.58037812 0.06827401
## PC9
## Pregnancies -0.26505155
## Glucose -0.58107579
## BloodPressure 0.06830645
## SkinThickness -0.39477039
## Insulin 0.47845101
## BMI 0.12644026
## DiabetesPedigreeFunction -0.04961786
## Age 0.28875418
## Outcome 0.31756221
Observations: PC1: a decrease in glucose, a decrease in bmi, and a decrease in insulin leads to a decrease in the persons outcome. PC2: an increase in age, pregnancy we have decrease in skin thickness and a decrease in insulin. this means the values are correlated in some way(inverse correlation). Pc3: there’s a relationship btn bllod pressure, outcome and glucose. i.e. glose goes up when blood pressure comes down.