In this article, I am going to analyse the US ARREST Dataset and implementing Prinipal component Analysis to reduce the number of features in input dataset.
df <- USArrests
head(df)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
The input dataset consists of total 50 records and each record has 4 attributes.
str(df)
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
The Assault and UrbanPop are integer values. The Murder and Rape are numerical values.
summary(df)
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
plot_hist <- function(colu , name){
hist(colu ,
main = name ,
sub = paste ( paste('mean = ' ,mean(colu) , sep =' '),
paste('SD =' , round(sd(colu),2) , sep = ' '),
sep = "\n") ,
xlab = paste(" ")
)
}
par(mfrow=c(2,2))
plot_hist(df$Murder , name = "MURDERs")
plot_hist(df$Assault, name = "Assaults")
plot_hist(df$UrbanPop, name = "UrbanPops")
plot_hist(df$Rape, name = "Rapes")
cor_mat <- cor(df)
cor_mat
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
corrplot(cor_mat , is.corr = FALSE )
ggpairs(df)
Murder , Assault & Rape are highly co-related to each other. Only UrbanPop is not in co-relation with other variables. Because of the highly co-related input data, we can apply Principal component analysis and reduce the number of features.
pca <- prcomp(df , scale = TRUE)
pca
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
So, By the first two Principal components explains around 87% of the total variation in input data.
plot(pca)
get_eig(pca)
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 2.4802416 62.006039 62.00604
## Dim.2 0.9897652 24.744129 86.75017
## Dim.3 0.3565632 8.914080 95.66425
## Dim.4 0.1734301 4.335752 100.00000
colSums(get_eig(pca))
## eigenvalue variance.percent
## 4.0000 100.0000
## cumulative.variance.percent
## 344.4205
fviz_eig(pca , choice = "eigenvalue" , addlabels = TRUE )
Analyzing the co-relation between all the input variables to a set of principal components. Individual score of each record ( for all 50 records ) with respect to a given PCy.
biplot(pca , xpd = TRUE)
fviz_pca_var(pca, col.var = "black")
### Acess FIT with new PCA Dimensions :-
Getting all the rows in new PCA vector space.
str(pca)
## List of 5
## $ sdev : num [1:4] 1.575 0.995 0.597 0.416
## $ rotation: num [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## $ center : Named num [1:4] 7.79 170.76 65.54 21.23
## ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
## $ scale : Named num [1:4] 4.36 83.34 14.47 9.37
## ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
## $ x : num [1:50, 1:4] -0.976 -1.931 -1.745 0.14 -2.499 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
## - attr(*, "class")= chr "prcomp"
The Loadings of all the four Principal components are as follows.
pca$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
The data in new vector space is as follows
head(pca$x)
## PC1 PC2 PC3 PC4
## Alabama -0.9756604 1.1220012 -0.43980366 0.154696581
## Alaska -1.9305379 1.0624269 2.01950027 -0.434175454
## Arizona -1.7454429 -0.7384595 0.05423025 -0.826264240
## Arkansas 0.1399989 1.1085423 0.11342217 -0.180973554
## California -2.4986128 -1.5274267 0.59254100 -0.338559240
## Colorado -1.4993407 -0.9776297 1.08400162 0.001450164
row_sum <- rowSums(df)
first_pca <- pca$x[,1]
plot(row_sum , first_pca)
### Getting the cos2 values of all variables :-
var <- get_pca_var(pca)
var
## Principal Component Analysis Results for variables
## ===================================================
## Name Description
## 1 "$coord" "Coordinates for the variables"
## 2 "$cor" "Correlations between variables and dimensions"
## 3 "$cos2" "Cos2 for the variables"
## 4 "$contrib" "contributions of the variables"
var$cos2
## Dim.1 Dim.2 Dim.3 Dim.4
## Murder 0.7122962 0.1730854 0.04151814 0.073100217
## Assault 0.8435380 0.0349769 0.02563817 0.095846950
## UrbanPop 0.1919463 0.7539938 0.05095143 0.003108430
## Rape 0.7324611 0.0277090 0.23845544 0.001374491
fviz_cos2(pca, choice = "var" , axes = 1:2)
fviz_pca_var(pca,
col.var = "cos2",
repel = TRUE
)
As all the variables are near to circumference of circle, it means all variables are important in terms of dimensioanlity reduction.
Finally the feature reduced data frame is
final_df <- pca$x[,1:2]
head(final_df)
## PC1 PC2
## Alabama -0.9756604 1.1220012
## Alaska -1.9305379 1.0624269
## Arizona -1.7454429 -0.7384595
## Arkansas 0.1399989 1.1085423
## California -2.4986128 -1.5274267
## Colorado -1.4993407 -0.9776297
plot(final_df)