US_ARREST DATASET

INTRODUCTION :-

In this article, I am going to analyse the US ARREST Dataset and implementing Prinipal component Analysis to reduce the number of features in input dataset.

EDA of input data :-

df <- USArrests
head(df)

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7

The input dataset consists of total 50 records and each record has 4 attributes.

str(df)

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

The Assault and UrbanPop are integer values. The Murder and Rape are numerical values.

summary(df)

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

plot_hist <- function(colu , name){
  hist(colu , 
       main = name ,
       sub = paste (  paste('mean = ' ,mean(colu) , sep =' '), 
                      paste('SD =' , round(sd(colu),2) , sep = ' '),
                      sep = "\n") ,
       xlab = paste(" ")
      )
}

par(mfrow=c(2,2))
plot_hist(df$Murder , name = "MURDERs")
plot_hist(df$Assault, name = "Assaults")
plot_hist(df$UrbanPop, name = "UrbanPops")
plot_hist(df$Rape, name = "Rapes")

CO-RELEATION BETWEEN THE VARIABLES :-

cor_mat <- cor(df)
cor_mat

##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

corrplot(cor_mat , is.corr = FALSE )

ggpairs(df)

Murder , Assault & Rape are highly co-related to each other. Only UrbanPop is not in co-relation with other variables. Because of the highly co-related input data, we can apply Principal component analysis and reduce the number of features.

Fitting PCA to input data ( # of input features = 4)

pca <- prcomp(df , scale = TRUE)
pca

## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation (n x k) = (4 x 4):
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

Getting the summary of PCA and identifing the proportion of variance

summary(pca)

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

So, By the first two Principal components explains around 87% of the total variation in input data.

Plotting the PCA

plot(pca)

Analyzing indepth using factoextra package :-

get_eig(pca)

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  2.4802416        62.006039                    62.00604
## Dim.2  0.9897652        24.744129                    86.75017
## Dim.3  0.3565632         8.914080                    95.66425
## Dim.4  0.1734301         4.335752                   100.00000

colSums(get_eig(pca))

##                  eigenvalue            variance.percent 
##                      4.0000                    100.0000 
## cumulative.variance.percent 
##                    344.4205

fviz_eig(pca , choice = "eigenvalue" , addlabels = TRUE )

BIPLOT :-

Analyzing the co-relation between all the input variables to a set of principal components. Individual score of each record ( for all 50 records ) with respect to a given PCy.

biplot(pca , xpd = TRUE)

fviz_pca_var(pca, col.var = "black")

### Acess FIT with new PCA Dimensions :-

Getting all the rows in new PCA vector space.

str(pca)

## List of 5
##  $ sdev    : num [1:4] 1.575 0.995 0.597 0.416
##  $ rotation: num [1:4, 1:4] -0.536 -0.583 -0.278 -0.543 0.418 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 7.79 170.76 65.54 21.23
##   ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ scale   : Named num [1:4] 4.36 83.34 14.47 9.37
##   ..- attr(*, "names")= chr [1:4] "Murder" "Assault" "UrbanPop" "Rape"
##  $ x       : num [1:50, 1:4] -0.976 -1.931 -1.745 0.14 -2.499 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:50] "Alabama" "Alaska" "Arizona" "Arkansas" ...
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

The Loadings of all the four Principal components are as follows.

pca$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

The data in new vector space is as follows

head(pca$x)

##                   PC1        PC2         PC3          PC4
## Alabama    -0.9756604  1.1220012 -0.43980366  0.154696581
## Alaska     -1.9305379  1.0624269  2.01950027 -0.434175454
## Arizona    -1.7454429 -0.7384595  0.05423025 -0.826264240
## Arkansas    0.1399989  1.1085423  0.11342217 -0.180973554
## California -2.4986128 -1.5274267  0.59254100 -0.338559240
## Colorado   -1.4993407 -0.9776297  1.08400162  0.001450164

row_sum <- rowSums(df)
first_pca <- pca$x[,1]
plot(row_sum , first_pca)

### Getting the cos2 values of all variables :-

var <- get_pca_var(pca)
var

## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

var$cos2

##              Dim.1     Dim.2      Dim.3       Dim.4
## Murder   0.7122962 0.1730854 0.04151814 0.073100217
## Assault  0.8435380 0.0349769 0.02563817 0.095846950
## UrbanPop 0.1919463 0.7539938 0.05095143 0.003108430
## Rape     0.7324611 0.0277090 0.23845544 0.001374491

fviz_cos2(pca, choice = "var" , axes = 1:2)

fviz_pca_var(pca,
             col.var = "cos2",
             repel = TRUE
)

As all the variables are near to circumference of circle, it means all variables are important in terms of dimensioanlity reduction.

Finally the feature reduced data frame is

final_df <- pca$x[,1:2]
head(final_df)

##                   PC1        PC2
## Alabama    -0.9756604  1.1220012
## Alaska     -1.9305379  1.0624269
## Arizona    -1.7454429 -0.7384595
## Arkansas    0.1399989  1.1085423
## California -2.4986128 -1.5274267
## Colorado   -1.4993407 -0.9776297

plot(final_df)