Introduction :-

In this report, I am attempting to do Dimentionality reduction on US-Arrest Data Set.


Exploratory Data Analysis :-

Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.

Structure of given dataset :-

The given dataset has 50 records and each record is observed in 4 attributes of it.

The sample dataset is as folows.

##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
  • Explanation of all the variables :-

    1. Murder: Murder arrests (per 100,000)

    2. Assault: Assault arrests (per 100,000)

    3. UrbanPop: Percent urban population

    4. Rape: Rape arrests (per 100,000)

The detailed structure is as follows.

## 'data.frame':    50 obs. of  4 variables:
##  $ Murder  : num  13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
##  $ Assault : int  236 263 294 190 276 204 110 238 335 211 ...
##  $ UrbanPop: int  58 48 80 50 91 78 77 72 80 60 ...
##  $ Rape    : num  21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...

In Input , The type of each attribute is as follows.

##    Murder   Assault  UrbanPop      Rape 
## "numeric" "integer" "integer" "numeric"

For our regression analysis the types are perfectly fine. We can proceed further.


  • Dealing with NULL values :-

The number of null values in each column are as follows.

##   Murder  Assault UrbanPop     Rape 
##        0        0        0        0

As there is no null values, we can proceed further.


  • Summary :-

The overall summary of all the attributes is as follows.

##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00

The distribution of all continuous variables is as follows.


The co-releation between the continous variables is as follows

##              Murder   Assault   UrbanPop      Rape
## Murder   1.00000000 0.8018733 0.06957262 0.5635788
## Assault  0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape     0.56357883 0.6652412 0.41134124 1.0000000

___

Description of EDA :-

In our data set,

  • All the four variables are highly co-releated to each other. We can apply any dimentionality reduction techniques likes PCA and can be describe all the 4 attributes with less number of attributes.

Principal Component Analysis ( PCA) :-

when variables are correlated, then less variables could explain almost the same amount of variation.PCA is used to extract the important information from multivariate data and express this information as a set of few new variables called principal components.


Fitting PCA :-

The Explanation for the fitted PCA is as follows.

## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
## 
## Rotation (n x k) = (4 x 4):
##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

We can observe that,

\(PC1 = -0.5358995*X_{murder} - 0.5831836*X_{Assault} - 0.2781909*X_{UrbanPop} - 0.5434321*X_{Rape}\)

\(PC2 = 0.4181809*X_{murder} + 0.1879856*X_{Assault} - 0.8728062*X_{UrbanPop} - 0.1673186*X_{Rape}\)

\(PC3 = -0.3412327*X_{murder} - 0.2681484*X_{Assault} - 0.3780158*X_{UrbanPop} + 0.8177779*X_{Rape}\)

\(PC4 = 0.64922780*X_{murder} - 0.74340748*X_{Assault} + 0.13387773*X_{UrbanPop} + 0.08902432*X_{Rape}\)

Summary PCA :-

The Explanation for the fitted PCA is as follows.

## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion  0.6201 0.8675 0.95664 1.00000

We can observe that,

  • 62 % of the varience in the input dataset is explained by \(PC_1\) itself.
  • 86.7 % of the varience in the input dataset is explained by \(PC_1\) & \(PC_2\) together.

Eigen-Values PCA :-

Apart from the varience of expalined distribution , we can explin the same information in terms of eigen-values. The eigen-values for the fitted Principal components are as follows.

##       eigenvalue variance.percent cumulative.variance.percent
## PC_1   2.4802416        62.006039                    62.00604
## PC_2   0.9897652        24.744129                    86.75017
## PC_3   0.3565632         8.914080                    95.66425
## PC_4   0.1734301         4.335752                   100.00000
## TOTAL  4.0000000       100.000000                   344.42046

We can observe that,

  • Eigen-Values of each principal component is proportional to Variance Explained by it.

  • Similar to variance Explained, eigen value is very high for first principal component.

  • As total eigen-value = total number of input variables (4)

    & Total Varience = 100% , we can say as

All the Principal components together can explain all the varience in input dataset.

Plot of Eigen-Values:-

This plot clearly explains us ,

  • \(PC_1\) and \(PC_2\) are explaining most of the variences.

  • We can ignore \(PC_3\) and \(PC_4\)

\(Cos^{2}\) values :-

\(Cos^{2}\) shows the importance of any principal component for a given observation.

From the above plot we can conclude that, for the \(PC_1\) & \(PC_2\) , all the four input varaibels are contributing almost equally.

Bi-Plot :-

As we observed from the \(cos^2\) plot, all the input variables are contributing almost equally. The same thing, we can observe in this bi-plot, as all the arrows are near to circumference of the circle, means, all input variables are important.

Updated Dataset :-

The Dimensionally reduced DataSet for USArrest Data is as follows

##                        PC1         PC2
## Alabama        -0.97566045  1.12200121
## Alaska         -1.93053788  1.06242692
## Arizona        -1.74544285 -0.73845954
## Arkansas        0.13999894  1.10854226
## California     -2.49861285 -1.52742672
## Colorado       -1.49934074 -0.97762966
## Connecticut     1.34499236 -1.07798362
## Delaware       -0.04722981 -0.32208890
## Florida        -2.98275967  0.03883425
## Georgia        -1.62280742  1.26608838
## Hawaii          0.90348448 -1.55467609
## Idaho           1.62331903  0.20885253
## Illinois       -1.36505197 -0.67498834
## Indiana         0.50038122 -0.15003926
## Iowa            2.23099579 -0.10300828
## Kansas          0.78887206 -0.26744941
## Kentucky        0.74331256  0.94880748
## Louisiana      -1.54909076  0.86230011
## Maine           2.37274014  0.37260865
## Maryland       -1.74564663  0.42335704
## Massachusetts   0.48128007 -1.45967706
## Michigan       -2.08725025 -0.15383500
## Minnesota       1.67566951 -0.62590670
## Mississippi    -0.98647919  2.36973712
## Missouri       -0.68978426 -0.26070794
## Montana         1.17353751  0.53147851
## Nebraska        1.25291625 -0.19200440
## Nevada         -2.84550542 -0.76780502
## New Hampshire   2.35995585 -0.01790055
## New Jersey     -0.17974128 -1.43493745
## New Mexico     -1.96012351  0.14141308
## New York       -1.66566662 -0.81491072
## North Carolina -1.11208808  2.20561081
## North Dakota    2.96215223  0.59309738
## Ohio            0.22369436 -0.73477837
## Oklahoma        0.30864928 -0.28496113
## Oregon         -0.05852787 -0.53596999
## Pennsylvania    0.87948680 -0.56536050
## Rhode Island    0.85509072 -1.47698328
## South Carolina -1.30744986  1.91397297
## South Dakota    1.96779669  0.81506822
## Tennessee      -0.98969377  0.85160534
## Texas          -1.34151838 -0.40833518
## Utah            0.54503180 -1.45671524
## Vermont         2.77325613  1.38819435
## Virginia        0.09536670  0.19772785
## Washington      0.21472339 -0.96037394
## West Virginia   2.08739306  1.41052627
## Wisconsin       2.05881199 -0.60512507
## Wyoming         0.62310061  0.31778662

————————————————————- THANK YOU ————————————————————-