Introduction :-
In this report, I am attempting to do Dimentionality reduction on US-Arrest Data Set.
Exploratory Data Analysis :-
Exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often with visual methods.
Structure of given dataset :-
The given dataset has 50 records and each record is observed in 4 attributes of it.
The sample dataset is as folows.
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
Explanation of all the variables :-
Murder: Murder arrests (per 100,000)
Assault: Assault arrests (per 100,000)
UrbanPop: Percent urban population
Rape: Rape arrests (per 100,000)
The detailed structure is as follows.
## 'data.frame': 50 obs. of 4 variables:
## $ Murder : num 13.2 10 8.1 8.8 9 7.9 3.3 5.9 15.4 17.4 ...
## $ Assault : int 236 263 294 190 276 204 110 238 335 211 ...
## $ UrbanPop: int 58 48 80 50 91 78 77 72 80 60 ...
## $ Rape : num 21.2 44.5 31 19.5 40.6 38.7 11.1 15.8 31.9 25.8 ...
In Input , The type of each attribute is as follows.
## Murder Assault UrbanPop Rape
## "numeric" "integer" "integer" "numeric"
For our regression analysis the types are perfectly fine. We can proceed further.
- Dealing with NULL values :-
The number of null values in each column are as follows.
## Murder Assault UrbanPop Rape
## 0 0 0 0
As there is no null values, we can proceed further.
- Summary :-
The overall summary of all the attributes is as follows.
## Murder Assault UrbanPop Rape
## Min. : 0.800 Min. : 45.0 Min. :32.00 Min. : 7.30
## 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50 1st Qu.:15.07
## Median : 7.250 Median :159.0 Median :66.00 Median :20.10
## Mean : 7.788 Mean :170.8 Mean :65.54 Mean :21.23
## 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75 3rd Qu.:26.18
## Max. :17.400 Max. :337.0 Max. :91.00 Max. :46.00
The distribution of all continuous variables is as follows.
The co-releation between the continous variables is as follows
## Murder Assault UrbanPop Rape
## Murder 1.00000000 0.8018733 0.06957262 0.5635788
## Assault 0.80187331 1.0000000 0.25887170 0.6652412
## UrbanPop 0.06957262 0.2588717 1.00000000 0.4113412
## Rape 0.56357883 0.6652412 0.41134124 1.0000000
___
Description of EDA :-
In our data set,
- All the four variables are highly co-releated to each other. We can apply any dimentionality reduction techniques likes PCA and can be describe all the 4 attributes with less number of attributes.
Principal Component Analysis ( PCA) :-
when variables are correlated, then less variables could explain almost the same amount of variation.PCA is used to extract the important information from multivariate data and express this information as a set of few new variables called principal components.
Fitting PCA :-
The Explanation for the fitted PCA is as follows.
## Standard deviations (1, .., p=4):
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
##
## Rotation (n x k) = (4 x 4):
## PC1 PC2 PC3 PC4
## Murder -0.5358995 0.4181809 -0.3412327 0.64922780
## Assault -0.5831836 0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158 0.13387773
## Rape -0.5434321 -0.1673186 0.8177779 0.08902432
We can observe that,
Highly co-releated attributes (Murder , Assault , UrbanPop & Rape) can be described by highly un-coreleated attributes (PC1 , PC2 , PC3 & PC4)
The New Principal components can be expalined by following equations.
\(PC1 = -0.5358995*X_{murder} - 0.5831836*X_{Assault} - 0.2781909*X_{UrbanPop} - 0.5434321*X_{Rape}\)
\(PC2 = 0.4181809*X_{murder} + 0.1879856*X_{Assault} - 0.8728062*X_{UrbanPop} - 0.1673186*X_{Rape}\)
\(PC3 = -0.3412327*X_{murder} - 0.2681484*X_{Assault} - 0.3780158*X_{UrbanPop} + 0.8177779*X_{Rape}\)
\(PC4 = 0.64922780*X_{murder} - 0.74340748*X_{Assault} + 0.13387773*X_{UrbanPop} + 0.08902432*X_{Rape}\)
Summary PCA :-
The Explanation for the fitted PCA is as follows.
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.5749 0.9949 0.59713 0.41645
## Proportion of Variance 0.6201 0.2474 0.08914 0.04336
## Cumulative Proportion 0.6201 0.8675 0.95664 1.00000
We can observe that,
- 62 % of the varience in the input dataset is explained by \(PC_1\) itself.
- 86.7 % of the varience in the input dataset is explained by \(PC_1\) & \(PC_2\) together.
Eigen-Values PCA :-
Apart from the varience of expalined distribution , we can explin the same information in terms of eigen-values. The eigen-values for the fitted Principal components are as follows.
## eigenvalue variance.percent cumulative.variance.percent
## PC_1 2.4802416 62.006039 62.00604
## PC_2 0.9897652 24.744129 86.75017
## PC_3 0.3565632 8.914080 95.66425
## PC_4 0.1734301 4.335752 100.00000
## TOTAL 4.0000000 100.000000 344.42046
We can observe that,
Eigen-Values of each principal component is proportional to Variance Explained by it.
Similar to variance Explained, eigen value is very high for first principal component.
As total eigen-value = total number of input variables (4)
& Total Varience = 100% , we can say as
All the Principal components together can explain all the varience in input dataset.
Plot of Eigen-Values:-
This plot clearly explains us ,
\(PC_1\) and \(PC_2\) are explaining most of the variences.
We can ignore \(PC_3\) and \(PC_4\)
\(Cos^{2}\) values :-
\(Cos^{2}\) shows the importance of any principal component for a given observation.
From the above plot we can conclude that, for the \(PC_1\) & \(PC_2\) , all the four input varaibels are contributing almost equally.
Bi-Plot :-
As we observed from the \(cos^2\) plot, all the input variables are contributing almost equally. The same thing, we can observe in this bi-plot, as all the arrows are near to circumference of the circle, means, all input variables are important.
Updated Dataset :-
The Dimensionally reduced DataSet for USArrest Data is as follows
## PC1 PC2
## Alabama -0.97566045 1.12200121
## Alaska -1.93053788 1.06242692
## Arizona -1.74544285 -0.73845954
## Arkansas 0.13999894 1.10854226
## California -2.49861285 -1.52742672
## Colorado -1.49934074 -0.97762966
## Connecticut 1.34499236 -1.07798362
## Delaware -0.04722981 -0.32208890
## Florida -2.98275967 0.03883425
## Georgia -1.62280742 1.26608838
## Hawaii 0.90348448 -1.55467609
## Idaho 1.62331903 0.20885253
## Illinois -1.36505197 -0.67498834
## Indiana 0.50038122 -0.15003926
## Iowa 2.23099579 -0.10300828
## Kansas 0.78887206 -0.26744941
## Kentucky 0.74331256 0.94880748
## Louisiana -1.54909076 0.86230011
## Maine 2.37274014 0.37260865
## Maryland -1.74564663 0.42335704
## Massachusetts 0.48128007 -1.45967706
## Michigan -2.08725025 -0.15383500
## Minnesota 1.67566951 -0.62590670
## Mississippi -0.98647919 2.36973712
## Missouri -0.68978426 -0.26070794
## Montana 1.17353751 0.53147851
## Nebraska 1.25291625 -0.19200440
## Nevada -2.84550542 -0.76780502
## New Hampshire 2.35995585 -0.01790055
## New Jersey -0.17974128 -1.43493745
## New Mexico -1.96012351 0.14141308
## New York -1.66566662 -0.81491072
## North Carolina -1.11208808 2.20561081
## North Dakota 2.96215223 0.59309738
## Ohio 0.22369436 -0.73477837
## Oklahoma 0.30864928 -0.28496113
## Oregon -0.05852787 -0.53596999
## Pennsylvania 0.87948680 -0.56536050
## Rhode Island 0.85509072 -1.47698328
## South Carolina -1.30744986 1.91397297
## South Dakota 1.96779669 0.81506822
## Tennessee -0.98969377 0.85160534
## Texas -1.34151838 -0.40833518
## Utah 0.54503180 -1.45671524
## Vermont 2.77325613 1.38819435
## Virginia 0.09536670 0.19772785
## Washington 0.21472339 -0.96037394
## West Virginia 2.08739306 1.41052627
## Wisconsin 2.05881199 -0.60512507
## Wyoming 0.62310061 0.31778662
————————————————————- THANK YOU ————————————————————-