Median house value appears to be in dollars, but median income has been scaled to a 0-15 continuous scale. Note that longitude is expressed in negative terms, meaning west of Greenwich. Larger absolute values for longitude indicate geographic locations further west.
Step 1: Conduct PCA and FA on the eight predictors in the house data set using the training data. Confirm the same using the test data.
Step 2: Conduct regression analysis using the factors derived from step 1 as explanatory variables and median house value as dependent variable.
Solution Partial Solution - PCA## Import Data into R
houseprice <- read.csv("hsg_data.csv", header = TRUE)
dim(houseprice)
## [1] 20640 9
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 3.5.1
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, units
describe(houseprice)
## houseprice
##
## 9 Variables 20640 Observations
## ---------------------------------------------------------------------------
## medianhousevalue
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 3842 1 206856 125934 66200 82300
## .25 .50 .75 .90 .95
## 119600 179700 264725 376600 489810
##
## lowest : 14999 17500 22500 25000 26600, highest: 498800 499000 499100 500000 500001
## ---------------------------------------------------------------------------
## medianincome
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 1058 1 3.871 1.981 1.60 1.90
## .25 .50 .75 .90 .95
## 2.56 3.53 4.74 6.16 7.30
##
## lowest : 0.50 0.54 0.55 0.64 0.68, highest: 14.41 14.42 14.58 14.90 15.00
## ---------------------------------------------------------------------------
## housingmedianwage
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 52 0.999 28.64 14.43 8 13
## .25 .50 .75 .90 .95
## 18 29 37 46 52
##
## lowest : 1 2 3 4 5, highest: 48 49 50 51 52
## ---------------------------------------------------------------------------
## totalrooms
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 5926 1 2636 1914 621 941
## .25 .50 .75 .90 .95
## 1448 2127 3148 4652 6213
##
## lowest : 2 6 8 11 12, highest: 30450 32054 32627 37937 39320
## ---------------------------------------------------------------------------
## totalbedrooms
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 1928 1 537.9 383.7 136 198
## .25 .50 .75 .90 .95
## 295 435 647 966 1276
##
## lowest : 1 2 3 4 5, highest: 5290 5419 5471 6210 6445
## ---------------------------------------------------------------------------
## population
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 3888 1 1425 1014 348 510
## .25 .50 .75 .90 .95
## 787 1166 1725 2566 3288
##
## lowest : 3 5 6 8 9, highest: 15507 16122 16305 28566 35682
## ---------------------------------------------------------------------------
## households
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 1815 1 499.5 351.1 125 184
## .25 .50 .75 .90 .95
## 280 409 605 890 1162
##
## lowest : 1 2 3 4 5, highest: 4930 5050 5189 5358 6082
## ---------------------------------------------------------------------------
## latitude
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 862 1 35.63 2.35 32.82 33.63
## .25 .50 .75 .90 .95
## 33.93 34.26 37.71 38.48 38.96
##
## lowest : 32.54 32.55 32.56 32.57 32.58, highest: 41.84 41.86 41.88 41.92 41.95
## ---------------------------------------------------------------------------
## longitude
## n missing distinct Info Mean Gmd .05 .10
## 20640 0 844 1 -119.6 2.243 -122.5 -122.3
## .25 .50 .75 .90 .95
## -121.8 -118.5 -118.0 -117.2 -117.1
##
## lowest : -124.35 -124.30 -124.27 -124.26 -124.25
## highest: -114.56 -114.55 -114.49 -114.47 -114.31
## ---------------------------------------------------------------------------
# picking random sample of size = 500 from the population
set.seed(100)
houseprice1 <- houseprice[sample(nrow(houseprice),500),]
attach(houseprice1)
Exploratory Data analysis
#box plot
str(houseprice1)
## 'data.frame': 500 obs. of 9 variables:
## $ medianhousevalue : int 184500 500001 288000 57600 97800 146900 260300 217700 180200 218400 ...
## $ medianincome : num 4.56 10.8 5.96 2.04 1.58 3.34 5.28 5.38 3.26 5.58 ...
## $ housingmedianwage: int 13 44 21 39 20 15 29 30 22 35 ...
## $ totalrooms : int 3859 533 3107 1551 1963 1743 3795 3615 4255 1898 ...
## $ totalbedrooms : int 710 90 483 353 434 366 675 760 971 344 ...
## $ population : int 2283 291 1688 684 682 655 2494 2813 2901 1123 ...
## $ households : int 759 97 503 310 273 264 696 752 920 347 ...
## $ latitude : num 34.1 34.1 33.7 39.5 38.5 ...
## $ longitude : num -118 -118 -118 -122 -119 ...
par(mfrow=c(2,4))
boxplot(medianhousevalue,horizontal = TRUE, col="Red", main="Box plot for medianhousevalue")
boxplot(medianincome, horizontal = TRUE, col="Orange", main="Box plot for medianincome")
boxplot(housingmedianwage, horizontal = TRUE, col="Pink", main="Box plot for housingmedianwage")
boxplot(totalrooms, horizontal = TRUE, col="Yellow", main="Box plot for totalrooms")
#histogram plots
hist(medianhousevalue, col="Red", main="Histogram for medianhousevalue")
hist(medianincome, col="Orange", main="Histogram for medianincome")
hist(housingmedianwage, col="Pink", main="Histogram for housingmedianwage")
hist(totalrooms, col="Yellow", main="Histogram for totalrooms")
par(mfrow=c(2,3))
boxplot(totalbedrooms ,horizontal = TRUE, col="Cyan", main="Box plot for medianhousevalue")
boxplot(population, horizontal = TRUE, col="Blue", main="Box plot for population")
boxplot(households, horizontal = TRUE, col="Sky Blue", main="Box plot for households")
hist(totalbedrooms, col="Cyan", main="Histogram ")
hist(population, col="Blue", main="Histogram")
hist(households, col="Sky Blue", main="Histogram")
par(mfrow=c(2,2))
boxplot(latitude, horizontal = TRUE, col="Green", main="Box plot for latitude")
boxplot(longitude,horizontal = TRUE, col="Light Green", main="Box plot for longitude")
hist(latitude, col="Green", main="Histogram ")
hist(longitude, col="Light Green", main="Histogram")
Correlation Matrix
library(psych)
## Warning: package 'psych' was built under R version 3.5.1
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
cm <- cor(houseprice1)
cm
## medianhousevalue medianincome housingmedianwage
## medianhousevalue 1.00000000 0.718328707 0.06495472
## medianincome 0.71832871 1.000000000 -0.12165576
## housingmedianwage 0.06495472 -0.121655756 1.00000000
## totalrooms 0.22185165 0.270297806 -0.33787982
## totalbedrooms 0.10488492 0.052212522 -0.29339442
## population 0.03307099 0.052327756 -0.26577237
## households 0.12660708 0.086995289 -0.28608585
## latitude -0.11153518 -0.085402794 0.04536584
## longitude -0.08530861 -0.005500642 -0.13277903
## totalrooms totalbedrooms population households
## medianhousevalue 0.221851648 0.10488492 0.03307099 0.12660708
## medianincome 0.270297806 0.05221252 0.05232776 0.08699529
## housingmedianwage -0.337879823 -0.29339442 -0.26577237 -0.28608585
## totalrooms 1.000000000 0.93117967 0.83788677 0.92352714
## totalbedrooms 0.931179668 1.00000000 0.88165060 0.97805369
## population 0.837886766 0.88165060 1.00000000 0.91347647
## households 0.923527136 0.97805369 0.91347647 1.00000000
## latitude -0.035439617 -0.06815791 -0.12989397 -0.08347347
## longitude 0.005536102 0.02626701 0.08596126 0.02669034
## latitude longitude
## medianhousevalue -0.11153518 -0.085308606
## medianincome -0.08540279 -0.005500642
## housingmedianwage 0.04536584 -0.132779034
## totalrooms -0.03543962 0.005536102
## totalbedrooms -0.06815791 0.026267014
## population -0.12989397 0.085961257
## households -0.08347347 0.026690337
## latitude 1.00000000 -0.920003081
## longitude -0.92000308 1.000000000
corPlot(cm)
library(corrplot)
## Warning: package 'corrplot' was built under R version 3.5.1
## corrplot 0.84 loaded
#Correlation plot - method=circle, type = upper, order=hclust
corrplot(cm, method = "circle", type="upper", order="hclust")
Bartlett’s Test of Sphericity: Bartlett’s test of sphericity can be used to test the null hypothesis that the variables are uncorrelated in the population: in other words, the population correlation matrix is an identity matrix. If this hypothesis cannot be rejected, then the appropriateness of factor analysis should be questioned.
# Barlett Sphericity Test for checking the possibility of data dimension reduction
print(cortest.bartlett(cm, nrow(houseprice1)),digits = 8)
## $chisq
## [1] 5380.5283
##
## $p.value
## [1] 0
##
## $df
## [1] 36
KMO Test
library(psych)
KMO(cm)
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = cm)
## Overall MSA = 0.67
## MSA for each item =
## medianhousevalue medianincome housingmedianwage totalrooms
## 0.42 0.43 0.78 0.82
## totalbedrooms population households latitude
## 0.74 0.87 0.76 0.43
## longitude
## 0.43
Eigen Values and Eigen Vectors
# Finding out the Eigen Values and Eigen Vectors
library(car)
## Warning: package 'car' was built under R version 3.5.1
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
library(foreign)
library(MASS)
library(lattice)
library(nFactors)
## Warning: package 'nFactors' was built under R version 3.5.1
## Loading required package: boot
##
## Attaching package: 'boot'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:psych':
##
## logit
## The following object is masked from 'package:survival':
##
## aml
## The following object is masked from 'package:lattice':
##
## melanoma
##
## Attaching package: 'nFactors'
## The following object is masked from 'package:lattice':
##
## parallel
# get eigenvalues
ev = eigen(cor(houseprice1))
ev
## eigen() decomposition
## $values
## [1] 3.92821345 1.91614077 1.70724378 0.90924653 0.27431787 0.14828743
## [7] 0.05844476 0.04156899 0.01653642
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.10516442 0.01336725 -0.69309938 -0.127127486 0.65257185
## [2,] 0.11259600 0.04501260 -0.68421565 0.171614578 -0.65310627
## [3,] -0.20065780 -0.06413451 -0.06868599 -0.944372332 -0.22500668
## [4,] 0.48456444 -0.08029206 -0.04221375 -0.016691532 -0.14324974
## [5,] 0.48585475 -0.07007956 0.10350367 -0.113353592 0.12499485
## [6,] 0.46437554 -0.01791843 0.13519287 -0.132309890 -0.15685970
## [7,] 0.49021700 -0.06353206 0.08269046 -0.124329544 0.07673301
## [8,] -0.07913064 -0.69692363 0.02511393 0.128201939 -0.12747520
## [9,] 0.05406017 0.70161531 0.09221650 0.004562276 -0.11868919
## [,6] [,7] [,8] [,9]
## [1,] 0.09386979 0.22650679 0.07858731 0.008412420
## [2,] 0.02008865 -0.21429932 0.10782053 -0.056502796
## [3,] -0.08340157 0.03217845 0.02106761 0.001794502
## [4,] -0.46429804 0.43557946 -0.55964865 0.132893397
## [5,] -0.30468443 -0.27078184 0.27184565 -0.692670765
## [6,] 0.79564115 0.26737062 -0.04452377 -0.130889445
## [7,] -0.04808394 -0.37875074 0.32080735 0.691943583
## [8,] -0.09728469 0.45796707 0.50417833 0.036407267
## [9,] -0.17493599 0.46261186 0.48533777 0.045128072
EigenValues=ev$values
EigenValues
## [1] 3.92821345 1.91614077 1.70724378 0.90924653 0.27431787 0.14828743
## [7] 0.05844476 0.04156899 0.01653642
EigenVectors = ev$vectors
EigenVectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.10516442 0.01336725 -0.69309938 -0.127127486 0.65257185
## [2,] 0.11259600 0.04501260 -0.68421565 0.171614578 -0.65310627
## [3,] -0.20065780 -0.06413451 -0.06868599 -0.944372332 -0.22500668
## [4,] 0.48456444 -0.08029206 -0.04221375 -0.016691532 -0.14324974
## [5,] 0.48585475 -0.07007956 0.10350367 -0.113353592 0.12499485
## [6,] 0.46437554 -0.01791843 0.13519287 -0.132309890 -0.15685970
## [7,] 0.49021700 -0.06353206 0.08269046 -0.124329544 0.07673301
## [8,] -0.07913064 -0.69692363 0.02511393 0.128201939 -0.12747520
## [9,] 0.05406017 0.70161531 0.09221650 0.004562276 -0.11868919
## [,6] [,7] [,8] [,9]
## [1,] 0.09386979 0.22650679 0.07858731 0.008412420
## [2,] 0.02008865 -0.21429932 0.10782053 -0.056502796
## [3,] -0.08340157 0.03217845 0.02106761 0.001794502
## [4,] -0.46429804 0.43557946 -0.55964865 0.132893397
## [5,] -0.30468443 -0.27078184 0.27184565 -0.692670765
## [6,] 0.79564115 0.26737062 -0.04452377 -0.130889445
## [7,] -0.04808394 -0.37875074 0.32080735 0.691943583
## [8,] -0.09728469 0.45796707 0.50417833 0.036407267
## [9,] -0.17493599 0.46261186 0.48533777 0.045128072
# We will consider Components which are having eigenvalues > 1 unit i.e. PC1 - PC3.
Principal Component Analysis
# Getting the loadings and Communality
pc=principal(houseprice1, nfactors=4, rotate="none")
print(pc, digits = 3)
## Principal Components Analysis
## Call: principal(r = houseprice1, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 h2 u2 com
## medianhousevalue 0.208 -0.019 0.906 0.121 0.879 0.1214 1.14
## medianincome 0.223 -0.062 0.894 -0.164 0.880 0.1203 1.21
## housingmedianwage -0.398 0.089 0.090 0.901 0.985 0.0150 1.42
## totalrooms 0.960 0.111 0.055 0.016 0.938 0.0620 1.03
## totalbedrooms 0.963 0.097 -0.135 0.108 0.967 0.0333 1.09
## population 0.920 0.025 -0.177 0.126 0.895 0.1052 1.11
## households 0.972 0.088 -0.108 0.119 0.977 0.0225 1.07
## latitude -0.157 0.965 -0.033 -0.122 0.971 0.0287 1.09
## longitude 0.107 -0.971 -0.120 -0.004 0.969 0.0307 1.06
##
## PC1 PC2 PC3 PC4
## SS loadings 3.928 1.916 1.707 0.909
## Proportion Var 0.436 0.213 0.190 0.101
## Cumulative Var 0.436 0.649 0.839 0.940
## Proportion Explained 0.464 0.226 0.202 0.107
## Cumulative Proportion 0.464 0.691 0.893 1.000
##
## Mean item complexity = 1.1
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.028
## with the empirical chi square 27.468 with prob < 0.000118
##
## Fit based upon off diagonal values = 0.996
Interpreting the variance:
# Interpreting the variance
part.pca<-EigenValues/sum(EigenValues)*100
part.pca
## [1] 43.6468161 21.2904530 18.9693753 10.1027393 3.0479763 1.6476381
## [7] 0.6493863 0.4618776 0.1837380
# Scree Plot
plot(EigenValues, type = "lines", xlab = "Principal Components", ylab="Eigen Values")
## Warning in plot.xy(xy, type, ...): plot type 'lines' will be truncated to
## first character
houseprice_prop_Varex<- EigenValues/sum(EigenValues)
houseprice_prop_Varex
## [1] 0.436468161 0.212904530 0.189693753 0.101027393 0.030479763 0.016476381
## [7] 0.006493863 0.004618776 0.001837380
par(mfrow=c(1,1))
plot(houseprice_prop_Varex, col="blue", xlab = "Principal Component", ylab = "Proportion of Variance Explained", type = "b")
plot(cumsum(houseprice_prop_Varex), col="red", xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", type = "b")
From the above variance analysis, we see that 4 components are able to explain 93.79% variance
Principal Components Scoring and Perceptual Map
# Principal Components Scoring and Perceptual Map
pc.cr<-princomp(houseprice1,cor=TRUE)
summary(pc.cr)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.9819721 1.3842474 1.3066154 0.9535442 0.52375363
## Proportion of Variance 0.4364682 0.2129045 0.1896938 0.1010274 0.03047976
## Cumulative Proportion 0.4364682 0.6493727 0.8390664 0.9400938 0.97057360
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.38508106 0.241753521 0.203884738 0.12859401
## Proportion of Variance 0.01647638 0.006493863 0.004618776 0.00183738
## Cumulative Proportion 0.98704998 0.993543844 0.998162620 1.00000000
plot(pc.cr)
Bi Plot
# A biplot uses points to represent the scores of the observations on the principal components, and it uses vectors to represent the coefficients of the variables on the principal component
biplot(pc.cr, scale=0)
Visualization of PCA using another library
Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.
library(factoextra)
## Warning: package 'factoextra' was built under R version 3.5.1
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
# Visualization of Scree Plot
fviz_eig(pc.cr)
Graph of individuals. Individuals with a similar profile are grouped together
fviz_pca_ind(pc.cr,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph
fviz_pca_var(pc.cr,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Biplot of individuals and variables
fviz_pca_biplot(pc.cr, repel = TRUE,
col.var = "#2E9FDF", # Variables color
col.ind = "#696969" # Individuals color
)
End of document