Houses data set, provides census information from all the block groups from the 1990 California census. The original data set had 20,640 records, of which 18,540 were randomly selected for a training data set, and 2100 held out for a test data set. Median house value is the response variable; the predictor variables are the following:

Median house value appears to be in dollars, but median income has been scaled to a 0-15 continuous scale. Note that longitude is expressed in negative terms, meaning west of Greenwich. Larger absolute values for longitude indicate geographic locations further west.

Step 1: Conduct PCA and FA on the eight predictors in the house data set using the training data. Confirm the same using the test data.

Step 2: Conduct regression analysis using the factors derived from step 1 as explanatory variables and median house value as dependent variable.

Solution Partial Solution - PCA
    1. Import dataset
    ## Import Data into R
    houseprice <- read.csv("hsg_data.csv", header = TRUE)
    dim(houseprice)
    ## [1] 20640     9
    library(Hmisc)
    ## Warning: package 'Hmisc' was built under R version 3.5.1
    ## Loading required package: lattice
    ## Loading required package: survival
    ## Loading required package: Formula
    ## Loading required package: ggplot2
    ## 
    ## Attaching package: 'Hmisc'
    ## The following objects are masked from 'package:base':
    ## 
    ##     format.pval, units
    describe(houseprice)
    ## houseprice 
    ## 
    ##  9  Variables      20640  Observations
    ## ---------------------------------------------------------------------------
    ## medianhousevalue 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     3842        1   206856   125934    66200    82300 
    ##      .25      .50      .75      .90      .95 
    ##   119600   179700   264725   376600   489810 
    ## 
    ## lowest :  14999  17500  22500  25000  26600, highest: 498800 499000 499100 500000 500001
    ## ---------------------------------------------------------------------------
    ## medianincome 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     1058        1    3.871    1.981     1.60     1.90 
    ##      .25      .50      .75      .90      .95 
    ##     2.56     3.53     4.74     6.16     7.30 
    ## 
    ## lowest :  0.50  0.54  0.55  0.64  0.68, highest: 14.41 14.42 14.58 14.90 15.00
    ## ---------------------------------------------------------------------------
    ## housingmedianwage 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0       52    0.999    28.64    14.43        8       13 
    ##      .25      .50      .75      .90      .95 
    ##       18       29       37       46       52 
    ## 
    ## lowest :  1  2  3  4  5, highest: 48 49 50 51 52
    ## ---------------------------------------------------------------------------
    ## totalrooms 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     5926        1     2636     1914      621      941 
    ##      .25      .50      .75      .90      .95 
    ##     1448     2127     3148     4652     6213 
    ## 
    ## lowest :     2     6     8    11    12, highest: 30450 32054 32627 37937 39320
    ## ---------------------------------------------------------------------------
    ## totalbedrooms 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     1928        1    537.9    383.7      136      198 
    ##      .25      .50      .75      .90      .95 
    ##      295      435      647      966     1276 
    ## 
    ## lowest :    1    2    3    4    5, highest: 5290 5419 5471 6210 6445
    ## ---------------------------------------------------------------------------
    ## population 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     3888        1     1425     1014      348      510 
    ##      .25      .50      .75      .90      .95 
    ##      787     1166     1725     2566     3288 
    ## 
    ## lowest :     3     5     6     8     9, highest: 15507 16122 16305 28566 35682
    ## ---------------------------------------------------------------------------
    ## households 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0     1815        1    499.5    351.1      125      184 
    ##      .25      .50      .75      .90      .95 
    ##      280      409      605      890     1162 
    ## 
    ## lowest :    1    2    3    4    5, highest: 4930 5050 5189 5358 6082
    ## ---------------------------------------------------------------------------
    ## latitude 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0      862        1    35.63     2.35    32.82    33.63 
    ##      .25      .50      .75      .90      .95 
    ##    33.93    34.26    37.71    38.48    38.96 
    ## 
    ## lowest : 32.54 32.55 32.56 32.57 32.58, highest: 41.84 41.86 41.88 41.92 41.95
    ## ---------------------------------------------------------------------------
    ## longitude 
    ##        n  missing distinct     Info     Mean      Gmd      .05      .10 
    ##    20640        0      844        1   -119.6    2.243   -122.5   -122.3 
    ##      .25      .50      .75      .90      .95 
    ##   -121.8   -118.5   -118.0   -117.2   -117.1 
    ## 
    ## lowest : -124.35 -124.30 -124.27 -124.26 -124.25
    ## highest: -114.56 -114.55 -114.49 -114.47 -114.31
    ## ---------------------------------------------------------------------------
    # picking random sample of size = 500 from the population
    set.seed(100)
    houseprice1 <- houseprice[sample(nrow(houseprice),500),]
    attach(houseprice1)
  1. Exploratory Data analysis

    #box plot
    str(houseprice1)
    ## 'data.frame':    500 obs. of  9 variables:
    ##  $ medianhousevalue : int  184500 500001 288000 57600 97800 146900 260300 217700 180200 218400 ...
    ##  $ medianincome     : num  4.56 10.8 5.96 2.04 1.58 3.34 5.28 5.38 3.26 5.58 ...
    ##  $ housingmedianwage: int  13 44 21 39 20 15 29 30 22 35 ...
    ##  $ totalrooms       : int  3859 533 3107 1551 1963 1743 3795 3615 4255 1898 ...
    ##  $ totalbedrooms    : int  710 90 483 353 434 366 675 760 971 344 ...
    ##  $ population       : int  2283 291 1688 684 682 655 2494 2813 2901 1123 ...
    ##  $ households       : int  759 97 503 310 273 264 696 752 920 347 ...
    ##  $ latitude         : num  34.1 34.1 33.7 39.5 38.5 ...
    ##  $ longitude        : num  -118 -118 -118 -122 -119 ...
    par(mfrow=c(2,4))
    boxplot(medianhousevalue,horizontal = TRUE, col="Red", main="Box plot for medianhousevalue")
    boxplot(medianincome, horizontal = TRUE, col="Orange", main="Box plot for medianincome")
    boxplot(housingmedianwage, horizontal = TRUE, col="Pink", main="Box plot for housingmedianwage")
    boxplot(totalrooms, horizontal = TRUE, col="Yellow", main="Box plot for totalrooms")
    
    #histogram plots
    
    hist(medianhousevalue, col="Red", main="Histogram for medianhousevalue")
    hist(medianincome,  col="Orange", main="Histogram for medianincome")
    hist(housingmedianwage, col="Pink", main="Histogram for housingmedianwage")
    hist(totalrooms, col="Yellow", main="Histogram for totalrooms")

    par(mfrow=c(2,3))
    
    boxplot(totalbedrooms ,horizontal = TRUE, col="Cyan", main="Box plot for medianhousevalue")
    boxplot(population, horizontal = TRUE, col="Blue", main="Box plot for population")
    boxplot(households, horizontal = TRUE, col="Sky Blue", main="Box plot for households")
    
    hist(totalbedrooms, col="Cyan", main="Histogram ")
    hist(population, col="Blue", main="Histogram")
    hist(households, col="Sky Blue", main="Histogram")

    par(mfrow=c(2,2))
    boxplot(latitude, horizontal = TRUE, col="Green", main="Box plot for latitude")
    boxplot(longitude,horizontal = TRUE, col="Light Green", main="Box plot for longitude")
    
    hist(latitude, col="Green", main="Histogram ")
    hist(longitude, col="Light Green", main="Histogram")

  2. Correlation Matrix

    library(psych)
    ## Warning: package 'psych' was built under R version 3.5.1
    ## 
    ## Attaching package: 'psych'
    ## The following object is masked from 'package:Hmisc':
    ## 
    ##     describe
    ## The following objects are masked from 'package:ggplot2':
    ## 
    ##     %+%, alpha
    cm <- cor(houseprice1)
    cm
    ##                   medianhousevalue medianincome housingmedianwage
    ## medianhousevalue        1.00000000  0.718328707        0.06495472
    ## medianincome            0.71832871  1.000000000       -0.12165576
    ## housingmedianwage       0.06495472 -0.121655756        1.00000000
    ## totalrooms              0.22185165  0.270297806       -0.33787982
    ## totalbedrooms           0.10488492  0.052212522       -0.29339442
    ## population              0.03307099  0.052327756       -0.26577237
    ## households              0.12660708  0.086995289       -0.28608585
    ## latitude               -0.11153518 -0.085402794        0.04536584
    ## longitude              -0.08530861 -0.005500642       -0.13277903
    ##                     totalrooms totalbedrooms  population  households
    ## medianhousevalue   0.221851648    0.10488492  0.03307099  0.12660708
    ## medianincome       0.270297806    0.05221252  0.05232776  0.08699529
    ## housingmedianwage -0.337879823   -0.29339442 -0.26577237 -0.28608585
    ## totalrooms         1.000000000    0.93117967  0.83788677  0.92352714
    ## totalbedrooms      0.931179668    1.00000000  0.88165060  0.97805369
    ## population         0.837886766    0.88165060  1.00000000  0.91347647
    ## households         0.923527136    0.97805369  0.91347647  1.00000000
    ## latitude          -0.035439617   -0.06815791 -0.12989397 -0.08347347
    ## longitude          0.005536102    0.02626701  0.08596126  0.02669034
    ##                      latitude    longitude
    ## medianhousevalue  -0.11153518 -0.085308606
    ## medianincome      -0.08540279 -0.005500642
    ## housingmedianwage  0.04536584 -0.132779034
    ## totalrooms        -0.03543962  0.005536102
    ## totalbedrooms     -0.06815791  0.026267014
    ## population        -0.12989397  0.085961257
    ## households        -0.08347347  0.026690337
    ## latitude           1.00000000 -0.920003081
    ## longitude         -0.92000308  1.000000000
    corPlot(cm)

    library(corrplot)
    ## Warning: package 'corrplot' was built under R version 3.5.1
    ## corrplot 0.84 loaded
    #Correlation plot - method=circle, type = upper, order=hclust
    corrplot(cm, method = "circle", type="upper", order="hclust")

  3. Bartlett’s Test of Sphericity: Bartlett’s test of sphericity can be used to test the null hypothesis that the variables are uncorrelated in the population: in other words, the population correlation matrix is an identity matrix. If this hypothesis cannot be rejected, then the appropriateness of factor analysis should be questioned.

    # Barlett Sphericity Test for checking the possibility of data dimension reduction 
    print(cortest.bartlett(cm, nrow(houseprice1)),digits = 8)
    ## $chisq
    ## [1] 5380.5283
    ## 
    ## $p.value
    ## [1] 0
    ## 
    ## $df
    ## [1] 36
  4. KMO Test

    library(psych)
    KMO(cm)
    ## Kaiser-Meyer-Olkin factor adequacy
    ## Call: KMO(r = cm)
    ## Overall MSA =  0.67
    ## MSA for each item = 
    ##  medianhousevalue      medianincome housingmedianwage        totalrooms 
    ##              0.42              0.43              0.78              0.82 
    ##     totalbedrooms        population        households          latitude 
    ##              0.74              0.87              0.76              0.43 
    ##         longitude 
    ##              0.43
  5. Eigen Values and Eigen Vectors

    # Finding out the Eigen Values and Eigen Vectors
    library(car)
    ## Warning: package 'car' was built under R version 3.5.1
    ## Loading required package: carData
    ## 
    ## Attaching package: 'car'
    ## The following object is masked from 'package:psych':
    ## 
    ##     logit
    library(foreign)
    library(MASS)
    library(lattice)
    library(nFactors)
    ## Warning: package 'nFactors' was built under R version 3.5.1
    ## Loading required package: boot
    ## 
    ## Attaching package: 'boot'
    ## The following object is masked from 'package:car':
    ## 
    ##     logit
    ## The following object is masked from 'package:psych':
    ## 
    ##     logit
    ## The following object is masked from 'package:survival':
    ## 
    ##     aml
    ## The following object is masked from 'package:lattice':
    ## 
    ##     melanoma
    ## 
    ## Attaching package: 'nFactors'
    ## The following object is masked from 'package:lattice':
    ## 
    ##     parallel
    # get eigenvalues
    ev = eigen(cor(houseprice1)) 
    ev
    ## eigen() decomposition
    ## $values
    ## [1] 3.92821345 1.91614077 1.70724378 0.90924653 0.27431787 0.14828743
    ## [7] 0.05844476 0.04156899 0.01653642
    ## 
    ## $vectors
    ##              [,1]        [,2]        [,3]         [,4]        [,5]
    ##  [1,]  0.10516442  0.01336725 -0.69309938 -0.127127486  0.65257185
    ##  [2,]  0.11259600  0.04501260 -0.68421565  0.171614578 -0.65310627
    ##  [3,] -0.20065780 -0.06413451 -0.06868599 -0.944372332 -0.22500668
    ##  [4,]  0.48456444 -0.08029206 -0.04221375 -0.016691532 -0.14324974
    ##  [5,]  0.48585475 -0.07007956  0.10350367 -0.113353592  0.12499485
    ##  [6,]  0.46437554 -0.01791843  0.13519287 -0.132309890 -0.15685970
    ##  [7,]  0.49021700 -0.06353206  0.08269046 -0.124329544  0.07673301
    ##  [8,] -0.07913064 -0.69692363  0.02511393  0.128201939 -0.12747520
    ##  [9,]  0.05406017  0.70161531  0.09221650  0.004562276 -0.11868919
    ##              [,6]        [,7]        [,8]         [,9]
    ##  [1,]  0.09386979  0.22650679  0.07858731  0.008412420
    ##  [2,]  0.02008865 -0.21429932  0.10782053 -0.056502796
    ##  [3,] -0.08340157  0.03217845  0.02106761  0.001794502
    ##  [4,] -0.46429804  0.43557946 -0.55964865  0.132893397
    ##  [5,] -0.30468443 -0.27078184  0.27184565 -0.692670765
    ##  [6,]  0.79564115  0.26737062 -0.04452377 -0.130889445
    ##  [7,] -0.04808394 -0.37875074  0.32080735  0.691943583
    ##  [8,] -0.09728469  0.45796707  0.50417833  0.036407267
    ##  [9,] -0.17493599  0.46261186  0.48533777  0.045128072
    EigenValues=ev$values
    EigenValues
    ## [1] 3.92821345 1.91614077 1.70724378 0.90924653 0.27431787 0.14828743
    ## [7] 0.05844476 0.04156899 0.01653642
    EigenVectors = ev$vectors
    EigenVectors
    ##              [,1]        [,2]        [,3]         [,4]        [,5]
    ##  [1,]  0.10516442  0.01336725 -0.69309938 -0.127127486  0.65257185
    ##  [2,]  0.11259600  0.04501260 -0.68421565  0.171614578 -0.65310627
    ##  [3,] -0.20065780 -0.06413451 -0.06868599 -0.944372332 -0.22500668
    ##  [4,]  0.48456444 -0.08029206 -0.04221375 -0.016691532 -0.14324974
    ##  [5,]  0.48585475 -0.07007956  0.10350367 -0.113353592  0.12499485
    ##  [6,]  0.46437554 -0.01791843  0.13519287 -0.132309890 -0.15685970
    ##  [7,]  0.49021700 -0.06353206  0.08269046 -0.124329544  0.07673301
    ##  [8,] -0.07913064 -0.69692363  0.02511393  0.128201939 -0.12747520
    ##  [9,]  0.05406017  0.70161531  0.09221650  0.004562276 -0.11868919
    ##              [,6]        [,7]        [,8]         [,9]
    ##  [1,]  0.09386979  0.22650679  0.07858731  0.008412420
    ##  [2,]  0.02008865 -0.21429932  0.10782053 -0.056502796
    ##  [3,] -0.08340157  0.03217845  0.02106761  0.001794502
    ##  [4,] -0.46429804  0.43557946 -0.55964865  0.132893397
    ##  [5,] -0.30468443 -0.27078184  0.27184565 -0.692670765
    ##  [6,]  0.79564115  0.26737062 -0.04452377 -0.130889445
    ##  [7,] -0.04808394 -0.37875074  0.32080735  0.691943583
    ##  [8,] -0.09728469  0.45796707  0.50417833  0.036407267
    ##  [9,] -0.17493599  0.46261186  0.48533777  0.045128072
    # We will consider Components which are having eigenvalues > 1 unit i.e. PC1 - PC3.
  6. Principal Component Analysis

    # Getting the loadings and Communality
    pc=principal(houseprice1, nfactors=4, rotate="none")
    print(pc, digits = 3)
    ## Principal Components Analysis
    ## Call: principal(r = houseprice1, nfactors = 4, rotate = "none")
    ## Standardized loadings (pattern matrix) based upon correlation matrix
    ##                      PC1    PC2    PC3    PC4    h2     u2  com
    ## medianhousevalue   0.208 -0.019  0.906  0.121 0.879 0.1214 1.14
    ## medianincome       0.223 -0.062  0.894 -0.164 0.880 0.1203 1.21
    ## housingmedianwage -0.398  0.089  0.090  0.901 0.985 0.0150 1.42
    ## totalrooms         0.960  0.111  0.055  0.016 0.938 0.0620 1.03
    ## totalbedrooms      0.963  0.097 -0.135  0.108 0.967 0.0333 1.09
    ## population         0.920  0.025 -0.177  0.126 0.895 0.1052 1.11
    ## households         0.972  0.088 -0.108  0.119 0.977 0.0225 1.07
    ## latitude          -0.157  0.965 -0.033 -0.122 0.971 0.0287 1.09
    ## longitude          0.107 -0.971 -0.120 -0.004 0.969 0.0307 1.06
    ## 
    ##                         PC1   PC2   PC3   PC4
    ## SS loadings           3.928 1.916 1.707 0.909
    ## Proportion Var        0.436 0.213 0.190 0.101
    ## Cumulative Var        0.436 0.649 0.839 0.940
    ## Proportion Explained  0.464 0.226 0.202 0.107
    ## Cumulative Proportion 0.464 0.691 0.893 1.000
    ## 
    ## Mean item complexity =  1.1
    ## Test of the hypothesis that 4 components are sufficient.
    ## 
    ## The root mean square of the residuals (RMSR) is  0.028 
    ##  with the empirical chi square  27.468  with prob <  0.000118 
    ## 
    ## Fit based upon off diagonal values = 0.996

Interpreting the variance:

# Interpreting the variance

part.pca<-EigenValues/sum(EigenValues)*100
part.pca
## [1] 43.6468161 21.2904530 18.9693753 10.1027393  3.0479763  1.6476381
## [7]  0.6493863  0.4618776  0.1837380
# Scree Plot

plot(EigenValues, type = "lines", xlab = "Principal Components", ylab="Eigen Values")
## Warning in plot.xy(xy, type, ...): plot type 'lines' will be truncated to
## first character

houseprice_prop_Varex<- EigenValues/sum(EigenValues)
houseprice_prop_Varex
## [1] 0.436468161 0.212904530 0.189693753 0.101027393 0.030479763 0.016476381
## [7] 0.006493863 0.004618776 0.001837380
par(mfrow=c(1,1))
plot(houseprice_prop_Varex,   col="blue", xlab = "Principal Component", ylab = "Proportion of Variance Explained", type = "b")

plot(cumsum(houseprice_prop_Varex),  col="red", xlab = "Principal Component", ylab = "Cumulative Proportion of Variance Explained", type = "b")

From the above variance analysis, we see that 4 components are able to explain 93.79% variance

  • Principal Components Scoring and Perceptual Map

    # Principal Components Scoring and Perceptual Map
    
    
    pc.cr<-princomp(houseprice1,cor=TRUE)
    summary(pc.cr)
    ## Importance of components:
    ##                           Comp.1    Comp.2    Comp.3    Comp.4     Comp.5
    ## Standard deviation     1.9819721 1.3842474 1.3066154 0.9535442 0.52375363
    ## Proportion of Variance 0.4364682 0.2129045 0.1896938 0.1010274 0.03047976
    ## Cumulative Proportion  0.4364682 0.6493727 0.8390664 0.9400938 0.97057360
    ##                            Comp.6      Comp.7      Comp.8     Comp.9
    ## Standard deviation     0.38508106 0.241753521 0.203884738 0.12859401
    ## Proportion of Variance 0.01647638 0.006493863 0.004618776 0.00183738
    ## Cumulative Proportion  0.98704998 0.993543844 0.998162620 1.00000000
    plot(pc.cr)

  • Bi Plot

    #  A biplot uses points to represent the scores of the observations on the principal components, and it uses vectors to represent the coefficients of the variables on the principal component
    
    biplot(pc.cr, scale=0)

  • Visualization of PCA using another library

    • Visualize eigenvalues (scree plot). Show the percentage of variances explained by each principal component.

      library(factoextra)
      ## Warning: package 'factoextra' was built under R version 3.5.1
      ## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
      # Visualization of Scree Plot
      fviz_eig(pc.cr)
    • Graph of individuals. Individuals with a similar profile are grouped together

      fviz_pca_ind(pc.cr,
                   col.ind = "cos2", # Color by the quality of representation
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                   repel = TRUE     # Avoid text overlapping
                   )
    • Graph of variables. Positive correlated variables point to the same side of the plot. Negative correlated variables point to opposite sides of the graph

      fviz_pca_var(pc.cr,
                   col.var = "contrib", # Color by contributions to the PC
                   gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
                   repel = TRUE     # Avoid text overlapping
                   )

    • Biplot of individuals and variables

      fviz_pca_biplot(pc.cr, repel = TRUE,
                      col.var = "#2E9FDF", # Variables color
                      col.ind = "#696969"  # Individuals color
                      )

    End of document