Introduction

This project focuses on applying Principal Component Analysis (PCA) to perform a dimension reduction on a housing market dataset for properties in Sydney and Melbourne, Australia. The goal is to identify the most significant factors influencing housing market trends while simplifying the dataset. PCA will help transform the original variables into a smaller set of components, retaining as much variance as possible - it means retaining as much of the original information in the dataset as possible. Dimension reduction will allow for clearer insights into key drivers of property prices and characteristics.

Libraries

library(dplyr)
library(corrplot)
library(factoextra)
library(gridExtra)
library(cowplot)

Dataset exploration

The dataset was downloaded from Kaggle and was originally designed for housing price prediction and contains detailed information about properties in Sydney and Melbourne - Australia. You can access the dataset here: https://www.kaggle.com/datasets/shree1992/housedata

Variables in the dataset:

Dataset Preprocessing and Cleaning

Firstly, let’s import our dataset and look at the type of each variable.

df_house_price<-read.csv("data.csv", sep=",", dec=".", header=TRUE)
summary(df_house_price)
##      date               price             bedrooms       bathrooms    
##  Length:4600        Min.   :       0   Min.   :0.000   Min.   :0.000  
##  Class :character   1st Qu.:  322875   1st Qu.:3.000   1st Qu.:1.750  
##  Mode  :character   Median :  460943   Median :3.000   Median :2.250  
##                     Mean   :  551963   Mean   :3.401   Mean   :2.161  
##                     3rd Qu.:  654962   3rd Qu.:4.000   3rd Qu.:2.500  
##                     Max.   :26590000   Max.   :9.000   Max.   :8.000  
##   sqft_living       sqft_lot           floors        waterfront      
##  Min.   :  370   Min.   :    638   Min.   :1.000   Min.   :0.000000  
##  1st Qu.: 1460   1st Qu.:   5001   1st Qu.:1.000   1st Qu.:0.000000  
##  Median : 1980   Median :   7683   Median :1.500   Median :0.000000  
##  Mean   : 2139   Mean   :  14852   Mean   :1.512   Mean   :0.007174  
##  3rd Qu.: 2620   3rd Qu.:  11001   3rd Qu.:2.000   3rd Qu.:0.000000  
##  Max.   :13540   Max.   :1074218   Max.   :3.500   Max.   :1.000000  
##       view          condition       sqft_above   sqft_basement   
##  Min.   :0.0000   Min.   :1.000   Min.   : 370   Min.   :   0.0  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0  
##  Median :0.0000   Median :3.000   Median :1590   Median :   0.0  
##  Mean   :0.2407   Mean   :3.452   Mean   :1827   Mean   : 312.1  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 610.0  
##  Max.   :4.0000   Max.   :5.000   Max.   :9410   Max.   :4820.0  
##     yr_built     yr_renovated       street              city          
##  Min.   :1900   Min.   :   0.0   Length:4600        Length:4600       
##  1st Qu.:1951   1st Qu.:   0.0   Class :character   Class :character  
##  Median :1976   Median :   0.0   Mode  :character   Mode  :character  
##  Mean   :1971   Mean   : 808.6                                        
##  3rd Qu.:1997   3rd Qu.:1999.0                                        
##  Max.   :2014   Max.   :2014.0                                        
##    statezip           country         
##  Length:4600        Length:4600       
##  Class :character   Class :character  
##  Mode  :character   Mode  :character  
##                                       
##                                       
## 
dim(df_house_price)
## [1] 4600   18
head(df_house_price,7)
##                  date   price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00  313000        3      1.50        1340     7912    1.5
## 2 2014-05-02 00:00:00 2384000        5      2.50        3650     9050    2.0
## 3 2014-05-02 00:00:00  342000        3      2.00        1930    11947    1.0
## 4 2014-05-02 00:00:00  420000        3      2.25        2000     8030    1.0
## 5 2014-05-02 00:00:00  550000        4      2.50        1940    10500    1.0
## 6 2014-05-02 00:00:00  490000        2      1.00         880     6380    1.0
## 7 2014-05-02 00:00:00  335000        2      2.00        1350     2560    1.0
##   waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1          0    0         3       1340             0     1955         2005
## 2          0    4         5       3370           280     1921            0
## 3          0    0         4       1930             0     1966            0
## 4          0    0         4       1000          1000     1963            0
## 5          0    0         4       1140           800     1976         1992
## 6          0    0         3        880             0     1938         1994
## 7          0    0         3       1350             0     1976            0
##                     street      city statezip country
## 1     18810 Densmore Ave N Shoreline WA 98133     USA
## 2          709 W Blaine St   Seattle WA 98119     USA
## 3 26206-26214 143rd Ave SE      Kent WA 98042     USA
## 4          857 170th Pl NE  Bellevue WA 98008     USA
## 5        9105 170th Ave NE   Redmond WA 98052     USA
## 6           522 NE 88th St   Seattle WA 98115     USA
## 7        2616 174th Ave NE   Redmond WA 98052     USA

Handling issues with dataset

After reviewing the dataset I see we have few issues:

  1. Some properties have price equal to 0
  2. There are properties with 0 bedrooms or 0 bathrooms (to be verified—may not necessarily be incorrect)
  3. The maximum value for sqft_lot is 1074218, which seems unusually high
  4. Some observations have 0 for yr_renovated
  5. The variable street is not numerical
  6. The variable city is not numerical
  7. The variable statezip is not numerical
  8. The variable country is not numerical
  9. The variable date is not numerical

Let’s check them:

1) Some properties have price equal to 0

df_house_price[df_house_price$price == 0,]
##                     date price bedrooms bathrooms sqft_living sqft_lot floors
## 4355 2014-05-05 00:00:00     0        3      1.75        1490    10125    1.0
## 4357 2014-05-05 00:00:00     0        4      2.75        2600     5390    1.0
## 4358 2014-05-05 00:00:00     0        6      2.75        3200     9200    1.0
## 4359 2014-05-06 00:00:00     0        5      3.50        3480    36615    2.0
## 4362 2014-05-07 00:00:00     0        5      1.50        1500     7112    1.0
## 4363 2014-05-07 00:00:00     0        4      4.00        3680    18804    2.0
## 4375 2014-05-09 00:00:00     0        2      2.50        2200   188200    1.0
## 4377 2014-05-09 00:00:00     0        4      2.25        2170    10500    1.0
## 4383 2014-05-12 00:00:00     0        5      4.50        4630     6324    2.0
## 4384 2014-05-13 00:00:00     0        5      4.00        4430     9000    2.0
## 4386 2014-05-13 00:00:00     0        4      4.50        5030    11023    2.0
## 4387 2014-05-13 00:00:00     0        4      1.50        2180    22870    1.0
## 4390 2014-05-15 00:00:00     0        4      3.50        4210    10308    2.0
## 4395 2014-05-16 00:00:00     0        5      3.25        3690    12353    2.0
## 4406 2014-05-20 00:00:00     0        4      3.75        3300     4545    1.5
## 4409 2014-05-21 00:00:00     0        5      2.25        2880    11965    2.0
## 4412 2014-05-22 00:00:00     0        5      2.25        2000     7900    1.0
## 4413 2014-05-22 00:00:00     0        3      3.00        1860     7440    1.0
## 4414 2014-05-22 00:00:00     0        4      3.00        1990     6180    2.0
## 4421 2014-05-27 00:00:00     0        4      1.00        1360    13372    1.0
## 4443 2014-06-02 00:00:00     0        1      1.00         720     6000    1.0
## 4449 2014-06-03 00:00:00     0        5      2.75        2740     5616    1.5
## 4454 2014-06-03 00:00:00     0        3      1.00        1300     6710    1.0
## 4455 2014-06-03 00:00:00     0        5      2.50        2090     4698    2.0
## 4473 2014-06-09 00:00:00     0        4      3.75        4060    19290    2.0
## 4479 2014-06-11 00:00:00     0        5      2.75        2910    53898    1.0
## 4480 2014-06-11 00:00:00     0        5      2.00        1910     7200    1.0
## 4481 2014-06-11 00:00:00     0        3      2.50        2880    13500    1.0
## 4482 2014-06-11 00:00:00     0        5      2.75        3240     6863    2.0
## 4488 2014-06-12 00:00:00     0        4      1.00        2080     3500    1.5
## 4500 2014-06-17 00:00:00     0        5      3.75        3870     8225    2.0
## 4508 2014-06-18 00:00:00     0        4      1.50        2310    68824    2.0
## 4510 2014-06-18 00:00:00     0        6      3.00        3020    13783    2.0
## 4521 2014-06-20 00:00:00     0        4      2.50        1960    11600    1.0
## 4522 2014-06-20 00:00:00     0        4      1.00        1810     7500    1.0
## 4523 2014-06-22 00:00:00     0        2      2.25        1490     6770    1.5
## 4524 2014-06-23 00:00:00     0        3      4.50        5230    17826    2.0
## 4529 2014-06-24 00:00:00     0        4      5.00        4550    18641    1.0
## 4535 2014-06-24 00:00:00     0        3      2.75        1310     7300    1.0
## 4543 2014-06-25 00:00:00     0        5      3.50        2640     6895    2.0
## 4553 2014-06-26 00:00:00     0        4      2.00        2100     4857    2.0
## 4555 2014-06-27 00:00:00     0        2      1.00         810     8424    1.0
## 4556 2014-06-27 00:00:00     0        2      1.50        1520     8040    1.0
## 4559 2014-06-28 00:00:00     0        4      4.25        3500     8750    1.0
## 4564 2014-07-01 00:00:00     0        2      2.25        2130     4920    1.5
## 4568 2014-07-02 00:00:00     0        4      2.50        4080    18362    2.0
## 4575 2014-07-02 00:00:00     0        3      1.00        1520     9030    1.0
## 4576 2014-07-02 00:00:00     0        5      6.25        8020    21738    2.0
## 4589 2014-07-08 00:00:00     0        4      2.25        2890    18226    3.0
##      waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 4355          0    0         4       1490             0     1962            0
## 4357          0    0         4       1300          1300     1960         2001
## 4358          0    2         4       1600          1600     1953         1983
## 4359          0    0         4       2490           990     1983            0
## 4362          0    0         5        760           740     1920            0
## 4363          0    0         3       3680             0     1990         2009
## 4375          0    3         3       2200             0     2007            0
## 4377          0    2         4       1270           900     1960         2001
## 4383          0    0         3       3210          1420     2006            0
## 4384          0    0         3       4430             0     2013         1923
## 4386          0    2         3       3250          1780     2008            0
## 4387          0    0         4       1280           900     1954         1975
## 4390          0    0         3       4210             0     2006            0
## 4395          0    0         5       3690             0     1977            0
## 4406          0    4         3       2600           700     1926         1999
## 4409          0    0         4       2880             0     1990            0
## 4412          0    0         4       1300           700     1986            0
## 4413          0    0         5       1040           820     1954            0
## 4414          0    0         3       1990             0     1990         2009
## 4421          0    0         3       1360             0     1955         2005
## 4443          0    0         3        720             0     1940         1996
## 4449          0    0         5       1670          1070     1925            0
## 4454          0    0         4       1300             0     1952            0
## 4455          0    0         3       2090             0     1998         2006
## 4473          0    0         3       4060             0     2002            0
## 4479          0    0         5       1510          1400     1979            0
## 4480          0    0         4       1110           800     1951         1999
## 4481          0    4         5       1520          1360     1950            0
## 4482          0    0         3       3240             0     2013         1923
## 4488          0    0         5       1260           820     1926            0
## 4500          0    0         3       3870             0     1998         2006
## 4508          0    0         4       2310             0     1968            0
## 4510          0    0         3       3020             0     1952         2002
## 4521          0    0         5        980           980     1931            0
## 4522          0    0         2       1410           400     1959            0
## 4523          0    0         3       1490             0     1926         2003
## 4524          1    4         3       3740          1490     2005            0
## 4529          1    4         3       2600          1950     2002            0
## 4535          0    0         3       1310             0     1957         2000
## 4543          0    0         3       2640             0     2001            0
## 4553          0    0         3       2100             0     1965         1984
## 4555          0    0         4        810             0     1959            0
## 4556          0    0         5       1520             0     1951            0
## 4559          0    4         5       2140          1360     1951            0
## 4564          0    4         4       1530           600     1941         1998
## 4568          0    2         4       4080             0     1983            0
## 4575          0    0         3       1520             0     1956         2001
## 4576          0    0         3       8020             0     2001            0
## 4589          1    4         3       2890             0     1984            0
##                             street             city statezip country
## 4355               3911 S 328th St      Federal Way WA 98001     USA
## 4357               2120 31st Ave W          Seattle WA 98199     USA
## 4358       12271 Marine View Dr SW           Burien WA 98146     USA
## 4359              21809 SE 38th Pl         Issaquah WA 98075     USA
## 4362       14901-14999 12th Ave SW           Burien WA 98166     USA
## 4363        1223-1237 244th Ave NE        Sammamish WA 98074     USA
## 4375            39612 254th Ave SE         Enumclaw WA 98022     USA
## 4377               216 SW 183rd St    Normandy Park WA 98166     USA
## 4383           6925 Oakmont Ave SE       Snoqualmie WA 98065     USA
## 4384                9235 NE 5th St         Bellevue WA 98004     USA
## 4386             4140 Boulevard Pl    Mercer Island WA 98040     USA
## 4387     31603 E Lake Morton Dr SE             Kent WA 98042     USA
## 4390             2234 167th Ave SE         Bellevue WA 98008     USA
## 4395             19055 35th Ave NE Lake Forest Park WA 98155     USA
## 4406              3665 50th Ave NE          Seattle WA 98105     USA
## 4409             25437 163rd Pl SE        Covington WA 98042     USA
## 4412               3202 S 194th St           SeaTac WA 98188     USA
## 4413              10744 62nd Ave S          Seattle WA 98178     USA
## 4414             32706 20th Ave SW      Federal Way WA 98023     USA
## 4421              18423 61st Pl NE          Kenmore WA 98028     USA
## 4443          1236 S Cloverdale St          Seattle WA 98108     USA
## 4449               1013 NE 80th St          Seattle WA 98115     USA
## 4454              2760 72nd Ave SE    Mercer Island WA 98040     USA
## 4455             27622 237th Pl SE     Maple Valley WA 98038     USA
## 4473               21418 SE 5th Pl        Sammamish WA 98074     USA
## 4479            13505 208th Ave NE      Woodinville WA 98077     USA
## 4480        11620-11698 57th Ave S          Seattle WA 98178     USA
## 4481               9243 NE 20th St       Clyde Hill WA 98004     USA
## 4482     1301-1303 Monterey Ave NE           Renton WA 98056     USA
## 4488              6506 40th Ave SW          Seattle WA 98136     USA
## 4500          101-127 247th Ave SE        Sammamish WA 98074     USA
## 4508            29656 232nd Ave SE    Black Diamond WA 98010     USA
## 4510              4115 85th Ave SE    Mercer Island WA 98040     USA
## 4521                506 21st St SE           Auburn WA 98002     USA
## 4522        12231 Occidental Ave S          Seattle WA 98168     USA
## 4523               4921 28th Ave S          Seattle WA 98108     USA
## 4524             7455 W Mercer Way    Mercer Island WA 98040     USA
## 4529  425 E Lake Sammamish Pkwy SE        Sammamish WA 98074     USA
## 4535              16232 SE 10th St         Bellevue WA 98008     USA
## 4543               34529 SE Jay Ct       Snoqualmie WA 98065     USA
## 4553              4500 NE 171st St Lake Forest Park WA 98155     USA
## 4555        30401-30499 8th Ave SW      Federal Way WA 98023     USA
## 4556             11533 22nd Ave NE          Seattle WA 98125     USA
## 4559              12725 8th Ave NW          Seattle WA 98177     USA
## 4564              3428 60th Ave SW          Seattle WA 98116     USA
## 4568              2710 95th Ave NE       Clyde Hill WA 98004     USA
## 4575              2533 155th Pl SE         Bellevue WA 98007     USA
## 4576                2 Crescent Key         Bellevue WA 98006     USA
## 4589 3227-3399 Mountain View Ave N           Renton WA 98056     USA

It appears that these are very large properties that should have a high price. Since a price of 0 is most likely an error, we will remove these observations

df_house_price <- df_house_price[df_house_price$price != 0,]

2) There are properties with 0 bedrooms or 0 bathrooms (to be verified—may not necessarily be incorrect).

df_house_price[df_house_price$bedrooms == 0 | df_house_price$bathrooms == 0,]
##                     date   price bedrooms bathrooms sqft_living sqft_lot floors
## 2366 2014-06-12 00:00:00 1095000        0         0        3064     4764    3.5
## 3210 2014-06-24 00:00:00 1295648        0         0        4810    28008    2.0
##      waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 2366          0    2         3       3064             0     1990         2009
## 3210          0    0         3       4810             0     1990         2009
##                street    city statezip country
## 2366    814 E Howe St Seattle WA 98102     USA
## 3210 20418 NE 64th Pl Redmond WA 98053     USA

Only 2 properties have 0 bedrooms or 0 bathrooms. The rest of the data seems correct, so we will keep these observations.

3) The maximum value for sqft_lot is 1,074,218, which seems unusually high.

head(df_house_price[order(-df_house_price$sqft_lot),],10)
##                     date     price bedrooms bathrooms sqft_living sqft_lot
## 1079 2014-05-21 00:00:00  542500.0        5      3.25        3010  1074218
## 2481 2014-06-13 00:00:00  849900.0        2      2.00        2280   641203
## 3488 2014-06-26 00:00:00  667000.0        3      1.75        3320   478288
## 376  2014-05-08 00:00:00  330000.0        2      2.00        1550   435600
## 880  2014-05-19 00:00:00  480000.0        4      3.50        3370   435600
## 1540 2014-05-29 00:00:00  302000.0        2      1.00         900   423838
## 3057 2014-06-23 00:00:00  230000.0        3      1.00        1530   389126
## 241  2014-05-07 00:00:00  630000.0        3      2.50        2680   327135
## 123  2014-05-05 00:00:00 2280000.0        7      8.00       13540   307752
## 4354 2014-05-05 00:00:00  117833.3        3      1.00        1340   306848
##      floors waterfront view condition sqft_above sqft_basement yr_built
## 1079    1.5          0    0         5       2010          1000     1931
## 2481    2.0          0    0         3       2280             0     1990
## 3488    1.5          0    3         4       2260          1060     1933
## 376     1.5          0    0         2       1550             0     1972
## 880     2.0          0    3         3       3370             0     2005
## 1540    1.0          0    2         5        900             0     1925
## 3057    1.5          0    0         4       1530             0     1919
## 241     2.0          0    0         3       2680             0     1995
## 123     3.0          0    4         3       9410          4130     1999
## 4354    1.0          0    0         3       1340             0     1953
##      yr_renovated                    street       city statezip country
## 1079            0  16200-16398 252nd Ave SE   Issaquah WA 98027     USA
## 2481         2009          9326 SW 216th St     Vashon WA 98070     USA
## 3488         1982        40201 292nd Ave SE   Enumclaw WA 98022     USA
## 376             0          36521 SE 94th St Snoqualmie WA 98065     USA
## 880             0      44250 SE Edgewick Rd North Bend WA 98045     USA
## 1540            0         18923 SE 416th St   Enumclaw WA 98022     USA
## 3057         1985  24727 SE Mud Mountain Rd   Enumclaw WA 98022     USA
## 241             0         25339 SE 248th St Ravensdale WA 98051     USA
## 123             0          26408 NE 70th St    Redmond WA 98053     USA
## 4354            0 17827 Mountain View Rd NE     Duvall WA 98019     USA

The highest value seem correct after reviewing the property details on Google Maps. It appears that the property have a large surrounding lot, so the high sqft_lot value is likely accurate.

4) Some observations have 0 for yr_renovated

head(df_house_price[df_house_price$yr_renovated==0,],10)
##                   date   price bedrooms bathrooms sqft_living sqft_lot floors
## 2  2014-05-02 00:00:00 2384000        5      2.50        3650     9050      2
## 3  2014-05-02 00:00:00  342000        3      2.00        1930    11947      1
## 4  2014-05-02 00:00:00  420000        3      2.25        2000     8030      1
## 7  2014-05-02 00:00:00  335000        2      2.00        1350     2560      1
## 8  2014-05-02 00:00:00  482000        4      2.50        2710    35868      2
## 9  2014-05-02 00:00:00  452500        3      2.50        2430    88426      1
## 13 2014-05-02 00:00:00  588500        3      1.75        2330    14892      1
## 16 2014-05-02 00:00:00  242500        3      1.50        1200     9720      1
## 17 2014-05-02 00:00:00  419000        3      1.50        1570     6700      1
## 18 2014-05-02 00:00:00  367500        4      3.00        3110     7231      2
##    waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 2           0    4         5       3370           280     1921            0
## 3           0    0         4       1930             0     1966            0
## 4           0    0         4       1000          1000     1963            0
## 7           0    0         3       1350             0     1976            0
## 8           0    0         3       2710             0     1989            0
## 9           0    0         4       1570           860     1985            0
## 13          0    0         3       1970           360     1980            0
## 16          0    0         4       1200             0     1965            0
## 17          0    0         4       1570             0     1956            0
## 18          0    0         3       3110             0     1997            0
##                      street         city statezip country
## 2           709 W Blaine St      Seattle WA 98119     USA
## 3  26206-26214 143rd Ave SE         Kent WA 98042     USA
## 4           857 170th Pl NE     Bellevue WA 98008     USA
## 7         2616 174th Ave NE      Redmond WA 98052     USA
## 8         23762 SE 253rd Pl Maple Valley WA 98038     USA
## 9   46611-46625 SE 129th St   North Bend WA 98045     USA
## 13         1833 220th Pl NE    Sammamish WA 98074     USA
## 16        14034 SE 201st St         Kent WA 98042     USA
## 17          15424 SE 9th St     Bellevue WA 98007     USA
## 18        11224 SE 306th Pl       Auburn WA 98092     USA
nrow(df_house_price[df_house_price$yr_renovated == 0, ])
## [1] 2706

There are 2706 rows with yr_renovated = 0. This likely indicates that the properties were never renovated. Since this variable may significantly impact dimension reduction, I decided to remove it:

df_house_price <- df_house_price[, -14]

5) The variable street is not numerical

Since the street variable isn’t important for dimension reduction, we’ll remove it from the dataset:

df_house_price <- df_house_price[, -14]

6) The variable city is not numerical

The city variable may actually be useful for our analysis. Let’s change the variable to a binary type. It will be 1, when the property is in “expensive” city and 0 if not. I used AI to check where properties are more expensive and this is the list: Medina, Clyde Hill, Bellevue, Mercer Island, Kirkland, Redmond, Sammamish, Issaquah

top_cities <- c("Medina", "Clyde Hill", "Bellevue", "Mercer Island", 
                "Kirkland", "Redmond", "Sammamish", "Issaquah")
df_house_price$expensive_city <- ifelse(df_house_price$city %in% top_cities,1,0)
head(df_house_price[, c("city", "expensive_city")], 10)
##            city expensive_city
## 1     Shoreline              0
## 2       Seattle              0
## 3          Kent              0
## 4      Bellevue              1
## 5       Redmond              1
## 6       Seattle              0
## 7       Redmond              1
## 8  Maple Valley              0
## 9    North Bend              0
## 10      Seattle              0

This transformation works as expected. Now, we can safely remove the original city variable:

df_house_price <- df_house_price[, -14]

7) The variable statezip is not numerical

Since statezip is not essential for further analysis, we can remove it from the dataset.

8) The variable country is not numerical

unique(df_house_price$country)
## [1] "USA"

There is only one country listed, so this variable is redundant and can be removed.

9) The variable date is not numerical

The date variable isn’t necessary for our analysis, so we will remove it:

df_house_price <- df_house_price %>%
  select(-1, -14, -15)

Summary of variables after changes

str(df_house_price)
## 'data.frame':    4551 obs. of  13 variables:
##  $ price         : num  313000 2384000 342000 420000 550000 ...
##  $ bedrooms      : num  3 5 3 3 4 2 2 4 3 4 ...
##  $ bathrooms     : num  1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
##  $ sqft_living   : int  1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
##  $ sqft_lot      : int  7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
##  $ floors        : num  1.5 2 1 1 1 1 1 2 1 1.5 ...
##  $ waterfront    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ view          : int  0 4 0 0 0 0 0 0 0 0 ...
##  $ condition     : int  3 5 4 4 4 3 3 3 4 3 ...
##  $ sqft_above    : int  1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
##  $ sqft_basement : int  0 280 0 1000 800 0 0 0 860 0 ...
##  $ yr_built      : int  1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
##  $ expensive_city: num  0 0 0 1 1 0 1 0 0 0 ...

Now, the dataset contains no non-numerical or non-integer variables.

Missing values

summary(df_house_price)
##      price             bedrooms       bathrooms      sqft_living   
##  Min.   :    7800   Min.   :0.000   Min.   :0.000   Min.   :  370  
##  1st Qu.:  326264   1st Qu.:3.000   1st Qu.:1.750   1st Qu.: 1460  
##  Median :  465000   Median :3.000   Median :2.250   Median : 1970  
##  Mean   :  557906   Mean   :3.395   Mean   :2.155   Mean   : 2132  
##  3rd Qu.:  657500   3rd Qu.:4.000   3rd Qu.:2.500   3rd Qu.: 2610  
##  Max.   :26590000   Max.   :9.000   Max.   :8.000   Max.   :13540  
##     sqft_lot           floors        waterfront            view       
##  Min.   :    638   Min.   :1.000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:   5000   1st Qu.:1.000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :   7680   Median :1.500   Median :0.000000   Median :0.0000  
##  Mean   :  14835   Mean   :1.512   Mean   :0.006592   Mean   :0.2347  
##  3rd Qu.:  10978   3rd Qu.:2.000   3rd Qu.:0.000000   3rd Qu.:0.0000  
##  Max.   :1074218   Max.   :3.500   Max.   :1.000000   Max.   :4.0000  
##    condition       sqft_above   sqft_basement       yr_built   
##  Min.   :1.000   Min.   : 370   Min.   :   0.0   Min.   :1900  
##  1st Qu.:3.000   1st Qu.:1190   1st Qu.:   0.0   1st Qu.:1951  
##  Median :3.000   Median :1590   Median :   0.0   Median :1976  
##  Mean   :3.449   Mean   :1822   Mean   : 310.2   Mean   :1971  
##  3rd Qu.:4.000   3rd Qu.:2300   3rd Qu.: 600.0   3rd Qu.:1997  
##  Max.   :5.000   Max.   :9410   Max.   :4820.0   Max.   :2014  
##  expensive_city  
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2553  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

There are no missing values in our dataset

Correlation

cor_df <- cor(df_house_price)
corrplot(cor_df, type = "full", order = "hclust", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = 0.6)

We see high correlation between (>0.7):

  • sqft_living and bathrooms
  • sqft_above and sqft_living

Both high correlations are quite logical and we could have expected them. The correlation between sqft_living and bathrooms exists because generally, the larger the house, the more bathrooms it has, and between sqft_above and sqft_living, because the square footage of the above-ground area is typically included in the total square footage of the living space.

Principal Component Analysis (PCA)

Optimal number of components

df <- df_house_price #creating a copy of our df
df.pca1 <- prcomp(df, scale.=TRUE)
df.pca1
## Standard deviations (1, .., p=13):
##  [1] 2.002377e+00 1.429070e+00 1.107972e+00 1.001491e+00 9.817904e-01
##  [6] 9.259522e-01 8.494600e-01 7.847303e-01 7.369407e-01 6.354856e-01
## [11] 6.208298e-01 4.760509e-01 2.347354e-15
## 
## Rotation (n x k) = (13 x 13):
##                        PC1         PC2           PC3          PC4         PC5
## price           0.25134313 -0.20544526  1.279477e-01  0.079604574 -0.33822620
## bedrooms        0.32030372 -0.14587194 -3.309437e-01  0.118808781  0.22430130
## bathrooms       0.43441870  0.01768608 -7.891571e-02  0.103885947  0.14750297
## sqft_living     0.46003467 -0.14174728 -9.019708e-02 -0.040310482  0.06188388
## sqft_lot        0.10526724 -0.04549900 -3.557951e-02 -0.953920156  0.02656855
## floors          0.27217028  0.37233973  1.856393e-01  0.089574776  0.18388090
## waterfront      0.07127180 -0.20325247  6.823707e-01  0.011032768 -0.13306184
## view            0.15621882 -0.34791111  4.904376e-01 -0.005211432  0.13372690
## condition      -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above      0.43865089  0.11532660  5.327237e-05 -0.130339454 -0.01926456
## sqft_basement   0.14061738 -0.50660322 -1.867353e-01  0.157653852  0.16368086
## yr_built        0.24698767  0.42387089  4.728165e-02  0.057776934 -0.03404764
## expensive_city  0.19290521  0.03241068 -1.563359e-01  0.040110459 -0.83482940
##                         PC6         PC7          PC8         PC9        PC10
## price          -0.409235857 -0.63505138 -0.164220660  0.24157013  0.31438304
## bedrooms        0.052804931  0.25103765 -0.395328025 -0.30639381  0.55401443
## bathrooms       0.026415306  0.08416448  0.099113765  0.24665893 -0.07795297
## sqft_living     0.002868739 -0.02800623 -0.041449599 -0.06793876 -0.33275635
## sqft_lot        0.097503682 -0.04355247 -0.006381238  0.12297297  0.08072408
## floors         -0.359263123  0.09531496  0.189094782  0.08042993 -0.22364005
## waterfront      0.105865762  0.42227470 -0.472091360  0.23427704 -0.05140168
## view            0.097530053 -0.05185919  0.577724257 -0.43188080  0.24990574
## condition      -0.530379613  0.51430703  0.317011623  0.28067222  0.10503861
## sqft_above     -0.222704617  0.05625388 -0.095057158 -0.26490705 -0.22226250
## sqft_basement   0.417831478 -0.16199317  0.090041655  0.34936987 -0.27746522
## yr_built        0.312574683  0.10422732  0.270979274  0.45160532  0.45683656
## expensive_city  0.257971058  0.17045070  0.147319704 -0.21060153 -0.09839353
##                        PC11        PC12          PC13
## price           0.049460102  0.01194271  2.169661e-16
## bedrooms        0.264143151  0.09748506  1.512706e-15
## bathrooms       0.030648643 -0.82869107  2.667399e-15
## sqft_living    -0.290952708  0.24529503  7.014339e-01
## sqft_lot        0.203248087 -0.02280740 -2.955910e-17
## floors          0.640019679  0.26639802 -2.653631e-16
## waterfront      0.010938992  0.01051222 -5.093739e-17
## view            0.003584873 -0.03407131  2.620599e-16
## condition      -0.115697550  0.08014919  7.357615e-17
## sqft_above     -0.431978480  0.12596573 -6.269600e-01
## sqft_basement   0.196908093  0.27459161 -3.389862e-01
## yr_built       -0.292765290  0.26878065 -4.604561e-17
## expensive_city  0.258159150 -0.02758985  3.518910e-17
df.pca1$rotation
##                        PC1         PC2           PC3          PC4         PC5
## price           0.25134313 -0.20544526  1.279477e-01  0.079604574 -0.33822620
## bedrooms        0.32030372 -0.14587194 -3.309437e-01  0.118808781  0.22430130
## bathrooms       0.43441870  0.01768608 -7.891571e-02  0.103885947  0.14750297
## sqft_living     0.46003467 -0.14174728 -9.019708e-02 -0.040310482  0.06188388
## sqft_lot        0.10526724 -0.04549900 -3.557951e-02 -0.953920156  0.02656855
## floors          0.27217028  0.37233973  1.856393e-01  0.089574776  0.18388090
## waterfront      0.07127180 -0.20325247  6.823707e-01  0.011032768 -0.13306184
## view            0.15621882 -0.34791111  4.904376e-01 -0.005211432  0.13372690
## condition      -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above      0.43865089  0.11532660  5.327237e-05 -0.130339454 -0.01926456
## sqft_basement   0.14061738 -0.50660322 -1.867353e-01  0.157653852  0.16368086
## yr_built        0.24698767  0.42387089  4.728165e-02  0.057776934 -0.03404764
## expensive_city  0.19290521  0.03241068 -1.563359e-01  0.040110459 -0.83482940
##                         PC6         PC7          PC8         PC9        PC10
## price          -0.409235857 -0.63505138 -0.164220660  0.24157013  0.31438304
## bedrooms        0.052804931  0.25103765 -0.395328025 -0.30639381  0.55401443
## bathrooms       0.026415306  0.08416448  0.099113765  0.24665893 -0.07795297
## sqft_living     0.002868739 -0.02800623 -0.041449599 -0.06793876 -0.33275635
## sqft_lot        0.097503682 -0.04355247 -0.006381238  0.12297297  0.08072408
## floors         -0.359263123  0.09531496  0.189094782  0.08042993 -0.22364005
## waterfront      0.105865762  0.42227470 -0.472091360  0.23427704 -0.05140168
## view            0.097530053 -0.05185919  0.577724257 -0.43188080  0.24990574
## condition      -0.530379613  0.51430703  0.317011623  0.28067222  0.10503861
## sqft_above     -0.222704617  0.05625388 -0.095057158 -0.26490705 -0.22226250
## sqft_basement   0.417831478 -0.16199317  0.090041655  0.34936987 -0.27746522
## yr_built        0.312574683  0.10422732  0.270979274  0.45160532  0.45683656
## expensive_city  0.257971058  0.17045070  0.147319704 -0.21060153 -0.09839353
##                        PC11        PC12          PC13
## price           0.049460102  0.01194271  2.169661e-16
## bedrooms        0.264143151  0.09748506  1.512706e-15
## bathrooms       0.030648643 -0.82869107  2.667399e-15
## sqft_living    -0.290952708  0.24529503  7.014339e-01
## sqft_lot        0.203248087 -0.02280740 -2.955910e-17
## floors          0.640019679  0.26639802 -2.653631e-16
## waterfront      0.010938992  0.01051222 -5.093739e-17
## view            0.003584873 -0.03407131  2.620599e-16
## condition      -0.115697550  0.08014919  7.357615e-17
## sqft_above     -0.431978480  0.12596573 -6.269600e-01
## sqft_basement   0.196908093  0.27459161 -3.389862e-01
## yr_built       -0.292765290  0.26878065 -4.604561e-17
## expensive_city  0.258159150 -0.02758985  3.518910e-17
fviz_eig(df.pca1, addlabels = TRUE)

fviz_eig(df.pca1, choice= "eigenvalue", addlabels = TRUE, main = "Eigenvalues") +
  geom_line(linetype = "dashed", y = 1)

summary(df.pca1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0024 1.4291 1.10797 1.00149 0.98179 0.92595 0.84946
## Proportion of Variance 0.3084 0.1571 0.09443 0.07715 0.07415 0.06595 0.05551
## Cumulative Proportion  0.3084 0.4655 0.55995 0.63710 0.71125 0.77720 0.83271
##                            PC8     PC9    PC10    PC11    PC12      PC13
## Standard deviation     0.78473 0.73694 0.63549 0.62083 0.47605 2.347e-15
## Proportion of Variance 0.04737 0.04178 0.03106 0.02965 0.01743 0.000e+00
## Cumulative Proportion  0.88008 0.92185 0.95292 0.98257 1.00000 1.000e+00

The threshold of 70% for the cumulative proportion of variance is exceeded only when selecting five principal components. Although the fifth component has a standard deviation below 1, it is important to retain sufficient information. Limiting the selection to the first four components captures 63.7% of the variance, potentially oversimplifying the model. In contrast, including the fifth component raises the cumulative variance to 71.13%, which is considered adequate. Therefore, the decision was made to select five components.

PCA components analysis

fviz_pca_var(df.pca1, col.var="contrib")

df.pca1$rotation[,1:5]
##                        PC1         PC2           PC3          PC4         PC5
## price           0.25134313 -0.20544526  1.279477e-01  0.079604574 -0.33822620
## bedrooms        0.32030372 -0.14587194 -3.309437e-01  0.118808781  0.22430130
## bathrooms       0.43441870  0.01768608 -7.891571e-02  0.103885947  0.14750297
## sqft_living     0.46003467 -0.14174728 -9.019708e-02 -0.040310482  0.06188388
## sqft_lot        0.10526724 -0.04549900 -3.557951e-02 -0.953920156  0.02656855
## floors          0.27217028  0.37233973  1.856393e-01  0.089574776  0.18388090
## waterfront      0.07127180 -0.20325247  6.823707e-01  0.011032768 -0.13306184
## view            0.15621882 -0.34791111  4.904376e-01 -0.005211432  0.13372690
## condition      -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above      0.43865089  0.11532660  5.327237e-05 -0.130339454 -0.01926456
## sqft_basement   0.14061738 -0.50660322 -1.867353e-01  0.157653852  0.16368086
## yr_built        0.24698767  0.42387089  4.728165e-02  0.057776934 -0.03404764
## expensive_city  0.19290521  0.03241068 -1.563359e-01  0.040110459 -0.83482940
PC1 <- fviz_contrib(df.pca1, "var", axes=1)
PC2 <- fviz_contrib(df.pca1, "var", axes=2)
PC3 <- fviz_contrib(df.pca1, "var", axes=3)
PC4 <- fviz_contrib(df.pca1, "var", axes=4)
PC5 <- fviz_contrib(df.pca1, "var", axes=5)

plot_grid(PC1, PC2, PC3, PC4, PC5, ncol = 2)

PC1

The first principal component (PC1) is driven by house size and overall quality, with key contributions from living area, bathrooms, and above-ground square footage. It reflects the general livability and functionality of homes

PC2

PC2 captures structural and architectural features, such as the year built and number of floors, with a negative relationship to basement size. It highlights trends in design and construction style, because nowadays people hardly ever build houses with basements.

PC3

PC3 focuses on luxury and aesthetic appeal, dominated by waterfront properties and better views.

PC4

PC4 highlights the difference between homes with smaller versus larger land areas.

PC5

PC5 relates to the cost of living in expensive cities. Homes in “expensive” areas have a stronger negative association with this component, helping to separate homes in expensive cities from more affordable ones elsewhere.

Conclusion

The study focused on simplifying the housing market dataset by reducing its dimensions while preserving important information (variance). Principal Component Analysis (PCA) identified 5 as the optimal number of dimensions, effectively summarizing the dataset’s key characteristics.