This project focuses on applying Principal Component Analysis (PCA) to perform a dimension reduction on a housing market dataset for properties in Sydney and Melbourne, Australia. The goal is to identify the most significant factors influencing housing market trends while simplifying the dataset. PCA will help transform the original variables into a smaller set of components, retaining as much variance as possible - it means retaining as much of the original information in the dataset as possible. Dimension reduction will allow for clearer insights into key drivers of property prices and characteristics.
library(dplyr)
library(corrplot)
library(factoextra)
library(gridExtra)
library(cowplot)
The dataset was downloaded from Kaggle and was originally designed for housing price prediction and contains detailed information about properties in Sydney and Melbourne - Australia. You can access the dataset here: https://www.kaggle.com/datasets/shree1992/housedata
Variables in the dataset:
Firstly, let’s import our dataset and look at the type of each variable.
df_house_price<-read.csv("data.csv", sep=",", dec=".", header=TRUE)
summary(df_house_price)
## date price bedrooms bathrooms
## Length:4600 Min. : 0 Min. :0.000 Min. :0.000
## Class :character 1st Qu.: 322875 1st Qu.:3.000 1st Qu.:1.750
## Mode :character Median : 460943 Median :3.000 Median :2.250
## Mean : 551963 Mean :3.401 Mean :2.161
## 3rd Qu.: 654962 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :26590000 Max. :9.000 Max. :8.000
## sqft_living sqft_lot floors waterfront
## Min. : 370 Min. : 638 Min. :1.000 Min. :0.000000
## 1st Qu.: 1460 1st Qu.: 5001 1st Qu.:1.000 1st Qu.:0.000000
## Median : 1980 Median : 7683 Median :1.500 Median :0.000000
## Mean : 2139 Mean : 14852 Mean :1.512 Mean :0.007174
## 3rd Qu.: 2620 3rd Qu.: 11001 3rd Qu.:2.000 3rd Qu.:0.000000
## Max. :13540 Max. :1074218 Max. :3.500 Max. :1.000000
## view condition sqft_above sqft_basement
## Min. :0.0000 Min. :1.000 Min. : 370 Min. : 0.0
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0
## Median :0.0000 Median :3.000 Median :1590 Median : 0.0
## Mean :0.2407 Mean :3.452 Mean :1827 Mean : 312.1
## 3rd Qu.:0.0000 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 610.0
## Max. :4.0000 Max. :5.000 Max. :9410 Max. :4820.0
## yr_built yr_renovated street city
## Min. :1900 Min. : 0.0 Length:4600 Length:4600
## 1st Qu.:1951 1st Qu.: 0.0 Class :character Class :character
## Median :1976 Median : 0.0 Mode :character Mode :character
## Mean :1971 Mean : 808.6
## 3rd Qu.:1997 3rd Qu.:1999.0
## Max. :2014 Max. :2014.0
## statezip country
## Length:4600 Length:4600
## Class :character Class :character
## Mode :character Mode :character
##
##
##
dim(df_house_price)
## [1] 4600 18
head(df_house_price,7)
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 1 2014-05-02 00:00:00 313000 3 1.50 1340 7912 1.5
## 2 2014-05-02 00:00:00 2384000 5 2.50 3650 9050 2.0
## 3 2014-05-02 00:00:00 342000 3 2.00 1930 11947 1.0
## 4 2014-05-02 00:00:00 420000 3 2.25 2000 8030 1.0
## 5 2014-05-02 00:00:00 550000 4 2.50 1940 10500 1.0
## 6 2014-05-02 00:00:00 490000 2 1.00 880 6380 1.0
## 7 2014-05-02 00:00:00 335000 2 2.00 1350 2560 1.0
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 1 0 0 3 1340 0 1955 2005
## 2 0 4 5 3370 280 1921 0
## 3 0 0 4 1930 0 1966 0
## 4 0 0 4 1000 1000 1963 0
## 5 0 0 4 1140 800 1976 1992
## 6 0 0 3 880 0 1938 1994
## 7 0 0 3 1350 0 1976 0
## street city statezip country
## 1 18810 Densmore Ave N Shoreline WA 98133 USA
## 2 709 W Blaine St Seattle WA 98119 USA
## 3 26206-26214 143rd Ave SE Kent WA 98042 USA
## 4 857 170th Pl NE Bellevue WA 98008 USA
## 5 9105 170th Ave NE Redmond WA 98052 USA
## 6 522 NE 88th St Seattle WA 98115 USA
## 7 2616 174th Ave NE Redmond WA 98052 USA
After reviewing the dataset I see we have few issues:
Let’s check them:
df_house_price[df_house_price$price == 0,]
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 4355 2014-05-05 00:00:00 0 3 1.75 1490 10125 1.0
## 4357 2014-05-05 00:00:00 0 4 2.75 2600 5390 1.0
## 4358 2014-05-05 00:00:00 0 6 2.75 3200 9200 1.0
## 4359 2014-05-06 00:00:00 0 5 3.50 3480 36615 2.0
## 4362 2014-05-07 00:00:00 0 5 1.50 1500 7112 1.0
## 4363 2014-05-07 00:00:00 0 4 4.00 3680 18804 2.0
## 4375 2014-05-09 00:00:00 0 2 2.50 2200 188200 1.0
## 4377 2014-05-09 00:00:00 0 4 2.25 2170 10500 1.0
## 4383 2014-05-12 00:00:00 0 5 4.50 4630 6324 2.0
## 4384 2014-05-13 00:00:00 0 5 4.00 4430 9000 2.0
## 4386 2014-05-13 00:00:00 0 4 4.50 5030 11023 2.0
## 4387 2014-05-13 00:00:00 0 4 1.50 2180 22870 1.0
## 4390 2014-05-15 00:00:00 0 4 3.50 4210 10308 2.0
## 4395 2014-05-16 00:00:00 0 5 3.25 3690 12353 2.0
## 4406 2014-05-20 00:00:00 0 4 3.75 3300 4545 1.5
## 4409 2014-05-21 00:00:00 0 5 2.25 2880 11965 2.0
## 4412 2014-05-22 00:00:00 0 5 2.25 2000 7900 1.0
## 4413 2014-05-22 00:00:00 0 3 3.00 1860 7440 1.0
## 4414 2014-05-22 00:00:00 0 4 3.00 1990 6180 2.0
## 4421 2014-05-27 00:00:00 0 4 1.00 1360 13372 1.0
## 4443 2014-06-02 00:00:00 0 1 1.00 720 6000 1.0
## 4449 2014-06-03 00:00:00 0 5 2.75 2740 5616 1.5
## 4454 2014-06-03 00:00:00 0 3 1.00 1300 6710 1.0
## 4455 2014-06-03 00:00:00 0 5 2.50 2090 4698 2.0
## 4473 2014-06-09 00:00:00 0 4 3.75 4060 19290 2.0
## 4479 2014-06-11 00:00:00 0 5 2.75 2910 53898 1.0
## 4480 2014-06-11 00:00:00 0 5 2.00 1910 7200 1.0
## 4481 2014-06-11 00:00:00 0 3 2.50 2880 13500 1.0
## 4482 2014-06-11 00:00:00 0 5 2.75 3240 6863 2.0
## 4488 2014-06-12 00:00:00 0 4 1.00 2080 3500 1.5
## 4500 2014-06-17 00:00:00 0 5 3.75 3870 8225 2.0
## 4508 2014-06-18 00:00:00 0 4 1.50 2310 68824 2.0
## 4510 2014-06-18 00:00:00 0 6 3.00 3020 13783 2.0
## 4521 2014-06-20 00:00:00 0 4 2.50 1960 11600 1.0
## 4522 2014-06-20 00:00:00 0 4 1.00 1810 7500 1.0
## 4523 2014-06-22 00:00:00 0 2 2.25 1490 6770 1.5
## 4524 2014-06-23 00:00:00 0 3 4.50 5230 17826 2.0
## 4529 2014-06-24 00:00:00 0 4 5.00 4550 18641 1.0
## 4535 2014-06-24 00:00:00 0 3 2.75 1310 7300 1.0
## 4543 2014-06-25 00:00:00 0 5 3.50 2640 6895 2.0
## 4553 2014-06-26 00:00:00 0 4 2.00 2100 4857 2.0
## 4555 2014-06-27 00:00:00 0 2 1.00 810 8424 1.0
## 4556 2014-06-27 00:00:00 0 2 1.50 1520 8040 1.0
## 4559 2014-06-28 00:00:00 0 4 4.25 3500 8750 1.0
## 4564 2014-07-01 00:00:00 0 2 2.25 2130 4920 1.5
## 4568 2014-07-02 00:00:00 0 4 2.50 4080 18362 2.0
## 4575 2014-07-02 00:00:00 0 3 1.00 1520 9030 1.0
## 4576 2014-07-02 00:00:00 0 5 6.25 8020 21738 2.0
## 4589 2014-07-08 00:00:00 0 4 2.25 2890 18226 3.0
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 4355 0 0 4 1490 0 1962 0
## 4357 0 0 4 1300 1300 1960 2001
## 4358 0 2 4 1600 1600 1953 1983
## 4359 0 0 4 2490 990 1983 0
## 4362 0 0 5 760 740 1920 0
## 4363 0 0 3 3680 0 1990 2009
## 4375 0 3 3 2200 0 2007 0
## 4377 0 2 4 1270 900 1960 2001
## 4383 0 0 3 3210 1420 2006 0
## 4384 0 0 3 4430 0 2013 1923
## 4386 0 2 3 3250 1780 2008 0
## 4387 0 0 4 1280 900 1954 1975
## 4390 0 0 3 4210 0 2006 0
## 4395 0 0 5 3690 0 1977 0
## 4406 0 4 3 2600 700 1926 1999
## 4409 0 0 4 2880 0 1990 0
## 4412 0 0 4 1300 700 1986 0
## 4413 0 0 5 1040 820 1954 0
## 4414 0 0 3 1990 0 1990 2009
## 4421 0 0 3 1360 0 1955 2005
## 4443 0 0 3 720 0 1940 1996
## 4449 0 0 5 1670 1070 1925 0
## 4454 0 0 4 1300 0 1952 0
## 4455 0 0 3 2090 0 1998 2006
## 4473 0 0 3 4060 0 2002 0
## 4479 0 0 5 1510 1400 1979 0
## 4480 0 0 4 1110 800 1951 1999
## 4481 0 4 5 1520 1360 1950 0
## 4482 0 0 3 3240 0 2013 1923
## 4488 0 0 5 1260 820 1926 0
## 4500 0 0 3 3870 0 1998 2006
## 4508 0 0 4 2310 0 1968 0
## 4510 0 0 3 3020 0 1952 2002
## 4521 0 0 5 980 980 1931 0
## 4522 0 0 2 1410 400 1959 0
## 4523 0 0 3 1490 0 1926 2003
## 4524 1 4 3 3740 1490 2005 0
## 4529 1 4 3 2600 1950 2002 0
## 4535 0 0 3 1310 0 1957 2000
## 4543 0 0 3 2640 0 2001 0
## 4553 0 0 3 2100 0 1965 1984
## 4555 0 0 4 810 0 1959 0
## 4556 0 0 5 1520 0 1951 0
## 4559 0 4 5 2140 1360 1951 0
## 4564 0 4 4 1530 600 1941 1998
## 4568 0 2 4 4080 0 1983 0
## 4575 0 0 3 1520 0 1956 2001
## 4576 0 0 3 8020 0 2001 0
## 4589 1 4 3 2890 0 1984 0
## street city statezip country
## 4355 3911 S 328th St Federal Way WA 98001 USA
## 4357 2120 31st Ave W Seattle WA 98199 USA
## 4358 12271 Marine View Dr SW Burien WA 98146 USA
## 4359 21809 SE 38th Pl Issaquah WA 98075 USA
## 4362 14901-14999 12th Ave SW Burien WA 98166 USA
## 4363 1223-1237 244th Ave NE Sammamish WA 98074 USA
## 4375 39612 254th Ave SE Enumclaw WA 98022 USA
## 4377 216 SW 183rd St Normandy Park WA 98166 USA
## 4383 6925 Oakmont Ave SE Snoqualmie WA 98065 USA
## 4384 9235 NE 5th St Bellevue WA 98004 USA
## 4386 4140 Boulevard Pl Mercer Island WA 98040 USA
## 4387 31603 E Lake Morton Dr SE Kent WA 98042 USA
## 4390 2234 167th Ave SE Bellevue WA 98008 USA
## 4395 19055 35th Ave NE Lake Forest Park WA 98155 USA
## 4406 3665 50th Ave NE Seattle WA 98105 USA
## 4409 25437 163rd Pl SE Covington WA 98042 USA
## 4412 3202 S 194th St SeaTac WA 98188 USA
## 4413 10744 62nd Ave S Seattle WA 98178 USA
## 4414 32706 20th Ave SW Federal Way WA 98023 USA
## 4421 18423 61st Pl NE Kenmore WA 98028 USA
## 4443 1236 S Cloverdale St Seattle WA 98108 USA
## 4449 1013 NE 80th St Seattle WA 98115 USA
## 4454 2760 72nd Ave SE Mercer Island WA 98040 USA
## 4455 27622 237th Pl SE Maple Valley WA 98038 USA
## 4473 21418 SE 5th Pl Sammamish WA 98074 USA
## 4479 13505 208th Ave NE Woodinville WA 98077 USA
## 4480 11620-11698 57th Ave S Seattle WA 98178 USA
## 4481 9243 NE 20th St Clyde Hill WA 98004 USA
## 4482 1301-1303 Monterey Ave NE Renton WA 98056 USA
## 4488 6506 40th Ave SW Seattle WA 98136 USA
## 4500 101-127 247th Ave SE Sammamish WA 98074 USA
## 4508 29656 232nd Ave SE Black Diamond WA 98010 USA
## 4510 4115 85th Ave SE Mercer Island WA 98040 USA
## 4521 506 21st St SE Auburn WA 98002 USA
## 4522 12231 Occidental Ave S Seattle WA 98168 USA
## 4523 4921 28th Ave S Seattle WA 98108 USA
## 4524 7455 W Mercer Way Mercer Island WA 98040 USA
## 4529 425 E Lake Sammamish Pkwy SE Sammamish WA 98074 USA
## 4535 16232 SE 10th St Bellevue WA 98008 USA
## 4543 34529 SE Jay Ct Snoqualmie WA 98065 USA
## 4553 4500 NE 171st St Lake Forest Park WA 98155 USA
## 4555 30401-30499 8th Ave SW Federal Way WA 98023 USA
## 4556 11533 22nd Ave NE Seattle WA 98125 USA
## 4559 12725 8th Ave NW Seattle WA 98177 USA
## 4564 3428 60th Ave SW Seattle WA 98116 USA
## 4568 2710 95th Ave NE Clyde Hill WA 98004 USA
## 4575 2533 155th Pl SE Bellevue WA 98007 USA
## 4576 2 Crescent Key Bellevue WA 98006 USA
## 4589 3227-3399 Mountain View Ave N Renton WA 98056 USA
It appears that these are very large properties that should have a high price. Since a price of 0 is most likely an error, we will remove these observations
df_house_price <- df_house_price[df_house_price$price != 0,]
df_house_price[df_house_price$bedrooms == 0 | df_house_price$bathrooms == 0,]
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 2366 2014-06-12 00:00:00 1095000 0 0 3064 4764 3.5
## 3210 2014-06-24 00:00:00 1295648 0 0 4810 28008 2.0
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 2366 0 2 3 3064 0 1990 2009
## 3210 0 0 3 4810 0 1990 2009
## street city statezip country
## 2366 814 E Howe St Seattle WA 98102 USA
## 3210 20418 NE 64th Pl Redmond WA 98053 USA
Only 2 properties have 0 bedrooms or 0 bathrooms. The rest of the data seems correct, so we will keep these observations.
head(df_house_price[order(-df_house_price$sqft_lot),],10)
## date price bedrooms bathrooms sqft_living sqft_lot
## 1079 2014-05-21 00:00:00 542500.0 5 3.25 3010 1074218
## 2481 2014-06-13 00:00:00 849900.0 2 2.00 2280 641203
## 3488 2014-06-26 00:00:00 667000.0 3 1.75 3320 478288
## 376 2014-05-08 00:00:00 330000.0 2 2.00 1550 435600
## 880 2014-05-19 00:00:00 480000.0 4 3.50 3370 435600
## 1540 2014-05-29 00:00:00 302000.0 2 1.00 900 423838
## 3057 2014-06-23 00:00:00 230000.0 3 1.00 1530 389126
## 241 2014-05-07 00:00:00 630000.0 3 2.50 2680 327135
## 123 2014-05-05 00:00:00 2280000.0 7 8.00 13540 307752
## 4354 2014-05-05 00:00:00 117833.3 3 1.00 1340 306848
## floors waterfront view condition sqft_above sqft_basement yr_built
## 1079 1.5 0 0 5 2010 1000 1931
## 2481 2.0 0 0 3 2280 0 1990
## 3488 1.5 0 3 4 2260 1060 1933
## 376 1.5 0 0 2 1550 0 1972
## 880 2.0 0 3 3 3370 0 2005
## 1540 1.0 0 2 5 900 0 1925
## 3057 1.5 0 0 4 1530 0 1919
## 241 2.0 0 0 3 2680 0 1995
## 123 3.0 0 4 3 9410 4130 1999
## 4354 1.0 0 0 3 1340 0 1953
## yr_renovated street city statezip country
## 1079 0 16200-16398 252nd Ave SE Issaquah WA 98027 USA
## 2481 2009 9326 SW 216th St Vashon WA 98070 USA
## 3488 1982 40201 292nd Ave SE Enumclaw WA 98022 USA
## 376 0 36521 SE 94th St Snoqualmie WA 98065 USA
## 880 0 44250 SE Edgewick Rd North Bend WA 98045 USA
## 1540 0 18923 SE 416th St Enumclaw WA 98022 USA
## 3057 1985 24727 SE Mud Mountain Rd Enumclaw WA 98022 USA
## 241 0 25339 SE 248th St Ravensdale WA 98051 USA
## 123 0 26408 NE 70th St Redmond WA 98053 USA
## 4354 0 17827 Mountain View Rd NE Duvall WA 98019 USA
The highest value seem correct after reviewing the property details on Google Maps. It appears that the property have a large surrounding lot, so the high sqft_lot value is likely accurate.
head(df_house_price[df_house_price$yr_renovated==0,],10)
## date price bedrooms bathrooms sqft_living sqft_lot floors
## 2 2014-05-02 00:00:00 2384000 5 2.50 3650 9050 2
## 3 2014-05-02 00:00:00 342000 3 2.00 1930 11947 1
## 4 2014-05-02 00:00:00 420000 3 2.25 2000 8030 1
## 7 2014-05-02 00:00:00 335000 2 2.00 1350 2560 1
## 8 2014-05-02 00:00:00 482000 4 2.50 2710 35868 2
## 9 2014-05-02 00:00:00 452500 3 2.50 2430 88426 1
## 13 2014-05-02 00:00:00 588500 3 1.75 2330 14892 1
## 16 2014-05-02 00:00:00 242500 3 1.50 1200 9720 1
## 17 2014-05-02 00:00:00 419000 3 1.50 1570 6700 1
## 18 2014-05-02 00:00:00 367500 4 3.00 3110 7231 2
## waterfront view condition sqft_above sqft_basement yr_built yr_renovated
## 2 0 4 5 3370 280 1921 0
## 3 0 0 4 1930 0 1966 0
## 4 0 0 4 1000 1000 1963 0
## 7 0 0 3 1350 0 1976 0
## 8 0 0 3 2710 0 1989 0
## 9 0 0 4 1570 860 1985 0
## 13 0 0 3 1970 360 1980 0
## 16 0 0 4 1200 0 1965 0
## 17 0 0 4 1570 0 1956 0
## 18 0 0 3 3110 0 1997 0
## street city statezip country
## 2 709 W Blaine St Seattle WA 98119 USA
## 3 26206-26214 143rd Ave SE Kent WA 98042 USA
## 4 857 170th Pl NE Bellevue WA 98008 USA
## 7 2616 174th Ave NE Redmond WA 98052 USA
## 8 23762 SE 253rd Pl Maple Valley WA 98038 USA
## 9 46611-46625 SE 129th St North Bend WA 98045 USA
## 13 1833 220th Pl NE Sammamish WA 98074 USA
## 16 14034 SE 201st St Kent WA 98042 USA
## 17 15424 SE 9th St Bellevue WA 98007 USA
## 18 11224 SE 306th Pl Auburn WA 98092 USA
nrow(df_house_price[df_house_price$yr_renovated == 0, ])
## [1] 2706
There are 2706 rows with yr_renovated = 0. This likely indicates that the properties were never renovated. Since this variable may significantly impact dimension reduction, I decided to remove it:
df_house_price <- df_house_price[, -14]
Since the street variable isn’t important for dimension reduction, we’ll remove it from the dataset:
df_house_price <- df_house_price[, -14]
The city variable may actually be useful for our analysis. Let’s change the variable to a binary type. It will be 1, when the property is in “expensive” city and 0 if not. I used AI to check where properties are more expensive and this is the list: Medina, Clyde Hill, Bellevue, Mercer Island, Kirkland, Redmond, Sammamish, Issaquah
top_cities <- c("Medina", "Clyde Hill", "Bellevue", "Mercer Island",
"Kirkland", "Redmond", "Sammamish", "Issaquah")
df_house_price$expensive_city <- ifelse(df_house_price$city %in% top_cities,1,0)
head(df_house_price[, c("city", "expensive_city")], 10)
## city expensive_city
## 1 Shoreline 0
## 2 Seattle 0
## 3 Kent 0
## 4 Bellevue 1
## 5 Redmond 1
## 6 Seattle 0
## 7 Redmond 1
## 8 Maple Valley 0
## 9 North Bend 0
## 10 Seattle 0
This transformation works as expected. Now, we can safely remove the original city variable:
df_house_price <- df_house_price[, -14]
Since statezip is not essential for further analysis, we can remove it from the dataset.
unique(df_house_price$country)
## [1] "USA"
There is only one country listed, so this variable is redundant and can be removed.
The date variable isn’t necessary for our analysis, so we will remove it:
df_house_price <- df_house_price %>%
select(-1, -14, -15)
Summary of variables after changes
str(df_house_price)
## 'data.frame': 4551 obs. of 13 variables:
## $ price : num 313000 2384000 342000 420000 550000 ...
## $ bedrooms : num 3 5 3 3 4 2 2 4 3 4 ...
## $ bathrooms : num 1.5 2.5 2 2.25 2.5 1 2 2.5 2.5 2 ...
## $ sqft_living : int 1340 3650 1930 2000 1940 880 1350 2710 2430 1520 ...
## $ sqft_lot : int 7912 9050 11947 8030 10500 6380 2560 35868 88426 6200 ...
## $ floors : num 1.5 2 1 1 1 1 1 2 1 1.5 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ view : int 0 4 0 0 0 0 0 0 0 0 ...
## $ condition : int 3 5 4 4 4 3 3 3 4 3 ...
## $ sqft_above : int 1340 3370 1930 1000 1140 880 1350 2710 1570 1520 ...
## $ sqft_basement : int 0 280 0 1000 800 0 0 0 860 0 ...
## $ yr_built : int 1955 1921 1966 1963 1976 1938 1976 1989 1985 1945 ...
## $ expensive_city: num 0 0 0 1 1 0 1 0 0 0 ...
Now, the dataset contains no non-numerical or non-integer variables.
summary(df_house_price)
## price bedrooms bathrooms sqft_living
## Min. : 7800 Min. :0.000 Min. :0.000 Min. : 370
## 1st Qu.: 326264 1st Qu.:3.000 1st Qu.:1.750 1st Qu.: 1460
## Median : 465000 Median :3.000 Median :2.250 Median : 1970
## Mean : 557906 Mean :3.395 Mean :2.155 Mean : 2132
## 3rd Qu.: 657500 3rd Qu.:4.000 3rd Qu.:2.500 3rd Qu.: 2610
## Max. :26590000 Max. :9.000 Max. :8.000 Max. :13540
## sqft_lot floors waterfront view
## Min. : 638 Min. :1.000 Min. :0.000000 Min. :0.0000
## 1st Qu.: 5000 1st Qu.:1.000 1st Qu.:0.000000 1st Qu.:0.0000
## Median : 7680 Median :1.500 Median :0.000000 Median :0.0000
## Mean : 14835 Mean :1.512 Mean :0.006592 Mean :0.2347
## 3rd Qu.: 10978 3rd Qu.:2.000 3rd Qu.:0.000000 3rd Qu.:0.0000
## Max. :1074218 Max. :3.500 Max. :1.000000 Max. :4.0000
## condition sqft_above sqft_basement yr_built
## Min. :1.000 Min. : 370 Min. : 0.0 Min. :1900
## 1st Qu.:3.000 1st Qu.:1190 1st Qu.: 0.0 1st Qu.:1951
## Median :3.000 Median :1590 Median : 0.0 Median :1976
## Mean :3.449 Mean :1822 Mean : 310.2 Mean :1971
## 3rd Qu.:4.000 3rd Qu.:2300 3rd Qu.: 600.0 3rd Qu.:1997
## Max. :5.000 Max. :9410 Max. :4820.0 Max. :2014
## expensive_city
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2553
## 3rd Qu.:1.0000
## Max. :1.0000
There are no missing values in our dataset
cor_df <- cor(df_house_price)
corrplot(cor_df, type = "full", order = "hclust", tl.col = "black", tl.cex = 0.6, addCoef.col = "black", number.cex = 0.6)
We see high correlation between (>0.7):
Both high correlations are quite logical and we could have expected them. The correlation between sqft_living and bathrooms exists because generally, the larger the house, the more bathrooms it has, and between sqft_above and sqft_living, because the square footage of the above-ground area is typically included in the total square footage of the living space.
df <- df_house_price #creating a copy of our df
df.pca1 <- prcomp(df, scale.=TRUE)
df.pca1
## Standard deviations (1, .., p=13):
## [1] 2.002377e+00 1.429070e+00 1.107972e+00 1.001491e+00 9.817904e-01
## [6] 9.259522e-01 8.494600e-01 7.847303e-01 7.369407e-01 6.354856e-01
## [11] 6.208298e-01 4.760509e-01 2.347354e-15
##
## Rotation (n x k) = (13 x 13):
## PC1 PC2 PC3 PC4 PC5
## price 0.25134313 -0.20544526 1.279477e-01 0.079604574 -0.33822620
## bedrooms 0.32030372 -0.14587194 -3.309437e-01 0.118808781 0.22430130
## bathrooms 0.43441870 0.01768608 -7.891571e-02 0.103885947 0.14750297
## sqft_living 0.46003467 -0.14174728 -9.019708e-02 -0.040310482 0.06188388
## sqft_lot 0.10526724 -0.04549900 -3.557951e-02 -0.953920156 0.02656855
## floors 0.27217028 0.37233973 1.856393e-01 0.089574776 0.18388090
## waterfront 0.07127180 -0.20325247 6.823707e-01 0.011032768 -0.13306184
## view 0.15621882 -0.34791111 4.904376e-01 -0.005211432 0.13372690
## condition -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above 0.43865089 0.11532660 5.327237e-05 -0.130339454 -0.01926456
## sqft_basement 0.14061738 -0.50660322 -1.867353e-01 0.157653852 0.16368086
## yr_built 0.24698767 0.42387089 4.728165e-02 0.057776934 -0.03404764
## expensive_city 0.19290521 0.03241068 -1.563359e-01 0.040110459 -0.83482940
## PC6 PC7 PC8 PC9 PC10
## price -0.409235857 -0.63505138 -0.164220660 0.24157013 0.31438304
## bedrooms 0.052804931 0.25103765 -0.395328025 -0.30639381 0.55401443
## bathrooms 0.026415306 0.08416448 0.099113765 0.24665893 -0.07795297
## sqft_living 0.002868739 -0.02800623 -0.041449599 -0.06793876 -0.33275635
## sqft_lot 0.097503682 -0.04355247 -0.006381238 0.12297297 0.08072408
## floors -0.359263123 0.09531496 0.189094782 0.08042993 -0.22364005
## waterfront 0.105865762 0.42227470 -0.472091360 0.23427704 -0.05140168
## view 0.097530053 -0.05185919 0.577724257 -0.43188080 0.24990574
## condition -0.530379613 0.51430703 0.317011623 0.28067222 0.10503861
## sqft_above -0.222704617 0.05625388 -0.095057158 -0.26490705 -0.22226250
## sqft_basement 0.417831478 -0.16199317 0.090041655 0.34936987 -0.27746522
## yr_built 0.312574683 0.10422732 0.270979274 0.45160532 0.45683656
## expensive_city 0.257971058 0.17045070 0.147319704 -0.21060153 -0.09839353
## PC11 PC12 PC13
## price 0.049460102 0.01194271 2.169661e-16
## bedrooms 0.264143151 0.09748506 1.512706e-15
## bathrooms 0.030648643 -0.82869107 2.667399e-15
## sqft_living -0.290952708 0.24529503 7.014339e-01
## sqft_lot 0.203248087 -0.02280740 -2.955910e-17
## floors 0.640019679 0.26639802 -2.653631e-16
## waterfront 0.010938992 0.01051222 -5.093739e-17
## view 0.003584873 -0.03407131 2.620599e-16
## condition -0.115697550 0.08014919 7.357615e-17
## sqft_above -0.431978480 0.12596573 -6.269600e-01
## sqft_basement 0.196908093 0.27459161 -3.389862e-01
## yr_built -0.292765290 0.26878065 -4.604561e-17
## expensive_city 0.258159150 -0.02758985 3.518910e-17
df.pca1$rotation
## PC1 PC2 PC3 PC4 PC5
## price 0.25134313 -0.20544526 1.279477e-01 0.079604574 -0.33822620
## bedrooms 0.32030372 -0.14587194 -3.309437e-01 0.118808781 0.22430130
## bathrooms 0.43441870 0.01768608 -7.891571e-02 0.103885947 0.14750297
## sqft_living 0.46003467 -0.14174728 -9.019708e-02 -0.040310482 0.06188388
## sqft_lot 0.10526724 -0.04549900 -3.557951e-02 -0.953920156 0.02656855
## floors 0.27217028 0.37233973 1.856393e-01 0.089574776 0.18388090
## waterfront 0.07127180 -0.20325247 6.823707e-01 0.011032768 -0.13306184
## view 0.15621882 -0.34791111 4.904376e-01 -0.005211432 0.13372690
## condition -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above 0.43865089 0.11532660 5.327237e-05 -0.130339454 -0.01926456
## sqft_basement 0.14061738 -0.50660322 -1.867353e-01 0.157653852 0.16368086
## yr_built 0.24698767 0.42387089 4.728165e-02 0.057776934 -0.03404764
## expensive_city 0.19290521 0.03241068 -1.563359e-01 0.040110459 -0.83482940
## PC6 PC7 PC8 PC9 PC10
## price -0.409235857 -0.63505138 -0.164220660 0.24157013 0.31438304
## bedrooms 0.052804931 0.25103765 -0.395328025 -0.30639381 0.55401443
## bathrooms 0.026415306 0.08416448 0.099113765 0.24665893 -0.07795297
## sqft_living 0.002868739 -0.02800623 -0.041449599 -0.06793876 -0.33275635
## sqft_lot 0.097503682 -0.04355247 -0.006381238 0.12297297 0.08072408
## floors -0.359263123 0.09531496 0.189094782 0.08042993 -0.22364005
## waterfront 0.105865762 0.42227470 -0.472091360 0.23427704 -0.05140168
## view 0.097530053 -0.05185919 0.577724257 -0.43188080 0.24990574
## condition -0.530379613 0.51430703 0.317011623 0.28067222 0.10503861
## sqft_above -0.222704617 0.05625388 -0.095057158 -0.26490705 -0.22226250
## sqft_basement 0.417831478 -0.16199317 0.090041655 0.34936987 -0.27746522
## yr_built 0.312574683 0.10422732 0.270979274 0.45160532 0.45683656
## expensive_city 0.257971058 0.17045070 0.147319704 -0.21060153 -0.09839353
## PC11 PC12 PC13
## price 0.049460102 0.01194271 2.169661e-16
## bedrooms 0.264143151 0.09748506 1.512706e-15
## bathrooms 0.030648643 -0.82869107 2.667399e-15
## sqft_living -0.290952708 0.24529503 7.014339e-01
## sqft_lot 0.203248087 -0.02280740 -2.955910e-17
## floors 0.640019679 0.26639802 -2.653631e-16
## waterfront 0.010938992 0.01051222 -5.093739e-17
## view 0.003584873 -0.03407131 2.620599e-16
## condition -0.115697550 0.08014919 7.357615e-17
## sqft_above -0.431978480 0.12596573 -6.269600e-01
## sqft_basement 0.196908093 0.27459161 -3.389862e-01
## yr_built -0.292765290 0.26878065 -4.604561e-17
## expensive_city 0.258159150 -0.02758985 3.518910e-17
fviz_eig(df.pca1, addlabels = TRUE)
fviz_eig(df.pca1, choice= "eigenvalue", addlabels = TRUE, main = "Eigenvalues") +
geom_line(linetype = "dashed", y = 1)
summary(df.pca1)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0024 1.4291 1.10797 1.00149 0.98179 0.92595 0.84946
## Proportion of Variance 0.3084 0.1571 0.09443 0.07715 0.07415 0.06595 0.05551
## Cumulative Proportion 0.3084 0.4655 0.55995 0.63710 0.71125 0.77720 0.83271
## PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 0.78473 0.73694 0.63549 0.62083 0.47605 2.347e-15
## Proportion of Variance 0.04737 0.04178 0.03106 0.02965 0.01743 0.000e+00
## Cumulative Proportion 0.88008 0.92185 0.95292 0.98257 1.00000 1.000e+00
The threshold of 70% for the cumulative proportion of variance is exceeded only when selecting five principal components. Although the fifth component has a standard deviation below 1, it is important to retain sufficient information. Limiting the selection to the first four components captures 63.7% of the variance, potentially oversimplifying the model. In contrast, including the fifth component raises the cumulative variance to 71.13%, which is considered adequate. Therefore, the decision was made to select five components.
fviz_pca_var(df.pca1, col.var="contrib")
df.pca1$rotation[,1:5]
## PC1 PC2 PC3 PC4 PC5
## price 0.25134313 -0.20544526 1.279477e-01 0.079604574 -0.33822620
## bedrooms 0.32030372 -0.14587194 -3.309437e-01 0.118808781 0.22430130
## bathrooms 0.43441870 0.01768608 -7.891571e-02 0.103885947 0.14750297
## sqft_living 0.46003467 -0.14174728 -9.019708e-02 -0.040310482 0.06188388
## sqft_lot 0.10526724 -0.04549900 -3.557951e-02 -0.953920156 0.02656855
## floors 0.27217028 0.37233973 1.856393e-01 0.089574776 0.18388090
## waterfront 0.07127180 -0.20325247 6.823707e-01 0.011032768 -0.13306184
## view 0.15621882 -0.34791111 4.904376e-01 -0.005211432 0.13372690
## condition -0.09399809 -0.40296868 -2.372956e-01 -0.046947230 -0.11973680
## sqft_above 0.43865089 0.11532660 5.327237e-05 -0.130339454 -0.01926456
## sqft_basement 0.14061738 -0.50660322 -1.867353e-01 0.157653852 0.16368086
## yr_built 0.24698767 0.42387089 4.728165e-02 0.057776934 -0.03404764
## expensive_city 0.19290521 0.03241068 -1.563359e-01 0.040110459 -0.83482940
PC1 <- fviz_contrib(df.pca1, "var", axes=1)
PC2 <- fviz_contrib(df.pca1, "var", axes=2)
PC3 <- fviz_contrib(df.pca1, "var", axes=3)
PC4 <- fviz_contrib(df.pca1, "var", axes=4)
PC5 <- fviz_contrib(df.pca1, "var", axes=5)
plot_grid(PC1, PC2, PC3, PC4, PC5, ncol = 2)
PC1
The first principal component (PC1) is driven by house size and overall quality, with key contributions from living area, bathrooms, and above-ground square footage. It reflects the general livability and functionality of homes
PC2
PC2 captures structural and architectural features, such as the year built and number of floors, with a negative relationship to basement size. It highlights trends in design and construction style, because nowadays people hardly ever build houses with basements.
PC3
PC3 focuses on luxury and aesthetic appeal, dominated by waterfront properties and better views.
PC4
PC4 highlights the difference between homes with smaller versus larger land areas.
PC5
PC5 relates to the cost of living in expensive cities. Homes in “expensive” areas have a stronger negative association with this component, helping to separate homes in expensive cities from more affordable ones elsewhere.
The study focused on simplifying the housing market dataset by reducing its dimensions while preserving important information (variance). Principal Component Analysis (PCA) identified 5 as the optimal number of dimensions, effectively summarizing the dataset’s key characteristics.