Using any of the two unsupervised learning algorithms we’ve learned, we will produce a simple R markdown document where we demonstrate an exercise of either clustering or dimensionality reduction on one of either the wholesale.csv, the nyc.csv, or our own dataset.
We will explain our choice of parameters (how we choose k for k-means clustering, or how we choose to retain n number of dimensions for PCA) from the original data. We will give some business utility for the unsupervised model we’ve developed. (The R Markdown document should contain one or two visualization.)
We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.
knitr::opts_chunk$set(cache=TRUE)
options(scipen = 9999)
The libraries that will be used in this analysis :
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## Loading required package: ggplot2
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(FactoMineR)
In this analysis, we use nyc dataset to find out the principal component of it.
nyc data :property_data <- read.csv("data_input/nyc.csv", stringsAsFactors = F, sep = ",")
describe(property_data)
## property_data
##
## 22 Variables 84548 Observations
## ---------------------------------------------------------------------------
## X
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 26736 1 10344 8149 849 1695
## .25 .50 .75 .90 .95
## 4231 8942 15987 21167 23281
##
## lowest : 4 5 6 7 8, highest: 26735 26736 26737 26738 26739
## ---------------------------------------------------------------------------
## BOROUGH
## n missing distinct Info Mean Gmd
## 84548 0 5 0.934 2.999 1.424
##
## Value 1 2 3 4 5
## Frequency 18306 7049 24047 26736 8410
## Proportion 0.217 0.083 0.284 0.316 0.099
## ---------------------------------------------------------------------------
## NEIGHBORHOOD
## n missing distinct
## 84548 0 254
##
## lowest : AIRPORT LA GUARDIA ALPHABET CITY ANNADALE ARDEN HEIGHTS ARROCHAR
## highest: WOODHAVEN WOODLAWN WOODROW WOODSIDE WYCKOFF HEIGHTS
## ---------------------------------------------------------------------------
## BUILDING.CLASS.CATEGORY
## n missing distinct
## 84548 0 47
##
## lowest : 01 ONE FAMILY DWELLINGS 02 TWO FAMILY DWELLINGS 03 THREE FAMILY DWELLINGS 04 TAX CLASS 1 CONDOS 05 TAX CLASS 1 VACANT LAND
## highest: 45 CONDO HOTELS 46 CONDO STORE BUILDINGS 47 CONDO NON-BUSINESS STORAGE 48 CONDO TERRACES/GARDENS/CABANAS 49 CONDO WAREHOUSES/FACTORY/INDUS
## ---------------------------------------------------------------------------
## TAX.CLASS.AT.PRESENT
## n missing distinct
## 83810 738 10
##
## Value 1 1A 1B 1C 2 2A 2B 2C 3 4
## Frequency 38633 1444 1234 186 30919 2521 814 1915 4 6140
## Proportion 0.461 0.017 0.015 0.002 0.369 0.030 0.010 0.023 0.000 0.073
## ---------------------------------------------------------------------------
## BLOCK
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 11566 1 4237 3872 276 633
## .25 .50 .75 .90 .95
## 1323 3311 6281 9151 11616
##
## lowest : 1 3 5 6 7, highest: 16315 16316 16317 16319 16322
## ---------------------------------------------------------------------------
## LOT
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 2627 1 376.2 544.8 2 7
## .25 .50 .75 .90 .95
## 22 50 1001 1207 1403
##
## lowest : 1 2 3 4 5, highest: 9080 9081 9085 9099 9106
## ---------------------------------------------------------------------------
## BUILDING.CLASS.AT.PRESENT
## n missing distinct
## 83810 738 166
##
## lowest : A0 A1 A2 A3 A4, highest: Z0 Z2 Z3 Z7 Z9
## ---------------------------------------------------------------------------
## ADDRESS
## n missing distinct
## 84548 0 67563
##
## lowest : ****** 95TH STREET 1 12TH ST EXTENSION 1 5 AVENUE 1 5TH AVENUE, 23A 1 ASCAN AVE, 35
## highest: WOODROW ROAD WOODYCREST AVENUE WORTMAN AVENUE YORK AVENUE ZEREGA AVENUE
## ---------------------------------------------------------------------------
## APARTMENT.NUMBER
## n missing distinct
## 19052 65496 3988
##
## lowest : #4 #PHC ` 0 0.25, highest: W6B W8D WS2 Y1 Z
## ---------------------------------------------------------------------------
## ZIP.CODE
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 186 1 10732 842 10011 10019
## .25 .50 .75 .90 .95
## 10305 11209 11357 11414 11427
##
## Value 0 10000 10100 10200 10300 10500 10800 11000 11100 11200
## Frequency 982 15806 2117 1 8350 6990 1 529 1954 23732
## Proportion 0.012 0.187 0.025 0.000 0.099 0.083 0.000 0.006 0.023 0.281
##
## Value 11400 11700
## Frequency 23079 1007
## Proportion 0.273 0.012
## ---------------------------------------------------------------------------
## RESIDENTIAL.UNITS
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 176 0.899 2.025 2.833 0 0
## .25 .50 .75 .90 .95
## 0 1 2 3 4
##
## lowest : 0 1 2 3 4, highest: 889 894 948 1641 1844
## ---------------------------------------------------------------------------
## COMMERCIAL.UNITS
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 55 0.171 0.1936 0.3788 0 0
## .25 .50 .75 .90 .95
## 0 0 0 0 1
##
## lowest : 0 1 2 3 4, highest: 254 318 422 436 2261
## ---------------------------------------------------------------------------
## TOTAL.UNITS
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 192 0.887 2.249 3.072 0 0
## .25 .50 .75 .90 .95
## 1 1 2 3 4
##
## lowest : 0 1 2 3 4, highest: 902 955 1653 1866 2261
## ---------------------------------------------------------------------------
## LAND.SQUARE.FEET
## n missing distinct
## 84548 0 6062
##
## lowest : - 0 100 1000 10000, highest: 998 9980 999 9992 9996
## ---------------------------------------------------------------------------
## GROSS.SQUARE.FEET
## n missing distinct
## 84548 0 5691
##
## lowest : - 0 100 1000 10000, highest: 9975 998 999 9990 9992
## ---------------------------------------------------------------------------
## YEAR.BUILT
## n missing distinct Info Mean Gmd .05 .10
## 84548 0 158 0.998 1789 327.5 0 1899
## .25 .50 .75 .90 .95
## 1920 1940 1965 2006 2013
##
## Value 0 1120 1680 1800 1820 1840 1860 1880 1900 1920
## Frequency 6970 1 1 37 2 22 15 122 5882 24002
## Proportion 0.082 0.000 0.000 0.000 0.000 0.000 0.000 0.001 0.070 0.284
##
## Value 1940 1960 1980 2000 2020
## Frequency 10018 18118 5666 9017 4675
## Proportion 0.118 0.214 0.067 0.107 0.055
## ---------------------------------------------------------------------------
## TAX.CLASS.AT.TIME.OF.SALE
## n missing distinct Info Mean Gmd
## 84548 0 4 0.799 1.657 0.7752
##
## Value 1 2 3 4
## Frequency 41533 36726 4 6285
## Proportion 0.491 0.434 0.000 0.074
## ---------------------------------------------------------------------------
## BUILDING.CLASS.AT.TIME.OF.SALE
## n missing distinct
## 84548 0 166
##
## lowest : A0 A1 A2 A3 A4, highest: Z0 Z2 Z3 Z7 Z9
## ---------------------------------------------------------------------------
## SALE.PRICE
## n missing distinct
## 84548 0 10008
##
## lowest : - 0 1 10 100
## highest: 999988 99999 999990 999999 9999999
## ---------------------------------------------------------------------------
## SALE.DATE
## n missing distinct
## 84548 0 364
##
## lowest : 2016-09-01 00:00:00 2016-09-02 00:00:00 2016-09-03 00:00:00 2016-09-04 00:00:00 2016-09-05 00:00:00
## highest: 2017-08-27 00:00:00 2017-08-28 00:00:00 2017-08-29 00:00:00 2017-08-30 00:00:00 2017-08-31 00:00:00
## ---------------------------------------------------------------------------
##
## Variables with all observations missing:
##
## [1] EASE.MENT
glimpse(property_data)
## Observations: 84,548
## Variables: 22
## $ X <int> 4, 5, 6, 7, 8, 9, 10, 11, 12, 1...
## $ BOROUGH <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ NEIGHBORHOOD <chr> "ALPHABET CITY", "ALPHABET CITY...
## $ BUILDING.CLASS.CATEGORY <chr> "07 RENTALS - WALKUP APARTMENTS...
## $ TAX.CLASS.AT.PRESENT <chr> "2A", "2", "2", "2B", "2A", "2"...
## $ BLOCK <int> 392, 399, 399, 402, 404, 405, 4...
## $ LOT <int> 6, 26, 39, 21, 55, 16, 32, 18, ...
## $ EASE.MENT <lgl> NA, NA, NA, NA, NA, NA, NA, NA,...
## $ BUILDING.CLASS.AT.PRESENT <chr> "C2", "C7", "C7", "C4", "C2", "...
## $ ADDRESS <chr> "153 AVENUE B", "234 EAST 4TH ...
## $ APARTMENT.NUMBER <chr> " ", " ", " ", " ", " ", " ", "...
## $ ZIP.CODE <int> 10009, 10009, 10009, 10009, 100...
## $ RESIDENTIAL.UNITS <int> 5, 28, 16, 10, 6, 20, 8, 44, 15...
## $ COMMERCIAL.UNITS <int> 0, 3, 1, 0, 0, 0, 0, 2, 0, 0, 4...
## $ TOTAL.UNITS <int> 5, 31, 17, 10, 6, 20, 8, 46, 15...
## $ LAND.SQUARE.FEET <chr> "1633", "4616", "2212", "2272",...
## $ GROSS.SQUARE.FEET <chr> "6440", "18690", "7803", "6794"...
## $ YEAR.BUILT <int> 1900, 1900, 1900, 1913, 1900, 1...
## $ TAX.CLASS.AT.TIME.OF.SALE <int> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2...
## $ BUILDING.CLASS.AT.TIME.OF.SALE <chr> "C2", "C7", "C7", "C4", "C2", "...
## $ SALE.PRICE <chr> "6625000", " - ", " - ", "393...
## $ SALE.DATE <chr> "2017-07-19 00:00:00", "2016-12...
Variables explanations : BOROUGH: A digit code for the borough the property is located in; in order these are Manhattan (1), Bronx (2), Brooklyn (3), Queens (4), and Staten Island (5)
NEIGHBORHOOD: The neighborhood name
BUILDING.CLASS.CATEGORY: Category class of the property[^6]
TAX.CLASS.AT.PRESENT, TAX.CLASS.AT.TIME.OF.SALE: BLOCK, LOT: The combination of borough, block, and lot forms a unique key for property in New York City
EASE.MENT:An easement is a right, such as a right of way, which allows an entity to make limited use of another’s real property
BUILDING.CLASS.AT.PRESENT, BUILDING.CLASS.AT.TIME.OF.SALE: The type of building at various points in time, (for example “A”" signifies one-family homes, “O” signifies office buildings. “R” signifies condominiums)[^6]
ADDRESS: Street address of the property
APARTMENT.NUMBER: Apartment number if applicable
ZIP.CODE: The property’s postal code
RESIDENTIAL.UNTIS: The number of residential units at the listed property
COMMERCIAL.UNITS: The number of commercial units at the listed property
LAND.SQUARE.FEET: The land area of the property listed in square feet
GROSS.SQUARE.FEET: The total area of all the floors of a building as measured from the exterior surfaces of the outside walls of the building, including the land area and space within any building or structure on the property
YEAR.BUILT: Year the property was built
SALE.PRICE: Price paid for the property
SALE.DATE: Date the property sold
This dataset uses the financial definition of a building/building unit, for tax purposes. In case a single entity owns the building in question, a sale covers the value of the entire building. In case a building is owned piecemeal by its residents (a condominium), a sale refers to a single apartment (or group of apartments) owned by some individual.
dplyrWe only want to use numerical data (integers) and keeping BOROUGH as one of the class (factor) in the data we want to analyse.
library(dplyr)
property <- property_data %>%
mutate(LAND.SQUARE.FEET = as.integer(LAND.SQUARE.FEET),
GROSS.SQUARE.FEET = as.integer(GROSS.SQUARE.FEET),
SALE.PRICE = as.integer(SALE.PRICE)
) %>%
select_if(is.integer) %>%
select(-c(X, BLOCK, LOT, ZIP.CODE)) %>%
filter(complete.cases(.))
property$BOROUGH <- as.factor(property$BOROUGH)
property$TAX.CLASS.AT.TIME.OF.SALE <- as.factor(property$TAX.CLASS.AT.TIME.OF.SALE)
glimpse(property)
## Observations: 48,243
## Variables: 9
## $ BOROUGH <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ RESIDENTIAL.UNITS <int> 5, 10, 6, 8, 24, 10, 24, 3, 4, 5, 0,...
## $ COMMERCIAL.UNITS <int> 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, ...
## $ TOTAL.UNITS <int> 5, 10, 6, 8, 24, 10, 24, 4, 5, 6, 1,...
## $ LAND.SQUARE.FEET <int> 1633, 2272, 2369, 1750, 4489, 3717, ...
## $ GROSS.SQUARE.FEET <int> 6440, 6794, 4615, 4226, 18523, 12350...
## $ YEAR.BUILT <int> 1900, 1913, 1900, 1920, 1920, 2009, ...
## $ TAX.CLASS.AT.TIME.OF.SALE <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 4, 1, ...
## $ SALE.PRICE <int> 6625000, 3936272, 8000000, 3192840, ...
We have to scale our data first (with scale.unit inside PCA()) because our data can be on different units and if not, the amount of variance explained by the different principal components is going to be dominated by variables that are on a larger range.
(property_PCA <- PCA(property, quali.sup = c(1,8), scale.unit = TRUE, graph = FALSE))
## **Results for the Principal Component Analysis (PCA)**
## The analysis was performed on 48243 individuals, described by 9 variables
## *The results are available in the following objects:
##
## name
## 1 "$eig"
## 2 "$var"
## 3 "$var$coord"
## 4 "$var$cor"
## 5 "$var$cos2"
## 6 "$var$contrib"
## 7 "$ind"
## 8 "$ind$coord"
## 9 "$ind$cos2"
## 10 "$ind$contrib"
## 11 "$quali.sup"
## 12 "$quali.sup$coord"
## 13 "$quali.sup$v.test"
## 14 "$call"
## 15 "$call$centre"
## 16 "$call$ecart.type"
## 17 "$call$row.w"
## 18 "$call$col.w"
## description
## 1 "eigenvalues"
## 2 "results for the variables"
## 3 "coord. for the variables"
## 4 "correlations variables - dimensions"
## 5 "cos2 for the variables"
## 6 "contributions of the variables"
## 7 "results for the individuals"
## 8 "coord. for the individuals"
## 9 "cos2 for the individuals"
## 10 "contributions of the individuals"
## 11 "results for the supplementary categorical variables"
## 12 "coord. for the supplementary categories"
## 13 "v-test of the supplementary categories"
## 14 "summary statistics"
## 15 "mean of the variables"
## 16 "standard error of the variables"
## 17 "weights for the individuals"
## 18 "weights for the variables"
summary(property_PCA)
##
## Call:
## PCA(X = property, scale.unit = TRUE, quali.sup = c(1, 8), graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 2.908 1.179 0.999 0.976 0.706 0.232
## % of var. 41.541 16.843 14.274 13.941 10.092 3.309
## Cumulative % of var. 41.541 58.384 72.658 86.598 96.690 100.000
## Dim.7
## Variance 0.000
## % of var. 0.000
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dist Dim.1 ctr cos2 Dim.2
## 1 | 0.675 | 0.294 0.000 0.189 | -0.148
## 2 | 0.671 | 0.508 0.000 0.574 | -0.030
## 3 | 0.833 | 0.359 0.000 0.186 | -0.158
## 4 | 0.507 | 0.331 0.000 0.426 | -0.005
## 5 | 2.403 | 1.808 0.002 0.566 | -0.304
## 6 | 1.279 | 0.792 0.000 0.384 | -0.261
## 7 | 2.071 | 1.663 0.002 0.644 | -0.176
## 8 | 0.323 | 0.088 0.000 0.074 | 0.034
## 9 | 0.727 | 0.289 0.000 0.158 | -0.080
## 10 | 0.500 | 0.242 0.000 0.234 | 0.016
## ctr cos2 Dim.3 ctr cos2
## 1 0.000 0.048 | 0.198 0.000 0.086 |
## 2 0.000 0.002 | 0.195 0.000 0.085 |
## 3 0.000 0.036 | 0.206 0.000 0.061 |
## 4 0.000 0.000 | 0.209 0.000 0.170 |
## 5 0.000 0.016 | 0.279 0.000 0.014 |
## 6 0.000 0.042 | 0.447 0.000 0.122 |
## 7 0.000 0.007 | 0.263 0.000 0.016 |
## 8 0.000 0.011 | 0.198 0.000 0.375 |
## 9 0.000 0.012 | 0.204 0.000 0.079 |
## 10 0.000 0.001 | 0.206 0.000 0.169 |
##
## Variables
## Dim.1 ctr cos2 Dim.2 ctr cos2
## RESIDENTIAL.UNITS | 0.855 25.139 0.731 | -0.097 0.799 0.009
## COMMERCIAL.UNITS | 0.316 3.441 0.100 | 0.886 66.516 0.784
## TOTAL.UNITS | 0.887 27.055 0.787 | 0.387 12.720 0.150
## LAND.SQUARE.FEET | 0.640 14.104 0.410 | -0.285 6.884 0.081
## GROSS.SQUARE.FEET | 0.854 25.071 0.729 | -0.310 8.136 0.096
## YEAR.BUILT | 0.044 0.068 0.002 | -0.018 0.026 0.000
## SALE.PRICE | 0.386 5.124 0.149 | -0.241 4.919 0.058
## Dim.3 ctr cos2
## RESIDENTIAL.UNITS | -0.019 0.037 0.000 |
## COMMERCIAL.UNITS | 0.015 0.022 0.000 |
## TOTAL.UNITS | -0.009 0.008 0.000 |
## LAND.SQUARE.FEET | -0.075 0.556 0.006 |
## GROSS.SQUARE.FEET | -0.005 0.002 0.000 |
## YEAR.BUILT | 0.994 98.883 0.988 |
## SALE.PRICE | 0.070 0.492 0.005 |
##
## Supplementary categories
## Dist Dim.1 cos2 v.test Dim.2
## BOROUGH 1 | 2.469 | 1.981 0.644 37.226 | -0.352
## BOROUGH 2 | 0.200 | 0.048 0.057 2.535 | 0.004
## BOROUGH 3 | 0.150 | -0.076 0.253 -9.705 | 0.014
## BOROUGH 4 | 0.243 | -0.016 0.004 -1.125 | 0.013
## BOROUGH 5 | 0.339 | -0.066 0.037 -2.893 | -0.031
## TAX.CLASS.AT.TIME.OF.SALE 1 | 0.190 | -0.106 0.312 -19.555 | 0.000
## TAX.CLASS.AT.TIME.OF.SALE 2 | 0.329 | 0.170 0.265 12.710 | 0.023
## TAX.CLASS.AT.TIME.OF.SALE 3 | 3.947 | -0.383 0.009 -0.318 | 0.102
## TAX.CLASS.AT.TIME.OF.SALE 4 | 0.892 | 0.382 0.183 13.933 | -0.079
## cos2 v.test Dim.3 cos2 v.test
## BOROUGH 1 0.020 -10.398 | 0.290 0.014 9.279 |
## BOROUGH 2 0.000 0.360 | -0.183 0.838 -16.610 |
## BOROUGH 3 0.009 2.870 | -0.128 0.723 -27.997 |
## BOROUGH 4 0.003 1.393 | 0.235 0.934 28.209 |
## BOROUGH 5 0.008 -2.164 | 0.289 0.727 21.780 |
## TAX.CLASS.AT.TIME.OF.SALE 1 0.000 -0.002 | 0.147 0.597 46.181 |
## TAX.CLASS.AT.TIME.OF.SALE 2 0.005 2.741 | -0.224 0.460 -28.584 |
## TAX.CLASS.AT.TIME.OF.SALE 3 0.001 0.132 | -3.911 0.982 -5.533 |
## TAX.CLASS.AT.TIME.OF.SALE 4 0.008 -4.546 | -0.565 0.400 -35.125 |
From the summary, we know that we only need 4 principal components to retain more than 80% of the variation in our data.
glimpse(property_PCA)
## List of 6
## $ eig : num [1:7, 1:3] 2.908 1.179 0.999 0.976 0.706 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "comp 1" "comp 2" "comp 3" "comp 4" ...
## .. ..$ : chr [1:3] "eigenvalue" "percentage of variance" "cumulative percentage of variance"
## $ var :List of 4
## ..$ coord : num [1:7, 1:5] 0.855 0.316 0.887 0.64 0.854 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cor : num [1:7, 1:5] 0.855 0.316 0.887 0.64 0.854 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:7, 1:5] 0.731 0.1 0.787 0.41 0.729 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:7, 1:5] 25.14 3.44 27.05 14.1 25.07 ...
## .. ..- attr(*, "dimnames")=List of 2
## $ ind :List of 4
## ..$ coord : num [1:48243, 1:5] 0.294 0.508 0.359 0.331 1.808 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:48243, 1:5] 0.189 0.574 0.186 0.426 0.566 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ contrib: num [1:48243, 1:5] 0.0000614 0.0001841 0.000092 0.000078 0.0023293 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ dist : Named num [1:48243] 0.675 0.671 0.833 0.507 2.403 ...
## .. ..- attr(*, "names")= chr [1:48243] "1" "2" "3" "4" ...
## $ svd :List of 3
## ..$ vs: num [1:7] 1.705 1.086 1 0.988 0.841 ...
## ..$ U : num [1:48243, 1:5] 0.172 0.298 0.211 0.194 1.06 ...
## ..$ V : num [1:7, 1:5] 0.501 0.185 0.52 0.376 0.501 ...
## $ quali.sup:List of 5
## ..$ coord : num [1:9, 1:5] 1.9814 0.0476 -0.0756 -0.016 -0.0656 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ cos2 : num [1:9, 1:5] 0.6443 0.05679 0.253 0.00432 0.03733 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ v.test: num [1:9, 1:5] 37.23 2.54 -9.7 -1.12 -2.89 ...
## .. ..- attr(*, "dimnames")=List of 2
## ..$ dist : Named num [1:9] 2.469 0.2 0.15 0.243 0.339 ...
## .. ..- attr(*, "names")= chr [1:9] "BOROUGH 1" "BOROUGH 2" "BOROUGH 3" "BOROUGH 4" ...
## ..$ eta2 : num [1:2, 1:5] 0.029395 0.008823 0.0024 0.000513 0.036288 ...
## .. ..- attr(*, "dimnames")=List of 2
## $ call :List of 10
## ..$ row.w : num [1:48243] 0.0000207 0.0000207 0.0000207 0.0000207 0.0000207 ...
## ..$ col.w : num [1:7] 1 1 1 1 1 1 1
## ..$ scale.unit: logi TRUE
## ..$ ncp : num 5
## ..$ centre : num [1:7] 2.567 0.248 2.834 3356.5 3636.935 ...
## ..$ ecart.type: num [1:7] 17.5 11 20.7 31433.9 28579.9 ...
## ..$ X :'data.frame': 48243 obs. of 9 variables:
## .. ..$ BOROUGH : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ RESIDENTIAL.UNITS : int [1:48243] 5 10 6 8 24 10 24 3 4 5 ...
## .. ..$ COMMERCIAL.UNITS : int [1:48243] 0 0 0 0 0 0 0 1 1 1 ...
## .. ..$ TOTAL.UNITS : int [1:48243] 5 10 6 8 24 10 24 4 5 6 ...
## .. ..$ LAND.SQUARE.FEET : int [1:48243] 1633 2272 2369 1750 4489 3717 4131 1520 2201 1779 ...
## .. ..$ GROSS.SQUARE.FEET : int [1:48243] 6440 6794 4615 4226 18523 12350 16776 3360 5608 3713 ...
## .. ..$ YEAR.BUILT : int [1:48243] 1900 1913 1900 1920 1920 2009 1928 1910 1900 1910 ...
## .. ..$ TAX.CLASS.AT.TIME.OF.SALE: Factor w/ 4 levels "1","2","3","4": 2 2 2 2 2 2 2 2 2 2 ...
## .. ..$ SALE.PRICE : int [1:48243] 6625000 3936272 8000000 3192840 16232000 10350000 11900000 3300000 7215000 4750000 ...
## ..$ row.w.init: num [1:48243] 1 1 1 1 1 1 1 1 1 1 ...
## ..$ call : language PCA(X = property, scale.unit = TRUE, quali.sup = c(1, 8), graph = FALSE)
## ..$ quali.sup :List of 5
## .. ..$ quali.sup :'data.frame': 48243 obs. of 2 variables:
## .. ..$ modalite : int [1:2] 5 4
## .. ..$ nombre : num [1:9] 1005 7049 24047 11078 5064 ...
## .. ..$ barycentre:'data.frame': 9 obs. of 7 variables:
## .. ..$ numero : num [1:2] 1 8
## - attr(*, "class")= chr [1:2] "PCA" "list "
The eigen value (\(eig), result of variables (\)var), mean of the variables (\(call\)centre), and standard error of the variables of (\(call\)ecart.type) each principal component
property_PCA$eig
## eigenvalue percentage of variance
## comp 1 2.90787699968 41.5410999955
## comp 2 1.17898738072 16.8426768674
## comp 3 0.99918293523 14.2740419318
## comp 4 0.97584668055 13.9406668651
## comp 5 0.70644025663 10.0920036662
## comp 6 0.23163936792 3.3091338275
## comp 7 0.00002637926 0.0003768466
## cumulative percentage of variance
## comp 1 41.54110
## comp 2 58.38378
## comp 3 72.65782
## comp 4 86.59849
## comp 5 96.69049
## comp 6 99.99962
## comp 7 100.00000
property_PCA$var
## $coord
## Dim.1 Dim.2 Dim.3 Dim.4
## RESIDENTIAL.UNITS 0.85498392 -0.09703775 -0.019233785 -0.16704593
## COMMERCIAL.UNITS 0.31630325 0.88555744 0.014968969 0.14712653
## TOTAL.UNITS 0.88697377 0.38725903 -0.008856898 -0.06267080
## LAND.SQUARE.FEET 0.64040411 -0.28489754 -0.074548106 -0.38758659
## GROSS.SQUARE.FEET 0.85382986 -0.30971630 -0.004504338 0.08036685
## YEAR.BUILT 0.04435943 -0.01755000 0.993991182 -0.09605043
## SALE.PRICE 0.38600345 -0.24081261 0.070102083 0.86974744
## Dim.5
## RESIDENTIAL.UNITS -0.47077704
## COMMERCIAL.UNITS 0.30401284
## TOTAL.UNITS -0.23530494
## LAND.SQUARE.FEET 0.55477904
## GROSS.SQUARE.FEET 0.15132125
## YEAR.BUILT 0.02109446
## SALE.PRICE 0.07677326
##
## $cor
## Dim.1 Dim.2 Dim.3 Dim.4
## RESIDENTIAL.UNITS 0.85498392 -0.09703775 -0.019233785 -0.16704593
## COMMERCIAL.UNITS 0.31630325 0.88555744 0.014968969 0.14712653
## TOTAL.UNITS 0.88697377 0.38725903 -0.008856898 -0.06267080
## LAND.SQUARE.FEET 0.64040411 -0.28489754 -0.074548106 -0.38758659
## GROSS.SQUARE.FEET 0.85382986 -0.30971630 -0.004504338 0.08036685
## YEAR.BUILT 0.04435943 -0.01755000 0.993991182 -0.09605043
## SALE.PRICE 0.38600345 -0.24081261 0.070102083 0.86974744
## Dim.5
## RESIDENTIAL.UNITS -0.47077704
## COMMERCIAL.UNITS 0.30401284
## TOTAL.UNITS -0.23530494
## LAND.SQUARE.FEET 0.55477904
## GROSS.SQUARE.FEET 0.15132125
## YEAR.BUILT 0.02109446
## SALE.PRICE 0.07677326
##
## $cos2
## Dim.1 Dim.2 Dim.3 Dim.4
## RESIDENTIAL.UNITS 0.730997502 0.0094163244 0.00036993848 0.027904344
## COMMERCIAL.UNITS 0.100047744 0.7842119871 0.00022407004 0.021646216
## TOTAL.UNITS 0.786722475 0.1499695595 0.00007844464 0.003927629
## LAND.SQUARE.FEET 0.410117418 0.0811666068 0.00555742010 0.150223366
## GROSS.SQUARE.FEET 0.729025437 0.0959241882 0.00002028906 0.006458830
## YEAR.BUILT 0.001967759 0.0003080025 0.98801847081 0.009225685
## SALE.PRICE 0.148998665 0.0579907122 0.00491430210 0.756460610
## Dim.5
## RESIDENTIAL.UNITS 0.2216310242
## COMMERCIAL.UNITS 0.0924238079
## TOTAL.UNITS 0.0553684128
## LAND.SQUARE.FEET 0.3077797807
## GROSS.SQUARE.FEET 0.0228981219
## YEAR.BUILT 0.0004449763
## SALE.PRICE 0.0058941328
##
## $contrib
## Dim.1 Dim.2 Dim.3 Dim.4
## RESIDENTIAL.UNITS 25.13852898 0.79867898 0.037024099 2.8595008
## COMMERCIAL.UNITS 3.44057687 66.51572357 0.022425327 2.2181985
## TOTAL.UNITS 27.05487456 12.72020057 0.007850879 0.4024842
## LAND.SQUARE.FEET 14.10367145 6.88443389 0.556196458 15.3941566
## GROSS.SQUARE.FEET 25.07071093 8.13615055 0.002030565 0.6618694
## YEAR.BUILT 0.06766996 0.02612432 98.882640604 0.9454032
## SALE.PRICE 5.12396724 4.91868812 0.491832069 77.5183873
## Dim.5
## RESIDENTIAL.UNITS 31.37293240
## COMMERCIAL.UNITS 13.08303244
## TOTAL.UNITS 7.83766387
## LAND.SQUARE.FEET 43.56770127
## GROSS.SQUARE.FEET 3.24133877
## YEAR.BUILT 0.06298852
## SALE.PRICE 0.83434272
property_PCA$call$centre
## [1] 2.5665900 0.2484506 2.8339655 3356.5001969
## [5] 3636.9349958 1827.7623075 1107495.5967083
property_PCA$call$ecart.type
## [1] 17.46548 10.98693 20.74990 31433.89188 28579.92478
## [6] 464.36073 8857712.41336
property_PCA$quali.sup
## $coord
## Dim.1 Dim.2 Dim.3
## BOROUGH 1 1.98143909 -0.352399007367 0.2895092
## BOROUGH 2 0.04758252 0.004300310523 -0.1827417
## BOROUGH 3 -0.07557940 0.014233585870 -0.1278099
## BOROUGH 4 -0.01599782 0.012617273994 0.2351416
## BOROUGH 5 -0.06557540 -0.031240341073 0.2894424
## TAX.CLASS.AT.TIME.OF.SALE 1 -0.10589099 -0.000007537488 0.1465885
## TAX.CLASS.AT.TIME.OF.SALE 2 0.16955791 0.023283324304 -0.2235252
## TAX.CLASS.AT.TIME.OF.SALE 3 -0.38342543 0.101643219101 -3.9105636
## TAX.CLASS.AT.TIME.OF.SALE 4 0.38207685 -0.079372880927 -0.5646255
## Dim.4 Dim.5
## BOROUGH 1 1.366165184 -0.30492111
## BOROUGH 2 -0.051049816 -0.03294969
## BOROUGH 3 0.002985195 -0.01579772
## BOROUGH 4 -0.045953693 0.03322987
## BOROUGH 5 -0.113715606 0.10870366
## TAX.CLASS.AT.TIME.OF.SALE 1 -0.054309713 0.01691500
## TAX.CLASS.AT.TIME.OF.SALE 2 0.034504129 -0.16691869
## TAX.CLASS.AT.TIME.OF.SALE 3 0.334317954 -0.09123121
## TAX.CLASS.AT.TIME.OF.SALE 4 0.374523938 0.41581442
##
## $cos2
## Dim.1 Dim.2 Dim.3
## BOROUGH 1 0.644304411 0.020379757187860 0.01375481
## BOROUGH 2 0.056789658 0.000463846214325 0.83762539
## BOROUGH 3 0.252995218 0.008972926352411 0.72349355
## BOROUGH 4 0.004324796 0.002690141222544 0.93433590
## BOROUGH 5 0.037328674 0.008472121804581 0.72725193
## TAX.CLASS.AT.TIME.OF.SALE 1 0.311650540 0.000000001579076 0.59724103
## TAX.CLASS.AT.TIME.OF.SALE 2 0.264894838 0.004994906545456 0.46035252
## TAX.CLASS.AT.TIME.OF.SALE 3 0.009438117 0.000663254705965 0.98175296
## TAX.CLASS.AT.TIME.OF.SALE 4 0.183309179 0.007910921253578 0.40031679
## Dim.4 Dim.5
## BOROUGH 1 0.3062923001 0.015258247
## BOROUGH 2 0.0653676321 0.027231845
## BOROUGH 3 0.0003946852 0.011053358
## BOROUGH 4 0.0356849294 0.018659556
## BOROUGH 5 0.1122537338 0.102576767
## TAX.CLASS.AT.TIME.OF.SALE 1 0.0819794271 0.007952322
## TAX.CLASS.AT.TIME.OF.SALE 2 0.0109693155 0.256712685
## TAX.CLASS.AT.TIME.OF.SALE 3 0.0071753472 0.000534331
## TAX.CLASS.AT.TIME.OF.SALE 4 0.1761334887 0.217111017
##
## $v.test
## Dim.1 Dim.2 Dim.3
## BOROUGH 1 37.2256802 -10.397526364 9.278758
## BOROUGH 2 2.5352386 0.359836044 -16.610192
## BOROUGH 3 -9.7048031 2.870324932 -27.997124
## BOROUGH 4 -1.1249910 1.393436620 28.208713
## BOROUGH 5 -2.8925185 -2.164135250 21.780248
## TAX.CLASS.AT.TIME.OF.SALE 1 -19.5551384 -0.002186059 46.181420
## TAX.CLASS.AT.TIME.OF.SALE 2 12.7102782 2.741040352 -28.584400
## TAX.CLASS.AT.TIME.OF.SALE 3 -0.3179892 0.132386388 -5.532690
## TAX.CLASS.AT.TIME.OF.SALE 4 13.9329450 -4.545668756 -35.125154
## Dim.4 Dim.5
## BOROUGH 1 44.3059804 -11.6225040
## BOROUGH 2 -4.6952943 -3.5618288
## BOROUGH 3 0.6616879 -4.1155511
## BOROUGH 4 -5.5783516 4.7409698
## BOROUGH 5 -8.6586959 9.7281367
## TAX.CLASS.AT.TIME.OF.SALE 1 -17.3131708 6.3375912
## TAX.CLASS.AT.TIME.OF.SALE 2 4.4648338 -25.3858827
## TAX.CLASS.AT.TIME.OF.SALE 3 0.4786173 -0.1535059
## TAX.CLASS.AT.TIME.OF.SALE 4 23.5759394 30.7639462
##
## $dist
## BOROUGH 1 BOROUGH 2
## 2.4685116 0.1996700
## BOROUGH 3 BOROUGH 4
## 0.1502613 0.2432641
## BOROUGH 5 TAX.CLASS.AT.TIME.OF.SALE 1
## 0.3394061 0.1896815
## TAX.CLASS.AT.TIME.OF.SALE 2 TAX.CLASS.AT.TIME.OF.SALE 3
## 0.3294438 3.9467375
## TAX.CLASS.AT.TIME.OF.SALE 4
## 0.8923981
##
## $eta2
## Dim.1 Dim.2 Dim.3 Dim.4
## BOROUGH 0.029394967 0.002400124 0.03628809 0.04212607
## TAX.CLASS.AT.TIME.OF.SALE 0.008823114 0.000513238 0.05143083 0.01301342
## Dim.5
## BOROUGH 0.005257136
## TAX.CLASS.AT.TIME.OF.SALE 0.028414011
When using select, we include:
select="contrib 5: label 5 elements that have the highest contribution on the 2 dimensions of our plot
## Variable Factor Map
plot.PCA(property_PCA,
cex = 0.6,
choix = c("ind"),
select = "contrib 5",
habillage = 8) # Tax class for legend
## Variable Factor Map
plot.PCA(property_PCA,
cex=0.6,
choix = c("var"))
With our Variables factor map plotted (second plot), let’s use the dimdesc() function to help us understand the variables and the categories that are the most characteristic according to each dimension obtained in the PCA process. (This is also consistent with the result from summary())
We are only looking at the quantitative variables “best describe” the first and second dimensions, and that the function has automatically sorted the values by descending order.
dimdesc(property_PCA)
## $Dim.1
## $Dim.1$quanti
## correlation p.value
## TOTAL.UNITS 0.88697377 0.000000000000000000000000000
## RESIDENTIAL.UNITS 0.85498392 0.000000000000000000000000000
## GROSS.SQUARE.FEET 0.85382986 0.000000000000000000000000000
## LAND.SQUARE.FEET 0.64040411 0.000000000000000000000000000
## SALE.PRICE 0.38600345 0.000000000000000000000000000
## COMMERCIAL.UNITS 0.31630325 0.000000000000000000000000000
## YEAR.BUILT 0.04435943 0.000000000000000000000188547
##
## $Dim.1$quali
## R2
## BOROUGH 0.029394967
## TAX.CLASS.AT.TIME.OF.SALE 0.008823114
## p.value
## BOROUGH 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000213916
## TAX.CLASS.AT.TIME.OF.SALE 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000024294661395084662660018860854146705015708451330674062836799611835927685182349483674513938722391525017081656021498561652619427484751253128718524302445455902281821974844141251854343014066901787618619004426259248844299883302641
##
## $Dim.1$category
## Estimate
## 1 1.6070653
## 4 0.3664973
## 2 0.1539783
## 2 -0.3267913
## 5 -0.4399492
## 3 -0.4499532
## 1 -0.1214706
## p.value
## 1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001043815
## 4 0.0000000000000000000000000000000000000000000329153775615770020911642912732372638933177643462111757427108925700012128833096382648177450962799388765763259246006910974102765976567752659320831298828125000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 2 0.0000000000000000000000000000000000004535404307008517481500053664971626594853625738998926919166963189718464064188430565909318098471742636279557814305007923394441604614257812500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 2 0.0112356300103029615317096201465574267785996198654174804687500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 5 0.0038206879643984691975744372172130169929005205631256103515625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 3 0.0000000000000000000002749672808466575869358360091487126560195843103841720985759927015079640000294602941721677780151367187500000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000001748147462519148452359124632461778150748826580477113888583998314459509434395101347782169052558164076420359718310264061442021928818821925178510865763372309608994316562134031164193239453759913311311294132149537529363758636691272841
##
##
## $Dim.2
## $Dim.2$quanti
## correlation
## COMMERCIAL.UNITS 0.88555744
## TOTAL.UNITS 0.38725903
## YEAR.BUILT -0.01755000
## RESIDENTIAL.UNITS -0.09703775
## SALE.PRICE -0.24081261
## LAND.SQUARE.FEET -0.28489754
## GROSS.SQUARE.FEET -0.30971630
## p.value
## COMMERCIAL.UNITS 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## TOTAL.UNITS 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## YEAR.BUILT 0.00011576294932938061650259942148011305107502266764640808105468750000000000000000000000000000000000000000000
## RESIDENTIAL.UNITS 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002918995
## SALE.PRICE 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## LAND.SQUARE.FEET 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## GROSS.SQUARE.FEET 0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
##
## $Dim.2$quali
## R2 p.value
## BOROUGH 0.002400124 0.000000000000000000000003973015
## TAX.CLASS.AT.TIME.OF.SALE 0.000513238 0.000017292220738148899770395689
##
## $Dim.2$category
## Estimate p.value
## 3 0.08473122 0.004099491264680589323876613861
## 2 0.01189679 0.006123303495678600674723135455
## 5 0.03925729 0.030452522172288578466980979442
## 4 -0.09075941 0.000005465298646257913447316090
## 1 -0.28190137 0.000000000000000000000000239752
##
##
## $Dim.3
## $Dim.3$quanti
## correlation
## YEAR.BUILT 0.99399118
## SALE.PRICE 0.07010208
## COMMERCIAL.UNITS 0.01496897
## RESIDENTIAL.UNITS -0.01923378
## LAND.SQUARE.FEET -0.07454811
## p.value
## YEAR.BUILT 0.000000000000000000000000000000000000000000000000000000000000000000
## SALE.PRICE 0.000000000000000000000000000000000000000000000000000012779263457765
## COMMERCIAL.UNITS 0.001009281260546321607254882657400685275206342339515686035156250000
## RESIDENTIAL.UNITS 0.000023910481720145631851642820109304921061266213655471801757812500
## LAND.SQUARE.FEET 0.000000000000000000000000000000000000000000000000000000000002030628
##
## $Dim.3$quali
## R2 p.value
## BOROUGH 0.03628809 0
## TAX.CLASS.AT.TIME.OF.SALE 0.05143083 0
##
## $Dim.3$category
## Estimate
## 1 1.2846200
## 4 0.5734059
## 2 0.9145062
## 4 0.1344333
## 5 0.1887341
## 1 0.1888009
## 3 -2.7725321
## 2 -0.2834500
## 3 -0.2285182
## p.value
## 1 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 4 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000009226337
## 2 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000003200863174319934831159695851198660266247852320594460240302195111672638963014795625248319663702820550
## 4 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000167037749568152415667071495338133056811324669913256487057695090182692141192490737002168992786056607706283
## 5 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011096322854539416200659894539489379598694373288985007149737575645869886495684904237520340918333770171176914928764463170639847323161363357279599016303429988552573965820012337472
## 1 0.0000000000000000000165147381655864597382188648610615061114022303989091368793366843004122301863390021026134490966796875000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 3 0.0000000313935158681424085383343037559955579496318023302592337131500244140625000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
## 2 0.0000000000000000000000000000000000000000000000000000000000000396915802406432462492421110789460489473228378621684504727858589446452456021403627630265892588902191755044054875595762851899583728648927367902983960273421808195859483703316072933375835418701171875000000000000000000000000
## 3 0.0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000071029404444766175710980547551243671287715718013212165327325541538632974825706470415557429633311878021600959
In the following graph, we see that PC1 and PC2 combined can do a classifying task the Tax Class of our dataset, but not too well classifying on Borough:
plotellipses(property_PCA, keepvar = "quali.sup")
plotellipses(property_PCA, keepvar = 8,
ylim = c(-50, 50), xlim = c(-10, 60))
From this analysis we know which tax class of the property that we want to sell/buy in New York City.