LDA CLASSIFCATION

Objective - to determine which variables are likely to classify property into PROFIT or LOSS segment.

Dataset from https://www.edgeprop.sg/pullout/the-edge-property. I took the profit and loss transactions of property from the period of 3 weeks of October and 2 weeks of September.

setwd("E:\\interim\\property")
data <- read.csv("profit_loss.csv", header = T, stringsAsFactors = F)

#CONVERT into correct class formats

data$DISTRICT <- as.factor(data$DISTRICT)
data$MAKE <- as.factor(data$MAKE)
data$AREA <- as.numeric(data$AREA)
data$SALE_PSF <- as.numeric(data$SALE_PSF)
data$PURCHASE_PSF <- as.numeric(data$PURCHASE_PSF)
data$PROFIT_LOSS <- as.numeric(data$PROFIT_LOSS)

head(data)

##             PROJECT DISTRICT AREA SALE_PSF PURCHASE_PSF HOLD_YEARS MAKE
## 1      Beverly Hill       10 3778     1747          709       14.4    P
## 2   Orchard Bel Air       10 3229     1208          483       12.2    P
## 3     The Waterside       15 2433     1554          848       17.6    P
## 4     Bayshore Park       16 3531      765          331       11.6    P
## 5     Newton Suites       11 1238     1793          798       12.4    P
## 6 Mount Faber Lodge        4 2594      964          524       22.2    P
##   PROFIT_LOSS ANNUALISED
## 1     3920000          6
## 2     2340000          8
## 3     1717000          4
## 4     1531112          7
## 5     1232000          7
## 6     1140000          3

str(data)

## 'data.frame':    100 obs. of  9 variables:
##  $ PROJECT     : chr  "Beverly Hill" "Orchard Bel Air" "The Waterside" "Bayshore Park" ...
##  $ DISTRICT    : Factor w/ 18 levels "1","3","4","5",..: 7 7 11 12 8 3 16 8 16 16 ...
##  $ AREA        : num  3778 3229 2433 3531 1238 ...
##  $ SALE_PSF    : num  1747 1208 1554 765 1793 ...
##  $ PURCHASE_PSF: num  709 483 848 331 798 524 637 803 924 301 ...
##  $ HOLD_YEARS  : num  14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
##  $ MAKE        : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PROFIT_LOSS : num  3920000 2340000 1717000 1531112 1232000 ...
##  $ ANNUALISED  : num  6 8 4 7 7 3 8 5 6 6 ...

Some explanations of variables listed

PROJECT - is the name of condo.

AREA - is the square feet.

SALES_PSF - sold in 2017 psf.

PURCHASE_PSF - bought in some years back psf.

HOLD_YEARS - number of years property held before SOLD.

PROFIT_LOSS - absolute number in dollars.

Annualised - profit or loss annualised % over the years. Negative means loss and vice versa.

MAKE - this is a factor which i indicate Profit (P) or Loss (L). This is used for classification of a likely PROFIT or LOSS of Property based on other independent variables.

Which variables are CONTROLLABLE by the BUYER ?

I would say AREA, HOLDING YEARS AND PURCHASE PSF. Why ? This is because property buyer can decide how long to HOLD a property, how BIG an area to buy and at what price buyer would PURCHASE at that point in time.

Which variables are useless in this analytics ?

Condo name is useless because it cannot be used to determine profit or loss. SALES PSF can be a useful variable, but seller usually goes with market price and seller cannot dictate the price. Profit and Loss and Annualised % are DERIVED values which is useless.

Prepare Data

#create a subset of Profit and LOSS
profit <- subset(data, data$MAKE =="P")
loss <- subset(data, data$MAKE =="L")

Visualization between AREA AND HOLDING YEARS

plot(profit$HOLD_YEARS, profit$AREA, pch = 20, col=2, xlim=c(1,20), ylim=c(500, 5000), main="Holding Years vs Area", xlab="Holding Years", ylab="Area")
points(loss$HOLD_YEARS, loss$AREA, col=3, pch=15)
legend("topright", legend =c("Profit", "Loss"), pch=c(20,15), col=c(2:3))

Can we use the 2 variables to classify into PROFIT and LOSS classification?

## Call:
## lda(data$MAKE ~ data$HOLD_YEARS + data$AREA, na.action = "na.omit")
## 
## Prior probabilities of groups:
##   L   P 
## 0.5 0.5 
## 
## Group means:
##   data$HOLD_YEARS data$AREA
## L           6.666   1574.56
## P          14.394   1987.86
## 
## Coefficients of linear discriminants:
##                          LD1
## data$HOLD_YEARS 0.2877120804
## data$AREA       0.0002344837

##    
##      L  P
##   L 48  2
##   P  8 42

LDA CLASSIFCATION

LIM KAH KHENG (jkklim@hotmail.com)

26 March, 2018