Objective - to determine which variables are likely to classify property into PROFIT or LOSS segment.

Dataset from https://www.edgeprop.sg/pullout/the-edge-property. I took the profit and loss transactions of property from the period of 3 weeks of October and 2 weeks of September.

## 'data.frame':    100 obs. of  9 variables:
##  $ PROJECT     : chr  "Beverly Hill" "Orchard Bel Air" "The Waterside" "Bayshore Park" ...
##  $ DISTRICT    : Factor w/ 18 levels "1","3","4","5",..: 7 7 11 12 8 3 16 8 16 16 ...
##  $ AREA        : num  3778 3229 2433 3531 1238 ...
##  $ SALE_PSF    : num  1747 1208 1554 765 1793 ...
##  $ PURCHASE_PSF: num  709 483 848 331 798 524 637 803 924 301 ...
##  $ HOLD_YEARS  : num  14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
##  $ MAKE        : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
##  $ PROFIT_LOSS : num  3920000 2340000 1717000 1531112 1232000 ...
##  $ ANNUALISED  : num  6 8 4 7 7 3 8 5 6 6 ...
##           PROJECT DISTRICT AREA SALE_PSF PURCHASE_PSF HOLD_YEARS MAKE
## 1    Beverly Hill       10 3778     1747          709       14.4    P
## 2 Orchard Bel Air       10 3229     1208          483       12.2    P
## 3   The Waterside       15 2433     1554          848       17.6    P
## 4   Bayshore Park       16 3531      765          331       11.6    P
## 5   Newton Suites       11 1238     1793          798       12.4    P
##   PROFIT_LOSS ANNUALISED
## 1     3920000          6
## 2     2340000          8
## 3     1717000          4
## 4     1531112          7
## 5     1232000          7

Some explanations of variables listed

PROJECT - is the name of condo.

AREA - is the square feet.

SALES_PSF - sold in 2017 psf.

PURCHASE_PSF - bought in some years back psf.

HOLD_YEARS - number of years property held before SOLD.

MAKE - this is a factor which i indicate Profit (P) or Loss (L). This is used for classification of a likely PROFIT or LOSS of Property based on other independent variables.

# MAKE is the dependent categorial variable. The rest of the variables must be numeric to use LDA
# I removed PROJECT AND DISTRICT - which i think it will not affect the MAKE (PROFIT/LOSS)
# Since i have the observation in MAKE (profit/loss categorial variable), there is no need for PROFIT_LOSS and ANNUALISED variables
# which is why i select columns 3 to 7
data1 <- subset(data, select =  c(3:7))
str(data1)
## 'data.frame':    100 obs. of  5 variables:
##  $ AREA        : num  3778 3229 2433 3531 1238 ...
##  $ SALE_PSF    : num  1747 1208 1554 765 1793 ...
##  $ PURCHASE_PSF: num  709 483 848 331 798 524 637 803 924 301 ...
##  $ HOLD_YEARS  : num  14.4 12.2 17.6 11.6 12.4 22.2 11 15.3 8.2 16.3 ...
##  $ MAKE        : Factor w/ 2 levels "L","P": 2 2 2 2 2 2 2 2 2 2 ...
# Understand model
library("MASS")
fit<-lda(MAKE ~ ., data=data1)
fit
## Call:
## lda(MAKE ~ ., data = data1)
## 
## Prior probabilities of groups:
##   L   P 
## 0.5 0.5 
## 
## Group means:
##      AREA SALE_PSF PURCHASE_PSF HOLD_YEARS
## L 1574.56  1497.56      1704.58      6.666
## P 1987.86  1408.86       759.70     14.394
## 
## Coefficients of linear discriminants:
##                        LD1
## AREA          0.0001890049
## SALE_PSF      0.0038187891
## PURCHASE_PSF -0.0043282253
## HOLD_YEARS    0.1545673822
plot(fit)

2 graphs generated, L(oss) and P(rofit)

It shows group MEANS separated , which is very easy to classify. If the 2 MEANS are overlap, it will definitely have some misclassifcation.

But what independent variables can be used to SEPARATE the classfications without much overlap ?

library(klaR)
## Warning: package 'klaR' was built under R version 3.4.4
partimat(MAKE ~ .,data=data1,method="lda")

Conclusions

Refer to Partition plot. The lesser the App error code, the better the classification.

Classification works well with 2 independent variables (SALES_PSF and PURCHASE_PSF) to determine MAKE. THIS SHOWS COMMON SENSE which statistically shown.

Another classifcation that works well will be these 2 independent variables (PURCHASE_PSF and HOLD_YEARS) to determine MAKE (assuming if we do not have SALES_PSF or PURCHASE_PSF). This shows a person statistically will tend to HOLD for years to ensure profit, unless the person has no HOLDING power.

Can DISTRICT a categorial independent variable which i left out earlier be considered an influencer to the MAKE ?

#Important categorical variable selection
district_tb <- table(data$DISTRICT, data$MAKE)
chisq.test(district_tb)
## Warning in chisq.test(district_tb): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  district_tb
## X-squared = 21.738, df = 17, p-value = 0.195

Its P-VALUE is high, we can justify to remove the DISTRICT independent variable in our previous dataset model - data1