title: “houses” author: “rg” date: “2022-12-01”

1. Reading our Cleaned Melbourne Data for our Hypothesis Testing and Validation

Reading the Melbourne data & importing required libraries

require(ggplot2)
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(sjmisc)
## 
## Attaching package: 'sjmisc'
## The following object is masked from 'package:tidyr':
## 
##     replace_na
library(corrplot)
## corrplot 0.92 loaded
library(fastDummies)
library(caret)
## Loading required package: lattice
library(tidyr)
library(BBmisc)
## 
## Attaching package: 'BBmisc'
## The following objects are masked from 'package:sjmisc':
## 
##     %nin%, seq_col, seq_row
## The following objects are masked from 'package:dplyr':
## 
##     coalesce, collapse
## The following object is masked from 'package:base':
## 
##     isFALSE
library(class)
##load the package class
library(class)
library(C50)
library(MASS) # Needed to sample multivariate Gaussian distributions 
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(neuralnet) # The package for neural networks in R
## 
## Attaching package: 'neuralnet'
## The following object is masked from 'package:dplyr':
## 
##     compute
library(nnet)

housing.dataset <- read.csv("D:/Freelancer_questions/shivam/Melbourne_housing/melbourne_data.csv", header = TRUE)

str(housing.dataset)
## 'data.frame':    34857 obs. of  12 variables:
##  $ Date         : chr  "03-09-2016" "03-12-2016" "04-02-2016" "04-02-2016" ...
##  $ Type         : chr  "h" "h" "h" "u" ...
##  $ Price        : int  NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ...
##  $ Landsize     : int  126 202 156 0 134 94 120 400 201 202 ...
##  $ BuildingArea : num  NA NA 79 NA 150 NA 142 220 NA NA ...
##  $ Rooms        : int  2 2 2 3 3 3 4 4 2 2 ...
##  $ Bathroom     : int  1 1 1 2 2 2 1 2 1 2 ...
##  $ Car          : int  1 1 0 1 0 1 2 2 2 1 ...
##  $ YearBuilt    : int  NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ...
##  $ Distance     : chr  "2.5" "2.5" "2.5" "2.5" ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
##  $ Propertycount: chr  "4019" "4019" "4019" "4019" ...

Task A : Hypotheses Vaidation

Hypotheses 1: We see that on average Price houses nearest to the center having less distance tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable distance will likely play a large role in predicting median house value.

ggplot(housing.dataset, aes(x=Distance, y=Price)) + geom_point() +
ggtitle("\t\t\t Distance vs Price scater plot") + xlab("Distance") +
ylab("Price")
## Warning: Removed 7610 rows containing missing values (`geom_point()`).

Hypotheses 2: We see that on average Price houses having more bathroom tend to have higher median house values, whereas those with less bathrooms have the lower median values.

ggplot(housing.dataset, aes(x=Bathroom, y=Price)) + geom_point() +
ggtitle("\t\t\t Bathroom vs Price scater plot") + xlab("no of Bathroom") +
ylab("Price")
## Warning: Removed 14057 rows containing missing values (`geom_point()`).

Hypotheses 3: We see that on average Price houses having more cars tend to have higher median house values, whereas those with less cars have the lower median values.

ggplot(housing.dataset, aes(x=Car, y=Price)) + geom_point() +
ggtitle("\t\t\t Car vs Price scater plot") + xlab("no of Car") +
ylab("Price")
## Warning: Removed 14434 rows containing missing values (`geom_point()`).

Task B : Prediction

Converting the categorical variable Regionname and PropertyCount

Create dummy variable and dropping the columns not required for prediction, and converting the prediction columns to numeric, dropping where Price is NA

data <- dummy_cols(housing.dataset, 
                   select_columns = c("Type","Regionname"),remove_selected_columns = TRUE)

data <- data[, !(colnames(data) %in% c("Date","Propertycount","YearBuilt"))]

data <- data.frame(apply(data, 2, function(x) as.numeric(as.character(x))))
## Warning in FUN(newX[, i], ...): NAs introduced by coercion
data <- data %>% drop_na(Price)

###replacing the null with median values
data$Landsize [is.na(data$Landsize)]<-median(data$Landsize ,na.rm=TRUE)

data$BuildingArea [is.na(data$BuildingArea)]<-median(data$BuildingArea ,na.rm=TRUE)

data$Rooms [is.na(data$Rooms)]<-median(data$Rooms ,na.rm=TRUE)

data$Bathroom [is.na(data$Bathroom)]<-median(data$Bathroom ,na.rm=TRUE)

data$Car[is.na(data$Car)]<-median(data$Car ,na.rm=TRUE)

data$Distance[is.na(data$Distance)]<-median(data$Distance ,na.rm=TRUE)

Divide the dataset into training and test data. Use 75/25 split.

#split data 
RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)

idx <- sample(nrow(data), nrow(data)* 0.75)

housing_train <- data[idx,]

housing_test <- data[ -idx,]
full_additive_model = lm(Price ~ ., data = housing_train)

summary(full_additive_model)$adj.r.squared
## [1] 0.5571159

The model has a decent R2 value at almost 71%

housing_test$Predicted_Price <- predict(full_additive_model, housing_test)
## Warning in predict.lm(full_additive_model, housing_test): prediction from a
## rank-deficient fit may be misleading
housing_test <- housing_test %>% drop_na(Price)
housing_test <- housing_test %>% drop_na(Predicted_Price)

The model RMSE, MAE, MSE

MAE(housing_test$Predicted_Price, housing_test$Price)
## [1] 284493.5
RMSE(housing_test$Predicted_Price, housing_test$Price)
## [1] 416837.9

Normalizing the data and predicting

preproc_data = normalize(data[,2:ncol(data)], method = "range", range = c(0, 1))
preproc_data$Price <- data$Price

set.seed(417)

idx <- sample(nrow(preproc_data), nrow(preproc_data)* 0.75)

housing_train_prec <- preproc_data[idx,]

housing_test_prec <- preproc_data[ -idx,]

full_additive_model_prec = lm(Price ~ ., data = housing_train_prec)

summary(full_additive_model_prec)$adj.r.squared
## [1] 0.5571159
housing_test_prec$Predicted_Price <- predict(full_additive_model_prec, housing_test_prec)
## Warning in predict.lm(full_additive_model_prec, housing_test_prec): prediction
## from a rank-deficient fit may be misleading
housing_test_prec <- housing_test_prec %>% drop_na(Price)
housing_test_prec <- housing_test_prec %>% drop_na(Predicted_Price)

MAE(housing_test_prec$Predicted_Price, housing_test_prec$Price)
## [1] 284493.5
RMSE(housing_test_prec$Predicted_Price, housing_test_prec$Price)
## [1] 416837.9

not much difference in Prediction result post normalization too as we have removed the outliers and other cleaning intially itself

Task C : Prediction

Divide the data into 80/20

KNN/Kmeams adding category to the variables

Based on KNN feature exploration it has been observed that 4 categories of houses is optimal;

data2 <- data

data2 <- data2[, !(colnames(data2) %in% c("Price"))]

data2<- data2 %>% drop_na()

km.res <- kmeans(data2, 4, nstart = 25)

data2$Target <- as.factor(km.res$cluster)

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(417)

idx <- sample(nrow(data2), nrow(data2)* 0.80)

housing_train_80 <- data2[idx,]

housing_test_20 <- data2[ -idx,]

unique(housing_train_80$Target)
## [1] 3 2 1 4
## Levels: 1 2 3 4
unique(housing_test_20$Target)
## [1] 3 2 1
## Levels: 1 2 3 4
modelknn<- knn(train=housing_train_80, test=housing_test_20, cl=housing_train_80$Target, k=21)

caret::confusionMatrix(housing_test_20$Target, modelknn)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1    0    0    1    0
##          2    0 1503    4    0
##          3    0    2 3940    0
##          4    0    0    0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9987          
##                  95% CI : (0.9974, 0.9995)
##     No Information Rate : 0.7239          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity                 NA   0.9987   0.9987       NA
## Specificity          0.9998165   0.9990   0.9987        1
## Pos Pred Value              NA   0.9973   0.9995       NA
## Neg Pred Value              NA   0.9995   0.9967       NA
## Prevalence           0.0000000   0.2761   0.7239        0
## Detection Rate       0.0000000   0.2758   0.7229        0
## Detection Prevalence 0.0001835   0.2765   0.7233        0
## Balanced Accuracy           NA   0.9988   0.9987       NA

KNN model is giving a very good accuracy on test data along with good precision vs recall vs f1 score

C 5.0 Model Training & Prediction

c50 <- C5.0(housing_train_80[,-19], housing_train_80$Target)
c50
## 
## Call:
## C5.0.default(x = housing_train_80[, -19], y = housing_train_80$Target)
## 
## Classification Tree
## Number of samples: 21797 
## Number of predictors: 18 
## 
## Tree size: 9 
## 
## Non-standard options: attempt to group attributes
caret::confusionMatrix(housing_test_20$Target, predict(c50, newdata = housing_test_20[,-19]))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4
##          1    1    0    0    0
##          2    0 1507    0    0
##          3    0    1 3941    0
##          4    0    0    0    0
## 
## Overall Statistics
##                                     
##                Accuracy : 0.9998    
##                  95% CI : (0.999, 1)
##     No Information Rate : 0.7231    
##     P-Value [Acc > NIR] : < 2.2e-16 
##                                     
##                   Kappa : 0.9995    
##                                     
##  Mcnemar's Test P-Value : NA        
## 
## Statistics by Class:
## 
##                       Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity          1.0000000   0.9993   1.0000       NA
## Specificity          1.0000000   1.0000   0.9993        1
## Pos Pred Value       1.0000000   1.0000   0.9997       NA
## Neg Pred Value       1.0000000   0.9997   1.0000       NA
## Prevalence           0.0001835   0.2767   0.7231        0
## Detection Rate       0.0001835   0.2765   0.7231        0
## Detection Prevalence 0.0001835   0.2765   0.7233        0
## Balanced Accuracy    1.0000000   0.9997   0.9997       NA

Even C50 decision tree algorithm is giving a good prediction

ANN Model Training & Prediction

we are only considering few features for ANN as the time for processing is very high

nn<-nnet(Target ~ Landsize + BuildingArea + Rooms + Bathroom + Car, data=housing_train_80, hidden=5,size=5, decay=5e-4, maxit=200)
## # weights:  54
## initial  value 34980.666995 
## iter  10 value 6368.356372
## iter  20 value 5113.239804
## iter  30 value 4143.094701
## iter  40 value 2560.510333
## iter  50 value 896.130659
## iter  60 value 271.642408
## iter  70 value 154.730789
## iter  80 value 121.184533
## iter  90 value 75.861625
## iter 100 value 46.469550
## iter 110 value 43.212627
## iter 120 value 34.329770
## iter 130 value 30.871260
## iter 140 value 27.917828
## iter 150 value 26.849672
## iter 160 value 22.810192
## iter 170 value 17.703523
## iter 180 value 16.089653
## iter 190 value 14.652899
## iter 200 value 13.770072
## final  value 13.770072 
## stopped after 200 iterations
test <- predict(nn, housing_test_20[,c(1:5)],type="class")

unique(test)
## [1] "3" "2" "1"

Neural Network is computationally intensive but the accuracy is high and similar to KNN model