predictiom.knit

title: “houses” author: “rg” date: “2022-12-01”

1. Reading our Cleaned Melbourne Data for our Hypothesis Testing and Validation

Reading the Melbourne data & importing required libraries

require(ggplot2)

## Loading required package: ggplot2

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(sjmisc)

## 
## Attaching package: 'sjmisc'

## The following object is masked from 'package:tidyr':
## 
##     replace_na

library(corrplot)

## corrplot 0.92 loaded

library(fastDummies)
library(caret)

## Loading required package: lattice

library(tidyr)
library(BBmisc)

## 
## Attaching package: 'BBmisc'

## The following objects are masked from 'package:sjmisc':
## 
##     %nin%, seq_col, seq_row

## The following objects are masked from 'package:dplyr':
## 
##     coalesce, collapse

## The following object is masked from 'package:base':
## 
##     isFALSE

library(class)
##load the package class
library(class)
library(C50)
library(MASS) # Needed to sample multivariate Gaussian distributions

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(neuralnet) # The package for neural networks in R

## 
## Attaching package: 'neuralnet'

## The following object is masked from 'package:dplyr':
## 
##     compute

housing.dataset <- read.csv("D:/Freelancer_questions/shivam/Melbourne_housing/melbourne_data.csv", header = TRUE)

str(housing.dataset)

## 'data.frame':    34857 obs. of  12 variables:
##  $ Date         : chr  "03-09-2016" "03-12-2016" "04-02-2016" "04-02-2016" ...
##  $ Type         : chr  "h" "h" "h" "u" ...
##  $ Price        : int  NA 1480000 1035000 NA 1465000 850000 1600000 NA NA NA ...
##  $ Landsize     : int  126 202 156 0 134 94 120 400 201 202 ...
##  $ BuildingArea : num  NA NA 79 NA 150 NA 142 220 NA NA ...
##  $ Rooms        : int  2 2 2 3 3 3 4 4 2 2 ...
##  $ Bathroom     : int  1 1 1 2 2 2 1 2 1 2 ...
##  $ Car          : int  1 1 0 1 0 1 2 2 2 1 ...
##  $ YearBuilt    : int  NA NA 1900 NA 1900 NA 2014 2006 1900 1900 ...
##  $ Distance     : chr  "2.5" "2.5" "2.5" "2.5" ...
##  $ Regionname   : chr  "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" "Northern Metropolitan" ...
##  $ Propertycount: chr  "4019" "4019" "4019" "4019" ...

Task A : Hypotheses Vaidation

Hypotheses 1: We see that on average Price houses nearest to the center having less distance tend to have higher median house values, whereas those inland have the lower median values. This difference is quite substantial and tells us that the variable distance will likely play a large role in predicting median house value.

Hypotheses 2: We see that on average Price houses having more bathroom tend to have higher median house values, whereas those with less bathrooms have the lower median values.

Hypotheses 3: We see that on average Price houses having more cars tend to have higher median house values, whereas those with less cars have the lower median values.

Task B : Prediction

Converting the categorical variable Regionname and PropertyCount

Create dummy variable and dropping the columns not required for prediction, and converting the prediction columns to numeric, dropping where Price is NA

data <- dummy_cols(housing.dataset, 
                   select_columns = c("Type","Regionname","Propertycount"),remove_selected_columns = TRUE)

data <- data[, !(colnames(data) %in% c("Date"))]

data <- data.frame(apply(data, 2, function(x) as.numeric(as.character(x))))

## Warning in FUN(newX[, i], ...): NAs introduced by coercion

data <- data %>% drop_na(Price)

Divide the dataset into training and test data. Use 75/25 split.

#split data 
RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417)

idx <- sample(nrow(data), nrow(data)* 0.75)

housing_train <- data[idx,]

housing_test <- data[ -idx,]

full_additive_model = lm(Price ~ ., data = housing_train)

summary(full_additive_model)$adj.r.squared

## [1] 0.7012419

The model has a decent R2 value at almost 71%

housing_test$Predicted_Price <- predict(full_additive_model, housing_test)

## Warning in predict.lm(full_additive_model, housing_test): prediction from a
## rank-deficient fit may be misleading

housing_test <- housing_test %>% drop_na(Price)
housing_test <- housing_test %>% drop_na(Predicted_Price)

The model RMSE, MAE, MSE

MAE(housing_test$Predicted_Price, housing_test$Price)

## [1] 235951.6

RMSE(housing_test$Predicted_Price, housing_test$Price)

## [1] 359541

Normalizing the data and predicting

preproc_data = normalize(data[,2:ncol(data)], method = "range", range = c(0, 1))
preproc_data$Price <- data$Price

set.seed(417)

idx <- sample(nrow(preproc_data), nrow(preproc_data)* 0.75)

housing_train_prec <- preproc_data[idx,]

housing_test_prec <- preproc_data[ -idx,]

full_additive_model_prec = lm(Price ~ ., data = housing_train_prec)

summary(full_additive_model_prec)$adj.r.squared

## [1] 0.7012419

housing_test_prec$Predicted_Price <- predict(full_additive_model_prec, housing_test_prec)

## Warning in predict.lm(full_additive_model_prec, housing_test_prec): prediction
## from a rank-deficient fit may be misleading

housing_test_prec <- housing_test_prec %>% drop_na(Price)
housing_test_prec <- housing_test_prec %>% drop_na(Predicted_Price)

MAE(housing_test_prec$Predicted_Price, housing_test_prec$Price)

## [1] 235951.6

RMSE(housing_test_prec$Predicted_Price, housing_test_prec$Price)

## [1] 359541

not much difference in Prediction result post normalization too as we have removed the outliers and other cleaning intially itself

Task C : Prediction

Divide the data into 80/20

KNN/Kmeams adding category to the variables

Based on KNN feature exploration it has been observed that 4 categories of houses is optimal;

data2 <- data

data2<- data2 %>% drop_na()

km.res <- kmeans(data2, 4, nstart = 25)

data2$Target <- as.factor(km.res$cluster)

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(417)

idx <- sample(nrow(data2), nrow(data2)* 0.80)

housing_train_80 <- data2[idx,]

housing_test_20 <- data2[ -idx,]

unique(housing_train_80$Target)

## [1] 2 4 1 3
## Levels: 1 2 3 4

unique(housing_test_20$Target)

## [1] 2 4 1 3
## Levels: 1 2 3 4

modelknn<- knn(train=housing_train_80, test=housing_test_20, cl=housing_train_80$Target, k=21)

caret::confusionMatrix(housing_test_20$Target, modelknn)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   1   2   3   4
##          1 210   0   0   0
##          2   2 552   0   0
##          3   0   0  38   0
##          4   0   0   0 977
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9989          
##                  95% CI : (0.9959, 0.9999)
##     No Information Rate : 0.5492          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9981          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4
## Sensitivity            0.9906   1.0000  1.00000   1.0000
## Specificity            1.0000   0.9984  1.00000   1.0000
## Pos Pred Value         1.0000   0.9964  1.00000   1.0000
## Neg Pred Value         0.9987   1.0000  1.00000   1.0000
## Prevalence             0.1192   0.3103  0.02136   0.5492
## Detection Rate         0.1180   0.3103  0.02136   0.5492
## Detection Prevalence   0.1180   0.3114  0.02136   0.5492
## Balanced Accuracy      0.9953   0.9992  1.00000   1.0000

KNN model is giving a very good accuracy on test data along with good precision vs recall vs f1 score

C 5.0 Model Training & Prediction

c50 <- C5.0(housing_train_80[,-363], housing_train_80$Target)
c50

## 
## Call:
## C5.0.default(x = housing_train_80[, -363], y = housing_train_80$Target)
## 
## Classification Tree
## Number of samples: 7116 
## Number of predictors: 363 
## 
## Tree size: 4 
## 
## Non-standard options: attempt to group attributes

Even C50 decision tree algorithm is giving a good prediction

ANN Model Training & Prediction

we are only considering few features for ANN as the time for processing is very high

set.seed(333)
n <- neuralnet(Target~ Landsize + BuildingArea + Rooms + Bathroom + Car,
               data = housing_train_80,
               hidden = 5,#adjust the hidden layers
               err.fct = "ce",
               linear.output = FALSE)

## Warning: Algorithm did not converge in 1 of 1 repetition(s) within the stepmax.

summary(n)

##               Length Class      Mode    
## call              6  -none-     call    
## response      28464  -none-     logical 
## covariate     35580  -none-     numeric 
## model.list        2  -none-     list    
## err.fct           1  -none-     function
## act.fct           1  -none-     function
## linear.output     1  -none-     logical 
## data            364  data.frame list    
## exclude           0  -none-     NULL

1. Reading our Cleaned Melbourne Data for our Hypothesis Testing and Validation

Task A : Hypotheses Vaidation

Hypotheses 2: We see that on average Price houses having more bathroom tend to have higher median house values, whereas those with less bathrooms have the lower median values.

Hypotheses 3: We see that on average Price houses having more cars tend to have higher median house values, whereas those with less cars have the lower median values.

Task B : Prediction

Converting the categorical variable Regionname and PropertyCount

Create dummy variable and dropping the columns not required for prediction, and converting the prediction columns to numeric, dropping where Price is NA

Divide the dataset into training and test data. Use 75/25 split.

The model has a decent R2 value at almost 71%

The model RMSE, MAE, MSE

Normalizing the data and predicting

not much difference in Prediction result post normalization too as we have removed the outliers and other cleaning intially itself

Task C : Prediction

Divide the data into 80/20

KNN/Kmeams adding category to the variables

Based on KNN feature exploration it has been observed that 4 categories of houses is optimal;

KNN model is giving a very good accuracy on test data along with good precision vs recall vs f1 score

C 5.0 Model Training & Prediction

Even C50 decision tree algorithm is giving a good prediction

ANN Model Training & Prediction

we are only considering few features for ANN as the time for processing is very high

Neural Network is computationally intensive but the accuracy is high and similar to KNN model