Programming of Data Science

Name	Matrix Number
Hoo Xing Yu	S2117839
Pei Lin Soh	S2193368
Low Kian Haw	S2190839
Clarice Lau Yeo Hang	S2192784
Yeoh Joer	22077700

Introduction

Due to intense competition, the housing market necessitates constant innovation and optimisation.
A machine learning approach has enormous promise to address these issues.
Housing development and sales processes can be improved by using artificial intelligence and data analysis to ensure efficiency, profitability, and responsiveness to changing client demands.
In particular, this study aims to investigate the relationship between house attributes and sales by applying machine learning approaches to optimise housing development and sales.
Finding the elements that appeal to customers the most is the aim.
Industry experts can take crucial lessons from this inquiry on utilising machine learning to succeed in the dynamic real estate market.

Problem Statement:

Clustering – Group house tiers based on their features to identify common characteristics and attributes that can inform our decision-making process in terms of what type of houses we should build.
Prediction – Predict house prices based on various features such as location, size, number of rooms, and other amenities. This information will allow us to price our properties competitively and attract potential buyers.
Association rule – Determine customers’ preferences on renovation based on house types to inform our renovation plans and maximize the number of houses sold.

Research Objective

Maximize revenue and customer satisfaction while minimizing costs and risk.
Identify best type of properties with competitive prices to attract potential buyers.

Data Description/ Source

The dataset is collected from Kaggle and it contains test and train dataset.
Data Source: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset

Import Necessary Libraries

suppressPackageStartupMessages({
  library(arules)
  library(caret)
  library(catboost)
  library(cluster)
  library(corrplot)
  library(dplyr)
  library(e1071)
  library(factoextra)
  library(ggplot2)
  library(glmnet)
  library(gridExtra)
  library(lattice)
  library(lightgbm)
  library(mlr)
  library(mlrMBO)
  library(psych)
  library(randomForest)
  library(readr)
  library(stats)
  library(tidyr)
  library(viridis)
  library(xgboost)
  library(DescTools)
  library(Metrics)
  library(Rtsne)
  library(DiceKriging)
})

Import Dataset

Dataset contains information on housing features such as location, size, number of rooms, and other amenities.

dataset <- read.csv("train.csv")
head(dataset)

Dataset Overview

There are 1460 records (rows) and 81 attributes (columns).
Dataset are mostly made up of numerical value (int64 & float64).

cat("Total rows & columns:", dim(dataset), '\n\n')

## Total rows & columns: 1460 81

str(dataset)

## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

Data Pre-processing

Check for Sum of Duplicates Between Columns

No duplication inside the dataset.

duplicates <- sum(duplicated(dataset))
cat(paste("Number of duplicates:", duplicates))

## Number of duplicates: 0

Remove Meaningless Columns

At this stage, we have removed irrelevant columns from the dataset.
‘Id’ column is removed, as it is a unique identifier for each customer.
‘GarageCars’ column is removed, as it provides redundant information already captured by the ‘GarageArea’ column.
We also removed the ‘Exterior2nd’ column, as it represents a limited set of additional choices for exterior covering.
Furthermore, we removed the ‘GarageYrBlt’ column, as the year of construction is typically the same as the ‘YearBuilt’ column.
Finally, we removed the ‘TotRmsAbvGrd’ column, as it provides a similar meaning as the ‘First Floor’ square footage.

# Subset the dataset by removing specific columns
dataset <- subset(dataset, select = -c(Id, GarageCars, Exterior2nd, GarageYrBlt, TotRmsAbvGrd))

Check for Missing Values

There are a total of 19 columns with missing values
5 of them have missing values higher than 20%
Columns with missing values higher than 20%: PoolQC, MiscFeature, Alley, Fence, FireplaceQu
Column with missing values higher than 20% will be removed in this stage

# Calculate missing percentage
missing_percentages <- colSums(is.na(dataset)) / nrow(dataset) * 100

# Create a data frame with column names and missing percentages
missing_dataset <- data.frame(
  column_name = names(missing_percentages), 
  missing_percentage = missing_percentages
) %>%
  # Sort the data frame by missing percentages in descending order
  arrange(desc(missing_percentage)) %>%
  # Filter out columns with 0 missing percentages
  filter(missing_percentage > 0)

# Print the missing dataset
print(missing_dataset)

##               column_name missing_percentage
## PoolQC             PoolQC        99.52054795
## MiscFeature   MiscFeature        96.30136986
## Alley               Alley        93.76712329
## Fence               Fence        80.75342466
## FireplaceQu   FireplaceQu        47.26027397
## LotFrontage   LotFrontage        17.73972603
## GarageType     GarageType         5.54794521
## GarageFinish GarageFinish         5.54794521
## GarageQual     GarageQual         5.54794521
## GarageCond     GarageCond         5.54794521
## BsmtExposure BsmtExposure         2.60273973
## BsmtFinType2 BsmtFinType2         2.60273973
## BsmtQual         BsmtQual         2.53424658
## BsmtCond         BsmtCond         2.53424658
## BsmtFinType1 BsmtFinType1         2.53424658
## MasVnrType     MasVnrType         0.54794521
## MasVnrArea     MasVnrArea         0.54794521
## Electrical     Electrical         0.06849315

Remove Columns With More Than 20% Missing Values

dataset <- subset(dataset, select = -c(PoolQC, MiscFeature, Alley, Fence, FireplaceQu))

Impute Missing Data

Median is chosen to fill in missing values, rather than the mean.
The mean is sensitive to extreme values and can be heavily influenced by outliers in the data, which can lead to biased estimates.
In contrast, the median is a robust measure of central tendency that is less sensitive to extreme values and is more appropriate for skewed data.

# Identify numerical columns & replace missing numerical values with median
num_cols <- sapply(dataset, is.numeric)
for (col in names(dataset[, num_cols])) {
  col_median <- median(dataset[, col], na.rm = TRUE)
  dataset[, col][is.na(dataset[, col])] <- rep(col_median, sum(is.na(dataset[, col])))
}

# Identify categorical columns & replace missing categorical values with mode
cat_cols <- sapply(dataset, is.factor) | sapply(dataset, is.character)
for (col in names(dataset[, cat_cols])) {
  dataset[, col][is.na(dataset[, col])] <- as.character(Mode(dataset[, col], na.rm = TRUE))
}

Explore Data - Exploratory Data Analysis (EDA)

Plot Distribution of Numeric Data

Majority of numerical columns’ distribution are right-skewed (32 attributes), however, 2 of them are left-skewed

# Select numerical columns using the "is.numeric" function
numerical_cols <- names(dataset)[sapply(dataset, is.numeric)]
categorical_cols <- names(dataset)[sapply(dataset, is.factor) | sapply(dataset, is.character)]

# Create new data frames with only numerical and categorical columns
numerical_data <- dataset[numerical_cols]
categorical_data <- dataset[categorical_cols]

# Display the first six rows of the numerical data frame
head(numerical_data)

# Create a function to plot probability density plot and skewness
plot_density <- function(column, col_name) {
  # Compute the skewness of the data
  skewness <- skew(column)

  # Set the plot color based on the skewness
  if (skewness < 0) {
    plot_color <- "blue"
  } else {
    plot_color <- "red"
  }

  # Create the density plot
  plot_title <- col_name  # Rename the plot title with the column name
  density_plot <- ggplot(data = data.frame(column), aes(x = column)) +
    geom_density(fill = plot_color, alpha = 0.3) +
    ggtitle(plot_title)

  return(density_plot)
}

# Apply the plot_density function to each column of numerical data in numerical_data
# Select the first 20 column names from numerical_data
first_20 <- names(numerical_data)[1:20]

# Apply the plot_density function to each selected column
plots <- lapply(first_20, function(col_name) {
  plot_density(numerical_data[[col_name]], col_name)
})

# Arrange the resulting plots in a grid with 5 columns
grid.arrange(grobs = plots, ncol = 5)

# Select the first 20 column names from numerical_data
rest_14 <- names(numerical_data)[21:34]

# Apply the plot_density function to each selected column
plots <- lapply(rest_14, function(col_name) {
  plot_density(numerical_data[[col_name]], col_name)
})

# Arrange the resulting plots in a grid with 5
grid.arrange(grobs = plots, ncol = 5)

Check Outliers Using Boxplot

Outliers are found in most of the columns - 29 attributes
“MSSubClass”, “LotFrontage”, “LotArea”, “OverallQual”, “OverallCond”, “YearBuilt”, “MasVnrArea”, “BsmtFinSF1”, “BsmtFinSF2”, “BsmtUnfSF”, “TotalBsmtSF”, “X1stFlrSF”, “X2ndFlrSF”, “LowQualFinSF”, “GrLivArea”, “BsmtFullBath”, “BsmtHalfBath”, “BedroomAbvGr”, “KitchenAbvGr”, “Fireplaces”, “GarageArea”, “WoodDeckSF”, “OpenPorchSF”, “EnclosedPorch”, “X3SsnPorch”, “ScreenPorch”, “PoolArea”, “MiscVal”, “SalePrice”

# Create a list to store the column names with outliers
columns_with_outliers <- c()

# Create a list of boxplots for the first 20 attributes
boxplot_numerical_first20attributes <- lapply(1:20, function(i) {
  # Calculate the lower and upper fences for outlier detection
  Q1 <- quantile(numerical_data[, i], 0.25)
  Q3 <- quantile(numerical_data[, i], 0.75)
  IQR <- Q3 - Q1
  lower_fence <- Q1 - 1.5 * IQR
  upper_fence <- Q3 + 1.5 * IQR

  # Check for outliers
  outliers <- numerical_data[, i] < lower_fence | numerical_data[, i] > upper_fence

  # Add the column name to the list if outliers are present
  if (any(outliers)) {
    columns_with_outliers <<- c(columns_with_outliers, colnames(numerical_data)[i])
  }

  # Create the boxplot
  ggplot(numerical_data, aes(x = 1, y = numerical_data[, i])) +
    geom_boxplot() +
    ggtitle(paste0("Boxplot of", colnames(numerical_data)[i])) +
    xlab("") +
    ylab(colnames(numerical_data)[i])
})

# Arrange the resulting boxplots in a grid with 5 columns
grid.arrange(grobs = boxplot_numerical_first20attributes, ncol = 5)

# Create a list of boxplots for the rest
boxplot_numerical_restattributes <- lapply(21:34, function(i) {
  # Calculate the lower and upper fences for outlier detection
  Q1 <- quantile(numerical_data[, i], 0.25)
  Q3 <- quantile(numerical_data[, i], 0.75)
  IQR <- Q3 - Q1
  lower_fence <- Q1 - 1.5 * IQR
  upper_fence <- Q3 + 1.5 * IQR

  # Check for outliers
  outliers <- numerical_data[, i] < lower_fence | numerical_data[, i] > upper_fence

  # Add the column name to the list if outliers are present
  if (any(outliers)) {
    columns_with_outliers <<- c(columns_with_outliers, colnames(numerical_data)[i])
  }

  # Create the boxplot
  ggplot(numerical_data, aes(x = 1, y = numerical_data[, i])) +
    geom_boxplot() +
    ggtitle(paste0("Boxplot of", colnames(numerical_data)[i])) +
    xlab("") +
    ylab(colnames(numerical_data)[i])
})

# Arrange the resulting boxplots in a grid with 5 columns
grid.arrange(grobs = boxplot_numerical_restattributes, ncol = 5)

# Print the column names with outliers
print(columns_with_outliers)

##  [1] "MSSubClass"    "LotFrontage"   "LotArea"       "OverallQual"  
##  [5] "OverallCond"   "YearBuilt"     "MasVnrArea"    "BsmtFinSF1"   
##  [9] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "X1stFlrSF"    
## [13] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [17] "BsmtHalfBath"  "BedroomAbvGr"  "KitchenAbvGr"  "Fireplaces"   
## [21] "GarageArea"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch"
## [25] "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "MiscVal"      
## [29] "SalePrice"

Plot Correlation Matrix

Correlation with >0.8 and above:
X1stFlrSF & TotalBsmtSF || 0.81953

# Create a heat map of the correlation matrix
correlation_matrix <- cor(numerical_data, method = "pearson")
corrplot(correlation_matrix, method = "color", type = "upper", order = "hclust", tl.col = "black", tl.cex = 0.3, tl.srt = 90)

# Print correlation higher than 0.8
high_corr <- which(abs((correlation_matrix) > 0.8 | correlation_matrix < -0.8)& correlation_matrix!= 1, arr.ind=TRUE)
high_corr_values <- correlation_matrix[high_corr]
high_corr_df <- data.frame(row = high_corr[, 1], col = high_corr[, 2], corr = high_corr_values)
high_corr_df

Bar Plots for Categorical Data

There are total of 37 categorical data column

# Display the first six rows of the categorical data frame
head(categorical_data)

Generally, most of the residents having following characteristics:
- Low density resident
- Paved streets
- Gravel alley
- Regular sized property
- Near Flat level
- Available with electric power, natural gas, steam supply, water supply, and sewage removal
- Inside lot
- Gentle slope of property
- Single-family Detached
- Single storey
- Gable roof
- Standard (Composite) Shingle roof
- Good quality Vinyl Siding exteriors
- Slight damped basement
- No walkout or garden level walls
- Gas forced warm air furnace
- Excellent heating quality
- Has Air-conditioning
- Standard Circuit Breakers & Romex electrical system
- Typical kitchen quality
- Typical home functionality
- Masonry Fireplace in main level
- Typical unfinished garage attached to home
- Paved driveway
- Good pool quality
- Minimum Wood/Wire fence
- Warranty Deed – Conventional Sales
- Normal Sales

# Print bar plots for categorical data for the first 20 attributes
plots_first20attribute <- lapply(colnames(categorical_data)[1:20], function(col) {
  ggplot(categorical_data, aes(x = !!sym(col))) +
    geom_bar(fill = "steelblue") +
    xlab(col) +
    ylab("Count") +
    ggtitle(paste0("Distribution of", col)) +
    scale_x_discrete(guide = guide_axis(check.overlap = TRUE))
})

grid.arrange(grobs = plots_first20attribute, ncol = 5)

# Print bar plots for categorical data for the remaining 17 attributes
plots_restattribute <- lapply(colnames(categorical_data)[21:37], function(col) {
  ggplot(categorical_data, aes(x = !!sym(col))) +
    geom_bar(fill = "steelblue") +
    xlab(col) +
    ylab("Count") +
    ggtitle(paste0("Distribution of", col)) +
    scale_x_discrete(guide = guide_axis(check.overlap = TRUE))
})

grid.arrange(grobs = plots_restattribute, ncol = 5)

Data Transformation/Scaling

Data pre-processing steps such as log transformation, scaling, and one-hot encoding.
Data splitted into train, validation, and test sets and combines the training and validation sets for model training.
Calculates and displays the skewness of the numerical features and target variable before and after the log transformations.

# Select numerical columns using the "is.numeric" function
numerical_cols <- names(dataset)[sapply(dataset, is.numeric)]
categorical_cols <- names(dataset)[sapply(dataset, is.factor) | sapply(dataset, is.character)]

# Create new data frames with only numerical and categorical columns
numerical_data <- dataset[numerical_cols]
categorical_data <- dataset[categorical_cols]

# Split the numerical variables into features and the target variable.
X_num <- subset(numerical_data, select = -c(SalePrice))
y <- subset(numerical_data, select = c(SalePrice))$SalePrice

# Log Transformation for numerical features.
skewness_before <- sapply(X_num, function(x) {
  e1071::skewness(x)
})

X_num_skewed <- skewness_before[abs(skewness_before) > 0.75]

for (x in names(X_num_skewed)) {
  X_num[[x]] <- log1p(X_num[[x]])
}

skewness_after <- sapply(X_num, function(x) {
  e1071::skewness(x)
})

data.frame(skewness_before, skewness_after)

# Log Transformation for the target variable.
skewness_before <- e1071::skewness(y)

y_t <- log1p(y)

skewness_after <- e1071::skewness(y_t)

sprintf("Before: %f, After: %f", skewness_before, skewness_after)

## [1] "Before: 1.879009, After: 0.121097"

# Scaling
X_num <- scale(X_num)

# One-Hot Encoding for categorical variables.
encoder <- dummyVars(~., data = categorical_data)

X_cat <- predict(encoder, newdata = categorical_data)
X_cat <- data.frame(X_cat)

# Split into train and validation and test sets.
X <- cbind(X_cat, X_num)

set.seed(12345)
train_idx <- createDataPartition(y_t, p = 0.7, list = F)
X_train <- X[train_idx, ]
y_train <- y_t[train_idx]

X_test <- X[-train_idx, ]
y_test <- y_t[-train_idx]

train_val_idx <- createDataPartition(y_train, p = 0.8, list = FALSE)
X_train <- X_train[train_val_idx, ]
y_train <- y_train[train_val_idx]

X_val <- X_train[-train_val_idx, ]
y_val <- y_train[-train_val_idx]

X_train_val <- rbind(X_train, X_val)
y_train_val <- c(y_train, y_val)

dim(X)

## [1] 1460  251

length(y)

## [1] 1460

dim(X_train)

## [1] 821 251

length(y_train)

## [1] 821

dim(X_val)

## [1] 163 251

length(y_val)

## [1] 163

#names(X)

Clustering

PCA + K-Means Clustering

The code executes PCA (Principal Component Analysis) to reduce the data’s dimensionality and extract the principal components.
Elbow method used to determine the best number of clusters. Result: 3.
K-means clustering is performed on the principal components, assigning each observation to a specific cluster.
The cluster assignments are appended to the principal components data, and a scatter plot is generated to visualize the resulting clusters.

# Perform PCA on the data matrix
pca_result <- prcomp(X)

# Extract the principal components
principal_components <- as.data.frame(pca_result$x)

# Determine the optimal number of clusters using the elbow method
fviz_nbclust(principal_components, kmeans, method = "wss")

k <- 3
# Perform K-means clustering on the principal components
kmeans_result <- kmeans(principal_components, centers = k)
cluster_assignments <- kmeans_result$cluster
# print(cluster_assignments)

# Add the cluster assignments to the principal components data
principal_components$cluster <- as.factor(cluster_assignments)

ggplot(principal_components, aes(PC1, PC2, color = cluster)) +
  geom_point() +
  labs(x = "Principal Component 1", y = "Principal Component 2") +
  scale_color_discrete(name = "Cluster") +
 theme_minimal()

t-SNE + Hierarchical Clustering

t-SNE (t-Distributed Stochastic Neighbor Embedding) is performed for dimensionality reduction and visualization of the data.

Function:

Calculates the t-SNE coordinates
Creates a dendrogram through hierarchical clustering
Determines the number of clusters visually, assigns data points to clusters
Computes centroids for each cluster
Generates scatter plots to visualize the t-SNE coordinates colored by the original variable and cluster assignments
Includes a bar plot showing the cluster frequencies and evaluates clustering quality using silhouette analysis.

tsne <- Rtsne(X)
tsne_df <- data.frame(tsne)

dist_mat <- dist(tsne$Y, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")

dend <- as.dendrogram(hclust_avg)
plot(dend)

k = 15
cut_avg <- cutree(hclust_avg, k)
tsne_df$cluster <- cut_avg

getCentroid <- function(points) {
  xy <- numeric(2)
  
  xy[1] = mean(points[, 1])
  xy[2] = mean(points[, 2])

  return(xy)
}

centroids = matrix(0, k, 2)
for (i in unique(cut_avg)) centroids[i, ] <- getCentroid(tsne$Y[cut_avg == i,])

Silhouette Graph

Graph width represents the number of data point; longer length indicates a stronger separation and better-defined clusters.
The lighter the color, the higher the house price.

ggplot(tsne_df, aes(x=Y.1, y=Y.2, color=y)) +
  geom_point() +
  scale_color_gradientn(colours = heat.colors(10))

ggplot(data.frame(table(tsne_df$cluster)), aes(x=Var1, y=Freq)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Clusters", y = "Sizes")

ggplot(tsne_df, aes(x=Y.1, y=Y.2, color=cluster)) +
  geom_point() +
  scale_color_viridis() +
  scale_fill_viridis(discrete = T) +
  geom_point(data = data.frame(centroids), aes(x=X1, y=X2), color="black", fill="white", shape=21, size=8) +
  geom_text(data = data.frame(centroids), aes(x=X1, y=X2, label=1:k), color="black")

fviz_silhouette(silhouette(cutree(hclust_avg, k = k), dist_mat))

##    cluster size ave.sil.width
## 1        1  253          0.33
## 2        2   92          0.41
## 3        3  215          0.32
## 4        4   29          0.74
## 5        5  174          0.35
## 6        6   42          0.79
## 7        7   96          0.16
## 8        8  169          0.37
## 9        9   37          0.89
## 10      10   17          0.81
## 11      11   87          0.53
## 12      12   85          0.44
## 13      13   22          0.81
## 14      14   66          0.70
## 15      15   76          0.35

Regression Modelling

Baselines

Sets up empty lists to store RMSE values for baseline models on the training and testing datasets.
Calculate performance metrics (MAE, MAPE, RMSE, MSE, R2) by converting predicted and actual values from transformed to original scale.
Focus on data structures and the metrics calculation function.

baselines_rmse <- list()

baselines_rmse_test <- list()
actual_metrics_test <- list()

metrics_fusion <- function(y_pred, y) {
  y_pred_inv <- expm1(y_pred)
  y_inv <- expm1(y)

  a <- mae(y_pred_inv, y_inv)
  b <- mape(y_pred_inv, y_inv)
  c <- rmse(y_pred_inv, y_inv)
  d <- mse(y_pred_inv, y_inv)
  e <- R2(y_pred_inv, y_inv)

  return(c("mae" = a, "mape" = b, "rmse" = c, "mse" = d, "r2" = e))
}

Linear Regression

Evaluates the performance of a linear regression model for a regression problem.
Cross-validation used to train the model on the training and validation data.
Resulting model is applied to make predictions on the validation dataset.
Root mean squared error (RMSE) is calculated as a measure of prediction accuracy for the model on the validation data.
Trained linear regression model used to make predictions on the test dataset.
RMSE is calculated by comparing the predicted values with the actual test values.
Stores additional performance metrics (MAE, MAPE, RMSE, MSE, R2) for the linear regression model on the test data.

linreg_tc <- trainControl(method = "cv", number = 5)
linreg_cv <- caret::train(
  SalePrice ~ .,
  data = cbind(X_train_val, SalePrice = y_train_val),
  trControl = linreg_tc,
  method = "lm"
)

# Validation predictions and metrics
score_val <- linreg_cv$results$RMSE
baselines_rmse$linear_regression <- score_val

# Test predictions and metrics
linreg <- lm(SalePrice ~ ., data = cbind(X_train_val, SalePrice = y_train_val))

y_pred_test <- predict(linreg, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$linear_regression <- score_test
actual_metrics_test$linear_regression <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.1279347

score_test

## [1] 0.1584937

Lasso Regression

Lasso Regression applied to a regression problem.
Model trained by using cross-validation and selects the optimal regularization parameter.
Root mean squared error (RMSE) calculated for the validation set.
For the test set, target variable predicted by using the trained model, calculates the RMSE
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Lasso Regression model on the test set
Displays the validation RMSE and test RMSE.

lasso <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = 1)

# Validation predictions and metrics
score_val <- mean(sqrt(lasso$cvm))
baselines_rmse$lasso <- score_val

# Test predictions and metrics
y_pred_test <- predict(lasso, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$lasso <- score_test
actual_metrics_test$lasso <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.1534448

score_test

## [1] 0.1401135

Ridge Regression

Ridge Regression applied to a regression problem.
Model trained by using cross-validation and selects the optimal regularization parameter.
Root mean squared error (RMSE) calculated for the validation set.
For the test set, it predicts the target variable using the trained model, calculates the RMSE.
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Ridge Regression model on the test set.
Validation RMSE and test RMSE dsiplayed.

ridge <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = 0)

# Validation predictions and metrics
score_val <- mean(sqrt(ridge$cvm))
baselines_rmse$ridge <- score_val

# Test predictions and metrics
y_pred_test <- predict(ridge, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$ridge <- score_test
actual_metrics_test$ridge <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.2334951

score_test

## [1] 0.1457187

Elastic Net

Elastic Net Regression applied to a regression problem.
Cross-validation peformed to select the optimal alpha parameter, which controls the balance between L1 (Lasso) and L2 (Ridge) regularization.
The root mean squared error (RMSE) calculated for the validation set.
For the test set, it predicts the target variable using the trained Elastic Net model with the selected alpha value, calculates the RMSE.
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Elastic Net model on the test set.
Selected alpha, validation RMSE, and test RMSE.

results <- data.frame()

for (i in 0:20) {
  elasticnet <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = i/20)

  row <- data.frame(alpha = i/20, rmse_val = mean(sqrt(elasticnet$cvm)))
  results <- rbind(results, row)
}

best_alpha <- results$alpha[which.min(results$rmse_val)]

# Validation predictions and metrics
score_val <- min(results$rmse_val)
baselines_rmse$elasticnet <- score_val

# Test predictions and metrics
elasticnet <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = best_alpha)

y_pred_test <- predict(elasticnet, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$elasticnet <- score_test
actual_metrics_test$elasticnet <- metrics_fusion(y_pred_test, y_test)

# Display scores
best_alpha

## [1] 0.9

score_val

## [1] 0.1523751

score_test

## [1] 0.1397882

K-Nearest Neighbors Regression

K-Nearest Neighbors Regression applied to the data
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.

knn_tc <- trainControl(method = "cv", number = 5)
knn_cv <- caret::train(
  SalePrice ~ .,
  data = cbind(X_train_val, SalePrice = y_train_val),
  trControl = knn_tc,
  method = "knn"
)

# Validation predictions and metrics
score_val <- mean(knn_cv$results$RMSE)
baselines_rmse$knn <- score_val

# Test predictions and metrics
knn <- knnreg(x = X_train_val, y = y_train_val)

y_pred_test <- predict(knn, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$knn <- score_test
actual_metrics_test$knn <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.1788794

score_test

## [1] 0.1994154

Support Vector Regression

Support Vector Regression (SVR) applied to the data
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.

svr_tc <- trainControl(method = "cv", number = 5)
svr_cv <- caret::train(
  SalePrice ~ .,
  data = cbind(X_train_val, SalePrice = y_train_val),
  trControl = svr_tc,
  method = "svmLinear2",
  scale = F
)

# Validation predictions and metrics
score_val <- mean(svr_cv$results$RMSE)
baselines_rmse$svr <- score_val

# Test predictions and metrics
svr <- e1071::svm(SalePrice ~ ., data = cbind(X_train_val, SalePrice = y_train_val), scale = F)

y_pred_test <- predict(svr, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$svr <- score_test
actual_metrics_test$svr <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.1189599

score_test

## [1] 0.1310749

Decision Tree

Decision Tree regression applied to the data.
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.

dt_tc <- trainControl(method = "cv", number = 5)
dt_cv <- caret::train(
  SalePrice ~ .,
  data = cbind(X_train_val, SalePrice = y_train_val),
  trControl = dt_tc,
  method = "rpart"
)

# Validation predictions and metrics
score_val <- mean(dt_cv$results$RMSE, na.rm = TRUE)
baselines_rmse$decision_tree <- score_val

# Test predictions and metrics
dt <- caret::train(x = X_train_val, y = y_train_val, method = "rpart")

y_pred_test <- predict(dt, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$decision_tree <- score_test
actual_metrics_test$decision_tree <- metrics_fusion(y_pred_test, y_test)

# Display scores
score_val

## [1] 0.2944126

score_test

## [1] 0.282019

Ensemble Learning

Empty lists initialized that will be used to store the RMSE values and actual performance metrics for ensemble learning models.
Empty lists will be populated with values in subsequent steps of the code.

ensemble_rmse <- list()
ensemble_actual_metrics <- list()

ensemble_rmse_test <- list()
ensemble_actual_metrics_test <- list()

Random Forest

Random forest algorithm implemented for regression tasks using ensemble learning.
Random forest model trained on the training and validation data.
Its performance is evaluated using root mean squared error (RMSE) on the validation and test datasets.
Calculate feature importance using the random forest model and creates a bar plot to visualize the top 30 important features.

rf_tc <- trainControl(method = "cv", number = 5)
rf_cv <- caret::train(
  SalePrice ~ .,
  data = cbind(X_train_val, SalePrice = y_train_val),
  trControl = rf_tc,
  method = "rf"
)

# Validation predictions and metrics
score_val <- mean(rf_cv$results$RMSE)
ensemble_rmse$random_forest <- score_val

# Test predictions and metrics
rf <- randomForest(x = X_train_val, y = y_train_val, proximity = T)

y_pred_test_rf <- predict(rf, newdata = X_test)
score_test <- rmse(y_pred_test_rf, y_test)
ensemble_rmse_test$random_forest <- score_test
ensemble_actual_metrics_test$random_forest <- metrics_fusion(y_pred_test_rf, y_test)

y_pred_train_rf <- predict(rf, newdata = X_train_val)

# Display scores
score_val

## [1] 0.1643092

score_test

## [1] 0.152642

rf_df <- data.frame(rf$importance) %>%
  mutate(Feature = rownames(rf$importance)) %>%
  arrange(desc(IncNodePurity)) %>%
  head(30)

rf_df

ggplot(data = rf_df, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Features", y = "Importance")

Gradient-Boosted Trees

Gradient-Boosted Trees (GBT) using XGBoost, LightGBM, and CatBoost libraries implemented for regression tasks.
The models are trained on the training and validation data, and their performance is evaluated using RMSE on the validation and test datasets.
The best hyperparameters are determined through cross-validation, and the models are then used to make predictions on the test data.
The RMSE scores and other evaluation metrics are recorded for each model.
The feature importance of the GBT model is visualized using a bar chart.
Code snippet showcases the application of ensemble learning techniques for regression tasks using boosting algorithms.

dtrain_val <- xgb.DMatrix(data = as.matrix(X_train_val), label = y_train_val)
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)

xgb_params = list(
  eta = 0.01,
  gamma = 0.0468,
  max_depth = 6,
  min_child_weight = 1.41,
  subsample = 0.769,
  colsample_bytree = 0.283
)

xgb_cv <- xgb.cv(
  params = xgb_params,
  data = dtrain_val,
  nround = 10000,
  nfold = 5,
  prediction = F,
  showsd = T,
  metrics = "rmse",
  verbose = 1,
  print_every_n = 500,
  early_stopping_rounds = 25
)

## [1]  train-rmse:11.414310+0.004317   test-rmse:11.414306+0.017596 
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 25 rounds.
## 
## [501]    train-rmse:0.123461+0.000648    test-rmse:0.152562+0.008779 
## [1001]   train-rmse:0.068082+0.000494    test-rmse:0.114561+0.008507 
## [1501]   train-rmse:0.064569+0.000540    test-rmse:0.113083+0.008597 
## [2001]   train-rmse:0.062831+0.000580    test-rmse:0.112503+0.008773 
## [2501]   train-rmse:0.061697+0.000634    test-rmse:0.112112+0.008748 
## Stopping. Best iteration:
## [2768]   train-rmse:0.061252+0.000597    test-rmse:0.111974+0.008758

# Validation predictions and metrics
score_val <- xgb_cv$evaluation_log$test_rmse_mean %>% min
ensemble_rmse$xgboost <- score_val

# Test predictions and metrics
xgb <- xgboost(
  params = xgb_params,
  data = dtrain_val,
  nround = 10000,
  eval_metric = "rmse",
  verbose = 1,
  print_every_n = 500,
  early_stopping_rounds = 25
)

## [1]  train-rmse:11.414108 
## Will train until train_rmse hasn't improved in 25 rounds.
## 
## [501]    train-rmse:0.122362 
## [1001]   train-rmse:0.067779 
## [1501]   train-rmse:0.064224 
## [2001]   train-rmse:0.062695 
## Stopping. Best iteration:
## [2056]   train-rmse:0.062552

y_pred_test_xgb <- predict(xgb, newdata = dtest)
score_test <- rmse(y_pred_test_xgb, y_test)
ensemble_rmse_test$xgboost <- score_test
ensemble_actual_metrics_test$xgboost <- metrics_fusion(y_pred_test_xgb, y_test)

y_pred_train_xgb <- predict(xgb, newdata = dtrain_val)

# Display scores
score_val

## [1] 0.1119744

score_test

## [1] 0.1359029

xgb_df <- xgb.importance(model = xgb) %>% head(30)

xgb_df

ggplot(data = xgb_df, aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Features", y = "Importance")

objective_fn <- makeSingleObjectiveFunction(
  fn = function(x) {
    params = list(
      booster = "gbtree",
      eta = x["eta"],
      gamma = x["gamma"],
      max_depth = x["max_depth"],
      min_child_weight = x["min_child_weight"],
      subsample = x["subsample"],
      colsample_bytree = x["colsample_bytree"],
      max_delta_step = x["max_delta_step"]
    )

    cv <- xgb.cv(
      params = params,
      data = dtrain_val,
      nround = 10000,
      nfold = 5,
      prediction = F,
      showsd = T,
      metrics = "rmse",
      verbose = 1,
      print_every_n = 500,
      early_stopping_rounds = 25
    )

    cv$evaluation_log$test_rmse_mean %>% min
  },
  par.set = makeParamSet(
    makeNumericParam("eta", lower = 0.005, upper = 0.01),
    makeNumericParam("gamma", lower = 0.01, upper = 5),
    makeIntegerParam("max_depth", lower = 2, upper = 10),
    makeIntegerParam("min_child_weight", lower = 1, upper = 2000),
    makeNumericParam("subsample", lower = 0.20,  upper = 0.8),
    makeNumericParam("colsample_bytree", lower = 0.20, upper = 0.8),
    makeNumericParam("max_delta_step", lower = 0, upper = 5)
  ),
  minimize = TRUE
)

# Define experiments and an objective function
design <- generateDesign(n = 1000, par.set = getParamSet(objective_fn), fun = lhs::randomLHS)
control <- makeMBOControl() %>% setMBOControlTermination(., iters = 10)

# run <- mbo(
#  fun = objective_fn,
#  design = design,
#  learner = makeLearner("regr.km", predict.type = "se", covtype = "matern3_2", control = list(trace = FALSE)),
#  control = control,
#  show.info = TRUE
# )

# Best parameters
# run$x

lgb_train_val <- lgb.Dataset(data = as.matrix(X_train_val), label = y_train_val)
lgb_test <- lgb.Dataset(data = as.matrix(X_test), label = y_test)

params <- list(
  objective = "regression",
  metric = "rmse",
  boosting_type = "gbdt",
  num_boost_round = 10000,
  num_leaves = 15,
  learning_rate = 0.1,
  feature_fraction = 0.9,
  bagging_fraction = 0.8,
  bagging_freq = 5,
  force_row_wise = T
)

lgb_cv <- lgb.cv(
  params = params,
  data = lgb_train_val,
  early_stopping_rounds = 25,
  verbose = 0
)

## [LightGBM] [Info] Start training from score 12.018962
## [LightGBM] [Info] Start training from score 12.025414
## [LightGBM] [Info] Start training from score 12.022889

# Validation predictions and metrics
score_val <- min(unlist(lgb_cv$record_evals$valid$rmse$eval))
ensemble_rmse$lightgbm <- score_val

# Test predictions and metrics
lgb <- lgb.train(
  params = params,
  data = lgb_train_val,
  verbose = 0
)

y_pred_test_lgb <- predict(lgb, data = as.matrix(X_test))
score_test <- rmse(y_pred_test_lgb, y_test)
ensemble_rmse_test$lightgbm <- score_test
ensemble_actual_metrics_test$lightgbm <- metrics_fusion(y_pred_test_lgb, y_test)

y_pred_train_lgb <- predict(lgb, data = as.matrix(X_train_val))

# Display scores
score_val

## [1] 0.1246116

score_test

## [1] 0.1435826

lgb_df <- lgb.importance(model = lgb) %>% head(30)

lgb_df

ggplot(data = xgb_df, aes(x = reorder(Feature, Gain), y = Gain)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Features", y = "Importance")

train_val_pool <- catboost.load_pool(data = X_train_val, label = y_train_val)
test_pool <- catboost.load_pool(data = X_test, label = y_test)

params <- list(
  loss_function = "RMSE",
  iterations = 10000,
  learning_rate = 0.01,
  metric_period = 1000
)

catb_cv <- catboost.cv(
  train_val_pool,
  params = params,
  fold_count = 5,
  early_stopping_rounds = 25
)

## Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
## 0:   learn: 11.9129312   test: 11.9132621    best: 11.9132621 (0)    total: 208ms    remaining: 34m 36s
## 1000:    learn: 0.1222245    test: 0.2262026 best: 0.2262026 (1000)  total: 1m 19s   remaining: 11m 57s
## 2000:    learn: 0.0685048    test: 0.2006144 best: 0.2006144 (2000)  total: 2m 37s   remaining: 10m 29s
## 3000:    learn: 0.0454878    test: 0.1946286 best: 0.1946286 (3000)  total: 3m 52s   remaining: 9m 1s
## 4000:    learn: 0.0318095    test: 0.1920010 best: 0.1920010 (4000)  total: 5m 12s   remaining: 7m 48s
## 5000:    learn: 0.0230517    test: 0.1906835 best: 0.1906835 (5000)  total: 6m 28s   remaining: 6m 28s
## 6000:    learn: 0.0171276    test: 0.1899629 best: 0.1899625 (5999)  total: 8m 26s   remaining: 5m 37s
## 7000:    learn: 0.0129467    test: 0.1895087 best: 0.1895086 (6999)  total: 9m 43s   remaining: 4m 10s
## 8000:    learn: 0.0098459    test: 0.1892415 best: 0.1892415 (8000)  total: 10m 32s  remaining: 2m 37s
## 9000:    learn: 0.0075465    test: 0.1890371 best: 0.1890370 (8998)  total: 11m 26s  remaining: 1m 16s
## Stopped by overfitting detector  (25 iterations wait)

# Validation predictions and metrics
score_val <- min(catb_cv$test.RMSE.mean)
ensemble_rmse$catboost <- score_val

# Test predictions and metrics
catb <- catboost.train(
  params = params,
  learn_pool = train_val_pool
)

## 0:   learn: 0.3949874    total: 11.9ms   remaining: 1m 58s
## 1000:    learn: 0.0760428    total: 7.03s    remaining: 1m 3s
## 2000:    learn: 0.0505683    total: 14.5s    remaining: 58.1s
## 3000:    learn: 0.0348161    total: 21.6s    remaining: 50.4s
## 4000:    learn: 0.0252573    total: 28.6s    remaining: 42.9s
## 5000:    learn: 0.0190112    total: 36.6s    remaining: 36.6s
## 6000:    learn: 0.0146441    total: 44.4s    remaining: 29.6s
## 7000:    learn: 0.0115120    total: 52s  remaining: 22.3s
## 8000:    learn: 0.0091391    total: 59.8s    remaining: 14.9s
## 9000:    learn: 0.0074168    total: 1m 6s    remaining: 7.42s
## 9999:    learn: 0.0060960    total: 1m 13s   remaining: 0us

y_pred_test_catboost <- catboost.predict(catb, test_pool)
score_test <- rmse(y_pred_test_catboost, y_test)
ensemble_rmse_test$catboost <- score_test
ensemble_actual_metrics_test$catboost <- metrics_fusion(y_pred_test_catboost, y_test)

y_pred_train_catboost <- catboost.predict(catb, train_val_pool)

# Display scores
score_val

## [1] 0.1890371

score_test

## [1] 0.1339784

catb_df <- data.frame(catboost.get_feature_importance(catb))
catb_df <- catb_df %>%
  mutate(Feature = rownames(catb_df)) %>%
  rename(Importance = catboost.get_feature_importance.catb.) %>%
  arrange(desc(Importance)) %>%
  head(30)

catb_df

ggplot(data = catb_df, aes(x = reorder(Feature, Importance), y = Importance)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(x = "Features", y = "Importance")

Stacking

Stacking is performed by combining the predictions from different models (Random Forest, XGBoost, LightGBM, CatBoost) into a new dataset.
Linear regression model is trained on the stacked dataset to learn the relationship between ensemble predictions and the actual target values.
Model’s predictions on the test dataset are evaluated using root mean squared error (RMSE).
Metrics fusion function is applied to assess the performance of the stacking ensemble.

stacked_data <- data.frame(y = y_train_val, prediction_rf = y_pred_train_rf, prediction_xgb = y_pred_train_xgb, prediction_lgb = y_pred_train_lgb, prediction_catboost = y_pred_train_catboost)

stacked_data_test <- data.frame(y = y_test, prediction_rf = y_pred_test_rf, prediction_xgb = y_pred_test_xgb, prediction_lgb = y_pred_test_lgb, prediction_catboost = y_pred_test_catboost)

model_meta <- caret::train(y ~ ., data = stacked_data, method = "lm")
predictions_meta <- predict(model_meta, newdata = stacked_data_test)

ensemble_rmse_test$stacking <- rmse(predictions_meta, y_test)
ensemble_actual_metrics_test$stacking <- metrics_fusion(predictions_meta, y_test)

Comparison

Compares the performance of different models and ensembles using the root mean squared error (RMSE) metric.
Bar charts created to display the RMSE values, with the x-axis representing the models and the y-axis representing the RMSE.
The first plot focuses on the evaluation using the validation set.
The second plot focuses on the evaluation using the test set.
Purpose is to compare the models’ performance and determine which ones have the lowest RMSE, indicating better predictive accuracy.

# Plot validation loss
rmse_all <- c(unlist(baselines_rmse), unlist(ensemble_rmse))
rmse_all_names <- c(names(baselines_rmse), names(ensemble_rmse))

df <- data.frame(models = rmse_all_names, rmse = rmse_all)
df %>% arrange(rmse)

ggplot(df, aes(x = factor(models, level=rmse_all_names), y = rmse)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  xlab("Models") +
  ylab("RMSE") +
  ylim(0, 0.3) +
  ggtitle("Validation Loss") +
  scale_x_discrete(guide = guide_axis(angle = 30)) +
  theme_minimal()

# Plot test loss
rmse_all_test <- c(unlist(baselines_rmse_test), unlist(ensemble_rmse_test))
rmse_all_test_names <- c(names(baselines_rmse_test), names(ensemble_rmse_test))

df <- data.frame(models = rmse_all_test_names, rmse = rmse_all_test)
df %>% arrange(rmse)

ggplot(df, aes(x = factor(models, level=rmse_all_test_names), y = rmse)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  xlab("Models") +
  ylab("RMSE") +
  ylim(0, 0.3) +
  ggtitle("Test Loss") +
  scale_x_discrete(guide = guide_axis(angle = 30)) +
  theme_minimal()

data.frame(t(data.frame(actual_metrics_test))) %>% arrange(desc(r2))

data.frame(t(data.frame(ensemble_actual_metrics_test))) %>% arrange(desc(r2))

Association Rule Mining on Housing Characteristics

Code snippet performs association rule mining on housing characteristics using the Apriori algorithm.
Converts categorical data into transactions and applies the Apriori algorithm to find frequent itemsets.
Generate association rules using given support and confidence thresholds.
The code includes a dictionary to map column names and values to their corresponding descriptions for better rule interpretation.
Top ten association rules with their corresponding explanations displayed, including the antecedent (IF) and consequent (THEN) parts along with their descriptions and confidence percentages.

transactions <- transactions(categorical_data)
summary(transactions)

## transactions as itemMatrix in sparse format with
##  1460 rows (elements/itemsets/transactions) and
##  218 columns (items) and a density of 0.1697248 
## 
## most frequent items:
## Utilities=AllPub      Street=Pave  Condition2=Norm RoofMatl=CompShg 
##             1459             1454             1445             1434 
##     Heating=GasA          (Other) 
##             1428            46800 
## 
## element (itemset/transaction) length distribution:
## sizes
##   37 
## 1460 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      37      37      37      37      37      37 
## 
## includes extended item information - examples:
##             labels variables  levels
## 1 MSZoning=C (all)  MSZoning C (all)
## 2      MSZoning=FV  MSZoning      FV
## 3      MSZoning=RH  MSZoning      RH
## 
## includes extended transaction information - examples:
##   transactionID
## 1             1
## 2             2
## 3             3

inspect(head(transactions, n = 1))

##     items                   transactionID
## [1] {MSZoning=RL,                        
##      Street=Pave,                        
##      LotShape=Reg,                       
##      LandContour=Lvl,                    
##      Utilities=AllPub,                   
##      LotConfig=Inside,                   
##      LandSlope=Gtl,                      
##      Neighborhood=CollgCr,               
##      Condition1=Norm,                    
##      Condition2=Norm,                    
##      BldgType=1Fam,                      
##      HouseStyle=2Story,                  
##      RoofStyle=Gable,                    
##      RoofMatl=CompShg,                   
##      Exterior1st=VinylSd,                
##      MasVnrType=BrkFace,                 
##      ExterQual=Gd,                       
##      ExterCond=TA,                       
##      Foundation=PConc,                   
##      BsmtQual=Gd,                        
##      BsmtCond=TA,                        
##      BsmtExposure=No,                    
##      BsmtFinType1=GLQ,                   
##      BsmtFinType2=Unf,                   
##      Heating=GasA,                       
##      HeatingQC=Ex,                       
##      CentralAir=Y,                       
##      Electrical=SBrkr,                   
##      KitchenQual=Gd,                     
##      Functional=Typ,                     
##      GarageType=Attchd,                  
##      GarageFinish=RFn,                   
##      GarageQual=TA,                      
##      GarageCond=TA,                      
##      PavedDrive=Y,                       
##      SaleType=WD,                        
##      SaleCondition=Normal}              1

rules <- apriori(transactions, parameter = list(support = 0.95, confidence = 0.95))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.95    0.1    1 none FALSE            TRUE       5    0.95      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1387 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[218 item(s), 1460 transaction(s)] done [0.01s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [94 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

summary(rules)

## set of 94 rules
## 
## rule length distribution (lhs + rhs):sizes
##  1  2  3  4 
##  7 28 39 20 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   2.766   3.000   4.000 
## 
## summary of quality measures:
##     support         confidence        coverage           lift       
##  Min.   :0.9500   Min.   :0.9534   Min.   :0.9507   Min.   :0.9996  
##  1st Qu.:0.9562   1st Qu.:0.9793   1st Qu.:0.9678   1st Qu.:0.9998  
##  Median :0.9637   Median :0.9895   Median :0.9781   Median :1.0000  
##  Mean   :0.9664   Mean   :0.9871   Mean   :0.9791   Mean   :1.0000  
##  3rd Qu.:0.9740   3rd Qu.:0.9958   3rd Qu.:0.9897   3rd Qu.:1.0000  
##  Max.   :0.9993   Max.   :0.9993   Max.   :1.0000   Max.   :1.0010  
##      count     
##  Min.   :1387  
##  1st Qu.:1396  
##  Median :1407  
##  Mean   :1411  
##  3rd Qu.:1422  
##  Max.   :1459  
## 
## mining info:
##          data ntransactions support confidence
##  transactions          1460    0.95       0.95
##                                                                               call
##  apriori(data = transactions, parameter = list(support = 0.95, confidence = 0.95))

con <- file("data_description.txt", open = "r")

column_dictionary <- list()
value_dictionary <- list()

repeat {
  line <- readLines(con, n = 1)

  if (length(line) == 0) {
    break
  }

  first_character <- substr(line, 1, 1)

  if (first_character == "") {
    next
  }

  if (first_character != " ") {
    column_name <- sub(":.*", "", line)
    column_description <- trimws(sub(".*:", "", line))

    column_dictionary[[column_name]] <- column_description
    value_dictionary[[column_name]] <- list()
  } else {
    pairs <- unlist(strsplit(line, "\t"))
    key <- trimws(pairs[1])
    value <- trimws(pairs[2])

    value_dictionary[[column_name]][[key]] <- value
  }
}

close(con)

rules_top_ten_df <- data.frame(
  lhs = labels(lhs(rules)),
  rhs = labels(rhs(rules)),
  rules@quality
) %>% arrange(desc(lift)) %>% head(n = 20)

for (i in 1:nrow(rules_top_ten_df)) {
  row <- rules_top_ten_df[i, ]

  explanation <- ""
  lhs <- unlist(strsplit(gsub('^.|.$', '', row["lhs"]), ","))

  for (i in 1:length(lhs)) {
    pair <- unlist(strsplit(lhs[i], "="))
    key <- pair[1]
    value <- pair[2]

    key_t <- column_dictionary[[key]]
    value_t <- value_dictionary[[key]][[value]]

    if (i == 1) {
      explanation <- paste("IF", key_t, "=", value_t)
    } else {
      explanation <- paste(explanation, "AND", key_t, "=", value_t)
    }
  }

  rhs <- unlist(strsplit(gsub('^.|.$', '', row["rhs"]), "="))
  key <- rhs[1]
  value <- rhs[2]

  key_t <- column_dictionary[[key]]
  value_t <- value_dictionary[[key]][[value]]

  confidence_pct <- format(round(row["confidence"] * 100, 2), 2)

  explanation <- paste(explanation, "THEN", key_t, "=", value_t, "(Confidence:", paste0(confidence_pct, "%)"))
  print(explanation)
  cat("\n")
}

## [1] "IF Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.47%)"
## 
## [1] "IF Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.08%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.08%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.47%)"
## 
## [1] "IF Type of road access to property = Paved AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.07%)"
## 
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.07%)"
## 
## [1] "IF Type of road access to property = Paved AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.46%)"
## 
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.45%)"
## 
## [1] "IF Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.25%)"
## 
## [1] "IF Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.84%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.25%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.84%)"
## 
## [1] "IF Type of road access to property = Paved AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.24%)"
## 
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.24%)"
## 
## [1] "IF Type of road access to property = Paved AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.83%)"
## 
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.83%)"
## 
## [1] "IF Proximity to various conditions (if more than one is present) = Normal AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.23%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.23%)"
## 
## [1] "IF Proximity to various conditions (if more than one is present) = Normal AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.82%)"
## 
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.81%)"

Conclusion

Several regression models were trained and evaluated on a dataset.
The linear regression model, trained using cross-validation, achieved good results with an RMSE score of approximately 0.1279 on the validation dataset and 0.1585 on the test dataset.
Lasso regression, which incorporates feature selection, had an average RMSE score of around 0.1534 on validation and 0.1401 on the test dataset.
Ridge regression, focusing on reducing multicollinearity, had an average RMSE score of approximately 0.2335 on validation and 0.1457 on the test dataset.
Elastic Net regression, combining Lasso and Ridge regularization, performed well with an RMSE score of around 0.1524 on validation and 0.1398 on the test dataset.
K-Nearest Neighbors regression achieved moderate performance with an average RMSE score of approximately 0.1789 on validation and 0.1994 on the test dataset.
The SVR model, based on support vector regression, consistently performed well with an average RMSE score of 0.1190 on validation and 0.1311 on the test dataset.
The decision tree model had an average RMSE score of 0.2944 on validation and 0.2820 on the test dataset.
In comparison, the random forest model, an ensemble of decision trees, outperformed the decision tree model with an average RMSE score of 0.1643 on validation and 0.1526 on the test dataset.
The random forest model highlighted influential features such as “OverallQual,” “GrLivArea,” and “YearBuilt” in predicting sale prices.
Among the gradient-boosted tree models, XGBoost and LightGBM had similar performance.
XGBoost achieved a validation RMSE of 0.1126 and a test RMSE of 0.1373, while LightGBM had a validation RMSE of 0.1258 and a test RMSE of 0.1436.
CatBoost had slightly higher errors with a validation RMSE of 0.2121 and a test RMSE of 0.1340.
Key factors influencing house prices included OverallQual, GrLivArea, and GarageArea.
Overall, the linear regression, Lasso regression, Elastic Net regression, SVR, random forest, XGBoost, and LightGBM models showed promising performance in predicting sale prices.
These models can provide valuable insights for decision-making in the real estate market.
In the analysis, comparison the performance of baseline models and ensemble models for the given dataset will be occur.
The evaluation metric used was Root Mean Squared Error (RMSE), which measures the average prediction error.
RMSE provides a measure of the average prediction error in the same units as the target variable.
It is useful for comparing different models or evaluating the performance of a single model.
Lower RMSE values indicate better predictive accuracy, as they imply smaller average errors between predicted and actual values.
Higher RMSE values indicate worse predictive accuracy, as they imply bigger average errors between predicted and actual values.

Baseline Models (Validation)

Seven models had been used which are Linear Regression, Lasso, Ridge, Elastic Net, KNN, SVR and Decision Tree.
Among the baseline models, SVR had the lowest Root Mean Squared Error (RMSE), followed closely by Linear Regression.
KNN, Elastic Net, Lasso and Ridge performed relatively well.
However, Decision Tree had the highest RMSE, indicating poorer predictive performance.

Ensemble Models (Validation)

The ensemble models, including Random Forest, XGBoost, LightGBM, and CatBoost, showed improved performance compared to individual baseline models.
XGBoost achieved the lowest RMSE on the test set, closely followed by LightGBM.
The model having the highest RMSE is CatBoost, followed by Random Forest, indicating poorer predictive performance.

Baseline Models (Test)

Based on the provided plot and data for the baseline models’ RMSE on the test set, Decision Tree had the highest RMSE, while SVR had the lowest.
Decision Tree showed the worst predictive performance, closely followed by KNN and Linear Regression.
SVR showed the best predictive performance, closely followed by Elastic Net, Lasso Regression, and Ridge Regression.

Ensemble Models (Test)

Based on the provided plot and data for the ensemble models’ RMSE on the test set, XGBoost and CatBoost achieved the lowest RMSE value.
They were closely followed by LightGBM.
These three models exhibited the best predictive performance among the ensemble models.
Random forest had the highest RMSE but it also performed relatively well.
The stacking model had a slightly higher RMSE compared to the top-performing ensemble models but still demonstrated a good predictive performance.

Comparison

In baseline models, SVR, Elastic Net, Linear Regression and Lasso showed the best predictive performance.
XGBoost, CatBoost, and LightGBM showed the best predictive performance among the ensemble models based on the RMSE metric.
These models have the potential to provide more accurate predictions for the target variable on the test set.

Programming of Data Science

Group 3

Introduction

Problem Statement:

Research Objective

Data Description/ Source

Import Necessary Libraries

Import Dataset

Dataset Overview

Data Pre-processing

Check for Sum of Duplicates Between Columns

Remove Meaningless Columns

Check for Missing Values

Remove Columns With More Than 20% Missing Values

Impute Missing Data

Explore Data - Exploratory Data Analysis (EDA)

Plot Distribution of Numeric Data

Check Outliers Using Boxplot

Plot Correlation Matrix

Bar Plots for Categorical Data

Data Transformation/Scaling

Clustering

PCA + K-Means Clustering

t-SNE + Hierarchical Clustering

Silhouette Graph

Regression Modelling

Baselines

Linear Regression

Lasso Regression

Ridge Regression

Elastic Net

K-Nearest Neighbors Regression

Support Vector Regression

Decision Tree

Ensemble Learning

Random Forest

Gradient-Boosted Trees

Stacking

Comparison

Association Rule Mining on Housing Characteristics

Conclusion

Baseline Models (Validation)

Ensemble Models (Validation)

Baseline Models (Test)

Ensemble Models (Test)

Comparison