| Name | Matrix Number |
|---|---|
| Hoo Xing Yu | S2117839 |
| Pei Lin Soh | S2193368 |
| Low Kian Haw | S2190839 |
| Clarice Lau Yeo Hang | S2192784 |
| Yeoh Joer | 22077700 |
Clustering – Group house tiers based on their features to identify common characteristics and attributes that can inform our decision-making process in terms of what type of houses we should build.
Prediction – Predict house prices based on various features such as location, size, number of rooms, and other amenities. This information will allow us to price our properties competitively and attract potential buyers.
Association rule – Determine customers’ preferences on renovation based on house types to inform our renovation plans and maximize the number of houses sold.
Maximize revenue and customer satisfaction while minimizing costs and risk.
Identify best type of properties with competitive prices to attract potential buyers.
The dataset is collected from Kaggle and it contains test and train dataset.
Data Source: https://www.kaggle.com/datasets/prevek18/ames-housing-dataset
suppressPackageStartupMessages({
library(arules)
library(caret)
library(catboost)
library(cluster)
library(corrplot)
library(dplyr)
library(e1071)
library(factoextra)
library(ggplot2)
library(glmnet)
library(gridExtra)
library(lattice)
library(lightgbm)
library(mlr)
library(mlrMBO)
library(psych)
library(randomForest)
library(readr)
library(stats)
library(tidyr)
library(viridis)
library(xgboost)
library(DescTools)
library(Metrics)
library(Rtsne)
library(DiceKriging)
})
dataset <- read.csv("train.csv")
head(dataset)
There are 1460 records (rows) and 81 attributes (columns).
Dataset are mostly made up of numerical value (int64 & float64).
cat("Total rows & columns:", dim(dataset), '\n\n')
## Total rows & columns: 1460 81
str(dataset)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
duplicates <- sum(duplicated(dataset))
cat(paste("Number of duplicates:", duplicates))
## Number of duplicates: 0
At this stage, we have removed irrelevant columns from the dataset.
‘Id’ column is removed, as it is a unique identifier for each customer.
‘GarageCars’ column is removed, as it provides redundant information already captured by the ‘GarageArea’ column.
We also removed the ‘Exterior2nd’ column, as it represents a limited set of additional choices for exterior covering.
Furthermore, we removed the ‘GarageYrBlt’ column, as the year of construction is typically the same as the ‘YearBuilt’ column.
Finally, we removed the ‘TotRmsAbvGrd’ column, as it provides a similar meaning as the ‘First Floor’ square footage.
# Subset the dataset by removing specific columns
dataset <- subset(dataset, select = -c(Id, GarageCars, Exterior2nd, GarageYrBlt, TotRmsAbvGrd))
There are a total of 19 columns with missing values
5 of them have missing values higher than 20%
Columns with missing values higher than 20%: PoolQC, MiscFeature, Alley, Fence, FireplaceQu
Column with missing values higher than 20% will be removed in this stage
# Calculate missing percentage
missing_percentages <- colSums(is.na(dataset)) / nrow(dataset) * 100
# Create a data frame with column names and missing percentages
missing_dataset <- data.frame(
column_name = names(missing_percentages),
missing_percentage = missing_percentages
) %>%
# Sort the data frame by missing percentages in descending order
arrange(desc(missing_percentage)) %>%
# Filter out columns with 0 missing percentages
filter(missing_percentage > 0)
# Print the missing dataset
print(missing_dataset)
## column_name missing_percentage
## PoolQC PoolQC 99.52054795
## MiscFeature MiscFeature 96.30136986
## Alley Alley 93.76712329
## Fence Fence 80.75342466
## FireplaceQu FireplaceQu 47.26027397
## LotFrontage LotFrontage 17.73972603
## GarageType GarageType 5.54794521
## GarageFinish GarageFinish 5.54794521
## GarageQual GarageQual 5.54794521
## GarageCond GarageCond 5.54794521
## BsmtExposure BsmtExposure 2.60273973
## BsmtFinType2 BsmtFinType2 2.60273973
## BsmtQual BsmtQual 2.53424658
## BsmtCond BsmtCond 2.53424658
## BsmtFinType1 BsmtFinType1 2.53424658
## MasVnrType MasVnrType 0.54794521
## MasVnrArea MasVnrArea 0.54794521
## Electrical Electrical 0.06849315
dataset <- subset(dataset, select = -c(PoolQC, MiscFeature, Alley, Fence, FireplaceQu))
Median is chosen to fill in missing values, rather than the mean.
The mean is sensitive to extreme values and can be heavily influenced by outliers in the data, which can lead to biased estimates.
In contrast, the median is a robust measure of central tendency that is less sensitive to extreme values and is more appropriate for skewed data.
# Identify numerical columns & replace missing numerical values with median
num_cols <- sapply(dataset, is.numeric)
for (col in names(dataset[, num_cols])) {
col_median <- median(dataset[, col], na.rm = TRUE)
dataset[, col][is.na(dataset[, col])] <- rep(col_median, sum(is.na(dataset[, col])))
}
# Identify categorical columns & replace missing categorical values with mode
cat_cols <- sapply(dataset, is.factor) | sapply(dataset, is.character)
for (col in names(dataset[, cat_cols])) {
dataset[, col][is.na(dataset[, col])] <- as.character(Mode(dataset[, col], na.rm = TRUE))
}
# Select numerical columns using the "is.numeric" function
numerical_cols <- names(dataset)[sapply(dataset, is.numeric)]
categorical_cols <- names(dataset)[sapply(dataset, is.factor) | sapply(dataset, is.character)]
# Create new data frames with only numerical and categorical columns
numerical_data <- dataset[numerical_cols]
categorical_data <- dataset[categorical_cols]
# Display the first six rows of the numerical data frame
head(numerical_data)
# Create a function to plot probability density plot and skewness
plot_density <- function(column, col_name) {
# Compute the skewness of the data
skewness <- skew(column)
# Set the plot color based on the skewness
if (skewness < 0) {
plot_color <- "blue"
} else {
plot_color <- "red"
}
# Create the density plot
plot_title <- col_name # Rename the plot title with the column name
density_plot <- ggplot(data = data.frame(column), aes(x = column)) +
geom_density(fill = plot_color, alpha = 0.3) +
ggtitle(plot_title)
return(density_plot)
}
# Apply the plot_density function to each column of numerical data in numerical_data
# Select the first 20 column names from numerical_data
first_20 <- names(numerical_data)[1:20]
# Apply the plot_density function to each selected column
plots <- lapply(first_20, function(col_name) {
plot_density(numerical_data[[col_name]], col_name)
})
# Arrange the resulting plots in a grid with 5 columns
grid.arrange(grobs = plots, ncol = 5)
# Select the first 20 column names from numerical_data
rest_14 <- names(numerical_data)[21:34]
# Apply the plot_density function to each selected column
plots <- lapply(rest_14, function(col_name) {
plot_density(numerical_data[[col_name]], col_name)
})
# Arrange the resulting plots in a grid with 5
grid.arrange(grobs = plots, ncol = 5)
Outliers are found in most of the columns - 29 attributes
“MSSubClass”, “LotFrontage”, “LotArea”, “OverallQual”, “OverallCond”, “YearBuilt”, “MasVnrArea”, “BsmtFinSF1”, “BsmtFinSF2”, “BsmtUnfSF”, “TotalBsmtSF”, “X1stFlrSF”, “X2ndFlrSF”, “LowQualFinSF”, “GrLivArea”, “BsmtFullBath”, “BsmtHalfBath”, “BedroomAbvGr”, “KitchenAbvGr”, “Fireplaces”, “GarageArea”, “WoodDeckSF”, “OpenPorchSF”, “EnclosedPorch”, “X3SsnPorch”, “ScreenPorch”, “PoolArea”, “MiscVal”, “SalePrice”
# Create a list to store the column names with outliers
columns_with_outliers <- c()
# Create a list of boxplots for the first 20 attributes
boxplot_numerical_first20attributes <- lapply(1:20, function(i) {
# Calculate the lower and upper fences for outlier detection
Q1 <- quantile(numerical_data[, i], 0.25)
Q3 <- quantile(numerical_data[, i], 0.75)
IQR <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR
# Check for outliers
outliers <- numerical_data[, i] < lower_fence | numerical_data[, i] > upper_fence
# Add the column name to the list if outliers are present
if (any(outliers)) {
columns_with_outliers <<- c(columns_with_outliers, colnames(numerical_data)[i])
}
# Create the boxplot
ggplot(numerical_data, aes(x = 1, y = numerical_data[, i])) +
geom_boxplot() +
ggtitle(paste0("Boxplot of", colnames(numerical_data)[i])) +
xlab("") +
ylab(colnames(numerical_data)[i])
})
# Arrange the resulting boxplots in a grid with 5 columns
grid.arrange(grobs = boxplot_numerical_first20attributes, ncol = 5)
# Create a list of boxplots for the rest
boxplot_numerical_restattributes <- lapply(21:34, function(i) {
# Calculate the lower and upper fences for outlier detection
Q1 <- quantile(numerical_data[, i], 0.25)
Q3 <- quantile(numerical_data[, i], 0.75)
IQR <- Q3 - Q1
lower_fence <- Q1 - 1.5 * IQR
upper_fence <- Q3 + 1.5 * IQR
# Check for outliers
outliers <- numerical_data[, i] < lower_fence | numerical_data[, i] > upper_fence
# Add the column name to the list if outliers are present
if (any(outliers)) {
columns_with_outliers <<- c(columns_with_outliers, colnames(numerical_data)[i])
}
# Create the boxplot
ggplot(numerical_data, aes(x = 1, y = numerical_data[, i])) +
geom_boxplot() +
ggtitle(paste0("Boxplot of", colnames(numerical_data)[i])) +
xlab("") +
ylab(colnames(numerical_data)[i])
})
# Arrange the resulting boxplots in a grid with 5 columns
grid.arrange(grobs = boxplot_numerical_restattributes, ncol = 5)
# Print the column names with outliers
print(columns_with_outliers)
## [1] "MSSubClass" "LotFrontage" "LotArea" "OverallQual"
## [5] "OverallCond" "YearBuilt" "MasVnrArea" "BsmtFinSF1"
## [9] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "X1stFlrSF"
## [13] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [17] "BsmtHalfBath" "BedroomAbvGr" "KitchenAbvGr" "Fireplaces"
## [21] "GarageArea" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [25] "X3SsnPorch" "ScreenPorch" "PoolArea" "MiscVal"
## [29] "SalePrice"
Correlation with >0.8 and above:
X1stFlrSF & TotalBsmtSF || 0.81953
# Create a heat map of the correlation matrix
correlation_matrix <- cor(numerical_data, method = "pearson")
corrplot(correlation_matrix, method = "color", type = "upper", order = "hclust", tl.col = "black", tl.cex = 0.3, tl.srt = 90)
# Print correlation higher than 0.8
high_corr <- which(abs((correlation_matrix) > 0.8 | correlation_matrix < -0.8)& correlation_matrix!= 1, arr.ind=TRUE)
high_corr_values <- correlation_matrix[high_corr]
high_corr_df <- data.frame(row = high_corr[, 1], col = high_corr[, 2], corr = high_corr_values)
high_corr_df
# Display the first six rows of the categorical data frame
head(categorical_data)
Generally, most of the residents having following characteristics:
Low density resident
Paved streets
Gravel alley
Regular sized property
Near Flat level
Available with electric power, natural gas, steam supply, water supply, and sewage removal
Inside lot
Gentle slope of property
Single-family Detached
Single storey
Gable roof
Standard (Composite) Shingle roof
Good quality Vinyl Siding exteriors
Slight damped basement
No walkout or garden level walls
Gas forced warm air furnace
Excellent heating quality
Has Air-conditioning
Standard Circuit Breakers & Romex electrical system
Typical kitchen quality
Typical home functionality
Masonry Fireplace in main level
Typical unfinished garage attached to home
Paved driveway
Good pool quality
Minimum Wood/Wire fence
Warranty Deed – Conventional Sales
Normal Sales
# Print bar plots for categorical data for the first 20 attributes
plots_first20attribute <- lapply(colnames(categorical_data)[1:20], function(col) {
ggplot(categorical_data, aes(x = !!sym(col))) +
geom_bar(fill = "steelblue") +
xlab(col) +
ylab("Count") +
ggtitle(paste0("Distribution of", col)) +
scale_x_discrete(guide = guide_axis(check.overlap = TRUE))
})
grid.arrange(grobs = plots_first20attribute, ncol = 5)
# Print bar plots for categorical data for the remaining 17 attributes
plots_restattribute <- lapply(colnames(categorical_data)[21:37], function(col) {
ggplot(categorical_data, aes(x = !!sym(col))) +
geom_bar(fill = "steelblue") +
xlab(col) +
ylab("Count") +
ggtitle(paste0("Distribution of", col)) +
scale_x_discrete(guide = guide_axis(check.overlap = TRUE))
})
grid.arrange(grobs = plots_restattribute, ncol = 5)
Data pre-processing steps such as log transformation, scaling, and one-hot encoding.
Data splitted into train, validation, and test sets and combines the training and validation sets for model training.
Calculates and displays the skewness of the numerical features and target variable before and after the log transformations.
# Select numerical columns using the "is.numeric" function
numerical_cols <- names(dataset)[sapply(dataset, is.numeric)]
categorical_cols <- names(dataset)[sapply(dataset, is.factor) | sapply(dataset, is.character)]
# Create new data frames with only numerical and categorical columns
numerical_data <- dataset[numerical_cols]
categorical_data <- dataset[categorical_cols]
# Split the numerical variables into features and the target variable.
X_num <- subset(numerical_data, select = -c(SalePrice))
y <- subset(numerical_data, select = c(SalePrice))$SalePrice
# Log Transformation for numerical features.
skewness_before <- sapply(X_num, function(x) {
e1071::skewness(x)
})
X_num_skewed <- skewness_before[abs(skewness_before) > 0.75]
for (x in names(X_num_skewed)) {
X_num[[x]] <- log1p(X_num[[x]])
}
skewness_after <- sapply(X_num, function(x) {
e1071::skewness(x)
})
data.frame(skewness_before, skewness_after)
# Log Transformation for the target variable.
skewness_before <- e1071::skewness(y)
y_t <- log1p(y)
skewness_after <- e1071::skewness(y_t)
sprintf("Before: %f, After: %f", skewness_before, skewness_after)
## [1] "Before: 1.879009, After: 0.121097"
# Scaling
X_num <- scale(X_num)
# One-Hot Encoding for categorical variables.
encoder <- dummyVars(~., data = categorical_data)
X_cat <- predict(encoder, newdata = categorical_data)
X_cat <- data.frame(X_cat)
# Split into train and validation and test sets.
X <- cbind(X_cat, X_num)
set.seed(12345)
train_idx <- createDataPartition(y_t, p = 0.7, list = F)
X_train <- X[train_idx, ]
y_train <- y_t[train_idx]
X_test <- X[-train_idx, ]
y_test <- y_t[-train_idx]
train_val_idx <- createDataPartition(y_train, p = 0.8, list = FALSE)
X_train <- X_train[train_val_idx, ]
y_train <- y_train[train_val_idx]
X_val <- X_train[-train_val_idx, ]
y_val <- y_train[-train_val_idx]
X_train_val <- rbind(X_train, X_val)
y_train_val <- c(y_train, y_val)
dim(X)
## [1] 1460 251
length(y)
## [1] 1460
dim(X_train)
## [1] 821 251
length(y_train)
## [1] 821
dim(X_val)
## [1] 163 251
length(y_val)
## [1] 163
#names(X)
The code executes PCA (Principal Component Analysis) to reduce the data’s dimensionality and extract the principal components.
Elbow method used to determine the best number of clusters. Result: 3.
K-means clustering is performed on the principal components, assigning each observation to a specific cluster.
The cluster assignments are appended to the principal components data, and a scatter plot is generated to visualize the resulting clusters.
# Perform PCA on the data matrix
pca_result <- prcomp(X)
# Extract the principal components
principal_components <- as.data.frame(pca_result$x)
# Determine the optimal number of clusters using the elbow method
fviz_nbclust(principal_components, kmeans, method = "wss")
k <- 3
# Perform K-means clustering on the principal components
kmeans_result <- kmeans(principal_components, centers = k)
cluster_assignments <- kmeans_result$cluster
# print(cluster_assignments)
# Add the cluster assignments to the principal components data
principal_components$cluster <- as.factor(cluster_assignments)
ggplot(principal_components, aes(PC1, PC2, color = cluster)) +
geom_point() +
labs(x = "Principal Component 1", y = "Principal Component 2") +
scale_color_discrete(name = "Cluster") +
theme_minimal()
Function:
Calculates the t-SNE coordinates
Creates a dendrogram through hierarchical clustering
Determines the number of clusters visually, assigns data points to clusters
Computes centroids for each cluster
Generates scatter plots to visualize the t-SNE coordinates colored by the original variable and cluster assignments
Includes a bar plot showing the cluster frequencies and evaluates clustering quality using silhouette analysis.
tsne <- Rtsne(X)
tsne_df <- data.frame(tsne)
dist_mat <- dist(tsne$Y, method = "euclidean")
hclust_avg <- hclust(dist_mat, method = "average")
dend <- as.dendrogram(hclust_avg)
plot(dend)
k = 15
cut_avg <- cutree(hclust_avg, k)
tsne_df$cluster <- cut_avg
getCentroid <- function(points) {
xy <- numeric(2)
xy[1] = mean(points[, 1])
xy[2] = mean(points[, 2])
return(xy)
}
centroids = matrix(0, k, 2)
for (i in unique(cut_avg)) centroids[i, ] <- getCentroid(tsne$Y[cut_avg == i,])
Graph width represents the number of data point; longer length indicates a stronger separation and better-defined clusters.
The lighter the color, the higher the house price.
ggplot(tsne_df, aes(x=Y.1, y=Y.2, color=y)) +
geom_point() +
scale_color_gradientn(colours = heat.colors(10))
ggplot(data.frame(table(tsne_df$cluster)), aes(x=Var1, y=Freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Clusters", y = "Sizes")
ggplot(tsne_df, aes(x=Y.1, y=Y.2, color=cluster)) +
geom_point() +
scale_color_viridis() +
scale_fill_viridis(discrete = T) +
geom_point(data = data.frame(centroids), aes(x=X1, y=X2), color="black", fill="white", shape=21, size=8) +
geom_text(data = data.frame(centroids), aes(x=X1, y=X2, label=1:k), color="black")
fviz_silhouette(silhouette(cutree(hclust_avg, k = k), dist_mat))
## cluster size ave.sil.width
## 1 1 253 0.33
## 2 2 92 0.41
## 3 3 215 0.32
## 4 4 29 0.74
## 5 5 174 0.35
## 6 6 42 0.79
## 7 7 96 0.16
## 8 8 169 0.37
## 9 9 37 0.89
## 10 10 17 0.81
## 11 11 87 0.53
## 12 12 85 0.44
## 13 13 22 0.81
## 14 14 66 0.70
## 15 15 76 0.35
Sets up empty lists to store RMSE values for baseline models on the training and testing datasets.
Calculate performance metrics (MAE, MAPE, RMSE, MSE, R2) by converting predicted and actual values from transformed to original scale.
Focus on data structures and the metrics calculation function.
baselines_rmse <- list()
baselines_rmse_test <- list()
actual_metrics_test <- list()
metrics_fusion <- function(y_pred, y) {
y_pred_inv <- expm1(y_pred)
y_inv <- expm1(y)
a <- mae(y_pred_inv, y_inv)
b <- mape(y_pred_inv, y_inv)
c <- rmse(y_pred_inv, y_inv)
d <- mse(y_pred_inv, y_inv)
e <- R2(y_pred_inv, y_inv)
return(c("mae" = a, "mape" = b, "rmse" = c, "mse" = d, "r2" = e))
}
Evaluates the performance of a linear regression model for a regression problem.
Cross-validation used to train the model on the training and validation data.
Resulting model is applied to make predictions on the validation dataset.
Root mean squared error (RMSE) is calculated as a measure of prediction accuracy for the model on the validation data.
Trained linear regression model used to make predictions on the test dataset.
RMSE is calculated by comparing the predicted values with the actual test values.
Stores additional performance metrics (MAE, MAPE, RMSE, MSE, R2) for the linear regression model on the test data.
linreg_tc <- trainControl(method = "cv", number = 5)
linreg_cv <- caret::train(
SalePrice ~ .,
data = cbind(X_train_val, SalePrice = y_train_val),
trControl = linreg_tc,
method = "lm"
)
# Validation predictions and metrics
score_val <- linreg_cv$results$RMSE
baselines_rmse$linear_regression <- score_val
# Test predictions and metrics
linreg <- lm(SalePrice ~ ., data = cbind(X_train_val, SalePrice = y_train_val))
y_pred_test <- predict(linreg, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$linear_regression <- score_test
actual_metrics_test$linear_regression <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.1279347
score_test
## [1] 0.1584937
Lasso Regression applied to a regression problem.
Model trained by using cross-validation and selects the optimal regularization parameter.
Root mean squared error (RMSE) calculated for the validation set.
For the test set, target variable predicted by using the trained model, calculates the RMSE
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Lasso Regression model on the test set
Displays the validation RMSE and test RMSE.
lasso <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = 1)
# Validation predictions and metrics
score_val <- mean(sqrt(lasso$cvm))
baselines_rmse$lasso <- score_val
# Test predictions and metrics
y_pred_test <- predict(lasso, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$lasso <- score_test
actual_metrics_test$lasso <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.1534448
score_test
## [1] 0.1401135
Ridge Regression applied to a regression problem.
Model trained by using cross-validation and selects the optimal regularization parameter.
Root mean squared error (RMSE) calculated for the validation set.
For the test set, it predicts the target variable using the trained model, calculates the RMSE.
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Ridge Regression model on the test set.
Validation RMSE and test RMSE dsiplayed.
ridge <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = 0)
# Validation predictions and metrics
score_val <- mean(sqrt(ridge$cvm))
baselines_rmse$ridge <- score_val
# Test predictions and metrics
y_pred_test <- predict(ridge, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$ridge <- score_test
actual_metrics_test$ridge <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.2334951
score_test
## [1] 0.1457187
Elastic Net Regression applied to a regression problem.
Cross-validation peformed to select the optimal alpha parameter, which controls the balance between L1 (Lasso) and L2 (Ridge) regularization.
The root mean squared error (RMSE) calculated for the validation set.
For the test set, it predicts the target variable using the trained Elastic Net model with the selected alpha value, calculates the RMSE.
Actual performance metrics (MAE, MAPE, RMSE, MSE, R2) computed for the Elastic Net model on the test set.
Selected alpha, validation RMSE, and test RMSE.
results <- data.frame()
for (i in 0:20) {
elasticnet <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = i/20)
row <- data.frame(alpha = i/20, rmse_val = mean(sqrt(elasticnet$cvm)))
results <- rbind(results, row)
}
best_alpha <- results$alpha[which.min(results$rmse_val)]
# Validation predictions and metrics
score_val <- min(results$rmse_val)
baselines_rmse$elasticnet <- score_val
# Test predictions and metrics
elasticnet <- cv.glmnet(x = as.matrix(X_train_val), y = y_train_val, alpha = best_alpha)
y_pred_test <- predict(elasticnet, newx = as.matrix(X_test))
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$elasticnet <- score_test
actual_metrics_test$elasticnet <- metrics_fusion(y_pred_test, y_test)
# Display scores
best_alpha
## [1] 0.9
score_val
## [1] 0.1523751
score_test
## [1] 0.1397882
K-Nearest Neighbors Regression applied to the data
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.
knn_tc <- trainControl(method = "cv", number = 5)
knn_cv <- caret::train(
SalePrice ~ .,
data = cbind(X_train_val, SalePrice = y_train_val),
trControl = knn_tc,
method = "knn"
)
# Validation predictions and metrics
score_val <- mean(knn_cv$results$RMSE)
baselines_rmse$knn <- score_val
# Test predictions and metrics
knn <- knnreg(x = X_train_val, y = y_train_val)
y_pred_test <- predict(knn, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$knn <- score_test
actual_metrics_test$knn <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.1788794
score_test
## [1] 0.1994154
Support Vector Regression (SVR) applied to the data
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.
svr_tc <- trainControl(method = "cv", number = 5)
svr_cv <- caret::train(
SalePrice ~ .,
data = cbind(X_train_val, SalePrice = y_train_val),
trControl = svr_tc,
method = "svmLinear2",
scale = F
)
# Validation predictions and metrics
score_val <- mean(svr_cv$results$RMSE)
baselines_rmse$svr <- score_val
# Test predictions and metrics
svr <- e1071::svm(SalePrice ~ ., data = cbind(X_train_val, SalePrice = y_train_val), scale = F)
y_pred_test <- predict(svr, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$svr <- score_test
actual_metrics_test$svr <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.1189599
score_test
## [1] 0.1310749
Decision Tree regression applied to the data.
Calculates the RMSE for validation and test sets, and stores the results.
Additional performance metrics computed and displays the RMSE values.
dt_tc <- trainControl(method = "cv", number = 5)
dt_cv <- caret::train(
SalePrice ~ .,
data = cbind(X_train_val, SalePrice = y_train_val),
trControl = dt_tc,
method = "rpart"
)
# Validation predictions and metrics
score_val <- mean(dt_cv$results$RMSE, na.rm = TRUE)
baselines_rmse$decision_tree <- score_val
# Test predictions and metrics
dt <- caret::train(x = X_train_val, y = y_train_val, method = "rpart")
y_pred_test <- predict(dt, newdata = X_test)
score_test <- rmse(y_pred_test, y_test)
baselines_rmse_test$decision_tree <- score_test
actual_metrics_test$decision_tree <- metrics_fusion(y_pred_test, y_test)
# Display scores
score_val
## [1] 0.2944126
score_test
## [1] 0.282019
Empty lists initialized that will be used to store the RMSE values and actual performance metrics for ensemble learning models.
Empty lists will be populated with values in subsequent steps of the code.
ensemble_rmse <- list()
ensemble_actual_metrics <- list()
ensemble_rmse_test <- list()
ensemble_actual_metrics_test <- list()
Random forest algorithm implemented for regression tasks using ensemble learning.
Random forest model trained on the training and validation data.
Its performance is evaluated using root mean squared error (RMSE) on the validation and test datasets.
Calculate feature importance using the random forest model and creates a bar plot to visualize the top 30 important features.
rf_tc <- trainControl(method = "cv", number = 5)
rf_cv <- caret::train(
SalePrice ~ .,
data = cbind(X_train_val, SalePrice = y_train_val),
trControl = rf_tc,
method = "rf"
)
# Validation predictions and metrics
score_val <- mean(rf_cv$results$RMSE)
ensemble_rmse$random_forest <- score_val
# Test predictions and metrics
rf <- randomForest(x = X_train_val, y = y_train_val, proximity = T)
y_pred_test_rf <- predict(rf, newdata = X_test)
score_test <- rmse(y_pred_test_rf, y_test)
ensemble_rmse_test$random_forest <- score_test
ensemble_actual_metrics_test$random_forest <- metrics_fusion(y_pred_test_rf, y_test)
y_pred_train_rf <- predict(rf, newdata = X_train_val)
# Display scores
score_val
## [1] 0.1643092
score_test
## [1] 0.152642
rf_df <- data.frame(rf$importance) %>%
mutate(Feature = rownames(rf$importance)) %>%
arrange(desc(IncNodePurity)) %>%
head(30)
rf_df
ggplot(data = rf_df, aes(x = reorder(Feature, IncNodePurity), y = IncNodePurity)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Features", y = "Importance")
Gradient-Boosted Trees (GBT) using XGBoost, LightGBM, and CatBoost libraries implemented for regression tasks.
The models are trained on the training and validation data, and their performance is evaluated using RMSE on the validation and test datasets.
The best hyperparameters are determined through cross-validation, and the models are then used to make predictions on the test data.
The RMSE scores and other evaluation metrics are recorded for each model.
The feature importance of the GBT model is visualized using a bar chart.
Code snippet showcases the application of ensemble learning techniques for regression tasks using boosting algorithms.
dtrain_val <- xgb.DMatrix(data = as.matrix(X_train_val), label = y_train_val)
dtest <- xgb.DMatrix(data = as.matrix(X_test), label = y_test)
xgb_params = list(
eta = 0.01,
gamma = 0.0468,
max_depth = 6,
min_child_weight = 1.41,
subsample = 0.769,
colsample_bytree = 0.283
)
xgb_cv <- xgb.cv(
params = xgb_params,
data = dtrain_val,
nround = 10000,
nfold = 5,
prediction = F,
showsd = T,
metrics = "rmse",
verbose = 1,
print_every_n = 500,
early_stopping_rounds = 25
)
## [1] train-rmse:11.414310+0.004317 test-rmse:11.414306+0.017596
## Multiple eval metrics are present. Will use test_rmse for early stopping.
## Will train until test_rmse hasn't improved in 25 rounds.
##
## [501] train-rmse:0.123461+0.000648 test-rmse:0.152562+0.008779
## [1001] train-rmse:0.068082+0.000494 test-rmse:0.114561+0.008507
## [1501] train-rmse:0.064569+0.000540 test-rmse:0.113083+0.008597
## [2001] train-rmse:0.062831+0.000580 test-rmse:0.112503+0.008773
## [2501] train-rmse:0.061697+0.000634 test-rmse:0.112112+0.008748
## Stopping. Best iteration:
## [2768] train-rmse:0.061252+0.000597 test-rmse:0.111974+0.008758
# Validation predictions and metrics
score_val <- xgb_cv$evaluation_log$test_rmse_mean %>% min
ensemble_rmse$xgboost <- score_val
# Test predictions and metrics
xgb <- xgboost(
params = xgb_params,
data = dtrain_val,
nround = 10000,
eval_metric = "rmse",
verbose = 1,
print_every_n = 500,
early_stopping_rounds = 25
)
## [1] train-rmse:11.414108
## Will train until train_rmse hasn't improved in 25 rounds.
##
## [501] train-rmse:0.122362
## [1001] train-rmse:0.067779
## [1501] train-rmse:0.064224
## [2001] train-rmse:0.062695
## Stopping. Best iteration:
## [2056] train-rmse:0.062552
y_pred_test_xgb <- predict(xgb, newdata = dtest)
score_test <- rmse(y_pred_test_xgb, y_test)
ensemble_rmse_test$xgboost <- score_test
ensemble_actual_metrics_test$xgboost <- metrics_fusion(y_pred_test_xgb, y_test)
y_pred_train_xgb <- predict(xgb, newdata = dtrain_val)
# Display scores
score_val
## [1] 0.1119744
score_test
## [1] 0.1359029
xgb_df <- xgb.importance(model = xgb) %>% head(30)
xgb_df
ggplot(data = xgb_df, aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Features", y = "Importance")
objective_fn <- makeSingleObjectiveFunction(
fn = function(x) {
params = list(
booster = "gbtree",
eta = x["eta"],
gamma = x["gamma"],
max_depth = x["max_depth"],
min_child_weight = x["min_child_weight"],
subsample = x["subsample"],
colsample_bytree = x["colsample_bytree"],
max_delta_step = x["max_delta_step"]
)
cv <- xgb.cv(
params = params,
data = dtrain_val,
nround = 10000,
nfold = 5,
prediction = F,
showsd = T,
metrics = "rmse",
verbose = 1,
print_every_n = 500,
early_stopping_rounds = 25
)
cv$evaluation_log$test_rmse_mean %>% min
},
par.set = makeParamSet(
makeNumericParam("eta", lower = 0.005, upper = 0.01),
makeNumericParam("gamma", lower = 0.01, upper = 5),
makeIntegerParam("max_depth", lower = 2, upper = 10),
makeIntegerParam("min_child_weight", lower = 1, upper = 2000),
makeNumericParam("subsample", lower = 0.20, upper = 0.8),
makeNumericParam("colsample_bytree", lower = 0.20, upper = 0.8),
makeNumericParam("max_delta_step", lower = 0, upper = 5)
),
minimize = TRUE
)
# Define experiments and an objective function
design <- generateDesign(n = 1000, par.set = getParamSet(objective_fn), fun = lhs::randomLHS)
control <- makeMBOControl() %>% setMBOControlTermination(., iters = 10)
# run <- mbo(
# fun = objective_fn,
# design = design,
# learner = makeLearner("regr.km", predict.type = "se", covtype = "matern3_2", control = list(trace = FALSE)),
# control = control,
# show.info = TRUE
# )
# Best parameters
# run$x
lgb_train_val <- lgb.Dataset(data = as.matrix(X_train_val), label = y_train_val)
lgb_test <- lgb.Dataset(data = as.matrix(X_test), label = y_test)
params <- list(
objective = "regression",
metric = "rmse",
boosting_type = "gbdt",
num_boost_round = 10000,
num_leaves = 15,
learning_rate = 0.1,
feature_fraction = 0.9,
bagging_fraction = 0.8,
bagging_freq = 5,
force_row_wise = T
)
lgb_cv <- lgb.cv(
params = params,
data = lgb_train_val,
early_stopping_rounds = 25,
verbose = 0
)
## [LightGBM] [Info] Start training from score 12.018962
## [LightGBM] [Info] Start training from score 12.025414
## [LightGBM] [Info] Start training from score 12.022889
# Validation predictions and metrics
score_val <- min(unlist(lgb_cv$record_evals$valid$rmse$eval))
ensemble_rmse$lightgbm <- score_val
# Test predictions and metrics
lgb <- lgb.train(
params = params,
data = lgb_train_val,
verbose = 0
)
y_pred_test_lgb <- predict(lgb, data = as.matrix(X_test))
score_test <- rmse(y_pred_test_lgb, y_test)
ensemble_rmse_test$lightgbm <- score_test
ensemble_actual_metrics_test$lightgbm <- metrics_fusion(y_pred_test_lgb, y_test)
y_pred_train_lgb <- predict(lgb, data = as.matrix(X_train_val))
# Display scores
score_val
## [1] 0.1246116
score_test
## [1] 0.1435826
lgb_df <- lgb.importance(model = lgb) %>% head(30)
lgb_df
ggplot(data = xgb_df, aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Features", y = "Importance")
train_val_pool <- catboost.load_pool(data = X_train_val, label = y_train_val)
test_pool <- catboost.load_pool(data = X_test, label = y_test)
params <- list(
loss_function = "RMSE",
iterations = 10000,
learning_rate = 0.01,
metric_period = 1000
)
catb_cv <- catboost.cv(
train_val_pool,
params = params,
fold_count = 5,
early_stopping_rounds = 25
)
## Warning: Overfitting detector is active, thus evaluation metric is calculated on every iteration. 'metric_period' is ignored for evaluation metric.
## 0: learn: 11.9129312 test: 11.9132621 best: 11.9132621 (0) total: 208ms remaining: 34m 36s
## 1000: learn: 0.1222245 test: 0.2262026 best: 0.2262026 (1000) total: 1m 19s remaining: 11m 57s
## 2000: learn: 0.0685048 test: 0.2006144 best: 0.2006144 (2000) total: 2m 37s remaining: 10m 29s
## 3000: learn: 0.0454878 test: 0.1946286 best: 0.1946286 (3000) total: 3m 52s remaining: 9m 1s
## 4000: learn: 0.0318095 test: 0.1920010 best: 0.1920010 (4000) total: 5m 12s remaining: 7m 48s
## 5000: learn: 0.0230517 test: 0.1906835 best: 0.1906835 (5000) total: 6m 28s remaining: 6m 28s
## 6000: learn: 0.0171276 test: 0.1899629 best: 0.1899625 (5999) total: 8m 26s remaining: 5m 37s
## 7000: learn: 0.0129467 test: 0.1895087 best: 0.1895086 (6999) total: 9m 43s remaining: 4m 10s
## 8000: learn: 0.0098459 test: 0.1892415 best: 0.1892415 (8000) total: 10m 32s remaining: 2m 37s
## 9000: learn: 0.0075465 test: 0.1890371 best: 0.1890370 (8998) total: 11m 26s remaining: 1m 16s
## Stopped by overfitting detector (25 iterations wait)
# Validation predictions and metrics
score_val <- min(catb_cv$test.RMSE.mean)
ensemble_rmse$catboost <- score_val
# Test predictions and metrics
catb <- catboost.train(
params = params,
learn_pool = train_val_pool
)
## 0: learn: 0.3949874 total: 11.9ms remaining: 1m 58s
## 1000: learn: 0.0760428 total: 7.03s remaining: 1m 3s
## 2000: learn: 0.0505683 total: 14.5s remaining: 58.1s
## 3000: learn: 0.0348161 total: 21.6s remaining: 50.4s
## 4000: learn: 0.0252573 total: 28.6s remaining: 42.9s
## 5000: learn: 0.0190112 total: 36.6s remaining: 36.6s
## 6000: learn: 0.0146441 total: 44.4s remaining: 29.6s
## 7000: learn: 0.0115120 total: 52s remaining: 22.3s
## 8000: learn: 0.0091391 total: 59.8s remaining: 14.9s
## 9000: learn: 0.0074168 total: 1m 6s remaining: 7.42s
## 9999: learn: 0.0060960 total: 1m 13s remaining: 0us
y_pred_test_catboost <- catboost.predict(catb, test_pool)
score_test <- rmse(y_pred_test_catboost, y_test)
ensemble_rmse_test$catboost <- score_test
ensemble_actual_metrics_test$catboost <- metrics_fusion(y_pred_test_catboost, y_test)
y_pred_train_catboost <- catboost.predict(catb, train_val_pool)
# Display scores
score_val
## [1] 0.1890371
score_test
## [1] 0.1339784
catb_df <- data.frame(catboost.get_feature_importance(catb))
catb_df <- catb_df %>%
mutate(Feature = rownames(catb_df)) %>%
rename(Importance = catboost.get_feature_importance.catb.) %>%
arrange(desc(Importance)) %>%
head(30)
catb_df
ggplot(data = catb_df, aes(x = reorder(Feature, Importance), y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(x = "Features", y = "Importance")
Stacking is performed by combining the predictions from different models (Random Forest, XGBoost, LightGBM, CatBoost) into a new dataset.
Linear regression model is trained on the stacked dataset to learn the relationship between ensemble predictions and the actual target values.
Model’s predictions on the test dataset are evaluated using root mean squared error (RMSE).
Metrics fusion function is applied to assess the performance of the stacking ensemble.
stacked_data <- data.frame(y = y_train_val, prediction_rf = y_pred_train_rf, prediction_xgb = y_pred_train_xgb, prediction_lgb = y_pred_train_lgb, prediction_catboost = y_pred_train_catboost)
stacked_data_test <- data.frame(y = y_test, prediction_rf = y_pred_test_rf, prediction_xgb = y_pred_test_xgb, prediction_lgb = y_pred_test_lgb, prediction_catboost = y_pred_test_catboost)
model_meta <- caret::train(y ~ ., data = stacked_data, method = "lm")
predictions_meta <- predict(model_meta, newdata = stacked_data_test)
ensemble_rmse_test$stacking <- rmse(predictions_meta, y_test)
ensemble_actual_metrics_test$stacking <- metrics_fusion(predictions_meta, y_test)
Compares the performance of different models and ensembles using the root mean squared error (RMSE) metric.
Bar charts created to display the RMSE values, with the x-axis representing the models and the y-axis representing the RMSE.
The first plot focuses on the evaluation using the validation set.
The second plot focuses on the evaluation using the test set.
Purpose is to compare the models’ performance and determine which ones have the lowest RMSE, indicating better predictive accuracy.
# Plot validation loss
rmse_all <- c(unlist(baselines_rmse), unlist(ensemble_rmse))
rmse_all_names <- c(names(baselines_rmse), names(ensemble_rmse))
df <- data.frame(models = rmse_all_names, rmse = rmse_all)
df %>% arrange(rmse)
ggplot(df, aes(x = factor(models, level=rmse_all_names), y = rmse)) +
geom_bar(stat = "identity", fill = "steelblue") +
xlab("Models") +
ylab("RMSE") +
ylim(0, 0.3) +
ggtitle("Validation Loss") +
scale_x_discrete(guide = guide_axis(angle = 30)) +
theme_minimal()
# Plot test loss
rmse_all_test <- c(unlist(baselines_rmse_test), unlist(ensemble_rmse_test))
rmse_all_test_names <- c(names(baselines_rmse_test), names(ensemble_rmse_test))
df <- data.frame(models = rmse_all_test_names, rmse = rmse_all_test)
df %>% arrange(rmse)
ggplot(df, aes(x = factor(models, level=rmse_all_test_names), y = rmse)) +
geom_bar(stat = "identity", fill = "steelblue") +
xlab("Models") +
ylab("RMSE") +
ylim(0, 0.3) +
ggtitle("Test Loss") +
scale_x_discrete(guide = guide_axis(angle = 30)) +
theme_minimal()
data.frame(t(data.frame(actual_metrics_test))) %>% arrange(desc(r2))
data.frame(t(data.frame(ensemble_actual_metrics_test))) %>% arrange(desc(r2))
Code snippet performs association rule mining on housing characteristics using the Apriori algorithm.
Converts categorical data into transactions and applies the Apriori algorithm to find frequent itemsets.
Generate association rules using given support and confidence thresholds.
The code includes a dictionary to map column names and values to their corresponding descriptions for better rule interpretation.
Top ten association rules with their corresponding explanations displayed, including the antecedent (IF) and consequent (THEN) parts along with their descriptions and confidence percentages.
transactions <- transactions(categorical_data)
summary(transactions)
## transactions as itemMatrix in sparse format with
## 1460 rows (elements/itemsets/transactions) and
## 218 columns (items) and a density of 0.1697248
##
## most frequent items:
## Utilities=AllPub Street=Pave Condition2=Norm RoofMatl=CompShg
## 1459 1454 1445 1434
## Heating=GasA (Other)
## 1428 46800
##
## element (itemset/transaction) length distribution:
## sizes
## 37
## 1460
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37 37 37 37 37 37
##
## includes extended item information - examples:
## labels variables levels
## 1 MSZoning=C (all) MSZoning C (all)
## 2 MSZoning=FV MSZoning FV
## 3 MSZoning=RH MSZoning RH
##
## includes extended transaction information - examples:
## transactionID
## 1 1
## 2 2
## 3 3
inspect(head(transactions, n = 1))
## items transactionID
## [1] {MSZoning=RL,
## Street=Pave,
## LotShape=Reg,
## LandContour=Lvl,
## Utilities=AllPub,
## LotConfig=Inside,
## LandSlope=Gtl,
## Neighborhood=CollgCr,
## Condition1=Norm,
## Condition2=Norm,
## BldgType=1Fam,
## HouseStyle=2Story,
## RoofStyle=Gable,
## RoofMatl=CompShg,
## Exterior1st=VinylSd,
## MasVnrType=BrkFace,
## ExterQual=Gd,
## ExterCond=TA,
## Foundation=PConc,
## BsmtQual=Gd,
## BsmtCond=TA,
## BsmtExposure=No,
## BsmtFinType1=GLQ,
## BsmtFinType2=Unf,
## Heating=GasA,
## HeatingQC=Ex,
## CentralAir=Y,
## Electrical=SBrkr,
## KitchenQual=Gd,
## Functional=Typ,
## GarageType=Attchd,
## GarageFinish=RFn,
## GarageQual=TA,
## GarageCond=TA,
## PavedDrive=Y,
## SaleType=WD,
## SaleCondition=Normal} 1
rules <- apriori(transactions, parameter = list(support = 0.95, confidence = 0.95))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.95 0.1 1 none FALSE TRUE 5 0.95 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1387
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[218 item(s), 1460 transaction(s)] done [0.01s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [94 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(rules)
## set of 94 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3 4
## 7 28 39 20
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 2.766 3.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.9500 Min. :0.9534 Min. :0.9507 Min. :0.9996
## 1st Qu.:0.9562 1st Qu.:0.9793 1st Qu.:0.9678 1st Qu.:0.9998
## Median :0.9637 Median :0.9895 Median :0.9781 Median :1.0000
## Mean :0.9664 Mean :0.9871 Mean :0.9791 Mean :1.0000
## 3rd Qu.:0.9740 3rd Qu.:0.9958 3rd Qu.:0.9897 3rd Qu.:1.0000
## Max. :0.9993 Max. :0.9993 Max. :1.0000 Max. :1.0010
## count
## Min. :1387
## 1st Qu.:1396
## Median :1407
## Mean :1411
## 3rd Qu.:1422
## Max. :1459
##
## mining info:
## data ntransactions support confidence
## transactions 1460 0.95 0.95
## call
## apriori(data = transactions, parameter = list(support = 0.95, confidence = 0.95))
con <- file("data_description.txt", open = "r")
column_dictionary <- list()
value_dictionary <- list()
repeat {
line <- readLines(con, n = 1)
if (length(line) == 0) {
break
}
first_character <- substr(line, 1, 1)
if (first_character == "") {
next
}
if (first_character != " ") {
column_name <- sub(":.*", "", line)
column_description <- trimws(sub(".*:", "", line))
column_dictionary[[column_name]] <- column_description
value_dictionary[[column_name]] <- list()
} else {
pairs <- unlist(strsplit(line, "\t"))
key <- trimws(pairs[1])
value <- trimws(pairs[2])
value_dictionary[[column_name]][[key]] <- value
}
}
close(con)
rules_top_ten_df <- data.frame(
lhs = labels(lhs(rules)),
rhs = labels(rhs(rules)),
rules@quality
) %>% arrange(desc(lift)) %>% head(n = 20)
for (i in 1:nrow(rules_top_ten_df)) {
row <- rules_top_ten_df[i, ]
explanation <- ""
lhs <- unlist(strsplit(gsub('^.|.$', '', row["lhs"]), ","))
for (i in 1:length(lhs)) {
pair <- unlist(strsplit(lhs[i], "="))
key <- pair[1]
value <- pair[2]
key_t <- column_dictionary[[key]]
value_t <- value_dictionary[[key]][[value]]
if (i == 1) {
explanation <- paste("IF", key_t, "=", value_t)
} else {
explanation <- paste(explanation, "AND", key_t, "=", value_t)
}
}
rhs <- unlist(strsplit(gsub('^.|.$', '', row["rhs"]), "="))
key <- rhs[1]
value <- rhs[2]
key_t <- column_dictionary[[key]]
value_t <- value_dictionary[[key]][[value]]
confidence_pct <- format(round(row["confidence"] * 100, 2), 2)
explanation <- paste(explanation, "THEN", key_t, "=", value_t, "(Confidence:", paste0(confidence_pct, "%)"))
print(explanation)
cat("\n")
}
## [1] "IF Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.47%)"
##
## [1] "IF Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.08%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.08%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.47%)"
##
## [1] "IF Type of road access to property = Paved AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.07%)"
##
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Garage condition = Typical/Average THEN Proximity to various conditions (if more than one is present) = Normal (Confidence: 99.07%)"
##
## [1] "IF Type of road access to property = Paved AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.46%)"
##
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal THEN Garage condition = Typical/Average (Confidence: 96.45%)"
##
## [1] "IF Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.25%)"
##
## [1] "IF Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.84%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.25%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.84%)"
##
## [1] "IF Type of road access to property = Paved AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.24%)"
##
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.24%)"
##
## [1] "IF Type of road access to property = Paved AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.83%)"
##
## [1] "IF Type of road access to property = Paved AND Type of utilities available = All public Utilities (E,G,W,& S) AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.83%)"
##
## [1] "IF Proximity to various conditions (if more than one is present) = Normal AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.23%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal AND Type of heating = Gas forced warm air furnace THEN Roof material = Standard (Composite) Shingle (Confidence: 98.23%)"
##
## [1] "IF Proximity to various conditions (if more than one is present) = Normal AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.82%)"
##
## [1] "IF Type of utilities available = All public Utilities (E,G,W,& S) AND Proximity to various conditions (if more than one is present) = Normal AND Roof material = Standard (Composite) Shingle THEN Type of heating = Gas forced warm air furnace (Confidence: 97.81%)"
Several regression models were trained and evaluated on a dataset.
The linear regression model, trained using cross-validation, achieved good results with an RMSE score of approximately 0.1279 on the validation dataset and 0.1585 on the test dataset.
Lasso regression, which incorporates feature selection, had an average RMSE score of around 0.1534 on validation and 0.1401 on the test dataset.
Ridge regression, focusing on reducing multicollinearity, had an average RMSE score of approximately 0.2335 on validation and 0.1457 on the test dataset.
Elastic Net regression, combining Lasso and Ridge regularization, performed well with an RMSE score of around 0.1524 on validation and 0.1398 on the test dataset.
K-Nearest Neighbors regression achieved moderate performance with an average RMSE score of approximately 0.1789 on validation and 0.1994 on the test dataset.
The SVR model, based on support vector regression, consistently performed well with an average RMSE score of 0.1190 on validation and 0.1311 on the test dataset.
The decision tree model had an average RMSE score of 0.2944 on validation and 0.2820 on the test dataset.
In comparison, the random forest model, an ensemble of decision trees, outperformed the decision tree model with an average RMSE score of 0.1643 on validation and 0.1526 on the test dataset.
The random forest model highlighted influential features such as “OverallQual,” “GrLivArea,” and “YearBuilt” in predicting sale prices.
Among the gradient-boosted tree models, XGBoost and LightGBM had similar performance.
XGBoost achieved a validation RMSE of 0.1126 and a test RMSE of 0.1373, while LightGBM had a validation RMSE of 0.1258 and a test RMSE of 0.1436.
CatBoost had slightly higher errors with a validation RMSE of 0.2121 and a test RMSE of 0.1340.
Key factors influencing house prices included OverallQual, GrLivArea, and GarageArea.
Overall, the linear regression, Lasso regression, Elastic Net regression, SVR, random forest, XGBoost, and LightGBM models showed promising performance in predicting sale prices.
These models can provide valuable insights for decision-making in the real estate market.
In the analysis, comparison the performance of baseline models and ensemble models for the given dataset will be occur.
The evaluation metric used was Root Mean Squared Error (RMSE), which measures the average prediction error.
RMSE provides a measure of the average prediction error in the same units as the target variable.
It is useful for comparing different models or evaluating the performance of a single model.
Lower RMSE values indicate better predictive accuracy, as they imply smaller average errors between predicted and actual values.
Higher RMSE values indicate worse predictive accuracy, as they imply bigger average errors between predicted and actual values.
Seven models had been used which are Linear Regression, Lasso, Ridge, Elastic Net, KNN, SVR and Decision Tree.
Among the baseline models, SVR had the lowest Root Mean Squared Error (RMSE), followed closely by Linear Regression.
KNN, Elastic Net, Lasso and Ridge performed relatively well.
However, Decision Tree had the highest RMSE, indicating poorer predictive performance.
The ensemble models, including Random Forest, XGBoost, LightGBM, and CatBoost, showed improved performance compared to individual baseline models.
XGBoost achieved the lowest RMSE on the test set, closely followed by LightGBM.
The model having the highest RMSE is CatBoost, followed by Random Forest, indicating poorer predictive performance.
Based on the provided plot and data for the baseline models’ RMSE on the test set, Decision Tree had the highest RMSE, while SVR had the lowest.
Decision Tree showed the worst predictive performance, closely followed by KNN and Linear Regression.
SVR showed the best predictive performance, closely followed by Elastic Net, Lasso Regression, and Ridge Regression.
Based on the provided plot and data for the ensemble models’ RMSE on the test set, XGBoost and CatBoost achieved the lowest RMSE value.
They were closely followed by LightGBM.
These three models exhibited the best predictive performance among the ensemble models.
Random forest had the highest RMSE but it also performed relatively well.
The stacking model had a slightly higher RMSE compared to the top-performing ensemble models but still demonstrated a good predictive performance.
In baseline models, SVR, Elastic Net, Linear Regression and Lasso showed the best predictive performance.
XGBoost, CatBoost, and LightGBM showed the best predictive performance among the ensemble models based on the RMSE metric.
These models have the potential to provide more accurate predictions for the target variable on the test set.