library(ggplot2) # plots and visualizations
library(readr) #read data
library(dplyr) #data manipulation
library(ggpubr) # For ggarrange function
library(tidyverse) #Comprehensive package for data manipulation and visualization.
library(reshape2) # transform the data) from wide format to long format
library(plotly) # interactive visualizations
library(ggalt) #geom_encircle function
library(knitr)
library(gridExtra)
library(viridis)
house_data2 <- read.csv("r1.csv",header = TRUE, sep = ",")
str(house_data2)
## 'data.frame': 26795 obs. of 31 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ bathrooms : num 3.25 1.75 2.25 3 2 2 1.75 1 2 3.5 ...
## $ bedrooms : int 4 3 4 2 4 3 4 2 3 6 ...
## $ condition : int 3 5 3 3 3 4 3 3 3 3 ...
## $ date : int 1401235207 1424736007 1402617607 1399248007 1427241607 1406678407 1406160007 1408320007 1408492807 1405814407 ...
## $ feat01 : num 0.43 0.955 0.728 0.321 0.961 ...
## $ feat02 : num 0.603 0.515 0.433 0.518 0.529 ...
## $ feat03 : num 0.695 0.223 0.795 0.236 0.583 ...
## $ feat04 : num 0.3106 0.806 0.306 0.0489 0.9978 ...
## $ feat05 : num 0.506 0.374 0.43 0.39 0.5 ...
## $ feat06 : num 0.4357 0.0234 0.7412 0.833 0.3574 ...
## $ feat07 : num 0.394 0.556 0.124 0.495 0.519 ...
## $ feat08 : num 0.68859 0.00761 0.17641 0.45371 0.55569 ...
## $ feat09 : num 0.2762 0.8218 0.0896 0.4757 0.5348 ...
## $ feat10 : num 0.538 0.854 0.46 0.692 0.134 ...
## $ floors : num 2 1.5 1 2 2 2 1 1 1 2 ...
## $ grade : int 10 8 7 8 7 7 6 6 7 9 ...
## $ lat : num 47.6 47.4 47.7 47.7 47.7 ...
## $ long : num -122 -122 -122 -122 -122 ...
## $ price : int 1399000 431500 439900 530600 751200 263500 248600 273700 245400 769000 ...
## $ sqft_above : int 2140 3440 1220 1110 2040 1380 2290 940 1390 2220 ...
## $ sqft_basement: int 840 0 790 160 600 0 0 0 0 190 ...
## $ sqft_living : int 2980 3440 2010 1270 2640 1380 2290 940 1390 2410 ...
## $ sqft_living15: int 2200 2160 1820 1290 1330 1260 1240 940 1390 1980 ...
## $ sqft_lot : int 7000 10428 7575 1175 5000 8536 7765 5000 8250 6000 ...
## $ sqft_lot15 : int 4800 8400 7831 1800 5000 8750 8215 5000 7875 6000 ...
## $ view : int 3 0 0 0 0 0 0 0 0 4 ...
## $ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
## $ yr_built : int 1900 1974 1974 2000 1949 1955 1936 1951 1987 1916 ...
## $ yr_renovated : int 2014 0 0 0 0 0 1953 0 0 1990 ...
## $ zipcode : int 98144 98058 98034 98103 98117 98148 98146 98126 98042 98136 ...
The dataset consists of 26.795 observations (rows) and 31 variables (columns). The variables include information such as id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15 and 10 artificial variables from feat01 to feat10.
The id variable appears to be a unique identifier for each observation. All variables structured as numeric/integer except the date variable which is currently stored as a character type. It might be useful to convert it to a date type for time-related analyses.However we will not make time-varient analyze for that project for now.
It’s important to note that while all variables may be structured as numeric, certain variables, despite their numeric representation, hold categorical significance. These categorical variables are essentially numerically coded to represent different categories or levels within the dataset. This nuance is crucial to consider when interpreting and analyzing the data.
summary(house_data2)
## id bathrooms bedrooms condition
## Min. : 1 Min. :0.000 Min. : 0.000 Min. :1.000
## 1st Qu.: 6700 1st Qu.:1.500 1st Qu.: 3.000 1st Qu.:3.000
## Median :13398 Median :2.250 Median : 3.000 Median :3.000
## Mean :13398 Mean :2.107 Mean : 3.372 Mean :3.419
## 3rd Qu.:20097 3rd Qu.:2.500 3rd Qu.: 4.000 3rd Qu.:4.000
## Max. :26795 Max. :8.000 Max. :33.000 Max. :5.000
## date feat01 feat02 feat03
## Min. :1.399e+09 Min. :0.0000095 Min. :0.0000 Min. :0.0000309
## 1st Qu.:1.406e+09 1st Qu.:0.2500121 1st Qu.:0.3866 1st Qu.:0.2487362
## Median :1.414e+09 Median :0.4994889 Median :0.4641 Median :0.5011274
## Mean :1.415e+09 Mean :0.5009999 Mean :0.4648 Mean :0.4997116
## 3rd Qu.:1.424e+09 3rd Qu.:0.7535160 3rd Qu.:0.5428 3rd Qu.:0.7498798
## Max. :1.433e+09 Max. :0.9999751 Max. :1.0000 Max. :0.9999813
## feat04 feat05 feat06 feat07
## Min. :0.0000137 Min. :0.0000 Min. :0.0000349 Min. :0.000044
## 1st Qu.:0.2460411 1st Qu.:0.3745 1st Qu.:0.2507906 1st Qu.:0.248796
## Median :0.4993561 Median :0.4523 Median :0.5014055 Median :0.502817
## Mean :0.4976852 Mean :0.4543 Mean :0.5014324 Mean :0.500995
## 3rd Qu.:0.7462034 3rd Qu.:0.5316 3rd Qu.:0.7520400 3rd Qu.:0.752308
## Max. :0.9999877 Max. :1.0000 Max. :0.9999267 Max. :0.999858
## feat08 feat09 feat10 floors
## Min. :0.0000912 Min. :0.000023 Min. :0.0000122 Min. :1.000
## 1st Qu.:0.2521830 1st Qu.:0.248318 1st Qu.:0.2459837 1st Qu.:1.000
## Median :0.5037222 Median :0.500146 Median :0.5015440 Median :1.000
## Mean :0.5012478 Mean :0.499521 Mean :0.4993034 Mean :1.484
## 3rd Qu.:0.7519803 3rd Qu.:0.750082 3rd Qu.:0.7510612 3rd Qu.:2.000
## Max. :0.9999338 Max. :0.999920 Max. :0.9999926 Max. :3.500
## grade lat long price
## Min. : 1.00 Min. :47.16 Min. :-122.5 Min. : 76100
## 1st Qu.: 7.00 1st Qu.:47.47 1st Qu.:-122.3 1st Qu.: 322800
## Median : 7.00 Median :47.57 Median :-122.2 Median : 451246
## Mean : 7.65 Mean :47.56 Mean :-122.2 Mean : 540920
## 3rd Qu.: 8.00 3rd Qu.:47.68 3rd Qu.:-122.1 3rd Qu.: 647000
## Max. :13.00 Max. :47.78 Max. :-121.3 Max. :7701000
## sqft_above sqft_basement sqft_living sqft_living15
## Min. : 290 Min. : 0.0 Min. : 290 Min. : 399
## 1st Qu.:1200 1st Qu.: 0.0 1st Qu.: 1430 1st Qu.:1490
## Median :1560 Median : 0.0 Median : 1910 Median :1840
## Mean :1785 Mean : 294.6 Mean : 2079 Mean :1987
## 3rd Qu.:2200 3rd Qu.: 570.0 3rd Qu.: 2540 3rd Qu.:2360
## Max. :9410 Max. :4820.0 Max. :13540 Max. :6210
## sqft_lot sqft_lot15 view waterfront
## Min. : 520 Min. : 651 Min. :0.0000 Min. :0.000000
## 1st Qu.: 5100 1st Qu.: 5120 1st Qu.:0.0000 1st Qu.:0.000000
## Median : 7667 Median : 7653 Median :0.0000 Median :0.000000
## Mean : 15342 Mean : 12914 Mean :0.2358 Mean :0.007837
## 3rd Qu.: 10800 3rd Qu.: 10139 3rd Qu.:0.0000 3rd Qu.:0.000000
## Max. :1651359 Max. :871200 Max. :4.0000 Max. :1.000000
## yr_built yr_renovated zipcode
## Min. :1900 Min. : 0.00 Min. :98001
## 1st Qu.:1951 1st Qu.: 0.00 1st Qu.:98033
## Median :1974 Median : 0.00 Median :98065
## Mean :1970 Mean : 86.56 Mean :98078
## 3rd Qu.:1995 3rd Qu.: 0.00 3rd Qu.:98117
## Max. :2015 Max. :2015.00 Max. :98199
id - The id variable represents a
unique identifier for each home sold.
date - The date variable, contains
information about the date of the house sale and spans from 2014, to
2015.
price: - The price variable is the
dependent variable and shows a wide range, with the
minimum house price at $76,100 and the maximum at $7,701,000. - The
median house price is $451,246, and the mean is $540,920.
bedrooms and bathrooms: - The variables related to the number of bedrooms and bathrooms (0.5 accounts for a room with a toilet but no shower) exhibit varying ranges and distributions. - The number of bedrooms ranges from 0 to 33, with a mean of approximately 3.37. - The number of bathrooms ranges from 0 to 8, with a mean of approximately 2.11.
sqft_living and sqft_lot: - These variables
represent the size of houses. - sqft_living reflects the
Square footage of the apartments interior living area, ranging from 290
to 13,540 square feet, with a mean of 2080. - sqft_lot
represents the lot size, ranging from 520 to 1,651,359 square feet, with
a mean of 15,107.
floors -The floors variable is
represents the levels of the houses. The majority of houses have 1 or
1.48 floors. - Notably, there seems to be a common occurrence of houses
with 1.5 floors, while the mean is approximately 1.494. - This suggests
that many houses have a split-level design or additional space on an
upper level, contributing to the fractional floor values.
waterfront:
- The waterfront variable is a dummy variable mostly 0 ,
represents the property has no waterfront view and 1 for with
waterfront. Mean is equal to 0.008 which shows that most of the houses
not waterfront.
view and condition:
- view represents the overall view rating (0 to 4) with a
mean of 0.23 -condition represents the overall condition
rating (1 to 5) with a mean of 3.42 for condition.
grade: - grade represents the overall
grade given to the housing unit and ranges from 1 to 13 where 1-3 falls
short of building construction and design, 7 has an average level of
construction and design, and 11-13 have a high quality level of
construction and design. and mean is equal to 7.65 and median is 7.
sqft_above, and sqft_basement: -
sqft_above and sqft_basement show the square
footage above ground and is below ground level¶ (in the basement),
respectively.
yr_built, yr_renovated: - Houses were built between
1900 and 2015 (yr_built), with the majority built in the
mid to late 20th century. - yr_renovated indicates the last
renovation year, with a mean of 86.56 and many zero values, suggesting
no renovations.
Geographical Information (lat, long, zipcode): -
lat and long provide latitude and longitude
information of house locations, respectively. - zipcode
represents the zip code of the house location.
sqft_living15 and sqft_lot15: -
sqft_living15 and sqft_lot15 indicate the
living room and lot size in 2015, reflecting potential renovations or
changes. (? some sources mention it differently and main source couldnt
find !!!)
These summary statistics provide an overview of the distribution and characteristics of each numeric variable in the dataset, with a specific focus on understanding the relationships with the dependent variable, ‘price.’
Summary statistics does not show any missing values but we approved with is.na() function.
# Check for missing values in the entire dataset
missing_values <- any(is.na(house_data2))
missing_values
## [1] FALSE
In organizing our variables by type, we enhance the precision of our analysis and visualization methods. This thoughtful categorization enables us to apply tailored techniques to each variable type, ensuring more insightful and nuanced exploration of the dataset.
# All variables
all_vars <- c( "id", "bathrooms", "bedrooms", "condition", "date", "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10", "floors", "grade", "lat", "long", "price", "sqft_above", "sqft_basement", "sqft_living", "sqft_living15", "sqft_lot", "sqft_lot15", "view", "waterfront", "yr_built", "yr_renovated", "zipcode" )
# Continuous Numeric Variables
cont_vars <- c("price", "sqft_living","sqft_living15","sqft_lot",
"sqft_lot15","sqft_above", "sqft_basement")
#Artificial Continuous Numeric Variables
art_vars <- c( "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10" )
# Discrete Numeric Variables
disc_vars <- c("bedrooms", "floors", "bathrooms")
# Categorical Variables
cat_vars <- c("waterfront", "view", "condition", "grade")
# Date Variables
date_vars <- c("date", "yr_built", "yr_renovated")
# Geographical Variables
geo_vars <- c("lat", "long", "zipcode")
Utilize various R packages (e.g., ggplot2, plotly) for data exploration. Conduct correlation analysis, distribution analysis.
# to display numeric values without scientific notation and with more digits
options(scipen = 999, digits = 9)
# Set up a layout grid
par(mfrow = c(4, 2), mar = c(4, 4, 2, 1)) # Adjust margins for better appearance
# Create histograms for numeric variables
for (cont in cont_vars) {
# Determine appropriate bin width based on the range and number of observations
bin_width <- (max(house_data2[[cont]]) - min(house_data2[[cont]])) /
sqrt(length(house_data2[[cont]]))
# Create histogram with scaled x-axis
hist(house_data2[[cont]], main = paste("Distribution of", cont), xlab = cont,
col = "skyblue", breaks = seq(min(house_data2[[cont]]),
max(house_data2[[cont]]) + bin_width, bin_width))
# Add smoother distribution line
density_curve <- density(house_data2[[cont]], bw = "nrd0")
lines(density_curve$x, density_curve$y * bin_width * length(house_data2[[cont]]),
col = "red", lwd = 2)
# Add normal distribution line
mu <- mean(house_data2[[cont]])
sigma <- sd(house_data2[[cont]])
x <- seq(min(house_data2[[cont]]), max(house_data2[[cont]]), length = 100)
y <- dnorm(x, mean = mu, sd = sigma) * bin_width * length(house_data2[[cont]])
lines(x, y, col = "blue", lwd = 2)
# Identify potential outliers using a boxplot
boxplot(house_data2[[cont]], main = paste("Boxplot of", cont), col = "lightblue",
border = "black", horizontal = TRUE)
}
# Reset the plotting layout
par(mfrow = c(1, 1))
The visual inspection of the plots suggests that distribution of the variables are skewed right/ non-normal distributions with a considerable number of outliers. Given the context of the dataset, where very luxurious or unique properties may contribute to these extreme values, it is justifiable to observe such outliers.
Instead of removing or transforming these outliers, a more suitable strategy might involve employing robust statistical methods to handle outliers problem for further analysis. Robust methods are designed to be less sensitive to extreme values, allowing for a more reliable analysis that acknowledges the presence of these high-end properties without disproportionately impacting the results.
# Set up a layout grid
par(mfrow = c(3, 2), mar = c(4, 4, 2, 1)) # Adjust margins for better appearance
# Create histograms for numeric variables
for (art in art_vars) {
# Determine appropriate bin width based on the range and number of observations
bin_width <- (max(house_data2[[art]]) - min(house_data2[[art]])) /
sqrt(length(house_data2[[art]]))
# Create histogram with scaled x-axis
hist(house_data2[[art]], main = paste("Distribution of", art), xlab = art,
col = "skyblue", breaks = seq(min(house_data2[[art]]),
max(house_data2[[art]]) + bin_width, bin_width))
# Add smoother distribution line
density_curve <- density(house_data2[[art]], bw = "nrd0")
lines(density_curve$x, density_curve$y * bin_width * length(house_data2[[art]]),
col = "red", lwd = 2)
# Add normal distribution line
mu <- mean(house_data2[[art]])
sigma <- sd(house_data2[[art]])
x <- seq(min(house_data2[[art]]), max(house_data2[[art]]), length = 100)
y <- dnorm(x, mean = mu, sd = sigma) * bin_width * length(house_data2[[art]])
lines(x, y, col = "blue", lwd = 2)
# Identify potential outliers using a boxplot
boxplot(house_data2[[art]], main = paste("Boxplot of", art), col = "lightblue",
border = "black", horizontal = TRUE)
}
# Reset the plotting layout
par(mfrow = c(1, 1))
We found that “feat02” and “feat05” follow a typical bell-shaped curve, suggesting a normal distribution and making them promising for our model. On the contrary, other artificially created variables have a uniform distribution, lacking distinct patterns. Uniformly distributed variables are less informative for predictions due to their even distribution, so we opted not to include them in our model.
# Visualize the Distribution
for (disc in disc_vars) {
# Convert discrete numeric variables to factors
house_data2[[disc]] <- as.factor(house_data2[[disc]])
# Create bar plots for discrete numeric variables
bar_plot <- ggplot(house_data2, aes(x = !!sym(disc), fill = !!sym(disc))) +
geom_bar(position = "dodge", fill = "skyblue", alpha = 0.7, width = 0.7) +
labs(title = paste("Distribution of", disc), x = disc, y = "Count") +
theme_minimal() +
geom_text(stat = "count",
aes(label = scales::percent(round(after_stat(count)/sum(after_stat(count)), 5))),
position = position_dodge(0.7), vjust = -0.3, size= 3) + # Add percentage
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels
# Display the plot
print(bar_plot)
}
Bedrooms :
The distribution of bedrooms in the dataset reveals a clear preference for houses with 3 bedrooms, constituting nearly half of the entries (45.5%). 4-bedroom homes follow closely with 31.8%, and 2 and 5-bedroom configurations are also prevalent, making up 12.8% and 7.4%. However, 0-bedroom and 1-bedroom houses have notably lower percentages with approximately 0.1% and 0.9%, respectively. The distribution is positively skewed, with a peak around 3 bedrooms.
Bathrooms Configuration:
The dataset showcases a diverse distribution of bathrooms. Houses with 2.5 bathrooms are most common, representing 24.9%. Additionally, 1 bathroom and 1.75 bathrooms are prevalent at 17.8% and 14.1%, respectively. The distribution exhibits multiple peaks, suggesting a variety of bathroom count configurations in the dataset.
In the United States, bathrooms are generally categorized as master bathroom, containing a varied shower and a tub that is adjoining to a master bedroom, a “full bathroom” (or “full bath”), containing four plumbing fixtures: bathtub/shower, or (separate shower), toilet, and sink; “half (1/2) bath” (or “powder room”) containing just a toilet and sink; and “3/4 bath” containing toilet, sink, and shower, although the terms vary from market to market. In some U.S. markets, a toilet, sink, and shower are considered a “full bath”. (wikipedia)
Floor Counts:
When considering the number of floors, Houses with 1 floor are predominant, making up 49.4% of the dataset. 2 floors houses follow closely at 38.1%, with 1.5 floors representing 8.8%. The distribution is skewed towards fewer floors, with a sharp decline for houses with more than 2 floors.
Note on 0 Values:
In the context of houses requiring bedrooms and bathrooms, the presence of 0 values in these categories may indicate missing or incomplete data. It’s uncommon for a house to have zero bedrooms or bathrooms. Investigating and addressing the reasons behind these zero values is crucial for ensuring the quality and accuracy of the dataset, as well as the reliability of any analyses conducted.
# Set up a layout grid
grid_layout <- matrix(c(1, 2, 3, 4), nrow = 2, byrow = TRUE)
# Create more advanced bar plots for categorical variables
plots <- list()
for (cat in cat_vars) {
# Convert categorical variables to factors
house_data2[[cat]] <- as.factor(house_data2[[cat]])
# Create bar plots for categorical variables
bar_plot <- ggplot(house_data2, aes(x = factor(!!sym(cat)), fill =
factor(!!sym(cat)))) + geom_bar() +geom_text(stat = "count",
aes(label = scales::percent(round(after_stat(count)/sum(after_stat(count)), 3))),
vjust = 0.2, size= 2.5) + # Add percentage labels
labs(title = paste("Distribution of", cat), x = cat, y = "Count") +
theme_minimal() +
theme(legend.position = "none")
# Add the plot to the list
plots[[cat]] <- bar_plot
}
# Arrange the plots in a 2x2 grid
grid.arrange(grobs = plots, layout_matrix = grid_layout)
# Reset the plotting layout
par(mfrow = c(1, 1))
Out of all observations, only 1 percent of houses are located on the waterfront.
Additionally, the majority of houses (90.2%) have a view score of 0. Among the remaining view scores, 4.5% have a score of 2, while scores of 1 and 4 each account for 1.5%. The remaining 2.4% of houses have a view score of 3. we observed that the ‘view’ variable predominantly contained 0 values, suggesting that many houses had not been viewed. In response, we decided to engineer a new feature named ‘viewed’ to capture this information more explicitly. The ‘viewed’ variable takes on a value of 1 if the house has been viewed and 0 otherwise.
# Create a new variable 'viewed' with value 1 if 'view' is not 0, and 0 otherwise
house_data2$viewed <- ifelse(house_data2$view != 0, 1, 0)
# Drop the original 'view' variable
house_data2 <- house_data2[, !names(house_data2) %in% c("view")]
The majority of houses in the dataset are in good to average condition. Approximately 91.7% of houses fall within Condition 3, indicating that a significant portion of the properties is well-maintained.Condition 4 homes represent 26.3%, suggesting a sizable proportion of houses are in better-than-average condition. Meanwhile, Condition 5 homes, which likely denote excellent condition, constitute 7.9% of the dataset.
The distribution of grades reflects a diverse range of housing quality. A significant portion of houses falls within Grade 7 (41.6%) and Grade 8 (28.1%), indicating properties with a higher level of construction and design. Grades 9 and 10 together contribute 17.3%, highlighting a considerable proportion of houses with superior construction and design quality.The dataset includes a limited number of houses with lower grades (1-6), with most grades in this range having negligible representation (close to 0%).The distribution is skewed towards higher grades, emphasizing the prevalence of houses with above-average construction and design quality in the dataset.
library(ggplot2)
# Count of observations where bedrooms is 0
count_bedrooms_0 <- nrow(house_data2[house_data2$bedrooms == 0, ])
# Count of observations where bathrooms is 0
count_bathrooms_0 <- nrow(house_data2[house_data2$bathrooms == 0, ])
# Create a data frame for plotting
count_data <- data.frame(Category = c("Bedrooms=0", "Bathrooms=0"),
Count = c(count_bedrooms_0, count_bathrooms_0))
# Create a bar plot
ggplot(count_data, aes(x = Category, y = Count, fill = Category)) +
geom_bar(stat = "identity") +
labs(title = "Count of Observations with Bedrooms=0 and Bathrooms=0",
x = "Category",
y = "Count") +
theme_minimal()
# Function to impute missing values using the median based on non-zero values
impute_nonzero <- function(var) {
non_zero_values <- as.numeric(var[var != 0])
if (length(non_zero_values) > 0) {
imputed_value <- median(non_zero_values)
var[var == 0] <- imputed_value
}
return(var)
}
# Convert the variables to numeric again
house_data2$bedrooms <- as.numeric(as.character(house_data2$bedrooms))
house_data2$bathrooms <- as.numeric(as.character(house_data2$bathrooms))
house_data2$floors <- as.numeric(as.character(house_data2$floors))
house_data2$waterfront <- as.numeric(as.character(house_data2$waterfront))
house_data2$condition <- as.numeric(as.character(house_data2$condition))
house_data2$grade <- as.numeric(as.character(house_data2$grade))
# Apply the imputation function to bedrooms and bathrooms
house_data2$bedrooms <- impute_nonzero(house_data2$bedrooms)
house_data2$bathrooms <- impute_nonzero(house_data2$bathrooms)
##redefine the variables for further stages
# Categorical Variables
cat_vars <- c("waterfront", "viewed", "condition", "grade")
# All variables
all_vars <- house_data2[, c("price", "bedrooms", "bathrooms", "sqft_living",
"sqft_lot","floors", "waterfront", "viewed", "condition",
"grade", "sqft_above", "sqft_basement", "yr_built",
"yr_renovated", "zipcode", "lat", "long", "sqft_living15",
"sqft_lot15")]
attach(house_data2)
# Create an empty list to store plots
plots_list <- list()
# Iterate through variables and create scatter plots
for (variable in cont_vars[-1]) {
# Create scatter plot with regression line
scatter_plot <- ggplot(house_data2, aes_string(x = variable, y = "price")) +
geom_point(color = "orange") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_encircle(data = house_data2 %>% filter(price > 6000000),
color = "red", size = 2, expand = 0.05) +
geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
color = "green", size = 2, expand = 0.05) +
labs(title = paste(variable, "vs. Price"), x = variable, y = "price") +
theme_minimal()
# Add the plot to the list
plots_list[[variable]] <- scatter_plot
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Arrange the plots in a grid
advanced_plots <- ggarrange(plotlist = plots_list, ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Display the arranged plots
print(advanced_plots)
## $`1`
##
## $`2`
##
## attr(,"class")
## [1] "list" "ggarrange"
We generate scatter plots for various continuous variables against housing prices, utilizing light green points and red regression lines for visualization. A red circle is incorporated to emphasize observations where housing prices exceed $6,000,000, indicating potential outliers. The scatter plots collectively underscore similarities in the distribution patterns of all continuous variables concerning price. This graphical exploration enhances the understanding of the correlation between each continuous variable and housing prices, with the filter for prices drawing attention to three potential outliers. Identifying and comprehending such outliers is crucial for robust data analysis, aiding in informed decisions regarding their impact on statistical models and subsequent analyses. Further investigation and domain knowledge are typically required to interpret these outliers within the dataset’s context.
# Create an empty list to store plots
plots_list <- list()
# Iterate through variables and create scatter plots
for (variable in disc_vars) {
# Create scatter plot with regression line
scatter_plot <- ggplot(house_data2, aes_string(x = variable, y = "price")) +
geom_jitter(width = .3, alpha = .3, color = "blue") + # Introduce a noise
geom_smooth(method = "lm", se = FALSE, color = "red") +
geom_encircle(data = house_data2 %>% filter(price > 6000000),
color = "red", size = 2, expand = 0.05) + # Add an encircling for high prices
geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
color = "green", size = 2, expand = 0.05) + # Add an encirling for
#bedrooms == 33
labs(title = paste(variable, "vs. Price"), x = variable, y = "Price")
# Add the plot to the list
plots_list[[variable]] <- scatter_plot
}
# Arrange the plots in a grid
advanced_plots <- ggarrange(plotlist = plots_list, ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Display the arranged plots
print(advanced_plots)
In this code update, we continued our exploration of the dataset by examining the relationship between housing prices and discrete variables. We introduced noise for a more nuanced view and identified high-priced outliers, visualizing them with encircling shapes.
As an additional step, we focused on the unusual case where the number of bedrooms equals 33. We specifically circled these observations using a distinctive green color. This targeted analysis aims to spotlight and investigate unique patterns and outliers within the data, enhancing our understanding of their impact on housing prices. Our approach reflects an iterative process, adapting visualizations to reveal hidden insights in the dataset.
# Find indices where 'bedrooms' is 33
index_bedrooms_33 <- which(house_data2$bedrooms == 33)
# Replace the 'bedrooms' value of 33 with the median value of bedrooms
house_data2$bedrooms[index_bedrooms_33] <- median(house_data2$bedrooms, na.rm = TRUE)
# Other advanced visualizations for analysis ( hexbin)
p <- ggplot(house_data2, aes(x = sqft_living, y = price)) +
geom_hex(bins = 50, aes(fill = ..count..), color = "darkblue") +
geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
color = "green", size = 2, expand = 0.05) +
geom_encircle(data = house_data2 %>% filter(price > 6000000), color = "red", size = 1, expand = 0.05,
linetype ="dashed") +
labs(title = 'Hexbin Plot: Housing Price vs. Square Footage of Living Space',
x = 'Square Footage of Living Space', y = 'Price') +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Reverse the color scale
p +
scale_fill_distiller(palette = "Blues", direction = 1) +
theme_bw() +
scale_x_continuous(labels = scales::comma) +
scale_y_continuous(labels = scales::dollar_format(scale = 0.001))
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
In this advanced visualization, we employed a hexbin plot to explore the relationship between housing price and square footage of living space. The hexbin plot effectively communicates the data density, revealing insights into the distribution of housing prices concerning living space. Notably, we observed a high density in areas where prices are lower than $2 million and square footage of living space is below 2,000 square feet. We continued the analysis by encircling specific data points of interest. The red dashed circle encompasses houses with prices exceeding $6,000,000. Additionally, we maintained the encircling of the green circle around properties with 33 bedrooms and the blue circle around those with a grade less than 3. These encirclings help to highlight and differentiate distinct subsets within the dataset, providing a nuanced understanding of the data distribution. The refined aesthetics, including color-coded circles and improved line types, enhance the visual appeal and interpretability of the plot. This visualization strategy builds upon the previous stages, offering a comprehensive exploration of the dataset’s intricate patterns and outliers.
house_data2[duplicated(house_data2), ]
## [1] id bathrooms bedrooms condition date
## [6] feat01 feat02 feat03 feat04 feat05
## [11] feat06 feat07 feat08 feat09 feat10
## [16] floors grade lat long price
## [21] sqft_above sqft_basement sqft_living sqft_living15 sqft_lot
## [26] sqft_lot15 waterfront yr_built yr_renovated zipcode
## [31] viewed
## <0 rows> (or 0-length row.names)
duplicates <- house_data2[duplicated(house_data2$id), ]
dim(duplicates)
## [1] 0 31
The dataset analysis provided two distinct findings regarding duplicates. First, a comprehensive scan across all columns of the dataset did not reveal any duplicate entries, suggesting the entire dataset is unique in its entirety. Secondly, focusing specifically on the ‘id’ column—a unique identifier for each home sold— and couldn’t find any duplication. However we recognized that id column is renumarized and the original dataset had the duplication on id which can cause also some issue on model and needs to underline it.
Considering the initial exploratory data analysis (EDA), it is evident that the dataset exhibits varying ranges among its features, necessitating the application of scaling techniques. Additionally, the presence of numerous positive outliers suggests the adoption of robust scaling methods and the utilization of robust models for subsequent production stages.Unlike standard methods, a robust scaler is less influenced by extreme values. In R, we harnessed the power of the robustbase package, specifically using the scale function. This function automatically applies robust scaling by centering and scaling the data. Setting the center and scale arguments to TRUE ensures the robust scaling methodology, making our normalization process resilient to outliers.
# Extract numeric columns
numeric_cols <- sapply(house_data2, is.numeric)
df_numeric <- house_data2[, numeric_cols, drop = FALSE]
# Apply robust scaling using scale
df_scaled <- as.data.frame(scale(df_numeric, center = TRUE, scale = TRUE))
# Conduct correlation analysis without scaling
cor_matrix <- cor(df_scaled[, numeric_cols])
# Create a heatmap for correlation values
melted_correlation <- melt(cor_matrix, na.rm = TRUE)
p1 <- ggplot(melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
geom_text(aes(label = ifelse(abs(value) > 0.3, round(value, 2), "")), vjust = 1,
size = 2) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
coord_fixed()
p1
Initially, we constructed a correlation matrix to discern relationships among variables in the dataset. To enhance interpretability, we applied a filter, selecting correlations with an absolute value greater than 0.5. This focused approach facilitates easier interpretation by highlighting strong correlated variables. The choice of the cutoff level for correlation analysis depends on the specific goals of the analysis and the nature of the data. Commonly used cutoff values for correlation coefficients between 0.5-0.7 for moderate correlation and above 0.7 for strong correlation. Eventhough we in the begging chooses 0.7 and then 0.5 cutoff , the results showed that The price variable stands out with a strong positive correlation of 0.70 with only the square footage of living space (sqft_living). While this suggests a notable linear relationship between these two factors,we recognized it may be beneficial to also consider variables with moderate correlations to price, as they could provide additional insights into determinants of house prices beyond just living space. So we decreased the cutoff level as 0.30 according that.
# Select correlations with "price" where abs(correlation) > 0.3
price_correlations <- round((cor_matrix[abs(cor_matrix[,"price"]) > 0.3, "price"]),2)
# Print the selected correlations
print(price_correlations)
## bathrooms bedrooms feat02 feat05 grade
## 0.52 0.31 0.41 0.37 0.66
## lat price sqft_above sqft_basement sqft_living
## 0.31 1.00 0.60 0.32 0.70
## sqft_living15 viewed
## 0.58 0.36
# Create a data frame for the selected correlations
correlation_data <- data.frame(Feature = names(price_correlations), Correlation = price_correlations)
# Create a bar plot
ggplot(correlation_data, aes(x = Feature, y = Correlation, fill = Feature)) +
geom_bar(stat = "identity", width = 0.7) +
labs(title = "Correlation with Price (abs > 0.3)",
y = "Correlation",
x = "Feature") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
geom_text(aes(label = Correlation), vjust = -0.5, size = 3)
According to final results (with 0.3 cutoff):
The correlation analysis reveals significant relationships between the “price” variable and various features in the dataset. The strongest positive correlation is observed with the “grade” of the house (0.66), indicating that higher-grade houses tend to have higher prices. The size-related features, such as “sqft_living” (0.70) and “sqft_living15” (0.58), also exhibit strong positive correlations, suggesting that larger living spaces, both for the specific property and the surrounding area, contribute positively to house prices. Additionally, the number of “bathrooms” shows a moderate positive correlation (0.52), emphasizing their influence on property values. The “bedrooms” variable demonstrates a weaker positive correlation (0.31). Features like “feat02” (0.41) and “feat05” (0.37) also display positive correlations, though not as strong as others. Moreover, the “viewed” status has a moderate positive correlation (0.36) with house prices. These findings provide valuable insights into the factors driving housing prices, guiding further analysis and modeling decisions.
# Load necessary libraries
library(tidyverse)
library(caret)
library(tree)
library(randomForest)
library(rpart)
library(here)
library(gbm)
library(xgboost)
library(robustbase)
In the preprocessing step for ML prediction, we begin by removing unnecessary features from our dataset. We first drop the features that are not relevant or useful for our modeling purposes. In this case, we’re dropping features like “id”, “date”, “lat”, “long”, and “zipcode” from our dataset. While acknowledging the potential significance of spatial correlation and temporal analysis, we consciously choose to set aside these considerations for the current project. However, we note that exploring these aspects could be advantageous for future projects, providing avenues for deeper investigation and potentially enriching our predictive models.
# Drop specified features
df <- df_scaled[, !(names(df_scaled) %in% c("id","date", "lat", "long", "zipcode"))]
Next, we select only the relevant columns from the original dataset based on variables that are highly correlated with prices. These variables are determined through a correlation analysis.
# Select only the relevant columns from the original dataset
selected_vars <- c("price", "bathrooms", "bedrooms", "feat02", "feat05", "grade", "sqft_above",
"sqft_basement", "sqft_living", "sqft_living15", "viewed")
df2 <- df_scaled[, selected_vars, drop = FALSE]
After pre-processing the data, we split it into training and testing sets. This step is crucial for evaluating the performance of our machine learning models.
set.seed(123456789)
training_obs <- createDataPartition(df$price,
p = 0.7,
list = FALSE)
train.scaled <- df[training_obs,]
test.scaled <- df[-training_obs,]
set.seed(123456789)
training_obs2 <- createDataPartition(df2$price,
p = 0.7,
list = FALSE)
train.scaled2 <- df[training_obs2,]
test.scaled2 <- df[-training_obs2,]
These codes use the createDataPartition function from
the caret package to split the dataset into training and
testing sets. The training set contains 70% of the data, while the
testing set contains the remaining 30%.
Finally, we define the model formulas for our machine learning models. These formulas specify the relationship between the predictors (features) and the target variable (price).
We decided to utilize tree-based models for our analysis. While we aimed to incorporate all variables in our first model (model1), we also wanted to explore the insights gleaned from correlation analysis. Hence, we opted to create two distinct models for comparison.
model1: Includes all predictors except for the variables that were dropped during pre-processing. model2: Includes only the variables that were selected based on correlation analysis.
These model formulas will be used in subsequent steps for model training and evaluation.
# Define the model formula
model1 <- price ~ .
model2 <- price ~ bathrooms + bedrooms + feat02 + feat05 + grade +
sqft_above + sqft_basement + sqft_living + sqft_living15 + viewed
Regression trees offer a powerful framework for predicting house prices, allowing us to uncover intricate patterns in the data. By following a systematic approach and leveraging the capabilities of regression trees, we can build robust predictive models that drive informed decision-making in the real estate industry.
set.seed(123)
reg.tree <- tree(model1, train.scaled)
reg.tree
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 18758 18810.6000 0.00095086
## 2) grade < 0.725091 15109 4260.0800 -0.27780100
## 4) sqft_living < -0.0437263 10263 1747.0600 -0.42426700 *
## 5) sqft_living > -0.0437263 4846 1826.5800 0.03238950
## 10) yr_built < -0.53984 1087 587.4880 0.40996100 *
## 11) yr_built > -0.53984 3759 1039.3200 -0.07679390 *
## 3) grade > 0.725091 3649 8515.4200 1.15515000
## 6) sqft_living < 2.33753 3169 3199.3200 0.86468400
## 12) yr_built < -0.0262972 501 732.7710 1.81075000 *
## 13) yr_built > -0.0262972 2668 1933.9300 0.68703200
## 26) grade < 1.57815 1770 669.9660 0.42354100 *
## 27) grade > 1.57815 898 898.8670 1.20638000 *
## 7) sqft_living > 2.33753 480 3283.5900 3.07280000
## 14) sqft_living < 6.37701 469 2051.8300 2.84581000
## 28) feat05 < 1.126 274 597.0880 2.07160000 *
## 29) feat05 > 1.126 195 1059.7300 3.93368000
## 58) waterfront < 5.58118 177 762.9930 3.59607000 *
## 59) waterfront > 5.58118 18 78.1866 7.25347000 *
## 15) sqft_living > 6.37701 11 177.2690 12.75090000 *
summary(reg.tree)
##
## Regression tree:
## tree(formula = model1, data = train.scaled)
## Variables actually used in tree construction:
## [1] "grade" "sqft_living" "yr_built" "feat05" "waterfront"
## Number of terminal nodes: 10
## Residual mean deviance: 0.388895 = 7291.01 / 18748
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.0488800 -0.3624880 -0.0762008 0.0000000 0.2590970 7.0814200
sd(train.scaled$price) ^ 2 * (length(train.scaled$price) - 1)
## [1] 18810.5885
(train.scaled$price - mean(train.scaled$price)) ^ 2 %>% sum
## [1] 18810.5885
In constructing the decision tree, we utilized the tree() function which implements the binary recursive partitioning algorithm, considering all predictors specified in the model formula. The algorithm aims for maximum reduction in node impurity to generate splits and continues this process until nodes have a sufficient number of observations. Notably, the tree’s depth is constrained to 31 nodes. The summary of the tree estimation provides insights into the number of predictors eventually utilized in the model.
The regression tree analysis reveals that the primary predictors
influencing house prices are “sqft_living,” “yr_built,” “grade,”
“feat05,” and “waterfront.” The tree, consisting of 10
terminal nodes, delineates different conditions based on these factors.
For instance, when the grade is below a certain threshold and
sqft_living is also within a specific range, the predicted price tends
to be lower. On the other hand, higher grades, larger sqft_living areas,
and certain waterfront conditions contribute to higher predicted prices.
The residual mean deviance, a measure of model fit, is
0.39, indicating the average difference between predicted
and actual prices. Overall, the tree provides a simplified yet
informative representation of the relationships between key features and
house prices.
This indicates that the deviance at the root of the tree is equivalent to the corrected sum of squares of the target variable. The tree is providing a model that minimizes the sum of squared differences compared to the simple mean-based model.
In the next step, we can visualize the tree on the plot.
plot(reg.tree)
text(reg.tree, pretty = 0)
# Cross-validation to find optimal tree size
cv_result <- cv.tree(reg.tree)
best_tree_size <- cv_result$size[which.min(cv_result$dev)]
best_tree_size
## [1] 10
The optimal tree size determined by cross-validation is 10. thus the cross-validated optimal tree size is already equal to the number of terminal nodes, it suggests that the tree is at an appropriate level of complexity, and additional pruning may not be necessary.
However we still wanted to try the limit the size of the tree by using lecture notes and compare the results.
To choose appropriate values for mincut, minsize, and mindev parameters, we’ll consider the characteristics of the first model and aim to balance model complexity with predictive performance.
mincut: We’ll start with a value that allows for a reasonable number of splits in the tree without excessively increasing complexity. Let’s choose a moderate value of 20 for mincut.
minsize: Considering that the first model resulted in 10 terminal nodes, we’ll set minsize to ensure each terminal node contains a sufficient number of observations. Given the relatively balanced number of observations in each node, we’ll choose a value of 50 for minsize.
mindev: Since the residual mean deviance of the first model is 0.388895, indicating a good overall model fit, we’ll choose a conservative value for mindev to avoid unnecessary splits. Let’s set mindev to 0.01 to ensure that splits contribute significantly to reducing deviance.
set.seed(123456)
control1 <- tree.control(nobs = length(train.scaled$price),
mincut = 20,
minsize = 50,
mindev = 0.01)
reg.tree1 <- tree(formula = model1,
data = train.scaled,
control = control1)
reg.tree1
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 18758 18810.600 0.00095086
## 2) grade < 0.725091 15109 4260.080 -0.27780100
## 4) sqft_living < -0.0437263 10263 1747.060 -0.42426700 *
## 5) sqft_living > -0.0437263 4846 1826.580 0.03238950
## 10) yr_built < -0.53984 1087 587.488 0.40996100 *
## 11) yr_built > -0.53984 3759 1039.320 -0.07679390 *
## 3) grade > 0.725091 3649 8515.420 1.15515000
## 6) sqft_living < 2.33753 3169 3199.320 0.86468400
## 12) yr_built < -0.0262972 501 732.771 1.81075000 *
## 13) yr_built > -0.0262972 2668 1933.930 0.68703200
## 26) grade < 1.57815 1770 669.966 0.42354100 *
## 27) grade > 1.57815 898 898.867 1.20638000 *
## 7) sqft_living > 2.33753 480 3283.590 3.07280000
## 14) sqft_living < 5.7568 460 1861.750 2.79583000
## 28) feat05 < 1.126 271 572.775 2.06499000 *
## 29) feat05 > 1.126 189 936.667 3.84377000 *
## 15) sqft_living > 5.7568 20 574.955 9.44305000 *
summary(reg.tree1)
##
## Regression tree:
## tree(formula = model1, data = train.scaled, control = control1)
## Variables actually used in tree construction:
## [1] "grade" "sqft_living" "yr_built" "feat05"
## Number of terminal nodes: 9
## Residual mean deviance: 0.413882 = 7759.87 / 18749
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.7459500 -0.3624880 -0.0761098 0.0000000 0.2596630 8.1992300
plot(reg.tree1)
text(reg.tree1, pretty = 0)
The results from the regression tree analysis for
reg.tree1
indicate a reasonably balanced tree structure with 9 terminal nodes. The
tree splits based on predictor variables such as “grade,”
“sqft_living,” “yr_built,” and “feat05,” effectively partitioning
the dataset to minimize deviance. The residual mean deviance is low at
0.41, suggesting that the model fits the data well on
average. The distribution of residuals appears symmetrical around the
mean, indicating unbiased predictions. Overall, the chosen values for
mincut, minsize, and mindev seem to have produced a suitable regression
tree for the dataset. A hıgher residual mean deviance indicates a worse
fit of the model to the data when compare the fırst results. Therefore,
based on this comparison, the first model performs slightly better in
terms of model fit.
Here, our focus shifts to constructing and evaluating the next model, referred to as Model2. The objective is to deepen our understanding of how the selected predictors influence house prices. By analyzing Model2, we aim to gain insights into the specific variables that have the most significant impact on predicting house prices.
set.seed(123456)
reg.tree2 <- tree(model2, train.scaled2)
reg.tree2
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 18758 18810.600 0.00095086
## 2) grade < 0.725091 15109 4260.080 -0.27780100
## 4) sqft_living < -0.0437263 10263 1747.060 -0.42426700 *
## 5) sqft_living > -0.0437263 4846 1826.580 0.03238950 *
## 3) grade > 0.725091 3649 8515.420 1.15515000
## 6) sqft_living < 2.33753 3169 3199.320 0.86468400
## 12) sqft_living < 1.17057 1889 1154.160 0.56565200 *
## 13) sqft_living > 1.17057 1280 1626.960 1.30599000 *
## 7) sqft_living > 2.33753 480 3283.590 3.07280000
## 14) sqft_living < 6.37701 469 2051.830 2.84581000
## 28) feat05 < 1.126 274 597.088 2.07160000 *
## 29) feat05 > 1.126 195 1059.730 3.93368000 *
## 15) sqft_living > 6.37701 11 177.269 12.75090000 *
summary(reg.tree2)
##
## Regression tree:
## tree(formula = model2, data = train.scaled2)
## Variables actually used in tree construction:
## [1] "grade" "sqft_living" "feat05"
## Number of terminal nodes: 7
## Residual mean deviance: 0.436716 = 8188.85 / 18751
## Distribution of residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8.0488800 -0.3732420 -0.0889825 0.0000000 0.2632470 8.9464700
plot(reg.tree2)
text(reg.tree2, pretty = 0)
# Cross-validation to find optimal tree size
cv_result2 <- cv.tree(reg.tree2)
best_tree_size2 <- cv_result2$size[which.min(cv_result2$dev)]
best_tree_size2
## [1] 7
# Predictions on the training set
reg.tree.train.pred <- predict(reg.tree, newdata = train.scaled)
reg.tree1.train.pred <- predict(reg.tree1, newdata = train.scaled)
reg.tree2.train.pred <- predict(reg.tree2, newdata = train.scaled)
# Calculate MSE for training set
mse_reg.tree_train <- mean((reg.tree.train.pred - train.scaled$price) ^ 2)
mse_reg.tree1_train <- mean((reg.tree1.train.pred - train.scaled$price) ^ 2)
mse_reg.tree2_train <- mean((reg.tree2.train.pred - train.scaled2$price) ^ 2)
# Calculate Root Mean Squared Error (RMSE) for training set
rmse_tree_train <- sqrt(mean((reg.tree.train.pred - train.scaled$price) ^ 2))
rmse_tree1_train <- sqrt(mean((reg.tree1.train.pred - train.scaled$price) ^ 2))
rmse_tree2_train <- sqrt(mean((reg.tree2.train.pred - train.scaled2$price) ^ 2))
# Calculate Mean Absolute Error (MAE) for training set
mae_tree_train <- mean(abs(reg.tree.train.pred - train.scaled$price))
mae_tree1_train <- mean(abs(reg.tree1.train.pred - train.scaled$price))
mae_tree2_train <- mean(abs(reg.tree2.train.pred - train.scaled2$price))
# Create a data frame for the training set results
results_train_df <- data.frame(
Model = c("reg.tree", "reg.tree1", "reg.tree2"),
MSE = c(mse_reg.tree_train, mse_reg.tree1_train, mse_reg.tree2_train),
RMSE = c(rmse_tree_train, rmse_tree1_train, rmse_tree2_train),
MAE = c(mae_tree_train, mae_tree1_train, mae_tree2_train)
)
# Print the training set results as a table
kable(results_train_df, format = "markdown")
| Model | MSE | RMSE | MAE |
|---|---|---|---|
| reg.tree | 0.388688046 | 0.623448511 | 0.424154963 |
| reg.tree1 | 0.413683225 | 0.643182109 | 0.428327183 |
| reg.tree2 | 0.436552665 | 0.660721322 | 0.444317150 |
set.seed(123456)
reg.tree.pred <- predict(reg.tree, newdata = test.scaled)
reg.tree1.pred <- predict(reg.tree1, newdata = test.scaled)
reg.tree2.pred <- predict(reg.tree2, newdata = test.scaled2)
# Calculate MSE
mse_reg.tree <- mean((reg.tree.pred - test.scaled$price) ^ 2)
mse_reg.tree1 <- mean((reg.tree1.pred - test.scaled$price) ^ 2)
mse_reg.tree2 <- mean((reg.tree2.pred - test.scaled2$price) ^ 2)
# Calculate Root Mean Squared Error (RMSE)
rmse_tree <- sqrt(mean((reg.tree.pred - test.scaled$price) ^ 2))
rmse_tree1 <- sqrt(mean((reg.tree1.pred - test.scaled$price) ^ 2))
rmse_tree2 <- sqrt(mean((reg.tree2.pred - test.scaled2$price) ^ 2))
# Calculate Mean Absolute Error (MAE)
mae_tree <- mean(abs(reg.tree.pred - test.scaled$price))
mae_tree1 <- mean(abs(reg.tree1.pred - test.scaled$price))
mae_tree2 <- mean(abs(reg.tree2.pred - test.scaled2$price))
library(knitr)
# Create a data frame for the results
results_test_df <- data.frame(
Model = c("reg.tree", "reg.tree1", "reg.tree2"),
MSE = c(mse_reg.tree, mse_reg.tree1, mse_reg.tree2),
RMSE = c(rmse_tree, rmse_tree1, rmse_tree2),
MAE = c(mae_tree, mae_tree1, mae_tree2)
)
# Print the results as a table
kable(results_test_df, format = "markdown")
| Model | MSE | RMSE | MAE |
|---|---|---|---|
| reg.tree | 0.410356501 | 0.640590744 | 0.431282928 |
| reg.tree1 | 0.423276384 | 0.650596944 | 0.432994135 |
| reg.tree2 | 0.463238434 | 0.680616216 | 0.454251037 |
# Combine the training and test set results into a single data frame
results_combined_df <- merge(results_train_df, results_test_df, by = "Model")
# Melt the data frame for easier plotting
results_melted <- reshape2::melt(results_combined_df, id.vars = "Model")
# Plot MSE, RMSE, and MAE for both training and test sets
ggplot(results_melted, aes(x = Model, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ variable, scales = "free_y") +
labs(title = "Comparison of Model Performance on Training and Test Sets",
x = "Model", y = "Metric Value") +
theme_minimal() +
theme(legend.position = "top")
# Reshape the table to have columns for train and test results separately
reshaped_table <- results_melted %>%
tidyr::spread(variable, value)
reshaped_table
## Model MSE.x RMSE.x MAE.x MSE.y RMSE.y
## 1 reg.tree 0.388688046 0.623448511 0.424154963 0.410356501 0.640590744
## 2 reg.tree1 0.413683225 0.643182109 0.428327183 0.423276384 0.650596944
## 3 reg.tree2 0.436552665 0.660721322 0.444317150 0.463238434 0.680616216
## MAE.y
## 1 0.431282928
## 2 0.432994135
## 3 0.454251037
# Calculate the differences
differences_table <- reshaped_table %>%
mutate(reg.tree_MSE_differences = MSE.x - MSE.y,
reg.tree_RMSE_differences = RMSE.x - RMSE.y,
reg.tree_MAE_differences = MAE.x - MAE.y) %>%
select(Model, reg.tree_MSE_differences, reg.tree_RMSE_differences, reg.tree_MAE_differences)
# Print the differences table
print(differences_table)
## Model reg.tree_MSE_differences reg.tree_RMSE_differences
## 1 reg.tree -0.02166845478 -0.01714223249
## 2 reg.tree1 -0.00959315909 -0.00741483569
## 3 reg.tree2 -0.02668576882 -0.01989489451
## reg.tree_MAE_differences
## 1 -0.00712796500
## 2 -0.00466695220
## 3 -0.00993388698
# Reshape the table for plotting
reshaped_plot <- differences_table %>%
pivot_longer(cols = starts_with("reg.tree"), names_to = "Metric", values_to = "Difference")
# Create the line plot
line_plot <- ggplot(reshaped_plot, aes(x = Metric, y = Difference, group = Model, color = Model)) +
geom_line() +
geom_point() +
labs(title = "Assessing Overfitting",
x = "Metrics",
y = "Differences") +
theme_minimal()
# Print the line plot
print(line_plot)
Among the three models assessed, Model1 (reg.tree) demonstrates superior
performance with notably lower Mean Squared Error (MSE), Root Mean
Squared Error (RMSE), and Mean Absolute Error (MAE) values across both
training and test datasets, indicating robust predictive capabilities.
Moreover, upon examining the overfitting scenario, Model1 exhibits the
most favorable outcome. Consequently, considering its strong performance
and generalization potential, Model1 emerges as the preferred option for
predictive modeling in this scenario. Nevertheless, additional scrutiny
and experimentation may be warranted to validate this assertion
conclusively.
In this analysis, we employed the Gradient Boosting methodology for
regression using the gbm() function in R. The choice of
distribution for the target variable is crucial, and in our case, we
opted for a regression analysis, applying the normal distribution with
the parameter distribution = "gaussian". This setting
aligns with the assumption that the target variable follows a Gaussian
distribution.
n.trees - number of iterations, actually trees, which
will be growninteraction.depth - depth of the trees, that is the
degree of their complexity, defined as the number of splits in the tree
(counted from the first single node)shrinkage - learning rate parametern.minobsinnode – minimum number of observation in the
terminal node, required while creating the splitwe created a Gradient Boosting model (gbm_model) with the formula price ~ . and additional specifications such as the number of trees (n.trees = 100) and interaction depth (interaction.depth = 4) and specific values for hyperparameters are not provided, default values are used.
The default value for the shrinkage parameter, also known as the learning rate, is 0.001. The default value for the n.minobsinnode parameter, determining the minimum observations in a terminal node for a split, is 10.
set.seed(123456)
# Create a Gradient Boosting model
gbm_model <- gbm(model1, data = train.scaled, distribution = "gaussian", n.trees = 100, interaction.depth = 4)
# Predictions on the training set
gbm_pred_train <- predict(gbm_model, newdata = train.scaled, n.trees = 100)
# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_train <- mean((gbm_pred_train - train.scaled$price) ^ 2)
# Calculate MAE for Gradient Boosting with tuned parameters on training set
mae_gbm_train <- mean(abs(gbm_pred_train - train.scaled$price))
# Print the MSE and MAE values for training set
cat("MSE for Model1 GBM on Training Set:", mse_gbm_train, "\n")
## MSE for Model1 GBM on Training Set: 0.193684381
cat("MAE for Model1 GBM on Training Set:", mae_gbm_train, "\n")
## MAE for Model1 GBM on Training Set: 0.306711465
# Predictions on the test set
gbm_pred_test <- predict(gbm_model, newdata = test.scaled, n.trees = 100)
# Calculate MSE for Gradient Boosting
mse_gbm_test <- mean((gbm_pred_test - test.scaled$price) ^ 2)
# Calculate MAE for Gradient Boosting
mae_gbm_test <- mean(abs(gbm_pred_test - test.scaled$price))
# Print the MSE and MAE values for testing set
cat("MSE for Model1 GBM on Testing Set:", mse_gbm_test, "\n")
## MSE for Model1 GBM on Testing Set: 0.217774281
cat("MAE for Model1 GBM on Testing Set:", mae_gbm_test, "\n")
## MAE for Model1 GBM on Testing Set: 0.31438852
MSE is relatively low 0.19, suggesting that the initial
Gradient Boosting model is effective in capturing the patterns in the
training data. The MSE for the testing set is slightly higher than the
training set MSE with approximately 0.22. This metric
represents the average squared difference between the predicted and
actual values in our test set. In the context of regression models, a
lower MSE indicates better predictive performance. Therefore, the
obtained MSE of 0.277 suggests that the model provides a relatively
accurate prediction of the target variable, capturing the variance in
the data with a moderate level of precision. However, the model seems to
be performing well on the training set, but there is a slight increase
in error when applied to the testing set. This scenario commonly
suggests that the model is generalizing reasonably well but may be
experiencing some level of overfitting. Thus we decided to try tuning on
hyperparameters to improve the model, but first lets visualize our
results.
### Scatter plot for training set with prediction line
# Calculate absolute differences for training set
abs_diff_train <- abs(train.scaled$price - gbm_pred_train)
# Scatter plot for training set with prediction line
plot(train.scaled$price, gbm_pred_train, col = ifelse(abs_diff_train > 4, "black", "blue"), main = "Scatter Plot - Actual vs Predicted (Training Set)", xlab = "Actual Prices", ylab = "Predicted Prices")
# Highlight points with absolute differences over 4
points(train.scaled$price[abs_diff_train > 4], gbm_pred_train[abs_diff_train > 4], col = "black", pch = 16)
# Add a legend outside the plot
legend("bottomright", legend = c("Train Set", "AD > 4"), col = c("blue", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)
# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)
# Calculate absolute differences for training set
abs_diff_test <- abs(test.scaled$price - gbm_pred_test)
### Scatter plot for test set with prediction line
plot(test.scaled$price, gbm_pred_test, col = ifelse(abs_diff_test > 5, "black", "red"), main = "Scatter Plot - Actual vs Predicted (Test Set)", xlab = "Actual Prices", ylab = "Predicted Prices")
# Highlight points with absolute differences over 5
points(test.scaled$price[abs_diff_test > 4], gbm_pred_test[abs_diff_test > 4], col = "black", pch = 16)
# Add a legend outside the plot
legend("bottomright", legend = c("Test Set", "AD> 4"), col = c("red", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)
# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)
The green dashed diagonal line represents a scenario where predicted values exactly match the actual values. Deviations from this line indicate the model’s prediction errors. Black points are the observations that absolute differences between actual value and predicted value is over 4.
set.seed(123456)
# Create a Gradient Boosting model
gbm_model2 <- gbm(model2, data = train.scaled2, distribution = "gaussian", n.trees = 100, interaction.depth = 4)
# Predictions on the training set
gbm_pred_train2 <- predict(gbm_model2, newdata = train.scaled2, n.trees = 100)
# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_train2 <- mean((gbm_pred_train2 - train.scaled2$price) ^ 2)
# Print the MSE value for training set
cat("MSE for Model2 GBM on Training Set:", mse_gbm_train2, "\n")
## MSE for Model2 GBM on Training Set: 0.258698837
# Predictions on the test set
gbm_pred_test2 <- predict(gbm_model2, newdata = test.scaled2, n.trees = 100)
# Calculate MSE for Gradient Boosting
mse_gbm_test2 <- mean((gbm_pred_test2 - test.scaled2$price) ^ 2)
# Print the MSE value
cat("MSE for Model2 GBM on Testing Set:", mse_gbm_test2, "\n")
## MSE for Model2 GBM on Testing Set: 0.287677614
# Calculate MAE for Gradient Boosting with tuned parameters on training set
mae_gbm_train2 <- mean(abs(gbm_pred_train2 - train.scaled2$price))
# Print the MAE value for training set
cat("MAE for Model2 GBM on Training Set:", mae_gbm_train2, "\n")
## MAE for Model2 GBM on Training Set: 0.352250693
# Calculate MAE for Gradient Boosting on test set
mae_gbm_test2 <- mean(abs(gbm_pred_test2 - test.scaled2$price))
# Print the MAE value for test set
cat("MAE for Model2 GBM on Testing Set:", mae_gbm_test2, "\n")
## MAE for Model2 GBM on Testing Set: 0.362980382
Model2 GBM exhibits slightly higher MSE and MAE values on both the
training and testing sets compared to Model1 GBM. Specifically, Model2
GBM has an MSE of approximately 0.259 on the training set
and 0.288 on the testing set, along with MAE values of
about 0.352 and 0.363 on the training and
testing sets, respectively. These higher error rates suggest that Model2
GBM performs less effectively in capturing the dataset’s patterns
compared to Model1 GBM. Therefore, based on these metrics, Model1 GBM
appears to be the superior choice. Further analysis may be needed to
confirm this conclusion, but for now, continuing with Model1 for
subsequent analyses seems prudent.
In an effort to enhance the performance of our Gradient Boosting model, we performed hyperparameter tuning by increasing the number of trees (n.trees) to 200 and adjusting the interaction depth to 6. The aim was to find a configuration that better captures complex relationships in the data.
set.seed(123456)
# Create a Gradient Boosting model with different hyperparameters
gbm_model_tuned <- gbm(price ~ ., data = train.scaled, distribution = "gaussian", n.trees = 200, interaction.depth = 6)
# Predictions on the training set
gbm_pred_tuned_train <- predict(gbm_model_tuned, newdata = train.scaled, n.trees = 200)
# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_tuned_train <- mean((gbm_pred_tuned_train - train.scaled$price) ^ 2)
# Print the MSE value for training set
cat("MSE for Tuned GBM on Training Set:", mse_gbm_tuned_train, "\n")
## MSE for Tuned GBM on Training Set: 0.153130196
# Predictions on the test set
gbm_pred_tuned_test <- predict(gbm_model_tuned, newdata = test.scaled, n.trees = 200)
# Calculate MSE for Gradient Boosting with tuned parameters
mse_gbm_tuned_test <- mean((gbm_pred_tuned_test - test.scaled$price) ^ 2)
# Print the MSE value
cat("MSE for Tuned GBM on Test Set:", mse_gbm_tuned_test, "\n")
## MSE for Tuned GBM on Test Set: 0.203248099
# Calculate MAE for Tuned Gradient Boosting on training set
mae_gbm_tuned_train <- mean(abs(gbm_pred_tuned_train - train.scaled$price))
# Print the MAE value for training set
cat("MAE for Tuned GBM on Training Set:", mae_gbm_tuned_train, "\n")
## MAE for Tuned GBM on Training Set: 0.283985196
# Calculate MAE for Tuned Gradient Boosting on test set
mae_gbm_tuned_test <- mean(abs(gbm_pred_tuned_test - test.scaled$price))
# Print the MAE value for test set
cat("MAE for Tuned GBM on Test Set:", mae_gbm_tuned_test, "\n")
## MAE for Tuned GBM on Test Set: 0.305398644
The tuned hyperparameter values for the Gradient Boosting model, with
200 trees and an interaction depth of 6, resulted in a Training MSE of
0.15 and a Testing MSE of 0.20. When comparing
these values with the initial model , the tuned model exhibits improved
performance on both the training and testing sets.
However the overfitting difference increased slightly after tuning.
# Calculate overfitting differences
overfitting_difference_initial <- mse_gbm_train - mse_gbm_test
overfitting_difference_tuned <- mse_gbm_tuned_train - mse_gbm_tuned_test
# Print overfitting differences
cat("Overfitting Difference (Initial):", overfitting_difference_initial, "\n")
## Overfitting Difference (Initial): -0.0240899001
cat("Overfitting Difference (Tuned):", overfitting_difference_tuned, "\n")
## Overfitting Difference (Tuned): -0.0501179032
### Scatter plot for training set with prediction line
# Calculate absolute differences for training set
abs_diff_tuned1 <- abs(train.scaled$price - gbm_pred_tuned_train)
# Scatter plot for training set with prediction line
plot(train.scaled$price, gbm_pred_tuned_train, col = ifelse(abs_diff_tuned1 > 4, "black", "blue"), main = "Tuned Model - Actual vs Predicted (Training Set)", xlab = "Actual Prices", ylab = "Predicted Prices")
# Highlight points with absolute differences over 4
points(train.scaled$price[abs_diff_tuned1 > 4], gbm_pred_tuned_train[abs_diff_tuned1 > 4], col = "black", pch = 16)
# Add a legend outside the plot
legend("bottomright", legend = c("Train Set", "AD > 4"), col = c("blue", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)
# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)
####
# Calculate absolute differences for training set
abs_diff_tuned_test <- abs(test.scaled$price - gbm_pred_tuned_test)
#Scatter plot for test set with prediction line
plot(test.scaled$price, gbm_pred_tuned_test, col = ifelse(abs_diff_tuned_test > 5, "black", "red"), main = "Scatter Plot - Actual vs Predicted (Test Set)", xlab = "Actual Prices", ylab = "Predicted Prices")
# Highlight points with absolute differences over 4
points(test.scaled$price[abs_diff_tuned_test > 4], gbm_pred_tuned_test[abs_diff_tuned_test > 4], col = "black", pch = 16)
# Add a legend outside the plot
legend("bottomright", legend = c("Test Set", "AD> 4"), col = c("red", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)
# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)
In this analysis, we conducted hyperparameter tuning for a Gradient Boosting Machine (GBM) model using a grid search approach. The goal was to optimize the model’s performance by exploring various combinations of hyperparameters, including the number of trees, interaction depth, shrinkage, and minimum observations in a node. We utilized a cross-validation strategy with 5 folds to evaluate model performance, with the root mean squared error (RMSE) serving as the metric for optimization. The outcome of this process yielded the best set of hyperparameters, providing valuable insights for configuring the GBM model to achieve improved accuracy on the given dataset.
set.seed(123456)
# Define a hyperparameter grid
hyperparameters <- expand.grid(
n.trees = c(50, 100, 150, 200),
interaction.depth = c(3, 6, 9, 11),
shrinkage = c(0.01, 0.1, 0.2),
n.minobsinnode = c(5, 10, 20)
)
# Set up the control parameters for cross-validation
ctrl <- trainControl(method = "cv", number = 6)
# Perform grid search using train function
gbm_tune <- train(
price ~ .,
data = train.scaled,
method = "gbm",
trControl = ctrl,
tuneGrid = hyperparameters,
metric = "RMSE"
)
set.seed(123456)
# Print the best parameters
print(gbm_tune)
## Stochastic Gradient Boosting
##
## 18758 samples
## 25 predictor
##
## No pre-processing
## Resampling: Cross-Validated (6 fold)
## Summary of sample sizes: 15632, 15633, 15630, 15630, 15634, 15631, ...
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees RMSE
## 0.01 3 5 50 0.825060169
## 0.01 3 5 100 0.722044719
## 0.01 3 5 150 0.658223096
## 0.01 3 5 200 0.613987236
## 0.01 3 10 50 0.824958596
## 0.01 3 10 100 0.722880198
## 0.01 3 10 150 0.660606765
## 0.01 3 10 200 0.616874157
## 0.01 3 20 50 0.825368587
## 0.01 3 20 100 0.723491067
## 0.01 3 20 150 0.660758312
## 0.01 3 20 200 0.617907073
## 0.01 6 5 50 0.795456135
## 0.01 6 5 100 0.675794291
## 0.01 6 5 150 0.604600638
## 0.01 6 5 200 0.559296963
## 0.01 6 10 50 0.795827358
## 0.01 6 10 100 0.677858937
## 0.01 6 10 150 0.606657242
## 0.01 6 10 200 0.561483159
## 0.01 6 20 50 0.796970105
## 0.01 6 20 100 0.680601332
## 0.01 6 20 150 0.610420647
## 0.01 6 20 200 0.566490678
## 0.01 9 5 50 0.780709360
## 0.01 9 5 100 0.654875107
## 0.01 9 5 150 0.581622842
## 0.01 9 5 200 0.537226349
## 0.01 9 10 50 0.781923418
## 0.01 9 10 100 0.657497582
## 0.01 9 10 150 0.584942641
## 0.01 9 10 200 0.541704362
## 0.01 9 20 50 0.784408000
## 0.01 9 20 100 0.661252175
## 0.01 9 20 150 0.589298209
## 0.01 9 20 200 0.545889592
## 0.01 11 5 50 0.772739596
## 0.01 11 5 100 0.643917831
## 0.01 11 5 150 0.569352939
## 0.01 11 5 200 0.526389070
## 0.01 11 10 50 0.776285420
## 0.01 11 10 100 0.648414279
## 0.01 11 10 150 0.574267173
## 0.01 11 10 200 0.531354155
## 0.01 11 20 50 0.777971478
## 0.01 11 20 100 0.651679165
## 0.01 11 20 150 0.579490360
## 0.01 11 20 200 0.536997015
## 0.10 3 5 50 0.505767157
## 0.10 3 5 100 0.475565356
## 0.10 3 5 150 0.466835189
## 0.10 3 5 200 0.464289983
## 0.10 3 10 50 0.509082688
## 0.10 3 10 100 0.479652763
## 0.10 3 10 150 0.470985843
## 0.10 3 10 200 0.468228839
## 0.10 3 20 50 0.511932757
## 0.10 3 20 100 0.483606958
## 0.10 3 20 150 0.476496144
## 0.10 3 20 200 0.473447898
## 0.10 6 5 50 0.476564247
## 0.10 6 5 100 0.461045908
## 0.10 6 5 150 0.456725420
## 0.10 6 5 200 0.455884999
## 0.10 6 10 50 0.485308246
## 0.10 6 10 100 0.467837316
## 0.10 6 10 150 0.464433071
## 0.10 6 10 200 0.462035882
## 0.10 6 20 50 0.487256507
## 0.10 6 20 100 0.471508040
## 0.10 6 20 150 0.467448784
## 0.10 6 20 200 0.466230354
## 0.10 9 5 50 0.466578390
## 0.10 9 5 100 0.453935233
## 0.10 9 5 150 0.452034842
## 0.10 9 5 200 0.451762186
## 0.10 9 10 50 0.475349513
## 0.10 9 10 100 0.462516086
## 0.10 9 10 150 0.458205677
## 0.10 9 10 200 0.455303734
## 0.10 9 20 50 0.480808367
## 0.10 9 20 100 0.467727842
## 0.10 9 20 150 0.463408383
## 0.10 9 20 200 0.460850251
## 0.10 11 5 50 0.465200600
## 0.10 11 5 100 0.455737841
## 0.10 11 5 150 0.454448797
## 0.10 11 5 200 0.452332556
## 0.10 11 10 50 0.471956956
## 0.10 11 10 100 0.460854924
## 0.10 11 10 150 0.457508710
## 0.10 11 10 200 0.455014704
## 0.10 11 20 50 0.472989796
## 0.10 11 20 100 0.461926063
## 0.10 11 20 150 0.459150027
## 0.10 11 20 200 0.457853661
## 0.20 3 5 50 0.479724117
## 0.20 3 5 100 0.471090713
## 0.20 3 5 150 0.468474357
## 0.20 3 5 200 0.466757134
## 0.20 3 10 50 0.486466193
## 0.20 3 10 100 0.476791076
## 0.20 3 10 150 0.476210418
## 0.20 3 10 200 0.473641595
## 0.20 3 20 50 0.488802351
## 0.20 3 20 100 0.478438968
## 0.20 3 20 150 0.475012575
## 0.20 3 20 200 0.474280313
## 0.20 6 5 50 0.472325542
## 0.20 6 5 100 0.469114908
## 0.20 6 5 150 0.467437292
## 0.20 6 5 200 0.465336786
## 0.20 6 10 50 0.475996423
## 0.20 6 10 100 0.470436469
## 0.20 6 10 150 0.470095992
## 0.20 6 10 200 0.467682988
## 0.20 6 20 50 0.474133241
## 0.20 6 20 100 0.468335200
## 0.20 6 20 150 0.467180546
## 0.20 6 20 200 0.465754402
## 0.20 9 5 50 0.468619887
## 0.20 9 5 100 0.467440525
## 0.20 9 5 150 0.469519396
## 0.20 9 5 200 0.469094517
## 0.20 9 10 50 0.473374399
## 0.20 9 10 100 0.468343034
## 0.20 9 10 150 0.467552054
## 0.20 9 10 200 0.468244593
## 0.20 9 20 50 0.472117461
## 0.20 9 20 100 0.468287212
## 0.20 9 20 150 0.468060986
## 0.20 9 20 200 0.466515064
## 0.20 11 5 50 0.467794301
## 0.20 11 5 100 0.466675445
## 0.20 11 5 150 0.466117679
## 0.20 11 5 200 0.465661595
## 0.20 11 10 50 0.468761414
## 0.20 11 10 100 0.465392158
## 0.20 11 10 150 0.465190246
## 0.20 11 10 200 0.465285013
## 0.20 11 20 50 0.465665671
## 0.20 11 20 100 0.460380496
## 0.20 11 20 150 0.461215967
## 0.20 11 20 200 0.460752465
## Rsquared MAE
## 0.599282108 0.521629952
## 0.634026186 0.461511235
## 0.664034024 0.428429602
## 0.690555389 0.406213657
## 0.598216371 0.520777789
## 0.629313492 0.460788998
## 0.659124552 0.428078835
## 0.686530000 0.405390055
## 0.602378161 0.520315686
## 0.628871895 0.460046859
## 0.659295913 0.426168817
## 0.684161161 0.403808764
## 0.667143881 0.506176104
## 0.699594027 0.436795882
## 0.720998455 0.398536582
## 0.739112262 0.373981489
## 0.662172767 0.505873951
## 0.693201353 0.435800200
## 0.716290062 0.397657915
## 0.735143752 0.372921699
## 0.659544199 0.504420289
## 0.689769632 0.434423335
## 0.712445476 0.395640266
## 0.730321808 0.371600449
## 0.695923032 0.498624454
## 0.723314779 0.424485625
## 0.741458776 0.384159593
## 0.755848727 0.359880027
## 0.688298961 0.497865815
## 0.716831591 0.423332021
## 0.735556388 0.382641943
## 0.748988720 0.358783884
## 0.681510672 0.496035225
## 0.711975640 0.421214700
## 0.730947638 0.380648473
## 0.745294125 0.356678563
## 0.711177195 0.494414720
## 0.735337836 0.417811284
## 0.752772163 0.376517481
## 0.763942082 0.353005480
## 0.706179609 0.493629871
## 0.730495400 0.416694421
## 0.747024161 0.375472190
## 0.758610781 0.351840394
## 0.694841389 0.491399774
## 0.722713255 0.414208634
## 0.739446807 0.372977698
## 0.751613069 0.349833776
## 0.752855823 0.342022633
## 0.776060932 0.320714022
## 0.783142969 0.315170989
## 0.785684921 0.312928920
## 0.750882803 0.339730115
## 0.772451712 0.319856521
## 0.779795199 0.315124034
## 0.782111959 0.313163986
## 0.746439909 0.339439814
## 0.768501743 0.320159824
## 0.774904482 0.316018209
## 0.777799710 0.314047433
## 0.776529720 0.322374247
## 0.788720381 0.311677999
## 0.792558360 0.308688930
## 0.793301713 0.307446006
## 0.768630707 0.323764998
## 0.782785418 0.313319279
## 0.785768148 0.310856276
## 0.787793324 0.309312183
## 0.766850589 0.322447702
## 0.779973227 0.313005937
## 0.783413075 0.310485941
## 0.784606976 0.309196035
## 0.784387374 0.316275910
## 0.794726834 0.307842539
## 0.796449549 0.305669470
## 0.796691904 0.304353384
## 0.777776725 0.316099817
## 0.787964422 0.308434325
## 0.791684954 0.306302378
## 0.794289162 0.304436369
## 0.771765653 0.316836949
## 0.783126659 0.309261605
## 0.787184848 0.307073255
## 0.789493702 0.306014878
## 0.786586234 0.313840390
## 0.793985518 0.306347393
## 0.795043778 0.304483877
## 0.797066705 0.302608372
## 0.780015325 0.314442954
## 0.789019676 0.308127760
## 0.791906775 0.305704923
## 0.794075297 0.304492149
## 0.779158412 0.312176309
## 0.788463807 0.306507957
## 0.791147150 0.305061589
## 0.791938483 0.304113673
## 0.771103942 0.321690877
## 0.779605014 0.315242438
## 0.781902085 0.312912958
## 0.783453513 0.311417946
## 0.764736201 0.323888373
## 0.773961907 0.318000823
## 0.774572328 0.316053175
## 0.776817124 0.314652682
## 0.763018554 0.322975153
## 0.773237686 0.316757884
## 0.776504353 0.315100981
## 0.777244610 0.313785095
## 0.777902487 0.316202641
## 0.781080859 0.312660555
## 0.782736064 0.310829441
## 0.784796892 0.309539458
## 0.774942643 0.317618661
## 0.780028908 0.314334404
## 0.780659573 0.312530801
## 0.783046481 0.310767301
## 0.777131946 0.314570403
## 0.782300411 0.311522526
## 0.783718309 0.310543077
## 0.785125480 0.309167387
## 0.782654886 0.312817558
## 0.784036271 0.309954239
## 0.782547090 0.309140510
## 0.782682727 0.308617531
## 0.777860400 0.312981459
## 0.782598223 0.310102043
## 0.783456918 0.308584128
## 0.782805766 0.308128066
## 0.778971901 0.313278967
## 0.782959912 0.310859997
## 0.783040201 0.310563101
## 0.784388807 0.309417834
## 0.783517044 0.312366763
## 0.784474399 0.309767710
## 0.785144485 0.308995986
## 0.785787357 0.308463426
## 0.782312817 0.310509522
## 0.785567406 0.307674079
## 0.785624856 0.306915391
## 0.785675597 0.306504600
## 0.784184665 0.310231371
## 0.789276352 0.307630302
## 0.788790832 0.306565416
## 0.789065423 0.306360695
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 200, interaction.depth =
## 9, shrinkage = 0.1 and n.minobsinnode = 5.
set.seed(123456)
# Create a Gradient Boosting model with the tuned hyperparameters
gbm_model_final <- gbm(
price ~ .,
data = train.scaled,
distribution = "gaussian",
n.trees = 200,
interaction.depth = 9,
shrinkage = 0.1,
n.minobsinnode = 5
)
# Predictions on the training set
gbm_pred_final_train <- predict(gbm_model_final, newdata = train.scaled, n.trees = 200)
# Calculate MSE for the final Gradient Boosting model on the training set
mse_gbm_final_train <- mean((gbm_pred_final_train - train.scaled$price) ^ 2)
# Print the MSE value for the training set
cat("MSE for Final GBM on Train:", mse_gbm_final_train, "\n")
## MSE for Final GBM on Train: 0.133801464
# Predictions on the test set
gbm_pred_final_test <- predict(gbm_model_final, newdata = test.scaled, n.trees = 200)
# Calculate MSE for the final Gradient Boosting model
mse_gbm_final_test <- mean((gbm_pred_final_test - test.scaled$price) ^ 2)
# Print the MSE value
cat("MSE for Final GBM on Test:", mse_gbm_final_test, "\n")
## MSE for Final GBM on Test: 0.206072014
# Calculate MAE for Tuned Gradient Boosting on training set
mae_gbm_final_train <- mean(abs(gbm_pred_final_train - train.scaled$price))
# Print the MAE value for training set
cat("MAE for Final GBM on Training Set:", mae_gbm_final_train, "\n")
## MAE for Final GBM on Training Set: 0.271696231
# Calculate MAE for Tuned Gradient Boosting on test set
mae_gbm_final_test <- mean(abs(gbm_pred_final_test - test.scaled$price))
# Print the MAE value for test set
cat("MAE for Final GBM on Test Set:", mae_gbm_final_test, "\n")
## MAE for Final GBM on Test Set: 0.304497559
# Calculate overfitting differences
overfitting_difference_initial <- mse_gbm_train - mse_gbm_test
overfitting_difference_initial2 <- mse_gbm_train2 - mse_gbm_test2
overfitting_difference_tuned <- mse_gbm_tuned_train - mse_gbm_tuned_test
overfitting_difference_final <- mse_gbm_final_train - mse_gbm_final_test
# Print overfitting differences
cat("Overfitting Difference (Initial):", overfitting_difference_initial, "\n")
## Overfitting Difference (Initial): -0.0240899001
cat("Overfitting Difference (Second):", overfitting_difference_initial2, "\n")
## Overfitting Difference (Second): -0.0289787765
cat("Overfitting Difference (Tuned):", overfitting_difference_tuned, "\n")
## Overfitting Difference (Tuned): -0.0501179032
cat("Overfitting Difference (Final):", overfitting_difference_final, "\n")
## Overfitting Difference (Final): -0.07227055
The comparison between the two Gradient Boosting models, one with hyperparameters tuned through a grid search and the other with manually specified hyperparameters, reveals slightly better performance in terms of MSE values for the training dataset. The model with grid search achieved an MSE of 0.13, whereas the manually tuned model had an MSE of 0.15. However, the performance on the test dataset remained relatively similar, with an MSE of 0.206 for the grid search model and 0.203 for the manually tuned model.
Although the difference in MSE is marginal, it indicates that the model with manually selected hyperparameters may have a slight edge in predictive accuracy. This result suggests that, in this specific case, a thoughtful manual selection of hyperparameters can be as effective as an exhaustive grid search, emphasizing the importance of domain knowledge and intuition in model tuning.
# Create a data frame for model results
results <- data.frame(
Model = c("Initial Model","Second Model", "Tuned Model", "Final Tuned Model"),
Training_MSE = c(mse_gbm_train,mse_gbm_train2, mse_gbm_tuned_train, mse_gbm_final_train),
Testing_MSE = c(mse_gbm_test, mse_gbm_test2, mse_gbm_tuned_test, mse_gbm_final_test),
Overfitting_Difference = c(overfitting_difference_initial, overfitting_difference_initial2, overfitting_difference_tuned, overfitting_difference_final)
)
# Print the results table
print(results)
## Model Training_MSE Testing_MSE Overfitting_Difference
## 1 Initial Model 0.193684381 0.217774281 -0.0240899001
## 2 Second Model 0.258698837 0.287677614 -0.0289787765
## 3 Tuned Model 0.153130196 0.203248099 -0.0501179032
## 4 Final Tuned Model 0.133801464 0.206072014 -0.0722705500
# Create a data frame for model results
results <- data.frame(
Model = c("Initial Model", "Second Model", "Tuned Model", "Final Tuned Model"),
Training_MSE = c(0.19, 0.26, 0.15, 0.13),
Testing_MSE = c(0.22, 0.29, 0.20, 0.21),
Overfitting_Difference = c(-0.024, -0.029, -0.050, -0.072)
)
# Set factor levels for the "Model" variable
results$Model <- factor(results$Model, levels = c("Initial Model", "Second Model", "Tuned Model", "Final Tuned Model"))
# Melt the data frame for better visualization
melted_results <- melt(results, id.vars = "Model", variable.name = "Metric")
# Create a line plot
ggplot(melted_results, aes(x = Model, y = value, group = Metric, color = Metric, shape = Metric, linetype = Metric)) +
geom_line() +
geom_point(size = 3) +
labs(title = "Model Comparison",
y = "MSE",
x = "Model",
color = "Metric") +
theme_minimal() +
theme(legend.position = "top")
Striking a balance between model complexity and generalization
performance is crucial when selecting the appropriate model. While the
“Final Tuned Model” demonstrates promising results in terms of MSE, its
higher overfitting difference suggests potential limitations in
generalizing to unseen data.
Therefore, depending on the specific requirements and constraints of your analysis or application, it may be prudent to consider an alternative model, such as the “Tuned Model,” which exhibits a lower overfitting difference while still maintaining competitive performance in terms of MSE.
In the context of the our current price prediction analysis, opting for the “Tuned Model” would be advisable. This decision ensures a more balanced trade-off between model complexity, overfitting, and predictive performance, thereby enhancing the model’s reliability in making accurate price predictions.
In this process we tune the Random Forest model using the best parameters obtained from the tuned Gradient Boosting model, potentially improving the performance of the Random Forest model.
The code snippet begins by extracting the best parameters from a tuned Gradient Boosting model. These parameters are then used to construct a hyperparameter grid for tuning a Random Forest model. Cross-validation control parameters are set up, and a grid search is performed to optimize the Random Forest model. Finally, the best parameters for the Random Forest model are printed. This process aims to leverage insights from the Gradient Boosting model to improve the performance of the Random Forest model.
set.seed(123456)
# Create a Random Forest model
rf_model <- randomForest(price ~ ., data = train.scaled,
ntree = 200,
mtry = 4,
nodesize = 9,
maxnodes = 20)
# Predictions on the training set
rf_pred_train <- predict(rf_model, newdata = train.scaled)
# Calculate MSE for Random Forest on training set
mse_rf_train <- mean((rf_pred_train - train.scaled$price) ^ 2)
# Print the MSE value for training set
cat("MSE for Random Forest on Training Set:", mse_rf_train, "\n")
## MSE for Random Forest on Training Set: 0.328845394
# Predictions on the test set
rf_pred_test <- predict(rf_model, newdata = test.scaled)
# Calculate MSE for Random Forest
mse_rf <- mean((rf_pred_test - test.scaled$price) ^ 2)
# Print the MSE value
cat("MSE for Random Forest on Test Set:", mse_rf, "\n")
## MSE for Random Forest on Test Set: 0.348988041
set.seed(123456)
# Create a Random Forest model
rf_model <- randomForest(price ~ ., data = train.scaled)
# Predictions on the training set
rf_pred_train <- predict(rf_model, newdata = train.scaled)
# Calculate MSE for Random Forest on training set
mse_rf_train <- mean((rf_pred_train - train.scaled$price) ^ 2)
# Print the MSE value for training set
cat("MSE for Random Forest on Training Set:", mse_rf_train, "\n")
## MSE for Random Forest on Training Set: 0.0350226282
# Predictions on the test set
rf_pred_test <- predict(rf_model, newdata = test.scaled)
# Calculate MSE for Random Forest
mse_rf <- mean((rf_pred_test - test.scaled$price) ^ 2)
# Print the MSE value
cat("MSE for Random Forest on Test Set:", mse_rf, "\n")
## MSE for Random Forest on Test Set: 0.201226508
The Random Forest model, when manually tuned with custom
hyperparameters, yielded an MSE of 0.33 on the training set
and 0.35 on the test set. However, the default parameter
setting resulted in significantly lower MSE values of 0.03
on the training set and 0.20 on the test set. Despite the
improved performance with default parameters, there was a noticeable
increase in overfitting, highlighting the importance of balancing model
complexity and generalization. Moreover, the custom-tuned model incurred
higher computational costs in terms of runtime. Therefore, further
exploration of alternative models may be necessary.
In conclusion, while Model1 (reg.tree) emerges as the preferred option among decision tree models, the final tuned gradient boosting model showcases the best predictive accuracy overall. However, model selection should consider not only performance metrics but also computational costs and overfitting concerns. Further exploration of alternative models and hyperparameter tuning strategies may be warranted to optimize performance effectively.
##Resources
R Studio and associated libraries
Kaggle dataset (provided under CC0: Public Domain). https://www.kaggle.com/datasets/shivachandel/kc-house-data/data
Shapefile for mapping https://gis-kingcounty.opendata.arcgis.com/