library(ggplot2) # plots and visualizations
library(readr) #read  data
library(dplyr) #data manipulation
library(ggpubr)  # For ggarrange function
library(tidyverse) #Comprehensive package for data manipulation and visualization.
library(reshape2) # transform the data) from wide format to long format
library(plotly) # interactive visualizations
library(ggalt) #geom_encircle function
library(knitr)
library(gridExtra)
library(viridis)

Data

Data Overview

house_data2 <- read.csv("r1.csv",header = TRUE, sep = ",")

Structure of Dataset

str(house_data2)
## 'data.frame':    26795 obs. of  31 variables:
##  $ id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ bathrooms    : num  3.25 1.75 2.25 3 2 2 1.75 1 2 3.5 ...
##  $ bedrooms     : int  4 3 4 2 4 3 4 2 3 6 ...
##  $ condition    : int  3 5 3 3 3 4 3 3 3 3 ...
##  $ date         : int  1401235207 1424736007 1402617607 1399248007 1427241607 1406678407 1406160007 1408320007 1408492807 1405814407 ...
##  $ feat01       : num  0.43 0.955 0.728 0.321 0.961 ...
##  $ feat02       : num  0.603 0.515 0.433 0.518 0.529 ...
##  $ feat03       : num  0.695 0.223 0.795 0.236 0.583 ...
##  $ feat04       : num  0.3106 0.806 0.306 0.0489 0.9978 ...
##  $ feat05       : num  0.506 0.374 0.43 0.39 0.5 ...
##  $ feat06       : num  0.4357 0.0234 0.7412 0.833 0.3574 ...
##  $ feat07       : num  0.394 0.556 0.124 0.495 0.519 ...
##  $ feat08       : num  0.68859 0.00761 0.17641 0.45371 0.55569 ...
##  $ feat09       : num  0.2762 0.8218 0.0896 0.4757 0.5348 ...
##  $ feat10       : num  0.538 0.854 0.46 0.692 0.134 ...
##  $ floors       : num  2 1.5 1 2 2 2 1 1 1 2 ...
##  $ grade        : int  10 8 7 8 7 7 6 6 7 9 ...
##  $ lat          : num  47.6 47.4 47.7 47.7 47.7 ...
##  $ long         : num  -122 -122 -122 -122 -122 ...
##  $ price        : int  1399000 431500 439900 530600 751200 263500 248600 273700 245400 769000 ...
##  $ sqft_above   : int  2140 3440 1220 1110 2040 1380 2290 940 1390 2220 ...
##  $ sqft_basement: int  840 0 790 160 600 0 0 0 0 190 ...
##  $ sqft_living  : int  2980 3440 2010 1270 2640 1380 2290 940 1390 2410 ...
##  $ sqft_living15: int  2200 2160 1820 1290 1330 1260 1240 940 1390 1980 ...
##  $ sqft_lot     : int  7000 10428 7575 1175 5000 8536 7765 5000 8250 6000 ...
##  $ sqft_lot15   : int  4800 8400 7831 1800 5000 8750 8215 5000 7875 6000 ...
##  $ view         : int  3 0 0 0 0 0 0 0 0 4 ...
##  $ waterfront   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ yr_built     : int  1900 1974 1974 2000 1949 1955 1936 1951 1987 1916 ...
##  $ yr_renovated : int  2014 0 0 0 0 0 1953 0 0 1990 ...
##  $ zipcode      : int  98144 98058 98034 98103 98117 98148 98146 98126 98042 98136 ...

The dataset consists of 26.795 observations (rows) and 31 variables (columns). The variables include information such as id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zipcode, lat, long, sqft_living15, sqft_lot15 and 10 artificial variables from feat01 to feat10.

The id variable appears to be a unique identifier for each observation. All variables structured as numeric/integer except the date variable which is currently stored as a character type. It might be useful to convert it to a date type for time-related analyses.However we will not make time-varient analyze for that project for now.

It’s important to note that while all variables may be structured as numeric, certain variables, despite their numeric representation, hold categorical significance. These categorical variables are essentially numerically coded to represent different categories or levels within the dataset. This nuance is crucial to consider when interpreting and analyzing the data.

Summary Statistics

summary(house_data2)
##        id          bathrooms        bedrooms        condition    
##  Min.   :    1   Min.   :0.000   Min.   : 0.000   Min.   :1.000  
##  1st Qu.: 6700   1st Qu.:1.500   1st Qu.: 3.000   1st Qu.:3.000  
##  Median :13398   Median :2.250   Median : 3.000   Median :3.000  
##  Mean   :13398   Mean   :2.107   Mean   : 3.372   Mean   :3.419  
##  3rd Qu.:20097   3rd Qu.:2.500   3rd Qu.: 4.000   3rd Qu.:4.000  
##  Max.   :26795   Max.   :8.000   Max.   :33.000   Max.   :5.000  
##       date               feat01              feat02           feat03         
##  Min.   :1.399e+09   Min.   :0.0000095   Min.   :0.0000   Min.   :0.0000309  
##  1st Qu.:1.406e+09   1st Qu.:0.2500121   1st Qu.:0.3866   1st Qu.:0.2487362  
##  Median :1.414e+09   Median :0.4994889   Median :0.4641   Median :0.5011274  
##  Mean   :1.415e+09   Mean   :0.5009999   Mean   :0.4648   Mean   :0.4997116  
##  3rd Qu.:1.424e+09   3rd Qu.:0.7535160   3rd Qu.:0.5428   3rd Qu.:0.7498798  
##  Max.   :1.433e+09   Max.   :0.9999751   Max.   :1.0000   Max.   :0.9999813  
##      feat04              feat05           feat06              feat07        
##  Min.   :0.0000137   Min.   :0.0000   Min.   :0.0000349   Min.   :0.000044  
##  1st Qu.:0.2460411   1st Qu.:0.3745   1st Qu.:0.2507906   1st Qu.:0.248796  
##  Median :0.4993561   Median :0.4523   Median :0.5014055   Median :0.502817  
##  Mean   :0.4976852   Mean   :0.4543   Mean   :0.5014324   Mean   :0.500995  
##  3rd Qu.:0.7462034   3rd Qu.:0.5316   3rd Qu.:0.7520400   3rd Qu.:0.752308  
##  Max.   :0.9999877   Max.   :1.0000   Max.   :0.9999267   Max.   :0.999858  
##      feat08              feat09             feat10              floors     
##  Min.   :0.0000912   Min.   :0.000023   Min.   :0.0000122   Min.   :1.000  
##  1st Qu.:0.2521830   1st Qu.:0.248318   1st Qu.:0.2459837   1st Qu.:1.000  
##  Median :0.5037222   Median :0.500146   Median :0.5015440   Median :1.000  
##  Mean   :0.5012478   Mean   :0.499521   Mean   :0.4993034   Mean   :1.484  
##  3rd Qu.:0.7519803   3rd Qu.:0.750082   3rd Qu.:0.7510612   3rd Qu.:2.000  
##  Max.   :0.9999338   Max.   :0.999920   Max.   :0.9999926   Max.   :3.500  
##      grade            lat             long            price        
##  Min.   : 1.00   Min.   :47.16   Min.   :-122.5   Min.   :  76100  
##  1st Qu.: 7.00   1st Qu.:47.47   1st Qu.:-122.3   1st Qu.: 322800  
##  Median : 7.00   Median :47.57   Median :-122.2   Median : 451246  
##  Mean   : 7.65   Mean   :47.56   Mean   :-122.2   Mean   : 540920  
##  3rd Qu.: 8.00   3rd Qu.:47.68   3rd Qu.:-122.1   3rd Qu.: 647000  
##  Max.   :13.00   Max.   :47.78   Max.   :-121.3   Max.   :7701000  
##    sqft_above   sqft_basement     sqft_living    sqft_living15 
##  Min.   : 290   Min.   :   0.0   Min.   :  290   Min.   : 399  
##  1st Qu.:1200   1st Qu.:   0.0   1st Qu.: 1430   1st Qu.:1490  
##  Median :1560   Median :   0.0   Median : 1910   Median :1840  
##  Mean   :1785   Mean   : 294.6   Mean   : 2079   Mean   :1987  
##  3rd Qu.:2200   3rd Qu.: 570.0   3rd Qu.: 2540   3rd Qu.:2360  
##  Max.   :9410   Max.   :4820.0   Max.   :13540   Max.   :6210  
##     sqft_lot         sqft_lot15          view          waterfront      
##  Min.   :    520   Min.   :   651   Min.   :0.0000   Min.   :0.000000  
##  1st Qu.:   5100   1st Qu.:  5120   1st Qu.:0.0000   1st Qu.:0.000000  
##  Median :   7667   Median :  7653   Median :0.0000   Median :0.000000  
##  Mean   :  15342   Mean   : 12914   Mean   :0.2358   Mean   :0.007837  
##  3rd Qu.:  10800   3rd Qu.: 10139   3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :1651359   Max.   :871200   Max.   :4.0000   Max.   :1.000000  
##     yr_built     yr_renovated        zipcode     
##  Min.   :1900   Min.   :   0.00   Min.   :98001  
##  1st Qu.:1951   1st Qu.:   0.00   1st Qu.:98033  
##  Median :1974   Median :   0.00   Median :98065  
##  Mean   :1970   Mean   :  86.56   Mean   :98078  
##  3rd Qu.:1995   3rd Qu.:   0.00   3rd Qu.:98117  
##  Max.   :2015   Max.   :2015.00   Max.   :98199

id - The id variable represents a unique identifier for each home sold.

date - The date variable, contains information about the date of the house sale and spans from 2014, to 2015.

price: - The price variable is the dependent variable and shows a wide range, with the minimum house price at $76,100 and the maximum at $7,701,000. - The median house price is $451,246, and the mean is $540,920.

bedrooms and bathrooms: - The variables related to the number of bedrooms and bathrooms (0.5 accounts for a room with a toilet but no shower) exhibit varying ranges and distributions. - The number of bedrooms ranges from 0 to 33, with a mean of approximately 3.37. - The number of bathrooms ranges from 0 to 8, with a mean of approximately 2.11.

sqft_living and sqft_lot: - These variables represent the size of houses. - sqft_living reflects the Square footage of the apartments interior living area, ranging from 290 to 13,540 square feet, with a mean of 2080. - sqft_lot represents the lot size, ranging from 520 to 1,651,359 square feet, with a mean of 15,107.

floors -The floors variable is represents the levels of the houses. The majority of houses have 1 or 1.48 floors. - Notably, there seems to be a common occurrence of houses with 1.5 floors, while the mean is approximately 1.494. - This suggests that many houses have a split-level design or additional space on an upper level, contributing to the fractional floor values.

waterfront:
- The waterfront variable is a dummy variable mostly 0 , represents the property has no waterfront view and 1 for with waterfront. Mean is equal to 0.008 which shows that most of the houses not waterfront.

view and condition:
- view represents the overall view rating (0 to 4) with a mean of 0.23 -condition represents the overall condition rating (1 to 5) with a mean of 3.42 for condition.

grade: - grade represents the overall grade given to the housing unit and ranges from 1 to 13 where 1-3 falls short of building construction and design, 7 has an average level of construction and design, and 11-13 have a high quality level of construction and design. and mean is equal to 7.65 and median is 7.

sqft_above, and sqft_basement: - sqft_above and sqft_basement show the square footage above ground and is below ground level¶ (in the basement), respectively.

yr_built, yr_renovated: - Houses were built between 1900 and 2015 (yr_built), with the majority built in the mid to late 20th century. - yr_renovated indicates the last renovation year, with a mean of 86.56 and many zero values, suggesting no renovations.

Geographical Information (lat, long, zipcode): - lat and long provide latitude and longitude information of house locations, respectively. - zipcode represents the zip code of the house location.

sqft_living15 and sqft_lot15: - sqft_living15 and sqft_lot15 indicate the living room and lot size in 2015, reflecting potential renovations or changes. (? some sources mention it differently and main source couldnt find !!!)

These summary statistics provide an overview of the distribution and characteristics of each numeric variable in the dataset, with a specific focus on understanding the relationships with the dependent variable, ‘price.’

Missing Values

Summary statistics does not show any missing values but we approved with is.na() function.

# Check for missing values in the entire dataset
missing_values <- any(is.na(house_data2))
missing_values
## [1] FALSE

Data Exploration

In organizing our variables by type, we enhance the precision of our analysis and visualization methods. This thoughtful categorization enables us to apply tailored techniques to each variable type, ensuring more insightful and nuanced exploration of the dataset.

# All variables
all_vars <- c( "id", "bathrooms", "bedrooms", "condition", "date", "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10", "floors", "grade", "lat", "long", "price", "sqft_above", "sqft_basement", "sqft_living", "sqft_living15", "sqft_lot", "sqft_lot15", "view", "waterfront", "yr_built", "yr_renovated", "zipcode" )

# Continuous Numeric Variables
cont_vars <- c("price", "sqft_living","sqft_living15","sqft_lot",
               "sqft_lot15","sqft_above", "sqft_basement")

#Artificial Continuous Numeric Variables
art_vars <- c( "feat01", "feat02", "feat03", "feat04", "feat05", "feat06", "feat07", "feat08", "feat09", "feat10" )

# Discrete Numeric Variables
disc_vars <- c("bedrooms", "floors", "bathrooms")


# Categorical Variables
cat_vars <- c("waterfront", "view", "condition", "grade")

# Date Variables
date_vars <- c("date", "yr_built", "yr_renovated")

# Geographical Variables
geo_vars <- c("lat", "long", "zipcode")

Exploratory Data Analysis (EDA)

Utilize various R packages (e.g., ggplot2, plotly) for data exploration. Conduct correlation analysis, distribution analysis.

#  to display numeric values without scientific notation and with more digits
options(scipen = 999, digits = 9)

Distribution of Contunious Numeric Variables

# Set up a layout grid
par(mfrow = c(4, 2), mar = c(4, 4, 2, 1))  # Adjust margins for better appearance

# Create histograms for  numeric variables
for (cont in cont_vars) {
  # Determine appropriate bin width based on the range and number of observations
  bin_width <- (max(house_data2[[cont]]) - min(house_data2[[cont]])) /
    sqrt(length(house_data2[[cont]]))
  
  # Create histogram with scaled x-axis
  hist(house_data2[[cont]], main = paste("Distribution of", cont), xlab = cont, 
       col = "skyblue", breaks = seq(min(house_data2[[cont]]),
                                     max(house_data2[[cont]]) + bin_width, bin_width))

  # Add smoother distribution line
    density_curve <- density(house_data2[[cont]], bw = "nrd0")
    lines(density_curve$x, density_curve$y * bin_width * length(house_data2[[cont]]),
          col = "red", lwd = 2)
    
  # Add normal distribution line
  mu <- mean(house_data2[[cont]])
  sigma <- sd(house_data2[[cont]])
  x <- seq(min(house_data2[[cont]]), max(house_data2[[cont]]), length = 100)
  y <- dnorm(x, mean = mu, sd = sigma) * bin_width * length(house_data2[[cont]])
  lines(x, y, col = "blue", lwd = 2)  
  
  # Identify potential outliers using a boxplot
  boxplot(house_data2[[cont]], main = paste("Boxplot of", cont), col = "lightblue",
          border = "black", horizontal = TRUE)
  }

# Reset the plotting layout
par(mfrow = c(1, 1))

The visual inspection of the plots suggests that distribution of the variables are skewed right/ non-normal distributions with a considerable number of outliers. Given the context of the dataset, where very luxurious or unique properties may contribute to these extreme values, it is justifiable to observe such outliers.

Instead of removing or transforming these outliers, a more suitable strategy might involve employing robust statistical methods to handle outliers problem for further analysis. Robust methods are designed to be less sensitive to extreme values, allowing for a more reliable analysis that acknowledges the presence of these high-end properties without disproportionately impacting the results.

Distribution of Artificial Contunious Numeric Variables

# Set up a layout grid
par(mfrow = c(3, 2), mar = c(4, 4, 2, 1))  # Adjust margins for better appearance

# Create histograms for  numeric variables
for (art in art_vars) {
  # Determine appropriate bin width based on the range and number of observations
  bin_width <- (max(house_data2[[art]]) - min(house_data2[[art]])) /
    sqrt(length(house_data2[[art]]))
  
  # Create histogram with scaled x-axis
  hist(house_data2[[art]], main = paste("Distribution of", art), xlab = art, 
       col = "skyblue", breaks = seq(min(house_data2[[art]]),
                                     max(house_data2[[art]]) + bin_width, bin_width))

  # Add smoother distribution line
    density_curve <- density(house_data2[[art]], bw = "nrd0")
    lines(density_curve$x, density_curve$y * bin_width * length(house_data2[[art]]),
          col = "red", lwd = 2)
    
  # Add normal distribution line
  mu <- mean(house_data2[[art]])
  sigma <- sd(house_data2[[art]])
  x <- seq(min(house_data2[[art]]), max(house_data2[[art]]), length = 100)
  y <- dnorm(x, mean = mu, sd = sigma) * bin_width * length(house_data2[[art]])
  lines(x, y, col = "blue", lwd = 2)  
  
  # Identify potential outliers using a boxplot
  boxplot(house_data2[[art]], main = paste("Boxplot of", art), col = "lightblue",
          border = "black", horizontal = TRUE)
  }

# Reset the plotting layout
par(mfrow = c(1, 1))

We found that “feat02” and “feat05” follow a typical bell-shaped curve, suggesting a normal distribution and making them promising for our model. On the contrary, other artificially created variables have a uniform distribution, lacking distinct patterns. Uniformly distributed variables are less informative for predictions due to their even distribution, so we opted not to include them in our model.

Distribution of Discrete Numeric Variables

# Visualize the Distribution 
for (disc in disc_vars) {
  # Convert discrete numeric variables to factors
  house_data2[[disc]] <- as.factor(house_data2[[disc]])
  
  # Create bar plots for discrete numeric variables
  bar_plot <- ggplot(house_data2, aes(x = !!sym(disc), fill = !!sym(disc))) +
    geom_bar(position = "dodge", fill = "skyblue", alpha = 0.7, width = 0.7) +
    labs(title = paste("Distribution of", disc), x = disc, y = "Count") +
    theme_minimal() +
    geom_text(stat = "count", 
              aes(label = scales::percent(round(after_stat(count)/sum(after_stat(count)), 5))),
              position = position_dodge(0.7), vjust = -0.3, size= 3) +  # Add percentage
    theme(axis.text.x = element_text(angle = 45, hjust = 1))  # Rotate x-axis labels
  
  # Display the plot
  print(bar_plot)
}

Bedrooms :

The distribution of bedrooms in the dataset reveals a clear preference for houses with 3 bedrooms, constituting nearly half of the entries (45.5%). 4-bedroom homes follow closely with 31.8%, and 2 and 5-bedroom configurations are also prevalent, making up 12.8% and 7.4%. However, 0-bedroom and 1-bedroom houses have notably lower percentages with approximately 0.1% and 0.9%, respectively. The distribution is positively skewed, with a peak around 3 bedrooms.

Bathrooms Configuration:

The dataset showcases a diverse distribution of bathrooms. Houses with 2.5 bathrooms are most common, representing 24.9%. Additionally, 1 bathroom and 1.75 bathrooms are prevalent at 17.8% and 14.1%, respectively. The distribution exhibits multiple peaks, suggesting a variety of bathroom count configurations in the dataset.

In the United States, bathrooms are generally categorized as master bathroom, containing a varied shower and a tub that is adjoining to a master bedroom, a “full bathroom” (or “full bath”), containing four plumbing fixtures: bathtub/shower, or (separate shower), toilet, and sink; “half (1/2) bath” (or “powder room”) containing just a toilet and sink; and “3/4 bath” containing toilet, sink, and shower, although the terms vary from market to market. In some U.S. markets, a toilet, sink, and shower are considered a “full bath”. (wikipedia)

Floor Counts:

When considering the number of floors, Houses with 1 floor are predominant, making up 49.4% of the dataset. 2 floors houses follow closely at 38.1%, with 1.5 floors representing 8.8%. The distribution is skewed towards fewer floors, with a sharp decline for houses with more than 2 floors.

Note on 0 Values:

In the context of houses requiring bedrooms and bathrooms, the presence of 0 values in these categories may indicate missing or incomplete data. It’s uncommon for a house to have zero bedrooms or bathrooms. Investigating and addressing the reasons behind these zero values is crucial for ensuring the quality and accuracy of the dataset, as well as the reliability of any analyses conducted.

Distribution of Categorical Variables

# Set up a layout grid
grid_layout <- matrix(c(1, 2, 3, 4), nrow = 2, byrow = TRUE)

# Create more advanced bar plots for categorical variables
plots <- list()
for (cat in cat_vars) {
  # Convert categorical variables to factors
  house_data2[[cat]] <- as.factor(house_data2[[cat]])

  # Create bar plots for categorical variables
  bar_plot <- ggplot(house_data2, aes(x = factor(!!sym(cat)), fill =
          factor(!!sym(cat)))) + geom_bar() +geom_text(stat = "count", 
    aes(label = scales::percent(round(after_stat(count)/sum(after_stat(count)), 3))),
    vjust = 0.2, size= 2.5) +  # Add percentage labels
    labs(title = paste("Distribution of", cat), x = cat, y = "Count") +
    theme_minimal() +
    theme(legend.position = "none")

  # Add the plot to the list
  plots[[cat]] <- bar_plot
}

# Arrange the plots in a 2x2 grid
grid.arrange(grobs = plots, layout_matrix = grid_layout)

# Reset the plotting layout
par(mfrow = c(1, 1))

Out of all observations, only 1 percent of houses are located on the waterfront.

Additionally, the majority of houses (90.2%) have a view score of 0. Among the remaining view scores, 4.5% have a score of 2, while scores of 1 and 4 each account for 1.5%. The remaining 2.4% of houses have a view score of 3. we observed that the ‘view’ variable predominantly contained 0 values, suggesting that many houses had not been viewed. In response, we decided to engineer a new feature named ‘viewed’ to capture this information more explicitly. The ‘viewed’ variable takes on a value of 1 if the house has been viewed and 0 otherwise.

# Create a new variable 'viewed' with value 1 if 'view' is not 0, and 0 otherwise
house_data2$viewed <- ifelse(house_data2$view != 0, 1, 0)

# Drop the original 'view' variable
house_data2 <- house_data2[, !names(house_data2) %in% c("view")]

The majority of houses in the dataset are in good to average condition. Approximately 91.7% of houses fall within Condition 3, indicating that a significant portion of the properties is well-maintained.Condition 4 homes represent 26.3%, suggesting a sizable proportion of houses are in better-than-average condition. Meanwhile, Condition 5 homes, which likely denote excellent condition, constitute 7.9% of the dataset.

The distribution of grades reflects a diverse range of housing quality. A significant portion of houses falls within Grade 7 (41.6%) and Grade 8 (28.1%), indicating properties with a higher level of construction and design. Grades 9 and 10 together contribute 17.3%, highlighting a considerable proportion of houses with superior construction and design quality.The dataset includes a limited number of houses with lower grades (1-6), with most grades in this range having negligible representation (close to 0%).The distribution is skewed towards higher grades, emphasizing the prevalence of houses with above-average construction and design quality in the dataset.

Handling Zero Values in Bedroom & Bathroom

library(ggplot2)

# Count of observations where bedrooms is 0
count_bedrooms_0 <- nrow(house_data2[house_data2$bedrooms == 0, ])

# Count of observations where bathrooms is 0
count_bathrooms_0 <- nrow(house_data2[house_data2$bathrooms == 0, ])

# Create a data frame for plotting
count_data <- data.frame(Category = c("Bedrooms=0", "Bathrooms=0"),
                         Count = c(count_bedrooms_0, count_bathrooms_0))

# Create a bar plot
ggplot(count_data, aes(x = Category, y = Count, fill = Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Count of Observations with Bedrooms=0 and Bathrooms=0",
       x = "Category",
       y = "Count") +
  theme_minimal()

# Function to impute missing values using the median based on non-zero values
impute_nonzero <- function(var) {
  non_zero_values <- as.numeric(var[var != 0])
  if (length(non_zero_values) > 0) {
    imputed_value <- median(non_zero_values)
    var[var == 0] <- imputed_value
  }
  return(var)
}

# Convert the variables to numeric again
house_data2$bedrooms <- as.numeric(as.character(house_data2$bedrooms))
house_data2$bathrooms <- as.numeric(as.character(house_data2$bathrooms))
house_data2$floors <- as.numeric(as.character(house_data2$floors))
house_data2$waterfront <- as.numeric(as.character(house_data2$waterfront))
house_data2$condition <- as.numeric(as.character(house_data2$condition))
house_data2$grade <- as.numeric(as.character(house_data2$grade))

# Apply the imputation function to bedrooms and bathrooms
house_data2$bedrooms <- impute_nonzero(house_data2$bedrooms)
house_data2$bathrooms <- impute_nonzero(house_data2$bathrooms)
##redefine the variables for further stages
# Categorical Variables
cat_vars <- c("waterfront", "viewed", "condition", "grade")
 # All variables
all_vars <- house_data2[, c("price", "bedrooms", "bathrooms", "sqft_living",
                            "sqft_lot","floors", "waterfront", "viewed", "condition",
                            "grade", "sqft_above", "sqft_basement", "yr_built",
                            "yr_renovated", "zipcode", "lat", "long", "sqft_living15",
                            "sqft_lot15")]

Scatter plots

Continuous variables vs. “Price”

attach(house_data2)

# Create an empty list to store plots
plots_list <- list()

# Iterate through variables and create scatter plots
for (variable in cont_vars[-1]) {
  # Create scatter plot with regression line
  scatter_plot <- ggplot(house_data2, aes_string(x = variable, y = "price")) +
    geom_point(color = "orange") +
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    geom_encircle(data = house_data2 %>% filter(price > 6000000),
                  color = "red", size = 2, expand = 0.05) +
     geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
                  color = "green", size = 2, expand = 0.05) +
    labs(title = paste(variable, "vs. Price"), x = variable, y = "price") +
    theme_minimal()

  # Add the plot to the list
  plots_list[[variable]] <- scatter_plot
}
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
# Arrange the plots in a grid
advanced_plots <- ggarrange(plotlist = plots_list, ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Display the arranged plots
print(advanced_plots)
## $`1`

## 
## $`2`

## 
## attr(,"class")
## [1] "list"      "ggarrange"

We generate scatter plots for various continuous variables against housing prices, utilizing light green points and red regression lines for visualization. A red circle is incorporated to emphasize observations where housing prices exceed $6,000,000, indicating potential outliers. The scatter plots collectively underscore similarities in the distribution patterns of all continuous variables concerning price. This graphical exploration enhances the understanding of the correlation between each continuous variable and housing prices, with the filter for prices drawing attention to three potential outliers. Identifying and comprehending such outliers is crucial for robust data analysis, aiding in informed decisions regarding their impact on statistical models and subsequent analyses. Further investigation and domain knowledge are typically required to interpret these outliers within the dataset’s context.

Discreate variables vs “Price”

# Create an empty list to store plots
plots_list <- list()

# Iterate through variables and create scatter plots
for (variable in disc_vars) {
  # Create scatter plot with regression line
  scatter_plot <- ggplot(house_data2, aes_string(x = variable, y = "price")) +
    geom_jitter(width = .3, alpha = .3, color = "blue") + # Introduce a noise
    geom_smooth(method = "lm", se = FALSE, color = "red") +
    geom_encircle(data = house_data2 %>% filter(price > 6000000),
     color = "red", size = 2, expand = 0.05) + # Add an encircling  for high prices
    geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
                  color = "green", size = 2, expand = 0.05) + # Add an  encirling for
    #bedrooms == 33
    labs(title = paste(variable, "vs. Price"), x = variable, y = "Price")

  # Add the plot to the list
  plots_list[[variable]] <- scatter_plot
}

# Arrange the plots in a grid
advanced_plots <- ggarrange(plotlist = plots_list, ncol = 2, nrow = 2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
# Display the arranged plots
print(advanced_plots)

In this code update, we continued our exploration of the dataset by examining the relationship between housing prices and discrete variables. We introduced noise for a more nuanced view and identified high-priced outliers, visualizing them with encircling shapes.

As an additional step, we focused on the unusual case where the number of bedrooms equals 33. We specifically circled these observations using a distinctive green color. This targeted analysis aims to spotlight and investigate unique patterns and outliers within the data, enhancing our understanding of their impact on housing prices. Our approach reflects an iterative process, adapting visualizations to reveal hidden insights in the dataset.

# Find indices where 'bedrooms' is 33
index_bedrooms_33 <- which(house_data2$bedrooms == 33)


# Replace the 'bedrooms' value of 33 with the median value of bedrooms
house_data2$bedrooms[index_bedrooms_33] <- median(house_data2$bedrooms, na.rm = TRUE)

Hexbin Visualization: Housing Prices and Subset Encircling

# Other advanced visualizations for analysis ( hexbin)
p <- ggplot(house_data2, aes(x = sqft_living, y = price)) +
  geom_hex(bins = 50, aes(fill = ..count..), color = "darkblue") +
      geom_encircle(data = house_data2 %>% filter(bedrooms == 33),
                  color = "green", size = 2, expand = 0.05) + 
    geom_encircle(data = house_data2 %>% filter(price > 6000000), color = "red", size =  1, expand = 0.05,
                linetype ="dashed") +
  labs(title = 'Hexbin Plot: Housing Price vs. Square Footage of Living Space',
       x = 'Square Footage of Living Space', y = 'Price')  + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Reverse the color scale
p + 
  scale_fill_distiller(palette = "Blues", direction = 1) +
  theme_bw() +
  scale_x_continuous(labels = scales::comma) +
  scale_y_continuous(labels = scales::dollar_format(scale = 0.001)) 
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

In this advanced visualization, we employed a hexbin plot to explore the relationship between housing price and square footage of living space. The hexbin plot effectively communicates the data density, revealing insights into the distribution of housing prices concerning living space. Notably, we observed a high density in areas where prices are lower than $2 million and square footage of living space is below 2,000 square feet. We continued the analysis by encircling specific data points of interest. The red dashed circle encompasses houses with prices exceeding $6,000,000. Additionally, we maintained the encircling of the green circle around properties with 33 bedrooms and the blue circle around those with a grade less than 3. These encirclings help to highlight and differentiate distinct subsets within the dataset, providing a nuanced understanding of the data distribution. The refined aesthetics, including color-coded circles and improved line types, enhance the visual appeal and interpretability of the plot. This visualization strategy builds upon the previous stages, offering a comprehensive exploration of the dataset’s intricate patterns and outliers.

Checking Dublicates

house_data2[duplicated(house_data2), ]
##  [1] id            bathrooms     bedrooms      condition     date         
##  [6] feat01        feat02        feat03        feat04        feat05       
## [11] feat06        feat07        feat08        feat09        feat10       
## [16] floors        grade         lat           long          price        
## [21] sqft_above    sqft_basement sqft_living   sqft_living15 sqft_lot     
## [26] sqft_lot15    waterfront    yr_built      yr_renovated  zipcode      
## [31] viewed       
## <0 rows> (or 0-length row.names)
duplicates <- house_data2[duplicated(house_data2$id), ]
dim(duplicates)
## [1]  0 31

The dataset analysis provided two distinct findings regarding duplicates. First, a comprehensive scan across all columns of the dataset did not reveal any duplicate entries, suggesting the entire dataset is unique in its entirety. Secondly, focusing specifically on the ‘id’ column—a unique identifier for each home sold— and couldn’t find any duplication. However we recognized that id column is renumarized and the original dataset had the duplication on id which can cause also some issue on model and needs to underline it.

Scaling the dataset

Considering the initial exploratory data analysis (EDA), it is evident that the dataset exhibits varying ranges among its features, necessitating the application of scaling techniques. Additionally, the presence of numerous positive outliers suggests the adoption of robust scaling methods and the utilization of robust models for subsequent production stages.Unlike standard methods, a robust scaler is less influenced by extreme values. In R, we harnessed the power of the robustbase package, specifically using the scale function. This function automatically applies robust scaling by centering and scaling the data. Setting the center and scale arguments to TRUE ensures the robust scaling methodology, making our normalization process resilient to outliers.

# Extract numeric columns
numeric_cols <- sapply(house_data2, is.numeric)
df_numeric <- house_data2[, numeric_cols, drop = FALSE]

# Apply robust scaling using scale
df_scaled <- as.data.frame(scale(df_numeric, center = TRUE, scale = TRUE))

Correlation Analysis

# Conduct correlation analysis without scaling
cor_matrix <- cor(df_scaled[, numeric_cols])


# Create a heatmap for correlation values
melted_correlation <- melt(cor_matrix, na.rm = TRUE)

p1 <- ggplot(melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
                       midpoint = 0, limit = c(-1, 1), space = "Lab", 
                       name = "Correlation") +
  geom_text(aes(label = ifelse(abs(value) > 0.3, round(value, 2), "")), vjust = 1,
            size = 2) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_fixed()

p1

Initially, we constructed a correlation matrix to discern relationships among variables in the dataset. To enhance interpretability, we applied a filter, selecting correlations with an absolute value greater than 0.5. This focused approach facilitates easier interpretation by highlighting strong correlated variables. The choice of the cutoff level for correlation analysis depends on the specific goals of the analysis and the nature of the data. Commonly used cutoff values for correlation coefficients between 0.5-0.7 for moderate correlation and above 0.7 for strong correlation. Eventhough we in the begging chooses 0.7 and then 0.5 cutoff , the results showed that The price variable stands out with a strong positive correlation of 0.70 with only the square footage of living space (sqft_living). While this suggests a notable linear relationship between these two factors,we recognized it may be beneficial to also consider variables with moderate correlations to price, as they could provide additional insights into determinants of house prices beyond just living space. So we decreased the cutoff level as 0.30 according that.

# Select correlations with "price" where abs(correlation) > 0.3
price_correlations <- round((cor_matrix[abs(cor_matrix[,"price"]) > 0.3, "price"]),2)

# Print the selected correlations
print(price_correlations)
##     bathrooms      bedrooms        feat02        feat05         grade 
##          0.52          0.31          0.41          0.37          0.66 
##           lat         price    sqft_above sqft_basement   sqft_living 
##          0.31          1.00          0.60          0.32          0.70 
## sqft_living15        viewed 
##          0.58          0.36
# Create a data frame for the selected correlations
correlation_data <- data.frame(Feature = names(price_correlations), Correlation = price_correlations)

# Create a bar plot
ggplot(correlation_data, aes(x = Feature, y = Correlation, fill = Feature)) +
  geom_bar(stat = "identity", width = 0.7) +
  labs(title = "Correlation with Price (abs > 0.3)",
       y = "Correlation",
       x = "Feature") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  geom_text(aes(label = Correlation), vjust = -0.5, size = 3)

According to final results (with 0.3 cutoff):

The correlation analysis reveals significant relationships between the “price” variable and various features in the dataset. The strongest positive correlation is observed with the “grade” of the house (0.66), indicating that higher-grade houses tend to have higher prices. The size-related features, such as “sqft_living” (0.70) and “sqft_living15” (0.58), also exhibit strong positive correlations, suggesting that larger living spaces, both for the specific property and the surrounding area, contribute positively to house prices. Additionally, the number of “bathrooms” shows a moderate positive correlation (0.52), emphasizing their influence on property values. The “bedrooms” variable demonstrates a weaker positive correlation (0.31). Features like “feat02” (0.41) and “feat05” (0.37) also display positive correlations, though not as strong as others. Moreover, the “viewed” status has a moderate positive correlation (0.36) with house prices. These findings provide valuable insights into the factors driving housing prices, guiding further analysis and modeling decisions.

Machine Learning

# Load necessary libraries
library(tidyverse)
library(caret)
library(tree)
library(randomForest)
library(rpart)
library(here)
library(gbm)
library(xgboost)
library(robustbase)

Preprocessing for ML models

In the preprocessing step for ML prediction, we begin by removing unnecessary features from our dataset. We first drop the features that are not relevant or useful for our modeling purposes. In this case, we’re dropping features like “id”, “date”, “lat”, “long”, and “zipcode” from our dataset. While acknowledging the potential significance of spatial correlation and temporal analysis, we consciously choose to set aside these considerations for the current project. However, we note that exploring these aspects could be advantageous for future projects, providing avenues for deeper investigation and potentially enriching our predictive models.

# Drop specified features
df <- df_scaled[, !(names(df_scaled) %in% c("id","date", "lat", "long", "zipcode"))]

Next, we select only the relevant columns from the original dataset based on variables that are highly correlated with prices. These variables are determined through a correlation analysis.

# Select only the relevant columns from the original dataset
selected_vars <- c("price", "bathrooms", "bedrooms", "feat02", "feat05", "grade", "sqft_above",
                   "sqft_basement", "sqft_living", "sqft_living15", "viewed")

df2 <- df_scaled[, selected_vars, drop = FALSE]

Split the data into training and testing sets

After pre-processing the data, we split it into training and testing sets. This step is crucial for evaluating the performance of our machine learning models.

set.seed(123456789)
training_obs <- createDataPartition(df$price, 
                                    p = 0.7, 
                                    list = FALSE)
train.scaled <- df[training_obs,]
test.scaled  <- df[-training_obs,]
set.seed(123456789)
training_obs2 <- createDataPartition(df2$price, 
                                    p = 0.7, 
                                    list = FALSE)
train.scaled2 <- df[training_obs2,]
test.scaled2  <- df[-training_obs2,]

These codes use the createDataPartition function from the caret package to split the dataset into training and testing sets. The training set contains 70% of the data, while the testing set contains the remaining 30%.

Building the Models

Finally, we define the model formulas for our machine learning models. These formulas specify the relationship between the predictors (features) and the target variable (price).

We decided to utilize tree-based models for our analysis. While we aimed to incorporate all variables in our first model (model1), we also wanted to explore the insights gleaned from correlation analysis. Hence, we opted to create two distinct models for comparison.

model1: Includes all predictors except for the variables that were dropped during pre-processing. model2: Includes only the variables that were selected based on correlation analysis.

These model formulas will be used in subsequent steps for model training and evaluation.

# Define the model formula
model1 <- price ~ .
model2 <- price ~ bathrooms + bedrooms + feat02 + feat05 + grade +
                  sqft_above + sqft_basement + sqft_living + sqft_living15 + viewed

Regression Trees

Regression trees offer a powerful framework for predicting house prices, allowing us to uncover intricate patterns in the data. By following a systematic approach and leveraging the capabilities of regression trees, we can build robust predictive models that drive informed decision-making in the real estate industry.

Model1

set.seed(123)
reg.tree <- tree(model1, train.scaled)
reg.tree
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 18758 18810.6000  0.00095086  
##    2) grade < 0.725091 15109  4260.0800 -0.27780100  
##      4) sqft_living < -0.0437263 10263  1747.0600 -0.42426700 *
##      5) sqft_living > -0.0437263 4846  1826.5800  0.03238950  
##       10) yr_built < -0.53984 1087   587.4880  0.40996100 *
##       11) yr_built > -0.53984 3759  1039.3200 -0.07679390 *
##    3) grade > 0.725091 3649  8515.4200  1.15515000  
##      6) sqft_living < 2.33753 3169  3199.3200  0.86468400  
##       12) yr_built < -0.0262972 501   732.7710  1.81075000 *
##       13) yr_built > -0.0262972 2668  1933.9300  0.68703200  
##         26) grade < 1.57815 1770   669.9660  0.42354100 *
##         27) grade > 1.57815 898   898.8670  1.20638000 *
##      7) sqft_living > 2.33753 480  3283.5900  3.07280000  
##       14) sqft_living < 6.37701 469  2051.8300  2.84581000  
##         28) feat05 < 1.126 274   597.0880  2.07160000 *
##         29) feat05 > 1.126 195  1059.7300  3.93368000  
##           58) waterfront < 5.58118 177   762.9930  3.59607000 *
##           59) waterfront > 5.58118 18    78.1866  7.25347000 *
##       15) sqft_living > 6.37701 11   177.2690 12.75090000 *
summary(reg.tree)
## 
## Regression tree:
## tree(formula = model1, data = train.scaled)
## Variables actually used in tree construction:
## [1] "grade"       "sqft_living" "yr_built"    "feat05"      "waterfront" 
## Number of terminal nodes:  10 
## Residual mean deviance:  0.388895 = 7291.01 / 18748 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -8.0488800 -0.3624880 -0.0762008  0.0000000  0.2590970  7.0814200
sd(train.scaled$price) ^ 2 * (length(train.scaled$price) - 1)
## [1] 18810.5885
(train.scaled$price - mean(train.scaled$price))  ^ 2 %>% sum
## [1] 18810.5885

In constructing the decision tree, we utilized the tree() function which implements the binary recursive partitioning algorithm, considering all predictors specified in the model formula. The algorithm aims for maximum reduction in node impurity to generate splits and continues this process until nodes have a sufficient number of observations. Notably, the tree’s depth is constrained to 31 nodes. The summary of the tree estimation provides insights into the number of predictors eventually utilized in the model.

The regression tree analysis reveals that the primary predictors influencing house prices are “sqft_living,” “yr_built,” “grade,” “feat05,” and “waterfront.” The tree, consisting of 10 terminal nodes, delineates different conditions based on these factors. For instance, when the grade is below a certain threshold and sqft_living is also within a specific range, the predicted price tends to be lower. On the other hand, higher grades, larger sqft_living areas, and certain waterfront conditions contribute to higher predicted prices. The residual mean deviance, a measure of model fit, is 0.39, indicating the average difference between predicted and actual prices. Overall, the tree provides a simplified yet informative representation of the relationships between key features and house prices.

This indicates that the deviance at the root of the tree is equivalent to the corrected sum of squares of the target variable. The tree is providing a model that minimizes the sum of squared differences compared to the simple mean-based model.

In the next step, we can visualize the tree on the plot.

plot(reg.tree)
text(reg.tree, pretty = 0)

Cross-validation
# Cross-validation to find optimal tree size
cv_result <- cv.tree(reg.tree)
best_tree_size <- cv_result$size[which.min(cv_result$dev)]
best_tree_size
## [1] 10

The optimal tree size determined by cross-validation is 10. thus the cross-validated optimal tree size is already equal to the number of terminal nodes, it suggests that the tree is at an appropriate level of complexity, and additional pruning may not be necessary.

However we still wanted to try the limit the size of the tree by using lecture notes and compare the results.

Model1.2 : Controlling Tree size

To choose appropriate values for mincut, minsize, and mindev parameters, we’ll consider the characteristics of the first model and aim to balance model complexity with predictive performance.

mincut: We’ll start with a value that allows for a reasonable number of splits in the tree without excessively increasing complexity. Let’s choose a moderate value of 20 for mincut.

minsize: Considering that the first model resulted in 10 terminal nodes, we’ll set minsize to ensure each terminal node contains a sufficient number of observations. Given the relatively balanced number of observations in each node, we’ll choose a value of 50 for minsize.

mindev: Since the residual mean deviance of the first model is 0.388895, indicating a good overall model fit, we’ll choose a conservative value for mindev to avoid unnecessary splits. Let’s set mindev to 0.01 to ensure that splits contribute significantly to reducing deviance.

set.seed(123456)

control1 <- tree.control(nobs    = length(train.scaled$price), 
                         mincut  = 20, 
                         minsize = 50, 
                         mindev  = 0.01)
reg.tree1 <- tree(formula = model1, 
                     data    = train.scaled,
                     control = control1)
reg.tree1
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 18758 18810.600  0.00095086  
##    2) grade < 0.725091 15109  4260.080 -0.27780100  
##      4) sqft_living < -0.0437263 10263  1747.060 -0.42426700 *
##      5) sqft_living > -0.0437263 4846  1826.580  0.03238950  
##       10) yr_built < -0.53984 1087   587.488  0.40996100 *
##       11) yr_built > -0.53984 3759  1039.320 -0.07679390 *
##    3) grade > 0.725091 3649  8515.420  1.15515000  
##      6) sqft_living < 2.33753 3169  3199.320  0.86468400  
##       12) yr_built < -0.0262972 501   732.771  1.81075000 *
##       13) yr_built > -0.0262972 2668  1933.930  0.68703200  
##         26) grade < 1.57815 1770   669.966  0.42354100 *
##         27) grade > 1.57815 898   898.867  1.20638000 *
##      7) sqft_living > 2.33753 480  3283.590  3.07280000  
##       14) sqft_living < 5.7568 460  1861.750  2.79583000  
##         28) feat05 < 1.126 271   572.775  2.06499000 *
##         29) feat05 > 1.126 189   936.667  3.84377000 *
##       15) sqft_living > 5.7568 20   574.955  9.44305000 *
summary(reg.tree1)
## 
## Regression tree:
## tree(formula = model1, data = train.scaled, control = control1)
## Variables actually used in tree construction:
## [1] "grade"       "sqft_living" "yr_built"    "feat05"     
## Number of terminal nodes:  9 
## Residual mean deviance:  0.413882 = 7759.87 / 18749 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -8.7459500 -0.3624880 -0.0761098  0.0000000  0.2596630  8.1992300
plot(reg.tree1)
text(reg.tree1, pretty = 0)

The results from the regression tree analysis for reg.tree1 indicate a reasonably balanced tree structure with 9 terminal nodes. The tree splits based on predictor variables such as “grade,” “sqft_living,” “yr_built,” and “feat05,” effectively partitioning the dataset to minimize deviance. The residual mean deviance is low at 0.41, suggesting that the model fits the data well on average. The distribution of residuals appears symmetrical around the mean, indicating unbiased predictions. Overall, the chosen values for mincut, minsize, and mindev seem to have produced a suitable regression tree for the dataset. A hıgher residual mean deviance indicates a worse fit of the model to the data when compare the fırst results. Therefore, based on this comparison, the first model performs slightly better in terms of model fit.

Model2

Here, our focus shifts to constructing and evaluating the next model, referred to as Model2. The objective is to deepen our understanding of how the selected predictors influence house prices. By analyzing Model2, we aim to gain insights into the specific variables that have the most significant impact on predicting house prices.

set.seed(123456)
reg.tree2 <- tree(model2, train.scaled2)
reg.tree2
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 18758 18810.600  0.00095086  
##    2) grade < 0.725091 15109  4260.080 -0.27780100  
##      4) sqft_living < -0.0437263 10263  1747.060 -0.42426700 *
##      5) sqft_living > -0.0437263 4846  1826.580  0.03238950 *
##    3) grade > 0.725091 3649  8515.420  1.15515000  
##      6) sqft_living < 2.33753 3169  3199.320  0.86468400  
##       12) sqft_living < 1.17057 1889  1154.160  0.56565200 *
##       13) sqft_living > 1.17057 1280  1626.960  1.30599000 *
##      7) sqft_living > 2.33753 480  3283.590  3.07280000  
##       14) sqft_living < 6.37701 469  2051.830  2.84581000  
##         28) feat05 < 1.126 274   597.088  2.07160000 *
##         29) feat05 > 1.126 195  1059.730  3.93368000 *
##       15) sqft_living > 6.37701 11   177.269 12.75090000 *
summary(reg.tree2)
## 
## Regression tree:
## tree(formula = model2, data = train.scaled2)
## Variables actually used in tree construction:
## [1] "grade"       "sqft_living" "feat05"     
## Number of terminal nodes:  7 
## Residual mean deviance:  0.436716 = 8188.85 / 18751 
## Distribution of residuals:
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -8.0488800 -0.3732420 -0.0889825  0.0000000  0.2632470  8.9464700
plot(reg.tree2)
text(reg.tree2, pretty = 0)

Cross-validation

# Cross-validation to find optimal tree size
cv_result2 <- cv.tree(reg.tree2)
best_tree_size2 <- cv_result2$size[which.min(cv_result2$dev)]
best_tree_size2
## [1] 7

Comparison

# Predictions on the training set
reg.tree.train.pred <- predict(reg.tree, newdata = train.scaled)
reg.tree1.train.pred <- predict(reg.tree1, newdata = train.scaled)
reg.tree2.train.pred <- predict(reg.tree2, newdata = train.scaled)

# Calculate MSE for training set
mse_reg.tree_train <- mean((reg.tree.train.pred - train.scaled$price) ^ 2)
mse_reg.tree1_train <- mean((reg.tree1.train.pred - train.scaled$price) ^ 2)
mse_reg.tree2_train <- mean((reg.tree2.train.pred - train.scaled2$price) ^ 2)

# Calculate Root Mean Squared Error (RMSE) for training set
rmse_tree_train <- sqrt(mean((reg.tree.train.pred - train.scaled$price) ^ 2))
rmse_tree1_train <- sqrt(mean((reg.tree1.train.pred - train.scaled$price) ^ 2))
rmse_tree2_train <- sqrt(mean((reg.tree2.train.pred - train.scaled2$price) ^ 2))

# Calculate Mean Absolute Error (MAE) for training set
mae_tree_train <- mean(abs(reg.tree.train.pred - train.scaled$price))
mae_tree1_train <- mean(abs(reg.tree1.train.pred - train.scaled$price))
mae_tree2_train <- mean(abs(reg.tree2.train.pred - train.scaled2$price))

# Create a data frame for the training set results
results_train_df <- data.frame(
  Model = c("reg.tree", "reg.tree1", "reg.tree2"),
  MSE = c(mse_reg.tree_train, mse_reg.tree1_train, mse_reg.tree2_train),
  RMSE = c(rmse_tree_train, rmse_tree1_train, rmse_tree2_train),
  MAE = c(mae_tree_train, mae_tree1_train, mae_tree2_train)
)

# Print the training set results as a table
kable(results_train_df, format = "markdown")
Model MSE RMSE MAE
reg.tree 0.388688046 0.623448511 0.424154963
reg.tree1 0.413683225 0.643182109 0.428327183
reg.tree2 0.436552665 0.660721322 0.444317150
set.seed(123456)

reg.tree.pred  <- predict(reg.tree,  newdata = test.scaled)
reg.tree1.pred <- predict(reg.tree1, newdata = test.scaled)
reg.tree2.pred <- predict(reg.tree2, newdata = test.scaled2)

# Calculate MSE 
mse_reg.tree <- mean((reg.tree.pred - test.scaled$price) ^ 2)
mse_reg.tree1 <- mean((reg.tree1.pred - test.scaled$price) ^ 2)
mse_reg.tree2 <- mean((reg.tree2.pred - test.scaled2$price) ^ 2)

# Calculate Root Mean Squared Error (RMSE)
rmse_tree <- sqrt(mean((reg.tree.pred - test.scaled$price) ^ 2))
rmse_tree1 <- sqrt(mean((reg.tree1.pred - test.scaled$price) ^ 2))
rmse_tree2 <- sqrt(mean((reg.tree2.pred - test.scaled2$price) ^ 2))

# Calculate Mean Absolute Error (MAE)
mae_tree <- mean(abs(reg.tree.pred - test.scaled$price))
mae_tree1 <- mean(abs(reg.tree1.pred - test.scaled$price))
mae_tree2 <- mean(abs(reg.tree2.pred - test.scaled2$price))
library(knitr)

# Create a data frame for the results
results_test_df <- data.frame(
  Model = c("reg.tree", "reg.tree1", "reg.tree2"),
  MSE = c(mse_reg.tree, mse_reg.tree1, mse_reg.tree2),
  RMSE = c(rmse_tree, rmse_tree1, rmse_tree2),
  MAE = c(mae_tree, mae_tree1, mae_tree2)
)

# Print the results as a table
kable(results_test_df, format = "markdown")
Model MSE RMSE MAE
reg.tree 0.410356501 0.640590744 0.431282928
reg.tree1 0.423276384 0.650596944 0.432994135
reg.tree2 0.463238434 0.680616216 0.454251037
# Combine the training and test set results into a single data frame
results_combined_df <- merge(results_train_df, results_test_df, by = "Model")

# Melt the data frame for easier plotting
results_melted <- reshape2::melt(results_combined_df, id.vars = "Model")

# Plot MSE, RMSE, and MAE for both training and test sets
ggplot(results_melted, aes(x = Model, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ variable, scales = "free_y") +
  labs(title = "Comparison of Model Performance on Training and Test Sets",
       x = "Model", y = "Metric Value") +
  theme_minimal() +
  theme(legend.position = "top")

# Reshape the table to have columns for train and test results separately
reshaped_table <- results_melted %>%
  tidyr::spread(variable, value)

reshaped_table
##       Model       MSE.x      RMSE.x       MAE.x       MSE.y      RMSE.y
## 1  reg.tree 0.388688046 0.623448511 0.424154963 0.410356501 0.640590744
## 2 reg.tree1 0.413683225 0.643182109 0.428327183 0.423276384 0.650596944
## 3 reg.tree2 0.436552665 0.660721322 0.444317150 0.463238434 0.680616216
##         MAE.y
## 1 0.431282928
## 2 0.432994135
## 3 0.454251037
# Calculate the differences
differences_table <- reshaped_table %>%
  mutate(reg.tree_MSE_differences = MSE.x - MSE.y,
         reg.tree_RMSE_differences = RMSE.x - RMSE.y,
         reg.tree_MAE_differences = MAE.x - MAE.y) %>%
  select(Model, reg.tree_MSE_differences, reg.tree_RMSE_differences, reg.tree_MAE_differences)

# Print the differences table
print(differences_table)
##       Model reg.tree_MSE_differences reg.tree_RMSE_differences
## 1  reg.tree           -0.02166845478            -0.01714223249
## 2 reg.tree1           -0.00959315909            -0.00741483569
## 3 reg.tree2           -0.02668576882            -0.01989489451
##   reg.tree_MAE_differences
## 1           -0.00712796500
## 2           -0.00466695220
## 3           -0.00993388698
# Reshape the table for plotting
reshaped_plot <- differences_table %>%
  pivot_longer(cols = starts_with("reg.tree"), names_to = "Metric", values_to = "Difference")

# Create the line plot
line_plot <- ggplot(reshaped_plot, aes(x = Metric, y = Difference, group = Model, color = Model)) +
  geom_line() +
  geom_point() +
  labs(title = "Assessing Overfitting",
       x = "Metrics",
       y = "Differences") +
  theme_minimal()

# Print the line plot
print(line_plot)

Among the three models assessed, Model1 (reg.tree) demonstrates superior performance with notably lower Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Error (MAE) values across both training and test datasets, indicating robust predictive capabilities. Moreover, upon examining the overfitting scenario, Model1 exhibits the most favorable outcome. Consequently, considering its strong performance and generalization potential, Model1 emerges as the preferred option for predictive modeling in this scenario. Nevertheless, additional scrutiny and experimentation may be warranted to validate this assertion conclusively.

Gradient Boosting:

In this analysis, we employed the Gradient Boosting methodology for regression using the gbm() function in R. The choice of distribution for the target variable is crucial, and in our case, we opted for a regression analysis, applying the normal distribution with the parameter distribution = "gaussian". This setting aligns with the assumption that the target variable follows a Gaussian distribution.

  1. n.trees - number of iterations, actually trees, which will be grown
  2. interaction.depth - depth of the trees, that is the degree of their complexity, defined as the number of splits in the tree (counted from the first single node)
  3. shrinkage - learning rate parameter
  4. n.minobsinnode – minimum number of observation in the terminal node, required while creating the split

we created a Gradient Boosting model (gbm_model) with the formula price ~ . and additional specifications such as the number of trees (n.trees = 100) and interaction depth (interaction.depth = 4) and specific values for hyperparameters are not provided, default values are used.

The default value for the shrinkage parameter, also known as the learning rate, is 0.001. The default value for the n.minobsinnode parameter, determining the minimum observations in a terminal node for a split, is 10.

Model1

set.seed(123456)

# Create a Gradient Boosting model
gbm_model <- gbm(model1, data = train.scaled, distribution = "gaussian", n.trees = 100, interaction.depth = 4)

# Predictions on the training set
gbm_pred_train <- predict(gbm_model, newdata = train.scaled, n.trees = 100)

# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_train <- mean((gbm_pred_train - train.scaled$price) ^ 2)

# Calculate MAE for Gradient Boosting with tuned parameters on training set
mae_gbm_train <- mean(abs(gbm_pred_train - train.scaled$price))

# Print the MSE and MAE values for training set
cat("MSE for Model1 GBM on Training Set:", mse_gbm_train, "\n")
## MSE for Model1 GBM on Training Set: 0.193684381
cat("MAE for Model1 GBM on Training Set:", mae_gbm_train, "\n")
## MAE for Model1 GBM on Training Set: 0.306711465
# Predictions on the test set
gbm_pred_test <- predict(gbm_model, newdata = test.scaled, n.trees = 100)

# Calculate MSE for Gradient Boosting
mse_gbm_test <- mean((gbm_pred_test - test.scaled$price) ^ 2)

# Calculate MAE for Gradient Boosting
mae_gbm_test <- mean(abs(gbm_pred_test - test.scaled$price))

# Print the MSE and MAE values for testing set
cat("MSE for Model1 GBM on Testing Set:", mse_gbm_test, "\n")
## MSE for Model1 GBM on Testing Set: 0.217774281
cat("MAE for Model1 GBM on Testing Set:", mae_gbm_test, "\n")
## MAE for Model1 GBM on Testing Set: 0.31438852

MSE is relatively low 0.19, suggesting that the initial Gradient Boosting model is effective in capturing the patterns in the training data. The MSE for the testing set is slightly higher than the training set MSE with approximately 0.22. This metric represents the average squared difference between the predicted and actual values in our test set. In the context of regression models, a lower MSE indicates better predictive performance. Therefore, the obtained MSE of 0.277 suggests that the model provides a relatively accurate prediction of the target variable, capturing the variance in the data with a moderate level of precision. However, the model seems to be performing well on the training set, but there is a slight increase in error when applied to the testing set. This scenario commonly suggests that the model is generalizing reasonably well but may be experiencing some level of overfitting. Thus we decided to try tuning on hyperparameters to improve the model, but first lets visualize our results.

### Scatter plot for training set with prediction line

# Calculate absolute differences for training set
abs_diff_train <- abs(train.scaled$price - gbm_pred_train)

# Scatter plot for training set with prediction line
plot(train.scaled$price, gbm_pred_train, col = ifelse(abs_diff_train > 4, "black", "blue"), main = "Scatter Plot - Actual vs Predicted (Training Set)", xlab = "Actual Prices", ylab = "Predicted Prices")

# Highlight points with absolute differences over 4
points(train.scaled$price[abs_diff_train > 4], gbm_pred_train[abs_diff_train > 4], col = "black", pch = 16)

# Add a legend outside the plot
legend("bottomright", legend = c("Train Set", "AD > 4"), col = c("blue", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)

# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)

# Calculate absolute differences for training set
abs_diff_test <- abs(test.scaled$price - gbm_pred_test)


### Scatter plot for test set with prediction line
plot(test.scaled$price, gbm_pred_test, col = ifelse(abs_diff_test > 5, "black", "red"), main = "Scatter Plot - Actual vs Predicted (Test Set)", xlab = "Actual Prices", ylab = "Predicted Prices")

# Highlight points with absolute differences over 5
points(test.scaled$price[abs_diff_test > 4], gbm_pred_test[abs_diff_test > 4], col = "black", pch = 16)

# Add a legend outside the plot
legend("bottomright", legend = c("Test Set", "AD> 4"), col = c("red", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)

# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)

The green dashed diagonal line represents a scenario where predicted values exactly match the actual values. Deviations from this line indicate the model’s prediction errors. Black points are the observations that absolute differences between actual value and predicted value is over 4.

Model2

set.seed(123456)

# Create a Gradient Boosting model
gbm_model2 <- gbm(model2, data = train.scaled2, distribution = "gaussian", n.trees = 100, interaction.depth = 4)


# Predictions on the training set
gbm_pred_train2 <- predict(gbm_model2, newdata = train.scaled2, n.trees = 100)

# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_train2 <- mean((gbm_pred_train2 - train.scaled2$price) ^ 2)

# Print the MSE value for training set
cat("MSE for Model2 GBM on Training Set:", mse_gbm_train2, "\n")
## MSE for Model2 GBM on Training Set: 0.258698837
# Predictions on the test set
gbm_pred_test2 <- predict(gbm_model2, newdata = test.scaled2, n.trees = 100)

# Calculate MSE for Gradient Boosting
mse_gbm_test2 <- mean((gbm_pred_test2 - test.scaled2$price) ^ 2)

# Print the MSE value
cat("MSE for Model2 GBM on Testing Set:", mse_gbm_test2, "\n")
## MSE for Model2 GBM on Testing Set: 0.287677614
# Calculate MAE for Gradient Boosting with tuned parameters on training set
mae_gbm_train2 <- mean(abs(gbm_pred_train2 - train.scaled2$price))

# Print the MAE value for training set
cat("MAE for Model2 GBM on Training Set:", mae_gbm_train2, "\n")
## MAE for Model2 GBM on Training Set: 0.352250693
# Calculate MAE for Gradient Boosting on test set
mae_gbm_test2 <- mean(abs(gbm_pred_test2 - test.scaled2$price))

# Print the MAE value for test set
cat("MAE for Model2 GBM on Testing Set:", mae_gbm_test2, "\n")
## MAE for Model2 GBM on Testing Set: 0.362980382

Model2 GBM exhibits slightly higher MSE and MAE values on both the training and testing sets compared to Model1 GBM. Specifically, Model2 GBM has an MSE of approximately 0.259 on the training set and 0.288 on the testing set, along with MAE values of about 0.352 and 0.363 on the training and testing sets, respectively. These higher error rates suggest that Model2 GBM performs less effectively in capturing the dataset’s patterns compared to Model1 GBM. Therefore, based on these metrics, Model1 GBM appears to be the superior choice. Further analysis may be needed to confirm this conclusion, but for now, continuing with Model1 for subsequent analyses seems prudent.

Hyperparameter Tuning for Gradient Boosting:

In an effort to enhance the performance of our Gradient Boosting model, we performed hyperparameter tuning by increasing the number of trees (n.trees) to 200 and adjusting the interaction depth to 6. The aim was to find a configuration that better captures complex relationships in the data.

set.seed(123456)

# Create a Gradient Boosting model with different hyperparameters
gbm_model_tuned <- gbm(price ~ ., data = train.scaled, distribution = "gaussian", n.trees = 200, interaction.depth = 6)

# Predictions on the training set
gbm_pred_tuned_train <- predict(gbm_model_tuned, newdata = train.scaled, n.trees = 200)

# Calculate MSE for Gradient Boosting with tuned parameters on training set
mse_gbm_tuned_train <- mean((gbm_pred_tuned_train - train.scaled$price) ^ 2)

# Print the MSE value for training set
cat("MSE for Tuned GBM on Training Set:", mse_gbm_tuned_train, "\n")
## MSE for Tuned GBM on Training Set: 0.153130196
# Predictions on the test set
gbm_pred_tuned_test <- predict(gbm_model_tuned, newdata = test.scaled, n.trees = 200)

# Calculate MSE for Gradient Boosting with tuned parameters
mse_gbm_tuned_test <- mean((gbm_pred_tuned_test - test.scaled$price) ^ 2)

# Print the MSE value
cat("MSE for Tuned GBM on Test Set:", mse_gbm_tuned_test, "\n")
## MSE for Tuned GBM on Test Set: 0.203248099
# Calculate MAE for Tuned Gradient Boosting on training set
mae_gbm_tuned_train <- mean(abs(gbm_pred_tuned_train - train.scaled$price))

# Print the MAE value for training set
cat("MAE for Tuned GBM on Training Set:", mae_gbm_tuned_train, "\n")
## MAE for Tuned GBM on Training Set: 0.283985196
# Calculate MAE for Tuned Gradient Boosting on test set
mae_gbm_tuned_test <- mean(abs(gbm_pred_tuned_test - test.scaled$price))

# Print the MAE value for test set
cat("MAE for Tuned GBM on Test Set:", mae_gbm_tuned_test, "\n")
## MAE for Tuned GBM on Test Set: 0.305398644

The tuned hyperparameter values for the Gradient Boosting model, with 200 trees and an interaction depth of 6, resulted in a Training MSE of 0.15 and a Testing MSE of 0.20. When comparing these values with the initial model , the tuned model exhibits improved performance on both the training and testing sets.

However the overfitting difference increased slightly after tuning.

# Calculate overfitting differences
overfitting_difference_initial <- mse_gbm_train - mse_gbm_test
overfitting_difference_tuned <- mse_gbm_tuned_train - mse_gbm_tuned_test

# Print overfitting differences
cat("Overfitting Difference (Initial):", overfitting_difference_initial, "\n")
## Overfitting Difference (Initial): -0.0240899001
cat("Overfitting Difference (Tuned):", overfitting_difference_tuned, "\n")
## Overfitting Difference (Tuned): -0.0501179032
### Scatter plot for training set with prediction line

# Calculate absolute differences for training set
abs_diff_tuned1 <- abs(train.scaled$price - gbm_pred_tuned_train)

# Scatter plot for training set with prediction line
plot(train.scaled$price, gbm_pred_tuned_train, col = ifelse(abs_diff_tuned1 > 4, "black", "blue"), main = "Tuned Model - Actual vs Predicted (Training Set)", xlab = "Actual Prices", ylab = "Predicted Prices")

# Highlight points with absolute differences over 4
points(train.scaled$price[abs_diff_tuned1 > 4], gbm_pred_tuned_train[abs_diff_tuned1 > 4], col = "black", pch = 16)

# Add a legend outside the plot
legend("bottomright", legend = c("Train Set", "AD > 4"), col = c("blue", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)

# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)

####
# Calculate absolute differences for training set
abs_diff_tuned_test <- abs(test.scaled$price - gbm_pred_tuned_test)

#Scatter plot for test set with prediction line
plot(test.scaled$price, gbm_pred_tuned_test, col = ifelse(abs_diff_tuned_test > 5, "black", "red"), main = "Scatter Plot - Actual vs Predicted (Test Set)", xlab = "Actual Prices", ylab = "Predicted Prices")

# Highlight points with absolute differences over 4
points(test.scaled$price[abs_diff_tuned_test > 4], gbm_pred_tuned_test[abs_diff_tuned_test > 4], col = "black", pch = 16)

# Add a legend outside the plot
legend("bottomright", legend = c("Test Set", "AD> 4"), col = c("red", "black"), pch = c(1, 16), inset = c(0, 0.05), xpd = TRUE)

# Add a diagonal line for perfect predictions
abline(0, 1, col = "green", lty = 2)

Random Forest

In this process we tune the Random Forest model using the best parameters obtained from the tuned Gradient Boosting model, potentially improving the performance of the Random Forest model.

The code snippet begins by extracting the best parameters from a tuned Gradient Boosting model. These parameters are then used to construct a hyperparameter grid for tuning a Random Forest model. Cross-validation control parameters are set up, and a grid search is performed to optimize the Random Forest model. Finally, the best parameters for the Random Forest model are printed. This process aims to leverage insights from the Gradient Boosting model to improve the performance of the Random Forest model.

set.seed(123456)

# Create a Random Forest model
rf_model <- randomForest(price ~ ., data = train.scaled,
                           ntree = 200,
  mtry = 4,
  nodesize = 9,
  maxnodes = 20)


# Predictions on the training set
rf_pred_train <- predict(rf_model, newdata = train.scaled)

# Calculate MSE for Random Forest on training set
mse_rf_train <- mean((rf_pred_train - train.scaled$price) ^ 2)

# Print the MSE value for training set
cat("MSE for Random Forest on Training Set:", mse_rf_train, "\n")
## MSE for Random Forest on Training Set: 0.328845394
# Predictions on the test set
rf_pred_test <- predict(rf_model, newdata = test.scaled)

# Calculate MSE for Random Forest
mse_rf <- mean((rf_pred_test - test.scaled$price) ^ 2)

# Print the MSE value
cat("MSE for Random Forest on Test Set:", mse_rf, "\n")
## MSE for Random Forest on Test Set: 0.348988041
set.seed(123456)

# Create a Random Forest model
rf_model <- randomForest(price ~ ., data = train.scaled)


# Predictions on the training set
rf_pred_train <- predict(rf_model, newdata = train.scaled)

# Calculate MSE for Random Forest on training set
mse_rf_train <- mean((rf_pred_train - train.scaled$price) ^ 2)

# Print the MSE value for training set
cat("MSE for Random Forest on Training Set:", mse_rf_train, "\n")
## MSE for Random Forest on Training Set: 0.0350226282
# Predictions on the test set
rf_pred_test <- predict(rf_model, newdata = test.scaled)

# Calculate MSE for Random Forest
mse_rf <- mean((rf_pred_test - test.scaled$price) ^ 2)

# Print the MSE value
cat("MSE for Random Forest on Test Set:", mse_rf, "\n")
## MSE for Random Forest on Test Set: 0.201226508

The Random Forest model, when manually tuned with custom hyperparameters, yielded an MSE of 0.33 on the training set and 0.35 on the test set. However, the default parameter setting resulted in significantly lower MSE values of 0.03 on the training set and 0.20 on the test set. Despite the improved performance with default parameters, there was a noticeable increase in overfitting, highlighting the importance of balancing model complexity and generalization. Moreover, the custom-tuned model incurred higher computational costs in terms of runtime. Therefore, further exploration of alternative models may be necessary.

Conclusion

In conclusion, while Model1 (reg.tree) emerges as the preferred option among decision tree models, the final tuned gradient boosting model showcases the best predictive accuracy overall. However, model selection should consider not only performance metrics but also computational costs and overfitting concerns. Further exploration of alternative models and hyperparameter tuning strategies may be warranted to optimize performance effectively.

##Resources

R Studio and associated libraries

Kaggle dataset (provided under CC0: Public Domain). https://www.kaggle.com/datasets/shivachandel/kc-house-data/data

Shapefile for mapping https://gis-kingcounty.opendata.arcgis.com/