Lab #4 - Tree-Method and Unsupervised Learning

Author
Affiliation

Tessa Reilly and Chanel Anderson

Denison University

Question 1

For this question, you will the ames data set that we have already worked in class. This data set contains information on 2,930 properties in Ames, Iowa, including columns related to: - house characteristics (bedrooms, garage, fireplace, pool, porch, etc.) - location (neighborhood) - lot information (zoning, shape, size, etc.) - ratings of condition and quality - sale price

The (data, documentation)) are available in the links above

  1. Before you dig in predicting house prices, we would like to know a couple a things about our data. To answer each subquestion, please create a nice plot !
  • How expensive are houses (min, max, average, median) ?
  • When were the houses built, including the oldest and newest?
  • When were houses sold?
  • Where are houses?
  • How big are houses?
Code
library(ggplot2)
library(dplyr)
library(corrplot)

properties <- read.csv("https://vincentarelbundock.github.io/Rdatasets/csv/modeldata/ames.csv")
summary(properties %>% select(Sale_Price, Year_Built, Year_Sold, Lot_Area))
# Subquestion 1: how expensive are houses?
ggplot(properties, aes(x = Sale_Price)) +
  geom_histogram(bins = 50, fill = "#046307", color = "#5C655A", alpha = 0.7) +
  scale_x_log10() +
  labs(
    title = "Housing Price",
    subtitle = "1. How expensive are houses?",
    x = "Logged Housing Price ($)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )
# Subquestion 2: When were houses built? 
ggplot(properties, aes(x = Year_Built)) +
  geom_histogram(bins = 50, fill = "#002366", color = "black", alpha = 0.7) +
  labs(
    title = "Year Built",
    subtitle = "2. When were houses built?",
    x = "Year Built",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )

# Subquestion 3: When are houses sold? 

ggplot(properties, aes(x = Year_Sold)) +
  geom_bar(bins = 50, fill = "#005672", color = "black", alpha = 0.7) +
  labs(
    title = "Year Sold",
    subtitle = "3. When are houses sold? ",
    x = "Year Sold",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )


# Subquestion 4: Where are Houses?

properties %>%
  count(Neighborhood) %>%
  mutate(Neighborhood = gsub("_", " ", Neighborhood)) %>%
  ggplot(aes(x = reorder(Neighborhood, n), y = n)) +
  geom_col(fill = "#8E9AFE", color = "black", alpha = 0.7) +
  coord_flip() +
  labs(
    title = "Neighborhoods",
    subtitle = "Neighborhoods of houses sold in Ames, Iowa from 2006 to 2010",
    x = "Neighborhood",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 10, hjust = 0.5)
  )

# Subquestion 5: How big are the houses? 

ggplot(properties, aes(x = Lot_Area)) +
  geom_histogram(bins = 50, fill = "#6A5ACD", color = "black", alpha = 0.7) +
  scale_x_log10() +
  labs(
    title = "Lot Area",
    subtitle = "5: How big are houses?",
    x = " Logged Lot size (sq. ft)",
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )
Output
   Sale_Price       Year_Built     Year_Sold       Lot_Area     
 Min.   : 12789   Min.   :1872   Min.   :2006   Min.   :  1300  
 1st Qu.:129500   1st Qu.:1954   1st Qu.:2007   1st Qu.:  7440  
 Median :160000   Median :1973   Median :2008   Median :  9436  
 Mean   :180796   Mean   :1971   Mean   :2008   Mean   : 10148  
 3rd Qu.:213500   3rd Qu.:2001   3rd Qu.:2009   3rd Qu.: 11555  
 Max.   :755000   Max.   :2010   Max.   :2010   Max.   :215245  
Plot

Plot

Plot

Plot

Plot

Based on the plots and summary statistics, the sale price of houses is quite skewed with a median of 160,000 USD, a minimum of 12,789 USD and a maximum of 755,000 USD. For visual purposes, we logged the sale price to visualize the prices on the extreme ends that were otherwise minimized by the skewness. Although the age range of the houses varies significantly from the 1870s to 2010, the mean year for the year a house was built was 1971 revealing a leftward skew in building years. While houses were sold at a fairly uniform frequency there is a sharp decline in 2010, and a sizeable spike in 2007.The majority of houses were sold in the neighborhoods North Ames, College Creek and Old Town, and the least were old in Greens, Green Hills, and Landmark. Finally, the overall lot area of the houses had a similar leftward skew to the houses prices, thus we logged the Lot Area variable for visual purposes. Based on the summary statistics, the minimum lot area was 1300 sq. ft., the maximum was 215,245 sq. ft., and the mean was 10,148 sq. ft.

  1. Since our target variable is house prices, please provide a detail overview of the distribution of our target variables including stats and plots.
Code
# Summary statistics of key numeric variables by Neighborhood
properties %>%
  mutate(Neighborhood = gsub("_", " ", Neighborhood)) %>%
  group_by(Neighborhood) %>%
  summarise(
    count = n(),
    mean_Lot_Area = mean(Lot_Area, na.rm = TRUE),
    median_Lot_Area = median(Lot_Area, na.rm = TRUE),
    mean_Sale_Price = mean(Sale_Price, na.rm = TRUE),
    median_Sale_Price = median(Sale_Price, na.rm = TRUE),
    mean_Year_Built = mean(Year_Built, na.rm = TRUE),
    median_Year_Built = median(Year_Built, na.rm = TRUE),
    mean_Year_Sold = mean(Year_Sold, na.rm = TRUE),
    median_Year_Sold = median(Year_Sold, na.rm = TRUE)
  ) %>%
  arrange(desc(mean_Sale_Price))

#  Lot Area vs Sale Price
ggplot(properties, aes(x = Lot_Area, y = Sale_Price)) +
  geom_point(color = "#6A5ACD", alpha = 0.6) +
  geom_smooth(method = "lm", color = "black", se = TRUE, linewidth = 1) +
  scale_y_log10() +
  scale_x_log10() +
  labs(
    title = "Lot Area vs Sale Price",
    x = "Logged Lot Area (sq. ft.)",
    y = "Logged Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )


# Neighborhood vs Sale Price (Top 10 by Median Price) since there are multiple
properties %>%
  mutate(Neighborhood = gsub("_", " ", Neighborhood)) %>%
  group_by(Neighborhood) %>%
  mutate(median_price = median(Sale_Price, na.rm = TRUE)) %>%
  ungroup() %>%
  filter(Neighborhood %in% names(sort(tapply(Sale_Price, Neighborhood, median, na.rm = TRUE), decreasing = TRUE))[1:10]) %>%
  ggplot(aes(x = reorder(Neighborhood, Sale_Price, median), y = Sale_Price)) +
  geom_boxplot(fill = "#8E9AFE", color = "black", alpha = 0.7) +
  scale_y_log10() +
  coord_flip() +
  labs(
    title = "Neighborhood vs Sale Price",
    subtitle = "Top 10 neighborhoods by median price",
    x = "Neighborhood",
    y = "Logged Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )


# Year Built vs Sale Price (Smoothed Trend)
properties %>%
  group_by(Year_Built) %>%
  summarise(mean_price = mean(Sale_Price, na.rm = TRUE)) %>%
  ggplot(aes(x = Year_Built, y = mean_price)) +
  geom_point(color = "#002366", alpha = 0.6) +
  geom_smooth(color = "#002366", se = FALSE, linewidth = 1.2) +
  scale_y_log10() +
  labs(
    title = "Year Built vs Average Sale Price",
    subtitle = "How do average prices change over time?",
    x = "Year Built",
    y = "Mean Logged Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )

# Sale_Price by Year_Sold
properties %>%
  ggplot(aes(x = factor(Year_Sold), y = Sale_Price)) +
  geom_boxplot(fill = "#005672", color = "black", alpha = 0.7) +
  scale_y_log10() +
  labs(
    title = "Sale Price by Year Sold",
    x = "Year Sold",
    y = "Logged Sale Price ($)"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5)
  )
Output
# A tibble: 28 × 10
   Neighborhood       count mean_Lot_Area median_Lot_Area mean_Sale_Price
   <chr>              <int>         <dbl>           <dbl>           <dbl>
 1 Northridge            71        12976.          11400          330319.
 2 Stone Brook           51        11383.           8118          324229.
 3 Northridge Heights   166        11217.          12092          322018.
 4 Green Hills            2         9001            9001          280000 
 5 Veenker               24        15482.          14512          248315.
 6 Timberland            72        18234.          11848          246600.
 7 Somerset             182         7611.           8282.         229707.
 8 Clear Creek           44        24842.          15588.         208662.
 9 Crawford             103        11650.          11275          207551.
10 College Creek        267        10000.           9675          201803.
# ℹ 18 more rows
# ℹ 5 more variables: median_Sale_Price <dbl>, mean_Year_Built <dbl>,
#   median_Year_Built <dbl>, mean_Year_Sold <dbl>, median_Year_Sold <dbl>
Plot

Plot

Plot

Plot

Sale Price and Lot Area: Based on the scatter plot of the Logged Lot Area and Logged Sale Price,there appears to be a general upwards trend meaning that a percentage increase in lot area would result in some percentage increase in sale price.

Sale Price and Neighborhood: Given the large number of neighborhoods, we selected the top 10 neighborhooods which had mean sale prices ranging from 201,803.43 USD to 330,319 USD while the bottom 10 (not plotted) had mean sale prices ranging from 95,756.49 USD to 136,751.15 USD. In some neighborhoods such as Stone Brook, Northridge, Somerset, and College Creek, there appears to be significant variability with a few outliers in sale price. Whereas other neighborhoods such as Green Hills and Greens have more rigidity in the variation of sale price.

Sale Price and Year Built: Based on the plot of the build year and average sale price, there is a general upwards trend, suggesting that newer houses are sold on average for higher values. However, it is worth noting that some outliers do exist during the late 1890s-late 1930s period, and there appears to be a slightly non-linear relationship.

Sale Price and Year Sold: Given that houses are sold over the short time frame of 2006-2010, we created a series of box plots to display the relationship between sale price and selling year. Despite the presence of outliers across all years, the mean sale price across all 5 years remains fairly constant.

  1. Provide a overview of the correlation between numerical features of the house (Think about to create a plot that shows all correlations at once). Include any information that you deem important
Code
library(GGally)
# Select numeric variables for correlation analysis
num_vars <- properties %>%
  select(Lot_Area, Total_Bsmt_SF, Gr_Liv_Area, 
         Garage_Area, Wood_Deck_SF, Enclosed_Porch, Pool_Area, 
         Misc_Val, Sale_Price)

# Split into 3 subsets (each including Sale_Price)
num_vars1 <- num_vars %>% select(Lot_Area,Garage_Area, Sale_Price)
num_vars2 <- num_vars %>% select(Total_Bsmt_SF, Gr_Liv_Area, Sale_Price)
num_vars3 <- num_vars %>% select(Pool_Area, Misc_Val, Sale_Price)

# Make correlation plots
ggpairs(num_vars1)
ggpairs(num_vars2)
ggpairs(num_vars3)
Plot

Plot

Plot

For our selection of numerical variables to include, we chose to eliminate some variables from the num_var subset that would cause potential endogeneity. For instance, the total basement variable was selected rather than first or second basement as the inclusion of all of these variables would lead to multicollinearity, and total basement area is more comprehensive. For readability, we created a subset of numerical variables, then plotted those variables against Sale_Price in subsets using the ggpairs function which gives a series of plots within and across variables, as well as their numerical correlations. From this tactic, we noticed that Lot_Area, Garage_Area, Total_Bsmt_SF, Gr_Liv_Area and Pool_Area all had statistically significant correlations to sale price at all levels while Misc_Val did not.

  1. Choose 5 categorical features, and create plots to understand the relationship between houses prices and those variables.
Code
# Utilities
med_utils <- properties %>%
  group_by(Utilities) %>%
  summarise(Med_Sale_Price = median(Sale_Price, na.rm = TRUE)) %>%
  arrange(desc(Med_Sale_Price))

ggplot(med_utils, aes(x = reorder(Utilities, -Med_Sale_Price), y = Med_Sale_Price)) +
  geom_col(fill = "#FF6F61", color = "black", alpha = 0.8) +
   scale_x_discrete(labels = c(
    "AllPub"  = "All Public Utilities",
    "NoSewr"  = "Electricity, Gas and Water",
    "NoSeWa" = "Electricity and Gas Only",
    "ELO"  = "Electricity Only")) + 
  labs(
    title = "Median Sale Price by Utilities",
    subtitle = "Comparing median home prices across utility types",
    x = "Utilities",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1))

# Overall Condition
med_cond <- properties %>%
  group_by(Overall_Cond) %>%
  summarise(Med_Sale_Price = median(Sale_Price, na.rm = TRUE)) %>%
  arrange(desc(Med_Sale_Price))

ggplot(med_cond, aes(x = reorder(as.factor(Overall_Cond), -Med_Sale_Price), y = Med_Sale_Price)) +
  geom_col(fill = "#FFD700", color = "black", alpha = 0.8) +
  labs(
    title = "Median Sale Price by Overall Condition",
    subtitle = "Comparing median home prices across overall condition ratings",
    x = "Overall Condition",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Central Air
med_air <- properties %>%
  group_by(Central_Air) %>%
  summarise(Med_Sale_Price = median(Sale_Price, na.rm = TRUE)) %>%
  arrange(desc(Med_Sale_Price))

ggplot(med_air, aes(x = reorder(Central_Air, -Med_Sale_Price), y = Med_Sale_Price)) +
  geom_col(fill = "#1E90FF", color = "black", alpha = 0.8) +
  scale_y_continuous(labels = NULL, breaks = NULL) +
   scale_x_discrete(labels = c(
    "Y"  = "Yes",
    "N"  = "No")) + 
  labs(
    title = "Median Sale Price by Central Air",
    subtitle = "Comparing median home prices across central air types",
    x = "Central Air",
    y = NULL
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# Sale Condition
med_salecond <- properties %>%
  group_by(Sale_Condition) %>%
  summarise(Med_Sale_Price = median(Sale_Price, na.rm = TRUE)) %>%
  arrange(desc(Med_Sale_Price))

ggplot(med_salecond, aes(x = reorder(Sale_Condition, -Med_Sale_Price), y = Med_Sale_Price)) +
  geom_col(fill = "#32CD32", color = "black", alpha = 0.8) +
   scale_x_discrete(labels = c(
    "Abnorml"  = "Abnormal Sale",
    "AdjLand"  = "Adjoining Land Purchase",
    "Alloca" = "Allocation"))+
  labs(
    title = "Median Sale Price by Sale Condition",
    subtitle = "Comparing median home prices across sale conditions",
    x = "Sale Condition",
    y = "Sale Price"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

# MS Zoning
med_zoning <- properties %>%
  group_by(MS_Zoning) %>%
  summarise(Med_Sale_Price = median(Sale_Price, na.rm = TRUE)) %>%
  arrange(desc(Med_Sale_Price))

ggplot(med_zoning, aes(x = reorder(MS_Zoning, -Med_Sale_Price), y = Med_Sale_Price)) +
  geom_col(fill = "#FF69B4", color = "black", alpha = 0.8) +
  scale_x_discrete(labels = c(
    "A_agr"  = "Agriculture",
    "C_all"  = "Commercial",
    "Floating_Village_Residential" = "Floating Village Residential",
    "I_all"  = "Industrial",
    "Residential_High_Density" = "Residential High Density",
    "Residential_Low_Density" = "Residential Low Density",
    "Residential_Medium_Density" = "Residential Medium Density"
  )) +
  labs(
    title = "Median Sale Price by MS Zoning",
    subtitle = "Comparing median home prices across zoning categories",
    x = "MS Zoning",
    y = "Sale_Price"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 18, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 12, hjust = 0.5),
    axis.text.x = element_text(angle = 45, hjust = 1)
  )
Plot

Plot

Plot

Plot

Plot

For readability and interpretability, we created plots using the median price for each of the categorical variables selected: 1) Median Price by Utilities: Based on the bar graph, it appears that houses with all public utilities had higher median selling prices than houses with electricity and gas only, and houses with electricity, gas, and water. Notably, the majority of houses had all public utilities, so the visual may be slightly biased in that regard. 2) Median Price by Condition: Based on the graph, there appears to be a downward tend in median price vs condition suggesting that houses in worse condition sell for lower median prices, which makes sense in context. Interestingly, there is not much of a visible price difference in houses rated as “good”, “very good” and “above average” 3) Median Sale Price and Central Air: Houses with central air appear to have higher median selling prices compared to houses that do not have central air. Like the Condition variable, this trend makes sense in context as additional features like central air are often incorporated into the selling price of homes 4) Median Sale Price and Sale Condition: Partial sales have a significantly higher median sale price compared to the other sale conditions, which would hold true as houses sold in the “partial” category are often new houses that are not yet finished. Given that they are new, their value would be higher than older depreciated houses 5) Median Sale Price and MS Zoning: Houses in the agriculture zones had lower median selling prices whereas houses in the floating village residential category had the highest median selling prices. Given that argicultural houses are in rural landscapes, property values on average would be lower than the floating village residential houses in urban areas.

  1. Alright, after checking some of the variables, let’s predict house prices. Split the dataset into training and testing (60-40), and compare the following models using cross-validation (k=10).
  1. Best Sub Selection Linear Regression
  2. Lasso Linear Regression
  3. Regression Tree with Pruning
  4. Bagging
  5. Random Forest.

You should decide the criteria to choose the most accurate model, and show the main variables of the chosen models. Beware, that you might have to do some data wrangling before running model (Have fun!)

Code
library(leaps)

data1 <- sapply(train_data, is.numeric)
data2 <- train_data[, data1]

ctrl <- trainControl(method = "cv", number = 10)

# Best subsets 
best_sub <- train(
  Sale_Price ~ .,
  data = data2,
  method = "leapSeq",
  tuneGrid = data.frame(nvmax = 1:20),
  trControl = ctrl
)


best_sub
rmse_best <- best_sub$results$RMSE[best_sub$results$nvmax == best_sub$bestTune$nvmax]


summary(best_sub)

# LASSO

set.seed(123)

# Create numeric matrix
x <- model.matrix(Sale_Price ~ . - 1, data = train_data)
y <- train_data$Sale_Price

# 10-fold CV
kfolds <- trainControl(method = "cv", number = 10)

# Manually specify alpha and lambda grid
lasso_grid <- expand.grid(
  alpha = 1,                     # LASSO
  lambda = c(0.001, 0.01, 0.1, 1, 10, 100)  # short grid, works in all versions
)

# Train LASSO
lasso <- train(
  x = x,
  y = y,
  method = "glmnet",
  tuneGrid = lasso_grid,
  trControl = kfolds
)

lasso

rmse_lasso <- lasso$results$RMSE[lasso$results$lambda == lasso$bestTune$lambda & 
                                 lasso$results$alpha == lasso$bestTune$alpha]
Output
Linear Regression with Stepwise Selection 

1760 samples
  26 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1584, 1584, 1584, 1584, 1584, 1584, ... 
Resampling results across tuning parameters:

  nvmax  RMSE      Rsquared   MAE     
   1     56559.10  0.4952271  38514.96
   2     47638.88  0.6492140  31771.85
   3     41820.59  0.7299140  27489.22
   4     41233.36  0.7379863  26696.66
   5     39692.84  0.7556706  25897.45
   6     39026.27  0.7641718  25340.90
   7     40119.83  0.7450615  26138.65
   8     40077.32  0.7320136  28090.57
   9     36941.78  0.7878053  23900.77
  10     38469.68  0.7652096  25619.19
  11     36350.51  0.7933476  23599.77
  12     36018.66  0.7969044  23214.54
  13     36188.07  0.7954723  23351.25
  14     36190.91  0.7964099  23394.81
  15     36207.47  0.7948326  23324.28
  16     35848.93  0.7989153  23080.84
  17     36082.64  0.7964556  23173.96
  18     35837.33  0.7989840  23179.86
  19     36004.34  0.7972432  23157.78
  20     35774.62  0.7996595  23076.71

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was nvmax = 20.
Subset selection object
26 Variables  (and intercept)
               Forced in Forced out
rownames           FALSE      FALSE
Lot_Frontage       FALSE      FALSE
Lot_Area           FALSE      FALSE
Year_Built         FALSE      FALSE
Year_Remod_Add     FALSE      FALSE
Mas_Vnr_Area       FALSE      FALSE
BsmtFin_SF_1       FALSE      FALSE
Bsmt_Unf_SF        FALSE      FALSE
Total_Bsmt_SF      FALSE      FALSE
First_Flr_SF       FALSE      FALSE
Second_Flr_SF      FALSE      FALSE
Gr_Liv_Area        FALSE      FALSE
Bsmt_Full_Bath     FALSE      FALSE
Bsmt_Half_Bath     FALSE      FALSE
Full_Bath          FALSE      FALSE
Half_Bath          FALSE      FALSE
Bedroom_AbvGr      FALSE      FALSE
TotRms_AbvGrd      FALSE      FALSE
Fireplaces         FALSE      FALSE
Garage_Cars        FALSE      FALSE
Garage_Area        FALSE      FALSE
Wood_Deck_SF       FALSE      FALSE
Mo_Sold            FALSE      FALSE
Year_Sold          FALSE      FALSE
Longitude          FALSE      FALSE
Latitude           FALSE      FALSE
1 subsets of each size up to 20
Selection Algorithm: 'sequential replacement'
          rownames Lot_Frontage Lot_Area Year_Built Year_Remod_Add Mas_Vnr_Area
1  ( 1 )  " "      " "          " "      " "        " "            " "         
2  ( 1 )  " "      " "          " "      "*"        " "            " "         
3  ( 1 )  " "      " "          " "      "*"        " "            " "         
4  ( 1 )  " "      " "          " "      "*"        " "            " "         
5  ( 1 )  " "      " "          " "      "*"        " "            " "         
6  ( 1 )  " "      " "          " "      "*"        "*"            " "         
7  ( 1 )  " "      " "          " "      "*"        "*"            "*"         
8  ( 1 )  " "      " "          " "      "*"        "*"            "*"         
9  ( 1 )  "*"      "*"          "*"      "*"        "*"            "*"         
10  ( 1 ) " "      " "          " "      "*"        "*"            "*"         
11  ( 1 ) " "      "*"          " "      "*"        "*"            "*"         
12  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
13  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
14  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
15  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
16  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
17  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
18  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
19  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
20  ( 1 ) " "      "*"          "*"      "*"        "*"            "*"         
          BsmtFin_SF_1 Bsmt_Unf_SF Total_Bsmt_SF First_Flr_SF Second_Flr_SF
1  ( 1 )  " "          " "         " "           " "          " "          
2  ( 1 )  " "          " "         " "           " "          " "          
3  ( 1 )  " "          " "         "*"           " "          " "          
4  ( 1 )  " "          " "         "*"           " "          " "          
5  ( 1 )  " "          " "         "*"           " "          " "          
6  ( 1 )  " "          " "         "*"           " "          " "          
7  ( 1 )  " "          " "         "*"           " "          " "          
8  ( 1 )  " "          "*"         "*"           " "          " "          
9  ( 1 )  "*"          "*"         "*"           " "          " "          
10  ( 1 ) " "          "*"         "*"           " "          " "          
11  ( 1 ) " "          "*"         "*"           " "          " "          
12  ( 1 ) " "          "*"         "*"           " "          " "          
13  ( 1 ) " "          "*"         "*"           " "          " "          
14  ( 1 ) " "          "*"         "*"           "*"          "*"          
15  ( 1 ) " "          "*"         "*"           "*"          "*"          
16  ( 1 ) " "          "*"         "*"           "*"          "*"          
17  ( 1 ) " "          "*"         "*"           "*"          "*"          
18  ( 1 ) " "          "*"         "*"           "*"          "*"          
19  ( 1 ) " "          "*"         "*"           "*"          "*"          
20  ( 1 ) " "          "*"         "*"           "*"          "*"          
          Gr_Liv_Area Bsmt_Full_Bath Bsmt_Half_Bath Full_Bath Half_Bath
1  ( 1 )  "*"         " "            " "            " "       " "      
2  ( 1 )  "*"         " "            " "            " "       " "      
3  ( 1 )  "*"         " "            " "            " "       " "      
4  ( 1 )  "*"         " "            " "            " "       " "      
5  ( 1 )  "*"         " "            " "            " "       " "      
6  ( 1 )  "*"         " "            " "            " "       " "      
7  ( 1 )  "*"         " "            " "            " "       " "      
8  ( 1 )  "*"         " "            " "            " "       " "      
9  ( 1 )  " "         " "            " "            " "       " "      
10  ( 1 ) "*"         " "            " "            " "       " "      
11  ( 1 ) "*"         " "            " "            " "       " "      
12  ( 1 ) "*"         " "            " "            " "       " "      
13  ( 1 ) "*"         "*"            " "            " "       " "      
14  ( 1 ) " "         "*"            " "            " "       " "      
15  ( 1 ) " "         "*"            " "            " "       " "      
16  ( 1 ) " "         "*"            " "            " "       " "      
17  ( 1 ) " "         "*"            "*"            " "       " "      
18  ( 1 ) " "         "*"            "*"            " "       " "      
19  ( 1 ) " "         "*"            "*"            "*"       " "      
20  ( 1 ) " "         "*"            "*"            "*"       " "      
          Bedroom_AbvGr TotRms_AbvGrd Fireplaces Garage_Cars Garage_Area
1  ( 1 )  " "           " "           " "        " "         " "        
2  ( 1 )  " "           " "           " "        " "         " "        
3  ( 1 )  " "           " "           " "        " "         " "        
4  ( 1 )  " "           " "           " "        " "         "*"        
5  ( 1 )  "*"           " "           " "        "*"         " "        
6  ( 1 )  " "           " "           "*"        " "         "*"        
7  ( 1 )  " "           " "           "*"        " "         "*"        
8  ( 1 )  " "           " "           "*"        " "         "*"        
9  ( 1 )  " "           " "           " "        " "         " "        
10  ( 1 ) "*"           " "           "*"        " "         "*"        
11  ( 1 ) "*"           " "           "*"        " "         "*"        
12  ( 1 ) "*"           " "           "*"        " "         "*"        
13  ( 1 ) "*"           " "           "*"        " "         "*"        
14  ( 1 ) "*"           " "           "*"        " "         "*"        
15  ( 1 ) "*"           "*"           "*"        " "         "*"        
16  ( 1 ) "*"           "*"           "*"        "*"         "*"        
17  ( 1 ) "*"           "*"           "*"        "*"         "*"        
18  ( 1 ) "*"           "*"           "*"        "*"         "*"        
19  ( 1 ) "*"           "*"           "*"        "*"         "*"        
20  ( 1 ) "*"           "*"           "*"        "*"         "*"        
          Wood_Deck_SF Mo_Sold Year_Sold Longitude Latitude
1  ( 1 )  " "          " "     " "       " "       " "     
2  ( 1 )  " "          " "     " "       " "       " "     
3  ( 1 )  " "          " "     " "       " "       " "     
4  ( 1 )  " "          " "     " "       " "       " "     
5  ( 1 )  " "          " "     " "       " "       " "     
6  ( 1 )  " "          " "     " "       " "       " "     
7  ( 1 )  " "          " "     " "       " "       " "     
8  ( 1 )  " "          " "     " "       " "       " "     
9  ( 1 )  " "          " "     " "       " "       " "     
10  ( 1 ) " "          " "     " "       " "       "*"     
11  ( 1 ) " "          " "     " "       " "       "*"     
12  ( 1 ) " "          " "     " "       " "       "*"     
13  ( 1 ) " "          " "     " "       " "       "*"     
14  ( 1 ) " "          " "     " "       " "       "*"     
15  ( 1 ) " "          " "     " "       " "       "*"     
16  ( 1 ) " "          " "     " "       " "       "*"     
17  ( 1 ) " "          " "     " "       " "       "*"     
18  ( 1 ) "*"          " "     " "       " "       "*"     
19  ( 1 ) "*"          " "     " "       " "       "*"     
20  ( 1 ) "*"          "*"     " "       " "       "*"     
glmnet 

1760 samples
 212 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1583, 1584, 1584, 1584, 1584, 1584, ... 
Resampling results across tuning parameters:

  lambda  RMSE      Rsquared   MAE     
  1e-03   30185.44  0.8601485  18804.58
  1e-02   30185.44  0.8601485  18804.58
  1e-01   30185.44  0.8601485  18804.58
  1e+00   30185.44  0.8601485  18804.58
  1e+01   30159.59  0.8603776  18786.55
  1e+02   29795.76  0.8632971  18419.30

Tuning parameter 'alpha' was held constant at a value of 1
RMSE was used to select the optimal model using the smallest value.
The final values used for the model were alpha = 1 and lambda = 100.
Code
# TREE WITH PRUNING
tree_fit <- train(
  Sale_Price ~ .,
  data = train_data,
  method = "rpart",
  tuneLength = 10,
  trControl = kfolds
)

tree_fit
rmse_tree <- tree_fit$results$RMSE[tree_fit$results$cp == tree_fit$bestTune$cp]
Output
CART 

1760 samples
  53 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1584, 1584, 1584, 1584, 1584, 1584, ... 
Resampling results across tuning parameters:

  cp          RMSE      Rsquared   MAE     
  0.01545238  44221.35  0.6831832  31068.34
  0.01692598  45864.79  0.6620878  32212.33
  0.01731402  46285.56  0.6573025  32287.91
  0.02150539  46731.88  0.6486357  32870.75
  0.02373662  46999.47  0.6442860  32963.89
  0.03074978  48111.18  0.6264282  33892.18
  0.04501678  51539.62  0.5700389  36538.57
  0.06902094  54475.98  0.5272571  38168.43
  0.11884697  59307.37  0.4285324  42921.00
  0.39785290  72891.77  0.3086824  53077.25

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was cp = 0.01545238.
Code
#BAGGING
bag <- train(
  Sale_Price ~ .,
  data = train_data,
  method = "treebag",
  trControl = kfolds
)

bag
rmse_bag <- bag$results$RMSE
Output
Bagged CART 

1760 samples
  53 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 1584, 1584, 1584, 1584, 1584, 1585, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  36066.84  0.7938594  24344.95
Code
# RANDOM FOREST
library(randomForest)
random_forest <- train(
  Sale_Price ~ ., 
  data = train_data,
  method = "rf",
  trControl = trainControl(method = "cv", number = 5),  # 5-fold to speed up
  ntree = 200
)
random_forest
rmse_rf <- random_forest$results$RMSE[random_forest$results$mtry == random_forest$bestTune$mtry]
Output
Random Forest 

1760 samples
  53 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 1408, 1408, 1408, 1409, 1407 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
    2   45962.30  0.8029054  30054.16
  106   28255.91  0.8776451  17307.71
  211   28860.47  0.8711075  17566.15

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 106.
Code
# Combine RMSEs into a summary table
results_table <- data.frame(
  Model = c("Best Subset", "LASSO", "Tree (Pruned)", "Bagging", "Random Forest"),
  RMSE  = c(rmse_best, rmse_lasso, rmse_tree, rmse_bag, rmse_rf)
)

# View results
results_table
Output
          Model     RMSE
1   Best Subset 35774.62
2         LASSO 29795.76
3 Tree (Pruned) 44221.35
4       Bagging 36066.84
5 Random Forest 28255.91

Based on the models, the lasso regression has the lowest RMSE and highest R-squared value, making it the most viable choice for housing price prediction. Additionally, since lasso regression preserves the important variables and assigns coefficients to them, we had the added advantage of seeing the marginal effects of each explanatory variable on housing sale price.