#1. Data Loading and High Price Definition
In this section I load the Ames Housing dataset and define a binary
outcome called HighPrice. Homes in the top 5 percent of the
sale price distribution are coded as 1 and all other homes are coded as
0. This converts a dollar sale price into a simple high or not high
classification, which is easier to work with in probability models.
In practical terms, this creates a group of “luxury” homes at the top of the market and treats the rest as regular homes. A real estate analyst might be less interested in predicting the exact sale price and more interested in whether a home belongs in this upper segment.
housing <- read_csv("AmesHousing.csv")
#95th percentile cutoff
price_95 <- quantile(housing$SalePrice, 0.95, na.rm = TRUE)
price_95
## 95%
## 335000
#Create high price indicator and select relevant variables
housing_clean <- housing %>%
mutate(
HighPrice = ifelse(SalePrice > price_95, 1, 0)
) %>%
select(
HighPrice,
SalePrice,
`Gr Liv Area`,
`Overall Qual`,
`Year Built`,
Neighborhood,
`Central Air`,
`Garage Type`,
`Garage Cars`,
`Full Bath`,
`Bedroom AbvGr`
) %>%
drop_na()
dim(housing_clean)
## [1] 2772 11
head(housing_clean)
## # A tibble: 6 × 11
## HighPrice SalePrice `Gr Liv Area` `Overall Qual` `Year Built` Neighborhood
## <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0 215000 1656 6 1960 NAmes
## 2 0 105000 896 5 1961 NAmes
## 3 0 172000 1329 6 1958 NAmes
## 4 0 244000 2110 7 1968 NAmes
## 5 0 189900 1629 5 1997 Gilbert
## 6 0 195500 1604 6 1998 Gilbert
## # ℹ 5 more variables: `Central Air` <chr>, `Garage Type` <chr>,
## # `Garage Cars` <dbl>, `Full Bath` <dbl>, `Bedroom AbvGr` <dbl>
The price_95 value is the threshold for being considered
high price. Because the cutoff is based on the 95th percentile, only a
small share of homes receive HighPrice = 1. Most homes
remain in the larger non high price group.
#2. Creating a Dummy Variable for Central Air
Here I create a dummy variable for central air conditioning. The
original column is coded as "Y" or "N".
Regression models work better with numeric indicators, so I convert this
to a 0–1 variable.
This reflects a general idea in data preparation. Many important housing features are categorical. To use them in regression, we translate them into dummy variables that signal whether a feature is present.
housing_clean <- housing_clean %>%
mutate(
CentralAir_Dummy = ifelse(`Central Air` == "Y", 1, 0)
)
glimpse(housing_clean)
## Rows: 2,772
## Columns: 12
## $ HighPrice <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ SalePrice <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 21350…
## $ `Gr Liv Area` <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, …
## $ `Overall Qual` <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9,…
## $ `Year Built` <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995,…
## $ Neighborhood <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilbe…
## $ `Central Air` <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ `Garage Type` <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "At…
## $ `Garage Cars` <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ `Full Bath` <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1,…
## $ `Bedroom AbvGr` <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1,…
## $ CentralAir_Dummy <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
Central air is often viewed as a comfort and quality feature. It is reasonable to expect that homes with central air are more likely to be high price.
#3. Simple Dummy Regression: Central Air and High Price
This section estimates a regression where the only predictor is the
central air dummy. Since HighPrice is coded 0 or 1, the
coefficients can be interpreted as differences in average
probabilities.
model_air <- lm(HighPrice ~ CentralAir_Dummy, data = housing_clean)
summary(model_air)
##
## Call:
## lm(formula = HighPrice ~ CentralAir_Dummy, data = housing_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.05467 -0.05467 -0.05467 -0.05467 0.94533
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.850e-15 1.887e-02 0.000 1.00000
## CentralAir_Dummy 5.467e-02 1.936e-02 2.824 0.00478 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2217 on 2770 degrees of freedom
## Multiple R-squared: 0.002871, Adjusted R-squared: 0.002511
## F-statistic: 7.975 on 1 and 2770 DF, p-value: 0.004777
housing_clean %>%
group_by(CentralAir_Dummy) %>%
summarise(mean_high = mean(HighPrice), n = n())
## # A tibble: 2 × 3
## CentralAir_Dummy mean_high n
## <dbl> <dbl> <int>
## 1 0 0 138
## 2 1 0.0547 2634
The coefficient on CentralAir_Dummy is about 0.055. This
means that homes with central air have, on average, a 5.5 percentage
point higher probability of being high price compared with homes without
central air. The group means confirm this: homes without central air
almost never appear in the high price group, while homes with central
air do so more often. The effect is statistically significant, but
central air alone explains very little of the variation in high price
status.
In the housing market, this fits the idea that central air is a common but desirable feature. It is more prevalent among better homes, though it is not enough by itself to define a luxury property.
#4. Neighborhood Effects for the Top Five Neighborhoods
Here I examine whether location helps explain which homes are high price. I first select the five most common neighborhoods in the data and then estimate a model with neighborhood indicators. One neighborhood becomes the reference group, and the other coefficients measure differences relative to that group.
Location is a major driver of housing value. Neighborhoods differ in amenities, school quality, lot sizes, and reputation, and these factors are usually reflected in prices.
top_neighborhoods <- housing_clean %>%
count(Neighborhood, sort = TRUE) %>%
head(5) %>%
pull(Neighborhood)
top_neighborhoods
## [1] "NAmes" "CollgCr" "OldTown" "Somerst" "NridgHt"
housing_top_nbhd <- housing_clean %>%
filter(Neighborhood %in% top_neighborhoods)
model_nbhd <- lm(HighPrice ~ factor(Neighborhood), data = housing_top_nbhd)
summary(model_nbhd)
##
## Call:
## lm(formula = HighPrice ~ factor(Neighborhood), data = housing_top_nbhd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39759 -0.01887 -0.00478 -0.00231 0.99769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.01887 0.01291 1.461 0.144
## factor(Neighborhood)NAmes -0.01656 0.01639 -1.010 0.313
## factor(Neighborhood)NridgHt 0.37872 0.02080 18.205 <2e-16 ***
## factor(Neighborhood)OldTown -0.01408 0.01944 -0.724 0.469
## factor(Neighborhood)Somerst 0.03058 0.02023 1.512 0.131
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2102 on 1250 degrees of freedom
## Multiple R-squared: 0.2796, Adjusted R-squared: 0.2773
## F-statistic: 121.3 on 4 and 1250 DF, p-value: < 2.2e-16
The neighborhood that does not appear as its own row in the output is the reference category. The intercept gives its average probability of being high price. The other coefficients measure how much more or less likely the other neighborhoods are to contain high price homes.
In my results, one neighborhood has a much larger positive coefficient and is highly significant. Homes in that neighborhood are several tens of percentage points more likely to be high price than homes in the reference neighborhood. The R-squared is around 0.28, which means neighborhood alone explains a sizable share of variation in high price status. This matches the common observation that “location matters” in real estate.
#5. Interaction Between Living Area and Neighborhood
Next I allow the effect of living area to differ across neighborhoods by including an interaction term between square footage and neighborhood. To keep the model simple, I restrict the data to the two most common neighborhoods among the top five.
The idea is that adding space may increase value more in some neighborhoods than others. Large homes in premium locations often attract a stronger price premium than large homes in ordinary areas.
housing_two_nbhd <- housing_top_nbhd %>%
filter(Neighborhood %in% top_neighborhoods[1:2])
model_interaction <- lm(
HighPrice ~ `Gr Liv Area` * factor(Neighborhood),
data = housing_two_nbhd
)
summary(model_interaction)
##
## Call:
## lm(formula = HighPrice ~ `Gr Liv Area` * factor(Neighborhood),
## data = housing_two_nbhd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.12148 -0.01588 -0.00230 0.00413 0.96639
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.244e-01 2.079e-02 -5.983 3.51e-09
## `Gr Liv Area` 9.552e-05 1.338e-05 7.141 2.35e-12
## factor(Neighborhood)NAmes 9.815e-02 2.561e-02 3.833 0.000138
## `Gr Liv Area`:factor(Neighborhood)NAmes -7.339e-05 1.739e-05 -4.221 2.76e-05
##
## (Intercept) ***
## `Gr Liv Area` ***
## factor(Neighborhood)NAmes ***
## `Gr Liv Area`:factor(Neighborhood)NAmes ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08878 on 694 degrees of freedom
## Multiple R-squared: 0.0804, Adjusted R-squared: 0.07643
## F-statistic: 20.23 on 3 and 694 DF, p-value: 1.417e-12
The coefficient on Gr Liv Area shows how square footage
affects high price probability in the reference neighborhood. The
interaction term shows how that slope changes in the second
neighborhood. In my results, square footage raises the probability of
being high price in both places, but the interaction term is negative
and significant. This means the effect of additional space is weaker in
the second neighborhood than in the reference neighborhood.
This reflects a realistic pattern. In some areas, extra space commands a strong premium because buyers want larger homes and the surrounding properties are also large. In other areas, the payoff to additional square footage is smaller.
#6. Linear Probability Model (LPM) with Multiple Predictors
Now I estimate a Linear Probability Model that includes several important housing characteristics at once. The predictors are living area, overall quality, year built, central air, and garage capacity. The LPM uses ordinary least squares even though the outcome is binary.
Each coefficient can be interpreted as the change in the probability of being high price when the predictor increases by one unit, holding the others constant.
lpm_model <- lm(
HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
CentralAir_Dummy + `Garage Cars`,
data = housing_clean
)
summary(lpm_model)
##
## Call:
## lm(formula = HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
## CentralAir_Dummy + `Garage Cars`, data = housing_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.57539 -0.09927 -0.03046 0.04582 0.88129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.862e-01 3.246e-01 1.190 0.2343
## `Gr Liv Area` 8.617e-05 9.553e-06 9.021 < 2e-16 ***
## `Overall Qual` 4.659e-02 4.024e-03 11.579 < 2e-16 ***
## `Year Built` -4.050e-04 1.727e-04 -2.345 0.0191 *
## CentralAir_Dummy -4.010e-02 1.802e-02 -2.225 0.0261 *
## `Garage Cars` 4.524e-02 7.681e-03 5.891 4.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1935 on 2766 degrees of freedom
## Multiple R-squared: 0.2415, Adjusted R-squared: 0.2401
## F-statistic: 176.1 on 5 and 2766 DF, p-value: < 2.2e-16
housing_clean$lpm_pred <- predict(lpm_model)
The coefficients show clear patterns. Larger homes, higher overall quality, and more garage spaces all increase the probability of being high price. The effect of one extra square foot is small individually, but becomes meaningful over a 500 or 1 000 square foot increase. Each additional point of quality produces a large jump in the probability of being high price, which confirms that quality is a central feature of luxury housing.
Interestingly, after controlling for size and quality, the central air dummy becomes negative. This suggests that once we account for structural features, central air by itself does not distinguish luxury homes. The overall R-squared is about 0.24, which means these simple structural predictors explain roughly one quarter of the variation in high price status.
#7. Boundary Violations in the LPM
A known limitation of the Linear Probability Model is that its predictions can fall outside the valid range of 0 to 1. Here I count how many predicted values are negative or greater than one.
below_zero <- sum(housing_clean$lpm_pred < 0)
above_one <- sum(housing_clean$lpm_pred > 1)
total_preds <- length(housing_clean$lpm_pred)
below_zero
## [1] 950
above_one
## [1] 0
total_preds
## [1] 2772
In my results, 950 predictions are negative and none exceed one, out of 2 772 observations. That means roughly one third of the fitted values are not valid probabilities.
The homes with the most negative predictions tend to be small, older, low quality properties with limited garages. These homes are extremely unlikely to be high price, but a negative probability still has no interpretation. This illustrates why the LPM is often considered a rough approximation rather than a final probability model.
#8. Logistic Regression Model
To address the limitations of the LPM, I estimate a logistic regression model using the same predictors. Logistic regression models the log odds of being high price and automatically keeps predicted probabilities between 0 and 1.
logit_model <- glm(
HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
CentralAir_Dummy + `Garage Cars`,
data = housing_clean,
family = binomial
)
summary(logit_model)
##
## Call:
## glm(formula = HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
## CentralAir_Dummy + `Garage Cars`, family = binomial, data = housing_clean)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.339e+01 6.931e+02 -0.106 0.9157
## `Gr Liv Area` 2.139e-03 3.219e-04 6.646 3.01e-11 ***
## `Overall Qual` 1.754e+00 1.829e-01 9.589 < 2e-16 ***
## `Year Built` 1.911e-02 1.076e-02 1.776 0.0758 .
## CentralAir_Dummy 1.157e+01 6.928e+02 0.017 0.9867
## `Garage Cars` 1.270e+00 3.006e-01 4.223 2.41e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1132.1 on 2771 degrees of freedom
## Residual deviance: 419.5 on 2766 degrees of freedom
## AIC: 431.5
##
## Number of Fisher Scoring iterations: 18
housing_clean$logit_pred <- predict(logit_model, type = "response")
The logistic model confirms the importance of living area, overall quality, and garage spaces. The coefficients for these variables are positive and highly significant, indicating that larger, higher quality homes with more parking are much more likely to be high price. Year built has a smaller effect and central air is not reliably estimated, likely because almost all high price homes have central air, which leaves little variation.
Unlike the LPM, all predicted values from this model lie in the valid probability range.
#9. Comparing LPM and Logistic Predictions
This section compares the predicted probabilities from the LPM and the logistic model. Each point in the plot represents one home, with its LPM prediction on the x axis and its logistic prediction on the y axis.
plot_data <- housing_clean %>%
select(lpm_pred, logit_pred)
ggplot(plot_data, aes(x = lpm_pred, y = logit_pred)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(
title = "Comparison of LPM and Logistic Predicted Probabilities",
x = "LPM predicted probability",
y = "Logistic predicted probability"
) +
theme_minimal()
The scatterplot shows that the models agree in the middle of the range but diverge at the extremes. The LPM produces negative predictions for many low value homes, while the logistic model gives them probabilities close to zero but still valid. For high value homes, the logistic predictions rise sharply toward one, while the LPM follows a straight line. This visual makes clear why logistic regression is usually preferred for probability modeling.
#10. Odds Ratios
Logistic regression coefficients are expressed in log odds, which are hard to interpret directly. Exponentiating the coefficients yields odds ratios, which describe how the odds of being high price change when a predictor increases by one unit.
odds_ratios <- exp(coef(logit_model))
odds_ratios
## (Intercept) `Gr Liv Area` `Overall Qual` `Year Built`
## 1.337081e-32 1.002142e+00 5.778461e+00 1.019296e+00
## CentralAir_Dummy `Garage Cars`
## 1.056400e+05 3.559760e+00
or_quality <- odds_ratios["`Overall Qual`"]
or_quality
## `Overall Qual`
## 5.778461
beta_area <- coef(logit_model)["`Gr Liv Area`"]
or_area_500 <- exp(beta_area * 500)
or_area_500
## `Gr Liv Area`
## 2.914302
The odds ratio for overall quality is about 5.8. This means a one point increase in quality multiplies the odds of being high price by almost six. The odds ratio for an extra 500 square feet is about 2.9, so adding that much space nearly triples the odds of being high price. These results reinforce the idea that quality and size are key features of luxury homes.
The odds ratio for garage spaces is also well above one, showing that more parking capacity is a strong signal of high price status. Year built has a more modest effect. The central air odds ratio is unstable and very large, reflecting the fact that almost all high price homes already have central air, so it does not serve as a useful separator within this sample.
#11. Marginal Effects of Living Area
While odds ratios focus on changes in odds, marginal effects translate coefficients into changes in actual probability. This section computes three types of marginal effects for living area: the average marginal effect, the marginal effect at the means, and the marginal effect when the predicted probability is 0.5.
housing_clean$logit_prob <- predict(logit_model, type = "response")
beta_area <- coef(logit_model)["`Gr Liv Area`"]
# Average marginal effect (AME)
housing_clean$me_area <- beta_area * housing_clean$logit_prob * (1 - housing_clean$logit_prob)
AME_area <- mean(housing_clean$me_area)
AME_area
## [1] 4.799145e-05
# Marginal effect at the means (MEM)
means_df <- data.frame(
area = mean(housing_clean$`Gr Liv Area`),
qual = mean(housing_clean$`Overall Qual`),
year = mean(housing_clean$`Year Built`),
air = mean(housing_clean$CentralAir_Dummy),
garage = mean(housing_clean$`Garage Cars`)
)
names(means_df) <- c("Gr Liv Area", "Overall Qual", "Year Built", "CentralAir_Dummy", "Garage Cars")
p_means <- predict(logit_model, newdata = means_df, type = "response")
MEM_area <- beta_area * p_means * (1 - p_means)
MEM_area
## `Gr Liv Area`
## 5.628528e-07
# Marginal effect at p = 0.5
ME_at_50 <- beta_area * 0.5 * 0.5
ME_at_50
## `Gr Liv Area`
## 0.0005348151
The average marginal effect is about 4.8e-05. On average, an extra square foot increases the probability of being high price by about 0.005 percentage points. This effect is tiny on its own but becomes meaningful for several hundred additional square feet.
The marginal effect at the means is even smaller because the average home has a very low baseline probability of being high price. The logistic curve is nearly flat in that region, so an extra square foot hardly changes the probability. At a predicted probability of 0.5, the marginal effect is larger, reflecting the fact that the logistic curve is steepest in the middle. This shows how the impact of size depends on where a home lies in the distribution of predicted probabilities.
#12. Classification with Threshold 0.5
So far the models have produced probabilities. To make a discrete yes or no decision, we need a threshold. The common default is 0.5. Homes with predicted probability above 0.5 are classified as high price.
housing_clean$logit_class_50 <- ifelse(housing_clean$logit_pred > 0.5, 1, 0)
cm50 <- table(
Predicted = housing_clean$logit_class_50,
Actual = housing_clean$HighPrice
)
cm50
## Actual
## Predicted 0 1
## 0 2607 62
## 1 21 82
TP <- cm50["1", "1"]
FP <- cm50["1", "0"]
TN <- cm50["0", "0"]
FN <- cm50["0", "1"]
accuracy_50 <- (TP + TN) / sum(cm50)
precision_50 <- TP / (TP + FP)
recall_50 <- TP / (TP + FN)
accuracy_50
## [1] 0.9700577
precision_50
## [1] 0.7961165
recall_50
## [1] 0.5694444
With a 0.5 threshold, accuracy is about 97 percent and precision is close to 80 percent, but recall is only about 57 percent. The model correctly predicts most homes as not high price, and when it does predict high price it is usually correct, but it misses many true high price homes. This happens because the high price group is rare and most probabilities never reach 0.5.
In practice, this means the default threshold is too conservative for detecting high price homes.
#13. Classification with Threshold 0.2
To improve recall, I lower the threshold to 0.2. This labels more homes as high price and should catch more of the true high price homes.
housing_clean$logit_class_20 <- ifelse(housing_clean$logit_pred > 0.2, 1, 0)
cm20 <- table(
Predicted = housing_clean$logit_class_20,
Actual = housing_clean$HighPrice
)
cm20
## Actual
## Predicted 0 1
## 0 2546 22
## 1 82 122
TP20 <- cm20["1", "1"]
FP20 <- cm20["1", "0"]
TN20 <- cm20["0", "0"]
FN20 <- cm20["0", "1"]
accuracy_20 <- (TP20 + TN20) / sum(cm20)
precision_20 <- TP20 / (TP20 + FP20)
recall_20 <- TP20 / (TP20 + FN20)
accuracy_20
## [1] 0.962482
precision_20
## [1] 0.5980392
recall_20
## [1] 0.8472222
With a 0.2 threshold, accuracy remains high at about 96 percent. Precision falls to about 60 percent, but recall rises to about 85 percent. That means the model now captures most of the actual high price homes instead of missing many of them.
This trade-off makes sense in a real estate context. Treating a mid range home as high price mainly wastes some extra marketing effort, but missing a true high price home may cost a substantial commission. Lowering the threshold is a reasonable choice when false negatives are more costly than false positives.
#14. Train–Test Split and Out-of-Sample Evaluation
To check whether the model generalizes to new data, I split the sample into a training set and a test set. The model is estimated on 75 percent of the data and then evaluated on the remaining 25 percent. This mimics applying the model to future homes that were not used in estimation.
set.seed(123)
train_index <- sample(1:nrow(housing_clean), size = 0.75 * nrow(housing_clean))
train_data <- housing_clean[train_index, ]
test_data <- housing_clean[-train_index, ]
logit_train <- glm(
HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
CentralAir_Dummy + `Garage Cars`,
data = train_data,
family = binomial
)
test_data$test_pred <- predict(logit_train, newdata = test_data, type = "response")
test_data$test_class_20 <- ifelse(test_data$test_pred > 0.2, 1, 0)
cm_test <- table(
Predicted = test_data$test_class_20,
Actual = test_data$HighPrice
)
cm_test
## Actual
## Predicted 0 1
## 0 640 4
## 1 23 26
TP_t <- cm_test["1", "1"]
FP_t <- cm_test["1", "0"]
TN_t <- cm_test["0", "0"]
FN_t <- cm_test["0", "1"]
accuracy_test <- (TP_t + TN_t) / sum(cm_test)
precision_test <- TP_t / (TP_t + FP_t)
recall_test <- TP_t / (TP_t + FN_t)
accuracy_test
## [1] 0.961039
precision_test
## [1] 0.5306122
recall_test
## [1] 0.8666667
The test set accuracy is about 96 percent, very close to the in sample value. Precision falls slightly to about 53 percent, while recall improves to about 87 percent. The model still identifies most high price homes on unseen data and does not lose much performance when moving from training to testing.
This suggests that the logistic model is not overfitting and should perform reliably on new listings with similar characteristics.
#15. Final Diagnostics: ROC and Precision–Recall Curves
As a final check, I plot a ROC curve and a precision–recall curve for the test set. These curves summarize model performance across all possible thresholds rather than focusing on a single cutoff.
roc_obj <- roc(test_data$HighPrice, test_data$test_pred)
plot(roc_obj, main = "ROC Curve for Logistic Model (Test Set)")
pr_obj <- pr.curve(
scores.class0 = test_data$test_pred[test_data$HighPrice == 1],
scores.class1 = test_data$test_pred[test_data$HighPrice == 0],
curve = TRUE
)
plot(pr_obj, main = "Precision–Recall Curve (Test Set)")
The ROC curve rises well above the diagonal, indicating that the model has good ability to distinguish high price homes from others. The precision–recall curve shows that precision remains high as recall increases and the area under the curve is well above what a random classifier would achieve. These plots confirm that the logistic model performs well for this rare event classification problem.
#16. Overall Conclusions
This project showed how dummy variables, neighborhood effects, and probability models can be used to study the determinants of high price homes in the Ames housing market. The analysis began with simple dummy regressions and moved toward a full logistic model with multiple predictors, threshold tuning, and out-of-sample testing.
The main findings are that overall quality, living area, and garage capacity are strong predictors of whether a home belongs to the high price segment. Neighborhood also plays a major role when it is included. Logistic regression provides a better framework than the Linear Probability Model because it produces valid probabilities and captures the nonlinear pattern of how features affect the chance of being high price.
From a real world perspective, the model offers a useful screening tool. A real estate agent or analyst could use it to flag homes that are likely to qualify as high price and to decide where to concentrate marketing resources. The results also highlight the importance of investing in quality improvements, additional space, and parking capacity for homeowners who aim to move their properties into the upper tier of the market.