Assignment 2: Ames Housing Dummy Variable

#1. Data Loading and High Price Definition

In this section I load the Ames Housing dataset and define a binary outcome called HighPrice. Homes in the top 5 percent of the sale price distribution are coded as 1 and all other homes are coded as 0. This converts a dollar sale price into a simple high or not high classification, which is easier to work with in probability models.

In practical terms, this creates a group of “luxury” homes at the top of the market and treats the rest as regular homes. A real estate analyst might be less interested in predicting the exact sale price and more interested in whether a home belongs in this upper segment.

housing <- read_csv("AmesHousing.csv")

#95th percentile cutoff
price_95 <- quantile(housing$SalePrice, 0.95, na.rm = TRUE)
price_95

##    95% 
## 335000

#Create high price indicator and select relevant variables
housing_clean <- housing %>%
  mutate(
    HighPrice = ifelse(SalePrice > price_95, 1, 0)
  ) %>%
  select(
    HighPrice,
    SalePrice,
    `Gr Liv Area`,
    `Overall Qual`,
    `Year Built`,
    Neighborhood,
    `Central Air`,
    `Garage Type`,
    `Garage Cars`,
    `Full Bath`,
    `Bedroom AbvGr`
  ) %>%
  drop_na()

dim(housing_clean)

## [1] 2772   11

head(housing_clean)

## # A tibble: 6 × 11
##   HighPrice SalePrice `Gr Liv Area` `Overall Qual` `Year Built` Neighborhood
##       <dbl>     <dbl>         <dbl>          <dbl>        <dbl> <chr>       
## 1         0    215000          1656              6         1960 NAmes       
## 2         0    105000           896              5         1961 NAmes       
## 3         0    172000          1329              6         1958 NAmes       
## 4         0    244000          2110              7         1968 NAmes       
## 5         0    189900          1629              5         1997 Gilbert     
## 6         0    195500          1604              6         1998 Gilbert     
## # ℹ 5 more variables: `Central Air` <chr>, `Garage Type` <chr>,
## #   `Garage Cars` <dbl>, `Full Bath` <dbl>, `Bedroom AbvGr` <dbl>

The price_95 value is the threshold for being considered high price. Because the cutoff is based on the 95th percentile, only a small share of homes receive HighPrice = 1. Most homes remain in the larger non high price group.

#2. Creating a Dummy Variable for Central Air

Here I create a dummy variable for central air conditioning. The original column is coded as "Y" or "N". Regression models work better with numeric indicators, so I convert this to a 0–1 variable.

This reflects a general idea in data preparation. Many important housing features are categorical. To use them in regression, we translate them into dummy variables that signal whether a feature is present.

housing_clean <- housing_clean %>%
  mutate(
    CentralAir_Dummy = ifelse(`Central Air` == "Y", 1, 0)
  )

glimpse(housing_clean)

## Rows: 2,772
## Columns: 12
## $ HighPrice        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ SalePrice        <dbl> 215000, 105000, 172000, 244000, 189900, 195500, 21350…
## $ `Gr Liv Area`    <dbl> 1656, 896, 1329, 2110, 1629, 1604, 1338, 1280, 1616, …
## $ `Overall Qual`   <dbl> 6, 5, 6, 7, 5, 6, 8, 8, 8, 7, 6, 6, 6, 7, 8, 8, 8, 9,…
## $ `Year Built`     <dbl> 1960, 1961, 1958, 1968, 1997, 1998, 2001, 1992, 1995,…
## $ Neighborhood     <chr> "NAmes", "NAmes", "NAmes", "NAmes", "Gilbert", "Gilbe…
## $ `Central Air`    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ `Garage Type`    <chr> "Attchd", "Attchd", "Attchd", "Attchd", "Attchd", "At…
## $ `Garage Cars`    <dbl> 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3,…
## $ `Full Bath`      <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 3, 2, 1,…
## $ `Bedroom AbvGr`  <dbl> 3, 2, 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 3, 2, 1, 4, 4, 1,…
## $ CentralAir_Dummy <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…

Central air is often viewed as a comfort and quality feature. It is reasonable to expect that homes with central air are more likely to be high price.

#3. Simple Dummy Regression: Central Air and High Price

This section estimates a regression where the only predictor is the central air dummy. Since HighPrice is coded 0 or 1, the coefficients can be interpreted as differences in average probabilities.

model_air <- lm(HighPrice ~ CentralAir_Dummy, data = housing_clean)
summary(model_air)

## 
## Call:
## lm(formula = HighPrice ~ CentralAir_Dummy, data = housing_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.05467 -0.05467 -0.05467 -0.05467  0.94533 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      4.850e-15  1.887e-02   0.000  1.00000   
## CentralAir_Dummy 5.467e-02  1.936e-02   2.824  0.00478 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2217 on 2770 degrees of freedom
## Multiple R-squared:  0.002871,   Adjusted R-squared:  0.002511 
## F-statistic: 7.975 on 1 and 2770 DF,  p-value: 0.004777

housing_clean %>%
  group_by(CentralAir_Dummy) %>%
  summarise(mean_high = mean(HighPrice), n = n())

## # A tibble: 2 × 3
##   CentralAir_Dummy mean_high     n
##              <dbl>     <dbl> <int>
## 1                0    0        138
## 2                1    0.0547  2634

The coefficient on CentralAir_Dummy is about 0.055. This means that homes with central air have, on average, a 5.5 percentage point higher probability of being high price compared with homes without central air. The group means confirm this: homes without central air almost never appear in the high price group, while homes with central air do so more often. The effect is statistically significant, but central air alone explains very little of the variation in high price status.

In the housing market, this fits the idea that central air is a common but desirable feature. It is more prevalent among better homes, though it is not enough by itself to define a luxury property.

#4. Neighborhood Effects for the Top Five Neighborhoods

Here I examine whether location helps explain which homes are high price. I first select the five most common neighborhoods in the data and then estimate a model with neighborhood indicators. One neighborhood becomes the reference group, and the other coefficients measure differences relative to that group.

Location is a major driver of housing value. Neighborhoods differ in amenities, school quality, lot sizes, and reputation, and these factors are usually reflected in prices.

top_neighborhoods <- housing_clean %>%
  count(Neighborhood, sort = TRUE) %>%
  head(5) %>%
  pull(Neighborhood)

top_neighborhoods

## [1] "NAmes"   "CollgCr" "OldTown" "Somerst" "NridgHt"

housing_top_nbhd <- housing_clean %>%
  filter(Neighborhood %in% top_neighborhoods)

model_nbhd <- lm(HighPrice ~ factor(Neighborhood), data = housing_top_nbhd)
summary(model_nbhd)

## 
## Call:
## lm(formula = HighPrice ~ factor(Neighborhood), data = housing_top_nbhd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.39759 -0.01887 -0.00478 -0.00231  0.99769 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  0.01887    0.01291   1.461    0.144    
## factor(Neighborhood)NAmes   -0.01656    0.01639  -1.010    0.313    
## factor(Neighborhood)NridgHt  0.37872    0.02080  18.205   <2e-16 ***
## factor(Neighborhood)OldTown -0.01408    0.01944  -0.724    0.469    
## factor(Neighborhood)Somerst  0.03058    0.02023   1.512    0.131    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2102 on 1250 degrees of freedom
## Multiple R-squared:  0.2796, Adjusted R-squared:  0.2773 
## F-statistic: 121.3 on 4 and 1250 DF,  p-value: < 2.2e-16

The neighborhood that does not appear as its own row in the output is the reference category. The intercept gives its average probability of being high price. The other coefficients measure how much more or less likely the other neighborhoods are to contain high price homes.

In my results, one neighborhood has a much larger positive coefficient and is highly significant. Homes in that neighborhood are several tens of percentage points more likely to be high price than homes in the reference neighborhood. The R-squared is around 0.28, which means neighborhood alone explains a sizable share of variation in high price status. This matches the common observation that “location matters” in real estate.

#5. Interaction Between Living Area and Neighborhood

Next I allow the effect of living area to differ across neighborhoods by including an interaction term between square footage and neighborhood. To keep the model simple, I restrict the data to the two most common neighborhoods among the top five.

The idea is that adding space may increase value more in some neighborhoods than others. Large homes in premium locations often attract a stronger price premium than large homes in ordinary areas.

housing_two_nbhd <- housing_top_nbhd %>%
  filter(Neighborhood %in% top_neighborhoods[1:2])

model_interaction <- lm(
  HighPrice ~ `Gr Liv Area` * factor(Neighborhood),
  data = housing_two_nbhd
)

summary(model_interaction)

## 
## Call:
## lm(formula = HighPrice ~ `Gr Liv Area` * factor(Neighborhood), 
##     data = housing_two_nbhd)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.12148 -0.01588 -0.00230  0.00413  0.96639 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)
## (Intercept)                             -1.244e-01  2.079e-02  -5.983 3.51e-09
## `Gr Liv Area`                            9.552e-05  1.338e-05   7.141 2.35e-12
## factor(Neighborhood)NAmes                9.815e-02  2.561e-02   3.833 0.000138
## `Gr Liv Area`:factor(Neighborhood)NAmes -7.339e-05  1.739e-05  -4.221 2.76e-05
##                                            
## (Intercept)                             ***
## `Gr Liv Area`                           ***
## factor(Neighborhood)NAmes               ***
## `Gr Liv Area`:factor(Neighborhood)NAmes ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08878 on 694 degrees of freedom
## Multiple R-squared:  0.0804, Adjusted R-squared:  0.07643 
## F-statistic: 20.23 on 3 and 694 DF,  p-value: 1.417e-12

The coefficient on Gr Liv Area shows how square footage affects high price probability in the reference neighborhood. The interaction term shows how that slope changes in the second neighborhood. In my results, square footage raises the probability of being high price in both places, but the interaction term is negative and significant. This means the effect of additional space is weaker in the second neighborhood than in the reference neighborhood.

This reflects a realistic pattern. In some areas, extra space commands a strong premium because buyers want larger homes and the surrounding properties are also large. In other areas, the payoff to additional square footage is smaller.

#6. Linear Probability Model (LPM) with Multiple Predictors

Now I estimate a Linear Probability Model that includes several important housing characteristics at once. The predictors are living area, overall quality, year built, central air, and garage capacity. The LPM uses ordinary least squares even though the outcome is binary.

Each coefficient can be interpreted as the change in the probability of being high price when the predictor increases by one unit, holding the others constant.

lpm_model <- lm(
  HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
    CentralAir_Dummy + `Garage Cars`,
  data = housing_clean
)

summary(lpm_model)

## 
## Call:
## lm(formula = HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` + 
##     CentralAir_Dummy + `Garage Cars`, data = housing_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.57539 -0.09927 -0.03046  0.04582  0.88129 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.862e-01  3.246e-01   1.190   0.2343    
## `Gr Liv Area`     8.617e-05  9.553e-06   9.021  < 2e-16 ***
## `Overall Qual`    4.659e-02  4.024e-03  11.579  < 2e-16 ***
## `Year Built`     -4.050e-04  1.727e-04  -2.345   0.0191 *  
## CentralAir_Dummy -4.010e-02  1.802e-02  -2.225   0.0261 *  
## `Garage Cars`     4.524e-02  7.681e-03   5.891 4.31e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1935 on 2766 degrees of freedom
## Multiple R-squared:  0.2415, Adjusted R-squared:  0.2401 
## F-statistic: 176.1 on 5 and 2766 DF,  p-value: < 2.2e-16

housing_clean$lpm_pred <- predict(lpm_model)

The coefficients show clear patterns. Larger homes, higher overall quality, and more garage spaces all increase the probability of being high price. The effect of one extra square foot is small individually, but becomes meaningful over a 500 or 1 000 square foot increase. Each additional point of quality produces a large jump in the probability of being high price, which confirms that quality is a central feature of luxury housing.

Interestingly, after controlling for size and quality, the central air dummy becomes negative. This suggests that once we account for structural features, central air by itself does not distinguish luxury homes. The overall R-squared is about 0.24, which means these simple structural predictors explain roughly one quarter of the variation in high price status.

#7. Boundary Violations in the LPM

A known limitation of the Linear Probability Model is that its predictions can fall outside the valid range of 0 to 1. Here I count how many predicted values are negative or greater than one.

below_zero <- sum(housing_clean$lpm_pred < 0)
above_one  <- sum(housing_clean$lpm_pred > 1)
total_preds <- length(housing_clean$lpm_pred)

below_zero

## [1] 950

above_one

## [1] 0

total_preds

## [1] 2772

In my results, 950 predictions are negative and none exceed one, out of 2 772 observations. That means roughly one third of the fitted values are not valid probabilities.

The homes with the most negative predictions tend to be small, older, low quality properties with limited garages. These homes are extremely unlikely to be high price, but a negative probability still has no interpretation. This illustrates why the LPM is often considered a rough approximation rather than a final probability model.

#8. Logistic Regression Model

To address the limitations of the LPM, I estimate a logistic regression model using the same predictors. Logistic regression models the log odds of being high price and automatically keeps predicted probabilities between 0 and 1.

logit_model <- glm(
  HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
    CentralAir_Dummy + `Garage Cars`,
  data = housing_clean,
  family = binomial
)

summary(logit_model)

## 
## Call:
## glm(formula = HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` + 
##     CentralAir_Dummy + `Garage Cars`, family = binomial, data = housing_clean)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -7.339e+01  6.931e+02  -0.106   0.9157    
## `Gr Liv Area`     2.139e-03  3.219e-04   6.646 3.01e-11 ***
## `Overall Qual`    1.754e+00  1.829e-01   9.589  < 2e-16 ***
## `Year Built`      1.911e-02  1.076e-02   1.776   0.0758 .  
## CentralAir_Dummy  1.157e+01  6.928e+02   0.017   0.9867    
## `Garage Cars`     1.270e+00  3.006e-01   4.223 2.41e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1132.1  on 2771  degrees of freedom
## Residual deviance:  419.5  on 2766  degrees of freedom
## AIC: 431.5
## 
## Number of Fisher Scoring iterations: 18

housing_clean$logit_pred <- predict(logit_model, type = "response")

The logistic model confirms the importance of living area, overall quality, and garage spaces. The coefficients for these variables are positive and highly significant, indicating that larger, higher quality homes with more parking are much more likely to be high price. Year built has a smaller effect and central air is not reliably estimated, likely because almost all high price homes have central air, which leaves little variation.

Unlike the LPM, all predicted values from this model lie in the valid probability range.

#9. Comparing LPM and Logistic Predictions

This section compares the predicted probabilities from the LPM and the logistic model. Each point in the plot represents one home, with its LPM prediction on the x axis and its logistic prediction on the y axis.

plot_data <- housing_clean %>%
  select(lpm_pred, logit_pred)

ggplot(plot_data, aes(x = lpm_pred, y = logit_pred)) +
  geom_point(alpha = 0.3) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    title = "Comparison of LPM and Logistic Predicted Probabilities",
    x = "LPM predicted probability",
    y = "Logistic predicted probability"
  ) +
  theme_minimal()

The scatterplot shows that the models agree in the middle of the range but diverge at the extremes. The LPM produces negative predictions for many low value homes, while the logistic model gives them probabilities close to zero but still valid. For high value homes, the logistic predictions rise sharply toward one, while the LPM follows a straight line. This visual makes clear why logistic regression is usually preferred for probability modeling.

#10. Odds Ratios

Logistic regression coefficients are expressed in log odds, which are hard to interpret directly. Exponentiating the coefficients yields odds ratios, which describe how the odds of being high price change when a predictor increases by one unit.

odds_ratios <- exp(coef(logit_model))
odds_ratios

##      (Intercept)    `Gr Liv Area`   `Overall Qual`     `Year Built` 
##     1.337081e-32     1.002142e+00     5.778461e+00     1.019296e+00 
## CentralAir_Dummy    `Garage Cars` 
##     1.056400e+05     3.559760e+00

or_quality <- odds_ratios["`Overall Qual`"]
or_quality

## `Overall Qual` 
##       5.778461

beta_area <- coef(logit_model)["`Gr Liv Area`"]
or_area_500 <- exp(beta_area * 500)
or_area_500

## `Gr Liv Area` 
##      2.914302

The odds ratio for overall quality is about 5.8. This means a one point increase in quality multiplies the odds of being high price by almost six. The odds ratio for an extra 500 square feet is about 2.9, so adding that much space nearly triples the odds of being high price. These results reinforce the idea that quality and size are key features of luxury homes.

The odds ratio for garage spaces is also well above one, showing that more parking capacity is a strong signal of high price status. Year built has a more modest effect. The central air odds ratio is unstable and very large, reflecting the fact that almost all high price homes already have central air, so it does not serve as a useful separator within this sample.

#11. Marginal Effects of Living Area

While odds ratios focus on changes in odds, marginal effects translate coefficients into changes in actual probability. This section computes three types of marginal effects for living area: the average marginal effect, the marginal effect at the means, and the marginal effect when the predicted probability is 0.5.

housing_clean$logit_prob <- predict(logit_model, type = "response")

beta_area <- coef(logit_model)["`Gr Liv Area`"]

# Average marginal effect (AME)
housing_clean$me_area <- beta_area * housing_clean$logit_prob * (1 - housing_clean$logit_prob)
AME_area <- mean(housing_clean$me_area)
AME_area

## [1] 4.799145e-05

# Marginal effect at the means (MEM)
means_df <- data.frame(
  area   = mean(housing_clean$`Gr Liv Area`),
  qual   = mean(housing_clean$`Overall Qual`),
  year   = mean(housing_clean$`Year Built`),
  air    = mean(housing_clean$CentralAir_Dummy),
  garage = mean(housing_clean$`Garage Cars`)
)

names(means_df) <- c("Gr Liv Area", "Overall Qual", "Year Built", "CentralAir_Dummy", "Garage Cars")

p_means <- predict(logit_model, newdata = means_df, type = "response")

MEM_area <- beta_area * p_means * (1 - p_means)
MEM_area

## `Gr Liv Area` 
##  5.628528e-07

# Marginal effect at p = 0.5
ME_at_50 <- beta_area * 0.5 * 0.5
ME_at_50

## `Gr Liv Area` 
##  0.0005348151

The average marginal effect is about 4.8e-05. On average, an extra square foot increases the probability of being high price by about 0.005 percentage points. This effect is tiny on its own but becomes meaningful for several hundred additional square feet.

The marginal effect at the means is even smaller because the average home has a very low baseline probability of being high price. The logistic curve is nearly flat in that region, so an extra square foot hardly changes the probability. At a predicted probability of 0.5, the marginal effect is larger, reflecting the fact that the logistic curve is steepest in the middle. This shows how the impact of size depends on where a home lies in the distribution of predicted probabilities.

#12. Classification with Threshold 0.5

So far the models have produced probabilities. To make a discrete yes or no decision, we need a threshold. The common default is 0.5. Homes with predicted probability above 0.5 are classified as high price.

housing_clean$logit_class_50 <- ifelse(housing_clean$logit_pred > 0.5, 1, 0)

cm50 <- table(
  Predicted = housing_clean$logit_class_50,
  Actual = housing_clean$HighPrice
)
cm50

##          Actual
## Predicted    0    1
##         0 2607   62
##         1   21   82

TP <- cm50["1", "1"]
FP <- cm50["1", "0"]
TN <- cm50["0", "0"]
FN <- cm50["0", "1"]

accuracy_50 <- (TP + TN) / sum(cm50)
precision_50 <- TP / (TP + FP)
recall_50 <- TP / (TP + FN)

accuracy_50

## [1] 0.9700577

precision_50

## [1] 0.7961165

recall_50

## [1] 0.5694444

With a 0.5 threshold, accuracy is about 97 percent and precision is close to 80 percent, but recall is only about 57 percent. The model correctly predicts most homes as not high price, and when it does predict high price it is usually correct, but it misses many true high price homes. This happens because the high price group is rare and most probabilities never reach 0.5.

In practice, this means the default threshold is too conservative for detecting high price homes.

#13. Classification with Threshold 0.2

To improve recall, I lower the threshold to 0.2. This labels more homes as high price and should catch more of the true high price homes.

housing_clean$logit_class_20 <- ifelse(housing_clean$logit_pred > 0.2, 1, 0)

cm20 <- table(
  Predicted = housing_clean$logit_class_20,
  Actual = housing_clean$HighPrice
)
cm20

##          Actual
## Predicted    0    1
##         0 2546   22
##         1   82  122

TP20 <- cm20["1", "1"]
FP20 <- cm20["1", "0"]
TN20 <- cm20["0", "0"]
FN20 <- cm20["0", "1"]

accuracy_20 <- (TP20 + TN20) / sum(cm20)
precision_20 <- TP20 / (TP20 + FP20)
recall_20 <- TP20 / (TP20 + FN20)

accuracy_20

## [1] 0.962482

precision_20

## [1] 0.5980392

recall_20

## [1] 0.8472222

With a 0.2 threshold, accuracy remains high at about 96 percent. Precision falls to about 60 percent, but recall rises to about 85 percent. That means the model now captures most of the actual high price homes instead of missing many of them.

This trade-off makes sense in a real estate context. Treating a mid range home as high price mainly wastes some extra marketing effort, but missing a true high price home may cost a substantial commission. Lowering the threshold is a reasonable choice when false negatives are more costly than false positives.

#14. Train–Test Split and Out-of-Sample Evaluation

To check whether the model generalizes to new data, I split the sample into a training set and a test set. The model is estimated on 75 percent of the data and then evaluated on the remaining 25 percent. This mimics applying the model to future homes that were not used in estimation.

set.seed(123)

train_index <- sample(1:nrow(housing_clean), size = 0.75 * nrow(housing_clean))

train_data <- housing_clean[train_index, ]
test_data  <- housing_clean[-train_index, ]

logit_train <- glm(
  HighPrice ~ `Gr Liv Area` + `Overall Qual` + `Year Built` +
    CentralAir_Dummy + `Garage Cars`,
  data = train_data,
  family = binomial
)

test_data$test_pred <- predict(logit_train, newdata = test_data, type = "response")
test_data$test_class_20 <- ifelse(test_data$test_pred > 0.2, 1, 0)

cm_test <- table(
  Predicted = test_data$test_class_20,
  Actual = test_data$HighPrice
)
cm_test

##          Actual
## Predicted   0   1
##         0 640   4
##         1  23  26

TP_t <- cm_test["1", "1"]
FP_t <- cm_test["1", "0"]
TN_t <- cm_test["0", "0"]
FN_t <- cm_test["0", "1"]

accuracy_test  <- (TP_t + TN_t) / sum(cm_test)
precision_test <- TP_t / (TP_t + FP_t)
recall_test    <- TP_t / (TP_t + FN_t)

accuracy_test

## [1] 0.961039

precision_test

## [1] 0.5306122

recall_test

## [1] 0.8666667

The test set accuracy is about 96 percent, very close to the in sample value. Precision falls slightly to about 53 percent, while recall improves to about 87 percent. The model still identifies most high price homes on unseen data and does not lose much performance when moving from training to testing.

This suggests that the logistic model is not overfitting and should perform reliably on new listings with similar characteristics.

#15. Final Diagnostics: ROC and Precision–Recall Curves

As a final check, I plot a ROC curve and a precision–recall curve for the test set. These curves summarize model performance across all possible thresholds rather than focusing on a single cutoff.

roc_obj <- roc(test_data$HighPrice, test_data$test_pred)
plot(roc_obj, main = "ROC Curve for Logistic Model (Test Set)")

pr_obj <- pr.curve(
  scores.class0 = test_data$test_pred[test_data$HighPrice == 1],
  scores.class1 = test_data$test_pred[test_data$HighPrice == 0],
  curve = TRUE
)
plot(pr_obj, main = "Precision–Recall Curve (Test Set)")

The ROC curve rises well above the diagonal, indicating that the model has good ability to distinguish high price homes from others. The precision–recall curve shows that precision remains high as recall increases and the area under the curve is well above what a random classifier would achieve. These plots confirm that the logistic model performs well for this rare event classification problem.

#16. Overall Conclusions

This project showed how dummy variables, neighborhood effects, and probability models can be used to study the determinants of high price homes in the Ames housing market. The analysis began with simple dummy regressions and moved toward a full logistic model with multiple predictors, threshold tuning, and out-of-sample testing.

The main findings are that overall quality, living area, and garage capacity are strong predictors of whether a home belongs to the high price segment. Neighborhood also plays a major role when it is included. Logistic regression provides a better framework than the Linear Probability Model because it produces valid probabilities and captures the nonlinear pattern of how features affect the chance of being high price.

From a real world perspective, the model offers a useful screening tool. A real estate agent or analyst could use it to flag homes that are likely to qualify as high price and to decide where to concentrate marketing resources. The results also highlight the importance of investing in quality improvements, additional space, and parking capacity for homeowners who aim to move their properties into the upper tier of the market.

Assignment 2: Ames Housing Dummy Variable

Gavin Shklanka

November 22, 2025