Introduction

This week I’m looking at group probabilities and finding unusual patterns in the Ames housing data. By grouping homes in different ways, I can figure out which combinations are super rare and which are common. This is kind of like anomaly detection—finding the weird outliers that don’t fit the normal patterns. Understanding which home features rarely occur together could help buyers find niches, show builders what’s missing in the market, or just help me spot data issues.

Data Loading

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

cat("Dataset dimensions:", nrow(ames), "rows and", ncol(ames), "columns\n")
## Dataset dimensions: 2930 rows and 82 columns

Part 1: Group-By Analysis with Probability Investigation

Grouping 1: Building Type by Sale Price Analysis

Research Question: Which building types are rarest in the Ames market, and what does this tell us about housing development patterns?

# Grouping by building type and analyze sale prices
group1_df <- ames |>
  group_by(Bldg.Type) |>
  summarise(
    Count = n(),
    Median_Price = median(SalePrice),
    Mean_Price = mean(SalePrice),
    Std_Price = sd(SalePrice)
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / sum(Count),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.05, "RARE", "Common")
  )

# Displaying full table
kable(group1_df, 
      col.names = c("Building Type", "Count", "Median Price", "Mean Price", 
                    "Std Dev", "Probability", "Probability %", "Rarity"),
      caption = "Building Type Distribution and Price Statistics",
      format.args = list(big.mark = ","))
Building Type Distribution and Price Statistics
Building Type Count Median Price Mean Price Std Dev Probability Probability % Rarity
2fmCon 62 122,250 125,581.7 31,089.24 0.0211604 2.12% RARE
Twnhs 101 130,000 135,934.1 41,938.93 0.0344710 3.45% RARE
Duplex 109 136,905 139,808.9 39,498.97 0.0372014 3.72% RARE
TwnhsE 233 180,000 192,311.9 66,191.74 0.0795222 7.95% Common
1Fam 2,425 165,000 184,812.0 82,821.80 0.8276451 82.76% Common
# Identifying rarest groups
rarest_bldg <- group1_df |>
  filter(Rarity_Tag == "RARE")

cat("\n=== PROBABILITY ANALYSIS ===\n")
## 
## === PROBABILITY ANALYSIS ===
cat("If we randomly select a home from the dataset:\n")
## If we randomly select a home from the dataset:
for(i in 1:nrow(rarest_bldg)) {
  cat(sprintf("- Probability of getting a %s: %.2f%% (%d out of %d homes)\n",
              rarest_bldg$Bldg.Type[i],
              rarest_bldg$Probability[i] * 100,
              rarest_bldg$Count[i],
              nrow(ames)))
}
## - Probability of getting a 2fmCon: 2.12% (62 out of 2930 homes)
## - Probability of getting a Twnhs: 3.45% (101 out of 2930 homes)
## - Probability of getting a Duplex: 3.72% (109 out of 2930 homes)

Insight: Townhouses and multi-family conversions are really rare in Ames—you’d only have about a 1-4% chance of randomly picking one from the dataset. This surprised me because townhouses are usually good starter homes, but Ames is clearly dominated by single-family houses. My guess is that Ames developed as a typical Midwestern suburb where everyone wanted their own yard and house, so townhouses just weren’t built much.

Testable Hypothesis: Townhouses and multi-family dwellings are concentrated in specific neighborhoods near the university (Iowa State), where student and young professional demand justifies denser development. We can test this by cross-tabulating building type with neighborhood location.

# Visualization with probability annotations
ggplot(group1_df, aes(x = reorder(Bldg.Type, -Count), y = Count, fill = Rarity_Tag)) +
  geom_col() +
  geom_text(aes(label = paste0(Count, "\n(", Probability_Pct, ")")), 
            vjust = -0.3, size = 3.5) +
  scale_fill_manual(values = c("RARE" = "#e74c3c", "Common" = "#3498db"),
                    name = "Classification") +
  labs(title = "Building Type Distribution with Selection Probabilities",
       subtitle = "Rare types (<5% probability) highlighted in red",
       x = "Building Type",
       y = "Number of Homes",
       caption = "1Fam = Single-Family, TwnhsE = Townhouse End Unit, Twnhs = Townhouse, \nDuplex = Duplex, 2fmCon = Two-Family Conversion") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))


Grouping 2: Neighborhood Tier and Overall Quality Combinations

Research Question: Which quality-neighborhood combinations are rarest, and what does this reveal about market segmentation?

# Creating neighborhood price tiers
neighborhood_tiers <- ames |>
  group_by(Neighborhood) |>
  summarise(Median_Price = median(SalePrice)) |>
  mutate(Tier = case_when(
    Median_Price >= 200000 ~ "Premium",
    Median_Price >= 140000 ~ "Mid-Market",
    TRUE ~ "Value"
  ))

# Join and group by tier and quality
group2_df <- ames |>
  left_join(neighborhood_tiers |> select(Neighborhood, Tier), by = "Neighborhood") |>
  group_by(Tier, Overall.Qual) |>
  summarise(
    Count = n(),
    Avg_Price = mean(SalePrice),
    .groups = "drop"
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.01, "EXTREMELY RARE", 
                        if_else(Probability < 0.03, "RARE", "Common"))
  )

# Displaying full table
kable(group2_df, 
      col.names = c("Neighborhood Tier", "Quality Rating", "Count", 
                    "Avg Price", "Probability", "Probability %", "Rarity"),
      caption = "Neighborhood Tier × Quality Rating Combinations",
      format.args = list(big.mark = ","))
Neighborhood Tier × Quality Rating Combinations
Neighborhood Tier Quality Rating Count Avg Price Probability Probability % Rarity
Mid-Market 1 1 81,500.00 0.0003413 0.03% EXTREMELY RARE
Mid-Market 9 1 377,500.00 0.0003413 0.03% EXTREMELY RARE
Value 9 1 320,000.00 0.0003413 0.03% EXTREMELY RARE
Value 1 3 37,800.00 0.0010239 0.10% EXTREMELY RARE
Premium 4 5 147,220.00 0.0017065 0.17% EXTREMELY RARE
Value 10 5 265,720.00 0.0017065 0.17% EXTREMELY RARE
Mid-Market 3 6 92,216.67 0.0020478 0.20% EXTREMELY RARE
Value 8 9 174,822.22 0.0030717 0.31% EXTREMELY RARE
Value 2 13 52,325.31 0.0044369 0.44% EXTREMELY RARE
Premium 10 26 485,697.58 0.0088737 0.89% EXTREMELY RARE
Value 3 34 81,592.32 0.0116041 1.16% RARE
Mid-Market 8 41 235,685.63 0.0139932 1.40% RARE
Mid-Market 4 48 126,056.04 0.0163823 1.64% RARE
Value 7 55 164,691.98 0.0187713 1.88% RARE
Premium 5 85 151,062.25 0.0290102 2.90% RARE
Premium 9 105 368,709.85 0.0358362 3.58% Common
Premium 6 124 185,563.01 0.0423208 4.23% Common
Value 4 173 99,877.70 0.0590444 5.90% Common
Value 6 220 136,692.04 0.0750853 7.51% Common
Mid-Market 7 254 201,163.34 0.0866894 8.67% Common
Premium 7 293 215,945.26 0.1000000 10.00% Common
Premium 8 300 278,610.82 0.1023891 10.24% Common
Mid-Market 5 342 140,269.19 0.1167235 11.67% Common
Mid-Market 6 388 169,065.29 0.1324232 13.24% Common
Value 5 398 126,528.82 0.1358362 13.58% Common
# Identifying extremely rare combinations
extremely_rare <- group2_df |>
  filter(Rarity_Tag == "EXTREMELY RARE")

cat("\n=== EXTREME RARITY ANALYSIS ===\n")
## 
## === EXTREME RARITY ANALYSIS ===
cat("Combinations with <1% probability:\n")
## Combinations with <1% probability:
for(i in 1:nrow(extremely_rare)) {
  cat(sprintf("- %s neighborhood + Quality %d: %.3f%% probability (%d homes)\n",
              extremely_rare$Tier[i],
              extremely_rare$Overall.Qual[i],
              extremely_rare$Probability[i] * 100,
              extremely_rare$Count[i]))
}
## - Mid-Market neighborhood + Quality 1: 0.034% probability (1 homes)
## - Mid-Market neighborhood + Quality 9: 0.034% probability (1 homes)
## - Value neighborhood + Quality 9: 0.034% probability (1 homes)
## - Value neighborhood + Quality 1: 0.102% probability (3 homes)
## - Premium neighborhood + Quality 4: 0.171% probability (5 homes)
## - Value neighborhood + Quality 10: 0.171% probability (5 homes)
## - Mid-Market neighborhood + Quality 3: 0.205% probability (6 homes)
## - Value neighborhood + Quality 8: 0.307% probability (9 homes)
## - Value neighborhood + Quality 2: 0.444% probability (13 homes)
## - Premium neighborhood + Quality 10: 0.887% probability (26 homes)

Insight: Some combinations basically don’t exist in the data. Premium neighborhoods almost never have low-quality homes (only 0.07% chance), which makes sense—if you’re in an expensive neighborhood, you probably can’t get away with having a run-down house. Either the neighborhood standards keep quality high, or low-quality homes just get renovated. On the flip side, value neighborhoods rarely have excellent quality homes either. I think this is because if you’re going to build a really nice house, you’re probably going to do it in a nicer neighborhood where people will pay for it.

Testable Hypothesis: The rare combinations (premium-low quality, value-high quality) will show the largest price premiums/discounts relative to their tier averages, indicating they’re market anomalies. A quality-10 home in a value neighborhood might command a price far above the neighborhood median.

# Heatmap visualization
ggplot(group2_df, aes(x = Overall.Qual, y = Tier, fill = Count)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Count), color = "white", fontface = "bold", size = 4) +
  scale_fill_gradient(low = "#2c3e50", high = "#e74c3c", 
                      name = "Number\nof Homes",
                      trans = "log10") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Rarity Heatmap: Neighborhood Tier × Quality Rating",
       subtitle = "Darker colors = rarer combinations (log scale)",
       x = "Overall Quality Rating (1-10)",
       y = "Neighborhood Tier") +
  theme_minimal() +
  theme(panel.grid = element_blank())


Grouping 3: Home Age Categories and Garage Capacity

Research Question: Which age-garage combinations are rarest, and what does this tell us about evolving homeowner preferences?

# Creating age categories and analyze garage capacity
group3_df <- ames |>
  mutate(
    Age_Category = cut(2010 - Year.Built,
                      breaks = c(-1, 10, 20, 40, 60, 200),
                      labels = c("0-10 yrs", "11-20 yrs", "21-40 yrs", 
                                "41-60 yrs", "60+ yrs"))
  ) |>
  group_by(Age_Category, Garage.Cars) |>
  summarise(
    Count = n(),
    Avg_Area = mean(Gr.Liv.Area),
    Avg_Price = mean(SalePrice),
    .groups = "drop"
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.01, "EXTREMELY RARE",
                        if_else(Probability < 0.03, "RARE", "Common"))
  )

# Displaying full table
kable(group3_df, 
      col.names = c("Age Category", "Garage Capacity (Cars)", "Count",
                    "Avg Living Area", "Avg Price", "Probability", 
                    "Probability %", "Rarity"),
      caption = "Home Age × Garage Capacity Combinations",
      format.args = list(big.mark = ","))
Home Age × Garage Capacity Combinations
Age Category Garage Capacity (Cars) Count Avg Living Area Avg Price Probability Probability % Rarity
60+ yrs 5 1 1,072.000 126,500.00 0.0003413 0.03% EXTREMELY RARE
60+ yrs 1 1,828.000 150,909.00 0.0003413 0.03% EXTREMELY RARE
0-10 yrs 1 2 1,103.000 150,000.00 0.0006826 0.07% EXTREMELY RARE
21-40 yrs 4 2 1,856.000 207,750.00 0.0006826 0.07% EXTREMELY RARE
60+ yrs 4 2 2,038.000 202,489.50 0.0006826 0.07% EXTREMELY RARE
41-60 yrs 4 3 2,096.667 166,000.00 0.0010239 0.10% EXTREMELY RARE
0-10 yrs 4 4 2,787.750 317,125.00 0.0013652 0.14% EXTREMELY RARE
11-20 yrs 0 4 937.500 116,900.00 0.0013652 0.14% EXTREMELY RARE
11-20 yrs 4 5 1,459.600 214,600.00 0.0017065 0.17% EXTREMELY RARE
11-20 yrs 1 9 1,135.556 166,794.44 0.0030717 0.31% EXTREMELY RARE
21-40 yrs 3 9 1,757.111 204,933.33 0.0030717 0.31% EXTREMELY RARE
41-60 yrs 3 10 1,604.300 147,670.00 0.0034130 0.34% EXTREMELY RARE
0-10 yrs 0 12 1,109.083 135,979.17 0.0040956 0.41% EXTREMELY RARE
60+ yrs 3 19 2,067.684 195,889.47 0.0064846 0.65% EXTREMELY RARE
41-60 yrs 0 24 1,362.333 109,160.42 0.0081911 0.82% EXTREMELY RARE
21-40 yrs 0 26 1,037.192 109,048.08 0.0088737 0.89% EXTREMELY RARE
11-20 yrs 3 57 2,469.947 327,591.54 0.0194539 1.95% RARE
60+ yrs 0 91 1,264.418 98,050.36 0.0310580 3.11% Common
21-40 yrs 1 98 1,025.735 118,793.27 0.0334471 3.34% Common
60+ yrs 2 190 1,580.795 146,321.27 0.0648464 6.48% Common
11-20 yrs 2 259 1,666.950 204,079.43 0.0883959 8.84% Common
0-10 yrs 3 279 2,013.237 323,792.87 0.0952218 9.52% Common
41-60 yrs 2 319 1,445.871 161,544.82 0.1088737 10.89% Common
60+ yrs 1 328 1,292.988 122,218.40 0.1119454 11.19% Common
41-60 yrs 1 341 1,158.774 133,382.79 0.1163823 11.64% Common
21-40 yrs 2 349 1,482.444 174,802.32 0.1191126 11.91% Common
0-10 yrs 2 486 1,543.975 207,929.28 0.1658703 16.59% Common
# Identifying rare combinations
rare_age_garage <- group3_df |>
  filter(Rarity_Tag != "Common")

cat("\n=== RARE AGE-GARAGE COMBINATIONS ===\n")
## 
## === RARE AGE-GARAGE COMBINATIONS ===
for(i in 1:nrow(rare_age_garage)) {
  cat(sprintf("- %s homes with %d-car garage: %.2f%% probability (%d homes)\n",
              rare_age_garage$Age_Category[i],
              rare_age_garage$Garage.Cars[i],
              rare_age_garage$Probability[i] * 100,
              rare_age_garage$Count[i]))
}
## - 60+ yrs homes with 5-car garage: 0.03% probability (1 homes)
## - 60+ yrs homes with NA-car garage: 0.03% probability (1 homes)
## - 0-10 yrs homes with 1-car garage: 0.07% probability (2 homes)
## - 21-40 yrs homes with 4-car garage: 0.07% probability (2 homes)
## - 60+ yrs homes with 4-car garage: 0.07% probability (2 homes)
## - 41-60 yrs homes with 4-car garage: 0.10% probability (3 homes)
## - 0-10 yrs homes with 4-car garage: 0.14% probability (4 homes)
## - 11-20 yrs homes with 0-car garage: 0.14% probability (4 homes)
## - 11-20 yrs homes with 4-car garage: 0.17% probability (5 homes)
## - 11-20 yrs homes with 1-car garage: 0.31% probability (9 homes)
## - 21-40 yrs homes with 3-car garage: 0.31% probability (9 homes)
## - 41-60 yrs homes with 3-car garage: 0.34% probability (10 homes)
## - 0-10 yrs homes with 0-car garage: 0.41% probability (12 homes)
## - 60+ yrs homes with 3-car garage: 0.65% probability (19 homes)
## - 41-60 yrs homes with 0-car garage: 0.82% probability (24 homes)
## - 21-40 yrs homes with 0-car garage: 0.89% probability (26 homes)
## - 11-20 yrs homes with 3-car garage: 1.95% probability (57 homes)

Insight: The data shows how much garages have become expected over time. New homes (built in the last 20 years) almost never have zero or just one car garage—there’s less than 1% chance of finding that. Everyone expects at least a 2-car garage now, and 3-car garages are pretty common. But old homes (60+ years) rarely have 3-car garages because back then, most families only had one car, so big garages weren’t needed. The weirdest combination is homes from the 1990s-2000s with 4-car garages—only 0.07% probability. Even during the building boom, 4-car garages were still just for really big custom houses.

Testable Hypothesis: Recent homes (0-10 years) with no garage or 1-car garage are priced significantly below their age cohort average, indicating garage capacity is now a critical value driver. We can test this by comparing price-per-square-foot across garage categories within recent construction.

# Stacked bar chart showing distribution
group3_df_viz <- group3_df |>
  filter(!is.na(Garage.Cars)) |>
  mutate(Garage_Label = paste0(Garage.Cars, "-car"))

ggplot(group3_df_viz, aes(x = Age_Category, y = Count, fill = Garage_Label)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = Count), position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  scale_fill_brewer(palette = "Set2", name = "Garage\nCapacity") +
  labs(title = "Home Age vs. Garage Capacity: Shifting Automotive Culture",
       subtitle = "Newer homes increasingly feature multi-car garages",
       x = "Home Age Category",
       y = "Number of Homes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")


Part 2: Categorical Variable Combinations Analysis

Combination Analysis: House Style × Roof Style

Research Question: Which architectural combinations exist or don’t exist in Ames, and what does this reveal about architectural constraints and preferences?

# Getting all unique combinations of House Style and Roof Style
actual_combinations <- ames |>
  select(House.Style, Roof.Style) |>
  distinct() |>
  arrange(House.Style, Roof.Style)

# Getting counts for each combination
combination_counts <- ames |>
  group_by(House.Style, Roof.Style) |>
  summarise(Count = n(), .groups = "drop") |>
  arrange(desc(Count))

# Creating all possible combinations
all_house_styles <- unique(ames$House.Style)
all_roof_styles <- unique(ames$Roof.Style)
all_possible <- expand.grid(
  House.Style = all_house_styles,
  Roof.Style = all_roof_styles,
  stringsAsFactors = FALSE
)

# Finding missing combinations
missing_combinations <- all_possible |>
  anti_join(actual_combinations, by = c("House.Style", "Roof.Style"))

cat("=== ARCHITECTURAL COMBINATION ANALYSIS ===\n\n")
## === ARCHITECTURAL COMBINATION ANALYSIS ===
cat("Total possible combinations:", nrow(all_possible), "\n")
## Total possible combinations: 48
cat("Actual combinations observed:", nrow(actual_combinations), "\n")
## Actual combinations observed: 28
cat("Missing combinations:", nrow(missing_combinations), "\n\n")
## Missing combinations: 20
if(nrow(missing_combinations) > 0) {
  cat("MISSING COMBINATIONS:\n")
  kable(missing_combinations,
        caption = "House Style × Roof Style Combinations NOT in Dataset")
} else {
  cat("All possible combinations are present in the dataset!\n")
}
## MISSING COMBINATIONS:
House Style × Roof Style Combinations NOT in Dataset
House.Style Roof.Style
1.5Unf Hip
1.5Fin Mansard
SFoyer Mansard
2.5Unf Mansard
1.5Unf Mansard
2.5Fin Mansard
SFoyer Gambrel
2.5Unf Gambrel
1.5Unf Gambrel
2.5Fin Gambrel
2Story Shed
SFoyer Shed
SLvl Shed
2.5Unf Shed
1.5Unf Shed
2.5Fin Shed
1.5Fin Flat
2.5Unf Flat
1.5Unf Flat
2.5Fin Flat
# Analyzing most and least common combinations
cat("\n=== MOST COMMON COMBINATIONS ===\n")
## 
## === MOST COMMON COMBINATIONS ===
kable(head(combination_counts, 10),
      col.names = c("House Style", "Roof Style", "Count"),
      caption = "Top 10 Most Common Architectural Combinations")
Top 10 Most Common Architectural Combinations
House Style Roof Style Count
1Story Gable 1053
2Story Gable 746
1Story Hip 407
1.5Fin Gable 300
2Story Hip 102
SLvl Gable 100
SFoyer Gable 77
SLvl Hip 24
1.5Unf Gable 19
2.5Unf Gable 19
cat("\n=== LEAST COMMON COMBINATIONS ===\n")
## 
## === LEAST COMMON COMBINATIONS ===
least_common <- combination_counts |>
  filter(Count <= 5) |>
  arrange(Count)

kable(least_common,
      col.names = c("House Style", "Roof Style", "Count"),
      caption = "Rarest Architectural Combinations (≤5 homes)")
Rarest Architectural Combinations (≤5 homes)
House Style Roof Style Count
1.5Fin Shed 1
1Story Gambrel 1
2.5Fin Hip 1
SLvl Gambrel 1
SLvl Mansard 1
SFoyer Flat 2
SLvl Flat 2
1Story Mansard 3
2Story Flat 3
1Story Shed 4
SFoyer Hip 4
1.5Fin Gambrel 5
2.5Unf Hip 5
# Calculating probabilities
combination_probs <- combination_counts |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100)
  ) |>
  arrange(desc(Probability))

Why Certain Combinations Are Missing:

If there are any missing combinations, it’s probably because:

  1. They just don’t work together structurally: Some roof styles and house styles don’t match up well. Like, you wouldn’t put a Mansard roof (which is super old-fashioned and Victorian) on a modern ranch house—it would look weird.

  2. Ames has pretty traditional architecture: Almost all homes here have Gable or Hip roofs (98% of them). The fancier roof styles like Gambrel or Mansard are super rare, probably just on a few custom or historical homes.

  3. Different eras have different styles: Certain house types like split-levels were popular in specific decades, and they came with the roof styles that were popular then. You wouldn’t see modern roof designs on old split-level homes.

Most Common Combinations:

cat("\nThe three most common combinations are:\n")
## 
## The three most common combinations are:
top3 <- head(combination_probs, 3)
for(i in 1:3) {
  cat(sprintf("%d. %s + %s: %s probability (%d homes)\n",
              i,
              top3$House.Style[i],
              top3$Roof.Style[i],
              top3$Probability_Pct[i],
              top3$Count[i]))
}
## 1. 1Story + Gable: 35.94% probability (1053 homes)
## 2. 2Story + Gable: 25.46% probability (746 homes)
## 3. 1Story + Hip: 13.89% probability (407 homes)

These are the most popular combinations because Ames is a pretty traditional Midwestern town. The 1Story-Gable combo is basically your classic ranch house, and the 2Story-Gable is your classic colonial or traditional two-story. These are what got built during the big suburban boom after World War II, which is probably when a lot of Ames was developed.

Least Common Combinations:

The rarest combinations are probably either weird custom builds or old historical homes. Any house with a Shed or Flat roof is unusual here—these show up in less than 1% of homes. Shed roofs might be on some modern designer homes, and Flat roofs could be older commercial buildings that got converted to homes or some contemporary minimalist houses.

# Limiting visualization to top 6 house styles by frequency to avoid clutter
top_house_styles <- ames |>
  count(House.Style, sort = TRUE) |>
  head(6) |>
  pull(House.Style)

# Filtering for visualization
combo_viz_data <- combination_counts |>
  filter(House.Style %in% top_house_styles)

ggplot(combo_viz_data, aes(x = House.Style, y = Count, fill = Roof.Style)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set3", name = "Roof Style") +
  labs(title = "Architectural Combinations: House Style × Roof Style",
       subtitle = "Limited to 6 most common house styles for clarity",
       x = "House Style",
       y = "Number of Homes",
       caption = "1Story = Ranch, 2Story = Two-story, 1.5Fin = 1.5 story finished, \nSLvl = Split level, SFoyer = Split foyer") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")

Insight from Visualization: The chart shows that Gable and Hip roofs are everywhere across all house types, but you see them in different amounts. Two-story homes really prefer Gable roofs even more than one-story homes do. I think this is because a Gable roof (the pointy triangle kind) looks better on a tall house—it emphasizes the height. Hip roofs (where all sides slope down) are flatter and work well on both styles but seem especially popular on ranch houses where you don’t want the house to look too tall.


Conclusion

Looking at group probabilities and rare combinations this week taught me a lot about the Ames housing market:

  1. Townhouses are really rare: Even though they’re usually good starter homes, less than 5% of Ames homes are townhouses or multi-family. Ames is basically all single-family houses.

  2. Quality and neighborhood go together: You basically can’t find a cheap, low-quality house in an expensive neighborhood (only 0.07% chance). Nice neighborhoods stay nice, and builders only put expensive houses in nice areas.

  3. Garages became essential over time: Modern homes without garages are super rare because everyone expects parking now. The shift from 1-car to 3-car garages shows how car ownership changed over the decades.

  4. Architecture is pretty traditional here: Most homes have standard Gable or Hip roofs (98% of them). The fancy or unusual combinations either don’t work structurally or just aren’t popular in the Midwest.

These findings could actually be useful for real people: developers could see that townhouses are rare and maybe there’s demand for them, homebuyers can understand if what they want is common or unusual (which affects how hard it’ll be to find), and appraisers can spot truly weird homes that need special attention (like a premium neighborhood with a quality-2 house).

Next Steps: I’d like to look at whether the rare combinations are priced differently—do unique features make homes more expensive or just harder to sell? Also, I’m curious if the rare combinations are becoming more or less common over time, which would show if trends are changing.