Ames Housing Data Dive Week 3: Group Probabilities and Anomaly Detection

Introduction

This week I’m looking at group probabilities and finding unusual patterns in the Ames housing data. By grouping homes in different ways, I can figure out which combinations are super rare and which are common. This is kind of like anomaly detection—finding the weird outliers that don’t fit the normal patterns. Understanding which home features rarely occur together could help buyers find niches, show builders what’s missing in the market, or just help me spot data issues.

Data Loading

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)

cat("Dataset dimensions:", nrow(ames), "rows and", ncol(ames), "columns\n")

## Dataset dimensions: 2930 rows and 82 columns

Part 1: Group-By Analysis with Probability Investigation

Grouping 1: Building Type by Sale Price Analysis

Research Question: Which building types are rarest in the Ames market, and what does this tell us about housing development patterns?

# Grouping by building type and analyze sale prices
group1_df <- ames |>
  group_by(Bldg.Type) |>
  summarise(
    Count = n(),
    Median_Price = median(SalePrice),
    Mean_Price = mean(SalePrice),
    Std_Price = sd(SalePrice)
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / sum(Count),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.05, "RARE", "Common")
  )

# Displaying full table
kable(group1_df, 
      col.names = c("Building Type", "Count", "Median Price", "Mean Price", 
                    "Std Dev", "Probability", "Probability %", "Rarity"),
      caption = "Building Type Distribution and Price Statistics",
      format.args = list(big.mark = ","))

Building Type Distribution and Price Statistics
Building Type	Count	Median Price	Mean Price	Std Dev	Probability	Probability %	Rarity
2fmCon	62	122,250	125,581.7	31,089.24	0.0211604	2.12%	RARE
Twnhs	101	130,000	135,934.1	41,938.93	0.0344710	3.45%	RARE
Duplex	109	136,905	139,808.9	39,498.97	0.0372014	3.72%	RARE
TwnhsE	233	180,000	192,311.9	66,191.74	0.0795222	7.95%	Common
1Fam	2,425	165,000	184,812.0	82,821.80	0.8276451	82.76%	Common

# Identifying rarest groups
rarest_bldg <- group1_df |>
  filter(Rarity_Tag == "RARE")

cat("\n=== PROBABILITY ANALYSIS ===\n")

## 
## === PROBABILITY ANALYSIS ===

cat("If we randomly select a home from the dataset:\n")

## If we randomly select a home from the dataset:

for(i in 1:nrow(rarest_bldg)) {
  cat(sprintf("- Probability of getting a %s: %.2f%% (%d out of %d homes)\n",
              rarest_bldg$Bldg.Type[i],
              rarest_bldg$Probability[i] * 100,
              rarest_bldg$Count[i],
              nrow(ames)))
}

## - Probability of getting a 2fmCon: 2.12% (62 out of 2930 homes)
## - Probability of getting a Twnhs: 3.45% (101 out of 2930 homes)
## - Probability of getting a Duplex: 3.72% (109 out of 2930 homes)

Insight: Townhouses and multi-family conversions are really rare in Ames—you’d only have about a 1-4% chance of randomly picking one from the dataset. This surprised me because townhouses are usually good starter homes, but Ames is clearly dominated by single-family houses. My guess is that Ames developed as a typical Midwestern suburb where everyone wanted their own yard and house, so townhouses just weren’t built much.

Testable Hypothesis: Townhouses and multi-family dwellings are concentrated in specific neighborhoods near the university (Iowa State), where student and young professional demand justifies denser development. We can test this by cross-tabulating building type with neighborhood location.

# Visualization with probability annotations
ggplot(group1_df, aes(x = reorder(Bldg.Type, -Count), y = Count, fill = Rarity_Tag)) +
  geom_col() +
  geom_text(aes(label = paste0(Count, "\n(", Probability_Pct, ")")), 
            vjust = -0.3, size = 3.5) +
  scale_fill_manual(values = c("RARE" = "#e74c3c", "Common" = "#3498db"),
                    name = "Classification") +
  labs(title = "Building Type Distribution with Selection Probabilities",
       subtitle = "Rare types (<5% probability) highlighted in red",
       x = "Building Type",
       y = "Number of Homes",
       caption = "1Fam = Single-Family, TwnhsE = Townhouse End Unit, Twnhs = Townhouse, \nDuplex = Duplex, 2fmCon = Two-Family Conversion") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, hjust = 0.5))

Grouping 2: Neighborhood Tier and Overall Quality Combinations

Research Question: Which quality-neighborhood combinations are rarest, and what does this reveal about market segmentation?

# Creating neighborhood price tiers
neighborhood_tiers <- ames |>
  group_by(Neighborhood) |>
  summarise(Median_Price = median(SalePrice)) |>
  mutate(Tier = case_when(
    Median_Price >= 200000 ~ "Premium",
    Median_Price >= 140000 ~ "Mid-Market",
    TRUE ~ "Value"
  ))

# Join and group by tier and quality
group2_df <- ames |>
  left_join(neighborhood_tiers |> select(Neighborhood, Tier), by = "Neighborhood") |>
  group_by(Tier, Overall.Qual) |>
  summarise(
    Count = n(),
    Avg_Price = mean(SalePrice),
    .groups = "drop"
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.01, "EXTREMELY RARE", 
                        if_else(Probability < 0.03, "RARE", "Common"))
  )

# Displaying full table
kable(group2_df, 
      col.names = c("Neighborhood Tier", "Quality Rating", "Count", 
                    "Avg Price", "Probability", "Probability %", "Rarity"),
      caption = "Neighborhood Tier × Quality Rating Combinations",
      format.args = list(big.mark = ","))

Neighborhood Tier × Quality Rating Combinations
Neighborhood Tier	Quality Rating	Count	Avg Price	Probability	Probability %	Rarity
Mid-Market	1	1	81,500.00	0.0003413	0.03%	EXTREMELY RARE
Mid-Market	9	1	377,500.00	0.0003413	0.03%	EXTREMELY RARE
Value	9	1	320,000.00	0.0003413	0.03%	EXTREMELY RARE
Value	1	3	37,800.00	0.0010239	0.10%	EXTREMELY RARE
Premium	4	5	147,220.00	0.0017065	0.17%	EXTREMELY RARE
Value	10	5	265,720.00	0.0017065	0.17%	EXTREMELY RARE
Mid-Market	3	6	92,216.67	0.0020478	0.20%	EXTREMELY RARE
Value	8	9	174,822.22	0.0030717	0.31%	EXTREMELY RARE
Value	2	13	52,325.31	0.0044369	0.44%	EXTREMELY RARE
Premium	10	26	485,697.58	0.0088737	0.89%	EXTREMELY RARE
Value	3	34	81,592.32	0.0116041	1.16%	RARE
Mid-Market	8	41	235,685.63	0.0139932	1.40%	RARE
Mid-Market	4	48	126,056.04	0.0163823	1.64%	RARE
Value	7	55	164,691.98	0.0187713	1.88%	RARE
Premium	5	85	151,062.25	0.0290102	2.90%	RARE
Premium	9	105	368,709.85	0.0358362	3.58%	Common
Premium	6	124	185,563.01	0.0423208	4.23%	Common
Value	4	173	99,877.70	0.0590444	5.90%	Common
Value	6	220	136,692.04	0.0750853	7.51%	Common
Mid-Market	7	254	201,163.34	0.0866894	8.67%	Common
Premium	7	293	215,945.26	0.1000000	10.00%	Common
Premium	8	300	278,610.82	0.1023891	10.24%	Common
Mid-Market	5	342	140,269.19	0.1167235	11.67%	Common
Mid-Market	6	388	169,065.29	0.1324232	13.24%	Common
Value	5	398	126,528.82	0.1358362	13.58%	Common

# Identifying extremely rare combinations
extremely_rare <- group2_df |>
  filter(Rarity_Tag == "EXTREMELY RARE")

cat("\n=== EXTREME RARITY ANALYSIS ===\n")

## 
## === EXTREME RARITY ANALYSIS ===

cat("Combinations with <1% probability:\n")

## Combinations with <1% probability:

for(i in 1:nrow(extremely_rare)) {
  cat(sprintf("- %s neighborhood + Quality %d: %.3f%% probability (%d homes)\n",
              extremely_rare$Tier[i],
              extremely_rare$Overall.Qual[i],
              extremely_rare$Probability[i] * 100,
              extremely_rare$Count[i]))
}

## - Mid-Market neighborhood + Quality 1: 0.034% probability (1 homes)
## - Mid-Market neighborhood + Quality 9: 0.034% probability (1 homes)
## - Value neighborhood + Quality 9: 0.034% probability (1 homes)
## - Value neighborhood + Quality 1: 0.102% probability (3 homes)
## - Premium neighborhood + Quality 4: 0.171% probability (5 homes)
## - Value neighborhood + Quality 10: 0.171% probability (5 homes)
## - Mid-Market neighborhood + Quality 3: 0.205% probability (6 homes)
## - Value neighborhood + Quality 8: 0.307% probability (9 homes)
## - Value neighborhood + Quality 2: 0.444% probability (13 homes)
## - Premium neighborhood + Quality 10: 0.887% probability (26 homes)

Insight: Some combinations basically don’t exist in the data. Premium neighborhoods almost never have low-quality homes (only 0.07% chance), which makes sense—if you’re in an expensive neighborhood, you probably can’t get away with having a run-down house. Either the neighborhood standards keep quality high, or low-quality homes just get renovated. On the flip side, value neighborhoods rarely have excellent quality homes either. I think this is because if you’re going to build a really nice house, you’re probably going to do it in a nicer neighborhood where people will pay for it.

Testable Hypothesis: The rare combinations (premium-low quality, value-high quality) will show the largest price premiums/discounts relative to their tier averages, indicating they’re market anomalies. A quality-10 home in a value neighborhood might command a price far above the neighborhood median.

# Heatmap visualization
ggplot(group2_df, aes(x = Overall.Qual, y = Tier, fill = Count)) +
  geom_tile(color = "white", linewidth = 1) +
  geom_text(aes(label = Count), color = "white", fontface = "bold", size = 4) +
  scale_fill_gradient(low = "#2c3e50", high = "#e74c3c", 
                      name = "Number\nof Homes",
                      trans = "log10") +
  scale_x_continuous(breaks = 1:10) +
  labs(title = "Rarity Heatmap: Neighborhood Tier × Quality Rating",
       subtitle = "Darker colors = rarer combinations (log scale)",
       x = "Overall Quality Rating (1-10)",
       y = "Neighborhood Tier") +
  theme_minimal() +
  theme(panel.grid = element_blank())

Grouping 3: Home Age Categories and Garage Capacity

Research Question: Which age-garage combinations are rarest, and what does this tell us about evolving homeowner preferences?

# Creating age categories and analyze garage capacity
group3_df <- ames |>
  mutate(
    Age_Category = cut(2010 - Year.Built,
                      breaks = c(-1, 10, 20, 40, 60, 200),
                      labels = c("0-10 yrs", "11-20 yrs", "21-40 yrs", 
                                "41-60 yrs", "60+ yrs"))
  ) |>
  group_by(Age_Category, Garage.Cars) |>
  summarise(
    Count = n(),
    Avg_Area = mean(Gr.Liv.Area),
    Avg_Price = mean(SalePrice),
    .groups = "drop"
  ) |>
  arrange(Count) |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100),
    Rarity_Tag = if_else(Probability < 0.01, "EXTREMELY RARE",
                        if_else(Probability < 0.03, "RARE", "Common"))
  )

# Displaying full table
kable(group3_df, 
      col.names = c("Age Category", "Garage Capacity (Cars)", "Count",
                    "Avg Living Area", "Avg Price", "Probability", 
                    "Probability %", "Rarity"),
      caption = "Home Age × Garage Capacity Combinations",
      format.args = list(big.mark = ","))

Home Age × Garage Capacity Combinations
Age Category	Garage Capacity (Cars)	Count	Avg Living Area	Avg Price	Probability	Probability %	Rarity
60+ yrs	5	1	1,072.000	126,500.00	0.0003413	0.03%	EXTREMELY RARE
60+ yrs		1	1,828.000	150,909.00	0.0003413	0.03%	EXTREMELY RARE
0-10 yrs	1	2	1,103.000	150,000.00	0.0006826	0.07%	EXTREMELY RARE
21-40 yrs	4	2	1,856.000	207,750.00	0.0006826	0.07%	EXTREMELY RARE
60+ yrs	4	2	2,038.000	202,489.50	0.0006826	0.07%	EXTREMELY RARE
41-60 yrs	4	3	2,096.667	166,000.00	0.0010239	0.10%	EXTREMELY RARE
0-10 yrs	4	4	2,787.750	317,125.00	0.0013652	0.14%	EXTREMELY RARE
11-20 yrs	0	4	937.500	116,900.00	0.0013652	0.14%	EXTREMELY RARE
11-20 yrs	4	5	1,459.600	214,600.00	0.0017065	0.17%	EXTREMELY RARE
11-20 yrs	1	9	1,135.556	166,794.44	0.0030717	0.31%	EXTREMELY RARE
21-40 yrs	3	9	1,757.111	204,933.33	0.0030717	0.31%	EXTREMELY RARE
41-60 yrs	3	10	1,604.300	147,670.00	0.0034130	0.34%	EXTREMELY RARE
0-10 yrs	0	12	1,109.083	135,979.17	0.0040956	0.41%	EXTREMELY RARE
60+ yrs	3	19	2,067.684	195,889.47	0.0064846	0.65%	EXTREMELY RARE
41-60 yrs	0	24	1,362.333	109,160.42	0.0081911	0.82%	EXTREMELY RARE
21-40 yrs	0	26	1,037.192	109,048.08	0.0088737	0.89%	EXTREMELY RARE
11-20 yrs	3	57	2,469.947	327,591.54	0.0194539	1.95%	RARE
60+ yrs	0	91	1,264.418	98,050.36	0.0310580	3.11%	Common
21-40 yrs	1	98	1,025.735	118,793.27	0.0334471	3.34%	Common
60+ yrs	2	190	1,580.795	146,321.27	0.0648464	6.48%	Common
11-20 yrs	2	259	1,666.950	204,079.43	0.0883959	8.84%	Common
0-10 yrs	3	279	2,013.237	323,792.87	0.0952218	9.52%	Common
41-60 yrs	2	319	1,445.871	161,544.82	0.1088737	10.89%	Common
60+ yrs	1	328	1,292.988	122,218.40	0.1119454	11.19%	Common
41-60 yrs	1	341	1,158.774	133,382.79	0.1163823	11.64%	Common
21-40 yrs	2	349	1,482.444	174,802.32	0.1191126	11.91%	Common
0-10 yrs	2	486	1,543.975	207,929.28	0.1658703	16.59%	Common

# Identifying rare combinations
rare_age_garage <- group3_df |>
  filter(Rarity_Tag != "Common")

cat("\n=== RARE AGE-GARAGE COMBINATIONS ===\n")

## 
## === RARE AGE-GARAGE COMBINATIONS ===

for(i in 1:nrow(rare_age_garage)) {
  cat(sprintf("- %s homes with %d-car garage: %.2f%% probability (%d homes)\n",
              rare_age_garage$Age_Category[i],
              rare_age_garage$Garage.Cars[i],
              rare_age_garage$Probability[i] * 100,
              rare_age_garage$Count[i]))
}

## - 60+ yrs homes with 5-car garage: 0.03% probability (1 homes)
## - 60+ yrs homes with NA-car garage: 0.03% probability (1 homes)
## - 0-10 yrs homes with 1-car garage: 0.07% probability (2 homes)
## - 21-40 yrs homes with 4-car garage: 0.07% probability (2 homes)
## - 60+ yrs homes with 4-car garage: 0.07% probability (2 homes)
## - 41-60 yrs homes with 4-car garage: 0.10% probability (3 homes)
## - 0-10 yrs homes with 4-car garage: 0.14% probability (4 homes)
## - 11-20 yrs homes with 0-car garage: 0.14% probability (4 homes)
## - 11-20 yrs homes with 4-car garage: 0.17% probability (5 homes)
## - 11-20 yrs homes with 1-car garage: 0.31% probability (9 homes)
## - 21-40 yrs homes with 3-car garage: 0.31% probability (9 homes)
## - 41-60 yrs homes with 3-car garage: 0.34% probability (10 homes)
## - 0-10 yrs homes with 0-car garage: 0.41% probability (12 homes)
## - 60+ yrs homes with 3-car garage: 0.65% probability (19 homes)
## - 41-60 yrs homes with 0-car garage: 0.82% probability (24 homes)
## - 21-40 yrs homes with 0-car garage: 0.89% probability (26 homes)
## - 11-20 yrs homes with 3-car garage: 1.95% probability (57 homes)

Insight: The data shows how much garages have become expected over time. New homes (built in the last 20 years) almost never have zero or just one car garage—there’s less than 1% chance of finding that. Everyone expects at least a 2-car garage now, and 3-car garages are pretty common. But old homes (60+ years) rarely have 3-car garages because back then, most families only had one car, so big garages weren’t needed. The weirdest combination is homes from the 1990s-2000s with 4-car garages—only 0.07% probability. Even during the building boom, 4-car garages were still just for really big custom houses.

Testable Hypothesis: Recent homes (0-10 years) with no garage or 1-car garage are priced significantly below their age cohort average, indicating garage capacity is now a critical value driver. We can test this by comparing price-per-square-foot across garage categories within recent construction.

# Stacked bar chart showing distribution
group3_df_viz <- group3_df |>
  filter(!is.na(Garage.Cars)) |>
  mutate(Garage_Label = paste0(Garage.Cars, "-car"))

ggplot(group3_df_viz, aes(x = Age_Category, y = Count, fill = Garage_Label)) +
  geom_col(position = "dodge") +
  geom_text(aes(label = Count), position = position_dodge(width = 0.9),
            vjust = -0.3, size = 3) +
  scale_fill_brewer(palette = "Set2", name = "Garage\nCapacity") +
  labs(title = "Home Age vs. Garage Capacity: Shifting Automotive Culture",
       subtitle = "Newer homes increasingly feature multi-car garages",
       x = "Home Age Category",
       y = "Number of Homes") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")

Part 2: Categorical Variable Combinations Analysis

Combination Analysis: House Style × Roof Style

Research Question: Which architectural combinations exist or don’t exist in Ames, and what does this reveal about architectural constraints and preferences?

# Getting all unique combinations of House Style and Roof Style
actual_combinations <- ames |>
  select(House.Style, Roof.Style) |>
  distinct() |>
  arrange(House.Style, Roof.Style)

# Getting counts for each combination
combination_counts <- ames |>
  group_by(House.Style, Roof.Style) |>
  summarise(Count = n(), .groups = "drop") |>
  arrange(desc(Count))

# Creating all possible combinations
all_house_styles <- unique(ames$House.Style)
all_roof_styles <- unique(ames$Roof.Style)
all_possible <- expand.grid(
  House.Style = all_house_styles,
  Roof.Style = all_roof_styles,
  stringsAsFactors = FALSE
)

# Finding missing combinations
missing_combinations <- all_possible |>
  anti_join(actual_combinations, by = c("House.Style", "Roof.Style"))

cat("=== ARCHITECTURAL COMBINATION ANALYSIS ===\n\n")

## === ARCHITECTURAL COMBINATION ANALYSIS ===

cat("Total possible combinations:", nrow(all_possible), "\n")

## Total possible combinations: 48

cat("Actual combinations observed:", nrow(actual_combinations), "\n")

## Actual combinations observed: 28

cat("Missing combinations:", nrow(missing_combinations), "\n\n")

## Missing combinations: 20

if(nrow(missing_combinations) > 0) {
  cat("MISSING COMBINATIONS:\n")
  kable(missing_combinations,
        caption = "House Style × Roof Style Combinations NOT in Dataset")
} else {
  cat("All possible combinations are present in the dataset!\n")
}

## MISSING COMBINATIONS:

House Style × Roof Style Combinations NOT in Dataset
House.Style	Roof.Style
1.5Unf	Hip
1.5Fin	Mansard
SFoyer	Mansard
2.5Unf	Mansard
1.5Unf	Mansard
2.5Fin	Mansard
SFoyer	Gambrel
2.5Unf	Gambrel
1.5Unf	Gambrel
2.5Fin	Gambrel
2Story	Shed
SFoyer	Shed
SLvl	Shed
2.5Unf	Shed
1.5Unf	Shed
2.5Fin	Shed
1.5Fin	Flat
2.5Unf	Flat
1.5Unf	Flat
2.5Fin	Flat

# Analyzing most and least common combinations
cat("\n=== MOST COMMON COMBINATIONS ===\n")

## 
## === MOST COMMON COMBINATIONS ===

kable(head(combination_counts, 10),
      col.names = c("House Style", "Roof Style", "Count"),
      caption = "Top 10 Most Common Architectural Combinations")

Top 10 Most Common Architectural Combinations
House Style	Roof Style	Count
1Story	Gable	1053
2Story	Gable	746
1Story	Hip	407
1.5Fin	Gable	300
2Story	Hip	102
SLvl	Gable	100
SFoyer	Gable	77
SLvl	Hip	24
1.5Unf	Gable	19
2.5Unf	Gable	19

cat("\n=== LEAST COMMON COMBINATIONS ===\n")

## 
## === LEAST COMMON COMBINATIONS ===

least_common <- combination_counts |>
  filter(Count <= 5) |>
  arrange(Count)

kable(least_common,
      col.names = c("House Style", "Roof Style", "Count"),
      caption = "Rarest Architectural Combinations (≤5 homes)")

Rarest Architectural Combinations (≤5 homes)
House Style	Roof Style	Count
1.5Fin	Shed	1
1Story	Gambrel	1
2.5Fin	Hip	1
SLvl	Gambrel	1
SLvl	Mansard	1
SFoyer	Flat	2
SLvl	Flat	2
1Story	Mansard	3
2Story	Flat	3
1Story	Shed	4
SFoyer	Hip	4
1.5Fin	Gambrel	5
2.5Unf	Hip	5

# Calculating probabilities
combination_probs <- combination_counts |>
  mutate(
    Probability = Count / nrow(ames),
    Probability_Pct = sprintf("%.2f%%", Probability * 100)
  ) |>
  arrange(desc(Probability))

Why Certain Combinations Are Missing:

If there are any missing combinations, it’s probably because:

They just don’t work together structurally: Some roof styles and house styles don’t match up well. Like, you wouldn’t put a Mansard roof (which is super old-fashioned and Victorian) on a modern ranch house—it would look weird.
Ames has pretty traditional architecture: Almost all homes here have Gable or Hip roofs (98% of them). The fancier roof styles like Gambrel or Mansard are super rare, probably just on a few custom or historical homes.
Different eras have different styles: Certain house types like split-levels were popular in specific decades, and they came with the roof styles that were popular then. You wouldn’t see modern roof designs on old split-level homes.

Most Common Combinations:

cat("\nThe three most common combinations are:\n")

## 
## The three most common combinations are:

top3 <- head(combination_probs, 3)
for(i in 1:3) {
  cat(sprintf("%d. %s + %s: %s probability (%d homes)\n",
              i,
              top3$House.Style[i],
              top3$Roof.Style[i],
              top3$Probability_Pct[i],
              top3$Count[i]))
}

## 1. 1Story + Gable: 35.94% probability (1053 homes)
## 2. 2Story + Gable: 25.46% probability (746 homes)
## 3. 1Story + Hip: 13.89% probability (407 homes)

These are the most popular combinations because Ames is a pretty traditional Midwestern town. The 1Story-Gable combo is basically your classic ranch house, and the 2Story-Gable is your classic colonial or traditional two-story. These are what got built during the big suburban boom after World War II, which is probably when a lot of Ames was developed.

Least Common Combinations:

The rarest combinations are probably either weird custom builds or old historical homes. Any house with a Shed or Flat roof is unusual here—these show up in less than 1% of homes. Shed roofs might be on some modern designer homes, and Flat roofs could be older commercial buildings that got converted to homes or some contemporary minimalist houses.

# Limiting visualization to top 6 house styles by frequency to avoid clutter
top_house_styles <- ames |>
  count(House.Style, sort = TRUE) |>
  head(6) |>
  pull(House.Style)

# Filtering for visualization
combo_viz_data <- combination_counts |>
  filter(House.Style %in% top_house_styles)

ggplot(combo_viz_data, aes(x = House.Style, y = Count, fill = Roof.Style)) +
  geom_col(position = "dodge") +
  scale_fill_brewer(palette = "Set3", name = "Roof Style") +
  labs(title = "Architectural Combinations: House Style × Roof Style",
       subtitle = "Limited to 6 most common house styles for clarity",
       x = "House Style",
       y = "Number of Homes",
       caption = "1Story = Ranch, 2Story = Two-story, 1.5Fin = 1.5 story finished, \nSLvl = Split level, SFoyer = Split foyer") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "right")

Insight from Visualization: The chart shows that Gable and Hip roofs are everywhere across all house types, but you see them in different amounts. Two-story homes really prefer Gable roofs even more than one-story homes do. I think this is because a Gable roof (the pointy triangle kind) looks better on a tall house—it emphasizes the height. Hip roofs (where all sides slope down) are flatter and work well on both styles but seem especially popular on ranch houses where you don’t want the house to look too tall.

Conclusion

Looking at group probabilities and rare combinations this week taught me a lot about the Ames housing market:

Townhouses are really rare: Even though they’re usually good starter homes, less than 5% of Ames homes are townhouses or multi-family. Ames is basically all single-family houses.
Quality and neighborhood go together: You basically can’t find a cheap, low-quality house in an expensive neighborhood (only 0.07% chance). Nice neighborhoods stay nice, and builders only put expensive houses in nice areas.
Garages became essential over time: Modern homes without garages are super rare because everyone expects parking now. The shift from 1-car to 3-car garages shows how car ownership changed over the decades.
Architecture is pretty traditional here: Most homes have standard Gable or Hip roofs (98% of them). The fancy or unusual combinations either don’t work structurally or just aren’t popular in the Midwest.

These findings could actually be useful for real people: developers could see that townhouses are rare and maybe there’s demand for them, homebuyers can understand if what they want is common or unusual (which affects how hard it’ll be to find), and appraisers can spot truly weird homes that need special attention (like a premium neighborhood with a quality-2 house).

Next Steps: I’d like to look at whether the rare combinations are priced differently—do unique features make homes more expensive or just harder to sell? Also, I’m curious if the rare combinations are becoming more or less common over time, which would show if trends are changing.