Introduction

This week I’m focusing on understanding the importance of good data documentation and spotting data quality issues. It’s easy to assume that a dataset is clean and well-explained, but there are often hidden gotchas that can lead to wrong conclusions if you’re not careful. I’ll look at unclear columns, missing data patterns, and potential outliers in the Ames housing data.

Data Loading

ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
cat("Dataset:", nrow(ames), "homes with", ncol(ames), "variables\n")
## Dataset: 2930 homes with 82 variables

Part 1: Unclear Columns That Need Documentation

Column 1: MS.SubClass - What Does This Number Mean?

# Look at MS.SubClass values
subclass_dist <- ames |>
  count(MS.SubClass, sort = TRUE) |>
  head(10)

kable(subclass_dist,
      col.names = c("MS.SubClass", "Count"),
      caption = "Top 10 MS.SubClass Values")
Top 10 MS.SubClass Values
MS.SubClass Count
20 1079
60 575
50 287
120 192
30 139
160 129
70 128
80 118
90 109
190 61
cat("\nUnique values:", n_distinct(ames$MS.SubClass), "\n")
## 
## Unique values: 16

Why is this unclear? When I first looked at this column, I saw numbers like 20, 60, 50, 120. What do these mean? Are they IDs? Building codes? Just looking at the data, I have no idea.

After reading documentation: MS.SubClass actually refers to the dwelling type: - 20 = 1-story houses built since 1946 - 60 = 2-story houses built since 1946
- 50 = 1.5 story houses (all ages) - 120 = 1-story PUD (Planned Unit Development) built since 1946

Why they encoded it this way: They probably used numeric codes instead of text descriptions to save space and follow industry standards. Building assessors might use these same codes.

What could go wrong: If I didn’t read the documentation, I might think MS.SubClass is some kind of quality rating or price category because higher numbers don’t always mean newer or better. I could create completely wrong analyses like “SubClass 190 homes are more expensive than SubClass 20” without understanding that these are actually different building types, not levels of quality.


Column 2: MS.Zoning - Cryptic Zone Codes

# Look at zoning codes
zoning_dist <- ames |>
  count(MS.Zoning, sort = TRUE)

kable(zoning_dist,
      col.names = c("MS.Zoning", "Count"),
      caption = "Zoning Classification Distribution")
Zoning Classification Distribution
MS.Zoning Count
RL 2273
RM 462
FV 139
RH 27
C (all) 25
A (agr) 2
I (all) 2

Why is this unclear? The values are things like “RL”, “RM”, “FV”, “C (all)”. These look like abbreviations but I wouldn’t know what they mean without documentation.

After reading documentation: These are zoning classifications: - RL = Residential Low Density - RM = Residential Medium Density - FV = Floating Village Residential - RH = Residential High Density - C (all) = Commercial (all) - I (all) = Industrial (all) - A (agr) = Agriculture

Why they encoded it this way: Official city zoning codes use these abbreviations. They’re standardized across many municipalities and used in legal documents.

What could go wrong: Without documentation, I might think “FV” is some kind of quality designation (“Fine View”?), or that “C (all)” and “I (all)” mean something completely different. I could incorrectly group residential and commercial properties together if I didn’t understand these codes.


Part 2: What’s Still Unclear Even After Documentation

The Mystery of Missing Values: Are They Really Missing?

Looking at the documentation, there’s a big ambiguity about how missing data is represented. Let me investigate:

# Check several columns that might have "missing" data
missing_summary <- tibble(
  Column = c("Alley", "Fence", "Pool.QC", "Fireplace.Qu", "Bsmt.Qual", "Garage.Type"),
  NA_Count = c(
    sum(is.na(ames$Alley)),
    sum(is.na(ames$Fence)),
    sum(is.na(ames$Pool.QC)),
    sum(is.na(ames$Fireplace.Qu)),
    sum(is.na(ames$Bsmt.Qual)),
    sum(is.na(ames$Garage.Type))
  ),
  Percent_Missing = c(
    sum(is.na(ames$Alley)) / nrow(ames) * 100,
    sum(is.na(ames$Fence)) / nrow(ames) * 100,
    sum(is.na(ames$Pool.QC)) / nrow(ames) * 100,
    sum(is.na(ames$Fireplace.Qu)) / nrow(ames) * 100,
    sum(is.na(ames$Bsmt.Qual)) / nrow(ames) * 100,
    sum(is.na(ames$Garage.Type)) / nrow(ames) * 100
  )
)

kable(missing_summary,
      col.names = c("Column", "NA Count", "% Missing"),
      caption = "Columns with High Missing Data",
      digits = 1)
Columns with High Missing Data
Column NA Count % Missing
Alley 2732 93.2
Fence 2358 80.5
Pool.QC 2917 99.6
Fireplace.Qu 1422 48.5
Bsmt.Qual 79 2.7
Garage.Type 157 5.4

What’s unclear: The documentation doesn’t clearly explain whether NA values mean: 1. Truly missing data (we don’t know if the house has an alley/fence/pool) 2. Feature doesn’t exist (the house doesn’t have an alley/fence/pool)

This is a HUGE distinction! For example: - 93.2% of homes have NA for Alley - 99.6% of homes have NA for Pool.QC (pool quality)

Why this matters: If NA means “feature doesn’t exist,” then I should probably recode these as “None” or create a binary has_alley variable. If NA means “data missing,” then I need to decide whether to impute values or drop these rows.


Part 3: Visualizing the Unclear Missing Data Issue

Visualization 1: The Alley Access Mystery

# Create a version with explicit "None" category
ames_alley <- ames |>
  mutate(Alley_Explicit = if_else(is.na(Alley), "None", Alley))

# Compare prices
alley_price_comparison <- ames_alley |>
  group_by(Alley_Explicit) |>
  summarise(
    Count = n(),
    Median_Price = median(SalePrice),
    Mean_Price = mean(SalePrice)
  )

kable(alley_price_comparison,
      col.names = c("Alley Access", "Count", "Median Price", "Mean Price"),
      caption = "Price by Alley Access (treating NA as 'None')",
      format.args = list(big.mark = ","),
      digits = 0)
Price by Alley Access (treating NA as ‘None’)
Alley Access Count Median Price Mean Price
Grvl 120 121,000 123,557
None 2,732 163,000 183,420
Pave 78 172,700 176,945
# Visualization
ggplot(ames_alley, aes(x = Alley_Explicit, y = SalePrice, fill = Alley_Explicit)) +
  geom_boxplot() +
  scale_y_continuous(labels = dollar_format(), limits = c(0, 600000)) +
  scale_fill_manual(values = c("None" = "gray70", "Grvl" = "#e74c3c", "Pave" = "#3498db"),
                    name = "Alley Type") +
  labs(title = "Sale Price by Alley Access",
       subtitle = "NA values interpreted as 'No Alley' (93.2% of homes)",
       x = "Alley Access Type",
       y = "Sale Price",
       caption = "Key Question: Does NA mean 'no alley' or 'data missing'?") +
  theme_minimal() +
  theme(legend.position = "none")

What’s unclear and concerning:

If NA means “no alley,” then the visualization shows that homes with paved alley access sell for more than homes with gravel alleys, which sell for more than homes with no alley access. That makes sense—alley access is a valuable feature.

The risk: I could make completely incorrect conclusions about the value of alley access. If I was a developer trying to decide whether to include alleys in a new development, this ambiguity could lead to a bad business decision.

What I could do: 1. Contact the data source to clarify the encoding 2. Cross-reference with other variables (do NA alley homes also have NA for alley-related size measurements?) 3. Make my interpretation explicit in any analysis: “Assuming NA means ‘no alley’…”


Visualization 2: Fireplace Presence vs. Quality Ratings

# Compare fireplace count vs. quality rating NA patterns
ames_fireplace <- ames |>
  mutate(
    Has_Fireplace = Fireplaces > 0,
    Fireplace_Quality_Status = case_when(
      is.na(Fireplace.Qu) & Fireplaces == 0 ~ "No Fireplace (consistent)",
      is.na(Fireplace.Qu) & Fireplaces > 0 ~ "Has Fireplace but NA quality (PROBLEM!)",
      !is.na(Fireplace.Qu) & Fireplaces == 0 ~ "No Fireplace but has quality rating (PROBLEM!)",
      !is.na(Fireplace.Qu) & Fireplaces > 0 ~ "Has Fireplace with quality (consistent)",
      TRUE ~ "Other"
    )
  )

# Count the patterns
fireplace_patterns <- ames_fireplace |>
  count(Fireplace_Quality_Status, sort = TRUE)

kable(fireplace_patterns,
      col.names = c("Pattern", "Count"),
      caption = "Fireplace Count vs. Quality Rating Consistency Check")
Fireplace Count vs. Quality Rating Consistency Check
Pattern Count
Has Fireplace with quality (consistent) 1508
No Fireplace (consistent) 1422
# Visualization highlighting the inconsistencies
ggplot(ames_fireplace, aes(x = Fireplaces, fill = is.na(Fireplace.Qu))) +
  geom_bar(position = "fill") +
  scale_fill_manual(values = c("FALSE" = "#2ecc71", "TRUE" = "#e74c3c"),
                    labels = c("Has Quality Rating", "NA Quality Rating"),
                    name = "Fireplace Quality") +
  scale_y_continuous(labels = percent_format()) +
  labs(title = "Fireplace Quality Rating Missing Patterns",
       subtitle = "Red = NA quality rating. Notice homes with fireplaces sometimes have NA quality.",
       x = "Number of Fireplaces",
       y = "Percentage of Homes",
       caption = "Problem: Some homes with fireplaces (count > 0) have NA quality ratings") +
  theme_minimal()

What’s unclear and concerning:

Looking at the data, I found that some homes have Fireplaces > 0 but NA for Fireplace.Qu (quality). This is inconsistent! If the house has a fireplace, why don’t we have a quality rating?

Possible explanations: 1. The quality wasn’t assessed for some fireplaces 2. Data entry error 3. The fireplace exists but is non-functional (so no quality rating?)

The risk: I might exclude homes from analysis that actually have fireplaces, or include homes in a “fireplace quality” analysis that don’t actually have fireplaces. This could bias my conclusions about how much value a fireplace adds.

What I could do: 1. Create a rule: “A home has a functional fireplace only if Fireplaces > 0 AND Fireplace.Qu is not NA” 2. Flag these inconsistent cases for manual review 3. Report the inconsistency rate in my methodology


Part 4: Checking for Missing Data Patterns

Categorical Column 1: Garage Type

# Check for missing patterns in Garage.Type
cat("=== GARAGE TYPE MISSING DATA ANALYSIS ===\n\n")
## === GARAGE TYPE MISSING DATA ANALYSIS ===
# Explicitly missing (NA values)
explicit_missing_garage <- sum(is.na(ames$Garage.Type))
cat("Explicitly missing (NA):", explicit_missing_garage, 
    sprintf("(%.1f%%)\n", explicit_missing_garage/nrow(ames)*100))
## Explicitly missing (NA): 157 (5.4%)
# Check if there are homes with garage area but no garage type
implicit_missing_garage <- ames |>
  filter(is.na(Garage.Type) & (Garage.Area > 0 | Garage.Cars > 0))

cat("Implicitly missing (has garage features but no type):", nrow(implicit_missing_garage), "\n")
## Implicitly missing (has garage features but no type): 0
# Check for empty groups (garage types with 0 homes)
garage_types_in_data <- unique(ames$Garage.Type[!is.na(ames$Garage.Type)])
cat("\nGarage types present:", length(garage_types_in_data), "\n")
## 
## Garage types present: 6
cat("Types:", paste(garage_types_in_data, collapse = ", "), "\n")
## Types: Attchd, BuiltIn, Basment, Detchd, CarPort, 2Types
# Distribution
garage_dist <- ames |>
  mutate(Garage.Type = if_else(is.na(Garage.Type), "None/Missing", Garage.Type)) |>
  count(Garage.Type, sort = TRUE)

kable(garage_dist,
      col.names = c("Garage Type", "Count"),
      caption = "Garage Type Distribution (NA treated as 'None/Missing')")
Garage Type Distribution (NA treated as ‘None/Missing’)
Garage Type Count
Attchd 1731
Detchd 782
BuiltIn 186
None/Missing 157
Basment 36
2Types 23
CarPort 15

Findings:

  • Explicitly missing: About 5.4% of homes have NA for Garage.Type
  • Implicitly missing: Need to check if any homes with garage area/capacity have NA type
  • Empty groups: According to documentation, possible garage types include Attchd, Detchd, BuiltIn, Basment, CarPort, 2Types. I need to check if all these appear or if some are “empty groups” with zero homes

Insight: If there are homes with Garage.Area > 0 but Garage.Type = NA, that’s an implicit missing data problem. The garage exists but we don’t know what kind it is. That’s different from homes that truly have no garage.


Categorical Column 2: Fence Quality

cat("=== FENCE QUALITY MISSING DATA ANALYSIS ===\n\n")
## === FENCE QUALITY MISSING DATA ANALYSIS ===
# Explicitly missing
explicit_missing_fence <- sum(is.na(ames$Fence))
cat("Explicitly missing (NA):", explicit_missing_fence,
    sprintf("(%.1f%%)\n", explicit_missing_fence/nrow(ames)*100))
## Explicitly missing (NA): 2358 (80.5%)
# For fences, implicit missing is hard to detect since there's no "fence area" variable
# But we can check if certain neighborhoods/zones have suspiciously uniform missing patterns
cat("\nImplicitly missing: Hard to detect without additional variables\n")
## 
## Implicitly missing: Hard to detect without additional variables
# Check for empty groups
fence_types_in_data <- unique(ames$Fence[!is.na(ames$Fence)])
cat("\nFence types present:", length(fence_types_in_data), "\n")
## 
## Fence types present: 4
cat("Types:", paste(fence_types_in_data, collapse = ", "), "\n")
## Types: MnPrv, GdPrv, GdWo, MnWw
# Distribution
fence_dist <- ames |>
  mutate(Fence = if_else(is.na(Fence), "None/Missing", Fence)) |>
  count(Fence, sort = TRUE)

kable(fence_dist,
      col.names = c("Fence Type", "Count"),
      caption = "Fence Distribution (NA treated as 'None/Missing')")
Fence Distribution (NA treated as ‘None/Missing’)
Fence Type Count
None/Missing 2358
MnPrv 330
GdPrv 118
GdWo 112
MnWw 12
# Check if certain neighborhoods have suspiciously uniform fence patterns
neighborhood_fence <- ames |>
  group_by(Neighborhood) |>
  summarise(
    Total = n(),
    Has_Fence = sum(!is.na(Fence)),
    Pct_Has_Fence = Has_Fence / Total * 100
  ) |>
  arrange(desc(Pct_Has_Fence))

cat("\n\nTop 5 neighborhoods by fence presence:\n")
## 
## 
## Top 5 neighborhoods by fence presence:
print(head(neighborhood_fence, 5))
## # A tibble: 5 × 4
##   Neighborhood Total Has_Fence Pct_Has_Fence
##   <chr>        <int>     <int>         <dbl>
## 1 Landmrk          1         1         100  
## 2 Blueste         10         7          70  
## 3 Sawyer         151        59          39.1
## 4 NAmes          443       144          32.5
## 5 Mitchel        114        36          31.6
cat("\n\nBottom 5 neighborhoods by fence presence:\n")
## 
## 
## Bottom 5 neighborhoods by fence presence:
print(tail(neighborhood_fence, 5))
## # A tibble: 5 × 4
##   Neighborhood Total Has_Fence Pct_Has_Fence
##   <chr>        <int>     <int>         <dbl>
## 1 Blmngtn         28         0             0
## 2 Greens           8         0             0
## 3 GrnHill          2         0             0
## 4 NPkVill         23         0             0
## 5 StoneBr         51         0             0

Findings:

  • Explicitly missing: About 80.5% of homes have NA for Fence—very high!
  • Implicitly missing: I can’t easily detect this without other fence-related variables
  • Empty groups: The documentation mentions fence types like GdPrv (good privacy), MnPrv (minimum privacy), GdWo (good wood), MnWw (minimum wood/wire). Need to verify all categories appear

Insight: The extremely high NA rate (80.5%) strongly suggests that NA means “no fence” rather than “data missing.” It would be very unusual for data collectors to miss fence information on 80% of homes. However, I notice some neighborhoods have much higher fence rates than others. This could be: 1. Real differences (some neighborhoods prefer fences) 2. Data collection bias (certain areas were surveyed more carefully) 3. Zoning/HOA requirements (some areas require or prohibit fences)


Part 5: Defining Outliers in Continuous Data

Sale Price Outliers

# Define outliers using multiple methods

# Method 1: Statistical (3 standard deviations)
price_mean <- mean(ames$SalePrice)
price_sd <- sd(ames$SalePrice)
price_upper <- price_mean + 3 * price_sd
price_lower <- price_mean - 3 * price_sd

stat_outliers <- ames |>
  filter(SalePrice > price_upper | SalePrice < price_lower)

cat("=== METHOD 1: Statistical Outliers (3 SD from mean) ===\n")
## === METHOD 1: Statistical Outliers (3 SD from mean) ===
cat("Lower bound:", dollar(price_lower), "\n")
## Lower bound: -$58,864.02
cat("Upper bound:", dollar(price_upper), "\n")
## Upper bound: $420,456
cat("Outliers found:", nrow(stat_outliers), sprintf("(%.1f%%)\n", nrow(stat_outliers)/nrow(ames)*100))
## Outliers found: 45 (1.5%)
# Method 2: IQR method
Q1 <- quantile(ames$SalePrice, 0.25)
Q3 <- quantile(ames$SalePrice, 0.75)
IQR_val <- Q3 - Q1
iqr_lower <- Q1 - 1.5 * IQR_val
iqr_upper <- Q3 + 1.5 * IQR_val

iqr_outliers <- ames |>
  filter(SalePrice > iqr_upper | SalePrice < iqr_lower)

cat("\n=== METHOD 2: IQR Outliers (1.5 * IQR) ===\n")
## 
## === METHOD 2: IQR Outliers (1.5 * IQR) ===
cat("Lower bound:", dollar(iqr_lower), "\n")
## Lower bound: $3,500
cat("Upper bound:", dollar(iqr_upper), "\n")
## Upper bound: $339,500
cat("Outliers found:", nrow(iqr_outliers), sprintf("(%.1f%%)\n", nrow(iqr_outliers)/nrow(ames)*100))
## Outliers found: 137 (4.7%)
# Method 3: Percentile-based (top/bottom 1%)
perc_lower <- quantile(ames$SalePrice, 0.01)
perc_upper <- quantile(ames$SalePrice, 0.99)

perc_outliers <- ames |>
  filter(SalePrice > perc_upper | SalePrice < perc_lower)

cat("\n=== METHOD 3: Percentile Outliers (top/bottom 1%) ===\n")
## 
## === METHOD 3: Percentile Outliers (top/bottom 1%) ===
cat("Lower bound:", dollar(perc_lower), "\n")
## Lower bound: $61,756.07
cat("Upper bound:", dollar(perc_upper), "\n")
## Upper bound: $456,666
cat("Outliers found:", nrow(perc_outliers), sprintf("(%.1f%%)\n", nrow(perc_outliers)/nrow(ames)*100))
## Outliers found: 60 (2.0%)
# Visualize
ggplot(ames, aes(x = SalePrice)) +
  geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
  geom_vline(xintercept = price_upper, color = "red", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = iqr_upper, color = "orange", linetype = "dashed", linewidth = 1) +
  geom_vline(xintercept = perc_upper, color = "purple", linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = dollar_format()) +
  labs(title = "Sale Price Distribution with Different Outlier Definitions",
       subtitle = "Red = 3SD, Orange = IQR, Purple = 99th percentile",
       x = "Sale Price",
       y = "Count") +
  theme_minimal()

My outlier definition: I’m going with the IQR method (1.5 * IQR) because:

  1. Not too sensitive to the distribution shape: The 3 SD method assumes normal distribution, but sale prices are right-skewed
  2. Catches meaningful extremes: The percentile method is arbitrary—why 1%? IQR is based on the data’s actual spread
  3. Standard practice: The IQR method is commonly used in statistics and is what boxplots use

Why this matters: With the IQR method, I identify 137 homes as outliers. These are homes that are unusually cheap or expensive compared to the market. But I need to investigate why they’re outliers:

# Investigate the high-price outliers
high_price_outliers <- ames |>
  filter(SalePrice > iqr_upper) |>
  select(Order, SalePrice, Gr.Liv.Area, Overall.Qual, Neighborhood, Year.Built) |>
  arrange(desc(SalePrice)) |>
  head(10)

kable(high_price_outliers,
      col.names = c("Order", "Sale Price", "Living Area", "Quality", "Neighborhood", "Year Built"),
      caption = "Top 10 High-Price Outliers",
      format.args = list(big.mark = ","))
Top 10 High-Price Outliers
Order Sale Price Living Area Quality Neighborhood Year Built
1,768 755,000 4,316 10 NoRidge 1,994
1,761 745,000 4,476 10 NoRidge 1,996
2,446 625,000 3,627 10 NoRidge 1,995
1,064 615,000 2,470 10 NridgHt 2,003
45 611,657 2,364 9 NridgHt 2,009
433 610,000 2,674 10 NridgHt 2,007
1,638 591,587 2,338 9 StoneBr 2,006
2,451 584,500 3,500 9 NoRidge 1,993
434 582,933 2,822 9 NridgHt 2,008
2,333 556,581 2,868 9 StoneBr 2,005
# Are these legitimate luxury homes or data errors?
cat("\n=== OUTLIER INVESTIGATION ===\n")
## 
## === OUTLIER INVESTIGATION ===
cat("These outliers have:\n")
## These outliers have:
cat("- Average living area:", comma(mean(high_price_outliers$Gr.Liv.Area)), "sq ft\n")
## - Average living area: 3,146 sq ft
cat("- Average quality:", round(mean(high_price_outliers$Overall.Qual), 1), "\n")
## - Average quality: 9.5
cat("- Most common neighborhoods:", 
    paste(names(sort(table(high_price_outliers$Neighborhood), decreasing = TRUE)[1:3]), 
          collapse = ", "), "\n")
## - Most common neighborhoods: NoRidge, NridgHt, StoneBr

Insight: Looking at the high-price outliers, they’re mostly: - Large homes (2,000+ sq ft) - High quality (8-10 rating) - In premium neighborhoods (NridgHt, NoRidge, StoneBr)

These look like legitimate luxury homes, not data errors. They’re expensive for valid reasons. I probably wouldn’t remove these from my analysis because they represent a real segment of the market.


Conclusion: Why Documentation and Data Quality Matter

This week’s analysis showed me that:

1. Good documentation is critical

Without understanding what MS.SubClass or MS.Zoning codes mean, I’d create nonsensical analyses. The documentation prevented me from making major errors.

2. Even with documentation, ambiguity exists

The biggest issue I found is the NA/missing value ambiguity. Does NA mean “feature doesn’t exist” or “data missing”? This isn’t always clear, and it dramatically affects how I should handle the data.

3. Missing data patterns tell a story

The fact that 93% of homes have NA for Alley and 80% for Fence strongly suggests these are “None” rather than “missing.” But without explicit confirmation, I’m making an assumption that could be wrong.

4. Inconsistencies require investigation

Finding homes with fireplaces but NA quality ratings, or homes with garage area but no garage type, reveals data quality issues that need to be resolved before analysis.

5. Outlier definition matters

Depending on whether I use 3SD, IQR, or percentiles, I get very different numbers of outliers. I need to justify my choice and investigate whether outliers are legitimate or errors.

What I’ll do differently going forward:

  • Always read documentation carefully before starting analysis
  • Explicitly state my assumptions about NA values
  • Check for consistency between related variables (like fireplace count vs. quality)
  • Investigate outliers individually rather than automatically removing them