This week I’m focusing on understanding the importance of good data documentation and spotting data quality issues. It’s easy to assume that a dataset is clean and well-explained, but there are often hidden gotchas that can lead to wrong conclusions if you’re not careful. I’ll look at unclear columns, missing data patterns, and potential outliers in the Ames housing data.
ames <- read.csv("ames.csv", stringsAsFactors = FALSE)
cat("Dataset:", nrow(ames), "homes with", ncol(ames), "variables\n")
## Dataset: 2930 homes with 82 variables
# Look at MS.SubClass values
subclass_dist <- ames |>
count(MS.SubClass, sort = TRUE) |>
head(10)
kable(subclass_dist,
col.names = c("MS.SubClass", "Count"),
caption = "Top 10 MS.SubClass Values")
| MS.SubClass | Count |
|---|---|
| 20 | 1079 |
| 60 | 575 |
| 50 | 287 |
| 120 | 192 |
| 30 | 139 |
| 160 | 129 |
| 70 | 128 |
| 80 | 118 |
| 90 | 109 |
| 190 | 61 |
cat("\nUnique values:", n_distinct(ames$MS.SubClass), "\n")
##
## Unique values: 16
Why is this unclear? When I first looked at this column, I saw numbers like 20, 60, 50, 120. What do these mean? Are they IDs? Building codes? Just looking at the data, I have no idea.
After reading documentation: MS.SubClass actually
refers to the dwelling type: - 20 = 1-story houses built since 1946 - 60
= 2-story houses built since 1946
- 50 = 1.5 story houses (all ages) - 120 = 1-story PUD (Planned Unit
Development) built since 1946
Why they encoded it this way: They probably used numeric codes instead of text descriptions to save space and follow industry standards. Building assessors might use these same codes.
What could go wrong: If I didn’t read the documentation, I might think MS.SubClass is some kind of quality rating or price category because higher numbers don’t always mean newer or better. I could create completely wrong analyses like “SubClass 190 homes are more expensive than SubClass 20” without understanding that these are actually different building types, not levels of quality.
# Look at zoning codes
zoning_dist <- ames |>
count(MS.Zoning, sort = TRUE)
kable(zoning_dist,
col.names = c("MS.Zoning", "Count"),
caption = "Zoning Classification Distribution")
| MS.Zoning | Count |
|---|---|
| RL | 2273 |
| RM | 462 |
| FV | 139 |
| RH | 27 |
| C (all) | 25 |
| A (agr) | 2 |
| I (all) | 2 |
Why is this unclear? The values are things like “RL”, “RM”, “FV”, “C (all)”. These look like abbreviations but I wouldn’t know what they mean without documentation.
After reading documentation: These are zoning classifications: - RL = Residential Low Density - RM = Residential Medium Density - FV = Floating Village Residential - RH = Residential High Density - C (all) = Commercial (all) - I (all) = Industrial (all) - A (agr) = Agriculture
Why they encoded it this way: Official city zoning codes use these abbreviations. They’re standardized across many municipalities and used in legal documents.
What could go wrong: Without documentation, I might think “FV” is some kind of quality designation (“Fine View”?), or that “C (all)” and “I (all)” mean something completely different. I could incorrectly group residential and commercial properties together if I didn’t understand these codes. More importantly, I might not realize that RL homes are fundamentally different from RM homes in terms of lot size, density, and typical prices.
Looking at the documentation, there’s a big ambiguity about how missing data is represented. Let me investigate:
# Check several columns that might have "missing" data
missing_summary <- tibble(
Column = c("Alley", "Fence", "Pool.QC", "Fireplace.Qu", "Bsmt.Qual", "Garage.Type"),
NA_Count = c(
sum(is.na(ames$Alley)),
sum(is.na(ames$Fence)),
sum(is.na(ames$Pool.QC)),
sum(is.na(ames$Fireplace.Qu)),
sum(is.na(ames$Bsmt.Qual)),
sum(is.na(ames$Garage.Type))
),
Percent_Missing = c(
sum(is.na(ames$Alley)) / nrow(ames) * 100,
sum(is.na(ames$Fence)) / nrow(ames) * 100,
sum(is.na(ames$Pool.QC)) / nrow(ames) * 100,
sum(is.na(ames$Fireplace.Qu)) / nrow(ames) * 100,
sum(is.na(ames$Bsmt.Qual)) / nrow(ames) * 100,
sum(is.na(ames$Garage.Type)) / nrow(ames) * 100
)
)
kable(missing_summary,
col.names = c("Column", "NA Count", "% Missing"),
caption = "Columns with High Missing Data",
digits = 1)
| Column | NA Count | % Missing |
|---|---|---|
| Alley | 2732 | 93.2 |
| Fence | 2358 | 80.5 |
| Pool.QC | 2917 | 99.6 |
| Fireplace.Qu | 1422 | 48.5 |
| Bsmt.Qual | 79 | 2.7 |
| Garage.Type | 157 | 5.4 |
What’s unclear: The documentation doesn’t clearly explain whether NA values mean: 1. Truly missing data (we don’t know if the house has an alley/fence/pool) 2. Feature doesn’t exist (the house doesn’t have an alley/fence/pool)
This is a HUGE distinction! For example: - 93.2% of homes have NA for Alley - 99.6% of homes have NA for Pool.QC (pool quality)
It seems unlikely that we’re missing data for 93% of homes. More likely, 93% of homes simply don’t have alley access. But the documentation doesn’t explicitly state this encoding scheme.
Why this matters: If NA means “feature doesn’t exist,” then I should probably recode these as “None” or create a binary has_alley variable. If NA means “data missing,” then I need to decide whether to impute values or drop these rows.
# Create a version with explicit "None" category
ames_alley <- ames |>
mutate(Alley_Explicit = if_else(is.na(Alley), "None", Alley))
# Compare prices
alley_price_comparison <- ames_alley |>
group_by(Alley_Explicit) |>
summarise(
Count = n(),
Median_Price = median(SalePrice),
Mean_Price = mean(SalePrice)
)
kable(alley_price_comparison,
col.names = c("Alley Access", "Count", "Median Price", "Mean Price"),
caption = "Price by Alley Access (treating NA as 'None')",
format.args = list(big.mark = ","),
digits = 0)
| Alley Access | Count | Median Price | Mean Price |
|---|---|---|---|
| Grvl | 120 | 121,000 | 123,557 |
| None | 2,732 | 163,000 | 183,420 |
| Pave | 78 | 172,700 | 176,945 |
# Visualization
ggplot(ames_alley, aes(x = Alley_Explicit, y = SalePrice, fill = Alley_Explicit)) +
geom_boxplot() +
scale_y_continuous(labels = dollar_format(), limits = c(0, 600000)) +
scale_fill_manual(values = c("None" = "gray70", "Grvl" = "#e74c3c", "Pave" = "#3498db"),
name = "Alley Type") +
labs(title = "Sale Price by Alley Access",
subtitle = "NA values interpreted as 'No Alley' (93.2% of homes)",
x = "Alley Access Type",
y = "Sale Price",
caption = "Key Question: Does NA mean 'no alley' or 'data missing'?") +
theme_minimal() +
theme(legend.position = "none")
What’s unclear and concerning:
If NA means “no alley,” then the visualization shows something surprising: homes with NO alley access actually have HIGHER median prices ($163,000) than homes with gravel alleys ($121,000) or even paved alleys ($172,700). This is counterintuitive—you’d expect alley access to add value.
Possible explanations: 1. Homes without alleys might be in more desirable neighborhoods where alleys aren’t common 2. Alley access might be associated with older, smaller homes in less premium areas 3. Buyers might actually prefer NOT having an alley (more privacy, less traffic)
But if NA actually means “we don’t know,” then I’m comparing homes with known alley types (Grvl/Pave) against homes with unknown status (NA). In that case, I can’t conclude anything about alley value at all—I’m just seeing that homes with missing data happen to be more expensive, which could be due to where data was collected.
The risk: I could make completely incorrect conclusions about the value of alley access. The data suggests alley access might actually DECREASE value (or at least not add value), but this could just be because homes with alleys are in different types of neighborhoods. If I was a developer trying to decide whether to include alleys in a new development, this ambiguity could lead to a bad business decision. I might incorrectly think alleys hurt property value when really it’s just a neighborhood/location effect.
What I could do: 1. Contact the data source to clarify the encoding 2. Cross-reference with other variables (do NA alley homes also have NA for alley-related size measurements?) 3. Make my interpretation explicit in any analysis: “Assuming NA means ‘no alley’…”
# Compare fireplace count vs. quality rating NA patterns
ames_fireplace <- ames |>
mutate(
Has_Fireplace = Fireplaces > 0,
Fireplace_Quality_Status = case_when(
is.na(Fireplace.Qu) & Fireplaces == 0 ~ "No Fireplace, No Quality (consistent)",
is.na(Fireplace.Qu) & Fireplaces > 0 ~ "Has Fireplace but NA quality (PROBLEM!)",
!is.na(Fireplace.Qu) & Fireplaces == 0 ~ "No Fireplace but has quality (PROBLEM!)",
!is.na(Fireplace.Qu) & Fireplaces > 0 ~ "Has Fireplace with quality (consistent)",
TRUE ~ "Other"
)
)
# Count the patterns
fireplace_patterns <- ames_fireplace |>
count(Fireplace_Quality_Status, sort = TRUE)
kable(fireplace_patterns,
col.names = c("Pattern", "Count"),
caption = "Fireplace Count vs. Quality Rating Consistency Check")
| Pattern | Count |
|---|---|
| Has Fireplace with quality (consistent) | 1508 |
| No Fireplace, No Quality (consistent) | 1422 |
# Check for the actual inconsistency
inconsistent_cases <- ames_fireplace |>
filter(Fireplace_Quality_Status %in% c("Has Fireplace but NA quality (PROBLEM!)",
"No Fireplace but has quality (PROBLEM!)"))
cat("\n=== CONSISTENCY CHECK ===\n")
##
## === CONSISTENCY CHECK ===
cat("Homes with fireplaces but missing quality:",
sum(ames$Fireplaces > 0 & is.na(ames$Fireplace.Qu)), "\n")
## Homes with fireplaces but missing quality: 0
cat("Homes without fireplaces but has quality:",
sum(ames$Fireplaces == 0 & !is.na(ames$Fireplace.Qu)), "\n")
## Homes without fireplaces but has quality: 0
# Visualization showing the relationship
ggplot(ames_fireplace, aes(x = factor(Fireplaces), fill = is.na(Fireplace.Qu))) +
geom_bar(position = "stack") +
scale_fill_manual(values = c("FALSE" = "#2ecc71", "TRUE" = "#e74c3c"),
labels = c("Has Quality Rating", "NA Quality Rating"),
name = "Fireplace Quality") +
labs(title = "Fireplace Count vs. Quality Rating Presence",
subtitle = "Green = has quality rating, Red = NA quality rating",
x = "Number of Fireplaces",
y = "Number of Homes",
caption = "Finding: Homes with 0 fireplaces ALL have NA quality (consistent).\nHomes with 1+ fireplaces ALL have quality ratings (also consistent).") +
theme_minimal()
What I actually found:
Looking at the data more carefully, I discovered that the fireplace data is actually CONSISTENT, not inconsistent as I initially thought:
This is exactly what we’d expect! If there’s no fireplace, there’s nothing to rate, so NA makes sense. If there is a fireplace, it gets a quality rating.
Why this matters for the NA ambiguity:
This is a PERFECT example where NA clearly means “feature doesn’t exist” rather than “data missing.” The perfect consistency (100% of homes with fireplaces have quality ratings, 100% without fireplaces have NA) shows that the data collectors used NA intentionally to indicate “not applicable.”
This gives me more confidence that the same encoding is used for other features like Alley, Fence, and Pool.QC. When 93% of homes have NA for Alley, it’s almost certainly because 93% of homes don’t have alley access, not because we’re missing data on 93% of homes.
The risk I initially imagined doesn’t exist for fireplaces:
I thought there would be inconsistencies where some homes with fireplaces lacked quality ratings. That would be a data quality problem. But the data is actually clean and consistent here. The “risk” is only in interpretation—I need to understand that NA means “not applicable” not “missing.”
# Check for missing patterns in Garage.Type
cat("=== GARAGE TYPE MISSING DATA ANALYSIS ===\n\n")
## === GARAGE TYPE MISSING DATA ANALYSIS ===
# Explicitly missing (NA values)
explicit_missing_garage <- sum(is.na(ames$Garage.Type))
cat("Explicitly missing (NA):", explicit_missing_garage,
sprintf("(%.1f%%)\n", explicit_missing_garage/nrow(ames)*100))
## Explicitly missing (NA): 157 (5.4%)
# Check if there are homes with garage area but no garage type
implicit_missing_garage <- ames |>
filter(is.na(Garage.Type) & (Garage.Area > 0 | Garage.Cars > 0))
cat("Implicitly missing (has garage features but no type):", nrow(implicit_missing_garage), "\n")
## Implicitly missing (has garage features but no type): 0
# Check for empty groups (garage types with 0 homes)
garage_types_in_data <- unique(ames$Garage.Type[!is.na(ames$Garage.Type)])
cat("\nGarage types present:", length(garage_types_in_data), "\n")
##
## Garage types present: 6
cat("Types:", paste(garage_types_in_data, collapse = ", "), "\n")
## Types: Attchd, BuiltIn, Basment, Detchd, CarPort, 2Types
# Distribution
garage_dist <- ames |>
mutate(Garage.Type = if_else(is.na(Garage.Type), "None/Missing", Garage.Type)) |>
count(Garage.Type, sort = TRUE)
kable(garage_dist,
col.names = c("Garage Type", "Count"),
caption = "Garage Type Distribution (NA treated as 'None/Missing')")
| Garage Type | Count |
|---|---|
| Attchd | 1731 |
| Detchd | 782 |
| BuiltIn | 186 |
| None/Missing | 157 |
| Basment | 36 |
| 2Types | 23 |
| CarPort | 15 |
Findings:
Insight: If there are homes with Garage.Area > 0 but Garage.Type = NA, that’s an implicit missing data problem. The garage exists but we don’t know what kind it is. That’s different from homes that truly have no garage.
cat("=== FENCE QUALITY MISSING DATA ANALYSIS ===\n\n")
## === FENCE QUALITY MISSING DATA ANALYSIS ===
# Explicitly missing
explicit_missing_fence <- sum(is.na(ames$Fence))
cat("Explicitly missing (NA):", explicit_missing_fence,
sprintf("(%.1f%%)\n", explicit_missing_fence/nrow(ames)*100))
## Explicitly missing (NA): 2358 (80.5%)
# For fences, implicit missing is hard to detect since there's no "fence area" variable
# But we can check if certain neighborhoods/zones have suspiciously uniform missing patterns
cat("\nImplicitly missing: Hard to detect without additional variables\n")
##
## Implicitly missing: Hard to detect without additional variables
# Check for empty groups
fence_types_in_data <- unique(ames$Fence[!is.na(ames$Fence)])
cat("\nFence types present:", length(fence_types_in_data), "\n")
##
## Fence types present: 4
cat("Types:", paste(fence_types_in_data, collapse = ", "), "\n")
## Types: MnPrv, GdPrv, GdWo, MnWw
# Distribution
fence_dist <- ames |>
mutate(Fence = if_else(is.na(Fence), "None/Missing", Fence)) |>
count(Fence, sort = TRUE)
kable(fence_dist,
col.names = c("Fence Type", "Count"),
caption = "Fence Distribution (NA treated as 'None/Missing')")
| Fence Type | Count |
|---|---|
| None/Missing | 2358 |
| MnPrv | 330 |
| GdPrv | 118 |
| GdWo | 112 |
| MnWw | 12 |
# Check if certain neighborhoods have suspiciously uniform fence patterns
neighborhood_fence <- ames |>
group_by(Neighborhood) |>
summarise(
Total = n(),
Has_Fence = sum(!is.na(Fence)),
Pct_Has_Fence = Has_Fence / Total * 100
) |>
arrange(desc(Pct_Has_Fence))
cat("\n\nTop 5 neighborhoods by fence presence:\n")
##
##
## Top 5 neighborhoods by fence presence:
print(head(neighborhood_fence, 5))
## # A tibble: 5 × 4
## Neighborhood Total Has_Fence Pct_Has_Fence
## <chr> <int> <int> <dbl>
## 1 Landmrk 1 1 100
## 2 Blueste 10 7 70
## 3 Sawyer 151 59 39.1
## 4 NAmes 443 144 32.5
## 5 Mitchel 114 36 31.6
cat("\n\nBottom 5 neighborhoods by fence presence:\n")
##
##
## Bottom 5 neighborhoods by fence presence:
print(tail(neighborhood_fence, 5))
## # A tibble: 5 × 4
## Neighborhood Total Has_Fence Pct_Has_Fence
## <chr> <int> <int> <dbl>
## 1 Blmngtn 28 0 0
## 2 Greens 8 0 0
## 3 GrnHill 2 0 0
## 4 NPkVill 23 0 0
## 5 StoneBr 51 0 0
Findings:
Insight: The extremely high NA rate (80.5%) strongly suggests that NA means “no fence” rather than “data missing.” It would be very unusual for data collectors to miss fence information on 80% of homes. However, I notice some neighborhoods have much higher fence rates than others. This could be: 1. Real differences (some neighborhoods prefer fences) 2. Data collection bias (certain areas were surveyed more carefully) 3. Zoning/HOA requirements (some areas require or prohibit fences)
# Define outliers using multiple methods
# Method 1: Statistical (3 standard deviations)
price_mean <- mean(ames$SalePrice)
price_sd <- sd(ames$SalePrice)
price_upper <- price_mean + 3 * price_sd
price_lower <- price_mean - 3 * price_sd
stat_outliers <- ames |>
filter(SalePrice > price_upper | SalePrice < price_lower)
cat("=== METHOD 1: Statistical Outliers (3 SD from mean) ===\n")
## === METHOD 1: Statistical Outliers (3 SD from mean) ===
cat("Lower bound:", dollar(price_lower), "\n")
## Lower bound: -$58,864.02
cat("Upper bound:", dollar(price_upper), "\n")
## Upper bound: $420,456
cat("Outliers found:", nrow(stat_outliers), sprintf("(%.1f%%)\n", nrow(stat_outliers)/nrow(ames)*100))
## Outliers found: 45 (1.5%)
# Method 2: IQR method
Q1 <- quantile(ames$SalePrice, 0.25)
Q3 <- quantile(ames$SalePrice, 0.75)
IQR_val <- Q3 - Q1
iqr_lower <- Q1 - 1.5 * IQR_val
iqr_upper <- Q3 + 1.5 * IQR_val
iqr_outliers <- ames |>
filter(SalePrice > iqr_upper | SalePrice < iqr_lower)
cat("\n=== METHOD 2: IQR Outliers (1.5 * IQR) ===\n")
##
## === METHOD 2: IQR Outliers (1.5 * IQR) ===
cat("Lower bound:", dollar(iqr_lower), "\n")
## Lower bound: $3,500
cat("Upper bound:", dollar(iqr_upper), "\n")
## Upper bound: $339,500
cat("Outliers found:", nrow(iqr_outliers), sprintf("(%.1f%%)\n", nrow(iqr_outliers)/nrow(ames)*100))
## Outliers found: 137 (4.7%)
# Method 3: Percentile-based (top/bottom 1%)
perc_lower <- quantile(ames$SalePrice, 0.01)
perc_upper <- quantile(ames$SalePrice, 0.99)
perc_outliers <- ames |>
filter(SalePrice > perc_upper | SalePrice < perc_lower)
cat("\n=== METHOD 3: Percentile Outliers (top/bottom 1%) ===\n")
##
## === METHOD 3: Percentile Outliers (top/bottom 1%) ===
cat("Lower bound:", dollar(perc_lower), "\n")
## Lower bound: $61,756.07
cat("Upper bound:", dollar(perc_upper), "\n")
## Upper bound: $456,666
cat("Outliers found:", nrow(perc_outliers), sprintf("(%.1f%%)\n", nrow(perc_outliers)/nrow(ames)*100))
## Outliers found: 60 (2.0%)
# Visualize
ggplot(ames, aes(x = SalePrice)) +
geom_histogram(bins = 50, fill = "steelblue", alpha = 0.7) +
geom_vline(xintercept = price_upper, color = "red", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = iqr_upper, color = "orange", linetype = "dashed", linewidth = 1) +
geom_vline(xintercept = perc_upper, color = "purple", linetype = "dashed", linewidth = 1) +
scale_x_continuous(labels = dollar_format()) +
labs(title = "Sale Price Distribution with Different Outlier Definitions",
subtitle = "Red = 3SD, Orange = IQR, Purple = 99th percentile",
x = "Sale Price",
y = "Count") +
theme_minimal()
My outlier definition: I’m going with the IQR method (1.5 * IQR) because:
Why this matters: With the IQR method, I identify 137 homes as outliers. These are homes that are unusually cheap or expensive compared to the market. But I need to investigate why they’re outliers:
# Investigate the high-price outliers
high_price_outliers <- ames |>
filter(SalePrice > iqr_upper) |>
select(Order, SalePrice, Gr.Liv.Area, Overall.Qual, Neighborhood, Year.Built) |>
arrange(desc(SalePrice)) |>
head(10)
kable(high_price_outliers,
col.names = c("Order", "Sale Price", "Living Area", "Quality", "Neighborhood", "Year Built"),
caption = "Top 10 High-Price Outliers",
format.args = list(big.mark = ","))
| Order | Sale Price | Living Area | Quality | Neighborhood | Year Built |
|---|---|---|---|---|---|
| 1,768 | 755,000 | 4,316 | 10 | NoRidge | 1,994 |
| 1,761 | 745,000 | 4,476 | 10 | NoRidge | 1,996 |
| 2,446 | 625,000 | 3,627 | 10 | NoRidge | 1,995 |
| 1,064 | 615,000 | 2,470 | 10 | NridgHt | 2,003 |
| 45 | 611,657 | 2,364 | 9 | NridgHt | 2,009 |
| 433 | 610,000 | 2,674 | 10 | NridgHt | 2,007 |
| 1,638 | 591,587 | 2,338 | 9 | StoneBr | 2,006 |
| 2,451 | 584,500 | 3,500 | 9 | NoRidge | 1,993 |
| 434 | 582,933 | 2,822 | 9 | NridgHt | 2,008 |
| 2,333 | 556,581 | 2,868 | 9 | StoneBr | 2,005 |
# Are these legitimate luxury homes or data errors?
cat("\n=== OUTLIER INVESTIGATION ===\n")
##
## === OUTLIER INVESTIGATION ===
cat("These outliers have:\n")
## These outliers have:
cat("- Average living area:", comma(mean(high_price_outliers$Gr.Liv.Area)), "sq ft\n")
## - Average living area: 3,146 sq ft
cat("- Average quality:", round(mean(high_price_outliers$Overall.Qual), 1), "\n")
## - Average quality: 9.5
cat("- Most common neighborhoods:",
paste(names(sort(table(high_price_outliers$Neighborhood), decreasing = TRUE)[1:3]),
collapse = ", "), "\n")
## - Most common neighborhoods: NoRidge, NridgHt, StoneBr
Insight: Looking at the high-price outliers, they’re mostly: - Large homes (2,000+ sq ft) - High quality (8-10 rating) - In premium neighborhoods (NridgHt, NoRidge, StoneBr)
These look like legitimate luxury homes, not data errors. They’re expensive for valid reasons. I probably wouldn’t remove these from my analysis because they represent a real segment of the market.
However, I would investigate low-price outliers more carefully—homes selling for way below market might be distressed sales, data errors, or have severe problems not captured in other variables.
This week’s analysis showed me that:
Without understanding what MS.SubClass or MS.Zoning codes mean, I’d create nonsensical analyses. The documentation prevented me from making major errors.
The biggest issue I found is the NA/missing value ambiguity. Does NA mean “feature doesn’t exist” or “data missing”? This isn’t always clear, and it dramatically affects how I should handle the data.
The fact that 93% of homes have NA for Alley and 80% for Fence strongly suggests these are “None” rather than “missing.” But without explicit confirmation, I’m making an assumption that could be wrong.
Finding homes with fireplaces but NA quality ratings, or homes with garage area but no garage type, reveals data quality issues that need to be resolved before analysis.
Depending on whether I use 3SD, IQR, or percentiles, I get very different numbers of outliers. I need to justify my choice and investigate whether outliers are legitimate or errors.
What I’ll do differently going forward:
The biggest lesson: Data is messy, documentation is imperfect, and I need to be a detective, not just an analyst. Every dataset has quirks that need to be understood before drawing conclusions.