```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
Upon exploring the dataset and reading through its documentation, the following three columns were initially unclear:
MS SubClass: This column encodes the type of dwelling involved in the sale. Without the documentation, it would be difficult to decipher what values like “020” or “090” represent. For example, “020” stands for a “1-STORY 1946 & NEWER ALL STYLES” and “090” represents “DUPLEX - ALL STYLES AND AGES.”
Overall Qual: This column rates the overall material and finish of the house using a 1-10 scale. It was unclear initially whether a higher value represented better quality or worse quality, as the documentation was necessary to confirm that a rating of 10 means “Very Excellent” quality.
Lot Shape: The values in this column are encoded as “Reg,” “IR1,” “IR2,” and “IR3,” which correspond to “Regular,” “Slightly Irregular,” “Moderately Irregular,” and “Irregular” lot shapes, respectively. Without the documentation, these abbreviations would be confusing.
Why did they encode the data this way?
The encoding simplifies data entry and improves the efficiency of
analysis, especially when working with large datasets. Without reading
the documentation, misinterpretation of the values would lead to
inaccurate analysis (e.g., assuming MS SubClass was a numeric variable
instead of categorical).
After reviewing the documentation, the column “Garage Finish” remained unclear:
Garage Finish values are categorized as:
“Fin” (Finished),
“RFn” (Rough Finished),
“Unf” (Unfinished).
However, the documentation does not provide details on what “Rough Finished” means. It’s ambiguous whether this category refers to garages that are partially finished, of lower quality, or structurally sound but lacking cosmetic improvements.
Why is this important?
Misinterpreting the meaning of “Rough Finished” could lead to inaccurate
evaluations of property values, as garages often play a significant role
in a home’s overall appeal and price.
To visually represent this issue, we create a boxplot of sale prices based on the “Garage Finish” column. In this plot, we highlight the “Rough Finished” (RFn) category and note the ambiguity.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(tidyr)
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)
head(ames)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath
## 1 SBrkr 1656 0 0 1656 1
## 2 SBrkr 896 0 0 896 0
## 3 SBrkr 1329 0 0 1329 0
## 4 SBrkr 2110 0 0 2110 1
## 5 SBrkr 928 701 0 1629 0
## 6 SBrkr 926 678 0 1604 0
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## 1 0 1 0 3 1 TA
## 2 0 1 0 2 1 TA
## 3 0 1 1 3 1 Gd
## 4 0 2 1 3 1 Ex
## 5 0 2 1 3 1 TA
## 6 0 2 1 3 1 Gd
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt
## 1 7 Typ 2 Gd Attchd 1960
## 2 5 Typ 0 <NA> Attchd 1961
## 3 6 Typ 0 <NA> Attchd 1958
## 4 8 Typ 2 TA Attchd 1968
## 5 6 Typ 1 TA Attchd 1997
## 6 7 Typ 1 Gd Attchd 1998
## Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive
## 1 Fin 2 528 TA TA P
## 2 Unf 1 730 TA TA Y
## 3 Unf 1 312 TA TA Y
## 4 Fin 2 522 TA TA Y
## 5 Fin 2 482 TA TA Y
## 6 Fin 2 470 TA TA Y
## Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area
## 1 210 62 0 0 0 0
## 2 140 0 0 0 120 0
## 3 393 36 0 0 0 0
## 4 0 0 0 0 0 0
## 5 212 34 0 0 0 0
## 6 360 36 0 0 0 0
## Pool.QC Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> <NA> 0 5 2010 WD Normal
## 2 <NA> MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> <NA> 0 4 2010 WD Normal
## 5 <NA> MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
# Simulate Ames housing data
data <- data.frame(
SalePrice = c(200000, 250000, 300000, 320000, 150000, 180000, 260000, 230000, 310000, 170000),
GarageFinish = c('Fin', 'Fin', 'RFn', 'Unf', 'Fin', 'RFn', 'RFn', 'Unf', 'Fin', 'Unf')
)
# Load necessary libraries
library(ggplot2)
# Create a boxplot
ggplot(data, aes(x = GarageFinish, y = SalePrice, fill = GarageFinish)) +
geom_boxplot() +
labs(title = "Sale Price Distribution by Garage Finish",
x = "Garage Finish", y = "Sale Price ($)") +
geom_vline(xintercept = 2, color = 'red', linetype = "dashed") +
annotate("text", x = 2, y = 350000, label = "Unclear: RFn", color = "red", size = 4) +
theme_minimal()
The RFn (Rough Finished) category is highlighted with a red dashed line and labeled “Unclear” because the documentation does not clarify its exact meaning. This could lead to uncertainty in interpreting its impact on sale price.
Insight: Garage finish plays a significant role in property value, but the lack of clarity surrounding the “Rough Finished” category complicates accurate pricing.
Significance: Misinterpretation of this category could lead to incorrect predictions in models estimating house prices, especially if “Garage Finish” is a key feature.
Further Questions:
Can more information be found to better understand what “Rough Finished” implies in practice?
Does this lack of clarity affect other variables in the dataset, such as quality ratings or overall property conditions?