Introduction to Housing Market Analysis with the Ames Housing Dataset

Photo by Tom Thain on Unsplash: https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png

The goal of this project is to visualize a few statistical relationships in Dean De Cock’s Ames Housing for use in data science education.The sale price of a house is impacted by various elements, including its location and nearby amenities, as well as its dimensions, configuration, and modernization—all of which converge to influence its marketability. The following are the explanations for features I selected in this dataset:

SalePrice: This is the target variable, and it represents the sale price of the property in dollars. In the context of the competition, your objective is to predict this variable.
OverallQual: These fields provide information about the overall material and finish quality, as well as the overall condition rating of the property.
YearBuilt and YearRemodAdd: They represent the original construction date and the remodel date of the property, respectively.
TotalBsmtSF: These fields provide information about the basement, including its quality, condition, exposure, finished area, and unfinished area.
1stFlrSF, GrLivArea: These fields represent the square footage of different floors and the above-grade (ground) living area square footage.
FullBath: Provide information about the number of bathrooms on different levels.
TotRmsAbvGrd: These fields provide information about the number of bedrooms, kitchens, and the quality of the kitchen.
GarageCars, GarageArea: Provide information about the garage, including its type, year built, finish, size, and quality.

In this project, I used the MIT-licensed datasets “House Prices - Advanced Regression Techniques” on Kaggle. According to the website’s acknowledgements, Dean De Cock compiled the Ames Housing dataset. Dean De Cock’s Ames Housing dataset is a comprehensive collection of residential property sales data in Ames, Iowa, from 2006 to 2010. It contains 2930 observations with a wide range of characteristics (23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables) to allow for in-depth regression analysis for educational purposes. This dataset was sourced from the Ames City Assessor’s Office and refined for use in academic settings as an alternative to the Boston Housing dataset. I putted both original data source reference and the link from kaggle at the bottom of the page.

I chose this topic and dataset because of the widespread relevance of housing prices, which are of particular concern to people looking for a stable living environment. I intend to use this data to investigate the factors that influence home prices.For me, the topic and dataset are significant because Analyzing housing prices using this dataset allows me to uncover the numerous factors that influence property values, providing insights that could help me make informed decisions.

While exploring the data, the first thing that came to mind was to use a bar chart distribution to see the sale price of housing distributions. While doing so, I realized that I should categorize the sale price into different ranges to make the data look more tidy and easy to caption the range of sales price allocated. Let’s get start.

Load packages

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library(corrplot)

## corrplot 0.92 loaded

library(RColorBrewer)
library(reshape2)

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(scales)

## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

Read data from CSV file

train <- read_csv("train.csv")

## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Categorize the Sale Price into different ranges

train <- train |>
  mutate(SalePrice_Range = cut(SalePrice,
                               breaks = c(-Inf, 100000, 150000, 200000, Inf),
                               labels = c("0-100k", "100k-150k", "150k-200k", "200k+")))

Calculate the counts and percentages

train_summary <- train |>
  group_by(SalePrice_Range) |>
  summarise(Count = n()) |>
  mutate(Percentage = Count / sum(Count) * 100)

Convert the ‘SalePrice_Range’ column to a factor to ensure it’s treated as a categorical variable

train$SalePrice_Range <- as.factor(train$SalePrice_Range)

Preliminary Graph

p <- ggplot(train, aes(x = SalePrice_Range, fill = SalePrice_Range)) +
  geom_bar(stat = "count") +  # The height of the bar represents the count of each category
  geom_text(stat = 'count', aes(label = ..count.., y = ..count..), vjust = -0.3) +  # Add count labels above bars
  scale_fill_manual(values = c("0-100k" = "#d3c3f7", 
                               "100k-150k" = "#abc9ea", 
                               "150k-200k" = "#f3aba8", 
                               "200k+" = "#98daa7"), 
                    name = "Sales Price Range") +  # Rename legend title
  labs(title = "Distribution of Sale Prices", 
       subtitle = "Insights into Housing Values: Market Analysis Through Sale Prices.",
       x = "Sales Price Range", 
       y = "Count",
       caption = "Data source: The Ames Housing dataset by Dean De Cock") +  # Data source caption
  theme_minimal() +
  theme(plot.caption = element_text(hjust = 0.4),  # Center caption horizontally
        legend.position = "right",  # Move legend to right
        plot.caption.position = "plot",  # Position caption at the plot bottom
        plot.title = element_text(face = "bold", size = 14),  # Bold the title and increase the size
        plot.subtitle = element_text(size = 10))  # Adjust the size of the subtitle

# Print the plot
p

## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

From the chart, we observe four distinct categories: “0-100k,” “100k-150k,” “150k-200k,” and “200k+”. The “100k-150k” range has the highest count of properties, closely followed by the “200k+” range, indicating a higher concentration of property sales within these price brackets. The “0-100k” range, by contrast, contains the fewest properties, suggesting a smaller market segment for properties valued below $100,000.

Now that we know the distribution of sale price ranges, what are the factors that cause the price to vary? To find out those factors, I want to use a correlation heatmap to do a more objective analysis.

Before we start to plotting a correlation heatmap, the first step is to clean data.

Data Cleaning

# Compute the correlation matrix for numerical columns
cor_matrix <- cor(train[sapply(train, is.numeric)], use = "pairwise.complete.obs")

# Select the top n most correlated variables to 'SalePrice'
k <- 10 # Number of variables you want to show in the heatmap

# Find the top k highest correlations with SalePrice
top_correlations <- sort(abs(cor_matrix[,"SalePrice"]), decreasing = TRUE)[1:(k+1)]
names_top_correlations <- names(top_correlations)

house <- train |> 
  select(names_top_correlations) |>
  mutate(SalePrice_Range = cut(SalePrice,
                               breaks = c(-Inf, 100000, 150000, 200000, Inf),
                               labels = c("0-100k", "100k-150k", "150k-200k", "200k+")))

## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(names_top_correlations)
## 
##   # Now:
##   data %>% select(all_of(names_top_correlations))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

To clean the Ames Housing dataset, I concentrated solely on features relevant to a basic housing price analysis, excluding more complex or nuanced data that would require advanced treatment. I removed extraneous columns and kept only the variables that had a direct impact on ‘SalePrice’. I divided ‘SalePrice’ into discrete ranges before converting it to a factor to gain a better understanding of each price bracket. Then, within these price categories, I computed the mean of critical housing attributes, streamlining the data for easy interpretation and visualization in accordance with specific requirements for later analysis stages.

Plotting the Correlation Heatmap

# Subset the correlation matrix to only include the top correlated variables
cor_matrix_subset <- cor_matrix[names_top_correlations, names_top_correlations]
# Convert the correlation matrix to a long format for plotting
cor_data <- melt(cor_matrix_subset)

# Create the heatmap using plotly
p1 <- plot_ly(data = cor_data, x = ~Var2, y = ~Var1, z = ~value, type = "heatmap", colorscale = 'RdBu',
              zmin = -1, zmax = 1) |>
  layout(title = list(text = 'Top 10 Correlated Features with SalePrice', x = 0.5, y = 0.99),
         xaxis = list(title = '', tickangle = -90),
         yaxis = list(title = ''),
         annotations = list(
           x = 0.5, y = -0.35,  # Adjust x and y positions to place at the left bottom
           text = 'Data source: The Ames Housing dataset by Dean De Cock',
           showarrow = FALSE,
           xref = 'paper', yref = 'paper',
           xanchor = 'center', yanchor = 'bottom',
           font = list(size = 10)
         ),
         margin = list(b = 80, l = 80),  # Adjust bottom and left margins
         width = 700, height = 500)

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

# Print the plot
p1

These are the variables that are most correlated with ‘SalePrice’ according to our heatmap. Here are my thoughts:

‘OverallQual’, ‘GrLivArea’ and ‘GarageCars’ are strongly correlated with ‘SalePrice’. Check!

‘GarageCars’ and ‘GarageArea’ are two other highly correlated variables. However, the number of cars that can fit into the garage is determined by the size of the garage. ‘GarageCars’ and ‘GarageArea’ are identical twins. You’ll never be able to tell them apart. As a result, we only need one of these variables in our analysis.

‘TotalBsmtSF’ and ‘1stFloor’ appear to be twins as well.

‘FullBath’?? Really?

‘TotRmsAbvGrd’ and ‘GrLivArea’ are twin brothers once more.we can keep ‘GrLivArea’ because it has a higher correlation with ‘SalePrice’

Oh, I did not pick ‘YearBuilt’, ‘YearBuilt’ appears to be slightly correlated with ‘SalePrice’. To be honest, thinking about ‘YearBuilt’ makes me nervous because it makes me think we should do some time-series analysis to get this right.

Let’s do some more analysis on this data; I’d like to see the statistical analyses using a boxplot.

# We need to ensure that SalePrice_Range is a factor
house$SalePrice_Range <- as.factor(house$SalePrice_Range)

# Define the colors for each SalePrice_Range
colors <- setNames(c("#abc9ea", "#f3aba8", "#98daa7", "#d3c3f7"), levels(house$SalePrice_Range))

# Create the Plotly boxplot
p2 <- plot_ly(data = house, y = ~GrLivArea, color = ~SalePrice_Range, colors = colors) |>
  add_boxplot() |>
  layout(title = list(text ='Above grade (ground) living area square feet Distribution by Sale Price Range', x = 0.5,y=0.98),
         xaxis = list(title = 'Sale Price Range'),
         yaxis = list(title = 'Above grade (ground) living area square feet'),
         annotations = list(
           x = 0.5, y = -0.18,  # Adjust y position to place it at the bottom
           text = 'Data source: The Ames Housing dataset by Dean De Cock',
           showarrow = FALSE,
           xref = 'paper', yref = 'paper',
           xanchor = 'center', yanchor = 'bottom',
           font = list(size = 10)
          ), 
         margin = list(b = 80, l = 80),  # Adjust bottom and left margins
         width = 700, height = 500)

## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()

# Display the plot
p2

The boxplot visualization, “Above grade (ground) living area square feet Distribution by Sale Price Range,” reveals several insights into the relationship between living area size and property sale price range.

We can see from the graph that as the sale price range increases, there is a general trend of increasing living area square footage. The compact boxplot with a lower median shows a relatively small living area in the lowest price range, “0-100k,” indicating that smaller living spaces are associated with lower-priced homes.

The price ranges “100k-150k” and “150k-200k” show a noticeable increase in living area square footage, with the median values rising and the box sizes indicating a wider distribution of living area sizes. In these price ranges, this implies greater variability and a larger average living area.

Properties in the “200k+” range not only have higher median living areas, but also a much broader spread, as evidenced by the box and whiskers’ extended range. This demonstrates a significant variation in living area sizes within this higher price category, which includes properties with significantly larger square footage.

Outlines are represented as points beyond the whiskers of the boxes in each price range category, indicating that there are exceptions to the general trends. These outliers may represent one-of-a-kind small or large properties within each price range that do not fit the overall pattern observed.

The findings of this statistical visualization suggest a relationship between sale price and living area square feet; however, the presence of outliers suggests that living area is not the only determinant of price. Other factors not depicted on this graph may also have a significant impact on the sale price, and a thorough analysis would necessitate investigating these additional variables. The visual analysis clearly shows that, while larger homes tend to be more expensive, there is significant diversity within each price segment, reflecting the complex nature of real estate valuation.

# Scatter plot with regression line
p3 <- ggplot(house, aes(x = GrLivArea, y = SalePrice)) +
  geom_point(aes(color = SalePrice_Range)) +  # Add points
  geom_smooth(method = "lm", se = FALSE) +  # Add linear regression line without standard error band
  labs(
    title = "Living Area vs. Sale Price Analysis", 
    subtitle = "Exploring the Impact of Gross Living Area on Sale Prices in the Ames Housing Market",
    x = "Above grade (ground) living area square feet", 
    y = "Sale Price ($)",
    caption = "Data source: The Ames Housing dataset by Dean De Cock"
  ) +
  scale_y_continuous(labels = label_comma()) +  # Change y-axis labels to comma format
  theme_minimal() +
  theme(
    plot.caption = element_text(hjust = 0.5)  # Center caption horizontally
  )

# Print the plot
p3

## `geom_smooth()` using formula = 'y ~ x'

This scatter plot depicts the relationship between living area square feet (above grade) and Ames housing market sale prices. Each dot represents a property, color-coded by different sale price ranges, revealing a positive trend in which larger houses command higher prices. Notably, properties larger than 4000 square feet are marked individually as exceptions. These outliers may indicate that premium pricing is influenced by factors other than size. Surprisingly, the scatter plot looks like layers of color swatches, with each layer stacked on top of the next, creating a visual stratification by sale price ranges.

Discussion:

About the data and visualization:

The richness of the dataset led to intuitive variable selection at first, but data exploration guided me to variables that were highly correlated with sale price for my visualizations. I started with a bar plot to identify common price ranges in the Ames dataset, then moved on to other graph types. The majority of homes exceeded $100,000, with less variation than expected. After creating a correlation heatmap, I discovered that ‘OverallQual,’ ‘GrLivArea,’ and ‘GarageCars’ were significantly related to sale price, prompting further investigation with ‘GrLivArea’ to investigate its impact on sale price.Even after selecting ‘GrLivArea’ as a key variable for analyzing its relationship with sale price, I experimented with other variables, the majority of which also influenced sale price fluctuations.

About difficuties：

Creating a heatmap was initially difficult because it required conceptualizing the data in matrix form and calculating correlation coefficients. Fortunately, after investigating various packages, I discovered a solution. However, I ran into a stumbling block: despite numerous attempts, I was unable to add subtitles to Plotly.

Below, I’ve highlighted what I believe to be the most crucial reference for further context.

Data Sources and References

The following are the most important references:

Ames Housing dataset:

Original: https://jse.amstat.org/v19n3/decock.pdf Kaggle: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/data

Heatmap:

https://www.geeksforgeeks.org/how-to-create-correlation-heatmap-in-r/

An Introduction to corrplot Package:

https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

Final Project

Mingzhuo Tian

2023-12-07