Photo
by Tom Thain on Unsplash: https://storage.googleapis.com/kaggle-media/competitions/House%20Prices/kaggle_5407_media_housesbanner.png
The goal of this project is to visualize a few statistical relationships in Dean De Cock’s Ames Housing for use in data science education.The sale price of a house is impacted by various elements, including its location and nearby amenities, as well as its dimensions, configuration, and modernization—all of which converge to influence its marketability. The following are the explanations for features I selected in this dataset:
In this project, I used the MIT-licensed datasets “House Prices - Advanced Regression Techniques” on Kaggle. According to the website’s acknowledgements, Dean De Cock compiled the Ames Housing dataset. Dean De Cock’s Ames Housing dataset is a comprehensive collection of residential property sales data in Ames, Iowa, from 2006 to 2010. It contains 2930 observations with a wide range of characteristics (23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables) to allow for in-depth regression analysis for educational purposes. This dataset was sourced from the Ames City Assessor’s Office and refined for use in academic settings as an alternative to the Boston Housing dataset. I putted both original data source reference and the link from kaggle at the bottom of the page.
I chose this topic and dataset because of the widespread relevance of housing prices, which are of particular concern to people looking for a stable living environment. I intend to use this data to investigate the factors that influence home prices.For me, the topic and dataset are significant because Analyzing housing prices using this dataset allows me to uncover the numerous factors that influence property values, providing insights that could help me make informed decisions.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(corrplot)
## corrplot 0.92 loaded
library(RColorBrewer)
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(scales)
##
## Attaching package: 'scales'
##
## The following object is masked from 'package:purrr':
##
## discard
##
## The following object is masked from 'package:readr':
##
## col_factor
train <- read_csv("train.csv")
## Rows: 1460 Columns: 81
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (43): MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConf...
## dbl (38): Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, Ye...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
train <- train |>
mutate(SalePrice_Range = cut(SalePrice,
breaks = c(-Inf, 100000, 150000, 200000, Inf),
labels = c("0-100k", "100k-150k", "150k-200k", "200k+")))
train_summary <- train |>
group_by(SalePrice_Range) |>
summarise(Count = n()) |>
mutate(Percentage = Count / sum(Count) * 100)
train$SalePrice_Range <- as.factor(train$SalePrice_Range)
p <- ggplot(train, aes(x = SalePrice_Range, fill = SalePrice_Range)) +
geom_bar(stat = "count") + # The height of the bar represents the count of each category
geom_text(stat = 'count', aes(label = ..count.., y = ..count..), vjust = -0.3) + # Add count labels above bars
scale_fill_manual(values = c("0-100k" = "#d3c3f7",
"100k-150k" = "#abc9ea",
"150k-200k" = "#f3aba8",
"200k+" = "#98daa7"),
name = "Sales Price Range") + # Rename legend title
labs(title = "Distribution of Sale Prices",
subtitle = "Insights into Housing Values: Market Analysis Through Sale Prices.",
x = "Sales Price Range",
y = "Count",
caption = "Data source: The Ames Housing dataset by Dean De Cock") + # Data source caption
theme_minimal() +
theme(plot.caption = element_text(hjust = 0.4), # Center caption horizontally
legend.position = "right", # Move legend to right
plot.caption.position = "plot", # Position caption at the plot bottom
plot.title = element_text(face = "bold", size = 14), # Bold the title and increase the size
plot.subtitle = element_text(size = 10)) # Adjust the size of the subtitle
# Print the plot
p
## Warning: The dot-dot notation (`..count..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(count)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
From the chart, we observe four distinct categories: “0-100k,” “100k-150k,” “150k-200k,” and “200k+”. The “100k-150k” range has the highest count of properties, closely followed by the “200k+” range, indicating a higher concentration of property sales within these price brackets. The “0-100k” range, by contrast, contains the fewest properties, suggesting a smaller market segment for properties valued below $100,000.
Before we start to plotting a correlation heatmap, the first step is to clean data.
# Compute the correlation matrix for numerical columns
cor_matrix <- cor(train[sapply(train, is.numeric)], use = "pairwise.complete.obs")
# Select the top n most correlated variables to 'SalePrice'
k <- 10 # Number of variables you want to show in the heatmap
# Find the top k highest correlations with SalePrice
top_correlations <- sort(abs(cor_matrix[,"SalePrice"]), decreasing = TRUE)[1:(k+1)]
names_top_correlations <- names(top_correlations)
house <- train |>
select(names_top_correlations) |>
mutate(SalePrice_Range = cut(SalePrice,
breaks = c(-Inf, 100000, 150000, 200000, Inf),
labels = c("0-100k", "100k-150k", "150k-200k", "200k+")))
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
## # Was:
## data %>% select(names_top_correlations)
##
## # Now:
## data %>% select(all_of(names_top_correlations))
##
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
To clean the Ames Housing dataset, I concentrated solely on features relevant to a basic housing price analysis, excluding more complex or nuanced data that would require advanced treatment. I removed extraneous columns and kept only the variables that had a direct impact on ‘SalePrice’. I divided ‘SalePrice’ into discrete ranges before converting it to a factor to gain a better understanding of each price bracket. Then, within these price categories, I computed the mean of critical housing attributes, streamlining the data for easy interpretation and visualization in accordance with specific requirements for later analysis stages.
# Subset the correlation matrix to only include the top correlated variables
cor_matrix_subset <- cor_matrix[names_top_correlations, names_top_correlations]
# Convert the correlation matrix to a long format for plotting
cor_data <- melt(cor_matrix_subset)
# Create the heatmap using plotly
p1 <- plot_ly(data = cor_data, x = ~Var2, y = ~Var1, z = ~value, type = "heatmap", colorscale = 'RdBu',
zmin = -1, zmax = 1) |>
layout(title = list(text = 'Top 10 Correlated Features with SalePrice', x = 0.5, y = 0.99),
xaxis = list(title = '', tickangle = -90),
yaxis = list(title = ''),
annotations = list(
x = 0.5, y = -0.35, # Adjust x and y positions to place at the left bottom
text = 'Data source: The Ames Housing dataset by Dean De Cock',
showarrow = FALSE,
xref = 'paper', yref = 'paper',
xanchor = 'center', yanchor = 'bottom',
font = list(size = 10)
),
margin = list(b = 80, l = 80), # Adjust bottom and left margins
width = 700, height = 500)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
# Print the plot
p1
These are the variables that are most correlated with ‘SalePrice’ according to our heatmap. Here are my thoughts:
‘OverallQual’, ‘GrLivArea’ and ‘GarageCars’ are strongly correlated with ‘SalePrice’. Check!
‘GarageCars’ and ‘GarageArea’ are two other highly correlated variables. However, the number of cars that can fit into the garage is determined by the size of the garage. ‘GarageCars’ and ‘GarageArea’ are identical twins. You’ll never be able to tell them apart. As a result, we only need one of these variables in our analysis.
‘TotalBsmtSF’ and ‘1stFloor’ appear to be twins as well.
‘FullBath’?? Really?
‘TotRmsAbvGrd’ and ‘GrLivArea’ are twin brothers once more.we can keep ‘GrLivArea’ because it has a higher correlation with ‘SalePrice’
Oh, I did not pick ‘YearBuilt’, ‘YearBuilt’ appears to be slightly correlated with ‘SalePrice’. To be honest, thinking about ‘YearBuilt’ makes me nervous because it makes me think we should do some time-series analysis to get this right.
# We need to ensure that SalePrice_Range is a factor
house$SalePrice_Range <- as.factor(house$SalePrice_Range)
# Define the colors for each SalePrice_Range
colors <- setNames(c("#abc9ea", "#f3aba8", "#98daa7", "#d3c3f7"), levels(house$SalePrice_Range))
# Create the Plotly boxplot
p2 <- plot_ly(data = house, y = ~GrLivArea, color = ~SalePrice_Range, colors = colors) |>
add_boxplot() |>
layout(title = list(text ='Above grade (ground) living area square feet Distribution by Sale Price Range', x = 0.5,y=0.98),
xaxis = list(title = 'Sale Price Range'),
yaxis = list(title = 'Above grade (ground) living area square feet'),
annotations = list(
x = 0.5, y = -0.18, # Adjust y position to place it at the bottom
text = 'Data source: The Ames Housing dataset by Dean De Cock',
showarrow = FALSE,
xref = 'paper', yref = 'paper',
xanchor = 'center', yanchor = 'bottom',
font = list(size = 10)
),
margin = list(b = 80, l = 80), # Adjust bottom and left margins
width = 700, height = 500)
## Warning: Specifying width/height in layout() is now deprecated.
## Please specify in ggplotly() or plot_ly()
# Display the plot
p2
The boxplot visualization, “Above grade (ground) living area square feet Distribution by Sale Price Range,” reveals several insights into the relationship between living area size and property sale price range.
We can see from the graph that as the sale price range increases, there is a general trend of increasing living area square footage. The compact boxplot with a lower median shows a relatively small living area in the lowest price range, “0-100k,” indicating that smaller living spaces are associated with lower-priced homes.
The price ranges “100k-150k” and “150k-200k” show a noticeable increase in living area square footage, with the median values rising and the box sizes indicating a wider distribution of living area sizes. In these price ranges, this implies greater variability and a larger average living area.
Properties in the “200k+” range not only have higher median living areas, but also a much broader spread, as evidenced by the box and whiskers’ extended range. This demonstrates a significant variation in living area sizes within this higher price category, which includes properties with significantly larger square footage.
Outlines are represented as points beyond the whiskers of the boxes in each price range category, indicating that there are exceptions to the general trends. These outliers may represent one-of-a-kind small or large properties within each price range that do not fit the overall pattern observed.
The findings of this statistical visualization suggest a relationship between sale price and living area square feet; however, the presence of outliers suggests that living area is not the only determinant of price. Other factors not depicted on this graph may also have a significant impact on the sale price, and a thorough analysis would necessitate investigating these additional variables. The visual analysis clearly shows that, while larger homes tend to be more expensive, there is significant diversity within each price segment, reflecting the complex nature of real estate valuation.
# Scatter plot with regression line
p3 <- ggplot(house, aes(x = GrLivArea, y = SalePrice)) +
geom_point(aes(color = SalePrice_Range)) + # Add points
geom_smooth(method = "lm", se = FALSE) + # Add linear regression line without standard error band
labs(
title = "Living Area vs. Sale Price Analysis",
subtitle = "Exploring the Impact of Gross Living Area on Sale Prices in the Ames Housing Market",
x = "Above grade (ground) living area square feet",
y = "Sale Price ($)",
caption = "Data source: The Ames Housing dataset by Dean De Cock"
) +
scale_y_continuous(labels = label_comma()) + # Change y-axis labels to comma format
theme_minimal() +
theme(
plot.caption = element_text(hjust = 0.5) # Center caption horizontally
)
# Print the plot
p3
## `geom_smooth()` using formula = 'y ~ x'
This scatter plot depicts the relationship between living area square feet (above grade) and Ames housing market sale prices. Each dot represents a property, color-coded by different sale price ranges, revealing a positive trend in which larger houses command higher prices. Notably, properties larger than 4000 square feet are marked individually as exceptions. These outliers may indicate that premium pricing is influenced by factors other than size. Surprisingly, the scatter plot looks like layers of color swatches, with each layer stacked on top of the next, creating a visual stratification by sale price ranges.
About the data and visualization:
The richness of the dataset led to intuitive variable selection at first, but data exploration guided me to variables that were highly correlated with sale price for my visualizations. I started with a bar plot to identify common price ranges in the Ames dataset, then moved on to other graph types. The majority of homes exceeded $100,000, with less variation than expected. After creating a correlation heatmap, I discovered that ‘OverallQual,’ ‘GrLivArea,’ and ‘GarageCars’ were significantly related to sale price, prompting further investigation with ‘GrLivArea’ to investigate its impact on sale price.Even after selecting ‘GrLivArea’ as a key variable for analyzing its relationship with sale price, I experimented with other variables, the majority of which also influenced sale price fluctuations.
About difficuties:
Creating a heatmap was initially difficult because it required conceptualizing the data in matrix form and calculating correlation coefficients. Fortunately, after investigating various packages, I discovered a solution. However, I ran into a stumbling block: despite numerous attempts, I was unable to add subtitles to Plotly.
Below, I’ve highlighted what I believe to be the most crucial reference for further context.