library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(AmesHousing)
## Warning: package 'AmesHousing' was built under R version 4.4.3
This project focuses on helping real estate agents in Ames, Iowa, set more accurate home prices. Many agents currently rely on past experience or general rules to determine sale prices. This can lead to issues, homes may be priced too high and remain unsold for a long time, or priced too low, resulting in lower profits for the seller. To address this, the analysis explores how the size of a house (measured by square footage, or GrLivArea) influences its sale price. Data from past home sales is used to build a simple model showing the relationship between the amount of living space and the price paid. Instead of predicting the sale price directly, the log of the sale price is used to improve model accuracy and interpretation. This model includes only one variable square footage, making it easy to understand and apply. If a strong relationship is found, the results can support real estate agents in providing better price recommendations, leading to more confident and consistent pricing decisions. Square footage was chosen because it has a strong, direct relationship with sale price, is easy to measure, and is widely understood by estate professionals.
This section looks at common problems in the model, like whether the model’s rules are followed, if the data is clear, and if the results can be improved. It uses what was learned in the lab, such as changing the shape of the data, checking how the model behaves, adding more information to the model, and comparing how different versions perform. These steps helped make the analysis clearer and more useful.
Log transformation helps normalize the distribution of sale prices, making the model more valid and the results easier to interpret.
# Standardize column names to lowercase
df <- make_ames()
df_small <- df |> rename_with(tolower)
# Create a new column for log-transformed sale price
df_small$log_sale_price <- log(df_small$sale_price)
# Preview the updated dataset to confirm the new column was added
head(df_small)
## # A tibble: 6 × 82
## ms_subclass ms_zoning lot_frontage lot_area street alley lot_shape
## <fct> <fct> <dbl> <int> <fct> <fct> <fct>
## 1 One_Story_1946_and_New… Resident… 141 31770 Pave No_A… Slightly…
## 2 One_Story_1946_and_New… Resident… 80 11622 Pave No_A… Regular
## 3 One_Story_1946_and_New… Resident… 81 14267 Pave No_A… Slightly…
## 4 One_Story_1946_and_New… Resident… 93 11160 Pave No_A… Regular
## 5 Two_Story_1946_and_New… Resident… 74 13830 Pave No_A… Slightly…
## 6 Two_Story_1946_and_New… Resident… 78 9978 Pave No_A… Slightly…
## # ℹ 75 more variables: land_contour <fct>, utilities <fct>, lot_config <fct>,
## # land_slope <fct>, neighborhood <fct>, condition_1 <fct>, condition_2 <fct>,
## # bldg_type <fct>, house_style <fct>, overall_qual <fct>, overall_cond <fct>,
## # year_built <int>, year_remod_add <int>, roof_style <fct>, roof_matl <fct>,
## # exterior_1st <fct>, exterior_2nd <fct>, mas_vnr_type <fct>,
## # mas_vnr_area <dbl>, exter_qual <fct>, exter_cond <fct>, foundation <fct>,
## # bsmt_qual <fct>, bsmt_cond <fct>, bsmt_exposure <fct>, …
The result shows a clear positive relationship between square footage and log-transformed sale price. As square footage increases, the log of the sale price also increases, indicating that larger homes tend to sell for more. The regression line fits the data well, suggesting a linear model is a reasonable choice for this variable. This insight supports using square footage as a meaningful predictor in the model.
A scatte rplot with a regression line helps visualize the trend between square footage and log(sale price), confirming a positive relationship.
plot(df_small$gr_liv_area, df_small$log_sale_price,
main = "Log(SalePrice) vs. Square Footage",
xlab = "Square Footage (GrLivArea)",
ylab = "Log(SalePrice)",
pch = 19, col = "blue")
abline(lm(log_sale_price ~ gr_liv_area, data = df_small), col = "red", lwd = 2)
The plot shows a clear upward trend between square footage and log-transformed sale price. This confirms a strong positive relationship, where larger houses tend to sell for more. The regression line fits the general direction of the data well, suggesting that a linear model is appropriate for capturing this relationship. This insight supports using square footage as a meaningful predictor of housing prices. A scatter plot with a regression line helps visualize the trend between square footage and log(sale price), confirming a positive relationship.
model <- lm(log_sale_price ~ gr_liv_area, data = df_small)
par(mfrow = c(2, 2))
plot(model)
The diagnostic plots help check if the model works well. The “Residuals vs Fitted” plot shows that the relationship between square footage and log sale price is mostly straight, which supports the use of a linear model. The Q-Q plot shows a small curve at the ends, which means the errors are not perfectly normal, but still close. The “Scale-Location” plot looks okay and supports the idea that the spread of errors is mostly even. The “Residuals vs Leverage” plot shows a few points that could have a stronger effect, but nothing too extreme.
Overall, these checks suggest that the model fits the data fairly well. The model fits the data well and is useful for understanding pricing patterns.
model2 <- lm(log_sale_price ~ gr_liv_area + overall_qual, data = df_small)
summary(model2)
##
## Call:
## lm(formula = log_sale_price ~ gr_liv_area + overall_qual, data = df_small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.74756 -0.09895 0.01429 0.12208 0.77368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.035e+01 9.913e-02 104.417 < 2e-16 ***
## gr_liv_area 2.755e-04 8.919e-06 30.886 < 2e-16 ***
## overall_qualPoor 2.537e-01 1.130e-01 2.245 0.0248 *
## overall_qualFair 6.471e-01 1.036e-01 6.244 4.89e-10 ***
## overall_qualBelow_Average 8.705e-01 9.971e-02 8.730 < 2e-16 ***
## overall_qualAverage 1.093e+00 9.910e-02 11.029 < 2e-16 ***
## overall_qualAbove_Average 1.220e+00 9.920e-02 12.295 < 2e-16 ***
## overall_qualGood 1.398e+00 9.938e-02 14.064 < 2e-16 ***
## overall_qualVery_Good 1.615e+00 9.976e-02 16.186 < 2e-16 ***
## overall_qualExcellent 1.869e+00 1.012e-01 18.465 < 2e-16 ***
## overall_qualVery_Excellent 1.826e+00 1.064e-01 17.154 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1976 on 2919 degrees of freedom
## Multiple R-squared: 0.7657, Adjusted R-squared: 0.7649
## F-statistic: 954.1 on 10 and 2919 DF, p-value: < 2.2e-16
summary(model)$r.squared # Model with only square footage
## [1] 0.4842244
summary(model2)$r.squared # Model with square footage + overall quality
## [1] 0.7657285
Adding a second predictor, overall quality, makes the model better at explaining house prices. The R-squared value goes up, which means the model explains more of the price differences. Both square footage and quality are important, and the results show that higher quality homes are usually more expensive. The increase in R-squared shows that adding quality helps the model fit the data better and gives a clearer picture of what influences home prices. This makes the model more helpful for understanding what affects sale prices.
This model uses square footage and overall quality to predict house prices. While helpful, it leaves out many other important factors that affect price, such as location, condition, number of bathrooms, and nearby services. These missing pieces can reduce the model’s accuracy and make the results less fair or realistic for some homes. Another concern is bias in the data. If house prices in the past were affected by things like lower pricing in certain neighborhoods, the model might still use those same patterns when making predictions today. There is also a risk that people might rely too much on the model. If real estate agents or sellers use the model output as the final word, they might ignore things that can’t be measured by square footage alone, such as updated kitchens, large windows with scenic views, how many doors are in the house, or the overall feel of the space, like whether it feels welcoming or peaceful. Finally, this model treats price as just a number, but in reality, house value includes feelings, lifestyle needs, and personal judgment. A model like this is a helpful tool, but it should not be used alone to make major financial decisions. These concerns align with ethical principles covered in Week 5, especially around bias, missing data, and how models should be used carefully. A model like this is a helpful tool—but it should not be used alone to make major financial decisions.
Lab 8 directly supported the steps taken in this analysis. It explained the importance of checking model assumptions such as linearity, normality, and equal spread of errors, which guided the use of diagnostic plots in Step 3. The lab also introduced log-transforming sale price to deal with skewed data, which was applied in Step 1. It demonstrated how visualizing relationships with scattr plots and regression lines can help spot patterns, as done in Step 2. Lab 8 also showed how to build models with multiple predictors and compare their fit, which supported adding overall quality in Step 4 and checking R-squared values. These examples helped shape the decisions made in this project and gave practical ways to improve and explain the model results.
This project explored how square footage and overall quality influence house prices using a simple linear regression model. While the model provided useful insights and explained a significant portion of the variation in sale prices, it also had important limitations. Diagnostic checks confirmed the model was valid, and visualizations helped communicate key findings. However, ethical and practical concerns, such as missing variables like neighborhood or home condition, remind us that models should support not replace human judgment. Overall, this analysis shows the value of data-driven tools for real estate, while also emphasizing the need for careful interpretation and continued improvement.