For this lab, you’ll be working with a group of other classmates, and each group will be assigned to critique a lab from a previous week.
Your group will have three goals:
First, choose a previous lab week 6
A real estate firm in Ames, Iowa, wants to understand how housing prices vary across different building types and within similar home categories. By analyzing these trends, they aim to provide better pricing guidance and recommendations to clients who are buying or selling homes. The purpose of this study is to better gauge housing prices on the market and identify how prices fluctuate across different types of homes.
Customer or Audience: The audience includes real estate agents and homeowners who want to better understand pricing and market trends.
Problem Statement: Real estate agents need a clear understanding of how building type affects housing prices so they can accurately price homes and advise clients. Inaccurately priced homes tend to stay on the market longer or fail to sell, creating financial and operational challenges. To address this, the firm will analyze the Ames Housing Dataset to identify pricing patterns across different home types, with the goal of reducing pricing uncertainty and improving decision-making for both buyers and sellers.
Scope: The variables that can be used from the Ames Housing Dataset to address this problem include sale_price and bldg_type. Additional variables such as overall_qual, first_flr_sf, and year_built can also be included to better explain differences in housing prices. To analyze this, an ANOVA test will be used to determine if there are significant differences in average sale prices across different building types. A linear regression model will also be used to examine how building type and other features impact housing prices and to better understand relationships between variables.
Objective: The goal of this analysis is to determine whether building type significantly affects housing prices and to identify the key factors that influence price differences. Success will be achieved when pricing patterns across building types have been identified, and the most influential variables affecting housing prices have been determined, allowing for more accurate pricing recommendations
load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.1 ✔ tibble 3.3.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(AmesHousing)
Issue 1: Using log transformation to fix skewed data
ames <-make_ames()
ames <-ames |> rename_with(tolower)
ames <- ames |> mutate(log_price = log(sale_price))
ml <- aov(log_price ~ bldg_type, data = ames)
summary(ml)
## Df Sum Sq Mean Sq F value Pr(>F)
## bldg_type 4 19.7 4.93 30.89 <2e-16 ***
## Residuals 2925 466.9 0.16
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(ames, aes(x = bldg_type, y = log_price)) +
geom_boxplot()
Issue 2: Multiple Linear Regression Model
model_multi <- lm(sale_price ~ bldg_type + overall_qual + first_flr_sf, data = ames)
summary(model_multi)
##
## Call:
## lm(formula = sale_price ~ bldg_type + overall_qual + first_flr_sf,
## data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415328 -21767 -2121 18148 291807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4192.718 19793.208 0.212 0.83226
## bldg_typeTwoFmCon -11473.824 5111.105 -2.245 0.02485 *
## bldg_typeDuplex -16726.811 3961.492 -4.222 2.49e-05 ***
## bldg_typeTwnhs -23420.004 4120.805 -5.683 1.45e-08 ***
## bldg_typeTwnhsE -23929.560 2755.176 -8.685 < 2e-16 ***
## overall_qualPoor 17286.241 22521.792 0.768 0.44283
## overall_qualFair 36172.361 20653.519 1.751 0.07998 .
## overall_qualBelow_Average 60180.688 19867.442 3.029 0.00247 **
## overall_qualAverage 79586.442 19741.832 4.031 5.69e-05 ***
## overall_qualAbove_Average 108545.238 19750.398 5.496 4.22e-08 ***
## overall_qualGood 146126.421 19767.315 7.392 1.88e-13 ***
## overall_qualVery_Good 198641.982 19849.771 10.007 < 2e-16 ***
## overall_qualExcellent 281492.678 20159.027 13.964 < 2e-16 ***
## overall_qualVery_Excellent 337087.996 21133.233 15.951 < 2e-16 ***
## first_flr_sf 49.882 2.293 21.752 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39370 on 2915 degrees of freedom
## Multiple R-squared: 0.7582, Adjusted R-squared: 0.7571
## F-statistic: 653 on 14 and 2915 DF, p-value: < 2.2e-16
Issue 3: Better Visualization
ggplot(ames, aes(x = first_flr_sf, y = sale_price, color = bldg_type)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
One concern is that the Ames Housing Dataset may not reflect current housing market conditions. Housing prices have changed significantly in recent years due to factors such as inflation and shifts in demand following COVID-19. Older trends in the data may not hold the same weight today, which could affect the accuracy and relevance of the analysis.
Another concern is potential bias in the data. The dataset only includes housing data from Ames, Iowa, which makes it useful for understanding that specific market, but limits how well the results can be applied to other cities. If these findings were used to make decisions in different housing markets, it could lead to biased or misleading conclusions.
There are also important factors that may not be fully captured in the dataset. For example, external influences such as interest rates, economic conditions, neighborhood development, and buyer preferences can all impact housing prices but may not be directly measured. This limits the ability of the model to fully explain price variation.
Finally, different groups are affected by this analysis, including real estate agents, homeowners, and potential buyers. If the model provides inaccurate or outdated recommendations, it could lead to homes being overpriced or underpriced, which may negatively impact financial outcomes and decision-making for these groups.