My research question:
How do various house features like square footage, number of bedrooms, age of house(when house was built), bathroom count, or neighborhood quality affect the sales price of homes?
My Data-set:
The data-set I will be using is called “ames”. This data-set contains 2930 observations, and 82 variables. Ames is a city in Iowa, and this data-set’s information was gathered from the Ames Assessor’s Office the information gathered was assessed through computing software in order to assess the residential property value of houses sold. In this project, I will be looking at houses built 1900’s and greater, using the variables like Sale_Price(as the continuous values), Year_Built, Neighborhood, Overall_Qual, and Gr_Liv_Area. I chose these variables because I found that location, size, and condition of the house are the top factors for sales prices of homes.(Nancy-Nash 2025)
Data-set link: https://www.openintro.org/data/index.php?data=ames
In this section, I will filter the years of the houses build in 1900 and greater, select the variables I will be using(SalePrice, Year.Built, Neighborhood, Overall.Qual, Gr.Liv.Area), and filter out any NAs found in my data variables so I don’t run into any errors when coding my plots. I will be preforming a Multi Linear Regression because I will be analyzing how these factors effect the prices of the houses in Ames, Iowa.
library(tidyverse)
library(ggplot2)
library(dplyr)
#Setting Working directory
setwd("C:/Users/Joanne G/OneDrive/Data101(Fall 2025)/Datasets")
#read the babies.csv in here
ames_houses_df <- read.csv("ames.csv")
Clean the data-set and conduct exploratory data analysis (EDA) to better understand the data (2 functions minimum)
# EDA Data-set Chunk
#dimensions
dim(ames_houses_df)
## [1] 2930 82
#head
head(ames_houses_df)
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## 1 1 526301100 20 RL 141 31770 Pave <NA>
## 2 2 526350040 20 RH 80 11622 Pave <NA>
## 3 3 526351010 20 RL 81 14267 Pave <NA>
## 4 4 526353030 20 RL 93 11160 Pave <NA>
## 5 5 527105010 60 RL 74 13830 Pave <NA>
## 6 6 527105030 60 RL 78 9978 Pave <NA>
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope Neighborhood
## 1 IR1 Lvl AllPub Corner Gtl NAmes
## 2 Reg Lvl AllPub Inside Gtl NAmes
## 3 IR1 Lvl AllPub Corner Gtl NAmes
## 4 Reg Lvl AllPub Corner Gtl NAmes
## 5 IR1 Lvl AllPub Inside Gtl Gilbert
## 6 IR1 Lvl AllPub Inside Gtl Gilbert
## Condition.1 Condition.2 Bldg.Type House.Style Overall.Qual Overall.Cond
## 1 Norm Norm 1Fam 1Story 6 5
## 2 Feedr Norm 1Fam 1Story 5 6
## 3 Norm Norm 1Fam 1Story 6 6
## 4 Norm Norm 1Fam 1Story 7 5
## 5 Norm Norm 1Fam 2Story 5 5
## 6 Norm Norm 1Fam 2Story 6 6
## Year.Built Year.Remod.Add Roof.Style Roof.Matl Exterior.1st Exterior.2nd
## 1 1960 1960 Hip CompShg BrkFace Plywood
## 2 1961 1961 Gable CompShg VinylSd VinylSd
## 3 1958 1958 Hip CompShg Wd Sdng Wd Sdng
## 4 1968 1968 Hip CompShg BrkFace BrkFace
## 5 1997 1998 Gable CompShg VinylSd VinylSd
## 6 1998 1998 Gable CompShg VinylSd VinylSd
## Mas.Vnr.Type Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1 Stone 112 TA TA CBlock TA
## 2 None 0 TA TA CBlock TA
## 3 BrkFace 108 TA TA CBlock TA
## 4 None 0 Gd TA CBlock TA
## 5 None 0 TA TA PConc Gd
## 6 BrkFace 20 TA TA PConc TA
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## 1 Gd Gd BLQ 639 Unf
## 2 TA No Rec 468 LwQ
## 3 TA No ALQ 923 Unf
## 4 TA No ALQ 1065 Unf
## 5 TA No GLQ 791 Unf
## 6 TA No GLQ 602 Unf
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating Heating.QC Central.Air
## 1 0 441 1080 GasA Fa Y
## 2 144 270 882 GasA TA Y
## 3 0 406 1329 GasA TA Y
## 4 0 1045 2110 GasA Ex Y
## 5 0 137 928 GasA Gd Y
## 6 0 324 926 GasA Ex Y
## Electrical X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area Bsmt.Full.Bath
## 1 SBrkr 1656 0 0 1656 1
## 2 SBrkr 896 0 0 896 0
## 3 SBrkr 1329 0 0 1329 0
## 4 SBrkr 2110 0 0 2110 1
## 5 SBrkr 928 701 0 1629 0
## 6 SBrkr 926 678 0 1604 0
## Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## 1 0 1 0 3 1 TA
## 2 0 1 0 2 1 TA
## 3 0 1 1 3 1 Gd
## 4 0 2 1 3 1 Ex
## 5 0 2 1 3 1 TA
## 6 0 2 1 3 1 Gd
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type Garage.Yr.Blt
## 1 7 Typ 2 Gd Attchd 1960
## 2 5 Typ 0 <NA> Attchd 1961
## 3 6 Typ 0 <NA> Attchd 1958
## 4 8 Typ 2 TA Attchd 1968
## 5 6 Typ 1 TA Attchd 1997
## 6 7 Typ 1 Gd Attchd 1998
## Garage.Finish Garage.Cars Garage.Area Garage.Qual Garage.Cond Paved.Drive
## 1 Fin 2 528 TA TA P
## 2 Unf 1 730 TA TA Y
## 3 Unf 1 312 TA TA Y
## 4 Fin 2 522 TA TA Y
## 5 Fin 2 482 TA TA Y
## 6 Fin 2 470 TA TA Y
## Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area
## 1 210 62 0 0 0 0
## 2 140 0 0 0 120 0
## 3 393 36 0 0 0 0
## 4 0 0 0 0 0 0
## 5 212 34 0 0 0 0
## 6 360 36 0 0 0 0
## Pool.QC Fence Misc.Feature Misc.Val Mo.Sold Yr.Sold Sale.Type Sale.Condition
## 1 <NA> <NA> <NA> 0 5 2010 WD Normal
## 2 <NA> MnPrv <NA> 0 6 2010 WD Normal
## 3 <NA> <NA> Gar2 12500 6 2010 WD Normal
## 4 <NA> <NA> <NA> 0 4 2010 WD Normal
## 5 <NA> MnPrv <NA> 0 3 2010 WD Normal
## 6 <NA> <NA> <NA> 0 6 2010 WD Normal
## SalePrice
## 1 215000
## 2 105000
## 3 172000
## 4 244000
## 5 189900
## 6 195500
summary(ames_houses_df)
## Order PID MS.SubClass MS.Zoning
## Min. : 1.0 Min. :5.263e+08 Min. : 20.00 Length:2930
## 1st Qu.: 733.2 1st Qu.:5.285e+08 1st Qu.: 20.00 Class :character
## Median :1465.5 Median :5.355e+08 Median : 50.00 Mode :character
## Mean :1465.5 Mean :7.145e+08 Mean : 57.39
## 3rd Qu.:2197.8 3rd Qu.:9.072e+08 3rd Qu.: 70.00
## Max. :2930.0 Max. :1.007e+09 Max. :190.00
##
## Lot.Frontage Lot.Area Street Alley
## Min. : 21.00 Min. : 1300 Length:2930 Length:2930
## 1st Qu.: 58.00 1st Qu.: 7440 Class :character Class :character
## Median : 68.00 Median : 9436 Mode :character Mode :character
## Mean : 69.22 Mean : 10148
## 3rd Qu.: 80.00 3rd Qu.: 11555
## Max. :313.00 Max. :215245
## NA's :490
## Lot.Shape Land.Contour Utilities Lot.Config
## Length:2930 Length:2930 Length:2930 Length:2930
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Land.Slope Neighborhood Condition.1 Condition.2
## Length:2930 Length:2930 Length:2930 Length:2930
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Bldg.Type House.Style Overall.Qual Overall.Cond
## Length:2930 Length:2930 Min. : 1.000 Min. :1.000
## Class :character Class :character 1st Qu.: 5.000 1st Qu.:5.000
## Mode :character Mode :character Median : 6.000 Median :5.000
## Mean : 6.095 Mean :5.563
## 3rd Qu.: 7.000 3rd Qu.:6.000
## Max. :10.000 Max. :9.000
##
## Year.Built Year.Remod.Add Roof.Style Roof.Matl
## Min. :1872 Min. :1950 Length:2930 Length:2930
## 1st Qu.:1954 1st Qu.:1965 Class :character Class :character
## Median :1973 Median :1993 Mode :character Mode :character
## Mean :1971 Mean :1984
## 3rd Qu.:2001 3rd Qu.:2004
## Max. :2010 Max. :2010
##
## Exterior.1st Exterior.2nd Mas.Vnr.Type Mas.Vnr.Area
## Length:2930 Length:2930 Length:2930 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 0.0
## Mode :character Mode :character Mode :character Median : 0.0
## Mean : 101.9
## 3rd Qu.: 164.0
## Max. :1600.0
## NA's :23
## Exter.Qual Exter.Cond Foundation Bsmt.Qual
## Length:2930 Length:2930 Length:2930 Length:2930
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1
## Length:2930 Length:2930 Length:2930 Min. : 0.0
## Class :character Class :character Class :character 1st Qu.: 0.0
## Mode :character Mode :character Mode :character Median : 370.0
## Mean : 442.6
## 3rd Qu.: 734.0
## Max. :5644.0
## NA's :1
## BsmtFin.Type.2 BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF
## Length:2930 Min. : 0.00 Min. : 0.0 Min. : 0
## Class :character 1st Qu.: 0.00 1st Qu.: 219.0 1st Qu.: 793
## Mode :character Median : 0.00 Median : 466.0 Median : 990
## Mean : 49.72 Mean : 559.3 Mean :1052
## 3rd Qu.: 0.00 3rd Qu.: 802.0 3rd Qu.:1302
## Max. :1526.00 Max. :2336.0 Max. :6110
## NA's :1 NA's :1 NA's :1
## Heating Heating.QC Central.Air Electrical
## Length:2930 Length:2930 Length:2930 Length:2930
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF Gr.Liv.Area
## Min. : 334.0 Min. : 0.0 Min. : 0.000 Min. : 334
## 1st Qu.: 876.2 1st Qu.: 0.0 1st Qu.: 0.000 1st Qu.:1126
## Median :1084.0 Median : 0.0 Median : 0.000 Median :1442
## Mean :1159.6 Mean : 335.5 Mean : 4.677 Mean :1500
## 3rd Qu.:1384.0 3rd Qu.: 703.8 3rd Qu.: 0.000 3rd Qu.:1743
## Max. :5095.0 Max. :2065.0 Max. :1064.000 Max. :5642
##
## Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4314 Mean :0.06113 Mean :1.567 Mean :0.3795
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :4.000 Max. :2.0000
## NA's :2 NA's :2
## Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd
## Min. :0.000 Min. :0.000 Length:2930 Min. : 2.000
## 1st Qu.:2.000 1st Qu.:1.000 Class :character 1st Qu.: 5.000
## Median :3.000 Median :1.000 Mode :character Median : 6.000
## Mean :2.854 Mean :1.044 Mean : 6.443
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :8.000 Max. :3.000 Max. :15.000
##
## Functional Fireplaces Fireplace.Qu Garage.Type
## Length:2930 Min. :0.0000 Length:2930 Length:2930
## Class :character 1st Qu.:0.0000 Class :character Class :character
## Mode :character Median :1.0000 Mode :character Mode :character
## Mean :0.5993
## 3rd Qu.:1.0000
## Max. :4.0000
##
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area
## Min. :1895 Length:2930 Min. :0.000 Min. : 0.0
## 1st Qu.:1960 Class :character 1st Qu.:1.000 1st Qu.: 320.0
## Median :1979 Mode :character Median :2.000 Median : 480.0
## Mean :1978 Mean :1.767 Mean : 472.8
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :2207 Max. :5.000 Max. :1488.0
## NA's :159 NA's :1 NA's :1
## Garage.Qual Garage.Cond Paved.Drive Wood.Deck.SF
## Length:2930 Length:2930 Length:2930 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 93.75
## 3rd Qu.: 168.00
## Max. :1424.00
##
## Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0
## Median : 27.00 Median : 0.00 Median : 0.000 Median : 0
## Mean : 47.53 Mean : 23.01 Mean : 2.592 Mean : 16
## 3rd Qu.: 70.00 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0
## Max. :742.00 Max. :1012.00 Max. :508.000 Max. :576
##
## Pool.Area Pool.QC Fence Misc.Feature
## Min. : 0.000 Length:2930 Length:2930 Length:2930
## 1st Qu.: 0.000 Class :character Class :character Class :character
## Median : 0.000 Mode :character Mode :character Mode :character
## Mean : 2.243
## 3rd Qu.: 0.000
## Max. :800.000
##
## Misc.Val Mo.Sold Yr.Sold Sale.Type
## Min. : 0.00 Min. : 1.000 Min. :2006 Length:2930
## 1st Qu.: 0.00 1st Qu.: 4.000 1st Qu.:2007 Class :character
## Median : 0.00 Median : 6.000 Median :2008 Mode :character
## Mean : 50.63 Mean : 6.216 Mean :2008
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :17000.00 Max. :12.000 Max. :2010
##
## Sale.Condition SalePrice
## Length:2930 Min. : 12789
## Class :character 1st Qu.:129500
## Mode :character Median :160000
## Mean :180796
## 3rd Qu.:213500
## Max. :755000
##
Use a minimum of three dplyr functions
(filter, select, mutate,
summary, mean, max, etc.,) to
manipulate the data-set and prepare it for modeling.
houses_cleaned <- ames_houses_df |>
filter(Year.Built > 1900) |>
select(SalePrice, Year.Built, Neighborhood, Overall.Qual, Gr.Liv.Area) |>
filter(!is.na(SalePrice), !is.na(Overall.Qual))
summary(houses_cleaned)
## SalePrice Year.Built Neighborhood Overall.Qual
## Min. : 12789 Min. :1901 Length:2875 Min. : 1.000
## 1st Qu.:130000 1st Qu.:1955 Class :character 1st Qu.: 5.000
## Median :161900 Median :1974 Mode :character Median : 6.000
## Mean :181617 Mean :1973 Mean : 6.109
## 3rd Qu.:214000 3rd Qu.:2001 3rd Qu.: 7.000
## Max. :755000 Max. :2010 Max. :10.000
## Gr.Liv.Area
## Min. : 334
## 1st Qu.:1124
## Median :1440
## Mean :1494
## 3rd Qu.:1734
## Max. :5642
Clearly state your final model (use lm() or glm(family = binomial))
Final Model (using lm()):
multiple_reg_model <- lm(SalePrice ~ Year.Built + Neighborhood + Overall.Qual + Gr.Liv.Area, data = houses_cleaned)
Present the model summary with coefficients, standard errors, p-values, and confidence intervals
summary(multiple_reg_model)
##
## Call:
## lm(formula = SalePrice ~ Year.Built + Neighborhood + Overall.Qual +
## Gr.Liv.Area, data = houses_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -340882 -16391 -276 14709 278128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.305e+05 1.037e+05 -8.976 < 2e-16 ***
## Year.Built 4.509e+02 5.227e+01 8.626 < 2e-16 ***
## NeighborhoodBlueste -1.800e+04 1.308e+04 -1.376 0.168805
## NeighborhoodBrDale -3.133e+04 9.455e+03 -3.314 0.000931 ***
## NeighborhoodBrkSide 1.110e+04 8.349e+03 1.330 0.183784
## NeighborhoodClearCr 3.278e+04 8.810e+03 3.721 0.000202 ***
## NeighborhoodCollgCr 1.232e+04 7.043e+03 1.749 0.080400 .
## NeighborhoodCrawfor 3.610e+04 8.114e+03 4.449 8.94e-06 ***
## NeighborhoodEdwards 1.283e+03 7.599e+03 0.169 0.865973
## NeighborhoodGilbert -2.985e+03 7.267e+03 -0.411 0.681223
## NeighborhoodGreens 5.713e+03 1.429e+04 0.400 0.689368
## NeighborhoodGrnHill 9.242e+04 2.589e+04 3.570 0.000363 ***
## NeighborhoodIDOTRR 1.966e+03 8.588e+03 0.229 0.818967
## NeighborhoodLandmrk -2.662e+04 3.600e+04 -0.739 0.459698
## NeighborhoodMeadowV -1.164e+04 9.145e+03 -1.273 0.203186
## NeighborhoodMitchel 1.220e+04 7.602e+03 1.605 0.108624
## NeighborhoodNAmes 1.126e+04 7.289e+03 1.545 0.122401
## NeighborhoodNoRidge 6.020e+04 8.092e+03 7.439 1.33e-13 ***
## NeighborhoodNPkVill -1.618e+04 1.006e+04 -1.609 0.107808
## NeighborhoodNridgHt 7.150e+04 7.291e+03 9.806 < 2e-16 ***
## NeighborhoodNWAmes 4.506e+03 7.541e+03 0.597 0.550245
## NeighborhoodOldTown 3.756e+02 8.121e+03 0.046 0.963109
## NeighborhoodSawyer 1.313e+04 7.629e+03 1.720 0.085471 .
## NeighborhoodSawyerW -1.445e+03 7.466e+03 -0.194 0.846572
## NeighborhoodSomerst 1.655e+04 7.185e+03 2.303 0.021344 *
## NeighborhoodStoneBr 7.420e+04 8.395e+03 8.839 < 2e-16 ***
## NeighborhoodSWISU -7.436e+03 9.236e+03 -0.805 0.420817
## NeighborhoodTimber 3.539e+04 7.908e+03 4.475 7.92e-06 ***
## NeighborhoodVeenker 3.691e+04 9.933e+03 3.716 0.000206 ***
## Overall.Qual 1.987e+04 8.129e+02 24.447 < 2e-16 ***
## Gr.Liv.Area 5.778e+01 1.741e+00 33.181 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35360 on 2844 degrees of freedom
## Multiple R-squared: 0.806, Adjusted R-squared: 0.8039
## F-statistic: 393.7 on 30 and 2844 DF, p-value: < 2.2e-16
Interpret the coefficients in the context of your research question (including odds ratios for logistic regression)
To interpret the coefficients, I will start off with Gr.Liv.Area(Above-Ground living area), Holding Year Built, Neighborhood, and Overall Quality constant, each additional square foot of above-ground living area is associated with an estimated increase in Sale Price equal to the coefficient value. Moving onto the Overall.Qual(Overall Quality of the house), Controlling for the other variables, for each 1-unit increase in overall quality (e.g., from 5 → 6), the model predicts an increase in Sale Price equal to the coefficient. Next with Year.Built(the year the house was built), Holding all other variables constant, each additional year newer the home is predicts a small but meaningful increase in Sale Price. Lastly, for the Neighborhood coefficients, each neighborhood coefficient represents the difference in mean Sale Price compared to the reference neighborhood (whatever R alphabetically chooses as baseline). As far as the interpretation of the multiple R² value, the value 0.806 ~ 0.81 represents that 81% of the variation in home sale prices is explained by the combination of Year the house was built, Neighborhood the house was built in, Overall Quality of the house, and the Ground living space of the house. I also want to know that since this model is an lm() linear regression, I only interpreted the raw changes in the outcome, not odds ratios.
Explicitly check and discuss the following assumptions:
Multiple linear regression assumes a linear relationship between each predictor and the outcome variable. After fitting the model, the Residuals vs. Fitted plot should show points scattered randomly around 0 without a pattern. If we observe curvature, funnel shapes, or clustering, it suggests the relationship may not be linear or that transformations may be required. In our model, we inspect the plot to evaluate whether the residuals deviate from linearity.
Independence means that the residuals for one home should not depend on any other home in the dataset. Since the Ames Housing dataset consists of individual, unrelated home sales, independence is generally reasonable.
Homoscedasticity means that the residuals have constant variance across all fitted values. This is assessed with both: Residuals vs.Fitted plot and Scale-Location plot. If residuals spread out as fitted values increase (fan shape), the assumption is violated. If they remain evenly scattered, the assumption is met.
Normality is tested by examining the Normal Q–Q Plot. If residuals follow the diagonal line closely, the assumption is satisfied. Moderate deviations at the tails are common in real datasets; severe S-shaped curves suggest non-normality.
Multicollinearity occurs when predictors are highly correlated with each other. You check this using VIF (Variance Inflation Factor).
VIF < 5 → acceptable
VIF > 10 → problematic
Include diagnostic plots (residuals vs fitted, Normal Q-Q, Scale-Location, Residuals vs Leverage) and interpret them
residuals vs fitted:
plot(multiple_reg_model, which = 1)
Interpretation:
In this plot, the residuals generally cluster near the horizontal line at zero, but there is some mild curvature and spreading as fitted SalePrice values increase. This indicates that the relationship between the predictors—Year Built, Neighborhood, Overall Quality, and Above-Ground Living Area—and SalePrice is mostly linear but may include nonlinear components, especially for very high-priced homes. Additionally, the model predicts expensive houses less consistently than lower-priced ones. This is common in housing data, where luxury homes vary much more widely in price. The presence of a few extreme observations (e.g., labels 1734, 2137, 1467) suggests outliers that may influence the regression fit.t. Overall, the linear model is reasonable but not perfect—linearity is mostly met
Normal Q-Q:
plot(multiple_reg_model, which = 2)
## Warning: not plotting observations with leverage one:
## 2734
Interpretation:
In the Q-Q plot, the middle portion of the residuals aligns fairly well with the theoretical normal line, indicating that most residuals approximate normality. However, the tails show clear deviations: the lower tail dips below the line, and the upper tail rises sharply above it. This indicates heavy-tailed distributions, meaning there are more extreme residuals than a normal distribution would expect. This is likely to arise from homes with unusually high or low sale prices relative to what the model predicts based on Year Built, Neighborhood, Overall Quality, and Living Area. To conclude, we see slight violation of normality, it suggests that some model estimates, especially for confidence intervals,may be slightly less reliable.
Residuals vs Leverage:
plot(multiple_reg_model, which = 5)
## Warning: not plotting observations with leverage one:
## 2734
Interpretation:
Lastly, in this Residuals vs Leverage plot, we can see that Most points fall within acceptable leverage and Cook’s Distance regions, suggesting the majority of homes do not disproportionately affect the model. However, a few labeled points (e.g., 1467, 2212, 2838) have higher leverage or unusually large residuals.These homes are likely outliers or high-influence cases, such as extremely old or newly built homes, unusually large houses, or highly atypical neighborhoods within Ames. While the model overall is not dominated by outliers, these influential cases should be investigated to ensure they reflect real properties and not data errors.
Overall, I have seen that 81% of the variation in home sale prices is explained by the combination of Year the house was built, Neighborhood the house was built in, Overall Quality of the house, and the Ground living space of the house. This was exactly my prediction, that these would have a high percentage that affect the price of houses(because that is what I had researched). Although I did notice a lot of patterns when it came to analyzing the different plots. That pattern is among luxury houses suggesting that extreme values may be acting as outliers, potentially skewing results and highlighting the disproportionate influence of high-end properties on the model. This emphasizes the importance of carefully considering outliers in real estate data analysis, as they can exaggerate relationships and affect the interpretation of predictors. With that being analyzed, I think some future steps I would’ve taken for this project is to separate modeling for luxury properties, or inclusion of additional market indicators to better account for extreme values and improve predictive accuracy, ultimately providing a more nuanced understanding of housing price dynamics.
These are links I used to understand the house market: