```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
Dataset Description: The dataset used for this project is the Ames Housing dataset, which contains detailed information on 2930 homes in Ames, Iowa. The dataset has 82 columns representing various aspects such as lot size, building features, and sale prices.
The dataset can be found on github with detailed documentation available on below link
https://github.com/leontoddjohnson/datasets/blob/main/data/ames.csv
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
# Load the Ames dataset
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)
dim(ames)
## [1] 2930 82
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.4.2
library(dplyr)
colnames(ames)
## [1] "Order" "PID" "MS.SubClass" "MS.Zoning"
## [5] "Lot.Frontage" "Lot.Area" "Street" "Alley"
## [9] "Lot.Shape" "Land.Contour" "Utilities" "Lot.Config"
## [13] "Land.Slope" "Neighborhood" "Condition.1" "Condition.2"
## [17] "Bldg.Type" "House.Style" "Overall.Qual" "Overall.Cond"
## [21] "Year.Built" "Year.Remod.Add" "Roof.Style" "Roof.Matl"
## [25] "Exterior.1st" "Exterior.2nd" "Mas.Vnr.Type" "Mas.Vnr.Area"
## [29] "Exter.Qual" "Exter.Cond" "Foundation" "Bsmt.Qual"
## [33] "Bsmt.Cond" "Bsmt.Exposure" "BsmtFin.Type.1" "BsmtFin.SF.1"
## [37] "BsmtFin.Type.2" "BsmtFin.SF.2" "Bsmt.Unf.SF" "Total.Bsmt.SF"
## [41] "Heating" "Heating.QC" "Central.Air" "Electrical"
## [45] "X1st.Flr.SF" "X2nd.Flr.SF" "Low.Qual.Fin.SF" "Gr.Liv.Area"
## [49] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Full.Bath" "Half.Bath"
## [53] "Bedroom.AbvGr" "Kitchen.AbvGr" "Kitchen.Qual" "TotRms.AbvGrd"
## [57] "Functional" "Fireplaces" "Fireplace.Qu" "Garage.Type"
## [61] "Garage.Yr.Blt" "Garage.Finish" "Garage.Cars" "Garage.Area"
## [65] "Garage.Qual" "Garage.Cond" "Paved.Drive" "Wood.Deck.SF"
## [69] "Open.Porch.SF" "Enclosed.Porch" "X3Ssn.Porch" "Screen.Porch"
## [73] "Pool.Area" "Pool.QC" "Fence" "Misc.Feature"
## [77] "Misc.Val" "Mo.Sold" "Yr.Sold" "Sale.Type"
## [81] "Sale.Condition" "SalePrice"
# Summary statistics
summary_stats <- ames %>%
summarise(
Avg_SalePrice = mean(SalePrice, na.rm = TRUE),
Median_SalePrice = median(SalePrice, na.rm = TRUE),
Avg_LotArea = mean(Lot.Area, na.rm = TRUE),
Median_GrLivArea = median(Gr.Liv.Area, na.rm = TRUE)
)
print(summary_stats)
## Avg_SalePrice Median_SalePrice Avg_LotArea Median_GrLivArea
## 1 180796.1 160000 10147.92 1442
The average sale price of homes in the dataset is $180,796.1. This metric indicates the mean price at which homes were sold in Ames, Iowa, considering all available transactions in the dataset.
The median sale price is $160,000.00, highlighting the middle value of home sale prices when arranged in order. This value minimizes the influence of extreme outliers compared to the average.
The average lot area of properties in the dataset is 10,147.92 square feet, providing insight into the typical land size of homes in Ames. Larger lot areas suggest more spacious properties.
The median above-ground living area is 1,442 square feet, indicating that half of the homes in the dataset have livable spaces larger than this, while the other half are smaller. This excludes any basement areas.
# Visualization: SalePrice Distribution
ggplot(ames, aes(x = SalePrice)) +
geom_histogram(binwidth = 10000, fill = "blue", color = "white") +
labs(title = "Distribution of Sale Prices", x = "Sale Price ($)", y = "Count") +
theme_minimal()
The distribution of Sale Price in the dataset is right-skewed. Most homes have sale prices concentrated around the median of $160,000, but there are several high-value properties that pull the mean upward to $180,796.1.
Skewness: The positive skewness indicates that there are more homes priced in the lower range, with fewer expensive homes creating a long tail on the right side.
Spread: The prices range from $12,789 (minimum) to $755,000 (maximum), showing a wide variability in the housing market.
Outliers: The presence of luxury or high-priced homes likely contributes to the skewness and outlier effect in the distribution.
# Add a new column for price per square foot
ames <- ames %>%
mutate(Price_Per_SqFt = SalePrice / Gr.Liv.Area)
# Visualization: Price Per SqFt Distribution
ggplot(ames, aes(x = Price_Per_SqFt)) +
geom_histogram(binwidth = 10, fill = "purple", color = "white") +
labs(title = "Price Per Square Foot Distribution", x = "Price Per Sq. Ft. ($)", y = "Count") +
theme_minimal()
Range: The values of Price Per Sq. Ft. range from approximately $20 to $500, with most values concentrated in the $80 to $150 range.
Distribution: The distribution is right-skewed, indicating a higher frequency of homes priced at lower rates per square foot, with a few outliers at the upper end representing luxury or premium properties.
Mean and Median:
The mean Price Per Sq. Ft. provides the average cost efficiency across all homes.
The median Price Per Sq. Ft. is likely lower than the mean, reflecting the skewed distribution and presence of outliers.
# Boxplot: SalePrice by Overall.Condition
ggplot(ames, aes(x = factor(Overall.Cond), y = SalePrice)) +
geom_boxplot(fill = "orange") +
labs(title = "Sale Price by Overall Condition", x = "Overall Condition (1-10)", y = "Sale Price ($)") +
theme_minimal()
General Trend:
The box plot reveals that homes with higher renovation quality ratings (closer to 10) consistently achieve higher sale prices.
Lower renovation quality ratings (1–4) correspond to a wide range of sale prices, with a tendency toward lower medians.
Spread of Sale Prices:
For low-quality renovations (ratings 1–3), the sale prices show a wider interquartile range, indicating greater variability. This could reflect the impact of other factors (e.g., location or lot size).
For high-quality renovations (ratings 8–10), the prices are tightly clustered at the higher end, suggesting premium homes with consistent market value.
# Calculate Price per Square Foot
ames <- ames %>%
mutate(Price_per_SqFt = SalePrice / Gr.Liv.Area)
# Bar Chart: Price per Square Foot by Neighborhood
ggplot(ames, aes(x = Neighborhood, y = Price_per_SqFt)) +
geom_bar(stat = "summary", fun = "mean", fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = "Average Price per Square Foot by Neighborhood",
x = "Neighborhood",
y = "Price per Square Foot") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Bar Chart: Price per Square Foot by Home Style
ggplot(ames, aes(x = House.Style, y = Price_per_SqFt)) +
geom_bar(stat = "summary", fun = "mean", fill = "lightgreen", color = "black") +
theme_minimal() +
labs(title = "Average Price per Square Foot by Home Style",
x = "Home Style",
y = "Price per Square Foot") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Hypothesis 1:
Null Hypothesis (H₀): There is no significant correlation between Above Ground Living Area and Sale Price.
Alternative Hypothesis (H₁): There is a significant correlation between Above Ground Living Area and Sale Price.
# Correlation Test
correlation <- cor.test(ames$Gr.Liv.Area, ames$SalePrice, method = "pearson")
print(correlation)
##
## Pearson's product-moment correlation
##
## data: ames$Gr.Liv.Area and ames$SalePrice
## t = 54.061, df = 2928, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.6881814 0.7244502
## sample estimates:
## cor
## 0.7067799
Correlation Coefficient: 0.7068 (approximately 0.71), which indicates a strong positive correlation. This means that as the above-ground living area increases, the sale price tends to increase as well.
t-value: 54.061, which is very high, supporting that the correlation is statistically significant.
p-value: < 2.2e-16, which is much smaller than 0.05, meaning the correlation is highly significant (the likelihood of this result occurring by chance is practically zero).
Because the p-value is so small, we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).
This means there is a significant difference in the mean sale prices between the two neighborhoods.
# Visualization
library(ggplot2)
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red") +
labs(title = "Living Area vs Sale Price", x = "Above Ground Living Area (sq. ft.)", y = "Sale Price ($)")
## `geom_smooth()` using formula = 'y ~ x'
Hypothesis 2:
Null Hypothesis (H0): “There is no significant difference in the mean Sale Price between two specific neighborhoods.”
Alternative Hypothesis (H1): “The mean Sale Price significantly differs between the two neighborhoods.”
# ANOVA Test
anova_result <- aov(SalePrice ~ Neighborhood, data = ames)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 27 1.072e+13 3.969e+11 144.4 <2e-16 ***
## Residuals 2902 7.977e+12 2.749e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Degrees of Freedom (Df): The 27 indicates that there are 28 different neighborhoods (since degrees of freedom is one less than the number of categories), and we are comparing their means.
Sum of Squares: The 1.072e+13 represents the total variation in sale prices across the neighborhoods.
Mean Square: The 3.969e+11 is the average variation in sale prices attributable to the neighborhood factor.
F-value: The 144.4 is very high, indicating that the variation in sale prices between neighborhoods is much greater than the variation within neighborhoods.
p-value: The < 2e-16 is extremely small, far below the common threshold of 0.05, indicating that the difference in mean sale prices between neighborhoods is statistically significant.
Significant Result: The p-value is less than 0.05, so we reject the null hypothesis. This means there is a significant difference in the mean sale prices between the two neighborhoods.
# Visualization
ggplot(ames, aes(x = Neighborhood, y = SalePrice)) +
geom_boxplot(fill = "lightblue", alpha = 0.7) +
labs(title = "Sale Prices by Neighborhood", x = "Neighborhood", y = "Sale Price ($)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Linear Regression Model
linear_model <- lm(SalePrice ~ Gr.Liv.Area + Neighborhood, data = ames)
summary(linear_model)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Neighborhood, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -303568 -19919 217 16372 282805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88061.907 8042.070 10.950 < 2e-16 ***
## Gr.Liv.Area 77.301 1.748 44.232 < 2e-16 ***
## NeighborhoodBlueste -34118.000 14934.572 -2.284 0.022414 *
## NeighborhoodBrDale -68662.344 10660.209 -6.441 1.38e-10 ***
## NeighborhoodBrkSide -58765.365 8598.824 -6.834 1.00e-11 ***
## NeighborhoodClearCr -14242.811 9814.249 -1.451 0.146822
## NeighborhoodCollgCr -1910.192 8051.238 -0.237 0.812476
## NeighborhoodCrawfor -13685.117 8654.379 -1.581 0.113920
## NeighborhoodEdwards -60627.084 8192.988 -7.400 1.78e-13 ***
## NeighborhoodGilbert -22712.460 8291.048 -2.739 0.006193 **
## NeighborhoodGreens 16012.638 16251.088 0.985 0.324546
## NeighborhoodGrnHill 83832.496 29659.771 2.826 0.004739 **
## NeighborhoodIDOTRR -77475.954 8742.175 -8.862 < 2e-16 ***
## NeighborhoodLandmrk -53099.367 41240.502 -1.288 0.198004
## NeighborhoodMeadowV -74762.719 10167.471 -7.353 2.51e-13 ***
## NeighborhoodMitchel -28490.466 8548.060 -3.333 0.000870 ***
## NeighborhoodNAmes -42841.774 7898.891 -5.424 6.32e-08 ***
## NeighborhoodNoRidge 50501.484 9236.312 5.468 4.95e-08 ***
## NeighborhoodNPkVill -43520.335 11407.097 -3.815 0.000139 ***
## NeighborhoodNridgHt 83788.250 8331.999 10.056 < 2e-16 ***
## NeighborhoodNWAmes -30198.864 8451.515 -3.573 0.000358 ***
## NeighborhoodOldTown -74763.259 8094.429 -9.236 < 2e-16 ***
## NeighborhoodSawyer -42760.011 8346.978 -5.123 3.21e-07 ***
## NeighborhoodSawyerW -27987.644 8479.672 -3.301 0.000977 ***
## NeighborhoodSomerst 17590.309 8233.557 2.136 0.032729 *
## NeighborhoodStoneBr 85490.761 9578.619 8.925 < 2e-16 ***
## NeighborhoodSWISU -79491.619 9644.751 -8.242 2.54e-16 ***
## NeighborhoodTimber 25994.152 9041.400 2.875 0.004070 **
## NeighborhoodVeenker 19600.093 11295.715 1.735 0.082815 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40520 on 2901 degrees of freedom
## Multiple R-squared: 0.7452, Adjusted R-squared: 0.7427
## F-statistic: 302.9 on 28 and 2901 DF, p-value: < 2.2e-16
# Visualizing Residuals to Evaluate Model
library(ggplot2)
ggplot(linear_model, aes(.fitted, .resid)) +
geom_point() +
geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")
Significance of Variables:
Above Ground Living Area: The coefficient for Above Ground Living Area is 77.301 with a t-value of 44.232, and the p-value is < 2e-16. This indicates a strong positive correlation between the above-ground living area and sale price. As the size of the house increases, the sale price also increases significantly.
Neighborhood: The model also includes neighborhood as a factor. Some neighborhoods show significant effects on the sale price, while others do not:
Significant Neighborhoods: Neighborhoods such as Blue stem, Briar Dale, Brook Side, Edwards, Iowa Dot and Rail Road, Mitchel, Old Town, Stone Brook, and others show significant positive or negative effects on sale prices, with p-values < 0.05.
Insignificant Neighborhoods: Neighborhoods like Clear Creek, Greens, and Land mark have p-values > 0.05, suggesting that, after accounting for living area, their impact on sale prices is not significant.
Model Performance:
R-squared: The Multiple R-squared value is 0.7452, meaning 74.5% of the variation in sale prices can be explained by the model, which includes both Above Ground Living Area and the neighborhood factor.
Adjusted R-squared: The Adjusted R-squared is 0.7427, indicating a good fit, accounting for the number of predictors in the model.
F-statistic: The F-statistic is 302.9, with a p-value < 2.2e-16, which shows that the overall model is statistically significant and that at least one of the predictors (Living Area or Neighborhood) has a significant relationship with the sale price.
The results from the regression analysis provide valuable insights for both home buyers and investors:
Home Size (Living Area): The size of the home (above ground living area) is a key determinant of sale price. Larger homes tend to sell for significantly higher prices, which should be a central consideration for both buyers and sellers.
Neighborhood Influence: The neighborhood in which a home is located also plays a significant role in determining its price. Some neighborhoods are associated with much higher prices (e.g., StoneBr), while others have lower prices (e.g., Blue stem, Brook Dale). The varying impact of neighborhood emphasizes the importance of location in the housing market.
Market Insights: The strong correlation between home size and price, along with the neighborhood effect, can guide investors in making decisions about where to buy properties and what sizes to focus on. Buyers can also use this information to assess whether a property is fairly priced based on its size and location.
Predictive Power: The model can be used to predict home sale prices based on size and neighborhood, helping buyers and investors make more informed decisions.
In conclusion, both the size of the property and its neighborhood significantly affect the sale price. Investors and buyers can use this knowledge to make data-driven decisions, choosing properties that align with their budget, investment goals, and preferred locations.