```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

Analysis of Ames Housing Data

Dataset Description: The dataset used for this project is the Ames Housing dataset, which contains detailed information on 2930 homes in Ames, Iowa. The dataset has 82 columns representing various aspects such as lot size, building features, and sale prices.

The dataset can be found on github with detailed documentation available on below link

https://github.com/leontoddjohnson/datasets/blob/main/data/ames.csv

https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

Loading the dataset

# Load the Ames dataset
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)

dim(ames)
## [1] 2930   82

Loading necessary libraries

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(ggrepel)
## Warning: package 'ggrepel' was built under R version 4.4.2
library(dplyr)

Descriptive Analysis of Key Variables

colnames(ames)
##  [1] "Order"           "PID"             "MS.SubClass"     "MS.Zoning"      
##  [5] "Lot.Frontage"    "Lot.Area"        "Street"          "Alley"          
##  [9] "Lot.Shape"       "Land.Contour"    "Utilities"       "Lot.Config"     
## [13] "Land.Slope"      "Neighborhood"    "Condition.1"     "Condition.2"    
## [17] "Bldg.Type"       "House.Style"     "Overall.Qual"    "Overall.Cond"   
## [21] "Year.Built"      "Year.Remod.Add"  "Roof.Style"      "Roof.Matl"      
## [25] "Exterior.1st"    "Exterior.2nd"    "Mas.Vnr.Type"    "Mas.Vnr.Area"   
## [29] "Exter.Qual"      "Exter.Cond"      "Foundation"      "Bsmt.Qual"      
## [33] "Bsmt.Cond"       "Bsmt.Exposure"   "BsmtFin.Type.1"  "BsmtFin.SF.1"   
## [37] "BsmtFin.Type.2"  "BsmtFin.SF.2"    "Bsmt.Unf.SF"     "Total.Bsmt.SF"  
## [41] "Heating"         "Heating.QC"      "Central.Air"     "Electrical"     
## [45] "X1st.Flr.SF"     "X2nd.Flr.SF"     "Low.Qual.Fin.SF" "Gr.Liv.Area"    
## [49] "Bsmt.Full.Bath"  "Bsmt.Half.Bath"  "Full.Bath"       "Half.Bath"      
## [53] "Bedroom.AbvGr"   "Kitchen.AbvGr"   "Kitchen.Qual"    "TotRms.AbvGrd"  
## [57] "Functional"      "Fireplaces"      "Fireplace.Qu"    "Garage.Type"    
## [61] "Garage.Yr.Blt"   "Garage.Finish"   "Garage.Cars"     "Garage.Area"    
## [65] "Garage.Qual"     "Garage.Cond"     "Paved.Drive"     "Wood.Deck.SF"   
## [69] "Open.Porch.SF"   "Enclosed.Porch"  "X3Ssn.Porch"     "Screen.Porch"   
## [73] "Pool.Area"       "Pool.QC"         "Fence"           "Misc.Feature"   
## [77] "Misc.Val"        "Mo.Sold"         "Yr.Sold"         "Sale.Type"      
## [81] "Sale.Condition"  "SalePrice"
# Summary statistics
summary_stats <- ames %>%
    summarise(
        Avg_SalePrice = mean(SalePrice, na.rm = TRUE),
        Median_SalePrice = median(SalePrice, na.rm = TRUE),
        Avg_LotArea = mean(Lot.Area, na.rm = TRUE),
        Median_GrLivArea = median(Gr.Liv.Area, na.rm = TRUE)
    )
print(summary_stats)
##   Avg_SalePrice Median_SalePrice Avg_LotArea Median_GrLivArea
## 1      180796.1           160000    10147.92             1442

The average sale price of homes in the dataset is $180,796.1. This metric indicates the mean price at which homes were sold in Ames, Iowa, considering all available transactions in the dataset.

The median sale price is $160,000.00, highlighting the middle value of home sale prices when arranged in order. This value minimizes the influence of extreme outliers compared to the average.

The average lot area of properties in the dataset is 10,147.92 square feet, providing insight into the typical land size of homes in Ames. Larger lot areas suggest more spacious properties.

The median above-ground living area is 1,442 square feet, indicating that half of the homes in the dataset have livable spaces larger than this, while the other half are smaller. This excludes any basement areas.

# Visualization: SalePrice Distribution
ggplot(ames, aes(x = SalePrice)) +
    geom_histogram(binwidth = 10000, fill = "blue", color = "white") +
    labs(title = "Distribution of Sale Prices", x = "Sale Price ($)", y = "Count") +
    theme_minimal()

The distribution of Sale Price in the dataset is right-skewed. Most homes have sale prices concentrated around the median of $160,000, but there are several high-value properties that pull the mean upward to $180,796.1.

  • Skewness: The positive skewness indicates that there are more homes priced in the lower range, with fewer expensive homes creating a long tail on the right side.

  • Spread: The prices range from $12,789 (minimum) to $755,000 (maximum), showing a wide variability in the housing market.

  • Outliers: The presence of luxury or high-priced homes likely contributes to the skewness and outlier effect in the distribution.

Price Per Square Foot Analysis

# Add a new column for price per square foot
ames <- ames %>%
    mutate(Price_Per_SqFt = SalePrice / Gr.Liv.Area)

# Visualization: Price Per SqFt Distribution
ggplot(ames, aes(x = Price_Per_SqFt)) +
    geom_histogram(binwidth = 10, fill = "purple", color = "white") +
    labs(title = "Price Per Square Foot Distribution", x = "Price Per Sq. Ft. ($)", y = "Count") +
    theme_minimal()

  • Range: The values of Price Per Sq. Ft. range from approximately $20 to $500, with most values concentrated in the $80 to $150 range.

  • Distribution: The distribution is right-skewed, indicating a higher frequency of homes priced at lower rates per square foot, with a few outliers at the upper end representing luxury or premium properties.

  • Mean and Median:

    • The mean Price Per Sq. Ft. provides the average cost efficiency across all homes.

    • The median Price Per Sq. Ft. is likely lower than the mean, reflecting the skewed distribution and presence of outliers.

Impact of Renovation/Condition on Sale Price

# Boxplot: SalePrice by Overall.Condition
ggplot(ames, aes(x = factor(Overall.Cond), y = SalePrice)) +
    geom_boxplot(fill = "orange") +
    labs(title = "Sale Price by Overall Condition", x = "Overall Condition (1-10)", y = "Sale Price ($)") +
    theme_minimal()

  • General Trend:

    • The box plot reveals that homes with higher renovation quality ratings (closer to 10) consistently achieve higher sale prices.

    • Lower renovation quality ratings (1–4) correspond to a wide range of sale prices, with a tendency toward lower medians.

  • Spread of Sale Prices:

    • For low-quality renovations (ratings 1–3), the sale prices show a wider interquartile range, indicating greater variability. This could reflect the impact of other factors (e.g., location or lot size).

    • For high-quality renovations (ratings 8–10), the prices are tightly clustered at the higher end, suggesting premium homes with consistent market value.

Price per Square Foot by Neighborhood

# Calculate Price per Square Foot
ames <- ames %>% 
  mutate(Price_per_SqFt = SalePrice / Gr.Liv.Area)

# Bar Chart: Price per Square Foot by Neighborhood
ggplot(ames, aes(x = Neighborhood, y = Price_per_SqFt)) +
  geom_bar(stat = "summary", fun = "mean", fill = "skyblue", color = "black") +
  theme_minimal() +
  labs(title = "Average Price per Square Foot by Neighborhood",
       x = "Neighborhood",
       y = "Price per Square Foot") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Price per Square Foot by Home Style

# Bar Chart: Price per Square Foot by Home Style
ggplot(ames, aes(x = House.Style, y = Price_per_SqFt)) +
  geom_bar(stat = "summary", fun = "mean", fill = "lightgreen", color = "black") +
  theme_minimal() +
  labs(title = "Average Price per Square Foot by Home Style",
       x = "Home Style",
       y = "Price per Square Foot") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Hypothesis Tests

Hypothesis 1:

Null Hypothesis (H₀): There is no significant correlation between Above Ground Living Area and Sale Price.

Alternative Hypothesis (H₁): There is a significant correlation between Above Ground Living Area and Sale Price.

Pearson’s Correlation Test

# Correlation Test
correlation <- cor.test(ames$Gr.Liv.Area, ames$SalePrice, method = "pearson")
print(correlation)
## 
##  Pearson's product-moment correlation
## 
## data:  ames$Gr.Liv.Area and ames$SalePrice
## t = 54.061, df = 2928, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6881814 0.7244502
## sample estimates:
##       cor 
## 0.7067799
  • Correlation Coefficient: 0.7068 (approximately 0.71), which indicates a strong positive correlation. This means that as the above-ground living area increases, the sale price tends to increase as well.

  • t-value: 54.061, which is very high, supporting that the correlation is statistically significant.

  • p-value: < 2.2e-16, which is much smaller than 0.05, meaning the correlation is highly significant (the likelihood of this result occurring by chance is practically zero).

  • Because the p-value is so small, we reject the null hypothesis (H₀) and accept the alternative hypothesis (H₁).

  • This means there is a significant difference in the mean sale prices between the two neighborhoods.

Living Area vs Sale Price

# Visualization
library(ggplot2)
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(color = "blue", alpha = 0.6) +
  geom_smooth(method = "lm", color = "red") +
  labs(title = "Living Area vs Sale Price", x = "Above Ground Living Area (sq. ft.)", y = "Sale Price ($)")
## `geom_smooth()` using formula = 'y ~ x'

Hypothesis 2:

Null Hypothesis (H0): “There is no significant difference in the mean Sale Price between two specific neighborhoods.”

Alternative Hypothesis (H1): “The mean Sale Price significantly differs between the two neighborhoods.”

ANOVA Test

# ANOVA Test
anova_result <- aov(SalePrice ~ Neighborhood, data = ames)
summary(anova_result)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## Neighborhood   27 1.072e+13 3.969e+11   144.4 <2e-16 ***
## Residuals    2902 7.977e+12 2.749e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • Degrees of Freedom (Df): The 27 indicates that there are 28 different neighborhoods (since degrees of freedom is one less than the number of categories), and we are comparing their means.

  • Sum of Squares: The 1.072e+13 represents the total variation in sale prices across the neighborhoods.

  • Mean Square: The 3.969e+11 is the average variation in sale prices attributable to the neighborhood factor.

  • F-value: The 144.4 is very high, indicating that the variation in sale prices between neighborhoods is much greater than the variation within neighborhoods.

  • p-value: The < 2e-16 is extremely small, far below the common threshold of 0.05, indicating that the difference in mean sale prices between neighborhoods is statistically significant.

  • Significant Result: The p-value is less than 0.05, so we reject the null hypothesis. This means there is a significant difference in the mean sale prices between the two neighborhoods.

Neighborhood Analysis

# Visualization
ggplot(ames, aes(x = Neighborhood, y = SalePrice)) +
  geom_boxplot(fill = "lightblue", alpha = 0.7) +
  labs(title = "Sale Prices by Neighborhood", x = "Neighborhood", y = "Sale Price ($)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Linear Regression Model
linear_model <- lm(SalePrice ~ Gr.Liv.Area + Neighborhood, data = ames)
summary(linear_model)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area + Neighborhood, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -303568  -19919     217   16372  282805 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          88061.907   8042.070  10.950  < 2e-16 ***
## Gr.Liv.Area             77.301      1.748  44.232  < 2e-16 ***
## NeighborhoodBlueste -34118.000  14934.572  -2.284 0.022414 *  
## NeighborhoodBrDale  -68662.344  10660.209  -6.441 1.38e-10 ***
## NeighborhoodBrkSide -58765.365   8598.824  -6.834 1.00e-11 ***
## NeighborhoodClearCr -14242.811   9814.249  -1.451 0.146822    
## NeighborhoodCollgCr  -1910.192   8051.238  -0.237 0.812476    
## NeighborhoodCrawfor -13685.117   8654.379  -1.581 0.113920    
## NeighborhoodEdwards -60627.084   8192.988  -7.400 1.78e-13 ***
## NeighborhoodGilbert -22712.460   8291.048  -2.739 0.006193 ** 
## NeighborhoodGreens   16012.638  16251.088   0.985 0.324546    
## NeighborhoodGrnHill  83832.496  29659.771   2.826 0.004739 ** 
## NeighborhoodIDOTRR  -77475.954   8742.175  -8.862  < 2e-16 ***
## NeighborhoodLandmrk -53099.367  41240.502  -1.288 0.198004    
## NeighborhoodMeadowV -74762.719  10167.471  -7.353 2.51e-13 ***
## NeighborhoodMitchel -28490.466   8548.060  -3.333 0.000870 ***
## NeighborhoodNAmes   -42841.774   7898.891  -5.424 6.32e-08 ***
## NeighborhoodNoRidge  50501.484   9236.312   5.468 4.95e-08 ***
## NeighborhoodNPkVill -43520.335  11407.097  -3.815 0.000139 ***
## NeighborhoodNridgHt  83788.250   8331.999  10.056  < 2e-16 ***
## NeighborhoodNWAmes  -30198.864   8451.515  -3.573 0.000358 ***
## NeighborhoodOldTown -74763.259   8094.429  -9.236  < 2e-16 ***
## NeighborhoodSawyer  -42760.011   8346.978  -5.123 3.21e-07 ***
## NeighborhoodSawyerW -27987.644   8479.672  -3.301 0.000977 ***
## NeighborhoodSomerst  17590.309   8233.557   2.136 0.032729 *  
## NeighborhoodStoneBr  85490.761   9578.619   8.925  < 2e-16 ***
## NeighborhoodSWISU   -79491.619   9644.751  -8.242 2.54e-16 ***
## NeighborhoodTimber   25994.152   9041.400   2.875 0.004070 ** 
## NeighborhoodVeenker  19600.093  11295.715   1.735 0.082815 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40520 on 2901 degrees of freedom
## Multiple R-squared:  0.7452, Adjusted R-squared:  0.7427 
## F-statistic: 302.9 on 28 and 2901 DF,  p-value: < 2.2e-16
# Visualizing Residuals to Evaluate Model
library(ggplot2)
ggplot(linear_model, aes(.fitted, .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Residual Plot", x = "Fitted Values", y = "Residuals")

  1. Significance of Variables:

    • Above Ground Living Area: The coefficient for Above Ground Living Area is 77.301 with a t-value of 44.232, and the p-value is < 2e-16. This indicates a strong positive correlation between the above-ground living area and sale price. As the size of the house increases, the sale price also increases significantly.

    • Neighborhood: The model also includes neighborhood as a factor. Some neighborhoods show significant effects on the sale price, while others do not:

      • Significant Neighborhoods: Neighborhoods such as Blue stem, Briar Dale, Brook Side, Edwards, Iowa Dot and Rail Road, Mitchel, Old Town, Stone Brook, and others show significant positive or negative effects on sale prices, with p-values < 0.05.

      • Insignificant Neighborhoods: Neighborhoods like Clear Creek, Greens, and Land mark have p-values > 0.05, suggesting that, after accounting for living area, their impact on sale prices is not significant.

  2. Model Performance:

    • R-squared: The Multiple R-squared value is 0.7452, meaning 74.5% of the variation in sale prices can be explained by the model, which includes both Above Ground Living Area and the neighborhood factor.

    • Adjusted R-squared: The Adjusted R-squared is 0.7427, indicating a good fit, accounting for the number of predictors in the model.

    • F-statistic: The F-statistic is 302.9, with a p-value < 2.2e-16, which shows that the overall model is statistically significant and that at least one of the predictors (Living Area or Neighborhood) has a significant relationship with the sale price.

Conclusion:

The results from the regression analysis provide valuable insights for both home buyers and investors:

  • Home Size (Living Area): The size of the home (above ground living area) is a key determinant of sale price. Larger homes tend to sell for significantly higher prices, which should be a central consideration for both buyers and sellers.

  • Neighborhood Influence: The neighborhood in which a home is located also plays a significant role in determining its price. Some neighborhoods are associated with much higher prices (e.g., StoneBr), while others have lower prices (e.g., Blue stem, Brook Dale). The varying impact of neighborhood emphasizes the importance of location in the housing market.

  • Market Insights: The strong correlation between home size and price, along with the neighborhood effect, can guide investors in making decisions about where to buy properties and what sizes to focus on. Buyers can also use this information to assess whether a property is fairly priced based on its size and location.

  • Predictive Power: The model can be used to predict home sale prices based on size and neighborhood, helping buyers and investors make more informed decisions.

In conclusion, both the size of the property and its neighborhood significantly affect the sale price. Investors and buyers can use this knowledge to make data-driven decisions, choosing properties that align with their budget, investment goals, and preferred locations.