```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
# Load required libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Read the dataset (replace with the correct path)
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)
Response Variable: The price of the house (SalePrice) is the most valuable variable in this dataset, as it directly impacts both buyers and sellers in the real estate market.
Explanatory Variable: For the ANOVA test, we will use the Neighborhood column, which indicates the location of the house. We hypothesize that different neighborhoods may have different average sale prices due to factors such as proximity to services, quality of life, and community amenities.
Null Hypothesis (H₀): The mean SalePrice is the same across all neighborhoods.
Alternative Hypothesis (H₁): At least one neighborhood has a different mean SalePrice.
# Ensure Neighborhood has fewer than 10 categories, consolidate if needed
# Group less common neighborhoods into 'Other'
ames$Neighborhood <- ifelse(ames$Neighborhood %in% c("NAmes", "CollgCr", "OldTown", "Edwards", "Somerst"),
ames$Neighborhood, "Other")
# ANOVA test: SalePrice by Neighborhood
anova_result <- aov(SalePrice ~ Neighborhood, data = ames)
# Display the ANOVA result
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Neighborhood 5 2.749e+12 5.498e+11 100.8 <2e-16 ***
## Residuals 2924 1.594e+13 5.453e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is much less than 0.05, we reject the null hypothesis. This indicates sufficient evidence to conclude that the average SalePrice varies between neighborhoods.
For potential buyers or sellers, this means that location matters significantly in real estate pricing. Understanding that certain neighborhoods are likely to command higher prices can guide strategic decisions about purchasing or listing a property. For example, if a buyer is considering homes in different neighborhoods, this finding emphasizes the need to research neighborhood amenities, schools, and overall desirability as these can affect property value.
Explanatory Variable for Regression: We will use Gr.Liv.Area (Above Ground Living Area in square feet) for the linear regression model. It is expected to have a linear relationship with SalePrice.
To ensure a roughly linear relationship, we can visualize the data:
# Visualize the relationship between Gr.Liv.Area and SalePrice
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Relationship between Above Ground Living Area and Sale Price",
x = "Above Ground Living Area (sq ft)",
y = "Sale Price ($)")
## `geom_smooth()` using formula = 'y ~ x'
# Build a simple linear regression model: SalePrice ~ Gr.Liv.Area
linear_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)
# Display the model summary
summary(linear_model)
##
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = ames)
##
## Residuals:
## Min 1Q Median 3Q Max
## -483467 -30219 -1966 22728 334323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13289.634 3269.703 4.064 4.94e-05 ***
## Gr.Liv.Area 111.694 2.066 54.061 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared: 0.4995, Adjusted R-squared: 0.4994
## F-statistic: 2923 on 1 and 2928 DF, p-value: < 2.2e-16
Intercept (β₀):
Value: 13,289.634
Interpretation: This value suggests that if the Gr.Liv.Area (Above Ground Living Area) is 0 square feet (which is not realistic), the theoretical baseline SalePrice would be approximately $13,290. This serves primarily as a reference point in the context of the model.
Gr.Liv.Area Coefficient (β₁):
Value: 111.694
Interpretation: This coefficient indicates that for each additional square foot of living area, the price of the house increases by approximately $111.69 on average, assuming all other factors are held constant. This demonstrates a positive relationship between living area and sale price, suggesting that larger homes command higher prices.
The results suggest that increasing the above-ground living area of a home can significantly impact its sale price, providing actionable insights for both buyers and sellers in the real estate market. For sellers, enhancing living space can lead to higher prices, while buyers should consider living area when budgeting for a home purchase.
The R-squared value indicates the proportion of the variance in the response variable (SalePrice) explained by the explanatory variable (Gr.Liv.Area).
Interpretation of R-squared:
A higher R-squared value (closer to 1) signifies a better fit for the model.
If R-squared is low, it suggests that other variables may also significantly influence SalePrice.
The linear regression model shows that larger houses (in terms of above-ground living area) tend to sell for higher prices.
For Sellers: Consider increasing the above-ground living area through renovations or extensions to enhance property value.
For Buyers: Homes with larger living areas are likely to command higher prices. This should be factored into budgeting considerations, as investing in homes with more living space can yield better returns in the future.