```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# Load required libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Read the dataset (replace with the correct path)
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)

1. Selecting the Response Variable

Response Variable: The price of the house (SalePrice) is the most valuable variable in this dataset, as it directly impacts both buyers and sellers in the real estate market.

2. Selecting the Categorical Explanatory Variable

Explanatory Variable: For the ANOVA test, we will use the Neighborhood column, which indicates the location of the house. We hypothesize that different neighborhoods may have different average sale prices due to factors such as proximity to services, quality of life, and community amenities.

3. Formulating the Hypothesis

4. Performing ANOVA Test

# Ensure Neighborhood has fewer than 10 categories, consolidate if needed
# Group less common neighborhoods into 'Other'
ames$Neighborhood <- ifelse(ames$Neighborhood %in% c("NAmes", "CollgCr", "OldTown", "Edwards", "Somerst"), 
                            ames$Neighborhood, "Other")

# ANOVA test: SalePrice by Neighborhood
anova_result <- aov(SalePrice ~ Neighborhood, data = ames)

# Display the ANOVA result
summary(anova_result)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## Neighborhood    5 2.749e+12 5.498e+11   100.8 <2e-16 ***
## Residuals    2924 1.594e+13 5.453e+09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is much less than 0.05, we reject the null hypothesis. This indicates sufficient evidence to conclude that the average SalePrice varies between neighborhoods.

Implications for Buyers and Sellers:

For potential buyers or sellers, this means that location matters significantly in real estate pricing. Understanding that certain neighborhoods are likely to command higher prices can guide strategic decisions about purchasing or listing a property. For example, if a buyer is considering homes in different neighborhoods, this finding emphasizes the need to research neighborhood amenities, schools, and overall desirability as these can affect property value.

5. Selecting a Continuous Explanatory Variable

Explanatory Variable for Regression: We will use Gr.Liv.Area (Above Ground Living Area in square feet) for the linear regression model. It is expected to have a linear relationship with SalePrice.

Confirming Linear Relationship:

To ensure a roughly linear relationship, we can visualize the data:

# Visualize the relationship between Gr.Liv.Area and SalePrice
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Relationship between Above Ground Living Area and Sale Price",
       x = "Above Ground Living Area (sq ft)",
       y = "Sale Price ($)")
## `geom_smooth()` using formula = 'y ~ x'

6. Building the Linear Regression Model

# Build a simple linear regression model: SalePrice ~ Gr.Liv.Area
linear_model <- lm(SalePrice ~ Gr.Liv.Area, data = ames)

# Display the model summary
summary(linear_model)
## 
## Call:
## lm(formula = SalePrice ~ Gr.Liv.Area, data = ames)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -483467  -30219   -1966   22728  334323 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13289.634   3269.703   4.064 4.94e-05 ***
## Gr.Liv.Area   111.694      2.066  54.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56520 on 2928 degrees of freedom
## Multiple R-squared:  0.4995, Adjusted R-squared:  0.4994 
## F-statistic:  2923 on 1 and 2928 DF,  p-value: < 2.2e-16

The results suggest that increasing the above-ground living area of a home can significantly impact its sale price, providing actionable insights for both buyers and sellers in the real estate market. For sellers, enhancing living space can lead to higher prices, while buyers should consider living area when budgeting for a home purchase.

7. Evaluating the Fit of the Model

Model Fit:

The R-squared value indicates the proportion of the variance in the response variable (SalePrice) explained by the explanatory variable (Gr.Liv.Area).

  • Interpretation of R-squared:

    • A higher R-squared value (closer to 1) signifies a better fit for the model.

    • If R-squared is low, it suggests that other variables may also significantly influence SalePrice.

8. Conclusion for Regression Model

Interpretation:

The linear regression model shows that larger houses (in terms of above-ground living area) tend to sell for higher prices.

Recommendations:

  • For Sellers: Consider increasing the above-ground living area through renovations or extensions to enhance property value.

  • For Buyers: Homes with larger living areas are likely to command higher prices. This should be factored into budgeting considerations, as investing in homes with more living space can yield better returns in the future.