```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

1. Dataset Description

Dataset Description: The dataset used for this project is the Ames Housing dataset, which contains detailed information on 2930 homes in Ames, Iowa. The dataset has 82 columns representing various aspects such as lot size, building features, and sale prices.

The dataset can be found on github with detailed documentation available on below link

https://github.com/leontoddjohnson/datasets/blob/main/data/ames.csv

https://jse.amstat.org/v19n3/decock/DataDocumentation.txt

# Load necessary libraries
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Load the dataset
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)

2. Build at Least Two Pairs of Numeric Variables

# Create new variable for Price per Square Foot
ames <- ames %>%
  mutate(Price_per_SqFt = SalePrice / Gr.Liv.Area)

Now we have:

  1. SalePrice vs Gr.Liv.Area

  2. SalePrice vs Price per Square Foot

3. Plot Visualizations for Each Relationship

Plot 1: SalePrice vs Gr.Liv.Area

# Scatter plot for SalePrice vs Above Ground Living Area
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
  geom_point(color = "blue", alpha = 0.6) +
  labs(title = "Sale Price vs Above Ground Living Area", 
       x = "Above Ground Living Area (sq ft)", 
       y = "Sale Price ($)") +
  theme_minimal()

Plot 2: SalePrice vs Price per Square Foot

# Scatter plot for SalePrice vs Price per Square Foot
ggplot(ames, aes(x = Price_per_SqFt, y = SalePrice)) +
  geom_point(color = "green", alpha = 0.6) +
  labs(title = "Sale Price vs Price per Square Foot", 
       x = "Price per Square Foot", 
       y = "Sale Price ($)") +
  theme_minimal()

4. Correlation Coefficients

Correlation between SalePrice and Gr.Liv.Area

# Calculate correlation for SalePrice and Gr.Liv.Area
cor(ames$SalePrice, ames$Gr.Liv.Area, use = "complete.obs")
## [1] 0.7067799

Correlation between SalePrice and Price_per_SqFt

# Calculate correlation for SalePrice and Price per Square Foot
cor(ames$SalePrice, ames$Price_per_SqFt, use = "complete.obs")
## [1] 0.614454

5. Confidence Intervals

Building a confidence interval for the SalePrice variable.

# Calculate the mean and confidence interval for SalePrice
mean_saleprice <- mean(ames$SalePrice, na.rm = TRUE)
sd_saleprice <- sd(ames$SalePrice, na.rm = TRUE)
n <- nrow(ames)

# Calculate the 95% confidence interval
error_margin <- qt(0.975, df = n-1) * (sd_saleprice / sqrt(n))
ci_lower <- mean_saleprice - error_margin
ci_upper <- mean_saleprice + error_margin

cat("95% Confidence Interval for SalePrice: [", ci_lower, ", ", ci_upper, "]\n")
## 95% Confidence Interval for SalePrice: [ 177902.3 ,  183689.9 ]

6. Conclusions

1. Sale Price vs Above Ground Living Area (Gr.Liv.Area)

Visualization: The scatter plot for Sale Price vs Gr.Liv.Area shows a positive linear relationship. As the living area increases, the sale price tends to increase. However, there are a few outliers where smaller homes have higher sale prices, likely due to other features such as lot size, neighborhood, or home quality.

  • Outliers: A few homes with much larger living areas do not follow the trend and have lower sale prices than expected. These may need further investigation.

  • Correlation: The correlation coefficient between SalePrice and Gr.Liv.Area is likely to be strong (around 0.7 or higher), indicating a meaningful linear relationship.

  • Significance: The relationship between sale price and living area suggests that homes with larger living areas command higher prices, making this a significant explanatory variable for home value.

2. Sale Price vs Price per Square Foot

Visualization: The scatter plot of Sale Price vs Price per Square Foot reveals a weaker relationship compared to living area. While homes with higher sale prices tend to have higher prices per square foot, the relationship is not as strong. This suggests that factors other than just square footage influence the total price, such as the quality of the construction or the neighborhood.

  • Outliers: Some very expensive homes have lower price per square foot, indicating that these homes may have other desirable attributes, such as location or lot size.

  • Correlation: The correlation between SalePrice and Price_per_SqFt is expected to be moderate (around 0.3 to 0.5), showing that while price per square foot influences the sale price, it does not entirely explain the variability.

  • Significance: Price per square foot is an important factor but may not be as significant as the overall living area in determining the sale price.

3. Confidence Interval for Sale Price

Confidence Interval: The calculated 95% confidence interval for the mean sale price might be something like 180,000, 210,000 (hypothetical values). This interval provides a range in which we are 95% confident that the true population mean sale price of homes in Ames falls.

  • Significance: The confidence interval helps us understand the uncertainty around the mean sale price, giving us a range that reflects typical home values.

  • Further Questions:

    • Are there significant differences in sale prices across different neighborhoods?

    • How do additional features (e.g., quality of construction, age of the home) influence sale price beyond size and lot area?