```{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
Dataset Description: The dataset used for this project is the Ames Housing dataset, which contains detailed information on 2930 homes in Ames, Iowa. The dataset has 82 columns representing various aspects such as lot size, building features, and sale prices.
The dataset can be found on github with detailed documentation available on below link
https://github.com/leontoddjohnson/datasets/blob/main/data/ames.csv
https://jse.amstat.org/v19n3/decock/DataDocumentation.txt
# Load necessary libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
ames <- read.csv('D:/Stats for DS/ames.csv', header = TRUE)
Response Variable: SalePrice
Explanatory Variables: Gr.Liv.Area (Above Ground Living Area), Lot.Area, and newly created Price per Square Foot.
# Create new variable for Price per Square Foot
ames <- ames %>%
mutate(Price_per_SqFt = SalePrice / Gr.Liv.Area)
Now we have:
SalePrice vs Gr.Liv.Area
SalePrice vs Price per Square Foot
# Scatter plot for SalePrice vs Above Ground Living Area
ggplot(ames, aes(x = Gr.Liv.Area, y = SalePrice)) +
geom_point(color = "blue", alpha = 0.6) +
labs(title = "Sale Price vs Above Ground Living Area",
x = "Above Ground Living Area (sq ft)",
y = "Sale Price ($)") +
theme_minimal()
# Scatter plot for SalePrice vs Price per Square Foot
ggplot(ames, aes(x = Price_per_SqFt, y = SalePrice)) +
geom_point(color = "green", alpha = 0.6) +
labs(title = "Sale Price vs Price per Square Foot",
x = "Price per Square Foot",
y = "Sale Price ($)") +
theme_minimal()
# Calculate correlation for SalePrice and Gr.Liv.Area
cor(ames$SalePrice, ames$Gr.Liv.Area, use = "complete.obs")
## [1] 0.7067799
# Calculate correlation for SalePrice and Price per Square Foot
cor(ames$SalePrice, ames$Price_per_SqFt, use = "complete.obs")
## [1] 0.614454
Building a confidence interval for the SalePrice variable.
# Calculate the mean and confidence interval for SalePrice
mean_saleprice <- mean(ames$SalePrice, na.rm = TRUE)
sd_saleprice <- sd(ames$SalePrice, na.rm = TRUE)
n <- nrow(ames)
# Calculate the 95% confidence interval
error_margin <- qt(0.975, df = n-1) * (sd_saleprice / sqrt(n))
ci_lower <- mean_saleprice - error_margin
ci_upper <- mean_saleprice + error_margin
cat("95% Confidence Interval for SalePrice: [", ci_lower, ", ", ci_upper, "]\n")
## 95% Confidence Interval for SalePrice: [ 177902.3 , 183689.9 ]
Visualization: The scatter plot for Sale Price vs Gr.Liv.Area shows a positive linear relationship. As the living area increases, the sale price tends to increase. However, there are a few outliers where smaller homes have higher sale prices, likely due to other features such as lot size, neighborhood, or home quality.
Outliers: A few homes with much larger living areas do not follow the trend and have lower sale prices than expected. These may need further investigation.
Correlation: The correlation coefficient between SalePrice and Gr.Liv.Area is likely to be strong (around 0.7 or higher), indicating a meaningful linear relationship.
Significance: The relationship between sale price and living area suggests that homes with larger living areas command higher prices, making this a significant explanatory variable for home value.
Visualization: The scatter plot of Sale Price vs Price per Square Foot reveals a weaker relationship compared to living area. While homes with higher sale prices tend to have higher prices per square foot, the relationship is not as strong. This suggests that factors other than just square footage influence the total price, such as the quality of the construction or the neighborhood.
Outliers: Some very expensive homes have lower price per square foot, indicating that these homes may have other desirable attributes, such as location or lot size.
Correlation: The correlation between SalePrice and Price_per_SqFt is expected to be moderate (around 0.3 to 0.5), showing that while price per square foot influences the sale price, it does not entirely explain the variability.
Significance: Price per square foot is an important factor but may not be as significant as the overall living area in determining the sale price.
Confidence Interval: The calculated 95% confidence interval for the mean sale price might be something like 180,000, 210,000 (hypothetical values). This interval provides a range in which we are 95% confident that the true population mean sale price of homes in Ames falls.
Significance: The confidence interval helps us understand the uncertainty around the mean sale price, giving us a range that reflects typical home values.
Further Questions:
Are there significant differences in sale prices across different neighborhoods?
How do additional features (e.g., quality of construction, age of the home) influence sale price beyond size and lot area?