To begin, we load the necessary libraries and the dataset. We then examine the first few rows of the data to get an overview of its structure.
# Load the data
data <- read.csv("https://raw.githubusercontent.com/hawa1983/DATA605/main/train.csv")
# Display the first few rows of the data
head(data)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
We select as our independent variable X and as our dependent variable Y.
# Pick and define variables
X <- data$LotArea
Y <- data$SalePrice
Next, we calculate the 3rd quartile (Q3) for LotArea
and
the 2nd quartile (Q2) for SalePrice
.
# Calculate quartiles
lotarea_q3 <- quantile(X, 0.75)
saleprice_q2 <- quantile(Y, 0.50)
# Print the results with labels
cat("LotArea 3rd Quartile (Q3):", lotarea_q3, "\n")
## LotArea 3rd Quartile (Q3): 11601.5
cat("SalePrice Median (Q2):", saleprice_q2, "\n")
## SalePrice Median (Q2): 163000
We compute the following probabilities:
# Calculate probabilities
P_X_greater_x <- sum(X > lotarea_q3 & Y > saleprice_q2) / sum(Y > saleprice_q2)
P_X_greater_x_Y_greater_y <- sum(X > lotarea_q3 & Y > saleprice_q2) / length(Y)
P_X_less_x_Y_greater_y <- sum(X < lotarea_q3 & Y > saleprice_q2) / sum(Y > saleprice_q2)
cat("P(X>x and Y>y): 0.379120879120879\n")
## P(X>x and Y>y): 0.379120879120879
cat("P(X>x and Y>y): 0.189041095890411\n")
## P(X>x and Y>y): 0.189041095890411
cat("P(X<x and Y>y): 0.620879120879121\n")
## P(X<x and Y>y): 0.620879120879121
P(X>x and Y>y):0.3791
The probability that X is greater than x and Y is greater than y is 0.3791. This indicates that there is approximately a 37.91% chance that both X and Y exceed their respective thresholds x and y. In practical terms, this indicates a considerable likelihood that larger properties tend to be more expensive.
P(X>x and Y>y):0.1890
The probability that X is
greater than x and Y is greater than y is 0.1890. This suggests that there is
approximately an 18.90% chance that both X and Y exceed their respective thresholds
x and y. This lower probability, compared to
the first one, suggests a less frequent occurrence of properties having
both large lot areas and high sale prices. It might imply that the
thresholds set for x
and y
are such that fewer
properties meet both criteria.
P(X<x and Y>y):0.6209
The probability that X is less
than x and Y is greater than y is 0.6209. This means that there is
approximately a 62.09% chance that X is below its threshold x while Y exceeds its threshold y. This higher probability indicates that
many properties have a sale price exceeding the threshold y
even if their lot area is less than the threshold x
. This
means that a significant number of properties with smaller lot areas
still manage to have higher sale prices, possibly due to other factors
such as location, amenities, or house quality.
We create a contingency table for and and perform a Chi-Square test to determine if the variables are independent.
# Calculate quartiles
lotarea_q3 <- quantile(X, 0.75)
saleprice_q2 <- quantile(Y, 0.50)
# Create the contingency table
contingency_table <- table(cut(X, breaks=c(-Inf, lotarea_q3, Inf)),
cut(Y, breaks=c(-Inf, saleprice_q2, Inf)))
# Convert to matrix and add row and column totals
contingency_matrix <- addmargins(as.matrix(contingency_table))
# Rename rows and columns for clarity
rownames(contingency_matrix) <- c("<=3rd quartile", ">3rd quartile", "Total")
colnames(contingency_matrix) <- c("<=2nd quartile", ">2nd quartile", "Total")
# Create a new matrix with additional row and column for labels
new_matrix <- matrix("", nrow=nrow(contingency_matrix) + 1, ncol=ncol(contingency_matrix) + 1)
new_matrix[1, 2:ncol(new_matrix)] <- c("<=2nd quartile", ">2nd quartile", "Total")
new_matrix[2:nrow(new_matrix), 1] <- c("<=3rd quartile", ">3rd quartile", "Total")
new_matrix[2:nrow(new_matrix), 2:ncol(new_matrix)] <- contingency_matrix
new_matrix[1, 1] <- "B \\ A"
The contingency table is as follows:
kable(new_matrix, format = "html", caption = "Contingency Table of LotArea and SalePrice with Custom Labels") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
B A | <=2nd quartile | >2nd quartile | Total |
<=3rd quartile | 643 | 452 | 1095 |
>3rd quartile | 89 | 276 | 365 |
Total | 732 | 728 | 1460 |
Majority of Properties with Smaller Lot Areas
The majority of properties (1,095 out of 1,460) have a
LotArea
that is less than or equal to the 3rd quartile.
Among these, 643 properties also have a SalePrice
less than
or equal to the 2nd quartile, and 452 properties have a
SalePrice
greater than the 2nd quartile. This indicates
that while a large number of smaller lot properties are less expensive,
a significant number are also relatively more expensive.
Properties with Larger Lot Areas
There are fewer properties with a LotArea
greater than
the 3rd quartile (365 out of 1,460). Among these, 276 properties have a
SalePrice
greater than the 2nd quartile, and 89 properties
have a SalePrice
less than or equal to the 2nd quartile.
This suggests that properties with larger lot areas are more likely to
have higher sale prices, although some still have lower sale prices.
Comparison Across Quartiles
Properties with LotArea
<= 3rd quartile and
SalePrice
> 2nd quartile (452 properties) outnumber
properties with LotArea
> 3rd quartile and
SalePrice
<= 2nd quartile (89 properties). This
indicates that even among properties with smaller lot areas, a
considerable number achieve higher sale prices, possibly due to factors
other than lot size, such as location, house condition, or
amenities.
We need to
Let A be the variable indicating
observations above the 3rd quartile for X (LotArea
), and B be the variable indicating observations
above the 2nd quartile for Y
(SalePrice
).
# Load necessary libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Define A and B
A <- ifelse(X > lotarea_q3, 1, 0)
B <- ifelse(Y > saleprice_q2, 1, 0)
# Create the contingency table for A and B
contingency_table_AB <- table(A, B)
# Calculate probabilities
P_A <- mean(A)
P_B <- mean(B)
P_A_given_B <- mean(A[B == 1])
# Check if P(A|B) = P(A)P(B)
P_A_equal_PB <- P_A_given_B == P_A * P_B
# Create a data frame to display probabilities
prob_df <- data.frame(
Metric = c("P(A)", "P(B)", "P(A|B)", "P(A|B) == P(A)P(B)"),
Value = c(P_A, P_B, P_A_given_B, P_A_equal_PB)
)
kable(prob_df, format = "html", caption = "Probabilities for A and B") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Metric | Value |
---|---|
P(A) | 0.2500000 |
P(B) | 0.4986301 |
P(A|B) | 0.3791209 |
P(A|B) == P(A)P(B) | 0.0000000 |
# Create a data frame for chi-square test results
# Perform Chi-Square test
chi_square_test_AB <- chisq.test(contingency_table_AB)
chi_square_df <- data.frame(
Statistic = round(chi_square_test_AB$statistic, 2),
`Degrees of Freedom` = chi_square_test_AB$parameter,
`p-value` = format.pval(chi_square_test_AB$p.value, digits = , scientific = TRUE)
)
kable(chi_square_df, format = "html", caption = "Chi-Square Test Results for A and B") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
Statistic | Degrees.of.Freedom | p.value | |
---|---|---|---|
X-squared | 127.74 | 1 | < 2.22e-16 |
###Conclusion: Independence of Variables A and B
Based on the analysis, we conclude that splitting the training data into variables A and B does not make them independent.
LotArea
and SalePrice
are
independent. There is strong evidence to suggest that there is a
significant association between LotArea
and
SalePrice
.Therefore, we can state that A and B are dependent.
We provide descriptive statistics for and . Additionally, we generate histograms, boxplots, and a scatterplot to visualize the data.
# Descriptive statistics
summary(data$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(data$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
library(ggplot2)
library(patchwork)
##
## Attaching package: 'patchwork'
## The following object is masked from 'package:MASS':
##
## area
# Histograms
p1 <- ggplot(data, aes(x=LotArea)) +
geom_histogram(bins=30, fill="blue", alpha=0.7) +
ggtitle("Histogram of LotArea")
p2 <- ggplot(data, aes(x=SalePrice)) +
geom_histogram(bins=30, fill="blue", alpha=0.7) +
ggtitle("Histogram of SalePrice")
# Combine all plots into a single layout
(p1 | p2)
library(ggplot2)
library(patchwork)
# Boxplots
p3 <- ggplot(data, aes(y=LotArea)) +
geom_boxplot() +
ggtitle("Boxplot of LotArea")
p4 <- ggplot(data, aes(y=SalePrice)) +
geom_boxplot() +
ggtitle("Boxplot of SalePrice")
# Combine all plots into a single layout
(p3 | p4)
library(ggplot2)
library(patchwork)
# Scatterplot
p5 <- ggplot(data, aes(x=LotArea, y=SalePrice)) +
geom_point() +
ggtitle("Scatterplot of LotArea vs SalePrice")
p5
We calculate the difference in means between LotArea
and
SalePrice
and compute the 95% confidence interval for this
difference.
# Calculate the difference in means and the standard error of the difference
mean_diff <- mean(X) - mean(Y)
se_diff <- sqrt(var(X)/length(X) + var(Y)/length(Y))
# Calculate the 95% confidence interval
t_value <- qt(0.975, df=length(X)-1)
ci_lower_diff <- mean_diff - t_value * se_diff
ci_upper_diff <- mean_diff + t_value * se_diff
# Print the results with labels
cat("Mean difference:", mean_diff, "\n")
## Mean difference: -170404.4
cat("95% CI lower bound:", ci_lower_diff, "\n")
## 95% CI lower bound: -174514.8
cat("95% CI upper bound:", ci_upper_diff, "\n")
## 95% CI upper bound: -166293.9
We compute the correlation matrix for and and perform a hypothesis test to determine if the correlation is significantly different from zero.
# Compute the correlation matrix
correlation_matrix <- cor(data[, c("LotArea", "SalePrice")])
# Test the hypothesis that the correlation is 0
correlation_test <- cor.test(X, Y)
correlation_matrix
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
correlation_test
##
## Pearson's product-moment correlation
##
## data: X and Y
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2154574 0.3109369
## sample estimates:
## cor
## 0.2638434
We invert the correlation matrix to obtain the precision matrix and perform matrix multiplications. We also conduct Principal Components Analysis (PCA) to understand the variance explained by each principal component.
# Invert the correlation matrix
precision_matrix <- solve(correlation_matrix)
# Matrix multiplications
correlation_times_precision <- correlation_matrix %*% precision_matrix
precision_times_correlation <- precision_matrix %*% correlation_matrix
# Perform PCA
pca <- prcomp(data[, c("LotArea", "SalePrice")], scale. = TRUE)
# Principal components and explained variance
principal_components <- pca$rotation
explained_variance <- pca$sdev^2 / sum(pca$sdev^2)
LotArea
and SalePrice
are the same
(approximately 1.0748219).LotArea
and SalePrice
have a partial
correlation of approximately -0.284, indicating a slight negative
conditional dependence.# Print the results with headings
cat("Precision Matrix:\n")
## Precision Matrix:
print(precision_matrix)
## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
This result shows an identity matrix, indicating that when the correlation matrix is multiplied by its inverse (the precision matrix), the result is an identity matrix. This confirms that the precision matrix was correctly calculated as the inverse of the correlation matrix.
cat("\nCorrelation Matrix times Precision Matrix:\n")
##
## Correlation Matrix times Precision Matrix:
print(correlation_times_precision)
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
Similar to the previous result, this multiplication also results in an identity matrix. This further confirms the correctness of the precision and correlation matrices.
cat("\nPrecision Matrix times Correlation Matrix:\n")
##
## Precision Matrix times Correlation Matrix:
print(precision_times_correlation)
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
LotArea
and
SalePrice
contribute equally to the first principal
component, with positive coefficients (approximately 0.7071). This
suggests that the first principal component represents a combined
measure of LotArea
and SalePrice
.LotArea
and
SalePrice
contribute equally but with opposite signs to the
second principal component. This suggests that the second principal
component represents the contrast between LotArea
and
SalePrice
.cat("\nPrincipal Components:\n")
##
## Principal Components:
print(principal_components)
## PC1 PC2
## LotArea 0.7071068 0.7071068
## SalePrice 0.7071068 -0.7071068
LotArea
and SalePrice
can be captured by this
combined measure.LotArea
and
SalePrice
.cat("\nExplained Variance:\n")
##
## Explained Variance:
print(explained_variance)
## [1] 0.6319217 0.3680783
Overall Conclusion:
The analysis provides insights into the relationship between
LotArea
and SalePrice
. The precision matrix
indicates a slight negative conditional dependence, while the PCA
reveals that a combined measure of these variables captures most of the
variance. The significant portion of variance explained by the first
principal component suggests that these variables are closely related,
but there is also distinct variability captured by the second principal
component.
We fit an exponential distribution to the LotArea
data
and generate samples from the fitted distribution. We also compare the
empirical and theoretical percentiles.
# Check if MASS package is installed; if not, install it
if (!requireNamespace("MASS", quietly = TRUE)) {
install.packages("MASS")
}
# Load the MASS package
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyr)
# Set a seed for reproducibility
set.seed(42)
# Filter out zero or negative values and shift `LotArea` to ensure positive values
lotarea_shifted <- data$LotArea - min(data$LotArea) + 1
# Fit the exponential distribution
fit <- fitdistr(lotarea_shifted, "exponential")
lambda <- fit$estimate
# Generate 1000 samples from the fitted exponential distribution
samples <- rexp(1000, rate=lambda)
# Calculate percentiles
theoretical_percentiles <- qexp(c(0.05, 0.95), rate=lambda)
empirical_percentiles <- quantile(lotarea_shifted, probs=c(0.05, 0.95))
# Print results
cat("Estimated Lambda:", lambda, "\n")
## Estimated Lambda: 0.0001084854
cat("Theoretical Percentiles (5% and 95%):", theoretical_percentiles, "\n")
## Theoretical Percentiles (5% and 95%): 472.8128 27614.15
cat("Empirical Percentiles (5% and 95%):", empirical_percentiles, "\n")
## Empirical Percentiles (5% and 95%): 2012.7 16102.15
# Create data frames for plotting
sample_df <- data.frame(Value = samples, Type = "Exponential Samples")
lotarea_df <- data.frame(Value = lotarea_shifted, Type = "Shifted LotArea")
# Combine data frames
plot_data <- bind_rows(sample_df, lotarea_df)
# Plot histograms with facet wrap
ggplot(plot_data, aes(x = Value)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~Type, scales = "free") +
ggtitle("Histograms of Exponential Samples and Shifted LotArea") +
xlab("Value") +
ylab("Frequency")
Estimated Lambda
The estimated lambda
for the exponential distribution is
1.0848543^{-4}. This value of lambda
is the rate parameter
of the exponential distribution. It indicates how quickly the values
decay. A smaller lambda would mean a slower decay, whereas a larger
lambda indicates a quicker decay.
Theoretical and Empirical Percentiles
Theoretical Percentiles (5% and 95%): 472.8127694, 2.7614145^{4} are the 5th and 95th percentiles of the theoretical exponential distribution based on the estimated lambda. They provide a range within which 90% of the data from the exponential distribution is expected to fall.
Empirical Percentiles
Empirical Percentiles (5% and 95%): 2012.7, 1.610215^{4} are the 5th
and 95th percentiles of the empirical shifted LotArea
data.
They provide a range within which 90% of the shifted
LotArea
data falls.
Histograms
Exponential Samples Histogram: The left histogram shows the distribution of 1000 samples generated from the fitted exponential distribution. As expected for an exponential distribution, the histogram shows a rapid decay, with most values concentrated towards the lower end and fewer values as you move to the right. The shape confirms that the data follows an exponential decay, which is consistent with the nature of an exponential distribution.
Shifted LotArea Histogram: The right
histogram shows the distribution of the shifted LotArea
values. This histogram also shows a right-skewed distribution, with most
values concentrated towards the lower end and fewer values extending to
the right. The shape of this histogram is somewhat similar to the
exponential samples histogram, indicating that the shifted
LotArea
data might reasonably follow an exponential
distribution.
Comparison and Analysis
Shape Comparison: Both histograms show
a right-skewed distribution, with a high frequency of small values and a
long tail of larger values. This similarity suggests that the
exponential distribution might be a good fit for the shifted
LotArea
data.
Percentile Comparison: The theoretical
percentiles from the exponential distribution (472.8127694,
2.7614145^{4}) and the empirical percentiles from the shifted LotArea
data (2012.7, 1.610215^{4}) are not identical but share a similar range.
This similarity further supports that the exponential distribution can
reasonably model the shifted LotArea data. The empirical 5th percentile
is higher than the theoretical 5th percentile, indicating that the lower
end of the LotArea
data might be more spread out compared
to the fitted exponential distribution.
Goodness of Fit: The comparison of histograms
and percentiles suggests that the exponential distribution is a
plausible model for the shifted LotArea
data.
We build a simple linear regression model using LotArea
to predict SalePrice
. We then evaluate the model using the
root mean squared error (RMSE).
# Split the data
set.seed(42)
train_indices <- sample(1:nrow(data), 0.8 * nrow(data))
train_data <- data[train_indices, ]
test_data <- data[-train_indices, ]
# Build the model
model <- lm(SalePrice ~ LotArea, data=train_data)
# Predictions
predictions <- predict(model, newdata=test_data)
# Evaluate the model
mse <- mean((test_data$SalePrice - predictions)^2)
rmse <- sqrt(mse)
# Print model summary and RMSE
cat("Model Summary:\n")
## Model Summary:
print(summary(model))##
##
## Call:
## lm(formula = SalePrice ~ LotArea, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -251762 -48469 -16582 32100 550953
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.571e+05 3.246e+03 48.38 <2e-16 ***
## LotArea 2.182e+00 2.257e-01 9.67 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 75670 on 1166 degrees of freedom
## Multiple R-squared: 0.07425, Adjusted R-squared: 0.07345
## F-statistic: 93.51 on 1 and 1166 DF, p-value: < 2.2e-16
cat("RMSE:", rmse, "\n")
## RMSE: 80495.68
# Check if MASS package is installed; if not, install it
if (!requireNamespace("MASS", quietly = TRUE)) {
install.packages("MASS")
}
# Load necessary packages
library(MASS)
library(ggplot2)
library(dplyr)
library(tidyr)
library(patchwork)
# Diagnostic plots
# Residuals vs Fitted
residuals_vs_fitted <- ggplot(model, aes(.fitted, .resid)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
ggtitle("Residuals vs Fitted") +
xlab("Fitted values") +
ylab("Residuals") +
theme_minimal()
# Q-Q plot
qq_plot <- ggplot(model, aes(sample = .stdresid)) +
stat_qq() +
stat_qq_line() +
ggtitle("Normal Q-Q") +
theme_minimal()
# Scale-Location plot
scale_location_plot <- ggplot(model, aes(.fitted, sqrt(abs(.stdresid)))) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
ggtitle("Scale-Location") +
xlab("Fitted values") +
ylab("Square Root of |Standardized Residuals|") +
theme_minimal()
# Residuals vs Leverage plot
residuals_vs_leverage <- ggplot(model, aes(.hat, .stdresid)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "red") +
ggtitle("Residuals vs Leverage") +
xlab("Leverage") +
ylab("Standardized Residuals") +
theme_minimal()
# Combine diagnostic plots
diagnostic_plots <- (residuals_vs_fitted | scale_location_plot) / (qq_plot | residuals_vs_leverage)
diagnostic_plots
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
The diagnostic plots assess the fit of the linear regression model. Here’s an interpretation of each plot:
Purpose: To check the linearity assumption and
identify any patterns. Interpretation: he residuals
should be randomly scattered around the horizontal line (zero). In the
plot, the residuals exhibit a clear pattern (a curve), suggesting
non-linearity. This indicates that the relationship between
LotArea
and SalePrice
may not be purely
linear.
Purpose: To check the homoscedasticity assumption
(constant variance of residuals). Interpretation: The residuals should
be evenly spread along the range of fitted values. In the plot, the red
line curves upwards, indicating that the variance of the residuals
increases with the fitted values. This suggests heteroscedasticity,
meaning that the variance of SalePrice
is not constant
across different levels of LotArea
.
Purpose: To check the normality of the residuals. Interpretation: The points should lie on the 45-degree reference line if the residuals are normally distributed. In the plot, the points deviate significantly from the line, especially at the tails, indicating that the residuals are not normally distributed. This violation can affect the reliability of hypothesis tests and confidence intervals.
Purpose: To identify influential observations that have a large impact on the model. Interpretation: Points that fall outside the dashed lines (Cook’s distance) are influential. In the plot, there are points with high leverage and large residuals, suggesting the presence of influential observations. These observations can disproportionately affect the model fit and should be investigated further.
The diagnostic plots suggest that the linear regression model may not be appropriate for the data due to violations of the key assumptions:
Next Steps:
Consider transforming the predictor (LotArea) or the response variable (SalePrice) to achieve linearity and homoscedasticity. Explore non-linear models, such as polynomial regression or generalized additive models (GAMs). Investigate and possibly remove or adjust for influential observations. Consider using robust regression techniques that are less sensitive to violations of assumptions.