Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Loading the dataset
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
df = read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/train.csv")
head(df)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
We have selected ‘LotArea’ as the independent variable for our analysis, due to its typical right skewness in real estate data. This characteristic makes LotArea an appropriate choice for examining skewed distributions. To visually confirm the skewness of ‘LotArea’, we’ll plot a histogram
ggplot(df, aes(x = LotArea)) +
geom_histogram(bins = 50, fill = "blue", color = "black") +
ggtitle("Histogram of LotArea") +
xlab("LotArea") +
ylab("Frequency")
We’ve chosen SalePrice as our dependent variable 𝑌 because it directly represents the house prices we’re interested in analyzing. To get a quick overview of SalePrice we have plotted a histogram once again.
summary(df$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
ggplot(df, aes(x = SalePrice)) +
geom_histogram(bins = 30, fill = "red", color = "black") +
ggtitle("Histogram of SalePrice") +
xlab("SalePrice") +
ylab("Frequency")
Defining the variables:
X = df$LotArea
Y = df$SalePrice
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
# Summary of X (LotArea)
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
Determine the 3rd quartile of X (LotArea)
x_3q = quantile(X, 0.75)
Summary of Y (SalePrice)
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Determine the 2nd quartile (Median) of Y (SalePrice)
y_2q = median(Y)
Calculating P(X > x | Y > y)
X_1 = X > x_3q
Y_1 = Y > y_2q
n = sum(X_1 & Y_1)
d = sum(Y_1)
prob1 = n / d
Calculating P(X > x, Y > y)
n_joint = sum(X_1 & Y_1)
d_total = nrow(df)
prob2 = n_joint / d_total
Calculating P(X < x | Y > y)
X_2 = X < x_3q
n_cond = sum(X_2 & Y_1)
prob3 = n_cond / d
Creating a table of counts
table_counts <- matrix(0, nrow = 3, ncol = 3)
colnames(table_counts) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(table_counts) <- c("<=3d quartile", ">3d quartile", "Total")
table_counts[1, 1] <- sum(X < x_3q & Y < y_2q)
table_counts[1, 2] <- sum(X < x_3q & Y > y_2q)
table_counts[1, 3] <- table_counts[1, 1] + table_counts[1, 2]
table_counts[2, 1] <- sum(X > x_3q & Y < y_2q)
table_counts[2, 2] <- sum(X > x_3q & Y > y_2q)
table_counts[2, 3] <- table_counts[2, 1] + table_counts[2, 2]
table_counts[3, 1] <- table_counts[1, 1] + table_counts[2, 1]
table_counts[3, 2] <- table_counts[1, 2] + table_counts[2, 2]
table_counts[3, 3] <- table_counts[1, 3] + table_counts[2, 3]
# Print the probabilities and the table
print(c(prob1, prob2, prob3))
## [1] 0.3791209 0.1890411 0.6208791
print(table_counts)
## <=2d quartile >2d quartile Total
## <=3d quartile 639 452 1091
## >3d quartile 89 276 365
## Total 728 728 1456
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
Defining the variables
A <- X > x_3q
B <- Y > y_2q
Checking conditional joint probabilities
# Calculate P(A|B)
P_A_given_B <- sum(A & B) / sum(B)
# Calculate P(A) and P(B)
P_A <- sum(A) / length(A)
P_B <- sum(B) / length(B)
# Calculate P(A) * P(B)
joint_probability <- P_A * P_B
# Print the probabilities
print(paste("P(A|B):", P_A_given_B))
## [1] "P(A|B): 0.379120879120879"
print(paste("P(A) * P(B):", joint_probability))
## [1] "P(A) * P(B): 0.124657534246575"
Chi square test
# Conduct a Chi-square test
c_test <- chisq.test(table(A, B))
# Print the Chi-square test result
print(c_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(A, B)
## X-squared = 127.74, df = 1, p-value < 2.2e-16
Discussion Splitting the training data into
observations above the 3rd quartile for LotArea (A) and the
upper half for SalePrice (B) does not make them
independent. The Chi-square test results show a very low p-value (<
0.0001) and a high statistic (127.74), which strongly indicates a
significant relationship between LotArea and
SalePrice. Thus, the variables are statistically
dependent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
Provide univariate descriptive statistics and appropriate plots for the training data set.
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Creating a scatter plot
library(ggplot2)
# Scatterplot of LotArea and SalePrice
ggplot(df, aes(x = X, y = Y)) +
geom_point() +
labs(x = "LotArea", y = "SalePrice", title = "Scatter Plot of LotArea and SalePrice")
Calculating the confiedence interval
t_test <- t.test(X, Y, conf.level = 0.95)
print(t_test)
##
## Welch Two Sample t-test
##
## data: X and Y
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -174514.7 -166294.1
## sample estimates:
## mean of x mean of y
## 10516.83 180921.20
Selecting the variables for correlation matrix, and conducting the correlation matrix
df2 <- df |>
select(LotArea, SalePrice)
co_matrix <- cor(df2)
print(co_matrix)
## LotArea SalePrice
## LotArea 1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000
Our hypothesis testing:
# Hypothesis test for correlation
co_test <- cor.test(df2$LotArea, df2$SalePrice, method = "pearson", conf.level = 0.99)
print(co_test)
##
## Pearson's product-moment correlation
##
## data: df2$LotArea and df2$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.2000196 0.3254375
## sample estimates:
## cor
## 0.2638434
Discussion The Pearson’s correlation test
results indicate a moderate positive correlation between
LotArea and SalePrice, with a correlation
coefficient of 0.2638434. The test is highly significant, with a t-value
of 10.445 and a p-value of less than 2.2e-16, rejecting the null
hypothesis that the true correlation is zero. The 99% confidence
interval for the correlation ranges from 0.2000196 to 0.3254375. This
suggests that, generally, larger lots tend to have higher prices, but
the relationship isn’t very strong. Other factors besides lot size are
also important in determining house prices.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
df2 <- df |> select(LotArea, SalePrice)
co_matrix <- cor(df2)
precision_matrix <- solve(co_matrix)
print("Precision Matrix:")
## [1] "Precision Matrix:"
print(precision_matrix)
## LotArea SalePrice
## LotArea 1.0748219 -0.2835846
## SalePrice -0.2835846 1.0748219
mult_cor_precision <- round(co_matrix %*% precision_matrix)
print("Correlation Matrix * Precision Matrix (Identity Matrix):")
## [1] "Correlation Matrix * Precision Matrix (Identity Matrix):"
print(mult_cor_precision)
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
mult_precision_cor <- round(precision_matrix %*% co_matrix)
print("Precision Matrix * Correlation Matrix (Identity Matrix):")
## [1] "Precision Matrix * Correlation Matrix (Identity Matrix):"
print(mult_precision_cor)
## LotArea SalePrice
## LotArea 1 0
## SalePrice 0 1
pca_result <- princomp(df2, cor = TRUE)
print("Loadings from PCA:")
## [1] "Loadings from PCA:"
print(pca_result$loadings)
##
## Loadings:
## Comp.1 Comp.2
## LotArea 0.707 0.707
## SalePrice 0.707 -0.707
##
## Comp.1 Comp.2
## SS loadings 1.0 1.0
## Proportion Var 0.5 0.5
## Cumulative Var 0.5 1.0
PC <- pca_result$scores
cor_PC <- cor(PC)
print("Correlation among Principal Components:")
## [1] "Correlation among Principal Components:"
print(cor_PC)
## Comp.1 Comp.2
## Comp.1 1.000000e+00 1.547207e-15
## Comp.2 1.547207e-15 1.000000e+00
# Summary of the PCA results to show importance of components
summary_pca <- summary(pca_result)
print("Summary of PCA:")
## [1] "Summary of PCA:"
print(summary_pca)
## Importance of components:
## Comp.1 Comp.2
## Standard deviation 1.1242079 0.8579957
## Proportion of Variance 0.6319217 0.3680783
## Cumulative Proportion 0.6319217 1.0000000
Discussion The precision matrix and the
resulting identity matrices confirm that the relationships between
LotArea and SalePrice have been accurately
isolated and normalized. The PCA revealed that the first component
explains about 63% of the variance, mainly capturing the combined
effects of lot size and sale price. The second component, explaining the
remaining 37%, offers a contrasting dimension. This separation into
independent components demonstrates that while LotArea and
SalePrice are related, they influence property values in
distinct ways.
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Certainly! Let’s follow your detailed plan and execute each step
using R code. We’ll apply the calculations to LotArea,
which is the right-skewed variable from your dataset.
Adjusting Data and Fitting an Exponential Distribution
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
# Adjust LotArea so the minimum value is above zero
X <- df$LotArea
X_shifted <- X + 1
# Fit an exponential distribution to the shifted data
exp_fit <- fitdistr(X_shifted, "exponential")
lambda <- exp_fit$estimate # Extract rate parameter λ
samples <- rexp(1000, rate = lambda)
Plotting Histograms for Comparison
par(mfrow = c(1, 2))
hist(X, main = "Original LotArea", xlab = "LotArea", col = "blue")
hist(samples, main = "Fitted Exponential Distribution", xlab = "Sample Values", col = "red")
Calculating Percentiles and Confidence Intervals
# Calculate empirical 5th and 95th percentiles for original data
data_percentiles <- quantile(X, probs = c(0.05, 0.95))
print("Original Data Percentiles:")
## [1] "Original Data Percentiles:"
print(data_percentiles)
## 5% 95%
## 3311.70 17401.15
# Calculate 95% confidence interval for the mean of the original data
mean_X <- mean(X)
std_X <- sd(X)
n <- length(X)
z <- qnorm(1 - 0.05/2)
lower_ci <- mean_X - z * (std_X / sqrt(n))
upper_ci <- mean_X + z * (std_X / sqrt(n))
print(paste("95% Confidence Interval for the Mean of Original Data: Lower =", lower_ci, ", Upper =", upper_ci))
## [1] "95% Confidence Interval for the Mean of Original Data: Lower = 10004.8430715642 , Upper = 11028.8130928194"
The percentiles for LotArea show that 5% of the lots are smaller than 3,311.7 square feet, and 95% are smaller than 17,401.15 square feet. This indicates that most of the lots are within this range, but there are some significantly larger properties.
The 95% confidence interval for the mean of LotArea ranges from 10,004.84 to 11,028.81 square feet. This means we are 95% confident that the average lot size of the properties falls within this range. This interval helps us understand the typical size of lots being considered, emphasizing that while there are outliers, the majority of lots have a fairly consistent size.
library(tidyverse)
library(MASS)
# Load training data
df_train <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/train.csv")
# Load test data
df_test <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/test.csv")
head(df_train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
First we will check for missing values and remove them if necessary
df[is.na(df)] <- 0
# If CentralAir is a factor or character representing yes/no, convert it to numeric (0 and 1)
if (is.character(df$CentralAir) || is.factor(df$CentralAir)) {
df$CentralAir <- as.numeric(df$CentralAir == "Y")
}
Select features for modeling
df_mm <- dplyr::select(df, SalePrice, LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath, TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)
head(df_mm)
## SalePrice LotArea YearBuilt TotalBsmtSF CentralAir GrLivArea FullBath
## 1 208500 8450 2003 856 1 1710 2
## 2 181500 9600 1976 1262 1 1262 2
## 3 223500 11250 2001 920 1 1786 2
## 4 140000 9550 1915 756 1 1717 1
## 5 250000 14260 2000 1145 1 2198 2
## 6 143000 14115 1993 796 1 1362 1
## HalfBath BedroomAbvGr KitchenAbvGr BsmtFullBath TotRmsAbvGrd Fireplaces
## 1 1 3 1 1 8 0
## 2 0 3 1 0 6 1
## 3 1 3 1 1 6 1
## 4 0 3 1 1 7 1
## 5 1 4 1 1 9 1
## 6 1 1 1 1 5 0
## GarageArea GarageCars
## 1 548 2
## 2 460 2
## 3 608 2
## 4 642 3
## 5 836 3
## 6 480 2
Building a multiple linear regression model
model <- lm(SalePrice ~ ., data = df_mm)
summary(model)
##
## Call:
## lm(formula = SalePrice ~ ., data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548095 -18314 -2128 16073 294903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.570e+05 1.041e+05 -9.194 < 2e-16 ***
## LotArea 2.841e-01 1.115e-01 2.548 0.0109 *
## YearBuilt 5.061e+02 5.405e+01 9.363 < 2e-16 ***
## TotalBsmtSF 2.887e+01 3.346e+00 8.628 < 2e-16 ***
## CentralAir 1.355e+03 4.696e+03 0.289 0.7730
## GrLivArea 6.224e+01 4.812e+00 12.935 < 2e-16 ***
## FullBath 7.667e+03 3.087e+03 2.483 0.0131 *
## HalfBath -1.302e+03 2.701e+03 -0.482 0.6298
## BedroomAbvGr -1.472e+04 1.818e+03 -8.097 1.18e-15 ***
## KitchenAbvGr -4.622e+04 5.257e+03 -8.792 < 2e-16 ***
## BsmtFullBath 1.115e+04 2.193e+03 5.083 4.21e-07 ***
## TotRmsAbvGrd 7.518e+03 1.367e+03 5.499 4.51e-08 ***
## Fireplaces 7.649e+03 1.927e+03 3.969 7.58e-05 ***
## GarageArea 7.548e+00 1.084e+01 0.696 0.4862
## GarageCars 1.596e+04 3.170e+03 5.037 5.34e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39380 on 1445 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7542
## F-statistic: 320.8 on 14 and 1445 DF, p-value: < 2.2e-16
Checking model assumptions by plotting diagnostic plots
par(mfrow=c(2,2))
plot(model)
CentralAir: The p-value is 0.7730, indicating it is not statistically
significant at typical significance levels (e.g., 0.05), so it will be
dropped.
model_refined1 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea +
FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(model_refined1)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars,
## data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548428 -18336 -2177 15914 294879
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.633e+05 1.017e+05 -9.470 < 2e-16 ***
## LotArea 2.842e-01 1.115e-01 2.549 0.0109 *
## YearBuilt 5.100e+02 5.229e+01 9.753 < 2e-16 ***
## TotalBsmtSF 2.891e+01 3.343e+00 8.646 < 2e-16 ***
## GrLivArea 6.220e+01 4.808e+00 12.936 < 2e-16 ***
## FullBath 7.608e+03 3.080e+03 2.470 0.0136 *
## HalfBath -1.285e+03 2.700e+03 -0.476 0.6341
## BedroomAbvGr -1.467e+04 1.810e+03 -8.107 1.09e-15 ***
## KitchenAbvGr -4.646e+04 5.184e+03 -8.962 < 2e-16 ***
## BsmtFullBath 1.116e+04 2.192e+03 5.089 4.08e-07 ***
## TotRmsAbvGrd 7.518e+03 1.367e+03 5.501 4.47e-08 ***
## Fireplaces 7.713e+03 1.914e+03 4.030 5.87e-05 ***
## GarageArea 7.681e+00 1.082e+01 0.710 0.4780
## GarageCars 1.595e+04 3.168e+03 5.034 5.41e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39370 on 1446 degrees of freedom
## Multiple R-squared: 0.7566, Adjusted R-squared: 0.7544
## F-statistic: 345.7 on 13 and 1446 DF, p-value: < 2.2e-16
HalfBath: The p-value is 0.6298, indicating it is not statistically significant, so it will be dropped.
model_refined2 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea +
FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(model_refined2)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
## TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -548702 -18199 -2289 15874 295458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.442e+05 9.348e+04 -10.101 < 2e-16 ***
## LotArea 2.864e-01 1.114e-01 2.572 0.01021 *
## YearBuilt 5.001e+02 4.790e+01 10.441 < 2e-16 ***
## TotalBsmtSF 2.953e+01 3.076e+00 9.599 < 2e-16 ***
## GrLivArea 6.122e+01 4.348e+00 14.080 < 2e-16 ***
## FullBath 8.144e+03 2.866e+03 2.841 0.00455 **
## BedroomAbvGr -1.470e+04 1.809e+03 -8.126 9.45e-16 ***
## KitchenAbvGr -4.633e+04 5.175e+03 -8.952 < 2e-16 ***
## BsmtFullBath 1.119e+04 2.190e+03 5.110 3.65e-07 ***
## TotRmsAbvGrd 7.512e+03 1.366e+03 5.499 4.52e-08 ***
## Fireplaces 7.685e+03 1.913e+03 4.018 6.17e-05 ***
## GarageArea 8.158e+00 1.077e+01 0.757 0.44904
## GarageCars 1.584e+04 3.159e+03 5.014 5.99e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39360 on 1447 degrees of freedom
## Multiple R-squared: 0.7565, Adjusted R-squared: 0.7545
## F-statistic: 374.7 on 12 and 1447 DF, p-value: < 2.2e-16
GarageArea: The p-value is 0.4862, indicating it is not statistically significant, so it will be dropped
model_refined3 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea +
FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
summary(model_refined3)
##
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF +
## GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath +
## TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -544995 -18339 -2089 15905 294855
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.433e+05 9.346e+04 -10.093 < 2e-16 ***
## LotArea 2.895e-01 1.113e-01 2.602 0.00937 **
## YearBuilt 4.997e+02 4.789e+01 10.435 < 2e-16 ***
## TotalBsmtSF 2.993e+01 3.030e+00 9.876 < 2e-16 ***
## GrLivArea 6.168e+01 4.306e+00 14.323 < 2e-16 ***
## FullBath 7.971e+03 2.857e+03 2.790 0.00533 **
## BedroomAbvGr -1.476e+04 1.807e+03 -8.172 6.55e-16 ***
## KitchenAbvGr -4.645e+04 5.172e+03 -8.981 < 2e-16 ***
## BsmtFullBath 1.130e+04 2.186e+03 5.168 2.70e-07 ***
## TotRmsAbvGrd 7.480e+03 1.365e+03 5.478 5.06e-08 ***
## Fireplaces 7.521e+03 1.900e+03 3.958 7.91e-05 ***
## GarageCars 1.777e+04 1.860e+03 9.557 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 39350 on 1448 degrees of freedom
## Multiple R-squared: 0.7564, Adjusted R-squared: 0.7546
## F-statistic: 408.8 on 11 and 1448 DF, p-value: < 2.2e-16
We will run diagnostics for the refined model
par(mfrow=c(2,2))
plot(model_refined3)
Preparing the data for submission, we will load the test data and
prepare it. We will ensure the test data consists of the same variables,
while handling missing values. We will then make predictions based on
the refined model, and prepare the dataframe for submission.
df_test <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/test.csv")
df_test <- df_test %>%
mutate(CentralAir = as.numeric(CentralAir == "Y")) %>% # Encode CentralAir
replace(is.na(.), 0) # Handle missing values
df_test_mm <- df_test %>%
dplyr::select(LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea,
FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath,
TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)
predictions <- predict(model_refined3, newdata = df_test_mm)
submission <- data.frame(Id = df_test$Id, SalePrice = predictions)
write.csv(submission, "kaggle_submission.csv", row.names = FALSE)
print("Submission file created successfully.")
## [1] "Submission file created successfully."
Kaggle username: Noori Selina Public score: 0.19777
Thank you professor for a great semester :)