NSelina_Final.knit

Noori Selina

Data 605 Final

05-17-2024

Computational Mathematics

Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Loading the dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)

df = read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/train.csv")
head(df)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

We have selected ‘LotArea’ as the independent variable for our analysis, due to its typical right skewness in real estate data. This characteristic makes LotArea an appropriate choice for examining skewed distributions. To visually confirm the skewness of ‘LotArea’, we’ll plot a histogram

ggplot(df, aes(x = LotArea)) +
  geom_histogram(bins = 50, fill = "blue", color = "black") +
  ggtitle("Histogram of LotArea") +
  xlab("LotArea") +
  ylab("Frequency")

We’ve chosen SalePrice as our dependent variable 𝑌 because it directly represents the house prices we’re interested in analyzing. To get a quick overview of SalePrice we have plotted a histogram once again.

summary(df$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

ggplot(df, aes(x = SalePrice)) +
  geom_histogram(bins = 30, fill = "red", color = "black") +
  ggtitle("Histogram of SalePrice") +
  xlab("SalePrice") +
  ylab("Frequency")

Defining the variables:

X = df$LotArea
Y = df$SalePrice

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

# Summary of X (LotArea)
summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

Determine the 3rd quartile of X (LotArea)

x_3q = quantile(X, 0.75)

Summary of Y (SalePrice)

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Determine the 2nd quartile (Median) of Y (SalePrice)

y_2q = median(Y)

Calculating P(X > x | Y > y)

X_1 = X > x_3q
Y_1 = Y > y_2q

n = sum(X_1 & Y_1)
d = sum(Y_1)
prob1 = n / d

Calculating P(X > x, Y > y)

n_joint = sum(X_1 & Y_1)
d_total = nrow(df)
prob2 = n_joint / d_total

Calculating P(X < x | Y > y)

X_2 = X < x_3q

n_cond = sum(X_2 & Y_1)
prob3 = n_cond / d

Creating a table of counts

table_counts <- matrix(0, nrow = 3, ncol = 3)
colnames(table_counts) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(table_counts) <- c("<=3d quartile", ">3d quartile", "Total")

table_counts[1, 1] <- sum(X < x_3q & Y < y_2q)  
table_counts[1, 2] <- sum(X < x_3q & Y > y_2q)  
table_counts[1, 3] <- table_counts[1, 1] + table_counts[1, 2]        
table_counts[2, 1] <- sum(X > x_3q & Y < y_2q)   
table_counts[2, 2] <- sum(X > x_3q & Y > y_2q)   
table_counts[2, 3] <- table_counts[2, 1] + table_counts[2, 2]               
table_counts[3, 1] <- table_counts[1, 1] + table_counts[2, 1]
table_counts[3, 2] <- table_counts[1, 2] + table_counts[2, 2]
table_counts[3, 3] <- table_counts[1, 3] + table_counts[2, 3]

# Print the probabilities and the table
print(c(prob1, prob2, prob3))

## [1] 0.3791209 0.1890411 0.6208791

print(table_counts)

##               <=2d quartile >2d quartile Total
## <=3d quartile           639          452  1091
## >3d quartile             89          276   365
## Total                   728          728  1456

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

Defining the variables

A <- X > x_3q
B <- Y > y_2q

Checking conditional joint probabilities

# Calculate P(A|B)
P_A_given_B <- sum(A & B) / sum(B)

# Calculate P(A) and P(B)
P_A <- sum(A) / length(A)
P_B <- sum(B) / length(B)

# Calculate P(A) * P(B)
joint_probability <- P_A * P_B

# Print the probabilities
print(paste("P(A|B):", P_A_given_B))

## [1] "P(A|B): 0.379120879120879"

print(paste("P(A) * P(B):", joint_probability))

## [1] "P(A) * P(B): 0.124657534246575"

Chi square test

# Conduct a Chi-square test
c_test <- chisq.test(table(A, B))

# Print the Chi-square test result
print(c_test)

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(A, B)
## X-squared = 127.74, df = 1, p-value < 2.2e-16

Discussion Splitting the training data into observations above the 3rd quartile for LotArea (A) and the upper half for SalePrice (B) does not make them independent. The Chi-square test results show a very low p-value (< 0.0001) and a high statistic (127.74), which strongly indicates a significant relationship between LotArea and SalePrice. Thus, the variables are statistically dependent.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Provide univariate descriptive statistics and appropriate plots for the training data set.

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

summary(Y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Creating a scatter plot

library(ggplot2)

# Scatterplot of LotArea and SalePrice
ggplot(df, aes(x = X, y = Y)) +
  geom_point() +
  labs(x = "LotArea", y = "SalePrice", title = "Scatter Plot of LotArea and SalePrice")

Calculating the confiedence interval

t_test <- t.test(X, Y, conf.level = 0.95)
print(t_test)

## 
##  Welch Two Sample t-test
## 
## data:  X and Y
## t = -81.321, df = 1505.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -174514.7 -166294.1
## sample estimates:
## mean of x mean of y 
##  10516.83 180921.20

Selecting the variables for correlation matrix, and conducting the correlation matrix

df2 <- df |> 
  select(LotArea, SalePrice)

co_matrix <- cor(df2)
print(co_matrix)

##             LotArea SalePrice
## LotArea   1.0000000 0.2638434
## SalePrice 0.2638434 1.0000000

Our hypothesis testing:

# Hypothesis test for correlation
co_test <- cor.test(df2$LotArea, df2$SalePrice, method = "pearson", conf.level = 0.99)
print(co_test)

## 
##  Pearson's product-moment correlation
## 
## data:  df2$LotArea and df2$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.2000196 0.3254375
## sample estimates:
##       cor 
## 0.2638434

Discussion The Pearson’s correlation test results indicate a moderate positive correlation between LotArea and SalePrice, with a correlation coefficient of 0.2638434. The test is highly significant, with a t-value of 10.445 and a p-value of less than 2.2e-16, rejecting the null hypothesis that the true correlation is zero. The 99% confidence interval for the correlation ranges from 0.2000196 to 0.3254375. This suggests that, generally, larger lots tend to have higher prices, but the relationship isn’t very strong. Other factors besides lot size are also important in determining house prices.

Linear Algebra and Correlation.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

df2 <- df |> select(LotArea, SalePrice)
co_matrix <- cor(df2)

precision_matrix <- solve(co_matrix)
print("Precision Matrix:")

## [1] "Precision Matrix:"

print(precision_matrix)

##              LotArea  SalePrice
## LotArea    1.0748219 -0.2835846
## SalePrice -0.2835846  1.0748219

mult_cor_precision <- round(co_matrix %*% precision_matrix)
print("Correlation Matrix * Precision Matrix (Identity Matrix):")

## [1] "Correlation Matrix * Precision Matrix (Identity Matrix):"

print(mult_cor_precision)

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

mult_precision_cor <- round(precision_matrix %*% co_matrix)
print("Precision Matrix * Correlation Matrix (Identity Matrix):")

## [1] "Precision Matrix * Correlation Matrix (Identity Matrix):"

print(mult_precision_cor)

##           LotArea SalePrice
## LotArea         1         0
## SalePrice       0         1

pca_result <- princomp(df2, cor = TRUE)
print("Loadings from PCA:")

## [1] "Loadings from PCA:"

print(pca_result$loadings)

## 
## Loadings:
##           Comp.1 Comp.2
## LotArea    0.707  0.707
## SalePrice  0.707 -0.707
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0

PC <- pca_result$scores
cor_PC <- cor(PC)
print("Correlation among Principal Components:")

## [1] "Correlation among Principal Components:"

print(cor_PC)

##              Comp.1       Comp.2
## Comp.1 1.000000e+00 1.547207e-15
## Comp.2 1.547207e-15 1.000000e+00

# Summary of the PCA results to show importance of components
summary_pca <- summary(pca_result)
print("Summary of PCA:")

## [1] "Summary of PCA:"

print(summary_pca)

## Importance of components:
##                           Comp.1    Comp.2
## Standard deviation     1.1242079 0.8579957
## Proportion of Variance 0.6319217 0.3680783
## Cumulative Proportion  0.6319217 1.0000000

Discussion The precision matrix and the resulting identity matrices confirm that the relationships between LotArea and SalePrice have been accurately isolated and normalized. The PCA revealed that the first component explains about 63% of the variance, mainly capturing the combined effects of lot size and sale price. The second component, explaining the remaining 37%, offers a contrasting dimension. This separation into independent components demonstrates that while LotArea and SalePrice are related, they influence property values in distinct ways.

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Certainly! Let’s follow your detailed plan and execute each step using R code. We’ll apply the calculations to LotArea, which is the right-skewed variable from your dataset.

Adjusting Data and Fitting an Exponential Distribution

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

# Adjust LotArea so the minimum value is above zero
X <- df$LotArea
X_shifted <- X + 1  

# Fit an exponential distribution to the shifted data
exp_fit <- fitdistr(X_shifted, "exponential")
lambda <- exp_fit$estimate  # Extract rate parameter λ

samples <- rexp(1000, rate = lambda)

Plotting Histograms for Comparison

par(mfrow = c(1, 2))
hist(X, main = "Original LotArea", xlab = "LotArea", col = "blue")
hist(samples, main = "Fitted Exponential Distribution", xlab = "Sample Values", col = "red")

Calculating Percentiles and Confidence Intervals

# Calculate empirical 5th and 95th percentiles for original data
data_percentiles <- quantile(X, probs = c(0.05, 0.95))
print("Original Data Percentiles:")

## [1] "Original Data Percentiles:"

print(data_percentiles)

##       5%      95% 
##  3311.70 17401.15

# Calculate 95% confidence interval for the mean of the original data
mean_X <- mean(X)
std_X <- sd(X)
n <- length(X)
z <- qnorm(1 - 0.05/2)
lower_ci <- mean_X - z * (std_X / sqrt(n))
upper_ci <- mean_X + z * (std_X / sqrt(n))
print(paste("95% Confidence Interval for the Mean of Original Data: Lower =", lower_ci, ", Upper =", upper_ci))

## [1] "95% Confidence Interval for the Mean of Original Data: Lower = 10004.8430715642 , Upper = 11028.8130928194"

The percentiles for LotArea show that 5% of the lots are smaller than 3,311.7 square feet, and 95% are smaller than 17,401.15 square feet. This indicates that most of the lots are within this range, but there are some significantly larger properties.

The 95% confidence interval for the mean of LotArea ranges from 10,004.84 to 11,028.81 square feet. This means we are 95% confident that the average lot size of the properties falls within this range. This interval helps us understand the typical size of lots being considered, emphasizing that while there are outliers, the majority of lots have a fairly consistent size.

Modeling

library(tidyverse)
library(MASS)

# Load training data
df_train <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/train.csv")

# Load test data
df_test <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/test.csv")

head(df_train)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

First we will check for missing values and remove them if necessary

df[is.na(df)] <- 0

# If CentralAir is a factor or character representing yes/no, convert it to numeric (0 and 1)
if (is.character(df$CentralAir) || is.factor(df$CentralAir)) {
    df$CentralAir <- as.numeric(df$CentralAir == "Y")
}

Select features for modeling

df_mm <- dplyr::select(df, SalePrice, LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath, TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)

head(df_mm)

##   SalePrice LotArea YearBuilt TotalBsmtSF CentralAir GrLivArea FullBath
## 1    208500    8450      2003         856          1      1710        2
## 2    181500    9600      1976        1262          1      1262        2
## 3    223500   11250      2001         920          1      1786        2
## 4    140000    9550      1915         756          1      1717        1
## 5    250000   14260      2000        1145          1      2198        2
## 6    143000   14115      1993         796          1      1362        1
##   HalfBath BedroomAbvGr KitchenAbvGr BsmtFullBath TotRmsAbvGrd Fireplaces
## 1        1            3            1            1            8          0
## 2        0            3            1            0            6          1
## 3        1            3            1            1            6          1
## 4        0            3            1            1            7          1
## 5        1            4            1            1            9          1
## 6        1            1            1            1            5          0
##   GarageArea GarageCars
## 1        548          2
## 2        460          2
## 3        608          2
## 4        642          3
## 5        836          3
## 6        480          2

Building a multiple linear regression model

model <- lm(SalePrice ~ ., data = df_mm)
summary(model)

## 
## Call:
## lm(formula = SalePrice ~ ., data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548095  -18314   -2128   16073  294903 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.570e+05  1.041e+05  -9.194  < 2e-16 ***
## LotArea       2.841e-01  1.115e-01   2.548   0.0109 *  
## YearBuilt     5.061e+02  5.405e+01   9.363  < 2e-16 ***
## TotalBsmtSF   2.887e+01  3.346e+00   8.628  < 2e-16 ***
## CentralAir    1.355e+03  4.696e+03   0.289   0.7730    
## GrLivArea     6.224e+01  4.812e+00  12.935  < 2e-16 ***
## FullBath      7.667e+03  3.087e+03   2.483   0.0131 *  
## HalfBath     -1.302e+03  2.701e+03  -0.482   0.6298    
## BedroomAbvGr -1.472e+04  1.818e+03  -8.097 1.18e-15 ***
## KitchenAbvGr -4.622e+04  5.257e+03  -8.792  < 2e-16 ***
## BsmtFullBath  1.115e+04  2.193e+03   5.083 4.21e-07 ***
## TotRmsAbvGrd  7.518e+03  1.367e+03   5.499 4.51e-08 ***
## Fireplaces    7.649e+03  1.927e+03   3.969 7.58e-05 ***
## GarageArea    7.548e+00  1.084e+01   0.696   0.4862    
## GarageCars    1.596e+04  3.170e+03   5.037 5.34e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39380 on 1445 degrees of freedom
## Multiple R-squared:  0.7566, Adjusted R-squared:  0.7542 
## F-statistic: 320.8 on 14 and 1445 DF,  p-value: < 2.2e-16

Checking model assumptions by plotting diagnostic plots

par(mfrow=c(2,2))
plot(model)

CentralAir: The p-value is 0.7730, indicating it is not statistically significant at typical significance levels (e.g., 0.05), so it will be dropped.

model_refined1 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + 
                     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
                     TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(model_refined1)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, 
##     data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548428  -18336   -2177   15914  294879 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.633e+05  1.017e+05  -9.470  < 2e-16 ***
## LotArea       2.842e-01  1.115e-01   2.549   0.0109 *  
## YearBuilt     5.100e+02  5.229e+01   9.753  < 2e-16 ***
## TotalBsmtSF   2.891e+01  3.343e+00   8.646  < 2e-16 ***
## GrLivArea     6.220e+01  4.808e+00  12.936  < 2e-16 ***
## FullBath      7.608e+03  3.080e+03   2.470   0.0136 *  
## HalfBath     -1.285e+03  2.700e+03  -0.476   0.6341    
## BedroomAbvGr -1.467e+04  1.810e+03  -8.107 1.09e-15 ***
## KitchenAbvGr -4.646e+04  5.184e+03  -8.962  < 2e-16 ***
## BsmtFullBath  1.116e+04  2.192e+03   5.089 4.08e-07 ***
## TotRmsAbvGrd  7.518e+03  1.367e+03   5.501 4.47e-08 ***
## Fireplaces    7.713e+03  1.914e+03   4.030 5.87e-05 ***
## GarageArea    7.681e+00  1.082e+01   0.710   0.4780    
## GarageCars    1.595e+04  3.168e+03   5.034 5.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39370 on 1446 degrees of freedom
## Multiple R-squared:  0.7566, Adjusted R-squared:  0.7544 
## F-statistic: 345.7 on 13 and 1446 DF,  p-value: < 2.2e-16

HalfBath: The p-value is 0.6298, indicating it is not statistically significant, so it will be dropped.

model_refined2 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + 
                     FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
                     TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(model_refined2)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
##     TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548702  -18199   -2289   15874  295458 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.442e+05  9.348e+04 -10.101  < 2e-16 ***
## LotArea       2.864e-01  1.114e-01   2.572  0.01021 *  
## YearBuilt     5.001e+02  4.790e+01  10.441  < 2e-16 ***
## TotalBsmtSF   2.953e+01  3.076e+00   9.599  < 2e-16 ***
## GrLivArea     6.122e+01  4.348e+00  14.080  < 2e-16 ***
## FullBath      8.144e+03  2.866e+03   2.841  0.00455 ** 
## BedroomAbvGr -1.470e+04  1.809e+03  -8.126 9.45e-16 ***
## KitchenAbvGr -4.633e+04  5.175e+03  -8.952  < 2e-16 ***
## BsmtFullBath  1.119e+04  2.190e+03   5.110 3.65e-07 ***
## TotRmsAbvGrd  7.512e+03  1.366e+03   5.499 4.52e-08 ***
## Fireplaces    7.685e+03  1.913e+03   4.018 6.17e-05 ***
## GarageArea    8.158e+00  1.077e+01   0.757  0.44904    
## GarageCars    1.584e+04  3.159e+03   5.014 5.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39360 on 1447 degrees of freedom
## Multiple R-squared:  0.7565, Adjusted R-squared:  0.7545 
## F-statistic: 374.7 on 12 and 1447 DF,  p-value: < 2.2e-16

GarageArea: The p-value is 0.4862, indicating it is not statistically significant, so it will be dropped

model_refined3 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + 
                     FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
                     TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
summary(model_refined3)

## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
##     TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -544995  -18339   -2089   15905  294855 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.433e+05  9.346e+04 -10.093  < 2e-16 ***
## LotArea       2.895e-01  1.113e-01   2.602  0.00937 ** 
## YearBuilt     4.997e+02  4.789e+01  10.435  < 2e-16 ***
## TotalBsmtSF   2.993e+01  3.030e+00   9.876  < 2e-16 ***
## GrLivArea     6.168e+01  4.306e+00  14.323  < 2e-16 ***
## FullBath      7.971e+03  2.857e+03   2.790  0.00533 ** 
## BedroomAbvGr -1.476e+04  1.807e+03  -8.172 6.55e-16 ***
## KitchenAbvGr -4.645e+04  5.172e+03  -8.981  < 2e-16 ***
## BsmtFullBath  1.130e+04  2.186e+03   5.168 2.70e-07 ***
## TotRmsAbvGrd  7.480e+03  1.365e+03   5.478 5.06e-08 ***
## Fireplaces    7.521e+03  1.900e+03   3.958 7.91e-05 ***
## GarageCars    1.777e+04  1.860e+03   9.557  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39350 on 1448 degrees of freedom
## Multiple R-squared:  0.7564, Adjusted R-squared:  0.7546 
## F-statistic: 408.8 on 11 and 1448 DF,  p-value: < 2.2e-16

We will run diagnostics for the refined model

par(mfrow=c(2,2))
plot(model_refined3)

Preparing the data for submission, we will load the test data and prepare it. We will ensure the test data consists of the same variables, while handling missing values. We will then make predictions based on the refined model, and prepare the dataframe for submission.

df_test <- read.csv("https://raw.githubusercontent.com/NooriSelina/Data605/main/test.csv")

df_test <- df_test %>%
  mutate(CentralAir = as.numeric(CentralAir == "Y")) %>%  # Encode CentralAir
  replace(is.na(.), 0)  # Handle missing values

df_test_mm <- df_test %>%
  dplyr::select(LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea,
         FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath,
         TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)

predictions <- predict(model_refined3, newdata = df_test_mm)

submission <- data.frame(Id = df_test$Id, SalePrice = predictions)


write.csv(submission, "kaggle_submission.csv", row.names = FALSE)
print("Submission file created successfully.")

## [1] "Submission file created successfully."

Kaggle username and score

Kaggle username: Noori Selina Public score: 0.19777

Thank you professor for a great semester :)