Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

library("tidyverse")
library("dplyr")
df = read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/main/train.csv")

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!

# TotalBsmtSF is the Independent variable with right-skewness 

ggplot(df, aes(x = TotalBsmtSF)) +
      geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

X = df$TotalBsmtSF

Pick the dependent variable and define it as Y.

# SalePrice is the dependent variable with right-skewness 

ggplot(df, aes(x = SalePrice)) +
      geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Y = df$SalePrice

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

  1. P(X>x | Y>y) b. P(X>x, Y>y) c. P(X<x | Y>y)
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
#3rd quartile of X:
x_3q = summary(X)["3rd Qu."]
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
#2nd quartile of Y is equal to Median which is 163000
y_2q = summary(Y)["Median"]
  1. P(X>x | Y>y)
X_1 = X > x_3q
Y_1 = Y > y_2q

n = sum(X_1 & Y_1) 
d = sum(Y_1)

n/d
## [1] 0.4519231
  1. P(X>x, Y>y)
n = sum(X_1 & Y_1) 
d = nrow(df)

n/d
## [1] 0.2253425
  1. P(X<x | Y>y)
X_2 = X < x_3q
Y_1 = Y > y_2q

n = sum(X_2 & Y_1) 
d = sum(Y_1)

n/d
## [1] 0.5480769
table_counts <- matrix(0, nrow = 3, ncol = 3)
colnames(table_counts) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(table_counts) <- c("<=3d quartile", ">3d quartile", "Total")

table_counts[1, 1] <- sum(X < x_3q & Y < y_2q)  
table_counts[1, 2] <- sum(X < x_3q & Y > y_2q)  
table_counts[1, 3] <- table_counts[1, 1] + table_counts[1, 2]        
table_counts[2, 1] <- sum(X > x_3q & Y < y_2q)   
table_counts[2, 2] <- sum(X > x_3q & Y > y_2q)   
table_counts[2, 3] <- table_counts[2, 1] + table_counts[2, 2]               
table_counts[3, 1] <- table_counts[1, 1] + table_counts[2, 1]
table_counts[3, 2] <- table_counts[1, 2] + table_counts[2, 2]
table_counts[3, 3] <- table_counts[1, 3] + table_counts[2, 3]

print(table_counts)
##               <=2d quartile >2d quartile Total
## <=3d quartile           693          399  1092
## >3d quartile             35          329   364
## Total                   728          728  1456

Does splitting the training data in this fashion make them independent?

No, Splitting the training data does not make them independent.

Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

Below result shows that P(AB)≠P(A)P(B) and thus A and B are not independent

A = X > x_3q
B = Y > y_2q

# Find P(A|B)
P_A_given_B <- sum(A * B) / sum(B)
P_A_given_B
## [1] 0.4519231
# Find P(A) * P(B)
P_A <- sum(A) / length(A)
P_B <- sum(B) / length(B)
P_A * P_B
## [1] 0.1246575

Evaluate by running a Chi Square test for association.

c_test <- chisq.test(table(A, B))
print(c_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(A, B)
## X-squared = 313.61, df = 1, p-value < 2.2e-16

Use the chi-square test to test the null hypothesis

H0: there is no relationship between two categorical variables

P <= 0.05 (Hypothesis interpretations are rejected)

Since P-value is very small, we can say the two variables A, B are statistically dependent.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Provide univariate descriptive statistics and appropriate plots for the training data set.

summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Provide a scatterplot of X and Y.

# X represent TotalBsmtSF 
# Y represent SalePrice

ggplot(df, aes(x = X, y = Y)) +
  geom_point() +  
  labs(x = "TotalBsmtSF", y = "SalePrice", title = "Scatter Plot of TotalBsmtSF  and SalePrice")

Provide a 95% CI for the difference in the mean of the variables.

95 percent confidence interval: -183942.2 -175785.3

t.test(X, Y, conf.level = 0.95)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y
## t = -86.509, df = 1459.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -183942.2 -175785.3
## sample estimates:
##  mean of x  mean of y 
##   1057.429 180921.196

Derive a correlation matrix for two of the quantitative variables you selected.

df2 = df |> 
        select(TotalBsmtSF, SalePrice)

co_matrix = cor(df2)

co_matrix
##             TotalBsmtSF SalePrice
## TotalBsmtSF   1.0000000 0.6135806
## SalePrice     0.6135806 1.0000000

Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Since the p-value (2.2e-16) is less than the chosen significance level (e.g., 0.01), we reject the null hypothesis that the correlation is 0. This indicates that there is sufficient evidence to suggest a significant correlation between X and Y.

co_test <- cor.test(df2$TotalBsmtSF, df2$SalePrice, method = "pearson", conf.level = 0.99)
print(co_test)
## 
##  Pearson's product-moment correlation
## 
## data:  df2$TotalBsmtSF and df2$SalePrice
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5697562 0.6539251
## sample estimates:
##       cor 
## 0.6135806

Linear Algebra and Correlation.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

precision_matrix = solve(co_matrix)
precision_matrix
##             TotalBsmtSF  SalePrice
## TotalBsmtSF   1.6038006 -0.9840609
## SalePrice    -0.9840609  1.6038006

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

# Multiply the correlation matrix by the precision matrix

round(co_matrix %*% precision_matrix)
##             TotalBsmtSF SalePrice
## TotalBsmtSF           1         0
## SalePrice             0         1
# Multiply the precision matrix by the correlation matrix

round(precision_matrix %*% co_matrix)
##             TotalBsmtSF SalePrice
## TotalBsmtSF           1         0
## SalePrice             0         1

Conduct principle components analysis (research this!) and interpret. Discuss.

# The mean correlation is high and thus it is eligible for conducting PCA
mean(cor(df2))
## [1] 0.8067903

There are two ways to evaluate PCA:

  1. To check if PCs capture the essence of the original variable

The component loadings tell us how much of the variation in a variable is explained by the component. Here Component 2 explained all the variation of TotalBsmtSF while Component 1 explained all the variation of Sale Price and we can conclude the PCs does capture the essence of the original variable

  1. To check if the PCs are independent

The correlation among the PC is almost zero and we can conclude these PCs are independent

pca_result <- princomp(df2)
pca_result$loadings
## 
## Loadings:
##             Comp.1 Comp.2
## TotalBsmtSF         1.000
## SalePrice    1.000       
## 
##                Comp.1 Comp.2
## SS loadings       1.0    1.0
## Proportion Var    0.5    0.5
## Cumulative Var    0.5    1.0
PC = pca_result$scores
cor(PC)
##              Comp.1       Comp.2
## Comp.1 1.000000e+00 3.352131e-15
## Comp.2 3.352131e-15 1.000000e+00
summary(pca_result)
## Importance of components:
##                              Comp.1       Comp.2
## Standard deviation     79415.747781 3.462952e+02
## Proportion of Variance     0.999981 1.901391e-05
## Cumulative Proportion      0.999981 1.000000e+00

Calculus-Based Probability & Statistics.

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

For your variable that is skewed to the right, shift it so that the minimum value is above zero.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
X = df$TotalBsmtSF
summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
# Adding a constant value to all observations so that the minimum value is above zero.
X2 = X + 1
summary(X2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   796.8   992.5  1058.4  1299.2  6111.0

Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).

# Fit exponential distribution
exp_pdf = fitdistr(X, "exponential")
exp_pdf
##        rate    
##   9.456896e-04 
##  (2.474983e-05)

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

# Optimal lambda value
lambda = exp_pdf$estimate
lambda
##         rate 
## 0.0009456896
# Samples from exponential distribution
samples <- rexp(1000, rate = lambda)

Plot a histogram and compare it with a histogram of your original variable.

Sample histogram has smoother skewed to the right distribution

# Plot histograms
par(mfrow = c(1, 2))
hist(X, main = "Original Variable", xlab = "Value")
hist(samples, main = "Exponential Distribution", xlab = "Value")

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality.

quantile(samples, probs=c(0.05, 0.95))
##         5%        95% 
##   58.95148 3203.88076
# Generate a 95% confidence interval from the empirical data
mean_x <- mean(X)
std_x <- sd(X)
n <- length(X)
z <- qnorm(1 - 0.05/2)

lower_ci <- mean_x - z * (std_x / sqrt(n))
upper_ci <- mean_x + z * (std_x / sqrt(n))

lower_ci
## [1] 1034.926
upper_ci
## [1] 1079.933

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# Generate a 95% confidence interval from the empirical data
quantile(X, probs=c(0.05, 0.95))
##     5%    95% 
##  519.3 1753.0

The difference in the 95% CI suggests the exponential function is not describing the variability and shape of the original data.This imply the exponential function is not a good fitting model for the data.

The significant skewness of the selected variable would cause the percentiles calculated from the exponential probability density function (PDF) to inaccurately represent the tails of the distribution.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

df_mm <- dplyr::select(df, SalePrice, LotArea, YearBuilt, TotalBsmtSF, CentralAir, GrLivArea, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, BsmtFullBath, TotRmsAbvGrd, Fireplaces, GarageArea, GarageCars)

head(df_mm)
##   SalePrice LotArea YearBuilt TotalBsmtSF CentralAir GrLivArea FullBath
## 1    208500    8450      2003         856          Y      1710        2
## 2    181500    9600      1976        1262          Y      1262        2
## 3    223500   11250      2001         920          Y      1786        2
## 4    140000    9550      1915         756          Y      1717        1
## 5    250000   14260      2000        1145          Y      2198        2
## 6    143000   14115      1993         796          Y      1362        1
##   HalfBath BedroomAbvGr KitchenAbvGr BsmtFullBath TotRmsAbvGrd Fireplaces
## 1        1            3            1            1            8          0
## 2        0            3            1            0            6          1
## 3        1            3            1            1            6          1
## 4        0            3            1            1            7          1
## 5        1            4            1            1            9          1
## 6        1            1            1            1            5          0
##   GarageArea GarageCars
## 1        548          2
## 2        460          2
## 3        608          2
## 4        642          3
## 5        836          3
## 6        480          2
# 

m_full <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + CentralAir + GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     CentralAir + GrLivArea + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + 
##     GarageArea + GarageCars, data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548095  -18314   -2128   16073  294903 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.570e+05  1.041e+05  -9.194  < 2e-16 ***
## LotArea       2.841e-01  1.115e-01   2.548   0.0109 *  
## YearBuilt     5.061e+02  5.405e+01   9.363  < 2e-16 ***
## TotalBsmtSF   2.887e+01  3.346e+00   8.628  < 2e-16 ***
## CentralAirY   1.355e+03  4.696e+03   0.289   0.7730    
## GrLivArea     6.224e+01  4.812e+00  12.935  < 2e-16 ***
## FullBath      7.667e+03  3.087e+03   2.483   0.0131 *  
## HalfBath     -1.302e+03  2.701e+03  -0.482   0.6298    
## BedroomAbvGr -1.472e+04  1.818e+03  -8.097 1.18e-15 ***
## KitchenAbvGr -4.622e+04  5.257e+03  -8.792  < 2e-16 ***
## BsmtFullBath  1.115e+04  2.193e+03   5.083 4.21e-07 ***
## TotRmsAbvGrd  7.518e+03  1.367e+03   5.499 4.51e-08 ***
## Fireplaces    7.649e+03  1.927e+03   3.969 7.58e-05 ***
## GarageArea    7.548e+00  1.084e+01   0.696   0.4862    
## GarageCars    1.596e+04  3.170e+03   5.037 5.34e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39380 on 1445 degrees of freedom
## Multiple R-squared:  0.7566, Adjusted R-squared:  0.7542 
## F-statistic: 320.8 on 14 and 1445 DF,  p-value: < 2.2e-16

Drop the variable with the highest p-value and re-fit the model.

# drop CentralAirY since it has p-value of 0.7730

m_full_drop1 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full_drop1)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, 
##     data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548428  -18336   -2177   15914  294879 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.633e+05  1.017e+05  -9.470  < 2e-16 ***
## LotArea       2.842e-01  1.115e-01   2.549   0.0109 *  
## YearBuilt     5.100e+02  5.229e+01   9.753  < 2e-16 ***
## TotalBsmtSF   2.891e+01  3.343e+00   8.646  < 2e-16 ***
## GrLivArea     6.220e+01  4.808e+00  12.936  < 2e-16 ***
## FullBath      7.608e+03  3.080e+03   2.470   0.0136 *  
## HalfBath     -1.285e+03  2.700e+03  -0.476   0.6341    
## BedroomAbvGr -1.467e+04  1.810e+03  -8.107 1.09e-15 ***
## KitchenAbvGr -4.646e+04  5.184e+03  -8.962  < 2e-16 ***
## BsmtFullBath  1.116e+04  2.192e+03   5.089 4.08e-07 ***
## TotRmsAbvGrd  7.518e+03  1.367e+03   5.501 4.47e-08 ***
## Fireplaces    7.713e+03  1.914e+03   4.030 5.87e-05 ***
## GarageArea    7.681e+00  1.082e+01   0.710   0.4780    
## GarageCars    1.595e+04  3.168e+03   5.034 5.41e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39370 on 1446 degrees of freedom
## Multiple R-squared:  0.7566, Adjusted R-squared:  0.7544 
## F-statistic: 345.7 on 13 and 1446 DF,  p-value: < 2.2e-16
# drop HalfBath since it has p-value of 0.6341        

m_full_drop2 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
summary(m_full_drop2)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
##     TotRmsAbvGrd + Fireplaces + GarageArea + GarageCars, data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -548702  -18199   -2289   15874  295458 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.442e+05  9.348e+04 -10.101  < 2e-16 ***
## LotArea       2.864e-01  1.114e-01   2.572  0.01021 *  
## YearBuilt     5.001e+02  4.790e+01  10.441  < 2e-16 ***
## TotalBsmtSF   2.953e+01  3.076e+00   9.599  < 2e-16 ***
## GrLivArea     6.122e+01  4.348e+00  14.080  < 2e-16 ***
## FullBath      8.144e+03  2.866e+03   2.841  0.00455 ** 
## BedroomAbvGr -1.470e+04  1.809e+03  -8.126 9.45e-16 ***
## KitchenAbvGr -4.633e+04  5.175e+03  -8.952  < 2e-16 ***
## BsmtFullBath  1.119e+04  2.190e+03   5.110 3.65e-07 ***
## TotRmsAbvGrd  7.512e+03  1.366e+03   5.499 4.52e-08 ***
## Fireplaces    7.685e+03  1.913e+03   4.018 6.17e-05 ***
## GarageArea    8.158e+00  1.077e+01   0.757  0.44904    
## GarageCars    1.584e+04  3.159e+03   5.014 5.99e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39360 on 1447 degrees of freedom
## Multiple R-squared:  0.7565, Adjusted R-squared:  0.7545 
## F-statistic: 374.7 on 12 and 1447 DF,  p-value: < 2.2e-16
# drop GarageArea since it has p-value of 0.44904            

m_full_drop3 <- lm(SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
summary(m_full_drop3)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + YearBuilt + TotalBsmtSF + 
##     GrLivArea + FullBath + BedroomAbvGr + KitchenAbvGr + BsmtFullBath + 
##     TotRmsAbvGrd + Fireplaces + GarageCars, data = df_mm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -544995  -18339   -2089   15905  294855 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.433e+05  9.346e+04 -10.093  < 2e-16 ***
## LotArea       2.895e-01  1.113e-01   2.602  0.00937 ** 
## YearBuilt     4.997e+02  4.789e+01  10.435  < 2e-16 ***
## TotalBsmtSF   2.993e+01  3.030e+00   9.876  < 2e-16 ***
## GrLivArea     6.168e+01  4.306e+00  14.323  < 2e-16 ***
## FullBath      7.971e+03  2.857e+03   2.790  0.00533 ** 
## BedroomAbvGr -1.476e+04  1.807e+03  -8.172 6.55e-16 ***
## KitchenAbvGr -4.645e+04  5.172e+03  -8.981  < 2e-16 ***
## BsmtFullBath  1.130e+04  2.186e+03   5.168 2.70e-07 ***
## TotRmsAbvGrd  7.480e+03  1.365e+03   5.478 5.06e-08 ***
## Fireplaces    7.521e+03  1.900e+03   3.958 7.91e-05 ***
## GarageCars    1.777e+04  1.860e+03   9.557  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 39350 on 1448 degrees of freedom
## Multiple R-squared:  0.7564, Adjusted R-squared:  0.7546 
## F-statistic: 408.8 on 11 and 1448 DF,  p-value: < 2.2e-16

Verify that the conditions for this model are reasonable using diagnostic plots.

  1. Check for outliers. No extreme observations that might cause a concern. the distribution of the residuals are nearly normal.(From both histogram and Normal Q-Q plot)

  2. The variability of the residuals is nearly constant The Residuals vs Fitted plot shows a fairly uniform spread of residual across fitted values.

plot(m_full_drop3)

hist(m_full_drop3$residuals)

Perform predictions using the test.csv.

My Kaggle user name is TLeung1234 My Public Score is 0.1977

test = read.csv("https://raw.githubusercontent.com/tonyCUNY/tonyCUNY/main/test.csv")


test <- select_if(test, is.numeric)

test[is.na(test)] <- 0

m_predict = predict(m_full_drop3, test)

kaggle <- as.data.frame(cbind(test$Id, m_predict ))
colnames(kaggle) <- c("Id", "SalePrice")

write.csv(kaggle, file = "data605_CL_submission.csv", quote=FALSE, row.names=FALSE)
knitr::include_graphics('score.png')