DATA 605 - Final

House Prices: Advanced Regression Techniques

Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition, Kaggle Link. I want you to do the following.

Load Dependencies

# load libraries
library(tidyverse)
library(reactable)
library(DescTools)
library(reshape2)
library(MASS)
library(devtools)
library(ggbiplot)
library(zoo)

# load data
training_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/train.csv")
test_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/test.csv")

# view train data
reactable(training_data)

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

For the quantitative independent variable, the Lot size in square feet will be defined as the variable X and for the dependable variable, the Sale Price will be defined as the variable Y.

# histogram
variable_x_histgram <- ggplot(training_data,aes(x = LotArea)) +
  geom_histogram(colour = 4, fill = "white") +
  theme_minimal()

# set the Lot Area as x and Sale Price as y
variable_x <- training_data$LotArea
variable_y <- training_data$SalePrice

# histogram plot
variable_x_histgram

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the third quartile of the X variable, and the small letter “y” is estimated as the second quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

variable_x_summary <- summary(variable_x)
variable_y_summary <- summary(variable_y)

# 3rd quartile of x and 2nd quartile of y
x_prob_quartile <- round(variable_x_summary["3rd Qu."])
y_prob_quartile <- round(variable_y_summary["Median"])

# print
variable_x_summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

variable_y_summary

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Based on both summaries, the third quartile of the variable X is 11602 and the second quartile or median of the variable Y is 163000.

# calculation on less than 3rd quartile
both_less_qrt <- sum(variable_x < x_prob_quartile & variable_y < y_prob_quartile)
y_greater_qrt <- sum(variable_x < x_prob_quartile & variable_y > y_prob_quartile)

# calculation on more than 3rd quartile
both_great_qrt <- sum(variable_x > x_prob_quartile & variable_y > y_prob_quartile)
y_less_qrt <- sum(variable_x > x_prob_quartile & variable_y < y_prob_quartile)

# calculation on total
less_3qrt_total <- both_less_qrt + y_greater_qrt
great_3qrt_total <- both_great_qrt + y_less_qrt
less_2qrt_total <- both_less_qrt + y_less_qrt
great_2qrt_total <- both_great_qrt + y_greater_qrt
grand_total <- less_3qrt_total + great_3qrt_total

# input values into the table matrix
table_matrix <- matrix(c(both_less_qrt, y_greater_qrt, less_3qrt_total,
                         y_less_qrt, both_great_qrt, great_3qrt_total,
                         less_2qrt_total, great_2qrt_total, grand_total), 
                       nrow = 3, ncol = 3, byrow = TRUE)

colnames(table_matrix) <- c("less than 2nd quartile", "greater than 2nd quartile", "Total")
rownames(table_matrix) <- c("less than 3rd quartile", "greater than 3rd quartile", "Total")

table_matrix

##                           less than 2nd quartile greater than 2nd quartile
## less than 3rd quartile                       639                       452
## greater than 3rd quartile                     89                       276
## Total                                        728                       728
##                           Total
## less than 3rd quartile     1091
## greater than 3rd quartile   365
## Total                      1456

Based on the table, there are 1456 counts of potential outcomes. The count of less than 2nd quartile or greater than 2nd quartile is exactly the same; 728 counts. There is a higher count of less than 3rd quantile (1091) than the count of greater than 3rd quantile (365). By following the formula for conditional probability, the value can be defined for each probability.

# probability a function
prob_a_func <- function(x, y, xp, yp){
  result <- sum(x > xp & x > yp) / sum(x > yp)
  return(result)
}

# probability b function
prob_b_func <- function(x, y, xp, yp){
  result <- sum(x > xp & y > yp)/length(x)
  return(result)
}

# probability c function
prob_c_func <- function(x, y, xp, yp){
  result <- sum(x < xp & x > yp)/sum(x > yp)
  return(result)
}

P(X>x | Y>y)

prob_a_value <- prob_a_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_a_value

## [1] 1

P(X>x, Y>y)

prob_b_value <- prob_b_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_b_value

## [1] 0.1890411

P(X<x | Y>y)

prob_c_value <- prob_c_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_c_value

## [1] 0

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

\[P(A|B) = P(A)P(B) \\ P(A|B) = P(A) \\ P(A|B) = P(A)P(B) \\ \; \text{(if and only A and B independent based on the behavior of conditional probabilities)}\]

Based on the mathematics formula, the variables are independent. The Chi-Square test will be used to evaluate if that is the case. Our hypothesis is based on the following:

\[H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}\]

chi_result <- chisq.test(variable_x, variable_y)
chi_result

## 
##  Pearson's Chi-squared test
## 
## data:  variable_x and variable_y
## X-squared = 735095, df = 709664, p-value < 2.2e-16

Based on the result and the critical value for the chi-square test is typically 0.05, the p-value is significant less than the critical value. From there, the test rejected the mathematics evaluation and the null hypothesis is rejected.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

ggplot(training_data, aes(x = variable_x, y = variable_y, color = variable_y)) +
  geom_point() +
  geom_smooth(method=lm) +
  theme_minimal() +
  xlab("Lot size in square feet ") +
  ylab("Sale Price") +
  ggtitle("Scatter plot between Lot size in square feet & Sale Price")

As shown in the plot, there is a good indication that Lot size in square feet and Sale Price are independent from each other. There are some outliners that helps support that indication.

# inference on variable X based on the confidence level of 95%
MeanDiffCI(variable_x, variable_y, conf.level = 0.95)

##  meandiff    lwr.ci    upr.ci 
## -170404.4 -174514.7 -166294.1

Given the 95% confidence interval for the difference in the mean of the variables, the lower bound is -174515 and the upper bound is -166294.

# new dataset with variable x and y
corr_var <- cbind(variable_x, variable_y) %>%
  as.matrix()

# create a cormat and melted cormat
housing_cormat <- round(cor(corr_var), 2)
melted_housing_cormat <- melt(housing_cormat)

# create a heat map
ggplot(data = melted_housing_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
  theme_minimal()

Based on the plot, it shows that the two variables are more independent. So going back to the hypothesis stated before, testings need to be done to see if the null hypothesis is correct.

\[H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}\]

corr_test <- cor.test(training_data$LotArea, training_data$SalePrice, method = "pearson", conf.level = 0.99)
corr_test

## 
##  Pearson's product-moment correlation
## 
## data:  training_data$LotArea and training_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.2000196 0.3254375
## sample estimates:
##       cor 
## 0.2638434

Using the Pearson method to measure the degree of the relationship between LotArea and SalePrice, the p-value is less than 2.2e-16 which is close to 0 and less than the 99% confidence interval. This once again rejected the null hypothesis and shows that LotArea and SalePrice are significantly correlated with a correlation coefficient of 0.264.

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

# inverted correlation matrix
precision_corr <- solve(housing_cormat)
melted_precision_cormat <- melt(precision_corr)

# create a heat map
ggplot(data = melted_precision_cormat, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
  theme_minimal()

# multiplication of correlation matrix and precision matrix
corr_prec1_matrix <- housing_cormat %*% precision_corr
melted_corr_prec1 <- melt(corr_prec1_matrix)

# create a heat map
ggplot(data = melted_corr_prec1, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
  theme_minimal()

# multiplication of correlation matrix and precision matrix
corr_prec2_matrix <- precision_corr %*% housing_cormat
melted_corr_prec2 <- melt(corr_prec2_matrix)

# create a heat map
ggplot(data = melted_corr_prec2, aes(x=Var1, y=Var2, fill=value)) + 
  geom_tile(color = "white",
            lwd = 1.5,
            linetype = 1) +
  geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
  theme_minimal()

Given that the correlation between both variables are not that highly correlated and to see the variable is truly independent from each other, the principle components analysis will be used to determine that. Principle Components Analysis is used for making decisions in predictive models and dimensionality reduction by using each data point onto only the few principal components. R-blogger Link will be used as reference.

# principle components analysis
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)

summary(pca_var)

## Importance of components:
##                           PC1    PC2
## Standard deviation     1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion  0.6319 1.0000

The PCA summary shows that the PC1 captures 63% which is more than half of variability and PC2 captures 37% of the variability.

pca_plot1 <- ggscreeplot(pca_var) +
  theme_minimal()

pca_plot1

The screeplot shows that PC1 is necessary to account for most of the variance

pca_plot2 <- ggbiplot(pca_var, obs.scale = 1, var.scale = 1,
                     var.factor = 1.2,
                     ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
                     circle = TRUE, varname.size = 3, varname.color = "lightblue") + 
  theme(legend.direction = 'horizontal', legend.position = 'top') +
  theme_minimal()

pca_plot2

Based on the PCA biplot, it shows how variable x and y impact the principle components.The arrows are bit farther away from each other, indicating that they are not that correlated after all.

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See MASS). Find the optimal value for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

# run fitdistr function and extract the lambda value
exp_variable_x <- fitdistr(variable_x, "exponential")
lambda_x <- exp_variable_x$estimate

# 1000 samples from the exponential distribution
sample_size <- 1000
sample_x <- rexp(sample_size, rate = lambda_x) %>%
  as.data.frame()
colnames(sample_x) <- c("sample_value")

# exponential histogram
exp_variable_x_histogram <- ggplot(sample_x,aes(x = sample_value)) +
  geom_histogram(colour = 4, fill = "white") +
  theme_minimal() 

#plot between the original and exponential
par(mfrow=c(1,2))
variable_x_histgram + ggtitle("Original Distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

exp_variable_x_histogram + ggtitle("Exponential Distribution")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

cdf_quantile <- quantile(sample_x$sample_value, probs=c(0.05, 0.95))
cdf_quantile

##         5%        95% 
##   533.6617 29977.3458

Using the exponential PDF and the CDF, the 5th and 95th percentiles are 445.33 and 31512.02 respectively.

# 95% CI based on the empirical data
mean_var_x <- mean(variable_x)
std_var_x <- sd(variable_x)
size_var_x <- length(variable_x)
z <- qnorm(1 - 0.05/2)

# identify the upper and lower bound
lower_ci_var_x <- mean_var_x - z * (std_var_x / sqrt(size_var_x))
upper_ci_var_x <- mean_var_x + z * (std_var_x / sqrt(size_var_x))

# quantile on variable x
empirical_quantile <- quantile(variable_x, probs=c(0.05, 0.95))

round(lower_ci_var_x)

## [1] 10005

round(upper_ci_var_x)

## [1] 11029

empirical_quantile

##       5%      95% 
##  3311.70 17401.15

Based on the empirical data, the lower and upper bound CI is 10005 and 11029 respectively. For the 5th and 95th percentiles, it is 3311.7 and 17401.15 respectively. Compared to the exponential value, there is a significant movement.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

# linear regression model on training data
training_lm <- lm (variable_y ~ variable_x, data = training_data)

summary(training_lm)

## 
## Call:
## lm(formula = variable_y ~ variable_x, data = training_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -275668  -48169  -17725   31248  553356 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.588e+05  2.915e+03   54.49   <2e-16 ***
## variable_x  2.100e+00  2.011e-01   10.45   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared:  0.06961,    Adjusted R-squared:  0.06898 
## F-statistic: 109.1 on 1 and 1458 DF,  p-value: < 2.2e-16

Regression Model

# histogram on lm
lm_histogram <- ggplot(training_lm,aes(x = training_lm$residuals)) +
  geom_histogram(colour = 4, fill = "white") +
  xlab("Residual on Linear Regression") +
  theme_minimal() 

par(mfrow=c(2,2))
plot(training_lm)

lm_histogram

# pull in all columns with numeric values
training_data_num <- training_data %>%
  dplyr::select(where(is.numeric), -Id) 

# summary
summary(training_data_num)

##    MSSubClass     LotFrontage        LotArea        OverallQual    
##  Min.   : 20.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 50.0   Median : 69.00   Median :  9478   Median : 6.000  
##  Mean   : 56.9   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :190.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##                  NA's   :259                                       
##   OverallCond      YearBuilt     YearRemodAdd    MasVnrArea    
##  Min.   :1.000   Min.   :1872   Min.   :1950   Min.   :   0.0  
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0  
##  Median :5.000   Median :1973   Median :1994   Median :   0.0  
##  Mean   :5.575   Mean   :1971   Mean   :1985   Mean   : 103.7  
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Max.   :1600.0  
##                                                NA's   :8       
##    BsmtFinSF1       BsmtFinSF2        BsmtUnfSF       TotalBsmtSF    
##  Min.   :   0.0   Min.   :   0.00   Min.   :   0.0   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8  
##  Median : 383.5   Median :   0.00   Median : 477.5   Median : 991.5  
##  Mean   : 443.6   Mean   :  46.55   Mean   : 567.2   Mean   :1057.4  
##  3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2  
##  Max.   :5644.0   Max.   :1474.00   Max.   :2336.0   Max.   :6110.0  
##                                                                      
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##                                                                
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd      Fireplaces   
##  Min.   :0.000   Min.   :0.000   Min.   : 2.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.000  
##  Median :3.000   Median :1.000   Median : 6.000   Median :1.000  
##  Mean   :2.866   Mean   :1.047   Mean   : 6.518   Mean   :0.613  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.000  
##  Max.   :8.000   Max.   :3.000   Max.   :14.000   Max.   :3.000  
##                                                                  
##   GarageYrBlt     GarageCars      GarageArea       WoodDeckSF    
##  Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5   1st Qu.:  0.00  
##  Median :1980   Median :2.000   Median : 480.0   Median :  0.00  
##  Mean   :1979   Mean   :1.767   Mean   : 473.0   Mean   : 94.24  
##  3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0   3rd Qu.:168.00  
##  Max.   :2010   Max.   :4.000   Max.   :1418.0   Max.   :857.00  
##  NA's   :81                                                      
##   OpenPorchSF     EnclosedPorch      X3SsnPorch      ScreenPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median : 25.00   Median :  0.00   Median :  0.00   Median :  0.00  
##  Mean   : 46.66   Mean   : 21.95   Mean   :  3.41   Mean   : 15.06  
##  3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :547.00   Max.   :552.00   Max.   :508.00   Max.   :480.00  
##                                                                     
##     PoolArea          MiscVal             MoSold           YrSold    
##  Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
##  1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
##  Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
##  Mean   :  2.759   Mean   :   43.49   Mean   : 6.322   Mean   :2008  
##  3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
##  Max.   :738.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
##                                                                      
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

By pulling all columns with numeric values, it can provide a better perspective in the training set using appropriate plots. In this case, the box plot and histograms are best methods to compare all variables. ##### Box Plot Part 1

# boxplot
par(mfrow=c(2,3))
boxplot(training_data_num$MSSubClass, xlab = "The building class")
boxplot(training_data_num$LotFrontage, xlab = "Linear feet of street connected to property")
boxplot(training_data_num$LotArea, xlab = "Lot size in square feet")
boxplot(training_data_num$OverallQual, xlab = "Overall material and finish quality")
boxplot(training_data_num$OverallCond, xlab = "Overall condition rating")
boxplot(training_data_num$MasVnrArea, xlab = "Masonry veneer area in square feet")

Box Plot Part 2

# box plot part 2
par(mfrow=c(2,3))
boxplot(training_data_num$BsmtFinSF1, xlab = "Type 1 finished square feet")
boxplot(training_data_num$BsmtFinSF2, xlab = "Type 2 finished square feet")
boxplot(training_data_num$BsmtUnfSF, xlab = "Unfinished square feet of basement area")
boxplot(training_data_num$TotalBsmtSF, xlab = "Overall material and finish quality")
boxplot(training_data_num$X1stFlrSF, xlab = "First Floor square feet")
boxplot(training_data_num$X2ndFlrSF, xlab = "Second Floor square feet")

Box Plot Part 3

# box plot part 3
par(mfrow=c(2,3))
boxplot(training_data_num$LowQualFinSF, xlab = "Low quality finished square feet (all floors)")
boxplot(training_data_num$GrLivArea, xlab = "Above grade (ground) living area square feet")
boxplot(training_data_num$BsmtFullBath, xlab = "Basement full bathrooms")
boxplot(training_data_num$BsmtHalfBath, xlab = "Basement half bathrooms")
boxplot(training_data_num$FullBath, xlab = "Full bathrooms above grade")
boxplot(training_data_num$HalfBath, xlab = "Half baths above grade")

Box Plot Part 4

# boxplot part 4
par(mfrow=c(2,3))
boxplot(training_data_num$BedroomAbvGr, xlab = "Number of bedrooms above basement level")
boxplot(training_data_num$KitchenAbvGr, xlab = "Number of kitchens")
boxplot(training_data_num$TotRmsAbvGrd, xlab = "Total rooms above grade (does not include bathrooms)")
boxplot(training_data_num$Fireplaces, xlab = "Number of fireplaces")
boxplot(training_data_num$GarageCars, xlab = "Size of garage in car capacity")
boxplot(training_data_num$GarageArea, xlab = "Size of garage in square feet")

Box Plot Part 5

# boxplot part 5
par(mfrow=c(2,3))
boxplot(training_data_num$WoodDeckSF, xlab = "Wood deck area in square feet")
boxplot(training_data_num$OpenPorchSF, xlab = "Open porch area in square feet")
boxplot(training_data_num$EnclosedPorch, xlab = "Enclosed porch area in square feet")
boxplot(training_data_num$X3SsnPorch, xlab = "Three season porch area in square feet")
boxplot(training_data_num$ScreenPorch, xlab = "Screen porch area in square feet")
boxplot(training_data_num$PoolArea, xlab = "Pool area in square feet")

Box Plot Part 6

# boxplot part 6
par(mfrow=c(1,2))
boxplot(training_data_num$MiscVal, xlab = "$ Value of miscellaneous feature")
boxplot(training_data_num$SalePrice, xlab = "Sale Price")

Histogram

# histogram
par(mfrow=c(2,3))
hist(training_data_num$YearBuilt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Original construction date")
hist(training_data_num$YearRemodAdd, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Remodel date")
hist(training_data_num$GarageYrBlt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year garage was built")
hist(training_data_num$MoSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Month Sold")
hist(training_data_num$YrSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year Sold")

Multiple Linear Regression

# replace NA value in training data with the mean of the column
training_data_revised <- training_data_num %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))

# multiple linear regression with numeric values
training_lm2 <- lm(SalePrice ~ LotFrontage + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea , data = training_data_revised)

# summary
summary(training_lm2)

## 
## Call:
## lm(formula = SalePrice ~ LotFrontage + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + 
##     BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + 
##     WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea, 
##     data = training_data_revised)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -496229  -16828   -2074   13798  301576 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.031e+05  1.441e+06   0.210 0.833446    
## LotFrontage    9.182e+01  4.926e+01   1.864 0.062541 .  
## OverallQual    1.611e+04  1.202e+03  13.403  < 2e-16 ***
## OverallCond    4.903e+03  1.053e+03   4.658 3.48e-06 ***
## YearBuilt      2.500e+02  6.888e+01   3.630 0.000293 ***
## YearRemodAdd   1.389e+02  7.000e+01   1.985 0.047382 *  
## MasVnrArea     2.905e+01  6.065e+00   4.790 1.84e-06 ***
## BsmtFinSF1     2.152e+01  4.753e+00   4.528 6.46e-06 ***
## BsmtFinSF2     1.175e+01  7.177e+00   1.638 0.101703    
## BsmtUnfSF      1.204e+01  4.266e+00   2.823 0.004829 ** 
## X1stFlrSF      5.115e+01  5.911e+00   8.654  < 2e-16 ***
## X2ndFlrSF      4.289e+01  4.946e+00   8.670  < 2e-16 ***
## LowQualFinSF   1.355e+01  2.030e+01   0.667 0.504573    
## BsmtFullBath   8.644e+03  2.651e+03   3.261 0.001138 ** 
## BsmtHalfBath   1.473e+03  4.163e+03   0.354 0.723477    
## FullBath       3.327e+03  2.888e+03   1.152 0.249439    
## HalfBath      -1.736e+03  2.717e+03  -0.639 0.522846    
## BedroomAbvGr  -8.949e+03  1.728e+03  -5.180 2.53e-07 ***
## KitchenAbvGr  -2.542e+04  4.962e+03  -5.123 3.42e-07 ***
## TotRmsAbvGrd   5.751e+03  1.255e+03   4.582 5.01e-06 ***
## Fireplaces     4.583e+03  1.800e+03   2.546 0.010985 *  
## GarageYrBlt    9.097e+01  7.092e+01   1.283 0.199774    
## GarageCars     1.112e+04  2.934e+03   3.790 0.000157 ***
## GarageArea     4.260e-01  1.014e+01   0.042 0.966483    
## WoodDeckSF     2.726e+01  8.133e+00   3.351 0.000826 ***
## OpenPorchSF   -1.364e+00  1.549e+01  -0.088 0.929816    
## EnclosedPorch  9.234e+00  1.718e+01   0.537 0.591125    
## ScreenPorch    5.372e+01  1.753e+01   3.065 0.002220 ** 
## PoolArea      -4.127e+01  2.425e+01  -1.702 0.088986 .  
## MiscVal       -9.142e-03  1.891e+00  -0.005 0.996142    
## MoSold         2.421e+01  3.515e+02   0.069 0.945083    
## YrSold        -6.559e+02  7.163e+02  -0.916 0.359987    
## TotalBsmtSF           NA         NA      NA       NA    
## GrLivArea             NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35470 on 1428 degrees of freedom
## Multiple R-squared:  0.8049, Adjusted R-squared:  0.8006 
## F-statistic:   190 on 31 and 1428 DF,  p-value: < 2.2e-16

With the multiple regression model, it accounts for estimated 80% of the variance. Given some variables and its P-values are NULL and are higher than the critical value(0.05), those variables will be removed to retrain the model.

# multiple linear regression with revised variables
training_lm3 <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = training_data_revised)

# summary
summary(training_lm3)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond + YearBuilt + 
##     MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + 
##     X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + 
##     ScreenPorch, data = training_data_revised)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -503454  -17135   -2007   13724  283619 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.934e+05  1.074e+05  -8.322  < 2e-16 ***
## OverallQual   1.666e+04  1.174e+03  14.192  < 2e-16 ***
## OverallCond   5.670e+03  9.423e+02   6.017 2.25e-09 ***
## YearBuilt     2.736e+02  5.782e+01   4.732 2.44e-06 ***
## MasVnrArea    2.739e+01  5.977e+00   4.583 4.98e-06 ***
## BsmtFinSF1    2.076e+01  4.633e+00   4.480 8.05e-06 ***
## BsmtFinSF2    1.083e+01  7.080e+00   1.529 0.126431    
## BsmtUnfSF     1.207e+01  4.233e+00   2.851 0.004420 ** 
## X1stFlrSF     5.470e+01  5.549e+00   9.858  < 2e-16 ***
## X2ndFlrSF     4.321e+01  4.015e+00  10.762  < 2e-16 ***
## BsmtFullBath  8.412e+03  2.491e+03   3.377 0.000753 ***
## BedroomAbvGr -8.695e+03  1.673e+03  -5.198 2.30e-07 ***
## KitchenAbvGr -2.561e+04  4.818e+03  -5.315 1.24e-07 ***
## TotRmsAbvGrd  6.138e+03  1.232e+03   4.984 6.99e-07 ***
## Fireplaces    4.283e+03  1.778e+03   2.408 0.016149 *  
## GarageYrBlt   1.443e+02  6.506e+01   2.217 0.026757 *  
## GarageCars    1.176e+04  1.721e+03   6.831 1.24e-11 ***
## WoodDeckSF    2.551e+01  8.053e+00   3.168 0.001566 ** 
## ScreenPorch   4.809e+01  1.728e+01   2.783 0.005457 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35480 on 1441 degrees of freedom
## Multiple R-squared:  0.8029, Adjusted R-squared:  0.8005 
## F-statistic: 326.2 on 18 and 1441 DF,  p-value: < 2.2e-16

Multiple Linear Regression Plots

par(mfrow=c(2,2))
plot(training_lm3)

With the model, the test data will be used to help predict the Sale Price. Data wrangling will be used to replacing NA values with the mean. and only columns with numeric values are selected.

# data wrangling on test data
test_data_revised <- test_data %>%
  mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)) %>%
  dplyr::select(where(is.numeric))

Sale Price Predictions and write the CSV file

# sale price prediction
kaggle_submission <- test_data_revised %>%
  mutate(predict_value = predict(training_lm3, newdata = test_data_revised)) %>%
  dplyr::select(c('Id', predict_value))

summary(kaggle_submission)

##        Id       predict_value   
##  Min.   :1461   Min.   :-10852  
##  1st Qu.:1826   1st Qu.:128312  
##  Median :2190   Median :168620  
##  Mean   :2190   Mean   :178260  
##  3rd Qu.:2554   3rd Qu.:222134  
##  Max.   :2919   Max.   :664371

# export csv
kaggle_submission <- rename(kaggle_submission, SalePrice=predict_value)
write.csv(kaggle_submission, "Resources/kaggle_submission.csv", row.names = FALSE)

Result

My kaggle username is eddiexunyc and the lowest public score i was able to get is 0.33918.