Your final is due by the end of day on 19 May This project will show off your ability to understand the elements of the class. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition, Kaggle Link. I want you to do the following.
# load libraries
library(tidyverse)
library(reactable)
library(DescTools)
library(reshape2)
library(MASS)
library(devtools)
library(ggbiplot)
library(zoo)
# load data
training_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/train.csv")
test_data <- read.csv("https://raw.githubusercontent.com/eddiexunyc/DATA605_FINAL/main/Resources/test.csv")
# view train data
reactable(training_data)
Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
For the quantitative independent variable, the Lot size in square feet will be defined as the variable X and for the dependable variable, the Sale Price will be defined as the variable Y.
# histogram
variable_x_histgram <- ggplot(training_data,aes(x = LotArea)) +
geom_histogram(colour = 4, fill = "white") +
theme_minimal()
# set the Lot Area as x and Sale Price as y
variable_x <- training_data$LotArea
variable_y <- training_data$SalePrice
# histogram plot
variable_x_histgram
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the third quartile of the X variable, and the small letter “y” is estimated as the second quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
variable_x_summary <- summary(variable_x)
variable_y_summary <- summary(variable_y)
# 3rd quartile of x and 2nd quartile of y
x_prob_quartile <- round(variable_x_summary["3rd Qu."])
y_prob_quartile <- round(variable_y_summary["Median"])
# print
variable_x_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
variable_y_summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Based on both summaries, the third quartile of the variable X is 11602 and the second quartile or median of the variable Y is 163000.
# calculation on less than 3rd quartile
both_less_qrt <- sum(variable_x < x_prob_quartile & variable_y < y_prob_quartile)
y_greater_qrt <- sum(variable_x < x_prob_quartile & variable_y > y_prob_quartile)
# calculation on more than 3rd quartile
both_great_qrt <- sum(variable_x > x_prob_quartile & variable_y > y_prob_quartile)
y_less_qrt <- sum(variable_x > x_prob_quartile & variable_y < y_prob_quartile)
# calculation on total
less_3qrt_total <- both_less_qrt + y_greater_qrt
great_3qrt_total <- both_great_qrt + y_less_qrt
less_2qrt_total <- both_less_qrt + y_less_qrt
great_2qrt_total <- both_great_qrt + y_greater_qrt
grand_total <- less_3qrt_total + great_3qrt_total
# input values into the table matrix
table_matrix <- matrix(c(both_less_qrt, y_greater_qrt, less_3qrt_total,
y_less_qrt, both_great_qrt, great_3qrt_total,
less_2qrt_total, great_2qrt_total, grand_total),
nrow = 3, ncol = 3, byrow = TRUE)
colnames(table_matrix) <- c("less than 2nd quartile", "greater than 2nd quartile", "Total")
rownames(table_matrix) <- c("less than 3rd quartile", "greater than 3rd quartile", "Total")
table_matrix
## less than 2nd quartile greater than 2nd quartile
## less than 3rd quartile 639 452
## greater than 3rd quartile 89 276
## Total 728 728
## Total
## less than 3rd quartile 1091
## greater than 3rd quartile 365
## Total 1456
Based on the table, there are 1456 counts of potential outcomes. The count of less than 2nd quartile or greater than 2nd quartile is exactly the same; 728 counts. There is a higher count of less than 3rd quantile (1091) than the count of greater than 3rd quantile (365). By following the formula for conditional probability, the value can be defined for each probability.
# probability a function
prob_a_func <- function(x, y, xp, yp){
result <- sum(x > xp & x > yp) / sum(x > yp)
return(result)
}
# probability b function
prob_b_func <- function(x, y, xp, yp){
result <- sum(x > xp & y > yp)/length(x)
return(result)
}
# probability c function
prob_c_func <- function(x, y, xp, yp){
result <- sum(x < xp & x > yp)/sum(x > yp)
return(result)
}
prob_a_value <- prob_a_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_a_value
## [1] 1
prob_b_value <- prob_b_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_b_value
## [1] 0.1890411
prob_c_value <- prob_c_func(variable_x, variable_y, x_prob_quartile, y_prob_quartile)
prob_c_value
## [1] 0
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
\[P(A|B) = P(A)P(B) \\ P(A|B) = P(A) \\ P(A|B) = P(A)P(B) \\ \; \text{(if and only A and B independent based on the behavior of conditional probabilities)}\]
Based on the mathematics formula, the variables are independent. The Chi-Square test will be used to evaluate if that is the case. Our hypothesis is based on the following:
\[H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}\]
chi_result <- chisq.test(variable_x, variable_y)
chi_result
##
## Pearson's Chi-squared test
##
## data: variable_x and variable_y
## X-squared = 735095, df = 709664, p-value < 2.2e-16
Based on the result and the critical value for the chi-square test is typically 0.05, the p-value is significant less than the critical value. From there, the test rejected the mathematics evaluation and the null hypothesis is rejected.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatter plot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
ggplot(training_data, aes(x = variable_x, y = variable_y, color = variable_y)) +
geom_point() +
geom_smooth(method=lm) +
theme_minimal() +
xlab("Lot size in square feet ") +
ylab("Sale Price") +
ggtitle("Scatter plot between Lot size in square feet & Sale Price")
As shown in the plot, there is a good indication that Lot size in square feet and Sale Price are independent from each other. There are some outliners that helps support that indication.
# inference on variable X based on the confidence level of 95%
MeanDiffCI(variable_x, variable_y, conf.level = 0.95)
## meandiff lwr.ci upr.ci
## -170404.4 -174514.7 -166294.1
Given the 95% confidence interval for the difference in the mean of the variables, the lower bound is -174515 and the upper bound is -166294.
# new dataset with variable x and y
corr_var <- cbind(variable_x, variable_y) %>%
as.matrix()
# create a cormat and melted cormat
housing_cormat <- round(cor(corr_var), 2)
melted_housing_cormat <- melt(housing_cormat)
# create a heat map
ggplot(data = melted_housing_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
theme_minimal()
Based on the plot, it shows that the two variables are more independent. So going back to the hypothesis stated before, testings need to be done to see if the null hypothesis is correct.
\[H_{\theta} = \text{Sale Price and Lot Size are independent} \\ H_{1} = \text{Sale Price and Lot Size are dependent}\]
corr_test <- cor.test(training_data$LotArea, training_data$SalePrice, method = "pearson", conf.level = 0.99)
corr_test
##
## Pearson's product-moment correlation
##
## data: training_data$LotArea and training_data$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.2000196 0.3254375
## sample estimates:
## cor
## 0.2638434
Using the Pearson method to measure the degree of the relationship between LotArea and SalePrice, the p-value is less than 2.2e-16 which is close to 0 and less than the 99% confidence interval. This once again rejected the null hypothesis and shows that LotArea and SalePrice are significantly correlated with a correlation coefficient of 0.264.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
# inverted correlation matrix
precision_corr <- solve(housing_cormat)
melted_precision_cormat <- melt(precision_corr)
# create a heat map
ggplot(data = melted_precision_cormat, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 2)), color = "white", size = 4) +
theme_minimal()
# multiplication of correlation matrix and precision matrix
corr_prec1_matrix <- housing_cormat %*% precision_corr
melted_corr_prec1 <- melt(corr_prec1_matrix)
# create a heat map
ggplot(data = melted_corr_prec1, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
theme_minimal()
# multiplication of correlation matrix and precision matrix
corr_prec2_matrix <- precision_corr %*% housing_cormat
melted_corr_prec2 <- melt(corr_prec2_matrix)
# create a heat map
ggplot(data = melted_corr_prec2, aes(x=Var1, y=Var2, fill=value)) +
geom_tile(color = "white",
lwd = 1.5,
linetype = 1) +
geom_text(aes(label = round(value, 3)), color = "white", size = 4) +
theme_minimal()
Given that the correlation between both variables are not that highly correlated and to see the variable is truly independent from each other, the principle components analysis will be used to determine that. Principle Components Analysis is used for making decisions in predictive models and dimensionality reduction by using each data point onto only the few principal components. R-blogger Link will be used as reference.
# principle components analysis
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)
summary(pca_var)
## Importance of components:
## PC1 PC2
## Standard deviation 1.1242 0.8580
## Proportion of Variance 0.6319 0.3681
## Cumulative Proportion 0.6319 1.0000
The PCA summary shows that the PC1 captures 63% which is more than half of variability and PC2 captures 37% of the variability.
pca_plot1 <- ggscreeplot(pca_var) +
theme_minimal()
pca_plot1
The screeplot shows that PC1 is necessary to account for most of the variance
pca_plot2 <- ggbiplot(pca_var, obs.scale = 1, var.scale = 1,
var.factor = 1.2,
ellipse = TRUE, ellipse.level = 0.5, ellipse.alpha = 0.1,
circle = TRUE, varname.size = 3, varname.color = "lightblue") +
theme(legend.direction = 'horizontal', legend.position = 'top') +
theme_minimal()
pca_plot2
Based on the PCA biplot, it shows how variable x and y impact the principle components.The arrows are bit farther away from each other, indicating that they are not that correlated after all.
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See MASS). Find the optimal value for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
# run fitdistr function and extract the lambda value
exp_variable_x <- fitdistr(variable_x, "exponential")
lambda_x <- exp_variable_x$estimate
# 1000 samples from the exponential distribution
sample_size <- 1000
sample_x <- rexp(sample_size, rate = lambda_x) %>%
as.data.frame()
colnames(sample_x) <- c("sample_value")
# exponential histogram
exp_variable_x_histogram <- ggplot(sample_x,aes(x = sample_value)) +
geom_histogram(colour = 4, fill = "white") +
theme_minimal()
#plot between the original and exponential
par(mfrow=c(1,2))
variable_x_histgram + ggtitle("Original Distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
exp_variable_x_histogram + ggtitle("Exponential Distribution")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
cdf_quantile <- quantile(sample_x$sample_value, probs=c(0.05, 0.95))
cdf_quantile
## 5% 95%
## 533.6617 29977.3458
Using the exponential PDF and the CDF, the 5th and 95th percentiles are 445.33 and 31512.02 respectively.
# 95% CI based on the empirical data
mean_var_x <- mean(variable_x)
std_var_x <- sd(variable_x)
size_var_x <- length(variable_x)
z <- qnorm(1 - 0.05/2)
# identify the upper and lower bound
lower_ci_var_x <- mean_var_x - z * (std_var_x / sqrt(size_var_x))
upper_ci_var_x <- mean_var_x + z * (std_var_x / sqrt(size_var_x))
# quantile on variable x
empirical_quantile <- quantile(variable_x, probs=c(0.05, 0.95))
round(lower_ci_var_x)
## [1] 10005
round(upper_ci_var_x)
## [1] 11029
empirical_quantile
## 5% 95%
## 3311.70 17401.15
Based on the empirical data, the lower and upper bound CI is 10005 and 11029 respectively. For the 5th and 95th percentiles, it is 3311.7 and 17401.15 respectively. Compared to the exponential value, there is a significant movement.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
# linear regression model on training data
training_lm <- lm (variable_y ~ variable_x, data = training_data)
summary(training_lm)
##
## Call:
## lm(formula = variable_y ~ variable_x, data = training_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -275668 -48169 -17725 31248 553356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.588e+05 2.915e+03 54.49 <2e-16 ***
## variable_x 2.100e+00 2.011e-01 10.45 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76650 on 1458 degrees of freedom
## Multiple R-squared: 0.06961, Adjusted R-squared: 0.06898
## F-statistic: 109.1 on 1 and 1458 DF, p-value: < 2.2e-16
# histogram on lm
lm_histogram <- ggplot(training_lm,aes(x = training_lm$residuals)) +
geom_histogram(colour = 4, fill = "white") +
xlab("Residual on Linear Regression") +
theme_minimal()
par(mfrow=c(2,2))
plot(training_lm)
lm_histogram
# pull in all columns with numeric values
training_data_num <- training_data %>%
dplyr::select(where(is.numeric), -Id)
# summary
summary(training_data_num)
## MSSubClass LotFrontage LotArea OverallQual
## Min. : 20.0 Min. : 21.00 Min. : 1300 Min. : 1.000
## 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554 1st Qu.: 5.000
## Median : 50.0 Median : 69.00 Median : 9478 Median : 6.000
## Mean : 56.9 Mean : 70.05 Mean : 10517 Mean : 6.099
## 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602 3rd Qu.: 7.000
## Max. :190.0 Max. :313.00 Max. :215245 Max. :10.000
## NA's :259
## OverallCond YearBuilt YearRemodAdd MasVnrArea
## Min. :1.000 Min. :1872 Min. :1950 Min. : 0.0
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0
## Median :5.000 Median :1973 Median :1994 Median : 0.0
## Mean :5.575 Mean :1971 Mean :1985 Mean : 103.7
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0
## Max. :9.000 Max. :2010 Max. :2010 Max. :1600.0
## NA's :8
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8
## Median : 383.5 Median : 0.00 Median : 477.5 Median : 991.5
## Mean : 443.6 Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :5644.0 Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## Min. :0.000 Min. :0.000 Min. : 2.000 Min. :0.000
## 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.000
## Median :3.000 Median :1.000 Median : 6.000 Median :1.000
## Mean :2.866 Mean :1.047 Mean : 6.518 Mean :0.613
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.000
## Max. :8.000 Max. :3.000 Max. :14.000 Max. :3.000
##
## GarageYrBlt GarageCars GarageArea WoodDeckSF
## Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
## 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5 1st Qu.: 0.00
## Median :1980 Median :2.000 Median : 480.0 Median : 0.00
## Mean :1979 Mean :1.767 Mean : 473.0 Mean : 94.24
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 3rd Qu.:168.00
## Max. :2010 Max. :4.000 Max. :1418.0 Max. :857.00
## NA's :81
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 25.00 Median : 0.00 Median : 0.00 Median : 0.00
## Mean : 46.66 Mean : 21.95 Mean : 3.41 Mean : 15.06
## 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :547.00 Max. :552.00 Max. :508.00 Max. :480.00
##
## PoolArea MiscVal MoSold YrSold
## Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
## 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
## Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
## Mean : 2.759 Mean : 43.49 Mean : 6.322 Mean :2008
## 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :738.000 Max. :15500.00 Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
By pulling all columns with numeric values, it can provide a better perspective in the training set using appropriate plots. In this case, the box plot and histograms are best methods to compare all variables. ##### Box Plot Part 1
# boxplot
par(mfrow=c(2,3))
boxplot(training_data_num$MSSubClass, xlab = "The building class")
boxplot(training_data_num$LotFrontage, xlab = "Linear feet of street connected to property")
boxplot(training_data_num$LotArea, xlab = "Lot size in square feet")
boxplot(training_data_num$OverallQual, xlab = "Overall material and finish quality")
boxplot(training_data_num$OverallCond, xlab = "Overall condition rating")
boxplot(training_data_num$MasVnrArea, xlab = "Masonry veneer area in square feet")
# box plot part 2
par(mfrow=c(2,3))
boxplot(training_data_num$BsmtFinSF1, xlab = "Type 1 finished square feet")
boxplot(training_data_num$BsmtFinSF2, xlab = "Type 2 finished square feet")
boxplot(training_data_num$BsmtUnfSF, xlab = "Unfinished square feet of basement area")
boxplot(training_data_num$TotalBsmtSF, xlab = "Overall material and finish quality")
boxplot(training_data_num$X1stFlrSF, xlab = "First Floor square feet")
boxplot(training_data_num$X2ndFlrSF, xlab = "Second Floor square feet")
# box plot part 3
par(mfrow=c(2,3))
boxplot(training_data_num$LowQualFinSF, xlab = "Low quality finished square feet (all floors)")
boxplot(training_data_num$GrLivArea, xlab = "Above grade (ground) living area square feet")
boxplot(training_data_num$BsmtFullBath, xlab = "Basement full bathrooms")
boxplot(training_data_num$BsmtHalfBath, xlab = "Basement half bathrooms")
boxplot(training_data_num$FullBath, xlab = "Full bathrooms above grade")
boxplot(training_data_num$HalfBath, xlab = "Half baths above grade")
# boxplot part 4
par(mfrow=c(2,3))
boxplot(training_data_num$BedroomAbvGr, xlab = "Number of bedrooms above basement level")
boxplot(training_data_num$KitchenAbvGr, xlab = "Number of kitchens")
boxplot(training_data_num$TotRmsAbvGrd, xlab = "Total rooms above grade (does not include bathrooms)")
boxplot(training_data_num$Fireplaces, xlab = "Number of fireplaces")
boxplot(training_data_num$GarageCars, xlab = "Size of garage in car capacity")
boxplot(training_data_num$GarageArea, xlab = "Size of garage in square feet")
# boxplot part 5
par(mfrow=c(2,3))
boxplot(training_data_num$WoodDeckSF, xlab = "Wood deck area in square feet")
boxplot(training_data_num$OpenPorchSF, xlab = "Open porch area in square feet")
boxplot(training_data_num$EnclosedPorch, xlab = "Enclosed porch area in square feet")
boxplot(training_data_num$X3SsnPorch, xlab = "Three season porch area in square feet")
boxplot(training_data_num$ScreenPorch, xlab = "Screen porch area in square feet")
boxplot(training_data_num$PoolArea, xlab = "Pool area in square feet")
# boxplot part 6
par(mfrow=c(1,2))
boxplot(training_data_num$MiscVal, xlab = "$ Value of miscellaneous feature")
boxplot(training_data_num$SalePrice, xlab = "Sale Price")
# histogram
par(mfrow=c(2,3))
hist(training_data_num$YearBuilt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Original construction date")
hist(training_data_num$YearRemodAdd, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Remodel date")
hist(training_data_num$GarageYrBlt, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year garage was built")
hist(training_data_num$MoSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Month Sold")
hist(training_data_num$YrSold, border=F , col=rgb(0.2,0.2,0.8,0.7), xlab = "Year Sold")
# replace NA value in training data with the mean of the column
training_data_revised <- training_data_num %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x))
# multiple linear regression with numeric values
training_lm2 <- lm(SalePrice ~ LotFrontage + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea , data = training_data_revised)
# summary
summary(training_lm2)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + BsmtFullBath +
## BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea +
## WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold + TotalBsmtSF + GrLivArea,
## data = training_data_revised)
##
## Residuals:
## Min 1Q Median 3Q Max
## -496229 -16828 -2074 13798 301576
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.031e+05 1.441e+06 0.210 0.833446
## LotFrontage 9.182e+01 4.926e+01 1.864 0.062541 .
## OverallQual 1.611e+04 1.202e+03 13.403 < 2e-16 ***
## OverallCond 4.903e+03 1.053e+03 4.658 3.48e-06 ***
## YearBuilt 2.500e+02 6.888e+01 3.630 0.000293 ***
## YearRemodAdd 1.389e+02 7.000e+01 1.985 0.047382 *
## MasVnrArea 2.905e+01 6.065e+00 4.790 1.84e-06 ***
## BsmtFinSF1 2.152e+01 4.753e+00 4.528 6.46e-06 ***
## BsmtFinSF2 1.175e+01 7.177e+00 1.638 0.101703
## BsmtUnfSF 1.204e+01 4.266e+00 2.823 0.004829 **
## X1stFlrSF 5.115e+01 5.911e+00 8.654 < 2e-16 ***
## X2ndFlrSF 4.289e+01 4.946e+00 8.670 < 2e-16 ***
## LowQualFinSF 1.355e+01 2.030e+01 0.667 0.504573
## BsmtFullBath 8.644e+03 2.651e+03 3.261 0.001138 **
## BsmtHalfBath 1.473e+03 4.163e+03 0.354 0.723477
## FullBath 3.327e+03 2.888e+03 1.152 0.249439
## HalfBath -1.736e+03 2.717e+03 -0.639 0.522846
## BedroomAbvGr -8.949e+03 1.728e+03 -5.180 2.53e-07 ***
## KitchenAbvGr -2.542e+04 4.962e+03 -5.123 3.42e-07 ***
## TotRmsAbvGrd 5.751e+03 1.255e+03 4.582 5.01e-06 ***
## Fireplaces 4.583e+03 1.800e+03 2.546 0.010985 *
## GarageYrBlt 9.097e+01 7.092e+01 1.283 0.199774
## GarageCars 1.112e+04 2.934e+03 3.790 0.000157 ***
## GarageArea 4.260e-01 1.014e+01 0.042 0.966483
## WoodDeckSF 2.726e+01 8.133e+00 3.351 0.000826 ***
## OpenPorchSF -1.364e+00 1.549e+01 -0.088 0.929816
## EnclosedPorch 9.234e+00 1.718e+01 0.537 0.591125
## ScreenPorch 5.372e+01 1.753e+01 3.065 0.002220 **
## PoolArea -4.127e+01 2.425e+01 -1.702 0.088986 .
## MiscVal -9.142e-03 1.891e+00 -0.005 0.996142
## MoSold 2.421e+01 3.515e+02 0.069 0.945083
## YrSold -6.559e+02 7.163e+02 -0.916 0.359987
## TotalBsmtSF NA NA NA NA
## GrLivArea NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35470 on 1428 degrees of freedom
## Multiple R-squared: 0.8049, Adjusted R-squared: 0.8006
## F-statistic: 190 on 31 and 1428 DF, p-value: < 2.2e-16
With the multiple regression model, it accounts for estimated 80% of the variance. Given some variables and its P-values are NULL and are higher than the critical value(0.05), those variables will be removed to retrain the model.
# multiple linear regression with revised variables
training_lm3 <- lm(SalePrice ~ OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = training_data_revised)
# summary
summary(training_lm3)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + OverallCond + YearBuilt +
## MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF +
## X2ndFlrSF + BsmtFullBath + BedroomAbvGr + KitchenAbvGr +
## TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + WoodDeckSF +
## ScreenPorch, data = training_data_revised)
##
## Residuals:
## Min 1Q Median 3Q Max
## -503454 -17135 -2007 13724 283619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.934e+05 1.074e+05 -8.322 < 2e-16 ***
## OverallQual 1.666e+04 1.174e+03 14.192 < 2e-16 ***
## OverallCond 5.670e+03 9.423e+02 6.017 2.25e-09 ***
## YearBuilt 2.736e+02 5.782e+01 4.732 2.44e-06 ***
## MasVnrArea 2.739e+01 5.977e+00 4.583 4.98e-06 ***
## BsmtFinSF1 2.076e+01 4.633e+00 4.480 8.05e-06 ***
## BsmtFinSF2 1.083e+01 7.080e+00 1.529 0.126431
## BsmtUnfSF 1.207e+01 4.233e+00 2.851 0.004420 **
## X1stFlrSF 5.470e+01 5.549e+00 9.858 < 2e-16 ***
## X2ndFlrSF 4.321e+01 4.015e+00 10.762 < 2e-16 ***
## BsmtFullBath 8.412e+03 2.491e+03 3.377 0.000753 ***
## BedroomAbvGr -8.695e+03 1.673e+03 -5.198 2.30e-07 ***
## KitchenAbvGr -2.561e+04 4.818e+03 -5.315 1.24e-07 ***
## TotRmsAbvGrd 6.138e+03 1.232e+03 4.984 6.99e-07 ***
## Fireplaces 4.283e+03 1.778e+03 2.408 0.016149 *
## GarageYrBlt 1.443e+02 6.506e+01 2.217 0.026757 *
## GarageCars 1.176e+04 1.721e+03 6.831 1.24e-11 ***
## WoodDeckSF 2.551e+01 8.053e+00 3.168 0.001566 **
## ScreenPorch 4.809e+01 1.728e+01 2.783 0.005457 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35480 on 1441 degrees of freedom
## Multiple R-squared: 0.8029, Adjusted R-squared: 0.8005
## F-statistic: 326.2 on 18 and 1441 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(training_lm3)
With the model, the test data will be used to help predict the Sale Price. Data wrangling will be used to replacing NA values with the mean. and only columns with numeric values are selected.
# data wrangling on test data
test_data_revised <- test_data %>%
mutate_all(~ifelse(is.na(.x), mean(.x, na.rm = TRUE), .x)) %>%
dplyr::select(where(is.numeric))
# sale price prediction
kaggle_submission <- test_data_revised %>%
mutate(predict_value = predict(training_lm3, newdata = test_data_revised)) %>%
dplyr::select(c('Id', predict_value))
summary(kaggle_submission)
## Id predict_value
## Min. :1461 Min. :-10852
## 1st Qu.:1826 1st Qu.:128312
## Median :2190 Median :168620
## Mean :2190 Mean :178260
## 3rd Qu.:2554 3rd Qu.:222134
## Max. :2919 Max. :664371
# export csv
kaggle_submission <- rename(kaggle_submission, SalePrice=predict_value)
write.csv(kaggle_submission, "Resources/kaggle_submission.csv", row.names = FALSE)
My kaggle username is eddiexunyc and the lowest public score i
was able to get is 0.33918.