CUNY 605 Computational Mathematics Final Project
House Prices: Advanced Regression Techniques competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques).
Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/train.csv'
data <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
y <- data$SalePrice
head(data)## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Let’s choose ‘GrLivArea’ as the independent variable.
x <- data$GrLivArea
head(x)## [1] 1710 1262 1786 1717 2198 1362
Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 2nd quartile of the Y variable. Interpret the meaning of all probabilities.
Define Quartiles:
The first quartile, denoted by \(Q_1\) , is the median of the lower half of the data set. This means that about 25% of the numbers in the data set lie below \(Q_1\) and about 75% lie above \(Q_1\).
The third quartile, denoted by \(Q_3\) , is the median of the upper half of the data set. This means that about 75% of the numbers in the data set lie below \(Q_3\) and about 25% lie above \(Q_3\).
The second quartile, denoted by \(Q_2\) is the median of the data. This means that about 50% of the numbers in the data set lie below \(Q_2\) and about 50% lie above \(Q_2\).
Reference: http://web.mnstate.edu/peil/MDEV102/U4/S36/S363.html
\[a. P(X > x | Y > y)\]
In other words, this is asking what is the probability of choosing ‘GrLivArea’ such that the area is larger than the first quartile (1130) given that the ‘SalePrice’ is larger than the second quartile (163000).
describe_x <- quantile(x)
describe_y <- quantile(y)
# For GrLivArea
Q1_x <- as.integer(describe_x[2])
Q2_x <- as.integer(describe_x[3])
Q3_x <- as.integer(describe_x[4])
total_x <- length(x)
# For SalePrice
Q1_y <- as.integer(describe_y[2])
Q2_y <- as.integer(describe_y[3])
Q3_y <- as.integer(describe_y[4])
total_y <- length(y)For reference, the total number of homes in this data set:
total_x## [1] 1460
Let’s calculate \(P(Y > y)\) first. By the very definition of median, \(P(Y > y) = .5\). Let’s see how many houses are greater than
# We already know that the 2nd quartile is 50% of the points. We need to find out how many houses are greater than median.
total_house_price_grQ2 <- sum(y > Q2_y)
print(paste("How many homes are there more than the median? ", total_house_price_grQ2))## [1] "How many homes are there more than the median? 728"
Out of these homes where the Sale Price is greater than $163,000, how many of them have Ground Floor Square Footage greater than 1130 (Q1)?
# Start necessary packages
if (!require(tidyr)) install.packages("tidyr")
if (!require(dplyr)) install.packages("dplyr")
library(tidyr)
library(dplyr)
select_homes <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y)
GrLivArea_GrQ1 <- select_homes %>% filter(GrLivArea > Q1_x)
head(GrLivArea_GrQ1)## GrLivArea SalePrice
## 1 1710 208500
## 2 1262 181500
## 3 1786 223500
## 4 2198 250000
## 5 1694 307000
## 6 2090 200000
print(paste0("How many homes have GrLivArea > Q1 given SalePrice > Q2? ", nrow(GrLivArea_GrQ1)))## [1] "How many homes have GrLivArea > Q1 given SalePrice > Q2? 720"
print(paste0("P(X > x | Y > y): ", nrow(GrLivArea_GrQ1)/nrow(select_homes)))## [1] "P(X > x | Y > y): 0.989010989010989"
# We are dividing by select_homes as it is a given statement (and not an AND statement)\[b. P(X > x , Y > y)\] In other words, what is the probability of having a home with greater than 1130 square feet (\(Q1_x\)) AND a sale price greater than $163000 (\(Q2_y\)).
# This is going to be the same as part A. Notice that I changed the filter to have both conditionals.
select_homes2 <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y & GrLivArea > Q1_x)
head(select_homes2)## GrLivArea SalePrice
## 1 1710 208500
## 2 1262 181500
## 3 1786 223500
## 4 2198 250000
## 5 1694 307000
## 6 2090 200000
print(paste0("How many homes have GrLivArea > Q1 AND SalePrice > Q2? ", nrow(select_homes2)))## [1] "How many homes have GrLivArea > Q1 AND SalePrice > Q2? 720"
print(paste0("P(X > x & Y > y): ", nrow(select_homes2)/nrow(data)))## [1] "P(X > x & Y > y): 0.493150684931507"
\[c. P(X < x | Y > y)\]
Given that homes are greater than $163000 (\(Q2_y\)), what is the probability that homes are less than 1130 square feet (\(Q1_x\))?
select_homes3 <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y)
GrLivArea_LessQ1 <- select_homes3 %>% filter(GrLivArea < Q1_x)
head(GrLivArea_LessQ1)## GrLivArea SalePrice
## 1 964 163500
## 2 988 180000
## 3 1110 165500
## 4 1126 175000
## 5 999 178400
## 6 1088 170000
print(paste0("How many homes have GrLivArea < Q1 given SalePrice > Q2? ", nrow(GrLivArea_LessQ1)))## [1] "How many homes have GrLivArea < Q1 given SalePrice > Q2? 8"
print(paste0("P(X < x | Y > y): ", nrow(GrLivArea_LessQ1)/nrow(select_homes3)))## [1] "P(X < x | Y > y): 0.010989010989011"
Does splitting the training data in this fashion make them independent? In other words, does \(P(X|Y) = P(X) P(Y)\)?
As we noted in the above example, \(P(X|Y)\) does NOT equal \(P(X) P(Y)\). Please see below for an example. This makes complete sense as we expect the increasing size of the Ground Living Area to increase with the home prices. In the next section, we will perform some plots and determine if there’s a correlation.
# P(X|Y) = P(X) P(Y) ?
nrow(GrLivArea_GrQ1)/nrow(select_homes) == nrow(select_homes2)/nrow(data)## [1] FALSE
Check mathematically, and then evaluate by running a Chi Square test for association.
A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.
A very small chi-square test statistics means that your observed data fits your expected data extremely well. Or in other words, there is a relationship.
A very large chi-square test statistic means that the data does not fit very well. Or in other words, there isn’t a relationship.
\[\chi^2 = \Sigma \frac{(O_i - E_i)^2}{E_i}\] A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there was no relationship at all in the population.
Reference: http://www.statisticshowto.com/probability-and-statistics/chi-square/
This if course would be difficult to perform by hand, so we will do this with R.
# Reference: http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence
if (!require(MASS)) install.packages("MASS")
library(MASS)
tbl <- table(data$GrLivArea, data$SalePrice)
chisq.test(tbl)##
## Pearson's Chi-squared test
##
## data: tbl
## X-squared = 589730, df = 569320, p-value < 2.2e-16
As we expected, the p-value is significantly lower than 0.05, therefore, we reject the null hypothesis and make the claim that these two variables are dependent on each other.
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for both variables.
if (!require(ggplot2)) install.packages("ggplot2")
if (!require(psych)) install.packages("psych")
library(ggplot2)
library(psych)
print("Summary descriptive statistics of 'GrLivArea")## [1] "Summary descriptive statistics of 'GrLivArea"
summary(x)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
describe(x)## vars n mean sd median trimmed mad min max range skew
## X1 1 1460 1515.46 525.48 1464 1467.67 483.33 334 5642 5308 1.36
## kurtosis se
## X1 4.86 13.75
ggplot(data, aes(x = GrLivArea)) + geom_histogram(binwidth = 100) + labs(title = "Ground Living Area Histogram", x = "Square Footage", y = "Count")ggplot(data, aes(x = GrLivArea)) + geom_density(fill = "green") + geom_vline(xintercept=c(1130, 1464, 1777), col = c("red", "yellow","blue")) + xlab("Ground Living Area") + labs(main="Ground Living Area Density Plot")print("Summary descriptive statistics of 'SalePrice")## [1] "Summary descriptive statistics of 'SalePrice"
summary(y)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
describe(y)## vars n mean sd median trimmed mad min max range
## X1 1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
## skew kurtosis se
## X1 1.88 6.5 2079.11
ggplot(data, aes(x = SalePrice)) + geom_histogram(binwidth = 10000) + labs(title = "Sale Price Histogram", x = "Price", y = "Count")ggplot(data, aes(x = SalePrice)) + geom_density(fill = "green") + geom_vline(xintercept=c(129975, 163000, 214000), col = c("red", "yellow","blue"))+ xlab("Sale Price") + labs(main="Sale Price Density Plot")In the above plots, red is 25% (Q1), yellow is 50% (median or Q2), and blue is 75% (Q3).
As you can see, the data is left skewed (both for the ‘GrLivingArea’ and ‘SalePrice’) as noted in the histogram and density plot. Above also lists the mean, median, skew, kurtosis, standard deviation, and standard error.
Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations.
ggplot(data, aes(x = GrLivArea, y = SalePrice)) + geom_point(color = "blue") + labs(title="Ground Living Area vs. Sale Price", x = "Ground Living Area", y = "Sale Price in Dollars")Box-Cox Transformation
Heteroscedasticity can create model bias. A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.
At the core of the Box Cox transformation is an exponent, lambda \(\lambda\), which varies from -5 to 5. All values of \(\lambda\) are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.
The transformation of Y has the form:
\[y(\lambda) = \begin{cases} \frac{y^{\lambda}-1}{\lambda}, & \text{if } \lambda \neq 0 \\ log(y), & \text{if } \lambda = 0\\ \end{cases}\]
This test only workds for positive data. However, Box and Cox did propose a second formula that can be used for negative y-values:
\[y(\lambda) = \begin{cases} \frac{(y+\lambda_2)^{y_1}-1}{\lambda_1}, & \text{if } \lambda_1 \neq 0 \\ log(y+\lambda_2), & \text{if } \lambda_1 = 0\\ \end{cases}\]
We can use R and the command boxcox(object,…) to help create the transformations.
Reference: http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation, http://www.statisticshowto.com/box-cox-transformation/
if (!require(MASS)) install.packages("MASS")
library(MASS)
home_lm <- lm(y ~ x)
summary(home_lm)##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -462999 -29800 -1124 21957 339832
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18569.026 4480.755 4.144 3.61e-05 ***
## x 107.130 2.794 38.348 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared: 0.5021, Adjusted R-squared: 0.5018
## F-statistic: 1471 on 1 and 1458 DF, p-value: < 2.2e-16
# See the Q-Q plot and the residuals plot
plot(fitted(home_lm), resid(home_lm))
abline(h=0, lty = 2, col = "darkorange", lwd = 2)hist(resid(home_lm))qqnorm(resid(home_lm))
qqline(resid(home_lm))This is the untransformed data. As you can see, there are some significant deviations of the residuals from the Normal Q-Q Plot and heteroscedasticity. Let’s see what the Box - Cox transformation brings us.
# Reference: https://daviddalpiaz.github.io/appliedstats/transformations.html
bc_home <- boxcox(home_lm)boxcox(home_lm, plotit = TRUE, lambda = seq(-0.2, 0.5, by = 0.05))Let’s take the optimal \(\lambda\) as 0.1 (rounded up to make the calculation easier). So let’s make a transformation of the data and take a look at the new regression model with the transformed data.
# Reference: www.ics.uci.edu/~jutts/st108/Transformations_in_R.doc
home_lambda <- 0.1
SalePrice_lambda <- (y^(home_lambda)-1)/home_lambda
NewModel <- lm(SalePrice_lambda ~ x)
NewModel##
## Call:
## lm(formula = SalePrice_lambda ~ x)
##
## Coefficients:
## (Intercept) x
## 20.592934 0.001791
summary(NewModel)##
## Call:
## lm(formula = SalePrice_lambda ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5554 -0.4882 0.0835 0.5251 3.1168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.059e+01 7.564e-02 272.25 <2e-16 ***
## x 1.791e-03 4.716e-05 37.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9466 on 1458 degrees of freedom
## Multiple R-squared: 0.4974, Adjusted R-squared: 0.4971
## F-statistic: 1443 on 1 and 1458 DF, p-value: < 2.2e-16
plot(fitted(NewModel), resid(NewModel))
abline(h=0, lty = 2, col = "darkorange", lwd = 2)hist(resid(NewModel))qqnorm(resid(NewModel))
qqline(resid(NewModel))While it’s not perfect, it certainly had improved the model as noted in the QQ-line, residual histogram, and the scatterplot.
Linear Algebra and Correlation
Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
A correlation matrix
Let’s talk about a correlation test first. A correlation test is used to evaluate the association between two or more variables. There are different methods to perform correlation analysis.
Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because is depends to the distribution of the data. It can be used only when x and y are from normal distribution. the plot of y = f(x) is named the linear regression curve.
Kendall tau and Sparman rho, which are ranked-based correlation coefficients (non-parametric).
The most commonly used method is the Pearson correlation method.
Pearson Correlation Formula
\[r = \frac{\Sigma(x-m_x)(y-m_y)}{\sqrt{\Sigma(x-m_x)^2 \Sigma(y-m_y^2)}}\]
where \(m_x\) and \(m_y\) are the means of x and y variables.
The p-value (significance level) of the correlation can be determined:
By using the correlation coefficient table for the degrees of freedom: \(df = n - 2\), where n is the number of observation in x and y variables.
or by calculating the t-value as follows:
\[t = \frac{r}{\sqrt{1-r^2}}\sqrt{n-2}\] The corresponding p-value can be determined using the t distribution table for \(df = n - 2\) in: http://www.sthda.com/english/wiki/t-distribution-table
A correlation matrix is used to investigate the dependence between multiple variables at the same time. The result is a table containing the correlation coefficients between each variables and the others.
Let’s choose three untransformed variables at random (in addition to our ‘GrLivArea’) and create a dataframe.
# Let's choose 'LotArea', 'YearBuilt', 'OverallQual', and 'GrLivArea'
LotArea <- data$LotArea
YearBuilt <- data$YearBuilt
OverallQual <- data$OverallQual
GrLivArea <- data$GrLivArea
# And of course, the dependent variable, 'SalePrice'
SalePrice <- data$SalePrice
# Building the dataframe
home_data <- data.frame(LotArea, YearBuilt, OverallQual, GrLivArea, SalePrice)
head(home_data)## LotArea YearBuilt OverallQual GrLivArea SalePrice
## 1 8450 2003 7 1710 208500
## 2 9600 1976 6 1262 181500
## 3 11250 2001 7 1786 223500
## 4 9550 1915 7 1717 140000
## 5 14260 2000 8 2198 250000
## 6 14115 1993 5 1362 143000
Let’s build a correlation matrix.
if(!require(ggpubr)) install.packages("ggpubr")
if(!require(corrplot)) install.packages("corrplot")
library(ggpubr)
library(corrplot)
home_cor <- cor(home_data, method = "pearson", use = "complete.obs")
round(home_cor,2)## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 1.00 0.01 0.11 0.26 0.26
## YearBuilt 0.01 1.00 0.57 0.20 0.52
## OverallQual 0.11 0.57 1.00 0.59 0.79
## GrLivArea 0.26 0.20 0.59 1.00 0.71
## SalePrice 0.26 0.52 0.79 0.71 1.00
corrplot(home_cor, method = "circle")The above is the correlation matrix (res) with 4 indepedent variables and 1 dependent variable. For fun, we can also look to see the corresponding p-value for the correlation matrix.
if (!require(Hmisc)) install.packages("Hmisc")
library(Hmisc)
cor2 <- rcorr(as.matrix(home_data))
cor2## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 1.00 0.01 0.11 0.26 0.26
## YearBuilt 0.01 1.00 0.57 0.20 0.52
## OverallQual 0.11 0.57 1.00 0.59 0.79
## GrLivArea 0.26 0.20 0.59 1.00 0.71
## SalePrice 0.26 0.52 0.79 0.71 1.00
##
## n= 1460
##
##
## P
## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 0.587 0.000 0.000 0.000
## YearBuilt 0.587 0.000 0.000 0.000
## OverallQual 0.000 0.000 0.000 0.000
## GrLivArea 0.000 0.000 0.000 0.000
## SalePrice 0.000 0.000 0.000 0.000
The p-values seem to all be less than 0.05 with ‘SalePrice’ being the dependent factor, indicating some statistical significance.
Let’s create the inversion of the correlation matrix.
# Using the solve function to find the inverse
home_cor_inv <- solve(home_cor)
home_cor_inv## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 1.13181609 0.09259515 0.2856985 -0.1598675 -0.4597361
## YearBuilt 0.09259515 1.70273460 -0.7504489 0.6214791 -0.7615900
## OverallQual 0.28569848 -0.75044891 3.1118940 -0.5053939 -1.7862883
## GrLivArea -0.15986749 0.62147910 -0.5053939 2.2801980 -1.4988366
## SalePrice -0.45973614 -0.76159002 -1.7862883 -1.4988366 3.9945652
Reference: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r, http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software
So what is a precision matrix? It is the inverse of a covariance matrix.
Now, what is a covariance matrix? A covariance matrix, like many matrices used in statistics, is symmetric. That means that the table has the same headings across the top as it does along the side.
The simplest example, and a cousin of a covariance matrix, is a correlation matrix. It’s a table in which each variable is listed in both the column headings and row headings, and each cell of the table (i.e. matrix) is the correlation between the variables that make up the column and row headings. Please note above for an example.
All correlations on the diagonalequal 1, because they’re the correlation of each variable with itself.
A Covariance Matrix is very similar. There are really two differences between it and the Correlation Matrix.
\[\begin{bmatrix} Var_1 & Cov_{1,2} & Cov_{1,3} \\ Cov_{1,2} & Var_2 & Cov_{2,3} \\ Cov_{d1,3} & Cov_{2,3} & Var_3 & \end{bmatrix}\]
Covariance is just an unstandardized version of correlation. To compute any correlation, we divide the covariance by the standard deviation of both variables to remove units of measurement. So a covariance is just a correlation measured in the units of the original variables.
Second, the diagonal variables are the variances of each variable. A covariance of a variable with itself is simply the variance.
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
# Let's confirm that this is truly the inverse
# Remember: A %*% A(inv) == I and A(inv) %*% A == I
test <- home_cor %*% home_cor_inv
round(test,2)## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 1 0 0 0 0
## YearBuilt 0 1 0 0 0
## OverallQual 0 0 1 0 0
## GrLivArea 0 0 0 1 0
## SalePrice 0 0 0 0 1
test2 <- home_cor_inv %*% home_cor
round(test2,2)## LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea 1 0 0 0 0
## YearBuilt 0 1 0 0 0
## OverallQual 0 0 1 0 0
## GrLivArea 0 0 0 1 0
## SalePrice 0 0 0 0 1
It does appear to produce an identity matrix, so home_cor_inv is indeed the inverse of home_cor. home_cor_inv is known as the precision matrix, and contains variance inflation on the diagnoal.
Reference: https://www.statlect.com/glossary/precision-matrix, http://www.theanalysisfactor.com/covariance-matrices/
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See: See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).
We will go back and use GrLivArea as our independent variable. Let’s check the minimum value and ensure that this value is above zero.
print(paste0("Minimum value of 'GrLivArea' greater than 0? ", min(GrLivArea) > 0))## [1] "Minimum value of 'GrLivArea' greater than 0? TRUE"
Before we try to fit a density function, let’s re-visit and see what distribution GrLivArea takes on again.
# Reference: https://stackoverflow.com/questions/6967664/ggplot2-histogram-with-normal-curve
ggplot(data, aes(x=GrLivArea)) + geom_histogram(fill="green", color = "black", binwidth = 500) + labs(x="Ground Living Area", title = "Histogram: Ground Living Area") + stat_function(fun = function(x,mean,sd, n, bw){
dnorm(x = x, mean = mean, sd = sd) * n * bw}, args = c(mean = mean(GrLivArea), sd = sd(GrLivArea), n = 1000, bw = 500))Let’s see if can use fitdistr to fit a normal density function.
fit <- fitdistr(GrLivArea, densfun="normal") # Assuming a normal distribution
fit## mean sd
## 1515.46370 525.30039
## ( 13.74774) ( 9.72112)
hist(GrLivArea, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)The above fits the data not too badly. Let’s see if we can find a better distribution curve.
# Using fitdistrplus
if (!require(fitdistrplus)) install.packages("fitdistrplus")
library(fitdistrplus)
plotdist(GrLivArea, histo = TRUE, demp = TRUE)descdist(GrLivArea, discrete = FALSE, boot = 500)## summary statistics
## ------
## min: 334 max: 5642
## median: 1464
## mean: 1515.464
## estimated sd: 525.4804
## estimated skewness: 1.36656
## estimated kurtosis: 7.895121
The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.
Let’s try to fit a distribution. We will randomly choose weibull, gamma and log-normal.
fit_w <- fitdist(GrLivArea, "weibull")
fit_g <- fitdist(GrLivArea, "gamma")
fit_ln <- fitdist(GrLivArea, "lnorm")
summary(fit_ln)## Fitting of the distribution ' lnorm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## meanlog 7.2677744 0.008726424
## sdlog 0.3334362 0.006170264
## Loglikelihood: -11079.08 AIC: 22162.15 BIC: 22172.73
## Correlation matrix:
## meanlog sdlog
## meanlog 1 0
## sdlog 0 1
par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)Reference: http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html
It appears that the log-normal distribution is a better fit than the simple normal distribution. (Will ignore Gamma Distribution for now.)
The log-normal distribution (aka Galton) is a probability distribution with a normally distributed logarithm. A random variable is lognormally distributed if its logarithm is normally distributed. Skewed distributions (which our data has) with low mean values, large variance, and all-positive values often fit this type of distribution. Values must be positive as \(log(x)\) exists only for positive values of x.
print(paste0("Mean: ",mean(GrLivArea)))## [1] "Mean: 1515.46369863014"
print(paste0("Variance: ", sd(GrLivArea) ** 2))## [1] "Variance: 276129.633362596"
print(paste0("All values are positive? ", min(GrLivArea) > 0))## [1] "All values are positive? TRUE"
As you can see, our data can certainly is a candidate for a logarithmic transformation.
Let’s try again, and redo the fitdistr function with a log-normal distribution.
fit <- fitdistr(GrLivArea, densfun="log-normal")
fit## meanlog sdlog
## 7.267774383 0.333436175
## (0.008726424) (0.006170513)
GrLivArea_log <- log(GrLivArea)
hist(GrLivArea_log, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)As you can see, the log-normal distribution fits the plot quite nicely.
Reference: http://www.statisticshowto.com/lognormal-distribution/
Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g. \(rexp(1000,\lambda)\) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.
We had already noted the optimal values for the parameters of the log-normal distribution as noted above. I will reproduce it again here.
fit## meanlog sdlog
## 7.267774383 0.333436175
## (0.008726424) (0.006170513)
The meanlog is ~7.27 and sdlog is ~.33.
Let’s take a thousand samples from this distribution, plot it in a histogram, and compare it to the non-transformed histogram.
bootstrap_log <- sample(GrLivArea_log, size = 1000, replace = TRUE, prob = NULL)
bootstrap_normal <- sample(GrLivArea, size = 1000, replace = TRUE, prob = NULL)
par(mfrow=c(1,2))
plot.legend <- c("normal", "lognormal")
hist(bootstrap_normal)
hist(bootstrap_log)As you can see that the log_normal distribution (histogram) has a more normal distribution appearance than the regular standard plot of the untransformed data.
Modeling
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your kaggle.com user name and score.
Reference: Linear Regression Using R: An Introduction to Data Modeling
Below is my submission to the Kaggle.com competition.
Username: jcp9010 or JoelPark
Score: 0.14033
This dataset gives us an opportunity to look at a large set of homes in Ames, Iowa, and to determine to see if we can appropriately fit a regression model to predict SalePrice of a home.
Let’s start by loading the datasets
# Load the packages
library(Hmisc)
library(ggpubr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(psych)
library(fitdistrplus)
# Download the dataset
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/train.csv'
#data <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
# Show the head of the dataset
head(data)## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1 Lvl AllPub Inside Gtl CollgCr Norm
## 2 Lvl AllPub FR2 Gtl Veenker Feedr
## 3 Lvl AllPub Inside Gtl CollgCr Norm
## 4 Lvl AllPub Corner Gtl Crawfor Norm
## 5 Lvl AllPub FR2 Gtl NoRidge Norm
## 6 Lvl AllPub Inside Gtl Mitchel Norm
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1 Norm 1Fam 2Story 7 5 2003
## 2 Norm 1Fam 1Story 6 8 1976
## 3 Norm 1Fam 2Story 7 5 2001
## 4 Norm 1Fam 2Story 7 5 1915
## 5 Norm 1Fam 2Story 8 5 2000
## 6 Norm 1Fam 1.5Fin 5 5 1993
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1 2003 Gable CompShg VinylSd VinylSd BrkFace
## 2 1976 Gable CompShg MetalSd MetalSd None
## 3 2002 Gable CompShg VinylSd VinylSd BrkFace
## 4 1970 Gable CompShg Wd Sdng Wd Shng None
## 5 2000 Gable CompShg VinylSd VinylSd BrkFace
## 6 1995 Gable CompShg VinylSd VinylSd None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1 196 Gd TA PConc Gd TA No
## 2 0 TA TA CBlock Gd TA Gd
## 3 162 Gd TA PConc Gd TA Mn
## 4 0 TA TA BrkTil TA Gd No
## 5 350 Gd TA PConc Gd TA Av
## 6 0 TA TA Wood Gd TA No
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 GLQ 706 Unf 0 150 856
## 2 ALQ 978 Unf 0 284 1262
## 3 GLQ 486 Unf 0 434 920
## 4 ALQ 216 Unf 0 540 756
## 5 GLQ 655 Unf 0 490 1145
## 6 GLQ 732 Unf 0 64 796
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 GasA Ex Y SBrkr 856 854 0
## 2 GasA Ex Y SBrkr 1262 0 0
## 3 GasA Ex Y SBrkr 920 866 0
## 4 GasA Gd Y SBrkr 961 756 0
## 5 GasA Ex Y SBrkr 1145 1053 0
## 6 GasA Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1 1 Gd 8 Typ 0 <NA>
## 2 1 TA 6 Typ 1 TA
## 3 1 Gd 6 Typ 1 TA
## 4 1 Gd 7 Typ 1 Gd
## 5 1 Gd 9 Typ 1 TA
## 6 1 TA 5 Typ 0 <NA>
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1 Attchd 2003 RFn 2 548 TA
## 2 Attchd 1976 RFn 2 460 TA
## 3 Attchd 2001 RFn 2 608 TA
## 4 Detchd 1998 Unf 3 642 TA
## 5 Attchd 2000 RFn 3 836 TA
## 6 Attchd 1993 Unf 2 480 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 TA Y 0 61 0 0
## 2 TA Y 298 0 0 0
## 3 TA Y 0 42 0 0
## 4 TA Y 0 35 272 0
## 5 TA Y 192 84 0 0
## 6 TA Y 40 30 0 320
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1 0 0 <NA> <NA> <NA> 0 2 2008
## 2 0 0 <NA> <NA> <NA> 0 5 2007
## 3 0 0 <NA> <NA> <NA> 0 9 2008
## 4 0 0 <NA> <NA> <NA> 0 2 2006
## 5 0 0 <NA> <NA> <NA> 0 12 2008
## 6 0 0 <NA> MnPrv Shed 700 10 2009
## SaleType SaleCondition SalePrice
## 1 WD Normal 208500
## 2 WD Normal 181500
## 3 WD Normal 223500
## 4 WD Abnorml 140000
## 5 WD Normal 250000
## 6 WD Normal 143000
Let’s look closer at the statistics of SalePrice.
summary(data$SalePrice)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
ggplot(data, aes(x=SalePrice)) + geom_histogram(fill="lightblue", col="black",binwidth=10000) + labs(x="Price of a Home", y = "Frequency", title="Home Prices in Ames, Iowa")describe(data$SalePrice)## data$SalePrice
## n missing distinct Info Mean Gmd .05 .10
## 1460 0 663 1 180921 81086 88000 106475
## .25 .50 .75 .90 .95
## 129975 163000 214000 278000 326100
##
## lowest : 34900 35311 37900 39300 40000, highest: 582933 611657 625000 745000 755000
The distribution of the data certainly has a positive skew (which is probably not all that surprising for home prices in general). However, we will need to return to this and apply a transformation that will help normalize the data.
Before we start looking at the correlations of the independent variables to the dependent variables (SalePrice), let’s take a look at the missing data, as that can significantly alter the regression model.
print("Missing Data Points: ")## [1] "Missing Data Points: "
missing_list <- sapply(data, function(x) sum(is.na(x)))
missing_list <- sort(missing_list, decreasing = TRUE)
missing_list <- missing_list[missing_list != 0]
missing_list## PoolQC MiscFeature Alley Fence FireplaceQu
## 1453 1406 1369 1179 690
## LotFrontage GarageType GarageYrBlt GarageFinish GarageQual
## 259 81 81 81 81
## GarageCond BsmtExposure BsmtFinType2 BsmtQual BsmtCond
## 81 38 38 37 37
## BsmtFinType1 MasVnrType MasVnrArea Electrical
## 37 8 8 1
# Plotting the bar plots of missing data
NAMES <- names(missing_list)
NUM <- as.vector(missing_list)
missing_list_df <- data.frame(NAMES, NUM)
ggplot(missing_list_df, aes(x=NAMES, y=NUM)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + theme_minimal() + labs(x = "Number of Data Points Missing", y = "Independent Variables", title = "BarPlot: Missing Data Points")# Looks like some of the data points are missing quite a bit of data. Let's see what percent of each data is missing.
print("Missing Data Points by Percentage: ")## [1] "Missing Data Points by Percentage: "
missing_list_percent <- missing_list/nrow(data)
missing_list_percent## PoolQC MiscFeature Alley Fence FireplaceQu
## 0.9952054795 0.9630136986 0.9376712329 0.8075342466 0.4726027397
## LotFrontage GarageType GarageYrBlt GarageFinish GarageQual
## 0.1773972603 0.0554794521 0.0554794521 0.0554794521 0.0554794521
## GarageCond BsmtExposure BsmtFinType2 BsmtQual BsmtCond
## 0.0554794521 0.0260273973 0.0260273973 0.0253424658 0.0253424658
## BsmtFinType1 MasVnrType MasVnrArea Electrical
## 0.0253424658 0.0054794521 0.0054794521 0.0006849315
Interestingly, ‘PoolQC’ and ‘MiscFeature’ were missing over 999 and 96% of their data each! Let’s make an arbitrary cutoff. We will exclude out data that are missing more than 5% of its data.
drop_list <- names(missing_list_percent[missing_list_percent > 0.05])
updated_data <- data %>% dplyr::select(-one_of(drop_list))
head(updated_data)## Id MSSubClass MSZoning LotArea Street LotShape LandContour Utilities
## 1 1 60 RL 8450 Pave Reg Lvl AllPub
## 2 2 20 RL 9600 Pave Reg Lvl AllPub
## 3 3 60 RL 11250 Pave IR1 Lvl AllPub
## 4 4 70 RL 9550 Pave IR1 Lvl AllPub
## 5 5 60 RL 14260 Pave IR1 Lvl AllPub
## 6 6 50 RL 14115 Pave IR1 Lvl AllPub
## LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 Inside Gtl CollgCr Norm Norm 1Fam
## 2 FR2 Gtl Veenker Feedr Norm 1Fam
## 3 Inside Gtl CollgCr Norm Norm 1Fam
## 4 Corner Gtl Crawfor Norm Norm 1Fam
## 5 FR2 Gtl NoRidge Norm Norm 1Fam
## 6 Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1 2Story 7 5 2003 2003 Gable
## 2 1Story 6 8 1976 1976 Gable
## 3 2Story 7 5 2001 2002 Gable
## 4 2Story 7 5 1915 1970 Gable
## 5 2Story 8 5 2000 2000 Gable
## 6 1.5Fin 5 5 1993 1995 Gable
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual
## 1 CompShg VinylSd VinylSd BrkFace 196 Gd
## 2 CompShg MetalSd MetalSd None 0 TA
## 3 CompShg VinylSd VinylSd BrkFace 162 Gd
## 4 CompShg Wd Sdng Wd Shng None 0 TA
## 5 CompShg VinylSd VinylSd BrkFace 350 Gd
## 6 CompShg VinylSd VinylSd None 0 TA
## ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## 1 TA PConc Gd TA No GLQ
## 2 TA CBlock Gd TA Gd ALQ
## 3 TA PConc Gd TA Mn GLQ
## 4 TA BrkTil TA Gd No ALQ
## 5 TA PConc Gd TA Av GLQ
## 6 TA Wood Gd TA No GLQ
## BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 1 706 Unf 0 150 856 GasA
## 2 978 Unf 0 284 1262 GasA
## 3 486 Unf 0 434 920 GasA
## 4 216 Unf 0 540 756 GasA
## 5 655 Unf 0 490 1145 GasA
## 6 732 Unf 0 64 796 GasA
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1 Ex Y SBrkr 856 854 0
## 2 Ex Y SBrkr 1262 0 0
## 3 Ex Y SBrkr 920 866 0
## 4 Gd Y SBrkr 961 756 0
## 5 Ex Y SBrkr 1145 1053 0
## 6 Ex Y SBrkr 796 566 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1 1710 1 0 2 1 3
## 2 1262 0 1 2 0 3
## 3 1786 1 0 2 1 3
## 4 1717 1 0 1 0 3
## 5 2198 1 0 2 1 4
## 6 1362 1 0 1 1 1
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces GarageCars
## 1 1 Gd 8 Typ 0 2
## 2 1 TA 6 Typ 1 2
## 3 1 Gd 6 Typ 1 2
## 4 1 Gd 7 Typ 1 3
## 5 1 Gd 9 Typ 1 3
## 6 1 TA 5 Typ 0 2
## GarageArea PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1 548 Y 0 61 0 0
## 2 460 Y 298 0 0 0
## 3 608 Y 0 42 0 0
## 4 642 Y 0 35 272 0
## 5 836 Y 192 84 0 0
## 6 480 Y 40 30 0 320
## ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
## 1 0 0 0 2 2008 WD Normal
## 2 0 0 0 5 2007 WD Normal
## 3 0 0 0 9 2008 WD Normal
## 4 0 0 0 2 2006 WD Abnorml
## 5 0 0 0 12 2008 WD Normal
## 6 0 0 700 10 2009 WD Normal
## SalePrice
## 1 208500
## 2 181500
## 3 223500
## 4 140000
## 5 250000
## 6 143000
print(paste0("Number of Columns in New Updated Data Frame: ", ncol(updated_data)))## [1] "Number of Columns in New Updated Data Frame: 70"
Now that we eliminated the independent variables with missing values > 5%, let’s now find the correlations of all the variables. In regards to the missing values, given that < 5% NA Data is small compared to the entire data set, we will simply remove the NA/NULL Values. We will be using the Pearson Correlation. For this particular correlation, we will be using only variables that are numeric.
print("Correlation Matrix for Home Variables (Numeric): ")## [1] "Correlation Matrix for Home Variables (Numeric): "
updated_home_cor <- cor(updated_data[sapply(updated_data, is.numeric)], method = "pearson", use = "complete.obs")
round(updated_home_cor,2)## Id MSSubClass LotArea OverallQual OverallCond YearBuilt
## Id 1.00 0.01 -0.03 -0.03 0.01 -0.02
## MSSubClass 0.01 1.00 -0.14 0.03 -0.06 0.03
## LotArea -0.03 -0.14 1.00 0.11 0.00 0.02
## OverallQual -0.03 0.03 0.11 1.00 -0.09 0.57
## OverallCond 0.01 -0.06 0.00 -0.09 1.00 -0.38
## YearBuilt -0.02 0.03 0.02 0.57 -0.38 1.00
## YearRemodAdd -0.02 0.04 0.02 0.55 0.08 0.59
## MasVnrArea -0.05 0.02 0.10 0.41 -0.13 0.32
## BsmtFinSF1 -0.01 -0.07 0.21 0.24 -0.04 0.25
## BsmtFinSF2 -0.01 -0.07 0.11 -0.06 0.04 -0.05
## BsmtUnfSF -0.01 -0.14 0.00 0.31 -0.14 0.15
## TotalBsmtSF -0.02 -0.24 0.26 0.54 -0.17 0.39
## X1stFlrSF 0.01 -0.25 0.30 0.48 -0.14 0.28
## X2ndFlrSF 0.01 0.31 0.05 0.30 0.03 0.01
## LowQualFinSF -0.04 0.05 0.00 -0.03 0.03 -0.18
## GrLivArea 0.01 0.08 0.26 0.59 -0.08 0.20
## BsmtFullBath 0.00 0.00 0.16 0.11 -0.05 0.19
## BsmtHalfBath -0.02 0.00 0.05 -0.04 0.12 -0.04
## FullBath 0.01 0.14 0.12 0.55 -0.19 0.47
## HalfBath 0.01 0.18 0.02 0.27 -0.06 0.24
## BedroomAbvGr 0.04 -0.02 0.12 0.11 0.01 -0.07
## KitchenAbvGr 0.00 0.29 -0.02 -0.18 -0.08 -0.17
## TotRmsAbvGrd 0.03 0.04 0.19 0.43 -0.06 0.10
## Fireplaces -0.02 -0.04 0.27 0.40 -0.02 0.15
## GarageCars 0.01 -0.04 0.15 0.60 -0.18 0.54
## GarageArea 0.02 -0.10 0.18 0.56 -0.15 0.48
## WoodDeckSF -0.03 -0.01 0.17 0.24 0.00 0.23
## OpenPorchSF -0.01 -0.01 0.09 0.30 -0.03 0.19
## EnclosedPorch 0.00 -0.01 -0.02 -0.11 0.07 -0.39
## X3SsnPorch -0.05 -0.04 0.02 0.03 0.03 0.03
## ScreenPorch 0.00 -0.03 0.04 0.07 0.05 -0.05
## PoolArea 0.06 0.01 0.08 0.07 0.00 0.01
## MiscVal -0.01 -0.01 0.04 -0.03 0.07 -0.03
## MoSold 0.02 -0.01 0.00 0.07 0.00 0.01
## YrSold 0.00 -0.02 -0.01 -0.03 0.04 -0.01
## SalePrice -0.03 -0.08 0.26 0.79 -0.08 0.52
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Id -0.02 -0.05 -0.01 -0.01 -0.01
## MSSubClass 0.04 0.02 -0.07 -0.07 -0.14
## LotArea 0.02 0.10 0.21 0.11 0.00
## OverallQual 0.55 0.41 0.24 -0.06 0.31
## OverallCond 0.08 -0.13 -0.04 0.04 -0.14
## YearBuilt 0.59 0.32 0.25 -0.05 0.15
## YearRemodAdd 1.00 0.18 0.13 -0.07 0.18
## MasVnrArea 0.18 1.00 0.26 -0.07 0.11
## BsmtFinSF1 0.13 0.26 1.00 -0.05 -0.50
## BsmtFinSF2 -0.07 -0.07 -0.05 1.00 -0.21
## BsmtUnfSF 0.18 0.11 -0.50 -0.21 1.00
## TotalBsmtSF 0.29 0.36 0.52 0.11 0.42
## X1stFlrSF 0.24 0.34 0.44 0.10 0.32
## X2ndFlrSF 0.14 0.17 -0.14 -0.10 0.01
## LowQualFinSF -0.06 -0.07 -0.06 0.01 0.03
## GrLivArea 0.29 0.39 0.21 -0.01 0.24
## BsmtFullBath 0.12 0.09 0.65 0.16 -0.42
## BsmtHalfBath -0.01 0.03 0.07 0.07 -0.10
## FullBath 0.44 0.28 0.06 -0.08 0.29
## HalfBath 0.18 0.20 0.00 -0.03 -0.04
## BedroomAbvGr -0.04 0.10 -0.11 -0.02 0.17
## KitchenAbvGr -0.15 -0.04 -0.09 -0.04 0.03
## TotRmsAbvGrd 0.19 0.28 0.04 -0.04 0.25
## Fireplaces 0.11 0.25 0.26 0.05 0.05
## GarageCars 0.42 0.36 0.22 -0.04 0.21
## GarageArea 0.37 0.37 0.30 -0.02 0.18
## WoodDeckSF 0.21 0.16 0.21 0.07 0.00
## OpenPorchSF 0.22 0.13 0.11 0.00 0.13
## EnclosedPorch -0.19 -0.11 -0.11 0.04 0.00
## X3SsnPorch 0.05 0.02 0.03 -0.03 0.02
## ScreenPorch -0.04 0.06 0.06 0.09 -0.01
## PoolArea 0.01 0.01 0.14 0.04 -0.04
## MiscVal -0.01 -0.03 0.00 0.00 -0.02
## MoSold 0.02 -0.01 -0.02 -0.01 0.03
## YrSold 0.04 -0.01 0.02 0.03 -0.04
## SalePrice 0.51 0.48 0.38 -0.01 0.22
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Id -0.02 0.01 0.01 -0.04 0.01
## MSSubClass -0.24 -0.25 0.31 0.05 0.08
## LotArea 0.26 0.30 0.05 0.00 0.26
## OverallQual 0.54 0.48 0.30 -0.03 0.59
## OverallCond -0.17 -0.14 0.03 0.03 -0.08
## YearBuilt 0.39 0.28 0.01 -0.18 0.20
## YearRemodAdd 0.29 0.24 0.14 -0.06 0.29
## MasVnrArea 0.36 0.34 0.17 -0.07 0.39
## BsmtFinSF1 0.52 0.44 -0.14 -0.06 0.21
## BsmtFinSF2 0.11 0.10 -0.10 0.01 -0.01
## BsmtUnfSF 0.42 0.32 0.01 0.03 0.24
## TotalBsmtSF 1.00 0.82 -0.17 -0.03 0.45
## X1stFlrSF 0.82 1.00 -0.20 -0.01 0.57
## X2ndFlrSF -0.17 -0.20 1.00 0.06 0.69
## LowQualFinSF -0.03 -0.01 0.06 1.00 0.14
## GrLivArea 0.45 0.57 0.69 0.14 1.00
## BsmtFullBath 0.31 0.24 -0.17 -0.05 0.03
## BsmtHalfBath 0.00 0.00 -0.02 -0.01 -0.02
## FullBath 0.32 0.38 0.42 0.00 0.63
## HalfBath -0.05 -0.12 0.61 -0.03 0.42
## BedroomAbvGr 0.05 0.13 0.50 0.11 0.52
## KitchenAbvGr -0.08 0.06 0.06 0.01 0.10
## TotRmsAbvGrd 0.29 0.41 0.62 0.13 0.83
## Fireplaces 0.34 0.41 0.19 -0.02 0.46
## GarageCars 0.43 0.44 0.19 -0.09 0.47
## GarageArea 0.49 0.49 0.14 -0.07 0.47
## WoodDeckSF 0.23 0.24 0.09 -0.03 0.25
## OpenPorchSF 0.24 0.21 0.21 0.02 0.33
## EnclosedPorch -0.10 -0.07 0.06 0.06 0.01
## X3SsnPorch 0.04 0.06 -0.02 0.00 0.02
## ScreenPorch 0.09 0.09 0.04 0.03 0.10
## PoolArea 0.13 0.13 0.08 0.06 0.17
## MiscVal -0.02 -0.02 0.02 0.00 0.00
## MoSold 0.01 0.03 0.04 -0.02 0.05
## YrSold -0.01 -0.01 -0.03 -0.03 -0.04
## SalePrice 0.61 0.61 0.32 -0.03 0.71
## BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Id 0.00 -0.02 0.01 0.01 0.04
## MSSubClass 0.00 0.00 0.14 0.18 -0.02
## LotArea 0.16 0.05 0.12 0.02 0.12
## OverallQual 0.11 -0.04 0.55 0.27 0.11
## OverallCond -0.05 0.12 -0.19 -0.06 0.01
## YearBuilt 0.19 -0.04 0.47 0.24 -0.07
## YearRemodAdd 0.12 -0.01 0.44 0.18 -0.04
## MasVnrArea 0.09 0.03 0.28 0.20 0.10
## BsmtFinSF1 0.65 0.07 0.06 0.00 -0.11
## BsmtFinSF2 0.16 0.07 -0.08 -0.03 -0.02
## BsmtUnfSF -0.42 -0.10 0.29 -0.04 0.17
## TotalBsmtSF 0.31 0.00 0.32 -0.05 0.05
## X1stFlrSF 0.24 0.00 0.38 -0.12 0.13
## X2ndFlrSF -0.17 -0.02 0.42 0.61 0.50
## LowQualFinSF -0.05 -0.01 0.00 -0.03 0.11
## GrLivArea 0.03 -0.02 0.63 0.42 0.52
## BsmtFullBath 1.00 -0.15 -0.07 -0.03 -0.15
## BsmtHalfBath -0.15 1.00 -0.05 -0.01 0.05
## FullBath -0.07 -0.05 1.00 0.14 0.36
## HalfBath -0.03 -0.01 0.14 1.00 0.23
## BedroomAbvGr -0.15 0.05 0.36 0.23 1.00
## KitchenAbvGr -0.04 -0.04 0.13 -0.07 0.20
## TotRmsAbvGrd -0.05 -0.02 0.55 0.35 0.68
## Fireplaces 0.14 0.03 0.24 0.20 0.10
## GarageCars 0.13 -0.02 0.47 0.22 0.09
## GarageArea 0.18 -0.02 0.41 0.16 0.07
## WoodDeckSF 0.18 0.04 0.19 0.11 0.05
## OpenPorchSF 0.06 -0.02 0.26 0.20 0.10
## EnclosedPorch -0.05 -0.01 -0.12 -0.09 0.04
## X3SsnPorch 0.00 0.03 0.04 0.00 -0.02
## ScreenPorch 0.02 0.03 -0.01 0.07 0.04
## PoolArea 0.07 0.02 0.05 0.02 0.07
## MiscVal -0.02 -0.01 -0.01 0.00 0.01
## MoSold -0.02 0.03 0.06 -0.01 0.05
## YrSold 0.07 -0.05 -0.02 -0.01 -0.04
## SalePrice 0.23 -0.02 0.56 0.28 0.17
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageCars GarageArea
## Id 0.00 0.03 -0.02 0.01 0.02
## MSSubClass 0.29 0.04 -0.04 -0.04 -0.10
## LotArea -0.02 0.19 0.27 0.15 0.18
## OverallQual -0.18 0.43 0.40 0.60 0.56
## OverallCond -0.08 -0.06 -0.02 -0.18 -0.15
## YearBuilt -0.17 0.10 0.15 0.54 0.48
## YearRemodAdd -0.15 0.19 0.11 0.42 0.37
## MasVnrArea -0.04 0.28 0.25 0.36 0.37
## BsmtFinSF1 -0.09 0.04 0.26 0.22 0.30
## BsmtFinSF2 -0.04 -0.04 0.05 -0.04 -0.02
## BsmtUnfSF 0.03 0.25 0.05 0.21 0.18
## TotalBsmtSF -0.08 0.29 0.34 0.43 0.49
## X1stFlrSF 0.06 0.41 0.41 0.44 0.49
## X2ndFlrSF 0.06 0.62 0.19 0.19 0.14
## LowQualFinSF 0.01 0.13 -0.02 -0.09 -0.07
## GrLivArea 0.10 0.83 0.46 0.47 0.47
## BsmtFullBath -0.04 -0.05 0.14 0.13 0.18
## BsmtHalfBath -0.04 -0.02 0.03 -0.02 -0.02
## FullBath 0.13 0.55 0.24 0.47 0.41
## HalfBath -0.07 0.35 0.20 0.22 0.16
## BedroomAbvGr 0.20 0.68 0.10 0.09 0.07
## KitchenAbvGr 1.00 0.25 -0.13 -0.05 -0.06
## TotRmsAbvGrd 0.25 1.00 0.32 0.36 0.34
## Fireplaces -0.13 0.32 1.00 0.30 0.27
## GarageCars -0.05 0.36 0.30 1.00 0.88
## GarageArea -0.06 0.34 0.27 0.88 1.00
## WoodDeckSF -0.09 0.17 0.20 0.23 0.23
## OpenPorchSF -0.07 0.24 0.17 0.21 0.24
## EnclosedPorch 0.03 0.00 -0.03 -0.15 -0.12
## X3SsnPorch -0.02 -0.01 0.01 0.04 0.04
## ScreenPorch -0.05 0.06 0.19 0.05 0.05
## PoolArea -0.01 0.08 0.10 0.02 0.06
## MiscVal 0.06 0.02 0.00 -0.04 -0.03
## MoSold 0.03 0.04 0.05 0.04 0.03
## YrSold 0.03 -0.03 -0.02 -0.04 -0.03
## SalePrice -0.14 0.54 0.47 0.64 0.62
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Id -0.03 -0.01 0.00 -0.05 0.00
## MSSubClass -0.01 -0.01 -0.01 -0.04 -0.03
## LotArea 0.17 0.09 -0.02 0.02 0.04
## OverallQual 0.24 0.30 -0.11 0.03 0.07
## OverallCond 0.00 -0.03 0.07 0.03 0.05
## YearBuilt 0.23 0.19 -0.39 0.03 -0.05
## YearRemodAdd 0.21 0.22 -0.19 0.05 -0.04
## MasVnrArea 0.16 0.13 -0.11 0.02 0.06
## BsmtFinSF1 0.21 0.11 -0.11 0.03 0.06
## BsmtFinSF2 0.07 0.00 0.04 -0.03 0.09
## BsmtUnfSF 0.00 0.13 0.00 0.02 -0.01
## TotalBsmtSF 0.23 0.24 -0.10 0.04 0.09
## X1stFlrSF 0.24 0.21 -0.07 0.06 0.09
## X2ndFlrSF 0.09 0.21 0.06 -0.02 0.04
## LowQualFinSF -0.03 0.02 0.06 0.00 0.03
## GrLivArea 0.25 0.33 0.01 0.02 0.10
## BsmtFullBath 0.18 0.06 -0.05 0.00 0.02
## BsmtHalfBath 0.04 -0.02 -0.01 0.03 0.03
## FullBath 0.19 0.26 -0.12 0.04 -0.01
## HalfBath 0.11 0.20 -0.09 0.00 0.07
## BedroomAbvGr 0.05 0.10 0.04 -0.02 0.04
## KitchenAbvGr -0.09 -0.07 0.03 -0.02 -0.05
## TotRmsAbvGrd 0.17 0.24 0.00 -0.01 0.06
## Fireplaces 0.20 0.17 -0.03 0.01 0.19
## GarageCars 0.23 0.21 -0.15 0.04 0.05
## GarageArea 0.23 0.24 -0.12 0.04 0.05
## WoodDeckSF 1.00 0.06 -0.13 -0.03 -0.07
## OpenPorchSF 0.06 1.00 -0.09 -0.01 0.08
## EnclosedPorch -0.13 -0.09 1.00 -0.04 -0.08
## X3SsnPorch -0.03 -0.01 -0.04 1.00 -0.03
## ScreenPorch -0.07 0.08 -0.08 -0.03 1.00
## PoolArea 0.07 0.06 0.05 -0.01 0.05
## MiscVal -0.01 -0.02 0.02 0.00 0.03
## MoSold 0.02 0.07 -0.03 0.03 0.02
## YrSold 0.02 -0.06 -0.01 0.02 0.01
## SalePrice 0.32 0.31 -0.13 0.05 0.11
## PoolArea MiscVal MoSold YrSold SalePrice
## Id 0.06 -0.01 0.02 0.00 -0.03
## MSSubClass 0.01 -0.01 -0.01 -0.02 -0.08
## LotArea 0.08 0.04 0.00 -0.01 0.26
## OverallQual 0.07 -0.03 0.07 -0.03 0.79
## OverallCond 0.00 0.07 0.00 0.04 -0.08
## YearBuilt 0.01 -0.03 0.01 -0.01 0.52
## YearRemodAdd 0.01 -0.01 0.02 0.04 0.51
## MasVnrArea 0.01 -0.03 -0.01 -0.01 0.48
## BsmtFinSF1 0.14 0.00 -0.02 0.02 0.38
## BsmtFinSF2 0.04 0.00 -0.01 0.03 -0.01
## BsmtUnfSF -0.04 -0.02 0.03 -0.04 0.22
## TotalBsmtSF 0.13 -0.02 0.01 -0.01 0.61
## X1stFlrSF 0.13 -0.02 0.03 -0.01 0.61
## X2ndFlrSF 0.08 0.02 0.04 -0.03 0.32
## LowQualFinSF 0.06 0.00 -0.02 -0.03 -0.03
## GrLivArea 0.17 0.00 0.05 -0.04 0.71
## BsmtFullBath 0.07 -0.02 -0.02 0.07 0.23
## BsmtHalfBath 0.02 -0.01 0.03 -0.05 -0.02
## FullBath 0.05 -0.01 0.06 -0.02 0.56
## HalfBath 0.02 0.00 -0.01 -0.01 0.28
## BedroomAbvGr 0.07 0.01 0.05 -0.04 0.17
## KitchenAbvGr -0.01 0.06 0.03 0.03 -0.14
## TotRmsAbvGrd 0.08 0.02 0.04 -0.03 0.54
## Fireplaces 0.10 0.00 0.05 -0.02 0.47
## GarageCars 0.02 -0.04 0.04 -0.04 0.64
## GarageArea 0.06 -0.03 0.03 -0.03 0.62
## WoodDeckSF 0.07 -0.01 0.02 0.02 0.32
## OpenPorchSF 0.06 -0.02 0.07 -0.06 0.31
## EnclosedPorch 0.05 0.02 -0.03 -0.01 -0.13
## X3SsnPorch -0.01 0.00 0.03 0.02 0.05
## ScreenPorch 0.05 0.03 0.02 0.01 0.11
## PoolArea 1.00 0.03 -0.03 -0.06 0.09
## MiscVal 0.03 1.00 -0.01 0.00 -0.02
## MoSold -0.03 -0.01 1.00 -0.15 0.05
## YrSold -0.06 0.00 -0.15 1.00 -0.03
## SalePrice 0.09 -0.02 0.05 -0.03 1.00
corrplot(updated_home_cor, method = "circle")The above correlation plot is a little cluttered. Let’s take a closer look at the independent variable correlation with SalePrice.
updated_home_cor_df <- as.data.frame(updated_home_cor)
SalePrice_cor <- updated_home_cor_df$SalePrice
SalePrice_cor_names <- colnames(updated_home_cor_df)
SalePrice_cor_df <- data.frame(SalePrice_cor_names, SalePrice_cor)
SalePrice_cor_df <- SalePrice_cor_df[order(SalePrice_cor_df$SalePrice_cor, decreasing = TRUE),]
SalePrice_cor_df## SalePrice_cor_names SalePrice_cor
## 36 SalePrice 1.00000000
## 4 OverallQual 0.78999714
## 16 GrLivArea 0.71007968
## 25 GarageCars 0.63968574
## 26 GarageArea 0.62249171
## 12 TotalBsmtSF 0.61297093
## 13 X1stFlrSF 0.60684938
## 19 FullBath 0.56249126
## 23 TotRmsAbvGrd 0.53631085
## 6 YearBuilt 0.52289616
## 7 YearRemodAdd 0.50715781
## 8 MasVnrArea 0.47749305
## 24 Fireplaces 0.46893028
## 9 BsmtFinSF1 0.38397695
## 27 WoodDeckSF 0.32465013
## 14 X2ndFlrSF 0.32270976
## 28 OpenPorchSF 0.31126782
## 20 HalfBath 0.28204017
## 3 LotArea 0.26467400
## 17 BsmtFullBath 0.22502749
## 11 BsmtUnfSF 0.21573973
## 21 BedroomAbvGr 0.17193419
## 31 ScreenPorch 0.11304390
## 32 PoolArea 0.09310862
## 30 X3SsnPorch 0.04524719
## 34 MoSold 0.04513617
## 10 BsmtFinSF2 -0.01031642
## 18 BsmtHalfBath -0.01599287
## 33 MiscVal -0.02095080
## 15 LowQualFinSF -0.02526275
## 1 Id -0.02534261
## 35 YrSold -0.02618020
## 5 OverallCond -0.07629370
## 2 MSSubClass -0.08281336
## 29 EnclosedPorch -0.12877809
## 22 KitchenAbvGr -0.13741923
So it appears that the top ten correlated variables (excluding SalePrice) is:
OverallQualGrLivAreaGarageCarsGarageAreaTotalBsmtSFX1stFlrSFFullBathTotRmsAbvGrdYearBuiltYearRemodAdd
The above items seem to make sense. The Overall Quality of home, the size of the home, the year when it was built, and so forth seem like they would correlate with the price of a home. However, as you may have noticed, there appears to be independent variables that will likely correlate with each other; for instance, GarageCars and GarageArea. The more cars a garage fits, it is more likely that the area of a garage will be larger. However, when we build our linear regression model, we may be able to peel off some of these variables.
Before, we start building the model, let’s go back to SalePrice. As we remember, there was quite a positive skew on the histogram. Given that all of the data points are positive (there really should not be homes selling for negative dollars), has a low mean, and has a large variance.
print(paste0("SalePrice Mean: ", round(mean(updated_data$SalePrice),2)))## [1] "SalePrice Mean: 180921.2"
print(paste0("SalePrice Variance: ", round(sd(updated_data$SalePrice) ** 2)))## [1] "SalePrice Variance: 6311111264"
print(paste0("Are all data points of Sale Price positive? ", min(updated_data$SalePrice) > 0))## [1] "Are all data points of Sale Price positive? TRUE"
Therefore, this data may be a good fit for a logarithmic transformation.
fit <- fitdistr(updated_data$SalePrice, densfun="log-normal")
fit## meanlog sdlog
## 12.024050901 0.399315046
## ( 0.010450552) ( 0.007389656)
SalePrice_log <- log(updated_data$SalePrice)
bootstrap_log <- sample(SalePrice_log, size = 1000, replace = TRUE, prob = NULL)
bootstrap_normal <- sample(data$SalePrice, size = 1000, replace = TRUE, prob = NULL)
par(mfrow=c(1,2))
plot.legend <- c("normal", "lognormal")
hist(bootstrap_normal, col ="lightblue")
hist(bootstrap_log, col = "lightgreen")With the log transformation, the SalePrice data appears to be more normalized. Let’s set this new column SalePrice_log into the dataset.
data_log <- cbind(updated_data, SalePrice_log)
cols.dont.want <- "SalePrice"
data_log <- data_log[, ! names(data_log) %in% cols.dont.want, drop = TRUE]
# Let us remove any rows with NAs in their datapoints.
data_log <- na.omit(data_log)Now that we have a better general idea of what is going on with the data. Let’s partition the data into the training and testing data sets. With this prediction set, we will only deal with the numerical data (and not the categorical data columns).
data_log <- data_log[sapply(data_log, is.numeric)]
rows <- nrow(data_log)
f <- .6 # f is the arbitrary % of data used for training
upper_bound <- floor(f * rows)
permuted_data <- data_log[sample(rows),]
train.dat <- permuted_data[1:upper_bound,]
test.dat <- permuted_data[(upper_bound+1):rows,]Now that we have our training and testing data (with SalePrice_log as our dependent variable), we can now build the linear regression model.
f <- reformulate(setdiff(colnames(train.dat), "SalePrice_log"), response = "SalePrice_log")
home_log_lm <- lm(f, data = train.dat)
summary(home_log_lm)##
## Call:
## lm(formula = f, data = train.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.42541 -0.07200 0.00441 0.08253 0.64685
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.576e+01 8.667e+00 1.818 0.069456 .
## Id -1.326e-05 1.312e-05 -1.011 0.312446
## MSSubClass -8.490e-04 1.557e-04 -5.454 6.53e-08 ***
## LotArea 1.848e-06 5.492e-07 3.366 0.000800 ***
## OverallQual 9.451e-02 7.351e-03 12.856 < 2e-16 ***
## OverallCond 4.775e-02 6.337e-03 7.535 1.31e-13 ***
## YearBuilt 2.809e-03 3.735e-04 7.520 1.45e-13 ***
## YearRemodAdd 9.184e-04 4.007e-04 2.292 0.022163 *
## MasVnrArea 3.425e-05 3.516e-05 0.974 0.330327
## BsmtFinSF1 -1.200e-05 3.491e-05 -0.344 0.731140
## BsmtFinSF2 2.522e-06 4.779e-05 0.053 0.957933
## BsmtUnfSF -4.094e-05 3.424e-05 -1.196 0.232105
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 2.022e-04 4.047e-05 4.997 7.13e-07 ***
## X2ndFlrSF 1.089e-04 3.019e-05 3.606 0.000329 ***
## LowQualFinSF 1.248e-04 1.450e-04 0.861 0.389621
## GrLivArea NA NA NA NA
## BsmtFullBath 5.553e-02 1.598e-02 3.475 0.000539 ***
## BsmtHalfBath -6.942e-03 2.534e-02 -0.274 0.784172
## FullBath 5.207e-02 1.695e-02 3.071 0.002202 **
## HalfBath 2.221e-02 1.630e-02 1.362 0.173443
## BedroomAbvGr -7.450e-04 1.021e-02 -0.073 0.941824
## KitchenAbvGr 1.783e-02 3.535e-02 0.505 0.614011
## TotRmsAbvGrd 1.849e-02 7.498e-03 2.466 0.013855 *
## Fireplaces 5.221e-02 1.077e-02 4.847 1.50e-06 ***
## GarageCars 8.871e-02 1.738e-02 5.103 4.16e-07 ***
## GarageArea -3.760e-05 5.791e-05 -0.649 0.516359
## WoodDeckSF 1.466e-04 4.727e-05 3.101 0.001995 **
## OpenPorchSF 5.318e-05 9.626e-05 0.552 0.580790
## EnclosedPorch 1.513e-04 1.098e-04 1.378 0.168567
## X3SsnPorch 3.025e-04 1.960e-04 1.544 0.123016
## ScreenPorch 3.883e-04 1.103e-04 3.521 0.000454 ***
## PoolArea -5.639e-04 1.534e-04 -3.675 0.000253 ***
## MiscVal -4.539e-05 4.055e-05 -1.119 0.263349
## MoSold 6.982e-06 2.103e-03 0.003 0.997352
## YrSold -6.270e-03 4.316e-03 -1.453 0.146662
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1595 on 813 degrees of freedom
## Multiple R-squared: 0.8475, Adjusted R-squared: 0.8413
## F-statistic: 136.9 on 33 and 813 DF, p-value: < 2.2e-16
So we have a model that produces an adjusted R-squared value a F-statistic (which will vary depending on the sample). This corresponds with a very statistically significant p-value. However, there are also quite a bit of variables that do not appear to be statistically significant either. Let’s see if we can perform a stepwise selection and work backwards.
Using this paper “Variable selection with stepwise and best subset approaches” by Zhongheng Zhang, we will utilize an R function stepAIC().
step <- stepAIC(home_log_lm, trace = FALSE)print("Final Model")## [1] "Final Model"
final_model_df <- step$model
final_colnames <- colnames(final_model_df)
head(final_model_df)## SalePrice_log MSSubClass LotArea OverallQual OverallCond YearBuilt
## 936 11.28853 30 5825 4 5 1926
## 547 12.25486 50 8737 6 7 1923
## 1157 12.10016 80 9350 5 8 1965
## 386 12.16525 120 3182 8 5 2004
## 152 12.82773 20 13891 8 5 2007
## 111 11.82701 50 9525 6 4 1954
## YearRemodAdd BsmtUnfSF X1stFlrSF X2ndFlrSF BsmtFullBath FullBath
## 936 1953 600 747 0 0 1
## 547 1950 765 915 720 0 1
## 1157 1999 586 1265 0 0 2
## 386 2005 1232 1269 0 0 2
## 152 2008 310 1710 0 1 2
## 111 1972 550 1216 639 0 2
## HalfBath TotRmsAbvGrd Fireplaces GarageCars WoodDeckSF X3SsnPorch
## 936 0 5 0 2 0 0
## 547 1 6 1 2 0 144
## 1157 0 6 1 2 0 96
## 386 0 6 1 2 146 0
## 152 0 6 1 3 0 0
## 111 0 7 0 1 182 0
## ScreenPorch PoolArea YrSold
## 936 0 0 2006
## 547 0 0 2007
## 1157 0 0 2008
## 386 144 0 2010
## 152 0 0 2008
## 111 0 0 2006
print(paste0("How many columns are there? ", ncol(final_model_df)))## [1] "How many columns are there? 21"
print("What are the columns? ") ## [1] "What are the columns? "
print(final_colnames)## [1] "SalePrice_log" "MSSubClass" "LotArea" "OverallQual"
## [5] "OverallCond" "YearBuilt" "YearRemodAdd" "BsmtUnfSF"
## [9] "X1stFlrSF" "X2ndFlrSF" "BsmtFullBath" "FullBath"
## [13] "HalfBath" "TotRmsAbvGrd" "Fireplaces" "GarageCars"
## [17] "WoodDeckSF" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [21] "YrSold"
With this stepwise approach of eliminating these variables, most of the top 10 most correlated values were included in this final model (though interestingly, not all made it). With this final model, let’s create a new linear regression model.
final_lm <- lm(SalePrice_log ~ . , data = final_model_df)
summary(final_lm)##
## Call:
## lm(formula = SalePrice_log ~ ., data = final_model_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.47201 -0.07474 0.00468 0.08297 0.66575
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.631e+01 8.432e+00 1.935 0.053381 .
## MSSubClass -8.223e-04 1.407e-04 -5.844 7.35e-09 ***
## LotArea 1.746e-06 5.374e-07 3.249 0.001205 **
## OverallQual 9.564e-02 6.801e-03 14.063 < 2e-16 ***
## OverallCond 4.606e-02 6.148e-03 7.492 1.74e-13 ***
## YearBuilt 2.610e-03 3.336e-04 7.824 1.56e-14 ***
## YearRemodAdd 9.656e-04 3.867e-04 2.497 0.012713 *
## BsmtUnfSF -3.026e-05 1.737e-05 -1.742 0.081834 .
## X1stFlrSF 1.983e-04 2.690e-05 7.372 4.09e-13 ***
## X2ndFlrSF 1.101e-04 2.827e-05 3.893 0.000107 ***
## BsmtFullBath 5.654e-02 1.433e-02 3.946 8.61e-05 ***
## FullBath 5.552e-02 1.634e-02 3.398 0.000710 ***
## HalfBath 2.539e-02 1.599e-02 1.588 0.112737
## TotRmsAbvGrd 1.896e-02 6.321e-03 2.999 0.002786 **
## Fireplaces 5.121e-02 1.039e-02 4.927 1.01e-06 ***
## GarageCars 7.962e-02 1.025e-02 7.770 2.32e-14 ***
## WoodDeckSF 1.403e-04 4.576e-05 3.065 0.002245 **
## X3SsnPorch 2.962e-04 1.945e-04 1.523 0.128119
## ScreenPorch 3.691e-04 1.086e-04 3.398 0.000711 ***
## PoolArea -5.792e-04 1.511e-04 -3.834 0.000135 ***
## YrSold -6.399e-03 4.195e-03 -1.525 0.127570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1589 on 826 degrees of freedom
## Multiple R-squared: 0.8463, Adjusted R-squared: 0.8426
## F-statistic: 227.5 on 20 and 826 DF, p-value: < 2.2e-16
layout(matrix(c(1,2,3,4),2,2))
plot(final_lm)So this is the final model that has been produced. The ‘Residuals Vs Fitted’ plot appears to be fairly homoscedasticity. The Q-Q plot appears to fit the Q-Q line with the exceptions towards the ends of the plot.
Let’s see how this model compares with the testing dataset.
# Building the prediction
predicted.dat <- predict(final_lm, newdata = test.dat)
delta <- predicted.dat - test.dat$SalePrice_log
# Performing a T-test
t.test(delta, conf.level = 0.95)##
## One Sample t-test
##
## data: delta
## t = -1.5175, df = 564, p-value = 0.1297
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.019686741 0.002525584
## sample estimates:
## mean of x
## -0.008580579
# Plotting delta
plot(delta)The scattering around zero suggests that this formula is likely a good prediction. There are a few outliers, but for the most part, most of the data is around 0. Let’s take a look at how well the predictions were and its residuals.
# Plotting prediction vs. actual
plot(predicted.dat, test.dat$SalePrice_log, xlab = "Predicted Log Price", ylab = "Actual Log Price", main = "Predicted vs. Actual (Log Prices)")# Residuals
hist(resid(final_lm), col = "red", xlab = "Residuals", main = "Historgram of Residuals")qqnorm(resid(final_lm))
qqline(resid(final_lm))plot(fitted(final_lm), resid(final_lm), main = "Fitted vs. Residuals")
abline(h=0, col="red")Overall, it looks promising. We need to remember that these are in log form and not in actual dollars. We should consider this when we make our final predictions.
Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842399/pdf/atm-04-07-136.pdf, https://www.statmethods.net/stats/regression.html
Let’s see if we can use this model and predict the test dataset provided by Kaggle.com.
test_url <- url('https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/test.csv')
test_df <- read.csv(test_url, header = TRUE, stringsAsFactors = FALSE)
predicted_prices <- predict(final_lm, test_df)
print("Predicted Log Prices")## [1] "Predicted Log Prices"
head(predicted_prices)## 1 2 3 4 5 6
## 11.68407 11.89091 12.02269 12.19141 12.11989 12.07794
We have the predicted log prices. Let’s see if we can raise it to \(e\) to bring it back to dollars.
# In Dollars
predicted_prices <- sapply(predicted_prices, exp)
head(predicted_prices)## 1 2 3 4 5 6
## 118666.2 145934.3 166490.5 197089.6 183485.3 175947.9
Before we create the .csv file, we should check if there are any NA values in the predictions.
print(paste0("Are there NA values in the predictions? ", any(is.na(predicted_prices))))## [1] "Are there NA values in the predictions? TRUE"
print(paste0("How many NA values are there? ", sum(is.na(predicted_prices))))## [1] "How many NA values are there? 3"
Let’s deal with the missing NA values and replace them with the predicted_price mean (so we can submit this to Kaggle).
predicted_prices[is.na(predicted_prices)] <- mean(predicted_prices, na.rm = TRUE)Now that we have our price predictions in dollars. Let’s create a CSV file so that we can submit it to Kaggle.com. Kaggle requests that the format is in ID, SalePrice.
Id <- test_df$Id
SalePrice <- predicted_prices
submission <- data.frame(Id, SalePrice)
head(submission)## Id SalePrice
## 1 1461 118666.2
## 2 1462 145934.3
## 3 1463 166490.5
## 4 1464 197089.6
## 5 1465 183485.3
## 6 1466 175947.9
Outputting a .csv file can be done via:
# write.csv(submission, file = "submission.csv", row.names=FALSE)
# This is commented out so that it doesn't continue to output a new .csv fileThe score is 0.14033.
This concludes my final project regarding linear regression and Housing Prices.