CUNY 605 Computational Mathematics Final Project

House Prices: Advanced Regression Techniques competition (https://www.kaggle.com/c/house-prices-advanced-regression-techniques).

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/train.csv'
data <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)
y <- data$SalePrice
head(data)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Let’s choose ‘GrLivArea’ as the independent variable.

x <- data$GrLivArea
head(x)

## [1] 1710 1262 1786 1717 2198 1362

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 2nd quartile of the Y variable. Interpret the meaning of all probabilities.

Define Quartiles:

The first quartile, denoted by $Q_1$ , is the median of the lower half of the data set. This means that about 25% of the numbers in the data set lie below $Q_1$ and about 75% lie above $Q_1$.
The third quartile, denoted by $Q_3$ , is the median of the upper half of the data set. This means that about 75% of the numbers in the data set lie below $Q_3$ and about 25% lie above $Q_3$.
The second quartile, denoted by $Q_2$ is the median of the data. This means that about 50% of the numbers in the data set lie below $Q_2$ and about 50% lie above $Q_2$.

Reference: http://web.mnstate.edu/peil/MDEV102/U4/S36/S363.html

\[a. P(X > x | Y > y)\]

In other words, this is asking what is the probability of choosing ‘GrLivArea’ such that the area is larger than the first quartile (1130) given that the ‘SalePrice’ is larger than the second quartile (163000).

describe_x <- quantile(x)
describe_y <- quantile(y)

# For GrLivArea
Q1_x <- as.integer(describe_x[2])
Q2_x <- as.integer(describe_x[3])
Q3_x <- as.integer(describe_x[4])
total_x <- length(x)

# For SalePrice
Q1_y <- as.integer(describe_y[2])
Q2_y <- as.integer(describe_y[3])
Q3_y <- as.integer(describe_y[4])
total_y <- length(y)

For reference, the total number of homes in this data set:

total_x

## [1] 1460

Let’s calculate $P(Y > y)$ first. By the very definition of median, $P(Y > y) = .5$. Let’s see how many houses are greater than

# We already know that the 2nd quartile is 50% of the points. We need to find out how many houses are greater than median. 
total_house_price_grQ2 <- sum(y > Q2_y)
print(paste("How many homes are there more than the median? ", total_house_price_grQ2))

## [1] "How many homes are there more than the median?  728"

Out of these homes where the Sale Price is greater than $163,000, how many of them have Ground Floor Square Footage greater than 1130 (Q1)?

# Start necessary packages
if (!require(tidyr)) install.packages("tidyr")
if (!require(dplyr)) install.packages("dplyr")
library(tidyr)
library(dplyr)

select_homes <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y) 
GrLivArea_GrQ1 <- select_homes %>% filter(GrLivArea > Q1_x)
head(GrLivArea_GrQ1)

##   GrLivArea SalePrice
## 1      1710    208500
## 2      1262    181500
## 3      1786    223500
## 4      2198    250000
## 5      1694    307000
## 6      2090    200000

print(paste0("How many homes have GrLivArea > Q1 given SalePrice > Q2? ", nrow(GrLivArea_GrQ1)))

## [1] "How many homes have GrLivArea > Q1 given SalePrice > Q2? 720"

print(paste0("P(X > x | Y > y): ", nrow(GrLivArea_GrQ1)/nrow(select_homes)))

## [1] "P(X > x | Y > y): 0.989010989010989"

# We are dividing by select_homes as it is a given statement (and not an AND statement)

\[b. P(X > x , Y > y)\] In other words, what is the probability of having a home with greater than 1130 square feet ($Q1_x$) AND a sale price greater than $163000 ($Q2_y$).

# This is going to be the same as part A. Notice that I changed the filter to have both conditionals.
select_homes2 <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y & GrLivArea > Q1_x)
head(select_homes2)

##   GrLivArea SalePrice
## 1      1710    208500
## 2      1262    181500
## 3      1786    223500
## 4      2198    250000
## 5      1694    307000
## 6      2090    200000

print(paste0("How many homes have GrLivArea > Q1 AND SalePrice > Q2? ", nrow(select_homes2)))

## [1] "How many homes have GrLivArea > Q1 AND SalePrice > Q2? 720"

print(paste0("P(X > x & Y > y): ", nrow(select_homes2)/nrow(data)))

## [1] "P(X > x & Y > y): 0.493150684931507"

\[c. P(X < x | Y > y)\]

Given that homes are greater than $163000 ($Q2_y$), what is the probability that homes are less than 1130 square feet ($Q1_x$)?

select_homes3 <- data %>% select(GrLivArea, SalePrice) %>% filter(SalePrice > Q2_y) 
GrLivArea_LessQ1 <- select_homes3 %>% filter(GrLivArea < Q1_x)
head(GrLivArea_LessQ1)

##   GrLivArea SalePrice
## 1       964    163500
## 2       988    180000
## 3      1110    165500
## 4      1126    175000
## 5       999    178400
## 6      1088    170000

print(paste0("How many homes have GrLivArea < Q1 given SalePrice > Q2? ", nrow(GrLivArea_LessQ1)))

## [1] "How many homes have GrLivArea < Q1 given SalePrice > Q2? 8"

print(paste0("P(X < x | Y > y): ", nrow(GrLivArea_LessQ1)/nrow(select_homes3)))

## [1] "P(X < x | Y > y): 0.010989010989011"

Does splitting the training data in this fashion make them independent? In other words, does $P(X|Y) = P(X) P(Y)$?

As we noted in the above example, $P(X|Y)$ does NOT equal $P(X) P(Y)$. Please see below for an example. This makes complete sense as we expect the increasing size of the Ground Living Area to increase with the home prices. In the next section, we will perform some plots and determine if there’s a correlation.

# P(X|Y) = P(X) P(Y) ?
nrow(GrLivArea_GrQ1)/nrow(select_homes) == nrow(select_homes2)/nrow(data)

## [1] FALSE

Check mathematically, and then evaluate by running a Chi Square test for association.

A chi-square test for independence compares two variables in a contingency table to see if they are related. In a more general sense, it tests to see whether distributions of categorical variables differ from each another.

A very small chi-square test statistics means that your observed data fits your expected data extremely well. Or in other words, there is a relationship.
A very large chi-square test statistic means that the data does not fit very well. Or in other words, there isn’t a relationship.

\[\chi^2 = \Sigma \frac{(O_i - E_i)^2}{E_i}\] A chi-square statistic is one way to show a relationship between two categorical variables. In statistics, there are two types of variables: numerical (countable) variables and non-numerical (categorical) variables. The chi-squared statistic is a single number that tells you how much difference exists between your observed counts and the counts you would expect if there was no relationship at all in the population.

Reference: http://www.statisticshowto.com/probability-and-statistics/chi-square/

This if course would be difficult to perform by hand, so we will do this with R.

# Reference: http://www.r-tutor.com/elementary-statistics/goodness-fit/chi-squared-test-independence
if (!require(MASS)) install.packages("MASS")
library(MASS)

tbl <- table(data$GrLivArea, data$SalePrice)
chisq.test(tbl)

## 
##  Pearson's Chi-squared test
## 
## data:  tbl
## X-squared = 589730, df = 569320, p-value < 2.2e-16

As we expected, the p-value is significantly lower than 0.05, therefore, we reject the null hypothesis and make the claim that these two variables are dependent on each other.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables.

if (!require(ggplot2)) install.packages("ggplot2")
if (!require(psych)) install.packages("psych")
library(ggplot2)
library(psych)

print("Summary descriptive statistics of 'GrLivArea")

## [1] "Summary descriptive statistics of 'GrLivArea"

summary(x)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

describe(x)

##    vars    n    mean     sd median trimmed    mad min  max range skew
## X1    1 1460 1515.46 525.48   1464 1467.67 483.33 334 5642  5308 1.36
##    kurtosis    se
## X1     4.86 13.75

ggplot(data, aes(x = GrLivArea)) + geom_histogram(binwidth = 100) + labs(title = "Ground Living Area Histogram", x = "Square Footage", y = "Count")

ggplot(data, aes(x = GrLivArea)) + geom_density(fill = "green") + geom_vline(xintercept=c(1130, 1464, 1777), col = c("red", "yellow","blue")) + xlab("Ground Living Area") + labs(main="Ground Living Area  Density Plot")

print("Summary descriptive statistics of 'SalePrice")

## [1] "Summary descriptive statistics of 'SalePrice"

summary(y)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

describe(y)

##    vars    n     mean      sd median  trimmed     mad   min    max  range
## X1    1 1460 180921.2 79442.5 163000 170783.3 56338.8 34900 755000 720100
##    skew kurtosis      se
## X1 1.88      6.5 2079.11

ggplot(data, aes(x = SalePrice)) + geom_histogram(binwidth = 10000) + labs(title = "Sale Price Histogram", x = "Price", y = "Count")

ggplot(data, aes(x = SalePrice)) + geom_density(fill = "green") + geom_vline(xintercept=c(129975, 163000, 214000), col = c("red", "yellow","blue"))+ xlab("Sale Price") + labs(main="Sale Price Density Plot")

In the above plots, red is 25% (Q1), yellow is 50% (median or Q2), and blue is 75% (Q3).

As you can see, the data is left skewed (both for the ‘GrLivingArea’ and ‘SalePrice’) as noted in the histogram and density plot. Above also lists the mean, median, skew, kurtosis, standard deviation, and standard error.

Provide a scatterplot of X and Y. Transform both variables simultaneously using Box-Cox transformations.

ggplot(data, aes(x = GrLivArea, y = SalePrice)) + geom_point(color = "blue") + labs(title="Ground Living Area vs. Sale Price", x = "Ground Living Area", y = "Sale Price in Dollars")

Box-Cox Transformation

Heteroscedasticity can create model bias. A Box Cox transformation is a way to transform non-normal dependent variables into a normal shape. Normality is an important assumption for many statistical techniques; if your data isn’t normal, applying a Box-Cox means that you are able to run a broader number of tests.

At the core of the Box Cox transformation is an exponent, lambda $\lambda$, which varies from -5 to 5. All values of $\lambda$ are considered and the optimal value for your data is selected; The “optimal value” is the one which results in the best approximation of a normal distribution curve.

The transformation of Y has the form:

\[y(\lambda) = \begin{cases} \frac{y^{\lambda}-1}{\lambda}, & \text{if } \lambda \neq 0 \\ log(y), & \text{if } \lambda = 0\\ \end{cases}\]

This test only workds for positive data. However, Box and Cox did propose a second formula that can be used for negative y-values:

\[y(\lambda) = \begin{cases} \frac{(y+\lambda_2)^{y_1}-1}{\lambda_1}, & \text{if } \lambda_1 \neq 0 \\ log(y+\lambda_2), & \text{if } \lambda_1 = 0\\ \end{cases}\]

We can use R and the command boxcox(object,…) to help create the transformations.

Reference: http://blog.minitab.com/blog/applying-statistics-in-quality-projects/how-could-you-benefit-from-a-box-cox-transformation, http://www.statisticshowto.com/box-cox-transformation/

if (!require(MASS)) install.packages("MASS")
library(MASS)

home_lm <- lm(y ~ x)
summary(home_lm)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462999  -29800   -1124   21957  339832 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 18569.026   4480.755   4.144 3.61e-05 ***
## x             107.130      2.794  38.348  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56070 on 1458 degrees of freedom
## Multiple R-squared:  0.5021, Adjusted R-squared:  0.5018 
## F-statistic:  1471 on 1 and 1458 DF,  p-value: < 2.2e-16

# See the Q-Q plot and the residuals plot
plot(fitted(home_lm), resid(home_lm))
abline(h=0, lty = 2, col = "darkorange", lwd = 2)

hist(resid(home_lm))

qqnorm(resid(home_lm))
qqline(resid(home_lm))

This is the untransformed data. As you can see, there are some significant deviations of the residuals from the Normal Q-Q Plot and heteroscedasticity. Let’s see what the Box - Cox transformation brings us.

# Reference: https://daviddalpiaz.github.io/appliedstats/transformations.html
bc_home <- boxcox(home_lm)

boxcox(home_lm, plotit = TRUE, lambda = seq(-0.2, 0.5, by = 0.05))

Let’s take the optimal $\lambda$ as 0.1 (rounded up to make the calculation easier). So let’s make a transformation of the data and take a look at the new regression model with the transformed data.

# Reference: www.ics.uci.edu/~jutts/st108/Transformations_in_R.doc
home_lambda <- 0.1
SalePrice_lambda <- (y^(home_lambda)-1)/home_lambda
NewModel <- lm(SalePrice_lambda ~ x)

NewModel

## 
## Call:
## lm(formula = SalePrice_lambda ~ x)
## 
## Coefficients:
## (Intercept)            x  
##   20.592934     0.001791

summary(NewModel)

## 
## Call:
## lm(formula = SalePrice_lambda ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5554 -0.4882  0.0835  0.5251  3.1168 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.059e+01  7.564e-02  272.25   <2e-16 ***
## x           1.791e-03  4.716e-05   37.99   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9466 on 1458 degrees of freedom
## Multiple R-squared:  0.4974, Adjusted R-squared:  0.4971 
## F-statistic:  1443 on 1 and 1458 DF,  p-value: < 2.2e-16

plot(fitted(NewModel), resid(NewModel))
abline(h=0, lty = 2, col = "darkorange", lwd = 2)

hist(resid(NewModel))

qqnorm(resid(NewModel))
qqline(resid(NewModel))

While it’s not perfect, it certainly had improved the model as noted in the QQ-line, residual histogram, and the scatterplot.

Linear Algebra and Correlation

Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

A correlation matrix

Let’s talk about a correlation test first. A correlation test is used to evaluate the association between two or more variables. There are different methods to perform correlation analysis.

Pearson correlation (r), which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because is depends to the distribution of the data. It can be used only when x and y are from normal distribution. the plot of y = f(x) is named the linear regression curve.
Kendall tau and Sparman rho, which are ranked-based correlation coefficients (non-parametric).

The most commonly used method is the Pearson correlation method.

Pearson Correlation Formula

\[r = \frac{\Sigma(x-m_x)(y-m_y)}{\sqrt{\Sigma(x-m_x)^2 \Sigma(y-m_y^2)}}\]

where $m_x$ and $m_y$ are the means of x and y variables.

The p-value (significance level) of the correlation can be determined:

By using the correlation coefficient table for the degrees of freedom: $df = n - 2$, where n is the number of observation in x and y variables.
or by calculating the t-value as follows:

\[t = \frac{r}{\sqrt{1-r^2}}\sqrt{n-2}\] The corresponding p-value can be determined using the t distribution table for $df = n - 2$ in: http://www.sthda.com/english/wiki/t-distribution-table

A correlation matrix is used to investigate the dependence between multiple variables at the same time. The result is a table containing the correlation coefficients between each variables and the others.

Let’s choose three untransformed variables at random (in addition to our ‘GrLivArea’) and create a dataframe.

# Let's choose 'LotArea', 'YearBuilt', 'OverallQual', and 'GrLivArea'
LotArea <- data$LotArea
YearBuilt <- data$YearBuilt
OverallQual <- data$OverallQual
GrLivArea <- data$GrLivArea

# And of course, the dependent variable, 'SalePrice'
SalePrice <- data$SalePrice

# Building the dataframe
home_data <- data.frame(LotArea, YearBuilt, OverallQual, GrLivArea, SalePrice)
head(home_data)

##   LotArea YearBuilt OverallQual GrLivArea SalePrice
## 1    8450      2003           7      1710    208500
## 2    9600      1976           6      1262    181500
## 3   11250      2001           7      1786    223500
## 4    9550      1915           7      1717    140000
## 5   14260      2000           8      2198    250000
## 6   14115      1993           5      1362    143000

Let’s build a correlation matrix.

if(!require(ggpubr)) install.packages("ggpubr")
if(!require(corrplot)) install.packages("corrplot")
library(ggpubr)
library(corrplot)

home_cor <- cor(home_data, method = "pearson", use = "complete.obs")
round(home_cor,2)

##             LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea        1.00      0.01        0.11      0.26      0.26
## YearBuilt      0.01      1.00        0.57      0.20      0.52
## OverallQual    0.11      0.57        1.00      0.59      0.79
## GrLivArea      0.26      0.20        0.59      1.00      0.71
## SalePrice      0.26      0.52        0.79      0.71      1.00

corrplot(home_cor, method = "circle")

The above is the correlation matrix (res) with 4 indepedent variables and 1 dependent variable. For fun, we can also look to see the corresponding p-value for the correlation matrix.

if (!require(Hmisc)) install.packages("Hmisc")
library(Hmisc)

cor2 <- rcorr(as.matrix(home_data))
cor2

##             LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea        1.00      0.01        0.11      0.26      0.26
## YearBuilt      0.01      1.00        0.57      0.20      0.52
## OverallQual    0.11      0.57        1.00      0.59      0.79
## GrLivArea      0.26      0.20        0.59      1.00      0.71
## SalePrice      0.26      0.52        0.79      0.71      1.00
## 
## n= 1460 
## 
## 
## P
##             LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea             0.587     0.000       0.000     0.000    
## YearBuilt   0.587             0.000       0.000     0.000    
## OverallQual 0.000   0.000                 0.000     0.000    
## GrLivArea   0.000   0.000     0.000                 0.000    
## SalePrice   0.000   0.000     0.000       0.000

The p-values seem to all be less than 0.05 with ‘SalePrice’ being the dependent factor, indicating some statistical significance.

Let’s create the inversion of the correlation matrix.

# Using the solve function to find the inverse
home_cor_inv <- solve(home_cor)
home_cor_inv

##                 LotArea   YearBuilt OverallQual  GrLivArea  SalePrice
## LotArea      1.13181609  0.09259515   0.2856985 -0.1598675 -0.4597361
## YearBuilt    0.09259515  1.70273460  -0.7504489  0.6214791 -0.7615900
## OverallQual  0.28569848 -0.75044891   3.1118940 -0.5053939 -1.7862883
## GrLivArea   -0.15986749  0.62147910  -0.5053939  2.2801980 -1.4988366
## SalePrice   -0.45973614 -0.76159002  -1.7862883 -1.4988366  3.9945652

Reference: http://www.sthda.com/english/wiki/correlation-test-between-two-variables-in-r, http://www.sthda.com/english/wiki/correlation-matrix-a-quick-start-guide-to-analyze-format-and-visualize-a-correlation-matrix-using-r-software

So what is a precision matrix? It is the inverse of a covariance matrix.

Now, what is a covariance matrix? A covariance matrix, like many matrices used in statistics, is symmetric. That means that the table has the same headings across the top as it does along the side.

The simplest example, and a cousin of a covariance matrix, is a correlation matrix. It’s a table in which each variable is listed in both the column headings and row headings, and each cell of the table (i.e. matrix) is the correlation between the variables that make up the column and row headings. Please note above for an example.

All correlations on the diagonalequal 1, because they’re the correlation of each variable with itself.

A Covariance Matrix is very similar. There are really two differences between it and the Correlation Matrix.

\[\begin{bmatrix} Var_1 & Cov_{1,2} & Cov_{1,3} \\ Cov_{1,2} & Var_2 & Cov_{2,3} \\ Cov_{d1,3} & Cov_{2,3} & Var_3 & \end{bmatrix}\]

Covariance is just an unstandardized version of correlation. To compute any correlation, we divide the covariance by the standard deviation of both variables to remove units of measurement. So a covariance is just a correlation measured in the units of the original variables.

Second, the diagonal variables are the variances of each variable. A covariance of a variable with itself is simply the variance.

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

# Let's confirm that this is truly the inverse
# Remember: A %*% A(inv) == I and A(inv) %*% A == I 
test <- home_cor %*% home_cor_inv
round(test,2)

##             LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea           1         0           0         0         0
## YearBuilt         0         1           0         0         0
## OverallQual       0         0           1         0         0
## GrLivArea         0         0           0         1         0
## SalePrice         0         0           0         0         1

test2 <- home_cor_inv %*% home_cor
round(test2,2)

##             LotArea YearBuilt OverallQual GrLivArea SalePrice
## LotArea           1         0           0         0         0
## YearBuilt         0         1           0         0         0
## OverallQual       0         0           1         0         0
## GrLivArea         0         0           0         1         0
## SalePrice         0         0           0         0         1

It does appear to produce an identity matrix, so home_cor_inv is indeed the inverse of home_cor. home_cor_inv is known as the precision matrix, and contains variance inflation on the diagnoal.

Reference: https://www.statlect.com/glossary/precision-matrix, http://www.theanalysisfactor.com/covariance-matrices/

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See: See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html).

We will go back and use GrLivArea as our independent variable. Let’s check the minimum value and ensure that this value is above zero.

print(paste0("Minimum value of 'GrLivArea' greater than 0? ", min(GrLivArea) > 0))

## [1] "Minimum value of 'GrLivArea' greater than 0? TRUE"

Before we try to fit a density function, let’s re-visit and see what distribution GrLivArea takes on again.

# Reference: https://stackoverflow.com/questions/6967664/ggplot2-histogram-with-normal-curve

ggplot(data, aes(x=GrLivArea)) + geom_histogram(fill="green", color = "black", binwidth = 500) + labs(x="Ground Living Area", title = "Histogram: Ground Living Area") + stat_function(fun = function(x,mean,sd, n, bw){
  dnorm(x = x, mean = mean, sd = sd) * n * bw}, args = c(mean = mean(GrLivArea), sd = sd(GrLivArea), n = 1000, bw = 500))

Let’s see if can use fitdistr to fit a normal density function.

fit <- fitdistr(GrLivArea, densfun="normal") # Assuming a normal distribution
fit

##       mean          sd    
##   1515.46370    525.30039 
##  (  13.74774) (   9.72112)

hist(GrLivArea, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)

The above fits the data not too badly. Let’s see if we can find a better distribution curve.

# Using fitdistrplus
if (!require(fitdistrplus)) install.packages("fitdistrplus")
library(fitdistrplus)

plotdist(GrLivArea, histo = TRUE, demp = TRUE)

descdist(GrLivArea, discrete = FALSE, boot = 500)

## summary statistics
## ------
## min:  334   max:  5642 
## median:  1464 
## mean:  1515.464 
## estimated sd:  525.4804 
## estimated skewness:  1.36656 
## estimated kurtosis:  7.895121

The plot also includes a nonparametric bootstrap procedure for the values of kurtosis and skewness.

Let’s try to fit a distribution. We will randomly choose weibull, gamma and log-normal.

fit_w  <- fitdist(GrLivArea, "weibull")
fit_g  <- fitdist(GrLivArea, "gamma")
fit_ln <- fitdist(GrLivArea, "lnorm")
summary(fit_ln)

## Fitting of the distribution ' lnorm ' by maximum likelihood 
## Parameters : 
##          estimate  Std. Error
## meanlog 7.2677744 0.008726424
## sdlog   0.3334362 0.006170264
## Loglikelihood:  -11079.08   AIC:  22162.15   BIC:  22172.73 
## Correlation matrix:
##         meanlog sdlog
## meanlog       1     0
## sdlog         0     1

par(mfrow=c(2,2))
plot.legend <- c("Weibull", "lognormal", "gamma")
denscomp(list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
cdfcomp (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
qqcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)
ppcomp  (list(fit_w, fit_g, fit_ln), legendtext = plot.legend)

Reference: http://www.di.fc.ul.pt/~jpn/r/distributions/fitting.html

It appears that the log-normal distribution is a better fit than the simple normal distribution. (Will ignore Gamma Distribution for now.)

The log-normal distribution (aka Galton) is a probability distribution with a normally distributed logarithm. A random variable is lognormally distributed if its logarithm is normally distributed. Skewed distributions (which our data has) with low mean values, large variance, and all-positive values often fit this type of distribution. Values must be positive as $log(x)$ exists only for positive values of x.

print(paste0("Mean: ",mean(GrLivArea)))

## [1] "Mean: 1515.46369863014"

print(paste0("Variance: ", sd(GrLivArea) ** 2))

## [1] "Variance: 276129.633362596"

print(paste0("All values are positive? ", min(GrLivArea) > 0))

## [1] "All values are positive? TRUE"

As you can see, our data can certainly is a candidate for a logarithmic transformation.

Let’s try again, and redo the fitdistr function with a log-normal distribution.

fit <- fitdistr(GrLivArea, densfun="log-normal") 
fit

##      meanlog        sdlog   
##   7.267774383   0.333436175 
##  (0.008726424) (0.006170513)

GrLivArea_log <- log(GrLivArea)

hist(GrLivArea_log, pch=20, breaks=25, prob=TRUE, main="")
curve(dnorm(x, fit$estimate[1], fit$estimate[2]), col="red", lwd=2, add=T)

As you can see, the log-normal distribution fits the plot quite nicely.

Reference: http://www.statisticshowto.com/lognormal-distribution/

Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g. $rexp(1000,\lambda)$ for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

We had already noted the optimal values for the parameters of the log-normal distribution as noted above. I will reproduce it again here.

fit

##      meanlog        sdlog   
##   7.267774383   0.333436175 
##  (0.008726424) (0.006170513)

The meanlog is ~7.27 and sdlog is ~.33.

Let’s take a thousand samples from this distribution, plot it in a histogram, and compare it to the non-transformed histogram.

bootstrap_log <- sample(GrLivArea_log, size = 1000, replace = TRUE, prob = NULL)
bootstrap_normal <- sample(GrLivArea, size = 1000, replace = TRUE, prob = NULL)

par(mfrow=c(1,2))
plot.legend <- c("normal", "lognormal")
hist(bootstrap_normal)
hist(bootstrap_log)

As you can see that the log_normal distribution (histogram) has a more normal distribution appearance than the regular standard plot of the untransformed data.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your kaggle.com user name and score.

Reference: Linear Regression Using R: An Introduction to Data Modeling

Below is my submission to the Kaggle.com competition.

Username: jcp9010 or JoelPark

Score: 0.14033

This dataset gives us an opportunity to look at a large set of homes in Ames, Iowa, and to determine to see if we can appropriately fit a regression model to predict SalePrice of a home.

Let’s start by loading the datasets

# Load the packages
library(Hmisc)
library(ggpubr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(MASS)
library(corrplot)
library(psych)
library(fitdistrplus)

# Download the dataset
url <- 'https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/train.csv'
#data <- read.csv(url, header = TRUE, stringsAsFactors = FALSE)

# Show the head of the dataset
head(data)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 1  1         60       RL          65    8450   Pave  <NA>      Reg
## 2  2         20       RL          80    9600   Pave  <NA>      Reg
## 3  3         60       RL          68   11250   Pave  <NA>      IR1
## 4  4         70       RL          60    9550   Pave  <NA>      IR1
## 5  5         60       RL          84   14260   Pave  <NA>      IR1
## 6  6         50       RL          85   14115   Pave  <NA>      IR1
##   LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 1         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 2         Lvl    AllPub       FR2       Gtl      Veenker      Feedr
## 3         Lvl    AllPub    Inside       Gtl      CollgCr       Norm
## 4         Lvl    AllPub    Corner       Gtl      Crawfor       Norm
## 5         Lvl    AllPub       FR2       Gtl      NoRidge       Norm
## 6         Lvl    AllPub    Inside       Gtl      Mitchel       Norm
##   Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1       Norm     1Fam     2Story           7           5      2003
## 2       Norm     1Fam     1Story           6           8      1976
## 3       Norm     1Fam     2Story           7           5      2001
## 4       Norm     1Fam     2Story           7           5      1915
## 5       Norm     1Fam     2Story           8           5      2000
## 6       Norm     1Fam     1.5Fin           5           5      1993
##   YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 1         2003     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 2         1976     Gable  CompShg     MetalSd     MetalSd       None
## 3         2002     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 4         1970     Gable  CompShg     Wd Sdng     Wd Shng       None
## 5         2000     Gable  CompShg     VinylSd     VinylSd    BrkFace
## 6         1995     Gable  CompShg     VinylSd     VinylSd       None
##   MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 1        196        Gd        TA      PConc       Gd       TA           No
## 2          0        TA        TA     CBlock       Gd       TA           Gd
## 3        162        Gd        TA      PConc       Gd       TA           Mn
## 4          0        TA        TA     BrkTil       TA       Gd           No
## 5        350        Gd        TA      PConc       Gd       TA           Av
## 6          0        TA        TA       Wood       Gd       TA           No
##   BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1          GLQ        706          Unf          0       150         856
## 2          ALQ        978          Unf          0       284        1262
## 3          GLQ        486          Unf          0       434         920
## 4          ALQ        216          Unf          0       540         756
## 5          GLQ        655          Unf          0       490        1145
## 6          GLQ        732          Unf          0        64         796
##   Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1    GasA        Ex          Y      SBrkr       856       854            0
## 2    GasA        Ex          Y      SBrkr      1262         0            0
## 3    GasA        Ex          Y      SBrkr       920       866            0
## 4    GasA        Gd          Y      SBrkr       961       756            0
## 5    GasA        Ex          Y      SBrkr      1145      1053            0
## 6    GasA        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 1            1          Gd            8        Typ          0        <NA>
## 2            1          TA            6        Typ          1          TA
## 3            1          Gd            6        Typ          1          TA
## 4            1          Gd            7        Typ          1          Gd
## 5            1          Gd            9        Typ          1          TA
## 6            1          TA            5        Typ          0        <NA>
##   GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 1     Attchd        2003          RFn          2        548         TA
## 2     Attchd        1976          RFn          2        460         TA
## 3     Attchd        2001          RFn          2        608         TA
## 4     Detchd        1998          Unf          3        642         TA
## 5     Attchd        2000          RFn          3        836         TA
## 6     Attchd        1993          Unf          2        480         TA
##   GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1         TA          Y          0          61             0          0
## 2         TA          Y        298           0             0          0
## 3         TA          Y          0          42             0          0
## 4         TA          Y          0          35           272          0
## 5         TA          Y        192          84             0          0
## 6         TA          Y         40          30             0        320
##   ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 1           0        0   <NA>  <NA>        <NA>       0      2   2008
## 2           0        0   <NA>  <NA>        <NA>       0      5   2007
## 3           0        0   <NA>  <NA>        <NA>       0      9   2008
## 4           0        0   <NA>  <NA>        <NA>       0      2   2006
## 5           0        0   <NA>  <NA>        <NA>       0     12   2008
## 6           0        0   <NA> MnPrv        Shed     700     10   2009
##   SaleType SaleCondition SalePrice
## 1       WD        Normal    208500
## 2       WD        Normal    181500
## 3       WD        Normal    223500
## 4       WD       Abnorml    140000
## 5       WD        Normal    250000
## 6       WD        Normal    143000

Let’s look closer at the statistics of SalePrice.

summary(data$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

ggplot(data, aes(x=SalePrice)) + geom_histogram(fill="lightblue", col="black",binwidth=10000) + labs(x="Price of a Home", y = "Frequency", title="Home Prices in Ames, Iowa")

describe(data$SalePrice)

## data$SalePrice 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##     1460        0      663        1   180921    81086    88000   106475 
##      .25      .50      .75      .90      .95 
##   129975   163000   214000   278000   326100 
## 
## lowest :  34900  35311  37900  39300  40000, highest: 582933 611657 625000 745000 755000

The distribution of the data certainly has a positive skew (which is probably not all that surprising for home prices in general). However, we will need to return to this and apply a transformation that will help normalize the data.

Before we start looking at the correlations of the independent variables to the dependent variables (SalePrice), let’s take a look at the missing data, as that can significantly alter the regression model.

print("Missing Data Points: ")

## [1] "Missing Data Points: "

missing_list <- sapply(data, function(x) sum(is.na(x)))
missing_list <- sort(missing_list, decreasing = TRUE)
missing_list <- missing_list[missing_list != 0]
missing_list

##       PoolQC  MiscFeature        Alley        Fence  FireplaceQu 
##         1453         1406         1369         1179          690 
##  LotFrontage   GarageType  GarageYrBlt GarageFinish   GarageQual 
##          259           81           81           81           81 
##   GarageCond BsmtExposure BsmtFinType2     BsmtQual     BsmtCond 
##           81           38           38           37           37 
## BsmtFinType1   MasVnrType   MasVnrArea   Electrical 
##           37            8            8            1

# Plotting the bar plots of missing data
NAMES <- names(missing_list)
NUM <- as.vector(missing_list)
missing_list_df <- data.frame(NAMES, NUM)
ggplot(missing_list_df, aes(x=NAMES, y=NUM)) + geom_bar(stat="identity", fill="steelblue") + coord_flip() + theme_minimal() + labs(x = "Number of Data Points Missing", y = "Independent Variables", title = "BarPlot: Missing Data Points")

# Looks like some of the data points are missing quite a bit of data. Let's see what percent of each data is missing.
print("Missing Data Points by Percentage: ")

## [1] "Missing Data Points by Percentage: "

missing_list_percent <- missing_list/nrow(data)
missing_list_percent

##       PoolQC  MiscFeature        Alley        Fence  FireplaceQu 
## 0.9952054795 0.9630136986 0.9376712329 0.8075342466 0.4726027397 
##  LotFrontage   GarageType  GarageYrBlt GarageFinish   GarageQual 
## 0.1773972603 0.0554794521 0.0554794521 0.0554794521 0.0554794521 
##   GarageCond BsmtExposure BsmtFinType2     BsmtQual     BsmtCond 
## 0.0554794521 0.0260273973 0.0260273973 0.0253424658 0.0253424658 
## BsmtFinType1   MasVnrType   MasVnrArea   Electrical 
## 0.0253424658 0.0054794521 0.0054794521 0.0006849315

Interestingly, ‘PoolQC’ and ‘MiscFeature’ were missing over 999 and 96% of their data each! Let’s make an arbitrary cutoff. We will exclude out data that are missing more than 5% of its data.

drop_list <- names(missing_list_percent[missing_list_percent > 0.05])

updated_data <- data %>% dplyr::select(-one_of(drop_list))
head(updated_data)

##   Id MSSubClass MSZoning LotArea Street LotShape LandContour Utilities
## 1  1         60       RL    8450   Pave      Reg         Lvl    AllPub
## 2  2         20       RL    9600   Pave      Reg         Lvl    AllPub
## 3  3         60       RL   11250   Pave      IR1         Lvl    AllPub
## 4  4         70       RL    9550   Pave      IR1         Lvl    AllPub
## 5  5         60       RL   14260   Pave      IR1         Lvl    AllPub
## 6  6         50       RL   14115   Pave      IR1         Lvl    AllPub
##   LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle
## 1     2Story           7           5      2003         2003     Gable
## 2     1Story           6           8      1976         1976     Gable
## 3     2Story           7           5      2001         2002     Gable
## 4     2Story           7           5      1915         1970     Gable
## 5     2Story           8           5      2000         2000     Gable
## 6     1.5Fin           5           5      1993         1995     Gable
##   RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual
## 1  CompShg     VinylSd     VinylSd    BrkFace        196        Gd
## 2  CompShg     MetalSd     MetalSd       None          0        TA
## 3  CompShg     VinylSd     VinylSd    BrkFace        162        Gd
## 4  CompShg     Wd Sdng     Wd Shng       None          0        TA
## 5  CompShg     VinylSd     VinylSd    BrkFace        350        Gd
## 6  CompShg     VinylSd     VinylSd       None          0        TA
##   ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## 1        TA      PConc       Gd       TA           No          GLQ
## 2        TA     CBlock       Gd       TA           Gd          ALQ
## 3        TA      PConc       Gd       TA           Mn          GLQ
## 4        TA     BrkTil       TA       Gd           No          ALQ
## 5        TA      PConc       Gd       TA           Av          GLQ
## 6        TA       Wood       Gd       TA           No          GLQ
##   BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 1        706          Unf          0       150         856    GasA
## 2        978          Unf          0       284        1262    GasA
## 3        486          Unf          0       434         920    GasA
## 4        216          Unf          0       540         756    GasA
## 5        655          Unf          0       490        1145    GasA
## 6        732          Unf          0        64         796    GasA
##   HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 1        Ex          Y      SBrkr       856       854            0
## 2        Ex          Y      SBrkr      1262         0            0
## 3        Ex          Y      SBrkr       920       866            0
## 4        Gd          Y      SBrkr       961       756            0
## 5        Ex          Y      SBrkr      1145      1053            0
## 6        Ex          Y      SBrkr       796       566            0
##   GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 1      1710            1            0        2        1            3
## 2      1262            0            1        2        0            3
## 3      1786            1            0        2        1            3
## 4      1717            1            0        1        0            3
## 5      2198            1            0        2        1            4
## 6      1362            1            0        1        1            1
##   KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces GarageCars
## 1            1          Gd            8        Typ          0          2
## 2            1          TA            6        Typ          1          2
## 3            1          Gd            6        Typ          1          2
## 4            1          Gd            7        Typ          1          3
## 5            1          Gd            9        Typ          1          3
## 6            1          TA            5        Typ          0          2
##   GarageArea PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 1        548          Y          0          61             0          0
## 2        460          Y        298           0             0          0
## 3        608          Y          0          42             0          0
## 4        642          Y          0          35           272          0
## 5        836          Y        192          84             0          0
## 6        480          Y         40          30             0        320
##   ScreenPorch PoolArea MiscVal MoSold YrSold SaleType SaleCondition
## 1           0        0       0      2   2008       WD        Normal
## 2           0        0       0      5   2007       WD        Normal
## 3           0        0       0      9   2008       WD        Normal
## 4           0        0       0      2   2006       WD       Abnorml
## 5           0        0       0     12   2008       WD        Normal
## 6           0        0     700     10   2009       WD        Normal
##   SalePrice
## 1    208500
## 2    181500
## 3    223500
## 4    140000
## 5    250000
## 6    143000

print(paste0("Number of Columns in New Updated Data Frame: ", ncol(updated_data)))

## [1] "Number of Columns in New Updated Data Frame: 70"

Now that we eliminated the independent variables with missing values > 5%, let’s now find the correlations of all the variables. In regards to the missing values, given that < 5% NA Data is small compared to the entire data set, we will simply remove the NA/NULL Values. We will be using the Pearson Correlation. For this particular correlation, we will be using only variables that are numeric.

print("Correlation Matrix for Home Variables (Numeric): ")

## [1] "Correlation Matrix for Home Variables (Numeric): "

updated_home_cor <- cor(updated_data[sapply(updated_data, is.numeric)], method = "pearson", use = "complete.obs")
round(updated_home_cor,2)

##                  Id MSSubClass LotArea OverallQual OverallCond YearBuilt
## Id             1.00       0.01   -0.03       -0.03        0.01     -0.02
## MSSubClass     0.01       1.00   -0.14        0.03       -0.06      0.03
## LotArea       -0.03      -0.14    1.00        0.11        0.00      0.02
## OverallQual   -0.03       0.03    0.11        1.00       -0.09      0.57
## OverallCond    0.01      -0.06    0.00       -0.09        1.00     -0.38
## YearBuilt     -0.02       0.03    0.02        0.57       -0.38      1.00
## YearRemodAdd  -0.02       0.04    0.02        0.55        0.08      0.59
## MasVnrArea    -0.05       0.02    0.10        0.41       -0.13      0.32
## BsmtFinSF1    -0.01      -0.07    0.21        0.24       -0.04      0.25
## BsmtFinSF2    -0.01      -0.07    0.11       -0.06        0.04     -0.05
## BsmtUnfSF     -0.01      -0.14    0.00        0.31       -0.14      0.15
## TotalBsmtSF   -0.02      -0.24    0.26        0.54       -0.17      0.39
## X1stFlrSF      0.01      -0.25    0.30        0.48       -0.14      0.28
## X2ndFlrSF      0.01       0.31    0.05        0.30        0.03      0.01
## LowQualFinSF  -0.04       0.05    0.00       -0.03        0.03     -0.18
## GrLivArea      0.01       0.08    0.26        0.59       -0.08      0.20
## BsmtFullBath   0.00       0.00    0.16        0.11       -0.05      0.19
## BsmtHalfBath  -0.02       0.00    0.05       -0.04        0.12     -0.04
## FullBath       0.01       0.14    0.12        0.55       -0.19      0.47
## HalfBath       0.01       0.18    0.02        0.27       -0.06      0.24
## BedroomAbvGr   0.04      -0.02    0.12        0.11        0.01     -0.07
## KitchenAbvGr   0.00       0.29   -0.02       -0.18       -0.08     -0.17
## TotRmsAbvGrd   0.03       0.04    0.19        0.43       -0.06      0.10
## Fireplaces    -0.02      -0.04    0.27        0.40       -0.02      0.15
## GarageCars     0.01      -0.04    0.15        0.60       -0.18      0.54
## GarageArea     0.02      -0.10    0.18        0.56       -0.15      0.48
## WoodDeckSF    -0.03      -0.01    0.17        0.24        0.00      0.23
## OpenPorchSF   -0.01      -0.01    0.09        0.30       -0.03      0.19
## EnclosedPorch  0.00      -0.01   -0.02       -0.11        0.07     -0.39
## X3SsnPorch    -0.05      -0.04    0.02        0.03        0.03      0.03
## ScreenPorch    0.00      -0.03    0.04        0.07        0.05     -0.05
## PoolArea       0.06       0.01    0.08        0.07        0.00      0.01
## MiscVal       -0.01      -0.01    0.04       -0.03        0.07     -0.03
## MoSold         0.02      -0.01    0.00        0.07        0.00      0.01
## YrSold         0.00      -0.02   -0.01       -0.03        0.04     -0.01
## SalePrice     -0.03      -0.08    0.26        0.79       -0.08      0.52
##               YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Id                   -0.02      -0.05      -0.01      -0.01     -0.01
## MSSubClass            0.04       0.02      -0.07      -0.07     -0.14
## LotArea               0.02       0.10       0.21       0.11      0.00
## OverallQual           0.55       0.41       0.24      -0.06      0.31
## OverallCond           0.08      -0.13      -0.04       0.04     -0.14
## YearBuilt             0.59       0.32       0.25      -0.05      0.15
## YearRemodAdd          1.00       0.18       0.13      -0.07      0.18
## MasVnrArea            0.18       1.00       0.26      -0.07      0.11
## BsmtFinSF1            0.13       0.26       1.00      -0.05     -0.50
## BsmtFinSF2           -0.07      -0.07      -0.05       1.00     -0.21
## BsmtUnfSF             0.18       0.11      -0.50      -0.21      1.00
## TotalBsmtSF           0.29       0.36       0.52       0.11      0.42
## X1stFlrSF             0.24       0.34       0.44       0.10      0.32
## X2ndFlrSF             0.14       0.17      -0.14      -0.10      0.01
## LowQualFinSF         -0.06      -0.07      -0.06       0.01      0.03
## GrLivArea             0.29       0.39       0.21      -0.01      0.24
## BsmtFullBath          0.12       0.09       0.65       0.16     -0.42
## BsmtHalfBath         -0.01       0.03       0.07       0.07     -0.10
## FullBath              0.44       0.28       0.06      -0.08      0.29
## HalfBath              0.18       0.20       0.00      -0.03     -0.04
## BedroomAbvGr         -0.04       0.10      -0.11      -0.02      0.17
## KitchenAbvGr         -0.15      -0.04      -0.09      -0.04      0.03
## TotRmsAbvGrd          0.19       0.28       0.04      -0.04      0.25
## Fireplaces            0.11       0.25       0.26       0.05      0.05
## GarageCars            0.42       0.36       0.22      -0.04      0.21
## GarageArea            0.37       0.37       0.30      -0.02      0.18
## WoodDeckSF            0.21       0.16       0.21       0.07      0.00
## OpenPorchSF           0.22       0.13       0.11       0.00      0.13
## EnclosedPorch        -0.19      -0.11      -0.11       0.04      0.00
## X3SsnPorch            0.05       0.02       0.03      -0.03      0.02
## ScreenPorch          -0.04       0.06       0.06       0.09     -0.01
## PoolArea              0.01       0.01       0.14       0.04     -0.04
## MiscVal              -0.01      -0.03       0.00       0.00     -0.02
## MoSold                0.02      -0.01      -0.02      -0.01      0.03
## YrSold                0.04      -0.01       0.02       0.03     -0.04
## SalePrice             0.51       0.48       0.38      -0.01      0.22
##               TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Id                  -0.02      0.01      0.01        -0.04      0.01
## MSSubClass          -0.24     -0.25      0.31         0.05      0.08
## LotArea              0.26      0.30      0.05         0.00      0.26
## OverallQual          0.54      0.48      0.30        -0.03      0.59
## OverallCond         -0.17     -0.14      0.03         0.03     -0.08
## YearBuilt            0.39      0.28      0.01        -0.18      0.20
## YearRemodAdd         0.29      0.24      0.14        -0.06      0.29
## MasVnrArea           0.36      0.34      0.17        -0.07      0.39
## BsmtFinSF1           0.52      0.44     -0.14        -0.06      0.21
## BsmtFinSF2           0.11      0.10     -0.10         0.01     -0.01
## BsmtUnfSF            0.42      0.32      0.01         0.03      0.24
## TotalBsmtSF          1.00      0.82     -0.17        -0.03      0.45
## X1stFlrSF            0.82      1.00     -0.20        -0.01      0.57
## X2ndFlrSF           -0.17     -0.20      1.00         0.06      0.69
## LowQualFinSF        -0.03     -0.01      0.06         1.00      0.14
## GrLivArea            0.45      0.57      0.69         0.14      1.00
## BsmtFullBath         0.31      0.24     -0.17        -0.05      0.03
## BsmtHalfBath         0.00      0.00     -0.02        -0.01     -0.02
## FullBath             0.32      0.38      0.42         0.00      0.63
## HalfBath            -0.05     -0.12      0.61        -0.03      0.42
## BedroomAbvGr         0.05      0.13      0.50         0.11      0.52
## KitchenAbvGr        -0.08      0.06      0.06         0.01      0.10
## TotRmsAbvGrd         0.29      0.41      0.62         0.13      0.83
## Fireplaces           0.34      0.41      0.19        -0.02      0.46
## GarageCars           0.43      0.44      0.19        -0.09      0.47
## GarageArea           0.49      0.49      0.14        -0.07      0.47
## WoodDeckSF           0.23      0.24      0.09        -0.03      0.25
## OpenPorchSF          0.24      0.21      0.21         0.02      0.33
## EnclosedPorch       -0.10     -0.07      0.06         0.06      0.01
## X3SsnPorch           0.04      0.06     -0.02         0.00      0.02
## ScreenPorch          0.09      0.09      0.04         0.03      0.10
## PoolArea             0.13      0.13      0.08         0.06      0.17
## MiscVal             -0.02     -0.02      0.02         0.00      0.00
## MoSold               0.01      0.03      0.04        -0.02      0.05
## YrSold              -0.01     -0.01     -0.03        -0.03     -0.04
## SalePrice            0.61      0.61      0.32        -0.03      0.71
##               BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Id                    0.00        -0.02     0.01     0.01         0.04
## MSSubClass            0.00         0.00     0.14     0.18        -0.02
## LotArea               0.16         0.05     0.12     0.02         0.12
## OverallQual           0.11        -0.04     0.55     0.27         0.11
## OverallCond          -0.05         0.12    -0.19    -0.06         0.01
## YearBuilt             0.19        -0.04     0.47     0.24        -0.07
## YearRemodAdd          0.12        -0.01     0.44     0.18        -0.04
## MasVnrArea            0.09         0.03     0.28     0.20         0.10
## BsmtFinSF1            0.65         0.07     0.06     0.00        -0.11
## BsmtFinSF2            0.16         0.07    -0.08    -0.03        -0.02
## BsmtUnfSF            -0.42        -0.10     0.29    -0.04         0.17
## TotalBsmtSF           0.31         0.00     0.32    -0.05         0.05
## X1stFlrSF             0.24         0.00     0.38    -0.12         0.13
## X2ndFlrSF            -0.17        -0.02     0.42     0.61         0.50
## LowQualFinSF         -0.05        -0.01     0.00    -0.03         0.11
## GrLivArea             0.03        -0.02     0.63     0.42         0.52
## BsmtFullBath          1.00        -0.15    -0.07    -0.03        -0.15
## BsmtHalfBath         -0.15         1.00    -0.05    -0.01         0.05
## FullBath             -0.07        -0.05     1.00     0.14         0.36
## HalfBath             -0.03        -0.01     0.14     1.00         0.23
## BedroomAbvGr         -0.15         0.05     0.36     0.23         1.00
## KitchenAbvGr         -0.04        -0.04     0.13    -0.07         0.20
## TotRmsAbvGrd         -0.05        -0.02     0.55     0.35         0.68
## Fireplaces            0.14         0.03     0.24     0.20         0.10
## GarageCars            0.13        -0.02     0.47     0.22         0.09
## GarageArea            0.18        -0.02     0.41     0.16         0.07
## WoodDeckSF            0.18         0.04     0.19     0.11         0.05
## OpenPorchSF           0.06        -0.02     0.26     0.20         0.10
## EnclosedPorch        -0.05        -0.01    -0.12    -0.09         0.04
## X3SsnPorch            0.00         0.03     0.04     0.00        -0.02
## ScreenPorch           0.02         0.03    -0.01     0.07         0.04
## PoolArea              0.07         0.02     0.05     0.02         0.07
## MiscVal              -0.02        -0.01    -0.01     0.00         0.01
## MoSold               -0.02         0.03     0.06    -0.01         0.05
## YrSold                0.07        -0.05    -0.02    -0.01        -0.04
## SalePrice             0.23        -0.02     0.56     0.28         0.17
##               KitchenAbvGr TotRmsAbvGrd Fireplaces GarageCars GarageArea
## Id                    0.00         0.03      -0.02       0.01       0.02
## MSSubClass            0.29         0.04      -0.04      -0.04      -0.10
## LotArea              -0.02         0.19       0.27       0.15       0.18
## OverallQual          -0.18         0.43       0.40       0.60       0.56
## OverallCond          -0.08        -0.06      -0.02      -0.18      -0.15
## YearBuilt            -0.17         0.10       0.15       0.54       0.48
## YearRemodAdd         -0.15         0.19       0.11       0.42       0.37
## MasVnrArea           -0.04         0.28       0.25       0.36       0.37
## BsmtFinSF1           -0.09         0.04       0.26       0.22       0.30
## BsmtFinSF2           -0.04        -0.04       0.05      -0.04      -0.02
## BsmtUnfSF             0.03         0.25       0.05       0.21       0.18
## TotalBsmtSF          -0.08         0.29       0.34       0.43       0.49
## X1stFlrSF             0.06         0.41       0.41       0.44       0.49
## X2ndFlrSF             0.06         0.62       0.19       0.19       0.14
## LowQualFinSF          0.01         0.13      -0.02      -0.09      -0.07
## GrLivArea             0.10         0.83       0.46       0.47       0.47
## BsmtFullBath         -0.04        -0.05       0.14       0.13       0.18
## BsmtHalfBath         -0.04        -0.02       0.03      -0.02      -0.02
## FullBath              0.13         0.55       0.24       0.47       0.41
## HalfBath             -0.07         0.35       0.20       0.22       0.16
## BedroomAbvGr          0.20         0.68       0.10       0.09       0.07
## KitchenAbvGr          1.00         0.25      -0.13      -0.05      -0.06
## TotRmsAbvGrd          0.25         1.00       0.32       0.36       0.34
## Fireplaces           -0.13         0.32       1.00       0.30       0.27
## GarageCars           -0.05         0.36       0.30       1.00       0.88
## GarageArea           -0.06         0.34       0.27       0.88       1.00
## WoodDeckSF           -0.09         0.17       0.20       0.23       0.23
## OpenPorchSF          -0.07         0.24       0.17       0.21       0.24
## EnclosedPorch         0.03         0.00      -0.03      -0.15      -0.12
## X3SsnPorch           -0.02        -0.01       0.01       0.04       0.04
## ScreenPorch          -0.05         0.06       0.19       0.05       0.05
## PoolArea             -0.01         0.08       0.10       0.02       0.06
## MiscVal               0.06         0.02       0.00      -0.04      -0.03
## MoSold                0.03         0.04       0.05       0.04       0.03
## YrSold                0.03        -0.03      -0.02      -0.04      -0.03
## SalePrice            -0.14         0.54       0.47       0.64       0.62
##               WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch
## Id                 -0.03       -0.01          0.00      -0.05        0.00
## MSSubClass         -0.01       -0.01         -0.01      -0.04       -0.03
## LotArea             0.17        0.09         -0.02       0.02        0.04
## OverallQual         0.24        0.30         -0.11       0.03        0.07
## OverallCond         0.00       -0.03          0.07       0.03        0.05
## YearBuilt           0.23        0.19         -0.39       0.03       -0.05
## YearRemodAdd        0.21        0.22         -0.19       0.05       -0.04
## MasVnrArea          0.16        0.13         -0.11       0.02        0.06
## BsmtFinSF1          0.21        0.11         -0.11       0.03        0.06
## BsmtFinSF2          0.07        0.00          0.04      -0.03        0.09
## BsmtUnfSF           0.00        0.13          0.00       0.02       -0.01
## TotalBsmtSF         0.23        0.24         -0.10       0.04        0.09
## X1stFlrSF           0.24        0.21         -0.07       0.06        0.09
## X2ndFlrSF           0.09        0.21          0.06      -0.02        0.04
## LowQualFinSF       -0.03        0.02          0.06       0.00        0.03
## GrLivArea           0.25        0.33          0.01       0.02        0.10
## BsmtFullBath        0.18        0.06         -0.05       0.00        0.02
## BsmtHalfBath        0.04       -0.02         -0.01       0.03        0.03
## FullBath            0.19        0.26         -0.12       0.04       -0.01
## HalfBath            0.11        0.20         -0.09       0.00        0.07
## BedroomAbvGr        0.05        0.10          0.04      -0.02        0.04
## KitchenAbvGr       -0.09       -0.07          0.03      -0.02       -0.05
## TotRmsAbvGrd        0.17        0.24          0.00      -0.01        0.06
## Fireplaces          0.20        0.17         -0.03       0.01        0.19
## GarageCars          0.23        0.21         -0.15       0.04        0.05
## GarageArea          0.23        0.24         -0.12       0.04        0.05
## WoodDeckSF          1.00        0.06         -0.13      -0.03       -0.07
## OpenPorchSF         0.06        1.00         -0.09      -0.01        0.08
## EnclosedPorch      -0.13       -0.09          1.00      -0.04       -0.08
## X3SsnPorch         -0.03       -0.01         -0.04       1.00       -0.03
## ScreenPorch        -0.07        0.08         -0.08      -0.03        1.00
## PoolArea            0.07        0.06          0.05      -0.01        0.05
## MiscVal            -0.01       -0.02          0.02       0.00        0.03
## MoSold              0.02        0.07         -0.03       0.03        0.02
## YrSold              0.02       -0.06         -0.01       0.02        0.01
## SalePrice           0.32        0.31         -0.13       0.05        0.11
##               PoolArea MiscVal MoSold YrSold SalePrice
## Id                0.06   -0.01   0.02   0.00     -0.03
## MSSubClass        0.01   -0.01  -0.01  -0.02     -0.08
## LotArea           0.08    0.04   0.00  -0.01      0.26
## OverallQual       0.07   -0.03   0.07  -0.03      0.79
## OverallCond       0.00    0.07   0.00   0.04     -0.08
## YearBuilt         0.01   -0.03   0.01  -0.01      0.52
## YearRemodAdd      0.01   -0.01   0.02   0.04      0.51
## MasVnrArea        0.01   -0.03  -0.01  -0.01      0.48
## BsmtFinSF1        0.14    0.00  -0.02   0.02      0.38
## BsmtFinSF2        0.04    0.00  -0.01   0.03     -0.01
## BsmtUnfSF        -0.04   -0.02   0.03  -0.04      0.22
## TotalBsmtSF       0.13   -0.02   0.01  -0.01      0.61
## X1stFlrSF         0.13   -0.02   0.03  -0.01      0.61
## X2ndFlrSF         0.08    0.02   0.04  -0.03      0.32
## LowQualFinSF      0.06    0.00  -0.02  -0.03     -0.03
## GrLivArea         0.17    0.00   0.05  -0.04      0.71
## BsmtFullBath      0.07   -0.02  -0.02   0.07      0.23
## BsmtHalfBath      0.02   -0.01   0.03  -0.05     -0.02
## FullBath          0.05   -0.01   0.06  -0.02      0.56
## HalfBath          0.02    0.00  -0.01  -0.01      0.28
## BedroomAbvGr      0.07    0.01   0.05  -0.04      0.17
## KitchenAbvGr     -0.01    0.06   0.03   0.03     -0.14
## TotRmsAbvGrd      0.08    0.02   0.04  -0.03      0.54
## Fireplaces        0.10    0.00   0.05  -0.02      0.47
## GarageCars        0.02   -0.04   0.04  -0.04      0.64
## GarageArea        0.06   -0.03   0.03  -0.03      0.62
## WoodDeckSF        0.07   -0.01   0.02   0.02      0.32
## OpenPorchSF       0.06   -0.02   0.07  -0.06      0.31
## EnclosedPorch     0.05    0.02  -0.03  -0.01     -0.13
## X3SsnPorch       -0.01    0.00   0.03   0.02      0.05
## ScreenPorch       0.05    0.03   0.02   0.01      0.11
## PoolArea          1.00    0.03  -0.03  -0.06      0.09
## MiscVal           0.03    1.00  -0.01   0.00     -0.02
## MoSold           -0.03   -0.01   1.00  -0.15      0.05
## YrSold           -0.06    0.00  -0.15   1.00     -0.03
## SalePrice         0.09   -0.02   0.05  -0.03      1.00

corrplot(updated_home_cor, method = "circle")

The above correlation plot is a little cluttered. Let’s take a closer look at the independent variable correlation with SalePrice.

updated_home_cor_df <- as.data.frame(updated_home_cor)
SalePrice_cor <- updated_home_cor_df$SalePrice
SalePrice_cor_names <- colnames(updated_home_cor_df)
SalePrice_cor_df <- data.frame(SalePrice_cor_names, SalePrice_cor)
SalePrice_cor_df <- SalePrice_cor_df[order(SalePrice_cor_df$SalePrice_cor, decreasing = TRUE),]
SalePrice_cor_df

##    SalePrice_cor_names SalePrice_cor
## 36           SalePrice    1.00000000
## 4          OverallQual    0.78999714
## 16           GrLivArea    0.71007968
## 25          GarageCars    0.63968574
## 26          GarageArea    0.62249171
## 12         TotalBsmtSF    0.61297093
## 13           X1stFlrSF    0.60684938
## 19            FullBath    0.56249126
## 23        TotRmsAbvGrd    0.53631085
## 6            YearBuilt    0.52289616
## 7         YearRemodAdd    0.50715781
## 8           MasVnrArea    0.47749305
## 24          Fireplaces    0.46893028
## 9           BsmtFinSF1    0.38397695
## 27          WoodDeckSF    0.32465013
## 14           X2ndFlrSF    0.32270976
## 28         OpenPorchSF    0.31126782
## 20            HalfBath    0.28204017
## 3              LotArea    0.26467400
## 17        BsmtFullBath    0.22502749
## 11           BsmtUnfSF    0.21573973
## 21        BedroomAbvGr    0.17193419
## 31         ScreenPorch    0.11304390
## 32            PoolArea    0.09310862
## 30          X3SsnPorch    0.04524719
## 34              MoSold    0.04513617
## 10          BsmtFinSF2   -0.01031642
## 18        BsmtHalfBath   -0.01599287
## 33             MiscVal   -0.02095080
## 15        LowQualFinSF   -0.02526275
## 1                   Id   -0.02534261
## 35              YrSold   -0.02618020
## 5          OverallCond   -0.07629370
## 2           MSSubClass   -0.08281336
## 29       EnclosedPorch   -0.12877809
## 22        KitchenAbvGr   -0.13741923

So it appears that the top ten correlated variables (excluding SalePrice) is:

OverallQual
GrLivArea
GarageCars
GarageArea
TotalBsmtSF
X1stFlrSF
FullBath
TotRmsAbvGrd
YearBuilt
YearRemodAdd

The above items seem to make sense. The Overall Quality of home, the size of the home, the year when it was built, and so forth seem like they would correlate with the price of a home. However, as you may have noticed, there appears to be independent variables that will likely correlate with each other; for instance, GarageCars and GarageArea. The more cars a garage fits, it is more likely that the area of a garage will be larger. However, when we build our linear regression model, we may be able to peel off some of these variables.

Before, we start building the model, let’s go back to SalePrice. As we remember, there was quite a positive skew on the histogram. Given that all of the data points are positive (there really should not be homes selling for negative dollars), has a low mean, and has a large variance.

print(paste0("SalePrice Mean: ", round(mean(updated_data$SalePrice),2)))

## [1] "SalePrice Mean: 180921.2"

print(paste0("SalePrice Variance: ", round(sd(updated_data$SalePrice) ** 2)))

## [1] "SalePrice Variance: 6311111264"

print(paste0("Are all data points of Sale Price positive? ", min(updated_data$SalePrice) > 0))

## [1] "Are all data points of Sale Price positive? TRUE"

Therefore, this data may be a good fit for a logarithmic transformation.

fit <- fitdistr(updated_data$SalePrice, densfun="log-normal") 
fit

##      meanlog         sdlog    
##   12.024050901    0.399315046 
##  ( 0.010450552) ( 0.007389656)

SalePrice_log <- log(updated_data$SalePrice)

bootstrap_log <- sample(SalePrice_log, size = 1000, replace = TRUE, prob = NULL)
bootstrap_normal <- sample(data$SalePrice, size = 1000, replace = TRUE, prob = NULL)

par(mfrow=c(1,2))
plot.legend <- c("normal", "lognormal")
hist(bootstrap_normal, col ="lightblue")
hist(bootstrap_log, col = "lightgreen")

With the log transformation, the SalePrice data appears to be more normalized. Let’s set this new column SalePrice_log into the dataset.

data_log <- cbind(updated_data, SalePrice_log)

cols.dont.want <- "SalePrice"
data_log <- data_log[, ! names(data_log) %in% cols.dont.want, drop = TRUE]

# Let us remove any rows with NAs in their datapoints.
data_log <- na.omit(data_log)

Now that we have a better general idea of what is going on with the data. Let’s partition the data into the training and testing data sets. With this prediction set, we will only deal with the numerical data (and not the categorical data columns).

data_log <- data_log[sapply(data_log, is.numeric)]
rows <- nrow(data_log)
f <- .6 # f is the arbitrary % of data used for training
upper_bound <- floor(f * rows)
permuted_data <- data_log[sample(rows),]
train.dat <- permuted_data[1:upper_bound,]
test.dat <- permuted_data[(upper_bound+1):rows,]

Now that we have our training and testing data (with SalePrice_log as our dependent variable), we can now build the linear regression model.

f <- reformulate(setdiff(colnames(train.dat), "SalePrice_log"), response = "SalePrice_log")
home_log_lm <- lm(f, data = train.dat)
summary(home_log_lm)

## 
## Call:
## lm(formula = f, data = train.dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42541 -0.07200  0.00441  0.08253  0.64685 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.576e+01  8.667e+00   1.818 0.069456 .  
## Id            -1.326e-05  1.312e-05  -1.011 0.312446    
## MSSubClass    -8.490e-04  1.557e-04  -5.454 6.53e-08 ***
## LotArea        1.848e-06  5.492e-07   3.366 0.000800 ***
## OverallQual    9.451e-02  7.351e-03  12.856  < 2e-16 ***
## OverallCond    4.775e-02  6.337e-03   7.535 1.31e-13 ***
## YearBuilt      2.809e-03  3.735e-04   7.520 1.45e-13 ***
## YearRemodAdd   9.184e-04  4.007e-04   2.292 0.022163 *  
## MasVnrArea     3.425e-05  3.516e-05   0.974 0.330327    
## BsmtFinSF1    -1.200e-05  3.491e-05  -0.344 0.731140    
## BsmtFinSF2     2.522e-06  4.779e-05   0.053 0.957933    
## BsmtUnfSF     -4.094e-05  3.424e-05  -1.196 0.232105    
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      2.022e-04  4.047e-05   4.997 7.13e-07 ***
## X2ndFlrSF      1.089e-04  3.019e-05   3.606 0.000329 ***
## LowQualFinSF   1.248e-04  1.450e-04   0.861 0.389621    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   5.553e-02  1.598e-02   3.475 0.000539 ***
## BsmtHalfBath  -6.942e-03  2.534e-02  -0.274 0.784172    
## FullBath       5.207e-02  1.695e-02   3.071 0.002202 ** 
## HalfBath       2.221e-02  1.630e-02   1.362 0.173443    
## BedroomAbvGr  -7.450e-04  1.021e-02  -0.073 0.941824    
## KitchenAbvGr   1.783e-02  3.535e-02   0.505 0.614011    
## TotRmsAbvGrd   1.849e-02  7.498e-03   2.466 0.013855 *  
## Fireplaces     5.221e-02  1.077e-02   4.847 1.50e-06 ***
## GarageCars     8.871e-02  1.738e-02   5.103 4.16e-07 ***
## GarageArea    -3.760e-05  5.791e-05  -0.649 0.516359    
## WoodDeckSF     1.466e-04  4.727e-05   3.101 0.001995 ** 
## OpenPorchSF    5.318e-05  9.626e-05   0.552 0.580790    
## EnclosedPorch  1.513e-04  1.098e-04   1.378 0.168567    
## X3SsnPorch     3.025e-04  1.960e-04   1.544 0.123016    
## ScreenPorch    3.883e-04  1.103e-04   3.521 0.000454 ***
## PoolArea      -5.639e-04  1.534e-04  -3.675 0.000253 ***
## MiscVal       -4.539e-05  4.055e-05  -1.119 0.263349    
## MoSold         6.982e-06  2.103e-03   0.003 0.997352    
## YrSold        -6.270e-03  4.316e-03  -1.453 0.146662    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1595 on 813 degrees of freedom
## Multiple R-squared:  0.8475, Adjusted R-squared:  0.8413 
## F-statistic: 136.9 on 33 and 813 DF,  p-value: < 2.2e-16

So we have a model that produces an adjusted R-squared value a F-statistic (which will vary depending on the sample). This corresponds with a very statistically significant p-value. However, there are also quite a bit of variables that do not appear to be statistically significant either. Let’s see if we can perform a stepwise selection and work backwards.

Using this paper “Variable selection with stepwise and best subset approaches” by Zhongheng Zhang, we will utilize an R function stepAIC().

step <- stepAIC(home_log_lm, trace = FALSE)

print("Final Model")

## [1] "Final Model"

final_model_df <- step$model
final_colnames <- colnames(final_model_df)
head(final_model_df)

##      SalePrice_log MSSubClass LotArea OverallQual OverallCond YearBuilt
## 936       11.28853         30    5825           4           5      1926
## 547       12.25486         50    8737           6           7      1923
## 1157      12.10016         80    9350           5           8      1965
## 386       12.16525        120    3182           8           5      2004
## 152       12.82773         20   13891           8           5      2007
## 111       11.82701         50    9525           6           4      1954
##      YearRemodAdd BsmtUnfSF X1stFlrSF X2ndFlrSF BsmtFullBath FullBath
## 936          1953       600       747         0            0        1
## 547          1950       765       915       720            0        1
## 1157         1999       586      1265         0            0        2
## 386          2005      1232      1269         0            0        2
## 152          2008       310      1710         0            1        2
## 111          1972       550      1216       639            0        2
##      HalfBath TotRmsAbvGrd Fireplaces GarageCars WoodDeckSF X3SsnPorch
## 936         0            5          0          2          0          0
## 547         1            6          1          2          0        144
## 1157        0            6          1          2          0         96
## 386         0            6          1          2        146          0
## 152         0            6          1          3          0          0
## 111         0            7          0          1        182          0
##      ScreenPorch PoolArea YrSold
## 936            0        0   2006
## 547            0        0   2007
## 1157           0        0   2008
## 386          144        0   2010
## 152            0        0   2008
## 111            0        0   2006

print(paste0("How many columns are there? ", ncol(final_model_df)))

## [1] "How many columns are there? 21"

print("What are the columns? ")

## [1] "What are the columns? "

print(final_colnames)

##  [1] "SalePrice_log" "MSSubClass"    "LotArea"       "OverallQual"  
##  [5] "OverallCond"   "YearBuilt"     "YearRemodAdd"  "BsmtUnfSF"    
##  [9] "X1stFlrSF"     "X2ndFlrSF"     "BsmtFullBath"  "FullBath"     
## [13] "HalfBath"      "TotRmsAbvGrd"  "Fireplaces"    "GarageCars"   
## [17] "WoodDeckSF"    "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [21] "YrSold"

With this stepwise approach of eliminating these variables, most of the top 10 most correlated values were included in this final model (though interestingly, not all made it). With this final model, let’s create a new linear regression model.

final_lm <- lm(SalePrice_log ~ . , data = final_model_df)
summary(final_lm)

## 
## Call:
## lm(formula = SalePrice_log ~ ., data = final_model_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.47201 -0.07474  0.00468  0.08297  0.66575 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.631e+01  8.432e+00   1.935 0.053381 .  
## MSSubClass   -8.223e-04  1.407e-04  -5.844 7.35e-09 ***
## LotArea       1.746e-06  5.374e-07   3.249 0.001205 ** 
## OverallQual   9.564e-02  6.801e-03  14.063  < 2e-16 ***
## OverallCond   4.606e-02  6.148e-03   7.492 1.74e-13 ***
## YearBuilt     2.610e-03  3.336e-04   7.824 1.56e-14 ***
## YearRemodAdd  9.656e-04  3.867e-04   2.497 0.012713 *  
## BsmtUnfSF    -3.026e-05  1.737e-05  -1.742 0.081834 .  
## X1stFlrSF     1.983e-04  2.690e-05   7.372 4.09e-13 ***
## X2ndFlrSF     1.101e-04  2.827e-05   3.893 0.000107 ***
## BsmtFullBath  5.654e-02  1.433e-02   3.946 8.61e-05 ***
## FullBath      5.552e-02  1.634e-02   3.398 0.000710 ***
## HalfBath      2.539e-02  1.599e-02   1.588 0.112737    
## TotRmsAbvGrd  1.896e-02  6.321e-03   2.999 0.002786 ** 
## Fireplaces    5.121e-02  1.039e-02   4.927 1.01e-06 ***
## GarageCars    7.962e-02  1.025e-02   7.770 2.32e-14 ***
## WoodDeckSF    1.403e-04  4.576e-05   3.065 0.002245 ** 
## X3SsnPorch    2.962e-04  1.945e-04   1.523 0.128119    
## ScreenPorch   3.691e-04  1.086e-04   3.398 0.000711 ***
## PoolArea     -5.792e-04  1.511e-04  -3.834 0.000135 ***
## YrSold       -6.399e-03  4.195e-03  -1.525 0.127570    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1589 on 826 degrees of freedom
## Multiple R-squared:  0.8463, Adjusted R-squared:  0.8426 
## F-statistic: 227.5 on 20 and 826 DF,  p-value: < 2.2e-16

layout(matrix(c(1,2,3,4),2,2))
plot(final_lm)

So this is the final model that has been produced. The ‘Residuals Vs Fitted’ plot appears to be fairly homoscedasticity. The Q-Q plot appears to fit the Q-Q line with the exceptions towards the ends of the plot.

Let’s see how this model compares with the testing dataset.

# Building the prediction
predicted.dat <- predict(final_lm, newdata = test.dat)
delta <- predicted.dat - test.dat$SalePrice_log

# Performing a T-test
t.test(delta, conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  delta
## t = -1.5175, df = 564, p-value = 0.1297
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.019686741  0.002525584
## sample estimates:
##    mean of x 
## -0.008580579

# Plotting delta
plot(delta)

The scattering around zero suggests that this formula is likely a good prediction. There are a few outliers, but for the most part, most of the data is around 0. Let’s take a look at how well the predictions were and its residuals.

# Plotting prediction vs. actual
plot(predicted.dat, test.dat$SalePrice_log, xlab = "Predicted Log Price", ylab = "Actual Log Price", main = "Predicted vs. Actual (Log Prices)")

# Residuals
hist(resid(final_lm), col = "red", xlab = "Residuals", main = "Historgram of Residuals")

qqnorm(resid(final_lm))
qqline(resid(final_lm))

plot(fitted(final_lm), resid(final_lm), main = "Fitted vs. Residuals")
abline(h=0, col="red")

Overall, it looks promising. We need to remember that these are in log form and not in actual dollars. We should consider this when we make our final predictions.

Reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4842399/pdf/atm-04-07-136.pdf, https://www.statmethods.net/stats/regression.html

Let’s see if we can use this model and predict the test dataset provided by Kaggle.com.

test_url <- url('https://raw.githubusercontent.com/jcp9010/MSDA/master/CUNY%20605/Final%20Project/test.csv')
test_df <- read.csv(test_url, header = TRUE, stringsAsFactors = FALSE)

predicted_prices <- predict(final_lm, test_df)
print("Predicted Log Prices")

## [1] "Predicted Log Prices"

head(predicted_prices)

##        1        2        3        4        5        6 
## 11.68407 11.89091 12.02269 12.19141 12.11989 12.07794

We have the predicted log prices. Let’s see if we can raise it to $e$ to bring it back to dollars.

# In Dollars
predicted_prices <- sapply(predicted_prices, exp)
head(predicted_prices)

##        1        2        3        4        5        6 
## 118666.2 145934.3 166490.5 197089.6 183485.3 175947.9

Before we create the .csv file, we should check if there are any NA values in the predictions.

print(paste0("Are there NA values in the predictions? ", any(is.na(predicted_prices))))

## [1] "Are there NA values in the predictions? TRUE"

print(paste0("How many NA values are there? ", sum(is.na(predicted_prices))))

## [1] "How many NA values are there? 3"

Let’s deal with the missing NA values and replace them with the predicted_price mean (so we can submit this to Kaggle).

predicted_prices[is.na(predicted_prices)] <- mean(predicted_prices, na.rm = TRUE)

Now that we have our price predictions in dollars. Let’s create a CSV file so that we can submit it to Kaggle.com. Kaggle requests that the format is in ID, SalePrice.

Id <- test_df$Id
SalePrice <- predicted_prices

submission <- data.frame(Id, SalePrice)
head(submission)

##     Id SalePrice
## 1 1461  118666.2
## 2 1462  145934.3
## 3 1463  166490.5
## 4 1464  197089.6
## 5 1465  183485.3
## 6 1466  175947.9

Outputting a .csv file can be done via:

# write.csv(submission, file = "submission.csv", row.names=FALSE)
# This is commented out so that it doesn't continue to output a new .csv file

The score is 0.14033.

This concludes my final project regarding linear regression and Housing Prices.

CUNY 605 Final Project