Read Data


Our data is publicly available as the Ames Housing Data but we’ve collected it from the Kaggle competition House Prices Advanced Regression Techniques to use in part 3 of the final project. Here we’re downloading it from our Github account.

The target value is SalePrice, there are 78 predictors and one ID column, with 1460 rows in our train data. We don’t need the test data until part 3 of the final project.

datalocation_train = 'https://raw.githubusercontent.com/pkofy/DATA605/main/Final%20Project/train.csv'
train <- read.csv(file=datalocation_train)



Descriptive and Inferential Statistics


Assignment description:

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?


Descriptive Statistics

Univariate descriptive statistics describe a single variable in a data set. We’ll use the summary() and sd() to get a variety for each variable.

summary(train$SalePrice)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
summary(train$LotArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
summary(train$OverallQual)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   5.000   6.000   6.099   7.000  10.000
sd(train$SalePrice)
## [1] 79442.5
sd(train$LotArea)
## [1] 9981.265
sd(train$OverallQual)
## [1] 1.382997


Appropriate Plots

There are multiple plots to choose from to describe a variable in our dataset. Similarly when we take the log of SalePrice the histogram shifts into a more normally distributed curve.

Code Summary
+ Histogram and DotChart
+ Density Plot
+ Box Plot
+ Box Plot Transformed
+ Histogram Transformed


Histogram and DotChart

We’ve selected a combination of histogram and dot chart to describe SalePrice but from difference angles.

hist(train$SalePrice)

dotchart(train$SalePrice)


Density Plot

We’ve also used a density plot to describe OverallQual.

plot(density(train$OverallQual))


Box Plot

When we run a box plot on GrLivArea and it’s hard to read with a significant number of values outside of the interquartile range.

boxplot(train$GrLivArea)


Box Plot Transformed

But when we generate a box plot on the log of GrLivArea it looks like a normally distributed variable.

It looks like running box plots on every variable to identify ones that should be non-linearly transformed is a strategy was an untapped strategy for the Kaggle competition.

boxplot(log(train$GrLivArea))


Histogram Transformed

Compare these plots of log-transformed SalePrice compared to the originals in the “Histogram and DotChart” section to see how much more normally distributed SalePrice is now.

hist(log(train$SalePrice))

dotchart(log(train$SalePrice))


Scatterplot Matrix

A scatterplot lets us visualize the relationships between multiple variables in our data. Here we take a subset of the variables we ultimately used in our final model for the Kaggle Contest submission. Here we remove RoofMatl because it is non-numeric. We’re primarily interested in the top row because it’s the charts where SalePrice is on the y-axis and our various predictors on the x-axis.

pairs(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea")], gap=.25)


Correlation Matrix

A correlation matrix is like the numeric version of our scatterplot, displaying the correlations between the multiple variables. The more linear the scatterplot, the closer to 1 (or negative 1) is the corresponding cell in the correlation matrix, with less correlation being closer to 0. I ran the correlation matrix for all of them because it’s instructive but let’s look at the two correlated to SalePrice the most and the least: OverallQual and OverallCond respectively.

cor(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea")])
##               SalePrice     LotArea OverallQual OverallCond   YearBuilt
## SalePrice    1.00000000  0.26384335  0.79098160 -0.07785589  0.52289733
## LotArea      0.26384335  1.00000000  0.10580574 -0.00563627  0.01422765
## OverallQual  0.79098160  0.10580574  1.00000000 -0.09193234  0.57232277
## OverallCond -0.07785589 -0.00563627 -0.09193234  1.00000000 -0.37598320
## YearBuilt    0.52289733  0.01422765  0.57232277 -0.37598320  1.00000000
## GrLivArea    0.70862448  0.26311617  0.59300743 -0.07968587  0.19900971
##               GrLivArea
## SalePrice    0.70862448
## LotArea      0.26311617
## OverallQual  0.59300743
## OverallCond -0.07968587
## YearBuilt    0.19900971
## GrLivArea    1.00000000
correlation_matrix <- cor(train[c("SalePrice", "OverallQual", "OverallCond")])
correlation_matrix
##               SalePrice OverallQual OverallCond
## SalePrice    1.00000000  0.79098160 -0.07785589
## OverallQual  0.79098160  1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234  1.00000000


Hypothesis Testing

We’re going to test, with 80% confidence, whether our correlations are zero.

If the p-value is less than our threshold (usually 5%) then we can reject the hypothesis that the tested correlation is zero and our conclusion is there is some meaningful correlation.

If the p-value is higher than our threshold, we can’t reject our hypothesis that the correlation is zero and our conclusion is there is “not enough evidence to suggest a significant correlation”.

Looking at the output below it’s interesting that the correlation for both variables falls within the 80% confidence intervals. Even the correlation that was the lowest is still found to be relevant. I wonder if it’s because for less valuable properties, sellers put more effort into fixing them up before sale? Otherwise, why would Overall Condition be negatively correlated with Sale Price?

cor.test(train$SalePrice, train$OverallQual, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(train$SalePrice, train$OverallCond, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$SalePrice and train$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.1111272 -0.0444103
## sample estimates:
##         cor 
## -0.07785589


Analysis

We learned a lot from this exercise. One is that by using histograms, dotcharts, and barplots we can identify variables that maybe should be non-linearly transformed before regression modeling. Two is that Scatterplot and Correlation Matrices are helpful but not definitive when identifying if there is correlation.

If I were to do my regression model in Part 3 over again, I would spend more time assessing if the variables can be non-linearly transformed to be more normally distributed before regression modeling, and only amass scatterplot matrices to get a sense of the data, not to look for variables to exclude. (While the scatterplot matrices did help identify variables with lots of missing values, we have other methods to do the same thing.)

We’ve already identified an additional variable that we should have transformed for the model by taking the log of GrLvArea.


Family-wise Error

Initially I assumed family-wise error was when a variable in your dataset partially encodes other variables in your dataset. Like the number of bathrooms could partially encode the total square footage of the hosue and so be partially redundant.

However instead family-wise error is when you do multiple hypothesis testing at once and increase the chance of making a Type 1, or false positive, error.

Basically, you always have a chance to conclude that there is correlation between the variables when there isn’t. If you do one test you can, within your level of confidence, be assured that the chance of making a Type 1 error or false positive, is small; However if you run a batch of hypothesis tests, the chance that at least one of the meaningful correlations identified is wrong can become relatively high the more tests in your batch there are.

I’m not worried about it in our case because we have already corroborated these variables as having predictive value for the sale price. However I am worried about redundancy/high correlation/multilinearity in the predictors used in my ultimate model. Steps for future analysis would be to do additional triage where we look at the outcome of repeating our final model but minus each one of the predictors in turn and comparing those models. My metaphor is if I can build a stool with four legs, why do I need five legs? (but the floor is not level!)




Linear Algebra and Correlation


Assignment description:

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.


Matrix Inversion

We can solve for the matrix that’s the inverse of our correlation matrix in R directly.


Correlation Matrix

Here is a reprint of our earlier correlation matrix:

correlation_matrix
##               SalePrice OverallQual OverallCond
## SalePrice    1.00000000  0.79098160 -0.07785589
## OverallQual  0.79098160  1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234  1.00000000


Precision Matrix

Here is our precision matrix by solving for the inverse of the correlation matrix. I read a little about these, variance inflation factors, regression coefficients, and collinearity and it’s something to dig into later.

# Here is the calculation of our precision matrix, the inverse of our correlation matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
##               SalePrice OverallQual OverallCond
## SalePrice    2.67150050 -2.11183483  0.01384614
## OverallQual -2.11183483  2.67793985  0.08177049
## OverallCond  0.01384614  0.08177049  1.00859536


CorMX x PreMX

Here we multiply the correlation matrix by the precision matrix. The order of matrix multiplication matters so I assume we have the right conventional order here; However in this case, the two matrices are the inverse of each other so order doesn’t matter and we get the Identity Matrix. While this matrix below isn’t exactly the identify matrix, if you were to round the elements to the nearest quadrillionth it would be the identity matrix, so we’re close enough.

correlation_matrix %*% precision_matrix
##                 SalePrice   OverallQual  OverallCond
## SalePrice    1.000000e+00 -2.137437e-16 1.287709e-18
## OverallQual  1.097375e-16  1.000000e+00 4.762427e-18
## OverallCond -1.734723e-17  4.163336e-17 1.000000e+00


PreMX x CorMX

Here we multiply the precision matrix by the correlation matrix. Note again, this is the identity matrix with tiny tiny rounding differences less than a quadrillionth each.

precision_matrix %*% correlation_matrix
##                 SalePrice  OverallQual   OverallCond
## SalePrice    1.000000e+00 1.094186e-16 -1.387779e-17
## OverallQual -2.137437e-16 1.000000e+00  4.163336e-17
## OverallCond  1.287709e-18 4.762427e-18  1.000000e+00


LU Decomposition

I have a note! From one of the very first weeks of class where the professor said to remember how to do LU Decomposition and that it could be done easily in R with the pracma library.

LU decomposition is when you turn an nxn matrix into two triangular matrices, a (L)ower triangular matrix and an (U)pper triangular matrix. The (L)ower triangular matrix has all zeros above the diagonal and the (U)pper triangular matrix has all zeros below the diagonal line, such that when you multiply the (L)ower by the (U)pper you get the original nxn matrix.

To do it by hand, first, you put an Identity matrix of size n to the left of your nxn matrix, second, do Gaussian Elimination with every step performed on both the left and right matrix to make your original matrix the (U)pper triangular matrix. When complete your Identity matrix will now be the (L)ower triangular matrix.

I didn’t find any reasons for performing LU Decomposition on a correlation matrix, however there’s a lot of information online about performing LU Decomposition on precision matrices. Most of the reasons were computational efficiency/stability in using precision matrices to determine conditional dependencies between variables in your model or in using precision matrices to select variables in include in your model. These topics were outside of the scope of our class but will stay as a placeholder for me to learn more about how to use precision matrices with LASSO and ridge regression in the future.

We can perform LU Decomposition simply in R using the pracma library (Thank you, Doctor Larry):

library("pracma")
LU <- lu(precision_matrix)
LU$L
##                SalePrice OverallQual OverallCond
## SalePrice    1.000000000  0.00000000           0
## OverallQual -0.790505124  1.00000000           0
## OverallCond  0.005182906  0.09193234           1
LU$U
##             SalePrice OverallQual OverallCond
## SalePrice    2.671501   -2.111835  0.01384614
## OverallQual  0.000000    1.008524  0.09271594
## OverallCond  0.000000    0.000000  1.00000000



Calculus-Based Probability & Statistics


Assignment description:

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.


Find Skewness

Looking at our variables under consideration we know that SalePrice is skewed right (long tail to the right) from the original histogram we generated. Using the skewness function from the moments package we can calculate skewness for each of our variables (positive skewness is skewed right). LotArea has the highest right-skewness but we know that’s primarily due to one giant lot so we’re going to proceed with SalePrice.

#install.packages("moments")
library("moments")
skewness(train$SalePrice)
## [1] 1.880941
skewness(train$LotArea)
## [1] 12.19514
skewness(train$OverallQual)
## [1] 0.216721
skewness(train$OverallCond)
## [1] 0.6923552
skewness(train$YearBuilt)
## [1] -0.6128307
skewness(train$GrLivArea)
## [1] 1.365156


Shift Above Zero

Since this is sale price data it should never be negative, so we don’t need to shift it so the minimum value is above zero, but let’s do a quick range check to be sure. Yep, the lowest Sale Price is $34,900.

range(train$SalePrice)
## [1]  34900 755000


MASS package

Here we load the MASS package and run fit an exponential probability density to our data.

library(MASS)

sp_expfit <- fitdistr(train$SalePrice, "exponential")
sp_expfit
##        rate    
##   5.527268e-06 
##  (1.446552e-07)


Generated Data with Same Lambda

Here we generate data with the exponential distribution using the lambda identified when we fit an exponential distribution to our actual data.

lambda <- sp_expfit$estimate
set.seed(713)
gendata <- rexp(1000, rate = lambda)
lambda
##         rate 
## 5.527268e-06


Histograms

Here we compare our actual data to the data generated using an exponential distribution with a lambda equal to the lambda of an exponential distribution fit to our actual data.

It looks like the distribution is a reasonable proxy for the right tail. Maybe we could split the data on the median, perform this analysis on the right tail, perform this analysis a second time but on the flipped left tail and then combine the two.

However, remember that lambda is the rate of events per interval of time, so I assume that each data point is an event and the interval is increasing dollars… (already an indication the exponential might not be the distribution to fit to our data.)

hist(train$SalePrice, main = "Original Data", col = "yellowgreen")

hist(gendata, main = "Generated Data", col = "aliceblue")


Percentiles

Here we’re going to generate and then compare the 5th and 95th percentiles of the theoretical data we generated versus the 5th and 95th percentile of our empirical data, along with a confidence interval of our empirical data, to draw conclusions from.


5th & 95th Theoretical

Here we find the 5th and 9th percentiles of our generated data.

qexp(0.05, rate = lambda)
## [1] 9280.044
qexp(0.95, rate = lambda)
## [1] 541991.5


Confidence Interval

Instructions: Also generate a 95% confidence interval from the empirical data, assuming normality.

I interpret this question differently. Normally a confidence interval would be say if you calculate the mean, what is the highest and lowest mean that the true population could have, given a certain level of confidence, given the sample we have.

I assumed this question would relate to the 5th and 95th percentiles. So we could calculate the 5th percentile of the true population for which we have a sample given a 95% level of confidence. Do the same for the 95th percentile and then we’d have two confidence intervals, one for each the 5th and 95th percentiles.

The value in this approach is we could say that the 5th and 95th percentiles of our generated data are or are not in the confidence intervals we would expect from our true population, so, given a 95% confidence level the generated data is not a good fit for our actual/empirical data.

I read a detailed explanation for how you would do that on this stack exchange, stackexchange.com/how-to-obtain-a-confidence-interval-for-a-percentile and it (binomial counting of whether the next observation is above or below our critical value) seems outside of the scope of what we could be expected to accomplish after this course. But maybe we can approximate the confidence intervals for the 5th and 95th percentile using the formula for the confidence interval of the mean:

\(CI = \bar{X} \pm Z(std/\sqrt{n})\)

# Inputs to calculate the two confidence intervals
sp_percentile_05 <- quantile(train$SalePrice, 0.05, names=FALSE)
sp_percentile_95 <- quantile(train$SalePrice, 0.95, names=FALSE)
Z <- qnorm(0.975)
std <- sd(train$SalePrice)
n_sq <- sqrt(length(train$SalePrice))

# lower and upper bounds of the two confidence intervals
CI_05_Lower <- sp_percentile_05 - Z*(std/n_sq)
CI_05_Upper <- sp_percentile_05 + Z*(std/n_sq)
CI_95_Lower <- sp_percentile_95 - Z*(std/n_sq)
CI_95_Upper <- sp_percentile_95 + Z*(std/n_sq)

CI_05 <- c(CI_05_Lower, CI_05_Upper)
CI_95 <- c(CI_95_Lower, CI_95_Upper)

CI_05
## [1] 83925.03 92074.97
CI_95
## [1] 322025 330175


5th & 95th Empirical

Here we find the 5th and 95th percentiles of our actual data.

quantile(train$SalePrice, 0.05, names=FALSE)
## [1] 88000
quantile(train$SalePrice, 0.95, names=FALSE)
## [1] 326100


Discuss

The generated data’s 5th and 9th percentiles are not within our confidence intervals. (Though if we increased to the 99% confidence level then the lower one would be!) It looks like the exponential distribution is not the right distribution to fit to our data.

We’ve already taken the log of our sale price data and shown that it looks normally distributed. If we were to repeat this exercise with the log SalePrice and the normal distribution we might have calculated confidence intervals that contained the simulated 5th and 95th percentiles of our generated data and we would have confidence that our data, with log transformation, follows a normal distribution.