Our data is publicly available as the Ames Housing Data but we’ve collected it from the Kaggle competition House Prices Advanced Regression Techniques to use in part 3 of the final project. Here we’re downloading it from our Github account.
The target value is SalePrice
, there are 78 predictors
and one ID column, with 1460 rows in our train
data. We
don’t need the test
data until part 3 of the final
project.
datalocation_train = 'https://raw.githubusercontent.com/pkofy/DATA605/main/Final%20Project/train.csv'
train <- read.csv(file=datalocation_train)
Assignment description:
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Univariate descriptive statistics describe a single variable in a data set. We’ll use the summary() and sd() to get a variety for each variable.
summary(train$SalePrice)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
summary(train$LotArea)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
summary(train$OverallQual)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
sd(train$SalePrice)
## [1] 79442.5
sd(train$LotArea)
## [1] 9981.265
sd(train$OverallQual)
## [1] 1.382997
There are multiple plots to choose from to describe a variable in our
dataset. Similarly when we take the log of SalePrice
the
histogram shifts into a more normally distributed curve.
Code Summary
+ Histogram and DotChart
+ Density Plot
+ Box Plot
+ Box Plot Transformed
+ Histogram Transformed
We’ve selected a combination of histogram and dot chart to describe
SalePrice
but from difference angles.
hist(train$SalePrice)
dotchart(train$SalePrice)
We’ve also used a density plot to describe
OverallQual
.
plot(density(train$OverallQual))
When we run a box plot on GrLivArea
and it’s hard to
read with a significant number of values outside of the interquartile
range.
boxplot(train$GrLivArea)
But when we generate a box plot on the log of
GrLivArea
it looks like a normally distributed
variable.
It looks like running box plots on every variable to identify ones that should be non-linearly transformed is a strategy was an untapped strategy for the Kaggle competition.
boxplot(log(train$GrLivArea))
Compare these plots of log-transformed SalePrice
compared to the originals in the “Histogram and DotChart” section to see
how much more normally distributed SalePrice
is now.
hist(log(train$SalePrice))
dotchart(log(train$SalePrice))
A scatterplot lets us visualize the relationships between multiple
variables in our data. Here we take a subset of the variables we
ultimately used in our final model for the Kaggle Contest submission.
Here we remove RoofMatl
because it is non-numeric. We’re
primarily interested in the top row because it’s the charts where
SalePrice
is on the y-axis and our various predictors on
the x-axis.
pairs(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea")], gap=.25)
A correlation matrix is like the numeric version of our scatterplot,
displaying the correlations between the multiple variables. The more
linear the scatterplot, the closer to 1 (or negative 1) is the
corresponding cell in the correlation matrix, with less correlation
being closer to 0. I ran the correlation matrix for all of them because
it’s instructive but let’s look at the two correlated to
SalePrice
the most and the least: OverallQual
and OverallCond
respectively.
cor(train[c("SalePrice", "LotArea", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea")])
## SalePrice LotArea OverallQual OverallCond YearBuilt
## SalePrice 1.00000000 0.26384335 0.79098160 -0.07785589 0.52289733
## LotArea 0.26384335 1.00000000 0.10580574 -0.00563627 0.01422765
## OverallQual 0.79098160 0.10580574 1.00000000 -0.09193234 0.57232277
## OverallCond -0.07785589 -0.00563627 -0.09193234 1.00000000 -0.37598320
## YearBuilt 0.52289733 0.01422765 0.57232277 -0.37598320 1.00000000
## GrLivArea 0.70862448 0.26311617 0.59300743 -0.07968587 0.19900971
## GrLivArea
## SalePrice 0.70862448
## LotArea 0.26311617
## OverallQual 0.59300743
## OverallCond -0.07968587
## YearBuilt 0.19900971
## GrLivArea 1.00000000
correlation_matrix <- cor(train[c("SalePrice", "OverallQual", "OverallCond")])
correlation_matrix
## SalePrice OverallQual OverallCond
## SalePrice 1.00000000 0.79098160 -0.07785589
## OverallQual 0.79098160 1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234 1.00000000
We’re going to test, with 80% confidence, whether our correlations are zero.
If the p-value is less than our threshold (usually 5%) then we can reject the hypothesis that the tested correlation is zero and our conclusion is there is some meaningful correlation.
If the p-value is higher than our threshold, we can’t reject our hypothesis that the correlation is zero and our conclusion is there is “not enough evidence to suggest a significant correlation”.
Looking at the output below it’s interesting that the correlation for both variables falls within the 80% confidence intervals. Even the correlation that was the lowest is still found to be relevant. I wonder if it’s because for less valuable properties, sellers put more effort into fixing them up before sale? Otherwise, why would Overall Condition be negatively correlated with Sale Price?
cor.test(train$SalePrice, train$OverallQual, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$OverallQual
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(train$SalePrice, train$OverallCond, conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train$SalePrice and train$OverallCond
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1111272 -0.0444103
## sample estimates:
## cor
## -0.07785589
We learned a lot from this exercise. One is that by using histograms, dotcharts, and barplots we can identify variables that maybe should be non-linearly transformed before regression modeling. Two is that Scatterplot and Correlation Matrices are helpful but not definitive when identifying if there is correlation.
If I were to do my regression model in Part 3 over again, I would spend more time assessing if the variables can be non-linearly transformed to be more normally distributed before regression modeling, and only amass scatterplot matrices to get a sense of the data, not to look for variables to exclude. (While the scatterplot matrices did help identify variables with lots of missing values, we have other methods to do the same thing.)
We’ve already identified an additional variable that we should have
transformed for the model by taking the log of
GrLvArea
.
Initially I assumed family-wise error was when a variable in your dataset partially encodes other variables in your dataset. Like the number of bathrooms could partially encode the total square footage of the hosue and so be partially redundant.
However instead family-wise error is when you do multiple hypothesis testing at once and increase the chance of making a Type 1, or false positive, error.
Basically, you always have a chance to conclude that there is correlation between the variables when there isn’t. If you do one test you can, within your level of confidence, be assured that the chance of making a Type 1 error or false positive, is small; However if you run a batch of hypothesis tests, the chance that at least one of the meaningful correlations identified is wrong can become relatively high the more tests in your batch there are.
I’m not worried about it in our case because we have already corroborated these variables as having predictive value for the sale price. However I am worried about redundancy/high correlation/multilinearity in the predictors used in my ultimate model. Steps for future analysis would be to do additional triage where we look at the outcome of repeating our final model but minus each one of the predictors in turn and comparing those models. My metaphor is if I can build a stool with four legs, why do I need five legs? (but the floor is not level!)
Assignment description:
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
We can solve for the matrix that’s the inverse of our correlation matrix in R directly.
Here is a reprint of our earlier correlation matrix:
correlation_matrix
## SalePrice OverallQual OverallCond
## SalePrice 1.00000000 0.79098160 -0.07785589
## OverallQual 0.79098160 1.00000000 -0.09193234
## OverallCond -0.07785589 -0.09193234 1.00000000
Here is our precision matrix by solving for the inverse of the correlation matrix. I read a little about these, variance inflation factors, regression coefficients, and collinearity and it’s something to dig into later.
# Here is the calculation of our precision matrix, the inverse of our correlation matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
## SalePrice OverallQual OverallCond
## SalePrice 2.67150050 -2.11183483 0.01384614
## OverallQual -2.11183483 2.67793985 0.08177049
## OverallCond 0.01384614 0.08177049 1.00859536
Here we multiply the correlation matrix by the precision matrix. The order of matrix multiplication matters so I assume we have the right conventional order here; However in this case, the two matrices are the inverse of each other so order doesn’t matter and we get the Identity Matrix. While this matrix below isn’t exactly the identify matrix, if you were to round the elements to the nearest quadrillionth it would be the identity matrix, so we’re close enough.
correlation_matrix %*% precision_matrix
## SalePrice OverallQual OverallCond
## SalePrice 1.000000e+00 -2.137437e-16 1.287709e-18
## OverallQual 1.097375e-16 1.000000e+00 4.762427e-18
## OverallCond -1.734723e-17 4.163336e-17 1.000000e+00
Here we multiply the precision matrix by the correlation matrix. Note again, this is the identity matrix with tiny tiny rounding differences less than a quadrillionth each.
precision_matrix %*% correlation_matrix
## SalePrice OverallQual OverallCond
## SalePrice 1.000000e+00 1.094186e-16 -1.387779e-17
## OverallQual -2.137437e-16 1.000000e+00 4.163336e-17
## OverallCond 1.287709e-18 4.762427e-18 1.000000e+00
I have a note! From one of the very first weeks of class where the
professor said to remember how to do LU Decomposition and that it could
be done easily in R with the pracma
library.
LU decomposition is when you turn an nxn matrix into two triangular matrices, a (L)ower triangular matrix and an (U)pper triangular matrix. The (L)ower triangular matrix has all zeros above the diagonal and the (U)pper triangular matrix has all zeros below the diagonal line, such that when you multiply the (L)ower by the (U)pper you get the original nxn matrix.
To do it by hand, first, you put an Identity matrix of size n to the left of your nxn matrix, second, do Gaussian Elimination with every step performed on both the left and right matrix to make your original matrix the (U)pper triangular matrix. When complete your Identity matrix will now be the (L)ower triangular matrix.
I didn’t find any reasons for performing LU Decomposition on a correlation matrix, however there’s a lot of information online about performing LU Decomposition on precision matrices. Most of the reasons were computational efficiency/stability in using precision matrices to determine conditional dependencies between variables in your model or in using precision matrices to select variables in include in your model. These topics were outside of the scope of our class but will stay as a placeholder for me to learn more about how to use precision matrices with LASSO and ridge regression in the future.
We can perform LU Decomposition simply in R using the
pracma
library (Thank you, Doctor Larry):
library("pracma")
LU <- lu(precision_matrix)
LU$L
## SalePrice OverallQual OverallCond
## SalePrice 1.000000000 0.00000000 0
## OverallQual -0.790505124 1.00000000 0
## OverallCond 0.005182906 0.09193234 1
LU$U
## SalePrice OverallQual OverallCond
## SalePrice 2.671501 -2.111835 0.01384614
## OverallQual 0.000000 1.008524 0.09271594
## OverallCond 0.000000 0.000000 1.00000000
Assignment description:
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Looking at our variables under consideration we know that SalePrice is skewed right (long tail to the right) from the original histogram we generated. Using the skewness function from the moments package we can calculate skewness for each of our variables (positive skewness is skewed right). LotArea has the highest right-skewness but we know that’s primarily due to one giant lot so we’re going to proceed with SalePrice.
#install.packages("moments")
library("moments")
skewness(train$SalePrice)
## [1] 1.880941
skewness(train$LotArea)
## [1] 12.19514
skewness(train$OverallQual)
## [1] 0.216721
skewness(train$OverallCond)
## [1] 0.6923552
skewness(train$YearBuilt)
## [1] -0.6128307
skewness(train$GrLivArea)
## [1] 1.365156
Since this is sale price data it should never be negative, so we don’t need to shift it so the minimum value is above zero, but let’s do a quick range check to be sure. Yep, the lowest Sale Price is $34,900.
range(train$SalePrice)
## [1] 34900 755000
Here we load the MASS package and run fit an exponential probability density to our data.
library(MASS)
sp_expfit <- fitdistr(train$SalePrice, "exponential")
sp_expfit
## rate
## 5.527268e-06
## (1.446552e-07)
Here we generate data with the exponential distribution using the lambda identified when we fit an exponential distribution to our actual data.
lambda <- sp_expfit$estimate
set.seed(713)
gendata <- rexp(1000, rate = lambda)
lambda
## rate
## 5.527268e-06
Here we compare our actual data to the data generated using an exponential distribution with a lambda equal to the lambda of an exponential distribution fit to our actual data.
It looks like the distribution is a reasonable proxy for the right tail. Maybe we could split the data on the median, perform this analysis on the right tail, perform this analysis a second time but on the flipped left tail and then combine the two.
However, remember that lambda is the rate of events per interval of time, so I assume that each data point is an event and the interval is increasing dollars… (already an indication the exponential might not be the distribution to fit to our data.)
hist(train$SalePrice, main = "Original Data", col = "yellowgreen")
hist(gendata, main = "Generated Data", col = "aliceblue")
Here we’re going to generate and then compare the 5th and 95th percentiles of the theoretical data we generated versus the 5th and 95th percentile of our empirical data, along with a confidence interval of our empirical data, to draw conclusions from.
Here we find the 5th and 9th percentiles of our generated data.
qexp(0.05, rate = lambda)
## [1] 9280.044
qexp(0.95, rate = lambda)
## [1] 541991.5
Instructions: Also generate a 95% confidence interval from the empirical data, assuming normality.
I interpret this question differently. Normally a confidence interval would be say if you calculate the mean, what is the highest and lowest mean that the true population could have, given a certain level of confidence, given the sample we have.
I assumed this question would relate to the 5th and 95th percentiles. So we could calculate the 5th percentile of the true population for which we have a sample given a 95% level of confidence. Do the same for the 95th percentile and then we’d have two confidence intervals, one for each the 5th and 95th percentiles.
The value in this approach is we could say that the 5th and 95th percentiles of our generated data are or are not in the confidence intervals we would expect from our true population, so, given a 95% confidence level the generated data is not a good fit for our actual/empirical data.
I read a detailed explanation for how you would do that on this stack exchange, stackexchange.com/how-to-obtain-a-confidence-interval-for-a-percentile and it (binomial counting of whether the next observation is above or below our critical value) seems outside of the scope of what we could be expected to accomplish after this course. But maybe we can approximate the confidence intervals for the 5th and 95th percentile using the formula for the confidence interval of the mean:
\(CI = \bar{X} \pm Z(std/\sqrt{n})\)
# Inputs to calculate the two confidence intervals
sp_percentile_05 <- quantile(train$SalePrice, 0.05, names=FALSE)
sp_percentile_95 <- quantile(train$SalePrice, 0.95, names=FALSE)
Z <- qnorm(0.975)
std <- sd(train$SalePrice)
n_sq <- sqrt(length(train$SalePrice))
# lower and upper bounds of the two confidence intervals
CI_05_Lower <- sp_percentile_05 - Z*(std/n_sq)
CI_05_Upper <- sp_percentile_05 + Z*(std/n_sq)
CI_95_Lower <- sp_percentile_95 - Z*(std/n_sq)
CI_95_Upper <- sp_percentile_95 + Z*(std/n_sq)
CI_05 <- c(CI_05_Lower, CI_05_Upper)
CI_95 <- c(CI_95_Lower, CI_95_Upper)
CI_05
## [1] 83925.03 92074.97
CI_95
## [1] 322025 330175
Here we find the 5th and 95th percentiles of our actual data.
quantile(train$SalePrice, 0.05, names=FALSE)
## [1] 88000
quantile(train$SalePrice, 0.95, names=FALSE)
## [1] 326100
The generated data’s 5th and 9th percentiles are not within our confidence intervals. (Though if we increased to the 99% confidence level then the lower one would be!) It looks like the exponential distribution is not the right distribution to fit to our data.
We’ve already taken the log of our sale price data and shown that it
looks normally distributed. If we were to repeat this exercise with the
log SalePrice
and the normal distribution we might have
calculated confidence intervals that contained the simulated 5th and
95th percentiles of our generated data and we would have confidence that
our data, with log transformation, follows a normal distribution.