library(tidyverse)
library(scales)
library(GGally)
library(knitr)
library(matrixcalc)
library(MASS)
library(ldsr)
library(EnvStats)
set.seed(87)
palette <- c("#49c6e5", "#255f85", "#fae1df", "#ff3864", "#130303")
to_LaTeX <- function(A){
rows <- apply(A, MARGIN=1, paste, collapse = " & ")
matrix_string <- paste(rows, collapse = " \\\\ ")
return(paste("\\begin{bmatrix}", matrix_string, "\\end{bmatrix}"))
}
Using R, generate a random variable \(X\) that has \(10,000\) random Gamma pdf values. A Gamma pdf is completely describe by \(n\) (a size parameter) and lambda (\(\lambda\), a shape parameter). Choose any \(n\) greater \(3\) and an expected value (\(\lambda\)) between \(2\) and \(10\) (you choose).
alpha <- 10
lambda <- 6
X <- rgamma(n = 10000, shape = alpha, rate = lambda)
Then generate \(10,000\) observations from the sum of \(n\) exponential pdfs with rate/shape parameter (\(\lambda\)). The \(n\) and \(\lambda\) must be the same as in the previous case. (e.g., mysum=rexp(10000, \(\lambda\))+rexp(10000, \(\lambda\))
Y <- rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda)
Then generate \(10,000\) observations from a single exponential pdf with rate/shape parameter (\(\lambda\)).
Z <- rexp(n = 10000, rate = lambda)
The Gamma distribution is quite common in data science. For example, it is used to model failures for multiple processes when each of those processes has the same failure rate. The exponential is used for constant failure rates, service times, etc.
(5 Points).
Calculate the empirical expected value (means) and variances of all three pdfs.
mean_X <- mean(X)
mean_Y <- mean(Y)
mean_Z <- mean(Z)
var_X <- var(X)
var_Y <- var(Y)
var_Z <- var(Z)
Mean(X): \(1.6767234\), Var(X): \(0.2879548\)
Mean(Y): \(1.6579397\), Var(Y): \(0.2734734\)
Mean(Z): \(0.1616728\), Var(Z): \(0.0252837\)
Using calculus, calculate the expected value and variance of the Gamma pdf (X).
\(M_X(t) = (1 - \frac{t}{\lambda})^{-\alpha}\)
\(u = (1 - \frac{t}{\lambda})\)
\(M^{\prime}_X(t) = (-\alpha)(u)^{-\alpha-1}(-\frac{1}{\lambda})\)
\(M^{\prime}_X(t) = \frac{\alpha}{\lambda}(u)^{-\alpha-1}\)
\(M^{\prime}_X(t) = \frac{\alpha(1 - \frac{t}{\lambda})}{\lambda}^{-\alpha-1}\)
\(M^{\prime}_X(0) = \frac{\alpha(1 - \frac{0}{\lambda})}{\lambda}^{-\alpha-1} = \frac{\alpha}{\lambda}\)
Mean(X) = \(\frac{\alpha}{\lambda} = \frac{10}{6} \approx 1.67\)
\(M^{\prime\prime}_X(t) = \frac{\alpha(-\alpha-1)(u)^{-\alpha-2}(-\frac{1}{\lambda})}{\lambda}\)
\(M^{\prime\prime}_X(t) = -\frac{\alpha(-\alpha-1)(1 - \frac{t}{\lambda})^{-\alpha-2}}{\lambda^2} = \frac{\alpha(\alpha+1)}{(t-\lambda)^2(1 - \frac{t}{\lambda})^{\alpha}}\)
\(M^{\prime\prime}_X(0) = \frac{\alpha(\alpha+1)}{(0-\lambda)^2(1 - \frac{0}{\lambda})^{\alpha}} = \frac{\alpha(\alpha+1)}{(-\lambda)^2}\)
Var(X) = \(\frac{\alpha(\alpha+1)}{(-\lambda)^2} - \mu^2= \frac{(10(10 + 1)}{-6^2} - 1.67^2 = \frac{110}{36} - 2.79 = 3.06 - 2.79 \approx 0.27\)
Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y).
\(M_X(t) = \frac{\lambda}{\lambda - t}\)
\(u = \lambda - t\)
\(u^\prime = -1\)
\(M_X(t) = \lambda\frac{1}{u}\)
\(M_X^{\prime}(t) = -\lambda\frac{u^\prime}{(u)^2} = -\lambda\frac{(-1)}{(\lambda-t)^2} = \frac{\lambda}{(\lambda-t)^2}\)
\(M_X^{\prime}(0) = \frac{\lambda}{(\lambda-0)^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}\)
Mean(Z) = \(\frac{1}{\lambda} = \frac{1}{6} \approx 0.167\)
\(M_X(t) = (\frac{\lambda}{\lambda - t})^\alpha\)
\(u = \frac{\lambda}{\lambda - t}\)
\(u^\prime = \lambda(-\frac{1}{(\lambda-t)^2})\)
\(M_X^{\prime}(t) = -\alpha(\frac{\lambda}{\lambda - t})^{\alpha-1}\times \lambda(-\frac{1}{(\lambda-t)^2}) = \frac{\alpha\lambda(\frac{\lambda}{\lambda - t})^{\alpha-1}}{(\lambda-t)^2} = \frac{\alpha(\frac{\lambda}{\lambda - t})^{\alpha}}{\lambda-t}\)
\(M_X^{\prime}(0) = \frac{\alpha(\frac{\lambda}{\lambda - t})^{\alpha}}{\lambda-t} = \frac{\alpha(\frac{\lambda}{\lambda - 0})^{\alpha}}{\lambda - 0} = \frac{\alpha}{\lambda}\)
Mean(Y) = \(\frac{\alpha}{\lambda} = \frac{10}{6} \approx 1.67\)
Probability. For pdf Z (the exponential), calculate empirically probabilities a through c (below). Then evaluate through calculus whether the memoryless property holds.
Note: In all the probability statements below, I take \(\lambda\) to mean the value we defined at the beginning of Problem 1. In the evaluations where we actually find the conditional probability, it is that value’s reciprocal. With a \(\lambda\) of 6, we can pretend a machine fails 6 times per year. So the machine takes on average \(\frac{1}{6}\) years to fail. That is the mean of the distribution and the machine’s failure rate.
a. \(P(Z > \lambda | Z > \frac{\lambda}{2})\)
We take this to mean: what is the probability that we have more than 6 failures this year given that we have already had 3 failures?
\(P(Z \leq \lambda | Z > \frac{\lambda}{2}) = \frac{P(Z \leq \lambda \cap Z > \frac{\lambda}{2})}{P(Z > \frac{\lambda}{2})} = \frac{P(\frac{\lambda}{2} < Z \leq \lambda)}{P(Z > \frac{\lambda}{2})} = \frac{P(\frac{6}{2} < Z \leq 6)}{P(Z > \frac{6}{2})} = \frac{P(3 < Z \leq 6)}{P(Z > 3)}\)
\(= \frac{e^{-3\lambda} - e^{-6\lambda}}{e^{-3\lambda}} = 1 - e^{-3\lambda}\)
\(P(Z > \lambda | Z > \frac{\lambda}{2}) = 1 - P(Z \leq \lambda | Z > \frac{\lambda}{2}) = 1 - (1 - e^{-3\lambda}) = e^{-3\lambda} = e^{-\frac{3}{6}} \approx 0.607\)
The probability that we will have more than 6 failures given that we have already had 3 failures is the same as the probability that we will have more than 3 failures: \(0.607\).
b. \(P(Z > 2\lambda | Z > \lambda)\)
We take this to mean: what is the probability that we have more than 12 failures this year given that we have already had 6 failures?
\(P(Z \leq 2\lambda | Z > \lambda) = \frac{P(Z \leq 2\lambda \cap Z > \lambda)}{P(Z > \lambda)} = \frac{P(\lambda < Z \leq 2\lambda)}{P(Z > \lambda)} = \frac{P(6 < Z < 2(6))}{P(Z > 6)} = \frac{P(6 < Z < 12}{P(Z > 6)}\)
\(= \frac{e^{-6\lambda} - e^{-12\lambda}}{e^{-6\lambda}} = 1 - e^{-6\lambda}\)
\(P(Z > 2\lambda | Z > \lambda) = 1 - (1 - e^{-6\lambda}) = e^{-6\lambda} = e^{-\frac{6}{6}} \approx 0.368\)
The probability that we will have more than 12 failures given that we have already had 6 failures is the same as the probability that we will have more than 6 failures: \(0.368\).
c. \(P(Z > 3\lambda | Z > \lambda)\)
We take this to mean: what is the probability that we have more than 18 failures this year given that we have already had 6 failures?
\(P(Z \leq 3\lambda | Z > \lambda) = \frac{P(Z \leq 3\lambda \cap Z > \lambda)}{P(Z > \lambda)} = \frac{P(\lambda < Z \leq 3\lambda)}{P(Z > \lambda)} = \frac{P(6 < Z < 3(6))}{P(Z > 6)} = \frac{P(6 < Z < 18}{P(Z > 6)}\)
\(= \frac{e^{-6\lambda} - e^{-18\lambda}}{e^{-6\lambda}} = 1 - e^{-12\lambda}\)
\(P(Z > 3\lambda | Z > \lambda) = 1 - (1 - e^{-12\lambda}) = e^{-12\lambda} = e^{-\frac{12}{6}} \approx 0.135\)
The probability that we will have more than 18 failures given that we have already had 6 failures is the same as the probability that we will have more than 12 failures: \(0.135\).
These are all memoryless.
(5 Points). Loosely investigate whether \(P(YZ) = P(Y) P(Z)\) by building a table with quartiles and evaluating the marginal and joint probabilities.
q <- c(0.25, 0.25, 0.25, .25)
qY <- matrix(q, nrow = 4, ncol = 4, byrow = FALSE)
qZ <- matrix(q, nrow = 4, ncol = 4, byrow = TRUE)
A <- as.data.frame(matrix(qY * qZ, nrow=4, ncol=4))
c <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y")
colnames(A) <- c
A <- A %>%
mutate(Sum = rowSums(.[1:ncol(A)]))
A <- rbind(A, colSums(A))
r <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z",
"Sum")
rownames(A) <- r
kable(A, format="simple")
| 1st Quartile Y | 2nd Quartile Y | 3rd Quartile Y | 4th Quartile Y | Sum | |
|---|---|---|---|---|---|
| 1st Quartile Z | 0.0625 | 0.0625 | 0.0625 | 0.0625 | 0.25 |
| 2nd Quartile Z | 0.0625 | 0.0625 | 0.0625 | 0.0625 | 0.25 |
| 3rd Quartile Z | 0.0625 | 0.0625 | 0.0625 | 0.0625 | 0.25 |
| 4th Quartile Z | 0.0625 | 0.0625 | 0.0625 | 0.0625 | 0.25 |
| Sum | 0.2500 | 0.2500 | 0.2500 | 0.2500 | 1.00 |
The joint probability that you observe a variable in any quartile of the Y distribution and a variable in any quartile of the Z distribution is \(P(Y, Z) = P(Y | Z) \times P(Z) = 0.25 * 0.25 = 0.0625\).
The marginal probability that you observe a variable in any quartile of the Y distribution is \(0.25\), regardless of the Z variable. Similarly, the marginal probability that you observe a variable in any quartile of the Z distribution is \(0.25\), regardless of the Y variable.
\(P(YZ) = P(Y) P(Z)\)
(5 Points). Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Our understanding is neither test is for numerical data, so we binned the numerical variables into factors based on their quartiles. This practice may be discouraged based on what we’ve read, but it seemed like the best option.
df <- as.data.frame(cbind(Y, Z))
df <- df %>%
mutate(Y = ntile(.$Y, 4), Z = ntile(.$Z, 4))
cols <- c("YQuantile", "ZQuantile")
colnames(df) <- cols
df$YQuantile <- as.factor(df$YQuantile)
df$ZQuantile <- as.factor(df$ZQuantile)
contingency <- xtabs(~ ZQuantile + YQuantile, data = df)
contingency
## YQuantile
## ZQuantile 1 2 3 4
## 1 636 637 617 610
## 2 616 624 621 639
## 3 623 617 644 616
## 4 625 622 618 635
The Chi Square test is more appropriate because we have a large sample, so we perform that first. For both tests, the null hypothesis is that the variables are independent. The alternative hypothesis is that the variables are not independent.
test <- chisq.test(contingency)
test
##
## Pearson's Chi-squared test
##
## data: contingency
## X-squared = 2.4256, df = 9, p-value = 0.9828
The p-value for the Chi Square test is very large: 0.9828. So we reject the first null hypothesis that the variables are independent.
Next we perform the Fisher’s Exact test.
test2 <- fisher.test(contingency, simulate.p.value = TRUE)
test2
##
## Fisher's Exact Test for Count Data with simulated p-value (based on
## 2000 replicates)
##
## data: contingency
## p-value = 0.981
## alternative hypothesis: two.sided
The p-value for the Fisher’s Exact test is 0.981, almost the same very large value. So we reject the second null hypothesis that the variables are independent.
Provide univariate descriptive statistics and appropriate plots for the training data set.
First we reduce the number of variables to only ones we might be interested in describing/visualizing since there are so many. We print a summary of those.
my_url <- "https://raw.githubusercontent.com/geedoubledee/data605_final_project/main/train.csv"
train_df <- read.csv(my_url)
cols <- c("YearBuilt", "BedroomAbvGr", "GrLivArea", "FullBath", "BsmtFullBath",
"SalePrice", "OverallQual", "OverallCond")
train_df_sub <- train_df[, colnames(train_df) %in% cols]
train_df_sub <- train_df_sub %>%
mutate(AllFullBath = BsmtFullBath + FullBath)
summary(train_df_sub)
## OverallQual OverallCond YearBuilt GrLivArea
## Min. : 1.000 Min. :1.000 Min. :1872 Min. : 334
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1130
## Median : 6.000 Median :5.000 Median :1973 Median :1464
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1515
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:1777
## Max. :10.000 Max. :9.000 Max. :2010 Max. :5642
## BsmtFullBath FullBath BedroomAbvGr SalePrice
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 34900
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:129975
## Median :0.0000 Median :2.000 Median :3.000 Median :163000
## Mean :0.4253 Mean :1.565 Mean :2.866 Mean :180921
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.:214000
## Max. :3.0000 Max. :3.000 Max. :8.000 Max. :755000
## AllFullBath
## Min. :0.00
## 1st Qu.:1.00
## Median :2.00
## Mean :1.99
## 3rd Qu.:2.00
## Max. :6.00
There are 1,460 observations in the training dataset. Each observation is a sold home. The median sale price for a home in this dataset is $163,000.00. Below, we plot the distribution of sale price in thousands of dollars.
ggplot(data = train_df_sub, aes(x = SalePrice / 1000)) +
geom_histogram(binwidth=25, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(25, 775, by = 50)) +
scale_y_continuous(breaks = seq(0, 275, by = 25)) +
labs(x = "Sale Price in Thousands of $", y = "Frequency",
title = "Distribution of Sale Price in Thousands of $") +
geom_vline(aes(xintercept = mean(SalePrice / 1000)), color = palette[4],
linewidth = 0.5) +
annotate("text", x = mean(train_df_sub$SalePrice / 1000) + 35,
y=275, label="Mean", color = palette[4]) +
geom_vline(aes(xintercept = mean(SalePrice / 1000) + 3 * sd(SalePrice / 1000)),
color = palette[4], linewidth = 0.5, linetype = "dashed") +
annotate("text", x = mean(
train_df_sub$SalePrice / 1000) + 3 * sd(train_df_sub$SalePrice / 1000) +
35, y=275, label="+3 SD", color = palette[4]) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
The distribution is unimodal and right-skewed. Several homes that are significantly more expensive than the median sale price are pulling the mean up. Homes near a sale price of $425,000.00 and above are outliers, as they are more than 3 standard deviations above the mean.
Next we look at the distribution of above ground living area, measured in square feet.
ggplot(data = train_df_sub, aes(x = GrLivArea)) +
geom_histogram(binwidth=250, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(0, 6000, by = 500)) +
scale_y_continuous(breaks = seq(0, 300, by = 25)) +
labs(x = "Above Ground Living Area (Sq. Ft.)", y = "Frequency",
title = "Distribution of Above Ground Living Area (Sq. Ft.)") +
geom_vline(aes(xintercept = mean(GrLivArea)), color = palette[4],
linewidth = 0.5) +
annotate("text", x = mean(train_df_sub$GrLivArea) + 350, y=275,
label="Mean", color = palette[4]) +
geom_vline(aes(xintercept = mean(GrLivArea) + 3 * sd(GrLivArea)),
color = palette[4], linewidth = 0.5, linetype = "dashed") +
annotate("text", x = mean(train_df_sub$GrLivArea) + 3 *
sd(train_df_sub$GrLivArea) + 325,
y=275, label="+3 SD", color = palette[4]) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
The distribution is unimodal and right-skewed. The median above ground living area is 1,464 sq. ft. Some homes with significantly larger amounts of above ground living area are pulling the mean up. Homes with nearly 3,250 sq. ft. of above ground living area and above are outliers, as they are more than 3 standard deviations above the mean.
Next we look at the distribution of bedrooms. Basement-level bedrooms are not included.
ggplot(data = train_df_sub, aes(x = BedroomAbvGr)) +
geom_histogram(binwidth=1, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(0, 8, by = 1)) +
scale_y_continuous(breaks = seq(0, 800, by = 100)) +
labs(x = "Bedrooms (Above Ground Only)", y = "Frequency",
title = "Distribution of Bedrooms (Above Ground Only)") +
geom_vline(aes(xintercept = mean(BedroomAbvGr)), color = palette[4],
linewidth = 0.5) +
annotate("text", x = mean(train_df_sub$BedroomAbvGr) + 1.1, y=800,
label="Mean", color = palette[4]) +
geom_vline(aes(xintercept = mean(BedroomAbvGr) - 3 * sd(BedroomAbvGr)),
color = palette[4], linewidth = 0.5, linetype = "dashed") +
annotate("text", x = mean(train_df_sub$BedroomAbvGr) - 3 *
sd(train_df_sub$BedroomAbvGr) + 0.5,
y=800, label="-3 SD", color = palette[4]) +
geom_vline(aes(xintercept = mean(BedroomAbvGr) + 3 * sd(BedroomAbvGr)),
color = palette[4], linewidth = 0.5, linetype = "dashed") +
annotate("text", x = mean(train_df_sub$BedroomAbvGr) + 3 *
sd(train_df_sub$BedroomAbvGr) + 0.5,
y=800, label="+3 SD", color = palette[4]) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
The distribution is unimodal and nearly normal. The mean number of bedrooms is 2.9. There are some homes with 0 or 6+ bedrooms above ground, which are outliers since they are more than 3 standard deviations away from the mean.
Lastly we look at the distribution of full bathrooms, including basement-level full bathrooms.
ggplot(data = train_df_sub, aes(x = AllFullBath)) +
geom_histogram(binwidth=1, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(0, 6, by = 1)) +
scale_y_continuous(breaks = seq(0, 700, by = 50)) +
labs(x = "All Full Bathrooms", y = "Frequency",
title = "Distribution of All Full Bathrooms") +
geom_vline(aes(xintercept = mean(AllFullBath)), color = palette[4],
linewidth = 0.5) +
annotate("text", x = mean(train_df_sub$AllFullBath) + 1.1, y=700,
label="Mean", color = palette[4]) +
geom_vline(aes(xintercept = mean(AllFullBath) + 3 * sd(AllFullBath)),
color = palette[4], linewidth = 0.5, linetype = "dashed") +
annotate("text", x = mean(train_df_sub$AllFullBath) + 3 *
sd(train_df_sub$AllFullBath) + 0.5,
y=700, label="+3 SD", color = palette[4]) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
The distribution is unimodal and nearly normal. The mean number of bathrooms is 2.0. There is only one outlier that is more than 3 standard deviations above the mean, and it is a home with 6 baths.
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
reorder <- c("SalePrice", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea", "BedroomAbvGr", "BsmtFullBath", "FullBath", "AllFullBath")
train_df_sub <- train_df_sub[, reorder]
ggpairs(train_df_sub, columns=c(1, 2, 5))
Derive a correlation matrix for any three quantitative variables in the dataset.
cols <- c("GrLivArea", "BedroomAbvGr", "FullBath")
X <- as.matrix(train_df_sub[, colnames(train_df_sub) %in% cols])
M <- cor(X)
M
## GrLivArea BedroomAbvGr FullBath
## GrLivArea 1.0000000 0.5212695 0.6300116
## BedroomAbvGr 0.5212695 1.0000000 0.3632520
## FullBath 0.6300116 0.3632520 1.0000000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
\(\alpha = 0.2\)
Null Hypothesis 1: \(H_0: \rho = 0\)
Alternate Hypothesis 1: \(H_a: \rho \neq 0\)
cor.test(X[, "GrLivArea"], X[, "BedroomAbvGr"], conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: X[, "GrLivArea"] and X[, "BedroomAbvGr"]
## t = 23.323, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4963921 0.5452915
## sample estimates:
## cor
## 0.5212695
We are \(80\%\) confident that \(\rho\), the population correlation coefficient for “GrLivArea” and “BedroomAbvGr,” is between \(0.496\) and \(0.545\). We reject null hypothesis 1 in favor of the alternate hypothesis 1.
Null Hypothesis 2: \(H_0: \rho = 0\)
Alternate Hypothesis 2: \(H_a: \rho \neq 0\)
cor.test(X[, "GrLivArea"], X[, "FullBath"], conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: X[, "GrLivArea"] and X[, "FullBath"]
## t = 30.977, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6093339 0.6498331
## sample estimates:
## cor
## 0.6300116
We are \(80\%\) confident that \(\rho\), the population correlation coefficient for “GrLivArea” and “FullBath,” is between \(0.609\) and \(0.650\). We reject null hypothesis 2 in favor of the alternate hypothesis 2.
Null Hypothesis 3: \(H_0: \rho = 0\)
Alternate Hypothesis 3: \(H_a: \rho \neq 0\)
cor.test(X[, "BedroomAbvGr"], X[, "FullBath"], conf.level=0.8)
##
## Pearson's product-moment correlation
##
## data: X[, "BedroomAbvGr"] and X[, "FullBath"]
## t = 14.887, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3337593 0.3920342
## sample estimates:
## cor
## 0.363252
We are \(80\%\) confident that \(\rho\), the population correlation coefficient for “BedroomAbvGr” and “FullBath,” is between \(0.334\) and \(0.392\). We reject null hypothesis 3 in favor of the alternate hypothesis 3.
Discuss the meaning of your analysis.
All three of the correlation coefficients we tested are statistically significant, so there is a significant linear relationship between each pair of variables. This means we could model the linear relationship between each pair of variables in the population by using linear regression.
Would you be worried about familywise error? Why or why not?
a <- 0.2
m <- 3
FWER <- 1 - ((1 - a)^m)
Because we’ve conducted three hypothesis tests, our confidence level is low, and our alpha is therefore high, yes, I would be concerned about familywise error. The familywise error rate (FWER) for these three hypothesis tests, meaning the probability of us committing a type I error (rejecting the null hypothesis when it is actually true), is 48.8%. That is very high for three tests.
We can reduce the FWER using the Holm-Bonferonni formula. Our p-value for each pair of coefficients was the same very low value: 0.00000000000000022. So we will leave the order of the tests in the order we did them:
\(H_1 = 0.00000000000000022\)
\(H_2 = 0.00000000000000022\)
\(H_3 = 0.00000000000000022\)
Then we will sequentially confirm we should reject the null hypothesis for each test:
rank = 1
HB1 <- a / (m - rank + 1)
rank = 2
HB2 <- a / (m - rank + 1)
rank = 3
HB3 <- a / (m - rank + 1)
\(H_1 = 0.00000000000000022 < 0.0666667\)
\(H_2 = 0.00000000000000022 < 0.1\)
\(H_3 = 0.00000000000000022 < 0.2\)
So we were correct to reject all three null hypotheses after controlling for FWER.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
M_inv <- solve(M)
print_M_inv <- to_LaTeX(round(M_inv, 3))
\(M^{-1} = \begin{bmatrix} 1.982 & -0.668 & -1.006 \\ -0.668 & 1.377 & -0.08 \\ -1.006 & -0.08 & 1.663 \end{bmatrix}\)
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
M_M_inv <- M %*% M_inv
print_M_M_inv <- to_LaTeX(round(M_M_inv, 3))
\(M M^{-1} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
M_inv_M <- M_inv %*% M
print_M_inv_M <- to_LaTeX(round(M_inv_M, 3))
\(M^{-1} M = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)
Conduct LU decomposition on the matrix.
triangles <- lu.decomposition(M)
print_M <- to_LaTeX(round(M, 3))
L <- triangles$L
print_L <- to_LaTeX(round(L, 3))
U <- triangles$U
print_U <- to_LaTeX(round(U, 3))
\(M = \begin{bmatrix} 1 & 0.521 & 0.63 \\ 0.521 & 1 & 0.363 \\ 0.63 & 0.363 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0.521 & 1 & 0 \\ 0.63 & 0.048 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0.521 & 0.63 \\ 0 & 0.728 & 0.035 \\ 0 & 0 & 0.601 \end{bmatrix} = LU\)
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See this link).
ggplot(data = train_df, aes(x = LotArea / 1000)) +
geom_histogram(binwidth=5, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(0, 220, by = 10)) +
labs(x = "Lot Area in 1,000s", y = "Frequency",
title = "Distribution of Lot Area") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
Lot Area is right-skewed, so we will use that variable.
lot_area_exp_fit <- fitdistr(train_df$LotArea, densfun = "exponential")
Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))).
optimal_lambda <- lot_area_exp_fit$estimate
optimal_dist <- as.data.frame(rexp(1000, optimal_lambda))
cols <- "optimal_lot_area"
colnames(optimal_dist) <- cols
Plot a histogram and compare it with a histogram of your original variable.
ggplot(data = optimal_dist, aes(x = optimal_lot_area / 1000)) +
geom_histogram(binwidth=5, color = palette[2], fill = palette[1]) +
scale_x_continuous(breaks = seq(0, 220, by = 10)) +
labs(x = "Optimal Lot Area in 1,000s", y = "Frequency",
title = "Distribution of Optimal Lot Area") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = palette[5]))
This reduced the variance a lot.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
percentiles <- quantile(optimal_dist$optimal_lot_area, c(0.05, .95))
percentiles
## 5% 95%
## 624.5333 30905.3320
Also generate a 95% confidence interval from the empirical data, assuming normality.
mean_LotArea <- mean(train_df$LotArea)
sd_LotArea <- sd(train_df$LotArea)
se_LotArea <- sd_LotArea / sqrt(length(train_df$LotArea))
ci_actual_data <- c(mean_LotArea - 1.96 * se_LotArea, mean_LotArea + 1.96 * se_LotArea)
names(ci_actual_data) <- c("LL", "UL")
ci_actual_data
## LL UL
## 10004.83 11028.82
Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
percentiles2 <- quantile(train_df$LotArea, c(0.05, .95))
percentiles2
## 5% 95%
## 3311.70 17401.15
The distance between the max values and the 95th percentile is larger in the empirical data than in the transformed data, so the transformed data was less skewed.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
Since there are a lot of variables, we first use our knowledge of what might influence sale prices most to reduce the number of variables under consideration. We convert the single character variable to factor.
cols <- c("LotArea", "Neighborhood", "OverallQual", "GrLivArea", "TotalBsmtSF",
"FullBath", "HalfBath", "BedroomAbvGr", "Fireplaces", "GarageCars",
"SalePrice")
train_df_model_sub <- train_df[, colnames(train_df) %in% cols]
train_df_model_sub$Neighborhood <- as.factor(train_df_model_sub$Neighborhood)
reorder <- c("SalePrice", "LotArea", "Neighborhood", "OverallQual", "GrLivArea",
"TotalBsmtSF", "FullBath", "HalfBath", "BedroomAbvGr", "Fireplaces",
"GarageCars")
train_df_model_sub <- train_df_model_sub[, reorder]
We look at pairwise scatterplots to see if there appear to be linear relationships between our dependent variable and any of our possible predictors.
pairs(train_df_model_sub, gap=0.5)
Overall quality, above ground square footage, and total basement square footage appear to have the strongest positive linear relationships with sale price, just eyeballing these.
We set our p-value threshold for predictor inclusion at p = 0.05 and perform backward elimination to decide what variables will come out of the model.
train_df_model_sub_lm_full <- lm(SalePrice ~ Neighborhood + LotArea +
OverallQual + GrLivArea + TotalBsmtSF + FullBath + HalfBath + BedroomAbvGr +
Fireplaces + GarageCars, data = train_df_model_sub)
summary(train_df_model_sub_lm_full)
##
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual +
## GrLivArea + TotalBsmtSF + FullBath + HalfBath + BedroomAbvGr +
## Fireplaces + GarageCars, data = train_df_model_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413081 -14993 158 14206 268958
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.750e+04 1.132e+04 -4.195 2.89e-05 ***
## NeighborhoodBlueste -1.464e+04 2.593e+04 -0.565 0.572421
## NeighborhoodBrDale -1.234e+04 1.250e+04 -0.987 0.323961
## NeighborhoodBrkSide 9.207e+03 1.012e+04 0.910 0.362945
## NeighborhoodClearCr 2.253e+04 1.124e+04 2.005 0.045176 *
## NeighborhoodCollgCr 2.276e+04 9.043e+03 2.517 0.011952 *
## NeighborhoodCrawfor 2.971e+04 1.009e+04 2.945 0.003285 **
## NeighborhoodEdwards 2.686e+03 9.657e+03 0.278 0.780916
## NeighborhoodGilbert 1.280e+04 9.544e+03 1.341 0.179997
## NeighborhoodIDOTRR -4.914e+03 1.076e+04 -0.457 0.647972
## NeighborhoodMeadowV 5.853e+03 1.238e+04 0.473 0.636404
## NeighborhoodMitchel 1.064e+04 1.005e+04 1.059 0.289941
## NeighborhoodNAmes 9.396e+03 9.233e+03 1.018 0.309023
## NeighborhoodNoRidge 7.191e+04 1.035e+04 6.949 5.58e-12 ***
## NeighborhoodNPkVill -8.876e+03 1.438e+04 -0.617 0.537075
## NeighborhoodNridgHt 7.111e+04 9.388e+03 7.575 6.45e-14 ***
## NeighborhoodNWAmes 4.590e+03 9.599e+03 0.478 0.632628
## NeighborhoodOldTown -9.287e+03 9.558e+03 -0.972 0.331377
## NeighborhoodSawyer 1.251e+04 9.857e+03 1.269 0.204649
## NeighborhoodSawyerW 1.379e+04 9.737e+03 1.416 0.157001
## NeighborhoodSomerst 2.923e+04 9.321e+03 3.136 0.001748 **
## NeighborhoodStoneBr 7.317e+04 1.096e+04 6.674 3.54e-11 ***
## NeighborhoodSWISU -3.423e+03 1.158e+04 -0.295 0.767670
## NeighborhoodTimber 2.683e+04 1.036e+04 2.589 0.009723 **
## NeighborhoodVeenker 4.375e+04 1.354e+04 3.231 0.001261 **
## LotArea 5.017e-01 1.050e-01 4.780 1.94e-06 ***
## OverallQual 1.610e+04 1.160e+03 13.881 < 2e-16 ***
## GrLivArea 4.579e+01 3.709e+00 12.346 < 2e-16 ***
## TotalBsmtSF 2.133e+01 2.921e+00 7.303 4.67e-13 ***
## FullBath 3.253e+03 2.719e+03 1.196 0.231741
## HalfBath 2.741e+03 2.417e+03 1.134 0.256898
## BedroomAbvGr -5.594e+03 1.484e+03 -3.768 0.000171 ***
## Fireplaces 6.245e+03 1.768e+03 3.532 0.000426 ***
## GarageCars 1.247e+04 1.683e+03 7.412 2.13e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34510 on 1426 degrees of freedom
## Multiple R-squared: 0.8156, Adjusted R-squared: 0.8113
## F-statistic: 191.1 on 33 and 1426 DF, p-value: < 2.2e-16
Looking at the summary for our model, we see that our residuals median is not near 0. We also see that HalfBath has the highest p-value, so it is the first variable we will remove from the model.
train_df_model_sub_lm_update <- update(
train_df_model_sub_lm_full, .~. - HalfBath, data = train_df_model_sub)
summary(train_df_model_sub_lm_update)
##
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual +
## GrLivArea + TotalBsmtSF + FullBath + BedroomAbvGr + Fireplaces +
## GarageCars, data = train_df_model_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -412639 -14765 80 14066 267678
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.699e+04 1.131e+04 -4.153 3.47e-05 ***
## NeighborhoodBlueste -1.328e+04 2.591e+04 -0.513 0.608334
## NeighborhoodBrDale -1.082e+04 1.243e+04 -0.870 0.384191
## NeighborhoodBrkSide 8.378e+03 1.009e+04 0.830 0.406563
## NeighborhoodClearCr 2.226e+04 1.124e+04 1.981 0.047800 *
## NeighborhoodCollgCr 2.315e+04 9.038e+03 2.562 0.010522 *
## NeighborhoodCrawfor 2.920e+04 1.008e+04 2.897 0.003829 **
## NeighborhoodEdwards 2.232e+03 9.649e+03 0.231 0.817092
## NeighborhoodGilbert 1.396e+04 9.489e+03 1.472 0.141352
## NeighborhoodIDOTRR -5.548e+03 1.075e+04 -0.516 0.605724
## NeighborhoodMeadowV 6.174e+03 1.238e+04 0.499 0.617941
## NeighborhoodMitchel 1.041e+04 1.005e+04 1.036 0.300423
## NeighborhoodNAmes 9.172e+03 9.231e+03 0.994 0.320601
## NeighborhoodNoRidge 7.239e+04 1.034e+04 6.999 3.94e-12 ***
## NeighborhoodNPkVill -7.565e+03 1.433e+04 -0.528 0.597695
## NeighborhoodNridgHt 7.141e+04 9.385e+03 7.610 4.97e-14 ***
## NeighborhoodNWAmes 5.245e+03 9.583e+03 0.547 0.584246
## NeighborhoodOldTown -1.022e+04 9.524e+03 -1.073 0.283508
## NeighborhoodSawyer 1.248e+04 9.858e+03 1.266 0.205722
## NeighborhoodSawyerW 1.411e+04 9.733e+03 1.450 0.147350
## NeighborhoodSomerst 3.009e+04 9.291e+03 3.239 0.001227 **
## NeighborhoodStoneBr 7.331e+04 1.096e+04 6.687 3.26e-11 ***
## NeighborhoodSWISU -4.323e+03 1.156e+04 -0.374 0.708434
## NeighborhoodTimber 2.706e+04 1.036e+04 2.612 0.009107 **
## NeighborhoodVeenker 4.474e+04 1.351e+04 3.311 0.000953 ***
## LotArea 4.948e-01 1.048e-01 4.722 2.57e-06 ***
## OverallQual 1.617e+04 1.159e+03 13.960 < 2e-16 ***
## GrLivArea 4.771e+01 3.300e+00 14.461 < 2e-16 ***
## TotalBsmtSF 2.020e+01 2.748e+00 7.353 3.26e-13 ***
## FullBath 2.146e+03 2.538e+03 0.846 0.397905
## BedroomAbvGr -5.571e+03 1.484e+03 -3.753 0.000182 ***
## Fireplaces 6.265e+03 1.768e+03 3.543 0.000409 ***
## GarageCars 1.252e+04 1.683e+03 7.438 1.75e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34510 on 1427 degrees of freedom
## Multiple R-squared: 0.8154, Adjusted R-squared: 0.8113
## F-statistic: 197 on 32 and 1427 DF, p-value: < 2.2e-16
Now the variable with the highest p-value is FullBath, so we remove that from the model as well.
train_df_model_sub_lm_update <- update(
train_df_model_sub_lm_update, .~. - FullBath, data = train_df_model_sub)
summary(train_df_model_sub_lm_update)
##
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual +
## GrLivArea + TotalBsmtSF + BedroomAbvGr + Fireplaces + GarageCars,
## data = train_df_model_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -415192 -14946 181 13820 267765
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.507e+04 1.108e+04 -4.067 5.03e-05 ***
## NeighborhoodBlueste -1.415e+04 2.588e+04 -0.547 0.584643
## NeighborhoodBrDale -1.216e+04 1.233e+04 -0.986 0.324238
## NeighborhoodBrkSide 6.868e+03 9.932e+03 0.692 0.489323
## NeighborhoodClearCr 2.126e+04 1.117e+04 1.903 0.057307 .
## NeighborhoodCollgCr 2.260e+04 9.013e+03 2.507 0.012283 *
## NeighborhoodCrawfor 2.793e+04 9.967e+03 2.802 0.005144 **
## NeighborhoodEdwards 1.074e+03 9.551e+03 0.112 0.910505
## NeighborhoodGilbert 1.372e+04 9.484e+03 1.447 0.148251
## NeighborhoodIDOTRR -7.086e+03 1.059e+04 -0.669 0.503530
## NeighborhoodMeadowV 4.904e+03 1.228e+04 0.399 0.689800
## NeighborhoodMitchel 9.467e+03 9.986e+03 0.948 0.343299
## NeighborhoodNAmes 7.715e+03 9.068e+03 0.851 0.395035
## NeighborhoodNoRidge 7.141e+04 1.028e+04 6.949 5.57e-12 ***
## NeighborhoodNPkVill -7.249e+03 1.432e+04 -0.506 0.612901
## NeighborhoodNridgHt 7.105e+04 9.374e+03 7.579 6.23e-14 ***
## NeighborhoodNWAmes 4.700e+03 9.560e+03 0.492 0.623079
## NeighborhoodOldTown -1.159e+04 9.383e+03 -1.235 0.216949
## NeighborhoodSawyer 1.110e+04 9.720e+03 1.142 0.253761
## NeighborhoodSawyerW 1.345e+04 9.701e+03 1.386 0.165946
## NeighborhoodSomerst 2.991e+04 9.287e+03 3.221 0.001307 **
## NeighborhoodStoneBr 7.280e+04 1.095e+04 6.651 4.13e-11 ***
## NeighborhoodSWISU -5.494e+03 1.147e+04 -0.479 0.632124
## NeighborhoodTimber 2.658e+04 1.035e+04 2.569 0.010288 *
## NeighborhoodVeenker 4.353e+04 1.344e+04 3.240 0.001223 **
## LotArea 4.934e-01 1.048e-01 4.709 2.73e-06 ***
## OverallQual 1.622e+04 1.157e+03 14.015 < 2e-16 ***
## GrLivArea 4.871e+01 3.082e+00 15.802 < 2e-16 ***
## TotalBsmtSF 2.016e+01 2.747e+00 7.338 3.62e-13 ***
## BedroomAbvGr -5.364e+03 1.464e+03 -3.664 0.000257 ***
## Fireplaces 6.137e+03 1.762e+03 3.484 0.000510 ***
## GarageCars 1.259e+04 1.680e+03 7.494 1.16e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34510 on 1428 degrees of freedom
## Multiple R-squared: 0.8153, Adjusted R-squared: 0.8113
## F-statistic: 203.4 on 31 and 1428 DF, p-value: < 2.2e-16
Our Adjusted R-squared value is 0.8113, which is high and indicates these predictor variables explain about 81.13% of the variation in our dependent variable.
par(mfrow=c(2,2))
plot(train_df_model_sub_lm_update)
We can see by plotting the residuals vs. the fitted values that the residuals aren’t uniformly scattered about 0, and we can see from the qq-plot that the residuals don’t follow the normal distribution line. We will do a Box-Cox transformation to account for that.
lm_bc <- MASS::boxcox(train_df_model_sub_lm_update)
print(lm_lambda <- lm_bc$x[which.max(lm_bc$y)])
## [1] 0.1414141
train_df_model_sub_lm_new <- lm(((SalePrice^lm_lambda-1)/lm_lambda) ~ Neighborhood +
LotArea + OverallQual + GrLivArea + TotalBsmtSF + BedroomAbvGr + Fireplaces +
GarageCars, data = train_df_model_sub)
summary(train_df_model_sub_lm_new)
##
## Call:
## lm(formula = ((SalePrice^lm_lambda - 1)/lm_lambda) ~ Neighborhood +
## LotArea + OverallQual + GrLivArea + TotalBsmtSF + BedroomAbvGr +
## Fireplaces + GarageCars, data = train_df_model_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0318 -0.3880 0.0343 0.4699 2.8428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.524e+01 2.749e-01 91.822 < 2e-16 ***
## NeighborhoodBlueste -7.116e-01 6.421e-01 -1.108 0.267923
## NeighborhoodBrDale -1.099e+00 3.059e-01 -3.592 0.000340 ***
## NeighborhoodBrkSide -3.737e-01 2.464e-01 -1.517 0.129507
## NeighborhoodClearCr 6.881e-01 2.772e-01 2.482 0.013170 *
## NeighborhoodCollgCr 5.203e-01 2.236e-01 2.327 0.020098 *
## NeighborhoodCrawfor 6.371e-01 2.473e-01 2.577 0.010080 *
## NeighborhoodEdwards -4.472e-01 2.369e-01 -1.887 0.059321 .
## NeighborhoodGilbert 3.233e-01 2.353e-01 1.374 0.169647
## NeighborhoodIDOTRR -1.210e+00 2.627e-01 -4.608 4.44e-06 ***
## NeighborhoodMeadowV -7.495e-01 3.047e-01 -2.460 0.014026 *
## NeighborhoodMitchel 9.684e-02 2.477e-01 0.391 0.695913
## NeighborhoodNAmes -2.163e-02 2.250e-01 -0.096 0.923420
## NeighborhoodNoRidge 1.059e+00 2.549e-01 4.154 3.46e-05 ***
## NeighborhoodNPkVill -4.656e-01 3.554e-01 -1.310 0.190285
## NeighborhoodNridgHt 1.171e+00 2.325e-01 5.037 5.33e-07 ***
## NeighborhoodNWAmes 6.583e-02 2.372e-01 0.278 0.781375
## NeighborhoodOldTown -8.480e-01 2.328e-01 -3.643 0.000279 ***
## NeighborhoodSawyer 2.570e-02 2.411e-01 0.107 0.915130
## NeighborhoodSawyerW 2.127e-01 2.406e-01 0.884 0.376988
## NeighborhoodSomerst 6.796e-01 2.304e-01 2.950 0.003234 **
## NeighborhoodStoneBr 1.275e+00 2.715e-01 4.695 2.93e-06 ***
## NeighborhoodSWISU -4.167e-01 2.846e-01 -1.464 0.143473
## NeighborhoodTimber 5.540e-01 2.567e-01 2.158 0.031072 *
## NeighborhoodVeenker 1.069e+00 3.333e-01 3.208 0.001369 **
## LotArea 9.197e-06 2.599e-06 3.538 0.000416 ***
## OverallQual 5.149e-01 2.871e-02 17.937 < 2e-16 ***
## GrLivArea 1.086e-03 7.647e-05 14.209 < 2e-16 ***
## TotalBsmtSF 5.263e-04 6.815e-05 7.723 2.13e-14 ***
## BedroomAbvGr -3.442e-03 3.632e-02 -0.095 0.924502
## Fireplaces 2.394e-01 4.371e-02 5.477 5.11e-08 ***
## GarageCars 4.363e-01 4.168e-02 10.469 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.856 on 1428 degrees of freedom
## Multiple R-squared: 0.852, Adjusted R-squared: 0.8488
## F-statistic: 265.3 on 31 and 1428 DF, p-value: < 2.2e-16
The Box-Cox transformation increased the adjusted R-squared to 0.8488, and it reduced the median of the residuals to close to 0. BedroomAbvGr is no longer statistically significant in the new model, so we will remove it.
train_df_model_sub_lm_new <- update(
train_df_model_sub_lm_new, .~. - BedroomAbvGr, data = train_df_model_sub)
summary(train_df_model_sub_lm_new)
##
## Call:
## lm(formula = ((SalePrice^lm_lambda - 1)/lm_lambda) ~ Neighborhood +
## LotArea + OverallQual + GrLivArea + TotalBsmtSF + Fireplaces +
## GarageCars, data = train_df_model_sub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.0216 -0.3872 0.0348 0.4714 2.8448
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.524e+01 2.713e-01 93.014 < 2e-16 ***
## NeighborhoodBlueste -7.132e-01 6.416e-01 -1.111 0.266543
## NeighborhoodBrDale -1.100e+00 3.052e-01 -3.606 0.000322 ***
## NeighborhoodBrkSide -3.759e-01 2.453e-01 -1.532 0.125628
## NeighborhoodClearCr 6.863e-01 2.764e-01 2.483 0.013160 *
## NeighborhoodCollgCr 5.177e-01 2.218e-01 2.334 0.019746 *
## NeighborhoodCrawfor 6.350e-01 2.462e-01 2.579 0.010013 *
## NeighborhoodEdwards -4.497e-01 2.353e-01 -1.911 0.056142 .
## NeighborhoodGilbert 3.204e-01 2.332e-01 1.374 0.169713
## NeighborhoodIDOTRR -1.212e+00 2.619e-01 -4.629 4.01e-06 ***
## NeighborhoodMeadowV -7.512e-01 3.041e-01 -2.471 0.013606 *
## NeighborhoodMitchel 9.404e-02 2.459e-01 0.382 0.702172
## NeighborhoodNAmes -2.489e-02 2.222e-01 -0.112 0.910861
## NeighborhoodNoRidge 1.057e+00 2.541e-01 4.160 3.38e-05 ***
## NeighborhoodNPkVill -4.681e-01 3.543e-01 -1.321 0.186588
## NeighborhoodNridgHt 1.170e+00 2.318e-01 5.045 5.11e-07 ***
## NeighborhoodNWAmes 6.244e-02 2.344e-01 0.266 0.789950
## NeighborhoodOldTown -8.498e-01 2.319e-01 -3.665 0.000257 ***
## NeighborhoodSawyer 2.225e-02 2.383e-01 0.093 0.925620
## NeighborhoodSawyerW 2.103e-01 2.392e-01 0.879 0.379580
## NeighborhoodSomerst 6.777e-01 2.295e-01 2.953 0.003198 **
## NeighborhoodStoneBr 1.275e+00 2.714e-01 4.696 2.91e-06 ***
## NeighborhoodSWISU -4.205e-01 2.817e-01 -1.493 0.135684
## NeighborhoodTimber 5.514e-01 2.552e-01 2.161 0.030869 *
## NeighborhoodVeenker 1.068e+00 3.331e-01 3.207 0.001370 **
## LotArea 9.198e-06 2.598e-06 3.540 0.000413 ***
## OverallQual 5.151e-01 2.860e-02 18.013 < 2e-16 ***
## GrLivArea 1.082e-03 6.260e-05 17.289 < 2e-16 ***
## TotalBsmtSF 5.271e-04 6.752e-05 7.808 1.12e-14 ***
## Fireplaces 2.399e-01 4.333e-02 5.537 3.65e-08 ***
## GarageCars 4.366e-01 4.160e-02 10.494 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8557 on 1429 degrees of freedom
## Multiple R-squared: 0.852, Adjusted R-squared: 0.8489
## F-statistic: 274.3 on 30 and 1429 DF, p-value: < 2.2e-16
We plot the residuals vs. the fitted values and look at a qq-plot again.
par(mfrow=c(2,2))
plot(train_df_model_sub_lm_new)
The residuals follow the normal distribution line much more closely now, though the lower tail is still problematic.
We make our predictions on the test data that was withheld from the model during training.
my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data605_final_project/main/test.csv"
test_df <- read.csv(my_url2)
test_df$Neighborhood <- as.factor(test_df$Neighborhood)
test_df$SalePrice <- predict(train_df_model_sub_lm_new, test_df)
test_df$SalePrice <- inv_boxcox(test_df$SalePrice, lm_lambda)
submission_cols <- c("Id", "SalePrice")
submission <- test_df[, colnames(test_df) %in% submission_cols]
write.csv(submission, "submission.csv", row.names = FALSE)
Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.
Username: https://www.kaggle.com/glendavis
Score: 0.43507
Screenshot: