Data 605 - Final Project

library(tidyverse)
library(scales)
library(GGally)
library(knitr)
library(matrixcalc)
library(MASS)
library(ldsr)
library(EnvStats)

set.seed(87)

palette <- c("#49c6e5", "#255f85", "#fae1df", "#ff3864", "#130303")

to_LaTeX <- function(A){
    rows <- apply(A, MARGIN=1, paste, collapse = " & ")
    matrix_string <- paste(rows, collapse = " \\\\ ")
    return(paste("\\begin{bmatrix}", matrix_string, "\\end{bmatrix}"))
}

Problem 1:

Probability Density 1: X~Gamma:

Using R, generate a random variable $X$ that has $10,000$ random Gamma pdf values. A Gamma pdf is completely describe by $n$ (a size parameter) and lambda ($\lambda$, a shape parameter). Choose any $n$ greater $3$ and an expected value ($\lambda$) between $2$ and $10$ (you choose).

alpha <- 10
lambda <- 6
X <- rgamma(n = 10000, shape = alpha, rate = lambda)

Probability Density 2: Y~Sum of Exponentials:

Then generate $10,000$ observations from the sum of $n$ exponential pdfs with rate/shape parameter ($\lambda$). The $n$ and $\lambda$ must be the same as in the previous case. (e.g., mysum=rexp(10000, $\lambda$)+rexp(10000, $\lambda$)

Y <- rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda) + rexp(n = 10000, rate = lambda)

Probability Density 3: Z~Exponential:

Then generate $10,000$ observations from a single exponential pdf with rate/shape parameter ($\lambda$).

Z <- rexp(n = 10000, rate = lambda)

NOTE:

The Gamma distribution is quite common in data science. For example, it is used to model failures for multiple processes when each of those processes has the same failure rate. The exponential is used for constant failure rates, service times, etc.

(5 Points).

1a:

Calculate the empirical expected value (means) and variances of all three pdfs.

mean_X <- mean(X)
mean_Y <- mean(Y)
mean_Z <- mean(Z)
var_X <- var(X)
var_Y <- var(Y)
var_Z <- var(Z)

Mean(X): $1.6767234$, Var(X): $0.2879548$

Mean(Y): $1.6579397$, Var(Y): $0.2734734$

Mean(Z): $0.1616728$, Var(Z): $0.0252837$

1b:

Using calculus, calculate the expected value and variance of the Gamma pdf (X).

$M_X(t) = (1 - \frac{t}{\lambda})^{-\alpha}$

$u = (1 - \frac{t}{\lambda})$

$M^{\prime}_X(t) = (-\alpha)(u)^{-\alpha-1}(-\frac{1}{\lambda})$

$M^{\prime}_X(t) = \frac{\alpha}{\lambda}(u)^{-\alpha-1}$

$M^{\prime}_X(t) = \frac{\alpha(1 - \frac{t}{\lambda})}{\lambda}^{-\alpha-1}$

$M^{\prime}_X(0) = \frac{\alpha(1 - \frac{0}{\lambda})}{\lambda}^{-\alpha-1} = \frac{\alpha}{\lambda}$

Mean(X) = $\frac{\alpha}{\lambda} = \frac{10}{6} \approx 1.67$

$M^{\prime\prime}_X(t) = \frac{\alpha(-\alpha-1)(u)^{-\alpha-2}(-\frac{1}{\lambda})}{\lambda}$

$M^{\prime\prime}_X(t) = -\frac{\alpha(-\alpha-1)(1 - \frac{t}{\lambda})^{-\alpha-2}}{\lambda^2} = \frac{\alpha(\alpha+1)}{(t-\lambda)^2(1 - \frac{t}{\lambda})^{\alpha}}$

$M^{\prime\prime}_X(0) = \frac{\alpha(\alpha+1)}{(0-\lambda)^2(1 - \frac{0}{\lambda})^{\alpha}} = \frac{\alpha(\alpha+1)}{(-\lambda)^2}$

Var(X) = $\frac{\alpha(\alpha+1)}{(-\lambda)^2} - \mu^2= \frac{(10(10 + 1)}{-6^2} - 1.67^2 = \frac{110}{36} - 2.79 = 3.06 - 2.79 \approx 0.27$

Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y).

$M_X(t) = \frac{\lambda}{\lambda - t}$

$u = \lambda - t$

$u^\prime = -1$

$M_X(t) = \lambda\frac{1}{u}$

$M_X^{\prime}(t) = -\lambda\frac{u^\prime}{(u)^2} = -\lambda\frac{(-1)}{(\lambda-t)^2} = \frac{\lambda}{(\lambda-t)^2}$

$M_X^{\prime}(0) = \frac{\lambda}{(\lambda-0)^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$

Mean(Z) = $\frac{1}{\lambda} = \frac{1}{6} \approx 0.167$

$M_X(t) = (\frac{\lambda}{\lambda - t})^\alpha$

$u = \frac{\lambda}{\lambda - t}$

$u^\prime = \lambda(-\frac{1}{(\lambda-t)^2})$

$M_X^{\prime}(t) = -\alpha(\frac{\lambda}{\lambda - t})^{\alpha-1}\times \lambda(-\frac{1}{(\lambda-t)^2}) = \frac{\alpha\lambda(\frac{\lambda}{\lambda - t})^{\alpha-1}}{(\lambda-t)^2} = \frac{\alpha(\frac{\lambda}{\lambda - t})^{\alpha}}{\lambda-t}$

$M_X^{\prime}(0) = \frac{\alpha(\frac{\lambda}{\lambda - t})^{\alpha}}{\lambda-t} = \frac{\alpha(\frac{\lambda}{\lambda - 0})^{\alpha}}{\lambda - 0} = \frac{\alpha}{\lambda}$

Mean(Y) = $\frac{\alpha}{\lambda} = \frac{10}{6} \approx 1.67$

1c-e:

Probability. For pdf Z (the exponential), calculate empirically probabilities a through c (below). Then evaluate through calculus whether the memoryless property holds.

Note: In all the probability statements below, I take $\lambda$ to mean the value we defined at the beginning of Problem 1. In the evaluations where we actually find the conditional probability, it is that value’s reciprocal. With a $\lambda$ of 6, we can pretend a machine fails 6 times per year. So the machine takes on average $\frac{1}{6}$ years to fail. That is the mean of the distribution and the machine’s failure rate.

a. $P(Z > \lambda | Z > \frac{\lambda}{2})$

We take this to mean: what is the probability that we have more than 6 failures this year given that we have already had 3 failures?

$P(Z \leq \lambda | Z > \frac{\lambda}{2}) = \frac{P(Z \leq \lambda \cap Z > \frac{\lambda}{2})}{P(Z > \frac{\lambda}{2})} = \frac{P(\frac{\lambda}{2} < Z \leq \lambda)}{P(Z > \frac{\lambda}{2})} = \frac{P(\frac{6}{2} < Z \leq 6)}{P(Z > \frac{6}{2})} = \frac{P(3 < Z \leq 6)}{P(Z > 3)}$

$= \frac{e^{-3\lambda} - e^{-6\lambda}}{e^{-3\lambda}} = 1 - e^{-3\lambda}$

$P(Z > \lambda | Z > \frac{\lambda}{2}) = 1 - P(Z \leq \lambda | Z > \frac{\lambda}{2}) = 1 - (1 - e^{-3\lambda}) = e^{-3\lambda} = e^{-\frac{3}{6}} \approx 0.607$

The probability that we will have more than 6 failures given that we have already had 3 failures is the same as the probability that we will have more than 3 failures: $0.607$.

b. $P(Z > 2\lambda | Z > \lambda)$

We take this to mean: what is the probability that we have more than 12 failures this year given that we have already had 6 failures?

$P(Z \leq 2\lambda | Z > \lambda) = \frac{P(Z \leq 2\lambda \cap Z > \lambda)}{P(Z > \lambda)} = \frac{P(\lambda < Z \leq 2\lambda)}{P(Z > \lambda)} = \frac{P(6 < Z < 2(6))}{P(Z > 6)} = \frac{P(6 < Z < 12}{P(Z > 6)}$

$= \frac{e^{-6\lambda} - e^{-12\lambda}}{e^{-6\lambda}} = 1 - e^{-6\lambda}$

$P(Z > 2\lambda | Z > \lambda) = 1 - (1 - e^{-6\lambda}) = e^{-6\lambda} = e^{-\frac{6}{6}} \approx 0.368$

The probability that we will have more than 12 failures given that we have already had 6 failures is the same as the probability that we will have more than 6 failures: $0.368$.

c. $P(Z > 3\lambda | Z > \lambda)$

We take this to mean: what is the probability that we have more than 18 failures this year given that we have already had 6 failures?

$P(Z \leq 3\lambda | Z > \lambda) = \frac{P(Z \leq 3\lambda \cap Z > \lambda)}{P(Z > \lambda)} = \frac{P(\lambda < Z \leq 3\lambda)}{P(Z > \lambda)} = \frac{P(6 < Z < 3(6))}{P(Z > 6)} = \frac{P(6 < Z < 18}{P(Z > 6)}$

$= \frac{e^{-6\lambda} - e^{-18\lambda}}{e^{-6\lambda}} = 1 - e^{-12\lambda}$

$P(Z > 3\lambda | Z > \lambda) = 1 - (1 - e^{-12\lambda}) = e^{-12\lambda} = e^{-\frac{12}{6}} \approx 0.135$

The probability that we will have more than 18 failures given that we have already had 6 failures is the same as the probability that we will have more than 12 failures: $0.135$.

These are all memoryless.

(5 Points). Loosely investigate whether $P(YZ) = P(Y) P(Z)$ by building a table with quartiles and evaluating the marginal and joint probabilities.

q <- c(0.25, 0.25, 0.25, .25)
qY <- matrix(q, nrow = 4, ncol = 4, byrow = FALSE)
qZ <- matrix(q, nrow = 4, ncol = 4, byrow = TRUE)
A <- as.data.frame(matrix(qY * qZ, nrow=4, ncol=4))
c <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y")
colnames(A) <- c
A <- A %>%
    mutate(Sum = rowSums(.[1:ncol(A)]))
A <- rbind(A, colSums(A))
r <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z",
       "Sum")
rownames(A) <- r
kable(A, format="simple")

	1st Quartile Y	2nd Quartile Y	3rd Quartile Y	4th Quartile Y	Sum
1st Quartile Z	0.0625	0.0625	0.0625	0.0625	0.25
2nd Quartile Z	0.0625	0.0625	0.0625	0.0625	0.25
3rd Quartile Z	0.0625	0.0625	0.0625	0.0625	0.25
4th Quartile Z	0.0625	0.0625	0.0625	0.0625	0.25
Sum	0.2500	0.2500	0.2500	0.2500	1.00

The joint probability that you observe a variable in any quartile of the Y distribution and a variable in any quartile of the Z distribution is $P(Y, Z) = P(Y | Z) \times P(Z) = 0.25 * 0.25 = 0.0625$.

The marginal probability that you observe a variable in any quartile of the Y distribution is $0.25$, regardless of the Z variable. Similarly, the marginal probability that you observe a variable in any quartile of the Z distribution is $0.25$, regardless of the Y variable.

$P(YZ) = P(Y) P(Z)$

(5 Points). Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Our understanding is neither test is for numerical data, so we binned the numerical variables into factors based on their quartiles. This practice may be discouraged based on what we’ve read, but it seemed like the best option.

df <- as.data.frame(cbind(Y, Z))
df <- df %>%
    mutate(Y = ntile(.$Y, 4), Z = ntile(.$Z, 4))
cols <- c("YQuantile", "ZQuantile")
colnames(df) <- cols
df$YQuantile <- as.factor(df$YQuantile)
df$ZQuantile <- as.factor(df$ZQuantile)
contingency <- xtabs(~ ZQuantile + YQuantile, data = df)
contingency

##          YQuantile
## ZQuantile   1   2   3   4
##         1 636 637 617 610
##         2 616 624 621 639
##         3 623 617 644 616
##         4 625 622 618 635

The Chi Square test is more appropriate because we have a large sample, so we perform that first. For both tests, the null hypothesis is that the variables are independent. The alternative hypothesis is that the variables are not independent.

test <- chisq.test(contingency)
test

## 
##  Pearson's Chi-squared test
## 
## data:  contingency
## X-squared = 2.4256, df = 9, p-value = 0.9828

The p-value for the Chi Square test is very large: 0.9828. So we reject the first null hypothesis that the variables are independent.

Next we perform the Fisher’s Exact test.

test2 <- fisher.test(contingency, simulate.p.value = TRUE)
test2

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  contingency
## p-value = 0.981
## alternative hypothesis: two.sided

The p-value for the Fisher’s Exact test is 0.981, almost the same very large value. So we reject the second null hypothesis that the variables are independent.

Problem 2:

Descriptive and Inferential Statistics (5 Points):

Provide univariate descriptive statistics and appropriate plots for the training data set.

First we reduce the number of variables to only ones we might be interested in describing/visualizing since there are so many. We print a summary of those.

my_url <- "https://raw.githubusercontent.com/geedoubledee/data605_final_project/main/train.csv"
train_df <- read.csv(my_url)
cols <- c("YearBuilt", "BedroomAbvGr", "GrLivArea", "FullBath", "BsmtFullBath",
          "SalePrice", "OverallQual", "OverallCond")
train_df_sub <- train_df[, colnames(train_df) %in% cols]
train_df_sub <- train_df_sub %>%
    mutate(AllFullBath = BsmtFullBath + FullBath)
summary(train_df_sub)

##   OverallQual      OverallCond      YearBuilt      GrLivArea   
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   : 334  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1130  
##  Median : 6.000   Median :5.000   Median :1973   Median :1464  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1515  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:1777  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :5642  
##   BsmtFullBath       FullBath      BedroomAbvGr     SalePrice     
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 34900  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:2.000   1st Qu.:129975  
##  Median :0.0000   Median :2.000   Median :3.000   Median :163000  
##  Mean   :0.4253   Mean   :1.565   Mean   :2.866   Mean   :180921  
##  3rd Qu.:1.0000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:214000  
##  Max.   :3.0000   Max.   :3.000   Max.   :8.000   Max.   :755000  
##   AllFullBath  
##  Min.   :0.00  
##  1st Qu.:1.00  
##  Median :2.00  
##  Mean   :1.99  
##  3rd Qu.:2.00  
##  Max.   :6.00

There are 1,460 observations in the training dataset. Each observation is a sold home. The median sale price for a home in this dataset is $163,000.00. Below, we plot the distribution of sale price in thousands of dollars.

ggplot(data = train_df_sub, aes(x = SalePrice / 1000)) +
    geom_histogram(binwidth=25, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(25, 775, by = 50)) +
    scale_y_continuous(breaks = seq(0, 275, by = 25)) +
    labs(x = "Sale Price in Thousands of $", y = "Frequency",
         title = "Distribution of Sale Price in Thousands of $") +
    geom_vline(aes(xintercept = mean(SalePrice / 1000)), color = palette[4],
               linewidth = 0.5) +
    annotate("text", x = mean(train_df_sub$SalePrice / 1000) + 35,
             y=275, label="Mean", color = palette[4]) +
    geom_vline(aes(xintercept = mean(SalePrice / 1000) + 3 * sd(SalePrice / 1000)),
               color = palette[4], linewidth = 0.5, linetype = "dashed") +
    annotate("text", x = mean(
        train_df_sub$SalePrice / 1000) + 3 * sd(train_df_sub$SalePrice / 1000) +
            35, y=275, label="+3 SD", color = palette[4]) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

The distribution is unimodal and right-skewed. Several homes that are significantly more expensive than the median sale price are pulling the mean up. Homes near a sale price of $425,000.00 and above are outliers, as they are more than 3 standard deviations above the mean.

Next we look at the distribution of above ground living area, measured in square feet.

ggplot(data = train_df_sub, aes(x = GrLivArea)) +
    geom_histogram(binwidth=250, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(0, 6000, by = 500)) +
    scale_y_continuous(breaks = seq(0, 300, by = 25)) +
    labs(x = "Above Ground Living Area (Sq. Ft.)", y = "Frequency",
         title = "Distribution of Above Ground Living Area (Sq. Ft.)") +
    geom_vline(aes(xintercept = mean(GrLivArea)), color = palette[4],
               linewidth = 0.5) +
    annotate("text", x = mean(train_df_sub$GrLivArea) + 350, y=275,
             label="Mean", color = palette[4]) +
    geom_vline(aes(xintercept = mean(GrLivArea) + 3 * sd(GrLivArea)),
               color = palette[4], linewidth = 0.5, linetype = "dashed") +
    annotate("text", x = mean(train_df_sub$GrLivArea) + 3 *
                 sd(train_df_sub$GrLivArea) + 325,
             y=275, label="+3 SD", color = palette[4]) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

The distribution is unimodal and right-skewed. The median above ground living area is 1,464 sq. ft. Some homes with significantly larger amounts of above ground living area are pulling the mean up. Homes with nearly 3,250 sq. ft. of above ground living area and above are outliers, as they are more than 3 standard deviations above the mean.

Next we look at the distribution of bedrooms. Basement-level bedrooms are not included.

ggplot(data = train_df_sub, aes(x = BedroomAbvGr)) +
    geom_histogram(binwidth=1, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(0, 8, by = 1)) +
    scale_y_continuous(breaks = seq(0, 800, by = 100)) +
    labs(x = "Bedrooms (Above Ground Only)", y = "Frequency",
         title = "Distribution of Bedrooms (Above Ground Only)") +
    geom_vline(aes(xintercept = mean(BedroomAbvGr)), color = palette[4],
               linewidth = 0.5) +
    annotate("text", x = mean(train_df_sub$BedroomAbvGr) + 1.1, y=800,
             label="Mean", color = palette[4]) +
    geom_vline(aes(xintercept = mean(BedroomAbvGr) - 3 * sd(BedroomAbvGr)),
               color = palette[4], linewidth = 0.5, linetype = "dashed") +
    annotate("text", x = mean(train_df_sub$BedroomAbvGr) - 3 *
                 sd(train_df_sub$BedroomAbvGr) + 0.5,
             y=800, label="-3 SD", color = palette[4]) +
    geom_vline(aes(xintercept = mean(BedroomAbvGr) + 3 * sd(BedroomAbvGr)),
               color = palette[4], linewidth = 0.5, linetype = "dashed") +
    annotate("text", x = mean(train_df_sub$BedroomAbvGr) + 3 *
                 sd(train_df_sub$BedroomAbvGr) + 0.5,
             y=800, label="+3 SD", color = palette[4]) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

The distribution is unimodal and nearly normal. The mean number of bedrooms is 2.9. There are some homes with 0 or 6+ bedrooms above ground, which are outliers since they are more than 3 standard deviations away from the mean.

Lastly we look at the distribution of full bathrooms, including basement-level full bathrooms.

ggplot(data = train_df_sub, aes(x = AllFullBath)) +
    geom_histogram(binwidth=1, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(0, 6, by = 1)) +
    scale_y_continuous(breaks = seq(0, 700, by = 50)) +
    labs(x = "All Full Bathrooms", y = "Frequency",
         title = "Distribution of All Full Bathrooms") +
    geom_vline(aes(xintercept = mean(AllFullBath)), color = palette[4],
               linewidth = 0.5) +
    annotate("text", x = mean(train_df_sub$AllFullBath) + 1.1, y=700,
             label="Mean", color = palette[4]) +
    geom_vline(aes(xintercept = mean(AllFullBath) + 3 * sd(AllFullBath)),
               color = palette[4], linewidth = 0.5, linetype = "dashed") +
    annotate("text", x = mean(train_df_sub$AllFullBath) + 3 *
                 sd(train_df_sub$AllFullBath) + 0.5,
             y=700, label="+3 SD", color = palette[4]) +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

The distribution is unimodal and nearly normal. The mean number of bathrooms is 2.0. There is only one outlier that is more than 3 standard deviations above the mean, and it is a home with 6 baths.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

reorder <- c("SalePrice", "OverallQual", "OverallCond", "YearBuilt", "GrLivArea", "BedroomAbvGr", "BsmtFullBath", "FullBath", "AllFullBath")
train_df_sub <- train_df_sub[, reorder]
ggpairs(train_df_sub, columns=c(1, 2, 5))

Derive a correlation matrix for any three quantitative variables in the dataset.

cols <- c("GrLivArea", "BedroomAbvGr", "FullBath")
X <- as.matrix(train_df_sub[, colnames(train_df_sub) %in% cols])
M <- cor(X)
M

##              GrLivArea BedroomAbvGr  FullBath
## GrLivArea    1.0000000    0.5212695 0.6300116
## BedroomAbvGr 0.5212695    1.0000000 0.3632520
## FullBath     0.6300116    0.3632520 1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.

$\alpha = 0.2$

Null Hypothesis 1: $H_0: \rho = 0$

Alternate Hypothesis 1: $H_a: \rho \neq 0$

cor.test(X[, "GrLivArea"], X[, "BedroomAbvGr"], conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X[, "GrLivArea"] and X[, "BedroomAbvGr"]
## t = 23.323, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4963921 0.5452915
## sample estimates:
##       cor 
## 0.5212695

We are $80\%$ confident that $\rho$, the population correlation coefficient for “GrLivArea” and “BedroomAbvGr,” is between $0.496$ and $0.545$. We reject null hypothesis 1 in favor of the alternate hypothesis 1.

Null Hypothesis 2: $H_0: \rho = 0$

Alternate Hypothesis 2: $H_a: \rho \neq 0$

cor.test(X[, "GrLivArea"], X[, "FullBath"], conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X[, "GrLivArea"] and X[, "FullBath"]
## t = 30.977, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6093339 0.6498331
## sample estimates:
##       cor 
## 0.6300116

We are $80\%$ confident that $\rho$, the population correlation coefficient for “GrLivArea” and “FullBath,” is between $0.609$ and $0.650$. We reject null hypothesis 2 in favor of the alternate hypothesis 2.

Null Hypothesis 3: $H_0: \rho = 0$

Alternate Hypothesis 3: $H_a: \rho \neq 0$

cor.test(X[, "BedroomAbvGr"], X[, "FullBath"], conf.level=0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  X[, "BedroomAbvGr"] and X[, "FullBath"]
## t = 14.887, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3337593 0.3920342
## sample estimates:
##      cor 
## 0.363252

We are $80\%$ confident that $\rho$, the population correlation coefficient for “BedroomAbvGr” and “FullBath,” is between $0.334$ and $0.392$. We reject null hypothesis 3 in favor of the alternate hypothesis 3.

Discuss the meaning of your analysis.

All three of the correlation coefficients we tested are statistically significant, so there is a significant linear relationship between each pair of variables. This means we could model the linear relationship between each pair of variables in the population by using linear regression.

Would you be worried about familywise error? Why or why not?

a <- 0.2
m <- 3
FWER <- 1 - ((1 - a)^m)

Because we’ve conducted three hypothesis tests, our confidence level is low, and our alpha is therefore high, yes, I would be concerned about familywise error. The familywise error rate (FWER) for these three hypothesis tests, meaning the probability of us committing a type I error (rejecting the null hypothesis when it is actually true), is 48.8%. That is very high for three tests.

We can reduce the FWER using the Holm-Bonferonni formula. Our p-value for each pair of coefficients was the same very low value: 0.00000000000000022. So we will leave the order of the tests in the order we did them:

$H_1 = 0.00000000000000022$

$H_2 = 0.00000000000000022$

$H_3 = 0.00000000000000022$

Then we will sequentially confirm we should reject the null hypothesis for each test:

rank = 1
HB1 <- a / (m - rank + 1)
rank = 2
HB2 <- a / (m - rank + 1)
rank = 3
HB3 <- a / (m - rank + 1)

$H_1 = 0.00000000000000022 < 0.0666667$

$H_2 = 0.00000000000000022 < 0.1$

$H_3 = 0.00000000000000022 < 0.2$

So we were correct to reject all three null hypotheses after controlling for FWER.

Linear Algebra and Correlation (5 Points):

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

M_inv <- solve(M)
print_M_inv <- to_LaTeX(round(M_inv, 3))

$M^{-1} = \begin{bmatrix} 1.982 & -0.668 & -1.006 \\ -0.668 & 1.377 & -0.08 \\ -1.006 & -0.08 & 1.663 \end{bmatrix}$

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

M_M_inv <- M %*% M_inv
print_M_M_inv <- to_LaTeX(round(M_M_inv, 3))

$M M^{-1} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

M_inv_M <- M_inv %*% M
print_M_inv_M <- to_LaTeX(round(M_inv_M, 3))

$M^{-1} M = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$

Conduct LU decomposition on the matrix.

triangles <- lu.decomposition(M)
print_M <- to_LaTeX(round(M, 3))
L <- triangles$L
print_L <- to_LaTeX(round(L, 3))
U <- triangles$U
print_U <- to_LaTeX(round(U, 3))

$M = \begin{bmatrix} 1 & 0.521 & 0.63 \\ 0.521 & 1 & 0.363 \\ 0.63 & 0.363 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0.521 & 1 & 0 \\ 0.63 & 0.048 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0.521 & 0.63 \\ 0 & 0.728 & 0.035 \\ 0 & 0 & 0.601 \end{bmatrix} = LU$

Calculus-Based Probability & Statistics (5 Points):

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See this link).

ggplot(data = train_df, aes(x = LotArea / 1000)) +
    geom_histogram(binwidth=5, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(0, 220, by = 10)) +
    labs(x = "Lot Area in 1,000s", y = "Frequency",
         title = "Distribution of Lot Area") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

Lot Area is right-skewed, so we will use that variable.

lot_area_exp_fit <- fitdistr(train_df$LotArea, densfun = "exponential")

Find the optimal value of $\lambda$ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, $\lambda$)).

optimal_lambda <- lot_area_exp_fit$estimate
optimal_dist <- as.data.frame(rexp(1000, optimal_lambda))
cols <- "optimal_lot_area"
colnames(optimal_dist) <- cols

Plot a histogram and compare it with a histogram of your original variable.

ggplot(data = optimal_dist, aes(x = optimal_lot_area / 1000)) +
    geom_histogram(binwidth=5, color = palette[2], fill = palette[1]) +
    scale_x_continuous(breaks = seq(0, 220, by = 10)) +
    labs(x = "Optimal Lot Area in 1,000s", y = "Frequency",
         title = "Distribution of Optimal Lot Area") +
    theme(panel.grid.major = element_blank(),
          panel.grid.minor = element_blank(),
          panel.background = element_blank(),
          axis.line = element_line(colour = palette[5]))

This reduced the variance a lot.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

percentiles <- quantile(optimal_dist$optimal_lot_area, c(0.05, .95))
percentiles

##         5%        95% 
##   624.5333 30905.3320

Also generate a 95% confidence interval from the empirical data, assuming normality.

mean_LotArea <- mean(train_df$LotArea)
sd_LotArea <- sd(train_df$LotArea)
se_LotArea <- sd_LotArea / sqrt(length(train_df$LotArea))
ci_actual_data <- c(mean_LotArea - 1.96 * se_LotArea, mean_LotArea + 1.96 * se_LotArea)
names(ci_actual_data) <- c("LL", "UL")
ci_actual_data

##       LL       UL 
## 10004.83 11028.82

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

percentiles2 <- quantile(train_df$LotArea, c(0.05, .95))
percentiles2

##       5%      95% 
##  3311.70 17401.15

The distance between the max values and the 95th percentile is larger in the empirical data than in the transformed data, so the transformed data was less skewed.

Modeling (10 points):

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.

Since there are a lot of variables, we first use our knowledge of what might influence sale prices most to reduce the number of variables under consideration. We convert the single character variable to factor.

cols <- c("LotArea", "Neighborhood", "OverallQual", "GrLivArea", "TotalBsmtSF",
          "FullBath", "HalfBath", "BedroomAbvGr", "Fireplaces", "GarageCars",
          "SalePrice")
train_df_model_sub <- train_df[, colnames(train_df) %in% cols]
train_df_model_sub$Neighborhood <- as.factor(train_df_model_sub$Neighborhood)
reorder <- c("SalePrice", "LotArea", "Neighborhood", "OverallQual", "GrLivArea",
             "TotalBsmtSF", "FullBath", "HalfBath", "BedroomAbvGr", "Fireplaces",
             "GarageCars")
train_df_model_sub <- train_df_model_sub[, reorder]

We look at pairwise scatterplots to see if there appear to be linear relationships between our dependent variable and any of our possible predictors.

pairs(train_df_model_sub, gap=0.5)

Overall quality, above ground square footage, and total basement square footage appear to have the strongest positive linear relationships with sale price, just eyeballing these.

We set our p-value threshold for predictor inclusion at p = 0.05 and perform backward elimination to decide what variables will come out of the model.

train_df_model_sub_lm_full <- lm(SalePrice ~ Neighborhood + LotArea + 
    OverallQual +  GrLivArea + TotalBsmtSF + FullBath + HalfBath + BedroomAbvGr +
        Fireplaces + GarageCars, data = train_df_model_sub)
summary(train_df_model_sub_lm_full)

## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual + 
##     GrLivArea + TotalBsmtSF + FullBath + HalfBath + BedroomAbvGr + 
##     Fireplaces + GarageCars, data = train_df_model_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -413081  -14993     158   14206  268958 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.750e+04  1.132e+04  -4.195 2.89e-05 ***
## NeighborhoodBlueste -1.464e+04  2.593e+04  -0.565 0.572421    
## NeighborhoodBrDale  -1.234e+04  1.250e+04  -0.987 0.323961    
## NeighborhoodBrkSide  9.207e+03  1.012e+04   0.910 0.362945    
## NeighborhoodClearCr  2.253e+04  1.124e+04   2.005 0.045176 *  
## NeighborhoodCollgCr  2.276e+04  9.043e+03   2.517 0.011952 *  
## NeighborhoodCrawfor  2.971e+04  1.009e+04   2.945 0.003285 ** 
## NeighborhoodEdwards  2.686e+03  9.657e+03   0.278 0.780916    
## NeighborhoodGilbert  1.280e+04  9.544e+03   1.341 0.179997    
## NeighborhoodIDOTRR  -4.914e+03  1.076e+04  -0.457 0.647972    
## NeighborhoodMeadowV  5.853e+03  1.238e+04   0.473 0.636404    
## NeighborhoodMitchel  1.064e+04  1.005e+04   1.059 0.289941    
## NeighborhoodNAmes    9.396e+03  9.233e+03   1.018 0.309023    
## NeighborhoodNoRidge  7.191e+04  1.035e+04   6.949 5.58e-12 ***
## NeighborhoodNPkVill -8.876e+03  1.438e+04  -0.617 0.537075    
## NeighborhoodNridgHt  7.111e+04  9.388e+03   7.575 6.45e-14 ***
## NeighborhoodNWAmes   4.590e+03  9.599e+03   0.478 0.632628    
## NeighborhoodOldTown -9.287e+03  9.558e+03  -0.972 0.331377    
## NeighborhoodSawyer   1.251e+04  9.857e+03   1.269 0.204649    
## NeighborhoodSawyerW  1.379e+04  9.737e+03   1.416 0.157001    
## NeighborhoodSomerst  2.923e+04  9.321e+03   3.136 0.001748 ** 
## NeighborhoodStoneBr  7.317e+04  1.096e+04   6.674 3.54e-11 ***
## NeighborhoodSWISU   -3.423e+03  1.158e+04  -0.295 0.767670    
## NeighborhoodTimber   2.683e+04  1.036e+04   2.589 0.009723 ** 
## NeighborhoodVeenker  4.375e+04  1.354e+04   3.231 0.001261 ** 
## LotArea              5.017e-01  1.050e-01   4.780 1.94e-06 ***
## OverallQual          1.610e+04  1.160e+03  13.881  < 2e-16 ***
## GrLivArea            4.579e+01  3.709e+00  12.346  < 2e-16 ***
## TotalBsmtSF          2.133e+01  2.921e+00   7.303 4.67e-13 ***
## FullBath             3.253e+03  2.719e+03   1.196 0.231741    
## HalfBath             2.741e+03  2.417e+03   1.134 0.256898    
## BedroomAbvGr        -5.594e+03  1.484e+03  -3.768 0.000171 ***
## Fireplaces           6.245e+03  1.768e+03   3.532 0.000426 ***
## GarageCars           1.247e+04  1.683e+03   7.412 2.13e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34510 on 1426 degrees of freedom
## Multiple R-squared:  0.8156, Adjusted R-squared:  0.8113 
## F-statistic: 191.1 on 33 and 1426 DF,  p-value: < 2.2e-16

Looking at the summary for our model, we see that our residuals median is not near 0. We also see that HalfBath has the highest p-value, so it is the first variable we will remove from the model.

train_df_model_sub_lm_update <- update(
    train_df_model_sub_lm_full, .~. - HalfBath, data = train_df_model_sub)
summary(train_df_model_sub_lm_update)

## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual + 
##     GrLivArea + TotalBsmtSF + FullBath + BedroomAbvGr + Fireplaces + 
##     GarageCars, data = train_df_model_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -412639  -14765      80   14066  267678 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.699e+04  1.131e+04  -4.153 3.47e-05 ***
## NeighborhoodBlueste -1.328e+04  2.591e+04  -0.513 0.608334    
## NeighborhoodBrDale  -1.082e+04  1.243e+04  -0.870 0.384191    
## NeighborhoodBrkSide  8.378e+03  1.009e+04   0.830 0.406563    
## NeighborhoodClearCr  2.226e+04  1.124e+04   1.981 0.047800 *  
## NeighborhoodCollgCr  2.315e+04  9.038e+03   2.562 0.010522 *  
## NeighborhoodCrawfor  2.920e+04  1.008e+04   2.897 0.003829 ** 
## NeighborhoodEdwards  2.232e+03  9.649e+03   0.231 0.817092    
## NeighborhoodGilbert  1.396e+04  9.489e+03   1.472 0.141352    
## NeighborhoodIDOTRR  -5.548e+03  1.075e+04  -0.516 0.605724    
## NeighborhoodMeadowV  6.174e+03  1.238e+04   0.499 0.617941    
## NeighborhoodMitchel  1.041e+04  1.005e+04   1.036 0.300423    
## NeighborhoodNAmes    9.172e+03  9.231e+03   0.994 0.320601    
## NeighborhoodNoRidge  7.239e+04  1.034e+04   6.999 3.94e-12 ***
## NeighborhoodNPkVill -7.565e+03  1.433e+04  -0.528 0.597695    
## NeighborhoodNridgHt  7.141e+04  9.385e+03   7.610 4.97e-14 ***
## NeighborhoodNWAmes   5.245e+03  9.583e+03   0.547 0.584246    
## NeighborhoodOldTown -1.022e+04  9.524e+03  -1.073 0.283508    
## NeighborhoodSawyer   1.248e+04  9.858e+03   1.266 0.205722    
## NeighborhoodSawyerW  1.411e+04  9.733e+03   1.450 0.147350    
## NeighborhoodSomerst  3.009e+04  9.291e+03   3.239 0.001227 ** 
## NeighborhoodStoneBr  7.331e+04  1.096e+04   6.687 3.26e-11 ***
## NeighborhoodSWISU   -4.323e+03  1.156e+04  -0.374 0.708434    
## NeighborhoodTimber   2.706e+04  1.036e+04   2.612 0.009107 ** 
## NeighborhoodVeenker  4.474e+04  1.351e+04   3.311 0.000953 ***
## LotArea              4.948e-01  1.048e-01   4.722 2.57e-06 ***
## OverallQual          1.617e+04  1.159e+03  13.960  < 2e-16 ***
## GrLivArea            4.771e+01  3.300e+00  14.461  < 2e-16 ***
## TotalBsmtSF          2.020e+01  2.748e+00   7.353 3.26e-13 ***
## FullBath             2.146e+03  2.538e+03   0.846 0.397905    
## BedroomAbvGr        -5.571e+03  1.484e+03  -3.753 0.000182 ***
## Fireplaces           6.265e+03  1.768e+03   3.543 0.000409 ***
## GarageCars           1.252e+04  1.683e+03   7.438 1.75e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34510 on 1427 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.8113 
## F-statistic:   197 on 32 and 1427 DF,  p-value: < 2.2e-16

Now the variable with the highest p-value is FullBath, so we remove that from the model as well.

train_df_model_sub_lm_update <- update(
    train_df_model_sub_lm_update, .~. - FullBath, data = train_df_model_sub)
summary(train_df_model_sub_lm_update)

## 
## Call:
## lm(formula = SalePrice ~ Neighborhood + LotArea + OverallQual + 
##     GrLivArea + TotalBsmtSF + BedroomAbvGr + Fireplaces + GarageCars, 
##     data = train_df_model_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -415192  -14946     181   13820  267765 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.507e+04  1.108e+04  -4.067 5.03e-05 ***
## NeighborhoodBlueste -1.415e+04  2.588e+04  -0.547 0.584643    
## NeighborhoodBrDale  -1.216e+04  1.233e+04  -0.986 0.324238    
## NeighborhoodBrkSide  6.868e+03  9.932e+03   0.692 0.489323    
## NeighborhoodClearCr  2.126e+04  1.117e+04   1.903 0.057307 .  
## NeighborhoodCollgCr  2.260e+04  9.013e+03   2.507 0.012283 *  
## NeighborhoodCrawfor  2.793e+04  9.967e+03   2.802 0.005144 ** 
## NeighborhoodEdwards  1.074e+03  9.551e+03   0.112 0.910505    
## NeighborhoodGilbert  1.372e+04  9.484e+03   1.447 0.148251    
## NeighborhoodIDOTRR  -7.086e+03  1.059e+04  -0.669 0.503530    
## NeighborhoodMeadowV  4.904e+03  1.228e+04   0.399 0.689800    
## NeighborhoodMitchel  9.467e+03  9.986e+03   0.948 0.343299    
## NeighborhoodNAmes    7.715e+03  9.068e+03   0.851 0.395035    
## NeighborhoodNoRidge  7.141e+04  1.028e+04   6.949 5.57e-12 ***
## NeighborhoodNPkVill -7.249e+03  1.432e+04  -0.506 0.612901    
## NeighborhoodNridgHt  7.105e+04  9.374e+03   7.579 6.23e-14 ***
## NeighborhoodNWAmes   4.700e+03  9.560e+03   0.492 0.623079    
## NeighborhoodOldTown -1.159e+04  9.383e+03  -1.235 0.216949    
## NeighborhoodSawyer   1.110e+04  9.720e+03   1.142 0.253761    
## NeighborhoodSawyerW  1.345e+04  9.701e+03   1.386 0.165946    
## NeighborhoodSomerst  2.991e+04  9.287e+03   3.221 0.001307 ** 
## NeighborhoodStoneBr  7.280e+04  1.095e+04   6.651 4.13e-11 ***
## NeighborhoodSWISU   -5.494e+03  1.147e+04  -0.479 0.632124    
## NeighborhoodTimber   2.658e+04  1.035e+04   2.569 0.010288 *  
## NeighborhoodVeenker  4.353e+04  1.344e+04   3.240 0.001223 ** 
## LotArea              4.934e-01  1.048e-01   4.709 2.73e-06 ***
## OverallQual          1.622e+04  1.157e+03  14.015  < 2e-16 ***
## GrLivArea            4.871e+01  3.082e+00  15.802  < 2e-16 ***
## TotalBsmtSF          2.016e+01  2.747e+00   7.338 3.62e-13 ***
## BedroomAbvGr        -5.364e+03  1.464e+03  -3.664 0.000257 ***
## Fireplaces           6.137e+03  1.762e+03   3.484 0.000510 ***
## GarageCars           1.259e+04  1.680e+03   7.494 1.16e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34510 on 1428 degrees of freedom
## Multiple R-squared:  0.8153, Adjusted R-squared:  0.8113 
## F-statistic: 203.4 on 31 and 1428 DF,  p-value: < 2.2e-16

Our Adjusted R-squared value is 0.8113, which is high and indicates these predictor variables explain about 81.13% of the variation in our dependent variable.

par(mfrow=c(2,2))
plot(train_df_model_sub_lm_update)

We can see by plotting the residuals vs. the fitted values that the residuals aren’t uniformly scattered about 0, and we can see from the qq-plot that the residuals don’t follow the normal distribution line. We will do a Box-Cox transformation to account for that.

lm_bc <- MASS::boxcox(train_df_model_sub_lm_update)

print(lm_lambda <- lm_bc$x[which.max(lm_bc$y)])

## [1] 0.1414141

train_df_model_sub_lm_new <- lm(((SalePrice^lm_lambda-1)/lm_lambda) ~ Neighborhood +
    LotArea + OverallQual +  GrLivArea + TotalBsmtSF + BedroomAbvGr + Fireplaces +
        GarageCars, data = train_df_model_sub)
summary(train_df_model_sub_lm_new)

## 
## Call:
## lm(formula = ((SalePrice^lm_lambda - 1)/lm_lambda) ~ Neighborhood + 
##     LotArea + OverallQual + GrLivArea + TotalBsmtSF + BedroomAbvGr + 
##     Fireplaces + GarageCars, data = train_df_model_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0318  -0.3880   0.0343   0.4699   2.8428 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.524e+01  2.749e-01  91.822  < 2e-16 ***
## NeighborhoodBlueste -7.116e-01  6.421e-01  -1.108 0.267923    
## NeighborhoodBrDale  -1.099e+00  3.059e-01  -3.592 0.000340 ***
## NeighborhoodBrkSide -3.737e-01  2.464e-01  -1.517 0.129507    
## NeighborhoodClearCr  6.881e-01  2.772e-01   2.482 0.013170 *  
## NeighborhoodCollgCr  5.203e-01  2.236e-01   2.327 0.020098 *  
## NeighborhoodCrawfor  6.371e-01  2.473e-01   2.577 0.010080 *  
## NeighborhoodEdwards -4.472e-01  2.369e-01  -1.887 0.059321 .  
## NeighborhoodGilbert  3.233e-01  2.353e-01   1.374 0.169647    
## NeighborhoodIDOTRR  -1.210e+00  2.627e-01  -4.608 4.44e-06 ***
## NeighborhoodMeadowV -7.495e-01  3.047e-01  -2.460 0.014026 *  
## NeighborhoodMitchel  9.684e-02  2.477e-01   0.391 0.695913    
## NeighborhoodNAmes   -2.163e-02  2.250e-01  -0.096 0.923420    
## NeighborhoodNoRidge  1.059e+00  2.549e-01   4.154 3.46e-05 ***
## NeighborhoodNPkVill -4.656e-01  3.554e-01  -1.310 0.190285    
## NeighborhoodNridgHt  1.171e+00  2.325e-01   5.037 5.33e-07 ***
## NeighborhoodNWAmes   6.583e-02  2.372e-01   0.278 0.781375    
## NeighborhoodOldTown -8.480e-01  2.328e-01  -3.643 0.000279 ***
## NeighborhoodSawyer   2.570e-02  2.411e-01   0.107 0.915130    
## NeighborhoodSawyerW  2.127e-01  2.406e-01   0.884 0.376988    
## NeighborhoodSomerst  6.796e-01  2.304e-01   2.950 0.003234 ** 
## NeighborhoodStoneBr  1.275e+00  2.715e-01   4.695 2.93e-06 ***
## NeighborhoodSWISU   -4.167e-01  2.846e-01  -1.464 0.143473    
## NeighborhoodTimber   5.540e-01  2.567e-01   2.158 0.031072 *  
## NeighborhoodVeenker  1.069e+00  3.333e-01   3.208 0.001369 ** 
## LotArea              9.197e-06  2.599e-06   3.538 0.000416 ***
## OverallQual          5.149e-01  2.871e-02  17.937  < 2e-16 ***
## GrLivArea            1.086e-03  7.647e-05  14.209  < 2e-16 ***
## TotalBsmtSF          5.263e-04  6.815e-05   7.723 2.13e-14 ***
## BedroomAbvGr        -3.442e-03  3.632e-02  -0.095 0.924502    
## Fireplaces           2.394e-01  4.371e-02   5.477 5.11e-08 ***
## GarageCars           4.363e-01  4.168e-02  10.469  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.856 on 1428 degrees of freedom
## Multiple R-squared:  0.852,  Adjusted R-squared:  0.8488 
## F-statistic: 265.3 on 31 and 1428 DF,  p-value: < 2.2e-16

The Box-Cox transformation increased the adjusted R-squared to 0.8488, and it reduced the median of the residuals to close to 0. BedroomAbvGr is no longer statistically significant in the new model, so we will remove it.

train_df_model_sub_lm_new <- update(
    train_df_model_sub_lm_new, .~. - BedroomAbvGr, data = train_df_model_sub)
summary(train_df_model_sub_lm_new)

## 
## Call:
## lm(formula = ((SalePrice^lm_lambda - 1)/lm_lambda) ~ Neighborhood + 
##     LotArea + OverallQual + GrLivArea + TotalBsmtSF + Fireplaces + 
##     GarageCars, data = train_df_model_sub)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.0216  -0.3872   0.0348   0.4714   2.8448 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.524e+01  2.713e-01  93.014  < 2e-16 ***
## NeighborhoodBlueste -7.132e-01  6.416e-01  -1.111 0.266543    
## NeighborhoodBrDale  -1.100e+00  3.052e-01  -3.606 0.000322 ***
## NeighborhoodBrkSide -3.759e-01  2.453e-01  -1.532 0.125628    
## NeighborhoodClearCr  6.863e-01  2.764e-01   2.483 0.013160 *  
## NeighborhoodCollgCr  5.177e-01  2.218e-01   2.334 0.019746 *  
## NeighborhoodCrawfor  6.350e-01  2.462e-01   2.579 0.010013 *  
## NeighborhoodEdwards -4.497e-01  2.353e-01  -1.911 0.056142 .  
## NeighborhoodGilbert  3.204e-01  2.332e-01   1.374 0.169713    
## NeighborhoodIDOTRR  -1.212e+00  2.619e-01  -4.629 4.01e-06 ***
## NeighborhoodMeadowV -7.512e-01  3.041e-01  -2.471 0.013606 *  
## NeighborhoodMitchel  9.404e-02  2.459e-01   0.382 0.702172    
## NeighborhoodNAmes   -2.489e-02  2.222e-01  -0.112 0.910861    
## NeighborhoodNoRidge  1.057e+00  2.541e-01   4.160 3.38e-05 ***
## NeighborhoodNPkVill -4.681e-01  3.543e-01  -1.321 0.186588    
## NeighborhoodNridgHt  1.170e+00  2.318e-01   5.045 5.11e-07 ***
## NeighborhoodNWAmes   6.244e-02  2.344e-01   0.266 0.789950    
## NeighborhoodOldTown -8.498e-01  2.319e-01  -3.665 0.000257 ***
## NeighborhoodSawyer   2.225e-02  2.383e-01   0.093 0.925620    
## NeighborhoodSawyerW  2.103e-01  2.392e-01   0.879 0.379580    
## NeighborhoodSomerst  6.777e-01  2.295e-01   2.953 0.003198 ** 
## NeighborhoodStoneBr  1.275e+00  2.714e-01   4.696 2.91e-06 ***
## NeighborhoodSWISU   -4.205e-01  2.817e-01  -1.493 0.135684    
## NeighborhoodTimber   5.514e-01  2.552e-01   2.161 0.030869 *  
## NeighborhoodVeenker  1.068e+00  3.331e-01   3.207 0.001370 ** 
## LotArea              9.198e-06  2.598e-06   3.540 0.000413 ***
## OverallQual          5.151e-01  2.860e-02  18.013  < 2e-16 ***
## GrLivArea            1.082e-03  6.260e-05  17.289  < 2e-16 ***
## TotalBsmtSF          5.271e-04  6.752e-05   7.808 1.12e-14 ***
## Fireplaces           2.399e-01  4.333e-02   5.537 3.65e-08 ***
## GarageCars           4.366e-01  4.160e-02  10.494  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8557 on 1429 degrees of freedom
## Multiple R-squared:  0.852,  Adjusted R-squared:  0.8489 
## F-statistic: 274.3 on 30 and 1429 DF,  p-value: < 2.2e-16

We plot the residuals vs. the fitted values and look at a qq-plot again.

par(mfrow=c(2,2))
plot(train_df_model_sub_lm_new)

The residuals follow the normal distribution line much more closely now, though the lower tail is still problematic.

We make our predictions on the test data that was withheld from the model during training.

my_url2 <- "https://raw.githubusercontent.com/geedoubledee/data605_final_project/main/test.csv"
test_df <- read.csv(my_url2)
test_df$Neighborhood <- as.factor(test_df$Neighborhood)
test_df$SalePrice <- predict(train_df_model_sub_lm_new, test_df)
test_df$SalePrice <- inv_boxcox(test_df$SalePrice, lm_lambda)
submission_cols <- c("Id", "SalePrice")
submission <- test_df[, colnames(test_df) %in% submission_cols]
write.csv(submission, "submission.csv", row.names = FALSE)

Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

Username: https://www.kaggle.com/glendavis

Score: 0.43507

Screenshot: