library(knitr)   # for table of counts
library(kableExtra)
library(tidyverse)   # for %>% operator
library(moments)  # skewness and kurtosis
library(hrbrthemes) # themes
library(DescTools) # mode
library(scales)  # remove sci notation
library(corrplot)
library(factoextra)   # pca scree plot
library(MASS)  # fitdistr
train.df <- read.csv("train.csv")
# head(train.df)
hist(train.df$WoodDeckSF)

X <- train.df$WoodDeckSF
Y <- train.df$SalePrice

The independent variable is WoodDeckSF, which is wood deck area in square feet.

Probability

X_Q3 <- quantile(X, 0.75) # 3rd quartile of the variable X
Y_Q2 <- quantile(Y, 0.50) # 2nd quartile of the variable Y

a <- length(which(X <= X_Q3 & Y <= Y_Q2))
b <- length(which(X <= X_Q3 & Y > Y_Q2))
c <- a + b
d <- length(which(X > X_Q3 & Y <= Y_Q2))
e <- length(which(X > X_Q3 & Y > Y_Q2))
f <- d + e
g <- a + d
h <- b + e
i <- g + h   # also c + f
data <- data.frame(
  "X/Y" = c("X <= Q3", "X > Q3", "Total"),
  "Y <= Q2" = c(a, d, g),
  "Y > Q2" = c(b, e, h),
  "Total" = c(c, f, i),
  check.names=F
)

# Create a table using kable
kable(data, caption = "Table of Counts", format = "html") %>%
  kable_styling()
Table of Counts
X/Y Y <= Q2 Y > Q2 Total
X <= Q3 624 480 1104
X > Q3 108 248 356
Total 732 728 1460
  1. \(P(X>x | Y>y) = 248/728 = 0.3407\)

This represents: Given the wood deck area is above the 3rd quartile, what is the probability that the sales price is above the 2nd quartile?

  1. \(P(X>x, Y>y) = 248/1460 = 0.1699\)

This represents the probability that the events occur together. The wood deck area is above the 3rd quartile and the sales price is above the 2nd quartile.

  1. \(P(X<x | Y>y) = (480-28)/728 = 0.62\)

The number of records where x equals the 3rd quartile is 28, so we can subtract 28 from the number of records where x is less than or equal to the 3rd quartile and y is greater than the 2nd quartile. This probability represents when the wood deck area is strictly less than Q3 given that the sales price is strictly greater than the 2nd quartile.

Chi square test for association

Let A be the event \(X > Q3\). Let B be the event \(Y >Q2\).

\[\begin{align*} P(A|B) &= P(X > Q3 | Y > Q2) \\ &= 248/728 \\ &= 0.3407 \end{align*}\]

\[\begin{align*} P(A) \cdot P(B) &= P(X>Q3) \cdot P(Y>Q2) \\ &= \frac{356}{1460} \cdot \frac{728}{1460} \\ &= 0.1216 \end{align*}\]

probA <- length(which(X > X_Q3))/nrow(train.df)
probB <- length(which(Y > Y_Q2))/nrow(train.df)
prob_condAB <- length(which(X > X_Q3 & Y > Y_Q2))/length(which(Y > Y_Q2))

Since \(P(A|B) \neq P(A) \cdot P(B)\), the events are not independent. Verify with the chi-square test.

The condition for applying the chi-square test for independence is that all of the expected counts for the data are greater than five.

Check the expected counts:

data2 <- data.frame(
  "Combinations" = c("X<=Q3, Y<=Q2", "X>Q3, Y>Q2", "X>Q3, Y<=Q2", "X<=Q3, Y>Q2"),
  "Expected Counts" = c("1104*732/1460=553", "356*728/1460=178", "356*732/1460=179", "1104*728/1460=551"),
  ">5?" = c("Yes", "Yes", "Yes", "Yes"),
  check.names=F
)

kable(data2, caption = "Checking Expected Counts", format = "html") %>%
  kable_styling()
Checking Expected Counts
Combinations Expected Counts >5?
X<=Q3, Y<=Q2 1104*732/1460=553 Yes
X>Q3, Y>Q2 356*728/1460=178 Yes
X>Q3, Y<=Q2 356*732/1460=179 Yes
X<=Q3, Y>Q2 1104*728/1460=551 Yes

Calculate the test statistic

  • \(H_0\): The events \(X > Q3\) and \(Y >Q2\) are unrelated (independent).
  • \(H_A\): The events \(X > Q3\) and \(Y >Q2\) are not independent.

\[\begin{align*} \sum \frac{(O-E)^2}{E} &= \frac{(624-553)^2}{553} + \frac{(108-179)^2}{179} + \frac{(480-551)^2}{551} + \frac{(248-178)^2}{178} \\ &= 9.116 + 28.162 + 8.149 + 27.528 \\ &= 72.955 \end{align*}\]

The degrees of freedom is given by \(df = (r-1)(c-1)\).

\(df = (2-1)(2-1) = 1\)

Choose to be 0.05. \(X^2_{0.05, 1} = 3.841\)

Since the test statistic is larger, reject the null hypothesis of independence. Cannot assume that there is no relationship between the wood deck areas above the 3rd quartile and sales price above the 2nd quartile.


Descriptive and Inferential Statistics

Univariate Descriptive Statistics

For the all the numeric variables in the training set:

exclude_cols <- c("YrSold", "GarageYrBlt", "YearRemodAdd", "YearBuilt", "Id", "SalePrice")
train_numeric <- dplyr::select_if(train.df, is.numeric) %>% dplyr::select(-all_of(exclude_cols))
summary(train_numeric)
##    MSSubClass     LotFrontage        LotArea        OverallQual    
##  Min.   : 20.0   Min.   : 21.00   Min.   :  1300   Min.   : 1.000  
##  1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554   1st Qu.: 5.000  
##  Median : 50.0   Median : 69.00   Median :  9478   Median : 6.000  
##  Mean   : 56.9   Mean   : 70.05   Mean   : 10517   Mean   : 6.099  
##  3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602   3rd Qu.: 7.000  
##  Max.   :190.0   Max.   :313.00   Max.   :215245   Max.   :10.000  
##                  NA's   :259                                       
##   OverallCond      MasVnrArea       BsmtFinSF1       BsmtFinSF2     
##  Min.   :1.000   Min.   :   0.0   Min.   :   0.0   Min.   :   0.00  
##  1st Qu.:5.000   1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00  
##  Median :5.000   Median :   0.0   Median : 383.5   Median :   0.00  
##  Mean   :5.575   Mean   : 103.7   Mean   : 443.6   Mean   :  46.55  
##  3rd Qu.:6.000   3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00  
##  Max.   :9.000   Max.   :1600.0   Max.   :5644.0   Max.   :1474.00  
##                  NA's   :8                                          
##    BsmtUnfSF       TotalBsmtSF       X1stFlrSF      X2ndFlrSF   
##  Min.   :   0.0   Min.   :   0.0   Min.   : 334   Min.   :   0  
##  1st Qu.: 223.0   1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0  
##  Median : 477.5   Median : 991.5   Median :1087   Median :   0  
##  Mean   : 567.2   Mean   :1057.4   Mean   :1163   Mean   : 347  
##  3rd Qu.: 808.0   3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728  
##  Max.   :2336.0   Max.   :6110.0   Max.   :4692   Max.   :2065  
##                                                                 
##   LowQualFinSF       GrLivArea     BsmtFullBath     BsmtHalfBath    
##  Min.   :  0.000   Min.   : 334   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :  0.000   Median :1464   Median :0.0000   Median :0.00000  
##  Mean   :  5.845   Mean   :1515   Mean   :0.4253   Mean   :0.05753  
##  3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   :572.000   Max.   :5642   Max.   :3.0000   Max.   :2.00000  
##                                                                     
##     FullBath        HalfBath       BedroomAbvGr    KitchenAbvGr  
##  Min.   :0.000   Min.   :0.0000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000  
##  Median :2.000   Median :0.0000   Median :3.000   Median :1.000  
##  Mean   :1.565   Mean   :0.3829   Mean   :2.866   Mean   :1.047  
##  3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :3.000   Max.   :2.0000   Max.   :8.000   Max.   :3.000  
##                                                                  
##   TotRmsAbvGrd      Fireplaces      GarageCars      GarageArea    
##  Min.   : 2.000   Min.   :0.000   Min.   :0.000   Min.   :   0.0  
##  1st Qu.: 5.000   1st Qu.:0.000   1st Qu.:1.000   1st Qu.: 334.5  
##  Median : 6.000   Median :1.000   Median :2.000   Median : 480.0  
##  Mean   : 6.518   Mean   :0.613   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.: 7.000   3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :14.000   Max.   :3.000   Max.   :4.000   Max.   :1418.0  
##                                                                   
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
## 

For the chosen quantitative variables, Wood Deck Area and Sale Price,

ggplot() + 
  geom_boxplot(aes(y = X), fill="#69b3a2", color="black", alpha=0.8) + 
  scale_x_discrete( ) +
  labs(title = "Box Plot of Wood Deck Area",
       y = "Square Feet") +
  theme_ipsum() + 
  coord_flip()

ggplot(data=data.frame(X)) +
  geom_density(aes(x=X), fill="#69b3a2", color="#e9ecef", alpha=0.8) +
  ggtitle("Density Plot of Wood Deck Area") +
  theme_ipsum()

ggplot(data=data.frame(X), aes(x=X)) + 
  geom_histogram(fill="#69b3a2", color="#e9ecef", alpha=0.8) + 
  ggtitle("Histogram of Wood Deck Area") +
  theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot() + 
  geom_boxplot(aes(y = Y), fill="#ffd500", color="black", alpha=0.8) + 
  scale_x_discrete( ) +
  labs(title = "Box Plot of Sale Price",
       y = "Dollars") +
  theme_ipsum() + 
  coord_flip()

ggplot(data=data.frame(X)) +
  geom_density(aes(x=X), fill="#ffd500", color="#e9ecef", alpha=0.8) +
  ggtitle("Density Plot of Sale Price") +
  theme_ipsum()

ggplot(data=data.frame(X), aes(x=X)) + 
  geom_histogram(fill="#ffd500", color="#e9ecef", alpha=0.8) + 
  ggtitle("Histogram of Sale Price") +
  theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Wood Deck Area

  • Measures of Central Tendency:
    • Mean: 94.24
    • Median: 0
    • Mode: 0
  • Quartiles:
    • Q1: 0
    • Q2: 0
    • Q3: 168
    • Q4: 857
  • Measures of Dispersion or Variability:
    • Variance: 15709.81
    • Standard deviation: 125.3388
    • Range: 857
    • Interquartile range (IQR): 168
  • Measures of Shape or Distribution:
    • Skewness: 1.5398
    • Kurtosis: 5.9786

A skewness value of 1.5398 suggests moderate skewness to the right. There is a tendency for the data to have higher values (outliers) on the right side of the distribution. This is visible in the box plot above.

A kurtosis value of 5.9786 suggests that the distribution has a heavy tail and a sharper peak than a normal distribution.

Sale Price

  • Measures of Central Tendency:
    • Mean: 180921.2
    • Median: 163000
    • Mode: 140000
  • Quartiles:
    • Q1: 129975
    • Q2: 163000
    • Q3: 214000
    • Q4: 755000
  • Measures of Dispersion or Variability:
    • Variance: 6311111264
    • Standard deviation: 79442.5
    • Range: 720100
    • Interquartile range (IQR): 84025
  • Measures of Shape or Distribution:
    • Skewness: 1.8809
    • Kurtosis: 9.5098

A skewness value of 1.8809 suggests moderate skewness to the right.

A kurtosis value of 9.5098 suggests that the distribution has a heavy tail and a sharper peak than a normal distribution.

Scatter Plot of Wood Deck Area and Sale Price

ggplot(data=train.df, aes(x=WoodDeckSF, y=SalePrice)) + geom_point() + 
  ggtitle("Sale Price vs Wood Deck Area") +
  scale_y_continuous(labels = scales::comma) + theme_ipsum()

95% CI for the difference in the mean of Wood Deck Area and Sale Price

t.test(X, Y, paired = TRUE)
## 
##  Paired t-test
## 
## data:  X and Y
## t = -87.018, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -184903.2 -176750.7
## sample estimates:
## mean difference 
##         -180827

Correlation Matrix

M = cor(data.frame(X,Y))
corrplot(M, type = "upper", tl.col = "black", tl.srt = 45)

### Hypothesis Test

  • 99% confidence interval

\(H_0:\) The correlation between Wood Deck Area and Sale Price is 0.

\(H_A:\) The correlation between Wood Deck Area and Sale Price is not 0.

cor.test(X,Y, conf.level=0.99)
## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 13.096, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.2627778 0.3834121
## sample estimates:
##       cor 
## 0.3244134

The confidence interval is \((0.263, 0.383)\) , which means that we can say with 99% confidence that the true correlation between Wood Deck Area and Sale Price falls within this interval. The small p-value indicates strong evidence against the null hypothesis, suggesting that the observed correlation of \(0.324\) is unlikely to occur by chance alone.

Linear Algebra and Correlation

data <- data.frame(WoodDeckSF=train_numeric$WoodDeckSF, SalePrice=Y, GarageArea=train_numeric$GarageArea)
corr_matrix <- cor(data)
corrplot(corr_matrix, type = "upper", tl.col = "black", tl.srt = 45)

precision_mat <- solve(corr_matrix)
precision_mat
##             WoodDeckSF  SalePrice  GarageArea
## WoodDeckSF  1.11865091 -0.3373326 -0.04101942
## SalePrice  -0.33733260  1.7374927 -1.00742033
## GarageArea -0.04101942 -1.0074203  1.63727319
corr_matrix %*% precision_mat %*% corr_matrix
##            WoodDeckSF SalePrice GarageArea
## WoodDeckSF  1.0000000 0.3244134  0.2246663
## SalePrice   0.3244134 1.0000000  0.6234314
## GarageArea  0.2246663 0.6234314  1.0000000

Principal Component Analysis (PCA)

  • Reduced dimensionality can help reduce effects of overfitting and the curse of dimensionality, especially in datasets with a large number of variables. Here we have 32 numeric variables.
  • PCA can make interpreting models easier.
house.pca <- prcomp(~. ,data=train_numeric, scale.=T)
fviz_eig(house.pca)

* A common rule of thumb is to retain the principal components before the elbow bend, which appears to be at 3. However, less than 50% of the variance is captured in the data according to the scree plot. We will choose 6 principal components instead.

summary(house.pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.5689 1.74664 1.44279 1.34361 1.2107 1.10469 1.07217
## Proportion of Variance 0.2062 0.09534 0.06505 0.05642 0.0458 0.03814 0.03592
## Cumulative Proportion  0.2062 0.30156 0.36661 0.42303 0.4688 0.50697 0.54289
##                            PC8     PC9    PC10    PC11   PC12    PC13   PC14
## Standard deviation     1.04171 1.03809 1.00844 0.99509 0.9814 0.96411 0.9466
## Proportion of Variance 0.03391 0.03368 0.03178 0.03094 0.0301 0.02905 0.0280
## Cumulative Proportion  0.57680 0.61048 0.64226 0.67320 0.7033 0.73234 0.7603
##                           PC15    PC16    PC17   PC18    PC19    PC20    PC21
## Standard deviation     0.92006 0.91079 0.87151 0.8579 0.81152 0.79750 0.77821
## Proportion of Variance 0.02645 0.02592 0.02374 0.0230 0.02058 0.01988 0.01893
## Cumulative Proportion  0.78680 0.81272 0.83646 0.8595 0.88003 0.89991 0.91884
##                           PC22    PC23    PC24    PC25   PC26    PC27    PC28
## Standard deviation     0.73423 0.65270 0.61820 0.56307 0.5122 0.50166 0.43485
## Proportion of Variance 0.01685 0.01331 0.01194 0.00991 0.0082 0.00786 0.00591
## Cumulative Proportion  0.93568 0.94900 0.96094 0.97085 0.9790 0.98691 0.99282
##                           PC29    PC30      PC31      PC32
## Standard deviation     0.36738 0.30810 1.191e-15 9.369e-16
## Proportion of Variance 0.00422 0.00297 0.000e+00 0.000e+00
## Cumulative Proportion  0.99703 1.00000 1.000e+00 1.000e+00
  • Over 50% of the variance in the data is captured with 6 principal components. This can be seen in the Cumulative Proportion section of the output above at PC6. You can also sum the numbers in the Proportion of Variance section until you reach over 50%.
sum(0.2062, 0.09534, 0.06505, 0.05642, 0.0458, 0.03814)
## [1] 0.50695
  • The Standard deviation section indicates the amount of variance explained by each principal component. Larger standard deviations indicate principal components that capture more variability in the data.

  • If you wanted to, you could grab these six principal components and use any or all of them in a model.

pca_predictors <- house.pca$x[, 1:6]

What is x?

  • Each row of the x matrix corresponds to an observation (sample) from the original dataset.
  • Each column of the x matrix corresponds to a principal component.
  • The value at the intersection of a row and column represents the score or coordinate of that observation along that principal component.
  • The x matrix provides a lower-dimensional representation of the original dataset. Each observation is represented by a set of scores along the principal components, rather than by the original variables.
  • We can use x to help prevent problems that may arise as a result of the curse of dimensionality (ex: overfitting).

Calculus-Based Probability & Statistics

X1 <- X+1
exp_dist <- MASS::fitdistr(x=X1, densfun="exponential")
lambda <- exp_dist$estimate
lambda
##       rate 
## 0.01049929

\(\lambda = 0.01049929\)

exp_sample = rexp(1000, rate=lambda)

ggplot() + 
  geom_histogram(data=data.frame(exp_sample), aes(x=exp_sample, fill="y"), color="#e9ecef", alpha=0.5) + 
  geom_histogram(data=data.frame(X), aes(x=X, fill="b"), color="#e9ecef", alpha=0.5) + 
    scale_colour_manual(name ="exp_sample", values = c("y" = "yellow", "b" = "blue"), labels=c("b" = "exp_sample", "y" = "WoodDeckSq")) +
  scale_fill_manual(name ="Data", values = c("y" = "yellow", "b" = "blue"), labels=c("b" = "exp_sample", "y" = "WoodDeckSq")) +
  ggtitle("Histogram: WoodDeckSq vs Exp Dist") +
  xlab("")+
  theme_ipsum()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

* The exponential distribution has a larger spread. There are also more occurrences of zero. * Both are skewed to the right. * The exponential distribution is centered farther to the left.

5th percentile using CDF

qexp(0.05, rate = lambda)
## [1] 4.885405

95th percentile using CDF

qexp(0.95, rate = lambda)
## [1] 285.3271

95% confidence interval of the empirical data

  • Assume normality
res <- t.test(data$WoodDeckSF, conf.level=0.95)
res$conf.int
## [1]  87.80998 100.67906
## attr(,"conf.level")
## [1] 0.95
  • Empirical 5th percentile and 95th percentile of the data
quantile(x=data$WoodDeckSF, probs=c(0.05, 0.95))
##  5% 95% 
##   0 335
  • The 5th and 95th percentiles for the exponential distribution are about 5 and 285, respectively.
  • For the empirical data they are about 0 and 335.
  • In an exponential distribution, the probability of observing exactly zero is theoretically zero. It takes on non-negative values so it makes sense that 0 is not included in the 5th percentile despite 0 occurring so frequently in the WoodDeckSF variable. This suggests that an exponential distribution is not a good fit for this variable.

Modeling

m <- lm(Y ~ ., data=train_numeric)
summary(m)
## 
## Call:
## lm(formula = Y ~ ., data = train_numeric)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442775  -17713   -2277   15084  321710 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.038e+04  1.160e+04  -6.066 1.77e-09 ***
## MSSubClass    -1.862e+02  3.273e+01  -5.689 1.62e-08 ***
## LotFrontage   -7.440e+01  5.906e+01  -1.260 0.208025    
## LotArea        4.824e-01  1.572e-01   3.069 0.002201 ** 
## OverallQual    2.060e+04  1.320e+03  15.604  < 2e-16 ***
## OverallCond    3.015e+03  1.061e+03   2.840 0.004584 ** 
## MasVnrArea     3.726e+01  6.934e+00   5.373 9.33e-08 ***
## BsmtFinSF1     1.950e+01  5.571e+00   3.500 0.000483 ***
## BsmtFinSF2     7.756e+00  8.550e+00   0.907 0.364525    
## BsmtUnfSF      8.865e+00  4.989e+00   1.777 0.075838 .  
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.077e+01  6.936e+00   5.878 5.42e-09 ***
## X2ndFlrSF      3.566e+01  5.577e+00   6.394 2.33e-10 ***
## LowQualFinSF   1.900e+01  2.188e+01   0.868 0.385378    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   1.237e+04  3.034e+03   4.078 4.85e-05 ***
## BsmtHalfBath   4.025e+03  4.898e+03   0.822 0.411393    
## FullBath       1.634e+04  2.989e+03   5.467 5.59e-08 ***
## HalfBath       6.197e+03  2.999e+03   2.066 0.039050 *  
## BedroomAbvGr  -1.080e+04  1.957e+03  -5.521 4.16e-08 ***
## KitchenAbvGr  -1.782e+04  5.828e+03  -3.057 0.002285 ** 
## TotRmsAbvGrd   4.883e+03  1.444e+03   3.382 0.000744 ***
## Fireplaces     3.043e+03  2.132e+03   1.428 0.153637    
## GarageCars     1.355e+04  3.302e+03   4.103 4.36e-05 ***
## GarageArea    -4.063e-01  1.148e+01  -0.035 0.971776    
## WoodDeckSF     2.877e+01  9.827e+00   2.927 0.003486 ** 
## OpenPorchSF    4.016e+00  1.822e+01   0.220 0.825548    
## EnclosedPorch -2.421e+01  1.868e+01  -1.296 0.195335    
## X3SsnPorch     2.616e+01  3.775e+01   0.693 0.488494    
## ScreenPorch    4.599e+01  2.035e+01   2.261 0.023965 *  
## PoolArea      -5.827e+01  2.971e+01  -1.962 0.050057 .  
## MiscVal       -4.045e-01  5.845e+00  -0.069 0.944843    
## MoSold        -2.892e+02  4.057e+02  -0.713 0.476169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37130 on 1164 degrees of freedom
##   (265 observations deleted due to missingness)
## Multiple R-squared:  0.8057, Adjusted R-squared:  0.8007 
## F-statistic: 160.9 on 30 and 1164 DF,  p-value: < 2.2e-16

We will remove the TotalBsmtSF and GrLivArea variables because coefficients could not be calculated for them due to singularities.

m1 <- lm(Y ~ ., data=train_numeric %>% dplyr::select(-c(TotalBsmtSF, GrLivArea)))
summary(m1)
## 
## Call:
## lm(formula = Y ~ ., data = train_numeric %>% dplyr::select(-c(TotalBsmtSF, 
##     GrLivArea)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -442775  -17713   -2277   15084  321710 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -7.038e+04  1.160e+04  -6.066 1.77e-09 ***
## MSSubClass    -1.862e+02  3.273e+01  -5.689 1.62e-08 ***
## LotFrontage   -7.440e+01  5.906e+01  -1.260 0.208025    
## LotArea        4.824e-01  1.572e-01   3.069 0.002201 ** 
## OverallQual    2.060e+04  1.320e+03  15.604  < 2e-16 ***
## OverallCond    3.015e+03  1.061e+03   2.840 0.004584 ** 
## MasVnrArea     3.726e+01  6.934e+00   5.373 9.33e-08 ***
## BsmtFinSF1     1.950e+01  5.571e+00   3.500 0.000483 ***
## BsmtFinSF2     7.756e+00  8.550e+00   0.907 0.364525    
## BsmtUnfSF      8.865e+00  4.989e+00   1.777 0.075838 .  
## X1stFlrSF      4.077e+01  6.936e+00   5.878 5.42e-09 ***
## X2ndFlrSF      3.566e+01  5.577e+00   6.394 2.33e-10 ***
## LowQualFinSF   1.900e+01  2.188e+01   0.868 0.385378    
## BsmtFullBath   1.237e+04  3.034e+03   4.078 4.85e-05 ***
## BsmtHalfBath   4.025e+03  4.898e+03   0.822 0.411393    
## FullBath       1.634e+04  2.989e+03   5.467 5.59e-08 ***
## HalfBath       6.197e+03  2.999e+03   2.066 0.039050 *  
## BedroomAbvGr  -1.080e+04  1.957e+03  -5.521 4.16e-08 ***
## KitchenAbvGr  -1.782e+04  5.828e+03  -3.057 0.002285 ** 
## TotRmsAbvGrd   4.883e+03  1.444e+03   3.382 0.000744 ***
## Fireplaces     3.043e+03  2.132e+03   1.428 0.153637    
## GarageCars     1.355e+04  3.302e+03   4.103 4.36e-05 ***
## GarageArea    -4.063e-01  1.148e+01  -0.035 0.971776    
## WoodDeckSF     2.877e+01  9.827e+00   2.927 0.003486 ** 
## OpenPorchSF    4.016e+00  1.822e+01   0.220 0.825548    
## EnclosedPorch -2.421e+01  1.868e+01  -1.296 0.195335    
## X3SsnPorch     2.616e+01  3.775e+01   0.693 0.488494    
## ScreenPorch    4.599e+01  2.035e+01   2.261 0.023965 *  
## PoolArea      -5.827e+01  2.971e+01  -1.962 0.050057 .  
## MiscVal       -4.045e-01  5.845e+00  -0.069 0.944843    
## MoSold        -2.892e+02  4.057e+02  -0.713 0.476169    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37130 on 1164 degrees of freedom
##   (265 observations deleted due to missingness)
## Multiple R-squared:  0.8057, Adjusted R-squared:  0.8007 
## F-statistic: 160.9 on 30 and 1164 DF,  p-value: < 2.2e-16

Here are the most significant variables in this model:

names(coef(m1)[summary(m1)$coef[, "Pr(>|t|)"] < 0.05])[-1]
##  [1] "MSSubClass"   "LotArea"      "OverallQual"  "OverallCond"  "MasVnrArea"  
##  [6] "BsmtFinSF1"   "X1stFlrSF"    "X2ndFlrSF"    "BsmtFullBath" "FullBath"    
## [11] "HalfBath"     "BedroomAbvGr" "KitchenAbvGr" "TotRmsAbvGrd" "GarageCars"  
## [16] "WoodDeckSF"   "ScreenPorch"

Simplifying the model to include fewer variables:

m2 <- lm(Y ~ ., data=train_numeric %>% dplyr::select(names(coef(m1)[summary(m1)$coef[, "Pr(>|t|)"] < 0.05])[-1]))
summary(m2) 
## 
## Call:
## lm(formula = Y ~ ., data = train_numeric %>% dplyr::select(names(coef(m1)[summary(m1)$coef[, 
##     "Pr(>|t|)"] < 0.05])[-1]))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -462356  -16562   -2390   14903  289054 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.354e+04  9.526e+03  -7.720 2.17e-14 ***
## MSSubClass   -1.655e+02  2.633e+01  -6.285 4.35e-10 ***
## LotArea       3.911e-01  1.007e-01   3.883 0.000108 ***
## OverallQual   2.122e+04  1.078e+03  19.688  < 2e-16 ***
## OverallCond   2.897e+03  8.817e+02   3.286 0.001042 ** 
## MasVnrArea    3.462e+01  5.954e+00   5.814 7.50e-09 ***
## BsmtFinSF1    1.209e+01  3.003e+00   4.025 6.00e-05 ***
## X1stFlrSF     4.875e+01  4.636e+00  10.516  < 2e-16 ***
## X2ndFlrSF     3.598e+01  4.546e+00   7.914 4.95e-15 ***
## BsmtFullBath  1.110e+04  2.423e+03   4.580 5.06e-06 ***
## FullBath      1.384e+04  2.531e+03   5.468 5.36e-08 ***
## HalfBath      4.894e+03  2.505e+03   1.954 0.050880 .  
## BedroomAbvGr -1.072e+04  1.677e+03  -6.395 2.17e-10 ***
## KitchenAbvGr -2.005e+04  5.144e+03  -3.898 0.000101 ***
## TotRmsAbvGrd  4.996e+03  1.228e+03   4.068 5.00e-05 ***
## GarageCars    1.326e+04  1.671e+03   7.932 4.31e-15 ***
## WoodDeckSF    3.017e+01  7.989e+00   3.777 0.000165 ***
## ScreenPorch   4.653e+01  1.699e+01   2.739 0.006230 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35370 on 1434 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8033, Adjusted R-squared:  0.801 
## F-statistic: 344.5 on 17 and 1434 DF,  p-value: < 2.2e-16

Create test model results

test.df <- read.csv("test.csv")

# get the numeric columns of the test set.
exclude_cols <- c("YrSold", "GarageYrBlt", "YearRemodAdd", "YearBuilt", "Id")
test_numeric <- dplyr::select_if(test.df, is.numeric) %>% dplyr::select(-all_of(exclude_cols))

# Predict house sale price
X_pred <- test_numeric
y_pred <- predict(m2, X_pred)
med <- median(y_pred, na.rm = TRUE)
y_pred[is.na(y_pred)] <- med

res <- data.frame(Id=test.df$Id, SalePrice=y_pred)

#write.csv(res, "my_predictions.csv", row.names=FALSE)