House Prices: Advanced Regression Techniques competition

https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

train <- read.csv("house-prices-advanced-regression-techniques/train.csv")
print(str(train))
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## NULL

Since I have to pick a quantitative independent and dependent variable, I printed out all the train column types and sample data to help with my decision.
The independent variable (X) I choose was GarageArea because I thought that would be a good indicator for the houses’ values. I would assume a bigger garage (one that can fit more cars) indicts a bigger or fancier house. Having a garage in general can be seen as a luxury itself.
My dependent variable (Y) will be SalePrice as it is ‘the target variable that you’re trying to predict’ as the Kaggle page says and it will be the perfect variable to test my X against.
Below is a plot to show that my X variable is skewed to the right. I did see some other variables are more skewed to the right but I figured GarageArea was skewed enough and I was interested to use it.

library(tidyverse)
ggplot(data = train, aes(x = GarageArea)) +
  geom_histogram()

X = train$GarageArea
Y = train$SalePrice

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0
x = summary(X)["3rd Qu."]
summary(Y)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000
y = summary(Y)["Median"]

Above is the summaries of the variables X and Y where we can get the min, 1st Q, 2nd Q/Median, Mean, 3rd Q, and Max. Then I set our x and y to their respective integers.

  1. P(X>x | Y>y)
prob_a = sum(X>x & Y>y) / sum(Y>y)
cat("P(X>x | Y>y):", prob_a, "\n")
## P(X>x | Y>y): 0.4065934
sum(X>x)#num of houses where the garagearea is bigger than 3rd q - 576 sq ft
## [1] 352
sum(Y>y)#num of houses where the saleprice is bigger than 2nd q - $163,000
## [1] 728
sum(X>x & Y>y)#num of houses that fit both conditions
## [1] 296

Around 40.66% of the houses with a SalePrice higher than the 2nd Q ($163,000) have a GarageArea bigger than the 3rd Q (576 sq ft).

  1. P(X>x, Y>y)
prob_b = sum(X>x & Y>y) / nrow(train)
cat("P(X>x, Y>y):", prob_b)
## P(X>x, Y>y): 0.2027397

Around 20.27% of the houses that have a SalePrice higher than the 2nd Q ($163,000) and a GarageArea bigger than the 3rd Q (576 sq ft) make up the whole dataset. So the probability of a house from our dataset meeting both conditions is 20.27%.

  1. P(X<x | Y>y)
prob_c = sum(X<x & Y>y) / sum(Y>y)
cat("P(X<x | Y>y):", prob_c)
## P(X<x | Y>y): 0.5755495

Around 57.55% of the houses with a SalePrice higher than the 2nd Q ($163,000) have a GarageArea smaller than the 3rd Q (576 sq ft).

prob_table = matrix(nrow = 3, ncol = 3)
prob_table <- addmargins(table(X>x, Y>y), FUN = sum)
## Margins computed over dimensions
## in the following order:
## 1: 
## 2:
colnames(prob_table) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(prob_table) <- c("<=3d quartile", ">3d quartile", "Total") 

prob_table
##                
##                 <=2d quartile >2d quartile Total
##   <=3d quartile           676          432  1108
##   >3d quartile             56          296   352
##   Total                   732          728  1460

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y.
Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

A = X > x
B = Y > y

#Does P(A|B)=P(A)P(B)?
A_B = sum(A*B) / sum(B) 
AB = (sum(A)/nrow(train)) * (sum(B)/nrow(train))
cat("P(A|B)=P(A)P(B) =>", A_B, "=?", AB, " => NO")
## P(A|B)=P(A)P(B) => 0.4065934 =? 0.1202177  => NO
#Chi Square test
chisq.test(table(A, B))
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(A, B)
## X-squared = 215.56, df = 1, p-value < 2.2e-16

Since P(A|B) does NOT equal P(A)P(B), splitting the data did not make them independent. This was then confirmed by the Chi Square test with the p-value being lower than 0.05.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

summary(train['GarageArea'])
##    GarageArea    
##  Min.   :   0.0  
##  1st Qu.: 334.5  
##  Median : 480.0  
##  Mean   : 473.0  
##  3rd Qu.: 576.0  
##  Max.   :1418.0
ggplot(data = train, aes(x = X)) +
  geom_histogram() + 
  labs(x ='GarageArea (sq ft)', title = 'Histogram of GarageArea values') + 
  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(train['SalePrice'])
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000
ggplot(data = train, aes(x = Y)) +
  geom_histogram() + 
  labs(x ='SalePrices ($)', title = 'Histogram of SalePrice values') +
  theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#scatterplot of X and Y
ggplot(train, aes(x = X, y = Y)) +
  geom_point() +  
  labs(x = "GarageArea", y = "SalePrice", title = "GarageArea vs. SalePrice") +
  theme(plot.title = element_text(hjust = 0.5))

#95% CI for the difference in the mean of the variables
CI_95 = t.test(X, Y, conf.level = 0.95)
print(CI_95)
## 
##  Welch Two Sample t-test
## 
## data:  X and Y
## t = -86.791, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -184526.6 -176369.8
## sample estimates:
##   mean of x   mean of y 
##    472.9801 180921.1959
#Correlation matrix for two of the quantitative variables you selected. 
cor_matrix = cor(train[, c('GarageArea','SalePrice')])
print(cor_matrix)
##            GarageArea SalePrice
## GarageArea  1.0000000 0.6234314
## SalePrice   0.6234314 1.0000000
#Test the correlation between X and Y with a 99% CI
cor_test <- cor.test(train$GarageArea, train$SalePrice, method = "pearson", conf.level = 0.99)
print(cor_test)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageArea and train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5804338 0.6629623
## sample estimates:
##       cor 
## 0.6234314

We can see from the correlation matrix that there is a correlation between GarageArea and SalePrice because the values we received for GarageArea~SalePrice and SalePrice~GarageArea are the same (0.6234314). Additionally, the P-Value of the correlation between X and Y with a 99% CI is 2.2e-16 which is lower than 0.05. Having a lower P-Value than 0.05 is a good thing as then we have enough evidence/data to rule out the null hypothesis.

Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

prec_matrix <- solve(cor_matrix)
print(prec_matrix)
##            GarageArea SalePrice
## GarageArea   1.635769 -1.019790
## SalePrice   -1.019790  1.635769
print(cor_matrix%*%prec_matrix)
##            GarageArea SalePrice
## GarageArea          1         0
## SalePrice           0         1
print(prec_matrix%*%cor_matrix)
##            GarageArea SalePrice
## GarageArea          1         0
## SalePrice           0         1

Above is the precision matrix nad the multiplication of the two in both orders (correlation matrix by precision matrix and precision matrix by correlation matrix). Since we are dealing with precision matrix/inverse of the correlation matrix, both orders bring back the identify matrix.

A principle components analysis is a procedure/method that is used to reduced large data sets by summarizing its content to from an easier to analyze smaller data set.

corr_var <- cbind(X, Y) %>%
  as.matrix()
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)
print(summary(pca_var))
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.2741 0.6137
## Proportion of Variance 0.8117 0.1883
## Cumulative Proportion  0.8117 1.0000

Above is the command prcomp() to create the PCA. As seen in the summary, there was two PCAs created with PCA1 representing 81.17% of the original data set and the PCA2 representing the latter portion, 18.83%. I tried to see how to get it to split more even or why it split like this, but I’m guess 81.17% of the data set were close knit so they were grouped together. The weird split would then indicate the fact that our data set is skewed to the right and that is why PCA1 represents so much.

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

#shift it so that the minimum value is above zero
print(summary(X))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   334.5   480.0   473.0   576.0  1418.0
#my variable X's min is 0.0 so I am going to add 1 to each X so there is no 0's
X_1 = X + 1
print(summary(X_1))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   335.5   481.0   474.0   577.0  1419.0
#load mass
library(MASS)

#run fitdistr - exponential probability density function
epd = fitdistr(X_1, "exponential")
print(epd)
##        rate    
##   2.109793e-03 
##  (5.521581e-05)
#Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). 
lambda = epd$estimate
cat('The optimal value of λ:', lambda, '\n')
## The optimal value of λ: 0.002109793
samples <- rexp(1000, lambda) %>%
  as.data.frame()
colnames(samples) <- c("values")

#Plot a histogram and compare it with a histogram of your original variable. 
ggplot(train, aes(x = GarageArea)) +
  geom_histogram() + 
  ggtitle("Original Histogram")

ggplot(samples, aes(x = values)) +
  geom_histogram() + 
  ggtitle("Exponential Histogram")

For the last part, the exponential distribution histogram is more skewed to the right than the original distribution histogram. I assume this is due to using the optimal/minimum impacted value of λ.

#Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). 
cat('The 5th percentiles of the exponential pdf:', quantile(samples$values, 0.05), '%\n')
## The 5th percentiles of the exponential pdf: 24.31353 %
cat('The 95th percentiles of the exponential pdf:', quantile(samples$values, 0.95), '%\n')
## The 95th percentiles of the exponential pdf: 1428.6 %
#Generate a 95% confidence interval from the empirical data, assuming normality. 
print(t.test(X, conf.level = 0.95))
## 
##  One Sample t-test
## 
## data:  X
## t = 84.528, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  462.0040 483.9563
## sample estimates:
## mean of x 
##  472.9801
#Provide the empirical 5th percentile and 95th percentile of the data.
cat('The 5th percentiles of the data:', quantile(X, 0.05), '%\n')
## The 5th percentiles of the data: 0 %
cat('The 95th percentiles of the data:', quantile(X, 0.95), '%\n')
## The 95th percentiles of the data: 850.1 %

The 5th percentile and 95th percentile of each data do not match at all. I did not understand at first, but the exponential data was skewed more to the right and the values were closer so that is why the percentiles are very close together as oppose to the real data.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

train_clean <- train
nrow(train_clean)
## [1] 1460
sum(apply(is.na(train_clean[, -1]), 1, any))
## [1] 1460
#find out how many NA values in each column
colSums(is.na(train_clean))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1369             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1453          1179          1406 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0
#get rid of mostly NA columns
train_clean = train_clean[, !(colnames(train_clean) %in% c("Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature"))]
#fix the other null values - replace NA with median of column
train_clean = train_clean %>% mutate(across(c(LotFrontage, BsmtQual, BsmtCond, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond), ~replace_na(., median(., na.rm=TRUE))))
colSums(is.na(train_clean))
##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0             0             0 
##        Street      LotShape   LandContour     Utilities     LotConfig 
##             0             0             0             0             0 
##     LandSlope  Neighborhood    Condition1    Condition2      BldgType 
##             0             0             0             0             0 
##    HouseStyle   OverallQual   OverallCond     YearBuilt  YearRemodAdd 
##             0             0             0             0             0 
##     RoofStyle      RoofMatl   Exterior1st   Exterior2nd    MasVnrType 
##             0             0             0             0             8 
##    MasVnrArea     ExterQual     ExterCond    Foundation      BsmtQual 
##             8             0             0             0             0 
##      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1  BsmtFinType2 
##             0            38            37             0            38 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating     HeatingQC 
##             0             0             0             0             0 
##    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF  LowQualFinSF 
##             0             1             0             0             0 
##     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath 
##             0             0             0             0             0 
##  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##             0             0             0             0             0 
##    Fireplaces    GarageType   GarageYrBlt  GarageFinish    GarageCars 
##             0             0             0             0             0 
##    GarageArea    GarageQual    GarageCond    PavedDrive    WoodDeckSF 
##             0             0             0             0             0 
##   OpenPorchSF EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea 
##             0             0             0             0             0 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0
sum(is.na(train_clean))
## [1] 130
sum(apply(is.na(train_clean[, -1]), 1, any))
## [1] 48

I will be building a linear regression model for this part.
First, I need to clean up the train data to try to get the model to be as accurate as possible:
- find the number of NA values in each column
- get rid of columns that only have/have more than ~50% NA values (PoolQC, Fence, MiscFeature)
- replace NA values in most of the remainder columns with NA values with the median of that column

We can see the amount of rows with null values dropped from 1460 (all) to 48.

For the linear regression, I want to set more than one column as my x to help with the accuracy.

#linear regression 
train_lm <- lm(SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond +  YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold, data = train_clean)
summary(train_lm)
## 
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + 
##     LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch + 
##     PoolArea + MiscVal + MoSold + YrSold, data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -496749  -16493   -1780   14047  308293 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.249e+04  1.434e+06  -0.016 0.987488    
## LotFrontage    4.523e+01  4.985e+01   0.907 0.364415    
## LotArea        4.555e-01  1.035e-01   4.403 1.15e-05 ***
## OverallQual    1.631e+04  1.201e+03  13.576  < 2e-16 ***
## OverallCond    4.915e+03  1.046e+03   4.698 2.88e-06 ***
## YearBuilt      2.670e+02  6.832e+01   3.908 9.73e-05 ***
## YearRemodAdd   1.424e+02  6.947e+01   2.050 0.040521 *  
## MasVnrArea     3.008e+01  6.023e+00   4.994 6.64e-07 ***
## BsmtFinSF1     2.048e+01  4.722e+00   4.338 1.54e-05 ***
## BsmtFinSF2     9.824e+00  7.145e+00   1.375 0.169370    
## BsmtUnfSF      1.150e+01  4.237e+00   2.714 0.006727 ** 
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      4.934e+01  5.892e+00   8.374  < 2e-16 ***
## X2ndFlrSF      4.221e+01  4.922e+00   8.576  < 2e-16 ***
## LowQualFinSF   1.374e+01  2.014e+01   0.682 0.495214    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   8.030e+03  2.637e+03   3.045 0.002371 ** 
## BsmtHalfBath   6.371e+02  4.137e+03   0.154 0.877616    
## FullBath       2.668e+03  2.877e+03   0.927 0.353942    
## HalfBath      -2.061e+03  2.705e+03  -0.762 0.446194    
## BedroomAbvGr  -8.794e+03  1.717e+03  -5.123 3.43e-07 ***
## KitchenAbvGr  -2.420e+04  4.949e+03  -4.890 1.12e-06 ***
## TotRmsAbvGrd   5.849e+03  1.248e+03   4.688 3.03e-06 ***
## Fireplaces     3.903e+03  1.808e+03   2.159 0.031050 *  
## GarageYrBlt    1.091e+02  6.980e+01   1.562 0.118445    
## GarageCars     1.098e+04  2.913e+03   3.769 0.000171 ***
## GarageArea    -5.318e-01  1.005e+01  -0.053 0.957817    
## WoodDeckSF     2.354e+01  8.135e+00   2.894 0.003861 ** 
## OpenPorchSF   -3.698e+00  1.543e+01  -0.240 0.810626    
## EnclosedPorch  1.459e+01  1.712e+01   0.852 0.394164    
## X3SsnPorch     2.377e+01  3.179e+01   0.748 0.454732    
## ScreenPorch    5.644e+01  1.741e+01   3.242 0.001217 ** 
## PoolArea      -3.808e+01  2.407e+01  -1.582 0.113879    
## MiscVal       -4.212e-01  1.878e+00  -0.224 0.822511    
## MoSold         1.456e+01  3.504e+02   0.042 0.966858    
## YrSold        -5.313e+02  7.123e+02  -0.746 0.455828    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35180 on 1418 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8075, Adjusted R-squared:  0.8031 
## F-statistic: 180.3 on 33 and 1418 DF,  p-value: < 2.2e-16
#We then get rid of any high P-value (> 0.05) means that the coefficient is likely not to equal zero/we aren't getting enough from this variable to keep in the model
train_lm <- lm(SalePrice ~  LotArea + OverallQual + OverallCond +  YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + 
##     TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + BsmtFullBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -507128  -16679   -1695   13846  292939 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.027e+06  1.178e+05  -8.716  < 2e-16 ***
## LotArea       4.700e-01  1.011e-01   4.648 3.67e-06 ***
## OverallQual   1.667e+04  1.182e+03  14.102  < 2e-16 ***
## OverallCond   4.524e+03  1.019e+03   4.439 9.72e-06 ***
## YearBuilt     3.035e+02  5.317e+01   5.709 1.38e-08 ***
## YearRemodAdd  1.848e+02  6.575e+01   2.810  0.00502 ** 
## MasVnrArea    3.004e+01  5.965e+00   5.037 5.33e-07 ***
## BsmtFinSF1    1.113e+01  6.112e+00   1.822  0.06872 .  
## BsmtUnfSF     2.719e+00  6.192e+00   0.439  0.66070    
## TotalBsmtSF   8.502e+00  7.046e+00   1.207  0.22778    
## X1stFlrSF     3.221e+01  2.025e+01   1.591  0.11186    
## X2ndFlrSF     2.256e+01  1.987e+01   1.135  0.25645    
## GrLivArea     1.874e+01  1.976e+01   0.948  0.34305    
## BsmtFullBath  7.597e+03  2.482e+03   3.061  0.00225 ** 
## BedroomAbvGr -8.512e+03  1.672e+03  -5.091 4.03e-07 ***
## KitchenAbvGr -2.347e+04  4.818e+03  -4.872 1.23e-06 ***
## TotRmsAbvGrd  5.971e+03  1.234e+03   4.837 1.46e-06 ***
## Fireplaces    3.350e+03  1.772e+03   1.891  0.05886 .  
## GarageCars    1.147e+04  1.707e+03   6.722 2.59e-11 ***
## WoodDeckSF    2.280e+01  7.996e+00   2.851  0.00442 ** 
## ScreenPorch   5.011e+01  1.713e+01   2.925  0.00350 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35150 on 1431 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8062, Adjusted R-squared:  0.8035 
## F-statistic: 297.6 on 20 and 1431 DF,  p-value: < 2.2e-16
#Round 2 of getting rid of any high P-value (> 0.05) as five more popped up
train_lm <- lm(SalePrice ~  LotArea + OverallQual + OverallCond +  YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -401316  -19636   -2489   15990  363309 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.015e+06  1.234e+05  -8.222 4.42e-16 ***
## LotArea       6.805e-01  1.048e-01   6.495 1.14e-10 ***
## OverallQual   2.199e+04  1.148e+03  19.161  < 2e-16 ***
## OverallCond   2.515e+03  1.056e+03   2.381 0.017412 *  
## YearBuilt     1.921e+02  5.439e+01   3.531 0.000427 ***
## YearRemodAdd  2.837e+02  6.860e+01   4.136 3.74e-05 ***
## MasVnrArea    4.087e+01  6.215e+00   6.576 6.74e-11 ***
## BsmtFinSF1    2.092e+01  3.018e+00   6.932 6.24e-12 ***
## BsmtFullBath  4.602e+03  2.517e+03   1.828 0.067743 .  
## BedroomAbvGr -6.677e+03  1.741e+03  -3.834 0.000131 ***
## KitchenAbvGr -2.175e+04  4.999e+03  -4.352 1.44e-05 ***
## TotRmsAbvGrd  1.402e+04  1.058e+03  13.246  < 2e-16 ***
## Fireplaces    7.779e+03  1.817e+03   4.282 1.98e-05 ***
## GarageCars    1.418e+04  1.780e+03   7.961 3.43e-15 ***
## WoodDeckSF    3.260e+01  8.385e+00   3.888 0.000106 ***
## ScreenPorch   6.005e+01  1.799e+01   3.338 0.000867 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37100 on 1436 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7833, Adjusted R-squared:  0.781 
## F-statistic:   346 on 15 and 1436 DF,  p-value: < 2.2e-16
#Round 3 of getting rid of any high P-value (> 0.05) as one more popped up
train_lm <- lm(SalePrice ~  LotArea + OverallQual +  YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + YearBuilt + 
##     YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + 
##     ScreenPorch, data = train_clean)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -403444  -19404   -2122   16413  361821 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.971e+05  1.234e+05  -8.080 1.36e-15 ***
## LotArea       6.779e-01  1.049e-01   6.460 1.43e-10 ***
## OverallQual   2.218e+04  1.147e+03  19.342  < 2e-16 ***
## YearBuilt     1.278e+02  4.730e+01   2.702 0.006973 ** 
## YearRemodAdd  3.460e+02  6.351e+01   5.449 5.96e-08 ***
## MasVnrArea    4.064e+01  6.224e+00   6.529 9.17e-11 ***
## BsmtFinSF1    2.147e+01  3.014e+00   7.125 1.64e-12 ***
## BsmtFullBath  4.355e+03  2.519e+03   1.729 0.084041 .  
## BedroomAbvGr -6.293e+03  1.737e+03  -3.624 0.000301 ***
## KitchenAbvGr -2.281e+04  4.987e+03  -4.575 5.17e-06 ***
## TotRmsAbvGrd  1.372e+04  1.052e+03  13.034  < 2e-16 ***
## Fireplaces    7.827e+03  1.820e+03   4.301 1.81e-05 ***
## GarageCars    1.404e+04  1.782e+03   7.876 6.63e-15 ***
## WoodDeckSF    3.378e+01  8.384e+00   4.030 5.88e-05 ***
## ScreenPorch   6.169e+01  1.801e+01   3.426 0.000631 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37160 on 1437 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7824, Adjusted R-squared:  0.7803 
## F-statistic: 369.1 on 14 and 1437 DF,  p-value: < 2.2e-16
#P-values are all good now
ggplot(train_lm,aes(x = train_lm$residuals)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot(train_lm)

For the plots above the linear regression model does not look too bad as the red lines through out the plots are somewhat straight/linear. The histogram of train_lm shows the data is no longer skewed to the right and it pretty even which is good for our predictions.

Now we need to set up our submission for Kaggle using the test.csv.

test <- read.csv("house-prices-advanced-regression-techniques/test.csv")

predictions <- as.data.frame(cbind(test$Id, predict(train_lm, test) ))
colnames(predictions) <- c("Id", "SalePrice")

print(head(predictions))
##     Id SalePrice
## 1 1461  120277.8
## 2 1462  165560.6
## 3 1463  170434.4
## 4 1464  205606.0
## 5 1465  203415.5
## 6 1466  182999.9
#When I submitted the first time, I did not check for NA values and got an error as my score
sum(apply(is.na(predictions), 1, any))
## [1] 18
predictions = predictions %>% mutate(across(c(SalePrice), ~replace_na(., median(., na.rm=TRUE))))
sum(apply(is.na(predictions), 1, any))
## [1] 0
write.csv(predictions, "House_Prices_sub_Chai.csv", row.names = FALSE)

When submitting to Kaggle, I received 0.67283 for my linear regression model at first, but then played around with my variables for X to get a lower score of 0.37710.
At first I tried to add in any numeric/int variable from the dataset I thought would help, but there was not enough variables left for a more accurate prediction so I threw in almost all of them in for later submissions. The last two submissions differ based on my deletion of more too high P-Values.
You can check with my Kaggle username: aripenguin.

kaggle_sub
kaggle_sub