https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.
train <- read.csv("house-prices-advanced-regression-techniques/train.csv")
print(str(train))
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
## NULL
Since I have to pick a quantitative independent and dependent
variable, I printed out all the train column types and sample data to
help with my decision.
The independent variable (X) I choose was GarageArea because I thought
that would be a good indicator for the houses’ values. I would assume a
bigger garage (one that can fit more cars) indicts a bigger or fancier
house. Having a garage in general can be seen as a luxury itself.
My dependent variable (Y) will be SalePrice as it is ‘the target
variable that you’re trying to predict’ as the Kaggle page says and it
will be the perfect variable to test my X against.
Below is a plot to show that my X variable is skewed to the right. I did
see some other variables are more skewed to the right but I figured
GarageArea was skewed enough and I was interested to use it.
library(tidyverse)
ggplot(data = train, aes(x = GarageArea)) +
geom_histogram()
X = train$GarageArea
Y = train$SalePrice
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
x = summary(X)["3rd Qu."]
summary(Y)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
y = summary(Y)["Median"]
Above is the summaries of the variables X and Y where we can get the min, 1st Q, 2nd Q/Median, Mean, 3rd Q, and Max. Then I set our x and y to their respective integers.
prob_a = sum(X>x & Y>y) / sum(Y>y)
cat("P(X>x | Y>y):", prob_a, "\n")
## P(X>x | Y>y): 0.4065934
sum(X>x)#num of houses where the garagearea is bigger than 3rd q - 576 sq ft
## [1] 352
sum(Y>y)#num of houses where the saleprice is bigger than 2nd q - $163,000
## [1] 728
sum(X>x & Y>y)#num of houses that fit both conditions
## [1] 296
Around 40.66% of the houses with a SalePrice higher than the 2nd Q ($163,000) have a GarageArea bigger than the 3rd Q (576 sq ft).
prob_b = sum(X>x & Y>y) / nrow(train)
cat("P(X>x, Y>y):", prob_b)
## P(X>x, Y>y): 0.2027397
Around 20.27% of the houses that have a SalePrice higher than the 2nd Q ($163,000) and a GarageArea bigger than the 3rd Q (576 sq ft) make up the whole dataset. So the probability of a house from our dataset meeting both conditions is 20.27%.
prob_c = sum(X<x & Y>y) / sum(Y>y)
cat("P(X<x | Y>y):", prob_c)
## P(X<x | Y>y): 0.5755495
Around 57.55% of the houses with a SalePrice higher than the 2nd Q ($163,000) have a GarageArea smaller than the 3rd Q (576 sq ft).
prob_table = matrix(nrow = 3, ncol = 3)
prob_table <- addmargins(table(X>x, Y>y), FUN = sum)
## Margins computed over dimensions
## in the following order:
## 1:
## 2:
colnames(prob_table) <- c("<=2d quartile", ">2d quartile", "Total")
rownames(prob_table) <- c("<=3d quartile", ">3d quartile", "Total")
prob_table
##
## <=2d quartile >2d quartile Total
## <=3d quartile 676 432 1108
## >3d quartile 56 296 352
## Total 732 728 1460
Does splitting the training data in this fashion make them
independent? Let A be the new variable counting those observations above
the 3d quartile for X, and let B be the new variable counting those
observations above the 2d quartile for Y.
Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running
a Chi Square test for association.
A = X > x
B = Y > y
#Does P(A|B)=P(A)P(B)?
A_B = sum(A*B) / sum(B)
AB = (sum(A)/nrow(train)) * (sum(B)/nrow(train))
cat("P(A|B)=P(A)P(B) =>", A_B, "=?", AB, " => NO")
## P(A|B)=P(A)P(B) => 0.4065934 =? 0.1202177 => NO
#Chi Square test
chisq.test(table(A, B))
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table(A, B)
## X-squared = 215.56, df = 1, p-value < 2.2e-16
Since P(A|B) does NOT equal P(A)P(B), splitting the data did not make them independent. This was then confirmed by the Chi Square test with the p-value being lower than 0.05.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.
summary(train['GarageArea'])
## GarageArea
## Min. : 0.0
## 1st Qu.: 334.5
## Median : 480.0
## Mean : 473.0
## 3rd Qu.: 576.0
## Max. :1418.0
ggplot(data = train, aes(x = X)) +
geom_histogram() +
labs(x ='GarageArea (sq ft)', title = 'Histogram of GarageArea values') +
theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(train['SalePrice'])
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
ggplot(data = train, aes(x = Y)) +
geom_histogram() +
labs(x ='SalePrices ($)', title = 'Histogram of SalePrice values') +
theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#scatterplot of X and Y
ggplot(train, aes(x = X, y = Y)) +
geom_point() +
labs(x = "GarageArea", y = "SalePrice", title = "GarageArea vs. SalePrice") +
theme(plot.title = element_text(hjust = 0.5))
#95% CI for the difference in the mean of the variables
CI_95 = t.test(X, Y, conf.level = 0.95)
print(CI_95)
##
## Welch Two Sample t-test
##
## data: X and Y
## t = -86.791, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -184526.6 -176369.8
## sample estimates:
## mean of x mean of y
## 472.9801 180921.1959
#Correlation matrix for two of the quantitative variables you selected.
cor_matrix = cor(train[, c('GarageArea','SalePrice')])
print(cor_matrix)
## GarageArea SalePrice
## GarageArea 1.0000000 0.6234314
## SalePrice 0.6234314 1.0000000
#Test the correlation between X and Y with a 99% CI
cor_test <- cor.test(train$GarageArea, train$SalePrice, method = "pearson", conf.level = 0.99)
print(cor_test)
##
## Pearson's product-moment correlation
##
## data: train$GarageArea and train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
## 0.5804338 0.6629623
## sample estimates:
## cor
## 0.6234314
We can see from the correlation matrix that there is a correlation between GarageArea and SalePrice because the values we received for GarageArea~SalePrice and SalePrice~GarageArea are the same (0.6234314). Additionally, the P-Value of the correlation between X and Y with a 99% CI is 2.2e-16 which is lower than 0.05. Having a lower P-Value than 0.05 is a good thing as then we have enough evidence/data to rule out the null hypothesis.
Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.
prec_matrix <- solve(cor_matrix)
print(prec_matrix)
## GarageArea SalePrice
## GarageArea 1.635769 -1.019790
## SalePrice -1.019790 1.635769
print(cor_matrix%*%prec_matrix)
## GarageArea SalePrice
## GarageArea 1 0
## SalePrice 0 1
print(prec_matrix%*%cor_matrix)
## GarageArea SalePrice
## GarageArea 1 0
## SalePrice 0 1
Above is the precision matrix nad the multiplication of the two in both orders (correlation matrix by precision matrix and precision matrix by correlation matrix). Since we are dealing with precision matrix/inverse of the correlation matrix, both orders bring back the identify matrix.
A principle components analysis is a procedure/method that is used to reduced large data sets by summarizing its content to from an easier to analyze smaller data set.
corr_var <- cbind(X, Y) %>%
as.matrix()
pca_var <- prcomp(corr_var, center = TRUE, scale. = TRUE)
print(summary(pca_var))
## Importance of components:
## PC1 PC2
## Standard deviation 1.2741 0.6137
## Proportion of Variance 0.8117 0.1883
## Cumulative Proportion 0.8117 1.0000
Above is the command prcomp() to create the PCA. As seen in the summary, there was two PCAs created with PCA1 representing 81.17% of the original data set and the PCA2 representing the latter portion, 18.83%. I tried to see how to get it to split more even or why it split like this, but I’m guess 81.17% of the data set were close knit so they were grouped together. The weird split would then indicate the fact that our data set is skewed to the right and that is why PCA1 represents so much.
Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
#shift it so that the minimum value is above zero
print(summary(X))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
#my variable X's min is 0.0 so I am going to add 1 to each X so there is no 0's
X_1 = X + 1
print(summary(X_1))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 335.5 481.0 474.0 577.0 1419.0
#load mass
library(MASS)
#run fitdistr - exponential probability density function
epd = fitdistr(X_1, "exponential")
print(epd)
## rate
## 2.109793e-03
## (5.521581e-05)
#Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).
lambda = epd$estimate
cat('The optimal value of λ:', lambda, '\n')
## The optimal value of λ: 0.002109793
samples <- rexp(1000, lambda) %>%
as.data.frame()
colnames(samples) <- c("values")
#Plot a histogram and compare it with a histogram of your original variable.
ggplot(train, aes(x = GarageArea)) +
geom_histogram() +
ggtitle("Original Histogram")
ggplot(samples, aes(x = values)) +
geom_histogram() +
ggtitle("Exponential Histogram")
For the last part, the exponential distribution histogram is more skewed to the right than the original distribution histogram. I assume this is due to using the optimal/minimum impacted value of λ.
#Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
cat('The 5th percentiles of the exponential pdf:', quantile(samples$values, 0.05), '%\n')
## The 5th percentiles of the exponential pdf: 24.31353 %
cat('The 95th percentiles of the exponential pdf:', quantile(samples$values, 0.95), '%\n')
## The 95th percentiles of the exponential pdf: 1428.6 %
#Generate a 95% confidence interval from the empirical data, assuming normality.
print(t.test(X, conf.level = 0.95))
##
## One Sample t-test
##
## data: X
## t = 84.528, df = 1459, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 462.0040 483.9563
## sample estimates:
## mean of x
## 472.9801
#Provide the empirical 5th percentile and 95th percentile of the data.
cat('The 5th percentiles of the data:', quantile(X, 0.05), '%\n')
## The 5th percentiles of the data: 0 %
cat('The 95th percentiles of the data:', quantile(X, 0.95), '%\n')
## The 95th percentiles of the data: 850.1 %
The 5th percentile and 95th percentile of each data do not match at all. I did not understand at first, but the exponential data was skewed more to the right and the values were closer so that is why the percentiles are very close together as oppose to the real data.
Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
train_clean <- train
nrow(train_clean)
## [1] 1460
sum(apply(is.na(train_clean[, -1]), 1, any))
## [1] 1460
#find out how many NA values in each column
colSums(is.na(train_clean))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 259 0
## Street Alley LotShape LandContour Utilities
## 0 1369 0 0 0
## LotConfig LandSlope Neighborhood Condition1 Condition2
## 0 0 0 0 0
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 0 0 0 0 0
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## 0 0 0 0 0
## MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 8 8 0 0 0
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1
## 37 37 38 37 0
## BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## 38 0 0 0 0
## HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF
## 0 0 1 0 0
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 0 0 0 0 0
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## 0 0 0 0 0
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## 0 0 690 81 81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## 81 0 0 81 81
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 0 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## 0 0 1453 1179 1406
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
#get rid of mostly NA columns
train_clean = train_clean[, !(colnames(train_clean) %in% c("Alley", "FireplaceQu", "PoolQC", "Fence", "MiscFeature"))]
#fix the other null values - replace NA with median of column
train_clean = train_clean %>% mutate(across(c(LotFrontage, BsmtQual, BsmtCond, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond), ~replace_na(., median(., na.rm=TRUE))))
colSums(is.na(train_clean))
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 0 0
## Street LotShape LandContour Utilities LotConfig
## 0 0 0 0 0
## LandSlope Neighborhood Condition1 Condition2 BldgType
## 0 0 0 0 0
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd
## 0 0 0 0 0
## RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 0 0 0 0 8
## MasVnrArea ExterQual ExterCond Foundation BsmtQual
## 8 0 0 0 0
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 0 38 37 0 38
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## 0 0 0 0 0
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 0 1 0 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath
## 0 0 0 0 0
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 0 0 0 0 0
## Fireplaces GarageType GarageYrBlt GarageFinish GarageCars
## 0 0 0 0 0
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF
## 0 0 0 0 0
## OpenPorchSF EnclosedPorch X3SsnPorch ScreenPorch PoolArea
## 0 0 0 0 0
## MiscVal MoSold YrSold SaleType SaleCondition
## 0 0 0 0 0
## SalePrice
## 0
sum(is.na(train_clean))
## [1] 130
sum(apply(is.na(train_clean[, -1]), 1, any))
## [1] 48
I will be building a linear regression model for this part.
First, I need to clean up the train data to try to get the model to be
as accurate as possible:
- find the number of NA values in each column
- get rid of columns that only have/have more than ~50% NA values
(PoolQC, Fence, MiscFeature)
- replace NA values in most of the remainder columns with NA values with
the median of that column
We can see the amount of rows with null values dropped from 1460 (all) to 48.
For the linear regression, I want to set more than one column as my x to help with the accuracy.
#linear regression
train_lm <- lm(SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold + YrSold, data = train_clean)
summary(train_lm)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF +
## LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold, data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -496749 -16493 -1780 14047 308293
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.249e+04 1.434e+06 -0.016 0.987488
## LotFrontage 4.523e+01 4.985e+01 0.907 0.364415
## LotArea 4.555e-01 1.035e-01 4.403 1.15e-05 ***
## OverallQual 1.631e+04 1.201e+03 13.576 < 2e-16 ***
## OverallCond 4.915e+03 1.046e+03 4.698 2.88e-06 ***
## YearBuilt 2.670e+02 6.832e+01 3.908 9.73e-05 ***
## YearRemodAdd 1.424e+02 6.947e+01 2.050 0.040521 *
## MasVnrArea 3.008e+01 6.023e+00 4.994 6.64e-07 ***
## BsmtFinSF1 2.048e+01 4.722e+00 4.338 1.54e-05 ***
## BsmtFinSF2 9.824e+00 7.145e+00 1.375 0.169370
## BsmtUnfSF 1.150e+01 4.237e+00 2.714 0.006727 **
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 4.934e+01 5.892e+00 8.374 < 2e-16 ***
## X2ndFlrSF 4.221e+01 4.922e+00 8.576 < 2e-16 ***
## LowQualFinSF 1.374e+01 2.014e+01 0.682 0.495214
## GrLivArea NA NA NA NA
## BsmtFullBath 8.030e+03 2.637e+03 3.045 0.002371 **
## BsmtHalfBath 6.371e+02 4.137e+03 0.154 0.877616
## FullBath 2.668e+03 2.877e+03 0.927 0.353942
## HalfBath -2.061e+03 2.705e+03 -0.762 0.446194
## BedroomAbvGr -8.794e+03 1.717e+03 -5.123 3.43e-07 ***
## KitchenAbvGr -2.420e+04 4.949e+03 -4.890 1.12e-06 ***
## TotRmsAbvGrd 5.849e+03 1.248e+03 4.688 3.03e-06 ***
## Fireplaces 3.903e+03 1.808e+03 2.159 0.031050 *
## GarageYrBlt 1.091e+02 6.980e+01 1.562 0.118445
## GarageCars 1.098e+04 2.913e+03 3.769 0.000171 ***
## GarageArea -5.318e-01 1.005e+01 -0.053 0.957817
## WoodDeckSF 2.354e+01 8.135e+00 2.894 0.003861 **
## OpenPorchSF -3.698e+00 1.543e+01 -0.240 0.810626
## EnclosedPorch 1.459e+01 1.712e+01 0.852 0.394164
## X3SsnPorch 2.377e+01 3.179e+01 0.748 0.454732
## ScreenPorch 5.644e+01 1.741e+01 3.242 0.001217 **
## PoolArea -3.808e+01 2.407e+01 -1.582 0.113879
## MiscVal -4.212e-01 1.878e+00 -0.224 0.822511
## MoSold 1.456e+01 3.504e+02 0.042 0.966858
## YrSold -5.313e+02 7.123e+02 -0.746 0.455828
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35180 on 1418 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8075, Adjusted R-squared: 0.8031
## F-statistic: 180.3 on 33 and 1418 DF, p-value: < 2.2e-16
#We then get rid of any high P-value (> 0.05) means that the coefficient is likely not to equal zero/we aren't getting enough from this variable to keep in the model
train_lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtUnfSF +
## TotalBsmtSF + X1stFlrSF + X2ndFlrSF + GrLivArea + BsmtFullBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -507128 -16679 -1695 13846 292939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.027e+06 1.178e+05 -8.716 < 2e-16 ***
## LotArea 4.700e-01 1.011e-01 4.648 3.67e-06 ***
## OverallQual 1.667e+04 1.182e+03 14.102 < 2e-16 ***
## OverallCond 4.524e+03 1.019e+03 4.439 9.72e-06 ***
## YearBuilt 3.035e+02 5.317e+01 5.709 1.38e-08 ***
## YearRemodAdd 1.848e+02 6.575e+01 2.810 0.00502 **
## MasVnrArea 3.004e+01 5.965e+00 5.037 5.33e-07 ***
## BsmtFinSF1 1.113e+01 6.112e+00 1.822 0.06872 .
## BsmtUnfSF 2.719e+00 6.192e+00 0.439 0.66070
## TotalBsmtSF 8.502e+00 7.046e+00 1.207 0.22778
## X1stFlrSF 3.221e+01 2.025e+01 1.591 0.11186
## X2ndFlrSF 2.256e+01 1.987e+01 1.135 0.25645
## GrLivArea 1.874e+01 1.976e+01 0.948 0.34305
## BsmtFullBath 7.597e+03 2.482e+03 3.061 0.00225 **
## BedroomAbvGr -8.512e+03 1.672e+03 -5.091 4.03e-07 ***
## KitchenAbvGr -2.347e+04 4.818e+03 -4.872 1.23e-06 ***
## TotRmsAbvGrd 5.971e+03 1.234e+03 4.837 1.46e-06 ***
## Fireplaces 3.350e+03 1.772e+03 1.891 0.05886 .
## GarageCars 1.147e+04 1.707e+03 6.722 2.59e-11 ***
## WoodDeckSF 2.280e+01 7.996e+00 2.851 0.00442 **
## ScreenPorch 5.011e+01 1.713e+01 2.925 0.00350 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35150 on 1431 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8062, Adjusted R-squared: 0.8035
## F-statistic: 297.6 on 20 and 1431 DF, p-value: < 2.2e-16
#Round 2 of getting rid of any high P-value (> 0.05) as five more popped up
train_lm <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -401316 -19636 -2489 15990 363309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.015e+06 1.234e+05 -8.222 4.42e-16 ***
## LotArea 6.805e-01 1.048e-01 6.495 1.14e-10 ***
## OverallQual 2.199e+04 1.148e+03 19.161 < 2e-16 ***
## OverallCond 2.515e+03 1.056e+03 2.381 0.017412 *
## YearBuilt 1.921e+02 5.439e+01 3.531 0.000427 ***
## YearRemodAdd 2.837e+02 6.860e+01 4.136 3.74e-05 ***
## MasVnrArea 4.087e+01 6.215e+00 6.576 6.74e-11 ***
## BsmtFinSF1 2.092e+01 3.018e+00 6.932 6.24e-12 ***
## BsmtFullBath 4.602e+03 2.517e+03 1.828 0.067743 .
## BedroomAbvGr -6.677e+03 1.741e+03 -3.834 0.000131 ***
## KitchenAbvGr -2.175e+04 4.999e+03 -4.352 1.44e-05 ***
## TotRmsAbvGrd 1.402e+04 1.058e+03 13.246 < 2e-16 ***
## Fireplaces 7.779e+03 1.817e+03 4.282 1.98e-05 ***
## GarageCars 1.418e+04 1.780e+03 7.961 3.43e-15 ***
## WoodDeckSF 3.260e+01 8.385e+00 3.888 0.000106 ***
## ScreenPorch 6.005e+01 1.799e+01 3.338 0.000867 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37100 on 1436 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7833, Adjusted R-squared: 0.781
## F-statistic: 346 on 15 and 1436 DF, p-value: < 2.2e-16
#Round 3 of getting rid of any high P-value (> 0.05) as one more popped up
train_lm <- lm(SalePrice ~ LotArea + OverallQual + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + ScreenPorch, data = train_clean)
summary(train_lm)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + YearBuilt +
## YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFullBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
## ScreenPorch, data = train_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -403444 -19404 -2122 16413 361821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.971e+05 1.234e+05 -8.080 1.36e-15 ***
## LotArea 6.779e-01 1.049e-01 6.460 1.43e-10 ***
## OverallQual 2.218e+04 1.147e+03 19.342 < 2e-16 ***
## YearBuilt 1.278e+02 4.730e+01 2.702 0.006973 **
## YearRemodAdd 3.460e+02 6.351e+01 5.449 5.96e-08 ***
## MasVnrArea 4.064e+01 6.224e+00 6.529 9.17e-11 ***
## BsmtFinSF1 2.147e+01 3.014e+00 7.125 1.64e-12 ***
## BsmtFullBath 4.355e+03 2.519e+03 1.729 0.084041 .
## BedroomAbvGr -6.293e+03 1.737e+03 -3.624 0.000301 ***
## KitchenAbvGr -2.281e+04 4.987e+03 -4.575 5.17e-06 ***
## TotRmsAbvGrd 1.372e+04 1.052e+03 13.034 < 2e-16 ***
## Fireplaces 7.827e+03 1.820e+03 4.301 1.81e-05 ***
## GarageCars 1.404e+04 1.782e+03 7.876 6.63e-15 ***
## WoodDeckSF 3.378e+01 8.384e+00 4.030 5.88e-05 ***
## ScreenPorch 6.169e+01 1.801e+01 3.426 0.000631 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37160 on 1437 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7824, Adjusted R-squared: 0.7803
## F-statistic: 369.1 on 14 and 1437 DF, p-value: < 2.2e-16
#P-values are all good now
ggplot(train_lm,aes(x = train_lm$residuals)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
plot(train_lm)
For the plots above the linear regression model does not look too bad as the red lines through out the plots are somewhat straight/linear. The histogram of train_lm shows the data is no longer skewed to the right and it pretty even which is good for our predictions.
Now we need to set up our submission for Kaggle using the test.csv.
test <- read.csv("house-prices-advanced-regression-techniques/test.csv")
predictions <- as.data.frame(cbind(test$Id, predict(train_lm, test) ))
colnames(predictions) <- c("Id", "SalePrice")
print(head(predictions))
## Id SalePrice
## 1 1461 120277.8
## 2 1462 165560.6
## 3 1463 170434.4
## 4 1464 205606.0
## 5 1465 203415.5
## 6 1466 182999.9
#When I submitted the first time, I did not check for NA values and got an error as my score
sum(apply(is.na(predictions), 1, any))
## [1] 18
predictions = predictions %>% mutate(across(c(SalePrice), ~replace_na(., median(., na.rm=TRUE))))
sum(apply(is.na(predictions), 1, any))
## [1] 0
write.csv(predictions, "House_Prices_sub_Chai.csv", row.names = FALSE)
When submitting to Kaggle, I received 0.67283 for my linear
regression model at first, but then played around with my variables for
X to get a lower score of 0.37710.
At first I tried to add in any numeric/int variable from the dataset I
thought would help, but there was not enough variables left for a more
accurate prediction so I threw in almost all of them in for later
submissions. The last two submissions differ based on my deletion of
more too high P-Values.
You can check with my Kaggle username: aripenguin.