Final Exam
Problem 1
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. 5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X<x | X>y)
set.seed(100)
N <- 15
#10,000 random uniform numbers
X <- runif(10000,min=1,max=N)
#10,000 random normal numbers
Y <- rnorm(10000,mean=(N+1)/2)
#median of X (10,000 random uniform numbers)
x <- median(X)
#summary returns the quartile values, the 2nd value returns is the 1st quartile value
y <- summary(Y)[2]
df <- data.frame(X,Y)
a P(X>x | X>y)
This reads as the probability X is greater than x, the median, given X is greater than y, the first quartile value.
We know this formula is the same as calculating the intersection of X greater than x and X greater than y divided by X greater than y. \(P(X>x|X>y)=\frac{P(X>x \bigcap X>y)}{P(X>y)}\)
Here we see that 92% of the values are greater than the median, given that they are greater than the first quartile value, y.
## [1] 0.9191176
b P(X>x, Y>y)
Here’s we’re looking to find the probability that X is greater than the median multiplied by the probability that Y is greater than the first quartile value, y. \(P(X>x, Y>y) = P(X>x)P(Y>y)\)
Xgreaterx <- nrow(df %>% filter(X > x))/10000
Ygreatery <- nrow(df %>% filter(Y > y))/10000
Xgreaterx*Ygreatery
## [1] 0.375
c (X<x | X>y)
This reads as X less than x given X greater than y. Again, we know this equates to the following equation: \(P(X<x)=\frac{P(X<x \bigcap X>y)}{P(X>y)}\)
6% values of X are less than the median given that we know X is greater than the first quartile value, y.
top <- nrow(df %>% filter(X < x & X > y))/10000
Xgreatery<- nrow(df %>% filter(Y > y))/10000
top/Xgreatery
## [1] 0.05866667
Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
By looking at the table below we see that P(X>x and Y>y) = 0.3755.
Also using the table we see P(X>x)*P(Y>y) is 0.375. \(= P(X>x)*P(Y>y)\) \(= 0.5 * 0.75\) \(= 0.375\)
Xgreaterx <- nrow(df %>% filter(X > x))/10000
Xlessx <- nrow(df %>% filter(X < x))/10000
Ygreatery <- nrow(df %>% filter(Y > y))/10000
Ylessy <- nrow(df %>% filter(Y < y))/10000
XlessxAndYgreatery <- nrow(df %>% filter(X < x,Y > y))/10000
XlessxAndYlessy <- nrow(df %>% filter(X < x,Y < y))/10000
XgreaterxAndYgreatery <- nrow(df %>% filter(X > x,Y > y))/10000
XgreaterxAndYlessy <-nrow(df %>% filter(X > x,Y < y))/10000
table <- matrix(c(XlessxAndYlessy,XgreaterxAndYlessy,Ylessy,XlessxAndYgreatery,XgreaterxAndYgreatery,Ygreatery,Xlessx,Xgreaterx,1),nrow=3)
colnames(table) = c('Y<y','Y>y','')
rownames(table) = c('X<x','X>x','')
table
## Y<y Y>y
## X<x 0.1255 0.3745 0.5
## X>x 0.1245 0.3755 0.5
## 0.2500 0.7500 1.0
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Both tests are used to determine if two categorical variables have an association. Fishers test is used in small sample sizes and is exact while Chi-Square is an approximation and used in larger sample sizes. Since we have a large sample size we should use the Chi-Square test.
Both test have a p-value of 0.8354. Since this value is greater than 0.05, we cannot reject the null hypothesis that there is no association between the two categorical variables Y and X.
##
## Fisher's Exact Test for Count Data
##
## data: sub
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9222661 1.1076494
## sample estimates:
## odds ratio
## 1.010724
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: sub
## X-squared = 0.0432, df = 1, p-value = 0.8353
Problem 2
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.
## Parsed with column specification:
## cols(
## .default = col_character(),
## Id = col_double(),
## MSSubClass = col_double(),
## LotFrontage = col_double(),
## LotArea = col_double(),
## OverallQual = col_double(),
## OverallCond = col_double(),
## YearBuilt = col_double(),
## YearRemodAdd = col_double(),
## MasVnrArea = col_double(),
## BsmtFinSF1 = col_double(),
## BsmtFinSF2 = col_double(),
## BsmtUnfSF = col_double(),
## TotalBsmtSF = col_double(),
## `1stFlrSF` = col_double(),
## `2ndFlrSF` = col_double(),
## LowQualFinSF = col_double(),
## GrLivArea = col_double(),
## BsmtFullBath = col_double(),
## BsmtHalfBath = col_double(),
## FullBath = col_double()
## # ... with 18 more columns
## )
## See spec(...) for full column specifications.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 Length:1460 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 Class :character 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 Mode :character Median : 69.00
## Mean : 730.5 Mean : 56.9 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape
## Min. : 1300 Length:1460 Length:1460 Length:1460
## 1st Qu.: 7554 Class :character Class :character Class :character
## Median : 9478 Mode :character Mode :character Mode :character
## Mean : 10517
## 3rd Qu.: 11602
## Max. :215245
##
## LandContour Utilities LotConfig LandSlope
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Neighborhood Condition1 Condition2 BldgType
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## HouseStyle OverallQual OverallCond YearBuilt
## Length:1460 Min. : 1.000 Min. :1.000 Min. :1872
## Class :character 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Mode :character Median : 6.000 Median :5.000 Median :1973
## Mean : 6.099 Mean :5.575 Mean :1971
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## Max. :10.000 Max. :9.000 Max. :2010
##
## YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1950 Length:1460 Length:1460 Length:1460
## 1st Qu.:1967 Class :character Class :character Class :character
## Median :1994 Mode :character Mode :character Mode :character
## Mean :1985
## 3rd Qu.:2004
## Max. :2010
##
## Exterior2nd MasVnrType MasVnrArea ExterQual
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 0.0 Mode :character
## Mean : 103.7
## 3rd Qu.: 166.0
## Max. :1600.0
## NA's :8
## ExterCond Foundation BsmtQual BsmtCond
## Length:1460 Length:1460 Length:1460 Length:1460
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Length:1460 Length:1460 Min. : 0.0 Length:1460
## Class :character Class :character 1st Qu.: 0.0 Class :character
## Mode :character Mode :character Median : 383.5 Mode :character
## Mean : 443.6
## 3rd Qu.: 712.2
## Max. :5644.0
##
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Length:1460
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 Class :character
## Median : 0.00 Median : 477.5 Median : 991.5 Mode :character
## Mean : 46.55 Mean : 567.2 Mean :1057.4
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2
## Max. :1474.00 Max. :2336.0 Max. :6110.0
##
## HeatingQC CentralAir Electrical 1stFlrSF
## Length:1460 Length:1460 Length:1460 Min. : 334
## Class :character Class :character Class :character 1st Qu.: 882
## Mode :character Mode :character Mode :character Median :1087
## Mean :1163
## 3rd Qu.:1391
## Max. :4692
##
## 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath
## Min. : 0 Min. : 0.000 Min. : 334 Min. :0.0000
## 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130 1st Qu.:0.0000
## Median : 0 Median : 0.000 Median :1464 Median :0.0000
## Mean : 347 Mean : 5.845 Mean :1515 Mean :0.4253
## 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777 3rd Qu.:1.0000
## Max. :2065 Max. :572.000 Max. :5642 Max. :3.0000
##
## BsmtHalfBath FullBath HalfBath BedroomAbvGr
## Min. :0.00000 Min. :0.000 Min. :0.0000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:2.000
## Median :0.00000 Median :2.000 Median :0.0000 Median :3.000
## Mean :0.05753 Mean :1.565 Mean :0.3829 Mean :2.866
## 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :2.00000 Max. :3.000 Max. :2.0000 Max. :8.000
##
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Length:1460 Min. : 2.000 Length:1460
## 1st Qu.:1.000 Class :character 1st Qu.: 5.000 Class :character
## Median :1.000 Mode :character Median : 6.000 Mode :character
## Mean :1.047 Mean : 6.518
## 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :3.000 Max. :14.000
##
## Fireplaces FireplaceQu GarageType GarageYrBlt
## Min. :0.000 Length:1460 Length:1460 Min. :1900
## 1st Qu.:0.000 Class :character Class :character 1st Qu.:1961
## Median :1.000 Mode :character Mode :character Median :1980
## Mean :0.613 Mean :1979
## 3rd Qu.:1.000 3rd Qu.:2002
## Max. :3.000 Max. :2010
## NA's :81
## GarageFinish GarageCars GarageArea GarageQual
## Length:1460 Min. :0.000 Min. : 0.0 Length:1460
## Class :character 1st Qu.:1.000 1st Qu.: 334.5 Class :character
## Mode :character Median :2.000 Median : 480.0 Mode :character
## Mean :1.767 Mean : 473.0
## 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :4.000 Max. :1418.0
##
## GarageCond PavedDrive WoodDeckSF OpenPorchSF
## Length:1460 Length:1460 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 0.00 1st Qu.: 0.00
## Mode :character Mode :character Median : 0.00 Median : 25.00
## Mean : 94.24 Mean : 46.66
## 3rd Qu.:168.00 3rd Qu.: 68.00
## Max. :857.00 Max. :547.00
##
## EnclosedPorch 3SsnPorch ScreenPorch PoolArea
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.00 Median : 0.00 Median : 0.000
## Mean : 21.95 Mean : 3.41 Mean : 15.06 Mean : 2.759
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :552.00 Max. :508.00 Max. :480.00 Max. :738.000
##
## PoolQC Fence MiscFeature MiscVal
## Length:1460 Length:1460 Length:1460 Min. : 0.00
## Class :character Class :character Class :character 1st Qu.: 0.00
## Mode :character Mode :character Mode :character Median : 0.00
## Mean : 43.49
## 3rd Qu.: 0.00
## Max. :15500.00
##
## MoSold YrSold SaleType SaleCondition
## Min. : 1.000 Min. :2006 Length:1460 Length:1460
## 1st Qu.: 5.000 1st Qu.:2007 Class :character Class :character
## Median : 6.000 Median :2008 Mode :character Mode :character
## Mean : 6.322 Mean :2008
## 3rd Qu.: 8.000 3rd Qu.:2009
## Max. :12.000 Max. :2010
##
## SalePrice
## Min. : 34900
## 1st Qu.:129975
## Median :163000
## Mean :180921
## 3rd Qu.:214000
## Max. :755000
##
- There appears to be some type of positive relationship between gross living area and sale price. There is a definite cone shape happening with this relationship.
- Year Built and sales price may have a weak positive relationship.
- Lot area and sale price doesn’t appear to have a strong relationship
- All other independent variables appear to have wear relationships with one another
Let’s view the sales price alone:
* The housing sales price is right skewed,which makes sense. There are some outlier mansions that are probably very expensive.
Using the Shapiro test we see the p-value is under 0.05. This tells us that we have to reject the null hypothesis, which is that the SalePrice is normally distributed.
Using the powerTransform,we still the optimal transformtion raises SalePrice to approximately by -0.07692374. With this transformation, SalePrice still does not pass the Shapiro test.
For the purposes of our analysis, we will move forward using the untransformed SalePrice
##
## Shapiro-Wilk normality test
##
## data: train$SalePrice
## W = 0.86967, p-value < 2.2e-16
## train$SalePrice
## -0.07692374
##
## Shapiro-Wilk normality test
##
## data: train$SalePrice_trans
## W = 0.99153, p-value = 1.905e-07
Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train_subset <- select(train,"YearBuilt","GrLivArea","LotArea")
cor <- cor(train_subset)
corrplot(cor,type='upper',order="hclust")
Since our confidence level is 80%, if the p-value is less than 0.2 then we reject the null hypothesis that there is nothing going on between the two variables. If the p-value is less than 0.2 then the two variables have a relationship From our correlations tests we see:
* YearBuilt and Gross Living Area are correlated with a p-value of 1.66e-14 * YearBuilt and LotArea are not correlated with a p-value of 0.587 * Gross Living Area and Lot Area are correlated with a p-value of 2.2e-16
Our correlation plot above supports this.
I would be slightly worried about familywise error. The family wise error rate means we are making a Type 1 error. A Type 1 error is when we reject the null hypothesis when the null hypothesis is actually true. The probability this wouldn happen is 1 - (0.8)^3, which equals 0.488. As the number of tests increase, a type 1 error more likely to occur to happen. There are ways to control for this like Bonferri correction, but I won’t go into that here.
##
## Pearson's product-moment correlation
##
## data: train$YearBuilt and train$GrLivArea
## t = 7.754, df = 1458, p-value = 1.66e-14
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1665605 0.2310283
## sample estimates:
## cor
## 0.1990097
##
## Pearson's product-moment correlation
##
## data: train$YearBuilt and train$LotArea
## t = 0.54332, df = 1458, p-value = 0.587
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.01934322 0.04776648
## sample estimates:
## cor
## 0.01422765
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$LotArea
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2315997 0.2940809
## sample estimates:
## cor
## 0.2631162
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
## YearBuilt GrLivArea LotArea
## YearBuilt 1.04293483 -0.2187973 0.04273059
## GrLivArea -0.21879727 1.1202809 -0.29165104
## LotArea 0.04273059 -0.2916510 1.07613015
## [,1] [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00 0
## [2,] 8.673617e-18 1.000000e+00 0
## [3,] -1.734723e-18 -2.708339e-35 1
## [,1] [,2] [,3]
## [1,] 1 -1.561251e-17 -6.938894e-18
## [2,] 0 1.000000e+00 5.551115e-17
## [3,] 0 0.000000e+00 1.000000e+00
\(B = L*U\)
## YearBuilt GrLivArea LotArea
## YearBuilt 1.000000e+00 -1.561251e-17 -6.938894e-18
## GrLivArea 8.673617e-18 1.000000e+00 5.551115e-17
## LotArea -1.734723e-18 0.000000e+00 1.000000e+00
## [,1] [,2] [,3]
## [1,] 1.000000e+00 -1.561251e-17 -6.938894e-18
## [2,] 8.673617e-18 1.000000e+00 5.551115e-17
## [3,] -1.734723e-18 0.000000e+00 1.000000e+00
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Getting mean and sd from original train data in order to produce confidence intervals
mean = mean(train$GrLivArea)
sd = sd(train$GrLivArea)
paste0("Empirical data 5th and 95th percentiles are:",round(qnorm(0.05,mean=mean,sd=sd),2),"and ",round(qnorm(0.95,mean=mean,sd=sd),2))
## [1] "Empirical data 5th and 95th percentiles are:651.13and 2379.8"
fit <- fitdistr(train$GrLivArea,densfun='exponential')
#Optimal value of lambda for exponential distribution
lambda = fit$estimate
#Make 1000 samples from this exponential distribution
fit_data = rexp(1000,lambda)
qplot(fit_data,geom='histogram',bins=20)+ggtitle("Exponential Gross Living Area Histogram")
## [1] "Exponential data 5th and 95th percentiles are:76.85and 4388.83"
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Let’s start with backwards elimination method of removing of variables from our multiple regression model. We’ll begin with a subset of the original columns since there are too many columns to manually remove 1 at a time. There are also a lot of columsn that likely will not make much sense or are far too specific like Street name.
lm <- lm(SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation + BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageType + GarageArea + PavedDrive + WoodDeckSF + PoolArea + Fence + MiscVal + SaleType,data=train)
summary(lm)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation +
## BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## KitchenQual + TotRmsAbvGrd + Functional + Fireplaces + GarageType +
## GarageArea + PavedDrive + WoodDeckSF + PoolArea + Fence +
## MiscVal + SaleType, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95079 -9461 737 10029 95079
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.428e+05 2.913e+05 -2.893 0.004358 **
## LotFrontage -1.084e+02 1.415e+02 -0.766 0.445003
## LotArea 1.830e+00 8.233e-01 2.223 0.027653 *
## OverallQual 1.076e+04 2.998e+03 3.590 0.000442 ***
## OverallCond 9.744e+03 1.663e+03 5.859 2.65e-08 ***
## YearBuilt 4.453e+02 1.454e+02 3.062 0.002584 **
## RoofStyleGable 1.341e+04 2.399e+04 0.559 0.576958
## RoofStyleGambrel 1.943e+04 2.654e+04 0.732 0.465059
## RoofStyleHip 1.009e+04 2.425e+04 0.416 0.677745
## RoofStyleMansard 1.768e+04 3.497e+04 0.505 0.613921
## ExterQualFa -2.140e+04 4.984e+04 -0.429 0.668286
## ExterQualGd -2.509e+04 3.935e+04 -0.638 0.524598
## ExterQualTA -3.197e+04 4.057e+04 -0.788 0.431953
## FoundationCBlock -1.051e+03 7.166e+03 -0.147 0.883624
## FoundationPConc 5.351e+03 7.229e+03 0.740 0.460266
## FoundationStone 1.228e+04 2.637e+04 0.466 0.642101
## FoundationWood -2.842e+04 2.742e+04 -1.036 0.301564
## BsmtQualFa -1.027e+05 2.391e+04 -4.294 3.06e-05 ***
## BsmtQualGd -1.039e+05 1.972e+04 -5.267 4.51e-07 ***
## BsmtQualTA -9.966e+04 2.029e+04 -4.912 2.25e-06 ***
## TotalBsmtSF 3.086e+01 8.277e+00 3.728 0.000269 ***
## HeatingGasW 7.804e+03 1.804e+04 0.433 0.665851
## HeatingGrav 2.048e+04 2.135e+04 0.959 0.338890
## CentralAirY 1.629e+04 9.392e+03 1.735 0.084789 .
## GrLivArea 6.104e+01 1.021e+01 5.979 1.46e-08 ***
## BsmtFullBath 4.223e+03 3.876e+03 1.090 0.277555
## FullBath 1.053e+04 5.565e+03 1.892 0.060266 .
## HalfBath 4.735e+03 5.005e+03 0.946 0.345587
## BedroomAbvGr -5.813e+03 3.790e+03 -1.534 0.127114
## KitchenAbvGr -6.020e+03 1.244e+04 -0.484 0.629138
## KitchenQualFa -7.253e+04 1.762e+04 -4.116 6.21e-05 ***
## KitchenQualGd -7.011e+04 1.273e+04 -5.506 1.47e-07 ***
## KitchenQualTA -6.786e+04 1.343e+04 -5.054 1.19e-06 ***
## TotRmsAbvGrd -5.247e+03 2.786e+03 -1.883 0.061519 .
## FunctionalMin1 5.810e+02 1.756e+04 0.033 0.973656
## FunctionalMin2 -5.148e+03 1.917e+04 -0.269 0.788582
## FunctionalMod -1.085e+04 2.674e+04 -0.406 0.685528
## FunctionalTyp 5.184e+03 1.636e+04 0.317 0.751712
## Fireplaces 2.064e+02 3.342e+03 0.062 0.950841
## GarageTypeAttchd 5.646e+04 2.937e+04 1.922 0.056398 .
## GarageTypeBasment 3.549e+04 3.192e+04 1.112 0.267985
## GarageTypeBuiltIn 6.155e+04 3.343e+04 1.841 0.067486 .
## GarageTypeDetchd 5.631e+04 2.963e+04 1.901 0.059198 .
## GarageArea 1.489e+01 1.362e+01 1.093 0.276013
## PavedDriveP -2.804e+03 1.258e+04 -0.223 0.823969
## PavedDriveY -5.633e+01 9.315e+03 -0.006 0.995182
## WoodDeckSF -1.487e+00 1.408e+01 -0.106 0.916014
## PoolArea 4.490e+01 2.230e+01 2.013 0.045771 *
## FenceGdWo 8.897e+03 6.203e+03 1.434 0.153473
## FenceMnPrv 9.178e+03 4.823e+03 1.903 0.058883 .
## FenceMnWw 5.153e+03 9.700e+03 0.531 0.595993
## MiscVal -1.647e+00 7.664e+00 -0.215 0.830066
## SaleTypeConLI -4.891e+02 2.688e+04 -0.018 0.985507
## SaleTypeCWD -1.900e+03 2.271e+04 -0.084 0.933412
## SaleTypeWD 1.572e+04 9.797e+03 1.604 0.110644
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22720 on 157 degrees of freedom
## (1248 observations deleted due to missingness)
## Multiple R-squared: 0.904, Adjusted R-squared: 0.871
## F-statistic: 27.39 on 54 and 157 DF, p-value: < 2.2e-16
Next, let’s remove variables with high p-values that have no impact on our model: * Removing SaleType, MiscVal, PavedDrive, Fireplaces and Functional * From our first try, our R-squared was over 90% but this may be due to overfitting
lm2 <- lm(SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation + BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + GarageType + GarageArea + WoodDeckSF + PoolArea + Fence,data=train)
summary(lm2)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation +
## BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr +
## KitchenQual + TotRmsAbvGrd + GarageType + GarageArea + WoodDeckSF +
## PoolArea + Fence, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96850 -9435 554 10089 96850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.204e+05 2.814e+05 -2.916 0.004035 **
## LotFrontage -1.097e+02 1.342e+02 -0.817 0.415051
## LotArea 1.960e+00 7.888e-01 2.485 0.013931 *
## OverallQual 1.218e+04 2.722e+03 4.475 1.40e-05 ***
## OverallCond 9.270e+03 1.597e+03 5.805 3.14e-08 ***
## YearBuilt 4.424e+02 1.392e+02 3.178 0.001767 **
## RoofStyleGable 1.260e+04 2.356e+04 0.535 0.593493
## RoofStyleGambrel 2.145e+04 2.604e+04 0.824 0.411105
## RoofStyleHip 8.933e+03 2.381e+04 0.375 0.707955
## RoofStyleMansard 1.333e+04 3.332e+04 0.400 0.689668
## ExterQualFa -2.617e+04 4.486e+04 -0.583 0.560457
## ExterQualGd -2.938e+04 3.402e+04 -0.864 0.388942
## ExterQualTA -3.657e+04 3.548e+04 -1.031 0.304174
## FoundationCBlock -3.594e+02 6.928e+03 -0.052 0.958684
## FoundationPConc 7.545e+03 6.912e+03 1.091 0.276633
## FoundationStone 1.183e+04 2.583e+04 0.458 0.647584
## FoundationWood -2.748e+04 2.608e+04 -1.054 0.293491
## BsmtQualFa -1.049e+05 2.304e+04 -4.553 1.01e-05 ***
## BsmtQualGd -1.063e+05 1.921e+04 -5.535 1.18e-07 ***
## BsmtQualTA -1.010e+05 1.982e+04 -5.094 9.35e-07 ***
## TotalBsmtSF 2.776e+01 7.666e+00 3.622 0.000387 ***
## HeatingGasW 3.551e+03 1.495e+04 0.237 0.812594
## HeatingGrav 2.391e+04 2.043e+04 1.171 0.243436
## CentralAirY 1.941e+04 8.832e+03 2.197 0.029386 *
## GrLivArea 5.573e+01 9.033e+00 6.170 4.94e-09 ***
## BsmtFullBath 5.010e+03 3.722e+03 1.346 0.180043
## FullBath 1.090e+04 5.353e+03 2.037 0.043209 *
## HalfBath 6.421e+03 4.706e+03 1.365 0.174225
## BedroomAbvGr -5.745e+03 3.538e+03 -1.624 0.106288
## KitchenAbvGr -3.196e+03 1.142e+04 -0.280 0.779979
## KitchenQualFa -7.609e+04 1.712e+04 -4.446 1.59e-05 ***
## KitchenQualGd -7.479e+04 1.196e+04 -6.253 3.22e-09 ***
## KitchenQualTA -7.278e+04 1.241e+04 -5.863 2.35e-08 ***
## TotRmsAbvGrd -4.805e+03 2.524e+03 -1.904 0.058673 .
## GarageTypeAttchd 6.234e+04 2.849e+04 2.188 0.030022 *
## GarageTypeBasment 4.099e+04 3.082e+04 1.330 0.185222
## GarageTypeBuiltIn 6.886e+04 3.235e+04 2.128 0.034780 *
## GarageTypeDetchd 6.264e+04 2.864e+04 2.187 0.030103 *
## GarageArea 1.550e+01 1.283e+01 1.208 0.228751
## WoodDeckSF 3.180e-01 1.325e+01 0.024 0.980883
## PoolArea 4.832e+01 2.091e+01 2.311 0.022050 *
## FenceGdWo 9.777e+03 5.834e+03 1.676 0.095654 .
## FenceMnPrv 1.016e+04 4.492e+03 2.262 0.024974 *
## FenceMnWw 7.974e+03 9.320e+03 0.856 0.393413
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22390 on 168 degrees of freedom
## (1248 observations deleted due to missingness)
## Multiple R-squared: 0.9003, Adjusted R-squared: 0.8748
## F-statistic: 35.3 on 43 and 168 DF, p-value: < 2.2e-16
Now our second try, we have an R-squared value of 90.003%. There are still variables to remove:
* Remove WoodDeckSF and KitchenAbvGr
lm3 <- lm(SalePrice ~ LotFrontage + LotArea + OverallQual + OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation + BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea + BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + GarageType + GarageArea + PoolArea + Fence,data=train)
summary(lm3)
##
## Call:
## lm(formula = SalePrice ~ LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + RoofStyle + ExterQual + Foundation +
## BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea +
## BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenQual +
## TotRmsAbvGrd + GarageType + GarageArea + PoolArea + Fence,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96992 -9500 484 10252 96992
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.326e+05 2.670e+05 -3.118 0.002138 **
## LotFrontage -1.168e+02 1.277e+02 -0.914 0.361853
## LotArea 2.002e+00 7.669e-01 2.610 0.009850 **
## OverallQual 1.228e+04 2.682e+03 4.580 8.97e-06 ***
## OverallCond 9.262e+03 1.588e+03 5.834 2.67e-08 ***
## YearBuilt 4.465e+02 1.332e+02 3.351 0.000994 ***
## RoofStyleGable 1.278e+04 2.341e+04 0.546 0.585771
## RoofStyleGambrel 2.190e+04 2.584e+04 0.847 0.397934
## RoofStyleHip 9.062e+03 2.364e+04 0.383 0.701881
## RoofStyleMansard 1.327e+04 3.306e+04 0.401 0.688568
## ExterQualFa -2.887e+04 4.355e+04 -0.663 0.508231
## ExterQualGd -2.959e+04 3.370e+04 -0.878 0.381106
## ExterQualTA -3.666e+04 3.512e+04 -1.044 0.298091
## FoundationCBlock -5.236e+02 6.815e+03 -0.077 0.938852
## FoundationPConc 7.260e+03 6.780e+03 1.071 0.285797
## FoundationStone 1.170e+04 2.566e+04 0.456 0.648852
## FoundationWood -2.725e+04 2.576e+04 -1.058 0.291537
## BsmtQualFa -1.043e+05 2.269e+04 -4.598 8.30e-06 ***
## BsmtQualGd -1.065e+05 1.903e+04 -5.593 8.76e-08 ***
## BsmtQualTA -1.010e+05 1.968e+04 -5.134 7.68e-07 ***
## TotalBsmtSF 2.807e+01 7.487e+00 3.749 0.000243 ***
## HeatingGasW 3.577e+03 1.478e+04 0.242 0.809002
## HeatingGrav 2.447e+04 2.017e+04 1.213 0.226767
## CentralAirY 1.981e+04 8.630e+03 2.296 0.022918 *
## GrLivArea 5.565e+01 8.867e+00 6.276 2.79e-09 ***
## BsmtFullBath 4.952e+03 3.616e+03 1.369 0.172673
## FullBath 1.086e+04 5.313e+03 2.044 0.042469 *
## HalfBath 6.545e+03 4.658e+03 1.405 0.161806
## BedroomAbvGr -5.351e+03 3.185e+03 -1.680 0.094746 .
## KitchenQualFa -7.652e+04 1.687e+04 -4.536 1.08e-05 ***
## KitchenQualGd -7.495e+04 1.188e+04 -6.308 2.36e-09 ***
## KitchenQualTA -7.305e+04 1.230e+04 -5.937 1.59e-08 ***
## TotRmsAbvGrd -5.037e+03 2.357e+03 -2.137 0.033995 *
## GarageTypeAttchd 6.306e+04 2.818e+04 2.238 0.026529 *
## GarageTypeBasment 4.167e+04 3.053e+04 1.365 0.174120
## GarageTypeBuiltIn 6.969e+04 3.204e+04 2.175 0.030995 *
## GarageTypeDetchd 6.338e+04 2.832e+04 2.238 0.026551 *
## GarageArea 1.552e+01 1.274e+01 1.218 0.224934
## PoolArea 4.821e+01 2.078e+01 2.320 0.021533 *
## FenceGdWo 9.878e+03 5.787e+03 1.707 0.089682 .
## FenceMnPrv 1.013e+04 4.465e+03 2.269 0.024526 *
## FenceMnWw 8.284e+03 9.156e+03 0.905 0.366910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22260 on 170 degrees of freedom
## (1248 observations deleted due to missingness)
## Multiple R-squared: 0.9003, Adjusted R-squared: 0.8762
## F-statistic: 37.44 on 41 and 170 DF, p-value: < 2.2e-16
Now our let’s remove a few more variables to see if we can make our model better with fewer variables:
* Remove RoofStyle, Lot Frontage, GarageArea and BsmtFullBath
## # A tibble: 6 x 82
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 60 RL 65 8450 Pave <NA> Reg
## 2 2 20 RL 80 9600 Pave <NA> Reg
## 3 3 60 RL 68 11250 Pave <NA> IR1
## 4 4 70 RL 60 9550 Pave <NA> IR1
## 5 5 60 RL 84 14260 Pave <NA> IR1
## 6 6 50 RL 85 14115 Pave <NA> IR1
## # … with 74 more variables: LandContour <chr>, Utilities <chr>,
## # LotConfig <chr>, LandSlope <chr>, Neighborhood <chr>, Condition1 <chr>,
## # Condition2 <chr>, BldgType <chr>, HouseStyle <chr>, OverallQual <dbl>,
## # OverallCond <dbl>, YearBuilt <dbl>, YearRemodAdd <dbl>, RoofStyle <chr>,
## # RoofMatl <chr>, Exterior1st <chr>, Exterior2nd <chr>, MasVnrType <chr>,
## # MasVnrArea <dbl>, ExterQual <chr>, ExterCond <chr>, Foundation <chr>,
## # BsmtQual <chr>, BsmtCond <chr>, BsmtExposure <chr>, BsmtFinType1 <chr>,
## # BsmtFinSF1 <dbl>, BsmtFinType2 <chr>, BsmtFinSF2 <dbl>, BsmtUnfSF <dbl>,
## # TotalBsmtSF <dbl>, Heating <chr>, HeatingQC <chr>, CentralAir <chr>,
## # Electrical <chr>, `1stFlrSF` <dbl>, `2ndFlrSF` <dbl>, LowQualFinSF <dbl>,
## # GrLivArea <dbl>, BsmtFullBath <dbl>, BsmtHalfBath <dbl>, FullBath <dbl>,
## # HalfBath <dbl>, BedroomAbvGr <dbl>, KitchenAbvGr <dbl>, KitchenQual <chr>,
## # TotRmsAbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, FireplaceQu <chr>,
## # GarageType <chr>, GarageYrBlt <dbl>, GarageFinish <chr>, GarageCars <dbl>,
## # GarageArea <dbl>, GarageQual <chr>, GarageCond <chr>, PavedDrive <chr>,
## # WoodDeckSF <dbl>, OpenPorchSF <dbl>, EnclosedPorch <dbl>,
## # `3SsnPorch` <dbl>, ScreenPorch <dbl>, PoolArea <dbl>, PoolQC <chr>,
## # Fence <chr>, MiscFeature <chr>, MiscVal <dbl>, MoSold <dbl>, YrSold <dbl>,
## # SaleType <chr>, SaleCondition <chr>, SalePrice <dbl>, SalePrice_trans <dbl>
lm4 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + ExterQual + Foundation + BsmtQual + TotalBsmtSF + Heating + CentralAir + GrLivArea + FullBath + HalfBath + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + GarageType + PoolArea + Fence,data=train)
summary(lm4)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + ExterQual + Foundation + BsmtQual + TotalBsmtSF +
## Heating + CentralAir + GrLivArea + FullBath + HalfBath +
## BedroomAbvGr + KitchenQual + TotRmsAbvGrd + GarageType +
## PoolArea + Fence, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -94882 -9795 591 9575 94882
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.153e+05 2.337e+05 -3.917 0.000119 ***
## LotArea 2.235e+00 5.498e-01 4.065 6.63e-05 ***
## OverallQual 1.119e+04 2.116e+03 5.291 2.87e-07 ***
## OverallCond 9.132e+03 1.365e+03 6.689 1.74e-10 ***
## YearBuilt 4.992e+02 1.161e+02 4.300 2.54e-05 ***
## ExterQualFa -3.571e+04 4.040e+04 -0.884 0.377589
## ExterQualGd -2.928e+04 3.167e+04 -0.924 0.356238
## ExterQualTA -3.970e+04 3.252e+04 -1.221 0.223456
## FoundationCBlock -9.281e+02 6.125e+03 -0.152 0.879693
## FoundationPConc 5.674e+03 6.210e+03 0.914 0.361883
## FoundationStone 1.243e+03 2.381e+04 0.052 0.958412
## FoundationWood -7.242e+03 1.757e+04 -0.412 0.680620
## BsmtQualFa -1.042e+05 2.086e+04 -4.995 1.18e-06 ***
## BsmtQualGd -1.022e+05 1.755e+04 -5.824 1.96e-08 ***
## BsmtQualTA -9.890e+04 1.811e+04 -5.461 1.25e-07 ***
## TotalBsmtSF 2.468e+01 5.971e+00 4.134 5.02e-05 ***
## HeatingGasW -4.668e+03 1.369e+04 -0.341 0.733408
## HeatingGrav 2.488e+04 1.897e+04 1.312 0.190925
## CentralAirY 2.281e+04 8.123e+03 2.808 0.005419 **
## GrLivArea 6.010e+01 7.522e+00 7.990 6.86e-14 ***
## FullBath 8.313e+03 4.480e+03 1.856 0.064821 .
## HalfBath 4.058e+03 3.759e+03 1.080 0.281489
## BedroomAbvGr -4.663e+03 2.729e+03 -1.709 0.088821 .
## KitchenQualFa -7.016e+04 1.575e+04 -4.454 1.33e-05 ***
## KitchenQualGd -6.605e+04 1.066e+04 -6.197 2.70e-09 ***
## KitchenQualTA -6.612e+04 1.105e+04 -5.983 8.51e-09 ***
## TotRmsAbvGrd -4.705e+03 2.087e+03 -2.254 0.025131 *
## GarageTypeAttchd 4.825e+04 2.541e+04 1.899 0.058863 .
## GarageTypeBasment 3.791e+04 2.682e+04 1.414 0.158861
## GarageTypeBuiltIn 6.033e+04 2.753e+04 2.192 0.029424 *
## GarageTypeDetchd 4.988e+04 2.563e+04 1.946 0.052859 .
## PoolArea 4.907e+01 1.685e+01 2.912 0.003951 **
## FenceGdWo 1.040e+04 4.812e+03 2.161 0.031785 *
## FenceMnPrv 1.032e+04 3.789e+03 2.724 0.006954 **
## FenceMnWw 6.273e+03 7.974e+03 0.787 0.432267
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21410 on 226 degrees of freedom
## (1199 observations deleted due to missingness)
## Multiple R-squared: 0.8952, Adjusted R-squared: 0.8794
## F-statistic: 56.75 on 34 and 226 DF, p-value: < 2.2e-16
Let’s get rid of:
* Remove ExterQual, Foundation, Heating, HalfBath
lm5 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea + FullBath + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + GarageType + PoolArea + Fence,data=train)
summary(lm5)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea +
## FullBath + BedroomAbvGr + KitchenQual + TotRmsAbvGrd + GarageType +
## PoolArea + Fence, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -93208 -7907 277 10562 93208
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.007e+06 1.946e+05 -5.177 4.83e-07 ***
## LotArea 2.239e+00 5.367e-01 4.173 4.24e-05 ***
## OverallQual 1.201e+04 2.037e+03 5.895 1.29e-08 ***
## OverallCond 9.119e+03 1.320e+03 6.909 4.51e-11 ***
## YearBuilt 5.323e+02 9.577e+01 5.558 7.34e-08 ***
## BsmtQualFa -1.111e+05 2.060e+04 -5.392 1.69e-07 ***
## BsmtQualGd -1.099e+05 1.735e+04 -6.333 1.20e-09 ***
## BsmtQualTA -1.077e+05 1.774e+04 -6.073 4.98e-09 ***
## TotalBsmtSF 2.430e+01 5.503e+00 4.415 1.54e-05 ***
## CentralAirY 1.881e+04 7.282e+03 2.582 0.01041 *
## GrLivArea 6.309e+01 6.458e+00 9.769 < 2e-16 ***
## FullBath 6.801e+03 4.204e+03 1.618 0.10708
## BedroomAbvGr -4.440e+03 2.665e+03 -1.666 0.09695 .
## KitchenQualFa -6.541e+04 1.530e+04 -4.275 2.77e-05 ***
## KitchenQualGd -6.514e+04 1.050e+04 -6.207 2.41e-09 ***
## KitchenQualTA -6.871e+04 1.082e+04 -6.352 1.08e-09 ***
## TotRmsAbvGrd -4.695e+03 2.066e+03 -2.273 0.02395 *
## GarageTypeAttchd 4.664e+04 2.519e+04 1.852 0.06529 .
## GarageTypeBasment 3.263e+04 2.652e+04 1.231 0.21971
## GarageTypeBuiltIn 5.851e+04 2.712e+04 2.157 0.03200 *
## GarageTypeDetchd 4.829e+04 2.526e+04 1.912 0.05708 .
## PoolArea 4.548e+01 1.671e+01 2.723 0.00696 **
## FenceGdWo 8.979e+03 4.689e+03 1.915 0.05669 .
## FenceMnPrv 8.914e+03 3.749e+03 2.378 0.01820 *
## FenceMnWw 3.747e+03 7.952e+03 0.471 0.63791
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21570 on 236 degrees of freedom
## (1199 observations deleted due to missingness)
## Multiple R-squared: 0.8889, Adjusted R-squared: 0.8776
## F-statistic: 78.65 on 24 and 236 DF, p-value: < 2.2e-16
Let’s remove:
* Full Bath, BedroomAbvGr, Fence
lm6 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea + KitchenQual + TotRmsAbvGrd + GarageType + PoolArea,data=train)
summary(lm6)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea +
## KitchenQual + TotRmsAbvGrd + GarageType + PoolArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513298 -15903 -1034 13068 255688
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.094e+05 1.332e+05 -6.075 1.61e-09 ***
## LotArea 6.933e-01 1.001e-01 6.928 6.62e-12 ***
## OverallQual 1.531e+04 1.292e+03 11.846 < 2e-16 ***
## OverallCond 7.487e+03 1.061e+03 7.058 2.71e-12 ***
## YearBuilt 4.083e+02 6.610e+01 6.177 8.67e-10 ***
## BsmtQualFa -4.502e+04 8.608e+03 -5.230 1.97e-07 ***
## BsmtQualGd -4.154e+04 4.278e+03 -9.710 < 2e-16 ***
## BsmtQualTA -4.573e+04 5.300e+03 -8.629 < 2e-16 ***
## TotalBsmtSF 2.170e+01 3.245e+00 6.688 3.32e-11 ***
## CentralAirY 6.904e+03 5.272e+03 1.310 0.1906
## GrLivArea 5.233e+01 4.060e+00 12.890 < 2e-16 ***
## KitchenQualFa -4.131e+04 9.227e+03 -4.477 8.22e-06 ***
## KitchenQualGd -3.546e+04 4.631e+03 -7.657 3.66e-14 ***
## KitchenQualTA -4.399e+04 5.185e+03 -8.484 < 2e-16 ***
## TotRmsAbvGrd 3.033e+02 1.107e+03 0.274 0.7841
## GarageTypeAttchd 1.147e+04 1.469e+04 0.781 0.4349
## GarageTypeBasment 7.367e+03 1.670e+04 0.441 0.6592
## GarageTypeBuiltIn 1.213e+04 1.525e+04 0.795 0.4265
## GarageTypeCarPort 2.254e+03 1.978e+04 0.114 0.9093
## GarageTypeDetchd 7.481e+03 1.472e+04 0.508 0.6114
## PoolArea -4.544e+01 2.361e+01 -1.924 0.0546 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35230 on 1328 degrees of freedom
## (111 observations deleted due to missingness)
## Multiple R-squared: 0.8041, Adjusted R-squared: 0.8012
## F-statistic: 272.6 on 20 and 1328 DF, p-value: < 2.2e-16
Let’s remove:
* GarageType and TotRmsAbvGrd
lm7 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea + KitchenQual + PoolArea,data=train)
summary(lm7)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtQual + TotalBsmtSF + CentralAir + GrLivArea +
## KitchenQual + PoolArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -515379 -15222 -342 12610 256221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.947e+05 1.150e+05 -7.784 1.36e-14 ***
## LotArea 7.297e-01 9.795e-02 7.450 1.62e-13 ***
## OverallQual 1.492e+04 1.200e+03 12.433 < 2e-16 ***
## OverallCond 7.026e+03 9.902e+02 7.096 2.04e-12 ***
## YearBuilt 4.599e+02 5.755e+01 7.991 2.77e-15 ***
## BsmtQualFa -4.361e+04 8.201e+03 -5.317 1.22e-07 ***
## BsmtQualGd -4.288e+04 4.197e+03 -10.216 < 2e-16 ***
## BsmtQualTA -4.651e+04 5.117e+03 -9.090 < 2e-16 ***
## TotalBsmtSF 2.188e+01 2.976e+00 7.352 3.30e-13 ***
## CentralAirY 6.325e+03 4.629e+03 1.366 0.1721
## GrLivArea 5.259e+01 2.431e+00 21.634 < 2e-16 ***
## KitchenQualFa -3.949e+04 8.029e+03 -4.919 9.72e-07 ***
## KitchenQualGd -3.393e+04 4.448e+03 -7.628 4.36e-14 ***
## KitchenQualTA -4.354e+04 4.956e+03 -8.785 < 2e-16 ***
## PoolArea -4.371e+01 2.320e+01 -1.884 0.0598 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34890 on 1408 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.8088, Adjusted R-squared: 0.8069
## F-statistic: 425.4 on 14 and 1408 DF, p-value: < 2.2e-16
Let’s remove:
* Pool Area and CentrailAir
lm8 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + BsmtQual + TotalBsmtSF + GrLivArea + KitchenQual,data=train)
summary(lm8)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + BsmtQual + TotalBsmtSF + GrLivArea + KitchenQual,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -532988 -15276 -408 12959 240519
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.461e+05 1.085e+05 -8.720 < 2e-16 ***
## LotArea 7.320e-01 9.798e-02 7.471 1.39e-13 ***
## OverallQual 1.504e+04 1.200e+03 12.535 < 2e-16 ***
## OverallCond 7.360e+03 9.571e+02 7.691 2.73e-14 ***
## YearBuilt 4.879e+02 5.381e+01 9.067 < 2e-16 ***
## BsmtQualFa -4.310e+04 8.193e+03 -5.260 1.66e-07 ***
## BsmtQualGd -4.275e+04 4.200e+03 -10.179 < 2e-16 ***
## BsmtQualTA -4.610e+04 5.100e+03 -9.038 < 2e-16 ***
## TotalBsmtSF 2.155e+01 2.971e+00 7.254 6.64e-13 ***
## GrLivArea 5.206e+01 2.413e+00 21.569 < 2e-16 ***
## KitchenQualFa -4.061e+04 7.988e+03 -5.084 4.20e-07 ***
## KitchenQualGd -3.382e+04 4.452e+03 -7.597 5.50e-14 ***
## KitchenQualTA -4.318e+04 4.957e+03 -8.709 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34930 on 1410 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.8081, Adjusted R-squared: 0.8064
## F-statistic: 494.7 on 12 and 1410 DF, p-value: < 2.2e-16
Let’s remove:
* Pool Area and CentrailAir
lm9 <- lm(SalePrice ~ LotArea + OverallQual + OverallCond + YearBuilt + TotalBsmtSF + GrLivArea,data=train)
summary(lm9)
##
## Call:
## lm(formula = SalePrice ~ LotArea + OverallQual + OverallCond +
## YearBuilt + TotalBsmtSF + GrLivArea, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -529020 -18456 -2001 14210 275815
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.188e+06 8.907e+04 -13.341 < 2e-16 ***
## LotArea 6.806e-01 1.060e-01 6.421 1.83e-10 ***
## OverallQual 2.132e+04 1.158e+03 18.419 < 2e-16 ***
## OverallCond 6.540e+03 9.907e+02 6.601 5.69e-11 ***
## YearBuilt 5.490e+02 4.555e+01 12.055 < 2e-16 ***
## TotalBsmtSF 2.950e+01 2.862e+00 10.310 < 2e-16 ***
## GrLivArea 5.407e+01 2.543e+00 21.267 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38050 on 1453 degrees of freedom
## Multiple R-squared: 0.7716, Adjusted R-squared: 0.7706
## F-statistic: 818 on 6 and 1453 DF, p-value: < 2.2e-16
This leaves us with the following model:
\(SalePrice = -1.2 + 0.68*LotArea + 21,320*OverallQual + 6540*OverallCond + 549*YearBuilt + 29.5*TotalBsmtSf + 54.07*GrLivArea\)
This model still has a high R-squared value at 77.16% and it has far fewer variables than our original model to avoid overfitting.
Analyzing Residuals And QQPlot
The residuals are pretty close to normally distributed. The distribution is very tight around 0.
There appear to be a few outliers on the right and far left side but the qq plot doesn’t look bad.
Testing our Model
Now we will use our model, lm9, to predict values using the test data. These are the predictions that will be submitted to Kaggle.
## Parsed with column specification:
## cols(
## .default = col_character(),
## Id = col_double(),
## MSSubClass = col_double(),
## LotFrontage = col_double(),
## LotArea = col_double(),
## OverallQual = col_double(),
## OverallCond = col_double(),
## YearBuilt = col_double(),
## YearRemodAdd = col_double(),
## MasVnrArea = col_double(),
## BsmtFinSF1 = col_double(),
## BsmtFinSF2 = col_double(),
## BsmtUnfSF = col_double(),
## TotalBsmtSF = col_double(),
## `1stFlrSF` = col_double(),
## `2ndFlrSF` = col_double(),
## LowQualFinSF = col_double(),
## GrLivArea = col_double(),
## BsmtFullBath = col_double(),
## BsmtHalfBath = col_double(),
## FullBath = col_double()
## # ... with 17 more columns
## )
## See spec(...) for full column specifications.
y_pred <- cbind(test['Id'],predict(lm9,test))
y_pred[is.na(y_pred)] <- 0
colnames(y_pred) = c('Id','SalePrice')
write.csv(y_pred,file='HousingPredictions_DTeran.csv',row.names = FALSE)
results <- read_csv('HousingPredictions_DTeran.csv')
## Parsed with column specification:
## cols(
## Id = col_double(),
## SalePrice = col_double()
## )
My Kaggle team name is Devin Teran and my score was 0.91739
Link to Youtube Recording here(https://youtu.be/M3MiqqWOt00)