Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \[\frac{(N+1)}{2}\]

Probability

5 points. Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

  • establish “x” and “y”
  • prep all of the probabilities for later use in a, b, c
## [1] "N: 100 , Median X: 50.69 1st Quartile Y16.48, X<x: 0.5, X > x: 0.5, Y > y: 0.75, Y<y: 0.25, X > y: 0.839 , X<y: 0.161"
a. \[P(X>x | X>y)\]
  • Conditional probability - scenario (a) considers the probability of X>y occurring when X>x has already occurred
  • find the joint probability of P(X>x and X>y) / marginal probability of P(X>x)
## [1] "P(X>x | X>y) = 0.84"
b. \[P(X>x, Y>y)\]
  • marginal probability of X>x and Y>y occurring in isolation
## [1] "P(X>x, Y>y) = 0.38"
c. \[P(X<x | X>y)\]
  • Conditional probability - scenario (c) considers the probability of X>y occurring when X<x has already occurred
  • find the joint probability of P(Xy) / marginal probability of P(X<x )
## [1] "P(X<x | X>y)\t = 0.84"

Investigate marginal and joint probabilities

5 points.Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

  • Create a Joint Probability Table
## [1] "Joint Probability Matrix"
Values Y> 1st Quartile Y< 1st Quartile X Totals
X > Median 0.375 0.125 0.5
X < Median 0.375 0.125 0.5
Y Totals 0.75 0.25 1
## [1] "Joint Prob, P(X>x and Y>y) 0.375"
## [1] "Marginal Prob, P(X>x)P(Y>y)) 0.375"
## [1] "Based on the below table, the joint probability is equal to the marginal probability"

Fisher’s Exact Test and the Chi Square Test

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Chi Square

  • https://www.tutorialspoint.com/r/r_chi_square_tests.htm
  • Ho: Variable A and Variable B are independent.
  • Ha: Variable A and Variable B are not independent.
  • The results show chi-squared value of 0 and the p-value of greater than 0.05 which indicates a string independence
  • Null Hypothesis maintained: A and Variable B are independent.
##       [,1]  [,2]
## [1,] 0.375 0.125
## [2,] 0.375 0.125
## 
##  Pearson's Chi-squared test
## 
## data:  chi.table
## X-squared = 0, df = 1, p-value = 1

Fisher Test

## 
##  Fisher's Exact Test for Count Data
## 
## data:  chi.table
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    0 Inf
## sample estimates:
## odds ratio 
##          0

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Read in the Data

Descriptive and Inferential Statistics

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Provide univariate descriptive statistics

Dataset Shape & Column Classes
  • Dimensions of train.data (dim)
  • Column types (sapply)
## [1] "Dimensions of the raw train.data c(1460, 81)"
Variable Type for each Column
x
Id integer
MSSubClass integer
MSZoning character
LotFrontage integer
LotArea integer
Street character
Alley character
LotShape character
LandContour character
Utilities character
LotConfig character
LandSlope character
Neighborhood character
Condition1 character
Condition2 character
BldgType character
HouseStyle character
OverallQual integer
OverallCond integer
YearBuilt integer
YearRemodAdd integer
RoofStyle character
RoofMatl character
Exterior1st character
Exterior2nd character
MasVnrType character
MasVnrArea integer
ExterQual character
ExterCond character
Foundation character
BsmtQual character
BsmtCond character
BsmtExposure character
BsmtFinType1 character
BsmtFinSF1 integer
BsmtFinType2 character
BsmtFinSF2 integer
BsmtUnfSF integer
TotalBsmtSF integer
Heating character
HeatingQC character
CentralAir character
Electrical character
X1stFlrSF integer
X2ndFlrSF integer
LowQualFinSF integer
GrLivArea integer
BsmtFullBath integer
BsmtHalfBath integer
FullBath integer
HalfBath integer
BedroomAbvGr integer
KitchenAbvGr integer
KitchenQual character
TotRmsAbvGrd integer
Functional character
Fireplaces integer
FireplaceQu character
GarageType character
GarageYrBlt integer
GarageFinish character
GarageCars integer
GarageArea integer
GarageQual character
GarageCond character
PavedDrive character
WoodDeckSF integer
OpenPorchSF integer
EnclosedPorch integer
X3SsnPorch integer
ScreenPorch integer
PoolArea integer
PoolQC character
Fence character
MiscFeature character
MiscVal integer
MoSold integer
YrSold integer
SaleType character
SaleCondition character
SalePrice integer
Full summary statistics of train.data
  • find unique values per column
  • find # null values per column
  • get the summary statistics of the train dataset
  • merge all the information together
  • filter for complete cases on ’Min.’column
Variable.Name 1st Qu. 3rd Qu. Max. Mean Min. Unique.Values NA.COUNTS
Id 365.8 1095.2 1460.0 730.5 1.0 1460 0
LotArea 7554 11602 215245 10517 1300 1073 0
GrLivArea 1130 1777 5642 1515 334 861 0
BsmtUnfSF 223.0 808.0 2336.0 567.2 0.0 780 0
X1stFlrSF 882 1391 4692 1163 334 753 0
TotalBsmtSF 795.8 1298.2 6110.0 1057.4 0.0 721 0
SalePrice 129975 214000 755000 180921 34900 663 0
BsmtFinSF1 0.0 712.2 5644.0 443.6 0.0 637 0
GarageArea 334.5 576.0 1418.0 473.0 0.0 441 0
X2ndFlrSF 0 728 2065 347 0 417 0
MasVnrArea 0.0 166.0 1600.0 103.7 0.0 328 8
WoodDeckSF 0.00 168.00 857.00 94.24 0.00 274 0
OpenPorchSF 0.00 68.00 547.00 46.66 0.00 202 0
BsmtFinSF2 0.00 0.00 1474.00 46.55 0.00 144 0
EnclosedPorch 0.00 0.00 552.00 21.95 0.00 120 0
YearBuilt 1954 2000 2010 1971 1872 112 0
LotFrontage 59.00 80.00 313.00 70.05 21.00 111 259
GarageYrBlt 1961 2002 2010 1979 1900 98 81
ScreenPorch 0.00 0.00 480.00 15.06 0.00 76 0
YearRemodAdd 1967 2004 2010 1985 1950 61 0
LowQualFinSF 0.000 0.000 572.000 5.845 0.000 24 0
MiscVal 0.00 0.00 15500.00 43.49 0.00 21 0
X3SsnPorch 0.00 0.00 508.00 3.41 0.00 20 0
MSSubClass 20.0 70.0 190.0 56.9 20.0 15 0
MoSold 5.000 8.000 12.000 6.322 1.000 12 0
TotRmsAbvGrd 5.000 7.000 14.000 6.518 2.000 12 0
OverallQual 5.000 7.000 10.000 6.099 1.000 10 0
OverallCond 5.000 6.000 9.000 5.575 1.000 9 0
BedroomAbvGr 2.000 3.000 8.000 2.866 0.000 8 0
PoolArea 0.000 0.000 738.000 2.759 0.000 8 0
GarageCars 1.000 2.000 4.000 1.767 0.000 5 0
YrSold 2007 2009 2010 2008 2006 5 0
BsmtFullBath 0.0000 1.0000 3.0000 0.4253 0.0000 4 0
Fireplaces 0.000 1.000 3.000 0.613 0.000 4 0
FullBath 1.000 2.000 3.000 1.565 0.000 4 0
KitchenAbvGr 1.000 1.000 3.000 1.047 0.000 4 0
BsmtHalfBath 0.00000 0.00000 2.00000 0.05753 0.00000 3 0
HalfBath 0.0000 1.0000 2.0000 0.3829 0.0000 3 0
Box Plot

##### Variable Density Graphs

Scatterplot with Logarithmic Curve

Correlation Matrix (3 Variables)
  • General Correlation Matrix
  • Filtered for 3 variables
  • “GrLivArea”
  • “LotArea”
  • “LotFrontage”

Correlation Matrix (3 Variables)

Correlation Analysis and Hypothesis Testing
  • The p value in all cases is very low, causing us to reject the null hypothesis that there is 0 correlation

  • In the train data, we are 80% confident that the data correlation coefficient is between 0.2315997 and 0.2940809 between GrLivArea & LotArea

  • In the train data, we are 80% confident that the data correlation coefficient is between 0.3713236 and 0.4333466 between GrLivArea & LotFrontage

  • In the train data, we are 80% confident that the data correlation coefficient is between 0.3953198 and 0.4559147between LotArea & LotFrontage

## 
##  Pearson's product-moment correlation
## 
## data:  arg_1 and arg_2
## t = 10.414, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2315997 0.2940809
## sample estimates:
##       cor 
## 0.2631162
## 
##  Pearson's product-moment correlation
## 
## data:  arg_1 and arg_2
## t = 15.238, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3713236 0.4333466
## sample estimates:
##       cor 
## 0.4027974
## 
##  Pearson's product-moment correlation
## 
## data:  arg_1 and arg_2
## t = 16.309, df = 1199, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.3953198 0.4559147
## sample estimates:
##      cor 
## 0.426095

Linear Algebra and Correlation

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Inverted Matrix
  • Multiplying the sample correlation matrix by the inverse matrix -or- the inverse and the sample correlation matrix provides an identity matrix (diagonal of 1s)
  • Use the lu.decomposition() function to get the “L” & “U” for both the sample and inverse matrices
Sample 3 Variable Correlation Matrix
GrLivArea LotArea LotFrontage
GrLivArea 1.0000000 0.2631162 0.4027974
LotArea 0.2631162 1.0000000 0.4260950
LotFrontage 0.4027974 0.4260950 1.0000000
Inverted Correlation Matrix
GrLivArea LotArea LotFrontage
GrLivArea 1.2084186 -0.1350780 -0.4291918
LotArea -0.1350780 1.2369313 -0.4726412
LotFrontage -0.4291918 -0.4726412 1.3742674
sample.cor %*% inv.matrix
GrLivArea LotArea LotFrontage
GrLivArea 1 0 0
LotArea 0 1 0
LotFrontage 0 0 1
inv.matrix %*% sample.cor
GrLivArea LotArea LotFrontage
GrLivArea 1 0 0
LotArea 0 1 0
LotFrontage 0 0 1
LU decomposition on the matrix
## [1] "LU decomposition of the 3 variable sample matrix"
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.2631162 1.0000000    0
## [3,] 0.4027974 0.3439223    1
## 
## $U
##      [,1]      [,2]      [,3]
## [1,]    1 0.2631162 0.4027974
## [2,]    0 0.9307699 0.3201125
## [3,]    0 0.0000000 0.7276604
## [1] "LU decomposition of the 3 variable inverted sample matrix"
## $L
##            [,1]      [,2] [,3]
## [1,]  1.0000000  0.000000    0
## [2,] -0.1117808  1.000000    0
## [3,] -0.3551682 -0.426095    1
## 
## $U
##              [,1]      [,2]       [,3]
## [1,] 1.208419e+00 -0.135078 -0.4291918
## [2,] 0.000000e+00  1.221832 -0.5206166
## [3,] 5.551115e-17  0.000000  1.0000000

Calculus-Based Probability & Statistics

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

  • Based on variable density graphs above, variable “GrLivArea” is rightly-skewed.
  • Fitdistr https://www.rdocumentation.org/packages/MASS/versions/7.3-51.4/topics/fitdistr, fit exponential with “densfun” feature
  • Samples from exponential distribution with rexp() function
  • The confidence intervals for the exponential regression is farther than from the normal assumption, but it seems to account for all a longer poistive tail.Whereas the normal assumption is a bit more condensed.

## [1] "5th Percentile empirical data() 77.73"
## [1] "95th Percentile empirical data() 4539.92"
## [1] "5th Percentile normal data() 848"
## [1] "95th Percentile normal data() 2466.1"

Modeling

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

DATA PREPARATION
  • apply all functions to both the train/test data for consistency. The clean datasets will be clean.train & clean.test
  • the two dataset will be combined to make the cleaning process easier
  • Try to eliminate all variables with minimal impact
  • Set the clean.test value to zero
  • fill in the null value columns. For integers, use the median value. For characters, fill with “None”
  • Convert all the character columns to factors
## [1] "All Starter Columns Dimensions for Reference 82"
nearZeroVar Removal
  • Try to eliminate all variables with minimal impact between the two datasets
  • this dropped about 20 columns that were insignificant
## [1] "Number Columns Post nearZeroVar 61"
Null Removal/General Cleaning
  • identify all the nulls values in the combined train/test dataframe
  • impute values with the “remove.nulls” function
  • The “remove.nulls” function fills numerical columns with the median value & fills character values with the most common value in the column. The character columns are then converted to factor columns
  • immediately remove the high NA counts from the combined dataset dataset: ‘PoolQC’ (2,909 nulls),‘MiscFeature’ (2,814 nulls), ‘Alley’ (2,721 nulls), ‘Fence’ (2,348 nulls),‘FireplaceQu’ (1,420 nulls) as these columns are majority null values

    NA. COUNTS Vari able.Name
    6 486 LotFrontage
    7 159 GarageYrBlt
    8 159 GarageFinish
    9 157 GarageType
    10 82 BsmtExposure
    11 81 BsmtQual
    12 79 BsmtFinType1
    13 24 MasVnrType
    14 23 MasVnrArea
    15 4 MSZoning
    16 2 BsmtFullBath
    17 2 BsmtHalfBath
    18 1 Exterior1st
    19 1 Exterior2nd
    20 1 BsmtFinSF1
    21 1 BsmtUnfSF
    22 1 TotalBsmtSF
    23 1 Electrical
    24 1 KitchenQual
    25 1 GarageCars
    26 1 GarageArea
    27 1 SaleType
## [1] "Columns Post Extreme Null Removal 56"
BUILD MODELS
  • split the combined data to clean.train & clean.test
Model 1: All Variables SalePrice:
  • The first model igores collinearity and is inclusive of all vairables.
  • Removed “Exterior2ndCBlock” “BsmtFinType1None” “GarageFinishNone” “GarageQualNone” “GarageCondNone” because the values were returning NA and causing issues with collinearity testing.
## 
## Call:
## lm(formula = SalePrice ~ ., data = clean.train %>% dplyr::select(-c(Exterior2nd, 
##     BsmtFinType1, GarageFinish)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -325440  -11730    -107   11314  234260 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.852e+05  1.236e+06   0.635 0.525386    
## Id                   -1.204e+00  1.863e+00  -0.646 0.518114    
## MSSubClass           -9.681e+01  9.643e+01  -1.004 0.315576    
## MSZoningFV            3.068e+04  1.435e+04   2.138 0.032693 *  
## MSZoningRH            2.717e+04  1.416e+04   1.919 0.055191 .  
## MSZoningRL            2.574e+04  1.202e+04   2.142 0.032399 *  
## MSZoningRM            2.273e+04  1.126e+04   2.019 0.043688 *  
## LotFrontage          -1.286e+02  4.936e+01  -2.605 0.009301 ** 
## LotArea               3.725e-01  9.816e-02   3.794 0.000155 ***
## LotShapeIR2           9.824e+03  5.001e+03   1.965 0.049679 *  
## LotShapeIR3          -2.484e+04  1.039e+04  -2.390 0.016979 *  
## LotShapeReg           1.668e+03  1.910e+03   0.873 0.382718    
## LotConfigCulDSac      7.635e+03  3.937e+03   1.939 0.052657 .  
## LotConfigFR2         -1.001e+04  4.855e+03  -2.062 0.039428 *  
## LotConfigFR3         -1.781e+04  1.515e+04  -1.176 0.239717    
## LotConfigInside      -1.245e+03  2.123e+03  -0.586 0.557740    
## NeighborhoodBlueste  -2.535e+03  2.285e+04  -0.111 0.911654    
## NeighborhoodBrDale    1.020e+04  1.315e+04   0.775 0.438212    
## NeighborhoodBrkSide  -2.873e+03  1.106e+04  -0.260 0.795196    
## NeighborhoodClearCr  -9.427e+03  1.077e+04  -0.875 0.381653    
## NeighborhoodCollgCr  -4.367e+03  8.664e+03  -0.504 0.614311    
## NeighborhoodCrawfor   1.457e+04  1.010e+04   1.442 0.149599    
## NeighborhoodEdwards  -2.283e+04  9.552e+03  -2.390 0.016967 *  
## NeighborhoodGilbert  -7.715e+03  9.293e+03  -0.830 0.406545    
## NeighborhoodIDOTRR   -9.908e+03  1.271e+04  -0.780 0.435712    
## NeighborhoodMeadowV   1.549e+03  1.342e+04   0.115 0.908143    
## NeighborhoodMitchel  -1.472e+04  9.739e+03  -1.511 0.130907    
## NeighborhoodNAmes    -1.313e+04  9.279e+03  -1.416 0.157129    
## NeighborhoodNoRidge   4.906e+04  1.008e+04   4.867 1.27e-06 ***
## NeighborhoodNPkVill   1.138e+04  1.326e+04   0.858 0.390826    
## NeighborhoodNridgHt   3.210e+04  8.913e+03   3.601 0.000329 ***
## NeighborhoodNWAmes   -1.144e+04  9.545e+03  -1.199 0.230891    
## NeighborhoodOldTown  -1.707e+04  1.139e+04  -1.499 0.134135    
## NeighborhoodSawyer   -8.574e+03  9.716e+03  -0.882 0.377692    
## NeighborhoodSawyerW   1.724e+03  9.327e+03   0.185 0.853428    
## NeighborhoodSomerst   1.003e+04  1.080e+04   0.929 0.353095    
## NeighborhoodStoneBr   4.883e+04  9.879e+03   4.943 8.69e-07 ***
## NeighborhoodSWISU    -1.689e+04  1.151e+04  -1.468 0.142415    
## NeighborhoodTimber   -7.122e+03  9.637e+03  -0.739 0.460015    
## NeighborhoodVeenker   6.905e+03  1.257e+04   0.549 0.582906    
## Condition1Feedr      -5.308e+03  5.751e+03  -0.923 0.356208    
## Condition1Norm        7.120e+03  4.763e+03   1.495 0.135187    
## Condition1PosA        5.132e+03  1.171e+04   0.438 0.661363    
## Condition1PosN       -8.834e+03  8.340e+03  -1.059 0.289691    
## Condition1RRAe       -2.042e+04  1.096e+04  -1.863 0.062708 .  
## Condition1RRAn        5.812e+03  7.786e+03   0.746 0.455505    
## Condition1RRNe       -3.154e+03  2.130e+04  -0.148 0.882332    
## Condition1RRNn        1.283e+02  1.474e+04   0.009 0.993057    
## BldgType2fmCon       -4.154e+02  1.422e+04  -0.029 0.976699    
## BldgTypeDuplex       -1.270e+04  7.111e+03  -1.786 0.074392 .  
## BldgTypeTwnhs        -2.587e+04  1.170e+04  -2.210 0.027244 *  
## BldgTypeTwnhsE       -2.092e+04  1.050e+04  -1.991 0.046668 *  
## HouseStyle1.5Unf      1.442e+04  9.024e+03   1.597 0.110422    
## HouseStyle1Story      1.972e+04  4.978e+03   3.963 7.82e-05 ***
## HouseStyle2.5Fin     -1.784e+04  1.405e+04  -1.269 0.204528    
## HouseStyle2.5Unf     -1.020e+04  1.025e+04  -0.996 0.319677    
## HouseStyle2Story     -1.048e+04  4.025e+03  -2.603 0.009338 ** 
## HouseStyleSFoyer      1.201e+04  7.398e+03   1.624 0.104651    
## HouseStyleSLvl        1.081e+04  6.390e+03   1.691 0.091024 .  
## OverallQual           9.020e+03  1.167e+03   7.732 2.10e-14 ***
## OverallCond           5.487e+03  9.967e+02   5.506 4.42e-08 ***
## YearBuilt             1.296e+02  8.620e+01   1.504 0.132888    
## YearRemodAdd          3.619e+01  6.515e+01   0.555 0.578680    
## RoofStyleGable        2.835e+03  9.566e+03   0.296 0.767008    
## RoofStyleGambrel      8.482e+03  1.340e+04   0.633 0.526909    
## RoofStyleHip          5.286e+03  9.767e+03   0.541 0.588439    
## RoofStyleMansard      1.072e+04  1.491e+04   0.719 0.472413    
## RoofStyleShed         8.407e+03  2.287e+04   0.368 0.713172    
## Exterior1stAsphShn   -4.446e+03  3.071e+04  -0.145 0.884924    
## Exterior1stBrkComm   -9.432e+03  2.281e+04  -0.414 0.679304    
## Exterior1stBrkFace    1.651e+04  8.425e+03   1.959 0.050304 .  
## Exterior1stCBlock     1.107e+04  3.211e+04   0.345 0.730342    
## Exterior1stCemntBd   -4.659e+02  8.792e+03  -0.053 0.957750    
## Exterior1stHdBoard   -5.398e+03  7.689e+03  -0.702 0.482797    
## Exterior1stImStucc   -2.435e+04  2.993e+04  -0.814 0.416075    
## Exterior1stMetalSd    2.257e+02  7.434e+03   0.030 0.975781    
## Exterior1stPlywood   -3.322e+03  8.086e+03  -0.411 0.681302    
## Exterior1stStone      3.570e+03  2.252e+04   0.159 0.874074    
## Exterior1stStucco    -1.377e+04  9.442e+03  -1.458 0.145071    
## Exterior1stVinylSd   -2.060e+02  7.494e+03  -0.027 0.978069    
## Exterior1stWd Sdng   -2.715e+03  7.372e+03  -0.368 0.712705    
## Exterior1stWdShing   -4.008e+03  9.235e+03  -0.434 0.664403    
## MasVnrTypeBrkFace     1.129e+04  7.972e+03   1.417 0.156839    
## MasVnrTypeNone        1.326e+04  8.025e+03   1.652 0.098785 .  
## MasVnrTypeStone       1.341e+04  8.484e+03   1.581 0.114054    
## MasVnrArea            1.236e+01  6.854e+00   1.804 0.071494 .  
## ExterQualFa          -1.700e+04  1.202e+04  -1.415 0.157440    
## ExterQualGd          -1.212e+04  5.721e+03  -2.119 0.034268 *  
## ExterQualTA          -1.431e+04  6.314e+03  -2.266 0.023630 *  
## ExterCondFa          -1.781e+04  1.883e+04  -0.946 0.344555    
## ExterCondGd          -2.151e+04  1.758e+04  -1.224 0.221289    
## ExterCondPo          -3.301e+04  3.505e+04  -0.942 0.346519    
## ExterCondTA          -2.030e+04  1.756e+04  -1.156 0.247892    
## FoundationCBlock      6.522e+03  3.718e+03   1.754 0.079607 .  
## FoundationPConc       6.411e+03  4.050e+03   1.583 0.113699    
## FoundationSlab       -3.416e+03  8.600e+03  -0.397 0.691236    
## FoundationStone       7.674e+03  1.290e+04   0.595 0.552052    
## FoundationWood       -8.403e+03  1.764e+04  -0.476 0.633919    
## BsmtQualFa           -2.507e+04  7.412e+03  -3.383 0.000738 ***
## BsmtQualGd           -2.471e+04  3.983e+03  -6.204 7.36e-10 ***
## BsmtQualTA           -2.382e+04  4.896e+03  -4.865 1.29e-06 ***
## BsmtExposureGd        1.915e+04  3.515e+03   5.447 6.11e-08 ***
## BsmtExposureMn       -2.623e+03  3.621e+03  -0.724 0.468965    
## BsmtExposureNo       -8.524e+03  2.610e+03  -3.266 0.001120 ** 
## BsmtFinSF1            6.354e+00  5.264e+00   1.207 0.227614    
## BsmtUnfSF             4.160e-01  5.485e+00   0.076 0.939548    
## TotalBsmtSF           1.913e+00  6.484e+00   0.295 0.768011    
## HeatingQCFa          -1.008e+03  5.078e+03  -0.198 0.842717    
## HeatingQCGd          -3.934e+03  2.478e+03  -1.587 0.112646    
## HeatingQCPo          -6.254e+03  3.237e+04  -0.193 0.846832    
## HeatingQCTA          -3.220e+03  2.440e+03  -1.319 0.187237    
## CentralAirY           9.548e+02  4.329e+03   0.221 0.825488    
## ElectricalFuseF       1.475e+03  6.825e+03   0.216 0.828910    
## ElectricalFuseP      -1.339e+03  2.027e+04  -0.066 0.947346    
## ElectricalMix        -9.648e+03  3.090e+04  -0.312 0.754918    
## ElectricalSBrkr      -1.201e+03  3.452e+03  -0.348 0.727933    
## X1stFlrSF            -1.632e+01  2.145e+01  -0.761 0.446804    
## X2ndFlrSF             1.312e+01  2.059e+01   0.637 0.523934    
## GrLivArea             5.313e+01  2.099e+01   2.531 0.011488 *  
## BsmtFullBath          8.490e+03  2.246e+03   3.781 0.000164 ***
## BsmtHalfBath          3.777e+03  3.527e+03   1.071 0.284372    
## FullBath              7.689e+03  2.578e+03   2.982 0.002916 ** 
## HalfBath              4.275e+03  2.445e+03   1.749 0.080602 .  
## BedroomAbvGr         -2.105e+03  1.594e+03  -1.321 0.186762    
## KitchenQualFa        -2.744e+04  7.278e+03  -3.770 0.000170 ***
## KitchenQualGd        -2.875e+04  4.144e+03  -6.937 6.29e-12 ***
## KitchenQualTA        -2.954e+04  4.668e+03  -6.328 3.41e-10 ***
## TotRmsAbvGrd          1.884e+03  1.088e+03   1.731 0.083601 .  
## Fireplaces            4.615e+03  1.590e+03   2.902 0.003768 ** 
## GarageTypeAttchd      2.390e+04  1.279e+04   1.869 0.061829 .  
## GarageTypeBasment     2.708e+04  1.469e+04   1.843 0.065515 .  
## GarageTypeBuiltIn     1.846e+04  1.332e+04   1.386 0.166031    
## GarageTypeCarPort     1.996e+04  1.639e+04   1.218 0.223551    
## GarageTypeDetchd      2.291e+04  1.266e+04   1.810 0.070585 .  
## GarageYrBlt           5.123e+01  6.271e+01   0.817 0.414053    
## GarageCars            1.185e+04  2.574e+03   4.603 4.56e-06 ***
## GarageArea           -7.653e+00  9.097e+00  -0.841 0.400350    
## PavedDriveP          -9.404e+02  6.436e+03  -0.146 0.883842    
## PavedDriveY           2.286e+03  3.938e+03   0.580 0.561719    
## WoodDeckSF            1.360e+01  6.803e+00   1.998 0.045889 *  
## MoSold               -4.212e+02  2.928e+02  -1.439 0.150530    
## YrSold               -6.068e+02  6.074e+02  -0.999 0.317915    
## SaleTypeCon           2.461e+04  2.156e+04   1.141 0.253931    
## SaleTypeConLD         1.572e+04  1.160e+04   1.356 0.175415    
## SaleTypeConLI         1.116e+04  1.404e+04   0.794 0.427078    
## SaleTypeConLw        -3.278e+03  1.434e+04  -0.229 0.819258    
## SaleTypeCWD           1.595e+04  1.559e+04   1.023 0.306689    
## SaleTypeNew           3.341e+04  1.837e+04   1.819 0.069178 .  
## SaleTypeOth           1.403e+04  1.766e+04   0.794 0.427169    
## SaleTypeWD           -3.837e+02  5.046e+03  -0.076 0.939395    
## SaleConditionAdjLand  1.984e+04  1.681e+04   1.180 0.238160    
## SaleConditionAlloca   1.841e+03  9.968e+03   0.185 0.853526    
## SaleConditionFamily   1.154e+03  7.362e+03   0.157 0.875505    
## SaleConditionNormal   6.541e+03  3.412e+03   1.917 0.055449 .  
## SaleConditionPartial -1.804e+04  1.768e+04  -1.021 0.307600    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28160 on 1305 degrees of freedom
## Multiple R-squared:  0.8876, Adjusted R-squared:  0.8743 
## F-statistic:  66.9 on 154 and 1305 DF,  p-value: < 2.2e-16
Multi-collinearity Test (Model.1)
  • In Model.1 there are collinearity issues with MSSubClass (5.5), X1stFlrSF (11.3), X2ndFlrSF (12.2), GrLivArea (15.0)
  • These variables will be removed in the next model
##                       GVIF Df GVIF^(1/(2*Df))
## Id                1.134866  1        1.065301
## MSSubClass       30.599778  1        5.531707
## MSZoning         47.073460  4        1.618442
## LotFrontage       2.176681  1        1.475358
## LotArea           1.765511  1        1.328725
## LotShape          2.403450  3        1.157371
## LotConfig         2.221529  4        1.104922
## Neighborhood  98310.333590 24        1.270611
## Condition1        4.602658  8        1.100115
## BldgType        139.806280  4        1.854346
## HouseStyle      280.975519  7        1.495908
## OverallQual       4.787143  1        2.187954
## OverallCond       2.262404  1        1.504129
## YearBuilt        12.467703  1        3.530963
## YearRemodAdd      3.327361  1        1.824106
## RoofStyle         3.528657  5        1.134386
## Exterior1st      56.816311 14        1.155207
## MasVnrType        4.995470  3        1.307463
## MasVnrArea        2.822300  1        1.679970
## ExterQual        12.967330  3        1.532763
## ExterCond         2.511153  4        1.121977
## Foundation       19.183677  5        1.343672
## BsmtQual         10.687809  3        1.484162
## BsmtExposure      2.761571  3        1.184475
## BsmtFinSF1       10.602217  1        3.256105
## BsmtUnfSF        10.802945  1        3.286783
## TotalBsmtSF      14.882510  1        3.857786
## HeatingQC         4.366549  4        1.202312
## CentralAir        2.098587  1        1.448650
## Electrical        3.303583  4        1.161109
## X1stFlrSF       126.440297  1       11.244567
## X2ndFlrSF       148.555002  1       12.188314
## GrLivArea       223.790092  1       14.959615
## BsmtFullBath      2.497740  1        1.580424
## BsmtHalfBath      1.304203  1        1.142017
## FullBath          3.710723  1        1.926324
## HalfBath          2.780867  1        1.667593
## BedroomAbvGr      3.108065  1        1.762970
## KitchenQual       8.168256  3        1.419128
## TotRmsAbvGrd      5.753167  1        2.398576
## Fireplaces        1.933016  1        1.390329
## GarageType        6.962789  5        1.214167
## GarageYrBlt       4.163934  1        2.040572
## GarageCars        6.804504  1        2.608544
## GarageArea        6.957391  1        2.637687
## PavedDrive        1.950208  2        1.181735
## WoodDeckSF        1.337436  1        1.156476
## MoSold            1.152343  1        1.073472
## YrSold            1.196749  1        1.093960
## SaleType        117.611721  8        1.347110
## SaleCondition   118.967277  5        1.612660
Plotted Residuals- Model.1

Predicted Sales Prices- Model.1

Model.2 (Significant Variables ONLY)
  • Seemingly statistically significant variables with a confidence level of over 95% will be maintained in Model.2
  • Variables with colinearity will be removed in Model.2 (MSSubClass (5.5), X1stFlrSF (11.3), X2ndFlrSF (12.2), GrLivArea (15.0))
  • Model.2 Variables: MSZoning, LotFrontage, LotArea,LotShape,LotConfig,Neighborhood,BldgType,HouseStyle, OverallQual,OverallCond,Exterior2nd,ExterQual,BsmtQual,BsmtExposur, BsmtFullBath,FullBath,HalfBath,KitchenQual,Fireplaces,GarageType,GarageCars,WoodDeckSF,SaleCondition,
## 
## Call:
## lm(formula = SalePrice ~ ., data = clean.train %>% dplyr::select(MSZoning, 
##     LotFrontage, LotArea, LotShape, LotConfig, Neighborhood, 
##     BldgType, HouseStyle, OverallQual, OverallCond, ExterQual, 
##     BsmtQual, BsmtExposure, Exterior2nd, BsmtFullBath, FullBath, 
##     HalfBath, KitchenQual, Fireplaces, GarageType, GarageCars, 
##     WoodDeckSF, SaleCondition, SalePrice))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -248212  -14995    -606   13312  267801 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.310e+04  2.555e+04   1.686 0.091929 .  
## MSZoningFV            2.330e+04  1.483e+04   1.571 0.116365    
## MSZoningRH            2.521e+04  1.490e+04   1.692 0.090904 .  
## MSZoningRL            1.835e+04  1.249e+04   1.470 0.141836    
## MSZoningRM            1.627e+04  1.161e+04   1.401 0.161456    
## LotFrontage          -5.784e+00  5.074e+01  -0.114 0.909256    
## LotArea               4.781e-01  1.043e-01   4.582 5.02e-06 ***
## LotShapeIR2           8.629e+03  5.344e+03   1.615 0.106620    
## LotShapeIR3          -2.102e+04  1.070e+04  -1.965 0.049624 *  
## LotShapeReg           7.717e+02  2.047e+03   0.377 0.706227    
## LotConfigCulDSac      9.516e+03  4.165e+03   2.285 0.022471 *  
## LotConfigFR2         -1.619e+04  5.197e+03  -3.115 0.001879 ** 
## LotConfigFR3         -2.453e+04  1.603e+04  -1.531 0.126124    
## LotConfigInside      -1.641e+03  2.259e+03  -0.726 0.467845    
## NeighborhoodBlueste  -1.578e+04  2.428e+04  -0.650 0.515747    
## NeighborhoodBrDale   -1.051e+03  1.384e+04  -0.076 0.939491    
## NeighborhoodBrkSide  -6.446e+03  1.126e+04  -0.572 0.567232    
## NeighborhoodClearCr  -6.812e+03  1.116e+04  -0.610 0.541809    
## NeighborhoodCollgCr  -7.046e+03  9.064e+03  -0.777 0.437075    
## NeighborhoodCrawfor   2.248e+04  1.044e+04   2.153 0.031509 *  
## NeighborhoodEdwards  -1.923e+04  9.933e+03  -1.936 0.053106 .  
## NeighborhoodGilbert  -2.156e+04  9.671e+03  -2.229 0.025984 *  
## NeighborhoodIDOTRR   -1.525e+04  1.293e+04  -1.179 0.238461    
## NeighborhoodMeadowV  -3.178e+02  1.389e+04  -0.023 0.981752    
## NeighborhoodMitchel  -1.550e+04  1.022e+04  -1.517 0.129610    
## NeighborhoodNAmes    -8.859e+03  9.639e+03  -0.919 0.358209    
## NeighborhoodNoRidge   6.564e+04  1.032e+04   6.362 2.71e-10 ***
## NeighborhoodNPkVill  -3.902e+03  1.592e+04  -0.245 0.806480    
## NeighborhoodNridgHt   2.630e+04  9.210e+03   2.855 0.004363 ** 
## NeighborhoodNWAmes   -1.318e+04  9.950e+03  -1.324 0.185564    
## NeighborhoodOldTown  -1.751e+04  1.152e+04  -1.520 0.128749    
## NeighborhoodSawyer   -1.066e+04  1.017e+04  -1.049 0.294496    
## NeighborhoodSawyerW  -4.199e+03  9.762e+03  -0.430 0.667183    
## NeighborhoodSomerst   1.641e+03  1.101e+04   0.149 0.881550    
## NeighborhoodStoneBr   4.782e+04  1.043e+04   4.584 4.99e-06 ***
## NeighborhoodSWISU    -1.793e+04  1.184e+04  -1.515 0.130072    
## NeighborhoodTimber   -1.251e+04  1.023e+04  -1.223 0.221597    
## NeighborhoodVeenker   8.756e+03  1.319e+04   0.664 0.506866    
## BldgType2fmCon       -9.621e+03  6.171e+03  -1.559 0.119182    
## BldgTypeDuplex       -1.347e+04  5.258e+03  -2.561 0.010540 *  
## BldgTypeTwnhs        -3.453e+04  7.012e+03  -4.925 9.46e-07 ***
## BldgTypeTwnhsE       -3.222e+04  4.633e+03  -6.956 5.43e-12 ***
## HouseStyle1.5Unf     -1.620e+04  8.919e+03  -1.817 0.069483 .  
## HouseStyle1Story      7.243e+01  3.433e+03   0.021 0.983170    
## HouseStyle2.5Fin      2.979e+04  1.188e+04   2.507 0.012276 *  
## HouseStyle2.5Unf     -3.748e+03  1.018e+04  -0.368 0.712885    
## HouseStyle2Story     -3.963e+03  3.518e+03  -1.126 0.260251    
## HouseStyleSFoyer     -1.633e+04  6.970e+03  -2.343 0.019288 *  
## HouseStyleSLvl       -1.182e+04  5.409e+03  -2.186 0.028965 *  
## OverallQual           1.318e+04  1.173e+03  11.232  < 2e-16 ***
## OverallCond           4.562e+03  8.885e+02   5.135 3.24e-07 ***
## ExterQualFa          -2.018e+04  1.221e+04  -1.654 0.098452 .  
## ExterQualGd          -1.889e+04  6.017e+03  -3.139 0.001730 ** 
## ExterQualTA          -2.333e+04  6.636e+03  -3.515 0.000454 ***
## BsmtQualFa           -4.011e+04  7.510e+03  -5.340 1.09e-07 ***
## BsmtQualGd           -3.194e+04  4.203e+03  -7.598 5.56e-14 ***
## BsmtQualTA           -3.222e+04  5.016e+03  -6.423 1.84e-10 ***
## BsmtExposureGd        1.963e+04  3.726e+03   5.268 1.60e-07 ***
## BsmtExposureMn       -3.217e+03  3.871e+03  -0.831 0.406110    
## BsmtExposureNo       -9.152e+03  2.789e+03  -3.282 0.001057 ** 
## Exterior2ndAsphShn    3.176e+03  1.958e+04   0.162 0.871184    
## Exterior2ndBrk Cmn   -1.180e+04  1.707e+04  -0.691 0.489370    
## Exterior2ndBrkFace    9.101e+03  9.814e+03   0.927 0.353913    
## Exterior2ndCBlock     8.213e+03  3.376e+04   0.243 0.807857    
## Exterior2ndCmentBd   -3.168e+03  9.098e+03  -0.348 0.727737    
## Exterior2ndHdBoard   -8.118e+03  7.744e+03  -1.048 0.294721    
## Exterior2ndImStucc    1.720e+04  1.260e+04   1.365 0.172347    
## Exterior2ndMetalSd   -4.801e+03  7.538e+03  -0.637 0.524293    
## Exterior2ndOther     -1.592e+04  3.281e+04  -0.485 0.627453    
## Exterior2ndPlywood   -8.120e+03  7.937e+03  -1.023 0.306451    
## Exterior2ndStone     -2.758e+04  1.617e+04  -1.706 0.088269 .  
## Exterior2ndStucco    -1.327e+04  9.625e+03  -1.379 0.168063    
## Exterior2ndVinylSd   -4.169e+03  7.588e+03  -0.549 0.582869    
## Exterior2ndWd Sdng   -7.870e+03  7.556e+03  -1.041 0.297850    
## Exterior2ndWd Shng   -2.069e+04  8.809e+03  -2.348 0.019000 *  
## BsmtFullBath          1.208e+04  1.833e+03   6.591 6.23e-11 ***
## FullBath              2.636e+04  2.353e+03  11.206  < 2e-16 ***
## HalfBath              1.362e+04  2.464e+03   5.526 3.92e-08 ***
## KitchenQualFa        -3.654e+04  7.507e+03  -4.867 1.27e-06 ***
## KitchenQualGd        -3.290e+04  4.428e+03  -7.431 1.89e-13 ***
## KitchenQualTA        -3.501e+04  4.935e+03  -7.094 2.08e-12 ***
## Fireplaces            9.711e+03  1.603e+03   6.059 1.77e-09 ***
## GarageTypeAttchd      2.587e+04  1.349e+04   1.918 0.055367 .  
## GarageTypeBasment     2.871e+04  1.551e+04   1.851 0.064360 .  
## GarageTypeBuiltIn     3.408e+04  1.396e+04   2.441 0.014785 *  
## GarageTypeCarPort     8.277e+03  1.703e+04   0.486 0.627087    
## GarageTypeDetchd      2.263e+04  1.337e+04   1.692 0.090819 .  
## GarageCars            1.362e+04  1.704e+03   7.990 2.85e-15 ***
## WoodDeckSF            2.172e+01  7.277e+00   2.985 0.002887 ** 
## SaleConditionAdjLand  8.243e+03  1.652e+04   0.499 0.618002    
## SaleConditionAlloca   8.191e+03  1.032e+04   0.794 0.427569    
## SaleConditionFamily  -3.102e+02  7.779e+03  -0.040 0.968201    
## SaleConditionNormal   3.894e+03  3.380e+03   1.152 0.249403    
## SaleConditionPartial  1.295e+04  4.779e+03   2.709 0.006826 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30820 on 1366 degrees of freedom
## Multiple R-squared:  0.8591, Adjusted R-squared:  0.8495 
## F-statistic: 89.55 on 93 and 1366 DF,  p-value: < 2.2e-16
Multi-collinearity Test (Model.2)
  • In Model.2 the colinearity looks fine with all variables below 4
##                       GVIF Df GVIF^(1/(2*Df))
## MSZoning         32.569809  4        1.545617
## LotFrontage       1.920814  1        1.385934
## LotArea           1.665909  1        1.290701
## LotShape          1.945047  3        1.117262
## LotConfig         1.746427  4        1.072183
## Neighborhood  10660.912419 24        1.213144
## BldgType          7.163336  4        1.279056
## HouseStyle        9.279987  7        1.172494
## OverallQual       4.042981  1        2.010717
## OverallCond       1.501641  1        1.225415
## ExterQual         9.121737  3        1.445483
## BsmtQual          7.227337  3        1.390475
## BsmtExposure      2.273881  3        1.146730
## Exterior2nd      26.239573 15        1.115061
## BsmtFullBath      1.389493  1        1.178768
## FullBath          2.580369  1        1.606353
## HalfBath          2.358272  1        1.535667
## KitchenQual       5.836024  3        1.341795
## Fireplaces        1.639886  1        1.280580
## GarageType        3.766809  5        1.141819
## GarageCars        2.491990  1        1.578604
## WoodDeckSF        1.277969  1        1.130473
## SaleCondition     2.386563  5        1.090881
Plotted Residuals- Model.2

Predicted Sales Prices- Model.2

Model Selection: Mode1.2 Plotted Predicted Values (Test Data)

Of the two models, Model.2 has the largest f-statistic (89.55),no apparent Multi-collinearity issues, and adjusted R2 squared value (0.8495) that is only slightly smaller than Model.1 (0.8743). The predicted target wins for Model.2 appears to be the tighter fit graph (please view the above box plots).

  • Kaggle.com user name and score: MeaghanB, Score: .16778

Kaggle Submission

Kaggle Submission

MB Summary:
  • For this dataset, the null values in the columns proved to be especially challenging. I imputed values utilizing the median values for numerical values and the most common column values for characters. In the future, I would like to test additional imputation methods to test its impact on the modeling results.