Data 605 Final

Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of mu =sigma=(N+1)/2

#set.seed function ensures results are reproducible
set.seed(100)

#create variables
N <- 8
mu <- (N + 1)/2
devi <- mu

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

X <- runif(10000, min=1, max=6)
Y <- rnorm(10000, mean= mu, sd=devi)


#Graph histograms to show distribution
plot_histogram(X)

plot_histogram(Y)

#create little x and y

# Assume the small letter "x" is estimated as the median of the X variable
x <- median(X)

# small letter "y" is estimated as the 1st quartile of the Y variable
y <- quantile(Y, 0.25)

1a.P(X>x | X>y) Reference: https://stats.stackexchange.com/questions/391756/using-conditional-probability-to-calculate-sentiment-score-probability

#Means conditional probability  -> P(X>x and X>y)/P(X>y) --> P(AB)/P(B)
#Find the determine cases where B (...or the second case in our situation) is true and given B where A is also true. 
sum(X>x & X>y) / sum(X>y)
## [1] 0.5444251

1b.P(X>x, Y>y)

#Both are independent events
#Formula used P( A and B) = P(A)P(B)//Finding the intersection of A and B, both occur simulateneously
#Length of both X and Y are 10000

sum(X>x & Y>y)/length(X)
## [1] 0.3755

1c.P(X<x | X>y)

sum(X<x & X>y)/sum(X>y)
## [1] 0.4555749
  1. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities
#Start the creation of the table
X_grt_x_Y_grt_y <- sum(X>x & Y>y)


X_lsr_x_Y_lsr_y <- sum(X<x & Y<y)


X_grt_x_Y_lsr_y <- sum(X>x & Y<y)


X_lsr_x_Y_grt_y <- sum(X<x & Y>y)


table <- matrix(c(X_grt_x_Y_grt_y, X_grt_x_Y_lsr_y, X_lsr_x_Y_grt_y, X_lsr_x_Y_lsr_y), nrow=2, ncol=2)
colnames(table) <- c('X > x', ' X < x')
rownames(table) <- c('Y > y', 'Y < y')
print(table)
##       X > x  X < x
## Y > y  3755   3745
## Y < y  1245   1255
#P(X>x and Y>y)
a <- 3755 / 10000

#P(X>x)P(Y>y)
b <- sum(X>x & Y>y)/10000

Are_Equal <- a == b
Are_Equal
## [1] TRUE

P(X>x and Y >y) does equal the probability of P(X>x)P(Y>y)

  1. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
#Run Fisher test
fisher.test(table)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.9222661 1.1076494
## sample estimates:
## odds ratio 
##   1.010724
#Chi Square Test
chisq.test(table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table
## X-squared = 0.0432, df = 1, p-value = 0.8353

The p-values for both tests are very large (significantly above the 0.05), therefore we fail to reject the null-hypothesis. These events are indeed indepdent.

The chi-squared test is an appropriate approximation when the sample is large, while the Fisher exact test runs a precise procedure for small-sized samples. (reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5426219/#:~:text=The%20chi%2Dsquared%20test%20applies,especially%20for%20small%2Dsized%20samples.) Since the case is large, I would say the chi-test is most appropriate.

Problem 2.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Preliminary

#Load Data

#Train
train <- read.csv("train.csv", header = TRUE, sep=',')

#Test
test <- read.csv("test.csv", header = TRUE, sep=',')

2a.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Descriptive Statistics

##View the first five rows of the dataset
head(train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
#Find out total rows and columns
dim(train)
## [1] 1460   81
#Lets view the data types of the variables 
glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, ...
## $ MSZoning      <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, R...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 9...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, ...
## $ Street        <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave,...
## $ Alley         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ LotShape      <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg...
## $ LandContour   <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities     <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPu...
## $ LotConfig     <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Cor...
## $ LandSlope     <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood  <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel,...
## $ Condition1    <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Arte...
## $ Condition2    <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ BldgType      <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam,...
## $ HouseStyle    <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Stor...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4,...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5,...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931,...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950,...
## $ RoofStyle     <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gabl...
## $ RoofMatl      <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg,...
## $ Exterior1st   <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd,...
## $ Exterior2nd   <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd,...
## $ MasVnrType    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, S...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 30...
## $ ExterQual     <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, G...
## $ ExterCond     <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ Foundation    <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBl...
## $ BsmtQual      <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, G...
## $ BsmtCond      <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ BsmtExposure  <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, A...
## $ BsmtFinType1  <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906,...
## $ BsmtFinType2  <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991,...
## $ Heating       <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA,...
## $ HeatingQC     <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, E...
## $ CentralAir    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ Electrical    <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrk...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 107...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142,...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774,...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2,...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2,...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ KitchenQual   <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, G...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6...
## $ Functional    <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Ty...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0,...
## $ FireplaceQu   <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, G...
## $ GarageType    <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attch...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931,...
## $ GarageFinish  <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2,...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384...
## $ GarageQual    <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, T...
## $ GarageCond    <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ PavedDrive    <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 2...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, ...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PoolQC        <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Fence         <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA...
## $ MiscFeature   <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, N...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 7...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3,...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008,...
## $ SaleType      <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, ...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Norm...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, 30700...
#Overall statistics
summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00  
##                                                  NA's   :259     
##     LotArea        Street      Alley      LotShape  LandContour  Utilities   
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##  Median :  9478               NA's:1369   IR3: 10   Low:  36                 
##  Mean   : 10517                           Reg:925   Lvl:1311                 
##  3rd Qu.: 11602                                                              
##  Max.   :215245                                                              
##                                                                              
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond      YearBuilt   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000   Median :1973  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575   Mean   :1971  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000   Max.   :2010  
##                (Other): 19                                                  
##   YearRemodAdd    RoofStyle       RoofMatl     Exterior1st   Exterior2nd 
##  Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515   VinylSd:504  
##  1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222   MetalSd:214  
##  Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220   HdBoard:207  
##  Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206   Wd Sdng:197  
##  3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108   Plywood:142  
##  Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61   CmentBd: 60  
##                                (Other):   2   (Other):128   (Other):136  
##    MasVnrType    MasVnrArea     ExterQual ExterCond  Foundation  BsmtQual  
##  BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3   BrkTil:146   Ex  :121  
##  BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28   CBlock:634   Fa  : 35  
##  None   :864   Median :   0.0   Gd:488    Gd: 146   PConc :647   Gd  :618  
##  Stone  :128   Mean   : 103.7   TA:906    Po:   1   Slab  : 24   TA  :649  
##  NA's   :  8   3rd Qu.: 166.0             TA:1282   Stone :  6   NA's: 37  
##                Max.   :1600.0                       Wood  :  3             
##                NA's   :8                                                   
##  BsmtCond    BsmtExposure BsmtFinType1   BsmtFinSF1     BsmtFinType2
##  Fa  :  45   Av  :221     ALQ :220     Min.   :   0.0   ALQ :  19   
##  Gd  :  65   Gd  :134     BLQ :148     1st Qu.:   0.0   BLQ :  33   
##  Po  :   2   Mn  :114     GLQ :418     Median : 383.5   GLQ :  14   
##  TA  :1311   No  :953     LwQ : 74     Mean   : 443.6   LwQ :  46   
##  NA's:  37   NA's: 38     Rec :133     3rd Qu.: 712.2   Rec :  54   
##                           Unf :430     Max.   :5644.0   Unf :1256   
##                           NA's: 37                      NA's:  38   
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741   
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49   
##  Median :   0.00   Median : 477.5   Median : 991.5   GasW :  18   Gd:241   
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1   
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428   
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0   Wall :   4            
##                                                                            
##  CentralAir Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  N:  95     FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  Y:1365     FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##             FuseP:   3   Median :1087   Median :   0   Median :  0.000  
##             Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845  
##             SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##             NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                         
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586      Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000               Max.   :14.000  
##                                                                               
##  Functional    Fireplaces    FireplaceQu   GarageType   GarageYrBlt  
##  Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900  
##  Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961  
##  Min1:  31   Median :1.000   Gd  :380    Basment: 19   Median :1980  
##  Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979  
##  Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002  
##  Sev :   1   Max.   :3.000   NA's:690    Detchd :387   Max.   :2010  
##  Typ :1360                               NA's   : 81   NA's   :81    
##  GarageFinish   GarageCars      GarageArea     GarageQual  GarageCond 
##  Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2  
##  RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35  
##  Unf :605     Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9  
##  NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7  
##               3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326  
##               Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81  
##                                                                       
##  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Y:1340     Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##             Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##             3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##             Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                                
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
## 
#View how the variables are skewed
plot_histogram(train)

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. The 2 independent variables: X1stFlrSF : First floor square feet GrLivArea: Above ground living area

Dependent variable: SalePrice: sale price of the house

plot(train$X1stFlrSF, train$SalePrice,
     main = "First Floor Sq Ft vs Sale Price",
     xlab = "Sq Ft",
     ylab = 'Sale Price')

plot(train$GrLivArea, train$SalePrice,
     main = "Above Ground Living Area vs Sale Price",
     xlab = "Sq Ft",
     ylab = 'Sale Price')

Derive a correlation matrix for any three quantitative variables in the dataset

corr_matrix <- cor(train[c("OverallQual", "GrLivArea", "SalePrice")])
corr_matrix
##             OverallQual GrLivArea SalePrice
## OverallQual   1.0000000 0.5930074 0.7909816
## GrLivArea     0.5930074 1.0000000 0.7086245
## SalePrice     0.7909816 0.7086245 1.0000000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval

cor.test(train$OverallQual, train$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(train$OverallQual, train$GrLivArea, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$GrLivArea
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5708061 0.6143422
## sample estimates:
##       cor 
## 0.5930074
cor.test(train$GrLivArea, train$SalePrice, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

The three tests show a p-value that is much lower than the significant p-valuce of 0.05, meaning the null hypothesis can be rejected and the true correlation is not 0 for the 3 chosen variables.

2b. Linear Algebra and Correlation

Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix

#correlation matrix
corr_matrix
##             OverallQual GrLivArea SalePrice
## OverallQual   1.0000000 0.5930074 0.7909816
## GrLivArea     0.5930074 1.0000000 0.7086245
## SalePrice     0.7909816 0.7086245 1.0000000
#invert the matrix
inverted_corr_matrix = solve(corr_matrix)
inverted_corr_matrix
##             OverallQual  GrLivArea SalePrice
## OverallQual   2.6865350 -0.1753704 -2.000728
## GrLivArea    -0.1753704  2.0200794 -1.292763
## SalePrice    -2.0007280 -1.2927630  3.498623
#multiply correlation matrix by precision matrix
corr_matrix %*% inverted_corr_matrix
##             OverallQual GrLivArea SalePrice
## OverallQual           1         0         0
## GrLivArea             0         1         0
## SalePrice             0         0         1
#multiply the inverted matrix by the original correlation matrix
inverted_corr_matrix %*% corr_matrix
##              OverallQual     GrLivArea     SalePrice
## OverallQual 1.000000e+00 -4.440892e-16 -8.881784e-16
## GrLivArea   0.000000e+00  1.000000e+00  0.000000e+00
## SalePrice   4.440892e-16  0.000000e+00  1.000000e+00

Conduct LU decomposition on the matrix

library(Matrix)
## Warning: package 'Matrix' was built under R version 3.6.3
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
LU <- lu(corr_matrix)

lower <- expand(LU)$L
lower
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000         .         .
## [2,] 0.5930074 1.0000000         .
## [3,] 0.7909816 0.3695063 1.0000000
upper <- expand(LU)$U
upper
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]      [,2]      [,3]     
## [1,] 1.0000000 0.5930074 0.7909816
## [2,]         . 0.6483422 0.2395665
## [3,]         .         . 0.2858268
lower %*% upper
## 3 x 3 Matrix of class "dgeMatrix"
##           [,1]      [,2]      [,3]
## [1,] 1.0000000 0.5930074 0.7909816
## [2,] 0.5930074 1.0000000 0.7086245
## [3,] 0.7909816 0.7086245 1.0000000

2c. Calculus-Based Probability & Statistics

Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Based on the plot of histograms, there are number of variables that are skewed to the right. For the purposes of this problem, I am choosing Unfinished square feet of basement area (BsmtUnfSF.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
plot_histogram(train$BsmtUnfSF)

summary(train$BsmtUnfSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0

The variable needs to be shifted since the minimum is zero.

shifted_var <- train$BsmtUnfSF + 0.5
summary(shifted_var)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.5   223.5   478.0   567.7   808.5  2336.5

Then load the MASS package and run fitdistr to fit an exponential probability density function

fitted_var <- fitdistr(shifted_var, "exponential")


#Calculate lambda
lambda <- fitted_var$estimate
lambda
##        rate 
## 0.001761368

Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda))

exponential <- rexp(1000, lambda)


# Plot a histogram and compare it with a histogram of your original variable
par(mfrow = c(1,2))
plot_histogram(exponential)

plot_histogram(train$TotalBsmtSF)

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data

#Find the 5th and 95th percentile
qexp(c(0.05,0.95), rate = lambda)
## [1]   29.12128 1700.79827
#Generate a 95% confidence interval
CI(train$BsmtUnfSF, 0.95)
##    upper     mean    lower 
## 589.9246 567.2404 544.5562
#Find empirical 5th and 95th percentile
quantile(train$BsmtUnfSF, c(0.05, 0.95))
##   5%  95% 
##    0 1468

The exponential model is more skewed than the empirical.

2d. Model

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Let’s do some mathematical investigating and see the correlation/p-values between all variables and Sale Price. But, first, we need to do some data cleaning and remove non-numeric variables for the analysis.

numeric_train_data <- select_if(train, is.numeric)

#Make sure all the variables are data type integers
glimpse(numeric_train_data)
## Observations: 1,460
## Variables: 38
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, ...
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 9...
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, ...
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4,...
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5,...
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931,...
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950,...
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 30...
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906,...
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134,...
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991,...
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 107...
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142,...
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774,...
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2,...
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2,...
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6...
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0,...
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931,...
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2,...
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384...
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, ...
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 2...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, ...
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, ...
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 7...
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3,...
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008,...
## $ SalePrice     <int> 208500, 181500, 223500, 140000, 250000, 143000, 30700...
#Check for NA values
apply(numeric_train_data, 2, function(x) any(is.na(x)))
##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##         FALSE         FALSE          TRUE         FALSE         FALSE 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##         FALSE         FALSE         FALSE          TRUE         FALSE 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF     X1stFlrSF     X2ndFlrSF 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##          TRUE         FALSE         FALSE         FALSE         FALSE 
## EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##         FALSE         FALSE         FALSE         FALSE         FALSE 
##        MoSold        YrSold     SalePrice 
##         FALSE         FALSE         FALSE
#We see that there are missing values
#The columns are LotFrontage, GarageYrBlt, MasVnrArea

res <- cor(numeric_train_data[-1], numeric_train_data$SalePrice)
res
##                      [,1]
## MSSubClass    -0.08428414
## LotFrontage            NA
## LotArea        0.26384335
## OverallQual    0.79098160
## OverallCond   -0.07785589
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## MasVnrArea             NA
## BsmtFinSF1     0.38641981
## BsmtFinSF2    -0.01137812
## BsmtUnfSF      0.21447911
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## X2ndFlrSF      0.31933380
## LowQualFinSF  -0.02560613
## GrLivArea      0.70862448
## BsmtFullBath   0.22712223
## BsmtHalfBath  -0.01684415
## FullBath       0.56066376
## HalfBath       0.28410768
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## TotRmsAbvGrd   0.53372316
## Fireplaces     0.46692884
## GarageYrBlt            NA
## GarageCars     0.64040920
## GarageArea     0.62343144
## WoodDeckSF     0.32441344
## OpenPorchSF    0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch     0.04458367
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MiscVal       -0.02118958
## MoSold         0.04643225
## YrSold        -0.02892259
## SalePrice      1.00000000
#The NA values in those forces cor to return NA when doing the correlation test
#I could drop the NA values, replace values with mean or median
#reference: https://www.guru99.com/r-replace-missing-values.html
#I'm going to drop since the data has 1460 observations

#numeric_train_drop <- numeric_train_data %>% na.omit()
#dim(numeric_train_drop)

numeric_train_data[is.na(numeric_train_data)] <- 0

numeric_train_drop <- numeric_train_data
#The new dataset is 1121 observations of 38 variables

Let’s look at the correlation values with the new dataset with no NA values.

res_new <- cor(numeric_train_drop[-1], numeric_train_drop$SalePrice)
res_new
##                      [,1]
## MSSubClass    -0.08428414
## LotFrontage    0.20962394
## LotArea        0.26384335
## OverallQual    0.79098160
## OverallCond   -0.07785589
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## MasVnrArea     0.47261450
## BsmtFinSF1     0.38641981
## BsmtFinSF2    -0.01137812
## BsmtUnfSF      0.21447911
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## X2ndFlrSF      0.31933380
## LowQualFinSF  -0.02560613
## GrLivArea      0.70862448
## BsmtFullBath   0.22712223
## BsmtHalfBath  -0.01684415
## FullBath       0.56066376
## HalfBath       0.28410768
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## TotRmsAbvGrd   0.53372316
## Fireplaces     0.46692884
## GarageYrBlt    0.26136644
## GarageCars     0.64040920
## GarageArea     0.62343144
## WoodDeckSF     0.32441344
## OpenPorchSF    0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch     0.04458367
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MiscVal       -0.02118958
## MoSold         0.04643225
## YrSold        -0.02892259
## SalePrice      1.00000000

Now, we will create a linear model with multiple variables with relatively high correlation.

high_cor_var <- numeric_train_drop %>%dplyr:: select(OverallQual, TotalBsmtSF, X1stFlrSF, GrLivArea, GarageCars, GarageArea, FullBath, SalePrice)

high_cor_lm <- lm(SalePrice ~., data=high_cor_var)
summary(high_cor_lm)
## 
## Call:
## lm(formula = SalePrice ~ ., data = high_cor_var)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -471605  -19887   -1264   16874  288458 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.029e+05  4.923e+03 -20.901  < 2e-16 ***
## OverallQual  2.386e+04  1.108e+03  21.546  < 2e-16 ***
## TotalBsmtSF  2.453e+01  4.325e+00   5.671 1.71e-08 ***
## X1stFlrSF    1.112e+01  5.035e+00   2.208   0.0274 *  
## GrLivArea    4.243e+01  2.940e+00  14.432  < 2e-16 ***
## GarageCars   1.421e+04  3.067e+03   4.632 3.94e-06 ***
## GarageArea   1.630e+01  1.054e+01   1.547   0.1220    
## FullBath     1.457e+03  2.529e+03   0.576   0.5646    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38850 on 1452 degrees of freedom
## Multiple R-squared:  0.762,  Adjusted R-squared:  0.7608 
## F-statistic:   664 on 7 and 1452 DF,  p-value: < 2.2e-16

StepWise Regression Model ref: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/

#set up cross-validation (resampling procedure), K refers to the number of groups that a given data sample is split into
train.control <- trainControl(method ='cv', number=10)

step.model <- train(SalePrice ~., data = numeric_train_drop,
                    method = "leapSeq",
                    tuneGrid = data.frame(nvmax = 5:15),
                    trControl = train.control
                    )
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
#RMSE and MAE are different metrics that measure prediction error. You want low RMSE and MAE values
#High Rsquared results the better
step.model$results
##    nvmax     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      5 45216.17 0.6835491 31011.85 7999.331 0.08567085 2691.569
## 2      6 45329.53 0.6824841 30814.14 7681.631 0.08023287 2428.390
## 3      7 43062.11 0.7131724 28570.66 9221.972 0.09550397 3904.753
## 4      8 42532.38 0.7213604 28094.29 7708.265 0.07766905 2976.683
## 5      9 41834.80 0.7311900 27431.57 7061.471 0.07809127 2530.220
## 6     10 39718.36 0.7550711 26033.30 8023.905 0.08703431 3405.627
## 7     11 39089.63 0.7650416 25196.05 8151.882 0.08353799 2535.014
## 8     12 38763.44 0.7691174 24942.23 8025.952 0.07979375 2596.117
## 9     13 38587.10 0.7718841 24693.20 8081.352 0.07991216 2410.730
## 10    14 38668.21 0.7712736 24743.74 7591.561 0.07797126 2155.504
## 11    15 38550.59 0.7743463 24229.60 9614.871 0.09893680 2607.831
#Best Tune shows which model is best, and the model with 15 variables is the best.
step.model$bestTune
##    nvmax
## 11    15
summary(step.model$finalModel)
## Subset selection object
## 37 Variables  (and intercept)
##               Forced in Forced out
## Id                FALSE      FALSE
## MSSubClass        FALSE      FALSE
## LotFrontage       FALSE      FALSE
## LotArea           FALSE      FALSE
## OverallQual       FALSE      FALSE
## OverallCond       FALSE      FALSE
## YearBuilt         FALSE      FALSE
## YearRemodAdd      FALSE      FALSE
## MasVnrArea        FALSE      FALSE
## BsmtFinSF1        FALSE      FALSE
## BsmtFinSF2        FALSE      FALSE
## BsmtUnfSF         FALSE      FALSE
## X1stFlrSF         FALSE      FALSE
## X2ndFlrSF         FALSE      FALSE
## LowQualFinSF      FALSE      FALSE
## BsmtFullBath      FALSE      FALSE
## BsmtHalfBath      FALSE      FALSE
## FullBath          FALSE      FALSE
## HalfBath          FALSE      FALSE
## BedroomAbvGr      FALSE      FALSE
## KitchenAbvGr      FALSE      FALSE
## TotRmsAbvGrd      FALSE      FALSE
## Fireplaces        FALSE      FALSE
## GarageYrBlt       FALSE      FALSE
## GarageCars        FALSE      FALSE
## GarageArea        FALSE      FALSE
## WoodDeckSF        FALSE      FALSE
## OpenPorchSF       FALSE      FALSE
## EnclosedPorch     FALSE      FALSE
## X3SsnPorch        FALSE      FALSE
## ScreenPorch       FALSE      FALSE
## PoolArea          FALSE      FALSE
## MiscVal           FALSE      FALSE
## MoSold            FALSE      FALSE
## YrSold            FALSE      FALSE
## TotalBsmtSF       FALSE      FALSE
## GrLivArea         FALSE      FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: 'sequential replacement'
##           Id  MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## 1  ( 1 )  " " " "        " "         " "     "*"         " "         " "      
## 2  ( 1 )  " " " "        " "         " "     "*"         " "         " "      
## 3  ( 1 )  " " " "        " "         " "     "*"         " "         " "      
## 4  ( 1 )  " " " "        " "         " "     "*"         " "         " "      
## 5  ( 1 )  " " "*"        " "         " "     "*"         " "         " "      
## 6  ( 1 )  " " "*"        " "         " "     "*"         " "         "*"      
## 7  ( 1 )  " " "*"        " "         " "     "*"         " "         "*"      
## 8  ( 1 )  " " "*"        " "         " "     "*"         "*"         "*"      
## 9  ( 1 )  " " "*"        " "         " "     "*"         "*"         "*"      
## 10  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
## 11  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
## 12  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
## 13  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
## 14  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
## 15  ( 1 ) "*" "*"        "*"         "*"     "*"         "*"         "*"      
## 16  ( 1 ) " " "*"        " "         "*"     "*"         "*"         "*"      
##           YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1  ( 1 )  " "          " "        " "        " "        " "       " "        
## 2  ( 1 )  " "          " "        " "        " "        " "       " "        
## 3  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 4  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 5  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 6  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 7  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 8  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 9  ( 1 )  " "          " "        "*"        " "        " "       " "        
## 10  ( 1 ) " "          " "        "*"        " "        " "       " "        
## 11  ( 1 ) " "          "*"        "*"        " "        " "       " "        
## 12  ( 1 ) " "          "*"        "*"        " "        " "       " "        
## 13  ( 1 ) " "          "*"        "*"        " "        " "       " "        
## 14  ( 1 ) " "          "*"        "*"        " "        " "       " "        
## 15  ( 1 ) "*"          "*"        "*"        "*"        "*"       " "        
## 16  ( 1 ) " "          "*"        "*"        " "        " "       " "        
##           X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1  ( 1 )  " "       " "       " "          " "       " "          " "         
## 2  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 3  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 4  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 5  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 6  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 7  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 8  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 9  ( 1 )  " "       " "       " "          "*"       " "          " "         
## 10  ( 1 ) " "       " "       " "          "*"       " "          " "         
## 11  ( 1 ) " "       " "       " "          "*"       " "          " "         
## 12  ( 1 ) " "       " "       " "          "*"       "*"          " "         
## 13  ( 1 ) " "       " "       " "          "*"       "*"          " "         
## 14  ( 1 ) " "       " "       " "          "*"       "*"          " "         
## 15  ( 1 ) "*"       "*"       "*"          " "       " "          " "         
## 16  ( 1 ) " "       " "       " "          "*"       "*"          " "         
##           FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 1  ( 1 )  " "      " "      " "          " "          " "          " "       
## 2  ( 1 )  " "      " "      " "          " "          " "          " "       
## 3  ( 1 )  " "      " "      " "          " "          " "          " "       
## 4  ( 1 )  " "      " "      " "          " "          " "          " "       
## 5  ( 1 )  " "      " "      " "          " "          " "          " "       
## 6  ( 1 )  " "      " "      " "          " "          " "          " "       
## 7  ( 1 )  " "      " "      "*"          " "          " "          " "       
## 8  ( 1 )  " "      " "      "*"          " "          " "          " "       
## 9  ( 1 )  " "      " "      "*"          " "          " "          " "       
## 10  ( 1 ) " "      " "      "*"          " "          " "          " "       
## 11  ( 1 ) " "      " "      "*"          " "          " "          " "       
## 12  ( 1 ) " "      " "      "*"          " "          " "          " "       
## 13  ( 1 ) " "      " "      "*"          " "          "*"          " "       
## 14  ( 1 ) " "      " "      "*"          " "          " "          " "       
## 15  ( 1 ) " "      " "      " "          " "          " "          " "       
## 16  ( 1 ) " "      " "      "*"          "*"          "*"          " "       
##           GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## 1  ( 1 )  " "         " "        " "        " "        " "        
## 2  ( 1 )  " "         " "        " "        " "        " "        
## 3  ( 1 )  " "         " "        " "        " "        " "        
## 4  ( 1 )  " "         "*"        " "        " "        " "        
## 5  ( 1 )  " "         "*"        " "        " "        " "        
## 6  ( 1 )  " "         "*"        " "        " "        " "        
## 7  ( 1 )  " "         "*"        " "        " "        " "        
## 8  ( 1 )  " "         "*"        " "        " "        " "        
## 9  ( 1 )  "*"         "*"        " "        " "        " "        
## 10  ( 1 ) "*"         "*"        " "        " "        " "        
## 11  ( 1 ) "*"         "*"        " "        " "        " "        
## 12  ( 1 ) "*"         "*"        " "        " "        " "        
## 13  ( 1 ) "*"         "*"        " "        " "        " "        
## 14  ( 1 ) "*"         "*"        " "        "*"        " "        
## 15  ( 1 ) " "         " "        " "        " "        " "        
## 16  ( 1 ) "*"         "*"        " "        "*"        " "        
##           EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
## 1  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 2  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 3  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 4  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 5  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 6  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 7  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 8  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 9  ( 1 )  " "           " "        " "         " "      " "     " "    " "   
## 10  ( 1 ) " "           " "        " "         " "      " "     " "    " "   
## 11  ( 1 ) " "           " "        " "         " "      " "     " "    " "   
## 12  ( 1 ) " "           " "        " "         " "      " "     " "    " "   
## 13  ( 1 ) " "           " "        " "         " "      " "     " "    " "   
## 14  ( 1 ) " "           " "        "*"         " "      " "     " "    " "   
## 15  ( 1 ) " "           " "        " "         " "      " "     " "    " "   
## 16  ( 1 ) " "           " "        "*"         " "      " "     " "    " "
coef(step.model$finalModel, 15)
##   (Intercept)            Id    MSSubClass   LotFrontage       LotArea 
## -1.186637e+06  1.093822e+00 -2.006921e+02  2.693398e+01  7.321731e-01 
##   OverallQual   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea 
##  2.438843e+04  2.615994e+03  2.579998e+02  3.141734e+02  4.562553e+01 
##    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF     X2ndFlrSF  LowQualFinSF 
##  5.501056e+01  4.555287e+01  4.062282e+01  4.525051e+01  3.276732e+01 
##  BsmtFullBath 
##  7.625289e+03

After testing manual backward regression, automated stepwise regession with the caret package, and using the MASS package’s stepwise regression. It seems like the MASS package stepwise regression offers the best model.

Let’s test it.

# Train the model
step2.model <- train(SalePrice ~., data = numeric_train_drop,
                    method = "lmStepAIC", 
                    trControl = train.control,
                    trace = FALSE
                    )
# Model accuracy
step2.model$results
##   parameter     RMSE  Rsquared      MAE   RMSESD RsquaredSD    MAESD
## 1      none 36025.28 0.8002378 21778.04 12432.68  0.1140011 2212.563
# Final model coefficients
step2.model$finalModel
## 
## Call:
## lm(formula = .outcome ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = dat)
## 
## Coefficients:
##  (Intercept)    MSSubClass       LotArea   OverallQual   OverallCond  
##   -8.801e+05    -1.684e+02     4.079e-01     1.760e+04     4.999e+03  
##    YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1     BsmtUnfSF  
##    3.064e+02     1.193e+02     2.856e+01     1.559e+01     5.869e+00  
##    X1stFlrSF     X2ndFlrSF  BsmtFullBath      FullBath  BedroomAbvGr  
##    4.893e+01     4.550e+01     8.974e+03     3.845e+03    -1.019e+04  
## KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars  
##   -1.707e+04     5.104e+03     3.706e+03    -1.452e+01     1.690e+04  
##   WoodDeckSF   ScreenPorch  
##    2.532e+01     5.572e+01
# Summary of the model
summary(step2.model$finalModel)
## 
## Call:
## lm(formula = .outcome ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -478532  -16123   -2234   14242  284394 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.801e+05  1.233e+05  -7.136 1.52e-12 ***
## MSSubClass   -1.684e+02  2.562e+01  -6.575 6.80e-11 ***
## LotArea       4.078e-01  9.891e-02   4.123 3.95e-05 ***
## OverallQual   1.760e+04  1.159e+03  15.189  < 2e-16 ***
## OverallCond   4.999e+03  1.004e+03   4.978 7.20e-07 ***
## YearBuilt     3.064e+02  5.292e+01   5.790 8.64e-09 ***
## YearRemodAdd  1.193e+02  6.541e+01   1.825 0.068259 .  
## MasVnrArea    2.856e+01  5.842e+00   4.888 1.13e-06 ***
## BsmtFinSF1    1.559e+01  3.872e+00   4.026 5.98e-05 ***
## BsmtUnfSF     5.869e+00  3.599e+00   1.631 0.103189    
## X1stFlrSF     4.893e+01  5.187e+00   9.433  < 2e-16 ***
## X2ndFlrSF     4.550e+01  4.177e+00  10.893  < 2e-16 ***
## BsmtFullBath  8.974e+03  2.403e+03   3.735 0.000195 ***
## FullBath      3.845e+03  2.568e+03   1.497 0.134512    
## BedroomAbvGr -1.019e+04  1.659e+03  -6.143 1.04e-09 ***
## KitchenAbvGr -1.707e+04  5.083e+03  -3.358 0.000807 ***
## TotRmsAbvGrd  5.104e+03  1.198e+03   4.260 2.18e-05 ***
## Fireplaces    3.706e+03  1.724e+03   2.150 0.031719 *  
## GarageYrBlt  -1.452e+01  2.628e+00  -5.528 3.85e-08 ***
## GarageCars    1.690e+04  2.048e+03   8.253 3.44e-16 ***
## WoodDeckSF    2.532e+01  7.793e+00   3.249 0.001183 ** 
## ScreenPorch   5.572e+01  1.670e+01   3.337 0.000870 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34380 on 1438 degrees of freedom
## Multiple R-squared:  0.8154, Adjusted R-squared:  0.8127 
## F-statistic: 302.5 on 21 and 1438 DF,  p-value: < 2.2e-16

I used step2.model$modelinfo to get the linear model equation.

MASS_sr <- lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
    OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + 
    X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + BedroomAbvGr + 
    KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + 
    ScreenPorch + PoolArea, data = numeric_train_drop)

summary(MASS_sr)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea + 
##     OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 + 
##     X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF + 
##     ScreenPorch + PoolArea, data = numeric_train_drop)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -470219  -16254   -2088   13143  298726 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.633e+05  9.472e+04  -8.058 1.61e-15 ***
## MSSubClass   -1.630e+02  2.638e+01  -6.179 8.39e-10 ***
## LotFrontage   2.931e+01  2.853e+01   1.028 0.304313    
## LotArea       4.057e-01  1.002e-01   4.048 5.43e-05 ***
## OverallQual   1.851e+04  1.134e+03  16.333  < 2e-16 ***
## OverallCond   5.144e+03  9.226e+02   5.575 2.95e-08 ***
## YearBuilt     3.527e+02  4.807e+01   7.337 3.64e-13 ***
## MasVnrArea    2.990e+01  5.873e+00   5.092 4.02e-07 ***
## BsmtFinSF1    1.063e+01  2.976e+00   3.573 0.000365 ***
## X1stFlrSF     5.449e+01  4.822e+00  11.299  < 2e-16 ***
## X2ndFlrSF     4.605e+01  4.255e+00  10.823  < 2e-16 ***
## BsmtFullBath  9.508e+03  2.391e+03   3.976 7.35e-05 ***
## FullBath      5.818e+03  2.569e+03   2.265 0.023666 *  
## BedroomAbvGr -1.086e+04  1.656e+03  -6.559 7.54e-11 ***
## KitchenAbvGr -1.562e+04  5.127e+03  -3.046 0.002359 ** 
## TotRmsAbvGrd  5.364e+03  1.215e+03   4.415 1.09e-05 ***
## Fireplaces    2.685e+03  1.737e+03   1.546 0.122215    
## GarageCars    1.025e+04  1.699e+03   6.032 2.06e-09 ***
## WoodDeckSF    2.606e+01  7.919e+00   3.291 0.001022 ** 
## ScreenPorch   5.244e+01  1.691e+01   3.101 0.001964 ** 
## PoolArea     -3.131e+01  2.354e+01  -1.330 0.183733    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 34830 on 1439 degrees of freedom
## Multiple R-squared:  0.8104, Adjusted R-squared:  0.8077 
## F-statistic: 307.5 on 20 and 1439 DF,  p-value: < 2.2e-16

The residuals look normally distributed, looks okay to proceed to test on test data.

hist(MASS_sr$residuals, xlab = "Residuals", ylab = "", breaks = 200)

qqnorm(MASS_sr$residuals)
qqline(MASS_sr$residuals)

Test The Model

test[is.na(test)] <- 0
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated

## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
results <- predict(MASS_sr, test)

prediction_frame <- data.frame(cbind(Id = test$Id, SalePrice = results))

write.csv(prediction_frame, file= 'GAbreu_kaggle.csv', row.names = FALSE)


#Kaggle Score 

knitr::include_graphics('Kaggle Score.jpg')