Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of mu =sigma=(N+1)/2
#set.seed function ensures results are reproducible
set.seed(100)
#create variables
N <- 8
mu <- (N + 1)/2
devi <- mu
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
X <- runif(10000, min=1, max=6)
Y <- rnorm(10000, mean= mu, sd=devi)
#Graph histograms to show distribution
plot_histogram(X)
plot_histogram(Y)
#create little x and y
# Assume the small letter "x" is estimated as the median of the X variable
x <- median(X)
# small letter "y" is estimated as the 1st quartile of the Y variable
y <- quantile(Y, 0.25)
1a.P(X>x | X>y) Reference: https://stats.stackexchange.com/questions/391756/using-conditional-probability-to-calculate-sentiment-score-probability
#Means conditional probability -> P(X>x and X>y)/P(X>y) --> P(AB)/P(B)
#Find the determine cases where B (...or the second case in our situation) is true and given B where A is also true.
sum(X>x & X>y) / sum(X>y)
## [1] 0.5444251
1b.P(X>x, Y>y)
#Both are independent events
#Formula used P( A and B) = P(A)P(B)//Finding the intersection of A and B, both occur simulateneously
#Length of both X and Y are 10000
sum(X>x & Y>y)/length(X)
## [1] 0.3755
1c.P(X<x | X>y)
sum(X<x & X>y)/sum(X>y)
## [1] 0.4555749
#Start the creation of the table
X_grt_x_Y_grt_y <- sum(X>x & Y>y)
X_lsr_x_Y_lsr_y <- sum(X<x & Y<y)
X_grt_x_Y_lsr_y <- sum(X>x & Y<y)
X_lsr_x_Y_grt_y <- sum(X<x & Y>y)
table <- matrix(c(X_grt_x_Y_grt_y, X_grt_x_Y_lsr_y, X_lsr_x_Y_grt_y, X_lsr_x_Y_lsr_y), nrow=2, ncol=2)
colnames(table) <- c('X > x', ' X < x')
rownames(table) <- c('Y > y', 'Y < y')
print(table)
## X > x X < x
## Y > y 3755 3745
## Y < y 1245 1255
#P(X>x and Y>y)
a <- 3755 / 10000
#P(X>x)P(Y>y)
b <- sum(X>x & Y>y)/10000
Are_Equal <- a == b
Are_Equal
## [1] TRUE
P(X>x and Y >y) does equal the probability of P(X>x)P(Y>y)
#Run Fisher test
fisher.test(table)
##
## Fisher's Exact Test for Count Data
##
## data: table
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9222661 1.1076494
## sample estimates:
## odds ratio
## 1.010724
#Chi Square Test
chisq.test(table)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table
## X-squared = 0.0432, df = 1, p-value = 0.8353
The p-values for both tests are very large (significantly above the 0.05), therefore we fail to reject the null-hypothesis. These events are indeed indepdent.
The chi-squared test is an appropriate approximation when the sample is large, while the Fisher exact test runs a precise procedure for small-sized samples. (reference: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5426219/#:~:text=The%20chi%2Dsquared%20test%20applies,especially%20for%20small%2Dsized%20samples.) Since the case is large, I would say the chi-test is most appropriate.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
Preliminary
#Load Data
#Train
train <- read.csv("train.csv", header = TRUE, sep=',')
#Test
test <- read.csv("test.csv", header = TRUE, sep=',')
2a.
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
Descriptive Statistics
##View the first five rows of the dataset
head(train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
#Find out total rows and columns
dim(train)
## [1] 1460 81
#Lets view the data types of the variables
glimpse(train)
## Observations: 1,460
## Variables: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, ...
## $ MSZoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, RL, R...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 9...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, ...
## $ Street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave,...
## $ Alley <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ LotShape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg, Reg...
## $ LandContour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl...
## $ Utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, AllPu...
## $ LotConfig <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside, Cor...
## $ LandSlope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl...
## $ Neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mitchel,...
## $ Condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN, Arte...
## $ Condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm,...
## $ BldgType <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam,...
## $ HouseStyle <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, 1Stor...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4,...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5,...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931,...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950,...
## $ RoofStyle <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable, Gabl...
## $ RoofMatl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, CompShg,...
## $ Exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, VinylSd,...
## $ Exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, VinylSd,...
## $ MasVnrType <fct> BrkFace, None, BrkFace, None, BrkFace, None, Stone, S...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 30...
## $ ExterQual <fct> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, G...
## $ ExterCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ Foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc, CBl...
## $ BsmtQual <fct> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, TA, G...
## $ BsmtCond <fct> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ BsmtExposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, No, A...
## $ BsmtFinType1 <fct> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ, Rec...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906,...
## $ BsmtFinType2 <fct> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf, Unf...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991,...
## $ Heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA,...
## $ HeatingQC <fct> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, TA, E...
## $ CentralAir <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ Electrical <fct> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrk...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 107...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142,...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774,...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2,...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2,...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ KitchenQual <fct> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, TA, G...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6...
## $ Functional <fct> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Typ, Ty...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0,...
## $ FireplaceQu <fct> NA, TA, TA, Gd, TA, NA, Gd, TA, TA, TA, NA, Gd, NA, G...
## $ GarageType <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, Attch...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931,...
## $ GarageFinish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn, Unf...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2,...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384...
## $ GarageQual <fct> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, TA, T...
## $ GarageCond <fct> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, T...
## $ PavedDrive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 2...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, ...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ PoolQC <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Fence <fct> NA, NA, NA, NA, NA, MnPrv, NA, NA, NA, NA, NA, NA, NA...
## $ MiscFeature <fct> NA, NA, NA, NA, NA, Shed, NA, Shed, NA, NA, NA, NA, N...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 7...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3,...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008,...
## $ SaleType <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New, WD, ...
## $ SaleCondition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal, Norm...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 30700...
#Overall statistics
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50 NoSeWa: 1
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 NAmes :225 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48 Artery : 2
## FR3 : 4 Edwards:100 RRAn : 26 PosN : 2
## Inside :1052 Somerst: 86 PosN : 19 RRNn : 2
## Gilbert: 79 RRAe : 11 PosA : 1
## (Other):707 (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000 Min. :1872
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000 Median :1973
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575 Mean :1971
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## 1.5Unf : 14 Max. :10.000 Max. :9.000 Max. :2010
## (Other): 19
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504
## 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214
## Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207
## Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197
## 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142
## Max. :2010 Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
## (Other): 2 (Other):128 (Other):136
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146 Ex :121
## BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634 Fa : 35
## None :864 Median : 0.0 Gd:488 Gd: 146 PConc :647 Gd :618
## Stone :128 Mean : 103.7 TA:906 Po: 1 Slab : 24 TA :649
## NA's : 8 3rd Qu.: 166.0 TA:1282 Stone : 6 NA's: 37
## Max. :1600.0 Wood : 3
## NA's :8
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 45 Av :221 ALQ :220 Min. : 0.0 ALQ : 19
## Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0 BLQ : 33
## Po : 2 Mn :114 GLQ :418 Median : 383.5 GLQ : 14
## TA :1311 No :953 LwQ : 74 Mean : 443.6 LwQ : 46
## NA's: 37 NA's: 38 Rec :133 3rd Qu.: 712.2 Rec : 54
## Unf :430 Max. :5644.0 Unf :1256
## NA's: 37 NA's: 38
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49
## Median : 0.00 Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Wall : 4
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## N: 95 FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## Y:1365 FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd:586 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Maj1: 14 Min. :0.000 Ex : 24 2Types : 6 Min. :1900
## Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961
## Min1: 31 Median :1.000 Gd :380 Basment: 19 Median :1980
## Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88 Mean :1979
## Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Sev : 1 Max. :3.000 NA's:690 Detchd :387 Max. :2010
## Typ :1360 NA's : 81 NA's :81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :352 Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2
## RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35
## Unf :605 Median :2.000 Median : 480.0 Gd : 14 Gd : 9
## NA's: 81 Mean :1.767 Mean : 473.0 Po : 3 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311 TA :1326
## Max. :4.000 Max. :1418.0 NA's: 81 NA's: 81
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1340 Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.06 Mean : 2.759 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1179 NA's:1406
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
#View how the variables are skewed
plot_histogram(train)
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. The 2 independent variables: X1stFlrSF : First floor square feet GrLivArea: Above ground living area
Dependent variable: SalePrice: sale price of the house
plot(train$X1stFlrSF, train$SalePrice,
main = "First Floor Sq Ft vs Sale Price",
xlab = "Sq Ft",
ylab = 'Sale Price')
plot(train$GrLivArea, train$SalePrice,
main = "Above Ground Living Area vs Sale Price",
xlab = "Sq Ft",
ylab = 'Sale Price')
Derive a correlation matrix for any three quantitative variables in the dataset
corr_matrix <- cor(train[c("OverallQual", "GrLivArea", "SalePrice")])
corr_matrix
## OverallQual GrLivArea SalePrice
## OverallQual 1.0000000 0.5930074 0.7909816
## GrLivArea 0.5930074 1.0000000 0.7086245
## SalePrice 0.7909816 0.7086245 1.0000000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval
cor.test(train$OverallQual, train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
cor.test(train$OverallQual, train$GrLivArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$GrLivArea
## t = 28.121, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5708061 0.6143422
## sample estimates:
## cor
## 0.5930074
cor.test(train$GrLivArea, train$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$GrLivArea and train$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
The three tests show a p-value that is much lower than the significant p-valuce of 0.05, meaning the null hypothesis can be rejected and the true correlation is not 0 for the 3 chosen variables.
2b. Linear Algebra and Correlation
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix
#correlation matrix
corr_matrix
## OverallQual GrLivArea SalePrice
## OverallQual 1.0000000 0.5930074 0.7909816
## GrLivArea 0.5930074 1.0000000 0.7086245
## SalePrice 0.7909816 0.7086245 1.0000000
#invert the matrix
inverted_corr_matrix = solve(corr_matrix)
inverted_corr_matrix
## OverallQual GrLivArea SalePrice
## OverallQual 2.6865350 -0.1753704 -2.000728
## GrLivArea -0.1753704 2.0200794 -1.292763
## SalePrice -2.0007280 -1.2927630 3.498623
#multiply correlation matrix by precision matrix
corr_matrix %*% inverted_corr_matrix
## OverallQual GrLivArea SalePrice
## OverallQual 1 0 0
## GrLivArea 0 1 0
## SalePrice 0 0 1
#multiply the inverted matrix by the original correlation matrix
inverted_corr_matrix %*% corr_matrix
## OverallQual GrLivArea SalePrice
## OverallQual 1.000000e+00 -4.440892e-16 -8.881784e-16
## GrLivArea 0.000000e+00 1.000000e+00 0.000000e+00
## SalePrice 4.440892e-16 0.000000e+00 1.000000e+00
Conduct LU decomposition on the matrix
library(Matrix)
## Warning: package 'Matrix' was built under R version 3.6.3
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
LU <- lu(corr_matrix)
lower <- expand(LU)$L
lower
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
## [,1] [,2] [,3]
## [1,] 1.0000000 . .
## [2,] 0.5930074 1.0000000 .
## [3,] 0.7909816 0.3695063 1.0000000
upper <- expand(LU)$U
upper
## 3 x 3 Matrix of class "dtrMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.5930074 0.7909816
## [2,] . 0.6483422 0.2395665
## [3,] . . 0.2858268
lower %*% upper
## 3 x 3 Matrix of class "dgeMatrix"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.5930074 0.7909816
## [2,] 0.5930074 1.0000000 0.7086245
## [3,] 0.7909816 0.7086245 1.0000000
2c. Calculus-Based Probability & Statistics
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
Based on the plot of histograms, there are number of variables that are skewed to the right. For the purposes of this problem, I am choosing Unfinished square feet of basement area (BsmtUnfSF.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
plot_histogram(train$BsmtUnfSF)
summary(train$BsmtUnfSF)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 223.0 477.5 567.2 808.0 2336.0
The variable needs to be shifted since the minimum is zero.
shifted_var <- train$BsmtUnfSF + 0.5
summary(shifted_var)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5 223.5 478.0 567.7 808.5 2336.5
Then load the MASS package and run fitdistr to fit an exponential probability density function
fitted_var <- fitdistr(shifted_var, "exponential")
#Calculate lambda
lambda <- fitted_var$estimate
lambda
## rate
## 0.001761368
Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, lambda))
exponential <- rexp(1000, lambda)
# Plot a histogram and compare it with a histogram of your original variable
par(mfrow = c(1,2))
plot_histogram(exponential)
plot_histogram(train$TotalBsmtSF)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data
#Find the 5th and 95th percentile
qexp(c(0.05,0.95), rate = lambda)
## [1] 29.12128 1700.79827
#Generate a 95% confidence interval
CI(train$BsmtUnfSF, 0.95)
## upper mean lower
## 589.9246 567.2404 544.5562
#Find empirical 5th and 95th percentile
quantile(train$BsmtUnfSF, c(0.05, 0.95))
## 5% 95%
## 0 1468
The exponential model is more skewed than the empirical.
2d. Model
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Let’s do some mathematical investigating and see the correlation/p-values between all variables and Sale Price. But, first, we need to do some data cleaning and remove non-numeric variables for the analysis.
numeric_train_data <- select_if(train, is.numeric)
#Make sure all the variables are data type integers
glimpse(numeric_train_data)
## Observations: 1,460
## Variables: 38
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16...
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, ...
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 9...
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, ...
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4,...
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5,...
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931,...
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950,...
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 30...
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906,...
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134,...
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991,...
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 107...
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142,...
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774,...
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0,...
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2,...
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,...
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2,...
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2,...
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6...
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0,...
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931,...
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2,...
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384...
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, ...
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 2...
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, ...
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, ...
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 7...
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3,...
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008,...
## $ SalePrice <int> 208500, 181500, 223500, 140000, 250000, 143000, 30700...
#Check for NA values
apply(numeric_train_data, 2, function(x) any(is.na(x)))
## Id MSSubClass LotFrontage LotArea OverallQual
## FALSE FALSE TRUE FALSE FALSE
## OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1
## FALSE FALSE FALSE TRUE FALSE
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF X1stFlrSF X2ndFlrSF
## FALSE FALSE FALSE FALSE FALSE
## LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## FALSE FALSE FALSE FALSE FALSE
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## FALSE FALSE FALSE FALSE FALSE
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## TRUE FALSE FALSE FALSE FALSE
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal
## FALSE FALSE FALSE FALSE FALSE
## MoSold YrSold SalePrice
## FALSE FALSE FALSE
#We see that there are missing values
#The columns are LotFrontage, GarageYrBlt, MasVnrArea
res <- cor(numeric_train_data[-1], numeric_train_data$SalePrice)
res
## [,1]
## MSSubClass -0.08428414
## LotFrontage NA
## LotArea 0.26384335
## OverallQual 0.79098160
## OverallCond -0.07785589
## YearBuilt 0.52289733
## YearRemodAdd 0.50710097
## MasVnrArea NA
## BsmtFinSF1 0.38641981
## BsmtFinSF2 -0.01137812
## BsmtUnfSF 0.21447911
## TotalBsmtSF 0.61358055
## X1stFlrSF 0.60585218
## X2ndFlrSF 0.31933380
## LowQualFinSF -0.02560613
## GrLivArea 0.70862448
## BsmtFullBath 0.22712223
## BsmtHalfBath -0.01684415
## FullBath 0.56066376
## HalfBath 0.28410768
## BedroomAbvGr 0.16821315
## KitchenAbvGr -0.13590737
## TotRmsAbvGrd 0.53372316
## Fireplaces 0.46692884
## GarageYrBlt NA
## GarageCars 0.64040920
## GarageArea 0.62343144
## WoodDeckSF 0.32441344
## OpenPorchSF 0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch 0.04458367
## ScreenPorch 0.11144657
## PoolArea 0.09240355
## MiscVal -0.02118958
## MoSold 0.04643225
## YrSold -0.02892259
## SalePrice 1.00000000
#The NA values in those forces cor to return NA when doing the correlation test
#I could drop the NA values, replace values with mean or median
#reference: https://www.guru99.com/r-replace-missing-values.html
#I'm going to drop since the data has 1460 observations
#numeric_train_drop <- numeric_train_data %>% na.omit()
#dim(numeric_train_drop)
numeric_train_data[is.na(numeric_train_data)] <- 0
numeric_train_drop <- numeric_train_data
#The new dataset is 1121 observations of 38 variables
Let’s look at the correlation values with the new dataset with no NA values.
res_new <- cor(numeric_train_drop[-1], numeric_train_drop$SalePrice)
res_new
## [,1]
## MSSubClass -0.08428414
## LotFrontage 0.20962394
## LotArea 0.26384335
## OverallQual 0.79098160
## OverallCond -0.07785589
## YearBuilt 0.52289733
## YearRemodAdd 0.50710097
## MasVnrArea 0.47261450
## BsmtFinSF1 0.38641981
## BsmtFinSF2 -0.01137812
## BsmtUnfSF 0.21447911
## TotalBsmtSF 0.61358055
## X1stFlrSF 0.60585218
## X2ndFlrSF 0.31933380
## LowQualFinSF -0.02560613
## GrLivArea 0.70862448
## BsmtFullBath 0.22712223
## BsmtHalfBath -0.01684415
## FullBath 0.56066376
## HalfBath 0.28410768
## BedroomAbvGr 0.16821315
## KitchenAbvGr -0.13590737
## TotRmsAbvGrd 0.53372316
## Fireplaces 0.46692884
## GarageYrBlt 0.26136644
## GarageCars 0.64040920
## GarageArea 0.62343144
## WoodDeckSF 0.32441344
## OpenPorchSF 0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch 0.04458367
## ScreenPorch 0.11144657
## PoolArea 0.09240355
## MiscVal -0.02118958
## MoSold 0.04643225
## YrSold -0.02892259
## SalePrice 1.00000000
Now, we will create a linear model with multiple variables with relatively high correlation.
high_cor_var <- numeric_train_drop %>%dplyr:: select(OverallQual, TotalBsmtSF, X1stFlrSF, GrLivArea, GarageCars, GarageArea, FullBath, SalePrice)
high_cor_lm <- lm(SalePrice ~., data=high_cor_var)
summary(high_cor_lm)
##
## Call:
## lm(formula = SalePrice ~ ., data = high_cor_var)
##
## Residuals:
## Min 1Q Median 3Q Max
## -471605 -19887 -1264 16874 288458
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.029e+05 4.923e+03 -20.901 < 2e-16 ***
## OverallQual 2.386e+04 1.108e+03 21.546 < 2e-16 ***
## TotalBsmtSF 2.453e+01 4.325e+00 5.671 1.71e-08 ***
## X1stFlrSF 1.112e+01 5.035e+00 2.208 0.0274 *
## GrLivArea 4.243e+01 2.940e+00 14.432 < 2e-16 ***
## GarageCars 1.421e+04 3.067e+03 4.632 3.94e-06 ***
## GarageArea 1.630e+01 1.054e+01 1.547 0.1220
## FullBath 1.457e+03 2.529e+03 0.576 0.5646
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38850 on 1452 degrees of freedom
## Multiple R-squared: 0.762, Adjusted R-squared: 0.7608
## F-statistic: 664 on 7 and 1452 DF, p-value: < 2.2e-16
StepWise Regression Model ref: http://www.sthda.com/english/articles/37-model-selection-essentials-in-r/154-stepwise-regression-essentials-in-r/
#set up cross-validation (resampling procedure), K refers to the number of groups that a given data sample is split into
train.control <- trainControl(method ='cv', number=10)
step.model <- train(SalePrice ~., data = numeric_train_drop,
method = "leapSeq",
tuneGrid = data.frame(nvmax = 5:15),
trControl = train.control
)
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
## Warning in leaps.setup(x, y, wt = weights, nbest = nbest, nvmax = nvmax, : 2
## linear dependencies found
## Reordering variables and trying again:
#RMSE and MAE are different metrics that measure prediction error. You want low RMSE and MAE values
#High Rsquared results the better
step.model$results
## nvmax RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 5 45216.17 0.6835491 31011.85 7999.331 0.08567085 2691.569
## 2 6 45329.53 0.6824841 30814.14 7681.631 0.08023287 2428.390
## 3 7 43062.11 0.7131724 28570.66 9221.972 0.09550397 3904.753
## 4 8 42532.38 0.7213604 28094.29 7708.265 0.07766905 2976.683
## 5 9 41834.80 0.7311900 27431.57 7061.471 0.07809127 2530.220
## 6 10 39718.36 0.7550711 26033.30 8023.905 0.08703431 3405.627
## 7 11 39089.63 0.7650416 25196.05 8151.882 0.08353799 2535.014
## 8 12 38763.44 0.7691174 24942.23 8025.952 0.07979375 2596.117
## 9 13 38587.10 0.7718841 24693.20 8081.352 0.07991216 2410.730
## 10 14 38668.21 0.7712736 24743.74 7591.561 0.07797126 2155.504
## 11 15 38550.59 0.7743463 24229.60 9614.871 0.09893680 2607.831
#Best Tune shows which model is best, and the model with 15 variables is the best.
step.model$bestTune
## nvmax
## 11 15
summary(step.model$finalModel)
## Subset selection object
## 37 Variables (and intercept)
## Forced in Forced out
## Id FALSE FALSE
## MSSubClass FALSE FALSE
## LotFrontage FALSE FALSE
## LotArea FALSE FALSE
## OverallQual FALSE FALSE
## OverallCond FALSE FALSE
## YearBuilt FALSE FALSE
## YearRemodAdd FALSE FALSE
## MasVnrArea FALSE FALSE
## BsmtFinSF1 FALSE FALSE
## BsmtFinSF2 FALSE FALSE
## BsmtUnfSF FALSE FALSE
## X1stFlrSF FALSE FALSE
## X2ndFlrSF FALSE FALSE
## LowQualFinSF FALSE FALSE
## BsmtFullBath FALSE FALSE
## BsmtHalfBath FALSE FALSE
## FullBath FALSE FALSE
## HalfBath FALSE FALSE
## BedroomAbvGr FALSE FALSE
## KitchenAbvGr FALSE FALSE
## TotRmsAbvGrd FALSE FALSE
## Fireplaces FALSE FALSE
## GarageYrBlt FALSE FALSE
## GarageCars FALSE FALSE
## GarageArea FALSE FALSE
## WoodDeckSF FALSE FALSE
## OpenPorchSF FALSE FALSE
## EnclosedPorch FALSE FALSE
## X3SsnPorch FALSE FALSE
## ScreenPorch FALSE FALSE
## PoolArea FALSE FALSE
## MiscVal FALSE FALSE
## MoSold FALSE FALSE
## YrSold FALSE FALSE
## TotalBsmtSF FALSE FALSE
## GrLivArea FALSE FALSE
## 1 subsets of each size up to 16
## Selection Algorithm: 'sequential replacement'
## Id MSSubClass LotFrontage LotArea OverallQual OverallCond YearBuilt
## 1 ( 1 ) " " " " " " " " "*" " " " "
## 2 ( 1 ) " " " " " " " " "*" " " " "
## 3 ( 1 ) " " " " " " " " "*" " " " "
## 4 ( 1 ) " " " " " " " " "*" " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " "*" " " " " "*" " " "*"
## 8 ( 1 ) " " "*" " " " " "*" "*" "*"
## 9 ( 1 ) " " "*" " " " " "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 11 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 12 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 13 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 14 ( 1 ) " " "*" " " "*" "*" "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 16 ( 1 ) " " "*" " " "*" "*" "*" "*"
## YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" " " " " " "
## 5 ( 1 ) " " " " "*" " " " " " "
## 6 ( 1 ) " " " " "*" " " " " " "
## 7 ( 1 ) " " " " "*" " " " " " "
## 8 ( 1 ) " " " " "*" " " " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " "*" "*" " " " " " "
## 12 ( 1 ) " " "*" "*" " " " " " "
## 13 ( 1 ) " " "*" "*" " " " " " "
## 14 ( 1 ) " " "*" "*" " " " " " "
## 15 ( 1 ) "*" "*" "*" "*" "*" " "
## 16 ( 1 ) " " "*" "*" " " " " " "
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " "
## 3 ( 1 ) " " " " " " "*" " " " "
## 4 ( 1 ) " " " " " " "*" " " " "
## 5 ( 1 ) " " " " " " "*" " " " "
## 6 ( 1 ) " " " " " " "*" " " " "
## 7 ( 1 ) " " " " " " "*" " " " "
## 8 ( 1 ) " " " " " " "*" " " " "
## 9 ( 1 ) " " " " " " "*" " " " "
## 10 ( 1 ) " " " " " " "*" " " " "
## 11 ( 1 ) " " " " " " "*" " " " "
## 12 ( 1 ) " " " " " " "*" "*" " "
## 13 ( 1 ) " " " " " " "*" "*" " "
## 14 ( 1 ) " " " " " " "*" "*" " "
## 15 ( 1 ) "*" "*" "*" " " " " " "
## 16 ( 1 ) " " " " " " "*" "*" " "
## FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## 1 ( 1 ) " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " "
## 7 ( 1 ) " " " " "*" " " " " " "
## 8 ( 1 ) " " " " "*" " " " " " "
## 9 ( 1 ) " " " " "*" " " " " " "
## 10 ( 1 ) " " " " "*" " " " " " "
## 11 ( 1 ) " " " " "*" " " " " " "
## 12 ( 1 ) " " " " "*" " " " " " "
## 13 ( 1 ) " " " " "*" " " "*" " "
## 14 ( 1 ) " " " " "*" " " " " " "
## 15 ( 1 ) " " " " " " " " " " " "
## 16 ( 1 ) " " " " "*" "*" "*" " "
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " "
## 5 ( 1 ) " " "*" " " " " " "
## 6 ( 1 ) " " "*" " " " " " "
## 7 ( 1 ) " " "*" " " " " " "
## 8 ( 1 ) " " "*" " " " " " "
## 9 ( 1 ) "*" "*" " " " " " "
## 10 ( 1 ) "*" "*" " " " " " "
## 11 ( 1 ) "*" "*" " " " " " "
## 12 ( 1 ) "*" "*" " " " " " "
## 13 ( 1 ) "*" "*" " " " " " "
## 14 ( 1 ) "*" "*" " " "*" " "
## 15 ( 1 ) " " " " " " " " " "
## 16 ( 1 ) "*" "*" " " "*" " "
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) " " " " " " " " " " " " " "
## 5 ( 1 ) " " " " " " " " " " " " " "
## 6 ( 1 ) " " " " " " " " " " " " " "
## 7 ( 1 ) " " " " " " " " " " " " " "
## 8 ( 1 ) " " " " " " " " " " " " " "
## 9 ( 1 ) " " " " " " " " " " " " " "
## 10 ( 1 ) " " " " " " " " " " " " " "
## 11 ( 1 ) " " " " " " " " " " " " " "
## 12 ( 1 ) " " " " " " " " " " " " " "
## 13 ( 1 ) " " " " " " " " " " " " " "
## 14 ( 1 ) " " " " "*" " " " " " " " "
## 15 ( 1 ) " " " " " " " " " " " " " "
## 16 ( 1 ) " " " " "*" " " " " " " " "
coef(step.model$finalModel, 15)
## (Intercept) Id MSSubClass LotFrontage LotArea
## -1.186637e+06 1.093822e+00 -2.006921e+02 2.693398e+01 7.321731e-01
## OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea
## 2.438843e+04 2.615994e+03 2.579998e+02 3.141734e+02 4.562553e+01
## BsmtFinSF1 BsmtFinSF2 BsmtUnfSF X2ndFlrSF LowQualFinSF
## 5.501056e+01 4.555287e+01 4.062282e+01 4.525051e+01 3.276732e+01
## BsmtFullBath
## 7.625289e+03
After testing manual backward regression, automated stepwise regession with the caret package, and using the MASS package’s stepwise regression. It seems like the MASS package stepwise regression offers the best model.
Let’s test it.
# Train the model
step2.model <- train(SalePrice ~., data = numeric_train_drop,
method = "lmStepAIC",
trControl = train.control,
trace = FALSE
)
# Model accuracy
step2.model$results
## parameter RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 none 36025.28 0.8002378 21778.04 12432.68 0.1140011 2212.563
# Final model coefficients
step2.model$finalModel
##
## Call:
## lm(formula = .outcome ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = dat)
##
## Coefficients:
## (Intercept) MSSubClass LotArea OverallQual OverallCond
## -8.801e+05 -1.684e+02 4.079e-01 1.760e+04 4.999e+03
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtUnfSF
## 3.064e+02 1.193e+02 2.856e+01 1.559e+01 5.869e+00
## X1stFlrSF X2ndFlrSF BsmtFullBath FullBath BedroomAbvGr
## 4.893e+01 4.550e+01 8.974e+03 3.845e+03 -1.019e+04
## KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars
## -1.707e+04 5.104e+03 3.706e+03 -1.452e+01 1.690e+04
## WoodDeckSF ScreenPorch
## 2.532e+01 5.572e+01
# Summary of the model
summary(step2.model$finalModel)
##
## Call:
## lm(formula = .outcome ~ MSSubClass + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtUnfSF + X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath +
## BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces +
## GarageYrBlt + GarageCars + WoodDeckSF + ScreenPorch, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -478532 -16123 -2234 14242 284394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.801e+05 1.233e+05 -7.136 1.52e-12 ***
## MSSubClass -1.684e+02 2.562e+01 -6.575 6.80e-11 ***
## LotArea 4.078e-01 9.891e-02 4.123 3.95e-05 ***
## OverallQual 1.760e+04 1.159e+03 15.189 < 2e-16 ***
## OverallCond 4.999e+03 1.004e+03 4.978 7.20e-07 ***
## YearBuilt 3.064e+02 5.292e+01 5.790 8.64e-09 ***
## YearRemodAdd 1.193e+02 6.541e+01 1.825 0.068259 .
## MasVnrArea 2.856e+01 5.842e+00 4.888 1.13e-06 ***
## BsmtFinSF1 1.559e+01 3.872e+00 4.026 5.98e-05 ***
## BsmtUnfSF 5.869e+00 3.599e+00 1.631 0.103189
## X1stFlrSF 4.893e+01 5.187e+00 9.433 < 2e-16 ***
## X2ndFlrSF 4.550e+01 4.177e+00 10.893 < 2e-16 ***
## BsmtFullBath 8.974e+03 2.403e+03 3.735 0.000195 ***
## FullBath 3.845e+03 2.568e+03 1.497 0.134512
## BedroomAbvGr -1.019e+04 1.659e+03 -6.143 1.04e-09 ***
## KitchenAbvGr -1.707e+04 5.083e+03 -3.358 0.000807 ***
## TotRmsAbvGrd 5.104e+03 1.198e+03 4.260 2.18e-05 ***
## Fireplaces 3.706e+03 1.724e+03 2.150 0.031719 *
## GarageYrBlt -1.452e+01 2.628e+00 -5.528 3.85e-08 ***
## GarageCars 1.690e+04 2.048e+03 8.253 3.44e-16 ***
## WoodDeckSF 2.532e+01 7.793e+00 3.249 0.001183 **
## ScreenPorch 5.572e+01 1.670e+01 3.337 0.000870 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34380 on 1438 degrees of freedom
## Multiple R-squared: 0.8154, Adjusted R-squared: 0.8127
## F-statistic: 302.5 on 21 and 1438 DF, p-value: < 2.2e-16
I used step2.model$modelinfo to get the linear model equation.
MASS_sr <- lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 +
X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + BedroomAbvGr +
KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
ScreenPorch + PoolArea, data = numeric_train_drop)
summary(MASS_sr)
##
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + LotArea +
## OverallQual + OverallCond + YearBuilt + MasVnrArea + BsmtFinSF1 +
## X1stFlrSF + X2ndFlrSF + BsmtFullBath + FullBath + BedroomAbvGr +
## KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageCars + WoodDeckSF +
## ScreenPorch + PoolArea, data = numeric_train_drop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -470219 -16254 -2088 13143 298726
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.633e+05 9.472e+04 -8.058 1.61e-15 ***
## MSSubClass -1.630e+02 2.638e+01 -6.179 8.39e-10 ***
## LotFrontage 2.931e+01 2.853e+01 1.028 0.304313
## LotArea 4.057e-01 1.002e-01 4.048 5.43e-05 ***
## OverallQual 1.851e+04 1.134e+03 16.333 < 2e-16 ***
## OverallCond 5.144e+03 9.226e+02 5.575 2.95e-08 ***
## YearBuilt 3.527e+02 4.807e+01 7.337 3.64e-13 ***
## MasVnrArea 2.990e+01 5.873e+00 5.092 4.02e-07 ***
## BsmtFinSF1 1.063e+01 2.976e+00 3.573 0.000365 ***
## X1stFlrSF 5.449e+01 4.822e+00 11.299 < 2e-16 ***
## X2ndFlrSF 4.605e+01 4.255e+00 10.823 < 2e-16 ***
## BsmtFullBath 9.508e+03 2.391e+03 3.976 7.35e-05 ***
## FullBath 5.818e+03 2.569e+03 2.265 0.023666 *
## BedroomAbvGr -1.086e+04 1.656e+03 -6.559 7.54e-11 ***
## KitchenAbvGr -1.562e+04 5.127e+03 -3.046 0.002359 **
## TotRmsAbvGrd 5.364e+03 1.215e+03 4.415 1.09e-05 ***
## Fireplaces 2.685e+03 1.737e+03 1.546 0.122215
## GarageCars 1.025e+04 1.699e+03 6.032 2.06e-09 ***
## WoodDeckSF 2.606e+01 7.919e+00 3.291 0.001022 **
## ScreenPorch 5.244e+01 1.691e+01 3.101 0.001964 **
## PoolArea -3.131e+01 2.354e+01 -1.330 0.183733
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34830 on 1439 degrees of freedom
## Multiple R-squared: 0.8104, Adjusted R-squared: 0.8077
## F-statistic: 307.5 on 20 and 1439 DF, p-value: < 2.2e-16
The residuals look normally distributed, looks okay to proceed to test on test data.
hist(MASS_sr$residuals, xlab = "Residuals", ylab = "", breaks = 200)
qqnorm(MASS_sr$residuals)
qqline(MASS_sr$residuals)
test[is.na(test)] <- 0
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
## Warning in `[<-.factor`(`*tmp*`, thisvar, value = 0): invalid factor level, NA
## generated
results <- predict(MASS_sr, test)
prediction_frame <- data.frame(cbind(Id = test$Id, SalePrice = results))
write.csv(prediction_frame, file= 'GAbreu_kaggle.csv', row.names = FALSE)
#Kaggle Score
knitr::include_graphics('Kaggle Score.jpg')