Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points
\(P(X>x | X>y) = 0.58\)
X_gr_x <- X[which(X>x)]
#The above calculation is for the probability of X being above the median, which should be around 50%
Y_gr_y <- Y[which(Y>y)]
#The above calculation is for the probability of Y being above the 1st Quartile of Y, which should always be 75%
P_b <- length(X_gr_x)/length(X) * length(Y_gr_y)/length(Y)
\(P(X>x , Y>y) = 0.38\)
\(P(X<x | X>y) = 0.42\)
5 points Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
library(kableExtra)
#calculate missing options
X_le_x <- X[which(X<=x)]
Y_le_y <- Y[which(Y<=y)]
X_gr_x_and_Y_gr_y <- length(X_gr_x)/n * length(Y_gr_y)/n
X_gr_x_and_Y_le_y <- length(X_gr_x)/n * length(Y_le_y)/n
X_le_x_and_Y_gr_y <- length(X_le_x)/n * length(Y_gr_y)/n
X_le_x_and_Y_le_y <- length(X_le_x)/n * length(Y_le_y)/n
Tot_X_gr_x <- X_gr_x_and_Y_gr_y + X_gr_x_and_Y_le_y
Tot_X_le_x <- X_le_x_and_Y_gr_y + X_le_x_and_Y_le_y
Tot_Y_gr_y <- X_gr_x_and_Y_gr_y + X_le_x_and_Y_gr_y
Tot_Y_le_y <- X_gr_x_and_Y_le_y + X_le_x_and_Y_le_y
d <- matrix(c(X_gr_x_and_Y_gr_y,X_gr_x_and_Y_le_y,Tot_X_gr_x,X_le_x_and_Y_gr_y,X_le_x_and_Y_le_y,Tot_X_le_x, Tot_Y_gr_y,Tot_Y_le_y,Tot_X_gr_x+Tot_X_le_x), ncol = 3, byrow=TRUE)
colnames(d) <- c("Y>y", "Y\u2264y","Total")
rownames(d) <- c("X>x", "X\u2264x","Total")
Y>y | Y≤y | Total | |
---|---|---|---|
X>x | 0.375 | 0.125 | 0.5 |
X≤x | 0.375 | 0.125 | 0.5 |
Total | 0.750 | 0.250 | 1.0 |
From the table we see that P(X>x)P(Y>y) = 0.375.
From part b we see that \(P(X>x, Y>y)\)is also 0.375.
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
Count table:
##
## Fisher's Exact Test for Count Data
##
## data: round(c_t)
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9125 1.0959
## sample estimates:
## odds ratio
## 1
##
## Pearson's Chi-squared test
##
## data: c_t
## X-squared = 0, df = 1, p-value = 1
With such a high p-value, we can comfortably reject the null-hypothesis and state that these variables are in fact independent (as would be expected since they were generated independently). As we have a fairly large sample size, Chi-square is the more appropriate choice, though at large cell values we would expect the two to yield similar results (As we see they do).
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques. I want you to do the following.
5 points. . Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
## Id MSSubClass MSZoning LotFrontage
## Min. : 1 Min. : 20.0 C (all): 10 Min. : 21
## 1st Qu.: 366 1st Qu.: 20.0 FV : 65 1st Qu.: 59
## Median : 730 Median : 50.0 RH : 16 Median : 69
## Mean : 730 Mean : 56.9 RL :1151 Mean : 70
## 3rd Qu.:1095 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80
## Max. :1460 Max. :190.0 Max. :313
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual OverallCond
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.0 Min. :1.00
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.0 1st Qu.:5.00
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.0 Median :5.00
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.1 Mean :5.58
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.0 3rd Qu.:6.00
## PosA : 1 1.5Unf : 14 Max. :10.0 Max. :9.00
## (Other): 2 (Other): 19
## YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1872 Min. :1950 Flat : 13 CompShg:1434 VinylSd:515
## 1st Qu.:1954 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222
## Median :1973 Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220
## Mean :1971 Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206
## 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108
## Max. :2010 Max. :2010 Shed : 2 Membran: 1 CemntBd: 61
## (Other): 2 (Other):128
## Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## VinylSd:504 BrkCmn : 15 Min. : 0 Ex: 52 Ex: 3
## MetalSd:214 BrkFace:445 1st Qu.: 0 Fa: 14 Fa: 28
## HdBoard:207 None :864 Median : 0 Gd:488 Gd: 146
## Wd Sdng:197 Stone :128 Mean : 104 TA:906 Po: 1
## Plywood:142 NA's : 8 3rd Qu.: 166 TA:1282
## CmentBd: 60 Max. :1600
## (Other):136 NA's :8
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## BrkTil:146 Ex :121 Fa : 45 Av :221 ALQ :220
## CBlock:634 Fa : 35 Gd : 65 Gd :134 BLQ :148
## PConc :647 Gd :618 Po : 2 Mn :114 GLQ :418
## Slab : 24 TA :649 TA :1311 No :953 LwQ : 74
## Stone : 6 NA's: 37 NA's: 37 NA's: 38 Rec :133
## Wood : 3 Unf :430
## NA's: 37
## BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## Min. : 0 ALQ : 19 Min. : 0.0 Min. : 0
## 1st Qu.: 0 BLQ : 33 1st Qu.: 0.0 1st Qu.: 223
## Median : 384 GLQ : 14 Median : 0.0 Median : 478
## Mean : 444 LwQ : 46 Mean : 46.5 Mean : 567
## 3rd Qu.: 712 Rec : 54 3rd Qu.: 0.0 3rd Qu.: 808
## Max. :5644 Unf :1256 Max. :1474.0 Max. :2336
## NA's: 38
## TotalBsmtSF Heating HeatingQC CentralAir Electrical
## Min. : 0 Floor: 1 Ex:741 N: 95 FuseA: 94
## 1st Qu.: 796 GasA :1428 Fa: 49 Y:1365 FuseF: 27
## Median : 992 GasW : 18 Gd:241 FuseP: 3
## Mean :1057 Grav : 7 Po: 1 Mix : 1
## 3rd Qu.:1298 OthW : 2 TA:428 SBrkr:1334
## Max. :6110 Wall : 4 NA's : 1
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.0 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.0 Median :1464
## Mean :1163 Mean : 347 Mean : 5.8 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.0 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.0 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.000 Min. :0.0000 Min. :0.00 Min. :0.000
## 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:1.00 1st Qu.:0.000
## Median :0.000 Median :0.0000 Median :2.00 Median :0.000
## Mean :0.425 Mean :0.0575 Mean :1.57 Mean :0.383
## 3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:2.00 3rd Qu.:1.000
## Max. :3.000 Max. :2.0000 Max. :3.00 Max. :2.000
##
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.00 Min. :0.00 Ex:100 Min. : 2.00 Maj1: 14
## 1st Qu.:2.00 1st Qu.:1.00 Fa: 39 1st Qu.: 5.00 Maj2: 5
## Median :3.00 Median :1.00 Gd:586 Median : 6.00 Min1: 31
## Mean :2.87 Mean :1.05 TA:735 Mean : 6.52 Min2: 34
## 3rd Qu.:3.00 3rd Qu.:1.00 3rd Qu.: 7.00 Mod : 15
## Max. :8.00 Max. :3.00 Max. :14.00 Sev : 1
## Typ :1360
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## Min. :0.000 Ex : 24 2Types : 6 Min. :1900 Fin :352
## 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961 RFn :422
## Median :1.000 Gd :380 Basment: 19 Median :1980 Unf :605
## Mean :0.613 Po : 20 BuiltIn: 88 Mean :1978 NA's: 81
## 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Max. :3.000 NA's:690 Detchd :387 Max. :2010
## NA's : 81 NA's :81
## GarageCars GarageArea GarageQual GarageCond PavedDrive
## Min. :0.00 Min. : 0 Ex : 3 Ex : 2 N: 90
## 1st Qu.:1.00 1st Qu.: 334 Fa : 48 Fa : 35 P: 30
## Median :2.00 Median : 480 Gd : 14 Gd : 9 Y:1340
## Mean :1.77 Mean : 473 Po : 3 Po : 7
## 3rd Qu.:2.00 3rd Qu.: 576 TA :1311 TA :1326
## Max. :4.00 Max. :1418 NA's: 81 NA's: 81
##
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.0 Min. : 0.0 Min. : 0 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0.0
## Median : 0.0 Median : 25.0 Median : 0 Median : 0.0
## Mean : 94.2 Mean : 46.7 Mean : 22 Mean : 3.4
## 3rd Qu.:168.0 3rd Qu.: 68.0 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :857.0 Max. :547.0 Max. :552 Max. :508.0
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.0 Min. : 0.0 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.0 1st Qu.: 0.0 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.0 Median : 0.0 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.1 Mean : 2.8 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.0 3rd Qu.: 0.0 NA's :1179 NA's:1406
## Max. :480.0 Max. :738.0
##
## MiscVal MoSold YrSold SaleType
## Min. : 0 Min. : 1.00 Min. :2006 WD :1267
## 1st Qu.: 0 1st Qu.: 5.00 1st Qu.:2007 New : 122
## Median : 0 Median : 6.00 Median :2008 COD : 43
## Mean : 43 Mean : 6.32 Mean :2008 ConLD : 9
## 3rd Qu.: 0 3rd Qu.: 8.00 3rd Qu.:2009 ConLI : 5
## Max. :15500 Max. :12.00 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
As we see, the data contains 80 variables, including the ID of the house.
Let’s look at some of these variables in more detail:
The categorical variable ‘Neighborhood’ provides an interesting snapshot of where more expensive homes can be found. Northridge, Northridge Heights and Stone Brook in particular stand out as the areas with the most expensive homes.
## Loading required package: ggplot2
The variable ‘OverallCond’ rates the overall condition of the house on a scale of 1 to 10, with 1 being “Very Poor” and 10 being “Very Excellent”. Below is a histogram depicting the distribution of this variable within the training dataset along with a line indicating the mean value.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 5.00 5.00 5.58 6.00 9.00
The variable ‘X1stFlrSF’ indicates the square footage of the first floor. It is right-tailed with a mean of 1163 sq ft.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 882 1087 1163 1391 4692
The ‘TotRmsAbvGrd’ variable indicates how many total rooms the house has above grade (excluding bathrooms).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 5.00 6.00 6.52 7.00 14.00
Let us also take a look at the distribution of final Sales Price (our dependent variable for this dataset).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
To view the relationships between these three independent variables (and a couple of other ones) and our dependent variable, let’s take a look at the correlation matrix for them:
The corrrelation matrix for these variables is then:
## Loading required package: ggcorrplot
## Warning: package 'ggcorrplot' was built under R version 3.6.3
cor_m <- cor(P2_train[,c("OverallCond","X1stFlrSF", "X2ndFlrSF", "BedroomAbvGr", "TotRmsAbvGrd", "SalePrice")])
cor_m
## OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond 1.00000 -0.1442 0.02894 0.01298 -0.05758
## X1stFlrSF -0.14420 1.0000 -0.20265 0.12740 0.40952
## X2ndFlrSF 0.02894 -0.2026 1.00000 0.50290 0.61642
## BedroomAbvGr 0.01298 0.1274 0.50290 1.00000 0.67662
## TotRmsAbvGrd -0.05758 0.4095 0.61642 0.67662 1.00000
## SalePrice -0.07786 0.6059 0.31933 0.16821 0.53372
## SalePrice
## OverallCond -0.07786
## X1stFlrSF 0.60585
## X2ndFlrSF 0.31933
## BedroomAbvGr 0.16821
## TotRmsAbvGrd 0.53372
## SalePrice 1.00000
Correlation testing between selected pairs to test the hypothesis: “The correlation between each pairwise set of variables is 0”.
To start, the plot above shows that most of the correlations between pairs are not in fact 0. But let us check each pairing.
##
## Pearson's product-moment correlation
##
## data: BedroomAbvGr and OverallCond
## t = 0.5, df = 1458, p-value = 0.6
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.02059 0.04652
## sample estimates:
## cor
## 0.01298
##
## Pearson's product-moment correlation
##
## data: BedroomAbvGr and X1stFlrSF
## t = 4.9, df = 1458, p-value = 0.000001
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.09424 0.16028
## sample estimates:
## cor
## 0.1274
##
## Pearson's product-moment correlation
##
## data: BedroomAbvGr and X2ndFlrSF
## t = 22, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4774 0.5276
## sample estimates:
## cor
## 0.5029
##
## Pearson's product-moment correlation
##
## data: BedroomAbvGr and TotRmsAbvGrd
## t = 35, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6580 0.6944
## sample estimates:
## cor
## 0.6766
##
## Pearson's product-moment correlation
##
## data: BedroomAbvGr and SalePrice
## t = 6.5, df = 1458, p-value = 0.0000000001
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.1354 0.2006
## sample estimates:
## cor
## 0.1682
##
## Pearson's product-moment correlation
##
## data: OverallCond and X1stFlrSF
## t = -5.6, df = 1458, p-value = 0.00000003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1769 -0.1112
## sample estimates:
## cor
## -0.1442
##
## Pearson's product-moment correlation
##
## data: OverallCond and X2ndFlrSF
## t = 1.1, df = 1458, p-value = 0.3
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.004624 0.062443
## sample estimates:
## cor
## 0.02894
##
## Pearson's product-moment correlation
##
## data: OverallCond and TotRmsAbvGrd
## t = -2.2, df = 1458, p-value = 0.03
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.09097 -0.02407
## sample estimates:
## cor
## -0.05758
##
## Pearson's product-moment correlation
##
## data: OverallCond and SalePrice
## t = -3, df = 1458, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.11113 -0.04441
## sample estimates:
## cor
## -0.07786
##
## Pearson's product-moment correlation
##
## data: X1stFlrSF and X2ndFlrSF
## t = -7.9, df = 1458, p-value = 0.000000000000005
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.2346 -0.1702
## sample estimates:
## cor
## -0.2026
##
## Pearson's product-moment correlation
##
## data: X1stFlrSF and TotRmsAbvGrd
## t = 17, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.3812 0.4371
## sample estimates:
## cor
## 0.4095
##
## Pearson's product-moment correlation
##
## data: X1stFlrSF and SalePrice
## t = 29, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5842 0.6267
## sample estimates:
## cor
## 0.6059
##
## Pearson's product-moment correlation
##
## data: X2ndFlrSF and TotRmsAbvGrd
## t = 30, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5952 0.6368
## sample estimates:
## cor
## 0.6164
##
## Pearson's product-moment correlation
##
## data: X2ndFlrSF and SalePrice
## t = 13, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2889 0.3492
## sample estimates:
## cor
## 0.3193
##
## Pearson's product-moment correlation
##
## data: TotRmsAbvGrd and SalePrice
## t = 24, df = 1458, p-value <0.0000000000000002
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.5093 0.5573
## sample estimates:
## cor
## 0.5337
All of the tests returned the alternate hypothesis, indicating that the variables were not in fact independent of each other. This makes sense, since each variable was collected from the same homes - they are related in some way. Despite all of the tests rejecting the hypothesis and indicating that there are in fact non-0 correlations for every pairing, there are 3 cases where 0 is in fact within the 80% confidence interval, and not all of the variables neccesarily tie in to the rest, so I would be wary of family-wise error.
## OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond 1.02389 0.1724 0.02925 -0.07101 0.04221
## X1stFlrSF 0.17239 3.4116 2.47397 -0.11909 -1.85229
## X2ndFlrSF 0.02925 2.4740 3.45389 -0.36048 -2.15556
## BedroomAbvGr -0.07101 -0.1191 -0.36048 2.09611 -1.48249
## TotRmsAbvGrd 0.04221 -1.8523 -2.15556 -1.48249 4.18325
## SalePrice -0.04465 -1.8349 -1.38842 0.62039 -0.16948
## SalePrice
## OverallCond -0.04465
## X1stFlrSF -1.83487
## X2ndFlrSF -1.38842
## BedroomAbvGr 0.62039
## TotRmsAbvGrd -0.16948
## SalePrice 2.53765
## OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond 1 0 0 0 0
## X1stFlrSF 0 1 0 0 0
## X2ndFlrSF 0 0 1 0 0
## BedroomAbvGr 0 0 0 1 0
## TotRmsAbvGrd 0 0 0 0 1
## SalePrice 0 0 0 0 0
## SalePrice
## OverallCond 0
## X1stFlrSF 0
## X2ndFlrSF 0
## BedroomAbvGr 0
## TotRmsAbvGrd 0
## SalePrice 1
## OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond 1 0 0 0 0
## X1stFlrSF 0 1 0 0 0
## X2ndFlrSF 0 0 1 0 0
## BedroomAbvGr 0 0 0 1 0
## TotRmsAbvGrd 0 0 0 0 1
## SalePrice 0 0 0 0 0
## SalePrice
## OverallCond 0
## X1stFlrSF 0
## X2ndFlrSF 0
## BedroomAbvGr 0
## TotRmsAbvGrd 0
## SalePrice 1
Both products yield the identity matrix, as expected from the product of two matrices which are the inverse of one another. Below is the LU decomposition of our correlation matrix:
## Loading required package: matrixcalc
## $L
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00000 0.0000 0.0000 0.0000 0.00000 0
## [2,] -0.14420 1.0000 0.0000 0.0000 0.00000 0
## [3,] 0.02894 -0.2027 1.0000 0.0000 0.00000 0
## [4,] 0.01298 0.1320 0.5514 1.0000 0.00000 0
## [5,] -0.05758 0.4097 0.7294 0.3454 1.00000 0
## [6,] -0.07786 0.6073 0.4610 -0.2214 0.06679 1
##
## $U
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 -0.1442 0.02894 0.012980060094550580768 -0.05758 -0.07786
## [2,] 0 0.9792 -0.19847 0.129272510195122647403 0.40121 0.59463
## [3,] 0 0.0000 0.95893 0.528726854416190938935 0.69941 0.44211
## [4,] 0 0.0000 0.00000 0.691241584946385878574 0.23877 -0.15304
## [5,] 0 0.0000 0.00000 -0.000000000000000027756 0.23970 0.01601
## [6,] 0 0.0000 0.00000 0.000000000000000001854 0.00000 0.39407
#To confirm, let us multiply the L an dU matrices and see if we get our original matrix.
dec_mat$L %*% dec_mat$U
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1.00000 -0.1442 0.02894 0.01298 -0.05758 -0.07786
## [2,] -0.14420 1.0000 -0.20265 0.12740 0.40952 0.60585
## [3,] 0.02894 -0.2026 1.00000 0.50290 0.61642 0.31933
## [4,] 0.01298 0.1274 0.50290 1.00000 0.67662 0.16821
## [5,] -0.05758 0.4095 0.61642 0.67662 1.00000 0.53372
## [6,] -0.07786 0.6059 0.31933 0.16821 0.53372 1.00000
## OverallCond X1stFlrSF X2ndFlrSF BedroomAbvGr TotRmsAbvGrd
## OverallCond 1.00000 -0.1442 0.02894 0.01298 -0.05758
## X1stFlrSF -0.14420 1.0000 -0.20265 0.12740 0.40952
## X2ndFlrSF 0.02894 -0.2026 1.00000 0.50290 0.61642
## BedroomAbvGr 0.01298 0.1274 0.50290 1.00000 0.67662
## TotRmsAbvGrd -0.05758 0.4095 0.61642 0.67662 1.00000
## SalePrice -0.07786 0.6059 0.31933 0.16821 0.53372
## SalePrice
## OverallCond -0.07786
## X1stFlrSF 0.60585
## X2ndFlrSF 0.31933
## BedroomAbvGr 0.16821
## TotRmsAbvGrd 0.53372
## SalePrice 1.00000
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
As we saw above, the First Floor Square Footage variable [X1stFlrSF] is right-skewed so it works for this question.
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.6.3
## rate
## 0.00086012
## (0.00002251)
The calculated \(\lambda\) for the X1stFlrSF
variable is 0.0009
exp_dist <- rexp(1000,rate = lambda)
hist(P2_train$X1stFlrSF, main = "Histogram of 1st Floor Sq Ft", xlab = "X1stFlrSF", col = 'red')
The \(5^{th}\) and \(95^{th}\) percentiles using the CDF of our exponential distribution are:
## [1] 59.63 3482.92
The 95% confidence interval based on our fitted exponential distribution is:
## [1] 29.44
## [1] 4289
Using the empirical data, we see that the 5th and 95%ile were in fact:
## 5% 95%
## 673 1831
10 points. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
The first step is to select the variables used within the model. There is a total of 80 variables, so it’s doubtful that they will all make it into our final model. Let us begin by taking a subset of these 80 variables and clean them up by removing ‘NA’s and assigning numerical values where appropriate. We start by assigning linear numerical values to the variables where quality/condition was evaluated wither on a “Excellent to Poor” scale, or in the case of the type of damage, a sliding scale of severity. In the rest of the cases it seems that in every variable an ’NA’ corresponds to a lack of the feature in question, so it can be assigned a 0 value.
library(plyr)
want <- c("LotArea","Neighborhood","OverallQual","OverallCond", "YearBuilt","ExterQual","BsmtQual","BsmtCond","TotalBsmtSF","HeatingQC","CentralAir","X1stFlrSF","X2ndFlrSF","FullBath","BsmtFullBath","BedroomAbvGr","Kitchen","KitchnQual","TotRmsAbvGrd","Functional","GarageArea","PoolArea","WoodDeckSF","MiscVal")
Base <- P2_train[,(names(P2_train) %in% want)]
Model <- data.frame(lapply(Base, function(x) {mapvalues(x, c("Ex","Gd","TA","Fa","Po","Typ","Min1","Min2","Mod","Maj1","Maj2","Sev","Sal"), c(5,4,3,2,1,7,6,5,4,3,2,1,0))}))
Model[,-2] <- sapply(Model[,-2],as.numeric)
Model[is.na(Model)] <- 0
#Create 2 new additional variables showing the total number of full baths and the square
#footage of the first 2 foors.
Model$TotFullBath <- Model$FullBath+Model$BsmtFullBath
Model$TotSF <- Model$X1stFlrSF + Model$X2ndFlrSF
drop <- c("FullBath","BsmtFullBath", "X1stFlrSF", "X2ndFlrSF")
Model <- Model[,!(names(Model) %in% drop)]
Let us now center the variables to reduce the impact of significantly higher values (square footage -versus- number of baths, etc.). Finally we will add the Sales Price back to our model DataFrame.
#Took approach from https://www.gastonsanchez.com/visually-enforced/how-to/2014/01/15/Center-data-in-R/
center_scale <- function(x){
scale(x, scale = TRUE)
}
Model[,-2] <- center_scale(Model[,-2])
Model$SalePrice <- P2_train$SalePrice
In this case I will start by creating a correlation matrix and then proceed to add variables using forward selection (using the correlation scores as a guide of what variable to add next).
## LotArea OverallQual OverallCond YearBuilt ExterQual
## 0.26384 0.79098 -0.07786 0.52290 -0.63688
## BsmtQual BsmtCond TotalBsmtSF HeatingQC CentralAir
## -0.43888 0.14737 0.61358 -0.40018 0.25133
## BedroomAbvGr TotRmsAbvGrd Functional GarageArea WoodDeckSF
## 0.16821 0.53372 0.11533 0.62343 0.32441
## PoolArea MiscVal TotFullBath TotSF SalePrice
## 0.09240 -0.02119 0.58293 0.71688 1.00000
To start our model, let’s select the 5 variables that seem to have the highest correlation with SalePrice: ‘OverallQual’, ‘TotSF’, ‘ExterQual’, ‘GarageArea’ and ‘TotalBsmtSF’. We will also include the ‘Neighborhood’ variable for which we were not able to calculate a correlation (since it is not numeric) by assigning it a dummy variable.
multi <- lm(SalePrice~OverallQual + TotSF + ExterQual + GarageArea + TotalBsmtSF + Neighborhood, data = Model)
summary(multi)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea +
## TotalBsmtSF + Neighborhood, data = Model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -428443 -15434 -233 13721 266435
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 168251 8542 19.70 < 0.0000000000000002 ***
## OverallQual 21051 1612 13.06 < 0.0000000000000002 ***
## TotSF 23929 1303 18.36 < 0.0000000000000002 ***
## ExterQual -9422 1315 -7.16 0.00000000000124 ***
## GarageArea 8346 1229 6.79 0.00000000001612 ***
## TotalBsmtSF 9013 1204 7.49 0.00000000000012 ***
## NeighborhoodBlueste -8382 26023 -0.32 0.7474
## NeighborhoodBrDale -17309 12304 -1.41 0.1597
## NeighborhoodBrkSide 3546 9855 0.36 0.7191
## NeighborhoodClearCr 35018 10890 3.22 0.0013 **
## NeighborhoodCollgCr 14041 8917 1.57 0.1156
## NeighborhoodCrawfor 31256 9914 3.15 0.0017 **
## NeighborhoodEdwards -4093 9411 -0.43 0.6637
## NeighborhoodGilbert 14458 9410 1.54 0.1246
## NeighborhoodIDOTRR -14731 10462 -1.41 0.1593
## NeighborhoodMeadowV -1306 12243 -0.11 0.9151
## NeighborhoodMitchel 9380 9958 0.94 0.3464
## NeighborhoodNAmes 5605 8982 0.62 0.5327
## NeighborhoodNoRidge 66639 10272 6.49 0.00000000011987 ***
## NeighborhoodNPkVill -2172 14417 -0.15 0.8803
## NeighborhoodNridgHt 57602 9439 6.10 0.00000000134249 ***
## NeighborhoodNWAmes 6591 9541 0.69 0.4898
## NeighborhoodOldTown -14068 9325 -1.51 0.1316
## NeighborhoodSawyer 9145 9627 0.95 0.3423
## NeighborhoodSawyerW 8788 9649 0.91 0.3626
## NeighborhoodSomerst 18911 9258 2.04 0.0413 *
## NeighborhoodStoneBr 66493 10955 6.07 0.00000000164078 ***
## NeighborhoodSWISU -8002 11252 -0.71 0.4771
## NeighborhoodTimber 30859 10172 3.03 0.0025 **
## NeighborhoodVeenker 47655 13440 3.55 0.0004 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34700 on 1430 degrees of freedom
## Multiple R-squared: 0.813, Adjusted R-squared: 0.81
## F-statistic: 215 on 29 and 1430 DF, p-value: <0.0000000000000002
We can immediately see that there are very low P-values for each of our selected variables. Let us proceed by adding additional variables.
##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea +
## TotalBsmtSF + Neighborhood + TotFullBath + YearBuilt + TotRmsAbvGrd,
## data = Model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -413816 -14605 -135 13873 268557
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 165258.19 8661.32 19.08 < 0.0000000000000002 ***
## OverallQual 20543.19 1618.67 12.69 < 0.0000000000000002 ***
## TotSF 21065.09 2043.60 10.31 < 0.0000000000000002 ***
## ExterQual -8955.01 1314.11 -6.81 0.000000000014 ***
## GarageArea 7684.06 1228.05 6.26 0.000000000517 ***
## TotalBsmtSF 7785.07 1225.67 6.35 0.000000000286 ***
## NeighborhoodBlueste -7748.00 25793.30 -0.30 0.76392
## NeighborhoodBrDale -12134.96 12263.68 -0.99 0.32258
## NeighborhoodBrkSide 14238.58 10636.12 1.34 0.18088
## NeighborhoodClearCr 37661.27 10994.93 3.43 0.00063 ***
## NeighborhoodCollgCr 12474.08 8844.62 1.41 0.15865
## NeighborhoodCrawfor 39568.55 10496.92 3.77 0.00017 ***
## NeighborhoodEdwards -7.89 9669.07 0.00 0.99935
## NeighborhoodGilbert 11900.28 9320.52 1.28 0.20189
## NeighborhoodIDOTRR -2954.47 11259.31 -0.26 0.79305
## NeighborhoodMeadowV 412.26 12221.86 0.03 0.97310
## NeighborhoodMitchel 9065.21 9929.72 0.91 0.36143
## NeighborhoodNAmes 11269.47 9206.83 1.22 0.22114
## NeighborhoodNoRidge 66653.90 10254.97 6.50 0.000000000111 ***
## NeighborhoodNPkVill -2326.23 14354.11 -0.16 0.87128
## NeighborhoodNridgHt 56253.81 9356.18 6.01 0.000000002317 ***
## NeighborhoodNWAmes 8039.21 9564.92 0.84 0.40078
## NeighborhoodOldTown -2632.33 10401.86 -0.25 0.80026
## NeighborhoodSawyer 13311.34 9747.38 1.37 0.17227
## NeighborhoodSawyerW 7450.96 9603.94 0.78 0.43798
## NeighborhoodSomerst 17189.53 9182.57 1.87 0.06141 .
## NeighborhoodStoneBr 64982.35 10908.20 5.96 0.000000003227 ***
## NeighborhoodSWISU -312.28 12010.40 -0.03 0.97926
## NeighborhoodTimber 28967.29 10118.40 2.86 0.00426 **
## NeighborhoodVeenker 48913.76 13407.82 3.65 0.00027 ***
## TotFullBath 6294.02 1206.94 5.21 0.000000211009 ***
## YearBuilt 3702.22 2088.85 1.77 0.07655 .
## TotRmsAbvGrd 1606.95 1659.68 0.97 0.33309
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34300 on 1427 degrees of freedom
## Multiple R-squared: 0.818, Adjusted R-squared: 0.814
## F-statistic: 200 on 32 and 1427 DF, p-value: <0.0000000000000002
multi <- update(multi, .~. + BsmtQual + LotArea + BedroomAbvGr + CentralAir + Functional + BsmtCond - YearBuilt, data = Model)
summary(multi)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + TotSF + ExterQual + GarageArea +
## TotalBsmtSF + Neighborhood + TotFullBath + TotRmsAbvGrd +
## BsmtQual + LotArea + BedroomAbvGr + CentralAir + Functional +
## BsmtCond, data = Model)
##
## Residuals:
## Min 1Q Median 3Q Max
## -445285 -15028 502 13021 259444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164852 8281 19.91 < 0.0000000000000002 ***
## OverallQual 18411 1599 11.51 < 0.0000000000000002 ***
## TotSF 20899 2008 10.41 < 0.0000000000000002 ***
## ExterQual -8301 1299 -6.39 0.000000000223 ***
## GarageArea 6112 1190 5.14 0.000000316995 ***
## TotalBsmtSF 7171 1262 5.68 0.000000016309 ***
## NeighborhoodBlueste -7352 24834 -0.30 0.76726
## NeighborhoodBrDale -4894 11815 -0.41 0.67878
## NeighborhoodBrkSide 15760 9546 1.65 0.09897 .
## NeighborhoodClearCr 28431 10752 2.64 0.00828 **
## NeighborhoodCollgCr 14162 8668 1.63 0.10252
## NeighborhoodCrawfor 37124 9546 3.89 0.00011 ***
## NeighborhoodEdwards 1051 9122 0.12 0.90829
## NeighborhoodGilbert 11600 9125 1.27 0.20387
## NeighborhoodIDOTRR -166 10167 -0.02 0.98697
## NeighborhoodMeadowV 144 11797 0.01 0.99023
## NeighborhoodMitchel 7452 9643 0.77 0.43975
## NeighborhoodNAmes 13742 8742 1.57 0.11619
## NeighborhoodNoRidge 67702 9990 6.78 0.000000000018 ***
## NeighborhoodNPkVill -1453 13794 -0.11 0.91612
## NeighborhoodNridgHt 50967 9112 5.59 0.000000026660 ***
## NeighborhoodNWAmes 10430 9249 1.13 0.25964
## NeighborhoodOldTown -3688 9018 -0.41 0.68264
## NeighborhoodSawyer 15947 9363 1.70 0.08875 .
## NeighborhoodSawyerW 7453 9337 0.80 0.42488
## NeighborhoodSomerst 19542 8964 2.18 0.02941 *
## NeighborhoodStoneBr 62714 10503 5.97 0.000000002969 ***
## NeighborhoodSWISU 6801 10960 0.62 0.53500
## NeighborhoodTimber 21094 9966 2.12 0.03446 *
## NeighborhoodVeenker 43672 12883 3.39 0.00072 ***
## TotFullBath 5808 1152 5.04 0.000000525708 ***
## TotRmsAbvGrd 5125 1849 2.77 0.00564 **
## BsmtQual -7959 1223 -6.51 0.000000000105 ***
## LotArea 5352 995 5.38 0.000000086881 ***
## BedroomAbvGr -4496 1352 -3.32 0.00091 ***
## CentralAir 3910 986 3.97 0.000076691803 ***
## Functional 4207 910 4.62 0.000004109705 ***
## BsmtCond 3663 1064 3.44 0.00059 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33000 on 1422 degrees of freedom
## Multiple R-squared: 0.832, Adjusted R-squared: 0.828
## F-statistic: 190 on 37 and 1422 DF, p-value: <0.0000000000000002
Using this model let us set up our prediction:
TEST <- P2_test[,(names(P2_train) %in% want)]
Prediction <- data.frame(lapply(TEST, function(x) {mapvalues(x, c("Ex","Gd","TA","Fa","Po","Typ","Min1","Min2","Mod","Maj1","Maj2","Sev","Sal"), c(5,4,3,2,1,7,6,5,4,3,2,1,0))}))
Prediction[,-2] <- sapply(Prediction[,-2],as.numeric)
Prediction[is.na(Prediction)] <- 0
Prediction$TotFullBath <- Prediction$FullBath + Prediction$BsmtFullBath
Prediction$TotSF <- Prediction$X1stFlrSF + Prediction$X2ndFlrSF
Prediction[,-2] <- center_scale(Prediction[,-2])
Prediction$Predict <- predict(multi,Prediction,se.fit = FALSE)
Final <- data.frame(P2_test$Id,round(Prediction$Predict,2))
names(Final) = c("ID" ,"SalePrice")
write.csv(Final,file = "DATA605_Final_kaggle.csv", quote = FALSE, row.names = FALSE)
keep <- c("MSSubClass","LotArea", "Neighborhood", "HouseStyle", "OverallQual", "OverallCond", "YearBuilt", "YearRemodAdd", "ExterQual", "BsmtQual", "TotalBsmtSF", "X1stFlrSF", "X2ndFlrSF", "GrLivArea", "BsmtFullBath", "FullBath", "HalfBath", "BedroomAbvGr", "TotRmsAbvGrd", "GarageCars", "GarageArea", "Functional")
Forest <- P2_train[, (names(P2_train) %in% keep)]
Frst_Test <- P2_test[, (names(P2_test) %in% keep)]
summary(Frst_Test)
## MSSubClass LotArea Neighborhood HouseStyle
## Min. : 20.0 Min. : 1470 NAmes :218 1.5Fin:160
## 1st Qu.: 20.0 1st Qu.: 7391 OldTown:126 1.5Unf: 5
## Median : 50.0 Median : 9399 CollgCr:117 1Story:745
## Mean : 57.4 Mean : 9819 Somerst: 96 2.5Unf: 13
## 3rd Qu.: 70.0 3rd Qu.:11518 Edwards: 94 2Story:427
## Max. :190.0 Max. :56600 NridgHt: 89 SFoyer: 46
## (Other):719 SLvl : 63
## OverallQual OverallCond YearBuilt YearRemodAdd ExterQual
## Min. : 1.00 Min. :1.00 Min. :1879 Min. :1950 Ex: 55
## 1st Qu.: 5.00 1st Qu.:5.00 1st Qu.:1953 1st Qu.:1963 Fa: 21
## Median : 6.00 Median :5.00 Median :1973 Median :1992 Gd:491
## Mean : 6.08 Mean :5.55 Mean :1971 Mean :1984 TA:892
## 3rd Qu.: 7.00 3rd Qu.:6.00 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.00 Max. :9.00 Max. :2010 Max. :2010
##
## BsmtQual TotalBsmtSF X1stFlrSF X2ndFlrSF GrLivArea
## Ex :137 Min. : 0 Min. : 407 Min. : 0 Min. : 407
## Fa : 53 1st Qu.: 784 1st Qu.: 874 1st Qu.: 0 1st Qu.:1118
## Gd :591 Median : 988 Median :1079 Median : 0 Median :1432
## TA :634 Mean :1046 Mean :1157 Mean : 326 Mean :1486
## NA's: 44 3rd Qu.:1305 3rd Qu.:1382 3rd Qu.: 676 3rd Qu.:1721
## Max. :5095 Max. :5095 Max. :1862 Max. :5095
## NA's :1
## BsmtFullBath FullBath HalfBath BedroomAbvGr
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:2.00
## Median :0.000 Median :2.00 Median :0.000 Median :3.00
## Mean :0.434 Mean :1.57 Mean :0.378 Mean :2.85
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.:3.00
## Max. :3.000 Max. :4.00 Max. :2.000 Max. :6.00
## NA's :2
## TotRmsAbvGrd Functional GarageCars GarageArea
## Min. : 3.00 Typ :1357 Min. :0.00 Min. : 0
## 1st Qu.: 5.00 Min2 : 36 1st Qu.:1.00 1st Qu.: 318
## Median : 6.00 Min1 : 34 Median :2.00 Median : 480
## Mean : 6.38 Mod : 20 Mean :1.77 Mean : 473
## 3rd Qu.: 7.00 Maj1 : 5 3rd Qu.:2.00 3rd Qu.: 576
## Max. :15.00 (Other): 5 Max. :5.00 Max. :1488
## NA's : 2 NA's :1 NA's :1
The data needs cleaning, with the summary of our test data showing that there are NA’s in columns that didn’t have them for the training data. We must impute them before actually running our prediction. In addition there are several factor levels present for some of the variables within the test data set that are not present within the train dataset. To account for this we must assign the factor levels within the test dataset to the training dataset.
## Warning: package 'forcats' was built under R version 3.6.3
## Loading required package: tidyr
Forest <- transform(
Forest,
OverallQual= as.factor(OverallQual),
OverallCond = as.factor(OverallCond),
BsmtQual = fct_explicit_na(BsmtQual, na_level = "None")
)
Frst_Test <- transform(
Frst_Test,
OverallQual= as.factor(OverallQual),
OverallCond = as.factor(OverallCond),
BsmtQual = fct_explicit_na(BsmtQual, na_level = "None")
)
Frst_Test$TotalBsmtSF[is.na(Frst_Test$TotalBsmtSF)] <- as.integer(0)
Frst_Test$BsmtFullBath[is.na(Frst_Test$BsmtFullBath)] <- as.integer(0)
Frst_Test$Functional[is.na(Frst_Test$Functional)] <- as.factor("Typ")
Frst_Test$GarageCars[is.na(Frst_Test$GarageCars)] <- as.integer(0)
Frst_Test$GarageArea[is.na(Frst_Test$GarageArea)] <- as.integer(0)
levels(Frst_Test$Neighborhood) <- levels(Forest$Neighborhood)
levels(Frst_Test$HouseStyle) <- levels(Forest$HouseStyle)
levels(Frst_Test$OverallQual) <- levels(Forest$OverallQual)
levels(Frst_Test$OverallCond) <- levels(Forest$OverallCond)
levels(Frst_Test$ExterQual) <- levels(Forest$ExterQual)
levels(Frst_Test$BsmtQual) <- levels(Forest$BsmtQual)
levels(Frst_Test$Funtional) <- levels(Forest$Funtional)
summary(Frst_Test)
## MSSubClass LotArea Neighborhood HouseStyle OverallQual
## Min. : 20.0 Min. : 1470 NAmes :218 1Story :745 5 :428
## 1st Qu.: 20.0 1st Qu.: 7391 OldTown:126 2.5Unf :427 6 :357
## Median : 50.0 Median : 9399 CollgCr:117 1.5Fin :160 7 :281
## Mean : 57.4 Mean : 9819 Somerst: 96 SFoyer : 63 8 :174
## 3rd Qu.: 70.0 3rd Qu.:11518 Edwards: 94 2Story : 46 4 :110
## Max. :190.0 Max. :56600 NridgHt: 89 2.5Fin : 13 9 : 64
## (Other):719 (Other): 5 (Other): 45
## OverallCond YearBuilt YearRemodAdd ExterQual BsmtQual
## 5 :824 Min. :1879 Min. :1950 Ex: 55 Ex :137
## 6 :279 1st Qu.:1953 1st Qu.:1963 Fa: 21 Fa : 53
## 7 :185 Median :1973 Median :1992 Gd:491 Gd :591
## 8 : 72 Mean :1971 Mean :1984 TA:892 TA :634
## 4 : 44 3rd Qu.:2001 3rd Qu.:2004 None: 44
## 3 : 25 Max. :2010 Max. :2010
## (Other): 30
## TotalBsmtSF X1stFlrSF X2ndFlrSF GrLivArea
## Min. : 0 Min. : 407 Min. : 0 Min. : 407
## 1st Qu.: 784 1st Qu.: 874 1st Qu.: 0 1st Qu.:1118
## Median : 988 Median :1079 Median : 0 Median :1432
## Mean :1045 Mean :1157 Mean : 326 Mean :1486
## 3rd Qu.:1304 3rd Qu.:1382 3rd Qu.: 676 3rd Qu.:1721
## Max. :5095 Max. :5095 Max. :1862 Max. :5095
##
## BsmtFullBath FullBath HalfBath BedroomAbvGr
## Min. :0.000 Min. :0.00 Min. :0.000 Min. :0.00
## 1st Qu.:0.000 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:2.00
## Median :0.000 Median :2.00 Median :0.000 Median :3.00
## Mean :0.434 Mean :1.57 Mean :0.378 Mean :2.85
## 3rd Qu.:1.000 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.:3.00
## Max. :3.000 Max. :4.00 Max. :2.000 Max. :6.00
##
## TotRmsAbvGrd Functional GarageCars GarageArea
## Min. : 3.00 Maj1: 5 Min. :0.00 Min. : 0
## 1st Qu.: 5.00 Maj2: 4 1st Qu.:1.00 1st Qu.: 318
## Median : 6.00 Min1: 34 Median :2.00 Median : 480
## Mean : 6.38 Min2: 36 Mean :1.76 Mean : 472
## 3rd Qu.: 7.00 Mod : 20 3rd Qu.:2.00 3rd Qu.: 576
## Max. :15.00 Sev : 1 Max. :5.00 Max. :1488
## Typ :1359
## Loading required package: randomForest
## Warning: package 'randomForest' was built under R version 3.6.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
My user name on Kaggle.com is ‘mishakollontai’.
Upon submission of the multiple linear model, I received a score of 0.17969.
The Random Forest model received a score of 0.15722, which is significantly more accurate.