Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) taht has 10,000 random normal numbers iwth a mean of \(\mu = \sigma = (N+1)/2\).
The code below generates the random variables \(X\) and \(Y\). I set a seed to get consistent answers every time I run the code.
set.seed(100)
X = runif(10000,1,6)
Y = rnorm(10000,(6+1)/2,(6+1)/2)
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
\(\Large P(X>x | X > y)\)
The way that this probability is interpreted is as follows:
The probability that random variable X is greater than the median of X given that random variable X is greater than the 1st quartile of Y.
The code below takes the number of X values that are greater than the median divided by the number of X values that are greater than the first quantile of Y.
The resulting probability is \(0.5127\).
sum(X > median(X) & X > quantile(Y,0.25))/sum(X > quantile(Y,0.25))
## [1] 0.5127153
\(\Large P(X>x, Y>y)\)
The way that this probability is interpreted is as follows:
The probability that random variable X is greater than the median of X and that random variable Y is greater than the first quartile of Y.
Given that these are two independent random variables, I can multiply \(P(X>x)\) with \(P(Y>y)\) to get \(P(X>x, Y>y)\).
By definition, the median is larger than 50% of the values in a distribution and the first quartile is larger than 25% of the values in a distribution. Multiplying these two values together gets \(0.375\).
The code below states these probabilities more explicitly based off of the distributions themselves.
sum(X > median(X))/length(X) * sum(Y > quantile(Y,0.25))/length(Y)
## [1] 0.375
\(\Large P(X<x|X>y)\)
The way that this probability is interpreted is as follows:
The probability that a random variable X is less than the median of X given that random variable X is greater than the first quartile of Y.
The code below finds the number of observations of X that are less than the median of X and greater than the first quantile of Y and then divides that number by the number of observations of X that are greater than the first quantile of Y. The probability is \(0.4873\)
sum(X < median(X) & X > quantile(Y,0.25))/sum(X > quantile(Y,0.25))
## [1] 0.4872847
Investigate whether \(P(X>x \text{ and } Y>y) = P(X>x)P(Y>y)\) by building a table and evaluating the marginal and joint probabilities.
The code below builds a table that calculates the marginal and joint probabilities. The first few lines of code finds the number of observations for each joint probability. The next line calculates the marginal probabilities and puts the marginal and joint probabilities into one table. The next lines rename the rows and columns and display the probabilities.
one = sum(X > median(X) & Y > quantile(Y,0.25))
two = sum(X < median(X) & Y > quantile(Y,0.25))
three = sum(X > median(X) & Y < quantile(Y,0.25))
four = sum(X < median(X) & Y < quantile(Y,0.25))
matrix_1d = matrix(c(one,two,one+two,three,four,three+four,one+three,two+four,one+two+three+four),byrow = F,nrow=3)
rownames(matrix_1d) = c("X > median","X < median","margin")
colnames(matrix_1d) = c("Y > q1","Y < q1", "margin")
matrix_1d/10000
## Y > q1 Y < q1 margin
## X > median 0.3755 0.1245 0.5
## X < median 0.3745 0.1255 0.5
## margin 0.7500 0.2500 1.0
According to the table, the probability that \(P(X>x \text{ and } Y>y)\) is \(0.3755\). The probability that \(P(X>x)P(Y>y)\) is \(0.5 * 0.75 = 0.375\).
\(0.3755\) is very close to \(0.375\), which would indicate that \(P(X>x \text{ and } Y>y) = P(X>x)P(Y>y)\). In other words, there is not enough evidence to conclude that the two probabilities are not equal.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
matrix_1e = matrix(c(one,two,three,four),byrow = F,nrow = 2)
rownames(matrix_1e) = c("X > median","X < median")
colnames(matrix_1e) = c("Y > q1","Y < q1")
matrix_1e
## Y > q1 Y < q1
## X > median 3755 1245
## X < median 3745 1255
fisher.test(matrix_1e)
##
## Fisher's Exact Test for Count Data
##
## data: matrix_1e
## p-value = 0.8354
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.9222661 1.1076494
## sample estimates:
## odds ratio
## 1.010724
chisq.test(matrix_1e)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: matrix_1e
## X-squared = 0.0432, df = 1, p-value = 0.8353
According to both tests, the p-values are approximately 0.835, which indicates that the null hypothesis is not rejected. There is not enough evidence to conclude that the proportions observed differ significantly from the expected values if the events were independent.
The difference between the two tests is as follows:
The Fisher’s exact test is used to determine if the proportions of an event occurring are different depending on the value of another event occurring. The Fisher’s exact test calculates the p-value using the hypergeometric distribution and conditional probabilities. Since this test uses the hypergeometric distribution, there are a lot of factorial calculations, which can become computationally large when the number of observations is large. However, since we are using R, computational complexity isn’t an issue.
The chi-squared test calculates how likely an observed distribution occurred at random. The test compares the probabilities of the observed events compared to the expected probabilities if the events were independent. A chi-squared coefficient is calculated from the observed values and the expected values, which is then compared to the chi-squared distribution to find the p-value.
Since the sample size for this data is so large, the computational complexity of the Fisher’s exact test is not ideal. It would be more appropriate to use the chi-squared test because it can handle large sample sizes more efficiently.
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. Do the following:
wd = "C:/Users/super_000/Desktop/Data 605 Math/Final/house-prices-advanced-regression-techniques"
setwd(wd)
train = read.csv("train.csv",stringsAsFactors = F)
test = read.csv("test.csv",stringsAsFactors = F)
Provide univariate descriptive statistics and appropriate plots for the training data set.
Given that there are 81 variables in the training data set, it would be pretty tedious to plot every single variable. Instead, I have decided to provide five number summaries and frequency counts for each variable. In the code below, I created an lapply loop that iterates through each variable and returns a frequency table for categorical variables and five number summaries for numeric variables.
descriptive_statistics = lapply(1:ncol(train),function(x){
if(class(train[,x]) == "character"){
return(table(train[,x]))
}
else{
return(summary(train[,x]))
}
})
names(descriptive_statistics) = colnames(train)
descriptive_statistics
## $Id
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 365.8 730.5 730.5 1095.2 1460.0
##
## $MSSubClass
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.0 20.0 50.0 56.9 70.0 190.0
##
## $MSZoning
##
## C (all) FV RH RL RM
## 10 65 16 1151 218
##
## $LotFrontage
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 21.00 59.00 69.00 70.05 80.00 313.00 259
##
## $LotArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
##
## $Street
##
## Grvl Pave
## 6 1454
##
## $Alley
##
## Grvl Pave
## 50 41
##
## $LotShape
##
## IR1 IR2 IR3 Reg
## 484 41 10 925
##
## $LandContour
##
## Bnk HLS Low Lvl
## 63 50 36 1311
##
## $Utilities
##
## AllPub NoSeWa
## 1459 1
##
## $LotConfig
##
## Corner CulDSac FR2 FR3 Inside
## 263 94 47 4 1052
##
## $LandSlope
##
## Gtl Mod Sev
## 1382 65 13
##
## $Neighborhood
##
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert
## 17 2 16 58 28 150 51 100 79
## IDOTRR MeadowV Mitchel NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 37 17 49 225 41 9 77 73 113
## Sawyer SawyerW Somerst StoneBr SWISU Timber Veenker
## 74 59 86 25 25 38 11
##
## $Condition1
##
## Artery Feedr Norm PosA PosN RRAe RRAn RRNe RRNn
## 48 81 1260 8 19 11 26 2 5
##
## $Condition2
##
## Artery Feedr Norm PosA PosN RRAe RRAn RRNn
## 2 6 1445 1 2 1 1 2
##
## $BldgType
##
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 1220 31 52 43 114
##
## $HouseStyle
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story SFoyer SLvl
## 154 14 726 8 11 445 37 65
##
## $OverallQual
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.099 7.000 10.000
##
## $OverallCond
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 5.000 5.575 6.000 9.000
##
## $YearBuilt
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1872 1954 1973 1971 2000 2010
##
## $YearRemodAdd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1950 1967 1994 1985 2004 2010
##
## $RoofStyle
##
## Flat Gable Gambrel Hip Mansard Shed
## 13 1141 11 286 7 2
##
## $RoofMatl
##
## ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl
## 1 1434 1 1 1 11 5 6
##
## $Exterior1st
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd HdBoard ImStucc MetalSd
## 20 1 2 50 1 61 222 1 220
## Plywood Stone Stucco VinylSd Wd Sdng WdShing
## 108 2 25 515 206 26
##
## $Exterior2nd
##
## AsbShng AsphShn Brk Cmn BrkFace CBlock CmentBd HdBoard ImStucc MetalSd
## 20 3 7 25 1 60 207 10 214
## Other Plywood Stone Stucco VinylSd Wd Sdng Wd Shng
## 1 142 5 26 504 197 38
##
## $MasVnrType
##
## BrkCmn BrkFace None Stone
## 15 445 864 128
##
## $MasVnrArea
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 0.0 0.0 103.7 166.0 1600.0 8
##
## $ExterQual
##
## Ex Fa Gd TA
## 52 14 488 906
##
## $ExterCond
##
## Ex Fa Gd Po TA
## 3 28 146 1 1282
##
## $Foundation
##
## BrkTil CBlock PConc Slab Stone Wood
## 146 634 647 24 6 3
##
## $BsmtQual
##
## Ex Fa Gd TA
## 121 35 618 649
##
## $BsmtCond
##
## Fa Gd Po TA
## 45 65 2 1311
##
## $BsmtExposure
##
## Av Gd Mn No
## 221 134 114 953
##
## $BsmtFinType1
##
## ALQ BLQ GLQ LwQ Rec Unf
## 220 148 418 74 133 430
##
## $BsmtFinSF1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 0.0 383.5 443.6 712.2 5644.0
##
## $BsmtFinType2
##
## ALQ BLQ GLQ LwQ Rec Unf
## 19 33 14 46 54 1256
##
## $BsmtFinSF2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 46.55 0.00 1474.00
##
## $BsmtUnfSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 223.0 477.5 567.2 808.0 2336.0
##
## $TotalBsmtSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 795.8 991.5 1057.4 1298.2 6110.0
##
## $Heating
##
## Floor GasA GasW Grav OthW Wall
## 1 1428 18 7 2 4
##
## $HeatingQC
##
## Ex Fa Gd Po TA
## 741 49 241 1 428
##
## $CentralAir
##
## N Y
## 95 1365
##
## $Electrical
##
## FuseA FuseF FuseP Mix SBrkr
## 94 27 3 1 1334
##
## $X1stFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 882 1087 1163 1391 4692
##
## $X2ndFlrSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 347 728 2065
##
## $LowQualFinSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 5.845 0.000 572.000
##
## $GrLivArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1130 1464 1515 1777 5642
##
## $BsmtFullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.4253 1.0000 3.0000
##
## $BsmtHalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05753 0.00000 2.00000
##
## $FullBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.565 2.000 3.000
##
## $HalfBath
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.3829 1.0000 2.0000
##
## $BedroomAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.000 3.000 2.866 3.000 8.000
##
## $KitchenAbvGr
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 1.047 1.000 3.000
##
## $KitchenQual
##
## Ex Fa Gd TA
## 100 39 586 735
##
## $TotRmsAbvGrd
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 5.000 6.000 6.518 7.000 14.000
##
## $Functional
##
## Maj1 Maj2 Min1 Min2 Mod Sev Typ
## 14 5 31 34 15 1 1360
##
## $Fireplaces
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 0.613 1.000 3.000
##
## $FireplaceQu
##
## Ex Fa Gd Po TA
## 24 33 380 20 313
##
## $GarageType
##
## 2Types Attchd Basment BuiltIn CarPort Detchd
## 6 870 19 88 9 387
##
## $GarageYrBlt
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1900 1961 1980 1979 2002 2010 81
##
## $GarageFinish
##
## Fin RFn Unf
## 352 422 605
##
## $GarageCars
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 1.767 2.000 4.000
##
## $GarageArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 334.5 480.0 473.0 576.0 1418.0
##
## $GarageQual
##
## Ex Fa Gd Po TA
## 3 48 14 3 1311
##
## $GarageCond
##
## Ex Fa Gd Po TA
## 2 35 9 7 1326
##
## $PavedDrive
##
## N P Y
## 90 30 1340
##
## $WoodDeckSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 94.24 168.00 857.00
##
## $OpenPorchSF
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 25.00 46.66 68.00 547.00
##
## $EnclosedPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 21.95 0.00 552.00
##
## $X3SsnPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 3.41 0.00 508.00
##
## $ScreenPorch
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 15.06 0.00 480.00
##
## $PoolArea
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 2.759 0.000 738.000
##
## $PoolQC
##
## Ex Fa Gd
## 2 2 3
##
## $Fence
##
## GdPrv GdWo MnPrv MnWw
## 59 54 157 11
##
## $MiscFeature
##
## Gar2 Othr Shed TenC
## 2 2 49 1
##
## $MiscVal
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 0.00 43.49 0.00 15500.00
##
## $MoSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.000 6.000 6.322 8.000 12.000
##
## $YrSold
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2006 2007 2008 2008 2009 2010
##
## $SaleType
##
## COD Con ConLD ConLI ConLw CWD New Oth WD
## 43 2 9 5 5 4 122 3 1267
##
## $SaleCondition
##
## Abnorml AdjLand Alloca Family Normal Partial
## 101 4 12 20 1198 125
##
## $SalePrice
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34900 129975 163000 180921 214000 755000
Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.
The code below creates a correlogram of MSSubClass, LotFrontage, and SalePrice with a scatterplot matrix in the lower triangle and a shade panel in the upper triangle. There appears to be a moderate positive correlation between LotFrontage and SalePrice.
library(corrgram)
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
corrgram_df = train %>%
dplyr::select(MSSubClass,LotFrontage,SalePrice)
corrgram(corrgram_df, lower.panel = panel.pts,upper.panel = panel.shade)
Derive a correlation matrix for any three quantitative variables in the dataset.
The code below creates a correlation matrix between LotArea, OverallQual, and OverallCond.
cormat_df = train %>%
dplyr::select(LotArea,OverallQual,OverallCond)
cormat = cor(cormat_df)
cormat
## LotArea OverallQual OverallCond
## LotArea 1.00000000 0.10580574 -0.00563627
## OverallQual 0.10580574 1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234 1.00000000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
cor.test(train$LotArea,train$SalePrice,method="pearson",conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$LotArea and train$SalePrice
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.2323391 0.2947946
## sample estimates:
## cor
## 0.2638434
Given that the p-value is nearly zero for the correlation between Lot Area and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between 0.232 and 0.294.
cor.test(train$OverallQual,train$SalePrice,method="pearson",conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
Given that the p-value is nearly zero for the correlation between Overall Quality and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between 0.778 and 0.803.
cor.test(train$OverallCond,train$SalePrice,method="pearson",conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train$OverallCond and train$SalePrice
## t = -2.9819, df = 1458, p-value = 0.002912
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.1111272 -0.0444103
## sample estimates:
## cor
## -0.07785589
Given that the p-value is nearly zero for the correlation between Overall Condition and Sale Price, we have enough evidence to reject the null hypothesis and conclude with 80% confidence that the true correlation is between -0.111 and -0.044.
Discuss the meaning of your analysis.
Through this analysis, I was able to uncover a few things. In the first part of the problem, I looked at the descriptive statistics for each variable. These descriptive statistics provided a five number summary consisting of the minimum, 1st quartile, median, 3rd quartile, and maximum values for each variable. For categorical variables, I created a count for each category for each categorical variable.
After looking at those summary statistics, I made a scatterplot matrix using the corrgram library that plotted MSSubClass and LotFrontage relative to SalesPrice. This scatterplot displays the pairwise correlations between the variables with blue indicating a positive correlation and red indicating a negative correlation. The darker the color, the stronger the correlation.
From there, I looked at three other variables: Lot Area, Overall Condition, and Overall Quality. I used R to calculate a correlation matrix between these three variables and found that none of the three variables had strong correlations with each other given that their correlations were close to zero.
Finally I tested the pairwise correlations between the three variables and Sale Price. I found that all of the variables had statistically significant correlations with Sale Price given that none of the confidence intervals of the correlations included zero. However, in terms of practical significance, the correlations for Lot Area and Overall Condition are so low that it might not be worth it to include them in the final model.
Would you be worried about familywise error? Why or why not?
1 - ((1 - 0.2)^3)
## [1] 0.488
At a confidence level of 80%, the chance of familywise error is very high even for a low number of tests. Given that the barrier of passing the test is so low, even just one statistical test has a 20% chance of type-1 error. Performing 3 tests brings that chance up to 48.8%. This means that out of the three tests we performed, there is a 48.8% chance that at least one type-1 error occurred. Familywise error would be less of a problem if the significance level was higher; i.e. alpha < 0.0001. However, at the current significance level, familywise error is very likely to occur.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
cormat_inv = solve(cormat)
cormat_inv
## LotArea OverallQual OverallCond
## LotArea 1.011338860 -0.10738903 -0.004172346
## OverallQual -0.107389032 1.01992670 0.093158977
## OverallCond -0.004172346 0.09315898 1.008540807
Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.
round(cormat %*% cormat_inv,5)
## LotArea OverallQual OverallCond
## LotArea 1 0 0
## OverallQual 0 1 0
## OverallCond 0 0 1
round(cormat_inv %*% cormat,5)
## LotArea OverallQual OverallCond
## LotArea 1 0 0
## OverallQual 0 1 0
## OverallCond 0 0 1
Conduct LU decomposition on the matrix.
row_reducer = function(A,L,colnum,original){
if(A[colnum,colnum] == 0){
if(identical(L %*% A, original) == TRUE){
return(list(L = L, U = A, A = original))
}
else{
return("No Solution")
}
}
else{
pivot = A[colnum,colnum]
rowx = A[colnum,]
for(x in seq(colnum + 1, nrow(A))){
L[x,colnum] = A[x,colnum]/pivot
A[x,] = ((-A[x,colnum]/pivot) * rowx) + A[x,]
}
}
return(list(L = L, U = A, A = original))
}
LU = function(A){
original = A
print("Initial Matrix")
print(A)
L = diag(nrow(A))
if(A[1,1] == 0){
return("No Solution")
}
else{
output = row_reducer(A,L,1,original)
if(ncol(A) == 2){
return(output)
}
else{
for(x in seq(2,ncol(A) - 1)){
if(output[[1]][1] == "No Solution"){
return("No Solution")
}
else{
output = row_reducer(output$U,output$L,x,original)
}
}
}
return(output)
}
}
LU(cormat)
## [1] "Initial Matrix"
## LotArea OverallQual OverallCond
## LotArea 1.00000000 0.10580574 -0.00563627
## OverallQual 0.10580574 1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234 1.00000000
## $L
## [,1] [,2] [,3]
## [1,] 1.00000000 0.00000000 0
## [2,] 0.10580574 1.00000000 0
## [3,] -0.00563627 -0.09237006 1
##
## $U
## LotArea OverallQual OverallCond
## LotArea 1 0.1058057 -0.00563627
## OverallQual 0 0.9888051 -0.09133599
## OverallCond 0 0.0000000 0.99153152
##
## $A
## LotArea OverallQual OverallCond
## LotArea 1.00000000 0.10580574 -0.00563627
## OverallQual 0.10580574 1.00000000 -0.09193234
## OverallCond -0.00563627 -0.09193234 1.00000000
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.
hist(train$MSSubClass)
Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
LotArea_exp = fitdistr(train$LotArea,densfun = "exponential")
LotArea_exp
## rate
## 9.508570e-05
## (2.488507e-06)
Find the optimal value of lambda for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000,lambda)).
lambda = LotArea_exp$estimate
samples = rexp(1000,lambda)
Plot a histogram and compare it with a histogram of your original variable.
The histograms look quite different. The sampled histogram looks much more evenly distributed with a gradually decreasing tail, while the real data has a very sharp decrease as the lot area increases.
hist(samples, breaks = 50)
hist(train$LotArea, breaks = 50)
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).
quantile(samples,probs = c(0.05,0.95))
## 5% 95%
## 533.3104 30457.0813
Also generate a 95% confidence interval from the empirical data, assuming normality.
LA = train$LotArea
normal_dist = rnorm(length(LA),mean(LA),sd(LA))
quantile(normal_dist, probs = c(0.05,0.95))
## 5% 95%
## -6037.193 27165.050
Finally, provide the empirical 5th percentile and 95th percentile of the data.
quantile(LA,probs = c(0.05,0.95))
## 5% 95%
## 3311.70 17401.15
Discuss.
The initial data had a very clear right skew, so it didn’t make sense to assume normality. The refit exponential distribution was much closer to the true distribution than the normal distribution. Also, the normal distribution’s confidence interval was negative, while the real data has a minimum value of 1300. However, the normal distribution’s confidence interval had an upper bound that was closer to the true upper bound compared to the exponential distribution’s upper bound. Overall, the exponential distribution is a better fit for MSSubclass, albeit, with a more extreme upper bound than the normal distribution.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis.
The first step towards building my model was cleaning the data. Given that there are over 80 variables, some ordinal, some categorical, and some with missing data elements, it made sense to recode many of the variables to fit properly within a linear model. The code below uses the sqldf library which allows the use of SQL queries in R. It is much easier to recode factors and ordinal variables in SQL compared to R.
The code below recodes the ordinal variables to a numerical scale and changes binary variables to ones and zeros instead of their factor names.
library(mice)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
##
## Attaching package: 'mice'
## The following object is masked from 'package:tidyr':
##
## complete
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(sqldf)
## Loading required package: gsubfn
## Loading required package: proto
## Loading required package: RSQLite
#detach("package:RMySQL", unload=TRUE)
query = "select Id
, MSSubClass
, MSZoning
, LotFrontage
, LotArea
, case when Street = 'Pave' then 1 else 0 end as StreetPaved
, Alley
, LotShape
, LandContour
, case when Utilities = 'AllPub' then 1 else 0 end as UtilitiesAll
, LotConfig
, case when LandSlope = 'Gtl' then 1
when LandSlope = 'Mod' then 2
when LandSlope = 'Sev' then 3
end as LandSlope
, Neighborhood
, Condition1
, Condition2
, BldgType
, HouseStyle
, OverallQual
, OverallCond
, YearBuilt
, YearRemodAdd
, RoofStyle
, RoofMatl
, Exterior1st
, Exterior2nd
, MasVnrType
, MasVnrArea
, case when ExterQual = 'Ex' then 5
when ExterQual = 'Gd' then 4
when ExterQual = 'TA' then 3
when ExterQual = 'Fa' then 2
when ExterQual = 'Po' then 1
end as ExterQual
, case when ExterCond = 'Ex' then 5
when ExterCond = 'Gd' then 4
when ExterCond = 'TA' then 3
when ExterCond = 'Fa' then 2
when ExterCond = 'Po' then 1
end as ExterCond
, Foundation
, case when BsmtQual = 'Ex' then 5
when BsmtQual = 'Gd' then 4
when BsmtQual = 'TA' then 3
when BsmtQual = 'Fa' then 2
when BsmtQual = 'Po' then 1
when BsmtQual = 'NA' then 0
end as BsmtQual
, case when BsmtCond = 'Ex' then 5
when BsmtCond = 'Gd' then 4
when BsmtCond = 'TA' then 3
when BsmtCond = 'Fa' then 2
when BsmtCond = 'Po' then 1
when BsmtCond = 'NA' then 0
end as BsmtCond
, case when BsmtExposure = 'Gd' then 4
when BsmtExposure = 'Av' then 3
when BsmtExposure = 'Mn' then 2
when BsmtExposure = 'No' then 1
when BsmtExposure = 'NA' then 0
end as BsmtExposure
, case when BsmtFinType1 = 'GLQ' then 6
when BsmtFinType1 = 'ALQ' then 5
when BsmtFinType1 = 'BLQ' then 4
when BsmtFinType1 = 'Rec' then 3
when BsmtFinType1 = 'LwQ' then 2
when BsmtFinType1 = 'Unf' then 1
when BsmtFinType1 = 'NA' then 0
end as BsmtFinType1
, BsmtFinSF1
, case when BsmtFinType2 = 'GLQ' then 6
when BsmtFinType2 = 'ALQ' then 5
when BsmtFinType2 = 'BLQ' then 4
when BsmtFinType2 = 'Rec' then 3
when BsmtFinType2 = 'LwQ' then 2
when BsmtFinType2 = 'Unf' then 1
when BsmtFinType2 = 'NA' then 0
end as BsmtFinType2
, BsmtFinSF2
, BsmtUnfSF
, TotalBsmtSF
, Heating
, case when HeatingQC = 'Ex' then 5
when HeatingQC = 'Gd' then 4
when HeatingQC = 'TA' then 3
when HeatingQC = 'Fa' then 2
when HeatingQC = 'Po' then 1
end as HeatingQC
, CentralAir
, Electrical
, X1stFlrSF
, X2ndFlrSF
, LowQualFinSF
, GrLivArea
, BsmtFullBath
, BsmtHalfBath
, FullBath
, HalfBath
, BedroomAbvGr
, KitchenAbvGr
, case when KitchenQual = 'Ex' then 5
when KitchenQual = 'Gd' then 4
when KitchenQual = 'TA' then 3
when KitchenQual = 'Fa' then 2
when KitchenQual = 'Po' then 1
end as KitchenQual
, TotRmsAbvGrd
, case when Functional = 'Typ' then 8
when Functional = 'Min1' then 7
when Functional = 'Min2' then 6
when Functional = 'Mod' then 5
when Functional = 'Maj1' then 4
when Functional = 'Maj2' then 3
when Functional = 'Sev' then 2
when Functional = 'Sal' then 1
end as Functional
, Fireplaces
, case when FireplaceQu = 'Ex' then 5
when FireplaceQu = 'Gd' then 4
when FireplaceQu = 'TA' then 3
when FireplaceQu = 'Fa' then 2
when FireplaceQu = 'Po' then 1
when FireplaceQu = 'NA' then 0
end as FireplaceQu
, GarageType
, GarageYrBlt
, case when GarageFinish = 'Fin' then 3
when GarageFinish = 'RFn' then 2
when GarageFinish = 'Unf' then 1
when GarageFinish = 'NA' then 0
end as GarageFinish
, GarageCars
, GarageArea
, case when GarageQual = 'Ex' then 5
when GarageQual = 'Gd' then 4
when GarageQual = 'TA' then 3
when GarageQual = 'Fa' then 2
when GarageQual = 'Po' then 1
when GarageQual = 'NA' then 0
end as GarageQual
, case when GarageCond = 'Ex' then 5
when GarageCond = 'Gd' then 4
when GarageCond = 'TA' then 3
when GarageCond = 'Fa' then 2
when GarageCond = 'Po' then 1
when GarageCond = 'NA' then 0
end as GarageCond
, PavedDrive
, WoodDeckSF
, OpenPorchSF
, EnclosedPorch
, X3SsnPorch
, ScreenPorch
, PoolArea
, PoolQC
, case when Fence = 'GdPrv' then 4
when Fence = 'MnPrv' then 3
when Fence = 'GdWo' then 2
when Fence = 'MnWw' then 1
when Fence = 'NA' then 0
end as Fence
, MiscFeature
, MiscVal
, MoSold
, YrSold
, SaleType
, SaleCondition
, SalePrice
from train"
cleaned_train = sqldf::sqldf(query)
The next step after cleaning the data was to impute any missing values. Since missing values will result in deleting rows from the training set, I decided to impute them using the mice package. I decided to use predictive mean matching to impute the numerical missing values and assigned a value of 0 to all categorical missing values.
mice_impute_train = mice(cleaned_train,m=5,maxit = 5,method = 'pmm')
##
## iter imp variable
## 1 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 1 2 LotFrontage* MasVnrArea* GarageYrBlt*
## 1 3 LotFrontage* MasVnrArea* GarageYrBlt*
## 1 4 LotFrontage* MasVnrArea* GarageYrBlt*
## 1 5 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 2 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 3 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 4 LotFrontage* MasVnrArea* GarageYrBlt*
## 2 5 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 2 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 3 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 4 LotFrontage* MasVnrArea* GarageYrBlt*
## 3 5 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 2 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 3 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 4 LotFrontage* MasVnrArea* GarageYrBlt*
## 4 5 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 1 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 2 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 3 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 4 LotFrontage* MasVnrArea* GarageYrBlt*
## 5 5 LotFrontage* MasVnrArea* GarageYrBlt*
## * Please inspect the loggedEvents
## Warning: Number of logged events: 191
new_train = complete(mice_impute_train,2)
new_train[is.na(new_train)] = 0
From what I know about houses, the factors that seem to affect the price of houses the most is their size and their location. Generally speaking, larger houses are almost always more expensive than smaller houses in the same neighborhoods. For my model, I chose variables that relate to the physical size of the house, like Lot Area, Garage Area, Total Rooms Above Ground as different measures of how big the house is. To factor for location, I added the Neighborhood variable. While most of the neighborhoods are not statistically significant, the expensive neighborhoods are statistically significant and can bring up the home values substantially.
clean_model = lm(SalePrice~
BsmtExposure+
BsmtFinSF1+
BsmtUnfSF+
Fireplaces+
GarageArea+
KitchenAbvGr+
LandContour+
LotArea+
MasVnrArea+
Neighborhood+
OverallCond+
OverallQual+
TotRmsAbvGrd
,data = new_train)
summary(clean_model)
##
## Call:
## lm(formula = SalePrice ~ BsmtExposure + BsmtFinSF1 + BsmtUnfSF +
## Fireplaces + GarageArea + KitchenAbvGr + LandContour + LotArea +
## MasVnrArea + Neighborhood + OverallCond + OverallQual + TotRmsAbvGrd,
## data = new_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -391127 -17758 -923 14516 318071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.877e+04 1.467e+04 -4.007 6.47e-05 ***
## BsmtExposure1 -1.962e+04 6.471e+03 -3.032 0.002473 **
## BsmtExposure2 -1.424e+04 7.121e+03 -1.999 0.045769 *
## BsmtExposure3 -1.221e+04 6.895e+03 -1.771 0.076820 .
## BsmtExposure4 9.992e+03 7.425e+03 1.346 0.178599
## BsmtFinSF1 3.091e+01 3.106e+00 9.949 < 2e-16 ***
## BsmtUnfSF 1.668e+01 3.054e+00 5.461 5.59e-08 ***
## Fireplaces 7.948e+03 1.709e+03 4.652 3.60e-06 ***
## GarageArea 4.013e+01 5.592e+00 7.176 1.16e-12 ***
## KitchenAbvGr -2.744e+04 4.760e+03 -5.764 1.01e-08 ***
## LandContourHLS 2.188e+04 6.807e+03 3.215 0.001335 **
## LandContourLow 1.214e+04 7.964e+03 1.525 0.127596
## LandContourLvl 1.610e+04 4.633e+03 3.474 0.000528 ***
## LotArea 4.344e-01 1.065e-01 4.078 4.79e-05 ***
## MasVnrArea 1.895e+01 6.068e+00 3.123 0.001826 **
## NeighborhoodBlueste -4.364e+03 2.531e+04 -0.172 0.863126
## NeighborhoodBrDale -2.890e+04 1.224e+04 -2.361 0.018364 *
## NeighborhoodBrkSide -3.122e+03 9.697e+03 -0.322 0.747536
## NeighborhoodClearCr 8.176e+03 1.112e+04 0.735 0.462476
## NeighborhoodCollgCr 1.213e+04 8.770e+03 1.383 0.166992
## NeighborhoodCrawfor 1.981e+04 9.956e+03 1.990 0.046821 *
## NeighborhoodEdwards -1.330e+04 9.261e+03 -1.436 0.151287
## NeighborhoodGilbert 9.944e+03 9.204e+03 1.080 0.280179
## NeighborhoodIDOTRR -1.864e+04 1.028e+04 -1.813 0.070006 .
## NeighborhoodMeadowV -6.889e+03 1.196e+04 -0.576 0.564597
## NeighborhoodMitchel -3.573e+03 9.778e+03 -0.365 0.714882
## NeighborhoodNAmes -9.337e+03 8.879e+03 -1.052 0.293182
## NeighborhoodNoRidge 7.325e+04 1.014e+04 7.222 8.35e-13 ***
## NeighborhoodNPkVill -1.122e+04 1.404e+04 -0.799 0.424198
## NeighborhoodNridgHt 5.787e+04 9.295e+03 6.226 6.28e-10 ***
## NeighborhoodNWAmes -6.987e+03 9.406e+03 -0.743 0.457686
## NeighborhoodOldTown -1.429e+04 9.236e+03 -1.547 0.121971
## NeighborhoodSawyer -4.932e+03 9.517e+03 -0.518 0.604355
## NeighborhoodSawyerW 9.438e+03 9.524e+03 0.991 0.321857
## NeighborhoodSomerst 2.963e+04 9.135e+03 3.244 0.001207 **
## NeighborhoodStoneBr 7.005e+04 1.085e+04 6.455 1.48e-10 ***
## NeighborhoodSWISU -5.893e+03 1.113e+04 -0.530 0.596447
## NeighborhoodTimber 9.310e+03 1.020e+04 0.913 0.361511
## NeighborhoodVeenker 2.749e+04 1.327e+04 2.071 0.038505 *
## OverallCond 5.296e+03 9.017e+02 5.874 5.31e-09 ***
## OverallQual 1.648e+04 1.140e+03 14.457 < 2e-16 ***
## TotRmsAbvGrd 1.197e+04 7.180e+02 16.668 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33570 on 1418 degrees of freedom
## Multiple R-squared: 0.8265, Adjusted R-squared: 0.8215
## F-statistic: 164.7 on 41 and 1418 DF, p-value: < 2.2e-16
As we can see from the model output, the adjusted R-squared is quite good at 0.8189. The residuals are almost centered at zero, which is a good sign that there is not much bias in the model. Looking at the residual plot below, we can see that there is a very tight, random spread around the residual line. There does not appear to be significant changes in variation throughout the data. There does not appear to be a discernable pattern in the residuals either. The only concern is that there are a few outliers, but overall, the residual plot looks very good.
plot(clean_model$residuals,main = "Residual Plot")
abline(h=0,col = "red")
Looking at the QQplot, we can see that the points follow the QQ line very closely except at the tails. This indicates that the residuals are almost normally distributed, which is a good sign. At the extremes of housing prices, there are other factors that might contribute to housing price beyond just location and house size.
qqnorm(clean_model$residuals)
qqline(clean_model$residuals)
Overall, the model appears to be a good fit. The residuals do not exhibit any abnormal behavior that would indicate a non-linear model would be a better fit.
Now that the variables have been chosen, we can now tune the coefficients to be better at predicting new observations. I have called the caret library, which has many useful machine learning functions that can be used to cross validate the data.
library(caret)
## Loading required package: ggplot2
Using 20-fold cross validation, I was able to squeeze out a little bit more predictive power out of the model. Time to make some predictions.
train_control_20 = trainControl(method = "cv", number = 20)
lm_cv20 = train(SalePrice~
BsmtExposure+
BsmtFinSF1+
BsmtUnfSF+
Fireplaces+
GarageArea+
KitchenAbvGr+
LandContour+
LotArea+
MasVnrArea+
Neighborhood+
OverallCond+
OverallQual+
TotRmsAbvGrd
,new_train,method = "lm", trControl = train_control_20, na.action = na.pass)
print(lm_cv20)
## Linear Regression
##
## 1460 samples
## 13 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 1388, 1386, 1388, 1388, 1386, 1388, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 33831.5 0.8236922 22395.01
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
To make my predictions, I ran the same cleaning and imputation scripts on the testing data that I used for the training data.
query = "select Id
, MSSubClass
, MSZoning
, LotFrontage
, LotArea
, case when Street = 'Pave' then 1 else 0 end as StreetPaved
, Alley
, LotShape
, LandContour
, case when Utilities = 'AllPub' then 1 else 0 end as UtilitiesAll
, LotConfig
, case when LandSlope = 'Gtl' then 1
when LandSlope = 'Mod' then 2
when LandSlope = 'Sev' then 3
end as LandSlope
, Neighborhood
, Condition1
, Condition2
, BldgType
, HouseStyle
, OverallQual
, OverallCond
, YearBuilt
, YearRemodAdd
, RoofStyle
, RoofMatl
, Exterior1st
, Exterior2nd
, MasVnrType
, MasVnrArea
, case when ExterQual = 'Ex' then 5
when ExterQual = 'Gd' then 4
when ExterQual = 'TA' then 3
when ExterQual = 'Fa' then 2
when ExterQual = 'Po' then 1
end as ExterQual
, case when ExterCond = 'Ex' then 5
when ExterCond = 'Gd' then 4
when ExterCond = 'TA' then 3
when ExterCond = 'Fa' then 2
when ExterCond = 'Po' then 1
end as ExterCond
, Foundation
, case when BsmtQual = 'Ex' then 5
when BsmtQual = 'Gd' then 4
when BsmtQual = 'TA' then 3
when BsmtQual = 'Fa' then 2
when BsmtQual = 'Po' then 1
when BsmtQual = 'NA' then 0
end as BsmtQual
, case when BsmtCond = 'Ex' then 5
when BsmtCond = 'Gd' then 4
when BsmtCond = 'TA' then 3
when BsmtCond = 'Fa' then 2
when BsmtCond = 'Po' then 1
when BsmtCond = 'NA' then 0
end as BsmtCond
, case when BsmtExposure = 'Gd' then 4
when BsmtExposure = 'Av' then 3
when BsmtExposure = 'Mn' then 2
when BsmtExposure = 'No' then 1
when BsmtExposure = 'NA' then 0
end as BsmtExposure
, case when BsmtFinType1 = 'GLQ' then 6
when BsmtFinType1 = 'ALQ' then 5
when BsmtFinType1 = 'BLQ' then 4
when BsmtFinType1 = 'Rec' then 3
when BsmtFinType1 = 'LwQ' then 2
when BsmtFinType1 = 'Unf' then 1
when BsmtFinType1 = 'NA' then 0
end as BsmtFinType1
, BsmtFinSF1
, case when BsmtFinType2 = 'GLQ' then 6
when BsmtFinType2 = 'ALQ' then 5
when BsmtFinType2 = 'BLQ' then 4
when BsmtFinType2 = 'Rec' then 3
when BsmtFinType2 = 'LwQ' then 2
when BsmtFinType2 = 'Unf' then 1
when BsmtFinType2 = 'NA' then 0
end as BsmtFinType2
, BsmtFinSF2
, BsmtUnfSF
, TotalBsmtSF
, Heating
, case when HeatingQC = 'Ex' then 5
when HeatingQC = 'Gd' then 4
when HeatingQC = 'TA' then 3
when HeatingQC = 'Fa' then 2
when HeatingQC = 'Po' then 1
end as HeatingQC
, CentralAir
, Electrical
, X1stFlrSF
, X2ndFlrSF
, LowQualFinSF
, GrLivArea
, BsmtFullBath
, BsmtHalfBath
, FullBath
, HalfBath
, BedroomAbvGr
, KitchenAbvGr
, case when KitchenQual = 'Ex' then 5
when KitchenQual = 'Gd' then 4
when KitchenQual = 'TA' then 3
when KitchenQual = 'Fa' then 2
when KitchenQual = 'Po' then 1
end as KitchenQual
, TotRmsAbvGrd
, case when Functional = 'Typ' then 8
when Functional = 'Min1' then 7
when Functional = 'Min2' then 6
when Functional = 'Mod' then 5
when Functional = 'Maj1' then 4
when Functional = 'Maj2' then 3
when Functional = 'Sev' then 2
when Functional = 'Sal' then 1
end as Functional
, Fireplaces
, case when FireplaceQu = 'Ex' then 5
when FireplaceQu = 'Gd' then 4
when FireplaceQu = 'TA' then 3
when FireplaceQu = 'Fa' then 2
when FireplaceQu = 'Po' then 1
when FireplaceQu = 'NA' then 0
end as FireplaceQu
, GarageType
, GarageYrBlt
, case when GarageFinish = 'Fin' then 3
when GarageFinish = 'RFn' then 2
when GarageFinish = 'Unf' then 1
when GarageFinish = 'NA' then 0
end as GarageFinish
, GarageCars
, GarageArea
, case when GarageQual = 'Ex' then 5
when GarageQual = 'Gd' then 4
when GarageQual = 'TA' then 3
when GarageQual = 'Fa' then 2
when GarageQual = 'Po' then 1
when GarageQual = 'NA' then 0
end as GarageQual
, case when GarageCond = 'Ex' then 5
when GarageCond = 'Gd' then 4
when GarageCond = 'TA' then 3
when GarageCond = 'Fa' then 2
when GarageCond = 'Po' then 1
when GarageCond = 'NA' then 0
end as GarageCond
, PavedDrive
, WoodDeckSF
, OpenPorchSF
, EnclosedPorch
, X3SsnPorch
, ScreenPorch
, PoolArea
, PoolQC
, case when Fence = 'GdPrv' then 4
when Fence = 'MnPrv' then 3
when Fence = 'GdWo' then 2
when Fence = 'MnWw' then 1
when Fence = 'NA' then 0
end as Fence
, MiscFeature
, MiscVal
, MoSold
, YrSold
, SaleType
, SaleCondition
from test"
cleaned_test = sqldf::sqldf(query)
mice_impute = mice(cleaned_test,m=5,maxit = 5,method = 'pmm')
##
## iter imp variable
## 1 1 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 1 2 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 1 3 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 1 4 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 1 5 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 2 1 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 2 2 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 2 3 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 2 4 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 2 5 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 3 1 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 3 2 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 3 3 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 3 4 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 3 5 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 4 1 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 4 2 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 4 3 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 4 4 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 4 5 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 5 1 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 5 2 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 5 3 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 5 4 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## 5 5 LotFrontage* MasVnrArea* BsmtFinSF1* BsmtFinSF2* BsmtUnfSF* TotalBsmtSF* BsmtFullBath* BsmtHalfBath* GarageYrBlt* GarageCars* GarageArea*
## * Please inspect the loggedEvents
## Warning: Number of logged events: 591
new_test = complete(mice_impute,2)
new_test[is.na(new_test)] = 0
The code below makes predictions on the test set using my 20-fold cross validated linear model. The output can be seen below.
predictions = predict(lm_cv20,new_test)
output_df = data.frame(Id = cleaned_test$Id, SalePrice = predictions)
head(output_df)
## Id SalePrice
## 1 1461 128234.7
## 2 1462 159431.6
## 3 1463 160906.0
## 4 1464 190147.8
## 5 1465 251799.0
## 6 1466 171994.6
#write.csv(output_df,"Kaggle submission.csv",row.names = F)
Kaggle username: AuChan
Score: 0.21546