Problem 1.
Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\)
set.seed(605)
N <- 10
n <- 10000
mu <- (N + 1)/2
sigma <- (N + 1)/2
X <- runif(n, 1, N)
Y <- rnorm(n, mu, sigma)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
x <- median(X, 0.5)
y <- quantile(Y, 0.25)
sum(X>y & X>x)/sum(X>y)
## [1] 0.5521201
sum(X>x & Y>y)/length(X)
## [1] 0.3738
c \(P(X<x|X>y)\)
Given X > y, the probability that X>y and X<x is
sum(X>y & X<x)/sum(X>y)
## [1] 0.4478799
Investigate whether \(P(X>x\;and\;Y>y)=\) P(X>x)*P(Y>y) by building a table and evaluating the marginal and joint probabilities
table_num <- matrix(
c(
sum(X>x & Y>y),
sum(X<=x & Y>y),
sum(Y > y),
sum(X>x & Y<=y),
sum(X<=x & Y<=y),
sum(Y <= y),
sum(X > x),
sum(X <= x),
length(X)
), nrow = 3, ncol = 3, byrow = TRUE,
dimnames = list(c('Y > y','not Y > y', 'Total'), c('X > x','not X > x', 'Total'))
)
table_num
## X > x not X > x Total
## Y > y 3738 3762 7500
## not Y > y 1262 1238 2500
## Total 5000 5000 10000
table_prob <- table_num/length(X)
table_prob
## X > x not X > x Total
## Y > y 0.3738 0.3762 0.75
## not Y > y 0.1262 0.1238 0.25
## Total 0.5000 0.5000 1.00
So \(P(X > x)*P(Y > y) =\)
0.75*0.5
## [1] 0.375
Which is very close to \(P(X>x\;and\;Y>y)\). It seems that they are theoretically equal.
Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
fisher.test(table_num[1:2,1:2])
##
## Fisher's Exact Test for Count Data
##
## data: table_num[1:2, 1:2]
## p-value = 0.5953
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
## 0.8894241 1.0682094
## sample estimates:
## odds ratio
## 0.9747199
chisq.test(table_num[1:2,1:2])
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_num[1:2, 1:2]
## X-squared = 0.28213, df = 1, p-value = 0.5953
Both Fisher’s Exact Test and the Chi Square Test give about the same small p-value. We reject the null hypothesis that the two events are dependent. Chi Square Test requires a large sample size while Fisher’s Exact Test works for small sample size. The accuracy for Chi Square Test is an approximate and the accuracy for Fisher’s Exact Test is exact. Therefor, Fisher’s Exact Test is more appropriate in this case. However, the test results are nearly the same.
Problem 2
Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train <- read.csv("https://raw.githubusercontent.com/ezaccountz/datasets/main/train.csv")
test <- read.csv("https://raw.githubusercontent.com/ezaccountz/datasets/main/test.csv")
Variable descriptions:
Variable summary:
summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour Utilities
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50 NoSeWa: 1
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 NAmes :225 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48 Artery : 2
## FR3 : 4 Edwards:100 RRAn : 26 PosN : 2
## Inside :1052 Somerst: 86 PosN : 19 RRNn : 2
## Gilbert: 79 RRAe : 11 PosA : 1
## (Other):707 (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond YearBuilt
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000 Min. :1872
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000 Median :1973
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575 Mean :1971
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000
## 1.5Unf : 14 Max. :10.000 Max. :9.000 Max. :2010
## (Other): 19
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd
## Min. :1950 Flat : 13 CompShg:1434 VinylSd:515 VinylSd:504
## 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222 MetalSd:214
## Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220 HdBoard:207
## Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206 Wd Sdng:197
## 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108 Plywood:142
## Max. :2010 Shed : 2 Membran: 1 CemntBd: 61 CmentBd: 60
## (Other): 2 (Other):128 (Other):136
## MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual
## BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3 BrkTil:146 Ex :121
## BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28 CBlock:634 Fa : 35
## None :864 Median : 0.0 Gd:488 Gd: 146 PConc :647 Gd :618
## Stone :128 Mean : 103.7 TA:906 Po: 1 Slab : 24 TA :649
## NA's : 8 3rd Qu.: 166.0 TA:1282 Stone : 6 NA's: 37
## Max. :1600.0 Wood : 3
## NA's :8
## BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## Fa : 45 Av :221 ALQ :220 Min. : 0.0 ALQ : 19
## Gd : 65 Gd :134 BLQ :148 1st Qu.: 0.0 BLQ : 33
## Po : 2 Mn :114 GLQ :418 Median : 383.5 GLQ : 14
## TA :1311 No :953 LwQ : 74 Mean : 443.6 LwQ : 46
## NA's: 37 NA's: 38 Rec :133 3rd Qu.: 712.2 Rec : 54
## Unf :430 Max. :5644.0 Unf :1256
## NA's: 37 NA's: 38
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741
## 1st Qu.: 0.00 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49
## Median : 0.00 Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 46.55 Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 0.00 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :1474.00 Max. :2336.0 Max. :6110.0 Wall : 4
##
## CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## N: 95 FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## Y:1365 FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Gd:586 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Functional Fireplaces FireplaceQu GarageType GarageYrBlt
## Maj1: 14 Min. :0.000 Ex : 24 2Types : 6 Min. :1900
## Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961
## Min1: 31 Median :1.000 Gd :380 Basment: 19 Median :1980
## Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88 Mean :1979
## Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Sev : 1 Max. :3.000 NA's:690 Detchd :387 Max. :2010
## Typ :1360 NA's : 81 NA's :81
## GarageFinish GarageCars GarageArea GarageQual GarageCond
## Fin :352 Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2
## RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35
## Unf :605 Median :2.000 Median : 480.0 Gd : 14 Gd : 9
## NA's: 81 Mean :1.767 Mean : 473.0 Po : 3 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311 TA :1326
## Max. :4.000 Max. :1418.0 NA's: 81 NA's: 81
##
## PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Y:1340 Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.06 Mean : 2.759 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1179 NA's:1406
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
plot of some key variables
par(mfrow=c(3,2))
hist(train$GrLivArea)
hist(train$GarageArea)
hist(train$SalePrice)
hist(train$BedroomAbvGr)
plot(train$Neighborhood)
plot(train$SaleType)
par(mfrow=c(1,1))
scatterplot matrix
train_select <- train[,c("GrLivArea","GarageArea","SalePrice")]
pairs(train_select)
correlation matrix
cor_matrix <- cor(train_select)
cor_matrix
## GrLivArea GarageArea SalePrice
## GrLivArea 1.0000000 0.4689975 0.7086245
## GarageArea 0.4689975 1.0000000 0.6234314
## SalePrice 0.7086245 0.6234314 1.0000000
cor.test(train_select$GrLivArea, train_select$GarageArea, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_select$GrLivArea and train_select$GarageArea
## t = 20.276, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.4423993 0.4947713
## sample estimates:
## cor
## 0.4689975
cor.test(train_select$GrLivArea, train_select$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_select$GrLivArea and train_select$SalePrice
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
cor.test(train_select$GarageArea, train_select$SalePrice, conf.level = 0.8)
##
## Pearson's product-moment correlation
##
## data: train_select$GarageArea and train_select$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6024756 0.6435283
## sample estimates:
## cor
## 0.6234314
The result of the tests show that there is a strong correlation between GrLivArea/GarageArea and SalePrice. The p-values are nearly 0. It means we can compare the price of different houses using the their GrLivArea and GarageArea. By doing so, when comparing two houses with similar condition, quality and features, the probability of making familywise error is extremely low.
Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
inv_matrix = solve(cor_matrix)
inv_matrix
## GrLivArea GarageArea SalePrice
## GrLivArea 2.01353305 -0.08964955 -1.3709485
## GarageArea -0.08964955 1.63976057 -0.9587504
## SalePrice -1.37094845 -0.95875043 2.5692028
round(cor_matrix %*% inv_matrix,6)
## GrLivArea GarageArea SalePrice
## GrLivArea 1 0 0
## GarageArea 0 1 0
## SalePrice 0 0 1
round(inv_matrix %*% cor_matrix,6)
## GrLivArea GarageArea SalePrice
## GrLivArea 1 0 0
## GarageArea 0 1 0
## SalePrice 0 0 1
if(!require(matrixcalc)){
install.packages("matrixcalc")
}
library(matrixcalc)
lu.decomposition(inv_matrix)
## $L
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] -0.04452351 1.0000000 0
## [3,] -0.68086712 -0.6234314 1
##
## $U
## [,1] [,2] [,3]
## [1,] 2.013533 -0.08964955 -1.370948
## [2,] 0.000000 1.63576906 -1.019790
## [3,] 0.000000 0.00000000 1.000000
Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
if(!require(MASS)){
install.packages("MASS")
}
if(!require(MASS)){
install.packages("Rmisc")
}
library(MASS)
library("Rmisc")
Select variable GrLivArea
hist(train_select$GrLivArea)
optimal value of \(\lambda\)
mlh <- fitdistr(train_select$GrLivArea, densfun="exponential")
lambda = mlh$estimate
lambda
## rate
## 0.000659864
par(mfrow=c(1,2))
hist(rexp(1000, lambda), main="optimal")
hist(train_select$GrLivArea,main="Original")
5th percentile using the CDF of an exponential distribution
qexp(0.05, rate = lambda)
## [1] 77.73313
95th percentile using the CDF of an exponential distribution
qexp(0.95, rate = lambda)
## [1] 4539.924
95% confidence interval from the empirical data
if(!require(Rmisc)){
install.packages("Rmisc")
}
library(Rmisc)
CI(train_select$GrLivArea, ci = 0.95)
## upper mean lower
## 1542.440 1515.464 1488.487
empirical 5th percentile and 95th percentile of the data
quantile(train_select$GrLivArea, 0.05)
## 5%
## 848
quantile(train_select$GrLivArea, 0.95)
## 95%
## 2466.1
The result shows that the empirical data doesn’t fit to an exponential distribution. The exponential distribution pdf is a monotonic decreasing function and the data is increasing and then decreasing. We should select a different distribution that fits the data.
Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
Feature Selection:
First, let’s check the missing values of our data
colSums(is.na(train))[apply(train, 2, function(x) any(is.na(x)))]
## LotFrontage Alley MasVnrType MasVnrArea BsmtQual BsmtCond
## 259 1369 8 8 37 37
## BsmtExposure BsmtFinType1 BsmtFinType2 Electrical FireplaceQu GarageType
## 38 37 38 1 690 81
## GarageYrBlt GarageFinish GarageQual GarageCond PoolQC Fence
## 81 81 81 81 1453 1179
## MiscFeature
## 1406
colSums(is.na(test))[apply(test, 2, function(x) any(is.na(x)))]
## MSZoning LotFrontage Alley Utilities Exterior1st Exterior2nd
## 4 227 1352 2 1 1
## MasVnrType MasVnrArea BsmtQual BsmtCond BsmtExposure BsmtFinType1
## 16 15 44 45 44 42
## BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF BsmtFullBath
## 1 42 1 1 1 2
## BsmtHalfBath KitchenQual Functional FireplaceQu GarageType GarageYrBlt
## 2 1 2 730 76 78
## GarageFinish GarageCars GarageArea GarageQual GarageCond PoolQC
## 78 1 1 78 78 1456
## Fence MiscFeature SaleType
## 1169 1408 1
Variables with considerable number of NA values are LotFrontage, Alley, MasVnrType, MasVnrArea, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageQual, GarageCond, PoolQC, Fence, MiscFeature As they can not be applied generally to every house, we would exclude these variables in our analysis. In fact, some of the variables are correlated with some other variables that may be included in our model.
In order to reduce the complexity of our model, we would exclude from our analysis the following types of variables:
we would focus on the following variables:
We start building our model with all selected variables:
reg = lm(SalePrice ~ GrLivArea + GarageArea + TotalBsmtSF + LotArea + FullBath
+ HalfBath + BedroomAbvGr + MSZoning + Neighborhood + OverallQual
+ OverallCond + SaleType + SaleCondition + Heating + CentralAir
+ Electrical + MiscVal, data = train)
summary(reg)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + TotalBsmtSF +
## LotArea + FullBath + HalfBath + BedroomAbvGr + MSZoning +
## Neighborhood + OverallQual + OverallCond + SaleType + SaleCondition +
## Heating + CentralAir + Electrical + MiscVal, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -457031 -14233 -889 12384 263733
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.338e+04 3.872e+04 -2.412 0.01599 *
## GrLivArea 4.957e+01 3.644e+00 13.606 < 2e-16 ***
## GarageArea 3.460e+01 5.822e+00 5.944 3.51e-09 ***
## TotalBsmtSF 2.163e+01 2.995e+00 7.223 8.30e-13 ***
## LotArea 4.858e-01 1.035e-01 4.691 2.98e-06 ***
## FullBath 3.949e+03 2.709e+03 1.458 0.14510
## HalfBath 3.151e+03 2.424e+03 1.300 0.19373
## BedroomAbvGr -6.452e+03 1.487e+03 -4.338 1.54e-05 ***
## MSZoningFV 1.078e+04 1.616e+04 0.667 0.50459
## MSZoningRH 1.764e+04 1.611e+04 1.094 0.27396
## MSZoningRL 2.456e+04 1.348e+04 1.822 0.06872 .
## MSZoningRM 1.547e+04 1.275e+04 1.213 0.22517
## NeighborhoodBlueste -3.734e+03 2.602e+04 -0.144 0.88590
## NeighborhoodBrDale -9.654e+03 1.309e+04 -0.738 0.46081
## NeighborhoodBrkSide 3.792e+03 1.032e+04 0.367 0.71340
## NeighborhoodClearCr 2.057e+04 1.109e+04 1.855 0.06383 .
## NeighborhoodCollgCr 1.612e+04 8.858e+03 1.819 0.06906 .
## NeighborhoodCrawfor 2.253e+04 1.002e+04 2.248 0.02476 *
## NeighborhoodEdwards -5.850e+03 9.478e+03 -0.617 0.53720
## NeighborhoodGilbert 1.227e+04 9.363e+03 1.310 0.19034
## NeighborhoodIDOTRR -7.724e+01 1.206e+04 -0.006 0.99489
## NeighborhoodMeadowV 4.746e+03 1.298e+04 0.366 0.71462
## NeighborhoodMitchel 6.006e+03 9.907e+03 0.606 0.54445
## NeighborhoodNAmes 1.947e+03 9.120e+03 0.214 0.83095
## NeighborhoodNoRidge 7.186e+04 1.023e+04 7.024 3.35e-12 ***
## NeighborhoodNPkVill -7.771e+03 1.415e+04 -0.549 0.58293
## NeighborhoodNridgHt 6.347e+04 9.254e+03 6.859 1.04e-11 ***
## NeighborhoodNWAmes 4.981e+02 9.534e+03 0.052 0.95834
## NeighborhoodOldTown -1.177e+04 1.032e+04 -1.141 0.25421
## NeighborhoodSawyer 4.403e+03 9.714e+03 0.453 0.65039
## NeighborhoodSawyerW 1.179e+04 9.612e+03 1.227 0.22012
## NeighborhoodSomerst 2.867e+04 1.130e+04 2.536 0.01131 *
## NeighborhoodStoneBr 6.754e+04 1.072e+04 6.299 4.00e-10 ***
## NeighborhoodSWISU -1.195e+04 1.145e+04 -1.043 0.29706
## NeighborhoodTimber 2.505e+04 1.023e+04 2.448 0.01449 *
## NeighborhoodVeenker 3.495e+04 1.354e+04 2.580 0.00998 **
## OverallQual 1.539e+04 1.174e+03 13.107 < 2e-16 ***
## OverallCond 5.995e+03 9.522e+02 6.296 4.09e-10 ***
## SaleTypeCon 4.686e+04 2.527e+04 1.854 0.06390 .
## SaleTypeConLD 1.521e+04 1.300e+04 1.170 0.24222
## SaleTypeConLI 2.161e+04 1.623e+04 1.331 0.18325
## SaleTypeConLw 6.304e+03 1.651e+04 0.382 0.70260
## SaleTypeCWD 2.451e+04 1.814e+04 1.351 0.17698
## SaleTypeNew 4.791e+04 2.125e+04 2.255 0.02428 *
## SaleTypeOth 3.834e+04 2.053e+04 1.868 0.06196 .
## SaleTypeWD 9.278e+03 5.724e+03 1.621 0.10528
## SaleConditionAdjLand 2.442e+04 1.858e+04 1.314 0.18897
## SaleConditionAlloca 1.040e+04 1.068e+04 0.974 0.33037
## SaleConditionFamily -2.812e+03 8.602e+03 -0.327 0.74377
## SaleConditionNormal 6.128e+03 3.909e+03 1.568 0.11712
## SaleConditionPartial -1.143e+04 2.055e+04 -0.556 0.57798
## HeatingGasA -1.665e+04 3.453e+04 -0.482 0.62967
## HeatingGasW -1.906e+04 3.530e+04 -0.540 0.58931
## HeatingGrav -1.687e+04 3.667e+04 -0.460 0.64557
## HeatingOthW -5.797e+04 4.214e+04 -1.376 0.16917
## HeatingWall 1.810e+03 3.853e+04 0.047 0.96253
## CentralAirY 4.224e+03 4.836e+03 0.873 0.38261
## ElectricalFuseF -2.532e+02 7.780e+03 -0.033 0.97404
## ElectricalFuseP -1.512e+04 2.121e+04 -0.713 0.47618
## ElectricalMix -2.209e+04 3.495e+04 -0.632 0.52754
## ElectricalSBrkr 9.412e+01 3.973e+03 0.024 0.98110
## MiscVal -6.809e-01 1.799e+00 -0.378 0.70518
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33800 on 1397 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8191
## F-statistic: 109.2 on 61 and 1397 DF, p-value: < 2.2e-16
We then improve our model using backward elimination. Then final result is following:
reg = lm(SalePrice ~ GrLivArea + GarageArea + TotalBsmtSF + LotArea + BedroomAbvGr
+ Neighborhood + OverallQual + OverallCond + SaleType, data = train)
summary(reg)
##
## Call:
## lm(formula = SalePrice ~ GrLivArea + GarageArea + TotalBsmtSF +
## LotArea + BedroomAbvGr + Neighborhood + OverallQual + OverallCond +
## SaleType, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -460124 -14042 -497 12815 263196
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.863e+04 1.242e+04 -6.329 3.30e-10 ***
## GrLivArea 5.234e+01 2.978e+00 17.577 < 2e-16 ***
## GarageArea 3.611e+01 5.703e+00 6.333 3.23e-10 ***
## TotalBsmtSF 2.050e+01 2.752e+00 7.447 1.65e-13 ***
## LotArea 4.949e-01 1.022e-01 4.843 1.42e-06 ***
## BedroomAbvGr -5.732e+03 1.437e+03 -3.990 6.94e-05 ***
## NeighborhoodBlueste -1.041e+04 2.554e+04 -0.407 0.68379
## NeighborhoodBrDale -1.945e+04 1.204e+04 -1.616 0.10641
## NeighborhoodBrkSide -4.986e+03 9.764e+03 -0.511 0.60972
## NeighborhoodClearCr 1.749e+04 1.099e+04 1.591 0.11181
## NeighborhoodCollgCr 1.520e+04 8.830e+03 1.721 0.08548 .
## NeighborhoodCrawfor 1.837e+04 9.863e+03 1.863 0.06267 .
## NeighborhoodEdwards -8.775e+03 9.314e+03 -0.942 0.34628
## NeighborhoodGilbert 1.311e+04 9.298e+03 1.410 0.15886
## NeighborhoodIDOTRR -1.852e+04 1.035e+04 -1.790 0.07368 .
## NeighborhoodMeadowV -5.623e+03 1.199e+04 -0.469 0.63925
## NeighborhoodMitchel 3.038e+03 9.832e+03 0.309 0.75738
## NeighborhoodNAmes -9.146e+02 8.949e+03 -0.102 0.91861
## NeighborhoodNoRidge 7.078e+04 1.015e+04 6.974 4.70e-12 ***
## NeighborhoodNPkVill -5.966e+03 1.409e+04 -0.423 0.67213
## NeighborhoodNridgHt 6.322e+04 9.234e+03 6.847 1.12e-11 ***
## NeighborhoodNWAmes -3.527e+02 9.482e+03 -0.037 0.97033
## NeighborhoodOldTown -2.553e+04 9.271e+03 -2.754 0.00596 **
## NeighborhoodSawyer 1.502e+03 9.578e+03 0.157 0.87539
## NeighborhoodSawyerW 1.081e+04 9.536e+03 1.134 0.25712
## NeighborhoodSomerst 1.869e+04 9.112e+03 2.052 0.04038 *
## NeighborhoodStoneBr 6.727e+04 1.071e+04 6.283 4.42e-10 ***
## NeighborhoodSWISU -1.787e+04 1.120e+04 -1.596 0.11073
## NeighborhoodTimber 2.443e+04 1.017e+04 2.402 0.01642 *
## NeighborhoodVeenker 3.442e+04 1.345e+04 2.559 0.01061 *
## OverallQual 1.547e+04 1.142e+03 13.549 < 2e-16 ***
## OverallCond 6.392e+03 8.992e+02 7.108 1.86e-12 ***
## SaleTypeCon 4.985e+04 2.517e+04 1.980 0.04784 *
## SaleTypeConLD 1.344e+04 1.254e+04 1.072 0.28381
## SaleTypeConLI 2.175e+04 1.615e+04 1.347 0.17818
## SaleTypeConLw 8.427e+03 1.614e+04 0.522 0.60178
## SaleTypeCWD 2.265e+04 1.787e+04 1.267 0.20527
## SaleTypeNew 3.467e+04 6.444e+03 5.380 8.69e-08 ***
## SaleTypeOth 3.752e+04 2.043e+04 1.836 0.06656 .
## SaleTypeWD 1.280e+04 5.382e+03 2.378 0.01753 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33830 on 1420 degrees of freedom
## Multiple R-squared: 0.8235, Adjusted R-squared: 0.8187
## F-statistic: 169.9 on 39 and 1420 DF, p-value: < 2.2e-16
Let’s check the normality of the residuals of the final model
hist(reg$residuals, breaks = 50)
plot(fitted(reg),resid(reg))
qqnorm(resid(reg))
qqline(resid(reg))
The distribution of the residuals is approximately normal. It’s appropriate to predit the sale price using our model.
Now let’s do our prediction using the test data. As we found above, there are some missing data in MSZoning, SaleType, GarageArea and TotalBsmtSF. We are going to replace the NA of the categorical variables by their modes and replace the NA of the numeric variables by 0.
Mode <- function(x, na.rm = FALSE) {
if(na.rm){
x = x[!is.na(x)]
}
ux <- unique(x)
return(ux[which.max(tabulate(match(x, ux)))])
}
test_modified <- test
test_modified$MSZoning[is.na(test_modified$MSZoning)] <- Mode(test_modified$MSZoning)
test_modified$SaleType[is.na(test_modified$SaleType)] <- Mode(test_modified$SaleType)
test_modified$GarageArea[is.na(test_modified$GarageArea)] <- 0
test_modified$TotalBsmtSF[is.na(test_modified$TotalBsmtSF)] <- 0
Predict the sale price
pred <- predict(reg, test_modified)
Save the result in a csv file for submission
result <- data.frame(Id = test$Id, SalePrice = pred)
write.csv(result,"result.csv", row.names = FALSE)
Final result of the submission
user name: euclidzhang
score: 0.20894