library(dplyr)
library(MASS)- Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!
- Pick the dependent variable and define it as Y.
Let X = LotArea and Y = SalePrice.
data <- read.csv('train.csv')
house <- data[c('LotArea', 'SalePrice')]
X <- house$LotArea
Y <- house$SalePrice
hist(X, breaks=50, main='Histogram of Lot Area in Square Feet')summary(house)## LotArea SalePrice
## Min. : 1300 Min. : 34900
## 1st Qu.: 7554 1st Qu.:129975
## Median : 9478 Median :163000
## Mean : 10517 Mean :180921
## 3rd Qu.: 11602 3rd Qu.:214000
## Max. :215245 Max. :755000
From the histogram and summary, you can see that LotArea is highly skewed to the right.
Probability
Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.
a. \(P(X>x | Y>y)\)
Interpretation: This is the condition probability of LotArea greater than its 1st quantile, given that the SalePrice is greater than its respective 1st quantile. We will select from the data all the rows where SalePrice is greater than its 1st quantile, then, from the subset select the rows where LotArea is greater than its 1st quantile. The probability is the quotient of rows.
# 1st quantile of X and Y
(x_1stQ <- summary(X)[2])## 1st Qu.
## 7553.5
(y_1stQ <- summary(Y)[2])## 1st Qu.
## 129975
# Select all rows Y>y, then from which select all rows X>x
condition <- subset(house, SalePrice > y_1stQ)
event <- subset(condition, LotArea > x_1stQ)
# Checking:
min(condition$SalePrice) > y_1stQ## 1st Qu.
## TRUE
min(event$LotArea) > x_1stQ## 1st Qu.
## TRUE
# Solve probability:
(prob_a <- nrow(event) / nrow(condition))## [1] 0.8200913
b. \(P(X>x, Y>y)\)
Interpretation: This is the joint probability that LotArea and SalePrice are greater than their respective 1st quantiles. We will select from the house data all the rows where this occurs. The probability is the quotient of rows.
# Select all rows X>x and Y>y:
joint <- subset(house, LotArea > x_1stQ & SalePrice > y_1stQ)
# Checking:
nrow(subset(joint, LotArea < x_1stQ)) == 0## [1] TRUE
nrow(subset(joint, SalePrice < y_1stQ)) == 0## [1] TRUE
# Solve probability:
(prob_b <- nrow(joint) / nrow(house))## [1] 0.6150685
c. \(P(X<x | Y>y)\)
Interpretation: This is the condition probability of LotArea less than its 1st quantile, given that the SalePrice is greater than its respective 1st quantile. We will select from the data all the rows where SalePrice is greater than its 1st quantile, then, from this subset select the rows where LotArea is less than its 1st quantile. The probability is the quotient of rows.
# Select all rows Y>y, then from which select all rows X<x
condition <- subset(house, SalePrice > y_1stQ)
event <- subset(condition, LotArea < x_1stQ)
# Checking:
min(condition$SalePrice) > y_1stQ## 1st Qu.
## TRUE
max(event$LotArea) < x_1stQ## 1st Qu.
## TRUE
# Solve probability:
(prob_c <- nrow(event) / nrow(condition))## [1] 0.1799087
Analysis
Solve the joint probabilities in the rows and columns of the table:
# X<=x, Y<=y
joint <- subset(house, LotArea <= x_1stQ & SalePrice <= y_1stQ)
row1_col1 <- nrow(joint)
# X>x, Y<=y
joint <- subset(house, LotArea > x_1stQ & SalePrice <= y_1stQ)
row1_col2 <- nrow(joint)
# X<=x, Y>y
joint <- subset(house, LotArea <= x_1stQ & SalePrice > y_1stQ)
row2_col1 <- nrow(joint)
# X>x, Y>y
joint <- subset(house, LotArea > x_1stQ & SalePrice > y_1stQ)
row2_col2 <- nrow(joint) Create the contingency table:
observ_count <- data.frame(c(row1_col1, row2_col1), c(row1_col2, row2_col2))
# Compute row and column totals
observ_count[3,] = observ_count[1,] + observ_count[2,]
observ_count[,3] = observ_count[,1] + observ_count[,2]
names(observ_count) <- c('X<=x', 'X>x', 'Total')
rownames(observ_count) <- c('Y<=y', 'Y>y', 'Total')
# Calculate the probabilities by dividing by the total number of rows:
prob <- observ_count / nrow(house)
round(prob, 4)## X<=x X>x Total
## Y<=y 0.1151 0.1349 0.25
## Y>y 0.1349 0.6151 0.75
## Total 0.2500 0.7500 1.00
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
From the contingency table above, P(A) = 0.75, P(B) = 0.75, and P(AB) = 0.6151. It is not equal to P(A)P(B) = 0.5625. Thus mathematically they are not independent.
The null and alternative hypothesis for the Chi Square Test are:
\(H_0\): The variable X and Y are independent variables.
\(H_a\): They are not independent.
Below observed counts table can be used for the Chi Square test:
observ_count## X<=x X>x Total
## Y<=y 168 197 365
## Y>y 197 898 1095
## Total 365 1095 1460
Below is the table for expected counts, calculated as \(\frac { row\quad total\quad \times \quad column\quad total }{ overall\quad total }\):
row1_col1 <- observ_count[1,3] * observ_count[3,1] / observ_count[3,3]
row1_col2 <- observ_count[1,3] * observ_count[3,2] / observ_count[3,3]
row2_col1 <- observ_count[2,3] * observ_count[3,1] / observ_count[3,3]
row2_col2 <- observ_count[2,3] * observ_count[3,2] / observ_count[3,3]
expec_count <- data.frame(c(row1_col1, row2_col1), c(row2_col1, row2_col2))
names(expec_count) <- c('X<=x', 'X>x')
rownames(expec_count) <- c('Y<=y', 'Y>y')
round(expec_count, 2)## X<=x X>x
## Y<=y 91.25 273.75
## Y>y 273.75 821.25
Calculate \({ \chi }^{ 2 }=\sum { \frac { { (observed-expected) }^{ 2 } }{ expected } }\)
(cell_cal <- (observ_count[1:2, 1:2] - expec_count) ^2 / expec_count) ## X<=x X>x
## Y<=y 64.55411 21.518037
## Y>y 21.51804 7.172679
chi_square <- sum(cell_cal)
chi_square## [1] 114.7629
Using R to calculate the p-value, with DF = (2-1)(2-1) = 1.
p <- pchisq(chi_square, 1, lower.tail=F)
p## [1] 8.869425e-27
This value is much less than 0.05. Therefore null hypothesis can be rejected: X and Y are not independent.
Descriptive and Inferential Statistics
Provide univariate descriptive statistics and appropriate plots for the training data set.
Below are the summary statistics each variable in the training data set. If the data is numeric, the summary shows the variable’s minimum, maximum, median, and mean values, as well as its 1st and 3rd quantiles. If the data is categorical, it shows the counts in each category of the variable.
summary(data[2:dim(data)[2]])## MSSubClass MSZoning LotFrontage LotArea
## Min. : 20.0 C (all): 10 Min. : 21.00 Min. : 1300
## 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00 1st Qu.: 7554
## Median : 50.0 RH : 16 Median : 69.00 Median : 9478
## Mean : 56.9 RL :1151 Mean : 70.05 Mean : 10517
## 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## Street Alley LotShape LandContour Utilities
## Grvl: 6 Grvl: 50 IR1:484 Bnk: 63 AllPub:1459
## Pave:1454 Pave: 41 IR2: 41 HLS: 50 NoSeWa: 1
## NA's:1369 IR3: 10 Low: 36
## Reg:925 Lvl:1311
##
##
##
## LotConfig LandSlope Neighborhood Condition1 Condition2
## Corner : 263 Gtl:1382 NAmes :225 Norm :1260 Norm :1445
## CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81 Feedr : 6
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48 Artery : 2
## FR3 : 4 Edwards:100 RRAn : 26 PosN : 2
## Inside :1052 Somerst: 86 PosN : 19 RRNn : 2
## Gilbert: 79 RRAe : 11 PosA : 1
## (Other):707 (Other): 15 (Other): 2
## BldgType HouseStyle OverallQual OverallCond
## 1Fam :1220 1Story :726 Min. : 1.000 Min. :1.000
## 2fmCon: 31 2Story :445 1st Qu.: 5.000 1st Qu.:5.000
## Duplex: 52 1.5Fin :154 Median : 6.000 Median :5.000
## Twnhs : 43 SLvl : 65 Mean : 6.099 Mean :5.575
## TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000 3rd Qu.:6.000
## 1.5Unf : 14 Max. :10.000 Max. :9.000
## (Other): 19
## YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st
## Min. :1872 Min. :1950 Flat : 13 CompShg:1434 VinylSd:515
## 1st Qu.:1954 1st Qu.:1967 Gable :1141 Tar&Grv: 11 HdBoard:222
## Median :1973 Median :1994 Gambrel: 11 WdShngl: 6 MetalSd:220
## Mean :1971 Mean :1985 Hip : 286 WdShake: 5 Wd Sdng:206
## 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7 ClyTile: 1 Plywood:108
## Max. :2010 Max. :2010 Shed : 2 Membran: 1 CemntBd: 61
## (Other): 2 (Other):128
## Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond
## VinylSd:504 BrkCmn : 15 Min. : 0.0 Ex: 52 Ex: 3
## MetalSd:214 BrkFace:445 1st Qu.: 0.0 Fa: 14 Fa: 28
## HdBoard:207 None :864 Median : 0.0 Gd:488 Gd: 146
## Wd Sdng:197 Stone :128 Mean : 103.7 TA:906 Po: 1
## Plywood:142 NA's : 8 3rd Qu.: 166.0 TA:1282
## CmentBd: 60 Max. :1600.0
## (Other):136 NA's :8
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## BrkTil:146 Ex :121 Fa : 45 Av :221 ALQ :220
## CBlock:634 Fa : 35 Gd : 65 Gd :134 BLQ :148
## PConc :647 Gd :618 Po : 2 Mn :114 GLQ :418
## Slab : 24 TA :649 TA :1311 No :953 LwQ : 74
## Stone : 6 NA's: 37 NA's: 37 NA's: 38 Rec :133
## Wood : 3 Unf :430
## NA's: 37
## BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 ALQ : 19 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 383.5 GLQ : 14 Median : 0.00 Median : 477.5
## Mean : 443.6 LwQ : 46 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :5644.0 Unf :1256 Max. :1474.00 Max. :2336.0
## NA's: 38
## TotalBsmtSF Heating HeatingQC CentralAir Electrical
## Min. : 0.0 Floor: 1 Ex:741 N: 95 FuseA: 94
## 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365 FuseF: 27
## Median : 991.5 GasW : 18 Gd:241 FuseP: 3
## Mean :1057.4 Grav : 7 Po: 1 Mix : 1
## 3rd Qu.:1298.2 OthW : 2 TA:428 SBrkr:1334
## Max. :6110.0 Wall : 4 NA's : 1
##
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea
## Min. : 334 Min. : 0 Min. : 0.000 Min. : 334
## 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000 1st Qu.:1130
## Median :1087 Median : 0 Median : 0.000 Median :1464
## Mean :1163 Mean : 347 Mean : 5.845 Mean :1515
## 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000 3rd Qu.:1777
## Max. :4692 Max. :2065 Max. :572.000 Max. :5642
##
## BsmtFullBath BsmtHalfBath FullBath HalfBath
## Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
## Median :0.0000 Median :0.00000 Median :2.000 Median :0.0000
## Mean :0.4253 Mean :0.05753 Mean :1.565 Mean :0.3829
## 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.00000 Max. :3.000 Max. :2.0000
##
## BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## Min. :0.000 Min. :0.000 Ex:100 Min. : 2.000 Maj1: 14
## 1st Qu.:2.000 1st Qu.:1.000 Fa: 39 1st Qu.: 5.000 Maj2: 5
## Median :3.000 Median :1.000 Gd:586 Median : 6.000 Min1: 31
## Mean :2.866 Mean :1.047 TA:735 Mean : 6.518 Min2: 34
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 Mod : 15
## Max. :8.000 Max. :3.000 Max. :14.000 Sev : 1
## Typ :1360
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish
## Min. :0.000 Ex : 24 2Types : 6 Min. :1900 Fin :352
## 1st Qu.:0.000 Fa : 33 Attchd :870 1st Qu.:1961 RFn :422
## Median :1.000 Gd :380 Basment: 19 Median :1980 Unf :605
## Mean :0.613 Po : 20 BuiltIn: 88 Mean :1979 NA's: 81
## 3rd Qu.:1.000 TA :313 CarPort: 9 3rd Qu.:2002
## Max. :3.000 NA's:690 Detchd :387 Max. :2010
## NA's : 81 NA's :81
## GarageCars GarageArea GarageQual GarageCond PavedDrive
## Min. :0.000 Min. : 0.0 Ex : 3 Ex : 2 N: 90
## 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48 Fa : 35 P: 30
## Median :2.000 Median : 480.0 Gd : 14 Gd : 9 Y:1340
## Mean :1.767 Mean : 473.0 Po : 3 Po : 7
## 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311 TA :1326
## Max. :4.000 Max. :1418.0 NA's: 81 NA's: 81
##
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea PoolQC Fence MiscFeature
## Min. : 0.00 Min. : 0.000 Ex : 2 GdPrv: 59 Gar2: 2
## 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2 GdWo : 54 Othr: 2
## Median : 0.00 Median : 0.000 Gd : 3 MnPrv: 157 Shed: 49
## Mean : 15.06 Mean : 2.759 NA's:1453 MnWw : 11 TenC: 1
## 3rd Qu.: 0.00 3rd Qu.: 0.000 NA's :1179 NA's:1406
## Max. :480.00 Max. :738.000
##
## MiscVal MoSold YrSold SaleType
## Min. : 0.00 Min. : 1.000 Min. :2006 WD :1267
## 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007 New : 122
## Median : 0.00 Median : 6.000 Median :2008 COD : 43
## Mean : 43.49 Mean : 6.322 Mean :2008 ConLD : 9
## 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009 ConLI : 5
## Max. :15500.00 Max. :12.000 Max. :2010 ConLw : 5
## (Other): 9
## SaleCondition SalePrice
## Abnorml: 101 Min. : 34900
## AdjLand: 4 1st Qu.:129975
## Alloca : 12 Median :163000
## Family : 20 Mean :180921
## Normal :1198 3rd Qu.:214000
## Partial: 125 Max. :755000
##
Below are the statistic plots for each variable in the train data set. If the variable is numeric, a histogram is plotted (not filled bars). If the variable is categorical, the frequency of each category is plotted (filled bars).
par(mfrow=c(2,3))
for (var in names(data)[2:dim(data)[2]]){
if (is.numeric(data[,var])){
hist(data[,var], main=var, xlab='')
}else{
plot(data[,var], main=var)
}
}Lastly, if the variable is numeric, below are their boxplots:
par(mfrow=c(2,4))
for (var in names(data)[2:dim(data)[2]]){
if (is.numeric(data[,var])){
boxplot(data[,var], main=var, xlab='')
}
}Provide a scatterplot of X and Y.
plot(x=X/1000, y=Y/1000, xlab='Lot Area in Thousands Square Feet', ylab='Sale Price in Thousands USD', main='Lot Area vs Sale Price')Derive a correlation matrix for any THREE quantitative variables in the dataset.
correlations <- cor(data[, c('X1stFlrSF', 'LotArea', 'SalePrice')])
round(correlations, 4)## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.0000 0.2995 0.6059
## LotArea 0.2995 1.0000 0.2638
## SalePrice 0.6059 0.2638 1.0000
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval.
\(H_0\): The correlation coefficient is 0. In another word, the observed correlation coefficient is non-zero due to chance.
\(H_a\): The correlation coefficient is not 0.
The test is for two-tail t-test, with \(\alpha =(1-0.92)/2=0.04\).
The test statistic is calculated as \(t=\frac { r\sqrt { n-2 } }{ \sqrt { 1-{ r }^{ 2 } } }\)
Where r = correlation coefficient, and n = number of observations = 1460.
The correlation between between X1stFLsSF and LotArea is 0.2994746.
Using the R built in function to find p value, with DF = n - 2:
r <- correlations[1,2]
n <- nrow(data)
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p## [1] 1.234238e-31
This p value is a lot smaller than the \(\alpha=0.04\). Therefore the null hypothesis can be rejected. The correlation coefficient is not zero.
The correlation between between X1stFLsSF and SalePrice is 0.6058522.
r <- correlations[1,3]
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p## [1] 5.394711e-147
This p value is a lot smaller than the \(\alpha=0.04\). Therefore the null hypothesis can be rejected. The correlation coefficient is not zero, meaning that the variables are not independent.
The correlation between between LotArea and SalePrice is 0.2638434.
r <- correlations[2,3]
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p## [1] 1.123139e-24
This p value is a lot smaller than the \(\alpha=0.04\). Therefore the null hypothesis can be rejected. The correlation coefficient is not zero.
Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
The above analysis seems to suggest that the 3 correlations of the 3-pairs are not zero, ie. they are all significant. However, I’m worried about family wise error, because I have just done three statistic tests in parallel. This is inviting the multiple comparison problem. Also called the “look-elsewhere effect”, the problem arises when you are doing multiple tests, trying to look for significance. The significance may occur just because you tried many tests, due to chance. Thus, a Type I error may result, when the null hypothesis is truth but rejected, simply because you found significance by chance.
Linear Algebra and Correlation
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
precision <- solve(correlations)
precision## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.6340150 -0.2452191 -0.9252722
## LotArea -0.2452191 1.1116223 -0.1447277
## SalePrice -0.9252722 -0.1447277 1.5987636
correlations %*% precision## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.000000e+00 0 0
## LotArea -5.551115e-17 1 0
## SalePrice -1.110223e-16 0 1
precision %*% correlations## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.000000e+00 -2.775558e-17 -1.110223e-16
## LotArea -2.775558e-17 1.000000e+00 -2.775558e-17
## SalePrice 0.000000e+00 0.000000e+00 1.000000e+00
Below function was written in Assignment 2 to conduct LU decomposition
lu_decomp <- function(A) {
rows <- nrow(A)
L <- diag(rows)
U <- A
# Row operation to get row echelon form
for (i in 1:(rows-1)){
i_max <- which.max(abs(U[i:rows,i])) + i - 1
if (U[i_max,i] == 0){
print("###Cannot factorized without permutating.####")
break
}
for(j in (i+1):rows) {
coefficient <- - U[j,i] / U[i,i]
# Update U by updating the rows using the coefficient
U[j,] <- U[j,] + coefficient * U[i,]
# Update L by taking the opposite of coefficient
L[j,i] <- -coefficient
}
print(paste('U in Step', i, ":"))
print(U)
print(paste('L in Step', i, ":"))
print(L)
print("-----------------------------")
}
return(list(L, U))
}result <- lu_decomp(precision)## [1] "U in Step 1 :"
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.634015 -0.2452191 -0.9252722
## LotArea 0.000000 1.0748219 -0.2835846
## SalePrice 0.000000 -0.2835846 1.0748219
## [1] "L in Step 1 :"
## [,1] [,2] [,3]
## [1,] 1.0000000 0 0
## [2,] -0.1500715 1 0
## [3,] -0.5662568 0 1
## [1] "-----------------------------"
## [1] "U in Step 2 :"
## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.634015 -0.2452191 -0.9252722
## LotArea 0.000000 1.0748219 -0.2835846
## SalePrice 0.000000 0.0000000 1.0000000
## [1] "L in Step 2 :"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] -0.1500715 1.0000000 0
## [3,] -0.5662568 -0.2638434 1
## [1] "-----------------------------"
L <- result[[1]]
U <- result[[2]]
L %*% U## X1stFlrSF LotArea SalePrice
## [1,] 1.6340150 -0.2452191 -0.9252722
## [2,] -0.2452191 1.1116223 -0.1447277
## [3,] -0.9252722 -0.1447277 1.5987636
So the L and U are:
L## [,1] [,2] [,3]
## [1,] 1.0000000 0.0000000 0
## [2,] -0.1500715 1.0000000 0
## [3,] -0.5662568 -0.2638434 1
U## X1stFlrSF LotArea SalePrice
## X1stFlrSF 1.634015 -0.2452191 -0.9252722
## LotArea 0.000000 1.0748219 -0.2835846
## SalePrice 0.000000 0.0000000 1.0000000
Calculus-Based Probability & Statistics
Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
The first variable selected was the LotArea.
summary(X)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
The minimum value for this variable is already above zero.
fitting <- fitdistr(X, 'exponential')
rate <- fitting$estimate
rate## rate
## 9.50857e-05
The optimal value of \(\lambda\) for this distribution is 9.508570410^{-5}.
set.seed(1)
Xt <- rexp(1000, rate)
hist(X, breaks = 100, main='Histogram of Original Data')hist(Xt, breaks = 50, main='Histogram of Exponential Distribution')The two histograms are only similar in the sense that they are both unimodal and skewed to the right. They are very different otherwise. The original distribution’s shape exhibit some-what of a bell-shape, with a long tail trailing to the right. On the other hand, the exponential distribution sample set peaks immediately after near zero - there is no resemblance of a bell-shape. It then decreases gradually to the right.
The cumulative distribution function for the exponential distribution is:
\(P(x)=1-{ e }^{ -\lambda x }\)
Solving x for a given P:
\(x=\frac { ln(1-p) }{ -\lambda }\)
R function qexp can also be used to check.
For 5th percentile:
P <- 0.05
log(1-P)/-rate## rate
## 539.4428
qexp(P, rate)## [1] 539.4428
For 95th percentile:
P <- 0.95
log(1-P)/-rate## rate
## 31505.6
qexp(P, rate)## [1] 31505.6
Assuming normality for the empirical data (original data), and using Z = 1.96 for 95% confidence interval, the formula is: \(\bar { x } \pm Z\cdot \frac { \sigma }{ \sqrt { n } }\)
Z <- 1.96
x_bar <- mean(X)
std <- sd(X)
n <- length(X)
upper <- x_bar + Z * std / sqrt(n)
lower <- x_bar - Z * std / sqrt(n)
lower## [1] 10004.83
upper## [1] 11028.82
The 95% confidence interval is (1.000483410^{4}, 1.102882310^{4}).
The 5th and 95th percentile of the empirical data, using R’s quantile function is:
quantile(X, c(0.05, 0.95))## 5% 95%
## 3311.70 17401.15
The actual 5th and 95th percentiles observed in data are:
X.sorted <- sort(X)
X.sorted[round(0.05*length(X))]## [1] 3230
X.sorted[round(0.95*length(X))]## [1] 17400
They are very close to the values calculated by the quantile function.
The variable chosen is the LotArea of observed data. The 95th percentile is 17,400. This means that 95% of the 1460 houses are under 17,400 sqft. Notice that there is a large disparity in the percentiles between the empirical data and the data fitted with exponential function. This suggests that the exponential function is not a good fit for this data set. The 95% confident interval built above was the confident interval of the mean. This means that if we sample the houses in that area repeatedly, each time with the same sample size (1460), then 95% of the time the mean of the sample will be between (1.000483410^{4}, 1.102882310^{4}). However, for a data this skewed, median is a much better descriptor than mean.
Modeling
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
I will build a multiple regression model to predict the SalePrice of a house, using only the numeric variables in the data set as the explanatory variables.
First I find out which columns are numeric:
data <- read.csv('train.csv')
numeric_cols <- sapply(data, is.numeric)Next, I calculate the correlations between all the numeric variables. I order the correlation matrix by the SalePrice column and extract that column. The result is a vector of numerical variables listed in decreasing order of their correlation with SalePrice.
correlations <- as.data.frame(cor(data[numeric_cols]))
correlations <- correlations[order(correlations$SalePrice, decreasing=T),]
corr_saleprice <- correlations$SalePrice
names(corr_saleprice) <- rownames(correlations)
corr_saleprice## SalePrice OverallQual GrLivArea GarageCars GarageArea
## 1.00000000 0.79098160 0.70862448 0.64040920 0.62343144
## TotalBsmtSF X1stFlrSF FullBath TotRmsAbvGrd YearBuilt
## 0.61358055 0.60585218 0.56066376 0.53372316 0.52289733
## YearRemodAdd Fireplaces BsmtFinSF1 WoodDeckSF X2ndFlrSF
## 0.50710097 0.46692884 0.38641981 0.32441344 0.31933380
## OpenPorchSF HalfBath LotArea BsmtFullBath BsmtUnfSF
## 0.31585623 0.28410768 0.26384335 0.22712223 0.21447911
## BedroomAbvGr ScreenPorch PoolArea MoSold X3SsnPorch
## 0.16821315 0.11144657 0.09240355 0.04643225 0.04458367
## BsmtFinSF2 BsmtHalfBath MiscVal Id LowQualFinSF
## -0.01137812 -0.01684415 -0.02118958 -0.02191672 -0.02560613
## YrSold OverallCond MSSubClass EnclosedPorch KitchenAbvGr
## -0.02892259 -0.07785589 -0.08428414 -0.12857796 -0.13590737
## LotFrontage MasVnrArea GarageYrBlt
## NA NA NA
I will build the first model by using all of the variables in this list with correlation higher than 0.2. Now, this choice is completely arbitrary. The last variable that satisfy this is the BsmtUnfSF variable. So I will select variable from OveralQual to BsmtUnfSF.
Notice that the OverallQual is actually an ordinal variable. Its values represent the overal quality of the house, in the rating from 1 to 10. I will deal with this variable by including polynomial terms. This means I will include OverallQual^2 + OverallQual^3 +…+ OverallQual^9.
col_names <- c()
for (i in 2:9){
new_col <- data$OverallQual ^ i
data <- cbind(data, new_col)
col_name <- paste('OverallQual', i, sep='.')
names(data)[dim(data)[2]] <- col_name
col_names <- c(col_names, col_name)
}
col_names## [1] "OverallQual.2" "OverallQual.3" "OverallQual.4" "OverallQual.5"
## [5] "OverallQual.6" "OverallQual.7" "OverallQual.8" "OverallQual.9"
I now select the column names for the variable, and create the linear model using the lm function:
Y <- 'SalePrice'
X <- names(corr_saleprice)[2:which(names(corr_saleprice)=='BsmtUnfSF')]
X <- c(X, col_names)
X## [1] "OverallQual" "GrLivArea" "GarageCars" "GarageArea"
## [5] "TotalBsmtSF" "X1stFlrSF" "FullBath" "TotRmsAbvGrd"
## [9] "YearBuilt" "YearRemodAdd" "Fireplaces" "BsmtFinSF1"
## [13] "WoodDeckSF" "X2ndFlrSF" "OpenPorchSF" "HalfBath"
## [17] "LotArea" "BsmtFullBath" "BsmtUnfSF" "OverallQual.2"
## [21] "OverallQual.3" "OverallQual.4" "OverallQual.5" "OverallQual.6"
## [25] "OverallQual.7" "OverallQual.8" "OverallQual.9"
formula_str <- paste(X, collapse = '+')
formula_str <- paste(Y, formula_str, sep = '~')
formula <- as.formula(formula_str)
formula## SalePrice ~ OverallQual + GrLivArea + GarageCars + GarageArea +
## TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd + YearBuilt +
## YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF +
## OpenPorchSF + HalfBath + LotArea + BsmtFullBath + BsmtUnfSF +
## OverallQual.2 + OverallQual.3 + OverallQual.4 + OverallQual.5 +
## OverallQual.6 + OverallQual.7 + OverallQual.8 + OverallQual.9
model <- lm(formula, data=data)Below is the model summary for the first model:
summary(model)##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -508858 -14066 -352 12821 252370
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.204e+06 1.572e+06 -0.766 0.4437
## OverallQual 2.676e+05 4.025e+06 0.066 0.9470
## GrLivArea 1.168e+01 1.869e+01 0.625 0.5323
## GarageCars 1.201e+04 2.741e+03 4.382 1.26e-05 ***
## GarageArea -6.221e-01 9.326e+00 -0.067 0.9468
## TotalBsmtSF 1.878e+01 6.711e+00 2.799 0.0052 **
## X1stFlrSF 2.831e+01 1.907e+01 1.484 0.1379
## FullBath 2.276e+03 2.658e+03 0.856 0.3919
## TotRmsAbvGrd 6.078e+02 1.020e+03 0.596 0.5514
## YearBuilt 2.289e+02 4.903e+01 4.668 3.33e-06 ***
## YearRemodAdd 3.392e+02 5.706e+01 5.944 3.48e-09 ***
## Fireplaces 8.928e+03 1.671e+03 5.343 1.06e-07 ***
## BsmtFinSF1 2.270e+00 5.793e+00 0.392 0.6953
## WoodDeckSF 1.670e+01 7.535e+00 2.217 0.0268 *
## X2ndFlrSF 2.967e+01 1.889e+01 1.570 0.1165
## OpenPorchSF -5.733e+00 1.444e+01 -0.397 0.6913
## HalfBath 1.262e+03 2.552e+03 0.495 0.6210
## LotArea 5.054e-01 9.596e-02 5.267 1.60e-07 ***
## BsmtFullBath 5.499e+03 2.364e+03 2.326 0.0202 *
## BsmtUnfSF -9.130e+00 5.885e+00 -1.551 0.1210
## OverallQual.2 -2.314e+05 4.113e+06 -0.056 0.9551
## OverallQual.3 8.590e+04 2.255e+06 0.038 0.9696
## OverallQual.4 -1.106e+04 7.437e+05 -0.015 0.9881
## OverallQual.5 -1.629e+03 1.548e+05 -0.011 0.9916
## OverallQual.6 7.455e+02 2.049e+04 0.036 0.9710
## OverallQual.7 -1.036e+02 1.674e+03 -0.062 0.9506
## OverallQual.8 6.696e+00 7.690e+01 0.087 0.9306
## OverallQual.9 -1.704e-01 1.520e+00 -0.112 0.9107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33350 on 1432 degrees of freedom
## Multiple R-squared: 0.827, Adjusted R-squared: 0.8238
## F-statistic: 253.6 on 27 and 1432 DF, p-value: < 2.2e-16
Next, I will perform backward elimination to remove the variables. The variable with the maximum p-value is removed from the model one at a time, as long as the removal of the variable can increase the Adjusted R-squared of the new model and the variable’s p-value is higher than 0.05.
Two wrapper functions are written to perform the backward elimination automatically.
linear_model <- function(exp_var_vec, dep_var, data){
# This function creates a linear model with the given variables and data
# and returns the model, the model's adjusted R-squrared, the model's
# explanatory variable with the maximum Pr(>|t|), and the corresponding
# Pr(>|t|).
# Take the exp_var string vector and the dep_var to create a formula
formula_str <- paste(exp_var_vec, collapse = '+')
formula_str <- paste(dep_var, formula_str, sep = '~')
formula <- as.formula(formula_str)
# Construct a linear model
model <- lm(formula, data=data)
model_summary <- summary(model)
# Retrieve the coefficient matrix
coefficients <- model_summary$coefficients
# Find out which variable has max Pr(>|t|), excluding (intercept)
max_p <- max(coefficients[2:nrow(coefficients),4])
variable <- names(which(coefficients[,4] == max_p))
# Save result in a list and return the list
result <- list(model, model_summary$adj.r.squared, variable, max_p, formula)
names(result) <- c('Model.Obj', 'Model.Adj.r.sq', 'Var.with.Max.p', 'Max.p', 'Formula')
return(result)
}
backward_search <- function(exp_var_vec, dep_var, data){
# This functions takes a data and an original formula_str to perform backward
# step-wise elimiation for variable selection. The function eliminates
# the variable based on its Pr(>|t|) value. If this value is greater than
# 0.05, the function will eliminate the variable as long as it will increase
# the adjusted R-squared.
# Initialize model
backward_model <- linear_model(exp_var_vec, dep_var, data)
backward_ars <- backward_model$Model.Adj.r.sq
count <- 0
current_ars <- 0
current_maxp <- 1
remove_vec <- c()
ars_vec <- c()
# Loop to eliminate as long as the new model has increased adjusted R-square
# and the current model has an explanatory variable with Pr(>|t|) greater
# than 0.05.
while((backward_ars >= current_ars) & (current_maxp > 0.05)) {
count <- count + 1
current_model <- backward_model
current_ars <- backward_ars
current_maxp <- backward_model$Max.p
# Remove the variable with the max Pr(>|t|) from the formula
remove <- current_model$Var.with.Max.p
exp_var_vec <- exp_var_vec[-which(exp_var_vec == remove)]
# Create the new model
backward_model <- linear_model(exp_var_vec, dep_var, data)
backward_ars <- backward_model$Model.Adj.r.sq
# Save the removed variables and the resulting new adj r squared
remove_vec[count] <- remove
ars_vec[count] <- backward_ars
}
result <- list(current_model, remove_vec[1:(length(remove_vec)-1)],
ars_vec[1:(length(remove_vec)-1)], current_model$Formula)
names(result) <- c('Final.Model', 'Variables.Removed', 'Adj.R.Squares', 'Formula')
return(result)
} I can now see what the final model is like:
result <- backward_search(X, Y, data)
show_table <- data.frame(result[[2]], result[[3]])
names(show_table) <- c('Variables.Removed', 'Resulting.Adj.R.Squares')
show_table## Variables.Removed Resulting.Adj.R.Squares
## 1 OverallQual.5 0.8238917
## 2 GarageArea 0.8240140
## 3 OverallQual 0.8241192
## 4 BsmtFinSF1 0.8242236
## 5 OpenPorchSF 0.8243242
## 6 HalfBath 0.8244220
## 7 GrLivArea 0.8245004
## 8 FullBath 0.8245556
## 9 OverallQual.7 0.8245930
## 10 OverallQual.6 0.8247144
## 11 OverallQual.2 0.8248182
## 12 TotRmsAbvGrd 0.8248535
Above table lists the variables removed sequentially, and the resulting increase in adjusted R-squared. And the final model is:
model <- result[[1]]$Model.Obj
model.summary <- summary(model)
model.summary##
## Call:
## lm(formula = formula, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -513168 -14236 -194 12896 249569
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.144e+06 1.122e+05 -10.188 < 2e-16 ***
## GarageCars 1.185e+04 1.602e+03 7.397 2.36e-13 ***
## TotalBsmtSF 1.987e+01 4.264e+00 4.659 3.47e-06 ***
## X1stFlrSF 4.297e+01 4.371e+00 9.830 < 2e-16 ***
## YearBuilt 2.453e+02 4.282e+01 5.729 1.22e-08 ***
## YearRemodAdd 3.434e+02 5.621e+01 6.108 1.29e-09 ***
## Fireplaces 8.793e+03 1.632e+03 5.388 8.32e-08 ***
## WoodDeckSF 1.670e+01 7.486e+00 2.231 0.025820 *
## X2ndFlrSF 4.512e+01 2.536e+00 17.790 < 2e-16 ***
## LotArea 5.077e-01 9.532e-02 5.327 1.16e-07 ***
## BsmtFullBath 5.425e+03 2.332e+03 2.326 0.020136 *
## BsmtUnfSF -1.056e+01 2.917e+00 -3.620 0.000304 ***
## OverallQual.3 7.846e+02 2.777e+02 2.825 0.004792 **
## OverallQual.4 -1.352e+02 4.622e+01 -2.926 0.003491 **
## OverallQual.8 4.243e-02 8.883e-03 4.776 1.97e-06 ***
## OverallQual.9 -3.498e-03 6.993e-04 -5.002 6.37e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33250 on 1444 degrees of freedom
## Multiple R-squared: 0.8267, Adjusted R-squared: 0.8249
## F-statistic: 459.1 on 15 and 1444 DF, p-value: < 2.2e-16
The final model’s adjusted R-squared is 0.8248535. All of the explanatory variables turn out to be significant.
Below, I plot the Predicted Sale Price against the Actual Sale Price, and place the y=x through to see how the prediction performs.
plot(fitted(model), data$SalePrice, xlab='Predicted Sale Price', ylab='Actual Sale Price',
main='Predicted Sale Price vs Actual Sale Price')
abline(0, 1)As you can see the model did a good job for house prices under $450,000, except for two major errors, where the actual sale price is low but our prediction is very high. These are the two points to the lower right corner.
We can examine these two points, and their variables:
subset(data.frame(fitted(model), data$SalePrice), fitted.model. > 500000 & data.SalePrice < 200000)## fitted.model. data.SalePrice
## 524 545941.2 184750
## 1299 673168.2 160000
data[c(524, 1299), c(X, 'SalePrice')]## OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF X1stFlrSF
## 524 10 4676 3 884 3138 3138
## 1299 10 5642 2 1418 6110 4692
## FullBath TotRmsAbvGrd YearBuilt YearRemodAdd Fireplaces BsmtFinSF1
## 524 3 11 2007 2008 1 2260
## 1299 2 12 2008 2008 3 5644
## WoodDeckSF X2ndFlrSF OpenPorchSF HalfBath LotArea BsmtFullBath
## 524 208 1538 406 1 40094 1
## 1299 214 950 292 1 63887 2
## BsmtUnfSF OverallQual.2 OverallQual.3 OverallQual.4 OverallQual.5
## 524 878 100 1000 10000 1e+05
## 1299 466 100 1000 10000 1e+05
## OverallQual.6 OverallQual.7 OverallQual.8 OverallQual.9 SalePrice
## 524 1e+06 1e+07 1e+08 1e+09 184750
## 1299 1e+06 1e+07 1e+08 1e+09 160000
As you can see these are very good houses, with 10/10 OVerallQual, built recently, and with large areas in their rooms. However it was sold with very low price of below $190,000, where our model predicts it can be worth $500,000. Checking the SaleCondition reveals part of the reasons why:
data[c(524, 1299), 'SaleCondition']## [1] Partial Partial
## Levels: Abnorml AdjLand Alloca Family Normal Partial
summary(data$SaleCondition)## Abnorml AdjLand Alloca Family Normal Partial
## 101 4 12 20 1198 125
Their SaleCondition is “Partial”, and this category is just 8.56% of the data. Categorical variables were not included in our model, therefore this effect was not captured.
Next I plot the fitted values vs residuals.
plot(fitted(model), resid(model), xlab='Fitted Values', ylab='Residuals',
main='Fitted Values vs Residuals')Again, there are the two outliers. Besides that, most of the residuals fall uniformly around zero. There are some deviations to this for fitted values higher than 400,000.
Next I look at the Q-Q plot and histogram for the residuals to check residual normality.
qqnorm(resid(model))
qqline(resid(model))hist(resid(model), breaks=100)Besides the two outliers, most points again fall along the Q-Q plot. The histogram is by-and-large normal.
Below, I prepared the test set, and made the SalePrice prediction. The prediction was submitted to Kaggle.com.
test <- read.csv('test.csv')
for (i in 2:9){
new_col <- test$OverallQual ^ i
test <- cbind(test, new_col)
col_name <- paste('OverallQual', i, sep='.')
names(test)[dim(test)[2]] <- col_name
col_names <- c(col_names, col_name)
}prediction <- predict(model, test)
prediction[is.na(prediction)] <- mean(prediction,na.rm=T)
id <- seq(1461, 1460+1459)
prediction <- data.frame(id, prediction)
names(prediction) <- c('Id', 'SalePrice')
# Write ths data frame to csv:
# write.csv(prediction, file='prediction.csv', row.names=F)The Kaggle.com score for this model is 0.16131. My Kaggle user name is Jun Yan.
There are many ways to improve this model. Thus far I have used only a few of the numerical variables in the data set. There are many categorical variables in the data set that can be used to improve the model. I have not plotted the histogram / box plot of the variables used in this model. Maybe doing so can review that some of these variable needs a transformation, or a nonlinear terms deriving from that variable can be added to improve the model.