DATA 605 Final

library(dplyr)
library(MASS)

- Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!

- Pick the dependent variable and define it as Y.

Let X = LotArea and Y = SalePrice.

data <- read.csv('train.csv')
house <- data[c('LotArea', 'SalePrice')]
X <- house$LotArea
Y <- house$SalePrice
hist(X, breaks=50, main='Histogram of Lot Area in Square Feet')

summary(house)

##     LotArea         SalePrice     
##  Min.   :  1300   Min.   : 34900  
##  1st Qu.:  7554   1st Qu.:129975  
##  Median :  9478   Median :163000  
##  Mean   : 10517   Mean   :180921  
##  3rd Qu.: 11602   3rd Qu.:214000  
##  Max.   :215245   Max.   :755000

From the histogram and summary, you can see that LotArea is highly skewed to the right.

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below.

a. $P(X>x | Y>y)$

Interpretation: This is the condition probability of LotArea greater than its 1st quantile, given that the SalePrice is greater than its respective 1st quantile. We will select from the data all the rows where SalePrice is greater than its 1st quantile, then, from the subset select the rows where LotArea is greater than its 1st quantile. The probability is the quotient of rows.

# 1st quantile of X and Y
(x_1stQ <- summary(X)[2])

## 1st Qu. 
##  7553.5

(y_1stQ <- summary(Y)[2])

## 1st Qu. 
##  129975

# Select all rows Y>y, then from which select all rows X>x
condition <- subset(house, SalePrice > y_1stQ)
event <- subset(condition, LotArea > x_1stQ)

# Checking:
min(condition$SalePrice) > y_1stQ

## 1st Qu. 
##    TRUE

min(event$LotArea) > x_1stQ

## 1st Qu. 
##    TRUE

# Solve probability:
(prob_a <- nrow(event) / nrow(condition))

## [1] 0.8200913

b. $P(X>x, Y>y)$

Interpretation: This is the joint probability that LotArea and SalePrice are greater than their respective 1st quantiles. We will select from the house data all the rows where this occurs. The probability is the quotient of rows.

# Select all rows X>x and Y>y:
joint <- subset(house, LotArea > x_1stQ & SalePrice > y_1stQ)

# Checking:
nrow(subset(joint, LotArea < x_1stQ)) == 0

## [1] TRUE

nrow(subset(joint, SalePrice < y_1stQ)) == 0

## [1] TRUE

# Solve probability:
(prob_b <- nrow(joint) / nrow(house))

## [1] 0.6150685

c. $P(X<x | Y>y)$

Interpretation: This is the condition probability of LotArea less than its 1st quantile, given that the SalePrice is greater than its respective 1st quantile. We will select from the data all the rows where SalePrice is greater than its 1st quantile, then, from this subset select the rows where LotArea is less than its 1st quantile. The probability is the quotient of rows.

# Select all rows Y>y, then from which select all rows X<x
condition <- subset(house, SalePrice > y_1stQ)
event <- subset(condition, LotArea < x_1stQ)

# Checking:
min(condition$SalePrice) > y_1stQ

## 1st Qu. 
##    TRUE

max(event$LotArea) < x_1stQ

## 1st Qu. 
##    TRUE

# Solve probability:
(prob_c <- nrow(event) / nrow(condition))

## [1] 0.1799087

Analysis

Solve the joint probabilities in the rows and columns of the table:

# X<=x, Y<=y
joint <- subset(house, LotArea <= x_1stQ & SalePrice <= y_1stQ)
row1_col1 <- nrow(joint)

# X>x, Y<=y
joint <- subset(house, LotArea > x_1stQ & SalePrice <= y_1stQ)
row1_col2 <- nrow(joint) 

# X<=x, Y>y
joint <- subset(house, LotArea <= x_1stQ & SalePrice > y_1stQ)
row2_col1 <- nrow(joint) 

# X>x, Y>y
joint <- subset(house, LotArea > x_1stQ & SalePrice > y_1stQ)
row2_col2 <- nrow(joint)

Create the contingency table:

observ_count <- data.frame(c(row1_col1, row2_col1), c(row1_col2, row2_col2))

# Compute row and column totals
observ_count[3,] = observ_count[1,] + observ_count[2,]
observ_count[,3] = observ_count[,1] + observ_count[,2]

names(observ_count) <- c('X<=x', 'X>x', 'Total')
rownames(observ_count) <- c('Y<=y', 'Y>y', 'Total')

# Calculate the probabilities by dividing by the total number of rows:
prob <- observ_count / nrow(house)
round(prob, 4)

##         X<=x    X>x Total
## Y<=y  0.1151 0.1349  0.25
## Y>y   0.1349 0.6151  0.75
## Total 0.2500 0.7500  1.00

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

From the contingency table above, P(A) = 0.75, P(B) = 0.75, and P(AB) = 0.6151. It is not equal to P(A)P(B) = 0.5625. Thus mathematically they are not independent.

The null and alternative hypothesis for the Chi Square Test are:

$H_0$: The variable X and Y are independent variables.

$H_a$: They are not independent.

Below observed counts table can be used for the Chi Square test:

observ_count

##       X<=x  X>x Total
## Y<=y   168  197   365
## Y>y    197  898  1095
## Total  365 1095  1460

Below is the table for expected counts, calculated as $\frac { row\quad total\quad \times \quad column\quad total }{ overall\quad total }$:

row1_col1 <- observ_count[1,3] * observ_count[3,1] / observ_count[3,3]
row1_col2 <- observ_count[1,3] * observ_count[3,2] / observ_count[3,3]
row2_col1 <- observ_count[2,3] * observ_count[3,1] / observ_count[3,3]
row2_col2 <- observ_count[2,3] * observ_count[3,2] / observ_count[3,3]
expec_count <- data.frame(c(row1_col1, row2_col1), c(row2_col1, row2_col2))
names(expec_count) <- c('X<=x', 'X>x')
rownames(expec_count) <- c('Y<=y', 'Y>y')
round(expec_count, 2)

##        X<=x    X>x
## Y<=y  91.25 273.75
## Y>y  273.75 821.25

Calculate ${ \chi }^{ 2 }=\sum { \frac { { (observed-expected) }^{ 2 } }{ expected } }$

(cell_cal <- (observ_count[1:2, 1:2] - expec_count) ^2 / expec_count)

##          X<=x       X>x
## Y<=y 64.55411 21.518037
## Y>y  21.51804  7.172679

chi_square <- sum(cell_cal)
chi_square

## [1] 114.7629

Using R to calculate the p-value, with DF = (2-1)(2-1) = 1.

p <- pchisq(chi_square, 1, lower.tail=F)
p

## [1] 8.869425e-27

This value is much less than 0.05. Therefore null hypothesis can be rejected: X and Y are not independent.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Below are the summary statistics each variable in the training data set. If the data is numeric, the summary shows the variable’s minimum, maximum, median, and mean values, as well as its 1st and 3rd quantiles. If the data is categorical, it shows the counts in each category of the variable.

summary(data[2:dim(data)[2]])

##    MSSubClass       MSZoning     LotFrontage        LotArea      
##  Min.   : 20.0   C (all):  10   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 50.0   RH     :  16   Median : 69.00   Median :  9478  
##  Mean   : 56.9   RL     :1151   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :190.0                  Max.   :313.00   Max.   :215245  
##                                 NA's   :259                      
##   Street      Alley      LotShape  LandContour  Utilities   
##  Grvl:   6   Grvl:  50   IR1:484   Bnk:  63    AllPub:1459  
##  Pave:1454   Pave:  41   IR2: 41   HLS:  50    NoSeWa:   1  
##              NA's:1369   IR3: 10   Low:  36                 
##                          Reg:925   Lvl:1311                 
##                                                             
##                                                             
##                                                             
##    LotConfig    LandSlope   Neighborhood   Condition1     Condition2  
##  Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260   Norm   :1445  
##  CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81   Feedr  :   6  
##  FR2    :  47   Sev:  13   OldTown:113   Artery :  48   Artery :   2  
##  FR3    :   4              Edwards:100   RRAn   :  26   PosN   :   2  
##  Inside :1052              Somerst: 86   PosN   :  19   RRNn   :   2  
##                            Gilbert: 79   RRAe   :  11   PosA   :   1  
##                            (Other):707   (Other):  15   (Other):   2  
##    BldgType      HouseStyle   OverallQual      OverallCond   
##  1Fam  :1220   1Story :726   Min.   : 1.000   Min.   :1.000  
##  2fmCon:  31   2Story :445   1st Qu.: 5.000   1st Qu.:5.000  
##  Duplex:  52   1.5Fin :154   Median : 6.000   Median :5.000  
##  Twnhs :  43   SLvl   : 65   Mean   : 6.099   Mean   :5.575  
##  TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000   3rd Qu.:6.000  
##                1.5Unf : 14   Max.   :10.000   Max.   :9.000  
##                (Other): 19                                   
##    YearBuilt     YearRemodAdd    RoofStyle       RoofMatl     Exterior1st 
##  Min.   :1872   Min.   :1950   Flat   :  13   CompShg:1434   VinylSd:515  
##  1st Qu.:1954   1st Qu.:1967   Gable  :1141   Tar&Grv:  11   HdBoard:222  
##  Median :1973   Median :1994   Gambrel:  11   WdShngl:   6   MetalSd:220  
##  Mean   :1971   Mean   :1985   Hip    : 286   WdShake:   5   Wd Sdng:206  
##  3rd Qu.:2000   3rd Qu.:2004   Mansard:   7   ClyTile:   1   Plywood:108  
##  Max.   :2010   Max.   :2010   Shed   :   2   Membran:   1   CemntBd: 61  
##                                               (Other):   2   (Other):128  
##   Exterior2nd    MasVnrType    MasVnrArea     ExterQual ExterCond
##  VinylSd:504   BrkCmn : 15   Min.   :   0.0   Ex: 52    Ex:   3  
##  MetalSd:214   BrkFace:445   1st Qu.:   0.0   Fa: 14    Fa:  28  
##  HdBoard:207   None   :864   Median :   0.0   Gd:488    Gd: 146  
##  Wd Sdng:197   Stone  :128   Mean   : 103.7   TA:906    Po:   1  
##  Plywood:142   NA's   :  8   3rd Qu.: 166.0             TA:1282  
##  CmentBd: 60                 Max.   :1600.0                      
##  (Other):136                 NA's   :8                           
##   Foundation  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1
##  BrkTil:146   Ex  :121   Fa  :  45   Av  :221     ALQ :220    
##  CBlock:634   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148    
##  PConc :647   Gd  :618   Po  :   2   Mn  :114     GLQ :418    
##  Slab  : 24   TA  :649   TA  :1311   No  :953     LwQ : 74    
##  Stone :  6   NA's: 37   NA's:  37   NA's: 38     Rec :133    
##  Wood  :  3                                       Unf :430    
##                                                   NA's: 37    
##    BsmtFinSF1     BsmtFinType2   BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   ALQ :  19    Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00   1st Qu.: 223.0  
##  Median : 383.5   GLQ :  14    Median :   0.00   Median : 477.5  
##  Mean   : 443.6   LwQ :  46    Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :5644.0   Unf :1256    Max.   :1474.00   Max.   :2336.0  
##                   NA's:  38                                      
##   TotalBsmtSF      Heating     HeatingQC CentralAir Electrical  
##  Min.   :   0.0   Floor:   1   Ex:741    N:  95     FuseA:  94  
##  1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365     FuseF:  27  
##  Median : 991.5   GasW :  18   Gd:241               FuseP:   3  
##  Mean   :1057.4   Grav :   7   Po:  1               Mix  :   1  
##  3rd Qu.:1298.2   OthW :   2   TA:428               SBrkr:1334  
##  Max.   :6110.0   Wall :   4                        NA's :   1  
##                                                                 
##    X1stFlrSF      X2ndFlrSF     LowQualFinSF       GrLivArea   
##  Min.   : 334   Min.   :   0   Min.   :  0.000   Min.   : 334  
##  1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130  
##  Median :1087   Median :   0   Median :  0.000   Median :1464  
##  Mean   :1163   Mean   : 347   Mean   :  5.845   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777  
##  Max.   :4692   Max.   :2065   Max.   :572.000   Max.   :5642  
##                                                                
##   BsmtFullBath     BsmtHalfBath        FullBath        HalfBath     
##  Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.00000   Median :2.000   Median :0.0000  
##  Mean   :0.4253   Mean   :0.05753   Mean   :1.565   Mean   :0.3829  
##  3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
##  Max.   :3.0000   Max.   :2.00000   Max.   :3.000   Max.   :2.0000  
##                                                                     
##   BedroomAbvGr    KitchenAbvGr   KitchenQual  TotRmsAbvGrd    Functional 
##  Min.   :0.000   Min.   :0.000   Ex:100      Min.   : 2.000   Maj1:  14  
##  1st Qu.:2.000   1st Qu.:1.000   Fa: 39      1st Qu.: 5.000   Maj2:   5  
##  Median :3.000   Median :1.000   Gd:586      Median : 6.000   Min1:  31  
##  Mean   :2.866   Mean   :1.047   TA:735      Mean   : 6.518   Min2:  34  
##  3rd Qu.:3.000   3rd Qu.:1.000               3rd Qu.: 7.000   Mod :  15  
##  Max.   :8.000   Max.   :3.000               Max.   :14.000   Sev :   1  
##                                                               Typ :1360  
##    Fireplaces    FireplaceQu   GarageType   GarageYrBlt   GarageFinish
##  Min.   :0.000   Ex  : 24    2Types :  6   Min.   :1900   Fin :352    
##  1st Qu.:0.000   Fa  : 33    Attchd :870   1st Qu.:1961   RFn :422    
##  Median :1.000   Gd  :380    Basment: 19   Median :1980   Unf :605    
##  Mean   :0.613   Po  : 20    BuiltIn: 88   Mean   :1979   NA's: 81    
##  3rd Qu.:1.000   TA  :313    CarPort:  9   3rd Qu.:2002               
##  Max.   :3.000   NA's:690    Detchd :387   Max.   :2010               
##                              NA's   : 81   NA's   :81                 
##    GarageCars      GarageArea     GarageQual  GarageCond  PavedDrive
##  Min.   :0.000   Min.   :   0.0   Ex  :   3   Ex  :   2   N:  90    
##  1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48   Fa  :  35   P:  30    
##  Median :2.000   Median : 480.0   Gd  :  14   Gd  :   9   Y:1340    
##  Mean   :1.767   Mean   : 473.0   Po  :   3   Po  :   7             
##  3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311   TA  :1326             
##  Max.   :4.000   Max.   :1418.0   NA's:  81   NA's:  81             
##                                                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea        PoolQC       Fence      MiscFeature
##  Min.   :  0.00   Min.   :  0.000   Ex  :   2   GdPrv:  59   Gar2:   2  
##  1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2   GdWo :  54   Othr:   2  
##  Median :  0.00   Median :  0.000   Gd  :   3   MnPrv: 157   Shed:  49  
##  Mean   : 15.06   Mean   :  2.759   NA's:1453   MnWw :  11   TenC:   1  
##  3rd Qu.:  0.00   3rd Qu.:  0.000               NA's :1179   NA's:1406  
##  Max.   :480.00   Max.   :738.000                                       
##                                                                         
##     MiscVal             MoSold           YrSold        SaleType   
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   WD     :1267  
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   New    : 122  
##  Median :    0.00   Median : 6.000   Median :2008   COD    :  43  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008   ConLD  :   9  
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009   ConLI  :   5  
##  Max.   :15500.00   Max.   :12.000   Max.   :2010   ConLw  :   5  
##                                                     (Other):   9  
##  SaleCondition    SalePrice     
##  Abnorml: 101   Min.   : 34900  
##  AdjLand:   4   1st Qu.:129975  
##  Alloca :  12   Median :163000  
##  Family :  20   Mean   :180921  
##  Normal :1198   3rd Qu.:214000  
##  Partial: 125   Max.   :755000  
##

Below are the statistic plots for each variable in the train data set. If the variable is numeric, a histogram is plotted (not filled bars). If the variable is categorical, the frequency of each category is plotted (filled bars).

par(mfrow=c(2,3))
for (var in names(data)[2:dim(data)[2]]){
  
  if (is.numeric(data[,var])){
    hist(data[,var], main=var, xlab='')
  }else{
    plot(data[,var], main=var)
  }
  
}

Lastly, if the variable is numeric, below are their boxplots:

par(mfrow=c(2,4))
for (var in names(data)[2:dim(data)[2]]){
  
  if (is.numeric(data[,var])){
    boxplot(data[,var], main=var, xlab='')
  }
}

Provide a scatterplot of X and Y.

plot(x=X/1000, y=Y/1000, xlab='Lot Area in Thousands Square Feet', ylab='Sale Price in Thousands USD', main='Lot Area vs Sale Price')

Derive a correlation matrix for any THREE quantitative variables in the dataset.

correlations <- cor(data[, c('X1stFlrSF', 'LotArea', 'SalePrice')])
round(correlations, 4)

##           X1stFlrSF LotArea SalePrice
## X1stFlrSF    1.0000  0.2995    0.6059
## LotArea      0.2995  1.0000    0.2638
## SalePrice    0.6059  0.2638    1.0000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval.

$H_0$: The correlation coefficient is 0. In another word, the observed correlation coefficient is non-zero due to chance.

$H_a$: The correlation coefficient is not 0.

The test is for two-tail t-test, with $\alpha =(1-0.92)/2=0.04$.

The test statistic is calculated as $t=\frac { r\sqrt { n-2 } }{ \sqrt { 1-{ r }^{ 2 } } }$

Where r = correlation coefficient, and n = number of observations = 1460.

The correlation between between X1stFLsSF and LotArea is 0.2994746.

Using the R built in function to find p value, with DF = n - 2:

r <- correlations[1,2]
n <- nrow(data)
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p

## [1] 1.234238e-31

This p value is a lot smaller than the $\alpha=0.04$. Therefore the null hypothesis can be rejected. The correlation coefficient is not zero.

The correlation between between X1stFLsSF and SalePrice is 0.6058522.

r <- correlations[1,3]
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p

## [1] 5.394711e-147

This p value is a lot smaller than the $\alpha=0.04$. Therefore the null hypothesis can be rejected. The correlation coefficient is not zero, meaning that the variables are not independent.

The correlation between between LotArea and SalePrice is 0.2638434.

r <- correlations[2,3]
t <- r*sqrt(n-2) / sqrt(1-r^2)
p <- pt(t, n-2, lower.tail=F) * 2
p

## [1] 1.123139e-24

This p value is a lot smaller than the $\alpha=0.04$. Therefore the null hypothesis can be rejected. The correlation coefficient is not zero.

Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

The above analysis seems to suggest that the 3 correlations of the 3-pairs are not zero, ie. they are all significant. However, I’m worried about family wise error, because I have just done three statistic tests in parallel. This is inviting the multiple comparison problem. Also called the “look-elsewhere effect”, the problem arises when you are doing multiple tests, trying to look for significance. The significance may occur just because you tried many tests, due to chance. Thus, a Type I error may result, when the null hypothesis is truth but rejected, simply because you found significance by chance.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

precision <- solve(correlations)
precision

##            X1stFlrSF    LotArea  SalePrice
## X1stFlrSF  1.6340150 -0.2452191 -0.9252722
## LotArea   -0.2452191  1.1116223 -0.1447277
## SalePrice -0.9252722 -0.1447277  1.5987636

correlations %*% precision

##               X1stFlrSF LotArea SalePrice
## X1stFlrSF  1.000000e+00       0         0
## LotArea   -5.551115e-17       1         0
## SalePrice -1.110223e-16       0         1

precision %*% correlations

##               X1stFlrSF       LotArea     SalePrice
## X1stFlrSF  1.000000e+00 -2.775558e-17 -1.110223e-16
## LotArea   -2.775558e-17  1.000000e+00 -2.775558e-17
## SalePrice  0.000000e+00  0.000000e+00  1.000000e+00

Below function was written in Assignment 2 to conduct LU decomposition

lu_decomp <- function(A) {
  rows <- nrow(A)
  L <- diag(rows)
  U <- A
  # Row operation to get row echelon form
  for (i in 1:(rows-1)){
    i_max <- which.max(abs(U[i:rows,i])) + i - 1
    if (U[i_max,i] == 0){
      print("###Cannot factorized without permutating.####")
      break
    }
    for(j in (i+1):rows) {
      coefficient <- - U[j,i] / U[i,i]
      # Update U by updating the rows using the coefficient
      U[j,] <- U[j,] + coefficient * U[i,]
      # Update L by taking the opposite of coefficient
      L[j,i] <- -coefficient
    }
    print(paste('U in Step', i, ":"))
    print(U)
    print(paste('L in Step', i, ":"))
    print(L)
    print("-----------------------------")  
  }
  return(list(L, U))
}

result <- lu_decomp(precision)

## [1] "U in Step 1 :"
##           X1stFlrSF    LotArea  SalePrice
## X1stFlrSF  1.634015 -0.2452191 -0.9252722
## LotArea    0.000000  1.0748219 -0.2835846
## SalePrice  0.000000 -0.2835846  1.0748219
## [1] "L in Step 1 :"
##            [,1] [,2] [,3]
## [1,]  1.0000000    0    0
## [2,] -0.1500715    1    0
## [3,] -0.5662568    0    1
## [1] "-----------------------------"
## [1] "U in Step 2 :"
##           X1stFlrSF    LotArea  SalePrice
## X1stFlrSF  1.634015 -0.2452191 -0.9252722
## LotArea    0.000000  1.0748219 -0.2835846
## SalePrice  0.000000  0.0000000  1.0000000
## [1] "L in Step 2 :"
##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.1500715  1.0000000    0
## [3,] -0.5662568 -0.2638434    1
## [1] "-----------------------------"

L <- result[[1]]
U <- result[[2]]
L %*% U

##       X1stFlrSF    LotArea  SalePrice
## [1,]  1.6340150 -0.2452191 -0.9252722
## [2,] -0.2452191  1.1116223 -0.1447277
## [3,] -0.9252722 -0.1447277  1.5987636

So the L and U are:

##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.1500715  1.0000000    0
## [3,] -0.5662568 -0.2638434    1

##           X1stFlrSF    LotArea  SalePrice
## X1stFlrSF  1.634015 -0.2452191 -0.9252722
## LotArea    0.000000  1.0748219 -0.2835846
## SalePrice  0.000000  0.0000000  1.0000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of $\lambda$ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, $\lambda$)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

The first variable selected was the LotArea.

summary(X)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

The minimum value for this variable is already above zero.

fitting <- fitdistr(X, 'exponential')
rate <- fitting$estimate
rate

##        rate 
## 9.50857e-05

The optimal value of $\lambda$ for this distribution is 9.508570410^{-5}.

set.seed(1)
Xt <- rexp(1000, rate)
hist(X, breaks = 100, main='Histogram of Original Data')

hist(Xt, breaks = 50, main='Histogram of Exponential Distribution')

The two histograms are only similar in the sense that they are both unimodal and skewed to the right. They are very different otherwise. The original distribution’s shape exhibit some-what of a bell-shape, with a long tail trailing to the right. On the other hand, the exponential distribution sample set peaks immediately after near zero - there is no resemblance of a bell-shape. It then decreases gradually to the right.

The cumulative distribution function for the exponential distribution is:

$P(x)=1-{ e }^{ -\lambda x }$

Solving x for a given P:

$x=\frac { ln(1-p) }{ -\lambda }$

R function qexp can also be used to check.

For 5th percentile:

P <- 0.05
log(1-P)/-rate

##     rate 
## 539.4428

qexp(P, rate)

## [1] 539.4428

For 95th percentile:

P <- 0.95
log(1-P)/-rate

##    rate 
## 31505.6

qexp(P, rate)

## [1] 31505.6

Assuming normality for the empirical data (original data), and using Z = 1.96 for 95% confidence interval, the formula is: $\bar { x } \pm Z\cdot \frac { \sigma }{ \sqrt { n } }$

Z <- 1.96
x_bar <- mean(X)
std <- sd(X)
n <- length(X)
upper <- x_bar + Z * std / sqrt(n)
lower <- x_bar - Z * std / sqrt(n)
lower

## [1] 10004.83

upper

## [1] 11028.82

The 95% confidence interval is (1.000483410^{4}, 1.102882310^{4}).

The 5th and 95th percentile of the empirical data, using R’s quantile function is:

quantile(X, c(0.05, 0.95))

##       5%      95% 
##  3311.70 17401.15

The actual 5th and 95th percentiles observed in data are:

X.sorted <- sort(X)
X.sorted[round(0.05*length(X))]

## [1] 3230

X.sorted[round(0.95*length(X))]

## [1] 17400

They are very close to the values calculated by the quantile function.

The variable chosen is the LotArea of observed data. The 95th percentile is 17,400. This means that 95% of the 1460 houses are under 17,400 sqft. Notice that there is a large disparity in the percentiles between the empirical data and the data fitted with exponential function. This suggests that the exponential function is not a good fit for this data set. The 95% confident interval built above was the confident interval of the mean. This means that if we sample the houses in that area repeatedly, each time with the same sample size (1460), then 95% of the time the mean of the sample will be between (1.000483410^{4}, 1.102882310^{4}). However, for a data this skewed, median is a much better descriptor than mean.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I will build a multiple regression model to predict the SalePrice of a house, using only the numeric variables in the data set as the explanatory variables.

First I find out which columns are numeric:

data <- read.csv('train.csv')
numeric_cols <- sapply(data, is.numeric)

Next, I calculate the correlations between all the numeric variables. I order the correlation matrix by the SalePrice column and extract that column. The result is a vector of numerical variables listed in decreasing order of their correlation with SalePrice.

correlations <- as.data.frame(cor(data[numeric_cols]))
correlations <- correlations[order(correlations$SalePrice, decreasing=T),]
corr_saleprice <- correlations$SalePrice
names(corr_saleprice) <- rownames(correlations)
corr_saleprice

##     SalePrice   OverallQual     GrLivArea    GarageCars    GarageArea 
##    1.00000000    0.79098160    0.70862448    0.64040920    0.62343144 
##   TotalBsmtSF     X1stFlrSF      FullBath  TotRmsAbvGrd     YearBuilt 
##    0.61358055    0.60585218    0.56066376    0.53372316    0.52289733 
##  YearRemodAdd    Fireplaces    BsmtFinSF1    WoodDeckSF     X2ndFlrSF 
##    0.50710097    0.46692884    0.38641981    0.32441344    0.31933380 
##   OpenPorchSF      HalfBath       LotArea  BsmtFullBath     BsmtUnfSF 
##    0.31585623    0.28410768    0.26384335    0.22712223    0.21447911 
##  BedroomAbvGr   ScreenPorch      PoolArea        MoSold    X3SsnPorch 
##    0.16821315    0.11144657    0.09240355    0.04643225    0.04458367 
##    BsmtFinSF2  BsmtHalfBath       MiscVal            Id  LowQualFinSF 
##   -0.01137812   -0.01684415   -0.02118958   -0.02191672   -0.02560613 
##        YrSold   OverallCond    MSSubClass EnclosedPorch  KitchenAbvGr 
##   -0.02892259   -0.07785589   -0.08428414   -0.12857796   -0.13590737 
##   LotFrontage    MasVnrArea   GarageYrBlt 
##            NA            NA            NA

I will build the first model by using all of the variables in this list with correlation higher than 0.2. Now, this choice is completely arbitrary. The last variable that satisfy this is the BsmtUnfSF variable. So I will select variable from OveralQual to BsmtUnfSF.

Notice that the OverallQual is actually an ordinal variable. Its values represent the overal quality of the house, in the rating from 1 to 10. I will deal with this variable by including polynomial terms. This means I will include OverallQual^2 + OverallQual^3 +…+ OverallQual^9.

col_names <- c()
for (i in 2:9){
  new_col <- data$OverallQual ^ i
  data <- cbind(data, new_col)
  col_name <- paste('OverallQual', i, sep='.')
  names(data)[dim(data)[2]] <- col_name
  col_names <- c(col_names, col_name)
}
col_names

## [1] "OverallQual.2" "OverallQual.3" "OverallQual.4" "OverallQual.5"
## [5] "OverallQual.6" "OverallQual.7" "OverallQual.8" "OverallQual.9"

I now select the column names for the variable, and create the linear model using the lm function:

Y <- 'SalePrice'
X <- names(corr_saleprice)[2:which(names(corr_saleprice)=='BsmtUnfSF')]
X <- c(X, col_names)
X

##  [1] "OverallQual"   "GrLivArea"     "GarageCars"    "GarageArea"   
##  [5] "TotalBsmtSF"   "X1stFlrSF"     "FullBath"      "TotRmsAbvGrd" 
##  [9] "YearBuilt"     "YearRemodAdd"  "Fireplaces"    "BsmtFinSF1"   
## [13] "WoodDeckSF"    "X2ndFlrSF"     "OpenPorchSF"   "HalfBath"     
## [17] "LotArea"       "BsmtFullBath"  "BsmtUnfSF"     "OverallQual.2"
## [21] "OverallQual.3" "OverallQual.4" "OverallQual.5" "OverallQual.6"
## [25] "OverallQual.7" "OverallQual.8" "OverallQual.9"

formula_str <- paste(X, collapse = '+')
formula_str <- paste(Y, formula_str, sep = '~')
formula <- as.formula(formula_str)
formula

## SalePrice ~ OverallQual + GrLivArea + GarageCars + GarageArea + 
##     TotalBsmtSF + X1stFlrSF + FullBath + TotRmsAbvGrd + YearBuilt + 
##     YearRemodAdd + Fireplaces + BsmtFinSF1 + WoodDeckSF + X2ndFlrSF + 
##     OpenPorchSF + HalfBath + LotArea + BsmtFullBath + BsmtUnfSF + 
##     OverallQual.2 + OverallQual.3 + OverallQual.4 + OverallQual.5 + 
##     OverallQual.6 + OverallQual.7 + OverallQual.8 + OverallQual.9

model <- lm(formula, data=data)

Below is the model summary for the first model:

summary(model)

## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -508858  -14066    -352   12821  252370 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.204e+06  1.572e+06  -0.766   0.4437    
## OverallQual    2.676e+05  4.025e+06   0.066   0.9470    
## GrLivArea      1.168e+01  1.869e+01   0.625   0.5323    
## GarageCars     1.201e+04  2.741e+03   4.382 1.26e-05 ***
## GarageArea    -6.221e-01  9.326e+00  -0.067   0.9468    
## TotalBsmtSF    1.878e+01  6.711e+00   2.799   0.0052 ** 
## X1stFlrSF      2.831e+01  1.907e+01   1.484   0.1379    
## FullBath       2.276e+03  2.658e+03   0.856   0.3919    
## TotRmsAbvGrd   6.078e+02  1.020e+03   0.596   0.5514    
## YearBuilt      2.289e+02  4.903e+01   4.668 3.33e-06 ***
## YearRemodAdd   3.392e+02  5.706e+01   5.944 3.48e-09 ***
## Fireplaces     8.928e+03  1.671e+03   5.343 1.06e-07 ***
## BsmtFinSF1     2.270e+00  5.793e+00   0.392   0.6953    
## WoodDeckSF     1.670e+01  7.535e+00   2.217   0.0268 *  
## X2ndFlrSF      2.967e+01  1.889e+01   1.570   0.1165    
## OpenPorchSF   -5.733e+00  1.444e+01  -0.397   0.6913    
## HalfBath       1.262e+03  2.552e+03   0.495   0.6210    
## LotArea        5.054e-01  9.596e-02   5.267 1.60e-07 ***
## BsmtFullBath   5.499e+03  2.364e+03   2.326   0.0202 *  
## BsmtUnfSF     -9.130e+00  5.885e+00  -1.551   0.1210    
## OverallQual.2 -2.314e+05  4.113e+06  -0.056   0.9551    
## OverallQual.3  8.590e+04  2.255e+06   0.038   0.9696    
## OverallQual.4 -1.106e+04  7.437e+05  -0.015   0.9881    
## OverallQual.5 -1.629e+03  1.548e+05  -0.011   0.9916    
## OverallQual.6  7.455e+02  2.049e+04   0.036   0.9710    
## OverallQual.7 -1.036e+02  1.674e+03  -0.062   0.9506    
## OverallQual.8  6.696e+00  7.690e+01   0.087   0.9306    
## OverallQual.9 -1.704e-01  1.520e+00  -0.112   0.9107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33350 on 1432 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8238 
## F-statistic: 253.6 on 27 and 1432 DF,  p-value: < 2.2e-16

Next, I will perform backward elimination to remove the variables. The variable with the maximum p-value is removed from the model one at a time, as long as the removal of the variable can increase the Adjusted R-squared of the new model and the variable’s p-value is higher than 0.05.

Two wrapper functions are written to perform the backward elimination automatically.

linear_model <- function(exp_var_vec, dep_var, data){
  # This function creates a linear model with the given variables and data
  # and returns the model, the model's adjusted R-squrared, the model's
  # explanatory variable with the maximum Pr(>|t|), and the corresponding
  # Pr(>|t|).
  
  # Take the exp_var string vector and the dep_var to create a formula
  formula_str <- paste(exp_var_vec, collapse = '+')
  formula_str <- paste(dep_var, formula_str, sep = '~')
  formula <- as.formula(formula_str)
  
  # Construct a linear model
  model <- lm(formula, data=data)
  model_summary <- summary(model)
  
  # Retrieve the coefficient matrix
  coefficients <- model_summary$coefficients
  
  # Find out which variable has max Pr(>|t|), excluding (intercept)
  max_p <- max(coefficients[2:nrow(coefficients),4])
  variable <- names(which(coefficients[,4] == max_p))
  
  # Save result in a list and return the list
  result <- list(model, model_summary$adj.r.squared, variable, max_p, formula)
  names(result) <- c('Model.Obj', 'Model.Adj.r.sq', 'Var.with.Max.p', 'Max.p', 'Formula')
  
  return(result)
}

backward_search <- function(exp_var_vec, dep_var, data){
  # This functions takes a data and an original formula_str to perform backward
  # step-wise elimiation for variable selection. The function eliminates 
  # the variable based on its Pr(>|t|) value. If this value is greater than
  # 0.05, the function will eliminate the variable as long as it will increase
  # the adjusted R-squared.
  # Initialize model
  
  backward_model <- linear_model(exp_var_vec, dep_var, data) 
  backward_ars <- backward_model$Model.Adj.r.sq
  
  count <- 0
  current_ars <- 0
  current_maxp <- 1
  remove_vec <- c()
  ars_vec <- c()
  
  # Loop to eliminate as long as the new model has increased adjusted R-square 
  # and the current model has an explanatory variable with Pr(>|t|) greater
  # than 0.05.
  
  while((backward_ars >= current_ars) & (current_maxp > 0.05)) {
    
    count <- count + 1
    current_model <- backward_model
    current_ars <- backward_ars
    current_maxp <- backward_model$Max.p
    
    # Remove the variable with the max Pr(>|t|) from the formula
    remove <- current_model$Var.with.Max.p
    exp_var_vec <- exp_var_vec[-which(exp_var_vec == remove)]
    
    # Create the new model
    backward_model <- linear_model(exp_var_vec, dep_var, data)
    backward_ars <- backward_model$Model.Adj.r.sq
    
    # Save the removed variables and the resulting new adj r squared
    remove_vec[count] <- remove
    ars_vec[count] <- backward_ars
    
  }
  
  result <- list(current_model, remove_vec[1:(length(remove_vec)-1)], 
                 ars_vec[1:(length(remove_vec)-1)], current_model$Formula)
  names(result) <- c('Final.Model', 'Variables.Removed', 'Adj.R.Squares', 'Formula')
  
  return(result)
}

I can now see what the final model is like:

result <- backward_search(X, Y, data)
show_table <- data.frame(result[[2]], result[[3]])
names(show_table) <- c('Variables.Removed', 'Resulting.Adj.R.Squares')
show_table

##    Variables.Removed Resulting.Adj.R.Squares
## 1      OverallQual.5               0.8238917
## 2         GarageArea               0.8240140
## 3        OverallQual               0.8241192
## 4         BsmtFinSF1               0.8242236
## 5        OpenPorchSF               0.8243242
## 6           HalfBath               0.8244220
## 7          GrLivArea               0.8245004
## 8           FullBath               0.8245556
## 9      OverallQual.7               0.8245930
## 10     OverallQual.6               0.8247144
## 11     OverallQual.2               0.8248182
## 12      TotRmsAbvGrd               0.8248535

Above table lists the variables removed sequentially, and the resulting increase in adjusted R-squared. And the final model is:

model <- result[[1]]$Model.Obj
model.summary <- summary(model)
model.summary

## 
## Call:
## lm(formula = formula, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -513168  -14236    -194   12896  249569 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.144e+06  1.122e+05 -10.188  < 2e-16 ***
## GarageCars     1.185e+04  1.602e+03   7.397 2.36e-13 ***
## TotalBsmtSF    1.987e+01  4.264e+00   4.659 3.47e-06 ***
## X1stFlrSF      4.297e+01  4.371e+00   9.830  < 2e-16 ***
## YearBuilt      2.453e+02  4.282e+01   5.729 1.22e-08 ***
## YearRemodAdd   3.434e+02  5.621e+01   6.108 1.29e-09 ***
## Fireplaces     8.793e+03  1.632e+03   5.388 8.32e-08 ***
## WoodDeckSF     1.670e+01  7.486e+00   2.231 0.025820 *  
## X2ndFlrSF      4.512e+01  2.536e+00  17.790  < 2e-16 ***
## LotArea        5.077e-01  9.532e-02   5.327 1.16e-07 ***
## BsmtFullBath   5.425e+03  2.332e+03   2.326 0.020136 *  
## BsmtUnfSF     -1.056e+01  2.917e+00  -3.620 0.000304 ***
## OverallQual.3  7.846e+02  2.777e+02   2.825 0.004792 ** 
## OverallQual.4 -1.352e+02  4.622e+01  -2.926 0.003491 ** 
## OverallQual.8  4.243e-02  8.883e-03   4.776 1.97e-06 ***
## OverallQual.9 -3.498e-03  6.993e-04  -5.002 6.37e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33250 on 1444 degrees of freedom
## Multiple R-squared:  0.8267, Adjusted R-squared:  0.8249 
## F-statistic: 459.1 on 15 and 1444 DF,  p-value: < 2.2e-16

The final model’s adjusted R-squared is 0.8248535. All of the explanatory variables turn out to be significant.

Below, I plot the Predicted Sale Price against the Actual Sale Price, and place the y=x through to see how the prediction performs.

plot(fitted(model), data$SalePrice, xlab='Predicted Sale Price', ylab='Actual Sale Price',
     main='Predicted Sale Price vs Actual Sale Price')
abline(0, 1)

As you can see the model did a good job for house prices under $450,000, except for two major errors, where the actual sale price is low but our prediction is very high. These are the two points to the lower right corner.

We can examine these two points, and their variables:

subset(data.frame(fitted(model), data$SalePrice), fitted.model. > 500000 & data.SalePrice < 200000)

##      fitted.model. data.SalePrice
## 524       545941.2         184750
## 1299      673168.2         160000

data[c(524, 1299), c(X, 'SalePrice')]

##      OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF X1stFlrSF
## 524           10      4676          3        884        3138      3138
## 1299          10      5642          2       1418        6110      4692
##      FullBath TotRmsAbvGrd YearBuilt YearRemodAdd Fireplaces BsmtFinSF1
## 524         3           11      2007         2008          1       2260
## 1299        2           12      2008         2008          3       5644
##      WoodDeckSF X2ndFlrSF OpenPorchSF HalfBath LotArea BsmtFullBath
## 524         208      1538         406        1   40094            1
## 1299        214       950         292        1   63887            2
##      BsmtUnfSF OverallQual.2 OverallQual.3 OverallQual.4 OverallQual.5
## 524        878           100          1000         10000         1e+05
## 1299       466           100          1000         10000         1e+05
##      OverallQual.6 OverallQual.7 OverallQual.8 OverallQual.9 SalePrice
## 524          1e+06         1e+07         1e+08         1e+09    184750
## 1299         1e+06         1e+07         1e+08         1e+09    160000

As you can see these are very good houses, with 10/10 OVerallQual, built recently, and with large areas in their rooms. However it was sold with very low price of below $190,000, where our model predicts it can be worth $500,000. Checking the SaleCondition reveals part of the reasons why:

data[c(524, 1299), 'SaleCondition']

## [1] Partial Partial
## Levels: Abnorml AdjLand Alloca Family Normal Partial

summary(data$SaleCondition)

## Abnorml AdjLand  Alloca  Family  Normal Partial 
##     101       4      12      20    1198     125

Their SaleCondition is “Partial”, and this category is just 8.56% of the data. Categorical variables were not included in our model, therefore this effect was not captured.

Next I plot the fitted values vs residuals.

plot(fitted(model), resid(model), xlab='Fitted Values', ylab='Residuals', 
     main='Fitted Values vs Residuals')

Again, there are the two outliers. Besides that, most of the residuals fall uniformly around zero. There are some deviations to this for fitted values higher than 400,000.

Next I look at the Q-Q plot and histogram for the residuals to check residual normality.

qqnorm(resid(model))
qqline(resid(model))

hist(resid(model), breaks=100)

Besides the two outliers, most points again fall along the Q-Q plot. The histogram is by-and-large normal.

Below, I prepared the test set, and made the SalePrice prediction. The prediction was submitted to Kaggle.com.

test <- read.csv('test.csv')
for (i in 2:9){
  new_col <- test$OverallQual ^ i
  test <- cbind(test, new_col)
  col_name <- paste('OverallQual', i, sep='.')
  names(test)[dim(test)[2]] <- col_name
  col_names <- c(col_names, col_name)
}

prediction <- predict(model, test)
prediction[is.na(prediction)] <- mean(prediction,na.rm=T)
id <- seq(1461, 1460+1459)
prediction <- data.frame(id, prediction)
names(prediction) <- c('Id', 'SalePrice')
# Write ths data frame to csv:
# write.csv(prediction, file='prediction.csv', row.names=F)

The Kaggle.com score for this model is 0.16131. My Kaggle user name is Jun Yan.

There are many ways to improve this model. Thus far I have used only a few of the numerical variables in the data set. There are many categorical variables in the data set that can be used to improve the model. I have not plotted the histogram / box plot of the variables used in this model. Maybe doing so can review that some of these variable needs a transformation, or a nonlinear terms deriving from that variable can be added to improve the model.