Data 605 Final Project Week 16

Alexander Ng

12/06/2019

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(matrixcalc)

Problem 1: Random Variables

Section 1: Calculating probabilities

I choose to use \(N=7\) as the range for the uniform \(Unif[1,N]\) variables and generate the dataframe of random draws below.

num_trials = 10000
N = 7

df = data.frame( Unif = runif(num_trials, min=1, max=N), 
                 Normal = rnorm(num_trials ,  mean=(N+1)/2, sd=(N+1)/2 ) )

plot(df$Unif, df$Normal)

(x = median(df$Unif) )
## [1] 4.018455
(y =  quantile(df$Normal, probs = 0.25) )
##      25% 
## 1.354278
df %>% mutate( gtx = Unif > x, gty = Unif > y , Ygty = Normal > y, Xltx = Unif < x ) -> df2

The median of the Uniform variates is \(x =\) 4.0184553 and the 1st quartile of the Normal variates is \(y =\) 1.3542784 as shown above.

We also augmented the random draws with boolean variable \(gtx, gty, Ygty, Xltx\) which will assist us in tabulating conditional probabilities.

In the following section, we build 3 contingency tables to empirically calculate the probability of all events related to \[ \begin{align} X > x \text{ and } X > y & \implies \text{ table } ta \\ X > x \text{ and } Y > y & \implies \text{ table } tb \\ X < x \text{ and } X > y & \implies \text{ table } tc\\ \end{align} \]

There are four possible combinations illustrated in the table \(t\) below which is labelled as \(gtx\) on the rows and \(gty\) on the columns. \(gtx\) is True or False depending on where \(X > x\) and \(gty\) is True or False depending on whether \(X > y\).

In the R chunk below, we display the first few rows of the the raw dataframe.

kable(df2[1:5,], align=c('cccccc'))
Unif Normal gtx gty Ygty Xltx
5.012288 8.2796293 TRUE TRUE TRUE FALSE
6.082176 -0.6646849 TRUE TRUE FALSE FALSE
5.589171 1.3663078 TRUE TRUE TRUE FALSE
5.823520 -2.5489740 TRUE TRUE FALSE FALSE
3.901070 3.5493741 FALSE TRUE TRUE TRUE

The various probabilities can be estimated from the above contingency table.

a. \(P(X > x \lvert X > y)\)

We generate the contingency table associated with these two events \(X > x\) and \(X > y\) using variables \(gtx, gty\) and display the results below.

ta = table(df2$gtx, df2$gty, dnn = c("gtx" , "gty"))
print(ta)
##        gty
## gtx     FALSE TRUE
##   FALSE   590 4410
##   TRUE      0 5000
(prob_a = ta[2,2] / ( sum(ta[,2]) ) )
## [1] 0.5313496

The desired conditional probability is the ratio of cells in the second column: \[0.5313496 = \frac{ 5000}{ 4410 + 5000}\]

b. \(P( X > x , Y > y)\)

We generate the contingency table associated with these two events \(X > x\) and \(Y > y\) using variables \(gtx, Ygty\) and display the results below.

tb = table(df2$gtx, df2$Ygty, dnn = c("gtx", "Ygty"))
print(tb)
##        Ygty
## gtx     FALSE TRUE
##   FALSE  1280 3720
##   TRUE   1220 3780
(prob_b = tb[2,2] / ( sum(tb) ) )
## [1] 0.378

The desired joint probability is the ratio of the cell (2,2) divided by the number of trials 10^{4}. \[0.378 = \frac{ 3780}{ 10000}\]

c. \(P( X < x \lvert X > y)\)

We generate the contingency table associated with these two events \(X < x\) and $ X > y$ using variables \(Xltx, gty\) and display the results below.

tc = table(df2$Xltx, df2$gty, dnn = c("Xltx", "gty"))
print(tc)
##        gty
## Xltx    FALSE TRUE
##   FALSE     0 5000
##   TRUE    590 4410
(prob_c = tc[2,2] / sum(tc[,2]))
## [1] 0.4686504

The desired joint probability is the ratio of the cell (2,2) divided by the cells of column 2:

\[0.4686504 = \frac{ 4410}{ 9410}\]

Section 2 Marginal and Joint Probabilities

We re-use the contingency table \(tb\) calculated in the previous section to compare whether

\[\begin{align} P(X>x \text{ and } Y > y) = & P(X>x)& P(Y>y) \\ U\qquad = & \qquad V & W \qquad \\ \end{align} \]

print(tb)
##        Ygty
## gtx     FALSE TRUE
##   FALSE  1280 3720
##   TRUE   1220 3780
(U = tb[2,2]/sum(tb) )
## [1] 0.378
(V = sum(tb[2,]) / sum(tb))
## [1] 0.5
(W = sum(tb[,2]) / sum(tb))
## [1] 0.75
(V*W)
## [1] 0.375

In the above, we see that the left hand side (joint) probability is \(U=\) 0.378 whereas the product of the marginal probabilities is \(UV=\) 0.375.

I conclude that two sides of the equation are not exactly equal but nearly so. While the two time series come from a random number generator and should be close to independent, the observed counts while almost never exactly solve the identity.

Section 3: Fisher’s Exact Test and Chi-Squared Test

We now check for independence at the 99% confidence level using the Fisher Exact test.

(ftest = fisher.test(tb, conf.level = 0.99) )
## 
##  Fisher's Exact Test for Count Data
## 
## data:  tb
## p-value = 0.173
## alternative hypothesis: true odds ratio is not equal to 1
## 99 percent confidence interval:
##  0.9455133 1.2021293
## sample estimates:
## odds ratio 
##    1.06607

We now see the p-value 0.1730183 is quite large and conclude that we cannot reject the null hypothesis that the two events are independent.

(chitest = chisq.test( tb ) )
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tb
## X-squared = 1.8565, df = 1, p-value = 0.173
(p_value2 = chitest$p.value)
## [1] 0.1730254

The Chi-squared test implies we cannot reject the null hypothesis that the events are independent.

Thus, both the Fisher exact test and chi-squared test for independent cannot reject the null hypothesis that the events are independent. This is consistent with our ex-ante expectations that the data generated by random number generators should be independent.

What is the difference between the two?

Fisher’s test is an exact test which directly estimates the probability of the observed data using a hypergeometric distribution. The chi-squared test applies an approximation which is valid when the sample size is large. In principle, experts appear to suggest the use of FIsher’s test on small data sets and chi-squared test on larger data sets.

According to MacDonald’s Handbook of Biological Statistics Fisher’ exact test should be used when the total sample size is small or cell size is small. In this case, the sample size is big (10000) so it is not necessarily appropriate.

By contrast, the chi-squared test can be used when the sample size is large provided that the expected value of each outcome is large enough. One rule of thumb is \(p_{i}X_{i} >= 5\) for all outcomes \(i\).

Which is most appropriate?

The smallest value of marginal probability is \(p_i = 0.25\) while \(trials=10000\), then rule of thumb is satisfied. Because the sample size is large and the cell size is large for our experiment, the chi-squared test is preferred.

Problem 2 - Kaggle Home Price Competition

train = read_csv("house-prices-advanced-regression-techniques/train.csv")

Descriptive and Inferential Statistics

(dim(train) )
## [1] 1460   81
(attributes = sort(colnames(train)) )
##  [1] "1stFlrSF"      "2ndFlrSF"      "3SsnPorch"     "Alley"        
##  [5] "BedroomAbvGr"  "BldgType"      "BsmtCond"      "BsmtExposure" 
##  [9] "BsmtFinSF1"    "BsmtFinSF2"    "BsmtFinType1"  "BsmtFinType2" 
## [13] "BsmtFullBath"  "BsmtHalfBath"  "BsmtQual"      "BsmtUnfSF"    
## [17] "CentralAir"    "Condition1"    "Condition2"    "Electrical"   
## [21] "EnclosedPorch" "ExterCond"     "Exterior1st"   "Exterior2nd"  
## [25] "ExterQual"     "Fence"         "FireplaceQu"   "Fireplaces"   
## [29] "Foundation"    "FullBath"      "Functional"    "GarageArea"   
## [33] "GarageCars"    "GarageCond"    "GarageFinish"  "GarageQual"   
## [37] "GarageType"    "GarageYrBlt"   "GrLivArea"     "HalfBath"     
## [41] "Heating"       "HeatingQC"     "HouseStyle"    "Id"           
## [45] "KitchenAbvGr"  "KitchenQual"   "LandContour"   "LandSlope"    
## [49] "LotArea"       "LotConfig"     "LotFrontage"   "LotShape"     
## [53] "LowQualFinSF"  "MasVnrArea"    "MasVnrType"    "MiscFeature"  
## [57] "MiscVal"       "MoSold"        "MSSubClass"    "MSZoning"     
## [61] "Neighborhood"  "OpenPorchSF"   "OverallCond"   "OverallQual"  
## [65] "PavedDrive"    "PoolArea"      "PoolQC"        "RoofMatl"     
## [69] "RoofStyle"     "SaleCondition" "SalePrice"     "SaleType"     
## [73] "ScreenPorch"   "Street"        "TotalBsmtSF"   "TotRmsAbvGrd" 
## [77] "Utilities"     "WoodDeckSF"    "YearBuilt"     "YearRemodAdd" 
## [81] "YrSold"
train %>% summarize_all( funs( n_distinct(.) ) ) %>% gather() %>% arrange(desc(value)) %>% kable()
## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.
key value
Id 1460
LotArea 1073
GrLivArea 861
BsmtUnfSF 780
1stFlrSF 753
TotalBsmtSF 721
SalePrice 663
BsmtFinSF1 637
GarageArea 441
2ndFlrSF 417
MasVnrArea 328
WoodDeckSF 274
OpenPorchSF 202
BsmtFinSF2 144
EnclosedPorch 120
YearBuilt 112
LotFrontage 111
GarageYrBlt 98
ScreenPorch 76
YearRemodAdd 61
Neighborhood 25
LowQualFinSF 24
MiscVal 21
3SsnPorch 20
Exterior2nd 16
MSSubClass 15
Exterior1st 15
TotRmsAbvGrd 12
MoSold 12
OverallQual 10
Condition1 9
OverallCond 9
SaleType 9
Condition2 8
HouseStyle 8
RoofMatl 8
BedroomAbvGr 8
PoolArea 8
BsmtFinType1 7
BsmtFinType2 7
Functional 7
GarageType 7
RoofStyle 6
Foundation 6
Heating 6
Electrical 6
FireplaceQu 6
GarageQual 6
GarageCond 6
SaleCondition 6
MSZoning 5
LotConfig 5
BldgType 5
MasVnrType 5
ExterCond 5
BsmtQual 5
BsmtCond 5
BsmtExposure 5
HeatingQC 5
GarageCars 5
Fence 5
MiscFeature 5
YrSold 5
LotShape 4
LandContour 4
ExterQual 4
BsmtFullBath 4
FullBath 4
KitchenAbvGr 4
KitchenQual 4
Fireplaces 4
GarageFinish 4
PoolQC 4
Alley 3
LandSlope 3
BsmtHalfBath 3
HalfBath 3
PavedDrive 3
Street 2
Utilities 2
CentralAir 2
num_train = ( train %>% select_if( is.numeric ) )

num_train %>% gather() %>% ggplot(aes(value)) + facet_wrap( ~ key, scales = "free" ) + 
     geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We immediately see there are 38 numeric variables whose histograms are displays in the above panel.

Certain variables despite being numeric datatypes exhibit discrete distributions. Others appears to be continuous. We focus on identifying the continuous variables and identify them in the list cont_vars below.

cont_vars = c("1stFlrSF", "BsmtFinSF1", "BsmtUnfSF", "GarageArea" , "GrLivArea" , "LotFrontage" ,
              "LotArea" , "SalePrice" , "TotalBsmtSF" )

train %>% select( cont_vars) %>% gather() %>% ggplot(aes(value)) + facet_wrap( ~key, scales = "free") + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 259 rows containing non-finite values (stat_bin).

The above panel of histograms shows that the continuous variables show a right skew. In the plots below, we do exploratory analysis of this rich data set.

Obviously, we cannot do justice to all variables so we use our common sense understanding to focus on relevant attributes. Clearly, GrLivArea and LotFrontage are important explanatory variables from the scatter plots below. They are positively related to sale price (as we would expect) however, the variance grows with the increase in square footage. This suggests that other variables become more relevant as the size of the house increases.

ggplot(train, aes(x=GrLivArea,y=SalePrice) ) + geom_point()

ggplot(train, aes(LotFrontage, SalePrice) ) + geom_point()
## Warning: Removed 259 rows containing missing values (geom_point).

In the plots below, we show box plots of the distribution of salesprice grouped by each neighborhood. We know intuitively that location is a key drive of real estate values. Here the box plot is sorted by median home price in each neighborhood. We observe significant variation in the median neighborhood price and conclude that this factor should be included in a home pricing model.

Lastly, we consider the impact of bedrooms on the home price. While there is a trend in the box plot of the median sale price for 2,3,4 bedrooms, this pattern breaks down at the 0 and 5,6, 8 bedroom points. A confounding variable could be the type of property that has a given number of bedrooms.

ggplot(train, aes(x = reorder(Neighborhood, SalePrice, FUN=median), y=SalePrice) ) + geom_boxplot() + coord_flip()

ggplot(train, aes(x = factor(BedroomAbvGr), y=SalePrice) ) + geom_boxplot() + coord_flip()

qtrain = train[,c("1stFlrSF", "SalePrice", "GrLivArea") ]

plot(qtrain)

(qcor = cor( qtrain ) )
##            1stFlrSF SalePrice GrLivArea
## 1stFlrSF  1.0000000 0.6058522 0.5660240
## SalePrice 0.6058522 1.0000000 0.7086245
## GrLivArea 0.5660240 0.7086245 1.0000000
summary(qtrain)
##     1stFlrSF      SalePrice        GrLivArea   
##  Min.   : 334   Min.   : 34900   Min.   : 334  
##  1st Qu.: 882   1st Qu.:129975   1st Qu.:1130  
##  Median :1087   Median :163000   Median :1464  
##  Mean   :1163   Mean   :180921   Mean   :1515  
##  3rd Qu.:1391   3rd Qu.:214000   3rd Qu.:1777  
##  Max.   :4692   Max.   :755000   Max.   :5642

Discussion

Note the interesting property that GrLivArea-1stFlrSF scatterplot has a deterministic structure in its data points. There are many points of the plot on the 45 degree line. That is, for some houses, the 1st Floor square footage is the entire above ground Living area. This would probably correspond to home sales of ranch houses or apartments which don’t have second floors.

Note the SalePrice shows a nearly linear upper bound in the scatterplot in relation to the GrLivArea parameter. This means buyers are unwilling to pay above a certain upper limit for a home with a given amount of above ground living space. That upper limit appears to be linear in the square footage.

Hypothesis Testing

Now we test the hypothesis that the correlations among these two variables are zero and provide a 80% confidence interval. We use the cor.test function with method set to pearson.

(qcortest1 = cor.test(qtrain$`1stFlrSF`, qtrain$SalePrice, method="pearson", conf.level = 0.80 ) )
## 
##  Pearson's product-moment correlation
## 
## data:  qtrain$`1stFlrSF` and qtrain$SalePrice
## t = 29.078, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5841687 0.6266715
## sample estimates:
##       cor 
## 0.6058522
(qcortest2 = cor.test(qtrain$`1stFlrSF`, qtrain$GrLivArea, method="pearson", conf.level = 0.80 ) )
## 
##  Pearson's product-moment correlation
## 
## data:  qtrain$`1stFlrSF` and qtrain$GrLivArea
## t = 26.217, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5427732 0.5884078
## sample estimates:
##      cor 
## 0.566024
(qcortest3 = cor.test(qtrain$SalePrice, qtrain$GrLivArea, method="pearson", conf.level = 0.80 ) )
## 
##  Pearson's product-moment correlation
## 
## data:  qtrain$SalePrice and qtrain$GrLivArea
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

These results show that we reject the null hypotheses that each of the 3 correlations are zero. We have also constructed an 80% confidence interval around each pair of variables.

For example, between 1stFlrSF (1st Floor Square Footage) and SalesPrice, the point estimate for the correlation is 60.58% with a 80% confidence interval is 58.41% to 62.66% . For the variables 1stFlrSF and GrLivArea, the point estimate is 56.60% with a confidence interval of 54.27% to 58.84%. For the variables SalesPrice and GrLivArea, the point estimate is 70.86% with a confidence interval of 69.15% to 72.49%.

Discussion

It is not statistically valid to jointly testing more than 1 correlation coefficient on related data series. This is because the mathematical properties of correlations have to satisfy the positive semidefinite requirement for a correlation matrix. In a 3x3 correlation matrix, if \(r\) and \(\rho\) are positive correlations between variables BC and AB respectively, then the correlation \(c\) between AC must satisfy the lower bound:

\[c \geq \rho r - \sqrt{ 1 - \rho^2 + \rho^2 r^2 - r^2}\]

A standalone hypothesis test on \(c\) being equal to zero would assign non-zero confidence to \(c\) being less than the above lower bound. For example, if \(r = .95, \rho = .95\) then the lower bound on \(c\) is 0.805.

r = .95
rho = .95

(lower_bound = r * rho - sqrt( 1 - rho*rho + rho * rho * r * r - r*r ) )
## [1] 0.805

Please refer to [stackexchange.com] (https://quant.stackexchange.com/questions/529/how-to-quickly-estimate-a-lower-bound-on-correlation-for-a-large-number-of-stock) for more details of the lower bound.

Thus far, I have not found any statistical test for the joint validity of correlations for multiple variables.

Linear Algebra and Correlation

Now we compute the inverse of the 3x3 correlation matrix previously obtained.

(precision_mat = solve(qcor) )
##             1stFlrSF SalePrice  GrLivArea
## 1stFlrSF   1.6795240 -0.690746 -0.4611713
## SalePrice -0.6907460  2.292718 -1.2336974
## GrLivArea -0.4611713 -1.233697  2.1352622

Now we right multiple \(C\) (qcor) by \(P\) (precision_mat) and compare the left multiplication of \(C\) by \(P\).

(right_multiply = qcor %*% precision_mat )
##                1stFlrSF SalePrice GrLivArea
## 1stFlrSF   1.000000e+00         0         0
## SalePrice  5.551115e-17         1         0
## GrLivArea -5.551115e-17         0         1
(left_multiply = precision_mat %*% qcor )
##                1stFlrSF    SalePrice     GrLivArea
## 1stFlrSF   1.000000e+00 1.665335e-16  0.000000e+00
## SalePrice -2.220446e-16 1.000000e+00 -2.220446e-16
## GrLivArea  0.000000e+00 2.220446e-16  1.000000e+00
matrices_equal = function(a, b){   is.matrix(a) && is.matrix(b) && dim(a) == dim(b) && all(a==b)
}

matrices_equal(right_multiply, left_multiply)
## [1] FALSE

It is remarkable that the two products are not equal even though the correlations are well behaved and the matrix has small variance inflation factors as verified above in the matrices_equal function call.

Lastly, we perform an LU decomposition of the correlation matrix. Notice that LU decomposition does not resolve the fact that the correlation matrix and its inverse don’t give the identity matrix exactly when multiplied together in either order.

(lu = lu.decomposition(qcor) )
## $L
##           [,1]      [,2] [,3]
## [1,] 1.0000000 0.0000000    0
## [2,] 0.6058522 1.0000000    0
## [3,] 0.5660240 0.5777733    1
## 
## $U
##      [,1]          [,2]      [,3]
## [1,]    1  6.058522e-01 0.5660240
## [2,]    0  6.329431e-01 0.3656976
## [3,]    0 -5.551115e-17 0.4683266
(Linv = solve(lu$L) )
##            [,1]       [,2] [,3]
## [1,]  1.0000000  0.0000000    0
## [2,] -0.6058522  1.0000000    0
## [3,] -0.2159788 -0.5777733    1
(Uinv = solve(lu$U) )
##      [,1]          [,2]       [,3]
## [1,]    1 -9.571985e-01 -0.4611713
## [2,]    0  1.579921e+00 -1.2336974
## [3,]    0  1.872694e-16  2.1352622
(qInv = Uinv %*% Linv )
##            [,1]      [,2]       [,3]
## [1,]  1.6795240 -0.690746 -0.4611713
## [2,] -0.6907460  2.292718 -1.2336974
## [3,] -0.4611713 -1.233697  2.1352622
(qInv %*% qcor )
##          1stFlrSF    SalePrice    GrLivArea
## [1,] 1.000000e+00 5.551115e-17 0.000000e+00
## [2,] 1.110223e-16 1.000000e+00 2.220446e-16
## [3,] 0.000000e+00 4.440892e-16 1.000000e+00
matrices_equal(qInv %*% qcor, qcor %*% qInv )
## [1] FALSE

Calculus-based Probability and Statistics

We will use the variable GrLivArea. The histogram and density plot show the right skew of the distribution very clearly. Notice that GrLivArea is non-negative so we don’t need to shift the mean of the data to use an exponential distribution fitting algorithm.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
summary(train$GrLivArea)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642
ggplot(train, aes(GrLivArea)) + 
  geom_histogram( aes(y=..density..), fill="lightblue") + 
  geom_density(color="red")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(fit1 = fitdistr(train$GrLivArea, densfun="exponential"))
##        rate    
##   6.598640e-04 
##  (1.726943e-05)
(lambda = fit1$estimate)
##        rate 
## 0.000659864
V = rexp(1000, lambda)

Now we plot the overlay of two density plots. To make the task easier, the exponential distribution data must be the same size as the original data set.

library(reshape2)
## 
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
## 
##     smiths
set.seed(101)

V2 = rexp(1460, lambda)

overlayData = data.frame( empirical = train$GrLivArea, functional = V2)

data = melt(overlayData)
## No id variables; using all as measure variables
head(data)
##    variable value
## 1 empirical  1710
## 2 empirical  1262
## 3 empirical  1786
## 4 empirical  1717
## 5 empirical  2198
## 6 empirical  1362
ggplot( data, aes(x=value, fill=variable) ) + geom_density(alpha=0.25)

In the above plot, we see the empirical data is more centered than the exponential plot. The exponential distribution fitted with \[\lambda = \frac{1}{\mu_{GrLivArea}}\] has too much variance.

We now calculate the 5 and 95 percentile of the cumulative distribution of the functional distribution above.

(pct5 = qexp( 0.05, rate=lambda) )
## [1] 77.73313
(pct95 = qexp( 0.95, rate= lambda ))
## [1] 4539.924

Now we calculate the 95% confidence interval for our estimate of the mean from the empirical data assuming normality. Note that \(z=1.96\) is the z-score for the 95 percent confidence level.

(mu = mean(train$GrLivArea) )
## [1] 1515.464
(sd = sd( train$GrLivArea) )
## [1] 525.4804
(lower5 = mu - 1.96 * sd / sqrt(1460) )
## [1] 1488.509
(upper95 = mu + 1.96 * sd / sqrt(1460) )
## [1] 1542.419

Lastly, we calculate the empirical 5 and 95 percentile distribution.

(empirical_quantiles = quantile(train$GrLivArea, c( 0.05, 0.95 )) )
##     5%    95% 
##  848.0 2466.1

The 5 and 95 percentile of the exponential and empirical distributions are comparable. The 95% confidence interval of the mean is not. It measures something else – the accuracy of our point estimate of the mean under the assumption of normality.

Consistent with our previous comment that the exponential distribution has too much variance to adequately model the GrLivArea data, we see that the theoretical 5-95 interval is much wider than the empirical interval. The theoretical is \([77.73, 4539.92]\) where the empirical is \([848, 2466]\). Logically, people cannot live in a house with 77 square feet of living space.

Modeling

The following model is uploaded to kaggle for the competition. It includes 6 variables for the regression.

The SalesPrice is predicted on the basis of the Overall Quality score, above ground living area, lot size square footage, year of sale, neighborhood, number of above ground bedrooms. These factors take into account several important considerations: the amount of living space is reflected by GrLivArea, LotArea and the number of bedrooms, the location of the neighborhood proxies for the attractiveness of the environment, and the YrSold variable considers the market historical fluctuation. Housing prices vary year over year. The OverallQual factor considers the affluence of the owners in choosing better materials and construction methods.

Overall, these factors explain 79% of the variation in housing prices as measured by the adjusted R-squared. I think that is a pretty decent model given the limited number of variables used. I outline the diagnostic plots of the model. The residual vs. fitted show reasonably good fit for the data. However, the Q-Q plot shows actual house prices at the extreme upper and lower end are outliers.

The model has the same two influential observations 524 and 1299 which affect the lower end of the prices. They are also influential points in the cook distance plot. We can omit those outliers for a better fit but should explore those points in more detail.

mod3 = lm( SalePrice ~ OverallQual + GrLivArea + LotArea + YrSold + 
             as.factor(Neighborhood)  + as.factor(BedroomAbvGr) ,
           data=train)

summary(mod3)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + GrLivArea + LotArea + 
##     YrSold + as.factor(Neighborhood) + as.factor(BedroomAbvGr), 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -366418  -16675      56   15233  265058 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     2.633e+05  1.444e+06   0.182 0.855406    
## OverallQual                     2.023e+04  1.161e+03  17.421  < 2e-16 ***
## GrLivArea                       6.088e+01  3.089e+00  19.708  < 2e-16 ***
## LotArea                         6.906e-01  1.074e-01   6.430 1.74e-10 ***
## YrSold                         -1.397e+02  7.193e+02  -0.194 0.846023    
## as.factor(Neighborhood)Blueste -2.497e+04  2.703e+04  -0.924 0.355646    
## as.factor(Neighborhood)BrDale  -3.671e+04  1.275e+04  -2.880 0.004040 ** 
## as.factor(Neighborhood)BrkSide -1.058e+04  1.031e+04  -1.027 0.304813    
## as.factor(Neighborhood)ClearCr  1.131e+04  1.167e+04   0.969 0.332843    
## as.factor(Neighborhood)CollgCr  1.267e+04  9.486e+03   1.336 0.181723    
## as.factor(Neighborhood)Crawfor  1.372e+04  1.037e+04   1.323 0.185934    
## as.factor(Neighborhood)Edwards -1.619e+04  9.916e+03  -1.633 0.102784    
## as.factor(Neighborhood)Gilbert  3.684e+02  9.960e+03   0.037 0.970502    
## as.factor(Neighborhood)IDOTRR  -2.665e+04  1.097e+04  -2.430 0.015216 *  
## as.factor(Neighborhood)MeadowV -1.534e+04  1.285e+04  -1.193 0.233044    
## as.factor(Neighborhood)Mitchel  2.464e+03  1.046e+04   0.236 0.813729    
## as.factor(Neighborhood)NAmes   -1.751e+03  9.501e+03  -0.184 0.853831    
## as.factor(Neighborhood)NoRidge  6.360e+04  1.076e+04   5.908 4.33e-09 ***
## as.factor(Neighborhood)NPkVill -1.211e+04  1.502e+04  -0.806 0.420259    
## as.factor(Neighborhood)NridgHt  7.059e+04  9.817e+03   7.191 1.04e-12 ***
## as.factor(Neighborhood)NWAmes  -2.881e+03  1.007e+04  -0.286 0.774830    
## as.factor(Neighborhood)OldTown -2.914e+04  9.731e+03  -2.994 0.002797 ** 
## as.factor(Neighborhood)Sawyer   7.499e+02  1.026e+04   0.073 0.941747    
## as.factor(Neighborhood)SawyerW  2.372e+03  1.016e+04   0.233 0.815534    
## as.factor(Neighborhood)Somerst  1.967e+04  9.704e+03   2.027 0.042863 *  
## as.factor(Neighborhood)StoneBr  6.563e+04  1.144e+04   5.736 1.18e-08 ***
## as.factor(Neighborhood)SWISU   -2.646e+04  1.188e+04  -2.226 0.026139 *  
## as.factor(Neighborhood)Timber   2.171e+04  1.085e+04   2.001 0.045606 *  
## as.factor(Neighborhood)Veenker  3.816e+04  1.416e+04   2.695 0.007130 ** 
## as.factor(BedroomAbvGr)1       -1.072e+04  1.584e+04  -0.677 0.498639    
## as.factor(BedroomAbvGr)2       -2.488e+04  1.513e+04  -1.645 0.100264    
## as.factor(BedroomAbvGr)3       -3.003e+04  1.507e+04  -1.993 0.046450 *  
## as.factor(BedroomAbvGr)4       -3.645e+04  1.535e+04  -2.374 0.017729 *  
## as.factor(BedroomAbvGr)5       -6.442e+04  1.726e+04  -3.731 0.000198 ***
## as.factor(BedroomAbvGr)6       -6.028e+04  2.054e+04  -2.934 0.003396 ** 
## as.factor(BedroomAbvGr)8       -9.199e+04  4.026e+04  -2.285 0.022463 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36050 on 1424 degrees of freedom
## Multiple R-squared:  0.7991, Adjusted R-squared:  0.7941 
## F-statistic: 161.8 on 35 and 1424 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mod3)
## Warning: not plotting observations with leverage one:
##   636

## Warning: not plotting observations with leverage one:
##   636