DATA 605 - Final

Your final is due by the end of the last week of class. You should post your solutions to your GitHub account or RPubs. You are also expected to make a short presentation via YouTube and post that recording to the board. This project will show off your ability to understand the elements of the class.

library(tidyverse)
library(matrixcalc)
library(scales)
library(psych)
library(MASS)
library(kableExtra)
library(GGally)
library(fastDummies)

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.

set.seed(605)
N <- 9
n <- 10000
X <- runif(n, 1, N)
mean <- (N+1)/2
sd <- (N+1)/2
Y <- rnorm(n, mean=mean, sd=sd)

plot(X, Y, pch=16, col=rgb(0, 0, 1, 0.3), main='Joint Probability')

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Recall for independent events, \(P(X \ and \ Y) = P(X)P(Y)\) and for dependent events \(P(X \ and \ Y) = P(X)P(Y \ | \ X)\)

x <- median(X)
x

## [1] 5.03

y <- quantile(Y)[2]
y

##  25% 
## 1.65

a. P(X>x | X>y)

\(P(X>x \ | \ X>y) = \frac{P(X>x \ and \ X>y)}{P(X>y)}\) where the numerator simplifies to \(P(X>x)\)

p_Xgy <- length(X[X>y])/n
p_Xgx <- length(X[X>y & X>x])/n

p_Xgx / p_Xgy

## [1] 0.547

b. P(X>x, Y>y)

This is a joint probability.

p_Xgx <- length(X[X>x])/n
p_Ygy <- length(Y[Y>y])/n
p_Xgx * p_Ygy

## [1] 0.375

c. P(X<x | X>y)

Similarly to part a, we are looking for \(P(X<x \ | \ X>y) = \frac{P(X<x \ and \ X>y)}{P(X>y)} = \frac{P(y<X<x)}{P(X>y)}\).

p_Xgy <- length(X[X>y])/n
p_Xxy <- length(X[X>y & X<x])/n

p_Xxy / p_Xgy

## [1] 0.453

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

m <- matrix(c(length(X[X<x])/n*length(X[Y<y])/n,
                  length(X[X>x])/n*length(X[Y<y])/n,
                  length(X[X<x])/n*length(X[Y>y])/n,
                  length(X[X>x])/n*length(X[Y>y])/n),
                  ncol=2,byrow=TRUE)

colnames(m) <- c("P(X<x)","P(X>x)")
rownames(m) <- c("P(Y<y)","P(Y>y)")

showtable(m)


	P(X<x)	P(X>x)
P(Y<y)	0.125	0.125
P(Y>y)	0.375	0.375

Marginal probability of X:

margin.table(m,2)

## P(X<x) P(X>x) 
##    0.5    0.5

Marginal probability of Y:

margin.table(m,1)

## P(Y<y) P(Y>y) 
##   0.25   0.75

If we compare the product of the two marginal probabilities P(X>x)P(Y>y) = 0.5*0.75 = 0.375 we get the same result as the joint probability in part b (also shown in bottom right of table above). The statement holds.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

We see that from both test results below, we obtain a very high p-value (1) so we conclude that independence holds. Fisher’s test is exact in its accuracy, and is usually used for small sample sizes summarized in 2x2 contingency tables. The Chi-squared test is used for large sample sizes of arbitraty dimensions. Since we are working with a 2x2 contingency table, Fisher’s test is appriopriate, but the chi-square might be preffered because of the large sample size. Given that the random variables were generated independently, it should not be surprising that both results hold.

fisher.test(m*n)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  m * n
## p-value = 1
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.912 1.096
## sample estimates:
## odds ratio 
##          1

chisq.test(m*n)

## 
##  Pearson's Chi-squared test
## 
## data:  m * n
## X-squared = 0, df = 1, p-value = 1

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques.

Descriptive and Inferential Statistics

Load data set:

train <- read_csv('https://raw.githubusercontent.com/maelillien/data/master/kaggle_house_data/train.csv', col_names = TRUE)

Provide univariate descriptive statistics and appropriate plots for the training data set.

summary(train)

##        Id         MSSubClass      MSZoning          LotFrontage 
##  Min.   :   1   Min.   : 20.0   Length:1460        Min.   : 21  
##  1st Qu.: 366   1st Qu.: 20.0   Class :character   1st Qu.: 59  
##  Median : 730   Median : 50.0   Mode  :character   Median : 69  
##  Mean   : 730   Mean   : 56.9                      Mean   : 70  
##  3rd Qu.:1095   3rd Qu.: 70.0                      3rd Qu.: 80  
##  Max.   :1460   Max.   :190.0                      Max.   :313  
##                                                    NA's   :259  
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   LandSlope         Neighborhood        Condition1       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Condition2          BldgType          HouseStyle         OverallQual  
##  Length:1460        Length:1460        Length:1460        Min.   : 1.0  
##  Class :character   Class :character   Class :character   1st Qu.: 5.0  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.0  
##                                                           Mean   : 6.1  
##                                                           3rd Qu.: 7.0  
##                                                           Max.   :10.0  
##                                                                         
##   OverallCond     YearBuilt     YearRemodAdd   RoofStyle        
##  Min.   :1.00   Min.   :1872   Min.   :1950   Length:1460       
##  1st Qu.:5.00   1st Qu.:1954   1st Qu.:1967   Class :character  
##  Median :5.00   Median :1973   Median :1994   Mode  :character  
##  Mean   :5.58   Mean   :1971   Mean   :1985                     
##  3rd Qu.:6.00   3rd Qu.:2000   3rd Qu.:2004                     
##  Max.   :9.00   Max.   :2010   Max.   :2010                     
##                                                                 
##    RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   MasVnrType          MasVnrArea    ExterQual          ExterCond        
##  Length:1460        Min.   :   0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0   Class :character   Class :character  
##  Mode  :character   Median :   0   Mode  :character   Mode  :character  
##                     Mean   : 104                                        
##                     3rd Qu.: 166                                        
##                     Max.   :1600                                        
##                     NA's   :8                                           
##   Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  BsmtExposure       BsmtFinType1         BsmtFinSF1   BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0   Class :character  
##  Mode  :character   Mode  :character   Median : 384   Mode  :character  
##                                        Mean   : 444                     
##                                        3rd Qu.: 712                     
##                                        Max.   :5644                     
##                                                                         
##    BsmtFinSF2     BsmtUnfSF     TotalBsmtSF     Heating         
##  Min.   :   0   Min.   :   0   Min.   :   0   Length:1460       
##  1st Qu.:   0   1st Qu.: 223   1st Qu.: 796   Class :character  
##  Median :   0   Median : 478   Median : 992   Mode  :character  
##  Mean   :  47   Mean   : 567   Mean   :1057                     
##  3rd Qu.:   0   3rd Qu.: 808   3rd Qu.:1298                     
##  Max.   :1474   Max.   :2336   Max.   :6110                     
##                                                                 
##   HeatingQC          CentralAir         Electrical           1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##     2ndFlrSF     LowQualFinSF   GrLivArea     BsmtFullBath  
##  Min.   :   0   Min.   :  0   Min.   : 334   Min.   :0.000  
##  1st Qu.:   0   1st Qu.:  0   1st Qu.:1130   1st Qu.:0.000  
##  Median :   0   Median :  0   Median :1464   Median :0.000  
##  Mean   : 347   Mean   :  6   Mean   :1515   Mean   :0.425  
##  3rd Qu.: 728   3rd Qu.:  0   3rd Qu.:1777   3rd Qu.:1.000  
##  Max.   :2065   Max.   :572   Max.   :5642   Max.   :3.000  
##                                                             
##   BsmtHalfBath      FullBath       HalfBath      BedroomAbvGr 
##  Min.   :0.000   Min.   :0.00   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:2.00  
##  Median :0.000   Median :2.00   Median :0.000   Median :3.00  
##  Mean   :0.058   Mean   :1.56   Mean   :0.383   Mean   :2.87  
##  3rd Qu.:0.000   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:3.00  
##  Max.   :2.000   Max.   :3.00   Max.   :2.000   Max.   :8.00  
##                                                               
##   KitchenAbvGr  KitchenQual         TotRmsAbvGrd    Functional       
##  Min.   :0.00   Length:1460        Min.   : 2.00   Length:1460       
##  1st Qu.:1.00   Class :character   1st Qu.: 5.00   Class :character  
##  Median :1.00   Mode  :character   Median : 6.00   Mode  :character  
##  Mean   :1.05                      Mean   : 6.52                     
##  3rd Qu.:1.00                      3rd Qu.: 7.00                     
##  Max.   :3.00                      Max.   :14.00                     
##                                                                      
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars     GarageArea    GarageQual       
##  Length:1460        Min.   :0.00   Min.   :   0   Length:1460       
##  Class :character   1st Qu.:1.00   1st Qu.: 334   Class :character  
##  Mode  :character   Median :2.00   Median : 480   Mode  :character  
##                     Mean   :1.77   Mean   : 473                     
##                     3rd Qu.:2.00   3rd Qu.: 576                     
##                     Max.   :4.00   Max.   :1418                     
##                                                                     
##   GarageCond         PavedDrive          WoodDeckSF   OpenPorchSF 
##  Length:1460        Length:1460        Min.   :  0   Min.   :  0  
##  Class :character   Class :character   1st Qu.:  0   1st Qu.:  0  
##  Mode  :character   Mode  :character   Median :  0   Median : 25  
##                                        Mean   : 94   Mean   : 47  
##                                        3rd Qu.:168   3rd Qu.: 68  
##                                        Max.   :857   Max.   :547  
##                                                                   
##  EnclosedPorch   3SsnPorch    ScreenPorch     PoolArea  
##  Min.   :  0   Min.   :  0   Min.   :  0   Min.   :  0  
##  1st Qu.:  0   1st Qu.:  0   1st Qu.:  0   1st Qu.:  0  
##  Median :  0   Median :  0   Median :  0   Median :  0  
##  Mean   : 22   Mean   :  3   Mean   : 15   Mean   :  3  
##  3rd Qu.:  0   3rd Qu.:  0   3rd Qu.:  0   3rd Qu.:  0  
##  Max.   :552   Max.   :508   Max.   :480   Max.   :738  
##                                                         
##     PoolQC             Fence           MiscFeature           MiscVal     
##  Length:1460        Length:1460        Length:1460        Min.   :    0  
##  Class :character   Class :character   Class :character   1st Qu.:    0  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0  
##                                                           Mean   :   43  
##                                                           3rd Qu.:    0  
##                                                           Max.   :15500  
##                                                                          
##      MoSold          YrSold       SaleType         SaleCondition     
##  Min.   : 1.00   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.00   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.00   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.32   Mean   :2008                                        
##  3rd Qu.: 8.00   3rd Qu.:2009                                        
##  Max.   :12.00   Max.   :2010                                        
##                                                                      
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

Let’s take a closer look at the dependent variable: SalePrice. We learn from the information below that the variable is skewed to the right with an average SalePrice of 180,921 and a standard deviation of 79,442.

summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

describe(train$SalePrice)

##    vars    n   mean    sd median trimmed   mad   min    max  range skew
## X1    1 1460 180921 79442 163000  170783 56339 34900 755000 720100 1.88
##    kurtosis   se
## X1      6.5 2079

ggplot(train, aes(x=SalePrice)) +
  geom_histogram(color="black", fill="white") +
  scale_x_continuous(breaks=seq(0,max(train$SalePrice),100000), labels=function(x) format(x, big.mark = ",",scientific = FALSE)) +
  ggtitle('Dependent Variable: SalePrice')

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

pairs(train[c('SalePrice','LotArea','GrLivArea','GarageArea')])

Derive a correlation matrix for any three quantitative variables in the dataset.

cor_matrix <- cor(train[c('LotArea','GrLivArea','GarageArea')])


	LotArea	GrLivArea	GarageArea
LotArea	1.000	0.263	0.180
GrLivArea	0.263	1.000	0.469
GarageArea	0.180	0.469	1.000

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

LotArea vs LivArea

cor.test(train$LotArea, train$GrLivArea, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GrLivArea
## t = 10, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.232 0.294
## sample estimates:
##   cor 
## 0.263

LotArea vs GarageArea

cor.test(train$LotArea, train$GarageArea, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$LotArea and train$GarageArea
## t = 7, df = 1458, p-value = 4e-12
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.148 0.213
## sample estimates:
##  cor 
## 0.18

LivArea vs GarageArea

cor.test(train$GrLivArea, train$GarageArea, method = "pearson", conf.level = 0.8)

## 
##  Pearson's product-moment correlation
## 
## data:  train$GrLivArea and train$GarageArea
## t = 20, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.442 0.495
## sample estimates:
##   cor 
## 0.469

The analysis above concludes that the tested variables are not independent of each other and that there are statistically significant correlations between the variables, although the correlations are not strong. I would be concerned with familywise errors as the three variables are connected: LotArea is in part composed of GarageArea (if there is one) and GrLivingArea.

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

if(det(cor_matrix)>0) # Verify that the correlation matrix is invertible
  pres_matrix <- solve(cor_matrix)
pres_matrix

##            LotArea GrLivArea GarageArea
## LotArea     1.0792    -0.247    -0.0789
## GrLivArea  -0.2470     1.339    -0.5832
## GarageArea -0.0789    -0.583     1.2877

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

round(cor_matrix %*% pres_matrix,0)

##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

round(pres_matrix %*% cor_matrix,0)

##            LotArea GrLivArea GarageArea
## LotArea          1         0          0
## GrLivArea        0         1          0
## GarageArea       0         0          1

lu_decomp <- lu.decomposition(as.matrix(cor_matrix))
L <- lu_decomp$L
L

##       [,1]  [,2] [,3]
## [1,] 1.000 0.000    0
## [2,] 0.263 1.000    0
## [3,] 0.180 0.453    1

U <- lu_decomp$U
U

##      [,1]  [,2]  [,3]
## [1,]    1 0.263 0.180
## [2,]    0 0.931 0.422
## [3,]    0 0.000 0.777

L %*% U

##       [,1]  [,2]  [,3]
## [1,] 1.000 0.263 0.180
## [2,] 0.263 1.000 0.469
## [3,] 0.180 0.469 1.000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data.

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

The GrLivArea variable was selected. We can see from the summary that the min value is not 0, so there is no need to shift.

gr_liv_area <- train$GrLivArea
summary(gr_liv_area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1464    1515    1777    5642

Then load the MASS package and run fitdistr to fit an exponential probability density function.

Fit an exponential distribution to the data.

set.seed(605)
fit <- fitdistr(gr_liv_area, "exponential")

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

lambda <- fit$estimate
lambda

##    rate 
## 0.00066

Sample the exponential distribution generated from our rate estimate.

sim_exp_dist <- rexp(1000, rate = lambda)

Plot a histogram and compare it with a histogram of your original variable.

df_1 <- data.frame(obs = gr_liv_area)
df_1$data<- 'empirical'
df_2 <- data.frame(obs = sim_exp_dist)
df_2$data <- 'fit'
df <- rbind(df_1, df_2)

ggplot(data=df, aes(obs, fill=data)) + 
  geom_density(alpha = 0.4) +
  ggtitle('Comparing Empirical and Fit Data')

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

qexp(c(.05,0.95), rate=fit$estimate)

## [1]   77.7 4539.9

Also generate a 95% confidence interval from the empirical data, assuming normality.

mu <- mean(gr_liv_area)
sd <- sd(gr_liv_area)
n <- length(gr_liv_area)
me <- qnorm(.95)*sd/sqrt(n)
ci <- c(mu-me,mu+me)
ci

## [1] 1493 1538

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

quantile(gr_liv_area, c(0.05,0.95))

##   5%  95% 
##  848 2466

The comparative histogram and the mismatched percentiles shows that the exponential distribution is not a good fit for our data. However, we note that the mean of the exponential distribution which is the reciprocal of the rate we estimated above, does fit within the confidence interval for the mean. Aalso note that we usually refer to lambda as the mean of the exponential distribution and 1/lambda as the rate.

1/lambda

## rate 
## 1515

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Data Exploration

colnames(train)

##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "1stFlrSF"     
## [45] "2ndFlrSF"      "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "3SsnPorch"     "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"

Below we identify the features with missing values. Studying the variables names gives us idea of how we might deal with the NA values. For example, NA values for LotFrontage simply mean that the home is within the permiter of the property line and NA values can be replaced by zero. The feature Alley could be changed to a binary variable (alley, no alley). However, these are probably not the most important features so we decide to drop all of them. In general, we can remove the feature entirely if more than 15% of its observations are NA.

NAs <- train %>% 
  summarise_all(funs(sum(is.na(.)))) %>% 
  gather(key='Feature', value='num_NA') %>% 
  mutate(prop = num_NA/dim(train)[1]) %>%
  filter(num_NA>0) %>%
  arrange(desc(prop))

Features with NAs
Feature	num_NA	prop
PoolQC	1453	0.995
MiscFeature	1406	0.963
Alley	1369	0.938
Fence	1179	0.808
FireplaceQu	690	0.473
LotFrontage	259	0.177
GarageType	81	0.055
GarageYrBlt	81	0.055
GarageFinish	81	0.055
GarageQual	81	0.055
GarageCond	81	0.055
BsmtExposure	38	0.026
BsmtFinType2	38	0.026
BsmtQual	37	0.025
BsmtCond	37	0.025
BsmtFinType1	37	0.025
MasVnrType	8	0.005
MasVnrArea	8	0.005
Electrical	1	0.001

Studying the correlation matrix and histogram for quantitative values helps us select features of interest. From the most correlated variables, we select the following: OverallQual, TotalBsmtSF, GrLivArea, GarageArea, GarageCars. We also select LotArea and FullBath, which are more moderately correlated but intuitively have some importance.

ggcorr(train)

pairs.panels(train[c('SalePrice','LotArea','OverallQual','OverallCond','GrLivArea','FullBath', 'GarageArea')], 
             method = "pearson", hist.col = "#00AFBB",density = TRUE, ellipses = TRUE)

We proceed with the selection of interesting features and include a qualitative variable ’ExterQual that may bear some weight:

selected_features <- c('Id','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','GarageCars','LotArea','FullBath', 'ExterQual')
train_slim <- train %>% dplyr::select(selected_features)
train_slim$SalePrice <- train$SalePrice

Studying the scatterplot matrix of the selected variables one more time, we can identify interesting outliers in the LotArea, GrLivArea and TotalBsmtAreaSF.

pairs(train_slim[,c(-1,-9)])

ggplot(train_slim, aes(LotArea, SalePrice)) +
  geom_point(alpha=0.2) +
  ggtitle('SalePrice vs LotArea')

Upon closer look at LotArea, the outliers are significant and can perhaps be associated with much larger lots than differ from the typical home market. Given that the correlation between LotArea and SalePrice was actually quite low (0.26) we throw out this variable altogether and update our selected_features vector for future uses.

train_slim <- train_slim %>% dplyr::select(-LotArea)
selected_features <- c('Id','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','GarageCars','FullBath','ExterQual')

ggplot(train_slim, aes(GrLivArea, SalePrice)) +
  geom_point(alpha=0.2) +
  ggtitle('SalePrice vs GrLivArea')

The picture is clearer for GrLivArea, but the two rightmost point do not make sense as the observations appear to have too low a price for such area. We will throw out those two points by refering to their index.

ggplot(train_slim, aes(TotalBsmtSF, SalePrice)) +
  geom_point(alpha=0.2) +
  ggtitle('SalePrice vs TotalBsmtSF')

There is one significant outlier nearly double the area of the closer observation. We remove this value as well (in fact it is the same Id=1299 that we decided to remove previously). Notice as well that there are a number of observations with TotalBsmtSF = 0. This is also the case for the GarageArea variable. We create two binary categorical variables to account for that:

train_slim <- train_slim %>% filter(Id != 1299)
train_slim <- train_slim %>% filter(Id != 524)
train_slim <- train_slim %>% mutate(HasBasement = ifelse(TotalBsmtSF > 0, 1, 0))
train_slim <- train_slim %>% mutate(HasGarage = ifelse(GarageArea > 0, 1, 0))

The box plot below reveals that there are quite a few outliers when studying the ExterQual variable. However, we will limit our outlier elimination to the few observations above.

ggplot(train, aes(x=ExterQual, y=SalePrice)) +
  geom_boxplot(outlier.colour="red", outlier.shape=1,
                outlier.size=4) +
  scale_y_continuous(breaks=seq(0,max(train$SalePrice),100000), labels=function(x) format(x, big.mark = ",",scientific = FALSE)) +
  ggtitle('SalePrice vs ExterQual')

Generate dummy variables for the ExterQual feature.

train_slim <- fastDummies::dummy_cols(train_slim, select_columns = "ExterQual")
train_slim <- train_slim %>% dplyr::select(-ExterQual)
train_slim <- train_slim %>% mutate_at(c(10,11,12,13), function(x) as.numeric(x))

Modeling

Train the model with the selected features and update it using backwards elimination until we are left with the most statistically significant features.

selected.lm <- lm(SalePrice ~ ., data=train_slim[,-1])
modified.lm <- update(selected.lm, .~. - ExterQual_TA, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - ExterQual_Fa, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - FullBath, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - HasGarage, data=train_slim[-1])
modified.lm <- update(modified.lm, .~. - GarageCars, data=train_slim[-1])
summary(modified.lm)

## 
## Call:
## lm(formula = SalePrice ~ OverallQual + TotalBsmtSF + GrLivArea + 
##     GarageArea + HasBasement + ExterQual_Ex + ExterQual_Gd, data = train_slim[-1])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144863  -18366      77   16840  256435 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -53194.19    6830.36   -7.79  1.3e-14 ***
## OverallQual   16917.21    1059.46   15.97  < 2e-16 ***
## TotalBsmtSF      44.54       2.81   15.82  < 2e-16 ***
## GrLivArea        53.64       2.16   24.86  < 2e-16 ***
## GarageArea       45.94       5.19    8.86  < 2e-16 ***
## HasBasement  -27162.97    6138.79   -4.42  1.0e-05 ***
## ExterQual_Ex  78836.42    5816.29   13.55  < 2e-16 ***
## ExterQual_Gd  15580.21    2471.59    6.30  3.8e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32800 on 1450 degrees of freedom
## Multiple R-squared:  0.83,   Adjusted R-squared:  0.83 
## F-statistic: 1.01e+03 on 7 and 1450 DF,  p-value: <2e-16

The model results are decent, with an adjusted R-squared of 83%. However, we must verify the assumptions for linear regression by studying the residuals.

Residual Analysis & Assumptions

While the residuals are mostly evenly distributed between the 1st and 3rd quartiles, there is significant skew as revealed by the min and max values. Visually, we can identify a outliers outliers influencial in the residual which makes us cautious about accepting this model.

summary(selected.lm$residuals)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -142358  -17494     -51       0   17367  254937

plot(fitted(selected.lm),resid(selected.lm))

When testing for normality, both the significant deviation at the tail ends of the plot below and the result of the Shapiro-Wilk test (H0: normal, HA: not normal) causes us to further question the validity of this model. The density plot comparison below clearly shows the deviation from normal.

shapiro.test(selected.lm$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  selected.lm$residuals
## W = 0.9, p-value <2e-16

qqnorm(resid(selected.lm))
qqline(resid(selected.lm))

res <- selected.lm$residuals
df_1 <- data.frame(residuals = res)
df_1$data<- 'model'
df_2 <- data.frame(residuals = rnorm(length(res),mean(res),sd(res)))
df_2$data <- 'normal'
df <- rbind(df_1, df_2)

ggplot(data=df, aes(residuals, fill=data)) + 
  geom_density(alpha = 0.4) +
  ggtitle('Residual Comparison')

Despite a potentially invalid model, we try to predict the SalePrice of the test data using the model. As was done for the training data, we read the data, extract the selected features from the test data, deal with any missing values and add the categorical variables.

Prediction

test <- read_csv('https://raw.githubusercontent.com/maelillien/data/master/kaggle_house_data/test.csv', col_names = TRUE)

test_slim <- test %>% dplyr::select(selected_features)
test_slim[is.na(test_slim)] <- 0
test_slim <- test_slim %>% mutate(HasGarage = ifelse(GarageArea > 0, 1, 0))
test_slim <- test_slim %>% mutate(HasBasement = ifelse(TotalBsmtSF > 0, 1, 0))
test_slim <- fastDummies::dummy_cols(test_slim, select_columns = "ExterQual")
test_slim <- test_slim %>% dplyr::select(-ExterQual)

pred <- predict(selected.lm, newdata=test_slim[,-1])

Prepare the output for submission:

submission <- data.frame(Id = test_slim$Id, SalePrice = pred)
write.csv(submission, file = "submission.csv", row.names = FALSE)

Results and Conclusion

My Kaggle username is maelillien and my best prediction score is 0.21558 which is not great compared to the leaderboard, but it is a 795 spot improvement from my initial attempt which had a score of 0.93893. The current model score is 0.22516 which is slightly worse than the best attempt but this current model shows more feature engineering. The first attempt did not include the qualitative variables.

While the model yields decent prediction results, we saw that the underlying assumptions of the linear model are not fully met. To correct that and improve the model, there are a number of methods to consider:

Eliminate more outliers
Cross validation with hold out data
Transformations, including log which could help normalize some quantitative variables.
Preprocessing with centering and scaling
Incorporating more qualitative features
Use alternate regression methods

DATA 605 - Final

Mael Illien

5/11/2020

Problem 1

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of μ=σ=(N+1)/2.

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

a. P(X>x | X>y)

b. P(X>x, Y>y)

c. P(X<x | X>y)

Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Problem 2

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set.

Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

Derive a correlation matrix for any three quantitative variables in the dataset.

Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

LotArea vs LivArea

LotArea vs GarageArea

LivArea vs GarageArea

Linear Algebra and Correlation.

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)

Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Calculus-Based Probability & Statistics

Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary.

Then load the MASS package and run fitdistr to fit an exponential probability density function.

Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)).

Plot a histogram and compare it with a histogram of your original variable.

Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF).

Also generate a 95% confidence interval from the empirical data, assuming normality.

Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Data Exploration

Modeling

Residual Analysis & Assumptions

Prediction

Results and Conclusion