Problem 1.

Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(Xy)

Probabilities
X <- runif(10000, min = 1, max = 6)
Y <- rnorm(10000, mean = ((6+1)/2), sd = ((6+1)/2))

x = mean(X)
y = quantile(Y,.25)

# P(X>x)
Xgrx <- sum(X>x)/length(X)
# P(X<x)
Xlsx <- sum(X<x)/length(X)
# P(X>y)
Xgry <- sum(X>y)/length(X)
# P(Y>y)
Ygry <- sum(Y>y)/length(Y)

\[ Conditional \space Probability \space P(B|A) = \frac{P(A \space and \space B)}{P(A)} \]

\[ a. P(X>x | X>y) \]

\[ P(Xgrx|Xgry) = \frac{Xgry,Xgrx}{Xgry}\]

Xgrx_pipe_Xgry <- sum((X>y) & (X>x))/length(X)
a <- Xgrx_pipe_Xgry/Xgry
a
## [1] 0.5129095

\[ b. P(X>x, Y>y) \]

\[ P(Xgrx,Ygry) = P(Xgrx \space and \space Ygry)\]

Xgrx_and_Ygry <- sum((X>y) & (X>x))/length(X)
b <- Xgrx_and_Ygry
b
## [1] 0.5026

\[ c. P(X<x | X>y) \]

\[ P(Xlsx|Xgry) = \frac{Xgry,Xlsx}{Xgry}\]

Xlsx_pipe_Xgry <- sum((X>y) & (X<x))/length(X)
c <- Xlsx_pipe_Xgry/Xgry
c
## [1] 0.4870905

5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.

# P(X>x)
Xgrx <- sum(X>x)/length(X)
# P(Y>y)
Ygry <- sum(Y>y)/length(Y)

Xgrx_and_Ygry1 <- sum((X>x) & (Y>y))/length(X)
Xgrx_and_Ygry1
## [1] 0.3809
Xgrx_and_Ygry2 <- Xgrx*Ygry
Xgrx_and_Ygry2
## [1] 0.37695
identical(round(Xgrx_and_Ygry1,2), round(Xgrx_and_Ygry2,2))
## [1] TRUE

\[ P(X>x \space and \space Y>y)\space != \space P(X>x)P(Y>y) \]

Probabilities (rounded) are identical, therefore it can be concluded that that (X>x) and (Y>y) are independent

5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

$ H_0: $ (X>x) and (Y>y) are independent

$ H_A: $ (X>x) and (Y>y) are dependent

ChiSquare Test
cont_tbl <- table((X>x), (Y>y), dnn=list("X>x","Y>y"))
cont_tbl
##        Y>y
## X>x     FALSE TRUE
##   FALSE  1283 3691
##   TRUE   1217 3809
chisq.test(cont_tbl, correct = FALSE)
## 
##  Pearson's Chi-squared test
## 
## data:  cont_tbl
## X-squared = 3.3286, df = 1, p-value = 0.06808

p-value is considerable larger than the pre-defined significance level (0.05) so the Null Hypothesis cannot be rejected, therefore they are independent

Fisher’s Exact Test
cont_tbl <- table((X>x), (Y>y), dnn=list("X>x","Y>y"))
fisher.test(cont_tbl, conf.int = TRUE, conf.level = 0.99)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  cont_tbl
## p-value = 0.07163
## alternative hypothesis: true odds ratio is not equal to 1
## 99 percent confidence interval:
##  0.9648502 1.2267838
## sample estimates:
## odds ratio 
##   1.087925

p-value is considerable larger than the pre-defined significance level (0.05), very similar to the ChiSquare Test result, so the Null Hypothesis cannot be rejected, therefore they are independent

Problem 2.

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Descriptive and Inferential Statistics
data <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/train.csv', header = TRUE)

nrow(data) # Total observations
## [1] 1460
ncol(data) # Total features
## [1] 81
# Subsetting the data set into the potential most relevant features

head(data[c(5, 19, 21, 47, 50, 62, 81)])
##   LotArea OverallCond YearRemodAdd GrLivArea FullBath GarageCars SalePrice
## 1    8450           5         2003      1710        2          2    208500
## 2    9600           8         1976      1262        2          2    181500
## 3   11250           5         2002      1786        2          2    223500
## 4    9550           5         1970      1717        1          3    140000
## 5   14260           5         2000      2198        2          3    250000
## 6   14115           5         1995      1362        1          2    143000
summary(data[c(5, 19, 21, 47, 50, 62, 81)])
##     LotArea        OverallCond     YearRemodAdd    GrLivArea   
##  Min.   :  1300   Min.   :1.000   Min.   :1950   Min.   : 334  
##  1st Qu.:  7554   1st Qu.:5.000   1st Qu.:1967   1st Qu.:1130  
##  Median :  9478   Median :5.000   Median :1994   Median :1464  
##  Mean   : 10517   Mean   :5.575   Mean   :1985   Mean   :1515  
##  3rd Qu.: 11602   3rd Qu.:6.000   3rd Qu.:2004   3rd Qu.:1777  
##  Max.   :215245   Max.   :9.000   Max.   :2010   Max.   :5642  
##     FullBath       GarageCars      SalePrice     
##  Min.   :0.000   Min.   :0.000   Min.   : 34900  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:129975  
##  Median :2.000   Median :2.000   Median :163000  
##  Mean   :1.565   Mean   :1.767   Mean   :180921  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:214000  
##  Max.   :3.000   Max.   :4.000   Max.   :755000
hist(data$LotArea)

#hist(data$FullBath)
hist(data$GrLivArea)

hist(data$OverallCond)

hist(data$YearRemodAdd)

hist(data$GarageCars)

# Scatterplot Matrix
pairs(data[c(5, 19, 21, 47, 50, 62, 81)])

Y <- data[81]
X <- data[c(2,5,18:21,35, 37:39,44:53,55,57,62:63,67:72, 76:78)] #just numerical features

# Correlation Matrix
corr_features <- data.frame(cor(X,Y))
corr_features
##                 SalePrice
## MSSubClass    -0.08428414
## LotArea        0.26384335
## OverallQual    0.79098160
## OverallCond   -0.07785589
## YearBuilt      0.52289733
## YearRemodAdd   0.50710097
## BsmtFinSF1     0.38641981
## BsmtFinSF2    -0.01137812
## BsmtUnfSF      0.21447911
## TotalBsmtSF    0.61358055
## X1stFlrSF      0.60585218
## X2ndFlrSF      0.31933380
## LowQualFinSF  -0.02560613
## GrLivArea      0.70862448
## BsmtFullBath   0.22712223
## BsmtHalfBath  -0.01684415
## FullBath       0.56066376
## HalfBath       0.28410768
## BedroomAbvGr   0.16821315
## KitchenAbvGr  -0.13590737
## TotRmsAbvGrd   0.53372316
## Fireplaces     0.46692884
## GarageCars     0.64040920
## GarageArea     0.62343144
## WoodDeckSF     0.32441344
## OpenPorchSF    0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch     0.04458367
## ScreenPorch    0.11144657
## PoolArea       0.09240355
## MiscVal       -0.02118958
## MoSold         0.04643225
## YrSold        -0.02892259
# Top 5 features correlated to SalePrice target variable
ordering <- data.frame(order(corr_features$SalePrice, decreasing = T))
row_n <- rownames(corr_features)
row_n[ordering[1:5,]]
## [1] "OverallQual" "GrLivArea"   "GarageCars"  "GarageArea"  "TotalBsmtSF"

Hypothesis testing will be done on the top 2 features correlated with SalesPrice: OverallQual and GrLivArea, At 80% confidence interval

$ H_0: $ Pairwise correlation = 0 (SalePrice & OverallQual/GrLivArea) $ H_A: $ Pairwise correlation != 0 (SalePrice & OverallQual/GrLivArea)

Y <- data$SalePrice
X1 <- data$OverallQual
X2 <- data$GrLivArea

# Variables normal distribution
#qqnorm(X1)
#qqline(X1)
#qqnorm(X2)
#qqline(X2)

X1_CorrTest <- cor.test(X1,Y, method = "pearson", conf.level=.8)
X1_CorrTest
## 
##  Pearson's product-moment correlation
## 
## data:  X1 and Y
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
X2_CorrTest <- cor.test(X2,Y, method = "pearson", conf.level=.8)
X2_CorrTest
## 
##  Pearson's product-moment correlation
## 
## data:  X2 and Y
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6915087 0.7249450
## sample estimates:
##       cor 
## 0.7086245

Both OverallQual/GrLivArea and SalePrice are significatly possitively correlated with coefficients 0.79/0.70. p-values are extremely low (2.2e-16) which indicates that the null hypothesis can be rejected

Linear Algebra and Correlation

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Y <- data$SalePrice
X <- data$GrLivArea

# Correlation Matrix
XY <- data.frame(X,Y)
corr_matrix <- cor(XY)
corr_matrix
##           X         Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
#Precision Matrix
prec_matrix <- solve(corr_matrix)
prec_matrix
##           X         Y
## X  2.008632 -1.423366
## Y -1.423366  2.008632
#Matrices multiplication

corr_matrix %*% prec_matrix
##   X Y
## X 1 0
## Y 0 1
prec_matrix %*% corr_matrix
##   X Y
## X 1 0
## Y 0 1

Both produce the Identity Matrix \[ AA_{-1} = A_{-1}A = I \]

# LU Decomposition

#install.packages("matrixcalc")
library(matrixcalc)
LU <- lu.decomposition(corr_matrix)
L <- LU$L
U <- LU$U
L
##           [,1] [,2]
## [1,] 1.0000000    0
## [2,] 0.7086245    1
U
##      [,1]      [,2]
## [1,]    1 0.7086245
## [2,]    0 0.4978513
L%*%U
##           [,1]      [,2]
## [1,] 1.0000000 0.7086245
## [2,] 0.7086245 1.0000000
corr_matrix
##           X         Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
Calculus-Based Probability & Statistics

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

hist(data$LotArea)

min(data$LotArea) #min is above zero
## [1] 1300
library(MASS)
X <- data$LotArea
edf <- fitdistr(X, densfun="exponential")

#lambda represents the increase in the living area
lambda <- edf$estimate

exp_sample <- rexp(1000, lambda)
hist(exp_sample, col='dark green', main="Exponential Fit of X")

summary(exp_sample)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    21.55  3263.62  7734.02 10819.04 15364.62 94470.00
hist(X, col='blue', main="X")

summary(X)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245
#CDF
# 5th and 95th percentiles
qexp(c(.05, .95), rate = lambda)
## [1]   539.4428 31505.6013
# 95% confidence interval from the empirical data, assuming normality
qnorm(c(.025, .975), mean=mean(X), sd=sd(X))
## [1] -9046.092 30079.748
# empirical 5th percentile and 95th percentile
quantile(X, c(.05, .95))
##       5%      95% 
##  3311.70 17401.15
Modeling

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

#Loading data sets

train <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/train.csv', header = TRUE)

test <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/test.csv', header = TRUE)

#install.packages("dplyr")
library(dplyr)

#identify nature of the features, numeric and categorical
num <- select_if(train, is.numeric)

# Focus on numeric features

X_tmp <- train[c(colnames(num))]

#explore statistics for the model features, identify NAs
summary(X_tmp)
##        Id           MSSubClass     LotFrontage        LotArea      
##  Min.   :   1.0   Min.   : 20.0   Min.   : 21.00   Min.   :  1300  
##  1st Qu.: 365.8   1st Qu.: 20.0   1st Qu.: 59.00   1st Qu.:  7554  
##  Median : 730.5   Median : 50.0   Median : 69.00   Median :  9478  
##  Mean   : 730.5   Mean   : 56.9   Mean   : 70.05   Mean   : 10517  
##  3rd Qu.:1095.2   3rd Qu.: 70.0   3rd Qu.: 80.00   3rd Qu.: 11602  
##  Max.   :1460.0   Max.   :190.0   Max.   :313.00   Max.   :215245  
##                                   NA's   :259                      
##   OverallQual      OverallCond      YearBuilt     YearRemodAdd 
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967  
##  Median : 6.000   Median :5.000   Median :1973   Median :1994  
##  Mean   : 6.099   Mean   :5.575   Mean   :1971   Mean   :1985  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    MasVnrArea       BsmtFinSF1       BsmtFinSF2        BsmtUnfSF     
##  Min.   :   0.0   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 223.0  
##  Median :   0.0   Median : 383.5   Median :   0.00   Median : 477.5  
##  Mean   : 103.7   Mean   : 443.6   Mean   :  46.55   Mean   : 567.2  
##  3rd Qu.: 166.0   3rd Qu.: 712.2   3rd Qu.:   0.00   3rd Qu.: 808.0  
##  Max.   :1600.0   Max.   :5644.0   Max.   :1474.00   Max.   :2336.0  
##  NA's   :8                                                           
##   TotalBsmtSF       X1stFlrSF      X2ndFlrSF     LowQualFinSF    
##  Min.   :   0.0   Min.   : 334   Min.   :   0   Min.   :  0.000  
##  1st Qu.: 795.8   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000  
##  Median : 991.5   Median :1087   Median :   0   Median :  0.000  
##  Mean   :1057.4   Mean   :1163   Mean   : 347   Mean   :  5.845  
##  3rd Qu.:1298.2   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000  
##  Max.   :6110.0   Max.   :4692   Max.   :2065   Max.   :572.000  
##                                                                  
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath    
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565  
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000  
##                                                                   
##     HalfBath       BedroomAbvGr    KitchenAbvGr    TotRmsAbvGrd   
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000  
##  Median :0.0000   Median :3.000   Median :1.000   Median : 6.000  
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   Mean   : 6.518  
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000  
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000   Max.   :14.000  
##                                                                   
##    Fireplaces     GarageYrBlt     GarageCars      GarageArea    
##  Min.   :0.000   Min.   :1900   Min.   :0.000   Min.   :   0.0  
##  1st Qu.:0.000   1st Qu.:1961   1st Qu.:1.000   1st Qu.: 334.5  
##  Median :1.000   Median :1980   Median :2.000   Median : 480.0  
##  Mean   :0.613   Mean   :1979   Mean   :1.767   Mean   : 473.0  
##  3rd Qu.:1.000   3rd Qu.:2002   3rd Qu.:2.000   3rd Qu.: 576.0  
##  Max.   :3.000   Max.   :2010   Max.   :4.000   Max.   :1418.0  
##                  NA's   :81                                     
##    WoodDeckSF      OpenPorchSF     EnclosedPorch      X3SsnPorch    
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00  
##  Median :  0.00   Median : 25.00   Median :  0.00   Median :  0.00  
##  Mean   : 94.24   Mean   : 46.66   Mean   : 21.95   Mean   :  3.41  
##  3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00   3rd Qu.:  0.00  
##  Max.   :857.00   Max.   :547.00   Max.   :552.00   Max.   :508.00  
##                                                                     
##   ScreenPorch        PoolArea          MiscVal             MoSold      
##  Min.   :  0.00   Min.   :  0.000   Min.   :    0.00   Min.   : 1.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000  
##  Median :  0.00   Median :  0.000   Median :    0.00   Median : 6.000  
##  Mean   : 15.06   Mean   :  2.759   Mean   :   43.49   Mean   : 6.322  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000  
##  Max.   :480.00   Max.   :738.000   Max.   :15500.00   Max.   :12.000  
##                                                                        
##      YrSold       SalePrice     
##  Min.   :2006   Min.   : 34900  
##  1st Qu.:2007   1st Qu.:129975  
##  Median :2008   Median :163000  
##  Mean   :2008   Mean   :180921  
##  3rd Qu.:2009   3rd Qu.:214000  
##  Max.   :2010   Max.   :755000  
## 
#eliminate records with NAs (especifically columns: LotFrontage, MasVnrArea, GarageYrBlt)
X <- na.omit(X_tmp)
Y <- X[38] # target column (SalePrice)

# Identifying features with highest correlation to SalePrice - Top 15

corr_features <- data.frame(cor(X,Y))
ordering <- data.frame(order(corr_features$SalePrice, decreasing = T))
row_n <- rownames(corr_features)
top15 <- row_n[ordering[1:16,]]

X_top15 <- train[c(top15)]

#Fitting Model - 1st Try

lm_model1 <- lm(SalePrice ~., data=X_top15)
summary(lm_model1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = X_top15)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -511286  -18388   -1831   15823  276929 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.066e+06  1.627e+05  -6.553 8.65e-11 ***
## OverallQual   2.093e+04  1.476e+03  14.176  < 2e-16 ***
## GrLivArea     3.587e+01  5.150e+00   6.964 5.66e-12 ***
## GarageCars    1.533e+04  3.596e+03   4.263 2.19e-05 ***
## GarageArea    1.388e+01  1.263e+01   1.099 0.271793    
## TotalBsmtSF   1.107e+01  5.373e+00   2.061 0.039523 *  
## X1stFlrSF     3.338e+00  6.089e+00   0.548 0.583658    
## FullBath     -1.321e+03  3.249e+03  -0.407 0.684429    
## TotRmsAbvGrd  2.261e+03  1.358e+03   1.664 0.096308 .  
## YearBuilt     1.639e+02  7.682e+01   2.134 0.033064 *  
## YearRemodAdd  4.074e+02  7.898e+01   5.159 2.95e-07 ***
## GarageYrBlt  -7.564e+01  9.385e+01  -0.806 0.420441    
## MasVnrArea    2.897e+01  7.273e+00   3.984 7.23e-05 ***
## Fireplaces    8.333e+03  2.230e+03   3.737 0.000196 ***
## BsmtFinSF1    1.995e+01  3.098e+00   6.438 1.80e-10 ***
## LotFrontage   3.673e+01  5.680e+01   0.647 0.517986    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38880 on 1105 degrees of freedom
##   (339 observations deleted due to missingness)
## Multiple R-squared:  0.7835, Adjusted R-squared:  0.7806 
## F-statistic: 266.6 on 15 and 1105 DF,  p-value: < 2.2e-16
#discarding features that do not contribute to model accuracy/performance (Revised total of 9 features)

top_top9 <- top15[-c(5,7:9,12,16)]
X_top_top9 <- train[c(top_top9)]

#Fitting Model - 2nd Try

lm_model2 <- lm(SalePrice ~., data=X_top_top9)
summary(lm_model2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = X_top_top9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -521279  -17668   -1949   14962  272264 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.113e+06  1.182e+05  -9.409  < 2e-16 ***
## OverallQual   1.808e+04  1.167e+03  15.485  < 2e-16 ***
## GrLivArea     4.522e+01  2.576e+00  17.557  < 2e-16 ***
## GarageCars    1.329e+04  1.748e+03   7.602 5.23e-14 ***
## TotalBsmtSF   1.773e+01  3.047e+00   5.818 7.32e-09 ***
## YearBuilt     1.655e+02  4.685e+01   3.532 0.000425 ***
## YearRemodAdd  3.677e+02  6.184e+01   5.946 3.44e-09 ***
## MasVnrArea    2.933e+01  6.157e+00   4.764 2.09e-06 ***
## Fireplaces    8.593e+03  1.750e+03   4.911 1.01e-06 ***
## BsmtFinSF1    1.830e+01  2.524e+00   7.248 6.85e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36580 on 1442 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.7885, Adjusted R-squared:  0.7872 
## F-statistic: 597.4 on 9 and 1442 DF,  p-value: < 2.2e-16
AIC(lm_model2)
## [1] 34645.26
# Prediction Error Rate

pred_err_rate <- sigma(lm_model2)/mean(X_top_top9$SalePrice)
pred_err_rate
## [1] 0.2021614
#prediction <- predict(lm_model2)

#plot(lm_model)

# Residuals Analysis

plot(lm_model2$fitted.values, lm_model2$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="blue")

qqnorm(lm_model2$residuals)
qqline(lm_model2$residuals, col="red")

#Identifying features with significant outliers

summary(X_top_top9[,2:9])
##   OverallQual       GrLivArea      GarageCars     TotalBsmtSF    
##  Min.   : 1.000   Min.   : 334   Min.   :0.000   Min.   :   0.0  
##  1st Qu.: 5.000   1st Qu.:1130   1st Qu.:1.000   1st Qu.: 795.8  
##  Median : 6.000   Median :1464   Median :2.000   Median : 991.5  
##  Mean   : 6.099   Mean   :1515   Mean   :1.767   Mean   :1057.4  
##  3rd Qu.: 7.000   3rd Qu.:1777   3rd Qu.:2.000   3rd Qu.:1298.2  
##  Max.   :10.000   Max.   :5642   Max.   :4.000   Max.   :6110.0  
##                                                                  
##    YearBuilt     YearRemodAdd    MasVnrArea       Fireplaces   
##  Min.   :1872   Min.   :1950   Min.   :   0.0   Min.   :0.000  
##  1st Qu.:1954   1st Qu.:1967   1st Qu.:   0.0   1st Qu.:0.000  
##  Median :1973   Median :1994   Median :   0.0   Median :1.000  
##  Mean   :1971   Mean   :1985   Mean   : 103.7   Mean   :0.613  
##  3rd Qu.:2000   3rd Qu.:2004   3rd Qu.: 166.0   3rd Qu.:1.000  
##  Max.   :2010   Max.   :2010   Max.   :1600.0   Max.   :3.000  
##                                NA's   :8
boxplot(X_top_top9[,c(3,5,8)])

#calculating Cook's Distance - Influence of outliers in the predicted outcome.

cooksd <- cooks.distance(lm_model2)

# influential row numbers - cook's distance greater than 4 times the mean

influential_rows <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd))])

X_top_top9[influential_rows, ]
##      SalePrice OverallQual GrLivArea GarageCars TotalBsmtSF YearBuilt
## 179     501837           9      2234          3        2216      2008
## 186     475000          10      3608          3        1107      1892
## 441     555000          10      2402          3        3094      2008
## 524     184750          10      4676          3        3138      2007
## 692     755000          10      4316          3        2444      1994
## 770     538000           8      3279          3        1650      2003
## 804     582933           9      2822          3        1734      2008
## 899     611657           9      2364          3        2330      2009
## 1047    556581           9      2868          3        1992      2005
## 1170    625000          10      3627          3        1930      1995
## 1183    745000          10      4476          3        2396      1996
## 1299    160000          10      5642          2        6110      2008
##      YearRemodAdd MasVnrArea Fireplaces BsmtFinSF1
## 179          2009        748          1       1904
## 186          1993          0          2          0
## 441          2008        200          2       1767
## 524          2008        762          1       2260
## 692          1995       1170          2       1455
## 770          2003        603          1       1416
## 804          2009        424          1          0
## 899          2010        760          2       2188
## 1047         2006        208          1        240
## 1170         1996       1378          1       1387
## 1183         1996          0          2       2096
## 1299         2008        796          3       5644
# Discarding influential rows (outliers)
X_top_top9 <- X_top_top9[-influential_rows, ]

#Fitting Model - 3rd Try

lm_model3 <- lm(SalePrice ~., data=X_top_top9)
summary(lm_model3)
## 
## Call:
## lm(formula = SalePrice ~ ., data = X_top_top9)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130007  -16368    -948   14862  146380 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.206e+06  8.965e+04 -13.453  < 2e-16 ***
## OverallQual   1.570e+04  8.894e+02  17.658  < 2e-16 ***
## GrLivArea     4.831e+01  2.051e+00  23.552  < 2e-16 ***
## GarageCars    9.866e+03  1.337e+03   7.381 2.65e-13 ***
## TotalBsmtSF   2.607e+01  2.368e+00  11.009  < 2e-16 ***
## YearBuilt     1.958e+02  3.566e+01   5.491 4.73e-08 ***
## YearRemodAdd  3.873e+02  4.679e+01   8.278 2.84e-16 ***
## MasVnrArea    2.054e+01  4.807e+00   4.273 2.06e-05 ***
## Fireplaces    7.631e+03  1.333e+03   5.726 1.25e-08 ***
## BsmtFinSF1    2.406e+01  1.971e+00  12.209  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27640 on 1430 degrees of freedom
##   (8 observations deleted due to missingness)
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8497 
## F-statistic: 905.1 on 9 and 1430 DF,  p-value: < 2.2e-16
AIC(lm_model3)
## [1] 33552.47
# Prediction Error Rate

pred_err_rate <- sigma(lm_model3)/mean(X_top_top9$SalePrice)
pred_err_rate
## [1] 0.1552224
# Residuals Analysis

plot(lm_model3$fitted.values, lm_model3$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="blue")

qqnorm(lm_model3$residuals)
qqline(lm_model3$residuals, col="red")

Good Adjusted R_squared for the model = 0.8497, so, it explains over 85% of variability in the data. AIC = 33552.47 (smallest of the 3 models tried)

Prediction
# Dealing with NAs in the test set

library(dplyr)

num <- select_if(test, is.numeric)
catg <- select_if(test, is.factor)

test[c(colnames(num))][is.na(test[c(colnames(num))])] <- 0

# Predicting SalePrice using Model3

SalePrice_Pred <- predict(lm_model3, newdata=test)

results <- data.frame(test$Id, SalePrice_Pred)
colnames(results) <- c("Id", "SalePrice")
head(results)
##     Id SalePrice
## 1 1461  103439.7
## 2 1462  163133.6
## 3 1463  186701.7
## 4 1464  197205.7
## 5 1465  202492.8
## 6 1466  177996.5
# Write results to a CSV file for Kaggle Submission

# write.csv(results, file = 'C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/houseprices_prediction_humbertohp.csv', row.names = FALSE)
# Focus on categorical features (extra)

catg <- select_if(train, is.factor)
X_catg <- data.frame(train[c(colnames(catg))], stringsAsFactors = TRUE)
summary(X_catg)
##     MSZoning     Street      Alley      LotShape  LandContour
##  C (all):  10   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63   
##  FV     :  65   Pave:1454   Pave:  41   IR2: 41   HLS:  50   
##  RH     :  16               NA's:1369   IR3: 10   Low:  36   
##  RL     :1151                           Reg:925   Lvl:1311   
##  RM     : 218                                                
##                                                              
##                                                              
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1  
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260  
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81  
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48  
##                FR3    :   4              Edwards:100   RRAn   :  26  
##                Inside :1052              Somerst: 86   PosN   :  19  
##                                          Gilbert: 79   RRAe   :  11  
##                                          (Other):707   (Other):  15  
##    Condition2     BldgType      HouseStyle    RoofStyle       RoofMatl   
##  Norm   :1445   1Fam  :1220   1Story :726   Flat   :  13   CompShg:1434  
##  Feedr  :   6   2fmCon:  31   2Story :445   Gable  :1141   Tar&Grv:  11  
##  Artery :   2   Duplex:  52   1.5Fin :154   Gambrel:  11   WdShngl:   6  
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Hip    : 286   WdShake:   5  
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   Mansard:   7   ClyTile:   1  
##  PosA   :   1                 1.5Unf : 14   Shed   :   2   Membran:   1  
##  (Other):   2                 (Other): 19                  (Other):   2  
##   Exterior1st   Exterior2nd    MasVnrType  ExterQual ExterCond
##  VinylSd:515   VinylSd:504   BrkCmn : 15   Ex: 52    Ex:   3  
##  HdBoard:222   MetalSd:214   BrkFace:445   Fa: 14    Fa:  28  
##  MetalSd:220   HdBoard:207   None   :864   Gd:488    Gd: 146  
##  Wd Sdng:206   Wd Sdng:197   Stone  :128   TA:906    Po:   1  
##  Plywood:108   Plywood:142   NA's   :  8             TA:1282  
##  CemntBd: 61   CmentBd: 60                                    
##  (Other):128   (Other):136                                    
##   Foundation  BsmtQual   BsmtCond    BsmtExposure BsmtFinType1
##  BrkTil:146   Ex  :121   Fa  :  45   Av  :221     ALQ :220    
##  CBlock:634   Fa  : 35   Gd  :  65   Gd  :134     BLQ :148    
##  PConc :647   Gd  :618   Po  :   2   Mn  :114     GLQ :418    
##  Slab  : 24   TA  :649   TA  :1311   No  :953     LwQ : 74    
##  Stone :  6   NA's: 37   NA's:  37   NA's: 38     Rec :133    
##  Wood  :  3                                       Unf :430    
##                                                   NA's: 37    
##  BsmtFinType2  Heating     HeatingQC CentralAir Electrical   KitchenQual
##  ALQ :  19    Floor:   1   Ex:741    N:  95     FuseA:  94   Ex:100     
##  BLQ :  33    GasA :1428   Fa: 49    Y:1365     FuseF:  27   Fa: 39     
##  GLQ :  14    GasW :  18   Gd:241               FuseP:   3   Gd:586     
##  LwQ :  46    Grav :   7   Po:  1               Mix  :   1   TA:735     
##  Rec :  54    OthW :   2   TA:428               SBrkr:1334              
##  Unf :1256    Wall :   4                        NA's :   1              
##  NA's:  38                                                              
##  Functional  FireplaceQu   GarageType  GarageFinish GarageQual 
##  Maj1:  14   Ex  : 24    2Types :  6   Fin :352     Ex  :   3  
##  Maj2:   5   Fa  : 33    Attchd :870   RFn :422     Fa  :  48  
##  Min1:  31   Gd  :380    Basment: 19   Unf :605     Gd  :  14  
##  Min2:  34   Po  : 20    BuiltIn: 88   NA's: 81     Po  :   3  
##  Mod :  15   TA  :313    CarPort:  9                TA  :1311  
##  Sev :   1   NA's:690    Detchd :387                NA's:  81  
##  Typ :1360               NA's   : 81                           
##  GarageCond  PavedDrive  PoolQC       Fence      MiscFeature
##  Ex  :   2   N:  90     Ex  :   2   GdPrv:  59   Gar2:   2  
##  Fa  :  35   P:  30     Fa  :   2   GdWo :  54   Othr:   2  
##  Gd  :   9   Y:1340     Gd  :   3   MnPrv: 157   Shed:  49  
##  Po  :   7              NA's:1453   MnWw :  11   TenC:   1  
##  TA  :1326                          NA's :1179   NA's:1406  
##  NA's:  81                                                  
##                                                             
##     SaleType    SaleCondition 
##  WD     :1267   Abnorml: 101  
##  New    : 122   AdjLand:   4  
##  COD    :  43   Alloca :  12  
##  ConLD  :   9   Family :  20  
##  ConLI  :   5   Normal :1198  
##  ConLw  :   5   Partial: 125  
##  (Other):   9
# Identifying categorical features with highest potential correlation to SalePrice - Top 10 (Factor Value distributions)

top_catg_10 <- c("LotConfig", "Neighborhood", "BldgType", "HouseStyle", "Exterior1st", "ExterQual", "ExterCond", "BsmtQual", "HeatingQC", "Foundation", "SalePrice")

X_top_catg_10 <- train[c(top_catg_10)]

#Fitting Model - Categorical only

lm_modelcat <- lm(SalePrice ~., data=X_top_catg_10)
summary(lm_modelcat)
## 
## Call:
## lm(formula = SalePrice ~ ., data = X_top_catg_10)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -175763  -21022   -1552   16984  350059 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          399899.2    30046.2  13.309  < 2e-16 ***
## LotConfigCulDSac      12557.6     5166.3   2.431 0.015199 *  
## LotConfigFR2         -22873.0     6738.7  -3.394 0.000708 ***
## LotConfigFR3          -7350.5    20953.2  -0.351 0.725788    
## LotConfigInside       -2879.5     2916.6  -0.987 0.323692    
## NeighborhoodBlueste  -20116.1    31505.8  -0.638 0.523264    
## NeighborhoodBrDale   -33533.5    16762.6  -2.000 0.045647 *  
## NeighborhoodBrkSide  -60746.3    13483.6  -4.505 7.20e-06 ***
## NeighborhoodClearCr  -12839.3    14203.7  -0.904 0.366189    
## NeighborhoodCollgCr  -39100.0    11454.7  -3.413 0.000660 ***
## NeighborhoodCrawfor  -14503.8    13069.0  -1.110 0.267287    
## NeighborhoodEdwards  -69759.4    12367.0  -5.641 2.06e-08 ***
## NeighborhoodGilbert  -51425.7    12286.7  -4.185 3.03e-05 ***
## NeighborhoodIDOTRR   -88827.8    14090.4  -6.304 3.91e-10 ***
## NeighborhoodMeadowV  -66819.8    16867.3  -3.962 7.84e-05 ***
## NeighborhoodMitchel  -54820.8    12926.9  -4.241 2.38e-05 ***
## NeighborhoodNAmes    -47135.3    12213.1  -3.859 0.000119 ***
## NeighborhoodNoRidge   65211.0    13112.6   4.973 7.43e-07 ***
## NeighborhoodNPkVill  -18122.1    18352.4  -0.987 0.323597    
## NeighborhoodNridgHt   24822.7    11905.0   2.085 0.037251 *  
## NeighborhoodNWAmes   -25784.6    12734.4  -2.025 0.043085 *  
## NeighborhoodOldTown  -71008.3    12812.6  -5.542 3.59e-08 ***
## NeighborhoodSawyer   -54931.9    12794.3  -4.293 1.88e-05 ***
## NeighborhoodSawyerW  -38185.8    12417.1  -3.075 0.002145 ** 
## NeighborhoodSomerst  -20887.3    11651.0  -1.793 0.073239 .  
## NeighborhoodStoneBr   52008.0    13369.8   3.890 0.000105 ***
## NeighborhoodSWISU    -71171.0    14840.9  -4.796 1.80e-06 ***
## NeighborhoodTimber   -14208.6    13050.2  -1.089 0.276450    
## NeighborhoodVeenker    1757.4    16900.7   0.104 0.917196    
## BldgType2fmCon         2817.1     8251.4   0.341 0.732851    
## BldgTypeDuplex         2011.5     7216.3   0.279 0.780482    
## BldgTypeTwnhs        -63753.1     8359.2  -7.627 4.52e-14 ***
## BldgTypeTwnhsE       -55062.4     5269.3 -10.450  < 2e-16 ***
## HouseStyle1.5Unf     -25776.7    12129.8  -2.125 0.033762 *  
## HouseStyle1Story     -14013.6     4303.5  -3.256 0.001157 ** 
## HouseStyle2.5Fin      65552.9    15563.0   4.212 2.70e-05 ***
## HouseStyle2.5Unf      19651.5    13425.0   1.464 0.143483    
## HouseStyle2Story       4779.1     4581.3   1.043 0.297062    
## HouseStyleSFoyer     -26048.3     8580.1  -3.036 0.002444 ** 
## HouseStyleSLvl        -5465.2     6628.2  -0.825 0.409783    
## Exterior1stBrkComm   -20284.0    43273.7  -0.469 0.639334    
## Exterior1stBrkFace    54617.1    11857.9   4.606 4.49e-06 ***
## Exterior1stCBlock     -8469.0    44613.8  -0.190 0.849472    
## Exterior1stCemntBd    33420.8    12294.6   2.718 0.006645 ** 
## Exterior1stHdBoard     8984.8    10751.1   0.836 0.403466    
## Exterior1stImStucc   -37304.5    42691.9  -0.874 0.382378    
## Exterior1stMetalSd     8069.3    10407.6   0.775 0.438279    
## Exterior1stPlywood    29189.7    11254.1   2.594 0.009598 ** 
## Exterior1stStone      71988.9    31012.0   2.321 0.020417 *  
## Exterior1stStucco     29580.2    13244.7   2.233 0.025689 *  
## Exterior1stVinylSd     8526.5    10570.8   0.807 0.420035    
## Exterior1stWd Sdng    14708.0    10381.6   1.417 0.156788    
## Exterior1stWdShing     2468.1    13202.5   0.187 0.851734    
## ExterQualFa         -105551.1    15652.9  -6.743 2.29e-11 ***
## ExterQualGd          -60920.5     7375.1  -8.260 3.43e-16 ***
## ExterQualTA          -86231.6     8001.9 -10.776  < 2e-16 ***
## ExterCondFa          -61407.9    26228.3  -2.341 0.019362 *  
## ExterCondGd          -36767.6    24459.4  -1.503 0.133019    
## ExterCondPo         -100960.3    48004.3  -2.103 0.035637 *  
## ExterCondTA          -43810.2    24307.9  -1.802 0.071721 .  
## BsmtQualFa           -92260.8     9600.1  -9.610  < 2e-16 ***
## BsmtQualGd           -62135.6     5233.8 -11.872  < 2e-16 ***
## BsmtQualTA           -81134.9     6232.9 -13.017  < 2e-16 ***
## HeatingQCFa          -24386.6     6959.8  -3.504 0.000473 ***
## HeatingQCGd           -8995.0     3467.3  -2.594 0.009582 ** 
## HeatingQCPo            -166.8    43065.9  -0.004 0.996910    
## HeatingQCTA          -12047.3     3318.9  -3.630 0.000294 ***
## FoundationCBlock       7859.0     4939.8   1.591 0.111850    
## FoundationPConc       15074.2     5360.3   2.812 0.004991 ** 
## FoundationStone       12548.5    17506.9   0.717 0.473638    
## FoundationWood         2549.8    24452.5   0.104 0.916964    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40740 on 1352 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.7496, Adjusted R-squared:  0.7367 
## F-statistic: 57.83 on 70 and 1352 DF,  p-value: < 2.2e-16

Avg model with Adjusted R_squared for the model = 0.7367, so, it explains over 74% of variability in the data.

Kaggle Submission

The model used for submission was Model3 created only with continuous features. For simplicity purposes and to demonstrate several modeling techniques: 1) to identify and include the most statistically significant features using ranking correlation, and 2) to identify, validate and deal with outliers affecting model fit and performance using Cook’s distance. The best model would have been a combination of the most significant numeric and categorical features, due to timing constraints, I will try that as a separate project shortly.