Using R, generate a random variable X that has 10,000 random uniform numbers from 1 to N, where N can be any number of your choosing greater than or equal to 6. Then generate a random variable Y that has 10,000 random normal numbers with a mean of \(\mu = \sigma = (N+1)/2\).
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
5 points a. P(X>x | X>y) b. P(X>x, Y>y) c. P(X
X <- runif(10000, min = 1, max = 6)
Y <- rnorm(10000, mean = ((6+1)/2), sd = ((6+1)/2))
x = mean(X)
y = quantile(Y,.25)
# P(X>x)
Xgrx <- sum(X>x)/length(X)
# P(X<x)
Xlsx <- sum(X<x)/length(X)
# P(X>y)
Xgry <- sum(X>y)/length(X)
# P(Y>y)
Ygry <- sum(Y>y)/length(Y)
\[ Conditional \space Probability \space P(B|A) = \frac{P(A \space and \space B)}{P(A)} \]
\[ a. P(X>x | X>y) \]
\[ P(Xgrx|Xgry) = \frac{Xgry,Xgrx}{Xgry}\]
Xgrx_pipe_Xgry <- sum((X>y) & (X>x))/length(X)
a <- Xgrx_pipe_Xgry/Xgry
a
## [1] 0.5129095
\[ b. P(X>x, Y>y) \]
\[ P(Xgrx,Ygry) = P(Xgrx \space and \space Ygry)\]
Xgrx_and_Ygry <- sum((X>y) & (X>x))/length(X)
b <- Xgrx_and_Ygry
b
## [1] 0.5026
\[ c. P(X<x | X>y) \]
\[ P(Xlsx|Xgry) = \frac{Xgry,Xlsx}{Xgry}\]
Xlsx_pipe_Xgry <- sum((X>y) & (X<x))/length(X)
c <- Xlsx_pipe_Xgry/Xgry
c
## [1] 0.4870905
5 points. Investigate whether P(X>x and Y>y)=P(X>x)P(Y>y) by building a table and evaluating the marginal and joint probabilities.
# P(X>x)
Xgrx <- sum(X>x)/length(X)
# P(Y>y)
Ygry <- sum(Y>y)/length(Y)
Xgrx_and_Ygry1 <- sum((X>x) & (Y>y))/length(X)
Xgrx_and_Ygry1
## [1] 0.3809
Xgrx_and_Ygry2 <- Xgrx*Ygry
Xgrx_and_Ygry2
## [1] 0.37695
identical(round(Xgrx_and_Ygry1,2), round(Xgrx_and_Ygry2,2))
## [1] TRUE
\[ P(X>x \space and \space Y>y)\space != \space P(X>x)P(Y>y) \]
Probabilities (rounded) are identical, therefore it can be concluded that that (X>x) and (Y>y) are independent
5 points. Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?
$ H_0: $ (X>x) and (Y>y) are independent
$ H_A: $ (X>x) and (Y>y) are dependent
cont_tbl <- table((X>x), (Y>y), dnn=list("X>x","Y>y"))
cont_tbl
## Y>y
## X>x FALSE TRUE
## FALSE 1283 3691
## TRUE 1217 3809
chisq.test(cont_tbl, correct = FALSE)
##
## Pearson's Chi-squared test
##
## data: cont_tbl
## X-squared = 3.3286, df = 1, p-value = 0.06808
p-value is considerable larger than the pre-defined significance level (0.05) so the Null Hypothesis cannot be rejected, therefore they are independent
cont_tbl <- table((X>x), (Y>y), dnn=list("X>x","Y>y"))
fisher.test(cont_tbl, conf.int = TRUE, conf.level = 0.99)
##
## Fisher's Exact Test for Count Data
##
## data: cont_tbl
## p-value = 0.07163
## alternative hypothesis: true odds ratio is not equal to 1
## 99 percent confidence interval:
## 0.9648502 1.2267838
## sample estimates:
## odds ratio
## 1.087925
p-value is considerable larger than the pre-defined significance level (0.05), very similar to the ChiSquare Test result, so the Null Hypothesis cannot be rejected, therefore they are independent
You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.
5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
data <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/train.csv', header = TRUE)
nrow(data) # Total observations
## [1] 1460
ncol(data) # Total features
## [1] 81
# Subsetting the data set into the potential most relevant features
head(data[c(5, 19, 21, 47, 50, 62, 81)])
## LotArea OverallCond YearRemodAdd GrLivArea FullBath GarageCars SalePrice
## 1 8450 5 2003 1710 2 2 208500
## 2 9600 8 1976 1262 2 2 181500
## 3 11250 5 2002 1786 2 2 223500
## 4 9550 5 1970 1717 1 3 140000
## 5 14260 5 2000 2198 2 3 250000
## 6 14115 5 1995 1362 1 2 143000
summary(data[c(5, 19, 21, 47, 50, 62, 81)])
## LotArea OverallCond YearRemodAdd GrLivArea
## Min. : 1300 Min. :1.000 Min. :1950 Min. : 334
## 1st Qu.: 7554 1st Qu.:5.000 1st Qu.:1967 1st Qu.:1130
## Median : 9478 Median :5.000 Median :1994 Median :1464
## Mean : 10517 Mean :5.575 Mean :1985 Mean :1515
## 3rd Qu.: 11602 3rd Qu.:6.000 3rd Qu.:2004 3rd Qu.:1777
## Max. :215245 Max. :9.000 Max. :2010 Max. :5642
## FullBath GarageCars SalePrice
## Min. :0.000 Min. :0.000 Min. : 34900
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:129975
## Median :2.000 Median :2.000 Median :163000
## Mean :1.565 Mean :1.767 Mean :180921
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:214000
## Max. :3.000 Max. :4.000 Max. :755000
hist(data$LotArea)
#hist(data$FullBath)
hist(data$GrLivArea)
hist(data$OverallCond)
hist(data$YearRemodAdd)
hist(data$GarageCars)
# Scatterplot Matrix
pairs(data[c(5, 19, 21, 47, 50, 62, 81)])
Y <- data[81]
X <- data[c(2,5,18:21,35, 37:39,44:53,55,57,62:63,67:72, 76:78)] #just numerical features
# Correlation Matrix
corr_features <- data.frame(cor(X,Y))
corr_features
## SalePrice
## MSSubClass -0.08428414
## LotArea 0.26384335
## OverallQual 0.79098160
## OverallCond -0.07785589
## YearBuilt 0.52289733
## YearRemodAdd 0.50710097
## BsmtFinSF1 0.38641981
## BsmtFinSF2 -0.01137812
## BsmtUnfSF 0.21447911
## TotalBsmtSF 0.61358055
## X1stFlrSF 0.60585218
## X2ndFlrSF 0.31933380
## LowQualFinSF -0.02560613
## GrLivArea 0.70862448
## BsmtFullBath 0.22712223
## BsmtHalfBath -0.01684415
## FullBath 0.56066376
## HalfBath 0.28410768
## BedroomAbvGr 0.16821315
## KitchenAbvGr -0.13590737
## TotRmsAbvGrd 0.53372316
## Fireplaces 0.46692884
## GarageCars 0.64040920
## GarageArea 0.62343144
## WoodDeckSF 0.32441344
## OpenPorchSF 0.31585623
## EnclosedPorch -0.12857796
## X3SsnPorch 0.04458367
## ScreenPorch 0.11144657
## PoolArea 0.09240355
## MiscVal -0.02118958
## MoSold 0.04643225
## YrSold -0.02892259
# Top 5 features correlated to SalePrice target variable
ordering <- data.frame(order(corr_features$SalePrice, decreasing = T))
row_n <- rownames(corr_features)
row_n[ordering[1:5,]]
## [1] "OverallQual" "GrLivArea" "GarageCars" "GarageArea" "TotalBsmtSF"
Hypothesis testing will be done on the top 2 features correlated with SalesPrice: OverallQual and GrLivArea, At 80% confidence interval
$ H_0: $ Pairwise correlation = 0 (SalePrice & OverallQual/GrLivArea) $ H_A: $ Pairwise correlation != 0 (SalePrice & OverallQual/GrLivArea)
Y <- data$SalePrice
X1 <- data$OverallQual
X2 <- data$GrLivArea
# Variables normal distribution
#qqnorm(X1)
#qqline(X1)
#qqnorm(X2)
#qqline(X2)
X1_CorrTest <- cor.test(X1,Y, method = "pearson", conf.level=.8)
X1_CorrTest
##
## Pearson's product-moment correlation
##
## data: X1 and Y
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.7780752 0.8032204
## sample estimates:
## cor
## 0.7909816
X2_CorrTest <- cor.test(X2,Y, method = "pearson", conf.level=.8)
X2_CorrTest
##
## Pearson's product-moment correlation
##
## data: X2 and Y
## t = 38.348, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.6915087 0.7249450
## sample estimates:
## cor
## 0.7086245
Both OverallQual/GrLivArea and SalePrice are significatly possitively correlated with coefficients 0.79/0.70. p-values are extremely low (2.2e-16) which indicates that the null hypothesis can be rejected
5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
Y <- data$SalePrice
X <- data$GrLivArea
# Correlation Matrix
XY <- data.frame(X,Y)
corr_matrix <- cor(XY)
corr_matrix
## X Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
#Precision Matrix
prec_matrix <- solve(corr_matrix)
prec_matrix
## X Y
## X 2.008632 -1.423366
## Y -1.423366 2.008632
#Matrices multiplication
corr_matrix %*% prec_matrix
## X Y
## X 1 0
## Y 0 1
prec_matrix %*% corr_matrix
## X Y
## X 1 0
## Y 0 1
Both produce the Identity Matrix \[ AA_{-1} = A_{-1}A = I \]
# LU Decomposition
#install.packages("matrixcalc")
library(matrixcalc)
LU <- lu.decomposition(corr_matrix)
L <- LU$L
U <- LU$U
L
## [,1] [,2]
## [1,] 1.0000000 0
## [2,] 0.7086245 1
U
## [,1] [,2]
## [1,] 1 0.7086245
## [2,] 0 0.4978513
L%*%U
## [,1] [,2]
## [1,] 1.0000000 0.7086245
## [2,] 0.7086245 1.0000000
corr_matrix
## X Y
## X 1.0000000 0.7086245
## Y 0.7086245 1.0000000
5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function.
(See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ).
Find the optimal value of ??? for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, ???)). Plot a histogram and compare it with a histogram of your original variable.
Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
hist(data$LotArea)
min(data$LotArea) #min is above zero
## [1] 1300
library(MASS)
X <- data$LotArea
edf <- fitdistr(X, densfun="exponential")
#lambda represents the increase in the living area
lambda <- edf$estimate
exp_sample <- rexp(1000, lambda)
hist(exp_sample, col='dark green', main="Exponential Fit of X")
summary(exp_sample)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.55 3263.62 7734.02 10819.04 15364.62 94470.00
hist(X, col='blue', main="X")
summary(X)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1300 7554 9478 10517 11602 215245
#CDF
# 5th and 95th percentiles
qexp(c(.05, .95), rate = lambda)
## [1] 539.4428 31505.6013
# 95% confidence interval from the empirical data, assuming normality
qnorm(c(.025, .975), mean=mean(X), sd=sd(X))
## [1] -9046.092 30079.748
# empirical 5th percentile and 95th percentile
quantile(X, c(.05, .95))
## 5% 95%
## 3311.70 17401.15
10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
#Loading data sets
train <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/train.csv', header = TRUE)
test <- read.csv('C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/test.csv', header = TRUE)
#install.packages("dplyr")
library(dplyr)
#identify nature of the features, numeric and categorical
num <- select_if(train, is.numeric)
# Focus on numeric features
X_tmp <- train[c(colnames(num))]
#explore statistics for the model features, identify NAs
summary(X_tmp)
## Id MSSubClass LotFrontage LotArea
## Min. : 1.0 Min. : 20.0 Min. : 21.00 Min. : 1300
## 1st Qu.: 365.8 1st Qu.: 20.0 1st Qu.: 59.00 1st Qu.: 7554
## Median : 730.5 Median : 50.0 Median : 69.00 Median : 9478
## Mean : 730.5 Mean : 56.9 Mean : 70.05 Mean : 10517
## 3rd Qu.:1095.2 3rd Qu.: 70.0 3rd Qu.: 80.00 3rd Qu.: 11602
## Max. :1460.0 Max. :190.0 Max. :313.00 Max. :215245
## NA's :259
## OverallQual OverallCond YearBuilt YearRemodAdd
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967
## Median : 6.000 Median :5.000 Median :1973 Median :1994
## Mean : 6.099 Mean :5.575 Mean :1971 Mean :1985
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## Min. : 0.0 Min. : 0.0 Min. : 0.00 Min. : 0.0
## 1st Qu.: 0.0 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 223.0
## Median : 0.0 Median : 383.5 Median : 0.00 Median : 477.5
## Mean : 103.7 Mean : 443.6 Mean : 46.55 Mean : 567.2
## 3rd Qu.: 166.0 3rd Qu.: 712.2 3rd Qu.: 0.00 3rd Qu.: 808.0
## Max. :1600.0 Max. :5644.0 Max. :1474.00 Max. :2336.0
## NA's :8
## TotalBsmtSF X1stFlrSF X2ndFlrSF LowQualFinSF
## Min. : 0.0 Min. : 334 Min. : 0 Min. : 0.000
## 1st Qu.: 795.8 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## Median : 991.5 Median :1087 Median : 0 Median : 0.000
## Mean :1057.4 Mean :1163 Mean : 347 Mean : 5.845
## 3rd Qu.:1298.2 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## Max. :6110.0 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd
## Min. :0.0000 Min. :0.000 Min. :0.000 Min. : 2.000
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000
## Median :0.0000 Median :3.000 Median :1.000 Median : 6.000
## Mean :0.3829 Mean :2.866 Mean :1.047 Mean : 6.518
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000
## Max. :2.0000 Max. :8.000 Max. :3.000 Max. :14.000
##
## Fireplaces GarageYrBlt GarageCars GarageArea
## Min. :0.000 Min. :1900 Min. :0.000 Min. : 0.0
## 1st Qu.:0.000 1st Qu.:1961 1st Qu.:1.000 1st Qu.: 334.5
## Median :1.000 Median :1980 Median :2.000 Median : 480.0
## Mean :0.613 Mean :1979 Mean :1.767 Mean : 473.0
## 3rd Qu.:1.000 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0
## Max. :3.000 Max. :2010 Max. :4.000 Max. :1418.0
## NA's :81
## WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median : 0.00 Median : 25.00 Median : 0.00 Median : 0.00
## Mean : 94.24 Mean : 46.66 Mean : 21.95 Mean : 3.41
## 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00 3rd Qu.: 0.00
## Max. :857.00 Max. :547.00 Max. :552.00 Max. :508.00
##
## ScreenPorch PoolArea MiscVal MoSold
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 1.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 6.000
## Mean : 15.06 Mean : 2.759 Mean : 43.49 Mean : 6.322
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :480.00 Max. :738.000 Max. :15500.00 Max. :12.000
##
## YrSold SalePrice
## Min. :2006 Min. : 34900
## 1st Qu.:2007 1st Qu.:129975
## Median :2008 Median :163000
## Mean :2008 Mean :180921
## 3rd Qu.:2009 3rd Qu.:214000
## Max. :2010 Max. :755000
##
#eliminate records with NAs (especifically columns: LotFrontage, MasVnrArea, GarageYrBlt)
X <- na.omit(X_tmp)
Y <- X[38] # target column (SalePrice)
# Identifying features with highest correlation to SalePrice - Top 15
corr_features <- data.frame(cor(X,Y))
ordering <- data.frame(order(corr_features$SalePrice, decreasing = T))
row_n <- rownames(corr_features)
top15 <- row_n[ordering[1:16,]]
X_top15 <- train[c(top15)]
#Fitting Model - 1st Try
lm_model1 <- lm(SalePrice ~., data=X_top15)
summary(lm_model1)
##
## Call:
## lm(formula = SalePrice ~ ., data = X_top15)
##
## Residuals:
## Min 1Q Median 3Q Max
## -511286 -18388 -1831 15823 276929
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.066e+06 1.627e+05 -6.553 8.65e-11 ***
## OverallQual 2.093e+04 1.476e+03 14.176 < 2e-16 ***
## GrLivArea 3.587e+01 5.150e+00 6.964 5.66e-12 ***
## GarageCars 1.533e+04 3.596e+03 4.263 2.19e-05 ***
## GarageArea 1.388e+01 1.263e+01 1.099 0.271793
## TotalBsmtSF 1.107e+01 5.373e+00 2.061 0.039523 *
## X1stFlrSF 3.338e+00 6.089e+00 0.548 0.583658
## FullBath -1.321e+03 3.249e+03 -0.407 0.684429
## TotRmsAbvGrd 2.261e+03 1.358e+03 1.664 0.096308 .
## YearBuilt 1.639e+02 7.682e+01 2.134 0.033064 *
## YearRemodAdd 4.074e+02 7.898e+01 5.159 2.95e-07 ***
## GarageYrBlt -7.564e+01 9.385e+01 -0.806 0.420441
## MasVnrArea 2.897e+01 7.273e+00 3.984 7.23e-05 ***
## Fireplaces 8.333e+03 2.230e+03 3.737 0.000196 ***
## BsmtFinSF1 1.995e+01 3.098e+00 6.438 1.80e-10 ***
## LotFrontage 3.673e+01 5.680e+01 0.647 0.517986
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38880 on 1105 degrees of freedom
## (339 observations deleted due to missingness)
## Multiple R-squared: 0.7835, Adjusted R-squared: 0.7806
## F-statistic: 266.6 on 15 and 1105 DF, p-value: < 2.2e-16
#discarding features that do not contribute to model accuracy/performance (Revised total of 9 features)
top_top9 <- top15[-c(5,7:9,12,16)]
X_top_top9 <- train[c(top_top9)]
#Fitting Model - 2nd Try
lm_model2 <- lm(SalePrice ~., data=X_top_top9)
summary(lm_model2)
##
## Call:
## lm(formula = SalePrice ~ ., data = X_top_top9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -521279 -17668 -1949 14962 272264
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.113e+06 1.182e+05 -9.409 < 2e-16 ***
## OverallQual 1.808e+04 1.167e+03 15.485 < 2e-16 ***
## GrLivArea 4.522e+01 2.576e+00 17.557 < 2e-16 ***
## GarageCars 1.329e+04 1.748e+03 7.602 5.23e-14 ***
## TotalBsmtSF 1.773e+01 3.047e+00 5.818 7.32e-09 ***
## YearBuilt 1.655e+02 4.685e+01 3.532 0.000425 ***
## YearRemodAdd 3.677e+02 6.184e+01 5.946 3.44e-09 ***
## MasVnrArea 2.933e+01 6.157e+00 4.764 2.09e-06 ***
## Fireplaces 8.593e+03 1.750e+03 4.911 1.01e-06 ***
## BsmtFinSF1 1.830e+01 2.524e+00 7.248 6.85e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36580 on 1442 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7872
## F-statistic: 597.4 on 9 and 1442 DF, p-value: < 2.2e-16
AIC(lm_model2)
## [1] 34645.26
# Prediction Error Rate
pred_err_rate <- sigma(lm_model2)/mean(X_top_top9$SalePrice)
pred_err_rate
## [1] 0.2021614
#prediction <- predict(lm_model2)
#plot(lm_model)
# Residuals Analysis
plot(lm_model2$fitted.values, lm_model2$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="blue")
qqnorm(lm_model2$residuals)
qqline(lm_model2$residuals, col="red")
#Identifying features with significant outliers
summary(X_top_top9[,2:9])
## OverallQual GrLivArea GarageCars TotalBsmtSF
## Min. : 1.000 Min. : 334 Min. :0.000 Min. : 0.0
## 1st Qu.: 5.000 1st Qu.:1130 1st Qu.:1.000 1st Qu.: 795.8
## Median : 6.000 Median :1464 Median :2.000 Median : 991.5
## Mean : 6.099 Mean :1515 Mean :1.767 Mean :1057.4
## 3rd Qu.: 7.000 3rd Qu.:1777 3rd Qu.:2.000 3rd Qu.:1298.2
## Max. :10.000 Max. :5642 Max. :4.000 Max. :6110.0
##
## YearBuilt YearRemodAdd MasVnrArea Fireplaces
## Min. :1872 Min. :1950 Min. : 0.0 Min. :0.000
## 1st Qu.:1954 1st Qu.:1967 1st Qu.: 0.0 1st Qu.:0.000
## Median :1973 Median :1994 Median : 0.0 Median :1.000
## Mean :1971 Mean :1985 Mean : 103.7 Mean :0.613
## 3rd Qu.:2000 3rd Qu.:2004 3rd Qu.: 166.0 3rd Qu.:1.000
## Max. :2010 Max. :2010 Max. :1600.0 Max. :3.000
## NA's :8
boxplot(X_top_top9[,c(3,5,8)])
#calculating Cook's Distance - Influence of outliers in the predicted outcome.
cooksd <- cooks.distance(lm_model2)
# influential row numbers - cook's distance greater than 4 times the mean
influential_rows <- as.numeric(names(cooksd)[(cooksd > 4*mean(cooksd))])
X_top_top9[influential_rows, ]
## SalePrice OverallQual GrLivArea GarageCars TotalBsmtSF YearBuilt
## 179 501837 9 2234 3 2216 2008
## 186 475000 10 3608 3 1107 1892
## 441 555000 10 2402 3 3094 2008
## 524 184750 10 4676 3 3138 2007
## 692 755000 10 4316 3 2444 1994
## 770 538000 8 3279 3 1650 2003
## 804 582933 9 2822 3 1734 2008
## 899 611657 9 2364 3 2330 2009
## 1047 556581 9 2868 3 1992 2005
## 1170 625000 10 3627 3 1930 1995
## 1183 745000 10 4476 3 2396 1996
## 1299 160000 10 5642 2 6110 2008
## YearRemodAdd MasVnrArea Fireplaces BsmtFinSF1
## 179 2009 748 1 1904
## 186 1993 0 2 0
## 441 2008 200 2 1767
## 524 2008 762 1 2260
## 692 1995 1170 2 1455
## 770 2003 603 1 1416
## 804 2009 424 1 0
## 899 2010 760 2 2188
## 1047 2006 208 1 240
## 1170 1996 1378 1 1387
## 1183 1996 0 2 2096
## 1299 2008 796 3 5644
# Discarding influential rows (outliers)
X_top_top9 <- X_top_top9[-influential_rows, ]
#Fitting Model - 3rd Try
lm_model3 <- lm(SalePrice ~., data=X_top_top9)
summary(lm_model3)
##
## Call:
## lm(formula = SalePrice ~ ., data = X_top_top9)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130007 -16368 -948 14862 146380
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.206e+06 8.965e+04 -13.453 < 2e-16 ***
## OverallQual 1.570e+04 8.894e+02 17.658 < 2e-16 ***
## GrLivArea 4.831e+01 2.051e+00 23.552 < 2e-16 ***
## GarageCars 9.866e+03 1.337e+03 7.381 2.65e-13 ***
## TotalBsmtSF 2.607e+01 2.368e+00 11.009 < 2e-16 ***
## YearBuilt 1.958e+02 3.566e+01 5.491 4.73e-08 ***
## YearRemodAdd 3.873e+02 4.679e+01 8.278 2.84e-16 ***
## MasVnrArea 2.054e+01 4.807e+00 4.273 2.06e-05 ***
## Fireplaces 7.631e+03 1.333e+03 5.726 1.25e-08 ***
## BsmtFinSF1 2.406e+01 1.971e+00 12.209 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27640 on 1430 degrees of freedom
## (8 observations deleted due to missingness)
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.8497
## F-statistic: 905.1 on 9 and 1430 DF, p-value: < 2.2e-16
AIC(lm_model3)
## [1] 33552.47
# Prediction Error Rate
pred_err_rate <- sigma(lm_model3)/mean(X_top_top9$SalePrice)
pred_err_rate
## [1] 0.1552224
# Residuals Analysis
plot(lm_model3$fitted.values, lm_model3$residuals, xlab='Fitted Values', ylab='Residuals')
abline(0,0, col="blue")
qqnorm(lm_model3$residuals)
qqline(lm_model3$residuals, col="red")
Good Adjusted R_squared for the model = 0.8497, so, it explains over 85% of variability in the data. AIC = 33552.47 (smallest of the 3 models tried)
# Dealing with NAs in the test set
library(dplyr)
num <- select_if(test, is.numeric)
catg <- select_if(test, is.factor)
test[c(colnames(num))][is.na(test[c(colnames(num))])] <- 0
# Predicting SalePrice using Model3
SalePrice_Pred <- predict(lm_model3, newdata=test)
results <- data.frame(test$Id, SalePrice_Pred)
colnames(results) <- c("Id", "SalePrice")
head(results)
## Id SalePrice
## 1 1461 103439.7
## 2 1462 163133.6
## 3 1463 186701.7
## 4 1464 197205.7
## 5 1465 202492.8
## 6 1466 177996.5
# Write results to a CSV file for Kaggle Submission
# write.csv(results, file = 'C:/DATA/HHP/Personal/Degrees/Ms. Data Science (CUNY)/2nd Semester/605 - Foundations of Computational Math/Assignments/Final Project/house-prices-advanced-regression-techniques/houseprices_prediction_humbertohp.csv', row.names = FALSE)
# Focus on categorical features (extra)
catg <- select_if(train, is.factor)
X_catg <- data.frame(train[c(colnames(catg))], stringsAsFactors = TRUE)
summary(X_catg)
## MSZoning Street Alley LotShape LandContour
## C (all): 10 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## FV : 65 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## RH : 16 NA's:1369 IR3: 10 Low: 36
## RL :1151 Reg:925 Lvl:1311
## RM : 218
##
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle RoofStyle RoofMatl
## Norm :1445 1Fam :1220 1Story :726 Flat : 13 CompShg:1434
## Feedr : 6 2fmCon: 31 2Story :445 Gable :1141 Tar&Grv: 11
## Artery : 2 Duplex: 52 1.5Fin :154 Gambrel: 11 WdShngl: 6
## PosN : 2 Twnhs : 43 SLvl : 65 Hip : 286 WdShake: 5
## RRNn : 2 TwnhsE: 114 SFoyer : 37 Mansard: 7 ClyTile: 1
## PosA : 1 1.5Unf : 14 Shed : 2 Membran: 1
## (Other): 2 (Other): 19 (Other): 2
## Exterior1st Exterior2nd MasVnrType ExterQual ExterCond
## VinylSd:515 VinylSd:504 BrkCmn : 15 Ex: 52 Ex: 3
## HdBoard:222 MetalSd:214 BrkFace:445 Fa: 14 Fa: 28
## MetalSd:220 HdBoard:207 None :864 Gd:488 Gd: 146
## Wd Sdng:206 Wd Sdng:197 Stone :128 TA:906 Po: 1
## Plywood:108 Plywood:142 NA's : 8 TA:1282
## CemntBd: 61 CmentBd: 60
## (Other):128 (Other):136
## Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1
## BrkTil:146 Ex :121 Fa : 45 Av :221 ALQ :220
## CBlock:634 Fa : 35 Gd : 65 Gd :134 BLQ :148
## PConc :647 Gd :618 Po : 2 Mn :114 GLQ :418
## Slab : 24 TA :649 TA :1311 No :953 LwQ : 74
## Stone : 6 NA's: 37 NA's: 37 NA's: 38 Rec :133
## Wood : 3 Unf :430
## NA's: 37
## BsmtFinType2 Heating HeatingQC CentralAir Electrical KitchenQual
## ALQ : 19 Floor: 1 Ex:741 N: 95 FuseA: 94 Ex:100
## BLQ : 33 GasA :1428 Fa: 49 Y:1365 FuseF: 27 Fa: 39
## GLQ : 14 GasW : 18 Gd:241 FuseP: 3 Gd:586
## LwQ : 46 Grav : 7 Po: 1 Mix : 1 TA:735
## Rec : 54 OthW : 2 TA:428 SBrkr:1334
## Unf :1256 Wall : 4 NA's : 1
## NA's: 38
## Functional FireplaceQu GarageType GarageFinish GarageQual
## Maj1: 14 Ex : 24 2Types : 6 Fin :352 Ex : 3
## Maj2: 5 Fa : 33 Attchd :870 RFn :422 Fa : 48
## Min1: 31 Gd :380 Basment: 19 Unf :605 Gd : 14
## Min2: 34 Po : 20 BuiltIn: 88 NA's: 81 Po : 3
## Mod : 15 TA :313 CarPort: 9 TA :1311
## Sev : 1 NA's:690 Detchd :387 NA's: 81
## Typ :1360 NA's : 81
## GarageCond PavedDrive PoolQC Fence MiscFeature
## Ex : 2 N: 90 Ex : 2 GdPrv: 59 Gar2: 2
## Fa : 35 P: 30 Fa : 2 GdWo : 54 Othr: 2
## Gd : 9 Y:1340 Gd : 3 MnPrv: 157 Shed: 49
## Po : 7 NA's:1453 MnWw : 11 TenC: 1
## TA :1326 NA's :1179 NA's:1406
## NA's: 81
##
## SaleType SaleCondition
## WD :1267 Abnorml: 101
## New : 122 AdjLand: 4
## COD : 43 Alloca : 12
## ConLD : 9 Family : 20
## ConLI : 5 Normal :1198
## ConLw : 5 Partial: 125
## (Other): 9
# Identifying categorical features with highest potential correlation to SalePrice - Top 10 (Factor Value distributions)
top_catg_10 <- c("LotConfig", "Neighborhood", "BldgType", "HouseStyle", "Exterior1st", "ExterQual", "ExterCond", "BsmtQual", "HeatingQC", "Foundation", "SalePrice")
X_top_catg_10 <- train[c(top_catg_10)]
#Fitting Model - Categorical only
lm_modelcat <- lm(SalePrice ~., data=X_top_catg_10)
summary(lm_modelcat)
##
## Call:
## lm(formula = SalePrice ~ ., data = X_top_catg_10)
##
## Residuals:
## Min 1Q Median 3Q Max
## -175763 -21022 -1552 16984 350059
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 399899.2 30046.2 13.309 < 2e-16 ***
## LotConfigCulDSac 12557.6 5166.3 2.431 0.015199 *
## LotConfigFR2 -22873.0 6738.7 -3.394 0.000708 ***
## LotConfigFR3 -7350.5 20953.2 -0.351 0.725788
## LotConfigInside -2879.5 2916.6 -0.987 0.323692
## NeighborhoodBlueste -20116.1 31505.8 -0.638 0.523264
## NeighborhoodBrDale -33533.5 16762.6 -2.000 0.045647 *
## NeighborhoodBrkSide -60746.3 13483.6 -4.505 7.20e-06 ***
## NeighborhoodClearCr -12839.3 14203.7 -0.904 0.366189
## NeighborhoodCollgCr -39100.0 11454.7 -3.413 0.000660 ***
## NeighborhoodCrawfor -14503.8 13069.0 -1.110 0.267287
## NeighborhoodEdwards -69759.4 12367.0 -5.641 2.06e-08 ***
## NeighborhoodGilbert -51425.7 12286.7 -4.185 3.03e-05 ***
## NeighborhoodIDOTRR -88827.8 14090.4 -6.304 3.91e-10 ***
## NeighborhoodMeadowV -66819.8 16867.3 -3.962 7.84e-05 ***
## NeighborhoodMitchel -54820.8 12926.9 -4.241 2.38e-05 ***
## NeighborhoodNAmes -47135.3 12213.1 -3.859 0.000119 ***
## NeighborhoodNoRidge 65211.0 13112.6 4.973 7.43e-07 ***
## NeighborhoodNPkVill -18122.1 18352.4 -0.987 0.323597
## NeighborhoodNridgHt 24822.7 11905.0 2.085 0.037251 *
## NeighborhoodNWAmes -25784.6 12734.4 -2.025 0.043085 *
## NeighborhoodOldTown -71008.3 12812.6 -5.542 3.59e-08 ***
## NeighborhoodSawyer -54931.9 12794.3 -4.293 1.88e-05 ***
## NeighborhoodSawyerW -38185.8 12417.1 -3.075 0.002145 **
## NeighborhoodSomerst -20887.3 11651.0 -1.793 0.073239 .
## NeighborhoodStoneBr 52008.0 13369.8 3.890 0.000105 ***
## NeighborhoodSWISU -71171.0 14840.9 -4.796 1.80e-06 ***
## NeighborhoodTimber -14208.6 13050.2 -1.089 0.276450
## NeighborhoodVeenker 1757.4 16900.7 0.104 0.917196
## BldgType2fmCon 2817.1 8251.4 0.341 0.732851
## BldgTypeDuplex 2011.5 7216.3 0.279 0.780482
## BldgTypeTwnhs -63753.1 8359.2 -7.627 4.52e-14 ***
## BldgTypeTwnhsE -55062.4 5269.3 -10.450 < 2e-16 ***
## HouseStyle1.5Unf -25776.7 12129.8 -2.125 0.033762 *
## HouseStyle1Story -14013.6 4303.5 -3.256 0.001157 **
## HouseStyle2.5Fin 65552.9 15563.0 4.212 2.70e-05 ***
## HouseStyle2.5Unf 19651.5 13425.0 1.464 0.143483
## HouseStyle2Story 4779.1 4581.3 1.043 0.297062
## HouseStyleSFoyer -26048.3 8580.1 -3.036 0.002444 **
## HouseStyleSLvl -5465.2 6628.2 -0.825 0.409783
## Exterior1stBrkComm -20284.0 43273.7 -0.469 0.639334
## Exterior1stBrkFace 54617.1 11857.9 4.606 4.49e-06 ***
## Exterior1stCBlock -8469.0 44613.8 -0.190 0.849472
## Exterior1stCemntBd 33420.8 12294.6 2.718 0.006645 **
## Exterior1stHdBoard 8984.8 10751.1 0.836 0.403466
## Exterior1stImStucc -37304.5 42691.9 -0.874 0.382378
## Exterior1stMetalSd 8069.3 10407.6 0.775 0.438279
## Exterior1stPlywood 29189.7 11254.1 2.594 0.009598 **
## Exterior1stStone 71988.9 31012.0 2.321 0.020417 *
## Exterior1stStucco 29580.2 13244.7 2.233 0.025689 *
## Exterior1stVinylSd 8526.5 10570.8 0.807 0.420035
## Exterior1stWd Sdng 14708.0 10381.6 1.417 0.156788
## Exterior1stWdShing 2468.1 13202.5 0.187 0.851734
## ExterQualFa -105551.1 15652.9 -6.743 2.29e-11 ***
## ExterQualGd -60920.5 7375.1 -8.260 3.43e-16 ***
## ExterQualTA -86231.6 8001.9 -10.776 < 2e-16 ***
## ExterCondFa -61407.9 26228.3 -2.341 0.019362 *
## ExterCondGd -36767.6 24459.4 -1.503 0.133019
## ExterCondPo -100960.3 48004.3 -2.103 0.035637 *
## ExterCondTA -43810.2 24307.9 -1.802 0.071721 .
## BsmtQualFa -92260.8 9600.1 -9.610 < 2e-16 ***
## BsmtQualGd -62135.6 5233.8 -11.872 < 2e-16 ***
## BsmtQualTA -81134.9 6232.9 -13.017 < 2e-16 ***
## HeatingQCFa -24386.6 6959.8 -3.504 0.000473 ***
## HeatingQCGd -8995.0 3467.3 -2.594 0.009582 **
## HeatingQCPo -166.8 43065.9 -0.004 0.996910
## HeatingQCTA -12047.3 3318.9 -3.630 0.000294 ***
## FoundationCBlock 7859.0 4939.8 1.591 0.111850
## FoundationPConc 15074.2 5360.3 2.812 0.004991 **
## FoundationStone 12548.5 17506.9 0.717 0.473638
## FoundationWood 2549.8 24452.5 0.104 0.916964
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40740 on 1352 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.7496, Adjusted R-squared: 0.7367
## F-statistic: 57.83 on 70 and 1352 DF, p-value: < 2.2e-16
Avg model with Adjusted R_squared for the model = 0.7367, so, it explains over 74% of variability in the data.
The model used for submission was Model3 created only with continuous features. For simplicity purposes and to demonstrate several modeling techniques: 1) to identify and include the most statistically significant features using ranking correlation, and 2) to identify, validate and deal with outliers affecting model fit and performance using Cook’s distance. The best model would have been a combination of the most significant numeric and categorical features, due to timing constraints, I will try that as a separate project shortly.