Problem 1

Using R, generate a random variable \(X\) that has 10,000 random uniform numbers from 1 to \(N\), where \(N\) can be any number of your choosing greater than or equal to 6. Then generate a random variable \(Y\) that has 10,000 random normal numbers with a mean of \(\mu=\sigma=(N+1)/2\)

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the median of the \(X\) variable, and the small letter “y” is estimated as the 1st quartile of the \(Y\) variable.

set.seed(87)
X <- runif(10000, min=1, max=6)
Y <- rnorm(10000, mean=7/2, sd=7/2)

par(mfrow = c(1, 2))
hist(X)
hist(Y)

x <- median(X)
y <-quantile(Y,0.25)

x_probability_greater <- sum(X > x) / length(X)
x_probability_leq <- 1 - x_probability_greater
y_probability_greater <- sum(Y > y) / length(Y)
y_probability_leq <- 1 - y_probability_greater

x_count_greater <- sum(X > x)
x_count_leq <- length(X) - x_count_greater
y_count_greater <- sum(Y > y)  
y_count_leq <- length(Y) - y_count_greater

\[P(A \space and \space B) = P(A) * P(B|A)\]

Interpret

Interpret the meaning of all probabilities

\(a. P(X>x | Y>y)\)

a_probability <- 
  (x_probability_greater * y_probability_greater) / y_probability_greater
a_probability
## [1] 0.5

The probability of > x given > y 0.5

\(b. P(X>x, Y>y)\)

b_probability <- 
  (x_probability_greater) * 
  ((x_probability_greater * y_probability_greater) / x_probability_greater)
b_probability
## [1] 0.375

The probability of > x and > y 0.375

\(c. P(X<x | Y>y)\)

c_probability <- 
  (x_probability_leq * y_probability_greater) / y_probability_greater
c_probability
## [1] 0.5

The probability of < x given > y 0.5

Investigate

Investigate whether \(P(X>x \space and \space Y>y)=P(X>x) * P(Y>y)\) by building a table and evaluating the marginal and joint probabilities

#P(X>x and Y>y)
b_probability
## [1] 0.375
#P(X>x)
x_probability_greater
## [1] 0.5
#PY>y)
y_probability_greater
## [1] 0.75
Are_Probabilities_Equal <- (b_probability == (x_probability_greater * y_probability_greater))
Are_Probabilities_Equal
## [1] TRUE

Since the probabilities are equal, X and Y are independent events

Fisher’s Exact and Chi Square tests

Check to see if independence holds by using Fishers Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

#table of count
t <- cbind(X,Y)

table_of_count <- matrix(c(
  nrow(subset(t, X > x & Y > y)),
  nrow(subset(t, X <= x & Y > y)),
  
  nrow(subset(t, X > x & Y < y)),
  nrow(subset(t, X <= x & Y <= y))),
  
  nrow=2)

colnames(table_of_count) <- c(">1st quantile","<=1st quantile")
rownames(table_of_count) <- c(">mean","<=mean")
  
table_of_count
##        >1st quantile <=1st quantile
## >mean           3714           1286
## <=mean          3786           1214
#Fishers Exact Test
fisher.test(table_of_count)
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table_of_count
## p-value = 0.1011
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8449983 1.0149039
## sample estimates:
## odds ratio 
##  0.9260742
#Chi Square Test
chisq.test(table_of_count)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_of_count
## X-squared = 2.6885, df = 1, p-value = 0.1011

Since the p-values are greater than 0.05, it suggests no association between X and Y.

Problem 2

library(tidyr)
library(dplyr)
library(corrplot)
train <- read.csv('https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/Final/train.csv', header=TRUE, sep=",", stringsAsFactors = FALSE)

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Training Data Set Statistics

dim(train)
## [1] 1460   81
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   LandSlope         Neighborhood        Condition1       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   Condition2          BldgType          HouseStyle         OverallQual    
##  Length:1460        Length:1460        Length:1460        Min.   : 1.000  
##  Class :character   Class :character   Class :character   1st Qu.: 5.000  
##  Mode  :character   Mode  :character   Mode  :character   Median : 6.000  
##                                                           Mean   : 6.099  
##                                                           3rd Qu.: 7.000  
##                                                           Max.   :10.000  
##                                                                           
##   OverallCond      YearBuilt     YearRemodAdd   RoofStyle        
##  Min.   :1.000   Min.   :1872   Min.   :1950   Length:1460       
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Class :character  
##  Median :5.000   Median :1973   Median :1994   Mode  :character  
##  Mean   :5.575   Mean   :1971   Mean   :1985                     
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004                     
##  Max.   :9.000   Max.   :2010   Max.   :2010                     
##                                                                  
##    RoofMatl         Exterior1st        Exterior2nd       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##   MasVnrType          MasVnrArea      ExterQual          ExterCond        
##  Length:1460        Min.   :   0.0   Length:1460        Length:1460       
##  Class :character   1st Qu.:   0.0   Class :character   Class :character  
##  Mode  :character   Median :   0.0   Mode  :character   Mode  :character  
##                     Mean   : 103.7                                        
##                     3rd Qu.: 166.0                                        
##                     Max.   :1600.0                                        
##                     NA's   :8                                             
##   Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature       
##  Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##                                                          
##     MiscVal             MoSold           YrSold       SaleType        
##  Min.   :    0.00   Min.   : 1.000   Min.   :2006   Length:1460       
##  1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007   Class :character  
##  Median :    0.00   Median : 6.000   Median :2008   Mode  :character  
##  Mean   :   43.49   Mean   : 6.322   Mean   :2008                     
##  3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009                     
##  Max.   :15500.00   Max.   :12.000   Max.   :2010                     
##                                                                       
##  SaleCondition        SalePrice     
##  Length:1460        Min.   : 34900  
##  Class :character   1st Qu.:129975  
##  Mode  :character   Median :163000  
##                     Mean   :180921  
##                     3rd Qu.:214000  
##                     Max.   :755000  
## 

Scatterplot of 2 independent and dependent variables

Independent Variables:

    1. LotArea : Lot size in square feet
    1. GrLivArea : Above grade (ground) living area square feet

Dependent Variable:

    1. SalePrice : Sale Price of the House
plot(train$LotArea, train$SalePrice,
     main = "LotArea vs SalePrice",
     xlab = "LotArea",
     ylab = "SalePrice")

plot(train$GrLivArea, train$SalePrice,
     main = "GrLivArea vs SalePrice",
     xlab = "OverallQuad",
     ylab = "SalePrice")

Correlation matrix for 3 quantitative variables

Quantitative variables

    1. OverallQual : Rates the overall material and finish of the house
    1. GarageArea : Size of garage in square feet
    1. SalePrice : Sale Price of the House
cor_matrix <- cor(train[c("OverallQual","GarageArea","SalePrice")])
cor_matrix
##             OverallQual GarageArea SalePrice
## OverallQual   1.0000000  0.5620218 0.7909816
## GarageArea    0.5620218  1.0000000 0.6234314
## SalePrice     0.7909816  0.6234314 1.0000000

Hypotheses

  • \(H_0\) = There is 0 correlation between each pairwise variables
  • \(H_A\) = There is correlation between each pairwise variables
cor.test(train$OverallQual, train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$SalePrice
## t = 49.364, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.7780752 0.8032204
## sample estimates:
##       cor 
## 0.7909816
cor.test(train$GarageArea, train$SalePrice, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$GarageArea and train$SalePrice
## t = 30.446, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.6024756 0.6435283
## sample estimates:
##       cor 
## 0.6234314
cor.test(train$OverallQual, train$GarageArea, conf.level = 0.80)
## 
##  Pearson's product-moment correlation
## 
## data:  train$OverallQual and train$GarageArea
## t = 25.946, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5386198 0.5845573
## sample estimates:
##       cor 
## 0.5620218

Since the p-values are less than 0.05 for each pairwise variables, we can reject that null hypohethis that there is no correlation between them. Looking at the correlation matrix, there is relatively strong correlation between them.

Linear Algebra and Correlation

Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

Invert correlation matrix

cor_matrix_inv <- solve(cor_matrix);cor_matrix_inv
##             OverallQual GarageArea SalePrice
## OverallQual   2.7278952 -0.3074414 -1.966046
## GarageArea   -0.3074414  1.6704186 -0.798211
## SalePrice    -1.9660463 -0.7982110  3.052736

Multiply matrices

cor_matrix %*% cor_matrix_inv
##               OverallQual GarageArea SalePrice
## OverallQual  1.000000e+00          0         0
## GarageArea  -2.220446e-16          1         0
## SalePrice   -4.440892e-16          0         1
cor_matrix_inv %*% cor_matrix
##             OverallQual GarageArea     SalePrice
## OverallQual           1          0 -2.220446e-16
## GarageArea            0          1  0.000000e+00
## SalePrice             0          0  1.000000e+00

Conduct LU decomposition

library(Matrix)

lu_matrix <- expand(lu(cor_matrix_inv))
lu_matrix$L;lu_matrix$U
## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]       [,2]       [,3]      
## [1,]  1.0000000          .          .
## [2,] -0.1127028  1.0000000          .
## [3,] -0.7207191 -0.6234314  1.0000000
## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]       [,2]       [,3]      
## [1,]  2.7278952 -0.3074414 -1.9660463
## [2,]          .  1.6357691 -1.0197899
## [3,]          .          .  1.0000000
lu_matrix$L %*% lu_matrix$U
## 3 x 3 Matrix of class "dgeMatrix"
##            [,1]       [,2]      [,3]
## [1,]  2.7278952 -0.3074414 -1.966046
## [2,] -0.3074414  1.6704186 -0.798211
## [3,] -1.9660463 -0.7982110  3.052736

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000,\(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Select and shift a skewed variable

summary(train$TotalBsmtSF)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   795.8   991.5  1057.4  1298.2  6110.0
hist(train$TotalBsmtSF)

s <- train$TotalBsmtSF + 1
summary(s)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   796.8   992.5  1058.4  1299.2  6111.0
hist(s)

Exponential PDF and optimal value for lambda

library(MASS)
fit <- fitdistr(s, "exponential")
lambda <- fit$estimate
lambda
##         rate 
## 0.0009447961
s_samples <- rexp(1000, lambda)

Histograms

par(mfrow = c(1, 2))
hist(s_samples, main="Exponential")
hist(train$TotalBsmtSF, main="Empirical")

Percentiles and Confidence Interval

library(Rmisc)

#exponential
pct <- c(0.05, 0.95)
qexp(pct, rate = lambda)
## [1]   54.29033 3170.77127
#empirical confidence interval
CI(train$TotalBsmtSF, 0.95) #t.test(train$TotalBsmtSF)$conf.int[1:2]
##    upper     mean    lower 
## 1079.951 1057.429 1034.908
#empirical
quantile(train$TotalBsmtSF, pct)
##     5%    95% 
##  519.3 1753.0

Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Stepwise Regression

train_numeric <- select_if(train, is.numeric)
train_numeric_complete <- train_numeric[complete.cases(train_numeric),]

dim(train_numeric)
## [1] 1460   38
dim(train_numeric_complete)
## [1] 1121   38
id <- c(1:37)
for (val in id) {
  qnt <- quantile(train_numeric_complete[,val], probs=c(.25, .75), na.rm = T)
  caps <- quantile(train_numeric_complete[,val], probs=c(.05, .95), na.rm = T)
  H <- 1.5 * IQR(train_numeric_complete[,val], na.rm = T)
  train_numeric_complete[,val][train_numeric_complete[,val] < (qnt[1] - H)] <- caps[1]
  train_numeric_complete[,val][train_numeric_complete[,val] > (qnt[2] + H)] <- caps[2]
}

# Fit the full model 
full.model <- lm(log(SalePrice) ~., data = train_numeric_complete)

# Stepwise regression model
step.model <- stepAIC(full.model, direction = "both", 
                      trace = FALSE)

summary(step.model)
## 
## Call:
## lm(formula = log(SalePrice) ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + BsmtFinSF1 + BsmtUnfSF + 
##     TotalBsmtSF + X2ndFlrSF + GrLivArea + FullBath + HalfBath + 
##     Fireplaces + GarageYrBlt + GarageCars + OpenPorchSF + EnclosedPorch + 
##     ScreenPorch + YrSold, data = train_numeric_complete)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09569 -0.06036  0.00431  0.07190  0.43614 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.146e+01  6.049e+00   1.895  0.05840 .  
## MSSubClass    -3.634e-04  1.279e-04  -2.841  0.00458 ** 
## LotArea        1.024e-05  1.635e-06   6.263 5.42e-10 ***
## OverallQual    8.139e-02  5.118e-03  15.904  < 2e-16 ***
## OverallCond    5.967e-02  5.319e-03  11.218  < 2e-16 ***
## YearBuilt      2.375e-03  3.037e-04   7.818 1.25e-14 ***
## YearRemodAdd   1.283e-03  3.032e-04   4.232 2.50e-05 ***
## BsmtFinSF1     6.268e-05  2.336e-05   2.683  0.00741 ** 
## BsmtUnfSF     -5.016e-05  2.297e-05  -2.184  0.02920 *  
## TotalBsmtSF    2.332e-04  3.301e-05   7.065 2.84e-12 ***
## X2ndFlrSF      5.032e-05  2.360e-05   2.132  0.03319 *  
## GrLivArea      1.971e-04  2.378e-05   8.289 3.29e-16 ***
## FullBath       2.450e-02  1.223e-02   2.004  0.04532 *  
## HalfBath       2.735e-02  1.199e-02   2.280  0.02277 *  
## Fireplaces     4.049e-02  7.929e-03   5.107 3.86e-07 ***
## GarageYrBlt    7.980e-04  3.104e-04   2.571  0.01028 *  
## GarageCars     4.491e-02  9.378e-03   4.789 1.90e-06 ***
## OpenPorchSF    1.653e-04  8.805e-05   1.878  0.06067 .  
## EnclosedPorch  1.868e-04  7.238e-05   2.580  0.01000 *  
## ScreenPorch    2.590e-04  9.148e-05   2.831  0.00472 ** 
## YrSold        -4.919e-03  3.014e-03  -1.632  0.10293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1336 on 1100 degrees of freedom
## Multiple R-squared:  0.8886, Adjusted R-squared:  0.8866 
## F-statistic: 438.7 on 20 and 1100 DF,  p-value: < 2.2e-16

Diagnostic Plots

It appears that we have a decent model passing linearity assumptions

layout(matrix(c(1,2,3,4),2,2)) # optional 4 graphs/page 
plot(step.model)

Predict SalePrice

test <- read.csv('https://raw.githubusercontent.com/L-Velasco/DATA605_SP19/master/Final/test.csv', header=TRUE, sep=",", stringsAsFactors = FALSE)

dim(test)
## [1] 1459   80
testSalePrice <- predict(step.model, test, type = "response")

test$SalePrice <- exp(testSalePrice)

test[is.na(test)] <- mean(test$SalePrice, na.rm = TRUE)

submitTest <- subset(test, select = c("Id", "SalePrice"))

# Export
 # write.csv(submitTest,file="LV_submit.csv")

Kaggle Submission

Kaggle

Kaggle