Data605: Final

Final Exam

Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ , a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).

*Note: Assuming in the above prompt that the n parameter is supposed to refer to the shape (alpha) and the lambda is referencing the rate (beta)

Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000,λ)+rexp(10000,λ)+..)

Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).

set.seed(45)
lambda <- 6
alpha <- 4
n <-10000
X <- rgamma(n,alpha,lambda)
Y <- rexp(n,lambda)+rexp(n,lambda)+rexp(n,lambda)+rexp(n,lambda)
Z<- rexp(n,lambda)

1a: Calculate the empirical expected value (means) and variances of all three pdfs.

sprintf('Gamma Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(X),var(X))

## [1] "Gamma Distribution: The mean of the X random variables is: 0.664633, and the variance is: 0.114240"

#alpha/lambda,alpha/lambda^2

sprintf('Sum of Exponential Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(Y),var(Y))

## [1] "Sum of Exponential Distribution: The mean of the X random variables is: 0.666293, and the variance is: 0.109779"

#alpha/lambda,alpha/lambda^2
sprintf('Exponential Distribution: The mean of the X random variables is: %f, and the variance is: %f',mean(Z),var(Z))

## [1] "Exponential Distribution: The mean of the X random variables is: 0.166306, and the variance is: 0.027380"

#1/lambda,2/lambda^2

1b. Using calculus, calculate the expected value and variance of the Gamma pdf (X).

Mean of Gamma Distribution Proof

$$ \[\begin{aligned} \int_{0}^\infty x * fx(x) dx \\ \int_{0}^{\infty} x * \frac{\lambda e^{-\lambda x} \lambda x^{\alpha -1}}{(\alpha -1)!} dx \\ \int_{0}^{\infty} \frac{ e^{-\lambda x} \lambda x^{(\alpha+1) -1}}{(\alpha -1)!} dx \\ \frac{(\alpha +1 -1)!}{(\alpha - 1)! \lambda} \int_{0}^{\infty} \frac{ \lambda e^{-\lambda x} \lambda x^{(\alpha+1) -1}}{(\alpha +1-1)!} dx\\ \frac{(\alpha)!}{(\alpha - 1)! \lambda} \int_{0}^{\infty} \frac{ \lambda e^{-\lambda x} \lambda x^{\alpha}}{(\alpha)!} dx \\ \Gamma(\alpha +1 , \lambda) = \int f(x) dx = 1 \\ \frac{\alpha}{\lambda} \end{aligned}\]

sprintf('Expected Value approximation: %f, which is very close to the empirical calculation',alpha/lambda)

## [1] "Expected Value approximation: 0.666667, which is very close to the empirical calculation"

Variance of Gamma Distribution Proof

$$ \[\begin{aligned} Var(x)=E(x^2) - (E(x))^2 \int_{0}^\infty x^2 * fx(x) dx \\ \int_{0}^{\infty} x^2 * \frac{\lambda e^{-\lambda x} \lambda x^{\alpha -1}}{(\alpha -1)!} dx \\ \int_{0}^{\infty} \frac{ e^{-\lambda x} \lambda^{\alpha-1} x^{(\alpha+2) -1}}{(\alpha -1)!} dx \\ \frac{(\alpha +2 -1)!}{(\alpha - 1)! \lambda^2} \int_{0}^{\infty} \frac{ e^{-\lambda x} \lambda^{\alpha-1} \lambda^2 x^{(\alpha+2) -1}}{(\alpha +2-1)!} dx\\ \frac{(\alpha+2-1)!}{(\alpha - 1)! \lambda^2} \int_{0}^{\infty} \Gamma(\alpha+2,\lambda) dx \\ \frac{(\alpha+2-1)!}{(\alpha - 1)! \lambda^2} * 1 = \frac{(\alpha+1 )\alpha}{ \lambda^2} Var(x) = \frac{(\alpha+1 )\alpha}{ \lambda^2} - (\frac{\alpha}{\lambda})^2 \\ Var(x) = \frac{\alpha^2+\alpha - \alpha^2}{ \lambda^2} = \frac{\alpha}{\lambda^2} \end{aligned}\]

sprintf('Variance approximation: %f, which is very close to the empirical calculation',alpha/lambda^2)

## [1] "Variance approximation: 0.111111, which is very close to the empirical calculation"

Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

$$ \[\begin{aligned} M(t) = E[e^{tX}] \\ \int_{0}^{\infty} e^{tx}\lambda e^{-\lambda x} dx \\ \lambda \int_{0}^{\infty} e^{t-\lambda x} dx \\ \frac{\lambda}{t- \lambda } \left. e^{(t-\lambda x)} \right|_{0}^{\infty} \\ \frac{\lambda}{\lambda - t} \\ E(X^n) = M^n_X(0) = \frac{d^n M_x}{dt^n}(0) \\ dt = \frac{\lambda}{(\lambda - t)^2} \end{aligned}\]

Taking the first moment of the function using the derivative calculated above for the first moment:

sprintf('The expected value of the first moment is: %f (very close approximation of other methods)',lambda/(lambda - 0)^2)

## [1] "The expected value of the first moment is: 0.166667 (very close approximation of other methods)"

Using this source as a reference for the sum of exponential moment generating function:

$M_{x_1+x_2}= \frac{\lambda_1 \lambda_2}{(\lambda_1 -t)(\lambda_2 -t)}$

In this case the lambda is the same value across the different random variables so we can take the nth power (alpha times) the moment function

$$ \[\begin{aligned} (\frac{\lambda}{\lambda - t})^\alpha = (\frac{6}{6 - t})^4 \\ M'(t) = \frac{4*6^4}{(6-t)^5} \end{aligned}\]

sprintf('The expected value of the first moment is: %f (very close approximation of other methods)',(4*lambda^4)/(lambda - 0)^5)

## [1] "The expected value of the first moment is: 0.666667 (very close approximation of other methods)"

1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds. a. P(Z> λ| Z> λ/2) b. P(Z> 2λ | Z> λ) b. P(Z>3λ | Z> λ)

$$ \[\begin{aligned} P(Z>\lambda|Z>\lambda/2) = \frac{P(Z\leq \lambda \cap Z>\lambda/2 )}{P( Z>\lambda/2)} = \frac{P(\lambda/2 < Z \leq \lambda)}{P( Z>\lambda/2)} \\ \frac{e^{\frac{\lambda}{2}}-e^\lambda}{e^{\frac{\lambda}{2}}} \end{aligned}\]

Empirical vs CDF of Exponential

(exp(-lambda/2) - exp(-lambda))/exp(-lambda/2)

## [1] 0.9502129

1-exp(-lambda/2)

## [1] 0.9502129

(exp(-lambda) - exp(-2*lambda))/exp(-lambda)

## [1] 0.9975212

1-exp(-lambda)

## [1] 0.9975212

(exp(-lambda) - exp(-3*lambda))/exp(-lambda)

## [1] 0.9999939

1-exp(-2*lambda)

## [1] 0.9999939

Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities.

y_quant <- quantile(Y,probs=c(0.25,0.5,0.75,1))
z_quant <- quantile(Z,probs=c(0.25,0.5,0.75,1))

comb_prob <- as.matrix(z_quant) %*% t(as.matrix(y_quant))
prob_tbl <- rbind(cbind(comb_prob/sum(comb_prob),c(t(t(rowSums(comb_prob/sum(comb_prob)))),1)),colSums(comb_prob/sum(comb_prob)))

## Warning in cbind(comb_prob/sum(comb_prob), c(t(t(rowSums(comb_prob/
## sum(comb_prob)))), : number of rows of result is not a multiple of vector length
## (arg 2)

## Warning in rbind(cbind(comb_prob/sum(comb_prob), c(t(t(rowSums(comb_prob/
## sum(comb_prob)))), : number of columns of result is not a multiple of vector
## length (arg 2)

#validation
for (x in 1:4){
    print(sum(prob_tbl[x,1:4])==prob_tbl[x,5])
    print(sum(prob_tbl[1:4,x])==prob_tbl[5,x])
}

## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] TRUE

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

ind_test <- comb_prob
fisher.test(ind_test)

## Warning in fisher.test(ind_test): 'x' has been rounded to integer: Mean relative
## difference: 0.2727586

## 
##  Fisher's Exact Test for Count Data
## 
## data:  ind_test
## p-value = 1
## alternative hypothesis: two.sided

chisq.test(ind_test)

## Warning in chisq.test(ind_test): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  ind_test
## X-squared = 7.6258e-32, df = 9, p-value = 1

frmt_mat <- cbind(t(t(y_quant)),t(t(z_quant)))
fisher.test(frmt_mat)

## Warning in fisher.test(frmt_mat): 'x' has been rounded to integer: Mean relative
## difference: 0.3324803

## 
##  Fisher's Exact Test for Count Data
## 
## data:  frmt_mat
## p-value = 1
## alternative hypothesis: two.sided

chisq.test(frmt_mat)

## Warning in chisq.test(frmt_mat): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  frmt_mat
## X-squared = 0.32992, df = 3, p-value = 0.9543

Both tests indicate based on the p-value that you fail to reject the null hypothesis and the two distributions appear to be independent on both the quantiles merged together and the joint probability table. Both were used as the requested test’s are designed to compare two variables against one another for independence to see if a slightly alternative input would impact the results. Fisher’s Exact Test determines relationships that exist between two categorical variables that are not explained by randomness; however, the initial test was designed to assess 2 x 2 contingency tables and can often be computationally expensive with larger matrices. The Chi Square Test is also used for this same purpose and does not have the same potential performance constraints which is why it is more appropriate in this case to evaluate the question on independence.

Kaggle Housing Price Predictions

Input Data

train_df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/House_Prices_Prediction_train.csv')
head(train_df)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

Review generic summary of data columns

summary(train_df)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

Review bedroom distribution

hist(train_df$BedroomAbvGr)

Review Bathroom distribution

hist(train_df$FullBath)

ggplot(train_df,aes(x=YearBuilt)) +
    geom_bar()

ggplot(train_df,aes(x=BedroomAbvGr,y=SalePrice)) +
    geom_point() +
    labs(title='Comparing Sales Price of Homes Against Number of Bedrooms') +
    xlab('Bedrooms Above Ground')

ggplot(train_df,aes(x=GrLivArea,y=SalePrice))+
    geom_point()

ggplot(train_df,aes(x=OverallQual,y=SalePrice))+
    geom_point()

ggplot(train_df,aes(x=PoolQC))+
    geom_bar() +
    geom_text(stat='count', aes(label=..count..),vjust=-0.25)

ggplot(train_df,aes(MiscFeature))+
    geom_bar() +
    labs(title='Review of Frequency of Miscellaneous Features of Property')

ggplot(train_df,aes(SaleType))+
    geom_bar() +
    labs(title='Review of Sale Type Frequency') +
    geom_text(stat='count', aes(label=..count..),vjust=-0.25)

train_df <- train_df |> mutate(age=2022-YearBuilt)

ggplot(train_df,aes(x=age,y=SalePrice))+
    geom_point() +
    geom_smooth()

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

corr_df <- train_df |> dplyr::select(c(age,BedroomAbvGr,X1stFlrSF))
cor_mat <- cor(corr_df,method='pearson')

pairs(corr_df)

Correlation Hypothesis Test

first_cor_test <- cor.test(corr_df$age,corr_df$X1stFlrSF,conf.level = 0.8)
first_cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$age and corr_df$X1stFlrSF
## t = -11.223, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  -0.3125892 -0.2507977
## sample estimates:
##        cor 
## -0.2819859

sec_cor_test <- cor.test(corr_df$age,corr_df$BedroomAbvGr,conf.level = 0.8)
sec_cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$age and corr_df$BedroomAbvGr
## t = 2.7045, df = 1458, p-value = 0.006921
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.03717773 0.10396633
## sample estimates:
##        cor 
## 0.07065122

third_cor_test <- cor.test(corr_df$BedroomAbvGr,corr_df$X1stFlrSF,conf.level = 0.8)
third_cor_test

## 
##  Pearson's product-moment correlation
## 
## data:  corr_df$BedroomAbvGr and corr_df$X1stFlrSF
## t = 4.9046, df = 1458, p-value = 1.041e-06
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.09424207 0.16027708
## sample estimates:
##       cor 
## 0.1274007

The correlation test results indicate to reject the null hypothesis given a P-Value below 0.05 for each pairwise comparison. The confidence interval for age vs 1st floor sq ft is (-0.3125892, -0.2507977) , age vs bedroooms above ground is ( 0.03717773, 0.10396633) and bedrooms above ground floor vs 1st floor sq ft (0.09424207,0.16027708) also substantiates that there is a correlation (i.e. that the correlation is not 0) given that 80% of samples taken to compare these two variables would not be expected to contain zero and the true correlation average is not zero. Family wise error, type 1 error rate, does not appear to be that worrying because the p-values provided in each test are far below the alpha threshold which should indicate there is less of a chance that the interpretation is incorrect. Also, in this case there aren’t that many risks to rejecting the null hypothesis as the visual eye test seems to indicate some type of linear relationship irrespective of their strength.

Determine the precision matrix

if (det(cor_mat)==0){
   print('Matrix cannot be inverted') 
}else{
    prec_mat <- solve(cor_mat)
}

Running LU Decomposition

decomb_mat <- cor_mat %*% prec_mat %*% cor_mat
decomb_mat

##                      age BedroomAbvGr  X1stFlrSF
## age           1.00000000   0.07065122 -0.2819859
## BedroomAbvGr  0.07065122   1.00000000  0.1274007
## X1stFlrSF    -0.28198586   0.12740075  1.0000000

str(lum <- lu(decomb_mat))

## Formal class 'denseLU' [package "Matrix"] with 4 slots
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:3] "age" "BedroomAbvGr" "X1stFlrSF"
##   .. ..$ : chr [1:3] "age" "BedroomAbvGr" "X1stFlrSF"
##   ..@ x       : num [1:9] 1 0.0707 -0.282 0.0707 0.995 ...
##   ..@ perm    : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 3

elu <- expand(lum)

Lower Matrix from LU Decomposition

elu$L

## 3 x 3 Matrix of class "dtrMatrix" (unitriangular)
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000           .           .
## [2,]  0.07065122  1.00000000           .
## [3,] -0.28198586  0.14806246  1.00000000

Upper Matrix from LU Decomposition

elu$U

## 3 x 3 Matrix of class "dtrMatrix"
##      [,1]        [,2]        [,3]       
## [1,]  1.00000000  0.07065122 -0.28198586
## [2,]           .  0.99500841  0.14732339
## [3,]           .           .  0.89867091

Reviewing Right skewed continuous variables

ggplot(train_df,aes(x=OpenPorchSF)) + #GarageArea WoodDeckSF
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Transforming the skewed distribution

sum(is.na(train_df$OpenPorchSF))

## [1] 0

fit_exp <- MASS::fitdistr(train_df$OpenPorchSF,'exponential')
fit_exp$estimate

##       rate 
## 0.02143151

exp_hist_data <- rep(sample(rexp(1000,fit_exp$estimate),100),1000)
par(mfrow=c(1,1))

Reviewing Original and Transformed Distributions

par(mfrow=c(1,2))
hist(train_df$OpenPorchSF)

hist(exp_hist_data)

The shapes of distributions overall do look fairly similar although there is less right skewness from the transformed exponential and higher concentrations of values closer to zero.

Using exponential to review 5th and 95th Percentiles

qexp(0.05,fit_exp$estimate)

## [1] 2.393359

qexp(0.95,fit_exp$estimate)

## [1] 139.7817

Alternative Methods of Obtaining Confidence Intervals

len_exp <- length(train_df$OpenPorchSF)
sd_exp <- sd(train_df$OpenPorchSF)
std_err_exp <- sd_exp/sqrt(len_exp)
error_rate <- 0.05
t_var <- qt(p=error_rate/2,df=len_exp-1,lower.tail=F)
margin_error <- t_var * std_err_exp

paste0('Manually calculated 95% confidence interval: (',t_var-margin_error,",",t_var+margin_error,")")

## [1] "Manually calculated 95% confidence interval: (-1.4398070004854,5.36298953704289)"

#Empirical Percentiles
quantile(train_df$OpenPorchSF,c(0.05,0.95))

##     5%    95% 
##   0.00 175.05

The alternative methods of estimating the 5th and 95th percentiles were not that substantially different, but the reason the confidence interval includes zero is likely because of the large concentration of 0 values where houses did not have porches in the data set. The fit distribution function also determined a lambda that was fairly close to zero so it is not that surprising that the mean value would estimate zero. When comparing the exponential 95th percentile against the original data there is some variance as the exponential transformation slightly minimized the effects of the skew leading to a smaller value as compared to the input data source.

Multiple Linear Regression Analysis

Reviewing relationships of Data

quality <- c('Ex','Gd','TA','Fa','Po')
kitchen_rating <- c(0.9,0.75,0.5,0.25,0.1)
kitch_df <- data.frame(quality,kitchen_rating)
train_df <- train_df |>
    mutate(new_home= ifelse(SaleType=='New',1,0),
           court_officer=ifelse(SaleType=='COD',1,0),
           has_porch=ifelse(OpenPorchSF==0 & EnclosedPorch==0,0,1),
           single_fam = ifelse(BldgType=='1Fam',1,0),
           twoFmCon = ifelse(BldgType=='2fmCon',1,0),
           house_unf = ifelse(str_detect(HouseStyle, 'Unf'),1,0),
           central_air = ifelse(CentralAir=='Yes',1,0),
           functional_num = ifelse(Functional=='Typ',1,ifelse(str_detect(Functional,'Min'),0.5,ifelse(Functional=='Mod',0,ifelse(str_detect(Functional,'Major'),-0.5,-1)))),
           kitchen_rating = ifelse(KitchenQual=='Ex',0.9,ifelse(KitchenQual=='Gd',0.75,ifelse(KitchenQual=='TA',0.5,ifelse(KitchenQual=='Fa',0.25,0.1))))
           )


lm.out <- lm(SalePrice ~ BedroomAbvGr +age^2 +GrLivArea+ X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home + court_officer +single_fam  +TotalBsmtSF +LowQualFinSF + kitchen_rating + functional_num,data=train_df)

summary(lm.out)

## 
## Call:
## lm(formula = SalePrice ~ BedroomAbvGr + age^2 + GrLivArea + X2ndFlrSF + 
##     GarageArea + OverallCond + OverallQual + LotArea + new_home + 
##     court_officer + single_fam + TotalBsmtSF + LowQualFinSF + 
##     kitchen_rating + functional_num, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -541573  -17542   -1634   13515  277980 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -9.158e+04  8.987e+03 -10.191  < 2e-16 ***
## BedroomAbvGr   -8.318e+03  1.475e+03  -5.639 2.05e-08 ***
## age            -3.887e+02  4.759e+01  -8.167 6.81e-16 ***
## GrLivArea       7.243e+01  4.854e+00  14.921  < 2e-16 ***
## X2ndFlrSF      -1.365e+01  4.731e+00  -2.886 0.003965 ** 
## GarageArea      2.779e+01  5.915e+00   4.698 2.88e-06 ***
## OverallCond     4.532e+03  9.690e+02   4.677 3.18e-06 ***
## OverallQual     1.620e+04  1.184e+03  13.688  < 2e-16 ***
## LotArea         5.495e-01  1.008e-01   5.452 5.87e-08 ***
## new_home        1.795e+04  3.708e+03   4.841 1.43e-06 ***
## court_officer  -1.621e+04  5.581e+03  -2.905 0.003727 ** 
## single_fam      1.619e+04  2.662e+03   6.081 1.52e-09 ***
## TotalBsmtSF     1.342e+01  4.030e+00   3.329 0.000893 ***
## LowQualFinSF   -4.268e+01  2.023e+01  -2.110 0.035038 *  
## kitchen_rating  4.387e+04  8.933e+03   4.911 1.01e-06 ***
## functional_num  1.308e+04  3.650e+03   3.584 0.000350 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35620 on 1444 degrees of freedom
## Multiple R-squared:  0.801,  Adjusted R-squared:  0.7989 
## F-statistic: 387.5 on 15 and 1444 DF,  p-value: < 2.2e-16

When reviewing the model summary detail there was one initial item that might be slightly concerning and cause issues for the model. The residual spread which can be further reviewed by the diagnostic plots does not seem to be centered at zero although the remainder of the residuals show somewhat even spread at the interquartile range. The F-statistic confirms that the there is some significance to this linear model and the R-Squared adjusted seems to indicate that just under 80% of the variability in sales price is explained by the independent variables.

subset_df <- train_df |> dplyr::select(c(SalePrice,BedroomAbvGr,age,GrLivArea,X2ndFlrSF,GarageArea,OverallCond,OverallQual,LotArea,TotalBsmtSF,LowQualFinSF,kitchen_rating))
ResourceSelection::kdepairs(subset_df[,c(1:10)])

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

vif(lm.out)

##   BedroomAbvGr            age      GrLivArea      X2ndFlrSF     GarageArea 
##       1.664722       2.375276       7.481039       4.902881       1.838649 
##    OverallCond    OverallQual        LotArea       new_home  court_officer 
##       1.336816       3.080926       1.164042       1.211424       1.024345 
##     single_fam    TotalBsmtSF   LowQualFinSF kitchen_rating functional_num 
##       1.120262       3.594789       1.112440       2.117113       1.113141

Best practice indicates that any Variance Inflation factor over 5 is likely subject to multicollinearity and therefore the model will be subsequently revised to account for this issue.

lm2.out <- lm(SalePrice ~ BedroomAbvGr +age^2 + X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home + court_officer +single_fam  +TotalBsmtSF +LowQualFinSF + kitchen_rating + functional_num,data=train_df)

summary(lm2.out)

## 
## Call:
## lm(formula = SalePrice ~ BedroomAbvGr + age^2 + X2ndFlrSF + GarageArea + 
##     OverallCond + OverallQual + LotArea + new_home + court_officer + 
##     single_fam + TotalBsmtSF + LowQualFinSF + kitchen_rating + 
##     functional_num, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -514051  -19267   -2830   12713  314852 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -9.239e+04  9.651e+03  -9.573  < 2e-16 ***
## BedroomAbvGr   -1.560e+03  1.508e+03  -1.035 0.300984    
## age            -2.464e+02  5.007e+01  -4.921 9.58e-07 ***
## X2ndFlrSF       4.091e+01  3.223e+00  12.694  < 2e-16 ***
## GarageArea      4.815e+01  6.181e+00   7.791 1.26e-14 ***
## OverallCond     3.794e+03  1.039e+03   3.651 0.000271 ***
## OverallQual     1.975e+04  1.245e+03  15.861  < 2e-16 ***
## LotArea         8.042e-01  1.067e-01   7.538 8.41e-14 ***
## new_home        1.854e+04  3.982e+03   4.656 3.53e-06 ***
## court_officer  -1.096e+04  5.982e+03  -1.833 0.067058 .  
## single_fam      1.273e+04  2.848e+03   4.469 8.47e-06 ***
## TotalBsmtSF     5.172e+01  3.337e+00  15.496  < 2e-16 ***
## LowQualFinSF    2.196e+01  2.122e+01   1.035 0.300972    
## kitchen_rating  5.610e+04  9.553e+03   5.872 5.32e-09 ***
## functional_num  1.831e+03  3.836e+03   0.477 0.633139    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38260 on 1445 degrees of freedom
## Multiple R-squared:  0.7703, Adjusted R-squared:  0.7681 
## F-statistic: 346.2 on 14 and 1445 DF,  p-value: < 2.2e-16

Given that certain variables are not considered significant after this change it is necessary to test the impact to the model using backward testing.

Revised Model

lm3.out <- lm(SalePrice ~ age^2 + X2ndFlrSF+ GarageArea+ OverallCond + OverallQual + LotArea+ new_home  +single_fam  +TotalBsmtSF + kitchen_rating,data=train_df)

summary(lm3.out)

## 
## Call:
## lm(formula = SalePrice ~ age^2 + X2ndFlrSF + GarageArea + OverallCond + 
##     OverallQual + LotArea + new_home + single_fam + TotalBsmtSF + 
##     kitchen_rating, data = train_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -509332  -19364   -2599   13022  315897 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -9.647e+04  8.385e+03 -11.506  < 2e-16 ***
## age            -2.449e+02  4.882e+01  -5.017 5.88e-07 ***
## X2ndFlrSF       3.942e+01  2.733e+00  14.425  < 2e-16 ***
## GarageArea      4.752e+01  6.170e+00   7.703 2.45e-14 ***
## OverallCond     3.894e+03  1.022e+03   3.812 0.000144 ***
## OverallQual     1.999e+04  1.236e+03  16.168  < 2e-16 ***
## LotArea         8.051e-01  1.067e-01   7.546 7.87e-14 ***
## new_home        1.910e+04  3.975e+03   4.807 1.69e-06 ***
## single_fam      1.242e+04  2.822e+03   4.401 1.16e-05 ***
## TotalBsmtSF     5.073e+01  3.238e+00  15.669  < 2e-16 ***
## kitchen_rating  5.787e+04  9.485e+03   6.101 1.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38280 on 1449 degrees of freedom
## Multiple R-squared:  0.7694, Adjusted R-squared:  0.7678 
## F-statistic: 483.6 on 10 and 1449 DF,  p-value: < 2.2e-16

The residuals remains somewhat distant from averaging zero, but the adjusted r-squared hasn’t changed dramatically with the substitutions from the earlier version of the model and all of the p-values of the dependent variables are significant.

Reviewing Diagnostic Plots

par(mfrow=c(2,2))
plot(lm3.out)

The residuals are still somewhat far from zero and there is somewhat of pattern in the distribution across the fitted values and that is a potential concern of a violation of this assumption. The normal QQ plot follows a normal approximation for the majority of values; however, the third quantile appear to diverge more substantially

Confirming VIF

vif(lm3.out)

##            age      X2ndFlrSF     GarageArea    OverallCond    OverallQual 
##       2.164644       1.417250       1.732802       1.287010       2.911483 
##        LotArea       new_home     single_fam    TotalBsmtSF kitchen_rating 
##       1.129041       1.205520       1.089813       2.009204       2.066997

The variance factors have lowered across the board after revising the model and are no longer of concern.

test_df <- read.csv('https://raw.githubusercontent.com/jforster19/Data605/main/House_Prices_Prediction_test.csv')
dim(test_df)

## [1] 1459   80

test_df <- test_df |>
    mutate(age=2022-YearBuilt, new_home= ifelse(is.na(SaleType),0,ifelse(SaleType=='New',1,0)),
           GarageArea = ifelse(is.na(GarageArea),0,GarageArea),
           TotalBsmtSF = ifelse(is.na(TotalBsmtSF),0,TotalBsmtSF),
           single_fam = ifelse(BldgType=='1Fam',1,0),
           twoFmCon = ifelse(BldgType=='2fmCon',1,0),
           house_unf = ifelse(str_detect(HouseStyle, 'Unf'),1,0),
           central_air = ifelse(CentralAir=='Yes',1,ifelse(is.na(CentralAir),0,0)),
           functional_num = ifelse(is.na(Functional),0,ifelse(Functional=='Typ',1,ifelse(str_detect(Functional,'Min'),0.5,ifelse(Functional=='Mod',0,ifelse(str_detect(Functional,'Major'),-0.5,-1))))),
           kitchen_rating = ifelse(is.na(KitchenQual),0.5,ifelse(KitchenQual=='Ex',0.9,ifelse(KitchenQual=='Gd',0.75,ifelse(KitchenQual=='TA',0.5,ifelse(KitchenQual=='Fa',0.25,0.1))))))
           
predictions_lm <- cbind(test_df$Id,predict(lm3.out,test_df,interval='prediction'))

Reviewing Null Values

sum(is.na(test_df))

## [1] 6998

sapply(test_df, function(x) sum(is.na(x)))

##             Id     MSSubClass       MSZoning    LotFrontage        LotArea 
##              0              0              4            227              0 
##         Street          Alley       LotShape    LandContour      Utilities 
##              0           1352              0              0              2 
##      LotConfig      LandSlope   Neighborhood     Condition1     Condition2 
##              0              0              0              0              0 
##       BldgType     HouseStyle    OverallQual    OverallCond      YearBuilt 
##              0              0              0              0              0 
##   YearRemodAdd      RoofStyle       RoofMatl    Exterior1st    Exterior2nd 
##              0              0              0              1              1 
##     MasVnrType     MasVnrArea      ExterQual      ExterCond     Foundation 
##             16             15              0              0              0 
##       BsmtQual       BsmtCond   BsmtExposure   BsmtFinType1     BsmtFinSF1 
##             44             45             44             42              1 
##   BsmtFinType2     BsmtFinSF2      BsmtUnfSF    TotalBsmtSF        Heating 
##             42              1              1              0              0 
##      HeatingQC     CentralAir     Electrical      X1stFlrSF      X2ndFlrSF 
##              0              0              0              0              0 
##   LowQualFinSF      GrLivArea   BsmtFullBath   BsmtHalfBath       FullBath 
##              0              0              2              2              0 
##       HalfBath   BedroomAbvGr   KitchenAbvGr    KitchenQual   TotRmsAbvGrd 
##              0              0              0              1              0 
##     Functional     Fireplaces    FireplaceQu     GarageType    GarageYrBlt 
##              2              0            730             76             78 
##   GarageFinish     GarageCars     GarageArea     GarageQual     GarageCond 
##             78              1              0             78             78 
##     PavedDrive     WoodDeckSF    OpenPorchSF  EnclosedPorch     X3SsnPorch 
##              0              0              0              0              0 
##    ScreenPorch       PoolArea         PoolQC          Fence    MiscFeature 
##              0              0           1456           1169           1408 
##        MiscVal         MoSold         YrSold       SaleType  SaleCondition 
##              0              0              0              1              0 
##            age       new_home     single_fam       twoFmCon      house_unf 
##              0              0              0              0              0 
##    central_air functional_num kitchen_rating 
##              0              0              0

Modifying End Output

head(predictions_lm)

##             fit       lwr      upr
## 1 1461 142049.0  66771.36 217326.6
## 2 1462 180713.1 105478.15 255948.0
## 3 1463 166932.9  91638.32 242227.5
## 4 1464 200849.3 125622.15 276076.4
## 5 1465 211985.6 136609.29 287361.9
## 6 1466 180022.3 104764.47 255280.2

colnames(predictions_lm) <- c('Id','SalePrice','lwr','upr')
output_df <- as_tibble(predictions_lm )|> dplyr::select(c('Id','SalePrice'))
sapply(output_df, function(x) sum(is.na(x)))

##        Id SalePrice 
##         0         0

Output Data for Submission

head(output_df)

## # A tibble: 6 × 2
##      Id SalePrice
##   <dbl>     <dbl>
## 1  1461   142049.
## 2  1462   180713.
## 3  1463   166933.
## 4  1464   200849.
## 5  1465   211986.
## 6  1466   180022.

write_delim(output_df,'house_price_predictions.csv',delim=',')