FinalData605

library(dplyr)  
library(glue)
library(ggplot2)
library(scales)
library(corrplot)
library(matrixcalc)
library(tidyr)
library(purrr)
library(MASS)
library(Amelia)
library(randomForest)
library(caret)
library(e1071)
library(rpart)
library(readr)
library(Metrics)
library(forecast) 
library(mice)

Probability Density 1: X~Gamma. Using R, generate a random variable X that has 10,000 random Gamma pdf values. A Gamma pdf is completely describe by n (a size parameter) and lambda (λ , a shape parameter). Choose any n greater 3 and an expected value (λ) between 2 and 10 (you choose).

n = 4
lambda = 3
observations = 10000

X = rgamma(observations , n, lambda) 
head(X)

## [1] 1.3449226 1.6232740 0.6383956 1.7484828 3.4030763 0.9232465

Probability Density 2: Y~Sum of Exponentials. Then generate 10,000 observations from the sum of n exponential pdfs with rate/shape parameter (λ). The n and λ must be the same as in the previous case. (e.g., mysum=rexp(10000,λ)+rexp(10000,λ)+..)

Y = rexp(observations, lambda) + rexp(observations, lambda) + rexp(observations, lambda) + rexp(observations, lambda) 
head(Y)

## [1] 2.2679843 0.4105526 0.8449030 1.8832580 2.0976125 1.6004484

Probability Density 3: Z~ Exponential. Then generate 10,000 observations from a single exponential pdf with rate/shape parameter (λ).

Z = rexp(observations, lambda) 
head(Z)

## [1] 0.3498248 0.0651057 0.1445587 0.1936039 0.8894497 0.3311824

1a. Calculate the empirical expected value (means) and variances of all three pdfs.

PD1eev = mean(X)
PD1eev

## [1] 1.338276

PD1var = var(X)
PD1var

## [1] 0.4550475

PD2eev = mean(Y)
PD2eev

## [1] 1.337444

PD2var = var(Y)
PD2var

## [1] 0.4442312

PD3eev = mean(Z)
PD3eev

## [1] 0.336711

PD3var = var(Z)
PD3var

## [1] 0.1101664

glue('The empirical expected value of X is {round(PD1eev, 4)} and the variance of X is {round(PD1var, 4)}.\n The empirical expected value of Y is {round(PD2eev, 4)} and the variance of Y is {round(PD2var, 4)}.\n The empirical expected value of Z is {round(PD3eev, 4)} and the variance of Z is {round(PD3var, 4)}.')

## The empirical expected value of X is 1.3383 and the variance of X is 0.455.
## The empirical expected value of Y is 1.3374 and the variance of Y is 0.4442.
## The empirical expected value of Z is 0.3367 and the variance of Z is 0.1102.

Using calculus, calculate the expected value and variance of the Gamma pdf (X). Using the moment generating function for exponentials, calculate the expected value of the single exponential (Z) and the sum of exponentials (Y)

integrand <- function(x) x * dgamma(x, n, lambda)
expected_value_X <- integrate(integrand, lower = 0, upper = Inf)$value
expected_value_X

## [1] 1.333333

integrand <- function(x) x^2 * dgamma(x, n, lambda)
variance_X <- integrate(integrand, lower = 0, upper = Inf)$value - expected_value_X^2
variance_X

## [1] 0.4444444

Z The moment generating function of an exponential distribution is:

MX(t)=λ/λ−t

The first derivative of which, with λ=3 is:

M′X(t)=3/(3−t)^2

When we evaluate at t=0 we get:

M′X(0)= .33

Y The moment generating function of a sum of exponential distributions is:

MX(t)=(1−(t/λ))^−α

The first derivative of which, with α=4 and λ=3is:

M′X(t)=4/3(1−(t/3)^−5

When we evaluate at t=0 we get:

M′X(0)= 4/3

1c-e. Probability. For pdf Z (the exponential), calculate empirically probabilities a through c. Then evaluate through calculus whether the memoryless property holds.

a. P(Z> λ| Z> λ/2) b. P(Z> 2λ | Z> λ) cb. P(Z>3λ | Z> λ)

Baye’s Law P(A|B) = P(B|A) * P(A) / P(B)

prob_a = sum(Z > lambda/2 & Z > lambda) * sum(Z > lambda) / sum(Z > lambda/2)
prob_b = sum(Z > lambda & Z > 2*lambda)* sum(Z > 2*lambda) / sum(Z > lambda)
prob_c = sum(Z > lambda & Z > 3*lambda)* sum(Z > 3*lambda) / sum(Z > lambda)

glue('For pdf Z (the exponential), the empiricall probabilities of a = {round(prob_a, 4)} // b = {round(prob_b, 4)} // c = {round(prob_c, 4)}')

## For pdf Z (the exponential), the empiricall probabilities of a = 0.0364 // b = 0 // c = 0

The memoryless property states that a random variable’s past does not influence its future behavior. Therefore, all three probabilities hold the memoryless property.
Loosely investigate whether P(YZ) = P(Y) P(Z) by building a table with quartiles and evaluating the marginal and joint probabilities. With a little help from this website ’https://www.tutorialspoint.com/create-a-quartile-column-for-each-value-in-an-r-data-frame-column'->

bin_Y = cut(Y, breaks = quantile(Y), include.lowest = TRUE)
levels(bin_Y) = c("1st Quartile Y", "2d Quartile Y", "3d Quartile Y", "4th Quartile Y")
bin_Z = cut(Z, breaks = quantile(Z), include.lowest = TRUE)
levels(bin_Z) = c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z")

combined_instances = paste(sample(bin_Z, 10000, replace = TRUE), sample(bin_Y, 10000, replace = TRUE))
instance_counts_matrix <- matrix(table(combined_instances), ncol = 4)
instance_counts <- prop.table(instance_counts_matrix)

df <- as.data.frame(instance_counts) %>%
  mutate(Sum = rowSums(across(where(is.numeric)))) %>%
  bind_rows(summarise(.,across(where(is.numeric), sum)))

colnames(df) <- c("1st Quartile Y", "2nd Quartile Y", "3rd Quartile Y", "4th Quartile Y", "Sum")
rownames(df) <- c("1st Quartile Z", "2nd Quartile Z", "3rd Quartile Z", "4th Quartile Z", "Sum")
df

##                1st Quartile Y 2nd Quartile Y 3rd Quartile Y 4th Quartile Y
## 1st Quartile Z         0.0607         0.0625         0.0610         0.0619
## 2nd Quartile Z         0.0590         0.0630         0.0601         0.0640
## 3rd Quartile Z         0.0671         0.0620         0.0639         0.0633
## 4th Quartile Z         0.0657         0.0635         0.0621         0.0602
## Sum                    0.2525         0.2510         0.2471         0.2494
##                   Sum
## 1st Quartile Z 0.2461
## 2nd Quartile Z 0.2461
## 3rd Quartile Z 0.2563
## 4th Quartile Z 0.2515
## Sum            1.0000

Check to see if independence holds by using Fisher’s Exact Test and the Chi Square Test. What is the difference between the two? Which is most appropriate?

Fisher’s exact test is used when the sample size is small and the data do not follow a multinomial distribution, while the chi-square test is used when the sample size is large and the data follow a multinomial distribution. Fisher’s exact test is more computationally intensive than the chi-square test, but it provides exact probabilities of the observed data, while the chi-square test provides an approximate probability. The Chi Square Test is the more appropriate test in this case.

fisher.test(instance_counts_matrix, simulate.p.value = TRUE, B = 10000)

## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  10000 replicates)
## 
## data:  instance_counts_matrix
## p-value = 0.6263
## alternative hypothesis: two.sided

Since the p-value is greater than the usual significance level of 0.05, we cannot reject the null hypothesis and conclude that there is no significant evidence of an association between the two categorical variables being compared. This means that any observed association between the variables could be due to chance, and further investigation may be needed to determine if there is a true relationship between the variables.

chisq.test(instance_counts_matrix)

## 
##  Pearson's Chi-squared test
## 
## data:  instance_counts_matrix
## X-squared = 7.1378, df = 9, p-value = 0.6228

This also suggests that there is no significant evidence of an association between the two categorical variables being compared.

Problem 2 You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following. 5 points. Descriptive and Inferential Statistics. Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any three quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

train = read.csv("train_ames_data.csv", stringsAsFactors = F)
test = read.csv("test_ames_data.csv", stringsAsFactors = F)
head(train)

##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000

summary(train)

##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
##

dim(train)

## [1] 1460   81

dim(test)

## [1] 1459   80

test_labels = test$Id
test$Id = NULL
train$Id = NULL
test$SalePrice = NA
fulldf = rbind(train, test)
dim(fulldf)

## [1] 2919   80

ggplot(data=fulldf[!is.na(fulldf$SalePrice),], aes(x=SalePrice)) +
        geom_histogram(fill="blue", binwidth = 10000) +
        scale_x_continuous(breaks= seq(0, 800000, by=100000), labels = comma)

summary(fulldf$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   34900  129975  163000  180921  214000  755000    1459

numeric = which(sapply(fulldf, is.numeric))

all_num <- fulldf[, numeric]
cor_num <- cor(all_num, use="pairwise.complete.obs")

cor_sorted <- as.matrix(sort(cor_num[,'SalePrice'], decreasing = TRUE))
CorHigh <- names(which(apply(cor_sorted, 1, function(x) abs(x)>0.5)))
cor_num <- cor_num[CorHigh, CorHigh]

corrplot.mixed(cor_num, tl.col="black", tl.pos = "lt",  number.cex = 0.7)

par(mfrow = c(1,3))
plot(train$OverallQual, train$SalePrice,
     xlab = 'Overall Quality', ylab = 'Sale Price')
plot(train$GrLivArea,  train$SalePrice,
     xlab = 'Ground Living Area', ylab = 'Sale Price')
plot(train$GarageCars,    train$SalePrice,
     xlab = 'Garage Cars', ylab = 'Sale Price')

cor_test1 <- cor.test(formula = ~ OverallQual + GarageCars,
                      data = all_num,
                      method = "pearson",
                      conf.level = 0.80)
cor_test2 <- cor.test(formula = ~ OverallQual + GrLivArea,
                      data = all_num,
                      method = "pearson",
                      conf.level = 0.80)
cor_test3 <- cor.test(formula = ~ GarageCars + GrLivArea,
                      data = all_num,
                      method = "pearson",
                      conf.level = 0.80)

cor_test1

## 
##  Pearson's product-moment correlation
## 
## data:  OverallQual and GarageCars
## t = 40.579, df = 2916, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5853570 0.6156978
## sample estimates:
##       cor 
## 0.6007437

cor_test2

## 
##  Pearson's product-moment correlation
## 
## data:  OverallQual and GrLivArea
## t = 37.97, df = 2917, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.5590270 0.5907918
## sample estimates:
##       cor 
## 0.5751261

cor_test3

## 
##  Pearson's product-moment correlation
## 
## data:  GarageCars and GrLivArea
## t = 30.348, df = 2916, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.4716853 0.5077613
## sample estimates:
##      cor 
## 0.489933

All three tests suggest strong evidence of a statistically significant correlation between the two variables. A p-value less than 0.05 is commonly used to indicate statistical significance, but a p-value of less than 2.2e-16 is even stronger evidence of significance.

I would not be worried about familywise error because:

To calculate the FWER, you can use the following formula:

FWER = 1 - (1 - alpha)^m

FWER = 1 - (1 - .05)^3
FWER

## [1] 0.142625

The probability of a type I error is just over 14%, which is low considering only 3 tests were performed.

5 points. Linear Algebra and Correlation. Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

cm <- dplyr::select(all_num, OverallQual, FullBath, GrLivArea)
cor_matrix <- round(cor(cm),digits = 2)
cor_matrix

##             OverallQual FullBath GrLivArea
## OverallQual        1.00     0.53      0.58
## FullBath           0.53     1.00      0.63
## GrLivArea          0.58     0.63      1.00

precision_matrix <- solve(cor_matrix)
precision_matrix

##             OverallQual   FullBath  GrLivArea
## OverallQual   1.6163527 -0.4411402 -0.6595663
## FullBath     -0.4411402  1.7784972 -0.8645919
## GrLivArea    -0.6595663 -0.8645919  1.9272413

cor_pre <- cor_matrix %*% precision_matrix

cor_pre

##              OverallQual      FullBath     GrLivArea
## OverallQual 1.000000e+00 -4.315958e-17 -3.706001e-17
## FullBath    8.699900e-17  1.000000e+00 -1.624157e-16
## GrLivArea   1.110223e-16  0.000000e+00  1.000000e+00

pre_cor <- precision_matrix %*% cor_matrix

pre_cor

##               OverallQual      FullBath     GrLivArea
## OverallQual  1.000000e+00  1.425102e-16  1.110223e-16
## FullBath    -4.315958e-17  1.000000e+00 -1.110223e-16
## GrLivArea   -3.706001e-17 -1.624157e-16  1.000000e+00

lu_cor_pre = matrixcalc::lu.decomposition(cor_pre)

L1 <- lu_cor_pre$L

U1 <- lu_cor_pre$U

print(L1)

##              [,1]         [,2] [,3]
## [1,] 1.000000e+00 0.000000e+00    0
## [2,] 8.699900e-17 1.000000e+00    0
## [3,] 1.110223e-16 4.791676e-33    1

print(U1)

##      [,1]          [,2]          [,3]
## [1,]    1 -4.315958e-17 -3.706001e-17
## [2,]    0  1.000000e+00 -1.624157e-16
## [3,]    0  0.000000e+00  1.000000e+00

print( L1 %*% U1)

##              [,1]          [,2]          [,3]
## [1,] 1.000000e+00 -4.315958e-17 -3.706001e-17
## [2,] 8.699900e-17  1.000000e+00 -1.624157e-16
## [3,] 1.110223e-16  1.522449e-49  1.000000e+00

print(cor_pre)

##              OverallQual      FullBath     GrLivArea
## OverallQual 1.000000e+00 -4.315958e-17 -3.706001e-17
## FullBath    8.699900e-17  1.000000e+00 -1.624157e-16
## GrLivArea   1.110223e-16  0.000000e+00  1.000000e+00

lu_pre_cor = matrixcalc::lu.decomposition(pre_cor)

L2 <- lu_pre_cor$L

U2 <- lu_pre_cor$U

print(L2)

##               [,1]          [,2] [,3]
## [1,]  1.000000e+00  0.000000e+00    0
## [2,] -4.315958e-17  1.000000e+00    0
## [3,] -3.706001e-17 -1.624157e-16    1

print( L2 %*% U2)

##               [,1]          [,2]          [,3]
## [1,]  1.000000e+00  1.425102e-16  1.110223e-16
## [2,] -4.315958e-17  1.000000e+00 -1.110223e-16
## [3,] -3.706001e-17 -1.624157e-16  1.000000e+00

print(pre_cor)

##               OverallQual      FullBath     GrLivArea
## OverallQual  1.000000e+00  1.425102e-16  1.110223e-16
## FullBath    -4.315958e-17  1.000000e+00 -1.110223e-16
## GrLivArea   -3.706001e-17 -1.624157e-16  1.000000e+00

5 points. Calculus-Based Probability & Statistics. Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

train %>%
  dplyr::select(names(all_num)) %>%
  gather() %>%                  
  ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +
  geom_histogram()

The first one that jumped out at me as being skewed to the right was the BsmtUnfSF (Unfinished square feet of basement area). Let’s take a closer look.

ggplot(train, aes(BsmtUnfSF)) + 
  geom_histogram(bins = 30, fill = "steelblue", color = "white") + 
  labs(title = "Histogram of Basement Unfinished Square Feet",
       x = "Basement Unfinished Square Feet",
       y = "Count")

summary(train$BsmtUnfSF)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   223.0   477.5   567.2   808.0  2336.0

fitdistr(train$BsmtUnfSF, "exponential")

##        rate    
##   1.762921e-03 
##  (4.613775e-05)

expo <- rexp(1000, 0.00176292094)
expdf <- data.frame(
    type = c(rep("Original", 1460), rep("Exponential",1000)),
    value = c(train$BsmtUnfSF, expo)
)

ggplot(expdf, aes(x=value, fill=type)) +
    geom_histogram(binwidth = 50)

qexp(.05, rate= 0.00176292094)

## [1] 29.09563

qexp(.95, rate= 0.00176292094)

## [1] 1699.3

sigma <- sd(train$BsmtUnfSF)
n <- length(train$BsmtUnfSF)
ci <- qnorm(0.95)

me <- ci * (sigma/sqrt(n)) 

(mean(train$BsmtUnfSF) + me)

## [1] 586.2618

(mean(train$BsmtUnfSF) - me)

## [1] 548.219

quantile(train$BsmtUnfSF, 0.05)

## 5% 
##  0

quantile(train$BsmtUnfSF, 0.95)

##  95% 
## 1468

10 points. Modeling. Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score. Provide a screen snapshot of your score with your name identifiable.

traindf = read.csv("train_ames_data.csv", stringsAsFactors = F)
testdf = read.csv("test_ames_data.csv", stringsAsFactors = F)

How many columns have text data?

sum(sapply(traindf[,1:81], typeof) == "character")

## [1] 43

How many columns have numerical data?

sum(sapply(traindf[,1:81], typeof) == "integer")

## [1] 38

Test has no ‘Saleslprice’ variable so let’s make it:

testdf$SalePrice<-rep(NA,1459)
ames<-bind_rows(traindf,testdf)

Remove Outliers:

traindf <- traindf[traindf$GrLivArea<=4000,]

How many NAs per column?

colSums(sapply(traindf, is.na))

##            Id    MSSubClass      MSZoning   LotFrontage       LotArea 
##             0             0             0           259             0 
##        Street         Alley      LotShape   LandContour     Utilities 
##             0          1365             0             0             0 
##     LotConfig     LandSlope  Neighborhood    Condition1    Condition2 
##             0             0             0             0             0 
##      BldgType    HouseStyle   OverallQual   OverallCond     YearBuilt 
##             0             0             0             0             0 
##  YearRemodAdd     RoofStyle      RoofMatl   Exterior1st   Exterior2nd 
##             0             0             0             0             0 
##    MasVnrType    MasVnrArea     ExterQual     ExterCond    Foundation 
##             8             8             0             0             0 
##      BsmtQual      BsmtCond  BsmtExposure  BsmtFinType1    BsmtFinSF1 
##            37            37            38            37             0 
##  BsmtFinType2    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF       Heating 
##            38             0             0             0             0 
##     HeatingQC    CentralAir    Electrical     X1stFlrSF     X2ndFlrSF 
##             0             0             1             0             0 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##             0             0             0             0             0 
##      HalfBath  BedroomAbvGr  KitchenAbvGr   KitchenQual  TotRmsAbvGrd 
##             0             0             0             0             0 
##    Functional    Fireplaces   FireplaceQu    GarageType   GarageYrBlt 
##             0             0           690            81            81 
##  GarageFinish    GarageCars    GarageArea    GarageQual    GarageCond 
##            81             0             0            81            81 
##    PavedDrive    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             0             0             0             0             0 
##   ScreenPorch      PoolArea        PoolQC         Fence   MiscFeature 
##             0             0          1451          1176          1402 
##       MiscVal        MoSold        YrSold      SaleType SaleCondition 
##             0             0             0             0             0 
##     SalePrice 
##             0

Let’s take all the missing data:

NAs <- sapply(traindf,function(x) sum(is.na(x)))
NAdf <- data.frame(index = names(traindf),Missing_Values=NAs)
NAdf[NAdf$Missing_Values > 0,]

##                     index Missing_Values
## LotFrontage   LotFrontage            259
## Alley               Alley           1365
## MasVnrType     MasVnrType              8
## MasVnrArea     MasVnrArea              8
## BsmtQual         BsmtQual             37
## BsmtCond         BsmtCond             37
## BsmtExposure BsmtExposure             38
## BsmtFinType1 BsmtFinType1             37
## BsmtFinType2 BsmtFinType2             38
## Electrical     Electrical              1
## FireplaceQu   FireplaceQu            690
## GarageType     GarageType             81
## GarageYrBlt   GarageYrBlt             81
## GarageFinish GarageFinish             81
## GarageQual     GarageQual             81
## GarageCond     GarageCond             81
## PoolQC             PoolQC           1451
## Fence               Fence           1176
## MiscFeature   MiscFeature           1402

Creating new column for data after combining both to ensure we know if its test or train data:

testdf$SalePrice <- NA
traindf$isTrain <- 1
testdf$isTrain <- 0
ames <- rbind(traindf,testdf)

19 Columns must be ‘fixed’

MasVnrArea

ames$MasVnrArea[which(is.na(ames$MasVnrArea))] <- mean(ames$MasVnrArea,na.rm=T)

Alley

ames$Alley1 <- as.character(ames$Alley)
ames$Alley1[which(is.na(ames$Alley))] <- "None"
ames$Alley <- as.factor(ames$Alley1)
ames <- subset(ames,select = -Alley1)

MasVnrType1

ames$MasVnrType1 <- as.character(ames$MasVnrType)
ames$MasVnrType1[which(is.na(ames$MasVnrType))] <- "None"
ames$MasVnrType <- as.factor(ames$MasVnrType1)
ames <- subset(ames,select = -MasVnrType1)

LotFrontage

ames$LotFrontage[which(is.na(ames$LotFrontage))] <- median(ames$LotFrontage, na.rm = T)

FireplaceQu1

ames$FireplaceQu1 <- as.character(ames$FireplaceQu)
ames$FireplaceQu1[which(is.na(ames$FireplaceQu))] <- "None"
ames$FireplaceQu <- as.factor(ames$FireplaceQu1)
ames <- subset(ames, select = -FireplaceQu1)

PoolQC1

ames$PoolQC1 <- as.character(ames$PoolQC)
ames$PoolQC1[which(is.na(ames$PoolQC))] <- "None"
ames$PoolQC <- as.factor(ames$PoolQC1)
ames <- subset(ames, select = -PoolQC1)

Fence1

ames$Fence1 <- as.character(ames$Fence)
ames$Fence1[which(is.na(ames$Fence))] <- "None"
ames$Fence <- as.factor(ames$Fence1)
ames <- subset(ames, select = -Fence1)

MiscFeature1

ames$MiscFeature1 <- as.character(ames$MiscFeature)
ames$MiscFeature1[which(is.na(ames$MiscFeature))] <- "None"
ames$MiscFeature <- as.factor(ames$MiscFeature1)
ames <- subset(ames,select = -MiscFeature1)

GarageType1

ames$GarageType1 <- as.character(ames$GarageType)
ames$GarageType1[which(is.na(ames$GarageType))] <- "None"
ames$GarageType <- as.factor(ames$GarageType1)
ames <- subset(ames, select = -GarageType1)

GarageYrBlt

ames$GarageYrBlt[which(is.na(ames$GarageYrBlt))] <- 0

GarageFinish1

ames$GarageFinish1 <- as.character(ames$GarageFinish)
ames$GarageFinish1[which(is.na(ames$GarageFinish))] <- "None"
ames$GarageFinish <- as.factor(ames$GarageFinish1)
ames <- subset(ames,select = -GarageFinish1)

GarageQual1

ames$GarageQual1 <- as.character(ames$GarageQual)
ames$GarageQual1[which(is.na(ames$GarageQual))] <- "None"
ames$GarageQual <- as.factor(ames$GarageQual1)
ames <- subset(ames, select = -GarageQual1)

GarageCond1

ames$GarageCond1 <- as.character(ames$GarageCond)
ames$GarageCond1[which(is.na(ames$GarageCond))] <- "None"
ames$GarageCond <- as.factor(ames$GarageCond1)
ames <- subset(ames,select = -GarageCond1)

BsmtQual1

ames$BsmtQual1 <- as.character(ames$BsmtQual)
ames$BsmtQual1[which(is.na(ames$BsmtQual))] <- "None"
ames$BsmtQual <- as.factor(ames$BsmtQual1)
ames <- subset(ames, select = -BsmtQual1)

BsmtCond1

ames$BsmtCond1 <- as.character(ames$BsmtCond)
ames$BsmtCond1[which(is.na(ames$BsmtCond))] <- "None"
ames$BsmtCond <- as.factor(ames$BsmtCond1)
ames <- subset(ames, select = -BsmtCond1)

BsmtExposure1

ames$BsmtExposure1 <- as.character(ames$BsmtExposure)
ames$BsmtExposure1[which(is.na(ames$BsmtExposure))] <- "None"
ames$BsmtExposure <- as.factor(ames$BsmtExposure1)
ames <- subset(ames, select = -BsmtExposure1)

BsmtFinType11

ames$BsmtFinType11 <- as.character(ames$BsmtFinType1)
ames$BsmtFinType11[which(is.na(ames$BsmtFinType1))] <- "None"
ames$BsmtFinType1 <- as.factor(ames$BsmtFinType11)
ames <- subset(ames,select = -BsmtFinType11)

BsmtFinType21

ames$BsmtFinType21 <- as.character(ames$BsmtFinType2)
ames$BsmtFinType21[which(is.na(ames$BsmtFinType2))] <- "None"
ames$BsmtFinType2 <- as.factor(ames$BsmtFinType21)
ames <- subset(ames, select = -BsmtFinType21)

Electrical1

ames$Electrical1 <- as.character(ames$Electrical)
ames$Electrical1[which(is.na(ames$Electrical))] <- "None"
ames$Electrical <- as.factor(ames$Electrical1)
ames <- subset(ames, select = -Electrical1)

Factorize these columns:

ames$MSZoning <- factor(ames$MSZoning)
ames$Street <- factor(ames$Street)
ames$LotShape <- factor(ames$LotShape)
ames$LandContour <- factor(ames$LandContour)
ames$Utilities <- factor(ames$Utilities)
ames$LotConfig <- factor(ames$LotConfig)
ames$LandSlope <- factor(ames$LandSlope)
ames$Neighborhood <- factor(ames$Neighborhood)
ames$Condition1 <- factor(ames$Condition1)
ames$Condition2 <- factor(ames$Condition2)
ames$BldgType <- factor(ames$BldgType)
ames$HouseStyle <- factor(ames$HouseStyle)
ames$RoofStyle <- factor(ames$RoofStyle)
ames$RoofMatl <- factor(ames$RoofMatl)
ames$Exterior1st <- factor(ames$Exterior1st)
ames$Exterior2nd <- factor(ames$Exterior2nd)
ames$ExterQual <- factor(ames$ExterQual)
ames$ExterCond <- factor(ames$ExterCond)
ames$Foundation <- factor(ames$Foundation)
ames$Heating <- factor(ames$Heating)
ames$HeatingQC <- factor(ames$HeatingQC)
ames$CentralAir <- factor(ames$CentralAir)
ames$KitchenQual <- factor(ames$KitchenQual)
ames$Functional <- factor(ames$Functional)
ames$PavedDrive <- factor(ames$PavedDrive)
ames$SaleType <- factor(ames$SaleType)
ames$SaleCondition <- factor(ames$SaleCondition)

i <- 1
na <- 1
for (i in 1:length(ames)){
    na[i] <- ifelse(sum(is.na(ames[i]))>0, sum(is.na(ames[i])), 0)
    if (na[i]>0) { cat(names(ames[i]), '=', na[i],'
')}
}

## MSZoning = 4 
## Utilities = 2 
## Exterior1st = 1 
## Exterior2nd = 1 
## BsmtFinSF1 = 1 
## BsmtFinSF2 = 1 
## BsmtUnfSF = 1 
## TotalBsmtSF = 1 
## BsmtFullBath = 2 
## BsmtHalfBath = 2 
## KitchenQual = 1 
## Functional = 2 
## GarageCars = 1 
## GarageArea = 1 
## SaleType = 1 
## SalePrice = 1459

mice_mod <- mice(ames[, c("MSZoning", "Utilities", "Exterior1st", "Exterior2nd", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", "TotalBsmtSF", "BsmtFullBath", "BsmtHalfBath", "KitchenQual", "Functional", "GarageCars", "GarageArea", "SaleType")], method='pmm')

## 
##  iter imp variable
##   1   1  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   1   2  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   1   3  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   1   4  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   1   5  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   2   1  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   2   2  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   2   3  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   2   4  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   2   5  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   3   1  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   3   2  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   3   3  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   3   4  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   3   5  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   4   1  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   4   2  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   4   3  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   4   4  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   4   5  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   5   1  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   5   2  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   5   3  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   5   4  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType
##   5   5  MSZoning  Utilities  Exterior1st  Exterior2nd  BsmtFinSF1  BsmtFinSF2  BsmtUnfSF  TotalBsmtSF  BsmtFullBath  BsmtHalfBath  KitchenQual  Functional  GarageCars  GarageArea  SaleType

mice_complete <- complete(mice_mod)

ames$MSZoning = mice_complete$MSZoning
ames$LotFrontage = mice_complete$LotFrontage
ames$Utilities = mice_complete$Utilities
ames$Exterior1st = mice_complete$Exterior1st
ames$Exterior2nd = mice_complete$Exterior2nd
ames$BsmtFinSF1 = mice_complete$BsmtFinSF1
ames$BsmtFinSF2 = mice_complete$BsmtFinSF2
ames$BsmtUnfSF = mice_complete$BsmtUnfSF
ames$TotalBsmtSF = mice_complete$TotalBsmtSF
ames$BsmtFullBath = mice_complete$BsmtFullBath
ames$BsmtHalfBath = mice_complete$BsmtHalfBath
ames$KitchenQual = mice_complete$KitchenQual
ames$Functional = mice_complete$Functional
ames$GarageCars = mice_complete$GarageCars
ames$GarageArea = mice_complete$GarageArea
ames$SaleType = mice_complete$SaleType

i <- 1
na <- 1
for (i in 1:length(ames)){
    na[i] <- ifelse(sum(is.na(ames[i]))>0, sum(is.na(ames[i])), 0)
    if (na[i]>0) { cat(names(ames[i]), '=', na[i],'
')}
}

## SalePrice = 1459

train <- ames[ames$isTrain==1,]
test <- ames[ames$isTrain==0,]

train_indices <- createDataPartition(train$SalePrice, p = 0.8, list = FALSE)
training <- train[train_indices, ]
validation <- train[-train_indices, ]

test <- subset(test, select = -SalePrice)

forest <- randomForest(SalePrice ~ ., data=training, ntree = 400)

SalePrice <- predict(forest, newdata=test)

Id<-test$Id
to_csv <- data.frame(Id, SalePrice)
write_csv(to_csv, '605Final_4.csv')

Just checking if it worked:

testwithid = read.csv("605Final_4.csv", stringsAsFactors = F)
dim(testwithid)

## [1] 1459    2

I submitted 3 times. The first time something didn’t run correctly so I did it again and got a 0.14616 and that was with 400 trees so I thought I would try with 4000 trees because why not? With 4000 trees, the model had a generally improved performance (0.14602) but it also increased the computational cost of training and prediction and was definitely not worth it.
KaggleHouse Prices - Advanced Regression Techniques Score Lets take a look at how the model performed besides our Kaggle Score:

varImpPlot(forest, sort=TRUE, n.var=15,
           type=NULL, class=NULL, scale=TRUE, 
           main= 'Dotchart of variable importance')

This makes perfect sense. The most important predictors in a model are usually the variables most correlated with the independent variable.

My Model:

pred1 <- predict(forest, validation ,type = "response")
residuals <- validation$SalePrice - pred1
forest_pred <- data.frame("Predicted" = pred1, "Actual" = validation$SalePrice, "Residual" = residuals)
accuracy(pred1, validation$SalePrice)

##                 ME     RMSE      MAE       MPE     MAPE
## Test set -1741.781 23504.11 15490.34 -3.605621 9.740062

plot(pred1, validation$SalePrice, main = "Predicted vs. Actual SalePrice") 
abline(0,1)

set.seed(123)
train_idx <- sample(nrow(train), 0.7 * nrow(train))
train_data <- train[train_idx, ]
test_data <- train[-train_idx, ]
target_col <- "SalePrice"
baseline_pred <- rep(mean(train_data[[target_col]]), nrow(test_data))

Baseline Model:

accuracy(baseline_pred, test_data$SalePrice)

##                 ME     RMSE      MAE       MPE     MAPE
## Test set -870.1055 78685.91 57195.77 -18.89121 37.33238

The regression model that I created showed an improvement over the baseline model that was established, as demonstrated above. However, despite the improvement over the baseline model, there is still potential for enhancement in the regression model. One way to increase the performance of the model could be to include some features and not others. Additionally, the model’s performance could be further optimized through hyperparameter tuning, which involves adjusting the model’s settings to achieve better performance, such as through cross-validation techniques. While the current regression model has shown an improvement over the baseline, there remains potential for further optimization and refinement of the model’s features and settings to achieve even better performance.

FinalData605

Mathew Katz

2023-05-10