Data 605 - Final Project

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as X. Pick SalePrice as the dependent variable, and define it as Y for the next analysis.

I am choosing Total square feet of basement area TotalBsmtSF as my independent variable.

house <- read.csv(train_url, na.strings = "NA")  %>%
    dplyr::select(TotalBsmtSF, SalePrice) %>%
    filter(!is.na(TotalBsmtSF), !is.na(SalePrice))

PART 1 - Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 4th quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities.

First we will calculate the marginal and joint probabilities using the data.

xCentile <- 1
yCentile <- .5

#Create categorical df from data
cat.df <- house %>%
    select(TotalBsmtSF, SalePrice) %>%
    transmute(X_G_x = ifelse(
        house$TotalBsmtSF > quantile(house$TotalBsmtSF,xCentile),"X > x", "X <= x"),
        Y_G_y = ifelse(
            house$SalePrice > quantile(house$SalePrice,yCentile),"Y > y", "Y <= y"))

#Calculate Marginal Probabilities
#P(X > x)
XGx <- sum(house$TotalBsmtSF > quantile(house$TotalBsmtSF,xCentile))/nrow(house)
#P(X <= x)
XLEx <- sum(house$TotalBsmtSF <= quantile(house$TotalBsmtSF,xCentile))/nrow(house)
#P(Y>y)
YGy <- sum(house$SalePrice > quantile(house$SalePrice,yCentile))/nrow(house)
#P(Y <= y)
YLEy <- sum(house$SalePrice <= quantile(house$SalePrice,yCentile))/nrow(house)

#Calculate joint probabilities
XGx_YGy <- sum(cat.df$X_G_x =="X > x" & cat.df$Y_G_y =="Y > y")/nrow(cat.df)
XLEx_YGy <- sum(cat.df$X_G_x =="X <= x" & cat.df$Y_G_y =="Y > y")/nrow(cat.df)

The marginal and joint probabilities for the needed events are as follows:

$P(X > x) = 0$
$P(X \leq x) = 1$
$P(Y > y) = 0.4986301$
$P(Y \leq y) = 0.5013699$
$P(X > x,Y > y) = 0$
$P(X \leq x,Y > y) = 0.4986301$

a. $\space P(X>x|Y>y)$

This is a conditional probability. It reads as what is the probability of TotalBsmtSF being greater than its 100th percentile given that SalePirce greater than its 50th percentile. More specifically, using our values as defined above, what is the probability that TotalBsmtSF is greater than 6110 SF given that SalePrice is greater than $163000.

We will use conditional probability to analyze this, namely $P(X|Y) = \frac{P(X,Y)}{P(Y)}$, where

$P(X) = P(X > x) = 0$
$P(Y) = P(Y > y) = 0.4986301$

$\therefore P(X>x|Y>y) = \frac{0}{0.4986301} = 0$

b. $\space P(X>x, Y>y)$

This is a joint probability problem. It reads as what is the probability of TotalBsmtSF being greater than its 100th percentile and SalePirce greater than its 50th percentile. More specifically, using our values as defined above, what is the probability that TotalBsmtSF is greater than 6110 SF and that SalePrice is greater than $163000.

We will use joint probability to analyze this, namely $P(X,Y)$, where

$P(X) = P(X > x) = 0$
$P(Y) = P(Y > y) = 0.4986301$

This was calculated above.

$\therefore P(X>x,Y>y) = 0$

c. $\space P(X\leq x | Y>y)$**

This is a conditional probability problem. It reads as what is the probability of TotalBsmtSF being less than its 100th percentile given that SalePirce greater than its 50th percentile. More specifically, using our values as defined above, what is the probability that TotalBsmtSF is less than 6110 SF given that SalePrice is greater than $163000.

We will use conditional probability to analyze this, namely $P(X|Y) = \frac{P(X,Y)}{P(Y)}$, where

$P(X) = P(X \leq x) = 1$
$P(Y) = P(Y > y) = 0.4986301$

$\therefore P(X>x|Y>y) = \frac{0.4986301}{0.4986301} = 1$

Does splitting the training data in this fashion make them independent? In other words, does P(X|Y)=P(X)? Check mathematically, and then evaluate by running a Chi Square test for association.

Independence requires the joint probabilities to be equal to the product of their probabilities. Testing this on the two cases required, $P(X>x,Y>y) \& P(X \leq x,Y>y)$. Specifically, this is defined as $P(X,Y)=P(X)P(Y)$. We will test below for equality.

#P(X>x, Y>y) = P(X>x)P(Y>y)?
identical(XGx_YGy, XGx*YGy)

## [1] TRUE

#P(X<=x, Y>y) = P(X<=x)P(Y>y)?
identical(XLEx_YGy, XLEx*YGy)

## [1] TRUE

Mathematically, this is true.Using the Chi Square test for association. Using contingency tables and hypothesis testing we can determine with set confidence (let’s use 0.05 significance) if the BINs are independent.

$H_0:$ There two variables are independent or there is no association between the variables
$H_A:$ There two variables are dependent or there is an association between the variables

If the p-value is greater than our significance level, we will accept the null hypothesis. The contingency table does not return the row for $X > x$ because the counts are all zeros.

#Create and print contingency Table
(cont.table <- table(cat.df$X_G_x, cat.df$Y_G_y))

##         
##          Y <= y Y > y
##   X <= x    732   728

chisq.test(cont.table)

## 
##  Chi-squared test for given probabilities
## 
## data:  cont.table
## X-squared = 0.010959, df = 1, p-value = 0.9166

The p-value is >> than our defined significance value of 0.05 so we do accept the null hypothesis and conclude that the variables are independent.

PART 2 - Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of X and Y. Provide a scatterplot of X and Y.

Since there are zeros in TotalBsmtSF and they will lead to problems in the transform, they will be dropped. The following analysis will only apply to houses that have a basement area. Basic descriptive statistics for TotalBstmt and SalePrice are:

house <- house %>%
    filter(TotalBsmtSF > 0)

#Basic stats for SalePrice
psych::describe(house$SalePrice)

##    vars    n     mean       sd median  trimmed      mad   min    max
## X1    1 1423 182878.3 79387.53 165000 172684.8 56190.54 34900 755000
##     range skew kurtosis     se
## X1 720100 1.89     6.54 2104.5

#Basic stats for TotalBsmtSF
psych::describe(house$TotalBsmtSF)

##    vars    n    mean     sd median trimmed mad min  max range skew
## X1    1 1423 1084.92 409.41   1004 1049.66 341 105 6110  6005 2.17
##    kurtosis    se
## X1    17.12 10.85

Their histogram and QQ plots are:

x <- "Total Basement SF"
y <- "Sale Price ($)"

#Plots for dependent variable - SalePrice
hist(house$SalePrice, main = paste0("Histogram of ", y), xlab = y)

qqnorm(house$SalePrice)
qqline(house$SalePrice)

#Plots for independent variable - TotalBsmtSF
hist(house$TotalBsmtSF, main = paste0("Histogram of ", x), xlab = x)

qqnorm(house$TotalBsmtSF)
qqline(house$TotalBsmtSF)

It is obvious that neither variable is normally distributed; they both are heavily skewed to the right. Since they are right-skewed, we should be able to use power or log-type transforms to create a more normalize distribution more suitable for regression analysis.

A scatter plot of the two variables SalePrice ~ TotalBsmtSF are below.

house.fit <- lm(house$SalePrice ~ house$TotalBsmtSF)
plot(house$SalePrice ~ house$TotalBsmtSF, main = paste0(y, " ~ ", x), xlab = x, ylab = y)

Transform both variables simultaneously using Box-Cox transformations.

Using Box Cox transformations on both variables simultaneously we get the following output.

summary(bc <- powerTransform(cbind(house$TotalBsmtSF, house$SalePrice), family = "bcPower"))

## bcPower Transformations to Multinormality 
##    Est.Power Std.Err. Wald Lower Bound Wald Upper Bound
## Y1    0.2214   0.0418           0.1395           0.3034
## Y2   -0.0096   0.0434          -0.0948           0.0755
## 
## Likelihood ratio tests about transformation parameters
##                                     LRT df         pval
## LR test, lambda = (0 0)     29.03340560  2 4.959936e-07
## LR test, lambda = (1 1)    825.98217784  2 0.000000e+00
## LR test, lambda = (0.22 0)   0.04932042  2 9.756414e-01

The estimated power of 0 means a power transform using log(x). Let us look at the distributions of the variables after the power transforms.

house.new <- house %>%
    transmute(TotalBsmtSF_trans = bcPower(TotalBsmtSF, bc$roundlam[1]),
              SalePrice_trans = bcPower(SalePrice, bc$roundlam[2]))

house.new.fit <- lm(house.new$SalePrice_trans ~ house.new$TotalBsmtSF_trans)

x <- paste0("(", x, ")", "^", round(bc$roundlam[1], 4))
y <- paste0("log(", y, ")")

hist(house.new$TotalBsmtSF_trans, main = paste0("Histogram of ", x), xlab = x)

hist(house.new$SalePrice_trans, main = paste0("Histogram of ", y), xlab = y)

They are both nearly normal now. Taking a look at the new scatter plot.

plot(house.new$SalePrice_trans ~ house.new$TotalBsmtSF_trans,
     main = paste0(y, " ~ ",x), xlab = x, ylab = y)
abline(house.new.fit)

Using the transformed variables, run a correlation analysis and interpret. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis

First, we will establish a hypothesis test:

$H_0 \rightarrow \beta_1 = 0$: The slope is equal to zero, i.e. there is not a linear relationship between TotalBsmtSF and SalePrice.
$H_A \rightarrow \beta_1 \neq 0$: The slope is not equal to zero, i.e. there is a linear relationship between TotalBsmtSF and SalePrice

We will establish a conservative threshold for a p-value of < 0.05 being statistically significant and a p-value of < 0.01 as statistically highly significant. The summary of the linear model and its p-value is as follows:

house.cor <- cor(house.new)
summary(house.new.fit)

## 
## Call:
## lm(formula = house.new$SalePrice_trans ~ house.new$TotalBsmtSF_trans)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.51713 -0.20119 -0.04082  0.23716  1.03920 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 9.656980   0.081958  117.83   <2e-16 ***
## house.new$TotalBsmtSF_trans 0.144437   0.004949   29.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3115 on 1421 degrees of freedom
## Multiple R-squared:  0.3748, Adjusted R-squared:  0.3744 
## F-statistic: 851.9 on 1 and 1421 DF,  p-value: < 2.2e-16

The resulting p-value is highly significant since the p-value is << 0.001. Due to this, we should strongly consider rejecting $H_0$ in favor of $H_A$.

Looking at the $R^2$ value of 0.3748, a Pearson correlation coefficient of 0.6122091, we conclude that there is a fairly strong positive correlation between between TotalBsmtSF and SalePrice and that the variability of SalePrice is moderately described by ToalBsmtSF.

PART 3 - Linear Algebra and Correlation

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix

Assuming our square correlation matrix is invertible, the process outlined above should create the identity matrix due to $AA^{-1} = A^{-1}A = I$

#Invert Correlaiton Matrix
(house.cor.prec <- solve(house.cor))

##                   TotalBsmtSF_trans SalePrice_trans
## TotalBsmtSF_trans         1.5994881      -0.9792211
## SalePrice_trans          -0.9792211       1.5994881

#A*A^-1
(AAn1 <- house.cor%*%house.cor.prec)

##                   TotalBsmtSF_trans SalePrice_trans
## TotalBsmtSF_trans      1.000000e+00               0
## SalePrice_trans       -1.110223e-16               1

#A^-1*A
(An1A <- house.cor.prec%*%house.cor)

##                   TotalBsmtSF_trans SalePrice_trans
## TotalBsmtSF_trans                 1   -1.110223e-16
## SalePrice_trans                   0    1.000000e+00

We can see that this process does produce the identity matrix after dismissing the extremely small numbers as rounding errors.

PART 4 - Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, ) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

Using the fitdistrplus package (Extends the fitdistr() function from the MASS package) we will explore possible fits

#Since zeros were dropped previously, the data is reloaded
#Shift all values by 1 and transform to vector
dist.var <- read.csv(train_url, na.strings = "NA") %>%
    transmute(TotalBsmtSF + 1)

dist.var <- as.vector(dist.var$TotalBsmtSF)

plotdist(dist.var, histo = TRUE, demp = TRUE)

From the empirical density above, our distribution is right skewed and appears to be an exponential type of distribution. The Cullen and Frey Graph below is a good way to exempt some distributions by the parameters of skewness and kurtosis using the descdist function; the bootstrapped values are from random samples (with replacement) of the data. From this Cullen and Frey Graph and the empirical graphs above, our choices for fits will be limited distributions in the fitdistrplus package after having narrowed them down to:

Weibull
gamma
logistic

descdist(dist.var, boot = 500)

## summary statistics
## ------
## min:  1   max:  6111 
## median:  992.5 
## mean:  1058.429 
## estimated sd:  438.7053 
## estimated skewness:  1.524255 
## estimated kurtosis:  16.25048

These 3 aforementioned distributions are fit against 4 classical fit parameters, most important being the density and CDF plot.

fw <- fitdist(dist.var, "weibull")
fg <- fitdist(dist.var, "gamma", method = "mme")
fl <- fitdist(dist.var, "logis")

par(mfrow = c(2, 2))
plot.legend <- c("Weibull", "gamma", "logistic")
denscomp(list(fw, fg, fl), legendtext = plot.legend)
qqcomp(list(fw, fg, fl), legendtext = plot.legend)
cdfcomp(list(fw, fg, fl), legendtext = plot.legend)
ppcomp(list(fw, fg, fl), legendtext = plot.legend)

The gamma distribution is being chosen to fit the data due to good fit to the density and CDF.

The relevant estimate parameters for the fit are below. We will randomly sample from this gamma distribution using the estimates provided. We will ignore the one large outlier from the original dist.var from the limits to better compare the histograms.

fg

## Fitting of the distribution ' gamma ' by matching moments 
## Parameters:
##          estimate
## shape 5.824727137
## rate  0.005503179

#Distribution Sample
gamma.samp <- rgamma(1000, shape = fg$estimate[1], rate = fg$estimate[2])

#Plots both hists side by side

par(mfrow = c(1, 2))
hist(dist.var, main = "Histogram of Total Basement SF", xlab = "Total Basement SF",
     xlim = c(0, max(dist.var[dist.var < max(dist.var)])))
hist(gamma.samp, breaks = 5, main = "Histogram of gamma.samp", xlab = "gamma.samp")

From the histograms above, it is clear that the gamma distribution is a good fit for this data, i.e. the shape and range appear to match the original data set

Housing Sale Prediciton via Machine Learning

Read in data again and divide by factors and numeric.

train <- read.csv(train_url, na.strings = "NA")
test <- read.csv(test_url, na.strings = "NA")

#Numerics that should be categpries- done by inspection
cat_vars <- c("MSSubClass", "MoSold")

#Merge with obvious categories
cat_vars <- c(cat_vars, names(train)[which(sapply(train, is.factor))])

num_vars <- names(train)[!(names(train) %in% cat_vars) &
                             (names(train) != "SalePrice") &
                             (names(train) != "Id")]

#Subset non-numeric to factor dataframe
#Train
train_cat <- subset(train, select = names(train) %in% cat_vars)
train_cat <- as.data.frame(lapply(train_cat, factor))
#Test
test_cat <- subset(test, select = names(test) %in% cat_vars)
test_cat <- as.data.frame(lapply(test_cat, factor))

#Subset numerics
#Train
train_num <- subset(train, select = names(train) %in% num_vars)
train_num <- as.data.frame(lapply(train_num, as.numeric))
#Test
test_num <- subset(test, select = names(test) %in% num_vars)
test_num <- as.data.frame(lapply(test_num, as.numeric))

We will combine the test and training data to carry out the same operations. We will look at the distribution of the numeric variables.

#Combine test and train
train_num$isTrain <- TRUE
test_num$isTrain <- FALSE
combined_num <- rbind(train_num, test_num)

train_cat$isTrain <- TRUE
test_cat$isTrain <- FALSE
combined_cat <- rbind(train_cat, test_cat)

combined_num[1:7] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 509 rows containing non-finite values (stat_density).

combined_num[8:14] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 4 rows containing non-finite values (stat_density).

combined_num[15:21] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 4 rows containing non-finite values (stat_density).

combined_num[22:28] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 161 rows containing non-finite values (stat_density).

combined_num[29:35] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

They all seem skewed. We will use power transforms down the line to try to normalize them.

Let’s look at the missing values in the numeric and categorical variables.

#Look at NA values in numeric
colSums(sapply(combined_num, is.na))

##   LotFrontage       LotArea   OverallQual   OverallCond     YearBuilt 
##           486             0             0             0             0 
##  YearRemodAdd    MasVnrArea    BsmtFinSF1    BsmtFinSF2     BsmtUnfSF 
##             0            23             1             1             1 
##   TotalBsmtSF     X1stFlrSF     X2ndFlrSF  LowQualFinSF     GrLivArea 
##             1             0             0             0             0 
##  BsmtFullBath  BsmtHalfBath      FullBath      HalfBath  BedroomAbvGr 
##             2             2             0             0             0 
##  KitchenAbvGr  TotRmsAbvGrd    Fireplaces   GarageYrBlt    GarageCars 
##             0             0             0           159             1 
##    GarageArea    WoodDeckSF   OpenPorchSF EnclosedPorch    X3SsnPorch 
##             1             0             0             0             0 
##   ScreenPorch      PoolArea       MiscVal        YrSold       isTrain 
##             0             0             0             0             0

colSums(sapply(combined_cat, is.na))

##    MSSubClass      MSZoning        Street         Alley      LotShape 
##             0             4             0          2721             0 
##   LandContour     Utilities     LotConfig     LandSlope  Neighborhood 
##             0             2             0             0             0 
##    Condition1    Condition2      BldgType    HouseStyle     RoofStyle 
##             0             0             0             0             0 
##      RoofMatl   Exterior1st   Exterior2nd    MasVnrType     ExterQual 
##             0             1             1            24             0 
##     ExterCond    Foundation      BsmtQual      BsmtCond  BsmtExposure 
##             0             0            81            82            82 
##  BsmtFinType1  BsmtFinType2       Heating     HeatingQC    CentralAir 
##            79            80             0             0             0 
##    Electrical   KitchenQual    Functional   FireplaceQu    GarageType 
##             1             1             2          1420           157 
##  GarageFinish    GarageQual    GarageCond    PavedDrive        PoolQC 
##           159           159           159             0          2909 
##         Fence   MiscFeature        MoSold      SaleType SaleCondition 
##          2348          2814             0             1             0 
##       isTrain 
##             0

We will force the NAs to be of level “None” in order to proceed with regression

#Change na to "None" for categories
for(j in 1:ncol(combined_cat)) {
    if(any(is.na(combined_cat[,j]))) {
        levels(combined_cat[,j]) <- c(levels(combined_cat[,j]), "None")
        combined_cat[is.na(combined_cat[,j]),j] <- "None"
    }
}

We will use Box Cox Transforms in order to make the numeric variables more linear.

# Transform numerics by power transform
combined_num_trans <- combined_num

#Testing...Force na to mean+1, else +1
for (i in 1:(ncol(combined_num_trans)-1)) {
    combined_num_trans[,i] <- ifelse(
        is.na(combined_num_trans[,i]), mean(combined_num_trans[,i], na.rm = TRUE)+1, combined_num_trans[,i] + 1)
    
    bc <- powerTransform(combined_num_trans[,i] ~ 1, family = "bcPower")
    
    combined_num_trans[,i] <- as.data.frame(bcPower(combined_num_trans[,i], bc$roundlam))
}

## Warning in estimateTransform.default(X, Y, weights, family, start,
## method, : Convergence failure: return code = 52

combined_num[1:7] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 509 rows containing non-finite values (stat_density).

combined_num[8:14] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 4 rows containing non-finite values (stat_density).

combined_num[15:21] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 4 rows containing non-finite values (stat_density).

combined_num[22:28] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

## Warning: Removed 161 rows containing non-finite values (stat_density).

combined_num[29:35] %>%
    gather() %>%
    ggplot(aes(value)) +
    facet_wrap(~key, scales = "free", ncol = 3) +
    geom_density()

There seems to be a slight improvement. We will combine our restructured numeric and categorical data into their respective training and testing data frames.

train_transf <- cbind(
    combined_num_trans[combined_num_trans$isTrain == TRUE,] %>% select(-isTrain),
    combined_cat[combined_cat$isTrain == TRUE,] %>% select(-isTrain),
    y = log(train$SalePrice))

test_transf <- cbind(
    combined_num_trans[combined_num_trans$isTrain == FALSE,] %>% select(-isTrain),
    combined_cat[combined_cat$isTrain == FALSE,]) %>% select(-isTrain)

Due to the number of variables present, we will use leaps’s regsubsets to select the respective values for our model. We would want to use the exhaustive approach, but the run time is “S L O W”. We will use backwards substitution instead and plot our resultant residuals from out model.

#Find relevant variables in model
#Need to transform name due to factor appending
names(train_transf) <- paste0(colnames(train_transf), "__")

train_regsub <- regsubsets(
    y__ ~ .,
    data = train_transf,
    method = "backward",
    nvmax = NULL)

## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 17 linear dependencies found

## Reordering variables and trying again:

## Warning in rval$lopt[] <- rval$vorder[rval$lopt]: number of items to
## replace is not a multiple of replacement length

train.sum <- summary(train_regsub)

#Find best variables and use those
best <- train.sum$which[which.max(-train.sum$bic),]
best <- unique(gsub("_.*", "", names(best[best]), perl = TRUE ))

colnames(train_transf) <- gsub("__", "", colnames(train_transf), perl = TRUE)
best_train <- train_transf[,colnames(train_transf) %in% best] %>%
    mutate(y = log(train$SalePrice))

#Build Model
summary(train.lm <- lm(y ~., best_train))

## 
## Call:
## lm(formula = y ~ ., data = best_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69960 -0.04473  0.00242  0.05334  0.58197 
## 
## Coefficients:
##                         Estimate  Std. Error t value Pr(>|t|)    
## (Intercept)            5.063e+00   2.655e-01  19.067  < 2e-16 ***
## LotArea                2.610e-02   3.015e-03   8.660  < 2e-16 ***
## OverallQual            6.332e-02   7.103e-03   8.915  < 2e-16 ***
## OverallCond            1.065e-01   9.178e-03  11.607  < 2e-16 ***
## YearBuilt              8.197e-73   1.186e-73   6.912 7.44e-12 ***
## YearRemodAdd          8.239e-122  3.025e-122   2.724 0.006539 ** 
## BsmtFinSF1             4.120e-03   5.781e-04   7.126 1.69e-12 ***
## TotalBsmtSF            5.013e-04   8.846e-05   5.667 1.78e-08 ***
## GrLivArea              4.255e-01   2.087e-02  20.386  < 2e-16 ***
## BsmtFullBath           5.994e-02   1.780e-02   3.367 0.000782 ***
## KitchenAbvGr          -2.829e-01   1.811e-01  -1.562 0.118500    
## Fireplaces             4.350e-02   1.161e-02   3.746 0.000187 ***
## GarageCars             3.496e-02   5.318e-03   6.574 7.04e-11 ***
## WoodDeckSF             4.043e-03   1.525e-03   2.651 0.008120 ** 
## ScreenPorch            8.327e-02   2.379e-02   3.500 0.000480 ***
## PoolArea               6.196e+00   1.652e+00   3.750 0.000184 ***
## MSSubClass30          -3.252e-02   1.967e-02  -1.653 0.098529 .  
## MSSubClass40          -4.979e-02   5.646e-02  -0.882 0.377951    
## MSSubClass45           3.119e-02   3.531e-02   0.883 0.377176    
## MSSubClass50          -2.120e-02   1.633e-02  -1.299 0.194301    
## MSSubClass60          -7.056e-03   1.411e-02  -0.500 0.617039    
## MSSubClass70          -7.391e-03   2.294e-02  -0.322 0.747326    
## MSSubClass75          -2.691e-02   3.475e-02  -0.774 0.438803    
## MSSubClass80          -3.025e-03   1.690e-02  -0.179 0.857974    
## MSSubClass85           1.965e-02   2.540e-02   0.773 0.439394    
## MSSubClass90          -4.780e-02   2.341e-02  -2.042 0.041377 *  
## MSSubClass120         -1.602e-02   1.716e-02  -0.933 0.350868    
## MSSubClass160         -7.185e-02   2.578e-02  -2.787 0.005397 ** 
## MSSubClass180          2.218e-02   4.511e-02   0.492 0.623017    
## MSSubClass190         -5.556e-02   2.675e-02  -2.077 0.037995 *  
## MSZoningFV             4.622e-01   5.087e-02   9.087  < 2e-16 ***
## MSZoningRH             4.224e-01   5.068e-02   8.336  < 2e-16 ***
## MSZoningRL             4.185e-01   4.277e-02   9.785  < 2e-16 ***
## MSZoningRM             3.698e-01   3.990e-02   9.266  < 2e-16 ***
## LandSlopeMod           1.734e-02   1.491e-02   1.162 0.245287    
## LandSlopeSev          -8.642e-02   3.805e-02  -2.271 0.023310 *  
## NeighborhoodBlueste    3.642e-02   8.389e-02   0.434 0.664238    
## NeighborhoodBrDale     3.736e-02   4.878e-02   0.766 0.443909    
## NeighborhoodBrkSide    5.205e-02   3.965e-02   1.313 0.189513    
## NeighborhoodClearCr    1.245e-02   3.889e-02   0.320 0.748937    
## NeighborhoodCollgCr   -1.378e-02   3.117e-02  -0.442 0.658614    
## NeighborhoodCrawfor    1.064e-01   3.651e-02   2.914 0.003629 ** 
## NeighborhoodEdwards   -7.139e-02   3.442e-02  -2.074 0.038244 *  
## NeighborhoodGilbert   -2.651e-02   3.295e-02  -0.805 0.421174    
## NeighborhoodIDOTRR     2.121e-03   4.548e-02   0.047 0.962819    
## NeighborhoodMeadowV   -7.239e-02   5.129e-02  -1.412 0.158332    
## NeighborhoodMitchel   -6.454e-02   3.494e-02  -1.847 0.064935 .  
## NeighborhoodNAmes     -2.984e-02   3.340e-02  -0.893 0.371884    
## NeighborhoodNoRidge    9.331e-02   3.521e-02   2.650 0.008139 ** 
## NeighborhoodNPkVill    4.198e-02   4.768e-02   0.880 0.378830    
## NeighborhoodNridgHt    6.430e-02   3.114e-02   2.065 0.039146 *  
## NeighborhoodNWAmes    -3.909e-02   3.442e-02  -1.136 0.256306    
## NeighborhoodOldTown   -9.601e-03   4.028e-02  -0.238 0.811631    
## NeighborhoodSawyer    -2.090e-02   3.504e-02  -0.596 0.550970    
## NeighborhoodSawyerW   -9.378e-03   3.382e-02  -0.277 0.781631    
## NeighborhoodSomerst    2.476e-02   3.812e-02   0.650 0.516114    
## NeighborhoodStoneBr    1.328e-01   3.518e-02   3.776 0.000167 ***
## NeighborhoodSWISU      1.445e-02   4.069e-02   0.355 0.722591    
## NeighborhoodTimber    -2.723e-02   3.508e-02  -0.776 0.437751    
## NeighborhoodVeenker    4.515e-03   4.504e-02   0.100 0.920180    
## Condition1Feedr        4.209e-02   2.135e-02   1.971 0.048910 *  
## Condition1Norm         8.347e-02   1.773e-02   4.707 2.78e-06 ***
## Condition1PosA         5.200e-02   4.285e-02   1.214 0.225136    
## Condition1PosN         9.456e-02   3.170e-02   2.983 0.002908 ** 
## Condition1RRAe        -4.739e-02   3.955e-02  -1.198 0.231084    
## Condition1RRAn         3.504e-02   2.923e-02   1.199 0.230910    
## Condition1RRNe         4.014e-02   7.720e-02   0.520 0.603211    
## Condition1RRNn         6.397e-02   5.235e-02   1.222 0.221918    
## Condition2Feedr        1.022e-01   9.700e-02   1.053 0.292407    
## Condition2Norm         9.972e-02   8.288e-02   1.203 0.229142    
## Condition2PosA         4.482e-01   1.363e-01   3.289 0.001031 ** 
## Condition2PosN        -6.091e-01   1.159e-01  -5.257 1.71e-07 ***
## Condition2RRAe         2.879e-02   1.349e-01   0.213 0.830983    
## Condition2RRAn         1.195e-02   1.357e-01   0.088 0.929829    
## Condition2RRNn         1.569e-01   1.137e-01   1.380 0.167832    
## RoofMatlCompShg        1.876e+00   1.272e-01  14.745  < 2e-16 ***
## RoofMatlMembran        2.102e+00   1.725e-01  12.186  < 2e-16 ***
## RoofMatlMetal          2.026e+00   1.706e-01  11.872  < 2e-16 ***
## RoofMatlRoll           1.869e+00   1.674e-01  11.165  < 2e-16 ***
## RoofMatlTar&Grv        1.875e+00   1.324e-01  14.162  < 2e-16 ***
## RoofMatlWdShake        1.962e+00   1.389e-01  14.125  < 2e-16 ***
## RoofMatlWdShngl        1.995e+00   1.332e-01  14.973  < 2e-16 ***
## Exterior1stAsphShn    -9.205e-03   1.096e-01  -0.084 0.933101    
## Exterior1stBrkComm    -1.935e-01   8.481e-02  -2.282 0.022661 *  
## Exterior1stBrkFace     8.983e-02   3.091e-02   2.906 0.003723 ** 
## Exterior1stCBlock     -2.993e-02   1.094e-01  -0.274 0.784488    
## Exterior1stCemntBd     2.696e-02   3.224e-02   0.836 0.403056    
## Exterior1stHdBoard     1.676e-03   2.804e-02   0.060 0.952334    
## Exterior1stImStucc    -1.154e-02   1.079e-01  -0.107 0.914852    
## Exterior1stMetalSd     2.623e-02   2.733e-02   0.960 0.337411    
## Exterior1stPlywood     5.355e-03   2.955e-02   0.181 0.856229    
## Exterior1stStone       4.097e-03   8.448e-02   0.048 0.961334    
## Exterior1stStucco      2.883e-02   3.454e-02   0.835 0.403971    
## Exterior1stVinylSd     1.488e-02   2.755e-02   0.540 0.589308    
## Exterior1stWd Sdng    -4.169e-03   2.720e-02  -0.153 0.878227    
## Exterior1stWdShing     1.131e-02   3.430e-02   0.330 0.741562    
## BsmtQualFa            -3.685e-02   2.649e-02  -1.391 0.164439    
## BsmtQualGd            -5.649e-02   1.378e-02  -4.101 4.37e-05 ***
## BsmtQualTA            -5.876e-02   1.711e-02  -3.435 0.000612 ***
## BsmtQualNone           5.767e-02   1.089e-01   0.529 0.596559    
## BsmtExposureGd         4.804e-02   1.280e-02   3.754 0.000182 ***
## BsmtExposureMn         3.462e-03   1.312e-02   0.264 0.791966    
## BsmtExposureNo        -1.188e-02   9.604e-03  -1.237 0.216452    
## BsmtExposureNone      -8.834e-02   1.035e-01  -0.853 0.393544    
## HeatingGasA            9.538e-02   1.087e-01   0.877 0.380454    
## HeatingGasW            1.783e-01   1.115e-01   1.599 0.110037    
## HeatingGrav           -3.215e-02   1.179e-01  -0.273 0.785101    
## HeatingOthW            2.900e-02   1.348e-01   0.215 0.829709    
## HeatingWall            1.624e-01   1.208e-01   1.345 0.178878    
## HeatingQCFa           -3.229e-02   1.979e-02  -1.632 0.103006    
## HeatingQCGd           -2.040e-02   9.023e-03  -2.261 0.023906 *  
## HeatingQCPo           -3.376e-02   1.166e-01  -0.289 0.772322    
## HeatingQCTA           -3.155e-02   8.863e-03  -3.559 0.000385 ***
## CentralAirY            4.952e-02   1.590e-02   3.115 0.001877 ** 
## KitchenQualFa         -6.779e-02   2.566e-02  -2.642 0.008345 ** 
## KitchenQualGd         -7.674e-02   1.417e-02  -5.415 7.27e-08 ***
## KitchenQualTA         -7.852e-02   1.621e-02  -4.843 1.43e-06 ***
## FunctionalMaj2        -2.271e-01   5.867e-02  -3.871 0.000114 ***
## FunctionalMin1         3.329e-03   3.584e-02   0.093 0.926009    
## FunctionalMin2         6.031e-03   3.543e-02   0.170 0.864863    
## FunctionalMod         -6.523e-02   4.216e-02  -1.547 0.122115    
## FunctionalSev         -3.144e-01   1.158e-01  -2.715 0.006704 ** 
## FunctionalTyp          5.845e-02   3.073e-02   1.902 0.057362 .  
## GarageQualFa          -2.348e-01   6.542e-02  -3.590 0.000343 ***
## GarageQualGd          -1.397e-01   7.050e-02  -1.982 0.047699 *  
## GarageQualPo          -1.480e-01   9.249e-02  -1.600 0.109731    
## GarageQualTA          -1.941e-01   6.340e-02  -3.061 0.002249 ** 
## GarageQualNone        -2.193e-01   6.548e-02  -3.349 0.000835 ***
## SaleTypeCon            7.874e-02   7.790e-02   1.011 0.312288    
## SaleTypeConLD          1.133e-01   4.170e-02   2.716 0.006696 ** 
## SaleTypeConLI         -2.553e-02   5.046e-02  -0.506 0.612945    
## SaleTypeConLw          8.498e-03   5.184e-02   0.164 0.869806    
## SaleTypeCWD            5.153e-02   5.643e-02   0.913 0.361338    
## SaleTypeNew            1.506e-01   6.735e-02   2.235 0.025552 *  
## SaleTypeOth            7.756e-02   6.382e-02   1.215 0.224466    
## SaleTypeWD            -1.066e-02   1.807e-02  -0.590 0.555210    
## SaleConditionAdjLand   1.037e-01   6.051e-02   1.714 0.086798 .  
## SaleConditionAlloca    4.961e-02   3.575e-02   1.388 0.165390    
## SaleConditionFamily    5.446e-03   2.670e-02   0.204 0.838427    
## SaleConditionNormal    6.027e-02   1.248e-02   4.831 1.52e-06 ***
## SaleConditionPartial  -5.972e-02   6.503e-02  -0.918 0.358603    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1026 on 1319 degrees of freedom
## Multiple R-squared:  0.9404, Adjusted R-squared:  0.9341 
## F-statistic: 148.7 on 140 and 1319 DF,  p-value: < 2.2e-16

plot(train.lm$residuals ~ train.lm$fitted.values)
abline(h = 0) #Looks pretty good, i.e. variation. (overfit?)

Our residuals look pretty good and we have, but we have a high $R^2$, which may mean we overfit our model. Let us use our model to predict the Sale Price with the test data. We have to do a lot of forcing of the factor variables to factors used in the model. This is done by inspection.

test_df <- test_transf[,names(test_transf) %in% names(best_train)]

#Abort attempt to unionize factors -> yields NAs in results
#union new test levels with model levels
# for (k in 1:ncol(test_df)) {
#     if(is.factor(best_train[,k])) {
#         if(!all(levels(test_df[,k]) %in% levels(best_train))) {
#             train.lm$xlevels[[colnames(test_df)[k]]] <- union(
#                 train.lm$xlevels[[colnames(test_df)[k]]],
#                 levels(test_df[,k])
#             )
#         }
#     }
# }

#Got errors on new factor (150) in MSSubClass in test -> add to train
# levels(train_transf$MSSubClass) <- c(levels(train_transf$MSSubClass), "150")
# Did not work. 150 is a 1 1/2 Planned unit dwelling (PUD). 1 1/2 corresponds to
# Ceiling height (<= 7'?) not meeting code
#-> Assume it matches code -> force to 160 (2 story PUD dwelling)
test_df$MSSubClass[test_df$MSSubClass == "150"] <- "160"

#Same with MSZoning -> Force to most common vlaue, RL
test_df$MSZoning[test_df$MSZoning == "None"] <- "RL"

#Same for Exterior1st, Change to most common -> VinylSd
test_df$Exterior1st[test_df$Exterior1st == "None"] <- "VinylSd"

#Same for KitchQual, Change from None to TA; Typical/Avg
test_df$KitchenQual[test_df$KitchenQual == "None"] <- "TA"

#Same for Funcitonal, Change from None to Typical; 
test_df$Functional[test_df$Functional == "None"] <- "Typ"

#Same for SaleType, Change to Other; 
test_df$SaleType[test_df$SaleType == "None"] <- "Oth"

#Get predicted SalePrice
test_SalePrice <- exp(predict(train.lm,
                          newdata = test_df,
                          type = "response"))

#Compare statistics of predicted relative to train SalePrice to see if same ranges.
summary(train$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  130000  163000  180900  214000  755000

summary(test_SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   37350  126700  158500  179100  210000  811800

The predicted values seem to be within a range one would expect. Next we create a submit .csv for Kaggle.

#Submit csv
test_submit <- data.frame(Id = seq(1461, 1461+length(test_SalePrice)-1, 1),
                          SalePrice = test_SalePrice)
write.table(test_submit, file = "LByrne_Kaggle_Submit.csv",
            sep = ",", row.names = FALSE)

#Kaggle Score of 0.12614 :(

The Kaggle score received for this method was 0.12614, which places it in the top ~40th percentile. Other methods were tried, but they lead to the same scores. They are outlined below and were carried out through using examples found on the web. They are shown as proof of effort, but not discussed.

#*******************
#Try randomForest

train_forest <- randomForest(y ~ ., data = train_transf,
                             importance = TRUE)

test_forest_pred <- exp(predict(train_forest, test_transf))


test_forest_submit <- data.frame(Id = seq(1461, 1461+length(test_forest_pred)-1, 1),
                          SalePrice = test_forest_pred)

write.table(test_submit, file = "LByrne_Forest_Kaggle_Submit.csv",
            sep = ",", row.names = FALSE)

#------------------
#Try party package

train_party.fit <- cforest(y ~ ., data = train_transf,
                           controls = cforest_unbiased(ntree = 2000))

test_party_pred <- exp(predict(train_party.fit, test_transf, OOB = TRUE, type = "response"))

test_party_submit <- data.frame(Id = seq(1461, 1461+length(test_party_pred)-1, 1),
                          SalePrice = test_party_pred)

write.table(test_submit, file = "LByrne_Party_Kaggle_Submit.csv",
            sep = ",", row.names = FALSE)