Problem 1

Pick one of the quantitative independent variables (Xi) from the data set below, and define that variable as X. Also, pick one of the dependent variables (Yi) below, and define that as Y.

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Problem 1.1 Probability

Variable Selections & Calculation

\[Let \space X = X4, Y = Y1 \]

x <- as.numeric(quantile(data1$X4, 0.75))

y <- as.numeric(quantile(data1$Y1, 0.25))

paste0("x = ", x)
## [1] "x = 12.525"
paste0("y = ", y)
## [1] "y = 18.55"

1.1 a. P(X>x | Y>y)

This is the probability that X is greater than the 3rd quartile of all its entries, given that in Y is greater than its 1st quartile.

\[\begin{align} P(X>x | Y>y) &= \frac{P(Y > y \space and \space X > x)}{P(Y>y)} \\ \\ &= \frac{P(Y > 18.55 \space and \space X > 12.525)}{P(Y>18.55)}\\ \\ &= \frac{0.2}{0.75} \\ \\ &= \frac{4}{15} \end{align}\]

Solution in R P(Y > y and X > x)

pYANDX <- nrow( subset( data1, Y1 > y & X4 > x) ) / nrow(data1)
pYANDX
## [1] 0.2

P(Y > y)

pY <- nrow( subset( data1, data1$Y1 > y ) ) / nrow(data1) 
pY
## [1] 0.75

P(X>x | Y>y)

pYANDX/pY
## [1] 0.2666667

1.1 b. P(X>x, Y>y)

This is the probability that an entry in X is greater its 3rd quartile and an entry in Y is greater than its 1st quartile.

\[\begin{align} P(X>x , Y>y) &= P(X > x | Y > y)P(Y > y) \\ \\ &= \frac{4}{15}*\frac{3}{4} \\ \\ &= \frac{1}{5} \end{align}\]

Solution in R P(X > x)

nrow( subset( data1, data1$Y1 > y & data1$X4 > x) ) / nrow(data1) 
## [1] 0.2

1.1 c. \(P(X<x|Y>y)\)

This is the probability that in X is less than its 3rd quartile given that Y is greater than its 1st quartile.

\[\begin{align} P(X<x | Y>y) &= \frac{P(Y > y \space and \space X < x)}{P(Y>y)} \\ \\ &= \frac{P(Y > 18.55 \space and \space X < 12.525)}{P(Y>18.55)}\\ \\ &= \frac{0.55}{0.75} \\ \\ &= \frac{11}{15} \end{align}\]

Solution in R P(Y > y and X < x)

pYANDX <- nrow( subset( data1, Y1 > y & X4 < x) ) / nrow(data1)
pYANDX
## [1] 0.55

P(Y > y)

pY <- nrow( subset( data1, data1$Y1 > y ) ) / nrow(data1) 
pY
## [1] 0.75

P(Xy)

pYANDX/pY
## [1] 0.7333333

Problem 1.2 Contingency Table

In addition, make a table of counts as shown below.

#Create Table
q1Table <- data.frame("x/y" = c("<=1st quartile", ">1st quartile", "Total"),
                      "<=3rd quartile" = c("","", ""),
                      ">3rd quartile" = c("","", ""),
                      "Total" = c("","", ""), stringsAsFactors = FALSE)

# Calculate first quartiles
xQ1 <- as.numeric(quantile(data1$X4, 0.25))

yQ1 <- as.numeric(quantile(data1$Y1, 0.25))

names(q1Table) <- c("x/y","<=1st quartile", ">1st quartile", "Total")

# Calculate counts
q1Table$`<=1st quartile`[1] <- nrow( subset( data1,  X4 <= xQ1 & Y1 <= yQ1 ) )
q1Table$`<=1st quartile`[2] <- nrow( subset( data1,  X4 > xQ1 & Y1 <= yQ1 ) )

q1Table$`>1st quartile`[1] <- nrow( subset( data1,  X4 <= xQ1 & Y1 > yQ1 ) )
q1Table$`>1st quartile`[2] <- nrow( subset( data1,  X4 > xQ1 & Y1 > yQ1 ) )

# Calculate totals
q1Table$`<=1st quartile`[3] <- as.integer(q1Table$`<=1st quartile`[1]) + as.integer(q1Table$`<=1st quartile`[2])
q1Table$`>1st quartile`[3] <- as.integer(q1Table$`>1st quartile`[1]) + as.integer(q1Table$`>1st quartile`[2])
q1Table$Total[1] <- as.integer(q1Table$`<=1st quartile`[1]) + as.integer(q1Table$`>1st quartile`[1])
q1Table$Total[2] <- as.integer(q1Table$`<=1st quartile`[2]) + as.integer(q1Table$`>1st quartile`[2])
q1Table$Total[3] <- as.integer(q1Table$`<=1st quartile`[3]) + as.integer(q1Table$`>1st quartile`[3])

# Print table
kable(q1Table) %>%
  kable_styling(full_width = F, bootstrap_options = "bordered") %>%
  column_spec(1, bold = T, border_right = T)
x/y <=1st quartile >1st quartile Total
<=1st quartile 1 4 5
>1st quartile 4 11 15
Total 5 15 20

Problem 1.3 Independence

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

1.3.1 Does P(AB)=P(A)P(B)?

paste0("P(AB) = ", nrow( subset( data1,  X4 > xQ1 & Y1 > yQ1 ) )/nrow(data1)) 
## [1] "P(AB) = 0.55"
paste0("P(A) = ", nrow( subset( data1,  X4 > xQ1) )/nrow(data1))
## [1] "P(A) = 0.75"
paste0("P(B) = ", nrow( subset( data1, Y1 > yQ1 ) )/nrow(data1))
## [1] "P(B) = 0.75"
paste0("P(A)P(B) = ", (nrow( subset( data1, X4 > xQ1 ) )/nrow(data1))*(nrow( subset( data1, Y1 > yQ1 ) )/nrow(data1)))
## [1] "P(A)P(B) = 0.5625"

\[P(AB) = 0.55 \ne P(A)P(B) = 0.5625\]

\(P(AB)\) is not equal to \(P(A)P(B)\) therefore splitting the data in this manner does not make them independent.

1.3.2 Chi Square test for association.

# Create A & B based on criteria and perform test
A <- subset( data1,  X4 > xQ1 )
B <- subset( data1,  Y1 > yQ1 )

chisq.test(A,B)
## 
##  Pearson's Chi-squared test
## 
## data:  A
## X-squared = 110.45, df = 98, p-value = 0.1838

The p-value of the Chi Square is not statistically significant therefore we cannot reject the null hypothesis that the data is independent.

Problem 2

You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

2.1 Descriptive and Inferential Statistics.

2.1.1 Descriptive Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable.

# load data and select ignore the index variable
training_data <- read.csv("data/train.csv", header = TRUE, stringsAsFactors = FALSE)
#Summarize data and save to CSV
summary <- describe(training_data )
summary$na <- nrow(training_data) - summary$n
summary$na_pct <- summary$na/nrow(training_data)*100
summary <- dplyr::select(summary, vars, n, mean, sd, median, min, max, na, na_pct )
names(summary) <- c("Variables", "Cases", "Mean", "SD", "Median", "Min", "Max", "NAs", "NA_Pct")
datatable(round(summary,2), options = list(filter = FALSE, pageLength = 81, dom = 'tip'), filter = 'none')

2.1.2 Box Plots of Numerical Variables

long_training_data <- dplyr::select_if(training_data,is.numeric) %>%
  gather(training_data)
names(long_training_data) <- c("key", "value")

ggplot(long_training_data, aes(factor(key), value)) + 
  geom_boxplot() + 
  facet_wrap(~key, scale="free", ncol = 4) + 
  labs(x="", y="")

2.1.3 Histograms of Categorical Variables

training_data_categorical <- dplyr::select_if(training_data,is.character)
long_categorical_data <- gather(training_data_categorical)

ggplot(long_categorical_data, aes(value)) + 
    geom_histogram(bins = 10,  stat="count") + 
    facet_wrap(~key, scales = 'free_x', ncol = 4) +
  theme( axis.text.x = element_text(angle = 90))

2.1.4 Histograms of Numeric Variables

training_data_numeric <- dplyr::select_if(training_data,is.numeric)
long_numeric_data <- gather(training_data_numeric)

ggplot(long_numeric_data, aes(value)) + 
    geom_histogram(bins = 10,  stat="count") + 
    facet_wrap(~key, scales = 'free', ncol = 4) +
  theme( axis.text.x = element_text(angle = 90))

2.1.4 Scatter Plot Martix

ggpairs(dplyr::select(training_data, SalePrice, LotArea, LotFrontage, BedroomAbvGr, KitchenAbvGr), upper = "blank", diag = "blank")

2.1.5 Correlation

Correlation Matrix

corrData <- dplyr::select(training_data, SalePrice, LotArea, BedroomAbvGr)


corrMatrix <- round( cor(corrData, method = "pearson", use = "complete.obs"), 4 )
corrMatrix
##              SalePrice LotArea BedroomAbvGr
## SalePrice       1.0000  0.2638       0.1682
## LotArea         0.2638  1.0000       0.1197
## BedroomAbvGr    0.1682  0.1197       1.0000

80% Confidence Intervals: SalePrice & LotArea

cor.test(corrData$SalePrice,corrData$LotArea, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corrData$SalePrice and corrData$LotArea
## t = 10.445, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.2323391 0.2947946
## sample estimates:
##       cor 
## 0.2638434

80% Confidence Intervals: SalePrice & BedroomAbvGr

cor.test(corrData$SalePrice,corrData$BedroomAbvGr, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corrData$SalePrice and corrData$BedroomAbvGr
## t = 6.5159, df = 1458, p-value = 9.927e-11
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.1354160 0.2006421
## sample estimates:
##       cor 
## 0.1682132

80% Confidence Intervals: BedroomAbvGr & LoarArea

cor.test(corrData$BedroomAbvGr,corrData$LotArea, conf.level = 0.8)
## 
##  Pearson's product-moment correlation
## 
## data:  corrData$BedroomAbvGr and corrData$LotArea
## t = 4.6033, df = 1458, p-value = 4.52e-06
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
##  0.08647564 0.15263840
## sample estimates:
##       cor 
## 0.1196899

Each of the above the above correlations fall within their 80% confidence interval. The family-wise error is making one or more type I errors when performing multiple hypotheses tests. I would be a bit concerned about this because we have a very wide confidence interval, making the family wise error more likely when these inferences are combined. We may use the Bonferroni method to select a significance p-value in this context.

2.2 Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

2.2.1 Matrix Inversion

invCorrMatrix <- solve(corrMatrix)
invCorrMatrix
##               SalePrice     LotArea BedroomAbvGr
## SalePrice     1.0971260 -0.27121886  -0.15207170
## LotArea      -0.2712189  1.08158397  -0.08384659
## BedroomAbvGr -0.1520717 -0.08384659   1.03561490

2.2.2 Correlation Matrix X Precision Matrix

corrMatrix %*% invCorrMatrix 
##                  SalePrice      LotArea BedroomAbvGr
## SalePrice     1.000000e+00 2.428613e-17            0
## LotArea      -2.081668e-17 1.000000e+00            0
## BedroomAbvGr -2.775558e-17 0.000000e+00            1

2.2.3 Precision Matrix X Correlation Matrix

 invCorrMatrix %*% corrMatrix  
##                 SalePrice       LotArea  BedroomAbvGr
## SalePrice    1.000000e+00 -2.081668e-17 -2.775558e-17
## LotArea      2.428613e-17  1.000000e+00  0.000000e+00
## BedroomAbvGr 0.000000e+00  0.000000e+00  1.000000e+00

2.2.4 LU Decomposition of Correlation Matrix

I decided to use my solution function from Assignment 2 to perform the decomposition.

 decompose_A <- function(A){
  # get the size of the matrix A
  rows = dim(A)[1]
  cols = dim(A)[2]
  
  # initialize L & U
  U <- A
  L <- diag(rows)
  
  for(col in 1:(cols - 1)){ # no need to process the last column
    for(row in (col + 1):rows){
        if((col) > rows){break}
        if(U[row,col] == 0){next}
        coeff <- ( (-1*U[row,col])/U[col,col] ) # calculate factor to change element to 0
        U[row,] <- ( coeff * U[col,] ) + U[row,] # perform row operation 3
        L[row,col] <- -1*coeff # use the opposite sign for the same location in L
    }
  }
  
  return( list(L,U) )
 }

LU <- decompose_A(corrMatrix)

Extract L

LU[[1]]
##        [,1]      [,2] [,3]
## [1,] 1.0000 0.0000000    0
## [2,] 0.2638 1.0000000    0
## [3,] 0.1682 0.0809631    1

Extract U

LU[[2]]
##              SalePrice   LotArea BedroomAbvGr
## SalePrice            1 0.2638000   0.16820000
## LotArea              0 0.9304096   0.07532884
## BedroomAbvGr         0 0.0000000   0.96560990

Does A = LU?

LU[[1]] %*% LU[[2]] == corrMatrix
##      SalePrice LotArea BedroomAbvGr
## [1,]      TRUE    TRUE         TRUE
## [2,]      TRUE    TRUE         TRUE
## [3,]      TRUE    TRUE         TRUE

2.3 Calc-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., \(rexp(1000,\lambda)\)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

I selected the LotFrontage variable because seems to be right-skewed based on its histogram above in Problem 2.1.4 and its minimun value is 21 based on the summary table in Problem 2.1.1.

2.3.1 Fit Exponential PDF & Compare Plot

# fit PDF
expLotFrontage <- fitdistr(na.omit(training_data$LotFrontage), densfun = "exponential")

#extract lambda
lambda <- expLotFrontage$estimate

#sample PDF
sampleExp <- rexp(1000,lambda)

originalPlot <- ggplot(aes(x = LotFrontage, fill = LotFrontage), data = training_data) +
  geom_histogram(fill = "green") +
  labs(x="Frequency", y="LotFrontage (ft)")

pdfPlot <- ggplot( aes(x = X, fill = X), data = data.frame(X =sampleExp) ) +
  geom_histogram(fill = "blue") +
  labs(x="Frequency", y="Sample of Exponential PDF")

grid.arrange(originalPlot,pdfPlot,ncol=2,nrow=1)

2.3.2 \(5^{th}\) and \(95^{th}\) Percentiles (Exponential DF)

quantile(sampleExp, probs=c(0.05, 0.95))
##         5%        95% 
##   4.227185 208.969174

2.3.3 \(95%\) Confidence Interval (Observed Data)

ci <- t.test(training_data$LotFrontage, conf.level=0.95)

paste0("The lower bound is ", ci[["conf.int"]][1], ", the upper bound is ", ci[["conf.int"]][2],".")
## [1] "The lower bound is 68.6751299039821, the upper bound is 71.4247868320712."

2.3.4 \(5^{th}\) and \(95^{th}\) Percentiles (Observed Data)

quantile(training_data$LotFrontage, probs=c(0.05, 0.95), na.rm = TRUE)
##  5% 95% 
##  34 107

The plots of the observed data and the sample from the exponential distribution are quite similar, however the plot from the sample of the exponential PDF more resembles the exponential distribution, but this is expected. The 5th and 95th percentile are much lower than that of the observed data. It seems that the exponential distribution fits LotFrontage ok but not very well.

2.4 Modeling

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

I decided to start my model with 15 independent variables I deemed important in the context of house prices. These variables along with their descriptions are below.

variable description
GrLivArea Above grade (ground) living area square feet
MSZoning The general zoning classification
LotFrontage Linear feet of street connected to property
LotArea Lot size in square feet
Street Type of road access
Utilities Type of utilities available
OverallQual Overall material and finish quality
OverallCond Overall condition rating
TotalBsmtSF TotalBsmtSF
TotRmsAbvGrd Total rooms above grade (does not include bathrooms)
BedroomAbvGr Number of bedrooms above basement level
Functional Home functionality rating
GarageArea Size of garage in square feet
PoolArea Pool area in square feet
MiscVal Dollar Value of miscellaneous feature

2.4.1 Plot: Target Transformation

ggplot( training_data, aes(x=SalePrice)) + 
  labs(title = "Histogram of SalePrice" ) + 
  geom_histogram() 

The plot above shows that our target value is right-skewed. This seems like a good candidate for a log transformation.

ggplot( training_data, aes(x=log(SalePrice))) + 
  labs(title = "Histogram of log(SalePrice)" ) + 
  geom_histogram() 

The log transformation made a difference!

2.4.2 Plot: Target Vs. Predictors

model_data <- dplyr::select(training_data, GrLivArea, MSZoning, LotFrontage, LotArea, Street, Utilities,
                            OverallQual, OverallCond, TotalBsmtSF, TotRmsAbvGrd, BedroomAbvGr, Functional,
                            GarageArea, PoolArea, MiscVal, SalePrice)
model_data$logSalePrice <- log(model_data$SalePrice)

gather( model_data , "VARIABLE", "VALUE", -logSalePrice) %>%
ggplot( aes(x=VALUE, y=logSalePrice)) + 
  geom_point() + facet_wrap(~VARIABLE, scale = "free", ncol = 4) + 
  labs(x="", y="Sale Price", title = "Log Sale Price vs. Predictors" ) + 
theme( axis.text.x = element_text(angle = 90)) + 
  scale_x_discrete(labels = NULL)

Based on the plots above it shows seems that the predictors Utilities, PoolArea, Street, and MiscVal needs to be binned or transformed. I will remove these variables from the model. There is no issue with missing data so we will proceed will building our model.

2.4.3 Backward Stepwise Selection

## Start:  AIC=-5410.48
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + LotArea + 
##     OverallQual + OverallCond + TotalBsmtSF + TotRmsAbvGrd + 
##     BedroomAbvGr + Functional + GarageArea + SalePrice
## 
##                Df Sum of Sq    RSS     AIC
## - TotRmsAbvGrd  1    0.0000 12.841 -5412.5
## - LotArea       1    0.0001 12.841 -5412.5
## <none>                      12.841 -5410.5
## - GrLivArea     1    0.0304 12.871 -5409.6
## - LotFrontage   1    0.0533 12.894 -5407.5
## - TotalBsmtSF   1    0.1287 12.969 -5400.5
## - Functional    5    0.2971 13.138 -5393.0
## - BedroomAbvGr  1    0.2277 13.069 -5391.4
## - OverallCond   1    0.6587 13.500 -5352.4
## - GarageArea    1    0.9156 13.756 -5329.8
## - OverallQual   1    1.6480 14.489 -5267.5
## - MSZoning      4    2.9687 15.810 -5168.7
## - SalePrice     1   21.3359 34.177 -4236.8
## 
## Step:  AIC=-5412.48
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + LotArea + 
##     OverallQual + OverallCond + TotalBsmtSF + BedroomAbvGr + 
##     Functional + GarageArea + SalePrice
## 
##                Df Sum of Sq    RSS     AIC
## - LotArea       1    0.0001 12.841 -5414.5
## <none>                      12.841 -5412.5
## - GrLivArea     1    0.0464 12.887 -5410.1
## - LotFrontage   1    0.0532 12.894 -5409.5
## - TotalBsmtSF   1    0.1305 12.971 -5402.3
## - Functional    5    0.2976 13.138 -5395.0
## - BedroomAbvGr  1    0.2917 13.133 -5387.5
## - OverallCond   1    0.6607 13.502 -5354.2
## - GarageArea    1    0.9168 13.758 -5331.7
## - OverallQual   1    1.6483 14.489 -5269.4
## - MSZoning      4    2.9795 15.820 -5169.9
## - SalePrice     1   21.4923 34.333 -4233.3
## 
## Step:  AIC=-5414.47
## logSalePrice ~ GrLivArea + MSZoning + LotFrontage + OverallQual + 
##     OverallCond + TotalBsmtSF + BedroomAbvGr + Functional + GarageArea + 
##     SalePrice
## 
##                Df Sum of Sq    RSS     AIC
## <none>                      12.841 -5414.5
## - GrLivArea     1    0.0464 12.887 -5412.1
## - LotFrontage   1    0.0588 12.900 -5411.0
## - TotalBsmtSF   1    0.1310 12.972 -5404.3
## - Functional    5    0.2985 13.139 -5396.9
## - BedroomAbvGr  1    0.2923 13.133 -5389.4
## - OverallCond   1    0.6607 13.502 -5356.2
## - GarageArea    1    0.9176 13.758 -5333.6
## - OverallQual   1    1.6712 14.512 -5269.5
## - MSZoning      4    2.9802 15.821 -5171.8
## - SalePrice     1   21.7349 34.576 -4226.9
## 
## Call:
## lm(formula = logSalePrice ~ GrLivArea + MSZoning + LotFrontage + 
##     OverallQual + OverallCond + TotalBsmtSF + BedroomAbvGr + 
##     Functional + GarageArea + SalePrice, data = model_data)
## 
## Coefficients:
##    (Intercept)       GrLivArea      MSZoningFV      MSZoningRH  
##      1.044e+01       2.193e-05       4.328e-01       3.236e-01  
##     MSZoningRL      MSZoningRM     LotFrontage     OverallQual  
##      3.949e-01       2.850e-01      -3.487e-04       4.726e-02  
##    OverallCond     TotalBsmtSF    BedroomAbvGr  FunctionalMaj2  
##      2.272e-02       3.236e-05       2.510e-02      -2.312e-01  
## FunctionalMin1  FunctionalMin2   FunctionalMod   FunctionalTyp  
##     -2.560e-02      -3.063e-02      -1.027e-01      -2.374e-02  
##     GarageArea       SalePrice  
##      1.710e-04       3.392e-06

2.4.4 Final Model

houseModel <- lm(formula = logSalePrice ~ GrLivArea  + 
    LotArea + OverallQual + OverallCond + TotalBsmtSF +  
    GarageArea, data = model_data) 

summary(houseModel)
## 
## Call:
## lm(formula = logSalePrice ~ GrLivArea + LotArea + OverallQual + 
##     OverallCond + TotalBsmtSF + GarageArea, data = model_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.47725 -0.07374  0.01461  0.10421  0.52127 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.038e+01  3.369e-02 308.197  < 2e-16 ***
## GrLivArea   1.865e-04  1.178e-05  15.840  < 2e-16 ***
## LotArea     2.814e-06  5.031e-07   5.594 2.65e-08 ***
## OverallQual 1.410e-01  4.863e-03  28.986  < 2e-16 ***
## OverallCond 2.911e-02  4.327e-03   6.728 2.47e-11 ***
## TotalBsmtSF 1.271e-04  1.374e-05   9.247  < 2e-16 ***
## GarageArea  3.609e-04  2.820e-05  12.801  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1803 on 1453 degrees of freedom
## Multiple R-squared:  0.7971, Adjusted R-squared:  0.7963 
## F-statistic: 951.4 on 6 and 1453 DF,  p-value: < 2.2e-16

\[log(SalePrice) = 10.38 + 10.3846 + 2e-04 * GrLivArea + 0 * LotArea + 0.141 * OverallQual + \\ 0.0291 * OverallCond + 1e-04 * TotalBsmtSF + 4e-04 * GarageArea\]

2.4.5 Model Diagnostics

To assess whether this model is reliable, we will check for linearity, nearly normal residuals, constant variability (homoscedasticity), and issues with multicollinearity.

Linearity

The final variables showed signs of linearity. This assumption is met, however some transformations would have made the relationship more concrete.

Nearly Normal Residuals

Investigating the Q-Q plot below, we can see the evidence of skewness with strong deviations of the first two residuals from the line. The others show only small deviations from the line. I would say that the assumption of nearly normal residuals has been met but the outliers should be addressed.

ggplot( houseModel, aes(sample = houseModel$residuals) ) +
  stat_qq() +
  stat_qq_line() +
  labs(x="Theoretical Quantiles", y="Sample Quantiles") + 
  ggtitle("Normal Q-Q Plot")

Constant Variability and Independence

The plot of the residuals vs. the fitted values below shows the two outliers that may be influential. I would say that the constant variability is not met due to the two outliers.

ggplot(aes(x = houseModel$fitted.values, y = houseModel$residuals ), data = houseModel) +
  geom_point(shape=1) +
  labs(x="Fitted Values", y="Residuals") + 
  ggtitle("Residuals vs. Fitted Values") + 
  geom_abline(intercept = 0, 
              slope = 0, 
              color = "blue", 
              size = 1.0,
              linetype="dashed")

Multicollinearity

Multicollinearity is not a huge problem. There is a strong positive correlation between the the OverallCond and OverallQual variables.

model_data$sqrt_SalePrice <- sqrt(model_data$SalePrice)
corrplot( round( cor(dplyr::select(model_data,logSalePrice, GrLivArea, LotArea, OverallQual, OverallCond, TotalBsmtSF, GarageArea), method = "pearson", use = "complete.obs"), 2 ), 
          method="circle", 
          type="upper",
          tl.col="black")

2.4.6 Predictions

Kaggle Score (RMSE): 0.17298

Kaggle Username: A.Gilharry

# Predict wins
test_data <- read.csv("data/test.csv", header = TRUE, stringsAsFactors = FALSE)
test_data <- dplyr::select(test_data,Id, GrLivArea, LotArea, OverallQual, OverallCond, TotalBsmtSF, GarageArea)

predicts <- predict(houseModel, newdata = test_data, interval="prediction")
test_data$SalePrice <- round(exp(predicts[,1]))
test_data$SalePrice[is.na(test_data$SalePrice)] <- median(test_data$SalePrice, na.rm = TRUE)
predictions <- dplyr::select(test_data, Id, SalePrice)
write.csv(predictions, "data/predictions.csv", row.names = FALSE)

datatable(predictions, options = list(filter = FALSE, pageLength = 1416, dom = 'tip'), filter = 'none')