DATA605 - Final Exam

Your final is due by the end of day on 12/20/2017. You should post your solutions to your GitHub account. This project will show off your ability to understand the elements of the class.

Outline

Load Data and Packages
Probability
Descriptive and Inferential Statistics
Linear Algebra and Correlation
Calculus Based Probability & Statistics
Modeling

Load Data and Packages

I downloaded the datasets from Kaggle and saved them to my GitHub repository for the final exam (https://github.com/Randles-CUNY/data602_final_exam. Follow the links below to see the files:

train.csv - housing characteristics data (https://raw.githubusercontent.com/Randles-CUNY/data602_final_exam/master/train.csv)
test.csv - test data set to use for prediction submitted to Kaggle.com (https://github.com/Randles-CUNY/data602_final_exam/blob/master/test.csv)
data_description.txt - data dictionary for housing prices data set (https://github.com/Randles-CUNY/data602_final_exam/blob/master/data_description.txt)

In the code below, the train.csv data is read into R and loaded into a data frame called hp_raw_df. In addition, the following packages are installed and loaded:

ggplot2 - for plotting
MASS - for boxcox function and fitdistr function
caret - for preProcess function

install.packages("ggplot2", repos='https://mirrors.nics.utk.edu/cran/')
library(ggplot2)
install.packages("MASS", repos='https://mirrors.nics.utk.edu/cran/')
library(MASS)
install.packages("caret", repos='https://mirrors.nics.utk.edu/cran/')
library(caret)
hp_raw_df <- read.csv("https://raw.githubusercontent.com/Randles-CUNY/data602_final_exam/master/train.csv", colClasses = c("integer", "integer", "character", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "integer", "character", "character", "character", "character", "character", "character", "character", "integer", "character", "integer", "integer", "integer", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "character", "integer", "character", "character", "integer", "character", "integer", "integer", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "integer", "integer", "integer", "character", "character", "integer"), header = TRUE, sep = ",", stringsAsFactors = FALSE)

Probability

Pick one of the quantitative independent variables from the training data set (train.csv), and define that variable as \(x\). Pick SalePrice as the dependent variable, and define it as \(Y\) for the next analysis.

Probability. Calculate as a minimum the below probabilities \(a\) through \(c\). Assume the small letter \(x\) is estimated as the 1st quartile of the \(X\) variable, and the small letter \(y\) is estimated as the 2nd quartile of the \(Y\) variable. Interpret the meaning of all probabilities.

I chose LotArea (lot size in square feet) as independent variable X. There are no NA values for LotArea or SalePrice. Taking a quick look at histograms for both, we can see they are approximately normal with some positive skew and some extreme right-tail outliers:

# histograms for LotArea and SalePrice
ggplot(hp_raw_df, aes(LotArea)) + geom_histogram(binwidth = 500, color="royalblue2") + scale_x_continuous(labels = scales::comma)

ggplot(hp_raw_df, aes(SalePrice)) + geom_histogram(binwidth = 2000, color="maroon") + scale_x_continuous(labels = scales::comma)

# set x and y for problems a thru c
x <- hp_raw_df$LotArea
y <- hp_raw_df$SalePrice
# set the 1st quartile threshold for x variable
x_1q <- quantile(x, probs = 0.25)
# set the 2nd quartile threshold for y variable
y_2q <- quantile(y, probs = 0.5)

\(a. P(X>x | Y>y)\)

This is probability that the LotArea (\(X\)) is greater than the 1st quartile cut-off for LotArea given that the SalePrice (\(Y\)) is greater than the 2nd quartile cut-off for SalePrice.

# Is computed by taking P(X > x and Y > y) divided by P(Y > y)
# Probability P(X > x and Y > y)
p1 <- nrow(subset(hp_raw_df, hp_raw_df$LotArea > x_1q & hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
# Probability P(Y > y)
p2 <- nrow(subset(hp_raw_df, hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
# Probability P(X > x and Y > y) divided by P(Y > y)
p1 / p2

## [1] 0.8653846

\(b. P(X>x \; \& \; Y>y)\)

This is the probability that the LotArea (\(X\)) is greater than the 1st quartile cut-off for LotArea and the SalePrice (\(Y\)) is greater than the 2nd quartile cut-off for SalePrice.

# number of rows where both P(X > x) and P(Y > y)
p <- nrow(subset(hp_raw_df, hp_raw_df$LotArea > x_1q & hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
p

## [1] 0.4315068

\(c. P(X<x | Y>y)\)

This is probability that the LotArea (\(X\)) is less than the 1st quartile cut-off for LotArea given that the SalePrice (\(Y\)) is greater than the 2nd quartile cut-off for SalePrice.

# Is computed by taking P(X < x and Y > y) divided by P(Y > y)
# Probability P(X > x and Y > y)
p1 <- nrow(subset(hp_raw_df, hp_raw_df$LotArea < x_1q & hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
# Probability P(Y > y)
p2 <- nrow(subset(hp_raw_df, hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
# Probability P(X > x and Y > y) divided by P(Y > y)
p1 / p2

## [1] 0.1346154

Does splitting the training data in this fashion make them independent? In other words, does \(P(XY) = P(X)P(Y)\)? Check mathematically, and then evaluate by running a Chi Square test for association. You might have to research this.

# probability of X and Y using problem a
pxy <- nrow(subset(hp_raw_df, hp_raw_df$LotArea > x_1q & hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
pxy

## [1] 0.4315068

# probability of X using problem a
px <- nrow(subset(hp_raw_df, hp_raw_df$LotArea > x_1q)) / nrow(hp_raw_df)
# probability of Y using problem a
py <- nrow(subset(hp_raw_df, hp_raw_df$SalePrice > y_2q)) / nrow(hp_raw_df)
# P(X)P(Y)
px * py

## [1] 0.3739726

# check if P(XY) = P(X)P(Y) - can see above that is does not
pxy == px * py

## [1] FALSE

As you can see above, splitting data in this fashion does not make \(X\) and \(Y\) independent. The fact that we can take observations and subset them doesn’t determine whether the probability of one event occurring affects the probability of a different event occurring. Below are two Chi square tests for association. The first uses the actual values. You’ll note we get a Warning message: Chi-squared approximation may be incorrect. This is because many of the expected values are very small and therefore the approximations of p may not be right. Chi squared is usually used for categorizations and we’re using it for two numerical variables. This means we’re getting a very large contingency table.

The 2nd Chi square test uses quintile bins for both LotArea and SalePrice, which brings the contingency table down to a size which doesn’t generate the warning. In both cases, the test is generating very strong p-values.

# Chi square test using actual values
chisqtbl <- table(hp_raw_df$LotArea, hp_raw_df$SalePrice)
chisq.test(chisqtbl)

## Warning in chisq.test(chisqtbl): Chi-squared approximation may be incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  chisqtbl
## X-squared = 735090, df = 709660, p-value < 2.2e-16

# using quintile bins
# add column la_quintile for LotArea quintile bin
hp_raw_df <- within(hp_raw_df, la_quintile <- as.integer(cut(hp_raw_df$LotArea, quantile(hp_raw_df$LotArea, probs=0:10/10), include.lowest=TRUE)))
# add column sp_quintile for SalePrice quintile bin
hp_raw_df <- within(hp_raw_df, sp_quintile <- as.integer(cut(hp_raw_df$SalePrice, quantile(hp_raw_df$SalePrice, probs=0:10/10), include.lowest=TRUE)))
# Chi square test using quintile bin values
chisqtbl <- table(hp_raw_df$la_quintile, hp_raw_df$sp_quintile)
chisq.test(chisqtbl)

## 
##  Pearson's Chi-squared test
## 
## data:  chisqtbl
## X-squared = 465.71, df = 81, p-value < 2.2e-16

# no warning

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for both variables. Provide a scatterplot of \(X\) and \(Y\). Transform both variables simultaneously using Box-Cox transformations. You might have to research this.

Here are summary descriptive statistics for LotArea and SalePrice to supplment the Histograms from the prior section. Both also suggest positive skewness:

summary(hp_raw_df$LotArea)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245

summary(hp_raw_df$SalePrice)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000

Here is a scatterplot of LotArea and SalePrice:

ggplot(hp_raw_df, aes(x = hp_raw_df$LotArea, y = hp_raw_df$SalePrice)) + geom_point(color='royalblue2') + labs(title = "Lot Area vs. Sale Price", x = "Lot Area", y = "Sale Price") + scale_y_continuous(labels = scales::comma)

It certainly looks like the variables are correlated, though the scatterplot is affected by some outlying LotArea values. Below is another scatterplot which excludes LotArea values in the top (10th) quintile:

hp_raw_df_no <- subset(hp_raw_df, hp_raw_df$la_quintile < 10)
ggplot(hp_raw_df_no, aes(x = hp_raw_df_no$LotArea, y = hp_raw_df_no$SalePrice)) + geom_point(color='maroon') + labs(title = "Lot Area vs. Sale Price", x = "Lot Area", y = "Sale Price") + scale_y_continuous(labels = scales::comma)

Box-Cox transformation:

# create subset of LotArea/SalePrice and use preProcess to estimate desired lambda
hp_raw_df_subset <- subset(hp_raw_df, select = c("LotArea", "SalePrice"))
bcl <- preProcess(hp_raw_df_subset, method = "BoxCox")
bcl$bc

## $LotArea
## Box-Cox Transformation
## 
## 1460 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1300    7554    9478   10517   11602  215245 
## 
## Largest/Smallest: 166 
## Sample Skewness: 12.2 
## 
## Estimated Lambda: 0 
## With fudge factor, Lambda = 0 will be used for transformations
## 
## 
## $SalePrice
## Box-Cox Transformation
## 
## 1460 data points used to estimate Lambda
## 
## Input data summary:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   34900  129975  163000  180921  214000  755000 
## 
## Largest/Smallest: 21.6 
## Sample Skewness: 1.88 
## 
## Estimated Lambda: -0.1 
## With fudge factor, Lambda = 0 will be used for transformations

# result of preProcess shows default -2 to 2 for lambda fine for boxcox function
# x was set to LotArea and y was set to SalePrice previously in Probability section
# box cox transformation created below
bc <- boxcox(y ~ x)

The Boxcox plot shows normality, and the lambdas are near the values suggested by the preProcess function.

Linear Algebra and Correlation

Using at least three untransformed variables, build a correlation matrix. Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix.

Below I build a correlation matrix which includes the following untransformed variables:

OverallCond: Rates the overall condition of the house
TotalBsmtSF: Total square feet of basement area
X1stFlrSF: First Floor square feet
TotRmsAbvGrd: Total rooms above grade
SalePrice: Sale Price is dollars

# create correlation matrix
cm <- cor(hp_raw_df[c("OverallCond","TotalBsmtSF", "X1stFlrSF", "TotRmsAbvGrd", "SalePrice")])
cm

##              OverallCond TotalBsmtSF  X1stFlrSF TotRmsAbvGrd   SalePrice
## OverallCond   1.00000000  -0.1710975 -0.1442028  -0.05758317 -0.07785589
## TotalBsmtSF  -0.17109751   1.0000000  0.8195300   0.28557256  0.61358055
## X1stFlrSF    -0.14420278   0.8195300  1.0000000   0.40951598  0.60585218
## TotRmsAbvGrd -0.05758317   0.2855726  0.4095160   1.00000000  0.53372316
## SalePrice    -0.07785589   0.6135806  0.6058522   0.53372316  1.00000000

Then, I invert the correlation matrix to create the precision matrix:

# create inverted matrix
im <- solve(cm)
im

##              OverallCond TotalBsmtSF   X1stFlrSF TotRmsAbvGrd   SalePrice
## OverallCond   1.03236875   0.1949891  0.01587285   0.03265497 -0.06631087
## TotalBsmtSF   0.19498915   3.4652916 -2.49079866   0.50753499 -0.87288182
## X1stFlrSF     0.01587285  -2.4907987  3.42044620  -0.55770215 -0.24508484
## TotRmsAbvGrd  0.03265497   0.5075350 -0.55770215   1.49599448 -0.76943305
## SalePrice    -0.06631087  -0.8728818 -0.24508484  -0.76943305  2.08957004

Lastly, I multiply the correlation matrix by the precision matrix and the precision matrix by the correlation matrix (rounding to 15 digits to remove R’s decimilization at the extremes).

round(cm %*% im, 15)

##              OverallCond TotalBsmtSF X1stFlrSF TotRmsAbvGrd SalePrice
## OverallCond            1           0         0            0         0
## TotalBsmtSF            0           1         0            0         0
## X1stFlrSF              0           0         1            0         0
## TotRmsAbvGrd           0           0         0            1         0
## SalePrice              0           0         0            0         1

round(im %*% cm, 15)

##              OverallCond TotalBsmtSF X1stFlrSF TotRmsAbvGrd SalePrice
## OverallCond            1           0         0            0         0
## TotalBsmtSF            0           1         0            0         0
## X1stFlrSF              0           0         1            0         0
## TotRmsAbvGrd           0           0         0            1         0
## SalePrice              0           0         0            0         1

The result in both cases is the identity matrix.

Calculus Based Probability and Statistics

Many times, it makes sense to fit a closed form distribution to data. For your non-transformed independent variable, location shift (if necessary) it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit a density function of your choice. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html). Find the optimal value of the parameters for this distribution, and then take 1000 samples from this distribution (e.g., rexp(1000, \(\lambda\)) for an exponential). Plot a histogram and compare it with a histogram of your non-transformed original variable.

I chose to fit to a normal distribution using the log-normal distribution (because the skew):

# fit normal distribution
fd <- fitdistr(hp_raw_df$LotArea, "lognormal")
# optimal value of mean and sd
fd$estimate

##   meanlog     sdlog 
## 9.1108382 0.5172708

# 1000 samples from this distribution
r <- rnorm(1000, fd$estimate[1], fd$estimate[2])
# Plot a histogram using 1000 samples
hist(r, freq = FALSE, breaks = 50, col = "royalblue2", main = "Histogram of 1000 Samples", xlab="Log of Lot Area (Square Feet)")

# Plot a histogram using original variable
hist(log(hp_raw_df$LotArea), breaks = 50, col = "maroon", freq = FALSE, main = "Histogram of Log of Lot Area", xlab="Square Feet")

The two histograms are quite similar, except the histogram based on the log-normal distribution is fatter in the middle.

Modeling

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

For my linear regression model, I decided to start with the following predictor variables. I chose these variables because they had complete, clean data and/or did not have a lot of default values:

logLotArea (log of LotArea): Log transformation of Lot size in square feet
OverallQual: Rates the overall material and finish of the house
OverallCond: Rates the overall condition of the house
YearBuilt: Original construction date
TotalBsmtSF: Total square feet of basement area
X1stFlrSF: First Floor square feet
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
StreetScore: Gravel = 0, Paved = 1
CentralAirScore: Does not have A/C = 0, does have A/C = 1
BedroomAbvGr: Bedrooms above grade (does NOT include basement bedrooms)
KitchenAbvGr: Kitchens above grade
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

I used the log of LotArea (logLotArea) based on the skew exhibited in analyses from the prior sections. Similarly, I am using the log of SalePrice (logSalePrice) as my dependent variable. The StreetScore and CentralAirScore are dummy variables derived from categorical values.

In the code below, I am creating the log transformations, the dummy varables, and then generating pairwise comparisons (6 variables at a time) for the selected variables vs. logSalePrice:

# create logLotArea and logSalePrice
hp_raw_df$logLotArea <- log(hp_raw_df$LotArea)
hp_raw_df$logSalePrice <- log(hp_raw_df$SalePrice)
# add central air dummy variable (0 = no Central Air, 1 = Central Air)
hp_raw_df$CentralAirScore <- ifelse(hp_raw_df$CentralAir == "N", 0, 1)
# add street dummy variable (0 = gravel, 1 = paved)
hp_raw_df$StreetScore <- ifelse(hp_raw_df$Street == "Grvl", 0, 1)
# first pairwise comparison
pairs(subset(hp_raw_df, select = c("logLotArea", "OverallQual", "OverallCond", "YearBuilt", "TotalBsmtSF", "X1stFlrSF", "logSalePrice")), gap = 0.5)

# second pairwise comparison
pairs(subset(hp_raw_df, select = c("GrLivArea", "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "GarageCars", "logSalePrice")), gap = 0.5)

# third pairwise comparison
pairs(subset(hp_raw_df, select = c("GarageArea", "StreetScore", "CentralAirScore", "BedroomAbvGr", "KitchenAbvGr", "TotRmsAbvGrd", "logSalePrice")), gap = 0.5)

Looking at the pairwise plots, I did not readily identify other variables which need transformation.

Next, I created a linear model with all 18 variables and then whittled down the predictors using backward elimination:

#initial model
sp.lm <- lm(logSalePrice ~ logLotArea + OverallQual + OverallCond + YearBuilt + TotalBsmtSF + X1stFlrSF + GrLivArea + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + GarageCars + GarageArea + StreetScore + CentralAirScore + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd, data = hp_raw_df)
# summary of model
summary(sp.lm)

## 
## Call:
## lm(formula = logSalePrice ~ logLotArea + OverallQual + OverallCond + 
##     YearBuilt + TotalBsmtSF + X1stFlrSF + GrLivArea + BsmtFullBath + 
##     BsmtHalfBath + FullBath + HalfBath + GarageCars + GarageArea + 
##     StreetScore + CentralAirScore + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd, data = hp_raw_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.05389 -0.06401  0.00159  0.07343  0.48632 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.022e+00  4.587e-01   8.770  < 2e-16 ***
## logLotArea       1.055e-01  9.004e-03  11.715  < 2e-16 ***
## OverallQual      9.115e-02  4.826e-03  18.887  < 2e-16 ***
## OverallCond      5.100e-02  4.059e-03  12.564  < 2e-16 ***
## YearBuilt        2.726e-03  2.275e-04  11.983  < 2e-16 ***
## TotalBsmtSF      6.115e-05  1.693e-05   3.613 0.000313 ***
## X1stFlrSF        4.556e-05  2.168e-05   2.101 0.035796 *  
## GrLivArea        1.637e-04  1.943e-05   8.427  < 2e-16 ***
## BsmtFullBath     7.397e-02  8.282e-03   8.931  < 2e-16 ***
## BsmtHalfBath     2.738e-02  1.659e-02   1.651 0.099029 .  
## FullBath         5.128e-02  1.170e-02   4.383 1.26e-05 ***
## HalfBath         3.084e-02  1.102e-02   2.800 0.005186 ** 
## GarageCars       8.008e-02  1.186e-02   6.753 2.10e-11 ***
## GarageArea      -3.712e-05  4.046e-05  -0.917 0.359113    
## StreetScore      1.894e-01  6.121e-02   3.094 0.002014 ** 
## CentralAirScore  7.932e-02  1.821e-02   4.357 1.41e-05 ***
## BedroomAbvGr    -1.771e-02  6.993e-03  -2.532 0.011446 *  
## KitchenAbvGr    -9.933e-02  2.034e-02  -4.884 1.16e-06 ***
## TotRmsAbvGrd     1.586e-02  5.135e-03   3.088 0.002052 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1469 on 1441 degrees of freedom
## Multiple R-squared:  0.8665, Adjusted R-squared:  0.8648 
## F-statistic: 519.6 on 18 and 1441 DF,  p-value: < 2.2e-16

# iteration 2 with GarageArea taken out
sp.lm <- update(sp.lm, .~. - GarageArea, data = hp_raw_df)
# summary of model
summary(sp.lm)

## 
## Call:
## lm(formula = logSalePrice ~ logLotArea + OverallQual + OverallCond + 
##     YearBuilt + TotalBsmtSF + X1stFlrSF + GrLivArea + BsmtFullBath + 
##     BsmtHalfBath + FullBath + HalfBath + GarageCars + StreetScore + 
##     CentralAirScore + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd, 
##     data = hp_raw_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07147 -0.06504  0.00230  0.07338  0.48900 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      4.044e+00  4.580e-01   8.830  < 2e-16 ***
## logLotArea       1.047e-01  8.967e-03  11.681  < 2e-16 ***
## OverallQual      9.125e-02  4.825e-03  18.913  < 2e-16 ***
## OverallCond      5.086e-02  4.056e-03  12.539  < 2e-16 ***
## YearBuilt        2.716e-03  2.272e-04  11.953  < 2e-16 ***
## TotalBsmtSF      6.003e-05  1.688e-05   3.556 0.000389 ***
## X1stFlrSF        4.541e-05  2.168e-05   2.094 0.036395 *  
## GrLivArea        1.614e-04  1.927e-05   8.379  < 2e-16 ***
## BsmtFullBath     7.370e-02  8.276e-03   8.905  < 2e-16 ***
## BsmtHalfBath     2.770e-02  1.658e-02   1.671 0.094989 .  
## FullBath         5.227e-02  1.165e-02   4.487 7.79e-06 ***
## HalfBath         3.165e-02  1.098e-02   2.882 0.004006 ** 
## GarageCars       7.142e-02  7.178e-03   9.950  < 2e-16 ***
## StreetScore      1.927e-01  6.110e-02   3.153 0.001648 ** 
## CentralAirScore  7.892e-02  1.820e-02   4.336 1.55e-05 ***
## BedroomAbvGr    -1.740e-02  6.985e-03  -2.491 0.012853 *  
## KitchenAbvGr    -9.916e-02  2.034e-02  -4.876 1.20e-06 ***
## TotRmsAbvGrd     1.603e-02  5.131e-03   3.125 0.001816 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1469 on 1442 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8648 
## F-statistic: 550.1 on 17 and 1442 DF,  p-value: < 2.2e-16

# iteration 3 with BsmtHalfBath taken out
sp.lm <- update(sp.lm, .~. - BsmtHalfBath, data = hp_raw_df)
# summary of model
summary(sp.lm)

## 
## Call:
## lm(formula = logSalePrice ~ logLotArea + OverallQual + OverallCond + 
##     YearBuilt + TotalBsmtSF + X1stFlrSF + GrLivArea + BsmtFullBath + 
##     FullBath + HalfBath + GarageCars + StreetScore + CentralAirScore + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd, data = hp_raw_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.07607 -0.06417  0.00143  0.07380  0.51672 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.998e+00  4.574e-01   8.739  < 2e-16 ***
## logLotArea       1.054e-01  8.963e-03  11.762  < 2e-16 ***
## OverallQual      9.091e-02  4.823e-03  18.848  < 2e-16 ***
## OverallCond      5.161e-02  4.033e-03  12.797  < 2e-16 ***
## YearBuilt        2.736e-03  2.271e-04  12.051  < 2e-16 ***
## TotalBsmtSF      6.092e-05  1.688e-05   3.608 0.000319 ***
## X1stFlrSF        4.595e-05  2.169e-05   2.118 0.034327 *  
## GrLivArea        1.621e-04  1.927e-05   8.412  < 2e-16 ***
## BsmtFullBath     7.130e-02  8.156e-03   8.742  < 2e-16 ***
## FullBath         5.102e-02  1.163e-02   4.386 1.24e-05 ***
## HalfBath         3.153e-02  1.099e-02   2.870 0.004162 ** 
## GarageCars       7.159e-02  7.182e-03   9.968  < 2e-16 ***
## StreetScore      1.939e-01  6.113e-02   3.172 0.001544 ** 
## CentralAirScore  7.929e-02  1.821e-02   4.355 1.43e-05 ***
## BedroomAbvGr    -1.666e-02  6.975e-03  -2.389 0.017016 *  
## KitchenAbvGr    -9.943e-02  2.035e-02  -4.886 1.14e-06 ***
## TotRmsAbvGrd     1.560e-02  5.127e-03   3.042 0.002389 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1469 on 1443 degrees of freedom
## Multiple R-squared:  0.8662, Adjusted R-squared:  0.8647 
## F-statistic: 583.6 on 16 and 1443 DF,  p-value: < 2.2e-16

# now all predictors have p values < 0.05
sp.lm$coefficients

##     (Intercept)      logLotArea     OverallQual     OverallCond 
##    3.997640e+00    1.054207e-01    9.091124e-02    5.161414e-02 
##       YearBuilt     TotalBsmtSF       X1stFlrSF       GrLivArea 
##    2.736430e-03    6.091755e-05    4.595251e-05    1.621374e-04 
##    BsmtFullBath        FullBath        HalfBath      GarageCars 
##    7.129873e-02    5.102093e-02    3.153424e-02    7.158917e-02 
##     StreetScore CentralAirScore    BedroomAbvGr    KitchenAbvGr 
##    1.939375e-01    7.929492e-02   -1.666465e-02   -9.942732e-02 
##    TotRmsAbvGrd 
##    1.560008e-02

# build function for regression model
sp.lm_equation <- function(logLotArea, OverallQual, OverallCond, YearBuilt, TotalBsmtSF, X1stFlrSF, GrLivArea, BsmtFullBath, FullBath, HalfBath, GarageCars, StreetScore, CentralAirScore, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd) { (3.99764 + (0.1054207 * logLotArea) + (0.09091124 * OverallQual) + (0.05161414 * OverallCond) + (0.00273643 * YearBuilt) + (0.00006091755 * TotalBsmtSF) + (0.00004595251 * X1stFlrSF) + (0.0001621374 * GrLivArea) + (0.07129873 * BsmtFullBath) + (0.05102093 * FullBath) + (0.03153424 * HalfBath) + (0.07158917 * GarageCars) + (0.1939375 * StreetScore) + (0.07929492 * CentralAirScore) + (-0.01666465 * BedroomAbvGr) + (-0.09942732 * KitchenAbvGr) + (0.01560008 * TotRmsAbvGrd)) }

The linear regression equation is as follows:

logSalePrice = 3.99764 + (0.1054207 * logLotArea) + (0.09091124 * OverallQual) + (0.05161414 * OverallCond) + (0.00273643 * YearBuilt) + (0.00006091755 * TotalBsmtSF) + (0.00004595251 * X1stFlrSF) + (0.0001621374 * GrLivArea) + (0.07129873 * BsmtFullBath) + (0.05102093 * FullBath) + (0.03153424 * HalfBath) + (0.07158917 * GarageCars) + (0.1939375 * StreetScore) + (0.07929492 * CentralAirScore) + (-0.01666465 * BedroomAbvGr) + (-0.09942732 * KitchenAbvGr) + (0.01560008 * TotRmsAbvGrd)

Having constructed a model where all p-values are below 0.05, I looked the residual plot and Q-Q plot:

# residual plot
plot(fitted(sp.lm), resid(sp.lm), col = "royalblue2")

# q-q plot
qqnorm(sp.lm$residuals, col = "maroon")
qqline(sp.lm$residuals)

The residual plot is acceptable, but he Q-Q plot is heavily tailed. This is not necessarily a surprise because throughout each stage of the backward elimination process, as the model was evaluated, the residuals, though fairly centered in terms of mean and quartiles, exhibited skew on the minimum end. We can see that in the residual plot in that there are two huge outliers and there appear to be sligtly more plots below zero than above zero.

Next, I read in the test data and used the linear regression model to predict logSalePrice, which was then tranformed back to SalePrice:

# load data from csv
hp_test_df <- read.csv("https://raw.githubusercontent.com/Randles-CUNY/data602_final_exam/master/test.csv", colClasses = c("integer", "integer", "character", "integer", "integer", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "character", "character", "character", "character", "character", "integer", "character", "character", "character", "character", "character", "character", "character", "integer", "character", "integer", "integer", "integer", "character", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "integer", "character", "integer", "character", "integer", "character", "character", "integer", "character", "integer", "integer", "character", "character", "character", "integer", "integer", "integer", "integer", "integer", "integer", "character", "character", "character", "integer", "integer", "integer", "character", "character"), header = TRUE, sep = ",", stringsAsFactors = FALSE)
# add central air dummy variable (0 = no Central Air, 1 = Central Air)
hp_test_df$CentralAirScore <- ifelse(hp_test_df$CentralAir == "N", 0, 1)
# add street dummy variable (0 = gravel, 1 = paved)
hp_test_df$StreetScore <- ifelse(hp_test_df$Street == "Grvl", 0, 1)
# add logLotArea
hp_test_df$logLotArea <- log(hp_test_df$LotArea)
# predict logSalePrice using the test data and linear regression model
sp.lm_predicted <- sp.lm_equation(hp_test_df$logLotArea, hp_test_df$OverallQual, hp_test_df$OverallCond, hp_test_df$YearBuilt, hp_test_df$TotalBsmtSF, hp_test_df$X1stFlrSF, hp_test_df$GrLivArea, hp_test_df$BsmtFullBath, hp_test_df$FullBath, hp_test_df$HalfBath, hp_test_df$GarageCars, hp_test_df$StreetScore, hp_test_df$CentralAirScore, hp_test_df$BedroomAbvGr, hp_test_df$KitchenAbvGr, hp_test_df$TotRmsAbvGrd)
# compute mean
sp.lm_predicted_mean <- mean(sp.lm_predicted, na.rm = TRUE)
# replace NA values with mean
sp.lm_predicted <- ifelse(is.na(sp.lm_predicted) == TRUE, sp.lm_predicted_mean, sp.lm_predicted)
# transform logSalePrice prediction to SalePrice
sp.lm_predicted <- (exp(1))^sp.lm_predicted
# put transformed SalePrice into data frame
sp.lm_predicted_df <- data.frame(cbind(seq(from = 1461, to = 2919, by = 1), sp.lm_predicted))
# change column names to Id and SalePrice
colnames(sp.lm_predicted_df) <- c("Id", "SalePrice")
# export to .csv for submission
write.csv(sp.lm_predicted_df, file = "C:/Users/Lelan/Documents/Education/CUNY/DATA605/Spring_2017/Final_Exam/submission_LMR.csv")

The data frame was exported as “submission_LMR.csv” and loaded to my GitHub repo (https://github.com/Randles-CUNY/data602_final_exam/blob/master/submission_LMR.csv) as well as Kaggle.com. My Kaggle.com user name is “Leland R” and my score is 0.14116.