1. Pick one of the quantitative independent variables (\(X_{i}\)) from the data set below, and define that variable as X. Also, pick one of the dependent variables (\(Y_{i}\)) below, and define that as Y.

For this final, I’m using X4 and Y4 for my X and Y.

q1_X <- c(9.3, 12.4, 19.9, 6.9, -1.0, 10.6, 6.4, 10.6, 1.2, 7.7, 15.5, 6.9, 13.7, 3.7, 4.4, 11.5, 4.2, 13.9, 12.9, 1.2)
q1_Y <- c(20.2, 18.6, 22.6, 11.4, 23.6, 24.0, 26.0, 26.8, 19.7, 22.7, 16.8, 20.2, 21.7, 20.9, 26.9, 16.3, 19.9, 15.5, 26.5, 21.7)

Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

The quantile function provides the minimum, maximum, and quartiles for a set of numbers, where 0% is the minimum, 25% is the first quartile, 50% is the second quartile, 75% is the third quartile, and 100% is the maximum.

q1_x <- quantile(q1_X)[[4]]
q1_y <- quantile(q1_Y)[[2]]

a. \(P(X>x|Y>y)\)

\(P(A|B) = \frac{P(B\text{ and }A)}{P(B)}\)

\(\frac{P(Y>y\text{ and }X>x)}{P(Y>y)}\)

q1_Y_gt_y <- sum(q1_Y > q1_y)/length(q1_Y)
q1a_B_n_A <- sum(q1_Y > q1_y & q1_X > q1_x)/length(q1_X)
q1a <- q1a_B_n_A/q1_Y_gt_y

\(P(X>x|Y>y) =\) 0.2.

b. \(P(X>x, Y>y)\)

\(P(A, B) = P(A\text{ and }B)\)

\(P(X>x\text{ and }Y>y)\)

q1b <- sum(q1_X > q1_x & q1_Y > q1_y)/length(q1_X)

\(P(X>x, Y>y) =\) 0.15.

c. \(P(X<x|Y>y)\)

\(P(A|B) = \frac{P(B\text{ and }A)}{P(B)}\)

\(\frac{P(Y>y\text{ and }X<x)}{P(Y>y)}\)

q1c_B_n_A <- sum(q1_Y > q1_y & q1_X < q1_x)/length(q1_X)
q1c <- q1c_B_n_A/q1_Y_gt_y

\(P(X<x|Y>y) =\) 0.8.

In addition, make a table of counts as shown below.

q1_df <- data.frame("x/y"= c("<= 1st", "> 1st", "Total"), 
                    "<= 3rd" = c(sum(q1_X<=q1_x & q1_Y<=q1_y), sum(q1_X<=q1_x & q1_Y>q1_y), sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X<=q1_x & q1_Y>q1_y)),
                    "> 3rd" = c(sum(q1_X>q1_x & q1_Y<=q1_y), sum(q1_X>q1_x & q1_Y>q1_y), sum(q1_X>q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y>q1_y)),
                    "Total" = c(sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y<=q1_y), sum(q1_X<=q1_x & q1_Y>q1_y) + sum(q1_X>q1_x & q1_Y>q1_y), sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y<=q1_y) + sum(q1_X<=q1_x & q1_Y>q1_y) + sum(q1_X>q1_x & q1_Y>q1_y))
                    )
knitr::kable(q1_df, col.names=c("x/y", "<= 3rd", "> 3rd", "Total"))
x/y <= 3rd > 3rd Total
<= 1st 3 2 5
> 1st 12 3 15
Total 15 5 20

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

q1_x <- quantile(q1_X)[[2]]
q1_y <- quantile(q1_Y)[[2]]
q1_AB <- sum(q1_X>q1_x & q1_Y>q1_y)/length(q1_X)

\(P(AB) =\) 0.5.

q1_AxB <- sum(q1_X>q1_x)/length(q1_X) * sum(q1_Y>q1_y)/length(q1_Y)

\(P(A)\times P(B) =\) 0.5625.

\(P(AB) \neq P(A)\times P(B)\)

chisq.test(x=q1_X, y=q1_Y)
## 
##  Pearson's Chi-squared test
## 
## data:  q1_X and q1_Y
## X-squared = 290, df = 272, p-value = 0.2166

The p-value (0.2166) is greater than the significance level (0.05), indicating these variables are not independent.

2. You are to register for Kaggle.com (free) and compete in the House Prices: Advanced Regression Techniques competition. https://www.kaggle.com/c/house-prices-advanced-regression-techniques . I want you to do the following.

Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

train_dim <- dim(train)
train_col_types <- table(sapply(train, class))

The training set contains 1460 observations across 81 variables. Of the variables, 43 are strings and 38 are integers. 94.11, 5.89% of the data is missing values of some sort.

Variable Missing Values %
PoolQC 99.5205479
MiscFeature 96.3013699
Alley 93.7671233
Fence 80.7534247
FireplaceQu 47.2602740
LotFrontage 17.7397260
GarageType 5.5479452
GarageYrBlt 5.5479452
GarageFinish 5.5479452
GarageQual 5.5479452
GarageCond 5.5479452
BsmtExposure 2.6027397
BsmtFinType2 2.6027397
BsmtQual 2.5342466
BsmtCond 2.5342466
BsmtFinType1 2.5342466
MasVnrType 0.5479452
MasVnrArea 0.5479452
Electrical 0.0684932

The above table displays the variables missing data, sorted by the percent of data missing from those variables.

We’ll examine three variables in particular, all which are fully populated with data: the sales price of each house, the area of the lots, and the year the houses were built.

ggplot(train, aes(x=SalePrice)) + 
  geom_histogram(bins=100, color="darkblue", fill="lightblue") + 
  xlab("Sale Price") + 
  ggtitle("Houses Sold by Sale Price") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

The sales price of each house is a dependent variable, as it relies on many factors and changing any number of them can affect the value of a house. As can be seen above, the majority of the houses were in the $100,000 to $300,000 range, with a minimum price of $34,900 and a maximum price of $755,000. On average, homes sold for $180,921.2 and the median home sale price was $163,000.

ggplot(train, aes(x=LotArea)) + 
  geom_histogram(bins=100, color="darkblue", fill="lightblue") + 
  xlab("Area of Each Lot (sqft)") + 
  ggtitle("Houses Sold by Lot Area (sqft)") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

The area of each house’s lot is an independent variable, as it relies on few if any other factors. As can be seen above, the majority of the houses were in the 0 to 20,000 ft2 range, with a minimum area of 1,300 ft2 and a maximum area of 215,245 ft2. On average, homes sold were on a 10,516.83 ft2 lot and the median home lot area was 9,478.5 ft2. To get a better idea of each lot’s area graphically, the size can be limited to be 40,000 ft2 or less.

ggplot(train, aes(x=LotArea)) + 
  geom_histogram(bins=100, color="darkblue", fill="lightblue") + 
  xlab("Area of Each Lot (sqft)") + 
  ggtitle("Houses Sold by Lot Area (sqft)") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), limits=c(0, 40000))

With this new visual representation of the area of each lot in ft2, it can be seen that the majority of the houses sold had a lot area between 5,000 and 15,000 ft2.

ggplot(train, aes(x=YearBuilt)) + 
  geom_histogram(bins=100, color="darkblue", fill="lightblue") + 
  xlab("Each House's Year Built") + 
  ggtitle("Houses Sold by Year Built") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10), limits=c(1870,2020))

The year each house was built is an independent variable, as it relies on few if any other factors. As can be seen above, the majority of the houses were built between 1950 and 2010, with the earliest built in 1872 and the latest built in 2010. On average, homes were built in 1971 and the median year built was 1973.

pairs(train[c("SalePrice", "LotArea", "YearBuilt")])

As the year each house was built is actually a categorical variable, so the square footage of the first floor of each home was chosen instead for the correlation matrix.

corr <- round(cor(train[c("SalePrice", "LotArea", "X1stFlrSF")]), 3)
ggcorrplot(corr, method="square", lab=TRUE)

As can be seen above, there is somewhat of a positive correlation between all three variables, with the strongest of the correlations existing between the price the house sold for and the squart footage of the first floor of the house.

confidence_interval <- data.frame("Variables"=c("SalePrice x X1stFlrSF", "SalePrice x LotArea", "X1stFlrSF x LotArea"),
                                  "80% Confidence Lower" = c(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[9]][1], cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][1], cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][1]),
                                  "80% Confidence Upper" = c(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[9]][2], cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][2], cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][2]),
                                  "p-Value" = c(format(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE), format(cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE), format(cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE))
)
knitr::kable(confidence_interval, col.names = c("Variables", "80% Confidence Interval Lower", "80% Confidence Interval Upper", "p-Value"), align = "c")
Variables 80% Confidence Interval Lower 80% Confidence Interval Upper p-Value
SalePrice x X1stFlrSF 0.5841687 0.6266715 5.394711e-147
SalePrice x LotArea 0.2323391 0.2947946 1.123139e-24
X1stFlrSF x LotArea 0.2686127 0.3297222 1.234238e-31

Based on the above, there is an 80% chance that the correlation of each pair falls within the given ranges. This matches the actual values produced by the correlation matrix created previously. With the 80% confidence interval, there is an implication that there is a stronger correlation between the sale price of the houses and the first floor’s square footage, whereas there is a weaker (though not inverse) correlation when the area of the house’s lot is considered in tandem with the sale price or the first floor’s square footage. Given the confidence interval, it has been determined that the null hypothesis - that the correlations between each pairwise set of variables is 0 - can be safely rejected. The chance of there being a familywise error is unlikely due to the infinitesimal p-values that fall far below 0.001 (1e-03).

Linear Algebra and Correlation.

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

precision_corr <- solve(corr)
knitr::kable(precision_corr, caption="Precision Matrix")
Precision Matrix
SalePrice LotArea X1stFlrSF
SalePrice 1.5994012 -0.1454427 -0.9257498
LotArea -0.1454427 1.1114042 -0.2441715
X1stFlrSF -0.9257498 -0.2441715 1.6340117
corr_x_prec <- corr %*% precision_corr
knitr::kable(corr_x_prec, caption="Correlation Matrix multiplied by the Precision Matrix")
Correlation Matrix multiplied by the Precision Matrix
SalePrice LotArea X1stFlrSF
SalePrice 1 0 0
LotArea 0 1 0
X1stFlrSF 0 0 1
prec_x_corr <- precision_corr %*% corr
knitr::kable(corr_x_prec, caption="Precision Matrix multiplied by the Correlation Matrix")
Precision Matrix multiplied by the Correlation Matrix
SalePrice LotArea X1stFlrSF
SalePrice 1 0 0
LotArea 0 1 0
X1stFlrSF 0 0 1
L <- as.matrix(expand(lu(precision_corr))[["L"]])
rownames(L) <- c("SalePrice", "LotArea", "X1stFlrSF")
colnames(L) <- c("SalePrice", "LotArea", "X1stFlrSF")
knitr::kable(L, caption="Lower Triangular Decomposition", align="c")
Lower Triangular Decomposition
SalePrice LotArea X1stFlrSF
SalePrice 1.0000000 0.000 0
LotArea -0.0909357 1.000 0
X1stFlrSF -0.5788102 -0.299 1
U <- as.matrix(expand(lu(precision_corr))[["U"]])
rownames(U) <- c("SalePrice", "LotArea", "X1stFlrSF")
colnames(U) <- c("SalePrice", "LotArea", "X1stFlrSF")
knitr::kable(U, caption="Upper Triangular Decomposition", align="c")
Upper Triangular Decomposition
SalePrice LotArea X1stFlrSF
SalePrice 1.599401 -0.1454427 -0.9257498
LotArea 0.000000 1.0981782 -0.3283553
X1stFlrSF 0.000000 0.0000000 1.0000000

Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

In order to determine which data is skewed right, it needs to first be viewed in a histogram to get the best picture. 100 bins were used for each variable in order to maximize viewing potential. Below are all variables containing numeric values - this includes qualitative data.

train_integers <- train[as.vector(sapply(train, is.numeric))]
train_hist_plots <- lapply(train_integers, function(histplot) ggplot(train, aes(x=histplot)) + geom_histogram(bins=100, color="darkblue", fill="lightblue") + labs(x=names(train_integers)[which(apply(train_integers, 2, function(x) all(x[!is.na(x)] == histplot[!is.na(histplot)])))]))
grid.arrange(grobs=train_hist_plots[2:10])

grid.arrange(grobs=train_hist_plots[11:19])

grid.arrange(grobs=train_hist_plots[20:28])

grid.arrange(grobs=train_hist_plots[29:38])

From this selection, the price of a house when it was sold will be the right-skewed data to be utilized. As the minimum value is greater than or equal to 0 ($34,900) the data does not need to be shifted.

library(MASS)
saleprice_fitdistr <- fitdistr(train_integers$SalePrice, densfun="exponential")
lambda <- saleprice_fitdistr$estimate
saleprice_samples <- data.frame(table(round(rexp(1000, lambda))))
names(saleprice_samples) <- c("Samples", "Frequency")
saleprice_samples$Samples <- as.numeric(levels(saleprice_samples$Samples))[saleprice_samples$Samples]
saleprice_samples$Frequency <- as.numeric(levels(saleprice_samples$Frequency))[saleprice_samples$Frequency]
ggplot(saleprice_samples, aes(x=Samples)) + 
  geom_histogram(bins=100, color="darkblue", fill="lightblue") + 
  xlab("Sample Sale Price") + 
  ggtitle("Sample Houses Sold by Sale Price") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

The sample provided offers a more severe right skew of sale prices than the actual data; with the observations, there is a definite left tail that grows steadily into a peak, whereas with this sample, the peak of the distribution is found right at the beginning. In a broader comparison, 70.2054795% of the observed homes sold for less than $200,000; 64.5% of the sample homes sold for less than $200,000. For a clearer image, both histograms can be overlaid.

ggplot() +
  geom_histogram(data=saleprice_samples, aes(x=Samples), bins=100, color="maroon", fill="pink", alpha=0.3) +
  geom_histogram(data=train_integers, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
  labs(x="Houses Sold") +
  ggtitle("Houses Sold: Blue, Actual; Red, Sample") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits=c(0,1310000)) +
  theme(axis.text.x = element_text(angle=45))

The above shows the sample sale prices in red and the observed sale prices in blue. What is important to remember is that there are 1,000 samples, and 1,460 actual observations of houses sold; when considered as a percentage, it is clear that the majority of the samples are below $200,000 in price. Additionally, the distribution is arguably more normal in the observations.

percentiles <- quantile(saleprice_samples$Samples, c(0.05, 0.95))

The 5% percentile of the samples is $9,512.7, and the 95% percentile of the samples is $564,488.8.

saleprice_confidence_interval <- c(t.test(train$SalePrice, conf.level = 0.95)$conf.int[1], t.test(train$SalePrice, conf.level = 0.95)$conf.int[2])
saleprice_percentiles <- quantile(train$SalePrice, c(0.05, 0.95))
samples_confidence_interval <- c(t.test(saleprice_samples$Samples, conf.level = 0.95)$conf.int[1], t.test(saleprice_samples$Samples, conf.level = 0.95)$conf.int[2])

The empirical data has a confidence interval of [$176,842.8, $184,999.6], a 5% percentile of $88,000, and a 95% percentile of $326,100.

The mean of the samples ($186,345) cannot be found within the confidence interval of the empirical data ([$176,842.8, $184,999.6]), and the mean of the empirical data ($180,921) can be found within the confidence interval of the samples ([$175,068.6, $197,621.5]). This implies there might be a Type I Error, as both means do not exist within both confidence intervals. As a result, the generated sample may not be reflective of the empirical data.

Modeling.

Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

The following adjusted R2 are obtained when each variable is used in a linear model attempting to determine SalePrice.

summaries <- lapply(train, function(x) summary(lm(SalePrice~x, data=train)))[-81]
r2 <- lapply(summaries, function(x) x$adj.r.squared)
knitr::kable(sort(unlist(r2), decreasing=TRUE), col.names = c("R-Squared"))
R-Squared
OverallQual 0.6253951
Neighborhood 0.5379749
GrLivArea 0.5018072
ExterQual 0.4763110
KitchenQual 0.4554790
BsmtQual 0.4526012
GarageCars 0.4097194
GarageArea 0.3882475
TotalBsmtSF 0.3760534
X1stFlrSF 0.3666228
FullBath 0.3138736
TotRmsAbvGrd 0.2843699
Alley 0.2774686
YearBuilt 0.2729233
GarageFinish 0.2662114
YearRemodAdd 0.2566419
Foundation 0.2538112
GarageYrBlt 0.2359932
MasVnrArea 0.2274672
Fireplaces 0.2174862
GarageType 0.2037493
HeatingQC 0.1932888
BsmtFinType1 0.1897486
MasVnrType 0.1862109
PoolQC 0.1729771
BsmtFinSF1 0.1487368
Exterior2nd 0.1450400
Exterior1st 0.1445647
SaleType 0.1325310
SaleCondition 0.1325246
BsmtExposure 0.1285758
LotFrontage 0.1230318
FireplaceQu 0.1085047
MSZoning 0.1051062
WoodDeckSF 0.1046304
X2ndFlrSF 0.1013581
OpenPorchSF 0.0991477
HouseStyle 0.0819078
HalfBath 0.0800867
LotShape 0.0744726
LotArea 0.0689752
CentralAir 0.0625233
MiscFeature 0.0614813
Electrical 0.0570862
RoofStyle 0.0544562
PavedDrive 0.0532419
BsmtFullBath 0.0509340
BsmtUnfSF 0.0453470
Fence 0.0405849
BldgType 0.0318798
BedroomAbvGr 0.0276292
Condition1 0.0272971
BsmtCond 0.0267552
RoofMatl 0.0267436
GarageQual 0.0242739
GarageCond 0.0241931
LandContour 0.0237868
ExterCond 0.0209334
LotConfig 0.0183280
KitchenAbvGr 0.0177976
EnclosedPorch 0.0158578
Functional 0.0124191
ScreenPorch 0.0117430
Heating 0.0110480
PoolArea 0.0078584
MSSubClass 0.0064228
BsmtFinType2 0.0059547
OverallCond 0.0053798
Condition2 0.0051259
MoSold 0.0014716
LandSlope 0.0013126
X3SsnPorch 0.0013032
Street 0.0009992
YrSold 0.0001512
LowQualFinSF -0.0000297
Id -0.0002052
MiscVal -0.0002366
BsmtHalfBath -0.0004020
Utilities -0.0004808
BsmtFinSF2 -0.0005563

As can be observed, the three most influential variables in determining the price of the home when it was put up for sale was the neighborhood it was located in, the overall quality of the house, and the above ground living area in square feet.

model_1 <- lm(SalePrice ~ OverallQual+Neighborhood+GrLivArea, data=train)
summary(model_1)
## 
## Call:
## lm(formula = SalePrice ~ OverallQual + Neighborhood + GrLivArea, 
##     data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -310337  -17569    -444   16022  275856 
## 
## Coefficients:
##                       Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)         -34829.240  11541.232  -3.018             0.002591 ** 
## OverallQual          20951.425   1162.274  18.026 < 0.0000000000000002 ***
## NeighborhoodBlueste -30752.876  27697.270  -1.110             0.267047    
## NeighborhoodBrDale  -43358.881  12978.523  -3.341             0.000857 ***
## NeighborhoodBrkSide -13025.453  10450.445  -1.246             0.212821    
## NeighborhoodClearCr  24575.635  11569.626   2.124             0.033828 *  
## NeighborhoodCollgCr  11414.309   9496.582   1.202             0.229586    
## NeighborhoodCrawfor  14444.250  10502.215   1.375             0.169237    
## NeighborhoodEdwards -17842.951   9985.734  -1.787             0.074174 .  
## NeighborhoodGilbert   -892.880   9954.350  -0.090             0.928540    
## NeighborhoodIDOTRR  -28178.987  11135.584  -2.531             0.011495 *  
## NeighborhoodMeadowV -19099.020  12999.351  -1.469             0.141990    
## NeighborhoodMitchel   2030.612  10555.017   0.192             0.847469    
## NeighborhoodNAmes    -4430.099   9517.291  -0.465             0.641659    
## NeighborhoodNoRidge  64642.989  10939.818   5.909     0.00000000429492 ***
## NeighborhoodNPkVill -17807.184  15304.007  -1.164             0.244795    
## NeighborhoodNridgHt  71587.840   9994.775   7.163     0.00000000000126 ***
## NeighborhoodNWAmes   -4720.658  10079.515  -0.468             0.639611    
## NeighborhoodOldTown -32080.877   9863.373  -3.253             0.001170 ** 
## NeighborhoodSawyer   -1219.381  10211.495  -0.119             0.904965    
## NeighborhoodSawyerW    303.099  10264.232   0.030             0.976446    
## NeighborhoodSomerst  17766.964   9829.881   1.807             0.070903 .  
## NeighborhoodStoneBr  69954.472  11689.508   5.984     0.00000000274014 ***
## NeighborhoodSWISU   -36640.152  11923.592  -3.073             0.002160 ** 
## NeighborhoodTimber   29905.811  10829.233   2.762             0.005826 ** 
## NeighborhoodVeenker  47106.893  14337.857   3.285             0.001043 ** 
## GrLivArea               55.565      2.499  22.237 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37010 on 1433 degrees of freedom
## Multiple R-squared:  0.7868, Adjusted R-squared:  0.783 
## F-statistic: 203.5 on 26 and 1433 DF,  p-value: < 0.00000000000000022

Based on the adjusted R2, this model isn’t a bad choice. Due to the length of output of these models, going forward only the adjusted R2 will be shown.

pred <- predict(model_1, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
##      Id SalePrice
## 1  2099 -2392.516
## 2  2217 -1328.023
## 3  1788  8275.868
## 4  1601  8899.453
## 5  1815  8998.206
## 6  2872 16901.780
## 7  1537 21222.396
## 8  2786 23281.891
## 9  1916 23901.868
## 10 2579 25846.626

This first model has negative values for a sale price. This may not be the best model available as a result.

ggplot() + 
  geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) + 
  geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
  xlab("Sale Price") + 
  ggtitle("Houses Sold by Sale Price, Model 1: Blue, Actual; Green, Predicted") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

Despite this, it mostly lines up with the training set’s of sale prices - but how does the multiplicative model using the same variables compare?

model_2 <- lm(SalePrice ~ OverallQual*Neighborhood*GrLivArea, data=train)
# summary(model_2)
paste("Adjusted r-squared: ", summary(model_2)$adj.r.squared)
## [1] "Adjusted r-squared:  0.863226334511322"

The multiplicative model is clearly better as can be observed by the adjusted R2.

pred <- predict(model_2, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
##      Id SalePrice
## 1  2872  22692.47
## 2  2217  32189.03
## 3  1601  36946.28
## 4  1916  48372.07
## 5  2905  48942.27
## 6  2579  49853.19
## 7  1586  55664.33
## 8  2892  57862.25
## 9  1848  59933.63
## 10 1823  60852.98

This second model has absolutely no negative values, which is a good thing.

ggplot() + 
  geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) + 
  geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
  xlab("Sale Price") + 
  ggtitle("Houses Sold by Sale Price, Model 2: Blue, Actual; Green, Predicted") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

The histograms are less similar than the previous linear model, so it might not be the right choice.

Increasing the number of variables used in the linear model automatically increases - and never decreases - the adjusted R2 of the model. Keeping that in mind, using the exterior and kitchen quality of the house might improve the results.

model_3 <- lm(SalePrice ~ OverallQual+Neighborhood+GrLivArea+ExterQual+KitchenQual, data=train)
# summary(model_3)
paste("Adjusted r-squared: ", summary(model_3)$adj.r.squared)
## [1] "Adjusted r-squared:  0.809381027150161"

Adding two variables resulted in an adjusted R2 that is larger than the previous models.

pred <- predict(model_3, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
##      Id SalePrice
## 1  2217  14167.88
## 2  2099  17147.35
## 3  1601  27029.62
## 4  1815  27625.55
## 5  2872  33200.05
## 6  2786  36735.46
## 7  1788  36816.97
## 8  1537  38870.45
## 9  1916  40261.52
## 10 2579  42619.15

The predicted values are all greater than 0, which is a good sign for the linear model’s validity.

ggplot() + 
  geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) + 
  geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
  xlab("Sale Price") + 
  ggtitle("Houses Sold by Sale Price, Model 3: Blue, Actual; Green, Predicted") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

This third model has a histogram that appears very similar to the training set’s.

model_4 <- lm(SalePrice ~ OverallQual*Neighborhood*GrLivArea*ExterQual*KitchenQual, data=train)
# summary(model_4)
paste("Adjusted r-squared: ", summary(model_4)$adj.r.squared)
## [1] "Adjusted r-squared:  0.884002939992005"

Lastly, just for completion’s sake, a multiplicative multiple regression model of the five variables used in the third model.

pred <- predict(model_4, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
##      Id  SalePrice
## 1  2264 -3761702.4
## 2  2503 -2903825.4
## 3  2618 -2875683.2
## 4  2690 -2109016.0
## 5  2279 -1965067.2
## 6  2529 -1354496.5
## 7  2284 -1048762.1
## 8  1947  -944189.3
## 9  2193  -903461.6
## 10 2883  -781362.1

There is an issue with this model: there are 41 negative sale values. Nobody in their right mind would pay someone to take their home. My instinct says this model is no good, and the histogram overlay shows the same.

ggplot() + 
  geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) + 
  geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
  xlab("Sale Price") + 
  ggtitle("Houses Sold by Sale Price, Model 4: Blue, Actual; Green, Predicted") +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10))

Increasing the number of variables increased the value of the adjusted R2, but that does not necessarily mean that it is more accurate. In order to see how these models hold up, the resulting predictions needed to be submitted to Kaggle. As a note, all models have a p-value less than 0.00000000000000022.

My Kaggle user name is gbartomeo.

results <- data.frame(ModelNum = c(1, 2, 3, 4),
                      Score = c(0.44655, 0.19727, 0.17660, 2.02452),
                      R2 = c(summary(model_1)$adj.r.squared, summary(model_2)$adj.r.squared, summary(model_3)$adj.r.squared, summary(model_4)$adj.r.squared)
                      )
rownames(results) <- NULL
knitr::kable(results, col.names=c("Model Number", "Score", "Adjusted R2"), align="c")
Model Number Score Adjusted R2
1 0.44655 0.7829810
2 0.19727 0.8632263
3 0.17660 0.8093810
4 2.02452 0.8840029

My best score (0.1766; the closer to 0 the better) came from the third model, which was the additive model using five variables - the overall quality of the home, the neighborhood it was located in, the area of all the living space above ground, the external quality of the home, and the quality of the home’s kitchen. The results imply that having a higher adjusted R2 has little to no affect on the scoring received for the predictions produced.