For this final, I’m using X4 and Y4 for my X and Y.
q1_X <- c(9.3, 12.4, 19.9, 6.9, -1.0, 10.6, 6.4, 10.6, 1.2, 7.7, 15.5, 6.9, 13.7, 3.7, 4.4, 11.5, 4.2, 13.9, 12.9, 1.2)
q1_Y <- c(20.2, 18.6, 22.6, 11.4, 23.6, 24.0, 26.0, 26.8, 19.7, 22.7, 16.8, 20.2, 21.7, 20.9, 26.9, 16.3, 19.9, 15.5, 26.5, 21.7)
Probability. Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.
The quantile function provides the minimum, maximum, and quartiles for a set of numbers, where 0% is the minimum, 25% is the first quartile, 50% is the second quartile, 75% is the third quartile, and 100% is the maximum.
q1_x <- quantile(q1_X)[[4]]
q1_y <- quantile(q1_Y)[[2]]
\(P(A|B) = \frac{P(B\text{ and }A)}{P(B)}\)
\(\frac{P(Y>y\text{ and }X>x)}{P(Y>y)}\)
q1_Y_gt_y <- sum(q1_Y > q1_y)/length(q1_Y)
q1a_B_n_A <- sum(q1_Y > q1_y & q1_X > q1_x)/length(q1_X)
q1a <- q1a_B_n_A/q1_Y_gt_y
\(P(X>x|Y>y) =\) 0.2.
\(P(A, B) = P(A\text{ and }B)\)
\(P(X>x\text{ and }Y>y)\)
q1b <- sum(q1_X > q1_x & q1_Y > q1_y)/length(q1_X)
\(P(X>x, Y>y) =\) 0.15.
\(P(A|B) = \frac{P(B\text{ and }A)}{P(B)}\)
\(\frac{P(Y>y\text{ and }X<x)}{P(Y>y)}\)
q1c_B_n_A <- sum(q1_Y > q1_y & q1_X < q1_x)/length(q1_X)
q1c <- q1c_B_n_A/q1_Y_gt_y
\(P(X<x|Y>y) =\) 0.8.
In addition, make a table of counts as shown below.
q1_df <- data.frame("x/y"= c("<= 1st", "> 1st", "Total"),
"<= 3rd" = c(sum(q1_X<=q1_x & q1_Y<=q1_y), sum(q1_X<=q1_x & q1_Y>q1_y), sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X<=q1_x & q1_Y>q1_y)),
"> 3rd" = c(sum(q1_X>q1_x & q1_Y<=q1_y), sum(q1_X>q1_x & q1_Y>q1_y), sum(q1_X>q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y>q1_y)),
"Total" = c(sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y<=q1_y), sum(q1_X<=q1_x & q1_Y>q1_y) + sum(q1_X>q1_x & q1_Y>q1_y), sum(q1_X<=q1_x & q1_Y<=q1_y) + sum(q1_X>q1_x & q1_Y<=q1_y) + sum(q1_X<=q1_x & q1_Y>q1_y) + sum(q1_X>q1_x & q1_Y>q1_y))
)
knitr::kable(q1_df, col.names=c("x/y", "<= 3rd", "> 3rd", "Total"))
| x/y | <= 3rd | > 3rd | Total |
|---|---|---|---|
| <= 1st | 3 | 2 | 5 |
| > 1st | 12 | 3 | 15 |
| Total | 15 | 5 | 20 |
Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(AB)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.
q1_x <- quantile(q1_X)[[2]]
q1_y <- quantile(q1_Y)[[2]]
q1_AB <- sum(q1_X>q1_x & q1_Y>q1_y)/length(q1_X)
\(P(AB) =\) 0.5.
q1_AxB <- sum(q1_X>q1_x)/length(q1_X) * sum(q1_Y>q1_y)/length(q1_Y)
\(P(A)\times P(B) =\) 0.5625.
\(P(AB) \neq P(A)\times P(B)\)
chisq.test(x=q1_X, y=q1_Y)
##
## Pearson's Chi-squared test
##
## data: q1_X and q1_Y
## X-squared = 290, df = 272, p-value = 0.2166
The p-value (0.2166) is greater than the significance level (0.05), indicating these variables are not independent.
Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot matrix for at least two of the independent variables and the dependent variable. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 80% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?
train_dim <- dim(train)
train_col_types <- table(sapply(train, class))
The training set contains 1460 observations across 81 variables. Of the variables, 43 are strings and 38 are integers. 94.11, 5.89% of the data is missing values of some sort.
| Variable | Missing Values % |
|---|---|
| PoolQC | 99.5205479 |
| MiscFeature | 96.3013699 |
| Alley | 93.7671233 |
| Fence | 80.7534247 |
| FireplaceQu | 47.2602740 |
| LotFrontage | 17.7397260 |
| GarageType | 5.5479452 |
| GarageYrBlt | 5.5479452 |
| GarageFinish | 5.5479452 |
| GarageQual | 5.5479452 |
| GarageCond | 5.5479452 |
| BsmtExposure | 2.6027397 |
| BsmtFinType2 | 2.6027397 |
| BsmtQual | 2.5342466 |
| BsmtCond | 2.5342466 |
| BsmtFinType1 | 2.5342466 |
| MasVnrType | 0.5479452 |
| MasVnrArea | 0.5479452 |
| Electrical | 0.0684932 |
The above table displays the variables missing data, sorted by the percent of data missing from those variables.
We’ll examine three variables in particular, all which are fully populated with data: the sales price of each house, the area of the lots, and the year the houses were built.
ggplot(train, aes(x=SalePrice)) +
geom_histogram(bins=100, color="darkblue", fill="lightblue") +
xlab("Sale Price") +
ggtitle("Houses Sold by Sale Price") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
The sales price of each house is a dependent variable, as it relies on many factors and changing any number of them can affect the value of a house. As can be seen above, the majority of the houses were in the $100,000 to $300,000 range, with a minimum price of $34,900 and a maximum price of $755,000. On average, homes sold for $180,921.2 and the median home sale price was $163,000.
ggplot(train, aes(x=LotArea)) +
geom_histogram(bins=100, color="darkblue", fill="lightblue") +
xlab("Area of Each Lot (sqft)") +
ggtitle("Houses Sold by Lot Area (sqft)") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
The area of each house’s lot is an independent variable, as it relies on few if any other factors. As can be seen above, the majority of the houses were in the 0 to 20,000 ft2 range, with a minimum area of 1,300 ft2 and a maximum area of 215,245 ft2. On average, homes sold were on a 10,516.83 ft2 lot and the median home lot area was 9,478.5 ft2. To get a better idea of each lot’s area graphically, the size can be limited to be 40,000 ft2 or less.
ggplot(train, aes(x=LotArea)) +
geom_histogram(bins=100, color="darkblue", fill="lightblue") +
xlab("Area of Each Lot (sqft)") +
ggtitle("Houses Sold by Lot Area (sqft)") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10), limits=c(0, 40000))
With this new visual representation of the area of each lot in ft2, it can be seen that the majority of the houses sold had a lot area between 5,000 and 15,000 ft2.
ggplot(train, aes(x=YearBuilt)) +
geom_histogram(bins=100, color="darkblue", fill="lightblue") +
xlab("Each House's Year Built") +
ggtitle("Houses Sold by Year Built") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10), limits=c(1870,2020))
The year each house was built is an independent variable, as it relies on few if any other factors. As can be seen above, the majority of the houses were built between 1950 and 2010, with the earliest built in 1872 and the latest built in 2010. On average, homes were built in 1971 and the median year built was 1973.
pairs(train[c("SalePrice", "LotArea", "YearBuilt")])
As the year each house was built is actually a categorical variable, so the square footage of the first floor of each home was chosen instead for the correlation matrix.
corr <- round(cor(train[c("SalePrice", "LotArea", "X1stFlrSF")]), 3)
ggcorrplot(corr, method="square", lab=TRUE)
As can be seen above, there is somewhat of a positive correlation between all three variables, with the strongest of the correlations existing between the price the house sold for and the squart footage of the first floor of the house.
confidence_interval <- data.frame("Variables"=c("SalePrice x X1stFlrSF", "SalePrice x LotArea", "X1stFlrSF x LotArea"),
"80% Confidence Lower" = c(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[9]][1], cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][1], cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][1]),
"80% Confidence Upper" = c(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[9]][2], cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][2], cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[9]][2]),
"p-Value" = c(format(cor.test(train$SalePrice, train$X1stFlrSF, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE), format(cor.test(train$SalePrice, train$LotArea, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE), format(cor.test(train$X1stFlrSF, train$LotArea, method = "pearson", conf.level = 0.80)[[3]], scientific=TRUE))
)
knitr::kable(confidence_interval, col.names = c("Variables", "80% Confidence Interval Lower", "80% Confidence Interval Upper", "p-Value"), align = "c")
| Variables | 80% Confidence Interval Lower | 80% Confidence Interval Upper | p-Value |
|---|---|---|---|
| SalePrice x X1stFlrSF | 0.5841687 | 0.6266715 | 5.394711e-147 |
| SalePrice x LotArea | 0.2323391 | 0.2947946 | 1.123139e-24 |
| X1stFlrSF x LotArea | 0.2686127 | 0.3297222 | 1.234238e-31 |
Based on the above, there is an 80% chance that the correlation of each pair falls within the given ranges. This matches the actual values produced by the correlation matrix created previously. With the 80% confidence interval, there is an implication that there is a stronger correlation between the sale price of the houses and the first floor’s square footage, whereas there is a weaker (though not inverse) correlation when the area of the house’s lot is considered in tandem with the sale price or the first floor’s square footage. Given the confidence interval, it has been determined that the null hypothesis - that the correlations between each pairwise set of variables is 0 - can be safely rejected. The chance of there being a familywise error is unlikely due to the infinitesimal p-values that fall far below 0.001 (1e-03).
Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.
precision_corr <- solve(corr)
knitr::kable(precision_corr, caption="Precision Matrix")
| SalePrice | LotArea | X1stFlrSF | |
|---|---|---|---|
| SalePrice | 1.5994012 | -0.1454427 | -0.9257498 |
| LotArea | -0.1454427 | 1.1114042 | -0.2441715 |
| X1stFlrSF | -0.9257498 | -0.2441715 | 1.6340117 |
corr_x_prec <- corr %*% precision_corr
knitr::kable(corr_x_prec, caption="Correlation Matrix multiplied by the Precision Matrix")
| SalePrice | LotArea | X1stFlrSF | |
|---|---|---|---|
| SalePrice | 1 | 0 | 0 |
| LotArea | 0 | 1 | 0 |
| X1stFlrSF | 0 | 0 | 1 |
prec_x_corr <- precision_corr %*% corr
knitr::kable(corr_x_prec, caption="Precision Matrix multiplied by the Correlation Matrix")
| SalePrice | LotArea | X1stFlrSF | |
|---|---|---|---|
| SalePrice | 1 | 0 | 0 |
| LotArea | 0 | 1 | 0 |
| X1stFlrSF | 0 | 0 | 1 |
L <- as.matrix(expand(lu(precision_corr))[["L"]])
rownames(L) <- c("SalePrice", "LotArea", "X1stFlrSF")
colnames(L) <- c("SalePrice", "LotArea", "X1stFlrSF")
knitr::kable(L, caption="Lower Triangular Decomposition", align="c")
| SalePrice | LotArea | X1stFlrSF | |
|---|---|---|---|
| SalePrice | 1.0000000 | 0.000 | 0 |
| LotArea | -0.0909357 | 1.000 | 0 |
| X1stFlrSF | -0.5788102 | -0.299 | 1 |
U <- as.matrix(expand(lu(precision_corr))[["U"]])
rownames(U) <- c("SalePrice", "LotArea", "X1stFlrSF")
colnames(U) <- c("SalePrice", "LotArea", "X1stFlrSF")
knitr::kable(U, caption="Upper Triangular Decomposition", align="c")
| SalePrice | LotArea | X1stFlrSF | |
|---|---|---|---|
| SalePrice | 1.599401 | -0.1454427 | -0.9257498 |
| LotArea | 0.000000 | 1.0981782 | -0.3283553 |
| X1stFlrSF | 0.000000 | 0.0000000 | 1.0000000 |
Many times, it makes sense to fit a closed form distribution to data. Select a variable in the Kaggle.com training dataset that is skewed to the right, shift it so that the minimum value is absolutely above zero if necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.
In order to determine which data is skewed right, it needs to first be viewed in a histogram to get the best picture. 100 bins were used for each variable in order to maximize viewing potential. Below are all variables containing numeric values - this includes qualitative data.
train_integers <- train[as.vector(sapply(train, is.numeric))]
train_hist_plots <- lapply(train_integers, function(histplot) ggplot(train, aes(x=histplot)) + geom_histogram(bins=100, color="darkblue", fill="lightblue") + labs(x=names(train_integers)[which(apply(train_integers, 2, function(x) all(x[!is.na(x)] == histplot[!is.na(histplot)])))]))
grid.arrange(grobs=train_hist_plots[2:10])
grid.arrange(grobs=train_hist_plots[11:19])
grid.arrange(grobs=train_hist_plots[20:28])
grid.arrange(grobs=train_hist_plots[29:38])
From this selection, the price of a house when it was sold will be the right-skewed data to be utilized. As the minimum value is greater than or equal to 0 ($34,900) the data does not need to be shifted.
library(MASS)
saleprice_fitdistr <- fitdistr(train_integers$SalePrice, densfun="exponential")
lambda <- saleprice_fitdistr$estimate
saleprice_samples <- data.frame(table(round(rexp(1000, lambda))))
names(saleprice_samples) <- c("Samples", "Frequency")
saleprice_samples$Samples <- as.numeric(levels(saleprice_samples$Samples))[saleprice_samples$Samples]
saleprice_samples$Frequency <- as.numeric(levels(saleprice_samples$Frequency))[saleprice_samples$Frequency]
ggplot(saleprice_samples, aes(x=Samples)) +
geom_histogram(bins=100, color="darkblue", fill="lightblue") +
xlab("Sample Sale Price") +
ggtitle("Sample Houses Sold by Sale Price") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
The sample provided offers a more severe right skew of sale prices than the actual data; with the observations, there is a definite left tail that grows steadily into a peak, whereas with this sample, the peak of the distribution is found right at the beginning. In a broader comparison, 70.2054795% of the observed homes sold for less than $200,000; 64.5% of the sample homes sold for less than $200,000. For a clearer image, both histograms can be overlaid.
ggplot() +
geom_histogram(data=saleprice_samples, aes(x=Samples), bins=100, color="maroon", fill="pink", alpha=0.3) +
geom_histogram(data=train_integers, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
labs(x="Houses Sold") +
ggtitle("Houses Sold: Blue, Actual; Red, Sample") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20), limits=c(0,1310000)) +
theme(axis.text.x = element_text(angle=45))
The above shows the sample sale prices in red and the observed sale prices in blue. What is important to remember is that there are 1,000 samples, and 1,460 actual observations of houses sold; when considered as a percentage, it is clear that the majority of the samples are below $200,000 in price. Additionally, the distribution is arguably more normal in the observations.
percentiles <- quantile(saleprice_samples$Samples, c(0.05, 0.95))
The 5% percentile of the samples is $9,512.7, and the 95% percentile of the samples is $564,488.8.
saleprice_confidence_interval <- c(t.test(train$SalePrice, conf.level = 0.95)$conf.int[1], t.test(train$SalePrice, conf.level = 0.95)$conf.int[2])
saleprice_percentiles <- quantile(train$SalePrice, c(0.05, 0.95))
samples_confidence_interval <- c(t.test(saleprice_samples$Samples, conf.level = 0.95)$conf.int[1], t.test(saleprice_samples$Samples, conf.level = 0.95)$conf.int[2])
The empirical data has a confidence interval of [$176,842.8, $184,999.6], a 5% percentile of $88,000, and a 95% percentile of $326,100.
The mean of the samples ($186,345) cannot be found within the confidence interval of the empirical data ([$176,842.8, $184,999.6]), and the mean of the empirical data ($180,921) can be found within the confidence interval of the samples ([$175,068.6, $197,621.5]). This implies there might be a Type I Error, as both means do not exist within both confidence intervals. As a result, the generated sample may not be reflective of the empirical data.
Build some type of multiple regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.
The following adjusted R2 are obtained when each variable is used in a linear model attempting to determine SalePrice.
summaries <- lapply(train, function(x) summary(lm(SalePrice~x, data=train)))[-81]
r2 <- lapply(summaries, function(x) x$adj.r.squared)
knitr::kable(sort(unlist(r2), decreasing=TRUE), col.names = c("R-Squared"))
| R-Squared | |
|---|---|
| OverallQual | 0.6253951 |
| Neighborhood | 0.5379749 |
| GrLivArea | 0.5018072 |
| ExterQual | 0.4763110 |
| KitchenQual | 0.4554790 |
| BsmtQual | 0.4526012 |
| GarageCars | 0.4097194 |
| GarageArea | 0.3882475 |
| TotalBsmtSF | 0.3760534 |
| X1stFlrSF | 0.3666228 |
| FullBath | 0.3138736 |
| TotRmsAbvGrd | 0.2843699 |
| Alley | 0.2774686 |
| YearBuilt | 0.2729233 |
| GarageFinish | 0.2662114 |
| YearRemodAdd | 0.2566419 |
| Foundation | 0.2538112 |
| GarageYrBlt | 0.2359932 |
| MasVnrArea | 0.2274672 |
| Fireplaces | 0.2174862 |
| GarageType | 0.2037493 |
| HeatingQC | 0.1932888 |
| BsmtFinType1 | 0.1897486 |
| MasVnrType | 0.1862109 |
| PoolQC | 0.1729771 |
| BsmtFinSF1 | 0.1487368 |
| Exterior2nd | 0.1450400 |
| Exterior1st | 0.1445647 |
| SaleType | 0.1325310 |
| SaleCondition | 0.1325246 |
| BsmtExposure | 0.1285758 |
| LotFrontage | 0.1230318 |
| FireplaceQu | 0.1085047 |
| MSZoning | 0.1051062 |
| WoodDeckSF | 0.1046304 |
| X2ndFlrSF | 0.1013581 |
| OpenPorchSF | 0.0991477 |
| HouseStyle | 0.0819078 |
| HalfBath | 0.0800867 |
| LotShape | 0.0744726 |
| LotArea | 0.0689752 |
| CentralAir | 0.0625233 |
| MiscFeature | 0.0614813 |
| Electrical | 0.0570862 |
| RoofStyle | 0.0544562 |
| PavedDrive | 0.0532419 |
| BsmtFullBath | 0.0509340 |
| BsmtUnfSF | 0.0453470 |
| Fence | 0.0405849 |
| BldgType | 0.0318798 |
| BedroomAbvGr | 0.0276292 |
| Condition1 | 0.0272971 |
| BsmtCond | 0.0267552 |
| RoofMatl | 0.0267436 |
| GarageQual | 0.0242739 |
| GarageCond | 0.0241931 |
| LandContour | 0.0237868 |
| ExterCond | 0.0209334 |
| LotConfig | 0.0183280 |
| KitchenAbvGr | 0.0177976 |
| EnclosedPorch | 0.0158578 |
| Functional | 0.0124191 |
| ScreenPorch | 0.0117430 |
| Heating | 0.0110480 |
| PoolArea | 0.0078584 |
| MSSubClass | 0.0064228 |
| BsmtFinType2 | 0.0059547 |
| OverallCond | 0.0053798 |
| Condition2 | 0.0051259 |
| MoSold | 0.0014716 |
| LandSlope | 0.0013126 |
| X3SsnPorch | 0.0013032 |
| Street | 0.0009992 |
| YrSold | 0.0001512 |
| LowQualFinSF | -0.0000297 |
| Id | -0.0002052 |
| MiscVal | -0.0002366 |
| BsmtHalfBath | -0.0004020 |
| Utilities | -0.0004808 |
| BsmtFinSF2 | -0.0005563 |
As can be observed, the three most influential variables in determining the price of the home when it was put up for sale was the neighborhood it was located in, the overall quality of the house, and the above ground living area in square feet.
model_1 <- lm(SalePrice ~ OverallQual+Neighborhood+GrLivArea, data=train)
summary(model_1)
##
## Call:
## lm(formula = SalePrice ~ OverallQual + Neighborhood + GrLivArea,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -310337 -17569 -444 16022 275856
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -34829.240 11541.232 -3.018 0.002591 **
## OverallQual 20951.425 1162.274 18.026 < 0.0000000000000002 ***
## NeighborhoodBlueste -30752.876 27697.270 -1.110 0.267047
## NeighborhoodBrDale -43358.881 12978.523 -3.341 0.000857 ***
## NeighborhoodBrkSide -13025.453 10450.445 -1.246 0.212821
## NeighborhoodClearCr 24575.635 11569.626 2.124 0.033828 *
## NeighborhoodCollgCr 11414.309 9496.582 1.202 0.229586
## NeighborhoodCrawfor 14444.250 10502.215 1.375 0.169237
## NeighborhoodEdwards -17842.951 9985.734 -1.787 0.074174 .
## NeighborhoodGilbert -892.880 9954.350 -0.090 0.928540
## NeighborhoodIDOTRR -28178.987 11135.584 -2.531 0.011495 *
## NeighborhoodMeadowV -19099.020 12999.351 -1.469 0.141990
## NeighborhoodMitchel 2030.612 10555.017 0.192 0.847469
## NeighborhoodNAmes -4430.099 9517.291 -0.465 0.641659
## NeighborhoodNoRidge 64642.989 10939.818 5.909 0.00000000429492 ***
## NeighborhoodNPkVill -17807.184 15304.007 -1.164 0.244795
## NeighborhoodNridgHt 71587.840 9994.775 7.163 0.00000000000126 ***
## NeighborhoodNWAmes -4720.658 10079.515 -0.468 0.639611
## NeighborhoodOldTown -32080.877 9863.373 -3.253 0.001170 **
## NeighborhoodSawyer -1219.381 10211.495 -0.119 0.904965
## NeighborhoodSawyerW 303.099 10264.232 0.030 0.976446
## NeighborhoodSomerst 17766.964 9829.881 1.807 0.070903 .
## NeighborhoodStoneBr 69954.472 11689.508 5.984 0.00000000274014 ***
## NeighborhoodSWISU -36640.152 11923.592 -3.073 0.002160 **
## NeighborhoodTimber 29905.811 10829.233 2.762 0.005826 **
## NeighborhoodVeenker 47106.893 14337.857 3.285 0.001043 **
## GrLivArea 55.565 2.499 22.237 < 0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 37010 on 1433 degrees of freedom
## Multiple R-squared: 0.7868, Adjusted R-squared: 0.783
## F-statistic: 203.5 on 26 and 1433 DF, p-value: < 0.00000000000000022
Based on the adjusted R2, this model isn’t a bad choice. Due to the length of output of these models, going forward only the adjusted R2 will be shown.
pred <- predict(model_1, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
## Id SalePrice
## 1 2099 -2392.516
## 2 2217 -1328.023
## 3 1788 8275.868
## 4 1601 8899.453
## 5 1815 8998.206
## 6 2872 16901.780
## 7 1537 21222.396
## 8 2786 23281.891
## 9 1916 23901.868
## 10 2579 25846.626
This first model has negative values for a sale price. This may not be the best model available as a result.
ggplot() +
geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
xlab("Sale Price") +
ggtitle("Houses Sold by Sale Price, Model 1: Blue, Actual; Green, Predicted") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
Despite this, it mostly lines up with the training set’s of sale prices - but how does the multiplicative model using the same variables compare?
model_2 <- lm(SalePrice ~ OverallQual*Neighborhood*GrLivArea, data=train)
# summary(model_2)
paste("Adjusted r-squared: ", summary(model_2)$adj.r.squared)
## [1] "Adjusted r-squared: 0.863226334511322"
The multiplicative model is clearly better as can be observed by the adjusted R2.
pred <- predict(model_2, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
## Id SalePrice
## 1 2872 22692.47
## 2 2217 32189.03
## 3 1601 36946.28
## 4 1916 48372.07
## 5 2905 48942.27
## 6 2579 49853.19
## 7 1586 55664.33
## 8 2892 57862.25
## 9 1848 59933.63
## 10 1823 60852.98
This second model has absolutely no negative values, which is a good thing.
ggplot() +
geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
xlab("Sale Price") +
ggtitle("Houses Sold by Sale Price, Model 2: Blue, Actual; Green, Predicted") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
The histograms are less similar than the previous linear model, so it might not be the right choice.
Increasing the number of variables used in the linear model automatically increases - and never decreases - the adjusted R2 of the model. Keeping that in mind, using the exterior and kitchen quality of the house might improve the results.
model_3 <- lm(SalePrice ~ OverallQual+Neighborhood+GrLivArea+ExterQual+KitchenQual, data=train)
# summary(model_3)
paste("Adjusted r-squared: ", summary(model_3)$adj.r.squared)
## [1] "Adjusted r-squared: 0.809381027150161"
Adding two variables resulted in an adjusted R2 that is larger than the previous models.
pred <- predict(model_3, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
## Id SalePrice
## 1 2217 14167.88
## 2 2099 17147.35
## 3 1601 27029.62
## 4 1815 27625.55
## 5 2872 33200.05
## 6 2786 36735.46
## 7 1788 36816.97
## 8 1537 38870.45
## 9 1916 40261.52
## 10 2579 42619.15
The predicted values are all greater than 0, which is a good sign for the linear model’s validity.
ggplot() +
geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
xlab("Sale Price") +
ggtitle("Houses Sold by Sale Price, Model 3: Blue, Actual; Green, Predicted") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
This third model has a histogram that appears very similar to the training set’s.
model_4 <- lm(SalePrice ~ OverallQual*Neighborhood*GrLivArea*ExterQual*KitchenQual, data=train)
# summary(model_4)
paste("Adjusted r-squared: ", summary(model_4)$adj.r.squared)
## [1] "Adjusted r-squared: 0.884002939992005"
Lastly, just for completion’s sake, a multiplicative multiple regression model of the five variables used in the third model.
pred <- predict(model_4, test, type = "response")
pred <- data.frame(Id=seq(1461,2919,1), SalePrice=pred)
pred_test <- pred[order(pred["SalePrice"]),][1:10,]
rownames(pred_test) <- NULL
pred_test
## Id SalePrice
## 1 2264 -3761702.4
## 2 2503 -2903825.4
## 3 2618 -2875683.2
## 4 2690 -2109016.0
## 5 2279 -1965067.2
## 6 2529 -1354496.5
## 7 2284 -1048762.1
## 8 1947 -944189.3
## 9 2193 -903461.6
## 10 2883 -781362.1
There is an issue with this model: there are 41 negative sale values. Nobody in their right mind would pay someone to take their home. My instinct says this model is no good, and the histogram overlay shows the same.
ggplot() +
geom_histogram(data=train, aes(x=SalePrice), bins=100, color="darkblue", fill="lightblue", alpha=0.3) +
geom_histogram(data=pred, aes(x=SalePrice), bins=100, color="darkgreen", fill="green", alpha=0.3) +
xlab("Sale Price") +
ggtitle("Houses Sold by Sale Price, Model 4: Blue, Actual; Green, Predicted") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 10))
Increasing the number of variables increased the value of the adjusted R2, but that does not necessarily mean that it is more accurate. In order to see how these models hold up, the resulting predictions needed to be submitted to Kaggle. As a note, all models have a p-value less than 0.00000000000000022.
My Kaggle user name is gbartomeo.
results <- data.frame(ModelNum = c(1, 2, 3, 4),
Score = c(0.44655, 0.19727, 0.17660, 2.02452),
R2 = c(summary(model_1)$adj.r.squared, summary(model_2)$adj.r.squared, summary(model_3)$adj.r.squared, summary(model_4)$adj.r.squared)
)
rownames(results) <- NULL
knitr::kable(results, col.names=c("Model Number", "Score", "Adjusted R2"), align="c")
| Model Number | Score | Adjusted R2 |
|---|---|---|
| 1 | 0.44655 | 0.7829810 |
| 2 | 0.19727 | 0.8632263 |
| 3 | 0.17660 | 0.8093810 |
| 4 | 2.02452 | 0.8840029 |
My best score (0.1766; the closer to 0 the better) came from the third model, which was the additive model using five variables - the overall quality of the home, the neighborhood it was located in, the area of all the living space above ground, the external quality of the home, and the quality of the home’s kitchen. The results imply that having a higher adjusted R2 has little to no affect on the scoring received for the predictions produced.