First, let us load the data and necessary packages:
load("~/Desktop/R Programming/Statistics_Coursera/Capstone/ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(ggthemes)
library(statsr)
library(BAS)
library(kableExtra)
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
ames_train %>%
dplyr::select(Year.Built) %>%
mutate(age = max(Year.Built) - Year.Built) %>%
ggplot(aes(x = age)) +
geom_histogram(bins = 30) +
theme_fivethirtyeight() +
labs(title = "Distribution of the ages of the houses",
y = "Number of houses",
x = "Age of houses")
The distribution of the ages of the houses is right-skewed and multimodal. The figure shows that there are more new houses in the market.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
location_price <- ames_train %>%
dplyr::select(Neighborhood, price)
price_summary <- location_price %>%
group_by(Neighborhood) %>%
summarise(min_price = min(price),
max_price = max(price),
mean_price = mean(price),
median_price = median(price),
IQR_price = IQR(price),
sd_price = sd(price))
location_price %>%
ggplot(aes(x = reorder(Neighborhood, price, FUN = median), y = price)) +
geom_boxplot() +
theme_fivethirtyeight() +
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Distribution of Home Prices by Neighborhood",
y = "Home Price",
x = "Neighborhood")
ex_neighborhood <- price_summary[which(price_summary$median_price == max(price_summary$median_price)),]
leastex_neighborhood <- price_summary[which(price_summary$median_price == min(price_summary$median_price)),]
het_neighborhood <- price_summary[which(price_summary$sd_price == max(price_summary$sd_price)),]
ex_neighborhood %>%
kbl(caption = "<b>Most expensive neighborhood</b>") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Neighborhood | min_price | max_price | mean_price | median_price | IQR_price | sd_price |
|---|---|---|---|---|---|---|
| StoneBr | 180000 | 591587 | 339316 | 340691.5 | 151358 | 123459.1 |
leastex_neighborhood %>%
kbl(caption = "<b>Least expensive neighborhood</b>") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Neighborhood | min_price | max_price | mean_price | median_price | IQR_price | sd_price |
|---|---|---|---|---|---|---|
| MeadowV | 73000 | 129500 | 92946.88 | 85750 | 20150 | 18939.78 |
het_neighborhood %>%
kbl(caption = "<b>Most heterogenious neighborhood</b>") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Neighborhood | min_price | max_price | mean_price | median_price | IQR_price | sd_price |
|---|---|---|---|---|---|---|
| StoneBr | 180000 | 591587 | 339316 | 340691.5 | 151358 | 123459.1 |
The figure compares the median of home prices by neighborhood using boxplot as the median is the most appropriate indicator to determine home prices.
It is seen that StoneBr is the most expensive neighborhood, while MeadowV is the least expensive.StoneBr has the most heterogeneous in home price.
The highest median home price is 340,691.5 USD. The least median home price is 85,750 USD. The most heterogeneous in home price is 123,459.1 USD.
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
sum_na <- colSums(is.na(ames_train))
head(sort(unlist(sum_na), decreasing = TRUE))
## Pool.QC Misc.Feature Alley Fence Fireplace.Qu Lot.Frontage
## 997 971 933 798 491 167
summary(ames_train$Pool.QC)
## Ex Fa Gd TA NA's
## 1 1 1 0 997
Pool.QC has the most missing values. The variable denotes pool quality. Using summary function, it can be explained that NA values indicate that the house does not have a pool.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
q4_lm <-lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr,
data = ames_train)
model_AIC <- stepAIC(q4_lm, k = 2)
## Start: AIC=-2545.77
## log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 77.322 -2545.8
## - Land.Slope 2 1.4281 78.750 -2531.5
## - Bedroom.AbvGr 1 5.0628 82.385 -2484.3
## - Lot.Area 1 6.7292 84.051 -2464.3
## - Year.Remod.Add 1 11.9642 89.286 -2403.9
## - Year.Built 1 19.8546 97.177 -2319.2
summary(q4_lm)
##
## Call:
## lm(formula = log(price) ~ Lot.Area + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0878 -0.1651 -0.0211 0.1657 0.9945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
## Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
## Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
## Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
## Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
## Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
## Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
## F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
Multiple linear regression is conducted. Stepwise regression method is used for model selection. The summary table shows that all variables are statistically significant to be included in the model, with \(Adj.R^2\) of 0.5598 and Residual standard error of 0.279.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
pred.full <- exp(predict(q4_lm, ames_train))
resid.full <- which(abs(resid(q4_lm)) == max(abs(resid(q4_lm))))
row <- which.max(resid.full)
select(ames_train[428,],price,Lot.Area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr)
## # A tibble: 1 x 6
## price Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
## <int> <int> <fct> <int> <int> <int>
## 1 12789 9656 Gtl 1923 1970 2
paste("The predicted price for the home is", format(exp(predict(q4_lm, ames_train[428, ])), big.mark = ","))
## [1] "The predicted price for the home is 103,176.2"
ames_train[428,]
## # A tibble: 1 x 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street Alley
## <int> <int> <int> <int> <fct> <int> <int> <fct> <fct>
## 1 9.02e8 832 12789 30 RM 68 9656 Pave <NA>
## # … with 72 more variables: Lot.Shape <fct>, Land.Contour <fct>,
## # Utilities <fct>, Lot.Config <fct>, Land.Slope <fct>, Neighborhood <fct>,
## # Condition.1 <fct>, Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## # Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## # Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>, Foundation <fct>,
## # Bsmt.Qual <fct>, Bsmt.Cond <fct>, Bsmt.Exposure <fct>,
## # BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fct>,
## # BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fct>,
## # Heating.QC <fct>, Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## # TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## # Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## # Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## # Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## # Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>
Home with PID No. 902207130 (row 428) has the largest squared residual. The predicted price for the home is 103,176.2 USD, but its actual value stands at 12,789 USD. The sale condition also states as Abnormal.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
ANS. Replacing Lot.Area with log(Lot.Area), it is still seen that the model includes the same set of predictors, but with higher \(Adj.R^2\) of 0.6032 (from 0.5598) and lower Residual standard error of 0.2649 (from 0.279).
q5_lm <-lm(log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr,
data = ames_train)
model_AIC <- stepAIC(q5_lm , k = 2)
## Start: AIC=-2649.5
## log(price) ~ log(Lot.Area) + Land.Slope + Year.Built + Year.Remod.Add +
## Bedroom.AbvGr
##
## Df Sum of Sq RSS AIC
## <none> 69.704 -2649.5
## - Land.Slope 2 0.4429 70.147 -2647.2
## - Bedroom.AbvGr 1 2.2002 71.904 -2620.4
## - Year.Remod.Add 1 11.8182 81.522 -2494.9
## - log(Lot.Area) 1 14.3474 84.051 -2464.3
## - Year.Built 1 19.4083 89.112 -2405.9
summary(q5_lm)
##
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + Land.Slope + Year.Built +
## Year.Remod.Add + Bedroom.AbvGr, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.14050 -0.15650 -0.01561 0.15350 0.90854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.553e+01 8.197e-01 -18.947 < 2e-16 ***
## log(Lot.Area) 2.442e-01 1.708e-02 14.297 < 2e-16 ***
## Land.SlopeMod 1.151e-01 4.734e-02 2.431 0.0152 *
## Land.SlopeSev -6.554e-02 1.222e-01 -0.536 0.5917
## Year.Built 5.981e-03 3.597e-04 16.628 < 2e-16 ***
## Year.Remod.Add 6.734e-03 5.190e-04 12.975 < 2e-16 ***
## Bedroom.AbvGr 5.909e-02 1.055e-02 5.599 2.8e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2649 on 993 degrees of freedom
## Multiple R-squared: 0.6056, Adjusted R-squared: 0.6032
## F-statistic: 254.1 on 6 and 993 DF, p-value: < 2.2e-16
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
selected_vars <- ames_train %>%
dplyr::select(Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)
true_values <- selected_vars[, "price"]
data_test <- subset(selected_vars, select = -c(price))
data_prediction <- data_test
fullmodel_log <- q5_lm
prediction_log <- predict(fullmodel_log, newdata = data_prediction, interval = "prediction")
data_pred_log <- data.frame(prediction_log)
pred_log <- sapply(data_pred_log[1], function(x) exp(x))
fullmodel_nolog <- q4_lm
prediction_nolog <- predict(fullmodel_nolog, newdata = data_prediction, interval = "prediction")
data_pred_nolog <- data.frame(prediction_nolog)
pred_nolog <- sapply(data_pred_nolog[1], function(x) exp(x))
pred_log <- cbind(pred_log, true_values)
colnames(pred_log) <- c("pred", "true.value")
pred_nolog <- cbind(pred_nolog, true_values)
colnames(pred_nolog) <- c("pred", "true.value")
p1 <- ggplot(pred_log, aes(x = pred, y = true.value)) +
geom_point() +
theme_fivethirtyeight() +
labs(title = "Model with log transformation", x = "Predicted price", y = "Actual price") +
geom_smooth(method = "lm", linetype = "dashed", se = FALSE) +
geom_abline(intercept = 0, slope = 1)
p2 <- ggplot(pred_nolog, aes(x = pred, y = true.value)) +
geom_point() +
theme_fivethirtyeight() +
labs(title = "Model without log transformation", x = "Predicted price", y = "Actual price") +
geom_smooth(method = "lm", linetype = "dashed", se = FALSE) +
geom_abline(intercept = 0, slope = 1)
p1
## `geom_smooth()` using formula 'y ~ x'
p2
## `geom_smooth()` using formula 'y ~ x'
The distribution of the residuals in the log model is more evenly scattered as log transformation reduces the mean squared error, while the residuals in the model without log transformation show an uneven and nonrandom pattern across the range of observations.
The non-constant distribution of residuals in the model without log transformation indicates the presence of heteroscedasticity in the model, resulting in lower predictive ability.
Therefore, it is better to use the model with log transformation as the model has higher \(Adj.R^2\) of 0.6032 (from 0.5598) and lower Residual standard error of 0.2649 (from 0.279). The more constant scatter of residuals shows that the issue of heteroscedasticity is not violated.