First, let us load the data and necessary packages:
load("ames_train.Rdata")
library(MASS)
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.3
library(devtools)
## Warning: package 'devtools' was built under R version 3.3.3
library(statsr)
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.3.3
library(tidyr)
## Warning: package 'tidyr' was built under R version 3.3.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.3.3
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
# type your code for Question 1 here, and Knit
ggplot(data = ames_train, aes(x=year(today())-Year.Built))+
geom_histogram(bins = 30, fill="blue", colour="white")+labs(title="House counts by age", x="House age", y="House count")+
geom_vline(xintercept = median(year(today())-ames_train$Year.Built), colour="red")+
annotate("text", x= median(year(today())-ames_train$Year.Built)-4, y=c(100,-10)
, label=c("Median", median(year(today())-ames_train$Year.Built))
, colour="red", angle=c(90,0))+
geom_vline(xintercept = mean(year(today())-ames_train$Year.Built), colour="#41c42f")+
annotate("text", x= mean(year(today())-ames_train$Year.Built)+4, y=c(100,-10)
, label=c("Mean", round(mean(year(today())-ames_train$Year.Built), 1))
, colour="#41c42f", angle=c(-90,0))+
theme(plot.title = element_text(hjust=0.5)
,title = element_text(hjust = 0.5, colour = "red", size = rel(2))
, axis.title =element_text(size = rel(1), colour = "blue", hjust = .5)
, legend.position = "none"
,panel.background = element_rect(fill = "grey", colour = "black")
, panel.grid = element_line(color = "blue"))
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
# type your code for Question 2 here, and Knit
##plotting a box plot representing price distribution by neighborhood.
ames_train %>% ggplot(aes(x=Neighborhood, y=price/10^5, fill=Neighborhood))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90))
## Calculating all central and spread statistics figures grouped by Neighborhood and stored in a dataframe.
ames_stats <- ames_train %>% group_by(Neighborhood) %>% summarise(Min=min(price, na.rm=TRUE), Mean=mean(price, na.rm=TRUE), Median=median(price, na.rm=TRUE),IQR=IQR(price, na.rm=TRUE), Max=max(price, na.rm=TRUE),Range=Max-Min) %>% arrange(desc(Mean))
## Warning: package 'bindrcpp' was built under R version 3.3.3
## Summary statistics stored in a dataframe
ames_summary <- data.frame(filter(ames_stats, Mean==max(Mean))$Neighborhood, filter(ames_stats, Mean==min(Mean))$Neighborhood, filter(ames_stats, IQR==max(IQR))$Neighborhood)
## Formatting the dataframe
colnames(ames_summary) <- c("Most Expensive", "Least Expensive", "Most heterogenous")
rownames(ames_summary) <- c("Neighborhood")
## printing out the dataframe
ames_summary
## Most Expensive Least Expensive Most heterogenous
## Neighborhood StoneBr MeadowV StoneBr
The above summary statistics collected based on above scripts shows that StoneBr Neighborhood with the highest price mean & median values amongs all neighborhoods. StoneBr is therefore the most expensive neigborhood. However, we can as well see that based on the IQR, StoneBr has the most dispersed house price making it as well the most heterogenous neighborhood.
MeadowV in opposite has the lowest mean & median in terms of house price and therefore is the least expensive neighborhood.
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
# type your code for Question 3 here, and Knit
## Count all variables missing values
missing_val <- ames_train %>% summarise_all(funs(sum(is.na(.))))
## select the column that has the maximum missing values
missing_val[, colSums(missing_val[1,])==max(missing_val)]
## # A tibble: 1 x 1
## Pool.QC
## <int>
## 1 997
The variable that has the highest number of missing value is Pool.QC which means most of the houses do not have a swimmingpool. Only 3 houses in this data set have swimmingpools. This makes sense as the cost of owning a swimming pool is generally high as it entails not only the area space but as well the construction and more importantly the running costs which in essence are lifetime.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
## select the variable with at least one missing values
names(missing_val[, colSums(missing_val)>=1])
## [1] "Lot.Frontage" "Alley" "Mas.Vnr.Area" "Bsmt.Qual"
## [5] "Bsmt.Cond" "Bsmt.Exposure" "BsmtFin.Type.1" "BsmtFin.SF.1"
## [9] "BsmtFin.Type.2" "BsmtFin.SF.2" "Bsmt.Unf.SF" "Total.Bsmt.SF"
## [13] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Fireplace.Qu" "Garage.Type"
## [17] "Garage.Yr.Blt" "Garage.Finish" "Garage.Cars" "Garage.Area"
## [21] "Garage.Qual" "Garage.Cond" "Pool.QC" "Fence"
## [25] "Misc.Feature"
Above are listed all variables within this dataset with NA values. Based on the data dictionary and on the previous investigations, we can explain the missing and decide on how to deal with them.
Based on the above analysis, we have created a function \(prepare\_data()\) that will treat all NA values as described above and make the updated dataset \(data\) available for our model selection. in the appropriate window below, we are describing the model selection method used.
# type your code for Question 4 here, and Knit
#Frequentist approach of model selection
##Backward selection starting with full model
#---------------------------------------Preparing the data function--------------------------------------#
# this function will prepare the data from ames for analysis
prepare_data <- function(data=ames_train){
# identifying variables with same values not to be considered in modeling since not significant to the study
var_to_use <- data %>% summarise_all(funs(length(unique(.))))
var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])
# select userful variables only for the model
data <- data %>% select(var_to_use)
# Deal with NA values
## Seperate Numerical from Categorical variables
ames_var_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
## Deal with numerical variables NA values
ames_train_int <- data %>% select(ames_var_types$integer)
ames_train_int[is.na(ames_train_int)] <- 0
## Dealing with categorical variables NA values
ames_train_fac <- data %>% select(ames_var_types$factor)
ames_train_fac <- sapply(ames_train_fac, as.character)
ames_train_fac[is.na(ames_train_fac)] <- c("None")
ames_train_fac[ames_train_fac==""] <- c("None")
ames_train_fac <- data.frame(ames_train_fac)
## Merging both numerical and categoricals
data <- cbind(ames_train_int, ames_train_fac)
return(na.omit(data))
}
#---------------------------------------Model selection function--------------------------------------#
# This function will do the automatic model selection and yield the best model based on selection criteria
model_selection <- function(data=data, response_variable="price" ,criteria=c("pvalue", "Adj-Rsquare", "BIC", "AIC"), significance=0.05){
count <- 0
##############################################################
# Building the linear model based on the dataframe indicated #
##############################################################
## Looping along the variable names and creating the next variable list for the lm
full_var_list <- names(data)[names(data)!=response_variable]
while(count==0){
next_var <- c()
formula_lm <- NULL
for(j in seq_along(full_var_list)){
next_var <- noquote(paste0(next_var, full_var_list[j], sep = "+"))
}
## Removing the "+" at the end of the variable list
next_var_list <- substr(next_var, 1, nchar(next_var)-1)
# Writing the linear model formula to be used with the new variable list
formula_lm <- as.formula(paste(paste("log(", response_variable, ")", " ~ ", sep = ""), next_var_list, sep = ""))
data_lm <- lm(formula = formula_lm, data = data)
##############################################################
# Starting the model selection based on the method selected #
##############################################################
if(tolower(criteria) %in% "pvalue"){
## Select p-values higher than the preset significance level. Default is 0.05
model_pvalue <- data.frame(summary(data_lm)$coef[summary(data_lm)$coef[,4] <=significance,])
## Filter all row in the returned dataframe which partially match variable names in original dataframe
next_lm_var <- sapply(names(data), function(x) grep(x, row.names(model_pvalue), value=TRUE))
## Remove all variables with no match
##here we deal with character(0) type of variable which is different from NULL.
## We filter if by matching the length ==0L
next_lm_var <- next_lm_var[!sapply(next_lm_var, function(x) length(x)==0L)]
## Updating the next list of variable to be used for linear modeling
next_var_list <- names(next_lm_var)[names(next_lm_var)!=response_variable]
## Checking if previous and current variable list are the same
if(length(next_var_list)!=length(full_var_list)){
full_var_list <- next_var_list
}else{count <- 1}
}else if(tolower(criteria) %in% "adj-rsquared"){
## Select p-values higher than the preset significance level. Default is 0.05
# Adj-Rsquared <- summary(data_lm)$adj.r.squared
## Define the sequence of variable
# var_list <- names(data)
# var_seq <- seq(length(var_list))
## Filter all row in the returned dataframe which partially match variable names in original dataframe
# next_lm_var <- sapply(var_seq, function(x) combn(var_list, x))
## Remove all variables with no match
##here we deal with character(0) type of variable which is different from NULL.
## We filter if by matching the length ==0L
# next_lm_var <- next_lm_var[!sapply(next_lm_var, function(x) length(x)==0L)]
## Updating the next list of variable to be used for linear modeling
# next_var_list <- names(next_lm_var)[names(next_lm_var)!=response_variable]
## Checking if previous and current variable list are the same
print(paste(criteria, "is not yet supported"))
## Computing BIC model selection option if selected
} else if (tolower(criteria) %in% "bic"){
n <- nrow(na.omit(data))
bic_lm <- stepAIC(data_lm, direction = "backward", k=log(n), trace = 0)
data_lm <- bic_lm
## setting count to 1 to exit the while loop
count <- 1
## Computing AIC model selection option if selected
} else if (tolower(criteria) %in% "aic"){
n <- nrow(na.omit(data))
aic_lm <- stepAIC(data_lm, direction = "backward", k=n, trace = 0)
data_lm <- aic_lm
## setting count to 1 to exit the while loop
count <- 1
## Exit if related model selection is selected
} else{
print("Wrong Option. Supported options are AIC, BIC or Pvalue")
## setting count to 1 to exit the while loop
count <- 1
}
}
return(data_lm)
}
data <- prepare_data(data=ames_train)
# final_model_Pvalue <- model_selection(data = data, response_variable = "price", criteria = "pvalue", significance = 0.05)
final_model_BIC <- model_selection(data = data, response_variable = "price", criteria = "BIC")
# final_model_AIC <- model_selection(data = data, response_variable = "price", criteria = "AIC")
#print("Final model based on P-Value")
#summary(final_model_Pvalue)
print("Final model based on BIC")
## [1] "Final model based on BIC"
summary(final_model_BIC)
##
## Call:
## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond +
## Year.Built + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 +
## Bsmt.Unf.SF + Bedroom.AbvGr + Fireplaces + Garage.Cars +
## MS.Zoning + Condition.2 + Bldg.Type + Exter.Qual + Exter.Cond +
## Central.Air + Sale.Condition, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.29410 -0.06298 -0.00026 0.06037 0.88123
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.789e+00 6.579e-01 5.759 1.14e-08 ***
## area 2.846e-04 1.603e-05 17.752 < 2e-16 ***
## Lot.Area 1.612e-06 4.660e-07 3.460 0.000564 ***
## Overall.Qual 7.868e-02 5.594e-03 14.066 < 2e-16 ***
## Overall.Cond 5.196e-02 5.120e-03 10.149 < 2e-16 ***
## Year.Built 2.203e-03 2.788e-04 7.902 7.48e-15 ***
## Year.Remod.Add 1.061e-03 3.111e-04 3.411 0.000675 ***
## BsmtFin.SF.1 2.146e-04 1.463e-05 14.669 < 2e-16 ***
## BsmtFin.SF.2 1.732e-04 2.743e-05 6.315 4.13e-10 ***
## Bsmt.Unf.SF 1.081e-04 1.517e-05 7.122 2.09e-12 ***
## Bedroom.AbvGr -1.993e-02 7.477e-03 -2.666 0.007813 **
## Fireplaces 2.649e-02 7.959e-03 3.328 0.000907 ***
## Garage.Cars 3.673e-02 8.013e-03 4.584 5.16e-06 ***
## MS.ZoningFV 3.426e-01 5.061e-02 6.770 2.24e-11 ***
## MS.ZoningI (all) 3.180e-01 1.433e-01 2.219 0.026688 *
## MS.ZoningRH 2.814e-01 6.743e-02 4.172 3.29e-05 ***
## MS.ZoningRL 3.138e-01 4.696e-02 6.683 3.95e-11 ***
## MS.ZoningRM 2.364e-01 4.664e-02 5.068 4.83e-07 ***
## Condition.2Feedr -9.684e-02 1.091e-01 -0.887 0.375158
## Condition.2Norm -1.817e-03 9.465e-02 -0.019 0.984689
## Condition.2PosA 7.855e-02 1.629e-01 0.482 0.629736
## Condition.2PosN -1.006e+00 1.356e-01 -7.419 2.59e-13 ***
## Condition.2RRNn -6.852e-02 1.628e-01 -0.421 0.673885
## Bldg.Type2fmCon 4.260e-02 3.201e-02 1.331 0.183533
## Bldg.TypeDuplex -7.264e-02 2.531e-02 -2.870 0.004199 **
## Bldg.TypeTwnhs -1.367e-01 2.387e-02 -5.726 1.38e-08 ***
## Bldg.TypeTwnhsE -3.960e-02 1.766e-02 -2.243 0.025136 *
## Exter.QualFa -5.329e-03 5.419e-02 -0.098 0.921685
## Exter.QualGd -8.948e-02 2.558e-02 -3.498 0.000491 ***
## Exter.QualTA -1.266e-01 2.992e-02 -4.233 2.52e-05 ***
## Exter.CondFa -1.209e-01 7.521e-02 -1.608 0.108264
## Exter.CondGd 1.976e-02 6.673e-02 0.296 0.767240
## Exter.CondTA 4.762e-02 6.620e-02 0.719 0.472122
## Central.AirY 8.643e-02 2.297e-02 3.762 0.000179 ***
## Sale.ConditionAdjLand 1.264e-01 9.730e-02 1.299 0.194106
## Sale.ConditionAlloca 1.841e-01 6.837e-02 2.693 0.007208 **
## Sale.ConditionFamily -3.146e-02 3.620e-02 -0.869 0.384956
## Sale.ConditionNormal 8.407e-02 1.777e-02 4.731 2.56e-06 ***
## Sale.ConditionPartial 1.250e-01 2.434e-02 5.137 3.38e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1296 on 961 degrees of freedom
## Multiple R-squared: 0.9086, Adjusted R-squared: 0.905
## F-statistic: 251.5 on 38 and 961 DF, p-value: < 2.2e-16
#print("Final model based on AIC")
#summary(final_model_AIC)
Conditions for Multilinear regression verification
par(mfrow=c(2,3))
## Checking nearly normality
hist(final_model_BIC$residuals, breaks = 100, xlab = "Residuals", main = "Residual plot - Nearly normality" )
plot(density(final_model_BIC$residuals), main = "Density plot - Nearly Normality")
qqnorm(final_model_BIC$residuals)
qqline(final_model_BIC$residuals)
## Checking linearity and independent residuals
plot(final_model_BIC$residuals, ylab = "Residuals", main = "Residual plot - Linearity")
## Constant variability condition
plot(final_model_BIC$residuals~final_model_BIC$fitted.values, xlab="Fitted values" ,ylab = "Residuals", main = "Constant variability.")
plot(abs(final_model_BIC$residuals)~final_model_BIC$fitted.values, xlab="Fitted values" ,ylab = "Residuals", main = "Independence")
Now that these conditions are verified, we can proceed with these models analysis.
The model section method we use in this case is the BIC backward model selection. This is one of the bayesian way of selecting a models based on the Bayesian information Criteria. The BIC is built under penalized likelihood \(BIC=-2log(likelihood)+klog(n)\) where \(k\) represents the number of parameters used in the linear model. The BIC can also be express based on \(R^2\) as follow \(BIC=nlog(1-R^2)+klog(n)\). Written this way, we can see that If two(02) models end up with the same \(R^2\) value, the model with the fewer variables will be selected as it will be less penalized by the \(klog(n)\) factor. The linear model with the smaller BIC will be the best model. The BIC is built using a prior propability distribution of the variables involved. In this case, we go start under a uniform() prior giving the same chance to each variable for being selected in any given model.
The above code defines two(02) functions prepare_data() that deals with NA values and prepares the data for analysis. Wile the model_selection() function was build to perform various model selection methods for future reuse if required through this module. The goodness of fit of the final model in this case is \(R^2=0.9086\) which is quite strong.
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
# type your code for Question 5 here, and Knit
resid_squared <- final_model_BIC$residuals^2
ames_train[as.numeric(names(resid_squared[resid_squared==max(resid_squared)])),]
## # A tibble: 1 x 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## <int> <int> <int> <int> <fctr> <int> <int> <fctr>
## 1 902207130 832 12789 30 RM 68 9656 Pave
## # ... with 73 more variables: Alley <fctr>, Lot.Shape <fctr>,
## # Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## # Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## # Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## # Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## # Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## # Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## # Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## # Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## # Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## # BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## # Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## # Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## # X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## # Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## # Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## # TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## # Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## # Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## # Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## # Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## # X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## # Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## # Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>
anova(final_model_BIC)
## Analysis of Variance Table
##
## Response: log(price)
## Df Sum Sq Mean Sq F value Pr(>F)
## area 1 90.300 90.300 5374.9608 < 2.2e-16 ***
## Lot.Area 1 0.588 0.588 35.0039 4.572e-09 ***
## Overall.Qual 1 44.250 44.250 2633.9263 < 2.2e-16 ***
## Overall.Cond 1 0.706 0.706 42.0093 1.450e-10 ***
## Year.Built 1 10.663 10.663 634.6829 < 2.2e-16 ***
## Year.Remod.Add 1 0.088 0.088 5.2486 0.0221798 *
## BsmtFin.SF.1 1 4.328 4.328 257.6252 < 2.2e-16 ***
## BsmtFin.SF.2 1 0.426 0.426 25.3526 5.700e-07 ***
## Bsmt.Unf.SF 1 1.476 1.476 87.8472 < 2.2e-16 ***
## Bedroom.AbvGr 1 0.071 0.071 4.2192 0.0402402 *
## Fireplaces 1 0.452 0.452 26.9303 2.573e-07 ***
## Garage.Cars 1 0.732 0.732 43.5575 6.795e-11 ***
## MS.Zoning 5 2.010 0.402 23.9290 < 2.2e-16 ***
## Condition.2 5 1.719 0.344 20.4685 < 2.2e-16 ***
## Bldg.Type 4 0.890 0.223 13.2484 1.642e-10 ***
## Exter.Qual 3 0.342 0.114 6.7768 0.0001596 ***
## Exter.Cond 3 0.574 0.191 11.3819 2.441e-07 ***
## Central.Air 1 0.232 0.232 13.8067 0.0002143 ***
## Sale.Condition 5 0.736 0.147 8.7576 3.896e-08 ***
## Residuals 961 16.145 0.017
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
max(ames_train$Yr.Sold)
## [1] 2010
The house with the highest squared residual is the house with the PID 902207130 on line 428 of the ames_train dataframe, located in Old_town Neighborhood.
According to the analysis, this house is the house with the lowest sales prices among all houses in this dataset. This low sales price is explained by:
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
# type your code for Question 6 here, and Knit
data$Lot.Area <- log(data$Lot.Area)
final_model_BIC_log <- model_selection(data = data, response_variable = "price", criteria = "BIC")
final_model_BIC_log$call
## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond +
## Year.Built + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF +
## Bedroom.AbvGr + Fireplaces + Garage.Cars + MS.Zoning + Condition.2 +
## Exter.Qual + Exter.Cond + Heating.QC + Central.Air + Sale.Condition,
## data = data)
Log transforming the Lot.Area variable from the prepared dataset and reapplying the same model selection criteria (BIC), we realize that the final model is different from the one gotten without the log transformation. Two (02) variables are removed from the final BIC model which are -Year.Remod.Add -Bldg.Type while a new variable + Heating.QC is added as significant to this model.
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
# type your code for Question 7 here, and Knit
BPM_predict_log <- predict(final_model_BIC_log, estimator="BPM")
BPM_predict <- predict(final_model_BIC, estimator="BPM")
actual_fitted_price <- data.frame(BPM_predict, BPM_predict_log,log(data$price))
names(actual_fitted_price) <- c("BPM_predict", "BPM_predict_log", "log_price")
## Getting information on points with highest residuals value
out_col <- as.numeric(names(head(sort(abs(final_model_BIC$residuals), decreasing = TRUE), 3)))
out_log_col <- as.numeric(names(head(sort(abs(final_model_BIC_log$residuals), decreasing = TRUE), 3)))
outliers <- data.frame(x=log(data$price[out_col]), y=BPM_predict[out_col])
outliers_log <- data.frame(x=log(data$price[out_col]), y=BPM_predict_log[out_col])
p1 <- ggplot(data = actual_fitted_price, aes(y=BPM_predict, x=log_price))+geom_point(colour="blue", shape=16)+geom_smooth(method = "lm", col="red")+labs(title="Actual vs Fitted prices without transformation", x="Actual price", y="Fitted prices")+theme(title = element_text(color = "blue"), plot.title = element_text(hjust = 0.5, size = 8))+annotate("text", x=11.5, y=14, label = "Adjusted_R^2 == 0.905", parse = TRUE)+annotate("text", x=outliers$x+0.1, y=outliers$y, label = c("1", "2", "3"), colour="red", parse = TRUE)
p2 <- ggplot(data = actual_fitted_price, aes(y=BPM_predict_log, x=log_price))+geom_point(colour="blue", shape=16)+geom_smooth(method = "lm", col="red")+labs(title="Actual vs Fitted prices with Lot.Area log transformation", x="Actual price", y="Fitted prices")+theme(title = element_text(color = "blue"), plot.title = element_text(hjust = 0.5, size = 8))+annotate("text",x=11.5, y=14, label = "Adjusted_R^2 == 0.9068", parse = TRUE)+annotate("text", x=outliers_log$x+0.1, y=outliers_log$y, label = c("1", "2", "3"), colour="red", parse = TRUE)
grid.arrange(p1, p2, ncol=2, nrow=1)
par(mfrow=c(1,2))
qqnorm(final_model_BIC$residuals)
qqline(final_model_BIC$residuals)
qqnorm(final_model_BIC_log$residuals)
qqline(final_model_BIC_log$residuals)
## Second reason why the model with Log of Lot.Area is preferable
summary(final_model_BIC_log)
##
## Call:
## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond +
## Year.Built + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF +
## Bedroom.AbvGr + Fireplaces + Garage.Cars + MS.Zoning + Condition.2 +
## Exter.Qual + Exter.Cond + Heating.QC + Central.Air + Sale.Condition,
## data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.27958 -0.05825 0.00000 0.06013 0.89316
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.958e+00 5.422e-01 9.144 < 2e-16 ***
## area 2.700e-04 1.575e-05 17.145 < 2e-16 ***
## Lot.Area 7.847e-02 1.029e-02 7.624 5.89e-14 ***
## Overall.Qual 8.119e-02 5.519e-03 14.710 < 2e-16 ***
## Overall.Cond 5.646e-02 4.816e-03 11.723 < 2e-16 ***
## Year.Built 2.346e-03 2.522e-04 9.303 < 2e-16 ***
## BsmtFin.SF.1 2.064e-04 1.443e-05 14.304 < 2e-16 ***
## BsmtFin.SF.2 1.675e-04 2.701e-05 6.200 8.39e-10 ***
## Bsmt.Unf.SF 9.963e-05 1.503e-05 6.631 5.57e-11 ***
## Bedroom.AbvGr -1.878e-02 7.081e-03 -2.652 0.008128 **
## Fireplaces 2.636e-02 7.808e-03 3.375 0.000766 ***
## Garage.Cars 3.057e-02 7.969e-03 3.836 0.000133 ***
## MS.ZoningFV 3.261e-01 5.022e-02 6.492 1.35e-10 ***
## MS.ZoningI (all) 2.049e-01 1.417e-01 1.446 0.148435
## MS.ZoningRH 2.659e-01 6.650e-02 3.998 6.88e-05 ***
## MS.ZoningRL 2.892e-01 4.662e-02 6.204 8.18e-10 ***
## MS.ZoningRM 2.279e-01 4.628e-02 4.925 9.92e-07 ***
## Condition.2Feedr -1.032e-01 1.067e-01 -0.968 0.333448
## Condition.2Norm -1.374e-03 9.228e-02 -0.015 0.988125
## Condition.2PosA 7.188e-02 1.604e-01 0.448 0.654227
## Condition.2PosN -9.904e-01 1.331e-01 -7.440 2.22e-13 ***
## Condition.2RRNn -4.010e-02 1.621e-01 -0.247 0.804691
## Exter.QualFa -1.277e-03 5.328e-02 -0.024 0.980890
## Exter.QualGd -8.628e-02 2.530e-02 -3.410 0.000675 ***
## Exter.QualTA -1.210e-01 2.963e-02 -4.083 4.81e-05 ***
## Exter.CondFa -1.137e-01 7.448e-02 -1.527 0.127208
## Exter.CondGd 1.732e-02 6.608e-02 0.262 0.793321
## Exter.CondTA 4.435e-02 6.554e-02 0.677 0.498830
## Heating.QCFa -9.905e-02 3.109e-02 -3.186 0.001489 **
## Heating.QCGd -1.964e-02 1.284e-02 -1.529 0.126600
## Heating.QCPo -1.148e-01 1.312e-01 -0.875 0.381696
## Heating.QCTA -6.296e-02 1.146e-02 -5.493 5.05e-08 ***
## Central.AirY 7.017e-02 2.265e-02 3.098 0.002006 **
## Sale.ConditionAdjLand 8.991e-02 9.523e-02 0.944 0.345351
## Sale.ConditionAlloca 1.778e-01 6.727e-02 2.642 0.008365 **
## Sale.ConditionFamily -3.413e-02 3.586e-02 -0.952 0.341508
## Sale.ConditionNormal 8.916e-02 1.764e-02 5.054 5.18e-07 ***
## Sale.ConditionPartial 1.359e-01 2.404e-02 5.651 2.10e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1284 on 962 degrees of freedom
## Multiple R-squared: 0.9103, Adjusted R-squared: 0.9068
## F-statistic: 263.8 on 37 and 962 DF, p-value: < 2.2e-16
On the side by side of actual versus fitted price plots, points are more regrouped around the regression line on the plot representing the model built with a log transformation of Lot.Area. While on the plot without transformation, points are slightly more scattered away from the regression line. We however notice that the 3 points further away from the fitted line annoted (1,2,3) seems not really influenced by the log transformation of the Lot.Area.
Additionally the summary of the new linear model with log transformation of Lot.Area yields a slightly greater Adjusted R-squared values (0.9068) thanthe model without the transformation (0.905). This difference is slight but is good enough to justify that Log transforming Lot.Area is better.