First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.3

library(devtools)

## Warning: package 'devtools' was built under R version 3.3.3

library(statsr)
library(lubridate)

## Warning: package 'lubridate' was built under R version 3.3.3

library(tidyr)

## Warning: package 'tidyr' was built under R version 3.3.3

library(gridExtra)

## Warning: package 'gridExtra' was built under R version 3.3.3

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# type your code for Question 1 here, and Knit

ggplot(data = ames_train, aes(x=year(today())-Year.Built))+
  geom_histogram(bins = 30, fill="blue", colour="white")+labs(title="House counts by age", x="House age", y="House count")+
  
  geom_vline(xintercept = median(year(today())-ames_train$Year.Built), colour="red")+
  
  annotate("text", x= median(year(today())-ames_train$Year.Built)-4, y=c(100,-10)
           , label=c("Median", median(year(today())-ames_train$Year.Built))
           , colour="red", angle=c(90,0))+
  
  geom_vline(xintercept = mean(year(today())-ames_train$Year.Built), colour="#41c42f")+
  
  annotate("text", x= mean(year(today())-ames_train$Year.Built)+4, y=c(100,-10)
           , label=c("Mean", round(mean(year(today())-ames_train$Year.Built), 1))
           , colour="#41c42f", angle=c(-90,0))+
  
  theme(plot.title = element_text(hjust=0.5)  
        ,title = element_text(hjust = 0.5, colour = "red", size = rel(2))
        , axis.title =element_text(size = rel(1), colour = "blue", hjust = .5)
        , legend.position = "none"
        ,panel.background = element_rect(fill = "grey", colour = "black")
        , panel.grid = element_line(color = "blue"))

The distribution of house age plotted above is extremely right skewed.
We see three peaks which indicates that the distribution is multimodal. Most of the houses in this dataset (about 140) are 10-15 years old. The second group of houses (about 80) are 55-60 years old and towards the right of the distribution a third group of houses (about 37) are between 90-95 years old. This may indicate a boom in real-estate business at the indicated periods.
The distribution shows that more than 50% of the houses were built less than 45 years ago.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# type your code for Question 2 here, and Knit

##plotting a box plot representing price distribution by neighborhood.
ames_train %>% ggplot(aes(x=Neighborhood, y=price/10^5, fill=Neighborhood))+geom_boxplot()+theme(axis.text.x = element_text(angle = 90))

## Calculating all central and spread statistics figures grouped by Neighborhood and stored in a dataframe.
ames_stats <- ames_train %>% group_by(Neighborhood) %>% summarise(Min=min(price, na.rm=TRUE), Mean=mean(price, na.rm=TRUE), Median=median(price, na.rm=TRUE),IQR=IQR(price, na.rm=TRUE), Max=max(price, na.rm=TRUE),Range=Max-Min) %>% arrange(desc(Mean))

## Warning: package 'bindrcpp' was built under R version 3.3.3

## Summary statistics stored in a dataframe
ames_summary <- data.frame(filter(ames_stats, Mean==max(Mean))$Neighborhood, filter(ames_stats, Mean==min(Mean))$Neighborhood, filter(ames_stats, IQR==max(IQR))$Neighborhood)

## Formatting the dataframe
colnames(ames_summary) <- c("Most Expensive", "Least Expensive", "Most heterogenous")
rownames(ames_summary) <- c("Neighborhood")

## printing out the dataframe
ames_summary

##              Most Expensive Least Expensive Most heterogenous
## Neighborhood        StoneBr         MeadowV           StoneBr

The above summary statistics collected based on above scripts shows that StoneBr Neighborhood with the highest price mean & median values amongs all neighborhoods. StoneBr is therefore the most expensive neigborhood. However, we can as well see that based on the IQR, StoneBr has the most dispersed house price making it as well the most heterogenous neighborhood.

MeadowV in opposite has the lowest mean & median in terms of house price and therefore is the least expensive neighborhood.

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# type your code for Question 3 here, and Knit

## Count all variables missing values
missing_val <- ames_train %>% summarise_all(funs(sum(is.na(.))))

## select the column that has the maximum missing values
missing_val[, colSums(missing_val[1,])==max(missing_val)]

## # A tibble: 1 x 1
##   Pool.QC
##     <int>
## 1     997

The variable that has the highest number of missing value is Pool.QC which means most of the houses do not have a swimmingpool. Only 3 houses in this data set have swimmingpools. This makes sense as the cost of owning a swimming pool is generally high as it entails not only the area space but as well the construction and more importantly the running costs which in essence are lifetime.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

## select the variable with at least one missing values
names(missing_val[, colSums(missing_val)>=1])

##  [1] "Lot.Frontage"   "Alley"          "Mas.Vnr.Area"   "Bsmt.Qual"     
##  [5] "Bsmt.Cond"      "Bsmt.Exposure"  "BsmtFin.Type.1" "BsmtFin.SF.1"  
##  [9] "BsmtFin.Type.2" "BsmtFin.SF.2"   "Bsmt.Unf.SF"    "Total.Bsmt.SF" 
## [13] "Bsmt.Full.Bath" "Bsmt.Half.Bath" "Fireplace.Qu"   "Garage.Type"   
## [17] "Garage.Yr.Blt"  "Garage.Finish"  "Garage.Cars"    "Garage.Area"   
## [21] "Garage.Qual"    "Garage.Cond"    "Pool.QC"        "Fence"         
## [25] "Misc.Feature"

Above are listed all variables within this dataset with NA values. Based on the data dictionary and on the previous investigations, we can explain the missing and decide on how to deal with them.

Lot.Frontage and Mas.Vnr.Area are integers and their NAs means that there is no Lot frontage or Masonry Veneer area. We can set these NA value to zero (0)

All variables Alley, Pool.QC, Fence, Misc.Feature NAs means that there is none of them. We can set all NAs to the value “None”.

21 houses have no basement, therefore All variables containing “Bsmt” NAs will be set to the value “None”.

47 houses have no garage, hence “Garage” NAs can be set to the value “None”.

Based on the above analysis, we have created a function \(prepare\_data()\) that will treat all NA values as described above and make the updated dataset \(data\) available for our model selection. in the appropriate window below, we are describing the model selection method used.

# type your code for Question 4 here, and Knit
#Frequentist approach of model selection
##Backward selection starting with full model

#---------------------------------------Preparing the data function--------------------------------------#

# this function will prepare the data from ames for analysis
prepare_data <- function(data=ames_train){
 
  
  # identifying variables with same values not to be considered in modeling since not significant to the study
  var_to_use <- data %>% summarise_all(funs(length(unique(.))))
  var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])
  
  # select userful variables only for the model
  data <- data %>% select(var_to_use)
  
  # Deal with NA values 
  
  ## Seperate Numerical from Categorical variables
  ames_var_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
  
  ## Deal with numerical variables NA values

  ames_train_int <- data %>% select(ames_var_types$integer)
  ames_train_int[is.na(ames_train_int)] <- 0
  
  ## Dealing with categorical variables NA values
  ames_train_fac <- data %>% select(ames_var_types$factor)
  ames_train_fac <- sapply(ames_train_fac, as.character)
  ames_train_fac[is.na(ames_train_fac)] <- c("None")
  ames_train_fac[ames_train_fac==""] <- c("None")
  ames_train_fac <- data.frame(ames_train_fac)
  
  ## Merging both numerical and categoricals
  data <- cbind(ames_train_int, ames_train_fac)
  
  return(na.omit(data))
}




#---------------------------------------Model selection function--------------------------------------#

# This function will do the automatic model selection and yield the best model based on selection criteria
model_selection <- function(data=data, response_variable="price" ,criteria=c("pvalue", "Adj-Rsquare", "BIC", "AIC"), significance=0.05){

  count <- 0
  
##############################################################
# Building the linear model based on the dataframe indicated #
##############################################################  

  ## Looping along the variable names and creating the next variable list for the lm
 
   full_var_list <- names(data)[names(data)!=response_variable]
  
  
  while(count==0){
    
  next_var <- c()
  formula_lm <- NULL
  
      for(j in seq_along(full_var_list)){
    
    next_var <- noquote(paste0(next_var, full_var_list[j],  sep = "+"))
    
  }

  ## Removing the "+" at the end of the variable list
  next_var_list <- substr(next_var, 1, nchar(next_var)-1)

# Writing the linear model formula to be used with the new variable list
  
  formula_lm <-  as.formula(paste(paste("log(", response_variable, ")",  " ~ ", sep = ""), next_var_list, sep = ""))
  
  data_lm <- lm(formula = formula_lm, data = data)

  ##############################################################
  # Starting the model selection based on the method selected  #
  ##############################################################    

      if(tolower(criteria) %in% "pvalue"){
    
          ## Select p-values higher than the preset significance level. Default is 0.05
          model_pvalue <- data.frame(summary(data_lm)$coef[summary(data_lm)$coef[,4] <=significance,])
          
          ## Filter all row in the returned dataframe which partially match variable names in original dataframe 
          next_lm_var <- sapply(names(data), function(x) grep(x, row.names(model_pvalue), value=TRUE))
          
          ## Remove all variables with no match 
          ##here we deal with character(0) type of variable which is different from NULL. 
          ## We filter if by matching the length ==0L 
          
          next_lm_var <- next_lm_var[!sapply(next_lm_var, function(x) length(x)==0L)]
          
          ## Updating the next list of variable to be used for linear modeling
          next_var_list <- names(next_lm_var)[names(next_lm_var)!=response_variable]
    
          ## Checking if previous and current variable list are the same
          if(length(next_var_list)!=length(full_var_list)){
      
          full_var_list <- next_var_list
      
          }else{count <- 1}
    
    
      }else if(tolower(criteria) %in% "adj-rsquared"){
        
        ## Select p-values higher than the preset significance level. Default is 0.05
#        Adj-Rsquared <- summary(data_lm)$adj.r.squared
        
        ## Define the sequence of variable
#        var_list <- names(data)
#        var_seq <- seq(length(var_list))
        
        ## Filter all row in the returned dataframe which partially match variable names in original dataframe 
#        next_lm_var <- sapply(var_seq, function(x) combn(var_list, x))
        
        ## Remove all variables with no match 
        ##here we deal with character(0) type of variable which is different from NULL. 
        ## We filter if by matching the length ==0L 
        
#        next_lm_var <- next_lm_var[!sapply(next_lm_var, function(x) length(x)==0L)]
        
        ## Updating the next list of variable to be used for linear modeling
#        next_var_list <- names(next_lm_var)[names(next_lm_var)!=response_variable]
        
        ## Checking if previous and current variable list are the same

        print(paste(criteria, "is not yet supported"))
  
              
        ## Computing BIC model selection option if selected
      } else if (tolower(criteria) %in% "bic"){
        
        n <- nrow(na.omit(data))
        bic_lm <- stepAIC(data_lm, direction = "backward", k=log(n), trace = 0)
        data_lm <- bic_lm
        
        ## setting count to 1 to exit the while loop
        count <- 1
        
        
        ## Computing AIC model selection option if selected
      } else if (tolower(criteria) %in% "aic"){
        
        n <- nrow(na.omit(data))
        aic_lm <- stepAIC(data_lm, direction = "backward", k=n, trace = 0)
        data_lm <- aic_lm
        
        ## setting count to 1 to exit the while loop
        count <- 1

                
        ## Exit if related model selection is selected
      } else{
          
        print("Wrong Option. Supported options are AIC, BIC or Pvalue")
        
        ## setting count to 1 to exit the while loop
        count <- 1

        
        }
  
  }
  
  return(data_lm)
  
}


data <- prepare_data(data=ames_train)


# final_model_Pvalue <- model_selection(data = data, response_variable = "price", criteria = "pvalue", significance = 0.05)

final_model_BIC <- model_selection(data = data, response_variable = "price", criteria = "BIC")

# final_model_AIC <- model_selection(data = data, response_variable = "price", criteria = "AIC")


#print("Final model based on P-Value")
#summary(final_model_Pvalue)

print("Final model based on BIC")

## [1] "Final model based on BIC"

summary(final_model_BIC)

## 
## Call:
## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond + 
##     Year.Built + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 + 
##     Bsmt.Unf.SF + Bedroom.AbvGr + Fireplaces + Garage.Cars + 
##     MS.Zoning + Condition.2 + Bldg.Type + Exter.Qual + Exter.Cond + 
##     Central.Air + Sale.Condition, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.29410 -0.06298 -0.00026  0.06037  0.88123 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.789e+00  6.579e-01   5.759 1.14e-08 ***
## area                   2.846e-04  1.603e-05  17.752  < 2e-16 ***
## Lot.Area               1.612e-06  4.660e-07   3.460 0.000564 ***
## Overall.Qual           7.868e-02  5.594e-03  14.066  < 2e-16 ***
## Overall.Cond           5.196e-02  5.120e-03  10.149  < 2e-16 ***
## Year.Built             2.203e-03  2.788e-04   7.902 7.48e-15 ***
## Year.Remod.Add         1.061e-03  3.111e-04   3.411 0.000675 ***
## BsmtFin.SF.1           2.146e-04  1.463e-05  14.669  < 2e-16 ***
## BsmtFin.SF.2           1.732e-04  2.743e-05   6.315 4.13e-10 ***
## Bsmt.Unf.SF            1.081e-04  1.517e-05   7.122 2.09e-12 ***
## Bedroom.AbvGr         -1.993e-02  7.477e-03  -2.666 0.007813 ** 
## Fireplaces             2.649e-02  7.959e-03   3.328 0.000907 ***
## Garage.Cars            3.673e-02  8.013e-03   4.584 5.16e-06 ***
## MS.ZoningFV            3.426e-01  5.061e-02   6.770 2.24e-11 ***
## MS.ZoningI (all)       3.180e-01  1.433e-01   2.219 0.026688 *  
## MS.ZoningRH            2.814e-01  6.743e-02   4.172 3.29e-05 ***
## MS.ZoningRL            3.138e-01  4.696e-02   6.683 3.95e-11 ***
## MS.ZoningRM            2.364e-01  4.664e-02   5.068 4.83e-07 ***
## Condition.2Feedr      -9.684e-02  1.091e-01  -0.887 0.375158    
## Condition.2Norm       -1.817e-03  9.465e-02  -0.019 0.984689    
## Condition.2PosA        7.855e-02  1.629e-01   0.482 0.629736    
## Condition.2PosN       -1.006e+00  1.356e-01  -7.419 2.59e-13 ***
## Condition.2RRNn       -6.852e-02  1.628e-01  -0.421 0.673885    
## Bldg.Type2fmCon        4.260e-02  3.201e-02   1.331 0.183533    
## Bldg.TypeDuplex       -7.264e-02  2.531e-02  -2.870 0.004199 ** 
## Bldg.TypeTwnhs        -1.367e-01  2.387e-02  -5.726 1.38e-08 ***
## Bldg.TypeTwnhsE       -3.960e-02  1.766e-02  -2.243 0.025136 *  
## Exter.QualFa          -5.329e-03  5.419e-02  -0.098 0.921685    
## Exter.QualGd          -8.948e-02  2.558e-02  -3.498 0.000491 ***
## Exter.QualTA          -1.266e-01  2.992e-02  -4.233 2.52e-05 ***
## Exter.CondFa          -1.209e-01  7.521e-02  -1.608 0.108264    
## Exter.CondGd           1.976e-02  6.673e-02   0.296 0.767240    
## Exter.CondTA           4.762e-02  6.620e-02   0.719 0.472122    
## Central.AirY           8.643e-02  2.297e-02   3.762 0.000179 ***
## Sale.ConditionAdjLand  1.264e-01  9.730e-02   1.299 0.194106    
## Sale.ConditionAlloca   1.841e-01  6.837e-02   2.693 0.007208 ** 
## Sale.ConditionFamily  -3.146e-02  3.620e-02  -0.869 0.384956    
## Sale.ConditionNormal   8.407e-02  1.777e-02   4.731 2.56e-06 ***
## Sale.ConditionPartial  1.250e-01  2.434e-02   5.137 3.38e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1296 on 961 degrees of freedom
## Multiple R-squared:  0.9086, Adjusted R-squared:  0.905 
## F-statistic: 251.5 on 38 and 961 DF,  p-value: < 2.2e-16

#print("Final model based on AIC")
#summary(final_model_AIC)

Conditions for Multilinear regression verification

par(mfrow=c(2,3))
## Checking nearly normality
hist(final_model_BIC$residuals, breaks = 100, xlab = "Residuals", main = "Residual plot - Nearly normality" )
plot(density(final_model_BIC$residuals),  main = "Density plot - Nearly Normality")
qqnorm(final_model_BIC$residuals)
qqline(final_model_BIC$residuals)

## Checking linearity and independent residuals
plot(final_model_BIC$residuals, ylab = "Residuals", main = "Residual plot - Linearity")

## Constant variability condition
plot(final_model_BIC$residuals~final_model_BIC$fitted.values, xlab="Fitted values" ,ylab = "Residuals", main = "Constant variability.")
plot(abs(final_model_BIC$residuals)~final_model_BIC$fitted.values, xlab="Fitted values" ,ylab = "Residuals", main = "Independence")

Above diagrams verify the Multilinear regression conditions before proceeding. We can see from the above plots that:

Linearity is fulfilled by looking at the scatter of residuals around the zero (0) axis line.

Nearly normaility condition is fulfilled as the histogram and density plots are centered at 0. The normal plot shows additional evidence of nearly normality.

Scatter plot of residuals show a nearly uniform distribution of dots around the zero line which shows the constant variability is fulfilled.

Residuals ~ Fitted values and abs(Residuals) ~ Fitted values show no pattern in the data. which means independence condition is fulfilled

Now that these conditions are verified, we can proceed with these models analysis.

The model section method we use in this case is the BIC backward model selection. This is one of the bayesian way of selecting a models based on the Bayesian information Criteria. The BIC is built under penalized likelihood \(BIC=-2log(likelihood)+klog(n)\) where \(k\) represents the number of parameters used in the linear model. The BIC can also be express based on \(R^2\) as follow \(BIC=nlog(1-R^2)+klog(n)\). Written this way, we can see that If two(02) models end up with the same \(R^2\) value, the model with the fewer variables will be selected as it will be less penalized by the \(klog(n)\) factor. The linear model with the smaller BIC will be the best model. The BIC is built using a prior propability distribution of the variables involved. In this case, we go start under a uniform() prior giving the same chance to each variable for being selected in any given model.

The above code defines two(02) functions prepare_data() that deals with NA values and prepares the data for analysis. Wile the model_selection() function was build to perform various model selection methods for future reuse if required through this module. The goodness of fit of the final model in this case is \(R^2=0.9086\) which is quite strong.

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# type your code for Question 5 here, and Knit

resid_squared <- final_model_BIC$residuals^2
ames_train[as.numeric(names(resid_squared[resid_squared==max(resid_squared)])),]

## # A tibble: 1 x 81
##         PID  area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##       <int> <int> <int>       <int>    <fctr>        <int>    <int> <fctr>
## 1 902207130   832 12789          30        RM           68     9656   Pave
## # ... with 73 more variables: Alley <fctr>, Lot.Shape <fctr>,
## #   Land.Contour <fctr>, Utilities <fctr>, Lot.Config <fctr>,
## #   Land.Slope <fctr>, Neighborhood <fctr>, Condition.1 <fctr>,
## #   Condition.2 <fctr>, Bldg.Type <fctr>, House.Style <fctr>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fctr>, Roof.Matl <fctr>,
## #   Exterior.1st <fctr>, Exterior.2nd <fctr>, Mas.Vnr.Type <fctr>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fctr>, Exter.Cond <fctr>,
## #   Foundation <fctr>, Bsmt.Qual <fctr>, Bsmt.Cond <fctr>,
## #   Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fctr>, Heating.QC <fctr>,
## #   Central.Air <fctr>, Electrical <fctr>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fctr>,
## #   TotRms.AbvGrd <int>, Functional <fctr>, Fireplaces <int>,
## #   Fireplace.Qu <fctr>, Garage.Type <fctr>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fctr>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fctr>, Garage.Cond <fctr>, Paved.Drive <fctr>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>,
## #   Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>, Misc.Val <int>,
## #   Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>, Sale.Condition <fctr>

anova(final_model_BIC)

## Analysis of Variance Table
## 
## Response: log(price)
##                 Df Sum Sq Mean Sq   F value    Pr(>F)    
## area             1 90.300  90.300 5374.9608 < 2.2e-16 ***
## Lot.Area         1  0.588   0.588   35.0039 4.572e-09 ***
## Overall.Qual     1 44.250  44.250 2633.9263 < 2.2e-16 ***
## Overall.Cond     1  0.706   0.706   42.0093 1.450e-10 ***
## Year.Built       1 10.663  10.663  634.6829 < 2.2e-16 ***
## Year.Remod.Add   1  0.088   0.088    5.2486 0.0221798 *  
## BsmtFin.SF.1     1  4.328   4.328  257.6252 < 2.2e-16 ***
## BsmtFin.SF.2     1  0.426   0.426   25.3526 5.700e-07 ***
## Bsmt.Unf.SF      1  1.476   1.476   87.8472 < 2.2e-16 ***
## Bedroom.AbvGr    1  0.071   0.071    4.2192 0.0402402 *  
## Fireplaces       1  0.452   0.452   26.9303 2.573e-07 ***
## Garage.Cars      1  0.732   0.732   43.5575 6.795e-11 ***
## MS.Zoning        5  2.010   0.402   23.9290 < 2.2e-16 ***
## Condition.2      5  1.719   0.344   20.4685 < 2.2e-16 ***
## Bldg.Type        4  0.890   0.223   13.2484 1.642e-10 ***
## Exter.Qual       3  0.342   0.114    6.7768 0.0001596 ***
## Exter.Cond       3  0.574   0.191   11.3819 2.441e-07 ***
## Central.Air      1  0.232   0.232   13.8067 0.0002143 ***
## Sale.Condition   5  0.736   0.147    8.7576 3.896e-08 ***
## Residuals      961 16.145   0.017                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

max(ames_train$Yr.Sold)

## [1] 2010

The house with the highest squared residual is the house with the PID 902207130 on line 428 of the ames_train dataframe, located in Old_town Neighborhood.

According to the analysis, this house is the house with the lowest sales prices among all houses in this dataset. This low sales price is explained by:

Small area of the house. This house is one of the 25% houses with the lowest area. 51% of house prices is justified by the area as per the formula \(anova(final\_model\_BIC)[1,2]/sum(anova(final\_model\_BIC)[2])\)

The house has a very low overal quality of 2 which accounts for 25.04% of the house pricein this model as per the formula \(anova(final\_model\_BIC)[3,2]/sum(anova(final\_model\_BIC)[2])\)

The Year.Built as well is one of the variable affecting the price as this house was built in 1923 and is among the oldest houses in the data set. This variable contributes to 6% of the house price as per our model \(anova(final\_model\_BIC)[5,2]/sum(anova(final\_model\_BIC)[2])\)

the Exter.Cond & Exter.Qual of the house contributes negatively to its pricing.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# type your code for Question 6 here, and Knit

data$Lot.Area <- log(data$Lot.Area)

final_model_BIC_log <- model_selection(data = data, response_variable = "price", criteria = "BIC")

final_model_BIC_log$call

## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond + 
##     Year.Built + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF + 
##     Bedroom.AbvGr + Fireplaces + Garage.Cars + MS.Zoning + Condition.2 + 
##     Exter.Qual + Exter.Cond + Heating.QC + Central.Air + Sale.Condition, 
##     data = data)

Log transforming the Lot.Area variable from the prepared dataset and reapplying the same model selection criteria (BIC), we realize that the final model is different from the one gotten without the log transformation. Two (02) variables are removed from the final BIC model which are -Year.Remod.Add -Bldg.Type while a new variable + Heating.QC is added as significant to this model.

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# type your code for Question 7 here, and Knit

BPM_predict_log <- predict(final_model_BIC_log, estimator="BPM")

BPM_predict <- predict(final_model_BIC, estimator="BPM")

actual_fitted_price <- data.frame(BPM_predict, BPM_predict_log,log(data$price))
names(actual_fitted_price) <- c("BPM_predict", "BPM_predict_log", "log_price")

## Getting information on points with highest residuals value
out_col <- as.numeric(names(head(sort(abs(final_model_BIC$residuals), decreasing = TRUE), 3)))
out_log_col <- as.numeric(names(head(sort(abs(final_model_BIC_log$residuals), decreasing = TRUE), 3)))

outliers <- data.frame(x=log(data$price[out_col]), y=BPM_predict[out_col])
outliers_log <- data.frame(x=log(data$price[out_col]), y=BPM_predict_log[out_col])

p1 <- ggplot(data = actual_fitted_price, aes(y=BPM_predict, x=log_price))+geom_point(colour="blue", shape=16)+geom_smooth(method = "lm", col="red")+labs(title="Actual vs Fitted prices without transformation", x="Actual price", y="Fitted prices")+theme(title = element_text(color = "blue"), plot.title = element_text(hjust = 0.5, size = 8))+annotate("text", x=11.5, y=14, label = "Adjusted_R^2 == 0.905", parse = TRUE)+annotate("text", x=outliers$x+0.1, y=outliers$y, label = c("1", "2", "3"), colour="red", parse = TRUE)

p2 <- ggplot(data = actual_fitted_price, aes(y=BPM_predict_log, x=log_price))+geom_point(colour="blue", shape=16)+geom_smooth(method = "lm", col="red")+labs(title="Actual vs Fitted prices with Lot.Area log transformation", x="Actual price", y="Fitted prices")+theme(title = element_text(color = "blue"), plot.title = element_text(hjust = 0.5, size = 8))+annotate("text",x=11.5, y=14, label = "Adjusted_R^2 == 0.9068", parse = TRUE)+annotate("text", x=outliers_log$x+0.1, y=outliers_log$y, label = c("1", "2", "3"), colour="red", parse = TRUE)

grid.arrange(p1, p2, ncol=2, nrow=1)

par(mfrow=c(1,2))
qqnorm(final_model_BIC$residuals)
qqline(final_model_BIC$residuals)

qqnorm(final_model_BIC_log$residuals)
qqline(final_model_BIC_log$residuals)

## Second reason why the model with Log of Lot.Area is preferable

summary(final_model_BIC_log)

## 
## Call:
## lm(formula = log(price) ~ area + Lot.Area + Overall.Qual + Overall.Cond + 
##     Year.Built + BsmtFin.SF.1 + BsmtFin.SF.2 + Bsmt.Unf.SF + 
##     Bedroom.AbvGr + Fireplaces + Garage.Cars + MS.Zoning + Condition.2 + 
##     Exter.Qual + Exter.Cond + Heating.QC + Central.Air + Sale.Condition, 
##     data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27958 -0.05825  0.00000  0.06013  0.89316 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.958e+00  5.422e-01   9.144  < 2e-16 ***
## area                   2.700e-04  1.575e-05  17.145  < 2e-16 ***
## Lot.Area               7.847e-02  1.029e-02   7.624 5.89e-14 ***
## Overall.Qual           8.119e-02  5.519e-03  14.710  < 2e-16 ***
## Overall.Cond           5.646e-02  4.816e-03  11.723  < 2e-16 ***
## Year.Built             2.346e-03  2.522e-04   9.303  < 2e-16 ***
## BsmtFin.SF.1           2.064e-04  1.443e-05  14.304  < 2e-16 ***
## BsmtFin.SF.2           1.675e-04  2.701e-05   6.200 8.39e-10 ***
## Bsmt.Unf.SF            9.963e-05  1.503e-05   6.631 5.57e-11 ***
## Bedroom.AbvGr         -1.878e-02  7.081e-03  -2.652 0.008128 ** 
## Fireplaces             2.636e-02  7.808e-03   3.375 0.000766 ***
## Garage.Cars            3.057e-02  7.969e-03   3.836 0.000133 ***
## MS.ZoningFV            3.261e-01  5.022e-02   6.492 1.35e-10 ***
## MS.ZoningI (all)       2.049e-01  1.417e-01   1.446 0.148435    
## MS.ZoningRH            2.659e-01  6.650e-02   3.998 6.88e-05 ***
## MS.ZoningRL            2.892e-01  4.662e-02   6.204 8.18e-10 ***
## MS.ZoningRM            2.279e-01  4.628e-02   4.925 9.92e-07 ***
## Condition.2Feedr      -1.032e-01  1.067e-01  -0.968 0.333448    
## Condition.2Norm       -1.374e-03  9.228e-02  -0.015 0.988125    
## Condition.2PosA        7.188e-02  1.604e-01   0.448 0.654227    
## Condition.2PosN       -9.904e-01  1.331e-01  -7.440 2.22e-13 ***
## Condition.2RRNn       -4.010e-02  1.621e-01  -0.247 0.804691    
## Exter.QualFa          -1.277e-03  5.328e-02  -0.024 0.980890    
## Exter.QualGd          -8.628e-02  2.530e-02  -3.410 0.000675 ***
## Exter.QualTA          -1.210e-01  2.963e-02  -4.083 4.81e-05 ***
## Exter.CondFa          -1.137e-01  7.448e-02  -1.527 0.127208    
## Exter.CondGd           1.732e-02  6.608e-02   0.262 0.793321    
## Exter.CondTA           4.435e-02  6.554e-02   0.677 0.498830    
## Heating.QCFa          -9.905e-02  3.109e-02  -3.186 0.001489 ** 
## Heating.QCGd          -1.964e-02  1.284e-02  -1.529 0.126600    
## Heating.QCPo          -1.148e-01  1.312e-01  -0.875 0.381696    
## Heating.QCTA          -6.296e-02  1.146e-02  -5.493 5.05e-08 ***
## Central.AirY           7.017e-02  2.265e-02   3.098 0.002006 ** 
## Sale.ConditionAdjLand  8.991e-02  9.523e-02   0.944 0.345351    
## Sale.ConditionAlloca   1.778e-01  6.727e-02   2.642 0.008365 ** 
## Sale.ConditionFamily  -3.413e-02  3.586e-02  -0.952 0.341508    
## Sale.ConditionNormal   8.916e-02  1.764e-02   5.054 5.18e-07 ***
## Sale.ConditionPartial  1.359e-01  2.404e-02   5.651 2.10e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1284 on 962 degrees of freedom
## Multiple R-squared:  0.9103, Adjusted R-squared:  0.9068 
## F-statistic: 263.8 on 37 and 962 DF,  p-value: < 2.2e-16

On the side by side of actual versus fitted price plots, points are more regrouped around the regression line on the plot representing the model built with a log transformation of Lot.Area. While on the plot without transformation, points are slightly more scattered away from the regression line. We however notice that the 3 points further away from the fitted line annoted (1,2,3) seems not really influenced by the log transformation of the Lot.Area.

Additionally the summary of the new linear model with log transformation of Lot.Area yields a slightly greater Adjusted R-squared values (0.9068) thanthe model without the transformation (0.905). This difference is slight but is good enough to justify that Log transforming Lot.Area is better.

Peer Assessment I

1

2

3

4

5

6

7

7.0.1