First, let us load the data and necessary packages:

load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(ggplot2)
library(lubridate)
library(BAS)
library(broom)
library(gridExtra)

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

# Age of Houses compared to current date, default bins is 30 #
ames_train <- tbl_df(ames_train)

# Add a new column called House_Age #

ames_train_Q1 <- ames_train %>%
  mutate(House_Age = year(today()) - Year.Built)

ames_train_Q1$House_Age <- as.numeric(ames_train_Q1$House_Age) 
summary(ames_train_Q1$House_Age)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    18.0    44.0    46.8    64.0   147.0

# Plot histogram # 
ggplot(data = ames_train_Q1, aes(House_Age)) + 
  geom_histogram() + 
  xlab("Years since Built") + 
  ggtitle("Distribution of the Age of the Houses") + 
  ylab("Number of Houses") + 
  
    geom_vline(xintercept = quantile(ames_train_Q1$House_Age,0.90),
             linetype = "dashed", color = "blue") + 
    geom_text(aes(1,quantile(ames_train_Q1$House_Age,0.90),
            label = paste("90th Percentile is ",
            round(quantile(ames_train_Q1$House_Age,0.90),digits = 0),"years"),
            vjust = 29), angle = 90) + 
    theme(plot.title = element_text(size = 10)) + 
    theme_classic()

1.1 Summary: Distribution of the Age of the House since built

The median ages of the houses is 44 years and the average age is around 46 years. Around 50% of the houses age between 18 and 64 years of age as indicated by the first and third quantile of the house ages.

The distribution of the ages of the houses is right skewed and the 90th percentile is 94 years as indicated by the vertical blue dashed line in the graph.

There are more new houses than old houses and the distribution is bimodal.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

# Location based statistics on house price in relation to neighborhoods #

ggplot(data = ames_train , aes(price)) + geom_histogram() + theme_classic() + 
  xlab("Price of Houses") + 
  ggtitle("Distribution of house prices irrespective of neighborhoods")

ames_train_Location <- ames_train %>%
  group_by(Neighborhood) %>%
  summarise(Price_Neighborhood = median(price)) %>%
  arrange(Price_Neighborhood)

ggplot(data = ames_train_Location, aes(Neighborhood, Price_Neighborhood)) + 
  geom_jitter(aes(color = Neighborhood, size = Price_Neighborhood)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position = "none") + 
  ggtitle("Median Price in relation to Neighborhoods in Ames,Iowa") + 
  xlab("Neighborhoods in Ames, Iowa") +
  ylab("Median Price") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(axis.title = element_text(size = 10))

# Price Variation per Neighborhood #
 ames_train_Location_Var <- ames_train %>%
  group_by(Neighborhood) %>%
  summarise(Price_Var_Neighborhood = sd(price)) %>%
  arrange(Price_Var_Neighborhood)

ggplot(data = ames_train_Location_Var, aes(Neighborhood, Price_Var_Neighborhood)) + 
  geom_jitter(aes(color = Neighborhood, size = Price_Var_Neighborhood)) + 
  theme(axis.text.x = element_text(angle = 90)) + 
  theme(legend.position = "none") + 
  ggtitle("Price Variation in relation to Neighborhoods in Ames,Iowa") + 
  xlab("Neighborhoods in Ames, Iowa") +
  ylab("Price Variation") + 
  theme(plot.title = element_text(size = 10)) + 
  theme(axis.title = element_text(size = 10))

# Most and Least Expensive Neighborhoods #
ames_train_Location[which.max(ames_train_Location$Price_Neighborhood),]

## # A tibble: 1 x 2
##   Neighborhood Price_Neighborhood
##   <fct>                     <dbl>
## 1 StoneBr                 340692.

ames_train_Location[which.min(ames_train_Location$Price_Neighborhood),]

## # A tibble: 1 x 2
##   Neighborhood Price_Neighborhood
##   <fct>                     <dbl>
## 1 MeadowV                   85750

# Neighborhoods with most and least Variation in price #
ames_train_Location_Var[which.max(ames_train_Location_Var$Price_Var_Neighborhood),]

## # A tibble: 1 x 2
##   Neighborhood Price_Var_Neighborhood
##   <fct>                         <dbl>
## 1 StoneBr                     123459.

ames_train_Location_Var[which.min(ames_train_Location_Var$Price_Var_Neighborhood),]

## # A tibble: 1 x 2
##   Neighborhood Price_Var_Neighborhood
##   <fct>                         <dbl>
## 1 Blueste                      10381.

2.1 Summary: Neighborhood and their associated house price

Without any data transformation (logarithmic/square root), the distribution of the price of the houses is right skewed with all the neighborhoods put together.This is clear from the histogram. Median is a better summary statistic compared to the mean to reflect the average price in relation to the neighborhood. Median price per locality is a better statistic to communicate to someone looking for a house in a neighborhood.

Scatterplot of the median price agains the neighborhood depicts the most expensive and least expensive neighborhood.

The second scatterplot plots standard deviation as the statistic to reflect the neighborhoods with their associated price variation.

Neighborhood StoneBr or Stone Brook is the most expensive and also has the maximum variation in the price.

Neighborhood MeadowV or Meadow Village is the least expensive and Neighborhood Blueste or Bluestem has the least variation in the price of its houses. * * *

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

# Count NA's per column of the data frame #
na_count <- sapply(ames_train, function(x) sum(is.na(x)))

# Sort the vector in descending order and subset the first value #
na_count_desc <- sort(na_count, decreasing = TRUE)
na_count_desc[1]

## Pool.QC 
##     997

3.1 Summary: Number of NA count in variables

Variable Pool.QC or Pool Quality has the maximum number of NA values. Not many houses are expected to have an exclusive swimming pool and hence it makes sense this variable has many missing values.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

# type your code for Question 4 here, and Knit

ames_train_Q4 <- ames_train %>%
  mutate(LogPrice = log(price))

ggplot(data = ames_train_Q4, aes(LogPrice)) + geom_histogram() + theme_classic()

qqnorm(ames_train_Q4$LogPrice)

ames_train_Reg <-  ames_train_Q4 %>%
  dplyr::select(LogPrice, Lot.Area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr)

# Bayesian Model Averaging #

model_lm_bma <- bas.lm(LogPrice ~ ., data = na.omit(ames_train_Reg), prior = "BIC", modelprior = uniform(), method = "BAS")

model_lm_bma

## 
## Call:
## bas.lm(formula = LogPrice ~ ., data = na.omit(ames_train_Reg), 
##     prior = "BIC", modelprior = uniform(), method = "BAS")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept        Lot.Area   Land.SlopeMod   Land.SlopeSev  
##          1.000           1.000           0.904           0.904  
##     Year.Built  Year.Remod.Add   Bedroom.AbvGr  
##          1.000           1.000           1.000

summary(model_lm_bma)

##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept          1.0000000     1.0000     1.0000000  1.000000e+00
## Lot.Area           1.0000000     1.0000     1.0000000  1.000000e+00
## Land.SlopeMod      0.9039993     1.0000     0.0000000  1.000000e+00
## Land.SlopeSev      0.9039993     1.0000     0.0000000  1.000000e+00
## Year.Built         1.0000000     1.0000     1.0000000  1.000000e+00
## Year.Remod.Add     1.0000000     1.0000     1.0000000  1.000000e+00
## Bedroom.AbvGr      1.0000000     1.0000     1.0000000  0.000000e+00
## BF                        NA     1.0000     0.1061955  5.347284e-13
## PostProbs                 NA     0.9040     0.0960000  0.000000e+00
## R2                        NA     0.5625     0.5544000  5.338000e-01
## dim                       NA     7.0000     5.0000000  6.000000e+00
## logmarg                   NA -2198.1673 -2200.4098178 -2.226424e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## Lot.Area        1.000000e+00  0.000000e+00
## Land.SlopeMod   0.000000e+00  1.000000e+00
## Land.SlopeSev   0.000000e+00  1.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Year.Remod.Add  1.000000e+00  1.000000e+00
## Bedroom.AbvGr   0.000000e+00  1.000000e+00
## BF              6.841724e-14  2.396619e-17
## PostProbs       0.000000e+00  0.000000e+00
## R2              5.254000e-01  5.244000e-01
## dim             4.000000e+00  6.000000e+00
## logmarg        -2.228480e+03 -2.236437e+03

image(model_lm_bma, rotate = F)

coef_model_lm_bma <- coefficients(model_lm_bma)

plot(coef_model_lm_bma, subset = c(1:5), ask = FALSE)

confint(coef_model_lm_bma)

##                         2.5%        97.5%          beta
## Intercept       1.199992e+01 1.203464e+01  1.201847e+01
## Lot.Area        7.855965e-06 1.247550e-05  1.012662e-05
## Land.SlopeMod   0.000000e+00 2.176493e-01  1.251455e-01
## Land.SlopeSev  -7.020709e-01 0.000000e+00 -4.128938e-01
## Year.Built      5.271695e-03 6.761622e-03  6.046237e-03
## Year.Remod.Add  5.644047e-03 7.796233e-03  6.788766e-03
## Bedroom.AbvGr   6.421949e-02 1.065750e-01  8.687950e-02
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

plot(model_lm_bma)

4.1 Summary - Bayesian Model Averaging

After the natural log transformation of the house price, the histogram and the normal Q-Q plot indicate that the transformed data follows the normal distribution.
Uncertainty about the correct model specification can be high and choosing the right model. The core challenge is choosing the right variables from the multitude of variables that can be included in the model. Amorecomprehensive approach to addressin gmodel uncertainty is Bayesian model averaging, which allows us to assess the robustness of results to alternative speciﬁcations by calculating posterior distributions over coefﬁcients and models.
The Bayesian Model Averaging using the prior “BIC” and method “BAS” (without replacement). The model rank matrix indicates that the Model 1 which includes all the predictors has the highest log posterior odds. The posterior inclusion probabilities of all the predictors are either 1 or very close to 1 indicating that they are all significant predictors. It can be checked that re-running the regression using method as MCMC yielded same conclusion.

# Multiple Regression Model #

fit <-  lm(data = na.omit(ames_train_Reg), LogPrice ~ .)
fit_final <-  step(fit, direction = "both", trace = FALSE)
summary(fit_final)

## 
## Call:
## lm(formula = LogPrice ~ Lot.Area + Land.Slope + Year.Built + 
##     Year.Remod.Add + Bedroom.AbvGr, data = na.omit(ames_train_Reg))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.0878 -0.1651 -0.0211  0.1657  0.9945 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
## Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
## Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
## Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
## Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
## Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
## Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.279 on 993 degrees of freedom
## Multiple R-squared:  0.5625, Adjusted R-squared:  0.5598 
## F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

confint(fit_final)

##                        2.5 %        97.5 %
## (Intercept)    -1.539676e+01 -1.203189e+01
## Lot.Area        8.108705e-06  1.244810e-05
## Land.SlopeMod   4.048525e-02  2.363856e-01
## Land.SlopeSev  -7.539047e-01 -1.595778e-01
## Year.Built      5.305779e-03  6.792573e-03
## Year.Remod.Add  5.705098e-03  7.851227e-03
## Bedroom.AbvGr   6.572328e-02  1.080024e-01

tidy(fit_final)

##             term      estimate    std.error  statistic      p.value
## 1    (Intercept) -1.371433e+01 8.573560e-01 -15.996069 2.040298e-51
## 2       Lot.Area  1.027840e-05 1.105658e-06   9.296181 8.978626e-20
## 3  Land.SlopeMod  1.384354e-01 4.991459e-02   2.773446 5.650510e-03
## 4  Land.SlopeSev -4.567413e-01 1.514320e-01  -3.016148 2.625261e-03
## 5     Year.Built  6.049176e-03 3.788288e-04  15.968100 2.915437e-51
## 6 Year.Remod.Add  6.778162e-03 5.468247e-04  12.395495 6.534675e-33
## 7  Bedroom.AbvGr  8.686282e-02 1.077253e-02   8.063363 2.124195e-15

# Model Diagonostic Plots # 
ggplot(data = fit_final, aes(x = .fitted, y = .resid)) +
  geom_jitter(alpha = 0.6) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  labs(x = "Fitted values", y = "Residuals") + geom_smooth()

ggplot(data = fit_final, aes(.resid)) + geom_histogram(aes(binwidth = 10))

4.2 Summary - Multiple Regression

Next we run the multiple regression of LogPrice against all the predictors using the lm function.
The adjusted R squared is ~56%. Indicates that this is a weak model and there could be other variables not included in the model that could be of significance. As this question is limited to the selected variables, study of relevance of other variables is out of scope of this question.
Diagonostic plots of the final_fit indicate that the residuals are normally distributed (histogram of residuals) and the fitted vs residual plot shows that the distribution is random and tested for linearity.

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

# House with highest "squared" residual #

fit_final_aug <- augment(fit_final) 

fit_final_aug_sq <-  fit_final_aug %>%
  mutate(SqResid = .resid^2)

max_resid <- which.max(fit_final_aug_sq$SqResid)
max_resid

## [1] 428

ames_train[max_resid,]

## # A tibble: 1 x 81
##      PID  area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##    <int> <int> <int>       <int> <fct>            <int>    <int> <fct> 
## 1 9.02e8   832 12789          30 RM                  68     9656 Pave  
## # ... with 73 more variables: Alley <fct>, Lot.Shape <fct>,
## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>

5.1 Summary - House with the maximum squared residual

House on row number 428 has the maximum squared residual.

The overall condition and quality of the house is poor and the year built is 1923. In the previous week’s data analysis, there was a strong correlation between the overall quality and condition to the price of the house.

The variable Sale Condition for this house is mentioned as Abnormal and hence this house price could very well be an outlier.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

# Multiple Regression using Log(Lot.Area) instead of Lot.Area as the predictor #

ames_train_Q6 <- ames_train %>% 
  mutate(LogPrice = log(price)) %>%
  mutate(LogLot.Area = log(Lot.Area))

ames_train_Reg <-  ames_train_Q6 %>%
  dplyr::select(LogPrice, LogLot.Area,Land.Slope,Year.Built,Year.Remod.Add,Bedroom.AbvGr)

# Bayesian Model Averaging #

model_lm_bma <- bas.lm(LogPrice ~ ., data = na.omit(ames_train_Reg), prior = "BIC", modelprior = uniform(), method = "BAS")

model_lm_bma

## 
## Call:
## bas.lm(formula = LogPrice ~ ., data = na.omit(ames_train_Reg), 
##     prior = "BIC", modelprior = uniform(), method = "BAS")
## 
## 
##  Marginal Posterior Inclusion Probabilities: 
##      Intercept     LogLot.Area   Land.SlopeMod   Land.SlopeSev  
##        1.00000         1.00000         0.02319         0.02319  
##     Year.Built  Year.Remod.Add   Bedroom.AbvGr  
##        1.00000         1.00000         0.99999

summary(model_lm_bma)

##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000  1.000000e+00  1.000000e+00
## LogLot.Area       1.00000000     1.0000  1.000000e+00  1.000000e+00
## Land.SlopeMod     0.02318949     0.0000  1.000000e+00  0.000000e+00
## Land.SlopeSev     0.02318949     0.0000  1.000000e+00  0.000000e+00
## Year.Built        1.00000000     1.0000  1.000000e+00  1.000000e+00
## Year.Remod.Add    1.00000000     1.0000  1.000000e+00  1.000000e+00
## Bedroom.AbvGr     0.99998607     1.0000  1.000000e+00  0.000000e+00
## BF                        NA     1.0000  2.374021e-02  1.412793e-05
## PostProbs                 NA     0.9768  2.320000e-02  0.000000e+00
## R2                        NA     0.6031  6.056000e-01  5.913000e-01
## dim                       NA     5.0000  7.000000e+00  4.000000e+00
## logmarg                   NA -2142.5652 -2.146306e+03 -2.153733e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## LogLot.Area     1.000000e+00  1.000000e+00
## Land.SlopeMod   1.000000e+00  0.000000e+00
## Land.SlopeSev   1.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Year.Remod.Add  1.000000e+00  0.000000e+00
## Bedroom.AbvGr   0.000000e+00  1.000000e+00
## BF              1.340374e-07  2.253065e-33
## PostProbs       0.000000e+00  0.000000e+00
## R2              5.931000e-01  5.355000e-01
## dim             6.000000e+00  4.000000e+00
## logmarg        -2.158390e+03 -2.217738e+03

image(model_lm_bma, rotate = F)

coef_model_lm_bma <- coefficients(model_lm_bma)

plot(coef_model_lm_bma, subset = c(1:5), ask = FALSE)

confint(coef_model_lm_bma)

##                        2.5%        97.5%         beta
## Intercept      12.001940519 12.035311799 12.018470556
## LogLot.Area     0.214518274  0.280258943  0.247011466
## Land.SlopeMod   0.000000000  0.000000000  0.002668553
## Land.SlopeSev   0.000000000  0.000000000 -0.001519870
## Year.Built      0.005253984  0.006686369  0.005963898
## Year.Remod.Add  0.005760024  0.007825381  0.006764414
## Bedroom.AbvGr   0.036512582  0.078377518  0.057299200
## attr(,"Probability")
## [1] 0.95
## attr(,"class")
## [1] "confint.bas"

#par(mfrow = c(2,2))
plot(model_lm_bma)

#Multiple Regression #
fit_logLot.Area <-  lm(data = na.omit(ames_train_Reg), LogPrice ~ . - Land.Slope)
summary(fit_logLot.Area)

## 
## Call:
## lm(formula = LogPrice ~ . - Land.Slope, data = na.omit(ames_train_Reg))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.14609 -0.15825 -0.01477  0.15354  1.01578 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.557e+01  8.213e-01 -18.964  < 2e-16 ***
## LogLot.Area     2.471e-01  1.654e-02  14.935  < 2e-16 ***
## Year.Built      5.964e-03  3.604e-04  16.547  < 2e-16 ***
## Year.Remod.Add  6.765e-03  5.197e-04  13.017  < 2e-16 ***
## Bedroom.AbvGr   5.726e-02  1.054e-02   5.434 6.94e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2655 on 995 degrees of freedom
## Multiple R-squared:  0.6031, Adjusted R-squared:  0.6015 
## F-statistic: 377.9 on 4 and 995 DF,  p-value: < 2.2e-16

fit_logLot.Area_aug <- augment(fit_logLot.Area)

6.1 Summary - After transforming Lot.Area to log(Lot.Area)

The model rank image and the log posterior odds it is clear that Land.Slope is not a significant predictor. Hence, with the transformation of Lot.Area to its natural log results in further optimizing the significant predictors.

The adjusted R-squared with Lot.Area is ~55% The adjusted R-squared with log(Lot.Area) is ~60% without the Land.Slope variable. Hence it is parsimonious compared to the previous model and also has better predictability.

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

# Evaluate against log transform of Lot.Areavariable #

ggplot(data = ames_train, aes(Lot.Area)) + geom_histogram()

ggplot(data = ames_train, aes(log(Lot.Area))) + geom_histogram()

# Graphs of fitted values of models with and without log transformation of Lot.Area #

g1 <- ggplot(data = fit_final_aug, aes(x = LogPrice, y= .fitted)) + geom_point() +
  xlab("Model with Lot.Area") + theme_classic() + stat_smooth(method = "lm")

g2 <- ggplot(data = fit_logLot.Area_aug, aes(x =  LogPrice, y = .fitted)) + geom_point() + 
    xlab("Model with log of Lot.Area") + theme_classic() +  stat_smooth(method = "lm")

grid.arrange(g1,g2,ncol = 2)

7.1 Summary - Log Transform of Lot.Area vs No Transformation

Linear model formula for the full model with all predictors, Y ~ X. All code assumes that an intercept will be included in each model and that the X’s will be centered. The histogram of Lot.Area clearly indicates it is not normally distributed and is right skewed. The histogram of log transformation of Lot.Area clearly depicts that it is normally distrubuted. Hence it is better idea to log transform the Lot.Area

7.1.1 Comparison Graphs based on the two models

Model with no log transformation of Lot.Area = fit_final_aug
Model with log transformation of Lot.Area = fit_logLot.Area_aug

The log transformation should be applied on the Lot.Area as the residuals reduce and the condition of constant variance is satisfied. The adjusted R-squared is also increased to ~60% compared to ~56%.

###End of the Document