Introduction:

      This project report is to walk through all the steps that was used to build a linear model for prices of house sales in Washington DC, US using dataset dtrain and use this linear model to predict the prices of house of dataset dtest.

      In this report, I will introduce the steps which helped me derive my final linear model including general dataset description such as clarifications of some variables and preprocess dataset, model building, model selection, model checking and transformation.

1. Dataset Description

1.1 Clarifications on variables in the dataset(dtrain)

      The following table is a snapshot of how the dataset looks like.

# Load data
load("final.Rdata")
head(dtrain)
##   bathrm hf_bathrm          heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 1      2         1    Forced Air  Y     6     2 1870    2012 1980       2
## 2      3         1 Hot Water Rad  Y    12     7 1883      NA 1983       3
## 3      2         0 Hot Water Rad  Y    11     4 1920      NA 1960       2
## 4      3         0    Forced Air  Y     6     3 1910    1990 1960       2
## 5      1         1 Hot Water Rad  N     4     2 1920      NA 1964       2
## 6      3         1    Forced Air  Y     7     4 2014      NA 2015      NA
##              saledate   price  gba         style         grade      extwall
## 1 2013-02-20 00:00:00  874000 1443       2 Story Above Average Common Brick
## 2 2018-01-22 00:00:00 2246100 4725 2.5 Story Fin      Superior        Stone
## 3 2000-08-02 00:00:00  131600 2066       2 Story  Good Quality Common Brick
## 4 2006-07-06 00:00:00  671500 2258       2 Story  Good Quality       Stucco
## 5 2016-11-16 00:00:00  846300 1344       2 Story       Average  Wood Siding
## 6 2014-09-11 00:00:00  766900 2816       2 Story  Good Quality Common Brick
##   kitchens fireplaces landarea
## 1        1          2     1600
## 2        1          5     5127
## 3        1          1     1161
## 4        1          0     5532
## 5        1          1     1840
## 6        1          1     5565

      Most of the variables in dtrain can be self-explained by their name, however, it is necessary to explain some variables which is hard to tell the meaning by looking at their name.

      * ayb: the earliest time the main portion of the building was built
      * yr_rmdl: year structure was remodelled
      * eyb: the year an improvement was built more recent than actual year built
      * gba: gross building area in square feet

      After having a general understanding of all the variables in dtrain, we can start to preprocess the datas (ie. missing value and any possible transformation)

1.2 Preprocess data

1.2.1 Missing value

# Missing Value
dtrain$yr_rmdl[is.na(dtrain$yr_rmdl)] <- round(dtrain$ayb[is.na(dtrain$yr_rmdl)] +
                                                 (mean(dtrain$yr_rmdl, na.rm = TRUE)
                                                  - mean(dtrain$ayb)))

dtrain$stories[is.na(dtrain$stories)] <- median(dtrain$stories, na.rm =TRUE)
# Correlation between yr_rmdl and ayb
cor(dtrain$yr_rmdl, dtrain$ayb)
## [1] 0.8186744

      There are two variables in total that have missing value in the dataset, hence, we need to find a valid way to handle it.
      * yr_rmdl: The most common way to handle a missing value is to replace it with mean, median or mode. In this case, I replaced the missing data (NA) in yr_rmdl with the sum of the mean of difference between yr_rmdl and ayb and ayb and round yr_rmdl to the closest integer. From the output, we can see that yr_rmdl and ayb has a strong positive correlation, therefore, by simply replacing the missing value by its median could not be a good approach. Hence, I decided to use a linear model which contains ayb to predict yr_rmdl (ie. the sum of ayb and the mean difference between ayb and yr_rmdl). According to the definition of ayb, we know that ayb is always less than or equal to yr_rmdl, thus addition will work for replacing missing data in yr_rmdl.
      * stories: The way to handle missing data in stories is to replace them with median of known stories. Because of the definition of stories, replacing the missing data with the median of known stories can be a better approach instead of using the mean.

1.2.2 Any possible transformation

# Transformation of Categorical Variable and Saledate
dtrain$heat <- factor(dtrain$heat)
dtrain$ac <- factor(dtrain$ac)
saleMonth <- substr(dtrain$saledate,1,4)
dtrain$saledate <- as.integer(saleMonth)
dtrain$style <- factor(dtrain$style)
dtrain$grade <- factor(dtrain$grade)
dtrain$extwall <- factor(dtrain$extwall)

      * price: I made change to the response variate price to price to lambda (calculated in the transformation step)
      * saledate: Extract the year from original saledate using substring and convert string type to numeric type.
      * other variates: Convert heat, ac, style, grade and extwall to categorical variables.

2. Model Building

      A few potential linear models will be built in the following few subsections by different methods and will be compared and chose a best linear model in Section 3.

2.1 Use Correlation test

# Use Correlation test to add any numerical predictors which has a linear relationship with price into the model
cor(dtrain[c("price", "bathrm", "hf_bathrm", "rooms", "bedrm", "ayb", "yr_rmdl", "eyb", "stories", "saledate", "gba", "kitchens", "fireplaces", "landarea")])
##                  price     bathrm  hf_bathrm      rooms     bedrm         ayb
## price       1.00000000 0.53201366 0.18517412 0.36836444 0.4669599 -0.06812407
## bathrm      0.53201366 1.00000000 0.04891745 0.45569726 0.6177090  0.17297846
## hf_bathrm   0.18517412 0.04891745 1.00000000 0.19268416 0.1511272  0.15848815
## rooms       0.36836444 0.45569726 0.19268416 1.00000000 0.6446478  0.09301388
## bedrm       0.46695990 0.61770896 0.15112721 0.64464781 1.0000000  0.10240592
## ayb        -0.06812407 0.17297846 0.15848815 0.09301388 0.1024059  1.00000000
## yr_rmdl     0.10899565 0.34957633 0.18748963 0.17009515 0.2355928  0.81867441
## eyb         0.23270733 0.42186646 0.21740675 0.25863139 0.3209646  0.79086871
## stories     0.24321614 0.17053073 0.24296814 0.27536611 0.2357411 -0.03424906
## saledate    0.67242944 0.32569405 0.04805270 0.10022692 0.2353410  0.02852529
## gba         0.49482898 0.51322358 0.30150128 0.57172393 0.5824250  0.11379789
## kitchens    0.14529565 0.11993694 0.05392731 0.10913602 0.1628645 -0.06555062
## fireplaces  0.20414081 0.05531792 0.13781265 0.09370245 0.1026266 -0.02815414
## landarea    0.14668340 0.11824493 0.07949904 0.22155801 0.1924927 -0.05923628
##                yr_rmdl         eyb     stories    saledate        gba
## price       0.10899565  0.23270733  0.24321614  0.67242944 0.49482898
## bathrm      0.34957633  0.42186646  0.17053073  0.32569405 0.51322358
## hf_bathrm   0.18748963  0.21740675  0.24296814  0.04805270 0.30150128
## rooms       0.17009515  0.25863139  0.27536611  0.10022692 0.57172393
## bedrm       0.23559275  0.32096463  0.23574114  0.23534096 0.58242499
## ayb         0.81867441  0.79086871 -0.03424906  0.02852529 0.11379789
## yr_rmdl     1.00000000  0.82589709  0.03895837  0.18490531 0.19546114
## eyb         0.82589709  1.00000000  0.11026960  0.21595230 0.32910171
## stories     0.03895837  0.11026960  1.00000000  0.07669758 0.42486375
## saledate    0.18490531  0.21595230  0.07669758  1.00000000 0.10589945
## gba         0.19546114  0.32910171  0.42486375  0.10589945 1.00000000
## kitchens   -0.01164842  0.02404333  0.03831707  0.07221087 0.08492081
## fireplaces -0.11563713 -0.09022458  0.10627605 -0.06308372 0.20893033
## landarea   -0.09103139 -0.02225866 -0.01487142 -0.04707365 0.33626049
##               kitchens  fireplaces    landarea
## price       0.14529565  0.20414081  0.14668340
## bathrm      0.11993694  0.05531792  0.11824493
## hf_bathrm   0.05392731  0.13781265  0.07949904
## rooms       0.10913602  0.09370245  0.22155801
## bedrm       0.16286447  0.10262664  0.19249274
## ayb        -0.06555062 -0.02815414 -0.05923628
## yr_rmdl    -0.01164842 -0.11563713 -0.09103139
## eyb         0.02404333 -0.09022458 -0.02225866
## stories     0.03831707  0.10627605 -0.01487142
## saledate    0.07221087 -0.06308372 -0.04707365
## gba         0.08492081  0.20893033  0.33626049
## kitchens    1.00000000  0.03355654  0.02134784
## fireplaces  0.03355654  1.00000000  0.10849148
## landarea    0.02134784  0.10849148  1.00000000


    From the above correlation test table, it is clearly that saledate has the highest correlation with price which means among all the variables, saledate and price has the strongest linear relationship. Since for most of the variables, they do not have a relatively high correlation with price, so we can set \(|correlation| \geq 3.5\) as a cut-pff point.
    Variables that satisfy this conditions are: bathrm, rooms, bedrm, saledate, gba
      After we decided numberical predictors that can be put into the linear model, then we need to consider categorical variables: heat, ac, style, grade, extwall and we will use F-test to analyze these variables.
    Before conduct a F-test to test if each categorical variables will have a significantly different on the current linear model, we should analyze these five categorical variables separately. For heat and style, I do not believe that various heating type and style will affect this linear model and air conditioning, grade and extwall may have some effect on current linear model since air conditioning, grade and extwall will directly affect the cost when building the house and hence it will affect the house prices.
    After we did some predictions about five categorcial variables affecting current linear model, a F-test can be conducted to analyze if a categorical variable can be added into linear model.

Conducting F-Test

    A F-test is conducted to analyze if a new linear model with three additional categorical variables is significantly better than the current model.

    We have the null hypothesis \(H_0\): Two models do not significantly differ and alternative hypothesis \(H_a\): Model with three additional categorical variables will be significantly better.

# Add three categorical variables that we believe will affect current model
fullModel <- lm(price~bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall, data = dtrain)
# Current linear model that we derived from above
restrictedModel <- lm(price~bathrm + rooms + bedrm + saledate + gba, data = dtrain)
# Null Hypothesis will be H0: Two models do not significantly differ
# Alternative Hypothesis will be Ha: Model with three additional categorical variables will be significantly better
anova(restrictedModel, fullModel)
## Analysis of Variance Table
## 
## Model 1: price ~ bathrm + rooms + bedrm + saledate + gba
## Model 2: price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + 
##     extwall
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   1297 1.9864e+13                                   
## 2   1270 1.5204e+13 27 4.6598e+12 14.416 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    From the output, we have that p-value is < 2.2e-16 which is less than significance level (0.05), therefore, we can conclude that we reject \(H_0\). Hence, the new linear model with three additional categorical variables will be significantly better.

    We have tested three categorical variables and we can now test a new linear model with heat.
    We have the null hypothesis \(H_0\): Two models do not significantly differ and alternative hypothesis \(H_a\): Model with heat will be significantly better.

# Add three categorical variables that we believe will affect current model
fullModel <- lm(price~bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall + heat, data = dtrain)
# Current linear model that we derived from above
restrictedModel <- lm(price~bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall, data = dtrain)
# Null Hypothesis will be H0: Two models do not significantly differ
# Alternative Hypothesis will be Ha: Model with heat variables will be significantly better
anova(restrictedModel, fullModel)
## Analysis of Variance Table
## 
## Model 1: price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + 
##     extwall
## Model 2: price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + 
##     extwall + heat
##   Res.Df        RSS Df Sum of Sq      F Pr(>F)
## 1   1270 1.5204e+13                           
## 2   1262 1.5071e+13  8 1.334e+11 1.3964 0.1934

    From the output, we have that p-value is 0.1934 which is greater than significance level (0.05), therefore, we can conclude that we do not reject \(H_0\). Hence, the two models do not have significantly different.

    Lastly, we can test the model with style. We have the null hypothesis \(H_0\): Two models do not significantly differ and alternative hypothesis \(H_a\): Model with style will be significantly better.

# Add three categorical variables that we believe will affect current model
fullModel <- lm(price~bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall + style, data = dtrain)
# Current linear model that we derived from above
restrictedModel <- lm(price~bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall, data = dtrain)
# Null Hypothesis will be H0: Two models do not significantly differ
# Alternative Hypothesis will be Ha: Model with style variables will be significantly better
anova(restrictedModel, fullModel)
## Analysis of Variance Table
## 
## Model 1: price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + 
##     extwall
## Model 2: price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + 
##     extwall + style
##   Res.Df        RSS Df Sum of Sq      F    Pr(>F)    
## 1   1270 1.5204e+13                                  
## 2   1259 1.4765e+13 11 4.388e+11 3.4014 0.0001152 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

    From the output, we have that p-value is 0.0001152 which is less than significance level (0.05), therefore, we can conclude that we reject \(H_0\). Hence, model with style will be significantly better.

    By conducting F-test, we conclude that one of our potential model that can be moved on to do model selection is \(price \sim bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall + style\).

2.2 Automated Stepwise Regression

    Stepwise Regression is a combination of backward and forward method with p-value.

fit$call
## lm(formula = price ~ saledate + gba + grade + ayb + bathrm + 
##     fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + 
##     kitchens, data = dtrain)

    One of our potential model \(price \sim saledate + gba + grade + ayb + bathrm + fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + kitchens\).

2.3 Automated Forward Selection

    Forward Selection is a method of building up by adding one variable at time with p-value.

fit$call
## lm(formula = price ~ saledate + gba + grade + ayb + bathrm + 
##     fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + 
##     kitchens, data = dtrain)

    One of our potential model \(price \sim saledate + gba + grade + ayb + bathrm + fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + kitchens\) which we found that it is the same linear model that derived from stepwise regression.

Automated Backward

    Backward elimination is a method of building down by subtracting one variable at time with p-value.

fit$call
## lm(formula = price ~ bathrm + hf_bathrm + rooms + ayb + eyb + 
##     stories + saledate + gba + style + grade + extwall + kitchens + 
##     fireplaces + landarea, data = dtrain)

    One of our potential model \(price \sim bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea\).



    After we have derived multiple potential linear model from various method, we can now select the best model from them.

3. Model Selection

    Recall that we have derived three potential models from previous section:
      * \(price \sim bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall + style\)
      * \(price \sim saledate + gba + grade + ayb + bathrm + fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + kitchens\)
      * \(price \sim bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea\)
We will do model selection on three linear models by comparing adjusted R square, AIC, BIC and R square if needed.

# Assign to each model
m1 <- lm(price ~ bathrm + rooms + bedrm + saledate + gba + ac + grade + extwall + style, data=dtrain)
m2 <- lm(price ~ saledate + gba + grade + ayb + bathrm + fireplaces + extwall + eyb + hf_bathrm + rooms + landarea + kitchens, data = dtrain)
m3 <- lm(price ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea, data = dtrain)

    After we assign each linear model, we then can compare adjusted R square, AIC, BIC and R square and our measure is higher R square and adjusted R square and smaller AIC and BIC. If a potential linear model has a higher R square but a lower adjusted R square, we will not consider it as the best model (since the more complex the model is, the larger the values of \(R^{2}\) is).

# R square
summary(m1)$r.squared
## [1] 0.7425109
summary(m2)$r.squared
## [1] 0.7677312
summary(m3)$r.squared
## [1] 0.7721672
# Adjusted R square
summary(m1)$adj.r.squared
## [1] 0.7337166
summary(m2)$adj.r.squared
## [1] 0.7611264
summary(m3)$adj.r.squared
## [1] 0.7634463
# AIC
AIC(m1)
## [1] 33953.34
AIC(m2)
## [1] 33805.02
AIC(m3)
## [1] 33803.89
# BIC
BIC(m1)
## [1] 34186.1
BIC(m2)
## [1] 34001.57
BIC(m3)
## [1] 34062.52

    From the output, we can conclude that model 3 which is the model that derived from backward elimination is the best since it has the largest R square and adjusted R square and smallest AIC and second smallest BIC.
    After we have found the final model that we could perform model checking on, we need to consider possible interaction that we could add to this model to make it better.
    There are two ways to test if the interaction is better to be added to the model.
    1. Perform F-test like we did in Section 2
    2. Compare R square, adjusted R square, AIC and BIC like we did in above
    For here, I would use F-test and I will show an example about how to use F-test to test if the interaction is good to be added and then repeat the steps until you have tried every interaction.     We have the null hypothesis \(H_0\): Two models do not significantly differ and alternative hypothesis \(H_a\): Model with this interaction will be significantly better.

# Add one interaction to the current model
fullModel <- lm(price ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea + bathrm:hf_bathrm, data = dtrain)
# Current linear model that we derived from above
restrictedModel <- lm(price~bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea, data = dtrain)
# Null Hypothesis will be H0: Two models do not significantly differ
# Alternative Hypothesis will be Ha: Model with this interaction variables will be significantly better
anova(restrictedModel, fullModel)
## Analysis of Variance Table
## 
## Model 1: price ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + 
##     gba + style + grade + extwall + kitchens + fireplaces + landarea
## Model 2: price ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + 
##     gba + style + grade + extwall + kitchens + fireplaces + landarea + 
##     bathrm:hf_bathrm
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1254 1.3065e+13                            
## 2   1253 1.3039e+13  1 2.5203e+10 2.4219 0.1199

    From the output, we have that p-value is 0.1199 which is larger than significance level (0.05), therefore, we can conclude that we do not reject \(H_0\). Hence, we will keep the current model and keep testing interaction term on it.
    Our final model with interaction terms would be \(price\sim bathrm + hfbathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea + bathrm:saledate + bathrm:fireplaces + hfbathrm:saledate + rooms:ayb + rooms:eyb + ayb:eyb + eyb:gba + stories:saledate + saledate:gba + saledate:grade\). We now have the last step to check if this model is a valid model.

4. Model Checking

4.1 Check model Assumption

fit <- lm(formula = price ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea + bathrm:saledate + bathrm:fireplaces + hf_bathrm:saledate + rooms:ayb + rooms:eyb + ayb:eyb + eyb:gba + stories:saledate + saledate:gba + saledate:grade, data = dtrain)

# Residual vs fitted
plot(fitted(fit), rstudent(fit))
abline(h = c(-2, 0, 2), lty = 2)

# QQ Plot
qqnorm(rstudent(fit))
qqline(rstudent(fit))

# Cook's Distance
plot(fit, which=5)

    From Residual vs fitted plot, we can see that constant variance assumption is violated, thus we need to perform a transformation on the current model.

4.2 Transformation

library(MASS)
Box <- boxcox(fit, lambda = seq(-1, 1, 1/20))

# Get accurate value for lambda (The following code is from Rcompanion.org)
Cox = data.frame(Box$x, Box$y)     
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] 

Cox2[1,]
##        Box.x     Box.y
## 74 0.4747475 -2899.731
lambda = Cox2[1, "Box.x"] 
lambda
## [1] 0.4747475
# Transform model
fit2 <- lm(formula = price ^ 0.4747475 ~ bathrm + hf_bathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea + bathrm:saledate + bathrm:fireplaces + hf_bathrm:saledate + rooms:ayb + rooms:eyb + ayb:eyb + eyb:gba + stories:saledate + saledate:gba + saledate:grade, data = dtrain)

    After we get the model from transformation, we need to check the model assumption again to ensure that it is a valid linear regression model. If this model still does not satisfy the assumption ofr linear regression model, we have to repeat the same step as shown in transformation until we get a linear model that satisfy all the assumptions (constant variance, mean of zero, independence and normality, we don’t need to check for independence in this case).

# Residual vs fitted
plot(fitted(fit2), rstudent(fit2), xlab = "fittedValue", ylab = "Residuals")
abline(h = c(-3, -2, 0, 2, 3), lty = 2)

# QQ Plot
qqnorm(rstudent(fit2))
qqline(rstudent(fit2))

# Cook's Distance
plot(fit2, which=5)

    From Residual vs fitted plot, we can see that assumption of mean equals to zero is satisified since the residuals lie within a horizontal band of zero and there is no specific pattern of those data on the plot. In addition, most of the data falls into [-2, 2] which is approximately 95% confidence interval. From the plot, we can see that data on the plot has an approximate constant variance which satisified the constant variance assumtpion.

    From QQ plot, we can see that data on the plot follows a straight line and it satisfied the assumption of normality.

    Hence, we can conclude that this is a valid linear regression model and \(price^{0.4747475} \sim bathrm + hfbathrm + rooms + ayb + eyb + stories + saledate + gba + style + grade + extwall + kitchens + fireplaces + landarea + bathrm:saledate + bathrm:fireplaces + hfbathrm:saledate + rooms:ayb + rooms:eyb + ayb:eyb + eyb:gba + stories:saledate + saledate:gba + saledate:grade\) can be used as a final model.

    After we perform the model checking for all the assumptions, we could aslo check for data to identify any outliers and influential points and can be further discussed if these points need to be deleted.

4.3 Data Checking

4.3.1 Check for Outliers

# Cook's Distance
plot(fit2, which=5)
abline(h = c(-2.5, 2.5), lty = 2)

# Residual vs fitted
plot(fitted(fit2), rstudent(fit2))
abline(h = c(-3, 3), lty = 2)
text(fitted(fit2), rstudent(fit2))

We can visually identify the ouliers. From Residual vs fitted plot, observation 127, 429 and 1274 are outliers. We deeply digged into these observations to see if there’s any error.

# Outliers found in residuals vs fitted plot
dtrain[127,]
##     bathrm hf_bathrm       heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 127      3         1 Forced Air  Y     8     4 2007    2076 2010       2
##     saledate  price  gba   style         grade     extwall kitchens fireplaces
## 127     2007 619600 2016 2 Story Above Average Wood Siding        1          0
##     landarea
## 127     3450
dtrain[429,]
##     bathrm hf_bathrm      heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 429      1         1 Warm Cool  Y     6     3 1923    1992 1957       2
##     saledate price  gba   style         grade      extwall kitchens fireplaces
## 429     2000 32100 1266 2 Story Above Average Vinyl Siding        1          1
##     landarea
## 429     3450
dtrain[1274,]
##      bathrm hf_bathrm       heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 1274      3         0 Forced Air  Y     6     3 2009    2078 2012       2
##      saledate price  gba   style         grade      extwall kitchens fireplaces
## 1274     2007 31300 1749 2 Story Above Average Brick/Siding        1          0
##      landarea
## 1274     3500
# Outliers found in Cook's Distance plot
dtrain[151,]
##     bathrm hf_bathrm       heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 151      1         0 Forced Air  Y     5     2 1925    1994 1943     1.5
##     saledate  price gba           style   grade      extwall kitchens
## 151     2004 242800 750 1.5 Story Unfin Average Metal Siding        1
##     fireplaces landarea
## 151          0     5164
dtrain[820,]
##     bathrm hf_bathrm       heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 820      4         1 Forced Air  Y     8     4 2007    2076 2011       3
##     saledate  price  gba   style        grade extwall kitchens fireplaces
## 820     2018 736900 2460 3 Story Good Quality   Stone        1          1
##     landarea
## 820     2611
dtrain[565,]
##     bathrm hf_bathrm      heat ac rooms bedrm  ayb yr_rmdl  eyb stories
## 565      4         1 Warm Cool  Y     8     4 1936    2006 1967       2
##     saledate  price  gba   style         grade     extwall kitchens fireplaces
## 565     1998 194400 2635 2 Story Above Average Brick/Stone        1          2
##     landarea
## 565     7338

From observation 127, we can see that ayb is larger than 1986 which violates the definition of ayb and the price is relatively high compred to other prices. For observation 429, the price seems relatively low compared to other prices. Similar to observation 1274, the price seems relatively low compared to other prices.

From Cook’s Distance Plot, we can find some outliers in x direction (ie. 820, 151, 565). For observation 151, it seems to be that yr_rmdl is relatively higher than other observations but it is reasonable. For observation 820, it seems to be that eyb is relatively higher than other observations but it is reasonable for this case. Similiar to observation 151, observation 565 has a relatively high yr_rmdl which means this house is remodelled in recent years and it is also reasonable.

In this model, it is not necessary to remove any of the outliers because as explained above, all the potential outliers seems to be reasonable and can be explained in this case.

4.3.2 Influential Points

plot(fit2, which=4)
abline(h = 1, lty = 2)

From Cook’s distance plot, we know that obervation 151 is an influential point. As we explained above, observation 151 is a reasonable data which do not need to be deleted. Therefore, similiar to outliers, no influential points need to be deleted.

Conclusion

This model is not the same model that I derived from the prediction part because of the selection between potential linear models that derived from model building process. In this report, I chose the model derived from backward elimination (since this model has better measure than the stepwise regression in this report) and in the prediction part, I chose the model from stepwise regression instead. It is possible that I made a mistake when comparing adjusted R square, AIC and BIC when I am doing the prediction. However, since all the measures is very closed between the two models, it can be concluded that there will be some possible minor difference and inaccuracy for the stepwise regression model.

The final model that we derived from all the steps that have shown above is
\(y = (\beta_0 + \beta_1*bathrm + \beta_2*hfbathrm + \beta_3*rooms + \beta_4*ayb + \beta_5*eyb + \beta_6*stories + \beta_7*saledate + \beta_8*gba + \beta_9*style + \beta_{10}*grade + \beta_{11}*extwall + \beta_{12}*kitchens +\beta_{13}*fireplaces + \beta_{14}*landarea + \beta_{15}*bathrm:saledate + \beta_{16}*bathrm:fireplaces + \beta_{17}*hfbathrm:saledate + \beta_{18}*rooms:ayb + \beta_{19}*rooms:eyb + \beta_{20}*ayb:eyb + \beta_{21}*eyb:gba + \beta_{22}*stories:saledate + \beta_{23}*saledate:gba + \beta_{24}*saledate:grade + \epsilon_i) ^{1/0.4747475}\). Despite all the predictors in the models, there will be parameters \(\beta_i\) invloved in the model as well. Each parameter will have different interpretations.  * \(\beta_0\) represents the average house price in DC without any affect of the predictors in the model, but this average house price can be affected by some error terms which do not defined in this model such as the location of the house. * \(\beta_i\) where \(1\leq i \leq14\) represents the average amount of increase (or decrease) when the \(i^{th}\) predictor increases (or decreases) by 1 unit when holding other predictors constant. * \(\beta_i\) where \(15\leq i \leq24\) represents the average amount of increase (or decrease) when the \(i^{th}\) predictor increases (or decreases) by 1 and it depends on the predictor which has interaction with unit when holding other predictors constant. For example, take bathrm:saledate as an example, the amount of house price that bathrm will affect depends on the saledate. Recall that we find all the correlation between numerical predictors and the reponse variable (price) and among all the numerical predictors, saledate has the strongest linear relationship with price which means that the saledate will affect significantlt to the price of house sale, hence people who want to sale a good price on their house could analyze the saledate such as making a histogram of this predictor and to see which year has the highest house price. In this case, we are investigating the year of saledate, however, if the seller wants to do further investigation, they could extract the month if possible and analyze the data on the month. In addition, bathrm has the second largest correlation with price, and hence, the seller would know that the more bathroom that a house has, the higher price tend to be. By looking at the correlation test table, both the seller and the buyer would have a sense about what type of house will have a higher price.