How does living measure predictors affect the median housing price in Boston?

Section 0 - Synopsis

I use Kaggle sourced data to select and understand the significant factors from 14 original predictors which associated with median house prices in Boston.

The results show that:

The average number of rooms per dwelling, accessibility to Radial highways, proportion of blacks by town, and whether the house bounds the river positively predict the log-value of median value of owner-occupied homes in Boston.
The Per capita crime rate by town, nitric oxides concentration, full-value property-tax rate per $10,000, the log-value of the weighted distances to five Boston employment centres, pupil-teacher ratio by town, the square root value of percentage of lower status of the population, correlation between crim rate and the proportion of blacks negatively predict the log-value of median value of owner-occupied homes in Boston.
There is no relation throughout proportion of residential land zoned for lots over 25,000 sq.ft., proportion of non-retail business acres per town, proportion of owner-occupied units built prior to 1940, and the median value of owner-occupied homes in Boston.

Section 1 - Introduction and Summary of Results

Background

The housing prices in Boston have maintained a high level, with a median listing price of $2M in downtown Boston. The city has been known as the home of many prestigious universities and attracted many people to locate there.

This study investigates the median value of housing price in Boston, which can help potential customers or real estate agents to make investments.

Objective and Data Description

The aim of the study is to understand the factors associated with house prices in Boston. Cross-sectional data were sourced from Kaggle, a public data platform. I examined 506 observations to build and analyse several statistical models using the Multiple Linear Regression Technique achieve the objective of research.

The Kaggle dataset provided information on thirteen predictors, such as ZN, INDUS, CHAS etc., and one response variable MEDV, the detailed explanation are shown below. Note that CHAS and RAD are two categorical variables, while other variables are continuous. The Charles River has historically suffered from earlier industrial contamination, and excess nutrients caused algal blooms that can be toxic to animals and people during summer months. Therefore,the NOX variable was included to measure pollution.

The Cross-sectional data collect predictors including

CRIM:Per capita crime rate by town(Continuous)
ZN:Proportion of residential land zoned for lots over 25,000 sq.ft.(Continuous)
INDUS:Proportion of non-retail business acres per town(Continuous)
CHAS:Charles River dummy variable - whether the house bounds the river(Categorical, 2 levels)

Already coded as dummies in the dataset. 1 if tract bounds river; 0 otherwise.
NOX:Nitric oxides concentration (parts per 10 million)(Continuous)
RM:Average number of rooms per dwelling(Continuous)
AGE:Proportion of owner-occupied units built prior to 1940(Continuous)
DIS:Weighted distances to five Boston employment centres(Continuous)
RAD:Index of accessibility to Radial highways(Categorical, 24 levels)
TAX:Full-value property-tax rate per $10,000(Continuous)
PTRATIO:Pupil-teacher ratio by town(Continuous)
B:The result of the equation B=1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town(Continuous)
LSTAT:Percentage of lower status of the population(Continuous)
MEDV:Median value of owner-occupied homes in $1000’s(Continuous)

Load packages

suppressMessages(library('ggplot2'))
suppressMessages(library('car'))
suppressMessages(library("visdat"))
suppressMessages(library('leaps'))
suppressMessages(library('ISLR'))
suppressMessages(library('car'))
suppressMessages(library("lmtest"))
suppressMessages(library("sandwich"))
suppressMessages(library("MASS"))
suppressMessages(library("visdat"))
suppressMessages(library('glmnet'))
suppressMessages(library('magrittr'))
suppressMessages(library('dplyr'))
suppressMessages(library('rpart'))
suppressMessages(library('huxtable'))
suppressMessages(library('lme4'))
suppressMessages(library('patchwork'))

Proposed model

The final model that I proposed for explaining the relationship between MEDV and other predictors is: \[ log(MEDV) = 4.597-5.297*10^{-3}CRIM+0.0704CHAS-0.7876NOX+0.0704RM+0.02597new\,RAD\ moderate+0.09352new\,RAD\ remote+0.03468new\,RAD\ very\ remote-0.2287log(DIS)-5.79*10^{-4}TAX-0.03956PTRATIO+8.1*10^{-4}B-0.2123 sqrt(LSTAT)-4.083*10^{-5}CRIM*B. \] This model has an adjusted R-squared value of 0.8224, which is relatively high. This means that the proposed model can explain 82.24% of the variation in log(MEDV).

Analysis & Discussion

To complete statistical analysis and reach the final model, which helps to answer the study objective, I undertook the following steps:

1.Exploratory Data Analysis (EDA): EDA was carried out to investigate the dataset, scatterplots of different variables as well as some descriptive statistics to guide the subsequent analysis.
2.Variable Selection: Several variable selection methods were employed to determine the important predictor variables, including forward and backward selection, stepwise selection, best subsets and lasso.
3.Analysis of Five Candidate Models: Having independently created and assessed five candidate models, the final model that would best satisfy the objective of this study was agreed upon based on our judgement.
4.Final Model Diagnostics & Derivation: Residual plot and Normal QQ plot were used to check whether the underlying regression assumptions were met. VIF was used to investigate the problem of multicollinearity. Additionally, added variable plot and Cook’s distance plot were used to identify the influential points and determine if removing any point would improve estimation accuracy. Further analysis using transformations and the introduction of interactions were performed to finalise the final model.
5.Final Model Interpretation: The final model’s parameters were interpreted to verify that they make sense practically and can be used to draw conclusions from the model.
6.Limitations & Recommendations: Limitations of the final model were assessed with further improvement areas highlighted.

Section 2 - EXPLORATORY DATA ANALYSIS

Summary

After using the vis_miss function to ensure there is no missing data, exploratory data analysis was conducted to find the possible problem in the initial dataset.

The first figure below shows that the target of study is heavily positively skewed. Thus, in subsequent study I would use the log transformation of MEDV instead of the inital data (see second figure for comparison).After this transformation, all data is contained in Q3+1.5*IQR (IQR). So I decide to keep all values in the preliminary model, in section 4.4 further investigation would be built to see whether outliers influence the regression coefficients and predictive accuracy.

Importing raw data

boston_data <- read.csv("boston.xls", header = T)

head(boston_data)

CRIM	ZN	INDUS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT	MEDV
0.00632	18	2.31	0.538	6.58	65.2	4.09	1	296	15.3	397	4.98	24
0.0273	0	7.07	0.469	6.42	78.9	4.97	2	242	17.8	397	9.14	21.6
0.0273	0	7.07	0.469	7.18	61.1	4.97	2	242	17.8	393	4.03	34.7
0.0324	0	2.18	0.458	7	45.8	6.06	3	222	18.7	395	2.94	33.4
0.069	0	2.18	0.458	7.15	54.2	6.06	3	222	18.7	397	5.33	36.2
0.0299	0	2.18	0.458	6.43	58.7	6.06	3	222	18.7	394	5.21	28.7

View(boston_data) #getting an overview of the dataset
names(boston_data) #checking the variable names

##  [1] "CRIM"    "ZN"      "INDUS"   "CHAS"    "NOX"     "RM"      "AGE"    
##  [8] "DIS"     "RAD"     "TAX"     "PTRATIO" "B"       "LSTAT"   "MEDV"

str(boston_data)

## 'data.frame':    506 obs. of  14 variables:
##  $ CRIM   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ ZN     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ INDUS  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ CHAS   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ NOX    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ RM     : num  6.58 6.42 7.18 7 7.15 ...
##  $ AGE    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ DIS    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ RAD    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ TAX    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ PTRATIO: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ B      : num  397 397 393 395 397 ...
##  $ LSTAT  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ MEDV   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

Checking for missing values

any(is.na(boston_data))

## [1] FALSE

#no missing values

Study of Median housing price’s skewness

summary(boston_data$MEDV)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    5.00   17.02   21.20   22.53   25.00   50.00

#median < mean, which indicates extreme positive values in MEDV
#The data is mainly rightward skewed, but have some extremely values around 50.
#Thus I will consider logging the response variable.
par(mfrow=c(1,2))

quantile(boston_data$MEDV, probs=c(0.75))+IQR(boston_data$MEDV)*1.5

##     75% 
## 36.9625

quantile(boston_data$MEDV, probs=c(0.25))-IQR(boston_data$MEDV)*1.5

##    25% 
## 5.0625

hist(boston_data$MEDV,breaks=100,col='blue',xlab='MEDV',main='MEDV histogram') 
abline(v = 36.9625, col="red", lwd=3, lty=2)
abline(v = 5.0625, col="green", lwd=3, lty=2)

#log-transformation of MEDV
hist(log(boston_data$MEDV),breaks=100,col='blue',xlab='MEDV',main='log-MEDV histogram') 
quantile(log(boston_data$MEDV),probs=c(0.75))+IQR(boston_data$MEDV)*1.5

##      75% 
## 15.18138

abline(v = 15.18138, col="red", lwd=3, lty=2)
quantile(log(boston_data$MEDV),probs=c(0.25))-IQR(boston_data$MEDV)*1.5

##      25% 
## -9.12782

abline(v = -9.12782, col="green", lwd=3, lty=2)

Study of continuous predictors’ distribution

By individually viewing the scatter plots for MEDV against the different continuous variables, I roughly identified the patterns of regression. For example, the plot of MEDV against RM shows a positive linear relationship but has some high leverage points in the data (see Section 5 Outliers and Leverage). Moreover, the plot of the MEDV against LSTAT inspires us to think of a sqrt transformation (see Transformations in Section 4).

#MEDV against CRIM
p1 = ggplot(boston_data,aes(x = CRIM, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Crim rate")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against ZN
p2 = ggplot(boston_data,aes(x = ZN, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Proportion of residential land")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against INDUS
p3 = ggplot(boston_data,aes(x = INDUS, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Proportion of non-retail business")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against NOX
p4 = ggplot(boston_data,aes(x = NOX, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Nitric oxides concentration")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against RM
p5 = ggplot(boston_data,aes(x = RM, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Average number of rooms")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against AGE
p6 = ggplot(boston_data,aes(x = AGE, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Proportion of rooms built prior to 1940")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against DIS
p7 = ggplot(boston_data,aes(x = DIS, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Distances to employment centres")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against TAX
p8 = ggplot(boston_data,aes(x = TAX, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Proportional tax rate")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against PTRATIO
p9 = ggplot(boston_data,aes(x = PTRATIO, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Pupil-teacher ratio")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against B
p10 = ggplot(boston_data,aes(x = B, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Proportion of blacks")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))
#MEDV against LSTAT
p11 = ggplot(boston_data,aes(x = LSTAT, y = MEDV)) +geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Percentage of lower status")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))

#example of interaction of CRIM and B
p12 = ggplot(boston_data, aes(x = CRIM, y = MEDV, color = B)) + geom_point(size=1)+geom_smooth()+labs(y = "Median housing price in Boston", x = "Correlation between crim rate and blacks")+theme(axis.title=element_text(size=5), axis.text=element_text(size=5))

p1+p2+p3+p4+p5+p6+p7+p8+p9+p10+p11+p12

To determine the variable that affects the MEDV under the same CRIM rate, I plotted MEDV against CRIM while accounting for the factor B. The plot suggests that there is a factor interaction effect (see Interactions in Section 4), but this is likely due to the presence of influential points. Therefore, further analysis with statistical tests is needed.

Study of categorical predictors’ distribution

#MEDV against RAD
p13 = ggplot(data=boston_data)+geom_boxplot(aes(x=as.factor(RAD), y=MEDV))+labs(title = "Boxplots of categorical factors", y = "Median housing price in Boston", x = "Distance to radial highways")
#MEDV against CHAS
p14 = ggplot(data=boston_data)+geom_boxplot(aes(x=as.factor(CHAS), y=MEDV))+labs(y = "Median housing price in Boston", x = "Whether the house bounds river")

p13+p14

Classify factor new_RAD

The boxplot of factor RAD suggests that there exists similar performance over nearby road distances. So I reclassify the value into close, moderate, remote and very remote, corresponding to 1-3, 4-6, 7-8 and 24.

boston_data$new_RAD<- factor(boston_data$RAD)
levels(boston_data$new_RAD)[levels(boston_data$new_RAD)%in%
                                c("1","2","3")]<-"close"
levels(boston_data$new_RAD)[levels(boston_data$new_RAD)%in%
                                c("4","5","6")]<-"moderate"
levels(boston_data$new_RAD)[levels(boston_data$new_RAD)%in%
                                c("7","8")]<-"remote"
levels(boston_data$new_RAD)[levels(boston_data$new_RAD)%in%
                                c("24")]<-"very remote"
summary(boston_data$new_RAD)

##       close    moderate      remote very remote 
##          82         251          41         132

boston_data<- subset(boston_data, select = -RAD )

#boxplot for MEDV against new_RAD
p15 = ggplot(data=boston_data)+geom_boxplot(aes(x=new_RAD, y=MEDV))+labs(title = "Comparison of classified and unclassifed highway distances predictors", y = "Median housing price in Boston", x = "Classified distance to radial highways")
p13+p15

Model selection between linear and non-linear model

Before designing the analysis steps of the project, a rough analysis is applied to determine if the dataset is suitable for the linear regression model.The entire dataset was split into two parts, a 70% training set and a 30% test set. Linear regression and regression trees were used to analyse the training set respectively under the validation of test set, and root mean squared error (RMSE) was used to compare the performance of the out-of-sample models.

dt = sort(sample(nrow(boston_data), nrow(boston_data)*.7))
train<-boston_data[dt,]
test<-boston_data[-dt,]

lm.fit = lm(MEDV~., data=train)
tree.fit = rpart(MEDV~., data=train)

#linear regression model
lm.predict = predict(lm.fit, newdata=test) 
lm.rmse = sqrt(mean((lm.predict-test$MEDV)^2)) 
lm.rmse

## [1] 4.269851

#regression tree model
tree.predict = predict(tree.fit, newdata=test) 
tree.rmse = sqrt(mean((tree.predict - test$MEDV)^2)) 
tree.rmse

## [1] 4.042355

#Lower values of RMSE indicate better fit, so choose linear regression model.

Section 3 - Variable selections

I started with an initial model which included all variables other than MEDV as predictors (13 predictors) and performed different variable selection methods as well as testing for individual significance of predictors by performing t-tests and examining the p-values to decide which predictors should be incorporated into our model (only included significant variables as predictors to avoid overfitting).

Variable selection results

The variable selection methods produced consistent results, which is to remove AGE and INDUS. AGE and INDUS are therefore removed, and the baseline model is defined as follows: (where RAD and CHAS are categorical variables and are already provided as a set of dummies in the dataset.)

\[ MEDV = \beta_0 + \beta_1CRIM + \beta_2ZN + \beta_3CHAS + \beta_4NOX + \beta_5RM+ \beta_6DIS + \beta_7RAD+ \beta_8TAX + \beta_9PTRATIO + \beta_10B + \beta_11LSTAT. \]

full = lm(MEDV ~CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT+AGE+INDUS, data=boston_data)
vif(full)

##             GVIF Df GVIF^(1/(2*Df))
## CRIM    1.808806  1        1.344918
## ZN      2.343798  1        1.530947
## CHAS    1.079439  1        1.038961
## NOX     4.422840  1        2.103055
## RM      1.995487  1        1.412617
## DIS     4.054222  1        2.013510
## new_RAD 9.933908  3        1.466178
## TAX     9.279950  1        3.046301
## PTRATIO 1.846460  1        1.358845
## B       1.349626  1        1.161734
## LSTAT   2.944861  1        1.716060
## AGE     3.115940  1        1.765202
## INDUS   3.915472  1        1.978755

#By observing the results from forward selection, backward selection and best subset, I can proceed by dropping: AGE and INDUS

By performing the VIF test to check for multicollinearity, there is no major issue with multicollinearity as the value for each predictor lies below 10.

Section 4- Final Model Derivation

Final model was produced after performing the variable selection.

Baseline model

Transformations were applied to ensure there is a linear relationship between predictors and the response variable; residuals are approximately normally distributed; to correct for homoscedasticity and correlation among errors, and can be applied to a continuous variable if it is very skewed. By looking at the shape of the plot of each predictor against the dependent variable, I identified potential transformations and selected the transformation which improves the adjusted R-squared the most. If no typical regression pattern was followed, then applied common transformations (i.e. log, square root, squared, exponential, reciprocal, negative exponential) to the variables and calculated the correlation between the response and the transformed predictors. Transformations that gave the highest and second highest correlation with the response were incorporated into the model to see if it improved R-squared and decide if this transformation should be included.

I used the prior knowledge of the predictors to identify potential interaction terms (an interaction between two predictors is present if the effect of changing one predictor on the response is different for different values of the other predictor) and incorporate these into the model if it improves the value of the adjusted R-squared without causing serious problems of multicollinearity.

The detailed codes of model derivation could be found in appendix.

Transformations

Transformations for regression model

Transformations were applied to the variables to ensure the model satisfies the normal linear model assumptions and can explain a large fraction of the variation in the dependent variable(MEDV).

I further analyse the transformations applied in the final model. These transformations all improve the adjusted and multiple R-squared and ensure there is a linear relationship between the predictors and the dependent variable.

The log transformation of DIS was selected among several DIS transformations (e.g., DIS^2, sqrt(DIS)) based on the absolute value of the correlation between log(DIS) and the transformed MEDV.

The correlation between sqrt(LSTAT) and log(MEDV) is the highest and this transformation significantly increases the value of multiple and adjusted R-squared, which implies selecting the sqrt transformation.

The sqrt transformation makes the relationship between LSTAT and the transformed dependent variable log(MEDV) more linear.

After applying transformations, I discovered that the variable ZN has an extremely large p-value for the t-test, implying that this variable is not significant in explaining the relationship. Therefore I dropped this predictor. There is a slight increase in the value of multiple and adjusted R-squared after dropping ZN.

Interactions

Interactions for regression model

After transforming the variables to ensure linearity and the removal of ZN, I tried different interaction terms based on the knowledge of the predictors. I then decided on which interactions to include in the final model by assessing the increase in adjusted R-squared, the increase in multiple R-squared and the value of VIF.

An interaction term is incorporated if it leads to a significant increase in adjusted and multiple R-squared and all VIF values lie below 10, meaning that the model with the interaction term can explain a bigger proportion of the relationship without causing major issue with multicollinearity.

Only the interaction between CRIM and B is suitable, as all other interactions attempted either decrease the value of adjusted and multiple R-squared or lead to serious issues with multicollinearity.

The most promising interaction terms attempted after transformations and the removal of ZN are further analysed below.

The interaction term between CRIM and B was added to final model.

Final model

I selected the model with the highest adjusted R-squared (R-squared measures the fraction of the sample variation in the dependent variable (MEDV) explained by the predictors); a low VIF and where it satisfies all model assumptions.

#The best model from the above process:
final_adjusted= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data=boston_data)

summary(final_adjusted)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72555 -0.09626 -0.00847  0.09247  0.71429 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.701e+00  2.192e-01  21.447  < 2e-16 ***
## CRIM               -5.998e-03  2.176e-03  -2.756 0.006066 ** 
## CHAS                9.163e-02  3.253e-02   2.817 0.005050 ** 
## NOX                -8.419e-01  1.446e-01  -5.821 1.06e-08 ***
## RM                  7.700e-02  1.590e-02   4.842 1.72e-06 ***
## log(DIS)           -2.418e-01  2.812e-02  -8.596  < 2e-16 ***
## new_RADmoderate     2.597e-02  2.520e-02   1.031 0.303173    
## new_RADremote       8.782e-02  3.546e-02   2.477 0.013592 *  
## new_RADvery remote  3.078e-01  5.997e-02   5.133 4.11e-07 ***
## TAX                -5.623e-04  1.320e-04  -4.259 2.46e-05 ***
## PTRATIO            -3.751e-02  4.653e-03  -8.062 5.77e-15 ***
## B                   6.679e-04  1.364e-04   4.897 1.32e-06 ***
## sqrt(LSTAT)        -2.267e-01  1.367e-02 -16.577  < 2e-16 ***
## CRIM:B             -2.302e-05  6.853e-06  -3.360 0.000841 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1798 on 492 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8066 
## F-statistic:   163 on 13 and 492 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8116,   Adjusted R-squared:  0.8066

The final model is as follows: \[ log(MEDV) = \beta_0 + \beta_1CRIM + \beta_2CHAS + \beta_3NOX + \beta_4RM + \beta_5new_RAD + \beta_6log(DIS)+ \beta_7TAX+ \beta_8PTRATIO + \beta_9B + \beta_10sqrt(LSTAT) + \beta_11CRIM*B. \]

Section 5 - Diagnostics

Check normality model assumptions

In this section, I aim to analyse the proposed final model in detail, including steps taken to reach the model and assessments of the model.

To assess the suitability of model, I first check the proposed final model against the six normal linear model assumptions.

#Fitted plot vs residual plot
ggplot(final_adjusted, aes(x=.fitted,y=.stdresid))+
  geom_point()+
  geom_hline(yintercept=2, col="red", linetype="dashed")+
  geom_hline(yintercept=-2, col="red", linetype="dashed")+
  labs(y="standardized residuals")+
  labs(x="fitted values")

1. Partial residual plot (check for A1: linearity).

I have already transformed variables and added interaction terms to achieve linearity.

1. Residual plot (check for A4 & A5: constant variance & independence)

In order to see whether the residuals are consistent with the assumptions of random error, the residual plot shows that the residuals don’t have an obvious pattern/trend and lie symmetrically around zero. The residuals also fall in a band, so the zero mean assumption, independence and homoscedasticity are satisfied.

#QQ-plot
res.1 <- resid(final_adjusted)
qqnorm(res.1)
qqline(res.1, col = "steelblue", lwd = 2)

1. Normal QQ plot (check for A6: Normality)

I used the Normal QQ plot to check whether the data follows a normal distribution. The data should form a straight line if normally distributed. The Normal QQ plot shows that most data align with the straight line of a normal distribution but heavy tails exist at the left and right sides of the graph. This shows that the model works well for most data, but not for extreme values. Hence, the overall model is acceptable.

Influential Point Analysis

To locate the influential points, I used Cook’s distance plot. The graph shows that the most influential point is datapoint 381, which has a cook’s distance of about 1.3.

#Cook’s distance plot
plot(final_adjusted, which=4)

After removing datapoint 381, the range of cook’s distance reduces significantly, and more patterns can be analysed further.

#Cook’s distance graph after the removal of datapoint 381
final_ad_remov <- lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data = boston_data[-381,]) 
#model after removal of datapoint 381
summary(final_ad_remov)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data[-381, 
##     ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69582 -0.09517 -0.00712  0.08957  0.71301 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.738e+00  2.148e-01  22.054  < 2e-16 ***
## CRIM               -5.718e-03  2.132e-03  -2.682  0.00757 ** 
## CHAS                8.698e-02  3.188e-02   2.728  0.00660 ** 
## NOX                -8.618e-01  1.417e-01  -6.080 2.42e-09 ***
## RM                  7.019e-02  1.564e-02   4.487 9.01e-06 ***
## log(DIS)           -2.523e-01  2.764e-02  -9.129  < 2e-16 ***
## new_RADmoderate     2.619e-02  2.468e-02   1.061  0.28923    
## new_RADremote       9.474e-02  3.476e-02   2.725  0.00665 ** 
## new_RADvery remote  3.638e-01  5.995e-02   6.069 2.58e-09 ***
## TAX                -5.757e-04  1.293e-04  -4.451 1.06e-05 ***
## PTRATIO            -3.950e-02  4.577e-03  -8.631  < 2e-16 ***
## B                   8.132e-04  1.372e-04   5.929 5.74e-09 ***
## sqrt(LSTAT)        -2.212e-01  1.344e-02 -16.453  < 2e-16 ***
## CRIM:B             -4.039e-05  7.672e-06  -5.264 2.11e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1761 on 491 degrees of freedom
## Multiple R-squared:  0.8186, Adjusted R-squared:  0.8138 
## F-statistic: 170.4 on 13 and 491 DF,  p-value: < 2.2e-16

plot(final_ad_remov, which = 4)

#Cook’s distance plot after removal of datapoint 381 with threshold
threshold = 4/506
data_remov = boston_data[-381,]
ggplot(aes(seq_along(cooks.distance(final_ad_remov)),
cooks.distance(final_ad_remov)), data=data_remov) +
geom_col()+
geom_hline(yintercept = threshold, linetype=5 , color='red') +
xlab("Observation number")+
ylab("Cook's Distance")

The Cook’s distance plot of the new model after removing datapoint 381 with the threshold 4/n (approxi. 0.008) shown by the red dotted line. There are more than 20 datapoints which cook distance is above the threshold, which could imply that the dataset has a lot of influential points, hence more analysis should be done to identify those.

#outlier analysis - standardised residual plot
threshold_outlier = 3
plot(rstandard(final_adjusted), xlab = "Observation number", ylab =
'Standardised Residual')
abline(h = threshold_outlier, col = 'red', lty = 5)
abline(h = -threshold_outlier, col = 'red', lty = 5)
identify(rstandard(final_adjusted))

## integer(0)

This is a standardised residual plot with the set threshold for identifying outliers ±3, shown by the two red lines. Among all 506 observations, there are approximately 10 observations considered as outliers according to this graph, which is about 2% of the whole data set. The most significant data points are labeled with their indices.

#high leverage points analysis - hatvalues plot
plot(hatvalues(final_adjusted), xlab = "Observation number", ylab =
'Hatvalues')
threshold_lev=22/506
abline(h = threshold_lev, col = 'red', lty = 5)
identify(hatvalues(final_adjusted))

## integer(0)

The hat values plot of all data points is used to identify high leverage points. The threshold set is 2p/n (approxi.0.043). Similarly, the most significant points are labeled with their indices.

Model after removing influential points from influential points analysis

final_model_remov_influential = lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data = boston_data[-c(381,373,372),])

huxreg('Model 1' = final_adjusted, 'Model after 3 influential points removed' = final_model_remov_influential, statistics = c('N.obs' = 'nobs', 'R squared' = 'r.squared', 'P value' = 'p.value', 'F statistics' = 'statistic'))

	Model 1	Model after 3 influential points removed
(Intercept)	4.701 ***	4.597 ***
	(0.219)	(0.210)
CRIM	-0.006 **	-0.005 *
	(0.002)	(0.002)
CHAS	0.092 **	0.070 *
	(0.033)	(0.031)
NOX	-0.842 ***	-0.788 ***
	(0.145)	(0.138)
RM	0.077 ***	0.078 ***
	(0.016)	(0.015)
log(DIS)	-0.242 ***	-0.229 ***
	(0.028)	(0.027)
new_RADmoderate	0.026	0.026
	(0.025)	(0.024)
new_RADremote	0.088 *	0.094 **
	(0.035)	(0.034)
new_RADvery remote	0.308 ***	0.347 ***
	(0.060)	(0.058)
TAX	-0.001 ***	-0.001 ***
	(0.000)	(0.000)
PTRATIO	-0.038 ***	-0.040 ***
	(0.005)	(0.004)
B	0.001 ***	0.001 ***
	(0.000)	(0.000)
sqrt(LSTAT)	-0.227 ***	-0.212 ***
	(0.014)	(0.013)
CRIM:B	-0.000 ***	-0.000 ***
	(0.000)	(0.000)
N.obs	506	503
R squared	0.812	0.827
P value	0.000	0.000
F statistics	163.013	179.846
* p < 0.001; p < 0.01; * p < 0.05.

According to the previous three analysis on influential points, several data points are considered as influential points, which are: 372, 373, 381. A new model with all three points removed is constructed. The R-squared is improved from 0.816 to 0.827.

Comparison of QQ-plots between inital model and model that removes influential points

par(mfrow=c(1,2))

res.1 <- resid(final_adjusted)
qqnorm(res.1)
qqline(res.1, col = "steelblue", lwd = 2)

res.2 <- resid(final_model_remov_influential)
qqnorm(res.2)
qqline(res.2, col = "steelblue", lwd = 2)

The graph shows that the regression model is more close to a normal distribution after removes three influential points.

Section 6 - Final Model Predictors’ Interpretation

summary(final_model_remov_influential)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data[-c(381, 
##     373, 372), ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.68621 -0.08875 -0.00694  0.09305  0.68033 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.597e+00  2.097e-01  21.924  < 2e-16 ***
## CRIM               -5.297e-03  2.068e-03  -2.561  0.01074 *  
## CHAS                7.040e-02  3.131e-02   2.249  0.02498 *  
## NOX                -7.876e-01  1.380e-01  -5.707 1.99e-08 ***
## RM                  7.756e-02  1.523e-02   5.094 5.01e-07 ***
## log(DIS)           -2.287e-01  2.710e-02  -8.440 3.61e-16 ***
## new_RADmoderate     2.597e-02  2.393e-02   1.085  0.27831    
## new_RADremote       9.352e-02  3.370e-02   2.775  0.00573 ** 
## new_RADvery remote  3.468e-01  5.819e-02   5.960 4.84e-09 ***
## TAX                -5.790e-04  1.254e-04  -4.619 4.94e-06 ***
## PTRATIO            -3.956e-02  4.437e-03  -8.915  < 2e-16 ***
## B                   8.100e-04  1.330e-04   6.092 2.26e-09 ***
## sqrt(LSTAT)        -2.123e-01  1.313e-02 -16.169  < 2e-16 ***
## CRIM:B             -4.083e-05  7.438e-06  -5.490 6.46e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1707 on 489 degrees of freedom
## Multiple R-squared:  0.827,  Adjusted R-squared:  0.8224 
## F-statistic: 179.8 on 13 and 489 DF,  p-value: < 2.2e-16

Upon completing diagnostics and performing the above steps, the final model with statistically significant coefficient estimates: (p < 0.01):

\[ log(MEDV) = 4.597-5.297*10^{-3}CRIM+0.0704CHAS-0.7876NOX+0.0704RM+0.02597new\,RAD\ moderate+0.09352new\,RAD\ remote+0.03468new\,RAD\ very\ remote-0.2287log(DIS)-5.79*10^{-4}TAX-0.03956PTRATIO+8.1*10^{-4}B-0.2123 sqrt(LSTAT)-4.083*10^{-5}CRIM*B. \]

The final model coefficients suggest that the explanatory terms RM, new_RAD, B, and CHAS positively predict log(MEDV). In contrast, the other terms negatively predict log (MEDV), representing the house prices in Boston. Considering the estimated values of the final model coefficients, most of them seem plausible according to real-life references, although many are relatively small.

B, CRIM, B*CRIM

Given that there is an interaction term between B and CRIM, the two variables and the interaction are analysed together. The coefficients of B and CRIM are:

\[ B = 8.1*10^{-4}-4.083*10^{-5}CRIM \\ CRIM = -5.297*10^{-3}-4.083*10^{-5}B \] Neither is a strong predictor of housing prices. It is not difficult to interpret why the coefficient of CRIM is negative, as higher crime rate in the area would likely discourage people from buying houses and hence lead to lower market prices.

Whether B has a positive impact on house prices depends on the value of CRIM, with its correlation with housing prices weaker than that between CRIM and housing prices.

NOX & log(DIS)

For two variables that have the strongest impact on housing prices, NOX which measures the environment (the lower the better), and log(DIS), which measures the distance of the houses to employment centres (the lower the closer to employment centres). Both variables showed a negative correlation with the housing prices, with NOX being the strongest predictor with a coefficient of -0.7876. This is in line with general intuition - houses built in a location with better air quality usually have higher prices; and that being close to employment centres brings convenience and hence leads to higher housing prices.

sqrt(LSTAT) & PTRATIO

For two variables that measure the impact of the makeup of the population on local housing prices. Both variables are negatively correlated with the housing prices, meaning that the greater the proportion of teachers to pupils, or the lower the percentage of people with lower social status, the higher the local housing prices. This corresponds with general intuition that people prefer houses in areas with a higher percentage of teachers and a population with better social status.

CHAS & RAD

For two categorical variables that measure the impact of the geographical location of a house on its price. By looking into the coefficients, it could be inferred that both CHAS and RAD positively predict housing prices and whether the house has easy accessibility to radial highways has a greater impact on its price as opposed to if the house bounds the river.

RM & TAX

Finally, for predictors RM and TAX. TAX is the weakest predictor among all predictors used in the final model. The room number per dwelling positively predicts housing prices whereas property-tax rate negatively predicts it. This again matches model’s prediction – the greater number of rooms, the higher the house prices.

Section 7 - Conclusion

Limitations & recommendations

From the above analysis, I could verify the linear relationship between MEDV and all predictors used and the model does not violate the assumptions for a linear regression model. However, there are some limitations of this model, and some recommendations are provided for further improvement.

The first limitation comes from the relatively small sample. The dataset only consists of 506 observations, which may be too small to conclude for the general use in housing prices. Using a larger dataset could improve the performance of the model.

Furthermore, from the influential points analysis, I argue that many influential points deviate from the model when interpreted practically. We can use a larger dataset and conduct more careful interpretation to ensure that the regression model draws the general trends.

Lastly, all coefficients in final model are relatively small, which makes the model hard to interpret. One reason for this may come from the log-transformation of the variable MEDV. However, though it makes the coefficients less interpretable, this log-transformation is essential to ensure that the data is not skewed.

Conclusion

To conclude, despite some certain limitations of the multiple linear regression model, the final model is acceptable. CRIM, NOX, RM, new_RAD, log(DIS), TAX, PTRATIO, B, sqrt(LSTAT), CHAS and the interaction between CRIM and B are useful predictors to predict log(MEDV), which is the median value of owner-occupied homes in $1000’s. Higher number of rooms per dwelling(RM), higher access to radial highways (new_RAD), higher proportion of blacks by town (B), the house bounds the river (CHAS) leads to a higher median value of homes. Among them, CHAS has the largest influence, while TAX has the lowest influence. The model satisfies all assumptions and some significant influential points are removed to arrive at the final model. A F-test is performed which test the overall significance of the model, the F-statistic is large with an extremely low p-value, which suggests that overall model is significant. The final model has satisfactory values for the adjusted and multiple R-squared (0.8224 and 0.827 respectively), implying that the predictors in the final model explain a large sample variation in the dependent variable (MEDV). Hence, the final model is useful in predicting future housing prices.

Reference

Appendix

Variable selection results

Section 3 - Variable selections: Back to the main body #### Forward selection

null = lm(MEDV ~ 1, boston_data) 
full = lm(MEDV ~ CRIM + ZN +INDUS+CHAS+NOX+ RM+AGE+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)

selection_1 = step(null, scope = list(lower = null, upper = full), 
direction = "forward")

## Start:  AIC=2246.51
## MEDV ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + LSTAT    1   23243.9 19472 1851.0
## + RM       1   20654.4 22062 1914.2
## + PTRATIO  1   11014.3 31702 2097.6
## + INDUS    1    9995.2 32721 2113.6
## + TAX      1    9377.3 33339 2123.1
## + new_RAD  3    8286.4 34430 2143.4
## + NOX      1    7800.1 34916 2146.5
## + CRIM     1    6440.8 36276 2165.8
## + AGE      1    6069.8 36647 2171.0
## + ZN       1    5549.7 37167 2178.1
## + B        1    4749.9 37966 2188.9
## + DIS      1    2668.2 40048 2215.9
## + CHAS     1    1312.1 41404 2232.7
## <none>                 42716 2246.5
## 
## Step:  AIC=1851.01
## MEDV ~ LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## + RM       1    4033.1 15439 1735.6
## + PTRATIO  1    2670.1 16802 1778.4
## + CHAS     1     786.3 18686 1832.2
## + DIS      1     772.4 18700 1832.5
## + AGE      1     304.3 19168 1845.0
## + TAX      1     274.4 19198 1845.8
## + B        1     198.3 19274 1847.8
## + ZN       1     160.3 19312 1848.8
## + new_RAD  3     301.2 19171 1849.1
## + CRIM     1     146.9 19325 1849.2
## + INDUS    1      98.7 19374 1850.4
## <none>                 19472 1851.0
## + NOX      1       4.8 19468 1852.9
## 
## Step:  AIC=1735.58
## MEDV ~ LSTAT + RM
## 
##           Df Sum of Sq   RSS    AIC
## + PTRATIO  1   1711.32 13728 1678.1
## + CHAS     1    548.53 14891 1719.3
## + B        1    512.31 14927 1720.5
## + TAX      1    425.16 15014 1723.5
## + DIS      1    351.15 15088 1725.9
## + CRIM     1    311.42 15128 1727.3
## + new_RAD  3    246.78 15192 1733.4
## + INDUS    1     61.09 15378 1735.6
## <none>                 15439 1735.6
## + ZN       1     56.56 15383 1735.7
## + AGE      1     20.18 15419 1736.9
## + NOX      1     14.90 15424 1737.1
## 
## Step:  AIC=1678.13
## MEDV ~ LSTAT + RM + PTRATIO
## 
##           Df Sum of Sq   RSS    AIC
## + DIS      1    499.08 13229 1661.4
## + B        1    389.68 13338 1665.6
## + CHAS     1    377.96 13350 1666.0
## + CRIM     1    122.52 13606 1675.6
## + AGE      1     66.24 13662 1677.7
## <none>                 13728 1678.1
## + TAX      1     44.36 13684 1678.5
## + NOX      1     24.81 13703 1679.2
## + ZN       1     14.96 13713 1679.6
## + INDUS    1      0.83 13727 1680.1
## + new_RAD  3     84.05 13644 1681.0
## 
## Step:  AIC=1661.39
## MEDV ~ LSTAT + RM + PTRATIO + DIS
## 
##           Df Sum of Sq   RSS    AIC
## + NOX      1    759.56 12469 1633.5
## + B        1    502.64 12726 1643.8
## + CHAS     1    267.43 12962 1653.1
## + INDUS    1    242.65 12986 1654.0
## + TAX      1    240.34 12989 1654.1
## + CRIM     1    233.54 12995 1654.4
## + ZN       1    144.81 13084 1657.8
## + new_RAD  3    225.35 13004 1658.7
## + AGE      1     61.36 13168 1661.0
## <none>                 13229 1661.4
## 
## Step:  AIC=1633.47
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX
## 
##           Df Sum of Sq   RSS    AIC
## + CHAS     1    328.27 12141 1622.0
## + B        1    311.83 12158 1622.7
## + ZN       1    151.71 12318 1629.3
## + CRIM     1    141.43 12328 1629.7
## + new_RAD  3    167.44 12302 1632.6
## <none>                 12469 1633.5
## + INDUS    1     17.10 12452 1634.8
## + TAX      1     10.50 12459 1635.0
## + AGE      1      0.25 12469 1635.5
## 
## Step:  AIC=1621.97
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS
## 
##           Df Sum of Sq   RSS    AIC
## + B        1   272.837 11868 1612.5
## + ZN       1   164.406 11977 1617.1
## + CRIM     1   116.330 12025 1619.1
## + new_RAD  3   152.030 11989 1621.6
## <none>                 12141 1622.0
## + INDUS    1    26.274 12115 1622.9
## + TAX      1     4.187 12137 1623.8
## + AGE      1     2.331 12139 1623.9
## 
## Step:  AIC=1612.47
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B
## 
##           Df Sum of Sq   RSS    AIC
## + ZN       1   189.936 11678 1606.3
## + new_RAD  3   225.891 11642 1608.8
## + CRIM     1    55.633 11813 1612.1
## <none>                 11868 1612.5
## + INDUS    1    15.584 11853 1613.8
## + AGE      1     9.446 11859 1614.1
## + TAX      1     2.703 11866 1614.4
## 
## Step:  AIC=1606.31
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN
## 
##           Df Sum of Sq   RSS    AIC
## + new_RAD  3   207.567 11471 1603.2
## + CRIM     1    94.712 11584 1604.2
## <none>                 11678 1606.3
## + INDUS    1    16.048 11662 1607.6
## + TAX      1     3.952 11674 1608.1
## + AGE      1     1.491 11677 1608.2
## 
## Step:  AIC=1603.23
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD
## 
##         Df Sum of Sq   RSS    AIC
## + CRIM   1   224.355 11246 1595.2
## + TAX    1   174.085 11297 1597.5
## <none>               11471 1603.2
## + INDUS  1    14.925 11456 1604.6
## + AGE    1     1.101 11470 1605.2
## 
## Step:  AIC=1595.24
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD + 
##     CRIM
## 
##         Df Sum of Sq   RSS    AIC
## + TAX    1   188.925 11058 1588.7
## <none>               11246 1595.2
## + INDUS  1    22.355 11224 1596.2
## + AGE    1     0.990 11245 1597.2
## 
## Step:  AIC=1588.67
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD + 
##     CRIM + TAX
## 
##         Df Sum of Sq   RSS    AIC
## <none>               11058 1588.7
## + INDUS  1   0.62120 11057 1590.6
## + AGE    1   0.10679 11057 1590.7

summary(selection_1)

## 
## Call:
## lm(formula = MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + 
##     B + ZN + new_RAD + CRIM + TAX, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## B                    0.009209   0.002678   3.439 0.000634 ***
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

The forward selection shows that the factor AGE and INDUS should be removed.

Backward Stepwise Selection

null_1 = lm(MEDV~1, data = boston_data)
full_1 = lm(MEDV~CRIM + ZN +INDUS+CHAS+NOX+ RM+AGE+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data = boston_data)

selection_2 = step(full_1, data = boston_data, direction = 'backward')

## Start:  AIC=1592.63
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - AGE      1      0.11 11057 1590.6
## - INDUS    1      0.62 11057 1590.7
## <none>                 11057 1592.6
## - TAX      1    166.53 11223 1598.2
## - CHAS     1    212.37 11269 1600.3
## - CRIM     1    237.47 11294 1601.4
## - ZN       1    247.72 11304 1601.8
## - B        1    265.92 11323 1602.7
## - new_RAD  3    501.22 11558 1609.1
## - NOX      1    449.54 11506 1610.8
## - PTRATIO  1   1238.94 12296 1644.4
## - DIS      1   1293.62 12350 1646.6
## - RM       1   1703.43 12760 1663.1
## - LSTAT    1   2424.69 13481 1691.0
## 
## Step:  AIC=1590.64
## MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## - INDUS    1      0.62 11058 1588.7
## <none>                 11057 1590.6
## - TAX      1    167.19 11224 1596.2
## - CHAS     1    212.39 11269 1598.3
## - CRIM     1    237.54 11294 1599.4
## - ZN       1    251.74 11309 1600.0
## - B        1    266.27 11323 1600.7
## - new_RAD  3    503.11 11560 1607.2
## - NOX      1    488.32 11545 1610.5
## - PTRATIO  1   1248.78 12306 1642.8
## - DIS      1   1410.65 12468 1649.4
## - RM       1   1763.41 12820 1663.5
## - LSTAT    1   2751.71 13808 1701.1
## 
## Step:  AIC=1588.67
## MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + TAX + PTRATIO + 
##     B + LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11058 1588.7
## - TAX      1    188.92 11246 1595.2
## - CHAS     1    217.01 11274 1596.5
## - CRIM     1    239.19 11297 1597.5
## - ZN       1    251.73 11309 1598.1
## - B        1    265.74 11323 1598.7
## - new_RAD  3    524.83 11582 1606.1
## - NOX      1    516.20 11574 1609.8
## - PTRATIO  1   1264.37 12322 1641.5
## - DIS      1   1491.14 12549 1650.7
## - RM       1   1771.96 12829 1661.9
## - LSTAT    1   2764.03 13822 1699.6

summary(selection_2)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## B                    0.009209   0.002678   3.439 0.000634 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

The backward selection shows that the factor AGE and INDUS should be removed.

Stepwise selection

null_3 = lm(MEDV ~ 1, boston_data)  
full_3 = full = lm(MEDV ~ CRIM + ZN +INDUS+CHAS+NOX+ RM+AGE+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)

selection_3 = step(null_3, scope = list(upper = full_3), data = 
boston_data, direction="both")

## Start:  AIC=2246.51
## MEDV ~ 1
## 
##           Df Sum of Sq   RSS    AIC
## + LSTAT    1   23243.9 19472 1851.0
## + RM       1   20654.4 22062 1914.2
## + PTRATIO  1   11014.3 31702 2097.6
## + INDUS    1    9995.2 32721 2113.6
## + TAX      1    9377.3 33339 2123.1
## + new_RAD  3    8286.4 34430 2143.4
## + NOX      1    7800.1 34916 2146.5
## + CRIM     1    6440.8 36276 2165.8
## + AGE      1    6069.8 36647 2171.0
## + ZN       1    5549.7 37167 2178.1
## + B        1    4749.9 37966 2188.9
## + DIS      1    2668.2 40048 2215.9
## + CHAS     1    1312.1 41404 2232.7
## <none>                 42716 2246.5
## 
## Step:  AIC=1851.01
## MEDV ~ LSTAT
## 
##           Df Sum of Sq   RSS    AIC
## + RM       1    4033.1 15439 1735.6
## + PTRATIO  1    2670.1 16802 1778.4
## + CHAS     1     786.3 18686 1832.2
## + DIS      1     772.4 18700 1832.5
## + AGE      1     304.3 19168 1845.0
## + TAX      1     274.4 19198 1845.8
## + B        1     198.3 19274 1847.8
## + ZN       1     160.3 19312 1848.8
## + new_RAD  3     301.2 19171 1849.1
## + CRIM     1     146.9 19325 1849.2
## + INDUS    1      98.7 19374 1850.4
## <none>                 19472 1851.0
## + NOX      1       4.8 19468 1852.9
## - LSTAT    1   23243.9 42716 2246.5
## 
## Step:  AIC=1735.58
## MEDV ~ LSTAT + RM
## 
##           Df Sum of Sq   RSS    AIC
## + PTRATIO  1    1711.3 13728 1678.1
## + CHAS     1     548.5 14891 1719.3
## + B        1     512.3 14927 1720.5
## + TAX      1     425.2 15014 1723.5
## + DIS      1     351.2 15088 1725.9
## + CRIM     1     311.4 15128 1727.3
## + new_RAD  3     246.8 15192 1733.4
## + INDUS    1      61.1 15378 1735.6
## <none>                 15439 1735.6
## + ZN       1      56.6 15383 1735.7
## + AGE      1      20.2 15419 1736.9
## + NOX      1      14.9 15424 1737.1
## - RM       1    4033.1 19472 1851.0
## - LSTAT    1    6622.6 22062 1914.2
## 
## Step:  AIC=1678.13
## MEDV ~ LSTAT + RM + PTRATIO
## 
##           Df Sum of Sq   RSS    AIC
## + DIS      1     499.1 13229 1661.4
## + B        1     389.7 13338 1665.6
## + CHAS     1     378.0 13350 1666.0
## + CRIM     1     122.5 13606 1675.6
## + AGE      1      66.2 13662 1677.7
## <none>                 13728 1678.1
## + TAX      1      44.4 13684 1678.5
## + NOX      1      24.8 13703 1679.2
## + ZN       1      15.0 13713 1679.6
## + INDUS    1       0.8 13727 1680.1
## + new_RAD  3      84.0 13644 1681.0
## - PTRATIO  1    1711.3 15439 1735.6
## - RM       1    3074.3 16802 1778.4
## - LSTAT    1    5013.6 18742 1833.7
## 
## Step:  AIC=1661.39
## MEDV ~ LSTAT + RM + PTRATIO + DIS
## 
##           Df Sum of Sq   RSS    AIC
## + NOX      1     759.6 12469 1633.5
## + B        1     502.6 12726 1643.8
## + CHAS     1     267.4 12962 1653.1
## + INDUS    1     242.6 12986 1654.0
## + TAX      1     240.3 12989 1654.1
## + CRIM     1     233.5 12995 1654.4
## + ZN       1     144.8 13084 1657.8
## + new_RAD  3     225.4 13004 1658.7
## + AGE      1      61.4 13168 1661.0
## <none>                 13229 1661.4
## - DIS      1     499.1 13728 1678.1
## - PTRATIO  1    1859.3 15088 1725.9
## - RM       1    2622.6 15852 1750.9
## - LSTAT    1    5349.2 18578 1831.2
## 
## Step:  AIC=1633.47
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX
## 
##           Df Sum of Sq   RSS    AIC
## + CHAS     1     328.3 12141 1622.0
## + B        1     311.8 12158 1622.7
## + ZN       1     151.7 12318 1629.3
## + CRIM     1     141.4 12328 1629.7
## + new_RAD  3     167.4 12302 1632.6
## <none>                 12469 1633.5
## + INDUS    1      17.1 12452 1634.8
## + TAX      1      10.5 12459 1635.0
## + AGE      1       0.2 12469 1635.5
## - NOX      1     759.6 13229 1661.4
## - DIS      1    1233.8 13703 1679.2
## - PTRATIO  1    2116.5 14586 1710.8
## - RM       1    2546.2 15016 1725.5
## - LSTAT    1    3664.3 16134 1761.8
## 
## Step:  AIC=1621.97
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS
## 
##           Df Sum of Sq   RSS    AIC
## + B        1     272.8 11868 1612.5
## + ZN       1     164.4 11977 1617.1
## + CRIM     1     116.3 12025 1619.1
## + new_RAD  3     152.0 11989 1621.6
## <none>                 12141 1622.0
## + INDUS    1      26.3 12115 1622.9
## + TAX      1       4.2 12137 1623.8
## + AGE      1       2.3 12139 1623.9
## - CHAS     1     328.3 12469 1633.5
## - NOX      1     820.4 12962 1653.1
## - DIS      1    1146.8 13288 1665.6
## - PTRATIO  1    1924.9 14066 1694.4
## - RM       1    2480.7 14622 1714.0
## - LSTAT    1    3509.3 15650 1748.5
## 
## Step:  AIC=1612.47
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B
## 
##           Df Sum of Sq   RSS    AIC
## + ZN       1    189.94 11678 1606.3
## + new_RAD  3    225.89 11642 1608.8
## + CRIM     1     55.63 11813 1612.1
## <none>                 11868 1612.5
## + INDUS    1     15.58 11853 1613.8
## + AGE      1      9.45 11859 1614.1
## + TAX      1      2.70 11866 1614.4
## - B        1    272.84 12141 1622.0
## - CHAS     1    289.27 12158 1622.7
## - NOX      1    626.85 12495 1636.5
## - DIS      1   1103.33 12972 1655.5
## - PTRATIO  1   1804.30 13672 1682.1
## - RM       1   2658.21 14526 1712.7
## - LSTAT    1   2991.55 14860 1724.2
## 
## Step:  AIC=1606.31
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN
## 
##           Df Sum of Sq   RSS    AIC
## + new_RAD  3    207.57 11471 1603.2
## + CRIM     1     94.71 11584 1604.2
## <none>                 11678 1606.3
## + INDUS    1     16.05 11662 1607.6
## + TAX      1      3.95 11674 1608.1
## + AGE      1      1.49 11677 1608.2
## - ZN       1    189.94 11868 1612.5
## - B        1    298.37 11977 1617.1
## - CHAS     1    300.42 11979 1617.2
## - NOX      1    627.62 12306 1630.8
## - DIS      1   1276.45 12955 1656.8
## - PTRATIO  1   1364.63 13043 1660.2
## - RM       1   2384.55 14063 1698.3
## - LSTAT    1   3052.50 14731 1721.8
## 
## Step:  AIC=1603.23
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD
## 
##           Df Sum of Sq   RSS    AIC
## + CRIM     1     224.4 11246 1595.2
## + TAX      1     174.1 11297 1597.5
## <none>                 11471 1603.2
## + INDUS    1      14.9 11456 1604.6
## + AGE      1       1.1 11470 1605.2
## - new_RAD  3     207.6 11678 1606.3
## - ZN       1     171.6 11642 1608.8
## - CHAS     1     277.8 11748 1613.3
## - B        1     345.7 11816 1616.3
## - NOX      1     622.3 12093 1628.0
## - DIS      1    1321.3 12792 1656.4
## - PTRATIO  1    1342.6 12813 1657.2
## - RM       1    1899.9 13371 1678.8
## - LSTAT    1    3164.2 14635 1724.5
## 
## Step:  AIC=1595.24
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD + 
##     CRIM
## 
##           Df Sum of Sq   RSS    AIC
## + TAX      1    188.92 11058 1588.7
## <none>                 11246 1595.2
## + INDUS    1     22.36 11224 1596.2
## + AGE      1      0.99 11245 1597.2
## - ZN       1    198.12 11444 1602.1
## - CRIM     1    224.35 11471 1603.2
## - new_RAD  3    337.21 11584 1604.2
## - CHAS     1    253.10 11500 1604.5
## - B        1    282.33 11529 1605.8
## - NOX      1    680.57 11927 1623.0
## - PTRATIO  1   1395.19 12642 1652.4
## - DIS      1   1426.21 12673 1653.7
## - RM       1   1871.76 13118 1671.1
## - LSTAT    1   2829.96 14076 1706.8
## 
## Step:  AIC=1588.67
## MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + B + ZN + new_RAD + 
##     CRIM + TAX
## 
##           Df Sum of Sq   RSS    AIC
## <none>                 11058 1588.7
## + INDUS    1      0.62 11057 1590.6
## + AGE      1      0.11 11057 1590.7
## - TAX      1    188.92 11246 1595.2
## - CHAS     1    217.01 11274 1596.5
## - CRIM     1    239.19 11297 1597.5
## - ZN       1    251.73 11309 1598.1
## - B        1    265.74 11323 1598.7
## - new_RAD  3    524.83 11582 1606.1
## - NOX      1    516.20 11574 1609.8
## - PTRATIO  1   1264.37 12322 1641.5
## - DIS      1   1491.14 12549 1650.7
## - RM       1   1771.96 12829 1661.9
## - LSTAT    1   2764.03 13822 1699.6

summary(selection_3)

## 
## Call:
## lm(formula = MEDV ~ LSTAT + RM + PTRATIO + DIS + NOX + CHAS + 
##     B + ZN + new_RAD + CRIM + TAX, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## B                    0.009209   0.002678   3.439 0.000634 ***
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

The stepwise selection shows that the factor AGE and INDUS should be removed.

Best subsets

mod_bs <- regsubsets(MEDV ~ ., nvmax = 13, data = boston_data)
par(mfrow = c(1, 2))
plot(mod_bs, scale = "adjr2")
plot(mod_bs, scale = "bic")

coef(mod_bs, scale = "bic", id = 13)

##        (Intercept)               CRIM                 ZN              INDUS 
##       38.427347375       -0.107201143        0.045986604        0.009827964 
##               CHAS                NOX                 RM                DIS 
##        2.655088053      -17.154383325        3.683491146       -1.526164658 
##                TAX            PTRATIO                  B              LSTAT 
##       -0.010262270       -0.984138663        0.009219127       -0.527974326 
##      new_RADremote new_RADvery remote 
##        2.165028256        5.382599389

mod_summary <- summary(mod_bs)
par(mfrow = c(1, 2))
plot(mod_summary$adjr2, xlab = "Number of Variables",
     ylab = "adjusted RSq", type = "l")  

#plot adjusted R^2 vs the number of variables
best_model <- which.max(mod_summary$adjr2)
points(best_model, mod_summary$adjr2[best_model],col = "red", cex = 2, pch = 19)
plot(mod_summary$bic, xlab = "Number of Variables",
     ylab = "BIC", type = "l")
best_model <- which.min(mod_summary$bic)
points(best_model, mod_summary$bic[best_model],
       col = "red", cex = 2, pch = 19)

Baseline model

Section 4- Final Model Derivation: Back to the main body

reduced = lm(MEDV ~CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
vif(reduced)

##             GVIF Df GVIF^(1/(2*Df))
## CRIM    1.804800  1        1.343428
## ZN      2.294214  1        1.514666
## CHAS    1.065708  1        1.032331
## NOX     3.810111  1        1.951951
## RM      1.905884  1        1.380538
## DIS     3.529346  1        1.878655
## new_RAD 9.238935  3        1.448562
## TAX     7.806239  1        2.793965
## PTRATIO 1.801146  1        1.342068
## B       1.343201  1        1.158966
## LSTAT   2.588639  1        1.608925

#the vif for all variable <10, so there is no problem of multicollinearity

summary(full)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + INDUS + CHAS + NOX + RM + AGE + 
##     DIS + new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2626  -2.7486  -0.5016   1.6632  26.0990 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.837e+01  5.246e+00   7.314 1.07e-12 ***
## CRIM               -1.072e-01  3.305e-02  -3.244 0.001259 ** 
## ZN                  4.597e-02  1.388e-02   3.313 0.000990 ***
## INDUS               1.014e-02  6.097e-02   0.166 0.867930    
## CHAS                2.653e+00  8.647e-01   3.068 0.002276 ** 
## NOX                -1.712e+01  3.836e+00  -4.463 1.00e-05 ***
## RM                  3.693e+00  4.250e-01   8.689  < 2e-16 ***
## AGE                -9.263e-04  1.326e-02  -0.070 0.944315    
## DIS                -1.530e+00  2.021e-01  -7.572 1.85e-13 ***
## new_RADmoderate     7.164e-02  6.677e-01   0.107 0.914596    
## new_RADremote       2.218e+00  9.542e-01   2.324 0.020520 *  
## new_RADvery remote  5.477e+00  1.612e+00   3.397 0.000736 ***
## TAX                -1.038e-02  3.821e-03  -2.717 0.006828 ** 
## PTRATIO            -9.831e-01  1.327e-01  -7.410 5.58e-13 ***
## B                   9.234e-03  2.690e-03   3.433 0.000648 ***
## LSTAT              -5.266e-01  5.080e-02 -10.366  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.75 on 490 degrees of freedom
## Multiple R-squared:  0.7412, Adjusted R-squared:  0.7332 
## F-statistic: 93.54 on 15 and 490 DF,  p-value: < 2.2e-16

summary(reduced)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## B                    0.009209   0.002678   3.439 0.000634 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

#As reduced model has the best values of multiple R-squared and adjusted R-squared, removing AGE and INDUS increases the performance of model.

#Investigating further the variables
#try to treat the two catergorical variables as factor, first is new_RAD
reduced.1 = lm(MEDV ~ CRIM+ZN+CHAS+NOX+RM+DIS+as.factor(new_RAD)+TAX+PTRATIO+B+LSTAT, data=boston_data)
vif(reduced.1)

##                        GVIF Df GVIF^(1/(2*Df))
## CRIM               1.804800  1        1.343428
## ZN                 2.294214  1        1.514666
## CHAS               1.065708  1        1.032331
## NOX                3.810111  1        1.951951
## RM                 1.905884  1        1.380538
## DIS                3.529346  1        1.878655
## as.factor(new_RAD) 9.238935  3        1.448562
## TAX                7.806239  1        2.793965
## PTRATIO            1.801146  1        1.342068
## B                  1.343201  1        1.158966
## LSTAT              2.588639  1        1.608925

#there is no problem of multicollinearity

summary(reduced.1)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + as.factor(new_RAD) + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                           -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                              0.045853   0.013701   3.347 0.000880 ***
## CHAS                            2.664364   0.857423   3.107 0.001996 ** 
## NOX                           -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                              3.680534   0.414504   8.879  < 2e-16 ***
## DIS                            -1.533070   0.188212  -8.145 3.14e-15 ***
## as.factor(new_RAD)moderate      0.068717   0.665212   0.103 0.917766    
## as.factor(new_RAD)remote        2.199264   0.946407   2.324 0.020542 *  
## as.factor(new_RAD)very remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                            -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO                        -0.980885   0.130776  -7.501 2.99e-13 ***
## B                               0.009209   0.002678   3.439 0.000634 ***
## LSTAT                          -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

summary(reduced)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## B                    0.009209   0.002678   3.439 0.000634 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

#The reduced model has the same Multiple R-squared and Adjusted R-squared with the adjusted one

#then choose to change CHAS
reduced.2 = lm(MEDV ~ CRIM+ZN+as.factor(CHAS)+NOX+RM+DIS+new_RAD+PTRATIO+B+LSTAT, data=boston_data)
vif(reduced.2)

##                     GVIF Df GVIF^(1/(2*Df))
## CRIM            1.802438  1        1.342549
## ZN              2.248973  1        1.499658
## as.factor(CHAS) 1.058578  1        1.028872
## NOX             3.650308  1        1.910578
## RM              1.894885  1        1.376548
## DIS             3.512905  1        1.874274
## new_RAD         3.483392  3        1.231215
## PTRATIO         1.778103  1        1.333455
## B               1.341492  1        1.158228
## LSTAT           2.584086  1        1.607509

#all predictors show VIF<5, so move to the next step

summary(reduced.2)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + as.factor(CHAS) + NOX + RM + 
##     DIS + new_RAD + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.6509  -2.8511  -0.5333   1.6919  26.0337 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         36.719302   5.216929   7.038 6.55e-12 ***
## CRIM                -0.104034   0.033173  -3.136 0.001815 ** 
## ZN                   0.040275   0.013666   2.947 0.003361 ** 
## as.factor(CHAS)1     2.867705   0.860944   3.331 0.000931 ***
## NOX                -19.140673   3.504307  -5.462 7.48e-08 ***
## RM                   3.771829   0.416399   9.058  < 2e-16 ***
## DIS                 -1.495825   0.189179  -7.907 1.74e-14 ***
## new_RADmoderate     -0.538947   0.636056  -0.847 0.397224    
## new_RADremote        1.708480   0.938114   1.821 0.069184 .  
## new_RADvery remote   1.936770   1.009569   1.918 0.055636 .  
## PTRATIO             -1.023772   0.130909  -7.820 3.22e-14 ***
## B                    0.009486   0.002696   3.518 0.000475 ***
## LSTAT               -0.532887   0.047844 -11.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.776 on 493 degrees of freedom
## Multiple R-squared:  0.7367, Adjusted R-squared:  0.7303 
## F-statistic:   115 on 12 and 493 DF,  p-value: < 2.2e-16

summary(reduced)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## B                    0.009209   0.002678   3.439 0.000634 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

#As reduced still has the best values of multiple R^2 and adjusted R^2. So both variables were decided to keep in the model as initial form

Transformations for regression model

Transformations: Back to the main body

hist(boston_data$MEDV)

#as MEDV is positively skewed, so use the log transformation

hist(log(boston_data$MEDV))

#this helps with the skewness of MEDV and therefore we use log(MEDV) instead

adjusted = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + DIS + 
##     new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73172 -0.09931 -0.01551  0.09265  0.86072 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1598316  0.2090973  19.894  < 2e-16 ***
## CRIM               -0.0102589  0.0013227  -7.756 5.09e-14 ***
## ZN                  0.0010576  0.0005500   1.923 0.055082 .  
## CHAS                0.1033747  0.0344212   3.003 0.002807 ** 
## NOX                -0.7107207  0.1426591  -4.982 8.73e-07 ***
## RM                  0.0869690  0.0166403   5.226 2.56e-07 ***
## DIS                -0.0529920  0.0075558  -7.013 7.72e-12 ***
## new_RADmoderate     0.0095484  0.0267049   0.358 0.720833    
## new_RADremote       0.0889242  0.0379935   2.341 0.019656 *  
## new_RADvery remote  0.2517866  0.0629254   4.001 7.27e-05 ***
## TAX                -0.0004943  0.0001404  -3.520 0.000471 ***
## PTRATIO            -0.0388306  0.0052500  -7.396 6.08e-13 ***
## B                   0.0004096  0.0001075   3.810 0.000156 ***
## LSTAT              -0.0287647  0.0019081 -15.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 492 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7832 
## F-statistic: 141.3 on 13 and 492 DF,  p-value: < 2.2e-16

summary(reduced)

## 
## Call:
## lm(formula = MEDV ~ CRIM + ZN + CHAS + NOX + RM + DIS + new_RAD + 
##     TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2469  -2.7629  -0.5163   1.6661  26.0713 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         38.347866   5.208555   7.362 7.65e-13 ***
## CRIM                -0.107490   0.032949  -3.262 0.001182 ** 
## ZN                   0.045853   0.013701   3.347 0.000880 ***
## CHAS                 2.664364   0.857423   3.107 0.001996 ** 
## NOX                -17.030627   3.553599  -4.793 2.18e-06 ***
## RM                   3.680534   0.414504   8.879  < 2e-16 ***
## DIS                 -1.533070   0.188212  -8.145 3.14e-15 ***
## new_RADmoderate      0.068717   0.665212   0.103 0.917766    
## new_RADremote        2.199264   0.946407   2.324 0.020542 *  
## new_RADvery remote   5.431363   1.567453   3.465 0.000576 ***
## TAX                 -0.010140   0.003497  -2.899 0.003907 ** 
## PTRATIO             -0.980885   0.130776  -7.501 2.99e-13 ***
## B                    0.009209   0.002678   3.439 0.000634 ***
## LSTAT               -0.527107   0.047530 -11.090  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.741 on 492 degrees of freedom
## Multiple R-squared:  0.7411, Adjusted R-squared:  0.7343 
## F-statistic: 108.4 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in adjusted R squared and multiple R squared therefore this model is retained.

#the multiple R squared for adjusted model is 0.7888, the adjusted R squared is 0.7832

cor(log(boston_data$MEDV), boston_data$CRIM)

## [1] -0.5279464

cor(log(boston_data$MEDV),exp(boston_data$CRIM))

## [1] -0.07548637

cor(log(boston_data$MEDV),sqrt(boston_data$CRIM)) #Highest correlation

## [1] -0.606972

cor(log(boston_data$MEDV),log(boston_data$CRIM)) #Second Highest correlation

## [1] -0.5672422

cor(log(boston_data$MEDV),boston_data$CRIM^2)

## [1] -0.3130519

adjusted.1 = lm(log(MEDV) ~ sqrt(CRIM)+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.1)

## 
## Call:
## lm(formula = log(MEDV) ~ sqrt(CRIM) + ZN + CHAS + NOX + RM + 
##     DIS + new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70484 -0.10461 -0.01129  0.09140  0.87914 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1718200  0.2110375  19.768  < 2e-16 ***
## sqrt(CRIM)         -0.0909959  0.0126755  -7.179 2.61e-12 ***
## ZN                  0.0009483  0.0005536   1.713 0.087356 .  
## CHAS                0.1027789  0.0347048   2.962 0.003209 ** 
## NOX                -0.5864647  0.1435362  -4.086 5.13e-05 ***
## RM                  0.0865503  0.0167739   5.160 3.59e-07 ***
## DIS                -0.0538801  0.0076382  -7.054 5.93e-12 ***
## new_RADmoderate     0.0207832  0.0269798   0.770 0.441478    
## new_RADremote       0.1089318  0.0385315   2.827 0.004889 ** 
## new_RADvery remote  0.3609118  0.0693162   5.207 2.83e-07 ***
## TAX                -0.0004952  0.0001415  -3.499 0.000509 ***
## PTRATIO            -0.0406062  0.0053103  -7.647 1.09e-13 ***
## B                   0.0003762  0.0001091   3.448 0.000613 ***
## LSTAT              -0.0279282  0.0019499 -14.323  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1918 on 492 degrees of freedom
## Multiple R-squared:  0.7855, Adjusted R-squared:  0.7798 
## F-statistic: 138.6 on 13 and 492 DF,  p-value: < 2.2e-16

#choose not to use sqrt(CRIM) as this leads to a decrease in adjusted R squared and multiple R squared.

cor(log(boston_data$MEDV), boston_data$ZN) #Second Highest correlation

## [1] 0.3633445

cor(log(boston_data$MEDV),exp(boston_data$ZN))

## [1] 0.04761745

cor(log(boston_data$MEDV),sqrt(boston_data$ZN)) #Highest correlation

## [1] 0.3939155

cor(log(boston_data$MEDV),log(boston_data$ZN))

## [1] NaN

cor(log(boston_data$MEDV),boston_data$ZN^2)

## [1] 0.3053218

adjusted.2 = lm(log(MEDV) ~ CRIM+sqrt(ZN)+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.2)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + sqrt(ZN) + CHAS + NOX + RM + 
##     DIS + new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73099 -0.09852 -0.01643  0.09409  0.86291 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1522970  0.2100350  19.770  < 2e-16 ***
## CRIM               -0.0102040  0.0013246  -7.703 7.35e-14 ***
## sqrt(ZN)            0.0065656  0.0046287   1.418 0.156687    
## CHAS                0.1045711  0.0344840   3.032 0.002554 ** 
## NOX                -0.7167580  0.1430220  -5.012 7.55e-07 ***
## RM                  0.0883039  0.0166642   5.299 1.76e-07 ***
## DIS                -0.0513894  0.0076901  -6.683 6.38e-11 ***
## new_RADmoderate     0.0057419  0.0267149   0.215 0.829908    
## new_RADremote       0.0785193  0.0375491   2.091 0.037031 *  
## new_RADvery remote  0.2437829  0.0631116   3.863 0.000127 ***
## TAX                -0.0004705  0.0001396  -3.370 0.000810 ***
## PTRATIO            -0.0391798  0.0054059  -7.248 1.65e-12 ***
## B                   0.0004079  0.0001077   3.786 0.000172 ***
## LSTAT              -0.0288079  0.0019119 -15.067  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1906 on 492 degrees of freedom
## Multiple R-squared:  0.7881, Adjusted R-squared:  0.7825 
## F-statistic: 140.7 on 13 and 492 DF,  p-value: < 2.2e-16

#choose not to use sqrt(ZN) as this leads to a decrease in adjusted R squared and multiple R squared.

cor(log(boston_data$MEDV), boston_data$NOX)

## [1] -0.5106003

cor(log(boston_data$MEDV),exp(boston_data$NOX))

## [1] -0.5021635

cor(log(boston_data$MEDV),sqrt(boston_data$NOX)) #Second Highest correlation

## [1] -0.5140763

cor(log(boston_data$MEDV),log(boston_data$NOX)) #Highest correlation

## [1] -0.5152507

cor(log(boston_data$MEDV),boston_data$NOX^2)

## [1] -0.4964561

adjusted.3 = lm(log(MEDV) ~ CRIM+ZN+CHAS+log(NOX)+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.3)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + log(NOX) + RM + DIS + 
##     new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73412 -0.09895 -0.01712  0.09375  0.86923 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.4772847  0.1800232  19.316  < 2e-16 ***
## CRIM               -0.0102459  0.0013261  -7.727 6.26e-14 ***
## ZN                  0.0009234  0.0005555   1.662 0.097123 .  
## CHAS                0.1000661  0.0344648   2.903 0.003857 ** 
## log(NOX)           -0.4252803  0.0899760  -4.727 2.98e-06 ***
## RM                  0.0889603  0.0166638   5.339 1.43e-07 ***
## DIS                -0.0548735  0.0078618  -6.980 9.61e-12 ***
## new_RADmoderate     0.0128303  0.0268575   0.478 0.633063    
## new_RADremote       0.0910324  0.0381021   2.389 0.017262 *  
## new_RADvery remote  0.2565488  0.0632131   4.058 5.74e-05 ***
## TAX                -0.0005046  0.0001406  -3.589 0.000365 ***
## PTRATIO            -0.0375010  0.0052070  -7.202 2.24e-12 ***
## B                   0.0004160  0.0001077   3.862 0.000128 ***
## LSTAT              -0.0287973  0.0019156 -15.033  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1908 on 492 degrees of freedom
## Multiple R-squared:  0.7878, Adjusted R-squared:  0.7822 
## F-statistic: 140.5 on 13 and 492 DF,  p-value: < 2.2e-16

#choose not to use log(NOX) as this leads to a decrease in adjusted R squared and multiple R squared.

cor(log(boston_data$MEDV), boston_data$RM) #Second Highest correlation

## [1] 0.6320212

cor(log(boston_data$MEDV),exp(boston_data$RM))

## [1] 0.5567847

cor(log(boston_data$MEDV),sqrt(boston_data$RM))

## [1] 0.6227985

cor(log(boston_data$MEDV),log(boston_data$RM))

## [1] 0.6104374

cor(log(boston_data$MEDV),boston_data$RM^2) #Highest correlation

## [1] 0.6421332

adjusted.4 = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM^2+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.4)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM^2 + DIS + 
##     new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73172 -0.09931 -0.01551  0.09265  0.86072 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1598316  0.2090973  19.894  < 2e-16 ***
## CRIM               -0.0102589  0.0013227  -7.756 5.09e-14 ***
## ZN                  0.0010576  0.0005500   1.923 0.055082 .  
## CHAS                0.1033747  0.0344212   3.003 0.002807 ** 
## NOX                -0.7107207  0.1426591  -4.982 8.73e-07 ***
## RM                  0.0869690  0.0166403   5.226 2.56e-07 ***
## DIS                -0.0529920  0.0075558  -7.013 7.72e-12 ***
## new_RADmoderate     0.0095484  0.0267049   0.358 0.720833    
## new_RADremote       0.0889242  0.0379935   2.341 0.019656 *  
## new_RADvery remote  0.2517866  0.0629254   4.001 7.27e-05 ***
## TAX                -0.0004943  0.0001404  -3.520 0.000471 ***
## PTRATIO            -0.0388306  0.0052500  -7.396 6.08e-13 ***
## B                   0.0004096  0.0001075   3.810 0.000156 ***
## LSTAT              -0.0287647  0.0019081 -15.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 492 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7832 
## F-statistic: 141.3 on 13 and 492 DF,  p-value: < 2.2e-16

#the value of adjusted R squared and multiple R squared is same with the adjusted model, so choose the initial variable (RM) as result.

cor(log(boston_data$MEDV), boston_data$DIS)

## [1] 0.3427803

cor(log(boston_data$MEDV),exp(boston_data$DIS))

## [1] 0.05389699

cor(log(boston_data$MEDV),sqrt(boston_data$DIS)) #Second Highest correlation

## [1] 0.3773333

cor(log(boston_data$MEDV),log(boston_data$DIS)) #Highest correlation

## [1] 0.4057211

cor(log(boston_data$MEDV),boston_data$DIS^2)

## [1] 0.2688372

adjusted.5 = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.5)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75609 -0.09415 -0.00768  0.09544  0.81787 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.4583360  0.2146110  20.774  < 2e-16 ***
## CRIM               -0.0115581  0.0013159  -8.783  < 2e-16 ***
## ZN                  0.0006624  0.0004987   1.328 0.184747    
## CHAS                0.1006799  0.0337018   2.987 0.002954 ** 
## NOX                -0.9655928  0.1498398  -6.444 2.77e-10 ***
## RM                  0.0894800  0.0161368   5.545 4.80e-08 ***
## log(DIS)           -0.2679685  0.0314768  -8.513  < 2e-16 ***
## new_RADmoderate     0.0236005  0.0262176   0.900 0.368466    
## new_RADremote       0.0952942  0.0371774   2.563 0.010666 *  
## new_RADvery remote  0.2969762  0.0620014   4.790 2.21e-06 ***
## TAX                -0.0006100  0.0001388  -4.394 1.36e-05 ***
## PTRATIO            -0.0387065  0.0051397  -7.531 2.43e-13 ***
## B                   0.0003772  0.0001054   3.578 0.000381 ***
## LSTAT              -0.0294804  0.0018744 -15.728  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1863 on 492 degrees of freedom
## Multiple R-squared:  0.7975, Adjusted R-squared:  0.7922 
## F-statistic: 149.1 on 13 and 492 DF,  p-value: < 2.2e-16

#choose to use log(DIS) as this leads to an increase in adjusted R squared and multiple R squared.
#the multiple R squared for new adjusted model is 0.7975, the adjusted R squared is 0.7922.

cor(log(boston_data$MEDV), boston_data$TAX) #Highest correlation

## [1] -0.5614657

cor(log(boston_data$MEDV),exp(boston_data$TAX))

## [1] NaN

cor(log(boston_data$MEDV),sqrt(boston_data$TAX)) #Second Highest correlation

## [1] -0.5610001

cor(log(boston_data$MEDV),log(boston_data$TAX))

## [1] -0.5571838

cor(log(boston_data$MEDV),boston_data$TAX^2)

## [1] -0.5564381

#as the initial form (TAX) has the highest correlation, there is no further need to check for the regression

cor(log(boston_data$MEDV), boston_data$PTRATIO) #Second Highest correlation

## [1] -0.5017286

cor(log(boston_data$MEDV),exp(boston_data$PTRATIO))

## [1] -0.3957575

cor(log(boston_data$MEDV),sqrt(boston_data$PTRATIO))

## [1] -0.4974489

cor(log(boston_data$MEDV),log(boston_data$PTRATIO))

## [1] -0.492654

cor(log(boston_data$MEDV),boston_data$PTRATIO^2) #Highest correlation

## [1] -0.508736

adjusted.6 = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO^2+B+LSTAT, data=boston_data)
summary(adjusted.6)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + DIS + 
##     new_RAD + TAX + PTRATIO^2 + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73172 -0.09931 -0.01551  0.09265  0.86072 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1598316  0.2090973  19.894  < 2e-16 ***
## CRIM               -0.0102589  0.0013227  -7.756 5.09e-14 ***
## ZN                  0.0010576  0.0005500   1.923 0.055082 .  
## CHAS                0.1033747  0.0344212   3.003 0.002807 ** 
## NOX                -0.7107207  0.1426591  -4.982 8.73e-07 ***
## RM                  0.0869690  0.0166403   5.226 2.56e-07 ***
## DIS                -0.0529920  0.0075558  -7.013 7.72e-12 ***
## new_RADmoderate     0.0095484  0.0267049   0.358 0.720833    
## new_RADremote       0.0889242  0.0379935   2.341 0.019656 *  
## new_RADvery remote  0.2517866  0.0629254   4.001 7.27e-05 ***
## TAX                -0.0004943  0.0001404  -3.520 0.000471 ***
## PTRATIO            -0.0388306  0.0052500  -7.396 6.08e-13 ***
## B                   0.0004096  0.0001075   3.810 0.000156 ***
## LSTAT              -0.0287647  0.0019081 -15.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 492 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7832 
## F-statistic: 141.3 on 13 and 492 DF,  p-value: < 2.2e-16

#the value of adjusted R squared and multiple R squared is same with the adjusted model, so choose the initial variable (PTRATIO) as result.

cor(log(boston_data$MEDV), boston_data$B) #Highest correlation

## [1] 0.4023818

cor(log(boston_data$MEDV),exp(boston_data$B))

## [1] -0.06383093

cor(log(boston_data$MEDV),sqrt(boston_data$B))

## [1] 0.3908571

cor(log(boston_data$MEDV),log(boston_data$B))

## [1] 0.3434196

cor(log(boston_data$MEDV),boston_data$B^2) #Second Highest correlation

## [1] 0.4023335

#as the initial form (B) has the highest correlation, there is no further need to check for the regression

cor(log(boston_data$MEDV), boston_data$LSTAT)

## [1] -0.8050341

cor(log(boston_data$MEDV),exp(boston_data$LSTAT))

## [1] -0.08879723

cor(log(boston_data$MEDV),sqrt(boston_data$LSTAT)) #Highest correlation

## [1] -0.8250238

cor(log(boston_data$MEDV),log(boston_data$LSTAT)) #Second Highest correlation

## [1] -0.82296

cor(log(boston_data$MEDV),boston_data$LSTAT^2)

## [1] -0.7236813

adjusted.7 = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)
summary(adjusted.7)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + DIS + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70904 -0.10012 -0.01165  0.09717  0.80827 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.6167751  0.2112535  21.854  < 2e-16 ***
## CRIM               -0.0107875  0.0012779  -8.442 3.53e-16 ***
## ZN                  0.0005469  0.0005346   1.023 0.306821    
## CHAS                0.0965853  0.0334307   2.889 0.004034 ** 
## NOX                -0.6493495  0.1389431  -4.673 3.83e-06 ***
## RM                  0.0682807  0.0165045   4.137 4.14e-05 ***
## DIS                -0.0501376  0.0073187  -6.851 2.20e-11 ***
## new_RADmoderate     0.0130072  0.0259149   0.502 0.615949    
## new_RADremote       0.0835698  0.0368844   2.266 0.023902 *  
## new_RADvery remote  0.2495852  0.0610890   4.086 5.13e-05 ***
## TAX                -0.0004736  0.0001363  -3.473 0.000559 ***
## PTRATIO            -0.0364401  0.0051083  -7.133 3.52e-12 ***
## B                   0.0003886  0.0001044   3.721 0.000221 ***
## sqrt(LSTAT)        -0.2299616  0.0139649 -16.467  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1848 on 492 degrees of freedom
## Multiple R-squared:  0.801,  Adjusted R-squared:  0.7957 
## F-statistic: 152.3 on 13 and 492 DF,  p-value: < 2.2e-16

#choose to use sqrt(LSTAT) as this leads to an increase in adjusted R squared and multiple R squared.

#treat variable as factor
adjusted.8= lm(log(MEDV) ~ CRIM+ZN+as.factor(CHAS)+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.8)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + as.factor(CHAS) + NOX + 
##     RM + DIS + new_RAD + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73172 -0.09931 -0.01551  0.09265  0.86072 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1598316  0.2090973  19.894  < 2e-16 ***
## CRIM               -0.0102589  0.0013227  -7.756 5.09e-14 ***
## ZN                  0.0010576  0.0005500   1.923 0.055082 .  
## as.factor(CHAS)1    0.1033747  0.0344212   3.003 0.002807 ** 
## NOX                -0.7107207  0.1426591  -4.982 8.73e-07 ***
## RM                  0.0869690  0.0166403   5.226 2.56e-07 ***
## DIS                -0.0529920  0.0075558  -7.013 7.72e-12 ***
## new_RADmoderate     0.0095484  0.0267049   0.358 0.720833    
## new_RADremote       0.0889242  0.0379935   2.341 0.019656 *  
## new_RADvery remote  0.2517866  0.0629254   4.001 7.27e-05 ***
## TAX                -0.0004943  0.0001404  -3.520 0.000471 ***
## PTRATIO            -0.0388306  0.0052500  -7.396 6.08e-13 ***
## B                   0.0004096  0.0001075   3.810 0.000156 ***
## LSTAT              -0.0287647  0.0019081 -15.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 492 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7832 
## F-statistic: 141.3 on 13 and 492 DF,  p-value: < 2.2e-16

#the value of adjusted R squared and multiple R squared is same with the adjusted model, so choose the initial variable (CHAS) as result.

adjusted.9= lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+DIS+as.factor(new_RAD)+TAX+PTRATIO+B+LSTAT, data=boston_data)
summary(adjusted.9)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + DIS + 
##     as.factor(new_RAD) + TAX + PTRATIO + B + LSTAT, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73172 -0.09931 -0.01551  0.09265  0.86072 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.1598316  0.2090973  19.894  < 2e-16 ***
## CRIM                          -0.0102589  0.0013227  -7.756 5.09e-14 ***
## ZN                             0.0010576  0.0005500   1.923 0.055082 .  
## CHAS                           0.1033747  0.0344212   3.003 0.002807 ** 
## NOX                           -0.7107207  0.1426591  -4.982 8.73e-07 ***
## RM                             0.0869690  0.0166403   5.226 2.56e-07 ***
## DIS                           -0.0529920  0.0075558  -7.013 7.72e-12 ***
## as.factor(new_RAD)moderate     0.0095484  0.0267049   0.358 0.720833    
## as.factor(new_RAD)remote       0.0889242  0.0379935   2.341 0.019656 *  
## as.factor(new_RAD)very remote  0.2517866  0.0629254   4.001 7.27e-05 ***
## TAX                           -0.0004943  0.0001404  -3.520 0.000471 ***
## PTRATIO                       -0.0388306  0.0052500  -7.396 6.08e-13 ***
## B                              0.0004096  0.0001075   3.810 0.000156 ***
## LSTAT                         -0.0287647  0.0019081 -15.075  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1903 on 492 degrees of freedom
## Multiple R-squared:  0.7888, Adjusted R-squared:  0.7832 
## F-statistic: 141.3 on 13 and 492 DF,  p-value: < 2.2e-16

#the value of adjusted R squared and multiple R squared is same with the adjusted model, so choose the initial variable (new_RAD) as result.

#comparing to adjusted model:
#adjusted = lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+DIS+new_RAD+TAX+PTRATIO+B+LSTAT, data=boston_data).
#Multiple R-squared: 0.7888, Adjusted R-squared: 0.7832
#new_adjusted_1: adjusted but with log(DIS). 
#Multiple R-squared: 0.7975, Adjusted R-squared: 0.7922
#new_adjusted_2: adjusted but with sqrt(LSTAT). 
#Multiple R-squared: 0.801, Adjusted R-squared: 0.7957

#Combining transformations from new_adjusted_1, new_adjusted_2 and new_adjusted_3:
new_adjusted= lm(log(MEDV) ~ CRIM+ZN+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)
summary(new_adjusted)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + ZN + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73232 -0.09361 -0.00846  0.09728  0.71851 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8674362  0.2166728  22.464  < 2e-16 ***
## CRIM               -0.0119960  0.0012794  -9.376  < 2e-16 ***
## ZN                  0.0001275  0.0004875   0.262 0.793795    
## CHAS                0.0947031  0.0328914   2.879 0.004160 ** 
## NOX                -0.8758379  0.1469406  -5.960 4.80e-09 ***
## RM                  0.0726707  0.0160787   4.520 7.77e-06 ***
## log(DIS)           -0.2457086  0.0305622  -8.040 6.76e-15 ***
## new_RADmoderate     0.0260755  0.0255715   1.020 0.308366    
## new_RADremote       0.0882572  0.0362684   2.433 0.015311 *  
## new_RADvery remote  0.2909149  0.0604938   4.809 2.02e-06 ***
## TAX                -0.0005788  0.0001355  -4.272 2.33e-05 ***
## PTRATIO            -0.0364292  0.0050254  -7.249 1.64e-12 ***
## B                   0.0003625  0.0001029   3.523 0.000467 ***
## sqrt(LSTAT)        -0.2320838  0.0137511 -16.877  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 492 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8022 
## F-statistic: 158.5 on 13 and 492 DF,  p-value: < 2.2e-16

#Multiple R-squared: 0.8073, Adjusted R-squared: 0.8022
#This yields a higher Multiple R-squared and Adjusted R-squared than all other models, suggesting that this is a good model.

#Since the variable ZN is insignificant, I try to remove it before analyzing the interactions.
new_adjusted.1= lm(log(MEDV) ~ CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)
summary(new_adjusted.1)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73167 -0.09490 -0.00898  0.09813  0.71951 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8733362  0.2152914  22.636  < 2e-16 ***
## CRIM               -0.0119624  0.0012718  -9.406  < 2e-16 ***
## CHAS                0.0946272  0.0328591   2.880 0.004153 ** 
## NOX                -0.8805766  0.1456813  -6.045 2.96e-09 ***
## RM                  0.0729780  0.0160206   4.555 6.60e-06 ***
## log(DIS)           -0.2427818  0.0284126  -8.545  < 2e-16 ***
## new_RADmoderate     0.0255210  0.0254593   1.002 0.316631    
## new_RADremote       0.0868335  0.0358236   2.424 0.015713 *  
## new_RADvery remote  0.2901145  0.0603592   4.806 2.04e-06 ***
## TAX                -0.0005727  0.0001333  -4.295 2.10e-05 ***
## PTRATIO            -0.0368933  0.0046972  -7.854 2.53e-14 ***
## B                   0.0003627  0.0001028   3.529 0.000457 ***
## sqrt(LSTAT)        -0.2323090  0.0137111 -16.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1816 on 493 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8026 
## F-statistic: 172.1 on 12 and 493 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8073    Adjusted R-squared:  0.8026 
#The multiple R-squared does not change and the adjusted R-squared increases, so this overall improves the performance of model.

new_adjusted.2= lm(log(MEDV) ~ CRIM+CHAS+NOX+RM+log(DIS)+as.factor(new_RAD)+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)
summary(new_adjusted.2)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + log(DIS) + 
##     as.factor(new_RAD) + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73167 -0.09490 -0.00898  0.09813  0.71951 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    4.8733362  0.2152914  22.636  < 2e-16 ***
## CRIM                          -0.0119624  0.0012718  -9.406  < 2e-16 ***
## CHAS                           0.0946272  0.0328591   2.880 0.004153 ** 
## NOX                           -0.8805766  0.1456813  -6.045 2.96e-09 ***
## RM                             0.0729780  0.0160206   4.555 6.60e-06 ***
## log(DIS)                      -0.2427818  0.0284126  -8.545  < 2e-16 ***
## as.factor(new_RAD)moderate     0.0255210  0.0254593   1.002 0.316631    
## as.factor(new_RAD)remote       0.0868335  0.0358236   2.424 0.015713 *  
## as.factor(new_RAD)very remote  0.2901145  0.0603592   4.806 2.04e-06 ***
## TAX                           -0.0005727  0.0001333  -4.295 2.10e-05 ***
## PTRATIO                       -0.0368933  0.0046972  -7.854 2.53e-14 ***
## B                              0.0003627  0.0001028   3.529 0.000457 ***
## sqrt(LSTAT)                   -0.2323090  0.0137111 -16.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1816 on 493 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8026 
## F-statistic: 172.1 on 12 and 493 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8073,   Adjusted R-squared:  0.8026 
#The factor uses for further analysis should be new_RAD.

#Then try to compare between CHAS and as.factor(CHAS)
new_adjusted.3= lm(log(MEDV) ~ CRIM+as.factor(CHAS)+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)
summary(new_adjusted.3)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + as.factor(CHAS) + NOX + RM + 
##     log(DIS) + new_RAD + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73167 -0.09490 -0.00898  0.09813  0.71951 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8733362  0.2152914  22.636  < 2e-16 ***
## CRIM               -0.0119624  0.0012718  -9.406  < 2e-16 ***
## as.factor(CHAS)1    0.0946272  0.0328591   2.880 0.004153 ** 
## NOX                -0.8805766  0.1456813  -6.045 2.96e-09 ***
## RM                  0.0729780  0.0160206   4.555 6.60e-06 ***
## log(DIS)           -0.2427818  0.0284126  -8.545  < 2e-16 ***
## new_RADmoderate     0.0255210  0.0254593   1.002 0.316631    
## new_RADremote       0.0868335  0.0358236   2.424 0.015713 *  
## new_RADvery remote  0.2901145  0.0603592   4.806 2.04e-06 ***
## TAX                -0.0005727  0.0001333  -4.295 2.10e-05 ***
## PTRATIO            -0.0368933  0.0046972  -7.854 2.53e-14 ***
## B                   0.0003627  0.0001028   3.529 0.000457 ***
## sqrt(LSTAT)        -0.2323090  0.0137111 -16.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1816 on 493 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8026 
## F-statistic: 172.1 on 12 and 493 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8073,   Adjusted R-squared:  0.8026 
#So choose to use CHAS

#the model then become
new_adjusted=lm(log(MEDV) ~ CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT), data=boston_data)

Interactions for regression model

Interactions: Back to the main body

vif(new_adjusted)

##                 GVIF Df GVIF^(1/(2*Df))
## CRIM        1.831963  1        1.353500
## CHAS        1.066333  1        1.032634
## NOX         4.362573  1        2.088677
## RM          1.939677  1        1.392723
## log(DIS)    3.597609  1        1.896737
## new_RAD     9.017633  3        1.442720
## TAX         7.730457  1        2.780370
## PTRATIO     1.583104  1        1.258215
## B           1.348178  1        1.161111
## sqrt(LSTAT) 2.804554  1        1.674680

#all predictors have vif <10, so there is no problem of multicolinearity

new_adjusted.lm= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT))^2, data=boston_data)
anova(new_adjusted.lm)

Df	Sum Sq	Mean Sq	F value	Pr(>F)
1	23.5	23.5	1.22e+03	7.6e-128
1	1.41	1.41	72.9	2.35e-16
1	9.58	9.58	497	8.62e-74
1	17.1	17.1	887	8.91e-107
1	0.906	0.906	46.9	2.52e-11
3	0.355	0.118	6.12	0.000437
1	1.08	1.08	56.1	3.84e-13
1	3.42	3.42	177	3.98e-34
1	1.26	1.26	65.3	6.49e-15
1	9.47	9.47	491	3.29e-73
1	0.288	0.288	14.9	0.00013
1	1.18	1.18	61	4.31e-14
1	0.156	0.156	8.09	0.00467
1	0.000313	0.000313	0.0162	0.899
3	0.00553	0.00184	0.0955	0.962
1	0.0427	0.0427	2.21	0.138
1	0.00948	0.00948	0.491	0.484
1	0.113	0.113	5.86	0.0159
1	0.114	0.114	5.93	0.0153
1	0.397	0.397	20.6	7.52e-06
1	0.179	0.179	9.29	0.00244
1	0.000384	0.000384	0.0199	0.888
3	0.0373	0.0124	0.644	0.587
1	0.0121	0.0121	0.627	0.429
1	0.00205	0.00205	0.107	0.744
1	8.04e-05	8.04e-05	0.00417	0.949
1	0.00752	0.00752	0.39	0.533
1	0.345	0.345	17.9	2.89e-05
1	0.172	0.172	8.89	0.00303
3	0.309	0.103	5.34	0.00128
1	0.0875	0.0875	4.53	0.0338
1	0.0959	0.0959	4.97	0.0263
1	0.00108	0.00108	0.0558	0.813
1	0.71	0.71	36.8	2.87e-09
1	0.26	0.26	13.5	0.000274
3	0.429	0.143	7.41	7.55e-05
1	0.0011	0.0011	0.0569	0.812
1	0.211	0.211	10.9	0.00102
1	0.121	0.121	6.25	0.0128
1	0.182	0.182	9.44	0.00226
3	0.237	0.0791	4.1	0.00692
1	0.205	0.205	10.6	0.0012
1	6.28e-05	6.28e-05	0.00325	0.955
1	0.0104	0.0104	0.539	0.463
1	0.252	0.252	13.1	0.000334
2	0.241	0.121	6.25	0.00211
2	0.0153	0.00766	0.397	0.672
3	0.00979	0.00326	0.169	0.917
3	1.02	0.341	17.7	7.75e-11
1	0.0885	0.0885	4.59	0.0327
1	0.294	0.294	15.2	0.00011
1	0.0233	0.0233	1.21	0.273
1	0.00381	0.00381	0.197	0.657
1	0.0252	0.0252	1.31	0.253
1	0.0349	0.0349	1.81	0.18
432	8.33	0.0193

summary(new_adjusted.lm)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT))^2, data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.54021 -0.06537 -0.00522  0.06633  0.63969 
## 
## Coefficients: (2 not defined because of singularities)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     6.823e+00  3.159e+00   2.160 0.031347 *  
## CRIM                           -4.328e-01  4.209e-01  -1.028 0.304434    
## CHAS                            6.513e-01  9.237e-01   0.705 0.481168    
## NOX                            -5.875e+00  3.247e+00  -1.810 0.071059 .  
## RM                              2.031e-01  2.461e-01   0.825 0.409733    
## log(DIS)                       -2.521e+00  6.217e-01  -4.055 5.95e-05 ***
## new_RADmoderate                -4.135e-01  1.048e+00  -0.395 0.693225    
## new_RADremote                  -4.662e-03  2.313e+00  -0.002 0.998393    
## new_RADvery remote              4.584e+00  2.267e+00   2.022 0.043835 *  
## TAX                            -1.493e-02  5.256e-03  -2.841 0.004713 ** 
## PTRATIO                         8.012e-02  1.220e-01   0.657 0.511799    
## B                              -2.730e-03  3.538e-03  -0.771 0.440863    
## sqrt(LSTAT)                     7.351e-01  2.732e-01   2.690 0.007415 ** 
## CRIM:CHAS                       1.075e-01  2.849e-02   3.774 0.000183 ***
## CRIM:NOX                       -1.836e-01  4.584e-02  -4.006 7.28e-05 ***
## CRIM:RM                         1.079e-02  2.315e-03   4.661 4.21e-06 ***
## CRIM:log(DIS)                  -8.818e-03  8.239e-03  -1.070 0.285125    
## CRIM:new_RADmoderate            1.317e-01  3.136e-01   0.420 0.674809    
## CRIM:new_RADremote              1.325e-01  4.175e-01   0.317 0.751130    
## CRIM:new_RADvery remote        -2.550e-01  3.845e-01  -0.663 0.507514    
## CRIM:TAX                        1.537e-03  4.888e-04   3.145 0.001778 ** 
## CRIM:PTRATIO                   -1.568e-02  1.183e-02  -1.326 0.185642    
## CRIM:B                         -2.291e-05  8.410e-06  -2.724 0.006706 ** 
## CRIM:sqrt(LSTAT)                8.661e-03  2.972e-03   2.914 0.003751 ** 
## CHAS:NOX                       -6.486e-01  6.217e-01  -1.043 0.297404    
## CHAS:RM                        -1.691e-01  5.924e-02  -2.855 0.004504 ** 
## CHAS:log(DIS)                   1.786e-01  1.718e-01   1.039 0.299171    
## CHAS:new_RADmoderate            6.900e-02  1.196e-01   0.577 0.564275    
## CHAS:new_RADremote              3.297e-02  1.540e-01   0.214 0.830582    
## CHAS:new_RADvery remote        -4.867e-01  5.120e-01  -0.951 0.342342    
## CHAS:TAX                        6.076e-04  1.447e-03   0.420 0.674678    
## CHAS:PTRATIO                    2.055e-02  3.095e-02   0.664 0.507164    
## CHAS:B                          4.687e-04  6.916e-04   0.678 0.498321    
## CHAS:sqrt(LSTAT)               -7.577e-02  5.554e-02  -1.364 0.173215    
## NOX:RM                          4.408e-01  2.547e-01   1.731 0.084222 .  
## NOX:log(DIS)                    7.314e-01  3.573e-01   2.047 0.041280 *  
## NOX:new_RADmoderate             1.940e+00  7.514e-01   2.581 0.010173 *  
## NOX:new_RADremote               1.114e+00  2.439e+00   0.457 0.648027    
## NOX:new_RADvery remote          6.902e-01  1.551e+00   0.445 0.656418    
## NOX:TAX                         6.264e-03  3.544e-03   1.768 0.077830 .  
## NOX:PTRATIO                    -4.594e-02  1.077e-01  -0.427 0.669859    
## NOX:B                          -1.399e-03  1.598e-03  -0.876 0.381569    
## NOX:sqrt(LSTAT)                -1.610e-01  2.062e-01  -0.781 0.435479    
## RM:log(DIS)                     2.010e-01  5.113e-02   3.931 9.87e-05 ***
## RM:new_RADmoderate             -3.928e-02  5.786e-02  -0.679 0.497599    
## RM:new_RADremote               -7.971e-02  7.029e-02  -1.134 0.257406    
## RM:new_RADvery remote          -1.879e-01  1.516e-01  -1.239 0.215897    
## RM:TAX                          1.201e-05  3.626e-04   0.033 0.973591    
## RM:PTRATIO                     -1.603e-02  9.013e-03  -1.779 0.075921 .  
## RM:B                            6.834e-05  1.553e-04   0.440 0.660022    
## RM:sqrt(LSTAT)                 -6.832e-02  1.494e-02  -4.572 6.31e-06 ***
## log(DIS):new_RADmoderate        1.626e-02  9.750e-02   0.167 0.867634    
## log(DIS):new_RADremote         -2.135e-01  2.339e-01  -0.913 0.361891    
## log(DIS):new_RADvery remote    -5.666e-01  2.606e-01  -2.174 0.030251 *  
## log(DIS):TAX                    1.429e-03  5.442e-04   2.626 0.008956 ** 
## log(DIS):PTRATIO                1.140e-02  1.788e-02   0.637 0.524143    
## log(DIS):B                      1.271e-04  5.551e-04   0.229 0.818971    
## log(DIS):sqrt(LSTAT)            2.147e-02  4.117e-02   0.521 0.602370    
## new_RADmoderate:TAX             1.201e-03  4.296e-04   2.797 0.005395 ** 
## new_RADremote:TAX               2.065e-03  1.617e-03   1.277 0.202158    
## new_RADvery remote:TAX                 NA         NA      NA       NA    
## new_RADmoderate:PTRATIO        -1.255e-02  1.483e-02  -0.846 0.397932    
## new_RADremote:PTRATIO           1.664e-02  4.592e-02   0.362 0.717252    
## new_RADvery remote:PTRATIO             NA         NA      NA       NA    
## new_RADmoderate:B              -8.819e-04  1.714e-03  -0.514 0.607235    
## new_RADremote:B                -1.466e-03  3.287e-03  -0.446 0.655681    
## new_RADvery remote:B           -5.008e-03  2.715e-03  -1.844 0.065831 .  
## new_RADmoderate:sqrt(LSTAT)     4.232e-03  4.836e-02   0.088 0.930302    
## new_RADremote:sqrt(LSTAT)       3.475e-02  7.153e-02   0.486 0.627341    
## new_RADvery remote:sqrt(LSTAT) -1.248e-01  1.343e-01  -0.929 0.353202    
## TAX:PTRATIO                     1.889e-04  8.503e-05   2.221 0.026840 *  
## TAX:B                           1.528e-05  5.994e-06   2.549 0.011161 *  
## TAX:sqrt(LSTAT)                -4.042e-04  3.279e-04  -1.233 0.218369    
## PTRATIO:B                      -2.213e-05  1.119e-04  -0.198 0.843419    
## PTRATIO:sqrt(LSTAT)            -9.094e-03  7.565e-03  -1.202 0.229964    
## B:sqrt(LSTAT)                  -2.433e-04  1.810e-04  -1.344 0.179590    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1389 on 432 degrees of freedom
## Multiple R-squared:  0.9012, Adjusted R-squared:  0.8845 
## F-statistic: 53.99 on 73 and 432 DF,  p-value: < 2.2e-16

#only looking into 0.05 level, I get interaction for CRIM:CHAS, CRIM:NOX, CRIM:RM, CRIM:B, CRIM:sqrt(LSTAT), CHAS:NOX, CHAS:RM, NOX:RM, NOX:log(DIS), NOX:new_RAD, NOX:TAX, NOX:PTRATIO, NOX:sqrt(LSTAT), RM:log(DIS), RM:new_RAD, RM:PTRATIO, RM:B, RM:sqrt(LSTAT), log(DIS):new_RAD, log(DIS):TAX, log(DIS):sqrt(LSTAT), new_RAD:TAX, new_RAD:sqrt(LSTAT), TAX:PTRATIO, TAX:B

summary(new_adjusted)

## 
## Call:
## lm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73167 -0.09490 -0.00898  0.09813  0.71951 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8733362  0.2152914  22.636  < 2e-16 ***
## CRIM               -0.0119624  0.0012718  -9.406  < 2e-16 ***
## CHAS                0.0946272  0.0328591   2.880 0.004153 ** 
## NOX                -0.8805766  0.1456813  -6.045 2.96e-09 ***
## RM                  0.0729780  0.0160206   4.555 6.60e-06 ***
## log(DIS)           -0.2427818  0.0284126  -8.545  < 2e-16 ***
## new_RADmoderate     0.0255210  0.0254593   1.002 0.316631    
## new_RADremote       0.0868335  0.0358236   2.424 0.015713 *  
## new_RADvery remote  0.2901145  0.0603592   4.806 2.04e-06 ***
## TAX                -0.0005727  0.0001333  -4.295 2.10e-05 ***
## PTRATIO            -0.0368933  0.0046972  -7.854 2.53e-14 ***
## B                   0.0003627  0.0001028   3.529 0.000457 ***
## sqrt(LSTAT)        -0.2323090  0.0137111 -16.943  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1816 on 493 degrees of freedom
## Multiple R-squared:  0.8073, Adjusted R-squared:  0.8026 
## F-statistic: 172.1 on 12 and 493 DF,  p-value: < 2.2e-16

#Multiple R-squared of new_adjusted model: 0.8073, Adjusted R-squared: 0.8026

#CHAS*CRIM:
interaction.1= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CHAS*CRIM), data=boston_data)
summary(interaction.1)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CHAS * CRIM), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72272 -0.09683 -0.00960  0.09552  0.72683 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.9015941  0.2138068  22.925  < 2e-16 ***
## CRIM               -0.0118116  0.0012628  -9.354  < 2e-16 ***
## CHAS                0.0236349  0.0403950   0.585 0.558753    
## NOX                -0.9398141  0.1458980  -6.442 2.82e-10 ***
## RM                  0.0752964  0.0159135   4.732 2.92e-06 ***
## log(DIS)           -0.2442258  0.0281930  -8.663  < 2e-16 ***
## new_RADmoderate     0.0295099  0.0252943   1.167 0.243913    
## new_RADremote       0.0933788  0.0356094   2.622 0.009004 ** 
## new_RADvery remote  0.2914260  0.0598855   4.866 1.53e-06 ***
## TAX                -0.0006041  0.0001327  -4.552 6.70e-06 ***
## PTRATIO            -0.0380070  0.0046752  -8.129 3.53e-15 ***
## B                   0.0003459  0.0001021   3.387 0.000763 ***
## sqrt(LSTAT)        -0.2242410  0.0138706 -16.167  < 2e-16 ***
## CRIM:CHAS           0.0392307  0.0131814   2.976 0.003062 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1802 on 492 degrees of freedom
## Multiple R-squared:  0.8107, Adjusted R-squared:  0.8057 
## F-statistic:   162 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.1)

##                 GVIF Df GVIF^(1/(2*Df))
## CRIM        1.834917  1        1.354591
## CHAS        1.637210  1        1.279535
## NOX         4.445303  1        2.108389
## RM          1.944336  1        1.394394
## log(DIS)    3.598675  1        1.897017
## new_RAD     9.068234  3        1.444066
## TAX         7.779508  1        2.789177
## PTRATIO     1.593311  1        1.262264
## B           1.352292  1        1.162881
## sqrt(LSTAT) 2.915928  1        1.707609
## CRIM:CHAS   1.729420  1        1.315074

#the vif of all variables < 10, so there is no problem of multicolinearity.

#CRIM:NOX:
interaction.2= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*NOX), data=boston_data)
summary(interaction.2)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * NOX), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70348 -0.09181 -0.00577  0.08928  0.70415 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.3980638  0.2227170  19.747  < 2e-16 ***
## CRIM                0.0945185  0.0178377   5.299 1.76e-07 ***
## CHAS                0.0985334  0.0317639   3.102  0.00203 ** 
## NOX                -0.3929359  0.1626808  -2.415  0.01608 *  
## RM                  0.0843603  0.0155998   5.408 9.97e-08 ***
## log(DIS)           -0.1793498  0.0294350  -6.093 2.24e-09 ***
## new_RADmoderate     0.0099116  0.0247435   0.401  0.68891    
## new_RADremote       0.0731898  0.0346973   2.109  0.03542 *  
## new_RADvery remote  0.2530258  0.0586636   4.313 1.95e-05 ***
## TAX                -0.0006042  0.0001290  -4.685 3.63e-06 ***
## PTRATIO            -0.0363985  0.0045405  -8.016 7.98e-15 ***
## B                   0.0004790  0.0001012   4.732 2.92e-06 ***
## sqrt(LSTAT)        -0.2221490  0.0133597 -16.628  < 2e-16 ***
## CRIM:NOX           -0.1572708  0.0262834  -5.984 4.20e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1755 on 492 degrees of freedom
## Multiple R-squared:  0.8203, Adjusted R-squared:  0.8156 
## F-statistic: 172.8 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.2)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM        385.825226  1       19.642434
## CHAS          1.066783  1        1.032852
## NOX           5.824166  1        2.413331
## RM            1.968954  1        1.403194
## log(DIS)      4.133779  1        2.033170
## new_RAD       9.146155  3        1.446127
## TAX           7.743353  1        2.782688
## PTRATIO       1.583629  1        1.258423
## B             1.399741  1        1.183107
## sqrt(LSTAT)   2.850601  1        1.688372
## CRIM:NOX    382.496578  1       19.557520

#the value of vif is skyrocketing, the issue of multicollinearity still unsolved.

#CRIM:RM:
interaction.3= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*RM), data=boston_data)
summary(interaction.3)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * RM), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72452 -0.09186 -0.00663  0.09393  0.71893 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.7560043  0.2324932  20.457  < 2e-16 ***
## CRIM               -0.0013897  0.0080465  -0.173 0.862953    
## CHAS                0.0955921  0.0328414   2.911 0.003769 ** 
## NOX                -0.8530958  0.1470253  -5.802 1.17e-08 ***
## RM                  0.0834559  0.0178399   4.678 3.75e-06 ***
## log(DIS)           -0.2373741  0.0286798  -8.277 1.20e-15 ***
## new_RADmoderate     0.0265430  0.0254510   1.043 0.297505    
## new_RADremote       0.0830372  0.0359091   2.312 0.021166 *  
## new_RADvery remote  0.2830133  0.0605477   4.674 3.82e-06 ***
## TAX                -0.0005611  0.0001335  -4.203 3.14e-05 ***
## PTRATIO            -0.0358276  0.0047614  -7.525 2.54e-13 ***
## B                   0.0003683  0.0001028   3.583 0.000374 ***
## sqrt(LSTAT)        -0.2310504  0.0137330 -16.824  < 2e-16 ***
## CRIM:RM            -0.0017535  0.0013178  -1.331 0.183920    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1815 on 492 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8029 
## F-statistic: 159.2 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.3)

##                  GVIF Df GVIF^(1/(2*Df))
## CRIM        73.448490  1        8.570209
## CHAS         1.066853  1        1.032886
## NOX          4.450382  1        2.109593
## RM           2.408999  1        1.552095
## log(DIS)     3.671326  1        1.916070
## new_RAD      9.273408  3        1.449461
## TAX          7.763655  1        2.786334
## PTRATIO      1.629199  1        1.276401
## B            1.350416  1        1.162074
## sqrt(LSTAT)  2.817921  1        1.678666
## CRIM:RM     70.547312  1        8.399245

#the value of vif is skyrocketing, the issue of multicollinearity still unsolved.

#CRIM:B:
interaction.4= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data=boston_data)
summary(interaction.4)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72555 -0.09626 -0.00847  0.09247  0.71429 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.701e+00  2.192e-01  21.447  < 2e-16 ***
## CRIM               -5.998e-03  2.176e-03  -2.756 0.006066 ** 
## CHAS                9.163e-02  3.253e-02   2.817 0.005050 ** 
## NOX                -8.419e-01  1.446e-01  -5.821 1.06e-08 ***
## RM                  7.700e-02  1.590e-02   4.842 1.72e-06 ***
## log(DIS)           -2.418e-01  2.812e-02  -8.596  < 2e-16 ***
## new_RADmoderate     2.597e-02  2.520e-02   1.031 0.303173    
## new_RADremote       8.782e-02  3.546e-02   2.477 0.013592 *  
## new_RADvery remote  3.078e-01  5.997e-02   5.133 4.11e-07 ***
## TAX                -5.623e-04  1.320e-04  -4.259 2.46e-05 ***
## PTRATIO            -3.751e-02  4.653e-03  -8.062 5.77e-15 ***
## B                   6.679e-04  1.364e-04   4.897 1.32e-06 ***
## sqrt(LSTAT)        -2.267e-01  1.367e-02 -16.577  < 2e-16 ***
## CRIM:B             -2.302e-05  6.853e-06  -3.360 0.000841 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1798 on 492 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8066 
## F-statistic:   163 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.4)

##                 GVIF Df GVIF^(1/(2*Df))
## CRIM        5.475901  1        2.340064
## CHAS        1.067134  1        1.033022
## NOX         4.390376  1        2.095322
## RM          1.950714  1        1.396680
## log(DIS)    3.598036  1        1.896849
## new_RAD     9.119347  3        1.445420
## TAX         7.734761  1        2.781144
## PTRATIO     1.585547  1        1.259185
## B           2.422611  1        1.556474
## sqrt(LSTAT) 2.847447  1        1.687438
## CRIM:B      5.201599  1        2.280701

#the vif of all variables < 10, so there is no problem of multicolinearity.

#CRIM:sqrt(LSTAT):
interaction.5= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*sqrt(LSTAT)), data=boston_data)
summary(interaction.5)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * sqrt(LSTAT)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73876 -0.09402 -0.00674  0.09703  0.81154 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8576787  0.2127499  22.833  < 2e-16 ***
## CRIM                0.0096815  0.0061198   1.582 0.114293    
## CHAS                0.0873850  0.0325262   2.687 0.007463 ** 
## NOX                -0.9173439  0.1442908  -6.358 4.68e-10 ***
## RM                  0.0737331  0.0158296   4.658 4.12e-06 ***
## log(DIS)           -0.2397105  0.0280842  -8.535  < 2e-16 ***
## new_RADmoderate     0.0231938  0.0251618   0.922 0.357092    
## new_RADremote       0.0901019  0.0354049   2.545 0.011235 *  
## new_RADvery remote  0.2914276  0.0596354   4.887 1.39e-06 ***
## TAX                -0.0006065  0.0001321  -4.592 5.58e-06 ***
## PTRATIO            -0.0387138  0.0046681  -8.293 1.06e-15 ***
## B                   0.0003849  0.0001017   3.783 0.000174 ***
## sqrt(LSTAT)        -0.2130673  0.0145554 -14.638  < 2e-16 ***
## CRIM:sqrt(LSTAT)   -0.0047327  0.0013097  -3.614 0.000333 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1794 on 492 degrees of freedom
## Multiple R-squared:  0.8122, Adjusted R-squared:  0.8073 
## F-statistic: 163.7 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.5)

##                       GVIF Df GVIF^(1/(2*Df))
## CRIM             43.456956  1        6.592189
## CHAS              1.070396  1        1.034600
## NOX               4.384372  1        2.093889
## RM                1.940015  1        1.392844
## log(DIS)          3.600907  1        1.897606
## new_RAD           9.045874  3        1.443472
## TAX               7.769295  1        2.787346
## PTRATIO           1.601759  1        1.265606
## B                 1.353121  1        1.163237
## sqrt(LSTAT)       3.237860  1        1.799405
## CRIM:sqrt(LSTAT) 43.350966  1        6.584145

#the value of vif is skyrocketing, the issue of multicollinearity still unsolved.

#CHAS:NOX:
interaction.6= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CHAS*NOX), data=boston_data)
summary(interaction.6)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CHAS * NOX), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73225 -0.09537 -0.00863  0.09803  0.72279 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8591990  0.2166511  22.429  < 2e-16 ***
## CRIM               -0.0119647  0.0012726  -9.402  < 2e-16 ***
## CHAS                0.1803998  0.1433327   1.259 0.208769    
## NOX                -0.8480239  0.1550913  -5.468 7.26e-08 ***
## RM                  0.0722029  0.0160802   4.490 8.87e-06 ***
## log(DIS)           -0.2399259  0.0288075  -8.329 8.18e-16 ***
## new_RADmoderate     0.0243362  0.0255482   0.953 0.341280    
## new_RADremote       0.0854565  0.0359161   2.379 0.017724 *  
## new_RADvery remote  0.2857220  0.0608184   4.698 3.41e-06 ***
## TAX                -0.0005644  0.0001341  -4.209 3.05e-05 ***
## PTRATIO            -0.0368667  0.0047004  -7.843 2.75e-14 ***
## B                   0.0003632  0.0001029   3.531 0.000453 ***
## sqrt(LSTAT)        -0.2336540  0.0138931 -16.818  < 2e-16 ***
## CHAS:NOX           -0.1457946  0.2371373  -0.615 0.538964    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1817 on 492 degrees of freedom
## Multiple R-squared:  0.8074, Adjusted R-squared:  0.8023 
## F-statistic: 158.7 on 13 and 492 DF,  p-value: < 2.2e-16

#The Multiple R-squared and Adjusted R-squared remain the same.

vif(interaction.6)

##                  GVIF Df GVIF^(1/(2*Df))
## CRIM         1.831979  1        1.353506
## CHAS        20.263983  1        4.501553
## NOX          4.938117  1        2.222187
## RM           1.951674  1        1.397023
## log(DIS)     3.693647  1        1.921886
## new_RAD      9.148684  3        1.446194
## TAX          7.809604  1        2.794567
## PTRATIO      1.583238  1        1.258268
## B            1.348265  1        1.161148
## sqrt(LSTAT)  2.875859  1        1.695836
## CHAS:NOX    20.745527  1        4.554726

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#CHAS:RM:
interaction.7= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CHAS*RM), data=boston_data)
summary(interaction.7)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CHAS * RM), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73335 -0.09401 -0.00800  0.09754  0.72563 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.8470951  0.2165015  22.388  < 2e-16 ***
## CRIM               -0.0119311  0.0012718  -9.382  < 2e-16 ***
## CHAS                0.3727033  0.2499767   1.491 0.136615    
## NOX                -0.8907841  0.1459268  -6.104 2.10e-09 ***
## RM                  0.0775053  0.0165167   4.693 3.50e-06 ***
## log(DIS)           -0.2443822  0.0284409  -8.593  < 2e-16 ***
## new_RADmoderate     0.0252031  0.0254542   0.990 0.322595    
## new_RADremote       0.0849068  0.0358553   2.368 0.018268 *  
## new_RADvery remote  0.2887681  0.0603553   4.784 2.27e-06 ***
## TAX                -0.0005632  0.0001336  -4.217 2.95e-05 ***
## PTRATIO            -0.0366165  0.0047025  -7.787 4.10e-14 ***
## B                   0.0003627  0.0001028   3.529 0.000456 ***
## sqrt(LSTAT)        -0.2331912  0.0137301 -16.984  < 2e-16 ***
## CHAS:RM            -0.0427380  0.0380862  -1.122 0.262350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1816 on 492 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.8027 
## F-statistic:   159 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.7)

##                  GVIF Df GVIF^(1/(2*Df))
## CRIM         1.832848  1        1.353827
## CHAS        61.746126  1        7.857870
## NOX          4.379590  1        2.092747
## RM           2.062755  1        1.436229
## log(DIS)     3.606679  1        1.899126
## new_RAD      9.041741  3        1.443362
## TAX          7.761417  1        2.785932
## PTRATIO      1.587472  1        1.259949
## B            1.348178  1        1.161111
## sqrt(LSTAT)  2.813779  1        1.677432
## CHAS:RM     62.072834  1        7.878631

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#NOX:log(DIS):
interaction.8= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+NOX*log(DIS)), data=boston_data)
summary(interaction.8)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + NOX * log(DIS)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71852 -0.09357 -0.01196  0.09495  0.73233 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.0303308  0.2447807  20.550  < 2e-16 ***
## CRIM               -0.0115148  0.0013137  -8.765  < 2e-16 ***
## CHAS                0.0958219  0.0328442   2.917 0.003690 ** 
## NOX                -1.0848767  0.2104558  -5.155 3.68e-07 ***
## RM                  0.0691593  0.0162577   4.254 2.51e-05 ***
## log(DIS)           -0.3674205  0.0969786  -3.789 0.000170 ***
## new_RADmoderate     0.0178333  0.0260736   0.684 0.494322    
## new_RADremote       0.0827548  0.0359227   2.304 0.021656 *  
## new_RADvery remote  0.2736225  0.0615454   4.446 1.08e-05 ***
## TAX                -0.0005347  0.0001362  -3.926 9.88e-05 ***
## PTRATIO            -0.0392678  0.0050148  -7.830 3.01e-14 ***
## B                   0.0003712  0.0001029   3.607 0.000341 ***
## sqrt(LSTAT)        -0.2328528  0.0137059 -16.989  < 2e-16 ***
## NOX:log(DIS)        0.2673241  0.1988870   1.344 0.179535    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1815 on 492 degrees of freedom
## Multiple R-squared:  0.808,  Adjusted R-squared:  0.8029 
## F-statistic: 159.2 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.8)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.957811  1        1.399218
## CHAS          1.067114  1        1.033012
## NOX           9.119413  1        3.019837
## RM            2.000778  1        1.414489
## log(DIS)     41.981191  1        6.479289
## new_RAD       9.552825  3        1.456650
## TAX           8.079162  1        2.842387
## PTRATIO       1.807404  1        1.344397
## B             1.353241  1        1.163289
## sqrt(LSTAT)   2.807000  1        1.675410
## NOX:log(DIS) 23.096566  1        4.805889

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#NOX:new_RAD:
interaction.9= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+NOX*new_RAD), data=boston_data)
summary(interaction.9)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + NOX * new_RAD), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71860 -0.09946 -0.00810  0.09770  0.67599 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.1351210  0.2960146  17.348  < 2e-16 ***
## CRIM                   -0.0118248  0.0012564  -9.411  < 2e-16 ***
## CHAS                    0.0960286  0.0325585   2.949  0.00334 ** 
## NOX                    -1.7192153  0.4351777  -3.951 8.94e-05 ***
## RM                      0.0792327  0.0159017   4.983 8.71e-07 ***
## log(DIS)               -0.2348582  0.0315689  -7.440 4.56e-13 ***
## new_RADmoderate        -0.5187777  0.1953623  -2.655  0.00818 ** 
## new_RADremote          -0.2237869  0.4509516  -0.496  0.61994    
## new_RADvery remote      0.5361498  0.2644225   2.028  0.04314 *  
## TAX                    -0.0007462  0.0001400  -5.329 1.51e-07 ***
## PTRATIO                -0.0309727  0.0048310  -6.411 3.40e-10 ***
## B                       0.0003988  0.0001013   3.938 9.40e-05 ***
## sqrt(LSTAT)            -0.2320318  0.0134636 -17.234  < 2e-16 ***
## NOX:new_RADmoderate     1.1591334  0.4184472   2.770  0.00582 ** 
## NOX:new_RADremote       0.6767016  0.9575216   0.707  0.48008    
## NOX:new_RADvery remote -0.0065723  0.4934351  -0.013  0.98938    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1782 on 490 degrees of freedom
## Multiple R-squared:  0.8156, Adjusted R-squared:   0.81 
## F-statistic: 144.5 on 15 and 490 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.9)

##                     GVIF Df GVIF^(1/(2*Df))
## CRIM        1.857633e+00  1        1.362950
## CHAS        1.087670e+00  1        1.042914
## NOX         4.044394e+01  1        6.359555
## RM          1.985389e+00  1        1.409038
## log(DIS)    4.614205e+00  1        2.148070
## new_RAD     1.876881e+06  3       11.106387
## TAX         8.859859e+00  1        2.976551
## PTRATIO     1.739761e+00  1        1.319000
## B           1.359476e+00  1        1.165966
## sqrt(LSTAT) 2.809465e+00  1        1.676146
## NOX:new_RAD 2.892397e+06  3       11.936481

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#NOX:TAX:
interaction.10= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+NOX*TAX), data=boston_data)
summary(interaction.10)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + NOX * TAX), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71465 -0.09442 -0.00777  0.09172  0.70302 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.1476485  0.3076288  13.483  < 2e-16 ***
## CRIM               -0.0115758  0.0012650  -9.151  < 2e-16 ***
## CHAS                0.1067105  0.0327491   3.258 0.001198 ** 
## NOX                 0.3344514  0.3983479   0.840 0.401543    
## RM                  0.0758638  0.0158896   4.774 2.38e-06 ***
## log(DIS)           -0.2094639  0.0299224  -7.000 8.42e-12 ***
## new_RADmoderate     0.0115260  0.0255725   0.451 0.652390    
## new_RADremote       0.0748598  0.0356642   2.099 0.036324 *  
## new_RADvery remote  0.3595868  0.0634319   5.669 2.45e-08 ***
## TAX                 0.0008825  0.0004639   1.902 0.057702 .  
## PTRATIO            -0.0352462  0.0046788  -7.533 2.39e-13 ***
## B                   0.0003810  0.0001019   3.737 0.000208 ***
## sqrt(LSTAT)        -0.2324325  0.0135782 -17.118  < 2e-16 ***
## NOX:TAX            -0.0027103  0.0008283  -3.272 0.001142 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1799 on 492 degrees of freedom
## Multiple R-squared:  0.8114, Adjusted R-squared:  0.8064 
## F-statistic: 162.8 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.10)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.848089  1        1.359444
## CHAS          1.080064  1        1.039261
## NOX          33.260518  1        5.767193
## RM            1.945670  1        1.394873
## log(DIS)      4.068704  1        2.017103
## new_RAD      12.227409  3        1.517827
## TAX          95.422549  1        9.768447
## PTRATIO       1.601641  1        1.265560
## B             1.352252  1        1.162864
## sqrt(LSTAT)   2.804576  1        1.674687
## NOX:TAX     204.899558  1       14.314313

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#NOX:PTRATIO:
interaction.11= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+NOX*PTRATIO), data=boston_data)
summary(interaction.11)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + NOX * PTRATIO), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73083 -0.09428 -0.00580  0.10095  0.70087 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.9249198  0.4435761   8.848  < 2e-16 ***
## CRIM               -0.0119838  0.0012655  -9.470  < 2e-16 ***
## CHAS                0.0957723  0.0326983   2.929 0.003559 ** 
## NOX                 0.6715319  0.6519799   1.030 0.303521    
## RM                  0.0803877  0.0162268   4.954 1.00e-06 ***
## log(DIS)           -0.2453322  0.0282899  -8.672  < 2e-16 ***
## new_RADmoderate     0.0206640  0.0254101   0.813 0.416485    
## new_RADremote       0.0741722  0.0360198   2.059 0.040000 *  
## new_RADvery remote  0.3216248  0.0614286   5.236 2.44e-07 ***
## TAX                -0.0005258  0.0001341  -3.922 0.000100 ***
## PTRATIO             0.0160715  0.0221894   0.724 0.469234    
## B                   0.0003889  0.0001028   3.781 0.000175 ***
## sqrt(LSTAT)        -0.2280123  0.0137557 -16.576  < 2e-16 ***
## NOX:PTRATIO        -0.0954384  0.0390865  -2.442 0.014969 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1807 on 492 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8045 
## F-statistic: 160.9 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.11)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.832050  1        1.353533
## CHAS          1.066552  1        1.032740
## NOX          88.257699  1        9.394557
## RM            2.009973  1        1.417735
## log(DIS)      3.602520  1        1.898031
## new_RAD      10.226009  3        1.473277
## TAX           7.892864  1        2.809424
## PTRATIO      35.683675  1        5.973581
## B             1.362984  1        1.167469
## sqrt(LSTAT)   2.851215  1        1.688554
## NOX:PTRATIO 162.033654  1       12.729244

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#NOX:sqrt(LSTAT):
interaction.12= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+NOX*sqrt(LSTAT)), data=boston_data)
summary(interaction.12)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + NOX * sqrt(LSTAT)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71935 -0.09223 -0.00593  0.09509  0.71998 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.2574061  0.2729919  15.595  < 2e-16 ***
## CRIM               -0.0111179  0.0012783  -8.697  < 2e-16 ***
## CHAS                0.0820060  0.0326563   2.511 0.012352 *  
## NOX                 0.2460755  0.3444741   0.714 0.475349    
## RM                  0.0765739  0.0158612   4.828 1.85e-06 ***
## log(DIS)           -0.2207778  0.0287317  -7.684 8.41e-14 ***
## new_RADmoderate     0.0154538  0.0253110   0.611 0.541773    
## new_RADremote       0.0818213  0.0354242   2.310 0.021315 *  
## new_RADvery remote  0.2775009  0.0597429   4.645 4.37e-06 ***
## TAX                -0.0005512  0.0001319  -4.179 3.46e-05 ***
## PTRATIO            -0.0399253  0.0047170  -8.464 2.98e-16 ***
## B                   0.0003525  0.0001016   3.469 0.000568 ***
## sqrt(LSTAT)        -0.0624856  0.0490798  -1.273 0.203568    
## NOX:sqrt(LSTAT)    -0.2957696  0.0821577  -3.600 0.000350 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1795 on 492 degrees of freedom
## Multiple R-squared:  0.8122, Adjusted R-squared:  0.8072 
## F-statistic: 163.7 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.12)

##                      GVIF Df GVIF^(1/(2*Df))
## CRIM             1.895807  1        1.376883
## CHAS             1.078766  1        1.038637
## NOX             24.983791  1        4.998379
## RM               1.947400  1        1.395493
## log(DIS)         3.768137  1        1.941169
## new_RAD          9.132513  3        1.445767
## TAX              7.746437  1        2.783242
## PTRATIO          1.635235  1        1.278763
## B                1.349230  1        1.161564
## sqrt(LSTAT)     36.807118  1        6.066887
## NOX:sqrt(LSTAT) 81.946265  1        9.052418

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#RM:log(DIS):
interaction.13= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+RM*log(DIS)), data=boston_data)
summary(interaction.13)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + RM * log(DIS)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71996 -0.09444 -0.00320  0.08983  0.72624 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.2015977  0.2340631  22.223  < 2e-16 ***
## CRIM               -0.0123774  0.0012645  -9.789  < 2e-16 ***
## CHAS                0.0984454  0.0325353   3.026 0.002610 ** 
## NOX                -0.8610297  0.1442748  -5.968 4.60e-09 ***
## RM                  0.0051314  0.0255503   0.201 0.840911    
## log(DIS)           -0.7130140  0.1416922  -5.032 6.81e-07 ***
## new_RADmoderate     0.0380426  0.0254633   1.494 0.135812    
## new_RADremote       0.0851710  0.0354527   2.402 0.016658 *  
## new_RADvery remote  0.3013804  0.0598212   5.038 6.62e-07 ***
## TAX                -0.0006139  0.0001325  -4.633 4.61e-06 ***
## PTRATIO            -0.0335949  0.0047491  -7.074 5.21e-12 ***
## B                   0.0003558  0.0001017   3.497 0.000513 ***
## sqrt(LSTAT)        -0.2256417  0.0137100 -16.458  < 2e-16 ***
## RM:log(DIS)         0.0758636  0.0224050   3.386 0.000766 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1797 on 492 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8067 
## F-statistic: 163.1 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.13)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.849336  1        1.359903
## CHAS          1.067615  1        1.033255
## NOX           4.369568  1        2.090351
## RM            5.038337  1        2.244624
## log(DIS)     91.370861  1        9.558811
## new_RAD       9.310020  3        1.450413
## TAX           7.796252  1        2.792177
## PTRATIO       1.652636  1        1.285549
## B             1.348721  1        1.161344
## sqrt(LSTAT)   2.863623  1        1.692224
## RM:log(DIS) 103.639982  1       10.180372

#as there exists vif > 10, the issue of multicollinearity still unsolved.

#RM:new_RAD:
interaction.14= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+RM*new_RAD), data=boston_data)
summary(interaction.14)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + RM * new_RAD), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70293 -0.09001 -0.00637  0.08990  0.76321 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            4.1255696  0.3002766  13.739  < 2e-16 ***
## CRIM                  -0.0126827  0.0012574 -10.087  < 2e-16 ***
## CHAS                   0.1090767  0.0325098   3.355 0.000855 ***
## NOX                   -0.7600974  0.1463359  -5.194 3.02e-07 ***
## RM                     0.1512485  0.0340093   4.447 1.08e-05 ***
## log(DIS)              -0.2113307  0.0288761  -7.319 1.03e-12 ***
## new_RADmoderate        0.2457122  0.2419048   1.016 0.310255    
## new_RADremote          0.1354211  0.3238656   0.418 0.676028    
## new_RADvery remote     1.2242213  0.2587849   4.731 2.93e-06 ***
## TAX                   -0.0005379  0.0001333  -4.034 6.35e-05 ***
## PTRATIO               -0.0321098  0.0048151  -6.669 6.98e-11 ***
## B                      0.0003463  0.0001009   3.431 0.000652 ***
## sqrt(LSTAT)           -0.2172001  0.0138477 -15.685  < 2e-16 ***
## RM:new_RADmoderate    -0.0331890  0.0369242  -0.899 0.369180    
## RM:new_RADremote      -0.0110694  0.0479246  -0.231 0.817431    
## RM:new_RADvery remote -0.1545744  0.0395399  -3.909 0.000106 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1782 on 490 degrees of freedom
## Multiple R-squared:  0.8157, Adjusted R-squared:   0.81 
## F-statistic: 144.5 on 15 and 490 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.14)

##                     GVIF Df GVIF^(1/(2*Df))
## CRIM        1.860873e+00  1        1.364138
## CHAS        1.084738e+00  1        1.041508
## NOX         4.574576e+00  1        2.138826
## RM          9.084088e+00  1        3.013982
## log(DIS)    3.861760e+00  1        1.965136
## new_RAD     9.217468e+05  3        9.865110
## TAX         8.032896e+00  1        2.834236
## PTRATIO     1.728819e+00  1        1.314845
## B           1.350544e+00  1        1.162129
## sqrt(LSTAT) 2.972926e+00  1        1.724218
## RM:new_RAD  8.976907e+05  3        9.821725

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#RM:PTRATIO:
interaction.15= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+RM*PTRATIO), data=boston_data)
summary(interaction.15)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + RM * PTRATIO), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71006 -0.09052 -0.01112  0.09311  0.74138 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.3608536  0.6324809   3.733 0.000212 ***
## CRIM               -0.0121933  0.0012519  -9.740  < 2e-16 ***
## CHAS                0.1037331  0.0323861   3.203 0.001448 ** 
## NOX                -0.7323953  0.1475139  -4.965 9.49e-07 ***
## RM                  0.4409426  0.0886962   4.971 9.20e-07 ***
## log(DIS)           -0.2168549  0.0286101  -7.580 1.74e-13 ***
## new_RADmoderate     0.0144046  0.0251755   0.572 0.567470    
## new_RADremote       0.0854915  0.0352308   2.427 0.015599 *  
## new_RADvery remote  0.2565127  0.0598908   4.283 2.22e-05 ***
## TAX                -0.0005532  0.0001312  -4.216 2.96e-05 ***
## PTRATIO             0.0973478  0.0321770   3.025 0.002613 ** 
## B                   0.0003387  0.0001012   3.345 0.000886 ***
## sqrt(LSTAT)        -0.2316877  0.0134845 -17.182  < 2e-16 ***
## RM:PTRATIO         -0.0205936  0.0048850  -4.216 2.96e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1786 on 492 degrees of freedom
## Multiple R-squared:  0.814,  Adjusted R-squared:  0.8091 
## F-statistic: 165.6 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.15)

##                  GVIF Df GVIF^(1/(2*Df))
## CRIM         1.835476  1        1.354797
## CHAS         1.071097  1        1.034938
## NOX          4.625191  1        2.150626
## RM          61.476901  1        7.840721
## log(DIS)     3.771910  1        1.942141
## new_RAD      9.217404  3        1.447998
## TAX          7.740096  1        2.782103
## PTRATIO     76.815395  1        8.764439
## B            1.352467  1        1.162956
## sqrt(LSTAT)  2.804889  1        1.674780
## RM:PTRATIO  86.539923  1        9.302684

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#RM:B:
interaction.16= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+RM*B), data=boston_data)
summary(interaction.16)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + RM * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74374 -0.09063 -0.00402  0.08652  0.72981 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.3704872  0.3578312  17.803  < 2e-16 ***
## CRIM               -0.0131583  0.0012613 -10.432  < 2e-16 ***
## CHAS                0.0948915  0.0320351   2.962 0.003203 ** 
## NOX                -0.8304761  0.1423588  -5.834 9.84e-09 ***
## RM                 -0.1993771  0.0549854  -3.626 0.000318 ***
## log(DIS)           -0.2264576  0.0278797  -8.123 3.70e-15 ***
## new_RADmoderate     0.0296851  0.0248340   1.195 0.232529    
## new_RADremote       0.0788437  0.0349595   2.255 0.024554 *  
## new_RADvery remote  0.2967931  0.0588598   5.042 6.48e-07 ***
## TAX                -0.0005612  0.0001300  -4.317 1.91e-05 ***
## PTRATIO            -0.0342032  0.0046089  -7.421 5.14e-13 ***
## B                  -0.0044739  0.0009416  -4.751 2.65e-06 ***
## sqrt(LSTAT)        -0.2173449  0.0136776 -15.891  < 2e-16 ***
## RM:B                0.0007948  0.0001538   5.166 3.48e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1771 on 492 degrees of freedom
## Multiple R-squared:  0.8172, Adjusted R-squared:  0.8123 
## F-statistic: 169.2 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.16)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.895815  1        1.376886
## CHAS          1.066335  1        1.032635
## NOX           4.382913  1        2.093541
## RM           24.039574  1        4.903017
## log(DIS)      3.644426  1        1.909038
## new_RAD       9.071085  3        1.444142
## TAX           7.732712  1        2.780775
## PTRATIO       1.603573  1        1.266323
## B           119.015144  1       10.909406
## sqrt(LSTAT)   2.936248  1        1.713548
## RM:B        158.437429  1       12.587193

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#RM:sqrt(LSTAT):
interaction.17= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+RM*sqrt(LSTAT)), data=boston_data)
summary(interaction.17)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + RM * sqrt(LSTAT)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69231 -0.09005 -0.00281  0.09551  0.73564 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.8650760  0.3063168  12.618  < 2e-16 ***
## CRIM               -0.0127795  0.0012601 -10.142  < 2e-16 ***
## CHAS                0.0865029  0.0322731   2.680  0.00760 ** 
## NOX                -0.8056648  0.1438120  -5.602 3.53e-08 ***
## RM                  0.2151603  0.0350187   6.144 1.66e-09 ***
## log(DIS)           -0.2214094  0.0282574  -7.835 2.91e-14 ***
## new_RADmoderate     0.0329386  0.0250202   1.316  0.18863    
## new_RADremote       0.0758219  0.0352142   2.153  0.03179 *  
## new_RADvery remote  0.3034497  0.0592645   5.120 4.39e-07 ***
## TAX                -0.0005849  0.0001308  -4.473 9.60e-06 ***
## PTRATIO            -0.0322262  0.0047195  -6.828 2.54e-11 ***
## B                   0.0003082  0.0001015   3.036  0.00252 ** 
## sqrt(LSTAT)         0.0428306  0.0620376   0.690  0.49027    
## RM:sqrt(LSTAT)     -0.0455031  0.0100160  -4.543 6.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1781 on 492 degrees of freedom
## Multiple R-squared:  0.815,  Adjusted R-squared:  0.8101 
## F-statistic: 166.7 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.17)

##                     GVIF Df GVIF^(1/(2*Df))
## CRIM            1.870063  1        1.367503
## CHAS            1.069617  1        1.034223
## NOX             4.420690  1        2.102544
## RM              9.636930  1        3.104341
## log(DIS)        3.700168  1        1.923582
## new_RAD         9.182586  3        1.447085
## TAX             7.733730  1        2.780959
## PTRATIO         1.661843  1        1.289125
## B               1.367252  1        1.169295
## sqrt(LSTAT)    59.702327  1        7.726728
## RM:sqrt(LSTAT) 39.628960  1        6.295154

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#log(DIS):new_RAD:
interaction.18= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+log(DIS)*new_RAD), data=boston_data)
summary(interaction.18)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + log(DIS) * new_RAD), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71250 -0.09335 -0.00563  0.09318  0.74510 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.8836816  0.2279562  21.424  < 2e-16 ***
## CRIM                        -0.0113154  0.0013541  -8.357 6.71e-16 ***
## CHAS                         0.0952615  0.0329103   2.895 0.003966 ** 
## NOX                         -1.0104209  0.1539608  -6.563 1.35e-10 ***
## RM                           0.0699710  0.0162125   4.316 1.92e-05 ***
## log(DIS)                    -0.1846557  0.0493303  -3.743 0.000203 ***
## new_RADmoderate              0.2279489  0.0932395   2.445 0.014846 *  
## new_RADremote                0.1340489  0.1401878   0.956 0.339438    
## new_RADvery remote           0.4143235  0.1175493   3.525 0.000464 ***
## TAX                         -0.0007484  0.0001494  -5.009 7.67e-07 ***
## PTRATIO                     -0.0353894  0.0048139  -7.352 8.28e-13 ***
## B                            0.0003700  0.0001024   3.613 0.000334 ***
## sqrt(LSTAT)                 -0.2331934  0.0136533 -17.080  < 2e-16 ***
## log(DIS):new_RADmoderate    -0.1300062  0.0576571  -2.255 0.024585 *  
## log(DIS):new_RADremote      -0.0269458  0.0860126  -0.313 0.754203    
## log(DIS):new_RADvery remote  0.0164055  0.0778137   0.211 0.833107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1807 on 490 degrees of freedom
## Multiple R-squared:  0.8103, Adjusted R-squared:  0.8045 
## F-statistic: 139.5 on 15 and 490 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.18)

##                         GVIF Df GVIF^(1/(2*Df))
## CRIM                2.097297  1        1.448205
## CHAS                1.080273  1        1.039362
## NOX                 4.920882  1        2.218306
## RM                  2.006150  1        1.416386
## log(DIS)           10.952393  1        3.309440
## new_RAD          4115.061662  3        4.003096
## TAX                 9.804345  1        3.131189
## PTRATIO             1.679207  1        1.295842
## B                   1.351480  1        1.162532
## sqrt(LSTAT)         2.808552  1        1.675873
## log(DIS):new_RAD 2486.415614  3        3.680688

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#log(DIS):TAX:
interaction.19= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+log(DIS)*TAX), data=boston_data)
summary(interaction.19)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + log(DIS) * TAX), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71829 -0.09315 -0.01128  0.09943  0.74121 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.9904988  0.2333174  21.389  < 2e-16 ***
## CRIM               -0.0114579  0.0013291  -8.621  < 2e-16 ***
## CHAS                0.0981946  0.0329511   2.980 0.003025 ** 
## NOX                -0.9061092  0.1469038  -6.168 1.44e-09 ***
## RM                  0.0703763  0.0161345   4.362 1.57e-05 ***
## log(DIS)           -0.3185044  0.0648909  -4.908 1.25e-06 ***
## new_RADmoderate     0.0252324  0.0254426   0.992 0.321814    
## new_RADremote       0.0870171  0.0357990   2.431 0.015425 *  
## new_RADvery remote  0.3055221  0.0614748   4.970 9.26e-07 ***
## TAX                -0.0007932  0.0002159  -3.674 0.000265 ***
## PTRATIO            -0.0369007  0.0046940  -7.861 2.42e-14 ***
## B                   0.0003676  0.0001028   3.576 0.000383 ***
## sqrt(LSTAT)        -0.2322935  0.0137016 -16.954  < 2e-16 ***
## log(DIS):TAX        0.0001998  0.0001540   1.298 0.194985    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1815 on 492 degrees of freedom
## Multiple R-squared:  0.8079, Adjusted R-squared:  0.8028 
## F-statistic: 159.2 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.19)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          2.003427  1        1.415425
## CHAS          1.073806  1        1.036246
## NOX           4.442253  1        2.107665
## RM            1.970095  1        1.403601
## log(DIS)     18.791557  1        4.334923
## new_RAD       9.590199  3        1.457599
## TAX          20.296312  1        4.505143
## PTRATIO       1.583106  1        1.258215
## B             1.349989  1        1.161890
## sqrt(LSTAT)   2.804556  1        1.674681
## log(DIS):TAX 10.411511  1        3.226687

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#log(DIS):sqrt(LSTAT):
interaction.20= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+log(DIS)*sqrt(LSTAT)), data=boston_data)
summary(interaction.20)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + log(DIS) * sqrt(LSTAT)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71646 -0.09048 -0.00617  0.09670  0.78305 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.0744802  0.2203869  23.025  < 2e-16 ***
## CRIM                 -0.0109477  0.0012900  -8.487 2.52e-16 ***
## CHAS                  0.0810933  0.0327129   2.479 0.013511 *  
## NOX                  -0.8695312  0.1440617  -6.036 3.12e-09 ***
## RM                    0.0737389  0.0158402   4.655 4.17e-06 ***
## log(DIS)             -0.4225137  0.0582838  -7.249 1.63e-12 ***
## new_RADmoderate       0.0169800  0.0252870   0.671 0.502222    
## new_RADremote         0.0826314  0.0354370   2.332 0.020115 *  
## new_RADvery remote    0.2731391  0.0598686   4.562 6.40e-06 ***
## TAX                  -0.0005204  0.0001327  -3.923 0.000100 ***
## PTRATIO              -0.0385958  0.0046690  -8.266 1.29e-15 ***
## B                     0.0003728  0.0001017   3.667 0.000272 ***
## sqrt(LSTAT)          -0.2915078  0.0216027 -13.494  < 2e-16 ***
## log(DIS):sqrt(LSTAT)  0.0575094  0.0163404   3.519 0.000473 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1796 on 492 degrees of freedom
## Multiple R-squared:  0.812,  Adjusted R-squared:  0.807 
## F-statistic: 163.5 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.20)

##                           GVIF Df GVIF^(1/(2*Df))
## CRIM                  1.928298  1        1.388632
## CHAS                  1.081274  1        1.039843
## NOX                   4.364644  1        2.089173
## RM                    1.940038  1        1.392853
## log(DIS)             15.488371  1        3.935527
## new_RAD               9.112027  3        1.445226
## TAX                   7.828863  1        2.798011
## PTRATIO               1.600281  1        1.265022
## B                     1.349251  1        1.161573
## sqrt(LSTAT)           7.122808  1        2.668859
## log(DIS):sqrt(LSTAT) 10.161118  1        3.187651

#Even all variables have vif <20 (log(DIS):sqrt(LSTAT), log(DIS)), the interaction cannot exist if the inital factor is removed from the model. So this interaction would not be considered in the following analysis.

#new_RAD:TAX:
interaction.21= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+new_RAD*TAX), data=boston_data)
summary(interaction.21)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + new_RAD * TAX), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73054 -0.09176 -0.00949  0.09806  0.72296 
## 
## Coefficients: (1 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             4.9493618  0.2289273  21.620  < 2e-16 ***
## CRIM                   -0.0118813  0.0012781  -9.296  < 2e-16 ***
## CHAS                    0.0944604  0.0329295   2.869 0.004301 ** 
## NOX                    -0.8765972  0.1461144  -5.999 3.85e-09 ***
## RM                      0.0735070  0.0160432   4.582 5.85e-06 ***
## log(DIS)               -0.2354814  0.0303447  -7.760 4.96e-14 ***
## new_RADmoderate        -0.0765835  0.1131304  -0.677 0.498757    
## new_RADremote          -0.1217131  0.2984874  -0.408 0.683623    
## new_RADvery remote      0.4260355  0.1544575   2.758 0.006028 ** 
## TAX                    -0.0008927  0.0003580  -2.494 0.012967 *  
## PTRATIO                -0.0373808  0.0047425  -7.882 2.09e-14 ***
## B                       0.0003650  0.0001029   3.546 0.000428 ***
## sqrt(LSTAT)            -0.2326583  0.0137315 -16.943  < 2e-16 ***
## new_RADmoderate:TAX     0.0003799  0.0004075   0.932 0.351642    
## new_RADremote:TAX       0.0007313  0.0009945   0.735 0.462511    
## new_RADvery remote:TAX         NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 491 degrees of freedom
## Multiple R-squared:  0.8077, Adjusted R-squared:  0.8022 
## F-statistic: 147.3 on 14 and 491 DF,  p-value: < 2.2e-16

#This leads to a decrease in Multiple R-squared and Adjusted R-squared.

#new_RAD:sqrt(LSTAT):
interaction.22= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+new_RAD*sqrt(LSTAT)), data=boston_data)
summary(interaction.22)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + new_RAD * sqrt(LSTAT)), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.70152 -0.09214 -0.00717  0.09197  0.86472 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     4.981e+00  2.259e-01  22.056  < 2e-16 ***
## CRIM                           -9.957e-03  1.252e-03  -7.953 1.27e-14 ***
## CHAS                            6.098e-02  3.224e-02   1.892  0.05913 .  
## NOX                            -1.062e+00  1.422e-01  -7.474 3.61e-13 ***
## RM                              8.542e-02  1.566e-02   5.455 7.78e-08 ***
## log(DIS)                       -2.479e-01  2.726e-02  -9.093  < 2e-16 ***
## new_RADmoderate                -1.041e-01  8.418e-02  -1.237  0.21680    
## new_RADremote                   1.483e-01  1.363e-01   1.088  0.27702    
## new_RADvery remote              8.878e-01  1.214e-01   7.313 1.07e-12 ***
## TAX                            -7.149e-04  1.318e-04  -5.423 9.20e-08 ***
## PTRATIO                        -4.248e-02  4.564e-03  -9.307  < 2e-16 ***
## B                               2.969e-04  9.865e-05   3.009  0.00275 ** 
## sqrt(LSTAT)                    -2.094e-01  2.626e-02  -7.973 1.10e-14 ***
## new_RADmoderate:sqrt(LSTAT)     4.479e-02  2.802e-02   1.599  0.11048    
## new_RADremote:sqrt(LSTAT)      -1.823e-02  4.766e-02  -0.383  0.70222    
## new_RADvery remote:sqrt(LSTAT) -1.297e-01  3.152e-02  -4.115 4.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1735 on 490 degrees of freedom
## Multiple R-squared:  0.8252, Adjusted R-squared:  0.8199 
## F-statistic: 154.2 on 15 and 490 DF,  p-value: < 2.2e-16

#This leads to an increase in Multiple R-squared and Adjusted R-squared.

vif(interaction.22)

##                             GVIF Df GVIF^(1/(2*Df))
## CRIM                    1.945955  1        1.394975
## CHAS                    1.125084  1        1.060700
## NOX                     4.552923  1        2.133758
## RM                      2.030924  1        1.425105
## log(DIS)                3.631075  1        1.905538
## new_RAD             10118.894494  3        4.650741
## TAX                     8.281818  1        2.877815
## PTRATIO                 1.638506  1        1.280042
## B                       1.361115  1        1.166668
## sqrt(LSTAT)            11.277516  1        3.358201
## new_RAD:sqrt(LSTAT) 14230.554441  3        4.922703

#as there exists vif > 10, the issue of multicollinearity still unsloved.

#TAX:PTRATIO:
interaction.23= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+TAX*PTRATIO), data=boston_data)
summary(interaction.23)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + TAX * PTRATIO), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72591 -0.09330 -0.00964  0.09424  0.71970 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.497e+00  4.417e-01  12.446  < 2e-16 ***
## CRIM               -1.184e-02  1.272e-03  -9.313  < 2e-16 ***
## CHAS                9.619e-02  3.282e-02   2.931 0.003539 ** 
## NOX                -8.078e-01  1.522e-01  -5.306 1.69e-07 ***
## RM                  6.626e-02  1.653e-02   4.010 7.03e-05 ***
## log(DIS)           -2.339e-01  2.890e-02  -8.092 4.62e-15 ***
## new_RADmoderate     2.392e-02  2.544e-02   0.940 0.347533    
## new_RADremote       8.989e-02  3.582e-02   2.510 0.012395 *  
## new_RADvery remote  2.371e-01  6.861e-02   3.456 0.000596 ***
## TAX                -2.421e-03  1.151e-03  -2.104 0.035900 *  
## PTRATIO            -6.957e-02  2.074e-02  -3.354 0.000859 ***
## B                   3.449e-04  1.032e-04   3.342 0.000896 ***
## sqrt(LSTAT)        -2.367e-01  1.395e-02 -16.961  < 2e-16 ***
## TAX:PTRATIO         9.867e-05  6.102e-05   1.617 0.106531    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1813 on 492 degrees of freedom
## Multiple R-squared:  0.8083, Adjusted R-squared:  0.8032 
## F-statistic: 159.6 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to a fall in both Multiple R-squared and Adjusted R-squared.

vif(interaction.23)

##                   GVIF Df GVIF^(1/(2*Df))
## CRIM          1.838031  1        1.355740
## CHAS          1.067254  1        1.033080
## NOX           4.779813  1        2.186278
## RM            2.070581  1        1.438952
## log(DIS)      3.733922  1        1.932336
## new_RAD      12.982701  3        1.533066
## TAX         577.582246  1       24.032941
## PTRATIO      30.976361  1        5.565641
## B             1.363689  1        1.167771
## sqrt(LSTAT)   2.914628  1        1.707228
## TAX:PTRATIO 796.574900  1       28.223659

#as there exists vif > 23, the issue of multicollinearity still unsloved.

#TAX:B:
interaction.24= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+TAX*B), data=boston_data)
summary(interaction.24)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + TAX * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73437 -0.09405 -0.00820  0.09501  0.72435 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5.009e+00  3.277e-01  15.286  < 2e-16 ***
## CRIM               -1.194e-02  1.273e-03  -9.382  < 2e-16 ***
## CHAS                9.409e-02  3.290e-02   2.860  0.00442 ** 
## NOX                -8.997e-01  1.499e-01  -6.002 3.79e-09 ***
## RM                  7.365e-02  1.608e-02   4.581 5.89e-06 ***
## log(DIS)           -2.438e-01  2.850e-02  -8.556  < 2e-16 ***
## new_RADmoderate     2.549e-02  2.548e-02   1.001  0.31749    
## new_RADremote       8.647e-02  3.586e-02   2.412  0.01624 *  
## new_RADvery remote  2.965e-01  6.150e-02   4.820 1.91e-06 ***
## TAX                -7.868e-04  4.124e-04  -1.908  0.05700 .  
## PTRATIO            -3.676e-02  4.707e-03  -7.809 3.51e-14 ***
## B                   2.573e-05  6.228e-04   0.041  0.96707    
## sqrt(LSTAT)        -2.321e-01  1.372e-02 -16.915  < 2e-16 ***
## TAX:B               5.408e-07  9.858e-07   0.549  0.58350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1818 on 492 degrees of freedom
## Multiple R-squared:  0.8074, Adjusted R-squared:  0.8023 
## F-statistic: 158.6 on 13 and 492 DF,  p-value: < 2.2e-16

#This leads to a fall in both Multiple R-squared and Adjusted R-squared.

vif(interaction.24)

##                  GVIF Df GVIF^(1/(2*Df))
## CRIM         1.833184  1        1.353951
## CHAS         1.067293  1        1.033099
## NOX          4.612707  1        2.147721
## RM           1.950849  1        1.396728
## log(DIS)     3.614006  1        1.901054
## new_RAD      9.551073  3        1.456606
## TAX         73.859247  1        8.594140
## PTRATIO      1.587552  1        1.259981
## B           49.417352  1        7.029748
## sqrt(LSTAT)  2.805955  1        1.675098
## TAX:B       62.033028  1        7.876105

#as there exists vif > 23, the issue of multicollinearity still unsloved.

#There are two possible interactions: CHAS*CRIM and CRIM*B

interaction_1= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CHAS*CRIM), data=boston_data)
summary(interaction_1)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CHAS * CRIM), 
##     data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72272 -0.09683 -0.00960  0.09552  0.72683 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.9015941  0.2138068  22.925  < 2e-16 ***
## CRIM               -0.0118116  0.0012628  -9.354  < 2e-16 ***
## CHAS                0.0236349  0.0403950   0.585 0.558753    
## NOX                -0.9398141  0.1458980  -6.442 2.82e-10 ***
## RM                  0.0752964  0.0159135   4.732 2.92e-06 ***
## log(DIS)           -0.2442258  0.0281930  -8.663  < 2e-16 ***
## new_RADmoderate     0.0295099  0.0252943   1.167 0.243913    
## new_RADremote       0.0933788  0.0356094   2.622 0.009004 ** 
## new_RADvery remote  0.2914260  0.0598855   4.866 1.53e-06 ***
## TAX                -0.0006041  0.0001327  -4.552 6.70e-06 ***
## PTRATIO            -0.0380070  0.0046752  -8.129 3.53e-15 ***
## B                   0.0003459  0.0001021   3.387 0.000763 ***
## sqrt(LSTAT)        -0.2242410  0.0138706 -16.167  < 2e-16 ***
## CRIM:CHAS           0.0392307  0.0131814   2.976 0.003062 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1802 on 492 degrees of freedom
## Multiple R-squared:  0.8107, Adjusted R-squared:  0.8057 
## F-statistic:   162 on 13 and 492 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8107,   Adjusted R-squared:  0.8057 

interaction_2= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data=boston_data)
summary(interaction_2)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72555 -0.09626 -0.00847  0.09247  0.71429 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.701e+00  2.192e-01  21.447  < 2e-16 ***
## CRIM               -5.998e-03  2.176e-03  -2.756 0.006066 ** 
## CHAS                9.163e-02  3.253e-02   2.817 0.005050 ** 
## NOX                -8.419e-01  1.446e-01  -5.821 1.06e-08 ***
## RM                  7.700e-02  1.590e-02   4.842 1.72e-06 ***
## log(DIS)           -2.418e-01  2.812e-02  -8.596  < 2e-16 ***
## new_RADmoderate     2.597e-02  2.520e-02   1.031 0.303173    
## new_RADremote       8.782e-02  3.546e-02   2.477 0.013592 *  
## new_RADvery remote  3.078e-01  5.997e-02   5.133 4.11e-07 ***
## TAX                -5.623e-04  1.320e-04  -4.259 2.46e-05 ***
## PTRATIO            -3.751e-02  4.653e-03  -8.062 5.77e-15 ***
## B                   6.679e-04  1.364e-04   4.897 1.32e-06 ***
## sqrt(LSTAT)        -2.267e-01  1.367e-02 -16.577  < 2e-16 ***
## CRIM:B             -2.302e-05  6.853e-06  -3.360 0.000841 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1798 on 492 degrees of freedom
## Multiple R-squared:  0.8116, Adjusted R-squared:  0.8066 
## F-statistic:   163 on 13 and 492 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8116,   Adjusted R-squared:  0.8066 

interaction_3= lm(log(MEDV) ~ (CRIM+CHAS+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B+CHAS*CRIM), data=boston_data)
summary(interaction_3)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + CHAS + NOX + RM + log(DIS) + 
##     new_RAD + TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B + CHAS * 
##     CRIM), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71671 -0.09712 -0.00703  0.08924  0.73098 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.730e+00  2.177e-01  21.731  < 2e-16 ***
## CRIM               -5.886e-03  2.159e-03  -2.726 0.006639 ** 
## CHAS                2.123e-02  3.998e-02   0.531 0.595663    
## NOX                -9.009e-01  1.449e-01  -6.219 1.07e-09 ***
## RM                  7.927e-02  1.579e-02   5.019 7.27e-07 ***
## log(DIS)           -2.432e-01  2.790e-02  -8.715  < 2e-16 ***
## new_RADmoderate     2.993e-02  2.503e-02   1.195 0.232486    
## new_RADremote       9.430e-02  3.524e-02   2.676 0.007703 ** 
## new_RADvery remote  3.090e-01  5.950e-02   5.194 3.02e-07 ***
## TAX                -5.934e-04  1.314e-04  -4.517 7.86e-06 ***
## PTRATIO            -3.861e-02  4.630e-03  -8.338 7.66e-16 ***
## B                   6.493e-04  1.354e-04   4.794 2.17e-06 ***
## sqrt(LSTAT)        -2.187e-01  1.383e-02 -15.819  < 2e-16 ***
## CRIM:B             -2.288e-05  6.799e-06  -3.365 0.000825 ***
## CRIM:CHAS           3.891e-02  1.305e-02   2.983 0.002997 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1783 on 491 degrees of freedom
## Multiple R-squared:  0.8149, Adjusted R-squared:  0.8097 
## F-statistic: 154.4 on 14 and 491 DF,  p-value: < 2.2e-16

#Multiple R-squared:  0.8149,   Adjusted R-squared:  0.8097 

vif(interaction_3)

##                 GVIF Df GVIF^(1/(2*Df))
## CRIM        5.477549  1        2.340416
## CHAS        1.637732  1        1.279739
## NOX         4.473804  1        2.115137
## RM          1.955271  1        1.398310
## log(DIS)    3.599112  1        1.897133
## new_RAD     9.170833  3        1.446777
## TAX         7.784025  1        2.789986
## PTRATIO     1.595683  1        1.263204
## B           2.427741  1        1.558121
## sqrt(LSTAT) 2.957832  1        1.719835
## CRIM:B      5.201869  1        2.280761
## CRIM:CHAS   1.729510  1        1.315108

#all predictors have vif <10, so there is no problem of multicolinearity
#The addition of two interaction creates a better model.
#However, the p-value for CHAS was extremely high, implying that it is not significant as a predictor, CHAS was therefore removed from the model, which also led to the removal of the interaction between CRIM and CHAS, as interaction needs to include all lower order terms.

#after removing the interaction term
interaction_4= lm(log(MEDV) ~ (CRIM+NOX+RM+log(DIS)+new_RAD+TAX+PTRATIO+B+sqrt(LSTAT)+CRIM*B), data=boston_data)
summary(interaction_4)

## 
## Call:
## lm(formula = log(MEDV) ~ (CRIM + NOX + RM + log(DIS) + new_RAD + 
##     TAX + PTRATIO + B + sqrt(LSTAT) + CRIM * B), data = boston_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.73380 -0.09027 -0.00887  0.09540  0.72157 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         4.716e+00  2.206e-01  21.375  < 2e-16 ***
## CRIM               -6.069e-03  2.191e-03  -2.770 0.005821 ** 
## NOX                -8.130e-01  1.453e-01  -5.596 3.65e-08 ***
## RM                  7.808e-02  1.601e-02   4.878 1.45e-06 ***
## log(DIS)           -2.474e-01  2.825e-02  -8.757  < 2e-16 ***
## new_RADmoderate     3.004e-02  2.533e-02   1.186 0.236328    
## new_RADremote       9.712e-02  3.555e-02   2.732 0.006520 ** 
## new_RADvery remote  3.227e-01  6.016e-02   5.364 1.26e-07 ***
## TAX                -5.949e-04  1.324e-04  -4.493 8.77e-06 ***
## PTRATIO            -3.856e-02  4.670e-03  -8.256 1.39e-15 ***
## B                   6.883e-04  1.371e-04   5.019 7.27e-07 ***
## sqrt(LSTAT)        -2.284e-01  1.376e-02 -16.600  < 2e-16 ***
## CRIM:B             -2.355e-05  6.898e-06  -3.414 0.000692 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.181 on 493 degrees of freedom
## Multiple R-squared:  0.8085, Adjusted R-squared:  0.8039 
## F-statistic: 173.5 on 12 and 493 DF,  p-value: < 2.2e-16

#Which has lower adjusted R-squared than the model with factor CHAS and interaction term CRIM*B, and CHAS itself in this model is significant.

vif(interaction_2)

##                 GVIF Df GVIF^(1/(2*Df))
## CRIM        5.475901  1        2.340064
## CHAS        1.067134  1        1.033022
## NOX         4.390376  1        2.095322
## RM          1.950714  1        1.396680
## log(DIS)    3.598036  1        1.896849
## new_RAD     9.119347  3        1.445420
## TAX         7.734761  1        2.781144
## PTRATIO     1.585547  1        1.259185
## B           2.422611  1        1.556474
## sqrt(LSTAT) 2.847447  1        1.687438
## CRIM:B      5.201599  1        2.280701

#all predictors have vif <10, so there is no problem of multicolinearity