I swear that I worked solely alone on this take-home exam and within the time frame that was specified.

Problem: Real estate sales.

A city tax assessor was interested in predicting residential home sales prices in a midwestern city (in the US) as a function of various characteristics of the home and surrounding property. Data on 518 arms-length transactions were obtained for home sales during the year 2002. The data have already been randomly split into a training dataset and a validation dataset. You will use the training dataset to build a model and do model diagnostics. The validation dataset will be used to assess the predictive skill of the models you build. The training and validation data sets are in the files HousesTraining.txt and HousesValidation.txt, respectively. They both includes the following variables (note that the variable names are included in the data file):

Col.	Variable name	Description
1	Price	Sales price of residence (dollars)
2	SqF	Finished area of residence (square feet)
3	Bedr	Total number of bedrooms in residence
4	Bathr	Total number of bathrooms in residence
5	AC	Air conditioning installed (Yes or No)
6	Garage	Number of cars that garage will hold
7	Pool	Swimming pool in residence (Yes or No)
8	Year	Year property was originally constructed
9	Quality	Quality index of construction (Low,Medium,High)
10	Lot	Lot size (square feet)

1. EDA and model preparation

Your goal here is to identify the general form of a regression model, i.e. identify whether transformations are needed of the response and/or predictors.

(a)

Fit a first-order regression model to the “raw” training data. That is, regress sales price on all 9 predictors without any transformations. Provide the standard diagnostics plot (i.e. the plot(Fit)). Are any model assumptions violated? Explain.

house<-read.table("HousesTraining.txt", header = T)
str(house)

## 'data.frame':    318 obs. of  10 variables:
##  $ Price  : int  250000 205500 275500 229900 150000 190000 559000 535000 527000 169900 ...
##  $ SqF    : int  1780 1638 2196 2216 1597 2812 2791 3381 3232 1502 ...
##  $ Bedr   : int  4 4 4 3 2 7 3 5 5 2 ...
##  $ Bathr  : int  3 2 3 2 1 5 4 4 5 2 ...
##  $ AC     : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 2 2 ...
##  $ Garage : int  2 2 2 2 1 2 3 3 2 2 ...
##  $ Pool   : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ Year   : int  1980 1963 1968 1972 1955 1966 1992 1988 1984 1956 ...
##  $ Quality: Factor w/ 3 levels "High","Low","Medium": 3 3 3 3 3 2 1 1 3 3 ...
##  $ Lot    : int  21345 17342 21786 18639 22112 56639 30595 23172 21445 28958 ...

house_fit<-lm(Price~., data=house)
par(mfrow=c(2,2))
plot(house_fit)

##head(model.matrix(~ Quality,data = house))

house$Quality<-factor(house$Quality, levels(house$Quality)[c(2,3,1)]) ## reorder the levels for 'Quality'

Model assumptions:

Errors terms are independent [Acceptable] Sample is quite large, so this should not be an issue.
The error terms have constant variance [Violated] It is quite clear from the residual/standardized residuals plot against the fitted value that the variance is not constant as it increases as the response (Price) increases - the spread of the residuals increases as the fitted values increase.This is clear if we look at the plot with the absolute value of the standardized residuals as the shape resemble a megaphone
The error terms are normally distributed [Violated] From the qq-plot we can see that the distribution of the standardized residuals is pretty symmetrical (slightly skewed to the right) but with heavy tails - higher probabilities in the tails compared to a normal distribution. We would tend to say that the residuals are not normally distributed.
Regression function E(Y|x) is linear [Acceptable] From the residual plot against the fitted value we cannot identify any clear patterns, therefore we can accept the assumption that the regression function is linear.

(b)

Use the Box-Cox procedure to find an appropriate power transformation of the response. What transformation of price is suggested?

library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

boxcox(house_fit)

The value of Lambda suggested by the Box Cox transformation is close to zero. For clarity’s sake, We can pick lambda=0 without sacrificing much in terms of the effectiveness of the transformation. Lambda=0 corresponds to the log transformation of the price.

(c)

Fit the first-order regression model again but this time with a log-transformed response. Provide the standard diagnostics plots. Is the fit better? Explain.

house<-house %>% mutate(Price_log=log(Price))
house.L<-house[,-1] ## I creating a new table with only log transformation of Price.
house_fit_log<-lm(Price_log~.,data = house.L) 
par(mfrow=c(2,2))
plot(house_fit_log)

After the log transformation of Price the fit improved. The log transformation is also referred as “variance stabilizing transformation”. From the graph generated by the function Plot() we can see that the following assumptions are now verified:

Error terms with constant variance From the standardized residuals plot against the fitted values we can note that the errors are evenly scattered around 0. This indicates that the residuals have constant variance.
Error terms normally distributed From the qq plot we can see that errors are normally distributed as the plotted semi-studentized residuals are quite close to their expected value under normality assumption.

(d)

To determine whether transformations of predictors are needed, provide an Added variable plot for all the quantitative variables. Use log-price as your response. What functional form (e.g. linear term, quadratic term etc.) do these plots suggest for the quantitative predictors?

library(car)
house_qty<-house[, sapply(house, class) != "factor" & names(house) != "Price"]
house_fit_qty<-lm(Price_log~.,data = house_qty)
avPlots(house_fit_qty)

With quantitative predictors only, the added variable plots suggest a first order regression model as we don’t see any curved relationship between the residuals.

2. Model selection

Using the log-transformed response and first-order predictors you will now investigate whether it is reasonable to reduce the number of explanatory variables. ### (a) Use step-wise regression and the AIC criterion to select your model. Start with the full model and use “both” directions in the step() function. State the estimated regression function for the selected model.

#str(house.L)
names(house.L)

##  [1] "SqF"       "Bedr"      "Bathr"     "AC"        "Garage"    "Pool"     
##  [7] "Year"      "Quality"   "Lot"       "Price_log"

fullformula<-formula(Price_log~.,data = house.L)
house_fit_log<-lm(fullformula,data = house.L)

sel<-step(house_fit_log, scope=fullformula, direction="both", test="F")

## Start:  AIC=-1072.55
## Price_log ~ SqF + Bedr + Bathr + AC + Garage + Pool + Year + 
##     Quality + Lot
## 
##           Df Sum of Sq    RSS      AIC F value    Pr(>F)    
## - Pool     1   0.00082 10.177 -1074.52  0.0249  0.874775    
## - Bedr     1   0.00970 10.186 -1074.24  0.2925  0.588985    
## <none>                 10.177 -1072.55                      
## - Bathr    1   0.24765 10.424 -1066.90  7.4708  0.006634 ** 
## - AC       1   0.25155 10.428 -1066.78  7.5887  0.006224 ** 
## - Garage   1   0.31748 10.494 -1064.78  9.5776  0.002151 ** 
## - Year     1   0.51562 10.692 -1058.83 15.5550 9.934e-05 ***
## - Lot      1   1.05587 11.232 -1043.15 31.8530 3.783e-08 ***
## - Quality  2   1.69618 11.873 -1027.52 25.5848 5.286e-11 ***
## - SqF      1   3.07260 13.249  -990.64 92.6928 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-1074.52
## Price_log ~ SqF + Bedr + Bathr + AC + Garage + Year + Quality + 
##     Lot
## 
##           Df Sum of Sq    RSS      AIC F value    Pr(>F)    
## - Bedr     1   0.00961 10.187 -1076.22  0.2909  0.590005    
## <none>                 10.177 -1074.52                      
## + Pool     1   0.00082 10.177 -1072.55  0.0249  0.874775    
## - Bathr    1   0.24899 10.426 -1068.83  7.5353  0.006405 ** 
## - AC       1   0.25094 10.428 -1068.77  7.5944  0.006204 ** 
## - Garage   1   0.31709 10.494 -1066.76  9.5961  0.002130 ** 
## - Year     1   0.52056 10.698 -1060.66 15.7540 8.982e-05 ***
## - Lot      1   1.06881 11.246 -1044.76 32.3458 2.999e-08 ***
## - Quality  2   1.69723 11.875 -1029.47 25.6819 4.836e-11 ***
## - SqF      1   3.07692 13.254  -992.52 93.1176 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-1076.22
## Price_log ~ SqF + Bathr + AC + Garage + Year + Quality + Lot
## 
##           Df Sum of Sq    RSS      AIC F value    Pr(>F)    
## <none>                 10.187 -1076.22                      
## + Bedr     1    0.0096 10.177 -1074.52  0.2909  0.590005    
## + Pool     1    0.0007 10.186 -1074.24  0.0224  0.881107    
## - Bathr    1    0.2438 10.431 -1070.70  7.3940  0.006914 ** 
## - AC       1    0.2463 10.433 -1070.62  7.4698  0.006636 ** 
## - Garage   1    0.3111 10.498 -1068.65  9.4380  0.002314 ** 
## - Year     1    0.5308 10.718 -1062.07 16.1010 7.540e-05 ***
## - Lot      1    1.0612 11.248 -1046.71 32.1896 3.217e-08 ***
## - Quality  2    1.8286 12.015 -1027.72 27.7330 8.367e-12 ***
## - SqF      1    3.2121 13.399  -991.06 97.4323 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

sel

## 
## Call:
## lm(formula = Price_log ~ SqF + Bathr + AC + Garage + Year + Quality + 
##     Lot, data = house.L)
## 
## Coefficients:
##   (Intercept)            SqF          Bathr          ACYes         Garage  
##     4.891e+00      2.458e-04      4.542e-02      8.973e-02      6.927e-02  
##          Year  QualityMedium    QualityHigh            Lot  
##     3.273e-03      7.782e-02      3.696e-01      4.892e-06

v<-c("","x1","x2","x3","x4","x5","x6","x7","x8")

right<-paste(formatC(sel$coefficients,flag = "+",format = "f",digits = 6),v, collapse = " ", sep = "")
left<-c("log(Y)")

##Regression function
paste(left,right,sep = " = ")

## [1] "log(Y) = +4.890939 +0.000246x1 +0.045416x2 +0.089735x3 +0.069270x4 +0.003273x5 +0.077821x6 +0.369623x7 +0.000005x8"

##Legenda
paste(names(sel$coefficients)[-1],v[-1], sep =  "=")

## [1] "SqF=x1"           "Bathr=x2"         "ACYes=x3"         "Garage=x4"       
## [5] "Year=x5"          "QualityMedium=x6" "QualityHigh=x7"   "Lot=x8"

Swimming pool and bedroom number have been excluded from the model. The model selected has 2 qualitative predictors: AC (2 levels: Y/N) and Quality (3 levels: Low,Medium,High)

(b)

Interpret the estimated regression coefficients (except the intercept) of your selected model.

Since we have fitted a model using a a transformation of the response (log(Price)), it’s now quite difficult to give an interpretation to the coefficients - they are all very close to 0. However we can calculate the exponential (inverse of log) of the coefficients and give an interpretation

exp(sel$coefficients)

##   (Intercept)           SqF         Bathr         ACYes        Garage 
##    133.078436      1.000246      1.046463      1.093884      1.071726 
##          Year QualityMedium   QualityHigh           Lot 
##      1.003278      1.080929      1.447189      1.000005

##coefficients expressed as percentage increase
formatC((exp(sel$coefficients)[-1]-1)*100,format = "f")

##           SqF         Bathr         ACYes        Garage          Year 
##      "0.0246"      "4.6463"      "9.3884"      "7.1726"      "0.3278" 
## QualityMedium   QualityHigh           Lot 
##      "8.0929"     "44.7189"      "0.0005"

The interpretation we can give is the following:

SqF: for every unit increase in SqF, while the other predictors are held constant, the estimated price of the house increase by 0.0246 %.

Bathr: for every unit increase in Bathr, while the other predictors are held constant, the estimated price of the house increase by 4.6463 %.

ACYes: the presence of air conditioning will increase the estimated price of the house by 9.3884% compared to the same house (other predictors held constant) without AC. The presence of AC affects the value of the intercept.

Garage: for every unit increase in Garage, while the other predictors are held constant, the estimated price of the house increase by 7.1726 %

Year: for every unit increase in Year, while the other predictors are held constant, the estimated price of the house increase by 0.3278 %

QualityMedium: A medium quality index of construction will Increase the estimated price of house by 8.0929% compared to the same house (other predictors held constant) with low quality index of construction. The quality index of construction affects the value of the intercept.

QualityHigh: A high quality index of construction will increase the estimated price of house by 44.7189% compared to the same house (other predictors held constant) with low quality index of construction. The quality index of construction affects the value of the intercept.

Lot: for every unit increase in Lot, while the other predictors are held constant, the estimated price of the house increase by 0.0005%

3. Model Diagnostic

Do more extensive model diagnostics on the model you picked in the model building section.

(a)

Plot the “traditional” residuals (i.e. the semi-studentized residuals) against fitted values and each of the predictors. Also provide a normal probability plot (qqplot). Are model assumptions satisfied? Explain.

#Semi-Studendized Residuals
ss_res<-summary(sel)$residuals/summary(sel)$sigma

# Residuals versus each of the predictors 
par(mfrow=c(3,3), mar=c(4,4,1,1))
for(k in 2:length(names(sel$model))){
  plot(sel$model[,k],ss_res, xlab=names(sel$model)[k],ylab = "Semi-Stud Res")
} 
plot(sel$fitted.values,ss_res, xlab="Fitted values", ylab = "Semi-Stud Res")
qqnorm(ss_res)
qqline(ss_res)

The model assumptions are verified, since:

There are no curvilinear patterns in the residuals plots - Regression function E(Y|x) is linear.

The spread of the residuals is quite constant for all the predictors and the fitted values - The error terms have constant variance.

The errors are normally distributed as the plotted residuals are quite close to their expected value under normality assumption - The error terms are normally distributed.

(b)

Obtain the studentized deleted residuals (ti) and plot them against the fitted values. Do these tell a different story than the semi-studentized residuals?

## studentized deleted residuals
stdel_res<-rstudent(sel)
n<-nrow(sel$model)
p<-ncol(sel$model)

plot(sel$fitted.values,stdel_res, xlab="Fitted values",ylab="Stud Del. Residuals",ylim=c(-4,4))
## Test for outliers:
## Test for outliers, with Bonferroni correction (alpha=0.05):
abline(h=c(-1,1)*qt(1-0.05/(2*n), n-p-1), col='red', lty=2)

Studentized deleted residuals (ti: externally studentized residuals) are used to spot outlying Y observation.

(c)

Identify influential observations by calculating Cook’s distance and DFFITS for each observation and plot them (against the case number). Are there any influential observations?

DFFITS <- dffits(sel)
#DFFITS[abs(DFFITS)>1]
DFFITS[abs(DFFITS)>2*sqrt(p/n)] ## Influential observations according DFFITS

##          6          9         13         18         22         29         38 
## -1.0633231  0.3930272 -0.3664835 -0.3434260 -0.4653264 -0.3854986  0.3319163 
##         40         41         44         48         51         52         55 
##  0.3715225  0.4635233  0.4257972  0.9970987 -0.9621862 -1.0611054  0.8797025 
##         63         68         69         76         78         79         91 
## -0.4077340 -0.4481755 -0.3570566  0.4129874  0.4196129  0.5313962  1.3031542 
##        102        117        170        211        308        312 
## -0.4452269  0.7954620 -0.4362531 -0.3191877  0.4795258  0.6243101

length(DFFITS[abs(DFFITS)>2*sqrt(p/n)]) ## Number of influential observations according DFFITS

## [1] 27

DFFITS_test<-abs(DFFITS)>2*sqrt(p/n)


## DFFITS influential cases
plot(DFFITS, xlab='i', col='blue') ## DFFITS for each observation and plotted against the case number
## cut-offs:
abline(h=c(-1,1), col='red')
abline(h=c(-1,1)*2*sqrt(p/n), col='red', lty=2)
points(names(DFFITS[abs(DFFITS)>2*sqrt(p/n)]),DFFITS[abs(DFFITS)>2*sqrt(p/n)], pch=20, col="red")
text(names(DFFITS[abs(DFFITS)>2*sqrt(p/n)]), DFFITS[abs(DFFITS)>2*sqrt(p/n)], labels=names(DFFITS[abs(DFFITS)>2*sqrt(p/n)]), cex= 0.7, pos=2)

CooksD <- cooks.distance(sel)
plot(CooksD, xlab='i', col='blue', ylim=c(0,1))## cut-offs:
abline(h=qf(0.5, p, n-p), col='red')
abline(h=qf(0.2, p, n-p), col='red', lty=2)
abline(h=qf(0.1, p, n-p), col='red', lty=3)

#Influential observations according Cook's distance.
CooksD[CooksD>qf(0.5, p, n-p)]

## named numeric(0)

CooksD[CooksD>qf(0.2, p, n-p)]

## named numeric(0)

CooksD[CooksD>qf(0.1, p, n-p)]

## named numeric(0)

According to DFFITS (with n-large correction) we have 27 influential outliers while according to Cook’s distance there is none. If we take into consideration also the studentized deleted residuals with bonferroni correction (conservative approach) the plot does not display any outlier. We conclude that there are no outlier observations.

4. Model Validation.

If any influential observations were identified in part 3, use the model fits where they have been removed.

(a)

Predict the validation data set and obtain the mean squared prediction error (MSPE) and compare to the estimated MSE from the training data. What does this comparison tell you about the adequacy of the model?

sel$call$formula

## Price_log ~ SqF + Bathr + AC + Garage + Year + Quality + Lot

sel_formula<-formula(Price_log ~ SqF + Bathr + AC + Garage + Year + Quality + Lot)

house_val<-read.table("HousesValidation.txt", header = T)
house_val$Quality<-factor(house_val$Quality, levels(house_val$Quality)[c(2,3,1)]) ## reorder the levels for 'Quality'
#str(house_val)

house_val<-house_val%>% mutate(Price_log=log(Price))
house_val.L<-house_val[,-1] ## I creating a new table with only log transformation of Price.

Pred<- predict(sel, newdata=house_val.L)
MSPE<- sum( (house_val.L$Price_log-Pred)^2 )/nrow(house_val.L)
MSPE

## [1] 0.02992804

MSE_mb<-summary(sel)$sigma^2
MSE_mb

## [1] 0.0329675

((MSPE - MSE_mb)/MSE_mb)*100

## [1] -9.219564

MSPE and MSE_mb (MSE obtained from the model-building data set) are very similar. We can say that MSE is not very biased and the selected model will give sensible uncertainty estimates for prediction.

(b)

Fit the selected model to the validation data set and compare estimated regression coefficients (by comparing confidence intervals), estimated MSE and R2. How do these compare?

sel_val<-lm(sel_formula,data = house_val.L)

CI<-confint(sel)[-1,] %>% exp()## Intercept excluded
CI.v<-confint(sel_val)[-1,] %>% exp()## Intercept excluded

cbind(CI,CI.v) ## side by side. it seems that they overlap

##                  2.5 %   97.5 %     2.5 %   97.5 %
## SqF           1.000197 1.000295 1.0002442 1.000364
## Bathr         1.012631 1.081425 0.9979525 1.073996
## ACYes         1.025449 1.166886 0.9379784 1.094226
## Garage        1.025216 1.120345 0.9807466 1.092376
## Year          1.001669 1.004889 1.0024106 1.006095
## QualityMedium 1.016295 1.149674 1.0155880 1.182459
## QualityHigh   1.300675 1.610208 1.2675163 1.651307
## Lot           1.000003 1.000007 1.0000034 1.000009

summary(sel)$sigma^2 ##MSE obtained from the model-bulding data set

## [1] 0.0329675

MSE_val<-summary(sel_val)$sigma^2 ##MSE obtained from the validation data set
MSE_val

## [1] 0.02912889

(MSE_mb - MSE_val) ## small difference.

## [1] 0.003838613

summary(sel)$adj.r.squared

## [1] 0.8103407

summary(sel_val)$adj.r.squared ## Adjusted R squared are very similar.

## [1] 0.8549441

Comparing CI, MSE and R squared adjusted generated from the training data set and the validation data set we can see: Confidence intervals overlaps Estimated MSE are similar R adjusted very similar as well.

These 3 points can validate the model found - i.e. The regression coefficients are stable and reasonable The regression function plausible The inference can be generalized

The final product can be found using the formula [Price_log ~ SqF + Bathr + AC + Garage + Year + Quality + Lot] on the combined data set (training + validation data).

5.Shrinkage methods

Using all 9 predictors you started with and the log-transformed response you will now compare the performance of lasso, ridge regression and the selected model.

(a)

Fit a ridge regression model on the training data set, with lambda chosen by cross-validation. Use it to predict the validation data set and obtain the mean squared prediction error (MSPE).

library(glmnet)

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

## Loaded glmnet 3.0-2

#sum(is.na(house.L_noINF)) ## no NA values.

X_tr<- model.matrix(Price_log~., data=house.L)
Y_tr<- house.L$Price_log
n_tr<- nrow(X_tr)

X_val<-model.matrix(Price_log~., data=house_val.L)
Y_val<-house_val.L$Price_log
grid<- 10^seq(10,-2,length=100) ## lambda

house_fit_ridge<- glmnet(X_tr,Y_tr,alpha=0, lambda=grid) #ridge: alpha=0 

# Plot the parameter estimates for each lambda
#MyColors <- rainbow(20)
# Plot the parameter estimates for each lambda (log scale)
#matplot(log(house_fit_ridge$lambda), t(house_fit_ridge$beta), type='l', col=MyColors,lty=1, xlab="log(lambda)", ylab="Standardized coefficients")
#legend("bottomright", legend=rownames(house_fit_ridge$beta), col=MyColors, lty=1, cex=0.4)

house_fit_ridgeCV <- cv.glmnet(X_tr,Y_tr,alpha=0, nfolds=10) #ridge: alpha=0 
best.lambdaRR <- house_fit_ridgeCV$lambda.min

house_fit_ridgeBEST<-glmnet(X_tr,Y_tr,alpha=0, lambda=best.lambdaRR)
pred_house_ridge <- predict(house_fit_ridgeBEST, newx=X_val)
MSPE.Ridge<-mean((pred_house_ridge - Y_val)^2 )
MSPE.Ridge

## [1] 0.03128462

(b)

Fit a lasso model on the training data set, with lambda chosen by cross-validation. Use it to predict the validation data set and obtain the mean squared prediction error (MSPE).

house_fit_lasso <- glmnet(X_tr,Y_tr,alpha=1, lambda=grid) #lasso: alpha=1 

house_fit_lassoCV <- cv.glmnet(X_tr,Y_tr,alpha=1, nfolds=10)
best.lambdaRR <- house_fit_lassoCV$lambda.min

house_fit_lassoBEST<-glmnet(X_tr,Y_tr,alpha=1, lambda=best.lambdaRR)
pred_house_lasso <- predict(house_fit_lassoBEST, newx=X_val)
MSPE.lasso<-mean((pred_house_lasso - Y_val)^2 )
MSPE.lasso

## [1] 0.03018182

(c)

Compare the MSPE you got from lasso, ridge regression and least squares (part 4a). Which method performs best?

MSPEcheck<-data.frame(MSPE.Ridge,MSPE.lasso,MSPE)
MSPEcheck

##   MSPE.Ridge MSPE.lasso       MSPE
## 1 0.03128462 0.03018182 0.02992804

paste(paste(names(which.min(MSPEcheck)), round(min(MSPEcheck),4), sep=" = "),"performs best", sep=" , ")

## [1] "MSPE = 0.0299 , performs best"

Least squares is the method that performs best as the related MSPE is the lowest.

(d)

The lasso method essentially performs variable selection since it can set parameter estimates to zero. Display the estimated regression coefficients from the lasso fit. Did the lasso method select the same variables as your step-wise selection procedure and are the estimated regression coefficients similar?

house_fit_lasso <- glmnet(X_tr,Y_tr,alpha=1, lambda=grid) #lasso: alpha=1 

house_fit_lassoCV <- cv.glmnet(X_tr,Y_tr,alpha=1, nfolds=10)
best.lambdaRR <- house_fit_lassoCV$lambda.min

house_fit_lassoBEST<-glmnet(X_tr,Y_tr,alpha=1, lambda=best.lambdaRR)
pred_house_lasso <- predict(house_fit_lassoBEST, newx=X_val)
MSPE.lasso<-mean((pred_house_lasso - Y_val)^2 )

house_fit_lassoBEST$beta

## 11 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)   .           
## SqF           2.463396e-04
## Bedr          .           
## Bathr         4.812382e-02
## ACYes         8.564219e-02
## Garage        6.891286e-02
## PoolYes       .           
## Year          3.250866e-03
## QualityMedium 6.119326e-02
## QualityHigh   3.435227e-01
## Lot           4.574523e-06

sel$coefficients %>% as.matrix()

##                       [,1]
## (Intercept)   4.890939e+00
## SqF           2.458203e-04
## Bathr         4.541570e-02
## ACYes         8.973489e-02
## Garage        6.927009e-02
## Year          3.272646e-03
## QualityMedium 7.782085e-02
## QualityHigh   3.696233e-01
## Lot           4.892484e-06

Lasso method selects the same predictors a the step-wise selection procedure. If we look at the coefficient for the predictors kept in both models the values are very similar.

anv13_Final

Andrea Valtorta

10/11/2019