A Project Report on Mutivariate Regression Analysis Gurtej Khanooja
March 2017
To predict the energy consumption (Heating Load) by creating the best Multivariate Linear Regression Model.
Understanding the interaction between various regressor variable.
The dataset is borrowed from UC Irvine Machine Learning Repository. It’s an energy effiency dataset with 768 observations.There are total of 8 attributes and 1 respose variable in the dataset. The different attributes collectively are responsible for the energy consumption (the response variable). Each factor contributes towards the energy consumption in its own way.
Following are the attributes and the dependent variable in the data set:
List of independent variables:
Dependent Variable:
Our goal is to understand and study regression model derived out of this dataset. We will be applying the stastical methods learned during class to get the best regression model out of the dataset. The objective is to create the model with minimum number of variables without compromising with the accuracy of prediction. Moreover, we will also study how different regressor variables are dependent on each other by doing covariance analysis.
data <- read.csv("/Users/Gurtej/Documents/NEU/Course Material/Semester 2/SME/SME - Project 1/ENB2012_data.csv", header=TRUE)
head (data)
## X1 X2 X3 X4 X5 X6 X7 X8 Y1
## 1 0.98 514.5 294.0 110.25 7 2 0 0 15.55
## 2 0.98 514.5 294.0 110.25 7 3 0 0 15.55
## 3 0.98 514.5 294.0 110.25 7 4 0 0 15.55
## 4 0.98 514.5 294.0 110.25 7 5 0 0 15.55
## 5 0.90 563.5 318.5 122.50 7 2 0 0 20.84
## 6 0.90 563.5 318.5 122.50 7 3 0 0 21.46
attach (data)
dplyr::tbl_df(data)
## # A tibble: 768 × 9
## X1 X2 X3 X4 X5 X6 X7 X8 Y1
## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <int> <dbl>
## 1 0.98 514.5 294.0 110.25 7 2 0 0 15.55
## 2 0.98 514.5 294.0 110.25 7 3 0 0 15.55
## 3 0.98 514.5 294.0 110.25 7 4 0 0 15.55
## 4 0.98 514.5 294.0 110.25 7 5 0 0 15.55
## 5 0.90 563.5 318.5 122.50 7 2 0 0 20.84
## 6 0.90 563.5 318.5 122.50 7 3 0 0 21.46
## 7 0.90 563.5 318.5 122.50 7 4 0 0 20.71
## 8 0.90 563.5 318.5 122.50 7 5 0 0 19.68
## 9 0.86 588.0 294.0 147.00 7 2 0 0 19.50
## 10 0.86 588.0 294.0 147.00 7 3 0 0 19.95
## # ... with 758 more rows
\[ \pagebreak \]
model <- lm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8)
summary(model)
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8965 -1.3196 -0.0252 1.3532 7.7052
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.014521 19.033607 4.414 1.16e-05 ***
## X1 -64.773991 10.289445 -6.295 5.19e-10 ***
## X2 -0.087290 0.017075 -5.112 4.04e-07 ***
## X3 0.060813 0.006648 9.148 < 2e-16 ***
## X4 NA NA NA NA
## X5 4.169939 0.337990 12.337 < 2e-16 ***
## X6 -0.023328 0.094705 -0.246 0.80550
## X7 19.932680 0.813986 24.488 < 2e-16 ***
## X8 0.203772 0.069918 2.914 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9154
## F-statistic: 1187 on 7 and 760 DF, p-value: < 2.2e-16
From the above linear fit for multivariate problem, the best fit equation is as follows:
Y1 = 84.014521 -64.773991X1 - 0.087290X2 +0.060813X3 + 4.169939X5 - 0.023328X6 + 19.932680X7 + 0.203772X8
Intresting thing to see in the above model is that the variable X5 is not defined because if singularity which basically means there is another variable in whose linear combination is able to fullfil the contribution of X5. So, even if we try to form the model without including X5 we will get exactly the same number of coffecients.
The R-squared for the above model is 0.9162 which means 91.62% of variation in y1 is explained by the regressor variables.
model_quad <- lm(Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + I(X1^2) + I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2) + I(X6^2) + I(X7^2) + I(X8^2))
summary(model_quad)
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8 + I(X1^2) +
## I(X2^2) + I(X3^2) + I(X4^2) + I(X5^2) + I(X6^2) + I(X7^2) +
## I(X8^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2880 -1.7838 -0.0623 1.7228 6.3421
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.643e+03 1.255e+02 -13.093 < 2e-16 ***
## X1 -1.816e+03 2.955e+02 -6.144 1.30e-09 ***
## X2 8.891e+00 9.184e-01 9.682 < 2e-16 ***
## X3 -4.146e+00 6.640e-01 -6.244 7.12e-10 ***
## X4 NA NA NA NA
## X5 -4.365e+01 8.542e+00 -5.110 4.09e-07 ***
## X6 6.827e-02 6.335e-01 0.108 0.914201
## X7 3.213e+01 3.536e+00 9.086 < 2e-16 ***
## X8 1.140e+00 2.778e-01 4.103 4.52e-05 ***
## I(X1^2) 1.585e+03 1.905e+02 8.320 4.11e-16 ***
## I(X2^2) -3.074e-03 2.418e-04 -12.710 < 2e-16 ***
## I(X3^2) 3.466e-04 8.487e-05 4.083 4.91e-05 ***
## I(X4^2) -2.846e-02 4.506e-03 -6.316 4.60e-10 ***
## I(X5^2) NA NA NA NA
## I(X6^2) -1.309e-02 8.977e-02 -0.146 0.884134
## I(X7^2) -2.792e+01 7.195e+00 -3.880 0.000114 ***
## I(X8^2) -1.824e-01 4.697e-02 -3.884 0.000112 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.488 on 753 degrees of freedom
## Multiple R-squared: 0.9403, Adjusted R-squared: 0.9392
## F-statistic: 847.6 on 14 and 753 DF, p-value: < 2.2e-16
From the above quadratic fit for multivariate problem, the best fit equation is as follows:
Y1 = -1.643e+03 - 1.816e+03X1 + 8.891e+00X2 - 4.146e+00X3 - 4.365e+01X5 + 6.827e-02X6 + 3.213e+01X7 + 1.140e+00X8 + 1.585e+03X1^2 -3.074e-03X2^2 + 3.466e-04X3^2 - 2.846e-02X4^2 -1.309e-02X6^2 -2.792e+01X7^2 -1.824e-01X8^2
We see from the summary of above two models that the R square and adjusted R square increases as we increase the degree of fit basically giving a signal of better fit of the model. This increase in the degree of fitted model could be useful only till certain extent, if exceeded a certain limit the cons due to overfitting could dominate giving undesired predictions.
How can we tell if a model fits the data? If the model is correct then predicted variance should be an unbiased estimate of variance . If we have a model which is not complex enough to fit the data or simply takes the wrong form, then predicted variance will overestimate variance.
ts <- (summary(model)$sigma^2)*760 #test statistic for 760 degree of freedom
1-pchisq(ts,760)
## [1] 0
model_fit<- lm(Y1 ~ factor(X1)+ factor(X2)+factor(X3)+factor(X4)+factor(X5)+factor(X6)+factor(X7))
anova(model, model_fit)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Model 2: Y1 ~ factor(X1) + factor(X2) + factor(X3) + factor(X4) + factor(X5) +
## factor(X6) + factor(X7)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 760 6543.8
## 2 750 794.7 10 5749.1 542.58 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The low p-value indicates that we must conclude that there is a lack of fit. The reason is that the pure error standard devation is substantially less than the regression standard error of 2.934. We might investigate models other than a straight line to get more indepth idea.
For the factor model above, the R2 is 98.96%. So even this saturated model does not attain a 100% value for R2. For these data, it’s a small difference but in other cases, the difference can be substantial. In these cases, one should realize that the maximum R2 that may be attained might be substantially less than 100% and so perceptions about what a good value for R2 should be downgraded appropriately.
These methods are good for detecting lack of fit, but if the null hypothesis is accepted, we cannot conclude that we have the true model. After all, it may be that we just did not have enough data to detect the inadequacies of the model. All we can say is that the model is not contradicted by the data. When there are no replicates, it may be possible to group the responses for similar x but this is not straightforward. \[ \pagebreak \]
After fitting a regression model it is important to determine whether all the necessary model assumptions are valid before performing inference. If there are any violations, subsequent inferential procedures may be invalid resulting in faulty conclusions. Therefore, it is crucial to perform appropriate model diagnostics.
Model diagnostic procedures involve both graphical methods and formal statistical tests. These procedures allow us to explore whether the assumptions of the regression model are valid and decide whether we can trust subsequent inference results.
pairs(~X1+X2+X3+X4+X5+X6+X7+X8)
From the above scatter plot we can see how is the relation between various different variables in the dataset. It helps us get a graphical view of how variables are related to each other taken two at a time.
The leverage of an observation measures its ability to move the regression model all by itself by simply moving in the y-direction. The leverage measures the amount by which the predicted value would change if the observation was shifted one unit in the y- direction.
The leverage always takes values between 0 and 1. A point with zero leverage has no effect on the regression model. If a point has leverage equal to 1 the line must follow the point perfectly.
lev = hat(model.matrix(model))
plot(lev)
There are some points high up in the graph which are the high leverage points. Mostly all the points in the above graph occupy very small space (approx ~ 0.017), so there isn’t a high leverage point in this dataset. Still let’s try and observe the points with leverage value more then 0.022.
data[lev>0.022,]
## X1 X2 X3 X4 X5 X6 X7 X8 Y1
## 21 0.76 661.5 416.5 122.5 7 2 0 0 24.77
## 24 0.76 661.5 416.5 122.5 7 5 0 0 23.93
There is typically nothing unusual about these observations above but we can conclude that even if these points hold a miute leverage there are just two points points like this in the whole dataset.
We can use residuals to study whether: * The regression function is nonlinear. * The error terms have nonconstant variance. The error terms are not independent. * There are outliers. * The error terms are not normally distributed.
We can check for violations of these assumptions by making plots of the residuals, including: * Plots of the residuals against the explanatory variable or fitted values. * Histograms of the residuals. * Normal probability plots of the residuals.
par(mar=c(1,1,1,1))
par(mfrow=c(4,2))
plot(X1, model$residuals,xlab = "X1", ylab = "Residual" )
plot(X2, model$residuals,xlab = "X2", ylab = "Residual")
plot(X3, model$residuals,xlab = "X3", ylab = "Residual")
plot(X4, model$residuals,xlab = "X4", ylab = "Residual")
plot(X5, model$residuals,xlab = "X5", ylab = "Residual")
plot(X6, model$residuals,xlab = "X6", ylab = "Residual")
plot(X7, model$residuals,xlab = "X7", ylab = "Residual")
hist(model$res)
From above we see that the reiduals are distributed from range -10 to 5 for all different regressor variables. From the histogram plot for the residual we see that most of the residual are concetrated near mean with almost a normal looking curve, with the curve having thin gaps at extremities. \[ \pagebreak \]
NOTE: All the calculations are for linear multivariate model i.e. MODEL 1
confint(model)
## 2.5 % 97.5 %
## (Intercept) 46.64983259 121.37920978
## X1 -84.97310070 -44.57488228
## X2 -0.12081099 -0.05376956
## X3 0.04776285 0.07386383
## X4 NA NA
## X5 3.50643397 4.83344366
## X6 -0.20924198 0.16258573
## X7 18.33475226 21.53060810
## X8 0.06651684 0.34102670
The above table generated shows 95% confidence interval for the all the variables in best fit line. The coloum with 2.5% shows the lower boundary for the interval and the coloumn with 97.5% shows the upper boundary for the interval.
Confidence interval will be calculated for X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5
The true value of y1 for above case is 16.64. We will observe that to what accuracy our full model can predict the value of y1 for the above case.
Prediction interval will be calculated for X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5
The true value of y1 for above case is 16.64. We will observe that to what accuracy our full model can predict the value of y1 for the above case.
datapredict <- data.frame(X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5)
predict(model, datapredict, interval='confidence')
## Warning in predict.lm(model, datapredict, interval = "confidence"):
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 17.80396 16.81991 18.788
The above output gives 95 percent confidence interval for the above given values of regressor variable. The fitted value is 17.80396. The lower bound and upper bound for 95% confidence interval is given respectively by 16.81991 and 18.788.
datapredict <- data.frame(X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5)
predict(model, datapredict, interval='prediction')
## Warning in predict.lm(model, datapredict, interval = "prediction"):
## prediction from a rank-deficient fit may be misleading
## fit lwr upr
## 1 17.80396 11.96018 23.64774
The above output gives 95 percent prediction interval for the above given values of regressor variable. The fitted value is 17.80396 while the true value was 16.64. The lower bound and upper bound for 95% prediction interval is given respectively by 11.96018 and 23.64774.
The deviation from true value is 1.16 (17.80-16.64) giving 93.48 percent accuarcy in prediction for above case.
NOTE: One intresting thing to note here is the both in the case of confidence and prediction intrval we get the same fitted value but the difference between upper and lower bound in case of prediction inteval in larger compared to confidence interval. This is because the prediction interval takes into account the uncertainity of knowing the population mean as well as the data scatter.
vcov(model)
## (Intercept) X1 X2 X3
## (Intercept) 362.27818722 -1.928790e+02 -3.218025e-01 8.464890e-02
## X1 -192.87904822 1.058727e+02 1.678198e-01 -3.901285e-02
## X2 -0.32180255 1.678198e-01 2.915722e-04 -8.630359e-05
## X3 0.08464890 -3.901285e-02 -8.630359e-05 4.419501e-05
## X5 -4.85895252 2.223678e+00 4.799333e-03 -2.084161e-03
## X6 -0.03139142 3.869393e-15 5.209532e-18 -4.445512e-19
## X7 -0.12120240 3.816931e-14 3.991331e-17 4.707609e-18
## X8 -0.01090822 3.002871e-15 3.515831e-18 1.634146e-19
## X5 X6 X7 X8
## (Intercept) -4.858953e+00 -3.139142e-02 -1.212024e-01 -1.090822e-02
## X1 2.223678e+00 3.869393e-15 3.816931e-14 3.002871e-15
## X2 4.799333e-03 5.209532e-18 3.991331e-17 3.515831e-18
## X3 -2.084161e-03 -4.445512e-19 4.707609e-18 1.634146e-19
## X5 1.142372e-01 2.362747e-17 -2.713966e-16 -5.539011e-18
## X6 2.362747e-17 8.968978e-03 4.917679e-17 8.732390e-18
## X7 -2.713966e-16 4.917679e-17 6.625731e-01 -1.212024e-02
## X8 -5.539011e-18 8.732390e-18 -1.212024e-02 4.888497e-03
The above output is the variance covariance matrix. The off digonal elements show covariances between respective elements in the matrix. For example the largest positive number in matrix 2.223678e+00 corrosponding to element X1 and X5 reflecting high positive covariance amoungst two elements. The highest negative number is -4.858953e+00 which corrosponds to intercept and X5 sugesting large inverse covariance relation.
library(car)
## Warning: package 'car' was built under R version 3.3.2
vif(lm( Y1 ~ X1 + X2 + X3 + X5 + X6 + X7 + X8))
## X1 X2 X3 X5 X6 X7
## 105.524054 201.531134 7.492984 31.205474 1.000000 1.047508
## X8
## 1.047508
We take the help of variance inflation factor to derive the multi-colinearity in the model. As you see above there is no VIF coffecient corrosponding to X4, its because it’s an aliased element.
According to most authors VIF >= 5 is a sign of caution but VIF >= 10 is surely a sign of multi-colinearity. From the above results we can see that X1, X2 and X5 have very high value of VIF suggesting that a very high level of multi-colinearity is observed between those elements and other elements in the model.
Following is the formula for finding unbaised estimator of variance:
\[s^2 = SSE/(n-k-1)\]
unbaised_estimator <- sum(residuals(model)^2)/(768-7-1)
unbaised_estimator
## [1] 8.610219
The output for the above R script is 8.610219 which gives us the value of s^2 which is the unbaised estimator for the variance for above model.
\[ \pagebreak \]
From the summary of fitted model, following are the t-values and corrosponding p-values for the variable:
summary(model)
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.8965 -1.3196 -0.0252 1.3532 7.7052
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 84.014521 19.033607 4.414 1.16e-05 ***
## X1 -64.773991 10.289445 -6.295 5.19e-10 ***
## X2 -0.087290 0.017075 -5.112 4.04e-07 ***
## X3 0.060813 0.006648 9.148 < 2e-16 ***
## X4 NA NA NA NA
## X5 4.169939 0.337990 12.337 < 2e-16 ***
## X6 -0.023328 0.094705 -0.246 0.80550
## X7 19.932680 0.813986 24.488 < 2e-16 ***
## X8 0.203772 0.069918 2.914 0.00367 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.934 on 760 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9154
## F-statistic: 1187 on 7 and 760 DF, p-value: < 2.2e-16
We will use the above values to test the significance of individual variables using their individual t-values and p values and try to modify the model by removing the variable if they are found insignificant.
Let us assume a common assumption to do the hypothesis testing for all the variables in the model:
\[ Ho: \beta_j = 0 \] \[ H1: \beta_j \not= 0 \]
We will assume that the significance level \[\alpha = 0.05\]
We see from above table that only variable corrosponding to which p-value is greater than 0.05 is X6. We will now form a new model by removing X6 as well as X4 (as it is not effecting the model anyways) and see how much R-squared and adjusted R-squared is affected.
model3 <- lm(Y1~ X1+ X2+ X3+ X5+ X7+ X8)
summary(model3)
##
## Call:
## lm(formula = Y1 ~ X1 + X2 + X3 + X5 + X7 + X8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.932873 19.018972 4.413 1.17e-05 ***
## X1 -64.773991 10.283093 -6.299 5.06e-10 ***
## X2 -0.087290 0.017065 -5.115 3.97e-07 ***
## X3 0.060813 0.006644 9.153 < 2e-16 ***
## X5 4.169939 0.337781 12.345 < 2e-16 ***
## X7 19.932680 0.813483 24.503 < 2e-16 ***
## X8 0.203772 0.069875 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
From the summary of above model, we see that the R-squared value is 0.9162 and the adjusted R-squared value is 0.9155. The R-squared and adjusted R-squared values for the orignal model (MODEL1), were 0.9162 and 0.9154. Therefore, we can conclude that the model will less number of variable is infact better than the orignal model with more number of variables.
Apart from this the p-values of the variables in the new model are still less than 0.05, signifying that the remaning variables form essential part in model according to t-test.
\[ Ho: \beta_1 =\beta_2 =\beta_3 ........ \beta_8 = 0 \] \[ H1: \beta_1 \not= \beta_2 \not= \beta_3 \not= ......... \beta_8 \not= 0 \]
From the summary for the MODEL 1, F-statistic is 1187 with very low p-value. Through this we can conclude that the model can make pretty good predictions. Thus we reject Ho and accept the alternative hypothesis.
Partial f-test are performed to see if the some of the factors would still impact or even if they imact, by what extent they will impact the model even if the other variables in the model are taken into consideration.
Since there are 8 regressor variables in the model there are 256 different models that could be formed by keeping different variables equals to 0. Out of those 256 cases we will select 12 cases and see how are the predictive capiblities impacted compared to the initial base model (MODEL 1)
model1_reduced <- lm(Y1 ~ X2+ X3+ X4 +X5+ X6+ X7)
anova(model1_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X2 + X3 + X4 + X5 + X6 + X7
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 762 6958.1
## 2 760 6543.8 2 414.35 24.062 7.373e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_1 =\beta_8 = 0 \] \[ H1: \beta_1 and \beta_8 \not= 0 \]
Here we are trying to ask a question weather X1 and X8 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model2_reduced <- lm(Y1 ~ X1+ X3+ X4 +X5+ X7+ X8)
anova(model2_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X3 + X4 + X5 + X7 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 6544.3
## 2 760 6543.8 1 0.52243 0.0607 0.8055
\[ Ho: \beta_2 = \beta_6 = 0 \] \[ H1: \beta_2 and \beta_6 \not= 0 \]
Here we are trying to ask a question weather X2 and X6 are still significant given all other variables in the regression model. Seeing the p-value which is equal to 0.8055, we come to conclusion that the variables are are insignificant and we fail to reject Ho.The predicted value will not differ much as compared to full model even when X2 and X6 would be absent in the model.
model3_reduced <- lm(Y1 ~ X1+ X2+ X3+ X6+ X7+ X8)
anova(model3_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X6 + X7 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 7854.4
## 2 760 6543.8 1 1310.6 152.21 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_3 = \beta_5 = 0 \] \[ H1: \beta_3 and \beta_5 \not= 0 \]
Here we are trying to ask a question weather X3 and X5 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model4_reduced <- lm(Y1 ~ X1+ X2+ X3 +X5+ X6 + X8)
anova(model4_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X5 + X6 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 11706.9
## 2 760 6543.8 1 5163.1 599.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_4 = \beta_7 = 0 \] \[ H1: \beta_4 and \beta_7 \not= 0 \]
Here we are trying to ask a question weather X4 and X7 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model5_reduced <- lm(Y1 ~ X1+ X2+ X4+ X6+ X8)
anova(model5_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X4 + X6 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 762 13017.5
## 2 760 6543.8 2 6473.7 375.93 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_3 =\beta_5 =\beta_7 = 0 \] \[ H1: \beta_3, \beta_5 and \beta_7 \not= 0 \]
Here we are trying to ask a question weather X3, X5 and X7 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model6_reduced <- lm(Y1 ~ X1+ X2+ X3 +X4+ X5)
anova(model6_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X4 + X5
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 763 12303.5
## 2 760 6543.8 3 5759.7 222.98 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_6 =\beta_7 =\beta_8 = 0 \] \[ H1: \beta_6,\beta_7 and \beta_8 \not= 0 \]
Here we are trying to ask a question weather X6, X7 and X8 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model7_reduced <- lm(Y1 ~ X1+ X3 +X4+ X5+ X6+ X8)
anova(model7_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X3 + X4 + X5 + X6 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 11706.9
## 2 760 6543.8 1 5163.1 599.65 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_2 = \beta_7 = 0 \] \[ H1: \beta_2 and \beta_8 \not= 0 \]
Here we are trying to ask a question weather X2 and X7 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model8_reduced <- lm(Y1 ~ X1+ X2+ X3+ X5+ X6+ X7)
anova(model8_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X5 + X6 + X7
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 6616.9
## 2 760 6543.8 1 73.135 8.494 0.003668 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_4 = \beta_8 = 0 \] \[ H1: \beta_4 and \beta_8 \not= 0 \]
Here we are trying to ask a question weather X4 and X8 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model9_reduced <- lm(Y1 ~ X3 +X4+ X5+ X6+ X7+ X8)
anova(model9_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X3 + X4 + X5 + X6 + X7 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 6885.0
## 2 760 6543.8 1 341.22 39.629 5.187e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_1 = \beta_2 = 0 \] \[ H1: \beta_1 and \beta_2 \not = 0 \]
Here we are trying to ask a question weather X1 and X2 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model10_reduced <- lm(Y1 ~ X1+ X2+ X5+ X6+ X7+ X8)
anova(model10_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X5 + X6 + X7 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 761 7264.3
## 2 760 6543.8 1 720.51 83.68 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_3 = \beta_4 = 0 \] \[ H1: \beta_3 and \beta_4 \not = 0 \]
Here we are trying to ask a question weather X3 and X4 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model11_reduced <- lm(Y1 ~ X1+ X2+ X3+ X4+ X7+ X8)
anova(model11_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X2 + X3 + X4 + X7 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 762 7854.9
## 2 760 6543.8 2 1311.1 76.137 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_5 = \beta_6 = 0 \] \[ H1: \beta_5 and \beta_6 \not= 0 \]
Here we are trying to ask a question weather X5 and X6 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
model12_reduced <- lm(Y1 ~ X1+ X3+ X4+ X8)
anova(model12_reduced, model)
## Analysis of Variance Table
##
## Model 1: Y1 ~ X1 + X3 + X4 + X8
## Model 2: Y1 ~ X1 + X2 + X3 + X4 + X5 + X6 + X7 + X8
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 763 13018.0
## 2 760 6543.8 3 6474.2 250.64 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
\[ Ho: \beta_2 = \beta_5 = \beta_6 = \beta_8 = 0 \] \[ H1: \beta_2, \beta_5, \beta_6 and \beta_7 \not= 0 \]
Here we are trying to ask a question weather X2, X5, X6 and X8 are still significant given all other variables in the regression model. Seeing the p-value which is very less, we come to conclusion that the variables are still significant and we reject Ho.The predicted value will differ in their absence.
From all of the cases mentioned, the only case where we fail to reject Ho was case 2. According to it even if we remove X2 and X6 from the model the results will not significantly differ from the base model. To prove this we will take a values corrosponding to X1, X2, X3 ……. X8 from the dataset and see the how much is the variation of the values in the case 2 model, base model and orignal model.
We pick random values where X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5
datapredict <- data.frame(X1=0.64, X2=808.5, X3=367.5, X4= 220.5, X5= 3.5, X6=5, X7= 0.40, X8= 5)
datapredict_Case2 <- data.frame(X1=0.64, X3=367.5, X4= 220.5, X5= 3.5, X7= 0.40, X8= 5)
predict(model, datapredict)
## Warning in predict.lm(model, datapredict): prediction from a rank-deficient
## fit may be misleading
## 1
## 17.80396
predict(model2_reduced, datapredict_Case2)
## 1
## 17.83895
From above we observe the predict by the base model was 17.80396, the predict by the case 2 model was 17.83895 and the orignal value from the dataset was 16.64.
This basically proves that a new model without taking into consideration the contribution of X2 and X6 can give equivalently good results as the base model.
summary(model2_reduced)
##
## Call:
## lm(formula = Y1 ~ X1 + X3 + X4 + X5 + X7 + X8)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9315 -1.3189 -0.0263 1.3586 7.7169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 83.93287 19.01897 4.413 1.17e-05 ***
## X1 -64.77399 10.28309 -6.299 5.06e-10 ***
## X3 -0.02648 0.01277 -2.074 0.03841 *
## X4 -0.17458 0.03413 -5.115 3.97e-07 ***
## X5 4.16994 0.33778 12.345 < 2e-16 ***
## X7 19.93268 0.81348 24.503 < 2e-16 ***
## X8 0.20377 0.06987 2.916 0.00365 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.933 on 761 degrees of freedom
## Multiple R-squared: 0.9162, Adjusted R-squared: 0.9155
## F-statistic: 1387 on 6 and 761 DF, p-value: < 2.2e-16
Therefore the new model from partial f-test analysis could be re-written as:
Y1 = 83.93287 -64.7739X1 -0.02648X3 -0.17458X4 +4.16994X5 +19.93268X7 +0.20377X8
\[ \pagebreak \]
library(leaps)
## Warning: package 'leaps' was built under R version 3.3.2
forward <- regsubsets(Y1~., data=data, nvmax = 8, method = 'forward')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
## Warning in rval$lopt[] <- rval$vorder[rval$lopt]: number of items to
## replace is not a multiple of replacement length
summary(forward)
## Subset selection object
## Call: regsubsets.formula(Y1 ~ ., data = data, nvmax = 8, method = "forward")
## 8 Variables (and intercept)
## Forced in Forced out
## X1 FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X5 FALSE FALSE
## X6 FALSE FALSE
## X7 FALSE FALSE
## X8 FALSE FALSE
## X4 FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: forward
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 ( 1 ) " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " " " "*" " " "*" " "
## 3 ( 1 ) " " " " "*" " " "*" " " "*" " "
## 4 ( 1 ) "*" " " "*" " " "*" " " "*" " "
## 5 ( 1 ) "*" "*" "*" " " "*" " " "*" " "
## 6 ( 1 ) "*" "*" "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
“*" denotes inclusion of variable in the model
Following are the models suggested by forward selection method:
backward <- regsubsets(Y1~., data=data, nvmax = 8, method = 'backward')
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
## Warning in rval$lopt[] <- rval$vorder[rval$lopt]: number of items to
## replace is not a multiple of replacement length
summary(backward)
## Subset selection object
## Call: regsubsets.formula(Y1 ~ ., data = data, nvmax = 8, method = "backward")
## 8 Variables (and intercept)
## Forced in Forced out
## X1 FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X5 FALSE FALSE
## X6 FALSE FALSE
## X7 FALSE FALSE
## X8 FALSE FALSE
## X4 FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: backward
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 ( 1 ) " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " " " "*" " " "*" " "
## 3 ( 1 ) " " " " "*" " " "*" " " "*" " "
## 4 ( 1 ) "*" " " "*" " " "*" " " "*" " "
## 5 ( 1 ) "*" "*" "*" " " "*" " " "*" " "
## 6 ( 1 ) "*" "*" "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
From above we observe that the same model suggestion are given in this case by both forward selection and backward elimination.
Let us try to observe the values of the coffecient in case of forward selection and backward elimination in two cases.
coef(forward, 7)
## (Intercept) X1 X2 X3 X6
## 261.37819945 -145.94368872 -0.26247768 0.13689032 -0.02332813
## X7 X8 X4
## 19.93268018 0.20377177 0.00000000
coef(backward, 7)
## (Intercept) X1 X2 X3 X6
## 261.37819945 -145.94368872 -0.26247768 0.13689032 -0.02332813
## X7 X8 X4
## 19.93268018 0.20377177 0.00000000
We thus observe in both the cases same coffecients for the variables are observed.
coef(forward, 4)
## (Intercept) X1 X3 X6 X8
## -78.36440358 71.17234220 0.14055608 -0.02332813 0.56839397
coef(backward, 4)
## (Intercept) X1 X3 X6 X8
## -78.36440358 71.17234220 0.14055608 -0.02332813 0.56839397
In this case we tried to fit best 4 variable model that we derived out of forward selection and backward elimination. We found that in both the cases same variables were choosen for the model. From above we can clearly observe the values of the coffecients for the best 4 variable fitted model.
Comparing R-squared, Regression Sum of Squares, Adjusted R-Squared, Cp and BIC
We will compare model on basis of score of various parameters above. We will select the best regression model by taking in consideration all the scores and select the best optimum model giving the best prediction with minimum number of variables.
fit <- regsubsets(Y1~., data=data, nvmax=8)
## Warning in leaps.setup(x, y, wt = wt, nbest = nbest, nvmax = nvmax,
## force.in = force.in, : 1 linear dependencies found
## Reordering variables and trying again:
summary(fit)
## Subset selection object
## Call: regsubsets.formula(Y1 ~ ., data = data, nvmax = 8)
## 8 Variables (and intercept)
## Forced in Forced out
## X1 FALSE FALSE
## X2 FALSE FALSE
## X3 FALSE FALSE
## X5 FALSE FALSE
## X6 FALSE FALSE
## X7 FALSE FALSE
## X8 FALSE FALSE
## X4 FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## X1 X2 X3 X4 X5 X6 X7 X8
## 1 ( 1 ) " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " " " "*" " " "*" " "
## 3 ( 1 ) " " " " "*" " " "*" " " "*" " "
## 4 ( 1 ) "*" " " " " "*" "*" " " "*" " "
## 5 ( 1 ) "*" " " " " "*" "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" " " "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" " " "*" "*" "*" "*"
The 7 best model for each case ranging from model with one variable to model with 7 variables is given by above summary. We will now compare R-squared, Regression Sum of Squares, Adjusted R-Squared, Cp and BIC for the 7 models and select the model with optimum values thus giving proper reasoning for it. The method to select the best model in each category is based on “exhaustive” method of selection.
Regression sum of squares for 7 models, the first one with single variable and the last one with 7 variables:
summary(fit)$rss
## [1] 16313.989 10627.943 7038.364 6654.418 6581.283 6544.289 6543.766
R-squared for 7 models, the first one with single variable and the last one with 7 variables:
summary(fit)$rsq
## [1] 0.7910869 0.8639011 0.9098684 0.9147851 0.9157217 0.9161954 0.9162021
Adjusted R-squared for 7 models, the first one with single variable and the last one with 7 variables:
summary(fit)$adjr2
## [1] 0.7908142 0.8635453 0.9095145 0.9143384 0.9151686 0.9155346 0.9154303
Cp for 7 models, the first one with single variable and the last one with 7 variables:
summary(fit)$cp
## [1] 1128.231082 470.716485 56.367557 13.834344 7.351511 5.060596
## [7] 7.000000
BIC for 7 models, the first one with single variable and the last one with 7 variables:
summary(fit)$bic
## [1] -1189.275 -1511.747 -1821.605 -1858.042 -1859.885 -1857.571 -1850.988
Since the optimum model will be based on the intersection of all the selection parameters (RSS, R-squared, Adjusted R-squared, Cp and BIC), it will be a great idea to plot all of them and try to figure out the best model out of it.
par(mfrow=c(3,2))
plot(summary(fit)$rss ,xlab="Number of Variables ",ylab="RSS", type="l")
plot(summary(fit)$rsq ,xlab="Number of Variables ",ylab="Rsq", type="l")
plot(summary(fit)$adjr2 ,xlab="Number of Variables ",ylab="adjr2", type="l")
plot(summary(fit)$cp ,xlab="Number of Variables ",ylab="cp", type="l")
plot(summary(fit)$bic ,xlab="Number of Variables ",ylab="bic", type="l")
We should aim at minimizing the BIC, RSS and Cp and try to maximize R-squared and Adjusted R-squared to find the best model. By seeing the graph mostly all of them start to flaten when number of variables are equal to 4 and is almost constant for variable range from 5 to 7.
This means that we can go with 4 variable model to get good predictions and to be on safer side we can choode 5 variable model. By increasing number of variable above 5, no further improvement or negligible improvement will be observed.
P.S. Whenever we take of 4 variable, 5 variable or in that case ‘n’ variable model, we are refering to best fit moedel that we found in the above investigation.
The best fit 4 variable model through above investigation included variables X1, X3, X5 and X7 and 5 variable model included variables X1, X2, X3, X5 and X7.
Different sleection parameters for 4 variable model are:
RSS: 6654.418 R-Squared: 0.9147851 Adjusted R-Squared: 0.9143384 Cp: 13.834344 BIC: -1858.042
Different sleection parameters for 5 variable model are:
RSS: 6581.283 R-Squared: 0.9157217 Adjusted R-Squared: 0.9151686 Cp: 7.351511 BIC: -1850.988
Fit_4Variable <- lm(Y1 ~ X1+ X3+ X5+ X7)
coef(Fit_4Variable)
## (Intercept) X1 X3 X5 X7
## -11.95301879 -14.53244055 0.03497595 5.60675320 20.43789945
Fit_5Variable <- lm(Y1 ~ X1+ X2+ X3+ X5+ X7)
coef(Fit_5Variable)
## (Intercept) X1 X2 X3 X5
## 84.38757009 -64.77399149 -0.08729027 0.06081334 4.16993881
## X7
## 20.43789945
From above following would be equation for four variable model:
Y1 = -11.95301879 -14.53244055X1 + 0.03497595X3 + 5.60675320X5 + 20.43789945X7
Equation for 5 variable model:
Y1 = 84.38757009 -64.77399149X1 -0.08729027X2 +0.06081334X3 +4.16993881X5 +20.43789945X7
As of now we have done exhaustive analysis algorithm to find the best 1 variable - 7 variable model. We used different pameters like R-squared, Adjusted R-Squared, RSS, BIC and Cp to compare these models. After these through graphical analysis of these models we came to conclusion that the model with 4 and 5 variables would solve our purpose.
Above we have derived the equation for 4 variable and 5 variable multiple regression model. The values of selection parameters in both cases don’t have much difference. We can be quite sure of getting good results with 4 variable equation but to be on safer side and to have a little more accuracy in results we can go with 5 variable model as well.
Now at the end we will try and analyze which models perform the best using PRESS stastics. To do this, we will take the 12 models on which we applied the partial f-test and see how does the result differs in case when PRESS stastics is used to judge the model performance. The lower the PRESS stastics the better is the model performance.
library(MPV)
##
## Attaching package: 'MPV'
## The following object is masked from 'package:datasets':
##
## stackloss
model1_reduced <- lm(Y1 ~ X2+ X3+ X4 +X5+ X6+ X7)
PRESS(model1_reduced)
## [1] 7074.147
model2_reduced <- lm(Y1 ~ X1+ X3+ X4 +X5+ X7+ X8)
PRESS(model2_reduced)
## [1] 6660.489
model3_reduced <- lm(Y1 ~ X1+ X2+ X3+ X6+ X7+ X8)
PRESS(model3_reduced)
## [1] 7996.401
model4_reduced <- lm(Y1 ~ X1+ X2+ X3 +X5+ X6 + X8)
PRESS(model4_reduced)
## [1] 11939.6
model5_reduced <- lm(Y1 ~ X1+ X2+ X4+ X6+ X8)
PRESS(model5_reduced)
## [1] 13237.56
model6_reduced <- lm(Y1 ~ X1+ X2+ X3 +X4+ X5)
PRESS(model6_reduced)
## [1] 12470.86
model7_reduced <- lm(Y1 ~ X1+ X3 +X4+ X5+ X6+ X8)
PRESS(model7_reduced)
## [1] 11939.6
model8_reduced <- lm(Y1 ~ X1+ X2+ X3+ X5+ X6+ X7)
PRESS(model8_reduced)
## [1] 6733.614
model9_reduced <- lm(Y1 ~ X3 +X4+ X5+ X6+ X7+ X8)
PRESS(model9_reduced)
## [1] 7019.094
model10_reduced <- lm(Y1 ~ X1+ X2+ X5+ X6+ X7+ X8)
PRESS(model10_reduced)
## [1] 7391.987
model11_reduced <- lm(Y1 ~ X1+ X2+ X3+ X4+ X7+ X8)
PRESS(model11_reduced)
## [1] 7976.199
model12_reduced <- lm(Y1 ~ X1+ X3+ X4+ X8)
PRESS(model12_reduced)
## [1] 13203.73
Here we will use MODEL 1 i.e. the full model for calculating the PRESS stastics.
PRESS(model)
## [1] 6677.437
model14_reduced <- lm(Y1 ~ X1+ X4+ X5+ X7)
PRESS(model14_reduced)
## [1] 6747.529
model15_reduced <- lm(Y1 ~ X1+ X4+ X5+ X7+ X8)
PRESS(model15_reduced)
## [1] 6691.799
model16_reduced <- lm(Y1 ~ X1+ X2+ X3+ X5+ X7+ X8)
PRESS(model16_reduced)
## [1] 6660.489
model17_reduced <- lm(Y1 ~ X1+ X2+ X3+ X5+ X6+ X7+ X8)
PRESS(model17_reduced)
## [1] 6677.437
From values of PRESS stastics for different cases we see that the lowest PRESS stastics is of range 6600 and the higest PRESS stastics extends upto range of 13000. Models with press stastics of range 6600 - 7000 are pretty effective model.
One more intresting thing to note here is that all the models selected through procedural analysis have pretty good PRESS stastics giving validation for the procedural analysis done earlier.
Keeping in mind the optimum range of PRESS stastics being between 6600 - 6800, according to results above it would still be a great idea to go with 4 variable model or 5 variable model that we got out of procedural analysis as they are good mix of accuracy with minimum necessary variables requried to get a good prediction out of the model.
\[ \pagebreak \]
We have performed a rigrious multivariate regression analysis on the selected dataset. In a brief summary following are the things we have covered:
The first one is the 4 variable model:
Y1 = -11.95301879 -14.53244055X1 + 0.03497595X3 + 5.60675320X5 + 20.43789945X7
The second is 5 variable model:
Y1 = 84.38757009 -64.77399149X1 -0.08729027X2 +0.06081334X3 +4.16993881X5 +20.43789945X7