library(xlsx)
data <- read.xlsx("./data/All Greens Franchise.xls",1)
str(data)
## 'data.frame': 27 obs. of 6 variables:
## $ X1: num 231 156 10 519 437 487 299 195 20 68 ...
## $ X2: num 3 2.2 0.5 5.5 4.4 ...
## $ X3: num 294 232 149 600 567 571 512 347 212 102 ...
## $ X4: num 8.2 6.9 3 12 10.6 ...
## $ X5: num 8.2 4.1 4.3 16.1 14.1 ...
## $ X6: num 11 12 15 1 5 4 10 12 15 8 ...
library(GGally)
ggpairs(data)
We notice that there is a very strong correlation amongst the dependent variable X1 and all the other independent variables.
Also, there could be a problem of MULTICOLLINEARITY as the dependent variables are highly corelated with each other. It makes little sense to use all such highly correlated variables in the regression model. Doing such will lead to unstable estimates due to increased variances of the regression coefficients.
First, we fit the full model as following:
fit1 <- lm(X1 ~ ., data= data)
summary(fit1)
##
## Call:
## lm(formula = X1 ~ ., data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.338 -9.699 -4.496 4.040 41.139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -18.85941 30.15023 -0.626 0.538372
## X2 16.20157 3.54444 4.571 0.000166 ***
## X3 0.17464 0.05761 3.032 0.006347 **
## X4 11.52627 2.53210 4.552 0.000174 ***
## X5 13.58031 1.77046 7.671 1.61e-07 ***
## X6 -5.31097 1.70543 -3.114 0.005249 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared: 0.9932, Adjusted R-squared: 0.9916
## F-statistic: 611.6 on 5 and 21 DF, p-value: < 2.2e-16
All the variables are significant and the Multiple R Squared/Adjusted R-Squared is > 99%. But, we have to check if there is a multicollinearity problem in the model.
library(car)
## Warning: package 'car' was built under R version 3.3.3
vif(fit1)
## X2 X3 X4 X5 X6
## 4.240914 10.122480 7.624391 6.912318 5.818768
The VIF value for X3 > 10, let’s drop this from the model and try building the model again.
fit2 <- lm(X1 ~ .-X3, data= data)
summary(fit2)
##
## Call:
## lm(formula = X1 ~ . - X3, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.422 -12.858 -6.477 16.160 45.255
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.460 34.411 -1.147 0.2638
## X2 20.444 3.815 5.359 2.22e-05 ***
## X4 16.966 2.093 8.107 4.73e-08 ***
## X5 15.673 1.910 8.206 3.86e-08 ***
## X6 -4.043 1.937 -2.088 0.0486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.68 on 22 degrees of freedom
## Multiple R-squared: 0.9902, Adjusted R-squared: 0.9884
## F-statistic: 555.4 on 4 and 22 DF, p-value: < 2.2e-16
vif(fit2)
## X2 X4 X5 X6
## 3.579850 3.795323 5.861520 5.468943
We are good now as VIF for each of the variables is less than 10.
Let’s now use PCA and see how it can help us resolve the problem of multicollinearity.
prin_comp <- prcomp(data[,-1], scale. = T)
Let’s plot the resultant principal components.
biplot(prin_comp, scale=0)
We notice that the first pricipal component corresponds to the measure of X3, X5 and X6 while the second principal component corresponds to X2 and x4. Let’s look at the loadings to confirm.
prin_comp$rotation
## PC1 PC2 PC3 PC4 PC5
## X2 0.4346055 0.72351531 -0.2173219 0.4731253 -0.1287140
## X3 0.4587593 -0.07708152 -0.4766292 -0.3421379 0.6628458
## X4 0.4451881 -0.58485406 -0.3612434 0.1704407 -0.5479110
## X5 0.4529839 0.22908597 0.3987150 -0.6786263 -0.3504538
## X6 -0.4441522 0.27577053 -0.6603979 -0.4117164 -0.3479134
Let’s now find the components that expalin the maximum variance.
# std. deviation
std_dev <- prin_comp$sdev
#variance
pr_var <- std_dev^2
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex
## [1] 0.86266701 0.05569619 0.04602583 0.02479025 0.01082073
#scree plot
plot(prop_varex, xlab = "Principal Component",
ylab = "Proportion of Variance Explained",
type = "b")
As we notice, the first component explains 86% of the variance and second component onwards, the variance explained is <6%. We will only use the first component for building the regression model:
data1 <- data.frame(data[,1], prin_comp$x[,1])
colnames(data1) <- c("X1", "PC1")
fit3 <- lm(X1 ~ ., data=data1)
summary(fit3)
##
## Call:
## lm(formula = X1 ~ ., data = data1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.168 -15.059 -3.809 11.915 47.944
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 286.574 3.774 75.94 <2e-16 ***
## PC1 92.013 1.852 49.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.61 on 25 degrees of freedom
## Multiple R-squared: 0.99, Adjusted R-squared: 0.9896
## F-statistic: 2469 on 1 and 25 DF, p-value: < 2.2e-16