Data Loading

library(xlsx)
data <- read.xlsx("./data/All Greens Franchise.xls",1)
str(data)
## 'data.frame':    27 obs. of  6 variables:
##  $ X1: num  231 156 10 519 437 487 299 195 20 68 ...
##  $ X2: num  3 2.2 0.5 5.5 4.4 ...
##  $ X3: num  294 232 149 600 567 571 512 347 212 102 ...
##  $ X4: num  8.2 6.9 3 12 10.6 ...
##  $ X5: num  8.2 4.1 4.3 16.1 14.1 ...
##  $ X6: num  11 12 15 1 5 4 10 12 15 8 ...

Data Exploration

library(GGally)
ggpairs(data)

We notice that there is a very strong correlation amongst the dependent variable X1 and all the other independent variables.

Also, there could be a problem of MULTICOLLINEARITY as the dependent variables are highly corelated with each other. It makes little sense to use all such highly correlated variables in the regression model. Doing such will lead to unstable estimates due to increased variances of the regression coefficients.

Regression Analysis

First, we fit the full model as following:

fit1 <- lm(X1 ~ ., data= data)
summary(fit1)
## 
## Call:
## lm(formula = X1 ~ ., data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.338  -9.699  -4.496   4.040  41.139 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -18.85941   30.15023  -0.626 0.538372    
## X2           16.20157    3.54444   4.571 0.000166 ***
## X3            0.17464    0.05761   3.032 0.006347 ** 
## X4           11.52627    2.53210   4.552 0.000174 ***
## X5           13.58031    1.77046   7.671 1.61e-07 ***
## X6           -5.31097    1.70543  -3.114 0.005249 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.65 on 21 degrees of freedom
## Multiple R-squared:  0.9932, Adjusted R-squared:  0.9916 
## F-statistic: 611.6 on 5 and 21 DF,  p-value: < 2.2e-16

All the variables are significant and the Multiple R Squared/Adjusted R-Squared is > 99%. But, we have to check if there is a multicollinearity problem in the model.

library(car)
## Warning: package 'car' was built under R version 3.3.3
vif(fit1)
##        X2        X3        X4        X5        X6 
##  4.240914 10.122480  7.624391  6.912318  5.818768

The VIF value for X3 > 10, let’s drop this from the model and try building the model again.

fit2 <- lm(X1 ~ .-X3, data= data)
summary(fit2)
## 
## Call:
## lm(formula = X1 ~ . - X3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.422 -12.858  -6.477  16.160  45.255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -39.460     34.411  -1.147   0.2638    
## X2            20.444      3.815   5.359 2.22e-05 ***
## X4            16.966      2.093   8.107 4.73e-08 ***
## X5            15.673      1.910   8.206 3.86e-08 ***
## X6            -4.043      1.937  -2.088   0.0486 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.68 on 22 degrees of freedom
## Multiple R-squared:  0.9902, Adjusted R-squared:  0.9884 
## F-statistic: 555.4 on 4 and 22 DF,  p-value: < 2.2e-16
vif(fit2)
##       X2       X4       X5       X6 
## 3.579850 3.795323 5.861520 5.468943

We are good now as VIF for each of the variables is less than 10.

Let’s now use PCA and see how it can help us resolve the problem of multicollinearity.

prin_comp <- prcomp(data[,-1], scale. = T)

Let’s plot the resultant principal components.

biplot(prin_comp, scale=0)

We notice that the first pricipal component corresponds to the measure of X3, X5 and X6 while the second principal component corresponds to X2 and x4. Let’s look at the loadings to confirm.

prin_comp$rotation
##           PC1         PC2        PC3        PC4        PC5
## X2  0.4346055  0.72351531 -0.2173219  0.4731253 -0.1287140
## X3  0.4587593 -0.07708152 -0.4766292 -0.3421379  0.6628458
## X4  0.4451881 -0.58485406 -0.3612434  0.1704407 -0.5479110
## X5  0.4529839  0.22908597  0.3987150 -0.6786263 -0.3504538
## X6 -0.4441522  0.27577053 -0.6603979 -0.4117164 -0.3479134

Let’s now find the components that expalin the maximum variance.

# std. deviation
std_dev <- prin_comp$sdev
#variance
pr_var <- std_dev^2
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex
## [1] 0.86266701 0.05569619 0.04602583 0.02479025 0.01082073
#scree plot
plot(prop_varex, xlab = "Principal Component",
             ylab = "Proportion of Variance Explained",
             type = "b")

As we notice, the first component explains 86% of the variance and second component onwards, the variance explained is <6%. We will only use the first component for building the regression model:

data1 <- data.frame(data[,1], prin_comp$x[,1])
colnames(data1) <- c("X1", "PC1")
fit3 <- lm(X1 ~ ., data=data1)
summary(fit3)
## 
## Call:
## lm(formula = X1 ~ ., data = data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.168 -15.059  -3.809  11.915  47.944 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  286.574      3.774   75.94   <2e-16 ***
## PC1           92.013      1.852   49.69   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.61 on 25 degrees of freedom
## Multiple R-squared:   0.99,  Adjusted R-squared:  0.9896 
## F-statistic:  2469 on 1 and 25 DF,  p-value: < 2.2e-16