1 INTRODUCTION

The motivating data is retrieved from UCI-Machine Learning about the daily energy consumption in a Daewoo Steel Plant in Gwangyang, South Korea for a full year in 2018. The “Steel” Dataset consists of several variables:

X_1: Lagging current reactive power

X_2: Leading current reactive power

X_3: CO2 in ppm

X_4: Lagging current power factor

X_5: Leading current power factor

X_6: Number of Seconds from Midnight

X_7: Status of Days (0 denotes weekdays and 1 denotes weekends)

X_8: Load Type (1 denotes low load, 2 denotes medium load, 3 denotes high load)

Y: Energy Consumption

head(Steel,10)

# A tibble: 10 × 9
       Y    X1    X2    X3    X4    X5    X6    X7    X8
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  3.17  2.95     0     0  73.2   100   900     0     1
 2  4     4.46     0     0  66.8   100  1800     0     1
 3  3.24  3.28     0     0  70.3   100  2700     0     1
 4  3.31  3.56     0     0  68.1   100  3600     0     1
 5  3.82  4.5      0     0  64.7   100  4500     0     1
 6  3.28  3.56     0     0  67.8   100  5400     0     1
 7  3.6   4.14     0     0  65.6   100  6300     0     1
 8  3.6   4.28     0     0  64.4   100  7200     0     1
 9  3.28  3.64     0     0  66.9   100  8100     0     1
10  3.78  4.72     0     0  62.5   100  9000     0     1

2 PCA

2.1 Data Exploration

summary(Steel)

       Y                X1              X2               X3         
 Min.   :  0.00   Min.   : 0.00   Min.   : 0.000   Min.   :0.00000  
 1st Qu.:  3.20   1st Qu.: 2.30   1st Qu.: 0.000   1st Qu.:0.00000  
 Median :  4.57   Median : 5.00   Median : 0.000   Median :0.00000  
 Mean   : 27.39   Mean   :13.04   Mean   : 3.871   Mean   :0.01152  
 3rd Qu.: 51.24   3rd Qu.:22.64   3rd Qu.: 2.090   3rd Qu.:0.02000  
 Max.   :157.18   Max.   :96.91   Max.   :27.760   Max.   :0.07000  
       X4               X5               X6              X7        
 Min.   :  0.00   Min.   :  0.00   Min.   :    0   Min.   :0.0000  
 1st Qu.: 63.32   1st Qu.: 99.70   1st Qu.:21375   1st Qu.:0.0000  
 Median : 87.96   Median :100.00   Median :42750   Median :0.0000  
 Mean   : 80.58   Mean   : 84.37   Mean   :42750   Mean   :0.2849  
 3rd Qu.: 99.02   3rd Qu.:100.00   3rd Qu.:64125   3rd Qu.:1.0000  
 Max.   :100.00   Max.   :100.00   Max.   :85500   Max.   :1.0000  
       X8       
 Min.   :1.000  
 1st Qu.:1.000  
 Median :1.000  
 Mean   :1.692  
 3rd Qu.:2.000  
 Max.   :3.000

corrplot::corrplot(cor(Steel),method="number",type="upper")

In the correlation plot above, we could see clearly a very high correlation between X_1 and X_3, as also X_2 and X_5. The high correlation between the predictor variables could become a multicollinearity problem in the model. So we need to the Principal Component Analysis (PCA) to eliminate the high correlation and condense the 8 predictor variables.

2.2 PCA Result

pca = prcomp(x=Steel[,-1],scale = T, center = T)

Here we will use PCA function from FactoMineR package and plotting with factoextra package.

pca.dat = PCA(Steel[,-1],scale.unit = TRUE, graph = FALSE)
fviz_pca_var(pca.dat, col.var = 'cos2',gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), axes = c(1,2), repel = T)

Individuals factor map for PCA in this case is useless because of the large number of our observations. Plotting the individuals factor would be unreadable so we skip that part and we will just plot the variables plot.

The variables plot tells us that if we increase the value along one arrow, the other ones give us a rough estimate of what’s happening in other variables. For example, the greater the lagging current reactive power (X1), the CO2 ppm (X3) is also greater (which the two arrows point to somewhat same direction).

The colour gradient tells us about each variables’ squared cosine to the PCA ,that tells us that how represented each variables in the Principal Components. We could see variable Status of Days (X7) has the lowest representation in the PCs. As the leading current reactive power (X2), CO2 ppm (X3), and leading current power factor (X5) have the highest representation in the PCs.

Now let’s explore the contribution

a<-fviz_contrib(pca.dat, choice = "var", axes = 1)
b<-fviz_contrib(pca.dat, choice = "var", axes = 2)
grid.arrange(a,b, ncol=2, top='Contribution of the variables to the first two PCs')

In the first PC, we see that X_1 , X_5 , X_3 , and X_2 are the most important variables that influences the PC model. In the second PC, we see that X_4 , X_6 , and X_8 are the most important variables.

2.3 Choosing Number of PCs

Let’s take a look on the constructed PCs:

pca.eig = as.data.frame(pca.dat$eig)
knitr::kable(pca.eig)

	eigenvalue	percentage of variance	cumulative percentage of variance
comp 1	3.0477410	38.0967627	38.09676
comp 2	2.7626811	34.5335137	72.63028
comp 3	0.8149288	10.1866098	82.81689
comp 4	0.6016050	7.5200627	90.33695
comp 5	0.4011947	5.0149336	95.35188
comp 6	0.2616521	3.2706515	98.62253
comp 7	0.0667368	0.8342105	99.45674
comp 8	0.0434604	0.5432556	100.00000

We would also create the scree plot for better visualization

fviz_eig(pca.dat)

The criteria for choosing the number of PCs that is used vary from literature to literature. But a good rule of thumb on choosing the number of PCs is:

The Cumulative Percentage Variance is more than 80% OR
The Eigenvalue is more than 1

We see that eigenvalue from component 1 & 2 is more than 1 that cumulatively explained around 72% of the variance. In this case, we will only choose 2 Principal Components to be used.

2.4 PCA Model

Unfortunately, the PCA function can’t extract each x values that is composed to the PCs. We could define the PCA model as follows:

PC = pca$x[,1:2]
pca$rotation[,1:2]

           PC1         PC2
X1  0.47320214 -0.18523244
X2 -0.44322883 -0.31825860
X3  0.45785030 -0.29844144
X4 -0.06831445 -0.53689361
X5  0.45941515  0.29941890
X6 -0.04861139 -0.46496826
X7 -0.29275743  0.01411026
X8  0.25729735 -0.42547868

The two PC’s equations could be written:

PC_1=-.47X_1 + .44X_2 - .46X_3 +.07X_4 - .46X_5 + .05X_6 + .29X_7 - .26X_8

PC_2 = .19X_1 + .32X_2 + .30X_3 + .54X_4 - .30X_5 + .46X_6 - .01X_7 + .43X_8

These equations would be very important on our linear regression model later.

3 LINEAR MODEL

3.1 Building the Model

We will use glm function to build the logistic regression model

Y = Steel$Y
reg = lm(Y~PC)
summary(reg)


Call:
lm(formula = Y ~ PC)

Residuals:
    Min      1Q  Median      3Q     Max 
-37.997  -6.256  -1.323   7.059  82.807 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27.38689    0.06051   452.6   <2e-16 ***
PCPC1        15.25102    0.03466   440.0   <2e-16 ***
PCPC2       -10.09135    0.03641  -277.2   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 11.33 on 35037 degrees of freedom
Multiple R-squared:  0.8853,    Adjusted R-squared:  0.8853 
F-statistic: 1.352e+05 on 2 and 35037 DF,  p-value: < 2.2e-16

Finally, we get the regression model with PCs as the predictor variables. Next, we would plot the residuals and check the model’s assumptions.

3.2 Assumptions Checking

par(mfrow=c(2,2))
plot(reg)

H_0: The residuals follow the normal distribution vs.

H_1: The residuals DO NOT follow the normal distribution

nortest::ad.test(reg$residuals)


    Anderson-Darling normality test

data:  reg$residuals
A = 222.64, p-value < 2.2e-16

H_0 : \sigma_1^2 = \sigma_2^2 vs.

H_1 : \sigma_1^2 \ne \sigma_2^2

lmtest::bptest(reg)


    studentized Breusch-Pagan test

data:  reg
BP = 4221.5, df = 2, p-value < 2.2e-16

lmtest::dwtest(reg)


    Durbin-Watson test

data:  reg
DW = 0.33855, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0

We could see that there are violations on normality assumption and homoscedasticity assumption. The autocorrelation presents as the data is a time series and time dependent, as it was observed on each day for a full year.

3.3 Extracting the Linear Model

beta.Z = as.matrix(reg$coefficients[2:3])
V = as.matrix(pca$rotation[,1:2])
beta.X = V %*% beta.Z
beta.X

        [,1]
X1  9.086060
X2 -3.548034
X3  9.994360
X4  4.376114
X5  3.985010
X6  3.950782
X7 -4.607241
X8  8.217700

The regression model that is built is:

Y = 9.09X_1-3.55X_2+9.99X_3+4.38X_4+3.99X_5+3.95X_6-4.61X_7+8.22X_8

4 CONCLUSION

Principal Component Analysis perfectly took care of the multicollinearity and high correlation between independent variables. PCA is a great preprocessing tool that is used for intermediate modelling, such as regression, discriminant analysis, clustering, etc.

The scope of this article focused more on the PCA, unfortunately the assumption violations on the linear regression hasn’t been solved yet.