1 INTRODUCTION
The motivating data is retrieved from UCI-Machine Learning about the daily energy consumption in a Daewoo Steel Plant in Gwangyang, South Korea for a full year in 2018. The “Steel” Dataset consists of several variables:
X_1: Lagging current reactive power
X_2: Leading current reactive power
X_3: CO2 in ppm
X_4: Lagging current power factor
X_5: Leading current power factor
X_6: Number of Seconds from Midnight
X_7: Status of Days (0 denotes weekdays and 1 denotes weekends)
X_8: Load Type (1 denotes low load, 2 denotes medium load, 3 denotes high load)
Y: Energy Consumption
head(Steel,10)# A tibble: 10 × 9
Y X1 X2 X3 X4 X5 X6 X7 X8
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 3.17 2.95 0 0 73.2 100 900 0 1
2 4 4.46 0 0 66.8 100 1800 0 1
3 3.24 3.28 0 0 70.3 100 2700 0 1
4 3.31 3.56 0 0 68.1 100 3600 0 1
5 3.82 4.5 0 0 64.7 100 4500 0 1
6 3.28 3.56 0 0 67.8 100 5400 0 1
7 3.6 4.14 0 0 65.6 100 6300 0 1
8 3.6 4.28 0 0 64.4 100 7200 0 1
9 3.28 3.64 0 0 66.9 100 8100 0 1
10 3.78 4.72 0 0 62.5 100 9000 0 1
2 PCA
2.1 Data Exploration
summary(Steel) Y X1 X2 X3
Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. :0.00000
1st Qu.: 3.20 1st Qu.: 2.30 1st Qu.: 0.000 1st Qu.:0.00000
Median : 4.57 Median : 5.00 Median : 0.000 Median :0.00000
Mean : 27.39 Mean :13.04 Mean : 3.871 Mean :0.01152
3rd Qu.: 51.24 3rd Qu.:22.64 3rd Qu.: 2.090 3rd Qu.:0.02000
Max. :157.18 Max. :96.91 Max. :27.760 Max. :0.07000
X4 X5 X6 X7
Min. : 0.00 Min. : 0.00 Min. : 0 Min. :0.0000
1st Qu.: 63.32 1st Qu.: 99.70 1st Qu.:21375 1st Qu.:0.0000
Median : 87.96 Median :100.00 Median :42750 Median :0.0000
Mean : 80.58 Mean : 84.37 Mean :42750 Mean :0.2849
3rd Qu.: 99.02 3rd Qu.:100.00 3rd Qu.:64125 3rd Qu.:1.0000
Max. :100.00 Max. :100.00 Max. :85500 Max. :1.0000
X8
Min. :1.000
1st Qu.:1.000
Median :1.000
Mean :1.692
3rd Qu.:2.000
Max. :3.000
corrplot::corrplot(cor(Steel),method="number",type="upper")In the correlation plot above, we could see clearly a very high correlation between X_1 and X_3, as also X_2 and X_5. The high correlation between the predictor variables could become a multicollinearity problem in the model. So we need to the Principal Component Analysis (PCA) to eliminate the high correlation and condense the 8 predictor variables.
2.2 PCA Result
pca = prcomp(x=Steel[,-1],scale = T, center = T)Here we will use PCA function from FactoMineR package and plotting with factoextra package.
pca.dat = PCA(Steel[,-1],scale.unit = TRUE, graph = FALSE)
fviz_pca_var(pca.dat, col.var = 'cos2',gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), axes = c(1,2), repel = T)Individuals factor map for PCA in this case is useless because of the large number of our observations. Plotting the individuals factor would be unreadable so we skip that part and we will just plot the variables plot.
The variables plot tells us that if we increase the value along one arrow, the other ones give us a rough estimate of what’s happening in other variables. For example, the greater the lagging current reactive power (X1), the CO2 ppm (X3) is also greater (which the two arrows point to somewhat same direction).
The colour gradient tells us about each variables’ squared cosine to the PCA ,that tells us that how represented each variables in the Principal Components. We could see variable Status of Days (X7) has the lowest representation in the PCs. As the leading current reactive power (X2), CO2 ppm (X3), and leading current power factor (X5) have the highest representation in the PCs.
Now let’s explore the contribution
a<-fviz_contrib(pca.dat, choice = "var", axes = 1)
b<-fviz_contrib(pca.dat, choice = "var", axes = 2)
grid.arrange(a,b, ncol=2, top='Contribution of the variables to the first two PCs')In the first PC, we see that X_1 , X_5 , X_3 , and X_2 are the most important variables that influences the PC model. In the second PC, we see that X_4 , X_6 , and X_8 are the most important variables.
2.3 Choosing Number of PCs
Let’s take a look on the constructed PCs:
pca.eig = as.data.frame(pca.dat$eig)
knitr::kable(pca.eig)| eigenvalue | percentage of variance | cumulative percentage of variance | |
|---|---|---|---|
| comp 1 | 3.0477410 | 38.0967627 | 38.09676 |
| comp 2 | 2.7626811 | 34.5335137 | 72.63028 |
| comp 3 | 0.8149288 | 10.1866098 | 82.81689 |
| comp 4 | 0.6016050 | 7.5200627 | 90.33695 |
| comp 5 | 0.4011947 | 5.0149336 | 95.35188 |
| comp 6 | 0.2616521 | 3.2706515 | 98.62253 |
| comp 7 | 0.0667368 | 0.8342105 | 99.45674 |
| comp 8 | 0.0434604 | 0.5432556 | 100.00000 |
We would also create the scree plot for better visualization
fviz_eig(pca.dat)The criteria for choosing the number of PCs that is used vary from literature to literature. But a good rule of thumb on choosing the number of PCs is:
The Cumulative Percentage Variance is more than 80% OR
The Eigenvalue is more than 1
We see that eigenvalue from component 1 & 2 is more than 1 that cumulatively explained around 72% of the variance. In this case, we will only choose 2 Principal Components to be used.
2.4 PCA Model
Unfortunately, the PCA function can’t extract each x values that is composed to the PCs. We could define the PCA model as follows:
PC = pca$x[,1:2]
pca$rotation[,1:2] PC1 PC2
X1 0.47320214 -0.18523244
X2 -0.44322883 -0.31825860
X3 0.45785030 -0.29844144
X4 -0.06831445 -0.53689361
X5 0.45941515 0.29941890
X6 -0.04861139 -0.46496826
X7 -0.29275743 0.01411026
X8 0.25729735 -0.42547868
The two PC’s equations could be written:
PC_1=-.47X_1 + .44X_2 - .46X_3 +.07X_4 - .46X_5 + .05X_6 + .29X_7 - .26X_8
PC_2 = .19X_1 + .32X_2 + .30X_3 + .54X_4 - .30X_5 + .46X_6 - .01X_7 + .43X_8
These equations would be very important on our linear regression model later.
3 LINEAR MODEL
3.1 Building the Model
We will use glm function to build the logistic regression model
Y = Steel$Y
reg = lm(Y~PC)
summary(reg)
Call:
lm(formula = Y ~ PC)
Residuals:
Min 1Q Median 3Q Max
-37.997 -6.256 -1.323 7.059 82.807
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.38689 0.06051 452.6 <2e-16 ***
PCPC1 15.25102 0.03466 440.0 <2e-16 ***
PCPC2 -10.09135 0.03641 -277.2 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.33 on 35037 degrees of freedom
Multiple R-squared: 0.8853, Adjusted R-squared: 0.8853
F-statistic: 1.352e+05 on 2 and 35037 DF, p-value: < 2.2e-16
Finally, we get the regression model with PCs as the predictor variables. Next, we would plot the residuals and check the model’s assumptions.
3.2 Assumptions Checking
par(mfrow=c(2,2))
plot(reg)H_0: The residuals follow the normal distribution vs.
H_1: The residuals DO NOT follow the normal distribution
nortest::ad.test(reg$residuals)
Anderson-Darling normality test
data: reg$residuals
A = 222.64, p-value < 2.2e-16
H_0 : \sigma_1^2 = \sigma_2^2 vs.
H_1 : \sigma_1^2 \ne \sigma_2^2
lmtest::bptest(reg)
studentized Breusch-Pagan test
data: reg
BP = 4221.5, df = 2, p-value < 2.2e-16
lmtest::dwtest(reg)
Durbin-Watson test
data: reg
DW = 0.33855, p-value < 2.2e-16
alternative hypothesis: true autocorrelation is greater than 0
We could see that there are violations on normality assumption and homoscedasticity assumption. The autocorrelation presents as the data is a time series and time dependent, as it was observed on each day for a full year.
3.3 Extracting the Linear Model
beta.Z = as.matrix(reg$coefficients[2:3])
V = as.matrix(pca$rotation[,1:2])
beta.X = V %*% beta.Z
beta.X [,1]
X1 9.086060
X2 -3.548034
X3 9.994360
X4 4.376114
X5 3.985010
X6 3.950782
X7 -4.607241
X8 8.217700
The regression model that is built is:
Y = 9.09X_1-3.55X_2+9.99X_3+4.38X_4+3.99X_5+3.95X_6-4.61X_7+8.22X_8
4 CONCLUSION
Principal Component Analysis perfectly took care of the multicollinearity and high correlation between independent variables. PCA is a great preprocessing tool that is used for intermediate modelling, such as regression, discriminant analysis, clustering, etc.
The scope of this article focused more on the PCA, unfortunately the assumption violations on the linear regression hasn’t been solved yet.