Introduction

This document reports the analysis of local stability electric grid dataset, hosted by UCI Machine learning repository. The data set contains readings from an electrical system having 4 nodes producing electricity and using the data to determine the system stability.

*Project’s Goal**: Goal is to model the data, so as to accurately predict the stability of electric grid. Also, infer the relationship between predictors and the response.The URL of the dataset is https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+. The scope of this project is to apply concepts learnt in class on our desire data set. We use R language through out the project.

Size of Observations(n) = 10000

Number of predictors(p) = 11

What constitutes an observation: Each observation represents valuation record of a system using multiple nodes. There are 11 predictor variables and 1 outcome variable for each observation.The predictor variables usage are independent on each other.

Response Variables: The Response variable also known as the Output variable is stabf. It tells us if the system is stable or not.

Predictor variables: tau1,tau2,tau3,tau4 are electricity produced from node1,node2,node3 and node 4 respectively. Obviously these will contribute in determining of the system is stable or not. p2,p3,p4 are the power consumed on node2, node3 and node4 respectively. For some reasons energy consumed on node1 is not included in determining whether the system is stable of not. So we won’t worry about p1. That being said, p2,p3,p4 have impact on the response. g1,g2,g3,g4 are the price value of electricity on node1-node 4 at each observation. These variables would also add to factor to determine if the system is stable or not. The usage of the variables were independentof each other. Therefore each can be treated on its own.

We read the data from a csv file and attach to the project. The R function attach() is use to attach data to the project for ease of accessibility.

data = read.csv("electricity19.csv")
attach(data)

Below procedure display the data’s features and observations for better understanding.

names(data) #displays feature names
##  [1] "tau1"  "tau2"  "tau3"  "tau4"  "p2"    "p3"    "p4"    "g1"   
##  [9] "g2"    "g3"    "g4"    "stabf"
head(data)  #displays 10 observations.
##        tau1     tau2     tau3     tau4         p2         p3         p4
## 1 2.9590600 3.079885 8.381025 9.780754 -0.7826036 -1.2573948 -1.7230863
## 2 9.3040972 4.902524 3.047541 1.369357 -1.9400584 -1.8727417 -1.2550120
## 3 8.9717069 8.848428 3.046479 1.214518 -1.2074556 -1.2772101 -0.9204924
## 4 0.7164148 7.669600 4.486641 2.340563 -1.0274733 -1.9389442 -0.9973736
## 5 3.1341116 7.608772 4.943759 9.857573 -1.1255310 -1.8459749 -0.5543050
## 6 6.9992087 9.109247 3.784066 4.267788 -1.8571392 -0.6703968 -1.9021328
##          g1         g2        g3        g4    stabf
## 1 0.6504565 0.85957811 0.8874449 0.9580340 unstable
## 2 0.4134406 0.86241408 0.5621391 0.7817599   stable
## 3 0.1630410 0.76668866 0.8394440 0.1098532 unstable
## 4 0.4462089 0.97674408 0.9293805 0.3627178 unstable
## 5 0.7971095 0.45544995 0.6569467 0.8209235 unstable
## 6 0.2617929 0.07792967 0.5428839 0.4699310   stable
dim(data)
## [1] 10000    12

The dim() function shows the data has 10000 observations and 12 columns.

The summary() function in R shows a numerical summary of each variable in data set. We display the summary statistics of our data set.

summary(data)
##       tau1             tau2              tau3             tau4       
##  Min.   :0.5008   Min.   : 0.5001   Min.   :0.5008   Min.   :0.5005  
##  1st Qu.:2.8749   1st Qu.: 2.8751   1st Qu.:2.8755   1st Qu.:2.8750  
##  Median :5.2500   Median : 5.2500   Median :5.2500   Median :5.2497  
##  Mean   :5.2500   Mean   : 5.2500   Mean   :5.2500   Mean   :5.2500  
##  3rd Qu.:7.6247   3rd Qu.: 7.6249   3rd Qu.:7.6249   3rd Qu.:7.6248  
##  Max.   :9.9995   Max.   : 9.9998   Max.   :9.9994   Max.   :9.9994  
##        p2                p3                p4                g1         
##  Min.   :-1.9999   Min.   :-1.9999   Min.   :-1.9999   Min.   :0.05001  
##  1st Qu.:-1.6249   1st Qu.:-1.6250   1st Qu.:-1.6250   1st Qu.:0.28752  
##  Median :-1.2500   Median :-1.2500   Median :-1.2500   Median :0.52501  
##  Mean   :-1.2500   Mean   :-1.2500   Mean   :-1.2500   Mean   :0.52500  
##  3rd Qu.:-0.8750   3rd Qu.:-0.8750   3rd Qu.:-0.8751   3rd Qu.:0.76243  
##  Max.   :-0.5001   Max.   :-0.5001   Max.   :-0.5000   Max.   :0.99994  
##        g2                g3                g4               stabf     
##  Min.   :0.05005   Min.   :0.05005   Min.   :0.05003   stable  :3620  
##  1st Qu.:0.28755   1st Qu.:0.28751   1st Qu.:0.28749   unstable:6380  
##  Median :0.52500   Median :0.52501   Median :0.52500                  
##  Mean   :0.52500   Mean   :0.52500   Mean   :0.52500                  
##  3rd Qu.:0.76249   3rd Qu.:0.76244   3rd Qu.:0.76243                  
##  Max.   :0.99994   Max.   :0.99998   Max.   :0.99993

Next, we examine the correlation state of our data. We display correlation with cor() R function.

cor(data[,-12])
##              tau1         tau2         tau3          tau4           p2
## tau1  1.000000000  0.015585721 -0.005969607 -0.0172652888 -0.015485308
## tau2  0.015585721  1.000000000  0.014273414 -0.0019647158  0.006573472
## tau3 -0.005969607  0.014273414  1.000000000  0.0043543363 -0.003134261
## tau4 -0.017265289 -0.001964716  0.004354336  1.0000000000  0.010552963
## p2   -0.015485308  0.006573472 -0.003134261  0.0105529626  1.000000000
## p3   -0.015923767  0.007672768 -0.008780459  0.0061687215  0.002388424
## p4   -0.015807066 -0.005962636 -0.017531089 -0.0112110679 -0.006843659
## g1    0.010521128 -0.001742367 -0.011605048 -0.0041489273  0.015603017
## g2    0.015349831  0.015383308  0.007671386  0.0084310883 -0.018032104
## g3   -0.001278658  0.016508464  0.014701706  0.0032598572  0.007555052
## g4    0.005493873 -0.011764008 -0.011497448 -0.0004908815  0.019817004
##                p3           p4           g1           g2           g3
## tau1 -0.015923767 -0.015807066  0.010521128  0.015349831 -0.001278658
## tau2  0.007672768 -0.005962636 -0.001742367  0.015383308  0.016508464
## tau3 -0.008780459 -0.017531089 -0.011605048  0.007671386  0.014701706
## tau4  0.006168722 -0.011211068 -0.004148927  0.008431088  0.003259857
## p2    0.002388424 -0.006843659  0.015603017 -0.018032104  0.007555052
## p3    1.000000000  0.012952865 -0.003218925 -0.011574893 -0.005897288
## p4    0.012952865  1.000000000 -0.013635590  0.002849963 -0.003515084
## g1   -0.003218925 -0.013635590  1.000000000  0.007559018 -0.005836219
## g2   -0.011574893  0.002849963  0.007559018  1.000000000 -0.012808995
## g3   -0.005897288 -0.003515084 -0.005836219 -0.012808995  1.000000000
## g4   -0.010485250  0.017505439  0.012430613 -0.014909362  0.006900185
##                 g4
## tau1  0.0054938729
## tau2 -0.0117640085
## tau3 -0.0114974481
## tau4 -0.0004908815
## p2    0.0198170040
## p3   -0.0104852500
## p4    0.0175054386
## g1    0.0124306133
## g2   -0.0149093622
## g3    0.0069001847
## g4    1.0000000000

For all variables, there isn’t substantial correlation between the predictors. Also, we use graphical plot to show the correlation state of the data set.

Graphical Summaries

First, we use pair() function to create a scatterplot matrix for all the variables. We see a scatterplot for every pair variables in our data set.

pairs(data,col=data$stabf)

From the pair plot above, the red color indicate the distribution of observation with unstable response while black indicate distribution of observations with stable response.

We then go ahead to use the ggpairs function in ggally package in R, to see the correlation among predictors also.

library(GGally)
## Loading required package: ggplot2
ggpairs(data = data[1:11],
        title = "Electricity data Correlation Plot",
        upper = list(continuous = wrap("cor", size = 2)), 
        lower = list(continuous = "smooth")
)

Below,We display the density and frequency plot using histogram on the data.

library(ggplot2) # Data visualization
library(gridExtra)
library(grid)
# Density & Frequency analysis with the Histogram,

# Tau1 
tau1l <- ggplot(data=data, aes(x=tau1))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("Tau1") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(tau1)),linetype="dashed",color="grey")

# Tau2
tau2l <- ggplot(data=data, aes(x=tau2))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("Tau2") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(tau2)),linetype="dashed",color="grey")

# Tau3
tau3l <- ggplot(data=data, aes(x=tau3))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("Tau3") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(tau3)),linetype="dashed",color="grey")

# Tau4 
tau4l <- ggplot(data=data, aes(x=tau2))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("Tau4") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(tau4)),linetype="dashed",color="grey")

#P2
p2l <- ggplot(data=data, aes(x=p2))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("p2") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(p2)),linetype="dashed",color="grey")

#P3
p3l <- ggplot(data=data, aes(x=p3))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("p3") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(p3)),linetype="dashed",color="grey")

#P4
p4l <- ggplot(data=data, aes(x=p4))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("p4") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(p4)),linetype="dashed",color="grey")

#g1
g1l <- ggplot(data=data, aes(x=g1))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("g1") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(g1)),linetype="dashed",color="grey")

#g2
g2l <- ggplot(data=data, aes(x=g2))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("g2") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(g2)),linetype="dashed",color="grey")

#g3
g3l <- ggplot(data=data, aes(x=g3))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("g3") +  
  ylab("Frequency") + 
  theme(legend.position="none")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(g3)),linetype="dashed",color="grey")

#g4
g4l <- ggplot(data=data, aes(x=g4))+
  geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) + 
  xlab("g4") +  
  ylab("Frequency") + 
  theme(legend.position="right")+
  ggtitle("Histogram of Sepal Length")+
  geom_vline(data=data, aes(xintercept = mean(g4)),linetype="dashed",color="grey")

# Plot all visualizations
grid.arrange(tau1l + ggtitle(""),
             tau2l + ggtitle(""),
             tau3l + ggtitle(""),
             tau4l  + ggtitle(""),
             p2l + ggtitle(""),
             p3l + ggtitle(""),
             p4l + ggtitle(""),
             g1l + ggtitle(""),
             g2l + ggtitle(""),
             g3l + ggtitle(""),
             g4l + ggtitle(""),
             nrow = 3,
             top = textGrob("Electric Data Frequency Histogram", 
                            gp=gpar(fontsize=15))
)

From the above plots, We can see the density distribution of each attribute broken down by class value. Like the scatterplot matrix, the density plot by class can show the separation of classes. It can also help to us understand the overlap in class values for an attribute.

# Tau1 
tau1l <- ggplot(data=data, aes(x=tau1,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(tau1), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("Tau1") +  
  ylab("Density") + 
  theme(legend.position="none")

# Tau2 
tau2l <- ggplot(data=data, aes(x=tau2,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("Tau2") +  
  ylab("Density") + 
  theme(legend.position="none")

# Tau3 
tau3l <- ggplot(data=data, aes(x=tau3,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("Tau3") +  
  ylab("Density") + 
  theme(legend.position="none")

# Tau4 
tau4l <- ggplot(data=data, aes(x=tau4,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(tau4), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("Tau4") +  
  ylab("Density") + 
  theme(legend.position="none")

#P2  
p2l <- ggplot(data=data, aes(x=p2,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("p2") +  
  ylab("Density") + 
  theme(legend.position="none")

#P3  
p3l <- ggplot(data=data, aes(x=p3,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(p3), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("p3") +  
  ylab("Density") + 
  theme(legend.position="none")

#P4  
p4l <- ggplot(data=data, aes(x=p4,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(p4), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("p4") +  
  ylab("Density") + 
  theme(legend.position="none")

#g1  
g1l <- ggplot(data=data, aes(x=g1,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(g1), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("g1") +  
  ylab("Density") + 
  theme(legend.position="none")

#g2  
g2l <- ggplot(data=data, aes(x=g2,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(g2), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("g2") +  
  ylab("Density") + 
  theme(legend.position="none")

#g3  
g3l <- ggplot(data=data, aes(x=g3,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(g3), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("g3") +  
  ylab("Density") + 
  theme(legend.position="none")

#g4  
g4l <- ggplot(data=data, aes(x=g4,colour=stabf), fill=stabf)+
  geom_density(alpha=.3) +
  geom_vline(aes(xintercept = mean(g3), colour=stabf),linetype="dashed",color="grey",size=1)+
  xlab("g4") +  
  ylab("Density") + 
  theme(legend.position="right")

# Plot all density visualizations
grid.arrange(tau1l + ggtitle(""),
             tau2l + ggtitle(""),
             tau3l + ggtitle(""),
             tau4l + ggtitle(""),
             p2l + ggtitle(""),
             p3l + ggtitle(""),
             p4l + ggtitle(""),
             g1l + ggtitle(""),
             g2l + ggtitle(""),
             g3l + ggtitle(""),
             g4l + ggtitle(""),
             nrow = 3,
             top = textGrob("Electricity data Density Plot", 
                            gp=gpar(fontsize=15))
)  

Above plots shows the desity plot of the variables in our data set.

We therefore plot all the variables in a single visualization that contain all the boxplots.

tau1SL <- ggplot(data=data, aes(stabf,tau1,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("Tau1", breaks= seq(0,300, by=.5))+
        theme(legend.position="none")

tau2SL <- ggplot(data=data, aes(stabf,tau2,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("Tau2", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")

tau3SL <- ggplot(data=data, aes(stabf,tau3,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("Tau3", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
tau4SL <- ggplot(data=data, aes(stabf,tau4,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("Tau4", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
p2SL <- ggplot(data=data, aes(stabf,p2,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("P2", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
p3SL <- ggplot(data=data, aes(stabf,p3,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("P3", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
p4SL <- ggplot(data=data, aes(stabf,p4,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("P4", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
g1SL <- ggplot(data=data, aes(stabf,g1,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("G1", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
g2SL <- ggplot(data=data, aes(stabf,g2,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("G2", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")
g3SL <- ggplot(data=data, aes(stabf,tau3,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("G3", breaks= seq(0,30, by=.5))+
        theme(legend.position="none")

g4SL <- ggplot(data=data, aes(stabf,g4,fill=stabf)) + 
        geom_boxplot()+
        scale_y_continuous("G4", breaks= seq(0,30, by=.5))+
        theme(legend.position="right")


# Plot all density visualizations
grid.arrange(tau1SL + ggtitle(""),
             tau2SL + ggtitle(""),
             tau3SL + ggtitle(""),
             tau4SL + ggtitle(""),
             p2SL + ggtitle(""),
             p3SL + ggtitle(""),
             p4SL + ggtitle(""),
             g1SL + ggtitle(""),
             g2SL + ggtitle(""),
             g3SL + ggtitle(""),
             g4SL + ggtitle(""),
             nrow = 3,
             top = textGrob("Electricity data Density Plot", 
                            gp=gpar(fontsize=11))
)  

Above boxplots indicate that there is not any case of outliers within our data set.

In this section, we explored our data set. We used several R functions to understand our data set. We also used several graphical functions to look through the data. In the next section, we will carry out classification machine learning methods on our data set.

Classification

From the introduction, we saw that the response variable of our data set is qualitative. In this session, We will use several classification methods in predicting qualitative response.

There are several methods for classification. But in this project we focus on four(4) methods namely: Logistics regression, Linear discriminate analysis, Quadratic discriminate analysis and KNN.

Logistic Regression

Whole dataset for train and test

We fit a logistic regression model to predict the stablity of the electricity using the predictive variables of the data. We use the generalized linear models glm() function in R. We begin by using the whole data set for both traininbg and testing of model.

glm.fits=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial, data = data)
summary(glm.fits) #display summary of the fitted model
## 
## Call:
## glm(formula = stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + 
##     g1 + g2 + g3 + g4, family = binomial, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0376  -0.5482   0.2220   0.6123   2.4797  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.77892    0.28377 -41.509   <2e-16 ***
## tau1          0.31451    0.01129  27.847   <2e-16 ***
## tau2          0.32251    0.01139  28.316   <2e-16 ***
## tau3          0.31860    0.01139  27.961   <2e-16 ***
## tau4          0.33185    0.01147  28.935   <2e-16 ***
## p2            0.02298    0.06541   0.351    0.725    
## p3            0.08092    0.06530   1.239    0.215    
## p4           -0.09951    0.06486  -1.534    0.125    
## g1            2.68809    0.11044  24.339   <2e-16 ***
## g2            2.91745    0.11225  25.991   <2e-16 ***
## g3            3.12533    0.11357  27.518   <2e-16 ***
## g4            2.85721    0.11183  25.550   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 13091.2  on 9999  degrees of freedom
## Residual deviance:  7813.6  on 9988  degrees of freedom
## AIC: 7837.6
## 
## Number of Fisher Scoring iterations: 6

From the above procedure, we fitted the model using glm() function and used the summary() function in R to display summary of the fitted model. We can see that the predictors with smallest p-value here are tau1,tau2,tau3,tau4,g1,g2,g3,g4. It shows these predictors have significance on the response variable stabf.

It is important to note that variable p4 has negative coefficient, this suggests that p4 has a negative impact on the response variable.

Below we use coef() function to display the coefficients for the fitted model.

coef(glm.fits)
##  (Intercept)         tau1         tau2         tau3         tau4 
## -11.77891924   0.31450829   0.32250716   0.31859685   0.33184614 
##           p2           p3           p4           g1           g2 
##   0.02297856   0.08092072  -0.09951266   2.68808524   2.91744610 
##           g3           g4 
##   3.12533235   2.85720976

Also, we use the summary() function to display the p-values for the coefficients.

summary(glm.fits)$coef
##                 Estimate Std. Error    z value      Pr(>|z|)
## (Intercept) -11.77891924 0.28376700 -41.509123  0.000000e+00
## tau1          0.31450829 0.01129408  27.847191 1.164890e-170
## tau2          0.32250716 0.01138952  28.316140 2.187206e-176
## tau3          0.31859685 0.01139431  27.961036 4.840376e-172
## tau4          0.33184614 0.01146848  28.935491 4.272833e-184
## p2            0.02297856 0.06540543   0.351325  7.253446e-01
## p3            0.08092072 0.06529729   1.239266  2.152470e-01
## p4           -0.09951266 0.06486296  -1.534199  1.249808e-01
## g1            2.68808524 0.11044424  24.338845 7.609620e-131
## g2            2.91744610 0.11224898  25.990847 6.284651e-149
## g3            3.12533235 0.11357489  27.517810 1.074926e-166
## g4            2.85720976 0.11182960  25.549673 5.536863e-144

Next, we use the predict() function to prepedict the probability of the electricity stability, given the values of the predictors.

glm.probs=predict(glm.fits,type = "response")
glm.probs[1:10]
##          1          2          3          4          5          6 
## 0.99721229 0.84559325 0.69910272 0.73784670 0.98603349 0.51121102 
##          7          8          9         10 
## 0.64495319 0.82123837 0.01896633 0.94785159

The glm.probs above displays the probability of the electricity stability.

contrasts(data$stabf)
##          unstable
## stable          0
## unstable        1

From above, we used the constrast() function to assign dummy variable 1 to unstable and 0 to stable.

glm.pred=rep("stable",10000)
glm.pred[glm.probs>.5]="unstable"

The two commands above creates a vector of class predictions based whether the predicted probability of the Electricity grid non-stability is greater than or less than 0.5.

We use the table() function in R to create a confuse matrix to determine how many observations are correctly or incorrectly classified.

table(glm.pred,data$stabf)
##           
## glm.pred   stable unstable
##   stable     2556      771
##   unstable   1064     5609

From above, the table function displays the confession matrix. The diagonal elements of the confusion matrix shows the correct predictions, while off-diagonals shows incorrect predictions. 2556 + 5609 shows the number of correct predictions, while 1064+771 shows the number of incorrect predictions.

We use the the mean() function to use to calculate the fraction of prediction correctness.

mean(glm.pred==data$stabf)
## [1] 0.8165

The performance accuracy of our model using whole data for training and testing is 81.7% with logistic regression.

From above, we used the same set of 10000 observations as both training and testing our model.

The training error rate is 100-81.7 = 18.5%

Train dataset and Held out dataset

Using the whole datset for both training the model and testing the model could be misleading. The train error rate tends to underestimate the test error rate.To solve this issue, we are going to fit the model using part of the data, and then examine how well it predits using the held out data. This method yields a more realistic error rate.

Next,We randomly divid the dataset into train data and test data.

set.seed(1)
smp_siz = floor(0.75*nrow(data)) 
train_ind = sample(seq_len(nrow(data)),size = smp_siz)  
train =data[train_ind,] 
test=data[-train_ind,]

We used the ratio of 75:25 division for the dataset to train and test respectively.

Next, we go through the same process for fitting the model and prediction discussed before. But this time we use the train data set and test set.

glm.fits2=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial, data = train)
summary(glm.fits2)
## 
## Call:
## glm(formula = stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + 
##     g1 + g2 + g3 + g4, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0190  -0.5589   0.2223   0.6095   2.3612  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -11.50498    0.32357 -35.556   <2e-16 ***
## tau1          0.30918    0.01300  23.786   <2e-16 ***
## tau2          0.32721    0.01317  24.848   <2e-16 ***
## tau3          0.31319    0.01307  23.955   <2e-16 ***
## tau4          0.33721    0.01327  25.420   <2e-16 ***
## p2            0.09135    0.07576   1.206   0.2279    
## p3            0.15451    0.07548   2.047   0.0407 *  
## p4           -0.08095    0.07515  -1.077   0.2814    
## g1            2.68413    0.12692  21.148   <2e-16 ***
## g2            2.84170    0.12882  22.059   <2e-16 ***
## g3            3.07818    0.13022  23.639   <2e-16 ***
## g4            2.84946    0.12926  22.045   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9803.6  on 7499  degrees of freedom
## Residual deviance: 5871.0  on 7488  degrees of freedom
## AIC: 5895
## 
## Number of Fisher Scoring iterations: 6

Using the train data didnt change the predictors that are significant.The predictors with smallest p-value are still tau1,tau2,tau3,tau4,g1,g2,g3,g4. Also, this Shows these predictors have significance on the response variable stabf. p4 still has negative coefficient which suggests a negative impact on the response variable.

We now use the predict() function to predict the probability of the electricity stability on new model using the test data, given the values of the predictors.

glm.probs2=predict(glm.fits2,test,type = "response")
glm.probs2[1:10]
##         1         4        10        25        34        43        44 
## 0.9971698 0.7262091 0.9427452 0.9839988 0.9949180 0.9418826 0.3909568 
##        55        56        57 
## 0.8424696 0.9662207 0.9423227
contrasts(train$stabf)
##          unstable
## stable          0
## unstable        1
dim(test)
## [1] 2500   12

The dim() funtion displayed the dimension of the test data. The number of observation in the test data is 2500.

glm.pred2=rep("stable",2500)
glm.pred2[glm.probs2>.5]="unstable"
table(glm.pred2,test$stabf)
##           
## glm.pred2  stable unstable
##   stable      651      189
##   unstable    267     1393
glm_correctness = mean(glm.pred2==test$stabf)
glm_correctness
## [1] 0.8176

As usual we used mean() function to calculate the fraction of prediction correctness. We noticed a slight improvement in the accuracy of our model when we used train data and test data. The accuracy is 81.76%. Though the improvement isn’t much but it shows the need to have different set of data for training and testing of model.

Linear Discriminant Analysis

We are going to perform a linear discriminant analysis on our data set. For this model, we use the train and test data set we already have. We fit an LDA model using lda() function on our train data.

require(MASS)
## Loading required package: MASS
lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
lda.fit
## Call:
## lda(stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + g1 + g2 + 
##     g3 + g4, data = train)
## 
## Prior probabilities of groups:
##    stable  unstable 
## 0.3602667 0.6397333 
## 
## Group means:
##              tau1     tau2     tau3     tau4        p2        p3        p4
## stable   4.412792 4.359115 4.397580 4.359946 -1.252509 -1.254498 -1.242513
## unstable 5.719908 5.783466 5.757162 5.749345 -1.242431 -1.247655 -1.257868
##                 g1        g2        g3        g4
## stable   0.4543813 0.4472584 0.4422838 0.4509523
## unstable 0.5651724 0.5683109 0.5726177 0.5692794
## 
## Coefficients of linear discriminants:
##              LD1
## tau1  0.16997592
## tau2  0.17729304
## tau3  0.17184388
## tau4  0.18233386
## p2    0.04676196
## p3    0.07447041
## p4   -0.03928883
## g1    1.45615514
## g2    1.54255667
## g3    1.66818543
## g4    1.55383910

The output of the LDA indicates \(\pi1\) = 0.36 and \(\pi2\) = 0.64. This indicates that 36% of the training observations correspond to the Electricity grid being stable. While 64% indicates unstability of the Electricity grid.

We also see the group means. The Group means shows averge of each predictor within each class and its LDA estimates.

The Coefficients of linear discriminants provides the linear combination of all the predictors that are used to form the LDA decision rule.

We use plot() function to produce a plot of the linear discriminants.

plot(lda.fit)

Next, we use the predict() function on our fitted lda model with the test data.

lda.pred=predict(lda.fit, test)
names(lda.pred)
## [1] "class"     "posterior" "x"

From the result above, the predict() function returned a list with three elements. Namely the class, posterior and x which contains the linear discriminants.

Next, we use the class from fitted model on the table() function produce a confuse matrix. This will determine how many observations that is correctly or incorrectly classified.

lda.class=lda.pred$class
table(lda.class,test$stabf)
##           
## lda.class  stable unstable
##   stable      656      194
##   unstable    262     1388
lda_correctness=mean(lda.class==test$stabf)
lda_correctness
## [1] 0.8176

From the above procedure, with using lda on our data, we got an accuracy of 81.76% which is similar to what we got with logistics regression.

sum(lda.pred$posterior[,1]>=.5)
## [1] 850
sum(lda.pred$posterior[,1]<.5)
## [1] 1650

With the above code we recreated the predictions contained in lda.pred$class using 50% threshold to the posterior probabilities.

Quadratic Discriminant Analysis

We are going to carry out a quadratic discriminat analysis on our data set. We use qda() function in R, which is part of the MASS library.

qda.fit=qda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
qda.fit
## Call:
## qda(stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + g1 + g2 + 
##     g3 + g4, data = train)
## 
## Prior probabilities of groups:
##    stable  unstable 
## 0.3602667 0.6397333 
## 
## Group means:
##              tau1     tau2     tau3     tau4        p2        p3        p4
## stable   4.412792 4.359115 4.397580 4.359946 -1.252509 -1.254498 -1.242513
## unstable 5.719908 5.783466 5.757162 5.749345 -1.242431 -1.247655 -1.257868
##                 g1        g2        g3        g4
## stable   0.4543813 0.4472584 0.4422838 0.4509523
## unstable 0.5651724 0.5683109 0.5726177 0.5692794

The output contains group means but doesnt contain coefficient of the linear discriminants. This is because i is not a linear but quadratic model.

Next, We use the predict() function which works in similar fashion as LDA.

qda.class=predict(qda.fit,test)$class
table(qda.class,test$stabf)
##           
## qda.class  stable unstable
##   stable      745      107
##   unstable    173     1475
qda_correctness=mean(qda.class==test$stabf)
qda_correctness
## [1] 0.888

The accuracy of QDA is 88.8% which is greater than LDA and logistics regression performance accuracy. This suggest that QDA captures the true relationship more accurately than LDA and logistics regression.

K-Nearest Neighbors

Next, We carry out K-nearest neigbours analysis on our data set. We use the knn() functio in R. With knn it works differently from other model-fitting functions. Knn take a single command instead of the two-step train test approach.

As other model, we divid the data into train and test and use normalization formular to conver everything into values between 0 and 1.

library(class)
set.seed(1)

##normalization function is created
nor <-function(x) { (x -min(x))/(max(x)-min(x))   }

##Run nomalization on first 11 coulumns of dataset because they are the predictors
data_norm <- as.data.frame(lapply(data[,c(1,2,3,4,5,6,7,8,9,10,11)], nor))

train2 <- sample(1:nrow(data), 0.75 * nrow(data)) 

##extract training set
data_train <- data_norm[train2,] 

##extract testing set
data_test <- data_norm[-train2,] 

##extract 5th column of train dataset because it will be used as 'cl' argument in knn function.
data_target_category <- data[train2,12]
##extract 5th column if test dataset to measure the accuracy
data_test_category <- data[-train2,12]

From above, we used the normalization formular on the train and test data to convert to value between 0 and 1.

Next, we run the knn() function which takes both the normalized train data, the normalized test data, reponse variable and the value of K.

set.seed(1)
##run knn function
knn.pred <- knn(data_train,data_test,cl=data_target_category,k=1)
##create confusion matrix
table(knn.pred,data_test_category)
##           data_test_category
## knn.pred   stable unstable
##   stable      639      174
##   unstable    279     1408
knn_1 = mean(knn.pred==data_test_category)
knn_1
## [1] 0.8188

From the above procedure, we set k=1 and got a prediction accuracy of 81.88%.

We believe increasing the value of k will yield better prediction accuracy. Next, We set k=5 to see what the performance will be.

set.seed(1)
##run knn function
knn.pred <- knn(data_train,data_test,cl=data_target_category,k=5)
##create confusion matrix
table(knn.pred,data_test_category)
##           data_test_category
## knn.pred   stable unstable
##   stable      672       96
##   unstable    246     1486
knn_5=mean(knn.pred==data_test_category)
knn_5
## [1] 0.8632

Setting K=5 increased the performance accuracy of knn on our data set. k=5 with knn gave 86.32% accuracy while k=1 gave us 81.88% accuracy.

Comparing Logistic regression, LDA, QDA and KNN result

To compare this section, we compare the performance from all the models we used so far. Below shows the comparison of the performance of the model used in this section.

data.frame(Logistic = glm_correctness, LDA = lda_correctness, QDA = qda_correctness, Knn1 = knn_1, Knn5 = knn_5)
##   Logistic    LDA   QDA   Knn1   Knn5
## 1   0.8176 0.8176 0.888 0.8188 0.8632

ROC and AUC

We plot the Receiver Operating Characteristic (ROC) curve for our default predicting logistic regression model below.

library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
# score the same training data set on which the model was fit
prob = predict.glm(glm.fits, type='response', newdata =  data)
pred = prediction(prob, data$stabf)

# AUC
auc = performance(pred,"auc")@y.values[[1]][1]

# plot the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf, col="navyblue", cex.main=1,
     main= paste("Logistic Regression ROC Curve: AUC =", round(auc,3)))
abline(a=0, b = 1, col='darkorange1')

The Area under the ROC curve (AUC) is a widely-used measure of accuracy for classification models. The AUC = 0.891 is unrealistically high. This is because the predictions where made for the same dataset on which the model was estimated.

Next, we plot the ROC curve with our model built with training and test data.

library(ROCR)

# score the same training data set on which the model was fit
prob2 = predict.glm(glm.fits2, type='response', newdata =  test)
pred2 = prediction(prob2, test$stabf)

# AUC
auc2 = performance(pred2,"auc")@y.values[[1]][1]

# plot the ROC curve
perf2 <- performance(pred2,"tpr","fpr")
plot(perf2, col="navyblue", cex.main=1,
     main= paste("Logistic Regression ROC Curve: AUC =", round(auc2,3)))
abline(a=0, b = 1, col='darkorange1')

Using different data for estimating the model and testing the model, the AUC = 0.889. This is more realistic.

Resampling Methods

Resmapling methods do with repeatedly drawing samples from a training set and refitting the model of interest on each sample in order to obtain additional information about the fitted model.

We carryout cross-validation and the bootstrap methods. We use logistics regression for illustartion in this section.

Cross-Validation

The following sections describes the different cross-validation techniques.

The Validation set Approach

We carry out the Validation set approach on the our dataset. We use the set.seed() to random generate sample. We will randomly divide the observations into 5000 train and 5000 validation set.

set.seed(3)
smp_siz = floor(0.5*nrow(data)) 
train_ind = sample(seq_len(nrow(data)),size = smp_siz)  
trainVali =data[train_ind,] 
validationSet=data[-train_ind,]
train_model=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial,data = trainVali)

Above, we randomly divided the data set into 5000 observations for train set and 5000 for test set. We trained our model using logistic regression on the train set.

We run the predict() function using the test model and observe the validation set error rate

glm.probs=predict(train_model,validationSet,type = "response")
cont = contrasts(train$stabf)
glm.pred=rep("stable",5000)
glm.pred[glm.probs>.5]="unstable"
tlb = table(glm.pred,validationSet$stabf)
mean(glm.pred!=validationSet$stabf)
## [1] 0.1798

The Validation set error rate is 17.98%.

Then We use the poly() function to estimate the validation error rate for the quadratic and cubic classification.

train_model2=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),2), family = binomial,data = train)
glm.probs2=predict(train_model2,validationSet,type = "response")
cont2 = contrasts(train$stabf)
glm.pred2=rep("stable",5000)
glm.pred2[glm.probs2>.5]="unstable"
tlb2 = table(glm.pred2,validationSet$stabf)
mean(glm.pred2!=validationSet$stabf)
## [1] 0.2526

Validation set error rate for quadradic classification gave 25.2% on our data.

train_model3=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),3), family = binomial,data = train)
glm.probs3=predict(train_model3,validationSet,type = "response")
cont3 = contrasts(train$stabf)
glm.pred3=rep("stable",5000)
glm.pred3[glm.probs3>.5]="unstable"
tlb3 = table(glm.pred3,validationSet$stabf)
mean(glm.pred3!=validationSet$stabf)
## [1] 0.254

Validation set error rate for cubic classification gave 25.28% on our data.

Using validation set approach, we obtained the Validation set error rates for the model with linear, quadractic and cubic terms as 17.98, 25.2 and 25.28 respectively.

Leave-One-Out Cross-Validation (LOOCV)

We use the cv.glm() function in boot library in R to carry out Leave-One-Out Cross-Validation. We run the LOOCV for increasingly complex polynomial fits.

library(boot)
# Cost function for a binary classifier suggested by boot package
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
cv.error=rep(0,3)
for(i in 1:3){
  model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
  cv.error[i]=cv.glm(data,model,cost=cost)$delta[1]
}

cv.error
## [1] 0.2532 0.2543 0.2539
# Histogram of the Test Error
hist(cv.error,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

From above procedure, we used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. We Computed the Leave-One-Out cross-validation. Using the Leave-One-Out Cross-Validation approach, we obtained the error rates for the model with linear, quadractic and cubic terms as 0.2532, 0.2543 and 0.2539 respectively.

K-fOLD Cross-Validation

For K-Fold Cross-Validation, we use cv.glm() function to implement k-fold CV. We set K = 5 first and also use k=10.

For both procedure, we get CV errors corresponding to polynomial fits for orders 1 to 5.

set.seed(17)
cv.error.5=rep(0,5)
for(i in 1:5){
  model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
  cv.error.5[i]=cv.glm(data,model,K=5,cost=cost)$delta[1]
}

cv.error.5
## [1] 0.2541 0.2534 0.2532 0.2531 0.2538
# Histogram of the Test Error
hist(cv.error.5,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

The computation of K-fold is shorter than that of LOOCV. The CV error for orders 1 to 5 was displayed and shown with the above historgram. From above procedure, we used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=5. We Computed the K-Fold cross-validation with k=5. We obtained the error rates for the model for polynomials from i=1 to i=5 as 0.2541, 0.2534, 0.2532, 0.2531 and 0.2538 respectively.

Below we use K= 10 and still get the CV errors corresponding to polynomial fits for orders 1 to 5.

set.seed(17)
cv.error.10=rep(0,5)
for(i in 1:5){
  model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
  cv.error.10[i]=cv.glm(data,model,K=10,cost=cost)$delta[1]
}

cv.error.10
## [1] 0.2533 0.2538 0.2532 0.2537 0.2543
# Histogram of the Test Error
hist(cv.error.10,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
     col='cyan',border='blue',density=30)

We Computed the K-Fold cross-validation with k=10. We obtained the error rates for the model for polynomials from i=1 to i=5 as 0.2533, 0.2538, 0.2532, 0.2537 and 0.2543 respectively.

The Bootstrap

The bootstrap approach focuses on repeatedly sampling observations from the data set with replacement. We use the boot method within the trainControl function in R.

# load the library
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
## 
##     melanoma
library(e1071)
# load the iris dataset
# define training control
train_control <- trainControl(method="boot", number=100)
# train the model
model <- train(stabf~., data=data, trControl=train_control, method="glm")
# summarize results
print(model)
## Generalized Linear Model 
## 
## 10000 samples
##    11 predictor
##     2 classes: 'stable', 'unstable' 
## 
## No pre-processing
## Resampling: Bootstrapped (100 reps) 
## Summary of sample sizes: 10000, 10000, 10000, 10000, 10000, 10000, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8161362  0.5953132

From the above, using the bootstrap method and then fitting with a lositic model yielded a training error rate of 0.184.

set.seed(1)

boot.fn=function(data,index){
  model=glm(stabf~., family = binomial, data = data,subset = index)
  return(coef(model))
}

boot.fn(data,sample(392,392,replace = T))
##   (Intercept)          tau1          tau2          tau3          tau4 
## -11.028389517   0.292196397   0.324862455   0.205533802   0.327012711 
##            p2            p3            p4            g1            g2 
##  -0.007557283   0.263152346  -0.334260753   1.739338121   2.898179446 
##            g3            g4 
##   3.184296757   3.175513867
boot(data,boot.fn,1000)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = data, statistic = boot.fn, R = 1000)
## 
## 
## Bootstrap Statistics :
##          original        bias    std. error
## t1*  -11.77891924 -3.141154e-02  0.28356410
## t2*    0.31450829  4.606738e-04  0.01172524
## t3*    0.32250716  8.701986e-04  0.01216134
## t4*    0.31859685  7.105152e-04  0.01164373
## t5*    0.33184614  1.055868e-03  0.01174886
## t6*    0.02297856  1.876197e-03  0.06713454
## t7*    0.08092072 -3.563465e-03  0.06519611
## t8*   -0.09951266 -6.855842e-05  0.06357194
## t9*    2.68808524  6.250132e-03  0.11589781
## t10*   2.91744610  2.544534e-03  0.11549954
## t11*   3.12533235  4.673121e-03  0.11373222
## t12*   2.85720976  1.688370e-02  0.11440168

Model Selection and Regularization

Subset Selection

Best Subset Selection

We apply the best subset selection approach on our data set. Although there are other packages to do this but We use the bestglm package in R package to carryout subset selection. We use logistic regression for illustartion. This package is dependent on the leaps package that is used for linear regression but can also be used for logistic regression.

library(leaps)
regfit.full = regsubsets(stabf~.,data)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data)
## 11 Variables  (and intercept)
##      Forced in Forced out
## tau1     FALSE      FALSE
## tau2     FALSE      FALSE
## tau3     FALSE      FALSE
## tau4     FALSE      FALSE
## p2       FALSE      FALSE
## p3       FALSE      FALSE
## p4       FALSE      FALSE
## g1       FALSE      FALSE
## g2       FALSE      FALSE
## g3       FALSE      FALSE
## g4       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          tau1 tau2 tau3 tau4 p2  p3  p4  g1  g2  g3  g4 
## 1  ( 1 ) " "  "*"  " "  " "  " " " " " " " " " " " " " "
## 2  ( 1 ) " "  "*"  " "  "*"  " " " " " " " " " " " " " "
## 3  ( 1 ) "*"  "*"  " "  "*"  " " " " " " " " " " " " " "
## 4  ( 1 ) "*"  "*"  "*"  "*"  " " " " " " " " " " " " " "
## 5  ( 1 ) "*"  "*"  "*"  "*"  " " " " " " " " " " "*" " "
## 6  ( 1 ) "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" " "
## 7  ( 1 ) "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" "*"
## 8  ( 1 ) "*"  "*"  "*"  "*"  " " " " " " "*" "*" "*" "*"

Asterish indicates that a given variable is included in the corresponding model. From above, the output indicated that the best one-variable model contains taus2. Best two-variable model contains only tau2 and tau4, best three-variable model contains tau1,tau2 and tau4 etc.

We run the above procedure again but set nvmax to 11 variables this time. Parameter nvmax is used to set the maximum number of variables you desire the function to use.

library(leaps)
regfit.full = regsubsets(stabf~.,data,nvmax = 11)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11)
## 11 Variables  (and intercept)
##      Forced in Forced out
## tau1     FALSE      FALSE
## tau2     FALSE      FALSE
## tau3     FALSE      FALSE
## tau4     FALSE      FALSE
## p2       FALSE      FALSE
## p3       FALSE      FALSE
## p4       FALSE      FALSE
## g1       FALSE      FALSE
## g2       FALSE      FALSE
## g3       FALSE      FALSE
## g4       FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
##           tau1 tau2 tau3 tau4 p2  p3  p4  g1  g2  g3  g4 
## 1  ( 1 )  " "  "*"  " "  " "  " " " " " " " " " " " " " "
## 2  ( 1 )  " "  "*"  " "  "*"  " " " " " " " " " " " " " "
## 3  ( 1 )  "*"  "*"  " "  "*"  " " " " " " " " " " " " " "
## 4  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " " " " "
## 5  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " "*" " "
## 6  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" " "
## 7  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" "*"
## 8  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " "*" "*" "*" "*"
## 9  ( 1 )  "*"  "*"  "*"  "*"  " " " " "*" "*" "*" "*" "*"
## 10  ( 1 ) "*"  "*"  "*"  "*"  " " "*" "*" "*" "*" "*" "*"
## 11  ( 1 ) "*"  "*"  "*"  "*"  "*" "*" "*" "*" "*" "*" "*"

Variable p2 is the last variable to be added at subset 11. This agrees with our earlier discovery showing p2 with high p-value, showing it is less significance.

names(summary(regfit.full))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
reg.summary=summary(regfit.full)

The summary() function returns R2, RSS, adjusted R2, Cp, and BIC. We can examine these to try to select the best overall model.

Above shows the R-square values of all the 11 variables.

Below we are plot the BIC for all of the models to set the overall best model and indicates model with smallest statistics.

par(mfrow=c(2,2))
which.min(reg.summary$bic)
## [1] 8
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(8, reg.summary$bic[8], col = "red", cex = 2, pch = 20)

plot(regfit.full,scale ="bic")

From above, we can see that the model with the smallest BIC close to -5100 is the eight-variable model. Coef can be used to see which coefficient estimates associated with this model.

coef(regfit.full,8)
##  (Intercept)         tau1         tau2         tau3         tau4 
## -0.003073225  0.040415194  0.041291656  0.041038582  0.042267964 
##           g1           g2           g3           g4 
##  0.344351235  0.369522837  0.396671063  0.365174195

The best over model will contain only tau1, tau2, tau3, tau4, g1, g2g3 and g4 variables.

We will test this by fitting a logistic regression model with only these eight variables and checking the performance of this model.

glm.fits3=glm(stabf~tau1+tau2+tau3+tau4+g1+g2+g3+g4, family = binomial, data = train)
glm.probs3=predict(glm.fits3,test,type = "response")
glm.pred3=rep("stable",2500)
glm.pred3[glm.probs3>.5]="unstable"
table(glm.pred3,test$stabf)
##           
## glm.pred3  stable unstable
##   stable      644      185
##   unstable    274     1397
mean(glm.pred3==test$stabf)
## [1] 0.8164

The performance of our model using eight variables only is better than using all the model. We observe a 82% accuracy.

Stepwise Selection

Forward Stepwise Selection

The regsubsets() function will also be used for forward stepwise selection, by adding the parameter method=“forward”.

regfit.full = regsubsets(stabf~.,data,nvmax = 11, method = "forward")
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11, method = "forward")
## 11 Variables  (and intercept)
##      Forced in Forced out
## tau1     FALSE      FALSE
## tau2     FALSE      FALSE
## tau3     FALSE      FALSE
## tau4     FALSE      FALSE
## p2       FALSE      FALSE
## p3       FALSE      FALSE
## p4       FALSE      FALSE
## g1       FALSE      FALSE
## g2       FALSE      FALSE
## g3       FALSE      FALSE
## g4       FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: forward
##           tau1 tau2 tau3 tau4 p2  p3  p4  g1  g2  g3  g4 
## 1  ( 1 )  " "  "*"  " "  " "  " " " " " " " " " " " " " "
## 2  ( 1 )  " "  "*"  " "  "*"  " " " " " " " " " " " " " "
## 3  ( 1 )  "*"  "*"  " "  "*"  " " " " " " " " " " " " " "
## 4  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " " " " "
## 5  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " "*" " "
## 6  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" " "
## 7  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" "*"
## 8  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " "*" "*" "*" "*"
## 9  ( 1 )  "*"  "*"  "*"  "*"  " " " " "*" "*" "*" "*" "*"
## 10  ( 1 ) "*"  "*"  "*"  "*"  " " "*" "*" "*" "*" "*" "*"
## 11  ( 1 ) "*"  "*"  "*"  "*"  "*" "*" "*" "*" "*" "*" "*"

Following the above forward stepwise selection, the best one variable model contains only tau2, and the best two-variable contains taus and tau4 and so on. Due to the nature of our data, the result of forward stepwise is similar to what we got with best stepwise.

Backward Stepwise Selection

The regsubsets() function can also be used for backward stepwise selection, by adding the parameter method=“backward”.

regfit.full = regsubsets(stabf~.,data,nvmax = 11, method = "backward")
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11, method = "backward")
## 11 Variables  (and intercept)
##      Forced in Forced out
## tau1     FALSE      FALSE
## tau2     FALSE      FALSE
## tau3     FALSE      FALSE
## tau4     FALSE      FALSE
## p2       FALSE      FALSE
## p3       FALSE      FALSE
## p4       FALSE      FALSE
## g1       FALSE      FALSE
## g2       FALSE      FALSE
## g3       FALSE      FALSE
## g4       FALSE      FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: backward
##           tau1 tau2 tau3 tau4 p2  p3  p4  g1  g2  g3  g4 
## 1  ( 1 )  " "  "*"  " "  " "  " " " " " " " " " " " " " "
## 2  ( 1 )  " "  "*"  " "  "*"  " " " " " " " " " " " " " "
## 3  ( 1 )  "*"  "*"  " "  "*"  " " " " " " " " " " " " " "
## 4  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " " " " "
## 5  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " " " "*" " "
## 6  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" " "
## 7  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " " " "*" "*" "*"
## 8  ( 1 )  "*"  "*"  "*"  "*"  " " " " " " "*" "*" "*" "*"
## 9  ( 1 )  "*"  "*"  "*"  "*"  " " " " "*" "*" "*" "*" "*"
## 10  ( 1 ) "*"  "*"  "*"  "*"  " " "*" "*" "*" "*" "*" "*"
## 11  ( 1 ) "*"  "*"  "*"  "*"  "*" "*" "*" "*" "*" "*" "*"

The result we ordained is similar to forward stepwise and backward stepwise.

Shrinkage Methods

For shrinkage methods we carry out two approaches. Namely, regression model and lasso models. We use glmnet R package to perform both ridge regression model and lasso models.

Ridge regression

We use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. We make use of other R functions to prepare the data for ridge regression. We also convert the outcome class to numerical variable.

library(caret)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
library(magrittr) # only needed the first time you use it
library(dplyr)    # alternative installation of the %>%
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(123)
training.samples <- data$stabf %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]


# Find the best lambda using cross-validation
set.seed(123) 
# Dumy code categorical predictor variables
x = model.matrix(stabf~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y = ifelse(train.data$stabf == "stable", 1, 0)
cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial")

Next, we fit a model using our ridge regression value and then observe the performance accuracy.

# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
                lambda = cv.ridge$lambda.min)
# Display regression coefficients
coef(model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                     s0
## (Intercept)  8.4020556
## tau1        -0.2365165
## tau2        -0.2380509
## tau3        -0.2330998
## tau4        -0.2347727
## p2           .        
## p3           .        
## p4           .        
## g1          -1.8894645
## g2          -2.0506054
## g3          -2.2428720
## g4          -2.0522462
# Make predictions on the test data
x.test <- model.matrix(stabf ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "stable", "unstable")
# Model accuracy
observed.classes <- test.data$stabf
mean(predicted.classes == observed.classes)
## [1] 0.815

From the above procedure using lasson method, we observed an accurancy of 81.5%. We make a plot of the cv.ridge variable.

plot(cv.ridge)

The Lasso

Lasso works similar to ridge regression with the difference that lasso has penalty of the sum of the absolute values of the coefficients. The penalty for ridge regression is the sum of the squares of the cofficients. We will use glmnet also for lasso but with alpha parameter set to 1.

# Split the data into training and test set
#install.packages("magrittr") # only needed the first time you use it
#install.packages("dplyr")    # alternative installation of the %>%
library(magrittr) # need to run every time you start R and want to use %>%
library(dplyr)    # alternative, this also loads %>%
library(caret)
set.seed(123)
training.samples <- data$stabf %>% 
  createDataPartition(p = 0.8, list = FALSE)
train.data  <- data[training.samples, ]
test.data <- data[-training.samples, ]


# Find the best lambda using cross-validation
set.seed(123) 
# Dumy code categorical predictor variables
x <- model.matrix(stabf~., train.data)[,-1]
# Convert the outcome to a numerical variable
y <- ifelse(train.data$stabf == "stable", 1, 0)


cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
                lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 11.39079462
## tau1        -0.31482599
## tau2        -0.31660179
## tau3        -0.31178392
## tau4        -0.31517472
## p2          -0.03925336
## p3          -0.08143423
## p4           0.07529571
## g1          -2.59707958
## g2          -2.79362124
## g3          -3.02222301
## g4          -2.79597452
# Make predictions on the test data
x.test <- model.matrix(stabf ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "stable", "unstable")
# Model accuracy
observed.classes <- test.data$stabf
mean(predicted.classes == observed.classes)
## [1] 0.8195

From above, using lasso gave an accuracy of 81.95% which is better than ridge regression with accuracy of 81.5%.

set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)

The above plot displays the cross-validation error according to the log of lambda. The vertical dash line indictaes that the log of the optimal value of lambda is approximately -5, which minimizes the pprediction error. This lamdba value will gove the most accuract model.

Dimension Reduction Methods

There are two dimension reduction methods we carry out in this session. These are methods are principal components regression and partial least squares method.

Principal Components Regression

Principal components regression (PCT) can be performed using the prcomp() function in R.

#library (pls)
#set.seed (2)
#pcr.fit=pcr(tau1~., data=data ,scale=TRUE,validation ="CV")
#summary(pcr.fit)
#validationplot(pcr.fit ,val.type="MSEP")
data.pr <- prcomp(data[c(1:11)], center = TRUE, scale = TRUE)
summary(data.pr)
## Importance of components:
##                            PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     1.02901 1.02042 1.01906 1.00859 1.00481 1.00041
## Proportion of Variance 0.09626 0.09466 0.09441 0.09248 0.09179 0.09098
## Cumulative Proportion  0.09626 0.19092 0.28533 0.37780 0.46959 0.56057
##                            PC7    PC8     PC9    PC10    PC11
## Standard deviation     0.99198 0.9878 0.98584 0.97672 0.97369
## Proportion of Variance 0.08946 0.0887 0.08835 0.08673 0.08619
## Cumulative Proportion  0.65003 0.7387 0.82709 0.91381 1.00000

The prcomp() function runs PCA on our data’s predictor variables. We excluded the response variable. We tell the R function to center and scale our data and then standardize the data. We then displayed the summary.

The summary call displays the standard deviation (eigenvalues), proportion of variance and the cumulative proportion.

Below we make a plot of the eigenvalues pf the first 12 PCs and cumulative variance.

screeplot(data.pr, type = "l", npcs = 12, main = "Screeplot of the first 12 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
       col=c("red"), lty=5, cex=0.6)

cumpro <- cumsum(data.pr$sdev^2 / sum(data.pr$sdev^2))
plot(cumpro[0:12], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.88759, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC6"),
       col=c("blue"), lty=5, cex=0.6)

From the procedure above, the first 6 components has an Eigenvalue > 1 and explains almost 90% of variance. We can reduced the dimensionality from 11 to 6 while losing only 10% variance.

Below we will plot the first two components.

plot(data.pr$x[,1],data.pr$x[,2], xlab="PC1 (44.3%)", ylab = "PC2 (19%)", main = "PC1 / PC2 - plot")

Let us include the response variable in the plot and see how it looks.

library("factoextra")
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_ind(data.pr, geom.ind = "point", pointshape = 21, 
             pointsize = 2, 
             fill.ind = data$stabf, 
             col.ind = "black", 
             palette = "jco", 
             addEllipses = TRUE,
             label = "var",
             col.var = "black",
             repel = TRUE,
             legend.title = "Stabf") +
  ggtitle("2D PCA-plot from 11 feature dataset") +
  theme(plot.title = element_text(hjust = 0.5))

In above procedure, we made same plot as before but added some colors corresponding to the stability of the electric grid. This added beauty to the PCA. We used the first two components and we can see some separation between stable and unstable variable.

Partial Least Squares

To carry out PLS we use of caret package. We will start by preprocessing by removing the zero variance predictors, center and scale all the remaining predictors using the preProc argument. We train the PLS model and then check the cross-validation profile.

set.seed (1)
#pls.fit=plsr(Salary, data=Hitters ,subset =train ,scale=TRUE,validation ="CV")
#ummary (pls.fit )
dataPLS = data
dataPLS$class = factor(dataPLS$stabf)
# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(dataPLS$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")
 
# Train PLS model
mod1 <- train(class ~ ., data = dataPLS,
 method = "pls",
 metric = "Accuracy",
 tuneLength = 20,
 trControl = control,
 preProc = c("zv","center","scale"))
 
# Check CV profile
plot(mod1)

The above plot shows the CV profile, where we learn the accurancy obtained from the trained model with different latent variables(Lv). The PLS uses a discriminate analysis model for training.

We use the varImp() function with parameter 10 to display the 10 most important variable from the PLS.

# Compile models and compare performance
plot(varImp(mod1), 10, main = "PLS-DA")
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings

The above procedure displayed the 10 most important variables given in relative levels (scaled to the range 0 - 100). The response variable has a relative level of 100.

We move to the next session where we discuss moving beyong linearity.

Moving Beyond Linearity

In this session, we focus on generalized addictive models only. There are other models that can also be used here but we will limit our scop to GAM.

Generalized Additive Models

Generalized additive models (GAMs) presents framwork to extending a standard linear model by allowing non-linear functions of each variables while maintaining additivity.

We use the gam() function in the gam package in R. We are going to fit a logistic regression Model using GAMs for probabilities of the Binary response values. One important feature of gam is its ability to allow non-linearity on multiple variables at the sametime. GAM can be used with other non-linear functions.

We run gam function on our data set. The We fit the model using smoothing splines.

library(gam)
## Loading required package: splines
## Loaded gam 1.16
#logistic Regression Model
logitgam1<-gam(I(stabf) ~ s(tau1,df=4) + s(tau2,df=4) + s(tau3,df=4) + s(tau4,df=4) + s(p2,df=4) +s(p3,df=4)+s(p4,df=4) + + s(g1,df=4) + s(g1,df=4) + s(g2,df=4) + s(g3,df=4) + s(g4,df=4) ,data=data,family=binomial)
plot(logitgam1,se=T)

From the plots above, can see that fitting non-linearity on most of the variables is sufficient, but Non-linearity isn’t sufficint for variables p2,p3 and p4.

Tree-Based methods

Tree-based methods involve segmenting the predictor space into a number of simple regions. Bsed on the nature of our data set, we will be using tree-based methods for classifiction.

Fitting Classification Trees

The tree library is used to construct classification trees in R. We will use these methods on our electricity data set.

library(tree)
tree.data=tree(stabf~.,data)
summary(tree.data)
## 
## Classification tree:
## tree(formula = stabf ~ ., data = data)
## Variables actually used in tree construction:
## [1] "tau1" "g3"   "tau2" "g2"   "tau3" "tau4" "g1"  
## Number of terminal nodes:  14 
## Residual mean deviance:  0.9339 = 9326 / 9986 
## Misclassification error rate: 0.2254 = 2254 / 10000

The summary() function shows that only six of the eleven variables are used in the tree construction. The variables are “tau1” “g3” “tau2” “g2” “tau3” “tau4” “g1”. The number of nodes is 14, residual mean deviance is 0.94 and the misclassification error rate is 22.5%.

One interesting feature of tree methods is the fact that we can display the tree graphically. We will go ahead to display the tree graphically below.

plot(tree.data)
text(tree.data,pretty = 0)

We used plot() function to display the tree structure, and the text() function to disply the node labels. The argument pretty=0 is to include the category names of the qualitative predictors rather than displaying a letter for each category. Terminal nodes are stable and unstable.

We print the output corresponding to each branch of the tree below.

tree.data
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 10000 13090.0 unstable ( 0.36200 0.63800 )  
##    2) tau1 < 2.86664 2491  3370.0 stable ( 0.59093 0.40907 )  
##      4) g3 < 0.606507 1470  1780.0 stable ( 0.70612 0.29388 )  
##        8) tau2 < 2.57638 344   147.0 stable ( 0.94477 0.05523 ) *
##        9) tau2 > 2.57638 1126  1480.0 stable ( 0.63321 0.36679 )  
##         18) g2 < 0.556625 607   600.8 stable ( 0.80395 0.19605 ) *
##         19) g2 > 0.556625 519   710.3 unstable ( 0.43353 0.56647 ) *
##      5) g3 > 0.606507 1021  1392.0 unstable ( 0.42507 0.57493 )  
##       10) tau3 < 2.92726 265   293.0 stable ( 0.75849 0.24151 ) *
##       11) tau3 > 2.92726 756   933.9 unstable ( 0.30820 0.69180 ) *
##    3) tau1 > 2.86664 7509  8990.0 unstable ( 0.28606 0.71394 )  
##      6) tau4 < 2.91029 1911  2649.0 stable ( 0.50759 0.49241 )  
##       12) g1 < 0.600142 1109  1443.0 stable ( 0.64472 0.35528 ) *
##       13) g1 > 0.600142 802  1003.0 unstable ( 0.31796 0.68204 ) *
##      7) tau4 > 2.91029 5598  5761.0 unstable ( 0.21043 0.78957 )  
##       14) tau3 < 2.96374 1469  2002.0 unstable ( 0.42342 0.57658 )  
##         28) tau2 < 3.18128 427   507.2 stable ( 0.71897 0.28103 ) *
##         29) tau2 > 3.18128 1042  1277.0 unstable ( 0.30230 0.69770 )  
##           58) g2 < 0.438906 440   609.9 stable ( 0.50682 0.49318 ) *
##           59) g2 > 0.438906 602   514.8 unstable ( 0.15282 0.84718 ) *
##       15) tau3 > 2.96374 4129  3263.0 unstable ( 0.13466 0.86534 )  
##         30) tau2 < 3.59958 1318  1559.0 unstable ( 0.27845 0.72155 )  
##           60) g3 < 0.357665 420   578.4 stable ( 0.54762 0.45238 ) *
##           61) g3 > 0.357665 898   767.1 unstable ( 0.15256 0.84744 ) *
##         31) tau2 > 3.59958 2811  1385.0 unstable ( 0.06724 0.93276 )  
##           62) g2 < 0.492115 1246   978.0 unstable ( 0.13323 0.86677 ) *
##           63) g2 > 0.492115 1565   239.8 unstable ( 0.01470 0.98530 ) *

The above code shows the number of observations in the branch, the deviance, the overall prediction for the branch (stable or unstable), and the fraction of observations in that branch that takes one values of yes and no. The brances that lead to terminal nodes are indicated using asterisks.

To evaluate the performance of the classification tree on the data, we train the tree on the train data and estimate the test error using the test data.

set.seed(2)
train=sample(1:nrow(data),7500)
data.test = data[-train,]
stabf.test = data$stabf[-train]
tree.data=tree(stabf~.,data,subset = train)
tree.pred=predict(tree.data,data.test,type = "class")
table(tree.pred,stabf.test)
##           stabf.test
## tree.pred  stable unstable
##   stable      663      271
##   unstable    269     1297
(663+1297)/2500
## [1] 0.784

We split the observations into trainging set and test set. Build the tree using the training set, and evaluate the performance on the test data. The predict() fucntion is used for this purpose. This procedure gave 78.4% correct predictions of the locations of the test data set.

Next, we carry out pruning of the tree to see if there can be improvement on the tree. We use FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.

set.seed(3)
cv.data=cv.tree(tree.data,FUN=prune.misclass)
names(cv.data)
## [1] "size"   "dev"    "k"      "method"
cv.data
## $size
## [1] 16 14 12 10  8  6  4  2  1
## 
## $dev
## [1] 1738 1738 1752 1852 2002 2024 2210 2311 2688
## 
## $k
## [1]  -Inf   0.0  12.0  37.0  68.0  73.0  94.5 112.0 324.0
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

The above procedure outputs the number of terminal nodes of each tree considered(size), value of cost-complexity parameter used(k), corss-validation error(dev) and method which is misclass. Tree with 16 terminal nodes and tree with terminal nodes 14 results in the lowest cross-validation error rate, with 1738 cross-validation errors.

par(mfrow=c(1,2))
plot(cv.data$size,cv.data$dev, type = "b")
plot(cv.data$k,cv.data$dev, type = "b")

Above is the plot of the error rate as function of both size and k.

To prune the tree, We use the prune.misclass() function to obtain the 16-node tree.

prune.data=prune.misclass(tree.data,best = 16)
plot(prune.data)
text(prune.data,pretty=0)

To find out how well the pruned tree perform on the test data set, we apply the predict() function.

tree.pred2=predict(prune.data,data.test,type = "class")
table(tree.pred2,stabf.test)
##           stabf.test
## tree.pred2 stable unstable
##   stable      663      271
##   unstable    269     1297
(663+1297)/2500
## [1] 0.784

Using the pruned tree to carry out prediction on test data gave 78.4% accuracy. The pruning process is expected to produce both an interpretable tree and also improved the classification accuracy in some cases.

Bagging and Random Forests

We apply bagging and random forests to our data set, using the randomForest oackage in R. Bagging is a speacil case of a random forest with m=p. RandomForest() function in R can be used for both random forests and bagging.

library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(1)
bag.data=randomForest(stabf~.,data,subset=train,mtry=11,importance=TRUE)
bag.data
## 
## Call:
##  randomForest(formula = stabf ~ ., data = data, mtry = 11, importance = TRUE,      subset = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 11
## 
##         OOB estimate of  error rate: 9.03%
## Confusion matrix:
##          stable unstable class.error
## stable     2292      396  0.14732143
## unstable    281     4531  0.05839568

The argument mtry=11 to indicates that all 11 predictors should be considered for each split of the tee. Below, We check how well the bagged model perform on the test set.

table(predict(bag.data,data.test,type = "class"), stabf.test)
##           stabf.test
##            stable unstable
##   stable      806       88
##   unstable    126     1480
(1 - ((809+1480)/2500)) * 100
## [1] 8.44
(809+1480)/2500
## [1] 0.9156

The misclassification rate is 8.44%, in others words the accuracy is 91.6%. This performed better than using an optimally-pruned single tree.

To grow randomForest works in similar way as bagging, except that a smaller value is used for the mtry argument. By default, random() uses SqrtRoot of P when building a random forest of classification trees. We use mtry = 3.

set.seed(1)
rf.data=bag.data=randomForest(stabf~.,data,subset=train,mtry=3,importance=TRUE)
tree.pred=predict(rf.data,data.test,type = "class")
table(tree.pred,stabf.test)
##           stabf.test
## tree.pred  stable unstable
##   stable      805       59
##   unstable    127     1509
(805+1509)/2500
## [1] 0.9256

The accuracy is 92.56%; this shows that random forests yield an improvement over bagging in this case.

Below, we use the importance() function, to see the importance of each variable.

importance(rf.data)
##           stable    unstable MeanDecreaseAccuracy MeanDecreaseGini
## tau1 114.7872593 106.9407224         141.18392675         449.2558
## tau2 112.3658441 102.2635128         137.70979617         439.6647
## tau3 104.9953909 102.5314881         131.26815929         419.8940
## tau4 110.7367200 105.8246487         135.22536651         441.6628
## p2    -1.1223243   0.9351217          -0.08600038         108.7003
## p3    -0.9493097  -0.9904667          -1.32423234         108.2267
## p4    -2.8513651  -0.2818706          -2.19585744         107.3779
## g1    83.3364941  75.7485651         104.35650824         338.2807
## g2    79.8714919  76.9761063         101.14355238         344.8114
## g3    84.9086847  75.9031409         100.00287992         354.2238
## g4    82.0388687  73.4268525         100.91822593         338.1250

From the above procedure, two measure of variable of importance are mean deacreased accuracy and mean decreased gini.Plots of these measures will be produced below.

varImpPlot(rf.data)

Boosting

The gbm() function within glm package in R will be used to fit boosted classification trees to our data set.

library(gbm)
## Loaded gbm 2.1.5
set.seed(1)
data = read.csv("electricity19.csv")
data$stabf = as.numeric(data$stabf)
data = transform(data,stabf=stabf-1)
train=sample(1:nrow(data),7500)
data.test = data[-train,]
stabf.test = data$stabf[-train]

boost.data=gbm(stabf~.,data[train,],distribution = "bernoulli",n.trees = 5000, interaction.depth = 4)

The above procedure was used to fit boosted classification tree to the electrucity data set. We ran gmb() with option distribution=“bernoulli” for a binary classification problem. The argument n.trees=500 indictates we wanted 5000 trees and the option interaction.depth=4 limits the depth of each tree.

summary(boost.data)

##       var    rel.inf
## tau2 tau2 14.2353048
## tau4 tau4 14.2032276
## tau1 tau1 13.8088269
## tau3 tau3 13.6710992
## g3     g3 10.6593969
## g2     g2 10.6094066
## g4     g4 10.2077342
## g1     g1 10.1954886
## p4     p4  0.8329761
## p3     p3  0.8010815
## p2     p2  0.7754576

From above procedure, tau2 and tau4 are the most important variables. tau1, tau3,g3,g4 and g1 are also imporatnt variables.

par(mfrow=c(1,2))
plot(boost.data,i='tau2')

plot(boost.data,i='tau4')

In the above procedure, we produced partial dependence plots for the two most important variables. These ilustrate the marginal effect of the selected variables on the response after integrating out the other variable.

Next, We use the boosted model to predict stabf on the test set.

yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 5000)
mean(yhat.boost==stabf.test)
## [1] 4e-04

Prediction with boost model on our data gave 0.0004 test error rate, making the accuracy to be 0.9996. This is almost perfect and performs better than random forest and bagging.

In the above procedure we used the default lambda which is 0.1. I will run the procedure again to make use of lambda as 0.2.

boost.data=gbm(stabf~.,data[train,],distribution = "bernoulli",n.trees = 5000, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 5000)
mean(yhat.boost==stabf.test)
## [1] 0.1248

With lambda as 0.2, the output gave 0.125 test error rate, making the performance to be

Support Vector Machine

Support vector machine main objective is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. There are many possible hyperplanes that could be chosen to separte the two classes. In this project, our objective is to find a plane that has the maximum margin.

We use the svm() function in e1071 package in R. We use the training data to fit the model, and the testing data to test the model.

library(e1071)
library(magrittr)
set.seed(1)
smp_siz = floor(0.75*nrow(data)) 
train_ind = sample(seq_len(nrow(data)),size = smp_siz)  
train =data[train_ind,] 
test=data[-train_ind,]

model_svm <- svm(stabf ~., data=train, cost = 1000, gamma = 0.01)
model_svm
## 
## Call:
## svm(formula = stabf ~ ., data = train, cost = 1000, gamma = 0.01)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1000 
##       gamma:  0.01 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  6089

From the above procedure, the cost parameter is used to penalise the model for misclassification. We set the cost to be 100. We fitted the model using svm. Below, we run predict() function pn our fitted model using the test data, and then look at its accurracy.

test_svm <- predict(model_svm, newdata = test %>% na.omit())
yo <- test %>% na.omit()
#table(test_svm, yo$stabf) run this code
(876+1552)/2500
## [1] 0.9712

The result above gave a high accuracy of 97.12% using train data and test data. This accuracy shows that the cost penalty of 100 is suitable for our model.

Conclusion

The scope of this project is to apply machine learning concepts taught in class on our data set. We made use of open dataset. We were able to apply several classification machine learning methods on the electricity data set to achieve the goal of the project.

The goal of the project is to model the data, so as to accurately predict the stability of electric grid. Also, infer the relationship between predictors and the response. We achieved these goals by using sevearl classification methods discussed in lass.

The following table consolidates the results of some of the methods used in this project.

##   # Results of some methods used in the project
## | SI.No | Methods                           | Performance Accuracy  |
## |-------|:---------------------------------:|----------------------:|
## | 1     | Logistic regression               | 0.8176                |
## | 2     | Linear discriminate analysis      | 0.8176                |
## | 3     | Quadratic discriminate analysis   | 0.888                 |
## | 4     | K-Nearest Neighbors (k=5)         | 0.8632                |
## | 5     | Validation set approach           | 0.82                  | 
## | 7     | K-Fold cross-validation           | 0.7459                |
## | 8     | Boostrap                          | 0.816                 |
## | 9     | Classification trees              | 0.784                 |
## | 10    | Trees + Bagging                   | 0.916                 |
## | 11    | Trees + Random forest             | 0.926                 |
## | 12    | Trees + Boosting                  | 0.9996                |
## | 13    | SVM                               | 0.9712                |

Further work will be to apply more classification methods on our dataset and also to expand the scope of the methods we used.

Acknowledgment

I would like to express my sincere gratitute to my University Professor Dr. Gunes Koru (https://drkoru.us/) for the continuous support and excellent teaching techniques. His guidance made this project possible.

I would also like to thank my friend Nelson Ayaji for his assistance through the project. His advice made a difference.

References

  1. https://rstudio-pubs-static.s3.amazonaws.com/90467_c70206f3dc864d53bf36072207ee011d.html
  2. https://www.r-bloggers.com/predicting-creditability-using-logistic-regression-in-r-cross-validating-the-classifier-part-2-2/
  3. “An introduction to Statistical Learning with Applications in R” by Greth James et al