This document reports the analysis of local stability electric grid dataset, hosted by UCI Machine learning repository. The data set contains readings from an electrical system having 4 nodes producing electricity and using the data to determine the system stability.
*Project’s Goal**: Goal is to model the data, so as to accurately predict the stability of electric grid. Also, infer the relationship between predictors and the response.The URL of the dataset is https://archive.ics.uci.edu/ml/datasets/Electrical+Grid+Stability+Simulated+Data+. The scope of this project is to apply concepts learnt in class on our desire data set. We use R language through out the project.
Size of Observations(n) = 10000
Number of predictors(p) = 11
What constitutes an observation: Each observation represents valuation record of a system using multiple nodes. There are 11 predictor variables and 1 outcome variable for each observation.The predictor variables usage are independent on each other.
Response Variables: The Response variable also known as the Output variable is stabf. It tells us if the system is stable or not.
Predictor variables: tau1,tau2,tau3,tau4 are electricity produced from node1,node2,node3 and node 4 respectively. Obviously these will contribute in determining of the system is stable or not. p2,p3,p4 are the power consumed on node2, node3 and node4 respectively. For some reasons energy consumed on node1 is not included in determining whether the system is stable of not. So we won’t worry about p1. That being said, p2,p3,p4 have impact on the response. g1,g2,g3,g4 are the price value of electricity on node1-node 4 at each observation. These variables would also add to factor to determine if the system is stable or not. The usage of the variables were independentof each other. Therefore each can be treated on its own.
We read the data from a csv file and attach to the project. The R function attach() is use to attach data to the project for ease of accessibility.
data = read.csv("electricity19.csv")
attach(data)
Below procedure display the data’s features and observations for better understanding.
names(data) #displays feature names
## [1] "tau1" "tau2" "tau3" "tau4" "p2" "p3" "p4" "g1"
## [9] "g2" "g3" "g4" "stabf"
head(data) #displays 10 observations.
## tau1 tau2 tau3 tau4 p2 p3 p4
## 1 2.9590600 3.079885 8.381025 9.780754 -0.7826036 -1.2573948 -1.7230863
## 2 9.3040972 4.902524 3.047541 1.369357 -1.9400584 -1.8727417 -1.2550120
## 3 8.9717069 8.848428 3.046479 1.214518 -1.2074556 -1.2772101 -0.9204924
## 4 0.7164148 7.669600 4.486641 2.340563 -1.0274733 -1.9389442 -0.9973736
## 5 3.1341116 7.608772 4.943759 9.857573 -1.1255310 -1.8459749 -0.5543050
## 6 6.9992087 9.109247 3.784066 4.267788 -1.8571392 -0.6703968 -1.9021328
## g1 g2 g3 g4 stabf
## 1 0.6504565 0.85957811 0.8874449 0.9580340 unstable
## 2 0.4134406 0.86241408 0.5621391 0.7817599 stable
## 3 0.1630410 0.76668866 0.8394440 0.1098532 unstable
## 4 0.4462089 0.97674408 0.9293805 0.3627178 unstable
## 5 0.7971095 0.45544995 0.6569467 0.8209235 unstable
## 6 0.2617929 0.07792967 0.5428839 0.4699310 stable
dim(data)
## [1] 10000 12
The dim() function shows the data has 10000 observations and 12 columns.
The summary() function in R shows a numerical summary of each variable in data set. We display the summary statistics of our data set.
summary(data)
## tau1 tau2 tau3 tau4
## Min. :0.5008 Min. : 0.5001 Min. :0.5008 Min. :0.5005
## 1st Qu.:2.8749 1st Qu.: 2.8751 1st Qu.:2.8755 1st Qu.:2.8750
## Median :5.2500 Median : 5.2500 Median :5.2500 Median :5.2497
## Mean :5.2500 Mean : 5.2500 Mean :5.2500 Mean :5.2500
## 3rd Qu.:7.6247 3rd Qu.: 7.6249 3rd Qu.:7.6249 3rd Qu.:7.6248
## Max. :9.9995 Max. : 9.9998 Max. :9.9994 Max. :9.9994
## p2 p3 p4 g1
## Min. :-1.9999 Min. :-1.9999 Min. :-1.9999 Min. :0.05001
## 1st Qu.:-1.6249 1st Qu.:-1.6250 1st Qu.:-1.6250 1st Qu.:0.28752
## Median :-1.2500 Median :-1.2500 Median :-1.2500 Median :0.52501
## Mean :-1.2500 Mean :-1.2500 Mean :-1.2500 Mean :0.52500
## 3rd Qu.:-0.8750 3rd Qu.:-0.8750 3rd Qu.:-0.8751 3rd Qu.:0.76243
## Max. :-0.5001 Max. :-0.5001 Max. :-0.5000 Max. :0.99994
## g2 g3 g4 stabf
## Min. :0.05005 Min. :0.05005 Min. :0.05003 stable :3620
## 1st Qu.:0.28755 1st Qu.:0.28751 1st Qu.:0.28749 unstable:6380
## Median :0.52500 Median :0.52501 Median :0.52500
## Mean :0.52500 Mean :0.52500 Mean :0.52500
## 3rd Qu.:0.76249 3rd Qu.:0.76244 3rd Qu.:0.76243
## Max. :0.99994 Max. :0.99998 Max. :0.99993
Next, we examine the correlation state of our data. We display correlation with cor() R function.
cor(data[,-12])
## tau1 tau2 tau3 tau4 p2
## tau1 1.000000000 0.015585721 -0.005969607 -0.0172652888 -0.015485308
## tau2 0.015585721 1.000000000 0.014273414 -0.0019647158 0.006573472
## tau3 -0.005969607 0.014273414 1.000000000 0.0043543363 -0.003134261
## tau4 -0.017265289 -0.001964716 0.004354336 1.0000000000 0.010552963
## p2 -0.015485308 0.006573472 -0.003134261 0.0105529626 1.000000000
## p3 -0.015923767 0.007672768 -0.008780459 0.0061687215 0.002388424
## p4 -0.015807066 -0.005962636 -0.017531089 -0.0112110679 -0.006843659
## g1 0.010521128 -0.001742367 -0.011605048 -0.0041489273 0.015603017
## g2 0.015349831 0.015383308 0.007671386 0.0084310883 -0.018032104
## g3 -0.001278658 0.016508464 0.014701706 0.0032598572 0.007555052
## g4 0.005493873 -0.011764008 -0.011497448 -0.0004908815 0.019817004
## p3 p4 g1 g2 g3
## tau1 -0.015923767 -0.015807066 0.010521128 0.015349831 -0.001278658
## tau2 0.007672768 -0.005962636 -0.001742367 0.015383308 0.016508464
## tau3 -0.008780459 -0.017531089 -0.011605048 0.007671386 0.014701706
## tau4 0.006168722 -0.011211068 -0.004148927 0.008431088 0.003259857
## p2 0.002388424 -0.006843659 0.015603017 -0.018032104 0.007555052
## p3 1.000000000 0.012952865 -0.003218925 -0.011574893 -0.005897288
## p4 0.012952865 1.000000000 -0.013635590 0.002849963 -0.003515084
## g1 -0.003218925 -0.013635590 1.000000000 0.007559018 -0.005836219
## g2 -0.011574893 0.002849963 0.007559018 1.000000000 -0.012808995
## g3 -0.005897288 -0.003515084 -0.005836219 -0.012808995 1.000000000
## g4 -0.010485250 0.017505439 0.012430613 -0.014909362 0.006900185
## g4
## tau1 0.0054938729
## tau2 -0.0117640085
## tau3 -0.0114974481
## tau4 -0.0004908815
## p2 0.0198170040
## p3 -0.0104852500
## p4 0.0175054386
## g1 0.0124306133
## g2 -0.0149093622
## g3 0.0069001847
## g4 1.0000000000
For all variables, there isn’t substantial correlation between the predictors. Also, we use graphical plot to show the correlation state of the data set.
Graphical Summaries
First, we use pair() function to create a scatterplot matrix for all the variables. We see a scatterplot for every pair variables in our data set.
pairs(data,col=data$stabf)
From the pair plot above, the red color indicate the distribution of observation with unstable response while black indicate distribution of observations with stable response.
We then go ahead to use the ggpairs function in ggally package in R, to see the correlation among predictors also.
library(GGally)
## Loading required package: ggplot2
ggpairs(data = data[1:11],
title = "Electricity data Correlation Plot",
upper = list(continuous = wrap("cor", size = 2)),
lower = list(continuous = "smooth")
)
Below,We display the density and frequency plot using histogram on the data.
library(ggplot2) # Data visualization
library(gridExtra)
library(grid)
# Density & Frequency analysis with the Histogram,
# Tau1
tau1l <- ggplot(data=data, aes(x=tau1))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("Tau1") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(tau1)),linetype="dashed",color="grey")
# Tau2
tau2l <- ggplot(data=data, aes(x=tau2))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("Tau2") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(tau2)),linetype="dashed",color="grey")
# Tau3
tau3l <- ggplot(data=data, aes(x=tau3))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("Tau3") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(tau3)),linetype="dashed",color="grey")
# Tau4
tau4l <- ggplot(data=data, aes(x=tau2))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("Tau4") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(tau4)),linetype="dashed",color="grey")
#P2
p2l <- ggplot(data=data, aes(x=p2))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("p2") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(p2)),linetype="dashed",color="grey")
#P3
p3l <- ggplot(data=data, aes(x=p3))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("p3") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(p3)),linetype="dashed",color="grey")
#P4
p4l <- ggplot(data=data, aes(x=p4))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("p4") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(p4)),linetype="dashed",color="grey")
#g1
g1l <- ggplot(data=data, aes(x=g1))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("g1") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(g1)),linetype="dashed",color="grey")
#g2
g2l <- ggplot(data=data, aes(x=g2))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("g2") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(g2)),linetype="dashed",color="grey")
#g3
g3l <- ggplot(data=data, aes(x=g3))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("g3") +
ylab("Frequency") +
theme(legend.position="none")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(g3)),linetype="dashed",color="grey")
#g4
g4l <- ggplot(data=data, aes(x=g4))+
geom_histogram(binwidth=0.2, color="black", aes(fill=stabf)) +
xlab("g4") +
ylab("Frequency") +
theme(legend.position="right")+
ggtitle("Histogram of Sepal Length")+
geom_vline(data=data, aes(xintercept = mean(g4)),linetype="dashed",color="grey")
# Plot all visualizations
grid.arrange(tau1l + ggtitle(""),
tau2l + ggtitle(""),
tau3l + ggtitle(""),
tau4l + ggtitle(""),
p2l + ggtitle(""),
p3l + ggtitle(""),
p4l + ggtitle(""),
g1l + ggtitle(""),
g2l + ggtitle(""),
g3l + ggtitle(""),
g4l + ggtitle(""),
nrow = 3,
top = textGrob("Electric Data Frequency Histogram",
gp=gpar(fontsize=15))
)
From the above plots, We can see the density distribution of each attribute broken down by class value. Like the scatterplot matrix, the density plot by class can show the separation of classes. It can also help to us understand the overlap in class values for an attribute.
# Tau1
tau1l <- ggplot(data=data, aes(x=tau1,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(tau1), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("Tau1") +
ylab("Density") +
theme(legend.position="none")
# Tau2
tau2l <- ggplot(data=data, aes(x=tau2,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("Tau2") +
ylab("Density") +
theme(legend.position="none")
# Tau3
tau3l <- ggplot(data=data, aes(x=tau3,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("Tau3") +
ylab("Density") +
theme(legend.position="none")
# Tau4
tau4l <- ggplot(data=data, aes(x=tau4,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(tau4), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("Tau4") +
ylab("Density") +
theme(legend.position="none")
#P2
p2l <- ggplot(data=data, aes(x=p2,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(tau2), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("p2") +
ylab("Density") +
theme(legend.position="none")
#P3
p3l <- ggplot(data=data, aes(x=p3,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(p3), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("p3") +
ylab("Density") +
theme(legend.position="none")
#P4
p4l <- ggplot(data=data, aes(x=p4,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(p4), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("p4") +
ylab("Density") +
theme(legend.position="none")
#g1
g1l <- ggplot(data=data, aes(x=g1,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(g1), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("g1") +
ylab("Density") +
theme(legend.position="none")
#g2
g2l <- ggplot(data=data, aes(x=g2,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(g2), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("g2") +
ylab("Density") +
theme(legend.position="none")
#g3
g3l <- ggplot(data=data, aes(x=g3,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(g3), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("g3") +
ylab("Density") +
theme(legend.position="none")
#g4
g4l <- ggplot(data=data, aes(x=g4,colour=stabf), fill=stabf)+
geom_density(alpha=.3) +
geom_vline(aes(xintercept = mean(g3), colour=stabf),linetype="dashed",color="grey",size=1)+
xlab("g4") +
ylab("Density") +
theme(legend.position="right")
# Plot all density visualizations
grid.arrange(tau1l + ggtitle(""),
tau2l + ggtitle(""),
tau3l + ggtitle(""),
tau4l + ggtitle(""),
p2l + ggtitle(""),
p3l + ggtitle(""),
p4l + ggtitle(""),
g1l + ggtitle(""),
g2l + ggtitle(""),
g3l + ggtitle(""),
g4l + ggtitle(""),
nrow = 3,
top = textGrob("Electricity data Density Plot",
gp=gpar(fontsize=15))
)
Above plots shows the desity plot of the variables in our data set.
We therefore plot all the variables in a single visualization that contain all the boxplots.
tau1SL <- ggplot(data=data, aes(stabf,tau1,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("Tau1", breaks= seq(0,300, by=.5))+
theme(legend.position="none")
tau2SL <- ggplot(data=data, aes(stabf,tau2,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("Tau2", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
tau3SL <- ggplot(data=data, aes(stabf,tau3,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("Tau3", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
tau4SL <- ggplot(data=data, aes(stabf,tau4,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("Tau4", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
p2SL <- ggplot(data=data, aes(stabf,p2,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("P2", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
p3SL <- ggplot(data=data, aes(stabf,p3,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("P3", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
p4SL <- ggplot(data=data, aes(stabf,p4,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("P4", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
g1SL <- ggplot(data=data, aes(stabf,g1,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("G1", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
g2SL <- ggplot(data=data, aes(stabf,g2,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("G2", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
g3SL <- ggplot(data=data, aes(stabf,tau3,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("G3", breaks= seq(0,30, by=.5))+
theme(legend.position="none")
g4SL <- ggplot(data=data, aes(stabf,g4,fill=stabf)) +
geom_boxplot()+
scale_y_continuous("G4", breaks= seq(0,30, by=.5))+
theme(legend.position="right")
# Plot all density visualizations
grid.arrange(tau1SL + ggtitle(""),
tau2SL + ggtitle(""),
tau3SL + ggtitle(""),
tau4SL + ggtitle(""),
p2SL + ggtitle(""),
p3SL + ggtitle(""),
p4SL + ggtitle(""),
g1SL + ggtitle(""),
g2SL + ggtitle(""),
g3SL + ggtitle(""),
g4SL + ggtitle(""),
nrow = 3,
top = textGrob("Electricity data Density Plot",
gp=gpar(fontsize=11))
)
Above boxplots indicate that there is not any case of outliers within our data set.
In this section, we explored our data set. We used several R functions to understand our data set. We also used several graphical functions to look through the data. In the next section, we will carry out classification machine learning methods on our data set.
From the introduction, we saw that the response variable of our data set is qualitative. In this session, We will use several classification methods in predicting qualitative response.
There are several methods for classification. But in this project we focus on four(4) methods namely: Logistics regression, Linear discriminate analysis, Quadratic discriminate analysis and KNN.
We fit a logistic regression model to predict the stablity of the electricity using the predictive variables of the data. We use the generalized linear models glm() function in R. We begin by using the whole data set for both traininbg and testing of model.
glm.fits=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial, data = data)
summary(glm.fits) #display summary of the fitted model
##
## Call:
## glm(formula = stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 +
## g1 + g2 + g3 + g4, family = binomial, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0376 -0.5482 0.2220 0.6123 2.4797
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.77892 0.28377 -41.509 <2e-16 ***
## tau1 0.31451 0.01129 27.847 <2e-16 ***
## tau2 0.32251 0.01139 28.316 <2e-16 ***
## tau3 0.31860 0.01139 27.961 <2e-16 ***
## tau4 0.33185 0.01147 28.935 <2e-16 ***
## p2 0.02298 0.06541 0.351 0.725
## p3 0.08092 0.06530 1.239 0.215
## p4 -0.09951 0.06486 -1.534 0.125
## g1 2.68809 0.11044 24.339 <2e-16 ***
## g2 2.91745 0.11225 25.991 <2e-16 ***
## g3 3.12533 0.11357 27.518 <2e-16 ***
## g4 2.85721 0.11183 25.550 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 13091.2 on 9999 degrees of freedom
## Residual deviance: 7813.6 on 9988 degrees of freedom
## AIC: 7837.6
##
## Number of Fisher Scoring iterations: 6
From the above procedure, we fitted the model using glm() function and used the summary() function in R to display summary of the fitted model. We can see that the predictors with smallest p-value here are tau1,tau2,tau3,tau4,g1,g2,g3,g4. It shows these predictors have significance on the response variable stabf.
It is important to note that variable p4 has negative coefficient, this suggests that p4 has a negative impact on the response variable.
Below we use coef() function to display the coefficients for the fitted model.
coef(glm.fits)
## (Intercept) tau1 tau2 tau3 tau4
## -11.77891924 0.31450829 0.32250716 0.31859685 0.33184614
## p2 p3 p4 g1 g2
## 0.02297856 0.08092072 -0.09951266 2.68808524 2.91744610
## g3 g4
## 3.12533235 2.85720976
Also, we use the summary() function to display the p-values for the coefficients.
summary(glm.fits)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.77891924 0.28376700 -41.509123 0.000000e+00
## tau1 0.31450829 0.01129408 27.847191 1.164890e-170
## tau2 0.32250716 0.01138952 28.316140 2.187206e-176
## tau3 0.31859685 0.01139431 27.961036 4.840376e-172
## tau4 0.33184614 0.01146848 28.935491 4.272833e-184
## p2 0.02297856 0.06540543 0.351325 7.253446e-01
## p3 0.08092072 0.06529729 1.239266 2.152470e-01
## p4 -0.09951266 0.06486296 -1.534199 1.249808e-01
## g1 2.68808524 0.11044424 24.338845 7.609620e-131
## g2 2.91744610 0.11224898 25.990847 6.284651e-149
## g3 3.12533235 0.11357489 27.517810 1.074926e-166
## g4 2.85720976 0.11182960 25.549673 5.536863e-144
Next, we use the predict() function to prepedict the probability of the electricity stability, given the values of the predictors.
glm.probs=predict(glm.fits,type = "response")
glm.probs[1:10]
## 1 2 3 4 5 6
## 0.99721229 0.84559325 0.69910272 0.73784670 0.98603349 0.51121102
## 7 8 9 10
## 0.64495319 0.82123837 0.01896633 0.94785159
The glm.probs above displays the probability of the electricity stability.
contrasts(data$stabf)
## unstable
## stable 0
## unstable 1
From above, we used the constrast() function to assign dummy variable 1 to unstable and 0 to stable.
glm.pred=rep("stable",10000)
glm.pred[glm.probs>.5]="unstable"
The two commands above creates a vector of class predictions based whether the predicted probability of the Electricity grid non-stability is greater than or less than 0.5.
We use the table() function in R to create a confuse matrix to determine how many observations are correctly or incorrectly classified.
table(glm.pred,data$stabf)
##
## glm.pred stable unstable
## stable 2556 771
## unstable 1064 5609
From above, the table function displays the confession matrix. The diagonal elements of the confusion matrix shows the correct predictions, while off-diagonals shows incorrect predictions. 2556 + 5609 shows the number of correct predictions, while 1064+771 shows the number of incorrect predictions.
We use the the mean() function to use to calculate the fraction of prediction correctness.
mean(glm.pred==data$stabf)
## [1] 0.8165
The performance accuracy of our model using whole data for training and testing is 81.7% with logistic regression.
From above, we used the same set of 10000 observations as both training and testing our model.
The training error rate is 100-81.7 = 18.5%
Using the whole datset for both training the model and testing the model could be misleading. The train error rate tends to underestimate the test error rate.To solve this issue, we are going to fit the model using part of the data, and then examine how well it predits using the held out data. This method yields a more realistic error rate.
Next,We randomly divid the dataset into train data and test data.
set.seed(1)
smp_siz = floor(0.75*nrow(data))
train_ind = sample(seq_len(nrow(data)),size = smp_siz)
train =data[train_ind,]
test=data[-train_ind,]
We used the ratio of 75:25 division for the dataset to train and test respectively.
Next, we go through the same process for fitting the model and prediction discussed before. But this time we use the train data set and test set.
glm.fits2=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial, data = train)
summary(glm.fits2)
##
## Call:
## glm(formula = stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 +
## g1 + g2 + g3 + g4, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0190 -0.5589 0.2223 0.6095 2.3612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.50498 0.32357 -35.556 <2e-16 ***
## tau1 0.30918 0.01300 23.786 <2e-16 ***
## tau2 0.32721 0.01317 24.848 <2e-16 ***
## tau3 0.31319 0.01307 23.955 <2e-16 ***
## tau4 0.33721 0.01327 25.420 <2e-16 ***
## p2 0.09135 0.07576 1.206 0.2279
## p3 0.15451 0.07548 2.047 0.0407 *
## p4 -0.08095 0.07515 -1.077 0.2814
## g1 2.68413 0.12692 21.148 <2e-16 ***
## g2 2.84170 0.12882 22.059 <2e-16 ***
## g3 3.07818 0.13022 23.639 <2e-16 ***
## g4 2.84946 0.12926 22.045 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9803.6 on 7499 degrees of freedom
## Residual deviance: 5871.0 on 7488 degrees of freedom
## AIC: 5895
##
## Number of Fisher Scoring iterations: 6
Using the train data didnt change the predictors that are significant.The predictors with smallest p-value are still tau1,tau2,tau3,tau4,g1,g2,g3,g4. Also, this Shows these predictors have significance on the response variable stabf. p4 still has negative coefficient which suggests a negative impact on the response variable.
We now use the predict() function to predict the probability of the electricity stability on new model using the test data, given the values of the predictors.
glm.probs2=predict(glm.fits2,test,type = "response")
glm.probs2[1:10]
## 1 4 10 25 34 43 44
## 0.9971698 0.7262091 0.9427452 0.9839988 0.9949180 0.9418826 0.3909568
## 55 56 57
## 0.8424696 0.9662207 0.9423227
contrasts(train$stabf)
## unstable
## stable 0
## unstable 1
dim(test)
## [1] 2500 12
The dim() funtion displayed the dimension of the test data. The number of observation in the test data is 2500.
glm.pred2=rep("stable",2500)
glm.pred2[glm.probs2>.5]="unstable"
table(glm.pred2,test$stabf)
##
## glm.pred2 stable unstable
## stable 651 189
## unstable 267 1393
glm_correctness = mean(glm.pred2==test$stabf)
glm_correctness
## [1] 0.8176
As usual we used mean() function to calculate the fraction of prediction correctness. We noticed a slight improvement in the accuracy of our model when we used train data and test data. The accuracy is 81.76%. Though the improvement isn’t much but it shows the need to have different set of data for training and testing of model.
We are going to perform a linear discriminant analysis on our data set. For this model, we use the train and test data set we already have. We fit an LDA model using lda() function on our train data.
require(MASS)
## Loading required package: MASS
lda.fit=lda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
lda.fit
## Call:
## lda(stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + g1 + g2 +
## g3 + g4, data = train)
##
## Prior probabilities of groups:
## stable unstable
## 0.3602667 0.6397333
##
## Group means:
## tau1 tau2 tau3 tau4 p2 p3 p4
## stable 4.412792 4.359115 4.397580 4.359946 -1.252509 -1.254498 -1.242513
## unstable 5.719908 5.783466 5.757162 5.749345 -1.242431 -1.247655 -1.257868
## g1 g2 g3 g4
## stable 0.4543813 0.4472584 0.4422838 0.4509523
## unstable 0.5651724 0.5683109 0.5726177 0.5692794
##
## Coefficients of linear discriminants:
## LD1
## tau1 0.16997592
## tau2 0.17729304
## tau3 0.17184388
## tau4 0.18233386
## p2 0.04676196
## p3 0.07447041
## p4 -0.03928883
## g1 1.45615514
## g2 1.54255667
## g3 1.66818543
## g4 1.55383910
The output of the LDA indicates \(\pi1\) = 0.36 and \(\pi2\) = 0.64. This indicates that 36% of the training observations correspond to the Electricity grid being stable. While 64% indicates unstability of the Electricity grid.
We also see the group means. The Group means shows averge of each predictor within each class and its LDA estimates.
The Coefficients of linear discriminants provides the linear combination of all the predictors that are used to form the LDA decision rule.
We use plot() function to produce a plot of the linear discriminants.
plot(lda.fit)
Next, we use the predict() function on our fitted lda model with the test data.
lda.pred=predict(lda.fit, test)
names(lda.pred)
## [1] "class" "posterior" "x"
From the result above, the predict() function returned a list with three elements. Namely the class, posterior and x which contains the linear discriminants.
Next, we use the class from fitted model on the table() function produce a confuse matrix. This will determine how many observations that is correctly or incorrectly classified.
lda.class=lda.pred$class
table(lda.class,test$stabf)
##
## lda.class stable unstable
## stable 656 194
## unstable 262 1388
lda_correctness=mean(lda.class==test$stabf)
lda_correctness
## [1] 0.8176
From the above procedure, with using lda on our data, we got an accuracy of 81.76% which is similar to what we got with logistics regression.
sum(lda.pred$posterior[,1]>=.5)
## [1] 850
sum(lda.pred$posterior[,1]<.5)
## [1] 1650
With the above code we recreated the predictions contained in lda.pred$class using 50% threshold to the posterior probabilities.
We are going to carry out a quadratic discriminat analysis on our data set. We use qda() function in R, which is part of the MASS library.
qda.fit=qda(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4,data=train)
qda.fit
## Call:
## qda(stabf ~ tau1 + tau2 + tau3 + tau4 + p2 + p3 + p4 + g1 + g2 +
## g3 + g4, data = train)
##
## Prior probabilities of groups:
## stable unstable
## 0.3602667 0.6397333
##
## Group means:
## tau1 tau2 tau3 tau4 p2 p3 p4
## stable 4.412792 4.359115 4.397580 4.359946 -1.252509 -1.254498 -1.242513
## unstable 5.719908 5.783466 5.757162 5.749345 -1.242431 -1.247655 -1.257868
## g1 g2 g3 g4
## stable 0.4543813 0.4472584 0.4422838 0.4509523
## unstable 0.5651724 0.5683109 0.5726177 0.5692794
The output contains group means but doesnt contain coefficient of the linear discriminants. This is because i is not a linear but quadratic model.
Next, We use the predict() function which works in similar fashion as LDA.
qda.class=predict(qda.fit,test)$class
table(qda.class,test$stabf)
##
## qda.class stable unstable
## stable 745 107
## unstable 173 1475
qda_correctness=mean(qda.class==test$stabf)
qda_correctness
## [1] 0.888
The accuracy of QDA is 88.8% which is greater than LDA and logistics regression performance accuracy. This suggest that QDA captures the true relationship more accurately than LDA and logistics regression.
Next, We carry out K-nearest neigbours analysis on our data set. We use the knn() functio in R. With knn it works differently from other model-fitting functions. Knn take a single command instead of the two-step train test approach.
As other model, we divid the data into train and test and use normalization formular to conver everything into values between 0 and 1.
library(class)
set.seed(1)
##normalization function is created
nor <-function(x) { (x -min(x))/(max(x)-min(x)) }
##Run nomalization on first 11 coulumns of dataset because they are the predictors
data_norm <- as.data.frame(lapply(data[,c(1,2,3,4,5,6,7,8,9,10,11)], nor))
train2 <- sample(1:nrow(data), 0.75 * nrow(data))
##extract training set
data_train <- data_norm[train2,]
##extract testing set
data_test <- data_norm[-train2,]
##extract 5th column of train dataset because it will be used as 'cl' argument in knn function.
data_target_category <- data[train2,12]
##extract 5th column if test dataset to measure the accuracy
data_test_category <- data[-train2,12]
From above, we used the normalization formular on the train and test data to convert to value between 0 and 1.
Next, we run the knn() function which takes both the normalized train data, the normalized test data, reponse variable and the value of K.
set.seed(1)
##run knn function
knn.pred <- knn(data_train,data_test,cl=data_target_category,k=1)
##create confusion matrix
table(knn.pred,data_test_category)
## data_test_category
## knn.pred stable unstable
## stable 639 174
## unstable 279 1408
knn_1 = mean(knn.pred==data_test_category)
knn_1
## [1] 0.8188
From the above procedure, we set k=1 and got a prediction accuracy of 81.88%.
We believe increasing the value of k will yield better prediction accuracy. Next, We set k=5 to see what the performance will be.
set.seed(1)
##run knn function
knn.pred <- knn(data_train,data_test,cl=data_target_category,k=5)
##create confusion matrix
table(knn.pred,data_test_category)
## data_test_category
## knn.pred stable unstable
## stable 672 96
## unstable 246 1486
knn_5=mean(knn.pred==data_test_category)
knn_5
## [1] 0.8632
Setting K=5 increased the performance accuracy of knn on our data set. k=5 with knn gave 86.32% accuracy while k=1 gave us 81.88% accuracy.
To compare this section, we compare the performance from all the models we used so far. Below shows the comparison of the performance of the model used in this section.
data.frame(Logistic = glm_correctness, LDA = lda_correctness, QDA = qda_correctness, Knn1 = knn_1, Knn5 = knn_5)
## Logistic LDA QDA Knn1 Knn5
## 1 0.8176 0.8176 0.888 0.8188 0.8632
We plot the Receiver Operating Characteristic (ROC) curve for our default predicting logistic regression model below.
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
# score the same training data set on which the model was fit
prob = predict.glm(glm.fits, type='response', newdata = data)
pred = prediction(prob, data$stabf)
# AUC
auc = performance(pred,"auc")@y.values[[1]][1]
# plot the ROC curve
perf <- performance(pred,"tpr","fpr")
plot(perf, col="navyblue", cex.main=1,
main= paste("Logistic Regression ROC Curve: AUC =", round(auc,3)))
abline(a=0, b = 1, col='darkorange1')
The Area under the ROC curve (AUC) is a widely-used measure of accuracy for classification models. The AUC = 0.891 is unrealistically high. This is because the predictions where made for the same dataset on which the model was estimated.
Next, we plot the ROC curve with our model built with training and test data.
library(ROCR)
# score the same training data set on which the model was fit
prob2 = predict.glm(glm.fits2, type='response', newdata = test)
pred2 = prediction(prob2, test$stabf)
# AUC
auc2 = performance(pred2,"auc")@y.values[[1]][1]
# plot the ROC curve
perf2 <- performance(pred2,"tpr","fpr")
plot(perf2, col="navyblue", cex.main=1,
main= paste("Logistic Regression ROC Curve: AUC =", round(auc2,3)))
abline(a=0, b = 1, col='darkorange1')
Using different data for estimating the model and testing the model, the AUC = 0.889. This is more realistic.
Resmapling methods do with repeatedly drawing samples from a training set and refitting the model of interest on each sample in order to obtain additional information about the fitted model.
We carryout cross-validation and the bootstrap methods. We use logistics regression for illustartion in this section.
The following sections describes the different cross-validation techniques.
We carry out the Validation set approach on the our dataset. We use the set.seed() to random generate sample. We will randomly divide the observations into 5000 train and 5000 validation set.
set.seed(3)
smp_siz = floor(0.5*nrow(data))
train_ind = sample(seq_len(nrow(data)),size = smp_siz)
trainVali =data[train_ind,]
validationSet=data[-train_ind,]
train_model=glm(stabf~tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4, family = binomial,data = trainVali)
Above, we randomly divided the data set into 5000 observations for train set and 5000 for test set. We trained our model using logistic regression on the train set.
We run the predict() function using the test model and observe the validation set error rate
glm.probs=predict(train_model,validationSet,type = "response")
cont = contrasts(train$stabf)
glm.pred=rep("stable",5000)
glm.pred[glm.probs>.5]="unstable"
tlb = table(glm.pred,validationSet$stabf)
mean(glm.pred!=validationSet$stabf)
## [1] 0.1798
The Validation set error rate is 17.98%.
Then We use the poly() function to estimate the validation error rate for the quadratic and cubic classification.
train_model2=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),2), family = binomial,data = train)
glm.probs2=predict(train_model2,validationSet,type = "response")
cont2 = contrasts(train$stabf)
glm.pred2=rep("stable",5000)
glm.pred2[glm.probs2>.5]="unstable"
tlb2 = table(glm.pred2,validationSet$stabf)
mean(glm.pred2!=validationSet$stabf)
## [1] 0.2526
Validation set error rate for quadradic classification gave 25.2% on our data.
train_model3=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),3), family = binomial,data = train)
glm.probs3=predict(train_model3,validationSet,type = "response")
cont3 = contrasts(train$stabf)
glm.pred3=rep("stable",5000)
glm.pred3[glm.probs3>.5]="unstable"
tlb3 = table(glm.pred3,validationSet$stabf)
mean(glm.pred3!=validationSet$stabf)
## [1] 0.254
Validation set error rate for cubic classification gave 25.28% on our data.
Using validation set approach, we obtained the Validation set error rates for the model with linear, quadractic and cubic terms as 17.98, 25.2 and 25.28 respectively.
We use the cv.glm() function in boot library in R to carry out Leave-One-Out Cross-Validation. We run the LOOCV for increasingly complex polynomial fits.
library(boot)
# Cost function for a binary classifier suggested by boot package
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
cv.error=rep(0,3)
for(i in 1:3){
model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
cv.error[i]=cv.glm(data,model,cost=cost)$delta[1]
}
cv.error
## [1] 0.2532 0.2543 0.2539
# Histogram of the Test Error
hist(cv.error,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
col='cyan',border='blue',density=30)
From above procedure, we used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=3. We Computed the Leave-One-Out cross-validation. Using the Leave-One-Out Cross-Validation approach, we obtained the error rates for the model with linear, quadractic and cubic terms as 0.2532, 0.2543 and 0.2539 respectively.
For K-Fold Cross-Validation, we use cv.glm() function to implement k-fold CV. We set K = 5 first and also use k=10.
For both procedure, we get CV errors corresponding to polynomial fits for orders 1 to 5.
set.seed(17)
cv.error.5=rep(0,5)
for(i in 1:5){
model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
cv.error.5[i]=cv.glm(data,model,K=5,cost=cost)$delta[1]
}
cv.error.5
## [1] 0.2541 0.2534 0.2532 0.2531 0.2538
# Histogram of the Test Error
hist(cv.error.5,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
col='cyan',border='blue',density=30)
The computation of K-fold is shorter than that of LOOCV. The CV error for orders 1 to 5 was displayed and shown with the above historgram. From above procedure, we used loop to iteratively fit polynomial regressions for polynomial of order i = 1 to i=5. We Computed the K-Fold cross-validation with k=5. We obtained the error rates for the model for polynomials from i=1 to i=5 as 0.2541, 0.2534, 0.2532, 0.2531 and 0.2538 respectively.
Below we use K= 10 and still get the CV errors corresponding to polynomial fits for orders 1 to 5.
set.seed(17)
cv.error.10=rep(0,5)
for(i in 1:5){
model=glm(stabf~poly((tau1+tau2+tau3+tau4+p2+p3+p4+g1+g2+g3+g4),i), family = binomial, data = data)
cv.error.10[i]=cv.glm(data,model,K=10,cost=cost)$delta[1]
}
cv.error.10
## [1] 0.2533 0.2538 0.2532 0.2537 0.2543
# Histogram of the Test Error
hist(cv.error.10,xlab='Test Error',ylab='Freq',main='Test Error LOOCV',
col='cyan',border='blue',density=30)
We Computed the K-Fold cross-validation with k=10. We obtained the error rates for the model for polynomials from i=1 to i=5 as 0.2533, 0.2538, 0.2532, 0.2537 and 0.2543 respectively.
The bootstrap approach focuses on repeatedly sampling observations from the data set with replacement. We use the boot method within the trainControl function in R.
# load the library
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:boot':
##
## melanoma
library(e1071)
# load the iris dataset
# define training control
train_control <- trainControl(method="boot", number=100)
# train the model
model <- train(stabf~., data=data, trControl=train_control, method="glm")
# summarize results
print(model)
## Generalized Linear Model
##
## 10000 samples
## 11 predictor
## 2 classes: 'stable', 'unstable'
##
## No pre-processing
## Resampling: Bootstrapped (100 reps)
## Summary of sample sizes: 10000, 10000, 10000, 10000, 10000, 10000, ...
## Resampling results:
##
## Accuracy Kappa
## 0.8161362 0.5953132
From the above, using the bootstrap method and then fitting with a lositic model yielded a training error rate of 0.184.
set.seed(1)
boot.fn=function(data,index){
model=glm(stabf~., family = binomial, data = data,subset = index)
return(coef(model))
}
boot.fn(data,sample(392,392,replace = T))
## (Intercept) tau1 tau2 tau3 tau4
## -11.028389517 0.292196397 0.324862455 0.205533802 0.327012711
## p2 p3 p4 g1 g2
## -0.007557283 0.263152346 -0.334260753 1.739338121 2.898179446
## g3 g4
## 3.184296757 3.175513867
boot(data,boot.fn,1000)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = data, statistic = boot.fn, R = 1000)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* -11.77891924 -3.141154e-02 0.28356410
## t2* 0.31450829 4.606738e-04 0.01172524
## t3* 0.32250716 8.701986e-04 0.01216134
## t4* 0.31859685 7.105152e-04 0.01164373
## t5* 0.33184614 1.055868e-03 0.01174886
## t6* 0.02297856 1.876197e-03 0.06713454
## t7* 0.08092072 -3.563465e-03 0.06519611
## t8* -0.09951266 -6.855842e-05 0.06357194
## t9* 2.68808524 6.250132e-03 0.11589781
## t10* 2.91744610 2.544534e-03 0.11549954
## t11* 3.12533235 4.673121e-03 0.11373222
## t12* 2.85720976 1.688370e-02 0.11440168
We apply the best subset selection approach on our data set. Although there are other packages to do this but We use the bestglm package in R package to carryout subset selection. We use logistic regression for illustartion. This package is dependent on the leaps package that is used for linear regression but can also be used for logistic regression.
library(leaps)
regfit.full = regsubsets(stabf~.,data)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data)
## 11 Variables (and intercept)
## Forced in Forced out
## tau1 FALSE FALSE
## tau2 FALSE FALSE
## tau3 FALSE FALSE
## tau4 FALSE FALSE
## p2 FALSE FALSE
## p3 FALSE FALSE
## p4 FALSE FALSE
## g1 FALSE FALSE
## g2 FALSE FALSE
## g3 FALSE FALSE
## g4 FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## tau1 tau2 tau3 tau4 p2 p3 p4 g1 g2 g3 g4
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
Asterish indicates that a given variable is included in the corresponding model. From above, the output indicated that the best one-variable model contains taus2. Best two-variable model contains only tau2 and tau4, best three-variable model contains tau1,tau2 and tau4 etc.
We run the above procedure again but set nvmax to 11 variables this time. Parameter nvmax is used to set the maximum number of variables you desire the function to use.
library(leaps)
regfit.full = regsubsets(stabf~.,data,nvmax = 11)
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11)
## 11 Variables (and intercept)
## Forced in Forced out
## tau1 FALSE FALSE
## tau2 FALSE FALSE
## tau3 FALSE FALSE
## tau4 FALSE FALSE
## p2 FALSE FALSE
## p3 FALSE FALSE
## p4 FALSE FALSE
## g1 FALSE FALSE
## g2 FALSE FALSE
## g3 FALSE FALSE
## g4 FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: exhaustive
## tau1 tau2 tau3 tau4 p2 p3 p4 g1 g2 g3 g4
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Variable p2 is the last variable to be added at subset 11. This agrees with our earlier discovery showing p2 with high p-value, showing it is less significance.
names(summary(regfit.full))
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary=summary(regfit.full)
The summary() function returns R2, RSS, adjusted R2, Cp, and BIC. We can examine these to try to select the best overall model.
Above shows the R-square values of all the 11 variables.
Below we are plot the BIC for all of the models to set the overall best model and indicates model with smallest statistics.
par(mfrow=c(2,2))
which.min(reg.summary$bic)
## [1] 8
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(8, reg.summary$bic[8], col = "red", cex = 2, pch = 20)
plot(regfit.full,scale ="bic")
From above, we can see that the model with the smallest BIC close to -5100 is the eight-variable model. Coef can be used to see which coefficient estimates associated with this model.
coef(regfit.full,8)
## (Intercept) tau1 tau2 tau3 tau4
## -0.003073225 0.040415194 0.041291656 0.041038582 0.042267964
## g1 g2 g3 g4
## 0.344351235 0.369522837 0.396671063 0.365174195
The best over model will contain only tau1, tau2, tau3, tau4, g1, g2g3 and g4 variables.
We will test this by fitting a logistic regression model with only these eight variables and checking the performance of this model.
glm.fits3=glm(stabf~tau1+tau2+tau3+tau4+g1+g2+g3+g4, family = binomial, data = train)
glm.probs3=predict(glm.fits3,test,type = "response")
glm.pred3=rep("stable",2500)
glm.pred3[glm.probs3>.5]="unstable"
table(glm.pred3,test$stabf)
##
## glm.pred3 stable unstable
## stable 644 185
## unstable 274 1397
mean(glm.pred3==test$stabf)
## [1] 0.8164
The performance of our model using eight variables only is better than using all the model. We observe a 82% accuracy.
The regsubsets() function will also be used for forward stepwise selection, by adding the parameter method=“forward”.
regfit.full = regsubsets(stabf~.,data,nvmax = 11, method = "forward")
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11, method = "forward")
## 11 Variables (and intercept)
## Forced in Forced out
## tau1 FALSE FALSE
## tau2 FALSE FALSE
## tau3 FALSE FALSE
## tau4 FALSE FALSE
## p2 FALSE FALSE
## p3 FALSE FALSE
## p4 FALSE FALSE
## g1 FALSE FALSE
## g2 FALSE FALSE
## g3 FALSE FALSE
## g4 FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: forward
## tau1 tau2 tau3 tau4 p2 p3 p4 g1 g2 g3 g4
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Following the above forward stepwise selection, the best one variable model contains only tau2, and the best two-variable contains taus and tau4 and so on. Due to the nature of our data, the result of forward stepwise is similar to what we got with best stepwise.
The regsubsets() function can also be used for backward stepwise selection, by adding the parameter method=“backward”.
regfit.full = regsubsets(stabf~.,data,nvmax = 11, method = "backward")
summary(regfit.full)
## Subset selection object
## Call: regsubsets.formula(stabf ~ ., data, nvmax = 11, method = "backward")
## 11 Variables (and intercept)
## Forced in Forced out
## tau1 FALSE FALSE
## tau2 FALSE FALSE
## tau3 FALSE FALSE
## tau4 FALSE FALSE
## p2 FALSE FALSE
## p3 FALSE FALSE
## p4 FALSE FALSE
## g1 FALSE FALSE
## g2 FALSE FALSE
## g3 FALSE FALSE
## g4 FALSE FALSE
## 1 subsets of each size up to 11
## Selection Algorithm: backward
## tau1 tau2 tau3 tau4 p2 p3 p4 g1 g2 g3 g4
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" " " "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*" " "
## 6 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" " "
## 7 ( 1 ) "*" "*" "*" "*" " " " " " " " " "*" "*" "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
The result we ordained is similar to forward stepwise and backward stepwise.
For shrinkage methods we carry out two approaches. Namely, regression model and lasso models. We use glmnet R package to perform both ridge regression model and lasso models.
We use the glmnet() function in R with parameter alpha=0 to fit a ridge regression model. We make use of other R functions to prepare the data for ridge regression. We also convert the outcome class to numerical variable.
library(caret)
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-16
library(magrittr) # only needed the first time you use it
library(dplyr) # alternative installation of the %>%
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(123)
training.samples <- data$stabf %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
# Find the best lambda using cross-validation
set.seed(123)
# Dumy code categorical predictor variables
x = model.matrix(stabf~., train.data)[,-1]
# Convert the outcome (class) to a numerical variable
y = ifelse(train.data$stabf == "stable", 1, 0)
cv.ridge <- cv.glmnet(x, y, alpha = 0, family = "binomial")
Next, we fit a model using our ridge regression value and then observe the performance accuracy.
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.ridge$lambda.min)
# Display regression coefficients
coef(model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 8.4020556
## tau1 -0.2365165
## tau2 -0.2380509
## tau3 -0.2330998
## tau4 -0.2347727
## p2 .
## p3 .
## p4 .
## g1 -1.8894645
## g2 -2.0506054
## g3 -2.2428720
## g4 -2.0522462
# Make predictions on the test data
x.test <- model.matrix(stabf ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "stable", "unstable")
# Model accuracy
observed.classes <- test.data$stabf
mean(predicted.classes == observed.classes)
## [1] 0.815
From the above procedure using lasson method, we observed an accurancy of 81.5%. We make a plot of the cv.ridge variable.
plot(cv.ridge)
Lasso works similar to ridge regression with the difference that lasso has penalty of the sum of the absolute values of the coefficients. The penalty for ridge regression is the sum of the squares of the cofficients. We will use glmnet also for lasso but with alpha parameter set to 1.
# Split the data into training and test set
#install.packages("magrittr") # only needed the first time you use it
#install.packages("dplyr") # alternative installation of the %>%
library(magrittr) # need to run every time you start R and want to use %>%
library(dplyr) # alternative, this also loads %>%
library(caret)
set.seed(123)
training.samples <- data$stabf %>%
createDataPartition(p = 0.8, list = FALSE)
train.data <- data[training.samples, ]
test.data <- data[-training.samples, ]
# Find the best lambda using cross-validation
set.seed(123)
# Dumy code categorical predictor variables
x <- model.matrix(stabf~., train.data)[,-1]
# Convert the outcome to a numerical variable
y <- ifelse(train.data$stabf == "stable", 1, 0)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
# Fit the final model on the training data
model <- glmnet(x, y, alpha = 1, family = "binomial",
lambda = cv.lasso$lambda.min)
# Display regression coefficients
coef(model)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 11.39079462
## tau1 -0.31482599
## tau2 -0.31660179
## tau3 -0.31178392
## tau4 -0.31517472
## p2 -0.03925336
## p3 -0.08143423
## p4 0.07529571
## g1 -2.59707958
## g2 -2.79362124
## g3 -3.02222301
## g4 -2.79597452
# Make predictions on the test data
x.test <- model.matrix(stabf ~., test.data)[,-1]
probabilities <- model %>% predict(newx = x.test)
predicted.classes <- ifelse(probabilities > 0.5, "stable", "unstable")
# Model accuracy
observed.classes <- test.data$stabf
mean(predicted.classes == observed.classes)
## [1] 0.8195
From above, using lasso gave an accuracy of 81.95% which is better than ridge regression with accuracy of 81.5%.
set.seed(123)
cv.lasso <- cv.glmnet(x, y, alpha = 1, family = "binomial")
plot(cv.lasso)
The above plot displays the cross-validation error according to the log of lambda. The vertical dash line indictaes that the log of the optimal value of lambda is approximately -5, which minimizes the pprediction error. This lamdba value will gove the most accuract model.
There are two dimension reduction methods we carry out in this session. These are methods are principal components regression and partial least squares method.
Principal components regression (PCT) can be performed using the prcomp() function in R.
#library (pls)
#set.seed (2)
#pcr.fit=pcr(tau1~., data=data ,scale=TRUE,validation ="CV")
#summary(pcr.fit)
#validationplot(pcr.fit ,val.type="MSEP")
data.pr <- prcomp(data[c(1:11)], center = TRUE, scale = TRUE)
summary(data.pr)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.02901 1.02042 1.01906 1.00859 1.00481 1.00041
## Proportion of Variance 0.09626 0.09466 0.09441 0.09248 0.09179 0.09098
## Cumulative Proportion 0.09626 0.19092 0.28533 0.37780 0.46959 0.56057
## PC7 PC8 PC9 PC10 PC11
## Standard deviation 0.99198 0.9878 0.98584 0.97672 0.97369
## Proportion of Variance 0.08946 0.0887 0.08835 0.08673 0.08619
## Cumulative Proportion 0.65003 0.7387 0.82709 0.91381 1.00000
The prcomp() function runs PCA on our data’s predictor variables. We excluded the response variable. We tell the R function to center and scale our data and then standardize the data. We then displayed the summary.
The summary call displays the standard deviation (eigenvalues), proportion of variance and the cumulative proportion.
Below we make a plot of the eigenvalues pf the first 12 PCs and cumulative variance.
screeplot(data.pr, type = "l", npcs = 12, main = "Screeplot of the first 12 PCs")
abline(h = 1, col="red", lty=5)
legend("topright", legend=c("Eigenvalue = 1"),
col=c("red"), lty=5, cex=0.6)
cumpro <- cumsum(data.pr$sdev^2 / sum(data.pr$sdev^2))
plot(cumpro[0:12], xlab = "PC #", ylab = "Amount of explained variance", main = "Cumulative variance plot")
abline(v = 6, col="blue", lty=5)
abline(h = 0.88759, col="blue", lty=5)
legend("topleft", legend=c("Cut-off @ PC6"),
col=c("blue"), lty=5, cex=0.6)
From the procedure above, the first 6 components has an Eigenvalue > 1 and explains almost 90% of variance. We can reduced the dimensionality from 11 to 6 while losing only 10% variance.
Below we will plot the first two components.
plot(data.pr$x[,1],data.pr$x[,2], xlab="PC1 (44.3%)", ylab = "PC2 (19%)", main = "PC1 / PC2 - plot")
Let us include the response variable in the plot and see how it looks.
library("factoextra")
## Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ
fviz_pca_ind(data.pr, geom.ind = "point", pointshape = 21,
pointsize = 2,
fill.ind = data$stabf,
col.ind = "black",
palette = "jco",
addEllipses = TRUE,
label = "var",
col.var = "black",
repel = TRUE,
legend.title = "Stabf") +
ggtitle("2D PCA-plot from 11 feature dataset") +
theme(plot.title = element_text(hjust = 0.5))
In above procedure, we made same plot as before but added some colors corresponding to the stability of the electric grid. This added beauty to the PCA. We used the first two components and we can see some separation between stable and unstable variable.
To carry out PLS we use of caret package. We will start by preprocessing by removing the zero variance predictors, center and scale all the remaining predictors using the preProc argument. We train the PLS model and then check the cross-validation profile.
set.seed (1)
#pls.fit=plsr(Salary, data=Hitters ,subset =train ,scale=TRUE,validation ="CV")
#ummary (pls.fit )
dataPLS = data
dataPLS$class = factor(dataPLS$stabf)
# Compile cross-validation settings
set.seed(100)
myfolds <- createMultiFolds(dataPLS$class, k = 5, times = 10)
control <- trainControl("repeatedcv", index = myfolds, selectionFunction = "oneSE")
# Train PLS model
mod1 <- train(class ~ ., data = dataPLS,
method = "pls",
metric = "Accuracy",
tuneLength = 20,
trControl = control,
preProc = c("zv","center","scale"))
# Check CV profile
plot(mod1)
The above plot shows the CV profile, where we learn the accurancy obtained from the trained model with different latent variables(Lv). The PLS uses a discriminate analysis model for training.
We use the varImp() function with parameter 10 to display the 10 most important variable from the PLS.
# Compile models and compare performance
plot(varImp(mod1), 10, main = "PLS-DA")
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
The above procedure displayed the 10 most important variables given in relative levels (scaled to the range 0 - 100). The response variable has a relative level of 100.
We move to the next session where we discuss moving beyong linearity.
In this session, we focus on generalized addictive models only. There are other models that can also be used here but we will limit our scop to GAM.
Generalized additive models (GAMs) presents framwork to extending a standard linear model by allowing non-linear functions of each variables while maintaining additivity.
We use the gam() function in the gam package in R. We are going to fit a logistic regression Model using GAMs for probabilities of the Binary response values. One important feature of gam is its ability to allow non-linearity on multiple variables at the sametime. GAM can be used with other non-linear functions.
We run gam function on our data set. The We fit the model using smoothing splines.
library(gam)
## Loading required package: splines
## Loaded gam 1.16
#logistic Regression Model
logitgam1<-gam(I(stabf) ~ s(tau1,df=4) + s(tau2,df=4) + s(tau3,df=4) + s(tau4,df=4) + s(p2,df=4) +s(p3,df=4)+s(p4,df=4) + + s(g1,df=4) + s(g1,df=4) + s(g2,df=4) + s(g3,df=4) + s(g4,df=4) ,data=data,family=binomial)
plot(logitgam1,se=T)
From the plots above, can see that fitting non-linearity on most of the variables is sufficient, but Non-linearity isn’t sufficint for variables p2,p3 and p4.
Tree-based methods involve segmenting the predictor space into a number of simple regions. Bsed on the nature of our data set, we will be using tree-based methods for classifiction.
The tree library is used to construct classification trees in R. We will use these methods on our electricity data set.
library(tree)
tree.data=tree(stabf~.,data)
summary(tree.data)
##
## Classification tree:
## tree(formula = stabf ~ ., data = data)
## Variables actually used in tree construction:
## [1] "tau1" "g3" "tau2" "g2" "tau3" "tau4" "g1"
## Number of terminal nodes: 14
## Residual mean deviance: 0.9339 = 9326 / 9986
## Misclassification error rate: 0.2254 = 2254 / 10000
The summary() function shows that only six of the eleven variables are used in the tree construction. The variables are “tau1” “g3” “tau2” “g2” “tau3” “tau4” “g1”. The number of nodes is 14, residual mean deviance is 0.94 and the misclassification error rate is 22.5%.
One interesting feature of tree methods is the fact that we can display the tree graphically. We will go ahead to display the tree graphically below.
plot(tree.data)
text(tree.data,pretty = 0)
We used plot() function to display the tree structure, and the text() function to disply the node labels. The argument pretty=0 is to include the category names of the qualitative predictors rather than displaying a letter for each category. Terminal nodes are stable and unstable.
We print the output corresponding to each branch of the tree below.
tree.data
## node), split, n, deviance, yval, (yprob)
## * denotes terminal node
##
## 1) root 10000 13090.0 unstable ( 0.36200 0.63800 )
## 2) tau1 < 2.86664 2491 3370.0 stable ( 0.59093 0.40907 )
## 4) g3 < 0.606507 1470 1780.0 stable ( 0.70612 0.29388 )
## 8) tau2 < 2.57638 344 147.0 stable ( 0.94477 0.05523 ) *
## 9) tau2 > 2.57638 1126 1480.0 stable ( 0.63321 0.36679 )
## 18) g2 < 0.556625 607 600.8 stable ( 0.80395 0.19605 ) *
## 19) g2 > 0.556625 519 710.3 unstable ( 0.43353 0.56647 ) *
## 5) g3 > 0.606507 1021 1392.0 unstable ( 0.42507 0.57493 )
## 10) tau3 < 2.92726 265 293.0 stable ( 0.75849 0.24151 ) *
## 11) tau3 > 2.92726 756 933.9 unstable ( 0.30820 0.69180 ) *
## 3) tau1 > 2.86664 7509 8990.0 unstable ( 0.28606 0.71394 )
## 6) tau4 < 2.91029 1911 2649.0 stable ( 0.50759 0.49241 )
## 12) g1 < 0.600142 1109 1443.0 stable ( 0.64472 0.35528 ) *
## 13) g1 > 0.600142 802 1003.0 unstable ( 0.31796 0.68204 ) *
## 7) tau4 > 2.91029 5598 5761.0 unstable ( 0.21043 0.78957 )
## 14) tau3 < 2.96374 1469 2002.0 unstable ( 0.42342 0.57658 )
## 28) tau2 < 3.18128 427 507.2 stable ( 0.71897 0.28103 ) *
## 29) tau2 > 3.18128 1042 1277.0 unstable ( 0.30230 0.69770 )
## 58) g2 < 0.438906 440 609.9 stable ( 0.50682 0.49318 ) *
## 59) g2 > 0.438906 602 514.8 unstable ( 0.15282 0.84718 ) *
## 15) tau3 > 2.96374 4129 3263.0 unstable ( 0.13466 0.86534 )
## 30) tau2 < 3.59958 1318 1559.0 unstable ( 0.27845 0.72155 )
## 60) g3 < 0.357665 420 578.4 stable ( 0.54762 0.45238 ) *
## 61) g3 > 0.357665 898 767.1 unstable ( 0.15256 0.84744 ) *
## 31) tau2 > 3.59958 2811 1385.0 unstable ( 0.06724 0.93276 )
## 62) g2 < 0.492115 1246 978.0 unstable ( 0.13323 0.86677 ) *
## 63) g2 > 0.492115 1565 239.8 unstable ( 0.01470 0.98530 ) *
The above code shows the number of observations in the branch, the deviance, the overall prediction for the branch (stable or unstable), and the fraction of observations in that branch that takes one values of yes and no. The brances that lead to terminal nodes are indicated using asterisks.
To evaluate the performance of the classification tree on the data, we train the tree on the train data and estimate the test error using the test data.
set.seed(2)
train=sample(1:nrow(data),7500)
data.test = data[-train,]
stabf.test = data$stabf[-train]
tree.data=tree(stabf~.,data,subset = train)
tree.pred=predict(tree.data,data.test,type = "class")
table(tree.pred,stabf.test)
## stabf.test
## tree.pred stable unstable
## stable 663 271
## unstable 269 1297
(663+1297)/2500
## [1] 0.784
We split the observations into trainging set and test set. Build the tree using the training set, and evaluate the performance on the test data. The predict() fucntion is used for this purpose. This procedure gave 78.4% correct predictions of the locations of the test data set.
Next, we carry out pruning of the tree to see if there can be improvement on the tree. We use FUN=prune.misclass in order to indicate that we want the classification error rate to guide the cross-validation and pruning process, rather than the default for the cv.tree() function, which is deviance.
set.seed(3)
cv.data=cv.tree(tree.data,FUN=prune.misclass)
names(cv.data)
## [1] "size" "dev" "k" "method"
cv.data
## $size
## [1] 16 14 12 10 8 6 4 2 1
##
## $dev
## [1] 1738 1738 1752 1852 2002 2024 2210 2311 2688
##
## $k
## [1] -Inf 0.0 12.0 37.0 68.0 73.0 94.5 112.0 324.0
##
## $method
## [1] "misclass"
##
## attr(,"class")
## [1] "prune" "tree.sequence"
The above procedure outputs the number of terminal nodes of each tree considered(size), value of cost-complexity parameter used(k), corss-validation error(dev) and method which is misclass. Tree with 16 terminal nodes and tree with terminal nodes 14 results in the lowest cross-validation error rate, with 1738 cross-validation errors.
par(mfrow=c(1,2))
plot(cv.data$size,cv.data$dev, type = "b")
plot(cv.data$k,cv.data$dev, type = "b")
Above is the plot of the error rate as function of both size and k.
To prune the tree, We use the prune.misclass() function to obtain the 16-node tree.
prune.data=prune.misclass(tree.data,best = 16)
plot(prune.data)
text(prune.data,pretty=0)
To find out how well the pruned tree perform on the test data set, we apply the predict() function.
tree.pred2=predict(prune.data,data.test,type = "class")
table(tree.pred2,stabf.test)
## stabf.test
## tree.pred2 stable unstable
## stable 663 271
## unstable 269 1297
(663+1297)/2500
## [1] 0.784
Using the pruned tree to carry out prediction on test data gave 78.4% accuracy. The pruning process is expected to produce both an interpretable tree and also improved the classification accuracy in some cases.
We apply bagging and random forests to our data set, using the randomForest oackage in R. Bagging is a speacil case of a random forest with m=p. RandomForest() function in R can be used for both random forests and bagging.
library("randomForest")
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
set.seed(1)
bag.data=randomForest(stabf~.,data,subset=train,mtry=11,importance=TRUE)
bag.data
##
## Call:
## randomForest(formula = stabf ~ ., data = data, mtry = 11, importance = TRUE, subset = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 11
##
## OOB estimate of error rate: 9.03%
## Confusion matrix:
## stable unstable class.error
## stable 2292 396 0.14732143
## unstable 281 4531 0.05839568
The argument mtry=11 to indicates that all 11 predictors should be considered for each split of the tee. Below, We check how well the bagged model perform on the test set.
table(predict(bag.data,data.test,type = "class"), stabf.test)
## stabf.test
## stable unstable
## stable 806 88
## unstable 126 1480
(1 - ((809+1480)/2500)) * 100
## [1] 8.44
(809+1480)/2500
## [1] 0.9156
The misclassification rate is 8.44%, in others words the accuracy is 91.6%. This performed better than using an optimally-pruned single tree.
To grow randomForest works in similar way as bagging, except that a smaller value is used for the mtry argument. By default, random() uses SqrtRoot of P when building a random forest of classification trees. We use mtry = 3.
set.seed(1)
rf.data=bag.data=randomForest(stabf~.,data,subset=train,mtry=3,importance=TRUE)
tree.pred=predict(rf.data,data.test,type = "class")
table(tree.pred,stabf.test)
## stabf.test
## tree.pred stable unstable
## stable 805 59
## unstable 127 1509
(805+1509)/2500
## [1] 0.9256
The accuracy is 92.56%; this shows that random forests yield an improvement over bagging in this case.
Below, we use the importance() function, to see the importance of each variable.
importance(rf.data)
## stable unstable MeanDecreaseAccuracy MeanDecreaseGini
## tau1 114.7872593 106.9407224 141.18392675 449.2558
## tau2 112.3658441 102.2635128 137.70979617 439.6647
## tau3 104.9953909 102.5314881 131.26815929 419.8940
## tau4 110.7367200 105.8246487 135.22536651 441.6628
## p2 -1.1223243 0.9351217 -0.08600038 108.7003
## p3 -0.9493097 -0.9904667 -1.32423234 108.2267
## p4 -2.8513651 -0.2818706 -2.19585744 107.3779
## g1 83.3364941 75.7485651 104.35650824 338.2807
## g2 79.8714919 76.9761063 101.14355238 344.8114
## g3 84.9086847 75.9031409 100.00287992 354.2238
## g4 82.0388687 73.4268525 100.91822593 338.1250
From the above procedure, two measure of variable of importance are mean deacreased accuracy and mean decreased gini.Plots of these measures will be produced below.
varImpPlot(rf.data)
The gbm() function within glm package in R will be used to fit boosted classification trees to our data set.
library(gbm)
## Loaded gbm 2.1.5
set.seed(1)
data = read.csv("electricity19.csv")
data$stabf = as.numeric(data$stabf)
data = transform(data,stabf=stabf-1)
train=sample(1:nrow(data),7500)
data.test = data[-train,]
stabf.test = data$stabf[-train]
boost.data=gbm(stabf~.,data[train,],distribution = "bernoulli",n.trees = 5000, interaction.depth = 4)
The above procedure was used to fit boosted classification tree to the electrucity data set. We ran gmb() with option distribution=“bernoulli” for a binary classification problem. The argument n.trees=500 indictates we wanted 5000 trees and the option interaction.depth=4 limits the depth of each tree.
summary(boost.data)
## var rel.inf
## tau2 tau2 14.2353048
## tau4 tau4 14.2032276
## tau1 tau1 13.8088269
## tau3 tau3 13.6710992
## g3 g3 10.6593969
## g2 g2 10.6094066
## g4 g4 10.2077342
## g1 g1 10.1954886
## p4 p4 0.8329761
## p3 p3 0.8010815
## p2 p2 0.7754576
From above procedure, tau2 and tau4 are the most important variables. tau1, tau3,g3,g4 and g1 are also imporatnt variables.
par(mfrow=c(1,2))
plot(boost.data,i='tau2')
plot(boost.data,i='tau4')
In the above procedure, we produced partial dependence plots for the two most important variables. These ilustrate the marginal effect of the selected variables on the response after integrating out the other variable.
Next, We use the boosted model to predict stabf on the test set.
yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 5000)
mean(yhat.boost==stabf.test)
## [1] 4e-04
Prediction with boost model on our data gave 0.0004 test error rate, making the accuracy to be 0.9996. This is almost perfect and performs better than random forest and bagging.
In the above procedure we used the default lambda which is 0.1. I will run the procedure again to make use of lambda as 0.2.
boost.data=gbm(stabf~.,data[train,],distribution = "bernoulli",n.trees = 5000, interaction.depth = 4, shrinkage = 0.2, verbose = F)
yhat.boost=predict(boost.data,data.test,type = "response",n.trees = 5000)
mean(yhat.boost==stabf.test)
## [1] 0.1248
With lambda as 0.2, the output gave 0.125 test error rate, making the performance to be
Support vector machine main objective is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. There are many possible hyperplanes that could be chosen to separte the two classes. In this project, our objective is to find a plane that has the maximum margin.
We use the svm() function in e1071 package in R. We use the training data to fit the model, and the testing data to test the model.
library(e1071)
library(magrittr)
set.seed(1)
smp_siz = floor(0.75*nrow(data))
train_ind = sample(seq_len(nrow(data)),size = smp_siz)
train =data[train_ind,]
test=data[-train_ind,]
model_svm <- svm(stabf ~., data=train, cost = 1000, gamma = 0.01)
model_svm
##
## Call:
## svm(formula = stabf ~ ., data = train, cost = 1000, gamma = 0.01)
##
##
## Parameters:
## SVM-Type: eps-regression
## SVM-Kernel: radial
## cost: 1000
## gamma: 0.01
## epsilon: 0.1
##
##
## Number of Support Vectors: 6089
From the above procedure, the cost parameter is used to penalise the model for misclassification. We set the cost to be 100. We fitted the model using svm. Below, we run predict() function pn our fitted model using the test data, and then look at its accurracy.
test_svm <- predict(model_svm, newdata = test %>% na.omit())
yo <- test %>% na.omit()
#table(test_svm, yo$stabf) run this code
(876+1552)/2500
## [1] 0.9712
The result above gave a high accuracy of 97.12% using train data and test data. This accuracy shows that the cost penalty of 100 is suitable for our model.
The scope of this project is to apply machine learning concepts taught in class on our data set. We made use of open dataset. We were able to apply several classification machine learning methods on the electricity data set to achieve the goal of the project.
The goal of the project is to model the data, so as to accurately predict the stability of electric grid. Also, infer the relationship between predictors and the response. We achieved these goals by using sevearl classification methods discussed in lass.
The following table consolidates the results of some of the methods used in this project.
## # Results of some methods used in the project
## | SI.No | Methods | Performance Accuracy |
## |-------|:---------------------------------:|----------------------:|
## | 1 | Logistic regression | 0.8176 |
## | 2 | Linear discriminate analysis | 0.8176 |
## | 3 | Quadratic discriminate analysis | 0.888 |
## | 4 | K-Nearest Neighbors (k=5) | 0.8632 |
## | 5 | Validation set approach | 0.82 |
## | 7 | K-Fold cross-validation | 0.7459 |
## | 8 | Boostrap | 0.816 |
## | 9 | Classification trees | 0.784 |
## | 10 | Trees + Bagging | 0.916 |
## | 11 | Trees + Random forest | 0.926 |
## | 12 | Trees + Boosting | 0.9996 |
## | 13 | SVM | 0.9712 |
Further work will be to apply more classification methods on our dataset and also to expand the scope of the methods we used.
I would like to express my sincere gratitute to my University Professor Dr. Gunes Koru (https://drkoru.us/) for the continuous support and excellent teaching techniques. His guidance made this project possible.
I would also like to thank my friend Nelson Ayaji for his assistance through the project. His advice made a difference.