Semi Supevised for admissions data with explanation

Semi-Supervised Application to the Admissions Problem

remove(list = ls())
    pacman::p_load(scales, tidyverse)
    pacman::p_load(RSSL, caret, plyr)
    pacman::p_load(lattice, magrittr, useful)
    pacman::p_load(MASS, ssc, GGally)
#=============================================================================#

    data_semi = read.csv("data_semi.csv", header = TRUE)
    head(data_semi)

##   Class experience      age      gpa
## 1     0          0 25.93168 2.690558
## 2     0          1 23.83526 3.159641
## 3     0          0 22.69056 2.553409
## 4     0          0 25.29371 2.548841
## 5     0          0 21.87878 2.776811
## 6     0          0 23.83175 2.769822

    table(data_semi$Class)

## 
##  0  1 
## 50 50

     # Add Missing Labels ============================
  
          set.seed(12345)
          data_semi$Class = as.factor(data_semi$Class) # convert Class to a factor variable
          df <- data_semi %>% add_missinglabels_mar(Class~.,prob=0.80) #addmissing labels
    
          table(df$Class)

## 
##  0  1 
## 10 10

          ggplot(df,aes( x = age, y = gpa, col = as.factor(Class))) + 
            geom_point(size = 5)

In the code above, we first set the random seed for reproducibility. Then we converted the Class variable to a factor variable using the as.factor() function, which is a categorical variable that represents the two classes in our data (0 and 1).

Next, we used the add_missinglabels_mar() function from the missForest package to artificially create missing labels (i.e., mislabeled observations) in our data. We set the probability of adding missing labels to 0.8, meaning that 80% of the observations will have a mislabeled Class value.

Finally, we created a scatter plot using ggplot2 with age on the x-axis, gpa on the y-axis, and Class as the color of the points. The geom_point() function adds points to the plot, and we set the size of the points to 5 using the size parameter.

This plot helps us visualize the distribution of the data and see if there is any pattern or trend based on the Class variable.

1 is the students we want to give admissions to 0 is the students we won’t be giving admissions

Semi-Supervised Classifiers

The Nearest Mean Classifier is based on a k-means classifier popular in supervised learning. It assumes that items in close proximity (aka distance, aka proximity) are similar. It works iteratively by classiying, ranking, separating, continuing to do so till it converges.

The self-training algorithm relies on a supervised learning algorithm trained on the labeled data only. This classifier is then applied to the unlabeled data to generate more labeled examples as input for the supervised learning algorithm.

 # Supervised classifiers: Nearest Mean Classifier and  svmlin 
        g_nm <- NearestMeanClassifier(Class~ experience + age + gpa,df)
        g_self <- SelfLearning(Class ~experience + age + gpa,df, 
                               #method=NearestMeanClassifier,
                               method = svmlin)

        # Nearest Mean Classifier Prediction & Accuracy ========
        ( pred_g_nm = predict(g_nm, df))

##   [1] 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 0 0 0 0 0
##  [38] 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0
## Levels: 0 1

            table(pred_g_nm) # that's the estimation for the unlabled ones

## pred_g_nm
##  0  1 
## 43 57

The code above trains two supervised classifiers, namely the Nearest Mean Classifier and the Self-Learning Classifier. The Nearest Mean Classifier is a simple algorithm that estimates class labels for new observations by comparing their distances to the centroids of the classes in the training data. The Self-Learning Classifier is an extension of the Nearest Mean Classifier that iteratively selects and adds the most confidently predicted unlabeled instances to the training set.

The code then applies both classifiers on the full data set df. The predicted labels for df are stored in pred_g_nm using the Nearest Mean Classifier. The table() function shows the frequency distribution of predicted classes.

The difference between the Nearest Mean Classifier and the Self-Learning Classifier is that the former uses only labeled instances to estimate class labels while the latter makes use of both labeled and unlabeled instances. The Self-Learning Classifier is an active learning approach that attempts to overcome the limitations of traditional supervised learning when only a small proportion of the data is labeled.

 table(pred_g_nm, data_semi$Class)

##          
## pred_g_nm  0  1
##         0 40  3
##         1 10 47

        mean(pred_g_nm == data_semi$Class)

## [1] 0.87

        caret::confusionMatrix(pred_g_nm, as.factor(data_semi$Class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 40  3
##          1 10 47
##                                          
##                Accuracy : 0.87           
##                  95% CI : (0.788, 0.9289)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 6.565e-15      
##                                          
##                   Kappa : 0.74           
##                                          
##  Mcnemar's Test P-Value : 0.09609        
##                                          
##             Sensitivity : 0.8000         
##             Specificity : 0.9400         
##          Pos Pred Value : 0.9302         
##          Neg Pred Value : 0.8246         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4000         
##    Detection Prevalence : 0.4300         
##       Balanced Accuracy : 0.8700         
##                                          
##        'Positive' Class : 0              
##

The first line creates a confusion matrix by comparing the predicted class labels (pred_g_nm) obtained from the nearest mean classifier with the true class labels (data_semi$Class) in the data.

The second line calculates the accuracy of the classifier by computing the mean of a logical vector where TRUE is assigned for every correct prediction and FALSE for every incorrect prediction.

The third line uses the caret::confusionMatrix() function from the caret package to generate a more detailed confusion matrix that provides information on different measures such as accuracy, sensitivity, and specificity of the classifier.

Self Learning Algo Prediction & Accuracy

              pred_g_self = predict(g_self, df)
                  table(pred_g_self)

## pred_g_self
##  0  1 
## 39 61

                  table(pred_g_self, data_semi$Class)

##            
## pred_g_self  0  1
##           0 38  1
##           1 12 49

                  mean(pred_g_self == data_semi$Class)

## [1] 0.87

              caret::confusionMatrix(pred_g_self, data_semi$Class)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 38  1
##          1 12 49
##                                          
##                Accuracy : 0.87           
##                  95% CI : (0.788, 0.9289)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : 6.565e-15      
##                                          
##                   Kappa : 0.74           
##                                          
##  Mcnemar's Test P-Value : 0.005546       
##                                          
##             Sensitivity : 0.7600         
##             Specificity : 0.9800         
##          Pos Pred Value : 0.9744         
##          Neg Pred Value : 0.8033         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3800         
##    Detection Prevalence : 0.3900         
##       Balanced Accuracy : 0.8700         
##                                          
##        'Positive' Class : 0              
##

In the above code, we are predicting the class labels for the unlabeled data using the Self Learning algorithm.

First, we use the predict() function with g_self (which is the Self Learning model fitted on the labeled data) and df (which is the labeled and unlabeled data together) to predict the class labels for the unlabeled data and store it in pred_g_self.

Then, we create a contingency table using the table() function to compare the predicted class labels with the actual class labels for the unlabeled data (data_semi$Class).

We also calculate the overall accuracy of the Self Learning model using mean(pred_g_self == data_semi$Class).

Finally, we use the confusionMatrix() function from the caret package to compute the confusion matrix, which provides more detailed information on the performance of the Self Learning algorithm, such as precision, recall, and F1 score.

Here is SVM without learning - better accuracy 93%

c_svm <-SVM(Class~ experience + age + gpa,df, 
            scale=FALSE,
            kernel = kernlab::rbfdot(0.05),
            C = 2500)
pred_c_svm = predict(c_svm, df)   
table(pred_c_svm)

## pred_c_svm
##  0  1 
## 51 49

table(pred_c_svm, data_semi$Class)

##           
## pred_c_svm  0  1
##          0 47  4
##          1  3 46

mean(pred_c_svm == data_semi$Class)

## [1] 0.93

caret::confusionMatrix(as.factor(pred_c_svm), as.factor(data_semi$Class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 47  4
##          1  3 46
##                                           
##                Accuracy : 0.93            
##                  95% CI : (0.8611, 0.9714)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.86            
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9400          
##             Specificity : 0.9200          
##          Pos Pred Value : 0.9216          
##          Neg Pred Value : 0.9388          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4700          
##    Detection Prevalence : 0.5100          
##       Balanced Accuracy : 0.9300          
##                                           
##        'Positive' Class : 0               
##

In this block of code, we are applying the Support Vector Machine (SVM) algorithm to classify the data using the SVM() function from the caret package. We first create an SVM model c_svm using the Class variable as the response variable and experience, age, and gpa as predictor variables.

We set scale to FALSE to avoid scaling the data since our variables are already on similar scales. We use the radial basis function kernel kernlab::rbfdot with a sigma of 0.05 for non-linear classification. We also set the regularization parameter C to 2500 to control the trade-off between misclassification and margin width.

We then predict the class of the labeled and unlabeled data using the predict() function and store the results in pred_c_svm. We display the frequency table of the predicted classes using table(), as well as the confusion matrix using the caret::confusionMatrix() function. The accuracy of the model is calculated using the mean() function to compute the percentage of correctly classified data.

LaplacianSVM

        c_lapsvm <-LaplacianSVM(Class~ experience + age + gpa,df,
                        scale=FALSE,kernel=kernlab::rbfdot(0.05),
                        lambda = 0.0001, gamma = 10)

        pred_c_lapsvm = predict(c_lapsvm, df)   
          table(pred_c_lapsvm)

## pred_c_lapsvm
##  0  1 
## 52 48

          table(pred_c_lapsvm, data_semi$Class)

##              
## pred_c_lapsvm  0  1
##             0 50  2
##             1  0 48

          mean(pred_c_lapsvm == data_semi$Class)

## [1] 0.98

        caret::confusionMatrix(pred_c_lapsvm, as.factor(data_semi$Class))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 50  2
##          1  0 48
##                                           
##                Accuracy : 0.98            
##                  95% CI : (0.9296, 0.9976)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.96            
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9600          
##          Pos Pred Value : 0.9615          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.5000          
##          Detection Rate : 0.5000          
##    Detection Prevalence : 0.5200          
##       Balanced Accuracy : 0.9800          
##                                           
##        'Positive' Class : 0               
##

The code above is fitting the Laplacian SVM model on the labeled and unlabeled data using the LaplacianSVM function in the R package LaplacianEigenmaps. The Laplacian SVM is similar to the standard SVM but includes an additional penalty term based on the Laplacian graph of the data.

The predicted classes are then calculated on the entire dataset using the predict() function and the accuracy of the model is evaluated by calculating the mean of the correct predictions. The confusion matrix is also computed using the confusionMatrix() function from the caret package to evaluate the performance of the model.

The output shows the accuracy and performance of the classification model. The accuracy of the model is 0.98, which means that 98% of the data points were classified correctly. The 95% confidence interval for the accuracy is (0.9296, 0.9976), indicating that the true accuracy is likely to fall within this range. The no information rate (NIR) is 0.5, which is the accuracy achieved by always predicting the majority class. The p-value for the accuracy being greater than the NIR is <2e-16, which means that the model significantly outperforms the NIR. The kappa statistic measures the agreement between the predicted and actual labels, with a maximum value of 1 indicating perfect agreement. In this case, the kappa value is 0.96, indicating almost perfect agreement. The sensitivity is 1.0, indicating that all positive cases were correctly classified, while the specificity is 0.96, indicating that 96% of negative cases were correctly classified. The positive predictive value (PPV) is 0.9615, indicating that among the cases predicted to be positive, 96.15% were truly positive. The negative predictive value (NPV) is 1.0, indicating that among the cases predicted to be negative, all were truly negative. The prevalence is 0.5, which is the proportion of positive cases in the data set. The detection rate is 0.5, which is the proportion of positive cases correctly identified by the model. The detection prevalence is 0.52, which is the proportion of all cases that were classified as positive. Finally, the balanced accuracy is 0.98, which is the average of the sensitivity and specificity, and is a measure of overall model performance.

Plot dataset

df %>% 
  ggplot(aes(x=gpa,y=age,color=as.factor(Class))) +
  geom_point(size = 5) +
  scale_size_manual(values=c("0"=5,"1"=5), na.value=1) +
  geom_linearclassifier("Nearest Mean"=NearestMeanClassifier(Class~.,df),
                        "Self-Learning"=SelfLearning(Class ~.,df, 
                                                       method=NearestMeanClassifier)
  )

This code is used to create a plot of the dataset with the Nearest Mean Classifier and Self-Learning algorithm applied to it. It uses the ggplot function to create a scatter plot with gpa on the x-axis and age on the y-axis, where the points are colored based on their Class value. It also sets the size of the points to be 5.

The geom_linearclassifier function is used to add the linear decision boundary of the classifiers to the plot. It takes the name of the classifier and the classifier function as arguments. In this case, it adds the decision boundaries for the Nearest Mean Classifier and Self-Learning algorithm using the NearestMeanClassifier and SelfLearning functions, respectively.

Visualize

        dt_data = cbind.data.frame(pred_c_lapsvm, data_semi  ) 
        
        ggplot(dt_data,aes( x = age, y = gpa, col = as.factor(pred_c_lapsvm))) + 
          geom_point(size = 5)

        ggplot(dt_data,aes( x = age, y = gpa, col = as.factor(pred_c_lapsvm))) + 
          geom_point(size = 5) + facet_grid(~data_semi$Class)  +
          theme(legend.position = "NULL")

        table(pred_c_lapsvm, data_semi$Class)

##              
## pred_c_lapsvm  0  1
##             0 50  2
##             1  0 48

The first ggplot code creates a scatterplot of the data with the predicted classes from the Laplacian SVM algorithm shown in different colors. The x-axis represents the “age” feature, the y-axis represents the “gpa” feature, and the color of each point represents the predicted class of that point.

The second ggplot code creates a similar scatterplot, but it additionally separates the points into two facets based on their true class labels. This can be useful for visualizing how well the algorithm is performing on each class.

The table function at the end of the code shows the confusion matrix, which compares the predicted class labels to the true class labels.