1 Classifiers

1.1 Wine Data

There are no NA’s in the data.

sum(is.na(wine_data))
## [1] 0

The data seems to have skew.

wine_data %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +   geom_density(color="lightblue", fill="blue")

Using the logarithm method, the skewed data was fixed.

wine_data %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) + 
  facet_wrap(~ key, scales = "free") +   geom_density(color="lightblue", fill="blue")

KNN - Classifier

Logistic regression works best when we seek to classify with only two classes (either 0 or 1) - In other words, it is limited to only two-class classification problems.

After checking other classifiers, it was determined that there were other forms of classification algorithms that could be employed. The first was KNN.

Below, we will see how KNN was implemented.

gnum <- sample(1:nrow(wine_data), 0.9 * nrow(wine_data)) 
train_winedata <- wine_data[gnum,]
test_winedata <- wine_data[-gnum,] 
train_var <- wine_data[gnum,1]
test_var <-  wine_data[-gnum,1]

We had to specify the optimal number of KNN’s, this was gotten by finding the square root of the number of observations in the training dataset. k = 13.

sqrt(count(train_winedata))

Another way to check the accuracy - Below, the accuracy reads 100. which is great.

knn_mod <- knn(train_winedata,test_winedata,cl=train_var,k=13)
knn13  <- 100 * sum(test_var == knn_mod)/NROW(test_var)
knn13
## [1] 100

A confusion matrix has also been developed to give a breakdown of the model and to show the level of accuracy we have obtained from the KNN.

confusionMatrix(table(knn_mod ,test_var))
## Confusion Matrix and Statistics
## 
##        test_var
## knn_mod 1 2 3
##       1 3 0 0
##       2 0 9 0
##       3 0 0 6
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8147, 1)
##     No Information Rate : 0.5        
##     P-Value [Acc > NIR] : 3.815e-06  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3
## Sensitivity            1.0000      1.0   1.0000
## Specificity            1.0000      1.0   1.0000
## Pos Pred Value         1.0000      1.0   1.0000
## Neg Pred Value         1.0000      1.0   1.0000
## Prevalence             0.1667      0.5   0.3333
## Detection Rate         0.1667      0.5   0.3333
## Detection Prevalence   0.1667      0.5   0.3333
## Balanced Accuracy      1.0000      1.0   1.0000

1.2 Cancer Data

Renaming Variables For Easy Computation & Less Complexity

cancer_data <- cancer_data[,-1]
names(cancer_data)[10] <- "Class"
names(cancer_data)[1] <- "Clump Thickness"
names(cancer_data)[2] <- "Uniformity of Cell Size"
names(cancer_data)[3] <- "Uniformity of Cell Shape"
names(cancer_data)[4] <- "Marginal Adhesion"
names(cancer_data)[5] <- "Epithelial Cell Size"
names(cancer_data)[6] <- "Bare Nuclei"
names(cancer_data)[7] <- "Bland Chromatin"
names(cancer_data)[8] <- "Normal Nucleoli"
names(cancer_data)[9] <- "Mitoses"

When the data was read, there were instances of NA’s generated by a “?” that was found in the data. It was replaced with the mode of the column.

which(is.na(cancer_data))
##  [1] 3519 3536 3635 3641 3654 3660 3731 3745 3771 3788 3790 3793 3811 3817 3907
## [16] 4113
sum(is.na(cancer_data$`Bare Nuclei`))
## [1] 0
table(cancer_data$Class)
## 
##   2   4 
## 458 241

We had to specify the optimal number of KNN’s, this was gotten by finding the square root of the number of observations in the training dataset. k = 25. A confusion matrix has also been developed to give a breakdown of the model and to show the level of accuracy we have obtained from the KNN.

KNN - Cancer Data

cancernum <- sample(1:nrow(cancer_data), 0.9 * nrow(cancer_data)) 
train_cancerdata <- cancer_data[cancernum,]
test_cancerdata <- cancer_data[-cancernum,] 
train_cancer_var <- cancer_data[cancernum,10]
test_cancer_var <-  cancer_data[-cancernum,10]

Another way to check the accuracy - Below, the accuracy reads 97. which is great.

The highest k should be a 25, after setting k=25, there was a 95% level of accuracy, out of curiosity, we decided to check other k’s in order to see the accuracy level and it was found that 20 was the best k with the highest level of accuracy that can be seen in the confusion matrix table formulated below.

sqrt(count(train_cancerdata))
knn_cancer <- knn(train_cancerdata,test_cancerdata,cl=train_cancer_var,k=20)
knn  <- 100 * sum(test_cancer_var == knn_cancer)/NROW(test_cancer_var)
table(knn_cancer ,test_cancer_var)
##           test_cancer_var
## knn_cancer  2  4
##          2 49  2
##          4  0 19
confusionMatrix(table(knn_cancer ,test_cancer_var))
## Confusion Matrix and Statistics
## 
##           test_cancer_var
## knn_cancer  2  4
##          2 49  2
##          4  0 19
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9006, 0.9965)
##     No Information Rate : 0.7             
##     P-Value [Acc > NIR] : 6.81e-09        
##                                           
##                   Kappa : 0.9301          
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9048          
##          Pos Pred Value : 0.9608          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.7000          
##          Detection Rate : 0.7000          
##    Detection Prevalence : 0.7286          
##       Balanced Accuracy : 0.9524          
##                                           
##        'Positive' Class : 2               
## 

Multiple Logistic Regression - Cancer Data

The response variable (y) needs to be coded as a 0 or a 1 not as a two level factor. Recode one level to be a zero and the other to 1 and it should work. 2 - 0 (Benign), 4 - 1 (Malignant)

cancer_lg <- glm(Class ~., data = train_cancerdata, family = "binomial")
summary(cancer_lg)
## 
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train_cancerdata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.2573  -0.1311  -0.0696   0.0348   2.3851  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                -9.4746292  1.0834100  -8.745  < 2e-16 ***
## `Clump Thickness`           0.5006305  0.1362258   3.675 0.000238 ***
## `Uniformity of Cell Size`   0.0006865  0.1980133   0.003 0.997234    
## `Uniformity of Cell Shape`  0.2708559  0.2179547   1.243 0.213972    
## `Marginal Adhesion`         0.2114479  0.1154587   1.831 0.067045 .  
## `Epithelial Cell Size`      0.0929339  0.1545261   0.601 0.547565    
## `Bare Nuclei`               0.4429241  0.0929331   4.766 1.88e-06 ***
## `Bland Chromatin`           0.3668689  0.1615953   2.270 0.023190 *  
## `Normal Nucleoli`           0.1841442  0.1088510   1.692 0.090701 .  
## Mitoses                     0.5563356  0.2845122   1.955 0.050536 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 814.30  on 628  degrees of freedom
## Residual deviance: 106.54  on 619  degrees of freedom
## AIC: 126.54
## 
## Number of Fisher Scoring iterations: 8

Predictions on Test Data

#using prob
possibility <- cancer_lg %>% predict(test_cancerdata, type = "response")
head(possibility)
##           1           3           5          11          16          28 
## 0.017660473 0.010181544 0.016361256 0.002206665 0.520046460 0.012303680
predicted.classes <- ifelse(possibility > 0.5, 1, 0)
head(predicted.classes)
##  1  3  5 11 16 28 
##  0  0  0  0  1  0

After checking the accuracy, it was seen that it was 97%, which is pretty good.

mean(predicted.classes == test_cancerdata$Class)
## [1] 0.9857143

2 Regression

dim(loan_data)
## [1] 1000   21

The three ways we encoded the loan application data.

Applying Linear Regression

The first step is to use the linear regression approach to make predictions. We create a linear model as below,

Using lm shows that we have 6 significant columns that we can use the predict the loan amount. But in order to go deeper, we decided to use the stepAIC function which takes in the linear model and performs iterations of variable dropping until we get the optimal set of variables that will be needed for the prediction. It helps handle the feature selection efficiently and faster.

cols <- c(1,3,5:11,13:15,17:26)
loan_data[,cols] <- lapply(loan_data[,cols], factor)
str(loan_data)
## 'data.frame':    1000 obs. of  26 variables:
##  $ checking_balance                : Factor w/ 4 levels "0","1","2","3": 4 2 1 4 4 1 1 2 1 2 ...
##  $ months_loan_duration            : int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history                  : Factor w/ 5 levels "0","1","2","3",..: 1 3 1 3 2 3 3 3 3 1 ...
##  $ amount                          : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance_.ex.real.estate.: Factor w/ 5 levels "0","1","2","3",..: 1 5 5 5 5 1 2 5 4 5 ...
##  $ employment_length               : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
##  $ installment_rate                : Factor w/ 4 levels "1","2","3","4": 4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status                 : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
##  $ other_debtors                   : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 1 1 1 ...
##  $ residence_history               : Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 4 4 2 4 2 ...
##  $ property                        : Factor w/ 3 levels "0","1","2": 3 3 3 3 1 1 3 2 3 2 ...
##  $ age                             : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan                : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ housing                         : Factor w/ 3 levels "0","1","2": 3 3 3 1 1 1 3 2 3 3 ...
##  $ existing_credit_facilities      : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 1 2 ...
##  $ default.1.no.                   : int  1 2 1 1 2 1 1 1 1 2 ...
##  $ dependents                      : Factor w/ 2 levels "1","2": 1 1 2 2 2 2 1 1 1 1 ...
##  $ landline                        : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
##  $ foreign_worker                  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ purpose_of_loan.skilldev        : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
##  $ purpose_of_loan.homemaintenance : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 1 2 1 ...
##  $ purpose_of_loan.vehicle         : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 2 ...
##  $ jobmangement.self.employed      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 2 ...
##  $ jobskilled.employee             : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 1 1 1 ...
##  $ jobunemployed.non.resident      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ jobunskilled.resident           : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 2 1 ...
loan_train <- loan_data[1:750,]
loan_test <- loan_data[751:1000,]

linear_mod_loan <- lm(amount ~., data = loan_train)
summary(linear_mod_loan)
## 
## Call:
## lm(formula = amount ~ ., data = loan_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5245.7 -1051.5  -128.9   691.5 11073.6 
## 
## Coefficients: (1 not defined because of singularities)
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        2861.237    951.798   3.006 0.002740 ** 
## checking_balance1                   223.297    174.692   1.278 0.201589    
## checking_balance2                  -617.568    276.077  -2.237 0.025602 *  
## checking_balance3                   -95.135    181.753  -0.523 0.600842    
## months_loan_duration                124.793      5.968  20.912  < 2e-16 ***
## credit_history1                       9.072    261.101   0.035 0.972293    
## credit_history2                    -136.886    192.253  -0.712 0.476693    
## credit_history3                     662.310    352.665   1.878 0.060792 .  
## credit_history4                     110.557    362.942   0.305 0.760749    
## savings_balance_.ex.real.estate.1  -880.300    306.432  -2.873 0.004192 ** 
## savings_balance_.ex.real.estate.2  -527.354    258.476  -2.040 0.041699 *  
## savings_balance_.ex.real.estate.3  -419.855    334.585  -1.255 0.209948    
## savings_balance_.ex.real.estate.4  -436.290    187.201  -2.331 0.020056 *  
## employment_length1                  273.655    369.796   0.740 0.459537    
## employment_length2                   38.920    348.524   0.112 0.911117    
## employment_length3                  172.687    367.615   0.470 0.638679    
## employment_length4                 -157.398    348.086  -0.452 0.651277    
## installment_rate2                  -861.124    227.663  -3.782 0.000168 ***
## installment_rate3                 -1530.881    246.761  -6.204 9.40e-10 ***
## installment_rate4                 -2434.933    205.712 -11.837  < 2e-16 ***
## personal_status1                   -411.204    151.284  -2.718 0.006728 ** 
## other_debtors1                     -243.874    312.869  -0.779 0.435960    
## other_debtors2                      864.825    342.779   2.523 0.011856 *  
## residence_history2                  313.184    223.629   1.400 0.161814    
## residence_history3                   10.494    251.541   0.042 0.966734    
## residence_history4                  164.390    225.194   0.730 0.465638    
## property1                          -720.924    311.670  -2.313 0.021004 *  
## property2                          -859.569    306.904  -2.801 0.005238 ** 
## age                                   4.704      6.996   0.672 0.501529    
## installment_plan1                   -16.279    324.585  -0.050 0.960015    
## installment_plan2                  -246.877    199.169  -1.240 0.215561    
## housing1                            280.167    372.181   0.753 0.451839    
## housing2                            130.455    355.285   0.367 0.713591    
## existing_credit_facilities2          57.907    179.784   0.322 0.747478    
## existing_credit_facilities3         360.723    447.366   0.806 0.420327    
## existing_credit_facilities4         607.167    904.171   0.672 0.502111    
## default.1.no.                       164.473    163.841   1.004 0.315792    
## dependents2                        -131.456    193.917  -0.678 0.498058    
## landline1                           541.066    152.131   3.557 0.000401 ***
## foreign_worker1                      53.071    371.804   0.143 0.886537    
## purpose_of_loan.skilldev1          -594.688    619.877  -0.959 0.337704    
## purpose_of_loan.homemaintenance1   -627.635    611.515  -1.026 0.305074    
## purpose_of_loan.vehicle1           -303.121    607.462  -0.499 0.617939    
## jobmangement.self.employed1        1208.181    266.626   4.531 6.88e-06 ***
## jobskilled.employee1                -80.609    178.120  -0.453 0.651008    
## jobunemployed.non.resident1        -527.262    571.300  -0.923 0.356368    
## jobunskilled.resident1                   NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1762 on 704 degrees of freedom
## Multiple R-squared:  0.6156, Adjusted R-squared:  0.591 
## F-statistic: 25.05 on 45 and 704 DF,  p-value: < 2.2e-16

Obtaining Significant Values For Linear Regression

Below, we will notice that the last variable possesses NA’s. NA as a coefficient in a this regression indicates that the variable in question is linearly related to the other variables. As a result of that, there’s no unique solution to the regression without dropping one of the variables.

Only 59% of the data has been covered or explained with the usage of the model.

step_loan_data <- stepAIC(linear_mod_loan, direction="both", trace = 0)
summary(step_loan_data)
## 
## Call:
## lm(formula = amount ~ checking_balance + months_loan_duration + 
##     savings_balance_.ex.real.estate. + installment_rate + personal_status + 
##     other_debtors + property + landline + purpose_of_loan.homemaintenance + 
##     jobmangement.self.employed, data = loan_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4802.3 -1025.8  -120.3   674.8 10841.5 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        2842.708    329.437   8.629  < 2e-16 ***
## checking_balance1                   267.312    163.412   1.636 0.102309    
## checking_balance2                  -614.435    269.071  -2.284 0.022685 *  
## checking_balance3                   -42.314    168.439  -0.251 0.801720    
## months_loan_duration                127.065      5.574  22.798  < 2e-16 ***
## savings_balance_.ex.real.estate.1  -919.323    301.085  -3.053 0.002345 ** 
## savings_balance_.ex.real.estate.2  -440.053    253.099  -1.739 0.082516 .  
## savings_balance_.ex.real.estate.3  -397.105    329.478  -1.205 0.228496    
## savings_balance_.ex.real.estate.4  -361.668    181.946  -1.988 0.047210 *  
## installment_rate2                  -856.338    223.033  -3.840 0.000134 ***
## installment_rate3                 -1500.217    242.408  -6.189 1.01e-09 ***
## installment_rate4                 -2439.025    199.228 -12.242  < 2e-16 ***
## personal_status1                   -352.045    139.970  -2.515 0.012112 *  
## other_debtors1                     -282.844    303.096  -0.933 0.351034    
## other_debtors2                      887.540    329.640   2.692 0.007255 ** 
## property1                          -557.604    201.795  -2.763 0.005867 ** 
## property2                          -709.971    199.381  -3.561 0.000394 ***
## landline1                           551.101    144.943   3.802 0.000155 ***
## purpose_of_loan.homemaintenance1   -241.672    133.521  -1.810 0.070708 .  
## jobmangement.self.employed1        1292.481    200.710   6.440 2.17e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1755 on 730 degrees of freedom
## Multiple R-squared:  0.6045, Adjusted R-squared:  0.5942 
## F-statistic: 58.73 on 19 and 730 DF,  p-value: < 2.2e-16

The Idea Behind StepAIC

The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model. Perfect way to handle the data and model creation.

predicted_loan_amount <- predict(step_loan_data,loan_test)
loan_test$predicted <- predicted_loan_amount
ggplot(loan_test, aes(x=predicted,y=amount, xlab = "Actual Amount", ylab = "Predicted Amount")) + geom_point(size=2) + geom_abline()

previ <- cbind(loan_test$amount,loan_test$predicted)

Lasso Regression

The optimal lambda to create the model is noted as 18.47244.

y <- loan_data$amount
x <- data.matrix(loan_data[,-4])
cross_valid_model <- cv.glmnet(x, y, alpha = 1)

optimal_lambda <- cross_valid_model$lambda.min
optimal_lambda
## [1] 20.27348
optimal_mod <- glmnet(x, y, alpha = 1, lambda = optimal_lambda)
coef(optimal_mod)
## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                           s0
## (Intercept)                      2859.826190
## checking_balance                  -61.723901
## months_loan_duration              132.263674
## credit_history                      .       
## savings_balance_.ex.real.estate.  -41.457778
## employment_length                 -34.065965
## installment_rate                 -803.081480
## personal_status                  -259.024960
## other_debtors                     230.816134
## residence_history                   .       
## property                         -260.261726
## age                                 2.896487
## installment_plan                  -23.566641
## housing                           -38.985912
## existing_credit_facilities        100.714069
## default.1.no.                     299.466902
## dependents                        -23.960244
## landline                          509.171825
## foreign_worker                   -189.552527
## purpose_of_loan.skilldev         -274.283150
## purpose_of_loan.homemaintenance  -392.455846
## purpose_of_loan.vehicle             .       
## jobmangement.self.employed       1401.428908
## jobskilled.employee                 .       
## jobunemployed.non.resident       -376.289381
## jobunskilled.resident               .

Obtaining Significant Values For Lasso

No coefficient is shown for some of the predictors because the lasso regression shrunk the coefficient all the way to zero. This means it was completely dropped from the model because it wasn’t influential enough. Lasso regression has the potential to remove predictors from the model by shrinking the coefficients completely to zero.

After calculation, It was discovered that using Lasso, gave an accuracy of only 58%

y_predicted <- predict(optimal_mod, s = optimal_lambda, newx = x)

#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)

#find R-Squared
rsq <- 1 - sse/sst
rsq
## [1] 0.5810503