There are no NA’s in the data.
sum(is.na(wine_data))
## [1] 0
The data seems to have skew.
wine_data %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") + geom_density(color="lightblue", fill="blue")
Using the logarithm method, the skewed data was fixed.
wine_data %>% keep(is.numeric) %>% gather() %>% ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") + geom_density(color="lightblue", fill="blue")
Logistic regression works best when we seek to classify with only two classes (either 0 or 1) - In other words, it is limited to only two-class classification problems.
After checking other classifiers, it was determined that there were other forms of classification algorithms that could be employed. The first was KNN.
Below, we will see how KNN was implemented.
gnum <- sample(1:nrow(wine_data), 0.9 * nrow(wine_data))
train_winedata <- wine_data[gnum,]
test_winedata <- wine_data[-gnum,]
train_var <- wine_data[gnum,1]
test_var <- wine_data[-gnum,1]
We had to specify the optimal number of KNN’s, this was gotten by finding the square root of the number of observations in the training dataset. k = 13.
sqrt(count(train_winedata))
Another way to check the accuracy - Below, the accuracy reads 100. which is great.
knn_mod <- knn(train_winedata,test_winedata,cl=train_var,k=13)
knn13 <- 100 * sum(test_var == knn_mod)/NROW(test_var)
knn13
## [1] 100
A confusion matrix has also been developed to give a breakdown of the model and to show the level of accuracy we have obtained from the KNN.
confusionMatrix(table(knn_mod ,test_var))
## Confusion Matrix and Statistics
##
## test_var
## knn_mod 1 2 3
## 1 3 0 0
## 2 0 9 0
## 3 0 0 6
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8147, 1)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 3.815e-06
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3
## Sensitivity 1.0000 1.0 1.0000
## Specificity 1.0000 1.0 1.0000
## Pos Pred Value 1.0000 1.0 1.0000
## Neg Pred Value 1.0000 1.0 1.0000
## Prevalence 0.1667 0.5 0.3333
## Detection Rate 0.1667 0.5 0.3333
## Detection Prevalence 0.1667 0.5 0.3333
## Balanced Accuracy 1.0000 1.0 1.0000
cancer_data <- cancer_data[,-1]
names(cancer_data)[10] <- "Class"
names(cancer_data)[1] <- "Clump Thickness"
names(cancer_data)[2] <- "Uniformity of Cell Size"
names(cancer_data)[3] <- "Uniformity of Cell Shape"
names(cancer_data)[4] <- "Marginal Adhesion"
names(cancer_data)[5] <- "Epithelial Cell Size"
names(cancer_data)[6] <- "Bare Nuclei"
names(cancer_data)[7] <- "Bland Chromatin"
names(cancer_data)[8] <- "Normal Nucleoli"
names(cancer_data)[9] <- "Mitoses"
When the data was read, there were instances of NA’s generated by a “?” that was found in the data. It was replaced with the mode of the column.
which(is.na(cancer_data))
## [1] 3519 3536 3635 3641 3654 3660 3731 3745 3771 3788 3790 3793 3811 3817 3907
## [16] 4113
sum(is.na(cancer_data$`Bare Nuclei`))
## [1] 0
table(cancer_data$Class)
##
## 2 4
## 458 241
We had to specify the optimal number of KNN’s, this was gotten by finding the square root of the number of observations in the training dataset. k = 25. A confusion matrix has also been developed to give a breakdown of the model and to show the level of accuracy we have obtained from the KNN.
cancernum <- sample(1:nrow(cancer_data), 0.9 * nrow(cancer_data))
train_cancerdata <- cancer_data[cancernum,]
test_cancerdata <- cancer_data[-cancernum,]
train_cancer_var <- cancer_data[cancernum,10]
test_cancer_var <- cancer_data[-cancernum,10]
Another way to check the accuracy - Below, the accuracy reads 97. which is great.
The highest k should be a 25, after setting k=25, there was a 95% level of accuracy, out of curiosity, we decided to check other k’s in order to see the accuracy level and it was found that 20 was the best k with the highest level of accuracy that can be seen in the confusion matrix table formulated below.
sqrt(count(train_cancerdata))
knn_cancer <- knn(train_cancerdata,test_cancerdata,cl=train_cancer_var,k=20)
knn <- 100 * sum(test_cancer_var == knn_cancer)/NROW(test_cancer_var)
table(knn_cancer ,test_cancer_var)
## test_cancer_var
## knn_cancer 2 4
## 2 49 2
## 4 0 19
confusionMatrix(table(knn_cancer ,test_cancer_var))
## Confusion Matrix and Statistics
##
## test_cancer_var
## knn_cancer 2 4
## 2 49 2
## 4 0 19
##
## Accuracy : 0.9714
## 95% CI : (0.9006, 0.9965)
## No Information Rate : 0.7
## P-Value [Acc > NIR] : 6.81e-09
##
## Kappa : 0.9301
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : 1.0000
## Specificity : 0.9048
## Pos Pred Value : 0.9608
## Neg Pred Value : 1.0000
## Prevalence : 0.7000
## Detection Rate : 0.7000
## Detection Prevalence : 0.7286
## Balanced Accuracy : 0.9524
##
## 'Positive' Class : 2
##
The response variable (y) needs to be coded as a 0 or a 1 not as a two level factor. Recode one level to be a zero and the other to 1 and it should work. 2 - 0 (Benign), 4 - 1 (Malignant)
cancer_lg <- glm(Class ~., data = train_cancerdata, family = "binomial")
summary(cancer_lg)
##
## Call:
## glm(formula = Class ~ ., family = "binomial", data = train_cancerdata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2573 -0.1311 -0.0696 0.0348 2.3851
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.4746292 1.0834100 -8.745 < 2e-16 ***
## `Clump Thickness` 0.5006305 0.1362258 3.675 0.000238 ***
## `Uniformity of Cell Size` 0.0006865 0.1980133 0.003 0.997234
## `Uniformity of Cell Shape` 0.2708559 0.2179547 1.243 0.213972
## `Marginal Adhesion` 0.2114479 0.1154587 1.831 0.067045 .
## `Epithelial Cell Size` 0.0929339 0.1545261 0.601 0.547565
## `Bare Nuclei` 0.4429241 0.0929331 4.766 1.88e-06 ***
## `Bland Chromatin` 0.3668689 0.1615953 2.270 0.023190 *
## `Normal Nucleoli` 0.1841442 0.1088510 1.692 0.090701 .
## Mitoses 0.5563356 0.2845122 1.955 0.050536 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 814.30 on 628 degrees of freedom
## Residual deviance: 106.54 on 619 degrees of freedom
## AIC: 126.54
##
## Number of Fisher Scoring iterations: 8
#using prob
possibility <- cancer_lg %>% predict(test_cancerdata, type = "response")
head(possibility)
## 1 3 5 11 16 28
## 0.017660473 0.010181544 0.016361256 0.002206665 0.520046460 0.012303680
predicted.classes <- ifelse(possibility > 0.5, 1, 0)
head(predicted.classes)
## 1 3 5 11 16 28
## 0 0 0 0 1 0
After checking the accuracy, it was seen that it was 97%, which is pretty good.
mean(predicted.classes == test_cancerdata$Class)
## [1] 0.9857143
dim(loan_data)
## [1] 1000 21
The three ways we encoded the loan application data.
The first step is to use the linear regression approach to make predictions. We create a linear model as below,
Using lm shows that we have 6 significant columns that we can use the predict the loan amount. But in order to go deeper, we decided to use the stepAIC function which takes in the linear model and performs iterations of variable dropping until we get the optimal set of variables that will be needed for the prediction. It helps handle the feature selection efficiently and faster.
cols <- c(1,3,5:11,13:15,17:26)
loan_data[,cols] <- lapply(loan_data[,cols], factor)
str(loan_data)
## 'data.frame': 1000 obs. of 26 variables:
## $ checking_balance : Factor w/ 4 levels "0","1","2","3": 4 2 1 4 4 1 1 2 1 2 ...
## $ months_loan_duration : int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : Factor w/ 5 levels "0","1","2","3",..: 1 3 1 3 2 3 3 3 3 1 ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance_.ex.real.estate.: Factor w/ 5 levels "0","1","2","3",..: 1 5 5 5 5 1 2 5 4 5 ...
## $ employment_length : Factor w/ 5 levels "0","1","2","3",..: 5 3 4 4 3 3 5 3 4 1 ...
## $ installment_rate : Factor w/ 4 levels "1","2","3","4": 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : Factor w/ 2 levels "0","1": 1 2 1 1 1 1 1 1 1 1 ...
## $ other_debtors : Factor w/ 3 levels "0","1","2": 1 1 1 2 1 1 1 1 1 1 ...
## $ residence_history : Factor w/ 4 levels "1","2","3","4": 4 2 3 4 4 4 4 2 4 2 ...
## $ property : Factor w/ 3 levels "0","1","2": 3 3 3 3 1 1 3 2 3 2 ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ housing : Factor w/ 3 levels "0","1","2": 3 3 3 1 1 1 3 2 3 3 ...
## $ existing_credit_facilities : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 2 1 1 1 1 2 ...
## $ default.1.no. : int 1 2 1 1 2 1 1 1 1 2 ...
## $ dependents : Factor w/ 2 levels "1","2": 1 1 2 2 2 2 1 1 1 1 ...
## $ landline : Factor w/ 2 levels "0","1": 2 1 1 1 1 2 1 2 1 1 ...
## $ foreign_worker : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ purpose_of_loan.skilldev : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 1 1 ...
## $ purpose_of_loan.homemaintenance : Factor w/ 2 levels "0","1": 2 2 1 2 1 1 2 1 2 1 ...
## $ purpose_of_loan.vehicle : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 1 2 1 2 ...
## $ jobmangement.self.employed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 2 ...
## $ jobskilled.employee : Factor w/ 2 levels "0","1": 2 2 1 2 2 1 2 1 1 1 ...
## $ jobunemployed.non.resident : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ jobunskilled.resident : Factor w/ 2 levels "0","1": 1 1 2 1 1 2 1 1 2 1 ...
loan_train <- loan_data[1:750,]
loan_test <- loan_data[751:1000,]
linear_mod_loan <- lm(amount ~., data = loan_train)
summary(linear_mod_loan)
##
## Call:
## lm(formula = amount ~ ., data = loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5245.7 -1051.5 -128.9 691.5 11073.6
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2861.237 951.798 3.006 0.002740 **
## checking_balance1 223.297 174.692 1.278 0.201589
## checking_balance2 -617.568 276.077 -2.237 0.025602 *
## checking_balance3 -95.135 181.753 -0.523 0.600842
## months_loan_duration 124.793 5.968 20.912 < 2e-16 ***
## credit_history1 9.072 261.101 0.035 0.972293
## credit_history2 -136.886 192.253 -0.712 0.476693
## credit_history3 662.310 352.665 1.878 0.060792 .
## credit_history4 110.557 362.942 0.305 0.760749
## savings_balance_.ex.real.estate.1 -880.300 306.432 -2.873 0.004192 **
## savings_balance_.ex.real.estate.2 -527.354 258.476 -2.040 0.041699 *
## savings_balance_.ex.real.estate.3 -419.855 334.585 -1.255 0.209948
## savings_balance_.ex.real.estate.4 -436.290 187.201 -2.331 0.020056 *
## employment_length1 273.655 369.796 0.740 0.459537
## employment_length2 38.920 348.524 0.112 0.911117
## employment_length3 172.687 367.615 0.470 0.638679
## employment_length4 -157.398 348.086 -0.452 0.651277
## installment_rate2 -861.124 227.663 -3.782 0.000168 ***
## installment_rate3 -1530.881 246.761 -6.204 9.40e-10 ***
## installment_rate4 -2434.933 205.712 -11.837 < 2e-16 ***
## personal_status1 -411.204 151.284 -2.718 0.006728 **
## other_debtors1 -243.874 312.869 -0.779 0.435960
## other_debtors2 864.825 342.779 2.523 0.011856 *
## residence_history2 313.184 223.629 1.400 0.161814
## residence_history3 10.494 251.541 0.042 0.966734
## residence_history4 164.390 225.194 0.730 0.465638
## property1 -720.924 311.670 -2.313 0.021004 *
## property2 -859.569 306.904 -2.801 0.005238 **
## age 4.704 6.996 0.672 0.501529
## installment_plan1 -16.279 324.585 -0.050 0.960015
## installment_plan2 -246.877 199.169 -1.240 0.215561
## housing1 280.167 372.181 0.753 0.451839
## housing2 130.455 355.285 0.367 0.713591
## existing_credit_facilities2 57.907 179.784 0.322 0.747478
## existing_credit_facilities3 360.723 447.366 0.806 0.420327
## existing_credit_facilities4 607.167 904.171 0.672 0.502111
## default.1.no. 164.473 163.841 1.004 0.315792
## dependents2 -131.456 193.917 -0.678 0.498058
## landline1 541.066 152.131 3.557 0.000401 ***
## foreign_worker1 53.071 371.804 0.143 0.886537
## purpose_of_loan.skilldev1 -594.688 619.877 -0.959 0.337704
## purpose_of_loan.homemaintenance1 -627.635 611.515 -1.026 0.305074
## purpose_of_loan.vehicle1 -303.121 607.462 -0.499 0.617939
## jobmangement.self.employed1 1208.181 266.626 4.531 6.88e-06 ***
## jobskilled.employee1 -80.609 178.120 -0.453 0.651008
## jobunemployed.non.resident1 -527.262 571.300 -0.923 0.356368
## jobunskilled.resident1 NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1762 on 704 degrees of freedom
## Multiple R-squared: 0.6156, Adjusted R-squared: 0.591
## F-statistic: 25.05 on 45 and 704 DF, p-value: < 2.2e-16
Below, we will notice that the last variable possesses NA’s. NA as a coefficient in a this regression indicates that the variable in question is linearly related to the other variables. As a result of that, there’s no unique solution to the regression without dropping one of the variables.
Only 59% of the data has been covered or explained with the usage of the model.
step_loan_data <- stepAIC(linear_mod_loan, direction="both", trace = 0)
summary(step_loan_data)
##
## Call:
## lm(formula = amount ~ checking_balance + months_loan_duration +
## savings_balance_.ex.real.estate. + installment_rate + personal_status +
## other_debtors + property + landline + purpose_of_loan.homemaintenance +
## jobmangement.self.employed, data = loan_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4802.3 -1025.8 -120.3 674.8 10841.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2842.708 329.437 8.629 < 2e-16 ***
## checking_balance1 267.312 163.412 1.636 0.102309
## checking_balance2 -614.435 269.071 -2.284 0.022685 *
## checking_balance3 -42.314 168.439 -0.251 0.801720
## months_loan_duration 127.065 5.574 22.798 < 2e-16 ***
## savings_balance_.ex.real.estate.1 -919.323 301.085 -3.053 0.002345 **
## savings_balance_.ex.real.estate.2 -440.053 253.099 -1.739 0.082516 .
## savings_balance_.ex.real.estate.3 -397.105 329.478 -1.205 0.228496
## savings_balance_.ex.real.estate.4 -361.668 181.946 -1.988 0.047210 *
## installment_rate2 -856.338 223.033 -3.840 0.000134 ***
## installment_rate3 -1500.217 242.408 -6.189 1.01e-09 ***
## installment_rate4 -2439.025 199.228 -12.242 < 2e-16 ***
## personal_status1 -352.045 139.970 -2.515 0.012112 *
## other_debtors1 -282.844 303.096 -0.933 0.351034
## other_debtors2 887.540 329.640 2.692 0.007255 **
## property1 -557.604 201.795 -2.763 0.005867 **
## property2 -709.971 199.381 -3.561 0.000394 ***
## landline1 551.101 144.943 3.802 0.000155 ***
## purpose_of_loan.homemaintenance1 -241.672 133.521 -1.810 0.070708 .
## jobmangement.self.employed1 1292.481 200.710 6.440 2.17e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1755 on 730 degrees of freedom
## Multiple R-squared: 0.6045, Adjusted R-squared: 0.5942
## F-statistic: 58.73 on 19 and 730 DF, p-value: < 2.2e-16
The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model. Perfect way to handle the data and model creation.
predicted_loan_amount <- predict(step_loan_data,loan_test)
loan_test$predicted <- predicted_loan_amount
ggplot(loan_test, aes(x=predicted,y=amount, xlab = "Actual Amount", ylab = "Predicted Amount")) + geom_point(size=2) + geom_abline()
previ <- cbind(loan_test$amount,loan_test$predicted)
The optimal lambda to create the model is noted as 18.47244.
y <- loan_data$amount
x <- data.matrix(loan_data[,-4])
cross_valid_model <- cv.glmnet(x, y, alpha = 1)
optimal_lambda <- cross_valid_model$lambda.min
optimal_lambda
## [1] 20.27348
optimal_mod <- glmnet(x, y, alpha = 1, lambda = optimal_lambda)
coef(optimal_mod)
## 26 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 2859.826190
## checking_balance -61.723901
## months_loan_duration 132.263674
## credit_history .
## savings_balance_.ex.real.estate. -41.457778
## employment_length -34.065965
## installment_rate -803.081480
## personal_status -259.024960
## other_debtors 230.816134
## residence_history .
## property -260.261726
## age 2.896487
## installment_plan -23.566641
## housing -38.985912
## existing_credit_facilities 100.714069
## default.1.no. 299.466902
## dependents -23.960244
## landline 509.171825
## foreign_worker -189.552527
## purpose_of_loan.skilldev -274.283150
## purpose_of_loan.homemaintenance -392.455846
## purpose_of_loan.vehicle .
## jobmangement.self.employed 1401.428908
## jobskilled.employee .
## jobunemployed.non.resident -376.289381
## jobunskilled.resident .
No coefficient is shown for some of the predictors because the lasso regression shrunk the coefficient all the way to zero. This means it was completely dropped from the model because it wasn’t influential enough. Lasso regression has the potential to remove predictors from the model by shrinking the coefficients completely to zero.
After calculation, It was discovered that using Lasso, gave an accuracy of only 58%
y_predicted <- predict(optimal_mod, s = optimal_lambda, newx = x)
#find SST and SSE
sst <- sum((y - mean(y))^2)
sse <- sum((y_predicted - y)^2)
#find R-Squared
rsq <- 1 - sse/sst
rsq
## [1] 0.5810503