This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: Credit goes to https://www.youtube.com/watch?v=0gf5iLTbiQM
Really enjoyed learning linear regression for classification
install.packages("mlbench")
##
## The downloaded binary packages are in
## /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpuqeM4m/downloaded_packages
library(mlbench)
data("PimaIndiansDiabetes2")
LEts remove the NA rows from the data and change the diabates column to numeric. As the dibates column is a factor with a numerical value = 1 for neg and numerical value = 2 for pos. we can use it to convert to numerical value removing 1 from the numeric.
This gives 0 for negative and 1 for positive
pidna<-na.omit(PimaIndiansDiabetes2)
pidna1<-pidna
pidna1$diabetes<-as.numeric(pidna1$diabetes)-1
head(pidna1)
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 7 3 78 50 32 88 31.0 0.248 26 1
## 9 2 197 70 45 543 30.5 0.158 53 1
## 14 1 189 60 23 846 30.1 0.398 59 1
## 15 5 166 72 19 175 25.8 0.587 51 1
Now lets make the first 300 rows as train data and the rest as test data
train<-pidna1[1:300,]
test<-pidna1[301:392,]
Now lets apply linear regression for the classification - lm following the Y variable with all X can be given just with a .
lm_res<-lm(diabetes~.,data=pidna1)
summary(lm_res)
##
## Call:
## lm(formula = diabetes ~ ., data = pidna1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.07966 -0.25711 -0.06177 0.25851 1.03750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.103e+00 1.436e-01 -7.681 1.34e-13 ***
## pregnant 1.295e-02 8.364e-03 1.549 0.12230
## glucose 6.409e-03 8.159e-04 7.855 4.07e-14 ***
## pressure 5.465e-05 1.730e-03 0.032 0.97482
## triceps 1.678e-03 2.522e-03 0.665 0.50631
## insulin -1.233e-04 2.045e-04 -0.603 0.54681
## mass 9.325e-03 3.901e-03 2.391 0.01730 *
## pedigree 1.572e-01 5.804e-02 2.708 0.00707 **
## age 5.878e-03 2.787e-03 2.109 0.03559 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3853 on 383 degrees of freedom
## Multiple R-squared: 0.3458, Adjusted R-squared: 0.3321
## F-statistic: 25.3 on 8 and 383 DF, p-value: < 2.2e-16
glucose is more significant so larger glucose more diabatic larger mass also is significant as larger marss more diabatic pedigree also is significant as well age R2 is around 33%. so we could still go ahead predicting whether the ML correctly classifies in the test data
predicted<-predict(lm_res,newdata=test)
predicted
## 577 585 589 592 594
## 0.244420572 0.447303389 0.892430896 0.247853068 0.154082438
## 595 596 598 600 604
## 0.442751307 0.642212947 -0.018711274 0.058500664 0.741087495
## 607 608 609 610 611
## 0.808880319 -0.055306518 0.495926885 0.005068651 0.109302808
## 612 613 615 618 621
## 0.679045087 0.815361134 0.672183415 -0.254612874 0.257751243
## 624 626 632 634 638
## 0.118291584 0.181823036 0.111463508 0.144838721 0.083730581
## 639 640 641 645 646
## 0.362644439 -0.060517115 0.112448223 0.162610046 0.502681045
## 647 648 649 651 652
## 0.477760667 0.641301903 0.510290979 -0.067016047 0.236016628
## 653 655 656 657 658
## 0.336847701 0.094288889 0.443652110 -0.021751634 0.525936449
## 660 663 664 666 669
## 0.177437266 0.754965198 0.697322803 0.191315463 0.275578183
## 670 671 673 674 680
## 0.621513760 0.759605234 0.153076811 0.560988898 0.026878386
## 681 683 686 689 690
## -0.266237055 0.164858232 0.321757633 0.323023058 0.645496863
## 693 694 696 697 699
## 0.383535752 0.565978584 0.440350376 0.542901171 0.347472158
## 701 705 708 710 711
## 0.293380017 0.121538402 0.211310095 0.152168116 0.404701043
## 712 714 716 717 719
## 0.396564616 0.181857019 0.841957328 0.741620901 0.198243687
## 722 723 724 727 731
## 0.204339241 0.476820925 0.404912327 0.245937215 0.317860400
## 733 734 737 739 741
## 0.747986493 0.097251081 0.199898993 0.105687552 0.656768356
## 742 743 745 746 748
## 0.117138330 0.057377935 0.888220259 0.367467410 0.287295888
## 749 752 754 756 761
## 0.765829806 0.315565924 0.664499377 0.511247313 0.046555498
## 764 766
## 0.440586252 0.225402089
Here from the output, we can find from the predictor indicators whether the patient is diabatic or not. For eg. if a patient is showing the predictor value as 0.5 she is diabatic and if less she is not.
TAB<- table(test$diabetes,predicted > 0.5)
TAB shows there were 62 cases with 0 diabatic value
Out of that 56 were False - ie. greater than 0.5 successfully predicted 6 were true -ie greater than 0.5 predicted incorrectly
TAB shows there were 30 cases with 1 diabatic value
9 were False - ie. less than 0.5 incorrectly predicted 21 were true - ie less than 0.5 correctly predicted
Now to compute how we missed classifing correctly..
mcrate <- 1- sum(diag(TAB))/sum(TAB) # (9+6)/(56+6+9+21)
mcrate
## [1] 0.1630435
So lets change the threshold to be higher 0.7
TAB_high<- table(test$diabetes,predicted > 0.7)
TAB_high
##
## FALSE TRUE
## 0 60 2
## 1 21 9
mcrate_high <- 1- sum(diag(TAB_high))/sum(TAB_high)
mcrate_high
## [1] 0.25
TAB_low<- table(test$diabetes,predicted > 0.3)
TAB_low
##
## FALSE TRUE
## 0 43 19
## 1 2 28
mcrate_low <- 1- sum(diag(TAB_low))/sum(TAB_low)
mcrate_low
## [1] 0.2282609
From the above calculations we can see that by changing the threshold highter and lower the miss calculation rate was becoming more and hence 0.5 is the best threshold for this problem.