Linear regression for classification

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: Credit goes to https://www.youtube.com/watch?v=0gf5iLTbiQM

Really enjoyed learning linear regression for classification

install.packages("mlbench")

## 
## The downloaded binary packages are in
##  /var/folders/71/vnhmr1ts6w354s54vd5n1xw80000gn/T//RtmpuqeM4m/downloaded_packages

library(mlbench)
data("PimaIndiansDiabetes2")

LEts remove the NA rows from the data and change the diabates column to numeric. As the dibates column is a factor with a numerical value = 1 for neg and numerical value = 2 for pos. we can use it to convert to numerical value removing 1 from the numeric.

This gives 0 for negative and 1 for positive

pidna<-na.omit(PimaIndiansDiabetes2)
pidna1<-pidna
pidna1$diabetes<-as.numeric(pidna1$diabetes)-1
head(pidna1)

##    pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 4         1      89       66      23      94 28.1    0.167  21        0
## 5         0     137       40      35     168 43.1    2.288  33        1
## 7         3      78       50      32      88 31.0    0.248  26        1
## 9         2     197       70      45     543 30.5    0.158  53        1
## 14        1     189       60      23     846 30.1    0.398  59        1
## 15        5     166       72      19     175 25.8    0.587  51        1

Now lets make the first 300 rows as train data and the rest as test data

train<-pidna1[1:300,]
test<-pidna1[301:392,]

Now lets apply linear regression for the classification - lm following the Y variable with all X can be given just with a .

lm_res<-lm(diabetes~.,data=pidna1)
summary(lm_res)

## 
## Call:
## lm(formula = diabetes ~ ., data = pidna1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.07966 -0.25711 -0.06177  0.25851  1.03750 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.103e+00  1.436e-01  -7.681 1.34e-13 ***
## pregnant     1.295e-02  8.364e-03   1.549  0.12230    
## glucose      6.409e-03  8.159e-04   7.855 4.07e-14 ***
## pressure     5.465e-05  1.730e-03   0.032  0.97482    
## triceps      1.678e-03  2.522e-03   0.665  0.50631    
## insulin     -1.233e-04  2.045e-04  -0.603  0.54681    
## mass         9.325e-03  3.901e-03   2.391  0.01730 *  
## pedigree     1.572e-01  5.804e-02   2.708  0.00707 ** 
## age          5.878e-03  2.787e-03   2.109  0.03559 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3853 on 383 degrees of freedom
## Multiple R-squared:  0.3458, Adjusted R-squared:  0.3321 
## F-statistic:  25.3 on 8 and 383 DF,  p-value: < 2.2e-16

glucose is more significant so larger glucose more diabatic larger mass also is significant as larger marss more diabatic pedigree also is significant as well age R2 is around 33%. so we could still go ahead predicting whether the ML correctly classifies in the test data

predicted<-predict(lm_res,newdata=test)
predicted

##          577          585          589          592          594 
##  0.244420572  0.447303389  0.892430896  0.247853068  0.154082438 
##          595          596          598          600          604 
##  0.442751307  0.642212947 -0.018711274  0.058500664  0.741087495 
##          607          608          609          610          611 
##  0.808880319 -0.055306518  0.495926885  0.005068651  0.109302808 
##          612          613          615          618          621 
##  0.679045087  0.815361134  0.672183415 -0.254612874  0.257751243 
##          624          626          632          634          638 
##  0.118291584  0.181823036  0.111463508  0.144838721  0.083730581 
##          639          640          641          645          646 
##  0.362644439 -0.060517115  0.112448223  0.162610046  0.502681045 
##          647          648          649          651          652 
##  0.477760667  0.641301903  0.510290979 -0.067016047  0.236016628 
##          653          655          656          657          658 
##  0.336847701  0.094288889  0.443652110 -0.021751634  0.525936449 
##          660          663          664          666          669 
##  0.177437266  0.754965198  0.697322803  0.191315463  0.275578183 
##          670          671          673          674          680 
##  0.621513760  0.759605234  0.153076811  0.560988898  0.026878386 
##          681          683          686          689          690 
## -0.266237055  0.164858232  0.321757633  0.323023058  0.645496863 
##          693          694          696          697          699 
##  0.383535752  0.565978584  0.440350376  0.542901171  0.347472158 
##          701          705          708          710          711 
##  0.293380017  0.121538402  0.211310095  0.152168116  0.404701043 
##          712          714          716          717          719 
##  0.396564616  0.181857019  0.841957328  0.741620901  0.198243687 
##          722          723          724          727          731 
##  0.204339241  0.476820925  0.404912327  0.245937215  0.317860400 
##          733          734          737          739          741 
##  0.747986493  0.097251081  0.199898993  0.105687552  0.656768356 
##          742          743          745          746          748 
##  0.117138330  0.057377935  0.888220259  0.367467410  0.287295888 
##          749          752          754          756          761 
##  0.765829806  0.315565924  0.664499377  0.511247313  0.046555498 
##          764          766 
##  0.440586252  0.225402089

Here from the output, we can find from the predictor indicators whether the patient is diabatic or not. For eg. if a patient is showing the predictor value as 0.5 she is diabatic and if less she is not.

TAB<- table(test$diabetes,predicted > 0.5)

TAB shows there were 62 cases with 0 diabatic value

Out of that 56 were False - ie. greater than 0.5 successfully predicted 6 were true -ie greater than 0.5 predicted incorrectly

TAB shows there were 30 cases with 1 diabatic value
9 were False - ie. less than 0.5 incorrectly predicted 21 were true - ie less than 0.5 correctly predicted

Now to compute how we missed classifing correctly..

mcrate <- 1- sum(diag(TAB))/sum(TAB)  #  (9+6)/(56+6+9+21)
mcrate

## [1] 0.1630435

So lets change the threshold to be higher 0.7

TAB_high<- table(test$diabetes,predicted > 0.7)
TAB_high

##    
##     FALSE TRUE
##   0    60    2
##   1    21    9

mcrate_high <- 1- sum(diag(TAB_high))/sum(TAB_high) 
mcrate_high

## [1] 0.25

TAB_low<- table(test$diabetes,predicted > 0.3)
TAB_low

##    
##     FALSE TRUE
##   0    43   19
##   1     2   28

mcrate_low <- 1- sum(diag(TAB_low))/sum(TAB_low) 
mcrate_low

## [1] 0.2282609

From the above calculations we can see that by changing the threshold highter and lower the miss calculation rate was becoming more and hence 0.5 is the best threshold for this problem.