Data Mining Lab

Experiment 5

1) Consider a data set “students_placement_data.csv” and apply decision tree induction on it (Data link: https://bit.ly/2EN599e ). Consider placement status as the class label (Target variable)

2) Draw a decision tree using rpart.plot.

3) Find the accuracy and draw the confusion matrix.

1) Step 1 :Import the required packages.

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.5.3

2) Step 2: Import the dataset.

m<-read.csv("C:/Users/pradeep/OneDrive/datasets/students_placement_data.csv")
head(m) # Check the first 6 rows.

##   Roll.No Gender Section SSC.Percentage inter_Diploma_percentage
## 1       1      M       A          87.30                     65.3
## 2       2      F       B          89.00                     92.4
## 3       3      F       A          67.00                     68.0
## 4       4      M       A          71.00                     70.4
## 5       5      M       A          67.00                     65.5
## 6       6      M       A          81.26                     68.0
##   B.Tech_percentage Backlogs registered_for_.Placement_Training
## 1             40.00       18                                 NO
## 2             71.45        0                                yes
## 3             45.26       13                                yes
## 4             36.47       17                                yes
## 5             42.52       17                                yes
## 6             62.20        6                                yes
##   placement.status
## 1       Not placed
## 2           Placed
## 3       Not placed
## 4       Not placed
## 5       Not placed
## 6       Not placed

str(m) # Check the structure of the dataset

## 'data.frame':    117 obs. of  9 variables:
##  $ Roll.No                           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gender                            : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ Section                           : Factor w/ 2 levels "A","B": 1 2 1 1 1 1 1 1 1 1 ...
##  $ SSC.Percentage                    : num  87.3 89 67 71 67 ...
##  $ inter_Diploma_percentage          : num  65.3 92.4 68 70.4 65.5 68 56.5 79.3 89.6 75.5 ...
##  $ B.Tech_percentage                 : num  40 71.5 45.3 36.5 42.5 ...
##  $ Backlogs                          : int  18 0 13 17 17 6 20 3 10 8 ...
##  $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 2 2 2 2 1 2 2 ...
##  $ placement.status                  : Factor w/ 2 levels "Not placed","Placed": 1 2 1 1 1 1 1 1 1 1 ...

3) Step 3: Divide the data (117 observations) into training data and test data.

n=nrow(m) # n is total number of rows.
set.seed(101)

# We use sample function to partition the data. Here 85 percent is training data and 15 percent is test data. Note that since "replace = TRUE", we may have a row sampled more than once.
data_index=sample(1:n, size = round(0.85*n),replace = TRUE)
train_data=m[data_index,]
test_data=m[-data_index,]

4) Check the structure of training and test data (Optional).

str(train_data)

## 'data.frame':    99 obs. of  9 variables:
##  $ Roll.No                           : int  44 6 84 77 30 36 69 40 73 64 ...
##  $ Gender                            : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 1 2 ...
##  $ Section                           : Factor w/ 2 levels "A","B": 2 1 1 1 2 2 1 2 2 1 ...
##  $ SSC.Percentage                    : num  86 81.3 89 78 72 ...
##  $ inter_Diploma_percentage          : num  92.5 68 88.9 59 88.1 90 61 88.8 83.7 69.2 ...
##  $ B.Tech_percentage                 : num  70.8 62.2 63 51.1 69.6 ...
##  $ Backlogs                          : int  0 6 1 17 0 0 6 0 0 20 ...
##  $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 2 2 1 1 2 2 1 2 2 2 ...
##  $ placement.status                  : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 2 1 1 1 1 ...

str(test_data)

## 'data.frame':    49 obs. of  9 variables:
##  $ Roll.No                           : int  1 4 7 8 11 12 14 15 17 18 ...
##  $ Gender                            : Factor w/ 2 levels "F","M": 2 2 2 1 1 2 1 2 1 1 ...
##  $ Section                           : Factor w/ 2 levels "A","B": 1 1 1 1 2 1 2 1 2 2 ...
##  $ SSC.Percentage                    : num  87.3 71 71 84.8 82.3 ...
##  $ inter_Diploma_percentage          : num  65.3 70.4 56.5 79.3 76.3 66 88.7 52.2 85 95.1 ...
##  $ B.Tech_percentage                 : num  40 36.5 33.8 61 71.5 ...
##  $ Backlogs                          : int  18 17 20 3 0 16 0 7 0 0 ...
##  $ registered_for_.Placement_Training: Factor w/ 2 levels "NO","yes": 1 2 2 1 1 2 2 2 2 2 ...
##  $ placement.status                  : Factor w/ 2 levels "Not placed","Placed": 1 1 1 1 2 1 2 1 1 2 ...

5) Build a decision tree model using “rpart”" function.

Provide the class label(placement.stats) and attributes/variables.
Here method is “class” because we are going to classification and not prediction.
Two types of split criterias can be used(parms). Gini and entropy(information).Default split criteria is Gini.

stu_model<-rpart(formula =placement.status~ Backlogs+Gender+B.Tech_percentage+SSC.Percentage+inter_Diploma_percentage, data=train_data,method = "class",parms = list(split="gini"))

# Print the model.
print(stu_model)

## n= 99 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 99 28 Not placed (0.71717172 0.28282828)  
##   2) B.Tech_percentage< 67.135 63  2 Not placed (0.96825397 0.03174603) *
##   3) B.Tech_percentage>=67.135 36 10 Placed (0.27777778 0.72222222)  
##     6) SSC.Percentage< 83.58 11  3 Not placed (0.72727273 0.27272727) *
##     7) SSC.Percentage>=83.58 25  2 Placed (0.08000000 0.92000000) *

6) Draw a decision tree.

We using rpart.plot from rpart.plot package.

type=5 means we want to show the split variable name in the interior nodes.
extra=2 means we want to display the classification rate at the node, expressed as the number of correct classifications and the number of observations in the node.

rpart.plot(stu_model,type=5,extra = 2 )

7) Apply the model stu_model on our test data using predict function.

In the predict function, give the model name stu_model and the test_data as input and specify type=“class” because we are doing classification.

p<-predict(stu_model,test_data,type="class")
print(p)

##          1          4          7          8         11         12 
## Not placed Not placed Not placed Not placed Not placed Not placed 
##         14         15         17         18         19         21 
##     Placed Not placed     Placed     Placed Not placed     Placed 
##         23         26         31         32         34         35 
## Not placed Not placed     Placed Not placed Not placed Not placed 
##         37         41         42         43         45         55 
## Not placed     Placed Not placed Not placed Not placed Not placed 
##         56         57         58         59         60         63 
##     Placed Not placed Not placed     Placed     Placed Not placed 
##         65         66         68         71         74         75 
##     Placed     Placed     Placed     Placed Not placed Not placed 
##         76         85         87         88         89         93 
## Not placed Not placed Not placed Not placed Not placed Not placed 
##        100        101        102        105        106        114 
## Not placed Not placed Not placed Not placed Not placed Not placed 
##        116 
##     Placed 
## Levels: Not placed Placed

8) Print the confusion matrix.

“table” command is used to draw confusion matrix. “test_data[,9]” is the original class labels and “p” are predicted class labels. Confusion matrix gives information about number of correct predictions and number of wrong predictions.

t<-table(test_data[,9],p)
print(t)

##             p
##              Not placed Placed
##   Not placed         29      2
##   Placed              6     12

In the above table, (29+ 12) are correct predictions and (6+2) are wrong predictions.

9) Find the accuracy of the model.

Accuracy of the model is number of correct predictions in test set divided by total number of samples in test set.

Note: In the diagonal element in the matrix t, there are correct predictions

print(sum(diag(t))/sum(t))

## [1] 0.8367347